r - Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

ID : 20359

viewed : 38

Tags : rmergecomparerowsdataframer

Top 5 Answer for r - Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

vote vote

93

sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5]) a2 <- data.frame(a = 1:3, b=letters[1:3])  require(sqldf)  a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2') 

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2') 

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr)  anti_join(a1,a2) 

And semi_join to filter rows in a1 that are also in a2

semi_join(a1,a2) 
vote vote

90

In dplyr:

setdiff(a1,a2) 

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

In the SQLverse this is called a

Left Excluding Join Venn Diagram

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the setdiff() code when using the OP's data:

> a1   a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e  > a2   a b 1 1 a 2 2 b 3 3 c  > setdiff(a1,a2)   a b 1 4 d 2 5 e 

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

vote vote

76

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare) a1 <- data.frame(a = 1:5, b = letters[1:5]) a2 <- data.frame(a = 1:3, b = letters[1:3]) comparison <- compare(a1,a2,allowAll=TRUE) comparison$tM #  a b #1 1 a #2 2 b #3 3 c 

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-    data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i]))) colnames(difference) <- colnames(a1) difference #  a b #1 4 d #2 5 e 
vote vote

65

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE a2$included_a2 <- TRUE res <- merge(a1, a2, all=TRUE) 

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

vote vote

59

I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)   > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)   > df_compare = compare_df(df1, df2, "row")    > df_compare$comparison_df     row chng_type a b   1   4         + 4 d   2   5         + 5 e 

A more complicated example:

library(compareDF) df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",                          "Hornet 4 Drive", "Duster 360", "Merc 240D"),                  id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),                  hp = c(110, 110, 181, 110, 245, 62),                  cyl = c(6, 6, 4, 6, 8, 4),                  qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))  df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",                          "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),                  id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),                  hp = c(110, 110, 93, 110, 175, 105),                  cyl = c(6, 6, 4, 6, 8, 6),                  qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))  > df_compare$comparison_df     grp chng_type                id1 id2  hp cyl  qsec   1   1         -  Hornet Sportabout Dus 175   8 17.02   2   2         +         Datsun 710 Dat 181   4 33.00   3   2         -         Datsun 710 Dat  93   4 18.61   4   3         +         Duster 360 Dus 245   8 15.84   5   7         +          Merc 240D Mer  62   4 20.00   6   8         -            Valiant Val 105   6 20.22 

The package also has an html_output command for quick checking

df_compare$html_output enter image description here

Top 3 video Explaining r - Compare two data.frames to find the rows in data.frame 1 that are not present in data.frame 2

Related QUESTION?