r - Simultaneously merge multiple data.frames in a list

ID : 10287

viewed : 24

Tags : rlistmergedataframer-faqr

Top 5 Answer for r - Simultaneously merge multiple data.frames in a list

vote vote

93

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE) y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE) z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE) 

Update June 2018: I divided the answer in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.


1) Join them with reduce from the purrr package:

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse) list(x, y, z) %>% reduce(left_join, by = "i") #  A tibble: 3 x 4 #  i       j     k     l #  <chr> <int> <int> <int> # 1 a      1    NA     9 # 2 b      2     4    NA # 3 c      3     5     7 

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i") # A tibble: 4 x 4 # i       j     k     l # <chr> <int> <int> <int> # 1 a     1     NA     9 # 2 b     2     4      NA # 3 c     3     5      7 # 4 d     NA    6      8  list(x, y, z) %>% reduce(inner_join, by = "i") # A tibble: 1 x 4 # i       j     k     l # <chr> <int> <int> <int> # 1 c     3     5     7 

2) dplyr::left_join() with base R Reduce():

list(x,y,z) %>%     Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)  #   i j  k  l # 1 a 1 NA  9 # 2 b 2  4 NA # 3 c 3  5  7 

3) Base R merge() with base R Reduce():

And for comparison purposes, here is a base R version of the left join based on Charles's answer.

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),         list(x,y,z)) #   i j  k  l # 1 a 1 NA  9 # 2 b 2  4 NA # 3 c 3  5  7 
vote vote

84

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames) 

Here's a fully example using some mock data:

set.seed(1) list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10))) merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames) tail(merged.data.frame) #    x  a  b         y #12 12 NA 18        NA #13 13 NA 19        NA #14 14 NA 20 0.4976992 #15 15 NA NA 0.7176185 #16 16 NA NA 0.3841037 #17 19 NA NA 0.3800352 

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list) merged.data.frame[, 1:12]  #  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y #1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA> #2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA> #3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA> 

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y' merge(my.list[[1]], my.list[[2]], by=match.by, all=T) # [1] matchname    party        st           district     chamber      senate1993   name.x       # [8] votes.year.x senate1994   name.y       votes.year.y #<0 rows> (or 0-length row.names) # as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T) # [1] matchname    party        st           district     chamber      senate1993   name.x       # [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year   #<0 rows> (or 0-length row.names) # the next merge will fail as 'name' will get renamed to a pre-existing field. 

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,       names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list)) 

The merge/Reduce will then work fine.

vote vote

74

You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...) 

Here is an excellent resource on different methods to merge data frames.

vote vote

61

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... ) {     if ( length( data ) == 2 )      {         return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )     }         return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) ) } 
vote vote

50

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3) y <- data_frame(i = c("b","c","d"), k = 4:6) z <- data_frame(i = c("c","d","a"), l = 7:9) 

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)   list(x, y, z) %>%    map_df(gather, key=key, value=value, -i) %>%    spread(key, value) 

Top 3 video Explaining r - Simultaneously merge multiple data.frames in a list

Related QUESTION?