Merging data frames: An example with artificial data


df1 <- data.frame(
    x = c(1,3,2,4,6,5),
    y = c(1,1,2,2,2,4)
)
df1
  x y
1 1 1
2 3 1
3 2 2
4 4 2
5 6 2
6 5 4

df2 <- data.frame(
    a = c(51,42,22),
    b = c(1,2,3)
)
df2
  a  b
1 51 1
2 42 2
3 22 3

In this first attempt at merging, the data frames do not share any variables, hence there is no way of determining which of the rows of the two data frames “belong together”. In such a case each row of the first data frame is matched with each of the second data frame. Hence the number of rows of the result equals the products of the numbers of rows of the two data frames.


df12 <- merge(df1,df2)
df12
   x y a  b
1  1 1 51 1
2  3 1 51 1
3  2 2 51 1
4  4 2 51 1
5  6 2 51 1
6  5 4 51 1
7  1 1 42 2
8  3 1 42 2
9  2 2 42 2
10 4 2 42 2
11 6 2 42 2
12 5 4 42 2
13 1 1 22 3
14 3 1 22 3
15 2 2 22 3
16 4 2 22 3
17 6 2 22 3
18 5 4 22 3

nrow(df1)
[1] 6

nrow(df2)
[1] 3

nrow(df12)
[1] 18

By explicitly specifying the variables used for matching, the result is different: It contains only rows for which matches can be found in both data frames


merge(df1,df2,by.x="y",by.y="b")
  y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42

With the optional argument all.x=TRUE the result has a row for each row from the first data frame, whether or not a match is find for it: Missing information (from non-existing rows of the second data frame) is filled up with NA.


merge(df1,df2,by.x="y",by.y="b",
      all.x=TRUE)
  y x a
1 1 1 51
2 1 3 51
3 2 2 42
4 2 4 42
5 2 6 42
6 4 5 NA

With all.y=TRUE the result contains all rows from the second data frame:


merge(df1,df2,by.x="y",by.y="b",
      all.y=TRUE)
  y x  a
1 1  1 51
2 1  3 51
3 2  2 42
4 2  4 42
5 2  6 42
6 3 NA 22

The argument setting all=TRUE is equivalent with all.x=TRUE and all.y=TRUE


merge(df1,df2,by.x="y",by.y="b",
      all=TRUE)
  y x  a
1 1  1 51
2 1  3 51
3 2  2 42
4 2  4 42
5 2  6 42
6 3 NA 22
7 4  5 NA

Downloadable R script and interactive version

Explanation

The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4

Above you see a rendered version of the Jupyter notebook.5

1

For more information about Jupyter see http://jupyter.org. The Jupyter notebooks make use of the IRKernel package.

2

For more information about Docker see https://docs.docker.com/. The container images were created with repo2docker, while containers are run with docker spawner.

3

ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See https://info.orcid.org/what-is-orcid/ for more information.

4

The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.

5

The notebook is rendered with the help of the nbsphinx extension.