Merging data frames: An example with data from the British Election Study

Here we merge data from the British Election Study. Here we use data from the British Election Study 2010. The data set bes2010feelings-prepost-for-merge.RData is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data.


load("bes2010feelings-prepost-for-merge.RData")

A peek into a data frame about respondents’ feelings about parties:


head(bes2010flngs_parties_pre)
        id    refno vote   flng.labour flng.cons flng.libdem flng.snp flng.pcym
40103.1 40103 312   NA     5           6         4           NA       NA
40107.1 40107 312   NA     1           6         7           NA       NA
40109.1 40109 312   NA     3           4         5           NA       NA
40110.1 40110 312   Labour 6           6         5           NA       NA
40111.1 40111 312   Labour 8           4         5           NA       NA
40112.1 40112 312   Labour 5           1         4           NA       NA
        flng.green flng.ukip flng.bnp region
40103.1 7           3        0        England
40107.1 6           0        0        NA
40109.1 5           0        0        England
40110.1 5           3        2        England
40111.1 4          NA        2        NA
40112.1 4           0        0        England

And anotehr peek into a data frame about respondents’ feelings about party leaders:


head(bes2010flngs_leaders_pre)
        id    flng.brown flng.cameron flng.clegg flng.salmond flng.jones
40103.1 40103 6          3            3          NA            5
40107.1 40107 3          7            5          NA            3
40109.1 40109 8          7            4          NA           10
40110.1 40110 4          4            3          NA            7
40111.1 40111 5          5            5          NA            5
40112.1 40112 5          0            4          NA            1

The variable that identifies individual respondents in both data frames is ‘id’, so we use this variable to match the rows in both data frames:


bes2010flngs_pre_merged <- merge(
    bes2010flngs_parties_pre,
    bes2010flngs_leaders_pre,
    by="id"
)

merge() also allows for identifier variables that may have different names in the two data frame. In such cases one can use the named arguments by.x= and by.y=:


bes2010flngs_pre_merged <- merge(
    bes2010flngs_parties_pre,
    bes2010flngs_leaders_pre,
    by.x="id",
    by.y="id"
)

It is not absolutely necessary to provide a by= argument, if the merged data frames share a variable (with the same name in both) that idenfies cases or observations. Therefore, we can call merge() here without any by=, by.x=, or by.y= arguments:


bes2010flngs_pre_merged <- merge(
    bes2010flngs_parties_pre,
    bes2010flngs_leaders_pre
)
head(bes2010flngs_pre_merged)
  id    refno vote   flng.labour flng.cons flng.libdem flng.snp flng.pcym
1 40103 312   NA     5           6         4           NA       NA
2 40107 312   NA     1           6         7           NA       NA
3 40109 312   NA     3           4         5           NA       NA
4 40110 312   Labour 6           6         5           NA       NA
5 40111 312   Labour 8           4         5           NA       NA
6 40112 312   Labour 5           1         4           NA       NA
  flng.green flng.ukip flng.bnp region  flng.brown flng.cameron flng.clegg
1 7           3        0        England 6          3            3
2 6           0        0        NA      3          7            5
3 5           0        0        England 8          7            4
4 5           3        2        England 4          4            3
5 4          NA        2        NA      5          5            5
6 4           0        0        England 5          0            4
  flng.salmond flng.jones
1 NA            5
2 NA            3
3 NA           10
4 NA            7
5 NA            5
6 NA            1

The data frame constwin contains data about relectoral districts, that is which party won the respective district seat in 2005 and 2010. The variable that identifies the electoral district is both in the individual-level data frame and the district-level data frame is named refno, so we use this as a matching variable.


bes2010pre_merged <- merge(
    bes2010flngs_pre_merged,
    constwin,
    by = "refno" # Not necessary in the present case, because
)                # it is the same in both data frames.

As can be glimpsed from the output of str, the result of merge is sorted by the matching variable, i.e. “refno”


head(bes2010pre_merged)
  refno id    vote          flng.labour flng.cons flng.libdem flng.snp
1 1     77920 Plaid Cymru    6          5         5           NA
2 1     57911 NA             5          3         3           NA
3 1     57905 Labour        10          0         3           NA
4 1     57906 Labour        10          0         4           NA
5 1     57910 Conservatives  0          9         3           NA
6 1     57902 Conservatives  8          9         6           NA
  flng.pcym flng.green flng.ukip ⋯ flng.brown flng.cameron flng.clegg
1 7         7          5         ⋯  0         8            NA
2 3         4          0         ⋯  6         5             5
3 5         3          4         ⋯  8         0             4
4 3         0          6         ⋯ 10         0             0
5 3         6          2         ⋯  0         9             0
6 5         5          6         ⋯  4         8             6
  flng.salmond flng.jones seat     win05  win10  maj05 maj10
1 NA            4         Aberavon Labour Labour 46.3  35.7
2 NA            6         Aberavon Labour Labour 46.3  35.7
3 NA            0         Aberavon Labour Labour 46.3  35.7
4 NA           10         Aberavon Labour Labour 46.3  35.7
5 NA            0         Aberavon Labour Labour 46.3  35.7
6 NA            7         Aberavon Labour Labour 46.3  35.7

Downloadable R script and interactive version

Explanation

The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4

Above you see a rendered version of the Jupyter notebook.5

1

For more information about Jupyter see http://jupyter.org. The Jupyter notebooks make use of the IRKernel package.

2

For more information about Docker see https://docs.docker.com/. The container images were created with repo2docker, while containers are run with docker spawner.

3

ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See https://info.orcid.org/what-is-orcid/ for more information.

4

The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.

5

The notebook is rendered with the help of the nbsphinx extension.