Data Frames and their Management

This chapter describes how a typical data set used in multivariate analysis is composed - i.e. as a rectangular arrangement of variables and observations. The chapter further describes ways to manipulate data within data frames and how data frames can be restricted, combined, and reshaped. The chapter also discusses how data in various formats can be imported. Such data formats include CSV, TAB-delimited, and fixed-column files.

Below is the supporting material for the various sections of the chapter.

The Structure of Data Frames

  • Script file: structure-of-data-frames.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    Data frame construction

    In [2]:
    # First create a few vectors from which we construct the data frame:
    population  <- c(55619400,1885400,5424800,3125000)
    area.sq.m   <- c(50301,5460,30090,8023)
    GVA.cap     <- c(28096,20000,24800,19900)
    # then we use 'data.frame' to construct the data frame:
    UK <- data.frame(population,area.sq.m,GVA.cap)
    UK
    
      population area.sq.m GVA.cap
    1 55619400   50301     28096  
    2  1885400    5460     20000  
    3  5424800   30090     24800  
    4  3125000    8023     19900  
    In [3]:
    names(UK)
    names(UK) <- c("Population","Area","GVA")
    UK
    
    [1] "population" "area.sq.m"  "GVA.cap"   
      Population Area  GVA  
    1 55619400   50301 28096
    2  1885400    5460 20000
    3  5424800   30090 24800
    4  3125000    8023 19900
    In [4]:
    row.names(UK)
    
    [1] "1" "2" "3" "4"
    In [5]:
    row.names(UK) <- c("England",
                       "Northern Ireland",
                       "Scotland",
                       "Wales")
    UK
    
                     Population Area  GVA  
    England          55619400   50301 28096
    Northern Ireland  1885400    5460 20000
    Scotland          5424800   30090 24800
    Wales             3125000    8023 19900
    In [6]:
    # It is also possible to set the names and row names in the data frame explicitly, when this
    # appears more convenient:
    UK <- data.frame(
               Population = c(55619400,1885400,5424800,3125000),
               Area = c(50301,5460,30090,8023),
               GVA = c(28096,20000,24800,19900),
               row.names = c("England",
                             "Northern Ireland",
                             "Scotland",
                             "Wales"))
    UK
    
                     Population Area  GVA  
    England          55619400   50301 28096
    Northern Ireland  1885400    5460 20000
    Scotland          5424800   30090 24800
    Wales             3125000    8023 19900
    In [7]:
    nrow(UK)
    
    [1] 4
    In [8]:
    ncol(UK)
    
    [1] 3
    In [9]:
    dim(UK)
    
    [1] 4 3

    In what follows we treat the data frame 'UK' as a list:

    In [10]:
    # Here we get the variable 'Population':
    UK$Population
    
    [1] 55619400  1885400  5424800  3125000
    In [11]:
    # Analoguously, one can use the double bracket-operator ('[[]]')
    # to get the variable 'Population':
    UK[["Population"]]
    
    [1] 55619400  1885400  5424800  3125000
    In [12]:
    # Also the single bracket-operator works as with lists.
    # We get a data frame of the first two variables in
    # the data frame
    UK[1:2]
    
                     Population Area 
    England          55619400   50301
    Northern Ireland  1885400    5460
    Scotland          5424800   30090
    Wales             3125000    8023
    In [13]:
    # Now we get a data frame with the variables named 'Population' and
    # 'GVA'
    UK[c("Population","GVA")]
    
                     Population GVA  
    England          55619400   28096
    Northern Ireland  1885400   20000
    Scotland          5424800   24800
    Wales             3125000   19900

    In the next few lines show the selection of rows and columns of a data frame

    In [14]:
    # We select the first two rows of the
    # data frame 'UK' by just using their numbers:
    UK[1:2,]
    
                     Population Area  GVA  
    England          55619400   50301 28096
    Northern Ireland  1885400    5460 20000
    In [15]:
    # By referring to row names, we select Scotland and Wales:
    UK[c("Scotland","Wales"),]
    
             Population Area  GVA  
    Scotland 5424800    30090 24800
    Wales    3125000     8023 19900
    In [16]:
    # As in a previous example, we select the first two columns ...
    UK[,1:2]
    
                     Population Area 
    England          55619400   50301
    Northern Ireland  1885400    5460
    Scotland          5424800   30090
    Wales             3125000    8023
    In [17]:
    # and the variables named 'Population' and 'GVA'
    UK[,c("Population","GVA")]
    
                     Population GVA  
    England          55619400   28096
    Northern Ireland  1885400   20000
    Scotland          5424800   24800
    Wales             3125000   19900
    In [ ]:
    
    

Accessing and Changing Variables in Data Frames

  • Script file: accessing-and-changing-variables.R

    Required data file: bes2010feelings-prepost.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    load("bes2010feelings-prepost.RData")
    

    with() versus attach()

    In [3]:
    c(
        Brown   = mean(bes2010flngs_pre$flng.brown,na.rm=TRUE),
        Cameron = mean(bes2010flngs_pre$flng.cameron,na.rm=TRUE),
        Clegg   = mean(bes2010flngs_pre$flng.clegg,na.rm=TRUE),
        Salmond = mean(bes2010flngs_pre$flng.salmond,na.rm=TRUE),
        Jones   = mean(bes2010flngs_pre$flng.jones,na.rm=TRUE)
    )
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.339703 5.090708 4.557366 4.505660 4.235949 
    In [4]:
    ## Use of 'attach'
    
    # The following code shows how the use of 'attach' can lead to confusion
    Mean <- function(x,...) mean(x,na.rm=TRUE,...)
    attach(bes2010flngs_pre)
    c(
        Brown   = Mean(flng.brown),
        Cameron = Mean(flng.cameron),
        Clegg   = Mean(flng.clegg),
        Salmond = Mean(flng.salmond),
        Jones   = Mean(flng.jones)
    )
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.339703 5.090708 4.557366 4.505660 4.235949 
    In [5]:
    attach(bes2010flngs_post)
    c(
        Brown   = Mean(flng.brown),
        Cameron = Mean(flng.cameron),
        Clegg   = Mean(flng.clegg),
        Salmond = Mean(flng.salmond),
        Jones   = Mean(flng.jones)
    )
    
    The following objects are masked from bes2010flngs_pre:
    
        flng.bnp, flng.brown, flng.cameron, flng.clegg, flng.cons,
        flng.green, flng.jones, flng.labour, flng.libdem, flng.pcym,
        flng.salmond, flng.snp, flng.ukip, region
    
    
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.448116 5.206120 5.001756 4.228707 4.509317 
    In [6]:
    detach(bes2010flngs_post)
    
    In [7]:
    c(
        Brown   = Mean(flng.brown),
        Cameron = Mean(flng.cameron),
        Clegg   = Mean(flng.clegg),
        Salmond = Mean(flng.salmond),
        Jones   = Mean(flng.jones)
    )
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.339703 5.090708 4.557366 4.505660 4.235949 
    In [8]:
    detach(bes2010flngs_pre)
    
    In [9]:
    # 'with()' is a better alternative, because it is clear where the data in the varialbes come from:
    
    with(bes2010flngs_pre,c(
        Brown   = Mean(flng.brown),
        Cameron = Mean(flng.cameron),
        Clegg   = Mean(flng.clegg),
        Salmond = Mean(flng.salmond),
        Jones   = Mean(flng.jones)
    ))
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.339703 5.090708 4.557366 4.505660 4.235949 
    In [10]:
    with(bes2010flngs_post,c(
        Brown   = Mean(flng.brown),
        Cameron = Mean(flng.cameron),
        Clegg   = Mean(flng.clegg),
        Salmond = Mean(flng.salmond),
        Jones   = Mean(flng.jones)
    ))
    
       Brown  Cameron    Clegg  Salmond    Jones 
    4.448116 5.206120 5.001756 4.228707 4.509317 

    Changing variables within a data frame

    In [12]:
    bes2010flngs_pre <- within(bes2010flngs_pre,{
        ave_flng <- (flng.brown + flng.cameron + flng.clegg)/3
        rel_flng.brown   <- flng.brown - ave_flng
        rel_flng.cameron <- flng.cameron - ave_flng
        rel_flng.clegg   <- flng.clegg - ave_flng
    })
    
    In [13]:
    with(bes2010flngs_pre,c(
        Brown   = Mean(rel_flng.brown),
        Cameron = Mean(rel_flng.cameron),
        Clegg   = Mean(rel_flng.clegg)
    ))
    
         Brown    Cameron      Clegg 
    -0.3960328  0.5068399 -0.1108071 
    In [14]:
    # It is also possible without 'within()' but this is terribly tedious:
    bes2010flngs_pre$ave_flng <- (bes2010flngs_pre$flng.brown +
                                  bes2010flngs_pre$flng.cameron +
                                  bes2010flngs_pre$flng.clegg)/3
    bes2010flngs_pre$rel_flng.brown   <- (bes2010flngs_pre$flng.brown
                                          - bes2010flngs_pre$ave_flng)
    bes2010flngs_pre$rel_flng.cameron <- (bes2010flngs_pre$flng.cameron
                                          - bes2010flngs_pre$ave_flng)
    bes2010flngs_pre$rel_flng.clegg   <- (bes2010flngs_pre$flng.clegg
                                          - bes2010flngs_pre$ave_flng)
    
    In [15]:
    with(bes2010flngs_pre,c(
        Brown   = Mean(rel_flng.brown),
        Cameron = Mean(rel_flng.cameron),
        Clegg   = Mean(rel_flng.clegg)
    ))
    
         Brown    Cameron      Clegg 
    -0.3960328  0.5068399 -0.1108071 

Manipulating Data Frames

Subsetting

  • Script file: subsetting.R

    Required data file: bes2010feelings-prepost.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    First we load an R data file that contains data from the 2010 British election study.

    In [2]:
    load("bes2010feelings-prepost.RData")
    

    We then create a subset with only observations from Scotland and with parties and party leaders that run in Scotland:

    In [3]:
    bes2010flngs_pre_scotland <- subset(bes2010flngs_pre,
                                        region=="Scotland",
                                        select=c(
                                            flng.brown,
                                            flng.cameron,
                                            flng.clegg,
                                            flng.salmond,
                                            flng.labour,
                                            flng.cons,
                                            flng.libdem,
                                            flng.snp,
                                            flng.green))
    

    We can now compare the avarage feeling about Gordon Brown in the whole sample and in the subsample from Scotland: First the whole UK:

    In [4]:
    with(bes2010flngs_pre,mean(flng.brown,na.rm=TRUE))
    
    [1] 4.339703

    then the Scotland subsample:

    In [5]:
    with(bes2010flngs_pre_scotland,mean(flng.brown,na.rm=TRUE))
    
    [1] 5.395

    It is also possible to create a subset of cases and variables with the bracket operator, but this is pretty tedious:

    In [6]:
    bes2010flngs_pre_scotland <- bes2010flngs_pre[
        bes2010flngs_pre$region=="Scotland",c(
                                 "flng.labour",
                                 "flng.cons",
                                 "flng.libdem",
                                 "flng.snp",
                                 "flng.green",
                                 "flng.brown",
                                 "flng.cameron",
                                 "flng.clegg",
                                 "flng.salmond")]
    
    In [7]:
    with(bes2010flngs_pre_scotland,mean(flng.brown,na.rm=TRUE))
    
    [1] 5.395
    In [ ]:
    
    

Merging

  • Merging with data from the British Election Study

    • Script file: merging-BES.R

      Required data file: bes2010feelings-prepost-for-merge.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      Here we merge data from the British Election Study

      In [2]:
      load("bes2010feelings-prepost-for-merge.RData")
      

      A peek into a data frame about respondents' feelings about parties:

      In [3]:
      head(bes2010flngs_parties_pre)
      
              id    refno vote   flng.labour flng.cons flng.libdem flng.snp flng.pcym
      40103.1 40103 312   NA     5           6         4           NA       NA       
      40107.1 40107 312   NA     1           6         7           NA       NA       
      40109.1 40109 312   NA     3           4         5           NA       NA       
      40110.1 40110 312   Labour 6           6         5           NA       NA       
      40111.1 40111 312   Labour 8           4         5           NA       NA       
      40112.1 40112 312   Labour 5           1         4           NA       NA       
              flng.green flng.ukip flng.bnp region 
      40103.1 7           3        0        England
      40107.1 6           0        0        NA     
      40109.1 5           0        0        England
      40110.1 5           3        2        England
      40111.1 4          NA        2        NA     
      40112.1 4           0        0        England

      And anotehr peek into a data frame about respondents' feelings about party leaders:

      In [4]:
      head(bes2010flngs_leaders_pre)
      
              id    flng.brown flng.cameron flng.clegg flng.salmond flng.jones
      40103.1 40103 6          3            3          NA            5        
      40107.1 40107 3          7            5          NA            3        
      40109.1 40109 8          7            4          NA           10        
      40110.1 40110 4          4            3          NA            7        
      40111.1 40111 5          5            5          NA            5        
      40112.1 40112 5          0            4          NA            1        

      The variable that identifies individual respondents in both data frames is 'id', so we use this variable to match the rows in both data frames:

      In [5]:
      bes2010flngs_pre_merged <- merge(
          bes2010flngs_parties_pre,
          bes2010flngs_leaders_pre,
          by="id"
      )
      

      merge() also allows for identifier variables that may have different names in the two data frame. In such cases one can use the named arguments by.x= and by.y=:

      In [6]:
      bes2010flngs_pre_merged <- merge(
          bes2010flngs_parties_pre,
          bes2010flngs_leaders_pre,
          by.x="id",
          by.y="id"
      )
      

      It is not absolutely necessary to provide a by= argument, if the merged data frames share a variable (with the same name in both) that idenfies cases or observations. Therefore, we can call merge() here without any by=, by.x=, or by.y= arguments:

      In [7]:
      bes2010flngs_pre_merged <- merge(
          bes2010flngs_parties_pre,
          bes2010flngs_leaders_pre
      )
      head(bes2010flngs_pre_merged)
      
        id    refno vote   flng.labour flng.cons flng.libdem flng.snp flng.pcym
      1 40103 312   NA     5           6         4           NA       NA       
      2 40107 312   NA     1           6         7           NA       NA       
      3 40109 312   NA     3           4         5           NA       NA       
      4 40110 312   Labour 6           6         5           NA       NA       
      5 40111 312   Labour 8           4         5           NA       NA       
      6 40112 312   Labour 5           1         4           NA       NA       
        flng.green flng.ukip flng.bnp region  flng.brown flng.cameron flng.clegg
      1 7           3        0        England 6          3            3         
      2 6           0        0        NA      3          7            5         
      3 5           0        0        England 8          7            4         
      4 5           3        2        England 4          4            3         
      5 4          NA        2        NA      5          5            5         
      6 4           0        0        England 5          0            4         
        flng.salmond flng.jones
      1 NA            5        
      2 NA            3        
      3 NA           10        
      4 NA            7        
      5 NA            5        
      6 NA            1        

      The data frame constwin contains data about relectoral districts, that is which party won the respective district seat in 2005 and 2010. The variable that identifies the electoral district is both in the individual-level data frame and the district-level data frame is named refno, so we use this as a matching variable.

      In [8]:
      bes2010pre_merged <- merge(
          bes2010flngs_pre_merged,
          constwin,
          by = "refno" # Not necessary in the present case, because
      )                # it is the same in both data frames.
      

      As can be glimpsed from the output of str, the result of merge is sorted by the matching variable, i.e. "refno"

      In [9]:
      head(bes2010pre_merged)
      
        refno id    vote          flng.labour flng.cons flng.libdem flng.snp
      1 1     77920 Plaid Cymru    6          5         5           NA      
      2 1     57911 NA             5          3         3           NA      
      3 1     57905 Labour        10          0         3           NA      
      4 1     57906 Labour        10          0         4           NA      
      5 1     57910 Conservatives  0          9         3           NA      
      6 1     57902 Conservatives  8          9         6           NA      
        flng.pcym flng.green flng.ukip ⋯ flng.brown flng.cameron flng.clegg
      1 7         7          5         ⋯  0         8            NA        
      2 3         4          0         ⋯  6         5             5        
      3 5         3          4         ⋯  8         0             4        
      4 3         0          6         ⋯ 10         0             0        
      5 3         6          2         ⋯  0         9             0        
      6 5         5          6         ⋯  4         8             6        
        flng.salmond flng.jones seat     win05  win10  maj05 maj10
      1 NA            4         Aberavon Labour Labour 46.3  35.7 
      2 NA            6         Aberavon Labour Labour 46.3  35.7 
      3 NA            0         Aberavon Labour Labour 46.3  35.7 
      4 NA           10         Aberavon Labour Labour 46.3  35.7 
      5 NA            0         Aberavon Labour Labour 46.3  35.7 
      6 NA            7         Aberavon Labour Labour 46.3  35.7 
      In [ ]:
      
      
  • Merging with artificial data

    • Script file: merging-artificial.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      df1 <- data.frame(
          x = c(1,3,2,4,6,5),
          y = c(1,1,2,2,2,4)
      )
      df1
      
        x y
      1 1 1
      2 3 1
      3 2 2
      4 4 2
      5 6 2
      6 5 4
      In [3]:
      df2 <- data.frame(
          a = c(51,42,22),
          b = c(1,2,3)
      )
      df2
      
        a  b
      1 51 1
      2 42 2
      3 22 3

      In this first attempt at merging, the data frames do not share any variables, hence there is no way of determining which of the rows of the two data frames "belong together". In such a case each row of the first data frame is matched with each of the second data frame. Hence the number of rows of the result equals the products of the numbers of rows of the two data frames.

      In [4]:
      df12 <- merge(df1,df2)
      df12
      
         x y a  b
      1  1 1 51 1
      2  3 1 51 1
      3  2 2 51 1
      4  4 2 51 1
      5  6 2 51 1
      6  5 4 51 1
      7  1 1 42 2
      8  3 1 42 2
      9  2 2 42 2
      10 4 2 42 2
      11 6 2 42 2
      12 5 4 42 2
      13 1 1 22 3
      14 3 1 22 3
      15 2 2 22 3
      16 4 2 22 3
      17 6 2 22 3
      18 5 4 22 3
      In [5]:
      nrow(df1)
      
      [1] 6
      In [6]:
      nrow(df2)
      
      [1] 3
      In [7]:
      nrow(df12)
      
      [1] 18

      By explicitly specifying the variables used for matching, the result is different: It contains only rows for which matches can be found in both data frames

      In [8]:
      merge(df1,df2,by.x="y",by.y="b")
      
        y x a 
      1 1 1 51
      2 1 3 51
      3 2 2 42
      4 2 4 42
      5 2 6 42

      With the optional argument all.x=TRUE the result has a row for each row from the first data frame, whether or not a match is find for it: Missing information (from non-existing rows of the second data frame) is filled up with NA.

      In [9]:
      merge(df1,df2,by.x="y",by.y="b",
            all.x=TRUE)
      
        y x a 
      1 1 1 51
      2 1 3 51
      3 2 2 42
      4 2 4 42
      5 2 6 42
      6 4 5 NA

      With all.y=TRUE the result contains all rows from the second data frame:

      In [10]:
      merge(df1,df2,by.x="y",by.y="b",
            all.y=TRUE)
      
        y x  a 
      1 1  1 51
      2 1  3 51
      3 2  2 42
      4 2  4 42
      5 2  6 42
      6 3 NA 22

      The argument setting all=TRUE is equivalent with all.x=TRUE and all.y=TRUE

      In [11]:
      merge(df1,df2,by.x="y",by.y="b",
            all=TRUE)
      
        y x  a 
      1 1  1 51
      2 1  3 51
      3 2  2 42
      4 2  4 42
      5 2  6 42
      6 3 NA 22
      7 4  5 NA
      In [ ]:
      
      

Appending

  • Script file: appending.R

    Required data file: bes2010feelings-for-append.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    First we load some example data from the British Election Study 2010

    In [3]:
    load("bes2010feelings-for-append.RData")
    

    We now have two BES data frames, one from the pre-election wave and another from the post-election wave. They contain the same variables, but in a different order:

    In [4]:
    str(bes2010flngs_pre)
    
    'data.frame':	1935 obs. of  14 variables:
     $ flng.brown  : num  6 3 8 4 5 5 5 4 7 4 ...
     $ flng.cameron: num  3 7 7 4 5 0 3 6 2 2 ...
     $ flng.clegg  : num  3 5 4 3 5 4 2 7 4 8 ...
     $ flng.salmond: num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.jones  : num  5 3 10 7 5 1 7 1 6 4 ...
     $ flng.labour : num  5 1 3 6 8 5 6 2 8 3 ...
     $ flng.cons   : num  6 6 4 6 4 1 3 3 3 3 ...
     $ flng.libdem : num  4 7 5 5 5 4 0 5 4 9 ...
     $ flng.snp    : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.pcym   : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.green  : num  7 6 5 5 4 4 1 5 5 5 ...
     $ flng.ukip   : num  3 0 0 3 NA 0 NA 2 3 1 ...
     $ flng.bnp    : num  0 0 0 2 2 0 0 0 0 0 ...
     $ region      : Factor w/ 3 levels "England","Scotland",..: 1 NA 1 1 NA 1 1 1 1 1 ...
    
    In [5]:
    str(bes2010flngs_post)
    
    'data.frame':	3075 obs. of  14 variables:
     $ flng.jones  : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.labour : num  5 2 9 7 0 2 6 5 7 2 ...
     $ flng.ukip   : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.libdem : num  4 5 4 4 6 NA 4 4 7 7 ...
     $ flng.brown  : num  5 2 5 7 0 2 3 2 5 2 ...
     $ flng.bnp    : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.snp    : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.salmond: num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.pcym   : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.cons   : num  5 5 3 10 10 3 3 8 7 7 ...
     $ flng.cameron: num  5 6 5 3 8 10 7 8 8 7 ...
     $ flng.green  : num  NA NA NA NA NA NA NA NA NA NA ...
     $ flng.clegg  : num  NA 4 3 NA 6 3 5 4 7 6 ...
     $ region      : Factor w/ 3 levels "England","Scotland",..: 1 1 1 1 1 1 1 1 1 1 ...
    

    If the variables in the two data frames differ trying to use rbind() to append the data frames fails.

    In [6]:
    bes2010flngs_prepost <- rbind(bes2010flngs_pre[-1],
                                  bes2010flngs_post[-1])
    
    Error in match.names(clabs, names(xi)): names do not match previous names
    Traceback:
    
    1. rbind(bes2010flngs_pre[-1], bes2010flngs_post[-1])
    2. rbind(deparse.level, ...)
    3. match.names(clabs, names(xi))
    4. stop("names do not match previous names")

    If the variables in the two data frame are the same but differ in their order, rbind() succeeds and the variables are sorted all into the same order before the data frames are combined into a single one:

    In [7]:
    bes2010flngs_prepost <- rbind(bes2010flngs_pre,
                                  bes2010flngs_post)
    

    We compare the tail-ends of the data resulting data frame bes2010flngs_prepost and the data frame given as second argument to rbind. The tail-ends are identical except for the order of the variables.

    In [8]:
    tail(bes2010flngs_prepost)
    
            flng.brown flng.cameron flng.clegg flng.salmond flng.jones flng.labour
    79219.2 2          8            7          NA           5          3          
    79220.2 0          5            5          NA           4          3          
    79621.2 8          4            7          NA           5          7          
    79622.2 8          5            3          NA           6          8          
    80019.2 5          8            6          NA           5          3          
    80020.2 7          6            8          NA           6          7          
            flng.cons flng.libdem flng.snp flng.pcym flng.green flng.ukip flng.bnp
    79219.2 8         7           NA       6         6          4         0       
    79220.2 5         4           NA       3         1          0         1       
    79621.2 4         6           NA       3         4          4         4       
    79622.2 5         4           NA       6         7          4         1       
    80019.2 7         5           NA       5         3          6         5       
    80020.2 6         7           NA       5         5          2         0       
            region
    79219.2 Wales 
    79220.2 Wales 
    79621.2 Wales 
    79622.2 Wales 
    80019.2 Wales 
    80020.2 Wales 
    In [9]:
    tail(bes2010flngs_post)
    
            flng.jones flng.labour flng.ukip flng.libdem flng.brown flng.bnp
    79219.2 5          3           4         7           2          0       
    79220.2 4          3           0         4           0          1       
    79621.2 5          7           4         6           8          4       
    79622.2 6          8           4         4           8          1       
    80019.2 5          3           6         5           5          5       
    80020.2 6          7           2         7           7          0       
            flng.snp flng.salmond flng.pcym flng.cons flng.cameron flng.green
    79219.2 NA       NA           6         8         8            6         
    79220.2 NA       NA           3         5         5            1         
    79621.2 NA       NA           3         4         4            4         
    79622.2 NA       NA           6         5         5            7         
    80019.2 NA       NA           5         7         8            3         
    80020.2 NA       NA           5         6         6            5         
            flng.clegg region
    79219.2 7          Wales 
    79220.2 5          Wales 
    79621.2 7          Wales 
    79622.2 3          Wales 
    80019.2 6          Wales 
    80020.2 8          Wales 
    In [ ]:
    
    

Reshaping

  • Reshaping artificial data

    • Script file: reshaping-artificial.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      Here we construct the data frame that contains the first two rows of the data in wide format in the previous illustration.

      In [2]:
      example.data.wide <- data.frame(
          id = 1:2,
          v  = c(35,42),
          x1 = c(1.1,2.1),
          x2 = c(1.2,2.2),
          x3 = c(1.3,2.3),
          y1 = c(2.5,3.5),
          y2 = c(2.7,3.7),
          y3 = c(2.9,3.9))
      example.data.wide
      
        id v  x1  x2  x3  y1  y2  y3 
      1 1  35 1.1 1.2 1.3 2.5 2.7 2.9
      2 2  42 2.1 2.2 2.3 3.5 3.7 3.9

      We now call reshape() to cast the data into long format

      In [3]:
      example.data.long <- reshape(data=example.data.wide,
                                   varying=list(
                                       # The first group of variables 
                                       # in wide format
                                       c("x1","x2","x3"),
                                       # The second group of variables
                                       # in wide format
                                       c("y1","y2","y3")
                                   ),
                                   v.names=c("x","y"),
                                   timevar="t",
                                   times=1:3,
                                   direction="long")
      example.data.long
      
          id v  t x   y  
      1.1 1  35 1 1.1 2.5
      2.1 2  42 1 2.1 3.5
      1.2 1  35 2 1.2 2.7
      2.2 2  42 2 2.2 3.7
      1.3 1  35 3 1.3 2.9
      2.3 2  42 3 2.3 3.9

      In order to change the data from long into wide format, we can use almost the same function call, the only difference being the direction= argument.

      In [4]:
      example.data.wide.a <- reshape(data=example.data.long,
                                     varying=list(
                                       # The first group of variables 
                                       # in wide format
                                       c("x1","x2","x3"),
                                       # The second group of variables
                                       # in wide format
                                       c("y1","y2","y3")
                                     ),
                                     v.names=c("x","y"),
                                     timevar="t",
                                     times=1:3,
                                     direction="wide")
      

      The second call of reshape does not completely revert the first call, because the order of the variables now is different:

      In [5]:
      example.data.wide.a
      
          id v  x1  y1  x2  y2  x3  y3 
      1.1 1  35 1.1 2.5 1.2 2.7 1.3 2.9
      2.1 2  42 2.1 3.5 2.2 3.7 2.3 3.9
      In [ ]:
      
      
  • Reshaping data from the British Election Study

    • Script file: reshaping-BES.R

      Required data file: bes2010feelings-prepost.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

      The script makes use of the memisc package, which is available from https://cran.r-project.org/package=memisc

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      First we load an R data file that contains data from the 2010 British election study.

      In [2]:
      load("bes2010feelings-prepost.RData")
      
      In [3]:
      names(bes2010flngs_pre)
      
       [1] "flng.brown"   "flng.cameron" "flng.clegg"   "flng.salmond" "flng.jones"  
       [6] "flng.labour"  "flng.cons"    "flng.libdem"  "flng.snp"     "flng.pcym"   
      [11] "flng.green"   "flng.ukip"    "flng.bnp"     "region"      

      A sensible way to bring these data into long format would be to have the feelings towards the parties and their leaders as multiple measurements. Therefore we reshape the data in the appropriate long format:

      In [4]:
      bes2010flngs_pre_long <- reshape(
                    within(bes2010flngs_pre,
                           na <- NA),
                    varying=list(
                        # Parties
                        c("flng.cons","flng.labour","flng.libdem",
                          "flng.snp","flng.pcym",
                          "flng.green","flng.ukip","flng.bnp"),
                        # Party leaders
                        c("flng.cameron","flng.brown","flng.clegg",
                          "flng.salmond","flng.jones",
                          "na","na","na")
                    ),
                    v.names=c("flng.parties",
                              "flng.leaders"),
                    times=c("Conservative","Labour","LibDem",
                            "SNP","Plaid Cymru",
                            "Green","UKIP","BNP"),
                    timevar="party",
                    direction="long")
      head(bes2010flngs_pre_long,n=14)
      
                      region  party        flng.parties flng.leaders id
      1.Conservative  England Conservative 6            3             1
      2.Conservative  NA      Conservative 6            7             2
      3.Conservative  England Conservative 4            7             3
      4.Conservative  England Conservative 6            4             4
      5.Conservative  NA      Conservative 4            5             5
      6.Conservative  England Conservative 1            0             6
      7.Conservative  England Conservative 3            3             7
      8.Conservative  England Conservative 3            6             8
      9.Conservative  England Conservative 3            2             9
      10.Conservative England Conservative 3            2            10
      11.Conservative NA      Conservative 6            4            11
      12.Conservative England Conservative 3            2            12
      13.Conservative England Conservative 0            4            13
      14.Conservative England Conservative 5            5            14
      In [5]:
      library(memisc)
      
      Loading required package: lattice
      
      Loading required package: MASS
      
      
      Attaching package: ‘memisc’
      
      
      The following objects are masked from ‘package:stats’:
      
          contr.sum, contr.treatment, contrasts
      
      
      The following object is masked from ‘package:base’:
      
          as.array
      
      
      

      With the Reshape() function the syntax is a bit simpler than with reshape() from the "stats" package:

      In [6]:
      bes2010flngs_pre_long <- Reshape(bes2010flngs_pre,
             # Note that "empty" places designate measurement
             # occastions that are to be filled with NAs.
             # In the present case these are measurement 
             # feelings about party leaders that were not
             # asked in the BES 2010 questionnaires.
             flng.leaders=c(flng.cameron,flng.brown,
                            flng.clegg,flng.salmond,
                            flng.jones,,,),
             flng.parties=c(flng.cons,flng.labour,
                            flng.libdem,flng.snp,
                            flng.pcym,flng.green,
                            flng.ukip,flng.bnp),
             party=c("Conservative","Labour","LibDem",
                     "SNP","Plaid Cymru",
                     "Green","UKIP","BNP"),
             direction="long")
      

      In long format the observations are sorted such that the variable that distinguishes measurement occasions (the party variable) changes faster than the variable that distinguishes individuals:

      In [7]:
      head(bes2010flngs_pre_long)
      
                     region  party        flng.leaders flng.parties id
      1.Conservative England Conservative  3            6           1 
      1.Labour       England Labour        6            5           1 
      1.LibDem       England LibDem        3            4           1 
      1.SNP          England SNP          NA           NA           1 
      1.Plaid Cymru  England Plaid Cymru   5           NA           1 
      1.Green        England Green        NA            7           1 

      Like with reshape(), reshaping back from long into wide format takes (almost) the same syntax as reshaping from wide into long format:

      In [8]:
      bes2010flngs_pre_wide <- Reshape(bes2010flngs_pre_long,
             # Note that "empty" places designate measurement
             # occastions that are to be filled with NAs.
             # In the present case these are measurement 
             # feelings about party leaders that were not
             # asked in the BES 2010 questionnaires.
             flng.leaders=c(flng.cameron,flng.brown,
                            flng.clegg,flng.salmond,
                            flng.jones,,,),
             flng.parties=c(flng.cons,flng.labour,
                            flng.libdem,flng.snp,
                            flng.pcym,flng.green,
                            flng.ukip,flng.bnp),
             party=c("Conservative","Labour","LibDem",
                     "SNP","Plaid Cymru",
                     "Green","UKIP","BNP"),
             direction="wide")
      

      After reshaping into wide format, the variables that correspond to multiple measures of the same variable are grouped together:

      In [9]:
      head(bes2010flngs_pre_wide)
      
                     region  id flng.cameron flng.cons flng.brown flng.labour
      1.Conservative England 1  3            6         6          5          
      2.Conservative NA      2  7            6         3          1          
      3.Conservative England 3  7            4         8          3          
      4.Conservative England 4  4            6         4          6          
      5.Conservative NA      5  5            4         5          8          
      6.Conservative England 6  0            1         5          5          
                     flng.clegg flng.libdem flng.salmond flng.snp flng.jones
      1.Conservative 3          4           NA           NA        5        
      2.Conservative 5          7           NA           NA        3        
      3.Conservative 4          5           NA           NA       10        
      4.Conservative 3          5           NA           NA        7        
      5.Conservative 5          5           NA           NA        5        
      6.Conservative 4          4           NA           NA        1        
                     flng.pcym flng.green flng.ukip flng.bnp
      1.Conservative NA        7           3        0       
      2.Conservative NA        6           0        0       
      3.Conservative NA        5           0        0       
      4.Conservative NA        5           3        2       
      5.Conservative NA        4          NA        2       
      6.Conservative NA        4           0        0       
      In [10]:
      save(bes2010flngs_pre_long,file="bes2010flngs-pre-long.RData")
      
      In [ ]:
      
      

Sorting

  • Script file: sorting.R

    Required data file: bes2010feelings-pre-long.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the memisc package, which is available from https://cran.r-project.org/package=memisc

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    load("bes2010feelings-pre-long.RData")
    

    Here we use order()

    In [3]:
    ii <- with(bes2010flngs_pre_long,order(id,party))
    bes2010flngs_pre_long_sorted <- bes2010flngs_pre_long[ii,]
    
    In [4]:
    head(bes2010flngs_pre_long_sorted[c("party","id",
                                        "flng.leaders","flng.parties")],n=15)
    
                   party        id flng.leaders flng.parties
    1.Conservative Conservative 1   3            6          
    1.Labour       Labour       1   6            5          
    1.LibDem       LibDem       1   3            4          
    1.SNP          SNP          1  NA           NA          
    1.Plaid Cymru  Plaid Cymru  1   5           NA          
    1.Green        Green        1  NA            7          
    1.UKIP         UKIP         1  NA            3          
    1.BNP          BNP          1  NA            0          
    2.Conservative Conservative 2   7            6          
    2.Labour       Labour       2   3            1          
    2.LibDem       LibDem       2   5            7          
    2.SNP          SNP          2  NA           NA          
    2.Plaid Cymru  Plaid Cymru  2   3           NA          
    2.Green        Green        2  NA            6          
    2.UKIP         UKIP         2  NA            0          

    Some more convenient altarnatives: Using a Sort() function:

    In [5]:
    Sort <- function(data,...){
        ii <- eval(substitute(order(...)),
                              envir=data,
                              enclos=parent.frame())
        data[ii,]
    }
    
    In [6]:
    bes2010flngs_pre_long_sorted <- Sort(bes2010flngs_pre_long,
                                         id,party)
    

    Using the sort() method function from the 'memisc' package:

    In [7]:
    library(memisc)
    
    Loading required package: lattice
    
    Loading required package: MASS
    
    
    Attaching package: ‘memisc’
    
    
    The following objects are masked from ‘package:stats’:
    
        contr.sum, contr.treatment, contrasts
    
    
    The following object is masked from ‘package:base’:
    
        as.array
    
    
    
    In [8]:
    bes2010flngs_pre_long_sorted <- sort(bes2010flngs_pre_long,
                                         by=~party+id)
    
    In [9]:
    head(bes2010flngs_pre_long_sorted[c("party","id",
                                        "flng.leaders","flng.parties")],n=15)
    
                   party        id flng.leaders flng.parties
    1.Conservative Conservative 1   3            6          
    1.Labour       Labour       1   6            5          
    1.LibDem       LibDem       1   3            4          
    1.SNP          SNP          1  NA           NA          
    1.Plaid Cymru  Plaid Cymru  1   5           NA          
    1.Green        Green        1  NA            7          
    1.UKIP         UKIP         1  NA            3          
    1.BNP          BNP          1  NA            0          
    2.Conservative Conservative 2   7            6          
    2.Labour       Labour       2   3            1          
    2.LibDem       LibDem       2   5            7          
    2.SNP          SNP          2  NA           NA          
    2.Plaid Cymru  Plaid Cymru  2   3           NA          
    2.Green        Green        2  NA            6          
    2.UKIP         UKIP         2  NA            0          
    In [ ]:
    
    

Aggregating Data Frames

  • Script file: aggregating.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the memisc package, which is available from https://cran.r-project.org/package=memisc

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    In the following we aggregate data from the British Election Study 2010:

    In [2]:
    load("bes2010feelings.RData")
    

    Here we obtain the average affects towards the major three parties, using an 'old-style' call of the function aggregate().

    In [3]:
    Mean <- function(x,...)mean(x,...,na.rm=TRUE)
    aggregate(bes2010feelings[c("flng.brown","flng.cameron",
                                "flng.clegg","flng.salmond")],
              with(bes2010feelings,
                   list(Region=region,Wave=wave)),
              Mean)
    
      Region   Wave flng.brown flng.cameron flng.clegg flng.salmond
    1 England  Pre  4.092674   5.284810     4.618690        NaN    
    2 Scotland Pre  5.395000   4.502591     4.405229   4.412371    
    3 Wales    Pre  4.328244   4.774194     4.592233        NaN    
    4 England  Post 4.140990   5.441454     5.160313        NaN    
    5 Scotland Post 5.510769   4.539075     4.513793   4.228707    
    6 Wales    Post 4.307692   4.855895     4.814480        NaN    

    More recent versions of R also provide a slightly more convenient way of calling aggregate() using a formula argument:

    In [4]:
    aggregate(cbind(flng.brown,
                    flng.cameron,
                    flng.clegg,
                    flng.salmond
                    )~region+wave,
              data=bes2010feelings,
              Mean)
    
      region   wave flng.brown flng.cameron flng.clegg flng.salmond
    1 Scotland Pre  5.466667   4.500000     4.460000   4.480000    
    2 Scotland Post 5.513986   4.513986     4.498252   4.270979    

    The memisc package has a somewhat more flexible variant of aggregate(), the function Aggregate(). Here we reproduce the results of aggregate().

    In [5]:
    library(memisc)
    
    Loading required package: lattice
    
    Loading required package: MASS
    
    
    Attaching package: ‘memisc’
    
    
    The following object is masked _by_ ‘.GlobalEnv’:
    
        Mean
    
    
    The following objects are masked from ‘package:stats’:
    
        contr.sum, contr.treatment, contrasts
    
    
    The following object is masked from ‘package:base’:
    
        as.array
    
    
    
    In [6]:
    Aggregate(c(Brown=Mean(flng.brown),
                Cameron=Mean(flng.cameron),
                Clegg=Mean(flng.clegg),
                Salmond=Mean(flng.salmond))~region+wave,
                data=bes2010feelings)
    
      region   wave Brown    Cameron  Clegg    Salmond 
    1 England  Pre  4.092674 5.284810 4.618690      NaN
    2 Scotland Pre  5.395000 4.502591 4.405229 4.412371
    3 Wales    Pre  4.328244 4.774194 4.592233      NaN
    4 NA       Pre  4.507143 4.929870 4.426573 4.760563
    5 England  Post 4.140990 5.441454 5.160313      NaN
    6 Scotland Post 5.510769 4.539075 4.513793 4.228707
    7 Wales    Post 4.307692 4.855895 4.814480      NaN
    8 NA       Post       NA       NA       NA       NA

    However it also allows to used different summary functions.

    In [7]:
    Var <- function(x,...) var(x,...,na.rm=TRUE)
    Aggregate(c(Mean(flng.brown),Var(flng.brown))~region+wave,
              data=bes2010feelings)
    
      region   wave Mean(flng.brown) Var(flng.brown)
    1 England  Pre  4.092674         7.287340       
    2 Scotland Pre  5.395000         8.210025       
    3 Wales    Pre  4.328244         8.776042       
    4 NA       Pre  4.507143         7.754125       
    5 England  Post 4.140990         7.109491       
    6 Scotland Post 5.510769         6.376617       
    7 Wales    Post 4.307692         7.647408       
    8 NA       Post       NA               NA       
    In [ ]:
    
    

Groupwise computations within Data Frames

  • Script file: groupwise-computations.R

    Required data file: bes2010feelings-pre-long.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the memisc package, which is available from https://cran.r-project.org/package=memisc

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    load("bes2010feelings-pre-long.RData")
    

    Groupwise computations using split():

    In [3]:
    bes2010flngs_pre_long.splt <- split(bes2010flngs_pre_long,
                                        bes2010flngs_pre_long$id)
    
    str(bes2010flngs_pre_long.splt[[1]])
    
    'data.frame':	8 obs. of  5 variables:
     $ region      : Factor w/ 3 levels "England","Scotland",..: 1 1 1 1 1 1 1 1
     $ party       : Factor w/ 8 levels "Conservative",..: 1 2 3 4 5 6 7 8
     $ flng.leaders: num  3 6 3 NA 5 NA NA NA
     $ flng.parties: num  6 5 4 NA NA 7 3 0
     $ id          : int  1 1 1 1 1 1 1 1
     - attr(*, "reshapeLong")=List of 4
      ..$ varying:List of 2
      .. ..$ flng.leaders: chr [1:8] "flng.cameron" "flng.brown" "flng.clegg" "flng.salmond" ...
      .. ..$ flng.parties: chr [1:8] "flng.cons" "flng.labour" "flng.libdem" "flng.snp" ...
      ..$ v.names: chr [1:2] "flng.leaders" "flng.parties"
      ..$ idvar  : chr "id"
      ..$ timevar: chr "party"
    
    In [4]:
    Mean <- function(x,...) mean(x,...,na.rm=TRUE)
    
    In [5]:
    bes2010flngs_pre_long.splt <- lapply(
        bes2010flngs_pre_long.splt,
        within,expr={
            rel.flng.parties <- flng.parties - Mean(flng.parties)
            rel.flng.leaders <- flng.leaders - Mean(flng.leaders)
        })
    
    str(bes2010flngs_pre_long.splt[[1]])
    
    'data.frame':	8 obs. of  7 variables:
     $ region          : Factor w/ 3 levels "England","Scotland",..: 1 1 1 1 1 1 1 1
     $ party           : Factor w/ 8 levels "Conservative",..: 1 2 3 4 5 6 7 8
     $ flng.leaders    : num  3 6 3 NA 5 NA NA NA
     $ flng.parties    : num  6 5 4 NA NA 7 3 0
     $ id              : int  1 1 1 1 1 1 1 1
     $ rel.flng.leaders: num  -1.25 1.75 -1.25 NA 0.75 NA NA NA
     $ rel.flng.parties: num  1.833 0.833 -0.167 NA NA ...
     - attr(*, "reshapeLong")=List of 4
      ..$ varying:List of 2
      .. ..$ flng.leaders: chr [1:8] "flng.cameron" "flng.brown" "flng.clegg" "flng.salmond" ...
      .. ..$ flng.parties: chr [1:8] "flng.cons" "flng.labour" "flng.libdem" "flng.snp" ...
      ..$ v.names: chr [1:2] "flng.leaders" "flng.parties"
      ..$ idvar  : chr "id"
      ..$ timevar: chr "party"
    
    In [6]:
    bes2010flngs_pre_long <- unsplit(bes2010flngs_pre_long.splt,
                                     bes2010flngs_pre_long$id)
    str(bes2010flngs_pre_long)
    
    'data.frame':	15480 obs. of  7 variables:
     $ region          : Factor w/ 3 levels "England","Scotland",..: 1 1 1 1 1 1 1 1 NA NA ...
     $ party           : Factor w/ 8 levels "Conservative",..: 1 2 3 4 5 6 7 8 1 2 ...
     $ flng.leaders    : num  3 6 3 NA 5 NA NA NA 7 3 ...
     $ flng.parties    : num  6 5 4 NA NA 7 3 0 6 1 ...
     $ id              : int  1 1 1 1 1 1 1 1 2 2 ...
     $ rel.flng.leaders: num  -1.25 1.75 -1.25 NA 0.75 NA NA NA 2.5 -1.5 ...
     $ rel.flng.parties: num  1.833 0.833 -0.167 NA NA ...
     - attr(*, "reshapeLong")=List of 4
      ..$ varying:List of 2
      .. ..$ flng.leaders: chr [1:8] "flng.cameron" "flng.brown" "flng.clegg" "flng.salmond" ...
      .. ..$ flng.parties: chr [1:8] "flng.cons" "flng.labour" "flng.libdem" "flng.snp" ...
      ..$ v.names: chr [1:2] "flng.leaders" "flng.parties"
      ..$ idvar  : chr "id"
      ..$ timevar: chr "party"
    

    Groupwise computations using withinGroups():

    In [7]:
    library(memisc)
    
    Loading required package: lattice
    
    Loading required package: MASS
    
    
    Attaching package: ‘memisc’
    
    
    The following object is masked _by_ ‘.GlobalEnv’:
    
        Mean
    
    
    The following objects are masked from ‘package:stats’:
    
        contr.sum, contr.treatment, contrasts
    
    
    The following object is masked from ‘package:base’:
    
        as.array
    
    
    
    In [8]:
    Mean <- function(x,...) mean(x,...,na.rm=TRUE)
    bes2010flngs_pre_long <- withinGroups(bes2010flngs_pre_long,
                                          ~id,{
         rel.flng.parties <- flng.parties - Mean(flng.parties)
         rel.flng.leaders <- flng.leaders - Mean(flng.leaders)
        })
    

    We use 'head' to look at the first 14 elements of the re-combined data frame:

    In [9]:
    head(bes2010flngs_pre_long[-(1:2)],n=14)
    
                   flng.leaders flng.parties id rel.flng.leaders rel.flng.parties
    1.Conservative  3            6           1  -1.25             1.8333333      
    1.Labour        6            5           1   1.75             0.8333333      
    1.LibDem        3            4           1  -1.25            -0.1666667      
    1.SNP          NA           NA           1     NA                    NA      
    1.Plaid Cymru   5           NA           1   0.75                    NA      
    1.Green        NA            7           1     NA             2.8333333      
    1.UKIP         NA            3           1     NA            -1.1666667      
    1.BNP          NA            0           1     NA            -4.1666667      
    2.Conservative  7            6           2   2.50             2.6666667      
    2.Labour        3            1           2  -1.50            -2.3333333      
    2.LibDem        5            7           2   0.50             3.6666667      
    2.SNP          NA           NA           2     NA                    NA      
    2.Plaid Cymru   3           NA           2  -1.50                    NA      
    2.Green        NA            6           2     NA             2.6666667      
    In [ ]:
    
    

Importing Data into Data Frames

  • Script file: importing-data.R

    Required data files:

    Currently, these Data files available from https://www.pippanorris.com/data. (Previously they were available from http://www.hks.harvard.edu/fs/pnorris/Data/Data.htm.)

    The script makes use of the memisc package, which is available from https://cran.r-project.org/package=memisc

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    Importing data from text files

    Importing CSV data:

    In [2]:
    # We inspect the text file using 'readLines()'
    readLines("ConstituencyResults2010.csv",n=5)
    
    [1] "refno,cons,lab,libdem,snp,plcym,green,bnp,ukip"
    [2] "1,14.3,51.9,16.3,,7.1,,4.1,1.6"                
    [3] "2,35.8,24.5,19.3,,17.8,,,2.1"                  
    [4] "3,12.4,44.4,18.6,22.2,,,1.7,"                  
    [5] "4,20.7,36.5,28.4,11.9,,1.0,1.2,"               
    In [3]:
    # For the actual import we use 'read.csv()'
    ConstRes2010 <- read.csv("ConstituencyResults2010.csv")
    ConstRes2010[1:5,]
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    In [4]:
    # A CSV file without a variable name header
    readLines("ConstituencyResults2010-nohdr.csv",n=5)
    
    [1] "1,14.3,51.9,16.3,,7.1,,4.1,1.6"  "2,35.8,24.5,19.3,,17.8,,,2.1"   
    [3] "3,12.4,44.4,18.6,22.2,,,1.7,"    "4,20.7,36.5,28.4,11.9,,1.0,1.2,"
    [5] "5,30.3,13.6,38.4,15.7,,,1.1,0.9"
    In [5]:
    ConstRes2010 <- read.csv("ConstituencyResults2010-nohdr.csv",
                             header=FALSE)
    ConstRes2010[1:5,]
    
      V1 V2   V3   V4   V5   V6   V7 V8  V9 
    1 1  14.3 51.9 16.3   NA  7.1 NA 4.1 1.6
    2 2  35.8 24.5 19.3   NA 17.8 NA  NA 2.1
    3 3  12.4 44.4 18.6 22.2   NA NA 1.7  NA
    4 4  20.7 36.5 28.4 11.9   NA  1 1.2  NA
    5 5  30.3 13.6 38.4 15.7   NA NA 1.1 0.9
    In [6]:
    # Importing tab-delimited data:
    readLines("ConstituencyResults2010.tsv",n=5)
    
    [1] "refno\tcons\tlab\tlibdem\tsnp\tplcym\tgreen\tbnp\tukip"
    [2] "1\t14.3\t51.9\t16.3\t\t7.1\t\t4.1\t1.6"                
    [3] "2\t35.8\t24.5\t19.3\t\t17.8\t\t\t2.1"                  
    [4] "3\t12.4\t44.4\t18.6\t22.2\t\t\t1.7\t"                  
    [5] "4\t20.7\t36.5\t28.4\t11.9\t\t1.0\t1.2\t"               
    In [7]:
    ConstRes2010 <- read.delim("ConstituencyResults2010.tsv")
    ConstRes2010[1:5,]
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 

    Importing fixed-width data:

    In [8]:
    readLines("ConstituencyResults2010-fwf.txt",n=5)
    
    [1] "  114.351.916.3     7.1     4.1 1.6" "  235.824.519.3    17.8         2.1"
    [3] "  312.444.418.622.2         1.7"     "  420.736.528.411.9     1.0 1.2"    
    [5] "  530.313.638.415.7         1.1 0.9"
    In [10]:
    ConstRes2010 <- read.fwf("ConstituencyResults2010-fwf.txt",
                             widths=c(3,4,4,4,4,4,4,4,4))
    ConstRes2010[1:5,]
    
      V1 V2   V3   V4   V5   V6   V7 V8  V9 
    1 1  14.3 51.9 16.3   NA  7.1 NA 4.1 1.6
    2 2  35.8 24.5 19.3   NA 17.8 NA  NA 2.1
    3 3  12.4 44.4 18.6 22.2   NA NA 1.7  NA
    4 4  20.7 36.5 28.4 11.9   NA  1 1.2  NA
    5 5  30.3 13.6 38.4 15.7   NA NA 1.1 0.9

    Importing data from other statistics packages

    Importing data using the foreign package

    In [11]:
    library(foreign)
    
    # An SPSS 'system' file
    ConstRes2010 <- read.spss("ConstituencyResults2010.sav",
                              to.data.frame=TRUE)
    ConstRes2010[1:5,]
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    In [12]:
    # An SPSS 'portable' file
    ConstRes2010 <- read.spss("ConstituencyResults2010.por",
                              to.data.frame=TRUE)
    ConstRes2010[1:5,]
    
      REFNO CONS LAB  LIBDEM SNP  PLCYM GREEN BNP UKIP
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    In [13]:
    # A Stata file
    ConstRes2010 <- read.dta("ConstituencyResults2010.dta")
    ConstRes2010[1:5,]
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    In [15]:
    # The following does not work - newer Stata format is not supported
    ConstRes2010 <- read.dta("ConstResults2010-stata-new.dta")
    
    Error in read.dta("ConstResults2010-stata-new.dta"): not a Stata version 5-12 .dta file
    Traceback:
    
    1. read.dta("ConstResults2010-stata-new.dta")
    In [ ]: