Data Tables and the “Tidyverse”

This chapter discusses the improvements and new features introduced by the package data.table and some notable packages in the “tidyverse” collection of packages. In particular, the packages data.table and dplyr are compared in terms of the features they provide and how computationally efficient they are. Also the features for “tidying” data using the tidyr package are discussed.

Below is the supporting material for the various sections of the chapter.

Data Tables

Creating data tables

  • Script file: creating-data-tables.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the data.table package, which is available from https://cran.r-project.org/package=data.table

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    library(data.table)
    
    In [3]:
    UK <- data.table(
               Population = c(55619400,1885400,5424800,3125000),
               Area = c(50301,5460,30090,8023),
               GVA = c(28096,20000,24800,19900),
               country = c("England",
                             "Northern Ireland",
                             "Scotland",
                             "Wales"))
    UK
    
      Population Area  GVA   country         
    1 55619400   50301 28096 England         
    2  1885400    5460 20000 Northern Ireland
    3  5424800   30090 24800 Scotland        
    4  3125000    8023 19900 Wales           
    In [4]:
    class(UK)
    
    [1] "data.table" "data.frame"
    In [5]:
    DT <- data.table(
        x = rnorm(1000000),
        y = rnorm(1000000))
    DT
    
            x           y          
    1       -0.15279818 -0.47783172
    2       -1.70539944 -0.22764698
    3       -0.83157336  0.54272547
    4        1.22603817  0.38108578
    5        0.30521037  0.31583837
    6        1.13718083  0.69197497
    7        0.51842513 -1.40680405
    8       -0.71917822  1.02680513
    9        0.39633064 -1.36946391
    10       0.02244016  1.35067205
    11      -1.41041245  1.43463172
    12       0.15424717  1.30960037
    13      -0.64834247  2.79065757
    14      -0.49553384 -2.05155283
    15      -0.14673379  0.50991242
    16      -0.75601675  0.15452828
    17       0.47866941 -0.08568474
    18      -1.45425531 -2.20447058
    19       0.73526244 -0.72707865
    20      -0.56519050 -1.85391521
    21       1.51541619  0.82828098
    22       0.69653522 -0.16544687
    23       0.10405480 -0.94939379
    24      -0.18835395 -2.62117304
    25       0.05922234  0.23836024
    26       0.55200631  1.03270843
    27       0.47921075  0.32368910
    28      -1.87895045  0.33458745
    29      -0.52417448 -0.02576781
    30       1.10593164 -0.53994036
    ⋮       ⋮           ⋮          
    999971   1.03720964  0.25952205
    999972   1.85559105 -0.64376335
    999973   0.30896363  0.90888194
    999974  -0.19120411 -0.13836427
    999975   0.38456941 -0.41448315
    999976   0.19796241  0.18025090
    999977  -0.59655868 -1.08715656
    999978   1.52224757 -0.39968983
    999979   0.97208779 -0.99119966
    999980  -0.60255442  0.45956398
    999981  -1.10311491 -0.30234050
    999982  -1.82785989 -0.04869525
    999983   2.02033730 -1.58348097
    999984  -0.31190704 -0.34873065
    999985   1.78569431  2.14918477
    999986  -0.29153104  0.93791961
    999987   0.10206375  0.28002396
    999988   0.77630696  0.02731268
    999989   0.36156982 -0.52059235
    999990  -0.36892781 -0.43718632
    999991   1.08813807 -0.66400660
    999992  -0.03645137 -0.12748503
    999993   1.16919951 -0.08199737
    999994  -1.17768845 -0.85591426
    999995   1.74096286 -1.70328723
    999996  -0.41303043 -0.02196034
    999997  -0.20724831 -0.98494480
    999998  -0.20431187 -0.10192041
    999999  -0.33621017 -0.76541307
    1000000  1.07567218 -0.24330753
    In [6]:
    load("bes2010feelings.RData")
    setDT(bes2010feelings)
    class(bes2010feelings)
    
    [1] "data.table" "data.frame"
    In [ ]:
    
    

Subsetting data tables

  • Script file: subsetting-data-tables.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the data.table package, which is available from https://cran.r-project.org/package=data.table

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE, width=200) # Create output as usual in R, but make it wider
    
    In [2]:
    library(data.table)
    
    In [3]:
    load("bes2010feelings.RData")
    setDT(bes2010feelings)
    
    In [4]:
    sctl2010feelings <- bes2010feelings[region=="Scotland"]
    head(sctl2010feelings)
    
      flng.brown flng.cameron flng.clegg flng.salmond flng.jones flng.labour flng.cons flng.libdem flng.snp flng.pcym flng.green flng.ukip flng.bnp wave region  
    1 8          3            NA         7             6          9           7         5           9       NA         5         NA        NA       Pre  Scotland
    2 0          3            NA         2            NA         NA          NA        NA          NA       NA        NA         NA        NA       Pre  Scotland
    3 4          5             5         7             3          4           4         5           7       NA         3          2         0       Pre  Scotland
    4 6          5             6         6             4          5           4         4           5       NA         1          0         0       Pre  Scotland
    5 8          6             7         0             9          6           7         5           0       NA         2          3         0       Pre  Scotland
    6 9          3             3         4             7          7           5         4           6       NA         7          5         7       Pre  Scotland
    In [5]:
    bes2010feelings.srtd <- bes2010feelings[order(wave,region)]
    
    In [6]:
    bes2010feelings.sub <- bes2010feelings[,.(flng.brown,wave,region)]
    names(bes2010feelings.sub)
    
    [1] "flng.brown" "wave"       "region"    
    In [7]:
    str(bes2010feelings.sub)
    
    Classes ‘data.table’ and 'data.frame':	5010 obs. of  3 variables:
     $ flng.brown: num  6 3 8 4 5 5 5 4 7 4 ...
     $ wave      : Factor w/ 2 levels "Pre","Post": 1 1 1 1 1 1 1 1 1 1 ...
     $ region    : Factor w/ 3 levels "England","Scotland",..: 1 NA 1 1 NA 1 1 1 1 1 ...
     - attr(*, ".internal.selfref")=<externalptr> 
    
    In [8]:
    head(bes2010feelings.sub)
    
      flng.brown wave region 
    1 6          Pre  England
    2 3          Pre  NA     
    3 8          Pre  England
    4 4          Pre  England
    5 5          Pre  NA     
    6 5          Pre  England
    In [9]:
    sctl2010feelings <- bes2010feelings[region=="Scotland",
                                        .(wave,
                                          flng.brown,
                                          flng.cameron,
                                          flng.clegg,
                                          flng.salmond)]
    str(sctl2010feelings)
    
    Classes ‘data.table’ and 'data.frame':	872 obs. of  5 variables:
     $ wave        : Factor w/ 2 levels "Pre","Post": 1 1 1 1 1 1 1 1 1 1 ...
     $ flng.brown  : num  8 0 4 6 8 9 NA 1 7 10 ...
     $ flng.cameron: num  3 3 5 5 6 3 NA 3 5 6 ...
     $ flng.clegg  : num  NA NA 5 6 7 3 NA NA 3 6 ...
     $ flng.salmond: num  7 2 7 6 0 4 NA 4 5 8 ...
     - attr(*, ".internal.selfref")=<externalptr> 
    
    In [10]:
    head(sctl2010feelings)
    
      wave flng.brown flng.cameron flng.clegg flng.salmond
    1 Pre  8          3            NA         7           
    2 Pre  0          3            NA         2           
    3 Pre  4          5             5         7           
    4 Pre  6          5             6         6           
    5 Pre  8          6             7         0           
    6 Pre  9          3             3         4           
    In [ ]:
    
    

Summarizing data tables

  • Script file: summarizing-data-tables.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the data.table package, which is available from https://cran.r-project.org/package=data.table

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    load("bes2010feelings.RData")
    
    In [3]:
    library(data.table)
    
    In [4]:
    setDT(bes2010feelings)
    
    In [5]:
    Mean <- function(x) mean(x,na.rm=TRUE)
    bes2010feelings[,.(Brown=Mean(flng.brown),
                       Cameron=Mean(flng.cameron),
                       Clegg=Mean(flng.clegg),
                       N=.N)]
    
      Brown    Cameron  Clegg    N   
    1 4.406517 5.162484 4.850231 5010
    In [6]:
    bes2010feelings[,.(Brown=Mean(flng.brown),
                       Cameron=Mean(flng.cameron),
                       Clegg=Mean(flng.clegg),
                       N=.N),
                    by=.(wave,region)]
    
      wave region   Brown    Cameron  Clegg    N   
    1 Pre  England  4.092674 5.284810 4.618690 1159
    2 Pre  NA       4.507143 4.929870 4.426573  437
    3 Pre  Scotland 5.395000 4.502591 4.405229  207
    4 Pre  Wales    4.328244 4.774194 4.592233  132
    5 Post England  4.140990 5.441454 5.160313 2175
    6 Post Scotland 5.510769 4.539075 4.513793  665
    7 Post Wales    4.307692 4.855895 4.814480  235
    In [ ]:
    
    

Modifying data tables

  • Script file: modifying-data-tables.R

    The script makes use of the data.table package, which is available from https://cran.r-project.org/package=data.table

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    library(data.table)
    UK <- data.table(
               Population = c(55619400,1885400,5424800,3125000),
               Area = c(50301,5460,30090,8023),
               GVA = c(28096,20000,24800,19900),
               country = c("England",
                             "Northern Ireland",
                             "Scotland",
                             "Wales"))
    
    In [3]:
    UK1 <- UK
    UK[,Density := Population/Area]
    
    In [4]:
    UK
    
      Population Area  GVA   country          Density  
    1 55619400   50301 28096 England          1105.7315
    2  1885400    5460 20000 Northern Ireland  345.3114
    3  5424800   30090 24800 Scotland          180.2858
    4  3125000    8023 19900 Wales             389.5052
    In [5]:
    UK1
    
      Population Area  GVA   country          Density  
    1 55619400   50301 28096 England          1105.7315
    2  1885400    5460 20000 Northern Ireland  345.3114
    3  5424800   30090 24800 Scotland          180.2858
    4  3125000    8023 19900 Wales             389.5052
    In [ ]:
    
    

The Tidyverse

Importing data using haven

  • Script file: importing-data-with-haven.R

    Required data files:

    These Data files used to be available from http://www.hks.harvard.edu/fs/pnorris/Data/Data.htm. Updated data are now available from https://www.pippanorris.com/data.

    The script makes use of the haven package, which is available from https://cran.r-project.org/package=haven

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    library(haven)
    
    In [3]:
    # Reading in an SPSS "system" file
    ConstRes2010 <- read_sav("ConstituencyResults2010.sav")
    head(ConstRes2010)
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    6 6      8.7 58.2  8.1   23.6   NA  NA     NA  NA 
    In [4]:
    # Reading in an SPSS "portable" file 
    ConstRes2010 <- read_por("ConstituencyResults2010.por") # This works with 'foreign', but not with 'haven'
    head(ConstRes2010)
    
    Read bytes: 0   String: ⁸/  Ending state: 0
    Error parsing double string (length=4): ⁸/ [8/]
    
    Error in df_parse_por_file(spec, encoding = "", user_na = user_na, cols_skip, : Failed to parse /home/elff/R/SageBook/dataman-r/Data Tables and the Tidyverse/ConstituencyResults2010.por: Invalid file, or file has unsupported features.
    Traceback:
    
    1. read_por("ConstituencyResults2010.por")
    2. df_parse_por_file(spec, encoding = "", user_na = user_na, cols_skip, 
     .     n_max, skip, name_repair = .name_repair)
    In [5]:
    # Reading in a Stata file
    ConstRes2010 <- read_dta("ConstituencyResults2010.dta")
    head(ConstRes2010)
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    6 6      8.7 58.2  8.1   23.6   NA  NA     NA  NA 
    In [6]:
    # Reading in a new format Stata file
    ConstRes2010 <- read_dta("ConstResults2010-stata-new.dta")
    head(ConstRes2010)
    
      refno cons lab  libdem snp  plcym green bnp ukip
    1 1     14.3 51.9 16.3     NA  7.1  NA    4.1 1.6 
    2 2     35.8 24.5 19.3     NA 17.8  NA     NA 2.1 
    3 3     12.4 44.4 18.6   22.2   NA  NA    1.7  NA 
    4 4     20.7 36.5 28.4   11.9   NA   1    1.2  NA 
    5 5     30.3 13.6 38.4   15.7   NA  NA    1.1 0.9 
    6 6      8.7 58.2  8.1   23.6   NA  NA     NA  NA 
    In [ ]:
    
    

Subsetting data with dplyr

  • Script file: subsetting-data-with-dplyr.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the dplyr package, which is available from https://cran.r-project.org/package=dplyr

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    load("bes2010feelings.RData")
    
    In [3]:
    library(dplyr)
    
    Attaching package: ‘dplyr’
    
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    
    
    
    In [4]:
    bes2010feelings.sub <- bes2010feelings %>%
                                filter(region == "Scotland") %>%
                                select(wave,
                                       flng.brown,
                                       flng.cameron,
                                       flng.clegg,
                                       flng.salmond)
    
    In [5]:
    head(bes2010feelings.sub)
    
            wave flng.brown flng.cameron flng.clegg flng.salmond
    55002.1 Pre  8          3            NA         7           
    55003.1 Pre  0          3            NA         2           
    55006.1 Pre  4          5             5         7           
    55007.1 Pre  6          5             6         6           
    55010.1 Pre  8          6             7         0           
    55015.1 Pre  9          3             3         4           
    In [ ]:
    
    

Summarizing data with dplyr

  • Script file: summarizing-data-with-dplyr.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the dplyr package, which is available from https://cran.r-project.org/package=dplyr

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    library(dplyr)
    
    Attaching package: ‘dplyr’
    
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    
    
    
    In [3]:
    load("bes2010feelings.RData")
    Mean <- function(x,...) mean(x,na.rm=TRUE,...)
    
    In [4]:
    bes2010feelings %>% group_by(wave,region) %>%
                        summarize(Brown=Mean(flng.brown),
                                  Cameron=Mean(flng.cameron),
                                  Clegg=Mean(flng.clegg),
                                  N=n())
    
    `summarise()` regrouping output by 'wave' (override with `.groups` argument)
    
    
      wave region   Brown    Cameron  Clegg    N   
    1 Pre  England  4.092674 5.284810 4.618690 1159
    2 Pre  Scotland 5.395000 4.502591 4.405229  207
    3 Pre  Wales    4.328244 4.774194 4.592233  132
    4 Pre  NA       4.507143 4.929870 4.426573  437
    5 Post England  4.140990 5.441454 5.160313 2175
    6 Post Scotland 5.510769 4.539075 4.513793  665
    7 Post Wales    4.307692 4.855895 4.814480  235
    In [ ]:
    
    

Modifying data with dplyr

  • Script file: modifying-data-with-dplyr.R

    Required data file: bes2010feelings.RData (This data set is prepared from the original available at https://www.britishelectionstudy.com/data-object/2010-bes-cross-section/ by removing identifying information and scrambling the data)

    The script makes use of the dplyr package, which is available from https://cran.r-project.org/package=dplyr

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    library(dplyr)
    
    Attaching package: ‘dplyr’
    
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    
    
    
    In [3]:
    UK <- data.frame(
               Population = c(55619400,1885400,5424800,3125000),
               Area = c(50301,5460,30090,8023),
               GVA = c(28096,20000,24800,19900),
               country = c("England",
                             "Northern Ireland",
                             "Scotland",
                             "Wales"))
    
    In [4]:
    UK %>% mutate(Density = Population/Area)
    
      Population Area  GVA   country          Density  
    1 55619400   50301 28096 England          1105.7315
    2  1885400    5460 20000 Northern Ireland  345.3114
    3  5424800   30090 24800 Scotland          180.2858
    4  3125000    8023 19900 Wales             389.5052

    Equivalent code using 'base' R:

    In [6]:
    within(UK, Density <- Population/Area)
    
      Population Area  GVA   country          Density  
    1 55619400   50301 28096 England          1105.7315
    2  1885400    5460 20000 Northern Ireland  345.3114
    3  5424800   30090 24800 Scotland          180.2858
    4  3125000    8023 19900 Wales             389.5052
    In [ ]:
    
    

A Comparison between the Approaches

Creating group summaries

Modifying data within groups

Comparison summary

  • Script file: comparison-summary-tables.R

    The following data sets are required:

    The script makes use of the following add-on packages:

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=TRUE) # Create formatted output
    
    In [2]:
    library(data.table)
    
    In [3]:
    library(dplyr)
    
    Attaching package: ‘dplyr’
    
    
    The following objects are masked from ‘package:data.table’:
    
        between, first, last
    
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    
    
    
    In [4]:
    library(memisc)
    
    Loading required package: lattice
    
    Loading required package: MASS
    
    
    Attaching package: ‘MASS’
    
    
    The following object is masked from ‘package:dplyr’:
    
        select
    
    
    
    Attaching package: ‘memisc’
    
    
    The following objects are masked from ‘package:dplyr’:
    
        collect, recode, rename, syms
    
    
    The following objects are masked from ‘package:stats’:
    
        contr.sum, contr.treatment, contrasts
    
    
    The following object is masked from ‘package:base’:
    
        as.array
    
    
    
    In [5]:
    library(rbenchmark)
    
    In [6]:
    bench_matrix <- function(x){
        rn <- x$test
        x <- as.matrix(x[,-1])
        rownames(x) <- rn
        x
    }
    
    In [7]:
    load("grouped-summary-benchmark.RData")
    
    In [8]:
    grouped_summary_benchmark_1 <- bench_matrix(grouped_summary_benchmark_1)
    grouped_summary_benchmark_2 <- bench_matrix(grouped_summary_benchmark_2)
    
    In [9]:
    grouped_summary_benchmark <- memisc::collect(
        "`Big data'"    = grouped_summary_benchmark_1,
        "`Survey data'" = grouped_summary_benchmark_2)
    grouped_summary_benchmark <- grouped_summary_benchmark[-5,,]
    colnames(grouped_summary_benchmark) <- c("abs.","rel.")
    names(dimnames(grouped_summary_benchmark)) <- c("Method","Timing","Data")
    
    In [10]:
    ftable(grouped_summary_benchmark,col.vars=3:2) %>% memisc::show_html(digits=2)
    
    Data: `Big data' `Survey data'
    Method Timing: abs. rel. abs. rel.
    aggregate 54 . 46 11 . 96 0 . 55 4 . 55
    with + tapply 4 . 55 1 . 00 0 . 12 1 . 00
    data.table 17 . 04 3 . 74 0 . 89 7 . 29
    group_by + summarize 14 . 00 3 . 08 0 . 36 2 . 97
    withGroups 22 . 70 4 . 99 1 . 41 11 . 56
    In [11]:
    load("grouped-modification-benchmark.RData")
    
    grouped_modification_benchmark_1 <- bench_matrix(grouped_modification_benchmark_1)
    grouped_modification_benchmark_2 <- bench_matrix(grouped_modification_benchmark_2)
    
    grouped_modification_benchmark <- collect(
        "`Big data'"    = grouped_modification_benchmark_1,
        "`Survey data'" = grouped_modification_benchmark_2)
    colnames(grouped_modification_benchmark) <- c("abs.","rel.")
    names(dimnames(grouped_modification_benchmark)) <- c("Method","Timing","Data")
    
    In [12]:
    ftable(grouped_modification_benchmark,col.vars=3:2) %>% memisc::show_html(digits=2)
    
    Data: `Big data' `Survey data'
    Method Timing: abs. rel. abs. rel.
    within 26 . 91 1 . 08 2 . 37 1 . 58
    data.table 24 . 85 1 . 00 2 . 66 1 . 77
    group_by + mutate 27 . 18 1 . 09 3 . 26 2 . 17
    withinGroups 33 . 94 1 . 37 1 . 50 1 . 00
    In [ ]:
    
    

Tidying Data Using the tidyr Package

Reshaping data with gather() and pivot_longer() into a long arrangement

  • Script file: reshaping-to-long-with-tidyr.R

    Required data file: gini-oecd.tsv (scraped from https://data.oecd.org at 9 Dec 2019)

    The script makes use of the following add-on packages:

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE, width=120) # Create output as usual in R
    
    In [2]:
    # Inspecting the file
    substr(readLines("gini-oecd.tsv",n=5),start=1,stop=50)
    
    [1] "Data table for: Income inequality, Gini coefficien"        
    [2] "Location \t 2007\t2008\t 2009\t2010\t2011\t2012\t2013\t201"
    [3] "Australia\t\t\t\t\t\t0.326\t\t0.337\t\t0.330\t"            
    [4] "Austria\t0.284\t0.281\t0.289\t0.280\t0.281\t0.275\t0.279\t"
    [5] "Belgium\t0.277\t0.266\t0.272\t0.267\t0.270\t0.265\t0.265\t"
    In [3]:
    library(readr)
    
    In [4]:
    gini.oecd <- read_tsv("gini-oecd.tsv",
                          skip=1)
    
    Parsed with column specification:
    cols(
      Location = col_character(),
      `2007` = col_double(),
      `2008` = col_double(),
      `2009` = col_double(),
      `2010` = col_double(),
      `2011` = col_double(),
      `2012` = col_double(),
      `2013` = col_double(),
      `2014` = col_double(),
      `2015` = col_double(),
      `2016` = col_double(),
      `2017` = col_double()
    )
    
    
    In [5]:
    gini.oecd
    
       Location                     2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
    1  Australia                       NA    NA    NA    NA    NA 0.326    NA 0.337    NA 0.330    NA
    2  Austria                      0.284 0.281 0.289 0.280 0.281 0.275 0.279 0.274 0.276 0.284    NA
    3  Belgium                      0.277 0.266 0.272 0.267 0.270 0.265 0.265 0.266 0.268 0.266    NA
    4  Brazil                          NA    NA 0.485    NA 0.483    NA 0.470    NA    NA    NA    NA
    5  Canada                       0.317 0.315 0.316 0.316 0.313 0.317 0.320 0.313 0.318 0.307 0.310
    6  Chile                           NA    NA 0.480    NA 0.471    NA 0.465    NA 0.454    NA 0.460
    7  China (People's Republic of)    NA    NA    NA    NA 0.514    NA    NA    NA    NA    NA    NA
    8  Costa Rica                      NA    NA    NA 0.472 0.480 0.483 0.494 0.485 0.479 0.484 0.480
    9  Czech Republic               0.256 0.259 0.257 0.259 0.257 0.253 0.259 0.257 0.258 0.253    NA
    10 Denmark                         NA    NA    NA    NA 0.251 0.249 0.254 0.256 0.263 0.261    NA
    11 Estonia                         NA    NA    NA    NA    NA    NA 0.357 0.346 0.330 0.314    NA
    12 Finland                      0.269 0.264 0.259 0.264 0.264 0.260 0.262 0.257 0.260 0.259 0.266
    13 France                          NA    NA    NA    NA    NA 0.305 0.291 0.293 0.295 0.291    NA
    14 Germany                         NA 0.285    NA    NA 0.291 0.289 0.292 0.289 0.293 0.294    NA
    15 Greece                       0.329 0.328 0.330 0.336 0.333 0.338 0.342 0.339 0.340 0.333    NA
    16 Hungary                      0.257 0.246 0.241 0.266 0.270 0.280 0.277 0.282 0.284 0.282    NA
    17 Iceland                      0.286 0.305 0.266 0.249 0.252 0.253 0.241 0.246 0.255    NA    NA
    18 India                           NA    NA    NA    NA 0.495    NA    NA    NA    NA    NA    NA
    19 Ireland                      0.304 0.295 0.312 0.298 0.307 0.310 0.308 0.298 0.297 0.309    NA
    20 Israel                          NA    NA    NA    NA 0.371 0.371 0.360 0.365 0.360 0.346 0.344
    21 Italy                        0.313 0.317 0.315 0.327 0.327 0.330 0.325 0.326 0.333 0.328    NA
    22 Japan                           NA    NA 0.336    NA    NA 0.330    NA    NA 0.339    NA    NA
    23 Korea                           NA    NA    NA    NA    NA    NA    NA    NA 0.352 0.355 0.355
    24 Latvia                       0.374 0.373 0.354 0.347 0.352 0.347 0.351 0.350 0.346 0.346    NA
    25 Lithuania                    0.338 0.358 0.366 0.329 0.322 0.350 0.352 0.381 0.372 0.378    NA
    26 Luxembourg                      NA    NA    NA    NA    NA    NA    NA    NA 0.306 0.304    NA
    27 Mexico                          NA    NA    NA    NA    NA 0.457    NA 0.459    NA 0.458    NA
    28 Netherlands                     NA    NA    NA    NA 0.289 0.288 0.287 0.303 0.288 0.285    NA
    29 New Zealand                     NA    NA    NA    NA 0.323 0.333    NA 0.349    NA    NA    NA
    30 Norway                          NA 0.250 0.245 0.249 0.250 0.253 0.252 0.257 0.272 0.262 0.262
    31 Poland                       0.316 0.308 0.303 0.305 0.301 0.297 0.299 0.298 0.292 0.284    NA
    32 Portugal                     0.361 0.355 0.337 0.341 0.337 0.337 0.341 0.338 0.336 0.331    NA
    33 Russia                          NA    NA    NA    NA 0.376    NA    NA    NA    NA 0.331    NA
    34 Slovak Republic              0.245 0.256 0.264 0.263 0.261 0.250 0.269 0.247 0.251 0.241    NA
    35 Slovenia                     0.239 0.234 0.245 0.244 0.244 0.249 0.254 0.251 0.250 0.244    NA
    36 South Africa                    NA    NA    NA    NA    NA    NA    NA    NA 0.620    NA    NA
    37 Spain                        0.324 0.327 0.333 0.339 0.341 0.334 0.345 0.344 0.345 0.341    NA
    38 Sweden                          NA    NA    NA    NA    NA    NA 0.268 0.274 0.278 0.282 0.282
    39 Switzerland                  0.312 0.306 0.297 0.298 0.289 0.285 0.295 0.297 0.296    NA    NA
    40 Turkey                          NA    NA    NA    NA 0.403 0.399 0.390 0.398 0.404    NA    NA
    41 United Kingdom               0.373 0.369 0.374 0.351 0.354 0.351 0.358 0.356 0.360 0.351 0.357
    42 United States                   NA    NA    NA    NA    NA 0.396 0.394 0.390 0.391 0.390    NA
    In [6]:
    library(tidyr)
    
    In [7]:
    gini.oecd %>% gather(`2007`,`2008`,`2009`,`2010`,`2011`,`2012`,
                          `2013`,`2014`,`2015`,`2016`,`2017`,
                          key="year",value="gini") -> gini.oecd.long
    gini.oecd.long
    
        Location                     year gini 
    1   Australia                    2007    NA
    2   Austria                      2007 0.284
    3   Belgium                      2007 0.277
    4   Brazil                       2007    NA
    5   Canada                       2007 0.317
    6   Chile                        2007    NA
    7   China (People's Republic of) 2007    NA
    8   Costa Rica                   2007    NA
    9   Czech Republic               2007 0.256
    10  Denmark                      2007    NA
    11  Estonia                      2007    NA
    12  Finland                      2007 0.269
    13  France                       2007    NA
    14  Germany                      2007    NA
    15  Greece                       2007 0.329
    16  Hungary                      2007 0.257
    17  Iceland                      2007 0.286
    18  India                        2007    NA
    19  Ireland                      2007 0.304
    20  Israel                       2007    NA
    21  Italy                        2007 0.313
    22  Japan                        2007    NA
    23  Korea                        2007    NA
    24  Latvia                       2007 0.374
    25  Lithuania                    2007 0.338
    26  Luxembourg                   2007    NA
    27  Mexico                       2007    NA
    28  Netherlands                  2007    NA
    29  New Zealand                  2007    NA
    30  Norway                       2007    NA
    ⋮   ⋮                            ⋮    ⋮    
    433 France                       2017    NA
    434 Germany                      2017    NA
    435 Greece                       2017    NA
    436 Hungary                      2017    NA
    437 Iceland                      2017    NA
    438 India                        2017    NA
    439 Ireland                      2017    NA
    440 Israel                       2017 0.344
    441 Italy                        2017    NA
    442 Japan                        2017    NA
    443 Korea                        2017 0.355
    444 Latvia                       2017    NA
    445 Lithuania                    2017    NA
    446 Luxembourg                   2017    NA
    447 Mexico                       2017    NA
    448 Netherlands                  2017    NA
    449 New Zealand                  2017    NA
    450 Norway                       2017 0.262
    451 Poland                       2017    NA
    452 Portugal                     2017    NA
    453 Russia                       2017    NA
    454 Slovak Republic              2017    NA
    455 Slovenia                     2017    NA
    456 South Africa                 2017    NA
    457 Spain                        2017    NA
    458 Sweden                       2017 0.282
    459 Switzerland                  2017    NA
    460 Turkey                       2017    NA
    461 United Kingdom               2017 0.357
    462 United States                2017    NA
    In [8]:
    gini.oecd %>% gather(-Location,
                          key="year",value="gini") -> gini.oecd.long
    gini.oecd.long
    
        Location                     year gini 
    1   Australia                    2007    NA
    2   Austria                      2007 0.284
    3   Belgium                      2007 0.277
    4   Brazil                       2007    NA
    5   Canada                       2007 0.317
    6   Chile                        2007    NA
    7   China (People's Republic of) 2007    NA
    8   Costa Rica                   2007    NA
    9   Czech Republic               2007 0.256
    10  Denmark                      2007    NA
    11  Estonia                      2007    NA
    12  Finland                      2007 0.269
    13  France                       2007    NA
    14  Germany                      2007    NA
    15  Greece                       2007 0.329
    16  Hungary                      2007 0.257
    17  Iceland                      2007 0.286
    18  India                        2007    NA
    19  Ireland                      2007 0.304
    20  Israel                       2007    NA
    21  Italy                        2007 0.313
    22  Japan                        2007    NA
    23  Korea                        2007    NA
    24  Latvia                       2007 0.374
    25  Lithuania                    2007 0.338
    26  Luxembourg                   2007    NA
    27  Mexico                       2007    NA
    28  Netherlands                  2007    NA
    29  New Zealand                  2007    NA
    30  Norway                       2007    NA
    ⋮   ⋮                            ⋮    ⋮    
    433 France                       2017    NA
    434 Germany                      2017    NA
    435 Greece                       2017    NA
    436 Hungary                      2017    NA
    437 Iceland                      2017    NA
    438 India                        2017    NA
    439 Ireland                      2017    NA
    440 Israel                       2017 0.344
    441 Italy                        2017    NA
    442 Japan                        2017    NA
    443 Korea                        2017 0.355
    444 Latvia                       2017    NA
    445 Lithuania                    2017    NA
    446 Luxembourg                   2017    NA
    447 Mexico                       2017    NA
    448 Netherlands                  2017    NA
    449 New Zealand                  2017    NA
    450 Norway                       2017 0.262
    451 Poland                       2017    NA
    452 Portugal                     2017    NA
    453 Russia                       2017    NA
    454 Slovak Republic              2017    NA
    455 Slovenia                     2017    NA
    456 South Africa                 2017    NA
    457 Spain                        2017    NA
    458 Sweden                       2017 0.282
    459 Switzerland                  2017    NA
    460 Turkey                       2017    NA
    461 United Kingdom               2017 0.357
    462 United States                2017    NA
    In [9]:
    gini.oecd %>% pivot_longer(-Location,
                               names_to="year",
                               values_to="gini") -> gini.oecd.long
    gini.oecd.long
    
        Location       year gini 
    1   Australia      2007    NA
    2   Australia      2008    NA
    3   Australia      2009    NA
    4   Australia      2010    NA
    5   Australia      2011    NA
    6   Australia      2012 0.326
    7   Australia      2013    NA
    8   Australia      2014 0.337
    9   Australia      2015    NA
    10  Australia      2016 0.330
    11  Australia      2017    NA
    12  Austria        2007 0.284
    13  Austria        2008 0.281
    14  Austria        2009 0.289
    15  Austria        2010 0.280
    16  Austria        2011 0.281
    17  Austria        2012 0.275
    18  Austria        2013 0.279
    19  Austria        2014 0.274
    20  Austria        2015 0.276
    21  Austria        2016 0.284
    22  Austria        2017    NA
    23  Belgium        2007 0.277
    24  Belgium        2008 0.266
    25  Belgium        2009 0.272
    26  Belgium        2010 0.267
    27  Belgium        2011 0.270
    28  Belgium        2012 0.265
    29  Belgium        2013 0.265
    30  Belgium        2014 0.266
    ⋮   ⋮              ⋮    ⋮    
    433 Turkey         2010    NA
    434 Turkey         2011 0.403
    435 Turkey         2012 0.399
    436 Turkey         2013 0.390
    437 Turkey         2014 0.398
    438 Turkey         2015 0.404
    439 Turkey         2016    NA
    440 Turkey         2017    NA
    441 United Kingdom 2007 0.373
    442 United Kingdom 2008 0.369
    443 United Kingdom 2009 0.374
    444 United Kingdom 2010 0.351
    445 United Kingdom 2011 0.354
    446 United Kingdom 2012 0.351
    447 United Kingdom 2013 0.358
    448 United Kingdom 2014 0.356
    449 United Kingdom 2015 0.360
    450 United Kingdom 2016 0.351
    451 United Kingdom 2017 0.357
    452 United States  2007    NA
    453 United States  2008    NA
    454 United States  2009    NA
    455 United States  2010    NA
    456 United States  2011    NA
    457 United States  2012 0.396
    458 United States  2013 0.394
    459 United States  2014 0.390
    460 United States  2015 0.391
    461 United States  2016 0.390
    462 United States  2017    NA
    In [ ]:
    
    

Reshaping data with spread() and pivot_wider() into a wides arrangement

  • Script file: reshaping-to-wide-with-tidyr.R

    Required data file: inequality-oecd-downloaded.csv (manually downloaded from https://data.oecd.org at 9 Dec 2019)

    The script makes use of the following add-on packages:

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    substr(readLines("inequality-oecd-downloaded.csv",n=5),
           start=1,stop=40)
    
    [1] "\"LOCATION\",\"INDICATOR\",\"SUBJECT\",\"MEASUR"    
    [2] "\"AUS\",\"INCOMEINEQ\",\"P50P10\",\"RT\",\"A\",\"20"
    [3] "\"AUS\",\"INCOMEINEQ\",\"P50P10\",\"RT\",\"A\",\"20"
    [4] "\"AUS\",\"INCOMEINEQ\",\"P50P10\",\"RT\",\"A\",\"20"
    [5] "\"AUS\",\"INCOMEINEQ\",\"P90P10\",\"RT\",\"A\",\"20"
    In [3]:
    library(readr)
    
    In [4]:
    inequality.oecd.dld <- read_csv("inequality-oecd-downloaded.csv")
    
    Parsed with column specification:
    cols(
      LOCATION = col_character(),
      INDICATOR = col_character(),
      SUBJECT = col_character(),
      MEASURE = col_character(),
      FREQUENCY = col_character(),
      TIME = col_double(),
      Value = col_double(),
      `Flag Codes` = col_character()
    )
    
    
    In [5]:
    inequality.oecd.dld
    
         LOCATION INDICATOR  SUBJECT MEASURE FREQUENCY TIME Value Flag Codes
    1    AUS      INCOMEINEQ P50P10  RT      A         2012 2.200 NA        
    2    AUS      INCOMEINEQ P50P10  RT      A         2014 2.200 NA        
    3    AUS      INCOMEINEQ P50P10  RT      A         2016 2.100 NA        
    4    AUS      INCOMEINEQ P90P10  RT      A         2012 4.400 NA        
    5    AUS      INCOMEINEQ P90P10  RT      A         2014 4.300 NA        
    6    AUS      INCOMEINEQ P90P10  RT      A         2016 4.300 NA        
    7    AUS      INCOMEINEQ P90P50  RT      A         2012 2.000 NA        
    8    AUS      INCOMEINEQ P90P50  RT      A         2014 2.000 NA        
    9    AUS      INCOMEINEQ P90P50  RT      A         2016 2.100 NA        
    10   AUS      INCOMEINEQ GINI    INEQ    A         2012 0.326 NA        
    11   AUS      INCOMEINEQ GINI    INEQ    A         2014 0.337 NA        
    12   AUS      INCOMEINEQ GINI    INEQ    A         2016 0.330 NA        
    13   AUT      INCOMEINEQ P50P10  RT      A         2007 2.000 NA        
    14   AUT      INCOMEINEQ P50P10  RT      A         2008 1.900 NA        
    15   AUT      INCOMEINEQ P50P10  RT      A         2009 2.000 NA        
    16   AUT      INCOMEINEQ P50P10  RT      A         2010 1.900 NA        
    17   AUT      INCOMEINEQ P50P10  RT      A         2011 1.900 NA        
    18   AUT      INCOMEINEQ P50P10  RT      A         2012 2.000 NA        
    19   AUT      INCOMEINEQ P50P10  RT      A         2013 1.900 NA        
    20   AUT      INCOMEINEQ P50P10  RT      A         2014 1.900 NA        
    21   AUT      INCOMEINEQ P50P10  RT      A         2015 1.900 NA        
    22   AUT      INCOMEINEQ P50P10  RT      A         2016 2.000 NA        
    23   AUT      INCOMEINEQ P90P10  RT      A         2007 3.600 NA        
    24   AUT      INCOMEINEQ P90P10  RT      A         2008 3.400 NA        
    25   AUT      INCOMEINEQ P90P10  RT      A         2009 3.600 NA        
    26   AUT      INCOMEINEQ P90P10  RT      A         2010 3.500 NA        
    27   AUT      INCOMEINEQ P90P10  RT      A         2011 3.500 NA        
    28   AUT      INCOMEINEQ P90P10  RT      A         2012 3.500 NA        
    29   AUT      INCOMEINEQ P90P10  RT      A         2013 3.400 NA        
    30   AUT      INCOMEINEQ P90P10  RT      A         2014 3.400 NA        
    ⋮    ⋮        ⋮          ⋮       ⋮       ⋮         ⋮    ⋮     ⋮         
    2286 LTU      INCOMEINEQ PALMA   RT      A         2006 1.26  NA        
    2287 LTU      INCOMEINEQ PALMA   RT      A         2007 1.33  NA        
    2288 LTU      INCOMEINEQ PALMA   RT      A         2008 1.48  NA        
    2289 LTU      INCOMEINEQ PALMA   RT      A         2009 1.53  NA        
    2290 LTU      INCOMEINEQ PALMA   RT      A         2010 1.24  NA        
    2291 LTU      INCOMEINEQ PALMA   RT      A         2011 1.20  NA        
    2292 LTU      INCOMEINEQ PALMA   RT      A         2012 1.42  NA        
    2293 LTU      INCOMEINEQ PALMA   RT      A         2013 1.42  NA        
    2294 LTU      INCOMEINEQ PALMA   RT      A         2014 1.67  NA        
    2295 LTU      INCOMEINEQ PALMA   RT      A         2015 1.59  NA        
    2296 LTU      INCOMEINEQ PALMA   RT      A         2016 1.65  NA        
    2297 RUS      INCOMEINEQ PALMA   RT      A         2011 1.59  NA        
    2298 RUS      INCOMEINEQ PALMA   RT      A         2016 1.28  NA        
    2299 SVN      INCOMEINEQ PALMA   RT      A         2004 0.80  NA        
    2300 SVN      INCOMEINEQ PALMA   RT      A         2005 0.80  NA        
    2301 SVN      INCOMEINEQ PALMA   RT      A         2006 0.79  NA        
    2302 SVN      INCOMEINEQ PALMA   RT      A         2007 0.79  NA        
    2303 SVN      INCOMEINEQ PALMA   RT      A         2008 0.77  NA        
    2304 SVN      INCOMEINEQ PALMA   RT      A         2009 0.82  NA        
    2305 SVN      INCOMEINEQ PALMA   RT      A         2010 0.81  NA        
    2306 SVN      INCOMEINEQ PALMA   RT      A         2011 0.81  NA        
    2307 SVN      INCOMEINEQ PALMA   RT      A         2012 0.83  NA        
    2308 SVN      INCOMEINEQ PALMA   RT      A         2013 0.86  NA        
    2309 SVN      INCOMEINEQ PALMA   RT      A         2014 0.85  NA        
    2310 SVN      INCOMEINEQ PALMA   RT      A         2015 0.84  NA        
    2311 SVN      INCOMEINEQ PALMA   RT      A         2016 0.81  NA        
    2312 ZAF      INCOMEINEQ PALMA   RT      A         2015 7.03  P         
    2313 KOR      INCOMEINEQ PALMA   RT      A         2015 1.42  NA        
    2314 KOR      INCOMEINEQ PALMA   RT      A         2016 1.45  NA        
    2315 KOR      INCOMEINEQ PALMA   RT      A         2017 1.44  NA        
    In [7]:
    library(tidyr)
    
    In [8]:
    inequality.oecd.dld %>% spread(key="SUBJECT",value="Value") ->
                                                   inequality.oecd
    
    In [9]:
    inequality.oecd[-c(2,4,6)]
    
        LOCATION MEASURE TIME GINI  P50P10 P90P10 P90P50 PALMA S80S20
    1   AUS      INEQ    2012 0.326  NA     NA     NA      NA   NA   
    2   AUS      INEQ    2014 0.337  NA     NA     NA      NA   NA   
    3   AUS      INEQ    2016 0.330  NA     NA     NA      NA   NA   
    4   AUS      RT      2012    NA 2.2    4.4    2.0    1.24  5.5   
    5   AUS      RT      2014    NA 2.2    4.3    2.0    1.34  5.7   
    6   AUS      RT      2016    NA 2.1    4.3    2.1    1.26  5.5   
    7   AUT      INEQ    2007 0.284  NA     NA     NA      NA   NA   
    8   AUT      INEQ    2008 0.281  NA     NA     NA      NA   NA   
    9   AUT      INEQ    2009 0.289  NA     NA     NA      NA   NA   
    10  AUT      INEQ    2010 0.280  NA     NA     NA      NA   NA   
    11  AUT      INEQ    2011 0.281  NA     NA     NA      NA   NA   
    12  AUT      INEQ    2012 0.275  NA     NA     NA      NA   NA   
    13  AUT      INEQ    2013 0.279  NA     NA     NA      NA   NA   
    14  AUT      INEQ    2014 0.274  NA     NA     NA      NA   NA   
    15  AUT      INEQ    2015 0.276  NA     NA     NA      NA   NA   
    16  AUT      INEQ    2016 0.284  NA     NA     NA      NA   NA   
    17  AUT      RT      2007    NA 2.0    3.6    1.8    1.00  4.4   
    18  AUT      RT      2008    NA 1.9    3.4    1.8    1.00  4.3   
    19  AUT      RT      2009    NA 2.0    3.6    1.8    1.03  4.5   
    20  AUT      RT      2010    NA 1.9    3.5    1.8    0.98  4.3   
    21  AUT      RT      2011    NA 1.9    3.5    1.8    0.99  4.4   
    22  AUT      RT      2012    NA 2.0    3.5    1.8    0.96  4.2   
    23  AUT      RT      2013    NA 1.9    3.4    1.8    0.99  4.2   
    24  AUT      RT      2014    NA 1.9    3.4    1.7    0.96  4.1   
    25  AUT      RT      2015    NA 1.9    3.3    1.7    0.96  4.2   
    26  AUT      RT      2016    NA 2.0    3.5    1.8    1.00  4.5   
    27  BEL      INEQ    2004 0.287  NA     NA     NA      NA   NA   
    28  BEL      INEQ    2005 0.277  NA     NA     NA      NA   NA   
    29  BEL      INEQ    2006 0.268  NA     NA     NA      NA   NA   
    30  BEL      INEQ    2007 0.277  NA     NA     NA      NA   NA   
    ⋮   ⋮        ⋮       ⋮    ⋮     ⋮      ⋮      ⋮      ⋮     ⋮     
    742 SWE      INEQ    2015 0.278  NA      NA    NA      NA    NA  
    743 SWE      INEQ    2016 0.282  NA      NA    NA      NA    NA  
    744 SWE      INEQ    2017 0.282  NA      NA    NA      NA    NA  
    745 SWE      RT      2013    NA 1.9     3.2   1.7    0.94   4.0  
    746 SWE      RT      2014    NA 1.9     3.3   1.7    0.97   4.1  
    747 SWE      RT      2015    NA 2.0     3.3   1.7    0.99   4.2  
    748 SWE      RT      2016    NA 1.9     3.3   1.7    1.02   4.2  
    749 SWE      RT      2017    NA 2.0     3.3   1.7    1.02   4.2  
    750 TUR      INEQ    2011 0.403  NA      NA    NA      NA    NA  
    751 TUR      INEQ    2012 0.399  NA      NA    NA      NA    NA  
    752 TUR      INEQ    2013 0.390  NA      NA    NA      NA    NA  
    753 TUR      INEQ    2014 0.398  NA      NA    NA      NA    NA  
    754 TUR      INEQ    2015 0.404  NA      NA    NA      NA    NA  
    755 TUR      RT      2011    NA 2.5     6.1   2.5    1.89   8.0  
    756 TUR      RT      2012    NA 2.4     6.0   2.5    1.86   7.7  
    757 TUR      RT      2013    NA 2.4     5.9   2.5    1.76   7.5  
    758 TUR      RT      2014    NA 2.4     5.9   2.5    1.84   7.7  
    759 TUR      RT      2015    NA 2.3     5.7   2.5    1.91   7.8  
    760 USA      INEQ    2013 0.396  NA      NA    NA      NA    NA  
    761 USA      INEQ    2014 0.394  NA      NA    NA      NA    NA  
    762 USA      INEQ    2015 0.390  NA      NA    NA      NA    NA  
    763 USA      INEQ    2016 0.391  NA      NA    NA      NA    NA  
    764 USA      INEQ    2017 0.390  NA      NA    NA      NA    NA  
    765 USA      RT      2013    NA 2.7     6.4   2.3    1.82   8.6  
    766 USA      RT      2014    NA 2.7     6.4   2.3    1.79   8.7  
    767 USA      RT      2015    NA 2.7     6.1   2.3    1.75   8.3  
    768 USA      RT      2016    NA 2.7     6.3   2.3    1.77   8.5  
    769 USA      RT      2017    NA 2.7     6.2   2.3    1.76   8.4  
    770 ZAF      INEQ    2015 0.620  NA      NA    NA      NA    NA  
    771 ZAF      RT      2015    NA 4.8    25.6   5.3    7.03  37.6  
    In [11]:
    library(dplyr)
    
    Attaching package: ‘dplyr’
    
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    
    
    
    In [12]:
    inequality.oecd.sub <- select(inequality.oecd.dld,
                                  LOCATION,SUBJECT,TIME,Value)
    inequality.oecd.sub
    
         LOCATION SUBJECT TIME Value
    1    AUS      P50P10  2012 2.200
    2    AUS      P50P10  2014 2.200
    3    AUS      P50P10  2016 2.100
    4    AUS      P90P10  2012 4.400
    5    AUS      P90P10  2014 4.300
    6    AUS      P90P10  2016 4.300
    7    AUS      P90P50  2012 2.000
    8    AUS      P90P50  2014 2.000
    9    AUS      P90P50  2016 2.100
    10   AUS      GINI    2012 0.326
    11   AUS      GINI    2014 0.337
    12   AUS      GINI    2016 0.330
    13   AUT      P50P10  2007 2.000
    14   AUT      P50P10  2008 1.900
    15   AUT      P50P10  2009 2.000
    16   AUT      P50P10  2010 1.900
    17   AUT      P50P10  2011 1.900
    18   AUT      P50P10  2012 2.000
    19   AUT      P50P10  2013 1.900
    20   AUT      P50P10  2014 1.900
    21   AUT      P50P10  2015 1.900
    22   AUT      P50P10  2016 2.000
    23   AUT      P90P10  2007 3.600
    24   AUT      P90P10  2008 3.400
    25   AUT      P90P10  2009 3.600
    26   AUT      P90P10  2010 3.500
    27   AUT      P90P10  2011 3.500
    28   AUT      P90P10  2012 3.500
    29   AUT      P90P10  2013 3.400
    30   AUT      P90P10  2014 3.400
    ⋮    ⋮        ⋮       ⋮    ⋮    
    2286 LTU      PALMA   2006 1.26 
    2287 LTU      PALMA   2007 1.33 
    2288 LTU      PALMA   2008 1.48 
    2289 LTU      PALMA   2009 1.53 
    2290 LTU      PALMA   2010 1.24 
    2291 LTU      PALMA   2011 1.20 
    2292 LTU      PALMA   2012 1.42 
    2293 LTU      PALMA   2013 1.42 
    2294 LTU      PALMA   2014 1.67 
    2295 LTU      PALMA   2015 1.59 
    2296 LTU      PALMA   2016 1.65 
    2297 RUS      PALMA   2011 1.59 
    2298 RUS      PALMA   2016 1.28 
    2299 SVN      PALMA   2004 0.80 
    2300 SVN      PALMA   2005 0.80 
    2301 SVN      PALMA   2006 0.79 
    2302 SVN      PALMA   2007 0.79 
    2303 SVN      PALMA   2008 0.77 
    2304 SVN      PALMA   2009 0.82 
    2305 SVN      PALMA   2010 0.81 
    2306 SVN      PALMA   2011 0.81 
    2307 SVN      PALMA   2012 0.83 
    2308 SVN      PALMA   2013 0.86 
    2309 SVN      PALMA   2014 0.85 
    2310 SVN      PALMA   2015 0.84 
    2311 SVN      PALMA   2016 0.81 
    2312 ZAF      PALMA   2015 7.03 
    2313 KOR      PALMA   2015 1.42 
    2314 KOR      PALMA   2016 1.45 
    2315 KOR      PALMA   2017 1.44 
    In [13]:
    inequality.oecd.sub %>% spread(key=SUBJECT,
                                   value=Value) -> inequality.oecd
    
    In [14]:
    inequality.oecd
    
        LOCATION TIME GINI  P50P10 P90P10 P90P50 PALMA S80S20
    1   AUS      2012 0.326 2.2     4.4   2.0    1.24   5.5  
    2   AUS      2014 0.337 2.2     4.3   2.0    1.34   5.7  
    3   AUS      2016 0.330 2.1     4.3   2.1    1.26   5.5  
    4   AUT      2007 0.284 2.0     3.6   1.8    1.00   4.4  
    5   AUT      2008 0.281 1.9     3.4   1.8    1.00   4.3  
    6   AUT      2009 0.289 2.0     3.6   1.8    1.03   4.5  
    7   AUT      2010 0.280 1.9     3.5   1.8    0.98   4.3  
    8   AUT      2011 0.281 1.9     3.5   1.8    0.99   4.4  
    9   AUT      2012 0.275 2.0     3.5   1.8    0.96   4.2  
    10  AUT      2013 0.279 1.9     3.4   1.8    0.99   4.2  
    11  AUT      2014 0.274 1.9     3.4   1.7    0.96   4.1  
    12  AUT      2015 0.276 1.9     3.3   1.7    0.96   4.2  
    13  AUT      2016 0.284 2.0     3.5   1.8    1.00   4.5  
    14  BEL      2004 0.287 1.9     3.3   1.7    1.05   4.2  
    15  BEL      2005 0.277 2.0     3.4   1.7    0.99   4.1  
    16  BEL      2006 0.268 2.0     3.4   1.7    0.93   4.0  
    17  BEL      2007 0.277 1.9     3.3   1.7    0.98   4.1  
    18  BEL      2008 0.266 2.0     3.4   1.7    0.91   3.9  
    19  BEL      2009 0.272 2.0     3.4   1.7    0.95   4.1  
    20  BEL      2010 0.267 2.0     3.4   1.7    0.92   4.0  
    21  BEL      2011 0.270 2.0     3.4   1.7    0.93   4.0  
    22  BEL      2012 0.265 2.0     3.4   1.7    0.91   3.9  
    23  BEL      2013 0.265 2.0     3.4   1.7    0.90   3.9  
    24  BEL      2014 0.266 1.9     3.4   1.7    0.92   3.9  
    25  BEL      2015 0.268 2.0     3.4   1.7    0.93   4.0  
    26  BEL      2016 0.266 2.0     3.3   1.7    0.92   3.9  
    27  BRA      2006 0.510 3.1    10.4   3.4    3.39  15.3  
    28  BRA      2009 0.485 2.8     8.9   3.1    2.96  13.1  
    29  BRA      2011 0.483 3.1     9.3   3.0    2.95  13.9  
    30  BRA      2013 0.470 3.0     8.7   2.9    2.71  12.5  
    ⋮   ⋮        ⋮    ⋮     ⋮      ⋮      ⋮      ⋮     ⋮     
    357 SVK      2016 0.241 1.9     3.1   1.6    0.79   3.7  
    358 SVN      2004 0.241 1.8     3.0   1.7    0.80   3.5  
    359 SVN      2005 0.240 1.8     3.0   1.7    0.80   3.5  
    360 SVN      2006 0.237 1.8     3.0   1.7    0.79   3.4  
    361 SVN      2007 0.239 1.8     3.0   1.6    0.79   3.5  
    362 SVN      2008 0.234 1.8     3.0   1.6    0.77   3.4  
    363 SVN      2009 0.245 1.9     3.1   1.7    0.82   3.6  
    364 SVN      2010 0.244 1.9     3.2   1.6    0.81   3.6  
    365 SVN      2011 0.244 1.9     3.2   1.6    0.81   3.6  
    366 SVN      2012 0.249 2.0     3.2   1.7    0.83   3.7  
    367 SVN      2013 0.254 2.0     3.3   1.7    0.86   3.8  
    368 SVN      2014 0.251 2.0     3.3   1.7    0.85   3.7  
    369 SVN      2015 0.250 1.9     3.2   1.6    0.84   3.7  
    370 SVN      2016 0.244 1.9     3.1   1.7    0.81   3.6  
    371 SWE      2013 0.268 1.9     3.2   1.7    0.94   4.0  
    372 SWE      2014 0.274 1.9     3.3   1.7    0.97   4.1  
    373 SWE      2015 0.278 2.0     3.3   1.7    0.99   4.2  
    374 SWE      2016 0.282 1.9     3.3   1.7    1.02   4.2  
    375 SWE      2017 0.282 2.0     3.3   1.7    1.02   4.2  
    376 TUR      2011 0.403 2.5     6.1   2.5    1.89   8.0  
    377 TUR      2012 0.399 2.4     6.0   2.5    1.86   7.7  
    378 TUR      2013 0.390 2.4     5.9   2.5    1.76   7.5  
    379 TUR      2014 0.398 2.4     5.9   2.5    1.84   7.7  
    380 TUR      2015 0.404 2.3     5.7   2.5    1.91   7.8  
    381 USA      2013 0.396 2.7     6.4   2.3    1.82   8.6  
    382 USA      2014 0.394 2.7     6.4   2.3    1.79   8.7  
    383 USA      2015 0.390 2.7     6.1   2.3    1.75   8.3  
    384 USA      2016 0.391 2.7     6.3   2.3    1.77   8.5  
    385 USA      2017 0.390 2.7     6.2   2.3    1.76   8.4  
    386 ZAF      2015 0.620 4.8    25.6   5.3    7.03  37.6  
    In [15]:
    inequality.oecd.dld %>% pivot_wider(names_from=SUBJECT,
                                        values_from=Value,
                                        id_cols=c(LOCATION,TIME)) ->
                                                    inequality.oecd
    
    In [16]:
    inequality.oecd
    
        LOCATION TIME P50P10 P90P10 P90P50 GINI  S80S20 PALMA
    1   AUS      2012 2.2    4.4    2.0    0.326 5.5    1.24 
    2   AUS      2014 2.2    4.3    2.0    0.337 5.7    1.34 
    3   AUS      2016 2.1    4.3    2.1    0.330 5.5    1.26 
    4   AUT      2007 2.0    3.6    1.8    0.284 4.4    1.00 
    5   AUT      2008 1.9    3.4    1.8    0.281 4.3    1.00 
    6   AUT      2009 2.0    3.6    1.8    0.289 4.5    1.03 
    7   AUT      2010 1.9    3.5    1.8    0.280 4.3    0.98 
    8   AUT      2011 1.9    3.5    1.8    0.281 4.4    0.99 
    9   AUT      2012 2.0    3.5    1.8    0.275 4.2    0.96 
    10  AUT      2013 1.9    3.4    1.8    0.279 4.2    0.99 
    11  AUT      2014 1.9    3.4    1.7    0.274 4.1    0.96 
    12  AUT      2015 1.9    3.3    1.7    0.276 4.2    0.96 
    13  AUT      2016 2.0    3.5    1.8    0.284 4.5    1.00 
    14  BEL      2004 1.9    3.3    1.7    0.287 4.2    1.05 
    15  BEL      2005 2.0    3.4    1.7    0.277 4.1    0.99 
    16  BEL      2006 2.0    3.4    1.7    0.268 4.0    0.93 
    17  BEL      2007 1.9    3.3    1.7    0.277 4.1    0.98 
    18  BEL      2008 2.0    3.4    1.7    0.266 3.9    0.91 
    19  BEL      2009 2.0    3.4    1.7    0.272 4.1    0.95 
    20  BEL      2010 2.0    3.4    1.7    0.267 4.0    0.92 
    21  BEL      2011 2.0    3.4    1.7    0.270 4.0    0.93 
    22  BEL      2012 2.0    3.4    1.7    0.265 3.9    0.91 
    23  BEL      2013 2.0    3.4    1.7    0.265 3.9    0.90 
    24  BEL      2014 1.9    3.4    1.7    0.266 3.9    0.92 
    25  BEL      2015 2.0    3.4    1.7    0.268 4.0    0.93 
    26  BEL      2016 2.0    3.3    1.7    0.266 3.9    0.92 
    27  CAN      1976 2.3    4.2    1.8    0.303 5.1    1.09 
    28  CAN      1977 2.4    4.3    1.8    0.289 4.9    0.99 
    29  CAN      1978 2.3    4.2    1.8    0.294 4.9    1.04 
    30  CAN      1979 2.3    4.1    1.8    0.289 4.8    1.00 
    ⋮   ⋮        ⋮    ⋮      ⋮      ⋮      ⋮     ⋮      ⋮    
    357 LTU      2004 2.3     5.0   2.2    0.350  6.3   1.41 
    358 LTU      2005 2.3     5.2   2.2    0.352  6.4   1.43 
    359 LTU      2006 2.2     4.6   2.1    0.329  5.6   1.26 
    360 LTU      2007 2.3     4.6   2.0    0.338  5.8   1.33 
    361 LTU      2008 2.3     4.8   2.1    0.358  6.3   1.48 
    362 LTU      2009 2.4     5.5   2.3    0.366  7.2   1.53 
    363 LTU      2010 2.2     4.7   2.1    0.329  5.7   1.24 
    364 LTU      2011 2.1     4.5   2.1    0.322  5.3   1.20 
    365 LTU      2012 2.2     4.8   2.1    0.350  6.2   1.42 
    366 LTU      2013 2.2     4.9   2.3    0.352  6.1   1.42 
    367 LTU      2014 2.4     5.4   2.2    0.381  7.4   1.67 
    368 LTU      2015 2.5     5.5   2.3    0.372  7.1   1.59 
    369 LTU      2016 2.6     5.8   2.2    0.378  7.5   1.65 
    370 BRA      2006 3.1    10.4   3.4    0.510 15.3   3.39 
    371 BRA      2009 2.8     8.9   3.1    0.485 13.1   2.96 
    372 BRA      2011 3.1     9.3   3.0    0.483 13.9   2.95 
    373 BRA      2013 3.0     8.7   2.9    0.470 12.5   2.71 
    374 CHN      2011 7.8    23.0   2.9    0.514 28.3   3.86 
    375 CRI      2010 2.9     9.3   3.2    0.472 12.1   2.73 
    376 CRI      2011 3.0    10.2   3.4    0.480 13.1   2.87 
    377 CRI      2012 3.1     9.9   3.2    0.483 13.5   2.93 
    378 CRI      2013 3.2    10.8   3.4    0.494 14.3   3.09 
    379 CRI      2014 3.2    10.4   3.3    0.485 13.7   2.98 
    380 CRI      2015 3.1    10.3   3.3    0.479 13.4   2.85 
    381 CRI      2016 3.1    10.2   3.3    0.484 13.5   2.94 
    382 CRI      2017 3.0    10.0   3.4    0.480 13.0   2.88 
    383 CRI      2018 3.1    10.2   3.3    0.479 13.2   2.85 
    384 IND      2004 2.9     9.1   3.2    0.482 12.5   2.88 
    385 IND      2011 2.9     9.4   3.2    0.495 13.4   3.09 
    386 ZAF      2015 4.8    25.6   5.3    0.620 37.6   7.03 
    In [ ]:
    
    

Filling missing values with fill() and completing data by missing values with complete()

  • Script file: filling-and-completing-with-tidyr.R

    The script makes use of the following add-on packages:

  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    

    Filling missing values with fill()

    In [2]:
    library(tidyr)
    library(readr)
    
    In [3]:
    messy_data_str <- "
    country,  year,var1, var2
    Rodinia,  1297,  67, -3.0
    ,         1298,  69, -2.9
    ,         1299,  70, -2.8
    Pannotia, 1296,  73, -4.1
    ,         1297,  74, -3.9
    ,         1298,  75, -3.9
    Pangaea,  1296,  54, -1.2
    ,         1297,  53, -1.1
    ,         1298,  52, -1.0
    ,         1299,  51, -0.9
    "
    
    In [4]:
    messy_data_str %>% read_csv() -> messy_data
    messy_data
    
       country  year var1 var2
    1  Rodinia  1297 67   -3.0
    2  NA       1298 69   -2.9
    3  NA       1299 70   -2.8
    4  Pannotia 1296 73   -4.1
    5  NA       1297 74   -3.9
    6  NA       1298 75   -3.9
    7  Pangaea  1296 54   -1.2
    8  NA       1297 53   -1.1
    9  NA       1298 52   -1.0
    10 NA       1299 51   -0.9
    In [5]:
    messy_data %>% fill(country) -> filled_data
    filled_data
    
       country  year var1 var2
    1  Rodinia  1297 67   -3.0
    2  Rodinia  1298 69   -2.9
    3  Rodinia  1299 70   -2.8
    4  Pannotia 1296 73   -4.1
    5  Pannotia 1297 74   -3.9
    6  Pannotia 1298 75   -3.9
    7  Pangaea  1296 54   -1.2
    8  Pangaea  1297 53   -1.1
    9  Pangaea  1298 52   -1.0
    10 Pangaea  1299 51   -0.9

    Completing data by missing values with complete()

    In [6]:
    filled_data %>% complete(crossing(country,year))
    
       country  year var1 var2
    1  Pangaea  1296 54   -1.2
    2  Pangaea  1297 53   -1.1
    3  Pangaea  1298 52   -1.0
    4  Pangaea  1299 51   -0.9
    5  Pannotia 1296 73   -4.1
    6  Pannotia 1297 74   -3.9
    7  Pannotia 1298 75   -3.9
    8  Pannotia 1299 NA     NA
    9  Rodinia  1296 NA     NA
    10 Rodinia  1297 67   -3.0
    11 Rodinia  1298 69   -2.9
    12 Rodinia  1299 70   -2.8
    In [ ]: