Data Tables and the “Tidyverse”¶
This chapter discusses the improvements and new features introduced by the package data.table and some notable packages in the “tidyverse” collection of packages. In particular, the packages data.table and dplyr are compared in terms of the features they provide and how computationally efficient they are. Also the features for “tidying” data using the tidyr package are discussed.
Below is the supporting material for the various sections of the chapter.
Data Tables¶
Creating data tables¶
-
Script file:
creating-data-tables.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the data.table package, which is available from
https://cran.r-project.org/package=data.table
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(data.table)
In [3]:UK <- data.table( Population = c(55619400,1885400,5424800,3125000), Area = c(50301,5460,30090,8023), GVA = c(28096,20000,24800,19900), country = c("England", "Northern Ireland", "Scotland", "Wales")) UK
In [4]:class(UK)
In [5]:DT <- data.table( x = rnorm(1000000), y = rnorm(1000000)) DT
In [6]:load("bes2010feelings.RData") setDT(bes2010feelings) class(bes2010feelings)
In [ ]:
Subsetting data tables¶
-
Script file:
subsetting-data-tables.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the data.table package, which is available from
https://cran.r-project.org/package=data.table
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE, width=200) # Create output as usual in R, but make it wider
In [2]:library(data.table)
In [3]:load("bes2010feelings.RData") setDT(bes2010feelings)
In [4]:sctl2010feelings <- bes2010feelings[region=="Scotland"] head(sctl2010feelings)
In [5]:bes2010feelings.srtd <- bes2010feelings[order(wave,region)]
In [6]:bes2010feelings.sub <- bes2010feelings[,.(flng.brown,wave,region)] names(bes2010feelings.sub)
In [7]:str(bes2010feelings.sub)
In [8]:head(bes2010feelings.sub)
In [9]:sctl2010feelings <- bes2010feelings[region=="Scotland", .(wave, flng.brown, flng.cameron, flng.clegg, flng.salmond)] str(sctl2010feelings)
In [10]:head(sctl2010feelings)
In [ ]:
Summarizing data tables¶
-
Script file:
summarizing-data-tables.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the data.table package, which is available from
https://cran.r-project.org/package=data.table
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:load("bes2010feelings.RData")
In [3]:library(data.table)
In [4]:setDT(bes2010feelings)
In [5]:Mean <- function(x) mean(x,na.rm=TRUE) bes2010feelings[,.(Brown=Mean(flng.brown), Cameron=Mean(flng.cameron), Clegg=Mean(flng.clegg), N=.N)]
In [6]:bes2010feelings[,.(Brown=Mean(flng.brown), Cameron=Mean(flng.cameron), Clegg=Mean(flng.clegg), N=.N), by=.(wave,region)]
In [ ]:
Modifying data tables¶
-
Script file:
modifying-data-tables.R
The script makes use of the data.table package, which is available from
https://cran.r-project.org/package=data.table
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(data.table) UK <- data.table( Population = c(55619400,1885400,5424800,3125000), Area = c(50301,5460,30090,8023), GVA = c(28096,20000,24800,19900), country = c("England", "Northern Ireland", "Scotland", "Wales"))
In [3]:UK1 <- UK UK[,Density := Population/Area]
In [4]:UK
In [5]:UK1
In [ ]:
The Tidyverse¶
Importing data using haven¶
-
Script file:
importing-data-with-haven.R
Required data files:
ConstituencyResults2010.sav
ConstituencyResults2010.por
ConstituencyResults2010.dta
ConstResults2010-stata-new.dta
These Data files used to be available from
http://www.hks.harvard.edu/fs/pnorris/Data/Data.htm
. Updated data are now available fromhttps://www.pippanorris.com/data
.The script makes use of the haven package, which is available from
https://cran.r-project.org/package=haven
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(haven)
In [3]:# Reading in an SPSS "system" file ConstRes2010 <- read_sav("ConstituencyResults2010.sav") head(ConstRes2010)
In [4]:# Reading in an SPSS "portable" file ConstRes2010 <- read_por("ConstituencyResults2010.por") # This works with 'foreign', but not with 'haven' head(ConstRes2010)
In [5]:# Reading in a Stata file ConstRes2010 <- read_dta("ConstituencyResults2010.dta") head(ConstRes2010)
In [6]:# Reading in a new format Stata file ConstRes2010 <- read_dta("ConstResults2010-stata-new.dta") head(ConstRes2010)
In [ ]:
Subsetting data with dplyr¶
-
Script file:
subsetting-data-with-dplyr.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the dplyr package, which is available from
https://cran.r-project.org/package=dplyr
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:load("bes2010feelings.RData")
In [3]:library(dplyr)
In [4]:bes2010feelings.sub <- bes2010feelings %>% filter(region == "Scotland") %>% select(wave, flng.brown, flng.cameron, flng.clegg, flng.salmond)
In [5]:head(bes2010feelings.sub)
In [ ]:
Summarizing data with dplyr¶
-
Script file:
summarizing-data-with-dplyr.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the dplyr package, which is available from
https://cran.r-project.org/package=dplyr
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(dplyr)
In [3]:load("bes2010feelings.RData") Mean <- function(x,...) mean(x,na.rm=TRUE,...)
In [4]:bes2010feelings %>% group_by(wave,region) %>% summarize(Brown=Mean(flng.brown), Cameron=Mean(flng.cameron), Clegg=Mean(flng.clegg), N=n())
In [ ]:
Modifying data with dplyr¶
-
Script file:
modifying-data-with-dplyr.R
Required data file:
bes2010feelings.RData
(This data set is prepared from the original available athttps://www.britishelectionstudy.com/data-object/2010-bes-cross-section/
by removing identifying information and scrambling the data)The script makes use of the dplyr package, which is available from
https://cran.r-project.org/package=dplyr
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(dplyr)
In [3]:UK <- data.frame( Population = c(55619400,1885400,5424800,3125000), Area = c(50301,5460,30090,8023), GVA = c(28096,20000,24800,19900), country = c("England", "Northern Ireland", "Scotland", "Wales"))
In [4]:UK %>% mutate(Density = Population/Area)
Equivalent code using 'base' R:
In [6]:within(UK, Density <- Population/Area)
In [ ]:
A Comparison between the Approaches¶
Creating group summaries¶
-
Script file:
comparison-group-summaries.R
The script makes use of the following add-on packages:
-
data.table available from
https://cran.r-project.org/package=data.table
-
dplyr available from
https://cran.r-project.org/package=dplyr
-
memisc available from
https://cran.r-project.org/package=memisc
-
rbenchmark available from
https://cran.r-project.org/package=rbenchmark
-
data.table available from
- No interactive notebook is provided, due to the computational demands of the benchmarking script.
Modifying data within groups¶
-
Script file:
comparison-modifying-data-within-groups.R
The script makes use of the following add-on packages:
-
data.table available from
https://cran.r-project.org/package=data.table
-
dplyr available from
https://cran.r-project.org/package=dplyr
-
memisc available from
https://cran.r-project.org/package=memisc
-
rbenchmark available from
https://cran.r-project.org/package=rbenchmark
-
data.table available from
- No interactive notebook is provided, due to the computational demands of the benchmarking script.
Comparison summary¶
-
Script file:
comparison-summary-tables.R
The following data sets are required:
The script makes use of the following add-on packages:
-
data.table available from
https://cran.r-project.org/package=data.table
-
dplyr available from
https://cran.r-project.org/package=dplyr
-
memisc available from
https://cran.r-project.org/package=memisc
-
rbenchmark available from
https://cran.r-project.org/package=rbenchmark
-
data.table available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=TRUE) # Create formatted output
In [2]:library(data.table)
In [3]:library(dplyr)
In [4]:library(memisc)
In [5]:library(rbenchmark)
In [6]:bench_matrix <- function(x){ rn <- x$test x <- as.matrix(x[,-1]) rownames(x) <- rn x }
In [7]:load("grouped-summary-benchmark.RData")
In [8]:grouped_summary_benchmark_1 <- bench_matrix(grouped_summary_benchmark_1) grouped_summary_benchmark_2 <- bench_matrix(grouped_summary_benchmark_2)
In [9]:grouped_summary_benchmark <- memisc::collect( "`Big data'" = grouped_summary_benchmark_1, "`Survey data'" = grouped_summary_benchmark_2) grouped_summary_benchmark <- grouped_summary_benchmark[-5,,] colnames(grouped_summary_benchmark) <- c("abs.","rel.") names(dimnames(grouped_summary_benchmark)) <- c("Method","Timing","Data")
In [10]:ftable(grouped_summary_benchmark,col.vars=3:2) %>% memisc::show_html(digits=2)
In [11]:load("grouped-modification-benchmark.RData") grouped_modification_benchmark_1 <- bench_matrix(grouped_modification_benchmark_1) grouped_modification_benchmark_2 <- bench_matrix(grouped_modification_benchmark_2) grouped_modification_benchmark <- collect( "`Big data'" = grouped_modification_benchmark_1, "`Survey data'" = grouped_modification_benchmark_2) colnames(grouped_modification_benchmark) <- c("abs.","rel.") names(dimnames(grouped_modification_benchmark)) <- c("Method","Timing","Data")
In [12]:ftable(grouped_modification_benchmark,col.vars=3:2) %>% memisc::show_html(digits=2)
In [ ]:
Tidying Data Using the tidyr Package¶
Reshaping data with gather()
and pivot_longer()
into a long arrangement¶
-
Script file:
reshaping-to-long-with-tidyr.R
Required data file:
gini-oecd.tsv
(scraped fromhttps://data.oecd.org
at 9 Dec 2019)The script makes use of the following add-on packages:
-
tidyr available from
https://cran.r-project.org/package=tidyr
-
readr available from
https://cran.r-project.org/package=readr
-
tidyr available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE, width=120) # Create output as usual in R
In [2]:# Inspecting the file substr(readLines("gini-oecd.tsv",n=5),start=1,stop=50)
In [3]:library(readr)
In [4]:gini.oecd <- read_tsv("gini-oecd.tsv", skip=1)
In [5]:gini.oecd
In [6]:library(tidyr)
In [7]:gini.oecd %>% gather(`2007`,`2008`,`2009`,`2010`,`2011`,`2012`, `2013`,`2014`,`2015`,`2016`,`2017`, key="year",value="gini") -> gini.oecd.long gini.oecd.long
In [8]:gini.oecd %>% gather(-Location, key="year",value="gini") -> gini.oecd.long gini.oecd.long
In [9]:gini.oecd %>% pivot_longer(-Location, names_to="year", values_to="gini") -> gini.oecd.long gini.oecd.long
In [ ]:
Reshaping data with spread()
and pivot_wider()
into a wides arrangement¶
-
Script file:
reshaping-to-wide-with-tidyr.R
Required data file:
inequality-oecd-downloaded.csv
(manually downloaded fromhttps://data.oecd.org
at 9 Dec 2019)The script makes use of the following add-on packages:
-
tidyr available from
https://cran.r-project.org/package=tidyr
-
readr available from
https://cran.r-project.org/package=readr
-
tidyr available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:substr(readLines("inequality-oecd-downloaded.csv",n=5), start=1,stop=40)
In [3]:library(readr)
In [4]:inequality.oecd.dld <- read_csv("inequality-oecd-downloaded.csv")
In [5]:inequality.oecd.dld
In [7]:library(tidyr)
In [8]:inequality.oecd.dld %>% spread(key="SUBJECT",value="Value") -> inequality.oecd
In [9]:inequality.oecd[-c(2,4,6)]
In [11]:library(dplyr)
In [12]:inequality.oecd.sub <- select(inequality.oecd.dld, LOCATION,SUBJECT,TIME,Value) inequality.oecd.sub
In [13]:inequality.oecd.sub %>% spread(key=SUBJECT, value=Value) -> inequality.oecd
In [14]:inequality.oecd
In [15]:inequality.oecd.dld %>% pivot_wider(names_from=SUBJECT, values_from=Value, id_cols=c(LOCATION,TIME)) -> inequality.oecd
In [16]:inequality.oecd
In [ ]:
Filling missing values with fill()
and completing data by missing values with complete()
¶
-
Script file:
filling-and-completing-with-tidyr.R
The script makes use of the following add-on packages:
-
tidyr available from
https://cran.r-project.org/package=tidyr
-
readr available from
https://cran.r-project.org/package=readr
-
tidyr available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
Filling missing values with
fill()
In [2]:library(tidyr) library(readr)
In [3]:messy_data_str <- " country, year,var1, var2 Rodinia, 1297, 67, -3.0 , 1298, 69, -2.9 , 1299, 70, -2.8 Pannotia, 1296, 73, -4.1 , 1297, 74, -3.9 , 1298, 75, -3.9 Pangaea, 1296, 54, -1.2 , 1297, 53, -1.1 , 1298, 52, -1.0 , 1299, 51, -0.9 "
In [4]:messy_data_str %>% read_csv() -> messy_data messy_data
In [5]:messy_data %>% fill(country) -> filled_data filled_data
Completing data by missing values with
complete()
In [6]:filled_data %>% complete(crossing(country,year))
In [ ]: