Data from Complex Samples¶
Data from social science surveys often come from complex samples. In order to achieve efficient or at least accurate inference one may need to take into account the sampling design in the computation of sample summaries, e.g. by the application of sampling weights. The chapter shows how the survey package can help to take into account the sampling design in data management and data analysis.
Below is the supporting material for the various sections of the chapter.
Survey Design Objects¶
-
Constructing a survey design object from an example data that is included in the
survey
package:-
Script file:
survey-design-objects-NHANES.R
The script makes use of the survey package, which is available from
https://cran.r-project.org/package=survey
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
First we activate the survey package
In [2]:library(survey)
Next we make the
NHANES
data set available (which comes with the survey package)In [3]:data(nhanes)
Now we can take a look at the data set, it has 7 variables and 8581 observations
In [4]:head(nhanes)
In [5]:nrow(nhanes)
According to the documentation, the primary sampling units are identified by the variable
SDMVPSU
, the strata from which the PSUs are drawn are identified by the variableSDMVSTRA
, and the sample weights areWTMEC2YR
. The sum of the sample weights corresponds to the population size:In [6]:with(nhanes,sum(WTMEC2YR))
Here were create "ordinary" sampling weights as they are common in social science data sets:
In [7]:nhanes <- within(nhanes,{ smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes) })
We now create two survey design objects, one with the "population size" weights and one with the "sample size" weights:
In [8]:design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, data = nhanes)
First attempt fails: the cluster numbers are not nested in the strata We try again with
nest = TRUE
In [9]:design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, data = nhanes, nest = TRUE) design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~smplw, data = nhanes, nest = TRUE)
We first estimate the population proportions with high cholesterol
(HI_CHOL==1)
In [10]:svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
In [11]:svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
We secondly estimate the number of those with high cholesterol
(HI_CHOL==1)
in the populationIn [12]:svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
In [13]:svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
Obviously, we need the "population size" weights for an unbiased estimate of the population total, otherwise we just get a sample total.
Using the "population size" weights, we can reconstruct the sizes of the population strata:
In [14]:nhanes <- within(nhanes,{ strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum) }) design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, data = nhanes, nest = TRUE, fpc = ~ strata_size) design_pop
Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.
-
-
Constructing a survey design object from data of the 2016 American Election Study.
-
Script file:
survey-design-objects-ANES2016.R
Required data file:
anes_timeseries_2016.sav
which is available fromhttps://electionstudies.org/data-center/2016-time-series-study/
. You will need to register first in order to be able to download the data.The script makes use of the following add-on packages:
-
survey available from
https://cran.r-project.org/package=survey
-
memisc available from
https://cran.r-project.org/package=memisc
-
survey available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(memisc)
The the code makes used of the data file "
anes_timeseries_2016.sav
", which is not included in the supporting material. In order to obtain this data file (and run this notebook successufully), you need to download them from the ANES website for 2016 and upload them to the virtual machine that runs this notebook. To do this,- pull down the "File" menu item and select "Open"
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the ANES website.
Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).
In [3]:anes_2016_sav <- spss.file("anes_timeseries_2016.sav")
Loading a subset: Only pre-election waves and only face-to-face interviews
In [4]:anes_2016_pre_work_ds <- subset(anes_2016_sav, V160501 == 1, select=c( # According to docs, these are the # sample weights for the # face-to-face component pre_w_f2f = V160101f, # Face-to-face strata strat_f2f = V160201f, psu_f2f = V160202f, pre_voted12 = V161005, pre_recall12 = V161006, pre_voted = V161026, pre_vote = V161027, pre_intov = V161030, pre_voteint = V161031#, ))
In [5]:library(magrittr) # For the '%<>%' operator
In [6]:anes_2016_pre_work_ds %<>% within({ # Setting up recalled votes of 2012 # Since a "default" value for the remaining conditions # is used, we use 'check.xor = FALSE' to avoid warnings. recall12 <- cases( 'Did not vote' = 9 <- pre_voted12 == 2, 'Obama' = 1 <- pre_recall12 == 1, 'Romney' = 2 <- pre_recall12 == 2, 'Other' = 3 <- pre_recall12 == 5, 'Inap' = 99 <- TRUE, check.xor = FALSE ) # Early voters vote16_1 <- cases( 'Clinton' = 1 <- pre_voted == 1 & pre_vote == 1, 'Trump' = 2 <- pre_voted == 1 & pre_vote == 2, 'Other' = 3 <- pre_voted == 1 & pre_vote %in% 3:5, 'Inap' = 99 <- TRUE, check.xor = FALSE) # Vote intentions vote16 <- cases( 'Clinton' = 1 <- pre_intov == 1 & pre_voteint == 1, 'Trump' = 2 <- pre_intov == 1 & pre_voteint == 2, 'Other' = 3 <- pre_intov == 1 & pre_voteint %in% 3:6, 'Will not vote/Not registered' = 8 <- pre_intov %in% c(-1,2), 'Inap' = 99 <- TRUE, check.xor = FALSE) vote16[] <- ifelse(vote16 == 99 & vote16_1 != 99, vote16_1, vote16) measurement(pre_w_f2f) <- "ratio" })
In [7]:anes_2016_prevote <- as.data.frame(anes_2016_pre_work_ds) save(anes_2016_prevote,file="anes-2016-prevote.RData")
In [8]:#Unweighted crosstable xtabs(~ vote16 + recall12, data=anes_2016_prevote)
In [9]:library(survey)
In [10]:anes_2016_prevote_desgn <- svydesign(id = ~psu_f2f, strata = ~strat_f2f, weights = ~pre_w_f2f, data = anes_2016_prevote, nest = TRUE) anes_2016_prevote_desgn
In order to later make use of the survey design object, we save it into a file.
In [11]:save(anes_2016_prevote_desgn,file="anes-2016-prevote-desgn.RData")
We reduce the digits after dot ...
In [12]:ops <- options(digits=2) (tab <- svytable(~ vote16 + recall12, design = anes_2016_prevote_desgn))
and drop counts of non-valid responses before we compute percentages.
In [13]:percentages(vote16 ~ recall12, data=tab[-6,-5])
In [14]:options(ops) # To undo the change in the options.
Here we compute a F-test of independence with the table, which uses the Rao-Scott second-order correction with a Satterthwaite approximation of the denominator degrees of freedom is used.
In [15]:summary(tab)
The more conventional Pearson-Chi-squared test adjusted with a design-effect estimate is obtained by a slight modification.
In [16]:summary(tab, statistic="Chisq")
-
Survey Replication Weights¶
-
Using the replicate weights provided with CHIS data.
-
Script file:
survey-replication-weights-CHIS.R
Required data file:
adult.dta
which is available fromhttp://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx
. You will need to register first and declare that you do not redistribute the data in order to be able to download them.The script makes use of the survey package, which is available from
https://cran.r-project.org/package=survey
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(survey)
In [3]:library(foreign)
The file "
adult.dta
" is downloaded from http://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx and contains the 2005 wave of the California Health Interview Survey. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,- pull down the "File" menu item and select "Open"
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the website.
Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).
In [4]:adult_chis <- read.dta("adult.dta", warn.missing.labels=FALSE)
The data set contains 80 set of (raked) replicate weights. They are in the variables named "rakedw1" through "rakedw80". Raked sampling weights are in "raked0".
We obtain the column numbers of these variables, making use of our knowledge of the name pattern
In [5]:repw <- which(names(adult_chis) %in% paste0("rakedw",1:80))
To apecify replicate weights, we call the function
svrepdesgin
The first argument specifies the variables that will be used for data analysis. Theweights
argument specifies sampling weights, while the functionrepweights
specifies the replicate weights. Thedata=
argument specifies the data frame where the data all come from. Thecombined.weights=
argument is needed here, because the replicate weights were constructed from sampling weights and "pure" replicate weights. Since we do not know the way the replicate weights were constructed we have to specifytype="other"
.In [6]:adult_chis_rd <- svrepdesign(adult_chis[-repw], weights=~rakedw0, repweights=adult_chis[repw], data=adult_chis, combined.weights=TRUE, type="other", scale=1,rscales=1)
With
svymean()
we get the estimated proportions of the various categories of health insurance status in California 2005, along with standard errors, multiplying by 100, we get percentages.In [7]:100*svymean(~instyp_p, design=adult_chis_rd)
With
svytotal()
we obtain estimates of how many people have which health insurance status.In [8]:svytotal(~instyp_p, design=adult_chis_rd)
-
-
Creating replicate weights for the survey design of the 2016 American National Election Study.
-
Script file:
survey-replication-weights-ANES2016.R
Required data file:
anes-2016-prevote-desgn.RData
created by the previous scriptsurvey-design-objects-ANES2016.R
.The script makes use of the following add-on packages:
-
survey available from
https://cran.r-project.org/package=survey
-
memisc available from
https://cran.r-project.org/package=memisc
-
survey available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:load("anes-2016-prevote-desgn.RData") library(survey)
The 'automatic' type gives jackknife replicates
In [3]:anes_2016_prevote_jk <- as.svrepdesign(anes_2016_prevote_desgn, type="auto")
The number of replicates is determined by the number of clusters
In [4]:anes_2016_prevote_jk
Here we select the multistage rescaled bootstrap
In [5]:anes_2016_prevote_boo <- as.svrepdesign(anes_2016_prevote_desgn, type="mrbbootstrap")
In [6]:anes_2016_prevote_boo
By default the number of bootstrap replicates is 50, we can change it to 200
In [7]:anes_2016_prevote_boo <- as.svrepdesign(anes_2016_prevote_desgn, type="mrbbootstrap", replicates=200)
In [8]:anes_2016_prevote_boo
A function to compute the percentage of 2012 Democratic and Republican voters who vote for a candidate of the same party in 2016:
In [9]:StayerPerc <- function(weights,data){ tab <- xtabs(weights~vote16+recall12,data=data) # Remove 'inap' responses tab <- tab[-6,-5] # Column percentages ptab <- 100*prop.table(tab,2) # The diagonal are the percentages of 'stayers' # among the voters of 2012. # The first two elements of the diagonal are # the Democratic and Republican stayers. structure(diag(ptab)[1:2], names=c("Democratic", "Republican")) }
Estimates and replication based standard errors based on jackknife
In [10]:withReplicates(anes_2016_prevote_jk, StayerPerc)
In [11]:withReplicates(anes_2016_prevote_boo, StayerPerc)
-
Poststratification, Raking, and Calibration¶
-
Poststratification of the 2016 American National Election Study data.
-
Script file:
poststratification-ANES2016.R
Required data file:
anes-2016-prevote.RData
created by the previous scriptsurvey-design-objects-ANES2016.R
.The script makes use of the following add-on packages:
-
survey available from
https://cran.r-project.org/package=survey
-
memisc available from
https://cran.r-project.org/package=memisc
-
survey available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R load("anes-2016-prevote.RData")
There must not be any missing values in the stratifying variables.
In [2]:anes_2016_vprevote <- subset(anes_2016_prevote, vote16 != "Inap" & recall12 != "Inap" )
In order to make poststratification possible, we need to make sure that the levels of the stratification variables match the population information. Therefore we relabel the variables "recall12" and "vote16".
In [3]:library(memisc)
In [4]:anes_2016_vprevote <- within(anes_2016_vprevote,{ recall12 <- recall12[,drop=TRUE] vote16 <- vote16[,drop=TRUE] recall12 <- relabel(recall12,"Did not vote"="No vote") vote16 <- relabel(vote16, "Will not vote/Not registered"="No vote") }) save(anes_2016_vprevote,file="anes-2016-vprevote.RData")
Finally, we set up a survey design object:
In [5]:library(survey)
In [6]:anes_2016_vprevote_desgn <- svydesign(id = ~psu_f2f, strata = ~strat_f2f, weights = ~pre_w_f2f, data = anes_2016_vprevote, nest = TRUE) save(anes_2016_vprevote_desgn,file="anes-2016-vprevote-design.RData")
We collect the electoral results of 2012 to prepare poststratification.
In [7]:result.2012 = c(Obama = 65915795, Romney = 60933504, # Other candidates are combined Other = sum(c( Johson = 1275971, Stein = 469627, Others = 490510 )))
The number of non-voters is computed from the sum of the results and census data on the population in voting age.
In [8]:result.2012 <- c(result.2012, "No vote" = 235248000 - sum(result.2012))
In [9]:# Here we collect the results for 2016 result.2016 <- c(Clinton = 65853514, Trump = 62984828, Other = sum(c( Johnson = 4489341, Stein = 1457218, McMullin = 731991, Others = 1154084 )))
In [10]:result.2016 <- c(result.2016, "No vote" = 250056000 - sum(result.2016))
The poststratification function expects population data to be in the form of data frames:
In [11]:pop.vote16 <- data.frame( vote16=names(result.2016), Freq=result.2016)
In [12]:pop.recall12 <- data.frame( recall12=names(result.2012), Freq=result.2012/sum(result.2012)*sum(result.2016) )
In [13]:save(pop.recall12,pop.vote16,file="popl-results.RData")
We poststratify the sample design object by recalled vote in 2012
In [14]:anes_2016_prevote_desgn_post <- postStratify( anes_2016_vprevote_desgn,~recall12,population=pop.recall12)
We compare the estimated percentages of 2012 votes:
In [15]:100*svymean(~recall12,design=anes_2016_vprevote_desgn)
In [16]:100*svymean(~recall12,design=anes_2016_prevote_desgn_post)
As should be expected, post-stratification eliminates the uncertainty about 2012 votes. It also corrects for turnout overreporting.
We now compare the estimated percentages of 2016 votes
In [17]:100*svymean(~vote16,design=anes_2016_vprevote_desgn)
In [18]:100*svymean(~vote16,design=anes_2016_prevote_desgn_post)
The percentages of Clinton voters and Trump voters are closer after poststratification.
We save the poststratified data for later use.
In [19]:save(anes_2016_prevote_desgn_post,file="anes-2016-prevote-desgn-post.RData")
-
-
Raking the 2016 American National Election Study data.
-
Script file:
raking-ANES2016.R
Required data file:
anes-2016-vprevote-design.RData
created by the previous scriptpoststratification-ANES2016.R
.The script makes use of the survey package, which is available from
https://cran.r-project.org/package=survey
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R library(survey)
Here we rake the sample design object by recalled vote in 2012 and vote in 2016
In [2]:load("anes-2016-vprevote-design.RData") load("popl-results.RData") anes_2016_prevote_desgn_rake <- rake( anes_2016_vprevote_desgn,list(~recall12,~vote16), population=list(pop.recall12,pop.vote16), control=list(maxit=20))
Of course, the marginal distributions of the vote in 2012 and the vote in 2016 are no longer estimated, so that standard errors now vanish.
In [3]:100*svymean(~recall12,design=anes_2016_prevote_desgn_rake)
In [4]:100*svymean(~vote16,design=anes_2016_prevote_desgn_rake)
The actual percentage of non-voters in 2016 is obviously much higher than the one estimated from the sample, be it post-stratified or not.
We save the raked data for later use.
In [5]:save(anes_2016_prevote_desgn_rake,file="anes-2016-prevote-desgn-rake.RData")
-
-
Calibrating the 2016 American National Election Study data by linear regression.
-
Script file:
calibrating-ANES2016.R
Required data files:
anes-2016-vprevote-design.RData
andpopl-results.RData
created by the previous scriptpoststratifiation-ANES2016.R
.The script makes use of the survey package, which is available from
https://cran.r-project.org/package=survey
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R library(survey)
In [2]:load("anes-2016-vprevote-design.RData")
calibrate
expects the names of the calibration vectors to be like those of regression coefficents.In [3]:cal_names(~recall12+vote16,anes_2016_vprevote_desgn)
The following code defines a function creates a vector suitable for calibration from the data frames that
postStratify()
orrake()
useIn [4]:calib_counts <- function(formula,frames){ dframe2coef <- function(data){ fname <- names(data)[1] flevels <- as.character(data[[1]]) Freq <- data$Freq coefs <- c(sum(Freq),Freq[-1]) names(coefs) <- c("(Intercept)", paste0(fname,flevels[-1])) coefs } vars <- all.vars(formula) for(i in seq_along(vars)){ var_i <- vars[i] frame_i <- frames[[var_i]] coef_i <- dframe2coef(frame_i) if(i==1) res <- coef_i else res <- c(res,coef_i[-1]) } res }
We now apply this function to get the calibration criteria.
In [5]:load("popl-results.RData") calib_anes16 <- calib_counts(~recall12+vote16, list(recall12=pop.recall12, vote16=pop.vote16))
Finally we calibrate the ANES sample.
In [6]:anes_2016_prevote_desgn_calib <- calibrate( anes_2016_vprevote_desgn,~recall12+vote16, population=calib_anes16)
In [7]:100*svymean(~recall12,design=anes_2016_prevote_desgn_calib)
In [8]:100*svymean(~vote16,design=anes_2016_prevote_desgn_calib)
In [9]:save(anes_2016_prevote_desgn_calib,file="anes-2016-prevote-desgn-calib.RData")
-
-
Comparing poststratification, raking, and calibration with regards to the 2016 ANES data.
-
Script file:
comparing-ANES2016.R
Required data files:
anes-2016-vprevote-design.RData
,anes-2016-prevote-desgn-post.RData
,anes-2016-prevote-desgn-rake.RData
,anes-2016-prevote-desgn-calib.RData
, created by the previous scripts in this section.The script makes use of the following add-on packages:
-
survey available from
https://cran.r-project.org/package=survey
-
memisc available from
https://cran.r-project.org/package=memisc
-
survey available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R library(survey)
In [2]:library(memisc)
In [3]:load("anes-2016-vprevote-design.RData") load("anes-2016-prevote-desgn-post.RData") load("anes-2016-prevote-desgn-rake.RData") load("anes-2016-prevote-desgn-calib.RData")
Let's compare the effect of poststratification and raking on the relation between variables.
First, we create a table from the data with valid responses about voting behaviour in 2012 and 2016.
In [4]:tab <- svytable(~ vote16 + recall12, design = anes_2016_vprevote_desgn) percentages(vote16 ~ recall12, data=tab)
Second, we create a table from the poststatified data.
In [5]:tab_post <- svytable(~ vote16 + recall12, design = anes_2016_prevote_desgn_post) percentages(vote16 ~ recall12, data=tab_post)
Third, we create a table from the raked data.
In [6]:tab_rak <- svytable(~ vote16 + recall12, design = anes_2016_prevote_desgn_rake) percentages(vote16 ~ recall12, data=tab_rak)
Fourth, we create a table from the calibrated data
In [7]:tab_calib <- svytable(~ vote16 + recall12, design = anes_2016_prevote_desgn_calib) percentages(vote16 ~ recall12, data=tab_calib)
Poststratification does not alter percentages that are conditional on the variable used for poststratification. Yet raking does change the conditional percentages.
To examine whether raking affects relations between recalled vote in 2012 and vote in 2016 we compute log-odds ratios:
In [8]:log.odds <- function(x) log((x[1,1]/x[1,2])/(x[2,1]/x[2,2]))
Log-odds ratios are a way to describe the relation between two dichotomous variables. Like correlations between continuous variables they are not affected by the marginal distribution.
In [9]:log.odds(tab)
In [10]:log.odds(tab_post)
In [11]:log.odds(tab_rak)
In [12]:log.odds(tab_calib)
Clearly, both poststratfication and raking leaves log-odds ratios unaffected. Calibration has an effect, but this appears to be minor (at least in the present case).
-