Data from Complex Samples

Data from social science surveys often come from complex samples. In order to achieve efficient or at least accurate inference one may need to take into account the sampling design in the computation of sample summaries, e.g. by the application of sampling weights. The chapter shows how the survey package can help to take into account the sampling design in data management and data analysis.

Below is the supporting material for the various sections of the chapter.

Survey Design Objects

  • Constructing a survey design object from an example data that is included in the survey package:

    • Script file: survey-design-objects-NHANES.R

      The script makes use of the survey package, which is available from https://cran.r-project.org/package=survey

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      First we activate the survey package

      In [2]:
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      

      Next we make the NHANES data set available (which comes with the survey package)

      In [3]:
      data(nhanes)
      

      Now we can take a look at the data set, it has 7 variables and 8581 observations

      In [4]:
      head(nhanes)
      
        SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race agecat   RIAGENDR
      1 1       83       81528.77 0       2    (19,39]  1       
      2 1       84       14509.28 0       3    (0,19]   1       
      3 2       86       12041.64 0       3    (0,19]   1       
      4 2       75       21000.34 0       3    (59,Inf] 2       
      5 1       88       22633.58 0       1    (19,39]  1       
      6 2       85       74112.49 1       2    (39,59]  2       
      In [5]:
      nrow(nhanes)
      
      [1] 8591

      According to the documentation, the primary sampling units are identified by the variable SDMVPSU, the strata from which the PSUs are drawn are identified by the variable SDMVSTRA, and the sample weights are WTMEC2YR. The sum of the sample weights corresponds to the population size:

      In [6]:
      with(nhanes,sum(WTMEC2YR))
      
      [1] 276536446

      Here were create "ordinary" sampling weights as they are common in social science data sets:

      In [7]:
      nhanes <- within(nhanes,{
          smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes)
      })
      

      We now create two survey design objects, one with the "population size" weights and one with the "sample size" weights:

      In [8]:
      design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                                  weights = ~WTMEC2YR, data = nhanes)
      
      Error in svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, : Clusters not nested in strata at top level; you may want nest=TRUE.
      Traceback:
      
      1. svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
       .     data = nhanes)
      2. svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
       .     data = nhanes)
      3. stop("Clusters not nested in strata at top level; you may want nest=TRUE.")

      First attempt fails: the cluster numbers are not nested in the strata We try again with nest = TRUE

      In [9]:
      design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                              weights = ~WTMEC2YR, data = nhanes,
                              nest = TRUE)
      design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                               weights = ~smplw, data = nhanes,
                               nest = TRUE)
      

      We first estimate the population proportions with high cholesterol (HI_CHOL==1)

      In [10]:
      svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
      
                 mean     SE
      HI_CHOL 0.11214 0.0054
      In [11]:
      svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
      
                 mean     SE
      HI_CHOL 0.11214 0.0054

      We secondly estimate the number of those with high cholesterol (HI_CHOL==1) in the population

      In [12]:
      svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
      
                 total      SE
      HI_CHOL 28635245 2020711
      In [13]:
      svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
      
               total     SE
      HI_CHOL 889.59 62.776

      Obviously, we need the "population size" weights for an unbiased estimate of the population total, otherwise we just get a sample total.

      Using the "population size" weights, we can reconstruct the sizes of the population strata:

      In [14]:
      nhanes <- within(nhanes,{
          strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum)
      })
      design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                              weights = ~WTMEC2YR, data = nhanes,
                              nest = TRUE, fpc = ~ strata_size)
      design_pop
      
      Stratified 1 - level Cluster Sampling design
      With (31) clusters.
      svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
          data = nhanes, nest = TRUE, fpc = ~strata_size)

      Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.

  • Constructing a survey design object from data of the 2016 American Election Study.

    • Script file: survey-design-objects-ANES2016.R

      Required data file: anes_timeseries_2016.sav which is available from https://electionstudies.org/data-center/2016-time-series-study/. You will need to register first in order to be able to download the data.

      The script makes use of the following add-on packages:

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      library(memisc)
      
      Loading required package: lattice
      
      Loading required package: MASS
      
      
      Attaching package: ‘memisc’
      
      
      The following objects are masked from ‘package:stats’:
      
          contr.sum, contr.treatment, contrasts
      
      
      The following object is masked from ‘package:base’:
      
          as.array
      
      
      

      The the code makes used of the data file "anes_timeseries_2016.sav", which is not included in the supporting material. In order to obtain this data file (and run this notebook successufully), you need to download them from the ANES website for 2016 and upload them to the virtual machine that runs this notebook. To do this,

      1. pull down the "File" menu item and select "Open"
      2. An overview of the folder that contains the notebook opens.
      3. The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the ANES website.

      Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).

      In [3]:
      anes_2016_sav <- spss.file("anes_timeseries_2016.sav")
      
      File character set is 'UTF-8'.
      
      Converting character set to the local 'utf-8'.
      
      

      Loading a subset: Only pre-election waves and only face-to-face interviews

      In [4]:
      anes_2016_pre_work_ds <- subset(anes_2016_sav,
                                      V160501 == 1,
                                      select=c(
                                      # According to docs, these are the
                                      # sample weights for the
                                      # face-to-face component
                                      pre_w_f2f     = V160101f,
                                      # Face-to-face strata    
                                      strat_f2f     = V160201f,
                                      psu_f2f       = V160202f,
                                      pre_voted12   = V161005,
                                      pre_recall12  = V161006,
                                      pre_voted     = V161026,
                                      pre_vote      = V161027,
                                      pre_intov     = V161030,
                                      pre_voteint   = V161031#,
                                 ))
      
      In [5]:
      library(magrittr) # For the '%<>%' operator
      
      In [6]:
      anes_2016_pre_work_ds %<>% within({
          # Setting up recalled votes of 2012
          # Since a "default" value for the remaining conditions
          # is used, we use 'check.xor = FALSE' to avoid warnings.
          recall12 <- cases(
              'Did not vote' = 9 <- pre_voted12  == 2,
              'Obama'        = 1 <- pre_recall12 == 1,
              'Romney'       = 2 <- pre_recall12 == 2,
              'Other'        = 3 <- pre_recall12 == 5,
              'Inap'         = 99 <- TRUE, check.xor = FALSE
          )
          # Early voters 
          vote16_1 <- cases(
              'Clinton' = 1 <- pre_voted == 1 & pre_vote == 1,
              'Trump'   = 2 <- pre_voted == 1 & pre_vote == 2,
              'Other'   = 3 <- pre_voted == 1 & pre_vote %in% 3:5,
              'Inap'    = 99 <- TRUE, check.xor = FALSE)
          # Vote intentions
          vote16 <- cases(
              'Clinton' = 1 <- pre_intov == 1 & pre_voteint == 1,
              'Trump'   = 2 <- pre_intov == 1 & pre_voteint == 2,
              'Other'   = 3 <- pre_intov == 1 & pre_voteint %in% 3:6,
              'Will not vote/Not registered' = 8 <- pre_intov %in% c(-1,2),
              'Inap'    = 99 <- TRUE, check.xor = FALSE)
          vote16[] <- ifelse(vote16 == 99 & vote16_1 != 99,
                             vote16_1,
                             vote16)
          measurement(pre_w_f2f) <- "ratio"
      })
      
      In [7]:
      anes_2016_prevote <- as.data.frame(anes_2016_pre_work_ds)
      save(anes_2016_prevote,file="anes-2016-prevote.RData")
      
      In [8]:
      #Unweighted crosstable
      xtabs(~ vote16 + recall12,
            data=anes_2016_prevote)
      
                                    recall12
      vote16                         Obama Romney Other Did not vote Inap
        Clinton                        326     12     2           59    6
        Trump                           29    242     5           70    8
        Other                           30     28     7           16    4
        Will not vote/Not registered    28     41     0          139    5
        Inap                            46     27     2           31   17
      In [9]:
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      
      In [10]:
      anes_2016_prevote_desgn <- svydesign(id = ~psu_f2f,
                                           strata = ~strat_f2f,
                                           weights = ~pre_w_f2f,
                                           data = anes_2016_prevote,
                                           nest = TRUE)
      anes_2016_prevote_desgn
      
      Stratified 1 - level Cluster Sampling design (with replacement)
      With (65) clusters.
      svydesign(id = ~psu_f2f, strata = ~strat_f2f, weights = ~pre_w_f2f, 
          data = anes_2016_prevote, nest = TRUE)

      In order to later make use of the survey design object, we save it into a file.

      In [11]:
      save(anes_2016_prevote_desgn,file="anes-2016-prevote-desgn.RData")
      

      We reduce the digits after dot ...

      In [12]:
      ops <- options(digits=2)
      (tab <- svytable(~ vote16 + recall12,
                       design = anes_2016_prevote_desgn))
      
                                    recall12
      vote16                         Obama Romney Other Did not vote  Inap
        Clinton                      316.0   11.7   1.1         69.9   8.6
        Trump                         35.9  228.8   4.2         73.0   5.1
        Other                         34.1   24.4   6.6         13.9   5.3
        Will not vote/Not registered  28.8   41.4   0.0        150.2   4.3
        Inap                          44.8   25.0   1.9         28.3  16.0

      and drop counts of non-valid responses before we compute percentages.

      In [13]:
      percentages(vote16 ~ recall12, data=tab[-6,-5])
      
                                    recall12
      vote16                         Obama Romney Other Did not vote
        Clinton                       68.8    3.5   8.0         20.8
        Trump                          7.8   69.1  30.6         21.8
        Other                          7.4    7.4  47.6          4.1
        Will not vote/Not registered   6.3   12.5   0.0         44.8
        Inap                           9.7    7.5  13.9          8.4
      In [14]:
      options(ops) # To undo the change in the options.
      

      Here we compute a F-test of independence with the table, which uses the Rao-Scott second-order correction with a Satterthwaite approximation of the denominator degrees of freedom is used.

      In [15]:
      summary(tab)
      
                                    recall12
      vote16                         Obama Romney Other Did not vote Inap
        Clinton                        316     12     1           70    9
        Trump                           36    229     4           73    5
        Other                           34     24     7           14    5
        Will not vote/Not registered    29     41     0          150    4
        Inap                            45     25     2           28   16
      
      	Pearson's X^2: Rao & Scott adjustment
      
      data:  svychisq(~vote16 + recall12, design = anes_2016_prevote_desgn,     statistic = "F")
      F = 29.235, ndf = 9.3968, ddf = 310.0952, p-value < 2.2e-16
      

      The more conventional Pearson-Chi-squared test adjusted with a design-effect estimate is obtained by a slight modification.

      In [16]:
      summary(tab, statistic="Chisq")
      
                                    recall12
      vote16                         Obama Romney Other Did not vote Inap
        Clinton                        316     12     1           70    9
        Trump                           36    229     4           73    5
        Other                           34     24     7           14    5
        Will not vote/Not registered    29     41     0          150    4
        Inap                            45     25     2           28   16
      
      	Pearson's X^2: Rao & Scott adjustment
      
      data:  svychisq(~vote16 + recall12, design = anes_2016_prevote_desgn,     statistic = "Chisq")
      X-squared = 778.41, df = 16, p-value < 2.2e-16
      

Survey Replication Weights

  • Using the replicate weights provided with CHIS data.

    • Script file: survey-replication-weights-CHIS.R

      Required data file: adult.dta which is available from http://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx. You will need to register first and declare that you do not redistribute the data in order to be able to download them.

      The script makes use of the survey package, which is available from https://cran.r-project.org/package=survey

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      
      In [3]:
      library(foreign)
      

      The file "adult.dta" is downloaded from http://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx and contains the 2005 wave of the California Health Interview Survey. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,

      1. pull down the "File" menu item and select "Open"
      2. An overview of the folder that contains the notebook opens.
      3. The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the website.

      Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).

      In [4]:
      adult_chis <- read.dta("adult.dta",
                             warn.missing.labels=FALSE)
      

      The data set contains 80 set of (raked) replicate weights. They are in the variables named "rakedw1" through "rakedw80". Raked sampling weights are in "raked0".

      We obtain the column numbers of these variables, making use of our knowledge of the name pattern

      In [5]:
      repw <- which(names(adult_chis) %in% paste0("rakedw",1:80))
      

      To apecify replicate weights, we call the function svrepdesgin The first argument specifies the variables that will be used for data analysis. The weights argument specifies sampling weights, while the function repweights specifies the replicate weights. The data= argument specifies the data frame where the data all come from. The combined.weights= argument is needed here, because the replicate weights were constructed from sampling weights and "pure" replicate weights. Since we do not know the way the replicate weights were constructed we have to specify type="other".

      In [6]:
      adult_chis_rd <- svrepdesign(adult_chis[-repw],
                                   weights=~rakedw0,
                                   repweights=adult_chis[repw],
                                   data=adult_chis,
                                   combined.weights=TRUE,
                                   type="other",
                                   scale=1,rscales=1)
      

      With svymean() we get the estimated proportions of the various categories of health insurance status in California 2005, along with standard errors, multiplying by 100, we get percentages.

      In [7]:
      100*svymean(~instyp_p, design=adult_chis_rd)
      
                                          mean     SE
      instyp_pUNINSURED                16.1204 0.0027
      instyp_pMEDICARE & MEDICAID       4.0544 0.0011
      instyp_pMEDICARE & OTHERS         9.5286 0.0010
      instyp_pMEDICARE ONLY             2.0639 0.0007
      instyp_pMEDICAID                  8.5105 0.0018
      instyp_pEMPLOYMENT-BASED         51.9316 0.0030
      instyp_pPRIVATELY PURCHASED       6.0567 0.0017
      instyp_pHEALTHY FAM/OTHER PUBLIC  1.7339 0.0011

      With svytotal() we obtain estimates of how many people have which health insurance status.

      In [8]:
      svytotal(~instyp_p, design=adult_chis_rd)
      
                                          total    SE
      instyp_pUNINSURED                 4253792 72494
      instyp_pMEDICARE & MEDICAID       1069871 28764
      instyp_pMEDICARE & OTHERS         2514367 25892
      instyp_pMEDICARE ONLY              544612 19018
      instyp_pMEDICAID                  2245709 48474
      instyp_pEMPLOYMENT-BASED         13703511 79679
      instyp_pPRIVATELY PURCHASED       1598225 45184
      instyp_pHEALTHY FAM/OTHER PUBLIC   457527 27854
  • Creating replicate weights for the survey design of the 2016 American National Election Study.

    • Script file: survey-replication-weights-ANES2016.R

      Required data file: anes-2016-prevote-desgn.RData created by the previous script survey-design-objects-ANES2016.R.

      The script makes use of the following add-on packages:

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      load("anes-2016-prevote-desgn.RData")
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      

      The 'automatic' type gives jackknife replicates

      In [3]:
      anes_2016_prevote_jk <- as.svrepdesign(anes_2016_prevote_desgn,
                                             type="auto")
      

      The number of replicates is determined by the number of clusters

      In [4]:
      anes_2016_prevote_jk
      
      Call: as.svrepdesign(anes_2016_prevote_desgn, type = "auto")
      Stratified cluster jackknife (JKn) with 65 replicates.

      Here we select the multistage rescaled bootstrap

      In [5]:
      anes_2016_prevote_boo <- as.svrepdesign(anes_2016_prevote_desgn,
                                              type="mrbbootstrap")
      
      Warning message in mrbweights(design$cluster, design$strata, design$fpc, ...):
      “Design is sampled with replacement: only first stage used”
      
      In [6]:
      anes_2016_prevote_boo
      
      Call: as.svrepdesign(anes_2016_prevote_desgn, type = "mrbbootstrap")
      Multistage rescaled bootstrap with 50 replicates.

      By default the number of bootstrap replicates is 50, we can change it to 200

      In [7]:
      anes_2016_prevote_boo <- as.svrepdesign(anes_2016_prevote_desgn,
                                              type="mrbbootstrap",
                                              replicates=200)
      
      Warning message in mrbweights(design$cluster, design$strata, design$fpc, ...):
      “Design is sampled with replacement: only first stage used”
      
      In [8]:
      anes_2016_prevote_boo
      
      Call: as.svrepdesign(anes_2016_prevote_desgn, type = "mrbbootstrap", 
          replicates = 200)
      Multistage rescaled bootstrap with 200 replicates.

      A function to compute the percentage of 2012 Democratic and Republican voters who vote for a candidate of the same party in 2016:

      In [9]:
      StayerPerc <- function(weights,data){
          tab <- xtabs(weights~vote16+recall12,data=data)
          # Remove 'inap' responses
          tab <- tab[-6,-5]
          # Column percentages
          ptab <- 100*prop.table(tab,2)
          # The diagonal are the percentages of 'stayers'
          # among the voters of 2012.
          # The first two elements of the diagonal are
          # the Democratic and Republican stayers.
          structure(diag(ptab)[1:2],
                    names=c("Democratic",
                            "Republican"))
      }
      

      Estimates and replication based standard errors based on jackknife

      In [10]:
      withReplicates(anes_2016_prevote_jk,
                      StayerPerc)
      
                  theta     SE
      Democratic 68.765 3.8462
      Republican 69.058 3.6052
      In [11]:
      withReplicates(anes_2016_prevote_boo,
                      StayerPerc)
      
                  theta     SE
      Democratic 68.765 3.8286
      Republican 69.058 3.6551

Poststratification, Raking, and Calibration

  • Poststratification of the 2016 American National Election Study data.

    • Script file: poststratification-ANES2016.R

      Required data file: anes-2016-prevote.RData created by the previous script survey-design-objects-ANES2016.R.

      The script makes use of the following add-on packages:

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      load("anes-2016-prevote.RData")
      

      There must not be any missing values in the stratifying variables.

      In [2]:
      anes_2016_vprevote <- subset(anes_2016_prevote,
                                          vote16 != "Inap" &
                                          recall12 != "Inap"
                                          )
      

      In order to make poststratification possible, we need to make sure that the levels of the stratification variables match the population information. Therefore we relabel the variables "recall12" and "vote16".

      In [3]:
      library(memisc)
      
      Loading required package: lattice
      
      Loading required package: MASS
      
      
      Attaching package: ‘memisc’
      
      
      The following objects are masked from ‘package:stats’:
      
          contr.sum, contr.treatment, contrasts
      
      
      The following object is masked from ‘package:base’:
      
          as.array
      
      
      
      In [4]:
      anes_2016_vprevote <- within(anes_2016_vprevote,{
          recall12 <- recall12[,drop=TRUE]
          vote16 <- vote16[,drop=TRUE]
          recall12 <- relabel(recall12,"Did not vote"="No vote")
          vote16 <- relabel(vote16,
                            "Will not vote/Not registered"="No vote")
      })
      save(anes_2016_vprevote,file="anes-2016-vprevote.RData")
      

      Finally, we set up a survey design object:

      In [5]:
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      
      In [6]:
      anes_2016_vprevote_desgn <- svydesign(id = ~psu_f2f,
                                             strata = ~strat_f2f,
                                             weights = ~pre_w_f2f,
                                             data = anes_2016_vprevote,
                                             nest = TRUE)
      save(anes_2016_vprevote_desgn,file="anes-2016-vprevote-design.RData")
      

      We collect the electoral results of 2012 to prepare poststratification.

      In [7]:
      result.2012 = c(Obama  = 65915795,
                      Romney = 60933504,
                      # Other candidates are combined
                      Other = sum(c(
                          Johson = 1275971,
                          Stein  =  469627,
                          Others =  490510
                      )))
      

      The number of non-voters is computed from the sum of the results and census data on the population in voting age.

      In [8]:
      result.2012 <- c(result.2012,
                       "No vote" = 235248000 - sum(result.2012))
      
      In [9]:
      # Here we collect the results for 2016
      result.2016 <- c(Clinton = 65853514,
                       Trump   = 62984828,
                       Other   = sum(c(
                          Johnson  = 4489341, 
                          Stein    = 1457218,
                          McMullin =  731991,
                          Others   = 1154084 
                       )))
      
      In [10]:
      result.2016 <- c(result.2016,
                       "No vote" = 250056000 - sum(result.2016))
      

      The poststratification function expects population data to be in the form of data frames:

      In [11]:
      pop.vote16 <- data.frame(
          vote16=names(result.2016),
          Freq=result.2016)
      
      In [12]:
      pop.recall12 <- data.frame(
          recall12=names(result.2012),
          Freq=result.2012/sum(result.2012)*sum(result.2016)
      )
      
      In [13]:
      save(pop.recall12,pop.vote16,file="popl-results.RData")
      

      We poststratify the sample design object by recalled vote in 2012

      In [14]:
      anes_2016_prevote_desgn_post <- postStratify(
          anes_2016_vprevote_desgn,~recall12,population=pop.recall12)
      

      We compare the estimated percentages of 2012 votes:

      In [15]:
      100*svymean(~recall12,design=anes_2016_vprevote_desgn)
      
                         mean     SE
      recall12Obama   39.8844 0.0233
      recall12Romney  29.4551 0.0198
      recall12Other    1.1429 0.0035
      recall12No vote 29.5176 0.0222
      In [16]:
      100*svymean(~recall12,design=anes_2016_prevote_desgn_post)
      
                          mean SE
      recall12Obama   28.01970  0
      recall12Romney  25.90182  0
      recall12Other    0.95053  0
      recall12No vote 45.12795  0

      As should be expected, post-stratification eliminates the uncertainty about 2012 votes. It also corrects for turnout overreporting.

      We now compare the estimated percentages of 2016 votes

      In [17]:
      100*svymean(~vote16,design=anes_2016_vprevote_desgn)
      
                       mean     SE
      vote16Clinton 38.3334 0.0291
      vote16Trump   32.8720 0.0222
      vote16Other    7.5954 0.0104
      vote16No vote 21.1992 0.0200
      In [18]:
      100*svymean(~vote16,design=anes_2016_prevote_desgn_post)
      
                       mean     SE
      vote16Clinton 32.6932 0.0222
      vote16Trump   32.8370 0.0152
      vote16Other    6.9345 0.0101
      vote16No vote 27.5352 0.0228

      The percentages of Clinton voters and Trump voters are closer after poststratification.

      We save the poststratified data for later use.

      In [19]:
      save(anes_2016_prevote_desgn_post,file="anes-2016-prevote-desgn-post.RData")
      
  • Raking the 2016 American National Election Study data.

    • Script file: raking-ANES2016.R

      Required data file: anes-2016-vprevote-design.RData created by the previous script poststratification-ANES2016.R.

      The script makes use of the survey package, which is available from https://cran.r-project.org/package=survey

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      

      Here we rake the sample design object by recalled vote in 2012 and vote in 2016

      In [2]:
      load("anes-2016-vprevote-design.RData")
      load("popl-results.RData")
      anes_2016_prevote_desgn_rake <- rake(
          anes_2016_vprevote_desgn,list(~recall12,~vote16),
          population=list(pop.recall12,pop.vote16),
          control=list(maxit=20))
      

      Of course, the marginal distributions of the vote in 2012 and the vote in 2016 are no longer estimated, so that standard errors now vanish.

      In [3]:
      100*svymean(~recall12,design=anes_2016_prevote_desgn_rake)
      
                          mean SE
      recall12Obama   28.01970  0
      recall12Romney  25.90182  0
      recall12Other    0.95053  0
      recall12No vote 45.12795  0
      In [4]:
      100*svymean(~vote16,design=anes_2016_prevote_desgn_rake)
      
                       mean SE
      vote16Clinton 26.3355  0
      vote16Trump   25.1883  0
      vote16Other    3.1324  0
      vote16No vote 45.3439  0

      The actual percentage of non-voters in 2016 is obviously much higher than the one estimated from the sample, be it post-stratified or not.

      We save the raked data for later use.

      In [5]:
      save(anes_2016_prevote_desgn_rake,file="anes-2016-prevote-desgn-rake.RData")
      
  • Calibrating the 2016 American National Election Study data by linear regression.

    • Script file: calibrating-ANES2016.R

      Required data files: anes-2016-vprevote-design.RData and popl-results.RData created by the previous script poststratifiation-ANES2016.R.

      The script makes use of the survey package, which is available from https://cran.r-project.org/package=survey

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      
      In [2]:
      load("anes-2016-vprevote-design.RData")
      

      calibrate expects the names of the calibration vectors to be like those of regression coefficents.

      In [3]:
      cal_names(~recall12+vote16,anes_2016_vprevote_desgn)
      
      [1] "(Intercept)"     "recall12Romney"  "recall12Other"   "recall12No vote"
      [5] "vote16Trump"     "vote16Other"     "vote16No vote"  

      The following code defines a function creates a vector suitable for calibration from the data frames that postStratify() or rake() use

      In [4]:
      calib_counts <- function(formula,frames){
          dframe2coef <- function(data){
              fname <- names(data)[1]
              flevels <- as.character(data[[1]])
              Freq <- data$Freq
              coefs <- c(sum(Freq),Freq[-1])
              names(coefs) <- c("(Intercept)",
                                paste0(fname,flevels[-1]))
              coefs
          }
          vars <- all.vars(formula)
          for(i in seq_along(vars)){
              var_i <- vars[i]
              frame_i <- frames[[var_i]]
              coef_i <- dframe2coef(frame_i)
              if(i==1)
                  res <- coef_i
              else
                  res <- c(res,coef_i[-1])
          }
          res
      }
      

      We now apply this function to get the calibration criteria.

      In [5]:
      load("popl-results.RData")
      calib_anes16 <- calib_counts(~recall12+vote16,
                                   list(recall12=pop.recall12,
                                        vote16=pop.vote16))
      

      Finally we calibrate the ANES sample.

      In [6]:
      anes_2016_prevote_desgn_calib <- calibrate(
          anes_2016_vprevote_desgn,~recall12+vote16,
          population=calib_anes16)
      
      In [7]:
      100*svymean(~recall12,design=anes_2016_prevote_desgn_calib)
      
                          mean SE
      recall12Obama   28.01970  0
      recall12Romney  25.90182  0
      recall12Other    0.95053  0
      recall12No vote 45.12795  0
      In [8]:
      100*svymean(~vote16,design=anes_2016_prevote_desgn_calib)
      
                       mean SE
      vote16Clinton 26.3355  0
      vote16Trump   25.1883  0
      vote16Other    3.1324  0
      vote16No vote 45.3439  0
      In [9]:
      save(anes_2016_prevote_desgn_calib,file="anes-2016-prevote-desgn-calib.RData")
      
  • Comparing poststratification, raking, and calibration with regards to the 2016 ANES data.

    • Script file: comparing-ANES2016.R

      Required data files: anes-2016-vprevote-design.RData, anes-2016-prevote-desgn-post.RData, anes-2016-prevote-desgn-rake.RData, anes-2016-prevote-desgn-calib.RData, created by the previous scripts in this section.

      The script makes use of the following add-on packages:

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      library(survey)
      
      Loading required package: grid
      
      Loading required package: Matrix
      
      Loading required package: survival
      
      
      Attaching package: ‘survey’
      
      
      The following object is masked from ‘package:graphics’:
      
          dotchart
      
      
      
      In [2]:
      library(memisc)
      
      Loading required package: lattice
      
      Loading required package: MASS
      
      
      Attaching package: ‘memisc’
      
      
      The following object is masked from ‘package:Matrix’:
      
          as.array
      
      
      The following objects are masked from ‘package:stats’:
      
          contr.sum, contr.treatment, contrasts
      
      
      The following object is masked from ‘package:base’:
      
          as.array
      
      
      
      In [3]:
      load("anes-2016-vprevote-design.RData")
      load("anes-2016-prevote-desgn-post.RData")
      load("anes-2016-prevote-desgn-rake.RData")
      load("anes-2016-prevote-desgn-calib.RData")
      

      Let's compare the effect of poststratification and raking on the relation between variables.

      First, we create a table from the data with valid responses about voting behaviour in 2012 and 2016.

      In [4]:
      tab <- svytable(~ vote16 + recall12,
                       design = anes_2016_vprevote_desgn)
      percentages(vote16 ~ recall12, data=tab)
      
               recall12
      vote16        Obama    Romney     Other   No vote
        Clinton 76.187161  3.813877  9.241746 22.757762
        Trump    8.644019 74.683608 35.531586 23.783194
        Other    8.228826  7.974947 55.226668  4.516499
        No vote  6.939995 13.527568  0.000000 48.942545

      Second, we create a table from the poststatified data.

      In [5]:
      tab_post <- svytable(~ vote16 + recall12,
                       design = anes_2016_prevote_desgn_post)
      percentages(vote16 ~ recall12, data=tab_post)
      
               recall12
      vote16        Obama    Romney     Other   No vote
        Clinton 76.187161  3.813877  9.241746 22.757762
        Trump    8.644019 74.683608 35.531586 23.783194
        Other    8.228826  7.974947 55.226668  4.516499
        No vote  6.939995 13.527568  0.000000 48.942545

      Third, we create a table from the raked data.

      In [6]:
      tab_rak <- svytable(~ vote16 + recall12,
                          design = anes_2016_prevote_desgn_rake)
      percentages(vote16 ~ recall12, data=tab_rak)
      
               recall12
      vote16        Obama    Romney     Other   No vote
        Clinton 70.403656  3.152370 12.125475 12.579417
        Trump    8.177831 63.198219 47.727460 13.458918
        Other    4.213195  3.652234 40.147065  1.383226
        No vote 17.205318 29.997177  0.000000 72.578439

      Fourth, we create a table from the calibrated data

      In [7]:
      tab_calib <- svytable(~ vote16 + recall12,
                          design = anes_2016_prevote_desgn_calib)
      percentages(vote16 ~ recall12, data=tab_calib)
      
               recall12
      vote16        Obama    Romney     Other   No vote
        Clinton 69.137748  3.114927 11.193203 13.406539
        Trump    8.016145 62.183500 43.631304 14.227998
        Other    3.637356  3.547990 45.175493  1.694680
        No vote 19.208751 31.153583  0.000000 70.670783

      Poststratification does not alter percentages that are conditional on the variable used for poststratification. Yet raking does change the conditional percentages.

      To examine whether raking affects relations between recalled vote in 2012 and vote in 2016 we compute log-odds ratios:

      In [8]:
      log.odds <- function(x) log((x[1,1]/x[1,2])/(x[2,1]/x[2,2]))
      

      Log-odds ratios are a way to describe the relation between two dichotomous variables. Like correlations between continuous variables they are not affected by the marginal distribution.

      In [9]:
      log.odds(tab)
      
      [1] 5.15094
      In [10]:
      log.odds(tab_post)
      
      [1] 5.15094
      In [11]:
      log.odds(tab_rak)
      
      [1] 5.15094
      In [12]:
      log.odds(tab_calib)
      
      [1] 5.148527

      Clearly, both poststratfication and raking leaves log-odds ratios unaffected. Calibration has an effect, but this appears to be minor (at least in the present case).