Constructing a survey design object from a data frame

First we activate the survey package. You may need to install it from CRAN using the code install.packages("survey") before this if you try running this on your computer. (Package is already installed on the notebook container, however.)


library(survey)
Loading required package: grid

Loading required package: Matrix

Loading required package: survival


Attaching package: ‘survey’


The following object is masked from ‘package:graphics’:

    dotchart


Next we make the NHANES data set available (which comes with the survey package)


data(nhanes)

Now we can take a look at the data set, it has 7 variables and 8581 observations


head(nhanes)
  SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race agecat   RIAGENDR
1 1       83       81528.77 0       2    (19,39]  1
2 1       84       14509.28 0       3    (0,19]   1
3 2       86       12041.64 0       3    (0,19]   1
4 2       75       21000.34 0       3    (59,Inf] 2
5 1       88       22633.58 0       1    (19,39]  1
6 2       85       74112.49 1       2    (39,59]  2

nrow(nhanes)
[1] 8591

According to the documentation, the primary sampling units are identified by the variable SDMVPSU, the strata from which the PSUs are drawn are identified by the variable SDMVSTRA, and the sample weights are WTMEC2YR. The sum of the sample weights corresponds to the population size:


with(nhanes,sum(WTMEC2YR))
[1] 276536446

Here were create “ordinary” sampling weights as they are common in social science data sets:


nhanes <- within(nhanes,{
    smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes)
})

We now create two survey design objects, one with the “population size” weights and one with the “sample size” weights:


design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                            weights = ~WTMEC2YR, data = nhanes)
Error in svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, : Clusters not nested in strata at top level; you may want nest=TRUE.
Traceback:

1. svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
 .     data = nhanes)
2. svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
 .     data = nhanes)
3. stop("Clusters not nested in strata at top level; you may want nest=TRUE.")

First attempt fails: the cluster numbers are not nested in the strata. We try again with nest = TRUE


design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                        weights = ~WTMEC2YR, data = nhanes,
                        nest = TRUE)
design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                         weights = ~smplw, data = nhanes,
                         nest = TRUE)

We first estimate the population proportions with high cholesterol (HI_CHOL==1)


svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
           mean     SE
HI_CHOL 0.11214 0.0054

svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
           mean     SE
HI_CHOL 0.11214 0.0054

We secondly estimate the number of those with high cholesterol (HI_CHOL==1) in the population:


svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
           total      SE
HI_CHOL 28635245 2020711

svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
         total     SE
HI_CHOL 889.59 62.776

Obviously, we need the “population size” weights for an unbiased estimate of the population total, otherwise we just get a sample total.

Using the “population size” weights, we can reconstruct the sizes of the population strata:


nhanes <- within(nhanes,{
    strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum)
})
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                        weights = ~WTMEC2YR, data = nhanes,
                        nest = TRUE, fpc = ~ strata_size)
design_pop
Stratified 1 - level Cluster Sampling design
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
    data = nhanes, nest = TRUE, fpc = ~strata_size)

Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.

Downloadable R script and interactive version

Explanation

The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4

Above you see a rendered version of the Jupyter notebook.5

1

For more information about Jupyter see http://jupyter.org. The Jupyter notebooks make use of the IRKernel package.

2

For more information about Docker see https://docs.docker.com/. The container images were created with repo2docker, while containers are run with docker spawner.

3

ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See https://info.orcid.org/what-is-orcid/ for more information.

4

The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.

5

The notebook is rendered with the help of the nbsphinx extension.