Constructing a survey design object from a data frame¶
First we activate the survey package. You may need to install it from CRAN using the code install.packages("survey")
before this if you try running this on your computer. (Package is already installed on the notebook container, however.)
library(survey)
Loading required package: grid
Loading required package: Matrix
Loading required package: survival
Attaching package: ‘survey’
The following object is masked from ‘package:graphics’:
dotchart
Next we make the NHANES
data set available (which comes with the survey package)
data(nhanes)
Now we can take a look at the data set, it has 7 variables and 8581 observations
head(nhanes)
SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race agecat RIAGENDR
1 1 83 81528.77 0 2 (19,39] 1
2 1 84 14509.28 0 3 (0,19] 1
3 2 86 12041.64 0 3 (0,19] 1
4 2 75 21000.34 0 3 (59,Inf] 2
5 1 88 22633.58 0 1 (19,39] 1
6 2 85 74112.49 1 2 (39,59] 2
nrow(nhanes)
[1] 8591
According to the documentation, the primary sampling units are identified by the variable SDMVPSU
, the strata from which the PSUs are drawn are identified by the variable SDMVSTRA
, and the sample weights are WTMEC2YR
. The sum of the sample weights corresponds to the population size:
with(nhanes,sum(WTMEC2YR))
[1] 276536446
Here were create “ordinary” sampling weights as they are common in social science data sets:
nhanes <- within(nhanes,{
smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes)
})
We now create two survey design objects, one with the “population size” weights and one with the “sample size” weights:
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes)
Error in svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, : Clusters not nested in strata at top level; you may want nest=TRUE.
Traceback:
1. svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
. data = nhanes)
2. svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
. data = nhanes)
3. stop("Clusters not nested in strata at top level; you may want nest=TRUE.")
First attempt fails: the cluster numbers are not nested in the strata. We try again with nest = TRUE
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes,
nest = TRUE)
design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~smplw, data = nhanes,
nest = TRUE)
We first estimate the population proportions with high cholesterol (HI_CHOL==1)
svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
mean SE
HI_CHOL 0.11214 0.0054
svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
mean SE
HI_CHOL 0.11214 0.0054
We secondly estimate the number of those with high cholesterol (HI_CHOL==1)
in the population:
svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
total SE
HI_CHOL 28635245 2020711
svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
total SE
HI_CHOL 889.59 62.776
Obviously, we need the “population size” weights for an unbiased estimate of the population total, otherwise we just get a sample total.
Using the “population size” weights, we can reconstruct the sizes of the population strata:
nhanes <- within(nhanes,{
strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum)
})
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes,
nest = TRUE, fpc = ~ strata_size)
design_pop
Stratified 1 - level Cluster Sampling design
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
data = nhanes, nest = TRUE, fpc = ~strata_size)
Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.
Downloadable R script and interactive version
- R Script: survey-design-objects-NHANES.R
- Interactive version (shuts down after 60s):
- Interactive version (sign in required):
Explanation
The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4
Above you see a rendered version of the Jupyter notebook.5
- 1
-
For more information about Jupyter see
http://jupyter.org
. The Jupyter notebooks make use of the IRKernel package. - 2
-
For more information about Docker see
https://docs.docker.com/
. The container images were created with repo2docker, while containers are run with docker spawner. - 3
-
ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See
https://info.orcid.org/what-is-orcid/
for more information. - 4
-
The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.
- 5
-
The notebook is rendered with the help of the nbsphinx extension.