Constructing a survey design object from a data frame¶
First we activate the survey package. You may need to install it from CRAN using the code install.packages("survey")
before this if you try running this on your computer. (Package is already installed on the notebook container, however.)
library(survey)
Loading required package: grid
Loading required package: Matrix
Loading required package: survival
Attaching package: 'survey'
The following object is masked from 'package:graphics':
dotchart
Next we make the NHANES
data set available (which comes with the survey
package)
data(nhanes)
Now we can take a look at the data set, it has 7 variables and 8581 observations
head(nhanes)
SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race agecat RIAGENDR
1 1 83 81528.77 0 2 (19,39] 1
2 1 84 14509.28 0 3 (0,19] 1
3 2 86 12041.64 0 3 (0,19] 1
4 2 75 21000.34 0 3 (59,Inf] 2
5 1 88 22633.58 0 1 (19,39] 1
6 2 85 74112.49 1 2 (39,59] 2
nrow(nhanes)
[1] 8591
According to the documentation, the primary sampling units are identified by
the variable SDMVPSU
, the strata from which the PSUs are drawn are
identified by the variable SDMVSTRA
, and the sample weights are WTMEC2YR
.
The sum of the sample weights corresponds to the population size:
with(nhanes,sum(WTMEC2YR))
[1] 276536446
Here were create “ordinary” sampling weights as they are common in social science data sets:
nhanes <- within(nhanes,{
smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes)
})
We now create two survey design objects, one with the “population size” weights and one with the “sample size” weights:
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes)
Error in svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, :
Clusters not nested in strata at top level; you may want nest=TRUE.
First attempt fails: the cluster numbers are not nested in the strata.
We try again with nest = TRUE
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes,
nest = TRUE)
design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~smplw, data = nhanes,
nest = TRUE)
We first estimate the population proportions with high cholesterol (HI_CHOL==1)
svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
mean SE
HI_CHOL 0.11214 0.0054
svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
mean SE
HI_CHOL 0.11214 0.0054
We secondly estimate the number of those with high cholesterol (HI_CHOL==1)
in
the population:
svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
total SE
HI_CHOL 28635245 2020711
svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
total SE
HI_CHOL 889.59 62.776
Obviously, we need the “population size” weights for an unbiased estimate of the population total, otherwise we just get a sample total.
Using the “population size” weights, we can reconstruct the sizes of the population strata:
nhanes <- within(nhanes,{
strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum)
})
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC2YR, data = nhanes,
nest = TRUE, fpc = ~ strata_size)
design_pop
Stratified 1 - level Cluster Sampling design
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
data = nhanes, nest = TRUE, fpc = ~strata_size)
Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.
- R file: survey-design-objects-NHANES.R
- Rmarkdown file: survey-design-objects-NHANES.Rmd
- Jupyter notebook file: survey-design-objects-NHANES.ipynb
- Interactive version of the Jupyter notebook (shuts down after 60s):
- Interactive version of the Jupyter notebook (sign in required):