Constructing a survey design object from a data frame

First we activate the survey package. You may need to install it from CRAN using the code install.packages("survey") before this if you try running this on your computer. (Package is already installed on the notebook container, however.)

library(survey)
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

Next we make the NHANES data set available (which comes with the survey package)

data(nhanes)

Now we can take a look at the data set, it has 7 variables and 8581 observations

head(nhanes)
  SDMVPSU SDMVSTRA WTMEC2YR HI_CHOL race   agecat RIAGENDR
1       1       83 81528.77       0    2  (19,39]        1
2       1       84 14509.28       0    3   (0,19]        1
3       2       86 12041.64       0    3   (0,19]        1
4       2       75 21000.34       0    3 (59,Inf]        2
5       1       88 22633.58       0    1  (19,39]        1
6       2       85 74112.49       1    2  (39,59]        2
nrow(nhanes)
[1] 8591

According to the documentation, the primary sampling units are identified by the variable SDMVPSU, the strata from which the PSUs are drawn are identified by the variable SDMVSTRA, and the sample weights are WTMEC2YR. The sum of the sample weights corresponds to the population size:

with(nhanes,sum(WTMEC2YR))
[1] 276536446

Here were create “ordinary” sampling weights as they are common in social science data sets:

nhanes <- within(nhanes,{
    smplw <- WTMEC2YR/sum(WTMEC2YR)*nrow(nhanes)
})

We now create two survey design objects, one with the “population size” weights and one with the “sample size” weights:

design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                            weights = ~WTMEC2YR, data = nhanes)
Error in svydesign.default(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, :
Clusters not nested in strata at top level; you may want nest=TRUE.

First attempt fails: the cluster numbers are not nested in the strata. We try again with nest = TRUE

design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                        weights = ~WTMEC2YR, data = nhanes,
                        nest = TRUE)
design_smpl <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                         weights = ~smplw, data = nhanes,
                         nest = TRUE)

We first estimate the population proportions with high cholesterol (HI_CHOL==1)

svymean(~HI_CHOL, design=design_pop, na.rm=TRUE)
           mean     SE
HI_CHOL 0.11214 0.0054
svymean(~HI_CHOL, design=design_smpl, na.rm=TRUE)
           mean     SE
HI_CHOL 0.11214 0.0054

We secondly estimate the number of those with high cholesterol (HI_CHOL==1) in the population:

svytotal(~HI_CHOL, design=design_pop, na.rm=TRUE)
           total      SE
HI_CHOL 28635245 2020711
svytotal(~HI_CHOL, design=design_smpl, na.rm=TRUE)
         total     SE
HI_CHOL 889.59 62.776

Obviously, we need the “population size” weights for an unbiased estimate of the population total, otherwise we just get a sample total.

Using the “population size” weights, we can reconstruct the sizes of the population strata:

nhanes <- within(nhanes,{
    strata_size <- ave(WTMEC2YR,SDMVSTRA,FUN=sum)
})
design_pop <- svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA,
                        weights = ~WTMEC2YR, data = nhanes,
                        nest = TRUE, fpc = ~ strata_size)
design_pop
Stratified 1 - level Cluster Sampling design
With (31) clusters.
svydesign(id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR, 
    data = nhanes, nest = TRUE, fpc = ~strata_size)

Of course, usually its the other way round: starting with the strata and the cluster IDs, the sampling probabilities and sampling weights are computed.