data.set memisc 0.99.20.1

Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = default.stringsAsFactors(),
                 document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

## S4 method for signature 'data.set'
head(x,n=20,...)
## S4 method for signature 'data.set'
tail(x,n=20,...)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

n

integer; the number of rows to be shown by head or tail

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see item and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

})
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region    income
 1 *Not asked in survey                Wales 2480.3938
 2               Labour              England 1294.0506
 3          *Don't know              England 5331.7628
 4               Labour             Scotland 3628.6942
 5    Liberal Democrats             Scotland 2847.6935
 6               Labour              England 2830.0589
 7    Liberal Democrats *Not asked in survey 4186.1611
 8 *Not asked in survey *Not asked in survey 2134.4661
 9      *Not applicable *Not asked in survey 1004.4691
10      *Not applicable                Wales 1959.5258
11        Conservatives             Scotland 2030.1007
12               Labour                Wales 5272.8193
13               Labour              England 5132.2473
14               Labour *Not asked in survey 3062.9538
15      *Not applicable *Not asked in survey  691.0322
16          *Don't know              England 1691.9867
17    Liberal Democrats             Scotland 6548.7332
18               Labour              England  308.3800
19      *Not applicable *Not asked in survey  844.3661
20          *Don't know             Scotland  272.8527
21      *Not applicable              England 2025.4756
22      *Not applicable              England 1208.9411
23        Conservatives              England  915.5184
24      *Answer refused *Not asked in survey  547.6606
25        Conservatives *Not asked in survey  433.3502
(25 of 300 observations shown)
## Not run:
##
##
## # If we insist on seeing all, we can use 'print' instead
## print(Data)
## End(Not run)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 99 2 8 2 3 2 3 99 97
  97 ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 3 1 1 2 2 1 99 99
  99 3 ...
 $ income: Rto. item  num  2480 1294 5332 3629 2848 ...
summary(Data)
                  vote                     region        income
Conservatives       :40   England             :138   Min.   :  272.9
Labour              :46   Scotland            : 80   1st Qu.: 1290.7
Liberal Democrats   :36   Wales               : 42   Median : 2037.8
*Don't know         :36   *Not asked in survey: 40   Mean   : 2543.6
*Answer refused     :39                              3rd Qu.: 3136.2
*Not applicable     :41                              Max.   :15168.3
*Not asked in survey:62
## Not run:
##
## # If we want to 'View' a data set we can use 'dsView'
## dsView(Data)
## # Works also, but changes the data set into a data frame first:
## View(Data)
## End(Not run)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

 [1:300] *Not asked in survey Labour *Don't know Labour Liberal Democrats ...
Data[1,]
Data set with 1 observations and 3 variables

                  vote region   income
1 *Not asked in survey  Wales 2480.394
head(as.data.frame(Data))
               vote   region   income
1              <NA>    Wales 2480.394
2            Labour  England 1294.051
3              <NA>  England 5331.763
4            Labour Scotland 3628.694
5 Liberal Democrats Scotland 2847.693
6            Labour  England 2830.059
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 138 observations and 3 variables

                   vote  region    income
 1               Labour England 1294.0506
 2          *Don't know England 5331.7628
 3               Labour England 2830.0589
 4               Labour England 5132.2473
 5          *Don't know England 1691.9867
 6               Labour England  308.3800
 7      *Not applicable England 2025.4756
 8      *Not applicable England 1208.9411
 9        Conservatives England  915.5184
10      *Answer refused England 2716.2459
11 *Not asked in survey England 1376.9450
12      *Answer refused England 3758.4926
13        Conservatives England  518.9432
14               Labour England 7285.8555
15    Liberal Democrats England 8599.7544
16      *Not applicable England 1947.9748
17    Liberal Democrats England  900.3834
18               Labour England  545.5856
19          *Don't know England 2509.6463
20    Liberal Democrats England 1748.0666
21               Labour England 1618.1024
22 *Not asked in survey England  956.7359
23 *Not asked in survey England 1329.8740
24    Liberal Democrats England 1688.9110
25          *Don't know England 1209.1445
(25 of 138 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          24        7     5
  Labour                 22        9    10
  Liberal Democrats      14       12     5
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             24        7     5
  Labour                    22        9    10
  Liberal Democrats         14       12     5
  *Don't know               16       10     4
  *Answer refused           17       11     3
  *Not applicable           17       12     7
  *Not asked in survey      28       19     8