Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = default.stringsAsFactors(),
                 document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see item and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

)
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region    income
 1          *Don't know              England 1144.1352
 2      *Answer refused             Scotland 4723.0492
 3        Conservatives             Scotland  711.3674
 4      *Not applicable                Wales 1634.3065
 5      *Not applicable              England 2075.3244
 6               Labour             Scotland  624.6963
 7    Liberal Democrats                Wales 4803.0513
 8    Liberal Democrats             Scotland 4489.8271
 9      *Not applicable             Scotland 1561.9337
10    Liberal Democrats             Scotland 3413.3040
11        Conservatives              England 4041.6287
12          *Don't know              England 1124.7814
13          *Don't know             Scotland 1792.7268
14      *Answer refused             Scotland 3879.4189
15 *Not asked in survey                Wales 4234.7365
16    Liberal Democrats             Scotland  479.3504
17          *Don't know             Scotland 4183.5685
18    Liberal Democrats              England 1850.0802
19        Conservatives              England 5017.8963
20 *Not asked in survey             Scotland 2564.6258
21          *Don't know                Wales  339.8707
22      *Not applicable             Scotland 2330.9037
23      *Not applicable             Scotland 1066.9105
24               Labour              England 1587.7572
25               Labour *Not asked in survey 3549.3210
(25 of 300 observations shown)
## Not run:
##
##
## # If we insist on seeing all, we can use 'print' instead
## print(Data)
## End(Not run)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 8 9 1 97 97 2 3 3 97
  3 ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 1 2 2 3 1 2 3 2 2 2
  ...
 $ income: Rto. item  num  1144 4723 711 1634 2075 ...
summary(Data)
                  vote                     region        income
Conservatives       :42   England             :124   Min.   :  238.9
Labour              :42   Scotland            :102   1st Qu.: 1124.1
Liberal Democrats   :45   Wales               : 37   Median : 1865.3
*Don't know         :46   *Not asked in survey: 37   Mean   : 2380.8
*Answer refused     :53                              3rd Qu.: 3090.1
*Not applicable     :29                              Max.   :10987.7
*Not asked in survey:43
## Not run:
##
## # If we want to 'View' a data set we can use 'dsView'
## dsView(Data)
## # Works also, but changes the data set into a data frame first:
## View(Data)
## End(Not run)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

 [1:300] *Don't know *Answer refused Conservatives *Not applicable ...
Data[1,]
Data set with 1 observations and 3 variables

         vote  region   income
1 *Don't know England 1144.135
head(as.data.frame(Data))
           vote   region    income
1          <NA>  England 1144.1352
2          <NA> Scotland 4723.0492
3 Conservatives Scotland  711.3674
4          <NA>    Wales 1634.3065
5          <NA>  England 2075.3244
6        Labour Scotland  624.6963
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 124 observations and 3 variables

                   vote  region    income
 1          *Don't know England 1144.1352
 2      *Not applicable England 2075.3244
 3        Conservatives England 4041.6287
 4          *Don't know England 1124.7814
 5    Liberal Democrats England 1850.0802
 6        Conservatives England 5017.8963
 7               Labour England 1587.7572
 8      *Not applicable England 2887.2592
 9    Liberal Democrats England  354.1545
10          *Don't know England  872.1741
11 *Not asked in survey England 2676.4985
12        Conservatives England  834.5003
13          *Don't know England 2644.1519
14               Labour England 1563.7189
15               Labour England 1277.6320
16          *Don't know England  896.9744
17          *Don't know England  541.0061
18               Labour England  720.5664
19 *Not asked in survey England  862.6967
20      *Not applicable England  821.4815
21    Liberal Democrats England 4942.4375
22        Conservatives England 4786.7917
23 *Not asked in survey England 4557.2759
24        Conservatives England 1481.9483
25          *Don't know England 4901.3798
(25 of 124 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          17       20     3
  Labour                 15       14     6
  Liberal Democrats      21       15     4
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             17       20     3
  Labour                    15       14     6
  Liberal Democrats         21       15     4
  *Don't know               22       15     3
  *Answer refused           19       16     9
  *Not applicable           12        9     5
  *Not asked in survey      18       13     7