data.set memisc 0.99.22

Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = default.stringsAsFactors(),
                 document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

## S4 method for signature 'data.set'
head(x,n=20,...)
## S4 method for signature 'data.set'
tail(x,n=20,...)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

n

integer; the number of rows to be shown by head or tail

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see item and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

})
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region     income
 1               Labour *Not asked in survey  2795.9558
 2               Labour *Not asked in survey  5119.4167
 3      *Answer refused              England  2109.7892
 4               Labour              England 12559.9636
 5    Liberal Democrats              England  4997.4559
 6    Liberal Democrats                Wales   729.8586
 7        Conservatives                Wales  1643.4866
 8      *Answer refused                Wales  3053.1220
 9      *Answer refused                Wales   535.6466
10          *Don't know              England  3100.3083
11    Liberal Democrats              England  1299.9554
12    Liberal Democrats             Scotland  1920.5327
13               Labour              England   570.0341
14        Conservatives             Scotland  1876.3520
15    Liberal Democrats *Not asked in survey  3315.3440
16    Liberal Democrats *Not asked in survey   822.0119
17    Liberal Democrats                Wales   626.2846
18        Conservatives             Scotland  2147.0000
19 *Not asked in survey              England  4714.1003
20      *Not applicable *Not asked in survey   877.3359
21    Liberal Democrats              England   586.0908
22               Labour              England  1583.2009
23    Liberal Democrats                Wales  2444.1696
24    Liberal Democrats              England  1762.6464
25      *Answer refused              England  4176.7636
(25 of 300 observations shown)
# If we insist on seeing all, we can use 'print' instead
print(Data)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 2 2 9 2 3 3 1 9 9 8
  ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 99 99 1 1 1 3 3 3 3
  1 ...
 $ income: Rto. item  num  2796 5119 2110 12560 4997 ...
summary(Data)
                  vote                     region        income
Conservatives       :45   England             :120   Min.   :  397.5
Labour              :38   Scotland            : 85   1st Qu.: 1360.7
Liberal Democrats   :51   Wales               : 46   Median : 2173.6
*Don't know         :52   *Not asked in survey: 49   Mean   : 2708.6
*Answer refused     :36                              3rd Qu.: 3271.0
*Not applicable     :38                              Max.   :13066.4
*Not asked in survey:40
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

[1:300] Labour Labour *Answer refused Labour Liberal Democrats Liberal
  Democrats Conservatives ...
Data[1,]
Data set with 1 observations and 3 variables

    vote               region   income
1 Labour *Not asked in survey 2795.956
head(as.data.frame(Data))
               vote  region     income
1            Labour    <NA>  2795.9558
2            Labour    <NA>  5119.4167
3              <NA> England  2109.7892
4            Labour England 12559.9636
5 Liberal Democrats England  4997.4559
6 Liberal Democrats   Wales   729.8586
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 120 observations and 3 variables

                   vote  region     income
 1      *Answer refused England  2109.7892
 2               Labour England 12559.9636
 3    Liberal Democrats England  4997.4559
 4          *Don't know England  3100.3083
 5    Liberal Democrats England  1299.9554
 6               Labour England   570.0341
 7 *Not asked in survey England  4714.1003
 8    Liberal Democrats England   586.0908
 9               Labour England  1583.2009
10    Liberal Democrats England  1762.6464
11      *Answer refused England  4176.7636
12      *Not applicable England  1137.0075
13      *Answer refused England  1703.4688
14    Liberal Democrats England  3505.9389
15 *Not asked in survey England  1822.3907
16      *Answer refused England  3813.5718
17        Conservatives England  2524.7746
18               Labour England   946.8063
19    Liberal Democrats England  3458.6275
20        Conservatives England  2426.3013
21 *Not asked in survey England   450.9450
22        Conservatives England  3484.7972
23      *Not applicable England  2240.7120
24    Liberal Democrats England  8870.0405
25          *Don't know England  1581.9868
(25 of 120 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          16       16     6
  Labour                 16       12     3
  Liberal Democrats      24       10    10
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             16       16     6
  Labour                    16       12     3
  Liberal Democrats         24       10    10
  *Don't know               24       12     6
  *Answer refused           15        9     7
  *Not applicable           12       11     4
  *Not asked in survey      13       15    10