Building Blocks of Data

This chapter describes the most basic data types which all other data structures build on. It starts with simple numeric vectors which may e.g. contain series of measurement. It further discusses character vectors, i.e. sequences of character strings, logical vectors, i.e. sequences of TRUE/FALSE data, and finally lists. The chapter covers also how simple computations on such data can be conducted and simple summaries can be obtained form elementary data types. Finally the chapter discusses how data can be stored on disk in an R-specific format.

Below is the supporting material for the various sections of the chapter.

Basic Data Types

Numeric vectors

  • Script file: numeric-vectors.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    c(1.2,3.5,5.0,6.7,1.09e-3)
    
    [1] 1.20000 3.50000 5.00000 6.70000 0.00109
    In [3]:
    x <- c(1.2,3.5,5.0,6.7,1.09e-3)
    length(x)
    
    [1] 5
    In [4]:
    1:100
    
      [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
     [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
     [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
     [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
     [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
     [91]  91  92  93  94  95  96  97  98  99 100
    In [5]:
    x <- c(1,2,3,4,5)
    y <- c(3,2,3,2,3)
    z <- x + y
    print(z)
    
    [1] 4 4 6 6 8
    
    In [6]:
    x <- c(3,2,4,8,7)
    y <- x + 1
    print(y)
    
    [1] 4 3 5 9 8
    
    In [7]:
    x <- c(3,2,4,8,7)
    y <- x + c(1,1,1,1,1)
    print(y)
    
    [1] 4 3 5 9 8
    
    In [8]:
    1 + NA
    
    [1] NA
    In [9]:
    x <- c(-2,-1,0,1,2)
    1/x
    
    [1] -0.5 -1.0  Inf  1.0  0.5
    In [10]:
    x/0
    
    [1] -Inf -Inf  NaN  Inf  Inf

Logical vectors

  • Script file: logical-vectors.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    ## Comparisons
    x <- -3:3
    x
    
    [1] -3 -2 -1  0  1  2  3
    In [3]:
    x == 0
    
    [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
    In [4]:
    x <- -3:3
    y <- c(1:3,0,1:3)
    x == y
    
    [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
    In [5]:
    ## Logical operators
    a <- c(TRUE,FALSE,TRUE,FALSE)
    b <- c(TRUE,TRUE,FALSE,FALSE)
    
    In [6]:
    a & b
    
    [1]  TRUE FALSE FALSE FALSE
    In [7]:
    a | b
    
    [1]  TRUE  TRUE  TRUE FALSE
    In [8]:
    !a
    
    [1] FALSE  TRUE FALSE  TRUE
    In [9]:
    a & !b
    
    [1] FALSE FALSE  TRUE FALSE
    In [10]:
    !(a | b)
    
    [1] FALSE FALSE FALSE  TRUE
    In [11]:
    x <- -3:3
    
    In [12]:
    x > 1 & x < -1
    
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    In [13]:
    x > 1 | x < -1
    
    [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
    In [14]:
    a <- c(TRUE,FALSE,NA,TRUE,FALSE,NA,TRUE,FALSE,NA)
    b <- c(TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,NA,NA,NA)
    
    In [15]:
    a & b
    
    [1]  TRUE FALSE    NA FALSE FALSE FALSE    NA FALSE    NA
    In [16]:
    a | b
    
    [1]  TRUE  TRUE  TRUE  TRUE FALSE    NA  TRUE    NA    NA

Character vectors

  • Script file: character-vectors.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    Beatles <- c("John", "Paul", "George", "Ringo")
    Beatles
    
    [1] "John"   "Paul"   "George" "Ringo" 
    In [3]:
    paste("one","and","only")
    
    [1] "one and only"
    In [4]:
    paste(Beatles, collapse=" & ")
    
    [1] "John & Paul & George & Ringo"
    In [5]:
    First <- c("Mick","Keith","Ronnie","Charlie")
    Last <- c("Jagger","Richards","Wood","Watts")
    
    In [6]:
    paste(First,Last)
    
    [1] "Mick Jagger"    "Keith Richards" "Ronnie Wood"    "Charlie Watts" 
    In [7]:
    paste(First,Last,sep="_")
    
    [1] "Mick_Jagger"    "Keith_Richards" "Ronnie_Wood"    "Charlie_Watts" 
    In [8]:
    Beatles <- c("John", "Paul", "George", "Ringo")
    
    In [9]:
    substr(Beatles,1,2)
    
    [1] "Jo" "Pa" "Ge" "Ri"
    In [10]:
    substr(Beatles,1:4,2:5)
    
    [1] "Jo" "au" "or" "go"
    In [11]:
    Led.Zeppelin.song <- "Whole Lotta Love"
    ACDC.song <- sub("Love","Rosie",Led.Zeppelin.song)
    print(ACDC.song)
    
    [1] "Whole Lotta Rosie"
    
    In [12]:
    onetofour <- 1:4
    names(onetofour) <- c("first","second","third","fourth")
    
    In [13]:
    names(onetofour)
    
    [1] "first"  "second" "third"  "fourth"
    In [14]:
    onetofour
    
     first second  third fourth 
         1      2      3      4 
    In [15]:
    print(onetofour)
    
     first second  third fourth 
         1      2      3      4 
    
    In [ ]:
    
    

Basic Data Manipulation

Extracting and replacing elements of a vector

  • Script file: extracting-and-replacing-elements.R
  • Interactive notebook:

    In [15]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [16]:
    x <- c(10, 12, 30, 14, 50)
    
    In [17]:
    x[1]
    
    [1] 10
    In [18]:
    x[5]
    
    [1] 50
    In [19]:
    x[c(2,4,6)]
    
    [1] 12 14 NA
    In [20]:
    x[c(1,1,1,2,2)]
    
    [1] 10 10 10 12 12
    In [21]:
    x[-c(1,3,5)]
    
    [1] 12 14
    In [22]:
    x[c(FALSE,TRUE,FALSE,TRUE,FALSE)]
    
    [1] 12 14
    In [23]:
    x[x>=20]
    
    [1] 30 50
    In [24]:
    names(x) <- c("a","b","c","d","e")
    
    In [25]:
    x[c("a","c")]
    
     a  c 
    10 30 
    In [26]:
    set.seed(231)
    y <- rnorm(n=12)
    
    In [27]:
    y[1:4] <- 0
    y
    
     [1]  0.00000000  0.00000000  0.00000000  0.00000000 -0.47335746  0.21739728
     [7]  0.06292205 -0.87782986  0.56368979 -0.03432728 -0.22631292  1.38657787
    In [28]:
    y <- rnorm(n=12)
    
    In [29]:
    y[y < 0] <- 0
    y
    
     [1] 0.0000000 0.0000000 1.4013312 0.3196224 1.0058453 0.0000000 0.0000000
     [8] 1.6502536 1.4374338 0.0000000 0.0000000 0.0000000

Reordering and sorting elements of a vector

  • Script file: reordering-and-sorting.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    set.seed(231)
    
    In [3]:
    x <- rnorm(n=10)
    x
    
     [1] -0.53310192 -2.31166378 -0.95419786  0.26251575 -0.47335746  0.21739728
     [7]  0.06292205 -0.87782986  0.56368979 -0.03432728
    In [4]:
    x.srt <- sort(x)
    x.srt
    
     [1] -2.31166378 -0.95419786 -0.87782986 -0.53310192 -0.47335746 -0.03432728
     [7]  0.06292205  0.21739728  0.26251575  0.56368979
    In [5]:
    sort(x,decreasing=TRUE)
    
     [1]  0.56368979  0.26251575  0.21739728  0.06292205 -0.03432728 -0.47335746
     [7] -0.53310192 -0.87782986 -0.95419786 -2.31166378
    In [6]:
    stex <- c("1","11","A","a","Ab","AB","ab","aB","B","b","bb")
    sort(stex)
    
     [1] "1"  "11" "a"  "A"  "ab" "aB" "Ab" "AB" "b"  "B"  "bb"
    In [7]:
    set.seed(2134)
    x <- rnorm(6)
    x
    
    [1]  0.6549052 -0.2099869 -0.6148580 -0.2740271 -0.7234317  1.4371483
    In [8]:
    y <- rnorm(6)
    y
    
    [1] -0.09385485 -0.05070594  0.77188553  0.36295090  1.12152639  0.72011916
    In [9]:
    ii <- order(x)
    
    In [10]:
    x.ordered <- x[ii]
    y.ordered <- y[ii]
    
    In [11]:
    x.ordered
    
    [1] -0.7234317 -0.6148580 -0.2740271 -0.2099869  0.6549052  1.4371483
    In [12]:
    y.ordered
    
    [1]  1.12152639  0.77188553  0.36295090 -0.05070594 -0.09385485  0.72011916
    In [13]:
    jj <- order(ii)
    
    In [14]:
    all(x.ordered[jj] == x)
    
    [1] TRUE
    In [15]:
    all(y.ordered[jj] == y)
    
    [1] TRUE
    In [ ]:
    
    

Regular sequences and repetitions

  • Script file: regular-sequences-and-repetitions.R
  • Interactive notebook:

    In [3]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [4]:
    1:10
    
     [1]  1  2  3  4  5  6  7  8  9 10
    In [5]:
    seq(from=1,to=10)
    
     [1]  1  2  3  4  5  6  7  8  9 10
    In [6]:
    seq(from=2,to=10,by=2)
    
    [1]  2  4  6  8 10
    In [7]:
    seq(to=49,length.out=5,by=7)
    
    [1] 21 28 35 42 49
    In [8]:
    rep(1:5,3)
    
     [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
    In [9]:
    rep(1:5,each=3)
    
     [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
    In [ ]:
    
    

Sampling from a vector

  • Script file: sampling-from-vectors.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    set.seed(143)
    
    In [3]:
    sample(1:9)
    
    [1] 8 6 4 1 7 2 9 5 3
    In [4]:
    sample(1:1000,size=20)
    
     [1] 658 171 191 428 806 768 307 120 506 340 190 962 437 274 477 935 363 469 933
    [20]  79
    In [5]:
    sample(6,size=10,replace=TRUE)
    
     [1] 4 5 3 2 1 5 1 6 1 2
    In [ ]:
    
    

Complex Data Types

Lists

  • Script file: lists.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    AList <- list(1:5,
                  letters[1:6],
                  c(TRUE,FALSE,FALSE,TRUE))
    AList
    
    [[1]]
    [1] 1 2 3 4 5
    
    [[2]]
    [1] "a" "b" "c" "d" "e" "f"
    
    [[3]]
    [1]  TRUE FALSE FALSE  TRUE
    
    In [3]:
    AList[1:2]
    
    [[1]]
    [1] 1 2 3 4 5
    
    [[2]]
    [1] "a" "b" "c" "d" "e" "f"
    
    In [4]:
    AList[1]
    
    [[1]]
    [1] 1 2 3 4 5
    
    In [5]:
    AList[[2]]
    
    [1] "a" "b" "c" "d" "e" "f"
    In [6]:
    AList[[1:2]]
    
    [1] 2
    In [7]:
    AList[[1:3]]
    
    Error in AList[[1:3]]: rekursives Indizieren auf Level 2 fehlgeschlagen
    
    Traceback:
    
    In [8]:
    length(AList)
    
    [1] 3
    In [9]:
    FDR <- list(c("John","Delano"),
                c("Roosewelt"))
    
    In [10]:
    names(FDR) <- c("first.name","last.name")
    FDR
    
    $first.name
    [1] "John"   "Delano"
    
    $last.name
    [1] "Roosewelt"
    
    In [11]:
    FDR <- list(first.name=c("John","Delano"),
                last.name=c("Roosewelt"))
    FDR
    
    $first.name
    [1] "John"   "Delano"
    
    $last.name
    [1] "Roosewelt"
    
    In [12]:
    FDR$last.name
    
    [1] "Roosewelt"
    In [13]:
    FDR[["last.name"]]
    
    [1] "Roosewelt"
    In [14]:
    UK <- list(
        country.name = c("England","Northern Ireland","Scotland",
                                                      "Wales"),
        population   = c(54786300,1851600,5373000,3099100),
        area.sq.km   = c(130279,13562,77933,20735),
        GVA.cap      = c(26159,18584,23685,18002))
    UK
    
    $country.name
    [1] "England"          "Northern Ireland" "Scotland"         "Wales"           
    
    $population
    [1] 54786300  1851600  5373000  3099100
    
    $area.sq.km
    [1] 130279  13562  77933  20735
    
    $GVA.cap
    [1] 26159 18584 23685 18002
    
    In [16]:
    data.frame(UK)
    
      country.name     population area.sq.km GVA.cap
    1 England          54786300   130279     26159  
    2 Northern Ireland  1851600    13562     18584  
    3 Scotland          5373000    77933     23685  
    4 Wales             3099100    20735     18002  
    In [ ]:
    
    

Attributes

  • Script file: attributes.R
  • Interactive notebook:

    In [1]:
    options(jupyter.rich_display=FALSE) # Create output as usual in R
    
    In [2]:
    onetofour <- c(first=1,second=2,third=3,fourth=4)
    
    In [3]:
    attributes(onetofour)
    
    $names
    [1] "first"  "second" "third"  "fourth"
    
    In [4]:
    set.seed(42)
    satisfaction <- sample(1:4,size=20,replace=TRUE)
    
    In [5]:
    satisfaction <- ordered(satisfaction,
                           levels=1:4,
                           labels=c(
                               "not at all",
                               "low",
                               "medium",
                               "high"))
    attributes(satisfaction)
    
    $levels
    [1] "not at all" "low"        "medium"     "high"      
    
    $class
    [1] "ordered" "factor" 
    
    In [6]:
    attr(satisfaction,"levels")
    
    [1] "not at all" "low"        "medium"     "high"      
    In [7]:
    levels(satisfaction)
    
    [1] "not at all" "low"        "medium"     "high"      
    In [8]:
    attr(satisfaction,"class")
    
    [1] "ordered" "factor" 
    In [9]:
    class(satisfaction)
    
    [1] "ordered" "factor" 
    In [ ]: