Text as Data

Textual data have rapidly gained attention in the communication, social, and political sciences. Without their discussion, a companion to data management would be incomplete. This chapter starts with a discussion of basic operations on character strings, such as concatenation, search and replace. It then moves on to the management of corpora of text. It also discusses routine issues in the management of textual data, such as stemming, stop-word deletion, and the creation of term-frequency matrices.

Below is the supporting material for the various sections of the chapter.

Character Strings

  • length() versus nchar()

    • Script file: length-vs-nchar.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      some_great_rock_bands <- c("Led Zeppelin","Pink Floyd","Queen")
      
      In [3]:
      length(some_great_rock_bands)
      
      [1] 3
      In [4]:
      nchar(some_great_rock_bands)
      
      [1] 12 10  5
  • Character vector subsets versus substrings

    • Script file: subsets-vs-substrings.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      some_great_rock_bands <- c("Led Zeppelin","Pink Floyd","Queen")
      
      In [3]:
      some_great_rock_bands[1:2]
      
      [1] "Led Zeppelin" "Pink Floyd"  
      In [4]:
      substr(some_great_rock_bands,start=1,stop=2)
      
      [1] "Le" "Pi" "Qu"
      In [5]:
      substr(some_great_rock_bands,start=6,stop=15)
      
      [1] "eppelin" "Floyd"   ""       
  • Finding patterns within character strings and character vectors

    • Script file: finding-patterns.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      some_great_rock_bands <- c("Led Zeppelin","Pink Floyd","Queen")
      
      In [3]:
      grep("Zeppelin",some_great_rock_bands) # Just the indices
      
      [1] 1
      In [4]:
      grep("Zeppelin",some_great_rock_bands, value=TRUE) # the elements
      
      [1] "Led Zeppelin"
      In [5]:
      grepl("Zeppelin",some_great_rock_bands)
      
      [1]  TRUE FALSE FALSE
      In [6]:
      grep("[ei]n$",some_great_rock_bands,value=TRUE)
      
      [1] "Led Zeppelin" "Queen"       
  • Replacing patterns within character strings and character vectors

    • Script file: replacing-patterns.R
    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      some_great_rock_bands <- c("Led Zeppelin","Pink Floyd","Queen")
      
      In [3]:
      sub("e","i",some_great_rock_bands)
      
      [1] "Lid Zeppelin" "Pink Floyd"   "Quien"       
      In [4]:
      gsub("e","i",some_great_rock_bands)
      
      [1] "Lid Zippilin" "Pink Floyd"   "Quiin"       
      In [5]:
      gsub("([aeiouy]+)","[\\1]",some_great_rock_bands)
      
      [1] "L[e]d Z[e]pp[e]l[i]n" "P[i]nk Fl[oy]d"       "Q[uee]n"             
  • Counting words in the UK party manifesto on occasion of the 2017 election

    • Script file: counting-words-in-a-manifesto.R

      Data file used in the script: UKLabourParty_201706.csv which was downloaded from the Manifesto Project website. To obtain the data, one has to open https://manifesto-project.wzb.eu/datasets, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data.

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      The file "UKLabourParty_201706.csv" was downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,

      1. pull down the "File" menu item and select "Open"
      2. An overview of the folder that contains the notebook opens.
      3. The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.

      Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).

      In [2]:
      # First, the data are read in
      Labour.2017 <- read.csv("UKLabourParty_201706.csv",
                              stringsAsFactors=FALSE)
      
      In [3]:
      # Second, some non-ascii characters are substituted 
      Labour.2017$content <- gsub("\xE2\x80\x99","'",Labour.2017$content)
      str(Labour.2017)
      
      'data.frame':	1396 obs. of  3 variables:
       $ content : chr  "CREATING AN ECONOMY THAT WORKS FOR ALL" "Labour's economic strategy is about delivering a fairer, more prosperous society for the many, not just the few." "We will measure our economic success not by the number of billionaires, but by the ability of our people to live richer lives." "Labour understands that the creation of wealth is a collective endeavour between workers, entrepreneurs, invest"| __truncated__ ...
       $ cmp_code: chr  "H" "503" "503" "405" ...
       $ eu_code : logi  NA NA NA NA NA NA ...
      
      In [4]:
      # The variable 'content' contains the text of the manifesto 
      Labour.2017 <- Labour.2017$content
      Labour.2017[1:5]
      
      [1] "CREATING AN ECONOMY THAT WORKS FOR ALL"                                                                                            
      [2] "Labour's economic strategy is about delivering a fairer, more prosperous society for the many, not just the few."                  
      [3] "We will measure our economic success not by the number of billionaires, but by the ability of our people to live richer lives."    
      [4] "Labour understands that the creation of wealth is a collective endeavour between workers, entrepreneurs, investors and government."
      [5] "Each contributes and each must share fairly in the rewards."                                                                       
      In [5]:
      # The headings in the manifesto are all-uppercase, this helps
      # to identify them:
      Labour.2017.hlno <- which(Labour.2017==toupper(Labour.2017))
      Labour.2017.headings <- Labour.2017[Labour.2017.hlno]
      Labour.2017.headings[1:4]
      
      [1] "CREATING AN ECONOMY THAT WORKS FOR ALL"
      [2] "A FAIR TAXATION SYSTEM"                
      [3] "BALANCING THE BOOKS"                   
      [4] "INFRASTRUCTURE INVESTMENT"             
      In [6]:
      # All non-heading text is changed to lowercase
      labour.2017 <- tolower(Labour.2017[-Labour.2017.hlno])
      labour.2017[1:5]
      
      [1] "labour's economic strategy is about delivering a fairer, more prosperous society for the many, not just the few."                                                           
      [2] "we will measure our economic success not by the number of billionaires, but by the ability of our people to live richer lives."                                             
      [3] "labour understands that the creation of wealth is a collective endeavour between workers, entrepreneurs, investors and government."                                         
      [4] "each contributes and each must share fairly in the rewards."                                                                                                                
      [5] "this manifesto sets out labour's plan to upgrade our economy and rewrite the rules of a rigged system, so that our economy really works for the many, and not only the few."
      In [7]:
      # All lines that contain the pattern 'econom' are collected
      ecny.labour.2017 <- grep("econom",labour.2017,value=TRUE)
      ecny.labour.2017[1:5]
      
      [1] "labour's economic strategy is about delivering a fairer, more prosperous society for the many, not just the few."                                                           
      [2] "we will measure our economic success not by the number of billionaires, but by the ability of our people to live richer lives."                                             
      [3] "this manifesto sets out labour's plan to upgrade our economy and rewrite the rules of a rigged system, so that our economy really works for the many, and not only the few."
      [4] "britain is the only major developed economy where earnings have fallen even as growth has returned after the financial crisis."                                             
      [5] "we will upgrade our economy, breaking down the barriers that hold too many of us back,"                                                                                     
      In [8]:
      # Using 'strsplit()' the lines are split into words
      labour.2017.words <- strsplit(labour.2017,"[ ,.;:]+")
      str(labour.2017.words[1:5])
      
      List of 5
       $ : chr [1:18] "labour's" "economic" "strategy" "is" ...
       $ : chr [1:23] "we" "will" "measure" "our" ...
       $ : chr [1:17] "labour" "understands" "that" "the" ...
       $ : chr [1:10] "each" "contributes" "and" "each" ...
       $ : chr [1:32] "this" "manifesto" "sets" "out" ...
      
      In [9]:
      # The result is a list. We change it into a character vector.
      labour.2017.words <- unlist(labour.2017.words)
      labour.2017.words[1:20]
      
       [1] "labour's"   "economic"   "strategy"   "is"         "about"     
       [6] "delivering" "a"          "fairer"     "more"       "prosperous"
      [11] "society"    "for"        "the"        "many"       "not"       
      [16] "just"       "the"        "few"        "we"         "will"      
      In [10]:
      # We now count the words and look at the 20 most common ones.
      labour.2017.nwords <- table(labour.2017.words)
      labour.2017.nwords <- sort(labour.2017.nwords,decreasing=TRUE)
      labour.2017.nwords[1:20]
      
      labour.2017.words
         the    and     to   will     of      a     we     in labour    for    our 
        1202    947    832    664    625    438    418    369    313    312    244 
        that     on   with     by     is    are     as   have ensure 
         232    212    185    161    161    134    112    108    104 

Text Corpora with the tm Package

  • A simple example for the usage of the tm package

    • Script file: tm-simple.R

      The script makes use of the tm package, which is available from https://cran.r-project.org/package=tm

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      # Activating the 'tm' package
      library(tm)
      
      Loading required package: NLP
      
      
      In [3]:
      # We activate the 'acq' data, a corpus of 50 example news articles
      data(acq)
      acq
      
      <<VCorpus>>
      Metadata:  corpus specific: 0, document level (indexed): 0
      Content:  documents: 50
      In [4]:
      # We take a look at the first element in the corpus, a text document:
      class(acq[[1]])
      
      [1] "PlainTextDocument" "TextDocument"     
      In [5]:
      acq[[1]]
      
      <<PlainTextDocument>>
      Metadata:  15
      Content:  chars: 1287
      In [6]:
      inspect(acq[[1]])
      
      <<PlainTextDocument>>
      Metadata:  15
      Content:  chars: 1287
      
      Computer Terminal Systems Inc said
      it has completed the sale of 200,000 shares of its common
      stock, and warrants to acquire an additional one mln shares, to
      <Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.
          The company said the warrants are exercisable for five
      years at a purchase price of .125 dlrs per share.
          Computer Terminal said Sedio also has the right to buy
      additional shares and increase its total holdings up to 40 pct
      of the Computer Terminal's outstanding common stock under
      certain circumstances involving change of control at the
      company.
          The company said if the conditions occur the warrants would
      be exercisable at a price equal to 75 pct of its common stock's
      market price at the time, not to exceed 1.50 dlrs per share.
          Computer Terminal also said it sold the technolgy rights to
      its Dot Matrix impact technology, including any future
      improvements, to <Woodco Inc> of Houston, Tex. for 200,000
      dlrs. But, it said it would continue to be the exclusive
      worldwide licensee of the technology for Woodco.
          The company said the moves were part of its reorganization
      plan and would help pay current operation costs and ensure
      product delivery.
          Computer Terminal makes computer generated labels, forms,
      tags and ticket printers and terminals.
       Reuter
      
      In [7]:
      # We take a look at the document metadata
      meta(acq[[1]])
      
        author       : character(0)
        datetimestamp: 1987-02-26 15:18:06
        description  : 
        heading      : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
        id           : 10
        language     : en
        origin       : Reuters-21578 XML
        topics       : YES
        lewissplit   : TRAIN
        cgisplit     : TRAINING-SET
        oldid        : 5553
        places       : usa
        people       : character(0)
        orgs         : character(0)
        exchanges    : character(0)
      In [8]:
      DublinCore(acq[[1]])
      
        contributor: character(0)
        coverage   : character(0)
        creator    : character(0)
        date       : 1987-02-26 15:18:06
        description: 
        format     : character(0)
        identifier : 10
        language   : en
        publisher  : character(0)
        relation   : character(0)
        rights     : character(0)
        source     : character(0)
        subject    : character(0)
        title      : COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE
        type       : character(0)
  • A more complicated example involving data from the Manifesto Project

    • Script file: tm-ManifestoProject.R

      The script makes use of the tm package, which is available from https://cran.r-project.org/package=tm

      Data file used in the script: 51420_196410.csv, 51420_196603.csv, 51420_197006.csv, 51420_197402.csv, 51420_197410.csv, 51420_197905.csv, 51420_198306.csv, 51420_198706.csv, 51420_199204.csv, 51420_199705.csv, 51420_200106.csv, 51420_200505.csv, 51420_201505.csv, and 51420_201706.csv. The data files were downloaded from the Manifesto Project website and put into the directory ManifestoProject. To obtain the data, one has to open https://manifesto-project.wzb.eu/datasets, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data.

      Additionally, the script makes use of the file documents_MPDataset_MPDS2019b.csv, which was converted from the Excel file documents_MPDataset_MPDS2019b.xlsx available from the Manifesto Project website. Since this file is not subject to any restrictions, it is available for download from the given link.

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      

      The file CSV-files in folder "Manifesto Project" were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,

      1. pull down the "File" menu item and select "Open"
      2. An overview of the folder that contains the notebook opens.
      3. The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.

      Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).

      In [2]:
      # The Manifesto Project data is contained in a collection of CSV files
      csv.files <- dir("ManifestoProject",full.names=TRUE,
                       pattern="*.csv")
      csv.files
      
       [1] "ManifestoProject/51420_196410.csv" "ManifestoProject/51420_196603.csv"
       [3] "ManifestoProject/51420_197006.csv" "ManifestoProject/51420_197402.csv"
       [5] "ManifestoProject/51420_197410.csv" "ManifestoProject/51420_197905.csv"
       [7] "ManifestoProject/51420_198306.csv" "ManifestoProject/51420_198706.csv"
       [9] "ManifestoProject/51421_199204.csv" "ManifestoProject/51421_199705.csv"
      [11] "ManifestoProject/51421_200106.csv" "ManifestoProject/51421_200505.csv"
      [13] "ManifestoProject/51421_201505.csv" "ManifestoProject/51421_201706.csv"
      In [3]:
      # This file contains the relevant metadata:
      # It is available (withou registration) from
      # https://manifesto-project.wzb.eu/down/data/2019b/codebooks/documents_MPDataset_MPDS2019b.xlsx
      # in Excel format
      manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv",
                                     stringsAsFactors=FALSE)
      
      In [4]:
      library(tm)
      
      Loading required package: NLP
      
      
      In [5]:
      # The following code does not work, due to the peculiar structure of the CSV files
      manifesto.corpus <- VCorpus(DirSource("ManifestoProject"))
      
      In [6]:
      # To deal with the problem created by the peculiar structure of the files, we
      # define a helper function:
      getMDoc <- function(file,metadata.file){
          df <- read.csv(file,
                         stringsAsFactors=FALSE)
          content <- paste(df[,1],collapse="\n")
          
          fn <- basename(file)
          fn <- sub(".csv","",fn,fixed=TRUE)
          fn12 <- unlist(strsplit(fn,"_"))
      
          partycode <- as.numeric(fn12[1])
          datecode <- as.numeric(fn12[2])
          year <- datecode %/% 100
          month <- datecode %% 100
          datetime <- ISOdate(year=year,month=month,day=1)
      
          mf.meta <- subset(metadata.file,
                            party==partycode & date == datecode)
          if(!length(mf.meta$language))
              mf.meta$language <- "english"
              
          PlainTextDocument(
              content,
              id = fn,
              heading = mf.meta$title,
              datetimestamp = as.POSIXlt(datetime),
              language = mf.meta$language,
              partyname = mf.meta$partyname,
              partycode = partycode,
              datecode = datecode
          )
      }
      
      In [7]:
      # With the helper function we now create a corpus of UK manifestos:
      UKLib.docs <- lapply(csv.files,getMDoc,
                           metadata.file=manifesto.metadata)
      UKLib.Corpus <- as.VCorpus(UKLib.docs)
      UKLib.Corpus
      
      <<VCorpus>>
      Metadata:  corpus specific: 0, document level (indexed): 0
      Content:  documents: 14
      In [8]:
      UKLib.Corpus[[14]]
      
      <<PlainTextDocument>>
      Metadata:  10
      Content:  chars: 130585
      In [9]:
      # We need to deal with the non-ASCII characters, so we define yet another helper
      # function:
      handleUTF8quotes <- function(x){
          cx <- content(x)
          cx <- gsub("\xe2\x80\x98","'",cx)
          cx <- gsub("\xe2\x80\x99","'",cx)
          cx <- gsub("\xe2\x80\x9a",",",cx)
          cx <- gsub("\xe2\x80\x9b","`",cx)
          cx <- gsub("\xe2\x80\x9c","\"",cx)
          cx <- gsub("\xe2\x80\x9d","\"",cx)
          cx <- gsub("\xe2\x80\x9e","\"",cx)
          cx <- gsub("\xe2\x80\x9f","\"",cx)
          content(x) <- cx
          x
      }
      
      In [10]:
      # Another helper function is needed to change the texts into lowercase:
      toLower <- function(x) {
          content(x) <- tolower(content(x))
          x
      }
      
      In [11]:
      # We overwrite the 'inspect' method for "TextDocument" objects to a variant that shows only the first
      # 20 lines:
      inspect.TextDocument <- function(x){
          print(x)
          cat("\n")
          str <- as.character(x)
          str <- substr(x,start=0,stop=500)
          str <- paste(str,"... ...")
          writeLines(str)
          invisible(x)
      }
      
      In [12]:
      UKLib.Corpus.processed <- tm_map(UKLib.Corpus,handleUTF8quotes)
      UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,toLower)
      inspect(UKLib.Corpus.processed[[14]])
      
      <<PlainTextDocument>>
      Metadata:  10
      Content:  chars: 130585
      
      1 protect britain's place in europe
      1.1 giving the people the final say
      liberal democrats are open and outward-looking.
      we passionately believe that britain's relationship with its neighbours is stronger as part of the european union.
      whatever its imperfections, the eu remains the best framework for working effectively and co-operating in the pursuit of our shared aims.
      it has led directly to greater prosperity,
      increased trade,
      investment and jobs,
      better security
      and a greener environment.
      bri ... ...
      
      In [13]:
      UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removeNumbers)
      UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removePunctuation)
      inspect(UKLib.Corpus.processed[[14]])
      
      <<PlainTextDocument>>
      Metadata:  10
      Content:  chars: 127677
      
       protect britains place in europe
       giving the people the final say
      liberal democrats are open and outwardlooking
      we passionately believe that britains relationship with its neighbours is stronger as part of the european union
      whatever its imperfections the eu remains the best framework for working effectively and cooperating in the pursuit of our shared aims
      it has led directly to greater prosperity
      increased trade
      investment and jobs
      better security
      and a greener environment
      britain is better o ... ...
      
      In [14]:
      UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,stemDocument)
      inspect(UKLib.Corpus.processed[[14]])
      
      <<PlainTextDocument>>
      Metadata:  10
      Content:  chars: 112157
      
      protect britain place in europ give the peopl the final say liber democrat are open and outwardlook we passion believ that britain relationship with it neighbour is stronger as part of the european union whatev it imperfect the eu remain the best framework for work effect and cooper in the pursuit of our share aim it has led direct to greater prosper increas trade invest and job better secur and a greener environ britain is better off in the eu liber democrat campaign for the uk to remain in the ... ...
      
      In [15]:
      # After preprocessing the text documents we obtain a document-term matrix:
      UKLib.dtm <- DocumentTermMatrix(UKLib.Corpus.processed)
      UKLib.dtm
      
      <<DocumentTermMatrix (documents: 14, terms: 5940)>>
      Non-/sparse entries: 24547/58613
      Sparsity           : 70%
      Maximal term length: 27
      Weighting          : term frequency (tf)
      In [16]:
      # The various preprocessing steps can be combined into a single step:
      UKLib.dtm <- DocumentTermMatrix(
          tm_map(UKLib.Corpus,handleUTF8quotes),
          control=list(
              tolower=TRUE,
              removePunctuation=TRUE,
              removeNumber=TRUE,
              stopwords=TRUE,
              language="en",
              stemming=TRUE
          ))
      UKLib.dtm
      
      <<DocumentTermMatrix (documents: 14, terms: 6289)>>
      Non-/sparse entries: 24105/63941
      Sparsity           : 73%
      Maximal term length: 27
      Weighting          : term frequency (tf)

Improvements Provided by the quanteda Package

  • A basic example for the usage of the quanteda package

    • Script file: quanteda-basic.R

      The script makes use of the quanteda package, which is available from https://cran.r-project.org/package=quanteda

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      library(quanteda)
      
      Package version: 2.1.1
      
      Parallel computing: 2 of 12 threads used.
      
      See https://quanteda.io for tutorials and examples.
      
      
      Attaching package: ‘quanteda’
      
      
      The following object is masked from ‘jupyter:irkernel’:
      
          View
      
      
      The following object is masked from ‘package:utils’:
      
          View
      
      
      
      In [3]:
      quanteda_options(print_corpus_max_ndoc=3)
      
      In [4]:
      # This is an example corpus contained in the 'quanteda' package
      data_corpus_inaugural
      
      Corpus consisting of 58 documents and 4 docvars.
      1789-Washington :
      "Fellow-Citizens of the Senate and of the House of Representa..."
      
      1793-Washington :
      "Fellow citizens, I am again called upon by the voice of my c..."
      
      1797-Adams :
      "When it was first perceived, in early times, that no middle ..."
      
      [ reached max_ndoc ... 55 more documents ]
      In [5]:
      mode(data_corpus_inaugural)
      
      [1] "character"
      In [6]:
      class(data_corpus_inaugural)
      
      [1] "corpus"    "character"
      In [7]:
      data_corpus_inaugural[1:3]
      
      Corpus consisting of 3 documents and 4 docvars.
      1789-Washington :
      "Fellow-Citizens of the Senate and of the House of Representa..."
      
      1793-Washington :
      "Fellow citizens, I am again called upon by the voice of my c..."
      
      1797-Adams :
      "When it was first perceived, in early times, that no middle ..."
      
      In [8]:
      str(docvars(data_corpus_inaugural))
      
      'data.frame':	58 obs. of  4 variables:
       $ Year     : int  1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 ...
       $ President: chr  "Washington" "Washington" "Adams" "Jefferson" ...
       $ FirstName: chr  "George" "George" "John" "Thomas" ...
       $ Party    : Factor w/ 6 levels "Democratic","Democratic-Republican",..: 4 4 3 2 2 2 2 2 2 2 ...
      
      In [9]:
      docvars(data_corpus_inaugural,"Year")
      
       [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
      [16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
      [31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
      [46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017
      In [10]:
      data_corpus_inaugural$Year
      
       [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
      [16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
      [31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
      [46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017
      In [11]:
      corpus_subset(data_corpus_inaugural, Year > 1945)
      
      Corpus consisting of 18 documents and 4 docvars.
      1949-Truman :
      "Mr. Vice President, Mr. Chief Justice, and fellow citizens, ..."
      
      1953-Eisenhower :
      "My friends, before I begin the expression of those thoughts ..."
      
      1957-Eisenhower :
      "The Price of Peace Mr. Chairman, Mr. Vice President, Mr. Chi..."
      
      [ reached max_ndoc ... 15 more documents ]
      In [12]:
      subset.corpus <- function(x,...) corpus_subset(x,...)
      
      In [13]:
      subset(data_corpus_inaugural, Year > 1945)
      
      Corpus consisting of 18 documents and 4 docvars.
      1949-Truman :
      "Mr. Vice President, Mr. Chief Justice, and fellow citizens, ..."
      
      1953-Eisenhower :
      "My friends, before I begin the expression of those thoughts ..."
      
      1957-Eisenhower :
      "The Price of Peace Mr. Chairman, Mr. Vice President, Mr. Chi..."
      
      [ reached max_ndoc ... 15 more documents ]
      In [14]:
      docs_containing <- function(x,pattern,...) x[grep(pattern,x,...)]
      
      In [15]:
      c_sub <- docs_containing(data_corpus_inaugural,"[Cc]arnage")
      c_sub$President
      
      [1] "Trump"
      In [16]:
      inaugural_sntc <- corpus_reshape(data_corpus_inaugural,
                                       to="sentences")
      inaugural_sntc
      
      Corpus consisting of 5,018 documents and 4 docvars.
      1789-Washington.1 :
      "Fellow-Citizens of the Senate and of the House of Representa..."
      
      1789-Washington.2 :
      "On the one hand, I was summoned by my Country, whose voice I..."
      
      1789-Washington.3 :
      "On the other hand, the magnitude and difficulty of the trust..."
      
      [ reached max_ndoc ... 5,015 more documents ]
      In [17]:
      sntcl <- cbind(docvars(inaugural_sntc),
                     len=nchar(inaugural_sntc))
      head(sntcl)
      
                        Year President  FirstName Party len
      1789-Washington.1 1789 Washington George    none  278
      1789-Washington.2 1789 Washington George    none  478
      1789-Washington.3 1789 Washington George    none  436
      1789-Washington.4 1789 Washington George    none  179
      1789-Washington.5 1789 Washington George    none  515
      1789-Washington.6 1789 Washington George    none  654
      In [18]:
      sntcl.year <- aggregate(len~Year,data=sntcl,mean)
      with(sntcl.year,
           scatter.smooth(Year,len,ylab="Average length of sentences in characters"))
      
      In [19]:
      inaugural_ <- corpus_reshape(data_corpus_inaugural,
                                   to="documents")
      all(inaugural_$Year == data_corpus_inaugural$Year)
      
      [1] TRUE
  • Obtaining tokens in text documents using quanteda

    • Script file: quanteda-tokens.R

      The script makes use of the quanteda package, which is available from https://cran.r-project.org/package=quanteda

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      
      In [2]:
      library(quanteda)
      
      In [2]:
      quanteda_options(print_tokens_max_ndoc=3,
                       print_tokens_max_ntoken=6)
      
      Package version: 2.1.1
      
      Parallel computing: 2 of 12 threads used.
      
      See https://quanteda.io for tutorials and examples.
      
      
      Attaching package: ‘quanteda’
      
      
      The following object is masked from ‘jupyter:irkernel’:
      
          View
      
      
      The following object is masked from ‘package:utils’:
      
          View
      
      
      
      In [3]:
      inaugural_toks <- tokens(data_corpus_inaugural)
      inaugural_toks
      
      Tokens consisting of 58 documents and 4 docvars.
      1789-Washington :
      [1] "Fellow-Citizens" "of"              "the"             "Senate"         
      [5] "and"             "of"             
      [ ... and 1,531 more ]
      
      1793-Washington :
      [1] "Fellow"   "citizens" ","        "I"        "am"       "again"   
      [ ... and 141 more ]
      
      1797-Adams :
      [1] "When"      "it"        "was"       "first"     "perceived" ","        
      [ ... and 2,571 more ]
      
      [ reached max_ndoc ... 55 more documents ]
      In [4]:
      inaugural_ntoks <- sapply(inaugural_toks,
                                length)
      inaugural_ntoks <- cbind(docvars(inaugural_toks),
                               ntokens = inaugural_ntoks)
      
      In [5]:
      with(inaugural_ntoks,
           scatter.smooth(Year,ntokens,
                          ylab="Number of tokens per speech"))
      
      In [7]:
      inaugural_sntc <- corpus_reshape(data_corpus_inaugural,
                                       to="sentences")
      inaugural_sntc_toks <- tokens(inaugural_sntc)
      inaugural_sntc_ntoks <- sapply(inaugural_sntc_toks,
                                     length)
      inaugural_sntc_ntoks <- cbind(docvars(inaugural_sntc_toks),
                                    ntokens = inaugural_sntc_ntoks)
      
      In [9]:
      inaugural_sntc_ntoks <- aggregate (ntokens~Year,mean,
                                      data = inaugural_sntc_ntoks)
      
      In [10]:
      with(inaugural_sntc_ntoks,
           scatter.smooth(Year,ntokens,
                          ylab="Number of tokens per sentence"))
      
  • Preparing Manifesto Project data using quanteda

    • Script file: quanteda-ManifestoProject.R

      The script makes use of

      Data file used in the script: 51420_196410.csv, 51420_196603.csv, 51420_197006.csv, 51420_197402.csv, 51420_197410.csv, 51420_197905.csv, 51420_198306.csv, 51420_198706.csv, 51420_199204.csv, 51420_199705.csv, 51420_200106.csv, 51420_200505.csv, 51420_201505.csv, and 51420_201706.csv. The data files were downloaded from the Manifesto Project website and put into the directory ManifestoProject. To obtain the data, one has to open https://manifesto-project.wzb.eu/datasets, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data.

      Additionally, the script makes use of the file documents_MPDataset_MPDS2019b.csv, which was converted from the Excel file documents_MPDataset_MPDS2019b.xlsx available from the Manifesto Project website. Since this file is not subject to any restrictions, it is available for download from the given link.

    • Interactive notebook:

      In [1]:
      options(jupyter.rich_display=FALSE) # Create output as usual in R
      options(width=120)
      

      The file CSV-files in folder "Manifesto Project" were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,

      1. pull down the "File" menu item and select "Open"
      2. An overview of the folder that contains the notebook opens.
      3. The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.

      Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).

      In [2]:
      csv.files <- dir("ManifestoProject",
                       full.names=TRUE,
                       pattern="*.csv")
      length(csv.files)
      
      [1] 14
      In [3]:
      # 'readtext' (a companion package for 'quanteda') is somewhat better able to
      # deal with the Manfisto Project CSV files than 'tm':
      library(readtext)
      UKLib.rt <- readtext("ManifestoProject/*.csv",
                     text_field=1,
                     docvarsfrom="filenames",
                     docvarnames=c("party","date"))
      nrow(UKLib.rt)
      
      [1] 4228
      In [4]:
      # Here we create an index of documents in the corpus:
      UKLib.rta <- aggregate(text~party+date,
                             FUN=function(x)paste(x,collapse=" "),
                             data=UKLib.rt)
      nrow(UKLib.rta)
      
      [1] 14
      In [5]:
      UKLib.rta <- within(UKLib.rta,
                    doc_id <- paste(party,date,sep="_"))
      
      In [6]:
      library(quanteda)
      
      Package version: 2.1.1
      
      Parallel computing: 2 of 12 threads used.
      
      See https://quanteda.io for tutorials and examples.
      
      
      Attaching package: ‘quanteda’
      
      
      The following object is masked from ‘jupyter:irkernel’:
      
          View
      
      
      The following object is masked from ‘package:utils’:
      
          View
      
      
      
      In [7]:
      UKLib.corpus <- corpus(UKLib.rta)
      UKLib.corpus
      
      Corpus consisting of 14 documents and 2 docvars.
      51420_196410 :
      """THINK FOR YOURSELF""  The Liberal Party offers the elector..."
      
      51420_196603 :
      "For All the People: the Liberal Plan of 1966  BRITAIN DEMAND..."
      
      51420_197006 :
      "What a Life!  There must surely be a better way to run a cou..."
      
      51420_197402 :
      "'Change the face of Britain'  THE CRISIS OF GOVERNMENT  This..."
      
      51420_197410 :
      "Why Britain Needs Liberal Government  A PERSONAL MESSAGE FRO..."
      
      51420_197905 :
      "'The Real Fight is for Britain'  INTRODUCTION  With your sup..."
      
      [ reached max_ndoc ... 8 more documents ]
      In [8]:
      # Here we combine metadata with the text documents:
      manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv",stringsAsFactors=FALSE)
      str(manifesto.metadata)
      
      'data.frame':	4492 obs. of  6 variables:
       $ country    : int  11 11 11 11 11 11 11 11 11 11 ...
       $ countryname: chr  "Sweden" "Sweden" "Sweden" "Sweden" ...
       $ party      : int  11110 11110 11110 11110 11110 11110 11110 11110 11110 11220 ...
       $ partyname  : chr  "Green Ecology Party" "Green Ecology Party" "Green Ecology Party" "Green Ecology Party" ...
       $ date       : int  198809 199109 199409 199809 200209 200609 201009 201409 201809 194409 ...
       $ title      : chr  "Valmanifest" "Valmanifest ‘91" "Valmanifest" "Valmanifest 98" ...
      
      In [9]:
      docvars(UKLib.corpus) <- merge(docvars(UKLib.corpus),
                                     manifesto.metadata,
                                     by=c("party","date"))
      str(docvars(UKLib.corpus))
      
      'data.frame':	14 obs. of  6 variables:
       $ party      : int  51420 51420 51420 51420 51420 51420 51420 51420 51421 51421 ...
       $ date       : int  196410 196603 197006 197402 197410 197905 198306 198706 199204 199705 ...
       $ country    : int  51 51 51 51 51 51 51 51 51 51 ...
       $ countryname: chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
       $ partyname  : chr  "Liberal Party" "Liberal Party" "Liberal Party" "Liberal Party" ...
       $ title      : chr  "Think for Yourself - Vote Liberal" "For all the People: The Liberal Plan of 1966" "What a Life!" "Change the Face of Britain" ...
      
      In [10]:
      # Finally we create a document-feature matrix, without punctuation, numbers,
      # symbols and stopwords:
      UKLib.dfm <- dfm(UKLib.corpus,
                       remove_punct=TRUE,
                       remove_numbers=TRUE,
                       remove_symbols=TRUE,
                       remove=stopwords("english"),
                       stem=TRUE)
      str(docvars(UKLib.dfm))
      
      'data.frame':	14 obs. of  6 variables:
       $ party      : int  51420 51420 51420 51420 51420 51420 51420 51420 51421 51421 ...
       $ date       : int  196410 196603 197006 197402 197410 197905 198306 198706 199204 199705 ...
       $ country    : int  51 51 51 51 51 51 51 51 51 51 ...
       $ countryname: chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
       $ partyname  : chr  "Liberal Party" "Liberal Party" "Liberal Party" "Liberal Party" ...
       $ title      : chr  "Think for Yourself - Vote Liberal" "For all the People: The Liberal Plan of 1966" "What a Life!" "Change the Face of Britain" ...
      
      In [11]:
      # A more fine-grained control is possible using 'tokens()'
      UKLib.toks <- tokens(UKLib.corpus,
                           remove_punct=TRUE,
                           remove_numbers=TRUE)
      UKLib.toks
      
      Tokens consisting of 14 documents and 6 docvars.
      51420_196410 :
       [1] "THINK"         "FOR"           "YOURSELF"      "The"           "Liberal"       "Party"         "offers"       
       [8] "the"           "electorate"    "a"             "radical"       "non-Socialist"
      [ ... and 8,853 more ]
      
      51420_196603 :
       [1] "For"     "All"     "the"     "People"  "the"     "Liberal" "Plan"    "of"      "BRITAIN" "DEMANDS" "A"      
      [12] "NEW"    
      [ ... and 31,786 more ]
      
      51420_197006 :
       [1] "What"   "a"      "Life"   "There"  "must"   "surely" "be"     "a"      "better" "way"    "to"     "run"   
      [ ... and 23,962 more ]
      
      51420_197402 :
       [1] "Change"     "the"        "face"       "of"         "Britain"    "THE"        "CRISIS"     "OF"        
       [9] "GOVERNMENT" "This"       "country"    "has"       
      [ ... and 13,764 more ]
      
      51420_197410 :
       [1] "Why"        "Britain"    "Needs"      "Liberal"    "Government" "A"          "PERSONAL"   "MESSAGE"   
       [9] "FROM"       "THE"        "RT"         "HON"       
      [ ... and 10,485 more ]
      
      51420_197905 :
       [1] "The"          "Real"         "Fight"        "is"           "for"          "Britain"      "INTRODUCTION"
       [8] "With"         "your"         "support"      "this"         "election"    
      [ ... and 11,438 more ]
      
      [ reached max_ndoc ... 8 more documents ]
      In [12]:
      UKLib.dfm <- dfm(UKLib.toks)
      UKLib.dfm
      
      Document-feature matrix of: 14 documents, 9,908 features (73.8% sparse) and 6 docvars.
                    features
      docs           think for yourself  the liberal party offers electorate   a radical
        51420_196410     2 132        2  530      32     4      1          1 195       3
        51420_196603     1 489        1 1824      81    12      3          1 597       6
        51420_197006     1 373        0 1336     132    19      1          0 414       1
        51420_197402     2 174        0  852      38    29      1          0 291       4
        51420_197410     1 166        0  634      30    24      0          6 232       4
        51420_197905     1 174        0  700      36    10      1          6 237       7
      [ reached max_ndoc ... 8 more documents, reached max_nfeat ... 9,898 more features ]
      In [13]:
      UKLib.dfm <- dfm_remove(UKLib.dfm,
                              pattern=stopwords("english"))
      UKLib.dfm
      
      Document-feature matrix of: 14 documents, 9,765 features (74.4% sparse) and 6 docvars.
                    features
      docs           think liberal party offers electorate radical non-socialist alternative long run
        51420_196410     2      32     4      1          1       3             1           5    4   4
        51420_196603     1      81    12      3          1       6             0           7   26   6
        51420_197006     1     132    19      1          0       1             0           8    6   6
        51420_197402     2      38    29      1          0       4             0           1   11   4
        51420_197410     1      30    24      0          6       4             0           4    9   3
        51420_197905     1      36    10      1          6       7             0           3   10   4
      [ reached max_ndoc ... 8 more documents, reached max_nfeat ... 9,755 more features ]
      In [14]:
      UKLib.dfm <- dfm_wordstem(UKLib.dfm,language="english")
      UKLib.dfm
      
      Document-feature matrix of: 14 documents, 6,166 features (70.6% sparse) and 6 docvars.
                    features
      docs           think liber parti offer elector radic non-socialist altern long run
        51420_196410     3    46     7     1       1     3             1      6    4   7
        51420_196603     3    98    19    20       7     6             0     10   26  12
        51420_197006     2   149    31     6       2     1             0     15    6  10
        51420_197402     2    67    42     9       3     4             0      2   11   5
        51420_197410     3    44    31     1      11     5             0      5    9   4
        51420_197905     1    64    26     7      18     7             0      6   10   7
      [ reached max_ndoc ... 8 more documents, reached max_nfeat ... 6,156 more features ]
      In [15]:
      # 'quanteda' provides support for dictionaries:
      milecondict <- dictionary(list(
                      Military=c("military","forces","war","defence","victory","victorious","glory"),
                      Economy=c("economy","growth","business","enterprise","market")
      ))
      
      In [16]:
      # Here we extract the frequency of tokens belonging to certain dictionaries:
      UKLib.milecon.dfm <- dfm(UKLib.corpus,
                               dictionary=milecondict)
      UKLib.milecon.dfm
      
      Document-feature matrix of: 14 documents, 2 features (0.0% sparse) and 6 docvars.
                    features
      docs           Military Economy
        51420_196410       11      29
        51420_196603       40      83
        51420_197006       31      77
        51420_197402       14      50
        51420_197410        5      31
        51420_197905       23      34
      [ reached max_ndoc ... 8 more documents ]
      In [17]:
      time <- with(docvars(UKLib.milecon.dfm),
                   ISOdate(year=date%/%100,
                           month=date%%100,
                           day=1))
      time
      
       [1] "1964-10-01 12:00:00 GMT" "1966-03-01 12:00:00 GMT" "1970-06-01 12:00:00 GMT" "1974-02-01 12:00:00 GMT"
       [5] "1974-10-01 12:00:00 GMT" "1979-05-01 12:00:00 GMT" "1983-06-01 12:00:00 GMT" "1987-06-01 12:00:00 GMT"
       [9] "1992-04-01 12:00:00 GMT" "1997-05-01 12:00:00 GMT" "2001-06-01 12:00:00 GMT" "2005-05-01 12:00:00 GMT"
      [13] "2015-05-01 12:00:00 GMT" "2017-06-01 12:00:00 GMT"
      In [18]:
      UKLib.ntok <- ntoken(UKLib.corpus)
      
      In [19]:
      milit.freq <- as.vector(UKLib.milecon.dfm[,"Military"])
      econ.freq <- as.vector(UKLib.milecon.dfm[,"Economy"])
      milit.prop <- milit.freq/UKLib.ntok
      econ.prop <- econ.freq/UKLib.ntok
      
      In [20]:
      # We plot the frequency of tokens over time
      op <- par(mfrow=c(2,1),mar=c(3,4,0,0))
      plot(time,milit.prop,type="p",ylab="Military")
      lines(time,lowess(time,milit.prop)$y)
      plot(time,econ.prop,type="p",ylab="Economy")
      lines(time,lowess(time,econ.prop)$y)
      par(op)