Text as Data¶
Textual data have rapidly gained attention in the communication, social, and political sciences. Without their discussion, a companion to data management would be incomplete. This chapter starts with a discussion of basic operations on character strings, such as concatenation, search and replace. It then moves on to the management of corpora of text. It also discusses routine issues in the management of textual data, such as stemming, stop-word deletion, and the creation of term-frequency matrices.
Below is the supporting material for the various sections of the chapter.
Character Strings¶
-
length()
versusnchar()
- Script file:
length-vs-nchar.R
-
Interactive notebook:
- Script file:
-
Character vector subsets versus substrings
- Script file:
subsets-vs-substrings.R
-
Interactive notebook:
- Script file:
-
Finding patterns within character strings and character vectors
- Script file:
finding-patterns.R
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:some_great_rock_bands <- c("Led Zeppelin","Pink Floyd","Queen")
In [3]:grep("Zeppelin",some_great_rock_bands) # Just the indices
In [4]:grep("Zeppelin",some_great_rock_bands, value=TRUE) # the elements
In [5]:grepl("Zeppelin",some_great_rock_bands)
In [6]:grep("[ei]n$",some_great_rock_bands,value=TRUE)
- Script file:
-
Replacing patterns within character strings and character vectors
- Script file:
replacing-patterns.R
-
Interactive notebook:
- Script file:
-
Counting words in the UK party manifesto on occasion of the 2017 election
-
Script file:
counting-words-in-a-manifesto.R
Data file used in the script:
UKLabourParty_201706.csv
which was downloaded from the Manifesto Project website. To obtain the data, one has to openhttps://manifesto-project.wzb.eu/datasets
, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data. -
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
The file "UKLabourParty_201706.csv" was downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,
- pull down the "File" menu item and select "Open"
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.
Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).
In [2]:# First, the data are read in Labour.2017 <- read.csv("UKLabourParty_201706.csv", stringsAsFactors=FALSE)
In [3]:# Second, some non-ascii characters are substituted Labour.2017$content <- gsub("\xE2\x80\x99","'",Labour.2017$content) str(Labour.2017)
In [4]:# The variable 'content' contains the text of the manifesto Labour.2017 <- Labour.2017$content Labour.2017[1:5]
In [5]:# The headings in the manifesto are all-uppercase, this helps # to identify them: Labour.2017.hlno <- which(Labour.2017==toupper(Labour.2017)) Labour.2017.headings <- Labour.2017[Labour.2017.hlno] Labour.2017.headings[1:4]
In [6]:# All non-heading text is changed to lowercase labour.2017 <- tolower(Labour.2017[-Labour.2017.hlno]) labour.2017[1:5]
In [7]:# All lines that contain the pattern 'econom' are collected ecny.labour.2017 <- grep("econom",labour.2017,value=TRUE) ecny.labour.2017[1:5]
In [8]:# Using 'strsplit()' the lines are split into words labour.2017.words <- strsplit(labour.2017,"[ ,.;:]+") str(labour.2017.words[1:5])
In [9]:# The result is a list. We change it into a character vector. labour.2017.words <- unlist(labour.2017.words) labour.2017.words[1:20]
In [10]:# We now count the words and look at the 20 most common ones. labour.2017.nwords <- table(labour.2017.words) labour.2017.nwords <- sort(labour.2017.nwords,decreasing=TRUE) labour.2017.nwords[1:20]
-
Text Corpora with the tm Package¶
-
A simple example for the usage of the tm package
-
Script file:
tm-simple.R
The script makes use of the tm package, which is available from
https://cran.r-project.org/package=tm
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:# Activating the 'tm' package library(tm)
In [3]:# We activate the 'acq' data, a corpus of 50 example news articles data(acq) acq
In [4]:# We take a look at the first element in the corpus, a text document: class(acq[[1]])
In [5]:acq[[1]]
In [6]:inspect(acq[[1]])
In [7]:# We take a look at the document metadata meta(acq[[1]])
In [8]:DublinCore(acq[[1]])
-
-
A more complicated example involving data from the Manifesto Project
-
Script file:
tm-ManifestoProject.R
The script makes use of the tm package, which is available from
https://cran.r-project.org/package=tm
Data file used in the script:
51420_196410.csv
,51420_196603.csv
,51420_197006.csv
,51420_197402.csv
,51420_197410.csv
,51420_197905.csv
,51420_198306.csv
,51420_198706.csv
,51420_199204.csv
,51420_199705.csv
,51420_200106.csv
,51420_200505.csv
,51420_201505.csv
, and51420_201706.csv
. The data files were downloaded from the Manifesto Project website and put into the directoryManifestoProject
. To obtain the data, one has to openhttps://manifesto-project.wzb.eu/datasets
, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data.Additionally, the script makes use of the file
documents_MPDataset_MPDS2019b.csv
, which was converted from the Excel filedocuments_MPDataset_MPDS2019b.xlsx
available from the Manifesto Project website. Since this file is not subject to any restrictions, it is available for download from the given link. -
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
The file CSV-files in folder "Manifesto Project" were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,
- pull down the "File" menu item and select "Open"
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.
Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).
In [2]:# The Manifesto Project data is contained in a collection of CSV files csv.files <- dir("ManifestoProject",full.names=TRUE, pattern="*.csv") csv.files
In [3]:# This file contains the relevant metadata: # It is available (withou registration) from # https://manifesto-project.wzb.eu/down/data/2019b/codebooks/documents_MPDataset_MPDS2019b.xlsx # in Excel format manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv", stringsAsFactors=FALSE)
In [4]:library(tm)
In [5]:# The following code does not work, due to the peculiar structure of the CSV files manifesto.corpus <- VCorpus(DirSource("ManifestoProject"))
In [6]:# To deal with the problem created by the peculiar structure of the files, we # define a helper function: getMDoc <- function(file,metadata.file){ df <- read.csv(file, stringsAsFactors=FALSE) content <- paste(df[,1],collapse="\n") fn <- basename(file) fn <- sub(".csv","",fn,fixed=TRUE) fn12 <- unlist(strsplit(fn,"_")) partycode <- as.numeric(fn12[1]) datecode <- as.numeric(fn12[2]) year <- datecode %/% 100 month <- datecode %% 100 datetime <- ISOdate(year=year,month=month,day=1) mf.meta <- subset(metadata.file, party==partycode & date == datecode) if(!length(mf.meta$language)) mf.meta$language <- "english" PlainTextDocument( content, id = fn, heading = mf.meta$title, datetimestamp = as.POSIXlt(datetime), language = mf.meta$language, partyname = mf.meta$partyname, partycode = partycode, datecode = datecode ) }
In [7]:# With the helper function we now create a corpus of UK manifestos: UKLib.docs <- lapply(csv.files,getMDoc, metadata.file=manifesto.metadata) UKLib.Corpus <- as.VCorpus(UKLib.docs) UKLib.Corpus
In [8]:UKLib.Corpus[[14]]
In [9]:# We need to deal with the non-ASCII characters, so we define yet another helper # function: handleUTF8quotes <- function(x){ cx <- content(x) cx <- gsub("\xe2\x80\x98","'",cx) cx <- gsub("\xe2\x80\x99","'",cx) cx <- gsub("\xe2\x80\x9a",",",cx) cx <- gsub("\xe2\x80\x9b","`",cx) cx <- gsub("\xe2\x80\x9c","\"",cx) cx <- gsub("\xe2\x80\x9d","\"",cx) cx <- gsub("\xe2\x80\x9e","\"",cx) cx <- gsub("\xe2\x80\x9f","\"",cx) content(x) <- cx x }
In [10]:# Another helper function is needed to change the texts into lowercase: toLower <- function(x) { content(x) <- tolower(content(x)) x }
In [11]:# We overwrite the 'inspect' method for "TextDocument" objects to a variant that shows only the first # 20 lines: inspect.TextDocument <- function(x){ print(x) cat("\n") str <- as.character(x) str <- substr(x,start=0,stop=500) str <- paste(str,"... ...") writeLines(str) invisible(x) }
In [12]:UKLib.Corpus.processed <- tm_map(UKLib.Corpus,handleUTF8quotes) UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,toLower) inspect(UKLib.Corpus.processed[[14]])
In [13]:UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removeNumbers) UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removePunctuation) inspect(UKLib.Corpus.processed[[14]])
In [14]:UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,stemDocument) inspect(UKLib.Corpus.processed[[14]])
In [15]:# After preprocessing the text documents we obtain a document-term matrix: UKLib.dtm <- DocumentTermMatrix(UKLib.Corpus.processed) UKLib.dtm
In [16]:# The various preprocessing steps can be combined into a single step: UKLib.dtm <- DocumentTermMatrix( tm_map(UKLib.Corpus,handleUTF8quotes), control=list( tolower=TRUE, removePunctuation=TRUE, removeNumber=TRUE, stopwords=TRUE, language="en", stemming=TRUE )) UKLib.dtm
-
Improvements Provided by the quanteda Package¶
-
A basic example for the usage of the quanteda package
-
Script file:
quanteda-basic.R
The script makes use of the quanteda package, which is available from
https://cran.r-project.org/package=quanteda
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(quanteda)
In [3]:quanteda_options(print_corpus_max_ndoc=3)
In [4]:# This is an example corpus contained in the 'quanteda' package data_corpus_inaugural
In [5]:mode(data_corpus_inaugural)
In [6]:class(data_corpus_inaugural)
In [7]:data_corpus_inaugural[1:3]
In [8]:str(docvars(data_corpus_inaugural))
In [9]:docvars(data_corpus_inaugural,"Year")
In [10]:data_corpus_inaugural$Year
In [11]:corpus_subset(data_corpus_inaugural, Year > 1945)
In [12]:subset.corpus <- function(x,...) corpus_subset(x,...)
In [13]:subset(data_corpus_inaugural, Year > 1945)
In [14]:docs_containing <- function(x,pattern,...) x[grep(pattern,x,...)]
In [15]:c_sub <- docs_containing(data_corpus_inaugural,"[Cc]arnage") c_sub$President
In [16]:inaugural_sntc <- corpus_reshape(data_corpus_inaugural, to="sentences") inaugural_sntc
In [17]:sntcl <- cbind(docvars(inaugural_sntc), len=nchar(inaugural_sntc)) head(sntcl)
In [18]:sntcl.year <- aggregate(len~Year,data=sntcl,mean) with(sntcl.year, scatter.smooth(Year,len,ylab="Average length of sentences in characters"))
In [19]:inaugural_ <- corpus_reshape(data_corpus_inaugural, to="documents") all(inaugural_$Year == data_corpus_inaugural$Year)
-
-
Obtaining tokens in text documents using quanteda
-
Script file:
quanteda-tokens.R
The script makes use of the quanteda package, which is available from
https://cran.r-project.org/package=quanteda
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R
In [2]:library(quanteda)
In [2]:quanteda_options(print_tokens_max_ndoc=3, print_tokens_max_ntoken=6)
In [3]:inaugural_toks <- tokens(data_corpus_inaugural) inaugural_toks
In [4]:inaugural_ntoks <- sapply(inaugural_toks, length) inaugural_ntoks <- cbind(docvars(inaugural_toks), ntokens = inaugural_ntoks)
In [5]:with(inaugural_ntoks, scatter.smooth(Year,ntokens, ylab="Number of tokens per speech"))
In [7]:inaugural_sntc <- corpus_reshape(data_corpus_inaugural, to="sentences") inaugural_sntc_toks <- tokens(inaugural_sntc) inaugural_sntc_ntoks <- sapply(inaugural_sntc_toks, length) inaugural_sntc_ntoks <- cbind(docvars(inaugural_sntc_toks), ntokens = inaugural_sntc_ntoks)
In [9]:inaugural_sntc_ntoks <- aggregate (ntokens~Year,mean, data = inaugural_sntc_ntoks)
In [10]:with(inaugural_sntc_ntoks, scatter.smooth(Year,ntokens, ylab="Number of tokens per sentence"))
-
-
Preparing Manifesto Project data using quanteda
-
Script file:
quanteda-ManifestoProject.R
The script makes use of
- the quanteda package, which is available from
https://cran.r-project.org/package=quanteda
- the readtext package, which is available from
https://cran.r-project.org/package=readtext
Data file used in the script:
51420_196410.csv
,51420_196603.csv
,51420_197006.csv
,51420_197402.csv
,51420_197410.csv
,51420_197905.csv
,51420_198306.csv
,51420_198706.csv
,51420_199204.csv
,51420_199705.csv
,51420_200106.csv
,51420_200505.csv
,51420_201505.csv
, and51420_201706.csv
. The data files were downloaded from the Manifesto Project website and put into the directoryManifestoProject
. To obtain the data, one has to openhttps://manifesto-project.wzb.eu/datasets
, select the link entitled “Corpus & Documents”, and choose the UK Labour Party manifesto of 2017. This data file is not provided here, because the data are not freely redistributable. Instead one will need to register with the Manifesto Project to be able to download the data.Additionally, the script makes use of the file
documents_MPDataset_MPDS2019b.csv
, which was converted from the Excel filedocuments_MPDataset_MPDS2019b.xlsx
available from the Manifesto Project website. Since this file is not subject to any restrictions, it is available for download from the given link. - the quanteda package, which is available from
-
Interactive notebook:
In [1]:options(jupyter.rich_display=FALSE) # Create output as usual in R options(width=120)
The file CSV-files in folder "Manifesto Project" were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to preproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,
- pull down the "File" menu item and select "Open"
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled "Upload". Use this to upload the file that you downloaded from the Manifesto Project website.
Note that the uploaded data will disappear, once you "Quit" the notebook (and the Jupyter instance).
In [2]:csv.files <- dir("ManifestoProject", full.names=TRUE, pattern="*.csv") length(csv.files)
In [3]:# 'readtext' (a companion package for 'quanteda') is somewhat better able to # deal with the Manfisto Project CSV files than 'tm': library(readtext) UKLib.rt <- readtext("ManifestoProject/*.csv", text_field=1, docvarsfrom="filenames", docvarnames=c("party","date")) nrow(UKLib.rt)
In [4]:# Here we create an index of documents in the corpus: UKLib.rta <- aggregate(text~party+date, FUN=function(x)paste(x,collapse=" "), data=UKLib.rt) nrow(UKLib.rta)
In [5]:UKLib.rta <- within(UKLib.rta, doc_id <- paste(party,date,sep="_"))
In [6]:library(quanteda)
In [7]:UKLib.corpus <- corpus(UKLib.rta) UKLib.corpus
In [8]:# Here we combine metadata with the text documents: manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv",stringsAsFactors=FALSE) str(manifesto.metadata)
In [9]:docvars(UKLib.corpus) <- merge(docvars(UKLib.corpus), manifesto.metadata, by=c("party","date")) str(docvars(UKLib.corpus))
In [10]:# Finally we create a document-feature matrix, without punctuation, numbers, # symbols and stopwords: UKLib.dfm <- dfm(UKLib.corpus, remove_punct=TRUE, remove_numbers=TRUE, remove_symbols=TRUE, remove=stopwords("english"), stem=TRUE) str(docvars(UKLib.dfm))
In [11]:# A more fine-grained control is possible using 'tokens()' UKLib.toks <- tokens(UKLib.corpus, remove_punct=TRUE, remove_numbers=TRUE) UKLib.toks
In [12]:UKLib.dfm <- dfm(UKLib.toks) UKLib.dfm
In [13]:UKLib.dfm <- dfm_remove(UKLib.dfm, pattern=stopwords("english")) UKLib.dfm
In [14]:UKLib.dfm <- dfm_wordstem(UKLib.dfm,language="english") UKLib.dfm
In [15]:# 'quanteda' provides support for dictionaries: milecondict <- dictionary(list( Military=c("military","forces","war","defence","victory","victorious","glory"), Economy=c("economy","growth","business","enterprise","market") ))
In [16]:# Here we extract the frequency of tokens belonging to certain dictionaries: UKLib.milecon.dfm <- dfm(UKLib.corpus, dictionary=milecondict) UKLib.milecon.dfm
In [17]:time <- with(docvars(UKLib.milecon.dfm), ISOdate(year=date%/%100, month=date%%100, day=1)) time
In [18]:UKLib.ntok <- ntoken(UKLib.corpus)
In [19]:milit.freq <- as.vector(UKLib.milecon.dfm[,"Military"]) econ.freq <- as.vector(UKLib.milecon.dfm[,"Economy"]) milit.prop <- milit.freq/UKLib.ntok econ.prop <- econ.freq/UKLib.ntok
In [20]:# We plot the frequency of tokens over time op <- par(mfrow=c(2,1),mar=c(3,4,0,0)) plot(time,milit.prop,type="p",ylab="Military") lines(time,lowess(time,milit.prop)$y) plot(time,econ.prop,type="p",ylab="Economy") lines(time,lowess(time,econ.prop)$y) par(op)
-