An example for the use of tm on data from the Manifesto Project

The file CSV-files in folder “Manifesto Project” were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to reproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,

  1. pull down the “File” menu item and select “Open”
  2. An overview of the folder that contains the notebook opens.
  3. The folder view has a button labelled “Upload”. Use this to upload the file that you downloaded from the Manifesto Project website.

Note that the uploaded data will disappear, once you “Quit” the notebook (and the Jupyter instance).


# The Manifesto Project data is contained in a collection of CSV files
csv.files <- dir("ManifestoProject",full.names=TRUE,
                 pattern="*.csv")
csv.files
 [1] "ManifestoProject/51420_196410.csv" "ManifestoProject/51420_196603.csv"
 [3] "ManifestoProject/51420_197006.csv" "ManifestoProject/51420_197402.csv"
 [5] "ManifestoProject/51420_197410.csv" "ManifestoProject/51420_197905.csv"
 [7] "ManifestoProject/51420_198306.csv" "ManifestoProject/51420_198706.csv"
 [9] "ManifestoProject/51421_199204.csv" "ManifestoProject/51421_199705.csv"
[11] "ManifestoProject/51421_200106.csv" "ManifestoProject/51421_200505.csv"
[13] "ManifestoProject/51421_201505.csv" "ManifestoProject/51421_201706.csv"

The file documents_MPDataset_MPDS2019b.csv contains the relevant metadata. The original in Excel format is available (without registration) from the Manifesto Project web site.


manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv",
                               stringsAsFactors=FALSE)

The following makes use of the tm package. You may need to install it from CRAN using the code install.packages("tm") if you want to run this on your computer. (The package is already installed on the notebook container, however.)


library(tm)
Loading required package: NLP


# The following code does not work, due to the peculiar structure of the CSV files
manifesto.corpus <- VCorpus(DirSource("ManifestoProject"))

# To deal with the problem created by the peculiar structure of the files, we
# define a helper function:
getMDoc <- function(file,metadata.file){
    df <- read.csv(file,
                   stringsAsFactors=FALSE)
    content <- paste(df[,1],collapse="\n")

    fn <- basename(file)
    fn <- sub(".csv","",fn,fixed=TRUE)
    fn12 <- unlist(strsplit(fn,"_"))

    partycode <- as.numeric(fn12[1])
    datecode <- as.numeric(fn12[2])
    year <- datecode %/% 100
    month <- datecode %% 100
    datetime <- ISOdate(year=year,month=month,day=1)

    mf.meta <- subset(metadata.file,
                      party==partycode & date == datecode)
    if(!length(mf.meta$language))
        mf.meta$language <- "english"

    PlainTextDocument(
        content,
        id = fn,
        heading = mf.meta$title,
        datetimestamp = as.POSIXlt(datetime),
        language = mf.meta$language,
        partyname = mf.meta$partyname,
        partycode = partycode,
        datecode = datecode
    )
}

# With the helper function we now create a corpus of UK manifestos:
UKLib.docs <- lapply(csv.files,getMDoc,
                     metadata.file=manifesto.metadata)
UKLib.Corpus <- as.VCorpus(UKLib.docs)
UKLib.Corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 14

UKLib.Corpus[[14]]
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 130585

# We need to deal with the non-ASCII characters, so we define yet another helper
# function:
handleUTF8quotes <- function(x){
    cx <- content(x)
    cx <- gsub("\xe2\x80\x98","'",cx)
    cx <- gsub("\xe2\x80\x99","'",cx)
    cx <- gsub("\xe2\x80\x9a",",",cx)
    cx <- gsub("\xe2\x80\x9b","`",cx)
    cx <- gsub("\xe2\x80\x9c","\"",cx)
    cx <- gsub("\xe2\x80\x9d","\"",cx)
    cx <- gsub("\xe2\x80\x9e","\"",cx)
    cx <- gsub("\xe2\x80\x9f","\"",cx)
    content(x) <- cx
    x
}

# Another helper function is needed to change the texts into lowercase:
toLower <- function(x) {
    content(x) <- tolower(content(x))
    x
}

# We overwrite the 'inspect' method for "TextDocument" objects to a variant that shows only the first
# 20 lines:
inspect.TextDocument <- function(x){
    print(x)
    cat("\n")
    str <- as.character(x)
    str <- substr(x,start=0,stop=500)
    str <- paste(str,"... ...")
    writeLines(str)
    invisible(x)
}

UKLib.Corpus.processed <- tm_map(UKLib.Corpus,handleUTF8quotes)
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,toLower)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 130585

1 protect britain's place in europe
1.1 giving the people the final say
liberal democrats are open and outward-looking.
we passionately believe that britain's relationship with its neighbours is stronger as part of the european union.
whatever its imperfections, the eu remains the best framework for working effectively and co-operating in the pursuit of our shared aims.
it has led directly to greater prosperity,
increased trade,
investment and jobs,
better security
and a greener environment.
bri ... ...

UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removeNumbers)
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removePunctuation)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 127677

 protect britains place in europe
 giving the people the final say
liberal democrats are open and outwardlooking
we passionately believe that britains relationship with its neighbours is stronger as part of the european union
whatever its imperfections the eu remains the best framework for working effectively and cooperating in the pursuit of our shared aims
it has led directly to greater prosperity
increased trade
investment and jobs
better security
and a greener environment
britain is better o ... ...

UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,stemDocument)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata:  10
Content:  chars: 112157

protect britain place in europ give the peopl the final say liber democrat are open and outwardlook we passion believ that britain relationship with it neighbour is stronger as part of the european union whatev it imperfect the eu remain the best framework for work effect and cooper in the pursuit of our share aim it has led direct to greater prosper increas trade invest and job better secur and a greener environ britain is better off in the eu liber democrat campaign for the uk to remain in the ... ...

# After preprocessing the text documents we obtain a document-term matrix:
UKLib.dtm <- DocumentTermMatrix(UKLib.Corpus.processed)
UKLib.dtm
<<DocumentTermMatrix (documents: 14, terms: 5940)>>
Non-/sparse entries: 24547/58613
Sparsity           : 70%
Maximal term length: 27
Weighting          : term frequency (tf)

# The various preprocessing steps can be combined into a single step:
UKLib.dtm <- DocumentTermMatrix(
    tm_map(UKLib.Corpus,handleUTF8quotes),
    control=list(
        tolower=TRUE,
        removePunctuation=TRUE,
        removeNumber=TRUE,
        stopwords=TRUE,
        language="en",
        stemming=TRUE
    ))
UKLib.dtm
<<DocumentTermMatrix (documents: 14, terms: 6289)>>
Non-/sparse entries: 24105/63941
Sparsity           : 73%
Maximal term length: 27
Weighting          : term frequency (tf)

Downloadable R script and interactive version

Explanation

The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4

Above you see a rendered version of the Jupyter notebook.5

1

For more information about Jupyter see http://jupyter.org. The Jupyter notebooks make use of the IRKernel package.

2

For more information about Docker see https://docs.docker.com/. The container images were created with repo2docker, while containers are run with docker spawner.

3

ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See https://info.orcid.org/what-is-orcid/ for more information.

4

The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.

5

The notebook is rendered with the help of the nbsphinx extension.