An example for the use of tm on data from the Manifesto Project¶
The file CSV-files in folder “Manifesto Project” were downloaded from the Manifesto Project website. Redistribution of the data is prohibited, so readers who want to reproduce the following will need to download their own copy of the data set and upload it to the virtual machine that runs this notebook. To do this,
- pull down the “File” menu item and select “Open”
- An overview of the folder that contains the notebook opens.
- The folder view has a button labelled “Upload”. Use this to upload the file that you downloaded from the Manifesto Project website.
Note that the uploaded data will disappear, once you “Quit” the notebook (and the Jupyter instance).
# The Manifesto Project data is contained in a collection of CSV files
csv.files <- dir("ManifestoProject",full.names=TRUE,
pattern="*.csv")
csv.files
[1] "ManifestoProject/51420_196410.csv" "ManifestoProject/51420_196603.csv"
[3] "ManifestoProject/51420_197006.csv" "ManifestoProject/51420_197402.csv"
[5] "ManifestoProject/51420_197410.csv" "ManifestoProject/51420_197905.csv"
[7] "ManifestoProject/51420_198306.csv" "ManifestoProject/51420_198706.csv"
[9] "ManifestoProject/51421_199204.csv" "ManifestoProject/51421_199705.csv"
[11] "ManifestoProject/51421_200106.csv" "ManifestoProject/51421_200505.csv"
[13] "ManifestoProject/51421_201505.csv" "ManifestoProject/51421_201706.csv"
The file documents_MPDataset_MPDS2019b.csv contains the relevant metadata. The original in Excel format is available (without registration) from the Manifesto Project web site.
manifesto.metadata <- read.csv("documents_MPDataset_MPDS2019b.csv",
stringsAsFactors=FALSE)
The following makes use of the tm package. You may need to install it from
CRAN using the code
install.packages("tm")
if you want to run this on your computer. (The
package is already installed on the notebook container, however.)
library(tm)
Lade nötiges Paket: NLP
# The following code does not work, due to the peculiar structure of the CSV files
manifesto.corpus <- VCorpus(DirSource("ManifestoProject"))
# To deal with the problem created by the peculiar structure of the files, we
# define a helper function:
getMDoc <- function(file,metadata.file){
df <- read.csv(file,
stringsAsFactors=FALSE)
content <- paste(df[,1],collapse="\n")
fn <- basename(file)
fn <- sub(".csv","",fn,fixed=TRUE)
fn12 <- unlist(strsplit(fn,"_"))
partycode <- as.numeric(fn12[1])
datecode <- as.numeric(fn12[2])
year <- datecode %/% 100
month <- datecode %% 100
datetime <- ISOdate(year=year,month=month,day=1)
mf.meta <- subset(metadata.file,
party==partycode & date == datecode)
if(!length(mf.meta$language))
mf.meta$language <- "english"
PlainTextDocument(
content,
id = fn,
heading = mf.meta$title,
datetimestamp = as.POSIXlt(datetime),
language = mf.meta$language,
partyname = mf.meta$partyname,
partycode = partycode,
datecode = datecode
)
}
# With the helper function we now create a corpus of UK manifestos:
UKLib.docs <- lapply(csv.files,getMDoc,
metadata.file=manifesto.metadata)
UKLib.Corpus <- as.VCorpus(UKLib.docs)
UKLib.Corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 14
UKLib.Corpus[[14]]
<<PlainTextDocument>>
Metadata: 10
Content: chars: 130585
# We need to deal with the non-ASCII characters, so we define yet another helper
# function:
handleUTF8quotes <- function(x){
cx <- content(x)
cx <- gsub("\xe2\x80\x98","'",cx)
cx <- gsub("\xe2\x80\x99","'",cx)
cx <- gsub("\xe2\x80\x9a",",",cx)
cx <- gsub("\xe2\x80\x9b","`",cx)
cx <- gsub("\xe2\x80\x9c","\"",cx)
cx <- gsub("\xe2\x80\x9d","\"",cx)
cx <- gsub("\xe2\x80\x9e","\"",cx)
cx <- gsub("\xe2\x80\x9f","\"",cx)
content(x) <- cx
x
}
# Another helper function is needed to change the texts into lowercase:
toLower <- function(x) {
content(x) <- tolower(content(x))
x
}
# We overwrite the 'inspect' method for "TextDocument" objects to a variant that shows only the first
# 20 lines:
inspect.TextDocument <- function(x){
print(x)
cat("\n")
str <- as.character(x)
str <- substr(x,start=0,stop=500)
str <- paste(str,"... ...")
writeLines(str)
invisible(x)
}
UKLib.Corpus.processed <- tm_map(UKLib.Corpus,handleUTF8quotes)
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,toLower)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata: 10
Content: chars: 130585
1 protect britain's place in europe
1.1 giving the people the final say
liberal democrats are open and outward-looking.
we passionately believe that britain's relationship with its neighbours is stronger as part of the european union.
whatever its imperfections, the eu remains the best framework for working effectively and co-operating in the pursuit of our shared aims.
it has led directly to greater prosperity,
increased trade,
investment and jobs,
better security
and a greener environment.
bri ... ...
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removeNumbers)
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,removePunctuation)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata: 10
Content: chars: 127677
protect britains place in europe
giving the people the final say
liberal democrats are open and outwardlooking
we passionately believe that britains relationship with its neighbours is stronger as part of the european union
whatever its imperfections the eu remains the best framework for working effectively and cooperating in the pursuit of our shared aims
it has led directly to greater prosperity
increased trade
investment and jobs
better security
and a greener environment
britain is better o ... ...
UKLib.Corpus.processed <- tm_map(UKLib.Corpus.processed,stemDocument)
inspect(UKLib.Corpus.processed[[14]])
<<PlainTextDocument>>
Metadata: 10
Content: chars: 112157
protect britain place in europ give the peopl the final say liber democrat are open and outwardlook we passion believ that britain relationship with it neighbour is stronger as part of the european union whatev it imperfect the eu remain the best framework for work effect and cooper in the pursuit of our share aim it has led direct to greater prosper increas trade invest and job better secur and a greener environ britain is better off in the eu liber democrat campaign for the uk to remain in the ... ...
# After preprocessing the text documents we obtain a document-term matrix:
UKLib.dtm <- DocumentTermMatrix(UKLib.Corpus.processed)
UKLib.dtm
<<DocumentTermMatrix (documents: 14, terms: 5940)>>
Non-/sparse entries: 24547/58613
Sparsity : 70%
Maximal term length: 27
Weighting : term frequency (tf)
# The various preprocessing steps can be combined into a single step:
UKLib.dtm <- DocumentTermMatrix(
tm_map(UKLib.Corpus,handleUTF8quotes),
control=list(
tolower=TRUE,
removePunctuation=TRUE,
removeNumber=TRUE,
stopwords=TRUE,
language="en",
stemming=TRUE
))
UKLib.dtm
<<DocumentTermMatrix (documents: 14, terms: 6289)>>
Non-/sparse entries: 24105/63941
Sparsity : 73%
Maximal term length: 27
Weighting : term frequency (tf)
- R file: tm-ManifestoProject.R
- Rmarkdown file: tm-ManifestoProject.Rmd
- Jupyter notebook file: tm-ManifestoProject.ipynb
- Interactive version of the Jupyter notebook (shuts down after 60s):
- Interactive version of the Jupyter notebook (sign in required):