A basic example for the usage of the quanteda package

The following makes use of the quanteda package. You may need to install it from CRAN using the code install.packages("quanteda") if you want to run this on your computer. (The package is already installed on the notebook container, however.)


library(quanteda)
Package version: 2.1.1

Parallel computing: 2 of 20 threads used.

See https://quanteda.io for tutorials and examples.


Attaching package: ‘quanteda’


The following object is masked from ‘jupyter:irkernel’:

    View


The following object is masked from ‘package:utils’:

    View



quanteda_options(print_corpus_max_ndoc=3)

# This is an example corpus contained in the 'quanteda' package
data_corpus_inaugural
Corpus consisting of 58 documents and 4 docvars.
1789-Washington :
"Fellow-Citizens of the Senate and of the House of Representa..."

1793-Washington :
"Fellow citizens, I am again called upon by the voice of my c..."

1797-Adams :
"When it was first perceived, in early times, that no middle ..."

[ reached max_ndoc ... 55 more documents ]

mode(data_corpus_inaugural)
[1] "character"

class(data_corpus_inaugural)
[1] "corpus"    "character"

data_corpus_inaugural[1:3]
Corpus consisting of 3 documents and 4 docvars.
1789-Washington :
"Fellow-Citizens of the Senate and of the House of Representa..."

1793-Washington :
"Fellow citizens, I am again called upon by the voice of my c..."

1797-Adams :
"When it was first perceived, in early times, that no middle ..."


str(docvars(data_corpus_inaugural))
'data.frame':   58 obs. of  4 variables:
 $ Year     : int  1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 ...
 $ President: chr  "Washington" "Washington" "Adams" "Jefferson" ...
 $ FirstName: chr  "George" "George" "John" "Thomas" ...
 $ Party    : Factor w/ 6 levels "Democratic","Democratic-Republican",..: 4 4 3 2 2 2 2 2 2 2 ...

docvars(data_corpus_inaugural,"Year")
 [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
[16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
[31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
[46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017

data_corpus_inaugural$Year
 [1] 1789 1793 1797 1801 1805 1809 1813 1817 1821 1825 1829 1833 1837 1841 1845
[16] 1849 1853 1857 1861 1865 1869 1873 1877 1881 1885 1889 1893 1897 1901 1905
[31] 1909 1913 1917 1921 1925 1929 1933 1937 1941 1945 1949 1953 1957 1961 1965
[46] 1969 1973 1977 1981 1985 1989 1993 1997 2001 2005 2009 2013 2017

corpus_subset(data_corpus_inaugural, Year > 1945)
Corpus consisting of 18 documents and 4 docvars.
1949-Truman :
"Mr. Vice President, Mr. Chief Justice, and fellow citizens, ..."

1953-Eisenhower :
"My friends, before I begin the expression of those thoughts ..."

1957-Eisenhower :
"The Price of Peace Mr. Chairman, Mr. Vice President, Mr. Chi..."

[ reached max_ndoc ... 15 more documents ]

subset.corpus <- function(x,...) corpus_subset(x,...)

subset(data_corpus_inaugural, Year > 1945)
Corpus consisting of 18 documents and 4 docvars.
1949-Truman :
"Mr. Vice President, Mr. Chief Justice, and fellow citizens, ..."

1953-Eisenhower :
"My friends, before I begin the expression of those thoughts ..."

1957-Eisenhower :
"The Price of Peace Mr. Chairman, Mr. Vice President, Mr. Chi..."

[ reached max_ndoc ... 15 more documents ]

docs_containing <- function(x,pattern,...) x[grep(pattern,x,...)]

c_sub <- docs_containing(data_corpus_inaugural,"[Cc]arnage")
c_sub$President
[1] "Trump"

inaugural_sntc <- corpus_reshape(data_corpus_inaugural,
                                 to="sentences")
inaugural_sntc
Corpus consisting of 5,018 documents and 4 docvars.
1789-Washington.1 :
"Fellow-Citizens of the Senate and of the House of Representa..."

1789-Washington.2 :
"On the one hand, I was summoned by my Country, whose voice I..."

1789-Washington.3 :
"On the other hand, the magnitude and difficulty of the trust..."

[ reached max_ndoc ... 5,015 more documents ]

sntcl <- cbind(docvars(inaugural_sntc),
               len=nchar(inaugural_sntc))
head(sntcl)
                  Year President  FirstName Party len
1789-Washington.1 1789 Washington George    none  278
1789-Washington.2 1789 Washington George    none  478
1789-Washington.3 1789 Washington George    none  436
1789-Washington.4 1789 Washington George    none  179
1789-Washington.5 1789 Washington George    none  515
1789-Washington.6 1789 Washington George    none  654

sntcl.year <- aggregate(len~Year,data=sntcl,mean)
with(sntcl.year,
     scatter.smooth(Year,len,ylab="Average length of sentences in characters"))
/book/data-management-r/09-text-as-data/quanteda-basic/book_data-management-r_09-text-as-data_quanteda-basic_18_0.png

inaugural_ <- corpus_reshape(data_corpus_inaugural,
                             to="documents")
all(inaugural_$Year == data_corpus_inaugural$Year)
[1] TRUE

Downloadable R script and interactive version

Explanation

The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4

Above you see a rendered version of the Jupyter notebook.5

1

For more information about Jupyter see http://jupyter.org. The Jupyter notebooks make use of the IRKernel package.

2

For more information about Docker see https://docs.docker.com/. The container images were created with repo2docker, while containers are run with docker spawner.

3

ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See https://info.orcid.org/what-is-orcid/ for more information.

4

The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.

5

The notebook is rendered with the help of the nbsphinx extension.