Obtaining tokens in text documents using quanteda¶
The following makes use of the quanteda package. You may need to install it from CRAN using the code
install.packages("quanteda") if you want to run this on your computer. (The package is already installed on the notebook container, however.)
Package version: 2.1.1 Parallel computing: 2 of 20 threads used. See https://quanteda.io for tutorials and examples. Attaching package: ‘quanteda’ The following object is masked from ‘jupyter:irkernel’: View The following object is masked from ‘package:utils’: View
The following makes use of an example corpus which is part of the quanteda package.
inaugural_toks <- tokens(data_corpus_inaugural) inaugural_toks
Tokens consisting of 58 documents and 4 docvars. 1789-Washington :  "Fellow-Citizens" "of" "the" "Senate"  "and" "of" [ ... and 1,531 more ] 1793-Washington :  "Fellow" "citizens" "," "I" "am" "again" [ ... and 141 more ] 1797-Adams :  "When" "it" "was" "first" "perceived" "," [ ... and 2,571 more ] [ reached max_ndoc ... 55 more documents ]
inaugural_ntoks <- sapply(inaugural_toks, length) inaugural_ntoks <- cbind(docvars(inaugural_toks), ntokens = inaugural_ntoks)
with(inaugural_ntoks, scatter.smooth(Year,ntokens, ylab="Number of tokens per speech"))
inaugural_sntc <- corpus_reshape(data_corpus_inaugural, to="sentences") inaugural_sntc_toks <- tokens(inaugural_sntc) inaugural_sntc_ntoks <- sapply(inaugural_sntc_toks, length) inaugural_sntc_ntoks <- cbind(docvars(inaugural_sntc_toks), ntokens = inaugural_sntc_ntoks)
inaugural_sntc_ntoks <- aggregate (ntokens~Year,mean, data = inaugural_sntc_ntoks)
with(inaugural_sntc_ntoks, scatter.smooth(Year,ntokens, ylab="Number of tokens per sentence"))
Downloadable R script and interactive version
The link with the “jupyterhub” icon directs you to an interactive Jupyter1 notebook, which runs inside a Docker container2. There are two variants of the interative notebook. One shuts down after 60 seconds and does not require a sign it. The other requires sign in using your ORCID3 credentials, yet shuts down only after 24 hours. (There is no guarantee that such a container persists that long, it may be shut down earlier for maintenance purposes.) After shutdown all data within the container will be reset, i.e. all files created by the user will be deleted.4
Above you see a rendered version of the Jupyter notebook.5
ORCID is a free service for the authentication of researchers. It also allows to showcase publications and contributions to the academic community such as peer review.. See
https://info.orcid.org/what-is-orcid/for more information.
The Jupyter notebooks come with NO WARRANTY whatsoever. They are provided for educational and illustrative purposes only. Do not use them for production work.
The notebook is rendered with the help of the nbsphinx extension.