Obtaining tokens in text documents using quanteda¶
The following makes use of the quanteda package. You may need to install it from
CRAN using the code
install.packages("quanteda")
if you want to run this on your computer. (The
package is already installed on the notebook container, however.)
library(quanteda)
Package version: 3.2.1
Unicode version: 13.0
ICU version: 67.1
Parallel computing: 20 of 20 threads used.
See https://quanteda.io for tutorials and examples.
quanteda_options(print_tokens_max_ndoc=3,
print_tokens_max_ntoken=6)
The following makes use of an example corpus which is part of the quanteda package.
inaugural_toks <- tokens(data_corpus_inaugural)
inaugural_toks
Tokens consisting of 59 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens" "of" "the" "Senate"
[5] "and" "of"
[ ... and 1,531 more ]
1793-Washington :
[1] "Fellow" "citizens" "," "I" "am" "again"
[ ... and 141 more ]
1797-Adams :
[1] "When" "it" "was" "first" "perceived" ","
[ ... and 2,571 more ]
[ reached max_ndoc ... 56 more documents ]
inaugural_ntoks <- sapply(inaugural_toks,
length)
inaugural_ntoks <- cbind(docvars(inaugural_toks),
ntokens = inaugural_ntoks)
with(inaugural_ntoks,
scatter.smooth(Year,ntokens,
ylab="Number of tokens per speech"))

inaugural_sntc <- corpus_reshape(data_corpus_inaugural,
to="sentences")
inaugural_sntc_toks <- tokens(inaugural_sntc)
inaugural_sntc_ntoks <- sapply(inaugural_sntc_toks,
length)
inaugural_sntc_ntoks <- cbind(docvars(inaugural_sntc_toks),
ntokens = inaugural_sntc_ntoks)
inaugural_sntc_ntoks <- aggregate (ntokens~Year,mean,
data = inaugural_sntc_ntoks)
with(inaugural_sntc_ntoks,
scatter.smooth(Year,ntokens,
ylab="Number of tokens per sentence"))

- R file: quanteda-tokens.R
- Rmarkdown file: quanteda-tokens.Rmd
- Jupyter notebook file: quanteda-tokens.ipynb
- Interactive version of the Jupyter notebook (shuts down after 60s):
- Interactive version of the Jupyter notebook (sign in required):