Obtaining tokens in text documents using quanteda

The following makes use of the quanteda package. You may need to install it from CRAN using the code install.packages("quanteda") if you want to run this on your computer. (The package is already installed on the notebook container, however.)

library(quanteda)
Package version: 3.2.1
Unicode version: 13.0
ICU version: 67.1
Parallel computing: 20 of 20 threads used.
See https://quanteda.io for tutorials and examples.
quanteda_options(print_tokens_max_ndoc=3,
                 print_tokens_max_ntoken=6)

The following makes use of an example corpus which is part of the quanteda package.

inaugural_toks <- tokens(data_corpus_inaugural)
inaugural_toks
Tokens consisting of 59 documents and 4 docvars.
1789-Washington :
[1] "Fellow-Citizens" "of"              "the"             "Senate"         
[5] "and"             "of"             
[ ... and 1,531 more ]

1793-Washington :
[1] "Fellow"   "citizens" ","        "I"        "am"       "again"   
[ ... and 141 more ]

1797-Adams :
[1] "When"      "it"        "was"       "first"     "perceived" ","        
[ ... and 2,571 more ]

[ reached max_ndoc ... 56 more documents ]
inaugural_ntoks <- sapply(inaugural_toks,
                          length)
inaugural_ntoks <- cbind(docvars(inaugural_toks),
                         ntokens = inaugural_ntoks)
with(inaugural_ntoks,
     scatter.smooth(Year,ntokens,
                    ylab="Number of tokens per speech"))
quanteda-tokens_7_0.png
inaugural_sntc <- corpus_reshape(data_corpus_inaugural,
                                 to="sentences")
inaugural_sntc_toks <- tokens(inaugural_sntc)
inaugural_sntc_ntoks <- sapply(inaugural_sntc_toks,
                               length)
inaugural_sntc_ntoks <- cbind(docvars(inaugural_sntc_toks),
                              ntokens = inaugural_sntc_ntoks)
inaugural_sntc_ntoks <- aggregate (ntokens~Year,mean,
                                data = inaugural_sntc_ntoks)
with(inaugural_sntc_ntoks,
     scatter.smooth(Year,ntokens,
                    ylab="Number of tokens per sentence"))
quanteda-tokens_10_0.png