get-started.Rmd
This document will walk you through the general concepts of textanalysis and demonstrate the broad workflow of the package. First you will need to have the package installed of course; instructions are on the homepage. Once installed, the package can be loaded.
library(textanalysis)
Because the package depends on Julia we must initialise the session.
init_textanalysis()
#> Julia version 1.1.1 at location /home/jp/Downloads/julia-1.1.1-linux-x86_64/julia-1.1.1/bin will be used.
#> Loading setup script for JuliaCall...
#> Finish loading setup script for JuliaCall.
#> ✔ textanalysis backend initialised.
Once the pacakges, installed, loaded, and the session initialised we can start using the package. The most basic object type in textanalysis is the document, it can be created with any of the *_document
functions, though you will likely only need string_document
to create a document from a character string.
str <- "This is a very simple document!"
(doc <- string_document(str))
#> ℹ A single document.
You can always get the content of the document with get_text
.
get_text(doc)
#> [1] "This is a very simple document!"
Turning a character string into a document
allows to easily clean it, or prepare
it in textanalysis jargon. There are multitude of ways to clean text in textanalysis, they are further detailed in the preprocessing vignettes. Here we use the straightforward prepare
function with leaving all arguments on default.
prepare(doc)
#> ⚠ This function changes `document` in place!
Notice the warning, the prepare function changes the object doc
in place; we did not assign the (inexistent) output of prepare
to a new object and the object doc
changed. Let’s demonstrate.
get_text(doc) # see the document changed!
#> [1] " simple document"
This is somewhat puzzling for us R users but it actually happens for good reasons: textanalysis does not need to make a copy of the object, this allows processing more data.
However, you will not use the package in this manner as you have multiple documents to process at once, you can do so with the to_documents
function which will process multiple documents from a vector or data.frame. All functions previously used also work on objects of class documents
.
texts <- c(
str,
"This is another document"
)
(docs <- to_documents(texts))
#> ℹ 2 documents.
prepare(docs)
#> ⚠ This function changes `documents` in place!
get_text(docs)
#> # A tibble: 2 x 2
#> text document
#> <chr> <int>
#> 1 " simple document" 1
#> 2 " document" 2
With our documents constructed and clean we can build a corpus of our documents. You can do so with corpus(docs)
, however we will do so with a sample of dataset of the package: a set of 1,045 Reuters articles on 10 different commodities.
data("reuters")
dplyr::count(reuters, category, sort = TRUE)
#> # A tibble: 10 x 2
#> category n
#> <chr> <int>
#> 1 corn 237
#> 2 sugar 162
#> 3 coffee 139
#> 4 gold 124
#> 5 soybean 111
#> 6 cotton 59
#> 7 rice 59
#> 8 gas 54
#> 9 barley 51
#> 10 rubber 49
We’ll use the function to_documents
which has methods for character vectors and data.frame, to build multiple documents at once, we’ll just take 3 documents.
docs <- reuters %>%
dplyr::slice(1:3) %>%
to_documents(text = text)
prepare(docs)
#> ⚠ This function changes `documents` in place!
(crps <- corpus(docs))
#> ℹ A corpus of 3 documents.
We can already do a lot more with a corpus than mere documents, like extracting the lexicon, computing its size (number of unique words), and more.
lexicon_size(crps)
#> [1] 196
lexicon <- lexicon(crps)
lexicon %>%
dplyr::arrange(-n) %>%
dplyr::slice(1:5)
#> # A tibble: 5 x 2
#> words n
#> <chr> <int>
#> 1 exchange 13
#> 2 coffee 12
#> 3 rubber 8
#> 4 trading 8
#> 5 trade 5
You could also fetch thse lexical frequency of a specific word.
lexical_frequency(crps, "corn")
#> [1] 0
You can even assess the sentiment of documents.
sentiment(crps)
#> [1] 0.5596958 0.4871549 0.5085609
Now that we have a corpus we can compute the document-term matrix, which is the core of a multitude of computations and models.
(dtm <- document_term_matrix(crps))
#> Julia Object of type DocumentTermMatrix.
#> A DocumentTermMatrix
A document-term matrix will enable you to compute various frequency-related matrices, like tf-idf (term-frequency inverse document frequency), or even Okapi BM-25.
Note that the functions ran above do not return the words themselves, you can obtain these with the lexicon
function. This could be done with a function such as bind_lexicon
declared below: very straightforward, simply add a column for the words (lexicon).
# function to bind the lexicon to our matrices
bind_lexicon <- function(data){
data %>%
as.data.frame() %>%
dplyr::bind_cols(
lexicon %>%
dplyr::select(-n),
.
)
}
bind_lexicon(tf)
#> # A tibble: 196 x 4
#> words V1 V2 V3
#> <chr> <dbl> <dbl> <dbl>
#> 1 introduced 0 0.0185 0
#> 2 producers 0 0.0185 0
#> 3 crude 0.00518 0 0
#> 4 move 0 0.0185 0
#> 5 indonesian 0 0 0.0244
#> 6 korea 0.00518 0 0
#> 7 production 0 0.0185 0
#> 8 physicals 0.00518 0 0
#> 9 months 0.00518 0 0
#> 10 chairman 0.0104 0 0
#> # … with 186 more rows
bind_lexicon(tfidf)
#> # A tibble: 196 x 4
#> words V1 V2 V3
#> <chr> <dbl> <dbl> <dbl>
#> 1 introduced 0 0.0203 0
#> 2 producers 0 0.0203 0
#> 3 crude 0.00569 0 0
#> 4 move 0 0.0203 0
#> 5 indonesian 0 0 0.0268
#> 6 korea 0.00569 0 0
#> 7 production 0 0.0203 0
#> 8 physicals 0.00569 0 0
#> 9 months 0.00569 0 0
#> 10 chairman 0.0114 0 0
#> # … with 186 more rows
bind_lexicon(okapi)
#> # A tibble: 196 x 4
#> words V1 V2 V3
#> <chr> <dbl> <dbl> <dbl>
#> 1 introduced 0 2.94 0
#> 2 producers 0 2.94 0
#> 3 crude 1.67 0 0
#> 4 move 0 2.94 0
#> 5 indonesian 0 0 3.24
#> 6 korea 1.67 0 0
#> 7 production 0 2.94 0
#> 8 physicals 1.67 0 0
#> 9 months 1.67 0 0
#> 10 chairman 1.19 0 0
#> # … with 186 more rows