Setup

This document will walk you through the general concepts of textanalysis and demonstrate the broad workflow of the package. First you will need to have the package installed of course; instructions are on the homepage. Once installed, the package can be loaded.

library(textanalysis)

Because the package depends on Julia we must initialise the session.

init_textanalysis()
#> Julia version 1.1.1 at location /home/jp/Downloads/julia-1.1.1-linux-x86_64/julia-1.1.1/bin will be used.
#> Loading setup script for JuliaCall...
#> Finish loading setup script for JuliaCall.
#> ✔ textanalysis backend initialised.

Document

Once the pacakges, installed, loaded, and the session initialised we can start using the package. The most basic object type in textanalysis is the document, it can be created with any of the *_document functions, though you will likely only need string_document to create a document from a character string.

str <- "This is a very simple document!"
(doc <- string_document(str))
#> ℹ A single document.

You can always get the content of the document with get_text.

get_text(doc)
#> [1] "This is a very simple document!"

Turning a character string into a document allows to easily clean it, or prepare it in textanalysis jargon. There are multitude of ways to clean text in textanalysis, they are further detailed in the preprocessing vignettes. Here we use the straightforward prepare function with leaving all arguments on default.

prepare(doc)
#> ⚠ This function changes `document` in place!

Notice the warning, the prepare function changes the object doc in place; we did not assign the (inexistent) output of prepare to a new object and the object doc changed. Let’s demonstrate.

get_text(doc) # see the document changed!
#> [1] "    simple document"

This is somewhat puzzling for us R users but it actually happens for good reasons: textanalysis does not need to make a copy of the object, this allows processing more data.

However, you will not use the package in this manner as you have multiple documents to process at once, you can do so with the to_documents function which will process multiple documents from a vector or data.frame. All functions previously used also work on objects of class documents.

texts <- c(
  str,
  "This is another document"
)

(docs <- to_documents(texts))
#> ℹ 2 documents.

prepare(docs)
#> ⚠ This function changes `documents` in place!
get_text(docs)
#> # A tibble: 2 x 2
#>   text                  document
#>   <chr>                    <int>
#> 1 "    simple document"        1
#> 2 "   document"                2

Corpus

With our documents constructed and clean we can build a corpus of our documents. You can do so with corpus(docs), however we will do so with a sample of dataset of the package: a set of 1,045 Reuters articles on 10 different commodities.

data("reuters")

dplyr::count(reuters, category, sort = TRUE)
#> # A tibble: 10 x 2
#>    category     n
#>    <chr>    <int>
#>  1 corn       237
#>  2 sugar      162
#>  3 coffee     139
#>  4 gold       124
#>  5 soybean    111
#>  6 cotton      59
#>  7 rice        59
#>  8 gas         54
#>  9 barley      51
#> 10 rubber      49

We’ll use the function to_documents which has methods for character vectors and data.frame, to build multiple documents at once, we’ll just take 3 documents.

docs <- reuters %>%
  dplyr::slice(1:3) %>% 
  to_documents(text = text)
prepare(docs)
#> ⚠ This function changes `documents` in place!
(crps <- corpus(docs))
#> ℹ A corpus of 3 documents.

We can already do a lot more with a corpus than mere documents, like extracting the lexicon, computing its size (number of unique words), and more.

lexicon_size(crps)
#> [1] 196
lexicon <- lexicon(crps)

lexicon %>% 
  dplyr::arrange(-n) %>% 
  dplyr::slice(1:5)
#> # A tibble: 5 x 2
#>   words        n
#>   <chr>    <int>
#> 1 exchange    13
#> 2 coffee      12
#> 3 rubber       8
#> 4 trading      8
#> 5 trade        5

You could also fetch thse lexical frequency of a specific word.

lexical_frequency(crps, "corn")
#> [1] 0

You can even assess the sentiment of documents.

sentiment(crps)
#> [1] 0.5596958 0.4871549 0.5085609

Document-term Matrix

Now that we have a corpus we can compute the document-term matrix, which is the core of a multitude of computations and models.

(dtm <- document_term_matrix(crps))
#> Julia Object of type DocumentTermMatrix.
#> A DocumentTermMatrix

A document-term matrix will enable you to compute various frequency-related matrices, like tf-idf (term-frequency inverse document frequency), or even Okapi BM-25.

tf <- tf(dtm)
tfidf <- tf_idf(dtm)
okapi <- bm_25(dtm)

Note that the functions ran above do not return the words themselves, you can obtain these with the lexicon function. This could be done with a function such as bind_lexicon declared below: very straightforward, simply add a column for the words (lexicon).

# function to bind the lexicon to our matrices
bind_lexicon <- function(data){
  data %>% 
    as.data.frame() %>% 
    dplyr::bind_cols(
      lexicon %>% 
        dplyr::select(-n),
      .
    )
}

bind_lexicon(tf)
#> # A tibble: 196 x 4
#>    words           V1     V2     V3
#>    <chr>        <dbl>  <dbl>  <dbl>
#>  1 introduced 0       0.0185 0     
#>  2 producers  0       0.0185 0     
#>  3 crude      0.00518 0      0     
#>  4 move       0       0.0185 0     
#>  5 indonesian 0       0      0.0244
#>  6 korea      0.00518 0      0     
#>  7 production 0       0.0185 0     
#>  8 physicals  0.00518 0      0     
#>  9 months     0.00518 0      0     
#> 10 chairman   0.0104  0      0     
#> # … with 186 more rows
bind_lexicon(tfidf)
#> # A tibble: 196 x 4
#>    words           V1     V2     V3
#>    <chr>        <dbl>  <dbl>  <dbl>
#>  1 introduced 0       0.0203 0     
#>  2 producers  0       0.0203 0     
#>  3 crude      0.00569 0      0     
#>  4 move       0       0.0203 0     
#>  5 indonesian 0       0      0.0268
#>  6 korea      0.00569 0      0     
#>  7 production 0       0.0203 0     
#>  8 physicals  0.00569 0      0     
#>  9 months     0.00569 0      0     
#> 10 chairman   0.0114  0      0     
#> # … with 186 more rows
bind_lexicon(okapi)
#> # A tibble: 196 x 4
#>    words         V1    V2    V3
#>    <chr>      <dbl> <dbl> <dbl>
#>  1 introduced  0     2.94  0   
#>  2 producers   0     2.94  0   
#>  3 crude       1.67  0     0   
#>  4 move        0     2.94  0   
#>  5 indonesian  0     0     3.24
#>  6 korea       1.67  0     0   
#>  7 production  0     2.94  0   
#>  8 physicals   1.67  0     0   
#>  9 months      1.67  0     0   
#> 10 chairman    1.19  0     0   
#> # … with 186 more rows