Text is inherently dirty and removes extensive cleaning, thankfully textanalysis has all the functions you need to obtain clean text ready to model. The core function to preprocess data is prepare as it always efficiently applying multiple strip_* functions.

  • prepare
  • remove_case
  • remove_corrupt_utf8
  • remove_words
  • strip_articles
  • strip_indefinite_articles
  • strip_definite_articles
  • strip_preposition
  • strip_pronouns
  • strip_stopwords
  • strip_numbers
  • strip_non_letters
  • strip_spares_terms
  • strip_frequent_terms
  • strip_html_tags

The functions listed above apply to objects of class document, documents, and corpus, so we’ll create a dirty document to clean.

str <- "ThIs Is SoMe <span>vEry</span> Dirty TeXT!.!!"

(doc <- string_document(str))
#> ℹ A single document.

Now we can apply the prepare function to the document.

prepare(doc)
#> ⚠ This function changes `document` in place!
get_text(doc)
#> [1] "    dirty text"

Notice something important, the document (doc object in our case) is cleanded in place. This means the function prepare does not return the cleaned input and rather cleans the object in the environment. This is useful when dealing with large amount of text as these are not copied.

Though it may look like the document is now somewhat messed up with all the whitespaces inserted (as replacement to punctuation and other things) it is not, e.g.: textanalysis will still extract tokens correctly.

get_tokens(doc)
#> [1] "dirty" "text"

Note that textanalysis also comes with a stemmer.

stem_words(doc)
get_text(doc)
#> [1] "dirti text"

Also, though there are numerous functions to clean text all can be applied efficiently using prepare: the corresponsding Julia function applies them much more efficiently than using individually. You will likely not see a difference when working with a small number of documents. If you do use them separately ensure you apply them in the correct order, i.g.: do not strip the punctuation before the html tags (see example below).

str <- "This is <span>html</span>."
doc <- string_document(str)

# DONT DO THAT
strip_punctuation(doc)
get_text(doc)
#> [1] "This is spanhtmlspan"