Text is inherently dirty and removes extensive cleaning, thankfully textanalysis has all the functions you need to obtain clean text ready to model. The core function to preprocess data is
prepare as it always efficiently applying multiple
The functions listed above apply to objects of class
corpus, so we’ll create a dirty document to clean.
str <- "ThIs Is SoMe <span>vEry</span> Dirty TeXT!.!!" (doc <- string_document(str)) #> ℹ A single document.
Now we can apply the
prepare function to the document.
Notice something important, the document (
doc object in our case) is cleanded in place. This means the function
prepare does not return the cleaned input and rather cleans the object in the environment. This is useful when dealing with large amount of text as these are not copied.
Though it may look like the document is now somewhat messed up with all the whitespaces inserted (as replacement to punctuation and other things) it is not, e.g.: textanalysis will still extract tokens correctly.
get_tokens(doc) #>  "dirty" "text"
Note that textanalysis also comes with a stemmer.
Also, though there are numerous functions to clean text all can be applied efficiently using
prepare: the corresponsding Julia function applies them much more efficiently than using individually. You will likely not see a difference when working with a small number of documents. If you do use them separately ensure you apply them in the correct order, i.g.: do not strip the punctuation before the html tags (see example below).