In this example we use first names from NLTK to build a Naive Bayes classifier to predict gender. This is taken from the NLTK book. This is a very basic example that only uses the last letter of the word to predict the gender but nonetheless achieves over 70% accuracy.

First we simply load the data and extract the last letter of each first name.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(textanalysis)

init_textanalysis()
#> Julia version 1.1.1 at location /home/jp/Downloads/julia-1.1.1-linux-x86_64/julia-1.1.1/bin will be used.
#> Loading setup script for JuliaCall...
#> Finish loading setup script for JuliaCall.
#> ✔ textanalysis backend initialised.

classes <- factor(c("male", "female"))
model <- init_naive_classifer(classes)

# get data and shuffle
first_names <- nltk4r::first_names(to_r = TRUE) %>% 
  slice(sample(1:n())) %>% 
  mutate(last_letter = substr(name, nchar(name), nchar(name)))

head(first_names)
#> # A tibble: 6 x 3
#>   name     gender last_letter
#>   <chr>    <chr>  <chr>      
#> 1 Kerstin  female n          
#> 2 Emmett   male   t          
#> 3 Lazlo    male   o          
#> 4 Munroe   male   e          
#> 5 Alfreda  female a          
#> 6 Terri-Jo female o

We can then split into training and testing data sets then initialise and train our model.

# split
split <- floor(0.7 * nrow(first_names))
train <- first_names[1:split, ]
test <- first_names[(split + 1):nrow(first_names), ]

# train
train_naive_classifier(model, train, last_letter, gender)
#> ⚠ This function changes `model` in place!

Now we can predict and measure accuracy.

# predict test
classes <- predict_class(model, test, last_letter)

# bind test and predictions
predicted <- test %>% 
  bind_cols(classes) %>% 
  mutate(
    predicted_gender = case_when(
      male > .5 ~ "male",
      TRUE ~ "female"
    ),
    accuracy = case_when(
      predicted_gender == gender ~ TRUE,
      TRUE ~ FALSE
    )
  )

# predictions
head(predicted)
#> # A tibble: 6 x 7
#>   name    gender last_letter   male female predicted_gender accuracy
#>   <chr>   <chr>  <chr>        <dbl>  <dbl> <chr>            <lgl>   
#> 1 Maynord male   d           0.906  0.0941 male             TRUE    
#> 2 Irina   female a           0.0264 0.974  female           TRUE    
#> 3 Junia   female a           0.0264 0.974  female           TRUE    
#> 4 Shanda  female a           0.0264 0.974  female           TRUE    
#> 5 Niki    female i           0.234  0.766  female           TRUE    
#> 6 Karol   female l           0.636  0.364  male             FALSE

# accuracy
sum(predicted$accuracy) / nrow(predicted)
#> [1] 0.7407718