naive-bayes.Rmd
In this example we use first names from NLTK to build a Naive Bayes classifier to predict gender. This is taken from the NLTK book. This is a very basic example that only uses the last letter of the word to predict the gender but nonetheless achieves over 70% accuracy.
First we simply load the data and extract the last letter of each first name.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(textanalysis)
init_textanalysis()
#> Julia version 1.1.1 at location /home/jp/Downloads/julia-1.1.1-linux-x86_64/julia-1.1.1/bin will be used.
#> Loading setup script for JuliaCall...
#> Finish loading setup script for JuliaCall.
#> ✔ textanalysis backend initialised.
classes <- factor(c("male", "female"))
model <- init_naive_classifer(classes)
# get data and shuffle
first_names <- nltk4r::first_names(to_r = TRUE) %>%
slice(sample(1:n())) %>%
mutate(last_letter = substr(name, nchar(name), nchar(name)))
head(first_names)
#> # A tibble: 6 x 3
#> name gender last_letter
#> <chr> <chr> <chr>
#> 1 Kerstin female n
#> 2 Emmett male t
#> 3 Lazlo male o
#> 4 Munroe male e
#> 5 Alfreda female a
#> 6 Terri-Jo female o
We can then split into training and testing data sets then initialise and train our model.
# split
split <- floor(0.7 * nrow(first_names))
train <- first_names[1:split, ]
test <- first_names[(split + 1):nrow(first_names), ]
# train
train_naive_classifier(model, train, last_letter, gender)
#> ⚠ This function changes `model` in place!
Now we can predict and measure accuracy.
# predict test
classes <- predict_class(model, test, last_letter)
# bind test and predictions
predicted <- test %>%
bind_cols(classes) %>%
mutate(
predicted_gender = case_when(
male > .5 ~ "male",
TRUE ~ "female"
),
accuracy = case_when(
predicted_gender == gender ~ TRUE,
TRUE ~ FALSE
)
)
# predictions
head(predicted)
#> # A tibble: 6 x 7
#> name gender last_letter male female predicted_gender accuracy
#> <chr> <chr> <chr> <dbl> <dbl> <chr> <lgl>
#> 1 Maynord male d 0.906 0.0941 male TRUE
#> 2 Irina female a 0.0264 0.974 female TRUE
#> 3 Junia female a 0.0264 0.974 female TRUE
#> 4 Shanda female a 0.0264 0.974 female TRUE
#> 5 Niki female i 0.234 0.766 female TRUE
#> 6 Karol female l 0.636 0.364 male FALSE
# accuracy
sum(predicted$accuracy) / nrow(predicted)
#> [1] 0.7407718