Topic 14 Evaluating Classification Models (Part 2)
Learning Goals
- Contextually interpret overall accuracy, sensitivity, specificity, and AUC
- Appropriately use and interpret the no-information rate to evaluate accuracy metrics
- Implement LASSO logistic regression in
tidymodels
Slides from today are available here.
Exercises
You can download a template RMarkdown file to start from here.
Context
We’ll continue working with the spam dataset from last time.
spam
: Eitherspam
ornot spam
(outcome)word_freq_WORD
: percentage of words in the e-mail that matchWORD
(0-100)char_freq_CHAR
: percentage of characters in the e-mail that matchCHAR
(e.g., exclamation points, dollar signs)capital_run_length_average
: average length of uninterrupted sequences of capital letterscapital_run_length_longest
: length of longest uninterrupted sequence of capital letterscapital_run_length_total
: sum of length of uninterrupted sequences of capital letters
Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!
library(dplyr)
library(readr)
library(ggplot2)
library(tidymodels)
library(probably) # install.packages("probably")
tidymodels_prefer()
read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1")
spam <-
# A little data cleaning to remove the space in "not spam"
spam %>%
spam <- mutate(spam = ifelse(spam=="spam", "spam", "not_spam"))
Exercise 1: Implementing LASSO logistic regression in tidymodels
Open up this tidymodels note sheet as a reference for writing your code from scratch.
Fit a LASSO logistic regression model for the spam
outcome, and allow all possible predictors to be considered (spam ~ .
for the model formula).
- Use 10-fold CV.
- Use the
roc_auc
andaccuracy
(overall accuracy) metrics when tuning. - Initially try a sequence of 100 λ’s from 1 to 10.
- Diagnose whether this sequence should be updated by looking at the plot of test AUC vs. λ.
- If needed, adjust the max value in the sequence up or down by a factor of 10. (You’ll be able to determine from the plot whether to adjust up or down.)
set.seed(123)
# Need to set reference level (to the outcome you are NOT interested in)
spam %>%
spam <- mutate(spam = relevel(factor(spam), ref="not_spam"))
# Set up CV folds
data_cv <-
# LASSO logistic regression model specification
logistic_lasso_spec <-
# Recipe
logistic_lasso_rec <-
# Workflow (Recipe + Model)
log_lasso_wf <-
# Tune model: specify grid of parameters and tune
penalty_grid <-
tune_output <-
Exercise 2: Inspecting the LASSO logistic model
- Use
autoplot()
to inspect the plot of CV AUC vs. λ once more (after adjusting the penalty grid).
Is anything surprising about the results relative to your expectations from Exercise 1? Brainstorm some possible explanations in consideration of the data context.
# Visualize evaluation metrics as a function of tuning parameters
- Choose a final model whose CV AUC is within one standard error of the overall best metric. Comment on the variables that are removed from the model.
# Select "best" penalty
best_se_penalty <-
# Define workflow with "best" penalty value
final_wf <-
# Use final_wf to fit final model with "best" penalty value
final_fit_se <-
%>% tidy() final_fit_se
- Comment on the variable importance based on the how long a variable stayed in the model. Connect the output to the data context.
final_fit_se %>% extract_fit_engine()
glmnet_output <-
# Create a boolean matrix (predictors x lambdas) of variable exclusion
glmnet_output$beta==0
bool_predictor_exclude <-
# Loop over each variable
sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
var_imp <- bool_predictor_exclude[row,]
this_coeff_path <-if(sum(this_coeff_path) == ncol(bool_predictor_exclude)){ return(0)}else{
return(ncol(bool_predictor_exclude) - which.min(this_coeff_path) + 1)}
})
# Create a dataset of this information and sort
tibble(
var_imp_data <-var_name = rownames(bool_predictor_exclude),
var_imp = var_imp
)%>% arrange(desc(var_imp)) var_imp_data
Exercise 3: Interpreting evaluation metrics
Inspect the overall CV results for the “best” λ, and compute the no-information rate (NIR) to give context to the overall accuracy:
# CV results for "best lambda"
%>%
tune_output collect_metrics() %>%
filter(penalty == best_se_penalty %>% pull(penalty))
# Count up number of spam and not_spam emails in the training data
%>%
spam count(spam) # Name of the outcome variable goes inside count()
# Compute the NIR
Why is an AUC of 1 the best possible value for this metric? How does the AUC for our spam model look relative to this best value?