Topic 14 Evaluating Classification Models (Part 2)
Learning Goals
- Contextually interpret overall accuracy, sensitivity, specificity, and AUC
- Appropriately use and interpret the no-information rate to evaluate accuracy metrics
- Implement LASSO logistic regression in
tidymodels
Slides from today are available here.
Exercises
You can download a template RMarkdown file to start from here.
Context
We’ll continue working with the spam dataset from last time.
spam
: Eitherspam
ornot spam
(outcome)word_freq_WORD
: percentage of words in the e-mail that matchWORD
(0-100)char_freq_CHAR
: percentage of characters in the e-mail that matchCHAR
(e.g., exclamation points, dollar signs)capital_run_length_average
: average length of uninterrupted sequences of capital letterscapital_run_length_longest
: length of longest uninterrupted sequence of capital letterscapital_run_length_total
: sum of length of uninterrupted sequences of capital letters
Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!
library(dplyr)
library(readr)
library(ggplot2)
library(tidymodels)
library(probably) # install.packages("probably")
tidymodels_prefer()
read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1")
spam <-
# A little data cleaning to remove the space in "not spam"
spam %>%
spam <- mutate(spam = ifelse(spam=="spam", "spam", "not_spam"))
Exercise 1: Implementing LASSO logistic regression in tidymodels
Open up this tidymodels note sheet as a reference for writing your code from scratch.
Fit a LASSO logistic regression model for the spam
outcome, and allow all possible predictors to be considered (spam ~ .
for the model formula).
- Use 10-fold CV.
- Use the
roc_auc
andaccuracy
(overall accuracy) metrics when tuning. - Initially try a sequence of 100 \(\lambda\)’s from 1 to 10.
- Diagnose whether this sequence should be updated by looking at the plot of test AUC vs. \(\lambda\).
- If needed, adjust the max value in the sequence up or down by a factor of 10. (You’ll be able to determine from the plot whether to adjust up or down.)
set.seed(123)
# Need to set reference level (to the outcome you are NOT interested in)
spam %>%
spam <- mutate(spam = relevel(factor(spam), ref="not_spam"))
# Set up CV folds
data_cv <-
# LASSO logistic regression model specification
logistic_lasso_spec <-
# Recipe
logistic_lasso_rec <-
# Workflow (Recipe + Model)
log_lasso_wf <-
# Tune model: specify grid of parameters and tune
penalty_grid <-
tune_output <-
Exercise 2: Inspecting the LASSO logistic model
- Use
autoplot()
to inspect the plot of CV AUC vs. \(\lambda\) once more (after adjusting the penalty grid).
Is anything surprising about the results relative to your expectations from Exercise 1? Brainstorm some possible explanations in consideration of the data context.
# Visualize evaluation metrics as a function of tuning parameters
- Choose a final model whose CV AUC is within one standard error of the overall best metric. Comment on the variables that are removed from the model.
# Select "best" penalty
best_se_penalty <-
# Define workflow with "best" penalty value
final_wf <-
# Use final_wf to fit final model with "best" penalty value
final_fit_se <-
%>% tidy() final_fit_se
- Comment on the variable importance based on the how long a variable stayed in the model. Connect the output to the data context.
final_fit_se %>% extract_fit_engine()
glmnet_output <-
# Create a boolean matrix (predictors x lambdas) of variable exclusion
glmnet_output$beta==0
bool_predictor_exclude <-
# Loop over each variable
sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
var_imp <- bool_predictor_exclude[row,]
this_coeff_path <-if(sum(this_coeff_path) == ncol(bool_predictor_exclude)){ return(0)}else{
return(ncol(bool_predictor_exclude) - which.min(this_coeff_path) + 1)}
})
# Create a dataset of this information and sort
tibble(
var_imp_data <-var_name = rownames(bool_predictor_exclude),
var_imp = var_imp
)%>% arrange(desc(var_imp)) var_imp_data
Exercise 3: Interpreting evaluation metrics
Inspect the overall CV results for the “best” \(\lambda\), and compute the no-information rate (NIR) to give context to the overall accuracy:
# CV results for "best lambda"
%>%
tune_output collect_metrics() %>%
filter(penalty == best_se_penalty %>% pull(penalty))
# Count up number of spam and not_spam emails in the training data
%>%
spam count(spam) # Name of the outcome variable goes inside count()
# Compute the NIR
Why is an AUC of 1 the best possible value for this metric? How does the AUC for our spam model look relative to this best value?