Topic 10 Evaluating Classification Models

Learning Goals

  • Calculate (by hand from confusion matrices) and contextually interpret overall accuracy, sensitivity, and specificity
  • Construct and interpret plots of predicted probabilities across classes
  • Explain how a ROC curve is constructed and the rationale behind AUC as an evaluation metric
  • Appropriately use and interpret the no-information rate to evaluate accuracy metrics

Slides from today are available here.

LASSO for logistic regression in caret

To build LASSO models for logistic regression in caret, first load the package and set the seed for the random number generator to ensure reproducible results:

set.seed(___) # Pick your favorite number to fill in the parentheses

If we would like to use estimates of test (overall) accuracy to choose our final model (based on a probability threshold of 0.5), we can adapt the following:

lasso_logistic_mod <- train(
    y ~ x,
    data = ___,
    method = "glmnet",
    family = "binomial",
    tuneGrid = data.frame(alpha = 1, lambda = seq(___, ___, length.out = ___)),
    trControl = trainControl(method = "cv", number = ___, selectionFunction = ___),
    metric = "Accuracy",
    na.action = na.omit
Argument Meaning
y ~ x y must be a character or factor
data Sample data
method & family The glm method implements various “generalized” linear models. When we specify family = "binomial", glm performs logistic regression.
trControl Use cross-validation to estimate test performance for each model fit. selectionFunction can be "best" or "oneSE" as before.
tuneGrid Tuning parameters for LASSO: alpha = 1 indicates LASSO (as opposed to another regularization method). Specify a sequence of lambda values for the penalty.
metric Evaluate and compare competing models with respect to their CV-Accuracy. (Uses a default probability threshold of 0.5 to make hard predictions.)
na.action Set na.action = na.omit to prevent errors if the data has missing values.

If we would like to choose our final model based on estimates of test sensitivity, specificity, or AUC, we can adapt the following:

lasso_logistic_mod <- train(
    y ~ x,
    data = ___,
    method = "glmnet",
    family = "binomial",
    tuneGrid = data.frame(alpha = 1, lambda = seq(___, ___, length.out = ___)),
    trControl = trainControl(method = "cv", number = ___, selectionFunction = ___, classProbs = TRUE, summaryFunction = twoClassSummaryCustom),
    metric = "AUC",
    na.action = na.omit

The two new arguments to trainControl() are as follows:

  • classProbs: Set to true if soft (probability) predictions should be computed
  • summaryFunction: Use twoClassSummaryCustom in order to compute overall accuracy, sensitivity, and specificity (based on a threshold of 0.5) and to compute AUC

The metric now has 4 options:

  • "AUC": Compute AUC
  • "Sens": Compute sensitivity
  • "Spec": Compute specificity
  • "Accuracy": Compute overall accuracy

You’ll need to run the code below to create the twoClassSummaryCustom function (don’t worry about how this is written):

twoClassSummaryCustom <- function (data, lev = NULL, model = NULL) {
    if (length(lev) > 2) {
        stop(paste("Your outcome has", length(lev), "levels. The twoClassSummary() function isn't appropriate."))
    if (!all(levels(data[, "pred"]) == lev)) {
        stop("levels of observed and predicted data do not match")
    rocObject <- try(pROC::roc(data$obs, data[, lev[1]], direction = ">", 
        quiet = TRUE), silent = TRUE)
    rocAUC <- if (inherits(rocObject, "try-error")) 
    else rocObject$auc
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    out2 <- postResample(data[, "pred"], data[, "obs"])
    out <- c(out, out2[1])
    names(out) <- c("AUC", "Sens", "Spec", "Accuracy")


You can download a template RMarkdown file to start from here.


Before proceeding, install the pROC package (utilities for evaluating classification models with ROC curves) by entering install.packages("pROC") in the Console.

We’ll continue working with the spam dataset from last time.

  • spam: Either spam or not spam (outcome)
  • word_freq_WORD: percentage of words in the e-mail that match WORD (0-100)
  • char_freq_CHAR: percentage of characters in the e-mail that match CHAR (e.g., exclamation points, dollar signs)
  • capital_run_length_average: average length of uninterrupted sequences of capital letters
  • capital_run_length_longest: length of longest uninterrupted sequence of capital letters
  • capital_run_length_total: sum of length of uninterrupted sequences of capital letters

Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!


spam <- read_csv("")

# A little data cleaning to remove the space in "not spam"
spam <- spam %>%
    mutate(spam = ifelse(spam=="spam", "spam", "not_spam"))

Exercise 1: Conceptual warmup

LASSO for the logistic regression setting works analogously to the regression setting. How would you expect a plot of test accuracy vs. \(\lambda\) to look, and why? (Draw it!)

Exercise 2: Implementing LASSO logistic regression in caret

Fit a LASSO logistic regression model for the spam outcome, and allow all possible predictors to be considered (~ . in the model formula).

  • Use 10-fold CV.
  • Choose a final model whose test AUC is within one standard error of the overall best metric.
  • Initially try a sequence of 100 \(\lambda\)’s from 0 to 10.
    • Diagnose whether this sequence should be updated by looking at the plot of test AUC vs. \(\lambda\) (plot(lasso_logistic_mod)).
    • If needed, adjust the max value in the sequence up or down by a factor of 10. (You’ll be able to determine from the plot whether to adjust up or down.)
lasso_logistic_mod <- train(

Exercise 3: Inspecting the model

Inspect the $bestTune part of your fitted lasso_logistic_mod in conjunction with the plot of test AUC vs. \(\lambda\).

Is anything surprising about the results relative to your expectations from Exercise 1? Brainstorm some possible explanations in consideration of the data context.

Exercise 4: Interpreting evaluation metrics

Inspect the overall CV results for the “best” \(\lambda\), and compute the no-information rate (NIR):

# CV results for "best lambda"
lasso_logistic_mod$results %>%

# Count up number of spam and not_spam emails in the training data
spam %>%
    count(spam) # Name of the outcome variable goes inside count()

# Compute the NIR
  • Interpret the estimates of test sensitivity and specificity - what do these numbers mean? Do you think higher sensitivity or specificity would be more important in designing a spam filter?
  • Interpret overall accuracy - does this seem high? How can the no-information rate (NIR) help us interpret the overall accuracy?
  • Why is an AUC of 1 the best possible value for this metric? How does the AUC for our spam model look relative to this best value?

Exercise 5: Algorithmic understanding for evaluation metrics

Inspect the iteration specific information from CV for the “best” \(\lambda\):


How is one row of information computed? Carefully describe the CV process for a single iteration to estimate each of AUC, Sens, Spec, and Accuracy (overall accuracy). Use a generic confusion matrix (filled with variables instead of hard numbers) to illustrate the underlying computations.