Topic 9 Logistic Regression
Learning Goals
- Use a logistic regression model to make hard (class) and soft (probability) predictions
- Interpret non-intercept coefficients from logistic regression models in the data context
Slides from today are available here.
Logistic regression in caret
To build logistic regression models in caret, first load the package and set the seed for the random number generator to ensure reproducible results:
library(caret)
set.seed(___) # Pick your favorite number to fill in the parenthesesThen adapt the following code:
logistic_mod <- train(
y ~ x,
data = ___,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = ___),
metric = "Accuracy",
na.action = na.omit
)| Argument | Meaning |
|---|---|
y ~ x |
y must be a character or factor |
data |
Sample data |
method & family |
The glm method implements various “generalized” linear models. When we specify family = "binomial", glm performs logistic regression. (BTW: glm with family = "gaussian" is equivalent to using lm!) |
trControl |
Use cross-validation to estimate test performance for each model fit. |
metric |
Evaluate and compare competing models with respect to their CV-Accuracy. |
na.action |
Set na.action = na.omit to prevent errors if the data has missing values. |
Examining the logistic model
# Model summary table
summary(logistic_mod)
# Coefficients
coef(logistic_mod$finalModel)
# Exponentiated coefficients
coef(logistic_mod$finalModel) %>% exp()
# CV accuracy metrics
logistic_mod$results
logistic_mod$resampleMaking predictions from the logistic model
# Make soft (probability) predictions
predict(logistic_mod, newdata = ___, type = "prob")
# Make hard (class) predictions (using a default 0.5 probability threshold)
predict(logistic_mod, newdata = ___, type = "raw")Exercises
You can download a template RMarkdown file to start from here.
Context
Before proceeding, install the e1071 package (utilities for evaluating classification models) by entering install.packages("e1071") in the Console.
We’ll be working with a spam dataset that contains information on different features of emails and whether or not the email was spam. The variables are as follows:
spam: Eitherspamornot spamword_freq_WORD: percentage of words in the e-mail that matchWORD(0-100)char_freq_CHAR: percentage of characters in the e-mail that matchCHAR(e.g., exclamation points, dollar signs)capital_run_length_average: average length of uninterrupted sequences of capital letterscapital_run_length_longest: length of longest uninterrupted sequence of capital letterscapital_run_length_total: sum of length of uninterrupted sequences of capital letters
Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!
library(readr)
library(ggplot2)
library(dplyr)
library(caret)
spam <- read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1")Hello, how are things?
We’re nearing the halfway point of the module - how are you holding up?
Exercise 1: Visualization warmup
Let’s take a look at the frequency of the word “George” (the email recipient’s name is George) (word_freq_george) and the frequency of exclamation points (char_freq_exclam).
Create appropriate visualizations to assess the predictive ability of these two predictors.
# If you want to adjust the axis limits, you can add the following to your plot:
# + coord_cartesian(ylim = c(0,1))
# + coord_cartesian(xlim = c(0,1))
ggplot(spam, aes(x = ___, y = ___)) +
geom_???()Exercise 2: Implementing logistic regression in caret
Our goal is to fit a logistic regression model with word_freq_george and char_freq_exclam as predictors.
Write down the corresponding logistic regression model formula using general notation.
Use
caretto fit this logistic regression model. Use 10-fold CV to estimate test accuracy. (We’ll focus more on interpreting model evaluation metrics next time.)
Note: If you get warning (not error) messages like fitted probabilities numerically 0 or 1 occurred, this means that in some of the CV iterations, one or more of the predictors perfectly predicted spam classification.
set.seed(___)
logistic_mod <- train(
)Exercise 3: Interpreting the model
Take a look at the log-scale coefficients with
summary(logistic_mod). Do the signs of the coefficients for the 2 predictors agree with your visual inspection from Exercise 1?Display the exponentiated coefficients, and provide contextual interpretations for them (not the intercept).
Exercise 4: Making predictions
Consider a new email where the frequency of “George” is 0.25% and the frequency of exclamation points is 1%.
Use the model summary to make both a soft (probability) and hard (class) prediction for this test case by hand. Use a default probability threshold of 0.5. (You can use math expressions to use R as a calculator. The
exp()function exponentiates a number.)Check your work from part a by using
predict().predict(___, newdata = data.frame(word_freq_george = 0.25, char_freq_exclam = 1), ___)