Topic 12 Logistic Regression
Learning Goals
- Use a logistic regression model to make hard (class) and soft (probability) predictions
- Interpret non-intercept coefficients from logistic regression models in the data context
Slides from today are available here.
Logistic regression in tidymodels
To build logistic regression models in tidymodels
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(dplyr)
library(ggplot2)
library(tidymodels)
library(probably) # install.packages("probably")
tidymodels_prefer()
set.seed(___) # Pick your favorite number to fill in the parentheses
Then adapt the following code to fit a logistic regression model:
# Make sure you set reference level (the outcome you are NOT interested in)
data %>%
data <- mutate(outcome = relevel(outcome, ref = "failure")) # set reference level
vfold_cv(data, v = 10)
data_cv <-
# Logistic Regression Model Spec
logistic_reg() %>%
logistic_spec <- set_engine("glm") %>%
set_mode("classification")
# Recipe
recipe(outcome ~ ., data = data) %>%
logistic_rec <- step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# Workflow (Recipe + Model)
workflow() %>%
logistic_wf <- add_recipe(logistic_rec) %>%
add_model(logistic_spec)
# Fit Model to Training Data
fit(logistic_wf, data = data) logistic_fit <-
Examining the logistic model
# Display coefficient estimates
%>% tidy()
logistic_fit
# Get exponentiated coefficients and confidence intervals
%>% tidy() %>%
logistic_fit mutate(
OR.conf.low = exp(estimate - 1.96*std.error),
OR.conf.high = exp(estimate + 1.96*std.error)
%>%
) mutate(OR = exp(estimate))
Making predictions from the logistic model
# Make soft (probability) predictions
predict(logistic_fit, new_data = ___, type = "prob")
# Make hard (class) predictions (using a default 0.5 probability threshold)
predict(logistic_fit, new_data = ___, type = "class")
Exercises
You can download a template RMarkdown file to start from here.
Context
We’ll be working with a spam dataset that contains information on different features of emails and whether or not the email was spam. The variables are as follows:
spam
: Eitherspam
ornot spam
word_freq_WORD
: percentage of words in the e-mail that matchWORD
(0-100)char_freq_CHAR
: percentage of characters in the e-mail that matchCHAR
(e.g., exclamation points, dollar signs)capital_run_length_average
: average length of uninterrupted sequences of capital letterscapital_run_length_longest
: length of longest uninterrupted sequence of capital letterscapital_run_length_total
: sum of length of uninterrupted sequences of capital letters
Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!
library(dplyr)
library(readr)
library(ggplot2)
library(tidymodels)
tidymodels_prefer()
read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1")
spam <-
# A little data cleaning to remove the space in "not spam"
spam %>%
spam <- mutate(spam = ifelse(spam=="spam", "spam", "not_spam"))
Exercise 1: Implementing LASSO logistic regression in tidymodels
Our goal is to fit a logistic regression model with word_freq_george
and char_freq_exclam
as predictors.
Write down the corresponding logistic regression model formula using mathematical notation.
Use
tidymodels
to fit this logistic regression model to the training data. Let’s try to do this from scratch (almost). Open up this tidymodels note sheet, and we’ll work through the thought process piece by piece.- Work with your group to figure out what phase of the analysis is happening in each row. What do you think needs to be modified to implement logistic regression? (For now, we’re just fitting a model with a fixed set of predictors–not trying to estimate test performance with CV.)
- Key changes for implementing logistic regression: the model name is
logistic_reg()
, the model-building engine is"glm"
, and we are now performing"classification"
rather than"regression"
# Need to set reference level (to the outcome you are NOT interested in)
spam %>%
spam <- mutate(spam = relevel(factor(spam), ref="not_spam"))
# Logistic regression model specification
logistic_spec <-
# Recipe
logistic_rec <-
# Workflow (Recipe + Model)
log_wf <-
# Fit Model
fit(log_wf, data = spam) log_fit <-
Exercise 3: Interpreting the model
Take a look at the log-scale coefficients with
tidy(log_fit)
. Do the signs of the coefficients for the 2 predictors agree with your visual inspection from Exercise 1?Display the exponentiated coefficients, and provide contextual interpretations for them (not the intercept). (Use the output of
tidy()
withmutate()
andexp()
.)
Exercise 4: Making predictions
Consider a new email where the frequency of “George” is 0.25% and the frequency of exclamation points is 1%.
Use the model summary to make both a soft (probability) and hard (class) prediction for this test case by hand. Use a default probability threshold of 0.5. (You can use math expressions to use R as a calculator. The
exp()
function exponentiates a number.)Check your work from part a by using
predict()
.
predict(log_fit, new_data = data.frame(word_freq_george = 0.25, char_freq_exclam = 1), type = "prob")
predict(log_fit, new_data = data.frame(word_freq_george = 0.25, char_freq_exclam = 1), type = "class")