Topic 9 Logistic Regression
Learning Goals
- Use a logistic regression model to make hard (class) and soft (probability) predictions
- Interpret non-intercept coefficients from logistic regression models in the data context
Slides from today are available here.
Logistic regression in caret
To build logistic regression models in caret
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(caret)
set.seed(___) # Pick your favorite number to fill in the parentheses
Then adapt the following code:
train(
logistic_mod <-~ x,
y data = ___,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = ___),
metric = "Accuracy",
na.action = na.omit
)
Argument | Meaning |
---|---|
y ~ x |
y must be a character or factor |
data |
Sample data |
method & family |
The glm method implements various “generalized” linear models. When we specify family = "binomial" , glm performs logistic regression. (BTW: glm with family = "gaussian" is equivalent to using lm !) |
trControl |
Use cross-validation to estimate test performance for each model fit. |
metric |
Evaluate and compare competing models with respect to their CV-Accuracy . |
na.action |
Set na.action = na.omit to prevent errors if the data has missing values. |
Examining the logistic model
# Model summary table
summary(logistic_mod)
# Coefficients
coef(logistic_mod$finalModel)
# Exponentiated coefficients
coef(logistic_mod$finalModel) %>% exp()
# CV accuracy metrics
$results
logistic_mod$resample logistic_mod
Making predictions from the logistic model
# Make soft (probability) predictions
predict(logistic_mod, newdata = ___, type = "prob")
# Make hard (class) predictions (using a default 0.5 probability threshold)
predict(logistic_mod, newdata = ___, type = "raw")
Exercises
You can download a template RMarkdown file to start from here.
Context
Before proceeding, install the e1071
package (utilities for evaluating classification models) by entering install.packages("e1071")
in the Console.
We’ll be working with a spam dataset that contains information on different features of emails and whether or not the email was spam. The variables are as follows:
spam
: Eitherspam
ornot spam
word_freq_WORD
: percentage of words in the e-mail that matchWORD
(0-100)char_freq_CHAR
: percentage of characters in the e-mail that matchCHAR
(e.g., exclamation points, dollar signs)capital_run_length_average
: average length of uninterrupted sequences of capital letterscapital_run_length_longest
: length of longest uninterrupted sequence of capital letterscapital_run_length_total
: sum of length of uninterrupted sequences of capital letters
Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!
library(readr)
library(ggplot2)
library(dplyr)
library(caret)
read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1") spam <-
Hello, how are things?
We’re nearing the halfway point of the module - how are you holding up?
Exercise 1: Visualization warmup
Let’s take a look at the frequency of the word “George” (the email recipient’s name is George) (word_freq_george
) and the frequency of exclamation points (char_freq_exclam
).
Create appropriate visualizations to assess the predictive ability of these two predictors.
# If you want to adjust the axis limits, you can add the following to your plot:
# + coord_cartesian(ylim = c(0,1))
# + coord_cartesian(xlim = c(0,1))
ggplot(spam, aes(x = ___, y = ___)) +
geom_???()
Exercise 2: Implementing logistic regression in caret
Our goal is to fit a logistic regression model with word_freq_george
and char_freq_exclam
as predictors.
Write down the corresponding logistic regression model formula using general notation.
Use
caret
to fit this logistic regression model. Use 10-fold CV to estimate test accuracy. (We’ll focus more on interpreting model evaluation metrics next time.)
Note: If you get warning (not error) messages like fitted probabilities numerically 0 or 1 occurred
, this means that in some of the CV iterations, one or more of the predictors perfectly predicted spam classification.
set.seed(___)
train(
logistic_mod <-
)
Exercise 3: Interpreting the model
Take a look at the log-scale coefficients with
summary(logistic_mod)
. Do the signs of the coefficients for the 2 predictors agree with your visual inspection from Exercise 1?Display the exponentiated coefficients, and provide contextual interpretations for them (not the intercept).
Exercise 4: Making predictions
Consider a new email where the frequency of “George” is 0.25% and the frequency of exclamation points is 1%.
Use the model summary to make both a soft (probability) and hard (class) prediction for this test case by hand. Use a default probability threshold of 0.5. (You can use math expressions to use R as a calculator. The
exp()
function exponentiates a number.)Check your work from part a by using
predict()
.predict(___, newdata = data.frame(word_freq_george = 0.25, char_freq_exclam = 1), ___)