Topic 9 Logistic Regression

Learning Goals

Use a logistic regression model to make hard (class) and soft (probability) predictions
Interpret non-intercept coefficients from logistic regression models in the data context

Slides from today are available here.

Logistic regression in `caret`

To build logistic regression models in caret, first load the package and set the seed for the random number generator to ensure reproducible results:

library(caret)
set.seed(___) # Pick your favorite number to fill in the parentheses

Then adapt the following code:

logistic_mod <- train(
    y ~ x,
    data = ___,
    method = "glm",
    family = "binomial",
    trControl = trainControl(method = "cv", number = ___),
    metric = "Accuracy",
    na.action = na.omit
)

Argument	Meaning
`y ~ x`	`y` must be a `character` or `factor`
`data`	Sample data
`method` & `family`	The `glm` method implements various “generalized” linear models. When we specify `family = "binomial"`, `glm` performs logistic regression. (BTW: `glm` with `family = "gaussian"` is equivalent to using `lm`!)
`trControl`	Use cross-validation to estimate test performance for each model fit.
`metric`	Evaluate and compare competing models with respect to their CV-`Accuracy`.
`na.action`	Set `na.action = na.omit` to prevent errors if the data has missing values.

Examining the logistic model

# Model summary table
summary(logistic_mod)

# Coefficients
coef(logistic_mod$finalModel)

# Exponentiated coefficients
coef(logistic_mod$finalModel) %>% exp()

# CV accuracy metrics
logistic_mod$results
logistic_mod$resample

Making predictions from the logistic model

# Make soft (probability) predictions
predict(logistic_mod, newdata = ___, type = "prob")

# Make hard (class) predictions (using a default 0.5 probability threshold)
predict(logistic_mod, newdata = ___, type = "raw")

Exercises

You can download a template RMarkdown file to start from here.

Context

Before proceeding, install the e1071 package (utilities for evaluating classification models) by entering install.packages("e1071") in the Console.

We’ll be working with a spam dataset that contains information on different features of emails and whether or not the email was spam. The variables are as follows:

spam: Either spam or not spam
word_freq_WORD: percentage of words in the e-mail that match WORD (0-100)
char_freq_CHAR: percentage of characters in the e-mail that match CHAR (e.g., exclamation points, dollar signs)
capital_run_length_average: average length of uninterrupted sequences of capital letters
capital_run_length_longest: length of longest uninterrupted sequence of capital letters
capital_run_length_total: sum of length of uninterrupted sequences of capital letters

Our goal will be to use email features to predict whether or not an email is spam - essentially, to build a spam filter!

library(readr)
library(ggplot2)
library(dplyr)
library(caret)

spam <- read_csv("https://www.dropbox.com/s/leurr6a30f4l32a/spambase.csv?dl=1")

Hello, how are things?

We’re nearing the halfway point of the module - how are you holding up?

Exercise 1: Visualization warmup

Let’s take a look at the frequency of the word “George” (the email recipient’s name is George) (word_freq_george) and the frequency of exclamation points (char_freq_exclam).

Create appropriate visualizations to assess the predictive ability of these two predictors.

# If you want to adjust the axis limits, you can add the following to your plot:
# + coord_cartesian(ylim = c(0,1))
# + coord_cartesian(xlim = c(0,1))

ggplot(spam, aes(x = ___, y = ___)) +
    geom_???()

Exercise 2: Implementing logistic regression in `caret`

Our goal is to fit a logistic regression model with word_freq_george and char_freq_exclam as predictors.

Write down the corresponding logistic regression model formula using general notation.
Use caret to fit this logistic regression model. Use 10-fold CV to estimate test accuracy. (We’ll focus more on interpreting model evaluation metrics next time.)

Note: If you get warning (not error) messages like fitted probabilities numerically 0 or 1 occurred, this means that in some of the CV iterations, one or more of the predictors perfectly predicted spam classification.

set.seed(___)
logistic_mod <- train(
    
)

Exercise 3: Interpreting the model

Take a look at the log-scale coefficients with summary(logistic_mod). Do the signs of the coefficients for the 2 predictors agree with your visual inspection from Exercise 1?
Display the exponentiated coefficients, and provide contextual interpretations for them (not the intercept).

Exercise 4: Making predictions

Consider a new email where the frequency of “George” is 0.25% and the frequency of exclamation points is 1%.

Use the model summary to make both a soft (probability) and hard (class) prediction for this test case by hand. Use a default probability threshold of 0.5. (You can use math expressions to use R as a calculator. The exp() function exponentiates a number.)

Check your work from part a by using predict().

predict(___, newdata = data.frame(word_freq_george = 0.25, char_freq_exclam = 1), ___)