Topic 16 Simple Logistic Regression
Learning Goals
- Construct simple logistic regression models in R
- Interpret coefficients in simple logistic regression models
- Use simple logistic regression models to make predictions
- Describe the form (shape) of relationships on the log odds, odds, and probability scales
Warm-up
Navigate to:
Warm up questions and answers:
If I want to estimate P(Vote = yes | Democrat = yes), who should be counted in the numerator and denominator?
A. Denominator: all voters. Numerator: Democrats who are voters.
B. Denominator: all Democrats who are voters. Numerator: all voters.
C. Denominator: all Democrats. Numerator: all voters.
D. Denominator: all Democrats. Numerator: Democrats who are voters.
Answer: D
Which of the following is a logistic regression model describing how voting is related to age?
A. P(Vote = Yes) = B0 + B1 Age
B. Odds(Vote = Yes) = B0 + B1 Age
C. ln(Odds(Vote = Yes)) = B0 + B1 Age
D. e^P(Vote = Yes) = B0 + B1 Age
E. e^Odds(Vote = Yes) = B0 + B1 Age
Answer: C
Suppose that the odds of an event increases as a predictor X increases. What can be said about the probability of the event as X increases?
A. Probability increases
B. Probability increases initially but then decreases
C. Probability decreases
Answer: A
Suppose the log odds of an event decreases as X increases. What can be said about the corresponding probability?
A. Probability increases
B. Probability increases initially but then decreases
C. Probability decreases
Answer: C
Consider the logistic regression model log(odds(disease = Yes)) = -2 + 0.5*ExposedYes. What are the odds of disease in those not exposed to an environmental hazard?
A. -2
B. 0.5
C. -1.5
D. \(e^{-2}\)
E. \(e^{0.5}\)
F. \(e^{-1.5}\)
Answer: D
Same model: log(odds(disease = Yes)) = -2 + 0.5*ExposedYes. How do the odds of disease compare in the exposed and unexposed?
A. Exposed have 0.5 times the odds of the unexposed
B. Exposed have 0.5 lower odds than the unexposed (difference)
C. Exposed have e^0.5 times the odds of the unexposed
D. Exposed have e^0.5 lower odds than the unexposed (difference)
Answer: C
Consider the model log(odds(Disease = Yes)) = B0 + B1*ExposedYes. If the exposed and unexposed have the same chance of having disease, what would you expect about B1?
Answer: B1 = 0, e^B1 = 1
Exercises
A template RMarkdown document that you can start from is available here.
We’ll look at the O-ring data described in the video. Load the data and necessary packages as follows:
library(readr)
library(dplyr)
library(ggplot2)
read_csv("https://www.macalester.edu/~ajohns24/data/NASA.csv") oring <-
Let’s get acquainted with the data and make a few exploratory plots:
dim(oring)
head(oring)
# Univariate visualization of Broken
# Univariate visualization of Temp
# Visualization of Broken and Temp
Exercise 1
We want to model
Broken
as a function ofTemp
. Write down the logistic regression model formula. Try to do these without referring to notes.We can fit logistic regression models in R using the
glm()
function. The “lm” part of “glm” stands for “linear model” (just like thelm()
function), and the “g” stands for “generalized”. (The left hand side of the model has been made more general than just \(E[Y]\).) Also note that we need to supply the argumentfamily = "binomial"
.glm(Broken ~ Temp, data = oring, family = "binomial") oring_mod <-summary(oring_mod)
We’ll focus primarily on the output in the coefficients table. Interpret the intercept and the temperature coefficient on the log scale in a contextually meaningful way. What concern arises in interpreting the intercept?
Interpret the coefficients on the natural scale by exponentiating. Use the
exp()
function to do this. (e.g.,exp(3)
gives \(e^3\).)Use this model to make predictions about a 60 degree day:
- Predict the log odds of O-ring failure
- Predict the odds of O-ring failure
- Predict the probability of O-ring failure. Also write this probability using conditional probability notation.
Exercise 2
When you use
glm()
, R by default will useBroken = 1
as the event of interest (as opposed toBroken = 0
). Suppose that we were to fit the logistic regression model whereBroken = 0
became the event of interest. Mathematically work out what the new values of the model coefficients will be.Check your work by replacing the
glm()
model formula withBroken==0 ~ Temp
.
Exercise 3
Let’s visualize the model’s predictions.
The code below adds the predicted log odds to the original
oring
dataset. Complete the code to also add the predicted odds and probabilities.oring %>% oring <- mutate( log_odds = predict(oring_mod), odds = ???, # Hint: you can use log_odds in this expression prob = ??? )
Construct plots of the log odds, odds, and probability as a function of temperature.
We can also zoom out to see a broader temperature range. Describe the shape of the relationship between the probability of an O-ring breaking and temperature.
ggplot(oring, aes(x = Temp, y = Broken)) + geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, fullrange = TRUE) + labs(y = "probability of breaking") + lims(x = c(0,100))
Note: the function shown in this plot is called the logistic function which is where logistic regression gets its name.
Exercise 4
We have the following data on whether or not individuals were exposed to an environmental hazard and whether or not they developed a certain disease within 10 years.
Disease | No disease | Total | |
---|---|---|---|
Exposed | 5 | 495 | 500 |
Unexposed | 5 | 995 | 1000 |
Total | 10 | 1490 | 1500 |
Write down the logistic regression model formula corresponding and estimate the coefficients by hand using data from the table. (Use a variable called exposed
that is equal to 1 if exposed and 0 if not.)