Topic 16 Simple Logistic Regression

Learning Goals

Construct simple logistic regression models in R
Interpret coefficients in simple logistic regression models
Use simple logistic regression models to make predictions
Describe the form (shape) of relationships on the log odds, odds, and probability scales

Warm-up

Navigate to:

Warm up questions and answers:

If I want to estimate P(Vote = yes | Democrat = yes), who should be counted in the numerator and denominator?

A. Denominator: all voters. Numerator: Democrats who are voters.
B. Denominator: all Democrats who are voters. Numerator: all voters.
C. Denominator: all Democrats. Numerator: all voters.
D. Denominator: all Democrats. Numerator: Democrats who are voters.

Answer: D

Which of the following is a logistic regression model describing how voting is related to age?

A. P(Vote = Yes) = B0 + B1 Age
B. Odds(Vote = Yes) = B0 + B1 Age
C. ln(Odds(Vote = Yes)) = B0 + B1 Age
D. e^P(Vote = Yes) = B0 + B1 Age
E. e^Odds(Vote = Yes) = B0 + B1 Age

Answer: C

Suppose that the odds of an event increases as a predictor X increases. What can be said about the probability of the event as X increases?

A. Probability increases
B. Probability increases initially but then decreases
C. Probability decreases

Answer: A

Suppose the log odds of an event decreases as X increases. What can be said about the corresponding probability?

A. Probability increases
B. Probability increases initially but then decreases
C. Probability decreases

Answer: C

Consider the logistic regression model log(odds(disease = Yes)) = -2 + 0.5*ExposedYes. What are the odds of disease in those not exposed to an environmental hazard?

A. -2
B. 0.5
C. -1.5
D. $e^{-2}$
E. $e^{0.5}$
F. $e^{-1.5}$

Answer: D

Same model: log(odds(disease = Yes)) = -2 + 0.5*ExposedYes. How do the odds of disease compare in the exposed and unexposed?

A. Exposed have 0.5 times the odds of the unexposed
B. Exposed have 0.5 lower odds than the unexposed (difference)
C. Exposed have e^0.5 times the odds of the unexposed
D. Exposed have e^0.5 lower odds than the unexposed (difference)

Answer: C

Consider the model log(odds(Disease = Yes)) = B0 + B1*ExposedYes. If the exposed and unexposed have the same chance of having disease, what would you expect about B1?

Answer: B1 = 0, e^B1 = 1

Exercises

A template RMarkdown document that you can start from is available here.

We’ll look at the O-ring data described in the video. Load the data and necessary packages as follows:

library(readr)
library(dplyr)
library(ggplot2)

oring <- read_csv("https://www.macalester.edu/~ajohns24/data/NASA.csv")

Let’s get acquainted with the data and make a few exploratory plots:

dim(oring)
head(oring)

# Univariate visualization of Broken


# Univariate visualization of Temp


# Visualization of Broken and Temp

Exercise 1

We want to model Broken as a function of Temp. Write down the logistic regression model formula. Try to do these without referring to notes.
We can fit logistic regression models in R using the glm() function. The “lm” part of “glm” stands for “linear model” (just like the lm() function), and the “g” stands for “generalized”. (The left hand side of the model has been made more general than just $E[Y]$ .) Also note that we need to supply the argument family = "binomial".
```
oring_mod <- glm(Broken ~ Temp, data = oring, family = "binomial")
summary(oring_mod)
```
We’ll focus primarily on the output in the coefficients table. Interpret the intercept and the temperature coefficient on the log scale in a contextually meaningful way. What concern arises in interpreting the intercept?
Interpret the coefficients on the natural scale by exponentiating. Use the exp() function to do this. (e.g., exp(3) gives $e^3$ .)
Use this model to make predictions about a 60 degree day:
- Predict the log odds of O-ring failure
- Predict the odds of O-ring failure
- Predict the probability of O-ring failure. Also write this probability using conditional probability notation.

Exercise 2

When you use glm(), R by default will use Broken = 1 as the event of interest (as opposed to Broken = 0). Suppose that we were to fit the logistic regression model where Broken = 0 became the event of interest. Mathematically work out what the new values of the model coefficients will be.
Check your work by replacing the glm() model formula with Broken==0 ~ Temp.

Exercise 3

Let’s visualize the model’s predictions.

The code below adds the predicted log odds to the original oring dataset. Complete the code to also add the predicted odds and probabilities.

oring <- oring %>%
    mutate(
        log_odds = predict(oring_mod),
        odds = ???, # Hint: you can use log_odds in this expression
        prob = ???
    )

Construct plots of the log odds, odds, and probability as a function of temperature.
We can also zoom out to see a broader temperature range. Describe the shape of the relationship between the probability of an O-ring breaking and temperature.
```
ggplot(oring, aes(x = Temp, y = Broken)) +
    geom_point() +
    geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, fullrange = TRUE) + 
    labs(y = "probability of breaking") + 
    lims(x = c(0,100))
```
Note: the function shown in this plot is called the logistic function which is where logistic regression gets its name.

Exercise 4

We have the following data on whether or not individuals were exposed to an environmental hazard and whether or not they developed a certain disease within 10 years.

	Disease	No disease	Total
Exposed	5	495	500
Unexposed	5	995	1000
Total	10	1490	1500

Write down the logistic regression model formula corresponding and estimate the coefficients by hand using data from the table. (Use a variable called exposed that is equal to 1 if exposed and 0 if not.)