Simple logistic regression

Notes and in-class exercises

Notes

Slides for today are here.
You can download a template file to work with here.
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Learning goals

By the end of this lesson, you should be able to:

Explain the differences between linear regression and logistic regression for modeling binary outcomes
Construct simple logistic regression models in R
Interpret coefficients in simple logistic regression models
Use simple logistic regression models to make predictions
Describe the form (shape) of relationships on the log odds, odds, and probability scales

Readings and videos

Before class you should have read and/or watched:

Reading: Sections 4.1-4.3 in the STAT 155 Notes
Video: Logistic regression (slides)

Exercises

Context: The Titanic was a British passenger ship that famously sank in 1912 after hitting an iceberg in the North Atlantic Ocean. Approximately 2200 passengers were on board the Titanic, and it’s estimated that 1500 of them did not survive the crash. Historians have worked diligently to collect data on the passengers that were aboard the Titanic.

We have data for 1313 passengers, where the following information is available for each passenger:

Name: name
PClass: ticket class (1st, 2nd, 3rd)
Age: age (years)
Sex: binary sex (female, male)
Survived: indicator that the passenger survived (1 = survived, 0 = died)

Our question of interest is: how do different factors relate to survival?

In the Console, run install.packages("ggmosaic") to install the ggmosaic package that we’ll be using to make a specialized type of plot.

library(readr)
library(ggplot2)
library(ggmosaic)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

titanic <- read_csv("https://mac-stat.github.io/data/titanic.csv")

Rows: 1313 Columns: 5

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Name, PClass, Sex
dbl (2): Age, Survived

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 1: Exploring age

Did younger passengers tend to have higher survival rates than older passengers?

Visualizing the relationship between a binary response and a quantitative predictor can be tricky. We will take a few approaches here.

Create a boxplot where one box corresponds to the age distribution of survivors and the second to that of non-survivors.
Create density plots with separate colors for the survivors and non-survivors.
The remainder of the code below creates a plot of the fraction who survived at each age. (Since we have a large data set and multiple (though sometimes not many) observations at most ages, we can manually calculate the survival fraction.

After inspecting the plots, summarize what you learn.

# Create a boxplot
# Note that you'll need to force R to view Survived as a binary categorical variable by using x = factor(Survived) instead of just x = Survived in the aes() part of your plot


# Create a density plot (you'll need to use factor(Survived) again)


# Use the code below to create a plot of the fraction who survived at each age
titanic_summ <- titanic %>% 
    group_by(Age) %>%
    summarize(frac_survived = mean(Survived))

ggplot(titanic_summ, aes(x = Age, y = frac_survived)) +
    geom_point() +
    geom_smooth(se = FALSE)

Exercise 2: Exploring sex and ticket class

Were males or females more likely to survive? Did 1st class passengers tend to survive more than 2nd and 3rd class passengers?

The code below creates plots that allow us to explore how Sex and PClass relate to survival. The first two plots are standard bar plots that use color to indicate what fraction of each group survived. The last two plots are mosaic plots that are much like the standard bar plots, but the width of the bars reflects the distribution of the x-axis variable. (The widest bar is the most prevalent category.)

Summarize what you learn about the relationship between sex, ticket class, and survival.

# Standard bar plots
ggplot(titanic, aes(x = Sex, fill = factor(Survived))) +
    geom_bar(position = "fill")

ggplot(titanic, aes(x = PClass, fill = factor(Survived))) +
    geom_bar(position = "fill")

# Mosaic plots
ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
    geom_mosaic(aes(x = product(Sex), fill = Survived))

ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
    geom_mosaic(aes(x = product(PClass), fill = Survived))

Exercise 3: Linear regression model

For now we will focus on exploring the relationship between (ticket) class and survival.

Let’s tabulate survival across classes. We can tabulate across two variables by providing both variables to count():

titanic %>% 
    count(PClass, Survived)

Use the count() output to fill in the following contingency table:

Class	Died	Survived	Total
1st Class	___	___	___
2nd Class	___	___	___
3rd Class	___	___	___
Total	___	___	___

Using your table, estimate the following:
- the probability of surviving among 1st class passengers
- the probability of surviving among 2nd class passengers
- the probability of surviving among 3rd class passengers
- the difference in the probability of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how much lower is the probability of 2nd class passengers as compared to 1st class passengers?)
- the difference in the probability of surviving, comparing 3rd class passengers to 1st class passengers (i.e., how much lower is the probability of 3rd class passengers as compared to 1st class passengers?)
After fitting the linear regression model below, write out the model formula using correct notation. Explain carefully what it means to talk about the expected/average value of a binary variable.

lin_mod <- lm(Survived ~ PClass, data = titanic)
summary(lin_mod)

Write an interpretation of each of the coefficients in your linear regression model. How do your coefficient estimates compare to your answers in part b?

Exercise 4: Logistic regression model

Refer back to your contingency table from Exercise 3a. Using your table, estimate the following:
- the odds of surviving among 1st class passengers
- the odds of surviving among 2nd class passengers
- the odds of surviving among 3rd class passengers
- the ratio of the odds of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how many times higher/lower is the odds of survival among 2nd class passengers as compared to 1st class passengers?)
- the ratio of the odds of surviving, comparing 3rd class passengers to 1st class passengers
After fitting the logistic regression model below, write out the model formula using correct notation.

log_mod <- glm(Survived ~ PClass, data = titanic, family = "binomial")
coef(summary(log_mod))

Write an interpretation of each of the exponentiated coefficients in your logistic regression model. Think carefully about what we are modeling when we fit a logistic regression model. How do these exponentiated coefficient estimates compare to your answers in part a?

Exercise 5: Linear vs. logistic modeling

To highlight a key difference between linear vs. logistic modeling, consider the following linear and logistic regression models of survival with sex and age as predictors in addition to ticket class.

lin_mod2 <- lm(Survived ~ PClass + Sex + Age, data = titanic)
coef(summary(lin_mod2))

log_mod2 <- glm(Survived ~ PClass + Sex + Age, data = titanic, family = "binomial")
coef(summary(log_mod2))

Use the linear regression model to predict the probability of survival for Rose (a 17 year old female in 1st class) and Jack (a 20 year old male in 3rd class). Show your work.
Now use the logistic regression model to predict the survival probability for Rose and Jack. Show your work. (Hint: use the logistic regression model to obtain the predicted log odds, exponentiate to get the odds, and then convert to probability.)
Comment on differences that you notice in the predictions from parts a and b.

Reflection

What binary outcomes might be relevant in your project? What predictor(s) could be relevant in a logistic regression model for that outcome?

Response: Put your response here.

Render your work

Click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.
Scroll through and inspect the document to check that your work translated to the HTML format correctly.
Close the browser tab.
Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.

Solutions

Exercise 1: Exploring age

The boxplot doesn’t clearly indicate a difference in the age distributions across survivors and non-survivors, but we do notice from the density plot that there is a greater density of younger passengers among the survivors. We also see from the last plot that younger passengers tend to have a higher survival chance.

# Create a boxplot
ggplot(titanic, aes(x = factor(Survived), y = Age)) +
    geom_boxplot()

Warning: Removed 557 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Can flip the boxplot on its side too
ggplot(titanic, aes(y = factor(Survived), x = Age)) +
    geom_boxplot()

Warning: Removed 557 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# Create a density plot
ggplot(titanic, aes(x = Age, color = factor(Survived))) +
    geom_density()

Warning: Removed 557 rows containing non-finite outside the scale range
(`stat_density()`).

# Use the code below to create a plot of the fraction who survived at each age
titanic_summ <- titanic %>% 
    group_by(Age) %>%
    summarize(frac_survived = mean(Survived))

ggplot(titanic_summ, aes(x = Age, y = frac_survived)) +
    geom_point() +
    geom_smooth(se = FALSE)

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

Exercise 2: Exploring sex and ticket class

Females were more likely to survive than males.
1st class was most likely to survive, followed by 2nd then 3rd class.

# Standard bar plots
ggplot(titanic, aes(x = Sex, fill = factor(Survived))) +
    geom_bar(position = "fill")

ggplot(titanic, aes(x = PClass, fill = factor(Survived))) +
    geom_bar(position = "fill")

# Mosaic plots
ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
    geom_mosaic(aes(x = product(Sex), fill = Survived))

Warning: The `scale_name` argument of `continuous_scale()` is deprecated as of ggplot2
3.5.0.

Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
ℹ Please use the `transform` argument instead.

Warning: `unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the ggmosaic package.
  Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.

ggplot(data = titanic %>% mutate(Survived = as.factor(Survived))) +
    geom_mosaic(aes(x = product(PClass), fill = Survived))

Exercise 3: Linear regression model

titanic %>% 
    count(PClass, Survived)

# A tibble: 7 × 3
  PClass Survived     n
  <chr>     <dbl> <int>
1 1st           0   129
2 1st           1   193
3 2nd           0   160
4 2nd           1   119
5 3rd           0   573
6 3rd           1   138
7 <NA>          0     1

Class	Died	Survived	Total
1st Class	129	193	322
2nd Class	160	119	279
3rd Class	573	138	711
Total	862	450	1312

- the probability of surviving among 1st class passengers: 193/322 = 0.599
- the probability of surviving among 2nd class passengers: 119/279 = 0.427
- the probability of surviving among 3rd class passengers: 138/711 = 0.194
- the difference in the probability of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how much lower is the probability of 2nd class passengers as compared to 1st class passengers?): 119/279 - 193/322 = -0.173
- the difference in the probability of surviving, comparing 3rd class passengers to 1st class passengers (i.e., how much lower is the probability of 3rd class passengers as compared to 1st class passengers?): 138/711 - 193/322 = -0.405
This model can be written as: $E [S u r v i v e d | P C l a s s] = β_{0} + β_{1} P C l a s s 2 n d + β_{2} P C l a s s 3 r d$ .
- In the context of a binary variable, the expected value/average is the same as the probability that the variable equals one. To see an example of this, calculate the average of this list of 0’s and 1’s: (0,0,1,1,0,1,0,1). Now calculate the proportion of 1’s. What do you notice?
- This means that we can also write this model as follows: $P [S u r v i v e d = 1 | P C l a s s] = β_{0} + β_{1} P C l a s s 2 n d + β_{2} P C l a s s 3 r d$

lin_mod <- lm(Survived ~ PClass, data = titanic)
summary(lin_mod)


Call:
lm(formula = Survived ~ PClass, data = titanic)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5994 -0.1941 -0.1941  0.4006  0.8059 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.59938    0.02468  24.284  < 2e-16 ***
PClass2nd   -0.17286    0.03623  -4.772 2.03e-06 ***
PClass3rd   -0.40529    0.02975 -13.623  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4429 on 1309 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.1315,    Adjusted R-squared:  0.1302 
F-statistic: 99.09 on 2 and 1309 DF,  p-value: < 2.2e-16

The coefficient estimates are the differences in probability from part b!
- (Intercept): the estimated probability of survival for passengers in 1st class is 0.599 (59.9%)
- PClass2nd: the difference in the estimated probability of survival comparing passengers in 1st class to passengers in 2nd class is 0.173 (17.3%), where passengers in 1st class have the higher estimated survival probability
  - OR… comparing passengers in 1st class to passengers in 2nd class, the difference in the proportion of passengers that survived is 0.173 (17.3%), with 1st class having a higher proportion of passengers that survived
  - OR… the probability of survival is 17.3% lower among passengers in 2nd class than it is among passengers in 1st class
- PClass3rd: the difference in the estimated probability of survival comparing passengers in 1st class to passengers in 3rd class is 0.405 (40.5%), where passengers in 1st class have the higher estimated survival probability

Exercise 4: Logistic regression model

- the odds of surviving among 1st class passengers: 193/129 = 1.496
- the odds of surviving among 2nd class passengers: 119/160 = 0.744
- the odds of surviving among 3rd class passengers: 138/573 = 0.241
- the ratio of the odds of surviving, comparing 2nd class passengers to 1st class passengers (i.e., how many times higher/lower is the odds of survival among 2nd class passengers as compared to 1st class passengers?): (119/160)/(193/129) = 0.497
- the ratio of the odds of surviving, comparing 3rd class passengers to 1st class passengers: (138/573)/(193/129) = 0.161
$\log (O d d s [S u r v i v e d = 1 | P C l a s s]) = β_{0} + β_{1} P C l a s s 2 n d + β_{2} P C l a s s 3 r d$

log_mod <- glm(Survived ~ PClass, data = titanic, family = "binomial")

# These logistic coefficient estimates are NOT exponentiated
coef(summary(log_mod))

              Estimate Std. Error    z value     Pr(>|z|)
(Intercept)  0.4028778  0.1137246   3.542574 3.962427e-04
PClass2nd   -0.6989281  0.1660923  -4.208071 2.575600e-05
PClass3rd   -1.8265098  0.1480705 -12.335410 5.839072e-35

# Calculations for exponentiating coefficients
exp(0.4028778)

[1] 1.496124

exp(-0.6989281)

[1] 0.4971179

exp(-1.8265098)

[1] 0.1609744

These exponentiated coefficient estimates compare to your the odds and odds ratios in part a!
- exp(Intercept): the estimated odds of survival among passengers in first class is 1.496 (i.e., passengers in first class are 1.496 times more likely to survive than they are to die)
- exp(PClass2nd): we estimate that the odds of survival for passengers in 2nd class are only 0.50 times as high as the odds of survival among passengers in 1st class (i.e., the odds of survival are 2 times higher among passengers in 1st class than they are among passengers in 2nd class)
- exp(PClass3rd): we estimate that the odds of survival for passengers in 3rd class are only 0.16 times as high as the odds of survival among passengers in 1st class (i.e., the odds of survival are 1/0.16 = 6.21 times higher among passengers in 1st class than they are among passengers in 3rd class)

Exercise 5: Linear vs. logistic modeling

lin_mod2 <- lm(Survived ~ PClass + Sex + Age, data = titanic)
coef(summary(lin_mod2))

                Estimate  Std. Error    t value     Pr(>|t|)
(Intercept)  1.130522829 0.051940872  21.765573 8.158449e-82
PClass2nd   -0.207433817 0.039239825  -5.286308 1.637737e-07
PClass3rd   -0.393344488 0.037709874 -10.430809 7.001373e-24
Sexmale     -0.501325667 0.029419802 -17.040416 2.697807e-55
Age         -0.006004789 0.001105949  -5.429536 7.633977e-08

log_mod2 <- glm(Survived ~ PClass + Sex + Age, data = titanic, family = "binomial")
coef(summary(log_mod2))

               Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  3.75966210 0.397567324   9.456668 3.179129e-21
PClass2nd   -1.29196240 0.260075781  -4.967638 6.777324e-07
PClass3rd   -2.52141915 0.276656805  -9.113888 7.948131e-20
Sexmale     -2.63135683 0.201505379 -13.058494 5.684093e-39
Age         -0.03917681 0.007616218  -5.143868 2.691392e-07

## predict for Rose
## (by hand)
1.130523 + (-0.207434)*0 + (-0.393344)*0 + (-0.501326)*0 + (-0.006005)*17

[1] 1.028438

## (using predict)
predict(lin_mod2, newdata = data.frame(PClass = "1st", Sex = "female", Age = 17))

       1 
1.028441

## predict for Jack
## (by hand)
1.130523 + (-0.207434)*0 + (-0.393344)*1 + (-0.501326)*1 + (-0.006005)*20

[1] 0.115753

## (using predict)
predict(lin_mod2, newdata = data.frame(PClass = "3rd", Sex = "male", Age = 20))

        1 
0.1157569

## predict for Rose
## (by hand)
log_odds_rose <- 3.75966210 + (-1.29196240)*0 + (-2.52141915)*0 + (-2.63135683)*0 + (-0.03917681)*17
odds_rose <- exp(log_odds_rose)
odds_rose/(1+odds_rose)

[1] 0.9566303

## (using predict)
predict(log_mod2, newdata = data.frame(PClass = "1st", Sex = "female", Age = 17), type = "response")

        1 
0.9566303

## predict for Jack
## (by hand)
log_odds_jack <- 3.75966210 + (-1.29196240)*0 + (-2.52141915)*1 + (-2.63135683)*1 + (-0.03917681)*20
odds_jack <- exp(log_odds_jack)
odds_jack/(1+odds_jack)

[1] 0.101867

## (using predict)
predict(log_mod2, newdata = data.frame(PClass = "3rd", Sex = "male", Age = 20), type = "response")

       1 
0.101867

Our linear model predicted that Rose’s probability of survival was over 100% (which doesn’t make sense). The predictions for Jack are fairly similar: 10.2% based on our logistic model and 11.6% based on our linear model.