Topic 20 Linear Regression Assumptions

Learning Goals

Use visualizations to check the assumptions behind linear regression models
Transform variables approriately to help models better meet assumptions
Predict how properties of statistical inference proceudres will vary based on the degree to which assumptions are met

Discussion

We’ve written linear regression models as:

$E[Y] = \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p$

This is equivalent to how it was written in the video:

$Y = \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p + \varepsilon$

$\varepsilon$ stands for the error, or the residuals, and the 4 key assumptions of linear regression can be compactly summarized by the following statement:

$\varepsilon \stackrel{ind}{\sim} N(0, \sigma^2)$

Independence
Trend
Homoskedasticity
Normality
Normality of the residuals is actually least important of the 4 because of the Central Limit Theorem. However, violations of normality usually lead to violations of homoskedasticity.

Assumptions serve to ensure that statistical inference procedures (e.g., confidence intervals, hypothesis tests–coming up next!) “work as advertised”:

The process of creating a 95% CI is a procedure (add and subtract about 2 standard errors from the estimate).
- It “works as advertised” if in 95% of possible samples it creates intervals that contain the true population value. (The other 5% of samples are unlucky ones.)
- It does not “work as advertised” if, for example, only 90% of possible samples result in intervals (95% CIs) that contain the true population value.

Exercises

A template RMarkdown document that you can start from is available here.

# Load required packages
library(readr)
library(ggplot2)
library(dplyr)

# Read in the data
homes <- read_tsv("http://sites.williams.edu/rdeveaux/files/2014/09/Saratoga.txt")

# Look at the first 6 rows
head(homes)

We will pretend that the homes dataset contains the full population of New York houses. Let’s draw a random sample of 500 houses from the “population”. We can do this with the sample_n() function available in the dplyr package:

# Randomly sample 500 homes
set.seed(155)
homes_sample <- sample_n(homes, size = 500)

Exercise 1

Do you think the independence assumption holds? Briefly explain.

Note: violations to the indpendence assumption are beyond the scope of our course, but appropriate methods are handled in Mac’s Correlated Data class (STAT 452).

Exercise 2

In this exercise, we’ll build an initial model of Price versus Age. We’ll aim to improve it in the next exercise.

Using our sample (homes_sample), visualize the relationship between house price and house age. How would you describe the overall shape of the trend? Is it linear?
Using our sample (homes_sample), fit a linear regression model where Price is predicted by Age. Call this model mod1.

Check the trend and homoskedasticity assumptions by plotting the residuals versus the fitted (predicted) values. The points should be evenly scattered around the $y = 0$ line. Do you think these assumptions are met?

# Put the residuals and predicted values into a dataset
mod1_output <- data.frame(
    residual = residuals(mod1),
    predicted = fitted.values(mod1)
)

# Plot
ggplot(mod1_output, aes(???)) +
    ??? +
    geom_hline(yintercept = 0, color = "red") # Add the y = 0 line

Check the normality assumption by making a QQ-plot of the residuals. In a QQ-plot, each residual (y-axis) is plotted against its theoretical corresponding value from a standard normal distribution ( $N(0,1^2)$ ) on the x-axis. That is, the first quartile of the residuals is plotted against the first quartile of $N(0,1^2)$ , the median of the residuals is plotted against the median of $N(0,1^2)$ , and so on. If the residuals follow a normal distribution, then the points should fall on a line. Do you think the normality assumption holds?
```
ggplot(mod1_output, aes(sample = residual)) +
    geom_qq() +
    geom_qq_line()
```

Exercise 3

The diagnostic plots we made above suggest that key assumptions are not being met. Let’s explore how transforming variables can help us meet those assumptions.

One of the most common variable transformations that can help fix an unmet homoskedasticity assumption is a logarithmic transformation of the response variable. We will also try to better model the nonlinear shape of the Price vs. Age trend by using a logarithmic transformation. The mutate() function in the dplyr package allows us to create these new transformed variables:
```
homes_sample <- homes_sample %>%
    mutate(
        log_price = log(Price),
        log_age = log(Age + 1) # Some Age's are 0, so add 1 to prevent log(0), which is undefined
    )
```
Fit a linear regression model called mod2 where log_price is predicted by log_age. Using similar code as in Exercise 2, obtain the residuals and fitted values, and store this in an object called mod2_output.
Check the trend, homoskedasticity, and normality assumptions for mod2. Do these assumptions seem to hold better for mod1 or mod2?

Exercise 4

Now let’s look at the implications of our investigations for statistical inference.

Since we have the entire population of New York homes, we can investigate whether or not confidence intervals “work as advertised” for the two models we investigated.

Fit the Price ~ Age and the log_price ~ log_age models in the full population. What are the true population values of the Age and log(Age) coefficients?
Obtain the 95% confidence intervals for the coefficients in mod1 and mod2 with the confint() function. (Good review: What computations are going on behind the scenes in confint()?)
Does the 95% confidence interval “work as advertised” in this case?
In part (b), we just looked at one sample. If the 95% CI truly were “working as advertised”, 95% of samples would produce 95% CIs that contain the true population value. Behind the scenes, we have run a simulation study to see how 95% CIs “work” in 1000 different samples of 500 homes.
- We find that for mod1, 95% CIs contain the true value of the Age coefficient in 935 samples.
- We find that for mod2, 95% CIs contain the true value of the log_age coefficient in 968 samples.
  With regards to statistical inference, what can you conclude about assumption violations and fixing those violations?