Topic 2 Regression Assumptions
2.1 Discussion
Regression tasks
In regression tasks, our goal is to build a model (a function) \(f\) that predicts the quantitative response values well:
\[y = f(x) + \varepsilon\]
- \(y\): quantitative response variable
- \(x = (x_1, x_2, \ldots, x_p)\): are \(p\) explanatory variables, predictors, features
- \(f(x)\) is a function that captures the underlying trend/relationship between the response and the predictors
- \(\varepsilon\) is random unexplainable error/noise
The linear regression model you learned about in MATH 155 is a special case of \(f\):
\[ \begin{align*} y &= f(x) + \varepsilon \\ &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon \end{align*} \]
Evaluating regression models
- Should meet assumptions required for statistical inference
- Should explain a substantial proportion of the variation in the response
- Should produce accurate predictions
Linear regression assumptions
\[\varepsilon \stackrel{ind}{\sim} N(0, \sigma^2)\]
Assumptions are to ensure that statistical inference procedures (p-values, confidence intervals) “work as advertised”:
- The process of creating a 95% CI is a procedure (add and subtract about 2 standard errors from the estimate).
- It “works as advertised” if in 95% of possible samples it creates intervals that contain the true population value. (The other 5% of samples are unlucky ones.)
- It does not “work as advertised” if only 90% of possible samples result in intervals (95% CIs) that contain the true population value.
2.2 Exercises
You can download a template RMarkdown file to start from here.
We will continue looking at the New York housing data. Each case (row) in the dataset is a house in New York, and on each house, we have information on several variables. We’ll focus on the response variable Price
(in dollars) and a single predictor variable Age
(in years).
# Load required packages
library(ggplot2)
library(dplyr)
# Read in the data
homes <- read.delim("http://sites.williams.edu/rdeveaux/files/2014/09/Saratoga.txt")
# Look at the first 6 rows
head(homes)
## Price Lot.Size Waterfront Age Land.Value New.Construct Central.Air
## 1 132500 0.09 0 42 50000 0 0
## 2 181115 0.92 0 0 22300 0 0
## 3 109000 0.19 0 133 7300 0 0
## 4 155000 0.41 0 13 18700 0 0
## 5 86060 0.11 0 0 15000 1 1
## 6 120000 0.68 0 31 14000 0 0
## Fuel.Type Heat.Type Sewer.Type Living.Area Pct.College Bedrooms
## 1 3 4 2 906 35 2
## 2 2 3 2 1953 51 3
## 3 2 3 3 1944 51 4
## 4 2 2 2 1944 51 3
## 5 2 2 3 840 51 2
## 6 2 2 2 1152 22 4
## Fireplaces Bathrooms Rooms
## 1 1 1.0 5
## 2 0 2.5 6
## 3 1 1.0 8
## 4 1 1.5 5
## 5 0 1.0 3
## 6 1 1.0 8
We will pretend that the homes
dataset contains the full population of New York houses. Let’s draw a random sample of 500 houses from the “population”. We can do this with the sample_n()
function available in the dplyr
package:
# Randomly sample 500 homes
homes_sample <- sample_n(homes, size = 500)
- Checking the independence assumption
In thinking about what the cases are, do you think the independence assumption holds?
- A first try at a model
Visualize the relationship between house price and house age with a scatterplot and smooth trend line. How would you describe the overall shape of the trend? Is it linear?
ggplot(homes_sample, aes(x = ???, y = ???)) + geom_???() + geom_smooth()
Using our sample (
homes_sample
), fit a linear regression model wherePrice
is predicted byAge
:# Fit the model mod1 <- lm(Price ~ Age, data = homes_sample) # Display the summary table summary(mod1)
Check the trend and homoskedasticity assumptions by plotting the residuals versus the fitted (predicted) values. The points should be evenly scattered around the y = 0 line. Do you think these assumptions are met?
# Put the residuals and predicted values into a dataset mod1_output <- data.frame( residual = residuals(mod1), predicted = fitted.values(mod1) ) # Plot ggplot(mod1_output, aes(x = ???, y = ???)) + geom_point() + geom_smooth(color = "blue", lwd = 3) + # Add a smooth trend geom_hline(yintercept = 0, color = "red") # Add the y = 0 line
Check the normality assumption by making a QQ-plot of the residuals. In a QQ-plot, each residual (y-axis) is plotted against its theoretical corresponding value from a standard normal distribution (\(N(0,1^2)\)) on the x-axis. That is, the first quartile of the residuals is plotted against the first quartile of \(N(0,1^2)\), the median of the residuals is plotted against the median of \(N(0,1^2)\), and so on. If the residuals follow a normal distribution, then the points should fall on a line. Do you think the normality assumption holds?
ggplot(mod1_output, aes(sample = residual)) + geom_qq() + geom_qq_line()
- A second model
The diagnostic plots we made above suggest that key assumptions are not being met. Let’s explore how transforming variables can help us meet those assumptions.One of the most common variable transformations that can help fix an unmet homoskedasticity assumption is a logarithmic transformation of the response variable. We will also try to better model the nonlinear shape of the
Price
vs.Age
trend by using a logarithmic transformation. Themutate()
function in thedplyr
package allows us to create these new transformed variables:# log() = natural logarithm # The %>% is a "pipe": it takes the dataset to the left and provides it as input to the code that follows homes_sample <- homes_sample %>% mutate( log_price = log(Price), log_age = log(Age + 1) # Some Age's are 0, so add 1 to prevent log(0), which is undefined )
Fit a linear regression model where
log_price
is predicted bylog_age
and obtain the residuals and fitted values:mod2 <- lm(???, data = homes_sample) mod2_output <- data.frame( residual = residuals(???), predicted = fitted.values(???) )
Check the trend, homoskedasticity, and normality assumptions for
mod2
. Do these assumptions seem to hold better formod1
ormod2
?
- Implications for statistical inference
Since we have the entire population of New York homes, we can investigate whether or not confidence intervals “work as advertised” for the two models we investigated.Display the 95% confidence intervals for the coefficients in
mod1
andmod2
with theconfint()
function:By fitting theconfint(mod1) confint(mod2)
Price ~ Age
model in the full population, we know that the true population value of theAge
coefficient is -636.2551305.
And by fitting thelog_price ~ log_age
model in the full population, we know that the true population value of thelog_age
coefficient is -0.134966.
Does the 95% confidence interval “work as advertised” in this case?But what we did in part a just looked at one sample. If the 95% CI truly were working as advertised, 95% of samples would create 95% CIs that contain the true population value. We can run some simulations and see how 95% CIs “work” in 1000 different samples.
We find that formod1
, 95% CIs contain the true value of theAge
coefficient 899 times.
We find that formod2
, 95% CIs contain the true value of thelog_age
coefficient 964 times.
With regards to statistical inference, what can you conclude about assumption violations and fixing those violations?