Topic 6 Modeling Concepts (Part 2)
Learning Goals
- Practice simple linear regression modeling concepts: model formula, coefficient interpretations, predicted values, residuals
- Develop two ideas of model quality: \(R^2\) and residual standard error
- Understand how categorical predictors are incorporated in linear regression models
Warm-up
lm(Price ~ Age, data = homes)
mod1 <-summary(mod1)
##
## Call:
## lm(formula = Price ~ Age, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -222183 -66299 -22232 43147 564995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 229728.46 3218.18 71.385 < 2e-16 ***
## Age -636.26 79.66 -7.987 2.5e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96700 on 1726 degrees of freedom
## Multiple R-squared: 0.03564, Adjusted R-squared: 0.03508
## F-statistic: 63.79 on 1 and 1726 DF, p-value: 2.502e-15
- Write the regression model formula using numbers from this output.
- Interpret all coefficients in this model.
- For a 50 year old house whose price is $100,000, what is the residual?
Discussion
What can we quantify about residuals to measure model quality?
- Not the sum or the mean of residuals (will always be zero)
- Residual standard error: essentially equal to the standard deviation of the residuals
- Scale of residual standard error changes with the scale of the data (e.g., house prices versus strength-to-weight ratio)
- Can we put the variance of the residuals on a nicer scale? Say from 0 to 1? Yes, we can.
- Some facts:
\[ \hbox{Var}(\hbox{response}) = \hbox{Var}(\hbox{residuals}) + \hbox{Var}(\hbox{predicted values}) \] \[ \hbox{Total variation} = \hbox{Unexplained variation} + \hbox{Explained variation} \]
- \(R^2\): What fraction of total variation in the response is explained by the model?
- Hopefully a lot. Which would mean that there is relatively little unexplained variation.
- Ranges from 0 to 1
\[ \begin{align*} R^2 &= \frac{\hbox{Var}(\hbox{predicted values})}{\hbox{Var}(\hbox{response})} \\ &= 1 - \frac{\hbox{Var}(\hbox{residuals})}{\hbox{Var}(\hbox{response})} \end{align*} \]
lm(Price ~ Living.Area, data = homes)
mod1 <-summary(mod1)
##
## Call:
## lm(formula = Price ~ Living.Area, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -277022 -39371 -7726 28350 553325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13439.394 4992.353 2.692 0.00717 **
## Living.Area 113.123 2.682 42.173 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69100 on 1726 degrees of freedom
## Multiple R-squared: 0.5075, Adjusted R-squared: 0.5072
## F-statistic: 1779 on 1 and 1726 DF, p-value: < 2.2e-16
- Residual standard error: $69100
- This describes the amount of spread in the residuals.
- What qualifies as “high”? Imagine that your residual changed by that much. Is that a lot?
- \(R^2\) (
Multiple R-squared
): 0.5075- 50.75% of the variation in house prices is explained by a simple linear regression model with square footage as a predictor
- What qualifies as “high”? Context helps determine if the response variable simply varies a lot. (e.g., stocks)
How do we incorporate categorical predictors?
In our housing dataset, there is a Heat.Type
that indicates whether the heating type of the house is of type 2, 3, or 4.
Including a categorical predictor variable creates \(L-1\) indicator variables where \(L\) is the number of levels of the categorical variable.
- Type 2 is chosen as the reference category by default in R because it is first in alphanumeric order.
Heat.Type3
andHeat.Type4
get created as indicator variables by taking the original variable name (Heat.Type
) and pasting the name of the category (3
or4
afterward)Heat.Type3
equals 1 is this case is of heating type 3. Equals 0 otherwise.Heat.Type4
equals 1 is this case is of heating type 4. Equals 0 otherwise.
Case Heat.Type Heat.Type3 Heat.Type4
---- --------- ---------- ----------
1 3 1 0
2 4 0 1
3 4 0 1
4 2 0 0
lm(Price ~ Heat.Type, data = homes)
mod2 <-summary(mod2)
##
## Call:
## lm(formula = Price ~ Heat.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221355 -63355 -17644 43895 548645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226355 2853 79.348 < 2e-16 ***
## Heat.Type3 -17223 6192 -2.781 0.00547 **
## Heat.Type4 -64467 6168 -10.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95510 on 1725 degrees of freedom
## Multiple R-squared: 0.05972, Adjusted R-squared: 0.05863
## F-statistic: 54.78 on 2 and 1725 DF, p-value: < 2.2e-16
From this output, I can see that the regression model formula is:
\[ \begin{align*} E[\hbox{Price}] &= \beta_0 + \beta_1\,\hbox{Heat.Type3} + \beta_2\,\hbox{Heat.Type4} \\ &= 226355 - 17223\,\hbox{Heat.Type3} - 64467\,\hbox{Heat.Type4} \end{align*} \]
- When a house is of heating type 2, what are the values of the indicator variables? Thus what is the expected (average) price for a house of heating type 2?
- Same questions for types 3 and 4
- This leads us to the interpretation of the coefficients in this model.
Exercises
We won’t be working in R today. Instead, look at the output from R code below, and answer the following questions.
Exercise 1
Let’s look at a model that describes Price
in terms of Fuel.Type
, which can be of types 2, 3, or 4.
lm(Price ~ Fuel.Type, data = homes)
mod3 <-summary(mod3)
##
## Call:
## lm(formula = Price ~ Fuel.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223535 -60535 -19652 42811 546465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228535 2748 83.160 < 2e-16 ***
## Fuel.Type3 -63598 6021 -10.563 < 2e-16 ***
## Fuel.Type4 -39801 7029 -5.663 1.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95080 on 1725 degrees of freedom
## Multiple R-squared: 0.06823, Adjusted R-squared: 0.06715
## F-statistic: 63.16 on 2 and 1725 DF, p-value: < 2.2e-16
Interpret all coefficients in this model.
Interpret the \(R^2\) and residual standard error to evaluate the quality of the model.
What is the residual for a $250,000 house that is of fuel type 2? What about a $250,000 house that is of fuel type 3?
Exercise 2
Let’s look at a model that describes Price
in terms of Sewer.Type
, which can be of types 1, 2, or 3.
lm(Price ~ Sewer.Type, data = homes)
mod4 <-summary(mod4)
##
## Call:
## lm(formula = Price ~ Sewer.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211426 -66426 -21426 45574 574716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250952 28340 8.855 <2e-16 ***
## Sewer.Type2 -50668 28676 -1.767 0.0774 .
## Sewer.Type3 -34527 28479 -1.212 0.2255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98170 on 1725 degrees of freedom
## Multiple R-squared: 0.006633, Adjusted R-squared: 0.005481
## F-statistic: 5.759 on 2 and 1725 DF, p-value: 0.003215
Interpret all coefficients in this model.
Interpret the \(R^2\) and residual standard error to evaluate the quality of the model.
What is the residual for a $200,000 house that is of sewer type 3? What about a $200,000 house that is of sewer type 1?