Topic 6 Modeling Concepts (Part 2)
Learning Goals
- Practice simple linear regression modeling concepts: model formula, coefficient interpretations, predicted values, residuals
- Develop two ideas of model quality: R2 and residual standard error
- Understand how categorical predictors are incorporated in linear regression models
Warm-up
lm(Price ~ Age, data = homes)
mod1 <-summary(mod1)
##
## Call:
## lm(formula = Price ~ Age, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -222183 -66299 -22232 43147 564995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 229728.46 3218.18 71.385 < 2e-16 ***
## Age -636.26 79.66 -7.987 2.5e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96700 on 1726 degrees of freedom
## Multiple R-squared: 0.03564, Adjusted R-squared: 0.03508
## F-statistic: 63.79 on 1 and 1726 DF, p-value: 2.502e-15
- Write the regression model formula using numbers from this output.
- Interpret all coefficients in this model.
- For a 50 year old house whose price is $100,000, what is the residual?
Discussion
What can we quantify about residuals to measure model quality?
- Not the sum or the mean of residuals (will always be zero)
- Residual standard error: essentially equal to the standard deviation of the residuals
- Scale of residual standard error changes with the scale of the data (e.g., house prices versus strength-to-weight ratio)
- Can we put the variance of the residuals on a nicer scale? Say from 0 to 1? Yes, we can.
- Some facts:
Var(response)=Var(residuals)+Var(predicted values) Total variation=Unexplained variation+Explained variation
- R2: What fraction of total variation in the response is explained by the model?
- Hopefully a lot. Which would mean that there is relatively little unexplained variation.
- Ranges from 0 to 1
R2=Var(predicted values)Var(response)=1−Var(residuals)Var(response)
lm(Price ~ Living.Area, data = homes)
mod1 <-summary(mod1)
##
## Call:
## lm(formula = Price ~ Living.Area, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -277022 -39371 -7726 28350 553325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13439.394 4992.353 2.692 0.00717 **
## Living.Area 113.123 2.682 42.173 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69100 on 1726 degrees of freedom
## Multiple R-squared: 0.5075, Adjusted R-squared: 0.5072
## F-statistic: 1779 on 1 and 1726 DF, p-value: < 2.2e-16
- Residual standard error: $69100
- This describes the amount of spread in the residuals.
- What qualifies as “high”? Imagine that your residual changed by that much. Is that a lot?
- R2 (
Multiple R-squared
): 0.5075- 50.75% of the variation in house prices is explained by a simple linear regression model with square footage as a predictor
- What qualifies as “high”? Context helps determine if the response variable simply varies a lot. (e.g., stocks)
How do we incorporate categorical predictors?
In our housing dataset, there is a Heat.Type
that indicates whether the heating type of the house is of type 2, 3, or 4.
Including a categorical predictor variable creates L−1 indicator variables where L is the number of levels of the categorical variable.
- Type 2 is chosen as the reference category by default in R because it is first in alphanumeric order.
Heat.Type3
andHeat.Type4
get created as indicator variables by taking the original variable name (Heat.Type
) and pasting the name of the category (3
or4
afterward)Heat.Type3
equals 1 is this case is of heating type 3. Equals 0 otherwise.Heat.Type4
equals 1 is this case is of heating type 4. Equals 0 otherwise.
Case Heat.Type Heat.Type3 Heat.Type4
---- --------- ---------- ----------
1 3 1 0
2 4 0 1
3 4 0 1
4 2 0 0
lm(Price ~ Heat.Type, data = homes)
mod2 <-summary(mod2)
##
## Call:
## lm(formula = Price ~ Heat.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -221355 -63355 -17644 43895 548645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 226355 2853 79.348 < 2e-16 ***
## Heat.Type3 -17223 6192 -2.781 0.00547 **
## Heat.Type4 -64467 6168 -10.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95510 on 1725 degrees of freedom
## Multiple R-squared: 0.05972, Adjusted R-squared: 0.05863
## F-statistic: 54.78 on 2 and 1725 DF, p-value: < 2.2e-16
From this output, I can see that the regression model formula is:
E[Price]=β0+β1Heat.Type3+β2Heat.Type4=226355−17223Heat.Type3−64467Heat.Type4
- When a house is of heating type 2, what are the values of the indicator variables? Thus what is the expected (average) price for a house of heating type 2?
- Same questions for types 3 and 4
- This leads us to the interpretation of the coefficients in this model.
Exercises
We won’t be working in R today. Instead, look at the output from R code below, and answer the following questions.
Exercise 1
Let’s look at a model that describes Price
in terms of Fuel.Type
, which can be of types 2, 3, or 4.
lm(Price ~ Fuel.Type, data = homes)
mod3 <-summary(mod3)
##
## Call:
## lm(formula = Price ~ Fuel.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223535 -60535 -19652 42811 546465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228535 2748 83.160 < 2e-16 ***
## Fuel.Type3 -63598 6021 -10.563 < 2e-16 ***
## Fuel.Type4 -39801 7029 -5.663 1.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 95080 on 1725 degrees of freedom
## Multiple R-squared: 0.06823, Adjusted R-squared: 0.06715
## F-statistic: 63.16 on 2 and 1725 DF, p-value: < 2.2e-16
Interpret all coefficients in this model.
Interpret the R2 and residual standard error to evaluate the quality of the model.
What is the residual for a $250,000 house that is of fuel type 2? What about a $250,000 house that is of fuel type 3?
Exercise 2
Let’s look at a model that describes Price
in terms of Sewer.Type
, which can be of types 1, 2, or 3.
lm(Price ~ Sewer.Type, data = homes)
mod4 <-summary(mod4)
##
## Call:
## lm(formula = Price ~ Sewer.Type, data = homes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211426 -66426 -21426 45574 574716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 250952 28340 8.855 <2e-16 ***
## Sewer.Type2 -50668 28676 -1.767 0.0774 .
## Sewer.Type3 -34527 28479 -1.212 0.2255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 98170 on 1725 degrees of freedom
## Multiple R-squared: 0.006633, Adjusted R-squared: 0.005481
## F-statistic: 5.759 on 2 and 1725 DF, p-value: 0.003215
Interpret all coefficients in this model.
Interpret the R2 and residual standard error to evaluate the quality of the model.
What is the residual for a $200,000 house that is of sewer type 3? What about a $200,000 house that is of sewer type 1?