Topic 8 Local Regression & GAMs
Learning Goals
- Clearly describe the local regression algorithm for making a prediction
- Explain how bandwidth (span) relate to the bias-variance tradeoff
- Describe some different formulations for a GAM (how the arbitrary functions are represented)
- Explain how to make a prediction from a GAM
- Interpret the output from a GAM
Slides from today are available here.
GAMs in caret
To build GAMs in caret
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(caret)
set.seed(___) # Pick your favorite number to fill in the parentheses
Then adapt the following code:
train(
gam_mod <-~ x,
y data = ___,
method = "gamLoess",
tuneGrid = data.frame(degree = 1, span = seq(___, ___, by = ___)),
trControl = trainControl(method = "cv", number = ___, selectionFunction = ___),
metric = "MAE",
na.action = na.omit
)
Argument | Meaning |
---|---|
y ~ x |
Model formula for specifying response and predictors |
data |
Sample data |
method |
"gamLoess" builds GAMs with LOESS components |
tuneGrid |
A mini-dataset (data.frame ) of tuning parameters. degree is the degree of the local polynomial fit (1 = linear is just fine). span is the fraction of data used in the local fit: supply a sequence as seq(begin, end, by = size of step) . |
trControl |
Use cross-validation to estimate test performance for each model fit. The process used to pick a final model from among these is indicated by selectionFunction , with options including "best" and "oneSE" . |
metric |
Evaluate and compare competing models with respect to their CV-MAE. |
na.action |
Set na.action = na.omit to prevent errors if the data has missing values. |
Identifying the “best” GAM
The “best” model in the sequence of models fit is defined relative to the chosen selectionFunction
and metric
.
# Plot CV-estimated test performance versus the tuning parameter
plot(gam_mod)
# CV metrics for each model
$results
gam_mod
# Identify which tuning parameter is "best"
$bestTune
gam_mod
# Get information from all CV iterations for the "best" model
$resample
gam_mod
# Use the best model to make predictions
# newdata should be a data.frame with required predictors
predict(gam_mod, newdata = ___)
Inspecting the “best” GAM
# Plot functions for each predictor
# Dashed lines are +/- 2 SEs
plot(gam_mod$finalModel, se = TRUE)
# Plot functions for each predictor in case the functions are splines with plot.Gam
library(splines)
lm(
gam_mod_spline <-~ ns(quant_x1,df) + ns(quant_x2,df) + ...,
Grad.Rate data = ___
)plot.Gam(gam_mod_spline, se = TRUE)
Exercises
You can download a template RMarkdown file to start from here.
Before proceeding, install the gam
package by entering install.packages("gam")
in the Console.
We’ll continue using the College
dataset in the ISLR
package to explore splines. You can use ?College
in the Console to look at the data codebook.
library(caret)
library(ggplot2)
library(dplyr)
library(ISLR)
library(splines)
library(gam)
data(College)
# A little data cleaning
College %>%
college_clean <- mutate(school = rownames(College)) %>%
filter(Grad.Rate <= 100) # Remove one school with grad rate of 118%
rownames(college_clean) <- NULL # Remove school names as row names
Exercise 1: Conceptual warmup
Do you think that at GAM with all possible predictors will have better or worse performance than an ordinary (fully linear) least squares model with all possible predictors? Explain your thoughts.
How does high/low span relate to bias and variance of a LOESS model?
How should we choose predictors to be in a GAM? How could forward and backward stepwise selection and LASSO help with variable selection before a GAM?
Exercise 2: Building a GAM in caret
Suppose that our initial variable selection investigations lead us to using the predictors indicated below in our GAM. Fit a GAM with the following specifications:
- Use 8-fold CV.
- Select the model which has the lowest MAE. (Hint: options are “oneSE” or “best”).
- Use the sequence of
span
values: 0.1, 0.2, …, 0.9.
What do you expect that the plot of test MAE versus span will look like, and why?
set.seed(___)
train(
gam_mod <-~ Private + Apps + Top10perc + Top25perc + P.Undergrad + Outstate + Room.Board + Books + Personal + PhD + perc.alumni,
Grad.Rate
___ )
Exercise 3: Identifying the “best” GAM
The code below has been common to all of our methods below, so it is provided for convenience.
- Inspect the output to identify the “best”
span
for our GAM. (Was your prediction from Exercise 2 about the plot correct?) - Contextually interpret the CV MAE with units.
# Plot CV-estimated test performance versus the tuning parameter
plot(gam_mod)
# Identify which tuning parameter is "best"
$bestTune
gam_mod
# CV metrics for each model
$results
gam_mod
# CV metrics for just the "best" model
$results %>%
gam_mod filter(span==gam_mod$bestTune$span)
Exercise 4: Interpreting the GAM
We can plot the function for each predictor as below.
par(mfrow = c(3,4)) # Sets up a grid of plots
plot(gam_mod$finalModel, se = TRUE) # Dashed lines are +/- 2 SEs
- What about these plots indicates that using GAM instead of ordinary linear regression was probably a good choice?
- Pick 1 or 2 of these plots, and interpret your findings. Anything surprising or interesting?
- The
PrivateYes
plot might look odd. Not to worry - the GAM is treating this as a categorical (indicator) variable. What do you learn from this plot?
In case you find it useful, you can also build a GAM using spline components with lm()
and plot the nonlinear functions for each predictor with plot.Gam()
from the gam
package.
library(splines)
lm(
gam_mod_spline <-~ Private + ns(Apps,3) + ns(Top10perc,3) + ns(Top25perc,3) + ns(P.Undergrad,3) + ns(Outstate,3) + ns(Room.Board,3) + ns(Books,3) + ns(Personal,3) + ns(PhD,3) + ns(perc.alumni,3),
Grad.Rate data = college_clean
)
par(mfrow = c(3,4))
plot.Gam(gam_mod_spline, se = TRUE)
Exercise 5: Comparison of methods
Brainstorm the pros/cons of the different methods that we’ve explored. You may find it helpful to refer to the portfolio themes for each method.
(Soon, as part of the Portfolio, you’ll be doing a similar synthesis of our regression unit, so this brainstorming session might help!)
Just for fun!
In case you want a (silly!) take on the curse of dimensionality, check out this video. (“Relevant” parts are from 0:28 - 4:16.)