Topic 10 Local Regression & GAMs
Learning Goals
- Clearly describe the local regression algorithm for making a prediction
- Explain how bandwidth (span) relate to the bias-variance tradeoff
- Describe some different formulations for a GAM (how the arbitrary functions are represented)
- Explain how to make a prediction from a GAM
- Interpret the output from a GAM
Slides from today are available here.
GAMs - Options for Fitting
GAMs (splines + OLS)
We’ve already talked about how to fit GAM models with splines (step_ns()
) using the lm
engine (ordinary least squares).
GAMs (LOESS)
The gam
package provides tools for building GAMs with local regression (LOESS). We won’t explore this option further (via code) in class because there isn’t a tidymodels
interface.
GAMs (smoothing splines) in tidymodels
Today, we’ll try fitting GAM models with smoothing splines in tidymodels
.
To build GAMs (using smoothing splines) in tidymodels
, first load the package:
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
Then adapt the following code:
gam_spec <- gen_additive_mod() %>%
set_args(select_features = TRUE, adjust_deg_free = tune()) %>%
set_engine(engine = "mgcv") %>%
set_mode("regression")
# The implementation of GAMs in tidymodels is different than other methods...
# ...we don't specify a recipe
workflow() %>%
gam_wf <- add_variables(outcomes = YOUR_OUTCOME, predictors = c(PREDICTOR1, PREDICTOR2)) %>%
add_model(gam_spec, formula = YOUR_OUTCOME ~ CATEGORICAL_PREDICTORS + s(QUANTITATIVE_PREDICTOR, k=10))
# s(x1, k = 10): This makes the smoothing spline have 10 knots
grid_regular(
tuning_param_grid <-adjust_deg_free(range = c(0.25, 4)),
levels = 10
)
vfold_cv(___, v = 10)
data_cv <-
tune_grid(
tune_output <-
gam_wf,resamples = data_cv,
metrics = metric_set(mae),
grid = tuning_param_grid
)
Picking the best tuning parameter and visualizing the GAM
# Select best model & fit to full training data
tune_output %>% select_best()
best_param <- tune_output %>%
best_by_1se_param <- select_by_one_std_err(metric = "mae", desc(adjust_deg_free))
finalize_workflow(gam_wf, best_param) %>%
gam_mod_best <- fit(data = ___)
finalize_workflow(gam_wf, best_by_1se_param) %>%
gam_mod_best1se <- fit(data = ___)
# Plot functions for each predictor
# Dashed lines are +/- 2 SEs
%>% pluck("fit") %>% plot() fit_gam_model
Exercises
You can download a template RMarkdown file to start from here.
Before proceeding, install the mgcv
package by entering install.packages("mgcv")
in the Console.
We’ll continue using the College
dataset in the ISLR2
package to explore models for graduation rate. You can use ?College
in the Console to look at the data codebook.
library(ISLR2)
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
data(College)
# A little data cleaning
College %>%
college_clean <- mutate(school = rownames(College)) %>%
filter(Grad.Rate <= 100) # Remove one school with grad rate of 118%
rownames(college_clean) <- NULL # Remove school names as row names
Exercise 1: Conceptual warmup
How does high/low span relate to bias and variance of a local regression (LOESS) model?
How does high/low lambda relate to bias and variance of a smoothing spline in a GAM model?
Do you think that a GAM with all possible predictors will have better or worse performance than an ordinary (fully linear) least squares model with all possible predictors? Explain your thoughts.
Why might we want to perform variable selection before fitting a GAM? How could stepwise selection or LASSO help with this?
Exercise 2: Local regression (LOESS)
Use LOESS (geom_smooth()
) to explore the relationship between Apps
and Grad.Rate
for different values of span
(between 0 and 1).
- How is varying span like varying number of neighbors in KNN?
- How would you describe the relationship between number of applications and graduation rate?
%>%
college_clean ggplot(aes(x = Apps, y = Grad.Rate)) +
geom_point(alpha = 0.2) + # Adjust point transparency with alpha
geom_smooth(span = 0.4, method = "loess", se = FALSE) + # vary span
xlim(c(0,20000)) +
theme_classic()
Exercise 3: Building a GAM in tidymodels
Suppose that initial variable selection investigations have given us a set of predictors to include in our GAM. The code below for building a GAM is complete (nothing to fill in), but before looking at output, we are going to closely examine this code in comparison to the code for our previous methods: LASSO and KNN.
- The
adjust_deg_free
argument is like lambda in LASSO. Higher values of this tuning parameter mean that “wiggliness” is penalized more. - The
select_features = TRUE
allows for the ability to eliminate a predictor via penalization (more likely with higheradjust_deg_free
).
set.seed(123)
gam_spec <- gen_additive_mod() %>%
set_args(select_features = TRUE, adjust_deg_free = tune()) %>%
set_engine(engine = "mgcv") %>%
set_mode("regression")
# The implementation of GAMs in tidymodels is different than other methods...
# ...we don't specify a recipe
workflow() %>%
gam_wf <- add_variables(outcomes = Grad.Rate, predictors = c(Private, Apps, Top10perc, P.Undergrad, Outstate, perc.alumni)) %>%
add_model(gam_spec, formula = Grad.Rate ~ Private + s(Apps, k=10) + s(Top10perc, k=10) + s(P.Undergrad, k=10) + s(Outstate, k=10) + s(perc.alumni, k=10))
grid_regular(
tuning_param_grid <-adjust_deg_free(range = c(0.25, 4)),
levels = 10
)
vfold_cv(college_clean, v = 8)
data_cv8 <-
# This takes a few seconds
tune_grid(
tune_output <-
gam_wf,resamples = data_cv8,
metrics = metric_set(mae),
grid = tuning_param_grid
)
We can take a look at the test error metrics from CV and choose an optimal tuning parameter:
%>% collect_metrics()
tune_output %>% show_best()
tune_output
autoplot(tune_output) + theme_classic()
# Select best model & fit to full training data
tune_output %>% select_best()
best_param <- tune_output %>%
best_by_1se_param <- select_by_one_std_err(metric = "mae", desc(adjust_deg_free))
finalize_workflow(gam_wf, best_param) %>%
gam_mod_best <- fit(data = college_clean)
finalize_workflow(gam_wf, best_by_1se_param) %>%
gam_mod_best1se <- fit(data = college_clean)
Let’s visualize the estimated nonlinear functions.
- What about these plots indicates that using a GAM instead of ordinary linear regression was probably a good choice?
- Compare the plots resulting from the two different choices for “best” tuning parameter. Which choice would you prefer and why?
- For each of the plots, write a sentence describing what you learn about the relationship between graduation rate and that predictor.
# Plot the estimated nonlinear functions for all predictors...
# ...for the GAM resulting from the adjust_deg_free parameter value that gave the lowest error
$fit$fit$fit %>% plot(all.terms = TRUE, pages = 1)
gam_mod_best
# ...for the GAM resulting from the adjust_deg_free parameter value that gave the lowest error
$fit$fit$fit %>% plot(all.terms = TRUE, pages = 1) gam_mod_best1se