Topic 5 LASSO: Shrinkage/Regularization
Learning Goals
- Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function and (2) the goal of variable selection
- Explain why variable scaling is important for the performance of shrinkage methods
- Explain how the lambda tuning parameter affects model performance and how this is related to overfitting
- Describe how output from LASSO models can give a measure of variable importance
Slides from today are available here.
LASSO models in caret
To build LASSO models in caret
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(caret)
set.seed(___) # Pick your favorite number to fill in the parentheses
Then adapt the following code:
# Perform LASSO
train(
lasso_mod <-~ .,
y data = ___,
method = "glmnet",
tuneGrid = data.frame(alpha = 1, lambda = seq(___, ___, length.out = ___)),
trControl = trainControl(method = "cv", number = ___, selectionFunction = ___),
metric = "MAE",
na.action = na.omit
)
Argument | Meaning |
---|---|
y ~ . |
Shorthand for the model of y by all possible predictors x in the data (every other variable in the dataset except y ) |
data |
Sample data which only includes y and potential predictors x . (Make sure to remove unneeded variables with select(-variable) first.) |
method |
"glmnet" implements the LASSO and other regularization methods |
tuneGrid |
A mini-dataset (data.frame ) of tuning parameters. alpha = 1 specifies that we’re interested in the LASSO (not a different regularization method, like ridge regression). lambda = seq(begin, end, length.out = # values in sequence) specifies that LASSO should be run for each tuning parameter value in the specified sequence. |
trControl |
Use cross-validation to estimate test performance for each LASSO model fit (corresponding to each tuning parameter in tuneGrid ). The process used to pick a final model from among these is indicated by the selectionFunction argument, options including "best" and "oneSE" . The best option will pick the model with the best metric. The oneSE option will pick the simplest model for which the metric is within one standard error of the best metric. |
metric |
Evaluate and compare competing LASSO models with respect to their CV-MAE. |
na.action |
Set na.action = na.omit to prevent errors if the data has missing values. |
Note: caret
automatically implements variable standardization (scaling variables to have mean 0 and standard deviation 1) when using the "glmnet"
method.
Examining the LASSO model for each \(\lambda\)
# Plot coefficient paths as a function of lambda
plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))
# Codebook for which variables the numbers correspond to
rownames(lasso_mod$finalModel$beta)
Identifying the “best” LASSO model
The “best” model in the sequence of models fit is defined relative to the chosen selectionFunction
and metric
.
# Plot CV-estimated test performance versus the tuning parameter (lambda)
plot(lasso_mod)
# CV metrics for each LASSO model
$results
lasso_mod
# Identify which tuning parameter (lambda) is "best"
$bestTune
lasso_mod
# Obtain the predictors and coefficients of the "best" model
# The .'s indicate that the coefficient is 0
coef(lasso_mod$finalModel, lasso_mod$bestTune$lambda)
# Obtain the predictors and coefficients of LASSO model w/ different lambda
# e.g., lambda = 1 (roughly)
coef(lasso_mod$finalModel, 1)
# Get information from all CV iterations for the "best" model
$resample lasso_mod
Exercises
You can download a template RMarkdown file to start from here.
We’ll continue using the body fat dataset to explore LASSO modeling.
library(caret)
library(ggplot2)
library(dplyr)
library(readr)
read_csv("http://www.macalester.edu/~ajohns24/data/bodyfatsub.csv")
bodyfat <-
# Take out the redundant Density and HeightFt variables
bodyfat %>%
bodyfat <- select(-Density, -HeightFt)
Exercise 1: A least squares model
Let’s start by building an ordinary (not penalized) least squares model to review important concepts. We’ll fit a model with all possible predictors.
lm(BodyFat ~ ., data = bodyfat) ls_mod <-
Use
caret
to perform 10-fold cross-validation to estimate test MAE for this model.How do you think the estimated test error would change with fewer predictors?
This model fit with ordinary least squares corresponds to a special case of penalized least squares. What is the value of \(\lambda\) in this special case?
As \(\lambda\) increases, what would you expect to happen to the number of predictors that remain in the model?
Exercise 2: Fitting a LASSO model in caret
Adapt our general LASSO code to fit a set of LASSO models with the following parameters:
- Use 10-fold CV.
- Use mean absolute error (MAE) to select a final model.
- Select the simplest model for which the metric is within one standard error of the best metric.
- Use a sequence of 100 \(\lambda\) values from 0 to 10.
Before running the code, enter install.packages("glmnet")
in the Console.
We’ll explore output from lasso_mod
in the next exercises.
# Fit LASSO models for a grid of lambda values
set.seed(74)
train(
lasso_mod <-
)
Exercise 3: Examining output: plot of coefficient paths
A useful first plot allows us to examine coefficient paths resulting from the fitted LASSO models: coefficient estimates as a function of \(\lambda\).
# Plot coefficient paths as a function of lambda
plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))
# Codebook for which variables the numbers correspond to
rownames(lasso_mod$finalModel$beta)
# e.g., What are variables 2 and 4?
rownames(lasso_mod$finalModel$beta)[c(2,4)]
There’s a lot of information in this plot!
- Each colored line corresponds to a different predictor. The small number to the left of each line indicates a predictor by its position in
rownames(lasso_mod$finalModel$beta)
. - The x-axis reflects the range of different \(\lambda\) values (on the log-scale) considered in
lasso_mod
(intuneGrid
). - At each \(\lambda\), the y-axis reflects the coefficient estimates for the predictors in the corresponding LASSO model.
- At each \(\lambda\), the numbers at the top of the plot indicate how many predictors remain in the corresponding model.
Very roughly eyeball the coefficient estimates at the smallest value of \(\lambda\). Do they look like they correspond to the coefficient estimates from ordinary least squares in exercise 2?
Why do all of the lines head toward y = 0 on the far right of the plot?
We can zoom in on the plot by setting the y-axis limits to go from -0.5 to 1 with
ylim
as below. Compare the lines for variables 6 and 12. What are variables 6 and 12? Which seems to be a more “important” or “persistent” (persistently present in the model) variable? Does this make sense in context?# Zoom in plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20), ylim = c(-0.5,1))
Which predictor seems least “persistent”? In general, how might we use these coefficient paths to measure the predictive importance of our predictors?
Note: If you’re curious about code to automate this visual inspection of variable importance, look at Digging Deeper Exercise 2.
Exercise 4: Tuning \(\lambda\)
In order to pick which \(\lambda\) (hence LASSO model) is “best”, we can plot CV error estimates for the different models:
# Plot a summary of the performance of the different models
plot(lasso_mod)
Inspect the shape of the plot. The errors go down at the very beginning then start going back up. Based on this, what are the consequences of picking a \(\lambda\) that is too small or too large? (This is an example of a very important idea that we’ll see shortly: the bias-variance tradeoff.)
Based on visual inspection, roughly what value of \(\lambda\) results in the best model? (Remind yourself of the interpretation of “best” as defined by our
selectionFunction
of"oneSE"
.)Identify the “best” \(\lambda\).
# Identify which tuning parameter (lambda) is "best" $bestTune lasso_mod
Note: If you’re curious about making plots that show both test error estimates and their uncertainty, look at Digging Deeper Exercise 1.
Exercise 5: Examining and evaluating the best LASSO model
Take a look at the predictors and coefficients for the “best” LASSO model. Are the predictors that remain in the model sensible? Do the coefficient signs make sense?
# Obtain the predictors and coefficients of the "best" model # The .'s indicate that the coefficient is 0 coef(lasso_mod$finalModel, lasso_mod$bestTune$lambda) # Obtain the predictors and coefficients of LASSO model w/ different lambda # In case it's of interest to look at a slightly different model # e.g., lambda = 1 (roughly) coef(lasso_mod$finalModel, 1)
Evaluate the best LASSO model:
- Contextually interpret (with units) the CV error for the model by inspecting
lasso_mod$resample
orlasso_mod$results
. - Make residual plots for the model by creating a dataset called
lasso_mod_out
which contains the original data as well as predicted values and residuals (fitted
andresid
).
bodyfat %>% lasso_mod_out <- mutate( fitted = predict(lasso_mod, newdata = bodyfat), resid = BodyFat - fitted )
- Contextually interpret (with units) the CV error for the model by inspecting
Digging deeper
These exercises are recommended for further exploring code useful in an applied analysis.
plot(lasso_mod)
only shows estimates of test error but doesn’t show the uncertainty in those estimates. Uselasso_mod$results
to plot bothMAE
and its standard deviation (MAESD
). Using this ggplot2 cheat sheet might help.In Exercise 3, we used the plot of coefficient paths to evaluate the variable importance of our predictors. The code below does this systematically for each predictor so that we don’t have to eyeball. Step through and work out what each part is doing. It may help to look up function documentation with
?function_name
in the Console.# Create a boolean matrix (predictors x lambdas) of variable exclusion lasso_mod$finalModel$beta==0 bool_predictor_exclude <- # Loop over each variable sapply(seq_len(nrow(bool_predictor_exclude)), function(row) { var_imp <-# Extract coefficient path (sorted from highest to lowest lambda) bool_predictor_exclude[row,] this_coeff_path <-# Compute and return the # of lambdas until this variable is out forever ncol(bool_predictor_exclude)-which.min(this_coeff_path)+1 }) # Create a dataset of this information and sort tibble( var_imp_data <-var_name = rownames(bool_predictor_exclude), var_imp = var_imp )%>% arrange(desc(var_imp)) var_imp_data