Topic 5 LASSO: Shrinkage/Regularization

Learning Goals

Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function and (2) the goal of variable selection
Explain why variable scaling is important for the performance of shrinkage methods
Explain how the lambda tuning parameter affects model performance and how this is related to overfitting
Describe how output from LASSO models can give a measure of variable importance

Slides from today are available here.

LASSO models in `caret`

To build LASSO models in caret, first load the package and set the seed for the random number generator to ensure reproducible results:

library(caret)
set.seed(___) # Pick your favorite number to fill in the parentheses

Then adapt the following code:

# Perform LASSO
lasso_mod <- train(
    y ~ .,
    data = ___,
    method = "glmnet",
    tuneGrid = data.frame(alpha = 1, lambda = seq(___, ___, length.out = ___)),
    trControl = trainControl(method = "cv", number = ___, selectionFunction = ___),
    metric = "MAE",
    na.action = na.omit
)

Argument	Meaning
`y ~ .`	Shorthand for the model of `y` by all possible predictors `x` in the data (every other variable in the dataset except `y`)
`data`	Sample data which only includes `y` and potential predictors `x`. (Make sure to remove unneeded variables with `select(-variable)` first.)
`method`	`"glmnet"` implements the LASSO and other regularization methods
`tuneGrid`	A mini-dataset (`data.frame`) of tuning parameters. `alpha = 1` specifies that we’re interested in the LASSO (not a different regularization method, like ridge regression). `lambda = seq(begin, end, length.out = # values in sequence)` specifies that LASSO should be run for each tuning parameter value in the specified sequence.
`trControl`	Use cross-validation to estimate test performance for each LASSO model fit (corresponding to each tuning parameter in `tuneGrid`). The process used to pick a final model from among these is indicated by the `selectionFunction` argument, options including `"best"` and `"oneSE"`. The `best` option will pick the model with the best metric. The `oneSE` option will pick the simplest model for which the metric is within one standard error of the best metric.
`metric`	Evaluate and compare competing LASSO models with respect to their CV-MAE.
`na.action`	Set `na.action = na.omit` to prevent errors if the data has missing values.

Note: caret automatically implements variable standardization (scaling variables to have mean 0 and standard deviation 1) when using the "glmnet" method.

Examining the LASSO model for each $\lambda$

# Plot coefficient paths as a function of lambda
plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))

# Codebook for which variables the numbers correspond to
rownames(lasso_mod$finalModel$beta)

Identifying the “best” LASSO model

The “best” model in the sequence of models fit is defined relative to the chosen selectionFunction and metric.

# Plot CV-estimated test performance versus the tuning parameter (lambda)
plot(lasso_mod)

# CV metrics for each LASSO model
lasso_mod$results

# Identify which tuning parameter (lambda) is "best"
lasso_mod$bestTune

# Obtain the predictors and coefficients of the "best" model
# The .'s indicate that the coefficient is 0
coef(lasso_mod$finalModel, lasso_mod$bestTune$lambda)

# Obtain the predictors and coefficients of LASSO model w/ different lambda
# e.g., lambda = 1 (roughly)
coef(lasso_mod$finalModel, 1)

# Get information from all CV iterations for the "best" model
lasso_mod$resample

Exercises

You can download a template RMarkdown file to start from here.

We’ll continue using the body fat dataset to explore LASSO modeling.

library(caret)
library(ggplot2)
library(dplyr)
library(readr)
bodyfat <- read_csv("http://www.macalester.edu/~ajohns24/data/bodyfatsub.csv")

# Take out the redundant Density and HeightFt variables
bodyfat <- bodyfat %>% 
    select(-Density, -HeightFt)

Exercise 1: A least squares model

Let’s start by building an ordinary (not penalized) least squares model to review important concepts. We’ll fit a model with all possible predictors.

ls_mod <- lm(BodyFat ~ ., data = bodyfat)

Use caret to perform 10-fold cross-validation to estimate test MAE for this model.
How do you think the estimated test error would change with fewer predictors?
This model fit with ordinary least squares corresponds to a special case of penalized least squares. What is the value of $\lambda$ in this special case?
As $\lambda$ increases, what would you expect to happen to the number of predictors that remain in the model?

Exercise 2: Fitting a LASSO model in `caret`

Adapt our general LASSO code to fit a set of LASSO models with the following parameters:

Use 10-fold CV.
Use mean absolute error (MAE) to select a final model.
Select the simplest model for which the metric is within one standard error of the best metric.
Use a sequence of 100 $\lambda$ values from 0 to 10.

Before running the code, enter install.packages("glmnet") in the Console.

We’ll explore output from lasso_mod in the next exercises.

# Fit LASSO models for a grid of lambda values
set.seed(74)
lasso_mod <- train(
    
)

Exercise 3: Examining output: plot of coefficient paths

A useful first plot allows us to examine coefficient paths resulting from the fitted LASSO models: coefficient estimates as a function of $\lambda$ .

# Plot coefficient paths as a function of lambda
plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))

# Codebook for which variables the numbers correspond to
rownames(lasso_mod$finalModel$beta)

# e.g., What are variables 2 and 4?
rownames(lasso_mod$finalModel$beta)[c(2,4)]

There’s a lot of information in this plot!

Each colored line corresponds to a different predictor. The small number to the left of each line indicates a predictor by its position in rownames(lasso_mod$finalModel$beta).
The x-axis reflects the range of different $\lambda$ values (on the log-scale) considered in lasso_mod (in tuneGrid).
At each $\lambda$ , the y-axis reflects the coefficient estimates for the predictors in the corresponding LASSO model.
At each $\lambda$ , the numbers at the top of the plot indicate how many predictors remain in the corresponding model.

Very roughly eyeball the coefficient estimates at the smallest value of $\lambda$ . Do they look like they correspond to the coefficient estimates from ordinary least squares in exercise 2?
Why do all of the lines head toward y = 0 on the far right of the plot?
We can zoom in on the plot by setting the y-axis limits to go from -0.5 to 1 with ylim as below. Compare the lines for variables 6 and 12. What are variables 6 and 12? Which seems to be a more “important” or “persistent” (persistently present in the model) variable? Does this make sense in context?
```
# Zoom in
plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20), ylim = c(-0.5,1))
```
Which predictor seems least “persistent”? In general, how might we use these coefficient paths to measure the predictive importance of our predictors?

Note: If you’re curious about code to automate this visual inspection of variable importance, look at Digging Deeper Exercise 2.

Exercise 4: Tuning $\lambda$

In order to pick which $\lambda$ (hence LASSO model) is “best”, we can plot CV error estimates for the different models:

# Plot a summary of the performance of the different models
plot(lasso_mod)

Inspect the shape of the plot. The errors go down at the very beginning then start going back up. Based on this, what are the consequences of picking a $\lambda$ that is too small or too large? (This is an example of a very important idea that we’ll see shortly: the bias-variance tradeoff.)
Based on visual inspection, roughly what value of $\lambda$ results in the best model? (Remind yourself of the interpretation of “best” as defined by our selectionFunction of "oneSE".)

Identify the “best” $\lambda$ .

# Identify which tuning parameter (lambda) is "best"
lasso_mod$bestTune

Note: If you’re curious about making plots that show both test error estimates and their uncertainty, look at Digging Deeper Exercise 1.

Exercise 5: Examining and evaluating the best LASSO model

Take a look at the predictors and coefficients for the “best” LASSO model. Are the predictors that remain in the model sensible? Do the coefficient signs make sense?

# Obtain the predictors and coefficients of the "best" model
# The .'s indicate that the coefficient is 0
coef(lasso_mod$finalModel, lasso_mod$bestTune$lambda)

# Obtain the predictors and coefficients of LASSO model w/ different lambda
# In case it's of interest to look at a slightly different model
# e.g., lambda = 1 (roughly)
coef(lasso_mod$finalModel, 1)

Evaluate the best LASSO model:
- Contextually interpret (with units) the CV error for the model by inspecting lasso_mod$resample or lasso_mod$results.
- Make residual plots for the model by creating a dataset called lasso_mod_out which contains the original data as well as predicted values and residuals (fitted and resid).
```
lasso_mod_out <- bodyfat %>%
    mutate(
        fitted = predict(lasso_mod, newdata = bodyfat),
        resid = BodyFat - fitted
    )
```

Digging deeper

These exercises are recommended for further exploring code useful in an applied analysis.

plot(lasso_mod) only shows estimates of test error but doesn’t show the uncertainty in those estimates. Use lasso_mod$results to plot both MAE and its standard deviation (MAESD). Using this ggplot2 cheat sheet might help.

In Exercise 3, we used the plot of coefficient paths to evaluate the variable importance of our predictors. The code below does this systematically for each predictor so that we don’t have to eyeball. Step through and work out what each part is doing. It may help to look up function documentation with ?function_name in the Console.

# Create a boolean matrix (predictors x lambdas) of variable exclusion
bool_predictor_exclude <- lasso_mod$finalModel$beta==0

# Loop over each variable
var_imp <- sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
    # Extract coefficient path (sorted from highest to lowest lambda)
    this_coeff_path <- bool_predictor_exclude[row,]
    # Compute and return the # of lambdas until this variable is out forever
    ncol(bool_predictor_exclude)-which.min(this_coeff_path)+1
})

# Create a dataset of this information and sort
var_imp_data <- tibble(
    var_name = rownames(bool_predictor_exclude),
    var_imp = var_imp
)
var_imp_data %>% arrange(desc(var_imp))