Topic 6 LASSO: Shrinkage/Regularization

Learning Goals

Explain how ordinary and penalized least squares are similar and different with regard to (1) the form of the objective function and (2) the goal of variable selection
Explain why variable scaling is important for the performance of shrinkage methods
Explain how the lambda tuning parameter affects model performance and how this is related to overfitting
Describe how output from LASSO models can give a measure of variable importance

Slides from today are available here.

LASSO models in `tidymodels`

To build LASSO models in tidymodels, first load the package and set the seed for the random number generator to ensure reproducible results:

library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels) 
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
set.seed(___) # Pick your favorite number to fill in the parentheses

Then adapt the following code to fit a LASSO model to a data set using several lambda values:

# Lasso Model Spec
lm_lasso_spec_tune <- 
    linear_reg() %>%
    set_args(mixture = 1, penalty = tune()) %>% # mixture = 1 indicates LASSO
    set_engine(engine = "glmnet") %>% # note different engine than "lm"
    set_mode("regression") 

# Recipe with standardization (!)
data_rec <- recipe(___ ~ ___ , data = ___) %>% # formula: outcome ~ predictors (generally predictors is . for all predictors)
    step_nzv(all_predictors()) %>% # removes variables with the same value
    step_novel(all_nominal_predictors()) %>% # important if you have rare categorical variables 
    step_normalize(all_numeric_predictors()) %>%  # important standardization step for LASSO
    step_dummy(all_nominal_predictors()) # creates indicator variables for categorical variables

# Workflow (Recipe + Model)
lasso_wf_tune <- workflow() %>% 
    add_recipe(data_rec) %>%
    add_model(lm_lasso_spec_tune)

# Create CV folds
data_cv10 <- vfold_cv(___, v = 10)

# Tune Model (trying a variety of values of lambda penalty)
penalty_grid <- grid_regular(
    penalty(range = c(-5, 3)), # log10 transformed 10^-5 to 10^3
    levels = 30
)

lasso_tune_out <- tune_grid( # new function for tuning parameters
    lasso_wf_tune, # workflow
    resamples = data_cv10, # cv folds
    metrics = metric_set(rmse, mae),
    grid = penalty_grid # penalty grid defined above
)

Examining the LASSO model for each $\lambda$

The glmnet engine fits models for each $\lambda$ automatically, so we can visualize the estimates for each penalty value.

plot(lasso_fit %>% extract_fit_parsnip() %>% pluck('fit'), # way to get the original glmnet output
     xvar = "lambda") # glmnet fits the model with a variety of lambda penalty values

Identifying the “best” LASSO model

To identify the best model, we need to tune the model using cross validation. Adapt the following code to tune a Lasso Model to choose lambda:

# Visualize model evaluation metrics from tuning
autoplot(lasso_tune_out) + theme_classic()

# Summarize model evaluation metrics
collect_metrics(lasso_tune_out) %>%
    filter(.metric == "mae") %>% # or choose rmse
    select(penalty, rmse = mean) 

# Choose penalty value based on lowest mae (or choose rmse)
best_penalty <- select_best(lasso_tune_out, metric = "mae")

# Choose largest penalty value within 1 SE of the lowest CV MAE
best_se_penalty <- select_by_one_std_err(lasso_tune_out, metric = "mae", desc(penalty))

# Fit final model
final_wf <- finalize_workflow(lasso_wf_tune, best_penalty) # incorporates penalty value to workflow (or replace best_penalty with best_se_penalty)

final_fit <- fit(final_wf, data = ___)

tidy(final_fit)

Exercises

You can download a template RMarkdown file to start from here.

We’ll use a new data set to explore LASSO modeling. This data comes from the US Department of Energy. You will predict the fuel efficiency of modern cars from characteristics of these cars, like transmission and engine displacement. Fuel efficiency is a numeric value that ranges smoothly from about 15 to 40 miles per gallon.

library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels) 
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions

set.seed(123)

cars2018 <- read_csv("https://raw.githubusercontent.com/juliasilge/supervised-ML-case-studies-course/master/data/cars2018.csv")

head(cars2018)

# Cleaning
cars2018 <- cars2018 %>%
    select(-model_index)

Exercise 1: A least squares model

Let’s start by building an ordinary (not penalized) least squares model to review important concepts. We’ll fit a model to predict fuel efficiency measured in miles per gallon (mpg) with all possible predictors.

lm_spec <-
    linear_reg() %>% 
    set_engine(engine = "lm") %>% 
    set_mode("regression")

full_rec <- recipe(mpg ~ ., data = cars2018) %>%
    update_role(model, new_role = "ID") %>% # we want to keep the name of the car model but not as a predictor or outcome
    step_nzv(all_predictors()) %>% # removes variables with the same value
    step_normalize(all_numeric_predictors()) %>% # important standardization step for LASSO
    step_dummy(all_nominal_predictors())  # creates indicator variables for categorical variables

full_lm_wf <- workflow() %>%
    add_recipe(full_rec) %>%
    add_model(lm_spec)
    
full_model <- fit(full_lm_wf, data = cars2018) 

full_model %>% tidy()

Use tidymodels to perform 10-fold cross-validation to estimate test MAE for this model. (The code below comes from our body fat modeling previously. Adapt it for our context here.)

# Do we need to use set.seed()?

bodyfat_cv <- vfold_cv(bodyfat_train, v = 10)

mod1_cv <- fit_resamples(model_wf,
    resamples = bodyfat_cv, 
    metrics = metric_set(rmse, mae)
)

How do you think the estimated test error would change with fewer predictors?
This model fit with ordinary least squares corresponds to a special case of penalized least squares. What is the value of $\lambda$ in this special case?
As $\lambda$ increases, what would you expect to happen to the number of predictors that remain in the model?
Notice that our data preprocessing recipe (full_rec) contained a step to normalize all numeric predictors (step_normalize(all_numeric_predictors())).
- Why is this an important step for LASSO?
- Do you think that this is an important step for ordinary linear regression? (Hint: think about two models for body weight–one with height in inches as a predictor and one with height in feet as a predictor. Do the predictions from these two models differ?)

Exercise 2: Fitting a LASSO model in `tidymodels`

The code below (and in part d) fits a set of LASSO models with the following parameters:

Use 10-fold CV.
Use mean absolute error (MAE) to select a final model.
Select the simplest model for which the metric is within one standard error of the best metric.
Use a sequence of 30 $\lambda$ values from 0.001 to 10.

Before running the code, run install.packages("glmnet") in the Console.

set.seed(74)

# Create CV folds
data_cv10 <- vfold_cv(cars2018, v = 10)

# LASSO model specification where then `penalty` parameter needs to be tuned
lm_lasso_spec_tune <- 
    linear_reg() %>%
    set_args(mixture = 1, penalty = tune()) %>% ## mixture = 1 indicates Lasso
    set_engine(engine = "glmnet") %>% # note we are using a different engine
    set_mode("regression") 

# Workflow (Recipe + Model)
lasso_wf_tune <- workflow() %>% 
    add_recipe(full_rec) %>% # recipe defined above
    add_model(lm_lasso_spec_tune) 

# Tuning the model (trying a variety of values of Lambda penalty)
penalty_grid <- grid_regular(
    penalty(range = c(-3, 1)), #log10 transformed 
    levels = 30)

tune_output <- tune_grid( # new function for tuning parameters
    lasso_wf_tune, # workflow
    resamples = data_cv10, # CV folds
    metrics = metric_set(rmse, mae),
    grid = penalty_grid # penalty grid defined above
)

Let’s visualize the model evaluation metrics from tuning. We can use autoplot().

# Visualize Model Evaluation Metrics from Tuning
autoplot(tune_output) + theme_classic()

# Zoomed-in version to focus on curvature of MAE plot
autoplot(tune_output) + theme_classic() + coord_cartesian(ylim = c(1.96, 2.01))

Inspect the shape of the plot. The errors go down at the very beginning then start going back up. Based on this, what are the consequences of picking a $\lambda$ that is too small or too large? (This is an example of a very important idea that we’ll see shortly: the bias-variance tradeoff.)
Next, we need to choose the lambda that leads to the best model. We can choose the lambda penalty value that leads to the lowest CV MAE, or we can take into account the variation of the CV MAE and choose the largest lambda penalty value that is within 1 standard error of the lowest CV MAE. How might the models that result from these two penalties differ?

best_penalty <- select_best(tune_output, metric = "mae") # choose penalty value based on lowest CV MAE
best_penalty

best_se_penalty <- select_by_one_std_err(tune_output, metric = "mae", desc(penalty)) # choose largest penalty value within 1 se of the lowest CV MAE
best_se_penalty

Now check your understanding by fitting both “final” models and comparing the coefficients. How are these two models different?

# Fit Final Model
final_wf <- finalize_workflow(lasso_wf_tune, best_penalty) # incorporates penalty value to workflow
final_wf_se <- finalize_workflow(lasso_wf_tune, best_se_penalty) # incorporates penalty value to workflow

final_fit <- fit(final_wf, data = cars2018)
final_fit_se <- fit(final_wf_se, data = cars2018)

tidy(final_fit)
tidy(final_fit_se)

Going forward, we’ll examine output from the model chosen by select_by_one_std_err() (final_fit_se).

Exercise 3: Examining output: plot of coefficient paths

A useful plot allows us to examine coefficient paths resulting from the final fitted LASSO models: coefficient estimates as a function of $\lambda$ .

glmnet_output <- final_fit_se %>% extract_fit_parsnip() %>% pluck("fit") # get the original glmnet output

# Plot coefficient paths as a function of lambda
plot(glmnet_output, xvar = "lambda", label = TRUE, col = rainbow(20))

# Codebook for which variables the numbers correspond to
rownames(glmnet_output$beta)

# e.g., What are variables 2 and 4?
rownames(glmnet_output$beta)[c(2,4)]

There’s a lot of information in this plot!

Each colored line corresponds to a different predictor. (Note that categorical variables have been split into different predictors via indicator variable creation.)
The small number to the left of each line indicates a predictor by its position in rownames(glmnet_output$beta).
The x-axis reflects the range of different $\lambda$ values (on the log-scale).
At each $\lambda$ , the y-axis reflects the coefficient estimates for the predictors in the corresponding LASSO model.
At each $\lambda$ , the numbers at the top of the plot indicate how many predictors remain in the corresponding model.

Consider the coefficient estimates at the smallest value of $\lambda$ . How closely should they correspond to the coefficient estimates from ordinary least squares in exercise 1?
Why do all of the lines head toward y = 0 on the far right of the plot?
What variables seem to be more “important” or “persistent” (persistently present in the model) variable? Does this make sense in context?
In general, how might we use these “coefficient paths” to measure the relative importance of our predictors?

Note: If you’re curious about code to automate this visual inspection of variable importance, look at the Digging Deeper exercise at the end.

Exercise 4: Examining and evaluating the best LASSO model

Take a look at the predictors and coefficients for the “best” LASSO model. Are the predictors that remain in the model sensible? Do the coefficient signs make sense?

# Obtain the predictors and coefficients of the "best" model
# Filter out the coefficient are 0
final_fit_se %>% tidy() %>% filter(estimate != 0)

Evaluate the best LASSO model:
- Contextually interpret (with units) the CV MAE error for the best model.
- Make residual plots for the model by creating a dataset called lasso_mod_out which contains the original data as well as predicted values and residuals (.pred and resid).

# Evaluation metrics
tune_output %>%
    collect_metrics() %>%
    filter(penalty == (best_se_penalty %>% pull(penalty)))

# Residual plots
lasso_mod_out <- final_fit_se %>%
    predict(new_data = cars2018) %>%
    bind_cols(cars2018) %>%
    mutate(resid = mpg - .pred)

Digging deeper

These exercises are recommended for further exploring code useful in an applied analysis.

We used the plot of coefficient paths to evaluate the variable importance of our predictors. The code below does this systematically for each predictor so that we don’t have to eyeball. Step through and work out what each part is doing. It may help to look up function documentation with ?function_name in the Console.

# Create a boolean matrix (predictors x lambdas) of variable exclusion
bool_predictor_exclude <- glmnet_output$beta==0

# Loop over each variable
var_imp <- sapply(seq_len(nrow(bool_predictor_exclude)), function(row) {
    this_coeff_path <- bool_predictor_exclude[row,]
    if(sum(this_coeff_path) == ncol(bool_predictor_exclude)){ return(0)}else{
    return(ncol(bool_predictor_exclude) - which.min(this_coeff_path) + 1)}
})

# Create a dataset of this information and sort
var_imp_data <- tibble(
    var_name = rownames(bool_predictor_exclude),
    var_imp = var_imp
)
var_imp_data %>% arrange(desc(var_imp))

If you want more practice, the Hitters data in the ISLR package (be sure to to install and load) contains the salaries and performance measures for 322 Major League Baseball players. Use LASSO to determine the “best” predictive model of Salary.