Topic 6 Shrinkage/Regularization
6.1 Discussion
Exploring doing LASSO by hand
I took a subset of the Hitters
dataset to focus on only the Walks
and Assists
variables.
I made this penalized_rss()
function to compute the penalized sum of squared residuals given a value for the lambda
penalty and guesses for beta1
(for Walks
) and for beta2
(Assists
).
penalized_rss <- function(lambda, beta1, beta2) {
# Predict salary using beta1, beta2 and predictor info
pred <- hitters_subs$Walks*beta1 + hitters_subs$Assists*beta2
# Compute residuals
resid <- hitters_subs$Salary - pred
# Compute RSS
rss <- sum(resid^2, na.rm = TRUE)
# Compute penalized RSS
prss <- rss + lambda*(abs(beta1) + abs(beta2))
prss
}
I wanted to compute the penalized RSS for many different combinations of lambda
, beta1
, and beta2
. Here are some of those combinations:
## lambda beta1 beta2
## 1e+03 1.4 7.2
## 1e+02 3.2 4.4
## 0e+00 2.0 2.0
## 1e+05 1.0 7.8
## 1e+03 1.6 5.2
## 1e+02 2.8 0.4
I then actually computed the penalized RSS for these combinations using the penalized_rss()
function.
## lambda beta1 beta2 pen_rss
## 1e+03 1.4 7.2 53150907
## 1e+02 3.2 4.4 52964196
## 0e+00 2.0 2.0 53098255
## 1e+05 1.0 7.8 54062874
## 1e+03 1.6 5.2 53133494
## 1e+02 2.8 0.4 53024238
For each lambda
value that I tried, I manually fit a LASSO model by finding the beta1
and beta2
values that minimized the penalized RSS. These are shown here:
## # A tibble: 6 x 4
## lambda beta1 beta2 pen_rss
## <dbl> <dbl> <dbl> <dbl>
## 1 0 10 9.8 52264853.
## 2 100 10 9.6 52266813.
## 3 1000 10 8 52283711.
## 4 10000 10 0 52392761.
## 5 100000 10 0 53292761.
## 6 1000000 0 0 53319113.
Thought exercise: How do the results above demonstrate the shrinkage property of LASSO?
6.2 Exercises
You can download a template RMarkdown file to start from here.
We’ll explore LASSO modeling using the Hitters
dataset in the ISLR
package (associated with the optional textbook). You’ll need to install the ISLR package in the Console first. You should also install the glmnet
package as we’ll be using it subsequently for fitting LASSO models.
install.packages(c("ISLR", "glmnet"))
# Load the data
library(ISLR)
data(Hitters)
# Examine the data codebook
?Hitters
The Hitters
dataset contains a number of stats on major league baseball players in 1987. Our goal will be to build a regression model that predicts player Salary
.
- Get to know the
Hitters
data- Peek at the first few rows.
- How many players are in the dataset?
- How many possible predictors of salary are there?
Developing some intuition
A natural model to start with is one with all possible predictors. The following model is fit with ordinary (not penalized) least squares:least_squares_mod <- lm(Salary ~ ., data = Hitters) coefficients(least_squares_mod)
- Use
caret
to perform 7-fold cross-validation to estimate the test error of this model. Use the straight average of the RMSE column instead of squaring the values first. (Why 7? Think about the number of cases in the folds.) - How do you think the estimated test error would change with fewer predictors?
- Briefly describe how the output of a stepwise selection procedure could help you choose a smaller model (a model with fewer predictors).
- This model fit with ordinary least squares corresponds to a special case of penalized least squares. What is the value of \(\lambda\) in this special case?
- As \(\lambda\) increases, what would you expect to happen to the number of predictors that remain in the model?
- Use
- LASSO for specific \(\lambda\)
The code below fits a LASSO model with \(\lambda = 10\). This value of \(\lambda\) is specified in the
tuneGrid
argument. Thealpha = 1
specifies the LASSO method specifically (theglmnet
method has other purposes).set.seed(74) lasso_mod_lambda10 <- train( Salary ~ ., data = Hitters, method = "glmnet", trControl = trainControl(method = "cv", number = 7), tuneGrid = data.frame(alpha = 1, lambda = 10), metric = "RMSE", na.action = na.omit ) # Model coefficients for lambda = 10 # The .'s indicate that the coefficient is 0 coefficients(lasso_mod_lambda10$finalModel, 10)
- How many variables remain in the LASSO model with \(\lambda=10\)? How do their coefficients compare to the corresponding variables in the least squares model?
Fit the LASSO using \(\lambda=100\).
set.seed(74) lasso_mod_lambda100 <- train( Salary ~ ., data = Hitters, method = "glmnet", trControl = trainControl(method = "cv", number = 7), tuneGrid = data.frame(alpha = 1, lambda = 100), metric = "RMSE", na.action = na.omit ) # Model coefficients for lambda = 100 coefficients(lasso_mod_lambda100$finalModel, 100)
How many variables remain in the LASSO model with \(\lambda=100\)? Is this model “bigger” or smaller than the LASSO with \(\lambda=10\)? How do the variables’ coefficients compare to the corresponding variables in the least squares model and the LASSO with \(\lambda=10\)?
LASSO for a variety of \(\lambda\)
There are infinitely many \(\lambda\) we could use. It would be too tedious to examine these one at a time. The following code fits LASSO models across a grid of \(\lambda\) values and makes a summary plot of the coefficient estimates as a function of \(\lambda\).There’s a lot of information in this plot!# Create a grid of lambda values lambdas <- 10^seq(-3, 3, length.out = 100) # Fit LASSO models for all of the lambdas set.seed(74) lasso_mod <- train( Salary ~ ., data = Hitters, method = "glmnet", trControl = trainControl(method = "cv", number = 7), tuneGrid = data.frame(alpha = 1, lambda = lambdas), metric = "RMSE", na.action = na.omit ) # Plot summary of coefficient estimates plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20)) # What variables do the numbers correspond to? rownames(lasso_mod$finalModel$beta)
- Each colored line corresponds to a different predictor. The small number to the left of each line indicates a predictor by its position in
rownames(lasso_mod$finalModel$beta)
. - The x-axis reflects the range of different \(\lambda\) values considered in
lasso_mod
(thelambdas
vector that we created). - At each \(\lambda\), the y-axis reflects the coefficient estimates for the predictors in the corresponding LASSO model.
- At each \(\lambda\), the numbers at the top of the plot indicate how many predictors remain in the corresponding model.
- Very roughly eyeball the coefficient estimates when \(log(\lambda) = -2\). Do they look like they correspond to the coefficient estimates from ordinary least squares in exercise 2?
- Why do all of the lines head toward y = 0 on the far right of the plot?
We can zoom in on the plot by setting the y-axis limits to go from -10 to 10 with
ylim
as below. Compare the lines for variables 6 and 15. What are variables 6 and 15? Which seems to be a more “important” or “persistent” variable? Does this make sense in context?# Zoom in plot(lasso_mod$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20), ylim = c(-10,10)) # What is variable 6? rownames(lasso_mod$finalModel$beta)[6]
- Each colored line corresponds to a different predictor. The small number to the left of each line indicates a predictor by its position in
Picking \(\lambda\)
In order to pick which \(\lambda\) (hence LASSO model) is “best”, we can compare the 7-fold CV error rate for each model.caret
has actually done that for us when ittrain()
ed the model. We can look at a plot of those results:This figure plots cross-validation estimates of the RMSE (y-axis) versus value of \(\lambda\) (regularization parameter).# Plot a summary of the performance of the different models plot(lasso_mod)
- Comment on the shape of the plot. The RMSE’s go down at the very beginning then start going back up. Why do you think that is?
- Roughly, what value of \(\lambda\) results in the best model?
This plot indicates that we tried many \(\lambda\) values that were pretty bad. (Why?) Let’s fit LASSO models over a better grid of \(\lambda\) values. Modify the previous code to use the following grid and remake
lasso_mod
and the previous plot:lambdas <- seq(0, 50, length.out = 100)
Picking \(\lambda\): accounting for uncertainty
Each of the points on the previous plot arose from taking the mean RMSE over 7 cross-validation iterations. Those 7 RMSE estimates have a standard deviation and standard error too. You can use the custombest_lambdas()
function to make a plot of estimated test RMSE versus \(\lambda\) that also shows information about the standard errors.
In particular, the plot shows points that exactly correspond to the previous plot. The additional lines show 1 standard error above and below the RMSE estimate. In essence, the span of the lines indicates a confidence interval.
Thebest_lambdas()
function also prints information about some reasonable choices for good \(\lambda\) values.best_lambdas <- function(model) { # Extract the results table res <- model$results # Extract the K in K-fold CV k <- model$control$number # Compute the standard error (SE) of the RMSE estimate res$rmse_se <- res$RMSESD/sqrt(k) # Which lambda resulted in the lowest RMSE? index_lambda_min <- which.min(res$RMSE) lambda_min <- res$lambda[index_lambda_min] # Compute 1 SE below and above the minimum RMSE res$rmse_lower <- res$RMSE - res$rmse_se res$rmse_upper <- res$RMSE + res$rmse_se rmse_lower <- res$RMSE[index_lambda_min] - res$rmse_se[index_lambda_min] rmse_upper <- res$RMSE[index_lambda_min] + res$rmse_se[index_lambda_min] res$within_1se <- res$RMSE >= rmse_lower & res$RMSE <= rmse_upper index_lambda_1se <- max(which(res$within_1se)) lambda_1se <- res$lambda[index_lambda_1se] p <- ggplot(res, aes(x = lambda, y = RMSE)) + geom_pointrange(aes(ymin = rmse_lower, ymax = rmse_upper)) print(p) output <- res[c(index_lambda_min, index_lambda_1se),c("lambda", "RMSE")] rownames(output) <- c("lambda_min", "lambda_1se") output } lambda_choices <- best_lambdas(lasso_mod) lambda_choices
- The first row of printed output shows a choice for \(\lambda\) called
lambda_min
, the \(\lambda\) at which the observed CV error was smallest. The second row shows a choice calledlambda_1se
, the largest \(\lambda\) for which the corresponding LASSO model has a CV error that’s still within 1 standard error of that for the LASSO usinglambda_min
. Explain why we might use the LASSO withlambda_1se
instead oflambda_min
.
- How does the CV-estimated RMSE of these models compare to that of the original ordinary least squares model in exercise 2?
Look at the coefficients of LASSO models corresponding to both choices of \(\lambda\). How do the coefficients differ between
lambda_min
andlambda_1se
? Does one model’s coefficients seem more sensible contextually? The instructor does not have a deep enough understanding of baseball, but you might!# Coefficients for the lambda_min LASSO model coefficients(lasso_mod$finalModel, lambda_choices["lambda_min", "lambda"]) # Coefficients for the lambda_1se LASSO model coefficients(lasso_mod$finalModel, lambda_choices["lambda_1se", "lambda"])
- The first row of printed output shows a choice for \(\lambda\) called