Topic 7 KNN Regression and the Bias-Variance Tradeoff
Learning Goals
- Clearly describe / implement by hand the KNN algorithm for making a regression prediction
- Explain how the number of neighbors relates to the bias-variance tradeoff
- Explain the difference between parametric and nonparametric methods
- Explain how the curse of dimensionality relates to the performance of KNN
Slides from today are available here.
KNN models in tidymodels
To build KNN models in tidymodels
, first load the package and set the seed for the random number generator to ensure reproducible results:
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
set.seed(___) # Pick your favorite number to fill in the parentheses
Then adapt the following code:
# CV Folds
vfold_cv(___, v = 10)
data_cv10 <-
# Model Specification
knn_spec <- nearest_neighbor() %>% # new type of model!
set_args(neighbors = tune()) %>% # tuning parameter is neighbor; tuning spec
set_engine(engine = "kknn") %>% # new engine
set_mode("regression")
# Recipe with standardization (!)
recipe( ___ ~ ___ , data = ___) %>%
data_rec <- step_nzv(all_predictors()) %>% # removes variables with the same value
step_novel(all_nominal_predictors()) %>% # important if you have rare categorical variables
step_normalize(all_numeric_predictors()) %>% # important standardization step for KNN
step_dummy(all_nominal_predictors()) # creates indicator variables for categorical variables (important for KNN!)
# Workflow (Recipe + Model)
workflow() %>%
knn_wf <- add_model(knn_spec) %>%
add_recipe(data_rec)
# Tune model trying a variety of values for neighbors (using 10-fold CV)
grid_regular(
neighbors_grid <-neighbors(range = c(1, 50)), # min and max of values for neighbors
levels = 50 # number of neighbors values
)
tune_grid(
knn_fit_cv <-# workflow
knn_wf, resamples = data_cv10, # CV folds
grid = neighbors_grid, # grid specified above
metrics = metric_set(rmse, mae)
)
Note: tidymodels
defines neighbors as the cases that are the closest in terms of the Euclidean distance of the predictor values:
\[ d(case_i,case_j) = \sqrt{(x_{i1} - x_{j1})^2 + \cdots +(x_{ip} - x_{jp})^2 } \]
Identifying the “best” KNN model
The “best” model in the sequence of models fit is defined relative to the chosen metric
and the choice of select_best()
or select_by_one_std_err()
.
%>% autoplot() # Visualize Trained Model using CV
knn_fit_cv
%>% show_best(metric = "mae") # Show evaluation metrics for different values of neighbors, ordered
knn_fit_cv
# Choose value of Tuning Parameter (neighbors)
knn_fit_cv %>%
tuned_knn_wf <- select_by_one_std_err(metric = "mae", desc(neighbors)) %>% # Choose neighbors value that leads to the highest neighbors within 1 se of the lowest CV MAE
finalize_workflow(knn_wf, .)
# Fit final KNN model to data
tuned_knn_wf %>%
knn_fit_final <- fit(data = ___)
# Use the best model to make predictions
# new_data should be a data.frame with required predictors
predict(knn_fit_final, new_data = ___)
Exercises
You can download a template RMarkdown file to start from here.
We’ll explore KNN regression using the College
dataset in the ISLR2
package (install it with install.packages("ISLR2")
in the Console). You can use ?College
in the Console to look at the data codebook.
library(ISLR2)
library(dplyr)
library(readr)
library(broom)
library(ggplot2)
library(tidymodels)
tidymodels_prefer() # Resolves conflicts, prefers tidymodel functions
data(College)
# Data cleaning
College %>%
college_clean <- mutate(school = rownames(College)) %>% # creates variable with school name
filter(Grad.Rate <= 100) # Remove one school with grad rate of 118%
rownames(college_clean) <- NULL # Remove school names as row names
Exercise 1: Bias-variance tradeoff warmup
- Think back to the LASSO algorithm which depends upon tuning parameter \(\lambda\).
- For which values of \(\lambda\) (small or large) will LASSO be the most biased, and why?
- For which values of \(\lambda\) (small or large) will LASSO be the most variable, and why?
- The bias-variance tradeoff also comes into play when comparing across algorithms, not just within algorithms. Consider LASSO vs. least squares:
- Which will tend to be more biased?
- Which will tend to be more variable?
- When will LASSO outperform least squares in the bias-variance tradeoff?
Exercise 2: Impact of variable scale and distance measure
Consider the 1-nearest neighbor algorithm to predict Grad.Rate
on the basis of two predictors: Apps
and Private
. Let Yes
for Private
be represented with the value 1 and No
with 0.
- We have a test case whose number of applications is 13,530 and is a private school. Suppose that we have the tiny 2-case training set below. What would the 1-nearest neighbor prediction be using Euclidean distance?
%>%
college_clean filter(school %in% c("Princeton University", "SUNY at Albany")) %>%
select(Apps, Private, Grad.Rate, school)
sqrt( (13530 - ?)^2 + (1 - ?)^2) # Euclidean distance between test case and Princeton
sqrt( (13530 - ?)^2 + (1 - ?)^2) # Euclidean distance between test case and SUNY
- Do you have any concerns about the resulting prediction? Based on this, comment on the impact of variable scaling and the distance measure on KNN performance. How might you change the distance calculation (or correspondingly rescale the data) to generate a more sensible prediction in this situation?
Exercise 3: Implementing KNN in tidymodels
Before continuing, install the kknn
package by entering install.packages("kknn")
in the Console.
We will step-by-step write code to “fit” a set of KNN models to predict Grad.Rate
with the following specifications:
- Use the predictors
Private
,Top10perc
(% of new students from top 10% of high school class), andS.F.Ratio
(student/faculty ratio). - Use 8-fold CV. (Why 8? Take a look at the sample size.)
- Use mean absolute error (MAE) to select a final model.
- Select the simplest model for which the metric is within one standard error of the best metric.
- Use a sequence of neighbor values from 1 to 100 in increments of 5 (20 values in total).
Step 1: Describe the model that we want to fit and how. (Describe it’s specification.)
- The general model type we are using is called
nearest_neighbor
. - The KNN model has a parameter called
neighbors
(the number of nearest neighbors used to make predictions). - The “engine” used to build the model is called
"kknn"
. - We using KNN for a regression task (quantitative outcome). (Nearly all of our methods can be used in both a regression and classification setting.)
___() %>% # Insert the name of the general model type
knn_spec <- set_args(___ = tune()) %>% # Insert the name of the parameter that we will tune
set_engine(engine = ___) %>% # Insert the engine name (in quotes)
set_mode() # Indicate "regression" or "classification"
Step 2: Divide data into folds for cross-validation.
- Use 8-fold CV. (Why 8? Take a look at the sample size with
dim(college_clean)
ornrow(college_clean)
.)
set.seed(2023) # Why do we need this?
vfold_cv(___, v = ___) # Supply dataset and # of folds college_cv <-
Step 3: Create our data preprocessing recipe.
- We first have to specify the outcome and predictors.
- We’re predicting
Grad.Rate
(graduation rate). - Use the predictors
Private
,Top10perc
(% of new students from top 10% of high school class), andS.F.Ratio
(student/faculty ratio).
- We’re predicting
- Include
step_dummy(all_nominal_predictors())
.- Because KNN needs to compute distances between cases, all predictors should be in numeric form. We can convert categorical variables (
all_nominal_predictors()
) to indicator (dummy) variables.
- Because KNN needs to compute distances between cases, all predictors should be in numeric form. We can convert categorical variables (
- Include
step_normalize(all_numeric_predictors())
.- Why is this important based on Exercise 2?
recipe(___ ~ ___, data = ___) %>% # Outcome, predictors, and dataset
college_rec <- step_???() %>%
step_???()
Step 4: Define our analysis workflow: our model specification (Step 1) and our data preprocessing recipe (Step 3).
workflow() %>%
college_wf <- add_model(___) %>% # Model specification object
add_recipe(___) # Data preprocessing recipe object
Step 5: Set up “grid” of tuning parameters, and fit models for each tuning parameter value to find optimal value.
- Use 20 values between 1 and 100 for the number of neighbors.
- Compute
rmse
andmae
in CV iterations.
grid_regular(
tuning_param_grid <-neighbors(range = c(___, ___)), # min and max of values for neighbors
levels = ___ # number of neighbors values
)
tune_grid(
knn_fit_cv <-# workflow object
___, resamples = ___, # CV folds object
grid = ___, # tuning parameter grid object
metrics = metric_set(___, ___) # evaluation metric names (no quotes)
)
After adapting the code (but before inspecting any output, which will happen in the next exercise), answer the following conceptual questions:
- Explain your choice for your recipe.
- Does KNN actually “fit” a model as part of training? (This feature of KNN is known as “lazy learning.”)
- How is test MAE estimated? What are the steps of the KNN algorithm with cross-validation?
- Draw a picture of how you expect test MAE to vary with \(K\), the number of neighbors. In terms of the bias-variance tradeoff, why do you expect the plot to look this way?
Exercise 4: Inspecting the results
The code below allows us to inspect our results. (It’s complete–nothing to fill in, but it’s helpful to look back at our LASSO code to see the similarities.)
- Use
autoplot()
to verify your expectations about the plot of test MAE vs. \(K\), the number of neighbors. - Contextually interpret the test MAE for the “best” model. (There are two versions of best shown–what is the difference between them?)
- How else could you evaluate the KNN model?
- Does your KNN model help you understand which predictors of graduation rate are most important? Why or why not?
autoplot(knn_fit_cv) + theme_classic()
%>% show_best(metric = "mae") # Show evaluation metrics for different values of neighbors, ordered
knn_fit_cv
# Choose value of tuning parameter (# of neighbors)
## Overall lowest error
%>%
knn_fit_cv select_best(metric = "mae")
## Choose neighbors value that leads to the highest neighbors within 1 std. err. of the lowest CV MAE
%>%
knn_fit_cv select_by_one_std_err(metric = "mae", desc(neighbors)) ## The desc(neighbors) sorts the data from highest to lowest # of neighbors (most simple -> most complex)