Topic 8 Splines

8.1 Discussion

Goal: model nonlinear trends in data

We’ve seen that KNN regression can learn nonlinear functions.
We’ll see that splines can do the same.

Splines are a means of performing variable transformations

Let’s say that $y$ and $x$ are related with a logarithmic trend: $y = \log(x) + \varepsilon$ .
We transform $x$ to make a new variable called $x_{\text{new}}$ , with $x_{\text{new}}=\log(x)$ .
Then the relationship between $y$ and $x_{\text{new}}$ will be linear.
We do the same thing with splines.
- But with splines, we create multiple transformed variables.

How does the ns() function work?

If I want to split my quantitative predictor into $r$ regions, I’ll make $r-1$ cuts ( $r-1$ knots).
- Usually the knots are placed at regularly spaced quantiles (e.g. 25, 50, 75 for 3 knots)
This number of regions $r$ is called the degrees of freedom (df) parameter. This is a tuning parameter of splines.
Using only the quantitative predictor and df as inputs, ns() knows:
- The exact number of transformation functions to use (this number is equal to df)
- The exact formulas of the transformation functions (they’re polynomial functions)
How does R know the exact formulas of the transformation functions?
- Calculus and linear algebra

Example

Suppose we break our predictor variable into 3 regions (2 knots).

We want to find coefficients for the following polynomials in the 3 regions:
- Region 1: $a_0 + a_1x + a_2x^2 + a_3x^3$
- Region 2: $a_4 + a_5x + a_6x^2 + a_7x^3$
- Region 3: $a_8 + a_9x + a_{10}x^2 + a_{11}x^3$
We need to make sure that, at the knots, the polynomials are continuous and have continuous first and second derivatives.
We also make sure that at the minimum and maximum $x$ , the function is linear.

When you run ns(), R doesn’t give these functions directly but does give a mathematically “nice” representation of those 3 functions:

x <- sort(college_subs$PhD)
spline_terms <- ns(x, df = 3)
matplot(x, spline_terms, type = "l", lty = "solid", lwd = 6, xlab = "PhD", ylab = "Spline function", cex.axis = 1.5, cex.lab = 1.5)
legend("topleft", legend = paste("Spline function", 1:3), col = 1:3, lty = "solid", bty = "n", cex = 1.5)

(If you’ve taken linear algebra, this is a basis representation.)

To use these transformation functions, we plug in the original PhD’s and get out 3 transformed versions of PhD. You can think of these transformations as corresponding to the polynomials in the 3 regions.

Some predictions made from splines

Red line: linear fit
Blue line: LOESS fit
Orange line: spline fit with degrees of freedom (regions) = 4 (3 knots)
Green line: spline fit with degrees of freedom (regions) = 20 (19 knots)

library(gridExtra)
p1 <- ggplot(college_subs, aes(x = PhD, y = Outstate)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, col = "red", lwd = 3) +
    geom_smooth(method = "loess", se = FALSE, col = "blue", lwd = 3)
p2 <- p1 + geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 4), se = FALSE, col = "orange", lwd = 3)
p3 <- p2 + geom_smooth(method = "lm", formula = y ~ splines::ns(x, df = 20), se = FALSE, col = "limegreen", lwd = 3)
grid.arrange(p1, p2, p3, ncol = 3)

8.2 Exercises

You can download a template RMarkdown file to start from here.

We’ll explore KNN regression and modeling with splines using the College dataset in the ISLR package (associated with our optional textbook). You’ll need to install the splines package as we’ll be using it subsequently.

Our goal will be to build a model that predicts out-of-state tuition (Outstate).

# Load packages and data
library(dplyr)
library(ggplot2)
library(caret)
library(ISLR)
data(College)

# Examine the data codebook
?College

Get to know the College data
1. Peek at the first few rows.
2. How many colleges are in the dataset?
3. How many possible predictors of Outstate are there?
4. Is Mac one of the colleges in this data? The code to check this quickly is below:
```
"Macalester College" %in% rownames(College)
```
  If you are curious about working with strings in R, you might check out the stringr package. Example code is below:
```
library(stringr)
rownames(College)[str_detect(rownames(College), "Mac")]
rownames(College)[str_detect(rownames(College), "Olaf")]
```

We’ll actually work with a subset of the predictors for the rest of the exercises:

college_subs <- College %>%
    mutate(adm_rate = Accept/Apps) %>%
    select(Outstate, Private, Top10perc, Room.Board, PhD, S.F.Ratio, perc.alumni, Expend, Grad.Rate, adm_rate)

Baseline parametric vs. nonparametric
(Throughout, include set.seed(2019) in each code chunk just before you use train().)
1. Fit a parametric ordinary least squares (OLS) model of Outstate as a function of all predictors, and use 10-fold cross-validation to estimate the test error of this model. Use the straight average of the RMSE column.
2. We’ll fit nonparametric KNN models to this data and compare performance. The code below fits KNN models for $k = 1,6,\ldots,96$ . (seq(1,100,5) generates a regular sequence from 1 to 100 jumping by 5.)
```
set.seed(2019)
knn_mod_allpred <- train(
    Outstate ~ .,
    data = college_subs,
    method = "knn",
    tuneGrid = data.frame(k = seq(1,100,5)),
    trControl = trainControl(method = "cv", number = 10),
    metric = "RMSE",
    na.action = na.omit
)
```
3. Models train()ed by caret have several features in common. We can use plot() on the object resulting from train() to plot test RMSE versus the tuning parameter.
  Comment on the shape of this curve and how it is related to the bias-variance tradeoff.
```
plot(knn_mod_allpred)
```
4. We can also look at estimated test errors and their associated uncertainty with $results. We can also see the optimal value of the tuning parameter with $bestTune.
  How does the best KNN model compare to the OLS model in terms of estimated test error?
```
knn_mod_allpred$results
knn_mod_allpred$bestTune
```
5. In machine learning, nonparametric methods tend to suffer from something called the curse of dimensionality. Do some Googling or search our ISLR textbook for this term.
  In your own words, explain what the curse of dimensionality is and why KNN suffers from it.

Considering splines

We should first visually explore the data. Make scatterplots of the response versus the following predictors: PhD, S.F.Ratio, perc.alumni. Add a smooth trend line in blue and a linear trend line in red, as below.
```
ggplot(college_subs, aes(x = ???, y = ???)) +
    geom_point() +
    geom_smooth(color = "blue") +
    geom_smooth(method = "lm", color = "red")
```
Based on these plots, do you think that a spline model would improve test error?

Let’s fit a spline model “by hand”. Nothing for you to modify here, but step through the code below and make sure you understand the logic of what we’re trying to show.

library(splines)
# Create 4 new spline "variables" (corresponds to 3 knots)
spline_terms_phd <- ns(college_subs$PhD, df = 4)
spline_terms_phd <- as.data.frame(spline_terms_phd)
# Give the variables the names spline1, ..., spline4
colnames(spline_terms_phd) <- paste0("spline", 1:4)
# Add in a new column for the response variable
spline_terms_phd$Outstate <- college_subs$Outstate
# Peek at this new "dataset"
head(spline_terms_phd)
# Fit an OLS model with these spline variables
manual_spline_mod <- lm(Outstate ~ spline1+spline2+spline3+spline4, data = spline_terms_phd)
# Create a dataset with the original PhD variable and the predictions from the spline model
manual_spline_mod_output <- data.frame(
    PhD = college_subs$PhD,
    Outstate = fitted.values(manual_spline_mod)
)
# Overlay the scatterplot with our spline model's predictions
ggplot(college_subs, aes(x = PhD, y = Outstate)) +
    geom_point() +
    geom_smooth(color = "blue") +
    geom_smooth(method = "lm", color = "red") +
    geom_point(data = manual_spline_mod_output, color = "green", aes(x = PhD, y = Outstate))

Note: df in the ns() function stands for “degrees of freedom” and is equal to the number of new spline variables created. df is equal to number of knots plus 1.

Do splines help?

We can actually use ns() to create spline variables automatically when we write model formulas. Use the code below to fit a spline model that uses 3 knots for all of the quantitative predictors.

set.seed(2019)
spline_mod <- train(
    Outstate ~ Private+ns(Top10perc, df=4)+ns(Room.Board, df=4)+ns(PhD, df=4)+ns(S.F.Ratio, df=4)+ns(perc.alumni, df=4)+ns(Expend, df=4)+ns(Grad.Rate, df=4)+ns(adm_rate, df=4),
    data = college_subs,
    method = "lm",
    trControl = trainControl(method = "cv", number = 10),
    na.action = na.omit
)

Why do we use method = "lm"?
Estimate the test RMSE of spline_mod. How does this model compare to the original OLS model?

Splines and the bias-variance tradeoff
What tuning parameter is associated with splines? Add the spline tuning parameter to your BVT diagram from last time.