A Cross-Validation
If you’re keen on learning more about R programming and want to try implementing the methods we’ve talked about in class, you’re in the right place!
You’ll learn the programming tools needed to implement cross-validation here.
The goal will be to write an R function that performs cross-validation for ordinary least squares linear regression models. Along the way, you’ll learn about objects, subsetting operations, for
-loops, and writing functions in R.
A.1 Objects
Read the Vectors section of the free online Advanced R book by Hadley Wickham.
A.2 Subsetting
Read the Subsetting section of Advanced R.
A.3 Writing R functions
Read the Functions chapter of R for Data Science by Garrett Grolemund and Hadley Wickham.
A.4 for
-loops and control flow
Read the Iteration chapter of R for Data Science.
Also read the Control flow chapter of Advanced R to learn about if
-statements.
A.5 Building our cross-validation function!
Work through the steps below to build up our cross-validation function.
Step 1: Create the skeleton body for a function called cross_validation
that takes the following input arguments:
data
: the training dataset (adata.frame
object)formula
: the model formula for the linear regression model (e.g.resp~x1+x2
). This is a special type of R object called (reasonably) aformula
object.k
: the k in k-fold cross-validationresponse
: acharacter
object giving the name of the response variable
Step 2: Devise a way to randomly split the data into \(k\) folds. There are many ways to potentially do this, but here’s an idea that I have in mind:
- Randomly reorder the data.
- Then add a new “variable” column that repeats 1,2,3,…k,1,2,3,…,k until the last row.
If you want to try my idea, look at the help pages for the the sample_n()
function in the dplyr
package and the rep()
function. (You can enter ?dplyr::sample_n
and ?rep
into the Console.) You’ll need to be comfortable subsetting data.frame
’s. Use either the earlier reading, or look at the filter()
function available in dplyr
.
If you come up with another idea and want help implementing it, feel free to ask!
Step 3: Set up an output container object to hold the evaluation metric computed in each iteration of cross-validation. Also write the skeleton of a for
-loop.
Step 4: Write the for
-loop body to perform the steps needed to estimate the test error in this iteration. Feel free to use whatever evaluation metric you desire. It may be helpful to write a function for computing that evaluation metric.
If you want to add more capability to your function, try the following for some extra challenge:
- Remove the
response
argument. Try to extract it from the formula. To help with this, look into theas.character()
function, and thestr_split()
function in thestringr
package. - Add an argument called
metric
that will be a character object specifying what evaluation metric to use. You should create functions for each of the metrics that you’ll allow the user to specify. - Allow the
k
argument to be a character object where the user inputs “loocv” instead of a number. Your function should still support numericalk
though.
A.6 Aside: apply()
functions
R provides a family of apply()
functions (apply()
, lapply()
, sapply()
, tapply()
, mapply()
) that do similar things to for
-loops but tackle specialized tasks. The main feature that is different between these functions and for
-loops is that these functions create the output objects from looping automatically. In contrast, for
-loops require you to set up a vector container beforehand to store the outputs being created in the loop.
You can learn more about the apply()
functions here.
It is a common misconception in the R community that these functions are faster than for
-loops. This isn’t true. People often like apply()
functions for their readability, but everyone has their personal preferences. You can see more of the discussion in this issue of R News and on this StackOverflow thread.