Topic 18 Bagging and Random Forests (Coding)
Exercises
You can download a template RMarkdown file to start from here.
Before proceeding, install the ranger
and vip
packages.
Our goal will be to classify types of urban land cover in small subregions within a high resolution aerial image of a land region. Data from the UCI Machine Learning Repository include the observed type of land cover (determined by human eye) and “spectral, size, shape, and texture information” computed from the image. See this page for the data codebook.
Source: https://ncap.org.uk/sites/default/files/EK_land_use_0.jpg
library(dplyr)
library(readr)
library(ggplot2)
library(tidymodels)
tidymodels_prefer()
# Read in the data
read_csv("https://www.dropbox.com/s/r59esfepjw7qsg0/land_cover_training.csv?dl=1")
land <-
# There are 9 land types, but we'll focus on 3 of them
land %>%
land <- filter(class %in% c("asphalt", "grass", "tree")) %>%
mutate(class = factor(class))
Exercise 1: Building a random forest in tidymodels
If you want to use the land
dataset, the code is complete, but you should step through to understand what each line is doing.
If you are using your project dataset, you don’t need to do anything here if you finished implementing a random forest from last class.
set.seed(123)
# Model Specification
rand_forest() %>%
rf_spec <- set_engine(engine = "ranger") %>%
set_args(
mtry = NULL, # size of random subset of variables; default is floor(sqrt(number of total predictors))
trees = 1000, # Number of trees
min_n = 2,
probability = FALSE, # FALSE: get hard predictions (not needed for regression)
importance = "impurity"
%>%
) set_mode("classification") # change this for regression
# Recipe
recipe(class ~ ., data = land)
data_rec <-
# Workflows
workflow() %>%
data_wf <- add_model(rf_spec) %>%
add_recipe(data_rec)
# Note how we're not using tune_grid() or vfold_cv() information here
fit(data_wf, data = land) rf_fit <-
Exercise 2: Evaluate model performance
Printing the rf_fit
object displays information about the random forest model.
Report and interpret the OOB prediction error
value in the context of your data. (Misclassification rate is reported for classification, and MSE is reported for regression.)
rf_fit
Classification
We can look at a confusion matrix resulting from the out-of-bag (OOB) predictions.
- Check your conceptual understanding: How are OOB predictions made?
- Based on the confusion matrix, what classes are predicted most accurately, and does this make sense contextually?
land %>%
rf_output <- mutate(OOB_pred_class = rf_fit %>% extract_fit_engine() %>% pluck("predictions")) # This extracts the OOB predictions
conf_mat(
data = rf_output,
truth = class,
estimate = OOB_pred_class
)
We can also explore how misclassification rates relate to predictors. The code below creates a new variable is_misclass
indicating whether or not a case was misclassified in the OOB predictions.
- Make visualizations of
is_misclass
and each of your predictors individually to explore how misclassification rate might vary across predictors.
rf_output %>%
rf_output <- mutate(is_misclass = class!=OOB_pred_class)
Regression
We can compute residuals resulting from OOB predictions.
- Check your conceptual understanding: How are OOB predictions made?
- Make residual plots to visualize how errors might relate to your predictors.
land %>%
rf_output <- mutate(
OOB_pred = rf_fit %>% extract_fit_engine() %>% pluck("predictions"), # This extracts the OOB predictions
resid = YOUR_OUTCOME - OOB_pred
)
# Residual plots vs individual predictors
Exercise 3: Variable importance
(You’ll need to install the vip
package before proceeding.)
Because bagging and random forests use many trees, the nice interpretability of single decision trees is lost. However, we can still get a measure of how important the different predictors were in this predicting the outcome.
For each of the predictors, the code below gives the “total decrease in node impurities from splitting on the variable, averaged over all trees” (package documentation).
- Do the variable importance results make sense contextually? (If you’re using the
land
dataset, check out the codebook for these variables here.) - Conceptual question: It has been found that measures of variable importance from random forests can tend to favor predictors with a lot of unique values. Explain briefly why this makes sense by thinking about the recursive binary splitting algorithm for a single tree. (Note: similar cautions arise for variable importance in single trees.)
library(vip)
# Plot of the variable importance information
%>%
rf_fit extract_fit_engine() %>%
vip(num_features = 30) + theme_classic()
# Extract the numerical information on variable importance and display the most and least important predictors
rf_fit %>%
rf_var_imp <- extract_fit_engine() %>%
vip::vi()
head(rf_var_imp)
tail(rf_var_imp)