Bootstrapping

Notes and in-class exercises

Notes

You can download a template file to work with here.
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Learning goals

Let $β$ be some population parameter and $\hat{β}$ be a sample estimate of $β$ . In order to study the potential error in $\hat{β}$ , you will…

explore two approaches to approximating the sampling distribution of $\hat{β}$ :
- Central Limit Theorem (CLT)
- bootstrapping
identify the difference between sampling and resampling
intuit how bootstrapping results can be used to make inferences about $β$

Readings and videos

You can watch the following video after class to reinforce ideas:

Bootstrapping

Exercises

Context: Rivers contain small concentrations of mercury which can accumulate in fish. Scientists studied this phenomenon among largemouth bass in the Wacamaw and Lumber rivers of North Carolina. One goal was to explore the relationship of a fish’s mercury concentration (Concen) with its size, specifically its Length:

E[Concen | Length] = $β_{0}$ + $β_{1}$ Length

To this end, they caught and evaluated 171 fish, recording the following:

variable	meaning
River	Lumber or Wacamaw
Station	Station number where the fish was caught (0, 1, …, 15)
Length	Fish’s length (in centimeters)
Weight	Fish’s weight (in grams)
Concen	Fish’s mercury concentration (in parts per million; ppm)

# Load the data & packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
fish <- read_csv("https://mac-stat.github.io/data/Mercury.csv")

Rows: 171 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): River
dbl (4): Station, Length, Weight, Concen

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Plot and model the relationship of mercury concentration with length:

fish %>% 
    ggplot(aes(y = Concen, x = Length)) + 
    geom_point() + 
    geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

fish_model <- lm(Concen ~ Length, data = fish)
coef(summary(fish_model))

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
Length       0.05812749 0.005227593 11.119359 6.641225e-22

Exercise 1: sample vs population

In the summary table, is the Length coefficient 0.058 the population slope $β_{1}$ or a sample estimate ${\hat{β}}_{1}$ ?
If it’s a sample estimate, how accurate do you think it is?

Exercise 2: The rub

Since we don’t know $β_{1}$ , we can’t know the exact error in ${\hat{β}}_{1}$ !

This is where sampling distributions come in. They describe how estimates ${\hat{β}}_{1}$ might vary from sample to sample, thus how far these estimates might fall from $β_{1}$ :

In past activities, we used simulations to approximate the sampling distribution.

For example, we took and evaluated 500 different samples of 10 counties from the population of 3142 counties.

Why can’t we do that here in our fish example?

Exercise 3: CLT

In practice, we can’t observe the sampling distribution and its corresponding standard error. But we can approximate them.

When our sample size n is “large enough”, we might approximate the sampling distribution using the CLT:

${\hat{β}}_{1} \sim N (β_{1}, {standard error}^{2})$

The standard error in the CLT is approximated from our sample via some formula $c / \sqrt{n}$ where “c” is complicated.

Obtain and interpret this standard error from the model summary table:

coef(summary(fish_model))

REFLECT

Great! We can approximate the sampling distribution and standard error using the CLT.

BUT:

the quality of this approximation hinges upon the validity of the Central Limit theorem which hinges upon the validity of the theoretical model assumptions
the CLT uses complicated formulas for the standard error estimates, thus can feel a little mysterious

Let’s explore how we can use bootstrapping to complement (not entirely replace) the CLT.

The saying “to pull oneself up by the bootstraps” is often attributed to Rudolf Erich Raspe’s 1781 The Surprising Adventures of Baron Munchausen in which the character pulls himself out of a swamp by his hair (not bootstraps). In short, it means to get something from nothing, through your own effort:

In this spirit, statistical bootstrapping doesn’t make any probability model assumptions. It uses only the information from our one sample to approximate standard errors.

Exercise 4: Challenge

Recall that we have a sample size of 171 fish:

nrow(fish)

We’ll obtain a bootstrapping distribution of ${\hat{β}}_{1}$ by taking many (500) different samples of 171 fish and exploring the degree to which ${\hat{β}}_{1}$ varies from sample to sample. Let’s try doing this as we did in past activities:

# Build 500 models using samples of size 171
fish_models_bad_simulation <- mosaic::do(500)*(
  fish %>% 
    sample_n(size = 171, replace = FALSE) %>% 
    with(lm(Concen ~ Length))
)

head(fish_models_bad_simulation)

What’s funny about the results?
Why do you think this happened?
How might you adjust the code to “fix” things?

Exercise 5: Resampling

In practice, we take one sample of size n from the population. To obtain a bootstrapping distribution of some sample estimate $\hat{β}$ , we…

take many resamples of size n with replacement from the sample
calculate $\hat{β}$ using each resample

Let’s wrap our minds around the idea of resampling using a small example of 5 fish:

# Define data
small_sample <- data.frame(
    id = 1:5,
    Length = c(44, 43, 54, 52, 40)
)

small_sample

This sample has a mean Length of 46.6 cm:

small_sample %>% 
    summarize(mean(Length))

The chunk below samples 5 fish without replacement from our small_sample of 5 fish, and calculates their mean length. Run it several times. How do the sample and resulting mean change?

sample_1 <- sample_n(small_sample, size = 5, replace = FALSE)
sample_1

sample_1 %>% 
    summarize(mean(Length))

Sampling our sample without replacement merely returns our original sample. Instead, resample 5 fish from our small_sample with replacement. Run it several times. What do you notice about the samples? About their mean lengths?

sample_2 <- sample_n(small_sample, size = 5, replace = TRUE)
sample_2

sample_2 %>% 
    summarize(mean(Length))

Resampling our sample provides insight into the variability, hence potential error, in our sample estimates. (This works better when we have a sample bigger than 5!) As you observed in part b, each resample might include some fish from the original sample several times and others not at all. Why is this ok?

Exercise 6: Bootstrapping

We’re ready to bootstrap!

Fix one line of the code below to obtain 500 bootstrap estimates of the model of Concen by Length:

# Set the seed so we get the same results
set.seed(155)

# Build 500 bootstrap models using REsamples of size 171
fish_models_bootstrap <- mosaic::do(500)*(
  fish %>% 
    sample_n(size = 171, replace = FALSE) %>% 
    with(lm(Concen ~ Length))
)

head(fish_models_bootstrap)

Exercise 7: Bootstrap results

Recall that we started with 1 sample, thus 1 estimate of the model:

coef(summary(fish_model))

We now have 500 (resample) bootstrap estimates of the model. These vary around the red line in the plot below. What does the red line represent: the actual population model or fish_model (the estimated model calculated from our original fish sample)?

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

Now focus on just the 500 (resample) bootstrap estimates of the slope Length coefficient. Before plotting the distribution of these resampled slopes, what do you anticipate? What shape do you expect the distribution will have? Around what value do you expect it to be centered?
Check your intuition with the plot below. Was your intuition right?

fish_models_bootstrap %>% 
    ggplot(aes(x = Length)) + 
    geom_density()

Exercise 8: Bootstrap standard errors

Since they’re calculated from resamples of our sample, not different samples from the population, the 500 bootstrap estimates of the slope are centered around our original sample estimate. Importantly:

The degree to which the bootstrap estimates vary from the original sample estimate provides insight in the degree to which our original sample estimate might vary from the actual population slope (i.e. its standard error)!

Use the bootstrap estimates of the Length slope coefficient to approximate the standard error of 0.05813, our original sample estimate. (HINT: a standard error is a standard deviation)

fish_models_bootstrap %>%
    ___(___(Length))

How does this bootstrapped approximation of standard error compare to that calculated via (a complicated mystery) formula and reported in the model summary table?

coef(summary(fish_model))

Pause: Powerful stuff!

Just pause here to appreciate how awesome it is that you approximated the potential error in our sample estimates using simulation and your sample data alone – no “theorems” or complicated formulas.

You might say we pulled ourselves up by the bootstraps.

Exercise 9: Looking ahead at intervals

In the past few activities, we’ve been exploring sampling variability and error. These tools are critical in using our sample to make inferences about the broader population. We’ll explore inference more formally in the weeks ahead.

Here, use your intuition to apply our bootstrapping results:

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

fish_models_bootstrap %>% 
    ggplot(aes(x = Length)) + 
    geom_density()

Our original sample estimate of the Length coefficient, 0.0581, was simply our best guess of the actual coefficient among all fish. But we know it’s wrong. Based on the plots above, provide a bigger range or interval of plausible values for the actual coefficient among all fish.
We can do better than visual approximations. Use the fish_models_bootstrap results to provide a more specific interval of plausible values for the actual Length coefficient. Do you think we should use the full range of observed bootstrap estimates? Just a fraction? (Hint: the quantile() function will be useful. For example, quantile(x, 0.25) gives the 25th percentile of x, and quantile(x, 0.5) gives the 50th percentile.)

fish_models_bootstrap %>%
    summarize(___(Length, ___), ___(Length, ___))

Exercise 10: Looking ahead at hypothesis testing

Some researchers claim that mercury content is associated with the length of a fish. Let’s use our bootstrapping results to test this hypothesis.

Based on only the plot below of our bootstrap models, do you think our sample data supports this hypothesis?

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

What about numerical evidence? Based on the interval you calculated in part b of the previous exercise, do you think our sample data supports this hypothesis?

Solutions

Exercise 1: sample vs population

sample estimate
no idea (at this point at least)! we don’t know the actual population slope

Exercise 2: The rub

We only have access to a sample of fish, not the entire population.

Exercise 3: CLT

coef(summary(fish_model))

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
Length       0.05812749 0.005227593 11.119359 6.641225e-22

0.005 is the standard error for the Length coefficient and tells us how much a slope estimate will typically deviate from the true population value. Another way to put it: it tells us the amount of uncertainty in our slope estimate of 0.058. It tells us how much that slope estimate might jump around from sample to sample.

Exercise 4: Challenge

# Build 500 models using samples of size 171
fish_models_bad_simulation <- mosaic::do(500)*(
  fish %>% 
    sample_n(size = 171, replace = FALSE) %>% 
    with(lm(Concen ~ Length))
)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

head(fish_models_bad_simulation)

  Intercept     Length     sigma r.squared        F numdf dendf .row .index
1 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      1
2 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      2
3 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      3
4 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      4
5 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      5
6 -1.131645 0.05812749 0.5805245 0.4224989 123.6401     1   169    1      6

The results are all the same. Since we sampled without replacement, we got the same sample hence the same sample estimates every time.

Exercise 5: Resampling

# Define data
small_sample <- data.frame(
    id = 1:5,
    Length = c(44, 43, 54, 52, 40)
)

small_sample

  id Length
1  1     44
2  2     43
3  3     54
4  4     52
5  5     40

small_sample %>% 
    summarize(mean(Length))

  mean(Length)
1         46.6

We always get the same sample (the original 5 fish), just in a different order. Thus the mean is always the sample.

sample_1 <- sample_n(small_sample, size = 5, replace = FALSE)
sample_1

  id Length
1  1     44
2  2     43
3  3     54
4  4     52
5  5     40

sample_1 %>% 
    summarize(mean(Length))

  mean(Length)
1         46.6

They differ.

sample_2 <- sample_n(small_sample, size = 5, replace = TRUE)
sample_2

  id Length
1  4     52
2  5     40
3  5     40
4  2     43
5  4     52

sample_2 %>% 
    summarize(mean(Length))

  mean(Length)
1         45.4

Sampling the same fish more than once is like observing different fish with similar characteristics.

Exercise 6: Bootstrapping

# Set the seed so we get the same results
set.seed(155)

# Build 500 bootstrap models using REsamples of size 171
fish_models_bootstrap <- mosaic::do(500)*(
  fish %>% 
    sample_n(size = 171, replace = TRUE) %>% 
    with(lm(Concen ~ Length))
)

head(fish_models_bootstrap)

   Intercept     Length     sigma r.squared         F numdf dendf .row .index
1 -1.1802375 0.05827521 0.5834448 0.4127396 118.77693     1   169    1      1
2 -1.2902917 0.06296431 0.5667005 0.4661328 147.55814     1   169    1      2
3 -1.0449945 0.05645232 0.5836492 0.4094752 117.18610     1   169    1      3
4 -1.2721884 0.06065800 0.5253640 0.5043690 171.97948     1   169    1      4
5 -0.8324804 0.04929179 0.5820092 0.3695009  99.04160     1   169    1      5
6 -0.8811652 0.05088282 0.5786866 0.3671583  98.04942     1   169    1      6

Exercise 7: Bootstrap results

The red line represents fish_model.

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

intuition
They are normally distributed around our original sample estimate.

fish_models_bootstrap %>% 
    ggplot(aes(x = Length)) + 
    geom_density()

Exercise 8: Bootstrap standard errors

fish_models_bootstrap %>%
    summarize(sd(Length))

   sd(Length)
1 0.005664586

very close!

coef(summary(fish_model))

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -1.13164542 0.213614796 -5.297598 3.617750e-07
Length       0.05812749 0.005227593 11.119359 6.641225e-22

Exercise 9: Looking ahead at intervals

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

fish_models_bootstrap %>% 
    ggplot(aes(x = Length)) + 
    geom_density()

Will vary. Maybe 0.0375ish to 0.07ish.
Will vary. For example, if we lop off the extreme 2.5% (the smallest and biggest estimates) and just keep the middle 95%:

fish_models_bootstrap %>%
    summarize(quantile(Length, 0.025), quantile(Length, 0.975))

  quantile(Length, 0.025) quantile(Length, 0.975)
1              0.04677145              0.06905934

Exercise 10: Looking ahead at hypothesis testing

Yes. All of the lines reflect a positive association.

fish %>% 
    ggplot(aes(y = Concen, x = Length)) +
    geom_abline(data = fish_models_bootstrap, aes(intercept = Intercept, slope = Length), color = "gray") + 
    geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Yes. It falls entirely above 0 (a slope of 0 indicating no association).