Sampling distributions & the CLT

Notes and in-class exercises

Notes

You can download a template file to work with here.
File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Learning goals

Let $β$ be some population parameter and $\hat{β}$ be a sample estimate of $β$ . Our goals for the day are to:

use simulation to solidify our understanding of sampling distributions and standard errors
explore the appropriateness of the Central Limit Theorem in approximating a sampling distribution
explore the impact of sample size on sampling distributions and standard errors

Readings and videos

Before class you should have read and watched:

Reading: Section 6.7 in the STAT 155 Notes
Video 1: Sampling Distributions
Video 2: Central Limit Theorem

Exercises

Recall the complete population data on the 2019 and 2017 per capita income in each U.S. county:

# Load the data & handy packages
library(tidyverse) # this includes dplyr, ggplot2, & other related packages

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(usdata)
data(county_complete)

# Wrangle the data
county_clean <- county_complete %>% 
  mutate(pci_2019 = per_capita_income_2019 / 1000, 
         pci_2017 = per_capita_income_2017 / 1000) %>% 
  select(state, name, fips, pci_2019, pci_2017)

We can obtain the finite population parameters $β_{0}$ and $β_{1}$ for the model between these 2 variables:

$E [p c i_2019 ∣ p c i_2017] = β_{0} + β_{1} p c i_2017 = 1.294 + 1.027 p c i_2017$

# Model the relationship
pop_model <- lm(pci_2019 ~ pci_2017, data = county_clean)
coef(summary(pop_model))

            Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 1.293897 0.157901343   8.194339 3.628557e-16
pci_2017    1.026675 0.005883489 174.501046 0.000000e+00

# Plot the relationship
county_clean %>% 
  ggplot(aes(y = pci_2019, x = pci_2017)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Now PRETEND we don’t have data on all counties.

Thus we need to estimate the population parameters using sample data.

In the previous activity, each student took 1 sample of 10 counties.

This gave us a quick sense of how these estimates could vary from sample to sample.

But let’s dream bigger.

Let’s each take 500 samples of 10 counties each!

Exercise 1: 500 samples of size 10

Recall that we can sample 10 counties using sample_n():

# Run this chunk a few times to explore the different samples you get
county_clean %>% 
  sample_n(size = 10, replace = FALSE)

We can also take a sample and then use the data to estimate the model:

# Run this chunk a few times to explore the different sample models you get
county_clean %>% 
  sample_n(size = 10, replace = FALSE) %>% 
  with(lm(pci_2019 ~ pci_2017))

We can also take multiple unique samples and build a sample model from each.

The code below obtains 500 separate samples of 10 counties, and stores the model estimates from each:

# Set the seed so that we all get the same results
set.seed(155)

# Store the sample models
sample_models_10 <- mosaic::do(500)*(
  county_clean %>% 
    sample_n(size = 10, replace = FALSE) %>% 
    with(lm(pci_2019 ~ pci_2017))
)

# Check it out
head(sample_models_10)
dim(sample_models_10)

Reflect

What’s the point of the do() function?!? If you’ve taken any COMP classes, what process do you think do() is a shortcut for?
What is stored in the Intercept, pci_2017, and r.squared columns of the results?

Exercise 2: Sampling distribution

Check out the resulting 500 sample models:

county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(data = sample_models_10, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

Let’s focus on the slopes of these 500 sample models. A plot of the 500 slopes approximates the sampling distribution of the sample slopes.

sample_models_10 %>% 
  ggplot(aes(x = pci_2017)) + 
  geom_density() + 
  geom_vline(xintercept = 1.027, color = "red") + 
  xlim(0.3, 1.7)

Reflect: Describe the sampling distribution. What’s its general shape? Where is it centered? Roughly what’s its spread / i.e. what’s the range of estimates you observed?

Exercise 3: Standard error

For a more rigorous assessment of the spread among the sample slopes, let’s calculate their standard deviation:

sample_models_10 %>% 
  summarize(sd(pci_2017))

Recall: The standard deviation of sample estimates is called a “standard error”.

It measures the typical distance of a sample estimate from the actual population value.

Exercise 4: Central Limit Theorem (CLT)

Recall that the CLT assumes that, so long as our sample size is “big enough”, the sampling distribution of the sample slope will be Normal.

Specifically, all possible sample slopes will vary Normally around the population slope.

Do your simulation results support this assumption?
Want more intuition into the CLT? Watch this video explanation using bunnies and dragons: https://www.youtube.com/watch?v=jvoxEYmQHNM

Exercise 5: Using the CLT

Let ${\hat{β}}_{1}$ be an estimate of the population slope parameter $β_{1}$ calculated from a sample of 10 counties.

In exercise 3, you approximated that ${\hat{β}}_{1}$ has a standard error of roughly 0.16.

Thus, by the CLT, the sampling distribution of ${\hat{β}}_{1}$ is:

${\hat{β}}_{1} \sim N (β_{1}, {0.16}^{2})$

Use this result with the 68-95-99.7 property of the Normal model to understand the potential error in a slope estimate.

There are many possible samples of 10 counties. What percent of these will produce an estimate ${\hat{β}}_{1}$ that’s within 0.32, i.e. 2 standard errors, of the actual population slope $β_{1}$ ?
More than 2 standard errors from $β_{1}$ ?
More than 0.48, i.e. 3 standard errors, above $β_{1}$ ?

Exercise 6: CLT and the 68-95-99.7 Rule

Fill in the blanks below to complete some general properties assumed by the CLT:

___% of samples will produce ${\hat{β}}_{1}$ estimates within 1 st. err. of $β_{1}$
___% of samples will produce ${\hat{β}}_{1}$ estimates within 2 st. err. of $β_{1}$
___% of samples will produce ${\hat{β}}_{1}$ estimates within 3 st. err. of $β_{1}$

Exercise 7: Increasing sample size

Now that we have a sense of the potential variability and error in sample estimates, let’s consider the impact of sample size. Suppose we were to increase our sample size from n = 10 to n = 50 or n = 200 counties. What impact do you anticipate this having on our sample estimates of the population parameters:

Do you expect there to be more or less variability among the sample model lines?
Around what value would you expect the sampling distribution of sample slopes to be centered?
What general shape would you expect that sampling distribution to have?
In comparison to estimates based on the samples of size 10, do you think the estimates based on samples of size 50 will be closer to or farther from the true slope (on average)?

Exercise 8: 500 samples of size n

Let’s increase the sample size in our simulation. Fill in the blanks to take 500 samples of size 50, and build a sample model from each.

set.seed(155)
sample_models_50 <- mosaic::do(___)*(
  county_clean %>% 
    ___(size = ___, replace = FALSE) %>% 
    ___(___(pci_2019 ~ pci_2017))
)

# Check it out
head(sample_models_50)

Similarly, take 500 samples of size 200, and build a sample model from each.

set.seed(155)
sample_models_200 <- mosaic::do(___)*(
  county_clean %>% 
    ___(size = ___, replace = FALSE) %>% 
    ___(___(pci_2019 ~ pci_2017))
)

# Check it out
head(sample_models_200)

Exercise 9: Impact of sample size (part I)

Compare and contrast the 500 sets of sample models when using samples of size 10, 50, and 200.

# 500 sample models using samples of size 10
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_10, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

# 500 sample models using samples of size 50
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_50, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

# 500 sample models using samples of size 200
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_200, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

Reflect: What happens to our sample models as sample size increases? Was this what you expected?

Exercise 10: Impact of sample size (part II)

Let’s focus on just the sampling distributions of our 500 slope estimates ${\hat{β}}_{1}$ . For easy comparison, plot the estimates based on samples of size 10, 50, and 200 on the same frame:

# Don't think too hard about this code!
# Combine the estimates & sample size into a new data set
# Then plot it
data.frame(estimates = c(sample_models_10$pci_2017, sample_models_50$pci_2017, sample_models_200$pci_2017),
           sample_size = rep(c("10","50","200"), each = 500)) %>% 
  mutate(sample_size = fct_relevel(sample_size, c("10", "50", "200"))) %>% 
  ggplot(aes(x = estimates, color = sample_size)) + 
  geom_density() + 
  geom_vline(xintercept = 1.027, color = "red", linetype = "dashed") + 
  labs(title = "Sampling distributions of the sample slope")

Reflect: How do the shapes, centers, and spreads of these sampling distributions compare? Was this what you expected?

Exercise 11: Properties of sampling distributions

In light of your observations, complete the following statements about the sampling distribution of the sample slope.

For all sample sizes, the shape of the sampling distribution is roughly ___ and the sampling distribution is roughly centered around ___, the true population slope.
As sample size increases:
The average sample slope estimate INCREASES / DECREASES / IS FAIRLY STABLE.
The standard error of the sample slopes INCREASES / DECREASES / IS FAIRLY STABLE.
Thus, as sample size increases, our sample slopes become MORE RELIABLE / LESS RELIABLE.

Solutions

Exercise 1: 500 samples of size 10

Reflect

do() repeats the code within the parentheses as many times as you tell it. do()` is a shortcut for a for loop.
500 different sample estimates of the model

Exercise 2: Sampling distribution

set.seed(155)
sample_models_10 <- mosaic::do(500)*(
  county_clean %>% 
    sample_n(size = 10, replace = FALSE) %>% 
    with(lm(pci_2019 ~ pci_2017))
)

Registered S3 method overwritten by 'mosaic':
  method                           from   
  fortify.SpatialPolygonsDataFrame ggplot2

county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) +
  geom_abline(data = sample_models_10, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

sample_models_10 %>% 
  ggplot(aes(x = pci_2017)) + 
  geom_density() + 
  geom_vline(xintercept = 1.027, color = "red") + 
  xlim(0.3, 1.7)

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_density()`).

The 500 sample slopes are normally distributed around the population slope and range from roughly 0.4 to 1.6.

Exercise 3: Standard error

sample_models_10 %>% 
  summarize(sd(pci_2017))

  sd(pci_2017)
1    0.1580363

For example, for samples of size 10, we expect estimates of the sample slope (the expected change in pci_2019 per $1k increase in pci_2017) to be off by 0.16. The standard errors decrease as sample size increases.

Exercise 4: Central Limit Theorem (CLT)

Our simulation results do support the assumption of the Central Limit Theorem.

Exercise 5: Using the CLT

95%
5%
0.15% ((100 - 99.7%)/2)

Exercise 6: CLT and the 68-95-99.7 Rule

68%, 95%, 99.7%

Exercise 7: Increasing sample size

intuition. no wrong answer.

Exercise 8: 500 samples of size n

set.seed(155)
sample_models_50 <- mosaic::do(500)*(
  county_clean %>% 
    sample_n(size = 50, replace = FALSE) %>% 
    with(lm(pci_2019 ~ pci_2017))
)

# Check it out
head(sample_models_50)

  Intercept  pci_2017    sigma r.squared        F numdf dendf .row .index
1 1.2445147 1.0206258 1.996879 0.8827145 361.2578     1    48    1      1
2 2.0351324 0.9993395 1.578092 0.9248192 590.4612     1    48    1      2
3 3.3140440 0.9590203 1.457254 0.9446582 819.3372     1    48    1      3
4 1.5031974 1.0400601 2.173192 0.8890306 384.5518     1    48    1      4
5 2.2960437 0.9795596 1.759872 0.9320475 658.3761     1    48    1      5
6 0.1674985 1.0487638 2.425859 0.9095534 482.6998     1    48    1      6

set.seed(155)
sample_models_200 <- mosaic::do(500)*(
  county_clean %>% 
    sample_n(size = 200, replace = FALSE) %>% 
    with(lm(pci_2019 ~ pci_2017))
)

# Check it out
head(sample_models_200)

    Intercept  pci_2017    sigma r.squared        F numdf dendf .row .index
1  1.55222839 1.0133282 2.013702 0.9160050 2159.282     1   198    1      1
2  1.67853251 0.9994033 2.288979 0.8658330 1277.772     1   198    1      2
3  0.85803178 1.0394611 1.811912 0.9301205 2635.448     1   198    1      3
4  0.79740594 1.0414248 2.597108 0.8769687 1404.219     1   197    1      4
5 -0.05186957 1.0751463 2.563725 0.8845190 1516.568     1   198    1      5
6  1.06340006 1.0337143 2.094655 0.9126476 2068.681     1   198    1      6

Exercise 9: Impact of sample size (part I)

# 500 sample models using samples of size 10
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_10, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

# 500 sample models using samples of size 50
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_50, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

# 500 sample models using samples of size 200
county_clean %>% 
  ggplot(aes(x = pci_2017, y = pci_2019)) + 
  geom_smooth(method = "lm", se = FALSE) + 
  geom_abline(data = sample_models_200, 
              aes(intercept = Intercept, slope = pci_2017), 
              color = "gray", size = 0.25) + 
  geom_smooth(method = "lm", color = "red", se = FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).

The sample model lines become less and less variable from sample to sample.

Exercise 10: Impact of sample size (part II)

No matter the sample size, the sample estimates are normally distributed around the population slope. But as sample size increases, the variability of the sample estimates decreases.

# Don't think too hard about this code!
# Combine the estimates & sample size into a new data set
# Then plot it
data.frame(estimates = c(sample_models_10$pci_2017, sample_models_50$pci_2017, sample_models_200$pci_2017),
           sample_size = rep(c("10","50","200"), each = 500)) %>% 
  mutate(sample_size = fct_relevel(sample_size, c("10", "50", "200"))) %>% 
  ggplot(aes(x = estimates, color = sample_size)) + 
  geom_density() + 
  geom_vline(xintercept = 1.027, color = "red", linetype = "dashed") + 
  labs(title = "Sampling distributions of the sample slope")

Exercise 11: Properties of sampling distributions

For all sample sizes, the shape of the sampling distribution is roughly normal and the sampling distribution is roughly centered around 1.027, the true population slope.
As sample size increases:
The average sample slope estimate IS FAIRLY STABLE.
The standard error of the sample slopes DECREASES / IS FAIRLY STABLE.
Thus, as sample size increases, our sample slopes become MORE RELIABLE.