# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Ideal E SI2 61.5 55 326 4 4 2.4
2 0.2 Premium E SI1 59.8 61 326 3.9 3.8 2.3
3 0.2 Good E VS1 56.9 65 327 4 4.1 2.3
4 0.3 Premium I VS2 62.4 58 334 4.2 4.2 2.6
5 0.3 Good J SI2 63.3 58 335 4.3 4.3 2.8
6 0.2 Very Good J VVS2 62.8 57 336 3.9 4 2.5
7 0.2 Very Good I VVS1 62.3 57 336 4 4 2.5
8 0.3 Very Good H SI1 61.9 55 337 4.1 4.1 2.5
9 0.2 Fair E VS2 65.1 61 337 3.9 3.8 2.5
10 0.2 Very Good H VS1 59.4 61 338 4 4 2.4
# ℹ 53,930 more rows
dplyr provides the across() function for performing these repeated function calls:
# Option 1: Create our own named functionround_to_one <-function(x) {round(x, digits =1)}diamonds %>%mutate(across(.cols =c(carat, x, y, z), .fns = round_to_one))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Ideal E SI2 61.5 55 326 4 4 2.4
2 0.2 Premium E SI1 59.8 61 326 3.9 3.8 2.3
3 0.2 Good E VS1 56.9 65 327 4 4.1 2.3
4 0.3 Premium I VS2 62.4 58 334 4.2 4.2 2.6
5 0.3 Good J SI2 63.3 58 335 4.3 4.3 2.8
6 0.2 Very Good J VVS2 62.8 57 336 3.9 4 2.5
7 0.2 Very Good I VVS1 62.3 57 336 4 4 2.5
8 0.3 Very Good H SI1 61.9 55 337 4.1 4.1 2.5
9 0.2 Fair E VS2 65.1 61 337 3.9 3.8 2.5
10 0.2 Very Good H VS1 59.4 61 338 4 4 2.4
# ℹ 53,930 more rows
# Option 2: Use an "anonymous" or "lambda" function that isn't nameddiamonds %>%mutate(across(.cols =c(carat, x, y, z), .fns =function(x) {round(x, digits =1)} ))
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.2 Ideal E SI2 61.5 55 326 4 4 2.4
2 0.2 Premium E SI1 59.8 61 326 3.9 3.8 2.3
3 0.2 Good E VS1 56.9 65 327 4 4.1 2.3
4 0.3 Premium I VS2 62.4 58 334 4.2 4.2 2.6
5 0.3 Good J SI2 63.3 58 335 4.3 4.3 2.8
6 0.2 Very Good J VVS2 62.8 57 336 3.9 4 2.5
7 0.2 Very Good I VVS1 62.3 57 336 4 4 2.5
8 0.3 Very Good H SI1 61.9 55 337 4.1 4.1 2.5
9 0.2 Fair E VS2 65.1 61 337 3.9 3.8 2.5
10 0.2 Very Good H VS1 59.4 61 338 4 4 2.4
# ℹ 53,930 more rows
When we look at the documentation for across(), we see that the .cols argument specifies which variables we want to transform, and it has a <tidy-select> tag. This means that the syntax we use for .cols follows certain rules. Let’s click this link to explore the possibilities for selecting variables.
Read through the “Overview of selection features” section to get an overall sense of the many ways to select variables.
Navigate back to the across() documentation and read through the Examples section at the bottom. Click the “Run examples” link to view the output for all of the examples.
Exercises
(Not a pair programming exercise, but check in with each other as you work)
Using the diamonds dataset:
Transform the x, y, and z columns so that the units of millimeters are displayed (e.g., “4.0 mm”).
Convert all numeric columns into character columns.
Hint: type is. and hit Tab in the Console. Scroll through the function options. Do the same with as.
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <chr> <chr> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 mm 3.98 mm 2.43 mm
2 0.21 Premium E SI1 59.8 61 326 3.89 mm 3.84 mm 2.31 mm
3 0.23 Good E VS1 56.9 65 327 4.05 mm 4.07 mm 2.31 mm
4 0.29 Premium I VS2 62.4 58 334 4.2 mm 4.23 mm 2.63 mm
5 0.31 Good J SI2 63.3 58 335 4.34 mm 4.35 mm 2.75 mm
6 0.24 Very Good J VVS2 62.8 57 336 3.94 mm 3.96 mm 2.48 mm
7 0.24 Very Good I VVS1 62.3 57 336 3.95 mm 3.98 mm 2.47 mm
8 0.26 Very Good H SI1 61.9 55 337 4.07 mm 4.11 mm 2.53 mm
9 0.22 Fair E VS2 65.1 61 337 3.87 mm 3.78 mm 2.49 mm
10 0.23 Very Good H VS1 59.4 61 338 4 mm 4.05 mm 2.39 mm
# ℹ 53,930 more rows
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<chr> <ord> <ord> <ord> <chr> <chr> <chr> <chr> <chr> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
What if we wanted to perform multiple transformations on each of many variables?
Within the different values of diamond cut, let’s summarize the mean, median, and standard deviation of the numeric variables. When we look at the .fns argument in the across() documentation, we see that we can provide a list of functions:
List of 3
$ mean:function (x, ...)
$ med :function (x, na.rm = FALSE, ...)
$ sd :function (x, na.rm = FALSE)
We’ll be working more with lists soon.
for loops
In R, for loops have the following general structure:
for (i in some_vector) {# Code to do stuff with i}
some_vector can be any vector, including:
An indexing vector: 1:3
A character vector: c("group1", "group2", "group3")
A vector of any other class
groups <-c("group1", "group2", "group3")for (i in1:3) {print(groups[i])}
[1] "group1"
[1] "group2"
[1] "group3"
for (g in groups) {print(g)}
[1] "group1"
[1] "group2"
[1] "group3"
The seq_along() function generates an integer sequence from 1 to the length of the vector supplied. A nice feature of seq_along() is that it generates an empty iteration vector if the vector you’re iterating over itself has length 0.
seq_along(groups)
[1] 1 2 3
no_groups <-c()seq_along(no_groups)
integer(0)
for (i inseq_along(groups)) {print(groups[i])}
[1] "group1"
[1] "group2"
[1] "group3"
for (i inseq_along(no_groups)) {print(no_groups[i])}
Closely related to seq_along() is seq_len(). While seq_along(x) generates an integer sequence from 1 to length(x), seq_len(x) takes x itself to be a length:
seq_len(3)
[1] 1 2 3
seq_len(0)
integer(0)
for (i inseq_len(length(groups))) {print(groups[i])}
[1] "group1"
[1] "group2"
[1] "group3"
for (i inseq_len(length(no_groups))) {print(no_groups[i])}
seq_len() is useful for iterating over the rows of a data frame because seq_along() would iterate over columns:
small_data <-tibble(a =1:2, b =2:3, c =4:5)for (col in small_data) {print(col)}
[1] 1 2
[1] 2 3
[1] 4 5
for (r inseq_len(nrow(small_data))) {print(r)}
[1] 1
[1] 2
Often we’ll want to store output created during a for loop. We can create storage containers with the vector() function:
for (i inseq_len(3)) { char_storage[i] <-str_c("Number: ", i) num_storage[i] <-2*i list_storage[[i]] <- i # Note the [[ for subsetting here}char_storage
[1] "Number: 1" "Number: 2" "Number: 3"
num_storage
[1] 2 4 6
list_storage
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
Exercises
Pair programming exercises: Whoever has most recently eaten dessert (broadly interpreted) should be the driver first. Switch after Exercise 2.
Write for-loops that do each of the following:
Prints the even numbers from 2:20
Produce the same output with the seq() function
Iterates over the month.name vector and stores a character vector of output containing strings like “Month 1: January”, “Month 2: February”.
Produce the same output with str_c() only.
On the diamonds dataset, fit models of price vs. carat separately for each value of cut, and store the fitted models in a list storage container.
data(diamonds)# Fit models of price vs. carat separately for each value of cutunique_cuts <- diamonds %>%pull(cut) %>%levels()lin_mod_results <-vector(mode ="list", length =length(unique_cuts))for (i inseq_along(unique_cuts)) { this_cut <- unique_cuts[i] diamonds_sub <- diamonds %>%filter(cut==this_cut)# The double square brackets [[i]] accesses the ith element of a list lin_mod_results[[i]] <-lm(price ~ carat, data = diamonds_sub)}
Case study: performing many different versions of an analysis
One of my most common use cases for writing functions and iteration/looping is to perform some exploration or modeling repeatedly for several different “settings”. For example, our broad goal might be to fit a linear regression model to our data. However, there are often multiple choices that we have to make in practice:
Keep missing values or fill them in (imputation)?
Fit the model only on a certain group of cases?
Filter out outliers in one or more variables?
Transform certain variables? (e.g., log transformation)
We can map any number of these choices to arguments in a custom model-fitting function:
impute: TRUE or FALSE
filter_to: This could be a string description like “All cases”, “Group 1”, or “Groups 1 and 2”
fit_model <-function(data, impute, filter_to) {if (impute) { data <-some_imputation_function(data) }if (filter_to=="Group 1") { data <- data %>%filter(group==1) } elseif (filter_to=="Groups 1 and 2") { data <- data %>%filter(group %in%c(1,2)) }lm(y ~ x1 + x2, data = data)}
The tidyr package has a useful function called crossing() that is useful for generating argument combinations. For each argument, we specify all possible values for that argument. crossing() generates all combinations:
# A tibble: 6 × 2
impute filter_to
<lgl> <chr>
1 FALSE All cases
2 FALSE Group 1
3 FALSE Groups 1 and 2
4 TRUE All cases
5 TRUE Group 1
6 TRUE Groups 1 and 2
# Another examplecrossing(option1 =1:3,option2 =c(TRUE, FALSE),option3 =c("low", "medium", "high"))
# A tibble: 18 × 3
option1 option2 option3
<int> <lgl> <chr>
1 1 FALSE high
2 1 FALSE low
3 1 FALSE medium
4 1 TRUE high
5 1 TRUE low
6 1 TRUE medium
7 2 FALSE high
8 2 FALSE low
9 2 FALSE medium
10 2 TRUE high
11 2 TRUE low
12 2 TRUE medium
13 3 FALSE high
14 3 FALSE low
15 3 FALSE medium
16 3 TRUE high
17 3 TRUE low
18 3 TRUE medium
With loops in our toobox, we can iterate the fit_model() function over the combinations in df_arg_combos.
Exercises
Goal: In the diamonds dataset, we want to understand the relationship between price and size (carat). We want to explore variation along two choices:
The variables included in the model. We’ll explore 3 sets of variables:
No further variables (just price and carat)
Adjusting for cut
Adjusting for cut and clarity
Adjusting for cut, clarity, and color
Whether or not to remove outliers in the carat variable. We’ll define outliers as cases whose carat is over 3 SDs away from the mean.
Work with your partner on the following exercises (not in a pair-programming fashion). As you work, make note of what is challenging and any helpful thought processes/strategies that arise from the collaboration.
Exercise 1: Use crossing() to create the data frame of argument combinations for our analyses. Call it df_arg_combos. Note that you can create a list of formula objects in R with c(y ~ x1, y ~ x1 + x2). (Something like this will be the right hand side of an argument to crossing().)
Exercise 2: Write a function called remove_outliers that removes outliers in a dataset. The user should be able to supply the dataset (data), the variable to remove outliers in (what_var), and a threshold on the number of SDs away from the mean used to define outliers (sd_thresh). Write your function so that it runs as follows: remove_outliers(diamonds, what_var = carat, sd_thresh = 3).
Exercise 3: Write a function called fit_model that implements the different settings for our diamonds analysis.
fit_model <-function(data, mod_formula, remove_outliers) {# remove_outliers is a TRUE/FALSE flag of whether or not to remove outliers# This function implements our specific use case: outliers are cases that are 3 SDs away from the mean for the carat variable# Use mod_formula as the first argument of lm()}
Exercise 4: Write a for loop that stores the fitted linear models from all versions of the analysis.
Recall that you can pull out the contents of a single data frame column in many ways. For a data frame df with a variable named x:
df$x
df %>% pull(x)
df[["x"]]
Note that in df_arg_combos:
mod_formula: this column is a list and you can extract the 1st element with [[1]]
remove_outliers: this column is a vector and you can extract the 1st element with [1]
Solution
lin_mod_res_for <-vector(mode ="list", length =nrow(df_arg_combos))for (i inseq_along(lin_mod_res_for)) { this_formula <- df_arg_combos$mod_formula[[i]] # Double [[ for the **list** of formulas this_remove_outliers <- df_arg_combos$remove_outliers[i] # Single [ for the **atomic vector** of logicals lin_mod_res_for[[i]] <-fit_model(data = diamonds,mod_formula = this_formula,remove_outliers = this_remove_outliers )}