<- vector("character", ncol(mtcars))
col_classes
# Data frames are **lists** of columns, so this loop iterates over the columns
for (i in seq_along(mtcars)) {
<- class(mtcars[[i]])
col_classes[i] }
Loops and iteration - Part 2
Learning goals
After this lesson, you should be able to:
- Use the
map()
family of functions in thepurrr
package to handle repeated tasks
You can download a template Quarto file to start from here. Put this file in a folder called iteration
within a folder for this course.
Iteration with purrr
purrr
is a tidyverse
package that provides several useful functions for iteration. Open up the purrr
cheatsheet.
The main advantages of purrr
include:
- Improved readability of R code
- Reduction in the “overhead” in writing a
for
loop (creating storage containers and writing thefor (i in ...)
)
Exercise: Write a for
loop that stores the class()
(type) of every column in the mtcars
data frame. Recall that:
vector("numeric/logical/character/list", length)
creates a storage container.mtcars[[1]]
gets the first column of the data frame.
Solution
In purrr
, we can use the family of map()
functions to apply a function to each element of a list or vector. We can think of this as mapping an input (a list or vector) to a new output via a function. Let’s look at the purrr
cheatsheet to look at graphical representations of how these functions work.
map()
returns a listmap_chr()
returns a character vectormap_lgl()
returns a logical vectormap_int()
returns an integer vectormap_dbl()
returns a numeric vectormap_vec()
returns a vector of a different (non-atomic) type (like dates)
To get the class()
of each data frame column, map_chr()
is sensible because classes are strings:
map_chr(mtcars, class)
mpg cyl disp hp drat wt qsec vs
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
am gear carb
"numeric" "numeric" "numeric"
Let’s get the class of each variable in diamonds
:
map_chr(diamonds, class)
Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.
What happened!? map_chr()
was expecting to create a character vector with one element per element (column) in diamonds
. But something happened in column 2 with the cut
variable. Let’s figure out what happened:
class(diamonds$cut)
[1] "ordered" "factor"
Ah! cut
has 2 classes. In this case, map()
(which returns a list) is the best option because some variables have multiple classes:
map(diamonds, class)
$carat
[1] "numeric"
$cut
[1] "ordered" "factor"
$color
[1] "ordered" "factor"
$clarity
[1] "ordered" "factor"
$depth
[1] "numeric"
$table
[1] "numeric"
$price
[1] "integer"
$x
[1] "numeric"
$y
[1] "numeric"
$z
[1] "numeric"
The error we encountered with map_chr()
is a nice feature of purrr
because it allows us to be very sure of the type of output we are getting. (Failing loudly is vastly preferable to getting unexpected outputs silently because we can catch errors earlier.)
Recall that last time we explored syntax and functions for selecting variables in a data frame via the tidy-select documentation. (?dplyr_tidy_select
in the Console.) We can combine map_*()
functions with tidy selection for some powerful variable summaries that require much less code than for
loops.
map_dbl(diamonds %>% select(where(is.numeric)), mean)
carat depth table price x y
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
z
3.5387338
map_int(diamonds %>% select(!where(is.numeric)), n_distinct)
cut color clarity
5 7 8
Exercise: We want to compute several summary statistics on each quantitative variable in a data frame and organize the results in a new data frame (rows = variables, columns = summary statistics).
- Write a function called
summ_stats()
that takes a numeric vectorx
as input and returns the mean, median, standard deviation, and IQR as a data frame. You can usetibble()
to create the data frame.- Example:
tibble(a = 1:2, b = 2:3)
creates a data frame with variablesa
andb
.
- Example:
- Use a
map*()
function frompurrr
to get the summary statistics for each quantitative variable indiamonds
. - Look up the
bind_rows()
documentation fromdplyr
to combine summary statistics for all quantitative variables into one data frame.- Note: You’ll notice that the variable names are not present in the output. Try to figure out a way to add variable names back in with
mutate()
andcolnames()
.
- Note: You’ll notice that the variable names are not present in the output. Try to figure out a way to add variable names back in with
- (We’ll review solutions for the above parts before you tackle this on your own.) Write a
for
loop to achieve the same result. Which do you prefer in terms of ease of code writing and readability?
Solution
<- function(x) {
summ_stats tibble(
mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE),
sd = sd(x, na.rm = TRUE),
iqr = IQR(x, na.rm = TRUE)
)
}
<- diamonds %>% select(where(is.numeric))
diamonds_num <- map(diamonds_num, summ_stats) %>% bind_rows()
diamonds_summ_stats diamonds_summ_stats
# A tibble: 7 × 4
mean median sd iqr
<dbl> <dbl> <dbl> <dbl>
1 0.798 0.7 0.474 0.64
2 61.7 61.8 1.43 1.5
3 57.5 57 2.23 3
4 3933. 2401 3989. 4374.
5 5.73 5.7 1.12 1.83
6 5.73 5.71 1.14 1.82
7 3.54 3.53 0.706 1.13
%>% mutate(variable = colnames(diamonds_num)) diamonds_summ_stats
# A tibble: 7 × 5
mean median sd iqr variable
<dbl> <dbl> <dbl> <dbl> <chr>
1 0.798 0.7 0.474 0.64 carat
2 61.7 61.8 1.43 1.5 depth
3 57.5 57 2.23 3 table
4 3933. 2401 3989. 4374. price
5 5.73 5.7 1.12 1.83 x
6 5.73 5.71 1.14 1.82 y
7 3.54 3.53 0.706 1.13 z
# Another way
<- map(diamonds_num, summ_stats) %>% set_names(colnames(diamonds_num)) %>% bind_rows(.id = "variable")
diamonds_summ_stats diamonds_summ_stats
# A tibble: 7 × 5
variable mean median sd iqr
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.7 0.474 0.64
2 depth 61.7 61.8 1.43 1.5
3 table 57.5 57 2.23 3
4 price 3933. 2401 3989. 4374.
5 x 5.73 5.7 1.12 1.83
6 y 5.73 5.71 1.14 1.82
7 z 3.54 3.53 0.706 1.13
# with a for loop
<- vector("list", length = ncol(diamonds_num))
diamonds_summ_stats2 for (i in seq_along(diamonds_num)) {
<- summ_stats(diamonds_num[[i]])
diamonds_summ_stats2[[i]]
}%>%
diamonds_summ_stats2 set_names(colnames(diamonds_num)) %>%
bind_rows(.id = "variable")
# A tibble: 7 × 5
variable mean median sd iqr
<chr> <dbl> <dbl> <dbl> <dbl>
1 carat 0.798 0.7 0.474 0.64
2 depth 61.7 61.8 1.43 1.5
3 table 57.5 57 2.23 3
4 price 3933. 2401 3989. 4374.
5 x 5.73 5.7 1.12 1.83
6 y 5.73 5.71 1.14 1.82
7 z 3.54 3.53 0.706 1.13
purrr
also offers the pmap()
family of functions that take multiple inputs and loops over them simultaneously. Let’s look at the purrr
cheatsheet to look at graphical representations of how these functions work.
<- tibble(
string_data string = c("apple", "banana", "cherry"),
pattern = c("p", "n", "h"),
replacement = c("P", "N", "H")
) string_data
# A tibble: 3 × 3
string pattern replacement
<chr> <chr> <chr>
1 apple p P
2 banana n N
3 cherry h H
pmap_chr(string_data, str_replace_all)
[1] "aPPle" "baNaNa" "cHerry"
Note how the column names in string_data
exactly match the argument names in str_replace_all()
. The iteration that is happening is across rows, and the multiple arguments in str_replace_all()
are being matched by name.
We can also use pmap()
to specify variations in some arguments but leave some arguments constant across the iterations:
<- tibble(
string_data pattern = c("p", "n", "h"),
replacement = c("P", "N", "H")
)
pmap_chr(string_data, str_replace_all, string = "ppp nnn hhh")
[1] "PPP nnn hhh" "ppp NNN hhh" "ppp nnn HHH"
Exercise: Create 2 small examples that show how pmap()
works with str_sub()
. Your examples should:
- Use different arguments for
string
,start
, andend
- Use different arguments for
start
andend
but a fixedstring
Solution
<- tibble(
string_data string = c("apple", "banana", "cherry"),
start = c(1, 2, 4),
end = c(2, 3, 5)
)
pmap_chr(string_data, str_sub)
[1] "ap" "an" "rr"
<- tibble(
string_data start = c(1, 2, 4),
end = c(2, 3, 5)
)pmap_chr(string_data, str_sub, string = "abcde")
[1] "ab" "bc" "de"
Exercise: Last class we worked on an extended exercise where our goal was to write a series of functions and a for
loop to repeat linear model fitting under different “settings” (removal of outliers, model formula choice). Repeat this exercise using pmap()
–you’ll need to use the df_arg_combos
object, your remove_outliers()
function, and your fit_model()
function.
Solution
pmap(df_arg_combos, fit_model, data = diamonds)