Loops and iteration - Part 2

Learning goals

After this lesson, you should be able to:

  • Use the map() family of functions in the purrr package to handle repeated tasks


You can download a template Quarto file to start from here. Put this file in a folder called iteration within a folder for this course.





Iteration with purrr

purrr is a tidyverse package that provides several useful functions for iteration. Open up the purrr cheatsheet.

The main advantages of purrr include:

  • Improved readability of R code
  • Reduction in the “overhead” in writing a for loop (creating storage containers and writing the for (i in ...))

Exercise: Write a for loop that stores the class() (type) of every column in the mtcars data frame. Recall that:

  • vector("numeric/logical/character/list", length) creates a storage container.
  • mtcars[[1]] gets the first column of the data frame.
Solution
col_classes <- vector("character", ncol(mtcars))

# Data frames are **lists** of columns, so this loop iterates over the columns
for (i in seq_along(mtcars)) {
    col_classes[i] <- class(mtcars[[i]])
}

In purrr, we can use the family of map() functions to apply a function to each element of a list or vector. We can think of this as mapping an input (a list or vector) to a new output via a function. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.

  • map() returns a list
  • map_chr() returns a character vector
  • map_lgl() returns a logical vector
  • map_int() returns an integer vector
  • map_dbl() returns a numeric vector
  • map_vec() returns a vector of a different (non-atomic) type (like dates)

To get the class() of each data frame column, map_chr() is sensible because classes are strings:

map_chr(mtcars, class)
      mpg       cyl      disp        hp      drat        wt      qsec        vs 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       am      gear      carb 
"numeric" "numeric" "numeric" 

Let’s get the class of each variable in diamonds:

map_chr(diamonds, class)
Error in `map_chr()`:
ℹ In index: 2.
ℹ With name: cut.
Caused by error:
! Result must be length 1, not 2.

What happened!? map_chr() was expecting to create a character vector with one element per element (column) in diamonds. But something happened in column 2 with the cut variable. Let’s figure out what happened:

class(diamonds$cut)
[1] "ordered" "factor" 

Ah! cut has 2 classes. In this case, map() (which returns a list) is the best option because some variables have multiple classes:

map(diamonds, class)
$carat
[1] "numeric"

$cut
[1] "ordered" "factor" 

$color
[1] "ordered" "factor" 

$clarity
[1] "ordered" "factor" 

$depth
[1] "numeric"

$table
[1] "numeric"

$price
[1] "integer"

$x
[1] "numeric"

$y
[1] "numeric"

$z
[1] "numeric"

The error we encountered with map_chr() is a nice feature of purrr because it allows us to be very sure of the type of output we are getting. (Failing loudly is vastly preferable to getting unexpected outputs silently because we can catch errors earlier.)

Recall that last time we explored syntax and functions for selecting variables in a data frame via the tidy-select documentation. (?dplyr_tidy_select in the Console.) We can combine map_*() functions with tidy selection for some powerful variable summaries that require much less code than for loops.

map_dbl(diamonds %>% select(where(is.numeric)), mean)
       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338 
map_int(diamonds %>% select(!where(is.numeric)), n_distinct)
    cut   color clarity 
      5       7       8 

Exercise: We want to compute several summary statistics on each quantitative variable in a data frame and organize the results in a new data frame (rows = variables, columns = summary statistics).

  • Write a function called summ_stats() that takes a numeric vector x as input and returns the mean, median, standard deviation, and IQR as a data frame. You can use tibble() to create the data frame.
    • Example: tibble(a = 1:2, b = 2:3) creates a data frame with variables a and b.
  • Use a map*() function from purrr to get the summary statistics for each quantitative variable in diamonds.
  • Look up the bind_rows() documentation from dplyr to combine summary statistics for all quantitative variables into one data frame.
    • Note: You’ll notice that the variable names are not present in the output. Try to figure out a way to add variable names back in with mutate() and colnames().
  • (We’ll review solutions for the above parts before you tackle this on your own.) Write a for loop to achieve the same result. Which do you prefer in terms of ease of code writing and readability?
Solution
summ_stats <- function(x) {
    tibble(
        mean = mean(x, na.rm = TRUE),
        median = median(x, na.rm = TRUE),
        sd = sd(x, na.rm = TRUE),
        iqr = IQR(x, na.rm = TRUE)
    )
}

diamonds_num <- diamonds %>% select(where(is.numeric))
diamonds_summ_stats <- map(diamonds_num, summ_stats) %>% bind_rows()
diamonds_summ_stats
# A tibble: 7 × 4
      mean  median       sd     iqr
     <dbl>   <dbl>    <dbl>   <dbl>
1    0.798    0.7     0.474    0.64
2   61.7     61.8     1.43     1.5 
3   57.5     57       2.23     3   
4 3933.    2401    3989.    4374.  
5    5.73     5.7     1.12     1.83
6    5.73     5.71    1.14     1.82
7    3.54     3.53    0.706    1.13
diamonds_summ_stats %>% mutate(variable = colnames(diamonds_num))
# A tibble: 7 × 5
      mean  median       sd     iqr variable
     <dbl>   <dbl>    <dbl>   <dbl> <chr>   
1    0.798    0.7     0.474    0.64 carat   
2   61.7     61.8     1.43     1.5  depth   
3   57.5     57       2.23     3    table   
4 3933.    2401    3989.    4374.   price   
5    5.73     5.7     1.12     1.83 x       
6    5.73     5.71    1.14     1.82 y       
7    3.54     3.53    0.706    1.13 z       
# Another way
diamonds_summ_stats <- map(diamonds_num, summ_stats) %>% set_names(colnames(diamonds_num)) %>% bind_rows(.id = "variable")
diamonds_summ_stats
# A tibble: 7 × 5
  variable     mean  median       sd     iqr
  <chr>       <dbl>   <dbl>    <dbl>   <dbl>
1 carat       0.798    0.7     0.474    0.64
2 depth      61.7     61.8     1.43     1.5 
3 table      57.5     57       2.23     3   
4 price    3933.    2401    3989.    4374.  
5 x           5.73     5.7     1.12     1.83
6 y           5.73     5.71    1.14     1.82
7 z           3.54     3.53    0.706    1.13
# with a for loop
diamonds_summ_stats2 <- vector("list", length = ncol(diamonds_num))
for (i in seq_along(diamonds_num)) {
    diamonds_summ_stats2[[i]] <- summ_stats(diamonds_num[[i]])
}
diamonds_summ_stats2 %>%
    set_names(colnames(diamonds_num)) %>%
    bind_rows(.id = "variable")
# A tibble: 7 × 5
  variable     mean  median       sd     iqr
  <chr>       <dbl>   <dbl>    <dbl>   <dbl>
1 carat       0.798    0.7     0.474    0.64
2 depth      61.7     61.8     1.43     1.5 
3 table      57.5     57       2.23     3   
4 price    3933.    2401    3989.    4374.  
5 x           5.73     5.7     1.12     1.83
6 y           5.73     5.71    1.14     1.82
7 z           3.54     3.53    0.706    1.13

purrr also offers the pmap() family of functions that take multiple inputs and loops over them simultaneously. Let’s look at the purrr cheatsheet to look at graphical representations of how these functions work.

string_data <- tibble(
    string = c("apple", "banana", "cherry"),
    pattern = c("p", "n", "h"),
    replacement = c("P", "N", "H")
)
string_data
# A tibble: 3 × 3
  string pattern replacement
  <chr>  <chr>   <chr>      
1 apple  p       P          
2 banana n       N          
3 cherry h       H          
pmap_chr(string_data, str_replace_all)
[1] "aPPle"  "baNaNa" "cHerry"

Note how the column names in string_data exactly match the argument names in str_replace_all(). The iteration that is happening is across rows, and the multiple arguments in str_replace_all() are being matched by name.

We can also use pmap() to specify variations in some arguments but leave some arguments constant across the iterations:

string_data <- tibble(
    pattern = c("p", "n", "h"),
    replacement = c("P", "N", "H")
)

pmap_chr(string_data, str_replace_all, string = "ppp nnn hhh")
[1] "PPP nnn hhh" "ppp NNN hhh" "ppp nnn HHH"

Exercise: Create 2 small examples that show how pmap() works with str_sub(). Your examples should:

  • Use different arguments for string, start, and end
  • Use different arguments for start and end but a fixed string
Solution
string_data <- tibble(
    string = c("apple", "banana", "cherry"),
    start = c(1, 2, 4),
    end = c(2, 3, 5)
)

pmap_chr(string_data, str_sub)
[1] "ap" "an" "rr"
string_data <- tibble(
    start = c(1, 2, 4),
    end = c(2, 3, 5)
)
pmap_chr(string_data, str_sub, string = "abcde")
[1] "ab" "bc" "de"

Exercise: Last class we worked on an extended exercise where our goal was to write a series of functions and a for loop to repeat linear model fitting under different “settings” (removal of outliers, model formula choice). Repeat this exercise using pmap()–you’ll need to use the df_arg_combos object, your remove_outliers() function, and your fit_model() function.

Solution
pmap(df_arg_combos, fit_model, data = diamonds)