<- function(x, remove_nas) {
average sum(x, na.rm = remove_nas)/length(x)
}
<- function(x, remove_nas) {
average2 return(sum(x, na.rm = remove_nas)/length(x))
}
<- function(x, remove_nas = TRUE) {
average3 sum(x, na.rm = remove_nas)/length(x)
}
Functions
Learning goals
After this lesson, you should be able to:
- Recognize when it would be useful to write a function
- Identify the core components of a function definition and explain their role (the
function()
directive, arguments, argument defaults, function body, return value) - Describe the difference between argument matching by position and by name
- Write
if
-else
,if
-else if
-else
statements to conditionally execute code - Write your own function to carry out a repeated task
- Provide feedback on functions written by others
You can download a template Quarto file to start from here. Put this file in a folder called functions
within a folder for this course.
Functions and control structures
Why functions?
Getting really good at writing useful and reusable functions is one of the best ways to increase your expertise in data science. It requires a lot of practice.
If you’ve copied and pasted code 3 or more times, it’s time to write a function.
- Reducing errors: Copy+paste+modify is prone to errors (e.g., forgetting to change a variable name)
- Efficiency: If you need to update code, you only need to do it one place. This allows reuse of code within and across projects.
- Readability: Encapsulating code within a function with a descriptive name makes code more readable.
Core parts of a function
What does a function look like?
The core parts of a function include:
- The
function()
directive- This is what allows tells R to create a function.
- Arguments: the
x
andremove_nas
– these are function inputs- In
average3
, theremove_nas
argument has a default value ofTRUE
.
- In
- Function body
- The code inside the curly braces
{}
is where all the work happens. This code uses the function arguments to perform computations.
- The code inside the curly braces
- Return value
- The first value that gets computed and isn’t stored as an object is what the function returns. (This is generally the first line without an assignment operator
<-
.) - As in
average3()
, we can also explicitly return an object by putting it insidereturn()
.
- The first value that gets computed and isn’t stored as an object is what the function returns. (This is generally the first line without an assignment operator
When a function has default values for arguments, they don’t have to be explicitly named if you want to use the default value:
# Both give the same result
average3(c(1, 2, 3, NA))
[1] 1.5
average3(c(1, 2, 3, NA), remove_nas = TRUE)
[1] 1.5
Pair programming exercise: There are two function-writing exercises coming up. You’ll swap driver and navigator roles between exercises. (The driver writes the code. The navigator oversees and provides guidance.) For the first exercise, the person whose birthday is coming up sooner will be the driver first. Swap for the second exercise.
Exercise: Write a function that rescales a numeric vector to be between 0 and 1. Test out your function on the following inputs:
x = 2:4
. Expected output:0.0 0.5 1.0
x = c(-1, 0, 5)
. Expected output:0.0000000 0.1666667 1.0000000
x = -3:-1
. Expected output:0.0 0.5 1.0
Solution
<- function(x) {
rescale01 <- range(x, na.rm = TRUE)
range_x # The [2] and [1] below extract the second and first element of a vector
- min(x, na.rm = TRUE)) / (range_x[2]-range_x[1])
(x
}rescale01(2:4)
[1] 0.0 0.5 1.0
rescale01(c(-1, 0, 5))
[1] 0.0000000 0.1666667 1.0000000
rescale01(-3:-1)
[1] 0.0 0.5 1.0
Exercise Write a function that formats a 10-digit phone number nicely as (###) ###-####
. Your function should work on the following input: c("651-330-8661", "6516966000", "800 867 5309")
. It may help to refer to the stringr
cheatsheet.
Solution
<- function(nums) {
format_phone_number <- str_remove_all(nums, "[^[:alnum:]]")
nums <- str_sub(nums, 1, 3)
area <- str_sub(nums, 4, 6)
num3 <- str_sub(nums, 7, 10)
num4 # str_glue("({area}) {num3}-{num4}")
str_c("(", area, ")", num3, "-", num4)
}format_phone_number(c("651-330-8661", "6516966000", "800 867 5309"))
[1] "(651)330-8661" "(651)696-6000" "(800)867-5309"
Argument matching
When you supply arguments to a function, they can be matched by position and/or by name.
When you call a function without argument = value
inside the parentheses, you are using positional matching.
ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
The above works because the first argument of ggplot
is data
and the second is mapping
. (Pull up the documentation on ggplot
with ?ggplot
in the Console.) So the following doesn’t work:
ggplot(aes(x = carat, y = price), diamonds) + geom_point()
Error in `ggplot()`:
! `mapping` should be created with `aes()`.
✖ You've supplied a <tbl_df> object
But if we named the arguments (name matching), we would be fine:
ggplot(mapping = aes(x = carat, y = price), data = diamonds) + geom_point()
Somewhat confusingly, we can name some arguments and not others. Below, mapping
is named, but data
isn’t. This works because when an argument is matched by name, it is “removed” from the argument list, and the remaining unnamed arguments are matched in the order that they are listed in the function definition. Just because this is possible doesn’t mean it’s a good idea–don’t do this!
ggplot(mapping = aes(x = carat, y = price), diamonds) + geom_point()
In general, it is safest to match arguments by name and position for your peace of mind. For functions that you are very familiar with (and know the argument order), it’s ok to just use positional matching.
Exercise: Diagnose the error message in the example below:
ggplot() %>%
geom_sf(census_data, aes(fill = population))
Error in `layer_sf()`:
! `mapping` must be created by `aes()`
Solution
Use ?geom_sf
to look up the function documentation. We see that the order of the arguments is first mapping
and second data
. The error is due to R thinking that census_data
is supplying aesthetics. This is an example of positional matching gone wrong.
The if-else if-else
control structure
Often in functions, you will want to execute code conditionally. In a programming language, control structures are parts of the language that allow you to control what code is executed. By far the most common is the `if-else if-else
structure.
if (logical_condition) {
# some code
else if (other_logical_condition) {
} # some code
else {
} # some code
}
<- function(x) {
middle <- mean(x, na.rm = TRUE)
mean_x <- median(x, na.rm = TRUE)
median_x <- (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x)
seems_skewed if (seems_skewed) {
median_xelse {
}
mean_x
} }
Pair programming exercise: Whoever was driver most recently should start as navigator. Switch for the second exercise.
Exercise: Write a function for converting temperatures that takes as input a numeric value and a unit (either “C” for Celsius or “F” for Fahrenheit). The function should convert the temperature from one unit to the other based on the following formulas:
- To convert Celsius to Fahrenheit:
(Celsius * 9/5) + 32
- To convert Fahrenheit to Celsius:
(Fahrenheit - 32) * 5/9
Solution
<- function(temp, unit) {
convert_temp if (unit=="F") {
- 32) * 5/9
(temp else if (unit=="C") {
} * 9/5) + 32
(temp
}
}
convert_temp(0, unit = "C")
[1] 32
convert_temp(32, unit = "F")
[1] 0
Exercise: Write a function that extracts the domain name of a supplied email address. The function should return the domain name (e.g., “gmail.com”). If the input is not a valid email address, return “Invalid Email”. (A valid email ends in “dot something”.)
Solution
<- function(email) {
extract_domain <- str_detect(email, "@.+\\..+")
is_valid if (is_valid) {
str_extract(email, "@.+$") %>%
str_remove("@")
else {
} "Invalid Email"
}
}
extract_domain(email = "les@mac.edu")
[1] "mac.edu"
extract_domain(email = "les@mac.net")
[1] "mac.net"
extract_domain(email = "les@mac.")
[1] "Invalid Email"
extract_domain(email = "les@macedu")
[1] "Invalid Email"
Writing functions with tidyverse
verbs
Perhaps we are using group_by()
and summarize()
a lot to compute group means. We might write this function:
<- function(df, group_var, mean_var) {
group_means %>%
df group_by(group_var) %>%
summarize(mean = mean(mean_var))
}
Let’s use it on the diamonds
dataset to compute the mean size (carat
) by diamond cut
:
group_means(diamonds, group_var = cut, mean_var = carat)
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.
What if the problem is that the variable names need to be in quotes?
group_means(diamonds, group_var = "cut", mean_var = "carat")
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.
What’s going on??? The tidyverse
uses something called tidy evaluation: this allows you to refer to a variable by typing it directly (e.g., no need to put it in quotes). So group_by(group_var)
is expecting a variable that is actually called group_var
, and mean(mean_var)
is expecting a variable that is actually called mean_var
.
To fix this we need to embrace the variables inside the function with {{ var }}
:
<- function(df, group_var, mean_var) {
group_means %>%
df group_by({{ group_var }}) %>%
summarize(mean = mean({{ mean_var }}))
}
The {{ var }}
tells R to look at what the value of the variable var
rather than look for var
literally.
group_means(diamonds, group_var = cut, mean_var = carat)
# A tibble: 5 × 2
cut mean
<ord> <dbl>
1 Fair 1.05
2 Good 0.849
3 Very Good 0.806
4 Premium 0.892
5 Ideal 0.703
Let’s group by both cut
and color
:
group_means(diamonds, group_var = c(cut, color), mean_var = carat)
Error in `group_by()`:
ℹ In argument: `c(cut, color)`.
Caused by error:
! `c(cut, color)` must be size 53940 or 1, not 107880.
Oh no! What now?! When c(cut, color)
is put inside {{ c(cut, color) }}
within the function, R is actually running the code inside {{ }}
. This combines the columns for those 2 variables into one long vector. What we really meant by c(cut, color)
is “group by both cut and color”.
To fix this, we need the pick()
function to get R to see {{ group_var }}
as a list of separate variables (like the way select()
works).
<- function(df, group_var, mean_var) {
group_means %>%
df group_by(pick({{ group_var }})) %>%
summarize(mean = mean({{ mean_var }}))
}
Pair programming exercise: Partner with the person next to you again. Whoever was driver most recently should start as navigator. Switch for the second exercise.
Exercise: Create a new version of dplyr::count()
that also shows proportions instead of just sample sizes. The function should be able to handle counting by multiple variables. Test your function with two different sets of arguments using the diamonds
dataset.
Solution
<- function(df, count_vars) {
count_prop %>%
df count(pick({{ count_vars }})) %>%
mutate(prop = n/sum(n))
}
count_prop(diamonds, count_vars = cut)
# A tibble: 5 × 3
cut n prop
<ord> <int> <dbl>
1 Fair 1610 0.0298
2 Good 4906 0.0910
3 Very Good 12082 0.224
4 Premium 13791 0.256
5 Ideal 21551 0.400
count_prop(diamonds, count_vars = c(cut, color))
# A tibble: 35 × 4
cut color n prop
<ord> <ord> <int> <dbl>
1 Fair D 163 0.00302
2 Fair E 224 0.00415
3 Fair F 312 0.00578
4 Fair G 314 0.00582
5 Fair H 303 0.00562
6 Fair I 175 0.00324
7 Fair J 119 0.00221
8 Good D 662 0.0123
9 Good E 933 0.0173
10 Good F 909 0.0169
# ℹ 25 more rows
Exercise: Create a function that creates a scatterplot from a user-supplied dataset with user-supplied x and y variables. The plot should also show a curvy smoothing line in blue, and a linear smoothing line in red. Test your function using the diamonds
dataset.
Solution
<- function(df, x, y) {
scatter_with_smooths ggplot(df, aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(color = "blue") +
geom_smooth(method = "lm", color = "red")
}
scatter_with_smooths(diamonds, x = carat, y = price)
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula = 'y ~ x'
Reflection
In your personal class journal, write a few observations about pair programming today. In terms of learning and the collaboration, what went well, and what could go better? Why?