Wrangling: numerics, logicals, dates

Learning goals

After this lesson, you should be able to:

  • Determine the class of a given object and identify concerns to be wary of when manipulating an object of that class (numerics, logicals, factors, dates, strings, data.frames)
  • Explain what vector recycling is, when it is used, when it can be a problem, and how to avoid those problems
  • Explain the difference between implicit and explicit coercion
  • Extract date-time information using the lubridate package
  • Write R code to wrangle data from these different types
  • Recognize several new R errors and warnings related to data types


Slides for today are available here. (For our main activity, we will be using the rest of the webpage below.)


You can download a template Quarto file to start from here. Save this template within the following directory structure:

  • your_course_folder
    • wrangling_data_types
      • code
        • 05-data-types-1.qmd
      • data
        • You’ll download data from Moodle later today.




Object classes

We got a taste of object classes when we worked with maps. Reading in spatial data from different data sources created objects of different classes.

For example, the data_by_dist object from our Shiny app is an sf object, which means that we can use the convenient geom_sf() within ggplots to make maps.

class(data_by_dist)
[1] "sf"         "data.frame"

An important object class in R is the vector. Each column of a data frame is actually a vector. In data science, the most common vector classes are numeric, integer, logical, character, factor and Date.

class(lakers)
[1] "data.frame"
class(lakers %>% select(opponent))
[1] "data.frame"
class(lakers %>% pull(opponent))
[1] "character"

Why is this important? When looking at the documentation for a function, you will see that arguments are expected to be of a specific type (class). Examples:

  • ggplot::scale_x_continuous(): One option for the breaks argument is “A numeric vector of positions”
  • shiny::sliderInput(): The value argument. “The initial value of the slider, either a number, a date (class Date), or a date-time (class POSIXt). A length one vector will create a regular slider; a length two vector will create a double-ended range slider. Must lie between min and max.”




Numeric data

Numeric and integer classes

Numbers that we see in R are generally of the numeric class, which are numbers with decimals. The c() function below is a way to create a vector of multiple numbers.

numbers <- c(1, 2, 3)
class(numbers)
[1] "numeric"

R also has an integer class which will most often be formed when using the : operator to form regularly spaced sequences.

integers <- 1:3
class(integers)
[1] "integer"

It will be important to know how to check whether a number is a numeric or integer because we’ll be using the purrr package very shortly which checks types very strictly (e.g., 1 as an integer cannot be combined with 1 as a numeric)

Vector recycling

head(lakers %>% select(date, opponent, team, points))
      date opponent team points
1 20081028      POR  OFF      0
2 20081028      POR  LAL      0
3 20081028      POR  LAL      0
4 20081028      POR  LAL      0
5 20081028      POR  LAL      0
6 20081028      POR  LAL      2

Suppose that we wanted to update just the first two points values (e.g., we learned of a typo).

point_update <- c(2,3)
lakers2 <- lakers %>%
    mutate(points = points + point_update)
head(lakers$points)
[1] 0 0 0 0 0 2
head(lakers2$points)
[1] 2 3 2 3 2 5

Uh oh! It looks like the 2,3 point update vector got repeated multiple times. This is called vector recycling. If you are trying to combine or compare vectors of different lengths, R will repeat (recycle) the shorter one as many times as it takes to make them the same length. When the longer vector’s length isn’t a multiple of the smaller one, we’ll get a warning.

point_update <- c(2,3,2)
lakers2 <- lakers %>%
    mutate(points = points + point_update)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `points = points + point_update`.
Caused by warning in `points + point_update`:
! longer object length is not a multiple of shorter object length

In this case, the safest way to do the points update is to make sure that point_update has the same length as points:

lakers2 <- lakers %>%
    mutate(
        play_id = 1:nrow(lakers),
        point_update = case_when(
            play_id==1 ~ 2,
            play_id==2 ~ 3,
            .default = 0
        ),
        points = points + point_update
    )

head(lakers2 %>% select(date, opponent, team, points))
      date opponent team points
1 20081028      POR  OFF      2
2 20081028      POR  LAL      3
3 20081028      POR  LAL      0
4 20081028      POR  LAL      0
5 20081028      POR  LAL      0
6 20081028      POR  LAL      2

Recycling will very often come up when working with logical objects (Booleans):

class(diamonds)=="data.frame"
[1] FALSE FALSE  TRUE
class(diamonds)
[1] "tbl_df"     "tbl"        "data.frame"
"data.frame" %in% class(diamonds)
[1] TRUE
any(class(diamonds)=="data.frame")
[1] TRUE

Explicit coercion

In R there is a family of coercion functions that force a variable to be represented as a particular type. We have as.numeric() and as.integer() for numbers.

Most commonly we will use these when numbers have accidentally been read in as a character or a factor. (More on factors later.)

In the example below we have a set of 4 points values, but the last entry was mistakenly typed as a space in the spreadsheet (instead of as an empty cell). We can see when we display points that all of the values have quotes around them and that the class of the points object is a character vector. (More on working with character objects next time.)

points <- c(2, 3, 0, " ")
points
[1] "2" "3" "0" " "
class(points)
[1] "character"

Most commonly we will have numeric data that happens to be read in as a character. After cleaning up the strings, we can use as.numeric to coerce the vector to a numeric vector. (More on strings and regular expressions later.) Example:

x <- c("2.3", "3.4", "4.5", "5.6.")
as.numeric(x)
Warning: NAs introduced by coercion
[1] 2.3 3.4 4.5  NA
x <- str_remove(x, "\\.$")
as.numeric(x)
[1] 2.3 3.4 4.5 5.6

Other topics

The Numbers chapter in R4DS covers a lot of useful functions and ideas related to wrangling numbers. It would be very usefl to read this chapter. A glossary of the

  • n(), n_distinct()
  • sum(is.na())
  • pmin(), pmax() vs min(), max()
  • Integer division: %/%. Remainder: %%
  • round(), floor(), ceiling()
  • cut()
  • cumsum(), dplyr::cummean(), cummin(), cummax()
  • dplyr::min_rank()
  • lead(), lag(): shift a vector by padding with NAs
  • Numerical summaries: mean, median, min, max, quantile, sd, IQR




Logical data

(The Logical vectors chapter in R4DS is a good supplemental reference.)

Logical data result from comparisons:

  • Comparing numeric data uses the following operators
    • >, <: greater and less than
    • >=, <: greater and less than or equal to
    • == exactly equal
  • Characters
    • variable=="specific choice"
    • variable %in% c("choice1", "choice2", "choice3")
  • Combining logical statements
    • &: and
    • |: or

Example: filter to 3-pointers against the Minnesota Timberwolves (MIN team).

lakers_subs <- lakers %>%
    filter(points==3 & team=="MIN")
lakers_subs
       date opponent game_type  time period etype team            player result
1  20081214      MIN      home 00:04      1  shot  MIN        Ryan Gomes   made
2  20081214      MIN      home 05:56      2  shot  MIN    Rashad McCants   made
3  20081214      MIN      home 03:46      2  shot  MIN    Rashad McCants   made
4  20081214      MIN      home 08:34      3  shot  MIN       Mike Miller   made
5  20081214      MIN      home 00:30      3  shot  MIN        Ryan Gomes   made
6  20081214      MIN      home 11:16      4  shot  MIN Sebastian Telfair   made
7  20081214      MIN      home 03:21      4  shot  MIN       Mike Miller   made
8  20081214      MIN      home 02:28      4  shot  MIN       Mike Miller   made
9  20090130      MIN      away 09:10      1  shot  MIN        Randy Foye   made
10 20090130      MIN      away 03:25      1  shot  MIN        Ryan Gomes   made
11 20090130      MIN      away 06:30      2  shot  MIN        Randy Foye   made
12 20090130      MIN      away 08:22      4  shot  MIN     Rodney Carney   made
13 20090222      MIN      away 10:13      1  shot  MIN Sebastian Telfair   made
14 20090222      MIN      away 09:32      1  shot  MIN Sebastian Telfair   made
15 20090222      MIN      away 07:44      1  shot  MIN       Mike Miller   made
16 20090222      MIN      away 04:02      2  shot  MIN        Ryan Gomes   made
17 20090222      MIN      away 02:51      2  shot  MIN        Randy Foye   made
18 20090222      MIN      away 11:38      3  shot  MIN        Ryan Gomes   made
19 20090222      MIN      away 11:25      4  shot  MIN     Rodney Carney   made
20 20090222      MIN      away 02:07      4  shot  MIN       Mike Miller   made
21 20090222      MIN      away 01:36      4  shot  MIN        Ryan Gomes   made
22 20090222      MIN      away 00:09      4  shot  MIN        Randy Foye   made
23 20090306      MIN      home 07:55      1  shot  MIN Sebastian Telfair   made
24 20090306      MIN      home 08:42      2  shot  MIN     Rodney Carney   made
25 20090306      MIN      home 06:31      3  shot  MIN        Ryan Gomes   made
26 20090306      MIN      home 00:03      3  shot  MIN        Ryan Gomes   made
27 20090306      MIN      home 06:14      4  shot  MIN Sebastian Telfair   made
28 20090306      MIN      home 03:39      4  shot  MIN       Bobby Brown   made
   points type  x  y
1       3  3pt  2  7
2       3  3pt 48 16
3       3  3pt 43 24
4       3  3pt 47 23
5       3  3pt 27 33
6       3  3pt  9 26
7       3  3pt 44 23
8       3  3pt 49  5
9       3  3pt 23 32
10      3  3pt  2  7
11      3  3pt 47  8
12      3  3pt 48 12
13      3  3pt 48  7
14      3  3pt 47 18
15      3  3pt  2  6
16      3  3pt 20 31
17      3  3pt 47  9
18      3  3pt 47 19
19      3  3pt 48  8
20      3  3pt 47  9
21      3  3pt 35 29
22      3  3pt  8 25
23      3  3pt  2  6
24      3  3pt  2  8
25      3  3pt  3 18
26      3  3pt  5 22
27      3  3pt 25 32
28      3  3pt 24 34
# These give the same results
lakers %>%
    filter(points==3, team=="MIN")
       date opponent game_type  time period etype team            player result
1  20081214      MIN      home 00:04      1  shot  MIN        Ryan Gomes   made
2  20081214      MIN      home 05:56      2  shot  MIN    Rashad McCants   made
3  20081214      MIN      home 03:46      2  shot  MIN    Rashad McCants   made
4  20081214      MIN      home 08:34      3  shot  MIN       Mike Miller   made
5  20081214      MIN      home 00:30      3  shot  MIN        Ryan Gomes   made
6  20081214      MIN      home 11:16      4  shot  MIN Sebastian Telfair   made
7  20081214      MIN      home 03:21      4  shot  MIN       Mike Miller   made
8  20081214      MIN      home 02:28      4  shot  MIN       Mike Miller   made
9  20090130      MIN      away 09:10      1  shot  MIN        Randy Foye   made
10 20090130      MIN      away 03:25      1  shot  MIN        Ryan Gomes   made
11 20090130      MIN      away 06:30      2  shot  MIN        Randy Foye   made
12 20090130      MIN      away 08:22      4  shot  MIN     Rodney Carney   made
13 20090222      MIN      away 10:13      1  shot  MIN Sebastian Telfair   made
14 20090222      MIN      away 09:32      1  shot  MIN Sebastian Telfair   made
15 20090222      MIN      away 07:44      1  shot  MIN       Mike Miller   made
16 20090222      MIN      away 04:02      2  shot  MIN        Ryan Gomes   made
17 20090222      MIN      away 02:51      2  shot  MIN        Randy Foye   made
18 20090222      MIN      away 11:38      3  shot  MIN        Ryan Gomes   made
19 20090222      MIN      away 11:25      4  shot  MIN     Rodney Carney   made
20 20090222      MIN      away 02:07      4  shot  MIN       Mike Miller   made
21 20090222      MIN      away 01:36      4  shot  MIN        Ryan Gomes   made
22 20090222      MIN      away 00:09      4  shot  MIN        Randy Foye   made
23 20090306      MIN      home 07:55      1  shot  MIN Sebastian Telfair   made
24 20090306      MIN      home 08:42      2  shot  MIN     Rodney Carney   made
25 20090306      MIN      home 06:31      3  shot  MIN        Ryan Gomes   made
26 20090306      MIN      home 00:03      3  shot  MIN        Ryan Gomes   made
27 20090306      MIN      home 06:14      4  shot  MIN Sebastian Telfair   made
28 20090306      MIN      home 03:39      4  shot  MIN       Bobby Brown   made
   points type  x  y
1       3  3pt  2  7
2       3  3pt 48 16
3       3  3pt 43 24
4       3  3pt 47 23
5       3  3pt 27 33
6       3  3pt  9 26
7       3  3pt 44 23
8       3  3pt 49  5
9       3  3pt 23 32
10      3  3pt  2  7
11      3  3pt 47  8
12      3  3pt 48 12
13      3  3pt 48  7
14      3  3pt 47 18
15      3  3pt  2  6
16      3  3pt 20 31
17      3  3pt 47  9
18      3  3pt 47 19
19      3  3pt 48  8
20      3  3pt 47  9
21      3  3pt 35 29
22      3  3pt  8 25
23      3  3pt  2  6
24      3  3pt  2  8
25      3  3pt  3 18
26      3  3pt  5 22
27      3  3pt 25 32
28      3  3pt 24 34
lakers %>%
    filter(points==3) %>%
    filter(team=="MIN")
       date opponent game_type  time period etype team            player result
1  20081214      MIN      home 00:04      1  shot  MIN        Ryan Gomes   made
2  20081214      MIN      home 05:56      2  shot  MIN    Rashad McCants   made
3  20081214      MIN      home 03:46      2  shot  MIN    Rashad McCants   made
4  20081214      MIN      home 08:34      3  shot  MIN       Mike Miller   made
5  20081214      MIN      home 00:30      3  shot  MIN        Ryan Gomes   made
6  20081214      MIN      home 11:16      4  shot  MIN Sebastian Telfair   made
7  20081214      MIN      home 03:21      4  shot  MIN       Mike Miller   made
8  20081214      MIN      home 02:28      4  shot  MIN       Mike Miller   made
9  20090130      MIN      away 09:10      1  shot  MIN        Randy Foye   made
10 20090130      MIN      away 03:25      1  shot  MIN        Ryan Gomes   made
11 20090130      MIN      away 06:30      2  shot  MIN        Randy Foye   made
12 20090130      MIN      away 08:22      4  shot  MIN     Rodney Carney   made
13 20090222      MIN      away 10:13      1  shot  MIN Sebastian Telfair   made
14 20090222      MIN      away 09:32      1  shot  MIN Sebastian Telfair   made
15 20090222      MIN      away 07:44      1  shot  MIN       Mike Miller   made
16 20090222      MIN      away 04:02      2  shot  MIN        Ryan Gomes   made
17 20090222      MIN      away 02:51      2  shot  MIN        Randy Foye   made
18 20090222      MIN      away 11:38      3  shot  MIN        Ryan Gomes   made
19 20090222      MIN      away 11:25      4  shot  MIN     Rodney Carney   made
20 20090222      MIN      away 02:07      4  shot  MIN       Mike Miller   made
21 20090222      MIN      away 01:36      4  shot  MIN        Ryan Gomes   made
22 20090222      MIN      away 00:09      4  shot  MIN        Randy Foye   made
23 20090306      MIN      home 07:55      1  shot  MIN Sebastian Telfair   made
24 20090306      MIN      home 08:42      2  shot  MIN     Rodney Carney   made
25 20090306      MIN      home 06:31      3  shot  MIN        Ryan Gomes   made
26 20090306      MIN      home 00:03      3  shot  MIN        Ryan Gomes   made
27 20090306      MIN      home 06:14      4  shot  MIN Sebastian Telfair   made
28 20090306      MIN      home 03:39      4  shot  MIN       Bobby Brown   made
   points type  x  y
1       3  3pt  2  7
2       3  3pt 48 16
3       3  3pt 43 24
4       3  3pt 47 23
5       3  3pt 27 33
6       3  3pt  9 26
7       3  3pt 44 23
8       3  3pt 49  5
9       3  3pt 23 32
10      3  3pt  2  7
11      3  3pt 47  8
12      3  3pt 48 12
13      3  3pt 48  7
14      3  3pt 47 18
15      3  3pt  2  6
16      3  3pt 20 31
17      3  3pt 47  9
18      3  3pt 47 19
19      3  3pt 48  8
20      3  3pt 47  9
21      3  3pt 35 29
22      3  3pt  8 25
23      3  3pt  2  6
24      3  3pt  2  8
25      3  3pt  3 18
26      3  3pt  5 22
27      3  3pt 25 32
28      3  3pt 24 34

In HW1, I saw a lot of the following code:

weather %>% filter(RecordP == "TRUE")
weather %>% filter(RecordP == TRUE)

Because RecordP was already of the logical class, the most concise way to perform the filtering is:

weather %>% filter(RecordP)

A caution about using == for numeric data:

x <- c(1 / 49 * 49, sqrt(2) ^ 2)
x
[1] 1 2
x == c(1, 2)
[1] FALSE FALSE
print(x, digits = 16)
[1] 0.9999999999999999 2.0000000000000004
dplyr::near(x, c(1,2))
[1] TRUE TRUE

A caution about using == for NAs (use is.na() instead):

x <- NA
x == NA
[1] NA
is.na(x)
[1] TRUE




Dates

(The Dates and times chapter in R4DS is a good supplemental reference.)

The lubridate package contains useful functions for working with dates and times. The lubridate function reference is a useful resource for finding the functions you need.

lakers <- as_tibble(lakers)
head(lakers)
# A tibble: 6 × 13
     date opponent game_type time  period etype team  player result points type 
    <int> <chr>    <chr>     <chr>  <int> <chr> <chr> <chr>  <chr>   <int> <chr>
1  2.01e7 POR      home      12:00      1 jump… OFF   ""     ""          0 ""   
2  2.01e7 POR      home      11:39      1 shot  LAL   "Pau … "miss…      0 "hoo…
3  2.01e7 POR      home      11:37      1 rebo… LAL   "Vlad… ""          0 "off"
4  2.01e7 POR      home      11:25      1 shot  LAL   "Dere… "miss…      0 "lay…
5  2.01e7 POR      home      11:23      1 rebo… LAL   "Pau … ""          0 "off"
6  2.01e7 POR      home      11:22      1 shot  LAL   "Pau … "made"      2 "hoo…
# ℹ 2 more variables: x <int>, y <int>
lakers <- lakers %>%
    mutate(
        date = ymd(date),
        time = hm(time)
    )
lakers_clean <- lakers %>%
    mutate(
        year = year(date),
        month = month(date),
        day = day(date),
        day_of_week = wday(date, label = TRUE),
        minute = minute(time),
        second = second(time)
    )
lakers_clean %>% select(year:second)
# A tibble: 34,624 × 6
    year month   day day_of_week minute second
   <dbl> <dbl> <int> <ord>        <dbl>  <dbl>
 1  2008    10    28 Tue              0      0
 2  2008    10    28 Tue             39      0
 3  2008    10    28 Tue             37      0
 4  2008    10    28 Tue             25      0
 5  2008    10    28 Tue             23      0
 6  2008    10    28 Tue             22      0
 7  2008    10    28 Tue             22      0
 8  2008    10    28 Tue             22      0
 9  2008    10    28 Tue              0      0
10  2008    10    28 Tue             53      0
# ℹ 34,614 more rows
lakers_clean <- lakers_clean %>%
    group_by(date, opponent, period) %>%
    arrange(date, opponent, period, desc(time)) %>%
    mutate(
        diff_btw_plays_sec = as.numeric(time - lag(time, 1))
    )
lakers_clean %>% select(date, opponent, time, period, diff_btw_plays_sec)
# A tibble: 34,624 × 5
# Groups:   date, opponent, period [314]
   date       opponent time       period diff_btw_plays_sec
   <date>     <chr>    <Period>    <int>              <dbl>
 1 2008-10-28 POR      12H 0M 0S       1                 NA
 2 2008-10-28 POR      11H 39M 0S      1              -1260
 3 2008-10-28 POR      11H 37M 0S      1               -120
 4 2008-10-28 POR      11H 25M 0S      1               -720
 5 2008-10-28 POR      11H 23M 0S      1               -120
 6 2008-10-28 POR      11H 22M 0S      1                -60
 7 2008-10-28 POR      11H 22M 0S      1                  0
 8 2008-10-28 POR      11H 22M 0S      1                  0
 9 2008-10-28 POR      11H 0M 0S       1              -1320
10 2008-10-28 POR      10H 53M 0S      1               -420
# ℹ 34,614 more rows

Open-ended challenge: Is Hillary the most-poisoned baby name in US history?

Inspired by this viral post by an alumna of my PhD program, Hilary Parker, we will use baby name data from the Social Security Administration to investigate whether “Hilary” (or “Hillary”) is the most poisoned. Hilary defines “poisoning” as a sudden, rapid, and large decrease in the popularity of a name.

Data: Download the baby names data from Moodle and save this in the data folder for today.

Your goal: Using whatever data wrangling tools necessary, try to answer this question with a combination of visualizations and numerical results.

Codebook:

  • year
  • sex
  • name
  • n: Number of babies with the given name and sex that year
  • prop: Proportion of all US babies with the given name and sex that year
Stop to Reflect

As you progress through this open-ended challenge, make note of your process. How do you get started, get unstuck, take the next step in the analysis? In what areas do you tend to share insights for your peers? In what areas do your peers tend to contribute insights for you?