class(data_by_dist)
Wrangling: numerics, logicals, dates
Learning goals
After this lesson, you should be able to:
- Determine the class of a given object and identify concerns to be wary of when manipulating an object of that class (numerics, logicals, factors, dates, strings, data.frames)
- Explain what vector recycling is, when it is used, when it can be a problem, and how to avoid those problems
- Explain the difference between implicit and explicit coercion
- Extract date-time information using the
lubridate
package - Write R code to wrangle data from these different types
- Recognize several new R errors and warnings related to data types
Slides for today are available here. (For our main activity, we will be using the rest of the webpage below.)
You can download a template Quarto file to start from here. Save this template within the following directory structure:
your_course_folder
wrangling_data_types
code
05-data-types-1.qmd
data
- You’ll download data from Moodle later today.
Object classes
We got a taste of object classes when we worked with maps. Reading in spatial data from different data sources created objects of different classes.
For example, the data_by_dist
object from our Shiny app is an sf
object, which means that we can use the convenient geom_sf()
within ggplot
s to make maps.
[1] "sf" "data.frame"
An important object class in R is the vector. Each column of a data frame is actually a vector. In data science, the most common vector classes are numeric
, integer
, logical
, character
, factor
and Date
.
class(lakers)
[1] "data.frame"
class(lakers %>% select(opponent))
[1] "data.frame"
class(lakers %>% pull(opponent))
[1] "character"
Why is this important? When looking at the documentation for a function, you will see that arguments are expected to be of a specific type (class). Examples:
ggplot::scale_x_continuous()
: One option for thebreaks
argument is “A numeric vector of positions”shiny::sliderInput()
: Thevalue
argument. “The initial value of the slider, either a number, a date (class Date), or a date-time (class POSIXt). A length one vector will create a regular slider; a length two vector will create a double-ended range slider. Must lie between min and max.”
Numeric data
Numeric and integer classes
Numbers that we see in R are generally of the numeric
class, which are numbers with decimals. The c()
function below is a way to create a vector of multiple numbers.
<- c(1, 2, 3)
numbers class(numbers)
[1] "numeric"
R also has an integer
class which will most often be formed when using the :
operator to form regularly spaced sequences.
<- 1:3
integers class(integers)
[1] "integer"
It will be important to know how to check whether a number is a numeric
or integer
because we’ll be using the purrr
package very shortly which checks types very strictly (e.g., 1
as an integer cannot be combined with 1
as a numeric)
Vector recycling
head(lakers %>% select(date, opponent, team, points))
date opponent team points
1 20081028 POR OFF 0
2 20081028 POR LAL 0
3 20081028 POR LAL 0
4 20081028 POR LAL 0
5 20081028 POR LAL 0
6 20081028 POR LAL 2
Suppose that we wanted to update just the first two points
values (e.g., we learned of a typo).
<- c(2,3)
point_update <- lakers %>%
lakers2 mutate(points = points + point_update)
head(lakers$points)
[1] 0 0 0 0 0 2
head(lakers2$points)
[1] 2 3 2 3 2 5
Uh oh! It looks like the 2,3
point update vector got repeated multiple times. This is called vector recycling. If you are trying to combine or compare vectors of different lengths, R will repeat (recycle) the shorter one as many times as it takes to make them the same length. When the longer vector’s length isn’t a multiple of the smaller one, we’ll get a warning.
<- c(2,3,2)
point_update <- lakers %>%
lakers2 mutate(points = points + point_update)
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `points = points + point_update`.
Caused by warning in `points + point_update`:
! longer object length is not a multiple of shorter object length
In this case, the safest way to do the points update is to make sure that point_update
has the same length as points
:
<- lakers %>%
lakers2 mutate(
play_id = 1:nrow(lakers),
point_update = case_when(
==1 ~ 2,
play_id==2 ~ 3,
play_id.default = 0
),points = points + point_update
)
head(lakers2 %>% select(date, opponent, team, points))
date opponent team points
1 20081028 POR OFF 2
2 20081028 POR LAL 3
3 20081028 POR LAL 0
4 20081028 POR LAL 0
5 20081028 POR LAL 0
6 20081028 POR LAL 2
Recycling will very often come up when working with logical objects (Booleans):
class(diamonds)=="data.frame"
[1] FALSE FALSE TRUE
class(diamonds)
[1] "tbl_df" "tbl" "data.frame"
"data.frame" %in% class(diamonds)
[1] TRUE
any(class(diamonds)=="data.frame")
[1] TRUE
Explicit coercion
In R there is a family of coercion functions that force a variable to be represented as a particular type. We have as.numeric()
and as.integer()
for numbers.
Most commonly we will use these when numbers have accidentally been read in as a character
or a factor
. (More on factors later.)
In the example below we have a set of 4 points values, but the last entry was mistakenly typed as a space in the spreadsheet (instead of as an empty cell). We can see when we display points
that all of the values have quotes around them and that the class of the points
object is a character
vector. (More on working with character
objects next time.)
<- c(2, 3, 0, " ")
points points
[1] "2" "3" "0" " "
class(points)
[1] "character"
Most commonly we will have numeric data that happens to be read in as a character. After cleaning up the strings, we can use as.numeric
to coerce the vector to a numeric vector. (More on strings and regular expressions later.) Example:
<- c("2.3", "3.4", "4.5", "5.6.")
x as.numeric(x)
Warning: NAs introduced by coercion
[1] 2.3 3.4 4.5 NA
<- str_remove(x, "\\.$")
x as.numeric(x)
[1] 2.3 3.4 4.5 5.6
Other topics
The Numbers chapter in R4DS covers a lot of useful functions and ideas related to wrangling numbers. It would be very usefl to read this chapter. A glossary of the
n()
,n_distinct()
sum(is.na())
pmin(), pmax()
vsmin(), max()
- Integer division:
%/%
. Remainder:%%
round(), floor(), ceiling()
cut()
cumsum(), dplyr::cummean(), cummin(), cummax()
dplyr::min_rank()
lead(), lag()
: shift a vector by padding with NAs- Numerical summaries:
mean
,median
,min
,max
,quantile
,sd
,IQR
Logical data
(The Logical vectors chapter in R4DS is a good supplemental reference.)
Logical data result from comparisons:
- Comparing numeric data uses the following operators
>
,<
: greater and less than>=
,<
: greater and less than or equal to==
exactly equal
- Characters
variable=="specific choice"
variable %in% c("choice1", "choice2", "choice3")
- Combining logical statements
&
: and|
: or
Example: filter to 3-pointers against the Minnesota Timberwolves (MIN
team).
<- lakers %>%
lakers_subs filter(points==3 & team=="MIN")
lakers_subs
date opponent game_type time period etype team player result
1 20081214 MIN home 00:04 1 shot MIN Ryan Gomes made
2 20081214 MIN home 05:56 2 shot MIN Rashad McCants made
3 20081214 MIN home 03:46 2 shot MIN Rashad McCants made
4 20081214 MIN home 08:34 3 shot MIN Mike Miller made
5 20081214 MIN home 00:30 3 shot MIN Ryan Gomes made
6 20081214 MIN home 11:16 4 shot MIN Sebastian Telfair made
7 20081214 MIN home 03:21 4 shot MIN Mike Miller made
8 20081214 MIN home 02:28 4 shot MIN Mike Miller made
9 20090130 MIN away 09:10 1 shot MIN Randy Foye made
10 20090130 MIN away 03:25 1 shot MIN Ryan Gomes made
11 20090130 MIN away 06:30 2 shot MIN Randy Foye made
12 20090130 MIN away 08:22 4 shot MIN Rodney Carney made
13 20090222 MIN away 10:13 1 shot MIN Sebastian Telfair made
14 20090222 MIN away 09:32 1 shot MIN Sebastian Telfair made
15 20090222 MIN away 07:44 1 shot MIN Mike Miller made
16 20090222 MIN away 04:02 2 shot MIN Ryan Gomes made
17 20090222 MIN away 02:51 2 shot MIN Randy Foye made
18 20090222 MIN away 11:38 3 shot MIN Ryan Gomes made
19 20090222 MIN away 11:25 4 shot MIN Rodney Carney made
20 20090222 MIN away 02:07 4 shot MIN Mike Miller made
21 20090222 MIN away 01:36 4 shot MIN Ryan Gomes made
22 20090222 MIN away 00:09 4 shot MIN Randy Foye made
23 20090306 MIN home 07:55 1 shot MIN Sebastian Telfair made
24 20090306 MIN home 08:42 2 shot MIN Rodney Carney made
25 20090306 MIN home 06:31 3 shot MIN Ryan Gomes made
26 20090306 MIN home 00:03 3 shot MIN Ryan Gomes made
27 20090306 MIN home 06:14 4 shot MIN Sebastian Telfair made
28 20090306 MIN home 03:39 4 shot MIN Bobby Brown made
points type x y
1 3 3pt 2 7
2 3 3pt 48 16
3 3 3pt 43 24
4 3 3pt 47 23
5 3 3pt 27 33
6 3 3pt 9 26
7 3 3pt 44 23
8 3 3pt 49 5
9 3 3pt 23 32
10 3 3pt 2 7
11 3 3pt 47 8
12 3 3pt 48 12
13 3 3pt 48 7
14 3 3pt 47 18
15 3 3pt 2 6
16 3 3pt 20 31
17 3 3pt 47 9
18 3 3pt 47 19
19 3 3pt 48 8
20 3 3pt 47 9
21 3 3pt 35 29
22 3 3pt 8 25
23 3 3pt 2 6
24 3 3pt 2 8
25 3 3pt 3 18
26 3 3pt 5 22
27 3 3pt 25 32
28 3 3pt 24 34
# These give the same results
%>%
lakers filter(points==3, team=="MIN")
date opponent game_type time period etype team player result
1 20081214 MIN home 00:04 1 shot MIN Ryan Gomes made
2 20081214 MIN home 05:56 2 shot MIN Rashad McCants made
3 20081214 MIN home 03:46 2 shot MIN Rashad McCants made
4 20081214 MIN home 08:34 3 shot MIN Mike Miller made
5 20081214 MIN home 00:30 3 shot MIN Ryan Gomes made
6 20081214 MIN home 11:16 4 shot MIN Sebastian Telfair made
7 20081214 MIN home 03:21 4 shot MIN Mike Miller made
8 20081214 MIN home 02:28 4 shot MIN Mike Miller made
9 20090130 MIN away 09:10 1 shot MIN Randy Foye made
10 20090130 MIN away 03:25 1 shot MIN Ryan Gomes made
11 20090130 MIN away 06:30 2 shot MIN Randy Foye made
12 20090130 MIN away 08:22 4 shot MIN Rodney Carney made
13 20090222 MIN away 10:13 1 shot MIN Sebastian Telfair made
14 20090222 MIN away 09:32 1 shot MIN Sebastian Telfair made
15 20090222 MIN away 07:44 1 shot MIN Mike Miller made
16 20090222 MIN away 04:02 2 shot MIN Ryan Gomes made
17 20090222 MIN away 02:51 2 shot MIN Randy Foye made
18 20090222 MIN away 11:38 3 shot MIN Ryan Gomes made
19 20090222 MIN away 11:25 4 shot MIN Rodney Carney made
20 20090222 MIN away 02:07 4 shot MIN Mike Miller made
21 20090222 MIN away 01:36 4 shot MIN Ryan Gomes made
22 20090222 MIN away 00:09 4 shot MIN Randy Foye made
23 20090306 MIN home 07:55 1 shot MIN Sebastian Telfair made
24 20090306 MIN home 08:42 2 shot MIN Rodney Carney made
25 20090306 MIN home 06:31 3 shot MIN Ryan Gomes made
26 20090306 MIN home 00:03 3 shot MIN Ryan Gomes made
27 20090306 MIN home 06:14 4 shot MIN Sebastian Telfair made
28 20090306 MIN home 03:39 4 shot MIN Bobby Brown made
points type x y
1 3 3pt 2 7
2 3 3pt 48 16
3 3 3pt 43 24
4 3 3pt 47 23
5 3 3pt 27 33
6 3 3pt 9 26
7 3 3pt 44 23
8 3 3pt 49 5
9 3 3pt 23 32
10 3 3pt 2 7
11 3 3pt 47 8
12 3 3pt 48 12
13 3 3pt 48 7
14 3 3pt 47 18
15 3 3pt 2 6
16 3 3pt 20 31
17 3 3pt 47 9
18 3 3pt 47 19
19 3 3pt 48 8
20 3 3pt 47 9
21 3 3pt 35 29
22 3 3pt 8 25
23 3 3pt 2 6
24 3 3pt 2 8
25 3 3pt 3 18
26 3 3pt 5 22
27 3 3pt 25 32
28 3 3pt 24 34
%>%
lakers filter(points==3) %>%
filter(team=="MIN")
date opponent game_type time period etype team player result
1 20081214 MIN home 00:04 1 shot MIN Ryan Gomes made
2 20081214 MIN home 05:56 2 shot MIN Rashad McCants made
3 20081214 MIN home 03:46 2 shot MIN Rashad McCants made
4 20081214 MIN home 08:34 3 shot MIN Mike Miller made
5 20081214 MIN home 00:30 3 shot MIN Ryan Gomes made
6 20081214 MIN home 11:16 4 shot MIN Sebastian Telfair made
7 20081214 MIN home 03:21 4 shot MIN Mike Miller made
8 20081214 MIN home 02:28 4 shot MIN Mike Miller made
9 20090130 MIN away 09:10 1 shot MIN Randy Foye made
10 20090130 MIN away 03:25 1 shot MIN Ryan Gomes made
11 20090130 MIN away 06:30 2 shot MIN Randy Foye made
12 20090130 MIN away 08:22 4 shot MIN Rodney Carney made
13 20090222 MIN away 10:13 1 shot MIN Sebastian Telfair made
14 20090222 MIN away 09:32 1 shot MIN Sebastian Telfair made
15 20090222 MIN away 07:44 1 shot MIN Mike Miller made
16 20090222 MIN away 04:02 2 shot MIN Ryan Gomes made
17 20090222 MIN away 02:51 2 shot MIN Randy Foye made
18 20090222 MIN away 11:38 3 shot MIN Ryan Gomes made
19 20090222 MIN away 11:25 4 shot MIN Rodney Carney made
20 20090222 MIN away 02:07 4 shot MIN Mike Miller made
21 20090222 MIN away 01:36 4 shot MIN Ryan Gomes made
22 20090222 MIN away 00:09 4 shot MIN Randy Foye made
23 20090306 MIN home 07:55 1 shot MIN Sebastian Telfair made
24 20090306 MIN home 08:42 2 shot MIN Rodney Carney made
25 20090306 MIN home 06:31 3 shot MIN Ryan Gomes made
26 20090306 MIN home 00:03 3 shot MIN Ryan Gomes made
27 20090306 MIN home 06:14 4 shot MIN Sebastian Telfair made
28 20090306 MIN home 03:39 4 shot MIN Bobby Brown made
points type x y
1 3 3pt 2 7
2 3 3pt 48 16
3 3 3pt 43 24
4 3 3pt 47 23
5 3 3pt 27 33
6 3 3pt 9 26
7 3 3pt 44 23
8 3 3pt 49 5
9 3 3pt 23 32
10 3 3pt 2 7
11 3 3pt 47 8
12 3 3pt 48 12
13 3 3pt 48 7
14 3 3pt 47 18
15 3 3pt 2 6
16 3 3pt 20 31
17 3 3pt 47 9
18 3 3pt 47 19
19 3 3pt 48 8
20 3 3pt 47 9
21 3 3pt 35 29
22 3 3pt 8 25
23 3 3pt 2 6
24 3 3pt 2 8
25 3 3pt 3 18
26 3 3pt 5 22
27 3 3pt 25 32
28 3 3pt 24 34
In HW1, I saw a lot of the following code:
%>% filter(RecordP == "TRUE")
weather %>% filter(RecordP == TRUE) weather
Because RecordP
was already of the logical
class, the most concise way to perform the filtering is:
%>% filter(RecordP) weather
A caution about using ==
for numeric data:
<- c(1 / 49 * 49, sqrt(2) ^ 2)
x x
[1] 1 2
== c(1, 2) x
[1] FALSE FALSE
print(x, digits = 16)
[1] 0.9999999999999999 2.0000000000000004
::near(x, c(1,2)) dplyr
[1] TRUE TRUE
A caution about using ==
for NA
s (use is.na()
instead):
<- NA
x == NA x
[1] NA
is.na(x)
[1] TRUE
Dates
(The Dates and times chapter in R4DS is a good supplemental reference.)
The lubridate
package contains useful functions for working with dates and times. The lubridate
function reference is a useful resource for finding the functions you need.
<- as_tibble(lakers)
lakers head(lakers)
# A tibble: 6 × 13
date opponent game_type time period etype team player result points type
<int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int> <chr>
1 2.01e7 POR home 12:00 1 jump… OFF "" "" 0 ""
2 2.01e7 POR home 11:39 1 shot LAL "Pau … "miss… 0 "hoo…
3 2.01e7 POR home 11:37 1 rebo… LAL "Vlad… "" 0 "off"
4 2.01e7 POR home 11:25 1 shot LAL "Dere… "miss… 0 "lay…
5 2.01e7 POR home 11:23 1 rebo… LAL "Pau … "" 0 "off"
6 2.01e7 POR home 11:22 1 shot LAL "Pau … "made" 2 "hoo…
# ℹ 2 more variables: x <int>, y <int>
<- lakers %>%
lakers mutate(
date = ymd(date),
time = hm(time)
)
<- lakers %>%
lakers_clean mutate(
year = year(date),
month = month(date),
day = day(date),
day_of_week = wday(date, label = TRUE),
minute = minute(time),
second = second(time)
)%>% select(year:second) lakers_clean
# A tibble: 34,624 × 6
year month day day_of_week minute second
<dbl> <dbl> <int> <ord> <dbl> <dbl>
1 2008 10 28 Tue 0 0
2 2008 10 28 Tue 39 0
3 2008 10 28 Tue 37 0
4 2008 10 28 Tue 25 0
5 2008 10 28 Tue 23 0
6 2008 10 28 Tue 22 0
7 2008 10 28 Tue 22 0
8 2008 10 28 Tue 22 0
9 2008 10 28 Tue 0 0
10 2008 10 28 Tue 53 0
# ℹ 34,614 more rows
<- lakers_clean %>%
lakers_clean group_by(date, opponent, period) %>%
arrange(date, opponent, period, desc(time)) %>%
mutate(
diff_btw_plays_sec = as.numeric(time - lag(time, 1))
)%>% select(date, opponent, time, period, diff_btw_plays_sec) lakers_clean
# A tibble: 34,624 × 5
# Groups: date, opponent, period [314]
date opponent time period diff_btw_plays_sec
<date> <chr> <Period> <int> <dbl>
1 2008-10-28 POR 12H 0M 0S 1 NA
2 2008-10-28 POR 11H 39M 0S 1 -1260
3 2008-10-28 POR 11H 37M 0S 1 -120
4 2008-10-28 POR 11H 25M 0S 1 -720
5 2008-10-28 POR 11H 23M 0S 1 -120
6 2008-10-28 POR 11H 22M 0S 1 -60
7 2008-10-28 POR 11H 22M 0S 1 0
8 2008-10-28 POR 11H 22M 0S 1 0
9 2008-10-28 POR 11H 0M 0S 1 -1320
10 2008-10-28 POR 10H 53M 0S 1 -420
# ℹ 34,614 more rows
Open-ended challenge: Is Hillary the most-poisoned baby name in US history?
Inspired by this viral post by an alumna of my PhD program, Hilary Parker, we will use baby name data from the Social Security Administration to investigate whether “Hilary” (or “Hillary”) is the most poisoned. Hilary defines “poisoning” as a sudden, rapid, and large decrease in the popularity of a name.
Data: Download the baby names data from Moodle and save this in the data
folder for today.
Your goal: Using whatever data wrangling tools necessary, try to answer this question with a combination of visualizations and numerical results.
Codebook:
year
sex
name
n
: Number of babies with the given name and sex that yearprop
: Proportion of all US babies with the given name and sex that year
As you progress through this open-ended challenge, make note of your process. How do you get started, get unstuck, take the next step in the analysis? In what areas do you tend to share insights for your peers? In what areas do your peers tend to contribute insights for you?