<- "banana"
some_string some_string
[1] "banana"
After this lesson, you should be able to:
stringr
packageTo shape how we hold class today, go to my PollEverywhere page for some survey questions.
You can download a template Quarto file to start from here. Save this template within the following directory structure:
your_course_folder
wrangling_data_types
code
05-data-types-1.qmd
06-data-types-2.qmd
In 2018 the data journalism organization The Pudding featured a story called 30 Years of American Anxieties about themes in 30 years of posts to the Dear Abby column (an American advice column).
One way to understand themes in text data is to conduct a qualitative analysis, a methodology in which multiple readers read through instances of text several times to reach a consensus about themes.
Another way to understand themes in text data is to explore the text computationally with data science tools. This is what we will explore today. Both qualitative analysis and computational tools can be used in tandem. Often, using computational tools can help focus a close reading of select texts, which parallels the spirit of a qualitative analysis.
To prepare ourselves for a computational analysis, let’s learn about strings.
Strings are objects of the character
class (abbreviated as <chr>
in tibbles). When you print out strings, they display with double quotes:
Working with strings generally will involve the use of regular expressions, a tool for finding patterns in strings. Regular expressions (regex, for short) look like the following:
"^the" (Strings that start with "the")
"end$" (Strings that end with "end")
Before getting to regular expressions, let’s go over some fundamentals about working with strings. The stringr
package (available within tidyverse
) is great for working with strings.
Creating strings by hand is useful for testing out regular expressions.
To create a string, type any text in either double quotes ("
) or single quotes '
. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.
We can view these strings “naturally” (without the opening and closing quotes) with str_view()
:
[1] │ This is a string
[1] │ If I want to include a "quote" inside a string, I use single quotes
Exercise: Create the string It's Thursday
. What happens if you put the string inside single quotes? Double quotes?
Because "
and '
are special characters in the creation of strings, R offers another way to put them inside a string. We can escape these special characters by putting a \
in front of them:
string1 <- "This is a string with \"double quotes\""
string2 <- "This is a string with \'single quotes\'"
str_view(string1)
[1] │ This is a string with "double quotes"
[1] │ This is a string with 'single quotes'
Given that \
is a special character, how can we put the \
character in strings? We have to escape it with \\
.
Exercise: Create the string C:\Users
. What happens when you don’t escape the \
?
Other special characters include:
\t
(Creates a tab)\n
(Creates a newline)Both can be useful in plots to more neatly arrange text.
[1] │ Record temp:{\t}102
[1] │ Record temp:
│ 102
Exercise (Exploring function documentation): Can we get str_view()
to show the tab instead of {\t}
? Enter ?str_view
in the Console to pull up the documentation for this function. Look through the arguments to see how we might do this.
Reflection: In your Process and Reflection Log, record any strategies that you learned about reading function documentation.
Often we will want to create new strings within data frames. We can use str_c()
or str_glue()
:
str_c()
the strings to be combined are all separate arguments separated by commas.str_glue()
the desired string is written as a template with variable names inside curly braces {}
.df <- tibble(
first_name = c("Arya", "Olenna", "Tyrion", "Melisandre"),
last_name = c("Stark", "Tyrell", "Lannister", NA)
)
df
# A tibble: 4 × 2
first_name last_name
<chr> <chr>
1 Arya Stark
2 Olenna Tyrell
3 Tyrion Lannister
4 Melisandre <NA>
df %>%
mutate(
full_name1 = str_c(first_name, " ", last_name),
full_name2 = str_glue("{first_name} {last_name}")
)
# A tibble: 4 × 4
first_name last_name full_name1 full_name2
<chr> <chr> <chr> <glue>
1 Arya Stark Arya Stark Arya Stark
2 Olenna Tyrell Olenna Tyrell Olenna Tyrell
3 Tyrion Lannister Tyrion Lannister Tyrion Lannister
4 Melisandre <NA> <NA> Melisandre NA
Exercise: In the following data frame, create a full date string in month-day-year format using both str_c()
and str_glue()
.
The str_length()
counts the number of characters in a string.
comments <- tibble(
name = c("Alice", "Bob"),
comment = c("The essay was well organized around the core message and had good transitions.", "Good job!")
)
comments %>%
mutate(
comment_length = str_length(comment)
)
# A tibble: 2 × 3
name comment comment_length
<chr> <chr> <int>
1 Alice The essay was well organized around the core message and… 78
2 Bob Good job! 9
The str_sub()
function gets a substring of a string. The 2nd and 3rd arguments indicate the beginning and ending position to extract.
str_sub()
will just go as far as possible.[1] "App" "Ban" "Pea"
[1] "ple" "ana" "ear"
[1] "a"
Exercise: Find the middle letter of each word in the data frame below. (Challenge: How would you handle words with an even number of letters?)
Suppose that you’re exploring text data looking for places where people describe happiness. There are many ways to search. We could search for the word “happy” but that excludes “happiness” so we might search for “happi”.
Regular expressions (regex) are a powerful language for describing patterns within strings.
We can use str_view()
with the pattern
argument to see what parts of a string match the regex supplied in the pattern
argument. (Matches are enclosed in <>
.)
[6] │ bil<berry>
[7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
Essentials of forming a regex
.
, +
, *
, [
, ]
, and ?
, have special meanings and are called metacharacters.?
: match the preceding pattern 0 or 1 times+
: match the preceding pattern at least once*
: match the preceding pattern at least 0 times (any number of times)Exercise: Before running the code below, predict what matches will be made. Run the code to check your guesses. Note that in all regex’s below the ?, +, *
applies to the b
only (not the a
).
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[2] │ <ab>
[3] │ <abb>
[1] │ <a>
[2] │ <ab>
[3] │ <abb>
# This regex finds "a" then "b" at most once (can't have 2 or more b's in a row)
str_view(c("a", "ab", "abb"), "ab?")
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[]
(called a character class), e.g., [abcd]
matches “a”, “b”, “c”, or “d”.
^
: [^abcd]
matches anything except “a”, “b”, “c”, or “d”.[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
[836] │ <sys>tem
[901] │ <typ>e
Exercise Using the words
data, find words that have two vowels in a row followed by an “m”.
|
can be read just like the logical operator |
(“OR”) to pick between one or more alternative patterns. e.g., apple|banana
searches for “apple” or “banana”. [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
Exercise: Using the fruit
data, find fruits that have a repeated vowel (“aa”, “ee”, “ii”, “oo”, or “uu”.)
^
operator indicates the beginning of a string, and the $
operator indicates the end of a string. e.g., ^a
matches strings that start with “a”, and a$
matches words that end with “a”.ab+
is a little confusing. Does it match “ab” one or more times? Or does it match “a” first, then just “b” one or more times? (The latter, as we saw in an earlier example.) We can be very explicit and use a(b)+
.Exercise: Using the words
data, find (1) words that start with “y” and (2) words that don’t start with “y”.
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
[980] │ <y>oung
[1] │ <a>
[2] │ <a>ble
[3] │ <a>bout
[4] │ <a>bsolute
[5] │ <a>ccept
[6] │ <a>ccount
[7] │ <a>chieve
[8] │ <a>cross
[9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
[12] │ <a>dd
[13] │ <a>ddress
[14] │ <a>dmit
[15] │ <a>dvertise
[16] │ <a>ffect
[17] │ <a>fford
[18] │ <a>fter
[19] │ <a>fternoon
[20] │ <a>gain
... and 954 more
The following are core stringr
functions that use regular expressions:
str_view()
- View the first occurrence in a string that matches the regexstr_count()
- Count the number of times a regex matches within a stringstr_detect()
- Determine if (TRUE/FALSE) the regex is found within stringstr_subset()
- Return subset of strings that match the regexstr_extract(), str_extract_all()
- Return portion of each string that matches the regex. str_extract()
extracts the first instance of the match. str_extract_all()
extracts all matches.str_replace(), str_replace_all()
- Replace portion of string that matches the regex with something else. str_replace()
replaces the first instance of the match. str_replace_all()
replaces all instances of the match.str_remove(), str_remove_all()
- Removes the portion of the string that matches the pattern. Equivalent to str_replace(x, "THE REGEX PATTERN", "")
Exercise: Each person at your table should explore a different one of the functions (other than str_view()
). Pull up the documentation page using ?function_name
. Explore the arguments and create a small example that demonstrates its usage. Share with your group members.
Read in the “Dear Abby” data underlying The Pudding’s 30 Years of American Anxieties article.
posts <- read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")
Rows: 20034 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): day, url, title, question_only
dbl (3): year, month, letterId
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Take a couple minutes to scroll throgh the 30 Years of American Anxieties article to get ideas for themes that you might want to search for using regular expressions.
Using the stringr
and regular expression tools that we talked about today, explore a theme of interest to you in the Dear Abby data.
Reflection: Make note of your exploration process. When do you tend to get stuck? How do you get unstuck? How do you notice “mistakes” or a need to change direction in the analysis? Record observations in your Process and Reflection Log.
I have one more poll question–navigate to my PollEverywhere page.
Get together with your tentative project teammates from Tuesday.
If you have already located a dataset relevant to one or more of your research questions and can read it into R, start to explore that data in working towards Project Milestone 2.
Otherwise, peruse the Tidy Tuesday GitHub repository to find a dataset that is roughly (perhaps very roughly) related to your project domain. Start exploring this data in working towards Project Milestone 2.