library(rvest) # install.packages("rvest")
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
After this lesson, you should be able to:
html_elements()
and html_text()
functions within the rvest
packages to scrape data from webpage using CSS selectorsYou can download a template Quarto file to start from here. Save this template within the following directory structure:
your_course_folder
scraping
code
11-scraping.qmd
We have talked about how to acquire data from APIs. Whenever an API is available for your project, you should default to getting data from the API. Sometimes an API will not be available, and web scraping is another means of getting data.
Web scraping describes the use of code to extract information displayed on a web page. In R, the rvest
package offers tools for scraping. (rvest
is meant to sound like “harvest”.)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
Additional readings:
In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the NIH News Releases page, we can see that the data is represented in a consistent pattern of image + title + abstract.
We will identify data in a web page using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages.
For example:
Warning: Websites change often! So if you are going to scrape a lot of data, it is probably worthwhile to save and date a copy of the website. Otherwise, you may return after some time and your scraping code will include all of the wrong CSS selectors.
Although you can learn how to use CSS Selectors by hand, we will use a shortcut by installing the Selector Gadget tool.
Let’s watch the Selector Gadget tutorial video before proceeding.
Head over to the NIH News Releases page. Click the Selector Gadget extension icon or bookmark button. As you mouse over the webpage, different parts will be highlighted in orange. Click on the title of the first news release. You’ll notice that the Selector Gadget information in the lower right describes what you clicked on.
Scroll through the page to verify that only the information you intend (the description paragraph) is selected. The selector panel shows the CSS selector (.teaser-title
) and the number of matches for that CSS selector (10). (You may have to be careful with your clicking–there are two overlapping boxes, and clicking on the link of the title can lead to the CSS selector of “a”.)
Repeat the process above to find the correct selectors for the following fields. Make sure that each matches 10 results:
rvest
and CSS SelectorsNow that we have identified CSS selectors for the information we need, let’s fetch the data using the rvest
package.
Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the article titles:
# Retrieve and inspect course numbers
article_titles <- nih %>%
html_elements(".teaser-title") %>%
html_text()
head(article_titles)
[1] "Reduced blood lead levels linked to lower blood pressure in American Indians"
[2] "NIH-supported researchers create single-cell atlas of the placenta during term labor"
[3] "Reduced drug use is a meaningful treatment outcome for people with stimulant use disorders"
[4] "A common marker of neurological diseases may play role in healthy brains"
[5] "Residential addiction treatment for adolescents is scarce and expensive"
[6] "Semaglutide associated with lower risk of suicidal ideations compared to other treatments prescribed for obesity or type 2 diabetes"
Pair programming exercise: Whoever can write down more stringr
functions in 10 seconds will drive first and do the first exercise. Switch driver for the second exercise.
Exercise: Write a function that puts the article title, publication date, and abstract text from a single page of news results into a data frame.
get_text_from_page <- function(css_selector) {
page %>%
html_elements(css_selector) %>%
html_text()
}
scrape_page <- function(url) {
page <- read_html(url)
article_titles <- get_text_from_page(".teaser-title")
article_dates <- get_text_from_page(".date-display-single")
article_abstracts <- get_text_from_page(".teaser-description")
article_abstracts <- str_remove(article_abstracts, "^.+—") %>% trimws()
tibble(
title = article_titles,
date = article_dates,
abstract = article_abstracts
)
}
In your Process and Reflection Log write a few observations about your experience writing the function for this first exercise. What is starting to get easier about function writing? What is still challenging? What practices or strategies would you like to try out next?
Exercise: Use iteration to get article information for the first 5 pages of news results. You can use either a for-loop or purrr::map()
. For a challenge, do both.
Using a for-loop:
Using purrr::map()
:
In your Process and Reflection Log write a few observations about your experience working on this second exercise. What is starting to get easier about wrting code for iteration? What is still challenging? What practices or strategies would you like to try out next?
A more complex example with the NIH STEM Teaching Resources Webpage.
Selecting the resource titles ends up being tricky. Using Selector Gadget, we can only get one resource title at a time. In Chrome, you can right click part of a web page and click “Inspect”. This opens up Chrome’s Developer Tools. Mousing over the HTML in the top right panel highlights the corresponding part of the web page.
The underlying HTML used to create a web page is also called the page source code or page source. We learn from this that the resource titles are <h4>
headings that have the class resource-title
. We can infer from this that .resource-title
would be the CSS selector for the resouce titles.