# Helper function to reduce html_elements() %>% html_text() code duplication
get_text_from_page <- function(page, css_selector) {
page %>%
html_elements(css_selector) %>%
html_text()
}
scrape_page <- function(url) {
Sys.sleep(2)
page <- read_html(url)
article_titles <- get_text_from_page(page, ".teaser-title")
article_dates <- get_text_from_page(page, ".date-display-single")
article_abstracts <- get_text_from_page(page, ".teaser-description")
article_abstracts <- str_remove(article_abstracts, "^.+—") %>% trimws()
tibble(
title = article_titles,
date = article_dates,
abstract = article_abstracts
)
}Coding strategies and principles
Core principle of writing good functions
A function should complete a single small task
Applying this principle takes hard work, but it will make your code easier to read, debug, and reuse.
Open up the Homework 8 page alongside this for a reflection exercise.
“Single” and “small” are hard to get a sense of, so let’s look at solutions from our previous scraping activity:
get_text_from_page() does a single small task–it gets the contents from a single CSS selector
- This facilitates its reuse: we can easily use this function in another context
- It’s easily debugged and understandable because it’s so short.
If you only call your function once, you need to revise your code to have your functions do smaller tasks.
- Looping probably should not be done inside a function. Rather, we will generally want to call functions from inside loops.
- Why? When you are looping, you are doing a task over and over. That task can be turned into a function.
- If you find yourself writing a loop within a function, you likely need to write a helper function.
Example of how this might come about:
get_all_news_info <- function(url, title_selector, date_selector, abstract_selector, num_pages) {
for (i in seq_len(num_pages)) {
# Construct url for page i
# page <- read_html()
# get_text_from_page(page, title_selector)
# get_text_from_page(page, date_selector)
# get_text_from_page(page, abstract_selector)
}
}- This function does many tasks (gets many pages) instead of a single task (just one page).