# Helper function to reduce html_elements() %>% html_text() code duplication
<- function(page, css_selector) {
get_text_from_page %>%
page html_elements(css_selector) %>%
html_text()
}
<- function(url) {
scrape_page Sys.sleep(2)
<- read_html(url)
page <- get_text_from_page(page, ".teaser-title")
article_titles <- get_text_from_page(page, ".date-display-single")
article_dates <- get_text_from_page(page, ".teaser-description")
article_abstracts <- str_remove(article_abstracts, "^.+—") %>% trimws()
article_abstracts
tibble(
title = article_titles,
date = article_dates,
abstract = article_abstracts
) }
Coding strategies and principles
Core principle of writing good functions
A function should complete a single small task
Applying this principle takes hard work, but it will make your code easier to read, debug, and reuse.
Open up the Homework 8 page alongside this for a reflection exercise.
“Single” and “small” are hard to get a sense of, so let’s look at solutions from our previous scraping activity:
get_text_from_page()
does a single small task–it gets the contents from a single CSS selector
- This facilitates its reuse: we can easily use this function in another context
- It’s easily debugged and understandable because it’s so short.
If you only call your function once, you need to revise your code to have your functions do smaller tasks.
- Looping probably should not be done inside a function. Rather, we will generally want to call functions from inside loops.
- Why? When you are looping, you are doing a task over and over. That task can be turned into a function.
- If you find yourself writing a loop within a function, you likely need to write a helper function.
Example of how this might come about:
<- function(url, title_selector, date_selector, abstract_selector, num_pages) {
get_all_news_info for (i in seq_len(num_pages)) {
# Construct url for page i
# page <- read_html()
# get_text_from_page(page, title_selector)
# get_text_from_page(page, date_selector)
# get_text_from_page(page, abstract_selector)
} }
- This function does many tasks (gets many pages) instead of a single task (just one page).