Homework 8

Required parts

Due by Wednesday, April 3 at midnight on Moodle.

You can download a template Quarto file to start from here.

Homework 8 - Challenge 1

Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process.

Part 1: Reflect on your work scraping NIH news releases from last class.

How confident did you feel during the planning phase? During the implementation phase?
What challenges or uncertainties did you face during implementation? How did you address them?
Review your implementation in terms of the core principle of good function writing. To what extent does your implementation follow these principles?

Part 2: Locate and examine the robots.txt file for this website. Summarize what you learn from it.

Part 3: Learn about the html_attr() function from rvest. Describe how this function works with a small example.

Part 4: (Do this alongside Part 5 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:

Write R code using the rvest package that allows me to scrape cheese information from cheese.com.

Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful/not useful.

# Load required libraries
library(rvest)
library(dplyr)

# Define the URL
url <- "https://cheese.com/browse/"

# Read the HTML content from the webpage
webpage <- read_html(url)

# Extract the cheese names and URLs
cheese_data <- webpage %>%
  html_nodes(".cheese-item") %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  paste0("https://cheese.com", .)

cheese_names <- webpage %>%
  html_nodes(".cheese-item h3") %>%
  html_text()

# Create a data frame to store the results
cheese_df <- data.frame(Name = cheese_names,
                        URL = cheese_data,
                        stringsAsFactors = FALSE)

# Print the data frame
print(cheese_df)

Part 5: Obtain the following information for all cheeses in the database:

Cheese name
URL for the cheese’s webpage (e.g., https://www.cheese.com/gouda/)
Whether or not the cheese has a picture (e.g., gouda has a picture, but bianco does not)

To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)

Part 6: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information: milk information, country of origin, family, type, and flavour. (Just 10 to avoid overtaxing the website. Continue adding a 1 second pause between page queries.)

Part 7: Evaluate the code that you wrote in terms of the core principle of good function writing. To what extent does your implementation follow these principles? What are you learning about what is easy/challenging for you about approaching complex tasks requiring functions and loops?

Optional

Homework 8 - Challenge 2

This challenge is fully optional–I know that things are very busy right now. You can still earn an A without completing Challenge 2.

Option 1: Scrape Macalester’s Fall 2024 class schedule

Obtaining the data: Because the Registrar uses Javascript to dynamically generate the page, there is actually no HTML source underlying the tables with course data when you use read_html(). You’ll need to do the following as a workaround:
- Google Chrome users: Use Command-S, Ctrl-S, or File > Save Page As… to bring up options for saving the webpage. For Format, choose Webpage, Complete, and save this file as fall24_schedule.html.
- All other browsers: Navigate to Moodle to download the webpage HTML source that the instructor downloaded. (Note: Safari has a Web Archive format, but unfortunately, the file type can’t be read by rvest.)
Goal: Obtain a data frame of all courses with the following columns: course department (e.g., AMST or PE), course number (e.g., 102, 200), section number (e.g., 01, 02), course name, meeting days (e.g., M W F, T R), meeting time, meeting room, instructor(s), open seats, max enrollment.
Hint: All of your CSS selectors should have the text col. (You might encounter other CSS selectors via SelectorGadget, but you’ll run into issues with the number of elements selected.)

Option 2: Scrape any page of interest to you (possibly related to your project).

Make sure that you check the robots.txt file for your website (if it exists) and describe what you learn from it.

Tidy Tuesday

If you are aiming for an A in the course, recall from our syllabus that participating in 5 Tidy Tuesday challenges can move you toward this goal.

For further all-around practice, I encourage you to participate in the most recent Tidy Tuesday challenge.