File organization, GitHub

Learning goals

After this lesson, you should be able to:

  • Set up an organized directory structure for data science projects
  • Explain the difference between absolute and relative file paths and why relative file paths are preferred for reading in data
  • Construct relative file paths to read in data
  • Use the following git commands within RStudio (whether via the GUI or by the Terminal command line): clone, add, commit, push





File organization and navigation

Folder/directory structure

Minimal directory structure for a data science project: (Sub-bullets indicate folders that are inside other folders.) At minimum, a data science project should have a code, data, and results folder. Not having these folders and mixing code, data, and results files all in one folder can quickly get hard to navigate for even small projects.

  • Documents (This should be some place you can find easily through your Finder (Mac) or File Explorer (Windows).)
    • descriptive_project_name
      • code: All code files (.R, .Rmd, .qmd) should go here. Recommendation:
        • cleaning.qmd: for data acquisition and wrangling. Save (write) the cleaned dataset at the end of this file with readr::write_csv().
        • visualizations.qmd: for exploratory and final plots
        • modeling.qmd: for any statistical or predictive modeling
      • data: All data files go here
      • results: e.g., plots saved as images, results tables


More involved directory structure for a data science project: If you have a larger-scale project that involves a lot more code and data, the expanded directory structure below is useful:

  • Documents
    • descriptive_project_name
      • code
        • raw: For messy code that you’re actively working on
        • clean: For code that you have cleaned up, documented, organized, and tested to run as expected
      • data
        • raw: Original data that hasn’t been cleaned
        • clean: Any non-original data that has been processed in some way
      • results
        • figures: Plots that will be used in communicating your project should go here. (Using screenshots of output in RStudio is not a good practice.)
        • tables: Any sort of plain text file results (e.g., CSVs)

File paths

What are file paths?

A file path is a text string that tells a computer how to navigate from one location to another. We use file paths to read in (and write out) data.

Essentially, file paths are what go inside read_csv().

There are two types of paths: absolute and relative.

Absolute file paths start at the “root” directory in a computer system. Examples:

  • Mac: /Users/lesliemyint/Desktop/teaching/STAT212/2024_spring/activities/adv_ggplot/sfo_weather.csv
    • On a Mac the tilde ~ in a file path refers to the “Home” directory, which is /Users/lesliemyint. In this case, the path becomes ~/Desktop/teaching/STAT212/2024_spring/activities/adv_ggplot/sfo_weather.csv
  • Windows: C:/Users/lesliemyint/Documents/teaching/STAT212/2024_spring/activities/adv_ggplot/sfo_weather.csv
    • Note: Windows uses both / (forward slash) and \ (backward slash) to separate folders in a file path.
DON’T use absolute paths

For reading in data, absolute paths are not a good idea because if the code file is shared. The path will not work on a different computer.


Relative file paths start wherever you are right now (the working directory (WD)). The WD when you’re working in a code file (.Rmd, .qmd) may be different from the working directory in the Console.

Directory setup 1:

  • some_folder
    • your_code_file.qmd
    • data.csv

To read in the data:

data <- read_csv("data.csv")

Directory setup 2:

  • some_folder
    • your_code_file.qmd
    • data
      • data.csv

To read in the data:

data <- read_csv("data/data.csv")

Directory setup 3:

  • some_folder
    • data.csv
    • code
      • your_code_file.qmd

To go “up” a folder in a relative path we use ../. To read in the data:

data <- read_csv("../data.csv")

Directory setup 4:

  • some_folder
    • data
      • data.csv
    • code
      • your_code_file.qmd

To read in the data:

data <- read_csv("../data/data.csv")
DO use relative paths

For reading in data, relative paths are preferred because if the project directory structure is used on a different computer, the relative paths will still work.

Exercise

Download this Zip file from Moodle, and save it to your class folder. After unzipping, open the code/clean/cleaning.qmd file in RStudio. Follow the instructions in that file.

Exercise goals:

  • Practice using relative paths in a realistic project context
  • Practice data wrangling
Solution

Load packages and read in data.

library(tidyverse)
weather <- read_csv("../../data/raw/weather.csv")

Add dateInYear variable.

weather_clean <- weather %>% 
    arrange(Month, Day) %>% 
    mutate(dateInYear = 1:365)

Add in 3-letter month abbreviations.

# Option 1: via joins
months <- tibble(
    Month = 1:12,
    month_name = month.abb
)
weather_clean <- weather_clean %>% 
    left_join(months)

# Option 2: via vector subsetting
weather %>% 
    mutate(month_name = month.abb[Month])

Write out clean data to a CSV file.

write_csv(weather_clean, file = "../../data/clean/weather_clean.csv")





GitHub for working on skills challenges

We’ll step through the process from Lisa Lendway’s GitHub-RStudio setup video for Advanced visualization with ggplot2 - Challenge 1.


The remainder of class time today is yours to work on any of the following:

Reminder: Record and observe

As you work to finalize your plots, try as best you can to observe your strategies for getting unstuck–what sorts of things do you try? Some ideas:

  • Draw the part of your plot that you expect your code will create. Then compare to what actually does happen.
  • Consult peers for different Google queries
  • Look at function documentation by entering ?function_name in the Console. See what arguments might be tweaked to get what you want.





Announcements

  • Homework 1 is due Monday, January 29 at midnight.
  • Skills Session 1 will be happening 2/5-2/9. Keep practicing the keyboard shortcuts described here.
  • Check our Schedule for next week: there is one video to watch before class on Tuesday (with Guiding Questions) and packages to install.