Set up an organized directory structure for data science projects
Explain the difference between absolute and relative file paths and why relative file paths are preferred for reading in data
Construct relative file paths to read in data
Use the following git commands within RStudio (whether via the GUI or by the Terminal command line): clone, add, commit, push
File organization and navigation
Folder/directory structure
Minimal directory structure for a data science project: (Sub-bullets indicate folders that are inside other folders.) At minimum, a data science project should have a code, data, and results folder. Not having these folders and mixing code, data, and results files all in one folder can quickly get hard to navigate for even small projects.
Documents (This should be some place you can find easily through your Finder (Mac) or File Explorer (Windows).)
descriptive_project_name
code: All code files (.R, .Rmd, .qmd) should go here. Recommendation:
cleaning.qmd: for data acquisition and wrangling. Save (write) the cleaned dataset at the end of this file with readr::write_csv().
visualizations.qmd: for exploratory and final plots
modeling.qmd: for any statistical or predictive modeling
data: All data files go here
results: e.g., plots saved as images, results tables
More involved directory structure for a data science project: If you have a larger-scale project that involves a lot more code and data, the expanded directory structure below is useful:
Documents
descriptive_project_name
code
raw: For messy code that you’re actively working on
clean: For code that you have cleaned up, documented, organized, and tested to run as expected
data
raw: Original data that hasn’t been cleaned
clean: Any non-original data that has been processed in some way
results
figures: Plots that will be used in communicating your project should go here. (Using screenshots of output in RStudio is not a good practice.)
tables: Any sort of plain text file results (e.g., CSVs)
File paths
What are file paths?
A file path is a text string that tells a computer how to navigate from one location to another. We use file paths to read in (and write out) data.
Essentially, file paths are what go inside read_csv().
There are two types of paths: absolute and relative.
Absolute file paths start at the “root” directory in a computer system. Examples:
On a Mac the tilde ~ in a file path refers to the “Home” directory, which is /Users/lesliemyint. In this case, the path becomes ~/Desktop/teaching/STAT212/2024_spring/activities/adv_ggplot/sfo_weather.csv
Note: Windows uses both / (forward slash) and \ (backward slash) to separate folders in a file path.
DON’T use absolute paths
For reading in data, absolute paths are not a good idea because if the code file is shared. The path will not work on a different computer.
Relative file paths start wherever you are right now (the working directory (WD)). The WD when you’re working in a code file (.Rmd, .qmd) may be different from the working directory in the Console.
Directory setup 1:
some_folder
your_code_file.qmd
data.csv
To read in the data:
data <-read_csv("data.csv")
Directory setup 2:
some_folder
your_code_file.qmd
data
data.csv
To read in the data:
data <-read_csv("data/data.csv")
Directory setup 3:
some_folder
data.csv
code
your_code_file.qmd
To go “up” a folder in a relative path we use ../. To read in the data:
data <-read_csv("../data.csv")
Directory setup 4:
some_folder
data
data.csv
code
your_code_file.qmd
To read in the data:
data <-read_csv("../data/data.csv")
DO use relative paths
For reading in data, relative paths are preferred because if the project directory structure is used on a different computer, the relative paths will still work.
Exercise
Download this Zip file from Moodle, and save it to your class folder. After unzipping, open the code/clean/cleaning.qmd file in RStudio. Follow the instructions in that file.
Exercise goals:
Practice using relative paths in a realistic project context