Project work
Peer review of code
Class date: April 9, 2024
Plan: You’ll walk through your code so far step by step and demonstrate to another team that all steps are accurate.
Process:
- One team will talk through each section of code in their project so far and explain how they checked the accuracy of those steps.
- The other team will listen, ask questions, and provide suggestions. The listening team members should keep in mind the following characteristics from the Code quality and documentation section of the project rubric:
- Are functions used effectively to reduce code duplication?
- Do functions do a single small task?
- Are loops/
purrr
used appropriately to handle repeated tasks? - Is their clear documentation in the text before and after code chunks of what is happening in each code chunk?
- Are code comments used to add clarity inside code chunks?
- When using a clean dataset for modeling and/or visualization, is it clear what the cases represent and what portion (possibly a subset) of the data is being used?
- Take notes on what you learn from peer feedback on your code and what you learn from observing your peer’s code. This will be part of the reflection for Skills Session 2.
Data storytelling
Class date: April 11, 2024
Plan: We’ll look at some materials to provide inspiration and guidance for your projects.
Improving your data visualizations
Section 4.2.1 Guidelines for good plots presents 6 guidelines for creating great plots:
- Aim for high data density.
- Use clear, meaningful labels.
- Provide useful references.
- Highlight interesting aspects of the data.
- Consider using small multiples.
- Make order meaningful.
Although it’s not explicitly stated, an overarching theme is to facilitate comparisons.
- When you present your visualizations, what aspects is the viewer drawn to, and what do they want to compare?
- Make it as easy as possible to compare those things.
Example: grid of scatterplots
- Scatterplot grids are considered in Guideline 5: Consider using small multiples.
- What if the viewer ultimately wants to focus on the slope and the strength of the correlation?
- Then it might be more effective to summarize each scatter plot with the correlation coefficient, slope estimate, and slope confidence interval.
geom_point()
,geom_linerange()
/geom_errorbar()
would be effective here.
Example: Majority of Biden’s 2020 Voters Now Say He’s Too Old to Be Effective
- What aspects of the plot reinforce the intended message?
Resources for sparking creativity and imagination in your plots
- Blog post: The 30 Best Data Visualizations of 2023
- The Pudding is a great data journalism site. Examples of articles with unique visualizations:
- Pockets: On the sizes of men’s and women’s pockets
- Making it Big: Exploring the trajectory of bands
- The Differences in How CNN, MSNBC, & FOX Cover the News
- The Physical Traits that Define Men and Women in Literature
- How News Media Covers Trump and Clinton: An analysis of images in news media
- Where Slang Comes From: Exploring the emergence of slang over time
- Visualizations from the New York Times:
Your project in 3 visuals
Exercise: If your digital deliverable (whether blog post or Shiny app) could only show 3 visuals, what would they be? Why?
- What ideas do you have about the order of your visuals?
- What might you do to combine multiple visuals into one?
Human-centered data science
Let’s take a moment to explore The Pudding’s 30 Years of American Anxieties.
- In what ways do these letters reveal essential context that would never be found in a dataset?
- What hidden context can you imagine for your dataset?
- What additional information could accompany your dataset to provide a more full picture of the lived experiences of all those who may have been connected to the data?
- Who collected this data? Why? What might have been their agenda?
- How might the agendas of the data collectors affected what data are available? In terms of:
- What cases are present in and absent from the data?
- What variables are available and in what format (e.g., categories)?
- Think about the labor involved in collecting your data. Whose labor is most visible and applauded? Whose labor is invisible?