Univariate visualization and summaries

Notes and in-class exercises

Notes

  • You can download a template file to work with here.
  • File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Learning goals

By the end of this lesson, you should be able to:

  • Describe what a case (or unit of analysis) represents in a dataset.
  • Describe what a variable represents in a dataset.
  • Identify whether a variable is categorical or quantitative and what summarizations and visualizations are appropriate for that variable
  • Write R code to read in data and to summarize and visualize a single variable at a time.
  • Interpret key features of barplots, boxplots, histograms, and density plots
  • Describe information about the distribution of a quantitative variable using the concepts of shape, center, spread, and outliers
  • Relate summary statistics of data to the concepts of shape, center, spread, and outliers

Readings and videos

Before class you should have read and/or watched:

Review

Cases

When Macalester advertises an average class size of 17, what do you think the cases in this dataset represent?

Suppose we had just had 2 classes: one with a class size of 20 and the other with a class size of 28. If the cases are classes, the average class size is 24.

However, if the cases are students, the average class size looks like this:

(20*20 + 28*28)/(20+28)
[1] 24.66667

This is another viewpoint for what “average class size” means from the student perspective.

Note: From the student perspective (when cases are students), average class size will almost always be higher than when cases are classes.

(Thanks to this post for the idea for this example.)





Univariate visualization

In August 2018, the data journalism group The Pudding published an article about the size of men’s and women’s jeans pockets (called Women’s Pockets are Inferior).

We’ll explore this data to review univariate visualization.

Tips for navigating code
  • Whenever you see a parenthesis (, you are using a function. Look at the text to the left of the ( to see the function name. You can think of function names as verbs. They do things to data.
  • Whenever you see an arrow <-, whatever is happening on the right is being stored in a “box” with the label on the left. (So the result of read_csv() is being stored in a “box” called pocketspockets is the name/label of our dataset.)
library(readr)

pockets <- read_csv("https://raw.githubusercontent.com/the-pudding/data/master/pockets/measurements.csv")

The documentation (codebook) for this data is available here.

The menWomen variable is a categorical variable. We can use the code below to make a barplot to explore how many men’s and women’s jeans were examined.

Tips for navigating code
  • Inside parentheses (which indicate that a function is being called), look for lists of things separated by commas. The items in that list are inputs (also called arguments) for the function. These inputs supply essential information for the function.
  • For plots, look for + signs. These indicate “layers” of a plot, kind of like layers of a painting.
library(ggplot2)

ggplot(pockets, aes(x = menWomen)) +
    geom_bar()

We can also tabulate categorical variables with the count() function from the dplyr data wrangling/manipulation package.

Tips for navigating code
  • The %>% symbol is called a pipe. It takes the object before it and feeds it in as the first input to the function after it. Whenever you see the pipe symbol, you can replace it in your mind with the words “and then”.
library(dplyr)

pockets %>% 
    count(menWomen)
# A tibble: 2 × 2
  menWomen     n
  <chr>    <int>
1 men         40
2 women       40

The minHeightFront variable gives the minimum height of the front pocket and is quantitative. For a single quantitative variable, we can make a boxplot, density plot, or histogram. Let’s look at a histogram:

ggplot(pockets, aes(x = minHeightFront)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When you interpret a plot of a quantitative variable, there are 4 aspects to keep in mind:

  1. Shape
  2. Center
  3. Spread
  4. Outliers

Shape: How are values distributed along the observed range? What does the distribution of the variable look like?


Center: What is a typical value of the variable?

  • Quantified with summary statistics like the mean and median.
summary(pockets$minHeightFront)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9.50   13.00   15.00   15.65   17.00   25.00 


Spread: How spread out are the values? Are most values very close together or far apart?

  • Quantified with summary statistics like the variance, standard deviation, range (difference between min and max), interquartile range (IQR) (difference between the 75th percentile and 25th percentile)
  • Interpretation of the variance: it is the average (roughly) squared distance of each value to the mean. Units are the squared version of the original variable.
  • Interpretation of the standard deviation: square root of the variance. Measures spread on the same scale as the original variable (same units as the original variable).
var(pockets$minHeightFront)
[1] 12.27745
sd(pockets$minHeightFront)
[1] 3.50392


Outliers: Are there any values that are particularly high or low relative to the rest?


A good paragraph putting all of these aspects together:

The distribution of the minimum height of front pockets seems right skewed with values ranging from 9.5 to 25cm. The average pocket height is about 15cm (median). Pocket heights tend to deviate from this average by about 3.5cm (SD), and there don’t seem to be extreme outliers.





Exercises

Guiding question: What anxieties have been on Americans’ minds over the decades?

Context: Dear Abby is America’s longest running advice column. Started in 1956 by Pauline Phillips under the pseudonym Abigail van Buren, the column continues to this day under the stewardship of her daughter Jeanne. Each column features one or more letters to Abby from anonymous individuals, all signed with a pseudonym. Abby’s response follows each letter.

In 2018, the data journalism site The Pudding published a visual article called 30 Years of American Anxieties in which the authors explored themes in Dear Abby letters from 1985 to 2017. (We only have the questions, not Abby’s responses.) The codebook is available here.

Exercise 1: Get curious

  • Hypothesize with each other: what themes do you think might come up often in Dear Abby letters?
  • After brainstorming, take a quick glance at the original article from The Pudding to see what themes they explored.
  • Go to the very end of the Pudding article to the section titled “Data and Method”. In thinking about the who, what, when, where, why, and how of data context, what concerns/limitations surface with regards to using this data to learn about Americans’ concerns over the decades?

Exercise 2: Importing and getting to know the data

# Load package
library(readr)

# Read in the course evaluation data
abby <- read_csv("https://mac-stat.github.io/data/dear_abby.csv")
  1. Click on the Environment tab (generally in the upper right hand pane in RStudio). Then click the abby line. The abby data will pop up as a separate pane (like viewing a spreadsheet) – check it out.

  2. In this tidy dataset, what is the unit of observation? That is, what is represented in each row of the dataset?

  3. What term do we use for the columns of the dataset?

  4. Try out each function below. Identify what each function tells you about the abby data and note this in the ???:

# ??? [what do both numbers mean?]
dim(abby)

# ???
nrow(abby)

# ???
ncol(abby)

# ???
head(abby)

# ???
names(abby)
  1. We can learn what functions do by pulling up help pages. To do this, click inside the Console pane, and enter ?function_name. For example, to pull up a help page for the dim() function, we can type ?dim and hit Enter. Pull up the help page for the head() function.
    • Read the Description.
    • Challenge: Look at the Arguments and Examples sections to figure out how to display the first 10 rows of the evals data (instead of the default first 6 rows).

Exercise 3: Preparing to summarize and visualize the data

In the next exercises, we will be exploring themes in the Dear Abby questions and the overall “mood” or sentiment of the questions. Before continuing, read the codebook for this dataset for some context about sentiment analysis, which gives us a measure of the mood/sentiment of a text.

  1. What sentiment variables do we have in the dataset? Are they quantitative or categorical?

  2. If we were able to create a theme variable that took values like “friendship”, “marriage”, and “relationships”, would theme be quantitative or categorical?

  3. What visualizations are appropriate for looking at the distribution of a single quantitative variable? What about a single categorical variable?

Exercise 4: Exploring themes in the letters

The dplyr package provides many useful functions for managing data (like creating new variables, summarizing information). The stringr package provides tools for working with strings (text). We’ll use these packages to search for words in the questions in order to (roughly) identify themes/subjects.

The code below searches for words related to mothers, fathers, marriage, and money and combines them into a single theme variable.

  • Inside mutate() the line moms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom") creates a new variable called moms. If any of the text “mother”, “mama”, or “mom” (which covers “mommy”) is found, then the variable takes the value “mom”. Otherwise, the variable takes the value “no mom”.
  • The dads, marriage, and money variables are created similarly.
  • The themes = str_c(moms, dads, marriage, money, sep = "|") line takes the 4 created variables and combines the text of those variables separated with a |. For example, one value of the themes variable is “mom|no_dad|no_marriage|no_money” (which contains words about moms but not dads, marriage, or money).
library(dplyr)
library(stringr)

abby <- abby %>% 
    mutate(
        moms = ifelse(str_detect(question_only, "mother|mama|mom"), "mom", "no mom"),
        dads = ifelse(str_detect(question_only, "father|papa|dad"), "dad", "no dad"),
        marriage = ifelse(str_detect(question_only, "marriage|marry|married"), "marriage", "no marriage"),
        money = ifelse(str_detect(question_only, "money|finance"), "money", "no money"),
        themes = str_c(moms, dads, marriage, money, sep = "|")
    )
  1. Modify the code above however you wish to replace themes (e.g., replace “moms” with something else) or add new themes to search for. If you want to add a new subject to search for, copy and paste a line for an existing subject above the themes line, and modify the code like this:
    • If your subject is captured by multiple words: YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD1|WORD2|ETC"), "SUBJECT", "NO SUBJECT"),
    • If your subject is captured by a single word: YOUR_SUBJECT = ifelse(str_detect(question_only, "WORD"), "SUBJECT", "NO SUBJECT"),
    • Try to have no more than 6 subjects—otherwise we’ll have too many themes, which will complicate exploration.
  2. The code below makes a barplot of the themes variable using the ggplot2 visualization package. Before making the plot, make note of what you expect the plot might look like. (This might be hard–just do your best!) Then compare to what you observe when you run the code chunk to make the plot. (Clearly defining your expectations first is good scientific practice to avoid confirmation bias.)
# Load package
library(ggplot2)

# barplot
ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
  1. We can follow up on the barplot with a simple numerical summary. Whereas the ggplot2 package is great for visualizations, dplyr is great for numerical summaries. The code below constructs a table of the number of questions with each theme. Make sure that these numerical summaries match up with what you saw in the barplot.
# Construct a table of counts
abby %>% 
    count(themes)
  1. Before proceeding, let’s break down the plotting code above. Run each chunk to see how the two lines of code above build up the plot in “layers”. Add comments (on the lines starting with #) to document what you notice.
# ???
ggplot(abby, aes(x = themes))
# ???
ggplot(abby, aes(x = themes)) +
    geom_bar()
# ???
ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# ???
ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    theme_classic()

Exercise 5: Exploring sentiment

We’ll look at the distribution of the afinn_overall sentiment variable and associated summary statistics.

  1. The code below creates a boxplot of this variable. In the comment, make note of how this code is simliar to the code for the barplot above. As in the previous exercise, before running the code chunk to create the plot, make note of what you expect the boxplot to look like.
# ???
ggplot(abby, aes(x = afinn_overall)) +
    geom_boxplot()
  1. Challenge: Using the code for the barplot and boxplot as a guide, try to make a histogram and a density plot of the overall average ratings.
    • What information is given by the tallest bar of the histogram?
    • How would you describe the shape of the distribution?
# Histogram

# Density plot
  1. We can compute summary statistics (numerical summaries) for a quantitative variable using the summary() function or with the summarize() function from the dplyr package. (1st Qu. and 3rd Qu. stand for first and third quartile.) After inspecting these summaries, look back to your boxplot, histogram, and density plot. Which plots show which summaries most clearly?
# Summary statistics
# Using summary() - convenient for computing many summaries in one command
# Does not show the standard deviation
summary(abby$afinn_overall)

# Using summarize() from dplyr
# Note that we use %>% to pipe the data into the summarize() function
# We need to use na.rm = TRUE because there are missing values (NAs)
abby %>% 
    summarize(mean(afinn_overall, na.rm = TRUE), median(afinn_overall, na.rm = TRUE), sd(afinn_overall, na.rm = TRUE))
  1. Write a good paragraph describing the information in the histogram (or density plot) by discussing shape, center, spread, and outliers. Incorporate the numerical summaries from part c.

Exercise 6: Box plots vs. histograms vs. density plots

We took 3 different approaches to plotting the quantitative average course variable above. They all have pros and cons.

  1. What is one pro about the boxplot in comparison to the histogram and density plot?
  2. What is one con about the boxplot in comparison to the histogram and density plots?
  3. In this example, which plot do you prefer and why?

Exercise 7: Explore outliers

Given that Dear Abby column is an advice column, it seems natural that the sentiment of the questions would lean more negative. What’s going on with the questions that have particularly positive sentiments?

We can use the filter() function in the dplyr package to look at the . Based on the plots of afinn_overall that you made in Exercise 5, pick a threshold for the afinn_overall variable—we’ll say that questions with an overall sentiment score above this threshold are high outliers. Fill in this number where it says YOUR_THRESHOLD below.

abby %>% 
    filter(afinn_overall > YOUR_THRESHOLD) %>% 
    pull(question_only)

What do you notice? Why might these questions have such high sentiment scores?

Exercise 8: Returning to our context, looking ahead

In this activity, we explored data on Dear Abby question, with a focus on exploring a single variable at a time.

  • In big picture terms, what have we learned about Dear Abby questions?
  • What further curiosities do you have about the data?

Exercise 9: Different ways to think about data visualization

In working with and visualizing data, it’s important to keep in mind what a data point represents. It can reflect the experience of a real person. It might reflect the sentiment in a piece of art. It might reflect history. We’ve taken one very narrow and technical approach to data visualization. Check out the following examples, and write some notes about anything you find interesting.

Exercise 10: Rendering your work

Save this file, and then click the “Render” button in the menu bar for this pane (blue arrow pointing right). This will create an HTML file containing all of the directions, code, and responses from this activity. A preview of the HTML will appear in the browser.

  • Scroll through and inspect the document to see how your work was translated into this HTML format. Neat!
  • Close the browser tab.
  • Go to the “Background Jobs” pane in RStudio and click the Stop button to end the rendering process.
  • Navigate to your “Activities” subfolder within your “STAT155” folder and locate the HTML file. You can open it again in your browser to double check.

Reflection

Go to the top of this file and review the learning objectives for this lesson. Which objectives do you have a good handle on, are at least familiar with, or are struggling with? What feels challenging right now? What are some wins from the day?

Response: Put your response here.

Additional Practice

If you have time and want additional practice, try out the following exercises.

Exercise 11: Read in and get to know the weather data

Daily weather data are available for 3 locations in Perth, Australia.

  • View the codebook here.
  • Complete the code below to read in the data.
# Replace the ??? with your own name for the weather data
# Replace the ___ with the correct function
??? <- ___("https://mac-stat.github.io/data/weather_3_locations.csv")

Exercise 12: Exploring the data structure

Check out the basic features of the weather data.

# Examine the first six cases

# Find the dimensions of the data

What does a case represent in this data?

Exercise 13: Exploring rainfall

The raintoday variable contains information about rainfall.

  • Is this variable quantitative or categorical?
  • Create an appropriate visualization, and compute appropriate numerical summaries.
  • What do you learn about rainfall in Perth?
# Visualization

# Numerical summaries

Exercise 14: Exploring temperature

The maxtemp variable contains information on the daily high temperature.

  • Is this variable quantitative or categorical?
  • Create an appropriate visualization, and compute appropriate numerical summaries.
  • What do you learn about high temperatures in Perth?
# Visualization

# Numerical summaries

Exercise 15: Customizing! (CHALLENGE)

Though you will naturally absorb some RStudio code throughout the semester, being an effective statistical thinker and “programmer” does not require that we memorize all code. That would be impossible! In contrast, using the foundation you built today, do some digging online to learn how to customize your visualizations.

  1. For the histogram below, add a title and more meaningful axis labels. Specifically, title the plot “Distribution of max temperatures in Perth”, change the x-axis label to “Maximum temperature” and y-axis label to “Number of days”. HINT: Do a Google search for something like “add axis labels ggplot”.
# Add a title and axis labels
ggplot(weather, aes(x = maxtemp)) + 
    geom_histogram()
  1. Adjust the code below in order to color the bars green. NOTE: Color can be an effective tool, but here it is simply gratuitous.
# Make the bars green
ggplot(weather, aes(x = raintoday)) + 
    geom_bar()
  1. Check out the ggplot2 cheat sheet. Try making some of the other kinds of univariate plots outlined there.

  2. What else would you like to change about your plot? Try it!







Solutions

Exercise 1: Get curious

  • Results of brainstorming themes will vary
  • From the “Data and Method” section at the end of the Pudding article, we see this paragraph:

The writers of these questions likely skew roughly 2/3 female (according to Pauline Phillips, who mentions the demographics of responses to a survey she disseminated in 1987), and consequently, their interests are overrepresented; we’ve been unable to find other demographic data surrounding their origins. There is, doubtless, a level of editorializing here: only a fraction of the questions that people have written in have seen publication, because agony aunts (the writers of advice columns) must selectively filter what gets published. Nevertheless, the concerns of the day seem to be represented, such as the HIV/AIDS crisis in the 1980s. Additionally, we believe that the large sample of questions in our corpus (20,000+) that have appeared over recent decades gives a sufficient directional sense of broad trends.

  • Writers of the questions are predominately female. The 2/3 proportion was estimated in 1987, so it would be useful to understand shifts in demographics over time.
  • What questions were chosen to be answered on the column? Likely a small fraction of what got submitted. What themes tended to get cut out?

Exercise 2: Importing and getting to know the data

  1. Note how clicking the abby data causes both a popup pane and the command View(abby) to appear in the Console. In fact, the View() function is the underlying command that opens a dataset pane. (View() should always be entered in the Console and NOT your Quarto document.)

  2. Each row / case corresponds to a single question.

  3. Columns = variables

  4. Try out each function below. Identify what each function tells you about the abby data and note this in the ???:

# First number = number of rows / cases
# Second number = number of columns / variables
dim(abby)
[1] 20034    16
# Number of rows (cases)
nrow(abby)
[1] 20034
# Number of columns (variables)
ncol(abby)
[1] 16
# View first few rows of the dataset (6 rows, by default)
head(abby)
# A tibble: 6 × 16
   year month day   url     title letterId question_only afinn_overall afinn_pos
  <dbl> <dbl> <chr> <chr>   <chr>    <dbl> <chr>                 <dbl>     <dbl>
1  1985     1 01    proque… WOMA…        1 "i have been…           -30         5
2  1985     1 01    proque… WOMA…        1 "this is for…           -30         5
3  1985     1 02    proque… LAME…        1 "our 16-year…             1         3
4  1985     1 03    proque… 'NOR…        1 "i was a hap…            -3         7
5  1985     1 04    proque… IT'S…        1 "you be the …            13        31
6  1985     1 04    proque… IT'S…        1 "a further w…            13        31
# ℹ 7 more variables: afinn_neg <dbl>, bing_pos <dbl>, moms <chr>, dads <chr>,
#   marriage <chr>, money <chr>, themes <chr>
# Get all column (variable) names
names(abby)
 [1] "year"          "month"         "day"           "url"          
 [5] "title"         "letterId"      "question_only" "afinn_overall"
 [9] "afinn_pos"     "afinn_neg"     "bing_pos"      "moms"         
[13] "dads"          "marriage"      "money"         "themes"       
  1. We can display the first 10 rows with head(abby, n = 10).

Exercise 3: Preparing to summarize and visualize the data

  1. The sentiment variables are afinn_overall, afinn_pos, afinn_neg, and bing_pos, and they are quantitative. The afinn variables don’t have units but we can still get a sense of the scale by remembering that each word gets a score between -5 and 5. The bing_pos variable doesn’t have units because it’s a fraction, but we know that it ranges from 0 to 1.

  2. theme would be categorical.

  3. Appropriate visualizations:

    • single quantitative variable: boxplot, histogram, density plot
    • single categorical variable: barplot

Exercise 4: Exploring themes in the letters

  1. Code will vary

  2. Expectations about the plot will vary

ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

  1. Counts in the table below match the barplot
# Construct a table of counts
abby %>% 
    count(themes)
# A tibble: 16 × 2
   themes                                 n
   <chr>                              <int>
 1 mom|dad|marriage|money                67
 2 mom|dad|marriage|no money            567
 3 mom|dad|no marriage|money            109
 4 mom|dad|no marriage|no money         906
 5 mom|no dad|marriage|money            121
 6 mom|no dad|marriage|no money         839
 7 mom|no dad|no marriage|money         293
 8 mom|no dad|no marriage|no money     2462
 9 no mom|dad|marriage|money             41
10 no mom|dad|marriage|no money         350
11 no mom|dad|no marriage|money          96
12 no mom|dad|no marriage|no money      760
13 no mom|no dad|marriage|money         360
14 no mom|no dad|marriage|no money     2967
15 no mom|no dad|no marriage|money      865
16 no mom|no dad|no marriage|no money  9231
  1. What do the plot layers do?
# Just sets up the "canvas" of the plot with axis labels
ggplot(abby, aes(x = themes))

# Adds the bars
ggplot(abby, aes(x = themes)) +
    geom_bar()

# Rotates the x axis labels
ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# Changes the visual theme of the plot with a white background and removes gridlines
ggplot(abby, aes(x = themes)) +
    geom_bar() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
    theme_classic()

Exercise 5: Exploring course overall ratings

Now we’ll look at the distribution of the avg_rating variable and associated summary statistics.

    • We might expect the mean of this variable is less than zero given that more negative words might be appear in questions on an advice column.
    • The code has a similar structure to the barplot in that there is an initial ggplot() layer which sets the canvas, then a + to add a layer, then the final layer geom_boxplot() (like geom_bar()) which tells R what type of plot to make.
ggplot(abby, aes(x = afinn_overall)) +
    geom_boxplot()
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_boxplot()`).

  1. We replace geom_boxplot() with geom_histogram() and geom_density().
    • The tallest bar of the histogram indicates that over 7500 questions had an overall afinn sentiment score between around -8 to 0.(The -8 to 0 comes from eyeballing where the tallest bar is placed on the x-axis, and the height of this bar indicates how many cases fall into that bin.)
    • The shape of the distribution: roughly symmetric
# Histogram
ggplot(abby, aes(x = afinn_overall)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_bin()`).

# Density plot
ggplot(abby, aes(x = afinn_overall)) +
    geom_density()
Warning: Removed 490 rows containing non-finite outside the scale range
(`stat_density()`).

    • Boxplot shows min, max, median, 1st and 3rd quartile easily. (It shows median, 1st and 3rd quartile directly as lines)
    • Histogram and density plot show min and max but the mean and median aren’t shown directly–we have to roughly guess based on the peak of the distribution
# Summary statistics
summary(abby$afinn_overall)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-140.000   -6.000   -1.000   -1.401    3.000  100.000      490 
abby %>% 
    summarize(mean(afinn_overall, na.rm = TRUE), median(afinn_overall, na.rm = TRUE), sd(afinn_overall, na.rm = TRUE))
# A tibble: 1 × 3
  mean(afinn_overall, na.rm = TR…¹ median(afinn_overall…² sd(afinn_overall, na…³
                             <dbl>                  <dbl>                  <dbl>
1                            -1.40                     -1                   11.1
# ℹ abbreviated names: ¹​`mean(afinn_overall, na.rm = TRUE)`,
#   ²​`median(afinn_overall, na.rm = TRUE)`, ³​`sd(afinn_overall, na.rm = TRUE)`
  1. The distribution of sentiment scores is roughly symmetric with a mean of -1.4 and a similar median of -1. The median and mean are quite similar because the distribution is fairly symmetric. The standard deviation of the sentiment scores is about 11.08 which tells us how much variation there is from the center of the distribution. 11.08 is somewhat high given the IQR of -6 to 3 (which is a span of 9 units).

Exercise 6: Box plots vs. histograms vs. density plots

  1. Boxplots very clearly show key summary statistics like median, 1st and 3rd quartile
  2. Boxplots can oversimplify by not showing the shape of the distribution.

Exercise 7: Explore outliers

There are some positive words in the questions that seem to pull up the sentiment score a lot despite the negative overall tone. From this we can see the limitations of a basic sentiment analysis in which the sentiment of each word is considered in isolation.

abby %>% 
    filter(afinn_overall > 50) %>% 
    pull(question_only) %>% 
    head() # Just to look at first 6
[1] "i am a 36-year-old college dropout whose lifelong ambition was to be a physician. i have a very good job selling pharmaceutical supplies, but my heart is still in the practice of medicine. i do volunteer work at the local hospital on my time off, and people tell me i would have made a wonderful doctor.\nif i go back to college and get my degree, then go to medical school, do my internship and finally get into the actual practice of medicine, it will take me seven years! but, abby, in seven years i will be 43 years old. what do you think?\nunfulfilled in philly"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[2] "we have an only child--a grown daughter we dearly love--and when we pass on, we want to leave her our entire estate, which is considerable.\nthe thing that troubles us is this: our daughter is married to a very unworthy character. for years he has taken advantage of her sweet, forgiving, generous nature because he knows she worships him. we are sure that whatever we leave our daughter will be spent on this dirty dog.\nhow can we prevent this from happening?\nbewildered"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
[3] "both of our sons have been married for about 15 years. their wives were of normal weight when they married our sons, but one daughter-in- law weighs about 300 pounds and the other weighs about 225. their ages are 35 and 37. both our sons are good-looking, and neither is fat.\nour daughters-in-law seem to have no pride in their appearance, which upsets everyone in the family, except themselves. they are fat, they know it and they don't care! when they first began to put on weight, they tried various diets, pills, doctors, etc., but they both gave up and decided to \"accept\" themselves as they are.\nthey wear the wrong kind of clothes (shorts and blue jeans) without any apologies.\nour problem (my husband's and mine) is how do we cope with this? we are ashamed to be around them. our sons have accepted the situation, but we seem unable to.\nperhaps we need more help than the girls. any suggestions?\nupset in florida"                                                                                                                                                                                                                                                                                                                                                                                                                                           
[4] "the letter from \"concerned mom,\" who was trying to teach her 5-year-old not to accept gifts from strangers, prompts this letter.\na gentleman friend of mine recently stood in line behind a mother and her young daughter at a bank. the child remarked on the visor he was wearing, as it had the name of a popular pizza imprinted on it.\nmy friend, who is the public relations director for this pizza firm, wanted the child to have the visor but, instead of giving it to the child, he handed the visor to her mother and said to the child: \"i'm giving this to your mother to give to you, because she's probably told you never to accept gifts from a stranger. you won't ever do that, will you?\"\nwhat a thoughtful way to be friendly while reinforcing a message mothers cannot stress enough.\nsue in wichita, kan."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[5] "in january, i sent an original manuscript as a gift to woody allen. i had hand-bound the pages, and decorated the binding with baroque pearls and amethyst. i enclosed my name, address and telephone number. i had hoped that woody would send me a note or call me, or at the very least, instruct his secretary to do so.\nto date, i haven't received even an acknowledgment that my gift was received. is it unrealistic of me to expect a thank-you from a famous person?\ndisappointed in california."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
[6] "will you please, please discourage high school and college graduates from sending graduation invitations to every distant relative they and their parents ever heard of? we all know that sending \"invitations\" to people we hardly know is a flagrant, shameless bid for a gift. and if, in a moment of weakness, one does send a gift, a barrage of birth announcements and invitations to weddings, showers and more graduations is sure to follow.\ni am a 75-year-old widow, living on social security and very little else. i just received a high school graduation invitation from the granddaughter of a third cousin whom i have not seen in so long i wouldn't even recognize her. (i have never even met her granddaughter.)\ni have many relatives in this town, but i never hear from them unless they are celebrating something that requires a gift. i have no car, yet they \"invite\" me to every imaginable event, knowing full well i can't possibly attend. this is just shameless begging.\ni am not cheap. i just sent a generous graduation gift to a neighbor girl who used to stop by every day to bring in my mail and newspaper and ask if i needed any errands run.\ndon't suggest that i send \"a nice card\" to the relatives who send me invitations to events they know i can't attend. we both know a card is not what these spongers want.\nsick of them in iowa city"

Exercise 8: Returning to our context, looking ahead

  • Answers will vary

Exercise 11: Read in and get to know the weather data

weather <- read_csv("https://raw.githubusercontent.com/Mac-STAT/data/main/weather_3_locations.csv")
Rows: 2367 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): location, windgustdir, winddir9am, winddir3pm, raintoday, raintom...
dbl  (17): mintemp, maxtemp, rainfall, evaporation, sunshine, windgustspeed,...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 12: Exploring the data structure

Check out the basic features of the weather data.

# Examine the first six cases
head(weather)
# A tibble: 6 × 24
  date       location  mintemp maxtemp rainfall evaporation sunshine windgustdir
  <date>     <chr>       <dbl>   <dbl>    <dbl>       <dbl>    <dbl> <chr>      
1 2020-01-01 Wollongo…    17.1    23.1        0          NA       NA SSW        
2 2020-01-02 Wollongo…    17.7    24.2        0          NA       NA SSW        
3 2020-01-03 Wollongo…    19.7    26.8        0          NA       NA NE         
4 2020-01-04 Wollongo…    20.4    35.5        0          NA       NA SSW        
5 2020-01-05 Wollongo…    19.8    21.4        0          NA       NA SSW        
6 2020-01-06 Wollongo…    18.3    22.9        0          NA       NA NE         
# ℹ 16 more variables: windgustspeed <dbl>, winddir9am <chr>, winddir3pm <chr>,
#   windspeed9am <dbl>, windspeed3pm <dbl>, humidity9am <dbl>,
#   humidity3pm <dbl>, pressure9am <dbl>, pressure3pm <dbl>, cloud9am <dbl>,
#   cloud3pm <dbl>, temp9am <dbl>, temp3pm <dbl>, raintoday <chr>,
#   risk_mm <dbl>, raintomorrow <chr>
# Find the dimensions of the data
dim(weather)
[1] 2367   24

A case represents a day of the year in a particular area (Hobart, Uluru, Wollongong as seen by the location variable).

Exercise 13: Exploring rainfall

The raintoday variable contains information about rainfall.

  • raintoday is categorical (No, Yes)
  • It is more common to have no rain.
# Visualization
ggplot(weather, aes(x = raintoday)) +
    geom_bar()

# Numerical summaries
weather %>% 
    count(raintoday)
# A tibble: 3 × 2
  raintoday     n
  <chr>     <int>
1 No         1864
2 Yes         446
3 <NA>         57

Exercise 14: Exploring temperature

The maxtemp variable contains information on the daily high temperature.

  • maxtemp is quantitative
  • The typical max temperature is around 23 degrees Celsius (with an average of 23.62 and a median of 22 degrees). The max temperatures ranged from 8.6 to 45.4 degrees. Finally, on the typical day, the max temp falls about 7.8 degrees from the mean. There are multiple modes in the distribution of max temperature—this likely reflects the different cities in the dataset.
# Visualization
ggplot(weather, aes(x = maxtemp)) + 
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 34 rows containing non-finite outside the scale range
(`stat_bin()`).

# Numerical summaries
summary(weather$maxtemp)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   8.60   18.10   22.00   23.62   27.40   45.40      34 
# There are missing values (NAs) in this variable, so we add
# the na.rm = TRUE argument
weather %>% 
    summarize(sd(maxtemp, na.rm = TRUE))
# A tibble: 1 × 1
  `sd(maxtemp, na.rm = TRUE)`
                        <dbl>
1                        7.80

Exercise 15: Customizing! (CHALLENGE)

ggplot(weather, aes(x = maxtemp)) + 
    geom_histogram() + 
    labs(x = "Maximum temperature", y = "Number of days", title = "Distribution of max temperatures in Perth")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 34 rows containing non-finite outside the scale range
(`stat_bin()`).

# Make the bars green
ggplot(weather, aes(x = raintoday)) + 
    geom_bar(fill = "green")