SIDS Ch 3 - Data Wrangling

tidyverse
R
Author

Colin Madland

Published

May 18, 2025

Ismay, C., Kim, A. Y., & Valdivia, A. (2025). Statistical Inference via Data Science: A ModernDive into R and the Tidyverse (2nd ed.). Chapman and Hall/CRC.

library(nycflights23)
library(ggplot2)
library(moderndive)
library(tibble)
library(viridis)
Loading required package: viridisLite
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

3 components of the Grammar or Graphiics

data
the dataset containing the variables of interest.
geom
the geometric object in question. This refers to the type of object we can observe in a plot. For example: points, lines, and bars.
aes
aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.

Other components

facet
to break up a plot into several plots split by the values of another variable (Section 2.6) - position adjustments for barplots (Section 2.8)

Five Named graphs - 5NG

  1. Scatterplots
  2. line graphs
  3. bar graphs
  4. histograms
  5. boxplots

Scatterplots

Used to see the relationship between two numerical variables.

View(envoy_flights)
ggplot(data = envoy_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point()
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

  • there is a problem of the points overlapping
    • change transparency with alpha argument
ggplot(data = envoy_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2)
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

  • add jitter
ggplot(data = envoy_flights, mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_jitter(width = 30, height = 30)
Warning: Removed 3 rows containing missing values or values outside the scale range
(`geom_point()`).

Linegraphs

Show the relationship between two variables when the x-axis (explanatory variable) is sequential.

View(weather)
glimpse(weather)
Rows: 26,207
Columns: 15
$ origin     <chr> "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JFK", "JF…
$ year       <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023,…
$ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ hour       <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ temp       <dbl> 48.0, 48.2, 49.0, 49.0, 49.0, 48.0, 46.4, 46.0, 48.0, 47.0,…
$ dewp       <dbl> 48.0, 48.2, 49.0, 49.0, 49.0, 48.0, 46.4, 46.0, 48.0, 47.0,…
$ humid      <dbl> 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 100.00, 100…
$ wind_dir   <dbl> 0, 190, 190, 250, 170, 0, 250, 230, 260, 250, 240, 260, 260…
$ wind_speed <dbl> 0.00000, 4.60312, 5.75390, 5.75390, 8.05546, 0.00000, 9.206…
$ wind_gust  <dbl> 0.000000, 5.297178, 6.621473, 6.621473, 9.270062, 0.000000,…
$ precip     <dbl> 1e-02, 1e-02, 1e-04, 2e-02, 1e-04, 1e-04, 0e+00, 0e+00, 0e+…
$ pressure   <dbl> 1010.2, 1009.2, 1009.0, 1008.0, 1007.8, 1007.6, 1007.3, 100…
$ visib      <dbl> 0.25, 2.50, 0.25, 4.00, 0.75, 0.75, 0.24, 0.50, 8.00, 5.00,…
$ time_hour  <dttm> 2023-01-01 00:00:00, 2023-01-01 01:00:00, 2023-01-01 02:00…
ggplot(data = early_january_2023_weather, 
       mapping = aes(x = time_hour, y = wind_speed)) +
  geom_line()

Histograms

Shows the distribution of a variable.

ggplot(data = weather, mapping = aes(x = wind_speed)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_bin()`).

ggplot(data = weather, mapping = aes(x = wind_speed)) +
  geom_histogram(color = "white", fill = "steelblue")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_bin()`).

adjusting bins
  • number of bins
ggplot(data = weather, mapping = aes(x = wind_speed)) +
  geom_histogram(bins = 20, color = "white")
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_bin()`).

  • binwidth
ggplot(data = weather, mapping = aes(x = wind_speed)) +
  geom_histogram(binwidth = 5, color = "white")
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_bin()`).

Facets

Dividing plots by subcategories in the data.

ggplot(data = weather, mapping = aes(x = wind_speed)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~ month)
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_bin()`).

Boxplots

Display the distribution of data and include five numbers to summarize the data:

  • minimum
  • first quartile (25th percentile)
  • median (2nd quartile, 50th percentile)
  • third quartile (75th percentile)
  • maximum
  • also shows the IQR (middle 50%)
    • whiskers extend no more than 1.5 IQR units beyond 25th and 75 percentile.
    • points beyond 1.5 IQR units may be considered outliers.
ggplot(data = weather, mapping = aes(x = month, y = wind_speed)) +
  geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Note

Returns an invalid plot and a warning.

ggplot(data = weather, mapping = aes(x = month, y = wind_speed, group = month)) +
  geom_boxplot() 
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_boxplot()`).

OR

  • represent months as factors and use geom(violin) for a more detailed plot.
ggplot(data = weather, mapping = aes(x = factor(month), y = wind_speed)) +
  geom_violin()
Warning: Removed 1041 rows containing non-finite outside the scale range
(`stat_ydensity()`).

Barplots

Simpler representation of the distribution known as frequencxies - depends on whether the items are counted or not.

fruits <- tibble(fruit = c("apple", "apple", "orange", "apple", "orange"))
fruits_counted <- tibble(
  fruit = c("apple", "orange"),
  number = c(3, 2))
ggplot(data = fruits, mapping = aes(x = fruit)) +
  geom_bar()

ggplot(data = fruits_counted, mapping = aes(x = fruit)) +
  geom_bar()

ggplot(data = fruits_counted, mapping = aes(x = fruit, y = number)) +
  geom_col()

  • when items are not counted, use geom_bar() with fruit mapped to the x aes
  • when items are counted, we add number to the y aes and use geom_col()
ggplot(flights,  aes(x = carrier)) +
  geom_bar()

  • geom_histogram() has bars that touch, but geom_bar() has bars with white space between
View(airlines)
Stacked Barplot
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar()

Dodged Barplot
ggplot(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_bar(position = "dodge") +
  scale_fill_viridis_d(option = "plasma")

Summary

Named graph Shows Geometric object Notes
1 Scatterplot Relationship between 2 numerical variables geom_point()
2 Linegraph Relationship between 2 numerical variables geom_line() Used when there is a sequential order to x-variable, e.g., time
3 Histogram Distribution of 1 numerical variable geom_histogram() Facetted histograms show the distribution of 1 numerical variable split by the values of another variable
4 Boxplot Distribution of 1 numerical variable split by the values of another variable geom_boxplot()
5 Barplot Distribution of 1 categorical variable geom_bar() when counts are not pre-counted, geom_col() when counts are pre-counted Stacked, side-by-side, and faceted barplots show the joint distribution of 2 categorical variables