discovr_02 - Summarizing Data – Learning | Assessment

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ice_tib <- here::here("data/ice_bucket.csv") |> readr::read_csv()

Rows: 2323000 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): upload_day

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Frequency tables

use group_by() and summarise() and n() functions from dplyr

group_by(): groups data by whatever variable(s) you name within the function
summarise(): creates summary table based on the variables in the function
n(): counts the number of scores

To count frequencies:

tell R to treat values that are the same, as being in the same category
- group_by(upload_day) tells R that scores that are the same within upload_day are in the same group
- subsequent operations are conducted on the groups
count how many scores fall into each category
- summarize() creates a variable called frequency that counts how many items are in each group created by group_by()

freq_tbl <- ice_tib |>
  dplyr::group_by(upload_day) |> 
  dplyr::summarise(
    frequency = n()
  )
freq_tbl

# A tibble: 56 × 2
   upload_day frequency
        <dbl>     <int>
 1         21      2000
 2         22      2000
 3         23      4000
 4         24      4000
 5         25      8000
 6         26      8000
 7         27     10000
 8         28     16000
 9         29     20000
10         30     29000
# ℹ 46 more rows

this is a large table and a bit unwieldy
use a grouped frequency distribution
- place values of upload_days into bins
if we want to split the variable upload_day into bins of 4 days…

ggplot2::cut_width(upload_day, 4)

combine this with dplyr::mutate() to create a new variable called days_group

gp_freq_dist <- ice_tib |> 
  dplyr::mutate(
    days_group = ggplot2::cut_width(upload_day, 4)
    )
gp_freq_dist

# A tibble: 2,323,000 × 2
   upload_day days_group
        <dbl> <fct>     
 1         34 (30,34]   
 2         36 (34,38]   
 3         31 (30,34]   
 4         30 (26,30]   
 5         33 (30,34]   
 6         38 (34,38]   
 7         36 (34,38]   
 8         46 (42,46]   
 9         45 (42,46]   
10         31 (30,34]   
# ℹ 2,322,990 more rows

this creates a new object called gp_freq_dist that contains each value within ice_tib but with an extra column/variable called days_group that indicates the bin the value of upload_day is in

Set notation

the value of upload_day now has a corresponding value of days_group containing the bin
the first score of 34 has been assigned to the bin labelled `(30, 34] which is the bin containing any score above 30, up to and including 34
the label uses standard mathematical notation for sets where ( or ) means ‘not including’ and [ or ] means ‘including’

now we can use summarize() and n() to count scores like before, except to use days_group instead of upload_day

Coding challenge

Create a grouped frequency table called gp_freq_dist by starting with the code in the code example and then using the code we used to create freq_tbl to create a pipe that summarizes the grouped scores.

gp_freq_dist <- ice_tib |> 
  dplyr::mutate(
    days_group = ggplot2::cut_width(upload_day, 4)
    ) |>
  dplyr::group_by(days_group) |> 
  dplyr::summarise(
    frequency = n()
  )
gp_freq_dist

# A tibble: 15 × 2
   days_group frequency
   <fct>          <int>
 1 [18,22]         4000
 2 (22,26]        24000
 3 (26,30]        75000
 4 (30,34]       367000
 5 (34,38]       770000
 6 (38,42]       534000
 7 (42,46]       255000
 8 (46,50]       102000
 9 (50,54]        70000
10 (54,58]        38000
11 (58,62]        26000
12 (62,66]        18000
13 (66,70]        18000
14 (70,74]        16000
15 (74,78]         6000

Relative Frequencies

we have an object gp_freq_dist that contains the number of days grouped into bins of 4 days and the number of videos uploaded during each of the time periods represented by those bins
to calculate the relative frequency we can use dplyr::mutate() to add a variable that divides the frequency by the total number of videos using sum()

... |>
    dplyr::mutate(
        relative_freq = frequency/sum(frequency) # creates a new column
    )

Efficient Code

rather than creating the table of relative frequencies step-by-step, it is usually more efficient to carry out the steps in one piece of code

gp_freq_dist <- ice_tib |> 
  dplyr::mutate(
    days_group = ggplot2::cut_width(upload_day, 4)
    ) |> 
  dplyr::group_by(days_group) |> 
  dplyr::summarise(
    frequency = n()
  ) |> 
  dplyr::mutate(
    relative_freq = frequency/sum(frequency),
    percent = relative_freq*100
  )
  
gp_freq_dist

# A tibble: 15 × 4
   days_group frequency relative_freq percent
   <fct>          <int>         <dbl>   <dbl>
 1 [18,22]         4000       0.00172   0.172
 2 (22,26]        24000       0.0103    1.03 
 3 (26,30]        75000       0.0323    3.23 
 4 (30,34]       367000       0.158    15.8  
 5 (34,38]       770000       0.331    33.1  
 6 (38,42]       534000       0.230    23.0  
 7 (42,46]       255000       0.110    11.0  
 8 (46,50]       102000       0.0439    4.39 
 9 (50,54]        70000       0.0301    3.01 
10 (54,58]        38000       0.0164    1.64 
11 (58,62]        26000       0.0112    1.12 
12 (62,66]        18000       0.00775   0.775
13 (66,70]        18000       0.00775   0.775
14 (70,74]        16000       0.00689   0.689
15 (74,78]         6000       0.00258   0.258

Histograms

ggplot2 can produce data visualizations

Tip: Always load ggplot2!

We’ve discussed elsewhere that if you include packages when you use functions (e.g., dplyr::mutate()) you don’t need to explicitly load the package (in this case dplyr). However, to create plots with ggplot2 you build them up layer by layer, which means you use a lot of ggplot2 functions. For this reason, I advise loading it at the start of your Quarto document and not worrying too much about including package references when you use functions. You can load it either with library(ggplot2) or by loading the entire tidyverse using library(tidyverse).

general form of ggplot2

`ggplot2::ggplot(my_tib, aes(variable_for_x_axis, variable_for_y_axis))`

ggplot2::ggplot(ice_tib, aes(upload_day))

something is missing b/c we only told ggplot2 ‘what’ to plot, not ‘how’ to plot it.
need to add a geom with geom_histogram()

ggplot2::ggplot(ice_tib, aes(upload_day)) +
geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Changing bin widths

ggplot2::ggplot(ice_tib, aes(upload_day)) +
geom_histogram(binwidth=1)

Changing colours

include fill = within the geom_histogram() function

ggplot2::ggplot(ice_tib, aes(upload_day)) +
geom_histogram(binwidth = 1, fill = "#440154")

Transparency and axis labels

ggplot2::ggplot(ice_tib, aes(upload_day)) +
geom_histogram(binwidth = 1, fill = "#440154", alpha = 0.25) +
labs(y = "Frequency", x = "Days since first ice bucket challenge video")

Themes

ggplot2::ggplot(ice_tib, aes(upload_day)) +
geom_histogram(binwidth = 1, fill = "#440154", alpha = 0.9) +
labs(y = "Frequency", x = "Days since first ice bucket challenge video") +
theme_minimal()

Summarizing data

Mean and median

mean(variable, trim = 0, na.rm = FALSE)

trim: allows you to trim scores before calculating the mean by specifying a value between 0 and 0.5. default is 0 (no trim). to trim 10% of scores fom each end of the distribution you could set trim = 0.1
na.rm: stands for NA remove. Missing values are denoted NA, for ‘not available’. by setting na.rm = TRUE (or na.rm = T), R will remove missing values before computing the mean

Missing Values

The default in many functions is not to remove missing values (e.g. na.rm = FALSE). If you have missing values in your data and don’t change this default behaviour R will throw an error. Therefore, if you get an error from a function like mean(), check whether you have missing values and whether you have forgotten to set na.rm = TRUE.

The function for median is similar, except no trim (median is effectively the mean with 50% trim)

median(variable, na.rm = FALSE)

Code example

if the defaults are ok, there is no need to set those arguments

mean(ice_tib$upload_day)

[1] 39.678

To remove missing values:

mean(ice_tib$upload_day , na.rm = TRUE)

[1] 39.678

?mean

coding challenge

Find the median number of days after the original ice bucket video that other videos were uploaded.

median(ice_tib$upload_day)

[1] 38

Quantifying the ‘fit’ of the mean

var(): variance; var(variable_name, na.rm = FALSE)
sd(): standard deviation; sd(variable_name, na.rm = FALSE)

var() and sd() take the same syntax as mean()

coding challenge

Use what you learned in the previous section and the code example above to get the variance and standard deviation of the days since the original ice bucket video that other videos were uploaded.

var(ice_tib$upload_day)

[1] 59.94197

sd(ice_tib$upload_day)

[1] 7.74222

Inter-quartile range

IQR(): IQR(variable_name, na.rm = FALSE, type = 7)

IQR(ice_tib$upload_day, type = 7)

[1] 7

IQR(ice_tib$upload_day, type = 8)

[1] 7

Creating a summary table

used to comine all the above into one table

summarise()

ice_tib |>
  dplyr::summarise(
    median =  median(upload_day),  # creates new variable called `median` from the output of `median(upload_day)`
    mean =  mean(upload_day), # creates new variable called `mean` from the output of `mean(upload_day)`
    ...
    )

coding challenge

Create a summary table containing the mean, median, IQR, variance and SD of the number of days since the original ice bucket video.

ice_tib |>
  dplyr::summarise(
    median =  median(upload_day),  # creates new variable called `median` from the output of `median(upload_day)`
    mean =  mean(upload_day), # creates new variable called `mean` from the output of `mean(upload_day)`
    IQR = IQR(upload_day),
    var = var(upload_day),
    sd = sd(upload_day)
    )

# A tibble: 1 × 5
  median  mean   IQR   var    sd
   <dbl> <dbl> <dbl> <dbl> <dbl>
1     38  39.7     7  59.9  7.74

Code example

To store the summary of stats, we assign it to a new object:

upload_summary <- ice_tib |>
  dplyr::summarise(
    median =  median(upload_day),
    mean =  mean(upload_day),
    IQR = IQR(upload_day),
    variance = var(upload_day),
    std_dev = sd(upload_day)
    ) 
    upload_summary

# A tibble: 1 × 5
  median  mean   IQR variance std_dev
   <dbl> <dbl> <dbl>    <dbl>   <dbl>
1     38  39.7     7     59.9    7.74

Rounding Values

use round() function to round values and use kable() from knitr to round an entire table of values

round(): round(x, digits = 0); default is 0

round(3.211420)

[1] 3

also use a pipe to feed a mean, median, or variance into the round function

var(ice_tib$upload_day) |>
  round(3)

[1] 59.942

mean(ice_tib$upload_day) |>
  round(3)

[1] 39.678

sd(ice_tib$upload_day) |>
  round(3)

[1] 7.742

upload_summary <- ice_tib |>
  dplyr::summarise(
    median =  median(upload_day),
    mean =  mean(upload_day),
    IQR = IQR(upload_day),
    variance = var(upload_day),
    std_dev = sd(upload_day)
    ) 
    upload_summary |>
    round(2)

# A tibble: 1 × 5
  median  mean   IQR variance std_dev
   <dbl> <dbl> <dbl>    <dbl>   <dbl>
1     38  39.7     7     59.9    7.74

Datawizard

datawizard::describe_distribution(x = ice_tib,
  select = NULL,
  exclude = NULL,
  centrality = "mean",
  dispersion = TRUE,
  iqr = TRUE,
  range = TRUE,
  quartiles = FALSE,
  include_factors = FALSE,
  ci = NULL)

Variable   |  Mean |   SD | IQR |          Range | Skewness | Kurtosis
----------------------------------------------------------------------
upload_day | 39.68 | 7.74 |   7 | [21.00, 76.00] |     1.72 |     4.43

Variable   |       n | n_Missing
--------------------------------
upload_day | 2323000 |         0