Rows: 2323000 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): upload_day
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Frequency tables
use group_by() and summarise() and n() functions from dplyr
group_by()
groups data by whatever variable(s) you name within the function
summarise()
creates summary table based on the variables in the function
n()
counts the number of scores
To count frequencies:
tell R to treat values that are the same, as being in the same category
group_by(upload_day) tells R that scores that are the same within upload_day are in the same group
subsequent operations are conducted on the groups
count how many scores fall into each category
summarize() creates a variable called frequency that counts how many items are in each group created by group_by()
this creates a new object called gp_freq_dist that contains each value within ice_tib but with an extra column/variable called days_group that indicates the bin the value of upload_day is in
Set notation
the value of upload_day now has a corresponding value of days_group containing the bin
the first score of 34 has been assigned to the bin labelled `(30, 34] which is the bin containing any score above 30, up to and including 34
the label uses standard mathematical notation for sets where ( or ) means ‘not including’ and [ or ] means ‘including’
now we can use summarize() and n() to count scores like before, except to use days_group instead of upload_day
Coding challenge
Create a grouped frequency table called gp_freq_dist by starting with the code in the code example and then using the code we used to create freq_tbl to create a pipe that summarizes the grouped scores.
we have an object gp_freq_dist that contains the number of days grouped into bins of 4 days and the number of videos uploaded during each of the time periods represented by those bins
to calculate the relative frequency we can use dplyr::mutate() to add a variable that divides the frequency by the total number of videos using sum()
... |>
dplyr::mutate(
relative_freq = frequency/sum(frequency) # creates a new column
)
Efficient Code
rather than creating the table of relative frequencies step-by-step, it is usually more efficient to carry out the steps in one piece of code
We’ve discussed elsewhere that if you include packages when you use functions (e.g., dplyr::mutate()) you don’t need to explicitly load the package (in this case dplyr). However, to create plots with ggplot2 you build them up layer by layer, which means you use a lot of ggplot2 functions. For this reason, I advise loading it at the start of your Quarto document and not worrying too much about including package references when you use functions. You can load it either with library(ggplot2) or by loading the entire tidyverse using library(tidyverse).
include fill = within the geom_histogram() function
ggplot2::ggplot(ice_tib, aes(upload_day)) +geom_histogram(binwidth =1, fill ="#440154")
Transparency and axis labels
ggplot2::ggplot(ice_tib, aes(upload_day)) +geom_histogram(binwidth =1, fill ="#440154", alpha =0.25) +labs(y ="Frequency", x ="Days since first ice bucket challenge video")
Themes
ggplot2::ggplot(ice_tib, aes(upload_day)) +geom_histogram(binwidth =1, fill ="#440154", alpha =0.9) +labs(y ="Frequency", x ="Days since first ice bucket challenge video") +theme_minimal()
Summarizing data
Mean and median
mean(variable, trim = 0, na.rm = FALSE)
trim
allows you to trim scores before calculating the mean by specifying a value between 0 and 0.5. default is 0 (no trim). to trim 10% of scores fom each end of the distribution you could set trim = 0.1
na.rm
stands for NA remove. Missing values are denoted NA, for ‘not available’. by setting na.rm = TRUE (or na.rm = T), R will remove missing values before computing the mean
Missing Values
The default in many functions is not to remove missing values (e.g. na.rm = FALSE). If you have missing values in your data and don’t change this default behaviour R will throw an error. Therefore, if you get an error from a function like mean(), check whether you have missing values and whether you have forgotten to set na.rm = TRUE.
The function for median is similar, except no trim (median is effectively the mean with 50% trim)
median(variable, na.rm = FALSE)
Code example
if the defaults are ok, there is no need to set those arguments
mean(ice_tib$upload_day)
[1] 39.678
To remove missing values:
mean(ice_tib$upload_day , na.rm =TRUE)
[1] 39.678
?mean
coding challenge
Find the median number of days after the original ice bucket video that other videos were uploaded.
median(ice_tib$upload_day)
[1] 38
Quantifying the ‘fit’ of the mean
var()
variance
var(variable_name, na.rm = FALSE)
sd()
standard deviation
sd(variable_name, na.rm = FALSE)
var() and sd() take the same syntax as mean()
coding challenge
Use what you learned in the previous section and the code example above to get the variance and standard deviation of the days since the original ice bucket video that other videos were uploaded.
var(ice_tib$upload_day)
[1] 59.94197
sd(ice_tib$upload_day)
[1] 7.74222
Inter-quartile range
IQR()
IQR(variable_name, na.rm = FALSE, type = 7)
IQR(ice_tib$upload_day, type =7)
[1] 7
IQR(ice_tib$upload_day, type =8)
[1] 7
Creating a summary table
used to comine all the above into one table
summarise()
ice_tib |>
dplyr::summarise(
median = median(upload_day), # creates new variable called `median` from the output of `median(upload_day)`
mean = mean(upload_day), # creates new variable called `mean` from the output of `mean(upload_day)`
...
)
coding challenge
Create a summary table containing the mean, median, IQR, variance and SD of the number of days since the original ice bucket video.
ice_tib |> dplyr::summarise(median =median(upload_day), # creates new variable called `median` from the output of `median(upload_day)`mean =mean(upload_day), # creates new variable called `mean` from the output of `mean(upload_day)`IQR =IQR(upload_day),var =var(upload_day),sd =sd(upload_day) )
# A tibble: 1 × 5
median mean IQR var sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 38 39.7 7 59.9 7.74
Code example
To store the summary of stats, we assign it to a new object: