discovr_06 - The Beast of Bias – Learning | Assessment

library(tidyverse, ggplot2)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

download_tib <- here::here("data/download_festival.csv") |> readr::read_csv()

Rows: 810 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (4): ticket_no, day_1, day_2, day_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

download_tib <- download_tib |> 
dplyr::mutate(
    ticket_no = as.character(ticket_no),
    gender = forcats::as_factor(gender) |>
      forcats::fct_relevel("Male", "Female", "Non-binary")
  )

download_tib

# A tibble: 810 × 5
   ticket_no gender day_1 day_2 day_3
   <chr>     <fct>  <dbl> <dbl> <dbl>
 1 2111      Male    2.64  1.35  1.61
 2 2229      Female  0.97  1.41  0.29
 3 2338      Male    0.84 NA    NA   
 4 2384      Female  3.03 NA    NA   
 5 2401      Female  0.88  0.08 NA   
 6 2405      Male    0.85 NA    NA   
 7 2467      Female  1.56 NA    NA   
 8 2478      Female  3.02 NA    NA   
 9 2490      Male    2.29 NA    NA   
10 2504      Female  1.11  0.44  0.55
# ℹ 800 more rows

Which of the following describes tidy data?

Data that are arranged such that scores on a variable appear in a single column and rows represent a combination of the attributes of those scores – the entity from which the scores came, when the score was recorded, etc. Scores from a single entity can appear over multiple rows where each row represents a combination of the attributes of the score – for example, levels of an independent variable or time point at which the score was recorded.

Are the download data in tidy or messy format?

Messy

Correct - well done! The download data are messy because the hygiene scores on different days are spread across different columns rather than being in a single colum with an additional column to indicate the day of the festival that the hygiene score was measured.

tidyr has two functions for converting data from messy to tidy. - pivot_longer() takes columns and puts them into rosw to make messy data tidy - pivot_wider() takes rows and puts them in columns to make tidy data messy

Making messy data tidy

tidyr::pivot_longer(
  data = tibble,
  cols = column_names,
  names_to = "name_of_column_to_contain_variable_names",
  values_to = "name_of_column_to_contain_values",
)

tibble: Name of the messy tibble
column_names: list of columns to be restructured into rows
names_to: name for the new variable that contains names of the original columns
value_to: name for the new variable that will contain the values.

Code examplein download_tib, there are three columns/variables that need to be restructured into rows
specify the variables using day_1:day_3
scores in these columns represent hygiene scores, so we could use hygiene as the variable to contain values after restructuring
columns we are transforming represent different days at the festival, so we can use day as the name of the variable created to contain column names
download_tidy_tib <- download_tib |>  # create a new object called `download_tidy_tib`
  tidyr::pivot_longer(                              # use the `pivot_longer()` function from `tidyr`
  cols = day_1:day_3,                           # specify columns `day_1:day_3` for restructuring
  names_to = "day",                             # names of the columns be placed in a variable called `day`
  values_to = "hygiene",                        # values of the columns placed in a variable called `hygiene`
)
download_tidy_tib                               # display the new object

# A tibble: 2,430 × 4
   ticket_no gender day   hygiene
   <chr>     <fct>  <chr>   <dbl>
 1 2111      Male   day_1    2.64
 2 2111      Male   day_2    1.35
 3 2111      Male   day_3    1.61
 4 2229      Female day_1    0.97
 5 2229      Female day_2    1.41
 6 2229      Female day_3    0.29
 7 2338      Male   day_1    0.84
 8 2338      Male   day_2   NA   
 9 2338      Male   day_3   NA   
10 2384      Female day_1    3.03
# ℹ 2,420 more rows

Tidying labels

the values in day match the original column names exactly (day_1)
we want sentence case (Day 1)
use stringr

download_tidy_tib <- download_tidy_tib |>   # recreates `download_tidy_tib` from itself
   dplyr::mutate(                                             # uses `dplyr::mutate` to recreate the variable `day`
    day = stringr::str_to_sentence(day) |> stringr::str_replace("_", " ")                                                # uses `stringr::str_to_sentence` to capitalize the d, then `str_replace()` to find the underscore and replace it with a space
  )

download_tidy_tib <- download_tidy_tib |> 
   dplyr::mutate(
    day = stringr::str_to_sentence(day) |> stringr::str_replace("_", " ")
  )
  download_tidy_tib

# A tibble: 2,430 × 4
   ticket_no gender day   hygiene
   <chr>     <fct>  <chr>   <dbl>
 1 2111      Male   Day 1    2.64
 2 2111      Male   Day 2    1.35
 3 2111      Male   Day 3    1.61
 4 2229      Female Day 1    0.97
 5 2229      Female Day 2    1.41
 6 2229      Female Day 3    0.29
 7 2338      Male   Day 1    0.84
 8 2338      Male   Day 2   NA   
 9 2338      Male   Day 3   NA   
10 2384      Female Day 1    3.03
# ℹ 2,420 more rows

## Making tidy data messy

pivot_wider() reverses the process above

tidyr::pivot_wider(
  data = tibble,  # tibble to be restructured
  id_cols = variables_that_you_do_not_want_to_restructure,
  names_from = "variable_containing_the_names_of_columns",
  values_from = " variable_containing_the_scores",
)

download_tib <- download_tidy_tib |> 
  tidyr::pivot_wider(
  id_cols = c(ticket_no, gender),
  names_from = "day",
  values_from = "hygiene",
)
download_tib

# A tibble: 810 × 5
   ticket_no gender `Day 1` `Day 2` `Day 3`
   <chr>     <fct>    <dbl>   <dbl>   <dbl>
 1 2111      Male      2.64    1.35    1.61
 2 2229      Female    0.97    1.41    0.29
 3 2338      Male      0.84   NA      NA   
 4 2384      Female    3.03   NA      NA   
 5 2401      Female    0.88    0.08   NA   
 6 2405      Male      0.85   NA      NA   
 7 2467      Female    1.56   NA      NA   
 8 2478      Female    3.02   NA      NA   
 9 2490      Male      2.29   NA      NA   
10 2504      Female    1.11    0.44    0.55
# ℹ 800 more rows

in this case, having the variable names in sentence case (Day 1), is inconvenient because we will always have to put them in backticks
rename using dplyr::rename_with

download_tib <- download_tib |> 
  dplyr::rename_with(.cols = starts_with("Day"), # finds all columns w/i download_tib that begin with the word `Day'
     .fn = \(column) stringr::str_replace(string = column, # creates a lambda or anonymous function that will be applied to the variables that begin with Day
        pattern = "Day ",  # with next line, tells the function what to do
        replacement = "day_")
)

download_tib

# A tibble: 810 × 5
   ticket_no gender day_1 day_2 day_3
   <chr>     <fct>  <dbl> <dbl> <dbl>
 1 2111      Male    2.64  1.35  1.61
 2 2229      Female  0.97  1.41  0.29
 3 2338      Male    0.84 NA    NA   
 4 2384      Female  3.03 NA    NA   
 5 2401      Female  0.88  0.08 NA   
 6 2405      Male    0.85 NA    NA   
 7 2467      Female  1.56 NA    NA   
 8 2478      Female  3.02 NA    NA   
 9 2490      Male    2.29 NA    NA   
10 2504      Female  1.11  0.44  0.55
# ℹ 800 more rows

Spotting outliers

Two ways:

visualize the data and look for unusual cases
look for values that are poorly predicted by the model, using model residuals as described in DSUR

What are the model residuals?

The differences between the values a model predicts and the values observed in the data on which the model is based