discovr_06 - The Beast of Bias

bias
R
discovr
Author

Colin Madland

Published

April 16, 2024

library(tidyverse, ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
download_tib <- here::here("data/download_festival.csv") |> readr::read_csv()
Rows: 810 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (4): ticket_no, day_1, day_2, day_3

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
download_tib <- download_tib |> 
dplyr::mutate(
    ticket_no = as.character(ticket_no),
    gender = forcats::as_factor(gender) |>
      forcats::fct_relevel("Male", "Female", "Non-binary")
  )
download_tib
# A tibble: 810 × 5
   ticket_no gender day_1 day_2 day_3
   <chr>     <fct>  <dbl> <dbl> <dbl>
 1 2111      Male    2.64  1.35  1.61
 2 2229      Female  0.97  1.41  0.29
 3 2338      Male    0.84 NA    NA   
 4 2384      Female  3.03 NA    NA   
 5 2401      Female  0.88  0.08 NA   
 6 2405      Male    0.85 NA    NA   
 7 2467      Female  1.56 NA    NA   
 8 2478      Female  3.02 NA    NA   
 9 2490      Male    2.29 NA    NA   
10 2504      Female  1.11  0.44  0.55
# ℹ 800 more rows

Data that are arranged such that scores on a variable appear in a single column and rows represent a combination of the attributes of those scores – the entity from which the scores came, when the score was recorded, etc. Scores from a single entity can appear over multiple rows where each row represents a combination of the attributes of the score – for example, levels of an independent variable or time point at which the score was recorded.

Messy

Correct - well done! The download data are messy because the hygiene scores on different days are spread across different columns rather than being in a single colum with an additional column to indicate the day of the festival that the hygiene score was measured.

tidyr has two functions for converting data from messy to tidy. - pivot_longer() takes columns and puts them into rosw to make messy data tidy - pivot_wider() takes rows and puts them in columns to make tidy data messy

Making messy data tidy

tidyr::pivot_longer(
  data = tibble,
  cols = column_names,
  names_to = "name_of_column_to_contain_variable_names",
  values_to = "name_of_column_to_contain_values",
)
tibble
Name of the messy tibble
column_names
list of columns to be restructured into rows
names_to
name for the new variable that contains names of the original columns
value_to
name for the new variable that will contain the values.

Code example

  • in download_tib, there are three columns/variables that need to be restructured into rows
  • specify the variables using day_1:day_3
  • scores in these columns represent hygiene scores, so we could use hygiene as the variable to contain values after restructuring
  • columns we are transforming represent different days at the festival, so we can use day as the name of the variable created to contain column names
download_tidy_tib <- download_tib |>  # create a new object called `download_tidy_tib`
  tidyr::pivot_longer(                              # use the `pivot_longer()` function from `tidyr`
  cols = day_1:day_3,                           # specify columns `day_1:day_3` for restructuring
  names_to = "day",                             # names of the columns be placed in a variable called `day`
  values_to = "hygiene",                        # values of the columns placed in a variable called `hygiene`
)
download_tidy_tib                               # display the new object
# A tibble: 2,430 × 4
   ticket_no gender day   hygiene
   <chr>     <fct>  <chr>   <dbl>
 1 2111      Male   day_1    2.64
 2 2111      Male   day_2    1.35
 3 2111      Male   day_3    1.61
 4 2229      Female day_1    0.97
 5 2229      Female day_2    1.41
 6 2229      Female day_3    0.29
 7 2338      Male   day_1    0.84
 8 2338      Male   day_2   NA   
 9 2338      Male   day_3   NA   
10 2384      Female day_1    3.03
# ℹ 2,420 more rows

Tidying labels

  • the values in day match the original column names exactly (day_1)
  • we want sentence case (Day 1)
  • use stringr
download_tidy_tib <- download_tidy_tib |>   # recreates `download_tidy_tib` from itself
   dplyr::mutate(                                             # uses `dplyr::mutate` to recreate the variable `day`
    day = stringr::str_to_sentence(day) |> stringr::str_replace("_", " ")                                                # uses `stringr::str_to_sentence` to capitalize the d, then `str_replace()` to find the underscore and replace it with a space
  )
download_tidy_tib <- download_tidy_tib |> 
   dplyr::mutate(
    day = stringr::str_to_sentence(day) |> stringr::str_replace("_", " ")
  )
  download_tidy_tib
# A tibble: 2,430 × 4
   ticket_no gender day   hygiene
   <chr>     <fct>  <chr>   <dbl>
 1 2111      Male   Day 1    2.64
 2 2111      Male   Day 2    1.35
 3 2111      Male   Day 3    1.61
 4 2229      Female Day 1    0.97
 5 2229      Female Day 2    1.41
 6 2229      Female Day 3    0.29
 7 2338      Male   Day 1    0.84
 8 2338      Male   Day 2   NA   
 9 2338      Male   Day 3   NA   
10 2384      Female Day 1    3.03
# ℹ 2,420 more rows

## Making tidy data messy

  • pivot_wider() reverses the process above
tidyr::pivot_wider(
  data = tibble,  # tibble to be restructured
  id_cols = variables_that_you_do_not_want_to_restructure,
  names_from = "variable_containing_the_names_of_columns",
  values_from = " variable_containing_the_scores",
)
download_tib <- download_tidy_tib |> 
  tidyr::pivot_wider(
  id_cols = c(ticket_no, gender),
  names_from = "day",
  values_from = "hygiene",
)
download_tib
# A tibble: 810 × 5
   ticket_no gender `Day 1` `Day 2` `Day 3`
   <chr>     <fct>    <dbl>   <dbl>   <dbl>
 1 2111      Male      2.64    1.35    1.61
 2 2229      Female    0.97    1.41    0.29
 3 2338      Male      0.84   NA      NA   
 4 2384      Female    3.03   NA      NA   
 5 2401      Female    0.88    0.08   NA   
 6 2405      Male      0.85   NA      NA   
 7 2467      Female    1.56   NA      NA   
 8 2478      Female    3.02   NA      NA   
 9 2490      Male      2.29   NA      NA   
10 2504      Female    1.11    0.44    0.55
# ℹ 800 more rows
  • in this case, having the variable names in sentence case (Day 1), is inconvenient because we will always have to put them in backticks
  • rename using dplyr::rename_with
download_tib <- download_tib |> 
  dplyr::rename_with(.cols = starts_with("Day"), # finds all columns w/i download_tib that begin with the word `Day'
     .fn = \(column) stringr::str_replace(string = column, # creates a lambda or anonymous function that will be applied to the variables that begin with Day
        pattern = "Day ",  # with next line, tells the function what to do
        replacement = "day_")
)

download_tib
# A tibble: 810 × 5
   ticket_no gender day_1 day_2 day_3
   <chr>     <fct>  <dbl> <dbl> <dbl>
 1 2111      Male    2.64  1.35  1.61
 2 2229      Female  0.97  1.41  0.29
 3 2338      Male    0.84 NA    NA   
 4 2384      Female  3.03 NA    NA   
 5 2401      Female  0.88  0.08 NA   
 6 2405      Male    0.85 NA    NA   
 7 2467      Female  1.56 NA    NA   
 8 2478      Female  3.02 NA    NA   
 9 2490      Male    2.29 NA    NA   
10 2504      Female  1.11  0.44  0.55
# ℹ 800 more rows

Spotting outliers

Two ways:

  • visualize the data and look for unusual cases
  • look for values that are poorly predicted by the model, using model residuals as described in DSUR
  • The differences between the values a model predicts and the values observed in the data on which the model is based

Histograms and Boxplots