Rows: 810 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gender
dbl (4): ticket_no, day_1, day_2, day_3
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 810 × 5
ticket_no gender day_1 day_2 day_3
<chr> <fct> <dbl> <dbl> <dbl>
1 2111 Male 2.64 1.35 1.61
2 2229 Female 0.97 1.41 0.29
3 2338 Male 0.84 NA NA
4 2384 Female 3.03 NA NA
5 2401 Female 0.88 0.08 NA
6 2405 Male 0.85 NA NA
7 2467 Female 1.56 NA NA
8 2478 Female 3.02 NA NA
9 2490 Male 2.29 NA NA
10 2504 Female 1.11 0.44 0.55
# ℹ 800 more rows
Which of the following describes tidy data?
Data that are arranged such that scores on a variable appear in a single column and rows represent a combination of the attributes of those scores – the entity from which the scores came, when the score was recorded, etc. Scores from a single entity can appear over multiple rows where each row represents a combination of the attributes of the score – for example, levels of an independent variable or time point at which the score was recorded.
Are the download data in tidy or messy format?
Messy
Correct - well done! The download data are messy because the hygiene scores on different days are spread across different columns rather than being in a single colum with an additional column to indicate the day of the festival that the hygiene score was measured.
tidyr has two functions for converting data from messy to tidy. - pivot_longer() takes columns and puts them into rosw to make messy data tidy - pivot_wider() takes rows and puts them in columns to make tidy data messy
name for the new variable that contains names of the original columns
value_to
name for the new variable that will contain the values.
Code example
in download_tib, there are three columns/variables that need to be restructured into rows
specify the variables using day_1:day_3
scores in these columns represent hygiene scores, so we could use hygiene as the variable to contain values after restructuring
columns we are transforming represent different days at the festival, so we can use day as the name of the variable created to contain column names
download_tidy_tib <- download_tib |># create a new object called `download_tidy_tib` tidyr::pivot_longer( # use the `pivot_longer()` function from `tidyr`cols = day_1:day_3, # specify columns `day_1:day_3` for restructuringnames_to ="day", # names of the columns be placed in a variable called `day`values_to ="hygiene", # values of the columns placed in a variable called `hygiene`)download_tidy_tib # display the new object
# A tibble: 2,430 × 4
ticket_no gender day hygiene
<chr> <fct> <chr> <dbl>
1 2111 Male day_1 2.64
2 2111 Male day_2 1.35
3 2111 Male day_3 1.61
4 2229 Female day_1 0.97
5 2229 Female day_2 1.41
6 2229 Female day_3 0.29
7 2338 Male day_1 0.84
8 2338 Male day_2 NA
9 2338 Male day_3 NA
10 2384 Female day_1 3.03
# ℹ 2,420 more rows
Tidying labels
the values in day match the original column names exactly (day_1)
we want sentence case (Day 1)
use stringr
download_tidy_tib <- download_tidy_tib |> # recreates `download_tidy_tib` from itself
dplyr::mutate( # uses `dplyr::mutate` to recreate the variable `day`
day = stringr::str_to_sentence(day) |> stringr::str_replace("_", " ") # uses `stringr::str_to_sentence` to capitalize the d, then `str_replace()` to find the underscore and replace it with a space
)
# A tibble: 2,430 × 4
ticket_no gender day hygiene
<chr> <fct> <chr> <dbl>
1 2111 Male Day 1 2.64
2 2111 Male Day 2 1.35
3 2111 Male Day 3 1.61
4 2229 Female Day 1 0.97
5 2229 Female Day 2 1.41
6 2229 Female Day 3 0.29
7 2338 Male Day 1 0.84
8 2338 Male Day 2 NA
9 2338 Male Day 3 NA
10 2384 Female Day 1 3.03
# ℹ 2,420 more rows
## Making tidy data messy
pivot_wider() reverses the process above
tidyr::pivot_wider(
data = tibble, # tibble to be restructured
id_cols = variables_that_you_do_not_want_to_restructure,
names_from = "variable_containing_the_names_of_columns",
values_from = " variable_containing_the_scores",
)
# A tibble: 810 × 5
ticket_no gender `Day 1` `Day 2` `Day 3`
<chr> <fct> <dbl> <dbl> <dbl>
1 2111 Male 2.64 1.35 1.61
2 2229 Female 0.97 1.41 0.29
3 2338 Male 0.84 NA NA
4 2384 Female 3.03 NA NA
5 2401 Female 0.88 0.08 NA
6 2405 Male 0.85 NA NA
7 2467 Female 1.56 NA NA
8 2478 Female 3.02 NA NA
9 2490 Male 2.29 NA NA
10 2504 Female 1.11 0.44 0.55
# ℹ 800 more rows
in this case, having the variable names in sentence case (Day 1), is inconvenient because we will always have to put them in backticks
rename using dplyr::rename_with
download_tib <- download_tib |> dplyr::rename_with(.cols =starts_with("Day"), # finds all columns w/i download_tib that begin with the word `Day'.fn = \(column) stringr::str_replace(string = column, # creates a lambda or anonymous function that will be applied to the variables that begin with Daypattern ="Day ", # with next line, tells the function what to doreplacement ="day_"))download_tib
# A tibble: 810 × 5
ticket_no gender day_1 day_2 day_3
<chr> <fct> <dbl> <dbl> <dbl>
1 2111 Male 2.64 1.35 1.61
2 2229 Female 0.97 1.41 0.29
3 2338 Male 0.84 NA NA
4 2384 Female 3.03 NA NA
5 2401 Female 0.88 0.08 NA
6 2405 Male 0.85 NA NA
7 2467 Female 1.56 NA NA
8 2478 Female 3.02 NA NA
9 2490 Male 2.29 NA NA
10 2504 Female 1.11 0.44 0.55
# ℹ 800 more rows
Spotting outliers
Two ways:
visualize the data and look for unusual cases
look for values that are poorly predicted by the model, using model residuals as described in DSUR
What are the model residuals?
The differences between the values a model predicts and the values observed in the data on which the model is based