R for Data Science 2e - Ch25 - Functions – Learning | Assessment

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2e).

Show the code

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Show the code

library(nycflights13)

Based on a recommendation from the Productive R Workflow course, this page will be a sorting through of how to create functions in R…but the Seahawks are playing and I need to go pick up the pizza…

…one week later…

As you can see at the top of the post, I am following the R4DS (2e) online book (Wickham, Çetinkaya-Rundel, and Grolemund 2023), but focusing on ch25, so if you aren’t familiar with R, you might need to back up a bit in the book.

The need for me here is that the data is currently coming in for my PhD, and it is time to get serious about managing my workflow and R coding practices. Creating functions enhances reproducibility as there will only be one place for a particular task to run, rather than my copy/pasta/edit technique to date. This will not only create a more reproducible paper, but it will be cleaner in the index.qmd doc because I will only need to call the function, rather than include the entire code inline.

Vector Functions

vector functions: take one or more vectors and return a vector as a result. Example from R4DS:

Show the code

df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(a, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / 
    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)

# A tibble: 5 × 4
       a      b     c     d
   <dbl>  <dbl> <dbl> <dbl>
1 0      -0.406 1     0.622
2 1      -0.663 0     0    
3 0.399  -0.385 0.452 0.697
4 0.313   0.193 0.343 0.411
5 0.0783  0.337 0.375 1

However, in df |> mutate, the line b = (b - min(a, na.rm = TRUE...)) should read b = (b - min(b, na.rm = TRUE...)). This is a super easy error to make with my current copy/pasta/edit workflow, but writing a function for this series of tasks will eliminate the possibility for that particular error.

To visualize the repetitive components, we can pull the code out of mutate() and put each command on its own line:

(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))

Replace the variable part with █

(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))

Three things needed to turn this into a function:

a name, such as rescale01 which describes what the function does: it rescales a vector to lie between 0 and 1.
arguments which vary across calls. here we have only one, which we’ll call x as the convention for a numeric vector
body which is the code repeated across calls

name <- function(arguments) {
  body
}

So, in this case:

Show the code

rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

Show the code

rescale01(c(-10, 0, 10))

[1] 0.0 0.5 1.0

Show the code

rescale01(c(1, 2, 3, NA, 5))

[1] 0.00 0.25 0.50   NA 1.00

Now we can put the function back into mutate() as

Show the code

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)

# A tibble: 5 × 4
       a     b     c     d
   <dbl> <dbl> <dbl> <dbl>
1 0      0.257 1     0.622
2 1      0     0     0    
3 0.399  0.278 0.452 0.697
4 0.313  0.856 0.343 0.411
5 0.0783 1     0.375 1

In Chapter 26, you’ll learn how to use across() to reduce the duplication even further so all you need is df |> mutate(across(a:d, rescale01)).

Mutate functions

Mutate functions: work well within mutate() and filter() because they return an output the same length as the input.; e.g. computing Z-score so a vector has a mean of 0 and sd of 1

z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Or a function that ensures all values fall between a specified min and max, such as 3-7…

Show the code

clamp <- function(x, min, max) {
  case_when(
    x < min ~ min,
    x > max ~ max,
    .default = x
  )
}

clamp(1:10, min = 3, max = 7)

 [1] 3 3 3 4 5 6 7 7 7 7

Show the code

#>  [1] 3 3 3 4 5 6 7 7 7 7

Doesn’t have to be numeric…could manipulate a string, such as ensuring first letter is capitalized:

Show the code

first_upper <- function(x) {
  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
  x
}

first_upper("what is this")

[1] "What is this"

Show the code

#> [1] "Hello"

or strip punctuation before converting to a number…

Show the code

# https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all(fixed("$")) |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}

clean_number("$12,300")

[1] 12300

Show the code

clean_number("45%")

[1] 0.45

Or if a variable is coded as 999, 998, or 997 and you want to convert those to NA…

Show the code

fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}

Summary Functions

summary function: functions that return a single value for use in summarize()

Show the code

commas <- function(x) {
  str_flatten(x, collapse = ", ", last = " and ")
}

commas(c("cat", "dog", "pigeon"))

[1] "cat, dog and pigeon"

or wrap a simple calculation such as coefficient of variation which divides standard deviation by mean

Show the code

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

cv(runif(100, min = 0, max = 50))

[1] 0.5770062

Show the code

#> [1] 0.5196276
cv(runif(100, min = 0, max = 500))

[1] 0.5452706

Show the code

#> [1] 0.5652554

or to automate a common task to make it easier to remember

Show the code

# https://twitter.com/gbganalyst/status/1571619641390252033
n_missing <- function(x) {
  sum(is.na(x))
}

Exercises

mean(is.na(x))
mean(is.na(y))
mean(is.na(z))

x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)

round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)

Show the code

pct_missing <- function(x) {
    mean(is.na(x))
}

# Sample data with missing values
data <- c(1, 4, NA, 4, NA, NA, 7)

# Apply the function
pct_missing(data)

[1] 0.4285714

Show the code

# x / sum(x, na.rm = TRUE)

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science (2e). https://r4ds.hadley.nz/.