discovr_07 - Associations

correlation
R
discovr
Author

Colin Madland

Published

March 10, 2024

library(tidyverse, ggplot2)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
liar_tib <- here::here("data/biggest_liar.csv") |> readr::read_csv()
Rows: 68 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): id, novice
dbl (2): creativity, position

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
exam_tib <- here::here("data/exam_anxiety.csv") |> readr::read_csv()
Rows: 103 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (4): id, revise, exam_grade, anxiety

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
liar_tib <- liar_tib |> 
  dplyr::mutate(
    novice = forcats::as_factor(novice)
  )
exam_tib <- exam_tib |>
  dplyr::mutate(
    id = forcats::as_factor(id),
    sex = forcats::as_factor(sex)
  )
exam_tib
# A tibble: 103 × 5
   id    revise exam_grade anxiety sex   
   <fct>  <dbl>      <dbl>   <dbl> <fct> 
 1 1          4         40    86.3 Male  
 2 2         11         65    88.7 Female
 3 3         27         80    70.2 Male  
 4 4         53         80    61.3 Male  
 5 5          4         40    89.5 Male  
 6 6         22         70    60.5 Female
 7 7         16         20    81.5 Female
 8 8         21         55    75.8 Female
 9 9         25         50    69.4 Female
10 10        18         40    82.3 Female
# ℹ 93 more rows
GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety")) +
theme_minimal()
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Pearson’s correlation using R

  • correlation package -> workhorse function called correlation()
correlation::correlation(tibble,
                         method = "pearson",
                         p_adjust = "holm",
                         ci = 0.95
                         )
tibble
should be replaced with the name of tibble containing any variables to correlate
method
method of correlation coefficient, default is pearson, but can also accept spearman, kendall, biserial, polychoric, tetrachoric, and percentage
p_adjust
corrects the \(p\)-value for the number of tests you have performed using the Holm-Bonferroni method
applies the Bonferroni criterion in a slightly less strict way that controls the type I error, but with less risk of a type II error
can change to none (bad idea), bonferroni (to apply the standard Bonferroni method) or several other methods.
ci
set the confidence interval width; default is 0.95 for general use

To use the function, - pipe tibble into the select() function from dplyr to select variables to correlate, then pipe that into the correlation function - use the same syntax whether you want to correlate two variables or produce all correlations between pairs of multiple variables]

To calculate Pearson correlation btwn variables exam_grade and revise in exam_tib

exam_tib |> 
  dplyr::select(exam_grade, revise) |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)
Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs
exam_grade revise 0.397 0.95 0.22 0.548 4.343 101 0 Pearson correlation 103
exam_tib |> 
  dplyr::select(exam_grade, anxiety) |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)
Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs
exam_grade anxiety -0.441 0.95 -0.585 -0.271 -4.938 101 0 Pearson correlation 103
exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation() |> 
  knitr::kable(digits = 3)
Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs
exam_grade revise 0.397 0.95 0.220 0.548 4.343 101 0 Pearson correlation 103
exam_grade anxiety -0.441 0.95 -0.585 -0.271 -4.938 101 0 Pearson correlation 103
revise anxiety -0.709 0.95 -0.794 -0.598 -10.111 101 0 Pearson correlation 103

If this confidence interval is one of the 95% that contains the population value then the population value of r lies between 0.22 and 0.55.

The probability of getting a value of t at least as big as the value we have observed, if the value of r were, in fact, zero is less than 0.001. I’m going to assume, therefore, that the association between exam grade and revision is not zero.

  • exam grade correlates with revision - \(r\)=0.4
  • exam grade had a similar strength relationship with exam anxiety \(r\)=-0.44 but in the opposite direction
  • revision had a strong negative relationship with anxiety - \(r\)=-0.709
  • the more you revise, the better your performance
  • the more anxiety you have, the worse your performance
  • the mopre you revise, the less anxiety you have
  • all \(p\)-values are less than 0.001 and would be interpreted as the correlation coefficients being significantly different from zero
  • significance values tell us that the probability of getting correlation coefficients at least as big as this in a sample of 103 people if the null were true (that there was no relationship between the variables) is very low
  • if we assume the sample is one of the 95% of samples that will produce a confidence interval containing the population value, then the confidence intervals tell us about the uncertainty around \(r\).
Rounding

We can control the number of decimal places using knitr::kable(digits = 3)

We can also specify different columns to contain different rounding using knitr::kable(digits = c(2, 2, 2, 2, 2, 2, 2, 2, 8)) (column 9 to 8 decimal places) or knitr::kable(digits = c(rep(2, 8), 8))

Robust correlation coefficients

Given the skew in the variables, we should use a robust correlation coefficient, like the percentage bend correlation coefficient by setting method = "percentage" within correlation()

exam_tib |> 
  dplyr::select(exam_grade, revise, anxiety) |> 
  correlation::correlation(
   method = "percentage"
   ) |> 
  knitr::kable(digits = c(rep(2, 8), 8))
Parameter1 Parameter2 r CI CI_low CI_high t df_error p Method n_Obs
exam_grade revise 0.34 0.95 0.15 0.50 3.60 101 0.00049502 Percentage Bend correlation 103
exam_grade anxiety -0.40 0.95 -0.55 -0.23 -4.41 101 0.00005235 Percentage Bend correlation 103
revise anxiety -0.61 0.95 -0.72 -0.47 -7.66 101 0.00000000 Percentage Bend correlation 103
Par1 Par2 Percentage Bend \(r\) raw Pearson \(r\)
exam_grade revise 0.34 0.397
exam_grade anxiety -0.40 -0.441
revise anxiety -0.61 -0.709

All robust correlations (percentage bend) are less than raw, though all are significant at \(p<0.001\)

Spearman’s correlation coefficient

  • data from World’s Best Liar competition
  • want to know if creativity impacts lying ability
  • position data (1st, 2nd, etc) is ordinal, so Spearman’s correlation coefficient should be used
  • Data are in
liar_tib
# A tibble: 68 × 4
   id    creativity position novice          
   <chr>      <dbl>    <dbl> <fct>           
 1 lnwe          53        1 First time      
 2 vxob          36        3 Previous entrant
 3 qpli          31        4 First time      
 4 pwsq          43        2 First time      
 5 xafq          30        4 Previous entrant
 6 njra          41        1 First time      
 7 lxty          32        4 First time      
 8 dxcw          54        1 Previous entrant
 9 uxgp          47        2 Previous entrant
10 dvew          50        2 First time      
# ℹ 58 more rows
GGally::ggscatmat(liar_tib, columns = c("creativity", "position")) +
theme_minimal()

To get Spearman correlation, we use correlation() the same way as we did with Pearson, except we add method = "spearman" to the function

liar_tib |>
  dplyr::select(creativity, position) |> 
  correlation::correlation(method = "spearman") |> 
    knitr::kable(digits = c(rep(2, 7), 8))
Parameter1 Parameter2 rho CI CI_low CI_high S p Method n_Obs
creativity position -0.37 0.95 -0.57 -0.14 71948.4 0.00172042 Spearman correlation 68
  • Spearman correlation between the two is \(r_s=-0.37\) with an associated \(p\)-value of 0.002 and a sample size of 68
  • as creativity increased, position decreased
    • this might seem contrary to the hypothesis, but a position of 4 is a lower position than 1

Kendall’s tau (\(\tau\))

  • another non-parametric correlation
  • used instead of Spearman’s correlation when data set is small with a large number of tied ranks
  • correlation() function will calculate Kendall’s \(\tau\) by including method = "kendall"
liar_tib |>
  dplyr::select(creativity, position) |> 
  correlation::correlation(method = "kendall") |> 
    knitr::kable(digits = c(rep(4, 7), 8))
Parameter1 Parameter2 tau CI CI_low CI_high z p Method n_Obs
creativity position -0.3002 0.95 -0.4396 -0.1468 -3.2252 0.0012588 Kendall correlation 68
  • output shows \(\tau=-0.3\) -> closer to 0 than Spearman (-.38) therefore Kendall’s value is likely a more accurate guage of what the correlation in the population would be

Cover Photo by Zhuojun Yu on Unsplash