Information-Content-Informed Kendall Tau Correlation
Robert M Flight & Hunter NB Moseley
Source:vignettes/ici-kendalltau.Rmd
ici-kendalltau.Rmd
Problem
- How to handle missing data (i.e.
NA
’s) in calculating a correlation between two variables. - Current calculations of correlation are based on having all pairs of
observations for two variables.
- However, whether an observation is made is semi-quantitative information for many analytical measurements with sensitivity limits.
- i.e. in many cases, missing observations are not “missing-at-random”, but “missing-not-at-random” due to falling below the detection limit.
- In these cases, the
NA
is informative.
Approach
A Kendall Tau Correlation coefficient calculates correlation based on the number of concordant and discordant pairs:
- A pair are two x-y data points.
- A concordant pair has the following classical definition:
- and
- and
- A discordant pair has the following classical definition:
- and
- and
But these definitions can be expanded to handle missing observations:
- Information content informed concordant pairs:
- and
- and
- and
- and
- and
- and
- and (not used in local perspective version)
- and (not used in local perspective version)
- Information content informed discordant pairs:
- and
- and
- and
- and
- and
- and
- and (not used in local perspective version)
- and (not used in local perspective version)
- Also data points with both missing x and y values will naturally reduce the strength of the correlation value, since they can be neither concordant nor discordant with another (NA, NA) pair, but will impact the denominator.
- Alternatively, (NA,NA) points can be removed to calculate a
correlation that is specific to the two variables and does not consider
missing data from a global dataset perspective that spans a set of
variables.
- If this local perspective is used, then two data points that are
both missing data, should not be compared.
- This is equivalent to removing the last two concordant and discordant pair tests.
- If this local perspective is used, then two data points that are
both missing data, should not be compared.
Handling Tied Values
The base Kendall tau correlation must be adjusted to handle tied values, ie. the tau-b version of the equation.
where:
- - the size of the ith group of tied x values.
- - the size of the ith group of tied y values.
- From the local perspective, the number of NAs in x and y can be treated as a group of tied values in calculation of and , respectively.
Scaling by the correlation with the highest information content
When generating a correlation matrix (heatmap) for large analytical datasets, the number of observations in common can become quite low between any two variables. It becomes advantageous to scale by the pair of variables with the highest information content. One objective scaling factor is the highest possible absolute correlation at the maximum information content observed across a dataset, and dividing by this maximum possible absolute correlation would scale the whole dataset appropriately.
Where:
- Choose the two variables with the least number of missing values across the dataset.
- n is the length of the variables.
- m is the count of missing values across the two variables divided by
two rounded down.
- This formula is based on perfect correlation with a given number of (NA,NA) pairs added.
Functions
The functions that implement this include:
-
ici_kt
: the workhorse, actually calculating a correlation between X and Y vectors.- The option
perspective
will control how theNA
values influence ties. - When comparing multiple samples, you likely want to use
perspective = "global"
.
- The option
-
ici_kendallt
: Handles comparisons for a matrix.- Rows are features, columns are samples.
- Implicitly parallel, but have to call:
library(furrr)
plan(multiprocess)
- Otherwise will only use a single core.
We’ve also included a function for testing if the missingness in your
data comes from left-censorship, test_left_censorship
. We
walk through creating example data and testing it in the vignette Testing
for Left Censorship.
Implementation
It turns out, if we think about it really hard, all that is truly necessary is to replace missing values in each vector with a value smaller than the minimum value in each one. For the local version, we first remove common missing values from each vector. Our C++ implementation explicitly does this so that we can have speed, instead of wrapping the {stats::cor} function. We also use the double merge-sort algorithm, translating {scipy:: ::stats::kendalltau} function into C++ using {Rcpp}.
Speed
x = rnorm(1000)
y = rnorm(1000)
library(microbenchmark)
microbenchmark(
cor(x, y, method = "kendall"),
ici_kt(x, y, "global"),
times = 5
)
#> Unit: milliseconds
#> expr min lq mean median
#> cor(x, y, method = "kendall") 14.021754 14.192800 14.663676 14.250622
#> ici_kt(x, y, "global") 1.833214 1.836678 2.073085 1.888855
#> uq max neval
#> 15.407372 15.445833 5
#> 2.297732 2.508947 5
Running Many
Just like R’s cor
function, we can also calculate
correlations between many variables. Let’s make some fake data and try
it out.
set.seed(1234)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
s2[sample(length(s1), 100)] = s1[1:100]
s3 = s1
s3[c(1:10, sample(length(s1), 5))] = NA
matrix_1 = cbind(s1, s2, s3)
r_1 = ici_kendalltau(matrix_1)
r_1$cor
#> s1 s2 s3
#> s1 1.0000000 0.8049209 0.9907488
#> s2 0.8049209 1.0000000 0.7956652
#> s3 0.9907488 0.7956652 0.9850000
Parallelism
If you have {future} and the {furrr} packages installed, then it is also possible to split up the a set of matrix comparisons across compute resources for any multiprocessing engine registered with {future}.
library(furrr)
future::plan(multicore, workers = 4)
r_2 = ici_kendalltau(matrix_1)
Many Many Comparisons
In the case of hundreds of thousands of comparisons to be done, the
result matrices can become very, very large, and require lots of memory
for storage. They are also inefficient, as both the lower and upper
triangular components are stored. An alternative storage format is as a
data.frame
, where there is a single row for each comparison
performed. This is actually how the results are stored internally, and
then they are converted to a matrix form if requested (the default). To
keep the data.frame
output, add the argument
return_matrix=FALSE
to the call of
ici_kendalltau
.
r_3 = ici_kendalltau(matrix_1, return_matrix = FALSE)
r_3$cor
#> s1 s2 core raw pvalue taumax completeness cor
#> 1 s1 s2 1 0.8049209 0 1.0000000 1.000 0.8049209
#> 2 s1 s3 1 0.9907488 0 0.9998949 0.985 0.9907488
#> 3 s2 s3 1 0.7956652 0 0.9998949 0.985 0.7956652
#> 4 s1 s1 0 1.0000000 0 1.0000000 1.000 1.0000000
#> 5 s2 s2 0 1.0000000 0 1.0000000 1.000 1.0000000
#> 6 s3 s3 0 0.9850000 0 1.0000000 0.985 0.9850000
Logging
It is possible to log the steps being done and how much memory is being used (on Linux at least) as correlations are calculated. This can be useful when running very large sets of correlations and making sure too much memory isn’t being used, for example.
To enable logging, the {logger} package must be installed. If a
log_file
is not supplied, one will be created with the
current date and time.
enable_logging()
enable_logging("/tmp/my_ici_run.log")
By default, ici_kendalltau
also shows progress messages,
if you want to turn them off, you can do:
show_progress(FALSE)