Skip to contents

Problem

  • How to handle missing data (i.e. NA’s) in calculating a correlation between two variables.
  • Current calculations of correlation are based on having all pairs of observations for two variables.
    • However, whether an observation is made is semi-quantitative information for many analytical measurements with sensitivity limits.
    • i.e. in many cases, missing observations are not “missing-at-random”, but “missing-not-at-random” due to falling below the detection limit.
    • In these cases, the NA is informative.

Approach

A Kendall Tau Correlation coefficient calculates correlation based on the number of concordant and discordant pairs:

  • \(\tau = \frac{ | pairs_{concordant} | - | pairs_{discordant} |}{\binom{n}{2}}\)
  • A pair are two x-y data points.
  • A concordant pair has the following classical definition:
    • \(x_i > x_j\) and \(y_i > y_j\)
    • \(x_i < x_j\) and \(y_i < y_j\)
  • A discordant pair has the following classical definition:
    • \(x_i > x_j\) and \(y_i < y_j\)
    • \(x_i < x_j\) and \(y_i > y_j\)

But these definitions can be expanded to handle missing observations:

  • Information content informed concordant pairs:
    • \(x_i > x_j\) and \(y_i > y_j\)
    • \(x_i < x_j\) and \(y_i < y_j\)
    • \(x_i > x_j\) and \(y_i \& !y_j\)
    • \(x_i < x_j\) and \(!y_i \& y_j\)
    • \(x_i \& !x_j\) and \(y_i > y_j\)
    • \(!x_i \& x_j\) and \(y_i < y_j\)
    • \(x_i \& !x_j\) and \(y_i \& !y_j\) (not used in local perspective version)
    • \(x_i \& x_j\) and \(!y_i \& y_j\) (not used in local perspective version)
  • Information content informed discordant pairs:
    • \(x_i > x_j\) and \(y_i < y_j\)
    • \(x_i < x_j\) and \(y_i > y_j\)
    • \(x_i > x_j\) and \(!y_i \& y_j\)
    • \(x_i < x_j\) and \(y_i \& !y_j\)
    • \(x_i \& !x_j\) and \(y_i < y_j\)
    • \(!x_i \& x_j\) and \(y_i > y_j\)
    • \(x_i \& !x_j\) and \(!y_i \& y_j\) (not used in local perspective version)
    • \(!x_i \& x_j\) and \(y_i \& !y_j\) (not used in local perspective version)
  • Also data points with both missing x and y values will naturally reduce the strength of the correlation value, since they can be neither concordant nor discordant with another (NA, NA) pair, but will impact the denominator.
  • Alternatively, (NA,NA) points can be removed to calculate a correlation that is specific to the two variables and does not consider missing data from a global dataset perspective that spans a set of variables.
    • If this local perspective is used, then two data points that are both missing data, should not be compared.
      • This is equivalent to removing the last two concordant and discordant pair tests.

Handling Tied Values

The base Kendall tau correlation must be adjusted to handle tied values, ie. the tau-b version of the equation.

\[\tau = \frac{ | pairs_{concordant} | - | pairs_{discordant} |}{\sqrt{ ( n_0 - n_{xtie} ) ( n_0 - n_{ytie} )}} \] where:

  • \(n_0 = \frac{n \left ( n - 1 \right )}{2}\)
  • \(n_{xtie} = \sum_{i}^{} \frac{t_{xi} \left ( t_{xi} - 1 \right )}{2}\)
  • \(n_{ytie} = \sum_{i}^{} \frac{t_{yi} \left ( t_{yi} - 1 \right )}{2}\)
  • \(t_{xi}\) - the size of the ith group of tied x values.
  • \(t_{yi}\) - the size of the ith group of tied y values.
  • From the local perspective, the number of NAs in x and y can be treated as a group of tied values in calculation of \(n_{xtie}\) and \(n_{ytie}\), respectively.

Scaling by the correlation with the highest information content

When generating a correlation matrix (heatmap) for large analytical datasets, the number of observations in common can become quite low between any two variables. It becomes advantageous to scale by the pair of variables with the highest information content. One objective scaling factor is the highest possible absolute correlation at the maximum information content observed across a dataset, and dividing by this maximum possible absolute correlation would scale the whole dataset appropriately.

\[maxcorr = \frac{\binom{n-m}{2} + n * m}{\binom{n-m}{2} + n * m + \binom{m}{2}}\] Where:

  • Choose the two variables with the least number of missing values across the dataset.
  • n is the length of the variables.
  • m is the count of missing values across the two variables divided by two rounded down.
    • This formula is based on perfect correlation with a given number of (NA,NA) pairs added.

Functions

The functions that implement this include:

  • ici_kt: the workhorse, actually calculating a correlation between X and Y vectors.
    • The option perspective will control how the NA values influence ties.
    • When comparing multiple samples, you likely want to use perspective = "global".
  • ici_kendallt: Handles comparisons for a matrix.
    • Rows are features, columns are samples.
    • Implicitly parallel, but have to call:
    • Otherwise will only use a single core.

We’ve also included a function for testing if the missingness in your data comes from left-censorship, test_left_censorship. We walk through creating example data and testing it in the vignette Testing for Left Censorship.

Implementation

It turns out, if we think about it really hard, all that is truly necessary is to replace missing values in each vector with a value smaller than the minimum value in each one. For the local version, we first remove common missing values from each vector. Our C++ implementation explicitly does this so that we can have speed, instead of wrapping the {stats::cor} function. We also use the double merge-sort algorithm, translating {scipy:: ::stats::kendalltau} function into C++ using {Rcpp}.

Speed

x = rnorm(1000)
y = rnorm(1000)

library(microbenchmark)

microbenchmark(
  cor(x, y, method = "kendall"),
  ici_kt(x, y, "global"),
  times = 5
)
#> Unit: microseconds
#>                           expr       min        lq       mean    median
#>  cor(x, y, method = "kendall") 13366.180 13440.327 14114.0554 13697.570
#>         ici_kt(x, y, "global")   262.627   307.663   346.0602   320.794
#>         uq       max neval
#>  14627.574 15438.626     5
#>    369.942   469.275     5
all.equal(ici_kt(x, y, "global")[[1]], cor(x, y, method = "kendall"))
#> [1] TRUE

Running Many

Just like R’s cor function, we can also calculate correlations between many variables. Let’s make some fake data and try it out.

set.seed(1234)

s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
s2[sample(length(s1), 100)] = s1[1:100]
s3 = s1
s3[c(1:10, sample(length(s1), 5))] = NA

matrix_1 = cbind(s1, s2, s3)

r_1 = ici_kendalltau(matrix_1)
r_1$cor
#>           s1        s2        s3
#> s1 1.0000000 0.8049209 0.9907488
#> s2 0.8049209 1.0000000 0.7956652
#> s3 0.9907488 0.7956652 0.9850000

Parallelism

If you have {future} and the {furrr} packages installed, then it is also possible to split up the a set of matrix comparisons across compute resources for any multiprocessing engine registered with {future}.

library(furrr)
future::plan(multicore, workers = 4)
r_2 = ici_kendalltau(matrix_1)

Many Many Comparisons

In the case of hundreds of thousands of comparisons to be done, the result matrices can become very, very large, and require lots of memory for storage. They are also inefficient, as both the lower and upper triangular components are stored. An alternative storage format is as a data.frame, where there is a single row for each comparison performed. This is actually how the results are stored internally, and then they are converted to a matrix form if requested (the default). To keep the data.frame output, add the argument return_matrix=FALSE to the call of ici_kendalltau.

r_3 = ici_kendalltau(matrix_1, return_matrix = FALSE)
r_3$cor
#>   s1 s2 core       raw pvalue    taumax       cor
#> 1 s1 s2    1 0.8049209      0 1.0000000 0.8049209
#> 2 s1 s3    1 0.9907488      0 0.9998949 0.9907488
#> 3 s2 s3    1 0.7956652      0 0.9998949 0.7956652
#> 4 s1 s1    0 1.0000000      0 1.0000000 1.0000000
#> 5 s2 s2    0 1.0000000      0 1.0000000 1.0000000
#> 6 s3 s3    0 0.9850000      0 1.0000000 0.9850000

Logging

It is possible to log the steps being done and how much memory is being used (on Linux at least) as correlations are calculated. This can be useful when running very large sets of correlations and making sure too much memory isn’t being used, for example.

To enable logging, the {logger} package must be installed. If a log_file is not supplied, one will be created with the current date and time.

enable_logging()
enable_logging("/tmp/my_ici_run.log")

By default, ici_kendalltau also shows progress messages, if you want to turn them off, you can do: