library(visualizationQualityControl)

Problem

  • How to handle missing data (i.e. NA’s) in calculating a correlation between two variables.
  • Current calculations of correlation are based on having all pairs of observations for two variables.
    • However, whether an observation is made is semi-quantitative information for many analytical measurements with sensitivity limits.
    • i.e. in many cases, missing observations are not “missing-at-random”, but “missing-not-at-random” due to falling below the detection limit.
    • In these cases, NA is informative.

Approach

A Kendall Tau Correlation coefficient calculates correlation based on the number of concordant and discordant pairs:

  • \(\tau = \frac{ | pairs_{concordant} | - | pairs_{discordant} |}{\binom{n}{2}}\)
  • A pair are two x-y data points.
  • A concordant pair has the following classical definition:
    • \(x_i > x_j\) and \(y_i > y_j\)
    • \(x_i < x_j\) and \(y_i < y_j\)
  • A discordant pair has the following classical definition:
    • \(x_i > x_j\) and \(y_i < y_j\)
    • \(x_i < x_j\) and \(y_i > y_j\)

But these definitions can be expanded to handle missing observations:

  • Information content informed concordant pairs:
    • \(x_i > x_j\) and \(y_i > y_j\)
    • \(x_i < x_j\) and \(y_i < y_j\)
    • \(x_i > x_j\) and \(y_i \& !y_j\)
    • \(x_i < x_j\) and \(!y_i \& y_j\)
    • \(x_i \& !x_j\) and \(y_i > y_j\)
    • \(!x_i \& x_j\) and \(y_i < y_j\)
    • \(x_i \& !x_j\) and \(y_i \& !y_j\) (not used in local perspective version)
    • \(x_i \& x_j\) and \(!y_i \& y_j\) (not used in local perspective version)
  • Information content informed discordant pairs:
    • \(x_i > x_j\) and \(y_i < y_j\)
    • \(x_i < x_j\) and \(y_i > y_j\)
    • \(x_i > x_j\) and \(!y_i \& y_j\)
    • \(x_i < x_j\) and \(y_i \& !y_j\)
    • \(x_i \& !x_j\) and \(y_i < y_j\)
    • \(!x_i \& x_j\) and \(y_i > y_j\)
    • \(x_i \& !x_j\) and \(!y_i \& y_j\) (not used in local perspective version)
    • \(!x_i \& x_j\) and \(y_i \& !y_j\) (not used in local perspective version)
  • Also data points with both missing x and y values will naturally reduce the strength of the correlation value, since they can be neither concordant nor discordant with another (NA, NA) pair, but will impact the denominator.
  • Alternatively, (NA,NA) points can be removed to calculate a correlation that is specific to the two variables and does not consider missing data from a global dataset perspective that spans a set of variables.
    • If this local perspective is used, then two data points that are both missing data, should not be compared.
      • This is equivalent to removing the last two concordant and discordant pair tests.
      • It may be easier to just count them and then remove this count from the denominator. (See handling tied values below).

Handling Tied Values

The base Kendall tau correlation must be adjusted to handle tied values, ie. the tau-b version of the equation.

\[\tau = \frac{ | pairs_{concordant} | - | pairs_{discordant} |}{\sqrt{ ( n_0 - n_{xtie} ) ( n_0 - n_{ytie} )}} \] where:

  • \(n_0 = \frac{n \left ( n - 1 \right )}{2}\)
  • \(n_{xtie} = \sum_{i}^{} \frac{t_{xi} \left ( t_{xi} - 1 \right )}{2}\)
  • \(n_{ytie} = \sum_{i}^{} \frac{t_{yi} \left ( t_{yi} - 1 \right )}{2}\)
  • \(t_{xi}\) - the size of the ith group of tied x values.
  • \(t_{yi}\) - the size of the ith group of tied y values.
  • From the local perspective, the number of NAs in x and y can be treated as a group of tied values in calculation of \(n_{xtie}\) and \(n_{ytie}\), respectively.

Alternative Tau-b

\[\tau = \frac{\left | pairs_{concordant} \right | - \left | pairs_{discordant} \right |}{\left | pairs_{concordant} \right | + \left | pairs_{discordant} \right | + \left | pairs_{xties} \right | + \left | pairs_{yties} \right |}\]

The nice thing about this approach is that it is based on the above definitions of concordant and discordant. Also, I think this is directly the local version. Therefore, to get the global version, we count the ties of NA’s and other things, and appropriately inflate the x and y tie counts in the bottom.

Scaling by the correlation with the highest information content

When generating a correlation matrix (heatmap) for large analytical datasets, the number of observations in common can become quite low between any two variables. It becomes advantageous to scale by the pair of variables with the highest information content. One objective scaling factor is the highest possible absolute correlation at the maximum information content observed across a dataset, and dividing by this maximum possible absolute correlation would scale the whole dataset appropriately.

\[maxcorr = \frac{\binom{n-m}{2} + n * m}{\binom{n-m}{2} + n * m + \binom{m}{2}}\] Where:

  • Choose the two variables with the least number of missing values across the dataset.
  • n is the length of the variables.
  • m is the count of missing values across the two variables divided by two rounded down.
    • This formula is based on perfect correlation with a given number of (NA,NA) pairs added.

Functions

The functions that implement this include:

  • ici_kendallt: the workhorse, actually calculating a correlation between an X and Y.
    • The option perspective will control how the NA values influence ties.
    • When comparing samples, you likely want to use perspective = "global".
  • visqc_ici_kendallt: Handles comparisons for a matrix.
    • Rows are samples, columns are features.
    • Implicitly parallel, but have to call:
      • library(furrr)
      • plan(multiprocess)
    • Otherwise will only use a single core.
    • On a 835 feature by 179 sample data set, using all cores took ~ 11 seconds on an 8 core laptop, while the sequential version took 90 seconds.

Examples

Speed

x = rnorm(1000)
y = rnorm(1000)

library(microbenchmark)

microbenchmark(
  cor(x, y, method = "kendall"),
  ici_kendallt(x, y, "global"),
  times = 5
)
#> Unit: milliseconds
#>                           expr       min        lq      mean    median
#>  cor(x, y, method = "kendall") 13.228134 13.315856 13.324542 13.324393
#>   ici_kendallt(x, y, "global")  7.870682  7.873501  7.963921  7.904064
#>         uq       max neval
#>  13.326874 13.427453     5
#>   7.952865  8.218495     5
all.equal(ici_kendallt(x, y, "global"), cor(x, y, method = "kendall"))
#> [1] TRUE