1 Markey Cancer Center, University of Kentucky, Lexington,
KY 40536, USA
2 Department of Molecular & Cellular Biochemistry,
University of Kentucky, Lexington, KY 40536, USA
3 Superfund Research Center, University of Kentucky,
Lexington, KY 40536, USA
4 Department of Electrical and Computer Engineering,
University of Kentucky, Lexington, KY 40506
5 Institute for Biomedical Informatics, University of
Kentucky, Lexington, KY 40536, USA
6 Department of Toxicology and Cancer Biology, University of
Kentucky, Lexington, KY 40536, USA
✉ Correspondence: Hunter NB Moseley <hunter.moseley@uky.edu>
Abstract
Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on a perspective that the missing data represents no useful information. However, missing values occur in real data sets for a variety of reasons. In metabolomics data sets, a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent potentially useful information by virtue of their “missingness” at one end of the data distribution. Methods: To include this information due to left-censored missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We developed a statistical test and then show that most missing values in metabolomics datasets are the result of left-censorship. Next, we show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Results: Using both simulated and over 700 experimental data sets from The Metabolomics Workbench, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. Conclusions: We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples. The ICI-Kt methods are available as an R package and Python module on GitHub at https://github.com/moseleyBioinformaticsLab/ICIKendallTau and https://github.com/moseleyBioinformaticsLab/icikt, respectively.
Keywords: Metabolomics; Correlation; Missingness; Left-Censored
Correlation as a measure of the relatedness or similarity of two or more sets of data has a long history, with the mathematical technique being used (and abused) in various scientific fields since its introduction [1,2]. More recently, correlation calculations have become a cornerstone statistical method in the analysis and integration of varied omics’ datasets, especially the big five omics: genomics, transcriptomics, proteomics, metabolomics, and epigenomics [3]. Correlation between biomolecular features (nucleotide variants, RNA transcripts, proteins, metabolites, DNA methylation, etc.) may be used to evaluate the relationship strength between pairs of the features as well as to detect and derive correlative structures between groups of features [4]. Moreover, feature-feature correlations can be used to evaluate a dataset based on expected biochemical correlations, for example higher feature-feature correlations within lipid categories versus between lipid categories [5]. Correlation is a foundational method for generating biomolecular feature-feature interaction networks, like those provided by STRING [6], Genemania [7], and WCGNA [8]. Feature-feature correlation may also be used to inform which features are used for imputation of missing values [9].
Often, the first step in omics’ level analyses is to examine the sample-sample (dis)similarities in various ways using exploratory data analysis or EDA. This can include the examination of decomposition by principal components analysis (PCA), sample-sample pairwise distances, or sample-sample pairwise correlations to highlight biological and batch groups [10–12], double check the appropriateness of planned analyses [13], and check if any samples should be removed prior to statistical analysis (outlier detection and removal) [14]. Outlier detection, in particular, is often required for successful omics’ data analysis, as any misstep during the experimentation, sample collection, sample preparation, or analytical measurement of individual samples can inject high error and/or variance into the resulting data set [10–12,14,15].
All analytical methods, and in particular the analytical methods used in omics’ where many analytes are being measured simultaneously, suffer from missing measurements. Some analytes will be missing at random because of spurious issues with either the instrument, the particular sample, or sample preparation, but a larger number of missing measurements are left-censored due to analytes being below the effective detection limit of the instrument and the given specific sample preparation procedures utilized, as shown in Figure 1. Some analytical instruments are purposely designed to floor measurements when they occur below a certain signal to noise ratio threshold. Also, imputation of missing measurements in omics’ samples is an active area of research, which we will not comprehensively cover here beyond to say that it is worthwhile and very necessary in many instances. Imputation methods rely on very similar analytical detection limits between analytical samples. When this condition does not hold, imputation methods have reduced performance and lower interpretive value. For analytical techniques requiring complex sample handling and detection, the variability in the analytical detection level can be quite high. Some differential analysis methods have been developed to directly handle missing values in the statistical testing methodology [16,17]. However, when it comes to calculating correlation, there are very few methods that explicitly account for left-censored missing data that we know of. In many cases, missing values are either ignored or imputed to zero (or another value) and then included in the correlation calculation. The two most common approaches for ignoring (i.e. dropping) values is to only use those measurements that are common across all samples (complete) or that are common between two samples being compared (pairwise complete). Both dropping or imputing missing values are likely to cause the calculated sample-sample correlation values to deviate from the real sample-sample correlation values, especially with respect to specific data interpretation perspectives.
Figure 1. Graphical description of the left-censored data problem. An example density plot of the analyte concentrations for a single experimental sample is shown as a solid black curve. The true analyte concentration range covers the full range of the density distribution, with the minimum on the left (red vertical line), and the maximum on the right (yellow vertical line). Below certain concentrations, shown by the red line, the instrument returns either missing values (NA), zeros, or some other floored values, resulting in a left-censored distribution. Above certain concentrations, highlighted by the yellow line, typically the values returned will be identical (or flagged depending on the instrument). Which analytes will have concentrations below the red detection limit line may vary from sample to sample due to the overall sample composition, as well as biological variance.
Assuming that a majority of missing values are not missing at random, but rather result from left-censored distributions due to the analyte being below the effective detection limit (see Figure 1), we propose that these missing values do in fact encode useful information that can be incorporated into correlation calculations. Thus, information content, i.e., the amount of information available and used, can be either increased or at least not lost in the calculation of the correlation.
To create a correlation measure that is capable of working with missing values, we would not be interested in creating a completely new correlation metric from scratch, but modifying an existing one. Of the three commonly used correlation measures, Pearson, Spearman, and Kendall-\(\tau\), Spearman and Kendall-\(\tau\) seem most appropriate for modification as they solely use ranks in the calculation of their coefficients. Modifying Pearson would either involve imputing new values, or finding a way to calculate the covariances with missingness included. While Spearman uses ranks, many of the modifications for handling identical ranks and ties do not seem amenable to working with missing values. In contrast, Kendall-\(\tau\)’s use of concordant and discordant pair counts seems most amenable to the creation of new definitions that incorporate missingness while still working within the original definition of the correlation coefficient, as shown in the Implementation section below.
In this work, we propose new definitions of concordant and discordant rank pairs that include missing values, as well as methods for incorporating missing values into the number of tied values for the equivalent of the modified Kendall-\(\tau\) coefficient, the information-content-informed Kendall-\(\tau\) (ICI-Kt) method. The implementation of the basic calculation of ICI-Kt involves the replacement of missing values with a value lower than the observed values (technically simple imputation), with subsequent calculation of the Kendall \(\tau_b\) statistic, as a majority of missing values are the result of left-censorship, they still provide an interpretation from an information-content perspective, which we demonstrate with the equations below. We also developed a binomial statistical test for determining if the cause for missingness is likely left-censorship. With this statistical test and experimental datasets from The Metabolomics Workbench (MW) [18], we demonstrate that left-censorship is the cause of many missing values across a large number of metabolomics datasets. We examine the effect of missing values on various collections of simulated and real data-sets, comparing the ICI-Kt methodology with other simpler methods of handling the missing values, namely removing them or imputing them to zero. Given the detrimental effects of including outlier samples, we also evaluate the application of ICI-Kt in quality control and quality assessment steps typically performed prior to differential analyses. Specifically, we compare ICI-Kt to other common correlation-based outlier detection methods. We were also curious about the utility of the ICI-Kt methodology in creating metabolomics feature-feature networks with large amounts of missing values, so we evaluated the partitioning of networks by Reactome pathways [19] after network creation using different correlation measures.
All of the code and data used for this manuscript is available on Zenodo [20].
In the simplest form, the Kendall-\(\tau\) (\(\tau_a\)) correlation can be defined as:
\[ \tau_a = \frac{ n_{concordant} - n_{discordant} }{n_{concordant} + n_{discordant}}\]
where \(n_{concordant}\) is the the number of concordant pairs and \(n_{discordant}\) is the number of discordant pairs. In this case, a pair are any two x-y points, \(x_i, y_i\) and \(x_j, y_j\), with \(i \neq j\), composed from two joint random variables X and Y, where \(x_i\) represents the ith value in X and \(y_i\) represents the ith value in Y. In a metabolomics context, X and Y can represent metabolite feature vectors for two experimental samples or two specific metabolite features across a set of samples.
A concordant pair has the following classical definition:
\(x_i \gt x_j\) and \(y_i \gt y_j\)
\(x_i \lt x_j\) and \(y_i \lt y_j\)
A discordant pair has the following classical definition [21]:
\(x_i \gt x_j\) and \(y_i \lt y_j\)
\(x_i \lt x_j\) and \(y_i \gt y_j\)
We can expand the concordant and discordant pair definitions to
include missing values (e.g. NA in R). The
information-content-informed concordant pair definitions are then:
\(x_i \gt x_j\) and \(y_i \gt y_j\)
\(x_i \lt x_j\) and \(y_i \lt y_j\)
\(x_i \gt x_j\) and \(y_i \! \neq \! \text{NA} , \ y_j \!=\! \text{NA}\)
\(x_i \lt x_j\) and \(y_i\! =\! \text{NA}, \ y_j \! \neq \! \text{NA}\)
\(x_i \! \neq \! \text{NA}, \ x_j \! = \! \text{NA}\) and \(y_i \gt y_j\)
\(x_i \! = \! \text{NA}, \ x_j \! \neq \! \text{NA}\) and \(y_i \lt y_j\)
\(x_i \! \neq \! \text{NA}, \ x_j \! = \! \text{NA}\) and \(y_i \! \neq \! \text{NA}, \ y_j \! = \! \text{NA}\)
\(x_i \! = \! \text{NA}, \ x_j \! \neq \! \text{NA}\) and \(y_i \! = \! \text{NA}, \ y_j \! \neq \! \text{NA}\)
The information content informed discordant pair definitions are then:
\(x_i \gt x_j\) and \(y_i \lt y_j\)
\(x_i \lt x_j\) and \(y_i \gt y_j\)
\(x_i \gt x_j\) and \(y_i \! = \! \text{NA}, \ y_j \! \neq \! \text{NA}\)
\(x_i \lt x_j\) and \(y_i \! \neq \! \text{NA}, \ y_j \! = \! \text{NA}\)
\(x_i \! \neq \! \text{NA},\ x_j \! = \! \text{NA}\) and \(y_i \lt y_j\)
\(x_i \! = \! \text{NA}, \ x_j \! \neq \! \text{NA}\) and \(y_i \gt y_j\)
\(x_i \! \neq \! \text{NA}, \ x_j \! = \! \text{NA}\) and \(y_i \! = \! \text{NA}, y_j \! \neq \! \text{NA}\)
\(x_i \! = \! \text{NA}, \ x_j \! \neq \! \text{NA}\) and \(y_i \! \neq \! \text{NA}, y_j \! = \! \text{NA}\)
These additional definitions make it possible to interpret a Kendall-\(\tau\) correlation from the perspective of missing values as additional information, i.e., information-content-informed Kendall-\(\tau\) (ICI-Kt) methodology.
Tied values do not contribute to either of the concordant or discordant pair counts, and the original Kendall-\(\tau\) formula for the \(\tau_a\) statistic does not consider the presence of tied values. However, the related \(\tau_b\) statistic does handle the presence of tied values by adding the tied \(x\) and \(y\) values to the denominator, and in our special case of missing data, we can add the ties that result from \((x_i=\text{NA}, x_j=\text{NA})\) and \((y_i=\text{NA}, y_j=\text{NA})\) to \(n_{xtie}\) and \(n_{ytie}\) [22,23] used in the following equation for \(\tau_b\):
\[\tau_b = \frac{n_{concordant} - n_{discordant}}{\sqrt{(n_{tot} - n_{xtie})(n_{tot} - n_{ytie})}}\]
where \(n_{tot}\) is the total number of pairs, \(n_{xtie}\) are the number of tied values in X, and \(n_{ytie}\) are the number of paired values in Y.
We can also consider commonly missing values in X and Y specially as well. In the first instance, we remove those x-y points where both values are missing, preventing their interpretation as missing information content. We refer to this case as the local ICI-Kt correlation. It is most appropriate for the comparison of only two experimental samples, where we are concerned with what values are present in the two experimental samples, with the odd case of missingness.
The other case, where we leave ties resulting from points with missing X and Y, we refer to as the global ICI-Kt correlation. In this case, every single correlation over multiple comparisons with the same set of metabolite features will consider the same number of pair comparisons. This is useful when analyzing and interpreting correlations from a large number of experimental samples, not just two samples, since x-y points where both values are missing is interpreted as missing information content.
With the calculation of the number of entries, and the numbers of ties (which may change depending on whether one is using the global or local correlations), a p-value for the correlation can be calculated using the Mann-Kendall test [24].
The global case also provides an interesting property, whereby we can calculate the theoretical maximum correlation that would be possible to observe given the lowest number of shared missing values. This value can be useful to scale the rest of the observed correlation values across many sample-sample correlations, providing a value that scales an entire dataset appropriately. For any pairwise comparison of two vectors (from experimental samples for example), we can calculate the maximum possible Kendall-tau for that comparison by defining the maximum number of concordant pairs as:
\[\tau_{max} = \frac{n_{tot} - n_{xtie} - n_{ytie} + n_{tie}}{\sqrt{(n_{tot} - n_{xtie})(n_{tot} - n_{ytie})}}\]
where \(n_{tie}\) is the number of commonly tied values in both X and Y. Calculating a set of \(\tau_{max}\) values between all experimental samples, we can take the maximum of the values, and use it to scale all of the obtained Kendall-tau values equally (\(\frac{\tau}{\max(\tau_{max})}\)).
We do note that scaling by \(\max(\tau_{max})\) changes the overall
values returned from a particular dataset, and makes comparisons
between datasets invalid. Therefore, when calculating
ICI-Kt values for comparison between datasets, we advise users to set
the scale_max option to FALSE.
As an addition to the correlation value, we also calculate the completeness between any two samples. We first measure the number of entries missing in either of the samples being compared, and subtract that from the total number of features in the samples. This defines how many features are potentially complete between the two samples. This number, over the total number of features defines the completeness fraction.
\[ \text{completness} = \frac{n_{feat} - N (miss_i \cup miss_j)}{n_{feat}} \]
where for any two samples i and j, \(n_{feat}\) is the total number of features or entries, and \(miss_i \cup miss_j\) are the metabolite features missing in either sample i or j, with \(N\) being the total number of missing entries in either sample i or j.
We produced an initial reference implementation in base
R [25], where the various
concordant and discordant pair definitions were written as simple
logical tests to allow further exploration and validation of faster
implementations. During exploration and validation of an early
implementation, we discovered that an equivalent calculation was to
replace the missing values with a value smaller than all of the values
in the two sets being compared. This simplification does not change the
interpretation of the effect of left-censored missing values, but it
does allow for the direct use of the very fast mergesort
based algorithm for calculating \(\tau_b\) [26].
We re-implemented the mergesort implementation from the
SciPy kendalltau code [27] in both R (via
Rcpp) and Python (via Cython) to
enable fast, easy parallel computations in both languages (using
furrr and multiprocessing, respectively), as
well as inclusion of the calculation of \(tau_{max}\), which is derived from the same
values needed for the calculation of \(\tau_b\) (see above). For consistency, we
also re-implemented the calculation of the p-values for the \(\tau_b\) statistic from the
SciPy implementation, which follows the description of the
Mann-Kendall test [24]. The version of the
ICIKendallTau R package used in this
manuscript is available on zenodo [28]. In
addition to use as an imported Python module, the Python
icikt module provides a command line interface (CLI) for
the ICI-Kt methodology.
Distribution parameters for simulated datasets are listed in Table 1. Simulated feature vectors (analytical samples) are generated by drawing random values from a log-normal distribution, and sorting them in ascending order to create a pair of samples with perfectly positive (1) or negative (-1) correlation values (perfect dataset). Log-normal distributions were used for the initial distribution as our experience with analyzing several mass-spectrometry datasets has shown they frequently follow a log-like distribution, and it is necessary to log-transform the data prior to further analysis. This is also supported by Abram and McCloskey [29]. Random variance is added to one of the two samples by drawing values from a uniform distribution over -0.5 to 0.5, and adding the values to the original sample, and sorting them again to maintain a correlation of 1 or -1 for Kendall-\(\tau\) correlation (noise-1). A sample with a small percentage (0.5%) outlier points at one end of the distribution is created by sampling from a uniform distribution over the range -0.5 to 0.5, and then a log-normal distribution (outlier dataset), and adding the log-normal values to the uniform values for a combined source of random variance that is added to the original sample values. The negative analytical sample has values sorted in descending order. Missing value indices are generated by randomly sampling up to 499 of the lowest values in each sample. For the negative sample, the indices are also subtracted from 1000 to cause them to be at the lower end of the feature distribution. Finally, missing indices were only inserted into one of the two samples being compared before calculating the correlation. The missing indices are replaced with \(\text{NA}\), and then correlations between the analytical samples are calculated.
Another, more realistic, simulated data set is generated by drawing from a log-normal distribution, and adding noise from a normal distribution to create two statistical samples (realistic dataset and noise-2). Missing values are created in these statistical samples via two methods: 1) by creating intensity cutoffs from 0 to 1.5 in 0.1 increments, values below the cutoff set to missing or zero depending on the calculation; 2) randomly sampling locations in the two-sample matrix ranging from zero to 300 in increments of 50 and setting the indices to missing or zero.
Table 1. Parameters used for various simulated data.
Dataset | distribution | N | Mean | SD | Range |
|---|---|---|---|---|---|
perfect | log-normal | 1,000 | 1.0 | 0.5 | |
noise-1 | uniform | 1,000 | -0.5 - 0.5 | ||
outlier | log-normal | 5 | 1.2 | 0.1 | |
realistic | log-normal | 1,000 | 1.0 | 0.5 | |
noise-2 | normal | 1,000 | 0.0 | 0.2 | |
lod | log-normal | 1,000 | 1.0 | 0.5 | |
noise-3 | normal | 1,000 | 0.0 | 0.2 |
A set of 6105 analysis datasets from The Metabolomics Workbench (MW)
were downloaded on 2025-11-12 using the mwtab Python
package [30], and repaired to fix various
issues. For a subset that had metabolite feature abundances outside the
mwtab json file, the files were downloaded on 2025-11-13. The various
pieces of each dataset were parsed and transformed into R
appropriate structures, mainly data frames of various types for
metadata, and matrices of abundances (see Data Processing). Subject
sample factors (SSF) were transformed so that each sample has a
combination of various factors to describe the unique groups of samples.
For example, if a dataset included samples with one or more genotypes
(Knockout, FLOX) and taken from different segments of the intestines,
the final factor for each sample is the combination of genotype +
intestinal segment.
For inclusion in this work, an analysis dataset had to meet these criteria:
\(\ge\) 100 metabolites, so that any degree of missingness would still allow for robust estimation of correlations between samples;
one SSF grouping with \(\ge\) 5 samples, and \(\ge\) 2 SSF groupings after removing samples that may be pooled, quality-control or blanks; this provides a greater likelihood of decent variance estimates when calculating the F-statistics across SSF after removing potential outlier samples;
a maximum metabolite feature abundance \(\ge\) 20 to exclude log-transformed values and low dynamic range datasets;
the ability to calculate a correlation between the median rank of a metabolite feature and the number of samples the metabolite was missing in within a factor, as this indicated a minimum number of missing values in each SSF.
Of the 6105 datasets initially downloaded, 711 were kept for further analysis.
For each dataset, the samples were split by SSF (see previous Methods). For each metabolite feature, the rank of the feature was calculated for each sample where the feature was present, followed by the features median rank across samples, as well as the number of samples the feature was missing from. Grouping the features by the number of missing values, we calculate the median of median ranks, as well as the minimum of median ranks, for the visualization and correlation of the relationship of rank with missing values.
For each dataset, the samples were first split by SSF. In each sample, the median abundance of features present in the sample are calculated. For any feature that is missing in any sample, the values in the present samples are compared to the median value of their corresponding sample. If the value is less than or equal to the median value in the sample, that is counted as a success in a binomial test, otherwise it is counted as a failure. The number of successes and failures are aggregated across the SSF splits for calculation in a binomial test, with the null hypothesis assuming a ratio of 0.5.
For each dataset, we calculated correlations using a variety of
methods. Across the various datasets, either zeros or empty strings
(generally resulting in NA values when read into R) are
used to represent missingness. To start, we replaced all missing values
with NA, and then either left them as NA or
set them to zero for the various methods used. ICI-Kt with
NA (icikt); and then scaled (multiplied) by the
completeness metric (icikt_complete). Kendall-tau, with NA,
and then using pairwise-complete-observations (kt_base). Pearson, with
NA’s replaced by zeros, using
pairwise-complete-observations (pearson_base). Pearson, with
NA, using pairwise-complete-observations
(pearson_base_nozero). Pearson, with a \(log(x
+ 1)\) transform applied, using pairwise-complete-observations
(pearson_log1p). Pearson, with a \(log(x)\) transform, and then setting
infinite values to NA values, using
pairwise-complete-observations (pearson_log).
For outlier detection, median sample-sample correlations within the
unique SSF (genotype, condition, and their combinations) is calculated,
and \(log(1 - cor_{median})\)
calculated to transform it into a score. Then outliers are determined
using R’s grDevices::boxplot.stats, which by default are at
1.5X the whiskers in a box-and-whisker plot. This is functionally
equivalent to determining outliers as values that are \(\ge\) 1.5X the interquartile range of the
data. As we are interested in only those correlations at the
low end of correlation (becoming the high end after the
subtraction and log-transform), we restrict to only those entries at the
high end of the score distribution (using
visualizationQualityControl::determine_outliers [31]). This is equivalent to using the
correlation component of the score described by Gierliński et al [14] and setting the other component weights to
zero.
After outlier detection, the ANOVA statistics are calculated across
SSF using limma (v 3.62.2) [32], and the fraction of metabolites with an
adjusted p-value \(\le 0.05\) recorded
(Benjamini-Hochberg adjustment [33]). The
correlation methods are compared by their fractions of significant
metabolites, using a paired t-test, and adjusting p-values across method
comparisons using the Bonferroni adjustment.
Predicted Reactome pathway annotations for analyzable MW datasets
were parsed from the previous work by Huckvale et al. [19,34]. Given the hierarchical nature of
Reactome pathways, and the use of the pathways for partitioning
feature-feature networks, we aggregated the Reactome pathways into
larger grouped pathway sets with less feature overlap. First, we
aggregated predicted pathway annotations to the same pathway identifier
across species. Second, for each pathway annotation with \(\ge\) 20 and \(\le\) 500 features, we calculated the
combined overlap metric between all pathways
(categoryCompare2 v 0.200.4 [35,36], values range between 0 and 1). Treating
the overlap metric between pathways as weighted edges of a graph, we
removed the edges with a value \(\le
0.6\), and then did community detection using the
walktrap clustering method in the igraph
package (v 2.1.4)[37–40]. Each community
of Reactome pathways identified by the walktrap clustering
were then aggregated to a new grouped pathway annotation. These grouped
pathway annotations were then used for the network partitioning
calculations.
Each MW dataset that was used for outlier detection (711) was also checked against the list of datasets used in the prediction of pathways and enrichment by Huckvale et al. (resulting in 137 datasets) [34]. The various correlation measures were calculated between all features (see Correlation Methods). For any given correlation method, we generate the feature-feature network for that dataset-correlation method combination. The dataset correlations were transformed to partial correlations. From the distribution of partial correlation values, we consider the fraction of values that make up the 2.5 % of the tail values (for a total of 5%) as the significant partial correlations that can be used as actual edges in the network. The network is then trimmed to only the edges that have a positive weight.
For each feature annotation (see Feature Annotations), we calculate three sums of the edge weights.
The total sum of edge weights for all edges with features that are annotated to one or more of the annotations (annotated).
The within annotation edge weight sum, where both start and end nodes are annotated to the same annotation.
The outer annotation edge weight sum, where the start node is part of the annotated set, and the end node is annotated to one of the other annotations.
The partitioning ratio (or q-ratio) is calculated as follows:
\[Q = \sum_{i=1}^{annot}\frac{within_i}{annotated} - \left(\frac{outer_i}{annotated}\right)^2\]
The partitioning ratio was originally designed as a method to determine the optimal clustering of networks, where each member of the network has only a single label [41,42]. In those cases, the partitioning ratio should range between 0 and 1 for non-partitioned and fully partitioned networks, respectively. The grouped Reactome pathways still have shared metabolite features, and therefore the partitioning ratios have a much wider range of values. However, we expect that better partitioning of the network will be reflected by a more positive partitioning ratios.
Partitioning ratios were compared across correlation methods using a paired t-test, where methods were paired by the dataset. P-values were adjusted using the Bonferroni correction.
We created a simulated dataset as noted in Table 1 (lod and noise-3),
starting with a sample from a random log-normal distribution (lod).
Uniform noise was added via random normal distribution (noise-3), to
create 100 samples from the base distribution, values were transformed
to normal space and then \(log_{10}\)
applied to have a representation of orders of magnitude and dynamic
range. For any maximum level of censoring to be applied, a uniform
distribution sample is generated on the range of \(0 - max\) with 100 values. Data censoring
was applied by taking the minimum observed value for a sample, and
adding the censoring value from the uniform distribution. Any values in
the sample below the censoring value are set to missing
(NA).
Correlations were calculated between samples when no missing values are present (reference), and then again after censoring (trimmed). Two different correlation methods were used: ICI-Kt; and Pearson correlation. Imputation for Pearson correlation involved replacing all missing values with ½ the lowest observed value in the dataset after censoring. Differences between the reference and trimmed correlations were calculated, as well as the difference in the absolute value of differences of ICI-Kt and Pearson imputed.
We compared the performance of our Rcpp mergesort implementation to the base R cor function using both “pearson” and “kendall” methods on a single core with increasing numbers of features, ranging from 100 to 5000, in increments of 100.
To verify the parallel implementation in R, we created a
fake dataset with 10,000 features and 400 samples via
rnorm. Using Rscript to run each iteration in
a clean R session, we ran calculated ICI-Kt correlations
between the 400 samples for this dataset with an increasing number of
cores, ranging from 1 to 12, timing the calculation for each one, and
logging the memory used from the ICIKendallTau::log_memory.
The results are returned as a data.frame to save memory and time
generating the matrix. For Python, we created a similar
dataset using numpy.random.randn, and then ran it on all
cores, as the current package does not allow setting the number of cores
in the multiprocessing.Pool creation. We logged Python
memory usage using a bash script writing memory use to a file every 5
seconds. In both cases, the memory usage prior to starting calculations
was recorded so that the increased memory usage specific to calculating
the ICI-Kt results and saving them could be noted.
All data processing and statistical analysis used R v
4.4.1 [25] and Bioconductor v 3.20 [43]. JSONized MW files were read in using
jsonlite v 2.0.0 [44].
Metabolite feature id and sample id cleaning used janitor v
2.2.1 [45] and dplyr v 1.1.4
[46]. Plots were generated using either
singly or in combination the packages: ggplot2 v 4.0.0
[47]; ggforce v 0.5.0 [48]; patchwork v 1.3.2 [49]. Analysis workflows were coordinated using
targets v 1.11.4 [50].
Reactome pathway manipulation and aggregation used
categoryCompare2 v 0.200.4 [35]. Outlier detection used
visualizationQualityControl v 0.5.6 [31].
We are aware of only one previous investigation of the causes of missingness in metabolomics datasets [41]. In Do et al. [41], the authors showed that there was a limit of detection (LOD) effect, with a dependence on the day the samples were run. Unfortunately, the KORA4 metabolomics dataset from Do et al. is not publicly available, so we could not attempt to redo their analysis of missing values with the same dataset.
Given the number of projects and analyses available in Metabolomics Workbench (MW), we sought to obtain a large number of individual datasets (MW analyses) from MW to evaluate both the phenomenon of left censorship and the information-content-informed Kendall-tau methodology.
In Figure 2, we provide a summary of the number of datasets remaining
after filtering for various attributes (see Methods). From the starting
6501, we were able to retain 711 for this study. There were 1636
datasets that had no metabolite abundance data, or the metabolite data
was not easily parseable or downloadable from external text files. A
further 2162 datasets were excluded due to having \(\lt\) 100 metabolites. 755 were removed
because they had \(\lt\) 5 samples in
at least one group of subject sample factors (SSF) or \(\lt\) 2 SSF. 72 remaining datasets had a
maximum intensity \(\lt\) 20,
indicating either being log-transformed or a very small dynamic range.
Finally, 769 datasets were removed due to either having no missing
values, or the calculation of a correlation of rank with number of
missing values across samples returned an NA value.
Figure 2. Number of datasets remaining after each level of filtering and / or checking.
One would wonder just how many missing values are present in metabolomics datasets, and if their missingness is primarily due to left censorship or some other phenomenon. For the 711 datasets examined in this work, the percentage of missingness ranged from near zero (we required at least one missing value to keep a dataset for further analysis) to 25% for nuclear magnetic resonance (NMR), and the majority of mass-spectrometry (MS) datasets had missingness in the 0 - 25% range, with some datasets having missingness \(\gt\) 80% (Figure 3A). Using a binomial test to check if missing values are more likely to have non-missing values ranked at \(\le\) 0.5 (i.e., below the median) of the non-missing values in that sample, we see that the vast majority (681 of 711) have an adjusted-p-value \(\le\) 0.05, with over 160 having an extreme adjusted p-value (Figure 3B). We also checked if there is any relationship between the percentage of missing values and the binomial test p-values, but found none (Figure S1). We have included the analysis dataset measurement and chromatography information extracted from the MW file, the number of missing values, and the results of the binomial test for left-censorship for all MW datasets investigated in a supplemental excel file for those who may want to investigate the phenomenon for subsets of the datasets.
For each set of subject sample factors (SSF) of a dataset, we calculated the median rank and number of measurement values missing across samples (i.e. N-Missing) for each metabolite. As shown in Figure 3C, there is a monotonically decreasing relationship between the median rank and the number of missing values for that metabolite. Moreover, as N-Missing decreases, there is clearly a minimum median rank below which the values do not cross as illustrated by the red points (Figure 3C). As shown in Figure 3D, median rank and N-missing are negatively correlated across the majority of experiments, although there are more positive correlations when using the minimum median rank. Given the results of the binomial test of missing ranks and this relationship of the minimum median value observed and N-Present, we believe that the majority of missing values in many metabolomics datasets are due to left censorship.
This makes the ICI-Kt methodology appropriate for use in many metabolomics datasets containing missing values through incorporation of the missing values below the LOD as useful information in the correlation calculation.
Figure 3. Missingness in datasets. A: Sina plot of the percent missing values in NMR and MS datasets. B: Histogram of log10 adjusted p-values from the binomial test testing if non-missing values are more likely to be below the median rank when the metabolite has missingness in the dataset. Red line indicates an adjusted p-value of 0.05. Adjusted p-values of 0 were replaced with the lowest observed non-zero adjusted p-value. C: Median (black) and minimum median (red) rank of metabolite abundances across factor groups of samples vs the number of samples the metabolite had a missing value for dataset AN001074. D: Sina plots of the Kendall-tau correlation of the median rank and minimum median rank with number of missing samples across all datasets.
We compared the ICI-Kt correlation to both Pearson and regular
Kendall-tau-b correlations as calculated by the built-in R
functions using simulated data sets with missing values (Figures
S2-S4).
We created two samples with 1000 observations each drawn from a log-normal distribution, added further variance using a uniform distribution, and sorted in each case to create a pair of X and Y samples with a correlation of 1 and -1 for both Pearson and Kendall-tau correlation measures. The true correlation for each of the Kendall and Pearson correlations were calculated, and then missingness was introduced in the lower range of values, up to half of the values (see Simulated Data Sets).
In Figure 4 (and Figure S5), we can see that as missing values are added, only the ICI-Kt correlation values change in any significant way as illustrated by the wider range of ICI-Kt values on the y-axes versus the much narrower range of Pearson and Kendall tau correlation values on the x-axes. As the number of missing values increase, the ICI-Kt values drop or increase by up to 0.2. Similarly, Pearson correlation is also affected, but the degree of change in the correlation values are much less (notice the orders of magnitude differences in the x-axis scales compared to the y-axis), on the order of only 0.005 for both cases.
Adding outlier points at the high end of the distribution to one of the samples causes very odd discrete patterns to appear in the negative Pearson correlations (see Figures S6, S7). Again, the scale of the differences is much smaller in the Pearson correlations versus ICI-Kt. The negative Kendall correlations are unaffected by the outliers, in large part due to being a rank-based correlation. Likewise, the negative ICI-Kt correlations appear unaffected; however, the scale of changes seen in the negative Pearson correlations, if present in the negative ICI-Kt correlations, might simply be obscured by the changes due to missingness that are orders of magnitude larger.
These results demonstrate that the ICI-Kt correlation has quantitative sensitivity to missing values over the normal Kendall-tau correlation and linear Pearson correlation where points with missing values are ignored (pairwise complete).
Figure 4. Comparing the correlation values obtained by Pearson, Kendall, and ICI-Kt correlation as an increasing number of missing values (0 - 500) in the bottom half of either sample for both positively (correlation = 1) and negatively (correlation = -1) correlated samples. Points are colored by how many points were set to missing on average between the two samples. A subset of 10,000 points was used for visualization.
Figure 5 demonstrates the effect of introducing left-censored versus random missingness in five different measures of correlation, including the ICI-Kt, the normal Kendall-tau with pairwise-complete-observations, the normal Kendall-tau replacing missing with 0, Pearson with pairwise-complete-observations, and Pearson replacing missing with 0. The ICI-Kt correlation demonstrates a slight increase from the starting 0.90 correlation value with growing left-centered missingness caused by a slight reinforcement of the correlation, while with growing random missingness, the ICI-Kt correlation drops precipitously due to the large increase in discordant pairs caused by the random missing values. The normal Kendall tau correlation with pairwise complete has a small decrease in the correlation value with growing left-centered missingness caused by a loss of supporting pairs, while this correlation has a near constant average value with growing random missingness. The normal Kendall tau correlation replacing missing with 0 has identical behavior to the ICI-Kt correlation. In contrast to ICI-Kt, the Pearson correlation calculated using only pairwise complete entries is practically constant (i.e., range of 0.004 or less) over growing left-centered and random missingness. When replacing missing values with zero, Pearson correlation demonstrates a small decrease in the correlation value with growing left-centered missingness due to the zero values causing some deviation from linearity. Pearson correlation drops precipitously with growing random missingness with a magnitude similar to the ICI-Kt and normal Kendall tau replacing missing with 0. Overall, the ICI-Kt and the normal Kendall-tau replacing missing with zero have the desirable characteristics of maintaining the correlation with growing left-centered missing while sharply dropping the correlation with growing random missingness. In this special case where zero is lower than all of the values in the dataset, ICI-Kt and Kendall-tau replaced with zero result in identical correlation values, as shown in the bottom panels of Figure S8A and Figure S8C. In a naive treatment of the left-centered missing data, if the values below the cutoff are set to missing followed by log-transforming the values and subsequently setting missing values to 0, then the Kendall tau correlation replacing missing with 0 will show some very odd patterns at low intensity cutoffs due to the introduction of discordant pairs. Likewise, Pearson correlation replacing missing with 0 shows a parabolic effect with increasing missing values.
Figure 5. Effect of introducing missing values from a cutoff (A & B) or randomly (C) on different measures of correlation, including ICI-Kt, Kendall with pairwise complete, Kendall replacing missing with 0, Pearson with pairwise complete, and Pearson replacing missing with 0. A: Missing values introduced by setting an increasing cutoff. B: Missing values introduced by setting an increasing cutoff, and then log-transforming the values before calculating correlation. C: Missing values introduced at random. For the random case, each sample of random positions was repeated 100 times. Pay attention to the different y-axis ranges across graphs, with A and B graphs having much smaller y-axis ranges compared to C.
A common way missing data is handled in correlation calculations is to ignore them completely and use the pairwise complete cases to calculate the Pearson correlation coefficient. As shown in Figure 5C, this results in a complete misestimation of the changed correlative structure introduced by random entries. ICI-Kt, in contrast, incorporates the missingness in a sensical way, and the resulting correlation values fall as random entries are introduced.
Another way that missing values appear is due to changes in dynamic range between samples, as some samples have features with higher values, and the fixed dynamic range of the instrumentation results in features with lower values to be “missing” in those samples. We created a set of 100 simulated samples with uniform noise on the log-scale, with relatively constant dynamic ranges, and introduced changes to the overall dynamic range using a random censor at varying levels (see Implementation). Possible different levels of censoring based on dynamic range were checked by first determining how many missing values would be introduced in each sample as the dynamic range was increased in increments of 0.1 (see Figure S9). Based on the number of values being censored, limits of 0.5, 1, and 1.5 were selected, representing low, medium and high variability of the dynamic range.
For each level of possible missingness introduced by changes to the dynamic range, correlation across all samples were calculated using all values (reference), as well as after missingness was added (trimmed), and using Pearson correlation with global imputation (Pearson Imputed), or ICI-Kt. Figure 6 demonstrates that it is only as the number of missing values in one of the samples approaches 50% or more (500 of 1000 features) does the Pearson correlation with global imputation give correlation values closer to the known correlation with no missing values in any appreciable amount (points below the red lines in the top panels, and to the right of the red line in the histograms in the bottom panels). Points above the lines with slopes of -1 and 1 indicate that the difference of reference - trimmed is smaller in the ICI-Kt correlations, and below the lines indicate the difference is larger in the ICI-Kt correlations. This is further emphasized by the majority of the values are to the left of the line at 0 in the difference histograms.