Skip to contents

Given a data-matrix, computes the information-theoretic Kendall-tau-b between all samples.

Usage

ici_kendalltau(
  data_matrix,
  global_na = c(NA, Inf, 0),
  perspective = "global",
  scale_max = TRUE,
  diag_good = TRUE,
  include_only = NULL,
  check_timing = FALSE,
  return_matrix = TRUE
)

Arguments

data_matrix

matrix or data.frame of values, samples are columns, features are rows

global_na

numeric vector that defines globally, what should be treated as NA?

perspective

how to treat missing data in denominator and ties, character

scale_max

logical, should everything be scaled compared to the maximum correlation?

diag_good

logical, should the diagonal entries reflect how many entries in the sample were "good"?

include_only

only run the correlations that include the members (as a vector) or combinations (as a list or data.frame)

check_timing

logical to determine should we try to estimate run time for full dataset? (default is FALSE)

return_matrix

logical, should the data.frame or matrix result be returned?

Value

list with cor, raw, pval, taumax

Details

For more details, see the vignette vignette("ici-kendalltau", package = "ICIKendallTau"))

The default for global_na includes what values in the data to replace with NA for the Kendall-tau calculation. By default these are global_na = c(NA, Inf, 0). If you want to replace something other than 0, for example, you might use global_na = c(NA, Inf, -2), and all values of -2 will be replaced instead of 0.

When check_timing = TRUE, 5 random pairwise comparisons will be run to generate timings on a single core, and then estimates of how long the full set will take are calculated. The data is returned as a data.frame, and will be on the low side, but it should provide you with a good idea of how long your data will take.

Returned is a list containing matrices with:

  • cor: scaled correlations

  • raw: raw kendall-tau correlations

  • pval: p-values

  • taumax: the theoretical maximum kendall-tau value possible

Eventually, we plan to provide two more parameters for replacing values, feature_na for feature specific NA values and sample_na for sample specific NA values.

If you want to know if the missing values in your data are possibly due to left-censorship, we recommend testing that hypothesis with test_left_censorship() first.

Examples

if (FALSE) {
# not run
set.seed(1234)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10 

matrix_1 = cbind(s1, s2)

r_1 = ici_kendalltau(matrix_1)
r_1$cor

#    s1 s2
# s1  1  1
# s2  1  1
names(r_1)
# "cor", "raw", "pval", "taumax", "keep", "run_time"

s3 = s1
s3[sample(100, 50)] = NA

s4 = s2
s4[sample(100, 50)] = NA

matrix_2 = cbind(s3, s4)
r_2 = ici_kendalltau(matrix_2)
r_2$cor
#           s3        s4
# s3 1.0000000 0.9944616
# s4 0.9944616 1.0000000

# using include_only
set.seed(1234)
x = t(matrix(rnorm(5000), nrow = 100, ncol = 50))
colnames(x) = paste0("s", seq(1, nrow(x)))

# only calculate correlations of other columns with "s1"
include_s1 = "s1"
s1_only = ici_kendalltau(x, include_only = include_s1)

# include s1 and s3 things both
include_s1s3 = c("s1", "s3")
s1s3_only = ici_kendalltau(x, include_only = include_s1s3)

# only specify certain pairs either as a list
include_pairs = list(g1 = "s1", g2 = c("s2", "s3"))
s1_other = ici_kendalltau(x, include_only = include_pairs)

# or a data.frame
include_df = as.data.frame(list(g1 = "s1", g2 = c("s2", "s3")))
s1_df = ici_kendalltau(x, include_only = include_df)

}