Given a data-matrix, computes the information-theoretic Kendall-tau-b between all samples.
Usage
ici_kendalltau(
data_matrix,
global_na = c(NA, Inf, 0),
perspective = "global",
scale_max = TRUE,
diag_good = TRUE,
include_only = NULL,
alternative = "two.sided",
continuity = FALSE,
check_timing = FALSE,
return_matrix = TRUE
)
Arguments
- data_matrix
matrix or data.frame of values, samples are columns, features are rows
- global_na
numeric vector that defines globally, what should be treated as NA?
- perspective
how to treat missing data in denominator and ties, character
- scale_max
logical, should everything be scaled compared to the maximum correlation?
- diag_good
logical, should the diagonal entries reflect how many entries in the sample were "good"?
- include_only
only run the correlations that include the members (as a vector) or combinations (as a list or data.frame)
- alternative
what is the alternative for the p-value test?
- continuity
should a continuity correction be applied?
- check_timing
logical to determine should we try to estimate run time for full dataset? (default is FALSE)
- return_matrix
logical, should the data.frame or matrix result be returned?
Details
For more details, see the vignette vignette("ici-kendalltau", package = "ICIKendallTau"))
The default for global_na
includes what values in the data to replace with NA for the Kendall-tau calculation. By default these are global_na = c(NA, Inf, 0)
. If you want to replace something other than 0, for example, you might use global_na = c(NA, Inf, -2)
, and all values of -2 will be replaced instead of 0.
When check_timing = TRUE
, 5 random pairwise comparisons will be run to generate timings on a single core, and then estimates of how long the full set will take are calculated. The data is returned as a data.frame, and will be on the low side, but it should provide you with a good idea of how long your data will take.
Returned is a list containing matrices with:
cor: scaled correlations
raw: raw kendall-tau correlations
pvalue: p-values
taumax: the theoretical maximum kendall-tau value possible
completeness: how complete the two samples are (i.e. how many entries are not missing in either sample)
Eventually, we plan to provide two more parameters for replacing values, feature_na
for feature specific NA values and sample_na
for sample specific NA values.
If you want to know if the missing values in your data are possibly due to
left-censorship, we recommend testing that hypothesis with test_left_censorship()
first.
Examples
if (FALSE) { # \dontrun{
# not run
set.seed(1234)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
matrix_1 = cbind(s1, s2)
r_1 = ici_kendalltau(matrix_1)
r_1$cor
# s1 s2
# s1 1 1
# s2 1 1
names(r_1)
# "cor", "raw", "pvalue", "taumax", "completeness", "keep", "run_time"
s3 = s1
s3[sample(100, 50)] = NA
s4 = s2
s4[sample(100, 50)] = NA
matrix_2 = cbind(s3, s4)
r_2 = ici_kendalltau(matrix_2)
r_2$cor
# s3 s4
# s3 1.0000000 0.9944616
# s4 0.9944616 1.0000000
# using include_only
set.seed(1234)
x = t(matrix(rnorm(5000), nrow = 100, ncol = 50))
colnames(x) = paste0("s", seq(1, nrow(x)))
# only calculate correlations of other columns with "s1"
include_s1 = "s1"
s1_only = ici_kendalltau(x, include_only = include_s1)
# include s1 and s3 things both
include_s1s3 = c("s1", "s3")
s1s3_only = ici_kendalltau(x, include_only = include_s1s3)
# only specify certain pairs either as a list
include_pairs = list(g1 = "s1", g2 = c("s2", "s3"))
s1_other = ici_kendalltau(x, include_only = include_pairs)
# or a data.frame
include_df = as.data.frame(list(g1 = "s1", g2 = c("s2", "s3")))
s1_df = ici_kendalltau(x, include_only = include_df)
} # }