Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information
Abstract
1. Introduction
2. Materials and Methods
2.1. Additional Definitions of Concordant and Discordant Pairs to Include Missingness
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
- and
2.2. Considering Ties
2.3. p-Value
2.4. Theoretical Maxima
2.5. Completeness
2.6. Implementation Details
2.7. Simulated Datasets
2.8. Metabolomics Datasets from Metabolomics Workbench
- 100 metabolites, so that any degree of missingness would still allow for robust estimation of correlations between samples.
- One SSF grouping with 5 samples, and 2 SSF groupings after removing samples that may be pooled, quality control or blanks; this provides a greater likelihood of decent variance estimates when calculating the F-statistics across SSFs after removing potential outlier samples.
- A maximum metabolite feature abundance 20 to exclude log-transformed values and low-dynamic-range datasets.
- The ability to calculate a correlation between the median rank of a metabolite feature and the number of samples the metabolite was missing within a factor, as this indicated a minimum number of missing values in each SSF.
2.9. Number of Missing Values and Median Rank
2.10. Binomial Test for Left-Censorship
2.11. Correlation Methods
2.12. Outlier Detection
2.13. Feature Annotations
2.14. Feature–Feature Networks and Partitioning
- The total sum of edge weights for all edges with features that are annotated to one or more of the annotations (annotated).
- The within annotation edge weight sum, where both start and end nodes are annotated to the same annotation.
- The outer annotation edge weight sum, where the start node is part of the annotated set, and the end node is annotated to one of the other annotations.
2.15. Changes in Correlation Due to Changes in Dynamic Range and Imputation
2.16. Performance and Efficiency Evaluations
2.17. Data Processing
3. Results
3.1. Datasets
3.2. Left-Censoring as a Cause for Missingness
3.3. Comparison to Other Correlation Measures
3.4. Effect of Left-Censoring vs. Random Missing Data
3.5. Differences in Dynamic Range and Correlation
3.6. Utility for Metabolomics Datasets
3.7. Computational Performance and Efficiency
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ICI-Kt | Information-content-informed Kendall-tau |
| EDA | Exploratory data analysis |
| PCA | Principal component analysis |
| MW | The Metabolomics Workbench |
| SSF | Subject sample factor |
| LOD | Limit of detection |
| NMR | Nuclear magnetic resonance |
| MS | Mass spectrometry |
References
- Pearson, K. Notes on the History of Correlation. Biometrika 1920, 13, 25–45. [Google Scholar] [CrossRef]
- Rodgers, J.L.; Nicewander, W.A. Thirteen Ways to Look at the Correlation Coefficient. Am. Stat. 1988, 42, 59–66. [Google Scholar] [CrossRef]
- Gu, Z.; Eils, R.; Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016, 32, 2847–2849. [Google Scholar] [CrossRef]
- Fukushima, A.; Kusano, M.; Redestig, H.; Arita, M.; Saito, K. Integrated omics approaches in plant systems biology. Curr. Opin. Chem. Biol. 2009, 13, 532–538. [Google Scholar] [CrossRef]
- Mitchell, J.M.; Flight, R.M.; Moseley, H.N.B. Untargeted Lipidomics of Non-Small Cell Lung Carcinoma Demonstrates Differentially Abundant Lipid Classes in Cancer vs. Non-Cancer Tissue. Metabolites 2021, 11, 740. [Google Scholar] [CrossRef]
- Szklarczyk, D.; Morris, J.H.; Cook, H.; Kuhn, M.; Wyder, S.; Simonovic, M.; Santos, A.; Doncheva, N.T.; Roth, A.; Bork, P.; et al. The STRING database in 2017: Quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2017, 45, D362–D368. [Google Scholar] [CrossRef]
- Franz, M.; Rodriguez, H.; Lopes, C.; Zuberi, K.; Montojo, J.; Bader, G.D.; Morris, Q. GeneMANIA update 2018. Nucleic Acids Res. 2018, 46, W60–W64. [Google Scholar] [CrossRef] [PubMed]
- Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559. [Google Scholar] [CrossRef] [PubMed]
- Faquih, T.; van Smeden, M.; Luo, J.; le Cessie, S.; Kastenmüller, G.; Krumsiek, J.; Noordam, R.; van Heemst, D.; Rosendaal, F.R.; Vlieg, A.v.H.; et al. A Workflow for Missing Values Imputation of Untargeted Metabolomics Data. Metabolites 2020, 10, 486. [Google Scholar] [CrossRef]
- Love, M.I.; Anders, S.; Kim, V.; Huber, W. RNA-Seq workflow: Gene-level exploratory analysis and differential expression. F1000Research 2016, 4, 1070. [Google Scholar] [CrossRef]
- Law, C.W.; Alhamdoosh, M.; Su, S.; Dong, X.; Tian, L.; Smyth, G.K.; Ritchie, M.E. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. F1000Research 2018, 5, 1408. [Google Scholar] [CrossRef]
- Chen, Y.; Lun, A.T.L.; Smyth, G.K. From reads to genes to pathways: Differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 2016, 5, 1438. [Google Scholar] [CrossRef]
- Flight, R.M.; Wentzell, P.D. Preliminary exploration of time course DNA microarray data with correlation maps. OMICS 2010, 14, 99–107. [Google Scholar] [CrossRef] [PubMed]
- Gierliński, M.; Cole, C.; Schofield, P.; Schurch, N.J.; Sherstnev, A.; Singh, V.; Wrobel, N.; Gharbi, K.; Simpson, G.; Owen-Hughes, T.; et al. Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics 2015, 31, 3625–3630. [Google Scholar] [CrossRef] [PubMed]
- Moseley, H.N.B. Error Analysis and Propagation in Metabolomics Data Analysis. Comput. Struct. Biotechnol. J. 2013, 4, e201301006. [Google Scholar] [CrossRef]
- Vitkin, E. Differential expression analysis of binary appearance patterns [version 1; peer review: Awaiting peer review]. Open Res. Eur. 2024, 4, 52. [Google Scholar] [CrossRef]
- Li, Y.; Fan, T.W.M.; Lane, A.N.; Kang, W.-Y.; Arnold, S.M.; Stromberg, A.J.; Wang, C.; Chen, L. SDA: A semi-parametric differential abundance analysis method for metabolomics and proteomics data. BMC Bioinform. 2019, 20, 501. [Google Scholar] [CrossRef]
- The Metabolomics Workbench. Available online: https://metabolomicsworkbench.org/ (accessed on 12 February 2026).
- Joshi-Tope, G.; Gillespie, M.; Vastrik, I.; D’Eustachio, P.; Schmidt, E.; de Bono, B.; Jassal, B.; Gopinath, G.R.; Wu, G.R.; Matthews, L.; et al. Reactome: A knowledgebase of biological pathways. Nucleic Acids Res. 2005, 33, D428–D432. [Google Scholar] [CrossRef] [PubMed]
- Flight, R.M.; Bhatt, P.S.; Moseley, H.N. Code and Data for Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information; Zenodo: Geneva, Switzerland, 2026. [Google Scholar] [CrossRef]
- Kendall, M.G. A New Measure of Rank Correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
- Kendall, M.G. The treatment of ties in ranking problems. Biometrika 1945, 33, 239–251. [Google Scholar] [CrossRef]
- Kendall, M.G. Rank correlation methods. In Public Program Analysis, 1st ed.; Griffin, C., Ed.; Springer: Berlin/Heidelberg, Germany, 1948. [Google Scholar]
- Valz, P.D.; McLeod, A.I.; Thompson, M.E. Cumulant Generating Function and Tail Probability Approximations for Kendall’s Score with Tied Rankings. Ann. Stat. 1995, 23, 144–160. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
- Knight, W.R. A Computer Method for Calculating Kendall’s Tau with Ungrouped Data. J. Am. Stat. Assoc. 1966, 61, 436–439. [Google Scholar] [CrossRef]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
- Flight, R.M. MoseleyBioinformaticsLab/ICIKendallTau; Zenodo: Geneva, Switzerland, 2026. [Google Scholar] [CrossRef]
- Abram, K.J.; McCloskey, D. A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning. Metabolites 2022, 12, 202. [Google Scholar] [CrossRef]
- Thompson, P.T.; Moseley, H.N.B. A Major Update and Improved Validation Functionality in the mwtab Python Library and the Metabolomics Workbench File Status Website. Metabolites 2026, 16, 76. [Google Scholar] [CrossRef] [PubMed]
- Flight, R.M.; Moseley, H.N.B. MoseleyBioinformaticsLab/VisualizationQualityControl. Moseley Bioinformatics and Systems Biology Lab. 2025. Available online: https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/commit/413ea66 (accessed on 17 February 2026).
- Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef] [PubMed]
- Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
- Huckvale, E.D.; Thompson, P.T.; Flight, R.M.; Moseley, H.N.B. High-Quality Predicted Pathway Annotations Greatly Improve Pathway Enrichment Analysis of Metabolomics Datasets. bioRxiv 2025, 2025.11.18.689105. [Google Scholar] [CrossRef]
- Flight, R.M.; Moseley, H.N.B.; Huckvale, E.D.; Hinderer, E.W., III. CategoryCompare2. 2021. Available online: https://github.com/moseleybioinformaticslab/categoryCompare2/commit/91ab3a1482 (accessed on 17 February 2026).
- Flight, R.M.; Harrison, B.J.; Mohammad, F.; Bunge, M.B.; Moon, L.D.F.; Petruska, J.C.; Rouchka, E.C. categoryCompare, an analytical tool based on feature annotations. Front. Genet. 2014, 5, 98. [Google Scholar] [CrossRef]
- Pons, P.; Latapy, M. Computing communities in large networks using random walks. J. Graph Algorithms Appl. 2006, 10, 191–218. [Google Scholar] [CrossRef]
- Csardi, G.; Nepusz, T. The igraph software package for complex network research. InterJ. Complex Syst. 2006, 1695, 1–5. [Google Scholar]
- Antonov, M.; Csárdi, G.; Horvát, S.; Müller, K.; Nepusz, T.; Noom, D.; Salmon, M.; Traag, V.; Welles, B.F.; Zanini, F. Igraph enables fast and robust network analysis across programming languages. arXiv 2023, arXiv:2311.10260. [Google Scholar] [CrossRef]
- Csárdi, G.; Nepusz, T.; Müller, K.; Horvát, S.; Traag, V.; Zanini, F.; Noom, D. Igraph for R: R Interface of the Igraph Library for Graph Theory and Network Analysis; Zenodo: Geneva, Switzerland, 2025. [Google Scholar] [CrossRef]
- Do, K.T.; Wahl, S.; Raffler, J.; Molnos, S.; Laimighofer, M.; Adamski, J.; Suhre, K.; Strauch, K.; Peters, A.; Gieger, C.; et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 2018, 14, 128. [Google Scholar] [CrossRef]
- Krumsiek, J.; Suhre, K.; Illig, T.; Adamski, J.; Theis, F.J. Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Syst. Biol. 2011, 5, 21. [Google Scholar] [CrossRef]
- Huber, W.; Carey, V.J.; Gentleman, R.; Anders, S.; Carlson, M.; Carvalho, B.S.; Bravo, H.C.; Davis, S.; Gatto, L.; Girke, T.; et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 2015, 12, 115–121. [Google Scholar] [CrossRef] [PubMed]
- Ooms, J. The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects. arXiv 2014. [Google Scholar] [CrossRef]
- Firke, S.; Denney, B.; Haid, C.; Knight, R.; Grosser, M.; Zadra, J. Janitor: Simple Tools for Examining and Cleaning Dirty Data. 2024. Available online: https://cran.r-project.org/web/packages/janitor/index.html (accessed on 25 October 2025).
- Wickham, H.; François, R.; Henry, L.; Müller, K.; Vaughan, D.; Software, P. Dplyr: A Grammar of Data Manipulation. 2026. Available online: https://cran.r-project.org/web/packages/dplyr/index.html (accessed on 25 October 2025).
- Wickham, H. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016. Available online: https://ggplot2.tidyverse.org (accessed on 25 October 2025).
- Pedersen, T.L.; RStudio. Ggforce: Accelerating “Ggplot2”. 2025. Available online: https://cran.r-project.org/web/packages/ggforce/index.html (accessed on 25 October 2025).
- Pedersen, T.L. Patchwork: The Composer of Plots. 2020. Available online: https://CRAN.R-project.org/package=patchwork (accessed on 25 October 2025).
- Landau, W.M. The targets R package: A dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing. J. Open Source Softw. 2021, 6, 2959. [Google Scholar] [CrossRef]
- Alvo, M.; Cabilio, P. Rank correlation methods for missing data. Can. J. Stat. 1995, 23, 345–358. [Google Scholar] [CrossRef]
- Braun, D.J.; Hatton, K.W.; Fraser, J.F.; Flight, R.M.; Moseley, H.N.B.; Bailey, C.S.; Morganti, J.M.; Zhang, B.; Ariyapala, I.S.; Kim, T.K.; et al. Early changes in inflammation-related proteins in the cerebrospinal fluid and plasma of patients with aneurysmal subarachnoid hemorrhage. J. Stroke Cerebrovasc. Dis. 2025, 34, 108304. [Google Scholar] [CrossRef]
- Anspach, G.B.; Flight, R.M.; Park, S.; Moseley, H.N.B.; Helsley, R.N. An Integrated Multi-omic Analysis Reveals Novel Gene-Metabo lite Relationships in Human Steatohepatitic Hepatocellular Carcinoma. medRxiv 2026, 26344977. [Google Scholar] [CrossRef]






| Dataset | Distribution | N | Mean | SD | Range |
|---|---|---|---|---|---|
| perfect | log-normal | 1000 | 1.0 | 0.5 | |
| noise-1 | uniform | 1000 | −0.5–0.5 | ||
| outlier | log-normal | 5 | 1.2 | 0.1 | |
| realistic | log-normal | 1000 | 1.0 | 0.5 | |
| noise-2 | normal | 1000 | 0.0 | 0.2 | |
| lod | log-normal | 1000 | 1.0 | 0.5 | |
| noise-3 | normal | 1000 | 0.0 | 0.2 |
| Method | Mean | SD | Median | MAD |
|---|---|---|---|---|
| icikt | 0.457 | 0.338 | 0.454 | 0.482 |
| icikt_complete | 0.457 | 0.336 | 0.450 | 0.480 |
| pearson_log1p | 0.455 | 0.337 | 0.450 | 0.487 |
| kt_base | 0.455 | 0.337 | 0.460 | 0.478 |
| pearson_log | 0.453 | 0.337 | 0.450 | 0.475 |
| pearson_base | 0.450 | 0.338 | 0.442 | 0.483 |
| pearson_base_nozero | 0.448 | 0.338 | 0.441 | 0.480 |
| original | 0.443 | 0.338 | 0.436 | 0.484 |
| Comparison | Difference | p-Value | p-Adjusted |
|---|---|---|---|
| icikt v original | 0.0137 | 7.9 × 10−13 | 3.5 × 10−11 |
| icikt_complete v original | 0.013 | 1.5 × 10−11 | 6.9 × 10−10 |
| kt_base v original | 0.0104 | 1.1 × 10−9 | 4.9 × 10−8 |
| pearson_log1p v original | 0.011 | 2.7 × 10−9 | 1.2 × 10−7 |
| pearson_log v original | 0.00964 | 4.1 × 10−9 | 1.8 × 10−7 |
| icikt v pearson_base | 0.00819 | 3.8 × 10−8 | 1.7 × 10−6 |
| icikt v pearson_base_nozero | 0.00949 | 5.5 × 10−8 | 2.5 × 10−6 |
| icikt_complete v pearson_base_nozero | 0.00888 | 1.5 × 10−6 | 6.8 × 10−5 |
| icikt_complete v pearson_base | 0.00758 | 3.6 × 10−6 | 1.6 × 10−4 |
| pearson_base_nozero v kt_base | −0.00625 | 2.7 × 10−5 | 1.2 × 10−3 |
| pearson_base_nozero v pearson_log1p | −0.00687 | 1.0 × 10−4 | 4.7 × 10−3 |
| pearson_base_nozero v pearson_log | −0.00547 | 2.0 × 10−4 | 9.0 × 10−3 |
| pearson_base v kt_base | −0.00495 | 2.3 × 10−4 | 1.0 × 10−2 |
| pearson_base v original | 0.00547 | 2.6 × 10−4 | 1.2 × 10−2 |
| pearson_base v pearson_log1p | −0.00558 | 2.8 × 10−4 | 1.3 × 10−2 |
| Comparison | Difference | p-Value | p-Adjusted |
| icikt_complete v pearson_log | 2.3 | 2.7 × 10−12 | 5.7 × 10−11 |
| icikt v pearson_log | 2.35 | 3.9 × 10−12 | 8.2 × 10−11 |
| icikt v pearson_base_nozero | 2.51 | 4.1 × 10−12 | 8.6 × 10−11 |
| icikt_complete v pearson_base_nozero | 2.45 | 7.5 × 10−12 | 1.6 × 10−10 |
| icikt v kt_base | 1.7 | 1.2 × 10−10 | 2.6 × 10−9 |
| icikt_complete v kt_base | 1.63 | 3.7 × 10−10 | 7.7 × 10−9 |
| icikt v pearson_log1p | 1.15 | 8.6 × 10−9 | 1.8 × 10−7 |
| icikt_complete v pearson_log1p | 1.08 | 9.2 × 10−9 | 1.9 × 10−7 |
| icikt v pearson_base | 1.05 | 6.6 × 10−8 | 1.4 × 10−6 |
| icikt_complete v pearson_base | 0.976 | 6.8 × 10−8 | 1.4 × 10−6 |
| pearson_base v pearson_base_nozero | 1.41 | 7.7 × 10−7 | 1.6 × 10−5 |
| pearson_base_nozero v pearson_log1p | −1.33 | 1.7 × 10−6 | 3.7 × 10−5 |
| pearson_base v pearson_log | 1.31 | 3.1 × 10−5 | 6.6 × 10−4 |
| pearson_log1p v pearson_log | 1.18 | 3.4 × 10−4 | 7.1 × 10−3 |
| pearson_base v kt_base | 0.62 | 6.5 × 10−4 | 1.4 × 10−2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Flight, R.M.; Bhatt, P.S.; Moseley, H.N.B. Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information. Metabolites 2026, 16, 245. https://doi.org/10.3390/metabo16040245
Flight RM, Bhatt PS, Moseley HNB. Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information. Metabolites. 2026; 16(4):245. https://doi.org/10.3390/metabo16040245
Chicago/Turabian StyleFlight, Robert M., Praneeth S. Bhatt, and Hunter N. B. Moseley. 2026. "Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information" Metabolites 16, no. 4: 245. https://doi.org/10.3390/metabo16040245
APA StyleFlight, R. M., Bhatt, P. S., & Moseley, H. N. B. (2026). Information-Content-Informed Kendall-Tau Correlation Methodology: Interpreting Missing Values in Metabolomics as Potentially Useful Information. Metabolites, 16(4), 245. https://doi.org/10.3390/metabo16040245

