Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices
Abstract
1. Introduction
NCD Matrices and Their Significance
2. Methods
2.1. Datasets
2.2. NCD as an Alignment-Free Alternative
2.3. Models of DNA Evolution
2.4. MLE for Transition and Transversion Rates
2.5. JC69 (Jukes and Cantor 1969)
2.6. K80 (Kimura 1980)
2.7. K81 (Kimura 1981)
2.8. T92 (Tamura 1992)
2.9. TN93 (Tamura and Nei 1993)
3. Results
3.1. Rate of Elementary Quartets (REQ)
3.2. Robinson Fould and Normalized Robinson Fould (RF and nRF)
3.3. Euclidean Distances
4. Discussion
4.1. Maximum Likelihood Estimation
4.2. Models Are Non-Applicable to NCD Matrices
4.3. NCD Accuracy
4.4. Novel Evaluation Metrics
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| NCD | Normalized Compression Distance |
| NID | Normalized Information Distance |
| RF | Robinson Foulds |
| nRF | Normalized Robinson Foulds |
| RF-Norm | Normalized Robinson Foulds |
| REQ | Rate of Elementary Quartets |
| MLE | Maximum Likelihood Estimation |
| MSA | Multiple Sequence Alignment |
Appendix A
| Species | NCBI Accession IDs |
|---|---|
| Ailuropoda_melanoleuca | EF196663.1 |
| Ailurus_fulgens_styani | AB291074.1 |
| Anoura_caudifer | NC_022420.1 |
| Anourosorex_squamipes | NC_024563.1 |
| Antilope_cervicapra | NC_012098.1 |
| Artibeus_jamaicensis | AF061340.1 |
| Artibeus_lituratus | NC_016871.1 |
| Balaena_mysticetus | AP006472.1 |
| Blarina_brevicauda | NC_027902.1 |
| Bos_mutus | DQ124389.1 |
| Bos_taurus | V00654.1 |
| Bubalus_bubalis | AY488491.1 |
| Camelus_bactrianus | AJ409393.1 |
| Camelus_dromedarius | NC_009849.1 |
| Canis_lupus_familiaris | U96639.2 |
| Capra_hircus | GU295658.1 |
| Castor_canadensis | FJ959094.1 |
| Cavia_porcellus | AJ222767.1 |
| Cebus_albifrons | NC_021952.1 |
| Ceratotherium_simum | Y07726.1 |
| Chinchilla_lanigera | EF567130.1 |
| Cricetulus_griseus | NC_007936.1 |
| Crocidura_tanakae | NC_035941.1 |
| Cynopterus_sphinx | EU289411.1 |
| Dasypus_novemcinctus | Y11832.1 |
| Delphinapterus_leucas | NC_005279.1 |
| Dipodomys_ordii | NC_005314.1 |
| Elephas_maximus | DQ316068.1 |
| Eptesicus_fuscus | NC_029342.1 |
| Equus_asinus | NC_001788.1 |
| Equus_caballus | NC_001640.1 |
| Erinaceus_europaeus | X88898.1 |
| Eubalaena_japonica | AP006472.1 |
| Felis_catus | U20753.1 |
| Gorilla_gorilla | D38114.1 |
| Halichoerus_grypus | NC_001602.1 |
| Hippopotamus_amphibius | AP003425.1 |
| Homo_sapiens | NC_012920.1 |
| Hylobates_lar | NC_002082.1 |
| Loxodonta_africana | X56292.1 |
| Macaca_fascicularis | X79547.1 |
| Macaca_mulatta | AY612638.1 |
| Microtus_arvalis | EF489115.1 |
| Monodelphis_domestica | NC_006299.1 |
| Mus_musculus | V00711.1 |
| Mustela_putorius_furo | AJ544416.1 |
| Myotis_lucifugus | NC_006897.1 |
| Nomascus_leucogenys | NC_013993.1 |
| Nycticebus_coucang | NC_002765.1 |
| Ochotona_curzoniae | NC_011029.1 |
| Odocoileus_virginianus | NC_008414.1 |
| Orcinus_orca | Y13856.1 |
| Oryctolagus_cuniculus | AJ001588.1 |
| Ovis_aries | AF010406.1 |
| Pan_troglodytes | D38116.1 |
| Panthera_leo | NC_028302.1 |
| Panthera_tigris | EF551002.1 |
| Papio_hamadryas | Y18001.1 |
| Phocoena_phocoena | NC_005280.1 |
| Physeter_catodon | X72204.1 |
| Pongo_pygmaeus | X97707.1 |
| Pteropus_vampyrus | NC_009063.1 |
| Rattus_norvegicus | X14848.1 |
| Saimiri_sciureus | NC_025235.1 |
| Sus_scrofa | AJ002189.1 |
| Tarsius_syrichta | Y18001.1 |
| Trichechus_manatus | NC_005279.1 |
| Tursiops_truncatus | X72204.1 |
| Ursus_arctos | AF303110.1 |
| Ursus_maritimus | AJ428577.1 |
| Vicugna_pacos | EF397824.1 |
| Vulpes_vulpes_montana | KF387633.1 |
| Lines | NCBI Accession IDs |
|---|---|
| AUZE-A-5 | GCA_946402385.1 |
| FERR-A-8 | GCA_946403025.1 |
| BELC-C-10 | GCA_946403525.1 |
| ANGE-B-10 | GCA_946404075.1 |
| BARA-C-5 | GCA_946404385.1 |
| IP-San-9 | GCA_946404515.1 |
| BANI-C-1 | GCA_946404995.1 |
| IP-Met-6 | GCA_946405265.1 |
| FERR-A-12 | GCA_946405525.1 |
| IP-Evs-12 | GCA_946405545.1 |
| IP-Med-0 | GCA_946405725.1 |
| MONTM-B-16 | GCA_946405905.1 |
| IP-Alo-0.9506 | GCA_946406325.1 |
| Ler-0.7213 | GCA_946406525.1 |
| BANI-C-12 | GCA_946406595.1 |
| IP-Hom-4.9546 | GCA_946406625.1 |
| BARA-C-3 | GCA_946406735.1 |
| Rabacal-1.22005 | GCA_946406895.1 |
| CAMA-C-2 | GCA_946406975.1 |
| BARC-A-17 | GCA_946407145.1 |
| IP-Lor-16 | GCA_946407795.1 |
| SALE-A-10 | GCA_946408365.1 |
| CAMA-C-9 | GCA_946408575.1 |
| BELC-C-12 | GCA_946408975.1 |
| BROU-A-10 | GCA_946409395.1 |
| SALE-A-17 | GCA_946409815.1 |
| Tanz-1.10024 | GCA_946409825.1 |
| BARC-A-12 | GCA_946410165.1 |
| IP-Alo-19 | GCA_946410485.1 |
| IP-Cas-0.9831 | GCA_946411375.1 |
| IP-Hom-0 | GCA_946411425.1 |
| MERE-A-13 | GCA_946411655.1 |
| IP-Mos-9 | GCA_946411805.1 |
| IP-Mos-5 | GCA_946411885.1 |
| IP-Hum-4 | GCA_946412005.1 |
| GAIL-B-11 | GCA_946412065.1 |
| IP-Mdc-14 | GCA_946412225.1 |
| IP-Hum-2.9549 | GCA_946413285.1 |
| IP-Med-3 | GCA_946413305.1 |
| PREI-A-14 | GCA_946413405.1 |
| IP-Sln-22 | GCA_946413935.1 |
| Cvi-0.6911 | GCA_946414125.1 |
| IP-Cat-0.9832 | GCA_946414305.1 |
| ANGE-B-2 | GCA_946415005.1 |
| IP-Evs-0.9845 | GCA_946415165.1 |
| IP-Cas-6 | GCA_946415445.1 |
| MONTM-B-7 | GCA_946415625.1 |
| LACR-C-14 | GCA_946415655.1 |
| Ey15-2.9994 | GCA_946499665.1 |
| Col-0.6909 | GCA_946499705.1 |
| Pent-46.2212 | GCA_964057255.1 |
| LI-EF-011.685 | GCA_964057265.1 |
| 14INRCT07 | GCA_964057275.1 |
| MONF-A-1.22045 | GCA_965117475.1 |
| RAYR-A-17.22055 | GCA_965117485.1 |
| LANT-B-1.22039 | GCA_965117495.1 |
| LUZE-A-14.22042 | GCA_965117505.1 |
| MONT-B-14.22048 | GCA_965117515.1 |
| NAUV-B-7.22052 | GCA_965117525.1 |
| LACR-C-4.22038 | GCA_965117535.1 |
| BELL-A-1.22021 | GCA_965117545.1 |
| PREI-A-9.22054 | GCA_965117555.1 |
| MERE-A-7.22044 | GCA_965117565.1 |
| LANT-B-10.22040 | GCA_965117575.1 |
| LUZE-A-12.22041 | GCA_965117585.1 |
| MONF-A-14.22046 | GCA_965117595.1 |
| RAYR-A-9.22056 | GCA_965117605.1 |
| JUZE-A-3.22036 | GCA_965117615.1 |
| BOULO-A-16.22024 | GCA_965117625.1 |
| BELL-A-7.22022 | GCA_965117635.1 |
| BROU-A-2.22026 | GCA_965117645.1 |
| NAUV-B-14.22051 | GCA_965117655.1 |
| JUZE-A-2.22035 | GCA_965117665.1 |
| BOULO-A-1.22023 | GCA_965117675.1 |
| CARL-A-16.22030 | GCA_965117685.1 |
| CARL-A-10.22029 | GCA_965117695.1 |
| MONT-B-12.22047 | GCA_965117705.1 |
| GAIL-B-9.22034 | GCA_965117715.1 |
| AUZE-A-11.22011 | GCA_965117725.1 |
References
- Jarvis, E.D.; Mirarab, S.; Aberer, A.J.; Li, B.; Houde, P.; Li, C.; Ho, S.Y.; Faircloth, B.C.; Nabholz, B.; Howard, J.T.; et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 2014, 346, 1320–1331. [Google Scholar] [CrossRef] [PubMed]
- Liu, K.; Linder, C.R.; Warnow, T. RAxML and FastTree: Comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 2011, 6, e27731. [Google Scholar] [CrossRef] [PubMed]
- Chatzou, M.; Magis, C.; Chang, J.M.; Kemena, C.; Bussotti, G.; Erb, I.; Notredame, C. Multiple sequence alignment modeling: Methods and applications. Briefings Bioinform. 2015, 17, 1009–1023. [Google Scholar] [CrossRef]
- Suchard, M.A.; Weiss, R.E.; Sinsheimer, J.S. Bayesian Selection of Continuous-Time Markov Chain Evolutionary Models. Mol. Biol. Evol. 2001, 18, 1001–1013. [Google Scholar] [CrossRef]
- Daugelaite, J.; O’ Driscoll, A.; Sleator, R.D. An overview of multiple sequence alignments and cloud computing in bioinformatics. Int. Sch. Res. Not. 2013, 2013, 615630. [Google Scholar] [CrossRef]
- Izquierdo-Carrasco, F.; Gagneur, J.; Stamatakis, A. Trading memory for running time in phylogenetic likelihood computations. Heidelb. Inst. Theor. Stud. 2012, 86–95. [Google Scholar] [CrossRef]
- Claros, M.G.; Bautista, R.; Guerrero-Fernández, D.; Benzerki, H.; Seoane, P.; Fernández-Pozo, N. Why assembling plant genome sequences is so challenging. Biology 2012, 1, 439–459. [Google Scholar] [CrossRef]
- Ozan, Ş. DNA Sequence Classification with Compressors. arXiv 2024, arXiv:2401.14025. [Google Scholar] [CrossRef]
- Wilson, D.; Rogers, J. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. J. Comput. Biol. 2023, 30, 250–260. [Google Scholar] [CrossRef] [PubMed]
- Rogers, J.; Wilson, D. Comparing phylogeny by compression to phylogeny by NJp and Bayesian Inference. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2195–2202. [Google Scholar]
- Rogers, D.W.J. PhyloTools: A Software Package for Analyzing Phylogenetic Trees; GitHub Repository: San Francisco, CA, USA, 2021. [Google Scholar]
- Li, M.; Vit’anyi, P.M. An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Li, M.; Chen, X.; Li, X.; Ma, B.; Vit’anyi, P.M.B. The similarity metric. IEEE Trans. Inf. Theory 2004, 50, 3250–3264. [Google Scholar] [CrossRef]
- Pavlov, I. 7-Zip, Version 24.08 or Later; [Computer Software]; Moscow, Russia. 2025. Available online: https://www.7-zip.org/ (accessed on 2 August 2025).
- Moreno, D. NCD-Corrections: Correction Tools for Normalized Compression Distance (NCD); GitHub Repository: San Francisco, CA, USA, 2025. [Google Scholar]
- Astrom, K. Maximum Likelihood and Prediction Error Methods. Ifac Proc. Vol. 1979, 12, 551–574. [Google Scholar] [CrossRef]
- Stoltzfus, A.; Norris, R.W. On the Causes of Evolutionary Transition: Transversion Bias. Mol. Biol. Evol. 2016, 33, 595–602. [Google Scholar] [CrossRef]
- Jukes, T.; Cantor, C. Evolution of Protein Molecules. In Mammalian Protein Metabolism; Munro, H., Ed.; Academic Press: New York, NY, USA, 1969; pp. 21–132. [Google Scholar] [CrossRef]
- Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980, 16, 111–120. [Google Scholar] [CrossRef] [PubMed]
- Kimura, M. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 1981, 78, 454–458. [Google Scholar] [CrossRef]
- Tamura, K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 1992, 9, 678–687. [Google Scholar] [CrossRef] [PubMed]
- Tamura, K.; Nei, M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993, 10, 512–526. [Google Scholar] [CrossRef]
- Guénoche, A.; Garreta, H. Can we have confidence in a tree representation? In Proceedings of the International Conference on Biology, Informatics, and Mathematics, Montpellier, France, 3–5 May 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 45–56. [Google Scholar]
- Robinson, D.F.; Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 1981, 53, 131–147. [Google Scholar] [CrossRef]
- de Vienne, D.M.; Aguileta, G.; Ollier, S. Euclidean nature of phylogenetic distance matrices. Syst. Biol. 2011, 60, 826–832. [Google Scholar] [CrossRef]
- Pinho, A.; Pratas, D. MFCompress: A compression tool for FASTA and multi-FASTA data. Bioinformatics 2014, 30, 117–118. [Google Scholar] [CrossRef]
- Consens, M.; Dufault, C.; Wainberg, M.; Forster, D.; Karimzadeh, M.; Goodarzi, H.; Theis, F.J.; Moses, A.; Wang, B. Transformers and genome language models. Nat. Mach. Intell. 2025, 7, 346–362. [Google Scholar] [CrossRef]
- Wen, J.; Chan, R.H.; Yau, S.C.; He, R.L.; Yau, S.S. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014, 546, 25–34. [Google Scholar] [CrossRef] [PubMed]
- Moi, D.; Kilchoer, L.; Aguilar, P.S.; Dessimoz, C. Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes. PLoS Comput. Biol. 2020, 16, 1–21. [Google Scholar] [CrossRef] [PubMed]

| Mammal | Tomato | Arabidopsis | ||||
|---|---|---|---|---|---|---|
| # Branches | Avg Req | # Branches | Avg Req | # Branches | Avg Req | |
| NCD | 150 | 0.79 | 11 | 0.62 | 76 | 0.56 |
| JC69 | 150 | 0.53 | 11 | 0.57 | 76 | 0.51 |
| K80 | 150 | 0.78 | 11 | 0.52 | 76 | 0.55 |
| K81 | 150 | 0.78 | 11 | 0.45 | 76 | 0.55 |
| T92 | 150 | 0.78 | 11 | 0.49 | 76 | 0.55 |
| TN93 | 150 | 0.78 | 11 | 0.49 | 76 | 0.55 |
| Mammal | Tomato | Arabidopsis | ||||
|---|---|---|---|---|---|---|
| RF | RF-Norm | RF | RF-Norm | RF | RF-Norm | |
| NCD vs. JC69 | 21.90 | 0.47 | 1.36 | 0.36 | 9.94 | 0.48 |
| NCD vs. K80 | 8.41 | 0.12 | 1.46 | 0.42 | 5.87 | 0.19 |
| NCD vs. K81 | 8.41 | 0.12 | 1.46 | 0.42 | 5.87 | 0.19 |
| NCD vs. T92 | 8.41 | 0.12 | 1.46 | 0.42 | 5.87 | 0.19 |
| NCD vs. TN93 | 8.41 | 0.12 | 1.46 | 0.42 | 5.87 | 0.19 |
| Mammals | Tomatos | Arabidopsis | |
|---|---|---|---|
| NCD vs. JC69 | 0.0014 | 0.018 | 0.0017 |
| NCD vs. K80 | 0.0011 | 0.020 | 0.0020 |
| NCD vs. K81 | 0.0011 | 0.020 | 0.0020 |
| NCD vs. T92 | 0.0011 | 0.020 | 0.0020 |
| NCD vs. TN93 | 0.0011 | 0.020 | 0.0020 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moreno, D.; Hu, H.; Ramaraj, T.; Rogers, J. Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics 2025, 13, 3534. https://doi.org/10.3390/math13213534
Moreno D, Hu H, Ramaraj T, Rogers J. Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics. 2025; 13(21):3534. https://doi.org/10.3390/math13213534
Chicago/Turabian StyleMoreno, Damian, Hongzhi Hu, Thiruvarangan Ramaraj, and John Rogers. 2025. "Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices" Mathematics 13, no. 21: 3534. https://doi.org/10.3390/math13213534
APA StyleMoreno, D., Hu, H., Ramaraj, T., & Rogers, J. (2025). Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics, 13(21), 3534. https://doi.org/10.3390/math13213534

