Next Article in Journal
Fuzzy MCDM Methodology Application in Analysis of Annual Operational Efficiency in Passenger and Freight Air Transport
Previous Article in Journal
Research on Path Planning for Mobile Robot Using the Enhanced Artificial Lemming Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices

School of Computing, Jarvis College of Computing and Digital Media, DePaul University, Chicago, IL 60614, USA
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(21), 3534; https://doi.org/10.3390/math13213534
Submission received: 21 September 2025 / Revised: 22 October 2025 / Accepted: 26 October 2025 / Published: 4 November 2025
(This article belongs to the Special Issue Heuristic Algorithms in Computational Biology)

Abstract

Recent developments of Normalized Compressed Distance (NCD) matrices show potential to become a widely used method to develop phylogenetic trees among a group of organisms. However, such NCD matrices lack the biological and evolutionary context that is important to the development of phylogenetic trees. This study applied mathematical models of DNA evolution to NCD matrices with the goal of applying evolutionary context to improve the results of NCD methods and thus create better phylogenetic trees. Mammalian mitochondrial, Arabidopsis thaliana, and Tomato Solanum lycopersicum genomes were used in this study as benchmarks and underwent validation techniques to analyze whether the NCD matrix models were of better quality than their original NCD. The mathematical models applied were not sufficient in implementing a biological context towards NCD compression algorithms for phylogenetic trees, resulting in no successful improvement of them. However, this opens opportunities to explore ways to integrate biological context internally to NCD compression algorithms, rather than external methods of correcting NCD matrices after compression has been carried out.

1. Introduction

NCD Matrices and Their Significance

Phylogenomics utilizes genomic data from a group of organisms in order to describe the evolutionary relationships between all organisms in the group. These relationships are of significant importance, as they can be used to develop an understanding of the origin of diseases, to develop new crops and the tree of life, and to even classify organisms based on a common ancestor. Phylogenetic trees are the main source of analyzing such relationships and events. They are a visual diagram describing the evolutionary history of a group of organisms which must be computed via methods that utilize FASTA files [1,2].
Traditional methods of building phylogenetic trees involve the use of Multiple Sequence Alignment (MSA) methods [3]. These methods take a group of sequences such as mitochondrial DNA, rRNA sequences, and whole genomes in FASTA format, align every sequence to one another, and calculate their similarity to produce a phylogenetic tree [4]. These are often seen as the most reliable way to calculate phylogenetic trees. However, this makes aligning complex genomes and whole genomes impossible due to their very large size, the main limitation of using MSA-based methods [5,6]. One specific example of this is Eukaryotic genomes, which cannot be sequenced entirely due to their large size and complex genetic structure [7].
Due to this limitation being easily exploited by larger, more complex genomic datasets, there has been a movement of exploration towards alignment-free methods to compute phylogenomic trees.
Compression algorithms have risen towards the possibility of processing very large genomic datasets, such as whole genomes because they are computationally inexpensive when compared to other standard methods [8]. One of them is Normalized Compression Distance (NCD), a compression algorithm that is used to find the similarity between two pieces of texts such as emails and paragraphs [9]. When applied to FASTA files, they can find the similarity between two or more biological genomes and discover how similar they are through text pattern finding [9,10].
A recent study showed that Normalized Compression Distance (NCD) matrices allow for an alternative way to create phylogenetic trees. NCD matrices are produced via Phylotools (version 1.0.0), a software that uses the 7zip compression algorithm to find patterns in DNA sequences to produce a distance matrix that describes the pairwise differences among a group of organisms [9,10]. With promising results, NCD seems to be an effective and efficient approach to creating accurate phylogenetic trees without needing alignment.
However, NCD lacks evolutionary context when creating NCD matrices [9]. The 7zip compressor built into Phylotools has no knowledge of how likely certain nucleotides are to change over time within a species and how likely they are to change from organism to organism [9]. NCD and the 7zip compressor tool simply recognize patterns from different genomic sequences to generate a number describing how similar the sequences of the organisms are. This may create a bias in the creation of the NCD matrix, leading to results that may not capture the true relationship between a group of organisms.
In an attempt to contextualize the NCD, we applied models of DNA evolution to these NCD matrices produced by Phylotools. The goal of this study was to determine whether models of DNA evolution would produce more accurate NCD matrices and thus more accurate phylogenetic trees. This is a novel application of such mathematical models, since they were originally created to compute pairwise distances via alignment-based methods. However, in this study, the models were used as a way to integrate biological context into an NCD matrix to correct potentially inaccurate pairwise distance values and mitigate potential errors from the NCD algorithm.

2. Methods

2.1. Datasets

The genomic data used in this study consists of 3 different groups of organisms: 153 Mammalian mitochondrial sequences (Table A1), 14 tomato (Solanum lycopersicum) sequences (refer to the Data Availability Statement Section to access tomato genomes used in this study), and 79 Arabidopsis thaliana sequences (Table A2). All of the genomic data were in FASTA format, which were then later converted to non-aligned Normalized Compression Distance (NCD) matrices using Phylotools [11]. The datasets had the following GC%: mitochondria—38%, tomato—36%, Arabidopsis thaliana—36%. The justification for the datasets selected for this work was as follows: To evaluate the performance of the NCD method for phylogenetic inference, we selected three diverse genomic datasets spanning distinct evolutionary, structural, and biological contexts, 153 Mammalian mitochondrial genomes, 79 Arabidopsis thaliana genomes, and 14 tomato (Solanum lycopersicum) genomes. The Mammalian mitochondrial genomes dataset represents small circular genomes (16 to 20 kb) with well-established evolutionary relationships, providing a benchmark for assessing the NCD method. The A. thaliana dataset is characterized by a medium-sized nuclear genome (125–130 Mb) with high-quality chromosome-level assemblies and extensive population-level diversity data. This enables evaluation of NCD’s resolution in detecting intraspecific and subspecies-level variation. Lastly, in contrast, the tomato dataset, with a large and more complex nuclear genome (750–800 Mb), also completes chromosome-level assemblies. This dataset will help access NCD’s ability to handle higher genomic complexity and evolutionary distances. Collectively, these datasets span a continuum of genome sizes, complexities, and divergence scales, providing a comprehensive and representative framework to assess the robustness and adaptability of the NCD-based phylogenetic approach across biological domains.

2.2. NCD as an Alignment-Free Alternative

We measure the distance between two sequences using the Normalized Compression Distance (NCD) technique. NCD is an algorithmic approximation of the Normalized Information Distance (NID) measure. Informally, a measure of the similarity between two strings, over a finite alphabet, is how much their concatenation can be losslessly compressed relative to how much each string can be individually compressed. The idea is that if similar strings share a structure, a compressor can be used to minimize the number of bits needed to represent the concatenated strings.
The theoretical minimum compressed size of a string s is its Kolmogorov complexity [12], which is the length of the shortest program p such that, when run, produces s as output. We denote this size K ( s ) .
Based on this, Li et al. [13] defined the Normalized Information Distance (NID):
N I D ( s 0 , s 1 ) = K ( s 0 s 1 ) min ( K ( s 0 ) , K ( s 1 ) ) max ( K ( s 0 ) , K ( s 1 ) )
Because the Kolmogorov complexity of a string is uncomputable [12], the NID cannot be computed exactly. But it can be approximated from above by selecting a compression algorithm A and replacing K ( s ) by C A ( s ) , which is the size of the bit string resulting from applying A to s. This new distance is called the Normalized Compression Distance (NCD):
N C D ( s 0 , s 1 ) = C ( s 0 s 1 ) min ( C ( s 0 ) , C ( s 1 ) ) max ( C ( s 0 ) , C ( s 1 ) )
Note that the NCD is always non-negative and, in general, less than or equal to 1 and greater than 0. A small NCD for two strings indicates that they are similar; values of less than 0.5 indicate greater similarity. We assume lossless compression, so, for example, one may use a compressor based on the Huffman algorithm, the various Lempel–Ziv algorithms, or a compressor designed specifically for a particular kind of data. One advantage is that a pair of sequences does not have to be aligned before determining their distance.
Wilson and Rogers [9] present a proof of concept for utilizing NCDs as an alternative approach to phylogeny estimation. In their study, they compare NCDs with traditional methods such as Maximum Likelihood (ML) and Bayesian inference using biological and simulated data. A key aspect of their analysis focuses on evaluating NCDs’ performance in the presence of incomplete lineage sorting (ILS), a common challenge in phylogenetic studies.
The results indicate that NCD performs comparably to traditional methods, particularly in scenarios with high levels of ILS, where methods such as ML and Bayesian inference often encounter difficulties. Wilson and Rogers highlight the computational efficiency of NCD, as it bypasses the need for sequence alignment, which can be computationally expensive with large datasets. This demonstrates the robustness of NCD in handling large and complex genomic data, offering a scalable and alignment-free alternative to traditional phylogeny estimation methods [9].
Phylotools utilized the 7zip compressor since it is an open-source command line tool that allowed for an effective compression ratio that is approximately 2–10% more effective than other compressors such as PKZip and WinZip [14]. Specifically, when applied to Phylogenetic data, 7zip outperformed other compressors such as MFC3 and PPMZ9 [9].
In this study, 153 Mammalian mitochondrial sequences, 14 Tomato (Solanum lycopersicum) genomes, and 79 Arabidopsis thaliana genomes were combined into one FASTA file, one for each species, for a total of three FASTA files as shown in Figure 1. Each FASTA file was used as input into Phylotools to compute a non-aligned NCD matrix representing the pairwise, evolutionary differences among all the species within the group of organisms mentioned earlier. NCD-based distances operate outside traditional model-based frameworks.

2.3. Models of DNA Evolution

The NCD matrix form Phylotools was then used as input for the NCD Corrections Program, a program that allowed for the application of five different models of DNA evolution to mathematically contextualize the evolutionary pairwise distances between each species [15]. Every model was programmed as a mathematical formula and applied to a NumPy matrix for faster and more efficient implementation of the models.
Most of the models used in this study are typically calculated via a rate matrix that allows for the representation of nucleotide rate changes between species. However, since the focus of this study was to apply these models without alignment methods, the models were applied to the NCD matrix produced by Phylotools. This was performed by taking derivations of the models that used pairwise distances instead of rate matrices. These models were chosen because they are time-continuous Markov Chain models, meaning that the next state of an environment is based on the current state [4]. This is the behavior that best mimics how genetic mutations occur throughout time.

2.4. MLE for Transition and Transversion Rates

In addition, there were some models that required the use of transition and transversion rates. Due to the lack of alignment utilization, transition and transversion rates were calculated via Maximum Likelihood Estimation (MLE) [16], which utilized the input of a very small initial arbitrary value to calculate optimal values in correspondence to transition and transversion rates. The initial arbitrary values that we used were 0.15 for transitions and 0.01 for transversions since transitions are approximately about 15 times more likely to occur than transversions [17]. MLE was programmed in the Scipy Python package using scipy.optimize.minimize and iteratively run until the sum of squared differences between the original distance matrix and the model-predicted distances was less than 10 6 . This convergence criterion allowed for the parameter estimates to be as close as possible to the parameters that would likely be used to create the original matrix. Standard errors of the parameters were not computed since the focus of the study was to find the most optimal transition and transversion rates to use in the mathematical models of DNA evolution.

2.5. JC69 (Jukes and Cantor 1969)

Jukes and Cantor’s is known as the simplest model of DNA evolution due to having a single input parameter and several assumptions when computing evolutionary distances between species. These assumptions include equal base frequencies of 25% and equal mutation rates between base pairs [18]. The Jukes and Cantor model applied to the NCD matrix is denoted as
d = 3 4 ln ( 1 4 3 p ) .
where p is the proportion of sites that differ between sequence alignment and d is the evolutionary distance between each sequence. Due to the lack of alignment knowledge, this study replaced p with every value of the NCD matrix. This is because the proportion of sites that differ is analogous to the original NCD matrix value, and d is analogous to the model-predicted distance that is calculated after the model has been applied.

2.6. K80 (Kimura 1980)

Kimura 80 requires two parameters, one of which is the rate of transitions and the other being the rate of transversions. Similar to Jukes and Cantor’s model, however, it was also constructed with the idea that each base is equally frequent at 25%. The pairwise difference formula representing this model is denoted as
K = 1 2 ln ( ( 1 2 p q ) 1 2 q ) .
where K is the output of the corrected NCD matrix, p is the rate of transitions, and q is the rate of transversions between all organisms in the dataset [19].

2.7. K81 (Kimura 1981)

K81 is similar to K80 in that it utilizes the rate of transitions and transversions in order to calculate pairwise distances, but differs in that there are two rates for transversions. Like all other previous models, each equilibrium base frequency is 25%. The pairwise difference formula has the following parameters and formulas:
λ 1 = 4 ( α + β ) . λ 2 = 4 ( α + γ ) . λ 3 = 4 ( β + γ ) .
P = 1 e λ 1 e λ 2 e λ 3 . Q = 1 e λ 1 + e λ 2 e λ 3 . R = 1 + e λ 1 e λ 2 e λ 3 .
Ultimately,
K = 1 4 l n ( ( 1 2 P 2 Q ) ( 1 2 P 2 R ) ( 1 2 Q 2 R ) ) .
where K is the output of the corrected NCD matrix, α is the rate of transitions, β is the first rate of transversions, and γ is the second rate of transversions [20].

2.8. T92 (Tamura 1992)

T92 is well suited for sequences that have a strong G and C base content. It is an extension of K80, but with the addition of a parameter that describes the frequency of the content of G and C, which is denoted as θ . Thus, T92 does not assume equal base pair frequencies among sequences.
h = 2 θ ( 1 θ ) .
d = h ln ( 1 p h q ) 1 2 ( 1 h ) ln ( 1 2 q ) .
where d is the output to the NCD matrix, p is the rate of transitions, and q is the rate of transversions [21].

2.9. TN93 (Tamura and Nei 1993)

TN93 allows two different transition rates and two different transversion rates. In addition, this model does not assume any equal base frequencies. The derivation of this model that was used was the one that allowed the computation of the number of nucleotide substitutions between sequences, which we then used to apply to the NCD matrix [22]. In using this derivation of the model, we used the following parameters:
P ¯ 1 = 2 g A g G g R g R a a + 2 ( g R α ¯ 1 + g Y β ¯ ) t a + g Y a a + 2 β ¯ t a . P ¯ 2 = 2 g T g C g Y g Y a a + 2 ( g Y α ¯ 2 + g R β ¯ ) t a + g R a a + 2 β ¯ t a . Q ¯ = 2 g R g Y 1 a a + 2 β ¯ t a .
where P ¯ 1 is the number of transition differences between purines and P ¯ 2 is the transition differences between pyrimidines. Furthermore, Q ¯ is the rate of transversion differences.
The underlying parameters of g A , g T , g C , g G all represent base frequencies. The parameter a represents the index of dispersion, which tells how dispersed data points are relative to their mean. Specifically, it is expressed as
a = m 2 ( s 2 m ) .
where m is the mean of nucleotide substitutions per site and p is the variance of nucleotide substitutions.
Additionally, β ¯ is considered to be the average transversion rate and α ¯ is considered to be the average transition rate among all organisms. This is usually calculated before applying to the overlying parameters that they are in. The parameter t is considered to be a time parameter to take into account how much evolutionary time has passed before two sequences diverged. Due to the lack of this knowledge, we decided to keep t = 1 . MLE was used to optimize the parameters mentioned (besides t) as accurately as possible and then used in the formula below.
d ^ = 2 a [ g A g G g R ( 1 g R 2 g A g G P ^ 1 1 2 g R Q ^ ) 1 / a + g T g C g Y ( 1 g Y 2 g T g C P ^ 2 1 2 g Y Q ^ ) 1 / a + ( g R g Y g A g G g Y g R g T g C g R g Y ) × ( 1 1 2 g R g Y Q ^ ) 1 / a g A g G g T g C g R g Y ] .

3. Results

3.1. Rate of Elementary Quartets (REQ)

For context, REQ is a distance-based approach for assessing branch support, which calculates the proportion of elementary quartets induced by each internal branch that align with the four-point condition applied to pairwise evolutionary distances [23]. Unlike bootstrap-based methods, REQ directly utilizes the dissimilarity matrix and provides branch confidence values ranging from 0 to 1, with higher values indicating stronger support for the branch.
Based on the results of Table 1, most REQ values are uniform and have little to no differences between all tables, with the exception of tomatoes. They have the largest disparity of values among one group of species. This is worth noting, as tomato genomes are of the same lineage and are not very different from each other genetically.
For all REQ values in general, especially those of Arabidopsis thaliana and tomatoes, the values are less than 0.9 and typically fall in the range of 0.5–0.6. This means that the confidence level of the original vs. the NCD matrices is relatively low and shows only moderate similarity.

3.2. Robinson Fould and Normalized Robinson Fould (RF and nRF)

Structural similarity among these trees was quantified using the Robinson–Foulds (RF) distance [24]. The RF metric measures topological differences between trees, with lower values indicating greater similarity.
As shown in Table 2 all values with the exception of the Jukes and Cantor model had the same value. Similar to REQ, the Normalized RF shows moderate similarity, with some scores being 0.1–0.2. This is one tier below being very confident and shows that there were a large number of similarities, however, not enough to be considered very similar. However, most of the data points across the table are 0.3 and higher, indicating that there were a small number of similarities in the distance matrices between the original NCD and the model NCD.

3.3. Euclidean Distances

Euclidean Distance is a general way to compute the distance between two points in a space. In the context of this study, the goal was to utilize the Euclidean Distance to further establish how similar the original NCD matrix was to the corrected NCD matrices. The average Euclidean Distance was taken between each original matrix and a model matrix. The lower the values, generally the more similar to matrices are [25].
The values in Table 3 give a different indication of how similar each matrix is when compared to the other validation techniques. They show relatively small values, indicating that every original matrix is very similar to the model matrix. However, this is computed with values from matrices and is given no other context. It is possible that the overall values of the distance matrices are very similar since each matrix is normalized from 0 to 1 and thus gives a small value even if the matrices result in very different phylogenetic trees.

4. Discussion

As demonstrated by the results, the NCD correction models appear to give very similar results for the input matrix, with the exception of the Jukes and Cantor model. There are several possibilities as to why this may have occurred.

4.1. Maximum Likelihood Estimation

The goal of Maximum Likelihood Estimation (MLE) in the context of this study is to optimize a set of input parameters to minimize the difference between a new and original dataset. This makes MLE very accurate for predicting missing or unknown parameters of mathematical expressions and models that would change one dataset to another. This would indicate that every optimal parameter computed from MLE will attempt to create a matrix that is as close as possible to its original.

4.2. Models Are Non-Applicable to NCD Matrices

Another possibility of why the models are producing similar results is due to the applicability of the models to NCD matrices. As noted earlier, the study was adjusted to apply the models to the NCD matrices directly instead of creating rate matrices, which involve alignment. Since these models utilize alignments to count transition, transversion, and general point mutations, the modified versions of alignments are not accurate and lead to different results. Thus, the statistics provided in Table 1, Table 2 and Table 3 represent a lack of incompatibility with NCD matrices as the models do not provide sufficient biological context to NCD algorithms.

4.3. NCD Accuracy

It is also likely that the NCD matrices are accurate enough and do not need a correction model to be applied to. This would indicate that Phylotools computes accurate distance matrices and phylogenetic trees. Therefore, another method besides applying models of DNA evolution may correct the anomalies of NCD matrices.

4.4. Novel Evaluation Metrics

Due to the novelty of NCD distance matrices, it is possible that other metrics must be used in order to evaluate their results. Since compression algorithms are not typically used to create matrices, novel validation techniques may be needed to evaluate how well the matrices represent evolutionary change. Additionally, metrics that measure how much biological context is included within the compressor may be helpful.

5. Conclusions

Based on the results, applying mathematical models of DNA evolution to NCD matrices does not give a more accurate representation of phylogenomic trees.
This result supports the rationale that phylogenetic distance matrices produced based on NCD closely represent the evolutionary relationship among a group of organisms. We can confidently use these NCD matrices to look at Phylogeny from a whole-genome perspective rather than selecting representative sequences such as mitochondrial DNA or rRNA, which are well-suited for MSA-based methods.
Although NCD corrections did not show any improvements from our original NCD matrices, there is still potential to improve NCD matrices.
One future direction includes exploring different ways to take on even longer genomic sequences than the ones used in this study. Limitations such as compression window sizes can be addressed in order to improve the quality of NCD matrices. This would allow for sequences of any length to be compressed without applying any segmentation.
In addition, the exploration of existing techniques such as MFcompress [26] or developing other compression techniques that would specifically be designed to work with DNA sequences. This is so that no segmentation would be needed during the compression phase. This would allow for generating even more reliable NCD matrices.
Other methods besides compression such as Genome Language Processing [27] and Genome Large Language Models [27] may be used to continue exploration of non-alignment based methods. This allows for the fusion of Machine Learning and genomics to better study how nucleotides change over time. K-mer-based approaches are also gaining popularity due to their ability to take multiple short sequence alignment data from multiple genomes and analyze them for similarity [28]. This approach also utilizes distance matrices to compare the relationships among a group of organisms. Similarly, MinHash approaches, which fuse k-mers with hashing algorithms, have been shown to efficiently profile large eukaryotic genomes [29].
At the moment, Phylotools and compression-based methods seem to have a promising future, with the potential to be widely used for creating phylogenetic trees.

Author Contributions

Conceptualization, T.R. and J.R.; methodology, D.M. and H.H.; software, D.M. and H.H.; validation, T.R., J.R. and D.M.; formal analysis, D.M., H.H., T.R. and J.R.; investigation, D.M.; resources, T.R. and J.R.; data curation, D.M.; writing—original draft preparation, D.M.; writing—review and editing, T.R. and J.R.; visualization, D.M. and T.R.; supervision, T.R. and J.R.; project administration, T.R. and J.R.; funding acquisition, T.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. We express our gratitude to DePaul University’s Graduate and Undergraduate Research Assistant Program (GRAP/URAP) for its funding support. These awards have provided a unique opportunity for undergraduate and graduate students to engage in research, fostering a collaborative environment between students and faculty.

Data Availability Statement

The original data of the NCD matrices produced in the study are openly available in NCD-Matrices at https://github.com/domoreno23/NCD-Matrices (accessed on 2 August 2025) The original data of the Newick files produced in this study are also openly available in NCD-Newick-Files at https://github.com/domoreno23/NCD-Newick-Files (accessed on 2 August 2025). The original data of the tomato genomes can be found at https://solgenomics.net/ftp/genomes/tomato100/March_02_2020_sv_landscape/ (accessed on 2 August 2025) and https://zenodo.org/records/17166769 (accessed on 2 August 2025).

Acknowledgments

We would like to thank the members of the Computational Biology and Applied Bioinformatics (CoBaAB) laboratory research group in the School of Computing at DePaul University for their invaluable contributions, particularly DeAngelo Wilson, Catherine Huber, and Diego Izzaguire, whose knowledge and prior work have significantly influenced this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
NCDNormalized Compression Distance
NIDNormalized Information Distance
RFRobinson Foulds
nRFNormalized Robinson Foulds
RF-NormNormalized Robinson Foulds
REQRate of Elementary Quartets
MLEMaximum Likelihood Estimation
MSAMultiple Sequence Alignment

Appendix A

The Accession IDs of Mammalian mitochondrial genomes and Arabidopsis thaliana genomes are found below. These are the exact genomes used in this study.
Table A1. Mammalian mitochondrial Genomes Accession IDs.
Table A1. Mammalian mitochondrial Genomes Accession IDs.
SpeciesNCBI Accession IDs
Ailuropoda_melanoleucaEF196663.1
Ailurus_fulgens_styaniAB291074.1
Anoura_caudiferNC_022420.1
Anourosorex_squamipesNC_024563.1
Antilope_cervicapraNC_012098.1
Artibeus_jamaicensisAF061340.1
Artibeus_lituratusNC_016871.1
Balaena_mysticetusAP006472.1
Blarina_brevicaudaNC_027902.1
Bos_mutusDQ124389.1
Bos_taurusV00654.1
Bubalus_bubalisAY488491.1
Camelus_bactrianusAJ409393.1
Camelus_dromedariusNC_009849.1
Canis_lupus_familiarisU96639.2
Capra_hircusGU295658.1
Castor_canadensisFJ959094.1
Cavia_porcellusAJ222767.1
Cebus_albifronsNC_021952.1
Ceratotherium_simumY07726.1
Chinchilla_lanigeraEF567130.1
Cricetulus_griseusNC_007936.1
Crocidura_tanakaeNC_035941.1
Cynopterus_sphinxEU289411.1
Dasypus_novemcinctusY11832.1
Delphinapterus_leucasNC_005279.1
Dipodomys_ordiiNC_005314.1
Elephas_maximusDQ316068.1
Eptesicus_fuscusNC_029342.1
Equus_asinusNC_001788.1
Equus_caballusNC_001640.1
Erinaceus_europaeusX88898.1
Eubalaena_japonicaAP006472.1
Felis_catusU20753.1
Gorilla_gorillaD38114.1
Halichoerus_grypusNC_001602.1
Hippopotamus_amphibiusAP003425.1
Homo_sapiensNC_012920.1
Hylobates_larNC_002082.1
Loxodonta_africanaX56292.1
Macaca_fascicularisX79547.1
Macaca_mulattaAY612638.1
Microtus_arvalisEF489115.1
Monodelphis_domesticaNC_006299.1
Mus_musculusV00711.1
Mustela_putorius_furoAJ544416.1
Myotis_lucifugusNC_006897.1
Nomascus_leucogenysNC_013993.1
Nycticebus_coucangNC_002765.1
Ochotona_curzoniaeNC_011029.1
Odocoileus_virginianusNC_008414.1
Orcinus_orcaY13856.1
Oryctolagus_cuniculusAJ001588.1
Ovis_ariesAF010406.1
Pan_troglodytesD38116.1
Panthera_leoNC_028302.1
Panthera_tigrisEF551002.1
Papio_hamadryasY18001.1
Phocoena_phocoenaNC_005280.1
Physeter_catodonX72204.1
Pongo_pygmaeusX97707.1
Pteropus_vampyrusNC_009063.1
Rattus_norvegicusX14848.1
Saimiri_sciureusNC_025235.1
Sus_scrofaAJ002189.1
Tarsius_syrichtaY18001.1
Trichechus_manatusNC_005279.1
Tursiops_truncatusX72204.1
Ursus_arctosAF303110.1
Ursus_maritimusAJ428577.1
Vicugna_pacosEF397824.1
Vulpes_vulpes_montanaKF387633.1
Table A2. Arabidopsis thaliana Genome Accession IDs.
Table A2. Arabidopsis thaliana Genome Accession IDs.
LinesNCBI Accession IDs
AUZE-A-5GCA_946402385.1
FERR-A-8GCA_946403025.1
BELC-C-10GCA_946403525.1
ANGE-B-10GCA_946404075.1
BARA-C-5GCA_946404385.1
IP-San-9GCA_946404515.1
BANI-C-1GCA_946404995.1
IP-Met-6GCA_946405265.1
FERR-A-12GCA_946405525.1
IP-Evs-12GCA_946405545.1
IP-Med-0GCA_946405725.1
MONTM-B-16GCA_946405905.1
IP-Alo-0.9506GCA_946406325.1
Ler-0.7213GCA_946406525.1
BANI-C-12GCA_946406595.1
IP-Hom-4.9546GCA_946406625.1
BARA-C-3GCA_946406735.1
Rabacal-1.22005GCA_946406895.1
CAMA-C-2GCA_946406975.1
BARC-A-17GCA_946407145.1
IP-Lor-16GCA_946407795.1
SALE-A-10GCA_946408365.1
CAMA-C-9GCA_946408575.1
BELC-C-12GCA_946408975.1
BROU-A-10GCA_946409395.1
SALE-A-17GCA_946409815.1
Tanz-1.10024GCA_946409825.1
BARC-A-12GCA_946410165.1
IP-Alo-19GCA_946410485.1
IP-Cas-0.9831GCA_946411375.1
IP-Hom-0GCA_946411425.1
MERE-A-13GCA_946411655.1
IP-Mos-9GCA_946411805.1
IP-Mos-5GCA_946411885.1
IP-Hum-4GCA_946412005.1
GAIL-B-11GCA_946412065.1
IP-Mdc-14GCA_946412225.1
IP-Hum-2.9549GCA_946413285.1
IP-Med-3GCA_946413305.1
PREI-A-14GCA_946413405.1
IP-Sln-22GCA_946413935.1
Cvi-0.6911GCA_946414125.1
IP-Cat-0.9832GCA_946414305.1
ANGE-B-2GCA_946415005.1
IP-Evs-0.9845GCA_946415165.1
IP-Cas-6GCA_946415445.1
MONTM-B-7GCA_946415625.1
LACR-C-14GCA_946415655.1
Ey15-2.9994GCA_946499665.1
Col-0.6909GCA_946499705.1
Pent-46.2212GCA_964057255.1
LI-EF-011.685GCA_964057265.1
14INRCT07GCA_964057275.1
MONF-A-1.22045GCA_965117475.1
RAYR-A-17.22055GCA_965117485.1
LANT-B-1.22039GCA_965117495.1
LUZE-A-14.22042GCA_965117505.1
MONT-B-14.22048GCA_965117515.1
NAUV-B-7.22052GCA_965117525.1
LACR-C-4.22038GCA_965117535.1
BELL-A-1.22021GCA_965117545.1
PREI-A-9.22054GCA_965117555.1
MERE-A-7.22044GCA_965117565.1
LANT-B-10.22040GCA_965117575.1
LUZE-A-12.22041GCA_965117585.1
MONF-A-14.22046GCA_965117595.1
RAYR-A-9.22056GCA_965117605.1
JUZE-A-3.22036GCA_965117615.1
BOULO-A-16.22024GCA_965117625.1
BELL-A-7.22022GCA_965117635.1
BROU-A-2.22026GCA_965117645.1
NAUV-B-14.22051GCA_965117655.1
JUZE-A-2.22035GCA_965117665.1
BOULO-A-1.22023GCA_965117675.1
CARL-A-16.22030GCA_965117685.1
CARL-A-10.22029GCA_965117695.1
MONT-B-12.22047GCA_965117705.1
GAIL-B-9.22034GCA_965117715.1
AUZE-A-11.22011GCA_965117725.1

References

  1. Jarvis, E.D.; Mirarab, S.; Aberer, A.J.; Li, B.; Houde, P.; Li, C.; Ho, S.Y.; Faircloth, B.C.; Nabholz, B.; Howard, J.T.; et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 2014, 346, 1320–1331. [Google Scholar] [CrossRef] [PubMed]
  2. Liu, K.; Linder, C.R.; Warnow, T. RAxML and FastTree: Comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE 2011, 6, e27731. [Google Scholar] [CrossRef] [PubMed]
  3. Chatzou, M.; Magis, C.; Chang, J.M.; Kemena, C.; Bussotti, G.; Erb, I.; Notredame, C. Multiple sequence alignment modeling: Methods and applications. Briefings Bioinform. 2015, 17, 1009–1023. [Google Scholar] [CrossRef]
  4. Suchard, M.A.; Weiss, R.E.; Sinsheimer, J.S. Bayesian Selection of Continuous-Time Markov Chain Evolutionary Models. Mol. Biol. Evol. 2001, 18, 1001–1013. [Google Scholar] [CrossRef]
  5. Daugelaite, J.; O’ Driscoll, A.; Sleator, R.D. An overview of multiple sequence alignments and cloud computing in bioinformatics. Int. Sch. Res. Not. 2013, 2013, 615630. [Google Scholar] [CrossRef]
  6. Izquierdo-Carrasco, F.; Gagneur, J.; Stamatakis, A. Trading memory for running time in phylogenetic likelihood computations. Heidelb. Inst. Theor. Stud. 2012, 86–95. [Google Scholar] [CrossRef]
  7. Claros, M.G.; Bautista, R.; Guerrero-Fernández, D.; Benzerki, H.; Seoane, P.; Fernández-Pozo, N. Why assembling plant genome sequences is so challenging. Biology 2012, 1, 439–459. [Google Scholar] [CrossRef]
  8. Ozan, Ş. DNA Sequence Classification with Compressors. arXiv 2024, arXiv:2401.14025. [Google Scholar] [CrossRef]
  9. Wilson, D.; Rogers, J. Evaluating Compression-Based Phylogeny Estimation in the Presence of Incomplete Lineage Sorting. J. Comput. Biol. 2023, 30, 250–260. [Google Scholar] [CrossRef] [PubMed]
  10. Rogers, J.; Wilson, D. Comparing phylogeny by compression to phylogeny by NJp and Bayesian Inference. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2195–2202. [Google Scholar]
  11. Rogers, D.W.J. PhyloTools: A Software Package for Analyzing Phylogenetic Trees; GitHub Repository: San Francisco, CA, USA, 2021. [Google Scholar]
  12. Li, M.; Vit’anyi, P.M. An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed.; Springer Publishing Company, Incorporated: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  13. Li, M.; Chen, X.; Li, X.; Ma, B.; Vit’anyi, P.M.B. The similarity metric. IEEE Trans. Inf. Theory 2004, 50, 3250–3264. [Google Scholar] [CrossRef]
  14. Pavlov, I. 7-Zip, Version 24.08 or Later; [Computer Software]; Moscow, Russia. 2025. Available online: https://www.7-zip.org/ (accessed on 2 August 2025).
  15. Moreno, D. NCD-Corrections: Correction Tools for Normalized Compression Distance (NCD); GitHub Repository: San Francisco, CA, USA, 2025. [Google Scholar]
  16. Astrom, K. Maximum Likelihood and Prediction Error Methods. Ifac Proc. Vol. 1979, 12, 551–574. [Google Scholar] [CrossRef]
  17. Stoltzfus, A.; Norris, R.W. On the Causes of Evolutionary Transition: Transversion Bias. Mol. Biol. Evol. 2016, 33, 595–602. [Google Scholar] [CrossRef]
  18. Jukes, T.; Cantor, C. Evolution of Protein Molecules. In Mammalian Protein Metabolism; Munro, H., Ed.; Academic Press: New York, NY, USA, 1969; pp. 21–132. [Google Scholar] [CrossRef]
  19. Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 1980, 16, 111–120. [Google Scholar] [CrossRef] [PubMed]
  20. Kimura, M. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 1981, 78, 454–458. [Google Scholar] [CrossRef]
  21. Tamura, K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 1992, 9, 678–687. [Google Scholar] [CrossRef] [PubMed]
  22. Tamura, K.; Nei, M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 1993, 10, 512–526. [Google Scholar] [CrossRef]
  23. Guénoche, A.; Garreta, H. Can we have confidence in a tree representation? In Proceedings of the International Conference on Biology, Informatics, and Mathematics, Montpellier, France, 3–5 May 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 45–56. [Google Scholar]
  24. Robinson, D.F.; Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 1981, 53, 131–147. [Google Scholar] [CrossRef]
  25. de Vienne, D.M.; Aguileta, G.; Ollier, S. Euclidean nature of phylogenetic distance matrices. Syst. Biol. 2011, 60, 826–832. [Google Scholar] [CrossRef]
  26. Pinho, A.; Pratas, D. MFCompress: A compression tool for FASTA and multi-FASTA data. Bioinformatics 2014, 30, 117–118. [Google Scholar] [CrossRef]
  27. Consens, M.; Dufault, C.; Wainberg, M.; Forster, D.; Karimzadeh, M.; Goodarzi, H.; Theis, F.J.; Moses, A.; Wang, B. Transformers and genome language models. Nat. Mach. Intell. 2025, 7, 346–362. [Google Scholar] [CrossRef]
  28. Wen, J.; Chan, R.H.; Yau, S.C.; He, R.L.; Yau, S.S. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 2014, 546, 25–34. [Google Scholar] [CrossRef] [PubMed]
  29. Moi, D.; Kilchoer, L.; Aguilar, P.S.; Dessimoz, C. Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes. PLoS Comput. Biol. 2020, 16, 1–21. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow of the methods used for this study.
Figure 1. Workflow of the methods used for this study.
Mathematics 13 03534 g001
Table 1. Comparison of REQ values.
Table 1. Comparison of REQ values.
MammalTomatoArabidopsis
# BranchesAvg Req# BranchesAvg Req# BranchesAvg Req
NCD1500.79110.62760.56
JC691500.53110.57760.51
K801500.78110.52760.55
K811500.78110.45760.55
T921500.78110.49760.55
TN931500.78110.49760.55
Table 2. Comparison of Robinson Fould and Normalized Robinson Fould.
Table 2. Comparison of Robinson Fould and Normalized Robinson Fould.
MammalTomatoArabidopsis
RFRF-NormRFRF-NormRFRF-Norm
NCD vs. JC6921.900.471.360.369.940.48
NCD vs. K808.410.121.460.425.870.19
NCD vs. K818.410.121.460.425.870.19
NCD vs. T928.410.121.460.425.870.19
NCD vs. TN938.410.121.460.425.870.19
Table 3. Average Euclidean Distances.
Table 3. Average Euclidean Distances.
MammalsTomatosArabidopsis
NCD vs. JC690.00140.0180.0017
NCD vs. K800.00110.0200.0020
NCD vs. K810.00110.0200.0020
NCD vs. T920.00110.0200.0020
NCD vs. TN930.00110.0200.0020
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Moreno, D.; Hu, H.; Ramaraj, T.; Rogers, J. Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics 2025, 13, 3534. https://doi.org/10.3390/math13213534

AMA Style

Moreno D, Hu H, Ramaraj T, Rogers J. Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics. 2025; 13(21):3534. https://doi.org/10.3390/math13213534

Chicago/Turabian Style

Moreno, Damian, Hongzhi Hu, Thiruvarangan Ramaraj, and John Rogers. 2025. "Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices" Mathematics 13, no. 21: 3534. https://doi.org/10.3390/math13213534

APA Style

Moreno, D., Hu, H., Ramaraj, T., & Rogers, J. (2025). Exploring the Application of Models of DNA Evolution to Normalized Compression Distance (NCD) Matrices. Mathematics, 13(21), 3534. https://doi.org/10.3390/math13213534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop