This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the “distances” are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old “distance” and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.

Whole genome sequences are generally accepted as excellent tools for studying evolutionary relationships [

A number of methods without sequence alignment for deriving species phylogeny based on overall similarities of complete genomes have been developed. These include fractal analysis [

In the above approaches of SVD, Markov model and dynamical language model, there is a step to calculate the correlation-related distance between two genomes after removing the randomness or noise from the composition vectors. A drawback is that these correlation-related distances are not proper distance metrics in the strict mathematical sense (Professor Bailin Hao, personal communication, 2009; see also [

In this paper, we follow the second way and propose two proper correlation-related distance metrics to replace the pseudo-distance in the dynamical language approach used by Yu

Three kinds of data from the complete genomes can be analysed using the dynamical language approach proposed by Yu

There are a total of ^{K}^{K}_{1}_{2}..._{K}_{1}_{2}..._{K}_{1}_{2}..._{K}_{1}_{2}..._{K}_{1}_{2}..._{K}_{1}_{2}..._{K}_{j}_{1}_{2}..._{K}_{1}_{2}..._{K}_{j}_{1}_{2}..._{K}_{1}_{2}..._{K}_{i}_{1}, _{2},..., _{N}

Yu _{1}_{2}..._{K}_{K}_{1}_{2}..._{K–1} or a letter _{1} to the beginning of the (_{2}_{3}..._{K}_{1}) and _{K}_{1} and _{K}_{1}_{2}..._{K}^{K}^{K}

The transformation

Then we use _{1}_{2}..._{K}_{1}_{2}..._{K}_{1}, _{2},..., _{N}_{1}, _{2},...,_{N}

Then we view the

The distance _{r}_{r}

To determine a best length of strings (

Each genome can be considered as a point in ^{K}^{K}_{1}, _{2},..., _{N}

A function

The inequality (iii) is called the

If we denote:
_{u}_{u}_{u}_{u}

The distance defined by _{r}

The chord distance is defined on the set of unit vectors in a vector space as the length of the chord constructed from two unit vectors. Mathematically, let _{u}_{u1}_{u2}_{uN}_{u}_{u1}_{u2}_{uN}_{chord}_{u}_{u}

It is seen that _{chord}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{chord}_{u}_{u}_{chord}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{chord}_{u}_{u}

This distance metric is also defined on the set of unit vectors in a vector space. For any two unit vectors _{u}_{u}_{piecewise}_{u}_{u}

By definition, _{piecewise}_{u}_{u}_{u}_{u}_{u}_{u}_{piecewise}_{u}_{u}_{piecewise}_{u}_{u}_{u}_{u}_{piecewise}_{u}_{u}_{piecewise}_{u}_{u}_{piecewise}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{u}_{piecewise}_{u}_{u}

We propose to replace the pseudo-distance in the dynamical language approach [

We used the dynamical language approach for Datasets 1 and 2 in [

The whole DNA sequences (including protein-coding and non-coding regions), all protein-coding DNA sequences and the amino acid sequences of all protein-coding genes from genome data are used for phylogenetic analysis. For

For

For

For

There is no significant effect by the normalization of the distances and different values of

We proposed two new mathematically proper distance metrics based on the lengths of the chords constructed from unit vectors and on proportions of the sample correlation function of unit vectors to replace the pseudo-distance in the dynamical language approach [

The authors would like to thank Bailin Hao in T-Life Research Center of Fudan University for pointing out the distance problem and useful discussion. They also wish to thank the Editor and the Reviewers for their insights, comments and suggestions to improve the paper. This research was supported by the Chinese Program for New Century Excellent Talents in University grant NCET-08-0686 and the Fok Ying Tung Education Foundation grant 101004 (Z.-G. Yu), the Australian Research Council (grant no. DP0559807) (V. Anh).

The plot of mean value of

Phylogeny of 109 organisms (prokaryotes and eukaryotes) using the dynamical language approach with chord distance in the case

Phylogeny of chloroplast genomes using the dynamical language approach with piecewise distance in the case

The NJ tree of mitochondrial genomes based on the whole DNA sequences using the dynamical language approach with chord distance in the case

Phylogeny of 62 alpha-proteobacteria using the dynamical language approach with chord distance in the cases