<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">ijms</journal-id>
<journal-title>International Journal of Molecular Sciences</journal-title>
<abbrev-journal-title>Int. J. Mol. Sci.</abbrev-journal-title>
<issn pub-type="epub">1422-0067</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/ijms11031141</article-id>
<article-id pub-id-type="publisher-id">ijms-11-01141</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yu</surname><given-names>Zu-Guo</given-names></name><xref ref-type="aff" rid="af1-ijms-11-01141">1</xref><xref ref-type="aff" rid="af2-ijms-11-01141">2</xref><xref ref-type="corresp" rid="c1-ijms-11-01141">*</xref></contrib>
<contrib contrib-type="author">
<name><surname>Zhan</surname><given-names>Xiao-Wen</given-names></name><xref ref-type="aff" rid="af1-ijms-11-01141">1</xref></contrib>
<contrib contrib-type="author">
<name><surname>Han</surname><given-names>Guo-Sheng</given-names></name><xref ref-type="aff" rid="af1-ijms-11-01141">1</xref></contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname><given-names>Roger W.</given-names></name><xref ref-type="aff" rid="af3-ijms-11-01141">3</xref></contrib>
<contrib contrib-type="author">
<name><surname>Anh</surname><given-names>Vo</given-names></name><xref ref-type="aff" rid="af2-ijms-11-01141">2</xref></contrib>
<contrib contrib-type="author">
<name><surname>Chu</surname><given-names>Ka Hou</given-names></name><xref ref-type="aff" rid="af4-ijms-11-01141">4</xref></contrib></contrib-group>
<aff id="af1-ijms-11-01141">
<label>1</label> School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; E-Mails: 
<email>zhan031001140604@163.com</email> (X.-W.Z.);
<email>korea10282003@163.com</email> (G.-S.H.)</aff>
<aff id="af2-ijms-11-01141">
<label>2</label> School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia; E-Mail: 
<email>v.anh@qut.edu.au</email> (V.A.)</aff>
<aff id="af3-ijms-11-01141">
<label>3</label> Department of Mathematics, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China; E-Mail: 
<email>wwang_00@yahoo.com</email> (R.W.W.)</aff>
<aff id="af4-ijms-11-01141">
<label>4</label> Department of Biology, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China; E-Mail: 
<email>kahouchu@cuhk.edu.hk</email> (K.H.C.)</aff>
<author-notes>
<corresp id="c1-ijms-11-01141">
<label>*</label> Author to whom correspondence should be addressed; E-Mail: 
<email>yuzg@hotmail.com</email>; Tel.: + 86-731-52377625; Fax: +86-731-58293934.</corresp></author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>3</month>
<year>2010</year></pub-date>
<pub-date pub-type="collection">
<year>2010</year></pub-date>
<volume>11</volume>
<issue>3</issue>
<fpage>1141</fpage>
<lpage>1154</lpage>
<history>
<date date-type="received">
<day>4</day>
<month>2</month>
<year>2010</year></date>
<date date-type="accepted">
<day>3</day>
<month>3</month>
<year>2010</year></date></history>
<permissions>
<copyright-statement>© 2010 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland.</copyright-statement>
<copyright-year>2010</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0">
<p>This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the “distances” are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old “distance” and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.</p></abstract>
<kwd-group>
<kwd>phylogenetic analysis</kwd>
<kwd>complete genome</kwd>
<kwd>composition vector</kwd>
<kwd>correlation-related distance metric</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Whole genome sequences are generally accepted as excellent tools for studying evolutionary relationships [<xref ref-type="bibr" rid="b1-ijms-11-01141">1</xref>]. Traditional distance methods with multiple alignment or various sequence evolutionary models for phylogenetic analysis are not directly applicable to the analysis of complete genomes.</p>
<p>A number of methods without sequence alignment for deriving species phylogeny based on overall similarities of complete genomes have been developed. These include fractal analysis [<xref ref-type="bibr" rid="b2-ijms-11-01141">2</xref>–<xref ref-type="bibr" rid="b4-ijms-11-01141">4</xref>], dynamical language model [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>], information-based analysis [<xref ref-type="bibr" rid="b6-ijms-11-01141">6</xref>–<xref ref-type="bibr" rid="b8-ijms-11-01141">8</xref>], log-correlation distance and Fourier transformation with Kullback-Leibler divergence distance [<xref ref-type="bibr" rid="b9-ijms-11-01141">9</xref>], Markov model [<xref ref-type="bibr" rid="b10-ijms-11-01141">10</xref>–<xref ref-type="bibr" rid="b15-ijms-11-01141">15</xref>], principal component analysis [<xref ref-type="bibr" rid="b16-ijms-11-01141">16</xref>] and singular value decomposition (SVD) [<xref ref-type="bibr" rid="b17-ijms-11-01141">17</xref>–<xref ref-type="bibr" rid="b19-ijms-11-01141">19</xref>]. The analyses based on the Markov model and dynamical language model without sequence alignment using 103 prokaryotes and 6 eukaryotes have yielded trees separating the three domains of life, Archaea, Eubacteria and Eukarya, with the relationships among the taxa consistent with those based on traditional analyses [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>,<xref ref-type="bibr" rid="b11-ijms-11-01141">11</xref>]. These two methods were also used to analyze the complete chloroplast genomes [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>,<xref ref-type="bibr" rid="b12-ijms-11-01141">12</xref>]. The SVD method was used to analyze mitochondrial genomes of 64 selected vertebrates [<xref ref-type="bibr" rid="b19-ijms-11-01141">19</xref>]. A correlation-distance method without removing the random background (similar to [<xref ref-type="bibr" rid="b7-ijms-11-01141">7</xref>]) was used to analyze rRNA gene sequences as DNA barcodes [<xref ref-type="bibr" rid="b20-ijms-11-01141">20</xref>].</p>
<p>In the above approaches of SVD, Markov model and dynamical language model, there is a step to calculate the correlation-related distance between two genomes after removing the randomness or noise from the composition vectors. A drawback is that these correlation-related distances are not proper distance metrics in the strict mathematical sense (Professor Bailin Hao, personal communication, 2009; see also [<xref ref-type="bibr" rid="b21-ijms-11-01141">21</xref>]). There are some ways to overcome this problem. One way is to change the concept of distance to that of dissimilarity proposed by Xu and Hao [<xref ref-type="bibr" rid="b15-ijms-11-01141">15</xref>] in the Markov model approach. Another way is to replace a pseudo-distance by a proper distance metric, which requires that the results are not worsened from the biological point of view. In the first way, there is no widely accepted mathematical definition for the concept of dissimilarity or similarity. Chen <italic>et al.</italic> [<xref ref-type="bibr" rid="b22-ijms-11-01141">22</xref>] defined a similarity metric, but unfortunately the sample correlation between two vectors in a vector space does not yield a proper similarity under their definition.</p>
<p>In this paper, we follow the second way and propose two proper correlation-related distance metrics to replace the pseudo-distance in the dynamical language approach used by Yu <italic>et al</italic>. [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. We then evaluate the effects of this replacement on the analysis of a wide range of complete genomes from the biological point of view.</p></sec>
<sec sec-type="methods">
<label>2.</label>
<title>Dynamical Language Approach for Phylogenetic Analysis</title>
<p>Three kinds of data from the complete genomes can be analysed using the dynamical language approach proposed by Yu <italic>et al</italic>. [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. They are the whole DNA sequences (including protein-coding and non-coding regions), all protein-coding DNA sequences and the amino acid sequences of all protein-coding genes. We outline this approach here.</p>
<p>There are a total of <italic>N</italic> = 4<italic><sup>K</sup></italic> (for DNA sequences) or 20<italic><sup>K</sup></italic> (for protein sequences) possible types of <italic>K</italic>-strings, that is, the strings with fixed length <italic>K</italic>. We denote the length of a DNA or protein sequence as <italic>L</italic>. Then a window of length <italic>K</italic> is used to slide through the sequences by shifting one position at a time to determine the frequencies of each of the <italic>N</italic> kinds of <italic>K</italic>-strings in this sequence. We define <italic>p</italic>(<italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>) = <italic>n</italic>(<italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>) / (<italic>L</italic> – <italic>K</italic> + 1) as the observed frequency of a <italic>K</italic> -string <italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>, where <italic>n</italic>(<italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>) is the number of times that <italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic> appears in this sequence. For the DNA or amino acid sequences of the protein-coding genes, denoting by <italic>m</italic> the number of protein-coding genes from each complete genome, we define 
<inline-formula>
<mml:math>
<mml:mo stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>m</mml:mi></mml:msubsup>
<mml:mrow>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>α</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msubsup>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>m</mml:mi></mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo>−</mml:mo>
<mml:mi>K</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as the observed frequency of a <italic>K</italic>-string <italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>; here <italic>n<sub>j</sub></italic> (<italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>) means the number of times that <italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic> appears in the <italic>j</italic>th protein-coding DNA sequence or protein sequence, and <italic>L<sub>j</sub></italic> the length of the <italic>j</italic>th sequence in this complete genome. Then we can form a <italic>composition vector</italic> for a genome using <italic>p</italic>(<italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>) as components for all possible <italic>K</italic>-strings <italic>α</italic><sub>1</sub><italic>α</italic><sub>2</sub>...<italic>α<sub>K</sub></italic>. We use <italic>p<sub>i</sub></italic> to denote the <italic>i</italic>-th component corresponding to the string type <italic>i</italic>, <italic>i</italic> = 1,…,<italic>N</italic> (<italic>N</italic> strings are arranged in a fixed order as the alphabetical order). In this way we construct a composition vector <italic>p</italic> = (<italic>p</italic><sub>1</sub>, <italic>p</italic><sub>2</sub>,..., <italic>p<sub>N</sub></italic>) for a genome.</p>
<p>Yu <italic>et al</italic>. [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] considered an idea from the theory of dynamical language [<xref ref-type="bibr" rid="b23-ijms-11-01141">23</xref>] that a <italic>K</italic>-string <italic>s</italic><sub>1</sub><italic>s</italic><sub>2</sub>...<italic>s<sub>K</sub></italic> is possibly constructed by adding a letter <italic>s<sub>K</sub></italic> to the end of the (<italic>K</italic> – 1) -string <italic>s</italic><sub>1</sub><italic>s</italic><sub>2</sub>...<italic>s</italic><sub><italic>K</italic>–1</sub> or a letter <italic>s</italic><sub>1</sub> to the beginning of the (<italic>K</italic> – 1) -string <italic>s</italic><sub>2</sub><italic>s</italic><sub>3</sub>...<italic>s<sub>K</sub></italic>. After counting the observed frequencies for all strings of length (<italic>K</italic> – 1) and the four or 20 kinds of letters, the expected frequency of appearance of <italic>K</italic>-strings is predicted by:
<disp-formula id="FD1">
<label>(1)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>3</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:math></disp-formula>where <italic>p</italic>(<italic>s</italic><sub>1</sub>) and <italic>p</italic>(<italic>s<sub>K</sub></italic>) are frequencies of nucleotides or amino acids <italic>s</italic><sub>1</sub> and <italic>s<sub>K</sub></italic> appearing in this genome. Then <italic>q</italic>(<italic>s</italic><sub>1</sub><italic>s</italic><sub>2</sub>...<italic>s<sub>K</sub></italic>) of all 4<italic><sup>K</sup></italic> or 20<italic><sup>K</sup></italic> kinds of <italic>K</italic>-strings is viewed as the noise background. We then subtract the noise background before performing a cross-correlation analysis through defining:
<disp-formula id="FD2">
<label>(2)</label>
<mml:math display="block">
<mml:mi>X</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>≠</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn></mml:msub>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn></mml:msub>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>K</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula></p>
<p>The transformation <italic>X</italic> = (<italic>p</italic> / <italic>q</italic>) – 1 has the desired effect of subtraction of random background in <italic>p</italic> and rendering it a stationary time series suitable for subsequent cross-correlation analysis.</p>
<p>Then we use <italic>X</italic> (<italic>s</italic><sub>1</sub><italic>s</italic><sub>2</sub>...<italic>s<sub>K</sub></italic>) for all possible <italic>K</italic>-strings <italic>s</italic><sub>1</sub><italic>s</italic><sub>2</sub>...<italic>s<sub>K</sub></italic> as components and arrange according to a fixed alphabetical order all the <italic>K</italic>-strings to form a composition vector <italic>X</italic> = (<italic>X</italic><sub>1</sub>, <italic>X</italic><sub>2</sub>,..., <italic>X<sub>N</sub></italic>) for genome <italic>X</italic>, and likewise <italic>Y</italic> = (<italic>Y</italic><sub>1</sub>, <italic>Y</italic><sub>2</sub>,...,<italic>Y<sub>N</sub></italic>) for genome <italic>Y</italic>.</p>
<p>Then we view the <italic>N</italic> components in the vectors <italic>X</italic> and <italic>Y</italic> as samples of two random variables respectively. The sample correlation <italic>C</italic>(<italic>X</italic>, <italic>Y</italic>) between any two genomes <italic>X</italic> and <italic>Y</italic> is defined in the usual way in probability theory as:
<disp-formula>
<mml:math display="block">
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn></mml:msubsup></mml:mrow>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>Y</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn></mml:msubsup></mml:mrow>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>The distance <italic>D<sub>r</sub></italic> (<italic>X</italic>, <italic>Y</italic>) between the two genomes is then defined by <italic>D<sub>r</sub></italic> (<italic>X</italic>, <italic>Y</italic>) = (1 – <italic>C</italic>(<italic>X</italic>, <italic>Y</italic>)) / 2. A distance matrix for all the genomes under study is then generated for the construction of phylogenetic trees. This distance method to construct phylogenetic tree is referred to as the <italic>dynamical language model method</italic> [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. Finally, we construct all trees using the neighbour-joining (NJ) method [<xref ref-type="bibr" rid="b24-ijms-11-01141">24</xref>] in the software <italic>SplitsTree4</italic> V4.10 [<xref ref-type="bibr" rid="b25-ijms-11-01141">25</xref>] or in the <italic>Molecular Evolutionary Genetics Analysis</italic> software (MEGA 4) [<xref ref-type="bibr" rid="b26-ijms-11-01141">26</xref>] based on the distance matrices.</p>
<p>To determine a best length of strings (<italic>K</italic>) in our model, we plot the mean value of X over all <italic>K</italic>-strings from a genome (whole DNA sequences or protein sequences) as a function of <italic>K</italic> (see <xref ref-type="fig" rid="f1-ijms-11-01141">Figure 1</xref> for examples from our data). The mean value of <italic>X</italic> starts to approach zero at <italic>K</italic> = 6 or 7 if we use protein sequences from genome and at <italic>K</italic> = 11 or 12 if we use whole DNA sequence. The mean value of <italic>X</italic> being close to zero means that the value of <italic>p</italic> (from the sequence) is almost equal to value of <italic>q</italic> (from the model). Hence these <italic>K</italic> values are suitable for phylogeny reconstruction using our approach. This result is also confirmed later in this paper from a biological point of view.</p></sec>
<sec>
<label>3.</label>
<title>Proper Distance Metrics in Vector Spaces</title>
<p>Each genome can be considered as a point in <italic>N</italic> = 4<italic><sup>K</sup></italic> (for DNA sequences) or 20<italic><sup>K</sup></italic> (for protein sequences) dimensional space represented by its composition vector <italic>X</italic> = (<italic>X</italic><sub>1</sub>, <italic>X</italic><sub>2</sub>,..., <italic>X<sub>N</sub></italic>).</p>
<p>A function <italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) between two vectors <italic>X</italic> and <italic>Y</italic> is said to be a distance metric if it satisfies the following properties:
<list list-type="roman-lower">
<list-item>
<p><italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) ≥ 0; and <italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) = 0 if and only if <italic>X</italic> = <italic>Y</italic>;</p></list-item>
<list-item>
<p><italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) = <italic>D</italic>(<italic>Y</italic>, <italic>X</italic>);</p></list-item>
<list-item>
<p><italic>D</italic>(<italic>X</italic>, <italic>Z</italic>) ≤ <italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) + <italic>D</italic>(<italic>Y</italic>, <italic>Z</italic>) for any <italic>X</italic>, <italic>Y</italic> and <italic>Z</italic>.</p></list-item></list></p>
<p>The inequality (iii) is called the <italic>triangle inequality</italic>. A distance metric <italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) is said to be normalized if 0 ≤ <italic>D</italic>(<italic>X</italic>, <italic>Y</italic>) ≤ 1 for any <italic>X</italic> and <italic>Y</italic>.</p>
<p>If we denote:
<disp-formula>
<mml:math display="block">
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>|</mml:mo>
<mml:mo>,</mml:mo></mml:mrow></mml:mfrac>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Y</mml:mi>
<mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula>where |<italic>X</italic>| and |<italic>Y</italic>| are the lengths of the vectors <italic>X</italic> and <italic>Y</italic> respectively, then <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic> are unit vectors (<italic>i.e.</italic>, have length 1). Let <italic>θ</italic> be the angle between two vectors of <italic>X</italic> and <italic>Y</italic>. It is well known that <italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = cos<italic>θ</italic>.</p>
<p>The distance defined by <italic>D<sub>r</sub></italic> (<italic>X</italic>, <italic>Y</italic>) = (1 – <italic>C</italic>(<italic>X</italic>, <italic>Y</italic>)) / 2 is not a proper distance metric because it does not satisfy condition (i) (except for unit vectors) and the triangle inequality (iii) [<xref ref-type="bibr" rid="b21-ijms-11-01141">21</xref>]. In the following we describe two proper distance metrics related to the sample correlation.</p>
<sec>
<label>3.1.</label>
<title>Chord Distance</title>
<p>The chord distance is defined on the set of unit vectors in a vector space as the length of the chord constructed from two unit vectors. Mathematically, let <italic>X<sub>u</sub></italic> = (<italic>X<sub>u1</sub></italic>, <italic>X<sub>u2</sub></italic>,…,<italic>X<sub>uN</sub></italic>) and <italic>Y<sub>u</sub></italic> = (<italic>Y<sub>u1</sub></italic>, <italic>Y<sub>u2</sub></italic>,…, <italic>Y<sub>uN</sub></italic>) be two unit vectors; then the chord distance <italic>D<sub>chord</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>)is defined as:
<disp-formula id="FD3">
<label>(3)</label>
<mml:math display="block">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">chord</mml:mtext></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mtd>
<mml:mtd>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow></mml:msqrt>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow>
<mml:mn>2</mml:mn></mml:msubsup>
<mml:mo>+</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow>
<mml:mn>2</mml:mn></mml:msubsup></mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>2</mml:mn>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>N</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:mrow></mml:msqrt></mml:mtd></mml:mtr>
<mml:mtr><mml:mtd/>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msqrt>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:mi>Y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:msqrt></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>It is seen that <italic>D<sub>chord</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 0 if and only if <italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 1, <italic>i.e.</italic>, cos<italic>θ</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 1, which implies that <italic>θ</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 0 because the angle <italic>θ</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) between the two vectors <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic> is in [0, <italic>π</italic>]. This result means that the two vectors <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic> are identical. It is obvious that <italic>D<sub>chord</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = <italic>D<sub>chord</sub></italic> (<italic>Y<sub>u</sub></italic>, <italic>X<sub>u</sub></italic>). Because the three chords constructed by the pairs <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic>, <italic>X<sub>u</sub></italic> and <italic>Z<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic> and <italic>Z<sub>u</sub></italic> are the three edges of a triangle, and the sum of the lengths of any two edges of a triangle is larger or equal to the length of the third edge, the triangle inequality of the chord distance follows. Hence the chord distance is a proper distance metric in the strict mathematical sense. The chord distance <italic>D<sub>chord</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) can be normalized by 
<inline-formula>
<mml:math>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">chord</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">norm</mml:mtext></mml:mrow></mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">chord</mml:mtext></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn></mml:math></inline-formula>. This distance is also called Cavalli-Sforza chord distance [<xref ref-type="bibr" rid="b27-ijms-11-01141">27</xref>] or described on pp. 163–166 of [<xref ref-type="bibr" rid="b28-ijms-11-01141">28</xref>]. This distance performed well in simulations of tree-building algorithms by Takezaki and Nei [<xref ref-type="bibr" rid="b29-ijms-11-01141">29</xref>]. It has also been used to analyze microarray gene expression data [<xref ref-type="bibr" rid="b30-ijms-11-01141">30</xref>].</p></sec>
<sec>
<label>3.2.</label>
<title>Piecewise Distance</title>
<p>This distance metric is also defined on the set of unit vectors in a vector space. For any two unit vectors <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic>, we define:
<disp-formula id="FD4">
<label>(4)</label>
<mml:math display="block">
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">piecewise</mml:mtext></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>−</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>ρ</mml:mi></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>≠</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula>where <italic>ρ</italic> is any positive real number which is not smaller than 3. We call <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) the <italic>piecewise distance</italic>.</p>
<p>By definition, <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 0 if and only if <italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = 1, which means that the two vectors <italic>X<sub>u</sub></italic> and <italic>Y<sub>u</sub></italic> are identical as shown above. It is also obvious that <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) = <italic>D<sub>piecewise</sub></italic> (<italic>Y<sub>u</sub></italic>, <italic>X<sub>u</sub></italic>). Using the facts <italic>ρ</italic> ≥ 3, −1 ≤ <italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) ≤ 1 for any two unit vectors and <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) + <italic>D<sub>piecewise</sub></italic> (<italic>Y<sub>u</sub></italic>, <italic>Z<sub>u</sub></italic>) – <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Z<sub>u</sub></italic>) = [<italic>ρ</italic> + <italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) + <italic>C</italic>(<italic>Y<sub>u</sub></italic>, <italic>Z<sub>u</sub></italic>) –<italic>C</italic>(<italic>X<sub>u</sub></italic>, <italic>Z<sub>u</sub></italic>)]/<italic>ρ</italic> ≥ 0, we get the triangle inequality for the piecewise distance. Hence the piecewise distance is a proper distance metric in the strict mathematical sense. The piecewise distance <italic>D<sub>piecewise</sub></italic> (<italic>X<sub>u</sub></italic>, <italic>Y<sub>u</sub></italic>) can be normalized by 
<inline-formula>
<mml:math>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">piecewise</mml:mtext></mml:mrow>
<mml:mrow>
<mml:mtext mathvariant="italic">norm</mml:mtext></mml:mrow></mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mtext mathvariant="italic">piecewise</mml:mtext></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo>,</mml:mo>
<mml:mi> </mml:mi>
<mml:msub>
<mml:mi>Y</mml:mi>
<mml:mi>u</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn></mml:math></inline-formula>. Usually we may take <italic>ρ</italic> = 3.</p></sec></sec>
<sec>
<label>4.</label>
<title>Evaluation of the Proposed Distance Metrics from the Biological Point of View</title>
<p>We propose to replace the pseudo-distance in the dynamical language approach [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] by the chord distance or piecewise distance. We need to examine the effects of this replacement from the biological point of view. In order to do this, we evaluate the new distance metrics on four datasets, namely <bold>Dataset 1</bold> of 109 complete genomes of prokaryotes and eukaryotes used in [<xref ref-type="bibr" rid="b11-ijms-11-01141">11</xref>], <bold>Dataset 2</bold> of 34 prokaryote and chloroplast genomes used in [<xref ref-type="bibr" rid="b12-ijms-11-01141">12</xref>], <bold>Dataset 3</bold> of mitochondrial genomes of 64 selected vertebrates used in [<xref ref-type="bibr" rid="b19-ijms-11-01141">19</xref>], and <bold>Dataset 4</bold> of 62 complete genomes of alpha-proteobacteria used in [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>]. (<italic>Note</italic>: Chan <italic>et al.</italic> [<xref ref-type="bibr" rid="b21-ijms-11-01141">21</xref>] recently tested the chord distance with different denoising formulas on Dataset 2).</p>
<p>We used the dynamical language approach for Datasets 1 and 2 in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] and Dataset 3 in [<xref ref-type="bibr" rid="b32-ijms-11-01141">32</xref>]. Some biological comparisons of this approach with the Markov model approach on Datasets 1 and 2 were given in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. Recently we found that wrong data of the Archaea Crenarchaeota bacterium <italic>Pyrobaculum aerophilum</italic> (Pyrae) from Dataset 1 was used in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. Using the right genome data, <italic>Pyrobaculum aerophilum</italic> (Pyrae) groups with the other Archaea Crenarchaeota bacteria correctly (when we use the amino acid sequences of all protein-coding genes from genomes and <italic>K</italic> = 6). After this correction, the resulting tree is better than the one in [<xref ref-type="bibr" rid="b11-ijms-11-01141">11</xref>] from the biological point of view, with all firmicutes group together and the other branches are similar. For Dataset 2, we obtained two trees with the same topology to those using the dynamical language approach in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] and the Markov model approach in [<xref ref-type="bibr" rid="b12-ijms-11-01141">12</xref>] (also using the amino acid sequences of all protein-coding genes from genomes and <italic>K</italic> = 6). For Dataset 3, we reported in [<xref ref-type="bibr" rid="b32-ijms-11-01141">32</xref>] a good tree in agreement with the current understanding of the phylogeny of vertebrates revealed by the traditional approaches using the dynamical language approach (based on the whole DNA sequences of genomes and <italic>K</italic> = 11). This tree is better than the one in [<xref ref-type="bibr" rid="b19-ijms-11-01141">19</xref>] and the one obtained by the Markov model approach. Hence we just need to compare the best trees obtained by the dynamical language approach using the two proper distance metrics with the best trees obtained from the pseudo-distance in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] based on the first three datasets. In 2009, Guyon <italic>et al</italic>. [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>] compared four alignment free string distances for complete genome phylogeny using Dataset 4. We will compare our method in this paper with the results in [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>] based on Dataset 4.</p>
<p>The whole DNA sequences (including protein-coding and non-coding regions), all protein-coding DNA sequences and the amino acid sequences of all protein-coding genes from genome data are used for phylogenetic analysis. For <bold>Dataset 1</bold>, we have seen that amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with chord distance and piecewise distance on the amino acid sequences of all protein-coding genes from genomes for <italic>K</italic> = 3, 4, 5 and 6. We find the trees using the new distance metrics have the same topology as the trees using the old “distance” for the same value of <italic>K</italic>, and the trees for <italic>K</italic> = 6 are the best. Here we present the tree for <italic>K</italic> = 6 using dynamical language approach with chord distance in <xref ref-type="fig" rid="f2-ijms-11-01141">Figure 2</xref>. The phylogeny shown in <xref ref-type="fig" rid="f2-ijms-11-01141">Figure 2</xref> supports the broad division into three domains and agrees with the tree of life based on 16S rRNA in a majority of basic branches. For further biological discussions, one can refer to [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] with the correction for the position of <italic>Pyrobaculum aerophilum</italic> (Pyrae).</p>
<p>For <bold>Dataset 2</bold>, we have seen that the amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with chord distance and piecewise distance on the amino acid sequences of all protein-coding genes from genomes for <italic>K</italic> = 3, 4, 5 and 6. We find the tree using the piecewise distance has the same topology as the tree using the old “distance” for the same value of <italic>K</italic>, the tree using the chord distance has similar topology (a little bit worse because <italic>Pinus thunbergii</italic> is separated from its correct position) to the tree using the old “distance” for the same value of <italic>K</italic>. And the trees of <italic>K</italic> = 6 are the best. Hence we present the tree for <italic>K</italic> = 6 using the dynamical language approach with piecewise distance (<italic>ρ</italic> = 3) in <xref ref-type="fig" rid="f3-ijms-11-01141">Figure 3</xref>. We also note that the topology of the tree in <xref ref-type="fig" rid="f3-ijms-11-01141">Figure 3</xref> is the same as that of the tree obtained by the Markov model in [<xref ref-type="bibr" rid="b12-ijms-11-01141">12</xref>]). The phylogeny of <xref ref-type="fig" rid="f3-ijms-11-01141">Figure 3</xref> shows that the chloroplast genomes are separated to two major clades corresponding to chlorophytes <italic>s.l.</italic> and rhodophytes <italic>s.l.</italic> The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution. For further biological discussions, one can refer to [<xref ref-type="bibr" rid="b12-ijms-11-01141">12</xref>].</p>
<p>For <bold>Dataset 3</bold>, after comparing all the trees with the traditional classification of the 64 vertebrates (the traditional classification from the KEGG database is available under “Complete Mitochondrial Genomes” on <ext-link xlink:href="http://www.genome.jp/kegg/genes.html" ext-link-type="uri">http://www.genome.jp/kegg/genes.html</ext-link>)), we find that the whole DNA sequences give better results than those given by the amino acid sequences of all protein-coding genes from genomes and all protein-coding DNA sequences. We evaluated the dynamical language approach with the proposed distance metrics on the sequences of whole genomes for <italic>K</italic> = 6 to 13. We find the tree using the piecewise distance has the same topology as the tree using the old “distance” for the same value of <italic>K</italic>, the tree using the chord distance has similar topology (a little bit better because <italic>Dasypus novemcinctus.</italic>(Dnov) is close to but does not remain in a branch of primates) to the tree using the old “distance” for the same value of <italic>K</italic>. And the trees for <italic>K</italic> = 11 are the best. Hence we present the tree for K = 11 using the dynamical language approach with chord distance in <xref ref-type="fig" rid="f4-ijms-11-01141">Figure 4</xref>. The tree (<xref ref-type="fig" rid="f4-ijms-11-01141">Figure 4</xref>) generated is similar in topology to the tree obtained using the SVD method in the case <italic>K</italic> = 4 [<xref ref-type="bibr" rid="b19-ijms-11-01141">19</xref>], and is also similar to a recently generated tree of 69 species [<xref ref-type="bibr" rid="b33-ijms-11-01141">33</xref>], placing a vast majority of species into well-accepted groupings. As shown in <xref ref-type="fig" rid="f4-ijms-11-01141">Figure 4</xref>, our distance-based analysis shows that the mitochondrial genomes are separated into three major clusters. One group corresponds to mammals; one group corresponds to the fish; and the third one represents Archosauria (including birds and reptiles). The interrelationships among the mitochondrial genomes are roughly in agreement with the current understanding of the phylogeny of vertebrates revealed by the traditional approaches. For further biological discussion, one can refer to [<xref ref-type="bibr" rid="b32-ijms-11-01141">32</xref>].</p>
<p>For <bold>Dataset 4</bold>, Guyon <italic>et al</italic>. [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>] first reconstructed a reference tree using Maximum Likelihood (ML) method based on the large (LSU) and the small (SSU) ribosomal subunits sequences (<italic>i.e.</italic>, the traditional alignment method). Then they compared the results using four alignment free string distances for complete genome phylogeny. The four distances are Maximum Significant Matches (MSM) distance, <italic>k</italic>-word (KW) distance (<italic>i.e.</italic>, the Markov model in [<xref ref-type="bibr" rid="b11-ijms-11-01141">11</xref>]), Average Common Substring (ACS) distance and Compression (ZL) distance. Guyon <italic>et al</italic>. [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>] found the MSM distance out performs the other three distances and the KW cannot give good phylogenetic topology for the 62 alpha-proteobacteria (see Figure 3 in [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>]). We tested our dynamical language approach with pseudo-distance in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] and the two proper distances in this paper on Dataset 4. We found that amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with pseudo-distance in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] and the two proper distances in this paper on the amino acid sequences of all protein-coding genes from genomes for <italic>K</italic> = 3, 4, 5 and 6. We found the trees using the new distance metrics have the same topology as the trees using the old “distance” for the same value of <italic>K</italic>, and the topology of trees for <italic>K</italic> = 5 and 6 are the same and the best. Here we present the tree for <italic>K</italic> = 6 using dynamical language approach with chord distance in <xref ref-type="fig" rid="f5-ijms-11-01141">Figure 5</xref>. As shown in <xref ref-type="fig" rid="f5-ijms-11-01141">Figure 5</xref>, all Rhizobiales (Bartonellaceae, Brucellaceae, Rhizobiaceae and Phyllobacteriaceae) (A), Rhizobiales (Bradyrhizobiaceae) (B), Rickettsiales (Rickettsiaceae and Anaplasmataceae) (C), Rhodospirillales (D), Sphingomonadales (E); Rhodobacterales (Rhodobacteraceae) (F) group into correct branches respectively. Even inside each lineage (groups A to F), our phylogentic topology is more similar to that of ML reference tree (the right side tree in Figure 1 of [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>]) than that obtained by the MSM distance (the best result in [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>]). After comparing our <xref ref-type="fig" rid="f5-ijms-11-01141">Figure 5</xref> with the tree obtained using KW distance (<italic>i.e.</italic>, the Markov model in [<xref ref-type="bibr" rid="b11-ijms-11-01141">11</xref>]) (the tree in Figure 3 of [<xref ref-type="bibr" rid="b31-ijms-11-01141">31</xref>]), our dynamical language model performs much better than the KW distance.</p>
<p>There is no significant effect by the normalization of the distances and different values of <italic>ρ</italic> ≥ 3. Using the proposed distance metrics, we compared the trees before and after normalization and found that the topology of the trees is the same. Then we set <italic>ρ</italic> = 4, 6, 8, 10 and found that we could get the trees with the same topology as the tree for <italic>ρ</italic> = 3. As a result, there seems to be no noticeable effect by normalization of the distances and different values of <italic>ρ</italic> ≥ 3.</p></sec>
<sec sec-type="conclusions">
<label>5.</label>
<title>Conclusions</title>
<p>We proposed two new mathematically proper distance metrics based on the lengths of the chords constructed from unit vectors and on proportions of the sample correlation function of unit vectors to replace the pseudo-distance in the dynamical language approach [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>]. The results showed improvements with this replacement from a biological perspective. These results confirm their usefulness in phylogenetic analysis.</p></sec></body>
<back>
<ack>
<p>The authors would like to thank Bailin Hao in T-Life Research Center of Fudan University for pointing out the distance problem and useful discussion. They also wish to thank the Editor and the Reviewers for their insights, comments and suggestions to improve the paper. This research was supported by the Chinese Program for New Century Excellent Talents in University grant NCET-08-0686 and the Fok Ying Tung Education Foundation grant 101004 (Z.-G. Yu), the Australian Research Council (grant no. DP0559807) (V. Anh).</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-ijms-11-01141"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eisen</surname><given-names>JA</given-names></name><name><surname>Fraser</surname><given-names>CM</given-names></name></person-group><article-title>Phylogenomics: Intersection of evolution and genomics</article-title><source>Science</source><year>2003</year><volume>300</volume><fpage>1706</fpage><lpage>1707</lpage><pub-id pub-id-type="doi">10.1126/science.1086292</pub-id><pub-id pub-id-type="pmid">12805538</pub-id></citation></ref>
<ref id="b2-ijms-11-01141"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name><name><surname>Lau</surname><given-names>K-S</given-names></name></person-group><article-title>Multifractal and correlation analysis of protein sequences from complete genome</article-title><source>Phys. Rev. E</source><year>2003</year><volume>68</volume><fpage>021913</fpage><lpage>1</lpage><pub-id pub-id-type="doi">10.1103/PhysRevE.68.021913</pub-id></citation></ref>
<ref id="b3-ijms-11-01141"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name><name><surname>Lau</surname><given-names>K-S</given-names></name></person-group><article-title>Chaos game representation, and multifractal and correlation analysis of protein sequences from complete genome based on detailed HP model</article-title><source>J. Theor. Biol</source><year>2004</year><volume>226</volume><fpage>341</fpage><lpage>348</lpage><pub-id pub-id-type="doi">10.1016/j.jtbi.2003.09.009</pub-id><pub-id pub-id-type="pmid">14643648</pub-id></citation></ref>
<ref id="b4-ijms-11-01141"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name><name><surname>Lau</surname><given-names>K-S</given-names></name><name><surname>Chu</surname><given-names>K-H</given-names></name></person-group><article-title>The phylogenetic analysis of prokaryotes based on a fractal model of the complete genomes</article-title><source>Phys. Lett. A</source><year>2003</year><volume>317</volume><fpage>293</fpage><lpage>302</lpage><pub-id pub-id-type="doi">10.1016/j.physleta.2003.08.040</pub-id></citation></ref>
<ref id="b5-ijms-11-01141"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Zhou</surname><given-names>L-Q</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name><name><surname>Chu</surname><given-names>KH</given-names></name><name><surname>Long</surname><given-names>S-C</given-names></name><name><surname>Deng</surname><given-names>J-Q</given-names></name></person-group><article-title>Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from whole genome without sequence alignment</article-title><source>J. Mol. Evol</source><year>2005</year><volume>60</volume><fpage>538</fpage><lpage>545</lpage><pub-id pub-id-type="doi">10.1007/s00239-004-0255-9</pub-id><pub-id pub-id-type="pmid">15883888</pub-id></citation></ref>
<ref id="b6-ijms-11-01141"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>M</given-names></name><name><surname>Badger</surname><given-names>JH</given-names></name><name><surname>Chen</surname><given-names>X</given-names></name><name><surname>Kwong</surname><given-names>S</given-names></name><name><surname>Kearney</surname><given-names>P</given-names></name><name><surname>Zhang</surname><given-names>H</given-names></name></person-group><article-title>An information-based sequence distance and its application to whole mitochondrial genome phylogeny</article-title><source>Bioinformatics</source><year>2001</year><volume>17</volume><fpage>149</fpage><lpage>154</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/17.2.149</pub-id><pub-id pub-id-type="pmid">11238070</pub-id></citation></ref>
<ref id="b7-ijms-11-01141"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Jiang</surname><given-names>P</given-names></name></person-group><article-title>Distance, correlation and mutual information among portraits of organisms based on complete genomes</article-title><source>Phys. Lett. A</source><year>2001</year><volume>286</volume><fpage>34</fpage><lpage>46</lpage><pub-id pub-id-type="doi">10.1016/S0375-9601(01)00336-X</pub-id></citation></ref>
<ref id="b8-ijms-11-01141"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>ZG</given-names></name><name><surname>Mao</surname><given-names>Z</given-names></name><name><surname>Zhou</surname><given-names>LQ</given-names></name><name><surname>Anh</surname><given-names>VV</given-names></name></person-group><article-title>A mutual information based sequence distance for vertebrate phylogeny using complete mitochondrial genomes</article-title><conf-name>Proceeding of the 3nd International Conference on Natural Computation (ICNC2007)</conf-name><conf-loc>Haikou, China</conf-loc><conf-date>August 2007</conf-date><fpage>253</fpage><lpage>257</lpage></citation></ref>
<ref id="b9-ijms-11-01141"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Zhou</surname><given-names>LQ</given-names></name><name><surname>Yu</surname><given-names>ZG</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name><name><surname>Nie</surname><given-names>PR</given-names></name><name><surname>Liao</surname><given-names>FF</given-names></name><name><surname>Chen</surname><given-names>YJ</given-names></name></person-group><article-title>Log-correlation distance and Fourier transformation with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes</article-title><conf-name>Proceedings of the 3nd International Conference on Natural Computation (ICNC2007)</conf-name><conf-loc>Haikou, China</conf-loc><conf-date>August 2007</conf-date><fpage>304</fpage><lpage>308</lpage></citation></ref>
<ref id="b10-ijms-11-01141"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qi</surname><given-names>J</given-names></name><name><surname>Luo</surname><given-names>H</given-names></name><name><surname>Hao</surname><given-names>B</given-names></name></person-group><article-title>CVTree: A phylogenetic tree reconstruction tool based on whole genomes</article-title><source>Nucleic Acids Res</source><year>2004</year><volume>32</volume><fpage>W45</fpage><lpage>W47</lpage><pub-id pub-id-type="doi">10.1093/nar/gkh362</pub-id><pub-id pub-id-type="pmid">15215347</pub-id></citation></ref>
<ref id="b11-ijms-11-01141"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qi</surname><given-names>J</given-names></name><name><surname>Wang</surname><given-names>B</given-names></name><name><surname>Hao</surname><given-names>B</given-names></name></person-group><article-title>Whole proteome prokaryote phylogeny without sequence alignment: A K-string composition approach</article-title><source>J. Mol. Evol</source><year>2004</year><volume>58</volume><fpage>1</fpage><lpage>11</lpage><pub-id pub-id-type="doi">10.1007/s00239-003-2493-7</pub-id><pub-id pub-id-type="pmid">14743310</pub-id></citation></ref>
<ref id="b12-ijms-11-01141"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname><given-names>KH</given-names></name><name><surname>Qi</surname><given-names>J</given-names></name><name><surname>Yu</surname><given-names>Z-G</given-names></name><name><surname>Anh</surname><given-names>V</given-names></name></person-group><article-title>Origin and phylogeny of chloroplasts: A simple correlation analysis of complete genomes</article-title><source>Mol. Biol. Evol</source><year>2004</year><volume>21</volume><fpage>200</fpage><lpage>206</lpage><pub-id pub-id-type="pmid">14595102</pub-id></citation></ref>
<ref id="b13-ijms-11-01141"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname><given-names>L</given-names></name><name><surname>Qi</surname><given-names>J</given-names></name></person-group><article-title>Whole genome molecular phylogeny of large dsDNA viruses using composition vector method</article-title><source>BMC Evol. Biol</source><year>2007</year><volume>7</volume><fpage>1</fpage><lpage>7</lpage><pub-id pub-id-type="doi">10.1186/1471-2148-7-1</pub-id><pub-id pub-id-type="pmid">17214884</pub-id></citation></ref>
<ref id="b14-ijms-11-01141"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname><given-names>L</given-names></name><name><surname>Qi</surname><given-names>J</given-names></name><name><surname>Wei</surname><given-names>H</given-names></name><name><surname>Sun</surname><given-names>Y</given-names></name><name><surname>Hao</surname><given-names>B</given-names></name></person-group><article-title>Molecular phylogeny of coronaviruses including human SARS-CoV</article-title><source>Chin. Sci. Bull</source><year>2003</year><volume>48</volume><fpage>1170</fpage><lpage>1174</lpage></citation></ref>
<ref id="b15-ijms-11-01141"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname><given-names>Z</given-names></name><name><surname>Hao</surname><given-names>B</given-names></name></person-group><article-title>CVTree update: A newly designed phylogenetic study platform using composition vectors and whole genomes</article-title><source>Nucleic Acids Res</source><year>2009</year><volume>37</volume><fpage>W174</fpage><lpage>W178</lpage><pub-id pub-id-type="doi">10.1093/nar/gkp278</pub-id><pub-id pub-id-type="pmid">19398429</pub-id></citation></ref>
<ref id="b16-ijms-11-01141"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edwards</surname><given-names>SV</given-names></name><name><surname>Fertil</surname><given-names>B</given-names></name><name><surname>Giron</surname><given-names>A</given-names></name><name><surname>Deschavanne</surname><given-names>PJ</given-names></name></person-group><article-title>A genomic schism in birds revealed by phylogenetic analysis of DNA strings</article-title><source>Syst. Biol</source><year>2002</year><volume>51</volume><fpage>599</fpage><lpage>613</lpage><pub-id pub-id-type="doi">10.1080/10635150290102285</pub-id><pub-id pub-id-type="pmid">12228002</pub-id></citation></ref>
<ref id="b17-ijms-11-01141"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stuart</surname><given-names>GW</given-names></name><name><surname>Berry</surname><given-names>MW</given-names></name></person-group><article-title>An SVD-based comparison of nine whole eukaryotic genomes supports a coelomate rather than ecdysozoan lineage</article-title><source>BMC Bioinf</source><year>2004</year><volume>5</volume><fpage>204</fpage><pub-id pub-id-type="doi">10.1186/1471-2105-5-204</pub-id></citation></ref>
<ref id="b18-ijms-11-01141"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stuart</surname><given-names>GW</given-names></name><name><surname>Moffet</surname><given-names>K</given-names></name><name><surname>Baker</surname><given-names>S</given-names></name></person-group><article-title>Integrated gene species phylogenies from unaligned whole genome protein sequences</article-title><source>Bioinformatics</source><year>2002</year><volume>18</volume><fpage>100</fpage><lpage>108</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/18.1.100</pub-id><pub-id pub-id-type="pmid">11836217</pub-id></citation></ref>
<ref id="b19-ijms-11-01141"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stuart</surname><given-names>GW</given-names></name><name><surname>Moffet</surname><given-names>K</given-names></name><name><surname>Leader</surname><given-names>JJ</given-names></name></person-group><article-title>A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes</article-title><source>Mol. Biol. Evol</source><year>2002</year><volume>19</volume><fpage>554</fpage><lpage>562</lpage><pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a004111</pub-id><pub-id pub-id-type="pmid">11919297</pub-id></citation></ref>
<ref id="b20-ijms-11-01141"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname><given-names>KH</given-names></name><name><surname>Li</surname><given-names>CP</given-names></name><name><surname>Qi</surname><given-names>J</given-names></name></person-group><article-title>Ribosomal RNA as molecular barcodes: a simple correlation analysis without sequence alignment</article-title><source>Bioinformatics</source><year>2006</year><volume>22</volume><fpage>1690</fpage><lpage>1710</lpage><pub-id pub-id-type="doi">10.1093/bioinformatics/btl146</pub-id><pub-id pub-id-type="pmid">16613905</pub-id></citation></ref>
<ref id="b21-ijms-11-01141"><label>21.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Chan</surname><given-names>RHF</given-names></name><name><surname>Wang</surname><given-names>RW</given-names></name><name><surname>Wong</surname><given-names>JCF</given-names></name></person-group><article-title>Maximum Entropy Method for Composition Vector Method</article-title><source>Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications (Wiley Series in Bioinformatics)</source><person-group person-group-type="editor"><name><surname>Elloumi</surname><given-names>M</given-names></name><name><surname>Zomaya</surname><given-names>A</given-names></name></person-group><publisher-name>Wiley-Blackwell</publisher-name><publisher-loc>Oxford, UK</publisher-loc><year>2010</year></citation></ref>
<ref id="b22-ijms-11-01141"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>S</given-names></name><name><surname>Ma</surname><given-names>B</given-names></name><name><surname>Zhang</surname><given-names>K</given-names></name></person-group><article-title>On the similarity metric and the distance metric</article-title><source>Theor. Comp. Sci</source><year>2009</year><volume>410</volume><fpage>2365</fpage><lpage>2376</lpage><pub-id pub-id-type="doi">10.1016/j.tcs.2009.02.023</pub-id></citation></ref>
<ref id="b23-ijms-11-01141"><label>23.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Xie</surname><given-names>H-M</given-names></name></person-group><source>Grammatical Complexity and One-Dimensional Dynamical Systems</source><publisher-name>World Scientific</publisher-name><publisher-loc>Singapore</publisher-loc><year>1996</year></citation></ref>
<ref id="b24-ijms-11-01141"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saitou</surname><given-names>N</given-names></name><name><surname>Nei</surname><given-names>M</given-names></name></person-group><article-title>The neighbor-joining method: a new method for reconstructing phylogenetic trees</article-title><source>Mol. Biol. Evol</source><year>1987</year><volume>4</volume><fpage>406</fpage><lpage>425</lpage><pub-id pub-id-type="pmid">3447015</pub-id></citation></ref>
<ref id="b25-ijms-11-01141"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huson</surname><given-names>DH</given-names></name><name><surname>Bryant</surname><given-names>D</given-names></name></person-group><article-title>Application of phylogenetic networks in evolutionary studies</article-title><source>Mol. Biol. Evol</source><year>2006</year><volume>23</volume><fpage>254</fpage><lpage>267</lpage><pub-id pub-id-type="pmid">16221896</pub-id></citation></ref>
<ref id="b26-ijms-11-01141"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tamura</surname><given-names>K</given-names></name><name><surname>Dudley</surname><given-names>J</given-names></name><name><surname>Nei</surname><given-names>M</given-names></name><name><surname>Kumar</surname><given-names>S</given-names></name></person-group><article-title>MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0</article-title><source>Mol. Biol. Evol</source><year>2007</year><volume>24</volume><fpage>1596</fpage><lpage>1599</lpage><pub-id pub-id-type="doi">10.1093/molbev/msm092</pub-id><pub-id pub-id-type="pmid">17488738</pub-id></citation></ref>
<ref id="b27-ijms-11-01141"><label>27.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cavalli-Sforza</surname><given-names>LL</given-names></name><name><surname>Edwards</surname><given-names>AWF</given-names></name></person-group><article-title>Phylogenetic analysis: Models and estimation procedures</article-title><source>Am. J. Hum. Gen</source><year>1967</year><volume>19</volume><fpage>233</fpage><lpage>257</lpage></citation></ref>
<ref id="b28-ijms-11-01141"><label>28.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Weir</surname><given-names>BS</given-names></name></person-group><source>Genetic Data Analysis II: Methods for Discrete Population Genetic Data</source><edition>2nd ed</edition><publisher-name>Sinauer Assoc.</publisher-name><publisher-loc>Sunderland, MA, USA</publisher-loc><year>1996</year></citation></ref>
<ref id="b29-ijms-11-01141"><label>29.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Takezaki</surname><given-names>N</given-names></name><name><surname>Nei</surname><given-names>M</given-names></name></person-group><article-title>Genetic distances and reconstruction of phylogenetic trees from microsatellite DNA</article-title><source>Genetics</source><year>1996</year><volume>144</volume><fpage>389</fpage><lpage>399</lpage><pub-id pub-id-type="pmid">8878702</pub-id></citation></ref>
<ref id="b30-ijms-11-01141"><label>30.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Causton</surname><given-names>HC</given-names></name><name><surname>Quackenbush</surname><given-names>J</given-names></name><name><surname>Brazma</surname><given-names>A</given-names></name></person-group><source>Microarray Gene Expression Data Analysis: A Beginner’s Guide</source><publisher-name>Wiley-Blackwell</publisher-name><publisher-loc>Oxford, UK</publisher-loc><year>2003</year></citation></ref>
<ref id="b31-ijms-11-01141"><label>31.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guyon</surname><given-names>F</given-names></name><name><surname>Brochier-Armanet</surname><given-names>C</given-names></name><name><surname>Guenoche</surname><given-names>A</given-names></name></person-group><article-title>Comparison of alignment free string distances for complete genome phylogeny</article-title><source>Adv. Data Anal. Classif</source><year>2009</year><volume>3</volume><fpage>95</fpage><lpage>108</lpage><pub-id pub-id-type="doi">10.1007/s11634-009-0041-z</pub-id></citation></ref>
<ref id="b32-ijms-11-01141"><label>32.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname><given-names>ZG</given-names></name><name><surname>Chu</surname><given-names>KH</given-names></name><name><surname>Li</surname><given-names>CP</given-names></name><name><surname>Zhou</surname><given-names>LQ</given-names></name><name><surname>Anh</surname><given-names>VV</given-names></name></person-group><article-title>Simple correlation analysis for vertebrate Phylogeny based on Complete Mitochondrial Genomes</article-title><source>Sci. China Ser. C</source><year>2008</year><comment>submitted for publication.</comment></citation></ref>
<ref id="b33-ijms-11-01141"><label>33.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pollack</surname><given-names>DD</given-names></name><name><surname>Eisen</surname><given-names>JA</given-names></name><name><surname>Doggett</surname><given-names>NA</given-names></name><name><surname>Cummings</surname><given-names>MP</given-names></name></person-group><article-title>A case for evolutionary genomics and the comprehensive examination of sequence biodiversity</article-title><source>Mol. Biol. Evol</source><year>2000</year><volume>17</volume><fpage>1776</fpage><lpage>1788</lpage><pub-id pub-id-type="doi">10.1093/oxfordjournals.molbev.a026278</pub-id><pub-id pub-id-type="pmid">11110893</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures</title>
<fig id="f1-ijms-11-01141" position="float">
<label>Figure 1.</label>
<caption>
<p>The plot of mean value of <italic>X</italic> over all <italic>K</italic>-strings as a function of <italic>K</italic>. The abbreviations “Mycge”, “PorpuC” and Dvir” are one of genomes in our first three datasets.</p></caption><graphic xlink:href="ijms-11-01141f1.gif"/></fig>
<fig id="f2-ijms-11-01141" position="float">
<label>Figure 2.</label>
<caption>
<p>Phylogeny of 109 organisms (prokaryotes and eukaryotes) using the dynamical language approach with chord distance in the case <italic>K</italic> = 6 based on all protein sequences.</p></caption><graphic xlink:href="ijms-11-01141f2.gif"/></fig>
<fig id="f3-ijms-11-01141" position="float">
<label>Figure 3.</label>
<caption>
<p>Phylogeny of chloroplast genomes using the dynamical language approach with piecewise distance in the case <italic>K</italic> = 6 based on all protein sequences.</p></caption><graphic xlink:href="ijms-11-01141f3.gif"/></fig>
<fig id="f4-ijms-11-01141" position="float">
<label>Figure 4.</label>
<caption>
<p>The NJ tree of mitochondrial genomes based on the whole DNA sequences using the dynamical language approach with chord distance in the case <italic>K</italic> = 11. In this tree the birds and reptiles group together as Archosauria.</p></caption><graphic xlink:href="ijms-11-01141f4.gif"/></fig>
<fig id="f5-ijms-11-01141" position="float">
<label>Figure 5.</label>
<caption>
<p>Phylogeny of 62 alpha-proteobacteria using the dynamical language approach with chord distance in the cases <italic>K</italic> = 5 and 6 based on all protein sequences. The topology of trees obtained by the dynamical language approach with pseudo-distance in [<xref ref-type="bibr" rid="b5-ijms-11-01141">5</xref>] and piecewise distance in the cases <italic>K</italic> = 5 and 6 based on all protein sequences are the same as that in this figure.</p></caption><graphic xlink:href="ijms-11-01141f5.gif"/></fig></sec></back></article>
