# Support Values for Genome Phylogenies

^{*}

## Abstract

**:**

## 1. Introduction

`andi`[8]. It computes distances from approximate pairwise local alignments. Using suffix arrays, these approximate pairwise alignments can be computed very quickly; for example, 3085 S. pneumoniae strains are clustered on an 24-core computer in 4:37 h using 9.2 GB of RAM. However, the classical bootstrap is not applicable to pairwise alignments, and we propose two alternatives: pairwise bootstrap and quartet analysis. Pairwise bootstrap is a new variant of the Felsenstein bootstrap, while quartet analysis, which evaluates the agreement between a phylogeny and the underlying distance matrix, is taken from the literature [9]. We explore both methods by comparing them to the classical bootstrap when applied to simulated datasets, where pairwise bootstrap clearly outperforms quartet analysis. We also analyze two empirical datasets. The first comprises 53 human mitochondrial genomes, which are relatively short with only 16.6 kb each. The second dataset contains 29 complete E. coli/Shigella genomes, which are roughly 300-times longer than the mitochondrial genomes. Pairwise bootstrap outperforms quartet analysis when applied to the mitochondrial genomes. However, the converse is true for the E. coli dataset.

## 2. Methods and Data

#### 2.1. Classical Bootstrap

`dnaDist`, the sources and documentation of which are available from the website accompanying this paper:

http://evolbioinf.github.io/life2015

#### 2.2. Quartet Analysis

`PhyD*`implements quartet analysis [10].

`afra`with a view toward maximizing efficiency.

#### 2.3. Pairwise Bootstrap

`andi`[8].

#### 2.4. Simulation

- Convert the output of
`ms`to an alignment of DNA sequences, A, using`ms2dna`. - Subject A to bootstrap analysis using
`dnaDist`. - Compute the consensus tree and support values from the output of
`dnaDist`using the program`consense`, which is part of the PHYLIP package [13]. - Subject A to pairwise bootstrap analysis as implemented in the latest version of
`andi`[8] and also calculate the consensus tree using`consense`. - Use
`afra`to carry out quartet analysis on`andi`-distances computed from A. - For each cluster in the consensus tree, extract the three support values classical, pairwise and quartet using the program
`correlation.js`. - Repeat.

#### 2.5. Resource Consumption

for pairwise bootstrap analysis, and/usr/bin/time -f “elapsed\t%Es\nuser\t%Us\nmem\t%MkB\n” \andi -b 1000 foo.fasta > foo.dist

for/usr/bin/time -f “elapsed\t%Es\nuser\t%Us\nmem\t%MkB\n“ \java -Xmx4096m -jar PhyDstar.jar -c -i foo.dist

`PhyD*`[10].

#### 2.6. Data

#### 2.7. Alignment and Phylogeny Computation

`clustalw`[15]. The 29 E. coli/Shigella genomes were aligned using the fast genome aligner

`mugsy`[16]. As described for the simulations (Section 2.4), Jukes–Cantor distances were computed and bootstrapped using

`dnaDist`, and the distance matrices were subjected to neighbor joining as implemented in the program

`clustDist`. Consensus trees were computed using the PHYLIP program

`consense`[13], and the output of two

`consense`runs was compared using the program

`correlation.js`. Trees were midpoint-rooted using

`retree`, which is also part of PHYLIP.

## 3. Results and Discussion

#### 3.1. Resource Consumption

#### Pairwise Bootstrap

#### Quartet Analysis

`afra`takes approximately 0.04 s for 100 taxa, while the reference implementation,

`PhyD*`[10], requires approximately 4 s, that is 100-times longer. However, both programs roughly increase their run time ten-fold for a doubling of the number of taxa. This deviates substantially from the theoretical $O\left({n}^{5}\right)$ run time, according to which a doubling of sample size should result in a 32-fold increase in run time. We do not know the reason for this discrepancy, but it illustrates the importance of empirical resource measurements when analyzing software.

`afra`is less than 1 MB for 100 taxa, while

`PhyD*`uses approximately 100 MB for the same number of taxa (Figure 6B). However, memory consumption grows with similar rates for both applications.

`andi`in 4:37 h. Quartet analysis then took 2:18 h and occupied 150 MB of memory. This shows that quartet analysis scales well to large datasets.

#### 3.2. Accuracy

#### 3.3. Application to Real Sequence Data

#### Human Mitochondrial Genomes

`andi`is designed for analyzing closely-related genomes, which are increasingly often collected in the course of pathogen outbreaks. For our second and final empirical example, we therefore use a benchmark set of 29 E. coli/Shigella genomes. Figure 10A shows the tree computed from alignment-based distances. Bootstrapping this alignment yields only a single clade with support less than 100%. This clade has a bootstrap support of 53 and comprises six uropathogenic E. coli strains thought to be affected by horizontal gene transfer [17]. Interestingly, the uropathogenic clade also contains the only major topological difference to the tree computed from

`andi`-distances in Figure 10B: strains 536 and ED1a have switched positions. However, pairwise bootstrap fails to flag this clade; the only clades with pairwise bootstrap values less than 100% are part of the cluster of four very similar K12 strains. Quartet analysis, on the other hand, returns non-maximal support values even outside the K12 clade. In particular, it flags the group of uropathogenic strains. Note here that classical bootstrap evaluates individual nodes, while quartets refer to edges. In the case of the uropathogenic bacteria, the two outgoing edges are flagged by quartet analysis, as desired. In addition, quartet analysis indicates that two more clades in the flexneri/sonnei clade might be problematic with support values of 72% and 77%.

`andi`is large compared to the variance of the number of mismatches per site when bootstrapped from megabase-long genomes, such as those of E. coli. One indicator of the error in

`andi`measurements is the difference between the estimates based on the two possible query/subject labellings [8]. For E. coli, this is often a few percent (not shown). Compare this to a mismatch rate of, say, 1% between two typical E. coli genomes of length 5 Mb. The variance of the number of mismatches is $5\times {10}^{6}\times 0.01\times 0.99\approx 5\times {10}^{4}$, and the standard deviation of the per site mismatch rate is $\sqrt{{10}^{4}}/(5\times {10}^{6})=2\times {10}^{-5}$. In other words, 95% of the bootstrapped mismatch rates fall within an interval of $0.01\pm 4\times {10}^{-5}$. Note that the numerator of the standard deviation is proportional to the square root of the sequence length, while the denominator is proportional to the untransformed sequence length. As a result, bootstrap variation decreases with sequence length.

## 4. Conclusions

`andi`emulates classical bootstrap values. With simulated data, the fit is quite good. With real data, where the estimation error is greater, classical bootstrap values are still approximated for short sequences, such as human mitochondrial genomes, which comprise approximate 16.6 kb. However, for longer sequences, such as E. coli genomes (5 Mb), the error in estimating evolutionary distances using

`andi`can overwhelm the sensitivity of the pairwise bootstrap. In this situation, quartet analysis may be a suitable alternative. A topological difference between data and tree remains detectable by quartet analysis regardless of the size of the dataset, making this method immune to the saturation of the bootstrap with large samples. Our implementation of quartet analysis,

`afra`, is efficient enough to analyze distance matrices for thousands of taxa in a few hours.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Soltis, P.S.; Soltis, D.E. Applying the bootstrap in phylogeny reconstruction. Stat. Sci.
**2003**, 18, 256–267. [Google Scholar] [CrossRef] - Efron, B. Bootstrap methods: Another look at the Jackknife. Ann. Stat.
**1979**, 7, 1–26. [Google Scholar] [CrossRef] - Diaconis, P.; Efron, B. Computer-intensive methods in statistics. Sci. Am.
**1983**, 248, 116–130. [Google Scholar] [CrossRef] - Felsenstein, J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution
**1985**, 39, 783–791. [Google Scholar] [CrossRef] - Chewapreecha, C.; Harris, S.R.; Croucher, N.J.; Turner, C.; Marttinen, P.; Cheng, L.; Pessia, A.; Aanensen, D.M.; Mather, A.E.; Page, A.J.; et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet.
**2014**, 46, 305–309. [Google Scholar] [CrossRef] [PubMed] - Haubold, B. Alignment-free phylogenetics and population genetics. Brief. Bioinform.
**2014**, 15, 407–418. [Google Scholar] [CrossRef] [PubMed] - Vinga, S.; Almeida, J. Alignment-free sequence comparison—A review. Bioinformatics
**2003**, 19, 513–523. [Google Scholar] [CrossRef] [PubMed] - Haubold, B.; Klötzl, F.; Pfaffelhuber, P. Andi: Fast and accurate estimation of evolutionary distances between closely related genomes. Bioinformatics
**2015**, 31, 1169–1175. [Google Scholar] [CrossRef] [PubMed] - Guénoche, A.; Garreta, H. Can we have confidence in a tree representation? In JOBIM; Gascuel, O., Sagot, M., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Germany; Heidelberg, Germany, 2000; Volume 2066, pp. 45–56. [Google Scholar]
- Criscuolo, A.; Gascuel, O. Fast NJ-like algorithms to deal with incomplete distance matrices. BMC Bioinform.
**2008**, 9, 166. [Google Scholar] [CrossRef] [PubMed] - Felsenstein, J. Inferring Phylogenies; Sinauer: Sunderland, MA, USA, 2004. [Google Scholar]
- Hudson, R.R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics
**2002**, 18, 337–338. [Google Scholar] [CrossRef] [PubMed] - Felsenstein, J. PHYLIP (phylogeny interference package) version 3.6, 2005. Available online: http://evolution.genetics.washington.edu/phylip.html (accessed on 25 February 2016).
- Ingman, M.; Kaessmann, H.; Pääbo, S.; Gyllensten, U. Mitochondrial genome variation and the origin of modern humans. Nature
**2000**, 408, 708–713. [Google Scholar] [PubMed] - Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A.; Lopez, R.; Thompson, J.D.; Gibson, T.J.; Higgins, D.G. Clustal w and clustal x version 2.0. Bioinformatics
**2007**, 23, 2947–2948. [Google Scholar] [CrossRef] [PubMed] - Angiuoli, S.V.; Salzberg, S.L. Mugsy: Fast multiple alignment of closely related whole genomes. Bioinformatics
**2011**, 27, 334–342. [Google Scholar] [CrossRef] [PubMed] - Domazet-Lošo, M.; Haubold, B. Alignment-free detection of local similarity among viral and bacterial genomes. Bioinformatics
**2011**, 27, 1466–1472. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Cartoon of classical bootstrap. The columns of the original alignment (top row) are repeatedly resampled with replacement (second row). Distance matrices are computed from the bootstrap samples (third row) and summarized as phylogenies (fourth row). The clades in the bootstrapped phylogenies are summarized in a consensus tree with support values written next to the nodes (fifth row). A dot indicates a match to the nucleotide in the top row.

**Figure 3.**Illustration of anchors marked in red and blue for computing the anchor distance between the toy genomes ${g}_{1}$ and ${g}_{2}$.

**Figure 4.**Pairwise bootstrap samples based on the anchors shown in Figure 3. Dots indicate matching nucleotides.

**Figure 5.**Time (

**A**) and memory (

**B**) required by

`andi`to compute 1000 bootstrap replicates as a function of the number of taxa for sequences of 100 kb and 1 Mb length, L. The vertical line in (

**B**) is at 24, the number of cores on the test computer.

**Figure 6.**Comparing time (

**A**) and memory (

**B**) consumption between two implementations of quartet analysis, the published

`PhyD*`[10] and our own program,

`afra`, when applied to distance matrices of varying size.

**Figure 7.**Average support values as a function of classical bootstrap support. All simulations with sample size $n=20$, 1% polymorphisms per position, and ${10}^{4}$ iterations. The comparison along rows shows the effect of increasing the sequence length, L, from 10 kb (

**A**,

**C**) to 100 kb (

**B**,

**D**). The comparison along the columns shows the effect of increasing the rate of recombination per nucleotide, ρ, from 0 (

**A**,

**B**) to $2\times {10}^{-4}$ (

**C**,

**D**). See Section 2.4 for details.

**Figure 8.**Phylogeny of humans computed from 53 complete mitochondrial genomes [14]. Example bootstrap support values are quoted for two nodes: C: classical alignment-based; P: pairwise bootstrap of

`andi`distances; Q: quartet analysis of

`andi`distances.

**Figure 10.**Phylogeny of 29 strains of Escherichia coli/Shigella computed from their full genomes. (

**A**) Alignment-based; (

**B**)

`andi`-distances; the numbers refer to bootstrap support less than 100%; P: pairwise bootstrap; unmarked values in (B) refer to quartet support.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Klötzl, F.; Haubold, B.
Support Values for Genome Phylogenies. *Life* **2016**, *6*, 11.
https://doi.org/10.3390/life6010011

**AMA Style**

Klötzl F, Haubold B.
Support Values for Genome Phylogenies. *Life*. 2016; 6(1):11.
https://doi.org/10.3390/life6010011

**Chicago/Turabian Style**

Klötzl, Fabian, and Bernhard Haubold.
2016. "Support Values for Genome Phylogenies" *Life* 6, no. 1: 11.
https://doi.org/10.3390/life6010011