1. Introduction
Early phylogenies came without significance tests. It thus remained unclear whether the reconstructed tree was significantly better than an alternative tree or how reliably individual nodes would be recovered if a new set of characters were sampled. Of these two types of analyses, assessing whole trees
vs. assessing individual clades of a given tree, it is the latter that is most commonly carried out. And among the methods available for doing this, the bootstrap is the most widely used [
1].
The bootstrap is a simple, but highly effective method for solving the following problem in statistics: given a sample of
n measurements, what is the distribution of, say, the mean of these measurements if we do not know the null distribution from which the original measurements were drawn. The solution using the bootstrap consists of drawing
n measurements with replacement from the original sample and recalculating the statistic of interest; the mean in our example [
2]. By repeating this many times, the null distribution of the statistic is generated, which can be compared to another sample in order to test the null hypothesis that the two samples were drawn from the same population [
3].
This example shows two things: first, the bootstrap is only practical if computing is inexpensive, as it has been since the introduction of the PC in the mid-1980s. Second, in the limit of a large sample size, bootstrap samples become identical to the original sample.
Felsenstein introduced the bootstrap in phylogeny reconstruction [
4]: Consider an alignment of DNA sequences as an
m by
n matrix of nucleotides, where rows represent taxa and columns represent homologous residues (
Figure 1, top row). Compute a tree from this data matrix. Then, construct a pseudo-sample by drawing with replacement
n columns from the original sample. This pseudo-sample is called a bootstrap sample. Compute the tree from the bootstrap sample and repeat this many times. Record the number of times each clade of the original tree appears in the bootstrapped trees. This value is called the bootstrap support value (
Figure 1, bottom row).
Assigning bootstrap values to individual nodes has become standard practice in alignment-based phylogeny reconstruction. However, computing alignments of very long sequences, such as the megabase-sized genomes of bacteria or the gigabase-sized genomes of mammals, is computationally demanding. Nevertheless, an increasing number of bacterial outbreaks are being tracked by whole genome sequencing. For example, 3085 strains of
Streptococcus pneumoniae, each 2.2 Mb long, were sequenced during an outbreak of this human pathogen [
5]. A quick way to cluster sequence samples of this magnitude is highly desirable.
Perhaps surprisingly, such clustering can be carried out without alignment [
6,
7]. Now, without alignment, the original bootstrap can no longer be applied as it relies on resampling columns of homologous nucleotides. However, one might argue that for megabase-long sequences and beyond, the bootstrap reaches the limit in which it cannot generate any useful variation.
Here, we investigate this problem for our recently-published distance estimation program
andi [
8]. It computes distances from approximate pairwise local alignments. Using suffix arrays, these approximate pairwise alignments can be computed very quickly; for example, 3085
S. pneumoniae strains are clustered on an 24-core computer in 4:37 h using 9.2 GB of RAM. However, the classical bootstrap is not applicable to pairwise alignments, and we propose two alternatives: pairwise bootstrap and quartet analysis. Pairwise bootstrap is a new variant of the Felsenstein bootstrap, while quartet analysis, which evaluates the agreement between a phylogeny and the underlying distance matrix, is taken from the literature [
9]. We explore both methods by comparing them to the classical bootstrap when applied to simulated datasets, where pairwise bootstrap clearly outperforms quartet analysis. We also analyze two empirical datasets. The first comprises 53 human mitochondrial genomes, which are relatively short with only 16.6 kb each. The second dataset contains 29 complete
E. coli/
Shigella genomes, which are roughly 300-times longer than the mitochondrial genomes. Pairwise bootstrap outperforms quartet analysis when applied to the mitochondrial genomes. However, the converse is true for the
E. coli dataset.
4. Conclusions
Our new pairwise bootstrap scheme for andi emulates classical bootstrap values. With simulated data, the fit is quite good. With real data, where the estimation error is greater, classical bootstrap values are still approximated for short sequences, such as human mitochondrial genomes, which comprise approximate 16.6 kb. However, for longer sequences, such as E. coli genomes (5 Mb), the error in estimating evolutionary distances using andi can overwhelm the sensitivity of the pairwise bootstrap. In this situation, quartet analysis may be a suitable alternative. A topological difference between data and tree remains detectable by quartet analysis regardless of the size of the dataset, making this method immune to the saturation of the bootstrap with large samples. Our implementation of quartet analysis, afra, is efficient enough to analyze distance matrices for thousands of taxa in a few hours.