1. Introduction
Phylogenetics is one of the oldest fields in biology to study the evolutionary history of organisms using a phylogenetic tree, which is a tree representation of evolutionary history among species (or taxa). In order to reconstruct a phylogenetic tree from genetic data, researchers develop many statistical methods including maximum likelihood estimators, Bayesian inference, and distance-based methods [
1]. A distance-based method is one of the most popular methods to reconstruct a phylogenetic tree for its computational speed and relatively simple two-step procedure: (1) computing all pairwise distances between all possible pair of sequences from the input alignment; and (2) reconstructing a phylogenetic tree from all pairwise distances of sequences computed in Step (1) using combinatorics.
Maximum likelihood estimators under evolutionary models produce pairwise distances between all possible pairs of sequences and they form as a
distance matrix. A distance matrix is an input for a distance-based method to reconstruct a phylogenetic tree and we can consider them as a multivariate random variable and these distance-based and probabilistic methods do not always return the true phylogenetic tree topology. Therefore, we have to measure the robustness of a method and one metric of the robustness of a distance-based method for phylogenetic tree reconstruction is called the
safety radius of the method. A safety radius is a radius of all distance matrices such that a given distance-based method returns the “true tree topology” of a phylogenetic tree. This means that all distance matrices within the safety radius satisfy a precise combinatorial condition so that the distance-based method is guaranteed to return the true tree topology [
2].
In 2015, Steel and Gascuel introduced a notion of
Stochastic safety radius in [
2] for analyzing the probability for a distance-based method to return the true tree topology from a given distance matrix. In 2017, Xi et al. worked on developing a stochastic safety radius using the neighbor-joining (NJ) method and balance minimal evolution method for trees with number of leaves equal to 4 or 5 [
3].
Phylogenomics is a new field, which applies tools from phylogenetics to genome data. In phylogenomics, we often conduct the species tree and gene trees analyses using the multi-species coalescent model [
4]. Under the multi-species coalescent model, we assume that all gene trees, phylogenetic trees reconstructed from genes, are
equidistant trees [
4]. An equidistant tree is a rooted tree whose total branch length, from its root to each leaf, is the same for all leaves (an example of an equidistant tree with three leaves is shown in
Figure 1).
In the past five years, there has been much work on the
space of all equidistant trees for phylogenomics [
5,
6,
7,
8,
9,
10,
11]. However, these recent studies assume that all given equidistant trees are true trees or close to true trees, which is often not true. Davidson and Sullivant worked on variability of a distance-based method to reconstruct an equidistant tree from all pairwise distances, called the
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) using polyhedral geometry [
12]. They study how UPGMA project a given distance matrix to the space of all equidistant trees so that their result is from the view of polyhedral geometry and deterministic.
In this paper, therefore, we focus on the stochastic safety radius of the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), one of the most popular distance-based methods and a lower bound of its probability that the UPGMA method returns the true tree topology from a random input distance matrix with noise distributed from the Gaussian distribution. Note that UPGMA is a hierarchical clustering method that builds a dendrogram from a distance matrix which records pairwise “distances” defined by a user input metric between all pairs of observations in a higher dimensional vector space [
13]. In the application to a phylogenetic tree reconstruction, we use a distance matrix which contains pairwise distances between all pairs of sequences in the input alignment via a user input evolutionary model [
1].
A phylogenetic tree is a weighted tree with labeled external nodes, called
leaves, and unlabeled internal nodes. These labels represent species or taxa at the present time and each internal node represents a common ancestor for all of the leaves below this internal node. A weight on each branch (or edge) of a phylogenetic tree represents a mutation rate combined with its evolutionary time from an ancestor to its descendent. A phylogenic tree can be rooted or unrooted. For more details, see [
1].
Example 1. Suppose we have a label set for leaves which represents a set of species at the present time. Suppose we have a rooted phylogenetic tree T shown in Figure 1. Internal nodes on T do not have labels. The internal node of the ancestor of leaves represents the most common ancestral species of species and the root of T is the most common ancestor of all species . Each branch length on a branch represents the mutation rate combined with its evolutionary time. A distance matrix computed from this tree shown in Figure 1 is a matrix such that , the th cell of the matrix d is the total branch length from a leaf to a leaf , that is, Since d is computed from a phylogenetic tree, d is a tree metric. Not all symmetric matrices with diagonal elements equal to 0 are not tree metrics. Through this paper, we assume that binary phylogenetic trees and the smallest branch length of an internal edge in a binary phylogenetic tree are strictly positive. In this paper, we focus on equidistant trees which are rooted phylogenetic trees with branch lengths such that the total branch length from its root to each leaf is the same.
Example 2. We consider a rooted phylogenetic tree on the label set of leaves shown in Figure 1 with . If , then T is an equidistant tree. In this paper, our main contribution is that we show a lower bound of the probability of UPGMA to return the true equidistant tree on the set of leaves for from a set of random pairwise distances of all possible pairs of sequences. Then we conduct some computational experiments using a statistical software R to see how tight this lower bound is in practice for .
This paper is organized as follows:
Section 2 reminds the reader of the basics of tree metrics and random variables representing pairwise distances of all possible pairs of sequences. Then it adds a notion of stochastic safety radius defined by Steel and Gascuel in [
2]. In
Section 3, we compute the stochastic safety radius of the
three point condition for equidistant trees using a lower bound of the probability of returning the tree topology based on the three point condition. Then in
Section 4, we compute a lower bound of the the probability of returning the tree topology from UPGMA and in
Section 5, we show some results from our computational experiments with
R. In
Section 6, we end this paper with some discussion.
2. Stochastic Safety Radius
Let be the set of all non-negative integers. Let be the set of labels for given species (or taxa) and let T be a rooted phylogenetic tree with leaves X.
Definition 3. Let be an matrix with non-negative elements. If D is a symmetric matrix with its diagonal equal to 0, then we call D a distance matrix or dissimilarity maps.
Let be the th element of a distance matrix D. If satisfies
for any ,
for all ,
for all ,
then we call D ametric.
If D is a metric and if there exist a phylogenetic tree with leaves X such that is the total distance of branch lengths of the path from a leaf i to a leaf j for all , then D is called a tree metric.
Suppose D is a metric on X. Then if D satisfies for distinct , then D is called an ultrametric.
Definition 4. Suppose we have a rooted phylogenetic tree T with a leaf label set X. If a distance from its root to each leaf is the same distance for all , then we call T an equidistant tree.
Theorem 5 ([
14])
. Suppose we have an equidistant tree T with a leaf label set X and suppose for all is a distance from a leaf i to a leaf j. Then, D is an ultrametric if and only if T is an equidistant tree. In this paper, we focus on equidistant trees with leaves
X. In practice, we compute a distance matrix from an observed alignment. When we compute a distance matrix from an alignment via a maximum likelihood estimation, usually a distance matrix is not a tree metric [
1]. Therefore, in this paper, we investigate a probability that we obtain the tree topology of the true phylogenetic tree from a distance matrix obtained from an input alignment using
stochastic safety radius [
2].
Definition 6 (Stochastic safety radius)
. Let for some positive . For any , we say that a distance-based tree reconstruction method M has -stochastic safety radius
if for every binary phylogenetic X-tree T on n leaves, with minimum interior edge length , and with the distance matrix δ on X described by the random errors model, we have In this paper, we focus on the stochastic safety radius of a distance-based method, Unweighted Pair Group Method with Arithmetic Mean (UPGMA).
In reality, if we obtain all pairwise distances from a genetic data set, then we rarely have an ultrametric. Instead, we usually have dissimilarity maps. In order to infer an equidistant tree from dissimilarity maps, we can use UPGMA [
15], which is a weighted least squared method to estimate the closest ultrametric in the space of ultrametrics [
16].
Example 7. In order to demonstrate Algorithm 1, consider an equidistant tree with shown in Figure 2. Algorithm 1 UPGMA [15] |
Input: Dissimilarity map on X. Output: An estimated equidistant tree T
on X. Set and . for,
do Pick smallest for all pair of with . Set x as a parent node for the node i and j, compute branch length from i to x and j to x, and then record them in a tree T. Set a new node x with for all with and . Remove from S and add x to S. end for Record the branch lengths from the root to each leaf in the two leaves. returnT.
|
Figure 2.
Example for Algorithm 1.
Figure 2.
Example for Algorithm 1.
A distance matrix computed from the tree shown in Figure 2 is Note that
For :
Therefore, d satisfies Equation (3). Thus, this matrix is an ultrametric. With UPGMA algorithm shown in Algorithm 1, we have
For , we pick a pair of leaves with . Set as the parent node of . Assign the branch length from to 1 as and assign the branch length from to 2 as . Now we add as a leaf set X. Thus we have with For , we pick a pair of leaves with . Set as the parent node of . Assign the branch length from to 3 as and assign the branch length from to 4 as . Now we add as a leaf set y. Thus we have with
After the for-loop, we record the branch length from the root to the leaf x and the leaf y as . From this, we can compute the branch length from the root to by and the branch length from the root to by .
In this paper, we use UPGMA in order to investigate their stochastic safety radius and lower bounds for the probability for UPGMA to return the true tree topology if the input distance matrix is not ultrametric. Here we assume that the multivariate random variable
is defined as follows:
where
are independently and identically distributed for fixed
and for all
and
for all
.
Remark 8. In order to make it simple, we assume that the height of an equidistant tree T on leaves X, which is the total branch length from each leaf to its root, is equal to 1.
5. Computational Experiments
In this computational experiment, we use the
ape [
17] and the
phangorn packages [
18,
19]
R packages for phylogenetic tree data structures, generating random trees, and UPGMA.
First, in order to compare Theorem 10 and the space of ultrametrics, namely Theorem 5 computationally with
so that we can visualize the results. We generated 1000 random points
where
for
and
. We vary
. The results show in
Figure 5. Black points are ultrametrics
and red points are
.
We estimate the probability for UPGMA to return the true tree topology using Algorithm 2 for
and for
, and then we compare these estimated probabilities with lower bounds which we obtained in Theorem 10. These results are shown in
Table 1 and
Table 2. These results show that lower bounds computed in Theorem 10 might not be tight. In all cases, the lower bound computed by Theorem 10 is an order of magnitude less than the estimated probability for UPGMA to return the true tree topology with a random sample. This suggests that although the presence of a lower bound defines a boundary for the likelihood of the having the true tree topology, it also presents a challenge in the magnitude of gap.
Algorithm 2: Computational experiments for estimating the probability for UPGMA to return the true tree topology |
Input: The number of leaves n, standard deviation . Output: Estimated probability for UPGMA to return the true tree topology. for, do Generate a random tree T with n leaves using a multispecies coalescent model [ 4] via the function coal from the ape package. Set . for , do Generate a random distance matrix with
where is the total branch length from a leaf i to a leaf j in T and for all . Use UPGMA to reconstruct a tree from via the function upgma from the phangorn package. Compare tree topology between T and using the function all.equal in the ape package. if T and have the same tree topology, then . end if end for . end for return .
|
6. Conclusions
UPGMA is a hierarchical clustering method to reconstruct a phylogenetic tree from a distance matrix. In general, it is unlikely that a given distance matrix is a tree metric so that, in this paper, we focus on the case when an input distance matrix is written as a linear combination of the true tree metric and an error term which is generated from the Gaussian distribution around 0 with the standard deviation
. In addition, we show a lower bound of the probability for UPGMA to return the true tree topology if we have an input distance matrix
defined by Equation (
2).
Then we conduct computational experiments so that our lower bounds are close to the empirical probabilities estimated from random samples shown in
Table 1 and
Table 2. These computational results suggest our lower bounds are not tight. Thus, for a future direction of this research, we have the following questions:
Problem 11. Can we compute tighter lower bounds for UPGMA to return the true tree topology from a distance matrix δ defined in Equation (2)? If our bounds are tight for some situations, what are the conditions that our lower bounds are tight? In addition, using the idea of computing lower bounds of the probability, we might be able to compute a “confidence interval” of the estimated phylogenetic tree from a given distance matrix via UPGMA.