Abstract
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) is one of the most popular distance-based methods to reconstruct an equidistant phylogenetic tree from a distance matrix computed from an alignment of sequences. Since we use equidistant trees as gene trees for phylogenomic analyses under the multi-species coalescent model and since an input distance matrix computed from an alignment of each gene in a genome is estimated via the maximum likelihood estimators, it is important to conduct a robust analysis on UPGMA. Stochastic safety radius, introduced by Steel and Gascuel, provides a lower bound for the probability that a phylogenetic tree reconstruction method returns the true tree topology from a given distance matrix. In this article, we compute the stochastic safety radius of UPGMA for a phylogenetic tree with n leaves. Computational experiments show an improved gap between empirical probabilities estimated from random samples and the true tree topology from UPGMA, increasing confidence in phylogenic results.
1. Introduction
Phylogenetics is one of the oldest fields in biology to study the evolutionary history of organisms using a phylogenetic tree, which is a tree representation of evolutionary history among species (or taxa). In order to reconstruct a phylogenetic tree from genetic data, researchers develop many statistical methods including maximum likelihood estimators, Bayesian inference, and distance-based methods [1]. A distance-based method is one of the most popular methods to reconstruct a phylogenetic tree for its computational speed and relatively simple two-step procedure: (1) computing all pairwise distances between all possible pair of sequences from the input alignment; and (2) reconstructing a phylogenetic tree from all pairwise distances of sequences computed in Step (1) using combinatorics.
Maximum likelihood estimators under evolutionary models produce pairwise distances between all possible pairs of sequences and they form as a distance matrix. A distance matrix is an input for a distance-based method to reconstruct a phylogenetic tree and we can consider them as a multivariate random variable and these distance-based and probabilistic methods do not always return the true phylogenetic tree topology. Therefore, we have to measure the robustness of a method and one metric of the robustness of a distance-based method for phylogenetic tree reconstruction is called the safety radius of the method. A safety radius is a radius of all distance matrices such that a given distance-based method returns the “true tree topology” of a phylogenetic tree. This means that all distance matrices within the safety radius satisfy a precise combinatorial condition so that the distance-based method is guaranteed to return the true tree topology [2].
In 2015, Steel and Gascuel introduced a notion of Stochastic safety radius in [2] for analyzing the probability for a distance-based method to return the true tree topology from a given distance matrix. In 2017, Xi et al. worked on developing a stochastic safety radius using the neighbor-joining (NJ) method and balance minimal evolution method for trees with number of leaves equal to 4 or 5 [3].
Phylogenomics is a new field, which applies tools from phylogenetics to genome data. In phylogenomics, we often conduct the species tree and gene trees analyses using the multi-species coalescent model [4]. Under the multi-species coalescent model, we assume that all gene trees, phylogenetic trees reconstructed from genes, are equidistant trees [4]. An equidistant tree is a rooted tree whose total branch length, from its root to each leaf, is the same for all leaves (an example of an equidistant tree with three leaves is shown in Figure 1).
Figure 1.
Example of a phylogenetic tree T on the label set of leaves .
In the past five years, there has been much work on the space of all equidistant trees for phylogenomics [5,6,7,8,9,10,11]. However, these recent studies assume that all given equidistant trees are true trees or close to true trees, which is often not true. Davidson and Sullivant worked on variability of a distance-based method to reconstruct an equidistant tree from all pairwise distances, called the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) using polyhedral geometry [12]. They study how UPGMA project a given distance matrix to the space of all equidistant trees so that their result is from the view of polyhedral geometry and deterministic.
In this paper, therefore, we focus on the stochastic safety radius of the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), one of the most popular distance-based methods and a lower bound of its probability that the UPGMA method returns the true tree topology from a random input distance matrix with noise distributed from the Gaussian distribution. Note that UPGMA is a hierarchical clustering method that builds a dendrogram from a distance matrix which records pairwise “distances” defined by a user input metric between all pairs of observations in a higher dimensional vector space [13]. In the application to a phylogenetic tree reconstruction, we use a distance matrix which contains pairwise distances between all pairs of sequences in the input alignment via a user input evolutionary model [1].
A phylogenetic tree is a weighted tree with labeled external nodes, called leaves, and unlabeled internal nodes. These labels represent species or taxa at the present time and each internal node represents a common ancestor for all of the leaves below this internal node. A weight on each branch (or edge) of a phylogenetic tree represents a mutation rate combined with its evolutionary time from an ancestor to its descendent. A phylogenic tree can be rooted or unrooted. For more details, see [1].
Example 1.
Suppose we have a label set for leaves which represents a set of species at the present time. Suppose we have a rooted phylogenetic tree T shown in Figure 1.
Internal nodes on T do not have labels. The internal node of the ancestor of leaves represents the most common ancestral species of species and the root of T is the most common ancestor of all species . Each branch length on a branch represents the mutation rate combined with its evolutionary time. A distance matrix computed from this tree shown in Figure 1 is a matrix such that , the th cell of the matrix d is the total branch length from a leaf to a leaf , that is,
Since d is computed from a phylogenetic tree, d is a tree metric. Not all symmetric matrices with diagonal elements equal to 0 are not tree metrics.
Through this paper, we assume that binary phylogenetic trees and the smallest branch length of an internal edge in a binary phylogenetic tree are strictly positive. In this paper, we focus on equidistant trees which are rooted phylogenetic trees with branch lengths such that the total branch length from its root to each leaf is the same.
Example 2.
We consider a rooted phylogenetic tree on the label set of leaves shown in Figure 1 with . If , then T is an equidistant tree.
In this paper, our main contribution is that we show a lower bound of the probability of UPGMA to return the true equidistant tree on the set of leaves for from a set of random pairwise distances of all possible pairs of sequences. Then we conduct some computational experiments using a statistical software R to see how tight this lower bound is in practice for .
This paper is organized as follows: Section 2 reminds the reader of the basics of tree metrics and random variables representing pairwise distances of all possible pairs of sequences. Then it adds a notion of stochastic safety radius defined by Steel and Gascuel in [2]. In Section 3, we compute the stochastic safety radius of the three point condition for equidistant trees using a lower bound of the probability of returning the tree topology based on the three point condition. Then in Section 4, we compute a lower bound of the the probability of returning the tree topology from UPGMA and in Section 5, we show some results from our computational experiments with R. In Section 6, we end this paper with some discussion.
2. Stochastic Safety Radius
Let be the set of all non-negative integers. Let be the set of labels for given species (or taxa) and let T be a rooted phylogenetic tree with leaves X.
Definition 3.
Let be an matrix with non-negative elements. If D is a symmetric matrix with its diagonal equal to 0, then we call D a distance matrix or dissimilarity maps.
Let be the th element of a distance matrix D. If satisfies
- for any ,
- for all ,
- for all ,
then we call D ametric.
If D is a metric and if there exist a phylogenetic tree with leaves X such that is the total distance of branch lengths of the path from a leaf i to a leaf j for all , then D is called a tree metric.
Suppose D is a metric on X. Then if D satisfies
for distinct , then D is called an ultrametric.
Definition 4.
Suppose we have a rooted phylogenetic tree T with a leaf label set X. If a distance from its root to each leaf is the same distance for all , then we call T an equidistant tree.
Theorem 5
([14]). Suppose we have an equidistant tree T with a leaf label set X and suppose for all is a distance from a leaf i to a leaf j. Then, D is an ultrametric if and only if T is an equidistant tree.
In this paper, we focus on equidistant trees with leaves X. In practice, we compute a distance matrix from an observed alignment. When we compute a distance matrix from an alignment via a maximum likelihood estimation, usually a distance matrix is not a tree metric [1]. Therefore, in this paper, we investigate a probability that we obtain the tree topology of the true phylogenetic tree from a distance matrix obtained from an input alignment using stochastic safety radius [2].
Definition 6
(Stochastic safety radius). Let for some positive . For any , we say that a distance-based tree reconstruction method M has -stochastic safety radius if for every binary phylogenetic X-tree T on n leaves, with minimum interior edge length , and with the distance matrix δ on X described by the random errors model, we have
In this paper, we focus on the stochastic safety radius of a distance-based method, Unweighted Pair Group Method with Arithmetic Mean (UPGMA).
In reality, if we obtain all pairwise distances from a genetic data set, then we rarely have an ultrametric. Instead, we usually have dissimilarity maps. In order to infer an equidistant tree from dissimilarity maps, we can use UPGMA [15], which is a weighted least squared method to estimate the closest ultrametric in the space of ultrametrics [16].
Example 7.
In order to demonstrate Algorithm 1, consider an equidistant tree with shown in Figure 2.
| Algorithm 1 UPGMA [15] |
|
Figure 2.
Example for Algorithm 1.
A distance matrix computed from the tree shown in Figure 2 is
Note that
- For :and
- :and
- :and
- :and
Therefore, d satisfies Equation (3). Thus, this matrix is an ultrametric.
With UPGMA algorithm shown in Algorithm 1, we have
- For , we pick a pair of leaves with . Set as the parent node of . Assign the branch length from to 1 as and assign the branch length from to 2 as . Now we add as a leaf set X. Thus we have withand
- For , we pick a pair of leaves with . Set as the parent node of . Assign the branch length from to 3 as and assign the branch length from to 4 as . Now we add as a leaf set y. Thus we have with
After the for-loop, we record the branch length from the root to the leaf x and the leaf y as . From this, we can compute the branch length from the root to by and the branch length from the root to by .
In this paper, we use UPGMA in order to investigate their stochastic safety radius and lower bounds for the probability for UPGMA to return the true tree topology if the input distance matrix is not ultrametric. Here we assume that the multivariate random variable is defined as follows:
where are independently and identically distributed for fixed and for all and
for all .
Remark 8.
In order to make it simple, we assume that the height of an equidistant tree T on leaves X, which is the total branch length from each leaf to its root, is equal to 1.
3. Probability on the Three Point Condition
From Equation (1), we have the three point condition which is defined below: for all distinct leaves we have
and, by Theorem 5, if d is an ultrametric and is a tree metric of an equidistant tree T, then Equation (3) satisfies for all distinct leaves .
Figure 3.
Equidistant tree with three leaves on labels . are branch lengths.
In addition, from Equation (4), we have
Then we have
Since and , we have
Therefore, we have
Since are independently and identically distributed (i.i.d.) from the normal distribution , and are i.i.d. from the normal distribution . Therefore, we have
Let where is the smallest branch length for an internal edge in an equidistant tree T on leaves X. Then we have
where . Let for and . Let . Then we have
by Remark 8.
4. Probability Distribution of the Output Tree via Upgma
4.1. Case for
For , we have . Suppose and . from Equation (9), we have
4.2. Case for
For , we have .
- Case 1
- If we have a case in the left picture in Figure 4 for . The probability that is chosen first islet x be the new label merging leaves . Then
Figure 4. Two cases for equidistant trees without labels.Then the probability that is chosen isTherefore, the probability that UPGMA returns the true tree topology iswhere . - Case 2
- If we have a case in the right picture in Figure 4 for . The probability that is chosen first islet x be the new label merging leaves . ThenThen the probability that is chosen isTherefore, the probability that UPGMA returns the true tree topology iswhere .
Remark 9.
For , there are many different tree topologies of rooted phylogenetic trees on leaves X [1].
By Remark 9, there are . Thus, we have the probability that UPGMA returns the tree topology is
where .
4.3. General Case
For , we have . Using the same way with the case on , by recursive computations and by Remark 9, we can obtain the probability of UPGMA to return the true tree topology is
where .
Theorem 10.
Suppose we have a distance matrix δ defined by Equation (2) with a tree metric associated with the binary true phylogenetic tree T on the set of leaves with the smallest internal branch length . Then the probability for UPGMA to return the true tree topology on X is bounded by
where .
Proof.
For , it is trivial. For , then suppose the statement holds for . Then we want to show that the statement holds for n. First, we fix the tree topology. Let be subsets of and form a partition on X. Then suppose and . from Equation (9), we have
Since we assume that the statement holds for , for this particular tree topology, we have that the probability for UPGMA to return the true tree topology on X is bounded by
where . Since there are many tree topologies for n, we have the result. □
5. Computational Experiments
In this computational experiment, we use the ape [17] and the phangorn packages [18,19] R packages for phylogenetic tree data structures, generating random trees, and UPGMA.
First, in order to compare Theorem 10 and the space of ultrametrics, namely Theorem 5 computationally with so that we can visualize the results. We generated 1000 random points where for and . We vary . The results show in Figure 5. Black points are ultrametrics and red points are .
Figure 5.
Red points are randomly generated and black points are ultrametrics which are equivalent to equidistant trees. The top left figure is for . The top right figure is for . The bottom left figure is for . The bottom right figure is for .
We estimate the probability for UPGMA to return the true tree topology using Algorithm 2 for and for , and then we compare these estimated probabilities with lower bounds which we obtained in Theorem 10. These results are shown in Table 1 and Table 2. These results show that lower bounds computed in Theorem 10 might not be tight. In all cases, the lower bound computed by Theorem 10 is an order of magnitude less than the estimated probability for UPGMA to return the true tree topology with a random sample. This suggests that although the presence of a lower bound defines a boundary for the likelihood of the having the true tree topology, it also presents a challenge in the magnitude of gap.
| Algorithm 2: Computational experiments for estimating the probability for UPGMA to return the true tree topology |
|
Table 1.
Comparison between empirical probabilities with lower bounds computed by Theorem 10. The notation in each cell represents two parts: a represents the estimated probability for UPGMA to return the true tree topology with a random sample of sample size 1000 and (b) represents a lower bound computed by Theorem 10. We repeated this process 100 time and take an average for computing estimated probabilities. For example, we consider the cell for and . In this cell, 0.902 is an estimated probability computed by Algorithm 2 and 0.334 is a lower bound from Theorem 10 with and .
Table 2.
Differences estimated probabilities minus lower bounds in Theorem 10 computed in our computational experiments shown in Table 1.
6. Conclusions
UPGMA is a hierarchical clustering method to reconstruct a phylogenetic tree from a distance matrix. In general, it is unlikely that a given distance matrix is a tree metric so that, in this paper, we focus on the case when an input distance matrix is written as a linear combination of the true tree metric and an error term which is generated from the Gaussian distribution around 0 with the standard deviation . In addition, we show a lower bound of the probability for UPGMA to return the true tree topology if we have an input distance matrix defined by Equation (2).
Then we conduct computational experiments so that our lower bounds are close to the empirical probabilities estimated from random samples shown in Table 1 and Table 2. These computational results suggest our lower bounds are not tight. Thus, for a future direction of this research, we have the following questions:
Problem 11.
Can we compute tighter lower bounds for UPGMA to return the true tree topology from a distance matrix δ defined in Equation (2)? If our bounds are tight for some situations, what are the conditions that our lower bounds are tight?
In addition, using the idea of computing lower bounds of the probability, we might be able to compute a “confidence interval” of the estimated phylogenetic tree from a given distance matrix via UPGMA.
Author Contributions
R.Y. directed this research project and mainly contributed the main results. L.P. computed computational experiments on this research. P.N. worked on editing the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
R.Y. is partially funded by NSF Division of Mathematical Sciences: Statistics Program DMS 1916037.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| MDPI | Multidisciplinary Digital Publishing Institute |
| DOAJ | Directory of open access journals |
| TLA | Three letter acronym |
| LD | Linear dichroism |
| UPGMA | Unweighted Pair Group Method with Arithmetic Mean |
References
- Semple, C.; Steel, M. Phylogenetics; Number 22 in Mathematics and Its Applications Series; Oxford University Press: Oxford, UK, 2003. [Google Scholar]
- Gascuel, O.; Steel, M. A ‘Stochastic Safety Radius’ for Distance-Based Tree Reconstruction. Algorithmica 2016, 74, 1386–1403. [Google Scholar] [CrossRef][Green Version]
- Xi, J.; Xie, J.; Yoshida, R.; Forcey, S. Stochastic safety radius on Neighbor-Joining method and Balanced Minimal Evolution on small trees. arXiv 2015, arXiv:1507.08734. [Google Scholar]
- Maddison, W.P.; Maddison, D. Mesquite: A Modular System for Evolutionary Analysis. Evolution 2009, 2, 72. [Google Scholar]
- Yoshida, R.; Zhang, L.; Zhang, X. Tropical Principal Component Analysis and its Application to Phylogenetics. Bull. Math. Biol. 2019, 81, 568–597. [Google Scholar] [CrossRef] [PubMed]
- Yoshida, R.; Takamori, M.; Matsumoto, H.; Miura, K. Tropical Support Vector Machines: Evaluations and Extension to Function Spaces. Neural Netw. 2023, 157, 77–89. [Google Scholar] [CrossRef]
- Lin, B.; Sturmfels, B.; Tang, X.; Yoshida, R. Convexity in Tree Spaces. SIAM Discret. Math 2017, 3, 2015–2038. [Google Scholar] [CrossRef]
- Page, R.; Yoshida, R.; Zhang, L. Tropical principal component analysis on the space of phylogenetic trees. Bioinformatics 2020, 36, 4590–4598. [Google Scholar] [CrossRef] [PubMed]
- Yoshida, R.; Miura, K.; Barnhill, D.; Howe, D. Tropical Density Estimation of Phylogenetic Trees. arXiv 2022, arXiv:2206.04206. [Google Scholar]
- Yoshida, R.; Cox, S. Tree Topologies along a Tropical Line Segment. Vietnam. J. Math. 2022, 50, 395–419. [Google Scholar] [CrossRef]
- Monod, A.; Lin, B.; Yoshida, R. Tropical Geometric Variation of Tree Shapes. Discret. Comput. Geom. 2022, 68, 817–849. [Google Scholar]
- Davidson, R.; Sullivant, S. Polyhedral combinatorics of UPGMA cones. Adv. Appl. Math. 2013, 50, 327–338. [Google Scholar] [CrossRef][Green Version]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning: With Applications in R; Springer: New York, NY, USA, 2013. [Google Scholar]
- Buneman, P. A note on the metric properties of trees. J. Comb. Theory Ser. B. 1974, 17, 48–50. [Google Scholar] [CrossRef]
- Sokal, R.R.; Michener, C.D. A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 1958, 38, 1409–1438. [Google Scholar]
- Bernstein, D.I.; Long, C. L-Infinity Optimization to Linear Spaces and Phylogenetic Trees. SIAM J. Discret. Math. 2017, 31, 875–889. [Google Scholar] [CrossRef]
- Paradis, E.; Schliep, K. ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2019, 35, 526–528. [Google Scholar] [CrossRef] [PubMed]
- Schliep, K. Phangorn: Phylogenetic analysis in R. Bioinformatics 2011, 27, 592–593. [Google Scholar] [CrossRef] [PubMed]
- Schliep, K.; Potts, A.A.; Morrison, D.A.; Grimm, G.W. Intertwining phylogenetic trees and networks. Methods Ecol. Evol. 2017, 8, 1212–1220. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).