Abstract
Tropical Principal Component Analysis (PCA) is an analogue of the classical PCA in the setting of tropical geometry, and applied it to visualize a set of gene trees over a space of phylogenetic trees, which is a union of lower-dimensional polyhedral cones in an Euclidean space with dimension , where m is the number of leaves. In this paper, we introduce a projected gradient descent method to estimate the tropical principal polytope over the space of phylogenetic trees, and we apply it to an Apicomplexa dataset. With computational experiments against Markov Chain Monte Carlo (MCMC) samplers, we show that our projected gradient descent method yields a lower sum of tropical distances between observations and their projections onto the estimated best-fit tropical polytope, compared with the MCMC-based approach.
MSC:
14T90; 92D15; 62R01
1. Introduction
Phylogenomics is a relatively new field that applies tools from phylogenetics to genome data. One of the key tasks in phylogenomics is to analyze gene trees, which are phylogenetic trees representing the evolutionary histories of genes in the genome. In this work, we use an unsupervised learning method to visualize how gene trees are distributed over the space of phylogenetic trees, that is, the set of all possible phylogenetic trees with a fixed set of labels for all leaves.
A phylogenetic tree T on a given set of leaves is a weighted tree in which the internal nodes in T are unlabeled, their leaves X are labeled, and their branch lengths represent evolutionary time and mutation rates. In phylogenetics, a phylogenetic tree on the set of species represents their evolutionary history. In phylogenomics, we construct a phylogenetic tree from an alignment or sequence data for each gene in a given genome. A phylogenetic tree reconstruced from a gene alignment is called a gene tree. Since different genes may have distinct evolutionary histories, gene trees can vary in topology and branch lengths. Thus, it is a statistical challenge to analyze a set of phylogenetic trees.
When conducting statistical analysis on a set of phylogenetic trees, we represent each tree as a vector in a high-dimensional vector space. One common method is to compute all pairwise distances between two distinct leaves in , resulting in . However, not every vector in corresponds to a valid phylogenetic tree on . In 1974, Buneman showed [1] that a vector derived from all possible pairwise distances between leaves in must satisfy the four-point conditions to represent a phylogenetic tree. For an equidistant tree—namely, a rooted phylogenetic tree in which the total edge weight from the root to each leaf in is equal (see Definition 9)—the vector must satisfy the three-point condition to correspond to the phylogenetic tree (Theorem 1).
In 2006, Ardila and Klivans showed that the space of all phylogenetic trees on is a union of dimensional cones in , and that this space is not classically convex [2]. Therefore, classical statistical methods cannot be directly applied to a set of phylogenetic trees, as these methods assume a Euclidean sample space.
However, Ardila and Klivans also showed that the space of equidistant trees—rooted phylogenetic trees on as defined in Definition 9—is a tropical Grasmaniann. So, the space of equidistant trees on is tropically convex and forms a tropical linear space with the max-plus algebra over the tropical projective space. Therefore, we can apply tropical linear algebra to perform statistical analysis on the space of equidistant trees on .
In 2019, Yoshida et al. introduced tropical principal component analysis (PCA), an analogue of a classical PCA from the perspective of tropical geometry, to visualize how gene trees are distributed over the space of equidistant trees on using max-plus algebra [3]. For , they defined the -th order tropcial principal polytope, or the best-fit tropical polytope with s vertices, whose vertices serve as analogues of the classical first s principal components.They showed that computing these vertices can be formulated as a mixed integer linear programming problem, as shown in Problem 1 [3]. Later, Page et al. developed a Markov Chain Monte Carlo (MCMC) method to estimate the vertices of the tropical principal polytope from a set of gene trees.
In this work, inspired by the recent work of tropical gradient descent defined in [4], we introduce a projected gradient descent method to compute the set of vertices of the tropical principal polytope from a set of gene trees. We compute subgradients in order to find the optimal solution for the mixed integer programming problem in order to compute the tropical principal polytope shown in Theorem 3. Then, we apply our novel method to Apicomplexa data from [5], and our experiments using the R package TML version 2.3.0 [6] show that our method outperforms existing approaches in terms of computational time and cost function.
This paper is organized as follows. In Section 2, we introduce the basics of tropical geometry. In Section 3, we review the notions of metrics and ultrametrics, and discuss the isometry between the space of equidistant trees on and the space of ultrametrics on the finite set , based on results by Buneman [1]. In Section 4, we present tropical PCA and the s-th order tropical principal polytope for , where . Section 5 provides experimental results on the Apicomplexa dataset from [5].
2. Tropical Basics
In this section, we introduce the basics of tropical geometry to be used for our main results. Let . Then, through this paper, we consider the tropical projective torus, which is isomorphic to . This means that is equivalent to a hyperplane in . This implies that for a point ,
where . See [7] for more details.
Throughout this paper, we consider tropically convex sets defined by the max-plus algebra provided in Definition 1.
Definition 1 (Tropical Arithmetic Operations).
The tropical semiring is defined using the following tropical addition ⊕ and multiplication ⊙:
for any .
Remark 1.
is the identity element under addition ⊕ and 0 is the identity element under multiplication ⊙ over .
Definition 2 (Tropical Scalar Multiplication and Vector Addition).
For any and for any , the tropical scalar multiplication and tropical vector addition are defined as:
Definition 3 (Generalized Hilbert Projective Metric).
For any points , the tropical metric, , between v and w, is defined as:
Remark 2.
The tropical metric is the metric over .
Definition 4.
A subset is called tropically convex if it contains the point for all and all . The tropical convex hull or tropical polytope, , of a given finite subset is the smallest tropically convex set containing . In addition, can be written as the set of all tropical linear combinations
Any tropically convex subset S of is closed under tropical scalar multiplication, , i.e., if , then . Thus, the tropically convex set S is identified as its quotient in the tropical projective torus .
Definition 5
(Max-tropical Hyperplane [8]). A max-tropical hyperplane is the set of points , such that
is attained at least twice, where .
Definition 6
(Min-tropical Hyperplane [8]). A min-tropical hyperplane is the set of points , such that
is attained at least twice, where .
Remark 3.
A min-tropical hyperplane and a max-tropical hyperplane are tropically convex over .
Definition 7
(Max-tropical Sectors from Section 5.5 in [8]). For , the i-th open sector of is defined as
and the i-th closed sector of is defined as
Definition 8
(Min-tropical Sectors). For , the i-th open sector of is defined as
and the i-th closed sector of is defined as
3. Space of Phylogenetic Trees
A phylogenetic tree is a rooted or unrooted tree whose exterior nodes have unique labels, whose interior nodes do not have labels, and whose edges have non-negative weights. In this paper, we focus on an equidistant tree that is a rooted phylogenetic tree, such that a total weight on the path from its root to each leaf on the tree has the same total weight. Let be the set of leaf labels on an equidistant tree T.
Definition 9.
An equidistant tree T on is a rooted phylogenetic tree on , such that the total weight from the root to each leaf is equal to a constant for all . h is the height of T.
Example 1.
Figure 1 shows an equidistant tree a height 1 on .
Figure 1.
An equidistant tree with from Example 1. The height of the tree is 1.
Suppose is the total weight on the unique path from a leaf and a leaf on a phylogenetic tree T. Then, , where , is a metric, that is, D satisfies
for all . The metric D is the tree metric of a phylogenetic tree T.
If metric D satisfies
and this maximum is achieved at least twice for distinct , then D is called an ultrametric. Suppose is the total weight of the path from from an equidistant tree T, then we have the following theorem.
Theorem 1
(noted in [1]). Suppose we have an equidistant tree T with a leaf label set and D as its tree metric. Then, D is an ultrametric if and only if T is an equidistant tree. In addition, we can uniquely reconstruct T from D.
Using Theorem 1, we consider the space of ultrametrics on as the space of phylogenetic trees, which is the set of all possible equidistant trees with the leaf set . Let be the space of ultrametrics on .
With tropical geometry, it can be shown that is a tropical subspace over the tropical projective space . Let denote the subspace of , defined by the linear equations, such that for . The tropicalization is the tropical linear space consisting of points , such that, for any , the maximum of is achieved at least twice.
In addition, it is important to note that the tropical linear space corresponds to the graphic matroid of the complete graph .
Theorem 2
(Theorem 2.18 in [3]). The image of in the tropical projective torus coincides with .
Projection onto Tree Space
In tropical geometry, it is well-known that is the support of a pointed simplicial fan of dimension and it has rays, defined as clade metrics [2].
Definition 10.
Suppose we have an equidistant phylogenetic tree T with leaf set . A clade of T with leaves is the equidistant subtree of T formed by including all common ancestral interior nodes of combinations of leaves in σ, while excluding any common ancestors that involve leaves from in T, along with the edges in T that connect these interior nodes to the leaves in σ.
We note that a clade of an equidistant tree T with leaf set is a subtree of T with leaves . Feichtner showed that each topology of equidistant trees can be encoded by a nested set, that is, a set of clades , where , such that
for all and for all [9].
Definition 11
(Clade Ultrametrics). We consider an equidistant tree T on the leaf set . Let be a proper subset of with at least two elements. Let , such that
Then, is called a clade ultrametric.
We note that Ardila and Klivans showed that the set of clade ultrametrics forms a set of generators—i.e., rays—of a pointed simplicial fan of dimension , in terms of classical arithmetic in Euclidean geometry. In this paper, we use an extreme clade ultrametric, which is an analogue of a clade ultrametric defined using the max-plus algebra. This is done by replacing the identity element of classical addition with the identity of tropical addition ⊕ (namely, replacing 0 with ), and replacing the identity element of classical multiplication with that of tropical multiplication ⊙ (namely, replacing 1 with 0).
Definition 12
(Extreme Clade Ultrametrics). We consider an equidistant tree T on the leaf set . Let be a proper subset with at least two elements. Let , such that
Then, is called an extreme clade ultrametric.
Remark 4.
In polyhedral geometry, a polyhedral cone generated by a set of rays is defined as
We replace classical addition with ⊕ and classical multiplication with ⊙ for ; thus, we have
which is a tropical polytope defined by V.
Proposition 1
([10]). The set of all extreme clade ultrametrics, , is a generating set of in terms of max-plus algebra.
Proposition 1 is a tropical geometric analogue of the simplicial complex result by Ardila and Klivans in [2], obtained by replacing classical addition with ⊕ and classical multiplication with ⊙.
Definition 13
(Projection Map [10]). The tropical projection map to ultrametric tree space is given by
for .
Proposition 2
([10]). For all , we have
for all , and where is defined by Definition 13.
The following proposition is key to the projected gradient methods we proposed in Section 4.
Proposition 3
(Lemma 19, [11]). The projection map is non-expansive in terms of , i.e.,
for all .
Remark 5.
The projection map is equivalent to the single linkage hierarchical clustering method [12]. Therefore, in the computational experiment described in Section 5, we use a single linkage hierarchical clustering method to projecting a subgradient.
4. Tropical Principal Component Analysis
Yoshida et al. introduced the notion of tropical principal component analysis (PCA), an analysis based on the best-fit tropical hyperplane or tropical polytope [3]. In particular, they applied tropical PCA using tropical polytopes to samples of ultrametrics within the space of ultrametrics. In this section, we consider , where m is the number of leaves and . Suppose we have a sample .
Definition 14
(Definition 3.1 in [13]). The -th order tropical principal polytope minimizes
where is the projection onto a tropical polytope with s many vertices, that is
for all for , which is called the th-order tropical principal component polytope of the sample . The s vertices of the tropical principal component polytope are called the th order tropical principal components, or equivalently, the best-fit tropical polytope with s vertices.
Remark 6.
The 0-th tropical principal polytope is a tropical Fermat–Weber point of a sample with respect to . A tropical Fermat–Weber point of with respect to is defined as
A tropical Fermat–Weber point of with respect to is not necessarily unique, and the set of all tropical Fermat–Weber points forms both a classical polytope and a tropical polytope [14].
In this paper, we focus on the -th order principal components over for . Our problem can be written as follows:
Problem 1.
We seek a solution for the following optimization problem:
where
and
with
Remark 7
(Proposition 4.2 in [3]). Problem 1 can be formulated as a mixed integer programming problem.
In this section, we consider subgradients of Problem 1. Here, we are interested in computing
First, we notice that
by the product rule.
Let
be Kronecker’s Delta. Then, we have the following lemma:
Lemma 1
(Lemma 10 in [14]). For any two points , the gradient at x of the tropical distance between x and p is given by
provided there are no ties in , which ensures that the min- and max-sectors are uniquely identifiable; that is, the point x is inside of open sectors and not on the boundary of .
Also, we notice that
for .
Lemma 2.
for , , and for .
Proof.
Direct computation from the equation in (12). □
Theorem 3.
The subgradient of Problem 1 over is
where is obtained in Lemma 3.
Proof.
Using Proposition 3, we know that is non-expanding in terms of . Therefore, we have
Therefore,
when x is at a critical point.
Suppose is an optimal solution for the Problem 1. Then, let
Since
is a subgradient in Lemma 3, we have
for . Using Proposition 3, we have
for . □
5. Computational Experiments
5.1. Simulated Dataset
Next, we generate gene trees under the multispecies coalescent model (MCM) using a given species tree via the software Mesquite version 3.81 [15]. We fixed the effective population size at and varied where is the species depth, defined as the number of generations from the most recent common ancestor (the root of the tree) to its leaves.
In this experiment, we simulate 1000 trees for each valuee of R = 0.25, 0.5, 1.0, 2.0, 5.0, 10.0. We run 100 iterations of both the MCMC method and our projected gradient method. To account for variability in performance, we repeat both methods 10 times for each R. For each run, we compute the sum of tropical distances between the observed trees and their projections onto the estimated tropical polytope. This sum, representing the total “magnitude of errors” (SE) in terms of ), corresponds to the estimated optimal value of the linear programming problem in Problem 1. The results are shown in Figure 2. For the learning rate schedule, we start with and multiply it by at each iteration. Note that this learning rate has not been optimized. Therefore, investigating an optimal learning rates remains a direction for future work. As for the stopping criterion, we terminate the algorithm when the preset maximum number of iterations is reached. Each iteration has time complecity , where n is the sample size.
Figure 2.
Side-by-side boxplots for the SE for each method and ratio. We repeat computation 10 times for each ratio and method.
5.2. Empirical Dataset
We apply our projected gradient method to estimate the best-fit tropical polytope for an empirical dataset consisting of 268 orthologous sequences from eight protozoa species, as presented in [5]. This dataset contains gene trees reconstructed from the following sequences: Babesia bovis (Bb), Cryptosporidium parvum (Cp), Eimeria tenella (Et) [15], Plasmodium falciparum (Pf) [11], Plasmodium vivax (Pv), Theileria annulata (Ta), and Toxoplasma gondii (Tg). The outgroup is a free-living ciliate, Tetrahymena thermophila (Tt) (Figure 3).
Figure 3.
(Left): Estimated second-order tropical principal polytope. (Right): Each color represents a tree topology. The number inside each bracket is the frequency of the tree topology. 1 presents “Pv”, 2 represents “Pf”, 3 represents “Tg”, 4 represents “Et”, 5 represents “Cp”, 6 represents “Ta”, 7 represents “Bb” and 8 represents “Tt”.
In order to run this experiment, we used a Mac Pro laptop (Apple Inc., Cupertino, CA, USA) with Apple M4 Max and 128 GB memory. We implemented our projected gradient method using R.
We set the maximum number of iterations to 100 for our projected gradient method, and to 1000 for the MCMC method. For the learning rate schedule, we initialize and multiply it by at each iteration. Note that the learning rate and scheduling have not been optimized. Using the projected gradient descent to estimate the tropical principal polytope on this dataset takes s, resulting in an estimated optimal value of for the optimization problem in Problem 1. In comparison, the Markov Chain Monte Carlo (MCMC) method implemented via the TML package [6] yields an estimated optimal value of , with a computational time of s over 1000 iterations.
6. Conclusions
In this work, we introduce a novel method to approximate the best-fit tropical polytope that explains a sample of gene trees. We show that our gradient method effectively reduces the objective function when an appropriate learning rate is used. Computational experiments show that this method achieves a lower sum of tropical distances between observations and their projections onto the estimated best-fit tropical polytope, compared to the MCMC approach proposed by Page et al. [13]. Although we implement a decreasing learning rate in our experiments, the optimal learning rate schedule for this problem remains an open question.
We implement our method in R and the source code is available at http://polytopes.net/Tropical_Gradient2.zip (accessed on 15 May 2025).
Funding
This research was funded by NSF Division of Mathematical Sciences: Statistics Program DMS 2409819.
Data Availability Statement
The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The author declares no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| PCA | Principal Component Analysis |
| MCMC | Markov Chain Monte Carlo |
References
- Buneman, P. A note on the metric properties of trees. J. Comb. Theory Ser. B 1974, 17, 48–50. [Google Scholar] [CrossRef]
- Ardila, F.; Klivans, C.J. The Bergman Complex of a Matroid and Phylogenetic Trees. J. Comb. Theory Ser. B 2006, 96, 38–49. [Google Scholar] [CrossRef]
- Yoshida, R.; Zhang, L.; Zhang, X. Tropical Principal Component Analysis and its Application to Phylogenetics. Bull. Math. Biol. 2019, 81, 568–597. [Google Scholar] [CrossRef] [PubMed]
- Talbut, R.; Monod, A. Tropical Gradient Descent. arXiv 2024, arXiv:2405.19551. [Google Scholar] [CrossRef]
- Kuo, C.; Wares, J.P.; Kissinger, J.C. The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees. Mol. Biol. Evol. 2008, 25, 2689–2698. [Google Scholar] [CrossRef] [PubMed]
- Barnhill, D.; Yoshida, R.; Aliatimis, G.; Miura, K. Tropical geometric tools for machine learning: The TML package. J. Softw. Algebra Geom. 2024, 14, 133–174. [Google Scholar] [CrossRef]
- Maclagan, D.; Sturmfels, B. Introduction to Tropical Geometry; Graduate Studies in Mathematics; American Mathematical Society: Providence, RI, USA, 2015; Volume 161. [Google Scholar]
- Joswig, M. Essentials of Tropical Combinatorics; Springer: New York, NY, USA, 2021. [Google Scholar]
- Feichtner, E.M. Complexes of trees and nested set complexes. Pac. J. Math. 2004, 227, 271–286. [Google Scholar] [CrossRef]
- Ardila, F. Subdominant Matroid Ultrametrics. Ann. Comb. 2005, 8, 379–389. [Google Scholar] [CrossRef]
- Jaggi, M.; Katz, G.; Wagner, U. New results in tropical discrete geometry. Preprint 2008. Available online: https://api.semanticscholar.org/CorpusID:14825187 (accessed on 14 April 2025).
- Gascuel, O.; McKenzie, A. Performance analysis of hierarchical clustering algorithms. J. Classif. 2004, 21, 3–18. [Google Scholar] [CrossRef]
- Page, R.; Yoshida, R.; Zhang, L. Tropical principal component analysis on the space of phylogenetic trees. Bioinformatics 2020, 36, 4590–4598. [Google Scholar] [CrossRef] [PubMed]
- Barnhill, D.; Sabol, J.; Yoshida, R.; Miura, K. Tropical Fermat-Weber Polytropes. arXiv 2024, arXiv:2402.14287. [Google Scholar] [CrossRef]
- Maddison, W.P.; Maddison, D. Mesquite: A Modular System for Evolutionary Analysis. Version 2.72. 2009. Available online: http://mesquiteproject.org (accessed on 11 August 2016).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).