1. Introduction
Phylogenomics is a relatively new field that applies tools from phylogenetics to genome data. One of the key tasks in phylogenomics is to analyze gene trees, which are phylogenetic trees representing the evolutionary histories of genes in the genome. In this work, we use an unsupervised learning method to visualize how gene trees are distributed over the space of phylogenetic trees, that is, the set of all possible phylogenetic trees with a fixed set of labels for all leaves.
A phylogenetic tree T on a given set of leaves is a weighted tree in which the internal nodes in T are unlabeled, their leaves X are labeled, and their branch lengths represent evolutionary time and mutation rates. In phylogenetics, a phylogenetic tree on the set of species represents their evolutionary history. In phylogenomics, we construct a phylogenetic tree from an alignment or sequence data for each gene in a given genome. A phylogenetic tree reconstruced from a gene alignment is called a gene tree. Since different genes may have distinct evolutionary histories, gene trees can vary in topology and branch lengths. Thus, it is a statistical challenge to analyze a set of phylogenetic trees.
When conducting statistical analysis on a set of phylogenetic trees, we represent each tree as a vector in a high-dimensional vector space. One common method is to compute all pairwise distances between two distinct leaves in
, resulting in
. However, not every vector in
corresponds to a valid phylogenetic tree on
. In 1974, Buneman showed [
1] that a vector derived from all possible pairwise distances between leaves in
must satisfy the four-point conditions to represent a phylogenetic tree. For an equidistant tree—namely, a rooted phylogenetic tree in which the total edge weight from the root to each leaf in
is equal (see Definition 9)—the vector must satisfy the three-point condition to correspond to the phylogenetic tree (Theorem 1).
In 2006, Ardila and Klivans showed that the space of all phylogenetic trees on
is a union of
dimensional cones in
, and that this space is not classically convex [
2]. Therefore, classical statistical methods cannot be directly applied to a set of phylogenetic trees, as these methods assume a Euclidean sample space.
However, Ardila and Klivans also showed that the space of equidistant trees—rooted phylogenetic trees on as defined in Definition 9—is a tropical Grasmaniann. So, the space of equidistant trees on is tropically convex and forms a tropical linear space with the max-plus algebra over the tropical projective space. Therefore, we can apply tropical linear algebra to perform statistical analysis on the space of equidistant trees on .
In 2019, Yoshida et al. introduced tropical principal component analysis (PCA), an analogue of a classical PCA from the perspective of tropical geometry, to visualize how gene trees are distributed over the space of equidistant trees on
using max-plus algebra [
3]. For
, they defined the
-th order tropcial principal polytope, or the best-fit tropical polytope with
s vertices, whose vertices serve as analogues of the classical first
s principal components.They showed that computing these vertices can be formulated as a mixed integer linear programming problem, as shown in Problem 1 [
3]. Later, Page et al. developed a Markov Chain Monte Carlo (MCMC) method to estimate the vertices of the tropical principal polytope from a set of gene trees.
In this work, inspired by the recent work of tropical gradient descent defined in [
4], we introduce a projected gradient descent method to compute the set of vertices of the tropical principal polytope from a set of gene trees. We compute subgradients in order to find the optimal solution for the mixed integer programming problem in order to compute the tropical principal polytope shown in Theorem 3. Then, we apply our novel method to Apicomplexa data from [
5], and our experiments using the
R package
TML version 2.3.0 [
6] show that our method outperforms existing approaches in terms of computational time and cost function.
This paper is organized as follows. In
Section 2, we introduce the basics of tropical geometry. In
Section 3, we review the notions of metrics and ultrametrics, and discuss the isometry between the space of equidistant trees on
and the space of ultrametrics on the finite set
, based on results by Buneman [
1]. In
Section 4, we present tropical PCA and the
s-th order tropical principal polytope for
, where
.
Section 5 provides experimental results on the Apicomplexa dataset from [
5].
2. Tropical Basics
In this section, we introduce the basics of tropical geometry to be used for our main results. Let
. Then, through this paper, we consider the tropical projective torus,
which is isomorphic to
. This means that
is equivalent to a hyperplane in
. This implies that for a point
,
where
. See [
7] for more details.
Throughout this paper, we consider tropically convex sets defined by the max-plus algebra provided in Definition 1.
Definition 1 (Tropical Arithmetic Operations).
The tropical semiring is defined using the following tropical addition ⊕
and multiplication ⊙
:for any . Remark 1. is the identity element under addition ⊕ and 0 is the identity element under multiplication ⊙ over .
Definition 2 (Tropical Scalar Multiplication and Vector Addition).
For any and for any , the tropical scalar multiplication and tropical vector addition are defined as: Definition 3 (Generalized Hilbert Projective Metric).
For any points , the tropical metric, , between v and w, is defined as: Remark 2. The tropical metric is the metric over .
Definition 4. A subset is called tropically convex if it contains the point for all and all . The tropical convex hull or tropical polytope, , of a given finite subset is the smallest tropically convex set containing . In addition, can be written as the set of all tropical linear combinations Any tropically convex subset S of is closed under tropical scalar multiplication, , i.e., if , then . Thus, the tropically convex set S is identified as its quotient in the tropical projective torus .
Definition 5 (Max-tropical Hyperplane [
8]).
A max-tropical hyperplane is the set of points , such thatis attained at least twice, where . Definition 6 (Min-tropical Hyperplane [
8]).
A min-tropical hyperplane is the set of points , such thatis attained at least twice, where . Remark 3. A min-tropical hyperplane and a max-tropical hyperplane are tropically convex over .
Definition 7 (Max-tropical Sectors from Section 5.5 in [
8])
. For , the i-th
open sector of is defined asand the i-th
closed sector of is defined as Definition 8 (Min-tropical Sectors).
For , the i-th
open sector of is defined asand the i-th
closed sector of is defined as 3. Space of Phylogenetic Trees
A phylogenetic tree is a rooted or unrooted tree whose exterior nodes have unique labels, whose interior nodes do not have labels, and whose edges have non-negative weights. In this paper, we focus on an equidistant tree that is a rooted phylogenetic tree, such that a total weight on the path from its root to each leaf on the tree has the same total weight. Let be the set of leaf labels on an equidistant tree T.
Definition 9. An equidistant tree T on is a rooted phylogenetic tree on , such that the total weight from the root to each leaf is equal to a constant for all . h is the height of T.
Example 1. Figure 1 shows an equidistant tree a height 1
on . Suppose
is the total weight on the unique path from a leaf
and a leaf
on a phylogenetic tree
T. Then,
, where
, is a metric, that is,
D satisfies
for all
. The metric
D is the
tree metric of a phylogenetic tree
T.
If metric
D satisfies
and this maximum is achieved at least twice for distinct
, then
D is called an
ultrametric. Suppose
is the total weight of the path from
from an equidistant tree
T, then we have the following theorem.
Theorem 1 (noted in [
1]).
Suppose we have an equidistant tree T with a leaf label set and D as its tree metric. Then, D is an ultrametric if and only if T is an equidistant tree. In addition, we can uniquely reconstruct T from D. Using Theorem 1, we consider the space of ultrametrics on as the space of phylogenetic trees, which is the set of all possible equidistant trees with the leaf set . Let be the space of ultrametrics on .
With tropical geometry, it can be shown that is a tropical subspace over the tropical projective space . Let denote the subspace of , defined by the linear equations, such that for . The tropicalization is the tropical linear space consisting of points , such that, for any , the maximum of is achieved at least twice.
In addition, it is important to note that the tropical linear space corresponds to the graphic matroid of the complete graph .
Theorem 2 (Theorem 2.18 in [
3]).
The image of in the tropical projective torus coincides with . Projection onto Tree Space
In tropical geometry, it is well-known that
is the support of a pointed simplicial fan of dimension
and it has
rays, defined as clade metrics [
2].
Definition 10. Suppose we have an equidistant phylogenetic tree T with leaf set . A clade of T with leaves is the equidistant subtree of T formed by including all common ancestral interior nodes of combinations of leaves in σ, while excluding any common ancestors that involve leaves from in T, along with the edges in T that connect these interior nodes to the leaves in σ.
We note that a clade of an equidistant tree
T with leaf set
is a subtree of
T with leaves
. Feichtner showed that each topology of equidistant trees can be encoded by a nested set, that is, a set of clades
, where
, such that
for all
and
for all
[
9].
Definition 11 (Clade Ultrametrics).
We consider an equidistant tree T on the leaf set . Let be a proper subset of with at least two elements. Let , such thatThen, is called a clade ultrametric.
We note that Ardila and Klivans showed that the set of clade ultrametrics forms a set of generators—i.e., rays—of a pointed simplicial fan of dimension , in terms of classical arithmetic in Euclidean geometry. In this paper, we use an extreme clade ultrametric, which is an analogue of a clade ultrametric defined using the max-plus algebra. This is done by replacing the identity element of classical addition with the identity of tropical addition ⊕ (namely, replacing 0 with ), and replacing the identity element of classical multiplication with that of tropical multiplication ⊙ (namely, replacing 1 with 0).
Definition 12 (Extreme Clade Ultrametrics).
We consider an equidistant tree T on the leaf set . Let be a proper subset with at least two elements. Let , such thatThen, is called an extreme clade ultrametric.
Remark 4. In polyhedral geometry, a polyhedral cone generated by a set of rays is defined as We replace classical addition with ⊕
and classical multiplication with ⊙
for ; thus, we havewhich is a tropical polytope defined by V. Proposition 1 ([
10]).
The set of all extreme clade ultrametrics, , is a generating set of in terms of max-plus algebra. Proposition 1 is a tropical geometric analogue of the simplicial complex result by Ardila and Klivans in [
2], obtained by replacing classical addition with ⊕ and classical multiplication with ⊙.
Definition 13 (Projection Map [
10]).
The tropical projection map to ultrametric tree space is given byfor . Proposition 2 ([
10]).
For all , we havefor all , and where is defined by Definition 13. The following proposition is key to the projected gradient methods we proposed in
Section 4.
Proposition 3 (Lemma 19, [
11]).
The projection map is non-expansive in terms of , i.e.,for all . Remark 5. The projection map is equivalent to the single linkage hierarchical clustering method [12]. Therefore, in the computational experiment described in Section 5, we use a single linkage hierarchical clustering method to projecting a subgradient. 4. Tropical Principal Component Analysis
Yoshida et al. introduced the notion of tropical principal component analysis (PCA), an analysis based on the best-fit tropical hyperplane or tropical polytope [
3]. In particular, they applied tropical PCA using tropical polytopes to samples of ultrametrics within the space of ultrametrics. In this section, we consider
, where
m is the number of leaves and
. Suppose we have a sample
.
Definition 14 (Definition 3.1 in [
13]).
The -th order tropical principal polytope minimizeswhere is the projection onto a tropical polytope with s many vertices, that isfor all for , which is called the th-order tropical principal component polytope of the sample . The s vertices of the tropical principal component polytope are called the th order tropical principal components, or equivalently, the best-fit tropical polytope with s vertices. Remark 6. The 0
-th tropical principal polytope is a tropical Fermat–Weber point of a sample with respect to . A tropical Fermat–Weber point of with respect to is defined as A tropical Fermat–Weber point of with respect to is not necessarily unique, and the set of all tropical Fermat–Weber points forms both a classical polytope and a tropical polytope [14]. In this paper, we focus on the -th order principal components over for . Our problem can be written as follows:
Problem 1. We seek a solution for the following optimization problem:whereandwith Remark 7 (Proposition 4.2 in [
3]).
Problem 1 can be formulated as a mixed integer programming problem. In this section, we consider subgradients of Problem 1. Here, we are interested in computing
First, we notice that
by the product rule.
Let
be Kronecker’s Delta. Then, we have the following lemma:
Lemma 1 (Lemma 10 in [
14]).
For any two points , the gradient at x of the tropical distance between x and p is given byprovided there are no ties in , which ensures that the min- and max-sectors are uniquely identifiable; that is, the point x is inside of open sectors and not on the boundary of . Also, we notice that
for
.
Lemma 2. for , , and for . Proof. Direct computation from the equation in (12). □
Lemma 3. Subgradients of Problem 1 over iswhich can be obtained by Equations (11) and (12). Theorem 3. The subgradient of Problem 1 over iswhere is obtained in Lemma 3. Proof. Using Proposition 3, we know that
is non-expanding in terms of
. Therefore, we have
Therefore,
when
x is at a critical point.
Suppose
is an optimal solution for the Problem 1. Then, let
Since
is a subgradient in Lemma 3, we have
for
. Using Proposition 3, we have
for
. □