Tropical Balls and Its Applications to K Nearest Neighbor over the Space of Phylogenetic Trees

: A tropical ball is a ball deﬁned by the tropical metric over the tropical projective torus. In this paper we show several properties of tropical balls over the tropical projective torus and also over the space of phylogenetic trees with a given set of leaf labels. Then we discuss its application to the K nearest neighbors (KNN) algorithm, a supervised learning method used to classify a high-dimensional vector into given categories by looking at a ball centered at the vector, which contains K vectors in the space.


Introduction
A phylogenetic tree with a given set of leaf labels [n] = {1, . . ., n} is a weighted tree whose leaves have labels [n] while their interior nodes do not have labels.In phylogenetics, leaves in a phylogenetic tree represent observable species [n] in the current time and a tree represents an evolutionary relationship between these species in a given set [n].
In order to study evolutionary histories of species or genes in terms of molecular clock, leaves in a phylogenetic tree represent a given set of observable species in the current time, internal nodes in the tree represent common ancestors and branch lengths in the tree present evolutionary time in a molecular clock.Since we assume that all species in the tree have the same most common ancestor (the root of a phylogenetic tree), a phylogenetic tree of a given set of species has the property that a distance from its root to each leaf is same for all leaves in the tree.We call such a rooted phylogenetic tree an equidistant tree.An example of an equidistant tree is shown in Figure 1.In phylogenetics and phylogenomics, we often use equidistant trees to analyze genome data since multispecies coalescent processes applied to analyze gene trees and species tree in genome data [1] assume that all gene trees are equidistant trees.
Phylogenomics is a field in which we apply tools from phylogenetics to problems in genomics.More specifically, phylogenomics extracts information from comparative study on entire genomes by constructing phylogenetic trees from each gene.In phylogenomics, researchers are interested in problems like predictions of gene function; evolutionary relationships between genes; and finding lateral gene transfers.For specific examples, genetic drift and gene flow in ancestral populations can cause topological differences between gene trees [1][2][3][4].In the statistical point of view, these problems can be seen as finding outliers from a sample of phylogenetics with the same set of leaf labels reconstructed from genes.However, a space of phylogenetics trees (the set of all possible phylogenetic trees with a given set of leaf labels [n]) is not Euclidean.Thus, we cannot just simply apply statistical methods over a Euclidean space.Therefore, we have to consider a space of phylogenetic trees, which contains all possible phylogenetic trees of the same leaf set.
There are several ways to define a space of phylogenetic trees (also known as a tree space).These differences come from ways to define a phylogenetic tree and to vectorize each phylogenetic tree into a vector-representation (see [5] for details on differences between different tree spaces).In this paper we focus on the space of phylogenetic trees as the set of all equidistant trees with a fixed set of labels for leaves.In 2009, Speyer and Sturmfels showed a space of phylogenetic trees is a tropical Grassmanian [6], that is, a tropicalization of a linear space defined by a set of linear equations [7], which is a linear subspace of the tropical projective torus, with the max-plus algebra.Therefore it is natural to use tropical geometry to develop statistical methods to analyze data points over a space of phylogenetic trees.In this paper, we use the tropical metric.The tropical metric over a space of phylogenetic tree is well defined and [8,9] investigated its properties over the tree space.In 2020, Monod et al. in [10] introduced tropical balls over the space of phylogenetic trees with the tropical metric and defined probability measures using tropical balls.In [10], Monod et al. defined tropical balls with the tropical metric over the tropical projective torus.In this paper we show properties of tropical balls over the tropical projective torus with the tropical metrics as well as the space of phylogenetic trees.Then, we compare tropical balls with balls defined with L 2 norm and L ∞ norm.
The K Nearest Neighbors (KNN) algorithm is an instance-based method which classifies a vector in a high-dimensional vector space into finite categories [11].The basic idea of the KNN algorithm is that all data points with the same category should be distributed near to each other.The KNN algorithm, first, considers the ball around the vector which you want to categorize.Next it increases the radius of the ball with a given metric until it contains K many points from the training set.Then we assign the category to the input vector which is the majority of data points from the training set in the ball.KNN algorithm is well-studied and it is applied to "Big Data" [12].For example, [13] showed that the KNN algorithm is a special case of kernel density estimator.
The KNN algorithm has been applied to classify data points over a lower dimensional manifold (for example, [14]).However, since a space of phylogenetic trees is not a manifold over a Euclidean space and currently there has not been applied to classify data points over a space of phylogenetic trees.Therefore, in this paper, we consider tropical KNN, that is, KNN algorithm with the tropical metric defined over the tropical projective torus under the max-plus algebra to KNN and then we applied tropical KNN to classify data points over a space of phylogenetic trees.
The contributions of this paper include that 1.
we show some properties of a tropical ball in the tropical projective torus; 2.
we show some properties of a tropical ball in a space of equidistant trees with a given set of leaves [n]; 3.
we compare tropical balls with balls defined with L 2 norm and L ∞ norm; 4.
we define a tropical KNN algorithm; and 5.
we applied tropical KNN algorithm to simulated data generated by a multispecies coalescent model.

Preliminaries
In this section, we set up some notation and definitions from phylogenetics and tropical geometry.For interested readers, see [15] for more details.
First we discuss some definitions from phylogenetics.A dissimilarity map w is a function If a dissimilarity map w satisfies a triangle inequality then w is called a metric.If there exists a phylogenetic tree T with the leaf labels [n] such that the total branch length from a leaf i ∈ [n] to a leaf j ∈ [n] coincides with w(i, j) for all i, j ∈ [n], then we call w a tree metric.In fact if w is a tree metric of a phylogenetic tree T, then T has a unique tree metric w and T is unique for a tree metric w.
Definition 1.If a metric w satisfies the following condition, such that max{w(i, j), w(i, k), w(j, k)} is achieved twice for every distinct i, j, k ∈ [n], then we call a metric w an ultrametric.
To vectorize an equidistant tree, we can use an ultrametric w since there is a one-to-one mapping from a set of all possible equidistant trees with a leaf set [n] to the space of all possible ultrametrics: Theorem 1 ([10]).A tree metric w of a phylogenetic tree T with a set of leaf labels [n] is an ultrametric if and only if T is an equidistant tree with a set of leaf labels [n].
Therefore, in this paper, we define a space of equidistant trees with a leaf set [n] as the space of ultrametrics notated as U n .Now we shift our attentions to basics from tropical geometry.Throughout this paper we consider the max-plus tropical semiring ( R ∪ {−∞}, ⊕, ) .
Over this semiring, the tropical arithmetic operations of addition and multiplication are defined as the following: x ⊕ y := max{x, y}, x y := x + y for any x, y ∈ R ∪ {−∞}.
Any semiring has to have the identity for addition and identity for multiplication.In this semiring, −∞ is the identity for addition and 0 is the identity for multiplication.
Let e := ( n 2 ) and let R e /R1, where 1 := (1, 1, . . ., 1), be the tropical projective torus.Note that U n ⊂ R e /R1.Scalar multiplication and vector addition can be defined as: Through the paper, we use the tropical metric: Definition 2. For u = (u 1 , . . ., u e ), v = (v 1 , . . ., v e ) ∈ R e /R1, the tropical distance d tr between u and v is defined as: Recall that from Theorem 1 we consider the space of ultrametrics with labels [n] as a space of all equidistant trees with the label set [n].Let U n be the space of ultrametrics for equidistant trees with the leaf labels [n].In fact we can write U n as the tropicalization of the linear space generated by linear equations over the tropical projective torus R e /R1 where e = ( n 2 ).
Let L n ⊆ R e be the linear subspace of R e defined by the linear equations: for 1 ≤ i < j < k ≤ n.For the linear equations (2) spanning the linear space L n , the max-plus tropicalization Trop(L n ) of the linear space L n is the tropical linear space with w ∈ R e such that max w ij , w ik , w jk achieves at least twice for all i, j, k ∈ [n].Note that this is exactly the three point condition defined in Definition 1.
Therefore, by Theorem 2, U n is a tropical linear space in R e /R1 where e = ( n 2 ).For example, if n = 4, the space of ultrametrics U 4 is a union of 15 two-dimensional polyhedral cones over the tropical projective torus R 6 /R1.

Properties of Tropical Balls over the Space of Ultrametrics
In this section we investigate the tropical ball in terms of the tropical metric in the space of ultrametrics U n .Here we fix the height of the equidistant trees associate with their ultrametrics in U n as 1.All proofs for Lemmas and Theorems in this section are in Section 6. Recall e := ( n 2 ).
Lemma 1.If u := (u 1 , . . ., u e ) ∈ U n is from an ultrametric realizing an equidistant tree of height 1, then there is i ∈ {1, . . ., e} such that u i = 2.In addition, We will investigate a tropical ball centered at a point in U n with a radius r > 0 defined via the tropical metric d tr .We define a ball B x (r) at a point x ∈ U n with a radius r > 0 under the tropical metric d tr as the following: In the rest of this section, we show some properties on tropical balls in U n : Proposition 1. Suppose x ∈ U n is the origin, that is, the ultrametric x = (0, 0, . . ., 0) = (2, 2, . . ., 2) ∈ U n .In terms of equidistant trees, x represents a star tree with its height 1.Then for Now we show that a tropical ball is convex in terms of the tropical metric.For Lemmas 2 and 3, and Theorem 3, let us consider the tropical projective torus R e /R1.Until this moment in this section, we consider ultrametrics in U n ⊂ R e /R1.Even with a natural bijection between ultrametrics in U n and the space of equidistant trees described in Theorem 1, one might want to describe a tropical balls in terms of equidistant trees.In general we can describe a tropical ball in U n in terms of equidistant trees using the following theorem: Theorem 5. Let T n (h) be the set of all equidistant trees with their height h > 0 and with the leaf set [m] = {1, . . ., n}.Also let d T (i, j) be a pairwise distance between leaves i, j ∈ {1, . . ., n} in an equidistant tree T ∈ T n (h).Suppose T 1 , T 2 ∈ T n (h).Then let M(T 1 , T 2 ) = max{d T 1 (i, j) − d T 2 (i, j) : i, j ∈ [n]}.For an equidistant tree T x ∈ T n (h) and for 0 < r ≤ 4h,

Examples in U 4
In this section, we visualize tropical balls in the space of ultrametrics and in order to visualize, we consider n = 4.To visualize tropical balls in U 4 , we map all coordinates in U 4 to the Billera-Holmes-Vogtmann (BHV) treespace [16] using the one-to-one mapping described in [17].
Example 1. First, we consider the tree shown in the left side of Figure 2.This is an example so that the center of a tropical ball and entire tropical ball are inside of an orthant for a tree topology in Figure 2.
Here we have the height h = 1, so The corresponding ultrametric is The center of this tropical ball in this example is a = b = 1/4.The corresponding ultrametric is Therefore, we have The tropical ball in U 4 mapped in the BHV tree space is shown in the right side of Figure 2. Example 2. For the second example, we consider the tree shown in the left side of Figure 3.This is also a tropical ball contained inside of an orthant for the tree topology shown in Figure 3.
Here we have the height h = 1, so The corresponding ultrametric is The center of this tropical ball in this example is a = b = 1/2.The corresponding ultrametric is Therefore, we have The tropical ball around the ultrametric corresponding the equidistant tree is shown in the right side of Figure 3. Example 3. The third example we consider is given in the left side of Figure 4.This example shows a case that the tropical ball crosses between two orthants for two tree topologies shown in Figure 4.
The center tree of the tropical ball in this example is the tree where b = 0, c = 0.This is the tree on the boundary of two orthants in the space of phylogenetic trees in this case.Here we have the height h = 1, so The corresponding ultrametrics are The center of the tropical ball in this example is a Therefore, we have The tropical ball centered around the ultrametric corresponding to the equidistant tree where b = 0 and c = 0 is shown in the right side of Figure 4.

Approximation of a Tropical Ball
In this subsection we discuss how a tropical ball with the radius r > 0 relates with a ball under the l 2 and l ∞ metrics.Suppose we have u, v ∈ R e /R1.Let d 2 (u, v) be the l 2 norm metric and d ∞ (u, v) be the l ∞ norm between x and y.
Suppose we can assume that (u − v) > 0 and min(u − v) = 0 for u, v ∈ R e /R1.Then we have the following proposition: Proposition 2. We assume that (u − v) ≥ 0 over R e /R1 and min(u − v) = 0. Then we have In this case the tropical ball coincides with the ball defined with the l ∞ norm.However, if we are working on the space of ultrametrics, we have the constraints that all points have to be ultrametrics and no longer we can assume that (u − v) > 0 and min(u − v) = 0 in U n .Therefore for more general cases, we have the following bounds: Proposition 3. Suppose we have u, v ∈ R e /R1.Then we have Proof.The second inequality is trivial.Thus we want to prove that We can define d tr (u, v) as Therefore Using Proposition 3 we have the following theorem: Theorem 6.Let B 2 r (x) be a ball around a point x ∈ R e /R1 with the radius r > 0 with the l 2 metric d 2 and let B ∞ r (x) be a ball around a point x ∈ R e with the radius r > 0 with the l 2 metric d ∞ .Then we have Using Theorem 6 we can approximate a tropical KNN algorithm using the classical KNN algorithm with d 2 metric.We show some computational experiments using the multi-species coalescent model in the following section.

Computational Results
By Theorem 6 we can approximate a tropical ball using d 2 metric.Since the KNN algorithm uses the notion of balls to classify each data point in a test set, we can approximate a tropical KNN algorithm using the classical KNN algorithm.In this section we compare the tropical KNN algorithm and the classical KNN algorithm using d 2 with simulated data sets generated under the multi-species coalescent model via the software Mesquite [18].All computations are conducted in Apple Notebook MacBook Pro 2019 with 2.4 GHz 8-Core Intel Core i9 and 64 GB 2667 MHz DDR4.The R code used for this simulation study can be found at polytopes.net/tropical_KNN.tar.
First we review the KNN algorithm.The KNN algorithm is a classification model for a data set where the response variable is categorical with finite levels and explanatory variables are all numerical.Suppose we have a data set {(c 1 , x 1 ), . . .(c m , x m )} where c i ∈ {1, . . ., C} with C > 0 positive integer and x i ∈ R e .The outline of the Algorithm 1 is following: Algorithm 1: KNN Algorithm.In terms of balls, basically we can think of the KNN algorithm as a method to find a ball around y, which contains k many points from the training set.From this view we can think that the KNN algorithm with the tropical metric d tr can be approximated using the KNN algorithm with the l 2 metric d 2 .For this simulation we implemented the tropical KNN, the KNN algorithm with the tropical metric in R and we use the KNN algorithm implemented in the "class" package in R [19].
For simulated data sets, we use the software Mesquite, available at http://mesquiteproj ect.org [18], to generate gene trees under the multispecies coalescent model.In this model, we have two parameters, the effective population size N e and species depth SD.In this simulation study, we set N e = 100, 000 and varied c = SD N e .
For this simulation, we generated species trees under the Yule model and gene trees given the species tree are generated by the multispecies coalescent model via Mesquite.
With the knn() function from the "class" package, it took 12.7 s to finish the computations while the tropical metric took several hours since we can speed up the KNN algorithm for the l 2 metric.The results are shown in Figure 5.For the weighted tropical KNN, we assigned weight when we compute the majority votes such that 1 (d tr (y, x i )) 2 .

Conclusions
In this paper, we discussed a tropical ball over the tropical projective torus and the space of ultrametrics.Then we discussed approximation of tropical balls using the l 2 and l ∞ metrics.
Then we discussed applications of the KNN algorithm and we showed by simulations, using the multi-species coalescent model, that approximation of tropical balls with the l 2 metric works well.In addition, we can consider the ensemble model, that is, taking the average of all three methods we used in the simulation study.This will increase the accuracy of the simulation study.
Since the tropical metric considers maximum and minimum of elements in a vector, this metric might be very sensitive to outliers.Therefore, we recommend to conduct analysis on outliers before applying a statistical method with the tropical metric.

Discussion
We focused on tropical balls with the tropical metric under the max-plus algebra in this paper.However, we still have many problems to solve.For example, we do not know how to compute a tropical ball over the tropical projective torus and the space of ultrametrics in general even though we can compute some small examples by hand.Explicitly, Problem 1. Develop an algorithm to compute a tropical ball around a point x ∈ R e /R with the radius r > 0, B r (x) ∈ R e /R.Problem 2. Develop an algorithm to compute a tropical ball around a point x ∈ U n with the radius r > 0, B r (x) ∈ U n .
In addition, we might want to consider a distribution-based clustering method, such as Density-based spatial clustering of applications with noise (DBSCAN) with the tropical metric for clustering gene trees.Similarly to the results on the KNN, with initial experiments, DBSCAN works fairly well on the data sets generated under the coalescent model shown in Figure 6.

Materials and Methods
In this section, we show proofs for propositions and theorems in Section 3.
Proof for Lemma 1.Since the root in the equidistant tree has degree of at least two, at least one pairwise distance in u is realized by a path through the root.In addition since the height of an equidistant tree is 1, therefor, the maximum of the pairwise distance from any leaf i to any leaf j is 2.
Proof for Theorem 5. Let x ∈ U m be an ultrametric associated with the equidistant tree T x ∈ T m (h) and let u ∈ U m be an ultrametric associate with the equidistant tree T u ∈ T m (h).Then we have d tr (x, u) = max(x − u) − min(x − u) ≤ r.
Thus, M(T x , T u ) + M(T u , T x ) ≤ r.
Also we have max(a x − z) = a + max i (x i − z i ), min(a x − z) = a + min i (x i − z i ).
Thus we have d tr (a x, z) = max(a x − z) − min(a x − z) = a + max i (x i − z i ) − (a + min i (x i − z i )) = max i (x i − z i ) − min i (x i − z i ) = d tr (x, z).

Figure 1 .
Figure 1.An equidistant tree with species S 1 , S 2 , S 3 .Leaves in the tree represent observable species S 1 , S 2 , S 3 in the given set of labels and internal nodes in the tree represent their common ancestors.Filled black circles represent observable states and unfilled circles represent unobservable states.Number in each branch in the tree represent its branch length and the total branch lengths from the root to each leaf are same for all leaves.

Lemma 2 .Lemma 3 .
Suppose x, z ∈ R e /R1 and a ∈ R. Then d tr (a x, z) = d tr (x, z).Suppose x, y, z ∈ R e /R1 and a ∈ R. Then d tr (x ⊕ y, z) ≤ max{d tr (x, z), d tr (x, z)}.Theorem 3. A tropical ball B x (r) in the tropical projective torus R e /R1 is convex under the tropical metric d tr .Since U m is a tropical linear space over the tropical projective torus R e /R1 by Theorem 2 and since B x (r) is convex, now we have the following theorem by Theorem 3. Theorem 4. A tropical ball B x (r) in U m is convex in terms of the tropical metric d tr .

Figure 2 .
Figure 2. The first example for visualizing a tropical ball.LEFT: The tree corresponding to the center of the tropical ball.RIGHT: The tropical ball centered around the ultrametric corresponding to the equidistant tree in U 4 .

Figure 3 .
Figure 3.The second example for visualizing a tropical ball.LEFT: The tree corresponding to the center of the tropical ball.RIGHT: The tropical ball centered round the ultrametric corresponding to the equidistant tree in U 4 .

Figure 4 .
Figure 4.The third example for visualizing a tropical ball.LEFT: The tree corresponding to the center of the tropical ball is the tree where b = 0 and c = 0 in the picture.In this case the center of the tropical ball is on the boundary between two orthants in the tree space.RIGHT: The tropical ball centered around the ultrametric corresponding to the equidistant tree in U 4 .

Figure 5 .
Figure 5. Accuracy Rates for the classical KNN, tropical KNN, and weighted tropical KNN on simulated coalescent models.

Proof for Theorem 3 .
Thus,d tr (x ⊕ y, z) ≤ max{d tr (x, z), d tr (x, z)}.Let u, v ∈ B x (r).Then, we have d tr (x, u) ≤ r and d tr (x, v) ≤ r.We want to show any points in the tropical line segmentΓ u,v between u, v is in B x (r).Let z ∈ Γ u,v .Then there exist a, b ∈ R such that z = a u ⊕ b v.Thus we haved tr (z, x) = d tr (a u ⊕ b v, x)≤ max{d tr (a u, x), d tr (b v, x)}by Lemma 3 = max{d tr (u, x), d tr (v, x)}by Lemma 2 ≤ r.

•
Input: A data point y ∈ R e from a test set, a training set {x 1 , . . .x m }, and a metric d.Order d(y, x 1 ), . . ., d(y, x m ) from the smallest to the largest.Suppose d(y, x i 1 ), . . ., d(y, x i k ) be the first k smallest distances.Consider categories of x i 1 , . . .x i k , that is, c i 1 , . . ., c i k and assign the class, which is the biggest frequency among c i 1 , . . ., c i k , to y.
tr (x ⊕ y, z) = max i {max(x i , y i ) − z i } − min i {max(x i , y i ) − z i } = max i {max(x i − z i , y i − z i )} − min i {max(x i − z i , y i − z i )}, and max{d tr (x, z), d tr (x, z)} = max{max i (x i , z i ) − min i (x i − z i ), max i (y i , z i ) − min i (y i − z i )}.Also, min i (x i − z i ) ≤ min i {max(x i − z i , y i − z i )} min i (y i − z i ) ≤ min i {max(x i − z i , y i − z i )},and max{max i (x i − z i ), max i (y i − z i )} = max i {max(x i − z i , y i − z i )}.