Abstract
Given a labeled tree topology t of n taxa, consider a population P of k leaves chosen among those of t. The clade of P is the minimal subtree of t containing P, and its size is provided by the number of leaves in the clade. We study distributive properties of the clade size variable considered over labeled topologies of size n generated at random in the framework of Ford’s -model. Under this model, starting from the one-taxon labeled topology, a random labeled topology is produced iteratively by a sequence of -insertions, each of which adds a pendant edge to either a pendant or internal edge of a labeled topology, with a probability that depends on the parameter . Different values of determine different probability distributions over the set of labeled topologies of given size n, with the special cases and respectively corresponding to the Yule and uniform distributions. In the first part of the manuscript, we consider a labeled topology t of size n generated by a sequence of random -insertions starting from a fixed labeled topology of given size k, and determine the probability mass function, mean, and variance of the clade size in t when P is chosen as the set of leaves of t inherited from . In the second part of the paper, we calculate the probability that a set P of k leaves chosen at random in a Ford-distributed labeled topology of size n is monophyletic, that is, the probability that . Our investigations extend previous results on clade size statistics obtained for Yule and uniformly distributed labeled topologies.
MSC:
05C05; 60C05; 92B10
1. Introduction
The study of probabilistic properties of tree structures generated under random growth models is a popular subject in mathematical phylogenetics. Starting from a one-taxon tree, the Yule model generates a tree of n labeled taxa by adding new leaves equiprobably to pendant edges. By this process, different labeled topologies, or cladograms, of the same size are not equally likely to be generated, with more balanced labeled topologies having a larger probability of appearing. Under the uniform model, sometimes called the PDA (proportional to different arrangements) model [1], new leaves are added equiprobably to any edge, inducing the uniform distribution over the set of labeled topologies of a given size.
The Yule and uniform processes are the simplest stochastic models for generating random labeled topologies. They can be seen as specific instances of a popular random parametric model, the so-called Ford’s -model [2]. In this model, given a random permutation of , a labeled topology t of size n results from a sequence of random -insertions
where is the one-taxon tree for which the leaf is labeled by and the tree is obtained by attaching a new pendant edge labeled by to a pendant or internal edge of with a probability depending on the value of the parameter . Each value of determines a different probability distribution over the set of labeled topologies of given size n. In particular, when , the -model yields the Yule distribution, while when the -model induces the uniform distribution. When , i.e., when new leaves are added equiprobably only to internal edges, the -model provides the comb distribution, under which only completely unbalanced trees (also called comb or caterpillar trees) have non-zero probability.
A key concept in phylogenetic studies is that of the “clade”. Given a labeled topology, a clade is a subtree rooted at one of its nodes; as such, it consists of a node together with all its descendants. The taxa of a clade are said to form a “monophyletic” group; from a biological perspective, they determine a set of individuals that are more genealogically related to each other than any of them is to any taxa that does not belong to the clade. Monophyly, and clade statistics more generally, have been deeply investigated in mathematical biology. The probability of one or two sets of randomly selected taxa forming monophyletic groups in a Yule or uniformly distributed labeled topology of fixed size has been studied by [3,4], respectively. In [5], the authors investigated the random variable defined by the size of the minimal clade (or, equivalently, by the size of the minimal monophyletic group) containing a set P of k taxa selected at random from among the n leaves of a labeled topology generated at random under the Yule or PDA model. Here, we extend the calculations for the clade size by considering labeled topologies generated under the Ford’s -model when is fixed in the range . In Section 3, we study the growth of the clade size of a set of k initially monophyletic leaves subject to a sequence of -insertions. More precisely, we consider trees of size n generated by random -insertions performed starting from a tree of given size k, then determine the distribution of the clade size of the set of leaves P of t inherited from . In Section 4, we broaden the results of [3,4] by studying the probability of a set P of k randomly chosen taxa forming a monophyletic group in a labeled topology of size n selected under the Ford distribution. We start with some basic terminology and preliminary results.
2. Preliminaries
In this section, we outline basic combinatorial and probabilistic properties of labeled topologies and Ford’s -model. We also define the “clade size” variable, which is studied in the next sections over random labeled topologies.
A labeled topology, or cladogram, of size n is a full binary rooted tree with n leaves, also called taxa, labeled by the symbols (Figure 1A). A subtree of a labeled topology consists of an edge together with all its descending edges and nodes. An internal edge is either the branch above the root node of the tree (edge i in Figure 1A) or a branch connecting two adjacent internal nodes (edges , and h in the figure). A pendant edge is a branch that intersects a leaf (edges , and e in Figure 1A). Each labeled topology of size n has n pendant edges and internal edges. There are exactly different labeled topologies of size [6], among which have a completely unbalanced “caterpillar” structure such that (as in the rightmost tree of Figure 1B) each internal node has at least one leaf stemming directly from it.
Figure 1.
Labeled topologies and a sequence of -insertions. (A) A labeled topology of size n = 5. The tree has pendant edges (those in gray, denoted by the letters , and e) and internal edges (those in black, denoted by the letters , and i). (B) The labeled topology on the right generated under the -model, when , is the permutation that determines the label of the pendant edge added with the i-th insertion in (1).
Given a labeled topology t of size n, let P be a subset of its leaves. We define the clade of P as the subtree of t rooted at the most recent common ancestor of the taxa in P. In other words, is the minimal subtree of t containing the taxa belonging to P. In the phylogenetic literature [7], is also called the subtree of tinduced by the set of leaves in P. As an example, let t be the labeled topology of Figure 1A and take P as the set containing the two leaves labeled by 3 and 5; then, is the subtree of t whose leaves are labeled by 3, 4, and 5. The clade size of P is defined as the number of taxa belonging to the clade of P. Clearly, we have , and the population P is said to be monophyletic when .
Ford’s -model [2] is a probability model over the set of labeled topologies of fixed size. For a given value of the parameter , this model generates a random labeled topology t of size n by first selecting a permutation of the labels uniformly at random; then, starting from the one taxon tree whose leaf is labeled by , a tree of size n is generated as the n-th element of a sequence (1) of trees of increasing size, in which tree is obtained by attaching a new pendant edge labeled by to a random edge of tree . Depending on the value of , the -insertion of the new pendant edge has to satisfy the following rule: each pendant (resp. internal) edge of has probability (resp. ) to intersect the new edge. For instance, consider the labeled topology t of size shown on the right of Figure 1B, which in Newick format is written as . If , then the insertion process that generates t is the one depicted in the figure, that is,
Thus, under Ford’s model, t has conditional probability . Indeed, with probability , the pendant edge labeled by is attached to the unique pendant edge of ; then, the pendant edge labeled by is attached to the pendant edge of labeled by 1 with probability , and finally, with probability , the pendant edge labeled by is attached to the internal edge of that connects the root of the subtree to the root of . Similarly, if , then the insertion process that generates the same labeled topology is
with t having conditional probability , as the pendant edges labeled by and are attached to the pendant edges labeled by 4 in and by 1 in , respectively. Performing similar calculations for all possible permutations , the probability of the labeled topology t under Ford’s -model can be calculated as . An explicit formula for the probability of a labeled topology is provided in Proposition 2 of [8].
3. Clade Size of a Set of Initially Monophyletic Leaves Subject to -Insertions
Let t be a labeled topology of size n generated by a sequence of random -insertions
where is any starting tree of fixed size k with leaves labeled by the symbols and tree is obtained by attaching, a new pendant edge labeled by to one of the i pendant (resp. to one of the internal) edges of with probability (resp. ) (Figure 2).
Figure 2.
A sequence of -insertions. Starting from the labeled topology depicted on the left, the sequence generates the labeled topology t of size 7 depicted on the right. After the insertions, the clade size of the set of leaves inherited from has increased by one, as in the final tree t.
In what follows, we study the variable , defined as the clade size in t of the set of leaves P inherited from the starting tree . Our goal is to measure how a sequence of random -insertions modifies the clade size of the set of leaves P, which is monophyletic in at the beginning of the sequence of insertions. We first provide a recurrence for the probability mass function of the random variable by using the iterative construction (2) of the random labeled topology t.
Lemma 1.
Fix . In a random labeled topology of size generated by random insertions, as in (2), the probability mass function of the clade size of the set of leaves inherited from satisfies the following recurrence:
with boundary conditions if or and if .
Proof.
As in (2), the random labeled topology of size n is generated by attaching a pendant edge with leaf labeled by n to one of the edges of the random labeled topology of size . The clade size of P increases by one only when the new pendant edge is attached to an edge placed below the root (say, r) of in . Hence, conditioning on the clade size of P in , we have
where is if in and if in . This provides the claimed recurrence. □
From the recurrence provided in (3), it is possible to derive closed formulas for the probability in terms of the generalized binomial coefficient , where is the -function. This is done in the following theorem, in which the mean and variance of are also determined.
Theorem 1.
Fix . In a random labeled topology of size generated by random insertions, as in (2), the probability of the clade size of the set of leaves inherited from being d is provided by
Furthermore, the expected value and variance of the clade size of P are, respectively,
Proof.
We can prove the first formula for by induction on . For , we must have if and otherwise, which is consistent with the claimed formula. Now, assuming and using the recurrence from Lemma 1, we have
where in the third equality we have used the binomial identity that holds for every complex number x and every non-negative integer y.
By induction on , we can now show the formula for the expected value. If , then we must have , which is in agreement with the claimed formula. Assuming , the inductive step provides
Finally, we prove the formula for the variance ; we proceed by showing that the second moment of the variable reads as follows:
which, together with , yields the claimed formula for the variance calculated as . If , then , which is in agreement with the claimed formula. Assuming , the inductive step yields
□
Remark 1.
We conclude the section with a few observations following from the latter theorem.
- (i)
- The right-hand side of Equation (4) is extended by continuity to the cases of or . In particular, when and , the second factor in the formula reads as , while for and it is provided by . If we instead have and , then the third factor in the formula becomes , while for and it reads as .
- (ii)
- Lower values of the α parameter determine a larger expected clade size and smaller variance. This is shown in Figure 3, where we perform a sequence of α-insertions (2) starting from a labeled topology of size , then calculate the mean and variance of the clade size of the set of leaves of in the resulting random labeled topology of size n (Figure 2).
Figure 3. Expected value and variance of the clade size of a set of initially monophyletic leaves after random -insertions when (dots) and (boxes). - (iii)
- For fixed values of k and d, the asymptotic formula for , which holds for every complex number , yields the following asymptotic expansion as for the probabilityIn particular, for increasing values of n, the probability of the set of initially monophyletic leaves remaining monophyletic decreases to 0 as quickly as . From Theorem 1, the probability of the clade size of the starting set of leaves being equal to n is instead seen to be . By substituting in (4), we find that for , the probability of the ratio being a fixed satisfies
- (iv)
- It is possible to calculate the probability mass function of the clade size variable without making use of the generalized binomial coefficient. Indeed, by simple algebraic manipulations, from the formula provided in the latter theorem we can also writewhere we admit empty products.
4. Probability of Monophyly for a Set of Random Taxa
In this section, we determine the probability that a set P of randomly chosen taxa forms a monophyletic group in a labeled topology of size n selected under the Ford distribution. Without loss of generality, we assume P to be the set of taxa whose labels belong to in a random labeled topology of n leaves. Note that when , the -model corresponds to the Yule model; from [5], we know that . If we instead have , then a random labeled topology of size n generated under Ford’s model has a completely unbalanced (caterpillar) tree shape such as the one on the right of Figure 1B. Thus, the probability of the leaves with labels in forming a monophyletic group is , as there is only one subtree of size k in such a caterpillar labeled topology.
We now focus on the values of within the range . As described in Section 2, a random labeled topology t of n taxa is generated by first selecting a random permutation of size n, then performing a sequence of random insertions (1). The probability that
where each is an element of and each is a sequence of length (with ) over the alphabet , is provided by
Conditioned on as in (5), the probability that the set P of taxa is monophyletic after a sequence of random insertions of pendant edges labeled by the symbols in is a chain of (possibly empty) products
where is the probability of attaching a new pendant edge (say, one labeled by x) in a tree of size z that has a clade of size c, in such a way that the clade size of is and is the probability of attaching a new pendant edge in a tree of size z that has a clade of size c, while the clade size of remains equal to c after the insertion. For instance, if is the clade of size in the labeled topology t of size depicted in Figure 1A, then a new pendant edge labeled by x can be attached to different pendant edges of t (i.e., , and e) or to different internal edges of t (i.e., g and h) such that , which happens with probability . By considering the same tree t of size and the same clade of size , we see that a new pendant edge can be attached to different pendant edges of t (i.e., a and b) or to different internal edges of t (i.e., , and i) such that after the insertion we have , which happens with probability . In particular, each factor (resp. ) in (6) yields the probability of the clade size of (resp. ) being (resp. i) after insertion of the pending edge labeled by (resp. ).
Summing over the possible values of , the probability of the population P of taxa with label in being monophyletic in a random labeled topology of size n can then be calculated as follows:
In order to simplify the latter formula for the probability , we need the next lemma.
Lemma 2.
Let n be a positive integer; then, we have the following equality for the r nested sums:
Proof.
We proceed by induction on . If , then the sum reduces to . Now, assuming , setting , and applying the binomial “hockey stick” identity , we obtain
□
Then, we have the following theorem.
Theorem 2.
Under Ford’s α-model with , the probability of the set of leaves being monophyletic in a random labeled topology of size n is provided by
Proof.
We use the following notation: for each , set ; then, and . In particular, we can write
Furthermore, for every , let
We start by rewriting the products appearing in (7). First, we have
Second, we find
from which the second product in (7) reads as
Hence, by multiplying the products of (9) and (10), we obtain
and (7) becomes
Because , we find
From the last lemma, we know that
Hence, we obtain
Setting and denoting the argument of the latter sum by , we have
where
Thus, the formula for can be written as
where in the second identity we have used the fact that
□
Remark 2.
From the latter theorem, we observe the following.
- (i)
- For different values of n, we can observe a different behavior by plotting the (natural logarithm of the) probability as a function of α (Figure 4). The mentioned probability can be an increasing function of α (left panel), a decreasing function of α (right panel), or a unimodal function of α (middle panel).
Figure 4. Natural logarithm of the probability of monophyly when (from left to right) , , and when ranges in in steps of 0.05. - (ii)
- The formula for that appears in (8) calculates the probability of monophyly for a set of k taxa chosen at random from among the leaves of a labeled topology of size n selected under the Ford’s distribution and when or (the latter when we consider the limit ). Indeed, for , by substituting in the third of the following equalities and applying the Chu–Vandermonde identity in the fourth equality, the formula for reduces toFurthermore, when , we can use Legendre’s duplication formula to findFor , we havewhile providesHence,showing that the formula in (8) extended by continuity to yields the right expression for the probability .
- (iii)
- The formula for provided in the latter theorem can be further simplified for fixed values of k. For instance, when or , Equation (8) respectively reduces to the following:andMore generally, explicit expressions for can be determined for fixed values of k by writing the sum in (8) as a telescoping sum:wherefor a polynomial whose coefficients satisfyand the linear systemNote that if satisfies (12) and (13), then as in the first equality of (11) we indeed findbecause and are identical as polynomials of degree in the variable h with leading coefficient and roots .
We conclude this section with an asymptotic estimate for when and k are fixed and n goes to infinity.
Theorem 3.
Fix and ; then, when , we have the following asymptotic equivalence:
Proof.
We start by analyzing the sum
where
We write
For , Gautschi’s inequality (which states that for every positive real number x and every s in ) yields
Combining this with the elementary inequality
we obtain
where we have used and to respectively denote the lower and upper bounds of the previous inequality. For , and , consider the function defined by (with the convention ). Note that f is (strictly) increasing for , (strictly) decreasing for , and admits a maximum with . Note also that by the substitution we have
where is the Beta function defined for every complex number z and w such that and . Because
and
we find
that is,
We apply the latter estimates to bound and . For each , we denote . We first consider . Because , we have
We now focus on . Since and , we have
Thus, both and are asymptotically equivalent to , as ; therefore, so is . Because , we can conclude that
Finally, by inserting the latter formula into (8) together with the asymptotic equivalences
we find
□
From Theorem 3 and the formulas for provided at the beginning of this section for and , we find that behaves as follows for :
where .
The asymptotic constant is plotted in Figure 5 for (left) and (center). The plot has different shapes depending on k; it is strictly decreasing when , while it has a maximum at when . The value of the parameter at which the constant reaches its maximum is shown as a function of on the right of Figure 5. Note that for small values of k, i.e., , is maximum when , that is, when the Ford model corresponds to the Yule model. When , the uniform model () is the phylogenetic scenario that maximizes .
Figure 5.
Plot of the asymptotic constant appearing in (14) when (left) and (center) with ranging in . The plot on the (right) shows the value of for a fixed such that is at its maximum.
5. Conclusions
In this article, we have investigated the distributive properties of the clade size of a population P of k leaves taken from n taxa of a labeled topology generated at random under different scenarios in the framework of Ford’s -model.
Our goal in Section 3 was to measure how much a sequence of random -insertions modifies the clade size of a set of initially monophyletic leaves. To this end, in Theorem 1 we have determined the probability mass function, mean, and variance of the clade size of the set of k leaves belonging to a random labeled topology of size obtained through a sequence of -insertions starting from a tree of size k labeled by the symbols in P. As might be expected, our results show that lowering the value of the parameter , that is, decreasing the probability of attaching new leaves at the internal edges of the tree, increases the expectation of the clade size variable and reduces its variance (Figure 3).
Next, Section 4 was dedicated to the study of monophyly under the -model. In Theorem 2, we have derived a formula for calculating the probability that a set of k randomly chosen taxa is monophyletic in a labeled topology of size selected under the Ford distribution with parameter . Previous studies [3,4] have investigated the probability of monophyly for the special cases of and , respectively corresponding to the Yule and uniform distributions of labeled topologies of a given size. Our formula for allows for the inference of the parameter by conditioning on the monophyly of a random sample of k leaves in a labeled topology of size n generated under the Ford model. In particular, the conditional density function of reads as
where is the prior density function of . In Figure 6, assuming a uniform prior distribution on the parameter, i.e., , we plot the conditional cumulative distribution of when a random sample of k leaves is observed to be monophyletic in a random labeled topology of size . The plot shows that for a relatively small value of k, i.e., (top curve), the probability of is increased for every x with respect to the prior distribution (the dashed line). For a larger value of k, i.e., (bottom curve), the probability of instead increases for every x, suggesting that small values of alpha are more likely when observing the monophyly of a small sample of leaves. This is confirmed by the plot in Figure 7, where we have calculated the conditional probability of being either small (), medium () or large () when k random leaves are seen to be monophyletic in a Ford-distributed tree of size , again assuming a uniform prior distribution on . The probability of a small or large value of respectively decreases or increases with the first values of k, while the probability of a medium value of appears to be more uncorrelated with k. For small, medium, or large , the Ford model has its best-known instances respectively provided by the Yule (), uniform (), or comb () model of random labeled topologies, respectively, and our calculations can be used to infer which model is more likely to produce a given tree of size n in which a set of k random leaves is found to form a clade. In the final part of Section 4, we have further investigated the probability of monophyly by determining its asymptotic behavior (Theorem 3). We have shown that decreases like a constant multiple of for every , that is, as in (14). Thus, for a given value of k, the probability of monophyly is asymptotically larger for those values of at which the factor attains its maximum (Figure 5, right).
Figure 6.
Cumulative distribution of the parameter conditioned on the monophyly of a random sample of k leaves in a Ford-distributed labeled topology of size . We set in the top curve and in the bottom curve. The dashed line is the cumulative prior distribution of when this is assumed to be uniform over the interval .
Figure 7.
Probability of being either small, i.e., (•), medium, i.e., (■), or large, i.e., (♦), when k random leaves of a Ford-distributed labeled topology of size are found to be monophyletic.
Our investigation of the probability of monophyly also relates to other studies of leaf-induced subtrees. In [9], for a uniformly distributed labeled topology T of size n, the authors studied in the limit the maximum value of the probability of observing a fixed “caterpillar” or “even” tree B of size k as the clade of a set of k leaves randomly sampled in T. The probability of monophyly studied in the latter section of the present paper for a Ford-distributed labeled topology T of size n can be seen as the sum of the quantities , when all trees B of k leaves are considered.
Several directions of research naturally arise from this work. In Section 4, our focus has been on the probability of the clade size of a set of k random taxa in a Ford distributed labeled topology of given size being equal to k. It would be of interest to extend our calculations to broaden the results of [5] by determining the entire distribution of the clade size variable of a set of k randomly chosen taxa in a labeled topology of size n sampled under the Ford distribution. We remark that the population P for which the clade size is studied in Section 3 is not a set of k randomly chosen taxa of a random labeled topology t of size n, as P corresponds to the set of leaves present in the labeled topology from which a sequence of -insertions has produced t. It also remains to investigate reciprocal monophyly, that is, the probability of two (or more) prespecified groups of leaves being monophyletic in a random tree selected under Ford’s distribution. This probability was calculated for in Theorem 4.5 of [3]; however, a generalization to arbitrary values of remains missing.
Author Contributions
Conceptualization, A.D.N. and F.D.; Investigation, A.D.N. and F.D.; Writing—original draft, A.D.N. and F.D.; Writing—review & editing, A.D.N. and F.D. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The original contributions presented in the study are included in the article, further inquiries can be directed to the authors.
Acknowledgments
ADN is member of INdAM–GNSAGA, and acknowledges financial support from INdAM. FD acknowledges the MIUR Excellence Department Project awarded to the Department of Mathematics, University of Pisa, CUP I57G22000700001.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Aldous, D.J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci. 2001, 16, 23–34. [Google Scholar] [CrossRef]
- Ford, D.J. Probabilities on cladograms: Introduction to the alpha model. arXiv 2005, arXiv:0511246. [Google Scholar]
- Zhu, S.; Degnan, J.H.; Steel, M. Clades, clans, and reciprocal monophyly under neutral evolutionary models. Theor. Popul. Biol. 2011, 79, 220–227. [Google Scholar] [CrossRef] [PubMed]
- Zhu, S.; Than, C.; Wu, T. Clades and clans: A comparison study of two evolutionary models. J. Math. Biol. 2015, 71, 99–124. [Google Scholar] [CrossRef] [PubMed]
- Di Nunzio, A.; Disanto, F. Clade size distribution under neutral evolutionary models. Theor. Popul. Biol. 2024, 156, 93–102. [Google Scholar] [CrossRef]
- Rosenberg, N.A. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Ann. Comb. 2006, 10, 129–146. [Google Scholar] [CrossRef]
- Semple, C.; Steel, M. Phylogenetics; Oxford Univ. Press: Oxford, UK, 2003. [Google Scholar]
- Coronado, T.M.; Mir, A.; Rosselló, F. The probabilities of trees and cladograms under Ford’s α-model. Sci. World J. 2018, 2018, 1916094. [Google Scholar] [CrossRef] [PubMed]
- Czabarka, E.; Székely, L.A.; Wagner, S. Inducibility in Binary Trees and Crossings in Random Tanglegrams. SIAM J. Discret. Math. 2017, 31, 1732–1750. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).