Clade Size Statistics Under Ford’s α-Model

Di Nunzio, Antonio; Disanto, Filippo

doi:10.3390/math12243974

Open AccessArticle

Clade Size Statistics Under Ford’s α-Model

by

Antonio Di Nunzio

and

Filippo Disanto

^*

Dipartimento di Matematica, Università di Pisa, 56127 Pisa, Italy

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(24), 3974; https://doi.org/10.3390/math12243974

Submission received: 24 September 2024 / Revised: 11 December 2024 / Accepted: 16 December 2024 / Published: 18 December 2024

(This article belongs to the Special Issue Combinatorics, Riordan Matrices and Umbral Calculus—in Memory of Prof. Emanuele Munarini)

Download

Browse Figures

Versions Notes

Abstract

Given a labeled tree topology t of n taxa, consider a population P of k leaves chosen among those of t. The clade of P is the minimal subtree

\hat{P}

of t containing P, and its size

| \hat{P} |

is provided by the number of leaves in the clade. We study distributive properties of the clade size variable

| \hat{P} |

considered over labeled topologies of size n generated at random in the framework of Ford’s

α

-model. Under this model, starting from the one-taxon labeled topology, a random labeled topology is produced iteratively by a sequence of

α

-insertions, each of which adds a pendant edge to either a pendant or internal edge of a labeled topology, with a probability that depends on the parameter

α \in [0, 1]

. Different values of

α

determine different probability distributions over the set of labeled topologies of given size n, with the special cases

α = 0

and

α = 1 / 2

respectively corresponding to the Yule and uniform distributions. In the first part of the manuscript, we consider a labeled topology t of size n generated by a sequence of random

α

-insertions starting from a fixed labeled topology

t^{*}

of given size k, and determine the probability mass function, mean, and variance of the clade size

| \hat{P} |

in t when P is chosen as the set of leaves of t inherited from

t^{*}

. In the second part of the paper, we calculate the probability that a set P of k leaves chosen at random in a Ford-distributed labeled topology of size n is monophyletic, that is, the probability that

| \hat{P} | = k

. Our investigations extend previous results on clade size statistics obtained for Yule and uniformly distributed labeled topologies.

Keywords:

phylogenetics; Ford’s α-model; labeled topology; clade size

MSC:

05C05; 60C05; 92B10

1. Introduction

The study of probabilistic properties of tree structures generated under random growth models is a popular subject in mathematical phylogenetics. Starting from a one-taxon tree, the Yule model generates a tree of n labeled taxa by adding new leaves equiprobably to pendant edges. By this process, different labeled topologies, or cladograms, of the same size are not equally likely to be generated, with more balanced labeled topologies having a larger probability of appearing. Under the uniform model, sometimes called the PDA (proportional to different arrangements) model [1], new leaves are added equiprobably to any edge, inducing the uniform distribution over the set of labeled topologies of a given size.

The Yule and uniform processes are the simplest stochastic models for generating random labeled topologies. They can be seen as specific instances of a popular random parametric model, the so-called Ford’s

α

-model [2]. In this model, given a random permutation

σ = σ_{1} σ_{2} \dots σ_{n}

of

{1, 2, \dots, n}

, a labeled topology t of size n results from a sequence of random

α

-insertions

t_{1} = {\overset{|}{•}}_{σ_{1}} \overset{σ_{2}}{⟶} \dots \overset{σ_{i}}{⟶} t_{i} \overset{σ_{i + 1}}{⟶} t_{i + 1} \overset{σ_{i + 2}}{⟶} \dots \overset{σ_{n}}{⟶} t_{n} = t,

(1)

where

t_{1} = {\overset{|}{•}}_{σ_{1}}

is the one-taxon tree for which the leaf is labeled by

σ_{1}

and the tree

t_{i + 1}

is obtained by attaching a new pendant edge labeled by

σ_{i + 1}

to a pendant or internal edge of

t_{i}

with a probability depending on the value of the parameter

α \in [0, 1]

. Each value of

α

determines a different probability distribution over the set of labeled topologies of given size n. In particular, when

α = 0

, the

α

-model yields the Yule distribution, while when

α = 1 / 2

the

α

-model induces the uniform distribution. When

α = 1

, i.e., when new leaves are added equiprobably only to internal edges, the

α

-model provides the comb distribution, under which only completely unbalanced trees (also called comb or caterpillar trees) have non-zero probability.

A key concept in phylogenetic studies is that of the “clade”. Given a labeled topology, a clade is a subtree rooted at one of its nodes; as such, it consists of a node together with all its descendants. The taxa of a clade are said to form a “monophyletic” group; from a biological perspective, they determine a set of individuals that are more genealogically related to each other than any of them is to any taxa that does not belong to the clade. Monophyly, and clade statistics more generally, have been deeply investigated in mathematical biology. The probability of one or two sets of randomly selected taxa forming monophyletic groups in a Yule or uniformly distributed labeled topology of fixed size has been studied by [3,4], respectively. In [5], the authors investigated the random variable defined by the size

| \hat{P} |

of the minimal clade

\hat{P}

(or, equivalently, by the size of the minimal monophyletic group) containing a set P of k taxa selected at random from among the n leaves of a labeled topology generated at random under the Yule or PDA model. Here, we extend the calculations for the clade size

| \hat{P} |

by considering labeled topologies generated under the Ford’s

α

-model when

α

is fixed in the range

0 \leq α \leq 1

. In Section 3, we study the growth of the clade size of a set of k initially monophyletic leaves subject to a sequence of

α

-insertions. More precisely, we consider trees of size n generated by random

α

-insertions performed starting from a tree

t^{*}

of given size k, then determine the distribution of the clade size

| \hat{P} |

of the set of leaves P of t inherited from

t^{*}

. In Section 4, we broaden the results of [3,4] by studying the probability of a set P of k randomly chosen taxa forming a monophyletic group in a labeled topology of size n selected under the Ford distribution. We start with some basic terminology and preliminary results.

2. Preliminaries

In this section, we outline basic combinatorial and probabilistic properties of labeled topologies and Ford’s

α

-model. We also define the “clade size” variable, which is studied in the next sections over random labeled topologies.

A labeled topology, or cladogram, of size n is a full binary rooted tree with n leaves, also called taxa, labeled by the symbols

1, 2, \dots, n

(Figure 1A). A subtree of a labeled topology consists of an edge together with all its descending edges and nodes. An internal edge is either the branch above the root node of the tree (edge i in Figure 1A) or a branch connecting two adjacent internal nodes (edges

f, g

, and h in the figure). A pendant edge is a branch that intersects a leaf (edges

a, b, c, d

, and e in Figure 1A). Each labeled topology of size n has n pendant edges and

n - 1

internal edges. There are exactly

1 \times 3 \times 5 \times \dots \times (2 n - 3) = (2 n - 3)!!

different labeled topologies of size

n \geq 2

[6], among which

(\binom{n}{2}) (n - 2)! = n! / 2

have a completely unbalanced “caterpillar” structure such that (as in the rightmost tree of Figure 1B) each internal node has at least one leaf stemming directly from it.

Given a labeled topology t of size n, let P be a subset of its leaves. We define the clade of P as the subtree

\hat{P}

of t rooted at the most recent common ancestor of the taxa in P. In other words,

\hat{P}

is the minimal subtree of t containing the taxa belonging to P. In the phylogenetic literature [7],

\hat{P}

is also called the subtree of tinduced by the set of leaves in P. As an example, let t be the labeled topology of Figure 1A and take P as the set containing the two leaves labeled by 3 and 5; then,

\hat{P}

is the subtree of t whose leaves are labeled by 3, 4, and 5. The clade size of P is defined as the number

| \hat{P} |

of taxa belonging to the clade of P. Clearly, we have

| \hat{P} | \geq | P |

, and the population P is said to be monophyletic when

| \hat{P} | = | P |

.

Ford’s

α

-model [2] is a probability model over the set of labeled topologies of fixed size. For a given value of the parameter

α \in [0, 1]

, this model generates a random labeled topology t of size n by first selecting a permutation

σ = σ_{1} σ_{2} \dots σ_{n}

of the labels

{1, 2, \dots, n}

uniformly at random; then, starting from the one taxon tree

t_{1}

whose leaf is labeled by

σ_{1}

, a tree

t = t_{n}

of size n is generated as the n-th element of a sequence (1) of trees of increasing size, in which tree

t_{i + 1}

is obtained by attaching a new pendant edge labeled by

σ_{i + 1}

to a random edge of tree

t_{i}

. Depending on the value of

α

, the

α

-insertion of the new pendant edge has to satisfy the following rule: each pendant (resp. internal) edge of

t_{i}

has probability

\frac{1 - α}{i - α}

(resp.

\frac{α}{i - α}

) to intersect the new edge. For instance, consider the labeled topology t of size

n = 4

shown on the right of Figure 1B, which in Newick format is written as

t = (((1, 3), 4), 2)

. If

σ = 1234

, then the insertion process that generates t is the one depicted in the figure, that is,

t_{1} = {\overset{|}{•}}_{1} \overset{2}{⟶} t_{2} = (1, 2) \overset{3}{⟶} t_{3} = ((1, 3), 2) \overset{4}{⟶} t_{4} = (((1, 3), 4), 2) = t .

Thus, under Ford’s model, t has conditional probability

Prob (t | σ = 1234) = 1 \cdot \frac{1 - α}{2 - α} \cdot \frac{α}{3 - α}

. Indeed, with probability

\frac{1 - α}{1 - α} = 1

, the pendant edge labeled by

σ_{2} = 2

is attached to the unique pendant edge of

t_{1} = {\overset{|}{•}}_{1}

; then, the pendant edge labeled by

σ_{3} = 3

is attached to the pendant edge of

t_{2}

labeled by 1 with probability

\frac{1 - α}{2 - α}

, and finally, with probability

\frac{α}{3 - α}

, the pendant edge labeled by

σ_{4} = 4

is attached to the internal edge of

t_{3}

that connects the root of the subtree

(1, 3)

to the root of

t_{3}

. Similarly, if

σ = 4213

, then the insertion process that generates the same labeled topology

t = (((1, 3), 4), 2)

is

t_{1} = {\overset{|}{•}}_{4} \overset{2}{⟶} t_{2} = (4, 2) \overset{1}{⟶} t_{3} = ((1, 4), 2) \overset{3}{⟶} t_{4} = (((1, 3), 4), 2) = t,

with t having conditional probability

Prob (t | σ = 4213) = \frac{1 - α}{2 - α} \cdot \frac{1 - α}{3 - α}

, as the pendant edges labeled by

σ_{3} = 1

and

σ_{4} = 3

are attached to the pendant edges labeled by 4 in

t_{2}

and by 1 in

t_{3}

, respectively. Performing similar calculations for all possible

n! = 24

permutations

σ

, the probability of the labeled topology t under Ford’s

α

-model can be calculated as

Prob (t) = \sum_{σ} Prob (t | σ) Prob (σ) = \frac{1}{n!} \sum_{σ} Prob (t | σ)

. An explicit formula for the probability of a labeled topology is provided in Proposition 2 of [8].

3. Clade Size of a Set of Initially Monophyletic Leaves Subject to $α$ -Insertions

Let t be a labeled topology of size n generated by a sequence of random

α

-insertions

t^{*} = t_{k} \overset{k + 1}{⟶} \dots \overset{i}{⟶} t_{i} \overset{i + 1}{⟶} t_{i + 1} \overset{i + 2}{⟶} \dots \overset{n}{⟶} t_{n} = t,

(2)

where

t^{*} = t_{k}

is any starting tree of fixed size k with leaves labeled by the symbols

P = {1, 2, \dots, k}

and tree

t_{i + 1}

is obtained by attaching, a new pendant edge labeled by

i + 1

to one of the i pendant (resp. to one of the

i - 1

internal) edges of

t_{i}

with probability

\frac{1 - α}{i - α}

(resp.

\frac{α}{i - α}

) (Figure 2).

In what follows, we study the variable

| \hat{P} |

, defined as the clade size in t of the set of leaves P inherited from the starting tree

t^{*}

. Our goal is to measure how a sequence of random

α

-insertions modifies the clade size of the set of leaves P, which is monophyletic in

t^{*}

at the beginning of the sequence of insertions. We first provide a recurrence for the probability mass function of the random variable

| \hat{P} |

by using the iterative construction (2) of the random labeled topology t.

Lemma 1.

Fix

k \geq 2

. In a random labeled topology of size

n \geq k

generated by random insertions, as in (2), the probability mass function of the clade size

| \hat{P} |

of the set

P = {1, 2, \dots, k}

of leaves inherited from

t^{*}

satisfies the following recurrence:

p_{n} (d) \equiv p_{n} (| \hat{P} | = d) = (\frac{1 - α + d - n}{1 + α - n}) p_{n - 1} (d) + (\frac{1 + 2 α - d}{1 + α - n}) p_{n - 1} (d - 1)

(3)

with boundary conditions

p_{n} (d) = 0

if

d > n

or

d < k

and

p_{n} (d) = 1

if

n = k = d

.

Proof.

As in (2), the random labeled topology

t_{n} = t

of size n is generated by attaching a pendant edge with leaf labeled by n to one of the

2 n - 3

edges of the random labeled topology

t_{n - 1}

of size

n - 1

. The clade size of P increases by one only when the new pendant edge is attached to an edge placed below the root (say, r) of

\hat{P}

in

t_{n - 1}

. Hence, conditioning on the clade size of P in

t_{n - 1}

, we have

p_{n} (d) = [1 - Prob (attaching below r)] \times p_{n - 1} (d) + Prob (attaching below r) \times p_{n - 1} (d - 1),

where

Prob (attaching below r)

is

d \cdot \frac{1 - α}{n - 1 - α} + (d - 2) \cdot \frac{α}{n - 1 - α}

if

| \hat{P} | = d

in

t_{n - 1}

and

(d - 1) \cdot \frac{1 - α}{n - 1 - α} + (d - 3) \cdot \frac{α}{n - 1 - α}

if

| \hat{P} | = d - 1

in

t_{n - 1}

. This provides the claimed recurrence. □

From the recurrence provided in (3), it is possible to derive closed formulas for the probability

p_{n} (d)

in terms of the generalized binomial coefficient

(\binom{x}{y}) = \frac{Γ (x + 1)}{Γ (y + 1) Γ (x - y + 1)}

, where

Γ

is the

Γ

-function. This is done in the following theorem, in which the mean and variance of

| \hat{P} |

are also determined.

Theorem 1.

Fix

k \geq 2

. In a random labeled topology of size

n \geq k

generated by random insertions, as in (2), the probability of the clade size

| \hat{P} |

of the set

P = {1, 2, \dots, k}

of leaves inherited from

t^{*}

being d is provided by

p_{n} (d) = \frac{(\binom{n - d + α - 1}{n - d}) (\binom{d - 1 - 2 α}{d - k})}{(\binom{n - 1 - α}{n - k})} = \frac{Γ (k - α)}{Γ (n - α)} \cdot \frac{Γ (n + α - d)}{Γ (α)} \cdot \frac{Γ (d - 2 α)}{Γ (k - 2 α)} \cdot (\binom{n - k}{d - k}) .

(4)

Furthermore, the expected value and variance of the clade size of P are, respectively,

E_{n} (| \hat{P} |) = \frac{k n - 2 α n + α k}{k - α} and V_{n} (| \hat{P} |) = \frac{(n - k) (k - 2 α) (n - α) α}{{(k - α)}^{2} (k - α + 1)} .

Proof.

We can prove the first formula for

p_{n} (d)

by induction on

n \geq k

. For

n = k

, we must have

p_{k} (d) = 1

if

d = k

and

p_{k} (d) = 0

otherwise, which is consistent with the claimed formula. Now, assuming

L 348 n > k

and using the recurrence from Lemma 1, we have

\begin{matrix} p_{n} (d) & = & (\frac{1 - α + d - n}{1 + α - n}) p_{n - 1} (d) + (\frac{1 + 2 α - d}{1 + α - n}) p_{n - 1} (d - 1) \\ = & (\frac{n - d + α - 1}{n - 1 - α}) \frac{(\binom{n - d + α - 2}{n - d - 1}) (\binom{d - 1 - 2 α}{d - k})}{(\binom{n - 2 - α}{n - k - 1})} + (\frac{d - 1 - 2 α}{n - 1 - α}) \frac{(\binom{n - d + α - 1}{n - d}) (\binom{d - 2 - 2 α}{d - k - 1})}{(\binom{n - 2 - α}{n - k - 1})} \\ = & \frac{(n - d) (\binom{n - d + α - 1}{n - d}) (\binom{d - 1 - 2 α}{d - k}) + (d - k) (\binom{n - d + α - 1}{n - d}) (\binom{d - 1 - 2 α}{d - k})}{(n - k) (\binom{n - 1 - α}{n - k})} \\ = & \frac{(\binom{n - d + α - 1}{n - d}) (\binom{d - 1 - 2 α}{d - k})}{(\binom{n - 1 - α}{n - k})}, \end{matrix}

where in the third equality we have used the binomial identity

y (\binom{x}{y}) = x (\binom{x - 1}{y - 1})

that holds for every complex number x and every non-negative integer y.

By induction on

n \geq k

, we can now show the formula for the expected value. If

n = k

, then we must have

E_{k} (| \hat{P} |) = k \cdot p_{k} (k) = k

, which is in agreement with the claimed formula. Assuming

n > k

, the inductive step provides

\begin{matrix} E_{n} (| \hat{P} |) & = & \sum_{d = k}^{n} d \cdot p_{n} (d) = \frac{1}{1 + α - n} \sum_{d = k}^{n} (d (1 - α + d - n) p_{n - 1} (d) + d (1 + 2 α - d) p_{n - 1} (d - 1)) \\ = & \frac{1}{1 + α - n} [(1 - α - n) \sum_{d = k}^{n} d \cdot p_{n - 1} (d) + \sum_{d = k}^{n} d^{2} \cdot p_{n - 1} (d) \\ + (1 + 2 α) \sum_{d = k}^{n} d \cdot p_{n - 1} (d - 1) - \sum_{d = k}^{n} d^{2} \cdot p_{n - 1} (d - 1)] \\ = & \frac{1}{1 + α - n} [(1 - α - n) \sum_{d = k}^{n - 1} d \cdot p_{n - 1} (d) + \sum_{d = k}^{n - 1} d^{2} \cdot p_{n - 1} (d) \\ + (1 + 2 α) \sum_{d = k + 1}^{n} (d - 1 + 1) \cdot p_{n - 1} (d - 1) - \sum_{d = k + 1}^{n} {(d - 1 + 1)}^{2} \cdot p_{n - 1} (d - 1)] \\ = & \frac{1}{1 + α - n} [(1 - α - n) \cdot E_{n - 1} (| \hat{P} |) + \sum_{d = k}^{n - 1} d^{2} \cdot p_{n - 1} (d) \\ + (1 + 2 α) (E_{n - 1} (| \hat{P} |) + 1) - \sum_{d = k + 1}^{n} {(d - 1)}^{2} \cdot p_{n - 1} (d - 1) - 2 \cdot E_{n - 1} (| \hat{P} |) - 1] \\ = & \frac{1}{1 + α - n} [(α - n) \cdot E_{n - 1} (| \hat{P} |) + 2 α] = \frac{1}{1 + α - n} [(α - n) \cdot \frac{k (n - 1) - 2 α (n - 1) + α k}{k - α} + 2 α] \\ = & \frac{(α - n) \cdot (k (n - 1) - 2 α (n - 1) + α k) + 2 α (k - α)}{(1 + α - n) (k - α)} = \frac{(1 + α - n) (k n - 2 α n + α k)}{(1 + α - n) (k - α)} \\ = & \frac{k n - 2 α n + α k}{k - α} . \end{matrix}

Finally, we prove the formula for the variance

V_{n} (| \hat{P} |)

; we proceed by showing that the second moment of the variable

| \hat{P} |

reads as follows:

E_{n} (| \hat{P} |^{2}) = \frac{(k - 2 α) (k + 1 - 2 α) n^{2} + α (2 k + 1) (k - 2 α) n + α^{2} (k^{2} + 2 k)}{(k - α) (k + 1 - α)}

which, together with

E_{n} (| \hat{P} |) = \frac{k n - 2 α n + α k}{k - α}

, yields the claimed formula for the variance calculated as

V_{n} (| \hat{P} |) = E_{n} (| \hat{P} |^{2}) - {(E_{n} (| \hat{P} |))}^{2}

. If

n = k

, then

E_{k} (| \hat{P} |^{2}) = k^{2} \cdot p_{k} (k) = k^{2}

, which is in agreement with the claimed formula. Assuming

n > k

, the inductive step yields

\begin{matrix} E_{n} (| \hat{P} |^{2}) & = & \sum_{d = k}^{n} d^{2} \cdot p_{n} (d) \\ = & \frac{1}{1 + α - n} \sum_{d = k}^{n} (d^{2} (1 - α + d - n) p_{n - 1} (d) + d^{2} (1 + 2 α - d) p_{n - 1} (d - 1)) \\ = & \frac{1}{1 + α - n} [(1 - α - n) \sum_{d = k}^{n} d^{2} \cdot p_{n - 1} (d) + \sum_{d = k}^{n} d^{3} \cdot p_{n - 1} (d) \\ + (1 + 2 α) \sum_{d = k}^{n} d^{2} \cdot p_{n - 1} (d - 1) - \sum_{d = k}^{n} d^{3} \cdot p_{n - 1} (d - 1)] \\ = & \frac{1}{1 + α - n} [(1 - α - n) \sum_{d = k}^{n - 1} d^{2} \cdot p_{n - 1} (d) + \sum_{d = k}^{n - 1} d^{3} \cdot p_{n - 1} (d) \\ + (1 + 2 α) \sum_{d = k + 1}^{n} {(d - 1 + 1)}^{2} \cdot p_{n - 1} (d - 1) - \sum_{d = k + 1}^{n} {(d - 1 + 1)}^{3} \cdot p_{n - 1} (d - 1)] \\ = & \frac{1}{1 + α - n} [(1 - α - n) \cdot E_{n - 1} (| \hat{P} |^{2}) + \sum_{d = k}^{n - 1} d^{3} \cdot p_{n - 1} (d) \\ + (1 + 2 α) (E_{n - 1} (| \hat{P} |^{2}) + 2 E_{n} (| \hat{P} |) + 1) - \sum_{d = k + 1}^{n} {(d - 1)}^{3} \cdot p_{n - 1} (d - 1) \\ - 3 \cdot E_{n - 1} (| \hat{P} |^{2}) - 3 \cdot E_{n - 1} (| \hat{P} |) - 1] \\ = & \frac{1}{1 + α - n} [(α - n - 1) \cdot E_{n - 1} (| \hat{P} |^{2}) + (4 α - 1) \cdot E_{n - 1} (| \hat{P} |) + 2 α] \\ = & \frac{α - n - 1}{1 + α - n} \frac{(k - 2 α) (k + 1 - 2 α) {(n - 1)}^{2} + (2 k + 1) (k - 2 α) α (n - 1) + α^{2} (k^{2} + 2 k)}{(k - α) (k + 1 - α)} \\ + \frac{1}{1 + α - n} ((4 α - 1) \frac{k (n - 1) - 2 α (n - 1) + α k}{k - α} + 2 α) \\ = & \frac{(k - 2 α) (k + 1 - 2 α) n^{2} + (2 k + 1) (k - 2 α) α n + α^{2} (k^{2} + 2 k)}{(k - α) (k + 1 - α)} . \end{matrix}

□

Remark 1.

We conclude the section with a few observations following from the latter theorem.

(i): The right-hand side of Equation (4) is extended by continuity to the cases of $α = 0$ or $α = 1$ . In particular, when $α = 0$ and $d < n$ , the second factor in the formula reads as $\frac{Γ (n + α - d)}{Γ (α)} = \frac{Γ (n - d)}{Γ (0)} = \frac{Γ (n - d)}{\infty} = 0$ , while for $α = 0$ and $d = n$ it is provided by $\frac{Γ (n + α - d)}{Γ (α)} = \frac{Γ (0)}{Γ (0)} = 1$ . If we instead have $α = 1$ and $d > k = 2$ , then the third factor in the formula becomes $\frac{Γ (d - 2 α)}{Γ (k - 2 α)} = \frac{Γ (d - 2)}{Γ (0)} = \frac{Γ (d - 2)}{\infty} = 0$ , while for $α = 1$ and $d = k = 2$ it reads as $\frac{Γ (d - 2 α)}{Γ (k - 2 α)} = \frac{Γ (0)}{Γ (0)} = 1$ .
(ii): Lower values of the α parameter determine a larger expected clade size and smaller variance. This is shown in Figure 3, where we perform a sequence of α-insertions (2) starting from a labeled topology $t^{*}$ of size $k = 4$ , then calculate the mean and variance of the clade size of the set of leaves ${1, 2, 3, 4}$ of $t^{*}$ in the resulting random labeled topology of size n (Figure 2).
(iii): For fixed values of k and d, the asymptotic formula $Γ (x + γ) \sim Γ (x) x^{γ}$ for $x \to + \infty$ , which holds for every complex number $γ$ , yields the following asymptotic expansion as $n \to \infty$ for the probability

$p_{n} (d) \sim \frac{Γ (k - α) Γ (d - 2 α)}{Γ (α) Γ (k - 2 α) Γ (d - k + 1)} \frac{1}{n^{k - 2 α}} = (\binom{d - 2 α - 1}{d - k}) \frac{Γ (k - α)}{Γ (α)} \frac{1}{n^{k - 2 α}} .$

In particular, for increasing values of n, the probability $p_{n} (k)$ of the set of initially monophyletic leaves remaining monophyletic decreases to 0 as quickly as $O (1 / n^{k - 2 α})$ . From Theorem 1, the probability $p_{n} (n)$ of the clade size of the starting set of leaves being equal to n is instead seen to be $O (1 / n^{α})$ . By substituting $d = ρ \cdot n$ in (4), we find that for $n \to \infty$ , the probability of the ratio $\frac{| \hat{P} |}{n}$ being a fixed $ρ < 1$ satisfies

$p_{n} (\frac{| \hat{P} |}{n} = ρ) \sim \frac{Γ (k - α)}{Γ (α) Γ (k - 2 α)} \cdot \frac{1}{n} \frac{ρ^{k - 2 α - 1}}{{(1 - ρ)}^{1 - α}} .$
(iv): It is possible to calculate the probability mass function of the clade size variable without making use of the generalized binomial coefficient. Indeed, by simple algebraic manipulations, from the formula provided in the latter theorem we can also write

$p_{n} (d) = \frac{\prod_{j = 1}^{n - d} (n - d + α - j) \prod_{h = 1}^{d - k} (d - 2 α - h)}{\prod_{ℓ = 1}^{n - k} (n - α - ℓ)} (\binom{n - k}{d - k}),$

where we admit empty products.

4. Probability of Monophyly for a Set of Random Taxa

In this section, we determine the probability

M_{n} = M_{n} (α, k)

that a set P of

k < n

randomly chosen taxa forms a monophyletic group in a labeled topology of size n selected under the Ford distribution. Without loss of generality, we assume P to be the set of taxa whose labels belong to

{1, 2, \dots, k}

in a random labeled topology of n leaves. Note that when

α = 0

, the

α

-model corresponds to the Yule model; from [5], we know that

M_{n} = \frac{2 n}{k (k + 1) (\binom{n}{k})}

. If we instead have

α = 1

, then a random labeled topology of size n generated under Ford’s model has a completely unbalanced (caterpillar) tree shape such as the one on the right of Figure 1B. Thus, the probability of the leaves with labels in

P = {1, 2, \dots, k}

forming a monophyletic group is

M_{n} = \frac{1}{(\binom{n}{k})}

, as there is only one subtree of size k in such a caterpillar labeled topology.

We now focus on the values of

α

within the range

0 < α < 1

. As described in Section 2, a random labeled topology t of n taxa is generated by first selecting a random permutation

σ

of size n, then performing a sequence of random insertions (1). The probability that

σ = \underset{σ_{0}}{\underset{⏟}{σ_{0, 1} \dots σ_{0, s_{0}}}} p_{1} \underset{σ_{1}}{\underset{⏟}{σ_{1, 1} \dots σ_{1, s_{1}}}} p_{2} \underset{σ_{2}}{\underset{⏟}{σ_{2, 1} \dots σ_{2, s_{2}}}} \dots \underset{σ_{k - 1}}{\underset{⏟}{σ_{k - 1, 1} \dots σ_{k - 1, s_{k - 1}}}} p_{k} \underset{σ_{k}}{\underset{⏟}{σ_{k, 1} \dots σ_{k, s_{k}}}},

(5)

where each

p_{i}

is an element of

{1, 2, \dots, k}

and each

σ_{i}

is a sequence of length

s_{i} \geq 0

(with

s_{k} = n - k - s_{0} - s_{1} - \dots - s_{k - 1}

) over the alphabet

{1, 2, \dots, n} ∖ {1, 2, \dots, k} = {k + 1, k + 2, \dots, n}

, is provided by

\frac{k!}{n!} \cdot (\binom{n - k}{s_{0}}) s_{0}! \cdot (\binom{n - k - s_{0}}{s_{1}}) s_{1}! \cdot \dots \cdot (\binom{n - k - s_{0} - \dots - s_{k - 2}}{s_{k - 1}}) s_{k - 1}! \cdot s_{k}! = \frac{1}{(\binom{n}{k})} .

Conditioned on

σ

as in (5), the probability that the set P of taxa is monophyletic after a sequence of random insertions of pendant edges labeled by the symbols in

σ

is a chain of (possibly empty) products

\prod_{i = 1}^{k - 1} a (i + s_{0} + s_{1} + \dots + s_{i}, i) \times \prod_{i = 2}^{k} \prod_{j = 0}^{s_{i} - 1} b (i + j + s_{0} + s_{1} + \dots + s_{i - 1}, i),

(6)

where

a (z, c) = \frac{c (1 - α)}{z - α} + \frac{(c - 1) α}{z - α} = \frac{c - α}{z - α}

is the probability of attaching a new pendant edge (say, one labeled by x) in a tree of size z that has a clade

\hat{C}

of size c, in such a way that the clade size of

\hat{C \cup {x}}

is

c + 1

and

b (z, c) = \frac{(z - c) (1 - α)}{z - α} + \frac{(z - c + 1) α}{z - α} = \frac{z - c + α}{z - α}

is the probability of attaching a new pendant edge in a tree of size z that has a clade

\hat{C}

of size c, while the clade size of

\hat{C}

remains equal to c after the insertion. For instance, if

\hat{C}

is the clade

\hat{C} = ((3, 4), 5)

of size

c = 3

in the labeled topology t of size

z = 5

depicted in Figure 1A, then a new pendant edge labeled by x can be attached to

c = 3

different pendant edges of t (i.e.,

c, d

, and e) or to

c - 1 = 2

different internal edges of t (i.e., g and h) such that

| \hat{C \cup {x}} | = c + 1 = 4

, which happens with probability

a (z, c)

. By considering the same tree t of size

z = 5

and the same clade

\hat{C} = ((3, 4), 5)

of size

c = 3

, we see that a new pendant edge can be attached to

z - c = 2

different pendant edges of t (i.e., a and b) or to

z - c + 1 = 3

different internal edges of t (i.e.,

f, h

, and i) such that after the insertion we have

| \hat{C} | = c = 3

, which happens with probability

b (z, c)

. In particular, each factor

a (i + s_{0} + s_{1} + \dots + s_{i}, i)

(resp.

b (i + j + s_{0} + s_{1} + \dots + s_{i - 1}, i)

) in (6) yields the probability of the clade size of

{p_{1}, \dots, p_{i}, p_{i + 1}}

(resp.

{p_{1}, \dots, p_{i}}

) being

i + 1

(resp. i) after insertion of the pending edge labeled by

p_{i + 1}

(resp.

σ_{i, j + 1}

).

Summing over the possible values of

s_{0}, s_{1}, \dots, s_{k - 1}

, the probability of the population P of taxa with label in

{1, 2, \dots, k}

being monophyletic in a random labeled topology of size n can then be calculated as follows:

M_{n} = \frac{1}{(\binom{n}{k})} \sum_{s_{0} = 0}^{n - k} \sum_{s_{1} = 0}^{n - k - s_{0}} \dots \sum_{s_{k - 1} = 0}^{n - k - s_{0} - s_{1} - \dots - s_{k - 2}} \prod_{i = 1}^{k - 1} a (i + s_{0} + s_{1} + \dots + s_{i}, i) \times \prod_{i = 2}^{k} \prod_{j = 0}^{s_{i} - 1} b (i + j + s_{0} + s_{1} + \dots + s_{i - 1}, i) .

(7)

In order to simplify the latter formula for the probability

M_{n}

, we need the next lemma.

Lemma 2.

Let n be a positive integer; then, we have the following equality for the r nested sums:

\sum_{k_{1} = 0}^{n} \sum_{k_{2} = 0}^{n - k_{1}} \dots \sum_{k_{r} = 0}^{n - k_{1} - \dots - k_{r - 1}} 1 = \sum_{0 \leq k_{1} \leq k_{1} + k_{2} \leq \dots \leq k_{1} + \dots + k_{r} \leq n} 1 = (\binom{n + r}{r}) .

Proof.

We proceed by induction on

r \geq 1

. If

r = 1

, then the sum reduces to

n + 1 = (\binom{n + 1}{1})

. Now, assuming

r > 1

, setting

h = n - k_{1} + r - 1

, and applying the binomial “hockey stick” identity

\sum_{x = y}^{z} (\binom{x}{y}) = (\binom{z + 1}{y + 1})

, we obtain

\begin{matrix} \sum_{k_{1} = 0}^{n} \sum_{k_{2} = 0}^{n - k_{1}} \dots \sum_{k_{r} = 0}^{n - k_{1} - \dots - k_{r - 1}} 1 = \sum_{k_{1} = 0}^{n} (\binom{n - k_{1} + r - 1}{r - 1}) = \sum_{h = r - 1}^{n + r - 1} (\binom{h}{r - 1}) = (\binom{n + r}{r}) . \end{matrix}

□

Then, we have the following theorem.

Theorem 2.

Under Ford’s α-model with

0 < α < 1

, the probability

M_{n} = M_{n} (α)

of the set of leaves

P = {1, 2, \dots, k}

being monophyletic in a random labeled topology of size n is provided by

M_{n} = \frac{(\binom{1 - 2 α}{1 - α})}{(\binom{n}{k}) (\binom{n - 1 - α}{k - 2 α}) (\binom{k - 2 α}{1 - α})} \sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - 2 - h}{k - 2}) .

(8)

Proof.

We use the following notation: for each

i = 0, \dots, k

, set

z_{i} = s_{0} + s_{1} + \dots + s_{i}

; then,

s_{k} = n - k - z_{k - 1}

and

z_{k} = z_{k - 1} + s_{k} = n - k

. In particular, we can write

\begin{matrix} a (i + s_{0} + s_{1} + \dots + s_{i}, i) = a (i + z_{i}, i) = \frac{i - α}{z_{i} + i - α}; \\ b (i + j + s_{0} + s_{1} + \dots + s_{i - 1}, i) = b (i + j + z_{i - 1}, i) = \frac{z_{i - 1} + j + α}{z_{i - 1} + i + j - α} . \end{matrix}

Furthermore, for every

z \in C ∖ Z_{\leq 1}

, let

z! = Γ (z + 1) .

We start by rewriting the products appearing in (7). First, we have

\prod_{i = 1}^{k - 1} a (s_{0} + s_{1} + \dots + s_{i} + i, i) = \prod_{i = 1}^{k - 1} \frac{i - α}{z_{i} + i - α} = \frac{(k - 1 - α)!}{(- α)!} \prod_{i = 1}^{k - 1} \frac{1}{z_{i} + i - α} .

(9)

Second, we find

\prod_{j = 0}^{s_{i} - 1} b (s_{0} + s_{1} + \dots + s_{i - 1} + i + j, i) = \prod_{j = 0}^{s_{i} - 1} \frac{z_{i - 1} + j + α}{z_{i - 1} + i + j - α} = \frac{(z_{i} - 1 + α)!}{(z_{i - 1} - 1 + α)!} \frac{(z_{i - 1} + i - 1 - α)!}{(z_{i} + i - 1 - α)!},

from which the second product in (7) reads as

\begin{matrix} \prod_{i = 2}^{k} \frac{(z_{i} - 1 + α)!}{(z_{i - 1} - 1 + α)!} \frac{(z_{i - 1} + i - 1 - α)!}{(z_{i} + i - 1 - α)!} & = & \frac{(z_{k} - 1 + α)!}{(z_{1} - 1 + α)!} \frac{(z_{1} + 1 - α)!}{(z_{k} + k - α)!} \prod_{i = 2}^{k} (z_{i} + i - α) \\ = & \frac{(n - k - 1 + α)!}{(n - α - 1)!} \frac{(z_{1} - α)!}{(z_{1} - 1 + α)!} \prod_{i = 1}^{k - 1} (z_{i} + i - α) . \end{matrix}

(10)

Hence, by multiplying the products of (9) and (10), we obtain

\begin{matrix} \frac{(n - k - 1 + α)!}{(n - α - 1)!} \frac{(z_{1} - α)!}{(z_{1} - 1 + α)!} \frac{(k - 1 - α)!}{(- α)!} = \frac{(\binom{n - k + α - 1}{n - k})}{(\binom{n - 1 - α}{n - k})} \frac{(\binom{z_{1} - α}{z_{1}})}{(\binom{z_{1} + α - 1}{z_{1}})}, \end{matrix}

and (7) becomes

M_{n} = \frac{1}{(\binom{n}{k})} \sum_{s_{0} = 0}^{n - k} \sum_{s_{1} = 0}^{n - k - s_{0}} \dots \sum_{s_{k - 1} = 0}^{n - k - s_{0} - \dots - s_{k - 2}} \frac{(\binom{n - k + α - 1}{n - k})}{(\binom{n - 1 - α}{n - k})} \frac{(\binom{z_{1} - α}{z_{1}})}{(\binom{z_{1} + α - 1}{z_{1}})} .

Because

z_{1} = s_{0} + s_{1}

, we find

M_{n} = \frac{(\binom{n - k + α - 1}{n - k})}{(\binom{n}{k}) (\binom{n - 1 - α}{n - k})} \sum_{s_{0} = 0}^{n - k} \sum_{s_{1} = 0}^{n - k - s_{0}} \frac{(\binom{z_{1} - α}{z_{1}})}{(\binom{z_{1} + α - 1}{z_{1}})} (\sum_{s_{2} = 0}^{n - k - s_{0} - s_{1}} \dots \sum_{s_{k - 1} = 0}^{n - k - s_{0} - \dots - s_{k - 2}} 1) .

From the last lemma, we know that

\sum_{s_{2} = 0}^{n - k - s_{0} - s_{1}} \dots \sum_{s_{k - 1} = 0}^{n - k - s_{0} - \dots - s_{k - 2}} 1 = (\binom{n - k - s_{0} - s_{1} + k - 2}{k - 2}) = (\binom{n - 2 - z_{1}}{k - 2}) .

Hence, we obtain

M_{n} = \frac{(\binom{n - k + α - 1}{n - k})}{(\binom{n}{k}) (\binom{n - 1 - α}{n - k})} \sum_{s_{0} = 0}^{n - k} \sum_{s_{1} = 0}^{n - k - s_{0}} \frac{(\binom{z_{1} - α}{z_{1}})}{(\binom{z_{1} + α - 1}{z_{1}})} (\binom{n - 2 - z_{1}}{k - 2}) .

Setting

h = z_{1} = s_{0} + s_{1}

and denoting the argument of the latter sum by

f (h)

, we have

\sum_{s_{0} = 0}^{n - k} \sum_{s_{1} = 0}^{n - k - s_{0}} f (h) = \sum_{h = 0}^{n - k} (h + 1) f (h),

where

(h + 1) f (h) = \frac{(h + 1) (\binom{h - α}{h})}{(\binom{h + α - 1}{h})} (\binom{n - 2 - h}{k - 2}) = (α - 1) \frac{(\binom{h - α}{h})}{(\binom{h + α - 1}{h + 1})} (\binom{n - 2 - h}{k - 2}) .

Thus, the formula for

M_{n}

can be written as

\begin{matrix} M_{n} & = & \frac{(\binom{n - k + α - 1}{n - k})}{(\binom{n}{k}) (\binom{n - 1 - α}{n - k})} (α - 1) \sum_{h = 0}^{n - k} \frac{(\binom{h - α}{h})}{(\binom{h + α - 1}{h + 1})} (\binom{n - 2 - h}{k - 2}) \\ = & \frac{(\binom{1 - 2 α}{1 - α})}{(\binom{n}{k}) (\binom{n - 1 - α}{k - 2 α}) (\binom{k - 2 α}{1 - α})} \sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - 2 - h}{k - 2}), \end{matrix}

where in the second identity we have used the fact that

\begin{matrix} (α - 1) \frac{(\binom{h - α}{h})}{(\binom{h + α - 1}{h + 1})} & = & (α - 1) \frac{(h - α)! (h + 1)! (α - 2)!}{h! (- α)! (h + α - 1)!} = \frac{h + 1}{(\binom{- α}{1 - 2 α})} (\binom{h - α}{1 - 2 α}) . \end{matrix}

□

Remark 2.

From the latter theorem, we observe the following.

(i): For different values of n, we can observe a different behavior by plotting the (natural logarithm of the) probability $M_{n}$ as a function of α (Figure 4). The mentioned probability can be an increasing function of α (left panel), a decreasing function of α (right panel), or a unimodal function of α (middle panel).
(ii): The formula for $M_{n}$ that appears in (8) calculates the probability of monophyly for a set of k taxa chosen at random from among the leaves of a labeled topology of size n selected under the Ford’s distribution and when $α = 0$ or $α = 1$ (the latter when we consider the limit $α \to 1^{-}$ ). Indeed, for $α = 0$ , by substituting $j = h + 1$ in the third of the following equalities and applying the Chu–Vandermonde identity in the fourth equality, the formula for $M_{n} = M_{n} (α, k)$ reduces to

$\begin{matrix} M_{n} (0, k) & = & {\frac{(\binom{1 - 2 α}{1 - α})}{(\binom{n}{k}) (\binom{n - 1 - α}{k - 2 α}) (\binom{k - 2 α}{1 - α})} \sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - 2 - h}{k - 2})|}_{α = 0} \\ = & \frac{1}{(\binom{n}{k}) (\binom{n - 1}{k}) k} \sum_{h = 0}^{n - 2} (h + 1) h (\binom{n - 2 - h}{k - 2}) \\ = & \frac{2}{(\binom{n}{k}) (\binom{n - 1}{k}) k} \sum_{j = 0}^{n - 1} (\binom{j}{2}) (\binom{n - 1 - j}{k - 2}) \\ = & \frac{2}{(\binom{n}{k}) (\binom{n - 1}{k}) k} (\binom{n}{k + 1}) \\ = & \frac{2 n}{k (k + 1) (\binom{n}{k})} . \end{matrix}$

Furthermore, when $α \to 1^{-}$ , we can use Legendre’s duplication formula to find

$\begin{matrix} lim_{α \to 1^{-}} (\binom{1 - 2 α}{1 - α}) = lim_{α \to 1^{-}} \frac{Γ (2 - 2 α)}{Γ (2 - α) Γ (1 - α)} = lim_{α \to 1^{-}} \frac{1}{2^{1 - 2 (1 - α)} \sqrt{π}} Γ (1 - α + \frac{1}{2}) = \frac{1}{2} . \end{matrix}$

For $h = 1, \dots, n - k$ , we have

$\begin{matrix} lim_{α \to 1^{-}} (\binom{h - α}{1 - 2 α}) = lim_{α \to 1^{-}} \frac{Γ (h + 1 - α)}{Γ (2 - 2 α) Γ (h + α)} = lim_{α \to 1^{-}} \frac{Γ (h)}{Γ (2 - 2 α) Γ (h + 1)} = 0, \end{matrix}$

while $h = 0$ provides

$\begin{matrix} lim_{α \to 1^{-}} (\binom{- α}{1 - 2 α}) = lim_{α \to 1^{-}} \frac{Γ (1 - α)}{Γ (2 - 2 α) Γ (α)} = lim_{α \to 1^{-}} \frac{Γ (1 - α)}{Γ (2 - 2 α)} = 2 . \end{matrix}$

Hence,

$lim_{α \to 1^{-}} M_{n} (α, k) = \frac{1 / 2}{(\binom{n}{k}) (\binom{n - 2}{k - 2})} 2 (\binom{n - 2}{k - 2}) = \frac{1}{(\binom{n}{k})},$

showing that the formula in (8) extended by continuity to $α = 1$ yields the right expression for the probability $M_{n}$ .
(iii): The formula for $M_{n}$ provided in the latter theorem can be further simplified for fixed values of k. For instance, when $k = 2$ or $k = 3$ , Equation (8) respectively reduces to the following:

$M_{n} = \frac{Γ (n - α) Γ (α - 2) (α + (2 - 2 α) n) + Γ (1 - α) Γ (n - 2 + α)}{2 (3 - 2 α) Γ (n - α) Γ (α - 2) (\binom{n}{2})}$

and

$M_{n} = \frac{((1 - α) n + α) Γ (n - α) Γ (α - 2) + ((2 - α) n + α - 3) Γ (1 - α) Γ (n - 3 + α)}{2 (3 - 2 α) Γ (n - α) Γ (α - 2) (\binom{n}{3})} .$

More generally, explicit expressions for $M_{n}$ can be determined for fixed values of k by writing the sum in (8) as a telescoping sum:

$\sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - 2 - h}{k - 2}) = \sum_{h = 0}^{n - k} g (h + 1) - g (h) = g (n - k + 1) - g (0)$

(11)

where

$g (h) = (2 - 2 α) A (h) (\binom{h - α}{2 - 2 α})$

for a polynomial $A (h) = a_{k - 1} h^{k - 1} + a_{k - 2} h^{k - 2} + \dots + a_{0}$ whose coefficients satisfy

$a_{k - 1} = \frac{{(- 1)}^{k - 2}}{(k + 1 - 2 α) (k - 2)!}$

(12)

and the linear system

$\{\begin{matrix} A (0) = \frac{2 - α}{α} A (- 1) \\ A (n - k + i + 1) = \frac{n - k + i + α - 1}{n - k + i + 1 - α} A (n - k + i), & i = 1, \dots, k - 2 . \end{matrix}$

(13)

Note that if $A (h)$ satisfies (12) and (13), then as in the first equality of (11) we indeed find

$\begin{matrix} g (h + 1) - g (h) & = & (2 - 2 α) [A (h + 1) (\binom{h + 1 - α}{2 - 2 α}) - A (h) (\binom{h - α}{2 - 2 α})] \\ = & (\binom{h - α}{1 - 2 α}) \cdot (A (h + 1) (h + 1 - α) - A (h) (h + α - 1)) \\ = & (\binom{h - α}{1 - 2 α}) \cdot (h + 1) (\binom{n - 2 - h}{k - 2}), \end{matrix}$

because $A (h + 1) (h + 1 - α) - A (h) (h + α - 1)$ and $(h + 1) (\binom{n - 2 - h}{k - 2}) = (h + 1)$ $\frac{(n - 2 - h) \cdot \dots \cdot (n - k - h + 1)}{(k - 2)!}$ are identical as polynomials of degree $k - 1$ in the variable h with leading coefficient ${(- 1)}^{k - 2} / (k - 2)!$ and roots $- 1, n - k + 1, n - k + 2, \dots, n - 3, n - 2$ .

We conclude this section with an asymptotic estimate for

M_{n}

when

α

and k are fixed and n goes to infinity.

Theorem 3.

Fix

0 < α < 1

and

k \geq 2

; then, when

n \to \infty

, we have the following asymptotic equivalence:

M_{n} \sim \frac{(\binom{2 - 2 α}{2 - α})}{(\binom{k + 1 - 2 α}{2 - α})} \frac{k!}{n^{k - 1}} .

Proof.

We start by analyzing the sum

\sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - h - 2}{k - 2}) = \frac{1}{Γ (2 - 2 α) (k - 2)!} \sum_{h = 0}^{n - k} φ (h),

where

φ (h) = Γ (2 - 2 α) (k - 2)! \cdot (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - h - 2}{k - 2}) .

We write

(\binom{h - α}{1 - 2 α}) = \frac{1}{Γ (2 - 2 α)} \frac{Γ (h + 1 - α)}{Γ (h + α)} = \frac{1}{Γ (2 - 2 α)} \frac{Γ (h + 1 - α)}{Γ (h + 1)} \frac{Γ (h + 1)}{Γ (h + α)} .

For

h \geq 1

, Gautschi’s inequality (which states that

x^{1 - s} < Γ (x + 1) / Γ (x + s) < {(x + 1)}^{1 - s}

for every positive real number x and every s in

(0, 1)

) yields

\frac{h^{1 - α} {(h + 1)}^{- α}}{Γ (2 - 2 α)} < (\binom{h - α}{1 - 2 α}) < \frac{{(h + 1)}^{1 - α} h^{- α}}{Γ (2 - 2 α)} .

Combining this with the elementary inequality

\frac{{(n - h - k + 1)}^{k - 2}}{(k - 2)!} < (\binom{n - h - 2}{k - 2}) < \frac{{(n - h - 2)}^{k - 2}}{(k - 2)!},

we obtain

L (n) \equiv \sum_{h = 1}^{n - k} {(h (h + 1))}^{1 - α} {(n - h - k + 1)}^{k - 2} < \sum_{h = 1}^{n - k} φ (h) < \sum_{h = 1}^{n - k} {(h + 1)}^{2 - α} h^{- α} {(n - h - 2)}^{k - 2} \equiv U (n),

where we have used

L (n)

and

U (n)

to respectively denote the lower and upper bounds of the previous inequality. For

m \geq 1

,

a > 0

and

b \geq 0

, consider the function

f : [0, m] \to [0, \infty)

defined by

f (t) = t^{a} {(m - t)}^{b}

(with the convention

0^{0} = 1

). Note that f is (strictly) increasing for

0 < t < t_{0} = \frac{a m}{a + b}

, (strictly) decreasing for

t_{0} \leq t \leq m

, and admits a maximum

f (t_{0}) = C_{a, b} m^{a + b}

with

C_{a, b} = a^{a} b^{b} {(a + b)}^{- (a + b)}

. Note also that by the substitution

t = m u

we have

\int_{0}^{m} f (t) d t = m^{a + b + 1} \int_{0}^{1} u^{a} {(1 - u)}^{b} d u = m^{a + b + 1} B (a + 1, b + 1),

where

B (z, w) = Γ (z) Γ (w) / Γ (z + w)

is the Beta function defined for every complex number z and w such that

ℜ (z) > 0

and

ℜ (w) > 0

. Because

\sum_{h = 1}^{m} f (h) = \sum_{h = 0}^{⌊ t_{0} ⌋} f (h) + f (⌊ t_{0} ⌋ + 1) + \sum_{h = ⌊ t_{0} ⌋ + 2}^{m} f (h) \leq \int_{0}^{m} f (t) d t + f (⌊ t_{0} ⌋ + 1) \leq \int_{0}^{m} f (t) d t + f (t_{0})

and

\sum_{h = 1}^{m} f (h) + f (t_{0}) = \sum_{h = 0}^{m} f (h) + f (t_{0}) \geq \int_{0}^{m} f (t) d t,

we find

\begin{matrix} - f (t_{0}) + \int_{0}^{m} f (t) d t & \leq \sum_{h = 1}^{m} f (h) \leq f (t_{0}) + \int_{0}^{m} f (t) d t, \end{matrix}

that is,

\begin{matrix} - C_{a, b} m^{a + b} + m^{a + b + 1} B (a + 1, b + 1) & \leq \sum_{h = 1}^{m} f (h) \leq C_{a, b} m^{a + b} + m^{a + b + 1} B (a + 1, b + 1) . \end{matrix}

We apply the latter estimates to bound

L (n)

and

U (n)

. For each

a > 0

, we denote

C_{a} = C_{a, k - 2}

. We first consider

L (n)

. Because

1 - α > 0

, we have

\begin{matrix} L (n) & > \sum_{h = 1}^{n - k} h^{2 - 2 α} {(n - k - h)}^{k - 2} \geq - C_{2 - 2 α} {(n - k)}^{k - 2 α} + {(n - k)}^{k + 1 - 2 α} B (3 - 2 α, k - 1) . \end{matrix}

We now focus on

U (n)

. Since

{(h + 1)}^{2} \leq h^{2} + 3 h

and

h^{- 2 α} \leq h^{- α}

, we have

\begin{matrix} U (n) & = \sum_{h = 1}^{n - k} {(h + 1)}^{2 - α} h^{- α} {(n - h - 2)}^{k - 2} < \sum_{h = 1}^{n} {(h + 1)}^{2} h^{- 2 α} {(n - h)}^{k - 2} \\ < \sum_{h = 1}^{n} h^{2 - 2 α} {(n - h)}^{k - 2} + 3 \sum_{h = 1}^{n} h^{1 - α} {(n - h)}^{k - 2} \\ \leq C_{2 - 2 α} n^{k - 2 α} + n^{k + 1 - 2 α} B (3 - 2 α, k - 1) + 3 C_{1 - α} n^{k - α - 1} + 3 B (2 - α, k - 1) n^{k - α} . \end{matrix}

Thus, both

L (n)

and

U (n)

are asymptotically equivalent to

B (3 - 2 α, k - 1) n^{k + 1 - 2 α}

, as

n \to \infty

; therefore, so is

\sum_{h = 1}^{n - k} φ (h)

. Because

φ (0) = o (n^{k + 1 - 2 α})

, we can conclude that

\begin{matrix} \sum_{h = 0}^{n - k} (h + 1) (\binom{h - α}{1 - 2 α}) (\binom{n - h - 2}{k - 2}) \sim \frac{B (3 - 2 α, k - 1)}{Γ (2 - 2 α) (k - 2)!} n^{k + 1 - 2 α} = \frac{2 - 2 α}{Γ (k + 2 - 2 α)} n^{k + 1 - 2 α} . \end{matrix}

Finally, by inserting the latter formula into (8) together with the asymptotic equivalences

(\binom{n}{k}) \sim \frac{n^{k}}{k!} and (\binom{n - 1 - α}{k - 2 α}) \sim \frac{n^{k - 2 α}}{Γ (k + 1 - 2 α)},

we find

\begin{matrix} M_{n} & \sim \frac{(\binom{1 - 2 α}{1 - α})}{(\binom{k - 2 α}{1 - α})} \frac{Γ (k + 1 - 2 α) k!}{n^{2 k - 2 α}} \frac{2 - 2 α}{Γ (k + 2 - 2 α)} n^{k + 1 - 2 α} = \frac{(\binom{2 - 2 α}{2 - α})}{(\binom{k + 1 - 2 α}{2 - α})} \frac{k!}{n^{k - 1}} . \end{matrix}

□

From Theorem 3 and the formulas for

M_{n}

provided at the beginning of this section for

α = 0

and

α = 1

, we find that

M_{n}

behaves as follows for

n \to \infty

:

M_{n} \sim \{\begin{matrix} c (α, k) \cdot n^{1 - k}, & if 0 \leq α < 1 \\ k! \cdot n^{- k}, & if α = 1 \end{matrix}

(14)

where

c (α, k) = \frac{k! (\binom{2 - 2 α}{2 - α})}{(\binom{k + 1 - 2 α}{2 - α})}

.

The asymptotic constant

c (α, k)

is plotted in Figure 5 for

k = 5

(left) and

k = 15

(center). The plot has different shapes depending on k; it is strictly decreasing when

k = 5

, while it has a maximum at

α \approx 0.5

when

k = 15

. The value of the parameter

α

at which the constant

c (α, k)

reaches its maximum is shown as a function of

k \in [2, 50]

on the right of Figure 5. Note that for small values of k, i.e.,

k \leq 7

,

c (α, k)

is maximum when

α = 0

, that is, when the Ford model corresponds to the Yule model. When

k \approx 15

, the uniform model (

α \approx 1 / 2

) is the phylogenetic scenario that maximizes

c (α, k)

.

5. Conclusions

In this article, we have investigated the distributive properties of the clade size

| \hat{P} |

of a population P of k leaves taken from n taxa of a labeled topology generated at random under different scenarios in the framework of Ford’s

α

-model.

Our goal in Section 3 was to measure how much a sequence of random

α

-insertions modifies the clade size of a set of initially monophyletic leaves. To this end, in Theorem 1 we have determined the probability mass function, mean, and variance of the clade size of the set

P = {1, 2, \dots, k}

of k leaves belonging to a random labeled topology of size

n \geq k

obtained through a sequence of

n - k

α

-insertions starting from a tree of size k labeled by the symbols in P. As might be expected, our results show that lowering the value of the parameter

α

, that is, decreasing the probability of attaching new leaves at the internal edges of the tree, increases the expectation of the clade size variable and reduces its variance (Figure 3).

Next, Section 4 was dedicated to the study of monophyly under the

α

-model. In Theorem 2, we have derived a formula for calculating the probability

M_{n} = M_{n} (α, k)

that a set of k randomly chosen taxa is monophyletic in a labeled topology of size

n \geq k

selected under the Ford distribution with parameter

α

. Previous studies [3,4] have investigated the probability of monophyly for the special cases of

α = 0

and

α = 1 / 2

, respectively corresponding to the Yule and uniform distributions of labeled topologies of a given size. Our formula for

M_{n}

allows for the inference of the parameter

α

by conditioning on the monophyly of a random sample of k leaves in a labeled topology of size n generated under the Ford model. In particular, the conditional density function of

α

reads as

f_{α | monophyly} (x) = \frac{Prob (monophyly | α = x) \cdot f_{α} (x)}{\int_{0}^{1} Prob (monophyly | α = u) \cdot f_{α} (u) d u} = \frac{M_{n} (x, k) \cdot f_{α} (x)}{\int_{0}^{1} M_{n} (u, k) \cdot f_{α} (u) d u},

where

f_{α} (x)

is the prior density function of

α

. In Figure 6, assuming a uniform prior distribution on the

α

parameter, i.e.,

f_{α} (x) = 1

, we plot the conditional cumulative distribution of

α

when a random sample of k leaves is observed to be monophyletic in a random labeled topology of size

n = 100

. The plot shows that for a relatively small value of k, i.e.,

k = 5

(top curve), the probability of

α \leq x

is increased for every x with respect to the prior distribution (the dashed line). For a larger value of k, i.e.,

k = 20

(bottom curve), the probability of

α \geq x

instead increases for every x, suggesting that small values of alpha are more likely when observing the monophyly of a small sample of leaves. This is confirmed by the plot in Figure 7, where we have calculated the conditional probability of

α

being either small (

α \leq 0.2

), medium (

0.4 \leq α \leq 0.6

) or large (

α \geq 0.8

) when k random leaves are seen to be monophyletic in a Ford-distributed tree of size

n = 100

, again assuming a uniform prior distribution on

α

. The probability of a small or large value of

α

respectively decreases or increases with the first values of k, while the probability of a medium value of

α

appears to be more uncorrelated with k. For small, medium, or large

α

, the Ford model has its best-known instances respectively provided by the Yule (

α = 0

), uniform (

α = 1 / 2

), or comb (

α = 1

) model of random labeled topologies, respectively, and our calculations can be used to infer which model is more likely to produce a given tree of size n in which a set of k random leaves is found to form a clade. In the final part of Section 4, we have further investigated the probability of monophyly

M_{n}

by determining its asymptotic behavior (Theorem 3). We have shown that

M_{n}

decreases like a constant multiple of

n^{1 - k}

for every

0 \leq α < 1

, that is,

M_{n} \sim c (α, k) \cdot n^{1 - k}

as in (14). Thus, for a given value of k, the probability of monophyly is asymptotically larger for those values of

α

at which the factor

c (α, k)

attains its maximum (Figure 5, right).

Our investigation of the probability of monophyly also relates to other studies of leaf-induced subtrees. In [9], for a uniformly distributed labeled topology T of size n, the authors studied in the limit

n \to \infty

the maximum value of the probability

γ (B, T)

of observing a fixed “caterpillar” or “even” tree B of size k as the clade of a set of k leaves randomly sampled in T. The probability of monophyly studied in the latter section of the present paper for a Ford-distributed labeled topology T of size n can be seen as the sum of the quantities

γ (B, T)

, when all trees B of k leaves are considered.

Several directions of research naturally arise from this work. In Section 4, our focus has been on the probability of the clade size of a set of k random taxa in a Ford distributed labeled topology of given size being equal to k. It would be of interest to extend our calculations to broaden the results of [5] by determining the entire distribution of the clade size variable of a set of k randomly chosen taxa in a labeled topology of size n sampled under the Ford distribution. We remark that the population P for which the clade size is studied in Section 3 is not a set of k randomly chosen taxa of a random labeled topology t of size n, as P corresponds to the set of leaves present in the labeled topology

t^{*}

from which a sequence of

α

-insertions has produced t. It also remains to investigate reciprocal monophyly, that is, the probability of two (or more) prespecified groups of leaves being monophyletic in a random tree selected under Ford’s distribution. This probability was calculated for

α = 0

in Theorem 4.5 of [3]; however, a generalization to arbitrary values of

α

remains missing.

Author Contributions

Conceptualization, A.D.N. and F.D.; Investigation, A.D.N. and F.D.; Writing—original draft, A.D.N. and F.D.; Writing—review & editing, A.D.N. and F.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the authors.

Acknowledgments

ADN is member of INdAM–GNSAGA, and acknowledges financial support from INdAM. FD acknowledges the MIUR Excellence Department Project awarded to the Department of Mathematics, University of Pisa, CUP I57G22000700001.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aldous, D.J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci. 2001, 16, 23–34. [Google Scholar] [CrossRef]
Ford, D.J. Probabilities on cladograms: Introduction to the alpha model. arXiv 2005, arXiv:0511246. [Google Scholar]
Zhu, S.; Degnan, J.H.; Steel, M. Clades, clans, and reciprocal monophyly under neutral evolutionary models. Theor. Popul. Biol. 2011, 79, 220–227. [Google Scholar] [CrossRef] [PubMed]
Zhu, S.; Than, C.; Wu, T. Clades and clans: A comparison study of two evolutionary models. J. Math. Biol. 2015, 71, 99–124. [Google Scholar] [CrossRef] [PubMed]
Di Nunzio, A.; Disanto, F. Clade size distribution under neutral evolutionary models. Theor. Popul. Biol. 2024, 156, 93–102. [Google Scholar] [CrossRef]
Rosenberg, N.A. The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees. Ann. Comb. 2006, 10, 129–146. [Google Scholar] [CrossRef]
Semple, C.; Steel, M. Phylogenetics; Oxford Univ. Press: Oxford, UK, 2003. [Google Scholar]
Coronado, T.M.; Mir, A.; Rosselló, F. The probabilities of trees and cladograms under Ford’s α-model. Sci. World J. 2018, 2018, 1916094. [Google Scholar] [CrossRef] [PubMed]
Czabarka, E.; Székely, L.A.; Wagner, S. Inducibility in Binary Trees and Crossings in Random Tanglegrams. SIAM J. Discret. Math. 2017, 31, 1732–1750. [Google Scholar] [CrossRef]

Figure 1. Labeled topologies and a sequence of

α

-insertions. (A) A labeled topology of size n = 5. The tree has

n = 5

pendant edges (those in gray, denoted by the letters

a, b, c, d

, and e) and

n - 1 = 4

internal edges (those in black, denoted by the letters

f, g, h

, and i). (B) The labeled topology

(((1, 3), 4), 2)

on the right generated under the

α

-model, when

σ = σ_{1} σ_{2} σ_{3} σ_{4} = 1234

, is the permutation that determines the label

σ_{i}

of the pendant edge added with the i-th insertion in (1).

Figure 1. Labeled topologies and a sequence of

α

-insertions. (A) A labeled topology of size n = 5. The tree has

n = 5

pendant edges (those in gray, denoted by the letters

a, b, c, d

, and e) and

n - 1 = 4

internal edges (those in black, denoted by the letters

f, g, h

, and i). (B) The labeled topology

(((1, 3), 4), 2)

on the right generated under the

α

-model, when

σ = σ_{1} σ_{2} σ_{3} σ_{4} = 1234

, is the permutation that determines the label

σ_{i}

of the pendant edge added with the i-th insertion in (1).

Figure 2. A sequence of

α

-insertions. Starting from the labeled topology

t^{*} = ((1, 2), (3, 4))

depicted on the left, the sequence generates the labeled topology t of size 7 depicted on the right. After the insertions, the clade size of the set of leaves

P = {1, 2, 3, 4}

inherited from

t^{*}

has increased by one, as

\hat{P} = (7, ((1, 2), (3, 4)))

in the final tree t.

Figure 2. A sequence of

α

-insertions. Starting from the labeled topology

t^{*} = ((1, 2), (3, 4))

depicted on the left, the sequence generates the labeled topology t of size 7 depicted on the right. After the insertions, the clade size of the set of leaves

P = {1, 2, 3, 4}

inherited from

t^{*}

has increased by one, as

\hat{P} = (7, ((1, 2), (3, 4)))

in the final tree t.

Figure 3. Expected value and variance of the clade size of a set of

k = 4

initially monophyletic leaves after

n - k

random

α

-insertions when

α = 0.2

(dots) and

α = 0.8

(boxes).

Figure 3. Expected value and variance of the clade size of a set of

k = 4

initially monophyletic leaves after

n - k

random

α

-insertions when

α = 0.2

(dots) and

α = 0.8

(boxes).

Figure 4. Natural logarithm of the probability of monophyly

M_{n} (α)

when (from left to right)

n = 10, 15, 20

,

k = 5

, and when

α

ranges in

[0, 1]

in steps of 0.05.

Figure 4. Natural logarithm of the probability of monophyly

M_{n} (α)

when (from left to right)

n = 10, 15, 20

,

k = 5

, and when

α

ranges in

[0, 1]

in steps of 0.05.

Figure 5. Plot of the asymptotic constant

c (α, k)

appearing in (14) when

k = 5

(left) and

k = 15

(center) with

α

ranging in

[0, 1)

. The plot on the (right) shows the value of

α

for a fixed

k \in [2, 50]

such that

c (α, k)

is at its maximum.

Figure 5. Plot of the asymptotic constant

c (α, k)

appearing in (14) when

k = 5

(left) and

k = 15

(center) with

α

ranging in

[0, 1)

. The plot on the (right) shows the value of

α

for a fixed

k \in [2, 50]

such that

c (α, k)

is at its maximum.

Figure 6. Cumulative distribution of the

α

parameter conditioned on the monophyly of a random sample of k leaves in a Ford-distributed labeled topology of size

n = 100

. We set

k = 5

in the top curve and

k = 20

in the bottom curve. The dashed line is the cumulative prior distribution of

α

when this is assumed to be uniform over the interval

[0, 1]

.

Figure 6. Cumulative distribution of the

α

parameter conditioned on the monophyly of a random sample of k leaves in a Ford-distributed labeled topology of size

n = 100

. We set

k = 5

in the top curve and

k = 20

in the bottom curve. The dashed line is the cumulative prior distribution of

α

when this is assumed to be uniform over the interval

[0, 1]

.

Figure 7. Probability of

α

being either small, i.e.,

α \leq 0.2

(•), medium, i.e.,

0.4 \leq α \leq 0.6

(■), or large, i.e.,

α \geq 0.8

(♦), when k random leaves of a Ford-distributed labeled topology of size

n = 100

are found to be monophyletic.

Figure 7. Probability of

α

being either small, i.e.,

α \leq 0.2

(•), medium, i.e.,

0.4 \leq α \leq 0.6

(■), or large, i.e.,

α \geq 0.8

(♦), when k random leaves of a Ford-distributed labeled topology of size

n = 100

are found to be monophyletic.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Di Nunzio, A.; Disanto, F. Clade Size Statistics Under Ford’s α-Model. Mathematics 2024, 12, 3974. https://doi.org/10.3390/math12243974

AMA Style

Di Nunzio A, Disanto F. Clade Size Statistics Under Ford’s α-Model. Mathematics. 2024; 12(24):3974. https://doi.org/10.3390/math12243974

Chicago/Turabian Style

Di Nunzio, Antonio, and Filippo Disanto. 2024. "Clade Size Statistics Under Ford’s α-Model" Mathematics 12, no. 24: 3974. https://doi.org/10.3390/math12243974

APA Style

Di Nunzio, A., & Disanto, F. (2024). Clade Size Statistics Under Ford’s α-Model. Mathematics, 12(24), 3974. https://doi.org/10.3390/math12243974

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Clade Size Statistics Under Ford’s α-Model

Abstract

1. Introduction

2. Preliminaries

3. Clade Size of a Set of Initially Monophyletic Leaves Subject to $α$ -Insertions

4. Probability of Monophyly for a Set of Random Taxa

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Clade Size Statistics Under Ford’s α-Model

Abstract

1. Introduction

2. Preliminaries

3. Clade Size of a Set of Initially Monophyletic Leaves Subject to α -Insertions

4. Probability of Monophyly for a Set of Random Taxa

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. Clade Size of a Set of Initially Monophyletic Leaves Subject to $α$ -Insertions