Linearization of the Kingman Coalescent

Kingman’s coalescent process is a mathematical model of genealogy in which only pairwise common ancestry may occur. Inter-arrival times between successive coalescence events have a negative exponential distribution whose rate equals the combinatorial term (n 2 ) where n denotes the number of lineages present in the genealogy. These two standard constraints of Kingman’s coalescent, obtained in the limit of a large population size, approximate the exact ancestral process of Wright-Fisher or Moran models under appropriate parameterization. Calculation of coalescence event probabilities with higher accuracy quantifies the dependence of sample and population sizes that adhere to Kingman’s coalescent process. The convention that probabilities of leading order N−2 are negligible provided n N is examined at key stages of the mathematical derivation. Empirically, expected genealogical parity of the single-pair restricted Wright-Fisher haploid model exceeds 99% where n ≤ 2 3 √ N; similarly, per expected interval where n ≤ 2 √ N/6. The fractional cubic root criterion is practicable, since although it corresponds to perfect parity and to an extent confounds identifiability it also accords with manageable conditional probabilities of multi-coalescence.


Introduction
Kingman's coalescent process is a mathematical model of ancestral lineages that inspired a paradigmatic era in population genetics [1][2][3].Kingman's coalescent process [4][5][6][7] relies on negligibility of coalescence probabilities, and inter-arrival times, other than those of single pair-wise coalescence.Negligibility depends on terms of leading order N −2 or less that can be omitted from the process in the limit of a large population size.A comparative study of data generation simulators that implement Kingman's coalescent process demonstrates the utility of this conventional approximation to the exact ancestral process [8].Phylogenetic trees in general contain a coalescent process of ancestral lineages from the corresponding sub-population within each branch of the phylogeny.The ancestral process within the branches of a phylogeny are often modeled using Kingman's coalescent [9] or theory of branching processes [10].Statistical distribution theory of the Ewens' sampling formula is derived in population genetics by superimposing unique event mutations on the genealogical structure of Kingman's coalescent [11,12].

Coalescent Theory of Ancestral Processes
Kingman's coalescent process can be derived in a straightforward manner based on the genealogy of a Wright-Fisher model [13].Consider a parent and an offspring generation, where the haploid population size N is kept fixed in each generation.The probability of zero coalescence events, such that none of the offspring are direct descendants of any parent in common, equals with respect to n ancestral lines.This conventional approximation defines a geometric probability distribution for the number of generations that pass until a coalescence event, where j = 1, 2, 3, . . .denotes the generation in which at least one coalescence occurs.Recalibrated coalescent units of time t = j N generations in Equation ( 2) yields a negative exponential probability distribution, Pr(T > t) = e −( n 2 )t , where T denotes the waiting time until a coalescence event in the limit of a large population size.Consider Pr T ≥ j = (1 − p) j , where T ∼ Geom(p).Take p = ( n 2 ) N and j = Nt to get an approximation of the geometric distribution relevant to Kingman's coalescent process.
The binomial formula (x + y) n = ∑ n i=0 n i x i y n−i thus yields an infinite series, in the limit of a large population size, Mathematics 2018, 6, x FOR PEER REVIEW 2 of 32 with respect to n ancestral lines.This conventional approximation defines a geometric probability distribution for the number of generations that pass until a coalescence event, where = 1, 2, 3, … denotes the generation in which at least one coalescence occurs.Recalibrated coalescent units of time = generations in Equation ( 2) yields a negative exponential probability distribution, Pr( > ) = , where T denotes the waiting time until a coalescence event in the limit of a large population size.Consider Pr ≥ = (1 − ) , where ~Geom( ).Take = and = to get an approximation of the geometric distribution relevant to Kingman's coalescent process.The binomial formula ( + ) = ∑ thus yields an infinite series, in the limit of a large population size, Now, consider practical approximation, where = 1 ⇒ = and one unit of coalescent time equals N discrete generations in the geometric distribution.Thus, the negative exponential series in Equation (3) yields the conventional result, Pr( > ) = , when the process is observed in this rewind coalescent time under the approximation of a large finite population size.
Simulation of the trade-off between n versus N had suggested that < should ensure Kingman's coalescent process ( [14], pp.[5][6].Alternatively, a classic theoretical approximation due to R.A. Fisher yields a recursion of expected genealogical branch lengths to quantify single singleton nucleotide polymorphisms as a function of sample size upon effective population size [15].Further simulation study of the Kingman coalescent had suggested its validity threshold should be ≈ √2 [16].Evaluations in that work compared probabilities of pair-wise, multiple pair-wise and multicoalescence events.Exploratory analysis concludes that Kingman's coalescent should be a robust approximation of the Wright-Fisher model in terms of genealogical timing, with external branch lengths likely to differ significantly.Another simulation study, under a similar approximation to the Kingman coalescent, calculates percentages of multi-coalescence events and statistics of mutational activity throughout a genealogy of high sample sizes with alternative demographics [17].The results in Sections 2 and 3 herein clearly demonstrate the region of validity for the Kingman coalescent depends on population size.Furthermore, multi-coalescence events yield sensitivity in terms of finescale topological variation towards the tips.The negligibility of multiple coalescence events by which the Kingman coalescent should accurately approximate the exact Wright-Fisher ancestral process tends to be indirectly addressed in the literature of applied probability modeling and evolutionary biology on multi-coalescent processes.

Coalescent Theory of Branching Processes
An active research field on extension of discrete generations Wright-Fisher models, overlapping generations Moran models, and generalizations to the Cannings model, are based on their multinomial offspring distribution variance and moments to develop multi-coalescents [18][19][20].
with respect to n ancestral lines.This conventional approximation defines a geometric probability distribution for the number of generations that pass until a coalescence event, where    − thus yields an infinite series, in the limit of a large population size, Now, consider practical approximation, where  = 1 ⇒  =  and one unit of coalescent time equals N discrete generations in the geometric distribution.Thus, the negative exponential series in Equation (3) yields the conventional result, Pr( > ) =  −(  2 ) , when the process is observed in this rewind coalescent time under the approximation of a large finite population size.
Simulation of the trade-off between n versus N had suggested that  2 <  should ensure Kingman's coalescent process ( [14], pp.5-6).Alternatively, a classic theoretical approximation due to R.A. Fisher yields a recursion of expected genealogical branch lengths to quantify single singleton nucleotide polymorphisms as a function of sample size upon effective population size [15].Further simulation study of the Kingman coalescent had suggested its validity threshold should be  ≈ √2 [16].Evaluations in that work compared probabilities of pair-wise, multiple pair-wise and multicoalescence events.Exploratory analysis concludes that Kingman's coalescent should be a robust approximation of the Wright-Fisher model in terms of genealogical timing, with external branch lengths likely to differ significantly.Another simulation study, under a similar approximation to the Kingman coalescent, calculates percentages of multi-coalescence events and statistics of mutational activity throughout a genealogy of high sample sizes with alternative demographics [17].The results in Sections 2 and 3 herein clearly demonstrate the region of validity for the Kingman coalescent depends on population size.Furthermore, multi-coalescence events yield sensitivity in terms of finescale topological variation towards the tips.The negligibility of multiple coalescence events by which the Kingman coalescent should accurately approximate the exact Wright-Fisher ancestral process tends to be indirectly addressed in the literature of applied probability modeling and evolutionary biology on multi-coalescent processes.
with respect to n ancestral lines.This conventional approximation defines a geometric probability distribution for the number of generations that pass until a coalescence event, where    − thus yields an infinite series, in the limit of a large population size, Now, consider practical approximation, where  = 1 ⇒  =  and one unit of coalescent time equals N discrete generations in the geometric distribution.Thus, the negative exponential series in Equation (3) yields the conventional result, Pr( > ) =  −(  2 ) , when the process is observed in this rewind coalescent time under the approximation of a large finite population size.
Simulation of the trade-off between n versus N had suggested that  2 <  should ensure Kingman's coalescent process ( [14], pp.5-6).Alternatively, a classic theoretical approximation due to R.A. Fisher yields a recursion of expected genealogical branch lengths to quantify single singleton nucleotide polymorphisms as a function of sample size upon effective population size [15].Further simulation study of the Kingman coalescent had suggested its validity threshold should be  ≈ √2 [16].Evaluations in that work compared probabilities of pair-wise, multiple pair-wise and multicoalescence events.Exploratory analysis concludes that Kingman's coalescent should be a robust approximation of the Wright-Fisher model in terms of genealogical timing, with external branch lengths likely to differ significantly.Another simulation study, under a similar approximation to the Kingman coalescent, calculates percentages of multi-coalescence events and statistics of mutational activity throughout a genealogy of high sample sizes with alternative demographics [17].The results in Sections 2 and 3 herein clearly demonstrate the region of validity for the Kingman coalescent depends on population size.Furthermore, multi-coalescence events yield sensitivity in terms of finescale topological variation towards the tips.The negligibility of multiple coalescence events by which the Kingman coalescent should accurately approximate the exact Wright-Fisher ancestral process tends to be indirectly addressed in the literature of applied probability modeling and evolutionary biology on multi-coalescent processes.
(3) Now, consider practical approximation, where t = 1 ⇒ j = N and one unit of coalescent time equals N discrete generations in the geometric distribution.Thus, the negative exponential series in Equation (3) yields the conventional result, Pr(T > t) = e −( n 2 )t , when the process is observed in this rewind coalescent time under the approximation of a large finite population size.
Simulation of the trade-off between n versus N had suggested that n 2 < N should ensure Kingman's coalescent process ( [14], pp.5-6).Alternatively, a classic theoretical approximation due to R.A. Fisher yields a recursion of expected genealogical branch lengths to quantify single nucleotide polymorphisms as a function of sample size upon effective population size [15].Further simulation study of the Kingman coalescent had suggested its validity threshold should be n ≈ √ 2N [16].Evaluations in that work compared probabilities of pair-wise, multiple pair-wise and multi-coalescence events.Exploratory analysis concludes that Kingman's coalescent should be a robust approximation of the Wright-Fisher model in terms of genealogical timing, with external branch lengths likely to differ significantly.Another simulation study, under a similar approximation to the Kingman coalescent, calculates percentages of multi-coalescence events and statistics of mutational activity throughout a genealogy of high sample sizes with alternative demographics [17].The results in Sections 2 and 3 herein clearly demonstrate the region of validity for the Kingman coalescent depends on population size.Furthermore, multi-coalescence events yield sensitivity in terms of fine-scale topological variation towards the tips.The negligibility of multiple coalescence events by which the Kingman coalescent should accurately approximate the exact Wright-Fisher ancestral process tends to be indirectly addressed in the literature of applied probability modeling and evolutionary biology on multi-coalescent processes.

Coalescent Theory of Branching Processes
An active research field on extension of discrete generations Wright-Fisher models, overlapping generations Moran models, and generalizations to the Cannings model, are based on their multinomial offspring distribution variance and moments to develop multi-coalescents [18][19][20].Derivations of alternative coalescent processes usually retain the conventional proportionality to N −2 ([21], Theorem 3.2 via Equation (5); [22], Theorem 2.1 via Equation ( 4)).These generalizations are in turn based on the partition structures of equivalence classes described in terms of sampling distributions not originally connected to genealogy [23][24][25].The corresponding convergence-to-coalescent results tend to rely upon fast continuous time scales rather than generational ancestral processes.Thus, multi-coalescent processes replace a multinomial offspring distribution with a variety of continuous population frequency distributions that yield non-negligible jump transitions of lineage decrements greater than one in continuous-time Markov chains.There are alternative approaches to the development of multi-coalescents: (i) branching process theory ( [26][27][28], for an application see [29]); and (ii) measure-valued diffusion theory [30,31].Both approaches model proliferation of lineages over time.Further examples include β-coalescent [32], Λ-coalescent [33,34], Ξ-coalescent [35,36], and Galton-Watson theory [37,38].Technical mathematical treatments tend to assume the foundations of ancestral processes.The quantitative analysis of Sections 2 and 3 in this work clearly identifies regions of adherence and detraction from the Wright-Fisher ancestral process, in terms of transition probabilities and expected inter-arrival times, due to the linearization of Kingman's coalescent that neglects multi-coalescence events.

Ancestral Process, per Generation
Error threshold is the forefront of the issue for computationally-intensive methodologies and statistical models based on Kingman's coalescent.Six main points arise: (i) discrepancy between the exact and linearized non-coalescence probability in Equation (1); (ii) validity of the linearized coalescence probability in Equation ( 2); (iii) conditional probabilities of single-pair and multi-coalescences given at least one coalescence; (iv) parity of reduced ancestral processes that suppress multi-coalescences, when compared to the exact ancestral process; (v) genealogical topology; and (vi) subsequent inter-arrival times.

Zero Coalescence Events
The exact probability of k offspring genes that are descendants of k different parents, without shared ancestry in the parental generation, was given by Equation (1).The corresponding approximation derives from the product in Equation (1), where expansion yields In Equation ( 4), calculate the summation of the quadratic term, N −2 , to get a coefficient Similarly, the summation of the cubic term, N −3 , yields a coefficient Derivation of Equations ( 5) and ( 6) are deferred to Appendix A.
The default population size in this work is set at N = 2 × 10 5 , unless otherwise stated, then the exponent increased and decreased by one or two to verify generality for criterion that are expressed as functions of N. Refer to Figure 1 that compares the first and third order approximation non-coalescence probabilities.The criterion √ 2N [16] sets the error tolerance down to where the linearized non-coalescence probability, per generation, goes negative at n = 633; clearly, negativity must occur at n(n − 1) > 2N.The criterion √ N [14] sets the error tolerance greater than 15%, and the corresponding proportion of the exact probability equals 0.825979 at n = 447.Reduction to precisely 1% error tolerance occurs at n = 233.Exact non-coalescence probability can be compared to its linearized, quadratic and cubic approximation; refer to Figures 2 and 3.The difference between the quadratic and cubic terms of Equation ( 4) determines the error of the linearization, since non-linear terms of higher degree do not significantly affect the exact value even with many lineages present in the genealogy; refer to Figure 4. Evaluation of the non-coalescence probability suggests a criterion of 1% proportional error after round-up be √ N/3.Remark 1.In this case, coalescence probability absolute error, linearized minus exact value, yields an empirical criterion for greater than (precisely) 99% expected genealogical parity; ≤ 33.The Wright-Fisher ancestral process restricted to single-pair coalescence thus yields ≤ 34.The total linearization error of the Kingman coalescent, which includes non-coalescence error, thus yields ≤ 26.Refer to the exposition of parity in Section 4 for the details of these criteria.

Single Pair Coalescence Events
Identically to Equation ( 1), precisely two lineages with the same parent occurs with probability The form of Equation ( 7) can be explained by analogy to Equation (1).Common ancestry among two lineages occurs with probability 1 • , since the same individual must be picked uniformly at random from the parent generation by two individuals from the offspring generation in a population of fixed size N. Exchangeability renders a combinatorial term , since any single pair of the n lineages from the offspring generation participate in such a common ancestry event.There is no common ancestry among the remaining n -2 lineages in the offspring generation, which yields the corresponding product of ( − )/ for = 1, 2, … , − 2.
Compare the linearized probability of at least one coalescence ( − 1) from Equation ( 2) and the exact pair-wise coalescence probability of Equation (7).Clearly, the linearization omits the corresponding non-coalescence probability product.Remark 1.In this case, coalescence probability absolute error, linearized minus exact value, yields an empirical criterion for greater than (precisely) 99% expected genealogical parity; n ≤ 33.The Wright-Fisher ancestral process restricted to single-pair coalescence thus yields n ≤ 34.The total linearization error of the Kingman coalescent, which includes non-coalescence error, thus yields n ≤ 26.Refer to the exposition of parity in Section 4 for the details of these criteria.

Single Pair Coalescence Events
Identically to Equation (1), precisely two lineages with the same parent occurs with probability The form of Equation ( 7) can be explained by analogy to Equation (1).Common ancestry among two lineages occurs with probability 1• 1 N , since the same individual must be picked uniformly at random from the parent generation by two individuals from the offspring generation in a population of fixed size N. Exchangeability renders a combinatorial term ( n 2 ), since any single pair of the n lineages from the offspring generation participate in such a common ancestry event.There is no common ancestry among the remaining n -2 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 1, 2, . . ., n − 2.
Compare the linearized probability of at least one coalescence 1  2 n(n − 1)N −1 from Equation ( 2) and the exact pair-wise coalescence probability of Equation (7).Clearly, the linearization omits the corresponding non-coalescence probability product.
Remark 2. The single-pair coalescence restriction is questionable prima facie with respect to the exact coalescence probabilities, since the complimentary event to non-coalescence in Equation ( 1) describes at least one coalescence.This includes combinations of single-pairs or multi-coalescence.The N −1 term of Equation ( 4) linearizes the probability of at least one coalescence, which is to be distinguished from the probability of single pair coalescence.
The differences between the corresponding linearized and exact probabilities cancel out as equal and opposite, whereas the relative proportions yield asymmetric linearized substitutions; refer to Figure 5.Both substitutions equal the exact value at n = 2; as n increases, linearized non-coalescence probability underestimates and linearized coalescence probability overestimates.Although the absolute errors have zero sum, linearization exaggerates coalescence transition probabilities and by comparison slightly reduces non-coalescence transition probabilities; refer to Section 2.3.Thus, Kingman's coalescent detracts from the exact ancestral process.Remark 2. The single-pair coalescence restriction is questionable prima facie with respect to the exact coalescence probabilities, since the complimentary event to non-coalescence in Equation ( 1) describes at least one coalescence.This includes combinations of single-pairs or multi-coalescence.The term of Equation ( 4) linearizes the probability of at least one coalescence, which is to be distinguished from the probability of single pair coalescence.
The differences between the corresponding linearized and exact probabilities cancel out as equal and opposite, whereas the relative proportions yield asymmetric linearized substitutions; refer to Figure 5.Both substitutions equal the exact value at n = 2; as n increases, linearized non-coalescence probability underestimates and linearized coalescence probability overestimates.Although the absolute errors have zero sum, linearization exaggerates coalescence transition probabilities and by comparison slightly reduces non-coalescence transition probabilities; refer to Section 2.3.Thus, Kingman's coalescent detracts from the exact ancestral process.Table 1 quantifies decreased accuracy of coalescence probability linearization, in Figure 5 (ii), for alternative population sizes, N.  Table 1 quantifies decreased accuracy of coalescence probability linearization, in Figure 5 (ii), for alternative population sizes, N. Remark 3. Does the conventional substitution correspond to omission of the multi-coalescence probabilities, or constraint of emergent coalescence events by suppression of multi-coalescence and replacement with single pair coalescence?Answer: The latter, since the probability of at least one coalescence is linearized in Equation (1).
Define the absolute error (type I) as the difference between linearized and exact single pair coalescence probabilities; ( n 2 )/N minus Equation (7).The quotient of the absolute error (type I) upon the exact single pair coalescence probability defines the relative error (type I).After cancellation, when n lineages remain, this equals the quotient of exact probabilities for at least one coalescence upon non-coalescence from n − 2 lineages.Alternatively, define absolute error (type II) as the difference between linearized and exact at least one coalescence probabilities; ( n 2 )/N minus the probability of the complimentary event to Equation (1).The quotient of the absolute error (type II) upon the exact at least one coalescence probability defines the relative error (type II).Refer to Equations ( 14) and ( 16) in Section 4 for further explanation.
The absolute and relative errors heighten a probability structure that would be invisible otherwise; refer to Figures 6 and 7. Thus, the robustness of the Kingman coalescent gets a qualitative measure.The quotient of relative errors illustrates their comparative proportional growth as n increases; refer to Figure 8.In this case, a minmax transition occurs around n = 20 between two gradient phases that correspond to the quotients of relative error type I upon type II.Intuitively, the two types of relative errors follow maximum and minimum detraction, respectively; type I corresponds to suppression of multi-coalescence altogether, whereas type II corresponds to replacement of multi-coalescence events with a single-pair, which accords to the Kingman coalescent.The single-pair and at least one coalescence probabilities for small to moderate numbers of lineages look equivalent; refer to Figure 9 in Section 2.3.
Remark 3. Does the conventional substitution correspond to omission of the multi-coalescence probabilities, or constraint of emergent coalescence events by suppression of multi-coalescence and replacement with single pair coalescence?Answer: The latter, since the probability of at least one coalescence is linearized in Equation (1).
Define the absolute error (type I) as the difference between linearized and exact single pair coalescence probabilities; / minus Equation (7).The quotient of the absolute error (type I) upon the exact single pair coalescence probability defines the relative error (type I).After cancellation, when n lineages remain, this equals the quotient of exact probabilities for at least one coalescence upon noncoalescence from − 2 lineages.Alternatively, define absolute error (type II) as the difference between linearized and exact at least one coalescence probabilities; / minus the probability of the complimentary event to Equation (1).The quotient of the absolute error (type II) upon the exact at least one coalescence probability defines the relative error (type II).Refer to Equations ( 14) and ( 16) in Section 4 for further explanation.
The absolute and relative errors heighten a probability structure that would be invisible otherwise; refer to Figures 6 and 7. Thus, the robustness of the Kingman coalescent gets a qualitative measure.The quotient of relative errors illustrates their comparative proportional growth as n increases; refer to Figure 8.In this case, a minmax transition occurs around n = 20 between two gradient phases that correspond to the quotients of relative error type I upon type II.Intuitively, the two types of relative errors follow maximum and minimum detraction, respectively; type I corresponds to suppression of multi-coalescence altogether, whereas type II corresponds to replacement of multi-coalescence events with a single-pair, which accords to the Kingman coalescent.The single-pair and at least one coalescence probabilities for small to moderate numbers of lineages look equivalent; refer to Figure 9 in Section 2.3.Relative error (type II), per generation, does not exceed precisely 1% where ≤ 90, in this case.

Multiple Coalescence Events
There is no implementation of multiple coalescence events in fastsimcoal, version 2.6 (fsc26), according to their online documentation [39][40][41].Extension of an original SimCoal package [42] simulates genetic data serially sampled, Serial SimCoal [43], and implements a heuristic double-pair coalescence transition probability (software and documentation available online: absorption occurs with a most recent common ancestor of the entire initial sample. The exact at least one coalescence probability, compliment to Equation (1), and multiple exact coalescence probabilities of Equations ( 8)-( 12) evaluated for small, moderate and larger numbers of lineages demonstrate their region of negligibility; refer to Figure 9.The significance of coalescence probabilities of Equations ( 7)-( 12) is of direct relevance to computer simulation and importance sampling methodology of the ancestral Markov chain, particularly as linearization errors accumulate.For the present purpose, quantitative analysis of conditional coalescence probability given the event of at least one coalescence, compliment to Equation (1), occupies Section 3.1.

Genealogical Topology and Expected Inter-Arrival Generations
Realization of the entire ancestral process yields one resultant genealogy.Statistical inference of genealogical time, for instance importance sampling methodologies, should be robust under a subset of ancestral transitions restricted to lineage decrements of one unless other genetic or exogenic processes act to emphasize the external branches.
Relative error (type II), per generation, does not exceed precisely 1% where n ≤ 90, in this case.
Consider the ancestral process in which at coalescence a decrement of two lineages can occur; double-pair or triplet coalescence.Precisely two pairs of lineages, with a different parent in common for each pair, occurs with probability 1 2 since discounting permutation of both pairs yields a factor one half.Similar with Equation ( 7), precisely one pair-wise common ancestry event occurs with probability ( n 2 )/N, since this event involves any two of the n lineages present in the offspring generation.The second pair-wise common ancestry event picks a different common parent to the first pair and this occurs with probability N−1 N 1 ). Permutation of the first and second pairs does not count due to the exchangeability of lineages in the ancestral process and requires the factor 1  2 .There is no common ancestry among the remaining n − 4 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 2, 3, . . ., n − 3.
Three lineages with the same parent occurs with probability Common ancestry among three lineages occurs with probability 1• 1 N • 1 N , since the same individual from the parent generation is picked uniformly at random by three individuals from the offspring generation in a population of fixed size N. Exchangeability renders a combinatorial term ( n 3 ), since any triplet of the n lineages from the offspring generation participate in such a common ancestry event.
There is no common ancestry among the remaining n − 3 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 1, 2, . . ., n − 3.
Consider the ancestral process in which at coalescence a decrement of three lineages can occur; triple-pair, both a single-pair and a triplet, or quadruplet coalescence.Three pairs of lineages, with a different parent in common for each pair, occurs with probability since discounting permutation of the triple-pair yields a factor one sixth.Similar with Equation ( 8), the first pair-wise common ancestry event occurs with probability ( n 2 )/N.The second pair-wise common ancestry event picks a different common parent to the first pair and this occurs with probability 2 )/N.The third pair-wise common ancestry event picks a different common parent to the first and second pairs and this occurs with probability ( N−2 N )( n−4 2 )/N.Permutation of the first, second and third pairs does not count due to the exchangeability of lineages in the ancestral process and requires the factor 1  6 .There is no common ancestry among the remaining n − 6 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 3, 4, . . ., n − 4.
One single-pair and one triplet of lineages, with a different parent in common, occurs with probability 1 2 since discounting permutation of the pair and the triplet yields a factor one half.The pair-wise common ancestry event occurs with probability ( n 2 )/N.Similar with Equation ( 9), the triplet common ancestry event now picks a different common parent to the pair and this occurs with probability ).The alternative combinatorial product ( n 3 )( n−2 2 ) yields the same function of n as in Equation (11).In this sense, the two alternatives cannot be distinguished.However, the usual permutation discount of simultaneous common ancestry events, one single pair and one triplet, applies with the factor 1  2 due to exchangeability.There is no common ancestry among the remaining n − 5 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 2, 3, . . ., n − 4.
Precisely four lineages with the same parent occurs with probability Common ancestry among four lineages occurs with probability 1• 1 N • 1 N • 1 N , since the same individual from the parent generation is picked uniformly at random by four individuals from the offspring generation in a population of fixed size N. Exchangeability renders a combinatorial term ( n 4 ), since any quadruplet of the n lineages from the offspring generation participate in such a common ancestry event.There is no common ancestry among the remaining n − 4 lineages in the offspring generation, which yields the corresponding product of (N − i)/N for i = 1, 2, . . ., n − 4.
The probabilities of Equations ( 7)-( 12) constitute a subset of all possible types of coalescences and therefore yield a restricted ancestral process.These probabilities correspond in every generation until a coalescence event occurs, with those of certain multi-coalescences equal to zero for small n.That is, such probabilities apply from one generation to the next among the offspring while n lineages remain.At coalescence, adjust n accordingly and continue the ancestral process, until eventually absorption occurs with a most recent common ancestor of the entire initial sample.
The exact at least one coalescence probability, compliment to Equation ( 1), and multiple exact coalescence probabilities of Equations ( 8) and ( 10) evaluated for small, moderate and larger numbers of lineages demonstrate their region of negligibility; refer to Figure 9.
The significance of coalescence probabilities of Equations ( 7)-( 12) is of direct relevance to computer simulation and importance sampling methodology of the ancestral Markov chain, particularly as linearization errors accumulate.For the present purpose, quantitative analysis of conditional coalescence probability given the event of at least one coalescence, compliment to Equation (1), occupies Section 3.1.

Genealogical Topology and Expected Inter-Arrival Generations
Realization of the entire ancestral process yields one resultant genealogy.Statistical inference of genealogical time, for instance importance sampling methodologies, should be robust under a subset of ancestral transitions restricted to lineage decrements of one unless other genetic or exogenic processes act to emphasize the external branches.

Conditional Probabilities of Multi-Coalescence
The conditional probability of multi-coalescences given a coalescence event determine the genealogical topology in realization of the ancestral process.Refer to Figure 10, where conditional probability is given the event of at least one coalescence, either linearized or exact.Given exact coalescence: when n = 10, Pr(double-pair | coalescence) < 1/14,286, Pr(triplet | coalescence) < 1/75,003 and Pr(triple-pair | coalescence) < 1/571,428,571.When n = 20, 1/2615, 1/33,344 and 1/13,071,895, respectively.Thus, in the region of most significance to timing, such multi-coalescence events rarely occur under genealogical stochastic reiteration.
Figures 11-13 illustrate the rapid decline of significant intervals for timing the genealogy and quantify the extent of multi-coalescence event rarity.Multi-coalescence event probabilities vary substantially within such regions, and negligibility becomes less extensive as population size decreases; refer to Figures 14-16.
In the first regions, these conditional probabilities of triplets exceed triple-pairs; this trend switches in the second regions and triple-pairs far exceed triplets.Thus, only the slightest relative contribution of multiple coalescence transition probabilities occurs in the ancestral process, per generation.Substantial replication of the ancestral process will be required before realizing a genealogy that contains multi-coalescence events.That is, unless the genealogy consists of many lineages or the population size is diminished substantially.
In the first regions, these conditional probabilities of triplets exceed triple-pairs; this trend switches in the second regions and triple-pairs far exceed triplets.Thus, only the slightest relative contribution of multiple coalescence transition probabilities occurs in the ancestral process, per generation.Substantial replication of the ancestral process will be required before realizing a genealogy that contains multi-coalescence events.That is, unless the genealogy consists of many lineages or the population size is diminished substantially.

Single-Pairs Dominate Double-Pairs?
Consider the relative probabilities of double-pair and single-pair coalescence, namely the quotient of Equation ( 8) upon Equation ( 7), Equation ( 13) equals case i = 1 [16] (Equation ( 19)), which required correction since it should be where the denominator term 2N(i + 1) replaces 4N(i + 1).This expression equals the quotient of the (i + 1)st multiple and the ith multiple-pair coalescence probability.Thus, i = 1 corresponds to the quotient of double-pair upon single-pair coalescence probabilities.
The quotient of Equation ( 13) explains the dominance of expected inter-arrival times by single-pair coalescence.This is because the geometric distribution yields expectation equal to the reciprocal of the sum of Equation (7) plus Equation (8), when double and single-pair coalescences may occur in the ancestral process.Thus, double-pair coalescence is negligible in terms of the expected inter-arrival generations in the ancestral process due to Equation (13).Refer to Figure 17, the quotient of double-pair upon single-pair coalescence probabilities per generation has increased from nil at n = 2 to 1% (0.1%, N = 2,000,000) at n = 92, whereas the relative proportion of the total expected generations in the genealogy then equals 0.0121%.The expected inter-arrival generations determined by single-pair and double-pair coalescence probability, respectively, equals 1/p s of Equation ( 7) and 1/p d of Equation ( 8); refer to Figure 18.The exact probability of avoiding a double-pair coalescence per expected interval, according to the geometric distribution with parameter p d at n = 92 equals 0.990032.the ancestral process.Thus, double-pair coalescence is negligible in terms of the expected inter-arrival generations in the ancestral process due to Equation (13).Refer to Figure 17, the quotient of doublepair upon single-pair coalescence probabilities per generation has increased from nil at n = 2 to 1% (0.1%, N = 2,000,000) at n = 92, whereas the relative proportion of the total expected generations in the genealogy then equals 0.0121%.The expected inter-arrival generations determined by single-pair and double-pair coalescence probability, respectively, equals 1/ of Equation ( 7) and 1/ of Equation ( 8); refer to Figure 18.The exact probability of avoiding a double-pair coalescence per expected interval, according to the geometric distribution with parameter at n = 92 equals 0.990032.In the next Section, calculation of multiple coalescence event probabilities per expected interval leads to a paradox of negligibility and its resolution obtained.7)) and double-pair (Equation ( 8)) exact coalescence probabilities.Population size N = 2 × 10 5 and n = 2, 3, . . ., 500 (inset n = 2, 3, . . ., 50 ).Due to a property of the Wright-Fisher model such that a geometric distribution determines the number of generations until a coalescence event occurs, the success probability of the distribution equals either Equation (7) or Equation (8).
In the next Section, calculation of multiple coalescence event probabilities per expected interval leads to a paradox of negligibility and its resolution obtained.

Parity of the Kingman Coalescent
Empirical calculations in this section yield a criterion of coalescence probability error, linearized minus exact value, such that expected genealogical parity be greater than 99% where n ≤ 1 2 3

√
N. The Wright-Fisher ancestral process restricted to single pair coalescence empirically yields the same criterion as that just described.Total error of the Kingman coalescent that includes linearized non-coalescence probability thus yields 1 2 3 √ N/2.In general, per generation, consider error to be the probability of a neglected coalescence; parity the probability of avoiding a neglected coalescence.The parity, per expected interval, is obtained by raising parity, per generation, to the power of an exponent given by 1/p, where p equals the probability of coalescence, per generation.For instance, using the linearized coalescence probability yields the expected inter-arrival generations of Kingman's coalescent.The product of parity, per expected interval, across all intervals from the initial sample to its most recent common ancestor yields expected genealogical parity.Non-occurrence of neglected coalescence events anywhere in the expected genealogical realization represents perfect parity.This maximum stringency confounds observability, since the impact of neglected coalescence depends on position within the genealogy.Therefore, parity, per expected interval, is more directly informative.

Linearization Errors
The linearization of Kingman's coalescent yields error in both the non-coalescence and the coalescence probabilities, which cancel each other and sum to zero when the coalescence error is with respect to the exact probability of at least one coalescence.Consider n lineages to be present in the genealogy.Define the linearization error (type I) with respect to the exact probability of single-pair coalescence, per generation, Equation ( 14) simplifies as the exact multi-coalescence probability and is equivalent to the error of the Wright-Fisher ancestral process restricted to single-pair coalescence.Thus, one minus the linearization error (type I) defines linearization parity (type I), per generation, Note Equation ( 15) equals one plus the linearized coalescence probability then multiplied by the exact non-coalescence probability, while n − 1 lineages remain in the genealogy.Equation ( 15) is quantified as n varies, per expected interval and expected cumulative genealogy, according to reduced, mid-range and enlarged constant population sizes in Figures 19-24.These Figures also illustrate that inclusion of multi-coalescence transition probabilities of Equations ( 8)-( 12) sustain parity of restricted Wright-Fisher models.quantified as n varies, per expected interval and expected cumulative genealogy, according to reduced, mid-range and enlarged constant population sizes in Figures 19-24.These Figures also illustrate that inclusion of multi-coalescence transition probabilities of Equations ( 8)-( 12) sustain parity of restricted Wright-Fisher models.
Ancestral process of restricted Wright-Fisher models:          In the single-pair exact ancestral process, parity, per expected interval, exceeds precisely 99% where ≤ 91; refer to Table 2.An identical criterion was observed with relative error (type II) of the Remark 4. Pr(decrement of 2 | coalescence) rises to 1% at n lineages when parity, per expected interval, falls to 99%; verified as N varies.An alternative interpretation of this coupling is that the fractional cubic root criterion of expected genealogical parity occurs at values of n lineages where multi-coalescence transitions remain probabilistically insignificant.

Parity Paradox
The single-pair coalescence probability dominates the expectation of generations between adjacent coalescence events in the genealogy, although inclusion of the double-pair coalescence probability sustains genealogical parity significantly beyond that obtained with single-pair coalescence.The paradox is resolved by two points: (i) relative probability values of single and double-pair coalescence explains the expected inter-arrival generations; and (ii) binomial expansion of the geometric probability for avoidance of omitted multi-coalescence events until the expected inter-arrival generations elapse.
Recall from Section 3.2 that single-pairs dominate expectation of inter-arrival generations.Then, let G = 1/p n , where p n equals Equation (7).Consider the binomial expansion of parity where x, y, and z denote non-coalescence, single-pair and double-pair coalescence probabilities of Equations ( 1), ( 7) and ( 8), respectively.The left-side of Equation ( 18) quantifies the long run non-occurrence probability of omitted multi-coalescence events within the expected interval duration, while n lineages remain.Double-pair coalescence yields a non-negligible probability in total, since Equation ( 18) contains a sum of terms on the order 1 2 G 2 multiplied by Equation (8).Therefore, accumulation of double-pair coalescence probabilities over many generations sustains parity.Hence, parity of the double-pairs restricted Wright-Fisher model is significantly greater than that of the single-pairs restricted Wright-Fisher model.Additional multi-coalescence transition probabilities strengthen parity accordingly.
The conventional standard deviation of the generations expected in between successive coalescence events equals √ q n p n , where q n = 1 − p n , and the subscript denotes the dependence of the coalescence probability on n lineages present.Note the conventional variance replaces a pathological mathematical variance of the geometric probability distribution (refer to Appendix B, for derivation of the mathematical variance).The higher moments do not resolve the conundrum that double-pair coalescence sustains genealogical parity, whereas single-pair coalescence determines expected inter-arrival generations.Consider the functional forms of Equations (A6), (A9) and (A10) in two cases: (i) Equation (7); and (ii) the sum of Equations ( 7) and (8).Therefore, single pair coalescence probability dominates the first, second, (to a lesser extent) third, and fourth moments similarly to the discussion of Section 3.2.

Conclusions
Linearization potentially affects the Kingman coalescent in two ways: (i) suppression of multi-coalescence events induces upward size bias; and (ii) inflation of coalescence probabilities due to linearization induces downward size bias.Quantitative analyses demonstrate such affects unlikely.More specifically, genealogical topology is predominantly unaffected from root to tips provided lineage numbers remain small to moderate.This relegates similar conjectured compensatory mechanism [15][16][17] to regions of many lineages.Many lineages render significant multi-coalescence probabilities and inflated linearized coalescence probabilities, although expected inter-arrival times diminish on external branches, in this region Kingman's coalescent therefore detracts from the exact ancestral process.
Kingman's coalescent is a reasonably robust genealogical model of population genetics, although unsuitable for a wide range of sample sizes dependent on population size.Regions of validity were quantified with restricted versions of the exact ancestral process.Computationally-intensive statistical inference methods usually require many millions of genealogical realizations to converge.Thus, small waiting-time adjustments and slightly inflated coalescence event probabilities could be investigated more fully for significant elaboration of the sample space upon which resultant parametric estimates depend.
Double-pairs and higher combinations of multi-coalescence have proven to be negligible in the region of most significance for timing the genealogy, in both the linearized and exact ancestral processes.In contrast, parity quantifies the long run avoidance of omitted multi-coalescences across many generations as the sample size increases.Multi-coalescence affects the shape towards the tips of large sample genealogies, and then yields only fine-tuning effects of ancestral timing properties.The loss of parity of the Kingman coalescent, under relaxation of its conventional limit of a large population size, was quantified.The resultant empirical criteria, that a valid sample size is less than certain fractional square and cubic roots of population size, were all verified to hold for a wide range of population sizes.Finally, utilizing genomic data for the discovery of ecological evolutionary dynamics represents an important challenge [44] that demands extremely robust statistical models of genealogy applicable to phylogenetics.

Figures 11 -
illustrate the rapid decline of significant intervals for timing the genealogy and quantify the extent of multi-coalescence event rarity.Multi-coalescence event probabilities vary substantially within such regions, and negligibility becomes less extensive as population size decreases; refer to Figures14-16.
illustrate the rapid decline of significant intervals for timing the genealogy and quantify the extent of multi-coalescence event rarity.Multi-coalescence event probabilities vary substantially within such regions, and negligibility becomes less extensive as population size decreases; refer to Figures14-16.

Table 1 .
Percentage overestimation of linearized coalescence probability reached at n lineages.

Table 1 .
Percentage overestimation of linearized coalescence probability reached at n lineages.