Computation of the Likelihood of Joint Site Frequency Spectra Using Orthogonal Polynomials

In population genetics, information about evolutionary forces, e.g., mutation, selection and genetic drift, is often inferred from DNA sequence information. Generally, DNA consists of two long strands of nucleotides or sites that pair via the complementary bases cytosine and guanine (C and G), on the one hand, and adenine and thymine (A and T), on the other. With whole genome sequencing, most genomic information stored in the DNA has become available for multiple individuals of one or more populations, at least in humans and model species, such as fruit flies of the genus Drosophila. In a genome-wide sample of L sites for M (haploid) individuals, the state of each site may be made binary, by binning the complementary bases, e.g., C with G to C/G, and contrasting C/G to A/T, to obtain a “site frequency spectrum” (SFS). Two such samples of either a single population from different time-points or two related populations from a single time-point are called joint site frequency spectra (joint SFS). While mathematical models describing the interplay of mutation, drift and selection have been available for more than 80 years, calculation of exact likelihoods from joint SFS is difficult. Sufficient statistics for inference of, e.g., mutation or selection parameters that would make use of all the information in the genomic data are rarely available. Hence, often suites of crude summary statistics are combined in simulation-based computational approaches. In this article, we use a bi-allelic boundary-mutation and drift population genetic model to compute the transition probabilities of joint SFS using orthogonal polynomials. This allows inference of population genetic parameters, such as the mutation rate (scaled by the population size) and the time separating the two samples. We apply this inference method to a population dataset of neutrally-evolving short intronic sites from six DNA sequences of the fruit fly Drosophila melanogaster and the reference sequence of the related species Drosophila sechellia.


Introduction
Evolutionary forces, e.g., mutation, selection and genetic drift, shape DNA sequence information.Typically, the evolutionary processes that have influenced the data reach back millions of generations or years.Mathematical theory that describes these processes has been available for more than 80 years (e.g., [1,2]), yet inference of population genetic parameters using probabilistic models is difficult, and only few analytical maximum-likelihood estimators are available; those based on diffusion theory, so far, assume independence among sites and are briefly reviewed in Vogl [3], Vogl and Bergman [4] and in the theory section below.
A DNA molecule is a string (or strand) of nucleotides (or sites) that usually pairs with a complementary strand to form a double-stranded chromosome.Pairing of sites is accomplished by hydrogen bonds between the complementary base pairs adenine (A) and thymine (T), on the one hand, and cytosine (C) and guanine (G), on the other.An assortment of chromosomes forms a genome, which is specific for a species.The main functional units of genomes are genes that often code for proteins.Proteins provide structure, catalyze metabolism or mediate physiological pathways in all living organisms.While the single-celled Bacteria and Archaea generally have compact genomes, genes of the more complex eukaryotic organisms are often interrupted by non-coding introns.Introns are spliced out, i.e., eliminated, during maturation of the messenger RNA, which is then translated into the chain of amino acids that makes up proteins.
A point mutation at a certain nucleotide or site creates a new genetic variant, i.e., an allele.Mutation is not strand-specific, but may be biased towards A or T (A/T) over C or G (C/G) or vice versa, because mutation rates between these two allelic classes may vary.Genome-wide sequence data may be made bi-allelic (binary), by considering A/T nucleotides as Allele 0 and C/G nucleotides as Allele 1.This simplifies mathematical analysis, such that maximum-likelihood inference becomes possible.Mutations introduce new variants into the genome and, thus, increase genomic variation.Conversely, stochastic fluctuations of the allelic proportion due to finite population sizes, i.e., random genetic drift, eventually cause fixation of an allelic type, thus eliminating variation.An equilibrium between mutation and drift may establish with time.
Recently, relatively inexpensive, high throughput DNA sequencing methods have made available population data from whole genomes (in multicellular organisms typically comprising 10 7 − 10 11 sites), at least for humans and model species, such as fruit flies of the genus Drosophila.These data provide the basis for inference of population genetic forces, such as random genetic drift, the mutation rate scaled by the population size, directional selection and the time of the split between two populations.
In this article, we focus on inferring population genetic parameters using a mutation-drift model of the allelic proportion x.For mathematical convenience, genomic sites are classified as binary with respect to their nucleotide (C/G vs. A/T).A total of L sites are classified into categories, depending on the count y of Allele 1 (C/G) among M aligned genomic sequences.Together, these counts form a site frequency spectrum (SFS) of size (M + 1) with 0 ≤ y ≤ M. Joint SFS may be constructed considering the allelic states of sites within a single population at two different time points or two related populations at a single time point.
The solution of the diffusion equation describing the evolution of x conditional on mutation and drift parameters has previously been represented as a series expansion of orthogonal polynomials (e.g., [5][6][7][8][9]).In this article, we extend the mathematical theory to a boundary-mutation model [4], which describes the evolution of x when the scaled mutation rate θ is small, i.e., on the order 0.1 or smaller [10].Using this model, a method for the inference of θ and the time of split t is derived and applied to both simulated and empirical Drosophila population data.The empirical data are joint SFS of short introns, as the nucleotide composition of this site class is considered to not be affected by selection, but only by the joint forces of mutation and drift [11][12][13].Therefore, the study of these sites likely provides an accurate estimate of the population demography and the genome-wide scaled mutation rate.The joint SFS from a sample of six individuals from the Malawian D. melanogaster population [14] and the D. sechellia reference sequence (Release 1.0; [15]) is used to infer mutation and drift parameters.

Inference with a Single Site Frequency Spectrum Assuming Equilibrium
Assume that a sample of L genomic loci or sites is available for M haploid individuals.The sample space of the allelic count for each locus l is then y l = (0, . . ., M) copies of Allele 1, with 1 ≤ l ≤ L. In regions of high recombination rates relative to mutation rates, sites may be assumed to be independently and identically distributed, such that the probabilities given the model parameters of all L sites can be multiplied.In this case, the theory developed below can be considered maximum likelihood.In regions of relatively low recombination rates, estimators are still consistent and may be considered a composite likelihood.For notational convenience, the index l is often dropped in the following.

Inference Based on the Beta Equilibrium Distribution
In a classical study, Wright [2] proposed a model for the evolution of a bi-allelic locus under the influence of the population genetic forces: mutation, directional selection and drift.He also derived the equilibrium distribution of the allelic proportion, conditional on the scaled mutation rate, the mutation bias and the scaled strength of directional selection.In the absence of selection, the equilibrium distribution of the population allelic proportion x of Allele 1 is a beta: where α is the mutation bias towards Allele 1 and β = (1 − α) and θ the overall scaled mutation rate, i.e., the product of the per-generation mutation rate µ and the effective population number or size N.
Conditional on x, the distribution of the allelic count y is assumed to be binomial.Especially with genome-wide samples, the allelic proportions at particular sites are not interesting and "integrated out", which leads to the beta-binomial distribution of the allelic count: Given a sample of L independent loci for M individuals for each locus and a common α and θ, let L y represent the counts of sites with y alleles of Type 1. Set q y = p(y | α, θ).The likelihood is then: For arbitrarily large values of θ, only iterative algorithms have been derived to obtain maximum likelihood estimates of α and θ [3], even in the simple case without selection.Note that this model corresponds to the canonical model of the empirical Bayes method, and maximizing this marginal likelihood corresponds to a parametric empirical Bayes approach [16].
In the limit of small θ, the beta-binomial compound distribution (2) can be expanded into a Taylor series in θ at θ = 0, up to first order.The "folded" site frequency spectrum (folded SFS) is derived from the general site frequency spectrum by lumping the samples L y with L M−y , such that the state space becomes 0 ≤ y ≤ [M/2] per locus.For a polymorphic sample of such a folded SFS, the Taylor series expansion in θ of the beta-binomial compound distribution has been derived by RoyChoudhury and Wakeley [17].In the general situation with 0 ≤ y ≤ M, the series expansion of the beta-binomial compound distribution leads to the general "RoyChoudhury-Wakeley" distribution [3]: With this first order expansion of the beta-binomial in θ, approximate maximum likelihood (ML) estimators of α and θ and their posterior distributions can be obtained easily [3].In particular, the approximate ML estimator for the scaled mutation rate is: where θ = α β θ and L p = ∑ M−1 y=1 L y , i.e., the sum over all polymorphic sites in the sample, while the approximate ML estimator for α is: If the boundary mutation model is assumed, these estimators are maximum likelihood, rather than approximations in the limit of small-scaled mutation rates θ [4].

Inference Based on the Assumptions of Equilibrium and Rare Mutations
The estimator θ of Formula ( 5) is a variant of the well-known Ewens-Watterson estimator of the scaled mutation rate [18,19], θw = L p /(L ∑ M−1 y=1 1/y).The latter was originally derived assuming the infinite sites model [20,21], which in turn was based on a model with irreversible mutation [2].With the infinite sites model, infinitely many sites may be hit by mutation at a finite rate, such that each site is hit only once [19,21].Furthermore, it is usually assumed that the ancestral and derived allelic states can be inferred with outgroup information, i.e., with information from closely-related species or populations (e.g., [12,13,[22][23][24]).Then, segregating mutations are assumed to only arise once from the ancestral background.Alleles at a site are thus not defined as having bases A/T versus C/G, but as being ancestral versus derived.Note that the factor of two difference between θ of Formula ( 5) and the Ewens-Watterson estimator reflects that mutations arise from both boundaries in the former and only one boundary in the latter.Obviously, the mutation bias cannot be inferred with polarization.Yet, irrespective of whether or not data are polarized, a polymorphic site is scored as polymorphic.
The Ewens-Watterson estimator is generally unbiased.If sites are unlinked, it can be shown that it is also the maximum likelihood estimator of θ and, thus, a sufficient statistic (e.g., [3]).Ewens [18] neglected to show this explicitly, while he earlier showed that the estimator for the corresponding infinite alleles model is maximum likelihood [25].Furthermore, it can been shown that the estimators are unbiased (e.g., [3,18]).Note that if assumptions are met, θw corresponds to the "expected heterozygosity", i.e., the expected proportion of polymorphism in a sample of size M = 2.
Similar to the infinite sites model, applications of the Poisson random field (PRF) model to population genetics generally assume small mutation rates.The PRF theory is often based on irreversible mutation models and, like the infinite sites model of Kimura [21], usually assumes the presence of directional selection.For an equilibrium distribution to exist, an inexhaustible and unvarying supply of sites must be assumed.Furthermore, the ancestral state of all sites must be known without errors and conditioned on.This is because, as discussed above, the rates of mutation from A/T to C/G generally differ, and the force of directional selection is reversed if an A/T mutates to a C/G or vice versa.The above assumptions are not met with real datasets: genomes are finite, and inference of the ancestral state is error-prone.Nevertheless, if appropriate outgroup information is available and quasi-equilibrium is assumed, the approach is sensible [26][27][28][29].
While RoyChoudhury and Wakeley [17] also use the PRF approach, they do not assume outgroup information, but rather start from a Taylor series expansion in θ of the equilibrium beta distribution (1).As shown above (Equation ( 5)), the estimator RoyChoudhury and Wakeley [17] derived is essentially identical to the Ewens-Watterson estimator of θ.Starting from a Moran model that only allows for mutations from the boundaries, Vogl and Clemente [10] derive a generalization of the estimator also for the case with directional selection, without assuming outgroup information.Vogl and Bergman [4] derive ML estimators for all three parameters: mutation bias α, scaled mutation rate θ and scaled selection strength γ, with the same assumptions, but base the analysis on a diffusion model.

Mathematical Theory and Algorithms
The Ewens-Watterson estimator θw [18,19] or its varieties are sufficient statistics for the analysis of site frequency spectra assuming the infinite sites model, equilibrium and unlinked sites.With real datasets, however, changes in demography or mutation parameters usually invalidate the equilibrium assumption.Moreover, the approach to equilibrium is dominated by the scaled mutation rate θ.Since θ is often on the order of 10 −2 per unit of diffusion time, which is scaled in N generations, it takes on the order of 100 • N generations to reach equilibrium.This is on the order of 10 7 − 10 9 generations.Even with the short-lived fruit flies, equilibrium is thus usually not reached before a change in population demography, the selection regime or the mutation bias.For probabilistic analysis of datasets that have not yet reached equilibrium, calculation of transition probabilities or densities is necessary.This is also necessary for joint site frequency spectra, where samples are drawn from a single population at two different time points or two closely-related populations at a single time point, which we will present in this article.
Consider a population of haploid population size N(t), where t is time.The dynamics are governed by only two population genetic forces: mutation and drift.Generally, the diffusion limit, i.e., N → ∞, is considered such that, at each time point only two quantities matter: the scaled mutation rate θ(t) = µ(t)N(t) and the mutation bias α(t) = (1 − β(t)) = µ 1 (t)/(µ 0 (t) + µ 1 (t)).Let x(t) denote the proportion of the first allelic type, which in our case may be identified with the proportion of C/G at this site in the population at time t.Assume now that the parameters N(t), µ(t) and α(t) are piecewise constant and consider only a single such epoch of constant parameters.Usually, the following forward operator is obtained [3,30]: The corresponding forward diffusion or Kolmogorov equation is: where φ(x, t) is the transition density of the allelic proportion x at any time t.To solve this equation, Song and Steinrücken [7] employ a series expansion with the modified Jacobi polynomials: where P (a,b) i (z) are the classical Jacobi polynomials [31].Note that Song and Steinrücken [7] primarily analyze the backward diffusion equation (but also use the forward diffusion equation in the section: "Empirical Transition Densities and Stationary Distributions").However, the relationship between the adjoint backward and forward diffusion equations is such that adaption of the theory concerning the backward equation to the forward equation is minimal (compare [9]).

The Boundary-Mutation Model
Further in the text, we will follow Vogl and Bergman [4] and model mutations as only affecting the boundaries.Then, φ(x, t) must be interpreted as a generalized probability measure that integrates to one over the unit interval, but may contain point masses at the boundaries (compare [32]).Within the polymorphic region, 1/N ≤ x ≤ (N − 1)/N, the dynamics are purely governed by drift, such that the diffusion generator is: and the corresponding Kolmogorov forward (or forward diffusion) equation is: Mutations are assumed to arise at the boundaries and correspond to a transition from x = 0 to x = 1/N, for a mutation from A/T to C/G, or from x = 1 to x = (N − 1)/N, for a mutation from C/G to A/T.
The Wright-Fisher model is most familiar to population geneticists.With this model, the transition between subsequent generations due to drift is modeled via binomial sampling, such that transitions between distant states are possible.The slightly less familiar Moran model only allows transitions between neighboring states, which simplifies the math.This simplification pertains also to the boundaries [4].With the boundary-mutation model, mutations are assumed to arise only at the boundaries; a transition from x = 0 to x = 1/N corresponds to a mutation from a monomorphic state with only A/T to a polymorphic state with a single C/G in the population; conversely, a transition from x = 1 to x = (N − 1)/N corresponds to a mutation from C/G to A/T.The reverse transitions from the polymorphic region to the boundaries at x = 0 and x = 1 are caused by drift.In particular, the flow from x = 1/N towards zero is proportional to drift times the probability mass at x = 1/N, and similarly at the other boundary.
With a change from the Moran to the diffusion model, the formulas for the flow towards the boundaries due to drift are: where the sign of the flow represents the direction.Conversely, the mutational flow from the boundaries to the interior is given by:

Modified Gegenbauer Polynomials
We will first analyze the situation without mutations, i.e., θ = 0.With pure drift, the transition density φ(x, t) can be expanded into a series of Gegenbauer polynomials (e.g., [5,[7][8][9]33]).Define: where the G i (x) are the modified Gegenbauer polynomials of Song and Steinrücken [7], and the C (α) i (z) correspond to the classical ultraspherical or Gegenbauer polynomials with α = 3/2 ([31], Chapter 22), also used by Kimura [5] and Tran et al. [8].The forward and backward diffusion generators are adjoint, such that the modified Gegenbauer polynomials from Song and Steinrücken [7] can also be used to solve the forward diffusion equation.Multiplication of the weight function x −1 (1 − x) −1 and G i (x) in ( 14) transforms a solution of the backward equation into that of the forward equation (compare [9]).
The first two polynomials are U 2 (x) = −1 and U 3 (x) = (2 − 4x); the recurrence relation to calculate all other polynomials is [7]: The U i (x) solve the differential equation: with: The U i (x) are orthogonal with the weight function: and the proportionality constant: A function f (x) defined within ]0, 1[ can be represented by an expansion of the U i (x).The coefficients c i can be calculated using: , where τ i (t) is a function pertaining to the temporal part of the transition density, into the diffusion Equation ( 11) leads to: For each i, we have: which can be rearranged to: Observing Equation ( 16), we obtain the eigenvalue equations: As stated above, the U i (x) solve the spatial differential Equation in (24) for each i, while τ i (t) = c i e −λ i t solves the temporally homogeneous, linear differential Equation in (24).The c i depend on the starting conditions and can be obtained from Formula (20).Since any function in ]0, solves the diffusion Equation (11) for any starting condition.
Integrating ( 16) from zero to one for the symmetric eigenvectors with even i, we obtain: Note that, for odd i, we have 26) is therefore trivially true for odd i.
Following Kimura [5], we substitute φ(x, t) = ∑ ∞ i=2 τ i (t)U i (x) into the differential Equation ( 11) and integrate from zero to one observing Equation ( 26): As in the temporal part of Formula ( 24), substituting τ i (t) = c i e −λ i t solves the system of differential equations: For even i, the flow out at the boundaries zero and one per unit time is symmetric and corresponds to what is present just inside the boundaries, i.e., U i (0) and U i (1).The rate of loss is the eigenvalue λ i .Now, multiply φ(x, t) with x and integrate again from zero to one: Thus, xφ(x, t) will eventually drift to Boundary 1, and conversely (1 − x)φ(x, t) to Boundary 0. This is expected, since drift is symmetric, such that the probability of eventual fixation is equal to the proportion x, for the boundary at one, and (1 − x), for the boundary at zero (see also [30], Chapter 4.3).As in the temporal part of Formula ( 24), substituting τ i (t) = c i e −λ i t solves the system of differential Equations (28).
The solution of the diagonal system of differential Equations (28) thus fulfills Equation ( 11) for all t, as well as the boundary conditions at both zero and one (12).
Note that: The above results suggest augmenting the U i (x) with the boundary terms: Furthermore, the result (30) shows that the boundary terms derived above correspond to those defined by Tran et al. [8].
We therefore define the following set of orthogonal polynomials augmented with boundary terms: where δ(x) is the Dirac delta function (compare Tran et al. [8], who arrive at the corresponding set of augmented eigenfunctions).With this definition of eigenfunctions, the probability mass that leaves the polymorphic region for each i at 1/N and (N − 1)/N is added to the monomorphic boundaries at x = 0 and x = 1.The integral over the closed interval between zero and one thus remains unity for all times.In Appendix A.1, we show that these augmented orthogonal polynomials can also be obtained by a Taylor series expansion of the general eigensystem solving the diffusion Equation ( 11) using the modified Jacobi polynomials R (α,θ) (x).

Starting and Prior Distributions
We base the following description on the theory of hierarchical Bayesian models and the empirical Bayes method [16] that we also employed earlier [3,4].In a frequentist context, one would rather use the context of marginal likelihoods.
Traditionally, a Dirac delta function at a certain position p has been used as a starting condition [33].With a site frequency spectrum, however, the joint density of the population allelic proportion x and the observed allelic count y in a sample of size M 0 must be used as starting density.Most naturally, the conditional distribution of the data y given the allelic proportion x is modeled as a binomial: With the pure drift model, we are generally interested in the polymorphic region, since probability mass at a boundary remains there due to the absence of mutations.
For "integrating out" the population allelic proportions x, a prior distribution for x must be assumed.With small-scaled mutation rates, an "improper prior" proportional to x −1 (1 − x) −1 within 1/N and (N − 1)/N is appropriate, as this is proportional to the equilibrium distribution (see also Subsections 2.3.4 and 2.3.5 below).Note that this prior corresponds to the inverse of the weight function.Thus, the inner product (20) to calculate the initial coefficients becomes: where the limit notation indicates that the integration includes only the polymorphic region, i.e., no point masses at the boundaries.We can thus specify a general algorithm that also includes the boundaries.

Algorithm 1: Allelic Proportions x with Pure Drift for All Times t, Conditional on Initial Values
• A measure f (x) between zero and one, which may have point masses m 0 and m 1 at Boundaries 0 and 1, is represented by an expansion of the H i (x) up to i = n.The coefficients c i are calculated according to Equation (34).The expansion of g(x) times the prior, up to the order n, is then: • The solution of Equation ( 28) for all t conditional on the initial distribution can be represented by a series expansion up to n: Note that the H i (x) contain the boundary terms that balance the probability masses at zero and one.This is obvious if the initial probability measure f (x) does not contain point masses at the boundaries, i.e., if m 0 = m 1 = 0.

Modified Gegenbauer Polynomials and the Boundary-Mutation Model
In this subsection, we will use the expansion in orthogonal polynomials with boundary terms to model both mutation and drift.

Mutation and Drift: Slowly Evolving Dynamics
For the slowly evolving dynamics at the boundaries, we augment the system with two eigenfunctions.Starting from the system for general θ, which can be expanded in a series of modified Jacobi polynomials (see Equation (9) in Song and Steinrücken [7]), we note that the eigenfunction for i = 0 does not change with time, i.e., λ 0 = 0.The eigenfunction for i = 1 has the eigenvalue λ 1 = θ (compare: [7]) and reflects the slow change in allele frequencies through mutation.Expressing the Jacobi polynomials as beta distributions and taking the limit θ → 0, such that only probability masses at the boundaries remain, the first two eigenvectors become: (see Appendix A.1). Obviously, these two eigenfunctions are unaffected by the dynamics in the polymorphic region inside [1/N, (N − 1)/N].These two eigenvectors have no probability mass within the polymorphic region, such that only eigenvectors with i ≥ 2 have nonzero probability masses in the polymorphic region.Hence, the model separates two spatial regions: the monomorphic boundaries and the polymorphic interior.As λ 1 = θ 1 while the λ i > 1 for all eigenvalues with i > 2, two different temporal regions can also be distinguished, in addition to the two spatial regions.Thus, evolution is modeled as a two-time process, where the slow dynamics captured by the eigenfunctions i = 0 and i = 1 are evolving independently from the polymorphic region, while the fast dynamics in the polymorphic region are in quasi-equilibrium with the slow dynamics at the boundaries.Generally, we are thus looking at a system of differential equations that for the slowly evolving part of the system is: Initially at t = 0, the boundary values are b 0 (0) = The solution over time is τ 1 (t) = (b 1 (t) − α)e −θt , such that the boundary values will slowly, at a rate of θ, approach the equilibrium values: If f (x) does not integrate to one, i.e., b 0 (t) + b 1 (t) = 1, modifications are trivial.Note that b 0 (t = 0) and b 1 (t = 0) correspond to the probability mass currently at the boundaries plus the part of the probability mass within the polymorphic region that is expected to be fixed by drift (i.e., without any further mutations) at the respective boundaries.They would only be identical to the probability mass currently at the boundaries if there were no probability mass in the polymorphic region.

Mutation and Drift: Quickly Evolving Dynamics
The slowly evolving part of the system is given in (39).For the quickly evolving part, note that, from Equation ( 13), mutation moves probability mass from the boundary at zero x = 0 to x = 1/N and from x = 1 to x = (N − 1)/N, respectively.We can model this with a Dirac delta function at x = 1/N and x = (N − 1)/N: with b 0 (t) and b 1 (t) as above.Combined with the pure drift diffusion Equation ( 11), we thus obtain the following diffusion equation within the interval between 1/N and (N − 1)/N: Equation ( 41) is an extension of Equation ( 21) to mutations from the boundaries.

Mutation and Drift: Slowly and Quickly Evolving Dynamics
Theorem 1. Starting from a generalized probability measure f (x) within the unit interval (Equation ( 11)), with the boundary Conditions ( 12) and ( 13), and letting N → ∞, the following function provides the general solution for all times of the Kolmogorov forward equation of boundary-mutation drift diffusion: with the previously-defined eigenfunctions (Equations ( 37) and ( 32)); the τ i (t) are given by a system of linear inhomogenous first order differential equations: The starting values, τ i (t = 0) for i ≥ 1, are given by the initial probability masses at the boundaries and by the expansion of the initial density f (x) in the interior into the eigenvectors.
Proof.The slowly evolving part of the system is given in (39).The coefficients for expanding the delta function in (40) are (compare Equation ( 20)) for the boundary at zero: (44) and similarly for the boundary at one.Substituting the Gegenbauer expansion into Equation ( 41), we obtain: For each i, we have: which can be rearranged to: Observing Equation ( 16), we obtain eigenvalue equations corresponding to those in (24): Compared to the case without mutation, the spatial part is unchanged, while the temporal part becomes a system of linear inhomogenous first order differential equations: All other considerations correspond to the case without mutation.
Note that, substituting b 0 (t) and b 1 (t), Equation (49) can also be written as: with constants: Furthermore, note that we assumed that the probability measure f (x) integrates to one over the closed interval between zero and one, i.e., 1 0 f (x) dx = 1.If this is not the case, the constants A i and B i must be multiplied by 1 0 f (x) dx.

Boundary-Mutation-Drift Equilibrium Distribution
In earlier work [4], we show that the equilibrium solution of the boundary-mutation model is the measure: where the interior region is bounded by 1/N and (N − 1)/N and BME stands for boundary-mutation equilibrium.BME(x | α, θ, N) integrates to one over the unit interval, irrespective of N.However, note that for large N, it integrates to more than one inside the interval [1/N, (N − 1)/N], while assuming negative values at the boundaries.In this limit, it therefore must be considered an "improper distribution" [4,34].In Appendix A.2, we show that, with time, Solution (43) converges to the BME (Equation ( 52)).

Prior Distribution
With the BME as prior and a binomial likelihood p(y | x, M 0 ) with 0 ≤ y ≤ M 0 , the coefficients of the joint distribution p(x, y where the limit notation indicates that the integration includes only the polymorphic region.Note that already Ewens used the same limit for inference [18,25].For polymorphic data, i.e., 1 ≤ y ≤ (M 0 − 1), this function is a polynomial and, thus, can be represented accurately as a series of Gegenbauer polynomials as long as n > M 0 .The boundary terms can also be derived easily because the probability of drifting to boundary one corresponds to the current proportion x (and to (1 − x) to the boundary zero), such that: where the limit notation is not used for brevity.For monomorphic y, i.e., y = 0 or y = M 0 , the c i for the probability mass in the interior are also given by Equation (53) with i ≤ n.The corresponding boundary terms are: and analogously for y = 0.
2.3.6.Algorithm 2: Allelic Proportions x with Boundary-Mutations and Drift for All Times t, Conditional on Initial Values

•
The interior of a joint distribution p(x, y | M 0 , α, θ) is represented as a Gegenbauer series (53).

•
The slowly evolving part of the system consists of the dynamics at the boundaries.Set the boundary terms at t = 0 to b 0 (t = 0) and b 1 (t = 0) as in Equations ( 54) and ( 55).With time, the boundary terms b 0 (t) and b 1 (t) then change slowly at the rate of θ according to the exponential function in Equation (39).
The solution of Equation ( 41) for all t conditional on f (x) can be represented by a series expansion up to n: with: and constants analogous to (51): Equation ( 57) is a solution to (50), as can be shown by substitution.

Applications
In this section, we illustrate the calculation of the marginal likelihoods of a mock dataset and an empirical fruit fly dataset using the expansion of Gegenbauer polynomials up to degree n = 52.

A Joint Site Frequency Spectrum under Pure Drift
With the pure drift model, the time between two time points t 0 = 0 and t 1 > 0 is assumed to be so small that newly arising mutations can be neglected.Moreover, sites where the samples from both time points are monomorphic for the same allele are usually ignored with such data analysis.For simplicity, assume that the sample size of the initial sample at time t 0 is M 0 = 3 and that of time t 1 also M 1 = 3.Four different cases need to be considered: (i) a site is polymorphic in both samples; (ii) a site is polymorphic in the first sample and monomorphic in the second; (iii) a site is monomorphic in the first sample and polymorphic in the second; and (iv) a site is monomorphic in the first sample for one allelic type and polymorphic in the second sample for the other allelic type.For Cases (i) and (ii), assume, e.g., a sample with two alleles of a certain type (zero or one), i.e., y 0 = 2. Thus, the joint density of the sample y 0 and the allelic proportions x become: This is represented by a sum of the modified Gegenbauer polynomials of degree up to three with c 3 = − 3 4 and c 2 = − 3 2 .At time t 1 , before considering the second sample, the probability mass of the joint interior distribution has diminished: while it has grown at the boundaries: and: For Case (i), the likelihood of a second sample of size M 1 = 3 with y 1 = 1 alleles of the first type is binomial: 3x(1 − x) 2 .The joint distribution consists only of an interior part, from which x can be integrated out to obtain the marginal likelihood: For Case (ii), the likelihood of a second sample of, e.g., size M 1 = 3 with y 1 = 3 alleles of the first type, is binomial: x 3 .The joint distribution consists of an interior part, from which x can be integrated out: Summing the interior and the boundary part, the marginal likelihood of the two samples is obtained: For Cases (i) and (ii), thus, a finite expansion of the Gegenbauer polynomials was sufficient.
For Cases (iii) and (iv), the product of the likelihood and prior at time t = 0 results in an infinite series of Gegenbauer polynomials.Note that the monomorphic term at t = 0 does not need to be included since, without new mutation, a polymorphism or a monomorphic alternative allele at t = t 1 implies that the population allelic proportion x must have been in the polymorphic region already at t = 0. Take, e.g., M 0 = 3 and y 0 = 3, which results in: While Equation ( 20) can be used in this case, it results in a rather "wiggly" function of x (Figure 1).With the expansion (1 x i a much smoother polynomial of degree n will be obtained, which can be expressed without loss as a sum of Gegenbauer polynomials up to that degree.Even with only moderate n, the two expansions will produce nearly indistinguishable likelihoods.In either case, the algebra cannot easily be reproduced here, but likelihoods corresponding to transitions between all possible allelic states with arbitrary (but moderate) M 0 , M 1 and t 1 can be easily calculated using computers.We implemented such an algorithm and tested inference using simulated datasets of L = 10 5 and M 0 = M 1 = 3, while varying the parameter t 1 .As expected, the modes of the likelihood curves closely coincide with the true values of t 1 (Figure 2).

Application to Drosophila Population Data
The joint site frequency spectrum of putatively neutral short intronic sites (positions 8-30 of introns less than 66 bp in length [11]) was used for inference (Table 1).The interior part of short introns is unlikely to contain selectively-constrained sequences.Short introns also show the highest intra-and inter-species diversity of any sequence class within the Drosophila genome [11].Furthermore, short introns are the most abundant intron type within the Drosophila genome.It is therefore assumed that mutation-drift dynamics shape the nucleotide composition of short intronic sites, and since polymorphism within a single intron is rare and linkage disequilibria decrease quickly, free recombination among sites may be assumed.Sites were classified as binary by lumping A and T nucleotides together as Allele 0 and C and G nucleotides as Allele 1.The reference sequence from D. sechellia (Release 1.0; [15]) was taken as ancestral, i.e., the initial sample of size M 0 = 1 at time t 0 = 0.While the states of closely related species are routinely taken as ancestral (e.g., [12,13,[22][23][24]), this practice violates the model assumptions that data are from a single populations and two time points.A D. melanogaster Malawian population sample [14] of size M 1 = 6 was considered as a sample from a later time point t 1 .The sequences were annotated by aligning the D. sechellia reference and the D. melanogaster population sample to the D. melanogaster reference sequence (Release 5.9; [15]).Interest is centered on inferring the time point t 1 , i.e., the time in N generations since the split of the two Drosophila species and the scaled mutation rate θ 1 , corresponding to the current mutation rate of the D. melanogaster population sample.Firstly, a prior distribution of allelic counts needed to be determined by setting initial parameters, α 0 and θ 0 .The ancestral mutation bias α 0 was inferred from the D. sechellia data to be α0 = 0.35 and is assumed to not change, i.e., α 1 = α 0 .We estimate the ancestral scaled mutation rate to be about θ0 = 0.079 [4] from D. simulans data, as this closely related species most likely reflects the ancestral state of both D. sechellia and D. melanogaster species due to its relatively constant (over the evolutionary times considered) and large effective population size [35].
We implemented a direct grid search algorithm, with the likelihood calculated as in Subsection 2.3.6, to obtain maximum likelihood estimates of parameters t 1 and θ 1 (Figure 3).The maximum likelihood estimates t1 = 4.5 and θ1 = 0.03 correspond closely to previously published estimates [36,37].Likelihood surface with respect to parameters θ 1 and t 1 estimated from the joint site frequency spectrum in Table 1.The point on the likelihood surface corresponds to ML estimates: θ1 = 0.03 and t1 = 4.5.

Discussion
Most population genetic models, e.g., the Wright-Fisher and Moran models and the corresponding (forward) Wright-Fisher or Moran diffusion models, do not restrict the number of mutations segregating in a population at a given site and time.By contrast, most population genetic methods that allow for analytical maximum likelihood estimators assume that variation at a specific site originates from only a single mutation, such that no more than two alleles can be present at any given site and time.This assumption is made for the Ewens-Watterson estimators of the scaled mutation rate [18,19] and the Poisson random field (PRF) models [26][27][28][29].These approaches thus implicitly assume low scaled mutation rates.Usually, it is also assumed that the ancestral state of a site is known.Both of these assumptions are made explicit with the infinite sites model.When introducing the infinite sites model, Kimura [20,21] assumed selection against the mutant allelic variant, such that only the favored ancestral allele would exist in the monomorphic state, while a mutant allele is eventually lost from the population by the joint action of adverse selection and drift.Kimura based the mathematics on a model of irreversible mutation [2].
In later developments of the infinite sites model [18,19], the assumption of selection was dropped.Without selection, sites may become monomorphic for all possible allelic states (usually two states are assumed, such that the model is bi-allelic).In practical applications, the ancestral state has to be inferred via "outgroup" information.For this inference, which is also called "polarization", data from an extant closely related population or species, i.e., an "outgroup", are used (e.g., [12,13,[22][23][24]). Polarization can only be successful if the outgroup is related closely enough to the focal population, such that double mutations are improbable, but distantly enough, such that allelic polymorphism is not shared with the focal population.These biological assumptions are rather restrictive.
With unrestricted mutations, i.e., with the assumptions of the Wright-Fisher diffusion or the Moran models, allelic proportions in a population converge to a beta equilibrium distribution.For a sample of moderate size, a beta-binomial equilibrium distribution is obtained.It seems that RoyChoudhury and Wakeley [17] were the first to expand the beta-binomial equilibrium distribution into a Taylor series up to first order in the scaled mutation rate θ to derive the sample distribution of a bi-allelic locus.With this approach, polarization is not necessary.Rather, DNA sequence data can be made binary by grouping together sites with the bases adenine (A) and thymine (T) to A/T and cytosine (C) and guanine (G) to C/G.In spite of this difference of the original infinite sites model, the sample distribution of polymorphic sites is a variant of the infinite sites model [18], and the maximum likelihood estimator of the scaled mutation rate derived from this distribution is a variant of the Ewens-Watterson estimator.Based on the Moran model, Vogl and Clemente [10] arrived at the same distribution of polymorphic sites if mutations only occur at the monomorphic boundaries.A Moran model with mutations only at the boundaries approximates a Moran model with mutations at any allelic state sufficiently well if the scaled mutation rate θ is below 0.1 [10].Obviously, the critical assumption that allows for analytical derivation of the maximum likelihood estimator is not that the ancestral state is known, but rather that only a single mutation segregates.Indeed, without selection equivalence of the estimators derived from the Taylor series expansion in θ of the general mutation model, estimators assuming the boundary mutation model can be shown [4].In the latter case, the ML estimators can be considered exact.Note that mutation bias has generally been ignored when analyzing DNA sequence data.In contrast to the usual infinite sites model, which assumes that ancestral and derived states can be inferred, a mutation bias α creating an imbalance between A/T and C/G sites can be modeled naturally with our approach.As deviations in the A/T:C/G ratio from 1:1 are generally observed and as inference of ancestral states is difficult, such theory is practically useful.Vogl [3] derived a maximum likelihood estimator for not only the scaled mutation rate θ, but also the mutation bias α.
Computation-intensive, probabilistic methods for estimating population genetic parameters, such as the ones implemented in the LAMARC software package [38], are suitable for the analysis of populations governed by non-equilibrium dynamics.Our method also considers non-equilibrium population dynamics, such as changes in the effective size of the population, the mutation rate and mutation bias between different time points, but poses less computational burden.This is achieved by extending the boundary mutation model [4] to joint site frequency spectra and using modified Gegenbauer polynomials to solve the transition density φ(x, t) of the allelic proportion x at any time t.If these model assumptions hold, the method also provides maximum likelihood estimates.
Gutenkunst et al. [39] estimate migration rates, selection coefficients and split times from joint site frequency spectra in their program ∂a∂i by approximating the diffusion process using a numerical grid of population proportions x.Influx of mutations is modeled by "injecting φ density at low frequency in each population (at a rate proportional to the total mutation flux θ)".This presumably corresponds to the boundary mutation model, but seems to assume influx in equal proportions from the boundaries.Furthermore, this method is directed towards evaluating population sizes, growth rates and migration rates from joint site frequency spectra, rather than scaled mutation rates.The difference in mutation rates of the different allelic classes, i.e., mutation bias, is not taken into account.
With mutations arising only at the boundaries, evolution of the allelic proportion x separates into a slowly evolving part, where the proportions of alleles at the boundaries change at a rate of θ, while the interior dynamics adjust relatively quickly to the slowly evolving boundary proportions.This process leads to a system of inhomogeneous linear differential equations.With this theory, changes in the parameters, i.e., the scaled mutation rate θ or the mutation bias α, do not necessitate a recalculation of the eigensystem, unlike the approach described in Song and Steinrücken [7].Thus, our approach speeds up computation, such that more complicated population genetic scenarios may be modeled, e.g., growing or shrinking population sizes that are commonly observed in nature.Since the equilibrium is reached at the rate of the scaled mutation rate θ, natural populations are rarely in equilibrium, and non-equilibrium dynamics need to be considered when inferring population genetic parameters.
We note that Evans et al. [40] also arrived at a system of inhomogeneous linear differential equations assuming the infinite sites model as defined above.Furthermore, they assume directional selection.Their analysis is based on iteration of moments, rather than on orthogonal polynomials, which leads to a recursive inhomogeneous system of differential equations.Nevertheless, the similarities between their and our approaches are readily apparent.In a follow-up study, Zivkovic et al. [41] apply their algorithm to human and fruit fly data.
So far, we have only discussed approaches based on the Kolmogorov (or diffusion) forward or backward equations.Another approach successfully employed for analyzing joint site frequency spectra is based on Kingman's coalescence [42,43].With the coalescence, the starting point is the sample.Inference proceeds by summing and integrating over the sample's genealogical history.It is also possible to derive the beta-binomial equilibrium distribution with the coalescent.In more complicated cases, one or the other of the two approaches may be more suited.Generally, the coalescence seems preferable if the sample distribution can be derived fairly easily compared to the population distribution.This is the case for the infinite sites model without recombination, where only the coalescence approach has been used to derive joint site frequency spectra, as far as we are aware.Recently, Chen [44,45] and Kamm et al. [46] improved the computational efficiency of the coalescence approach to joint site frequency spectra.We note that changes in the mutation bias, as incorporated into our approach, have not yet been incorporated into the coalescence approach.
Our algorithm allows for inference of the mutation bias α, the scaled mutation rate θ and the time separating the samples of joint site frequency spectra.A fruit fly dataset consisting of a joint site frequency spectrum of two Drosophila species is analyzed to illustrate the method.

Conclusions
In this article, we present a method for inferring population genetic parameters from joint site frequency spectra, i.e., from allelic frequencies at two (or more) time points.The parameters inferred are the time between the samples t and the mutation rate θ, scaled by the (effective) population size N.In contrast to earlier approaches [7,47,48], we assume a small mutation rate θ, such that only a mutation of single origin may segregate per site at any given time point in a sample.This assumption simplifies the mathematical treatment because, unlike with earlier approaches, changes in the parameters do not lead to a change in the eigensystem.Rather, the spatial expansion in orthogonal polynomials, specifically in Gegenbauer polynomials, remains unaffected.Compared to the case without mutations, i.e., to the pure drift model, the temporal part changes from a system of homogeneous to one of inhomogenous linear differential equations.In effect, the system separates into the slowly evolving boundaries, which change at a rate of the scaled mutation rate θ, and a fast evolving interior polymorphic region, which changes at the rate of drift, conditional on the dynamics at the boundaries.We show that the eigenvectors can be derived from the general Jacobi polynomials by a Taylor series expansion.Furthermore, the equilibrium distribution corresponds to the Taylor series expansion of the equilibrium distribution for general θ.Due to the underlying boundary mutation model, parameter estimation is computationally efficient, and the method can be expanded to accommodate analysis of more complex population genetic scenarios.

Figure 1 . 1 Figure 2 .
Figure 1.Approximate densities using the Gegenbauer polynomial expansion with terms up to n = 52.(A) Approximation to point masses at both boundaries, but without mass in the interior region; (B) approximation to the equilibrium improper density overlying the function x −1 (1 − x) −1 ; (C) approximation to the joint posterior distribution for a sample with y = 1, M = 1 overlying the joint distribution 2 x 1−1 (1 − x) 1−1 ; (D) approximation to the joint posterior distribution for a sample with y = 3, M = 6 overlying the joint distribution ( 6 3 ) x 3−1 (1 − x) 3−1 .

Figure 3 .
Figure3.Likelihood surface with respect to parameters θ 1 and t 1 estimated from the joint site frequency spectrum in Table1.The point on the likelihood surface corresponds to ML estimates: θ1 = 0.03 and t1 = 4.5.

Table 1 .
A joint site frequency spectrum of Drosophila short intronic sites with M 0 = 1 and M 1 = 6.The left-most column and the upper row of the table represent the possible allelic states of sites for the sample M 0 and M 1 , respectively.The interior entries of the table are the counts of sites with a specific allelic state with respect to Allele 1.