A General Framework for Neutrality Tests Based on the Site Frequency Spectrum

One of the main necessities for population geneticists is the availability of sensitive statistical tools that enable to accept or reject the standard Wright–Fisher model of neutral evolution. A number of statistical tests have been developed to detect specific deviations from the null frequency spectrum in different directions (e.g., Tajima’s D, Fu and Li’s F and D tests, Fay and Wu’s H). A general framework exists to generate all neutrality tests that are linear functions of the frequency spectrum. In this framework, it is possible to develop a family of optimal tests with almost maximum power against a specific alternative evolutionary scenario. In this paper we provide a thorough discussion of the structure and properties of linear and nonlinear neutrality tests. First, we present the general framework for linear tests and emphasise the importance of the property of scalability with the sample size (that is, the interpretation of the tests should not depend on the sample size), which, if missing, can lead to errors in interpreting the data. After summarising the motivation and structure of linear optimal tests, we present a more general framework for the optimisation of linear tests, leading to a new family of tunable neutrality tests. In a further generalisation, we extend the framework to nonlinear neutrality tests and we derive nonlinear optimal tests for polynomials of any degree in the frequency spectrum.


Introduction
Since the development of molecular genetics techniques allowed to obtain nucleotide sequences for the study of populations genetics [1], a number of neutrality tests have been developed with the objective to facilitate the interpretation of an increasing volume of molecular data.Statistical tests for neutrality have been generated exploiting the different properties of the stationary neutral model.Examples of tests are the HKA [2], which takes advantage of the polymorphism/divergence relationship across independent loci in a multilocus framework, and the Lewontin-Krakauer test [3], which looks for an unexpected level of population differentiation in a locus in relation to other loci.Additionally, another family of popular tests are the ones related to linkage disequilibrium, such as the one developed by [4], which detects long haplotypes at unusual elevated frequencies in candidate regions.
An important family of these tests, often used as summary statistics, is built on the frequency spectrum of biallelic polymorphisms.This family includes the well known tests by Tajima [5], Fu and Li [6], and Fay and Wu [7].If an outgroup is available, these tests are based on the unfolded spectrum ξ i , that is, the number of segregating sites with a derived allele frequency of i in a sample of (haploid) size n.Without an outgroup, it is not possible to distinguish derived and ancestral alleles and the only available data correspond to the folded spectrum η i , that is, the number of segregating sites with a minor allele frequency of i.The quantities ξ i and η i are all positive and the range of allele frequencies is 1 ≤ i ≤ n − 1 for the unfolded spectrum, 1 ≤ i ≤ n/2 for the folded spectrum.The average spectra for the standard Wright-Fisher neutral model are given by: where L is the length of the sequence and θ = 2pµN e , where µ is the mutation rate per base, p is the ploidy and N e is the effective population size.Note that we define θ as the rescaled mutation rate per base and not per sequence.Apart from this, we follow the notation of [8,9].Note that the spectra are proportional to θ.
A general framework for linear tests was presented in [8].The general tests proposed there were of the form: that are centred (i.e., they have a null expectation value) if the weights Ω i , Ω * i satisfy the conditions ∑ n−1 i=1 Ω i = 0 and ∑ n/2 i=1 Ω * i = 0.This framework allows the construction of many new neutrality tests and has been used to obtain optimal tests for specific alternative scenarios [10].However, the original framework does not take into account the dependence of the tests on the sample size, as emphasised in [10].It is important to choose this dependence in order to obtain results that are as independent as possible on sample size.Moreover, the framework presented in [8] covers just a large subfamily of neutrality tests based on the frequency spectrum, that is, the class of tests that are linear functions of the spectrum.This subfamily contains almost all the tests that can be found in the literature with the exception of the G ξ , G η tests of Fu [11], which are second order polynomials in the spectrum whose form is related with Hotelling statistics.Since these G ξ , G η tests were shown to be quite effective in some scenarios, it would be interesting to build a general framework for these quadratic (and more generally nonlinear) tests.New optimal tests based on extensions of the site frequency spectrum [12,13] were recently applied successfully to the detection of selection pressure on human chromosomal inversions [14].
In this work we provide a detailed study of the properties of the whole family of tests based on the site frequency spectrum, i.e., focus on the structure and the properties of neutrality tests that consider the frequency of the variants per position.Technical details and proofs can be found in Appendices A and B.
We begin with the discussion of the most interesting case, i.e., linear tests.Achaz [8] developed a general framework for constructing linear tests comparing two different estimators of variability, which are based on linear combinations of the frequency, each one containing different weights.We summarise his approach in the Section 2.1.
In Section 2.2, we discuss the importance of scaling the weights with the aim to avoid a dependence with the sample size.This allows us to interpret the information of the test in the same way for any sample size analysed.We provide a thorough analysis of a simple proposal for the scaling of the tests with the sample size.Different scaling strategies (including alternatives to scaling) are analysed and evaluated, demonstrating the importance but also the suitability of different weighting methods depending on the nature of the statistic.
In Section 2.3, we present and expand the theory of optimal tests, i.e., tests that have maximum power to detect an alternative scenario versus a null scenario, introduced in [10].We show that generic linear tests cannot detect arbitrary deviations from the neutral spectrum, and why tests must be optimised with respect to a specific alternative scenario.A geometrical interpretation of neutrality test is developed, showing how these tests depend on the differences between the null and the alternative scenario, clarifying the theoretical basis for these tests.
In Section 2.4, we extend this approach to nonlinear tests that consider different moments of the frequency spectrum combined in generic polynomials or in power series.In contrast to linear tests, nonlinear tests are dependent on the level of variability θ and thus need an accurate estimation of θ to have good statistical properties (e.g., unbiased, high statistical power).These tests are classified in strongly and weakly centred (that is, having null expectation value).Strongly centred tests are those tests that are centred for any value for any estimate of θ, even if this estimate is far from the actual value, and thus are more robust to deviations of θ estimates than weakly centred tests (simulations shown already in the Results section (Section 3)).Instead, weakly centred tests are only centred if the θ estimate is equal to the actual value, but this feature also makes the tests more powerful if the inference of θ is reliable.
In Section 3.1 in the Results, we see some consequences of the framework outlined in Methods; it is possible to optimise neutrality tests following a general maximisation approach that depends on the proxy used for the power to reject the null model in the alternative scenario, which depends on the mean spectra, the covariance matrices, and the critical p-value used.Moreover, we show that under some conditions, the power maximisation can be always achieved by tuning a parameter in a new family of linear "tunable" optimal tests developed here, which depend only on the mean spectra of the alternative model.
In Section 3.2, we present the results for the power to detect deviations from the neutral spectrum in coalescent simulations.We show how linear optimal tests have more power than usual neutrality tests such as Tajima's D, nonlinear optimal tests are more powerful than linear ones, and weakly centred tests are more powerful than strongly centred ones if θ is known.
In summary, our research augments the understanding of neutrality tests, encompassing weight normalisation, optimal test formulation, and both linear and nonlinear paradigms.The outcome not only enriches theoretical foundations but also provides novel methodologies for increasing power and effectiveness of neutrality tests across diverse scenarios.

Linear Neutrality Tests General Framework
As discussed by Achaz [8], the general form for linear tests based on the unfolded spectrum can be written as: where Ω i is a set of weights satisfying the condition: This is the most general form if we require that the test is centred and with variance 1, that is, E(T Ω ) = 0 and Var(T Ω ) = 1.The condition of intendedness can be obtained substituting the spectrum with its average in the standard neutral model, given by the Equation (1).
Alternatively, it is sufficient to choose any pair of unbiased estimators of θ based on the unfolded spectrum: with weights ω i , ω i that obey the conditions: to obtain a new test for neutrality: that is equivalent to the definition (3) with Ω i = ω i − ω i .Therefore a test T Ω is defined by real vectors Ω or ω, ω satisfying the above normalisation conditions.

Scaling of Weights with Sample Size
In this section we would like to remark that there are conditions that have to be imposed on the weights Ω i or ω i , ω i to ensure that these tests are consistent and meaningful in their interpretation.In fact, the values (and even the number!) of these weights depend explicitly on sample size n.Since every conceivable test should be applied to samples of different sizes, then its definition involves a whole family of weights with n = 2, 3 . . .∞ and to define a test it is necessary to specify how these weights scale with n.
As an example of the weird effects of some choices of scaling, we consider the test for admixture of [8].The weights of this test are and ω i = 1/(n − 1).Suppose that the population under study shows an excess of alleles of frequency f between 0.3 and 0.4.The average weight of these frequencies, rescaled by the sample size, is 0.5 for n = 10, but it reduces to −0.75 for n = 100 and to −1.0 for n = 1000.These weights are largely different, even in sign, therefore a strong excess of alleles in this range of frequency would show itself as either a positive or a negative value for this test, depending on the sample size!The reason can be understood by noticing that for n large, the binomial can be approximated by a Gaussian function of the allele frequencies f = i/n centred in f = 1/2 and with variance 1/4n.Therefore this weight function has a strong dependence on n when considered as a function of f and n.The changes of this weight function with sample size are apparent in the plot of Figure 1, which shows the actual function (rescaled by sample size) for n = 101,001,000.
In this example it is apparent that the interpretation of the results of this test depends on n.This means that the calibration of the test should be different for each possible sample size.
The weight-consistency requirement that we propose is that the result of the test should be almost independent of sample size.This requirement is equivalent to a condition on the scaling of the weights Ω (n) i with n.Our proposal for a reasonable requirement on this scaling is the following: the relative weight of different frequencies in the population should remain approximately constant while varying the size of the sample.This condition ensures that at least for sufficiently large n, the average values of the test on samples of different size from the same population should be approximately independent on sample size, i.e., that the test should be weight-consistent.Note that this requirement has to do with the interpretation of the test, rather than with the usual definition of statistical consistency.
Tests that are not weight-consistent could be statistically consistent, but the interpretation of results in the left and right tails would not be assured to be independent of sample size.To determine the scaling, we note that in limit of large n, the frequency spectrum approaches a continuum and we can define the weights as functions Ω( f ) or ω( f ), ω ( f ) with f ∈ (0, 1) and Since the ratio of the derived allele count and the sample size i/n is an unbiased estimator of the frequency f of the allele in the population (because E(i) = n f ), a simple scaling that satisfies the above requirement is: as proposed by some of the authors in [10].
In order to have the above approximate scaling while obeying the condition ∑ n−1 i=1 Ω i = 0, there are two simple weight-consistent forms for the weights: where the last term is a (typically small) correction that enforce centredness of the test, or: where the denominators are normalisation factors.Typically, this second form (10) for the scaling is more consistent in practice and it is implicitly assumed for most of the existing tests.However, the above expressions give similar numerical results for most choices of the functions Ω( f ) = ω( f ) − ω ( f ).In fact, if Ω( f ) is a limited and piecewise-continuous function, the difference between ( 9) and ( 10) is of order O(Ω)/n (since it is a factor coming from the discretisation of the frequencies) and it does not have a relevant impact on the results of the test.Therefore, in these cases the two scaling relations (9) and (10) are practically equivalent.
Note that all the tests involving the Watterson estimator (that corresponds to ω( f ) ∼ 1/ f ) have additional subtleties that are discussed in the next section.
Example: Fay and Wu's H test This test was proposed in [7] to look for an excess of high-frequency derived alleles as a signal of selection.It can be defined by the weight functions ω( f ) = 2(1 − f ) and ω ( f ) = 2 f or alternatively ω( f ) = 1 and ω ( f ) = 2 f .The weights can be found following Equation (10).The resulting test is: The scaling defined in Equation ( 9), with weight function , gives precisely the same result.
Example: F(r, r ) tests of Fu [15] This large class of test is based on the comparison of two estimators with weights: that in the case r, r < 1 correspond precisely to the scaling (10) suggested above, with weight functions This can be easily verified by multiplying both the numerator and the denominator of ω i , ω i by a factor (1 − r)/n −r , (1 − r )/n −r respectively.The test by Fay and Wu corresponds actually to F(0, −1).
The cases with r ≥ 1 or r ≥ 1 involve weight functions with divergent integrals and will be discussed in the next section.
Note that the same weight functions with the scaling (9) would give rise to a slightly different test with weights: that is not weight-consistent for weights of low frequency alleles, i.e., with i/n n 2/ max(r,r ) , and is therefore less interesting.
Example: Test for bottleneck of Achaz [8] This test is another example of a test with an unwanted scaling: The weight function for this test is e −αn f αn/(1 − e −αn ) − 1 that depends strongly on n, therefore this test is not weight-consistent in the above sense.
It is easy to build an equivalent test with the correct scaling by choosing the functions The resulting weights with the scaling (10) are: as discussed before.The optimal value reported in [8] is α 0.9 for n = 30.This value corresponds to β 27.
The test can also be implemented by choosing the scaling (9) and the weight function The resulting weights are: that are equivalent to the weights (15) up to an irrelevant multiplicative factor (see Theorem A1).Therefore, in this case the two choices of scaling give precisely the same result.

Divergent Weights
As discussed above, the two choices of scaling in Equations ( 9) and ( 10) do not usually obtain sensibly different numerical results.However, there are important choices of Ω( f ) for which this approximate equivalence between ( 9) and ( 10) does not hold.These critical cases correspond to functions that diverge as 1/ f or faster near f = 0 (or f = 1).This divergence is not a real feature of the distribution, because the integral has a natural cutoff at the scale of the inverse population size f min = 1/N (more precisely, the effective population size 1/N e , but this does not affect the discussion).However, in this case the integral 1 1/N d f Ω( f ) has a strong dependence on the cutoff 1/N and therefore the function Ω( f ) itself should depend strongly on N to ensure proper normalisation.
If this dependence is contained in a multiplicative term in front of ω( f ) or ω ( f ) or both, then the second term in Equation ( 9) is not a small correction of order 1/n as it happens with simple functions Ω( f ), but rather it represents a relevant correction with a strong dependence on sample size n and population size N.The denominators in Equation ( 10) also show a strong dependence on n (that could not be avoided anyway) but not on N, and therefore this second scaling form should be used.The dependence on sample size is as strong as the dependence of the divergent integral from the cutoff.This can be easily understood by noticing that the sample size n plays the role of the cutoff in the sum over the frequencies that are present in the sample, which is the same role played by the population size N for the whole population; more formally, the denominator in Equation ( 10) can be bounded from above and from below by the divergent integral, and therefore the divergence of the denominator as n → ∞ will be the same as the divergence of the integral as its inverse cutoff (that is, N) goes to infinity).For functions diverging as f −k with k ≥ 1, the dependence on n goes as n 1−k if k > 1 or log(n) for k = 1.This case always occurs when the test is built by comparing an estimator of θ with the Watterson estimator, which corresponds to ω( f ) ∼ 1/ f and therefore has a logarithmic dependence on n given by the usual harmonic factor If the dependence of Ω( f ) on N is contained in an additive term that does not depend on f , it is the correction in (9) that does not depend on N and therefore the first scaling form is more appropriate.We do not know examples of tests of this kind in the literature, even if the test by Zeng et al. [16] can be interpreted also in this way.
Example: Tajima's D test This is the most known test for neutrality based on the frequency spectrum.It is given by the difference between the Tajima estimator Π [17] based on the nucleotide pairwise diversity Π and the Watterson estimator θ W [18] based on the number S of segregating sites, therefore it can be defined by the weight functions ω( f ) = 2(1 − f ) for Π and ω ( f ) = 1/ f log(N) for the Watterson estimator.The latter function has an integral that diverges logarithmically near f = 0, and the corresponding dependence on N is contained in the factor 1/ log(N) that multiplies ω ( f ), therefore the scaling (10) should be used.The result is the usual test: Example: Test of Zeng et al. [16] This test was proposed to look for an excess of high-frequency derived alleles compared to low-frequency alleles.It is defined by the weight functions ω( f ) = 1/ f log(N) and ω ( f ) = 1, the former corresponding to the Watterson estimator.Proceeding as in the above example, the result is: Note that, exceptionally, the scaling of this test can also be defined by ( 9), without modifying the result.This is a consequence of the two equivalent forms for the weight function,

Weights of Singletons
The above scaling ( 8) is valid in principle for all weights.However, in practice there is an important exception, that is, the weight Ω 1 of singletons.This is due to the fact that for n N, the number of derived singletons ξ 1 is the only estimator that is affected by very rare derived alleles (and often by sequencing errors, see [19]).More precisely, ξ 1 is actually the only estimator sensitive to the deviations from neutrality in alleles of frequency 1/N < f < 1/n, which represent a vast majority of the SNPs in the population and can contain interesting biological information.Therefore, if the contribution of these alleles is relevant for the test, we can enhance (or reduce) the weight Ω 1 by adding a factor Ω ds .
In the approach detailed in the previous sections, this additional contribution to Ω 1 is needed to take into account a contribution ∆Ω( f ) to Ω( f ) of the form ∆Ω( f ) = Ω ds I( f < φ)/φ with φ 1.As far as the maximum sample size never exceeds in practice n max 2/φ, this function weights positively only alleles that appear as singletons.
A similar argument applies also to the weights of the number of ancestral singletons, that is, Ω n−1 , ω n−1 , ω n−1 that can be enhanced by factors Ω as , ω as , and ω as respectively.However, this case is more rare, the only interesting example being the tests of Achaz [19] that avoid sequencing errors by neglecting both derived and ancestral singletons.
Summarising the results up to this section, a test T Ω is completely defined by a function Ω( f ) and two parameters Ω ds , Ω as (that could depend on n) satisfying the conditions: and determining the weights through the formula: or by a pair of functions ω( f ), ω ( f ) and parameters ω ds , ω ds , ω as , ω as satisfying: and resulting in this formula for the scaling of the weights: As showed in the examples above and below, most of the tests in the literature have this general scaling, with the only exceptions the ones contained in [8] that are not weightconsistent in the above sense.
Example: Fu and Li's F test This test looks for an excess of very rare derived alleles as a possible signature of negative selection [6].The only nonzero weights are ω ds = 1 and ω ( f ) = 1/ f log(N), while ω( f ) = ω ds = ω as = ω as = 0.The resulting test is: Note that this test has both singleton weights and a divergent weight function.
Example: Error-corrected tests of Achaz [19] This class of tests is an attempt to correct for sequencing errors and biases in the data by removing the alleles where most of the problems manifest themselves, i.e., singletons (both ancestral and derived).With a slight generalisation of the proposal in [19], the weights of the singletons are chosen in such a way to cancel precisely the contributions of the weight functions: or: therefore, the final weights of derived or ancestral singletons are zero.These corrections can be applied in principle to any weight function.

Alternative Choices of Scaling
The choice of scaling discussed in the previous sections represents a quite simple and effective way to fix the dependence on n of a newly devised test.However, other choices are possible whose weights differ from the above ones for small n.The reason is that for n not too large, both the variance of order f (1 − f )/n i(n − i)/n 3 in the estimation of the frequency f = i/n and the related uncertainty about how the frequencies are actually weighted in the test become important.This uncertainty originates from the (binomial) sampling of individuals from the population and there is some degree of arbitrariness in deciding how to account for it.Moreover, tests that take it into account could be inconsistent in the above sense.
A possible choice of scaling that uses the binomial sampling is the following: considering ω( f ), ω ( f ) as frequency distributions, the weights ω i , ω i are assigned from ω( f ), ω ( f ) through the same binomial sampling that is done for allele spectra, that is: ) A simple example of this scaling (but with an highly divergent weight function) is given by the test for admixture [8] discussed previously.Optimal tests also follow this scaling.
Example: Test for admixture of Achaz [8] This test is apparently not consistent and it does not follow the scaling (8).However, it follows another scaling related to the allele sampling.To understand this, consider the weight functions if a is inside the range of integration and 0 otherwise.Technically speaking it is not a mathematical function, but a distribution, i.e., an element of a dual space of regular functions.)If we scale the weights according to (26) and ( 27), that is: (28) then the corresponding test is precisely the one proposed by Achaz.Note that the strong dependence of the test from sample size does not come only from the choice of scaling, but also from the weight function chosen, that is highly divergent.

Optimal Tests 2.3.1. On the Existence of Generic Tests
An interesting question on the way to build good linear tests is the following: do there exist generic tests?A completely generic test for neutrality should be able to detect any deviation from the spectrum of the null model that is sufficiently large.Unfortunately, these tests do not exist.In fact, for every test defined by a set of weights Ω i it is possible to find a spectrum ξ i = α/ia n + (1 − α)∆ i that is maximally different from the standard spectrum at least in a range of frequencies and is nevertheless undetectable by the test because its average value on this spectrum is zero.This is expressed in a more formal way in the following theorem, which shows that even the complete lack of alleles in some range of frequencies could not be always detected.
Theorem 1.For every set of n real weights Ω i with ∑ i Ω i = 0, there is a set of n real numbers ∆ i = const/i and a parameter α ∈ [0, 1] that satisfy the conditions: The above limitation is not a consequence of the small sample size.This can be seen for example in the framework of the scaling theory discussed in this paper.In fact, for large sample sizes, the weights can be approximated by a weight function Ω( f ).In this context it is possible to prove the next theorem, that is a continuous equivalent of the previous one.
Theorem 2. For every piece-wise continuous weight function there is a smooth function ∆( f ) = const/ f and a parameter α ∈ [0, 1] that satisfy the conditions: Note that in principle this problem can be solved using multiple tests.In fact, multiple tests should be able to detect any strong deviation from the null spectrum, provided that the number of these tests is large enough, as can be seen from the following theorem.Theorem 3. Given at least n − 2 linearly independent sets of n − 1 real weights Ω i with ∑ i Ω i = 0, it is not possible to find a set of real numbers This last theorem is only a formal result and the requirement of n − 2 independent tests is too strong.In practice, a small (but good) set of tests can detect most of the reasonable and interesting deviations for realistic spectra.
The above theorems can be extended to the folded spectrum.In this section and the next ones, we will consider only tests based on the unfolded spectrum.The generalisation of the discussion to the folded spectrum is usually straightforward after substituting ξ i (i = 1 . . .n − 1) with η i (i = 1 . . .n/2 ).

Optimal Tests and Their Geometric Structure
From the theorems of the previous section, it is clear that a single test cannot detect all the possible deviations occurring in complicated evolutionary scenarios.However, it is still possible to optimise neutrality tests for a specific alternative evolutionary scenario.A simple optimality condition has been proposed by some of the authors in [10] in order to maximise the power of the test to detect a fixed alternative scenario.We use a different notation E() and E () for the expected value with respect to the null scenario (neutral model) E() and the alternative scenario E ().If the null spectrum is E(ξ i ) = θLξ 0 i and the expected spectrum of the alternative scenario is E (ξ i ) = θL ξi (note that it may include an average of frequencies in a number of different related scenarios, e.g., averaging over some of the parameters of the scenario), the condition for optimal tests is the maximisation of the average result of the test under the alternative scenario: This condition is based on the observation that the tests have mean zero and variance 1; therefore, if the distributions of the results of the tests are similar, the maximisation of the average value of the test should correspond to the maximisation of the average power of the test.It is also possible to maximise directly the power of the test, taking into account the different distribution of the results under the null and the alternative model; this possibility will be pursued in Section 3.1.
Interestingly, optimal tests show a geometric structure which becomes apparent after defining the scalar product between spectra: where c −1 ij is the inverse of the covariance matrix Cov(ξ i , ξ j ).Since Cov(ξ i , ξ j ) is symmetric and positive, its inverse is also symmetric and positive, i.e., it is a positive bilinear form, therefore, the above expression defines a scalar product.Then, the optimal test for an alternative spectrum ξ can be written in the elegant form: This expression can be easily obtained as a special case of the general formula (50) that we will discuss later in the context of nonlinear tests.A direct proof of this result can be found in [10] after substituting the scalar products with the definition (33).
The numerator of the test is actually the matrix element between ξ and ξ of the linear operator 1 − P ξ 0 , where P ξ 0 is the projection operator along ξ 0 .In other words, it is proportional to the difference between the length of the projection of ξ on ξ and the length of the projection on ξ of the spectrum obtained by the projection of ξ on ξ 0 , as illustrated in Figure 2. From this geometrical interpretation it is clear that if the spectrum ξ corresponds to the null spectrum θLξ 0 , then the two projections are equal and the result of the test is zero.On the other side, if the spectrum is the alternative spectrum θL ξ, then the value of the test is: which is the maximum value over all possible tests in the alternative scenario.The same expression, but with a minus sign, corresponds to the minimum value.The denominator of the test is the square root of the matrix element of the linear operator 1 − P ξ 0 between ξ and itself.Note that both the numerator and the denominator of the test do not change by adding any (possibly negative) multiple of ξ 0 to ξ, because ξ 0 lies in the kernel of 1 − P ξ 0 .This means that optimal tests depend only on the expected deviations from the null spectrum in the alternative scenario.The result of the test is maximum when the deviations of the data from the null spectrum correspond exactly to the expected ones, and it is minimum when they are opposite to the expected ones.

Beyond Linear Neutrality Tests 2.4.1. Quadratic and Nonlinear Tests
Almost all the neutrality tests proposed in the literature are linear in the spectrum ξ i .As far as we know, there is only one exception, namely the G ξ test of Fu [11].This test is a quadratic polynomial reminiscent of Hotelling's t 2 statistics for the different components of the spectra: where c −1 ij is the inverse of the covariance matrix Cov(ξ i , ξ j ).Actually, the test proposed by Fu is an approximation to this test with a different normalisation, namely: In this approximation, the off diagonal terms in the covariance can be neglected [9,11].For large samples, the distribution of the results of the test G tends to a χ 2 distribution with n − 1 degrees of freedom.
Fu's approach cannot be extended to general quadratic or higher order tests, because the distribution of the results of the test would be generally unknown and not positive definite.For this reason we propose to rescale the tests to have zero mean and variance 1.With this normalisation, we expect that the distribution would asymptotically converge to a Gaussian N(0, 1) for all tests.As an example, the (re)normalised version of Fu's test would be: Since the only difference between this test and the original one is the normalisation and a shift by a constant factor n − 1, the power of the test is the same.Now we present a systematic discussion of nonlinear tests that are generic polynomials (or eventually power series) in the spectrum ξ i .All the tests are rescaled to be centred (i.e., to have zero mean) and have variance 1.We denote by µ ijk... the moments of the spectrum under the null model, that is, µ ijk... = E(ξ i ξ j ξ k . ..).With this definition, µ i = θLξ 0 i .Note that all these moments depend on θ.In the approximation of unlinked (independent) sites and small θ, the second moments are equal to µ ij = θLξ 0 i δ ij + θ 2 L 2 ξ 0 i ξ 0 j .The weights of general nonlinear tests can depend explicitly on θ, as seen in the previous example.To compute the values of the tests, the (unknown) parameter θ is substituted with an estimator θ.Unlike the linear case, in this case there are two different classes of tests, related to the dependence on θ of the centredness: strongly centred and weakly centred tests.
Strongly centred tests are tests that are always centred for any value of θ, even if it is different from the actual value of θ.The general form for strongly centred tests is: with the real symmetric weights Ω (n) ijk... satisfying the set of conditions: where we denote by µ The sum can be limited to polynomials of some finite order in ξ i or it can be a (convergent) power series.If we introduce the notation I = ijk . . . to denote a group of n I indices, we can rewrite the test in the simpler form: with the conditions: If we constrain these tests to be first order polynomials in ξ i , we recover the linear case with Ω (1) i = Ω i /ξ 0 i .Note that linear tests are always strongly centred.In fact, in the infinite site model the spectrum is always proportional to θ, which consequently factorises out by linearity and therefore has no effect on the centredness.
Weakly centred tests are tests that are centred but not strongly centred, i.e., they are centred if and only if θ = θ.The general form for weakly centred tests is: with the condition: where the Γ ijk... are real symmetric weights, possibly dependent on θ.We can simplify these expressions using the same notation as above, obtaining the simpler form: with the condition: Additionally, for this class of tests the sum can be limited to polynomials of fixed order or extended to power series.Note that the rescaled version of the G test by Fu presented above belongs to this class.
The important difference between strongly and weakly centred tests is related to the robustness with respect to a biased estimation of θ.Since the class of weakly centred tests is much larger than the class of strongly centred ones, it should be easier to find powerful tests in the former class than in the latter.However, even if weakly centred tests could be more powerful, they would not be centred in scenarios where the value of θ could not be estimated precisely.On the other side, strongly centred tests are robust with respect to a bad estimation of θ and therefore they would be preferable in scenarios where an unbiased estimation of θ is troublesome.
The scaling rule (8) can be generalised to nonlinear tests in terms of functions Ω (n I ) ( f 1 , f 2 . . .f n I ) for strongly centred and Γ (n I ) ( f 1 , f 2 . . .f n I ) for weakly centred tests: Fixing the precise scaling is more ambiguous than in the linear case because there are many different ways to preserve centredness.For this reason, the choice of scaling would be different for strongly and weakly centred tests and will not be discussed here.
All the possible nonlinear neutrality tests based on the frequency spectrum fall into one of the two classes presented in this section and have the form ( 41) and (42), or (45) and ( 46).Since both these classes contain an infinite number of possible choices of weights, the only reasonable criterion to study general nonlinear tests is to select the most powerful or interesting ones.Apart from the Hotelling choice of Fu [11], the most interesting choice is apparently the subclass of nonlinear optimal tests, which will be discussed in the next sections.

Strongly Centred Optimal Tests
As discussed for the linear case, optimal tests depend on the expected alternative scenario.In the nonlinear case, in principle it would be possible to find generic optimal tests, but there is no clear framework to obtain them.For this reason we limit our study to the case of optimal tests for a specific alternative scenario.We denote by μijk... = E (ξ i ξ j ξ k . ..) the moments of the alternative spectrum for this scenario.
Since we use the same normalisation for linear and nonlinear tests, the optimality condition corresponds to the maximisation of the expected value of the test under the alternative scenario: and can be justified as in the linear case.We denote by Ĩ the ordered sequence of the indices contained in I = ijk . . .and by σ(I) the number of distinct permutations of the sequence I, i.e., the total number of permutations divided by the number of permutations that leave I invariant.The main result for the optimal weights is presented in this theorem.Theorem 4. The maxima of E (T Ω ) correspond to the weights: where the matrices C −1 ĨJ and M kl satisfy the identities: Moreover, the variance of the corresponding unnormalised test under the null model is equal to its expected value under the alternative model: Note that in general all the weights of the above optimal solution (50) are nonzero, therefore the maximum average value of the test for optimal tests built on polynomials of degree d increases with the degree d.This suggests that optimal tests of higher degree should be more powerful than linear optimal tests.
We provide explicit formulae for the above weights for the optimal quadratic test in the independent sites approximation.Given E(ξ i ) = µ i and E (ξ i ) = μi , the relevant weights Ω (n I ) I are: where Σ µ = ∑ n−1 i=1 µ i and Σ μ = ∑ n−1 i=1 μi .All these formulae are also valid for the folded spectrum if the appropriate µ i and μi are used.These results are discussed in Appendix A.7.
For optimal tests of higher degree, explicit expressions become cumbersome and the numerical implementation of the test (50) and the matrices (51) and ( 52) is more convenient.

Weakly Centred Optimal Tests
In this case the optimality condition corresponds to the maximisation of the expression: with the same condition: The simplest case corresponds to a first order polynomial: whose maximum corresponds to the optimal weights: where c −1 ij is the inverse matrix of the covariance matrix c ij = µ ij − µ i µ j .Since γ = 0 for this optimal test, the value of this test for the specific scenario for which it is built is larger than the value of the corresponding linear optimal test.In fact the maximum of the test is: that should be compared to the maximum of the optimal test for the linear case, which can be rewritten as: The comparison shows clearly that nonlinear optimal tests are always more powerful than linear optimal tests for the same scenario.
The form of the results for the general case is similar to this simple case.
Theorem 5.The maxima of E (T Γ ) correspond to the weights: where C −1 ĨJ satisfied the identity: Moreover, the variance of the corresponding unnormalised test under the null model is equal to its expected value under the alternative model: Also in this case, the power of optimal tests based on polynomials of higher degree increases with the degree of the polynomial.
It is possible to give explicit expressions of the above matrix and moments for the optimal quadratic test.The formulae for the weights Γ for the unfolded spectrum are: These results are valid in the independent sites approximation.They are also valid for the folded spectrum if the appropriate µ i and μi are used.An expression for the denominator of the test in the independent sites approximation can be found in Appendix A.7.

Empirical Simulations and Code Used for Studying the Statistical Power of Neutrality Tests
We obtained the frequency spectrum of samples for different subdivision and expansion models using one million iterations.Simulation parameters, as well as all the data frequencies from the simulations are included in Zenodo (DOI: 10.5281/zenodo.8279694).Linear and nonlinear tests were calculated using our own code, also available in the same link.Additionally, linear and nonlinear optimal tests are also included in the program mstatspop (https://github.com/cragenomica/mstatspop,accessed on 15 August 2023), to facilitate the calculation of these tests by the users.

General Optimisation of Linear Tests
The condition for optimal tests is the maximisation of E (T Ω ) under the alternative scenario.However, a better approach would by the maximisation of the power of the test to reject the neutral model in the alternative scenario, given a choice of significance level α.This approach requires the knowledge of the form of the probability distributions p(T Ω = t|H 0 ), p(T Ω = t|H 1 ) where H 0 and H 1 are the null and alternative model, or equivalently of all the moments of the spectrum E(ξ i ξ j ξ k . ..) and E (ξ i ξ j ξ k . ..).
Since this information is usually not available in analytic form and hard to obtain computationally, we limit to the case where the distribution can be well approximated by a Gaussian both for the null and for the expected model.Then, the only information needed are the spectra We expect that both in this approximation and in the general case, the tests with maximum power will depend on the significance level chosen, therefore limiting the interest of these test and the possibilities of comparison between results of the test on samples from different experiments.
We call τ = erf −1 (1 − 2α) the z-value corresponding to the critical p-value α.In the Gaussian approximation, the power is given by the following expression: then its maximisation is equivalent to the maximisation of: In the general case, the weights corresponding to the maximum depend explicitly on τ and therefore on α.This dependence is expected but unwanted, since the interpretation of the test depends explicitly on the critical p-value chosen.
There is only one case with weights independent on τ, that is the case of cij (approximately) proportional to c ij .In this case the maximisation of the power of the test is (approximately) equivalent to the maximisation of the average result of the test, which is precisely the condition for optimal tests in the sense of [10].In fact, in this case, optimal tests correspond precisely to an approximation of the likelihood-ratio tests under the assumption of Gaussian likelihood functions, and are therefore approximately the most powerful tests because of the Neyman-Pearson lemma.
As a side note, there is also a regime of values of α such that the weights corresponding to maximum power are independent of α, that is, the regime τ(α) 1.In this case the power is an increasing function of ∑ j,k cjk Ω j Ω k /∑ l,m c lm Ω l Ω m and the weights are simply given by the null eigenvector (or linear combination of null eigenvectors) of the matrix cij − χc ij , where χ is uniquely defined by the requirement that cij − χc ij be a negative semidefinite matrix with at least a null eigenvalue.However, this regime is uninteresting because such small significance levels are practically useless (if τ ∼ 10, the corresponding critical p-value is α ∼ 10 −20 ).However, it could be possible to build interesting tests with higher power by selecting linear combinations of the weights of the two α-independent tests discussed above, that is, optimal tests and tests that maximise the alternative/null variance ratio ∑ j,k cjk Ω j Ω k /∑ l,m c lm Ω l Ω m .

Generalised Optimality Conditions
In a more general linear framework, the (normalised) test has the form T = ∑ n−1 i=1 Ω i ξ i /ξ 0 i .We denote its moments under the standard neutral model (SNM) and the alternative model by: where E(ξ i ) SN M = θLξ 0 i , E (ξ i ) alt = θL ξi , Cov(ξ i , ξ j ) SN M = c ij and Cov(ξ i , ξ j ) alt = cij .For the standard neutral model ξ 0 i = 1/i.The optimality condition can be the maximisation of a general function M(E 0 , V 0 , E , V ) of these quantities with respect to the weights Ω i .Since we are dealing with linear combination of the frequency spectrum with mean E 0 = 0 and variance V 0 = 1, the function effectively depends on E , V only.Maximisation of power (in the Gaussian approximation), given a significance level α, is equivalent to the maximisation of: If there is some knowledge of the variance of the alternative spectrum, or at least of the contribution for unlinked sites, maximisation of power is not the only possible optimisation.In other words, it is possible to optimise with respect to other functions M(E , V ) of the alternative mean E and variance V of the test, in order to obtain, e.g., more robust or conservative tests.
For example: • "Optimal tests": if V is assumed to be completely unknown, the best we can do is minimise the p-value of the expected alternative spectrum, that is, in the Gaussian approximation p = 1 − Φ((E − E 0 )/ √ V 0 ) where Φ(z) is the cumulative distribution function for the standard normal distribution.This is equivalent to the maximisation of: • Most powerful one-tail tests: in the Gaussian approximation, the power of the right tail of test to reject the neutral Wright-Fisher model given a significance level α is Maximising this power is equivalent to the maximisation of: If V ∝ V 0 , this is equivalent to the previous case and we retrieve the optimal tests of [10] that are therefore the most powerful tests in this approximation.
• Most powerful two-tails tests: in the Gaussian approximation, the power of both tails of the test to reject the neutral Wright-Fisher model is: where τ = Φ −1 (1 − α/2) and α is the significance level.If E 0 = 0, this maximisation is equivalent to the maximisation of (E − E 0 − τ √ V 0 )/ √ V similar to the previous case.

•
Penalisations for high or low variance, depending on an extra parameter ν, such as:

. Tunable Optimal Tests
Interestingly, it turns out that the choice of an optimisation criterion-i.e., of a function M(E , V ) to be maximised-can be performed simply by tuning a parameter λ in a simple class of tests that we call "tunable optimal tests".
The family of optimised tests with tunable parameter λ has the simple form: with weights: where a n = ∑ n−1 j=1 1/j.The interest of this family of tests lies in the following property: for every possible choice of the optimisation function M(E , V ), the test corresponding to the maximum value of this function belongs to this family (see Appendix B).
Usual optimal tests [10] are obtained from the maximisation of M(E , V ) = E and correspond to λ = 0 + , that is, to the usual weights a n .More generally, λ can be obtained directly by evaluating E , V with the weights (80) and then looking for the maximisation of M(E (λ), V (λ)).Note that there could be several local maxima.
This class of tests is simple but far more flexible than usual optimal tests.However, note that the exact choice of weights depends explicitly on θ and α.Furthermore, the smoothness of the choice of weights with respect to the evolutionary parameters is not assured.In principle, a slight change of evolutionary scenario could change the optimal weights abruptly.

Simulations of the Power of Optimal Tests
Since a theoretical evaluation of the power of optimal tests of different degree is not possible, we evaluate numerically the power of some of these tests in different scenarios.We consider the best possible case, that is, we assume that the precise value of θ is known.Moreover we assume unlinked sites and θ 1.In this approximation, as shown in Appendix A.7, the moments E(ξ i ξ j ξ k . ..) depend only on the first moments µ i = θLξ 0 i and similarly E (ξ i ξ j ξ k . ..) depend only on μi = θL ξi , therefore optimal tests depend only on the alternative and null average spectra.
Note that for numerical simulations of optimal tests of higher degree, the numerical implementation can be made easier if all the occurrences of inverse covariance matrices C −1 ĨJ in the the above formulae are replaced with the corresponding second moments µ −1 ĨJ , both in the expressions (50), (63), and in the definition (52).The test is the same because of the centredness condition, as it can be verified explicitly.We compare four optimal tests.The first two are the linear and quadratic strongly centred optimal tests, which are denoted by T sc O(1) and T sc O(2) respectively.The third test is the weakly centred optimal test T wc O(1) based on a first order polynomial and presented in (59).The last optimal test T wc O( 2) is also weakly centred and based on on a quadratic polynomial.The explicit formulae for the computation of the weights of T sc O(2) and T wc O(2) were given in Equations ( 54)-( 56) and ( 66)-(69).We simulated two demographic processes: (A) subdivision, where two populations having identical size exchange individuals given a symmetric migration rate M, then individuals are sampled from one population only; (B) expansion, where the population size changes by a factor N 0 /N = 10 at a time T before present (in units of 4N generations).For each value of the parameters M and T, 10 6 simulations were performed with mlcoalsim v1.98b [20] for a region of 1000 bases with variability θ = 0.05 and recombination ρ = ∞ and a sample size of n = 20 (haploid) individuals.Confidence intervals at 95% level were estimated from 10 6 simulations of the standard neutral coalescent with the same parameters.
In Figure 3 we compare the power of the tests in the best possible situation, namely when θ is known with good precision.In this condition all optimal tests should give the best results.In fact, the power of weakly centred tests (T wc O(1) and T wc O( 2) ) is impressive, being around 100% for a large part of the parameter space and decreasing for large migration rates (Figure 3A) and long times (Figure 3B) as every other test, because the frequency spectrum for these cases becomes very similar to the standard spectrum.So, weakly centred tests show a very good theoretical performance, counterbalanced by their lack of robustness.The power of T wc O(1) and T wc O(2) are almost identical, therefore the contribution of the quadratic part to T wc O(2) is probably not relevant.On the other hand, strongly centred optimal tests are more powerful than Tajima's D but less powerful than weakly centred tests, as expected.However, there is a sensible difference in power between T sc O(1) and T sc O(2) : in the range of parameters where the power of weakly centred tests is around 100%, both strongly centred tests show a good performance not so far from the weakly centred ones, while in the less favourable range the quadratic test T sc O(2) , while performing worse than the weakly centred tests, has a power that is 20% higher than the linear test T sc O(1) .Taking into account the robustness of the tests, these simulations show that optimal tests like T sc O(2) could be an interesting alternative to the usual linear tests.

Discussion
In this paper we have presented a systematic analysis of neutrality tests based on the site frequency spectrum.This study is intended to extend and complete the recent works in [8,10] by extending the study of the linear neutrality tests presented by Achaz.The properties of linear neutrality tests and optimal tests have been studied using this framework.A new class of "tunable" optimal tests that include usual optimal tests as a special case have been proposed, using a generalisation of the optimisation approach for linear tests.
The aim of the paper is to give mathematical guidelines to build new and more effective tests to detect deviations from neutrality.The proposed guidelines are the scaling relation (8) and the optimality condition based on the maximisation of (T Ω ).Both these guidelines are thoroughly explained and discussed.
One of the strengths of optimal neutrality tests (initially presented in [10]) is the fact that their weights scale automatically with the aim to avoid a dependence with the sample size, enabling interpretation of the results of the test in a sample-size-independent way.For other tests, different scaling strategies were analysed and evaluated, discussing the suitability of weighting methods and the scaling of existing neutrality tests.
The different neutral tests studied and developed in this paper have different features, which make them suitable for different purposes.For example, a general neutrality test where the weights are scaled to obtain interpretable results may be sufficient, with minimum effort, to reject the null model and to interpret results, but the statistical power when faced with a specific alternative hypothesis can be low.Instead, optimal tests become a better approach if the alternative hypothesis is clearly formulated and the data is not clearly from the null hypothesis.In respect to the differences between linear and nonlinear optimal test, while nonlinear optimal tests have been shown to be more powerful than linear ones (and weakly centred tests more powerful than strongly centred ones), power is not the only important issue: robustness must also be taken into consideration.
In fact, there are three important remarks on the relative robustness of these tests.The first one is that, as already discussed, centredness of weakly centred tests is not robust with respect to a biased estimate of θ, therefore these tests should be preferred to strongly centred tests only in situations where the value of θ is well known or a good estimate is available.
The second remark is that neither the weights nor the results of linear optimal tests do depend on the value of θ and on the number of segregating sites S, while the weights of nonlinear optimal tests have an explicit dependence on θ and their results depend not only on the spectrum but also on S; therefore, the interpretation of the results of these tests is more complicated.However, this is not necessarily true for homogeneous tests of any degree, like the quadratic G ξ test by Fu.An interesting development of this work could be a study of homogeneous tests of a given degree k satisfying the optimality condition, which can be easily obtained from Equation (50) by restricting all ordered sequences of indices Ĩ, J, K, L to contain precisely k indices (along with some "traceless" condition, in case).These homogeneous optimal tests (or at least some subclass of them) should depend weakly on S.
The third remark is that linear optimal tests have two interesting properties that are not shared by other tests: they depend only on the deviations from the null spectrum and they have an easy interpretation in terms of these deviations, that is, they are positive if the observed deviations are similar to the expected ones and negative if the observed deviations are opposite to the expected ones.These features give an important advantage to linear optimal tests.A fourth remark is that "tunable" optimal tests can be built to achieve weights that correspond to maximum power to reject the null hypothesis for a given alternative scenario but uncertain power calculations (e.g., when the variance of the alternative scenario is not known).This new class of neutrality tests is highly flexible and only depends on a single parameter and the mean alternative spectrum.Nevertheless, since the test can be used even when the alternative scenario is not fully defined, there may be situations where optimising its power can result in the interpretability or robustness of the test being compromised.
Tests based on the frequency spectrum of polymorphic sites are fast, being based on simple matrix multiplications, and can therefore be applied to genome-wide data.Moreover, they can be used as summary statistics for Approximate Bayesian Computation or other statistical approaches to the analysis of sequence data.While linear tests are often used in this way, the nonlinear tests presented in this paper contain more information (related to the covariances and higher moments of the frequency spectrum) that could increase the power of these analyses.the sample size, and the number of segregating sites of each studied locus and therefore the values of each locus are not directly comparable.
The contribution of each locus to the heterogeneity is hardly known.Tajima's D is robust to differences in the level of variability (the variance is approximately equal to one) and also quite robust against differences in sample size (as will be shown in the next part), although the quantitative values of Tajima's D for each condition are someway different and therefore the comparison between values is not simple.The proposal of Schaeffer is to use the test: as a (re)normalised version of Tajima's D. S obs is the observed number of segregating sites in the sample.This proposal can be generalised for all the tests of the form (3) as follows: T Ω = T Ω min(T Ω ) S=S obs = ∑ n−1 i=1 iΩ i ξ i min(jΩ j )S obs (A4) This appears to be the natural generalisation of D to general linear tests.For the optimal tests, there are other possibilities.
Appendix A.3.Tests Based on the Folded SFS If an outgroup is not available, then the test should be based on the folded spectrum and has the general form: where the weights Ω * i = ω * i − ω * i satisfy the conditions: and: are unbiased estimators of θ.
Appendix A.4. Scaling of Weights in Tests without an Outgroup The above arguments can be repeated in a straightforward way for the tests T * Ω based on the folded spectrum η i .The only relevant difference is that the frequency f of the minor allele in the population is always less than 50%, that is, f ∈ (0, 1/2].For consistency with the unfolded case, the weight η n/2 is reduced by a factor 2. Moreover, the additional parameters related to the weights of singletons cannot distinguish between ancestral and derived alleles and therefore reduce to Ω * s , ω * s , ω * s .These parameters, together with the functions Ω * ( f ), ω * ( f ), and ω * ( f ), should satisfy the conditions:

Figure 1 .
Figure 1.Illustrative example of the dependence of the weight on sample size: weight Ω as a function of i/n for the test for admixture by Achaz, plotted for different sample size n = 10 (blue), 100 (red), 1000 (green).

Figure 2 .
Figure 2. Geometrical representation of the numerator of the optimal test T O in (34).The length of the red line segment corresponds to the value of the numerator.

Figure 3 .
Figure 3. Statistical power of nonlinear optimal tests from coalescent simulations for the 5% tail, compared with Tajima's D test (for the left and the right tail).The parameters for the simulations are: n = 20, θ = 0.05, L = 1000 bp, ρ = ∞; two populations considered but only one sampled (for panel A); expansion factor N 0 /N = 10 (for panel B).