Difference in the Proportions of Deleterious Variations Within and Between Populations In�uences the Estimation of FST

Estimating the extent of genetic differentiation between populations is an important measure in population genetics, ecology and evolutionary biology. Fixation index or F ST is an important measure, which is routinely used to quantify this. Previous studies have shown that F ST estimated for selectively constrained regions was significantly lower than that estimated for neutral regions. By deriving the theoretical relationship between F ST at neutral and constrained sites we show that an excess in the fraction of deleterious variations segregating within populations compared to that segregates between populations is the cause for the reduction in F ST estimated at constrained sites. Using whole genome data, our results revealed that the magnitude of reduction in F ST estimates obtained for selectively constrained regions was much higher for distantly related populations compared to those estimated for closely related pairs. For example, the reduction was 47% for comparison between Europeans-Africans, 30% for European-Asian comparison, 16% for the Northern-Southern European pair and only 4% for the comparison involving two Southern European (Italian and Spanish) populations. Since deleterious variants are purged over time due to purifying selection, their contribution to the among-population diversity at constrained sites decreases with the increase in the divergence between populations. However, within-population diversity remains the same for all pairs compared and therefore F ST estimated at constrained sites for distantly related populations are much smaller than those estimated for closely related populations. Our results suggest that the level of population divergence should be considered when comparing constrained site F ST estimates from different pairs of populations.


Introduction
Since the introduction of F-statistics by Sewall Wright 1 , fixation index or FST has been routinely used to measure the extent of differentiation between populations [2][3][4][5][6][7][8][9][10][11][12] .FST compares the heterozygosities within and between (or total) populations to measure the level of genetic structure among populations.Apart from being an integral part of the descriptive statistics to describe a population, FST has direct applications in conservation biology, ecology, evolutionary biology and clinical genetics.FST reveals the extent of genetic drift and the level of migrations between populations, which is useful to understand the population dynamics of an ecosystem 13 .The level of differentiation in populations helps conservation biologists to measure risk of extinction of a population or species 14 .Furthermore, FST is used to identify candidate genetic variants and genes associated with Mendelian and complex genetic diseases 2,3,9 .
Since adaptive mutations quickly spread through a population, their relative high frequency in one population (compared to the other) elevates the FST estimates.In whole genome analyses, FST is estimated using a sliding window to detect genomic regions showing high differentiation between populations 10 .Although a number of previous studies have investigated the relationship between FST and positive selection only a handful of studies examined the influence of negative selection on FST.A previous study reported lower FST for genic compared to nongenic SNPs 3 .The reduction in FST was more pronounced when only the amino acid changing nonsynonymous SNPs (nSNPs) were considered and a similar reduction was observed for mutations in disease-related genes.This suggests that purifying selection does not allow an increase in the frequency of SNPs, which could have led to the observed low FST 16 .Later a more systematic investigation was conducted to examine this issue using human genome data 17 .This study grouped nSNPs based on the evolutionary rates of sites in which they were present and showed a positive correlation between the rates and FST.Hence FST estimated for the nSNPs present in selectively constrained sites (with low rate of evolution) was much smaller than that estimated for those present in neutral sites with high evolutionary rates.A similar observation was made by another study on the populations of fruit flies from France and Rwanda 18 .FST estimates obtained for long introns (known to be under high purifying selection) and conserved genes were typically lower than those estimated for short introns (under relaxed selective constraints) and less conserved genes.
Although the influence of purifying selection on FST estimates has been well documented, how exactly selective constraint affects FST estimations or the mechanism by which purifying selection influences these estimates is unclear.Furthermore, whether the magnitude of reduction in FST is depended on the divergence between populations or whether the magnitude of reduction is similar between closely related and distantly related populations is also unknown.To examine these, we first investigated the theoretical relationship between FST at neutral and constrained sites.Using the data from the 1000 genome project -Phase 3 19 we then estimated FST for pairs of populations with different levels of divergence.

Theoretical relationship between F ST and Neutrality Index (N I )
FST at synonymous sites (FST(S)) can be expressed using Hudson et.al 20 as: where Hb and Hw are mean number of synonymous nucleotide differences between and within populations respectively.Similarly, FST at nonsynonymous sites (FST(N)) is given as: , which is qualitatively similar to NI. Therefore If w b and w w fractions of nonsynonymous mutations are equal (that result in NI = 1), then equation 6 reduces to This suggests that FST at synonymous sites is equal to that at nonsynonymous sites.However, it is well known that the fraction of slightly deleterious mutations present within population is higher than that observed between populations.This is because a much higher fraction of those present within population are young and yet to be purged from the population by natural selection.Therefore, w w is expected to be higher than w b.Therefore, The above theoretical relationships demonstrate that FST at nonsynonymous sites is expected to be smaller than that observed for synonymous sites if there is an excess in the proportion of nonsynonymous deleterious mutations present within populations compared to that of between populations.

Population genome data
Whole genome data for 26 world-wide populations was downloaded from the 1000 genome project -Phase 3 (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) 19 .Only biallelic single nucleotide polymorphisms (SNPs) from the autosomes were included for the analyses and the allele frequencies of each SNP in 26 populations were computed, which were used for estimating FST using the estimators described below.Pairwise FSTs were computed for the exomes of human populations including CHB (Northern Chinese), CHS (Southern Chinese), GBR (British), IBS (Spanish), JPT (Japanese), TSI (Italian) and YRI (Nigerian).To determine the magnitude of selective constraints at each site of the human genome we used a robust method, Combined Annotation-Dependent Depletion (CADD) that integrates many diverse annotations into a single measure (C score). 22.The precomputed C scores for each genome position are available at: http://cadd.gs.washington.edu/download/1000G_phase3_inclAnno.tsv.gz and these scores were mapped to the genotype data from the 1000 genome project.To identify derived alleles, orientations of SNVs were determined using the ancestral state of the nucleotides, which was inferred from six primate EPO alignments 19 .

F ST estimation
For estimating FST from human exome data we used the method developed by Hudson et.al 20 and used the following estimator 23 : and where p1, p2 are frequencies of the bialleles.To combine FST estimated for different SNPs of the genome we used the ratio of averages approach suggested by Bhatia et.al. 23.To estimate the variance, we used a bootstrap resampling procedure with 1000 replicates.
The magnitude of reduction in the FST of nonsynonymous sites was quantified as:

Results
The effect of purifying selection on F ST To examine the influence of purifying selection on FST we used European and African exome data from the 1000 genome project -Phase 3 (see methods).In order to examine the magnitude of selection pressure the Combined Annotation Dependent Depletion (CADD) score or C-score was used 22 .Nonsynonymous SNPs were grouped into seven categories based on their Cscores.Figure 1A shows the relationship between selection pressure and FST estimated for synonymous (sSNPs) and nonsynonymous SNPs (nSNPs) using the exome data for the Italian (TSI) -Nigerian (YRI) pair.Clearly FST is the highest for the neutral sSNPs, which declines with increase in selection magnitude.FST estimate for highly constrained nSNPs with a Cscore >30 was only 0.082, which is much smaller than that estimated for sSNPs (0.154).We introduced a measure, to capture the magnitude of reduction in FST estimates of nSNPs compared to that of sSNPs (Equation 10 -see methods).Figure 1B that shows the positive relationship between the extent of selection constraint and magnitude of reduction of FST ( ).
The reduction in the FST estimate was only 2% for nSNPs under relaxed selection pressure (C-Score <=5), which increases with the magnitude of selection pressure.For highly constrained nSNPs (C-score >30), the reduction in FST was 47%, which is 24 times higher than that observed for nSNPs under relaxed constraint.

Relationship between F ST at neutral and constrained genomic regions
To understand the actual cause of the reduction in FST for constrained SNPs we derived the theoretical relationship between FST at neutral (FST(S)) and constrained (FST(N)) regions, which is shown in equation 6.From this it is clear that this relationship in modulated by Neutrality Index (NI), which is the ratio of ratio of nonsynonymous-to-synonymous polymorphisms observed within populations (w w) to that segregates between (w b) populations (NI = w w/w b).
As shown in equation 7, FST(S) and FST(N) are equal when the proportions of nonsynonymous polymorphisms observed within and between populations are the same (w w = w b).However, it is well known that a higher proportion of slightly deleterious nSNPs is expected to segregate within populations than that of between populations.This is because a significant fraction of them are purged by purifying selection over time and hence their fraction gets diminished for between population comparisons.Therefore, FST estimated for nSNPs is expected to be smaller than that of sSNPs as the fraction of nSNPs segregating within populations is higher than that segregates between populations (w w > w b) (Equation 8).
Plotting the relationship of equation 6 shows that the difference between FST(S) and FST(N) is much higher when FST(S) is small (Figure 2A).For instance, when FST(S) was 0.02 the corresponding FST(N) was 0.0004 (for NI=1.02).In contrast, when FST(S) when was 0.1 the corresponding FST(N) was 0.082, which is only slight smaller than the former.The magnitude of reduction of FST(N) compared to FST(S) is very clear in figure 2B.For NI=1.02, the magnitude of reduction of FST(N) was 98% when FST(S) = 0.02.Whereas this reduction was only 18% for FST(S) = 0.1.Figure 2B also shows that the magnitude of reduction is also increases with increasing NI.For example, when FST(S) = 0.2 and NI=1.02 the magnitude of reduction was only 8% but it was 48% for NI=1.12.

F ST estimates and population divergence
Next, we investigated the effects of purifying selection on FST with respect to population divergence.This is to compare the magnitude of reduction of FST estimated for closely and distantly related populations.For this purpose, we used four pairs of comparisons with different levels of divergence, European (Italian/TSI) -African (Nigerian/YRI), European (Italian/TSI) -Asian (Chinese/CHB), Southern European (Italian/TSI) -Northern European (British/GBR) and two Southern Europeans (Italian/TSI -Spanish/IBS).Figure 3 shows FST estimates obtained for sSNPs and highly constrained nSNPs (C score > 30) for the four pairs of populations.By comparing of the pairs of columns from Figure 3A to 3D reveals a reduction in the difference between the FST estimated for synonymous (FST(S)) and highly constrained (FST(N)) nonsynonymous sites.The positive correlation between the population divergence and the extent of reduction of FST(N) is clear in Figure 4.The FST(N) observed for constrained nSNPs of the distantly related Italian-Nigerian pair was 47% smaller than that of sSNPs (Figure 4A).
While this reduction was 30% for the Italian-Chinese pair and 16% for Italian-British comparison, it was only 4% for the closely related Italian-Spanish pair (Figure 4).We then estimated NI for the four pairs, which were found to be 1.0861, 1.0400, 1.0015 and 1.0003 for the Italian-Nigerian, Italian-Chinese, Italian-British and Italian-Spanish pairs respectively.
This translates to an excess 8.6%, 4.0%, 0.15% and 0.03% of nSNPs were present in within populations compared that segregating between the populations of the Italian-Nigerian, Italian-Chinese, Italian-British and Italian-Spanish pairs respectively.The presence of these excess fractions resulted in 47.1%, 30.0%, 15.9% and 4.2% reduction in the FST estimated for the highly constrained nSNPs (C-score > 30) of the corresponding pairs of populations respectively.A similar analysis was performed for the Chinese lineage using Chinese-Nigerian, Chinese-British, Chinese-Japanese and Northern-Southern Chinese population pairs.This analysis showed an excess 10.3%, 4.0%, 0.1% and 0,03% of nSNPs were present in within population comparisons compared to those in between these pairs of populations.These excess proportions resulted in 46.6%, 29.9%, 8.6% and 4.1% reduction in FST(N) compared to FST(S).

Discussion
compared to that of sSNPs (FST(N) << FST(S)).In contrast, there are fewer deleterious nSNPs in less constrained regions and hence the fraction of harmful polymorphisms segregating within populations is expected to be only modestly higher than that segregates between populations (w w > w b or NI > 1).This results in a much smaller reduction in the FST estimated for nSNPs present in regions under relaxed selective constraints (FST(N) < FST(S)).
Second, we have shown that the magnitude of reduction of FST at constrained sites for comparisons involving distantly related populations was much higher than that of those involving closely related pairs.For instance, this reduction for the Nigerian-Italian comparison (47%) is more than ten-fold higher than that of the Southern European (Spanish-Italian) pair (4.2%).It is well known that deleterious variants are removed over time and hence the only a small fraction (w b << 1) of them segregate and contribute to constrained site inter-population diversity for distantly related populations.However, a relatively modest fraction (w b < 1) of harmful nSNPs contribute to the inter-population diversity for closely related population as the elapsed time not enough to purge most of them.On the other hand, the fraction of deleterious nSNPs within population (w w) remain the same (eg.within Italians) for comparisons involving both distantly (eg.Nigerian-Italian) as well as closely (eg.Spanish-Italian) related populations.
Hence the magnitude of reduction in constrained site FST (with respect to neutral site FST) for distantly related populations is much higher (FST(N) << FST(S)) than that observed for closely related populations (FST(N) < FST(S)).
The findings of this study suggest that the FST estimated for different genes or genomic regions of a genome are not comparable if the level of selective constrains are different between them.This is particularly important while using FST estimates to detect positive selection because such methods assume neutral evolution in genes and genomic regions and hence do not account for purifying selection in the estimations 4,6,[10][11][12]15 . Ourresults also strongly indicate that FST obtained from the constrained regions of different pairs of populations are not comparable if the population divergence times between the pairs are not the same.In such cases FST estimations should include only neutral sites to obtain unbiased estimates.However, this is only possible for large genomes such as vertebrate in which constrained regions constitute only a small fraction (~10%) of the genome 24,25 .This is an important issue for small genomes such as those of fruit flies with >50% of the genome is under selection 26 .Therefore, population divergence time need to be considered when comparing the genome-wide FST estimates from different populations.
The magnitude of reduction in FST estimates of nSNPs obtained for four population pairs.The population tree on top is drawn to highlight the correlation between the population divergence and the magnitude of reduction in FST.(A) Italian-Nigerian, Italian-Chinese, Italian-British and Italian-Spanish population pairs.(B) Chinese-Nigerian, Chinese-British, Chinese-Japanese and Northern Chinese (Beijing)-Southern Chinese (Shanghai) population pairs.Error bars are the standard error of the mean and a bootstrap resampling procedure (1000 replicates) was used to estimate the variance.

Figure 1 .
Figure 1.(A) Relationship between selection intensity and FST.Whole exome data

Figure 3 .
Figure 3. FST estimates for synonymous and highly constrained nonsynonymous SNPs of the

Figure 4 .FiguresFigure 1 (
Figure 4.The magnitude of reduction in FST estimates of nSNPs obtained for four population