1. Introduction
Over the last decade, population geneticists have recognized that most traits are highly polygenic, and hence, have moved away from the study of genetic evolution using the single-gene, Mendelian approach, towards models that examine many genes together (i.e., polygenic models). Moreover, research into global variations in complex traits shows a significant amount of differentiation, for example in height [
1], cardiovascular disease [
2] and BMI [
3,
4]. Identifying polygenic adaptation is complicated by several factors. First, the presence of hundreds or thousands of genetic variants, each having a small effect, implies that signals of genetic differentiation are diluted, and that methods that target hard sweeps are not sensitive enough [
5,
6]. Second, identifying a number of SNPs which is sufficient to explain at least 5 or 10% of the total variance in a trait requires very large samples, and these have become available only recently (e.g., [
7]). Third, environmental factors that differ across populations can influence the phenotype, hence masking genetic differentiation.
Signals of polygenic selection can be identified by various methods, such as correlation of allele frequencies [
5,
8,
9,
10] and the regression of population average of trait values on polygenic scores (PGS) [
9]; these have been successfully applied to human stature [
3,
11,
12] and cognitive abilities [
9].
The goal of this paper is to test the predictive power of polygenic scores, independently of spatial autocorrelation and noise due to drift and migrations. The prediction is that the polygenic selection model explains average population IQ better than a null model representing only drift and migrations. This implies that the frequencies of alleles with positive effect in the GWAS have different means across different populations.
A goal of this study is to replicate the effects found by Piffer [
8,
9], using evidence from new intelligence and educational attainment GWAS published to date. [
8,
9] analyzed educational attainment and intelligence GWAS hits and found a factor that was highly predictive of population IQ. The factor analytic method is based on the assumption that polygenic selection acts as a latent variable which accounts for commonalities among several genetic variants scattered across the genome [
8]. The model also includes an error term due to measurement errors in the form of limited sample sizes, imperfect coverage or genetic drift, all of which act to increase noise.
Larger SNP sets become intractable with the factor analytic method; hence, the use of polygenic scores is usually preferred. These scores can be unweighted or weighted. Unweighted PGS are calculated as the sums of the number of effect alleles found in the genome for all trait-associated SNPs. A weighted PGS weights each trait-increasing allele by its odds ratio (for categorical traits) or β-value (for quantitative traits).
The chief limitation of polygenic scores is currently that the GWA studies were carried out on overwhelmingly European descent samples and this is responsible for a few issues, chiefly: 1) The GWAS will fail to capture population specific variants. This does not necessarily bias the PGS in favor of the reference group, as GWAS identify both negative and positive effect variants. For example, a recent GWAS carried out on Peruvians found a population-specific variant that reduces height by about 2.2 cm [
13]. Since this variant is polymorphic only in populations of Native American descent, it would have been missed by a European-based GWAS, potentially leading to an overestimation (relative to Europeans) of the PGS for the Peruvian population. A similar scenario might happen with EA polygenic scores, where population-specific variants with negative or positive effects are missed in other populations, leading respectively to over and under-estimations of the non-European population polygenic score. However, since population specific variants can also have a positive effect, the effects will tend to cancel each other out, thus limiting the potential bias. Evidence suggesting that this is the case can be gathered from the polygenic score on height calculated using an European-based GWAS [
14] which produced very low scores for Peruvians, the second lowest in the 1000 genomes samples (see
Section 3.4). 2) Since most GWAS hits are not causal (so-called “tag SNPs”), but are genetically linked with “true” causal variants, and because patterns of LD vary across populations (for example, Africans have on average much smaller LD blocks), this will reduce predictions for populations that are genetically distant from the GWAS sample.
Piffer (2017) [
15] (also see
supplementary material) identified 9 genomic loci that were replicated across three of the largest GWAS of educational attainment (EA) [
16,
17,
18]. The same 9 SNPs were successfully used to predict genetic differences in cognitive ability between ancient and modern samples [
19].
In addition, the full set of 2411 genome-wide significant SNPs from the latest GWAS of educational attainment (henceforth, “EDU3”) [
7] will be employed.
Lee et al. (2018) [
7] also reported a set of 127 putatively causal SNPs (posterior inclusion probability >0.9). These will be used to provide a polygenic score which is less subject to linkage disequilibrium-decay from a theoretical point of view (see discussion).
Average estimated population IQ will be used as the phenotype of interest and main dependent variable in the analyses. This choice can be justified by its privileged status in psychometric research and its robust genetic correlation with educational performance [
20] and attainment [
17]. Moreover, the GWAS hits identified by the three educational GWAS also predict general cognitive ability in their samples. A re-analysis of the Okbay et al. dataset revealed that the polygenic score also predicts general intelligence (3.6%) compared to 2% for the 2013 polygenic score [
21]. The latest EDU PGS have been estimated to predict between 3.2% to 11–13% of variance in EA, depending on which set is used, and about 10% in general cognitive ability [
7].
Height will be used as a control variable due to their highly polygenic nature which is consistent with cognitive ability, and because it is the anthropometric trait with the largest amount of GWA studies available. Moreover, height is known to differ among human populations for genetic and environmental reasons.
Finally, socioeconomic variables will be added as predictors to model the effect of genetics and environment at the group level.
2. Materials and Methods
Lee et al. (2018) [
7] reported 2415 SNPs reaching GWAS significance (P 5 × 10
−8), and 2411 were found in 1000 Genomes. Weighted and unweighted PGS were computed for this sample (“EDU3”). Since some of these SNPs are in LD, PGS was computed also for a set of LD clumped (“LD-free”) SNPs (N = 1267) (for clumping algorithm, see Lee et al., 2018 [
7]).
The Genome Aggregation Database (gnomAD) is the largest publicly-available genomic database to date [
23], comprising 15,708 genomes from 8 populations. These are divided as follows: African/African American (4359); Latino/Admixed American (424); Ashkenazi Jewish (145); East Asian (780); Finnish (1738); Estonian (2297); North-Western European (4299); Southern European (53).
The SNPs frequencies were downloaded using the gnomAD browser v2.1.
Polygenic scores for height were calculated using GWAS hits from the largest meta-analysis to date [
14]. Average height was obtained from the largest meta-analysis (NCD-RisC, 2016). However, for sub-populations within countries, separate estimates were retrieved from other sources (e.g., US Blacks
vs. US Whites; Toscani in Italy; Chinese North
vs. South).
Variables that were found to correlate with height in a previous study were included [
24]. Since some of these variables can be expected to be correlated with cognitive development (via health, education and nutrition), they were included in this analysis.
Monte Carlo simulations were performed using a null dataset, consisting of a large sample (N = 2,411,000) of matched random unlinked SNPs (downloaded from 1000 Genomes, phase 3). Matching was carried out using SNPSNAP [
25], by feeding the 2411 EA SNPs and setting LD r2 < 0.25 (for EUR), with maximum allowed deviation for MAF = 5%. Unlinked SNPs were used in order to have a sample with independent observations.
The empirical value p = (r+1)/(n+1) was calculated, where r is the number of runs whose Pearson’s correlation coefficient (r x population IQ) was higher than the one found using the actual (GWAS-derived) polygenic score; n = total number of runs. The corrected formula was provided by [
26]. Fst distances were obtained from [
9], calculated using Vcftools with 1000 Genomes, phase 3 data. Average population IQ estimates were obtained from Piffer (2015) [
9]. Previously published scores were used also to guarantee that the values were not created post-hoc. In addition, recently published estimates of learning/education quality were included [
27], based on performance on standardized tests of mathematics, reading, and science by 5-year age groups (from 5 to 19 years) for school-aged children.
Fst distances were used to partial out spatial autocorrelation, following the method outlined in Piffer (2015) [
9] based on the Mantel (1967) [
28] test. The method as it is commonly employed seeks spatial patterns of genetic variation by comparison of genetic distances, estimated by pairwise Fst, with geographic distances. In this study, the PGS distances were employed as the index of genetic distance for the EA-relevant polymorphisms, and the Fst distances were used as an index of “spatial” clustering due to phylogenetic relationships. Thus, the PGS distances were correlated to the Fst distance matrix. Subsequently, pairwise population IQ differences were regressed on Fst and PGS distances to test if the observed relationship appears only because both variables are “spatially” structured by intrinsic effects (random drift or migrations) or if the two matrices are “causally” related, indicating positive directional selection or diversifying selection. Various forms of the partial Mantel test have been widely used in ecological settings using geographic distance and some environmental variable such as temperature [
29] and its polygenic scores form was devised by Piffer (2015) [
9] for polygenic evolution modelling.
HGDP-CEPH data were downloaded from SPSmart [
30] and PGS were calculated on R after matching effect alleles (code can be found in
supplemental files). Fst distances and distance from Addis Abeba for HGDP populations were obtained from (Handley et al., 2007) [
31], and after removal of the non-overlapping samples, 49 matching populations were retained.
Statistical analyses were run using R (R CoreTeam, 2018) [
32].
4. Discussion
The calculation of population-level polygenic scores (average allele frequencies with positive GWAS beta) is a promising and quick approach to test signals of polygenic adaptation. The results clearly showed population differences in PGS (
Figure 3), which correlated with estimates of average population IQ (
Figure 2) and students performance on standardized tests of mathematics, reading and science (r = 0.9 and 0.8, respectively).
The EDU3 polygenic score was the most robust to tests of spatial autocorrelation (
Table 2), that is, it predicted population IQ also after removing SAC by partial Mantel test via Fst distances [
9,
28]. In fact, when IQ was regressed on the PGS and Fst distances, the latter lost all the predictive value (Beta = 0.045, p = 0.539), whereas the former retained high predictive power (Beta = 0.68, p = <2.2e × 10
−16). Additionally, partial and semi-partial correlation were also used as an alternative method to remove SAC, and both yielded significant correlation coefficients (respectively, r= 0.506; p = 7.85 × 10
−18 and r = 0.409; p = 1.304 × 10
−11). Similar results were obtained by using Learning index instead of IQ.
The polygenic score computed for MTAG-derived Cognitive Performance SNPs was highly correlated to EDU3 (r = 0.92) and to population IQ (r = 0.86).
More strength is given to the present findings by the high replicability of the same polygenic model from a previous publication (Piffer, 2015 [
9]). Piffer (2015) [
9] calculated polygenic and factor scores for 1000 Genomes population using the data available at the time from GWAS that relied on much smaller sample size. Since the correlation between the present study’s EDU3 and previous estimates of polygenic scores score are very high (r = 0.95–0.98), and the correlations with the average population IQ estimates from the previous study are similarly high (
Figure 2), the claim that this finding could be post-hoc is ruled out.
A Monte Carlo simulation using 943 PGS computed from sets of 2411 randomly matched SNP showed that none of them reached a correlation with population IQ equal to or higher than EDU’s (r = 0.886), corresponding to p = 0.001.
Height was used as a control variable, due to its similar polygenic nature to cognitive ability. The height PGS was positively correlated with average population height but it had a weak negative association with EDU PGS and with population IQ (
Figure 8). The lack of a relationship between the two PGS and the lack of an association with a phenotype (i.e., IQ) different from the GWAS trait (i.e., height) suggests that potential biases (e.g., favoring the European reference population) in the GWAS procedure do not drive the correlation between EDU PGS and IQ. Remarkably, whilst East Asians scored at the top of the EDU PGS, they also scored at the bottom of the height PGS, in accordance with their relatively smaller physical size.
The results were replicated in a new dataset (gnomAD), comprising a much larger sample of individuals, where the correlation between population IQ and PGS was 0.98 (
Table 5 and
Figure 9). This dataset included a sample of 145 Ashkenazi Jewish individuals. The IQ of Ashkenazi Jews has been estimated to be around 110 [
34]. Remarkably, their EDU polygenic score was the highest in our sample, corresponding to a predicted score of about 108, mirroring preliminary results from a smaller (N = 53) sample (Dunkel et al., 2019) [
34].
The large sample of Finnish individuals present in gnomAD also replicated their polygenic score advantage found in 1000 Genomes, closely mirroring the advantage over other European populations observed in scholastic aptitude and intelligence tests [
33].
One-way ANOVA found differences in mean allele frequencies between populations both in 1000 Genomes and gnomAD dataset.
The overall results were replicated using a larger sample of populations from the HGDP-CEPH panel (N = 52), which showed similar population and continental rankings of polygenic scores (
Figure 11 and
Figure 12). A positive correlation with latitude was found (r = 0.57), but a small negative one with distance from Eastern Africa (r = −0.29). The latter finding casts doubt on the hypothesis that migratory patterns account for population differences in polygenic scores. The positive correlation with latitude could reflect selection pressures due to climate, but further evaluating this hypothesis would require exhaustive simulations with null SNPs, using an approach similar to Berg and Coop (2014) [
10], which are beyond the scope of this paper.
The much higher East Asian scores (
Figure 3 and
Table 5) suggests that strong selection pressure on East Asians continued after the East Asian-Native American split, about 15 kya at the earliest (the earliest estimate of a migration across the Bering strait into the Americas) [
8]). This date has been disputed. There is evidence from ancient genomes that the split between northeast Asians and Native Americans happened 20kya when both were still residing in NE Asia, and there was continued gene flow until the Native American group crossed the Bering strait into North America around 15-14 kya [
39]. It is possible that the extremely low population density in the Americas reduced intraspecific competition (hence, selection pressure on cognitive ability was lower), but this topic is widely open to debate and speculation.
A limitation of the present study is its reliance on estimates of population IQ as the phenotypic variable, which are not perfectly accurate, besides potentially reflecting environmental and economic differences between populations. Moreover, the EA GWAS can capture genetic variation that contributes to educational attainment via mechanisms other than IQ (i.e., conscientiousness).
The moderate (11–13%) amount of variance explained at the individual level by the full set and the subset of significant hits (3.2%) (Lee et al., 2018) [
7] is not an issue of primary importance. Indeed, predicting group-level variance is different from predicting variance within a group. The important difference between trans-ethnic and within population phenotypic predictions—as measured by the amount of variance explained—is that while the latter is maximized by using the full set of GWAS SNPs, in the former, using low significance SNPs risks introducing too much noise, such as elements of drift or migrations, that would dilute the signal derived from selection pressures on a specific trait. Hence, the two problems require different approaches to SNP selection and optimal PGS construction.
Deciding the optimum number of SNPs can be done empirically (by picking the significance threshold that results in the highest correlation with the average phenotypic population value). In our data, there was degradation of signal across significance quantiles, as shown by a weak trend for lower significance SNPs to have lower correlation with population IQ (
Figure 7 and
Figure 10). More remarkably, the PGS generated from most SNP subsets had lower predictive power than that of the full set. For example, using the full set of 10k lead SNPs in 1000 Genomes, the average correlation coefficient (between each set’s PGS and IQ) was r = 0.470, whereas the correlation coefficient of the full set with IQ was r = 0.868. In the gnomAD dataset, 0.61 and the full (clumped) PGS’s correlation was 0.93.
This suggests the presence of a trade-off between quantity and quality of SNPs relative to gains in predictive power. That is, a larger number of SNPs increases the signal, yet it introduces more noise due to inclusion of lower significance SNPs. The maximal predictive power was reached by selecting SNPs which met the conventional GWAS significance threshold (P < 5 ×10−8), whilst picking higher significance SNPs reduced the predictive power (due to reduced number of SNPs). Although traditionally weighing by effect size is the most commonly employed weighting method, and since no increase in predictive accuracy was observed in this study for the effect size weighted PGS vs. the unweighted (raw frequency) PGS, p value weighting should be used as a valid alternative for PGS computation, particularly when including SNPs that are below the conventional GWAS significance threshold. The predictive accuracy of the PGS in this study is saturated by the high correlation with population IQ, but methods could be used in other studies to improve PGS construction. Reviewer 2 suggested the following procedure for optimal PGS construction that could be used in future studies: “Start with the quantile that has the most significant SNPs, and then add quantiles in declining order of genome-wide significance. Initially, adding quantiles will improve prediction, but after a certain point, adding more quantiles will make prediction worse. At that inflection point you have the optimal PGS”.
A persistent issue is that the trans-ethnic validity of PGS is compromised by LD decay. This is the decay in linkage disequilibrium with time, meaning that the older the causal polymorphism, the lower the level of linkage disequilibrium due to recombination events that occur with constant probability in every generation. As a consequence, linkage patterns can be different in different populations, especially those that separated a long time ago and that underwent population bottlenecks after separation.
The effect of LD decay on comparison of risk alleles between populations is still unclear. Since most GWAS hits are actually tag SNPs, decay in LD implies that the causal SNPs will be less efficiently flagged by the tag SNPs until the tag SNPs will resemble a sample of random SNPs. With less significant associations, it is not only more likely that the distance between the GWAS hit and the causal polymorphism is larger and linkage is weaker in the European-origin populations that are represented in the GWASs, but it is also more likely that the linkage phase is different in different races.
LD is sensitive to coverage and in older studies using low coverage genomic data (e.g., 1000 Genomes phase 1), it was found to reduce the reproducibility of findings [
40]. However, contemporary GWAS use higher coverage data (e.g., 1000 Genomes phase 3); hence, this issue is less important.
Moreover, simulations found that the effect of LD decay on true causal variants was null to negligible [
41]. The present study, by focusing only on the most significant hits (N = 2411), increased the likelihood of hitting on or very close to causal variants, hence reducing the artifact of LD decay. Moreover, the analysis was replicated using a set of putatively causal SNPs (N = 125) from the Lee et al. (2018) [
7] paper. The correlation with population IQ was still high (r = 0.82 and 0.85 with the weighted and unweighted PGS, respectively) although not as high as that achieved by the larger set: this could be caused by the loss of signal due to the much smaller number of SNPs.
In contrast, Lee et al. (2018) [
7] analyzed the association between EDU PGS and years of education in an older African American sample. Given their use of all SNPs regardless of significance, it is not surprising that the cross-ethnic validity of their scores was drastically reduced [
7]. Moreover, since the heritability of education among older African Americans is unknown and the predictive validity of PGS scores depend on the population specific trait heritability, we do not know if the reduction in trans-ethnic validity was due to a reduced heritability.
In fact, it is well known that polygenic scores perform better in European populations, and prediction accuracy is reduced by approximately 2 to 5 fold in East Asian and African American populations, respectively [
42].
A recent study has replicated the validity of Lee et al. (2018) [
7]’s PGS on a sample of African Americans; the authors found that a higher EA PGS was associated with higher probability of college completion and math performance, although not with reading achievement. There was also a negative association with criminal record status. However, there was an attenuation compared to the PGS effect among White subjects. The authors attribute this to various potential factors besides LD decay: 1) The study used a sample low in socioeconomic status (SES), where shared environment plays a bigger role than in high SES environments [
43]; and 2) Measures of math and reading performance were obtained in early childhood, a developmental period during which the importance of genetic influences on intelligence is lower (and that of shared environment is higher) compared to young adulthood [
43]. Nonetheless, this study provides evidence for the (partial) transferability of EA polygenic scores to African Americans.
Additionally, trans-ethnic GWAS meta-analyses on other traits have also found genetic variants with little heterogeneity between ancestry groups [
44,
45]. A recent GWAS of schizophrenia found that approximately 95% of SNPs from a Western GWAS had consistent direction of effect in the Chinese sample, and this was significant for about half of those (Li et al., 2017) [
46].
We postulate that this common core of causal genetic variants with trans-ethnically homogeneous effects is what drives the association between average trait values and population IQ or height, and the group differences in mean PGS. This would be superimposed on a background of heterogeneity of allelic effects, thus adding noise to the data.
In fact, LD decay is expected (from a theoretical perspective) to create noise and follow drift and not to produce a bias necessarily in a direction that favours the hypothesis of this study [
40]. Since the frequency of the average SNP allele is 50%, the tag SNPs will tend to converge towards an average frequency of 50%, with increasing LD decay. The implication of this for our analysis is that when the polygenic scores are below 50%, our estimates will be inflated, and vice-versa for the polygenic scores which are above 50%, because LD decay pushes the polygenic scores up (or down) towards the background frequency of 0.5. In the present case, the average frequency of alleles with positive effect is around 50% (49.7%) for CEU and CHS (49.7% and 50.1%); hence, LD decay should produce only a tiny bias in the estimate (upward and downward bias, respectively). However, the allele frequencies of causal SNPs in other (non-European) populations that are farther away from 50% will produce a stronger bias. For example, the average YRI frequency is 47.5%, so LD decay produces an overestimate of the PGS. In other words, the frequency of causal SNPs is expected to be lower than 47.5%.
A preliminary test of this hypothesis was carried out by employing 125 candidate causal SNPs from the Lee et al. (2018) [
7] GWAS. The unweighted EDU3-Causal PGS difference was negatively correlated (r = −0.62) to the unweighted EDU3 PGS. In other words, the lower the frequency of the EDU3 PGS, the higher the estimate in the larger set than in the set of putatively causal SNPs, suggesting that LD decay leads to an overestimation of the PGS in the populations with lower PGS, as predicted by theory.
Adding socioeconomic variables to the model slightly (but significantly) increased the predictive power (from 78–80% to 85–89%), although the PGS explained twice as much of the variance (70% vs. 35%) as those of HDI, average protein consumption or child mortality (
Table 3). The reverse was the case for height, where socioeconomic factors explained much more of the variance in average height than the PGS (
Table 4). This is in line with heritability studies which show the importance of the shared environment for adult height even in rich Western countries (about 10%) and more so in non-Western countries [
47], and the dramatic secular trend in height. On the other hand, the shared environmental impact on IQ or g in adulthood is typically found to be near zero [
48], although this might not be the case in developing countries or deprived rearing environments. Indeed, the IQ of African Americans appears to be higher than what is predicted by the PGS (
Figure 2), which suggests this cannot be explained by European admixture alone, but it could be the result of enjoying better nutrition or education infrastructure compared to native Africans. Another explanation is heterosis ("hybrid vigor"), that is the increase in fitness observed in hybrid offspring thanks to the reduced expression of homozygous deleterious recessive alleles [
49]. The Sri Lankan UK population also constitute an outlier, because their IQ is lower than that predicted by the PGS. This does not contradict the previous statement, because the IQ estimate obtained from Piffer (2015) [
9] was based on native Sri Lankans, since estimates for Sri Lankans living in the UK were not available. Given the moderate impact of environment, it is likely that the IQ of Sri Lankans living in the UK is actually higher than that of native Sri Lankans.
Testing more sophisticated models with larger sets of socioeconomic variables would go beyond the scope of this paper but it is an interesting direction for future research.
Future GWAS studies should be carried out on non-European populations. Indeed, trans-ethnic GWASs are a promising resource for the identification of alleles with homogeneous and heterogeneous effects and the computation of population-specific polygenic scores. Specifically, they would enable us to include SNPs that are polymorphic only in some populations, and to find the causal SNPs that have the same causal effect in all populations.