GWAS have had a history of success in the study of complex traits, enabling the identification of the genomic loci involved in these phenotypes for the first time. Indeed, GWAS have so far discovered more than 276 thousand genomic associations for more than 4 thousand traits and diseases [
49,
50,
51]. However, almost 20 years of analyses have also highlighted their limitations, which preclude more genomic associations from being identified [
21,
22]. Here, we discuss the main critical points of GWAS in detail, and we explain how the methodology can be extended to mitigate some of these. Next, we describe the most common complementary approaches and the existing alternatives that are attempting to solve these limitations.
4.1. Power and Sample Size
One of the main concerns in a GWAS is whether the study is powered enough to detect any association with a trait. The statistical power of association for a given variant strongly depends on the magnitude of its effect size and on its frequency in the population. Strong effect sizes are easier to capture, and common variants generally provide higher power. However, due to evolutionary selective pressures, effect sizes and frequencies are generally inversely correlated, with rarer alleles showing stronger
. In practical terms, current GWAS have mostly revealed associations for common variants with
of around 1.05–1.3 [
52].
A natural way to increase power in GWAS is to increase the size of the sample under study (
N). Increasing sample size would allow the identification of smaller effects for common variants as well as open the possibility to study rare variants. Motivated by this need, large-scale initiatives have been established in the form of international consortia to pool multiple resources and thus generate larger cohorts for subsequent analyses. These efforts have pushed the discovery of new loci and our understanding of complex disease genetics [
53,
54,
55,
56]. Further, biobanks have been established to make these large collections of genotypic and phenotypic data available for future studies [
57,
58,
59]. However, given the sensible nature of these genomic and medical data, accessibility restrictions have been put in place, which often hinder or discourage their reutilisation by further scientific efforts.
Another commonly used strategy to increase sample size in GWAS is meta-analysis based on the statistical combination of previous GWAS results from different studies on the same phenotype. Requiring only GWAS summary statistics (e.g., sample size, effect sizes and
p-values), meta-analyses are far more cost-effective than the generation of new genotype–phenotype datasets and thus have been used extensively [
13,
60,
61].
Meta-analysis approaches are based on a weighted sum of the effects obtained in each of the studies, thus providing an estimate of the association of each genetic marker over all of them. For example, in a meta-analysis for
M studies where each variant
has been assigned an effect
for the
j-th study, a Stouffer’s Z-score can be calculated by assigning a weight for the estimated allelic effect on each study
so that the allelic effect across all the studies will be
which estimates the association to disease over all tests.
In addition, the genetic heterogeneity between the different studies is measured, which is based on Cochran’s
Q-test, by the statistic
for each SNV
i. This measure helps to detect associations that are not consistent across the studies, which might then be filtered out if necessary.
Despite the proven value in increasing power, large sample sizes in GWAS present many challenges, nonetheless. The recruitment and genotyping of individuals might be extremely expensive in terms of time and resources. Despite having received more attention in recent years, data sharing is still limited and difficult, even in the form of summary statistics. Further, recent studies have estimated that unprecedented sample sizes, in the order of millions, might be needed to capture the entire spectrum of the variants associated with a trait [
62]. Different strategies other than simply increasing the number of analysed samples might be thus more feasible to increase discovery power and will be briefly discussed in the following sections.
4.2. Increasing the Number of Genomic Variants
Another important factor in determining the discovery power is the correlation (LD) existing between the interrogated variants and the real, underlying causal variant [
47]. Higher discovery power can be achieved by increasing the number of tested variants, thus obtaining a higher density coverage of the genome and increasing the probability of directly testing variants that are strongly correlated with the causal ones. However, as described in
Section 2, GWAS typically use DNA microarray technologies, which only provide the genotypes for a limited subset (0.5 to 2 M) of all of the SNVs in a genome [
63].
A technique that is commonly used to increase the number of variants that can be tested in a GWAS is genomic imputation. Starting from genotyping array data, genotypes of over 10 million variants can be inferred for an entire group of individuals (also named cohort) [
64], with a reduced number of missing values [
65,
66].
Imputation is usually preceded by a phasing step, in which haplotypes for each individual are inferred starting from genotypes, typically from array data. Then, the studied haplotypes are statistically compared with those in reference panels, which are panels of thousands of individuals with a deeply characterised haplotype [
15,
67,
68,
69,
70,
71]. Through this comparison, the genotype probabilities for variants in the reference panels are imputed into the cohort haplotypes [
72]. Several methods and tools have been developed to phase and impute [
65,
73,
74,
75,
76]. Most of them are essentially based on Markov Chains (MC), Hidden Markov Models (HMM), Markov Chain Monte Carlo (MCMC), and the expectation-maximisation algorithm [
28,
77]. Other tools have also been developed to combine the imputation results from different panels [
14].
As a result, given a population with
individuals, where,
variants
, are inspected for each individual
, each variant genotype can take a value from the space of genotypes
. Based on the space defined by the genotype, each genomic variant
can be considered as a simple random variable
so that
for which
, with
as the space of events. Under this scenario, the imputation model can be formalized by first stating that each variant genotype
for the individual
has a corresponding haplotype
, which is defined by a function
, where
. Thus, the haplotype space
is a partition of the genotype space
. For simplicity, each haplotype
can be written as a pair set
. The aim of imputation is to infer the missing genotypes based on the posterior probability
for each individual in a LD region by comparing the individual haplotypes in that region with the
N haplotypes
present in a reference panel (
Figure 4).
For example, in Hidden Markov Model (HMM) approaches, the posterior probability of each genotype, given the haplotype, can be calculated as
where the term
is the prior probability for each hidden state change along the sequence, and
models the probability that the genotype will be similar to the haplotypes that are copied from the reference. By estimating the genomic recombination rate across the region
based on the effective population size and the mutation rate
, Equation (6) can be simplified to
Given that both and can be estimated from the population of study and that the haplotypes can be inferred from the HMM, this model can be used to infer missing genotypes in the study population.
The accuracy of the different imputation methods can be assessed by masking known genotypes and imputing them using surrounding variants. The correlation between the estimations and the true values can be used to measure the imputation accuracy. Based on this method, current error rates range between 5.10 to 6.33% [
28].
Genotype imputation offered the possibility of comprehensively investigating variants throughout the genome, including rare variants, at a large scale for the first time. However, the imputation of rare variants still presents difficulties. Although rare variants are present in reference panels, those are usually in low LD with the common variants from the genotyping array; therefore, they are imputed with less accuracy. Further, rare variants tend to be more private, and only a fraction of these can be possibly present in reference panels; thus, only a few can be imputed. In the future, when whole genome sequencing is affordable for large studies, the imputation process will cease to be necessary since all of the genomic variants will be obtained from the DNA of the participants. However, until then, genotype imputation provides the most valid alternative for comprehensive GWAS.
4.3. Genetic and Population Heterogeneity
Genetic heterogeneity between individuals of shared ancestry or between those of different ancestries is a factor that further complicates the study of polygenic traits. The same apparent phenotype (especially diseases) might be the result of different combinations of genomic variants in different individuals. Genetic heterogeneity is typically overlooked in GWAS, as individuals with the same broad disease are considered as a homogeneous group of cases. In this scenario, GWAS can only capture the most shared signals, and less prevalent genomic associations might be masked.
An attempt to reduce this issue has been made by classifying cases into sub-groups by using multiple clinical variables or by defining sub- or endo-phenotypes. For example, a disease such as Type 2 diabetes is broadly defined by a high content of glucose in the blood, but different clinical sub-types have recently been identified using measures such as age of disease onset or body-mass index [
78]. The rationale is that these phenotypic sub-groups might reflect more genetically homogenous groups and may thus help us to identify the underlying genomic loci that differentiate them. Even though this strategy entails a decrease in the dimensional reduction of the sample size due to fragmentation, the power to discover the underlying genomic factors could be increased due to a reduction in the dilution of the relevant signals as a consequence of the homogeneity and less variability in the data [
79].
Genetic heterogeneity is also significant between individuals of different ancestral backgrounds due to differences in variant frequencies (e.g., a rare variant in one ancestry might be common in another) and LD patterns. Early GWAS were performed with individuals of predominantly European or Caucasian ancestry, which raised the question of their relevance for individuals of other ancestries. Moreover, the possibility remained that common variants were only associated with complex diseases because they were in LD with rare, high-impact variants that were specific to the studied ancestry and thus that these associations would not replicate in other ancestries.
Since then, trans-ancestry (also named trans-ethnic) studies, which analyse samples of multiple ancestries together, have shown that the variants that were associated with the complex traits and diseases that were identified in these studies were predominantly consistent with those identified in ancestry-specific studies [
80,
81,
82]. These findings suggest that these phenotypes are indeed driven by common variants and that their genetic architecture is mostly shared across different ancestries.
Albeit burdened with further increased sample collection and analytical complexities, these large studies have succeeded in the development of population genomics and have increased the genetic understanding of complex traits [
82,
83].
4.5. Biological Interpretation and Clinical Implications
GWAS have been successful in identifying multiple loci that are associated with complex traits. However, the biological interpretation and clinical application of these findings has proven to be very challenging.
First, because of linkage disequilibrium, GWAS can only provide associated genomic loci, encompassing multiple correlated variants. In addition, GWAS identify statistical associations, but it is well established that association does not imply causation. To attempt to overcome these limitations, further computational and experimental studies need to be pursued. Computational approaches include gene expression studies and enrichment analyses of gene, pathway, epigenomic, and regulatory elements or Mendelian randomisation analyses, which are used to gain further biological insights [
108,
109]. Simultaneously, wet-lab experiments with cell lines, model organisms, or further human studies also need to be used to answer the biological hypotheses that are inferred from these analyses.
As an attempt to produce some clinical insight directly from GWAS results, Polygenic Risk Scores (PRS) have recently been developed. PRS are based on the premise of evaluating the total risk of disease of a genome by considering all of its genomic variants with known disease associations [
110].
Particularly, PRS compute the relative risk of an individual from the population of study to develop a disease. Therefore, in a study of a population with
individuals, for each individual
in the population of study, given
genomic variants
, where the variants genotype can take a value from the genotypes space
, GWAS models can be applied to estimate the effects
for each genotype (
Section 3.2). Then, a PRS can be calculated based on the sum of the individual genotypes
weighted by the estimated effects for that genotype
, resulting from the GWAS analysis [
111]. Thus, each individual score
is calculated using the equation
. As each individual
will have an associated score
, the score can be observed as an independent variable explaining the phenotype
of the individual. Consequently, under a similar scenario to the one explained in
Section 3.2.2 for binary traits,
, with
and
being the probability of an individual being diseased. For example, the probability of an individual being diseased can be explained by the score as
. Therefore, the logit can be applied to the ratio between the probability of the individual having the disease or not, given a particular score, to fit the logistic regression model
. For quantitative traits, where the individual phenotype takes values
, with
the Borel set, a linear regression model could then be fitted to explain the phenotype based on the individuals score as
.
The distribution of the scores across the population of study follows a normal distribution, in which the left tail contains the individuals with the lowest risk of developing the disease, and the right those with the highest risk (
Figure 5). However, although the use of PRS has shown potential, statistically significant differences in disease risk are typically only found when comparing the individuals at the tails of the distributions (e.g., the individuals with the highest 5% of scores have a 3x higher risk of disease than those with the lowest 5% scores), thus only providing limited insights for the majority of the population.
Overall, the combination of cell biology studies [
112,
113] with GWAS results have produced a greater understanding of the biology behind complex diseases [
56]. However, the study of the specific biological mechanisms that mediate the association between genotype and disease remains one of the main open fields of study in biomedicine, and the advancement of personalised medicine depends on its success.
4.6. Comprehensive GWAS Strategies for New Discoveries: An Example
As detailed in the previous sections, different strategies can be put in place to achieve good power and to produce discoveries in GWAS. Here, we describe an example of how an improved, comprehensive methodology for GWAS can reveal novel association loci in a previously analysed, publicly available cohort. In this study [
14], 22 age-related diseases were analysed in 62,281 subjects from the GERA cohort. Ninety-four significant loci were identified, of which twenty-six had never been reported before, despite the fact that the data had already been previously analysed.
A first essential feature in driving novel discovery was an extended imputation step. Imputation was performed using four reference panels yielding 16,059,686 variants to test for association. The variants encompassed a broad spectrum of frequencies and types, including 2.6 M low-frequency and 5.5 M rare variants as well as 1.6 M small insertion/deletions (indels), which are normally absent from DNA microarrays and were thus excluded from analysis. Indeed, 3 of the 26 new loci corresponded to low-frequency variants, and 7 corresponded to rare variants. Further, only a fraction of the 26 new loci would have been genome-wide significant if the imputation had been performed with only one of the individual haplotype panels.
A second feature ensuring an increased discovery power was the use of multiple inheritance models in association testing. Typical GWAS only consider the additive model, according to which disease risk is proportional to the number of risk alleles in a genotype. However, dominant, recessive, or even more complex allelic interactions are known to exist. Indeed, 20 of the 94 loci only showed genome-wide significance when non-additive tests were applied. When focusing on the novel findings, 13 out 26 (50%) would have been missed if considering the additive model only, indicating again the strength of this approach in pushing discovery. Three of the thirteen non-additive signals corresponded to rare variants with large recessive effects (OR 4.3–19.0).
This study highlighted the value of open access and data sharing since the re-analysis using more refined and extensive methodologies led to the discovery of novel loci and disease insights. The entire GWAS strategy for this comprehensive methodology was integrated into a publicly available framework named GUIDANCE in order to facilitate further studies.