Establishing Best Practices for Clinical GWAS: Tackling Imputation and Data Quality Challenges
Abstract
1. Introduction
2. Genotype Imputation in GWASs
2.1. Imputation Methods
Potential and Limitations of Deep-Learning-Based Imputation
2.2. Quality Metrics for Imputation
- INFO Score: The INFO score estimates the squared correlation between imputed and true genotypes, ranging from 0 to 1. Higher values indicate greater confidence. A threshold of INFO > 0.8 is typically used for inclusion in downstream analyses, although lower thresholds (e.g., 0.6–0.7) may be acceptable when evaluating low-frequency variants or when using well-matched reference panels. However, it is important to recognize that a fixed INFO threshold may not be universally appropriate across all populations. For example, in ancestrally diverse or admixed cohorts, such as African Americans, greater haplotype diversity and shorter linkage disequilibrium blocks can reduce imputation accuracy, especially for rare variants. In these contexts, more stringent thresholds (e.g., INFO > 0.9) may be considered for variants with a high clinical impact to minimize false positives, while slightly relaxed thresholds might be acceptable for exploratory analyses or polygenic risk score construction when maximizing variant inclusion is important. Threshold selection should be guided by the imputation performance of the reference panel in the target population, variant allele frequency, and the intended application, and should be transparently reported with justification. INFO scores are also influenced by the minor allele frequency (MAF); rare variants tend to have lower INFO scores due to the greater uncertainty (Table 2) [29].
- Imputation R-squared (r2): This metric also estimates the correlation between imputed and actual genotypes, expressing the proportion of variance explained. It is especially useful in regression-based analyses, and, while INFO and r2 are often used interchangeably, they are computed differently across platforms (e.g., IMPUTE vs Minimac) and may not be directly comparable (Table 2) [30].
- Minor Allele Frequency (MAF): Variants with a low MAF (e.g., <0.01) are more difficult to impute accurately and often yield lower confidence scores. In clinical contexts, these variants require more stringent quality control to avoid spurious associations or false positives. For example, a rare variant associated with drug metabolism, such as DPYD rs3918290, which influences fluoropyrimidine toxicity, has an MAF below 1% in most populations. If poorly imputed, this variant could lead to incorrect dosing recommendations in pharmacogenomic settings. Ensuring high-quality imputation or confirming such variants with direct genotyping is critical when the clinical consequences are significant (Table 2) [31].
2.3. Factors Affecting Imputation Accuracy
3. GWAS Clinical Actionability, Genotyping Confirmation, and Transparent Reporting
3.1. Drug Dosing and Pharmacogenomics
3.2. Polygenic Risk Prediction
3.3. Identification of High-Risk Individuals
4. Best Practices for GWAS Implementation in Clinical Settings
4.1. Rigorous Quality Control for Imputed Data
- INFO/r2 thresholds: Exclude variants with INFO or r2 scores below a certain threshold. While a cutoff of 0.8 is commonly used, this should be adjusted based on the study goals and variant characteristics. For example, studies focusing on rare variants or underrepresented populations may benefit from more stringent thresholds to improve the reliability of imputed calls [65].
- Minor allele frequency (MAF): Exclude variants with MAF < 0.01 unless directly genotyped. This step helps to reduce spurious associations driven by rare variants with unreliable imputation.
- Hardy–Weinberg equilibrium (HWE): Filter variants deviating from HWE (e.g., p < 1 × 10−6). Deviations from HWE can indicate genotyping errors or ancestry stratification issues.
- Ancestry stratification: Adjust for the genetic ancestry structure using a principal component analysis (PCA), and more advanced methods such as mixed linear models (e.g., EMMAX [66] and GEMMA [67]) or local ancestry deconvolution [68]. These methods account for subtle population structure effects that can confound association analyses. PCA captures broad-scale ancestry differences, while mixed models and local ancestry deconvolution can address more complex patterns of relatedness and admixture.
- Batch effects: Users should test for systematic, non-biological errors using visualization techniques such as PCA and multidimensional scaling (MDS). Correct for these effects using statistical methods like linear regression, and an analysis of variance (ANOVA), or more sophisticated approaches such as ComBat [69] and mixed-effects models [70], which account for both fixed and random effects.
4.2. Direct Genotyping of Critical Variants: Minimizing Reliance on Imputation
- Established clinical guidelines: Prioritize variants included in evidence-based clinical guidelines from leading organizations such as the Clinical Pharmacogenetics Implementation Consortium (CPIC; https://cpicpgx.org/guidelines/, accessed on 5 May 2025) or the National Comprehensive Cancer Network (NCCN, https://www.nccn.org/guidelines, accessed on 5 May 2025), or relevant specialty-specific societies. These guidelines represent a consensus on variants with established clinical utility.
- Significant effect size: Focus on variants exhibiting large effect sizes on clinically relevant phenotypes, particularly those that influence therapeutic response, disease susceptibility, or prognostic outcomes. Effect size should be considered in the context of the specific clinical application and target population [73].
- Population-specific allele frequency: Account for the frequency of the variant in the target population. While high-frequency variants are generally prioritized, consider also including low-frequency variants that are enriched in specific populations and have substantial clinical implications within those groups [74].
- Evidence of imputation inaccuracy: If certain variants consistently exhibit a poor imputation performance (e.g., low INFO scores, and high discordance rates) in specific populations or genomic regions, prioritize direct genotyping to ensure accurate assessment [18].
- Tier 1—Initial validation: Genotype all critical variants in a subset of samples using an orthogonal method to confirm assay accuracy and identify potential technical artifacts.
- Tier 2—Periodic monitoring: Regularly monitor the genotyping performance using quality control metrics (e.g., call rates and allele frequencies) and implement periodic orthogonal validation in a representative subset of samples to detect any drift in assay performance or emerging technical issues.
4.3. Cross-Population Validation of Imputation Models
4.4. Transparent Reporting of Imputation Quality in Clinical Reports
- Flag imputed variants: Clearly identify imputed variants using a consistent notation (e.g., asterisk, color code, or designated field) to distinguish them from directly genotyped variants.
- Include quality scores: Provide relevant quality metrics for each imputed variant, such as INFO scores, r2 values, and MAF in the reference population. These metrics are essential in order to increase the confidence and reliability of the imputed calls.
- Offer plain-language interpretations: Provide clear, concise, and context-specific interpretations of the quality metrics, enabling clinicians to understand their implications for clinical decision-making.
4.5. Ethical and Equity Considerations
4.6. Global Collaboration and Data Sharing
- -
- Expand Diversity in Reference Panels: Actively recruit participants from underrepresented populations to create more inclusive datasets, and support initiatives like H3Africa that focus on increasing the genomic research capacity in Africa.
- -
- Promote Capacity Building in Low- and Middle-Income Countries (LMICs): Provide financial support for infrastructure development, training programs, and research initiatives in LMICs. Collaborate with local institutions to develop culturally appropriate consent processes.
- -
- Facilitate Open Data Access with Safeguards: Encourage open-access policies while implementing strong governance frameworks to protect participant privacy.
- -
- Encourage Cross-Border Collaboration: Establish international consortia focused on specific diseases or traits (e.g., the COVID-19 Host Genetics Initiative). Share best practices through global workshops, conferences, and working groups.
- -
- Leverage Technology for Secure Data Sharing: Use federated analysis models that allow researchers to analyze data locally without transferring it across borders. Blockchain technology could be explored as a tool for ensuring transparency and security in data-sharing agreements.
5. Future Perspectives
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
GWASs | Genome-Wide Association Studies |
PRS | Polygenic Risk Score |
SNP | Single-Nucleotide Polymorphism |
LD | Linkage Disequilibrium |
MAF | Minor Allele Frequency |
PCA | Principal Component Analysis |
OUD | Opioid Use Disorder |
CNV | Copy Number Variation |
CGH | Comparative Genomic Hybridization |
WGS | Whole-Genome Sequencing |
QC | Quality Control |
HWE | 1.3Hardy-Weinberg Equilibrium |
MDS | Multidimensional Scaling |
ML | Machine Learning |
CNN | Convolutional Neural Networks |
NGS | Next-Generation Sequencing |
ddPCR | droplet digital PCR |
HIPAA | Health Insurance Portability and Accountability |
GDPR | Act General Data Protection Regulation |
LMICs | low- and middle-income countries |
FDA | Food and Drug Administration |
EMA | the European Medicines Agency |
References
- Lennon Niall, J.; Kottyan, L.C.; Kachulis, C.; Abul-Husn, N.; Arias, J.; Belbin, G.; Below, J.E.; Berndt, S.; Chung, W.; Cimino, J.J.; et al. Selection, optimization and validation of ten chronic disease polygenic risk scores for clinical implementation in diverse US populations. Nat. Med. 2024, 30, 480–487. [Google Scholar] [CrossRef] [PubMed]
- Giacomini, K.M.; Yee, S.W.; Mushiroda, T.; Weinshilboum, R.M.; Ratain, M.J.; Kubo, M. Genome-wide association studies of drug response and toxicity: An opportunity for genome medicine. Nat. Rev. Drug Discov. 2016, 16, 70. [Google Scholar] [CrossRef] [PubMed]
- Sun, L.; Pennells, L.; Kaptoge, S.; Nelson, C.P.; Ritchie, S.C.; Abraham, G.; Arnold, M.; Bell, S.; Bolton, T.; Burgess, S.; et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses. PLoS Med. 2021, 18, e1003498. [Google Scholar] [CrossRef]
- Stadler, Z.K.; Thom, P.; Robson, M.E.; Weitzel, J.N.; Kauff, N.D.; Hurley, K.E.; Devlin, V.; Gold, B.; Klein, R.J.; Offit, K. Genome-Wide Association Studies of Cancer. J. Clin. Oncol. 2010, 28, 4255–4267. [Google Scholar] [CrossRef] [PubMed]
- Panday, S.K.; Shankar, V.; Lyman, R.A.; Alexov, E. Genetic Variants Linked to Opioid Addiction: A Genome-Wide Association Study. Int. J. Mol. Sci. 2024, 25, 12516. [Google Scholar] [CrossRef]
- Hall, F.S.; Drgonova, J.; Jain, S.; Uhl, G.R. Implications of genome wide association studies for addiction: Are our a priori assumptions all wrong? Pharmacol. Ther. 2013, 140, 267–279. [Google Scholar] [CrossRef]
- Kuhn, B.N.; Cannella, N.; Chitre, A.S.; Nguyen, K.M.H.; Cohen, K.; Chen, D.; Peng, B.; Ziegler, K.S.; Lin, B.; Johnson, B.B.; et al. Genome-wide association study reveals multiple loci for nociception and opioid consumption behaviors associated with heroin vulnerability in outbred rats. Mol. Psychiatry 2025. [Google Scholar] [CrossRef]
- Ikegawa, S. A Short History of the Genome-Wide Association Study: Where We Were and Where We Are Going. Genom. Inform. 2012, 10, 220–225. [Google Scholar] [CrossRef]
- Staerk, C.; Klinkhammer, H.; Wistuba, T.; Maj, C.; Mayr, A. Generalizability of polygenic prediction models: How is the R2 defined on test data? BMC Med. Genom. 2024, 17, 132. [Google Scholar] [CrossRef]
- Koch, S.; Schmidtke, J.; Krawczak, M.; Caliebe, A. Clinical utility of polygenic risk scores: A critical 2023 appraisal. J. Community Genet. 2023, 14, 471–487. [Google Scholar] [CrossRef]
- Purcell, S.M.; Wray, N.R.; Stone, J.L.; Visscher, P.M.; O’Donovan, M.C.; Sullivan, P.F.; International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 2009, 460, 748–752. [Google Scholar]
- Khera, A.V.; Chaffin, M.; Aragam, K.G.; Haas, M.E.; Roselli, C.; Choi, S.H.; Natarajan, P.; Lander, E.S.; Lubitz, S.A.; Ellinor, P.T.; et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 2018, 50, 1219–1224. [Google Scholar] [CrossRef] [PubMed]
- Kachuri, L.; Chatterjee, N.; Hirbo, J.; Schaid, D.J.; Martin, I.; Kullo, I.J.; Kenny, E.E.; Pasaniuc, B.; Polygenic Risk Methods in Diverse Populations (PRIMED) Consortium Methods Working Group; Auer, P.L.; et al. Principles and methods for transferring polygenic risk scores across global populations. Nat. Rev. Genet. 2023, 25, 8–25. [Google Scholar] [CrossRef]
- Zaitlen, N.; Kraft, P. Heritability in the genome-wide association era. Hum. Genet. 2012, 131, 1655–1664. [Google Scholar] [CrossRef] [PubMed]
- Roberts, E.; Howell, S.; Evans, D.G. Polygenic risk scores and breast cancer risk prediction. Breast 2023, 67, 71–77. [Google Scholar] [CrossRef]
- Naito, T.; Okada, Y. Genotype imputation methods for whole and complex genomic regions utilizing deep learning technology. J. Hum. Genet. 2024, 69, 481–486. [Google Scholar] [CrossRef] [PubMed]
- Chen, D.; Tashman, K.; Palmer, D.S.; Neale, B.; Roeder, K.; Bloemendal, A.; Churchhouse, C.; Ke, Z.T. A data harmonization pipeline to leverage external controls and boost power in GWAS. Hum. Mol. Genet. 2021, 31, 481–489. [Google Scholar] [CrossRef]
- Liu, Q.; Cirulli, E.T.; Han, Y.; Yao, S.; Liu, S.; Zhu, Q. Systematic assessment of imputation performance using the 1000 Genomes reference panels. Brief. Bioinform. 2014, 16, 549–562. [Google Scholar] [CrossRef]
- Quick, C.; Anugu, P.; Musani, S.; Weiss, S.T.; Burchard, E.G.; White, M.J.; Keys, K.L.; Cucca, F.; Sidore, C.; Boehnke, M.; et al. Sequencing and imputation in GWAS: Cost-effective strategies to increase power and genomic coverage across diverse populations. Genet. Epidemiol. 2020, 44, 537–549. [Google Scholar] [CrossRef]
- Chen, S.F.; Dias, R.; Evans, D.; Salfati, E.L.; Liu, S.; Wineinger, N.E.; Torkamani, A. Genotype imputation and variability in polygenic risk score estimation. Genome Med. 2020, 12, 100. [Google Scholar] [CrossRef]
- Li, Y.; Willer, C.; Sanna, S.; Abecasis, G. Genotype Imputation. Annu. Rev. Genom. Hum. Genet. 2009, 10, 387–406. [Google Scholar] [CrossRef] [PubMed]
- Hickey, J.M.; Kinghorn, B.P.; Tier, B.; van der Werf, J.H.; Cleveland, M.A. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet. Sel. Evol. 2012, 44, 9. [Google Scholar] [CrossRef]
- Treccani, M.; Locatelli, E.; Patuzzo, C.; Malerba, G. A broad overview of genotype imputation: Standard guidelines, approaches, and future investigations in genomic association studies. Biocell 2023, 47, 1225–1241. [Google Scholar] [CrossRef]
- van Leeuwen, E.M.; Kanterakis, A.; Deelen, P.; Kattenberg, M.V.; Slagboom, P.E.; de Bakker, P.I.W.; Wijmenga, C.; Swertz, M.A.; Boomsma, D.I.; The Genome of the Netherlands Consortium; et al. Population-specific genotype imputations using minimac or IMPUTE2. Nat. Protoc. 2015, 10, 1285–1296. [Google Scholar] [CrossRef] [PubMed]
- Browning, B.L.; Browning, S.R. Genotype Imputation with Millions of Reference Samples. Am. J. Hum. Genet. 2016, 98, 116–126. [Google Scholar] [CrossRef] [PubMed]
- Fuchsberger, C.; Abecasis, G.R.; Hinds, D.A. minimac2: Faster genotype imputation. Bioinformatics 2014, 31, 782–784. [Google Scholar] [CrossRef]
- Rubinacci, S.; Hofmeister, R.J.; Sousa da Mota, B.; Delaneau, O. Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes. Nat. Genet. 2023, 55, 1088–1090. [Google Scholar] [CrossRef]
- Arisdakessian, C.; Poirion, O.; Yunits, B.; Zhu, X.; Garmire, L.X. DeepImpute: An accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol. 2019, 20, 211. [Google Scholar] [CrossRef]
- Southam, L.; Panoutsopoulou, K.; Rayner, N.W.; Chapman, K.; Durrant, C.; Ferreira, T.; Arden, N.; Carr, A.; Deloukas, P.; the arcOGEN Consortium; et al. The effect of genome-wide association scan quality control on imputation outcome for common variants. Eur. J. Hum. Genet. 2011, 19, 610–614. [Google Scholar] [CrossRef]
- Nelson, S.C.; Doheny, K.F.; Pugh, E.W.; Romm, J.M.; Ling, H.; Laurie, C.A.; Browning, S.R.; Weir, B.S.; Laurie, C.C. Imputation-Based Genomic Coverage Assessments of Current Human Genotyping Arrays. G3 Genes|Genomes|Genet. 2013, 3, 1795–1807. [Google Scholar] [CrossRef]
- Zhang, Z.; Xiao, X.; Zhou, W.; Zhu, D.; Amos, C.I. False positive findings during genome-wide association studies with imputation: Influence of allele frequency and imputation accuracy. Hum. Mol. Genet. 2021, 31, 146–155. [Google Scholar] [CrossRef] [PubMed]
- Hwang, S.; Kim, E.; Lee, I.; Marcotte, E.M. Systematic comparison of variant calling pipelines using gold standard personal exome variants. Sci. Rep. 2015, 5, 17875. [Google Scholar] [CrossRef] [PubMed]
- Cahoon, J.L.; Rui, X.; Tang, E.; Simons, C.; Langie, J.; Chen, M.; Lo, Y.-C.; Chiang, C.W. Imputation accuracy across global human populations. Am. J. Hum. Genet. 2024, 111, 979–989. [Google Scholar] [CrossRef]
- Blair, D.R.; Hoffmann, T.J.; Shieh, J.T. Common genetic variation associated with Mendelian disease severity revealed through cryptic phenotype analysis. Nat. Commun. 2022, 13, 3675. [Google Scholar] [CrossRef] [PubMed]
- Deng, T.; Zhang, P.; Garrick, D.; Gao, H.; Wang, L.; Zhao, F. Comparison of Genotype Imputation for SNP Array and Low-Coverage Whole-Genome Sequencing Data. Front. Genet. 2022, 12, 704118. [Google Scholar] [CrossRef]
- Thanh Nguyen, D.; Hoang Nguyen, Q.; Thuy Duong, N.; Vo, N.S. LmTag: Functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays. Brief. Bioinform. 2022, 23, bbac252. [Google Scholar] [CrossRef]
- Ullah, E.; Mall, R.; Abbas, M.M.; Kunji, K.; Nato, A.Q.; Bensmail, H.; Wijsman, E.M.; Saad, M. Comparison and assessment of family- and population-based genotype imputation methods in large pedigrees. Genome Res. 2018, 29, 125–134. [Google Scholar] [CrossRef]
- Coombes, B.J.; Ploner, A.; Bergen, S.E.; Biernacka, J.M. A principal component approach to improve association testing with polygenic risk scores. Genet. Epidemiol. 2020, 44, 676–686. [Google Scholar] [CrossRef]
- Tan, T.; Atkinson, E.G. Strategies for the Genomic Analysis of Admixed Populations. Annu. Rev. Biomed. Data Sci. 2023, 6, 105–127. [Google Scholar] [CrossRef]
- Appadurai, V.; Bybjerg-Grauholm, J.; Krebs, M.D.; Rosengren, A.; Buil, A.; Ingason, A.; Mors, O.; Børglum, A.D.; Hougaard, D.M.; Nordentoft, M.; et al. Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks. Commun. Biol. 2023, 6, 101. [Google Scholar] [CrossRef]
- Shi, S.; Rubinacci, S.; Hu, S.; Moutsianas, L.; Stuckey, A.; Need, A.C.; Palamara, P.F.; Caulfield, M.; Marchini, J.; Myers, S. A Genomics England haplotype reference panel and imputation of UK Biobank. Nat. Genet. 2024, 56, 1800–1803. [Google Scholar] [CrossRef] [PubMed]
- Flanagan, J.; Liu, X.; Ortega-Reyes, D.; Tomizuka, K.; Matoba, N.; Akiyama, M.; Koido, M.; Ishigaki, K.; Ashikawa, K.; Takata, S.; et al. Population-specific reference panel improves imputation quality for genome-wide association studies conducted on the Japanese population. Commun. Biol. 2024, 7, 1665. [Google Scholar] [CrossRef] [PubMed]
- Mauleekoonphairoj, J.; Tongsima, S.; Khongphatthanayothin, A.; Jurgens, S.J.; Zimmerman, D.S.; Sutjaporn, B.; Wandee, P.; Bezzina, C.R.; Nademanee, K.; Poovorawan, Y.; et al. A diverse ancestrally-matched reference panel increases genotype imputation accuracy in a underrepresented population. Sci. Rep. 2023, 13, 12360. [Google Scholar] [CrossRef]
- Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef]
- Buniello, A.; MacArthur, J.A.L.; Cerezo, M.; Harris, L.W.; Hayhurst, J.; Malangone, C.; McMahon, A.; Morales, J.; Mountjoy, E.; Sollis, E.; et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019, 47, D1005–D1012. [Google Scholar] [CrossRef]
- Kember, R.L.; Vickers-Smith, R.; Xu, H.; Toikumo, S.; Niarchou, M.; Zhou, H.; Hartwell, E.E.; Crist, R.C.; Rentsch, C.T.; Program, M.V.; et al. Cross-ancestry meta-analysis of opioid use disorder uncovers novel loci with predominant effects in brain regions associated with addiction. Nat. Neurosci. 2022, 25, 1279–1287. [Google Scholar] [CrossRef] [PubMed]
- Klein, M.D.; Williams, A.K.; Lee, C.R.; Stouffer, G.A. Clinical Utility of CYP2C19 Genotyping to Guide Antiplatelet Therapy in Patients With an Acute Coronary Syndrome or Undergoing Percutaneous Coronary Intervention. Arterioscler. Thromb. Vasc. Biol. 2019, 39, 647–652. [Google Scholar] [CrossRef]
- Lucky, M.H.; Baig, S.; Hanif, M.; HAsghar, A. Systematic Review: Comprehensive Methods for Detecting BRCA1 and BRCA2 Mutations in Breast and Ovarian Cancer. Asian Pac. J. Cancer Biol. 2025, 10, 229–238. [Google Scholar] [CrossRef]
- König, E.; Mitchell, J.S.; Filosi, M.; Fuchsberger, C. Impact of the inaccessible genome on genotype imputation and genome-wide association studies. Hum. Mol. Genet. 2024, 33, 1207–1214. [Google Scholar] [CrossRef]
- McInnes, G.; Yee, S.W.; Pershad, Y.; Altman, R.B. Genomewide association studies in pharmacogenomics. Clin. Pharmacol. Ther. 2021, 110, 637–648. [Google Scholar] [CrossRef]
- Leu, C.; Avbersek, A.; Stevelink, R.; Custodio, H.M.; Chen, S.; Speed, D.; Bennett, C.A.; Jonsson, L.; Unnsteinsdóttir, U.; Jorgensen, A.L.; et al. Genome-wide association meta-analyses of drug-resistant epilepsy. EBioMedicine 2025, 115, 105675. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Liu, J.; Xiao, Q.; Mao, C.; She, L.; Yu, L.; Yu, B.; Lei, M.; Gao, Y.; He, B.; et al. GWAS study of myelosuppression among NSCLC patients receiving platinum-based combination chemotherapy. Acta Biochim. Et Biophys. Sin. 2025. [Google Scholar] [CrossRef]
- Best, C.M.; Li, X.; Rotter, J.I.; Prince, D.K.; Hsu, S.; Hoofnagle, A.N.; Siscovick, D.; Taylor, K.D.; Williams, K.; Michos, E.D.; et al. Genetic Variants Associated with the Biochemical Response to Vitamin D3 in the Multi-Ethnic Study of Atherosclerosis. J. Clin. Endocrinol. Metab. 2025, dgaf025. [Google Scholar] [CrossRef]
- Liboredo, R.; Pena, S.D.J. Pharmacogenomics: Accessing important alleles by imputation from commercial genome-wide SNP arrays. Genet. Mol. Res. 2014, 13, 5713–5721. [Google Scholar] [CrossRef]
- Zhao, M.; Wang, Q.; Wang, Q.; Jia, P.; Zhao, Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: Features and perspectives. BMC Bioinform. 2013, 14, S1. [Google Scholar] [CrossRef] [PubMed]
- Yang, L. A Practical Guide for Structural Variation Detection in the Human Genome. Curr. Protoc. Hum. Genet. 2020, 107, e103. [Google Scholar] [CrossRef]
- Maamari, D.J.; Abou-Karam, R.; Fahed, A.C. Polygenic Risk Scores in Human Disease. Clin. Chem. 2025, 71, 69–76. [Google Scholar] [CrossRef]
- Schunkert, H.; Di Angelantonio, E.; Inouye, M.; Patel, R.S.; Ripatti, S.; Widen, E.; Sanderson, S.C.; Kaski, J.P.; McEvoy, J.W.; Vardas, P.; et al. Clinical utility and implementation of polygenic risk scores for predicting cardiovascular disease: A clinical consensus statement of the ESC Council on Cardiovascular Genomics, the ESC Cardiovascular Risk Collaboration, and the European Association of Preventive Cardiology. Eur. Heart J. 2025, 46, 1372–1383. [Google Scholar] [PubMed]
- Sonehara, K.; Okada, Y. Leveraging genome-wide association studies to better understand the etiology of cancers. Cancer Sci. 2025, 116, 288–296. [Google Scholar] [CrossRef]
- Cheng, F.; Yang, K.; Wang, Y.; Yang, F.; Niu, X.; Li, W. Therapeutic potential of GLP-1RAs in sleep apnea with genetic associations to type 2 diabetes. Diabetol. Metab. Syndr. 2025, 17, 1–11. [Google Scholar] [CrossRef]
- Rüeger, S.; McDaid, A.; Kutalik, Z. Evaluation and application of summary statistic imputation to discover new height-associated loci. PLoS Genet. 2018, 14, e1007371. [Google Scholar] [CrossRef] [PubMed]
- Levi, H.; Elkon, R.; Shamir, R. The predictive capacity of polygenic risk scores for disease risk is only moderately influenced by imputation panels tailored to the target population. Bioinformatics 2024, 40, btae036. [Google Scholar] [CrossRef]
- Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; de Bakker, P.I.W.; Daly, M.J.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef]
- O’Connell, J.; Sharp, K.; Shrine, N.; Wain, L.; Hall, I.; Tobin, M.; Zagury, J.-F.; Delaneau, O.; Marchini, J. Haplotype estimation for biobank-scale data sets. Nat. Genet. 2016, 48, 817–820. [Google Scholar] [CrossRef]
- Marchini, J.; Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 2010, 11, 499–511. [Google Scholar] [CrossRef] [PubMed]
- Wu, C.; DeWan, A.; Hoh, J.; Wang, Z. A Comparison of Association Methods Correcting for Population Stratification in Case–Control Studies. Ann. Hum. Genet. 2011, 75, 418–427. [Google Scholar] [CrossRef]
- Parker, C.C.; Gopalakrishnan, S.; Carbonetto, P.; Gonzales, N.M.; Leung, E.; Park, Y.J.; Aryee, E.; Davis, J.; A Blizard, D.; Ackert-Bicknell, C.L.; et al. Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice. Nat. Genet. 2016, 48, 919–926. [Google Scholar] [CrossRef] [PubMed]
- Geza, E.; Mugo, J.; Mulder, N.J.; Wonkam, A.; Chimusa, E.R.; Mazandu, G.K. A comprehensive survey of models for dissecting local ancestry deconvolution in human genome. Brief. Bioinform. 2018, 20, 1709–1724. [Google Scholar] [CrossRef]
- Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-seq: Batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2020, 2, lqaa078. [Google Scholar] [CrossRef]
- Korte, A.; Vilhjálmsson, B.J.; Segura, V.; Platt, A.; Long, Q.; Nordborg, M. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat. Genet. 2012, 44, 1066–1071. [Google Scholar] [CrossRef]
- Ani, A.; van der Most, P.J.; Snieder, H.; Vaez, A.; Nolte, I.M. GWASinspector: Comprehensive quality control of genome-wide association study results. Bioinformatics 2021, 37, 129–130. [Google Scholar] [CrossRef]
- Poplin, R.; Chang, P.C.; Alexander, D.; Schwartz, S.; Colthurst, T.; Ku, A.; Newburger, D.; Dijamco, J.; Nguyen, N.; Afshar, P.T.; et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018, 36, 983–987. [Google Scholar] [CrossRef]
- Tesi, N.; van der Lee, S.J.; Hulsman, M.; Jansen, I.E.; Stringa, N.; van Schoor, N.; Meijers-Heijboer, H.; Huisman, M.; Scheltens, P.; Reinders, M.J.T.; et al. Centenarian controls increase variant effect sizes by an average twofold in an extreme case–extreme control analysis of Alzheimer’s disease. Eur. J. Hum. Genet. 2018, 27, 244–253. [Google Scholar] [CrossRef]
- Daw Elbait, G.; Henschel, A.; Tay, G.K.; Al Safar, H.S. A Population-Specific Major Allele Reference Genome From The United Arab Emirates Population. Front. Genet. 2021, 12, 660428. [Google Scholar] [CrossRef] [PubMed]
- Lee, C.R.; Luzum, J.A.; Sangkuhl, K.; Gammal, R.S.; Sabatine, M.S.; Stein, C.M.; Kisor, D.F.; Limdi, N.A.; Lee, Y.M.; Scott, S.A.; et al. Clinical Pharmacogenetics Implementation Consortium Guideline for CYP2C19 Genotype and Clopidogrel Therapy: 2022 Update. Clin. Pharmacol. Ther. 2022, 112, 959–967. [Google Scholar] [CrossRef] [PubMed]
- Fernandes, A.G.; Hernández, S.C.; Navaro, R.L.; Kawamura, S.; Melin, A.D. Droplet Digital PCR Provides Highly Sensitive and Accurate Opsin Gene SNP Detection From Wild Primate Fecal Samples. Ecol. Evol. 2025, 15, e70996. [Google Scholar] [CrossRef]
- Sawyer, S.L.; Mukherjee, N.; Pakstis, A.J.; Feuk, L.; Kidd, J.R.; Brookes, A.J.; Kidd, K.K. Linkage disequilibrium patterns vary substantially among populations. Eur. J. Hum. Genet. 2005, 13, 677–686. [Google Scholar] [CrossRef] [PubMed]
- Burgess, D.J. The TOPMed genomic resource for human health. Nat. Rev. Genet. 2021, 22, 200. [Google Scholar] [CrossRef]
- GenomeAsia100K Consortium. The GenomeAsia 100K Project enables genetic discoveries across Asia. Nature 2019, 576, 106–111. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Claw, K.G.; Anderson, M.Z.; Begay, R.L.; Tsosie, K.S.; Fox, K.; Garrison, N.A.; Summer internship for INdigenous peoples in Genomics (SING) Consortium. A framework for enhancing ethical genomic research with Indigenous communities. Nat. Commun. 2018, 9, 2957. [Google Scholar] [CrossRef]
- Bahcall, O.G. In this issue: GA4GH standards enable the responsible sharing of human genomic and biomedical data. Cell Genom. 2021, 1, 100038. [Google Scholar] [CrossRef]
Algorithm | Strengths | Weaknesses | Optimal Context |
---|---|---|---|
IMPUTE2 [24] | High accuracy for common variants; extensively validated in population-based studies | Computationally intensive | Smaller datasets, studies requiring high accuracy for common variants |
Beagle [25] | Fast, integrates phasing and imputation | Less accurate for rare variants | Large datasets, high-throughput studies |
Minimac4 [26] | Scalable, optimized for low memory usage | Slight accuracy trade-off | Very large datasets, meta-analyses |
GLIMPSE [27] | Effective for rare variant in admixed populations | Computationally intensive | Admixed cohorts; studies focused on rare variants |
DeepImpute [28] | Captures complex patterns; potential for high accuracy | Requires large training datasets; less validated | Experimental settings with rich computational resources |
Metric | Definition | Common Threshold | Interpretation |
---|---|---|---|
INFO Score | Squared correlation between imputed and true genotypes | >0.8 (typical) | High confidence; tool-specific |
Imputation r2 | Proportion of variance explained | >0.8 | Regression-friendly; may differ by tool |
Minor Allele Frequency (MAF) | Frequency of the less common allele | >0.01 (for inclusion) | Rare variants are less reliably imputed |
Hard-Call Concordance | % match to gold-standard genotypes | >95% | Best for validating imputation accuracy directly |
Year & [Ref] | Population | Variant Frequency | Accuracy Metric | Imputation Method | Reference Panel | Sample Size | Accuracy |
---|---|---|---|---|---|---|---|
2023 [40] | European | Rare (MAF < 0.005) | r2 | BEAGLE | PGP-UK Cohort | 452,264 | r2 = 0.43 |
2023 [40] | European | Common (MAF 0.2–0.5) | r2 | BEAGLE | PGP-UK Cohort | 452,264 | r2 = 0.95 |
2024 [41] | British (GBR) | Rare (MAF < 10%) | r2 | Not specified | GEL Panel | 200,000 | r2 = 0.60 |
2024 [41] | British (GBR) | Very Rare (MAF < 2%) | r2 | Not specified | GEL Panel | 200,000 | r2 = 0.75 |
2024 [33] | Saudi Arabian | Rare (MAF 1–5%) | Mean Rsq | TOPMed | TOPMed Panel | 1061 | R2 = 0.79 |
2024 [33] | Vietnamese | Rare (MAF 1–5%) | Mean Rsq | TOPMed | TOPMed Panel | 1264 | R2 = 0.78 |
2024 [33] | Thai | Rare (MAF 1–5%) | Mean Rsq | TOPMed | TOPMed Panel | 2435 | R2= 0.76 |
2024 [33] | Papua New Guinean | Rare (MAF 1–5%) | Mean Rsq | TOPMed | TOPMed Panel | 776 | R2 = 0.62 |
2024 [42] | Japanese | Rare (MAF < 5%) | Aggregated r2 | Minimac4 | Japanese WGS-enhanced panel (1KG + 7K) | 51,777 | Improved over TOPMed |
2023 [43] | Thai | Ultra-Rare (MAF < 0.001) | Genotype Concordance Rate (GCR) | GenomeAsia | GenomeAsia Panel | 412 | Median GCR = 0.97 |
Metric | Definition |
---|---|
Concordance | Assess overall agreement between imputed and directly typed genotypes, calculating the proportion of matching calls across all variants and samples |
Sensitivity (True Positive Rate) | Determine the proportion of true positives correctly identified by imputation, indicating the ability of imputation to detect variants that are present |
Specificity (True Negative Rate) | Calculate the proportion of true negatives correctly identified by imputation, indicating the ability of imputation to correctly exclude variants that are absent |
Positive Predictive Value (PPV) | Assess the proportion of imputed calls that are actually true positives, providing a measure of the reliability of positive predictions |
Negative Predictive Value (NPV) | Determine the proportion of imputed non-calls that are actually true negatives, providing a measure of the reliability of negative predictions |
Variant | Genotype | Imputation Status | Quality Score | Interpretation |
---|---|---|---|---|
CYP2C19*2 | *1/*2 | Imputed | INFO = 0.85 | Reduced function allele. Consider alternative therapy, especially in individuals of African ancestry |
SLCO1B1*5 | T/C | Imputed | r2 = 0.91 | Increased statin-associated myopathy risk. May warrant dose adjustment or alternative statin use |
VKORC1-1639G>A | A/A | Imputed | INFO = 0.87 | Higher warfarin sensitivity. Consider lower starting dose per dosing guidelines |
DPYD*2A | C/T | Imputed | INFO = 0.73 | Partial DPD deficiency. Elevated risk of fluoropyrimidine toxicity. Consider alternative dosing |
TPMT*3C | G/A | Imputed | r2 = 0.78 | Decreased TPMT activity. Risk of thiopurine toxicity; consider dose reduction or alternative therapy |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Casaburi, G.; McCullough, R.; D’Argenio, V. Establishing Best Practices for Clinical GWAS: Tackling Imputation and Data Quality Challenges. Int. J. Mol. Sci. 2025, 26, 6397. https://doi.org/10.3390/ijms26136397
Casaburi G, McCullough R, D’Argenio V. Establishing Best Practices for Clinical GWAS: Tackling Imputation and Data Quality Challenges. International Journal of Molecular Sciences. 2025; 26(13):6397. https://doi.org/10.3390/ijms26136397
Chicago/Turabian StyleCasaburi, Giorgio, Ron McCullough, and Valeria D’Argenio. 2025. "Establishing Best Practices for Clinical GWAS: Tackling Imputation and Data Quality Challenges" International Journal of Molecular Sciences 26, no. 13: 6397. https://doi.org/10.3390/ijms26136397
APA StyleCasaburi, G., McCullough, R., & D’Argenio, V. (2025). Establishing Best Practices for Clinical GWAS: Tackling Imputation and Data Quality Challenges. International Journal of Molecular Sciences, 26(13), 6397. https://doi.org/10.3390/ijms26136397