Generative AI and Language Models in Human Genetics and Health: From Variant Interpretation to Clinical Decision Support
Abstract
1. Introduction
2. Sequence Models for Human Genomics
2.1. DNA Language Models
2.2. Protein Language Models
2.3. Structure and Diffusion Generators (Non-LM)
3. Biomedical and Clinical Text Language Models
4. Synthetic Data for Genomics and Clinical Research
5. Clinical Decision Support and Integration
6. Limits and What’s Next
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Dias, R.; Torkamani, A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019, 11, 70. [Google Scholar] [CrossRef] [PubMed]
- Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal biomedical AI. Nat. Med. 2022, 28, 1773–1784. [Google Scholar] [CrossRef] [PubMed]
- Baowaly, M.K.; Lin, C.C.; Liu, C.L.; Chen, K.T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inf. Assoc. 2019, 26, 228–241. [Google Scholar] [CrossRef]
- Watson, J.L.; Juergens, D.; Bennett, N.R.; Trippe, B.L.; Yim, J.; Eisenach, H.E.; Ahern, W.; Borst, A.J.; Ragotte, R.J.; Milles, L.F.; et al. De novo design of protein structure and function with RFdiffusion. Nature 2023, 620, 1089–1100. [Google Scholar] [CrossRef] [PubMed]
- Raza, M.M.; Venkatesh, K.P.; Kvedar, J.C. Generative AI and large language models in health care: Pathways to implementation. npj Digit. Med. 2024, 7, 62. [Google Scholar] [CrossRef] [PubMed]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.-Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef] [PubMed]
- Avsec, Z.; Agarwal, V.; Visentin, D.; Ledsam, J.R.; Grabska-Barwinska, A.; Taylor, K.R.; Assael, Y.; Jumper, J.; Kohli, P.; Kelley, D.R. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 2021, 18, 1196–1203. [Google Scholar] [CrossRef] [PubMed]
- Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123–1130. [Google Scholar] [CrossRef] [PubMed]
- Consens, M.E.; Li, B.; Poetsch, A.R.; Gilbert, S. Genomic language models could transform medicine but not yet. npj Digit. Med. 2025, 8, 212. [Google Scholar] [CrossRef] [PubMed]
- Ji, Y.; Zhou, Z.; Liu, H.; Davuluri, R.V. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, E.; Poli, M.; Durrant, M.G.; Kang, B.; Katrekar, D.; Li, D.B.; Bartie, L.J.; Thomas, A.W.; King, S.H.; Brixi, G.; et al. Sequence modeling and design from molecular to genome scale with Evo. Science 2024, 386, eado9336. [Google Scholar] [CrossRef] [PubMed]
- Tang, Z.; Somia, N.; Yu, Y.; Koo, P.K. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Genome Biol. 2025, 26, 203. [Google Scholar] [CrossRef] [PubMed]
- Javed, N.; Weingarten, T.; Sehanobish, A.; Roberts, A.; Dubey, A.; Choromanski, K.; Bernstein, B.E. A multi-modal transformer for cell type-agnostic regulatory predictions. Cell Genom. 2025, 5, 100762. [Google Scholar] [CrossRef] [PubMed]
- Chen, K.; Zhou, Y.; Ding, M.; Wang, Y.; Ren, Z.; Yang, Y. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief. Bioinform. 2024, 25, bbae163. [Google Scholar] [CrossRef] [PubMed]
- Smith, C.; Kitzman, J.O. Benchmarking splice variant prediction algorithms using massively parallel splicing assays. Genome Biol. 2023, 24, 294. [Google Scholar] [CrossRef] [PubMed]
- Brehelin, L. Advancing Regulatory Genomics with Machine Learning. Bioinform. Biol. Insights 2024, 18, 11779322241249562. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Li, F.; Zhang, Y.; Imoto, S.; Shen, H.-H.; Li, S.; Guo, Y.; Yang, J.; Song, J. Deep learning approaches for non-coding genetic variant effect prediction: Current progress and future prospects. Brief. Bioinform. 2024, 25, bbae446. [Google Scholar] [CrossRef] [PubMed]
- Madani, A.; Krause, B.; Greene, E.R.; Subramanian, S.; Mohr, B.P.; Holton, J.M.; Olmos, J.L.; Xiong, C.; Sun, Z.Z.; Socher, R.; et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 2023, 41, 1099–1106. [Google Scholar] [CrossRef] [PubMed]
- Ferruz, N.; Schmidt, S.; Hocker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 2022, 13, 4348. [Google Scholar] [CrossRef] [PubMed]
- Meier, J.; Rao, R.; Verkuil, R.; Liu, J.; Sercu, T.; Rives, A. Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 2021. [Google Scholar] [CrossRef]
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
- Kortemme, T. De novo protein design-from new structures to programmable functions. Cell 2024, 187, 526–544. [Google Scholar] [CrossRef] [PubMed]
- Sumida, K.H.; Núñez-Franco, R.; Kalvet, I.; Pellock, S.J.; Wicky, B.I.M.; Milles, L.F.; Dauparas, J.; Wang, J.; Kipnis, Y.; Jameson, N.; et al. Improving Protein Expression, Stability, and Function with ProteinMPNN. J. Am. Chem. Soc. 2024, 146, 2054–2061. [Google Scholar] [CrossRef] [PubMed]
- Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R.J.; Milles, L.F.; Wicky, B.I.M.; Courbet, A.; de Haas, R.J.; Bethel, N.; et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56. [Google Scholar] [CrossRef] [PubMed]
- Ingraham, J.B.; Baranov, M.; Costello, Z.; Barber, K.W.; Wang, W.; Ismail, A.; Frappier, V.; Lord, D.M.; Ng-Thow-Hing, C.; Van Vlack, E.R.; et al. Illuminating protein space with a programmable generative model. Nature 2023, 623, 1070–1078. [Google Scholar] [CrossRef] [PubMed]
- Hsu, C.; Verkuil, R.; Liu, J.; Lin, Z.; Hie, B.; Sercu, T.; Lerer, A.; Rives, A. Learning inverse folding from millions of predicted structures. bioRxiv 2022. bioRxiv:2010.487779. [Google Scholar] [CrossRef]
- Alamdari, S.; Thakkar, N.; Van Den Berg, R.; Tenenholtz, N.; Strome, R.; Moses, A.M.; Lu, A.X.; Fusi, N.; Amini, A.P.; Yang, K.K. Protein generation with evolutionary diffusion: Sequence is all you need. bioRxiv 2024. [Google Scholar] [CrossRef] [PubMed]
- Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A.J.; Bambrick, J.; et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 630, 493–500. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed]
- Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef] [PubMed]
- Kohler, S.; Gargano, M.; Matentzoglu, N.; Carmody, L.C.; Lewis-Smith, D.; Vasilevsky, N.A.; Danis, D.; Balagura, G.; Baynam, G.; Brower, A.M.; et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 2021, 49, D1207–D1217. [Google Scholar] [CrossRef] [PubMed]
- Smolyak, D.; Bjarnadottir, M.V.; Crowley, K.; Agarwal, R. Large language models and synthetic health data: Progress and prospects. JAMIA Open 2024, 7, ooae114. [Google Scholar] [CrossRef] [PubMed]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Ning, Y.; Teixayavong, S.; Shang, Y.; Savulescu, J.; Nagaraj, V.; Miao, D.; Mertens, M.; Ting, D.S.W.; Ong, J.C.L.; Liu, M.; et al. Generative artificial intelligence and ethical considerations in health care: A scoping review and ethics checklist. Lancet Digit. Health 2024, 6, e848–e856. [Google Scholar] [CrossRef] [PubMed]
- Lu, S.; Cosgun, E. Boosting GPT models for genomics analysis: Generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning. Bioinform. Adv. 2025, 5, vbaf019. [Google Scholar] [CrossRef] [PubMed]
- Pezoulas, V.C.; Zaridis, D.I.; Mylona, E.; Androutsos, C.; Apostolidis, K.; Tachos, N.S.; Fotiadis, D.I. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Comput. Struct. Biotechnol. J. 2024, 23, 2892–2910. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Wu, Z.; Shi, X.; Cho, H.; Mukherjee, B. Generating synthetic electronic health record data: A methodological scoping review with benchmarking on phenotype data and open-source software. J. Am. Med. Inf. Assoc. 2025, 32, 1227–1240. [Google Scholar] [CrossRef] [PubMed]
- Rasul, K.; Seward, C.; Schuster, I.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. arXiv 2021, arXiv:2101.12072. Available online: https://ui.adsabs.harvard.edu/abs/2021arXiv210112072R (accessed on 1 May 2026).

| Data Type/Model Family | Primary Task | Representative Method(s) | Main Limitation | Clinical Readiness |
|---|---|---|---|---|
| DNA sequence LMs | Regulatory/splice and noncoding variant scoring | DNABERT, Enformer | Tissue/context bias; limited interpretability | Triage/research support |
| Protein sequence LMs | Missense/functional effect signals; sequence seeds for protein design | ProtGPT2, ProGen | In silico scores may not translate to function | Research/variant support |
| Structure prediction and diffusion generators | Structure and interaction prediction, Protein design | AlphaFold 3, RFdiffusion, ProteinMPNN | Requires wet lab validation | Research/design, drug discovery |
| Biomedical and clinical text LMs | Summaries, cohort search, gene–disease extraction, clinical text | BioGPT, Med-PaLM | Hallucination; site-specific bias | Drafting/review support |
| Synthetic data, tabular | Phenotype/genomics-adjacent table generation | medWGAN, medBGAN | Privacy leakage; may distort real-data patterns | Research/method development |
| Synthetic data, time series | Synthetic vitals/labs over time; predict next values/events | TimeGrad | May miss rare trajectories | Research/simulation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Pinchevsky Itan, Y.; Itan, Y. Generative AI and Language Models in Human Genetics and Health: From Variant Interpretation to Clinical Decision Support. Genes 2026, 17, 723. https://doi.org/10.3390/genes17060723
Pinchevsky Itan Y, Itan Y. Generative AI and Language Models in Human Genetics and Health: From Variant Interpretation to Clinical Decision Support. Genes. 2026; 17(6):723. https://doi.org/10.3390/genes17060723
Chicago/Turabian StylePinchevsky Itan, Yael, and Yuval Itan. 2026. "Generative AI and Language Models in Human Genetics and Health: From Variant Interpretation to Clinical Decision Support" Genes 17, no. 6: 723. https://doi.org/10.3390/genes17060723
APA StylePinchevsky Itan, Y., & Itan, Y. (2026). Generative AI and Language Models in Human Genetics and Health: From Variant Interpretation to Clinical Decision Support. Genes, 17(6), 723. https://doi.org/10.3390/genes17060723

