Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review
Abstract
1. Introduction
2. Materials and Methods
2.1. Eligibility Criteria
2.2. Study Screening
2.3. Data Extraction and Synthesis
3. Results
3.1. Study Screening and Selection
3.2. Baseline Characteristics of Included Studies
3.3. Impact on Model Performance
4. Discussion
4.1. Overview of Model Adaption in Healthcare
4.2. RAG + FT Frameworks in Healthcare AI
4.3. Strengths and Limitations
4.4. Future Directions
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Haider, S.A.; Prabha, S.; Gomez-Cabello, C.A.; Borna, S.; Genovese, A.; Trabilsy, M.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Tao, C.; et al. Synthetic Patient–Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation. Sensors 2025, 25, 4305. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Meng, X.; Yan, X.; Ji, J.; Liu, J.; Xu, H.; Zhang, H.; Liu, D.; Wang, J.; Wang, X.; et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J. Med. Internet Res. 2025, 27, e59069. [Google Scholar] [CrossRef]
- McCoy, L.G.; Swamy, R.; Sagar, N.; Wang, M.; Bacchi, S.; Fong, J.M.N.; Tan, N.C.; Tan, K.; Buckley, T.A.; Brodeur, P. Assessment of large language models in clinical reasoning: A novel benchmarking study. NEJM AI 2025, 2, AIdbp2500120. [Google Scholar] [CrossRef]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
- Agarwal, V.; Jin, Y.; Chandra, M.; De Choudhury, M.; Kumar, S.; Sastry, N. Medhalu: Hallucinations in responses to healthcare queries by large language models. arXiv 2024, arXiv:2409.19492. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, H.; Danek, B.; Li, Y.; Mack, C.; Arbuckle, L.; Biswal, D.; Poon, H.; Wang, Y.; Rajpurkar, P. A perspective for adapting generalist ai to specialized medical ai applications and their challenges. npj Digit. Med. 2025, 8, 429. [Google Scholar] [CrossRef] [PubMed]
- Ding, H.; Fang, Y.; Zhu, R.; Jiang, X.; Zhang, J.; Xu, Y.; Liao, W.; Chu, X.; Zhao, J.; Wang, Y. 3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; EMNLP: Suzhou, China, 2025; pp. 19473–19495. [Google Scholar]
- Pingua, B.; Sahoo, A.; Kandpal, M.; Murmu, D.; Rautaray, J.; Barik, R.K.; Saikia, M.J. Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation. Bioengineering 2025, 12, 687. [Google Scholar] [CrossRef] [PubMed]
- Ng, K.K.Y.; Matsuba, I.; Zhang, P.C. RAG in health care: A novel framework for improving communication and decision-making by addressing LLM limitations. NEJM AI 2025, 2, AIra2400380. [Google Scholar] [CrossRef]
- Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef] [PubMed]
- Haider, S.A.; Prabha, S.; Gomez Cabello, C.A.; Genovese, A.; Collaco, B.; Wood, N.; London, J.; Bagaria, S.; Tao, C.; Forte, A.J. The Development and Evaluation of a Retrieval-Augmented Generation Large Language Model Virtual Assistant for Postoperative Instructions. Bioengineering 2025, 12, 1219. [Google Scholar] [CrossRef] [PubMed]
- Rangan, K.; Yin, Y. A fine-tuning enhanced RAG system with quantized influence measure as AI judge. Sci. Rep. 2024, 14, 27446. [Google Scholar] [CrossRef] [PubMed]
- Soudani, H.; Kanoulas, E.; Hasibi, F. Fine tuning vs. retrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region; Association for Computing Machinery: New York, NY, USA, 2024; pp. 12–22. [Google Scholar]
- Aromataris, E.; Munn, Z. Chapter 11: Scoping reviews. In JBI Reviewer’s Manual; JBI: Adelaide, Australia, 2020; Volume 10. [Google Scholar]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
- Light, M. EndNote 1-2-3 Easy! Reference Management for the Professional, Second Edition, A. Agrawal, 2009, Springer, New York, USA, Price: €44.95, Soft Cover, 294 pages, ISBN: 978-0-387-95900-9, Website: www.springer.com. S. Afr. J. Bot. 2010, 76, 810. [Google Scholar] [CrossRef][Green Version]
- Collaco, B.G.; Haider, S.A.; Prabha, S.; Gomez-Cabello, C.A.; Genovese, A.; Wood, N.G.; Bagaria, S.P.; Gopala, N.; Tao, C.; Forte, A.J. The Role of Agentic Artificial Intelligence in Healthcare: A Systematic Review. Preprint 2025. [Google Scholar] [CrossRef]
- Bora, A.; Cuayáhuitl, H. Systematic Analysis of Retrieval-Augmented Generation-Based LLMs for Medical Chatbot Applications. Mach. Learn. Knowl. Extr. 2024, 6, 2355–2374. [Google Scholar] [CrossRef]
- Garcia, J.; Gong, J.; Zajac, M.; Hahn, A. DF-RAG: A Dual Federated Retrieval-Augmented Generation Framework for Collaborative Medical AI. In Proceedings of the ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies; IEEE: New York, NY, USA, 2025; pp. 418–423. [Google Scholar]
- Kuo, S.-M.; Tai, S.-K.; Lin, H.-Y.; Chen, R.-C. Automated Clinical Trial Data Analysis and Report Generation by Integrating Retrieval-Augmented Generation (RAG) and Large Language Model (LLM) Technologies. AI 2025, 6, 188. [Google Scholar] [CrossRef]
- Gao, Y.; Zong, L.; Li, Y. Enhancing biomedical question answering with parameter-efficient fine-tuning and hierarchical retrieval augmented generation. In CLEF Working Notes; CLEF: Grenoble, France, 2024. [Google Scholar]
- Neupane, S.; Tripathi, H.; Mitra, S.; Bozorgzad, S.; Mittal, S.; Rahimi, S.; Amirlatifi, A. ClinicSum: Utilizing Language Models for Generating Clinical Summaries from Patient-Doctor Conversations. Proc. IEEE Int. Conf. Big Data 2024, 2024, 5050–5059. [Google Scholar] [CrossRef]
- Xia, P.; Zhu, K.; Li, H.; Wang, T.; Shi, W.; Wang, S.; Zhang, L.; Zou, J.; Yao, H. Mmed-rag: Versatile multimodal rag system for medical vision language models. arXiv 2024, arXiv:2410.13085. [Google Scholar] [CrossRef]
- Hou, Z.; Liu, H.; Bian, J.; He, X.; Zhuang, Y. Enhancing medical coding efficiency through domain-specific fine-tuned large language models. npj Health Syst. 2025, 2, 14. [Google Scholar] [CrossRef] [PubMed]
- Nunes, M.; Bone, J.; Ferreira, J.C.; Elvas, L.B. Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review. JMIR Med. Inform. 2024, 12, e60164. [Google Scholar] [CrossRef] [PubMed]
- Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef] [PubMed]
- Dorfner, F.J.; Dada, A.; Busch, F.; Makowski, M.R.; Han, T.; Truhn, D.; Kleesiek, J.; Sushil, M.; Adams, L.C.; Bressem, K.K. Evaluating the effectiveness of biomedical fine-tuning for large language models on clinical tasks. J. Am. Med. Inform. Assoc. 2025, 32, 1015–1024. [Google Scholar] [CrossRef]
- Gargari, O.K.; Habibi, G. Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digit. Health 2025, 11, 20552076251337177. [Google Scholar] [CrossRef]
- Eastwood, B. How Does Retrieval-Augmented Generation (RAG) Support Healthcare AI Initiatives? Available online: https://healthtechmagazine.net/article/2025/01/retrieval-augmented-generation-support-healthcare-ai-perfcon (accessed on 23 July 2025).
- Yang, R.; Ning, Y.; Keppo, E.; Liu, M.; Hong, C.; Bitterman, D.S.; Ong, J.C.L.; Ting, D.S.W.; Liu, N. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2025, 2, 2. [Google Scholar] [CrossRef]
- Genovese, A.; Prabha, S.; Borna, S.; Gomez-Cabello, C.A.; Haider, S.A.; Trabilsy, M.; Ho, O.A.; Forte, A.J. 34. Assessing the Effectiveness of RAG-Based AI Models in Answering Postoperative Rhinoplasty Queries: Limitations and Future Directions. Plast. Reconstr. Surg.–Glob. Open 2025, 13, 22–23. [Google Scholar] [CrossRef]
- Ke, Y.H.; Jin, L.; Elangovan, K.; Abdullah, H.R.; Liu, N.; Sia, A.T.H.; Soh, C.R.; Tung, J.Y.M.; Ong, J.C.L.; Kuo, C.-F.; et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. npj Digit. Med. 2025, 8, 187. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: A systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. 2025, 32, 605–615. [Google Scholar] [CrossRef]
- Lakatos, R.; Pollner, P.; Hajdu, A.; Joo, T. Investigating the performance of retrieval-augmented generation and domain-specific fine-tuning for the development of AI-driven knowledge-based systems. Mach. Learn. Knowl. Extr. 2025, 7, 15. [Google Scholar] [CrossRef]
- Shi, Y.; Xu, S.; Yang, T.; Liu, Z.; Liu, T.; Li, X.; Liu, N. MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering. AMIA Annu. Symp. Proc. 2024, 2024, 1011–1020. [Google Scholar]
- Xiong, G.; Jin, Q.; Wang, X.; Zhang, M.; Lu, Z.; Zhang, A. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium; World Scientific Publishing Co. Pte. Ltd.: Singapore, 2024; pp. 199–214. [Google Scholar]
- Lu, S.; Cosgun, E. Boosting GPT models for genomics analysis: Generating trusted genetic variant annotations and interpretations through RAG and Fine-tuning. Bioinform. Adv. 2025, 5, vbaf019. [Google Scholar] [CrossRef]
- Zhang, T.; Patil, S.G.; Jain, N.; Shen, S.; Zaharia, M.; Stoica, I.; Gonzalez, J.E. Raft: Adapting language model to domain specific rag. arXiv 2024, arXiv:2403.10131. [Google Scholar] [CrossRef]
- Lopez, I.; Swaminathan, A.; Vedula, K.; Narayanan, S.; Nateghi Haredasht, F.; Ma, S.P.; Liang, A.S.; Tate, S.; Maddali, M.; Gallo, R.J.; et al. Clinical entity augmented retrieval for clinical information extraction. npj Digit. Med. 2025, 8, 45. [Google Scholar] [CrossRef] [PubMed]
- Elkin, P.L.; Mehta, G.; LeHouillier, F.; Koppel, R.; Elkin, A.N.; Nebeker, J.; Brown, S.H. Retrieval Augmented Generation: What Works and Lessons Learned. Stud. Health Technol. Inform. 2025, 326, 2–6. [Google Scholar] [CrossRef]
- Brown, A.; Roman, M.; Devereux, B. A systematic literature review of retrieval-augmented generation: Techniques, metrics, and challenges. Big Data Cogn. Comput. 2025, 9, 320. [Google Scholar] [CrossRef]
- Dorobanțu, F.R.; Hodoșan, V.; Tîrb, A.M.; Zaha, D.C.; Galușca, D.; Pop, N.O.; Dorobanțu, C.D. Pattern of newborn antibiotic use in a tertiary level maternity for five years. Pharmacophore 2022, 13, 57–63. [Google Scholar] [CrossRef]
- Das, S.; Ge, Y.; Guo, Y.; Rajwal, S.; Hairston, J.; Powell, J.; Walker, D.; Peddireddy, S.; Lakamana, S.; Bozkurt, S. Two-layer retrieval augmented generation framework for low-resource medical question-answering: Proof of concept using Reddit data. arXiv 2024, arXiv:2405.19519. [Google Scholar] [CrossRef] [PubMed]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
- Afrin, S.; Haque, M.Z.; Mastropaolo, A. A systematic literature review of parameter-efficient fine-tuning for large code models. arXiv 2025, arXiv:2504.21569. [Google Scholar] [CrossRef]
- Zhao, S.; Yang, Y.; Wang, Z.; He, Z.; Qiu, L.K.; Qiu, L. Retrieval augmented generation (rag) and beyond: A comprehensive survey on how to make your llms use external data more wisely. arXiv 2024, arXiv:2409.14924. [Google Scholar] [CrossRef]
- Atf, Z.; Safavi-Naini, S.A.A.; Lewis, P.R.; Mahjoubfar, A.; Naderi, N.; Savage, T.R.; Soroush, A. The challenge of uncertainty quantification of large language models in medicine. arXiv 2025, arXiv:2504.05278. [Google Scholar] [CrossRef]
- González, C.; Fuchs, M.; Santos, D.P.d.; Matthies, P.; Trenz, M.; Grüning, M.; Chaudhari, A.; Larson, D.B.; Othman, A.; Kim, M. Regulating radiology AI medical devices that evolve in their lifecycle. arXiv 2024, arXiv:2412.20498. [Google Scholar] [CrossRef]
- Farah, L.; Borget, I.; Martelli, N. International Market Access Strategies for Artificial Intelligence–Based Medical Devices: Can We Standardize the Process to Faster Patient Access? Mayo Clin. Proc. Digit. Health 2023, 1, 406–412. [Google Scholar] [CrossRef]
- Yuan, H. Agentic large language models for healthcare: Current progress and future opportunities. Med. Adv. 2025, 3, 37–41. [Google Scholar] [CrossRef]
- Vatsal, S.; Dubey, H.; Singh, A. Agentic AI in Healthcare & Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM-based Agents. arXiv 2025, arXiv:2602.04813. [Google Scholar]



| Author, Year | Country | Hybrid Framework | ||||
|---|---|---|---|---|---|---|
| Name a | Base Model | FT Strategy | RAG Strategy | Main Clinical Task | ||
| Bora and Cuayáhuitl, 2024 [18] | UK | RAG-based medical chatbot | LLaMA-2-7B, Mistral-7B, Fran-T5-Large | PEFT (LoRA) | Dense RAG | Medical chatbot QA |
| Garcia et al., 2025 [19] | USA | DF-RAG | LLaMA, DeepSeek, Qwen b | Federated PEFT (FlexLoRA) | FKG-based RAG | Clinical decision support (theoretical) |
| Kuo et al., 2025 [20] | Taiwan | Multimodal RAG–LLM | LLaMA-3 8B-Instruct | PEFT (LoRA/QLoRA) + RL alignment | Multimodal Hierarchical RAG | Automated clinical trial report generation |
| Gao et al., 2024 [21] | China | CPS | LlaMA-2-7B | PEFT (LoRA) | Hierarchical Hybrid RAG (Sparse + Dense) | Biomedical QA |
| Neupane et al., 2024 [22] | USA | CLINICSUM | LLaMA-3-8B, Gemma-2-9B, Mistral-7B, Mistral-Nemo-12B | PEFT (LoRA) | Hybrid RAG (Sparse + Dense) | SOAP clinical summary generation |
| Pingua et al., 2025 [8] | USA and India | Hybrid FT + RAG approach | Meta-LLaMA-3.1-8B; Phi-3.5-Mini-Instruct; Gemma-2-9B; Mistral-7B-Instruct; Qwen2.5-7B | PEFT (LoRA/QLoRA) | Dense RAG | Medical QA |
| Xia et al., 2024 [23] | USA | MMed-RAG | LLaVA-Med-1.5 (7B) | RAG-aware PEFT (LoRA) + DPO | Multimodal RAG with domain-aware retrieval | Medical VQA and medical report generation across radiology, ophthalmology, and pathology |
| Author, Year | Baseline Model | Key Metrics | Quantitative Performance (Dataset) | Main Qualitative Findings |
|---|---|---|---|---|
| Bora and Cuayáhuitl, 2024 [18] | Flan-T5-Large |
|
| Mistral-7B consistently outperformed the other models across various tasks and configurations. It demonstrated superior accuracy, relevance, and overall performance. Flan-T5- Large showed lower accuracy and performance across tasks and the fastest processing. Llama-2-7B offered a balanced performance. |
| LLaMA-2-7B |
| |||
| Mistral-7B |
| |||
| Garcia et al., 2025 [19] | Seed LLM a |
| Total score: 28/30 | Improved accuracy, privacy, interpretability in theoretical evaluation. Highest score compared with other non-hybrid approaches. |
| Kuo et al., 2025 [20] | LLaMA-3 8B-Instruct |
| 43.1/0.904/0.791/6.2/78.3 | Improved factual grounding with reduced hallucinations, high clinician acceptance, strong workflow integration, enhanced trust via traceable evidence and expert review. Improved report quality with 75% generation time reduction |
| Gao et al., 2024 [21] | LlaMA-2-7B |
| 0.558/0.573 (BioASQ Task 11b) | Improved QA performance; FT + RAG outperformed baselines |
| Neupane et al, 2024 [22] | LLaMA-3-8B |
| 0.70/0.48/0.55/0.84 | Outperformed GPT models; higher factual accuracy and 61% SME preference |
| Gemma-2-9B | 0.67/0.43/0.49/0.82 | |||
| Mistral-7B | 0.68/0.45/0.49/0.79 | |||
| Mistral-Nemo-12B | 0.68/0.45/0.51/0.78 | |||
| Pingua et al., 2025 [8] | LLaMA-3.1-8B |
| 0.153/0.236/0.423/0.299/0.413/0.268/0.891/0.819/0.908 | FT + RAG significantly outperforms FT across BLEU, ROUGE, BERTScore, and NASS; strongest gains for LLaMA-3.1-8B and Phi-3.5-Mini (p < 0.05). RAG + FT improves factual grounding, semantic coherence, and negation handling; benefits are model-dependent and sometimes comparable to RAG alone |
| Phi-3.5-Mini-Instruct | 0.136/0.217/0.391/0.250/0.380/0.258/0.876/0.815/0.881 | |||
| Gemma-2-9B | 0.104/0.181/0.351/0.193/0.338/0.240/0.871/0.791/0.841 | |||
| Mistral-7B-Instruct | 0.089/0.172/0.346/0.205/0.331/0.207/0.875/0.796/0.857 | |||
| Qwen-2.5-7B | 0.042/0.130/0.290/0.160/0.872/0.787/0.859 | |||
| Xia et al., 2024 [23] | LLaVA-Med-1.5 (7B) |
|
| Improved factual accuracy by 18.5% (VQA) and 69.1% (RG) over baseline Med-LVLM; outperformed decoding-based and prior RAG methods; reduced hallucinations and improved cross-modality alignment |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Collaco, B.G.; Srinivasagam, P.; Gomez-Cabello, C.A.; Haider, S.A.; Genovese, A.; Wood, N.G.; Bagaria, S.; Lifson, M.A.; Forte, A.J. Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review. Bioengineering 2026, 13, 225. https://doi.org/10.3390/bioengineering13020225
Collaco BG, Srinivasagam P, Gomez-Cabello CA, Haider SA, Genovese A, Wood NG, Bagaria S, Lifson MA, Forte AJ. Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review. Bioengineering. 2026; 13(2):225. https://doi.org/10.3390/bioengineering13020225
Chicago/Turabian StyleCollaco, Bernardo G., Prabha Srinivasagam, Cesar A. Gomez-Cabello, Syed Ali Haider, Ariana Genovese, Nadia G. Wood, Sanjay Bagaria, Mark A. Lifson, and Antonio Jorge Forte. 2026. "Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review" Bioengineering 13, no. 2: 225. https://doi.org/10.3390/bioengineering13020225
APA StyleCollaco, B. G., Srinivasagam, P., Gomez-Cabello, C. A., Haider, S. A., Genovese, A., Wood, N. G., Bagaria, S., Lifson, M. A., & Forte, A. J. (2026). Integrating Fine-Tuning and Retrieval-Augmented Generation for Healthcare AI Systems: A Scoping Review. Bioengineering, 13(2), 225. https://doi.org/10.3390/bioengineering13020225

