Applications of Large Language Models in Medical Research: From Systematic Reviews to Clinical Studies
Abstract
1. Introduction
1.1. Search Strategy and Study Selection
1.2. Large Language Models in Systematic Reviews
Literature Search Strategy Generation
1.3. Literature Screening and Study Selection
1.4. Data Extraction and Evidence Synthesis
1.5. Risk-of-Bias Assessment
1.6. Large Language Models in Narrative Review Writing
Augmenting Scientific Writing
1.7. Literature Synthesis and Thematic Analysis
1.8. Large Language Models in Clinical Research and Data Analysis
Statistical Programming and Analysis
1.9. Clinical Data Processing
1.10. Clinical Trial Protocol Development
2. Methodological Considerations
2.1. Prompt Engineering and Optimization
2.2. Validation and Quality Assurance
2.3. Ethical and Regulatory Considerations
Publication Ethics and Attribution
2.4. Data Privacy and Security
2.5. Access Limitations and Information Bias
2.6. Bias and Fairness
2.7. Integrating Scientific Integrity into LLM Workflows
2.8. Limitations and the Human-AI Partnership
3. Research Question and Hypothesis Generation
4. Future Directions
Emerging Open-Source and Cost-Effective Models
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- van Dis, E.A.M.; Bollen, J.; Zuidema, W.; van Rooij, R.; Bockting, C.L. ChatGPT: Five priorities for research. Nature 2023, 614, 224–226. [Google Scholar] [CrossRef] [PubMed]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
- Gong, E.J.; Bang, C.S. Evaluating the role of large language models in inflammatory bowel disease patient information. World J. Gastroenterol. 2024, 30, 3538–3540. [Google Scholar] [CrossRef]
- Gong, E.J.; Bang, C.S. Revolutionizing gastrointestinal endoscopy: The emerging role of large language models. Clin. Endosc. 2024, 57, 759–762. [Google Scholar] [CrossRef]
- Gong, E.J.; Bang, C.S.; Lee, J.J.; Park, J.; Kim, E.; Kim, S.; Kimm, M.; Choi, S.-H. Large Language Models in Gastroenterology: Systematic Review. J. Med. Internet Res. 2024, 26, e66648. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Wang, C.; Liu, S. Utility of ChatGPT in Clinical Practice. J. Med. Internet Res. 2023, 25, e48568. [Google Scholar] [CrossRef] [PubMed]
- Kim, H.J.; Gong, E.J.; Bang, C.-S. Application of Machine Learning Based on Structured Medical Data in Gastroenterology. Biomimetics 2023, 8, 512. [Google Scholar] [CrossRef] [PubMed]
- Borah, R.; Brown, A.W.; Capers, P.L.; Kaiser, K.A. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 2017, 7, e012545. [Google Scholar] [CrossRef] [PubMed]
- Landhuis, E. Scientific literature: Information overload. Nature 2016, 535, 457–458. [Google Scholar] [CrossRef]
- Marshall, I.J.; Wallace, B.C. Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Syst. Rev. 2019, 8, 163. [Google Scholar] [CrossRef] [PubMed]
- Linardon, J.; Messer, M.; Anderson, C.; Liu, C.; McClure, Z.; Jarman, H.K.; Goldberg, S.B.; Torous, J. Role of large language models in mental health research: An international survey of researchers’ practices and perspectives. BMJ Ment. Health 2025, 28, e301787. [Google Scholar] [CrossRef]
- Qureshi, R.; Shaughnessy, D.; Gill, K.A.R.; Robinson, K.A.; Li, T.; Agai, E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst. Rev. 2023, 12, 72. [Google Scholar] [CrossRef] [PubMed]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Anonymous. Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature 2023, 613, 612. [CrossRef]
- Flanagin, A.; Bibbins-Domingo, K.; Berkwits, M.; Christiansen, S.L. Nonhuman “Authors” and Implications for the Integrity of Scientific Publication and Medical Knowledge. JAMA 2023, 329, 637–639. [Google Scholar] [CrossRef] [PubMed]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025, 333, 319–328. [Google Scholar] [CrossRef]
- Scherbakov, D.; Hubig, N.; Jansari, V.; Bakumenko, A.; Lenert, L.A. The emergence of large language models as tools in literature reviews: A large language model-assisted systematic review. J. Am. Med. Inform. Assoc. 2025, 32, 1071–1086. [Google Scholar] [CrossRef]
- Lieberum, J.L.; Toews, M.; Metzendorf, M.I.; Heilmeyer, F.; Siemens, W.; Haverkamp, C.; Böhringer, D.; Meerpohl, J.J.; Eisele-Metzger, A. Large language models for conducting systematic reviews: On the rise, but not yet ready for use—A scoping review. J. Clin. Epidemiol. 2025, 181, 111746. [Google Scholar] [CrossRef] [PubMed]
- Ahn, S. The transformative impact of large language models on medical writing and publishing: Current applications, challenges and future directions. Korean J. Physiol. Pharmacol. 2024, 28, 393–401. [Google Scholar] [CrossRef]
- Omar, M.; Nadkarni, G.N.; Klang, E.; Glicksberg, B.S. Large language models in medicine: A review of current clinical trials across healthcare applications. PLoS Digit. Health 2024, 3, e0000662. [Google Scholar] [CrossRef] [PubMed]
- Ferrari, R. Writing narrative style literature reviews. Med. Writ. 2015, 24, 230–235. [Google Scholar] [CrossRef]
- Greenhalgh, T.; Thorne, S.; Malterud, K. Time to challenge the spurious hierarchy of systematic over narrative reviews? Eur. J. Clin. Investig. 2018, 48, e12931. [Google Scholar] [CrossRef] [PubMed]
- Sukhera, J. Narrative reviews: Flexible, rigorous, and practical. J. Grad. Med. Educ. 2022, 14, 414–417. [Google Scholar] [CrossRef] [PubMed]
- Baethge, C.; Goldbeck-Wood, S.; Mertens, S. SANRA—A scale for the quality assessment of narrative review articles. Res. Integr. Peer Rev. 2019, 4, 5. [Google Scholar] [CrossRef]
- Wang, S.; Scells, H.; Koopman, B.; Zuccon, G. Can ChatGPT write a good boolean query for systematic review literature search? arXiv 2023, arXiv:2302.03495. [Google Scholar] [CrossRef]
- Yu, F.; Kincaide, H.; Carlson, R.B. An Empirical Study Evaluating ChatGPT’s Performance in Generating Search Strategies for Systematic Reviews. Proc. Assoc. Inf. Sci. Technol. 2024, 61, 423–434. [Google Scholar] [CrossRef]
- Parisi, V.; Sutton, A. The role of ChatGPT in developing systematic literature searches: An evidence summary. J. EAHIL 2024, 20, 30–34. [Google Scholar] [CrossRef]
- O’Connor, A.M.; Tsafnat, G.; Gilbert, S.B.; Thayer, K.A.; Shemilt, I.; Thomas, J.; Glasziou, P.; Wolfe, M.S. Still moving toward automation of the systematic review process: A summary of discussions at the third meeting of the International Collaboration for Automation of Systematic Reviews (ICASR). Syst. Rev. 2019, 8, 57. [Google Scholar] [CrossRef]
- Clark, J.; Glasziou, P.; Del Mar, C.; Bannach-Brown, A.; Stehlik, P.; Scott, A.M. A full systematic review was completed in 2 weeks using automation tools: A case study. J. Clin. Epidemiol. 2020, 121, 81–90. [Google Scholar] [CrossRef]
- Issaiy, M.; Ghanaati, H.; Kolahi, S.; Shakiba, M.; Jalali, A.H.; Zarei, D.; Kazemian, S.; Avanaki, M.A.; Firouznia, K. Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Med. Res. Methodol. 2024, 24, 78. [Google Scholar] [CrossRef] [PubMed]
- Cai, X.; Geng, Y.; Du, Y.; Westerman, B.; Wang, D.; Ma, C.; Vallejo, J.J.G. Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation. BMC Med. Res. Methodol. 2025, 25, 116. [Google Scholar] [CrossRef]
- Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res. Synth. Methods 2024, 15, 616–626. [Google Scholar] [CrossRef]
- Kohandel Gargari, O.; Mahmoudi, M.H.; Hajisafarali, M.; Samiee, R. Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo. BMJ Evid.-Based Med. 2024, 29, 69–70. [Google Scholar] [CrossRef] [PubMed]
- Matsui, K.; Utsumi, T.; Aoki, Y.; Maruki, T.; Takeshima, M.; Takaesu, Y. Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews. J. Med. Internet Res. 2024, 26, e52758. [Google Scholar] [CrossRef]
- Windisch, P.; Dennstädt, F.; Koechli, C.; Schröder, C.; Aebersold, D.M.; Förster, R.; Zwahlen, D.R.; Windisch, P.Y. The Impact of Temperature on Extracting Information From Clinical Trial Publications Using Large Language Models. Cureus 2024, 16, e75748. [Google Scholar] [CrossRef]
- Oami, T.; Okada, Y.; Nakada, T.-A. Optimal large language models to screen citations for systematic reviews. Res. Synth. Methods 2025, 16, 859–875. [Google Scholar] [CrossRef] [PubMed]
- Sorich, M.J.; Mangoni, A.A.; Bacchi, S.; Menz, B.D.; Hopkins, A.M. The Triage and Diagnostic Accuracy of Frontier Large Language Models: Updated Comparison to Physician Performance. J. Med. Internet Res. 2024, 26, e67409. [Google Scholar] [CrossRef] [PubMed]
- Karpathy, A. LLM Council; GitHub: San Francisco, CA, USA, 2025; Available online: https://github.com/karpathy/llm-council (accessed on 13 March 2026).
- Guo, E.; Gupta, M.; Deng, J.; Park, Y.-J.; Paget, M.; Naugler, C. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J. Med. Internet Res. 2024, 26, e48996. [Google Scholar] [CrossRef] [PubMed]
- Jonnalagadda, S.R.; Goyal, P.; Huffman, M.D. Automating data extraction in systematic reviews: A systematic review. Syst. Rev. 2015, 4, 78. [Google Scholar] [CrossRef] [PubMed]
- Schmidt, L.; Hair, K.; Graziosi, S.; Campbell, F.; Kapp, C.; Khanteymoori, A.; Craig, D.; Engelbert, M.; Thomas, J. Exploring the use of a large language model for data extraction in systematic reviews: A rapid feasibility study. arXiv 2024, arXiv:2405.14445. [Google Scholar] [CrossRef]
- Khan, M.A.; Ayub, U.; Naqvi, S.A.A.; Khakwani, K.Z.R.; Sipra, Z.B.R.; Raina, A.; Zhou, S.; He, H.; Saeidi, A.; Hasan, B.; et al. Collaborative large language models for automated data extraction in living systematic reviews. J. Am. Med. Inform. Assoc. 2025, 32, 638–647. [Google Scholar] [CrossRef]
- Konet, A.; Thomas, I.; Gartlehner, G.; Kahwati, L.; Hilscher, R.; Kugley, S.; Crotty, K.; Viswanathan, M.; Chew, R. Performance of two large language models for data extraction in evidence synthesis. Res. Synth. Methods 2024, 15, 818–824. [Google Scholar] [CrossRef]
- Gartlehner, G.; Kahwati, L.; Hilscher, R.; Thomas, I.; Kugley, S.; Crotty, K.; Viswanathan, M.; Nussbaumer-Streit, B.; Booth, G.; Erskine, N.; et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res. Synth. Methods 2024, 15, 576–589. [Google Scholar] [CrossRef]
- Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; Park, S. Ocr-free document understanding transformer. arXiv 2021, arXiv:2111.15664. [Google Scholar] [CrossRef]
- Wang, D.; Raman, N.; Sibue, M.; Ma, Z.; Babkin, P.; Kaur, S.; Pei, Y.; Nourbakhsh, A.; Liu, X. Docllm: A layout-aware generative language model for multimodal document understanding. arXiv 2023, arXiv:2401.00908. [Google Scholar]
- Jin, Q.; Chen, F.; Zhou, Y.; Xu, Z.; Cheung, J.M.; Chen, R.; Summers, R.M.; Rousseau, J.F.; Ni, P.; Landsman, M.J.; et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. npj Digit. Med. 2024, 7, 190. [Google Scholar] [CrossRef]
- Bhattacharyya, M.; Miller, V.M.; Bhattacharyya, D.; Miller, L.E. High Rates of Fabricated and Inaccurate References in ChatGPT-Generated Medical Content. Cureus 2023, 15, e39238. [Google Scholar] [CrossRef] [PubMed]
- Motzfeldt Jensen, M.; Brix Danielsen, M.; Riis, J.; Assifuah Kristjansen, K.; Andersen, S.; Okubo, Y.; Jørgensen, M.G. ChatGPT-4o can serve as the second rater for data extraction in systematic reviews. PLoS ONE 2025, 20, e0313401. [Google Scholar] [CrossRef] [PubMed]
- Pitre, T.; Jassal, T.; Talukdar, J.R.; Shahab, M.; Ling, M.; Zeraatkar, D. ChatGPT for assessing risk of bias of randomized trials using the RoB 2.0 tool: A methods study. medRxiv 2023. medRxiv:2023.11.19.23298727. [Google Scholar]
- Kuitunen, I.; Ponkilainen, V.T.; Liukkonen, R.; Nyrhi, L.; Pakarinen, O.; Vaajala, M.; Uimonen, M.M. Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments. J. Evid.-Based Med. 2024, 17, 700–702. [Google Scholar] [CrossRef] [PubMed]
- Kuitunen, I.; Nyrhi, L.; De Luca, D. ChatGPT-4o in Risk-of-Bias Assessments in Neonatology: A Validity Analysis. Neonatology 2025, 122, 360–365. [Google Scholar] [CrossRef]
- Šuster, S.; Baldwin, T.; Verspoor, K. Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials. Res. Synth. Methods 2024, 15, 988–1000. [Google Scholar] [CrossRef] [PubMed]
- Lai, H.; Ge, L.; Sun, M.; Pan, B.; Huang, J.; Hou, L.; Yang, Q.; Liu, J.; Liu, J.; Ye, Z.; et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw. Open 2024, 7, e2412687. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Lai, H.; Zhao, W.; Xia, D.; Bai, C.; Sun, M.; Liu, J.; Liu, J.; Pan, B.; Tian, J.; et al. Large Language Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Evaluation Study. J. Med. Internet Res. 2025, 27, e70450. [Google Scholar] [CrossRef]
- Huang, J.; Tan, M. The role of ChatGPT in scientific communication: Writing better scientific review articles. Am. J. Cancer Res. 2023, 13, 1148–1154. [Google Scholar] [PubMed]
- Amano, T.; González-Varo, J.P.; Sutherland, W.J. Languages Are Still a Major Barrier to Global Science. PLoS Biol. 2016, 14, e2000933. [Google Scholar] [CrossRef]
- Dergaa, I.; Chamari, K.; Zmijewski, P.; Ben Saad, H. From human writing to artificial intelligence generated text: Examining the prospects and potential threats of ChatGPT in academic writing. Biol. Sport 2023, 40, 615–622. [Google Scholar] [CrossRef]
- Gong, E.J.; Woo, J.; Lee, J.J.; Bang, C.S. Role of artificial intelligence in gastric diseases. World J. Gastroenterol. 2025, 31, 111327. [Google Scholar] [CrossRef] [PubMed]
- Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, V.; Ahuja, M.; et al. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
- Walters, W.H.; Wilder, E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci. Rep. 2023, 13, 14045. [Google Scholar] [CrossRef]
- Else, H. Abstracts written by ChatGPT fool scientists. Nature 2023, 613, 423. [Google Scholar] [CrossRef] [PubMed]
- Salvagno, M.; Taccone, F.S.; Gerli, A.G. Can artificial intelligence help for scientific writing? Crit. Care 2023, 27, 75. [Google Scholar] [CrossRef]
- Patel, S.B.; Lam, K. ChatGPT: The future of discharge summaries? Lancet Digit. Health 2023, 5, e107–e108. [Google Scholar] [CrossRef]
- Cascella, M.; Montomoli, J.; Bellini, V.; Bignami, E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. J. Med. Syst. 2023, 47, 33. [Google Scholar] [CrossRef] [PubMed]
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. arXiv 2023, arXiv:2303.17651. [Google Scholar] [CrossRef]
- Sun, S.; Yuan, R.; Cao, Z.; Li, W.; Liu, P. Prompt chaining or stepwise prompt? Refinement in text summarization. arXiv 2024, arXiv:2406.00507. [Google Scholar] [CrossRef]
- Lee, P.; Bubeck, S.; Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 2023, 388, 1233–1239. [Google Scholar] [CrossRef] [PubMed]
- Ruta, M.R.; Gaidici, T.; Irwin, C.; Lifshitz, J. ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research. J. Med. Internet Res. 2025, 27, e63550. [Google Scholar] [CrossRef] [PubMed]
- Huang, Y.; Wu, R.; He, J.; Xiang, Y. Evaluating ChatGPT-4.0’s data analytic proficiency in epidemiological studies: A comparative analysis with SAS, SPSS, and R. J. Glob. Health 2024, 14, 04070. [Google Scholar] [CrossRef] [PubMed]
- Dobler, D.; Binder, H.; Boulesteix, A.L.; Igelmann, J.; Köhler, D.; Mansmann, U.; Pauly, M.; Scherag, A.; Schmid, M.; Al Tawil, A.; et al. ChatGPT as a Tool for Biostatisticians: A Tutorial on Applications, Opportunities, and Limitations. Stat. Med. 2025, 44, e70263. [Google Scholar] [CrossRef] [PubMed]
- Shahrul, A.I.; Syed Mohamed, A.M.F. A Comparative Evaluation of Statistical Product and Service Solutions (SPSS) and ChatGPT-4 in Statistical Analyses. Cureus 2024, 16, e72581. [Google Scholar] [CrossRef]
- Evans, R.; Pozzi, A. Using CHATGPT to develop the statistical analysis plan for a randomized controlled trial: A case report. Research Square 2023. [Google Scholar] [CrossRef]
- Lee, J.H.; Shin, J. How to Optimize Prompting for Large Language Models in Clinical Research. Korean J. Radiol. 2024, 25, 869–873. [Google Scholar] [CrossRef] [PubMed]
- Suh, C.H.; Yi, J.; Shim, W.H.; Heo, H. Insufficient Transparency in Stochasticity Reporting in Large Language Model Studies for Medical Applications in Leading Medical Journals. Korean J. Radiol. 2024, 25, 1029–1031. [Google Scholar] [CrossRef]
- Ordak, M. ChatGPT’s Skills in Statistical Analysis Using the Example of Allergology: Do We Have Reason for Concern? Healthcare 2023, 11, 2554. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Wu, Z.; Wu, X.; Lu, P.; Chang, K.-W.; Feng, Y. Are llms capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. arXiv 2024, arXiv:2402.17644. [Google Scholar] [CrossRef]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 2024, 30, 1134–1142. [Google Scholar] [CrossRef] [PubMed]
- Yan, C.; Ong, H.H.; Grabowska, M.E.; Krantz, M.S.; Su, W.C.; Dickson, A.L.; Peterson, J.F.; Feng, Q.; Roden, D.M.; Stein, C.M.; et al. Large language models facilitate the generation of electronic health record phenotyping algorithms. J. Am. Med. Inform. Assoc. 2024, 31, 1994–2001. [Google Scholar] [CrossRef]
- Jonnagaddala, J.; Wong, Z.S. Privacy preserving strategies for electronic health records in the era of large language models. npj Digit. Med. 2025, 8, 34. [Google Scholar] [CrossRef]
- Wiest, I.C.; Ferber, D.; Zhu, J.; van Treeck, M.; Meyer, S.K.; Juglan, R.; Carrero, Z.I.; Paech, D.; Kleesiek, J.; Ebert, M.P.; et al. Privacy-preserving large language models for structured medical information retrieval. npj Digit. Med. 2024, 7, 257. [Google Scholar] [CrossRef]
- Kugic, A.; Schulz, S.; Kreuzthaler, M. Disambiguation of acronyms in clinical narratives with large language models. J. Am. Med. Inform. Assoc. 2024, 31, 2040–2046. [Google Scholar] [CrossRef]
- Cui, H.; Unell, A.; Chen, B.; Fries, J.A.; Alsentzer, E.; Koyejo, S.; Shah, N.H. TIMER: Temporal instruction modeling and evaluation for longitudinal clinical records. npj Digit. Med. 2025, 8, 577. [Google Scholar] [CrossRef] [PubMed]
- Jin, Q.; Wang, Z.; Floudas, C.S.; Chen, F.; Gong, C.; Bracken-Clarke, D.; Xue, E.; Yang, Y.; Sun, J.; Lu, Z. Matching patients to clinical trials with large language models. Nat. Commun. 2024, 15, 9074. [Google Scholar] [CrossRef]
- Markey, N.; El-Mansouri, I.; Rensonnet, G.; van Langen, C.; Meier, C. From RAGs to riches: Utilizing large language models to write documents for clinical trials. Clin. Trials 2025, 22, 626–631. [Google Scholar] [CrossRef] [PubMed]
- Ali, R.; Connolly, I.D.; Tang, O.Y.; Mirza, F.N.; Johnston, B.; Abdulrazeq, H.F.; Lim, R.K.; Galamaga, P.F.; Libby, T.J.; Sodha, N.R.; et al. Bridging the literacy gap for surgical consents: An AI-human expert collaborative approach. npj Digit. Med. 2024, 7, 63. [Google Scholar] [CrossRef] [PubMed]
- Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
- Gong, E.J.; Bang, C.S. Interpretation of Medical Images Using Artificial Intelligence: Current Status and Future Perspectives. Korean J. Gastroenterol. 2023, 82, 43–45. [Google Scholar] [CrossRef]
- Jeon, S.; Kim, H.G. A comparative evaluation of chain-of-thought-based prompt engineering techniques for medical question answering. Comput. Biol. Med. 2025, 196, 110614. [Google Scholar] [CrossRef] [PubMed]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Park, S.H.; Suh, C.H.; Lee, J.H.; Kahn, C.E.; Moy, L. Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM). Korean J. Radiol. 2024, 25, 865–868. [Google Scholar] [CrossRef] [PubMed]
- Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv 2023, arXiv:2302.04023. [Google Scholar] [CrossRef]
- Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus 2023, 15, e35179. [Google Scholar] [CrossRef] [PubMed]
- Busch, F.; Hoffmann, L.; Rueger, C.; van Dijk, E.H.; Kader, R.; Ortiz-Prado, E.; Makowski, M.R.; Saba, L.; Hadamitzky, M.; Kather, J.N.; et al. Current applications and challenges in large language models for patient care: A systematic review. Commun. Med. 2025, 5, 26. [Google Scholar] [CrossRef]
- Zielinski, C.; Winker, M.A.; Aggarwal, R.; Ferris, L.E.; Heinemann, M.; Lapeña, J.F., Jr.; Pai, S.A.; Ing, E.; Citrome, L.; Alam, M.; et al. Chatbots, generative AI, and scholarly manuscripts: WAME recommendations on chatbots and generative artificial intelligence in relation to scholarly publications. Colomb. Med. 2023, 54, e1015868. [Google Scholar] [CrossRef] [PubMed]
- Ganjavi, C.; Eppler, M.B.; Pekcan, A.; Biedermann, B.; Abreu, A.; Collins, G.S.; Gill, I.S.; Cacciamani, G.E. Publishers’ and journals’ instructions to authors on use of generative artificial intelligence in academic and scientific publishing: Bibliometric analysis. BMJ 2024, 384, e077192. [Google Scholar] [CrossRef] [PubMed]
- Hill-Yardin, E.L.; Hutchinson, M.R.; Laycock, R.; Spencer, S.J. A Chat(GPT) about the future of scientific publishing. Brain Behav. Immun. 2023, 110, 152–154. [Google Scholar] [CrossRef] [PubMed]
- Murdoch, B. Privacy and artificial intelligence: Challenges for protecting health information in a new era. BMC Med. Ethics 2021, 22, 122. [Google Scholar] [CrossRef]
- Price, W.N., 2nd; Cohen, I.G. Privacy in the age of medical big data. Nat. Med. 2019, 25, 37–43. [Google Scholar] [CrossRef] [PubMed]
- Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2020, 2, 305–311. [Google Scholar] [CrossRef]
- Vayena, E.; Blasimme, A.; Cohen, I.G. Machine learning in medicine: Addressing ethical challenges. PLoS Med. 2018, 15, e1002689. [Google Scholar] [CrossRef]
- Morley, J.; Machado, C.C.V.; Burr, C.; Cowls, J.; Joshi, I.; Taddeo, M.; Floridi, L. The ethics of AI in health care: A mapping review. Soc. Sci. Med. 2020, 260, 113172. [Google Scholar] [CrossRef]
- Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; et al. The pile: An 800 GB dataset of diverse text for language modeling. arXiv 2020, arXiv:2101.00027. [Google Scholar]
- Baack, S. A critical analysis of the largest source for generative ai training data: Common crawl. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency; ACM: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
- Omar, M.; Sorin, V.; Agbareia, R.; Apakama, D.U.; Soroush, A.; Sakhuja, A.; Freeman, R.; Horowitz, C.R.; Richardson, L.D.; Nadkarni, G.N.; et al. Evaluating and addressing demographic disparities in medical large language models: A systematic review. International journal for equity in health. Int. J. Equity Health 2025, 24, 57. [Google Scholar] [CrossRef] [PubMed]
- Omiye, J.A.; Lester, J.C.; Spichak, S.; Rotemberg, V.; Daneshjou, R. Large language models propagate race-based medicine. npj Digit. Med. 2023, 6, 195. [Google Scholar] [CrossRef]
- Zack, T.; Lehman, E.; Suzgun, M.; Rodriguez, J.A.; Celi, L.A.; Gichoya, J.; Jurafsky, D.; Szolovits, P.; Bates, D.W.; Abdulnour, R.-E.E.; et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. Lancet Digit. Health 2024, 6, e12–e22. [Google Scholar] [CrossRef]
- Haltaufderheide, J.; Ranisch, R. The ethics of ChatGPT in medicine and healthcare: A systematic review on Large Language Models (LLMs). npj Digit. Med. 2024, 7, 183. [Google Scholar] [CrossRef]
- Parasuraman, R.; Manzey, D.H. Complacency and bias in human use of automation: An attentional integration. Hum. Factors 2010, 52, 381–410. [Google Scholar] [CrossRef] [PubMed]
- Goddard, K.; Roudsari, A.; Wyatt, J.C. Automation bias: A systematic review of frequency, effect mediators, and mitigators. J. Am. Med. Inform. Assoc. 2012, 19, 121–127. [Google Scholar] [CrossRef]
- Quek, S.X.Z.; Ho, K.Y. Artificial Intelligence in Upper Gastrointestinal Diagnosis. Korean J. Helicobacter Up. Gastrointest. Res. 2025, 25, 251. [Google Scholar] [CrossRef] [PubMed]
- Roser, D.; Meinikheim, M.; Muzalyova, A.; Mendel, R.; Palm, C.; Probst, A.; Nagl, S.; Scheppach, M.W.; Römmele, C.; Schnoy, E.; et al. Artificial Intelligence-assisted Endoscopy and Examiner Confidence: A Study on Human-Artificial Intelligence Interaction in Barrett’s Esophagus (with Video). DEN Open 2026, 6, e70150. [Google Scholar] [CrossRef] [PubMed]
- Abdulnour, R.E.; Gin, B.; Boscardin, C.K. Educational Strategies for Clinical Supervision of Artificial Intelligence Use. N. Engl. J. Med. 2025, 393, 786–797. [Google Scholar] [CrossRef] [PubMed]
- Marcus, G. Deep learning: A critical appraisal. arXiv 2018, arXiv:1801.00631. [Google Scholar] [CrossRef]
- Stadler, M.; Bannert, M.; Sailer, M. Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry. Comput. Hum. Behav. 2024, 160, 108386. [Google Scholar] [CrossRef]
- Kosmyna, N.; Hauptmann, E.; Yuan, Y.T.; Situ, J.; Liao, X.-H.; Beresnitzky, A.V.; Braunstein, I.; Maes, P. Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task. arXiv 2025, arXiv:2506.08872. [Google Scholar] [CrossRef]
- Choudhury, A.; Chaudhry, Z. Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals. J. Med. Internet Res. 2024, 26, e56764. [Google Scholar] [CrossRef]
- Abdel-Rehim, A.; Zenil, H.; Orhobor, O.; Fisher, M.; Collins, R.J.; Bourne, E.; Fearnley, G.W.; Tate, E.; Smith, H.X.; Soldatova, L.N.; et al. Scientific hypothesis generation by large language models: Laboratory validation in breast cancer treatment. J. R. Soc. Interface 2025, 22, 20240674. [Google Scholar] [CrossRef] [PubMed]
- Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal biomedical AI. Nat. Med. 2022, 28, 1773–1784. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef] [PubMed]
- Ren, F.; Aliper, A.; Chen, J.; Zhao, H.; Rao, S.; Kuppe, C.; Ozerov, I.V.; Zhang, M.; Witte, K.; Kruse, C.; et al. A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol. 2025, 43, 63–75. [Google Scholar] [CrossRef]
- Tordjman, M.; Liu, Z.; Yuce, M.; Fauveau, V.; Mei, Y.; Hadjadj, J.; Bolger, I.; Almansour, H.; Horst, C.; Parihar, A.S.; et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat. Med. 2025, 31, 2550–2555. [Google Scholar] [CrossRef] [PubMed]
- Sandmann, S.; Hegselmann, S.; Fujarski, M.; Bickmann, L.; Wild, B.; Eils, R.; Varghese, J. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat. Med. 2025, 31, 2546–2549. [Google Scholar] [CrossRef] [PubMed]
- Zhu, S.; Hu, W.; Yang, Z.; Yan, J.; Zhang, F. Qwen-2.5 outperforms other large language models in the Chinese National Nursing Licensing Examination: Retrospective cross-sectional comparative study. JMIR Med. Inform. 2025, 13, e63731. [Google Scholar] [CrossRef] [PubMed]
- Lin, K.H.; Kao, T.H.; Wang, L.C.; Kuo, C.T.; Chen, P.C.; Chu, Y.C.; Yeh, Y.C. Benchmarking large language models GPT-4o, llama 3.1, and qwen 2.5 for cancer genetic variant classification. npj Precis. Oncol. 2025, 9, 141. [Google Scholar] [CrossRef] [PubMed]

| Study | Model(s) | Number of Studies | Sensitivity | Specificity | Key Findings |
|---|---|---|---|---|---|
| Oami et al. 2025 [36] | GPT-4o | 16,669 | 0.85 | 0.97 | Higher specificity, lower sensitivity |
| Gemini 1.5 Pro | 0.94 | 0.85 | Higher sensitivity, lower specificity | ||
| Claude 3.5 Sonnet | 0.94 | 0.80 | Higher sensitivity, lowest specificity | ||
| Llama 3.3 70B | 0.88 | 0.93 | Trade-off: sensitivity vs. specificity | ||
| Ensemble (OR rule) | Improved | Decreased | Trade-off: sensitivity vs. specificity | ||
| Matsui et al. 2024 [34] | GPT-4 (3-layer) | 4527 | 0.81–0.88 | 0.86–1.00 | Layered screening approach effective |
| GPT-3.5 (3-layer) | 0.69–0.75 | 0.95–0.98 | Lower sensitivity than GPT-4 | ||
| Guo et al. 2024 [39] | GPT-3.5/GPT-4 | 24,307 | 0.76 | 0.91 | No pretraining required |
| Kohandel Gargari et al. 2024 [33] | GPT-3.5 Turbo | 200 | 0.38–0.69 | 0.25–0.85 | Prompt structure critical; trade-offs inevitable |
| Cai et al. 2025 [31] | LARS-GPT (multi-LLM) | N/A | >0.90 | N/A | 40% workload reduction with dual-phase approach |
| Khraisha et al. 2024 [32] | GPT-4 | 2421 | 0.75 (English) | N/A | Sensitivity drops to 0.36 for non-English texts |
| Extraction Approach | Performance | Key Findings |
|---|---|---|
| Overall extraction (single LLM) [41] | Accuracy ~80% | 82% clinical, 80% animal, 72% social science studies |
| PICO elements [41] | Accuracy >80% (P, I, C), lower for O | Participants/Intervention well-extracted; Outcomes challenging |
| Collaborative dual-LLM (concordant) [42] | Accuracy 94% | GPT-4-turbo + Claude-3-Opus agreement; hallucination rate 0.25% |
| Single LLM (discordant cases) [42] | Accuracy 41–50% | GPT-4-turbo 41%, Claude-3-Opus 50%; hallucination rate ~2.5% |
| Non-English texts [32] | Sensitivity 36% | Significant performance drop in non-English literature |
| PDF-dependent extraction [43,44] | 68.8–100% | Automated PDF parsing 68.8% vs. manual text selection 100% |
| Issue Type | Estimated Frequency | Detection Method | Mitigation Strategy |
|---|---|---|---|
| Fabricated references | 18–55% (model-dependent) [48] | Database verification | Verify citations |
| Inaccurate citations | 24–46% [48,61] | Original source check | Verify bibliographic details |
| Incorrect PMID | 93% of papers [48] | PubMed verification | Cross-check all PMIDs |
| Oversimplification | Common (not quantified) [64] | Expert review | Maintain technical precision |
| Lost nuance | Common (not quantified) [65] | Domain expert check | Preserve complexity |
| Style homogenization | Common (not quantified) | AI detection tools, stylometric analysis | Maintain author voice, iterative refinement |
| Consideration | Challenge | Evidence | Recommendation |
|---|---|---|---|
| Assumption checking | Often omitted without explicit prompting; 43.8% accuracy with basic prompts [69] | Fails normality verification, inappropriate test selection [76] | Always verify assumptions manually |
| Model selection | May choose inappropriate tests; incorrect method selection was the most common error (66%, n = 51 of 77 total errors) [69] | 44% of errors involved knowledge recall (wrong test selection, statistical vs. causal method confusion) [77] | Require statistical expertise for selection |
| Complex designs | Poor performance on hierarchical models, survival analysis, or meta-analysis [71] | R code for survival analysis worked without corrections in 7/10 sessions [71] | Use only for simple analyses initially |
| Reproducibility | Identical prompts yield different results across sessions [71] | High variability in meta-analysis outputs [71] | Verify across multiple runs |
| Stochasticity reporting | Stochastic outputs even at temperature = 0; model version changes alter results [74] | Only 15.1% of studies adequately reported stochasticity handling [75] | Document per MI-CLEAR-LLM; use temperature = 0; archive model versions |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gong, E.J.; Bang, C.S.; Shin, Y.S. Applications of Large Language Models in Medical Research: From Systematic Reviews to Clinical Studies. Bioengineering 2026, 13, 365. https://doi.org/10.3390/bioengineering13030365
Gong EJ, Bang CS, Shin YS. Applications of Large Language Models in Medical Research: From Systematic Reviews to Clinical Studies. Bioengineering. 2026; 13(3):365. https://doi.org/10.3390/bioengineering13030365
Chicago/Turabian StyleGong, Eun Jeong, Chang Seok Bang, and Yong Seok Shin. 2026. "Applications of Large Language Models in Medical Research: From Systematic Reviews to Clinical Studies" Bioengineering 13, no. 3: 365. https://doi.org/10.3390/bioengineering13030365
APA StyleGong, E. J., Bang, C. S., & Shin, Y. S. (2026). Applications of Large Language Models in Medical Research: From Systematic Reviews to Clinical Studies. Bioengineering, 13(3), 365. https://doi.org/10.3390/bioengineering13030365

