5. Gaps and Challenges
Addressing the tensions identified in this literature review requires a research agenda that is dynamically adaptive to the rapid evolution of LLM technology. We propose that the path forward is structured around four interconnected pillars, as illustrated in
Figure 3. The following discussion unpacks the critical gaps in current research that motivate each pillar of this agenda.
The first pillar of this agenda, Evolving Evaluation Frameworks, is motivated by the fact that the field’s current evaluation methods are fundamentally misaligned with clinical reality. Because the capabilities of LLMs evolve rapidly, evaluation frameworks must go beyond static benchmarks. The field requires methodological infrastructure that is flexible, modular, and continuously updated, capable of testing new models across evolving metrics such as factuality, empathy, bias, and safety. Rather than fixed leaderboards, this calls for iterative pipelines with reproducibility, versioning, and clinician-in-the-loop refinement.
The necessity of the second pillar, conducting clinical trials, arises from a significant gap between in silico validation and real-world clinical utility. The majority of studies evaluated in this review rely on retrospective, de-identified datasets, testing LLMs on static, historical information [
23,
25]. While essential for initial development, this methodology fails to address the complexities of prospective, longitudinal implementation. Future research must therefore move towards pilot and clinical trials that deploy these systems in live clinical workflows, measuring their impact not only on diagnostic accuracy but also on real-world outcomes such as consultation time, clinician workload, and, most importantly, patient safety.
The third pillar, the need to improve explainability, is a direct response to the significant trust deficit identified among clinicians [
23,
25]. Bridging this gap requires more than higher AUROC scores; it calls for clinically meaningful explainability. A critical research gap exists in identifying and testing which XAI methods, from chain-of-thought rationales to counterfactual risk explanations, can effectively improve clinician calibration and decision confidence without introducing new biases.
Finally, the fourth pillar, the call to focus on socio-technical systems, addresses the gap between model-centric research and the realities of clinical practice. As LLMs evolve in complexity, research must extend beyond the models themselves. This necessitates designing systems with a human-in-the-loop philosophy that facilitates, rather than replaces, human expertise. A critical component of this research is workforce preparedness, investigating the core competencies clinicians need to effectively supervise and correct LLM output, a crucial step towards ensuring safe and effective integration.
Author Contributions
G.d.P.S., methodology—developed the systematic literature review protocol, including search strategy, inclusion/exclusion criteria, and data extraction framework; formal analysis—synthesized study results, identified research gaps, and formulated the taxonomy and discussion; writing—original draft—prepared the initial manuscript, ensuring accuracy and completeness. G.M., supervision—provided oversight, strategic guidance, and methodological input throughout the study; writing—review and editing—critically reviewed and refined the manuscript. D.S., supervision—offered academic and technical oversight for the research; writing—review and editing—revised the manuscript for clarity, coherence, and scientific rigor. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Data is contained within the article. Further inquiries can be directed to the corresponding author.
Acknowledgments
The authors would like to thank the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| LLM | Large Language Models |
| SLR | Systematic Literature Review |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
| RAG | Retrieval-augmented generation |
| SDOH | Social Determinants of Health |
| LMIC | Low- and Middle-Income Countries |
| EHR | Electronic Health Record |
| NLP | Natural Language Processing |
| API | Application Programming Interface |
| MoE | Mixture of Experts |
| XAI | Explainable Artificial Intelligence |
| HIPAA | Health Insurance Portability and Accountability Act |
| PICOC | Population, Intervention, Comparison, Outcome, Context |
| RQ | Research Question |
| PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
| RLHF | Reinforcement Learning from Human Feedback |
| BLEU | Bilingual Evaluation Understudy |
| AUROC | Area Under the Receiver Operating Characteristic |
| VLM | Visual Language Models |
| SLM | Small Language Models |
| PHI | Personal Health Information |
| MIMIC | Medical Information Mart for Intensive Care |
References
- Louca, S. Personalized medicine—A tailored health care system: Challenges and opportunities. Croat. Med. J. 2012, 53, 211–213. [Google Scholar] [CrossRef] [PubMed]
- Cinti, C.; Trivella, M.G.; Joulie, M.; Ayoub, H.; Frenzel, M. The Roadmap toward Personalized Medicine: Challenges and Opportunities. J. Pers. Med. 2024, 14, 546. [Google Scholar] [CrossRef]
- Vicente, A.M.; Ballensiefen, W.; Jönsson, J.I. How personalised medicine will transform healthcare by 2030: The ICPerMed vision. J. Transl. Med. 2020, 18, 180. [Google Scholar] [CrossRef]
- Chunara, R.; Gjonaj, J.; Immaculate, E.; Wanga, I.; Alaro, J.; Scott-Sheldon, L.A.J.; Mangeni, J.; Mwangi, A.; Vedanthan, R.; Hogan, J. Social Determinants of Health: The Need for Data Science Methods and Capacity. Lancet Digit. Health 2024, 6, e235–e237. [Google Scholar] [CrossRef]
- Onnela, J.P. Opportunities and challenges in the collection and analysis of digital phenotyping data. Neuropsychopharmacology 2021, 46, 45–54. [Google Scholar] [CrossRef] [PubMed]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; et al. Sparks of Artificial General Intelligence: Early Experiments with GPT-4. arXiv 2023, arXiv:2303.12712. [Google Scholar] [CrossRef]
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A Survey on Multimodal Large Language Models. Natl. Sci. Rev. 2024, 11, nwae403. [Google Scholar] [CrossRef]
- Fang, C.M.; Danry, V.; Whitmore, N.; Bao, A.; Hutchison, A.; Pierce, C.; Maes, P. PhysioLLM: Supporting Personalized Health Insights with Wearables and Large Language Models. In Proceedings of the 2024 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Houston, TX, USA, 10–13 November 2024; pp. 1–8. [Google Scholar] [CrossRef]
- So, K.; Kim, H.J.; Shin, D.S.; Sim, J.A.; Lee, J.J.; Duong, D.; Meisinger, K.; Won, D.O. A Conversational Interaction Framework Using Large Language Models for Personalized Elderly Care. In Proceedings of the 2025 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 11–14 January 2025; pp. 1–2. [Google Scholar] [CrossRef]
- Kambare, S.M.; Jain, K.; Kale, I.; Kumbhare, V.; Lohote, S.; Lonare, S. Design and Evaluation of an AI-Powered Conversational Agent for Personalized Mental Health Support and Intervention (MindBot). In Proceedings of the 2024 International Conference on Sustainable Communication Networks and Application (ICSCNA), Theni, India, 11–13 December 2024; pp. 1394–1402. [Google Scholar] [CrossRef]
- Akilesh, S.; Abinaya, R.; Dhanushkodi, S.; Sekar, R. A Novel AI-based Chatbot Application for Personalized Medical Diagnosis and Review Using Large Language Models. In Proceedings of the 2023 International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), Chennai, India, 1–2 November 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Pap, I.A.; Oniga, S. eHealth Assistant AI Chatbot Using a Large Language Model to Provide Personalized Answers through Secure Decentralized Communication. Sensors 2024, 24, 6140. [Google Scholar] [CrossRef] [PubMed]
- Subramanian, S.; Han, X.; Baldwin, T.; Cohn, T.; Frermann, L. Evaluating debiasing techniques for intersectional biases. arXiv 2021, arXiv:2109.10441. [Google Scholar] [CrossRef]
- Cai, H. Multimodal Hybrid Healthcare Recommendation System Based on ERT-MOE and Large Language Model Enhancement. In Proceedings of the 2024 4th International Conference on Electronic Information Engineering and Computer Communication (EIECC), Wuhan, China, 13–15 December 2024; pp. 1222–1226. [Google Scholar] [CrossRef]
- Abbasian, M.; Khatibi, E.; Azimi, I.; Oniani, D.; Shakeri Hossein Abad, Z.; Thieme, A.; Sriram, R.; Yang, Z.; Wang, Y.; Lin, B.; et al. Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI. Npj Digit. Med. 2024, 7, 82. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. PLoS Med. 2021, 18, e1003583. [Google Scholar] [CrossRef] [PubMed]
- Kitchenham, B.; Pearl Brereton, O.; Budgen, D.; Turner, M.; Bailey, J.; Linkman, S. Systematic literature reviews in software engineering—A systematic literature review. Inf. Softw. Technol. 2009, 51, 7–15. [Google Scholar] [CrossRef]
- Petersen, K.; Vakkalanka, S.; Kuzniarz, L. Guidelines for Conducting Systematic Mapping Studies in Software Engineering: An Update. Inf. Softw. Technol. 2015, 64, 1–18. [Google Scholar] [CrossRef]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Kumar, G. A Doctor Assistance Tool: Personalized Healthcare Treatment Recommendations Journey from Deep Reinforcement Learning to Generative AI. In Proceedings of the 2024 3rd Edition of IEEE Delhi Section Flagship Conference (DELCON), New Delhi, India, 21–23 November 2024; pp. 1–9. [Google Scholar] [CrossRef]
- Johri, S.; Jeong, J.; Tran, B.A.; Schlessinger, D.I.; Wongvibulsin, S.; Barnes, L.A.; Zhou, H.Y.; Cai, Z.R.; Van Allen, E.M.; Kim, D.; et al. An Evaluation Framework for Clinical Use of Large Language Models in Patient Interaction Tasks. Nat. Med. 2025, 31, 77–86. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Chen, M.L.; Rezaei, S.J.; Hernandez-Boussard, T.; Chen, J.H.; Rodriguez, F.; Han, S.S.; Lal, R.A.; Kim, S.H.; Dosiou, C.; et al. Artificial Intelligence Tools in Supporting Healthcare Professionals for Tailored Patient Care. npj Digit. Med. 2025, 8, 210. [Google Scholar] [CrossRef]
- Jaiswal, S.; Lee, J.; Berria, J.; Tanikella, R.; Zolyomi, A.; Ahmad, M.A.; Si, D. Building Personality-Adaptive Conversational AI for Mental Health Therapy. In Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Shenzhen, China, 22–25 November 2024; p. 1. [Google Scholar] [CrossRef]
- Williams, C.Y.K.; Miao, B.Y.; Kornblith, A.E.; Butte, A.J. Evaluating the Use of Large Language Models to Provide Clinical Recommendations in the Emergency Department. Nat. Commun. 2024, 15, 8236. [Google Scholar] [CrossRef] [PubMed]
- Subramanian, A.; Yang, Z.; Azimi, I.; Rahmani, A.M. Graph-Augmented LLMs for Personalized Health Insights: A Case Study in Sleep Analysis. In Proceedings of the 2024 IEEE 20th International Conference on Body Sensor Networks (BSN), Chicago, IL, USA, 15–17 October 2024; pp. 1–4. [Google Scholar] [CrossRef]
- Garima, S.; Swapnil, M.; Shashank, S. Harnessing the Power of Language Models for Intelligent Digital Health Services. In Proceedings of the 2024 ITU Kaleidoscope: Innovation and Digital Transformation for a Sustainable World (ITU K), New Delhi, India, 21–23 October 2024; pp. 1–8. [Google Scholar] [CrossRef]
- Rahman, M.A.; Al-Hazzaa, S. Next-Generation Virtual Hospital: Integrating Discriminative and Large Multi-Modal Generative AI for Personalized Healthcare. In Proceedings of the GLOBECOM 2024—2024 IEEE Global Communications Conference, Cape Town, South Africa, 8–12 December 2024; pp. 3509–3514. [Google Scholar] [CrossRef]
- Balakrishna, C.; Yadav, A.; Singh, J.; Saba, M.; Shashikant; Shrivastava, V. Smart Drug Delivery Systems Using Large Language Models for Real-Time Treatment Personalization. In Proceedings of the 2024 2nd World Conference on Communication & Computing (WCONF), Raipur, India, 12–14 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models Are Zero-Shot Reasoners. arXiv 2022, arXiv:2205.11916. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward Expert-Level Medical Question Answering with Large Language Models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. Acm Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Li, X.; Wang, S.; Zeng, S.; Wu, Y.; Yang, Y. A Survey on LLM-based Multi-Agent Systems: Workflow, Infrastructure, and Challenges. Vicinagearth 2024, 1, 9. [Google Scholar] [CrossRef]
- Ferreira, A.A.; Rocha, L.; Cunha, W.; Machado, A.C.; Campos, J.M.; Jallais, G.; Viana, A.C.F.; Tuler, E.; Araújo, I.; Macul, V.; et al. A comprehensive qualitative analysis of patient dialogue summarization using large language models applied to noisy, informal, non-English real-world data. Sci. Rep. 2025, 15, 31660. [Google Scholar] [CrossRef]
- Silva, V.; Furtado, E.S.; Oliveira, J.; Furtado, V. Engenharia de Prompts em Assistentes Conversacionais para Promoção de Autocuidado baseados em Modelos Amplos de Linguagem. In Proceedings of the Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2024), Goiânia, Brazil, 25–28 June 2024; pp. 377–388. [Google Scholar] [CrossRef]
- Reis, Z.S.N.; Pagano, A.S.; Ramos de Oliveira, I.J.; dos Santos Dias, C.; Lage, E.M.; Mineiro, E.F.; Varella Pereira, G.M.; de Carvalho Gomes, I.; Basilio, V.A.; Cruz-Correia, R.J.; et al. Evaluating Large Language Model–Supported Instructions for Medication Use: First Steps Toward a Comprehensive Model. Mayo Clin. Proc. Digit. Health 2024, 2, 632–644. [Google Scholar] [CrossRef]
- Rajashekar, N.C.; Shin, Y.E.; Pu, Y.; Chung, S.; You, K.; Giuffre, M.; Chan, C.E.; Saarinen, T.; Hsiao, A.; Sekhon, J.; et al. Human-Algorithmic Interaction Using a Large Language Model-Augmented Artificial Intelligence Clinical Decision Support System. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–20. [Google Scholar] [CrossRef]
- Park, Y.J.; Pillai, A.; Deng, J.; Guo, E.; Gupta, M.; Paget, M.; Naugler, C. Assessing the research landscape and clinical utility of large language models: A scoping review. Bmc Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef]
- Zhang, Z.; Rossi, R.A.; Kveton, B.; Shao, Y.; Yang, D.; Zamani, H.; Dernoncourt, F.; Barrow, J.; Yu, T.; Kim, S.; et al. Personalization of Large Language Models: A Survey. arXiv 2025, arXiv:2411.00027. [Google Scholar] [CrossRef]
- Schneider, D.; de Almeida, M.A.; Nascimento, M.; Correia, A.; de Souza, J.M. Designing for (Digital) Nomad-AI Interaction. In Proceedings of the International Conference on Computer-Human Interaction Research and Applications, Marbella, Spain, 20–21 October 2025. [Google Scholar]
- Wang, Y.; Zhao, Y.; Petzold, L. Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding. arXiv 2023, arXiv:2304.05368. [Google Scholar] [CrossRef]
- Cai, C.J.; Winter, S.; Steiner, D.; Wilcox, L.; Terry, M. “Hello AI”: Uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making. Proc. ACM-Hum.-Comput. Interact. 2019, 3, 1–24. [Google Scholar]
- Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).