Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review
Abstract
1. Introduction
- RQ1: What major implementation approaches to LLM-based VP simulations have been proposed since 2023, and what are their key technical and pedagogical characteristics?
- RQ2: What types of datasets and knowledge sources are utilized, and how are data quality and governance ensured?
- RQ3: What evaluation frameworks and metrics are used to assess system performance and educational effectiveness?
- RQ4: What limitations, challenges, and future directions have been identified across the reviewed studies?
2. Materials and Methods
2.1. Data Sources and Search Strategies
2.2. Study Selection and Eligibility Criteria
- Studies that explicitly utilized LLMs within medical, nursing, or healthcare education contexts.
- Studies that involved interactive or generative simulations, including scenario design or VP dialogue.
- Peer-reviewed journal articles or full-length conference papers published in English between 2023 and 2025.
- Did not employ LLMs or were unrelated to medical or nursing education.
- Were unavailable in full text or limited to conference abstracts.
- Lacked interactive or agent-based simulation components relevant to this review’s objectives.
- Were inconsistent with the research purpose or analytical scope.
3. Results
3.1. Implementation Approaches
3.1.1. LLM-Based Scenario Generation
3.1.2. Simple Prompt-Based Virtual Patient Systems
3.1.3. Iterative Feedback and Automated Scoring Systems
3.1.4. Realism- and Adaptability-Enhanced Virtual Patient Systems
3.1.5. Knowledge-Driven and Multi-Agent Hybrid Virtual Patient Systems
3.1.6. Mental Health- and Counseling-Oriented Systems
3.2. Datasets
3.3. Evaluation Methods
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Collins, J.C.; Chong, W.W.; de Almeida Neto, A.C.; Moles, R.J.; Schneider, C.R. The simulated patient method: Design and application in health services research. Res. Soc. Adm. Pharm. 2021, 17, 2108–2115. [Google Scholar] [CrossRef]
- Davis, S. Patient-Drama: A Literature Review of Simulated Patient Experiences in Medical Education and Training; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Mackenzie, C.F.; Harper, B.D.; Xiao, Y. Simulator limitations and their effects on decision-making. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Philadelphia, PA, USA, 2–6 September 1996; pp. 747–751. [Google Scholar]
- Kononowicz, A.A.; Woodham, L.A.; Edelbring, S.; Stathakarou, N.; Davies, D.; Saxena, N.; Car, L.T.; Carlstedt-Duke, J.; Car, J.; Zary, N. Virtual patient simulations in health professions education: Systematic review and meta-analysis by the digital health education collaboration. J. Med. Internet Res. 2019, 21, e14676. [Google Scholar] [CrossRef]
- Huang, G.; Reynolds, R.; Candler, C. Virtual patient simulation at US and Canadian medical schools. Acad. Med. 2007, 82, 446–451. [Google Scholar] [CrossRef]
- Botezatu, M.; Hult, H.; Fors, U.G. Virtual patient simulation: What do students make of it? A focus group study. BMC Med. Educ. 2010, 10, 91. [Google Scholar] [CrossRef]
- Hege, I.; Kononowicz, A.A.; Berman, N.B.; Lenzer, B.; Kiesewetter, J. Advancing clinical reasoning in virtual patients–development and application of a conceptual framework. GMS J. Med. Educ. 2018, 35, Doc12. [Google Scholar]
- Botezatu, M.; Hult, H.; Tessma, M.K.; Fors, U. Virtual patient simulation: Knowledge gain or knowledge loss? Med. Teach. 2010, 32, 562–568. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Eysenbach, G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: A conversation with ChatGPT and a call for papers. JMIR Med. Educ. 2023, 9, e46885. [Google Scholar] [CrossRef] [PubMed]
- Jung, S. Challenges for future directions for artificial intelligence integrated nursing simulation education. Korean J. Women Health Nurs. 2023, 29, 239–242. [Google Scholar] [CrossRef] [PubMed]
- Maaz, S.; Palaganas, J.C.; Palaganas, G.; Bajwa, M. A guide to prompt design: Foundations and applications for healthcare simulationists. Front. Med. 2025, 11, 1504532. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT. Available online: https://chat.openai.com/ (accessed on 17 June 2025).
- DeepMind, G. Gemini. Available online: https://deepmind.google/gemini (accessed on 17 June 2025).
- Meta. LLaMA. Available online: https://www.llama.com/ (accessed on 17 June 2025).
- Anthropic. Claude. Available online: https://www.anthropic.com/claude (accessed on 17 June 2025).
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
- Kang, K.; Yu, M. Rapid cycle deliberate practice simulation with standardized prebriefing and video based formative feedback in advanced cardiac life support. Sci. Rep. 2025, 15, 16150. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
- Holderried, F.; Stegemann-Philipps, C.; Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, M.; Eickhoff, C.; Mahling, M. A language model–powered simulated patient with automated feedback for history taking: Prospective study. JMIR Med. Educ. 2024, 10, e59213. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; Weston, J. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv 2018, arXiv:1801.07243. [Google Scholar] [CrossRef]
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
- García-Torres, D.; Vicente Ripoll, M.A.; Fernández Peris, C.; Mira Solves, J.J. Enhancing clinical reasoning with virtual patients: A hybrid systematic review combining human reviewers and ChatGPT. Healthcare 2024, 12, 2241. [Google Scholar] [CrossRef]
- Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef]
- Vaughn, J.; Ford, S.H.; Scott, M.; Jones, C.; Lewinski, A. Enhancing healthcare education: Leveraging ChatGPT for innovative simulation scenarios. Clin. Simul. Nurs. 2024, 87, 101487. [Google Scholar] [CrossRef]
- Ghaffari, F.; Langarizadeh, M.; Nabovati, E.; Sabery, M. Effectiveness of ChatGPT for Clinical Scenario Generation: A Qualitative Study. Arch. Acad. Emerg. Med. 2025, 13, e49. [Google Scholar] [PubMed]
- Violato, E.; Corbett, C.; Rose, B.; Rauschning, B.; Witschen, B. The effectiveness and efficiency of using ChatGPT for writing health care simulations. Int. J. Healthc. Simul. 2023, 10, 54531. [Google Scholar] [CrossRef]
- Tian, Q.; Ren, F.; Zou, B.; Zhou, J.; Liu, G.; Zheng, Y.; Zhang, Z.; Wang, S. Iteratively refined ChatGPT outperforms clinical mentors in generating high-quality interprofessional education clinical scenarios: A comparative study. BMC Med. Educ. 2024, 25, 845. [Google Scholar]
- Gray, M.; Baird, A.; Sawyer, T.; James, J.; DeBroux, T.; Bartlett, M.; Krick, J.; Umoren, R. Increasing realism and variety of virtual patient dialogues for prenatal counseling education through a novel application of ChatGPT: Exploratory observational study. JMIR Med. Educ. 2024, 10, e50705. [Google Scholar] [CrossRef] [PubMed]
- Ananthanarayanan, A. Generating Medical Diagnostic Scenarios with LLM-Based Reinforcement Learning Feedback: Dataset Release and Methodology. In Proceedings of the IEEE Integrated STEM Education Conference, Princeton, NJ, USA, 15 March 2025. [Google Scholar]
- Sumpter, S. Automated Generation of High-Quality Medical Simulation Scenarios Through Integration of Semi-Structured Data and Large Language Models. arXiv 2024, arXiv:2404.19713. [Google Scholar]
- Barra, F.L.; Rodella, G.; Costa, A.; Scalogna, A.; Carenzo, L.; Monzani, A.; Corte, F.D. From prompt to platform: An agentic AI workflow for healthcare simulation scenario design. Adv. Simul. 2025, 10, 29. [Google Scholar] [CrossRef]
- Öncü, S.; Torun, F.; Ülkü, H.H. AI-powered standardised patients: Evaluating ChatGPT-4o’s impact on clinical case management in intern physicians. BMC Med. Educ. 2025, 25, 278. [Google Scholar] [CrossRef] [PubMed]
- Benfatah, M.; Marfak, A.; Saad, E.; Hilali, A.; Nejjari, C.; Youlyouz-Marfak, I. Assessing the efficacy of ChatGPT as a virtual patient in nursing simulation training: A study on nursing students’ experience. Teach. Learn. Nurs. 2024, 19, e486–e493. [Google Scholar] [CrossRef]
- Holderried, F.; Stegemann–Philipps, C.; Herschbach, L.; Moldt, J.-A.; Nevins, A.; Griewatz, J.; Holderried, M.; Herrmann-Werner, A.; Festl-Wietek, T.; Mahling, M. A generative pretrained transformer (GPT)–powered chatbot as a simulated patient to practice history taking: Prospective, mixed methods study. JMIR Med. Educ. 2024, 10, e53961. [Google Scholar] [CrossRef]
- Reichenpfader, D.; Denecke, K. Simulating diverse patient populations using patient vignettes and large language models. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024, Torino, Italy, 20 May 2024; pp. 20–25. [Google Scholar]
- Yi, Y.; Kim, K.-J. The feasibility of using generative artificial intelligence for history taking in virtual patients. BMC Res. Notes 2025, 18, 80. [Google Scholar] [CrossRef]
- Aster, A.; Ragaller, S.V.; Raupach, T.; Marx, A. ChatGPT as a Virtual patient: Written empathic expressions during medical history taking. Med. Sci. Educ. 2025, 35, 1513–1522. [Google Scholar] [CrossRef]
- Lower, K.; Seth, I.; Lim, B.; Seth, N. ChatGPT-4: Transforming medical education and addressing clinical exposure challenges in the post-pandemic era. Indian J. Orthop. 2023, 57, 1527–1544. [Google Scholar] [CrossRef]
- Cross, J.; Kayalackakom, T.; Robinson, R.E.; Vaughans, A.; Sebastian, R.; Hood, R.; Lewis, C.; Devaraju, S.; Honnavar, P.; Naik, S. Assessing ChatGPT’s Capability as a New Age Standardized Patient: Qualitative Study. JMIR Med. Educ. 2025, 11, e63353. [Google Scholar] [CrossRef]
- Scherr, R.; Halaseh, F.F.; Spina, A.; Andalib, S.; Rivera, R. ChatGPT interactive medical simulations for early clinical education: Case study. JMIR Med. Educ. 2023, 9, e49877. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Li, S.; Lin, N.; Zhang, X.; Han, Y.; Wang, X.; Liu, D.; Tan, X.; Pu, D.; Li, K. Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment. J. Med. Internet Res. 2025, 27, e59435. [Google Scholar] [CrossRef]
- Brügge, E.; Ricchizzi, S.; Arenbeck, M.; Keller, M.N.; Schur, L.; Stummer, W.; Holling, M.; Lu, M.H.; Darici, D. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: A randomized controlled trial. BMC Med. Educ. 2024, 24, 1391. [Google Scholar] [CrossRef]
- Haut, K.; Hasan, M.; Carroll, T.; Epstein, R.; Sen, T.; Hoque, E. AI Standardized Patient Improves Human Conversations in Advanced Cancer Care. arXiv 2025, arXiv:2505.02694. [Google Scholar] [CrossRef]
- Yamamoto, A.; Koda, M.; Ogawa, H.; Miyoshi, T.; Maeda, Y.; Otsuka, F.; Ino, H. Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial. JMIR Med. Educ. 2024, 10, e58753. [Google Scholar] [CrossRef]
- Hicke, Y.; Geathers, J.; Rajashekar, N.; Chan, C.; Jack, A.G.; Sewell, J.; Preston, M.; Cornes, S.; Shung, D.; Kizilcec, R. MedSimAI: Simulation and formative feedback generation to enhance deliberate practice in medical education. arXiv 2025, arXiv:2503.05793. [Google Scholar]
- Chiu, J.; Castro, B.; Ballard, I.; Nelson, K.; Zarutskie, P.; Olaiya, O.K.; Song, D.; Zhao, Y. Exploration of the Role of ChatGPT in Teaching Communication Skills for Medical Students: A Pilot Study. Med. Sci. Educ. 2025, 35, 1871–1882. [Google Scholar] [CrossRef] [PubMed]
- Cook, D.A.; Overgaard, J.; Pankratz, V.S.; Del Fiol, G.; Aakre, C.A. Virtual patients using large language models: Scalable, contextualized simulation of clinician-patient dialogue with feedback. J. Med. Internet Res. 2025, 27, e68486. [Google Scholar] [CrossRef] [PubMed]
- Bodonhelyi, A.; Stegemann-Philipps, C.; Sonanini, A.; Herschbach, L.; Szép, M.; Herrmann-Werner, A.; Festl-Wietek, T.; Kasneci, E.; Holderried, F. Modeling Challenging Patient Interactions: LLMs for Medical Communication Training. arXiv 2025, arXiv:2503.22250. [Google Scholar]
- Chen, S.; Wu, M.; Zhu, K.Q.; Lan, K.; Zhang, Z.; Cui, L. LLM-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv 2023, arXiv:2305.13614. [Google Scholar] [CrossRef]
- Borg, A.; Georg, C.; Jobs, B.; Huss, V.; Waldenlind, K.; Ruiz, M.; Edelbring, S.; Skantze, G.; Parodis, I. Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: Mixed methods study. J. Med. Internet Res. 2025, 27, e63312. [Google Scholar] [CrossRef]
- Gutiérrez Maquilón, R.; Uhl, J.; Schrom-Feiertag, H.; Tscheligi, M. Integrating GPT-Based AI into Virtual Patients to Facilitate Communication Training Among Medical First Responders: Usability Study of Mixed Reality Simulation. JMIR Form. Res. 2024, 8, e58623. [Google Scholar] [CrossRef] [PubMed]
- Sardesai, N.; Russo, P.; Martin, J.; Sardesai, A. Utilizing generative conversational artificial intelligence to create simulated patient encounters: A pilot study for anaesthesia training. Postgrad. Med. J. 2024, 100, 237–241. [Google Scholar] [CrossRef]
- Lee, K.; Lee, S.; Kim, E.H.; Ko, Y.; Eun, J.; Kim, D.; Cho, H.; Zhu, H.; Kraut, R.E.; Suh, E. Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training. arXiv 2025, arXiv:2506.00386. [Google Scholar]
- Du, Z.; Zheng, L.; Hu, R.; Xu, Y.; Li, X.; Sun, Y.; Chen, W.; Wu, J.; Cai, H.; Ying, H. LLMs Can Simulate Standardized Patients via Agent Coevolution. arXiv 2024, arXiv:2412.11716. [Google Scholar] [CrossRef]
- Yu, H.; Zhou, J.; Li, L.; Chen, S.; Gallifant, J.; Shi, A.; Li, X.; Hua, W.; Jin, M.; Chen, G. AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow. arXiv 2024, arXiv:2409.18924. [Google Scholar] [CrossRef]
- Li, Y.; Zeng, C.; Zhang, J.; Zhou, J.; Zou, L. MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient. arXiv 2024, arXiv:2408.12236. [Google Scholar]
- Li, Y.; Zeng, C.; Zhong, J.; Zhang, R.; Zhang, M.; Zou, L. Leveraging large language model as simulated patients for clinical education. arXiv 2024, arXiv:2404.13066. [Google Scholar] [CrossRef]
- Wang, R.; Milani, S.; Chiu, J.C.; Zhi, J.; Eack, S.M.; Labrum, T.; Murphy, S.M.; Jones, N.; Hardy, K.; Shen, H. Patient-Ψ: Using large language models to simulate patients for training mental health professionals. arXiv 2024, arXiv:2405.19660. [Google Scholar] [CrossRef]
- Steenstra, I.; Nouraei, F.; Bickmore, T. Scaffolding empathy: Training counselors with simulated patients and utterance-level performance visualizations. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–22. [Google Scholar]
- Louie, R.; Nandi, A.; Fang, W.; Chang, C.; Brunskill, E.; Yang, D. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. arXiv 2024, arXiv:2407.00870. [Google Scholar]
- Lee, J.; Lim, K.; Jung, Y.-C.; Kim, B.-H. PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents. arXiv 2025, arXiv:2501.01594. [Google Scholar] [CrossRef]
- Wang, J.; Xiao, Y.; Li, Y.; Song, C.; Xu, C.; Tan, C.; Li, W. Towards a client-centered assessment of llm therapists by client simulation. arXiv 2024, arXiv:2406.12266. [Google Scholar] [CrossRef]
- Saeed, M.; Villarroel, M.; Reisner, A.T.; Clifford, G.; Lehman, L.-W.; Moody, G.; Heldt, T.; Kyaw, T.H.; Moody, B.; Mark, R.G. Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database. Crit. Care Med. 2011, 39, 952–960. [Google Scholar] [CrossRef]
- Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.-w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
- Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2015, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Rosas, V.; Sun, X.; Li, C.; Wang, Y.; Resnicow, K.; Mihalcea, R. Analyzing the quality of counseling conversations: The tell-tale signs of high-quality counseling. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Wu, Z.; Balloccu, S.; Kumar, V.; Helaoui, R.; Reiter, E.; Recupero, D.R.; Riboni, D. Anno-mi: A dataset of expert-annotated counselling dialogues. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6177–6181. [Google Scholar]
- MTSamples. MTSamples. Available online: https://mtsamples.com (accessed on 5 June 2025).
- Lee, J.; Park, S.; Shin, J.; Cho, B. Analyzing evaluation methods for large language models in the medical field: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 366. [Google Scholar] [CrossRef]
- Fan, L.; Hua, W.; Li, L.; Ling, H.; Zhang, Y. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv 2023, arXiv:2312.14890. [Google Scholar]
- Fan, L.; Hua, W.; Li, X.; Zhu, K.; Jin, M.; Li, L.; Ling, H.; Chi, J.; Wang, J.; Ma, X. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models. arXiv 2024, arXiv:2403.01777. [Google Scholar]
- Gusev, I. PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation. arXiv 2024, arXiv:2409.06820. [Google Scholar]
- Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]


| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| ChatGPT | [26] | <15 s/scenario; realism acceptable for HPI & unfolding; frequent omissions 65–88%, inaccuracies 18–41%. | Missing PMH/profile/vitals common; SME review required |
| GPT-3.5 | [28] | Time ↓ 154.8 min/case (total −12.9 h/5 cases); non-expert preferred 4/23; strong structure/flow | Expert quality > non-expert; gaps in technical accuracy/clinical detail |
| GPT-3.5 | [30] | Generated 176 responses: realistic 80%, educationally relevant 87%, usable (≤minor edits) 63%; weighted κ = 0.84 | 37% require edits; precision/detail limited; expert screening advised |
| GPT-3.5 | [32] | Semi-structured data + LLM pipeline; reported time/resource reduction; better consistency/reuse | No quantitative evaluation; potential misinterpretation of complex cases; SME validation required |
| GPT-3.5 Turbo | [31] | accuracy 9.59 → 10, detail 5.59 → 5.78 (0–10) with RAG + critic; included more women and people of color cases for diversity | Small/preliminary; depends on RAG/critic quality; external validation pending |
| GPT-4 | [27] | ≈5 s generation; structured, realistic, clear objectives (expert panel) | Drug dose/logic errors; incomplete histories/tables; expert review required |
| GPT-4o | [29] | Time: mentors 118 ± 23 min → 9 ± 2 (iterative)/4 ± 2 (single); IQS ↑ challenge +0.63, engagement +0.39 (p < 0.01); blind attribution AI = human 16/16 (p = 0.61) → AI scenarios matched/exceeded expert quality | Subjective ratings; no IRR; expert review still needed |
| GPT-4o, Gemini 2.0, Claude-3.7 | [33] | Multi-agent workflow; ~4.5 min/case (≈50 runs); time ↓ 70–80%; INACSL/ASPiH-compliant; multilingual | Potential errors/biases; complex setup; expert oversight essential |
| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| ChatGPT | [35] | 5-pt ratings: Accessibility 4.3 ± 0.5, Engagement 4.3 ± 0.4, Usefulness 4.2 ± 0.5. Correlations with total (25-pt): Clarity r = 0.701, Useful info r = 0.597, Relevance r = 0.444 (all p < 0.05) | Small sample (12 participants); limited to one scenario (dyspnea); low adaptability in some students |
| GPT-3.5 | [39] | Empathic interactions 93/659 ≈ 14%; Autonomy score 38.2 ± 3.44/42 (freedom 6.8/7, task relevance 5.93/7) | Low empathy frequency; no non-verbal cues; no voice/visual input |
| GPT-3.5 | [42] | ACLS & ICU (pneumonia, sepsis) in open-response/state-change; very low cost, high accessibility, unlimited regeneration | No quantitative scoring; no automated grading/standardization; reproducibility/feedback consistency limited |
| GPT-3.5 Turbo | [36] | Generated 826 Q–A; clinical validity 97.9%; in-script info 94.4%; out-of-script fabricated 56.4%; CUQ 77/100; Q–A length ρ = 0.29 (p < 0.01) | Out-of-script hallucinations (56.4%); role drift & calc errors; one case/model |
| GPT-4 | [37] | Role-prompted vignette: Compliance/Coherence/Correctness 100%; Containment 64→45→9% (less context → worse); maintain realism and coherence within structured role-prompting setups | Context-reliant; over-inference/generalization; single vignette, non-clinical raters |
| GPT-4 | [40] | Ortho cases: Likert (accuracy 4/5; complex Q 3/5; comprehensiveness 3/5; depth 2/5). Consistent diagnostic reasoning & initial ED management | Limited specialized detail (e.g., urinalysis, nerve block); needs expert oversight |
| GPT-4 | [41] | Effective for repetitive practice, convenience, and anxiety reduction; enabled personalized feedback | Lacks non-verbal/visual cues; sensitive-topic limits; minor latency/language issues; small single-site sample |
| GPT-4o | [34] | Observed scores (6–10): PS (problem-solving) 8.4, CR (clinical reasoning) 8.3, CM (case management) 8.5; Self (55 max): 41.6, 42.6, 36.1; high inter-domain correlations (r = 0.68–0.95, p < 0.001); no competence gap | Tech issues (language, delays); time-pressure info handling; single site, small n |
| Naver(Seongnam-si, Republic of Korea) HyperCLOVA X | [38] | Pilot (5 sessions): 96 Q–A/1325 words; implausible 2.6% (inarticulate 1.7%, hallucinated 0.5%, missing 0.3%). Expert (1–5): Rel 4.50, Acc 4.10, Valid 4.20, Concise 3.80, Fluent 3.20, Total 3.96; ICC 0.64–0.80 | Some inaccuracies and unrealistic expressions; limited fluency; need for refined prompting |
| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| GPT-3.5 | [44] | CRI-HTI ↑ (F(1,18) = 4.44, p = 0.049, η2 = 0.198); inter-rater reliability ICC = 0.924 | No gain in focusing (p = 0.265); small, single-site RCT; some feedback lacked accuracy/specificity |
| GPT-3.5 | [48] | Confidence ↑ 3.00→4.17 (p = 0.002); trust ↑ (p = 0.001); SPIKES-based immediate feedback feasible | Faculty–AI scoring gap; limited non-verbal/emotional context (text-only); very small sample size |
| GPT-3.5 Turbo | [45] | ICC = 0.882; 3E(Empower/Be Explicit/Empathize) skills ↑ (all p < 0.001, d = 1.07–1.61); feedback 4.68/5; avatar realism 3.40/5 | Some “uncanny” affect; limited gaze/speech naturalness; technical complexity |
| GPT-4 | [21] | 99.3% clinical plausibility; Cohen’s κ 0.832 vs. human raters | Some category-level κ < 0.6 (rubric overlap/ambiguity) |
| GPT-4 | [43] | Score Difference Percentage(SDP) 29.8%→6.1%; no language-group diff (p > 0.05) | Occasional over-information; no emotion/attitude scoring |
| GPT-4.0 Turbo | [46] | OSCE score ↑ 28.1 vs. 27.1 (p = 0.01); repeated training; accessible via web, smartphone, LINE | Limited non-verbal training; repetitive/inaccurate feedback; short-term, non-randomized evaluation |
| GPT-4.0 Turbo | [49] | Validated authenticity/UX/feedback tools; LLM–human rating alignment; cost ≈ US $0.51/dialogue; patient prefs reflected in 42–98% | Reproducibility (ICC) slight–fair; possible self-evaluation bias; some verbosity/unnatural dialogue |
| GPT-4o | [47] | ~19.9 min, 38.7 turns/session; MIRS-based instant feedback; 53% found useful; included SRL features (goal-setting, reflection, progress tracking) | Some formulaic/repetitive replies; limited advanced-skill training; low SRL engagement |
| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| ChatGPT | [51] | Realism ↑ 1.93→2.21; wrong-symptom ↓ 18.4→15.1%; more human-like lexical style | Prompt drift; increased symptom inaccuracy; no non-verbal expression |
| ChatGPT | [54] | Intuitiveness 9/10, Accuracy 8/10, Comfort 87% | Overly polite/verbose; repetitive; hallucination risk; limited to 2D interface |
| GPT-3.5 Turbo | [52] | Robot > PC: authenticity 4.5 vs. 3.9 (p = 0.04); learning 4.4 vs. 4.1 (p = 0.01); greater immersion & emotion | ASR errors; timing interruptions; limited exam/clinical detail; robot setup cost |
| GPT-3.5 Turbo | [53] | Voice usability: MOS-X2 ≈ 4/10; SASSI 4.0–4.8/7 | ~3 s latency; ASR overlap disrupted turn-taking; limited prosody/natural flow |
| GPT-4 | [50] | Authenticity 3.8/5, style reproduction 3.7/5; sentiment: Accuser 3.1 vs. Rationalizer 4.0 (9-pt); N ≈ 15–17 min, ~15 turns | Style drift after 4–6 turns; repetitive pre-scripted replies; unnatural non-verbal attempts |
| Claude-3.5 | [55] | Dynamic > Static: role fidelity F(1,25.4) = 4.52 (p = 0.043); realism F(1,24.7) = 8.42 (p = 0.008); κ > 0.75; Cronbach’s α = 0.96–0.97; U = 160,960 (p = 0.001) | Text-only; some responses less fluent; KR-only, small sample size |
| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| GPT-3.5 Turbo, Qwen 2.5-72B | [56] | Ability = 0.860 (Rel 0.759/Faith 0.879/Rob 0.941); cheat-Q pref 91.3% (GPT-4)/86.1% (human); efficiency 6.69 s/401 tokens (≈−380 vs. CoT); req. align > 10% vs. baseline | Sim-only (~150 cases); injected cheat-Qs (5/10 turns); no real-patient validation; >10% gain not externally benchmarked |
| GPT-4 Turbo | [57] | QA accuracy = 94.15% (Symp 91.2/Hist 87.1/Soc 85.6); NER F1 = 0.89 (P 0.95/R 0.84/TPR 84.2%); readability FRE = 68.8/FK 6.4; robust/stable ns; κ = 0.92 | Med-history paraphrase sensitive (F = 5.30, p = 0.006); heavy KG/RAG dependence (F&S acc. ↓ to 13.3% w/o); single-site KG (MIMIC-III) |
| Qwen2-72B, Diffusion Transformer | [58] | KG-controlled, symptom-consistent image generation; multi-agent framework (KG/Chat/Image) | No FID/SSIM/diagnostic metrics; limited Open-i (3314) dataset; ethical/clinical validation pending |
| GPT-3.5 Turbo, PaLM, ERNIE-4, etc. | [59] | B-ELO: +250 (GPT-3.5) > +99 (ERNIE-4) > +68 (PaLM) > +51 (Qwen-72B) > +48 (Mixtral); human–AI corr. ρ = 0.81/r = 0.85 (p < 0.05); Virtual-Doctor score: ChatGPT 0.51 vs. human 0.78 | Role-flip risk (GPT-3.5); small scale (8 cases/80 dialogs); generalization (multi-lang/real-world) unverified |
| LLM | Reference | Key Research Achievements | Limitations |
|---|---|---|---|
| GPT-3.5, GPT-4 | [62] | Expert-in-the-loop “principles” pipeline; better adherence/consistency (win 35%/loss 5–10%; vs. baseline GPT-4); higher authenticity (+0.80, p < 0.01) & training-readiness (+0.64, p < 0.05) | ~20% responses missed multi-part principles; text-only dialogue; 11.4% manual edits required |
| GPT-4o | [61] | Real-time MI training with automatic MITI feedback and cognitive-state visualizations; significant self-efficacy gain (F = 15.56, p < 0.001); high usability (SUS = 88.1); reliable across modules (ICC ≥ 0.77) | No non-verbal cues; limited handling of resistance/ambivalence in complex dialogues |
| GPT-4o | [63] | Multi-faceted psychiatric simulation with construct-grounded evaluation (PSYCHE SCORE, r = 0.85 vs. experts, p < 0.0001); psychiatrist agreement 93%; high reliability (AC1 0.87, PABAK 0.86); moderate convergent validity (r = 0.64, p = 0.0025) | Under-expressed behaviors (e.g., thought process, insight) in specific disorders; low insight accuracy in OCD/PTSD (~50%); limited multimodal/non-verbal expressivity |
| GPT-4 | [60] | Client-centered framework (ClientCAST); objective outcome/alliance metrics (WAI-SR, SRS, CECS, SEQ); clear 213 high vs. 87 low session split; profile reproduction ≈ 70% similarity, F1 ≤ 0.85 | Limited emotional responsiveness; variable personality/affect consistency; overuse of positive tone |
| Claude-3, GPT-3.5, LLaMA3-70B, Mixtral 8 × 7B | [64] | CBT-based patient simulation framework integrating cognitive models; expert-rated fidelity +1.3 (p < 10−4); trainee confidence +1.8 vs. traditional (p < 10−4); auto-eval Acc 0.97 (Situation), Macro-F1 0.80 (Core Beliefs). | Limited emotional-style diversity; text-only interaction; short-term subjective evaluation |
| Category | Definition | No. of Studies | References |
|---|---|---|---|
| Self-constructed datasets | Scenarios, dialogues, and evaluation materials newly created by researchers using LLM prompts | 35 | [21,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,56,58,59,62,63,64,65] |
| Public medical datasets | Direct use and evaluation of open datasets such as MIMIC-III, Open-I, and High/Low Quality Counseling | 3 | [57,64] |
| Hybrid clinical datasets | Combined use of institutional EHR data and public datasets | 2 | [56,58] |
| KG/rule-based structured datasets | Use of patient KGs, CCDs, etc., to control LLM behavior and responses | 4 | [57,58,60,63] |
| Dataset | Data Type | Key Characteristics | Application Purpose |
|---|---|---|---|
| MIMIC-II [65] | EHR(ICU) | ~26,000 ICU admission records; CSV and DB formats | [56] (EvoPatient)—Combined with real hospital data to generate realistic VPs |
| MIMIC-III [66] | EHR(ICU) | 40,000+ ICU admission records; relational DB (CSV) format | [57] (AIPatient)—Converted patient data into KGs to improve LLM answer accuracy |
| Open-I Chest X-ray [67] | Medical imaging + reports | ~3300 chest X-ray images (DICOM format) with corresponding radiology reports | [58] (MedDiT)—Used for training a medical X-ray image generation model; LoRA fine-tuning |
| High/Low Quality Counseling [68] | Conversational text (counseling) | 300 counseling sessions (English dialogue data) | [62]—Trained for automatic evaluation of counseling quality |
| AnnoMI [69] | Conversational text (counseling) | 42 motivational interviewing (MI) counseling sessions (English dialogue data) | [62]—Trained ability to classify and evaluate counseling techniques at the sentence level |
| MTSamples [70] | Medical record documents | ~5000 clinical notes (text files) | [56] (EvoPatient)—Used to enrich terminology for enhancing the diversity of medical scenarios |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jo, Y.-W.; Lee, M.; Yang, H.-J. Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Appl. Sci. 2025, 15, 11917. https://doi.org/10.3390/app152211917
Jo Y-W, Lee M, Yang H-J. Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Applied Sciences. 2025; 15(22):11917. https://doi.org/10.3390/app152211917
Chicago/Turabian StyleJo, Young-Woo, Myungeun Lee, and Hyung-Jeong Yang. 2025. "Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review" Applied Sciences 15, no. 22: 11917. https://doi.org/10.3390/app152211917
APA StyleJo, Y.-W., Lee, M., & Yang, H.-J. (2025). Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Applied Sciences, 15(22), 11917. https://doi.org/10.3390/app152211917

