Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review

Jo, Young-Woo; Lee, Myungeun; Yang, Hyung-Jeong

doi:10.3390/app152211917

Open AccessReview

Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review

by

Young-Woo Jo

¹

,

Myungeun Lee

^1,2

and

Hyung-Jeong Yang

^1,2,*

¹

Department of Artificial Intelligence Convergence, Chonnam National University, Gwangju 61186, Republic of Korea

²

Hyper-Wide Federated Medical AI Research Center, Chonnam National University, Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 11917; https://doi.org/10.3390/app152211917

Submission received: 26 September 2025 / Revised: 28 October 2025 / Accepted: 7 November 2025 / Published: 9 November 2025

Download

Browse Figures

Versions Notes

Abstract

Large language model (LLM)-based virtual patient (VP) simulations are emerging to complement traditional medical and nursing education by enabling safe, repeatable, and context-rich clinical practice. This review synthesizes recent developments from 2023 to 2025, mapping implementation approaches, data practices, evaluation methods, and cross-cutting challenges across forty studies. Six implementation categories are identified: scenario generation; prompt-driven VPs; feedback-integrated automated scoring; realism- and adaptability-enhanced systems; knowledge-driven and multi-agent hybrids; and mental health-oriented systems. The analysis summarizes dataset usage (including knowledge sources and governance) and evaluation frameworks, and it introduces quantitative indicators for reproducible assessment. Persistent challenges include factual accuracy, role consistency, emotional realism, and ethical and legal accountability. Overall, LLM-based VP systems show growing potential to extend simulation-based learning, but stronger evidence from multi-site controlled studies, standardized metrics, transparent reporting (model versions, prompts), and robust data governance is needed to establish educational validity and generalizability.

Keywords:

large language models; medical simulation; medical education; virtual patient systems

1. Introduction

Simulation has become a cornerstone of medical and nursing education, providing safe and repeatable opportunities to practice clinical procedures in realistic environments without exposing patients to risk. Despite its pedagogical value, traditional simulation methods face notable constraints. Standardized patient (SP) programs demand significant time and financial resources for actor recruitment and training, while high-fidelity mannequin simulations require considerable operational effort and cost, limiting accessibility and sustained immersion [1,2,3].

To address these limitations, virtual patient (VP) simulations have emerged as software-based systems that enable learners to engage in realistic and standardized clinical scenarios within controlled environments. A VP models the symptoms, behaviors, and contexts of real patients to support the development of communication, reasoning, and decision-making skills [4]. Widely implemented in medical schools, VP systems mitigate limited clinical exposure while ensuring consistent, repeatable learning experiences [5,6]. However, most conventional VPs rely on predefined branching dialogues that constrain learner agency and reduce the realism and diversity of clinical encounters [7,8].

To overcome these constraints, large language models (LLMs) have opened a new frontier for VP simulation by enabling dynamic and context-aware exchanges that enhance realism, adaptability, and interactivity in learner–patient dialogues [9,10]. Systems such as ChatGPT leverage extensive medical knowledge and linguistic fluency to support authentic and flexible interactions, improving scenario diversity and educational realism [11]. Yet challenges remain—contextual inconsistency, factual hallucination, and data bias necessitate expert oversight and systematic evaluation [12]. LLMs, advanced transformer-based neural networks trained on large-scale text corpora, include models such as ChatGPT by OpenAI (San Francisco, CA, USA) [13], Gemini by Google DeepMind(London, UK) [14], LLaMA by Meta (Menlo Park, CA, USA) [15], and Claude by Anthropic (San Francisco, CA, USA) [16]. Architecturally, LLM-based VP systems may adopt single-agent designs, where one model represents the patient, or multi-agent frameworks, in which multiple models simulate distinct professional roles (e.g., physician, nurse, patient) to facilitate interprofessional collaboration [17,18]. When combined with appropriate pedagogical strategies and ethical design principles, these systems can significantly enhance educational realism, learner reflection, and confidence [19].

Evaluating such systems requires multidimensional frameworks that combine quantitative metrics—such as accuracy, error rates, and response time—with qualitative assessments of dialogue realism, behavioral fidelity, and learner satisfaction [20,21,22]. Recent reviews have examined LLM applications in healthcare education, yet few have specifically addressed their role in VP simulation. Sallam [23] offered an early systematic overview of ChatGPT in medical education and research without analyzing VP systems in detail. García-Torres et al. [24] presented a hybrid human–ChatGPT evaluation of conversational VPs emphasizing clinical reasoning but overlooking design and governance dimensions. Vrdoljak et al. [25] synthesized LLM applications across medical education and decision support but lacked comparative analysis of system architectures or evaluation frameworks.

Building on these works, the present review delivers the first focused and integrative synthesis of LLM-based VP simulations reported between 2023 and 2025, mapping their design paradigms, interaction mechanisms, data utilization practices, and evaluation methodologies. Specifically, it categorizes implementation architectures (e.g., prompt-based, feedback-integrated, multi-agent), summarizes quantitative and qualitative indicators, and identifies cross-cutting challenges related to realism, role consistency, and ethical governance. The review is guided by the following research questions (RQs):

RQ1: What major implementation approaches to LLM-based VP simulations have been proposed since 2023, and what are their key technical and pedagogical characteristics?
RQ2: What types of datasets and knowledge sources are utilized, and how are data quality and governance ensured?
RQ3: What evaluation frameworks and metrics are used to assess system performance and educational effectiveness?
RQ4: What limitations, challenges, and future directions have been identified across the reviewed studies?

These research questions form the foundation of the paper. Section 2 outlines the materials and methods, including the literature search strategy and inclusion criteria. Section 3 presents the results across six implementation approaches and their corresponding datasets and evaluation methodologies. Section 4 discusses the pedagogical, technical, and ethical implications of the findings, highlighting ongoing challenges and future directions. Section 5 concludes with key insights and recommendations to guide the advancement of LLM-based VP simulations in medical and nursing education.

2. Materials and Methods

2.1. Data Sources and Search Strategies

Papers published between January 2023 and June 2025 were reviewed by searching Google Scholar, PubMed, Web of Science, and Scopus. The final search was conducted on 5 July 2025. The following Boolean query served as the core search strategy (adapted for each database syntax): (“virtual patient” OR “simulation” OR “scenario generation”) AND (“LLM” OR “Large Language Model” OR “Generative AI” OR “ChatGPT”) AND (“medical” OR “nursing” OR “healthcare”) AND education Studies published from 2023 onward were included because the adoption of LLMs in medical education grew significantly after the public release of GPT-3.5 in late 2022. Although this review is not a formal systematic review, the study selection and reporting followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) framework to ensure transparency and reproducibility. Figure 1 presents the PRISMA flow diagram summarizing the identification, screening, eligibility, and inclusion process of the reviewed studies.

2.2. Study Selection and Eligibility Criteria

All studies were evaluated according to the following inclusion criteria, and final inclusion was made based on these assessments. Studies were included if they met the following conditions:

Studies that explicitly utilized LLMs within medical, nursing, or healthcare education contexts.
Studies that involved interactive or generative simulations, including scenario design or VP dialogue.
Peer-reviewed journal articles or full-length conference papers published in English between 2023 and 2025.

Studies were excluded if they met any of the following conditions:

Did not employ LLMs or were unrelated to medical or nursing education.
Were unavailable in full text or limited to conference abstracts.
Lacked interactive or agent-based simulation components relevant to this review’s objectives.
Were inconsistent with the research purpose or analytical scope.

After screening and eligibility assessment, a total of forty studies were included in this review. The subsequent section (Section 3) details how these studies were analyzed and classified into six implementation categories.

3. Results

3.1. Implementation Approaches

This study reviewed forty recent works on VP simulations utilizing LLMs to identify major implementation approaches and evaluation methodologies. Although the reviewed studies differed in technical scope, educational objectives, and application domains, recurring design patterns were identified through inductive synthesis.

Based on this process, the studies were classified into six categories that collectively capture the diverse functional roles and pedagogical applications of LLMs in healthcare simulation. The classification was not predetermined but derived from the observed convergence of technological characteristics and instructional purposes across the literature.

From a technical perspective, the reviewed systems vary in their degree of complexity—from prompt-driven models emphasizing simple interaction to knowledge-driven and multi-agent systems integrating advanced reasoning and contextual adaptation. From a pedagogical perspective, they reflect distinct educational intentions, ranging from scenario generation and interactive learning to feedback provision, realism enhancement, and domain-specific counseling training.

Accordingly, six interrelated categories were defined: (1) LLM-based scenario generation, (2) simple prompt-based VP systems, (3) iterative feedback and automated scoring systems, (4) realism- and adaptability-enhanced systems, (5) knowledge-driven and multi-agent hybrid systems, and (6) mental health- and counseling-oriented systems.

Rather than representing sequential stages, these categories are conceptually connected through overlapping technical and pedagogical objectives. Each reflects a specific approach to leveraging LLMs for simulation-based education. For example, scenario generation systems use LLMs as content creators to produce authentic learning materials, while prompt-based VP systems focus on interactive communication with learners. Feedback and scoring systems extend this to formative assessment, and realism-enhanced or hybrid models emphasize contextual adaptation and collaboration among multiple agents. Mental health-oriented systems, in turn, demonstrate the application of these techniques to specialized domains emphasizing empathy and emotional communication.

Together, these six categories provide a structured basis for analyzing and comparing how different LLM-based VP systems are designed, what educational goals they pursue, and what learning outcomes they achieve. Figure 2 illustrates the distribution of the forty reviewed studies across these categories. Section 3.1.1 provides a detailed analysis of the first component, LLM-based scenario generation.

3.1.1. LLM-Based Scenario Generation

Traditionally, domain experts authored medical simulation scenarios, a time- and resource-intensive process that struggles to capture sufficient case diversity.

Recent studies have introduced methods for rapidly generating and modifying scenarios using LLMs, demonstrating advantages in efficiency and accessibility. However, challenges remain in ensuring clinical authenticity, emotional expression, and alignment with educational objectives. To address these gaps, researchers have explored advanced techniques such as iterative feedback, external knowledge integration, and standardized validation.

A considerable number of studies have focused on prompt-based automatic generation. Vaughn et al. [26], Ghaffari et al. [27], and Violato et al. [28] utilized GPT-3.5 or GPT-4 to generate medical scenarios, highlighting efficiency and accessibility. These studies primarily used simple single-prompt approaches to rapidly generate simulation scenarios. Nevertheless, subsequent evaluations revealed limitations in terms of clinical authenticity, emotional complexity, and alignment with educational objectives.

To overcome such limitations, Tian et al. [29] proposed an iterative improvement strategy using GPT-4o, designing a three-stage feedback structure consisting of draft scenario generation, student-role GPT feedback, and expert-role GPT review. Their approach scored higher than expert-authored scenarios in terms of educational goal alignment, task difficulty, and learner engagement, while significantly reducing production time to 9 min (±2). To enhance emotional immersion and scenario diversity, Gray et al. [30] developed prenatal counseling scenarios with GPT-3.5, embedding patient emotional characteristics such as anxiety and mistrust of medicine, thereby enabling training in communication and empathy skills.

More structured approaches included Ananthanarayanan et al. [31], who combined GPT-3.5 Turbo with retrieval-augmented generation (RAG) and chain-of-thought (CoT) evaluation to ground outputs and promote demographic diversity; Sumpter et al. [32] who used JSON templates to standardize elements for consistency and reuse; Barra et al. [33] who orchestrated GPT-4o, Gemini 2.0, and Claude 3.7 in an agentic workflow to meet INACSL/ASPiH standards in approximately 4.5 min with a 70–80% development-time reduction.

Overall, LLMs markedly accelerate scenario authoring and help structure coherent narratives, but single-prompt generation often omits clinical details, misaligns with learning objectives, and lacks emotional depth. These risks can be mitigated through iterative feedback, structured templates, and expert review; however, most current studies remain technical or quasi-experimental, relying on small samples, non-validated metrics, and infrequent reporting of inter-rater reliability or learner outcomes. Consequently, existing evidence supports LLMs mainly as tools for procedural efficiency rather than confirmed pedagogical equivalence to expert-authored scenarios. Future research should adopt prospective controlled designs comparing human and iterative-LLM pipelines, employ validated and transparent evaluation rubrics aligned with INACSL/ASPiH standards, report reliability indices (e.g., ICC, κ), and measure learner-level outcomes such as OSCE performance or decision accuracy. Ensuring demographic fairness, reproducible reporting of model parameters, and fidelity metrics for simulator integration will be essential to establish educational validity and external generalizability.

The key contributions and limitations are summarized in Table 1 and the next section examines LLMs acting directly as VPs during learner interactions.

3.1.2. Simple Prompt-Based Virtual Patient Systems

Simple prompt-based VP studies provide LLMs with only basic patient role instructions and scenario information. This approach enables immediate implementation of medical simulations without additional datasets or complex configurations. It is advantageous in terms of rapid scenario creation, low cost, and high accessibility, particularly in contexts where SPs are difficult to employ.

Öncü et al. [34], Benfatah et al. [35], Holderried et al. [36], Denecke & Reichenpfader [37], and Yi & Kim [38] utilized GPT-3.5, GPT-4, or HyperCLOVA X. By inputting basic patient information and clinical situations, learners could freely engage in question–answer exchanges to practice history-taking, clinical communication, and case management. This approach is especially valuable in educational settings where opportunities for clinical practice are limited, supporting the development of essential clinical competencies.

In contrast, Aster et al. [39], Lower et al. [40], and Cross et al. [41] retained the basic prompt structure while augmenting it with specific objectives or instructional functions. Aster et al. [39] designed a cardiac case scenario that encouraged learners to identify ‘empathic opportunities’, thereby extending training beyond information gathering to relationship-building between physician and patient. Lower et al. [40] evaluated the diagnostic and treatment accuracy of GPT-4 in orthopedic cases, finding it useful for basic and intermediate-level education but limited in addressing advanced specialized domains. Cross et al. [41] demonstrated that GPT-4, acting as an SP, could facilitate repeated practice, self-directed learning, and feedback provision, highlighting its potential as a supplementary tool in education.

Similarly, Scherr et al. [42] used ChatGPT-3.5 to implement ACLS and ICU (pneumonia and sepsis) scenarios using simple prompts. Patient conditions dynamically changed in response to learner interventions, and narrative feedback was provided upon scenario completion. Although this approach demonstrated low cost and high accessibility, it also revealed limitations in clinical accuracy, reproducibility, and feedback consistency, suggesting the need for automated scoring and rubric-based feedback in subsequent designs.

Although these simple prompt-based VP studies demonstrated feasibility, accessibility, and positive learner engagement, their methodological rigor remains limited. Most employed small samples, single-site pilots, or lacked control groups, which constrains generalizability and makes it difficult to attribute learning gains specifically to LLM interaction. Variations in model version, prompt structure, and evaluation criteria further hinder reproducibility across studies. In addition, outcome measures—often restricted to plausibility, usability, or self-reported satisfaction—do not yet establish competence transfer to clinical performance. To advance the field, future research should adopt larger, multi-institutional randomized designs with transparent prompt reporting, validated scoring rubrics, and comparative analyses against SP encounters.

The main contributions and limitations of these simple prompt-based VP studies are summarized in Table 2. Building on these foundational systems, further research has explored approaches that provide automated scoring and structured feedback immediately after learner–patient interactions, thereby supporting iterative practice and self-directed learning. These feedback-integrated automatic scoring systems are examined in the next section.

3.1.3. Iterative Feedback and Automated Scoring Systems

As an advanced form of prompt-based VPs, studies have emerged that provide automatic scoring and structured feedback immediately after learner–VP interactions. This approach offers rubric-based objective scores and narrative suggestions for improvement in real time, enabling learners to recognize the strengths and weaknesses of their dialogue. In turn, this facilitates repetitive practice and self-regulated learning (SRL) and may serve as an efficient alternative in settings where SP use is limited.

In terms of validating scoring accuracy, the follow-up study by Holderried et al. [21], building on their earlier work [36], compared feedback generated by a GPT-4–based VP with that of human evaluators. Across 45 scored items, near-perfect agreement was achieved (Cohen’s κ = 0.832), and 99.3% of responses were judged medically valid. Although some discrepancies arose due to ambiguous definitions, the authors suggested that category redefinition, more specific prompting, and the provision of examples could enhance consistency. Similarly, Wang et al. [43] reduced the scoring error rate from 29.83% to 6.06% by refining prompts, thereby improving the stability and precision of LLM-based assessment.

For evaluating learning outcomes, Brügge et al. [44] conducted a randomized controlled trial (RCT) combining ChatGPT-based patient simulation with structured feedback. After only four sessions, clinical decision-making (CDM) scores improved significantly compared with the control group (p = 0.049), with particularly notable gains in contextual reasoning and information acquisition. Haut et al. [45] in another RCT applying personalized feedback based on the 3E framework (Empathy, Explicitness, Empowerment), reported significant improvement across all skills compared to the control group (p < 0.01), with a large effect size. Their system (SOPHIE) enhanced realism through hybrid dialogue management that combined rule-based schemas with LLMs, along with 3D avatars and emotion synchronization. Although not randomized, Yamamoto et al. [46] found that learners who received feedback after interviewing an AI patient achieved higher scores in OSCE-like interview assessments (p = 0.01) than the control group, and reported greater realism and positive affective interaction in their learning experience.

Hicke et al. [47] presented an integrated platform approach. Their system combined a GPT-4o–based AI-SP with multiple rubric-based assessments (MIRS, SPIKES), automated scoring, evidence citation from dialogues, and tools for scheduling and goal setting, creating an SRL hub. This allowed learners not only received scores but also understood the rationale behind them, thereby supporting iterative training. Instructors, in turn, were able to apply customized case-specific assessment frameworks more easily and track learner progress.

Other technical approaches have also been reported. Chiu et al. [48] automatically generated narrative feedback based on the SPIKES framework, clear guidance for learner improvement. Cook et al. [49] proposed a GPT-4–based system for self-assessment of dialogue and feedback quality, demonstrating low cost (average $0.51 per dialogue). However, the study noted low reproducibility of evaluation and a tendency toward overly positive feedback, highlighting directions for future improvement.

Despite promising results across individual studies, several methodological weaknesses limit the generalizability of current evidence. Most investigations involved small, single-site cohorts, lacked randomization or blinded comparison groups, and used narrow short-term outcomes such as rubric scores or user satisfaction. In addition, scoring reproducibility remains highly sensitive to prompt structure and model parameters, while feedback often shows over-positive tone and inconsistent depth. These methodological constraints highlight the need for larger, multi-site randomized trials with transparent prompt disclosure, fixed model versions, and longitudinal outcome tracking. Establishing such standards will be essential for determining whether LLM-based automated feedback can serve as a reliable, generalizable complement or alternative to standardized-patient-based assessment.

The key contributions and limitations of representative systems appear in Table 3, and the next section examines LLMs acting directly as VPs during learner interactions.

3.1.4. Realism- and Adaptability-Enhanced Virtual Patient Systems

Beyond iterative feedback and automated scoring systems, recent studies have advanced toward more sophisticated designs of VP language, emotions, and attitudes. These efforts also integrate multimodal elements such as speech, facial expressions, gaze, avatars, and mixed reality (MR) to enhance realism and immersion. Rather than relying on static dialogues, such approaches allow patient responses to adapt to learner utterances and provide sensory experiences resembling face-to-face encounters, thereby maximizing learning effectiveness.

Bodonhelyi et al. [50] implemented the Accuser (direct, aggressive) and Rationalizer (logical, emotion-avoiding) types from the Satir model as prompts, activating conditional resistance mechanisms during conversations with uncooperative patients. Their design relied on multi-stage prompt engineering—comprising author’s notes, behavioral instructions, and a stubbornness mechanism—representing a rule-based open-loop control algorithm that adjusts only linguistic style without adaptive feedback. This allowed learners to repeatedly practice persuasion and empathy skills, with expert evaluations confirming high scores for both stylistic consistency and realism. Chen et al. [51] developed a patient chatbot set in a mental health outpatient context that incorporated colloquial speech, emotional fluctuations, and resistance toward physicians, achieving such a level of immersion that both real patients and doctors evaluated it as “patient-like.” Algorithmically, this model introduced a lightweight feedback loop by adding short “attention reminders” at each conversational turn, allowing the LLM to retain emotional continuity over time but still lacking a mechanism for real-time behavioral adaptation.

Several studies have also leveraged nonverbal signals and physical interfaces. Borg et al. [52] used an LLM-based social robot to implement real-time speech, facial, and gaze responses, demonstrating significant improvements in authenticity (p = 0.04) and learning effectiveness (p = 0.01) compared with conventional computer-based VPs. The system followed a multimodal processing pipeline—speech recognition → LLM generation → text-to-speech and animation—providing a procedural synchronization algorithm between audio and visual channels. Gutiérrez et al. [53] applied GPT-3.5 to a mixed reality (MR) environment for pulmonary training, overlaying a 3D avatar on a physical mannequin while integrating speech recognition, speech synthesis, and emotional expression. However, response delays of about three seconds and turn-taking constraints were reported. Sardesai et al. [54] developed a no-code platform for remote anesthesia training simulations that combined knowledge bases with voice- and avatar-based patient interactions. Together, these multimodal systems expanded interface-level control but remained algorithmically open-loop, as they did not update internal patient states or learning parameters in response to user behavior.

In contrast to the above studies, Lee et al. [55] took a different approach. They introduced “Adaptive-VP,” a more sophisticated system that incorporates multilayered design features such as safety monitoring and rule-based utterance control. Unlike previous multimodal or persona-based systems, Adaptive-VP emphasizes behavioral adaptation by dynamically adjusting the patient’s emotional state and communication style in response to learner input. This shift represents a transition from visual and emotional realism to interactive, performance-driven realism. Algorithmically, Adaptive-VP implements a closed-loop control structure that continuously evaluates learner input, generates diagnostic scores, and updates internal states through four interdependent modules—Evaluation, Dynamic Adaptation, Dialogue Generation, and Safety Monitoring. This feedback-driven loop establishes a causal relationship between user performance and VP behavior, representing the first fully adaptive behavioral framework among the reviewed systems. The system dynamically assessed learner utterances in real time, assigned scores, and adjusted the patient’s attitude, level of dissatisfaction, and response style through an adaptive loop. By iteratively executing four stages—evaluation, dynamic adaptation, dialogue generation, and safety monitoring the system significantly improved role fidelity and conversational realism compared with static VPs (p < 0.05). In addition, a multi-expert agent evaluation method was employed to minimize bias.

Although these studies collectively demonstrate the technical feasibility and educational promise of realism- and adaptability-enhanced VPs, their evidence base remains preliminary. Most were small, single-site pilots—15 to 28 participants—without control groups or blinding, and primarily assessed perceived realism rather than objective learning gains. Open-loop persona models achieved stylistic consistency but reinforced fixed, non-adaptive behaviors, while multimodal and MR systems improved authenticity and engagement yet suffered from latency and small-sample self-report bias. The Adaptive-VP framework introduced the first closed-loop behavioral adaptation validated with expert- and novice-nurse corpora, but still focused on perceived realism rather than transfer to OSCE performance. To establish causal effectiveness, future work should employ adequately powered, randomized or non-inferiority designs comparing standardized-patient, open-loop, and adaptive VPs, use blinded human-rated communication metrics, and report safety and reproducibility indicators. Such methodological rigor is essential for determining which elements of realism and adaptivity truly improve clinical training outcomes.

The key strengths and limitations of representative systems are summarized in Table 4, and subsequent sections examine knowledge-driven and multi-agent hybrids that extend realism and adaptability with explicit grounding and coordinated roles.

3.1.5. Knowledge-Driven and Multi-Agent Hybrid Virtual Patient Systems

Recent studies have moved beyond the limitations of simple prompt-based patient generation by integrating knowledge graphs (KGs), RAG, and multi-agent architectures to enhance the clinical consistency and realism of VPs. This approach systematically leverages structured medical knowledge and improves the accuracy, consistency, and realism of dialogues through collaboration among specialized agents. In addition, by integrating multimodal content delivery with automated assessment and feedback functions, it aims to maximize learner immersion and educational effectiveness. To better explain how these integrations are implemented, recent systems can be generally divided into three layers: a knowledge layer (responsible for retrieving and structuring medical data from KGs or EHRs), an orchestration layer (which coordinates multi-agent communication and verification), and a generation layer (which produces text, image, or audio outputs and provides feedback). Each agent exchanges structured messages—such as JSON or Cypher queries—across these layers, enabling reasoning, consistency checking, and multimodal synchronization.

Du et al. [56] proposed the EvoPatient framework, which implemented a co-evolution structure in which patient agents and physician agents were trained interactively. The system simulated real clinical procedures step by step, with the patient agent selectively retrieving necessary information from medical records via RAG, while the physician agent dynamically invoked multidisciplinary experts based on a directed acyclic graph (DAG) to improve the professionalism and goal-directedness of questions. In implementation, each interaction phase—complaint, triage, interrogation, and conclusion—is handled through autonomous dialogue between a patient agent and multiple doctor agents. The DAG structure ensures non-redundant information flow among doctors from different specialties. In parallel, the framework maintains both instant and summarized memory to preserve dialogue continuity and context, while two specialized data stores—an Attention Library (for refined requirements) and a Trajectory Library (for validated dialogue sequences)—accumulate high-quality exemplars for reuse. This agentic co-evolution process allows both patient and doctor agents to iteratively refine dialogue quality and achieve performance improvements of more than 10% even under transfer learning conditions.

Yu et al. [57] developed the AIPatient system, which combined an EHR-based KG with a Reasoning-RAG framework. This system employed a multi-stage structure linking six specialized agents—Retrieval, KG Query Generation, Abstraction, Checker, Rewrite, and Summarization. Within the three stages of RAG, verification and summarization modules were inserted into the reasoning stage. Technically, the Reasoning-RAG workflow retrieves relevant nodes and edges from the AIPatient KG using Cypher queries (via the Retrieval and KG Query Generation agents), abstracts the query into higher-level medical reasoning tasks, and then executes iterative checking cycles to validate factual alignment. When inconsistency is detected, the Checker agent automatically triggers a new query or reformulation. The verified information is then rewritten into natural language and adapted to the patient’s persona through the Rewrite and Summarization agents. Furthermore, by incorporating the Big Five personality model, the system maintains 32 distinct patient profiles that control tone and emotional style in responses. This multi-agent reasoning chain ensures clinical accuracy, stability, and realism, achieving a QA accuracy of 94.15% and outperforming human SPs in consistency and reliability.

Li et al. [58] proposed the MedDiT framework, which integrated an LLM controlled by a KG with a Diffusion Transformer (DiT)–based medical image generation model. The system was implemented as a multi-agent structure composed of a KG Agent, a Chat Agent, and an Image Generation Agent, enabling real-time synchronization of patient dialogues with medical images to provide a multimodal learning environment. In MedDiT’s architecture, the KG Agent first retrieves relevant patient attributes and symptom data from the KG and linearizes them into natural-language prompts. These structured prompts are shared between the Chat Agent (responsible for conversational output) and the Image Generation Agent, which employs a Hunyuan-DiT model with LoRA adapters trained on 3314 Open-I chest X-rays. The resulting image prompt (IP) is then processed by the Diffusion Transformer according to the function (I = DiT(IP)), meaning that the generated medical image I is produced by applying the DiT model to the textual prompt IP. This ensures that the synthesized image remains semantically consistent with the clinical context of the dialogue. All three agents communicate through a unified LLM server operating as a microservice architecture, which manages message routing, version control, and real-time updates. Additionally, an integrated dialogue evaluation module automatically analyzes learner interactions in terms of information completeness, symptom exploration, and empathy, providing both quantitative scores and narrative feedback. Through this pipeline, MedDiT demonstrates a concrete form of multimodal integration that merges textual reasoning with image synthesis in real time.

Li et al. [59] introduced CureFun, which maintained the profile consistency of Virtual Simulated Patients (VSPs) using a case graph and the ERRG (Extract, Retrieve, Rewrite, Generate) procedure. This prevented role-flipping during conversations and ensured consistency by reasonably generating information not contained in original scripts and reflecting it into the graph for subsequent dialogues. In practical implementation, CureFun organizes dialogue generation as a four-stage pipeline: Extract stage identifies medical entities and relations from the learner’s query, Retrieve executes a SPARQL query on the case graph to collect relevant subgraphs, Rewrite transforms structured triples into natural language evidence, and Generate produces a context-aware response while writing new facts back to the graph. This continuous write-back mechanism maintains longitudinal consistency and prevents hallucination. Beyond text, CureFun integrates speech-based interaction through built-in STT (speech-to-text) and TTS (text-to-speech) modules, supporting natural conversation. The system operates on a dedicated LLM server using paged attention and speculative decoding to minimize latency. For automated assessment, CureFun converts traditional SP checklists into LLM-executable scoring programs and applies multi-LLM ensemble voting, achieving high alignment with human evaluators (r = 0.81–0.85). Through this integration of reasoning agents, KGs, and multimodal input/output, CureFun establishes a fully interactive and evaluable VP simulation framework.

While these knowledge-driven and multi-agent systems represent a clear methodological leap from single-prompt simulations, their current validation remains largely technical. Most studies assess internal reasoning accuracy or architectural efficiency rather than demonstrable learning gains, often relying on limited datasets and AI-based evaluators. As a result, evidence for educational effectiveness, scalability, and external validity is still emerging. Future research should therefore extend beyond engineering optimization to pedagogical verification—through multi-site controlled comparisons with human SPs, integration of multimodal outcome measures, and transparent, human-blinded assessment protocols. Such rigor will be essential to transform these technically mature prototypes into credible, evidence-based training tools for clinical and interprofessional education.

The key achievements and limitations are summarized in Table 5, and the subsequent section extends these advances to mental health and counseling-oriented VP systems with specialized affective control and automated evaluation.

3.1.6. Mental Health- and Counseling-Oriented Systems

In the field of mental health and counseling, LLM-based VP research aims to achieve both high realism and evaluability by combining patient modeling grounded in clinical theories with structured conversational guidelines. Traditional prompt-based approaches showed limitations in subtle emotional expression, resistant behaviors, and maintenance of multi-turn context. Recent studies, however, have sought to overcome these limitations by integrating theory- and scale-based information injection, multi-agent architectures, automated scoring and feedback, and behavior-principle control.

One line of research has focused on skill training using VPs. Wang et al. [60] proposed the Patient-Ψ system, which injected 106 expert-designed cognitive models for cognitive behavioral therapy (CBT) into GPT-4 and applied six conversational styles to reproduce emotional fluctuations, information concealment, and resistant behaviors. Learners engaged in dialogue to establish cognitive models and then compared their models with reference models to receive feedback and improve their skills. Steenstra et al. [61] developed SimPatient, which employed a multi-agent structure for motivational interviewing (MI) training. The system generated patient responses, coded utterance-level behaviors, tracked cognitive state changes, and visualized session summaries through a dashboard, significantly improving self-efficacy (p < 0.001). Louie et al. [62] proposed the Principle-Adherence Pipeline, which extracted and applied counselor feedback as principles, adjusting emotional expressions and resistant behaviors to fit the context, thereby increasing principle adherence by more than 30%.

Other studies have emphasized the evaluation of counselor and therapist competencies. Lee et al. [63] proposed PSYCHE, a multi-faceted construct (MFC)–based framework for generating and evaluating SPs for psychiatric interviews. The generated patients achieved an average clinical appropriateness of 93%, and automatic scoring showed a strong correlation with expert ratings (r ≈ 0.85). The MFC-Behavior design incorporated paralinguistic and uncooperative behaviors to strengthen realism, while information leakage prevention guidelines ensured safety. Wang et al. [64] introduced ClientCAST, in which an LLM client conversed with an LLM therapist and then completed client-perspective surveys (e.g., SRS, WAI-SR) to evaluate session quality. This system demonstrated validity in distinguishing between high- and low-quality counseling data.

Although recent LLM-based counseling simulations have achieved higher emotional realism and automated evaluability, several field-level limitations remain evident. Current studies often rely on small, non-randomized samples and focus on short-term self-efficacy rather than objective skill transfer, restricting claims of educational effectiveness. Despite theoretical grounding, many virtual clients still exhibit overly cooperative or homogeneous behaviors, lacking resistance, ambivalence, and paralinguistic nuance essential for authentic counseling encounters. Evaluation methods also vary considerably—ranging from expert-reviewed construct scoring to model-internal assessments—undermining comparability and validity across systems. To address these gaps, future research should adopt controlled and longitudinal designs, incorporate standardized behavioral metrics such as MITI or MFC-based rubrics, and employ triangulated evaluation integrating expert, client, and automated perspectives. Broader scenario diversity, cross-cultural validation, and transparent reporting of reliability and safety indicators will further enhance the robustness and generalizability of these mental health-oriented simulation frameworks.

The main achievements and limitations of the mental health- and counseling-oriented systems described above are summarized in Table 6.

3.2. Datasets

Across 40 studies (2023–2025), dataset use clustered into three sources—custom-built corpora, open medical resources, and mixed clinical collections—reflecting divergent practices in curation, accessibility, and governance beyond the search and selection methods.

Most studies utilized self-constructed datasets consisting of researcher- or instructor-developed prompts and scenarios, as well as learner–LLM dialogue logs. These datasets are typically created within specific institutional or classroom environments, which limits external accessibility and poses challenges for reproducibility and data sharing.

Public medical datasets were employed in a smaller number of cases to enhance the factual accuracy and diversity of simulated clinical information. Commonly used examples included electronic health record (EHR) and imaging resources such as MIMIC-II/III and Open-I Chest X-ray, as well as counseling-oriented corpora, enabling more realistic and domain-specific scenario generation.

Hybrid clinical datasets combined open resources with institution-specific EHR data to construct realistic and personalized scenarios. Although these designs improved authenticity, their adoption remained limited because of strict requirements related to data protection, ethical approval, and privacy governance.

Several studies also implemented KG- or rule-based structured datasets to control the reasoning and behavior of VP agents. While these partially overlap with the above categories, they were treated separately in this review because of their methodological novelty and emphasis on structured data representation.

When handling any form of medical, conversational, or learner-generated data, researchers emphasized the need for responsible data governance—including anonymization, informed consent, secure storage, and limited access—to protect privacy and ensure ethical compliance. Such measures, aligned with widely accepted international data-protection principles and research-ethics standards, are essential for maintaining transparency, reproducibility, and trust in LLM-based simulation research.

The classification of dataset utilization types is summarized in Table 7, and representative publicly available datasets and their application cases are shown in Table 8.

3.3. Evaluation Methods

To reliably verify the performance and educational effectiveness of LLM-based medical simulation systems, an objective and systematic evaluation framework is required. Traditional evaluations relied on expert judgment or standardized tests, but recent research has developed methods emphasizing objectivity, automation, and multidimensionality. This section reviews these recent research trends.

According to the systematic review by Lee et al. [71], evaluations of LLMs in the medical domain are divided into subjective assessments (56.3%) and objective assessments based on standardized examinations (37.3%). Test-based evaluations were found to have several limitations, including insufficient numbers of questions (often fewer than 100, 29%), a lack of repeated measures (24%), and limited use of prompt engineering (12.9%). Expert-based evaluations, on the other hand, faced challenges in reliability due to the small number of evaluators. The authors emphasized the necessity of multidimensional evaluation methodologies, such as ensuring an adequate number of test items, securing reproducibility through repeated measures, adopting diverse prompting strategies, and incorporating additional metrics beyond accuracy.

More recently, new benchmarks have been developed to assess the pure logical reasoning ability of LLMs. Fan et al. [72,73] introduced NPHardEval and NPHardEval4V, dynamic benchmarks based on computational complexity theory that subdivide 900 algorithmic problems of varying difficulty—ranging from NP-hard and NP-complete to P—into ten levels. These benchmarks enabled rigorous evaluation of the logical reasoning capabilities of state-of-the-art LLMs by defining explicit quantitative indicators. In these frameworks, model performance is measured using three complementary indices—Recognition Accuracy (RA), Instruction-following Effective Rate (ER), and Aggregated Accuracy (AA)—that jointly quantify recognition, compliance, and reasoning ability.

The recognition accuracy is calculated as

R A = \frac{\sum_{i = 1}^{N} C_{i}}{N},

(1)

and the instruction-following rate as

E R = \frac{\sum_{i = 1}^{N} F_{i}}{N},

(2)

where

C_{i}

and

F_{i}

denote binary indicators of whether the model correctly recognized the input and produced a well-formed output for each of

N

tasks.

The overall Aggregated Accuracy (AA) integrates these elements with a difficulty weight

w_{i}

assigned to each task:

A A = \frac{\sum_{i = 1}^{N} w_{i} {A_{i}^{'} R A}_{i} {E R}_{i}}{\sum_{i = 1}^{N} w_{i}},

(3)

where

A_{i}^{'}

represents the accuracy for correctly recognized and parsed items.

This weighted design allows consistent comparison across P, NP-complete, and NP-hard problem classes and reflects how reasoning accuracy changes with task complexity.

To ensure reproducibility, additional indices such as Weighted Accuracy (WA) and Failure Rate (FR) are also employed:

W A = \frac{\sum_{i = 1}^{10} w_{i} A_{i}}{\sum_{i = 1}^{K} w_{i}},

(4)

F R = \frac{\sum_{i = 1}^{10} F_{i}}{100},

(5)

These quantitative definitions provide transparency and statistical consistency across studies, addressing the limitation noted by Lee et al. [71] that many medical-domain evaluations relied solely on simple accuracy metrics without reproducibility testing.

When applied to LLM-based medical simulations, these equations can be interpreted within an educational context. Specifically,

C_{i}

indicates whether key clinical cues or vital signs are correctly identified,

F_{i}

reflects adherence to scenario or rubric constraints, and

A_{i}^{'}

denotes the accuracy of medical reasoning as judged by experts. Accordingly, Equation (3) serves as a unified quantitative metric that captures perception, compliance, and reasoning performance of VP systems within a reproducible framework.

Results from these benchmarks revealed that closed-source models such as GPT-4 Turbo and Claude 2 generally outperformed open-source models; however, all models exhibited substantial performance degradation on higher-complexity NP-hard problems. In multimodal evaluation with NPHardEval4V, reasoning performance of multimodal models was even lower than that of text-only models, due to the added complexity of processing visual information. Furthermore, the ability to generalize knowledge gained from simpler problems to more difficult ones remained limited. These findings underscore the importance of future research aimed at improving the logical reasoning capacity of LLMs and the performance of multimodal models.

Research has also focused on evaluating role-playing ability and automating the evaluation process. Gusev [74] proposed the PingPong benchmark, which employs an ensemble of a player model, an interrogator model, and multiple evaluator models to automatically assess the role-playing ability of LLMs. Results demonstrated a strong correlation between automated ensemble evaluations and human judgments, and models fine-tuned for creative writing also exhibited superior role-playing performance. In addition, PingPong supported multilingual evaluation, enabling comparisons of model performance in English and Russian.

Zheng et al. [75] proposed the LLM-as-a-Judge approach, which employs powerful LLMs such as GPT-4 as evaluators, and confirmed a high correlation (>80%) with human assessments. However, they also identified issues such as position bias, verbosity bias, self-reinforcement bias, and inaccuracies in evaluating mathematical and logical reasoning. To address these challenges, the authors suggested improvements including order-shuffled evaluation, the use of reference responses, and multi-rater ensemble scoring to mitigate bias and improve reliability.

Collectively, these advances mark a shift toward quantitative, automated, and reproducible evaluation in LLM-based simulation research. By pairing explicit metrics (e.g., Equations (1)–(5)) with automated judging frameworks such as PingPong and LLM-as-a-Judge, researchers can improve both objectivity and interpretability when assessing the educational effectiveness of LLM-driven VP systems.

4. Discussion

This study analyzed 40 publications released between January 2023 and June 2025 to comprehensively review the technical implementations, data utilization, evaluation methodologies, and challenges of LLM–based VP simulations in medical and nursing education. Each approach demonstrated distinct strengths and limitations in terms of educational effectiveness and technical maturity.

In scenario generation research, single-prompt methods showed significant advantages in production speed and accessibility. However, frequent omission of clinical details, errors in medication dosage, and limited emotional expression necessitated expert review. Recently, advanced techniques such as iterative feedback, emotional context modeling, structured templates, and RAG have been introduced. These reduced scenario creation time to within minutes and improved scenario quality and alignment with educational objectives, though fact verification and bias mitigation strategies remain necessary.

Prompt-based VP systems enabled training in history taking, clinical communication, and case management without the need for dedicated datasets. Such accessibility was particularly valuable in settings with limited opportunities for clinical practice. Nonetheless, challenges included difficulty maintaining role consistency, generation of hallucinated information, excessive generalization, and the absence of nonverbal cues. To address these, multilingual support, domain-specific prompting, and integration with multimodal interfaces appear necessary.

Systems combining iterative feedback and automated scoring allowed learners to immediately recognize strengths and weaknesses following VP interactions by providing rubric-based scores and narrative feedback. Some studies demonstrated strong agreement with human raters (Cohen’s κ ≈ 0.83). Randomized controlled trials and quasi-experimental studies further showed significant short-term improvements in clinical reasoning, information gathering, and contextualization skills. However, issues such as overly positive feedback, ambiguity in evaluation criteria, and variability in reproducibility remain unresolved.

Realism- and adaptability-enhanced systems improved immersion and role consistency by integrating emotional and attitudinal modeling with multimodal elements such as voice, gaze, facial expressions, avatars, and MR. Implementations using social robots and MR demonstrated significant gains in authenticity and learning outcomes, while adaptive structures dynamically adjusted patient attitudes and responses based on learner utterances, thereby increasing conversational realism. Nonetheless, technical constraints such as speech recognition errors, response delays, and rigid turn-taking were reported, highlighting the need for further improvements in safety monitoring and fluency of expression.

Knowledge-driven and multi-agent hybrid systems combined KGs, RAG, and role-differentiated agents to enhance accuracy, consistency, and persona maintenance. Some systems achieved QA accuracies above 94% and demonstrated high concordance with expert evaluations. Others extended functionality to include medical image generation, offering immersive multimodal training environments. However, challenges included the costs of KG construction and maintenance, limited generalizability across institutions, and insufficient clinical and ethical validation of multimodal outputs.

Mental health- and counseling-oriented systems incorporated CBT, motivational interviewing (MI), and other theory- and scale-based designs to more accurately reproduce emotional expression, resistant behaviors, and multi-turn conversational context. By combining utterance-level behavior coding, session summarization, and principle-adherence pipelines, these systems improved learner self-efficacy and satisfaction, while also validating the reliability of automated assessment. Nevertheless, limitations remained in handling complex cases, accurately representing nuanced resistance and ambivalence, and incorporating nonverbal signals.

From the perspective of data utilization, most studies relied on self-constructed datasets such as researcher-designed scenarios and dialogue logs, which limited reproducibility and comparability. Public medical datasets were employed in only a few studies, and hybrid clinical datasets were rarely used. Some studies employed KGs or rule-based structured data to control LLM behavior, but standards for dataset construction and sharing remain underdeveloped. Future work should pursue reproducibility strategies such as multi-institutional and multilingual scenario repositories, publication of de-identified logs and rubrics, and version management of KG schemas.

In evaluation methodology, expert subjective assessments and test-based evaluations continued to dominate. However, emerging methods such as dynamic difficulty reasoning benchmarks, LLM-as-a-Judge, and evaluation ensembles introduced automation and multidimensionality, with the potential to enhance efficiency and consistency. Yet, issues such as position bias, verbosity bias, and reduced accuracy in logical reasoning assessments have been reported. Mitigation strategies such as evidence citation, multi-rater consensus, and order-shuffled evaluations are therefore necessary.

In summary, the reviewed studies on LLM-based VP simulations can be categorized into scenario generation, iterative feedback and automated scoring, multimodal and adaptive systems, knowledge-integrated multi-agent frameworks, and mental health–specific designs. These approaches collectively contributed to enhancing educational effectiveness while expanding access to training opportunities. However, unresolved challenges include ensuring clinical accuracy, mitigating bias, improving data governance, strengthening evaluation reliability, verifying long-term effectiveness, and conducting cost-effectiveness analyses. Addressing these challenges and developing standardized guidelines for design, implementation, and evaluation will be critical. Such measures can help position LLM-based VPs as a reliable educational and assessment infrastructure that complements SPs and mannequins in medical training.

As LLM-based VP systems advance toward broader deployment, the establishment of comprehensive ethical and governance frameworks becomes essential. These systems frequently process sensitive educational and clinical information, requiring strict adherence to international principles of data protection, transparency, fairness, and accountability. Responsible data management—including anonymization, informed consent, secure storage, and restricted access—should be integrated into all stages of development and evaluation. Automated feedback or scoring mechanisms must operate under human oversight to ensure pedagogical validity and prevent bias or overreliance on algorithmic decisions. In the global regulatory landscape, emerging frameworks such as the EU Artificial Intelligence Act, the WHO guidance on LLMs in health, and the NIST AI Risk Management Framework highlight the need for human supervision, explainability, and traceable audit systems. Incorporating these standards into simulation design and institutional policy will promote safe, transparent, and equitable use of LLM-based VPs across diverse educational contexts. Ultimately, aligning technological innovation with ethical and legal responsibility will be crucial to building sustainable trust and ensuring that these systems enhance—not compromise—the integrity of medical and nursing education.

5. Conclusions

LLM–based VP simulations have emerged as a transformative innovation in medical and nursing education. By complementing traditional SP and mannequin-based simulations, these systems enhance accessibility, scalability, and interactivity. This review analyzed forty studies published between 2023 and 2025 and identified six key approaches: scenario generation, prompt-driven VPs, automated feedback and scoring, realism- and adaptability-enhanced systems, knowledge-driven and multi-agent frameworks, and mental health-oriented applications. Together, these developments demonstrate the rapid technical and pedagogical evolution of LLM-based simulations.

Despite these advances, current evidence for educational effectiveness remains preliminary. Most studies are limited by small samples, short-term evaluations, and heterogeneous assessment methods, making reproducibility and comparison difficult. Future research should adopt controlled, multi-site designs using standardized performance metrics, such as OSCE scores and validated communication rubrics, while investigating long-term learning outcomes and cost-effectiveness. Enhancing multimodal realism—through speech, emotion, and adaptive interaction—will also be crucial for improving immersion and transferability to real clinical contexts.

Ethical and governance principles must accompany this progress. LLM-based VP systems should ensure transparency, fairness, and data protection through anonymization, human oversight, and accountable evaluation frameworks. With rigorous validation and responsible implementation, these systems can evolve from experimental tools into reliable educational infrastructures that enrich the quality, equity, and realism of healthcare training.

Author Contributions

Conceptualization, Y.-W.J., M.L. and H.-J.Y.; Data curation, Y.-W.J.; Writing—original draft preparation, Y.-W.J. and M.L.; Writing—review and editing, M.L. and H.-J.Y.; Project administration, H.-J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)—Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2022-00156287, 33%), the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development grant funded by the Korea government (MSIT) (IITP-2023-RS-2023-00256629, 34%), and the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2025-RS-2024-00437718, 33%) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). This study was also financially supported by Chonnam National University (Grant number: 2025-1047-01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, 2025 version, GPT-4 and GPT-5 models) to review the text for potential improvements in clarity and grammar. The authors have carefully reviewed and edited all output and take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Collins, J.C.; Chong, W.W.; de Almeida Neto, A.C.; Moles, R.J.; Schneider, C.R. The simulated patient method: Design and application in health services research. Res. Soc. Adm. Pharm. 2021, 17, 2108–2115. [Google Scholar] [CrossRef]
Davis, S. Patient-Drama: A Literature Review of Simulated Patient Experiences in Medical Education and Training; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Mackenzie, C.F.; Harper, B.D.; Xiao, Y. Simulator limitations and their effects on decision-making. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Philadelphia, PA, USA, 2–6 September 1996; pp. 747–751. [Google Scholar]
Kononowicz, A.A.; Woodham, L.A.; Edelbring, S.; Stathakarou, N.; Davies, D.; Saxena, N.; Car, L.T.; Carlstedt-Duke, J.; Car, J.; Zary, N. Virtual patient simulations in health professions education: Systematic review and meta-analysis by the digital health education collaboration. J. Med. Internet Res. 2019, 21, e14676. [Google Scholar] [CrossRef]
Huang, G.; Reynolds, R.; Candler, C. Virtual patient simulation at US and Canadian medical schools. Acad. Med. 2007, 82, 446–451. [Google Scholar] [CrossRef]
Botezatu, M.; Hult, H.; Fors, U.G. Virtual patient simulation: What do students make of it? A focus group study. BMC Med. Educ. 2010, 10, 91. [Google Scholar] [CrossRef]
Hege, I.; Kononowicz, A.A.; Berman, N.B.; Lenzer, B.; Kiesewetter, J. Advancing clinical reasoning in virtual patients–development and application of a conceptual framework. GMS J. Med. Educ. 2018, 35, Doc12. [Google Scholar]
Botezatu, M.; Hult, H.; Tessma, M.K.; Fors, U. Virtual patient simulation: Knowledge gain or knowledge loss? Med. Teach. 2010, 32, 562–568. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Eysenbach, G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: A conversation with ChatGPT and a call for papers. JMIR Med. Educ. 2023, 9, e46885. [Google Scholar] [CrossRef] [PubMed]
Jung, S. Challenges for future directions for artificial intelligence integrated nursing simulation education. Korean J. Women Health Nurs. 2023, 29, 239–242. [Google Scholar] [CrossRef] [PubMed]
Maaz, S.; Palaganas, J.C.; Palaganas, G.; Bajwa, M. A guide to prompt design: Foundations and applications for healthcare simulationists. Front. Med. 2025, 11, 1504532. [Google Scholar] [CrossRef]
OpenAI. ChatGPT. Available online: https://chat.openai.com/ (accessed on 17 June 2025).
DeepMind, G. Gemini. Available online: https://deepmind.google/gemini (accessed on 17 June 2025).
Meta. LLaMA. Available online: https://www.llama.com/ (accessed on 17 June 2025).
Anthropic. Claude. Available online: https://www.anthropic.com/claude (accessed on 17 June 2025).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Kang, K.; Yu, M. Rapid cycle deliberate practice simulation with standardized prebriefing and video based formative feedback in advanced cardiac life support. Sci. Rep. 2025, 15, 16150. [Google Scholar] [CrossRef]
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
Holderried, F.; Stegemann-Philipps, C.; Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, M.; Eickhoff, C.; Mahling, M. A language model–powered simulated patient with automated feedback for history taking: Prospective study. JMIR Med. Educ. 2024, 10, e59213. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Dinan, E.; Urbanek, J.; Szlam, A.; Kiela, D.; Weston, J. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv 2018, arXiv:1801.07243. [Google Scholar] [CrossRef]
Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
García-Torres, D.; Vicente Ripoll, M.A.; Fernández Peris, C.; Mira Solves, J.J. Enhancing clinical reasoning with virtual patients: A hybrid systematic review combining human reviewers and ChatGPT. Healthcare 2024, 12, 2241. [Google Scholar] [CrossRef]
Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef]
Vaughn, J.; Ford, S.H.; Scott, M.; Jones, C.; Lewinski, A. Enhancing healthcare education: Leveraging ChatGPT for innovative simulation scenarios. Clin. Simul. Nurs. 2024, 87, 101487. [Google Scholar] [CrossRef]
Ghaffari, F.; Langarizadeh, M.; Nabovati, E.; Sabery, M. Effectiveness of ChatGPT for Clinical Scenario Generation: A Qualitative Study. Arch. Acad. Emerg. Med. 2025, 13, e49. [Google Scholar] [PubMed]
Violato, E.; Corbett, C.; Rose, B.; Rauschning, B.; Witschen, B. The effectiveness and efficiency of using ChatGPT for writing health care simulations. Int. J. Healthc. Simul. 2023, 10, 54531. [Google Scholar] [CrossRef]
Tian, Q.; Ren, F.; Zou, B.; Zhou, J.; Liu, G.; Zheng, Y.; Zhang, Z.; Wang, S. Iteratively refined ChatGPT outperforms clinical mentors in generating high-quality interprofessional education clinical scenarios: A comparative study. BMC Med. Educ. 2024, 25, 845. [Google Scholar]
Gray, M.; Baird, A.; Sawyer, T.; James, J.; DeBroux, T.; Bartlett, M.; Krick, J.; Umoren, R. Increasing realism and variety of virtual patient dialogues for prenatal counseling education through a novel application of ChatGPT: Exploratory observational study. JMIR Med. Educ. 2024, 10, e50705. [Google Scholar] [CrossRef] [PubMed]
Ananthanarayanan, A. Generating Medical Diagnostic Scenarios with LLM-Based Reinforcement Learning Feedback: Dataset Release and Methodology. In Proceedings of the IEEE Integrated STEM Education Conference, Princeton, NJ, USA, 15 March 2025. [Google Scholar]
Sumpter, S. Automated Generation of High-Quality Medical Simulation Scenarios Through Integration of Semi-Structured Data and Large Language Models. arXiv 2024, arXiv:2404.19713. [Google Scholar]
Barra, F.L.; Rodella, G.; Costa, A.; Scalogna, A.; Carenzo, L.; Monzani, A.; Corte, F.D. From prompt to platform: An agentic AI workflow for healthcare simulation scenario design. Adv. Simul. 2025, 10, 29. [Google Scholar] [CrossRef]
Öncü, S.; Torun, F.; Ülkü, H.H. AI-powered standardised patients: Evaluating ChatGPT-4o’s impact on clinical case management in intern physicians. BMC Med. Educ. 2025, 25, 278. [Google Scholar] [CrossRef] [PubMed]
Benfatah, M.; Marfak, A.; Saad, E.; Hilali, A.; Nejjari, C.; Youlyouz-Marfak, I. Assessing the efficacy of ChatGPT as a virtual patient in nursing simulation training: A study on nursing students’ experience. Teach. Learn. Nurs. 2024, 19, e486–e493. [Google Scholar] [CrossRef]
Holderried, F.; Stegemann–Philipps, C.; Herschbach, L.; Moldt, J.-A.; Nevins, A.; Griewatz, J.; Holderried, M.; Herrmann-Werner, A.; Festl-Wietek, T.; Mahling, M. A generative pretrained transformer (GPT)–powered chatbot as a simulated patient to practice history taking: Prospective, mixed methods study. JMIR Med. Educ. 2024, 10, e53961. [Google Scholar] [CrossRef]
Reichenpfader, D.; Denecke, K. Simulating diverse patient populations using patient vignettes and large language models. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024, Torino, Italy, 20 May 2024; pp. 20–25. [Google Scholar]
Yi, Y.; Kim, K.-J. The feasibility of using generative artificial intelligence for history taking in virtual patients. BMC Res. Notes 2025, 18, 80. [Google Scholar] [CrossRef]
Aster, A.; Ragaller, S.V.; Raupach, T.; Marx, A. ChatGPT as a Virtual patient: Written empathic expressions during medical history taking. Med. Sci. Educ. 2025, 35, 1513–1522. [Google Scholar] [CrossRef]
Lower, K.; Seth, I.; Lim, B.; Seth, N. ChatGPT-4: Transforming medical education and addressing clinical exposure challenges in the post-pandemic era. Indian J. Orthop. 2023, 57, 1527–1544. [Google Scholar] [CrossRef]
Cross, J.; Kayalackakom, T.; Robinson, R.E.; Vaughans, A.; Sebastian, R.; Hood, R.; Lewis, C.; Devaraju, S.; Honnavar, P.; Naik, S. Assessing ChatGPT’s Capability as a New Age Standardized Patient: Qualitative Study. JMIR Med. Educ. 2025, 11, e63353. [Google Scholar] [CrossRef]
Scherr, R.; Halaseh, F.F.; Spina, A.; Andalib, S.; Rivera, R. ChatGPT interactive medical simulations for early clinical education: Case study. JMIR Med. Educ. 2023, 9, e49877. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Li, S.; Lin, N.; Zhang, X.; Han, Y.; Wang, X.; Liu, D.; Tan, X.; Pu, D.; Li, K. Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment. J. Med. Internet Res. 2025, 27, e59435. [Google Scholar] [CrossRef]
Brügge, E.; Ricchizzi, S.; Arenbeck, M.; Keller, M.N.; Schur, L.; Stummer, W.; Holling, M.; Lu, M.H.; Darici, D. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: A randomized controlled trial. BMC Med. Educ. 2024, 24, 1391. [Google Scholar] [CrossRef]
Haut, K.; Hasan, M.; Carroll, T.; Epstein, R.; Sen, T.; Hoque, E. AI Standardized Patient Improves Human Conversations in Advanced Cancer Care. arXiv 2025, arXiv:2505.02694. [Google Scholar] [CrossRef]
Yamamoto, A.; Koda, M.; Ogawa, H.; Miyoshi, T.; Maeda, Y.; Otsuka, F.; Ino, H. Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial. JMIR Med. Educ. 2024, 10, e58753. [Google Scholar] [CrossRef]
Hicke, Y.; Geathers, J.; Rajashekar, N.; Chan, C.; Jack, A.G.; Sewell, J.; Preston, M.; Cornes, S.; Shung, D.; Kizilcec, R. MedSimAI: Simulation and formative feedback generation to enhance deliberate practice in medical education. arXiv 2025, arXiv:2503.05793. [Google Scholar]
Chiu, J.; Castro, B.; Ballard, I.; Nelson, K.; Zarutskie, P.; Olaiya, O.K.; Song, D.; Zhao, Y. Exploration of the Role of ChatGPT in Teaching Communication Skills for Medical Students: A Pilot Study. Med. Sci. Educ. 2025, 35, 1871–1882. [Google Scholar] [CrossRef] [PubMed]
Cook, D.A.; Overgaard, J.; Pankratz, V.S.; Del Fiol, G.; Aakre, C.A. Virtual patients using large language models: Scalable, contextualized simulation of clinician-patient dialogue with feedback. J. Med. Internet Res. 2025, 27, e68486. [Google Scholar] [CrossRef] [PubMed]
Bodonhelyi, A.; Stegemann-Philipps, C.; Sonanini, A.; Herschbach, L.; Szép, M.; Herrmann-Werner, A.; Festl-Wietek, T.; Kasneci, E.; Holderried, F. Modeling Challenging Patient Interactions: LLMs for Medical Communication Training. arXiv 2025, arXiv:2503.22250. [Google Scholar]
Chen, S.; Wu, M.; Zhu, K.Q.; Lan, K.; Zhang, Z.; Cui, L. LLM-empowered chatbots for psychiatrist and patient simulation: Application and evaluation. arXiv 2023, arXiv:2305.13614. [Google Scholar] [CrossRef]
Borg, A.; Georg, C.; Jobs, B.; Huss, V.; Waldenlind, K.; Ruiz, M.; Edelbring, S.; Skantze, G.; Parodis, I. Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: Mixed methods study. J. Med. Internet Res. 2025, 27, e63312. [Google Scholar] [CrossRef]
Gutiérrez Maquilón, R.; Uhl, J.; Schrom-Feiertag, H.; Tscheligi, M. Integrating GPT-Based AI into Virtual Patients to Facilitate Communication Training Among Medical First Responders: Usability Study of Mixed Reality Simulation. JMIR Form. Res. 2024, 8, e58623. [Google Scholar] [CrossRef] [PubMed]
Sardesai, N.; Russo, P.; Martin, J.; Sardesai, A. Utilizing generative conversational artificial intelligence to create simulated patient encounters: A pilot study for anaesthesia training. Postgrad. Med. J. 2024, 100, 237–241. [Google Scholar] [CrossRef]
Lee, K.; Lee, S.; Kim, E.H.; Ko, Y.; Eun, J.; Kim, D.; Cho, H.; Zhu, H.; Kraut, R.E.; Suh, E. Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training. arXiv 2025, arXiv:2506.00386. [Google Scholar]
Du, Z.; Zheng, L.; Hu, R.; Xu, Y.; Li, X.; Sun, Y.; Chen, W.; Wu, J.; Cai, H.; Ying, H. LLMs Can Simulate Standardized Patients via Agent Coevolution. arXiv 2024, arXiv:2412.11716. [Google Scholar] [CrossRef]
Yu, H.; Zhou, J.; Li, L.; Chen, S.; Gallifant, J.; Shi, A.; Li, X.; Hua, W.; Jin, M.; Chen, G. AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow. arXiv 2024, arXiv:2409.18924. [Google Scholar] [CrossRef]
Li, Y.; Zeng, C.; Zhang, J.; Zhou, J.; Zou, L. MedDiT: A Knowledge-Controlled Diffusion Transformer Framework for Dynamic Medical Image Generation in Virtual Simulated Patient. arXiv 2024, arXiv:2408.12236. [Google Scholar]
Li, Y.; Zeng, C.; Zhong, J.; Zhang, R.; Zhang, M.; Zou, L. Leveraging large language model as simulated patients for clinical education. arXiv 2024, arXiv:2404.13066. [Google Scholar] [CrossRef]
Wang, R.; Milani, S.; Chiu, J.C.; Zhi, J.; Eack, S.M.; Labrum, T.; Murphy, S.M.; Jones, N.; Hardy, K.; Shen, H. Patient-Ψ: Using large language models to simulate patients for training mental health professionals. arXiv 2024, arXiv:2405.19660. [Google Scholar] [CrossRef]
Steenstra, I.; Nouraei, F.; Bickmore, T. Scaffolding empathy: Training counselors with simulated patients and utterance-level performance visualizations. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–22. [Google Scholar]
Louie, R.; Nandi, A.; Fang, W.; Chang, C.; Brunskill, E.; Yang, D. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. arXiv 2024, arXiv:2407.00870. [Google Scholar]
Lee, J.; Lim, K.; Jung, Y.-C.; Kim, B.-H. PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents. arXiv 2025, arXiv:2501.01594. [Google Scholar] [CrossRef]
Wang, J.; Xiao, Y.; Li, Y.; Song, C.; Xu, C.; Tan, C.; Li, W. Towards a client-centered assessment of llm therapists by client simulation. arXiv 2024, arXiv:2406.12266. [Google Scholar] [CrossRef]
Saeed, M.; Villarroel, M.; Reisner, A.T.; Clifford, G.; Lehman, L.-W.; Moody, G.; Heldt, T.; Kyaw, T.H.; Moody, B.; Mark, R.G. Multiparameter Intelligent Monitoring in Intensive Care II: A public-access intensive care unit database. Crit. Care Med. 2011, 39, 952–960. [Google Scholar] [CrossRef]
Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.-w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2015, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
Pérez-Rosas, V.; Sun, X.; Li, C.; Wang, Y.; Resnicow, K.; Mihalcea, R. Analyzing the quality of counseling conversations: The tell-tale signs of high-quality counseling. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Wu, Z.; Balloccu, S.; Kumar, V.; Helaoui, R.; Reiter, E.; Recupero, D.R.; Riboni, D. Anno-mi: A dataset of expert-annotated counselling dialogues. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6177–6181. [Google Scholar]
MTSamples. MTSamples. Available online: https://mtsamples.com (accessed on 5 June 2025).
Lee, J.; Park, S.; Shin, J.; Cho, B. Analyzing evaluation methods for large language models in the medical field: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 366. [Google Scholar] [CrossRef]
Fan, L.; Hua, W.; Li, L.; Ling, H.; Zhang, Y. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv 2023, arXiv:2312.14890. [Google Scholar]
Fan, L.; Hua, W.; Li, X.; Zhu, K.; Jin, M.; Li, L.; Ling, H.; Chi, J.; Wang, J.; Ma, X. Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models. arXiv 2024, arXiv:2403.01777. [Google Scholar]
Gusev, I. PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation. arXiv 2024, arXiv:2409.06820. [Google Scholar]
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]

Figure 1. PRISMA flow diagram of the study selection process.

Figure 2. Distribution of reviewed articles by research area.

Table 1. Comparative analysis of studies on LLM-based scenario generation.

LLM	Reference	Key Research Achievements	Limitations
ChatGPT	[26]	<15 s/scenario; realism acceptable for HPI & unfolding; frequent omissions 65–88%, inaccuracies 18–41%.	Missing PMH/profile/vitals common; SME review required
GPT-3.5	[28]	Time ↓ 154.8 min/case (total −12.9 h/5 cases); non-expert preferred 4/23; strong structure/flow	Expert quality > non-expert; gaps in technical accuracy/clinical detail
GPT-3.5	[30]	Generated 176 responses: realistic 80%, educationally relevant 87%, usable (≤minor edits) 63%; weighted κ = 0.84	37% require edits; precision/detail limited; expert screening advised
GPT-3.5	[32]	Semi-structured data + LLM pipeline; reported time/resource reduction; better consistency/reuse	No quantitative evaluation; potential misinterpretation of complex cases; SME validation required
GPT-3.5 Turbo	[31]	accuracy 9.59 → 10, detail 5.59 → 5.78 (0–10) with RAG + critic; included more women and people of color cases for diversity	Small/preliminary; depends on RAG/critic quality; external validation pending
GPT-4	[27]	≈5 s generation; structured, realistic, clear objectives (expert panel)	Drug dose/logic errors; incomplete histories/tables; expert review required
GPT-4o	[29]	Time: mentors 118 ± 23 min → 9 ± 2 (iterative)/4 ± 2 (single); IQS ↑ challenge +0.63, engagement +0.39 (p < 0.01); blind attribution AI = human 16/16 (p = 0.61) → AI scenarios matched/exceeded expert quality	Subjective ratings; no IRR; expert review still needed
GPT-4o, Gemini 2.0, Claude-3.7	[33]	Multi-agent workflow; ~4.5 min/case (≈50 runs); time ↓ 70–80%; INACSL/ASPiH-compliant; multilingual	Potential errors/biases; complex setup; expert oversight essential

Table 2. Comparative analysis of studies on simple prompt-based virtual patient systems.

LLM	Reference	Key Research Achievements	Limitations
ChatGPT	[35]	5-pt ratings: Accessibility 4.3 ± 0.5, Engagement 4.3 ± 0.4, Usefulness 4.2 ± 0.5. Correlations with total (25-pt): Clarity r = 0.701, Useful info r = 0.597, Relevance r = 0.444 (all p < 0.05)	Small sample (12 participants); limited to one scenario (dyspnea); low adaptability in some students
GPT-3.5	[39]	Empathic interactions 93/659 ≈ 14%; Autonomy score 38.2 ± 3.44/42 (freedom 6.8/7, task relevance 5.93/7)	Low empathy frequency; no non-verbal cues; no voice/visual input
GPT-3.5	[42]	ACLS & ICU (pneumonia, sepsis) in open-response/state-change; very low cost, high accessibility, unlimited regeneration	No quantitative scoring; no automated grading/standardization; reproducibility/feedback consistency limited
GPT-3.5 Turbo	[36]	Generated 826 Q–A; clinical validity 97.9%; in-script info 94.4%; out-of-script fabricated 56.4%; CUQ 77/100; Q–A length ρ = 0.29 (p < 0.01)	Out-of-script hallucinations (56.4%); role drift & calc errors; one case/model
GPT-4	[37]	Role-prompted vignette: Compliance/Coherence/Correctness 100%; Containment 64→45→9% (less context → worse); maintain realism and coherence within structured role-prompting setups	Context-reliant; over-inference/generalization; single vignette, non-clinical raters
GPT-4	[40]	Ortho cases: Likert (accuracy 4/5; complex Q 3/5; comprehensiveness 3/5; depth 2/5). Consistent diagnostic reasoning & initial ED management	Limited specialized detail (e.g., urinalysis, nerve block); needs expert oversight
GPT-4	[41]	Effective for repetitive practice, convenience, and anxiety reduction; enabled personalized feedback	Lacks non-verbal/visual cues; sensitive-topic limits; minor latency/language issues; small single-site sample
GPT-4o	[34]	Observed scores (6–10): PS (problem-solving) 8.4, CR (clinical reasoning) 8.3, CM (case management) 8.5; Self (55 max): 41.6, 42.6, 36.1; high inter-domain correlations (r = 0.68–0.95, p < 0.001); no competence gap	Tech issues (language, delays); time-pressure info handling; single site, small n
Naver(Seongnam-si, Republic of Korea) HyperCLOVA X	[38]	Pilot (5 sessions): 96 Q–A/1325 words; implausible 2.6% (inarticulate 1.7%, hallucinated 0.5%, missing 0.3%). Expert (1–5): Rel 4.50, Acc 4.10, Valid 4.20, Concise 3.80, Fluent 3.20, Total 3.96; ICC 0.64–0.80	Some inaccuracies and unrealistic expressions; limited fluency; need for refined prompting

Table 3. Comparative analysis of studies on iterative feedback and automated scoring systems.

LLM	Reference	Key Research Achievements	Limitations
GPT-3.5	[44]	CRI-HTI ↑ (F(1,18) = 4.44, p = 0.049, η² = 0.198); inter-rater reliability ICC = 0.924	No gain in focusing (p = 0.265); small, single-site RCT; some feedback lacked accuracy/specificity
GPT-3.5	[48]	Confidence ↑ 3.00→4.17 (p = 0.002); trust ↑ (p = 0.001); SPIKES-based immediate feedback feasible	Faculty–AI scoring gap; limited non-verbal/emotional context (text-only); very small sample size
GPT-3.5 Turbo	[45]	ICC = 0.882; 3E(Empower/Be Explicit/Empathize) skills ↑ (all p < 0.001, d = 1.07–1.61); feedback 4.68/5; avatar realism 3.40/5	Some “uncanny” affect; limited gaze/speech naturalness; technical complexity
GPT-4	[21]	99.3% clinical plausibility; Cohen’s κ 0.832 vs. human raters	Some category-level κ < 0.6 (rubric overlap/ambiguity)
GPT-4	[43]	Score Difference Percentage(SDP) 29.8%→6.1%; no language-group diff (p > 0.05)	Occasional over-information; no emotion/attitude scoring
GPT-4.0 Turbo	[46]	OSCE score ↑ 28.1 vs. 27.1 (p = 0.01); repeated training; accessible via web, smartphone, LINE	Limited non-verbal training; repetitive/inaccurate feedback; short-term, non-randomized evaluation
GPT-4.0 Turbo	[49]	Validated authenticity/UX/feedback tools; LLM–human rating alignment; cost ≈ US $0.51/dialogue; patient prefs reflected in 42–98%	Reproducibility (ICC) slight–fair; possible self-evaluation bias; some verbosity/unnatural dialogue
GPT-4o	[47]	~19.9 min, 38.7 turns/session; MIRS-based instant feedback; 53% found useful; included SRL features (goal-setting, reflection, progress tracking)	Some formulaic/repetitive replies; limited advanced-skill training; low SRL engagement

Table 4. Comparative analysis of realism- and adaptability-enhanced virtual patient systems.

LLM	Reference	Key Research Achievements	Limitations
ChatGPT	[51]	Realism ↑ 1.93→2.21; wrong-symptom ↓ 18.4→15.1%; more human-like lexical style	Prompt drift; increased symptom inaccuracy; no non-verbal expression
ChatGPT	[54]	Intuitiveness 9/10, Accuracy 8/10, Comfort 87%	Overly polite/verbose; repetitive; hallucination risk; limited to 2D interface
GPT-3.5 Turbo	[52]	Robot > PC: authenticity 4.5 vs. 3.9 (p = 0.04); learning 4.4 vs. 4.1 (p = 0.01); greater immersion & emotion	ASR errors; timing interruptions; limited exam/clinical detail; robot setup cost
GPT-3.5 Turbo	[53]	Voice usability: MOS-X2 ≈ 4/10; SASSI 4.0–4.8/7	~3 s latency; ASR overlap disrupted turn-taking; limited prosody/natural flow
GPT-4	[50]	Authenticity 3.8/5, style reproduction 3.7/5; sentiment: Accuser 3.1 vs. Rationalizer 4.0 (9-pt); N ≈ 15–17 min, ~15 turns	Style drift after 4–6 turns; repetitive pre-scripted replies; unnatural non-verbal attempts
Claude-3.5	[55]	Dynamic > Static: role fidelity F(1,25.4) = 4.52 (p = 0.043); realism F(1,24.7) = 8.42 (p = 0.008); κ > 0.75; Cronbach’s α = 0.96–0.97; U = 160,960 (p = 0.001)	Text-only; some responses less fluent; KR-only, small sample size

Table 5. Comparative analysis of knowledge-driven and multi-agent hybrid virtual patient systems.

LLM	Reference	Key Research Achievements	Limitations
GPT-3.5 Turbo, Qwen 2.5-72B	[56]	Ability = 0.860 (Rel 0.759/Faith 0.879/Rob 0.941); cheat-Q pref 91.3% (GPT-4)/86.1% (human); efficiency 6.69 s/401 tokens (≈−380 vs. CoT); req. align > 10% vs. baseline	Sim-only (~150 cases); injected cheat-Qs (5/10 turns); no real-patient validation; >10% gain not externally benchmarked
GPT-4 Turbo	[57]	QA accuracy = 94.15% (Symp 91.2/Hist 87.1/Soc 85.6); NER F1 = 0.89 (P 0.95/R 0.84/TPR 84.2%); readability FRE = 68.8/FK 6.4; robust/stable ns; κ = 0.92	Med-history paraphrase sensitive (F = 5.30, p = 0.006); heavy KG/RAG dependence (F&S acc. ↓ to 13.3% w/o); single-site KG (MIMIC-III)
Qwen2-72B, Diffusion Transformer	[58]	KG-controlled, symptom-consistent image generation; multi-agent framework (KG/Chat/Image)	No FID/SSIM/diagnostic metrics; limited Open-i (3314) dataset; ethical/clinical validation pending
GPT-3.5 Turbo, PaLM, ERNIE-4, etc.	[59]	B-ELO: +250 (GPT-3.5) > +99 (ERNIE-4) > +68 (PaLM) > +51 (Qwen-72B) > +48 (Mixtral); human–AI corr. ρ = 0.81/r = 0.85 (p < 0.05); Virtual-Doctor score: ChatGPT 0.51 vs. human 0.78	Role-flip risk (GPT-3.5); small scale (8 cases/80 dialogs); generalization (multi-lang/real-world) unverified

Table 6. Comparative analysis of mental health- and counseling-oriented systems.

LLM	Reference	Key Research Achievements	Limitations
GPT-3.5, GPT-4	[62]	Expert-in-the-loop “principles” pipeline; better adherence/consistency (win 35%/loss 5–10%; vs. baseline GPT-4); higher authenticity (+0.80, p < 0.01) & training-readiness (+0.64, p < 0.05)	~20% responses missed multi-part principles; text-only dialogue; 11.4% manual edits required
GPT-4o	[61]	Real-time MI training with automatic MITI feedback and cognitive-state visualizations; significant self-efficacy gain (F = 15.56, p < 0.001); high usability (SUS = 88.1); reliable across modules (ICC ≥ 0.77)	No non-verbal cues; limited handling of resistance/ambivalence in complex dialogues
GPT-4o	[63]	Multi-faceted psychiatric simulation with construct-grounded evaluation (PSYCHE SCORE, r = 0.85 vs. experts, p < 0.0001); psychiatrist agreement 93%; high reliability (AC1 0.87, PABAK 0.86); moderate convergent validity (r = 0.64, p = 0.0025)	Under-expressed behaviors (e.g., thought process, insight) in specific disorders; low insight accuracy in OCD/PTSD (~50%); limited multimodal/non-verbal expressivity
GPT-4	[60]	Client-centered framework (ClientCAST); objective outcome/alliance metrics (WAI-SR, SRS, CECS, SEQ); clear 213 high vs. 87 low session split; profile reproduction ≈ 70% similarity, F1 ≤ 0.85	Limited emotional responsiveness; variable personality/affect consistency; overuse of positive tone
Claude-3, GPT-3.5, LLaMA3-70B, Mixtral 8 × 7B	[64]	CBT-based patient simulation framework integrating cognitive models; expert-rated fidelity +1.3 (p < 10⁻⁴); trainee confidence +1.8 vs. traditional (p < 10⁻⁴); auto-eval Acc 0.97 (Situation), Macro-F1 0.80 (Core Beliefs).	Limited emotional-style diversity; text-only interaction; short-term subjective evaluation

Table 7. Classification of dataset utilization types.

Category	Definition	No. of Studies	References
Self-constructed datasets	Scenarios, dialogues, and evaluation materials newly created by researchers using LLM prompts	35	[21,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,56,58,59,62,63,64,65]
Public medical datasets	Direct use and evaluation of open datasets such as MIMIC-III, Open-I, and High/Low Quality Counseling	3	[57,64]
Hybrid clinical datasets	Combined use of institutional EHR data and public datasets	2	[56,58]
KG/rule-based structured datasets	Use of patient KGs, CCDs, etc., to control LLM behavior and responses	4	[57,58,60,63]

Table 8. Major public medical datasets and their application cases in reviewed.

Dataset	Data Type	Key Characteristics	Application Purpose
MIMIC-II [65]	EHR(ICU)	~26,000 ICU admission records; CSV and DB formats	[56] (EvoPatient)—Combined with real hospital data to generate realistic VPs
MIMIC-III [66]	EHR(ICU)	40,000+ ICU admission records; relational DB (CSV) format	[57] (AIPatient)—Converted patient data into KGs to improve LLM answer accuracy
Open-I Chest X-ray [67]	Medical imaging + reports	~3300 chest X-ray images (DICOM format) with corresponding radiology reports	[58] (MedDiT)—Used for training a medical X-ray image generation model; LoRA fine-tuning
High/Low Quality Counseling [68]	Conversational text (counseling)	300 counseling sessions (English dialogue data)	[62]—Trained for automatic evaluation of counseling quality
AnnoMI [69]	Conversational text (counseling)	42 motivational interviewing (MI) counseling sessions (English dialogue data)	[62]—Trained ability to classify and evaluate counseling techniques at the sentence level
MTSamples [70]	Medical record documents	~5000 clinical notes (text files)	[56] (EvoPatient)—Used to enrich terminology for enhancing the diversity of medical scenarios

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jo, Y.-W.; Lee, M.; Yang, H.-J. Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Appl. Sci. 2025, 15, 11917. https://doi.org/10.3390/app152211917

AMA Style

Jo Y-W, Lee M, Yang H-J. Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Applied Sciences. 2025; 15(22):11917. https://doi.org/10.3390/app152211917

Chicago/Turabian Style

Jo, Young-Woo, Myungeun Lee, and Hyung-Jeong Yang. 2025. "Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review" Applied Sciences 15, no. 22: 11917. https://doi.org/10.3390/app152211917

APA Style

Jo, Y.-W., Lee, M., & Yang, H.-J. (2025). Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review. Applied Sciences, 15(22), 11917. https://doi.org/10.3390/app152211917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Model-Based Virtual Patient Simulations in Medical and Nursing Education: A Review

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources and Search Strategies

2.2. Study Selection and Eligibility Criteria

3. Results

3.1. Implementation Approaches

3.1.1. LLM-Based Scenario Generation

3.1.2. Simple Prompt-Based Virtual Patient Systems

3.1.3. Iterative Feedback and Automated Scoring Systems

3.1.4. Realism- and Adaptability-Enhanced Virtual Patient Systems

3.1.5. Knowledge-Driven and Multi-Agent Hybrid Virtual Patient Systems

3.1.6. Mental Health- and Counseling-Oriented Systems

3.2. Datasets

3.3. Evaluation Methods

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI