You are currently viewing a new version of our website. To view the old version click .
International Medical Education
  • Article
  • Open Access

12 October 2025

Understanding the Role of Large Language Model Virtual Patients in Developing Communication and Clinical Skills in Undergraduate Medical Education

,
,
,
,
,
,
and
1
Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON L8S 4L8, Canada
2
Department of Family Medicine, Division of Palliative Care, McMaster University, Hamilton, ON L8S 4L8, Canada
3
Department of Medicine, McMaster University, Hamilton, ON L8S 4L8, Canada
4
McMaster Health Education Research, Innovation & Theory (MERIT), McMaster University, Hamilton, ON L8S 4L8, Canada
This article belongs to the Special Issue New Advancements in Medical Education

Abstract

Access to practice opportunities for history-taking in undergraduate medical education can be resource-limited. Large language models are a potential avenue to address this. This study sought to characterize changes in learner self-reported confidence with history-taking before and after a simulation with an LLM-based patient and understand learner experience with and the acceptability of virtual LLM-based patients. This was a multi-method study conducted at McMaster University. Simulations were facilitated with the OSCEai tool. Data was collected through surveys with a Likert scale and open-ended questions and semi-structured interviews. A total of 24 participants generated 93 survey responses and 17 interviews. Overall, participants reported a 14.6% increase in comfort with history-taking. Strengths included its flexibility, accessibility, detailed feedback, and ability to provide a judgement-free space to practice. Limitations included its lower fidelity compared to standardized patients and at times repetitive and less clinically relevant feedback as compared to preceptors. It was overall viewed best as a supplement rather than a replacement for standardized patients. In conclusion, LLM-based virtual patients were feasible and valued as an adjunct tool. They can support scalable, personalized practice. Future work is needed to understand objective metrics of improvement and to design curricular strategies for integration.

1. Introduction

Patient interviews are a critical tool for clinicians. Over the course of their career, physicians conduct thousands of interviews, making this one of the most frequent and essential tasks in clinical practice. Strong interview skills significantly enhance a physician’s ability to capture a patient’s symptoms and are therefore essential for diagnosis. In fact, a thorough patient history alone can provide between 60 and 80 percent of the information required for diagnosis [1]. History-taking is a complex, multifaceted skill, requiring structured questioning, active listening, empathy, interpretation, and navigation of clinical decision-making. Given the cognitive load involved, early deliberate practice opportunities are critical. These allow medical students to focus on learning interview techniques before they are required to build clinical reasoning or diagnostic skills [1,2,3], and facilitate the development of basic skills with expanding complexity as students progress through the curriculum [4].
Current training methods often focus on structured, skills-focused training through role play [1]. This is facilitated by interactions with standardized patients (SPs), who are actors trained to answer questions about a predetermined medical case and diagnosis in order to simulate real patient interaction [2]. These training methods allow for structured practice with a high degree of fidelity; they closely replicate real patient interactions, allowing for learning objectives to be achieved in an environment that mimics the clinical environment [5]. However, simulations with standardized patients are often resource-intensive, and therefore, the ability of medical students to practice these skills is often limited. As a result of these limited simulation opportunities, the breadth of topics covered elsewhere in the curriculum cannot be fully captured in simulated cases. Additionally, the fidelity of standardized patients can occasionally introduce an increase in cognitive load [6]; there is evidence to suggest that cognitive load increases with greater realism in the environment of a simulation [7].
Generative artificial intelligence (AI) is rapidly gaining popularity in medical education and, therefore, offers a promising avenue to address these challenges. Generative AI, particularly large language models (LLMs) like the generative pretrained transformer (GPT), can generate human-like conversations, and therefore may be an avenue by which medical students are able to simulate patient interactions and practice their history-taking skills [8]. Recent feasibility studies have shown promise regarding the use of LLMs as virtual patients, including with respect to simulating patient-clinical dialogues by generating plausible answers, representing patient preferences, and generating structured feedback [9,10,11].
LLM-based virtual patients have many advantages. From a technology adoption lens, LLM-based virtual patients are scalable and on-demand, allowing asynchronous, self-paced access to clinical scenarios at relatively low cost [10,12,13]. Furthermore, they support early communication skills training in a low-stakes environment where experimentation is possible without risk of harm to real patients while also facilitating learner exposure to a diversity of cases [12]. In addition, from a human factors lens, they can reduce learner anxiety and support repeated, flexible practice [14]. Finally, they align with adult learning principles by enabling active, experiential, and contextual learning; this facilitates authentic engagement, promotes self-directed learning and adaptive thinking, and ultimately mirrors the clinical encounter [15,16].
However, there are potential drawbacks to LLMs. For example, they can display limitations in fidelity, including atypical vocabulary, excessive agreeableness, and limited emotional nuance [10,17]. Additionally, there may be variability in LLM performance, factual inaccuracies, and feedback that does not address important weaknesses [10,17]. Most significantly, over-reliance on virtual patients, who may not have real-world patient emotions, cultural diversity, and human responses, may leave learners at risk for underdeveloping the nuanced interpersonal skills required to navigate real human interaction [18,19].
Much of the existing literature focuses on evaluating the technical accuracy of LLMs [20]. There is comparatively little evidence regarding how medical students themselves experience these tools in practice. While prior reviews and surveys have explored students’ general perceptions [14,20,21], to the best of our knowledge, no studies have directly evaluated the impact of LLMs on learners’ self-reported confidence as a primary outcome. In addition, there is currently limited insight regarding the perceived pedagogical value of the integration of such simulations into medical curricula, including student acceptability and preferred contexts for use. With rising interest in the potential application of generative AI in medical education, there is a need to understand medical students’ perceptions of its utility. There are three key objectives of this study: (1) to quantify changes in students’ self-reported confidence with communication and history taking before and after a simulated interview with an LLM; (2) to identify medical students’ perceived strengths and limitations of using generative AI-driven simulations in medical education, particularly LLM-based virtual patients; and (3) to understand the acceptability and preferred contexts for use of AI-generated patient interviews. We will then use the information collected to inform the development of strategies for curricular integration.

2. Materials and Methods

2.1. Context

This study was conducted at the Michael G. DeGroote School of Medicine of McMaster University, where medical students are introduced to medical interviewing with SPs once a week in the first month of medical school. However, medical students’ abilities to reliably practice these skills are resource-limited and largely restricted to opportunities presented by the formal curriculum.

2.2. Study Design & Tool

This was a multi-methods study that presented generative AI simulated patients to undergraduate medical students. Participants were provided patient cases that they had previously encountered in McMaster’s problem-based learning curriculum, which was developed by medical experts and was tailored to the knowledge level of first-year medical students. These cases were simulated using OSCEai [22,23,24,25], which is a tool that uses generative AI to simulate a history. Participants carried out an untimed conversational interview, during which they could verbally speak or type to the simulated patient, with OSCEai providing both an auditory and text response. Participants had the option to solicit basic physical examination findings and laboratory values, should they choose to do so. Participants were not required to develop a formal synthesis or assessment of their findings as part of the study, and were able to immediately receive OSCEai-generated feedback on their performance. Feedback from this tool is generated using the Calgary-Cambridge guide for medical interviews, a research-validated rubric [22]. Post-simulation, both quantitative and qualitative feedback were solicited from medical students to understand their experiences with and perceptions of generative AI as a history-taking tool.

2.3. Study Participants

All participants were medical students at the Michael G. DeGroote School of Medicine at McMaster University. Participants were chosen via convenience and voluntary response sampling, through a cohort-wide email and social media post about the study. Convenience and voluntary response sampling were chosen as they best aligned with the aim of this study, which was to gather early insights from diverse students across the cohort rather than focusing on a predetermined, narrowly defined subgroup. Furthermore, given that this research was exploratory, convenience sampling provided a practical, timely, and cost-effective way to collect data. This enabled a variety of perspectives to be gathered. Participants were able to join the study even after the data collection process had started, provided that they were able to complete at least one simulation before the conclusion of the study. All study participants signed a detailed consent form and had the opportunity to discuss any questions or concerns with the study team.

2.4. Data Collection

Data was collected in two phases. In Phase 1, simulations were emailed to participants to complete over the course of a 6-month period from October 2024 to March 2025. Simulations were based on recent cases encountered in McMaster University’s problem-based learning curriculum, and topics therefore corresponded to elements of the curriculum with which students were currently engaged. Students were not otherwise given restrictions regarding when to complete these simulations, although they were encouraged to do so within a one-week time frame to ensure relevance of the simulation to existing curricular content. There were no restrictions with respect to the time given to complete a simulation. Immediately after a simulation, participants completed a post-simulation survey. This post-simulation survey data included both Likert-scale and open-ended questions to gather information about student experience and was approved by a panel of medical educators well-versed in AI (Appendix A).
Phase 2 occurred between March 2025 to May 2025. Students who had completed at least one simulation in Phase 1 of the study were asked to participate in semi-structured interviews, lasting approximately 30 min each. A total of 17 interviews were conducted by two researchers (US and ML). Recognizing that interviews were conducted by near-peers, steps were taken to proactively address any bias arising from this; these included acknowledgment of this near-peer relationship and close adherence to the pre-existing interview guide (Appendix B). All data was anonymized and de-identified before being shared with the analytic team. The purpose of these interviews was four-fold: (1) to seek further insight regarding participants’ experiences with generative AI in the context of history taking, including whether it impacted their confidence; (2) to identify the perceived strengths and limitations of generative AI in this context; (3) to understand participants’ experiences with the feedback provided by generative AI; and (4) to discuss participant perspectives on the integration of such a tool within the formal curriculum. The interview guide (Appendix B) was revised iteratively. Interviews were audio recorded and transcribed via Zoom’s inbuilt transcription software, and were de-identified and edited for grammatical clarity by US or ML. Data collection was considered completed when new data collected was redundant due to the previously collected data [26,27,28], and the analytic team agreed that analysis had sufficient conceptual depth. All participants who participated in the second phase were provided a $20 honorarium.

2.5. Analysis

Data analysis occurred in a two-stage process. First, the quantitative survey data from Phase 1 of the study were assessed to determine changes in participant comfort with history-taking before and after using the tool, as well as other relevant parameters, including participant perceptions of usability and feedback. The survey data was analyzed using a repeated measures ANOVA to assess significant changes in participant comfort with history-taking before and after using the tool. All open-response survey data were assigned one or more codes according to the main idea discussed in the participant’s answer; the codes were then numerically synthesized to generate a summary.
Second, the interview transcripts from Phase 2 were qualitatively analyzed. This was informed by a direct content analysis approach, sensitized by the theoretical perspective of simulation learning [29,30,31]. This theoretical perspective, wherein simulation provides a controlled environment for learners to engage in practice, reflection, and feedback, is grounded in several key educational theories, including experiential learning theory, situated learning, and deliberate practice. Simulation learning aligns with Kolb’s experiential learning cycle [32], which describes a cyclical, four-stage process of concrete experience, reflective observation, abstract conceptualization, and active experimentation. Transcripts were reviewed in a staged manner with initial qualitative description followed by coding conducted by two team members (US and ML) using qualitative analysis software (Dedoose 10.0.35). The analysis process was further supported by bi-weekly meetings with the analytic team to review transcripts, codes, and emerging key themes, as well as to iterate on the interview guide as required. Results from the quantitative analysis were triangulated with emergent themes from the qualitative analysis, with the aim of offering deeper insights into the overall impact of the tool. Data saturation was considered to be achieved when no new concepts arose in at least two consecutive interviews.

2.6. Analytic Team

The analytic team was composed of individuals from diverse backgrounds, including educators, medical trainees, clinical skills preceptors, the clinical skills chair at the undergraduate medical education program, and an SP trainer. Analysis was informed by a constructivist perspective and supported by frequent reflexive conversations, which included critical reflection on researchers’ preconceived notions and potential biases.

2.7. Ethical Approval

This study was approved by the Hamilton Integrated Research Ethics Board (Project 16977) on 11 July 2024, before the recruitment process commenced.

3. Results

3.1. Survey Results

Twenty-four unique participants completed eight simulations, yielding a total of 93 survey responses. Not all participants completed all simulations. A breakdown of the number of participants who completed each simulation is provided in Appendix C.
Table 1 illustrates the summarized numerical responses, means, and standard deviations for each question in the Likert-scale survey, along with a paired t-test statistic comparing comfort with a topic before and after interacting with OSCEai. Overall, participants reported an average of a 14.6% increase in comfort with a topic after one simulated history, and reported favourable views of OSCEai in improving their understanding of history-taking and communication skills. Participants also found it to be easy to use and accessible. Overall, participants did find the tool comparable to standardized patients and preceptors with regard to feedback, although participants felt less strongly positive about these metrics compared to the other measures in the survey.
Table 1. Quantitative Responses to Survey Questions.
Each open-ended survey response was assigned one or more codes according to the key idea(s) reflected in the response to the question. This data is found in Table 2. The majority of participants found that the tool was useful overall (n = 82), particularly as a result of its ability to provide an avenue for practice (n = 38). However, some participants did feel that its usefulness was hindered by relatively low fidelity compared to standardized patients (n = 13). With respect to challenges faced while using the tool, the majority of participants did not feel they faced challenges (n = 54). However, reported challenges included a learning curve or challenges with the user interface (n = 14), low fidelity in history-taking (n = 13) and physical examination (n = 3), technical glitches (n = 7), and hallucinations (n = 2). Participants had mixed perceptions when comparing feedback to human (standardized patient and preceptor) feedback. Some felt that the feedback received was comparable (n = 31) and others felt that it was more thorough and/or personalized (n = 28). Conversely, others felt it was less valuable than preceptor feedback (n = 23), too broad or redundant (n = 5), had different assessment criteria than preceptors (n = 5), less clinically relevant than preceptor feedback (n = 2), missing preceptor insight (n = 2), or containing hallucinations (n = 1).
Table 2. Coded, Open-Ended Reponses to Survey Questions.

3.2. Interview Analysis

Seventeen semi-structured interviews were conducted, and subsequently underwent a qualitative analysis process. Three major themes were derived from these data. These include (1) usability; (2) feedback; (3) curricular integration.
  • Theme 1: Usability (Fidelity, Accessibility, and Flexibility)
Fidelity. Participants reported both benefits and limitations with respect to fidelity. Many participants highlighted that simulations with OSCEai felt more realistic than practicing with peers. Others felt that its responses felt similar to real patients with respect to clarity and comprehensiveness.
“I’d say typically when I’m practicing a history… by myself it’s not as realistic. Like I’m expecting the responses that I’m going to give myself, if that makes sense. So being able to use generative AI in this setting… was a lot more realistic.”
(PAR 7)
“I was really impressed with sort of how realistic it was and how it provided me a chance to sort of practice gathering information and asking questions to figure out the whole picture.”
(PAR 4)
“I think [OSCEai is] more clinically realistic.”
(PAR 8)
Conversely, other participants felt that at times it was unrealistic, hallucinated, or was too forthcoming with information. Most notably, participants highlighted that SPs provided human-like clinical nuance, tone, and body language and that it was more difficult to establish a rapport or emotional connection with generative AI. Furthermore, the occasional lag or technical challenge limited fidelity.
“I think standardized patients definitely offer a more human perspective. I can’t prove that, obviously, but in terms of, like, empathy and… understanding tone.”
(PAR 14)
“The main difference is something that’s just inherent to the fact that it’s not a real person… body language, tone, specific kind of ways of phrasing a question. It was a little bit less nuanced, because obviously the AI is not able to…see how you…react to an answer… or the way that you’re angling yourself towards the patient.”
(PAR 12)
Accessibility & Freedom of Use. Participants noted that they were able to use the tool anytime, at their convenience, and including outside of scheduled sessions. In addition, this was a low-stakes, safe space for experimentation and to make mistakes, with no associated human judgement or repercussions.
“Clinical skills happens once a week, right? And you have like eight students. So it’s really helpful to kind of have like an unlimited person, like unlimited SP to take histories on.”
(PAR 15)
“It allowed me to practice outside of hours where I couldn’t necessarily practice with a friend or family member.”
(PAR 7)
“In clinical skills, sometimes you do it in front of your entire tutorial group, which is intimidating, or you do it one-on-one with your preceptor or your preceptor watching, which is also intimidating. And then, or even just doing it with a person who’s actively perceiving you… sometimes you’re very conscious about your hand movements and stuff like that. But I felt like with the GPT, I could be more… experimental. I could pause. I could take my time. I could think through the history just because I knew I wasn’t really inconveniencing anybody else and there wasn’t as significant of a time constraint.”
(PAR 3)
Flexibility. Participants highlighted that the tool was self-paced and they were able to control the speed of their interaction, with the ability to extend the conversation for as long as they required. They noted that they were able to reinforce learning by repeating cases. As well, participants said that it was helpful to have the ability to select the difficulty level of the cases and tailor their practice to their own personal goals for improvement.
“If someone’s having a difficult time with, like, a specific history, they could go back and… refine it a bit.”
(PAR 13)
“Frankly, I like that you could do the same case over and over. So I did it twice, actually. So it was nice for me, I think, to have that iterative experience with the patient history.”
(PAR 3)
  • Theme 2: Feedback
Many participants described AI feedback as thorough, noting that it was able to provide comments on every aspect of the history. Some participants felt this comprehensiveness was beneficial while others felt that this excessive detail was overwhelming and that they were provided feedback that was not clinically relevant. Participants felt that AI feedback was structured and provided concrete suggestions for improvement, including highlighting missed opportunities for questioning. However, participants highlighted that the feedback started to feel repetitive across cases.
“I thought it was pretty similar to my clinical skills tutors or maybe the SPs saying you missed this part of the condition or like this line of thought that you should have explored…. I will say I felt like there was a whole section on like empathy and spending time to build rapport with the patient, that kind of soft skill category, that I was getting similar [feedback on] every single time.”
(PAR 1)
“Well, I think with generative AI in general, sometimes it’s almost like too detailed where… it’ll be nitpicky almost with its feedback. That’s like a double-edged sword, [which you can maybe consider a limitation], but it’s also like sometimes nice because you miss certain things.”
(PAR 2)
Students had mixed opinions when comparing AI and preceptor (physician) feedback. Some noted that it felt similar or more detailed. Others felt that the feedback provided was less specific and lacked clinical context or “real-world” nuance. In addition, OSCEai did not have the ability to comment on tone and body language. Some participants felt that AI feedback was harsher when compared to preceptor feedback, which was described as gentler and encouraging.
“I think the AI, again, is just like more generalized feedback. It didn’t contextualize… asking this is really important when you’re in the emergency department… you want to make sure that you’re ruling out these really bad things… The AI was more like, you delivered your questions well and you elicited the right information.”
(PAR 5)
When comparing AI and SP feedback, participants felt that SPs were often able to provide valuable feedback on body language and tone that OSCEai was unable to comment on. However, participants otherwise overwhelmingly viewed AI feedback as more impactful than SP feedback; they felt it was more individualized.
“I feel like an SP is able to say like, ‘I really liked when you asked this specific question or like your tone of voice in this part was really helpful for making me feel comfortable’.”
(PAR 5)
“I feel like standardized patients do give good feedback, but it’s a lot more on the emotional aspect and the social aspect of how they felt as a patient, and how my line of questioning was… which is really useful as well, because the social aspect is very important in terms of building rapport. But I feel like the tool did a really good job in terms of telling me where to focus my questioning a little bit more, kind of more of the [type of] feedback you would get from a preceptor… which I found was useful”
(PAR 13)
  • Theme 3: Curricular Integration
Participants overwhelmingly felt that AI was best used as a supplement to, rather than a replacement for, existing teaching methods. The OSCEai tool was seen as a strong adjunct to, but not a replacement for, SPs. Participants felt that the tool was a valuable practice opportunity between formal clinical skills sessions for additional history-taking opportunities. Many participants used the tool in preparation for their OSCE. A small number of participants described using OSCEai to reinforce related concepts before other, non-clinical skills assessments within the curriculum.
“Sometimes it can be really nerve-wracking going into clinical skills and having to take a patient history in front of your whole group. So maybe before those sessions, like if people knew that [OSCEai] was available and if it was just available all the time, for a variety of topics, they could just do that topic that they were going to be doing in their upcoming clinical skills session. I think maybe that would help with some anxiety around practicing histories with actual SPs. I just don’t think that I would want it to completely replace SPs, which I know it’s not… but like I do think it’s important to practice the nonverbal side of stuff as well.”
(PAR 4)
“Perhaps a lot of people might have laughed and said there’s no way AI is being used for simulated cases, but I think with the current capabilities, it’s… definitely very feasible to do now.”
(PAR 14)
Participants often felt that having the opportunity to revisit cases after tutorials was useful. Over the course of the curriculum, this was seen as an opportunity to reinforce concepts studied previously, and to have the opportunity to recall the pathophysiology and clinical presentation alongside history-taking. Having pre-existing familiarity with a case, including detailed knowledge of the pathophysiology and clinical presentation of the topic presented, allowed participants to use the opportunity to focus on the history-taking technique. Some participants felt that, in the context of OSCE preparation specifically, they would value new, surprise cases.
“It was nice to connect it back to a condition that I was like, ‘oh, I remember this’, and having to also ask questions that relate to the pathophysiology and try to recall that at the same time. It was kind of a good way to study not only how to take a history, but also like the conditions we’re looking at and kind of a recall back to other clinical skills sessions.”
(PAR 5)

4. Discussion

This study assessed participants’ subjective perceptions of AI-generated histories. Overall, participants reported an average 14.6% increase in comfort with history-taking for a particular topic after one AI simulation. Participants generally appreciated that AI-based simulations were accessible, self-paced, and customizable to their individual academic goals and progress. However, participants did raise questions about the fidelity of these interactions, including that it was difficult to establish a rapport with generative AI or rely on non-verbal cues, such as body language. AI-generated feedback was described as detailed and thorough. This was seen as a positive by some participants, who appreciated its comprehensiveness and ability to comment on all aspects of performance. On the other hand, some individuals viewed this negatively as they felt that the excessive detail was overwhelming. Additionally, participants felt that AI-generated feedback felt repetitive over time, and this reduced the perceived utility of the feedback longitudinally. Preceptor feedback was also considered to be more clinically relevant than AI feedback. Overall, the OSCEai tool was best seen as a supplement to the existing curriculum rather than a replacement for any current components.
The findings from this study can be contextualized within the broader framework of simulation-based medical education, wherein learners are provided the opportunity to engage in practice, reflection, and feedback. Generative AI allows learners to engage in all four stages of Kolb’s experiential learning cycle [32]; concrete experience through the simulation itself, reflective observation through feedback, abstract conceptualization as the learner engages with this feedback, and active experimentation as they apply this learning. Although AI-based simulations are relatively limited in their physical and psychological fidelity, they are associated with a high degree of functional and cognitive fidelity [33,34]. A recent review by Cook et al. supports this finding [10] by highlighting the similarity in authenticity between GPTs and real healthcare practitioners, albeit these findings were limited to the fidelity of the GPT dialogue. Ultimately, by facilitating the task of taking a history and the cognitive processes required for this, AI simulations provide meaningful opportunities for deliberate practice and reflection, supporting skill acquisition even in the absence of visual or emotional realism. Our findings supplement prior research findings in similar populations and settings (i.e., nursing students), which demonstrated the ability of AI tools to develop skills, provide psychological safety, and clinical realism [35,36].
Simulated AI interactions represent a unique modality for learners to develop history-taking skills. In the status quo, resource constraints, in particular, limited time with standardized patients, often restrict student access to practicing opportunities, reducing the time and opportunities available to them to develop critical skills. AI represents a scalable solution that can be integrated into the medical school curriculum to provide students with more frequent and varied practice sessions, regardless of their access to standardized patients. This would ultimately democratize the learning experience and ensure that students have the opportunity to refine their clinical skills as per their needs, rather than limiting these practice opportunities based on program resource constraints. AI simulation may also help mitigate disparities between resource-rich and resource-poor learning environments. Furthermore, the flexibility of AI-driven tools allows for a more personalized training experience, allowing learners to control both the speed and the difficulty of a particular interaction as well as the topic areas to practice. The accessibility and flexibility benefit students who are practicing at their own pace and allow them to focus on areas in which they feel less confident.
Simulated AI interactions may be most optimally used as a supplementary tool for standardized patient interactions. Certain elements of professional identity formation, including empathy, tone, and non-verbal communication, remain best cultivated through interactions with standardized patients. As such, AI-generated history taking cannot act as a stand-alone modality and entirely replace the educational value of standardized patients. Supplementary integration opportunities include practice cases that are adjacent to clinical skills sessions, as a longitudinal resource available within the curriculum, and targeted preparation for clinical examinations, such as the Objective Structured Clinical Examination. Given current limitations in emotional fidelity, AI-based simulations may be best suited for formative practice rather than summative assessment. Similar work done with virtual SPs in nursing environments has demonstrated significant promise as a complementary resource, which parallels the findings in our research [37].
AI patient simulations may be optimally deployed within a spiral curriculum. The alignment of case content with other elements of the curriculum and, therefore, students’ pre-existing familiarity with the content of these cases may facilitate, in early stages, students’ abilities to focus on communication rather than being overwhelmed by medical knowledge. In more advanced stages, this may facilitate reverse transfer, which is a theory of learning that describes how new learning affects prior knowledge and reasoning [38]. In this case, students can use AI patient simulations to consolidate medical knowledge learned elsewhere by applying it practically. By facilitating the acquisition of these foundational skills, AI tools may also allow students to use time with standardized patients to focus on more nuanced or emotionally and socially complex patient cases.
With that said, it would be remiss not to mention some of the methodological limitations associated with our study. This work was conducted at a single medical school and, therefore, results may not be generalizable to settings with broadly different curriculum design and baseline simulation exposure. A second limitation of this study is the sampling method; individuals who are less likely to be engaged in artificial intelligence-based tools may not have participated. It is therefore challenging to understand to what degree this data is reflective of the entire medical class.
It is important to acknowledge that this study measured subjective comfort changes and perceived usefulness; students’ objective improvement was not measured. Future work can seek to assess to what extent student performance objectively improves before and after AI simulation. Subsequent studies could also explore the utility of AI-driven tools in enhancing other aspects of clinical training and their integration into medical education overall. Additionally, exploring the use of AI in fields other than medicine is warranted. Foundational work in nursing has been conducted, and the beginnings of work in undergraduate medical education have as well [10,35,36,37,39]. However, further exploration in the field of AI and its use in the realm of clinical practice is needed.

5. Conclusions

This study explores the feasibility and perceived value of integrating a generative AI tool into clinical skills education. Participants found the tool useful as a mechanism for skill development, particularly as it allowed for repeated, personalized, and flexible practice; however, generative AI is limited by its fidelity in comparison to standardized patients. AI is best used as a supplemental tool within the clinical skills curriculum, rather than a replacement for standardized patients, and may be optimally integrated alongside clinical skills sessions within a spiral curriculum. Future work can explore the utility of AI-driven tools in enhancing other aspects of medical education.

Author Contributions

Conceptualization, U.S. and M.S.; methodology, M.S., U.S., M.L., S.M., J.M. and N.L.; software, E.G.; validation, M.S.; formal analysis, U.S., M.L., J.M., N.B., N.L., S.M. and M.S.; investigation, U.S. and M.L.; resources, M.S.; data curation, U.S., M.L. and M.S.; writing—original draft preparation, U.S.; writing—review and editing, U.S., M.L., J.M., N.B., N.L., E.G., S.M. and M.S.; supervision, M.S.; project administration, U.S.; funding acquisition, U.S. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by funding from the Michael G. DeGroote School of Medicine–McMaster Medical Student Research Excellence Awards (MAC RES).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Hamilton Integrated Research Ethics Board (protocol code 16977 on 11 July 2024, with renewal approved on 11 June 2025).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to participant privacy concerns.

Acknowledgments

During this study, the author(s) used Generative AI (OSCEai) for the purposes of simulation.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. Survey questions provided to participants in post-simulation. All Likert Scale questions were scored from 1–5, with 1 being “Very uncomfortable” and 5 being “Very comfortable” for questions 1–2; and 1 being “Strongly disagree” and 5 being “Strongly agree” for questions 3–7.
Table A1. Survey questions provided to participants in post-simulation. All Likert Scale questions were scored from 1–5, with 1 being “Very uncomfortable” and 5 being “Very comfortable” for questions 1–2; and 1 being “Strongly disagree” and 5 being “Strongly agree” for questions 3–7.
Likert Scale Questions
(1)
Before interacting with OSCEai, how comfortable were you with this topic?
(2)
How comfortable are you now with this topic?
(3)
Using this tool and format would improve my understanding of history-taking.
(4)
Using this tool and format would improve my communication skills.
(5)
I found the tool and format easy to use and accessible.
(6)
The feedback given by OSCEai is comparable to that given by standardized patients in clinical skills classes.
(7)
The feedback given by OSCEai is comparable to that given by preceptors in clinical skills classes.
Open Response Questions
(8)
Was the tool useful? Did it impact your ability to take a history and/or your communication skills?
(9)
Did you face any challenges using the tool?
(10)
Would you use this tool again?
(11)
How did the feedback you were given by OSCEai compare to that of your clinical skills preceptors?
(12)
How did the feedback you were given by OSCEai compare to that of your standardized patients?

Appendix B

  • Questions included within the semi-structured interview guide.
  • Experience with Generative AI & taking histories:
  • Did using generative AI affect your confidence in taking a patient history? If so, how?
  • Did you face any challenges using the generative AI in this context?
  • Were there any limitations of the tool?
  • Were there any strengths of the tool? Any examples of a time when it helped you to recognize an area for improvement or practice a skill?
  • How would you compare the experience of taking a history with generative AI and humans? When/where in the curriculum did you typically use it? When you used generative AI, did you simulate it like a real history or did you ever go back and repeat parts of the history?
  • After using the tool, did you notice any differences between this and SP/real patient interactions? Anything that stood out in particular?
  • We reused cases that you had previously encountered in tutorial. Do you have any thoughts on this?
  • Feedback:
  • How did generative AI feedback compare to feedback from standardized patients?
  • How did generative AI feedback compare to preceptor feedback?
  • Is there any feedback you received that helped you refine your approach to history-taking?
  • General perceptions of AI:
  • Before this study, what were your thoughts on the use of generative AI in medical education? Have these changed?
  • From your perspective, is there a role for this to be integrated into clinical skills curricula?
    • Why or why not?
    • And if yes, how? When would this be most useful?
  • If you were to incorporate this AI tool into your process of learning to take histories, how would you go about doing this?
  • What improvements would make generative AI more useful for medical students?

Appendix C

Table A2. Number of respondees who participated in each simulation.
Table A2. Number of respondees who participated in each simulation.
SimulationParticipants
18
26
36
412
524
615
711
811

References

  1. Keifenheim, K.E.; Teufel, M.; Ip, J.; Speiser, N.; Leehr, E.J.; Zipfel, S.; Herrmann-Werner, A. Teaching history taking to medical students: A systematic review. BMC Med. Educ. 2015, 15, 159. [Google Scholar] [CrossRef]
  2. Maguire, G.P.; Clarke, D.; Jolley, B. An experimental comparison of three courses in history-taking skills for medical students. Med. Educ. 1977, 11, 175–182. [Google Scholar] [CrossRef]
  3. Rutter, D.R.; Maguire, G.P. History-taking for medical students: II—Evaluation of a training programme. Lancet 1976, 308, 558–560. [Google Scholar] [CrossRef]
  4. Rubin, E.S.; Rullo, J.; Tsai, P.; Criniti, S.; Elders, J.; Thielen, J.M.; Parish, S.J. Best practices in North American pre-clinical medical education in sexual history taking: Consensus from the summits in medical education in sexual health. J. Sex. Med. 2018, 15, 1414–1425. [Google Scholar] [CrossRef]
  5. Presado, M.H.C.V.; Colaço, S.; Rafael, H.; Baixinho, C.L.; Félix, I.; Saraiva, C.; Rebelo, I. Learning with high fidelity simulation. Cienc. Saude Coletiva 2018, 23, 51–59. [Google Scholar] [CrossRef]
  6. Fraser, K.; McLaughlin, K. Temporal pattern of emotions and cognitive load during simulation training and debriefing. Med. Teach. 2019, 41, 184–189. [Google Scholar] [CrossRef]
  7. Tremblay, M.L.; Lafleur, A.; Leppink, J.; Dolmans, D.H. The simulated clinical environment: Cognitive and emotional impact among undergraduates. Med. Teach. 2017, 39, 181–187. [Google Scholar] [CrossRef]
  8. Gordon, M.; Daniel, M.; Ajiboye, A.; Uraiby, H.; Xu, N.Y.; Bartlett, R.; Hanson, J.; Haas, M.; Spadafore, M.; Grafton-Clarke, C.; et al. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. Med. Teach. 2024, 46, 446–470. [Google Scholar] [CrossRef] [PubMed]
  9. Holderried, F.; Stegemann-Philipps, C.; Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, M.; Eickhoff, C.; Mahling, M. A language model–powered simulated patient with automated feedback for history taking: Prospective study. JMIR Med. Educ. 2024, 10, e59213. [Google Scholar] [CrossRef] [PubMed]
  10. Cook, D.A.; Overgaard, J.; Pankratz, V.S.; Del Fiol, G.; Aakre, C.A. Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue with Feedback. J. Med. Internet Res. 2025, 27, e68486. [Google Scholar] [CrossRef] [PubMed]
  11. Yi, Y.; Kim, K.J. The feasibility of using generative artificial intelligence for history taking in virtual patients. BMC Res. Notes 2025, 18, 80. [Google Scholar] [CrossRef]
  12. Laverde, N.; Grévisse, C.; Jaramillo, S.; Manrique, R. Integrating large language model-based agents into a virtual patient chatbot for clinical anamnesis training. Comput. Struct. Biotechnol. J. 2025, 27, 2481–2491. [Google Scholar] [CrossRef]
  13. Luo, M.J.; Bi, S.; Pang, J.; Liu, L.; Tsui, C.K.; Lai, Y.; Chen, W.; Yang, Y.; Xu, K.; Zhao, L.; et al. A large language model digital patient system enhances ophthalmology history taking skills. npj Digit. Med. 2025, 8, 502. [Google Scholar] [CrossRef]
  14. Mondal, H.; Karri, J.K.K.; Ramasubramanian, S.; Mondal, S.; Juhi, A.; Gupta, P. A qualitative survey on perception of medical students on the use of large language models for educational purposes. Adv. Physiol. Educ. 2025, 49, 27–36. [Google Scholar] [CrossRef]
  15. Borg, A.; Jobs, B.; Huss, V.; Gentline, C.; Espinosa, F.; Ruiz, M.; Edelbring, S.; Georg, C.; Skantze, G.; Parodis, I. Enhancing clinical reasoning skills for medical students: A qualitative comparison of LLM-powered social robotic versus computer-based virtual patients within rheumatology. Rheumatol. Int. 2024, 44, 3041–3051. [Google Scholar] [CrossRef]
  16. Borg, A.; Georg, C.; Jobs, B.; Huss, V.; Waldenlind, K.; Ruiz, M.; Edelbring, S.; Skantze, G.; Parodis, I. Virtual patient simulations using social robotics combined with large language models for clinical reasoning training in medical education: Mixed methods study. J. Med. Internet Res. 2025, 27, e63312. [Google Scholar] [CrossRef]
  17. Haider, S.A.; Prabha, S.; Gomez-Cabello, C.A.; Borna, S.; Genovese, A.; Trabilsy, M.; Collaco, B.G.; Wood, N.G.; Bagaria, S.; Tao, C.; et al. Synthetic Patient–Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation. Sensors 2025, 25, 4305. [Google Scholar] [CrossRef]
  18. Moura, L.; Jones, D.T.; Sheikh, I.S.; Murphy, S.; Kalfin, M.; Kummer, B.R.; Weathers, A.L.; Grinspan, Z.M.; Silsbee, H.M.; Jones, L.K., Jr.; et al. Implications of large language models for quality and efficiency of neurologic care: Emerging issues in neurology. Neurology 2024, 102, e209497. [Google Scholar] [CrossRef]
  19. Denecke, K.; May, R.; LLMHealthGroup; Rivera Romero, O. Potential of large language models in health care: Delphi study. J. Med. Internet Res. 2024, 26, e52399. [Google Scholar] [CrossRef]
  20. Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and evaluation of health care applications of large language models: A systematic review. Jama 2025, 333, 319–328. [Google Scholar] [CrossRef]
  21. Verghese, B.G.; Iyer, C.; Borse, T.; Cooper, S.; White, J.; Sheehy, R. Modern artificial intelligence and large language models in graduate medical education: A scoping review of attitudes, applications & practice. BMC Med. Educ. 2025, 25, 730. [Google Scholar]
  22. Guo, E.; Gupta, M.; Park, Y.; Ramchandani, R. Ai in medical education: Interactive and personalized learning with OSCEai (Abstract 01-4-4). Can. Med. Educ. J. 2025. [Google Scholar] [CrossRef]
  23. Guo, E.; Ramchandani, R.; Park, Y.; Gupta, M. OSCEai: Personalized interactive learning for undergraduate medical education. Can. Med. Educ. J. 2024. [Google Scholar] [CrossRef]
  24. Park, Y.-J.; Guo, E.; Sachdeva, M.; Ma, B.; Mirali, S.; Rankin, B.; Nathanielsz, N.; Abduelmula, A.; Lapa, T.; Gupta, M.; et al. OSCEai dermatology: Augmenting dermatologic medical education with large language model GPT-4. Can. Med. Educ. J. 2025; in press. Available online: https://journalhosting.ucalgary.ca/index.php/cmej/article/view/80056 (accessed on 15 August 2025). [CrossRef]
  25. Ramchandani, R.; Biglou, S.G.; Gupta, M.; Guo, E. Using AI to revolutionize clinical training through OSCEai: A focused exploration of user feedback on otolaryngology and neurology cases. Can. J. Neurol. Sci./J. Can. Des Sci. Neurol. 2024, 51, S35. [Google Scholar] [CrossRef]
  26. Nelson, J. Using conceptual depth criteria: Addressing the challenge of reaching saturation in qualitative research. Qual. Res. 2017, 17, 554–570. [Google Scholar] [CrossRef]
  27. Saunders, B.; Sim, J.; Kingston, T.; Baker, S.; Waterfield, J.; Bartlam, B.; Burroughs, H.; Jinks, C. Saturation in qualitative research: Exploring its conceptualization and operationalization. Qual. Quant. 2018, 52, 1893–1907. [Google Scholar] [CrossRef]
  28. Gutiérrez, K.D.; Penuel, W.R. Relevance to practice as a criterion for rigor. Educ. Res. 2014, 43, 19–23. [Google Scholar] [CrossRef]
  29. Herrera-Aliaga, E.; Estrada, L.D. Trends and innovations of simulation for twenty first century medical education. Front. Public Health 2022, 10, 619769. [Google Scholar] [CrossRef]
  30. Motola, I.; Devine, L.A.; Chung, H.S.; Sullivan, J.E.; Issenberg, S.B. Simulation in healthcare education: A best evidence practical guide. AMEE Guide No. 82. Med. Teach. 2013, 35, e1511–e1530. [Google Scholar] [CrossRef]
  31. Kneebone, R. Evaluating clinical simulations for learning procedural skills: A theory-based approach. Acad. Med. 2005, 80, 549–553. [Google Scholar] [CrossRef]
  32. Kolb, A.; Kolb, D. Eight important things to know about the experiential learning cycle. Aust. Educ. Lead. 2018, 40, 8–14. [Google Scholar]
  33. Andrews, D.H.; Carroll, L.A.; Bell, H.H. The future of selective fidelity in training devices. Educ. Technol. 1995, 35, 32–36. [Google Scholar]
  34. Hamstra, S.J.; Brydges, R.; Hatala, R.; Zendejas, B.; Cook, D.A. Reconsidering fidelity in simulation-based training. Acad. Med. J. Assoc. Am. Med. Coll. 2014, 89, 387–392. [Google Scholar] [CrossRef]
  35. Harder, N.; Ali, F.; Turner, S.; Workum, K.; Gillman, L. Comparing artificial intelligence-enhanced virtual reality and simulated patient simulations in undergraduate nursing education. Clin. Simul. Nurs. 2025, 105, 101780. [Google Scholar] [CrossRef]
  36. Jallad, S.T.; Işık, B. The effectiveness of virtual reality simulation as a learning strategy in the acquisition of medical skills in nursing education: A systematic review. Ir. J. Med. Sci. 2022, 191, 1407–1426. [Google Scholar] [CrossRef]
  37. De Mattei, L.; Morato, M.Q.; Sidhu, V.; Gautam, N.; Mendonca, C.T.; Tsai, A.; Hammer, M.; Creighton-Wong, L.; Azzam, A. Are Artificial Intelligence Virtual Simulated Patients (AI-VSP) a valid teaching modality for health professional students? Clin. Simul. Nurs. 2024, 92, 101536. [Google Scholar] [CrossRef]
  38. Hohensee, C.; Willoughby, L.; Gartland, S. Backward transfer, the relationship between new learning and prior ways of reasoning, and action versus process views of linear functions. Math. Think. Learn. 2024, 26, 71–89. [Google Scholar] [CrossRef]
  39. Zidoun, Y.; Mardi, A.E. Artificial intelligence (AI)-based simulators versus simulated patients in undergraduate programs: A protocol for a randomized controlled trial. BMC Med. Educ. 2024, 24, 1260. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.