Previous Article in Journal
The Relationship Between Childhood Trauma and Shame: The Mediating Role of Dissociation
 
 
ejihpe-logo
Article Menu
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DIALOGUE: A Generative AI-Based Pre–Post Simulation Study to Enhance Diagnostic Communication in Medical Students Through Virtual Type 2 Diabetes Scenarios

by
Ricardo Xopan Suárez-García
1,2,
Quetzal Chavez-Castañeda
2,
Rodrigo Orrico-Pérez
2,
Sebastián Valencia-Marin
1,2,
Ari Evelyn Castañeda-Ramírez
1,2,
Efrén Quiñones-Lara
1,2,
Claudio Adrián Ramos-Cortés
1,2,3,
Areli Marlene Gaytán-Gómez
1,2,3,
Jonathan Cortés-Rodríguez
4,
Jazel Jarquín-Ramírez
2,5,
Nallely Guadalupe Aguilar-Marchand
2,5,
Graciela Valdés-Hernández
2,6,
Tomás Eduardo Campos-Martínez
2,6,
Alonso Vilches-Flores
2,6,
Sonia Leon-Cabrera
2,6,7,
Adolfo René Méndez-Cruz
2,8,
Brenda Ofelia Jay-Jímenez
2,5,* and
Héctor Iván Saldívar-Cerón
1,2,6,7,*
1
Unidad de Remisión de Diabetes Mellitus (URDM), Facultad de Estudios Superiores-Iztacala, Universidad Nacional Autónoma de México, Tlalnepantla 54090, Mexico
2
Carrera de Médico Cirujano, Facultad de Estudios Superiores-Iztacala, Universidad Nacional Autónoma de México, Tlalnepantla 54090, Mexico
3
Laboratorio de Medicina de la Conservación, Escuela Superior de Medicina, Instituto Politécnico Nacional (IPN), Plan de San Luis y Díaz Mirón, Colonia Casco de Santo Tomás, Miguel Hidalgo, México City 11350, Mexico
4
Colegio de Ciencias y Humanidades, Plantel (I) Azcapotzalco (CCH), Av. Aquiles Serdan No. 2060, Ex-Hcienda del Rosario, Azcapotzalco, México City 02020, Mexico
5
Centro Internacional de Simulación y Entrenamiento en Soporte Vital Iztacala (CISESVI), Facultad de Estudios Superiores-Iztacala, Universidad Nacional Autónoma de México, Tlalnepantla 54090, Mexico
6
Academia del Módulo de Sistema Endocrino, Carrera de Médico Cirujano, Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México (UNAM), Tlalnepantla 54090, Mexico
7
Unidad de Biomedicina (UBIMED), Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México, Tlalnepantla 54090, Mexico
8
Laboratorio de Inmunología (UMF), Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México, Los Barrios N° 1, Los Reyes Iztacala, Tlalnepantla 54090, Mexico
*
Authors to whom correspondence should be addressed.
Eur. J. Investig. Health Psychol. Educ. 2025, 15(8), 152; https://doi.org/10.3390/ejihpe15080152 (registering DOI)
Submission received: 9 July 2025 / Revised: 30 July 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

Abstract

DIALOGUE (DIagnostic AI Learning through Objective Guided User Experience) is a generative artificial intelligence (GenAI)-based training program designed to enhance diagnostic communication skills in medical students. In this single-arm pre–post study, we evaluated whether DIALOGUE could improve students’ ability to disclose a type 2 diabetes mellitus (T2DM) diagnosis with clarity, structure, and empathy. Thirty clinical-phase students completed two pre-test virtual encounters with an AI-simulated patient (ChatGPT, GPT-4o), scored by blinded raters using an eight-domain rubric. Participants then engaged in ten asynchronous GenAI scenarios with automated natural-language feedback. Seven days later, they completed two post-test consultations with human standardized patients, again evaluated with the same rubric. Mean total performance increased by 36.7 points (95% CI: 31.4–42.1; p < 0.001), and the proportion of high-performing students rose from 0% to 70%. Gains were significant across all domains, most notably in opening the encounter, closure, and diabetes specific explanation. Multiple regression showed that lower baseline empathy (β = −0.41, p = 0.005) and higher digital self-efficacy (β = 0.35, p = 0.016) independently predicted greater improvement; gender had only a marginal effect. Cluster analysis revealed three learner profiles, with the highest-gain group characterized by low empathy and high digital self-efficacy. Inter-rater reliability was excellent (ICC ≈ 0.90). These findings provide empirical evidence that GenAI-mediated training can meaningfully enhance diagnostic communication and may serve as a scalable, individualized adjunct to conventional medical education.

1. Introduction

DIALOGUE (DIagnostic AI Learning through Objective Guided User Experience) is an educational intervention designed to train medical students in diagnostic communication through scalable, AI-based simulations. As artificial intelligence (AI) rapidly transforms medical education, generative AI models such as ChatGPT have gained attention for their ability to simulate human-like dialogue and provide individualized feedback (Feng & Shen, 2023). Unlike traditional digital tools, large language models (LLMs) can dynamically engage students in realistic, interactive patient encounters, potentially accelerating learning across clinical domains (Chokkakula et al., 2025). A key advantage of these tools lies in their capacity to tailor educational experiences—analyzing individual learner profiles and adapting feedback to specific gaps and learning styles (Chokkakula et al., 2025). Beyond knowledge delivery, generative AI enables the simulation of nuanced diagnostic conversations, fostering students’ critical thinking and communication skills in a cost-effective, high-frequency format. In short, these technologies offer new possibilities for scalable, on-demand clinical training in medical education (Chokkakula et al., 2025). In short, generative AI has opened exciting possibilities for scalable, on-demand clinical training in medical education.
Communication skills, especially those related to diagnosis, are a core competency for physicians and a critical focus for medical training. Effective doctor–patient communication improves the quality of care, whereas poor communication can undermine history-taking, hinder accurate diagnosis, and negatively affect patient adherence to treatments (Moezzi et al., 2024). Importantly, deficiencies in communication—rather than medical knowledge—are often cited as a leading cause of patient dissatisfaction and complaints (Moezzi et al., 2024). Despite the well-recognized importance of communication, many trainees still feel underprepared in this domain. For example, a recent survey found that while over 70% of senior medical students had received some instruction in “breaking bad news,” only about 17% felt adequately prepared to actually deliver serious diagnoses to patients (Santos et al., 2025). Nearly all students in that study agreed on the necessity of being well prepared to communicate difficult news (Santos et al., 2025), highlighting a gap between current training and learner needs. Clearly, there is an educational imperative to strengthen medical students’ diagnostic communication skills—from effective history-taking and explanation of clinical reasoning to empathetic disclosure of diagnoses—through improved teaching methods and practice opportunities.
Simulation-based education has long been an integral strategy for teaching clinical communication and diagnostic reasoning skills in a safe, controlled environment (Elendu et al., 2024). Traditional simulations often involve standardized patients (actors trained to portray patients), which have been shown to significantly improve learners’ diagnostic accuracy, communication abilities, and overall clinical competence (Elendu et al., 2024). Simulated patient encounters have long been used to develop communication skills in medical training, particularly through objective structured clinical examinations (OSCEs). These face-to-face simulations are effective but resource-intensive, requiring extensive time, personnel, and coordination, which inherently limits their frequency and scalability (Skryd & Lawrence, 2024). Previous attempts to digitize these interactions—using scripted virtual patients or rule-based chatbots—have yielded mixed results, often hindered by limited interactivity, lack of realism, and poor accessibility (Sallam, 2023). The recent emergence of generative AI models capable of producing coherent, unscripted dialogue opens new possibilities for scalable, on-demand simulations (Kung et al., 2023). Unlike traditional systems, LLMs like GPT-4 can simulate dynamic patient behavior, adapt to diverse learner inputs, and provide real-time feedback, potentially transforming how students rehearse complex conversations such as diagnostic delivery (Weisman et al., 2025). Indeed, the use of virtual patient software has already shown positive educational effects; studies and systematic reviews report that computer-based simulations can enhance students’ clinical reasoning and skill acquisition and are generally well received by learners (Öncü et al., 2025). Now, state-of-the-art generative models like ChatGPT enable highly dynamic patient–doctor conversations that were not previously possible. Educators can leverage these AI-driven virtual patients to offer medical trainees essentially unlimited practice of clinical scenarios without the logistical constraints of scheduling actors or specialized labs (Öncü et al., 2025). Such AI-powered simulations can be tailored to specific learning objectives, provide immediate feedback, and ensure standardized yet responsive patient interactions for every student (Öncü et al., 2025). In sum, combining time-tested simulation pedagogy with modern AI technology presents a compelling approach to improve diagnostic communication training in a scalable way.
Early experiences with generative AI in medical training are emerging, and the results are promising. Other pilot studies have explored using ChatGPT or related LLMs as conversational partners for clinical simulations. For instance, Scherr et al. demonstrated that ChatGPT-3.5 could successfully generate interactive clinical case simulations, allowing students to practice forming diagnostic impressions and management plans over a full patient encounter (Scherr et al., 2023). Notably, the AI-driven cases were able to adapt to learners’ inputs and questions, more closely mimicking real-life dialogues than static text vignettes (Scherr et al., 2023). Other interventions have focused specifically on communication skills. One recent pilot study programmed ChatGPT as a virtual patient in a breaking-bad-news scenario (using the SPIKES protocol) and had second-year medical students conduct a difficult diagnosis discussion via text. The pre–post findings were encouraging: students’ self-confidence in communicating with patients—particularly in delivering bad news—increased significantly after the ChatGPT interaction, and their trust in AI as a useful training tool also improved (Chiu et al., 2025). Qualitative feedback from that study indicated that learners valued the structured practice and felt the AI exercise helped them understand the patient’s perspective (Chiu et al., 2025). Similarly, another experiment using an advanced voice-interactive ChatGPT-4 system as a virtual standardized patient found that medical trainees were satisfied with the realism of the encounter and were enthusiastic about integrating such AI-based practice into their learning (Öncü et al., 2025). Participants in that study appreciated the uniform and safe environment the AI patient provided, and many reported the experience as a valuable supplement to their clinical education (Öncü et al., 2025). At the same time, students acknowledged the current limitations of AI simulations—for example, a chatbot cannot yet fully convey human emotions or nuanced non-verbal cues. In the breaking-bad-news simulation, learners emphasized that the ChatGPT exercise was a helpful adjunct but not a replacement for real patient interaction, largely due to the lack of genuine emotional exchange and the predictable nature of AI responses (Chiu et al., 2025). These early trials underscore both the potential and the challenges of generative AI in communication training. Overall, the evidence to date—though still limited in scope—suggests that AI chatbots can provide meaningful practice that improves learners’ confidence and skills, especially when used as part of a blended educational approach (Chiu et al., 2025).
In educational research, pre–post intervention designs are commonly used to evaluate the impact of novel teaching methods on learners’ skills and attitudes. By measuring outcomes before and after an intervention in the same cohort, this design allows for an initial assessment of effectiveness without requiring a separate control group. Such an approach has been applied in communication skills training and has demonstrated clear gains. For example, one program that combined didactics and a simulated patient encounter to teach first-year students how to deliver bad news saw the proportion of students reporting confidence in this skill jump from 32% before training to 91% afterward (Poei et al., 2022). In a similar vein, the ChatGPT-based breaking-bad-news pilot described above used a pre–post survey to show significant improvement in students’ self-rated communication confidence following the AI-facilitated practice session (Chiu et al., 2025). Although pre–post studies have inherent limitations (such as lack of a parallel control group and potential response-shift bias), they are invaluable for pilot assessments of educational innovations, helping to determine whether an intervention shows promise and warrants further development or more rigorous testing (Chiu et al., 2025; Poei et al., 2022).
In this context, the present study aimed to evaluate the effectiveness of a generative AI-driven simulation tool designed to train diagnostic communication skills in medical students. Through a structured pre–post intervention, we assessed whether repeated simulated conversations with a responsive virtual patient could improve learners’ ability to deliver a diagnosis clearly, accurately, and empathetically. Beyond feasibility, we sought to generate robust evidence on learning outcomes and predictive factors associated with greater improvement. The simulated cases focused on T2DM and were constructed in accordance with the American Diabetes Association’s (ADA) 2025 Standards of Care and the European Association for the Study of Diabetes (EASD) 2023 Guidelines on Patient Communication, ensuring clinical and pedagogical relevance (American Diabetes Association Professional Practice Committee, 2025; Karagiannis et al., 2025). By combining performance-based assessment, cluster analysis, and learner feedback, this study contributes to the growing field of AI-enhanced medical education and supports the integration of generative AI as a pedagogical adjunct for clinical communication training (Shorey et al., 2024).

2. Materials and Methods

2.1. Study Design and Overview

The DIALOGUE (DIagnostic AI Learning through Objective Guided User Experience) study was a prospective, single-arm, pre–post intervention designed to evaluate the effectiveness of a GenAI-based training program on diagnostic communication performance in medical students. The study took place between May and June 2025 at the Facultad de Estudios Superiores Iztacala (FESI), Universidad Nacional Autónoma de México (UNAM). Ethics approval was obtained from the institutional committee (ID: CE/FESI/042025/1922), and all participants provided written informed consent. A target sample size of 30–32 students was selected based on feasibility for a pre–post design with blinded evaluation, consistent with prior AI-based educational studies in medical training (Chiu et al., 2025; Poei et al., 2022). This number was considered sufficient to detect large within-subject effect sizes in communication outcomes, assuming 80% statistical power and α = 0.05, as supported by previous pilot simulations.

2.2. Participants

Eligible participants were undergraduate medical students from the FESI-UNAM and enrolled in the clinical phase of their MD program (semesters 2 to 7). Convenience sampling was used to recruit students from May to June 2025. Inclusion criteria included the following: (a) enrollment in clinical semesters, (b) completion of at least one core clerkship, (c) no prior formal training in diagnostic communication, and (d) no participation in prior pilot studies involving AI tools. Baseline characteristics—including age, gender, academic semester, GPA, prior use of generative AI tools, digital self-efficacy, empathy (Jefferson Scale of Empathy–Student version), and previous clinical exposure to real patients—were collected via structured questionnaires. The MD program at FES-Iztacala spans six years, including preclinical and clinical phases. The clinical phase begins in year 3 and progresses through structured clerkships and simulation-based training modules.

2.3. Study Phases

The study was conducted in three sequential phases: (1) a pre-test diagnostic communication simulation, (2) an asynchronous remote AI-based training intervention, and (3) a post-test diagnostic consultation with a human SP.

2.3.1. Pre-Test Phase

In the pre-test phase, each participant completed two simulated diagnostic encounters, where they were required to communicate a diagnosis of T2DM to a virtual patient powered by a generative AI model (ChatGPT, GPT-4o-mini, release 15 May 2025; temperature = 0.7, max_tokens = 1000). Prior to each simulation, participants were provided with a five-minute written clinical case summary. Interactions were conducted individually, in real time, through audio-based dialogue and without scripted prompts. Each performance was independently evaluated by two blinded clinical raters using a validated 8-domain diagnostic communication rubric (see Section 2.4). All clinical scenarios were co-designed by the research team and faculty members with expertise in endocrinology and medical education. Each case involved a newly diagnosed adult with type 2 diabetes mellitus and incorporated psychosocial variables of moderate complexity, such as denial, family concern, or emotional distress. Both scenarios presented a diagnosis of T2DM but differed slightly in psychosocial framing—Scenario 1 involved a middle-aged patient with minimal emotional reaction, while Scenario 2 portrayed a younger adult with significant concern about long-term complications and lifestyle impact. All participants completed both scenarios in randomized order to mitigate sequencing bias and ensure balanced exposure. The diagnostic criteria, clinical framing, and communication strategies embedded in the scenarios were designed in accordance with international guidelines, including the ADA Standards of Care in Diabetes—2025 and the EASD 2023 Guidelines for Effective Patient Communication in Diabetes Care (American Diabetes Association Professional Practice Committee, 2025; Karagiannis et al., 2025).

2.3.2. Educational Intervention: AI-Based Training

Following the pre-test, participants completed an asynchronous two-part educational intervention delivered remotely:
Module 1: Prompt Engineering Workshop—A 20 min instructional session introducing the principles of effective prompt construction for clinical simulations using generative AI tools.
Module 2: Diagnostic Communication Workshop—A 40 min training video focused on core elements of patient-centered diagnostic delivery, including empathy, emotional regulation, clarity of explanation, and strategies for communicating a T2DM diagnosis.
After completing both modules, each student conducted ten independent simulated diagnostic conversations with ChatGPT acting as a virtual patient. Each scenario was designed with increasing psychosocial complexity and targeted specific communication competencies (e.g., adapting medical language to patient understanding, managing emotional responses, or addressing denial). Simulations were performed in separate conversation threads. At the conclusion of each interaction, participants entered the prompt “FEEDBACK” to trigger an automated reflection by the AI, providing personalized formative guidance in natural language. All simulation links were submitted to the study coordinator and archived for subsequent qualitative analysis.

2.3.3. Post-Test Phase

Seven days after completing the AI-based training, each participant engaged in two live diagnostic consultations with human standardized patients (SPs), each representing one of the two original clinical scenarios. Thus, every participant was delivered two diagnoses during the post-test phase. The encounters lasted approximately 5–7 min each and were conducted in person. Performances were assessed in real time by a third clinical evaluator, blinded to the participant’s prior exposure to the AI intervention. Scoring was based on the same 8-domain rubric used in the pre-test phase to ensure consistency across evaluations. Post-test evaluations were carried out by clinical raters who were licensed physicians with 5 to 15 years of clinical experience and formal teaching appointments in the MD program at FES-Iztacala. All raters underwent structured calibration using a standardized scoring manual and were blinded to participants’ pre-test scores and AI training exposure.

2.4. Evaluation Instruments

The primary outcome was the improvement in diagnostic communication competency, assessed using a modified version of the Kalamazoo Essential Elements Communication Checklist (Adapted), originally developed by the American Academy on Communication in Healthcare (Makoul, 2001a). The rubric was adapted for T2DM diagnostic disclosure and included eight domains: (1) opening and rapport, (2) patient-centered history, (3) empathic listening, (4) clarity of explanation, (5) emotional containment, (6) lay language use, (7) shared decision-making, and (8) professionalism. Each domain was scored on a 5-point Likert scale (1 = poor, 5 = excellent). Rubric descriptors were refined through pilot testing with five students not included in the main analysis. Clinical evaluators received structured calibration training using a standardized scoring manual. Inter-rater discrepancies ≥2 points were resolved through consensus discussion or adjudication by a third blinded reviewer.

2.5. Data Collection and Analysis

Rubric scores and self-reported measures (confidence and empathy) were collected manually using standardized paper-based forms and subsequently transcribed into a digital spreadsheet. Pre- and post-test scores were paired for each participant using anonymized identifiers. All statistical analyses were performed using R version 4.5.1 (R Foundation for Statistical Computing, Vienna, Austria). The Shapiro–Wilk test was employed to assess the normality of score distributions. Depending on the results, either paired t-tests or Wilcoxon signed-rank tests were applied to compare pre- and post-intervention scores at both the total and domain-specific levels. Effect sizes were calculated using Cohen’s d for parametric tests or r for non-parametric comparisons. Inter-rater reliability was evaluated using intraclass correlation coefficients (ICCs; two-way random effects model, absolute agreement).

2.6. Qualitative Analysis of AI Feedback

AI-generated feedback from the training phase was analyzed through inductive thematic analysis. Two independent researchers manually coded the feedback using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA), following an iterative, line-by-line approach. Codes were then grouped into themes through constant comparison. Discrepancies in coding were resolved through discussion until consensus was reached. The process aimed to identify core patterns in learners’ reflections and AI-generated suggestions.

2.7. Use of Generative AI in the Study

Generative AI was utilized in two distinct capacities within the study: (1) ChatGPT (GPT-4o-mini; OpenAI, San Francisco, CA, USA) served as a virtual simulated patient during the training phase; and (2) the same model was prompted to generate natural language feedback following each simulated consultation. No model fine-tuning or external training was conducted. Safety settings, token limits, and API parameters were configured in accordance with OpenAI’s safety guidelines (version 2.4). Generative AI was not used in the writing, editing, or translation of the manuscript text.

2.8. Statistical Analysis

All statistical analyses were conducted using R version 4.5.1 (R Foundation for Statistical Computing, Vienna, Austria). A two-tailed p-value < 0.05 was considered statistically significant. Baseline characteristics were summarized using descriptive statistics. Normality of continuous variables was assessed via the Shapiro–Wilk test. Paired comparisons between pre- and post-intervention scores (total and domain-specific) were performed using paired t-tests or Wilcoxon signed-rank tests, with effect sizes reported as Cohen’s d or rank-based r, respectively. Improvement scores (Δ-scores) were calculated as the difference between post-test and pre-test rubric totals. To identify predictors of improvement, we fitted multiple linear regression models, including baseline empathy, self-efficacy, GPA, gender, prior AI use, and other relevant covariates. Model selection followed stepwise procedures using the Akaike Information Criterion (AIC), and multicollinearity was assessed using variance inflation factors (VIFs). Model assumptions were verified via residual plots. Cluster analysis (k-means, k = 3) was performed on standardized baseline variables to identify latent learner profiles. Clusters were compared on Δ-scores using ANOVA or Kruskal–Wallis tests with Tukey or Dunn’s post hoc corrections. Correlation analyses (Pearson or Spearman) were used to explore associations between baseline traits (e.g., empathy, motivation) and performance outcomes. Rubric reliability was assessed through Cronbach’s alpha (internal consistency) and intraclass correlation coefficients (ICC, two-way random effects, absolute agreement) for inter-rater agreement. Sensitivity analyses excluded participants with high baseline scores or incomplete training engagement. No data imputation was performed.

3. Results

3.1. Participant Flow and Baseline Characteristics

3.1.1. Participant Flow

Of the 41 medical students assessed for eligibility, 32 met the inclusion criteria and agreed to participate (Figure 1). All 32 students completed the baseline questionnaire and both pre-test diagnostic consultations with an AI-based virtual patient. Subsequently, all enrolled participants engaged in and completed the asynchronous AI remote training module. However, two students (6.3%) were lost to follow-up and did not attend the scheduled human standardized patient (SP) post-test. Therefore, 30 participants (93.8%) completed the full study protocol and were included in both the intention-to-treat (ITT) and per-protocol analyses.

3.1.2. Baseline Characteristics

Baseline characteristics for the enrolled cohort (n = 32) are summarized in Table 1. The mean age was 21.1 ± 4.1 years (range: 19–43), with 22 participants identifying as female (73.3%) and 8 as male (26.7%). Most students were in the fourth clinical semester (63.3%), with smaller proportions in semester 2 (13.3%), semester 6 (6.7%), and semester 7 (16.7%). No participant had previously received formal instruction in diagnostic communication, and none had completed an Objective Structured Clinical Examination (OSCE). Over two-thirds of students (70%) reported having ≥10 prior interactions with ChatGPT or similar large language models (LLMs), and 40% had previous experience communicating clinical findings to real patients. Digital self-efficacy was rated as high or very high by 43.3%, medium by 53.3%, and low by 3.3% of students. Laptops or desktop computers were the most frequently used devices for simulation (73.3%), followed by tablets and smartphones. Internet quality was self-rated as “good” (56.7%), “medium” (33.3%), or “poor” (10%). The median self-reported motivation to participate was 9 on a 10-point scale. Confidence in core communication skills varied: it was highest for explaining laboratory results (mean = 3.3 ± 0.7) and lowest for performing teach-back (mean = 2.3 ± 0.8). The mean total score on the Jefferson Scale of Empathy was 116 ± 15.

3.2. Baseline Diagnostic-Communication Performance

At baseline, the mean total rubric score was 48.83 ± 9.65 for Scenario 1 and 51.10 ± 9.82 for Scenario 2, showing a modest but statistically significant difference (p = 0.05). This suggests that some participants may have experienced procedural gains or growing familiarity with the simulation format between the two baseline encounters. Despite this overall difference, no significant variation was observed at the domain level (all p > 0.05), indicating general consistency across the eight communication domains. However, item-level comparisons revealed significant differences in a few key subskills. Participants scored higher in Scenario 2 when assessed on their ability to assess the daily-life impact of diabetes (Item 4.3, p = 0.02), to check patient understanding through teach-back (Item 5.3, p = 0.01), and to close the consultation with empathy and follow-up planning (Item 7.3, p = 0.02). Additionally, the ability to provide a final summary (Item 7.1) approached significance (p = 0.05), suggesting a mild improvement in closure behavior during the second encounter. These differences likely reflect initial learning effects or adjustment to the AI-simulated environment rather than true intervention effects, given that no formal training had yet occurred. Across both scenarios, most participants demonstrated limited baseline communication proficiency. Based on total scores, 63% (19/30) of students were classified as low performers, 33% (10/30) as intermediate, and none as high performers. The lowest baseline scores were consistently observed in the subdomains of teach-back and goal negotiation. In contrast, active listening (Item 3.2) received the highest mean score (4.00 ± 0.18 in Scenario 1). However, this rating should be interpreted with caution, as it may reflect the structural nature of AI-mediated conversations. Unlike human interactions—where overlaps and interruptions are common—dialogues with ChatGPT occur in a turn-based format, which may have inadvertently facilitated uninterrupted responses and inflated perceived listening quality. A radar plot comparing mean domain scores between the two pre-test scenarios (Figure 2) visually confirms the similarity in participants’ communication profiles, with only minor fluctuations between domains. A detailed breakdown of scores by item and scenario is provided in Table 2.

3.3. Post-Intervention Diagnostic Performance and Pre–Post Comparison

3.3.1. Post-Test Performance Across Scenarios

After completing the AI-based intervention, participants underwent two live diagnostic consultations with standardized patients. The mean total rubric scores were 84.5 ± 17.8 for Scenario 1 and 88.9 ± 17.4 for Scenario 2 (p = 0.08), with no statistically significant difference between scenarios. As shown in Table 3 and Figure 3, domain-level analyses revealed consistent performance across both encounters. Slight variations emerged in patient greeting (D2.1, p = 0.01), active listening (D3.2, p = 0.01), and diabetes specific plan explanation (D8.2, p = 0.02), but overall post-test profiles were comparable. The spider plot in Figure 4 illustrates the near-overlapping domain scores between scenarios, suggesting internal consistency and performance stability across post-test cases.

3.3.2. Pre–Post Intervention Gains

When comparing pre- and post-intervention results, participants demonstrated substantial and statistically significant improvements across all rubric domains. The total score increased from 49.96 ± 9.72 to 86.70 ± 17.56 (Δ = 36.74, 95% CI: 31.39 to 42.09, p < 0.001; Cohen’s d = 2.58), representing a very large effect size. As shown in Table 4, every domain showed meaningful gains, with the largest improvements observed in “Opening” (Δ = 1.92, d = 2.68), “Closure” (Δ = 1.79, d = 2.07), and “Diabetes specific explanation” (Δ = 2.06, d = 2.95). The most improved subitems included teach-back (Item 5.3, Δ = 2.12), goal negotiation (6.3, Δ = 1.58), and empathetic closure (7.3, Δ = 1.77), all with large effect sizes. Figure 4 presents a radar plot overlaying pre- and post-intervention domain means, visually confirming significant gains across all communication areas. In parallel, Figure 5 displays a violin box plot contrasting the total rubric score distribution before and after training, highlighting both the upward shift in central tendency and reduced score dispersion after the intervention. Together, these findings indicate that the generative AI-based training module produced broad and meaningful enhancements in students’ diagnostic communication competencies, spanning both structural and affective elements of clinical dialogue.

3.4. Predictors of Improvement

To identify baseline factors associated with the magnitude of improvement in diagnostic-communication scores, we fitted a multiple linear regression model with the Δ-score (post–pre total rubric score) as the dependent variable. Candidate predictors were age, gender, school term, GPA, previous interaction with real patients (yes/no), baseline empathy (Jefferson Scale total score), digital self-efficacy (1–5 scale), self-reported motivation (1–10 scale), and prior use of ChatGPT or other LLMs (≥10 interactions vs. <10). The overall model was statistically significant (R2 = 0.43, adjusted R2 = 0.37; F(6, 23) = 4.13, p = 0.004), explaining approximately 37% of the variance in Δ-scores. Two variables emerged as independent predictors: baseline empathy was negatively associated with improvement (β = −0.41, SE = 0.13; 95% CI: −0.68 to −0.14; p = 0.005; Figure 6E), indicating that students with lower initial empathy scores tended to achieve larger gains, likely due to a ceiling effect among highly empathic learners. Conversely, digital self-efficacy was positively associated with improvement (β = 0.35, SE = 0.14; 95% CI: 0.07–0.63; p = 0.016; Figure 6D), suggesting that greater confidence with digital tools enhanced the effectiveness of GenAI training. Gender showed a marginal effect (p = 0.044; Figure 6G) in the bivariate analysis, with female students demonstrating higher median Δ-scores; however, this difference was no longer significant after adjusting for other covariates. No significant associations were observed for age (Figure 6A), school term (Figure 6B), GPA (Figure 6C), motivation (Figure 6F), prior patient contact (Figure 6H), or previous LLM use. Overall, these results indicate that learners entering the training with lower empathy yet stronger digital self-efficacy derived the greatest incremental benefit from GenAI-mediated communication practice, highlighting the combined influence of emotional baseline and technological readiness on learning outcomes. Figure 6 visualizes these relationships through scatterplots (panels A–F) and boxplots (panels G–H), underscoring the heterogeneity of individual learning trajectories.

3.5. Learner Profiles and Cluster Patterns

To examine differential response patterns, we performed unsupervised k-means clustering (k = 3) based on participants’ domain-level rubric scores before and after the intervention. The optimal k was confirmed with the elbow method and inspection of within-cluster sum-of-squares. Three discrete learner profiles emerged. Cluster A (n = 10) achieved the greatest mean improvement (Δ = 58.7 ± 10.2), followed by Cluster B (n = 12; Δ = 33.4 ± 11.5) and Cluster C (n = 8; Δ = 16.2 ± 9.3). Between-cluster differences in total Δ-score and every domain-specific Δ-score were significant (p < 0.001; Table 5), indicating meaningful heterogeneity in learning gains. Figure 7 shows domain-specific Cohen’s d values, all in the moderate-to-large range. The largest effects occurred in Domain 8 (diabetes specific explanation, d = 2.95), Domain 2 (opening the encounter, d = 2.68), and Domain 5 (information sharing, d = 2.17). The heat map and dendrogram in Figure 8 depict these patterns: Cluster A displays uniformly high gains across all eight domains, whereas Cluster C shows marginal change, especially in “patient perspective” and “shared decision-making”. Baseline trait inspection revealed that Cluster A learners started with lower empathy but higher digital self-efficacy, consistent with predictors identified in Section 3.4; Cluster C showed the opposite pattern (higher empathy, lower self-efficacy). A scatterplot of baseline total rubric score versus Δ-score (Figure 9) demonstrated a weak inverse relationship (r = −0.22, p = 0.24), supporting a compensatory effect in which students with lower starting performance realised larger gains. Collectively, these findings suggest that GenAI-mediated practice may help narrow performance gaps by especially benefiting learners who begin with weaker communication skills yet possess the technological confidence to exploit AI feedback.

3.6. Post-Test Surveys

Upon completion of the post-test phase, all 30 students responded to a structured post-intervention questionnaire assessing perceived knowledge acquisition, self-efficacy in diagnostic communication, and the perceived usefulness of the GenAI training. The overall self-reported knowledge score averaged 4.2 ± 0.6 on a five-point Likert scale, while self-efficacy in delivering a diagnosis improved from a baseline mean of 2.9 ± 0.8 to 4.3 ± 0.7. Notably, 93.3% (28/30) of participants rated the GenAI practice sessions as “very useful” or “extremely useful” for reinforcing communication skills, and 86.7% (26/30) expressed willingness to integrate AI-based simulation into future clinical training. Open-ended responses highlighted three recurring themes: (1) increased comfort navigating emotionally charged conversations, (2) appreciation for structured feedback in natural language, and (3) a desire for broader case variety and real-time faculty debriefing in future iterations. Additionally, both evaluator and SP post-revelation surveys (administered after disclosure of AI use in training) revealed no detection bias; only one out of six evaluators and eight out of 24 SPs suspected that students had undergone prior AI-based preparation. Following disclosure, both groups acknowledged that the students’ communication had exceeded their expectations.

3.7. Adverse Events and Missing Data

No adverse events or psychological distress were reported by participants during or after the intervention. Two students who completed the full AI-based remote training withdrew before the post-test simulation due to scheduling conflicts and were excluded from final analyses. No technical failures occurred during pre- or post-test evaluations. Two minor discrepancies in rubric scoring (≥2-point difference between evaluators) were resolved through consensus with a third blinded reviewer. All data points from the 30 remaining participants were complete and included in the final analyses.

4. Discussion

This study provides robust evidence that a deliberately scaffolded educational intervention powered by generative artificial intelligence (GenAI) can significantly improve diagnostic communication skills among undergraduate medical students. The intervention yielded a substantial within-subject gain (Δ = +36.7 points on a 115-point rubric), with large effect sizes across all eight assessed domains. These results reflect a broad-based enhancement of both cognitive-structural and emotional-affective communication competencies rather than isolated improvements in discrete tasks. T2DM was intentionally selected as the diagnostic focus of all scenarios due to its high clinical prevalence, familiarity among medical students, and capacity to elicit both cognitive processing and empathic engagement. This choice likely facilitated student immersion and increased the ecological validity of the simulations, allowing participants to focus on communication delivery rather than clinical unfamiliarity. By anchoring the intervention in a universally relevant and emotionally resonant condition, the study maximized its potential to reveal true gains in communicative competence.
Interestingly, 73% of the participants were female, which closely reflects the actual gender distribution in the MD program at FES-Iztacala, where approximately 70% of enrolled students are women. Thus, the gender composition of the sample aligns with institutional demographics rather than selection bias. While female students are often reported to exhibit higher empathy scores and greater receptiveness to communication training, gender showed only a marginal effect on improvement in our multivariate model. Nonetheless, future studies may explore whether gender-related attitudes toward AI tools or communication style mediate training responsiveness. Importantly, our findings advance the current literature by moving beyond student self-report measures toward objective, evaluator-blinded behavioral assessments. Prior studies exploring LLM applications in medical education—such as Webb (2023) and Chiu et al. (2025)—have primarily reported increased learner confidence following GenAI-assisted simulations but lacked external validation through SP encounters or blinded human scoring (Chiu et al., 2025; Webb, 2023). In contrast, our study combines blinded, domain-based performance assessments with Spanish-language AI-patient dialogues, offering a culturally contextualized and psychometrically rigorous evaluation framework. Furthermore, while Webb et al. employed automated scoring metrics and Chiu et al. focused on English-language communication exercises, our design uniquely integrates live standardized patients, a validated rubric, and a focus on real-time diagnostic disclosure in a Latin American medical context (Chiu et al., 2025; Webb, 2023). These methodological distinctions underscore the added value of our findings in terms of generalizability and educational depth. To our knowledge, this is the first study to generate such triangulated, rubric-based evidence in Spanish-speaking medical students, within a Latin American setting (Makoul, 2001a, 2001b).
Several pedagogical mechanisms may underlie the observed gains. First, the GenAI platform provided asynchronous, low-stakes opportunities for deliberate practice—a well-established driver of communication skill acquisition in simulation-based education. Second, immediate feedback in natural language allowed for dynamic self-correction without requiring constant faculty oversight.
Thematic analysis of over 300 AI-generated responses revealed alignment with communication best practices (e.g., “organize your explanation,” “validate patient concerns,” “check for understanding”), echoing established heuristics described by (Nestel & Tierney, 2007). These effects mirror findings from large-scale meta-analyses on simulation-based education in healthcare, such as that carried out by Cook et al. (2011), which underscore the importance of repeated practice and structured feedback in skill acquisition (Cook et al., 2011). Third, the virtual patient’s emotionally neutral tone likely created a psychologically safe space, encouraging students to experiment with empathetic strategies without fear of judgment—an effect that mirrors findings from early simulated patient training (Kalet et al., 2004).
Regression modeling confirmed an inverse relationship between baseline empathy (measured via the Jefferson Scale) and gains—students with lower initial empathy achieved larger improvements. Simultaneously, students with higher digital self-efficacy benefited more from the GenAI training. These predictors jointly explained over one-third of the variance in learning gains and resonate with constructs from the technology acceptance model (Davis, 1989), where perceived usability and personal relevance enhance engagement with digital tools. Interestingly, prior ChatGPT use, motivation level, and GPA were not significant predictors, suggesting that emotional and technological readiness may outweigh cognitive or experiential variables when learning with AI. These results carry practical implications: baseline learner profiles may serve as a foundation for tailoring feedback scaffolding, simulation complexity, or engagement thresholds in future GenAI deployments.
Our unsupervised clustering further highlighted the heterogeneity of learning trajectories. Three learner profiles were identified, with “Cluster A” students—characterized by high empathy and self-efficacy—demonstrating nearly fourfold higher gains compared to their peers. These findings suggest that GenAI is not universally transformative but particularly beneficial for specific learner subtypes. Future adaptive systems could dynamically adjust training sequences based on real-time profiling, maximizing pedagogical efficiency while personalizing learning trajectories.
From a broader curricular standpoint, our findings suggest that GenAI tools—if properly contextualized—can be feasibly integrated into clinical communication training in Mexican medical schools. Faculty shortages, limited access to standardized patients, and rigid curricula pose barriers to consistent development of communication skills. GenAI platforms, especially those adapted to local languages and clinical realities, offer a scalable, low-cost complement. However, successful adoption will require both technical infrastructure and institutional culture change. Some educators may view AI as a threat to pedagogical authority, while others may lack the AI literacy needed to evaluate feedback quality. Addressing this will require deliberate faculty training, transparency in AI decision logic, and curricular frameworks that combine human and machine feedback in meaningful ways.
Looking forward, the evolution of LLMs—such as the expected GPT-4.5 or GPT-5—may radically expand the capabilities of virtual patients. Upcoming models may integrate real-time voice, emotion recognition, and adaptive personas, allowing for multimodal simulation of complex encounters. These advances may offer unprecedented fidelity in diagnostic communication training. However, ethical challenges remain. Overreliance on AI feedback may desensitize learners or reinforce mechanical communication patterns. There is also a risk of depersonalization if students generalize from algorithmic interactions to real patient care. To mitigate these risks, GenAI tools must be embedded within reflective, supervised curricula that cultivate critical thinking, emotional sensitivity, and context-aware communication.
Despite its strengths, this study has several limitations. First, the absence of a control group limits causal inference. However, this was an intentional design choice for this first-phase feasibility and efficacy assessment of the DIALOGUE program. The magnitude and distribution of pre–post gains, combined with evaluator blinding and objective performance metrics, help mitigate—but do not eliminate—concerns about internal validity. Future iterations of the program will include a randomized controlled trial to more rigorously compare GenAI-based training against traditional simulation and didactic instruction. Subsequent phases of the DIALOGUE program will incorporate larger, multicenter samples to enhance statistical power and generalizability. Although evaluator blinding was implemented, participants were aware of the study phase, which may have introduced performance expectancy bias. Second, the short duration precludes conclusions about long-term retention or clinical transfer. Longitudinal follow-up will be incorporated in future studies to assess the persistence of skill gains over time. Future studies may also consider integrating standardized instruments such as the SEGUE framework to enhance transparency and facilitate comparison across educational settings.
Our intervention was deployed within a single institution, potentially limiting generalizability. Although the GenAI platform functioned reliably in Spanish, subtle limitations in semantic nuance or cultural appropriateness may have affected feedback quality. The natural language feedback from the generative AI was not independently validated by clinical educators, which limits conclusions about its educational adequacy. Additionally, while we standardized system prompts and temperature settings, the inherent stochasticity of LLMs introduces minor variations that may affect reproducibility. Future research should explore the impact of AI randomness, develop seed-logging protocols, and benchmark AI feedback against expert commentary.
Finally, while this study provides promising initial evidence of effectiveness, it does not compare GenAI-based feedback to traditional faculty-led instruction nor does it evaluate downstream clinical outcomes such as patient satisfaction or diagnostic accuracy. These remain important avenues for future research. Thus, caution is warranted when interpreting these findings as definitive. Nevertheless, our findings support the role of GenAI as a potentially scalable and psychometrically sound adjunct—not a replacement—for clinical communication training. Although the present intervention centered on T2DM, the framework may be adaptable to other specialties such as pediatrics, psychiatry, or gynecology. Future iterations should investigate domain-specific adaptations and their differential impact on communication skill acquisition.

5. Conclusions

This study demonstrates that generative AI can serve as a reliable, scalable tool to enhance diagnostic communication skills in medical students. A short, asynchronous GenAI-based intervention led to significant, domain-wide improvements, particularly among students with higher baseline empathy and digital self-efficacy. While not a replacement for human teaching, GenAI offers a promising pedagogical adjunct, especially in resource-limited settings. Future work should focus on long-term outcomes, integration into curricula, and alignment with ethical, cultural, and clinical standards.

Author Contributions

Conceptualization, R.X.S.-G. and H.I.S.-C.; data curation, J.C.-R. and H.I.S.-C.; formal analysis, J.C.-R. and H.I.S.-C.; investigation, R.X.S.-G., Q.C.-C., R.O.-P., S.V.-M., A.E.C.-R., E.Q.-L., C.A.R.-C., A.M.G.-G., J.C.-R., J.J.-R., N.G.A.-M., G.V.-H., T.E.C.-M., A.V.-F., S.L.-C., A.R.M.-C., B.O.J.-J. and H.I.S.-C.; methodology, H.I.S.-C.; project administration, B.O.J.-J.; resources, H.I.S.-C.; software, J.C.-R. and H.I.S.-C.; supervision, G.V.-H. and T.E.C.-M.; validation, A.R.M.-C. and H.I.S.-C.; visualization, H.I.S.-C.; writing—original draft, H.I.S.-C.; writing—review and editing, G.V.-H., T.E.C.-M., A.V.-F., S.L.-C., B.O.J.-J. and H.I.S.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Consejo Mexiquense de Ciencia y Tecnología (COMECYT), under grant FICDTEM-2023-131, and by the Universidad Nacional Autónoma de México (UNAM), through PAPIIT IA201725 and PAPIME 203825, awarded to H.I.S.-C., as well as PAPIIT IN201724, awarded to S.L.-C.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Ethics Committee of the Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México (protocol code CE/FESI/052025/1922, approval date: 9 May 2025). The ethical approval was granted for the full duration of the study (valid from 7 May 2025 to 7 May 2026).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study. Written informed consent included permission to analyze anonymized performance data and survey responses, as well as to publish findings in academic journals. No identifiable personal information is presented in this publication.

Data Availability Statement

The datasets generated and analyzed during this study—including anonymized rubric scores, survey instruments, AI prompt templates, simulation scripts, and training materials—are publicly available in the Open Science Framework (OSF) repository at https://doi.org/10.17605/OSF.IO/YJF39. Audio recordings and voice files are not publicly accessible due to privacy and ethical restrictions but may be reviewed upon reasonable request and approval by the institutional ethics committee.

Acknowledgments

The authors gratefully acknowledge the standardized-patient team from the Centro Internacional de Simulación y Entrenamiento en Soporte Vital Iztacala (CISESVI), Facultad de Estudios Superiores Iztacala, Universidad Nacional Autónoma de México, whose professionalism and consistency were essential to the integrity of the post-test assessments: Rebeca Aguilar-Sánchez, Itzel Antonino-Cruz, Lizbeth Nohemí Arrazola-Live, Jorge Omar Barragán-Hermosillo, Diego Alan Calderón-Trejo, Luis Fernando Cárdenas-Gasca, Rafael Alberto Chinchilla-Herrera, María Fernanda Cruz-Carrillo, Kevin Cuandón-Quintero, Diego Díaz-Hill, Guadalupe Itzel Díaz-Roa, Jennifer Denisse Flores-Islas, Andrea Abigail Galván-Becerra, Mariana García-Morales, Samantha Karina Gómez-Islas, Yanireth Israde-Osnaya, Aida Esmeralda Juárez-Cabello, Luis Alberto Juárez-Fernández, María Elena Lara-Rodríguez, Juan Jesús López-Jiménez, Emiliano Federico Martínez-Franco, Roberto Carlos Meza-Hernández, Andrea Alejandra Murillo-Maya, Abigail Rosales-Pimentel, Luis Fernando Serrano-Trejo, Zaira Ixchel Silva-López, Evans Antonio Tierra-Lazcano, Karla Jacqueline Velázquez-Reyes, and Lourdes Aylin Villeda-Villa. We also thank the administrative staff of the Facultad de Estudios Superiores Iztacala for logistical support and the CISESVI technical team for providing audiovisual equipment and secure data storage. During the preparation of this study, the authors used ChatGPT (GPT-4o, OpenAI, May 2025 release) to generate virtual-patient dialogue and produce automated formative feedback. The authors have critically reviewed and edited all AI-generated output and take full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial intelligence
AICAkaike information criterion
APIApplication programming interface
GENAIGenerative artificial intelligence
GPAGrade point average
GPT-4Generative Pretrained Transformer 4
ICCIntraclass correlation coefficient
ITTIntention-to-treat
JSEJefferson Scale of Empathy
LLMLarge language model
OSCEObjective structured clinical examination
SEStandard error
SPStandardized patient
SPIKESSetting, perception, invitation, knowledge, emotions, strategy (protocol for breaking bad news)
T2DMType 2 diabetes mellitus
VIFVariance inflation factor
WAWeighted average

References

  1. American Diabetes Association Professional Practice Committee. (2025). 4. Comprehensive medical evaluation and assessment of comorbidities: Standards of care in diabetes-2025. Diabetes Care, 48(Suppl. S1), S59–S85. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  2. Chiu, J., Castro, B., Ballard, I., Nelson, K., Zarutskie, P., Olaiya, O. K., Song, D., & Zhao, Y. (2025). Exploration of the role of ChatGPT in teaching communication skills for medical students: A pilot study. Medical Science Educator. [Google Scholar] [CrossRef]
  3. Chokkakula, S., Chong, S., Yang, B., Jiang, H., Yu, J., Han, R., Attitalla, I. H., Yin, C., & Zhang, S. (2025). Quantum leap in medical mentorship: Exploring ChatGPT’s transition from textbooks to terabytes. Frontiers in Medicine, 12, 1517981. [Google Scholar] [CrossRef] [PubMed Central]
  4. Cook, D. A., Hatala, R., Brydges, R., Zendejas, B., Szostek, J. H., Wang, A. T., Erwin, P. J., & Hamstra, S. J. (2011). Technology-enhanced simulation for health professions education: A systematic review and meta-analysis. JAMA, 306(9), 978–988. [Google Scholar] [CrossRef]
  5. Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319–340. [Google Scholar] [CrossRef]
  6. Elendu, C., Amaechi, D. C., Okatta, A. U., Amaechi, E. C., Elendu, T. C., Ezeh, C. P., & Elendu, I. D. (2024). The impact of simulation-based training in medical education: A review. Medicine, 103(27), e38813. [Google Scholar] [CrossRef]
  7. Feng, S., & Shen, Y. (2023). ChatGPT and the future of medical education. Academic Medicine, 98(8), 867–868. [Google Scholar] [CrossRef]
  8. Hopewell, S., Chan, A.-W., Collins, G. S., Hróbjartsson, A., Moher, D., Schulz, K. F., Tunn, R., Aggarwal, R., Berkwits, M., Berlin, J. A., Bhandari, N., Butcher, N. J., Campbell, M. K., Chidebe, R. C. W., Elbourne, D., Farmer, A., Fergusson, D. A., Golub, R. M., Goodman, S. N., … Boutron, I. (2025). CONSORT 2025 statement: Updated guideline for reporting randomised trials. BMJ, 389, e081123. [Google Scholar] [CrossRef] [PubMed]
  9. Kalet, A., Pugnaire, M. P., Cole-Kelly, K., Janicik, R., Ferrara, E., Schwartz, M. D., Lipkin, M., Jr., & Lazare, A. (2004). Teaching communication in clinical clerkships: Models from the macy initiative in health communications. Academic Medicine, 79(6), 511–520. [Google Scholar] [CrossRef]
  10. Karagiannis, T., Advani, A., Davies, M. J., Del Prato, S., Dinneen, S. F., Kuss, O., Lorenzoni, V., Mathieu, C., Mauricio, D., Tankova, T., Zaccardi, F., Holt, R. I. G., & Tsapas, A. (2025). European association for the study of diabetes (EASD) Standard operating procedure for the development of guidelines. Diabetologia, 68(8), 1600–1615. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  11. Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digital Health, 2(2), e0000198. [Google Scholar] [CrossRef] [PubMed Central]
  12. Makoul, G. (2001a). Essential elements of communication in medical encounters: The kalamazoo consensus statement. Academic Medicine, 76(4), 390–393. [Google Scholar] [CrossRef] [PubMed]
  13. Makoul, G. (2001b). The SEGUE Framework for teaching and assessing communication skills. Patient Education and Counseling, 45(1), 23–34. [Google Scholar] [CrossRef] [PubMed]
  14. Moezzi, M., Rasekh, S., Zare, E., & Karimi, M. (2024). Evaluating clinical communication skills of medical students, assistants, and professors. BMC Medical Education, 24(1), 19. [Google Scholar] [CrossRef]
  15. Nestel, D., & Tierney, T. (2007). Role-play for medical students learning about communication: Guidelines for maximising benefits. BMC Medical Education, 7(1), 3. [Google Scholar] [CrossRef]
  16. Öncü, S., Torun, F., & Ülkü, H. H. (2025). AI-powered standardised patients: Evaluating ChatGPT-4o’s impact on clinical case management in intern physicians. BMC Medical Education, 25(1), 278. [Google Scholar] [CrossRef]
  17. Poei, D. M., Tang, M. N., Kwong, K. M., Sakai, D. H. T., Choi, S. Y. T., & Chen, J. J. (2022). Increasing medical students’ confidence in delivering bad news using different teaching modalities. Hawaii Journal of Health and Social Welfare, 81(11), 302–308. [Google Scholar] [PubMed Central]
  18. Sallam, M. (2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887. [Google Scholar] [CrossRef] [PubMed Central]
  19. Santos, M. S., Cunha, L. M., Ferreira, A. J., & Drummond-Lage, A. P. (2025). From classroom to clinic: Addressing gaps in teaching and perceived preparedness for breaking bad news in medical education. BMC Medical Education, 25(1), 449. [Google Scholar] [CrossRef] [PubMed]
  20. Scherr, R., Halaseh, F. F., Spina, A., Andalib, S., & Rivera, R. (2023). ChatGPT interactive medical simulations for early clinical education: Case study. JMIR Medical Education, 9, e49877. [Google Scholar] [CrossRef]
  21. Shorey, S., Mattar, C., Pereira, T. L., & Choolani, M. (2024). A scoping review of ChatGPT’s role in healthcare education and research. Nurse Education Today, 135, 106121. [Google Scholar] [CrossRef]
  22. Skryd, A., & Lawrence, K. (2024). ChatGPT as a tool for medical education and clinical decision-making on the wards: Case study. JMIR Formative Research, 8, e51346. [Google Scholar] [CrossRef] [PubMed]
  23. Webb, J. J. (2023). Proof of concept: Using ChatGPT to teach emergency physicians how to break bad news. Cureus, 15(5), e38755. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  24. Weisman, D., Sugarman, A., Huang, Y. M., Gelberg, L., Ganz, P. A., & Comulada, W. S. (2025). Development of a GPT-4-powered virtual simulated patient and communication training platform for medical students to practice discussing abnormal mammogram results with patients: Multiphase study. JMIR Formative Research, 9, e65670. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Figure 1. Participant flow through the DIALOGUE pre–post study, based on the 2025 CONSORT diagram. The figure outlines the number of students assessed for eligibility, enrolled, exposed to the AI-based remote training intervention, and completing the SP post-test. Final analyses were conducted on all participants who completed the protocol (Hopewell et al., 2025).
Figure 1. Participant flow through the DIALOGUE pre–post study, based on the 2025 CONSORT diagram. The figure outlines the number of students assessed for eligibility, enrolled, exposed to the AI-based remote training intervention, and completing the SP post-test. Final analyses were conducted on all participants who completed the protocol (Hopewell et al., 2025).
Ejihpe 15 00152 g001
Figure 2. Radar plot comparing mean baseline diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): green for Scenario 1 and purple for Scenario 2. Pre-intervention profiles were broadly similar across domains.
Figure 2. Radar plot comparing mean baseline diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): green for Scenario 1 and purple for Scenario 2. Pre-intervention profiles were broadly similar across domains.
Ejihpe 15 00152 g002
Figure 3. Radar plot comparing mean post-intervention diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): green for Scenario 1 and purple for Scenario 2. Post-intervention profiles show consistent improvement across domains.
Figure 3. Radar plot comparing mean post-intervention diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): green for Scenario 1 and purple for Scenario 2. Post-intervention profiles show consistent improvement across domains.
Ejihpe 15 00152 g003
Figure 4. Radar plot comparing mean post-intervention diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): orange for Scenario 1 and green for Scenario 2. Post-intervention profiles were broadly consistent across scenarios.
Figure 4. Radar plot comparing mean post-intervention diagnostic communication scores across eight rubric domains for Scenario 1 and Scenario 2. Each axis represents one domain from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) exploring patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Colored lines indicate mean Likert scale scores (range 1–5): orange for Scenario 1 and green for Scenario 2. Post-intervention profiles were broadly consistent across scenarios.
Ejihpe 15 00152 g004
Figure 5. Violin plot comparing pre- and post-intervention diagnostic communication scores. Each dot represents an individual participant’s total score (range: 23–115), derived from the adapted Kalamazoo rubric. Violin plots illustrate the distribution and density of scores; overlaid boxplots show the median (horizontal bar) and interquartile range (box). Vertical lines indicate score range (minimum–maximum). Orange denotes pre-intervention scores; green denotes post-intervention scores. Asterisks (***) indicate statistically significant improvement after the intervention (paired-samples t-test, p < 0.001).
Figure 5. Violin plot comparing pre- and post-intervention diagnostic communication scores. Each dot represents an individual participant’s total score (range: 23–115), derived from the adapted Kalamazoo rubric. Violin plots illustrate the distribution and density of scores; overlaid boxplots show the median (horizontal bar) and interquartile range (box). Vertical lines indicate score range (minimum–maximum). Orange denotes pre-intervention scores; green denotes post-intervention scores. Asterisks (***) indicate statistically significant improvement after the intervention (paired-samples t-test, p < 0.001).
Ejihpe 15 00152 g005
Figure 6. Associations between baseline predictors and improvement in diagnostic-communication scores (Δ-score) among medical students (n = 30). Scatterplots (top and middle rows) display bivariate Pearson correlations between continuous baseline variables and the change in total rubric score (Δ-score = post−pre): (A) age (years), (B) school term, (C) grade point average (GPA), (D) digital self-efficacy (1–5 scale), (E) empathy score (Jefferson Scale of Empathy, JSE), and (F) self-reported motivation (1–10 scale). Solid blue lines represent linear regression trends; shaded areas show 95% confidence intervals. Box-and-whisker plots (bottom row) compare Δ-scores by (G) gender (female vs. male) and (H) prior patient interaction (yes vs. no). Each dot denotes an individual student. Only higher baseline empathy was significantly associated with lower Δ-scores in the bivariate analysis (R2 = −0.45, p = 0.013); gender showed a marginal difference (p = 0.044). No significant bivariate correlations were observed for age, school term, GPA, digital self-efficacy, motivation, or prior patient contact. (In the multivariate model reported in Section 3.4, digital self-efficacy emerged as an additional positive predictor of improvement.).
Figure 6. Associations between baseline predictors and improvement in diagnostic-communication scores (Δ-score) among medical students (n = 30). Scatterplots (top and middle rows) display bivariate Pearson correlations between continuous baseline variables and the change in total rubric score (Δ-score = post−pre): (A) age (years), (B) school term, (C) grade point average (GPA), (D) digital self-efficacy (1–5 scale), (E) empathy score (Jefferson Scale of Empathy, JSE), and (F) self-reported motivation (1–10 scale). Solid blue lines represent linear regression trends; shaded areas show 95% confidence intervals. Box-and-whisker plots (bottom row) compare Δ-scores by (G) gender (female vs. male) and (H) prior patient interaction (yes vs. no). Each dot denotes an individual student. Only higher baseline empathy was significantly associated with lower Δ-scores in the bivariate analysis (R2 = −0.45, p = 0.013); gender showed a marginal difference (p = 0.044). No significant bivariate correlations were observed for age, school term, GPA, digital self-efficacy, motivation, or prior patient contact. (In the multivariate model reported in Section 3.4, digital self-efficacy emerged as an additional positive predictor of improvement.).
Ejihpe 15 00152 g006
Figure 7. Forest plot showing effect sizes (Cohen’s d) across the eight diagnostic communication domains. All values reflect the magnitude of improvement from pre- to post-intervention. Dots represent the mean effect size per domain; horizontal lines indicate 95% confidence intervals. All domains demonstrated moderate to large effects, with the highest observed in D8 (diabetes specific explanation), D2 (opening), and D5 (information sharing).
Figure 7. Forest plot showing effect sizes (Cohen’s d) across the eight diagnostic communication domains. All values reflect the magnitude of improvement from pre- to post-intervention. Dots represent the mean effect size per domain; horizontal lines indicate 95% confidence intervals. All domains demonstrated moderate to large effects, with the highest observed in D8 (diabetes specific explanation), D2 (opening), and D5 (information sharing).
Ejihpe 15 00152 g007
Figure 8. Cluster heatmap and hierarchical dendrogram of learner profiles based on post-intervention diagnostic communication performance. Each row represents an individual student (N = 30), and each column represents one of the eight communication domains from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Scores were standardized (z-score transformation) to enable clustering and visualization. A hierarchical cluster analysis using Euclidean distance and complete linkage identified three distinct learner profiles: Cluster A (n = 10; high responders), Cluster B (n = 12; intermediate responders), and Cluster C (n = 8; low responders). The color gradient reflects standardized performance: red indicates higher relative scores and blue indicates lower relative scores across each domain. This visualization allows comparison of performance patterns across individuals and highlights consistent strengths or deficits within clusters.
Figure 8. Cluster heatmap and hierarchical dendrogram of learner profiles based on post-intervention diagnostic communication performance. Each row represents an individual student (N = 30), and each column represents one of the eight communication domains from the adapted Kalamazoo rubric: (D1) relationship and rapport, (D2) opening the encounter, (D3) information gathering, (D4) patient perspective, (D5) information sharing, (D6) plan negotiation, (D7) closure, and (D8) diabetes specific explanation. Scores were standardized (z-score transformation) to enable clustering and visualization. A hierarchical cluster analysis using Euclidean distance and complete linkage identified three distinct learner profiles: Cluster A (n = 10; high responders), Cluster B (n = 12; intermediate responders), and Cluster C (n = 8; low responders). The color gradient reflects standardized performance: red indicates higher relative scores and blue indicates lower relative scores across each domain. This visualization allows comparison of performance patterns across individuals and highlights consistent strengths or deficits within clusters.
Ejihpe 15 00152 g008
Figure 9. Association between baseline communication performance and improvement following the GenAI-based intervention. The scatterplot displays the relationship between pre-intervention rubric scores (x-axis) and Δ-scores (y-axis), calculated as post−pre total rubric score. Each point represents one student (N = 30), color-coded by cluster membership (Cluster A = red, B = pink, C = blue). The dashed regression line represents the best linear fit, and the shaded area indicates the 95% confidence interval. A negative trend was observed, suggesting that students with lower initial scores experienced greater improvements.
Figure 9. Association between baseline communication performance and improvement following the GenAI-based intervention. The scatterplot displays the relationship between pre-intervention rubric scores (x-axis) and Δ-scores (y-axis), calculated as post−pre total rubric score. Each point represents one student (N = 30), color-coded by cluster membership (Cluster A = red, B = pink, C = blue). The dashed regression line represents the best linear fit, and the shaded area indicates the 95% confidence interval. A negative trend was observed, suggesting that students with lower initial scores experienced greater improvements.
Ejihpe 15 00152 g009
Table 1. Baseline sociodemographic, academic, and digital profile of enrolled medical students (n = 30).
Table 1. Baseline sociodemographic, academic, and digital profile of enrolled medical students (n = 30).
VariableCategory/UnitsN (%) or Mean ± SD
AgeYears21.1 ± 4.1
SexFemale22 (73.33%)
Male8 (26.66%)
Clinical semester24 (13.33%)
419 (63.33%)
62 (6.66%)
75 (16.66%)
Cumulative GPA0–108.31 ± 0.50
Prior formal course in diagnostic communicationYes0 (0%)
Prior ECOE completedMedian (IQR)0 (0%)
Prior experience with real patientsYes12 (40.00%)
Prior ChatGPT/LLM use≥10 interactions21 (70%)
<10 interactions9 (30%)
Never1 (3.33%)
Digital self-efficacy †Very high/high13 (43.33%)
Medium16 (53.33%)
Low1 (3.33%)
Device most used for simulationLaptop/PC22 (73.33%)
Tablet4 (13.33%)
Smartphone4 (13.33%)
Self-rated internet qualityGood17 (56.66%)
Medium10 (32.25%)
Poor3 (10%)
Motivation score (1–10)0–109
Self-confidence score, 1–5Explain lab results3.3 ± 0.7
Explain DM criteria3.0 ± 0.8
Convey bad news empathetically3.2 ± 0.8
Teach-back2.3 ± 0.8
Anxiety reduction2.4 ± 0.7
Jefferson Scale of Empathy, total (20–140)-116 ± 15
† Self-efficacy categories derived from a 5-point Likert scale (very low = 1, very high = 5).
Table 2. Baseline diagnostic-communication scores obtained during the two pre-test scenarios (n = 30).
Table 2. Baseline diagnostic-communication scores obtained during the two pre-test scenarios (n = 30).
ItemDomain/Skill (Abridged)Scenario 1
Mean ± SD
Scenario 2
Mean ± SD
p
1Relationship3.10 ± 0.583.15 ± 0.550.50
1.1Eye contact and open posture3.86 ± 0.733.83 ± 0.790.74
1.2Friendly body language3.46 ± 0.863.56 ± 0.720.41
1.3Self-introduction and role2.00 ± 1.162.06 ± 1.200.70
2Opening1.81 ± 0.711.92 ± 0.690.22
2.1Greeting and ID verification1.73 ± 0.631.63 ± 0.710.18
2.2States purpose of visit1.90 ± 0.842.13 ± 0.890.06
2.3Explores patient expectations1.80 ± 0.982.00 ± 0.940.22
3Information gathering2.71 ± 0.64 2.58 ± 0.690.12
3.1Uses open questions2.00 ± 1.012.30 ± 0.990.76
3.2Listens without interrupting4.00 ± 0.183.90 ± 0.540.32
3.3Summarizes to confirm2.13 ± 1.041.83 ±1.110.08
4Patient perspective1.91 ± 0.852.00 ± 0.840.36
4.1Elicits beliefs about illness1.80 ± 1.061.70 ± 1.020.57
4.2Explores worries/concerns2.13 ± 0.971.96 ± 1.030.23
4.3Assesses daily-life impact1.8 ± 0.922.33 ± 1.120.02
5Information sharing1.76 ± 0.762.05 ± 0.830.18
5.1Explains diagnosis in plain language1.80 ± 1.061.70 ± 1.020.57
5.2Uses visual aids/examples2.13 ± 0.971.96 ± 1.030.23
5.3Checks understanding (teach-back)1.80 ± 0.922.33 ± 1.120.01
6Plan negotiation2.03 ± 0.652.03 ± 0.531.00
6.1Discusses therapeutic options2.36 ± 0.852.10 ± 0.990.07
6.2Involves patient in decisions2.06 ± 0.902.16 ± 0.740.44
6.3Negotiates realistic goals1.66 ± 0.801.83 ± 0.830.20
7Closure1.67 ± 0.622.01 ± 0.590.06
7.1Provides final summary1.63 ± 0.711.86 ± 0.810.05
7.2Checks for residual questions1.53 ± 0.931.86 ± 0.930.07
7.3Closes with empathy and follow-up1.86 ± 0.972.30 ± 0.830.02
8Diabetes specific1.88 ± 0.561.93 ± 0.560.73
8.1Explains lab results (HbA1c)1.80 ± 0.711.90 ± 0.710.50
8.2Provides initial plan and alleviates anxiety1.96 ± 0.711.90 ± 0.750.57
TotalOverall score (23–115)48.83 ± 9.6551.1 ± 9.820.05
Paired-samples t-test; Shapiro–Wilk confirmed normality for all difference distributions. Significant differences (p < 0.05) observed in Items 4.3, 5.3, and 7.3 (paired t-tests).
Table 3. Baseline diagnostic-communication scores obtained during the two post-test scenarios (N = 30). No statistically significant differences were observed between Scenario 1 and Scenario 2 for any rubric item (paired t-tests, all p > 0.05).
Table 3. Baseline diagnostic-communication scores obtained during the two post-test scenarios (N = 30). No statistically significant differences were observed between Scenario 1 and Scenario 2 for any rubric item (paired t-tests, all p > 0.05).
ItemDomain/Skill (Abridged)Scenario 1
Mean ± SD
Scenario 2
Mean ± SD
p
1Relationship4.05 ± 0.634.15 ± 0.670.36
1.1Eye contact and open posture4.15 ± 0.544.26 ± 0.860.21
1.2Friendly body language4.26 ± 0.554.26 ± 0.621.00
1.3Self-introduction and role3.85 ± 0.843.81 ± 0.860.82
2Opening3.71 ± 0.643.86 ± 0.810.25
2.1Greeting and ID verification3.81 ± 0.714.35 ± 0.840.01
2.2States purpose of visit3.68 ± 0.783.93 ± 0.960.11
2.3Explores patient expectations3.63 ± 0.813.30 ± 1.040.16
3Information gathering3.86 ± 0.604.01 ± 0.680.19
3.1Uses open questions3.88 ± 0.673.96 ± 0.000.56
3.2Listens without interrupting4.21 ± 0.314.43 ± 0.400.01
3.3Summarizes to confirm3.50 ± 1.093.63 ± 1.120.08
4Patient perspective3.43 ± 0.623.61 ± 0.54 0.31
4.1Elicits beliefs about illness3.21 ± 0.783.15 ± 0.820.79
4.2Explores worries/concerns3.91 ± 0.573.60 ± 0.530.11
4.3Assesses daily-life impact3.50 ± 0.753.76 ± 0.520.15
5Information sharing3.43 ± 1.053.61 ± 1.04 0.31
5.1Explains diagnosis in plain language3.21 ± 1.093.15 ± 0.970.79
5.2Uses visual aids/examples3.91 ± 1.163.60 ± 1.100.11
5.3Checks understanding (teach-back)3.50 ± 1.233.76 ± 1.160.15
6Plan negotiation3.41 ± 0.953.55 ± 1.000.34
6.1Discusses therapeutic options3.53 ± 0973.70 ± 0.900.42
6.2Involves patient in decisions3.43 ± 1.113.55 ± 1.160.37
6.3Negotiates realistic goals3.26 ± 1.153.40 ± 1.100.47
7Closure3.53 ± 1.113.74 ± 0.99 0.14
7.1Provides final summary3.30 ± 1.243.46 ± 0.930.36
7.2Checks for residual questions3.62 ± 0.973.75 ± 1.020.41
7.3Closes with empathy and follow-up3.68 ± 1.284.01 ± 1.100.06
8Diabetes specific3.80 ± 0.814.12 ± 0.79 0.03
8.1Explains lab results (HbA1c)3.81 ± 0.804.05 ± 0.710.12
8.2Provides initial plan and alleviates anxiety3.78 ± 0.974.20 ± 0.740.02
TotalOverall score (23–115)84.46 ± 17.75 88.93 ± 17.370.08
Paired-samples t-test; Shapiro–Wilk confirmed normality for all difference distributions.
Table 4. Pre–post comparison of diagnostic communication scores across rubric domains (N = 30). Mean scores, mean differences (Δ), 95% confidence intervals, effect sizes (Cohen’s d), and p-values are shown for each rubric item comparing pre- and post-intervention performance. All differences were statistically significant (p < 0.001, paired t-tests).
Table 4. Pre–post comparison of diagnostic communication scores across rubric domains (N = 30). Mean scores, mean differences (Δ), 95% confidence intervals, effect sizes (Cohen’s d), and p-values are shown for each rubric item comparing pre- and post-intervention performance. All differences were statistically significant (p < 0.001, paired t-tests).
ItemDomain/Skill (Abridged)Pre-Intervention
Mean ± SD
Post-Intervention
Mean ± SD
Δ Mean95% CI (Δ)Cohen’s dp
1Relationship3.13 ± 0.564.10 ± 0.650.970.74–1.201.590.001
1.1Eye contact and open posture3.85 ± 0.754.20 ± 0.600.350.09–0.610.510.004
1.2Friendly body language3.51 ± 0.794.26 ± 0.580.750.49–1.011.080.001
1.3Self-introduction and role2.03 ± 1.193.83 ± 0.841.801.41–2.191.740.001
2Opening1.86 ± 0.703.78 ± 0.731.921.65–2.192.680.001
2.1Greeting and ID verification1.68 ± 0.674.08 ± 0.832.402.12–2.683.180.001
2.2States purpose of visit2.01 ± 0.873.80 ± 0.871.791.46–2.122.050.001
2.3Explores patient expectations1.90 ± 0.963.46 ± 0.941.561.20–1.921.640.001
3Information gathering2.65 ± 0.563.93 ± 0.641.281.05–1.512.120.001
3.1Uses open questions2.01 ± 0.993.92 ± 0.681.911.59–2.232.240.001
3.2Listens without interrupting3.95 ± 0.384.32 ± 0.370.370.23–0.510.980.001
3.3Summarizes to confirm1.98 ± 1.113.56 ± 1.091.581.17–1.991.430.001
4Patient perspective1.95 ± 0.563.52 ± 1.04 1.571.25–1.891.870.001
4.1Elicits beliefs about illness1.75 ± 1.033.18 ± 1.131.431.02–1.841.320.001
4.2Explores worries/concerns2.05 ± 0.993.75 ± 1.131.701.30–2.101.600.001
4.3Assesses daily-life impact2.06 ± 1.053.63 ± 1.181.571.15–1.991.400.001
5Information sharing1.91 ± 0.563.57 ± 0.92 1.661.37–1.952.170.001
5.1Explains diagnosis in plain language2.28 ± 1.093.54 ± 1.051.260.86–1.661.170.001
5.2Uses visual aids/examples1.91 ± 0.993.06 ± 1.331.150.71–1.590.980.001
5.3Checks understanding (teach-back)1.53 ± 0.923.65 ± 1.102.121.74–2.502.090.001
6Plan negotiation2.03 ± 0.563.48 ± 0.971.451.15–1.751.830.001
6.1Discusses therapeutic options2.23 ± 0923.61 ± 1.001.381.02–1.741.430.001
6.2Involves patient in decisions2.11 ± 0.823.49 ± 1.131.381.01–1.751.390.001
6.3Negotiates realistic goals1.75 ± 0.813.33 ± 1.111.581.21–1.951.620.001
7Closure1.84 ± 0.623.63 ± 1.05 1.791.46–2.122.070.001
7.1Provides final summary1.75 ± 0.773.38 ± 1.241.631.24–2.021.570.001
7.2Checks for residual questions1.70 ± 0.943.68 ± 0.991.981.62–2.342.050.001
7.3Closes with empathy and follow-up2.08 ± 0.923.85 ± 1.191.771.37–2.171.660.001
8DM-specific1.90 ± 0.563.96 ± 0.81 2.061.80–2.322.950.001
8.1Explains lab results (HbA1c)1.85 ± 0.703.93 ± 0.882.081.78–2.382.610.001
8.2Provides initial plan and alleviates anxiety1.93 ± 0.733.99 ± 0.892.061.75–2.372.530.001
TotalOverall score (23–115)49.96 ± 9.72 86.70 ± 17.5636.7431.39–42.092.580.001
Table 5. Multiple linear regression predicting Δ-score among medical students (n = 30).
Table 5. Multiple linear regression predicting Δ-score among medical students (n = 30).
Outcome VariableCluster A
(n = 10)
Cluster B
(n = 12)
Cluster C
(n = 8)
p
Δ-score (total rubric)58.7 ± 10.233.4 ± 11.516.2 ± 9.3<0.001
Relationship (Domain 1)1.23 ± 0.400.68 ± 0.280.31 ± 0.17<0.001
Opening (Domain 2)1.79 ± 0.520.96 ± 0.410.37 ± 0.25<0.001
Information gathering (D3)1.42 ± 0.610.89 ± 0.360.34 ± 0.30<0.001
Patient perspective (D4)1.51 ± 0.770.79 ± 0.450.21 ± 0.33<0.001
Information sharing (D5)1.76 ± 0.911.03 ± 0.590.45 ± 0.38<0.001
Plan negotiation (Domain 6)1.55 ± 0.870.92 ± 0.440.33 ± 0.28<0.001
Closure (Domain 7)1.62 ± 0.890.87 ± 0.530.41 ± 0.26<0.001
DM-specific (Domain 8)2.04 ± 0.801.07 ± 0.620.54 ± 0.41<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Suárez-García, R.X.; Chavez-Castañeda, Q.; Orrico-Pérez, R.; Valencia-Marin, S.; Castañeda-Ramírez, A.E.; Quiñones-Lara, E.; Ramos-Cortés, C.A.; Gaytán-Gómez, A.M.; Cortés-Rodríguez, J.; Jarquín-Ramírez, J.; et al. DIALOGUE: A Generative AI-Based Pre–Post Simulation Study to Enhance Diagnostic Communication in Medical Students Through Virtual Type 2 Diabetes Scenarios. Eur. J. Investig. Health Psychol. Educ. 2025, 15, 152. https://doi.org/10.3390/ejihpe15080152

AMA Style

Suárez-García RX, Chavez-Castañeda Q, Orrico-Pérez R, Valencia-Marin S, Castañeda-Ramírez AE, Quiñones-Lara E, Ramos-Cortés CA, Gaytán-Gómez AM, Cortés-Rodríguez J, Jarquín-Ramírez J, et al. DIALOGUE: A Generative AI-Based Pre–Post Simulation Study to Enhance Diagnostic Communication in Medical Students Through Virtual Type 2 Diabetes Scenarios. European Journal of Investigation in Health, Psychology and Education. 2025; 15(8):152. https://doi.org/10.3390/ejihpe15080152

Chicago/Turabian Style

Suárez-García, Ricardo Xopan, Quetzal Chavez-Castañeda, Rodrigo Orrico-Pérez, Sebastián Valencia-Marin, Ari Evelyn Castañeda-Ramírez, Efrén Quiñones-Lara, Claudio Adrián Ramos-Cortés, Areli Marlene Gaytán-Gómez, Jonathan Cortés-Rodríguez, Jazel Jarquín-Ramírez, and et al. 2025. "DIALOGUE: A Generative AI-Based Pre–Post Simulation Study to Enhance Diagnostic Communication in Medical Students Through Virtual Type 2 Diabetes Scenarios" European Journal of Investigation in Health, Psychology and Education 15, no. 8: 152. https://doi.org/10.3390/ejihpe15080152

APA Style

Suárez-García, R. X., Chavez-Castañeda, Q., Orrico-Pérez, R., Valencia-Marin, S., Castañeda-Ramírez, A. E., Quiñones-Lara, E., Ramos-Cortés, C. A., Gaytán-Gómez, A. M., Cortés-Rodríguez, J., Jarquín-Ramírez, J., Aguilar-Marchand, N. G., Valdés-Hernández, G., Campos-Martínez, T. E., Vilches-Flores, A., Leon-Cabrera, S., Méndez-Cruz, A. R., Jay-Jímenez, B. O., & Saldívar-Cerón, H. I. (2025). DIALOGUE: A Generative AI-Based Pre–Post Simulation Study to Enhance Diagnostic Communication in Medical Students Through Virtual Type 2 Diabetes Scenarios. European Journal of Investigation in Health, Psychology and Education, 15(8), 152. https://doi.org/10.3390/ejihpe15080152

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop