Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges

Roveta, Annalisa; Castello, Luigi Mario; Massarino, Costanza; Francese, Alessia; Ugo, Francesca; Maconi, Antonio

doi:10.3390/ai6090227

Open AccessReview

Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges

by

Annalisa Roveta

^1,†

,

Luigi Mario Castello

^2,3,†,

Costanza Massarino

^2,*

,

Alessia Francese

¹

,

Francesca Ugo

¹

and

Antonio Maconi

¹

Research Laboratories, Research and Innovation Department, Azienda Ospedaliero-Universitaria SS. Antonio e Biagio e Cesare Arrigo, 15121 Alessandria, Italy

²

Translational Medicine, Research and Innovation Department, Azienda Ospedaliero-Universitaria SS. Antonio e Biagio e Cesare Arrigo, 15121 Alessandria, Italy

³

Department of Translational Medicine, Università del Piemonte Orientale, 28100 Novara, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

AI 2025, 6(9), 227; https://doi.org/10.3390/ai6090227

Submission received: 27 June 2025 / Revised: 29 August 2025 / Accepted: 3 September 2025 / Published: 11 September 2025

(This article belongs to the Special Issue Exploring the Use of Artificial Intelligence in Education)

Download Versions Notes

Abstract

Artificial Intelligence (AI) is rapidly transforming medical education by enabling adaptive tutoring, interactive simulation, diagnostic enhancement, and competency-based assessment. This narrative review explores how AI has influenced learning processes in undergraduate and postgraduate medical training, focusing on methodological rigor, educational impact, and implementation challenges. The literature reveals promising results: large language models can generate didactic content and foster academic writing; AI-driven simulations enhance decision-making, procedural skills, and interprofessional communication; and deep learning systems improve diagnostic accuracy in visually intensive tasks such as radiology and histology. Despite promising findings, the existing literature is methodologically heterogeneous. A minority of studies use controlled designs, while the majority focus on short-term effects or are confined to small, simulated cohorts. Critical limitations include algorithmic opacity, generalizability concerns, ethical risks (e.g., GDPR compliance, data bias), and infrastructural barriers, especially in low-resource contexts. Additionally, the unregulated use of AI may undermine critical thinking, foster cognitive outsourcing, and compromise pedagogical depth if not properly supervised. In conclusion, AI holds substantial potential to enhance medical education, but its integration requires methodological robustness, human oversight, and ethical safeguards. Future research should prioritize multicenter validation, longitudinal evaluation, and AI literacy for learners and educators to ensure responsible and sustainable adoption.

Keywords:

artificial intelligence in education; personalized learning; machine learning; higher education; professional development; lifelong learning; adult education

1. Introduction

1.1. General Context

The role of artificial intelligence (AI) in healthcare education has become an increasingly vital topic in both research and practice. AI technologies, particularly machine learning (ML), deep learning (DL), and natural language processing (NLP), are revolutionizing medical training through the introduction of data-driven, adaptive, and interactive learning methods [1,2]. These innovations are reshaping the acquisition of knowledge, the development of clinical reasoning, and the gaining of hands-on skills across multiple domains, including surgery, radiology, pathology, and diagnostics [3,4]. AI-enhanced simulation tools and intelligent tutoring systems, for example, offer learners immediate and personalized feedback, thereby addressing performance gaps and strengthening clinical competence [5,6]. At the same time, natural language processing (NLP) algorithms are employed to summarize medical texts, while generative AI technologies, such as ChatGPT, are increasingly used as supplemental resources to support differential diagnosis exploration and exam preparation [7].

This includes AI chatbots answering student queries to image-generation tools used in teaching anatomy, which is increasingly becoming embedded within the educational curriculum [8]. Although case reports and implementation studies are on the rise, the existing literature remains fragmented. A clear need for synthesis and conceptual integration persists, as AI keeps changing not only the content of education, but also the methods of delivery, the target audiences, and the ethical and practical frameworks within which learning occurs [9,10].

1.2. Rationale

Despite the high level of interest generated by AI in medical education, current literature highlights notable methodological limitations. The majority of studies are exploratory or descriptive, with limited adoption of controlled experimental designs and a general lack of methodological standardization [11]. These limitations limit the accumulation of robust, generalizable evidence and reduce the ability to compare findings across different educational contexts.

Moreover, ethical and regulatory considerations, particularly those defined by the General Data Protection Regulation (GDPR), are often superficially addressed, although they have a major role in governing the responsible use of AI in safety-critical domains such as healthcare education [12].

At the same time, fundamental issues such as algorithmic transparency, bias mitigation, and risk management are rarely incorporated into the design and evaluation of AI-enhanced educational interventions. This omission has direct implications for user trust, perceived validity, and overall acceptance of AI systems by both educators and learners [13].

1.3. Emerging Issues and Challenges

The implementation of AI in medical education is confronted with a series of complex sets of technical, organizational, methodological, and ethical challenges that must be addressed to ensure safe, fair, and pedagogically meaningful adoption. From a technical standpoint, the “black box” nature of many AI models limits interpretability and reduces user trust, particularly in educational contexts, where transparency and explainability are essential for both learners and instructors [14].

Ethical and regulatory considerations also play a critical role. The design and deployment of AI-driven educational tools must comply with stringent legal frameworks, particularly the General Data Protection Regulation (GDPR). Beyond the principle of informed consent, the GDPR imposes obligations on special-category data processing (Article 9), the protection against fully automated decision-making (Article 22), as well as the principles of data protection by design and by default (Article 25). Recent empirical data indicate that different interpretations of these clauses have already resulted in delays or even exclusions of European involvement in international AI-driven healthcare research due to outstanding questions on data sharing across borders and/or legal uncertainty [15].

It is also argued in legal scholarship that a “health-conformant” interpretation of Article 22 is needed to safeguard trainees and/or patients where AI systems are applied in contexts such as individual learning analytics or decision support [16]. However, methodological reviews indicate that fewer than 20% of AI education projects in the UK include a formal Data Protection Impact Assessment, highlighting a persistent compliance gap [17].

In addition, there are also organizational and pedagogical barriers to take into account when it comes to implementation. Many medical curricula still lack structured training in AI fundamentals, digital ethics, and algorithmic literacy, raising questions about the preparedness of both faculty and students to engage meaningfully with these technologies. In parallel, the existing literature reveals recurring methodological shortcomings: most studies are based on small samples, lack longitudinal follow-up, and rarely employ standardized metrics to assess educational impact [2].

These limitations hinder cumulative knowledge development and reduce the ability to generate generalizable insights into the effectiveness and safety of AI-enhanced educational interventions.

1.4. Specific Objectives

In light of the challenges outlined above, this review aims to examine how the adoption of AI has influenced learning processes in undergraduate and postgraduate medical education, with particular attention to the methodological rigor and evaluative frameworks applied in current research. The central research question is “In what ways has the integration of AI into medical education affected learning processes, and what methodological limitations and evaluation challenges emerge from the current literature?”.

The objectives of this review are to (1) assess the methodological rigor of existing studies on AI applications in medical education, (2) identify key strengths and limitations in current research, (3) evaluate the effectiveness of AI-driven educational interventions, and (4) highlight existing gaps and future directions in this field. By synthesizing recent empirical findings through a narrative and thematically structured approach, this review seeks to support a more critical, transparent, and pedagogically grounded adoption of AI in medical education.

2. Materials and Methods

This narrative review was conducted to explore the impact of AI on learning processes in medical education and to critically examine the methodological challenges associated with its evaluation. The goal was not to conduct a systematic appraisal of intervention effectiveness, but rather to generate a conceptually driven and thematically structured synthesis of recent empirical findings.

A narrative synthesis approach was adopted in line with established methodological guidance for non-systematic reviews in complex and emerging fields. The synthesis was organized thematically across four core domains. These domains were identified a priori through a consensus process among the authors, drawing upon their collective expertise and familiarity with the key literature in the field. This approach allowed for the definition of a foundational framework representing the most established areas of AI application in medical education. The domains were then refined iteratively during the full-text analysis: (1) AI as a tutor or content generator; (2) AI in simulation-based learning; (3) AI in diagnostic training; and (4) AI in competency assessment.

2.1. Literature Search Strategy

A structured literature search was carried out using PubMed and Embase, covering the period from January 2018 to December 2024. These databases were selected for their comprehensive indexing of biomedical and health sciences literature, particularly relevant to medical education. While broader multidisciplinary databases such as Scopus or Web of Science offer wider citation tracking capabilities, they are less specialized in biomedical indexing [18].

The chosen timeframe (2018–2024) was selected to focus on the most recent and significant developments of AI in medical education, a rapidly evolving field.

Search strings combined Medical Subject Headings (MeSH) and free-text terms related to AI, machine learning, educational technologies, and clinical training. The full search strategies for both databases are reported below.

PubMed search string: ((“artificial intelligence”[MeSH] OR “artificial intelligence”[Title/Abstract] OR “AI”[Title/Abstract] OR “machine learning”[Title/Abstract] OR “deep learning”[Title/Abstract] OR “Large Language Models”[Title/Abstract] OR “Generative Artificial Intelligence”[Title/Abstract] OR “Generative AI”[Title/Abstract]) AND (“curriculum”[Title/Abstract] OR “training program”[Title/Abstract] OR “simulation”[Title/Abstract] OR “skills training”[Title/Abstract] OR “competency-based education”[Title/Abstract] OR “clinical teaching”[Title/Abstract] OR “medical education”[MeSH] OR “clinical education”[Title/Abstract]) AND (“medical students”[Title/Abstract] OR “residents”[Title/Abstract] OR “fellows”[Title/Abstract] OR “physicians”[Title/Abstract] OR “health personnel”[MeSH])) AND (“2018/01/01”[Date–Publication]: “2024/12/31”[Date–Publication])
Embase search string: (‘artificial intelligence’:ti,ab OR ‘ai’:ti,ab OR ‘machine learning’:ti,ab OR ‘deep learning’:ti,ab OR ‘large language models’:ti,ab OR ‘generative artificial intelligence’:ti,ab OR ‘generative ai’:ti,ab) AND (‘medical education’:ti,ab OR ‘curriculum’:ti,ab OR ‘training program’:ti,ab OR ‘simulation’:ti,ab OR ‘skills training’:ti,ab OR ‘competency-based education’:ti,ab OR ‘clinical teaching’:ti,ab OR ‘clinical education’:ti,ab) AND (‘healthcare professionals’:ti,ab OR ‘clinicians’:ti,ab OR ‘medical personnel’:ti,ab OR ‘medical students’:ti,ab OR ‘residents’:ti,ab OR ‘fellows’:ti,ab OR ‘specialists’:ti,ab OR ‘physicians’:ti,ab) AND [2018–2024]/py

The search was limited to titles and abstracts to ensure relevance and to manage scope. While this may have excluded some marginally relevant studies, it allowed for the identification of literature explicitly concerned with AI in medical education.

2.2. Study Selection and Eligibility Criteria

The study selection process followed a structured approach. The initial search yielded 1317 records (514 from PubMed and 803 from Embase). After removing 211 duplicates, 1106 unique articles were screened based on their titles and abstracts. This screening was performed independently by two reviewers to assess relevance. During this phase, 936 articles were excluded as they did not align with the review’s objectives.

This left 170 articles for full-text eligibility assessment. Following a detailed review of the full texts, 107 articles were further excluded for the following primary reasons: irrelevant outcomes, wrong population, full text not available in English, or non-pertinent AI intervention.

A final set of 63 articles met the inclusion criteria and were included in the narrative synthesis. Disagreements during the selection process were resolved by consensus. Data were extracted and coded thematically into the four predefined domains, with iterative refinement based on the emergent patterns across studies.

Inclusion criteria:

-: Empirical studies examining AI applications in undergraduate, graduate, or continuing medical education;
-: Studies reporting objective educational outcomes (e.g., performance metrics, test scores, skill acquisition);
-: Interventions involving AI-based tools as tutors, simulators, evaluators, or diagnostic aids.

Exclusion criteria:

-: Studies involving only patients or non-healthcare learners;
-: Articles reporting only subjective outcomes or perceptions;
-: Editorials, conference abstracts, reviews, or protocols.

3. Results

3.1. AI as a Tutor and Generator of Educational Content

AI is increasingly used to support the theoretical and pre-clinical components of medical education, functioning as a content generator, intelligent tutor, simulation agent, and assessment tool tied to medical performance outcomes such as OSCE stations, Mini-CEX–style skill checks, and specialty knowledge tests. The reviewed studies highlight a growing potential for AI to enhance educational accessibility, interactivity, and personalization. However, its impact is tempered by methodological heterogeneity and domain-specific constraints.

A first area of application is the generation of personalized educational content. Observational and controlled studies indicate shown that LLM-based tools such as ChatGPT can produce coherent and, at times, effective material for the preparation of specialist examinations [19,20]. For instance, the use of ChatGPT-4 to generate multiple-choice quizzes in pathology produced materials with good internal consistency (α = 0.74), a psychometric index of a test’s reliability [19]. However, this statistical coherence did not ensure clinical validity; indeed, the same study reported that expert review was essential to correct factual inaccuracies, proving that human oversight remains indispensable. Nevertheless, the educational potential of these tools is significant: a randomized comparison conducted by Gan et al. in orthopedics reported higher exam scores in the intervention group (p = 0.02) [20]. Yet, these advantages remained conditional on prompt quality and expert supervision, necessary to secure clinical accuracy, adherence to guidelines, and blueprint alignment.

AI is also applied in interactive clinical simulation. Conversational agents simulating patient interactions enhanced both clinical reasoning and OSCE performance (p = 0.049 and p = 0.01, respectively) [21,22]. Interactive simulation proved most effective when coupled with structured feedback and checklist-based debriefing. Zheng et al. expanded the simulation approach by generating entire clinical scenarios for cardiovascular training, with significant gains in knowledge (p < 0.05), operational skills (p < 0.001), and critical thinking (p < 0.001) [23]. Further applications include the use of AI to enhance interaction and emotional engagement through realistic simulations. Aster et al. integrated GAN-generated synthetic faces into a serious game for clinical reasoning; although no quantitative outcomes were measured, students reported high engagement and acceptance [24]. In interprofessional contexts, Liaw et al. found that immersive VR agents significantly strengthened communication self-efficacy (p < 0.001) [25].

Assessment for measuring cognitive, communicative, and metacognitive competencies is emerging as well. Beyond the NLP-based system evaluated by Cianciolo et al. [26], that exceeded human raters in evaluating clinical writing, Su et al. developed an assessment system based on repertory grids and case-based reasoning to evaluate specialist knowledge in otolaryngology, showing significant improvements in learning scores among both undergraduate students and residents (z = −3.976; p < 0.001) [27]. Brutschi et al. tested speaker diarization algorithms to automatically analyze debriefing dynamics, achieving 97.8% speaker identification accuracy [28].

AI is also used to support academic writing. Interventions with LLMs were associated with improved cohesion, confidence, and textual quality in student essays (p < 0.05 across several indicators) [29]. In addition, AI can promote AI literacy itself: modules using TensorFlow and deep learning frameworks were well received by radiology trainee [30] and curricular programs for AI-assisted decision-making and led to measurable knowledge gains and stronger engagement [31].

More advanced applications involve NLP on real clinical data to scaffold diagnostic reasoning [32,33], providing learning environments that approximate real-world complexity.

However, Desseauve et al. documented that only 21% of ChatGPT outputs were accurate in pregnancy-related liver disease, underscoring the limits of LLMs in high-risk, semantically complex contexts [34].

In synthesis, current research supports a gradual incorporation of AI into medical education, though outcomes remain contingent on contextual factors, supervision, and pedagogical design. While AI can extend access, facilitate interaction, and enrich analytic processes, it cannot substitute for human expertise in judgment, evaluation, or reflective practice.

Limitations and Challenges in the Use of AI as a Tutor and Generator of Educational Content

Despite promising developments, the integration of AI into medical education presents significant challenges across multiple domains.

-: Content reliability: Several studies report conceptual errors, omissions, and incoherent responses generated by LLMs such as ChatGPT, especially in complex clinical contexts [35]. The performance of these systems is highly sensitive to prompt quality and requires expert supervision to ensure accuracy [36].
-: Technical limitations: The inability to represent complex visual structures reduces AI’s effectiveness in domains such as anatomy and surgery [36]. Additional studies report operational constraints related to audio quality and context of use, which limit scalability [23].
-: External validity: Many interventions have been tested on small samples or within highly controlled experimental settings, limiting the generalizability of findings [21,22,24,25,28,32]. Moreover, the lack of longitudinal follow-up renders the long-term impact on learning uncertain.
-: Pedagogical risks: The widespread use of AI may promote passive learning and cognitive outsourcing unless accompanied by specific training in AI literacy [29,31,37,38]. In academic contexts, unregulated use of generative tools in writing raises concerns about originality and the development of critical thinking [29,34].
-: Ethical and systemic issues: Automated performance assessment raises concerns regarding bias, transparency, and accountability [23,26,33]. Some authors highlight the absence of AI-related competencies among teaching staff as a further barrier to responsible and effective technology adoption [31].

3.2. From Simulation to Practice: Developing Competence with AI

AI is increasingly shaping procedural learning in medicine by enabling simulation environments that integrate automated assessment, personalized adaptation, and real-time feedback. Compared to traditional instructional modalities, often dependent on human supervision and subjective evaluation, AI allows for more objective, standardizable, and scalable training processes [39,40,41].

In surgical simulation, machine learning algorithms have been applied to classify levels of expertise. In a tumor resection task, Siyar et al. reported up to 90% accuracy in distinguishing experts from novices [40], while Radi et al. documented significant performance improvements following VR training using AI-driven metrics [42]. A relevant methodological contribution is the Mastery Performance Index (MPI), developed by Simmonds et al., which enables inter-institutional comparison across robotic surgery training programs [41].

AI also supports adaptive tutoring strategies within simulation. Fazlollahi et al. showed that AI-based tutoring, compared with conventional instruction, led to significant improvements in OSATS scores and instrument path deviation [43]. Ruberto et al. developed a deep learning model that adjusts simulation difficulty in real time, based on cognitive load inferred from physiological signals (ECG, GSR), with avorable effects on perceived learning [44].

In anesthesiology training, intelligent systems have shown clinical benefits. Cai et al. reported that AI-supported nerve identification significantly reduced complications (e.g., paresthesia) during sciatic nerve blocks (4.12% vs. 14.06%) [45]. Similarly, Yovanoff et al. observed higher self-efficacy in trainees performing central venous catheter placement within an AI-mediated feedback environment [46].

Advanced predictive modeling is being explored to optimize learning curves. Ledwos et al. used a k-nearest neighbor (KNN) algorithm to model procedural progression in simulated neurosurgery, accurately identifying the optimal timing for instructional feedback [47]. Di Mitri et al. tested and validated a system based on Long Short-Term Memory (LSTM) networks to detect errors during CPR, achieving over 90% accuracy for metrics such as compression quality and body posture [48]. Melnyk et al. demonstrated that exposure to expert visual patterns enhances operational efficiency and multitasking in robotic simulation, suggesting reinforcement of implicit learning mechanisms by AI [49].

AI is also integrated into immersive and interprofessional training contexts. Sok et al. compared VR-based simulations led by AI versus human instructors. While clinical outcomes were similar, the AI arm was associated with reduced perceived self-efficacy [50]. In emergency medicine, Riaño et al. observed gains in patient stabilization and guideline adherence through an AI-based expert system [51].

Peripheral yet relevant applications include ergonomic and communication training. Hamilton et al. developed an AI-based intraoperative posture monitoring system that improved operator ergonomics significantly [52]. Hershberger et al. integrated a natural language processing system for real-time feedback during motivational interviewing training, resulting in enhanced communication quality [53]. In dental education, Chang et al. reported improved workflow efficiency with AI-assisted radiographic assembly, though diagnostic accuracy declined without expert oversight [54].

Limitations and Challenges

Despite promising findings, several limitations persist.

-: Sample size and generalizability: several studies rely on extremely small samples, as in the case of Ruberto et al. (n = 4) [44], which limits generalizability.
-: Model interpretability: Although advanced models such as CNNs and LSTMs demonstrate high predictive accuracy, their opaque architecture impedes their adoption in high-responsibility domains like surgery and anesthesia, where trust and transparency are critical [40,50,52]. Emerging model interpretability techniques, such as saliency maps for CNNs, aim to mitigate this “black box” issue by visualizing the features that most influence a model’s decision, yet their integration into educational platforms is not yet standard practice.
-: Scalability and infrastructure: The implementation of AI-driven simulation often depends on high-performance computing resources and complex technical setups, which may not be available in low-resource settings [42,43,45].
-: Learner engagement and self-efficacy: Some evidence suggests that AI-led training may reduce learner confidence or perceived competence. Sok et al. reported lower self-efficacy scores in simulations involving AI instructors and Chang et al. warned against cognitive offloading and automation dependency [50,54].
-: Retention and long-term efficacy: The durability of AI-acquired skills remains uncertain. Liu et al. observed evidence of skill decay over time, highlighting the need for periodic reinforcement and longitudinal curriculum strategies [55].

3.3. Enhancing Clinical Perception: AI in Diagnostic Training

AI is becoming a pivotal element of diagnostic training across visually intensive medical disciplines such as radiology, pathology, dermatology, ultrasound, and ophthalmology. By providing annotated case material, instant feedback, and adaptive learning environments, AI has delivered measurable gains in diagnostic competence.

In histology and radiology, AI systems have proved especially helpful for novices interpreting high-resolution medical images. In a classification task covering ten tissue classes, medical students achieved an average accuracy of 55%, compared to 91–93% for convolutional neural networks (CNNs) [56]. In hip fracture detection, AI-assisted training raised accuracy from 75.7% to 88.9% (p < 0.01), outperforming the control group [57].

In ultrasound education, AI-based tools providing real-time anatomical feedback doubled the likelihood of acquiring correct cardiac views (relative risk = 2.3; p = 0.002) [58] and shortened the training required for competence in obstetric imaging (3 vs. 4 cycles; p = 0.037) [59]. In ophthalmology, training with AI for pathological myopia identification enhanced diagnostic skills among residents (p < 0.0001) [60].

In cytopathology, AI-assisted image reading not only increased interpretation scores (p < 0.001) but also reduced reading times (32.1 to 11.4 min) and raised interrater agreement (κ 0.645 → 0.803) [61].

Applications extend to dermatology and rare disease recognition. A randomized study using synthetic images for chalazion and sebaceous carcinoma, which, according to the authors, were vetted and selected for clinical suitability by two physicians, increased accuracy from 56.1% to 69.8% (p < 0.001) and reduced response times by 78 s [62]. The educational potential of AI also emerges in clinical dysmorphology: automated facial phenotyping tools such as DeepGestalt have been incorporated into genetics education to strengthen differential diagnosis: patient-specific images and automated syndrome comparisons encouraged systematic learning and boosted confidence in morphological classification [63].

Beyond visual perception, AI contributes to diagnostic training through NLP-based conversational platforms designed to simulate clinical interviews. Virtual systems such as OSCEBot, powered by SBERT, achieved strong semantic accuracy (AUC 0.864; 85% precision) [64], while knowledge-based systems like MCRDR improved efficiency and learner confidence [65].

AI is also contributing to diagnostic reasoning via large language models. In a study using the PREP emergency medicine questionnaire, ChatGPT exceeded the certification threshold (accuracy 74.5%; κ = 0.71) and provided detailed explanations for each answer, underscoring its potential both for content validation and for guided self-learning [66].

AI is also being used to generate training content dynamically. In an educational workshop on COVID-19, participants rated ChatGPT-generated clinical scenarios positively (4.13/5), particularly for their value in stimulating reasoning and reflective learning [67]. These applications suggest a role for generative AI in enhancing authenticity and complexity in diagnostic training environments.

Across these domains, AI contributes through three main pathways: enhancing perceptual accuracy with visual feedback, optimizing assessment through automated metrics, and enabling realistic case interaction and generation. The strongest impact emerges in structured visual tasks, where AI enables scalable, repeatable training with objective performance monitoring.

This pedagogical focus is critical, as early and structured exposure to AI during training is foundational for developing the skills needed to critically appraise, validate, and responsibly integrate these technologies into future clinical practice.

Limitations and Challenges

Despite these encouraging results, the implementation of AI in diagnostic education faces multiple critical barriers.

-: Algorithmic reliability: Some generative systems, such as DALL·E 3, have demonstrated poor reliability for educational purposes. In one evaluation, over 78% of AI-generated anatomical images for congenital heart disease were judged unsuitable due to structural errors and misleading labels [68]. NLP tools like OSCEBot also show reduced performance in non-scripted or unexpected clinical interactions [64].
-: Generalizability and dataset bias: Many AI models perform well in narrowly defined diagnostic domains but fail to generalize beyond their training data. For instance, CNNs used in histological classification struggled with atypical or rare morphologies, limiting training validity [56]. In dermatology and rare disease education, synthetic datasets may under-represent relevant pathologies or demographic groups, introducing bias and limiting realism [62].
-: Technical and infrastructural barriers: Effective implementation requires high-quality hardware, validated datasets, and stable platforms—resources not evenly distributed across institutions. Limited digital infrastructure can compromise access and reproducibility, especially in low-resource settings [60,61].
-: Ethical concerns and data governance: The use of facial recognition technologies for training in clinical genetics (e.g., DeepGestalt) raises critical questions about biometric data protection, informed consent, and the potential for unintended re-identification. Biases embedded in facial datasets may also propagate into learner judgments [63].
-: Pedagogical risks: Automated feedback systems may reduce reflective learning if used without structured supervision. Studies warn against excessive reliance on AI-generated outputs, advocating for blended pedagogical approaches that include human feedback and critical reflection [67,68].

3.4. Towards Data-Driven Training: AI in Competency Assessment

AI is increasingly transforming competency assessment in medical education by enabling continuous, objective, and individualized evaluation strategies. Through the integration of machine learning algorithms, computer vision, and generative models, AI is shifting assessment from episodic and subjective evaluations toward scalable, data-rich, and reproducible systems.

In surgery and technical skills training, AI-based video analysis has shown particular promise in evaluating motor performance. At Stanford, computer vision models (Mask R-CNN and SORT) were used to quantify bimanual dexterity and efficiency during robotic procedures. The AI metrics strongly correlated with human expert ratings (r = 0.48 for dexterity; and an expected negative correlation of r = −0.72 for efficiency, as higher expert scores naturally correspond to lower, more favorable AI metrics such as shorter task completion times), achieving intraclass correlation coefficients (ICCs) between 0.7 and 0.8 across GEARS domains and clearly separated expert from novice operators (p < 0.001) [69].

Virtual reality environments have further validated AI in procedural assessment. Support Vector Machines (SVMs) classified skill levels in simulated spinal surgery with 97.6% accuracy [70], while deep learning-enhanced simulators (e.g., Intuitive Learning System) improved task performance by more than 180% in metrics such as the Ring Rollercoaster II task [71]. Intelligent tutoring systems such as the Virtual Operative Assistant (VOA), based on SVMs, provided adaptive feedback and significantly improved student performance in soft-tissue handling, including lower rates of healthy tissue removal (p = 0.03) and better motor control (p < 0.001) [72].

Deep neural networks (DNNs) and 3D convolutional neural networks have also been used to automatically assess surgical videos, classifying trainee expertise with accuracies over 80–91% for tasks such as suturing and knot-tying [73,74]. Complementarily, wearable devices have enabled motion tracking: in a study using Apple Watch and artificial neural networks, laparoscopic skill level was classified with an F1-score of 86.1% [75]. Even in low-fidelity VR simulations, AI has proven reliable—YOLOv8 models achieved 95% concordance with human evaluators in measuring error rates and task completion [76].

Beyond surgical contexts, AI is assisting clinicians in acquiring complex diagnostic skills. In a study on cPOCUS, a deep learning algorithm guiding junior neurologists significantly reduced acquisition time (from 3.1 to 1.7 days, p < 0.001), increased user confidence (89.3%), and contributed to clinical decision-making in more than one-third of cases [77]. Similarly, in radiotherapy planning, AI-assisted visual feedback modules enabled novice learners to generate treatment configurations closer to clinical standard, demonstrating potential for structured procedural planning [78].

AI has also been applied to support non-native medical students’ academic writing. The use of ChatGPT-3.5 and GPT-4 improved both human-graded and automated assessment scores in medical essays (p < 0.05), indicating that generative models may assist in formative evaluation of linguistic and argumentative competence [79].

Recent work has expanded AI’s role into the real-time evaluation of cognitive strain during technical tasks. Using eye-tracking data and LSTM-GAN architectures, one study identified disorientation episodes during endoscopic procedures with over 91% accuracy, providing valuable insight into user behavior under pressure and the triggers of procedural error [80].

Taken together, these developments outline an emerging model of AI-supported competency assessment that spans motor, cognitive, linguistic, and procedural domains. AI enables automated and adaptive feedback loops that bridge formative and summative functions, while supporting real-time monitoring, performance tracking, and digital literacy.

Limitations and Challenges

Despite promising developments, the reviewed studies highlight several significant limitations, both methodological and pedagogical, that warrant caution before large-scale adoption of AI-based assessment systems.

-: Generalizability and sample limitations: Many models have been tested on small or homogeneous populations (e.g., medical students only, low-fidelity simulators), limiting external validity. For instance, YOLOv8 and 3D CNNs were validated in highly controlled settings and may not generalize to real clinical tasks [73,74].
-: Clinical validation: Although AI tools show high accuracy in simulated environments, few studies have demonstrated translation to real-world clinical outcomes. The connection between performance improvements in VR and patient-level results remains tenuous [75].
-: Data quality and input standardization: AI performance can degrade in the presence of noisy, unbalanced, or non-standardized data (e.g., skewed pass/fail ratios or motion artifacts), particularly for video- and sensor-based systems [74,75].
-: Technological and cognitive burden: High technical complexity and limited digital familiarity can inhibit learner engagement. Meade et al. observed reduced participation and course retention due to initial intimidation regarding AI concepts [81].

4. Discussion

4.1. Summary of Evidence: Where AI Impacts Medical Education

In this narrative review, we examine the integration of AI into medical education across four main domains: personalized tutoring, procedural simulation, diagnostic training, and competency assessment. The literature shows that machine learning (ML), deep learning (DL), and natural language processing (NLP) consistently improve feedback quality, task personalization, and performance metrics [20,22].

In tutoring, generative tools such as ChatGPT produce quizzes and educational materials, judged useful by students, in some cases improving test performance [19,20]. AI-based simulation platforms, including conversational agents and virtual reality environments, enhance clinical decision-making, self-efficacy, and improve technical skills in immersive learning scenarios [21,25]. Results are promising: generative AI tools improve test scores, and simulations enhance performance in OSCE exams.

However, focusing only on these metrics risks being reductive and obscuring the deeper pedagogical potential of AI. Medical competence extends beyond exams to probabilistic reasoning under uncertainty. This is where the more advanced role of an AI tutor comes into play: no longer as a mere content provider, but as a “cognitive sparring partner”. Instead of providing perfect clinical cases, curricula can challenge students to find inaccuracies in AI outputs. This process trains critical validation skills, a crucial skill for patient safety.

The same reasoning applies to simulations. Simulations should evolve beyond scripted scenarios to reflect human variability, anxiety, bias, and cultural diversity—integrating technical with non-technical skills such as interprofessional communication.

In this context, the ability of AI to distinguish between novices and experts is technically impressive [40], but its educational value lies in the “why” behind this distinction. Algorithms detect micro-movements and procedural strategies that characterize expertise. This algorithmic ‘gaze’ can render tacit learning both visible and measurable. AI feedback can therefore move beyond a simple score, offering the learner a granular analysis of their procedural deficits and accelerating the transition from rule-based competence to expert fluency.

However, this potential is not without risks. A secondary analysis of the study by Fazlollahi et al. revealed an algorithmic ‘hidden curriculum’: while the AI tutor improved procedural confidence, it diminished efficiency, causing a divergence from expert benchmarks. This demonstrates that algorithmic optimization focused on a single aspect of performance can inadvertently compromise other essential skills, underscoring the irreplaceable role of the human supervisor in contextualizing AI feedback and in teaching the management of decision-making trade-offs in a high-stakes context.

This advanced role for AI extends to assessing the metacognitive skills crucial for clinical reasoning. Rather than merely grading a final answer, intelligent systems can analyze a student’s decision-making path, offering feedback on the reasoning process itself and helping them to reflect on their biases. The fallibility of LLMs, highlighted by Desseauve et al. [34], becomes pedagogical: students learn ‘clinical skepticism,’ treating outputs as hypotheses to be verified, ensuring responsibility remains with clinicians.

In the diagnostic domain, convolutional neural networks (CNNs) improve accuracy across radiology, ultrasonography, cytology, and dermatology training [56,60,80].

The fact that AI can outperform students in visual pattern recognition tasks [56] is not merely a statistic, but the signal of a paradigm shift. If the machine is superior in pattern recognition, the goal of education can no longer be to train humans to compete with it. The focus must shift from perceptual competence to one of supervision and integration. Physicians must judge reliability, recognize limitations (especially with atypical or rare morphologies), and incorporate AI results into holistic reasoning.

This phenomenon, however, highlights a fundamental tension between efficiency and learning: the automation of a task can foreclose the opportunity to develop the corresponding competence. Curricula should balance AI-supported tasks with independent performance to build resilient, non-technology-dependent skills.

Similarly, in the field of assessment, computer vision and neural network-based tools have demonstrated high discriminatory power in distinguishing different levels of expertise and providing objective metrics of procedural proficiency [69,70].

This assessment capability becomes particularly innovative when it shifts the analysis from what a student does to how they think and feel while performing an action. The identification of disorientation episodes via eye-tracking [80] opens new frontiers: instead of a memory-based debriefing, an instructor can review a procedure by pointing to the exact moments of cognitive overload. This allows for teaching not only the technique itself but also the metacognitive skills of self-regulation, which are fundamental for error management in high-stress environments.

4.2. Effectiveness, Methodological Rigor, and Epistemic Boundaries

Although the number of AI-related studies in medical education is growing, methodological rigor remains variable. Fewer than one-third of studies reviewed employed randomized controlled designs. Most are observational, exploratory, or proof-of-concept studies, often with small sample sizes or limited follow-up, typically conducted in simulated environments [43,44].

This suggests that AI-driven educational interventions cannot be conceived as isolated events but must be integrated into a longitudinal curriculum. For example, strategies facilitated by AI platforms may schedule recall sessions and ensure that procedural skills acquired in simulation are effectively maintained and transferred to clinical practice, thereby mitigating the risk of skill atrophy.

Regarding validation, the metrics used vary considerably across studies, limiting direct comparisons. Moreover, many evaluations rely on indirect indicators (e.g., execution time, OSATS scores) rather than clinical outcomes or real-world decision-making. While AI systems may outperform humans in recognizing visual patterns, their performance in ambiguous, multimorbid, or rare scenarios remains unreliable [60].

From a pedagogical perspective, it is essential to question what AI is actually measuring in educational contexts. If algorithms prioritize speed or formal correctness, they may fail to capture essential aspects of clinical training such as uncertainty management, ethical reasoning, and the navigation of diagnostic ambiguity. This highlights the risk of an uncritical “educational techno-positivism”, mistaking metrics for learning, without deeper pedagogical reflection.

4.3. Systemic Barriers: Technology, Pedagogy, and Ethics

The large-scale adoption of AI-based educational tools faces infrastructural, pedagogical, and ethical barriers.

From a technical standpoint, many high-performing systems require specialized equipment, VR headsets, eye-tracking tools, physiological sensors, and dedicated environments. This limits scalability, particularly in resource-limited settings or large medical programs, potentially exacerbating educational disparities [52]. This infrastructural barrier has a deep equity dimension. The risk is the creation of a two-tiered educational system, separating elite institutions from those with fewer resources. This educational ‘digital divide’ could translate into disparities in the quality of healthcare. It is therefore imperative to promote more inclusive innovation, fostering the development of low-cost and open-source solutions to ensure that the benefits of AI are accessible to all future physicians.

Pedagogically, the unguided use of AI systems can lead to cognitive outsourcing and diminished learner autonomy. Several studies note reductions in intrinsic motivation and metacognitive engagement when AI tools are deployed without critical scaffolding or supervision [20,54]. This is especially problematic in academic institutions, where future educators and clinician-scientists are trained, and where overreliance on automated systems may ultimately erode the capacity for critical reflection and innovation.

Ethically, the use of sensitive data, including biometric and physiological information, requires compliance with strict regulatory standards. The European General Data Protection Regulation (GDPR) mandates specific safeguards, including data minimization (Art. 9), protection from solely automated decision-making (Art. 22), and privacy by design (Art. 25). Yet fewer than 20% of surveyed AI-education projects in the UK report having conducted a formal Data Protection Impact Assessment (DPIA), revealing a gap between technological innovation and governance [12]. In this context, the issue of bias in datasets [62] is not just a technical problem, but an ethical and health equity issue with profound pedagogical implications. If a student trains on a dataset that, for example, lacks images of dermatological pathologies on dark skin, not only will the AI underperform, but the student themselves will develop a ‘visual bias’ that will translate into care disparities in their future practice. Curricula must therefore include specific modules on data ethics, teaching students to interrogate the provenance and composition of the datasets underpinning the tools they use. Furthermore, the use of tools like DeepGestalt [63] for dysmorphology brings the ethical discussion to an even more sensitive level: that of biometric data. Here, concerns extend beyond bias to include privacy, stigmatization, and the potential for discrimination. Teaching with AI in this field requires integrating technical training with skills in ethical communication, preparing future physicians to be not only users of a technology, but also custodians of their patients’ rights and dignity.

Finally, AI integration is also organizational and cultural. In academic centers, it affects educational dynamics, faculty roles, and learning pathways. Faculty must not only supervise but co-design, validate alignment with standards, and integrate AI into curricula.

This transformation of the educator’s role, however, entails profound challenges. Many faculty members report feeling unprepared, citing limited familiarity with AI and a lack of institutional support. The transition requires a massive investment in professional development and new forms of interdisciplinary collaboration among educators, clinicians, and data scientists. The integration of AI is therefore not merely a technological issue, but an organizational and cultural challenge.

4.4. Future Perspectives: Towards Responsible and Integrated Adoption

To realize the potential of AI in medical education responsibly and sustainably, several strategic priorities are recommended:

-: Supervised Hybrid Models (Human-in-the-Loop): AI should not replace but enhance educational interactions, providing automated feedback while requiring expert validation and interpretation [72].
-: Multicenter and Longitudinal Evaluations: Large-scale studies with clinical impact measures are essential to move beyond the exploratory phase and generate transferable evidence. Academic hospitals represent ideal testbeds for such integrated models, due to their integration of education, clinical care, and research.
-: AI Literacy for Learners and Educators: Medical curricula should include foundational modules on ML principles, digital ethics, and critical appraisal of algorithmic outputs. Without such training, there is a risk of creating passive users rather than critically engaged professionals [81].
-: Standardization and Interoperability: There is a pressing need for shared benchmarks, validated datasets, and interoperable systems to support algorithmic transparency and reliable cross-institutional implementation. This includes developing reference metrics for procedural and diagnostic competencies.

5. Conclusions

AI is emerging as a potentially transformative tool in medical education. It can support adaptive learning, interactive simulation, and objective assessment. Yet, its integration must be shaped by methodological rigor, pedagogical clarity, and a strong sense of ethical responsibility.

In the context of academic hospitals, using AI in education also calls for a deeper reflection on what it really means to learn medicine today. Learning medicine is no longer just about acquiring information, it’s about developing sound clinical judgment, a sense of responsibility, and the ability to reflect critically on one’s decisions.

To meet these evolving needs, a more integrated educational model is required, one that helps students develop a structured and meaningful approach to patient care. This includes learning how to communicate effectively with patients and colleagues, and how to use diagnostic and therapeutic tools appropriately. During their training, students should be guided in collecting and interpreting clinical data in a way that helps them recognize patients’ problems and manage them through a clear, problem-solving workflow.

A growing body of evidence suggests that AI can play a valuable role in supporting this process. However, the key challenge is not simply to add technology to the curriculum, but to design teaching methods that bring together clinical experience and the possibilities opened up by AI tools.

Moving toward more data-informed models of education will require close collaboration between technological innovation and thoughtful, critical pedagogy. It also calls for investment, not only in infrastructure, but in supporting faculty and developing applied research. Only under these conditions can AI become not an end in itself, but a smart and meaningful tool that serves the broader goal of helping students grow in knowledge, judgment, and care.

Author Contributions

Conceptualization, A.R.; methodology, C.M., A.F. and F.U.; writing—original draft preparation, A.R.; writing—review and editing, C.M. and A.F.; visualization, A.R.; supervision, A.M.; project administration, A.R. and L.M.C. All authors have read and agreed to the published version of the manuscript.

Funding

Open access publication fees (APC) are supported by Fondazione Compagnia di San Paolo and Fondazione CDP, Bando Intelligenza Artificiale 2, AI-LEAP project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data were presented in the main text.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AI-TIG	Artificial Intelligence Text-to-Image Generation
ANOVA	Analysis of Variance
AUC	Area Under the Curve
AUROC	Area Under the Receiver Operating Characteristic Curve
CAD	Computer-Aided Diagnostic
CCEP	Clinical Cardiac Electrophysiology
CHD	Congenital Heart Disease
CHDs	Congenital Heart Diseases
cPOCUS	Cardiac Point-Of-Care Ultrasound
DCNN	Deep Convolutional Neural Network
DL	Deep Learning
GANs	Generative Adversarial Networks
GDPR	General Data Protection Regulation
LLMs	Large Language Models
MESH	Medical Subject Headings
ML	Machine Learning
Mocap	Motion Capture
NEC	Necrotizing Enterocolitis
NICU	Neonatal Intensive Care Unit
NLP	Natural Language Processing
SUS	System Usability Scale
WSI	Whole Slide Images
OSCE	Objective Structured Clinical Examination
OSATS	Objective Structured Assessment of Technical Skills
VR	Virtual Reality
CNN	Convolutional Neural Network
SVM	Support Vector Machine
DNN	Deep Neural Network
LSTM	Long Short-Term Memory
GEARS	Global Evaluative Assessment of Robotic Skills
SBERT	Sentence-BERT
MCRDR	Multiple Classification Ripple Down Rules

References

Hallquist, E.; Gupta, I.; Montalbano, M.; Loukas, M. Applications of Artificial Intelligence in Medical Education: A Systematic Review. Cureus 2025, 17, e79878. [Google Scholar] [CrossRef]
Gordon, M.; Daniel, M.; Ajiboye, A.; Uraiby, H.; Xu, N.Y.; Bartlett, R.; Hanson, J.; Haas, M.; Spadafore, M.; Grafton-Clarke, C.; et al. A Scoping Review of Artificial Intelligence in Medical Education: BEME Guide No. 84. Med. Teach. 2024, 46, 446–470. [Google Scholar] [CrossRef]
Nagi, F.; Salih, R.; Alzubaidi, M.; Shah, H.; Alam, T.; Shah, Z.; Househ, M. Applications of Artificial Intelligence (AI) in Medical Education: A Scoping Review. Stud. Health Technol. Inform. 2023, 305, 648–651. [Google Scholar] [CrossRef] [PubMed]
Shaw, K.; Henning, M.A.; Webster, C.S. Artificial Intelligence in Medical Education: A Scoping Review of the Evidence for Efficacy and Future Directions. Med. Sci. Educ. 2025, 35, 1803–1816. [Google Scholar] [CrossRef] [PubMed]
Younis, H.A.; Eisa, T.A.E.; Nasser, M.; Sahib, T.M.; Noor, A.A.; Alyasiri, O.M.; Salisu, S.; Hayder, I.M.; Younis, H.A. A Systematic Review and Meta-Analysis of Artificial Intelligence Tools in Medicine and Healthcare: Applications, Considerations, Limitations, Motivation and Challenges. Diagnostics 2024, 14, 109. [Google Scholar] [CrossRef] [PubMed]
Giansanti, D.; Pirrera, A. Integrating AI and Assistive Technologies in Healthcare: Insights from a Narrative Review of Reviews. Healthcare 2025, 13, 556. [Google Scholar] [CrossRef] [PubMed]
Lee, J.; Wu, A.S.; Li, D.; Kulasegaram, K.M. Artificial Intelligence in Undergraduate Medical Education: A Scoping Review. Acad. Med. 2021, 96, S62–S70. [Google Scholar] [CrossRef]
Kovalainen, T.; Pramila-Savukoski, S.; Kuivila, H.-M.; Juntunen, J.; Jarva, E.; Rasi, M.; Mikkonen, K. Utilising Artificial Intelligence in Developing Education of Health Sciences Higher Education: An Umbrella Review of Reviews. Nurs. Educ. Today 2025, 147, 106600. [Google Scholar] [CrossRef]
Feigerlova, E.; Hani, H.; Hothersall-Davies, E. A Systematic Review of the Impact of Artificial Intelligence on Educational Outcomes in Health Professions Education. BMC Med. Educ. 2025, 25, 129. [Google Scholar] [CrossRef]
Batista, J.; Mesquita, A.; Carnaz, G. Generative AI and Higher Education: Trends, Challenges, and Future Directions from a Systematic Literature Review. Information 2024, 15, 676. [Google Scholar] [CrossRef]
Al-kfairy, M.; Mustafa, D.; Kshetri, N.; Insiew, M.; Alfandi, O. Ethical Challenges and Solutions of Generative AI: An Interdisciplinary Perspective. Informatics 2024, 11, 58. [Google Scholar] [CrossRef]
Mohammad Amini, M.; Jesus, M.; Fanaei Sheikholeslami, D.; Alves, P.; Hassanzadeh Benam, A.; Hariri, F. Artificial Intelligence Ethics and Challenges in Healthcare Applications: A Comprehensive Review in the Context of the European GDPR Mandate. Mach. Learn. Knowl. Extr. 2023, 5, 1023–1035. [Google Scholar] [CrossRef]
van Kolfschooten, H.B. A Health-Conformant Reading of the GDPR’s Right Not to Be Subject to Automated Decision-Making. Med. Law Rev. 2024, 32, 373–391. [Google Scholar] [CrossRef] [PubMed]
Gilbert, F.J.; Palmer, J.; Woznitza, N.; Nash, J.; Brackstone, C.; Faria, L.; Dunbar, J.K.; Hogg, H.D.J.; Liu, X.; Denniston, A.K. Data and Data Privacy Impact Assessments in the Context of AI Research and Practice in the UK. Front. Health Serv. 2025, 5, 1525955. [Google Scholar] [CrossRef] [PubMed]
Garcia, P.E.; Marques, F.C. Issues and Limitations on the Integration of Artificial Intelligence into Medical Education: A Narrative Review. Educ. Sci. 2024, 14, 379. [Google Scholar] [CrossRef]
Barrera Castro, G.P.; Chiappe, A.; Ramírez-Montoya, M.S.; Alcántar Nieblas, C. Key Barriers to Personalized Learning in Times of Artificial Intelligence: A Literature Review. Appl. Sci. 2025, 15, 3103. [Google Scholar] [CrossRef]
Lalova-Spinks, T.; Valcke, P.; Ioannidis, J.P.A.; Huys, I. EU–US Data Transfers: An Enduring Challenge for Health Research Collaborations. NPJ Digit. Med. 2024, 7, 215. [Google Scholar] [CrossRef]
Falagas, M.E.; Pitsouni, E.I.; Malietzis, G.A.; Pappas, G. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: Strengths and Weaknesses. FASEB J. 2008, 22, 338–342. [Google Scholar] [CrossRef] [PubMed]
Laohawetwanit, T.; Apornvirat, S.; Kantasiripitak, C. ChatGPT as a Teaching Tool: Preparing Pathology Residents for Board Examination with AI-Generated Digestive System Pathology Tests. Am. J. Clin. Pathol. 2024, 162, 471–479. [Google Scholar] [CrossRef]
Gan, W.; Ouyang, J.; Li, H.; Xue, Z.; Zhang, Y.; Dong, Q.; Huang, J.; Zheng, X.; Zhang, Y. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J. Med. Internet Res. 2024, 26, e57037. [Google Scholar] [CrossRef]
Brügge, E.; Ricchizzi, S.; Arenbeck, M.; Keller, M.N.; Schur, L.; Stummer, W.; Holling, M.; Lu, M.H.; Darici, D. Large Language Models Improve Clinical Decision Making of Medical Students through Patient Simulation and Structured Feedback: A Randomized Controlled Trial. BMC Med. Educ. 2024, 24, 1391. [Google Scholar] [CrossRef]
Yamamoto, A.; Koda, M.; Ogawa, H.; Miyoshi, T.; Maeda, Y.; Otsuka, F.; Ino, H. Enhancing Medical Interview Skills Through AI-Simulated Patient Interactions: Nonrandomized Controlled Trial. JMIR Med. Educ. 2024, 10, e58753. [Google Scholar] [CrossRef]
Zheng, K.; Shen, Z.; Chen, Z.; Che, C.; Zhu, H. Application of AI-Empowered Scenario-Based Simulation Teaching Mode in Cardiovascular Disease Education. BMC Med. Educ. 2024, 24, 1003. [Google Scholar] [CrossRef]
Aster, A.; Hütt, C.; Morton, C.; Flitton, M.; Laupichler, M.C.; Raupach, T. Development and Evaluation of an Emergency Department Serious Game for Undergraduate Medical Students. BMC Med. Educ. 2024, 24, 1061. [Google Scholar] [CrossRef] [PubMed]
Liaw, S.Y.; Tan, J.Z.; Lim, S.; Zhou, W.; Yap, J.; Ratan, R.; Ooi, S.L.; Wong, S.J.; Seah, B.; Chua, W.L. Artificial Intelligence in Virtual Reality Simulation for Interprofessional Communication Training: Mixed Method Study. Nurs. Educ. Today 2023, 122, 105718. [Google Scholar] [CrossRef]
Cianciolo, A.T.; LaVoie, N.; Parker, J. Machine Scoring of Medical Students’ Written Clinical Reasoning: Initial Validity Evidence. Acad. Med. 2021, 96, 1026–1035. [Google Scholar] [CrossRef]
Su, J.-M.; Hsu, S.-Y.; Fang, T.-Y.; Wang, P.-C. Developing and Validating a Knowledge-Based AI Assessment System for Learning Clinical Core Medical Knowledge in Otolaryngology. Comput. Biol. Med. 2024, 178, 108765. [Google Scholar] [CrossRef]
Brutschi, R.; Wang, R.; Kolbe, M.; Weiss, K.; Lohmeyer, Q.; Meboldt, M. Speech Recognition Technology for Assessing Team Debriefing Communication and Interaction Patterns: An Algorithmic Toolkit for Healthcare Simulation Educators. Adv. Simul. 2024, 9, 42. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Liao, Y.; Liu, S.; Zhang, D.; Wang, N.; Shu, J.; Wang, R. The Impact of Using ChatGPT on Academic Writing Among Medical Undergraduates. Ann. Med. 2024, 56, 2426760. [Google Scholar] [CrossRef] [PubMed]
Wiggins, W.F.; Caton, M.T., Jr.; Magudia, K.; Rosenthal, M.H.; Andriole, K.P. A Conference-Friendly, Hands-On Introduction to Deep Learning for Radiology Trainees. J. Digit. Imaging 2021, 34, 1026–1033. [Google Scholar] [CrossRef]
Krive, J.; Isola, M.; Chang, L.; Patel, T.; Anderson, M.; Sreedhar, R. Grounded in Reality: Artificial Intelligence in Medical Education. JAMIA Open 2023, 6, ooad037. [Google Scholar] [CrossRef]
Furlan, R.; Gatti, M.; Menè, R.; Shiffer, D.; Marchiori, C.; Giaj Levra, A.; Saturnino, V.; Brunetta, E.; Dipaola, F. A Natural Language Processing-Based Virtual Patient Simulator and Intelligent Tutoring System for the Clinical Diagnostic Process: Simulator Development and Case Study. JMIR Med. Inform. 2021, 9, e24073. [Google Scholar] [CrossRef]
Wang, M.; Sun, Z.; Jia, M.; Wang, Y.; Wang, H.; Zhu, X.; Chen, L.; Ji, H. Intelligent Virtual Case Learning System Based on Real Medical Records and Natural Language Processing. BMC Med. Inform. Decis. Mak. 2022, 22, 60. [Google Scholar] [CrossRef] [PubMed]
Desseauve, D.; Lescar, R.; de la Fourniere, B.; Ceccaldi, P.F.; Dziadzko, M. AI in obstetrics: Evaluating residents’ capabilities and interaction strategies with ChatGPT. Eur. J. Obstet. Gynecol. Reprod. Biol. 2024, 302, 238–241. [Google Scholar] [CrossRef] [PubMed]
Scherr, R.; Halaseh, F.F.; Spina, A.; Andalib, S.; Rivera, R. ChatGPT Interactive Medical Simulations for Early Clinical Education: Case Study. JMIR Med. Educ. 2023, 9, e49877. [Google Scholar] [CrossRef] [PubMed]
Saluja, S.; Tigga, S.R. Capabilities and Limitations of ChatGPT in Anatomy Education: An Interaction with ChatGPT. Cureus 2024, 16, e69000. [Google Scholar] [CrossRef]
Veras, M.; Dyer, J.O.; Shannon, H.; Bogie, B.J.M.; Ronney, M.; Sekhon, H.; Rutherford, D.; Silva, P.G.B.; Kairy, D. A Mixed Methods Crossover Randomized Controlled Trial Exploring the Experiences, Perceptions, and Usability of Artificial Intelligence (ChatGPT) in Health Sciences Education. Digit. Health 2024, 10, 20552076241298485. [Google Scholar] [CrossRef]
Xie, Y.; Seth, I.; Hunter-Smith, D.J.; Rozen, W.M.; Seifman, M.A. Investigating the Impact of Innovative AI Chatbot on Post-Pandemic Medical Education and Clinical Assistance: A Comprehensive Analysis. ANZ J. Surg. 2024, 94, 68–77. [Google Scholar] [CrossRef]
Siyar, S.; Azarnoush, H.; Rashidi, S.; Winkler-Schwartz, A.; Bissonnette, V.; Ponnudurai, N.; Del Maestro, R.F. Machine Learning Distinguishes Neurosurgical Skill Levels in a Virtual Reality Tumor Resection Task. Med. Biol. Eng. Comput. 2020, 58, 1357–1367. [Google Scholar] [CrossRef]
Alkadri, S.; Ledwos, N.; Mirchi, N.; Reich, A.; Yilmaz, R.; Driscoll, M.; Del Maestro, R.F. Utilizing a Multilayer Perceptron Artificial Neural Network to Assess a Virtual Reality Surgical Procedure. Comput. Biol. Med. 2021, 136, 104770. [Google Scholar] [CrossRef]
Simmonds, C.; Brentnall, M.; Lenihan, J. Evaluation of a Novel Universal Robotic Surgery Virtual Reality Simulation Proficiency Index That Will Allow Comparisons of Users across Any Virtual Reality Simulation Curriculum. Surg. Endosc. 2021, 35, 5867–5875. [Google Scholar] [CrossRef]
Radi, I.; Tellez, J.C.; Alterio, R.E.; Scott, D.J.; Sankaranarayanan, G.; Nagaraj, M.B.; Hogg, M.E.; Zeh, H.J.; Polanco, P.M. Feasibility, Effectiveness and Transferability of a Novel Mastery-Based Virtual Reality Robotic Training Platform for General Surgery Residents. Surg. Endosc. 2022, 36, 7279–7287. [Google Scholar] [CrossRef] [PubMed]
Fazlollahi, A.M.; Bakhaidar, M.; Alsayegh, A.; Yilmaz, R.; Winkler-Schwartz, A.; Mirchi, N.; Langleben, I.; Ledwos, N.; Sabbagh, A.J.; Bajunaid, K.; et al. Effect of Artificial Intelligence Tutoring vs. Expert Instruction on Learning Simulated Surgical Skills Among Medical Students: A Randomized Clinical Trial. JAMA Netw. Open 2022, 5, e2149008. [Google Scholar] [CrossRef]
Ruberto, A.J.; Rodenburg, D.; Ross, K.; Sarkar, P.; Hungler, P.C.; Etemad, A.; Howes, D.; Clarke, D.; McLellan, J.; Wilson, D.; et al. The Future of Simulation-Based Medical Education: Adaptive Simulation Utilizing a Deep Multitask Neural Network. AEM Educ. Train. 2021, 5, e10605. [Google Scholar] [CrossRef]
Cai, N.; Wang, G.; Xu, L.; Zhou, Y.; Chong, H.; Zhao, Y.; Wang, J.; Yan, W.; Zhang, B.; Liu, N. Examining the Impact of Perceptual Learning Artificial-Intelligence-Based on the Incidence of Paresthesia When Performing the Ultrasound-Guided Popliteal Sciatic Block: Simulation-Based Randomized Study. BMC Anesthesiol. 2022, 22, 392. [Google Scholar] [CrossRef]
Yovanoff, M.A.; Chen, H.E.; Pepley, D.F.; Mirkin, K.A.; Han, D.C.; Moore, J.Z.; Miller, S.R. Investigating the Effect of Simulator Functional Fidelity and Personalized Feedback on Central Venous Catheterization Training. J. Surg. Educ. 2018, 75, 1410–1421. [Google Scholar] [CrossRef]
Ledwos, N.; Mirchi, N.; Yilmaz, R.; Winkler-Schwartz, A.; Sawni, A.; Fazlollahi, A.M.; Bissonnette, V.; Bajunaid, K.; Sabbagh, A.J.; Del Maestro, R.F. Assessment of Learning Curves on a Simulated Neurosurgical Task Using Metrics Selected by Artificial Intelligence. J. Neurosurg. 2022, 137, 1160–1171. [Google Scholar] [CrossRef]
Di Mitri, D.; Schneider, J.; Specht, M.; Drachsler, H. Detecting Mistakes in CPR Training with Multimodal Data and Neural Networks. Sensors 2019, 19, 3099. [Google Scholar] [CrossRef]
Melnyk, R.; Campbell, T.; Holler, T.; Cameron, K.; Saba, P.; Witthaus, M.W.; Joseph, J.; Ghazi, A. See Like an Expert: Gaze-Augmented Training Enhances Skill Acquisition in a Virtual Reality Robotic Suturing Task. J. Endourol. 2021, 35, 376–382. [Google Scholar] [CrossRef] [PubMed]
Liaw, S.Y.; Tan, J.Z.; Bin Rusli, K.D.; Ratan, R.; Zhou, W.; Lim, S.; Lau, T.C.; Seah, B.; Chua, W.L. Artificial Intelligence Versus Human-Controlled Doctor in Virtual Reality Simulation for Sepsis Team Training: Randomized Controlled Study. J. Med. Internet Res. 2023, 25, e47748. [Google Scholar] [CrossRef] [PubMed]
Riaño, D.; Real, F.; Alonso, J.R. Improving Resident’s Skills in the Management of Circulatory Shock with a Knowledge-Based E-Learning Tool. Int. J. Med. Inform. 2018, 113, 49–55. [Google Scholar] [CrossRef]
Hamilton, B.C.; Dairywala, M.I.; Highet, A.; Nguyen, T.C.; O’Sullivan, P.; Chern, H.; Soriano, I.S. Artificial Intelligence Based Real-Time Video Ergonomic Assessment and Training Improves Resident Ergonomics. Am. J. Surg. 2023, 226, 741–746. [Google Scholar] [CrossRef]
Hershberger, P.J.; Pei, Y.; Bricker, D.A.; Crawford, T.N.; Shivakumar, A.; Castle, A.; Conway, K.; Medaramitta, R.; Rechtin, M.; Wilson, J.F. Motivational Interviewing Skills Practice Enhanced with Artificial Intelligence: ReadMI. BMC Med. Educ. 2024, 24, 237. [Google Scholar] [CrossRef]
Chang, J.; Bliss, L.; Angelov, N.; Glick, A. Artificial Intelligence-Assisted Full-Mouth Radiograph Mounting in Dental Education. J. Dent. Educ. 2024, 88, 933–939. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Watkins, K.; Hall, C.E.; Liu, Y.; Lee, S.H.; Papandria, D.; Delman, K.A.; Srinivasan, J.; Patel, A.; Davis, S.S.; et al. Utilizing Simulation to Evaluate Robotic Skill Acquisition and Learning Decay. Surg. Laparosc. Endosc. Percutan. Tech. 2023, 33, 317–323. [Google Scholar] [CrossRef]
Barui, S.; Sanyal, P.; Rajmohan, K.S.; Panigrahi, A.; Kundu, R. Perception without Preconception: Comparison between the Human and Machine Learner in Recognition of Tissues from Histological Sections. Sci. Rep. 2022, 12, 16420. [Google Scholar] [CrossRef]
Cheng, C.T.; Chen, C.C.; Fu, C.Y.; Chaou, C.H.; Wu, Y.T.; Hsu, C.P.; Chang, C.C.; Chung, I.F.; Hsieh, C.H.; Hsieh, M.J.; et al. Artificial Intelligence-Based Education Assists Medical Students’ Interpretation of Hip Fracture. Insights Imaging 2020, 11, 119. [Google Scholar] [CrossRef]
Aronovitz, N.; Hazan, I.; Jedwab, R.; Ben Shitrit, I.; Quinn, A.; Wacht, O.; Fuchs, L. The Effect of Real-Time EF Automatic Tool on Cardiac Ultrasound Performance among Medical Students. PLoS ONE 2024, 19, e0299461. [Google Scholar] [CrossRef]
Lei, T.; Zheng, Q.; Feng, J.; Zhang, L.; Zhou, Q.; He, M.; Lin, M.; Xie, H.N. Enhancing Trainee Performance in Obstetric Ultrasound through an Artificial Intelligence System: Randomized Controlled Trial. Ultrasound Obstet. Gynecol. 2024, 64, 453–462. [Google Scholar] [CrossRef] [PubMed]
Fang, Z.; Xu, Z.; He, X.; Han, W. Artificial Intelligence-Based Pathologic Myopia Identification System in the Ophthalmology Residency Training Program. Front. Cell Dev. Biol. 2022, 10, 1053079. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Xian, D.; Yu, L.; Kong, Y.; Lv, H.; Huang, L.; Liu, K.; Zhang, H.; Wei, W.; Tang, H. Integration of AI-Assisted in Digital Cervical Cytology Training: A Comparative Study. Cytopathology 2024, 36, 156–164. [Google Scholar] [CrossRef] [PubMed]
Tabuchi, H.; Nakajima, I.; Day, M.; Masumoto, H.; Tsuji, S.; Miki, M.; Enno, H.; Masumoto, K. Comparative Educational Effectiveness of AI Generated Images and Traditional Lectures for Diagnosing Chalazion and Sebaceous Carcinoma. Sci. Rep. 2024, 14, 29200. [Google Scholar] [CrossRef]
Marwaha, A.; Chitayat, D.; Meyn, M.S.; Mendoza-Londono, R.; Chad, L. The Point-of-Care Use of a Facial Phenotyping Tool in the Genetics Clinic: Enhancing Diagnosis and Education with Machine Learning. Am. J. Med. Genet. A 2021, 185, 1151–1158. [Google Scholar] [CrossRef]
Pereira, D.S.M.; Falcão, F.; Nunes, A.; Santos, N.; Costa, P.; Pêgo, J.M. Designing and Building OSCEBot^®® for Virtual OSCE—Performance Evaluation. Med. Educ. Online 2023, 28, 2228550. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Hebert, D.; Kim, S.; Kang, B. MCRDR Knowledge-Based 3D Dialogue Simulation in Clinical Training and Assessment. J. Med. Syst. 2019, 43, 200. [Google Scholar] [CrossRef]
Ramgopal, S.; Varma, S.; Gorski, J.K.; Kester, K.M.; Shieh, A.; Suresh, S. Evaluation of a Large Language Model on the American Academy of Pediatrics’ PREP Emergency Medicine Question Bank. Pediatr. Emerg. Care 2024, 40, 871–875. [Google Scholar] [CrossRef]
Berbenyuk, A.; Powell, L.; Zary, N. Feasibility and Educational Value of Clinical Cases Generated Using Large Language Models. Stud. Health Technol. Inform. 2024, 316, 1524–1528. [Google Scholar] [CrossRef]
Temsah, M.H.; Alhuzaimi, A.N.; Almansour, M.; Aljamaan, F.; Alhasan, K.; Batarfi, M.A.; Altamimi, I.; Alharbi, A.; Alsuhaibani, A.A.; Alwakeel, L.; et al. Art or Artifact: Evaluating the Accuracy, Appeal, and Educational Value of AI-Generated Imagery in DALL·E 3 for Illustrating Congenital Heart Diseases. J. Med. Syst. 2024, 48, 54. [Google Scholar] [CrossRef]
Yang, J.H.; Goodman, E.D.; Dawes, A.J.; Gahagan, J.V.; Esquivel, M.M.; Liebert, C.A.; Kin, C.; Yeung, S.; Gurland, B.H. Using AI and Computer Vision to Analyze Technical Proficiency in Robotic Surgery. Surg. Endosc. 2023, 37, 3010–3017. [Google Scholar] [CrossRef]
Bissonnette, V.; Mirchi, N.; Ledwos, N.; Alsidieri, G.; Winkler-Schwartz, A.; Del Maestro, R.F.; Neurosurgical Simulation & Artificial Intelligence Learning Centre. Artificial Intelligence Distinguishes Surgical Training Levels in a Virtual Reality Spinal Task. J. Bone Jt. Surg. Am. 2019, 101, e127. [Google Scholar] [CrossRef] [PubMed]
Gleason, A.; Servais, E.; Quadri, S.; Manganiello, M.; Cheah, Y.L.; Simon, C.J.; Preston, E.; Graham-Stephenson, A.; Wright, V. Developing Basic Robotic Skills Using Virtual Reality Simulation and Automated Assessment Tools: A Multidisciplinary Robotic Virtual Reality-Based Curriculum Using the Da Vinci Skills Simulator and Tracking Progress with the Intuitive Learning Platform. J. Robot. Surg. 2022, 16, 1313–1319. [Google Scholar] [CrossRef]
Fazlollahi, A.M.; Yilmaz, R.; Winkler-Schwartz, A.; Mirchi, N.; Ledwos, N.; Bakhaidar, M.; Alsayegh, A.; Del Maestro, R.F. AI in Surgical Curriculum Design and Unintended Outcomes for Technical Competencies in Simulation Training. JAMA Netw. Open 2023, 6, e2334658. [Google Scholar] [CrossRef] [PubMed]
Smith, R.; Julian, D.; Dubin, A. Deep Neural Networks Are Effective Tools for Assessing Performance during Surgical Training. J. Robot. Surg. 2022, 16, 559–562. [Google Scholar] [CrossRef] [PubMed]
Nagaraj, M.B.; Namazi, B.; Sankaranarayanan, G.; Scott, D.J. Developing Artificial Intelligence Models for Medical Student Suturing and Knot-Tying Video-Based Assessment and Coaching. Surg. Endosc. 2023, 37, 402–411. [Google Scholar] [CrossRef] [PubMed]
Laverde, R.; Rueda, C.; Amado, L.; Rojas, D.; Altuve, M. Artificial Neural Network for Laparoscopic Skills Classification Using Motion Signals from Apple Watch. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. 2018, 2018, 5434–5437. [Google Scholar] [CrossRef]
Bogar, P.Z.; Virag, M.; Bene, M.; Hardi, P.; Matuz, A.; Schlegl, A.T.; Toth, L.; Molnar, F.; Nagy, B.; Rendeki, S.; et al. Validation of a Novel, Low-Fidelity Virtual Reality Simulator and an Artificial Intelligence Assessment Approach for Peg Transfer Laparoscopic Training. Sci. Rep. 2024, 14, 16702. [Google Scholar] [CrossRef]
Mears, J.; Kaleem, S.; Panchamia, R.; Kamel, H.; Tam, C.; Thalappillil, R.; Murthy, S.; Merkler, A.E.; Zhang, C.; Ch’ang, J.H. Leveraging the Capabilities of AI: Novice Neurology-Trained Operators Performing Cardiac POCUS in Patients with Acute Brain Injury. Neurocrit. Care 2024, 41, 523–532. [Google Scholar] [CrossRef]
Mistro, M.; Sheng, Y.; Ge, Y.; Kelsey, C.R.; Palta, J.R.; Cai, J.; Wu, Q.; Yin, F.F.; Wu, Q.J. Knowledge Models as Teaching Aid for Training Intensity Modulated Radiation Therapy Planning: A Lung Cancer Case Study. Front. Artif. Intell. 2020, 3, 66. [Google Scholar] [CrossRef]
Li, J.; Zong, H.; Wu, E.; Wu, R.; Peng, Z.; Zhao, J.; Yang, L.; Xie, H.; Shen, B. Exploring the Potential of Artificial Intelligence to Enhance the Writing of English Academic Papers by Non-Native English-Speaking Medical Students—The Educational Application of ChatGPT. BMC Med. Educ. 2024, 24, 736. [Google Scholar] [CrossRef]
Xin, L.; Bin, Z.; Xiaoqin, D.; Wenjing, H.; Yuandong, L.; Jinyu, Z.; Chen, Z.; Lin, W. Detecting Task Difficulty of Learners in Colonoscopy: Evidence from Eye-Tracking. J. Eye Mov. Res. 2021, 14, 5. [Google Scholar] [CrossRef]
Meade, S.M.; Salas-Vega, S.; Nagy, M.R.; Sundar, S.J.; Steinmetz, M.P.; Benzel, E.C.; Habboub, G. A Pilot Remote Curriculum to Enhance Resident and Medical Student Understanding of Machine Learning in Healthcare. World Neurosurg. 2023, 180, e142–e148. [Google Scholar] [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Roveta, A.; Castello, L.M.; Massarino, C.; Francese, A.; Ugo, F.; Maconi, A. Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges. AI 2025, 6, 227. https://doi.org/10.3390/ai6090227

AMA Style

Roveta A, Castello LM, Massarino C, Francese A, Ugo F, Maconi A. Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges. AI. 2025; 6(9):227. https://doi.org/10.3390/ai6090227

Chicago/Turabian Style

Roveta, Annalisa, Luigi Mario Castello, Costanza Massarino, Alessia Francese, Francesca Ugo, and Antonio Maconi. 2025. "Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges" AI 6, no. 9: 227. https://doi.org/10.3390/ai6090227

APA Style

Roveta, A., Castello, L. M., Massarino, C., Francese, A., Ugo, F., & Maconi, A. (2025). Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges. AI, 6(9), 227. https://doi.org/10.3390/ai6090227

Article Menu

Artificial Intelligence in Medical Education: A Narrative Review on Implementation, Evaluation, and Methodological Challenges

Abstract

1. Introduction

1.1. General Context

1.2. Rationale

1.3. Emerging Issues and Challenges

1.4. Specific Objectives

2. Materials and Methods

2.1. Literature Search Strategy

2.2. Study Selection and Eligibility Criteria

3. Results

3.1. AI as a Tutor and Generator of Educational Content

3.2. From Simulation to Practice: Developing Competence with AI

3.3. Enhancing Clinical Perception: AI in Diagnostic Training

3.4. Towards Data-Driven Training: AI in Competency Assessment

4. Discussion

4.1. Summary of Evidence: Where AI Impacts Medical Education

4.2. Effectiveness, Methodological Rigor, and Epistemic Boundaries

4.3. Systemic Barriers: Technology, Pedagogy, and Ethics

4.4. Future Perspectives: Towards Responsible and Integrated Adoption

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI