Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants

Altermatt, Fernando R.; Neyem, Andres; Sumonte, Nicolás I.; Villagrán, Ignacio; Mendoza, Marcelo; Lacassie, Hector J.

doi:10.3390/app15116245

Open AccessArticle

Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants

by

Fernando R. Altermatt

^1,*

,

Andres Neyem

^2,3,

Nicolás I. Sumonte

^2,3,

Ignacio Villagrán

^2,4

,

Marcelo Mendoza

^2,3,5 and

Hector J. Lacassie

¹

Division of Anesthesiology, School of Medicine, Pontificia Universidad Católica de Chile, Santiago 8330024, Chile

²

Department of Computer Science, School of Engineering, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile

³

National Center for Artificial Intelligence (CENIA), National Research and Development Agency (ANID), Santiago 8331150, Chile

⁴

Health Science Department, School of Medicine, Pontificia Universidad Católica de Chile, Santiago 8331150, Chile

⁵

Millennium Institute for Foundational Research on Data (IMFD), Santiago 8331150, Chile

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6245; https://doi.org/10.3390/app15116245

Submission received: 10 April 2025 / Revised: 29 May 2025 / Accepted: 30 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue AI Technologies for eHealth and mHealth)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Bloom’s taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs’ potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings.

Keywords:

anesthesiology certification; clinical reasoning assessment; language model benchmarking; medical AI evaluation; non-English medical exams; psychometric analysis; Spanish-language healthcare; zero-shot prompting

1. Introduction

Large Language Models (LLMs), a class of artificial intelligence (AI) systems designed for complex language processing tasks, are increasingly being applied in healthcare and medical education [1,2]. These models can retrieve domain-specific knowledge, generate clinical reasoning pathways, and simulate decision-making processes, which position them as promising tools for diagnostic assistance and medical training. Their performance on English-language assessments, such as the United States Medical Licensing Examination (USMLE), has approached or exceeded that of human examinees in recent studies [3,4].

Despite these advances, there is a limited understanding of how LLMs perform in high-stakes, non-English medical certification settings. This gap is especially relevant given the linguistic, cultural, and curricular variability in the global medical education systems. Evaluating model performance in such contexts is essential to determine whether LLMs generalize beyond English-dominant datasets and standardized examinations.

The Chilean National Certification Examination in Anesthesiology (CONACEM), a rigorous Spanish-language board exam, provides an ideal setting for benchmarking AI performance in a specialized clinical field. Anesthesiology requires not only factual knowledge of pharmacology, physiology, and critical care but also situational decision-making and clinical reasoning. These multifaceted cognitive requirements are likely to challenge current LLMs, which tend to excel in recall-based tasks but may falter on items that require inference, synthesis, or contextual nuance.

While previous studies evaluating AI on multiple-choice questions have primarily relied on Classical Test Theory (CTT) [5,6], which captures surface-level accuracy, this study integrates Item Response Theory (IRT)—specifically the one-parameter logistic Rasch model—to provide deeper insight into latent ability and item difficulty. To further enhance interpretability, we classified all test items according to Bloom’s revised taxonomy [7,8] (Recall, Understand, Apply, Analyze), which enabled a stratified analysis across cognitive domains.

In addition, we conducted a qualitative error analysis of incorrect LLM responses using a classification framework adapted from Roy et al. (2024) [9]. Each justification was reviewed and coded into one of four error types: reasoning-based, knowledge-based, reading comprehension error, or plausible non-error. This allowed us to probe the underlying mechanisms of failure, moving beyond simple correctness to assess the cognitive and interpretive limitations of LLMs in clinical decision-making.

We hypothesized that closed-source models (e.g., GPT-4o and GPT-o1) would outperform their open-source counterparts (e.g., LLaMA 3 and Deepseek-R1) due to their extensive training datasets and architectural refinements. Nonetheless, we anticipated that all models would underperform on high-difficulty and high-discrimination items, especially those requiring deep reasoning. Human examinees are expected to maintain an advantage in tasks involving contextual judgment and clinical nuances.

The primary aim of this study was to benchmark the performance of nine state-of-the-art LLMs against human anesthesiology candidates using psychometrically calibrated exam data. By incorporating cognitive complexity analysis and error taxonomy, we offer a multidimensional evaluation framework that enhances the understanding of LLM behavior in educational and clinical assessment contexts.

2. Methods

2.1. Study Design and Ethical Approval

This study employed a cross-sectional experimental design to benchmark the performance of nine large language models against historical responses from human examinees on a Spanish-language board examination in anesthesiology. The dataset comprised anonymized responses from 134 Chilean anesthesiology candidates who had previously completed the National Certification Examination in Anesthesiology under standard conditions. Given the retrospective nature of the analysis and complete de-identification of human data, a formal ethical review was not required under local research governance protocols. The CONACEM board previously approved data used for educational and benchmarking purposes.

2.2. Exam Dataset Preparation

A total of 68 multiple-choice questions (MCQs) were selected from recent versions of the CONACEM exam. These questions represent the core domains of anesthesiology, including pharmacology, physiology, airway management, pediatric and obstetric anesthesia, critical care, and perioperative monitoring.

Exclusion Criteria and Item Refinement

To ensure compatibility across all evaluated LLMs, items involving multimodal inputs (e.g., imaging or waveform interpretation) were excluded, as several models (e.g., Deepseek-R1) lack vision capabilities. The remaining items were then subjected to psychometric refinement. An exploratory factor analysis (EFA) was attempted to examine dimensionality; however, given the limited sample size (n = 153), the results were considered unstable, with factor loadings below 0.5. Consequently, EFA findings were not used in the final calibration, and unidimensionality was instead assessed using Rasch-based methods.

To minimize item bias, we conducted Differential Item Functioning (DIF) analysis using both Hansel’s method and binary logistic regression, following established psychometric practices for detecting subgroup-based variance in item functioning [10,11]. Five items flagged for significant DIF were excluded from the analysis. The final dataset consisted of 63 MCQs, representing a refined and psychometrically sound unidimensional instrument appropriate for Rasch modeling and model-human comparisons.

2.3. Response Generation and Model Evaluation

Nine state-of-the-art LLMs were tested, including GPT-4o [12] and GPT-o1 [13] (OpenAI), Claude and Sonnet (Anthropic) [14], Deepseek-R1 [15] (Deepseek), Qwen 2.5 [16] (Alibaba), Gemini 1.5 Pro [17] (Google), and two versions of Meta’s LLaMA-3 [18] (70B and 405B). Each model was accessed via its official API and tested using a standardized zero-shot prompting [19] approach, meaning that no in-context learning or chain-of-thought strategies were applied. All the MCQs were presented verbatim in Spanish to preserve clinical and linguistic fidelity. To assess the impact of sampling randomness, responses were generated at two temperature settings: T = 0 for the deterministic output and T = 1 to allow for variability. The exact prompts used for all models are provided in the Supplementary Materials.

2.4. Rasch Modeling and Item Characterization

Item Response Theory (IRT) modeling was conducted using a one-parameter logistic (1PL) Rasch model, implemented in the eRm package in R (v4.3.2). The Rasch model was selected for its desirable psychometric properties, including sample-independent item calibration, interval-scale measurement, and robustness under small-to-moderate sample sizes [20,21]. The Rasch framework assumes that all items measure the same latent trait—in this case, theoretical competence in anesthesiology—and that they do so with equal discrimination [6]. Item difficulty parameters (β) were estimated from human response data and then fixed, enabling the maximum likelihood estimation (MLE) of latent ability scores (θ) for each LLM on the same psychometric scale.

In addition to the Rasch difficulty estimates, each item was further characterized using the Discrimination Index (DI), defined as the difference in correct response rates between the top and bottom quartiles of human performers. Although a Classical Test Theory (CTT) measure, DI remains widely used for its interpretability and complementary value in assessing item quality [7]. Items were categorized into three difficulty groups (easy: β < 0.3; moderate: 0.3 ≤ β < 0.8; hard: β ≥ 0.8) and four discrimination groups (poor: DI < 0; marginal: 0 ≤ DI < 0.2; good: 0.2 ≤ DI < 0.4; excellent: DI ≥ 0.4), enabling a stratified analysis of model behavior across psychometric categories [21,22].

To evaluate the psychometric validity and Rasch model assumptions, we conducted three core fit diagnostics. Andersen’s Likelihood Ratio Test assessed item parameter invariance across examinee subgroups [23], the Martin-Löf Test confirmed the unidimensionality of the item set [24,25], and the Wald Test [26,27] verified the item-level model fit. These tests provided rigorous statistical confirmation that the dataset satisfied the Rasch modeling assumptions, allowing for defensible ability estimation and model comparison.

2.5. Statistical Analysis

Two main performance metrics were used to evaluate the LLM outputs: (1) accuracy, defined as the proportion of correctly answered items, and (2) response time, measured as the average number of seconds taken per item. To explore the impact of item features on model performance, a series of inferential statistical tests was conducted. Logistic regression models were used to assess whether item difficulty (β) and DI significantly predicted the likelihood of a correct model response [20]. Additionally, Pearson’s correlation coefficients were computed to quantify the linear relationship between item difficulty and model accuracy. Group-level comparisons of correct versus incorrect model responses were analyzed using the Mann–Whitney U test, while chi-square tests assessed associations between categorical levels of item difficulty/discrimination and accuracy, with Cramer’s V reported as a measure of effect size. To examine the influence of sampling variability, the accuracy rates at T = 0 and T = 1 were compared using Mann–Whitney U tests.

To assess the cognitive alignment between human and artificial intelligence systems, Pearson’s correlation analysis was conducted on the accuracy rates of humans and LLMs across the 15 anesthesiology knowledge domains. This analysis aimed to determine whether humans and LLMs show similar difficulty patterns across clinical domains or reveal fundamentally different cognitive architectures in medical reasoning.

For all statistical tests, a two-tailed p-value < 0.05 was considered significant, and the False Discovery Rate (FDR) correction was applied to control for multiple comparisons. Where appropriate, Cohen’s d and other effect size estimates were reported to aid in interpretation.

The sample size of 134 human participants was considered appropriate for Rasch modeling and item calibration. Prior psychometric literature indicates that sample sizes between 100 and 200 are adequate for 1PL models when working with instruments containing 30–100 items, particularly under unidimensional assumptions and when sufficient item variability is present. Moreover, Rasch models are known for their statistical stability and low bias in small-to-moderate samples due to their strong model constraints [20,21,28]. While larger samples would enable finer subgroup analysis or exploration of multidimensional constructs, the present study—designed as a benchmarking analysis—offers robust estimates for the intended comparisons.

All statistical analyses were conducted using Python (v3.9) and R (v4.3.2), with Rasch-specific modeling performed in eRm, and general statistical analyses conducted using base R and the stats models and scipy libraries in Python.

2.6. Cognitive Complexity Categorization and Error Taxonomy

To further analyze the qualitative performance of the language models, we categorized all 63 refined multiple-choice questions according to Bloom’s revised taxonomy [7,8] into four cognitive domains: Recall, Understand, Apply, and Analyze. This categorization was performed independently by two medical educators with content expertise in anesthesiology. Discrepancies were resolved by consensus. This classification enabled stratified performance comparisons across varying cognitive demands, providing deeper insights into the strengths and limitations of knowledge processing and clinical reasoning.

Additionally, for the top-performing models—GPT-o1 and Deepseek-R1—we conducted an in-depth qualitative error analysis of incorrect responses. For each incorrect answer, the associated model-generated justification was examined and annotated into one of four categories adapted from the error taxonomy proposed by Roy et al. (2024) [9] for GPT-4’s performance in the USMLE:

-: Class 1 Reasoning-based errors: Incorrect due to flawed clinical logic or premature conclusions.
-: Class 2 Knowledge-based errors: Resulting from factual inaccuracies or misapplication of domain knowledge.
-: Class 3 Reading comprehension errors: Due to misinterpretation or neglect of critical information in the prompt.
-: Class 4: Non-errors, such as plausible alternative justifications.

This categorization was carried out by two medical experts following refined guidelines adapted from Roy et al.’s [9] multi-label span annotation framework. The taxonomy enabled the identification of systematic patterns in model reasoning and highlighted areas where even correct medical logic could still lead to incorrect answers, a phenomenon also observed in GPT-4’s performance on the USMLE questions.

3. Results

3.1. Psychometric Evaluation of the Instrument

Model fit was assessed using three psychometric tests following the refinement of the item pool. Prior to Rasch calibration, a Differential Item Functioning (DIF) analysis identified five biased items (44, 63, 68, 23, and 11), which were excluded to ensure fairness and parameter invariance in subsequent modeling.

With the bias-free item set, three psychometric tests were conducted:

First, Andersen’s Likelihood Ratio Test (LR = 76.125, df = 61, p = 0.092) did not reject the null hypothesis of parameter invariance, indicating item stability across the ability levels.

Second, the Martin-Löf Test confirmed unidimensionality (p = 1.0), supporting the interpretation that the test assessed a single latent trait—namely, theoretical competence in anesthesiology.

Third, the Wald Test (Supplementary Table S1) showed no significant item-level deviations for most items, further validating the Rasch model assumptions.

Rasch-based ability estimates (θ) were then computed for each model anchored to human performance. GPT-o1 achieved the highest ability (θ = 2.42), followed by Deepseek-R1 (θ = 2.17), and GPT-4o (θ = 1.96). LLaMA 3 70B had the lowest (θ = 0.52), indicating a weaker alignment with human-level performance under the IRT framework.

Supporting psychometric visuals—including the Item Characteristic Curve (ICC) plots, Item Fit Map, and Person-Item Map—are presented in Supplementary Figures S1–S3.

Figure 1 illustrates the distribution of the scores of the LLMs and human examinees.

3.2. Overall Performance Metrics

The model performance varied notably in terms of accuracy, IRT score, and response time (Table 1). GPT-o1 reached the highest mean accuracy (88.7 ± 0.0%), while LLaMA 3 70B had the lowest (60.5 ± 3.4%). Regarding efficiency, Deepseek-R1 had the slowest response (60.0 ± 29.3 s), whereas LLaMA 3.2 450B had the fastest (2.1 ± 0.5 s).

Figure 2 provides a side-by-side comparison of the model accuracies and Rasch-based abilities, reinforcing the observed performance disparities.

3.3. Effect of Item Difficulty and Discrimination on Model Performance

The impact of item difficulty (β) and discrimination index (DI) on model performance was analyzed using logistic regression, correlation analysis, non-parametric comparisons, and categorical testing.

Logistic regression (Supplementary Table S2) revealed that item difficulty was a consistent and significant negative predictor of the model accuracy. For instance, GPT-4o at T = 1 had an odds ratio of 0.07 (95% CI: [0.01, 0.44]; p = 0.005), indicating a marked decline in correct responses for harder items. In contrast, DI was not a reliable predictor; although some models, like Deepseek-R1, showed high odds ratios (e.g., OR = 40.5), these lacked statistical significance.

Pearson’s correlation analysis (Supplementary Tables S3 and S4) further confirmed these findings. Models such as LLaMA 3.2 450B, Anthropic Haiku, Qwen 2.5, and Gemini 1.5 demonstrated strong correlations between item difficulty and accuracy (r = 0.48–0.60; p < 0.001), while GPT-o1 and GPT-4o showed moderate but significant correlations (r ≈ 0.33–0.40). Other models, including Deepseek-R1, LLaMA 3 70B, and Anthropic Sonnet, showed negligible associations, indicating a lower sensitivity to difficulty.

These results were mirrored by the Mann−Whitney U tests, which identified significant differences in item difficulty between correct and incorrect responses for models with high correlations (p < 0.01; Cohen’s d > 1.0). Models with weak correlations, such as GPT-4o (T = 1) and Deepseek-R1, showed no significant differences and small effect sizes, confirming the variability in difficulty sensitivity across models.

Chi-square tests (Supplementary Table S5) using categorical groupings of item difficulty (easy, moderate, hard) showed significant associations with accuracy in most models (e.g., GPT-4o: χ² = 16.97, p < 0.001; Cramer’s V = 0.52), indicating that harder items were more likely to be answered incorrectly. In contrast, the discrimination category had no significant effect on model performance (all p > 0.5), with consistently low Cramer’s V values (<0.25), suggesting that LLMs are generally insensitive to item discrimination.

3.4. Influence of Temperature Settings

To assess the role of randomness, the model performance was compared across two temperature settings (T = 0 vs. T = 1) using Mann-Whitney U tests. No significant differences were found in the accuracy of any model (p range: 0.609–1.000), suggesting that the model outputs were robust to variations in temperature (Table 2).

3.5. Comparison of LLM and Human Performance by Topic

A comparison at the domain level between the LLMs and human examinees revealed significant performance differences (see Supplementary Table S6). The LLMs exhibited better results in areas such as Volume Replacement and Transfusion Therapy (100% vs. 45.5%), Cardiopulmonary Resuscitation (92.6% vs. 69.2%), and Obstetric Anesthesia (88.9% vs. 69.2%). Conversely, domains like Complications of General and Regional Anesthesia showed nearly identical performances (68.5% vs. 68.9%).

Pearson’s correlation analysis between human and LLM accuracy across the 15 anesthesiology domains revealed a moderate positive correlation (r = 0.567, p = 0.0346, 95% CI: 0.052–0.844). The most significant performance differences favoring LLMs were noted in Volume Replacement and Transfusion Therapy (+54.5 percentage points), HTM (+38.4 percentage points), and Monitoring (+24.8 percentage points). The smallest differences were observed in Complications of General and Regional Anesthesia (‒0.4 percentage points), Pediatric Anesthesia (+8.2 percentage points), and Airway (+8.3 percentage points).

Figure 3 illustrates these domain-specific comparisons, ranking content areas by LLM performance and highlighting the performance differences between the two groups across all anesthesiology knowledge domains.

3.6. Cognitive and Error-Type Analysis

To deepen our understanding of the model performance beyond surface-level accuracy, we conducted a detailed analysis of the errors made by the two highest-performing language models (GPT-o1 and Deepseek-R1). This analysis considered both the cognitive demands of the exam items, classified using Bloom’s taxonomy, and the nature of the errors, based on qualitative evaluation of the models’ answer justifications.

Among the 298 total items administered across the nine models, the distribution across Bloom’s cognitive levels was as follows: Apply (125 items, 41.9%), Understand (91 items, 30.5%), Analyze (42 items, 14.1%), and Recall (40 items, 13.4%). This distribution indicates a deliberate emphasis on higher-order cognitive processes, particularly the application and comprehension of clinical knowledge, which aligns with the CONACEM exam’s focus on critical thinking and reasoning in anesthesiology. The associated error rates for these categories are shown in Figure 4.

To assess the qualitative nature of the model errors, we conducted a focused review of 36 incorrect responses generated by GPT-o1 and Deepseek-R1. Each response was manually classified into one of three dominant error categories: (1) Reasoning-Based Errors, which reflected flawed clinical logic or unjustified conclusions; (2) Knowledge-Based Errors, stemming from factual inaccuracies or misapplications of medical knowledge; and (3) Reading Comprehension Errors, which arose from misinterpretation or omission of key information in the question prompt.

The distribution of these error types across cognitive domains revealed several important trends. The Apply category contained the largest number of reasoning (n = 5) and knowledge (n = 10) errors, underscoring the challenge faced by these models in integrating knowledge into clinically relevant scenarios. The Understand and Recall categories exhibited a mixture of all three error types, highlighting limitations not only in inferential reasoning but also in basic information retrieval and comprehension. In contrast, Analyze tasks were exclusively associated with knowledge-based errors, likely reflecting the small sample size and inherent complexity of such high-level tasks. The detailed breakdown is summarized in Table 3.

4. Discussion

The results of our study provide a comprehensive evaluation of LLMs in the Chilean National Anesthesiology Certification Exam, offering critical insights into their strengths and limitations and future applications in medical education. Our findings align with prior research that underscores the potential of LLMs in medical reasoning while also revealing key gaps in their performance, particularly in specialized non-English contexts [17,29].

4.1. Evaluating Model Strengths and Weaknesses

We hypothesized that closed-source models would outperform open-source models due to their access to extensive proprietary datasets and optimized architectures. This was largely confirmed, with models like GPT-o1 and GPT-4o achieving the highest accuracy rates (88% and 83%, respectively), while open-source models such as LLaMA 3 70B struggled with lower accuracy (60.5%). However, Deepseek-R1 proved to be highly competitive, achieving an accuracy of 86% and narrowing the performance gap between the open- and closed-source models.

These results suggest that performance is not solely dependent on access to proprietary data but also on architectural optimization and alignment strategies. The high Rasch-based ability estimates for GPT-o1 and Deepseek-R1 reinforce that well-optimized models, even when open-sourced, can compete with their closed-source counterparts under certain conditions.

4.2. Insights from Comparative Studies

Our findings echo those of previous studies on the performance of LLMs in high-stakes medical assessments. For instance, studies evaluating GPT-4 on the Polish Medical Final Examination (MFE) showed that while the model achieved passing scores, it still lagged behind human performance in complex diagnostic reasoning [6]. Similarly, GPT-4 achieved top scores on the USMLE but demonstrated difficulties in handling nuanced reasoning tasks compared with clinical experts [30]. Research in radiology and ultrasound examinations has further shown that model performance drops when dealing with image-based or interpretive questions, with hallucination and overconfidence being noted issues [31,32].

Moreover, our work aligns with benchmarking efforts in anesthesiology, such as the Chinese Anesthesiology Benchmark (CAB), which concluded that while LLMs handled factual knowledge adequately, their clinical decision-making capacities remained suboptimal [5]. These findings reflect a consistent challenge across specialties and languages: while LLMs excel in recall-based tasks, they struggle with synthesis, contextual nuance, and high discrimination tasks.

4.3. Study Limitations and Future Improvements

Despite its valuable insights, this study has several limitations. First, we focused solely on text-based multiple-choice questions, excluding multimodal elements such as patient imaging and waveform analysis. This choice was necessary to ensure a fair evaluation across all models, including those without image capabilities. However, multimodal LLMs, like GPT-4V and Gemini, are gaining traction in medical AI and have shown promise in visual diagnostic tasks [31,33]. Future research should include multimodal benchmarks to better simulate real-world clinical conditions.

Second, we evaluated only nine LLMs, and emerging open-source models (e.g., those fine-tuned on biomedical corpora) were not included. All models were tested in zero-shot mode without prompt engineering or domain-specific fine-tuning. While this ensures a standardized baseline comparison, it may underestimate the full potential of LLMs when they are adapted to specific tasks or domains [23].

4.4. Implications for Medical Training and Certification

The results have important implications for medical education and certification processes. The ability of LLMs to achieve high accuracy on low-difficulty, recall-driven items supports their potential as educational aids for medical education. This aligns with the findings of US-based studies in which LLMs were used to generate clinical vignettes, assist in self-testing, and explain complex topics [30].

However, the observed drop in performance on higher-difficulty and high-discrimination items underscores the limitations of LLMs as autonomous reasoning agents. In high-stakes clinical environments, especially in specialties like anesthesiology, where decisions are time-sensitive and safety-critical, AI should augment—not replace—human judgment [34].

4.5. Domain-Specific Performance Patterns: Cognitive Architecture Differences Between Human and Artificial Intelligence

The moderate positive correlation (r = 0.567, p = 0.0346) between human and LLM accuracy across the 14 anesthesiology knowledge domains reveals fundamental differences in cognitive architecture while suggesting partial alignment in difficulty patterns. Domains in which humans struggle moderately (HTM: 17.2%, Monitoring: 40.5%) also challenge LLMs, but to a lesser extent (HTM: 55.6%, Monitoring: 65.3%). This indicates that LLMs maintain baseline competency through systematic knowledge access, even in complex areas that require hemodynamic interpretation and physiological monitoring integration. Conversely, the dramatic LLM advantage in Volume Replacement and Transfusion Therapy (100% vs. 45.5%) reflects LLMs’ superior ability to access and apply algorithmic, protocol-driven knowledge compared with humans’ more variable recall of specific transfusion guidelines and fluid management protocols.

The differential performance profile indicates systematic differences in knowledge processing: LLMs excel in domains that require factual recall and standardized protocol application (Cardiopulmonary Resuscitation, Volume Replacement) but show more modest advantages in areas that demand contextual clinical judgment and experiential reasoning (Complications of Anesthesia, Airway Management). This cognitive complementarity supports the development of hybrid decision-support systems that utilize systematic AI knowledge retrieval for protocol-driven domains while preserving human oversight in complex diagnostic reasoning and complication management. Recent experimental evidence from endoscopic decision-making has shown that effective human-AI collaboration can lead to superior outcomes through the weighted integration of complementary expertise, where clinicians appropriately follow AI advice when it is correct while maintaining independent judgment when AI recommendations are flawed [35].

4.6. Cognitive Complexity and Error Taxonomy: A Diagnostic Lens into LLM Failures

The stratified analysis of model failures using Bloom’s taxonomy and the error classification schema proposed by Roy et al. offers critical insights into the qualitative nature of LLM shortcomings in clinical assessments. This approach moves beyond simple accuracy metrics to uncover how and why models fail, which is vital for their safe deployment in high-stakes environments.

Among the 298 total question simulations, the cognitive distribution leaned heavily toward higher-order reasoning: Apply (41.9%) and Understand (30.5%) together comprised nearly three-quarters of the total item pool. These domains require not only factual recall but also the capacity to integrate knowledge into dynamic, often ambiguous, clinical contexts. In this setting, the two best-performing models—GPT-o1 and Deepseek-R1—still exhibited systematic vulnerabilities.

A detailed review of 36 incorrect answers showed that knowledge-based errors (n = 23) and reasoning-based errors (n = 9) predominated. Most of these errors occurred in the Apply and Understand level items, highlighting deficiencies in the models’ ability to synthesize and manipulate clinical knowledge, even when they had adequate factual grounding. For instance, models frequently misapply pharmacologic principles or fail to account for contraindications and context-specific factors—mistakes that are critical in anesthesiology, where patient safety depends on nuanced clinical decision-making.

Interestingly, reading comprehension errors (n = 6), though less frequent, were disproportionately present in Recall and Understand tasks, indicating occasional failures to parse essential details or to frame questions. This aligns with the findings of Roy et al., who showed that GPT-4 hallucinated or misattributed clinical features when reasoning paths were not well defined [9]. Similarly, Herrmann-Werner et al. [36] found that GPT-4 frequently erred at lower cognitive levels in medical multiple-choice examinations, particularly in tasks requiring basic understanding and recall. These failures suggest that limitations are not confined to higher-order reasoning but may also reflect foundational issues in text comprehension and concept recognition.

Moreover, Roy et al. noted that a significant portion of GPT-4’s errors were judged as “reasonable” by medical annotators, which poses challenges for its use in autonomous settings where plausibility may mask critical inaccuracies [9]. This reinforces our finding that high performance on knowledge-recall questions may conceal latent weaknesses in inferential reasoning and contextual awareness.

These observations have direct implications for the role of LLMs in formative and summative assessment. While LLMs can serve as effective teaching aids for reinforcing basic knowledge, their deployment in evaluative or autonomous decision-making roles should be approached with caution, particularly in domains that require synthesis, judgment, and patient-specific adaptation.

By explicitly linking cognitive demands to error typologies, this study demonstrates a replicable framework for auditing AI behavior in clinical examinations. Such diagnostic insights are critical for informing future fine-tuning strategies, AI safety protocols, and hybrid human-AI collaboration models in medical education and practice.

4.7. Future Directions for AI in Medical Assessments

To enhance the applicability of LLMs in anesthesiology and other medical domains, several key directions must be pursued.

Fine-Tuning on Medical-Specific Data: Tailoring LLMs using anesthesiology-specific datasets (e.g., case studies, guidelines, perioperative protocols) could boost reasoning capabilities in complex clinical scenarios.

Integration of Multimodal Capabilities: As shown in imaging-based evaluations [31], incorporating visual interpretation can enhance AI tools for real-world diagnostics.

Cross-Language Optimization: Given the performance variability across languages, particularly Spanish, multilingual alignment must be prioritized to ensure equitable AI in global health systems.

Human-AI Collaboration Models: Rather than aiming for full autonomy, LLMs should be integrated into collaborative workflows in which humans retain clinical oversight, mitigating the risk of error propagation [32,34].

Our study highlights both the promise and current limitations of LLMs in complex, non-English clinical certification contexts. While closed-source models outperformed open-source models on average, the strong performance of Deepseek-R1 suggests that optimization, rather than the exclusivity of data, may be the key differentiator. These results support the growing utility of AI in medical education while reinforcing the importance of domain-specific training, multimodal integration, and robust evaluation standards. Moving forward, collaborative efforts among developers, educators, and regulators will be critical to safely and effectively harness the capabilities of LLMs in healthcare.

5. Conclusions

This study provides the first comprehensive evaluation of large language models on a Spanish-language, high-stakes medical certification exam, revealing both the promise and limitations of current AI systems in specialized clinical contexts. Our findings demonstrate that LLMs, particularly closed-source models like GPT-o1 (88.7% accuracy), can achieve impressive performance in factual recall and protocol-driven domains, often surpassing human examinees in areas that require systematic knowledge application, such as volume replacement therapy and cardiopulmonary resuscitation. However, the moderate correlation (r = 0.567) between human and LLM performance patterns across knowledge domains reveals fundamental differences in cognitive architecture, with LLMs maintaining a consistent baseline competency through systematic knowledge access, while humans show greater variability that reflects experiential learning. The predominance of knowledge- and reasoning-based errors in higher-order cognitive tasks (Apply and Understand), combined with LLMs’ sensitivity to item difficulty but not discrimination, underscores persistent limitations in complex clinical reasoning and contextual adaptation. These results support the cautious integration of LLMs as educational aids and decision-support tools in anesthesiology, emphasizing the need for hybrid human-AI collaboration models that leverage AI’s systematic knowledge retrieval capabilities while preserving human oversight for complex diagnostic reasoning and patient-specific clinical judgment.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15116245/s1, Figure S1: Item Characteristic Curve (ICC) Plot—Probability of correct response by ability (θ) for each of the 64 calibrated items, demonstrating item difficulty and discrimination under the Rasch model. Figure S2: Item Fit Map—Visualization of infit statistics for the Rasch model across all items, confirming acceptable item fit. Figure S3: Person–Item Map—Joint distribution of item difficulties and respondent abilities along the Rasch latent trait continuum. Table S1: Wald Test for Rasch Model Fit—Z statistics and p-values for item-level fit evaluation. Table S2: Logistic Regression of Item Difficulty and Discrimination— regression outputs evaluating the predictive value of psychometric parameters on model accuracy across temperature settings. Table S3: Pearson Correlation with Item Difficulty—Correlation and effect size analysis between item difficulty and model performance. Table S4: Pearson Correlation with Discrimination Index—Analysis of model accuracy sensitivity to item discrimination. Table S5: Chi-square Analysis of Difficulty and Discrimination Categories—Group-wise statistical comparison of accuracy by categorical item features. Table S6: Domain-level Accuracy Comparison—Average performance of LLMs and human examinees across anesthesiology subfields. Supplementary Materials—Zero-shot Prompt Configuration: Standardized Spanish-language prompting template used to elicit responses from all LLMs without fine-tuning or in-context examples.

Author Contributions

Conceptualization, F.R.A., A.N. and N.I.S.; methodology, F.R.A., A.N., N.I.S., I.V., H.J.L. and M.M.; software, N.I.S.; validation, F.R.A., A.N. and M.M.; formal analysis, N.I.S.; investigation, F.R.A. and N.I.S.; resources, F.R.A.; data curation, N.I.S.; writing—original draft preparation, F.R.A. and N.I.S.; writing—review and editing, F.R.A., A.N., I.V. and M.M.; visualization, N.I.S.; supervision, F.R.A.; project administration, F.R.A.; funding acquisition, F.R.A., A.N. and N.I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research and Development Agency (ANID), FONDEF IDeA [grant number ID23I10319]. The contributions of A.N., N.I.S., and M.M. were supported in part by the National Center for Artificial Intelligence [grant number FB210017], and Basal ANID. The APC was funded by the Pontificia Universidad Católica de Chile.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and/or analyzed during this study are partially available. The raw performance data (e.g., accuracy scores and execution times) are available upon reasonable request. Access to these data requires a formal request outlining the intended use and adherence to the terms of data usage.

Acknowledgments

The authors thank CONACEM for granting access to anonymized historical examination data. The authors also acknowledge the technical support provided by CENIA. During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-4, March 2024 version) for the purposes of language editing and clarity enhancement. The authors have reviewed and edited the output and take full responsibility for the content of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CONACEM	Consejo Nacional de Certificación en Anestesiología
LLM	Large Language Model
IRT	Item Response Theory
CTT	Classical Test Theory
ICC	Item Characteristic Curve
DIF	Differential Item Functioning
DI	Discrimination Index
AI	Artificial Intelligence
API	Application Programming Interface
MCQ	Multiple-Choice Question
ANID	Agencia Nacional de Investigación y Desarrollo
CENIA	Centro Nacional de Inteligencia Artificial
FONDEF	Fondo de Fomento al Desarrollo Científico y Tecnológico

References

Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial Intelligence in Healthcare: Past, Present and Future. Stroke Vasc. Neurol. 2017, 2, e000101. [Google Scholar] [CrossRef]
Wang, F.; Casalino, L.P.; Khullar, D. Deep Learning in Medicine—Promise, Progress, and Challenges. JAMA Intern. Med. 2019, 179, 293–294. [Google Scholar] [CrossRef] [PubMed]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Zhan, Y.; Wang, Z.; Li, Y.; Zhang, C.; Yu, B.; Ding, L.; Jin, H.; Liu, W.; Wang, X.; et al. Benchmarking Medical LLMs on Anesthesiology: A Comprehensive Dataset in Chinese. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 1–15. [Google Scholar] [CrossRef]
Rosoł, M.; Gąsior, J.S.; Łaba, J.; Korzeniewski, K.; Młyńczak, M. Evaluation of the Performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci. Rep. 2023, 13, 20512. [Google Scholar] [CrossRef]
Forehand, M. Bloom’s Taxonomy. Emerg. Perspect. Learn. Teach. Technol. 2010, 41, 47–56. [Google Scholar]
Conklin, J. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives Complete Edition; Phi Delta Kappa International: Arlington, TX, USA, 2005. [Google Scholar]
Roy, S.; Khatua, A.; Ghoochani, F.; Hadler, U.; Nejdl, W.; Ganguly, N. Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 1073–1082. [Google Scholar]
Schneider, L.; Strobl, C.; Zeileis, A.; Debelak, R. An R Toolbox for Score-Based Measurement Invariance Tests in IRT Models. Behav. Res. Methods 2022, 54, 2101–2113. [Google Scholar] [CrossRef]
Lee, W.-C.; Lee, G. IRT Linking and Equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2018; pp. 639–673. [Google Scholar]
OpenAI GPT-4o 2024. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 20 May 2025).
OpenAI GPT-O1 2024. 2024. Available online: https://openai.com/o1/ (accessed on 20 May 2025).
Anthropic, A. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card 2024, 1, 1. [Google Scholar]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-R1: Incentivizing Reasoning Capability in Llms via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2. 5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models Are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Debelak, R.; Strobl, C.; Zeigenfuse, M.D. An Introduction to the Rasch Model with Examples in R; Chapman and Hall/CRC: Boca Raton, FL, USA, 2022. [Google Scholar]
Revelle, W.; French, J. The ‘New Psychometrics’—Item Response Theory. In An Introduction to Psychometric Theory with Applications in R; Northwesten University: Evanston, IL, USA, 2010; pp. 241–269. [Google Scholar]
Zhang, Y.; Zhao, B.; Jian, M.; Wu, X. Cognitive Diagnostic Analysis of Mathematics Key Competencies Based on PISA Data. PLoS ONE 2025, 20, e0315539. [Google Scholar] [CrossRef]
Andersen, E.B. A Goodness of Fit Test for the Rasch Model. Psychometrika 1973, 38, 123–140. [Google Scholar] [CrossRef]
Glas, C.; Verhelst, N. Testing the Rasch Model. In Rasch Models. Foundatiol Recent Developments, and Applications; Fischer, G.H., Molenaar, I.W., Eds.; Springer: Berlin/Heidelberg, Germany, 1995; pp. 37–51. [Google Scholar]
Gustafsson, J.-E. Testing and Obtaining Fit of Data to the Rasch Model. Br. J. Math. Stat. Psychol. 1980, 33, 205–233. [Google Scholar] [CrossRef]
Lord, F.M. Applications of Item Response Theory to Practical Testing Problems; Routledge: London, UK, 2012; ISBN 0-203-05661-2. [Google Scholar]
Lord, F.; Portinga, Y. Basic Problems in Cross-Cultural Psychology; Swets and Zeitlinger: Lisse, The Netherlands, 1977. [Google Scholar]
Morizot, J.; Ainsworth, A.T.; Reise, S.P. Toward Modern Psychometrics. In Handbook of Research Methods in Personality Psychology; The Guilford Press: New York, NY, USA, 2009; p. 407. [Google Scholar]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
Bicknell, B.T.; Butler, D.; Whalen, S.; Ricks, J.; Dixon, C.J.; Clark, A.B.; Spaedy, O.; Skelton, A.; Edupuganti, N.; Dzubinski, L. Chatgpt-4 Omni Performance in Usmle Disciplines and Clinical Skills: Comparative Analysis. JMIR Med. Educ. 2024, 10, e63430. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, X.; Luo, Y.; Zhu, Y.; Ling, W. Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis. JMIR Med. Inform. 2025, 13, e63924. [Google Scholar] [CrossRef] [PubMed]
Tariq, R.; Malik, S.; Khanna, S. Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy. Gastroenterology 2024, 166, 220–221. [Google Scholar] [CrossRef] [PubMed]
Pal, A.; Sankarasubbu, M. Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs). medRxiv 2024, 4, 24305869. [Google Scholar] [CrossRef]
Reverberi, C.; Rigon, T.; Solari, A.; Hassan, C.; Cherubini, P.; Cherubini, A. Experimental evidence of effective human—AI collaboration in medical decision-making. Sci. Rep. 2022, 12, 14952. [Google Scholar] [CrossRef]
Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, F.; Herschbach, L.; Griewatz, J.; Masters, K.; Zipfel, S.; Mahling, M. Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J. Med. Internet Res. 2024, 26, e52113. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Distribution of total raw scores between human examinees and large language Models.

Figure 2. Comparison of model accuracies and Rasch-based abilities, reinforcing the observed performance disparities.

Figure 3. Comparison of average accuracy between human examinees and large language models (LLMs) across 15 anesthesiology knowledge domains. Areas are sorted by LLM performance from highest to lowest accuracy. Red bars represent human performance, and blue bars represent LLM performance.

Figure 4. Error rates of large language model responses categorized by Bloom’s cognitive levels (Recall, Understand, Apply, and Analyze). The figure highlights higher error frequencies in application and comprehension tasks, underscoring the limitations of LLMs in complex reasoning domains.

Table 1. Performance metrics for nine large language models on the Chilean National Anesthesiology Certification Exam.

Model	Mean Accuracy (%)	Mean Time (s)	Mean IRT Score
Gemini 1.5 Pro	75.00 (1.01)	28.81 (20.82)	1.32 (0.07)
GPT-4o	83.90 (0.00)	3.70 (7.67)	1.96 (0.00)
GPT-o1	88.70 (0.00)	13.62 (8.48)	2.42 (0.00)
Qwen 2.5	71.00 (0.00)	3.39 (2.15)	1.08 (0.00)
Deepseek-R1	86.30 (1.10)	60.04 (29.32)	2.17 (0.11)
Llama 3.2 450B	74.20 (2.30)	2.11 (0.53)	1.27 (0.14)
Llama 3 70B	60.5 (3.03)	5.15 (2.81)	0.52 (0.17)
Anthropic Haiku	73.4 (1.01)	3.15 (2.23)	1.22 (0.07)
Anthropic Sonnet	81.5 (1.01)	4.01 (6.28)	1.76 (0.09)

Table 2. Mann–Whitney U test comparing T = 0 vs. T = 1 performance across large language models (LLMs).

Model	MW Statistic	p-Value
Gemini 1.5 Pro	2415	0.848
GPT-4o	2450	1
GPT-o1	2485	0.799
Qwen 2.5	2485	0.855
Deepseek-R1	2520	0.633
LLaMA 3.2 450B	2485	0.851
LLaMA 3 70B	2555	0.609
Anthropic Haiku	2485	0.853
Anthropic Sonnet	2380	0.676

Table 3. Distribution of Error Types by Cognitive Domain.

Taxonomy	Error 1 (Reasoning)	Error 2 (Knowledge)	Error 3 (Reading Comprehension)
Apply	5	10	0
Recall	1	3	2
Understand	3	8	2
Analyze	0	2	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altermatt, F.R.; Neyem, A.; Sumonte, N.I.; Villagrán, I.; Mendoza, M.; Lacassie, H.J. Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants. Appl. Sci. 2025, 15, 6245. https://doi.org/10.3390/app15116245

AMA Style

Altermatt FR, Neyem A, Sumonte NI, Villagrán I, Mendoza M, Lacassie HJ. Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants. Applied Sciences. 2025; 15(11):6245. https://doi.org/10.3390/app15116245

Chicago/Turabian Style

Altermatt, Fernando R., Andres Neyem, Nicolás I. Sumonte, Ignacio Villagrán, Marcelo Mendoza, and Hector J. Lacassie. 2025. "Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants" Applied Sciences 15, no. 11: 6245. https://doi.org/10.3390/app15116245

APA Style

Altermatt, F. R., Neyem, A., Sumonte, N. I., Villagrán, I., Mendoza, M., & Lacassie, H. J. (2025). Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants. Applied Sciences, 15(11), 6245. https://doi.org/10.3390/app15116245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants

Abstract

1. Introduction

2. Methods

2.1. Study Design and Ethical Approval

2.2. Exam Dataset Preparation

Exclusion Criteria and Item Refinement

2.3. Response Generation and Model Evaluation

2.4. Rasch Modeling and Item Characterization

2.5. Statistical Analysis

2.6. Cognitive Complexity Categorization and Error Taxonomy

3. Results

3.1. Psychometric Evaluation of the Instrument

3.2. Overall Performance Metrics

3.3. Effect of Item Difficulty and Discrimination on Model Performance

3.4. Influence of Temperature Settings

3.5. Comparison of LLM and Human Performance by Topic

3.6. Cognitive and Error-Type Analysis

4. Discussion

4.1. Evaluating Model Strengths and Weaknesses

4.2. Insights from Comparative Studies

4.3. Study Limitations and Future Improvements

4.4. Implications for Medical Training and Certification

4.5. Domain-Specific Performance Patterns: Cognitive Architecture Differences Between Human and Artificial Intelligence

4.6. Cognitive Complexity and Error Taxonomy: A Diagnostic Lens into LLM Failures

4.7. Future Directions for AI in Medical Assessments

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI