1. Introduction
Large language models (LLMs) are artificial intelligence systems trained on vast text corpora to understand and generate human language through pattern recognition and statistical inference. These models, including OpenAI’s ChatGPT series, Google’s Gemini, and Meta’s Llama, have demonstrated increasing proficiency in medical knowledge tasks. Early studies demonstrated ChatGPT-4’s strong performance on general medical licensing exams such as the United States Medical Licensing Examination (USMLE) as well as the Japanese licensing examination, where it achieved a passing threshold across multiple steps [
1,
2,
3]. However, the performance of LLMs in specialist medical examinations, particularly in gastroenterology, has been mixed [
4].
Recent studies have examined LLM performance on gastroenterology-specific examinations with variable results. ChatGPT-4 failed to pass the American College of Gastroenterology (ACG) self-assessment test, scoring below the 70% threshold required for competency [
5]. However, Chat GPT-3.5 and AI Perplexity scored >80% in answering questions based on the Italian residents’ gastroenterology exam [
6]. These mixed results suggest that LLM performance may depend on examination format, regional guideline differences, and model architecture.
Recent innovations have produced two distinct categories of LLMs: reasoning models that employ explicit chain-of-thought processes to decompose complex problems into intermediate steps (e.g., DeepSeek-R1, ChatGPT-o1), and non-reasoning models that generate responses through direct inference without visible intermediate reasoning (e.g., ChatGPT-4o, Gemini-1.5-Pro, Llama-3.1-405B). While systematic reviews demonstrate progressive improvement in LLM performance on medical examinations [
2], comparative analyses of these architectural approaches remain limited, particularly in European medical contexts.
The European Specialty Examination in Gastroenterology and Hepatology (ESEGH) is a mandatory, high-quality, knowledge-based exam for board certification in the UK and Switzerland [
7]. The examination comprises 200 multiple-choice questions aligned with the European Blue Book Curriculum. This standardized format and broad content coverage make ESEGH-style questions an appropriate benchmark for evaluating LLM performance in European gastroenterology contexts.
This study addresses the gap in comparative architectural analysis by systematically benchmarking five current LLMs on questions designed to simulate ESEGH content and format. Our objectives are to: (1) compare performance between reasoning and non-reasoning model architectures, (2) identify domain-specific performance patterns across clinical areas, and (3) establish baseline performance metrics for future studies.
4. Discussion
This benchmarking study provides the first systematic comparison of reasoning versus non-reasoning LLM architectures on European gastroenterology board-style questions. The key finding—that reasoning models outperformed non-reasoning models by 5.7 percentage points overall, with advantages exceeding 20 percentage points in specific domains—suggests that explicit chain-of-thought processing enhances performance on complex medical questions. All evaluated models exceeded the ESEGH passing threshold, with four of five achieving accuracy levels that would place them in the top performance tier of human test-takers. However, substantial variation across clinical domains and question types reveals important limitations that must be understood before considering any clinical applications.
Our findings align with a rapidly growing body of research on the use of LLMs in medical education and specialty board exams. Many earlier investigations focused on the ability of models on diverse medical licensing examinations. Ali et al. [
8] reported that GPT-4 achieved scores comparable to human test-takers on the ACG self-assessment exams, scoring 76.3% on a text-based question set. Interestingly, the average human examinee scored 75.7%, suggesting near-equal performance between GPT-4 and board-eligible gastroenterologists. Safavi-Naini et al. (2024) compared GPT-4o and Claude-3.5-Sonnet with Llama and Mistral on gastroenterology exams [
9]. They found that GPT-4o and Claude-3.5-Sonnet achieved the highest accuracy (73.7–74.0%). Samaan et al. [
10] demonstrated that advanced prompt engineering strategies such as Retrieval-Augmented Generation (RAG) substantially improved GPT-4’s performance on specialty gastroenterology exams—from 60.3% to 80.7%.
The performance range observed in our study (68.5–84.0%) aligns with LLM evaluations across other medical specialties, where accuracies typically range from 60% to 89%. Gilson et al. [
11] evaluated ChatGPT on USMLE questions, revealing passing or near-passing performance (over 60%). Ali et al. [
12] and Chan et al. [
13] showed that GPT-4 surpassed human pass marks in neurosurgery and MRCS Part A exams, respectively. Angel et al. [
14] found that GPT-4 performed at 89% accuracy on the North American Veterinary Licensing Examination, exceeding GPT-3 and Bard. Other subfields beyond gastroenterology showed similar findings. For instance, Longwell et al. [
15] demonstrated 84.4–86.7% correctness on ASCO and ESMO oncology questions. Schubert et al. [
16] reported GPT-4 scoring 85.0% on neurology board-style examinations. Tarabanis et al. [
17] noted GPT-4’s 77.5–80.7% performance on internal medicine board-style questions, occasionally surpassing human respondents. Our results fit within this performance range. While vision–language models have been proposed to address image-based questions, Safavi-Naini et al. [
9] noted that LLMs often struggle with images unless a thorough human-crafted description is provided. We did not explore the image-interpretation, but existing evidence suggests that current multimodal approaches still lag behind text-based results.
An important consideration is the potential difference between the commercial preparation materials used in this study and the actual ESEGH examination questions. While the preparation banks we employed are designed to simulate the content, format, and difficulty of the official examination, several factors may limit direct generalizability. First, official ESEGH questions undergo rigorous psychometric validation, including item analysis and calibration against candidate performance data, which may result in more precisely calibrated difficulty levels and distractor effectiveness. Second, the official examination committee may employ specific question-writing conventions, clinical scenarios, or emphasis on emerging topics that are not fully captured in third-party materials. Third, the security of actual examination content means that preparation materials are necessarily approximations based on published curricula and candidate recall rather than direct replications. To establish definitive performance benchmarks, future research should pursue collaboration with the European Board of Gastroenterology and Hepatology to obtain access to validated examination items under appropriate confidentiality agreements.
A few studies highlight potential pitfalls. Koga et al. [
18] found inconsistency and inaccuracies in LLM answers to pathology questions. Kaiser et al. [
19] similarly reported incomplete or vague responses about colon cancer management in publicly available LLMs. Finally, Igarashi et al. [
20] evaluated ChatGPT on Japanese emergency medicine board certification exams, finding 62.3% accuracy. This discrepancy may be due to differences in exam style, language, or localized guidelines.
The superior performance of reasoning models likely stems from their ability to decompose complex questions into intermediate steps, particularly evident in questions requiring the filtering of irrelevant information. It is important to note that our reasoning versus non-reasoning classification reflects observable output behavior under standardized prompting rather than fundamental architectural differences—all LLMs engage in some form of internal computation that could be considered reasoning. The 11.3 percentage-point advantage on bait-and-switch questions demonstrates this capability. However, all models showed weaknesses in bariatric (61.6%) and pancreatic disorders (64.3%), suggesting training data limitations rather than architectural constraints. These subspecialty gaps highlight that even advanced architectures cannot compensate for insufficient domain representation in training corpora.
A limitation of our analysis is that we did not systematically categorize the types of errors made by each model. Incorrect answers may stem from distinct failure modes, including outdated or missing medical knowledge, guideline mismatches between European and North American recommendations, flawed reasoning chains, or misinterpretation of question distractors. Understanding these failure modes is essential for targeted model improvement. Future studies should incorporate structured error taxonomies to distinguish knowledge deficits from reasoning failures, which would provide actionable insights for both model developers and clinical end-users.
Importantly, this study was not designed as a human-versus-machine competition. The primary contribution lies in the systematic comparison of reasoning versus non-reasoning LLM architectures on European gastroenterology content. The human reference data contextualizes these findings but does not constitute a powered comparison. Our exploratory human comparison provides preliminary context suggesting that LLMs may match physician performance on standardized board-style questions. However, several important caveats warrant emphasis. First, our convenience sample of four physicians—two board-certified experts and two fellows—was designed to provide initial reference points rather than establish definitive human benchmarks. The observed low inter-physician agreement (43.3%) likely reflects both the inherent complexity of specialty-level medical knowledge and the heterogeneity of our small sample, which intentionally spanned different expertise levels. A larger, more homogeneous cohort of ESEGH-certified gastroenterologists would be needed to establish robust human performance baselines. Second, the expert-novice performance gap (62.8% vs. 38.9%) aligns with expected expertise gradients, suggesting our sample captured meaningful variation despite its size. Third, standardized examinations assess only a subset of clinical competence; they do not capture diagnostic reasoning at the bedside, procedural skills, patient communication, or the integration of contextual factors that define expert clinical practice.
The finding that all LLMs exceeded individual physician scores should therefore not be interpreted as evidence that these models can replace clinical judgment. Future studies should incorporate larger physician cohorts, ideally stratified by years of experience and recent ESEGH examination performance, to establish more reliable human benchmarks against which LLM capabilities can be meaningfully assessed.
Data contamination is another consideration. The commercial preparation materials used in this study may exist in some models’ training data, which could inflate performance through memorization rather than reasoning. Several observations partially address this concern. The substantial variation in performance across models suggests contamination did not affect all models equally. Additionally, the consistent difficulty of clinical examination questions and bariatric topics across all models suggests reasoning rather than pure recall—memorized content would likely show more uniform performance. However, we cannot definitively exclude contamination without access to training data documentation, which model developers do not publicly disclose. Future studies should consider using newly developed or embargoed questions to minimize this risk.
Our study has several limitations. The board exam is not directly correlated with clinical skills, but instead reflects a selected knowledge base; thus, our findings should be interpreted primarily in an educational rather than clinical context. The questions used in this analysis were sourced from specialized ESEGH preparation materials rather than from actual past exams. While these high-quality mock questions approximate the real exam structure, the absence of original ESEGH items may affect generalizability—particularly if official questions differ in nuance, ambiguity, or distractor design. Additionally, image-based items were excluded due to variable multimodal capabilities across models. These questions, which represent a notable subset of ESEGH content, may pose distinct challenges for current LLMs. The lack of direct benchmarking against human test-takers also prevents practical conclusions about clinical utility or deployment readiness. While the ESEGH multiple-choice format provides a useful framework for comparison, it remains a simplified abstraction of complex medical reasoning that may artificially inflate model performance. Nonetheless, the top-performing models in our study achieved scores exceeding the published pass rates for recent ESEGH sittings, suggesting potential utility as board-preparation tools. The human reference sample was small (n = 4) and heterogeneous by design, limiting the generalizability of human-LLM comparisons. This exploratory comparison should be replicated with larger, more homogeneous physician cohorts before drawing conclusions about relative human-AI performance. We did not perform a detailed error analysis to distinguish between different failure modes, such as knowledge gaps, reasoning errors, or guideline mismatches. Such an analysis would provide valuable insights into the specific weaknesses of each model and inform targeted improvements.
Our findings demonstrate that the top-performing LLMs achieved accuracy levels exceeding current ESEGH examination standards. With ChatGPT-o1 reaching 84.0% accuracy and the top four models all scoring above 77%, these results surpass both the historical pass threshold of 59% (2019) and the current equated pass mark of 61.5% (2022 onwards). The 432-point equated score corresponds to 61.5% accuracy, meaning that all evaluated LLMs would achieve passing scores. However, these findings should be interpreted within the context that our study used preparation materials rather than actual ESEGH questions, and real examination conditions may present additional challenges not captured in our assessment. The substantial margin by which LLMs exceeded the passing threshold suggests potential utility as study aids for candidates preparing for the ESEGH, though the clinical relevance of this performance advantage requires further investigation in real-world educational settings. Our sample size of 203 questions may underrepresent certain subspecialties and rare conditions. Focusing exclusively on European guidelines also limits generalizability to other healthcare systems and regional standards. We evaluated only a narrow slice of model functionality without assessing their ability to provide reasoning rationales. Our standardized prompting approach likewise did not explore the full impact of prompt engineering, which may significantly influence model performance. Finally, we did not analyze the underlying causes of incorrect answers—such as outdated knowledge, guideline mismatch, or distractor misinterpretation—each of which could inform future model refinement. Given these limitations, our findings should be interpreted cautiously as an initial exploration rather than evidence of clinical capability.
This benchmarking study establishes baseline performance metrics for current LLM architectures on European gastroenterology board-style questions. The consistent advantage of reasoning models suggests that future development should prioritize architectures capable of explicit problem decomposition. However, persistent weaknesses in clinical examination questions and subspecialized topics indicate fundamental limitations that architectural improvements alone cannot address. Future research should: (1) validate these findings using official examination materials through collaboration with the EBGH, (2) investigate whether reasoning advantages translate to other medical specialties, (3) develop multimodal capabilities for image-based questions, (4) perform detailed error analyses to distinguish knowledge deficits from reasoning failures and guideline mismatches, and (5) most importantly, assess whether high test performance correlates with any clinically meaningful outcomes. Until such evidence exists, LLMs should be viewed as emerging technologies requiring rigorous evaluation rather than ready tools for medical education or practice.