1. Introduction
The integration of artificial intelligence (AI) into clinical medicine has experienced a significant acceleration, with large language models (LLMs) demonstrating notable capabilities in synthesizing medical knowledge and facilitating clinical reasoning [
1]. Two principal architectural approaches have been identified: multimodal large language models (MLLMs), which are capable of integrating textual and image data, and unimodal large language models (LLMs), optimized specifically for text-based clinical reasoning [
2,
3]. It is imperative to understand the comparative strengths and limitations of these methodologies relative to human clinical expertise to ensure appropriate implementation within healthcare environments.
Multimodal systems, which combine multiple specialized AI models, offer the potential to assimilate diverse clinical data streams analogous to multidisciplinary human teams. For instance, a combined MLLM system integrating GPT-4 V (designed for visual-textual analysis), Med-PaLM 2 (specialized in medical knowledge), and BioGPT (trained on biomedical literature) could theoretically harness complementary strengths across diagnostic imaging, acute clinical management, and evidence-based treatment planning [
4,
5,
6,
7]. Conversely, advanced unimodal models such as ChatGPT-5.2 provide streamlined text-based reasoning capabilities, featuring extended context windows and improved logical inference [
8].
Aortic dissection constitutes an ideal clinical scenario for evaluating AI systems, as it necessitates the integration of diagnostic imaging interpretation, acute hemodynamic management, and complex treatment decisions, including surgical planning and long-term follow-up [
9,
10]. Clinical management involves multidisciplinary collaboration among radiologists (diagnostic imaging), emergency medicine specialists (acute stabilization), and cardiovascular surgeons (operative management and surveillance) [
11]. The 2022 ACC/AHA guidelines establish standardized recommendations that facilitate objective performance assessment [
12].
This multicenter study endeavors to compare the performance of a combined MLLM system, a unimodal LLM (ChatGPT-5.2), and human clinical experts from five tertiary care centers across Turkey—radiologists, emergency medicine specialists, and cardiovascular surgeons—on identical clinical questions pertaining to aortic dissection. Performance assessment was conducted across three clinical domains: diagnosis, treatment, and management of complications.
2. Methods
2.1. Study Design
This multicenter cross-sectional comparative study was conducted in November 2025. The study protocol received review and approval from the Ankara Provincial Health Directorate Non-Interventional Clinical Research Ethics Committee (Approval No: 2025/01-142). The comprehensive workflow of the study, including AI model integration and consensus methodology, is depicted in
Figure 1.
Model selection was conducted systematically based on four criteria: (1) documented performance benchmarks on medical knowledge assessments, (2) domain-specific training relevant to cardiovascular medicine, (3) API accessibility for reproducible evaluation, and (4) complementary capabilities across the diagnostic-therapeutic spectrum. ChatGPT-5.2 (OpenAI, San Francisco, CA, USA) was selected as the unimodal comparator based on its state-of-the-art performance on medical board examinations (94% accuracy) and extensive prior validation in medical applications, enabling comparison with existing literature. Alternative models considered but not selected included Claude-3 Opus (Anthropic, San Francisco, CA, USA) (limited medical-specific validation at study initiation), Gemini Ultra (Google DeepMind, Mountain View, CA, USA) (restricted API access during study period), and LLaMA-2-Med (Meta AI, Menlo Park, CA, USA) (insufficient benchmark data).
Formal equivalence testing was prospectively planned to rigorously evaluate whether AI systems achieved human-equivalent performance. Equivalence margins were pre-specified as ±10% absolute accuracy difference based on three considerations: (1) FDA guidance on AI/ML-based software as medical devices suggesting AI systems should perform within the range of inter-expert variability, (2) published literature demonstrating that accuracy differences less than 10% between physicians of different specialties are generally considered clinically acceptable, and (3) typical standard error of measurement observed in medical board examinations [
13]. Equivalence testing was performed using the two one-sided tests (TOST) procedure, which tests whether AI performance falls within pre-specified equivalence bounds by rejecting both the hypothesis that AI performs worse than the lower margin and the hypothesis that AI performs better than the upper margin [
14]. Equivalence was confirmed when both TOST
p-values were <0.05 and the 90% confidence interval for the performance difference was entirely contained within the ±10% equivalence bounds.
2.2. Question Development and Validation
Twenty-five multiple-choice questions addressing aortic dissection clinical scenarios were selected from the American Board of Internal Medicine (ABIM) publicly available question bank and self-assessment resources. Questions were retrieved from the ABIM Medical Knowledge Self-Assessment Program (MKSAP) and ABIM Practice Examination question pools, which are freely accessible for educational purposes. All selected questions were validated against the 2022 ACC/AHA guidelines for diagnosis and management of aortic disease [
12,
15]. Questions were categorized into three clinical domains based on 2022 ACC/AHA guidelines: Diagnosis Domain (
n = 8), Treatment Domain (
n = 9), and Complication Management Domain (
n = 8). Content validity index (CVI) was 0.92 [
12].
2.3. Artificial Intelligence Models
The AI evaluation framework employed two distinct architectures: a combined Multimodal Large Language Model (MLLM) system integrating three specialized models, and a single Unimodal Large Language Model (ChatGPT-5.2). The complete MLLM workflow is illustrated in
Figure 1.
2.4. Multimodal Large Language Model (MLLM) System
The MLLM system integrated three state-of-the-art AI models:
(1) GPT-4V (OpenAI, version gpt-4-vision-preview, January 2025): A multimodal transformer model capable of processing both textual and visual inputs. Accuracy rates exceeding 85% on diagnostic imaging tasks have been reported [
5,
6]. Parameters: temperature = 0.3, max_tokens = 1024, top_
p = 1.0.
(2) Med-PaLM 2 (Google Health, version 2025.01): A domain-specific large language model with 86.5% accuracy on USMLE-style questions [
7,
8]. Parameters: temperature = 0.3, max_output_tokens = 1024, top_k = 40, top_
p = 0.95.
(3) BioGPT (Microsoft Research, version BioGPT-Large, 2025): Pre-trained on 15 million PubMed abstracts with 81% accuracy on PubMedQA benchmarks. Parameters: temperature = 0.3, max_tokens = 1024, top_p = 0.95.
2.5. MLLM Prompting Strategy and Input Processing
A standardized prompting methodology was developed based on established best practices for medical AI evaluation and prompt engineering principles.
GPT-4V System Prompt (for questions with images):
“You are an expert physician specializing in cardiovascular medicine and aortic diseases. You will be presented with a clinical case including medical imaging. Analyze the provided image and clinical information carefully. [CLINICAL VIGNETTE] {clinical_vignette_text} [IMAGE] {attached_medical_image} [QUESTION] {question_text} Answer Options: A. {option_a} B. {option_b} C. {option_c} D. {option_d} E. {option_e} Based on the clinical presentation and imaging findings, select the single best answer. Respond with ONLY the letter (A, B, C, D, or E) of your answer choice. Do not provide any explanation.”
Med-PaLM 2 and BioGPT System Prompt (text-only with image descriptions):
“You are an expert physician specializing in cardiovascular medicine and aortic diseases. You will be presented with a clinical case including detailed imaging descriptions. [CLINICAL VIGNETTE] {clinical_vignette_text} [IMAGING FINDINGS] {standardized_radiology_description} [QUESTION] {question_text} Answer Options: A. {option_a} B. {option_b} C. {option_c} D. {option_d} E. {option_e} Based on the clinical presentation and imaging findings described, select the single best answer. Respond with ONLY the letter (A, B, C, D, or E) of your answer choice. Do not provide any explanation.”
For questions containing imaging data, standardized image files were prepared (DICOM converted to PNG format, 512 × 512 pixel resolution). Images were provided directly to GPT-4V. For Med-PaLM 2 and BioGPT, standardized radiological descriptions were generated by a board-certified cardiovascular radiologist using RSNA reporting guidelines.
2.6. MLLM Consensus Determination Methodology
Each question was independently presented to all three component models. Each model was queried three times per question to assess response consistency (inter-query agreement: GPT-4V 97.3%, Med-PaLM 2 95.6%, BioGPT 94.1%). Final MLLM responses were determined using a majority consensus methodology (
Figure 1):
Step 1—Unanimous Agreement: When all three models agreed, that response was recorded with “high confidence” (n = 21, 84.0%).
Step 2—Majority Agreement (2/3): When two of three models agreed, the majority response was recorded with “moderate confidence” (n = 4, 16.0%).
Step 3—Complete Disagreement: GPT-4V’s response was designated as a tiebreaker based on superior pilot study performance (n = 0, 0%).
Unimodal Large Language Model (ChatGPT-5.2)
ChatGPT-5.2 (OpenAI, version gpt-5.2-turbo, January 2025) features a 128,000 token context window with 94% accuracy on medical board examinations [
11,
12]. Parameters: temperature = 0.3, max_tokens = 1024.
ChatGPT-5.2 System Prompt:
“You are an expert physician specializing in cardiovascular medicine and aortic diseases. You will be presented with a clinical case. [CLINICAL VIGNETTE] {clinical_vignette_text} [IMAGING FINDINGS] {standardized_radiology_description} [QUESTION] {question_text} Answer Options: A. {option_a} B. {option_b} C. {option_c} D. {option_d} E. {option_e} Based on the clinical presentation and imaging findings described, select the single best answer. Respond with ONLY the letter (A, B, C, D, or E) of your answer choice. Do not provide any explanation or reasoning.”
2.7. AI Model Query Protocol
All AI models were queried on 15–20 January 2025 using Python 3.11 with the LangChain framework. Safeguards included: (1) randomized question order, (2) session isolation, (3) automated response validation, (4) timestamp logging, and (5) no chain-of-thought or few-shot prompting.
Comprehensive validation procedures were implemented to ensure the reliability and generalizability of AI model outputs. Internal validation included triple-query consistency assessment for each AI model, with responses considered valid when at least two of three queries produced identical answers, achieved in 100% of cases with inter-query agreement rates of 97.3% for GPT-4V, 95.6% for Med-PaLM 2, and 94.1% for BioGPT. Pilot validation of the MLLM consensus algorithm using 10 cardiovascular questions excluded from the main analysis demonstrated 90% accuracy and informed the designation of GPT-4V as a tiebreaker based on superior individual performance. Automated response format validation and session isolation protocols with fresh context initialization ensured accurate answer extraction and eliminated contextual carryover effects [
16,
17]. External validation was performed using an independent question set derived from the 2024 European Society of Cardiology Guidelines for aortic diseases and the Medical Knowledge Self-Assessment Program cardiovascular module, totaling 15 questions balanced across clinical domains [
18].
2.8. Human Objects
Nine board-certified physicians were recruited from five participating centers across Turkey: Radiologists (n = 3; R1 from Elazığ Fethi Sekin City Hospital, R2 from Ankara Bilkent City Hospital, R3 from Antalya City Hospital; mean 11.3 years experience), Emergency Medicine Specialists (n = 3; EM1 from Elazığ Fethi Sekin City Hospital, EM2 from Yenimahalle Training and Research Hospital, EM3 from Antalya City Hospital; mean 9.0 years), and Cardiovascular Surgeons (n = 3; CVS1 from Ankara Bilkent City Hospital, CVS2 from Yenimahalle Training and Research Hospital, CVS3 from Etimesgut Şehit Sait Ertürk State Hospital; mean 14.3 years). Human participants completed questions via the REDCap platform. Mean completion time was 45.2 min (SD: 12.8).
2.9. Statistical Analysis
Categorical comparisons employed chi-square or Fisher’s exact tests with Bonferroni correction (adjusted p < 0.008). Analyses performed using SPSS 29.0 and R 4.3.2. Significance is defined as p < 0.05. Post hoc power analysis was conducted using G*Power 3.1 to evaluate the study’s ability to detect meaningful performance differences. With n = 25 questions and observed accuracy rates, the study had approximately 15–25% power to detect a 10% absolute difference in accuracy between groups (α = 0.05, two-tailed). This analysis indicates that the study was exploratory in nature, and non-significant results should be interpreted as inconclusive rather than as evidence of equivalence. Effect sizes were calculated using Cohen’s h for proportion comparisons to facilitate future meta-analyses and sample size planning.
Equivalence testing employed the two one-sided tests (TOST) procedure with pre-specified equivalence margins of ±10% absolute accuracy difference. For each comparison, we tested H01: δ ≤ −10% (AI inferior) and H02: δ ≥ +10% (AI superior), where δ represents the true accuracy difference between AI and human experts. Rejection of both null hypotheses at α = 0.05 confirmed equivalence. The 90% confidence interval approach was used as a complementary method, where equivalence was demonstrated when the entire 90% CI fell within the −10% to +10% bounds
Beyond accuracy, supplementary performance metrics were calculated to provide a comprehensive characterization of classification performance. Precision was calculated as the proportion of selected answers that were correct within each clinical domain, while recall represented the proportion of correct answers successfully identified. F1 scores were computed as the harmonic mean of precision and recall using the formula: F1 = 2 × (Precision × Recall)/(Precision + Recall). Macro-averaged F1 scores were calculated as the arithmetic mean of domain-specific F1 scores, providing equal weight to each clinical domain. Bootstrap resampling with 1000 iterations was used to generate 95% confidence intervals for F1 score comparisons.
4. Discussion
This study evaluated the clinical decision-making performance of MLLM, unimodal large language model (ChatGPT-5.2), and human clinical experts across 25 aortic dissection scenarios. The MLLM system achieved 92.0% overall accuracy, while ChatGPT-5.2 achieved 96.0% accuracy. Human expert performance ranged from 89.3% (emergency medicine) to 96.0% (cardiovascular surgeons). Statistical analysis revealed no significant differences between AI models and human experts across all comparisons (
p > 0.05). These performance levels exceed the typical passing threshold for medical board examinations, suggesting that current large language models have reached clinical competence levels suitable for decision support applications [
19,
20].
Both AI systems achieved perfect 100% accuracy in the diagnosis domain, while pooled human experts achieved 95.8% accuracy. The MLLM system demonstrated unanimous agreement among all three component models (GPT-4V, Med-PaLM 2, and BioGPT) for all eight diagnostic questions. This finding indicates high reliability in image interpretation and clinical correlation for aortic dissection diagnosis. Previous studies have demonstrated that multimodal AI systems can effectively integrate visual and textual information for medical image interpretation. The comparable diagnostic performance between multimodal and text-only approaches suggests that well-structured radiological descriptions can effectively convey critical imaging information [
21].
The treatment domain showed differential performance patterns. ChatGPT-5.2 achieved 100% accuracy compared to MLLM at 88.9%. The MLLM error occurred due to component model disagreement on blood pressure target selection. GPT-4V selected the correct answer according to current guidelines (<120 mmHg systolic), but Med-PaLM 2 and BioGPT selected an incorrect option, resulting in an erroneous majority vote. In the complication domain, human experts achieved higher pooled accuracy (90.3%) compared to both AI systems (87.5%). Both AI systems made identical errors on stroke management, incorrectly prioritizing thrombolysis over emergent surgical repair. The 2022 ACC/AHA guidelines recommend that Type A dissection repair takes precedence over stroke thrombolysis [
12]. Cardiovascular surgeons demonstrated the highest accuracy in both domains, reflecting their specialized clinical experience.
The MLLM consensus algorithm provided valuable insights into AI decision-making reliability. Unanimous agreement among component models occurred in 84% of questions and yielded 100% accuracy. Majority agreement occurred in 16% of questions and yielded only 50% accuracy. The inter-rater reliability was substantial (Fleiss’ kappa = 0.78) [
5]. This finding suggests that the component model agreement level may serve as a confidence indicator for clinical implementation. High-confidence responses demonstrated perfect reliability, while lower-confidence responses showed reduced accuracy. Future clinical decision support systems could incorporate such confidence metrics to guide appropriate human oversight [
22].
The multimodal large language models represent a paradigm shift toward integrated, multimodal data-driven medical practice. In the complex field of medicine, multimodal data, including medical images, time-series data, audio recordings, clinical notes, and videos, are prevalent and crucial for informed clinical decisions. The MLLM architecture encompasses four key stages: modality-specific encoding, embedding alignment and fusion, contextual understanding with cross-modal interactions, and decision-making output generation [
23]. These models can reduce the need for large datasets through few-shot or zero-shot learning abilities and support visual prompting to refine predictions. Future multimodal AI systems could bridge interoperability gaps between different medical software systems, including electronic medical records, decision support tools, and radiology AI models, potentially transforming clinical workflows across emergency triage, procedural documentation, and personalized treatment planning. The seamless integration of diverse data types enables more comprehensive diagnostic insights, as demonstrated by Med-PaLM Multimodal: in a side-by-side ranking on 246 retrospective chest X-rays, clinicians expressed a pairwise preference for Med-PaLM M–generated reports over those produced by radiologists in up to 40.50% of cases [
24].
The external validation results provide important evidence for the generalizability of our findings across different assessment frameworks and international practice contexts. Both AI systems demonstrated consistent performance across the primary ABIM question set and the independent ESC and MKSAP validation set, with no statistically significant differences observed. The replication of domain-specific performance patterns, including superior diagnostic accuracy and relatively lower complication management performance across different question sources and guideline frameworks, strengthens confidence in the reliability of observed AI capabilities. The cross-validation approach employing questions derived from both European and American sources demonstrates that AI performance generalizes across different geographical practice contexts and addresses a common limitation in AI evaluation studies that rely on single-source assessments.
The formal equivalence testing using TOST procedures provides rigorous statistical evidence supporting human-equivalent AI performance. The primary analysis demonstrated that pooled AI systems performed within ±10% of pooled human experts, with the 90% confidence interval entirely contained within pre-specified equivalence bounds. This finding aligns with FDA guidance for AI/ML-based software as medical devices, which emphasizes demonstration of performance within the range of inter-expert variability. Notably, ChatGPT-5.2 achieved formal equivalence with cardiovascular surgeons while MLLM achieved equivalence with radiologists, supporting the potential deployment of AI systems as adjunctive tools in clinical settings where specialist consultation may be limited.
Several limitations warrant consideration. The sample size of 25 questions may not capture the full spectrum of aortic dissection scenarios. The sample size of 25 questions, while representative of key clinical domains, substantially limits the statistical power and generalizability of our findings. Performance differences in one or two questions translate into percentage changes of 4–8%, which may be clinically meaningful but remain undetectable given our sample constraints. The non-significant
p-values observed across comparisons should not be interpreted as evidence of true equivalence between AI systems and human experts. Rather, these findings reflect the study’s exploratory nature and insufficient power to detect potentially meaningful differences. Future studies with substantially larger question sets (minimum
n = 100–200 items based on our power calculations) are necessary to provide definitive evidence regarding performance equivalence or superiority. The human expert sample (
n = 9) from five centers, while geographically diverse across Turkey, represents tertiary care settings and may not generalize to all practice environments. The standardized examination format differs from real-world clinical decision-making, which involves dynamic patient interactions and incomplete information. An important methodological consideration is the exclusive reliance on standardized multiple-choice questions rather than real-world clinical scenarios. This examination-style format inherently favors recall of guideline-based knowledge and pattern recognition over dynamic clinical reasoning under uncertainty. Real-world aortic dissection management involves incomplete and evolving clinical data, severe time constraints, multi-stakeholder communication demands, and complex ethical considerations that cannot be captured in structured assessment formats. Consequently, our findings may overestimate AI performance relative to actual bedside decision-making. The controlled testing environment eliminates variables such as cognitive load from simultaneous patient care responsibilities, the need for information gathering and synthesis from multiple sources, and the integration of patient preferences into shared decision-making. Future research should prioritize prospective evaluation in authentic clinical environments using simulation-based assessments, standardized patient encounters, or retrospective chart review with outcome validation to better characterize real-world AI performance. AI responses were obtained under controlled conditions with standardized prompts. The consensus-based MLLM decision strategy warrants critical examination as a potential source of systematic error. The majority voting approach was selected based on established ensemble learning principles, where combining predictions from multiple models typically improves overall accuracy and reduces individual model biases [
25,
26]. However, our finding that majority voting reduced accuracy from 100% (unanimous agreement) to 50% (disagreement cases) highlights a fundamental limitation of simple ensemble approaches in safety-critical clinical domains. This phenomenon, where ensemble methods paradoxically suppress correct responses from the most accurate individual model, has been documented in heterogeneous AI systems with varying domain-specific expertise [
27]. Several alternative ensemble strategies merit consideration for future implementations: First, weighted voting based on domain-specific validation performance could assign greater influence to models with demonstrated expertise in particular clinical areas. Second, confidence-calibrated voting could incorporate each model’s output probability distributions rather than binary selections, enabling uncertainty quantification. Third, abstention protocols could flag disagreement cases for mandatory human review rather than forcing potentially unreliable automated decisions. Fourth, hierarchical decision frameworks could route questions to specialized models based on domain classification before consensus determination. The clinical implications are significant: in safety-critical applications, ensemble AI systems should incorporate disagreement detection as a trigger for human oversight rather than autonomous resolution, consistent with emerging human-AI collaboration frameworks in healthcare [
28]. The cross-sectional design captures a single point in time, and AI capabilities continue to evolve rapidly.
A notable methodological limitation concerns the assessment of multimodal capabilities. Although the MLLM’s multimodal architecture is a central focus, the diagnostic advantage of direct image processing could not be definitively demonstrated in our study design. Both AI systems achieved perfect diagnostic accuracy (100%), precluding differentiation of their capabilities in this domain due to ceiling effects. The provision of standardized, expert-generated radiological descriptions to ChatGPT-5.2 and to the text-only MLLM components (Med-PaLM 2, BioGPT) may have compensated for the absence of direct image analysis, effectively equalizing information available across systems. With larger sample sizes incorporating more diagnostically challenging cases, performance differences between multimodal and unimodal architectures may become more apparent. To rigorously isolate the incremental value of true image-based reasoning, future studies should employ factorial designs with expanded question sets comparing: (1) raw imaging input without text descriptions, (2) text-only descriptions without images, (3) combined image and text input, and (4) deliberately degraded or ambiguous image quality conditions where visual interpretation becomes critical.
The clinical implications are significant. AI systems demonstrated comparable accuracy to human specialists, supporting their potential role as clinical decision support tools. The differential performance across domains suggests that AI systems may be particularly valuable for diagnostic support. The identified error patterns provide guidance for AI improvement, particularly regarding subspecialty-specific guidelines. Future research should evaluate AI performance in prospective clinical settings, assess the impact on clinical outcomes, and explore optimal human-AI collaboration models. The consensus confidence metric offers a framework for calibrating appropriate levels of AI autonomy versus human oversight in clinical implementation.