Next Article in Journal
Large Language Models in Cardiovascular Prevention: A Narrative Review and Governance Framework
Previous Article in Journal
Sex-Related Differences in the Diagnosis and Evolution of Parietal Cell Antibody-Positive Autoimmune Gastritis: A Large Single-Center Retrospective Cohort Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context

1
Department of Clinical Pharmacy, College of Pharmacy, King Khalid University, Abha 62583, Saudi Arabia
2
Faculty of Pharmacy, Iqra University, Karachi 75850, Pakistan
3
Department of Pharmacology, Faculty of Pharmacy and Pharmaceutical Sciences, University of Karachi, Karachi 75270, Pakistan
*
Author to whom correspondence should be addressed.
Diagnostics 2026, 16(3), 388; https://doi.org/10.3390/diagnostics16030388
Submission received: 28 November 2025 / Revised: 14 January 2026 / Accepted: 17 January 2026 / Published: 26 January 2026
(This article belongs to the Special Issue Deep Learning in Medical Imaging: Challenges and Opportunities)

Abstract

Background and Aim: Large language models (LLMs) demonstrate significant potential in assisting with medical image interpretation. However, the diagnostic accuracy of general-purpose LLMs on image-based internal medicine cases and the added value of brief clinical history remain unclear. This study evaluated three general-purpose LLMs (ChatGPT, Gemini, and DeepSeek) on expert-curated cases to quantify diagnostic accuracy with image-only input versus image plus brief clinical context. Methods: We conducted a comparative evaluation using 138 expert-curated cases from Harrison’s Visual Case Challenge. Each case was presented to the models in two distinct phases: Phase 1 (image only) and Phase 2 (image plus a brief clinical history). The primary endpoint was top-1 diagnostic accuracy for the textbook diagnosis, comparing performance with versus without a brief clinical history. Secondary/Exploratory analyses compared models and assessed agreement between model-generated differential lists and the textbook differential. Statistical analysis included Wilson 95% confidence intervals, McNemar’s tests, Cochran’s Q with Benjamini–Hochberg correction, and Wilcoxon signed-rank tests. Results: The inclusion of clinical history substantially improved diagnostic accuracy for all models. ChatGPT’s accuracy increased from 50.7% in Phase 1 to 80.4% in Phase 2. Gemini’s accuracy improved from 39.9% to 72.5%, and DeepSeek’s accuracy rose from 30.4% to 75.4%. In Phase 2, diagnostic accuracy reached at least 65% across most disease nature and organ system categories. However, agreement with the reference differential diagnoses remained modest, with average overlap rates of 6.99% for ChatGPT, 36.39% for Gemini, and 32.74% for DeepSeek. Conclusions: The provision of brief clinical history significantly enhances the diagnostic accuracy of large language models on visual internal medicine cases. In this benchmark, performance differences between models were smaller in Phase 2 than in Phase 1. While diagnostic precision improves markedly, the models’ ability to generate comprehensive differential diagnoses that align with expert consensus is still limited. These findings underscore the utility of context-aware, multimodal LLMs for educational support and structured diagnostic practice in supervised settings while also highlighting the need for more sophisticated, semantics-sensitive benchmarks for evaluating diagnostic reasoning.

1. Introduction

Healthcare specialists rely mainly on visual inspection when diagnosing patients, predominantly in internal medicine contexts. These diagnostic results often originate from what doctors observe in various imaging techniques which may include skin presentations, X-rays, computed tomography scans, electrocardiogram readings, and microscopic tissue samples. Specialists highlight how lesion characteristics and their distribution guide dermatological diagnosis, while also stressing the need to create visual data across different medical specialties into rational diagnostic assessments [1,2]. Diagnosis is not only a visual exercise; research demonstrates that even brief patient histories can maximize the accuracy of image interpretation, revealing how visual evidence and narrative details work together in clinical decision making [3]. This reality pushes interest in large language models as tools for both education and clinical decision support. Initial research indicates these models can handle medical questions and standardized examinations, though their dependability appears linked to input characteristics [4].
Initial assessments put general-purpose large language models as capable players in text-based medical challenges. ChatGPT achieved passing or near-passing scores on United States Medical Licensing Examination (USMLE) assessments and generated clinically rigorous explanations, directing toward utility in educational settings [4]. Follow-up experimental work using standardized clinical scenarios demonstrated that these models can organize differential diagnoses and explain their reasoning, though both accuracy and explanation quality fluctuate based on case difficulty and how questions are framed [5]. Direct model comparisons using specialty examination databases showed mixed results, revealing how sensitive outcomes are to subject matter coverage and instruction phrasing [6]. Complementing these single-study findings, recent reviews synthesize a fast-moving literature and converge on a theme: LLMs exhibit promise for medical decision support and education, yet evidence remains uneven across tasks and modalities. Limitations noted in this body of work include sparse assessment of visual, image-first problems and inconsistent treatment of differential diagnosis scoring or the incremental effect of adding brief clinical history [7].
Recent work has begun to test multimodal large models on real medical images across specialties. In radiology, several groups evaluated GPT-4V or related vision-language models on chest radiographs and mixed-modality studies, showing that these systems can extract key findings and answer image-grounded questions, albeit with variable reliability across tasks and datasets [8,9]. Broader assessments using curated radiology case collections similarly reported that GPT-4V outperforms text-only LLM baselines when image input is available, emphasizing the potential value of visual context in diagnostic reasoning [10]. Outside radiology, dermatology studies indicate that multimodal models can separate benign from malignant lesions and generate clinically sensible descriptions, suggesting applicability to pattern-recognition problems at the bedside [11]. In pathology, emerging domain-specific artificial intelligence (AI) and early ChatGPT-based evaluations on microscopic images illustrate growing feasibility of image-aware reasoning in morphologic diagnosis, though methods and benchmarks remain heterogeneous [12,13]. Parallel systematic reviews examining medical artificial intelligence systems have captured these patterns, finding that vision-language models show potential for report creation and visual question answering tasks. However, these reviews also underscore the necessity for uniform evaluation methods across various clinical fields [14,15,16].
Despite artificial intelligence’s remarkable advances in processing diverse medical information, surprisingly few investigations have tested its capability on simple, single-image scenarios in general internal medicine. Such studies seldom employ tightly controlled experimental designs using just two conditions to definitively measure how much a brief patient narrative actually improves performance. Equally rare are investigations that assess how well artificial intelligence-generated differential diagnoses correspond to expert consensus while simultaneously measuring overall diagnostic precision. Recent systematic reviews have called for developing standardized assessment tools spanning multiple medical disciplines with explicit scoring frameworks. These benchmarks should extend beyond purely text-based case presentations and explicitly demonstrate how supplementary context modifies outcomes, building on emerging work on rigorous imaging-AI evaluation and attention-driven analysis [17,18]. This research gap stands out given long-established evidence that providing concise clinical histories can alter image interpretation and reshape the relevance of proposed differential diagnoses [19]. Harrison’s Visual Case Challenge offers expert-selected, image-centered cases complete with authoritative diagnoses and reference differential lists—materials that reflect how internists practice visual pattern recognition and how medical students are evaluated—thereby providing a standardized, reproducible benchmark for testing whether brief clinical context enhances artificial intelligence diagnostic performance [20]. This investigation sought to directly compare ChatGPT, Gemini, and DeepSeek on 138 Harrison’s cases. Our primary objective was to determine whether adding a brief clinical history improves top-1 diagnostic accuracy of general-purpose multimodal LLMs on visual internal medicine cases. Using this single endpoint, we compared three models (ChatGPT, Gemini, and DeepSeek). Secondary/Exploratory analyses assessed (i) the magnitude of the history-associated change in accuracy and (ii) agreement between model-generated differential lists and the textbook differential.

2. Method

2.1. Study Design

We conducted a comparative evaluation of three large language models—ChatGPT, Gemini, and DeepSeek—using image-based clinical cases from Harrison’s Visual Case Challenge under a standardized evaluation workflow [20]. Prompt wording, decoding parameters, and case order were held constant across models.

2.2. Phases and Prompting

In Phase 1 (image only), the case image—downloaded directly from Harrison’s Visual Case Challenge on AccessMedicine (McGraw Hill)—was uploaded to each model using its native image attachment interface, with no accompanying clinical text; the prompt asked for a single best diagnosis and instructed the model not to request additional information. Immediately after recording the Phase 1 output for the same case, we proceeded to Phase 2 (image + brief history). The identical image was presented again together with the concise patient history as written in the book, and the prompt requested a single best diagnosis plus a brief differential diagnosis list. The Phase 2 histories were used verbatim and were not edited or standardized; their length and included elements vary by case (e.g., demographics, symptom timing, and occasional laboratory clues), consistent with the textbook format. Prompts were standardized across models, decoding parameters were held constant, and no external tools or web browsing were permitted. The study intentionally used a minimal, fixed prompt and did not introduce additional prompt variants (e.g., explicit reasoning instructions or top-k answer formats) in this benchmark.

2.3. Data Source and Case Selection

Cases were drawn from Harrison’s Visual Case Challenge (McGraw-Hill; AccessMedicine). All 138 cases with a single representative image and a definitive reference diagnosis were included. Because the corpus contains a fixed set of 138 cases (n = 138), no formal a priori power calculation was performed; however, this sample size allows reasonably precise estimation of accuracy (95% confidence intervals of roughly ±8 percentage points around proportions near 0.5) and provides adequate power to detect moderate differences between phases and between models. These challenge cases are educational, single-image vignettes curated for teaching and are not prevalence-weighted, emphasizing classic “textbook” presentations rather than real-world incidence. Cases with a provided expert differential list were flagged for the differential-agreement analyses. For each case we recorded the clinical system/domain, a disease nature category (e.g., inflammatory/autoimmune, infectious, neoplastic), and the image modality (e.g., dermatology photograph, radiograph/CT, ECG, pathology). No cases contained patient identifiers. Because Harrison’s content is widely used in medical education, training data contamination cannot be excluded, and we did not attempt to distinguish cases that might be familiar to the models from those that are likely novel.

2.4. Reference Answer

For each case, the book’s single correct diagnosis served as the reference. A reference differential diagnosis list is provided for all cases and was used to assess differential agreement. Primary diagnosis scoring followed the book’s term, with limited manual synonym allowance at the author’s discretion (no formal ontology or automated normalization). Differential agreement was summarized as percent overlap, i.e., matched terms divided by the number of terms in the book’s list.

2.5. Models and Configurations

ChatGPT (version: GPT-5 default), Gemini (version: Gemini 2.5 Flash), and DeepSeek (version: DeepSeek-V3.2) were accessed via their websites on the free (unpaid) tier between 1 and 30 September 2025. Model/version identifiers were recorded exactly as displayed in each platform’s model selector during the evaluation window. All interactions used default settings; no plug-ins, retrieval, or web browsing were enabled, and identical prompts were used across models. Thus, the results represent a time-bounded snapshot of each platform under free-tier constraints, which may include undocumented limits on context length, response length, or image resolution and could affect absolute performance.

2.6. Outcomes

The primary outcome was top-1 diagnostic accuracy per case for each model in Phase 1 (image only, without any patient history) and Phase 2 (image plus the book’s brief patient history). Top-1 accuracy is reported as a benchmark endpoint under the study’s single diagnosis prompt constraint and should not be interpreted as a direct clinical top-k usefulness measure. To ensure reproducible scoring, we normalized diagnosis labels using a pre-specified equivalence list covering clinically trivial wording variants (including unambiguous abbreviations and established eponyms). Before analysis, we created this synonym list for all distinct diagnosis labels; two clinician authors curated it with SNOMED CT used as a reference where needed, and disagreements were resolved by a third author. Diagnoses were scored correct only when they matched the normalized concept label exactly (partial string matches were not credited), and umbrella diagnoses were not credited unless identical to the book’s reference diagnosis concept. The single diagnosis was counted as correct when it exactly matched the book’s single correct diagnosis. For example, ‘MI’ was treated as equivalent to ‘myocardial infarction,’ whereas broader related terms such as ‘acute coronary syndrome’ were not counted as a match unless identical to the reference diagnosis concept. The secondary outcome was differential diagnosis agreement, defined as the percentage of diagnoses in the book’s reference differential list that also appeared in the model’s differential diagnosis list (matched terms ÷ total terms in the book’s list × 100). We summarized Phase 2 differential performance using three set-based metrics: (i) differential coverage, defined as the proportion of cases in which at least one textbook differential term appeared in the model’s differential list (case-level hit rate); (ii) recall of the textbook differential list, defined as matched textbook terms ÷ total textbook differential terms; and (iii) Jaccard overlap, defined as matched terms ÷ the union of textbook and model differential term sets. Differential lists were evaluated as unordered sets (list order ignored), and rank-aware metrics were therefore not applied. For example, ‘MI’ was treated as equivalent to ‘myocardial infarction,’ whereas broader related terms such as ‘acute coronary syndrome’ were not counted as a match unless identical to the reference diagnosis concept.

2.7. Statistical Analysis

For each model and analysis stratum we reported the proportion accuracy with Wils on 95% confidence intervals. Paired comparisons of accuracy used McNemar’s test (within case), including (i) model-versus-model contrasts within a phase and (ii) Phase 1 versus Phase 2 within model. For paired contrasts (Phase 1 vs. Phase 2 within model; and model-versus-model within phase), we additionally report the absolute difference in accuracy with 95% CIs computed using a paired-proportion CI method based on discordant pairs. When comparing all three models simultaneously, we used Cochran’s Q with post hoc McNemar tests and Benjamini–Hochberg correction for multiple testing. For differential diagnosis agreement, we reported the case-level percent overlap as mean (SD) and median [IQR]; paired between-model comparisons used nonparametric tests (e.g., Wilcoxon signed-rank) where appropriate. Analyses were performed using Python (version 3.14.0). Two-sided p < 0.05 was considered statistically significant after correction where applicable.

2.8. Ethics and Copyright

All cases were drawn from Harrison’s Visual Case Challenge and contained no identifiable patient information. Representative case images and verbatim histories cannot be reproduced in the manuscript due to publisher copyright. This project involved no human–subjects interaction; an institutional review board (IRB) review was therefore not required or was deemed exempt under local policy. To support reproducibility, the complete prompt templates and platform configurations, scoring procedures, and case index with case-set IDs and A/B image labels, together with an outputs template, are provided in Supplementary Files S1 and S2.

3. Results

3.1. Overall Diagnostic Accuracy

We analyzed 138 cases from Harrison’s Visual Case Challenge to estimate the top-1 diagnostic accuracy of ChatGPT, Gemini, and DeepSeek and to compare performance across the two phases. Table 1 summarizes overall counts and per-phase accuracies for each model. Adding a brief clinical history improved performance for all three models. In Phase 1 (image only), accuracies were 50.72% for ChatGPT (70/138), 39.90% for Gemini (55/138), and 30.43% for DeepSeek (42/138). In Phase 2 (image + brief history), accuracies rose to 80.40% for ChatGPT (111/138), 72.50% for Gemini (100/138), and 75.36% for DeepSeek (104/138). The corresponding absolute improvements were +29.70 percentage points (pp) for ChatGPT, +32.60 pp for Gemini, and +44.93 pp for DeepSeek, indicating that concise clinical context was associated with sizable gains across models. As a contamination sensitivity analysis, after excluding dermatology and ocular cases (axes 1–3; 50/138 images), the main conclusion was unchanged and diagnostic accuracy still improved from Phase 1 to Phase 2 for all models (ChatGPT 55.7% vs. 79.5%; Gemini 20.5% vs. 67.0%; DeepSeek 36.4% vs. 77.3%).

3.2. Performance by Disease Nature Category

Table 2 presents diagnostic performance stratified by disease nature category, with Phase 1 and Phase 2 accuracies and changes for each model. Key findings include consistent improvements from Phase 1 to Phase 2 for ChatGPT and DeepSeek across all categories, while Gemini showed mixed results with gains in most but declines in two (metabolic/toxic: −10.00 pp; arrhythmic/electrophysiological: −20.00 pp). DeepSeek demonstrated the most substantial gains overall (e.g., +75.00 percentage points [pp] in cutaneous inflammatory/autoimmune and viral/parasitic infections), though its Phase 1 performance was very poor. In general, Phase 1 accuracies varied widely, with Gemini performing best (highest in 7 of 10 categories, e.g., 90.00% in structural/degenerative, 85.00% in arrhythmic/electrophysiological) and DeepSeek worst (lowest in 9 of 10 categories, e.g., 0.00% in viral/parasitic infections, 5.00% in arrhythmic/electrophysiological); ChatGPT was intermediate (e.g., 100.00% in “others” but 20.00% in metabolic/toxic). In Phase 2, accuracies converged to ≥65.00% in most categories, with DeepSeek leading or tying in four categories (e.g., 85.00% in cutaneous inflammatory/autoimmune, 80.00% in metabolic/toxic), ChatGPT in five categories (e.g., 100.00% in structural/degenerative and traumatic/hemorrhagic), and Gemini in three categories.
Figure 1 visualizes these accuracies as grouped bars with 95% Wilson confidence intervals (CIs; Phase 1 on top, Phase 2 on bottom), highlighting inter-model comparisons with statistical markers. Important findings include broader CIs in smaller categories (e.g., traumatic/hemorrhagic, n = 8), indicating greater variability, and more distinct bar separations in Phase 1 than Phase 2. In Phase 1, Gemini often led with significant superiority (solid stars) in categories like arrhythmic/electrophysiological and cutaneous inflammatory/autoimmune (Newcombe’s 95% CI for differences excludes 0; p < 0.05), while DeepSeek rarely competed. In Phase 2, differences were less marked, suggesting that added history can help reduced disparities.
Figure 2 presents false discovery rate adjusted q-values (Benjamini–Hochberg) as heatmaps comparing the models pairwise, where darker shades with lower q-values closer to zero point to stronger significance in a friendly, easy-to-spot way. The findings indicate greater differences in Phase 1. Q-values below 0.05 were frequently observed for Gemini versus DeepSeek, notably q = 0.000 in arrhythmic/electrophysiological, cutaneous inflammatory/autoimmune, and neoplastic/proliferative cases. Some differences were also noted for ChatGPT against DeepSeek, with a q-value of 0.028 in the “others” category. In contrast, ChatGPT versus Gemini exhibited minimal significant results with most q-values exceeding 0.05. In Phase 2, q-values were consistently higher with no statistical significances, which demonstrates diminished inter-model variability after including clinical history and within this single textbook-derived benchmark, suggesting diminished inter-model variability after including clinical history.

3.3. Performance by Organ System Category

Table 3 summarizes the diagnostic performance stratified by organ system category, presenting Phase 1 and Phase 2 accuracies along with changes for each model. The accuracies varied widely among AI models in Phase 1. ChatGPT performed best in pulmonary/thoracic (100%) and others (100%) category cases. Gemini outperformed in cardiovascular (83.3%) and neurological (91.7%) systems cases, while DeepSeek lagged significantly with the lowest accuracies in ocular (0%) and oral/mucosal (0%) systems cases. In Phase 2, all three AI models showed significant improvement, achieving ≥75.00% in most categories. The most significant improvements were for ChatGPT’s performance (+57.10 pp) in hematological/fluid systems (from 28.60% to 85.70%), DeepSeek’s +100.00 pp in oral/mucosal (from 0% to 100%), and consistent high performance in pulmonary/thoracic (100.00% for ChatGPT and Gemini, 75.00% for DeepSeek). The dot-whisker plots highlighting these shifts are presented in Figure 3 with tighter CIs in larger categories (e.g., blistering/nodular skin disorders, n = 28) and broader CIs in smaller ones (e.g., pulmonary/thoracic, n = 4), reflecting sample size effects.
Per-model accuracy by category with 95% Wilson Cis are shown as dot-whisker plots for (left-to-right) ChatGPT, Gemini, and DeepSeek. For each category, markers display Phase 1 and Phase 2 side-by-side, enabling a direct visual of within-model improvements and remaining variability across categories.

3.4. Differential Diagnosis Precision

For Phase 2 only, differential diagnosis precision (percentage overlap with differential diagnosis mentioned in the book) is summarized in Table 4 with categorization as per organ system. The precision varied widely across all three AI models in general. ChatGPT averaged 6.99% (median 6.47%; range 0.0–21.42%), significantly lower than Gemini at 36.39% (median 37.5%; range 0.0–92.85%) and DeepSeek at 32.74% (median 33.33%; range 0.0–87.5%). This divergence from ChatGPT’s high Phase 2 top-1 accuracy reflects that top-1 scoring applied clinician-curated normalization for trivial label variants, whereas differential agreement was evaluated as strict term overlap against the textbook list (order ignored), which is more sensitive to lexical choices and brief lists. Gemini and DeepSeek consistently outperformed ChatGPT in some categories. In microscopy (urine/synovial fluid) cases, Gemini achieved 45% and DeepSeek achieved 50%, while ChatGPT scored just 7.74%. For papulosquamous skin photos, Gemini scored 43.75% and DeepSeek 37.49%, compared with ChatGPT’s 10.49%. The ranges showed differences with Gemini peaking at 92.85% in ocular photographs and DeepSeek at 87.5% in papulosquamous skin photos, while ChatGPT topped out at 21.42% in ECG cases. The set-based summaries of the Phase 2 differential diagnoses are shown in Figure 4. Under our matching criteria, at least one textbook differential term appeared in only 5.8% of cases for ChatGPT (8/138), but in 83.3% and 70.3% of cases for Gemini and DeepSeek, respectively. Mean recall of the textbook differential list was 1.1% for ChatGPT, 39.6% for Gemini, and 26.7% for DeepSeek, with corresponding mean Jaccard overlaps of 0.26%, 24.4%, and 18.3%, respectively.

4. Discussion

Evaluating ChatGPT, Gemini, and DeepSeek on 138 cases from Harrison’s Visual Case Challenge showed that adding brief clinical histories dramatically boosted diagnostic performance. DeepSeek showed the most striking gains, jumping from a modest Phase 1 baseline of 30.43% accuracy to 75.36%—a remarkable increase of 44.93 percentage points. Gemini AI led competitors in Phase 1 visual diagnostics, excelling at pure image interpretation across comparable benchmarks, which underscores its multimodal pattern recognition capabilities even without contextual support. When context was added in Phase 2, accuracies across models became more similar, as reflected in elevated q-values, suggesting diminished between-model variation. That said, generating precise differential diagnoses proved difficult for all models. ChatGPT averaged merely 6.99% alignment (ranging from 0.0% to 21.42%), while Gemini achieved 36.39% (0.0% to 92.85%) and DeepSeek reached 32.74% (0.0% to 87.5%). These figures reveal ongoing struggles to produce thorough differential lists, despite some categories like ocular photographs showing notably higher ranges. The results point to large language models as promising supplementary tools when clinical history is available, yet they continue to face obstacles in pure image interpretation, especially for complex presentations. Given this modest differential alignment, we interpret the current findings primarily as supporting educational use (e.g., supervised teaching and structured practice) rather than stand-alone diagnostic assistance [21,22].
At a more granular level, the stratified analyses by disease nature category, organ system, and imaging modality show that the general pattern of Phase 2 improvement holds across most internal medicine domains, with some variation in the magnitude of gains by model and category. Qualitatively, many of the remaining errors involve visually or clinically similar entities (e.g., look-alike dermatoses or rhythm-strip patterns) rather than completely unrelated diagnoses, which may be helpful when using these outputs in teaching and case discussion. By modality, the most common failure modes differed. In dermatology photographs, incorrect top-1 outputs most often reflected visually similar mimics and near-miss look-alike dermatoses. In ECG or rhythm-strip cases, errors often reflected confusion between similar rhythm patterns. In CT or MRI cases, misses were more common when findings were subtle or nonspecific on a single representative image, and brief history could help narrow interpretation.
Our Phase 1 to Phase 2 gains (spanning +29.70 to +44.93 percentage points) echo Han et al. (2023)’s observations that multimodal large language models surpass baseline performance when provided clinical context [23]. Hu et al. (2023) reported that LLMs enhance pathology tasks with contextual input, specifically highlighting the performance of models like GPT-3 and BERT in improving diagnostic accuracy when paired with clinical context in pathology image analysis [24]. Bradshaw et al. (2025) also observed that adding a patient’s background or brief history greatly improves the accuracy of multimodal models in correctly assessing cardiac images [25]. In the current study, 138 cases spanning dermatology photographs, ECG tracings, chest X-rays, and CT/MRI scans offer a comprehensive view of internal medicine diagnostics moving beyond single-modality tests. Clear reporting of evaluation design and scoring procedures is emphasized across medical imaging AI studies [26]. Phase 1 accuracies ranged from 30 to 50%, but adding brief patient history like age, symptoms, and labs boosted Phase 2 accuracies to 80.43% for ChatGPT (+29.7 pp), 72.46% for Gemini (+32.60 pp), and 75.36% for DeepSeek (+44.93 pp). For radiology-specific results, cardiovascular cases (n = 24) improved from 4.17% to 54.17% for DeepSeek (+50 pp) and from 25.00% to 62.50% for ChatGPT (+37.5 pp), though Gemini dropped from 83.30% to 62.50% (−20.80 pp), while neurological CT/MRI brain scans (n = 12) rose from 33.33% to 91.67% for DeepSeek (+58.34 pp), showing history’s role in refining subtle findings. This approach aligns with Suh and Shim (2024), who reported top-1 accuracy on 190 radiology cases jumping from 15% (image-only) to 48% with patient history images [27], and the Kim et al. (2025), where Llama-3 and GPT-4o achieved 70–80% accuracy on 1933 cases with full history, emphasizing context’s impact—though our mixed internal medicine focus and pure-image Phase 1 baseline offer a broader contrast than their subspecialty baselines [28]. Related vision-language model benchmarking in radiology has also evaluated multimodal models on multisequence MRI datasets using structured comparisons [29].
Gemini displayed strong Phase 1 capabilities, hitting 85% accuracy for arrhythmic and electrophysiological cases—a finding that resonates with Kao et al. (2025)’s work on large language model effectiveness in radiology pattern recognition [30] and Bradshaw et al. (2024)’s validation of deep learning for computed tomography analysis, both pointing to solid initial visual interpretation skills [25]. By contrast, our study found ChatGPT delivered steady incremental improvements, including a noteworthy +57.10 percentage point gain in hematological and fluid system cases, which mirrors Han et al. (2024)’s comparative assessment of text-based reasoning and suggests balanced flexibility across different contextual scenarios [23]. DeepSeek’s context-driven improvement stands out, particularly its +75.00 percentage point rise in cutaneous cases, which aligns with Ling et al. (2023)’s focus on domain specialization [31] and mirrors the well-documented constraints of multimodal AI in clinical settings, such as data harmonization and model integration [32]. However, its weak Phase 1 showing (including 0.00% in viral cases) diverges from the consistent visual competence reported by Qin et al. (2023), highlighting its reliance on contextual information [33]. Gemini’s −20.00 percentage point drop in arrhythmic cases contrasts with the stable reporting seen in Atsukawa et al. (2025)’s radiology research [34] and Hu et al. (2023)’s imaging advances [24], likely reflecting the complexity of arrhythmic visuals. Worth noting, Suh & Shim (2024) documented similar initial strength in Gemini Pro Vision that weakened without historical context [27], contrasting with the Eurorad Benchmark (2025)’s Llama-3, which maintained stability when given contextual support [28]. Our comparative analysis reveals this variability, a dimension often overlooked in single-model investigations such as Bradshaw et al. (2024) [25] and Sahoo et al. (2024), prompt engineering study [35], thereby extending comprehensive evaluation by identifying model-specific deficiencies in visual task performance [33,36].
Differential diagnosis accuracy (Phase 2) varied among AI models, i.e., Gemini 36.39%, DeepSeek 32.74%, and ChatGPT 6.99% overall with wide spread across imaging categories. Accuracy peaked in stereotyped visuals—ocular photographs up to 92.85% and papulosquamous dermatoses up to 87.5%, but it was lower for ECG/rhythm (ChatGPT’s maximum: 21.42%). Because overlap reflects precision rather than completeness, future work should add recall/Jaccard or concept-level metrics to capture missed but clinically relevant alternatives. In a comparison of two LLMs with a legacy diagnostic decision support system on 36 unpublished cases, the share of cases where the correct diagnosis appeared anywhere in a 25-item differential was 42% (LLM1) and 39% (LLM2) without labs, rising to 64% and 58% with labs [37]; the DDSS listed the diagnosis more often and higher than the LLMs. Our differential precision numbers (Gemini 36.39%, DeepSeek 32.74%) are broadly consistent with that mid-range inclusion band once limited context is provided, though we score by term overlap rather than presence/rank, which is a stricter criterion.
Domain-specific evaluations echo this. In dermatology, a vision-enabled LLM correctly identified the top diagnosis in 54% of image-only cases and included the correct diagnosis in the differential in 50%, highlighting that differentials can capture partial clinical alignment even when the single best guess is wrong—much like our category-level peaks in ocular and papulosquamous groups [38]. In multimodal general clinical testbeds, investigators have likewise emphasized top-k differential metrics as more informative than top-1 alone; adopting such multi-term measures would complement our precision overlap and likely raise measured agreement for cases where models named several correct alternatives [39].
A key strength of the current study is the standardized, expert-curated corpus from Harrison’s Visual Case Challenge with canonical answers and reference differentials, enabling transparent, reproducible comparisons across models and conditions. The two-condition design (image-only vs. image + history) directly quantifies the incremental contribution of brief context at scale (n = 138). Limitations include reliance on a single educational, prevalence-agnostic textbook source and single-image vignettes (real-world cases may involve multi-view or serial imaging), which may introduce spectrum bias by over-representing classic presentations and under-representing borderline, ambiguous, or comorbid cases; real-world imaging may also be noisy or incomplete. We also did not obtain clinician performance on the same images under identical constraints, so we cannot directly compare LLM accuracy with expert or trainee benchmarks. In addition, we evaluated free-tier models during a fixed window (September 2025) under platform-specific constraints that may limit context length, throughput, or image resolution. We did not repeat the full experiment across multiple days, so our findings should be interpreted as a time-bounded snapshot that may change as model versions and free-tier policies evolve. Results reflect a single run per case and phase for each model; we did not repeat prompts to estimate within-model stochastic variability. While top-1 accuracy was pre-specified to match the study’s single diagnosis prompt and single reference label per case, future work can additionally assess top-3/top-5 inclusion using standardized ranked-list prompts. Model training corpora are not transparent, so training data contamination cannot be definitively excluded. We addressed this concern with a leave-out sensitivity analysis that excluded dermatology and ocular cases (axes 1–3). The Phase 2 improvement pattern was preserved across models. This supports the robustness of the main finding. The study did not quantify history information content (e.g., length or presence of age/labs/timing) or analyze accuracy gain as a function of these features, as Phase 2 histories were not standardized across cases. The study did not perform a prompt ablation (e.g., diagnosis-only vs. diagnosis + reasoning, or single diagnosis vs. top-k formats); therefore, some degree of prompt sensitivity cannot be excluded.

5. Conclusions

Across 138 single-image internal medicine cases, ChatGPT, Gemini, and DeepSeek achieved moderate to high top-1 diagnostic accuracy, with performance varying by disease category and modality. Providing a brief clinical history further improved accuracy for all three models and reduced apparent performance gaps between models within this benchmark. While variations across categories persisted, diagnostic accuracy generally aligned once clinical context was provided. Correspondence with expert differential diagnoses stayed moderate overall, with stronger agreement in visually characteristic modalities and weaker matching in rhythm interpretation. These outcomes support using multimodal models with clinical history for educational support and structured practice in supervised settings, while highlighting the ongoing need for standardized, semantically informed assessment of differential diagnosis lists. Future work should extend beyond a single educational source and single-image cases, incorporate ontology-anchored scoring for differentials, and include prospective comparisons with clinicians in blinded, multi-center settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics16030388/s1.

Author Contributions

Conceptualization, A.A.I. and R.A.; data curation, R.A. and S.A.A.; formal analysis, S.A.A., M.I., and A.I.; methodology, R.A., M.I., and K.O.; project administration, A.A.I. and K.O.; software, A.A.I., S.A.A., M.I., and A.I.; supervision, A.A.I. and K.O.; validation, R.A. and K.O.; visualization, A.A.I., R.A., K.O., and A.I.; writing, R.A., A.A.I., S.A.A., M.I., K.O., and A.I. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through small group research under grant number RGP2/83/46.

Institutional Review Board Statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements.

Informed Consent Statement

Written informed consent to participate in the study was not required from the participants or the participants’ legal guardians in accordance with the national legislation and institutional requirements.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request. The diagnostic images used in this study are drawn from Harrison’s Visual Case Challenge (McGraw-Hill; AccessMedicine) and are subject to publisher copyright; the raw images cannot be publicly shared by the authors. All non-image artifacts required to reproduce the evaluation—including full prompt templates and example wording, platform configuration and decoding settings, the case index with case-set IDs and A/B image labels, and an anonymized outputs template—are provided in Supplementary Files S1 and S2.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Hansmann, M.-L.; Klauschen, F.; Samek, W.; Müller, K.-R.; Donnadieu, E.; Scharf, S.; Hartmann, S.; Koch, I.; Ackermann, J.; Pantanowitz, L. Imaging bridges pathology and radiology. J. Pathol. Inform. 2023, 14, 100298. [Google Scholar] [CrossRef]
  2. Micheletti, R.G.; Shinkai, K.; Madigan, L. Introducing “images in dermatology”. JAMA Dermatol. 2018, 154, 1255–1256. [Google Scholar] [CrossRef]
  3. Yapp, K.E.; Brennan, P.; Ekpo, E. The effect of clinical history on diagnostic imaging interpretation—A systematic review. Acad. Radiol. 2022, 29, 255–266. [Google Scholar] [CrossRef]
  4. Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
  5. Goh, E.; Gallo, R.; Hom, J.; Strong, E.; Weng, Y.; Kerman, H.; Cool, J.A.; Kanjee, Z.; Parsons, A.S.; Ahuja, N. Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Netw. Open 2024, 7, e2440969. [Google Scholar] [CrossRef]
  6. Guerra, G.A.; Hofmann, H.L.; Le, J.L.; Wong, A.M.; Fathi, A.; Mayfield, C.K.; Petrigliano, F.A.; Liu, J.N. ChatGPT, Bard, and Bing chat are large language processing models that answered orthopaedic in-training examination questions with similar accuracy to first-year orthopaedic surgery residents. Arthrosc. J. Arthrosc. Relat. Surg. 2025, 41, 557–562. [Google Scholar] [CrossRef] [PubMed]
  7. Meng, X.; Yan, X.; Zhang, K.; Liu, D.; Cui, X.; Yang, Y.; Zhang, M.; Cao, C.; Wang, J.; Wang, X. The application of large language models in medicine: A scoping review. Iscience 2024, 27, 109713. [Google Scholar] [CrossRef] [PubMed]
  8. Zhou, Y.; Ong, H.; Kennedy, P.; Wu, C.C.; Kazam, J.; Hentel, K.; Flanders, A.; Shih, G.; Peng, Y. Evaluating GPT-4V (GPT-4 with Vision) on detection of radiologic findings on chest radiographs. Radiology 2024, 311, e233270. [Google Scholar] [CrossRef] [PubMed]
  9. Liu, Y.; Li, Y.; Wang, Z.; Liang, X.; Liu, L.; Wang, L.; Cui, L.; Tu, Z.; Wang, L.; Zhou, L. A systematic evaluation of GPT-4V’s multimodal capability for chest X-ray image analysis. Meta-Radiology 2024, 2, 100099. [Google Scholar] [CrossRef]
  10. Busch, F.; Han, T.; Makowski, M.R.; Truhn, D.; Bressem, K.K.; Adams, L. Integrating text and image analysis: Exploring GPT-4v’s capabilities in advanced radiological applications across subspecialties. J. Med. Internet Res. 2024, 26, e54948. [Google Scholar] [CrossRef]
  11. Cirone, K.; Akrout, M.; Abid, L.; Oakley, A. Assessing the utility of multimodal large language models (GPT-4 vision and large language and vision assistant) in identifying melanoma across different skin tones. JMIR Dermatol. 2024, 7, e55508. [Google Scholar] [CrossRef]
  12. Dai, D.; Zhang, Y.; Yang, Q.; Xu, L.; Shen, X.; Xia, S.; Wang, G. Pathologyvlm: A large vision-language model for pathology image understanding. Artif. Intell. Rev. 2025, 58, 186. [Google Scholar] [CrossRef]
  13. Ding, L.; Fan, L.; Shen, M.; Wang, Y.; Sheng, K.; Zou, Z.; An, H.; Jiang, Z. Evaluating ChatGPT’s diagnostic potential for pathology images. Front. Med. 2025, 11, 1507203. [Google Scholar] [CrossRef] [PubMed]
  14. Hartsock, I.; Rasool, G. Vision-language models for medical report generation and visual question answering: A review. Front. Artif. Intell. 2024, 7, 1430984. [Google Scholar] [CrossRef]
  15. Huang, S.-C.; Jensen, M.; Yeung-Levy, S.; Lungren, M.P.; Poon, H.; Chaudhari, A.S. Multimodal Foundation Models for Medical Imaging-A Systematic Review and Implementation Guidelines. medRxiv 2024. [Google Scholar] [CrossRef]
  16. Ryu, J.S.; Kang, H.; Chu, Y.; Yang, S. Vision-language foundation models for medical imaging: A review of current practices and innovations. Biomed. Eng. Lett. 2025, 15, 809–830. [Google Scholar] [CrossRef]
  17. Ge, X.; Chen, J.; Yuan, C.; Chu, Z.; Li, X.; Zhang, X.; Chen, Y.; Zheng, W.Y.; Miao, C. Systematic Comparison of Multimodal Large Language Models for Pediatric Profile Orthodontic Assessment and Early Intervention: ChatGPT, DeepSeek, and Gemini. 2025. Available online: https://www.researchsquare.com/article/rs-7750405/v1 (accessed on 16 January 2026).
  18. Hayat, M. Endoscopic Image Super-Resolution Algorithm Using Edge and Disparity Awareness. 2023. Available online: https://digital.car.chula.ac.th/chulaetd/11935/ (accessed on 16 January 2026).
  19. Maizlin, N.N.; Somers, S. The role of clinical history collected by diagnostic imaging staff in interpreting of imaging examinations. J. Med. Imaging Radiat. Sci. 2019, 50, 31–35. [Google Scholar] [CrossRef]
  20. Graber, M. The Harrison’s Visual Case Challenge; McGraw-Hill Education: New York, NY, USA, 2021. [Google Scholar]
  21. Aykac, K.; Cubuk, O.; Demir, O.O.; Choe, Y.J.; Aydin, M.; Ozsurekci, Y. Comparing ChatGPT-3.5, Gemini 2.0, and DeepSeek V3 for pediatric pneumonia learning in medical students. Sci. Rep. 2025, 15, 40342. [Google Scholar] [CrossRef] [PubMed]
  22. Bahir, D.; Zur, O.; Attal, L.; Nujeidat, Z.; Knaanie, A.; Pikkel, J.; Mimouni, M.; Plopsky, G. Gemini AI vs. ChatGPT: A comprehensive examination alongside ophthalmology residents in medical knowledge. Graefe’s Arch. Clin. Exp. Ophthalmol. 2025, 263, 527–536. [Google Scholar] [CrossRef]
  23. Han, T.; Adams, L.C.; Bressem, K.K.; Busch, F.; Nebelung, S.; Truhn, D. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 2024, 331, 1320–1321. [Google Scholar] [CrossRef] [PubMed]
  24. Hu, M.; Pan, S.; Li, Y.; Yang, X. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv 2023, arXiv:2304.04920. [Google Scholar] [CrossRef]
  25. Bradshaw, T.J.; Tie, X.; Warner, J.; Hu, J.; Li, Q.; Li, X. Large Language Models and Large Multimodal Models in Medical Imaging: A Primer for Physicians. J. Nucl. Med. 2025, 66, 173–182. [Google Scholar] [CrossRef]
  26. Hayat, M.; Aramvith, S. Superpixel-Guided Graph-Attention Boundary GAN for Adaptive Feature Refinement in Scribble-Supervised Medical Image Segmentation. IEEE Access 2025, 13, 196654–196668. [Google Scholar] [CrossRef]
  27. Suh, P.S.; Shim, W.H.; Suh, C.H.; Heo, H.; Park, C.R.; Eom, H.J.; Park, K.J.; Choe, J.; Kim, P.H.; Park, H.J. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini Pro Vision using image inputs from diagnosis please cases. Radiology 2024, 312, e240273. [Google Scholar] [CrossRef] [PubMed]
  28. Kim, S.H.; Schramm, S.; Adams, L.C.; Braren, R.; Bressem, K.K.; Keicher, M.; Platzek, P.-S.; Paprottka, K.J.; Zimmer, C.; Hedderich, D.M. Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports. npj Digit. Med. 2025, 8, 97. [Google Scholar] [CrossRef]
  29. Elboardy, A.T.; Khoriba, G.; al-Shatouri, M.; Mousa, M.; Rashed, E.A. Benchmarking vision-language models for brain cancer diagnosis using multisequence MRI. Inform. Med. Unlocked 2025, 58, 101692. [Google Scholar] [CrossRef]
  30. Kao, J.-P.; Kao, H.-T. Large Language Models in radiology: A technical and clinical perspective. Eur. J. Radiol. Artif. Intell. 2025, 30, 100021. [Google Scholar] [CrossRef]
  31. Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Li, Y.; Cui, H.; Zhang, X. Domain specialization as the key to make large language models disruptive: A comprehensive survey. ACM Comput. Surv. 2023, 58, 1–39. [Google Scholar] [CrossRef]
  32. Huang, S.-C.; Pareek, A.; Seyyedi, S.; Banerjee, I.; Lungren, M.P. Fusion of medical imaging and electronic health records using deep learning: A systematic review and implementation guidelines. npj Digit. Med. 2020, 3, 136. [Google Scholar] [CrossRef]
  33. Qin, Z.; Yi, H.; Lao, Q.; Li, K. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv 2022, arXiv:2209.15517. [Google Scholar]
  34. Atsukawa, N.; Tatekawa, H.; Oura, T.; Matsushita, S.; Horiuchi, D.; Takita, H.; Mitsuyama, Y.; Omori, A.; Shimono, T.; Miki, Y. Evaluation of radiology residents’ reporting skills using large language models: An observational study. Jpn. J. Radiol. 2025, 43, 1204–1212. [Google Scholar] [CrossRef]
  35. Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
  36. Wang, G.; Yang, R.; Zhang, Y.; Wen, X.; Liu, C.; Liu, E.; Tang, M.; Xue, L.; Liu, Z. Evaluating the performance of large language models in rheumatology for connective tissue diseases: DeepSeek-R1, ChatGPT-4.0, Copilot, and Gemini-2.0. Int. J. Med. Inform. 2026, 207, 106172. [Google Scholar] [CrossRef] [PubMed]
  37. Feldman, M.J.; Hoffer, E.P.; Conley, J.J.; Chang, J.; Chung, J.A.; Jernigan, M.C.; Lester, W.T.; Strasser, Z.H.; Chueh, H.C. Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses. JAMA Netw. Open 2025, 8, e2512994. [Google Scholar] [CrossRef] [PubMed]
  38. Pillai, A.; Parappally-Joseph, S.; Kreutz, J.; Traboulsi, D.; Gandhi, M.; Hardin, J. Evaluating the Diagnostic and Treatment Capabilities of GPT-4 Vision in Dermatology: A Pilot Study. J. Cutan. Med. Surg. 2025, 29, 570–576. [Google Scholar] [CrossRef] [PubMed]
  39. Jiang, Y.; Omiye, J.A.; Zakka, C.; Moor, M.; Gui, H.; Alipour, S.; Mousavi, S.S.; Chen, J.H.; Rajpurkar, P.; Daneshjou, R. Evaluating general vision-language models for clinical medicine. MedRxiv 2024, 2024-04. [Google Scholar]
Figure 1. Phase-wise accuracy by category with 95% CIs (grouped bars). Phase-wise (Phase 1 top; Phase 2 bottom) per-category diagnostic accuracy for ChatGPT, Gemini, and DeepSeek on 138 images across 10 disease nature categories. Bars show % correct with Wilson 95% CIs. A solid star (★) marks a single highest model whose lead over the next best is statistically significant (Newcombe 95% CI for the difference excludes 0). A white/hollow star (☆) indicates the model is highest but not significant; ties have no star.
Figure 1. Phase-wise accuracy by category with 95% CIs (grouped bars). Phase-wise (Phase 1 top; Phase 2 bottom) per-category diagnostic accuracy for ChatGPT, Gemini, and DeepSeek on 138 images across 10 disease nature categories. Bars show % correct with Wilson 95% CIs. A solid star (★) marks a single highest model whose lead over the next best is statistically significant (Newcombe 95% CI for the difference excludes 0). A white/hollow star (☆) indicates the model is highest but not significant; ties have no star.
Diagnostics 16 00388 g001
Figure 2. Pairwise significance after multiple-testing correction (FDR heatmaps). False discovery rate (FDR)-adjusted q-values (Benjamini–Hochberg) for pairwise model comparisons across categories in Phase 1 (top) and Phase 2 (bottom). Rows correspond to ChatGPT vs. Gemini, ChatGPT vs. DeepSeek, and Gemini vs. DeepSeek; columns are the 10 categories. Cell values are q-values (darker = smaller), with a white star (★) marking q < 0.05. This panel answers where model differences are statistically credible after controlling for multiplicity.
Figure 2. Pairwise significance after multiple-testing correction (FDR heatmaps). False discovery rate (FDR)-adjusted q-values (Benjamini–Hochberg) for pairwise model comparisons across categories in Phase 1 (top) and Phase 2 (bottom). Rows correspond to ChatGPT vs. Gemini, ChatGPT vs. DeepSeek, and Gemini vs. DeepSeek; columns are the 10 categories. Cell values are q-values (darker = smaller), with a white star (★) marking q < 0.05. This panel answers where model differences are statistically credible after controlling for multiplicity.
Diagnostics 16 00388 g002
Figure 3. Model-centric dot-whisker plots with 95% CIs across categories.
Figure 3. Model-centric dot-whisker plots with 95% CIs across categories.
Diagnostics 16 00388 g003
Figure 4. Phase 2 differential diagnosis coverage, recall, and Jaccard overlap by model.
Figure 4. Phase 2 differential diagnosis coverage, recall, and Jaccard overlap by model.
Diagnostics 16 00388 g004
Table 1. Overall diagnosis accuracy and differential diagnosis precision for ChatGPT, Gemini, and DeepSeek in medical image diagnosis.
Table 1. Overall diagnosis accuracy and differential diagnosis precision for ChatGPT, Gemini, and DeepSeek in medical image diagnosis.
MetricCHATGPTGEMINIDEEPSEEK
Phase 1 N (%)Phase 2 N (%)Δ DifferencePhase 1 N (%)Phase 2 N (%)Δ DifferencePhase 1 N (%)Phase 2 N (%)Δ Difference
 Diagnosis Accuracy
Total Cases138138138138138138
Correct Diagnoses70 (50.72%)111 (80.43%)+41 (+58.57%)55 (39.86%)100 (72.46%)+45 (+81.82%)42 (30.43%)104 (75.36%)+62 (+147.62%)
Incorrect Diagnoses68 (49.28%)27 (19.57%)−41 (−60.29%)83 (60.14%)38 (27.54%)−45 (−54.22%)96 (69.57%)34 (24.64%)−62 (−64.58%)
Overall Accuracy50.70%80.40%+29.70 ppt39.90%72.50%+32.60 ppt30.43%75.36%+44.93 ppt
 Differential Diagnosis (Phase 2 Only) Accuracy
Average (%)6.99%36.3932.74
Range (Min %–Max %)0.0–21.420.0–92.850.0–87.5
Median %6.4737.533.33
Table 2. Comparative performance of ChatGPT, Gemini, and DeepSeek in medical image diagnosis across disease nature categories: correct diagnoses and accuracy changes between Phase 1 and Phase 2 for 138 cases.
Table 2. Comparative performance of ChatGPT, Gemini, and DeepSeek in medical image diagnosis across disease nature categories: correct diagnoses and accuracy changes between Phase 1 and Phase 2 for 138 cases.
Disease Nature CategoryTotal CasesCHATGPTGEMINIDEEPSEEK
Phase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % AccuracyPhase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % AccuracyPhase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % Accuracy
Cutaneous Inflammatory/Autoimmune208 (40.00%)16 (80.00%)+40.00 ppt15 (75.00%)16 (80.00%)+5.00 ppt2 (10.00%)17 (85.00%)+75.00 ppt
Systemic Inflammatory/Autoimmune126 (50.00%)10 (83.30%)+33.30 ppt10 (83.30%)10 (83.30%)0.00 ppt6 (50.00%)10 (83.33%)+33.33 ppt
Bacterial and Fungal Infections126 (50.00%)11 (91.70%)+41.70 ppt10 (83.30%)11 (91.70%)+8.40 ppt4 (33.33%)8 (66.67%)+33.34 ppt
Viral and Parasitic Infections124 (33.30%)10 (83.30%)+50.00 ppt9 (75.00%)10 (83.30%)+8.30 ppt0 (0.00%)9 (75.00%)+75.00 ppt
Neoplastic and Proliferative2010 (50.00%)16 (80.00%)+30.00 ppt14 (70.00%)16 (80.00%)+10.00 ppt6 (30.00%)15 (75.00%)+45.00 ppt
Metabolic and Toxic102 (20.00%)6 (60.00%)+40.00 ppt7 (70.00%)6 (60.00%)−10.00 ppt3 (30.00%)8 (80.00%)+50.00 ppt
Arrhythmic and Electrophysiological204 (20.00%)13 (65.00%)+45.00 ppt17 (85.00%)13 (65.00%)−20.00 ppt1 (5.00%)10 (50.00%)+45.00 ppt
Structural and Degenerative108 (80.00%)10 (100.00%)+20.00 ppt9 (90.00%)10 (100.00%)+10.00 ppt5 (50.00%)8 (80.00%)+30.00 ppt
Traumatic and Hemorrhagic86 (75.00%)8 (100.00%)+25.00 ppt7 (87.50%)8 (100.00%)+12.50 ppt2 (25.00%)7 (87.50%)+62.50 ppt
Others1616 (100.00%)16 (100.00%)0.00 ppt15 (93.75%)16 (100.00%)+6.25 ppt9 (56.25%)12 (75.00%)+18.75 ppt
Table 3. Comparative performance of ChatGPT, Gemini, and DeepSeek in medical image diagnosis across organ systems.
Table 3. Comparative performance of ChatGPT, Gemini, and DeepSeek in medical image diagnosis across organ systems.
Organ System CategoryTotal CasesCHATGPTGEMINIDEEPSEEK
Phase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % AccuracyPhase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % AccuracyPhase 1 Correct (% Acc.) DiagnosisPhase 2 Correct (% Acc.) DiagnosisDifference Δ % Accuracy
Scaly Skin Disorders124 (33.30%)10 (83.30%)+50.00 ppt9 (75.00%)10 (83.30%)+8.30 ppt2 (16.67%)10 (83.33%)+66.66 ppt
Blistering and Nodular Skin Disorders2812 (42.90%)22 (78.60%)+35.70 ppt21 (75.00%)22 (78.60%)+3.60 ppt8 (28.57%)21 (75.00%)+46.43 ppt
Ocular System105 (50.00%)9 (90.00%)+40.00 ppt7 (70.00%)9 (90.00%)+20.00 ppt0 (0.00%)5 (50.00%)+50.00 ppt
Oral and Mucosal System42 (50.00%)4 (100.00%)+50.00 ppt3 (75.00%)4 (100.00%)+25.00 ppt0 (0.00%)4 (100.00%)+100.00 ppt
Cardiovascular System246 (25.00%)15 (62.50%)+37.50 ppt20 (83.30%)15 (62.50%)−20.80 ppt1 (4.17%)13 (54.17%)+50.00 ppt
Neurological System129 (75.00%)11 (91.70%)+16.70 ppt11 (91.70%)11 (91.70%)0.00 ppt4 (33.33%)11 (91.67%)+58.34 ppt
Abdominopelvic and Gastrointestinal System1610 (62.50%)14 (87.50%)+25.00 ppt13 (81.25%)14 (87.50%)+6.25 ppt12 (75.00%)13 (81.25%)+6.25 ppt
Hematological and Fluid Systems144 (28.60%)12 (85.70%)+57.10 ppt11 (78.60%)12 (85.70%)+7.10 ppt6 (42.86%)13 (92.86%)+50.00 ppt
Pulmonary and Thoracic System44 (100.00%)4 (100.00%)0.00 ppt4 (100.00%)4 (100.00%)0.00 ppt2 (50.00%)3 (75.00%)+25.00 ppt
Others1414 (100.00%)14 (100.00%)0.00 ppt13 (92.90%)14 (100.00%)+7.10 ppt7 (50.00%)11 (78.57%)+28.57 ppt
Table 4. Comparative precision analysis of ChatGPT, Gemini, and DeepSeek in differential diagnosis across medical imaging categories: average, median, and range.
Table 4. Comparative precision analysis of ChatGPT, Gemini, and DeepSeek in differential diagnosis across medical imaging categories: average, median, and range.
CategoryTotal CasesCHATGPT Differential Diagnosis AccuracyGEMINI Differential Diagnosis AccuracyDEEPSEEK Differential Diagnosis Accuracy
Average % Precision% MedianRange (Min–Max %)Average % PrecisionMedianRangeAverage % PrecisionMedianRange
 Papulosquamous, Plaque, and Scaling Skin Photos 1210.2310.493.12–16.5244.0343.7520.0–61.6643.7537.4916.66–87.5
Vesicular, Ulcerative, Nodular, and Pigmented Skin Photos285.635.520.0–11.4635.4532.0810.0–77.536.0139.580.0–66.66
Ocular and Conjunctival Photographs84.6400.0–18.5837.829.170.0–92.8532.2920.8312.5–75.0
Oral and Mucosal Photographs47.67.66.54–8.66303010.0–50.041.6741.670.0–83.34
Electrocardiography (ECG/Rhythm Strips)249.0110.040.0–21.4339.1443.757.14–87.523.7516.660.0–62.5
Cross-Sectional Neuroimaging (CT/MRI Brain)126.016.840.0–11.6634.2339.5814.58–50.035.4239.5916.66–50.0
Cross-Sectional Neuroimaging (CT Abdomen/Pelvis)126.785.320.0–18.3436.939.6412.14–55.027.6426.660.0–58.34
Microscopy (Blood Smears and Parasites)106.684.542.5–16.5436.6737.50.0–90.029.333016.66–50.0
Microscopy (Urine and Synovial Fluid)47.747.744.76–10.72454540.0–50.0505033.33–66.67
Others246.26.080.0–14.3630.2833.3312.5–62.530.6833.3312.5–50.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Asiri, R.; Ishaqui, A.A.; Ahmad, S.A.; Imran, M.; Orayj, K.; Iqbal, A. A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context. Diagnostics 2026, 16, 388. https://doi.org/10.3390/diagnostics16030388

AMA Style

Asiri R, Ishaqui AA, Ahmad SA, Imran M, Orayj K, Iqbal A. A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context. Diagnostics. 2026; 16(3):388. https://doi.org/10.3390/diagnostics16030388

Chicago/Turabian Style

Asiri, Rayah, Azfar Athar Ishaqui, Salman Ashfaq Ahmad, Muhammad Imran, Khalid Orayj, and Adnan Iqbal. 2026. "A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context" Diagnostics 16, no. 3: 388. https://doi.org/10.3390/diagnostics16030388

APA Style

Asiri, R., Ishaqui, A. A., Ahmad, S. A., Imran, M., Orayj, K., & Iqbal, A. (2026). A Comparative Assessment of ChatGPT, Gemini, and DeepSeek Accuracy: Examining Visual Medical Assessment in Internal Medicine Cases with and Without Clinical Context. Diagnostics, 16(3), 388. https://doi.org/10.3390/diagnostics16030388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop