Next Article in Journal
Comparison of Cottle-Area-2 and Cottle-Area-3 in Computed Tomography Scans of Patients with Nasal Obstruction and Controls
Previous Article in Journal
Platelet Satellitism in a Patient with Underlying Infection, Immune Thrombocytopenic Purpura (ITP) and Multiple Sclerosis
Previous Article in Special Issue
Lobish: Symbolic Language for Interpreting Electroencephalogram Signals in Language Detection Using Channel-Based Transformation and Pattern
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis

1
Department of Neurosurgery, Sincan Training and Research Hospital, Ankara 06949, Turkey
2
Department of Neurosurgery, Kulu State Hospital, Konya 42780, Turkey
3
Ankara Medipol University Faculty of Medicine, Ankara 06050, Turkey
4
Department of Neurosurgery, Adiyaman Training and Research Hospital, Adiyaman 02100, Turkey
5
Department of Neurosurgery, Gazi University Faculty of Medicine, Ankara 06560, Turkey
6
Department of Radiology, Stanford University School of Medicine, Stanford, CA 94305, USA
*
Author to whom correspondence should be addressed.
Diagnostics 2025, 15(11), 1320; https://doi.org/10.3390/diagnostics15111320 (registering DOI)
Submission received: 2 May 2025 / Revised: 17 May 2025 / Accepted: 21 May 2025 / Published: 24 May 2025
(This article belongs to the Special Issue Artificial Intelligence in Neuroimaging 2024)

Abstract

Background/Objectives: Large language models (LLMs) are revolutionizing the world and the field of medicine while constantly improving themselves. With recent advancements in image interpretation, evaluating the reasoning capabilities of these models and benchmarking their performance on brain MRI tasks has become crucial, as they may be utilized—albeit off-label—for patient care by both neurosurgeons and non-neurosurgeons. Methods: ChatGPT-4o, Grok, and Gemini were presented with 35,711 slices of brain MRI, including various pathologies and normal MRIs. Models were asked to identify the MRI sequence and determine the presence of pathology. Their individual performances were measured and compared with one another. Results: GPT refused to answer 28.02% of the slices despite three attempts, whereas Grok and Gemini provided responses on the first attempt for every slice. Gemini achieved 74.54% pathology prediction and 46.38% sequence prediction accuracy. GPT-4o achieved 74.33% pathology prediction and 85.98% sequence prediction accuracy for questions that it had answered (53.50% and 61.67% in total, respectively). Grok achieved 65.64% pathology prediction and 66.23% sequence prediction accuracy. Conclusions: The image interpretation capabilities of the investigated LLMs are limited for now and require further refinement before competing with specifically trained and fine-tuned dedicated applications. Amongst them, Gemini outperforms the others in pathology prediction while Grok outperforms others in sequence prediction. These limitations should be kept in mind if use during patient care is planned.
Keywords: Gemini; GPT; Grok; large language model; magnetic resonance imaging; neuroradiology Gemini; GPT; Grok; large language model; magnetic resonance imaging; neuroradiology

Share and Cite

MDPI and ACS Style

Sozer, A.; Sahin, M.C.; Sozer, B.; Erol, G.; Tufek, O.Y.; Nernekli, K.; Demirtas, Z.; Celtikci, E. Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics 2025, 15, 1320. https://doi.org/10.3390/diagnostics15111320

AMA Style

Sozer A, Sahin MC, Sozer B, Erol G, Tufek OY, Nernekli K, Demirtas Z, Celtikci E. Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics. 2025; 15(11):1320. https://doi.org/10.3390/diagnostics15111320

Chicago/Turabian Style

Sozer, Alperen, Mustafa Caglar Sahin, Batuhan Sozer, Gokberk Erol, Ozan Yavuz Tufek, Kerem Nernekli, Zuhal Demirtas, and Emrah Celtikci. 2025. "Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis" Diagnostics 15, no. 11: 1320. https://doi.org/10.3390/diagnostics15111320

APA Style

Sozer, A., Sahin, M. C., Sozer, B., Erol, G., Tufek, O. Y., Nernekli, K., Demirtas, Z., & Celtikci, E. (2025). Do LLMs Have ‘the Eye’ for MRI? Evaluating GPT-4o, Grok, and Gemini on Brain MRI Performance: First Evaluation of Grok in Medical Imaging and a Comparative Analysis. Diagnostics, 15(11), 1320. https://doi.org/10.3390/diagnostics15111320

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop