Evaluating Large Language Models in Interpreting MRI Reports and Recommending Treatment for Vestibular Schwannoma

Arthur H. A. Sales; Christine Julia Gizaw; Jürgen Beck; Jürgen Grauvogel

doi:10.3390/diagnostics15222841

,

and

Department of Neurosurgery, Faculty of Medicine, University of Freiburg, 79106 Freiburg, Germany

^*

Author to whom correspondence should be addressed.

Diagnostics2025, 15(22), 2841;https://doi.org/10.3390/diagnostics15222841

This article belongs to the Special Issue Large Language Models in Medical Diagnostics: Advancing Clinical Practice, Research, and Patient Care

Version Notes

Order Reprints

Abstract

Background/Objectives: The use of large language models (LLMs) by patients seeking information about their diagnosis and treatment is rapidly increasing. While their application in healthcare is still under scientific investigation, the demand for these models is expected to grow significantly in the coming years. This study evaluates the accuracy of three publicly available AI tools—GPT-4, Gemini, and Bing—in interpreting MRI reports and suggesting treatments for patients with vestibular schwannomas (VS). To evaluate and compare the diagnostic accuracy and treatment recommendations provided by GPT-4, Gemini, and Bing for patients with VS based on MRI reports, while addressing the growing use of these tools by patients seeking medical information. Methods: This retrospective study included 35 consecutive patients with VS treated at a university-based neurosurgery department. Anonymized MRI reports in German were translated to English, and AI tools were prompted with five standardized verbal prompts for diagnoses and treatment recommendations. Diagnostic accuracy, differential diagnoses, and treatment recommendations were assessed and compared. Results: Thirty-five patients (mean age, 57 years ± 13; 18 men) were included. GPT-4 achieved the highest diagnostic accuracy for VS at 97.14% (34/35), followed by Gemini at 88.57% (31/35), and Bing at 85.71% (30/35). GPT-4 provided the most accurate treatment recommendations (57.1%, 20/35), compared to Gemini (45.7%, 16/35) and Bing (31.4%, 11/35). GPT-4 correctly recommended surgery in 60% of cases (21/35), compared to 51.4% for Bing (18/35) and 45.7% for Gemini (16/35). The difference between GPT-4 and Bing was statistically significant (p-value: 0.02). Conclusions: GPT-4 outperformed Gemini and Bing in interpreting MRI reports and providing treatment recommendations for VS. Although the AI tools demonstrated good diagnostic accuracy, their treatment recommendations were less precise than those made by an interdisciplinary tumor board. This study highlights the growing role of AI tools in patient-driven healthcare inquiries.

Keywords:

large language models; artificial intelligence; vestibular schwannoma; MRI reports analysis; GPT-4; Gemini; Bing

1. Introduction

The use of commercially available AI tools that use natural language processing (NLP) in the context of healthcare has been an object of interest in previous studies [,,,,,,,,,,,,,,,].

These tools have the advantage of offering an easy-to-use interface that does not require programming language skills, since they understand commands through natural language processing, facilitating their use among physicians and patients.

The fact that these tools utilize an immeasurable amount of online content, such as books, articles, and expert opinions with varying levels of evidence, raises uncertainties about the accuracy of the information generated in response to medical queries and the extent to which these responses can be applied to individual cases. As these tools continue to evolve, their use by patients seeking insights into their diagnosis and treatment is expected to increase substantially in the coming years, highlighting critical concerns regarding the accuracy and dependability of the recommendations they provide, which may positively or negatively influence the doctor-patient relationship.

Vestibular Schwannoma (VS) is a tumor derived from Schwann cells arising from the vestibulocochlear nerve [,,]. This tumor is located in the internal auditory canal and/or cerebellopontine angle and classically manifests with unilateral sensorineural hearing loss []. The main diagnostic tool used to investigate patients with suspected VS is the magnetic resonance imaging (MRI) due to its high sensitivity, specificity, and cost-effectiveness [,].

The use of AI tools for speech recognition, categorization and extraction of information from radiology reports has been described in the literature [,,,,]. This is especially true for AI tools that are based on large language models (LLM), e.g., Generative Pre-trained Transformer version 4 (GPT-4), Google’s Gemini, and GPT-based Microsoft’s Bing [,,,,]. In this work, we compared the accuracy of three publicly available LLM (GPT-4, Gemini and Bing) in interpreting MRI reports of patients with VS written in two languages (German and English). In addition, we have evaluated treatment recommendations based only on these MRI findings. To the best of our knowledge, this is the first study that evaluates the role of LLM-based AI tools in both interpreting MRI reports and suggesting treatments for neurosurgical pathologies.

2. Materials and Methods

2.1. Patients and MRI Report Data Preparation

Thirty-five consecutive patients with vestibular schwannoma who were followed in the skull base clinic of the Department of Neurosurgery at the University Hospital of Freiburg in 2023 were included in this study. The exclusion criteria comprised missing written MRI reports, incomplete radiological documentation, or poor-quality imaging data. Tumor size or stage was not used as an inclusion criterion, resulting in a heterogeneous cohort encompassing both intracanalicular and large cerebellopontine-angle lesions.

All MRI reports were originally written in German and subsequently translated into English through a standardized human-assisted process (Figure 1). Each patient’s report was therefore available in two linguistically parallel forms (35 German and 35 English), allowing for paired analysis across languages. MRI report data was anonymized and converted from Portable Document Format (PDF) to text format. Since the primary outcome of this research is the accuracy of AI tools in diagnosing VS based on MRI reports, the terms ‘acoustic neuroma,’ ‘vestibular schwannoma,’ and any other synonyms directly referring to the etiological diagnosis were replaced by the word ‘lesion’ and its German synonyms in the findings section of the MRI report. Additionally, the conclusion section was excluded from the prompt for the same reason. During this process, no patient identifiers (such as names or addresses) were transferred or manually copied. Two board-certified neurosurgeons independently verified the resulting text files to ensure complete removal of personal data and preservation of the original report content and structure.

Figure 1. Study workflow from case selection to model evaluation.

2.2. Verbal Prompts

We used 5 prompts in English to elicit responses related to diagnosis and treatment, as shown in Table 1.

Table 1. Verbal Prompts.

The prompts were used twice for each patient to investigate response changes using the same MRI reports written in two different languages (English and German). We decided to also evaluate the accuracy of the model for interpreting MRI reports in German, as this was the original language used in the reports. See Figure 2. Each case was processed once per language and per model in an independent chat session to prevent memory carry-over effects. To assess output stability, five representative cases were repeated three times for each model under identical conditions. Reproducibility was quantified as the proportion of identical outputs across repetitions.

Figure 2. Example of the same prompt with MRI reports written in English and German (original language of the MRI report).

Prompts 1–4 were zero-shot queries, meaning the models received no prior contextual information other than the MRI report itself. Prompt 5 constituted a prompt-chaining query because it explicitly incorporated the diagnosis revealed in the previous response (‘considering that this is a vestibular schwannoma …’) to elicit a more refined, case-specific treatment recommendation. This design tested the models’ ability to integrate newly provided clinical facts into subsequent reasoning.

2.3. Outcomes and Statistical Analysis

The primary outcome was the diagnostic accuracy of each large language model (GPT-4, Gemini, and Bing) in identifying the correct etiological diagnosis of vestibular schwannoma from MRI report text.

Secondary outcomes included:

-: The accuracy of AI-generated treatment recommendations relative to the interdisciplinary skull-base tumor board decision;
-: The ability of each model to generate specific versus general treatment suggestions (e.g., “microsurgical resection” vs. “treatment required”);
-: The proportion of cases in which each model proposed plausible differential diagnoses for cerebellopontine angle lesions;
-: The presence of language-related performance differences between German and English report versions.

Analyses were performed using GPT-4 (OpenAI, May 2024 release, accessed via ChatGPT interface in January 2025), Gemini (Google, version 1.5 Pro release June 2024, accessed via Gemini web interface in January 2025) and Bing Copilot (Microsoft GPT-4-based model, accessed January 2025 via web interface).

All secondary analyses were exploratory and not adjusted for multiplicity. Accuracy values are reported with 95% confidence intervals (Wilson’s method). Pairwise comparisons of model performance and language-related accuracy differences were assessed using McNemar’s test on case-level paired data. A p-value < 0.05 was considered significant. Data are presented as mean (standard deviation), median (interquartile range), or number of patients (percentage). Model outputs explicitly declining to provide an etiological diagnosis (‘I cannot make a clinical judgment’) were coded as non-answers and treated as incorrect in the primary analysis. A sensitivity analysis excluding these cases was also performed. All analyses were performed using IBM SPSS Statistics version 27.0 (IBM Corp., Armonk, NY, USA) and R version 4.5.0 (R Core Team, 2025).

2.4. Ground Truth

The ground truth for diagnostic accuracy was defined through a two-step expert review. The initial MRI diagnosis was provided by a board-certified neuroradiologist. All MRI scans were independently reassessed by a board-certified neurosurgeon experienced in skull-base imaging, blinded to the neuroradiologist’s interpretation. In the single case of diagnostic disagreement, a third neurosurgeon adjudicated the final label. Because this process represented a hierarchical review rather than parallel ratings, inter-rater agreement was expressed as raw percentage agreement rather than Cohen’s kappa.

The ground truth for overall treatment recommendations was defined as the consensus decision of the interdisciplinary skull-base tumor board at the University of Freiburg, which serves as the institutional gold standard for case management decisions.

For the classification of treatment recommendations as specific versus general, two board-certified neurosurgeons independently assessed each case, with a third reviewer resolving any disagreement. Inter-rater reliability for this categorical classification was assessed using Cohen’s kappa coefficient.

2.5. Ethical Issues

This research was conducted in compliance with The Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments involving human subjects. The local medical ethics committee approved this study (23-1393-S1-retro). Written informed consent was not required due to the retrospective study design.

3. Results

All 35 screened patients with vestibular schwannoma had written MRI reports and were included in the analysis. No cases were excluded.

3.1. Diagnostic Accuracy for Acoustic Neuroma

Regarding the outputs to prompt 1 (What is the most probable etiological diagnosis based on the following MRI report?) and considering the diagnosis of VS as the ground truth, we obtained the following results shown in Table 2.

Table 2. AI Tools and most probable diagnosis based on MRI report.

GPT-4 correctly identified vestibular schwannoma in 34/35 cases (97.1%), Gemini in 31/35 cases (88.6%), and Bing in 30/35 cases (85.7%) (Figure 3 and Figure 4)

Figure 3. Model-level confusion matrices for diagnostic classification.

Figure 4. Case-level comparison of diagnostic correctness across models.

Pairwise McNemar comparisons showed no statistically significant differences:

-: GPT-4 vs. Gemini: 4 vs. 1 discordant pairs; odds ratio 4.00 (95% CI 0.45–35.79); p = 0.375;
-: GPT-4 vs. Bing: 5 vs. 1 discordant pairs; odds ratio 5.00 (95% CI 0.58–42.80); p = 0.219;
-: Gemini vs. Bing: 5 vs. 4 discordant pairs; odds ratio 1.25 (95% CI 0.34–4.66); p = 1.000.

The corresponding absolute differences in accuracy (unpaired) were 8.6% for GPT-4 vs. Gemini and 11.4% for GPT-4 vs. Bing.

When cerebellopontine angle meningioma, the main differential diagnosis of vestibular schwannoma, was additionally accepted as a plausible first diagnosis, accuracy increased to 100% for GPT-4, 97.1% for Gemini, and 88.6% for Bing, without altering the relative ranking of the models.

To assess the reproducibility of diagnostic outputs across models, five representative cases were re-evaluated in multiple runs. Across 45 repeated runs (5 cases × 3 repetitions × 3 models), 44 yielded identical diagnostic outputs (97.8%; 95% CI 88.5–99.9%), corresponding to a variability of about 2% and indicating high reproducibility across models.

GPT-4 identified meningioma as the most probable diagnosis in 1 case, Gemini in 3 cases, and Bing in 1 case. Additionally, Bing provided implausible diagnoses in 3 cases (intracerebral hemorrhage, intradural extramedullary spinal tumor, and intramedullary spinal tumor). GPT-4 and Gemini did not provide any implausible diagnoses. Agreement between the neuroradiologist and neurosurgeon who established the diagnostic ground truth was 34/35 (97.1%; 95% CI 85.5–99.5%), with one case requiring adjudication by a third neurosurgeon.

3.2. Differential Diagnoses

With respect to the power of AI tools in suggesting plausible differential diagnoses for the lesion described on MRI reports, we obtained the following results (Table 3):

Table 3. Second most probable differential diagnosis suggested by AI tools based on MRI reports.

Meningioma was the second most probable diagnosis for all three AI models. GPT-4 and Bing identified meningioma in 28 cases, while Gemini did so in 30 cases. Bing had the highest rate of implausible diagnoses (3 cases), while GPT-4 and Gemini did not provide implausible diagnoses.

Epidermoid cyst was the third most probable differential diagnosis in all three AI models, followed by cholesteatoma and facial nerve schwannoma (Table 4). We considered “infection” as an implausible diagnosis since it is a very unspecific term that does not directly relate to the described lesions on MRI reports. On the other hand, an abscess would be considered a plausible differential diagnosis because of its mass effect, which can present similarly to other lesions on MRI.

Table 4. Third most probable differential diagnosis suggested by AI tools based on MRI reports.

3.3. Treatment Recommendations

Prompt 3 inquired the AI tools about treatment recommendations. We classified the outputs answering queries about treatment recommendations into two categories: general recommendations and specific treatment recommendations. General recommendations were based on the supposed diagnosis, providing general treatment guidelines without considering the particular patient whose MRI report was described in the prompt. Specific treatment recommendations included precise suggestions such as surgery, wait and scan, or radiotherapy tailored to treat the particular patient. See Table 5.

Table 5. Treatment recommendations of AI tools based on MRI reports.

Bing was the AI tool with the highest rate of general treatment recommendations (23 cases), followed by Gemini (18 cases). GPT-4 provided general treatment recommendations in only 8 cases. GPT-4 indicated surgery in 26 cases, Gemini in 13 cases and Bing in 11 cases. Inter-rater agreement for the general-vs-specific classification was high across all model outputs: κ = 0.92 (95% CI 0.77–1.00) for GPT-4, κ = 0.94 (95% CI 0.83–1.00) for Gemini, and κ = 0.94 (95% CI 0.82–1.00) for Bing. Only three model-case combinations (one per model; 3/105, 2.9%) required adjudication by the third reviewer.

After revealing the etiological diagnosis (acoustic neuroma) in prompt 5, there was a slight trend towards recommending specific treatment modalities by GPT-4 and Gemini, while Bing increased the rate of general treatment recommendations. See Table 6.

Table 6. Treatment recommendations of AI tools based on MRI reports after revealing the etiological diagnosis (AN).

Considering the skull-base tumor board recommendations as the ground truth and counting non-responses as incorrect, GPT-4 achieved 20/35 correct treatment recommendations (57.1%, 95% CI 40.9–72.0), Gemini 16/35 (45.7%, 95% CI 30.5–61.8), and Bing 11/35 (31.4%, 95% CI 18.6–48.0). Pairwise McNemar tests showed no significant difference between GPT-4 and Gemini (6 vs. 2 discordant cases; p = 0.289; OR 3.0, 95% CI 0.61–14.9) and between Gemini and Bing (7 vs. 2; p = 0.180; OR 3.5, 95% CI 0.73–16.9). GPT-4, however, outperformed Bing (11 vs. 2 discordant cases; p = 0.022; OR 5.5, 95% CI 1.22–24.8). See Figure 5 and Figure 6.

Figure 5. Waterfall plot showing per-case correctness of treatment recommendations.

Figure 6. Model-level confusion matrices for treatment recommendations.

Excluding the single non-response did not materially affect results: GPT-4 (58.8%, 95% CI 42.3–73.6), Gemini (47.1%, 95% CI 31.9–62.8), and Bing (32.4%, 95% CI 19.1–49.2) maintained the same relative performance ranking. Pairwise McNemar comparisons remained unchanged, confirming that non-responses had no impact on the direction or significance of results.

The rates of unspecific treatment recommendations (radiotherapy, surgery, or wait and scan), based on treatment guidelines and not considering the individual patient, were also evaluated. GPT-4 recommended specific treatment measures considering the MRI report in 77.1% of cases, Gemini in 48.8% of cases, and Bing in only 34.3% of cases. The difference in rates of specific treatment recommendation between GPT-4 and other AI tools was statistically significant, while the difference between Gemini and Bing did not show statistical significance (Figure 7).

Figure 7. Rates of specific and unspecific treatment recommendations of AI Tools.

3.4. Surgical Indication

Prompt 4 inquired about surgical indication based solely on MRI reports. GPT-4 indicated surgery in 26 cases and did not indicate it in 9 cases. Gemini indicated surgery in 13 cases and did not answer the question in 6 cases, citing insufficient information to make a decision. Bing indicated surgery in 17 cases and did not answer the question in 15 cases.

Considering the treatment recommendations of our interdisciplinary skull base tumor board as the ground truth, GPT-4 presented the highest rate of accurate surgical indication based solely on MRI reports (60%). Gemini indicated surgery properly in 45.7% of cases, while Bing did so in 51.4% of cases. The difference between AI tools was not statistically significant.

Regarding the rates of no response to Prompt 4 (“Would you recommend surgical treatment in this particular case?”), GPT-4 provided responses regarding surgical indication for all patients, whereas Gemini did not answer this question in 17.1% of cases, and Bing failed to respond in 42.9% of cases. The difference in rates of no response regarding surgical treatment between GPT-4 and Bing was statistically significant (p-value < 0.001).

3.5. Language Issues

The use of MRI reports in the original language (German) did not impact the accuracy of diagnosis by any of the AI tools. For GPT, accuracy was identical in German and English (34/35, 97.1% for both; McNemar p = 1.000). Gemini showed a non-significant improvement when prompted in English (from 31/35, 88.6% to 33/35, 94.3%; McNemar p = 0.625). Bing also showed a non-significant trend toward higher accuracy in English (from 30/35, 85.7% to 33/35, 94.3%; McNemar p = 0.25).

4. Discussion

This study addresses an increasingly relevant issue in modern clinical practice: the growing use of LLMs by patients seeking medical advice and bringing these AI-generated recommendations into consultations. Understanding the reliability and limitations of these tools is crucial for physicians, as their outputs must always be interpreted in the context of individualized clinical decision-making.

The results of this study demonstrate that GPT-4 outperforms Gemini and Bing in interpreting MRI reports of patients with VS and providing specific treatment recommendations based solely on these reports. Previous studies have investigated the accuracy of AI models in diagnosing medical conditions [,,,]. Kanjee et al. investigated the power of GPT-4 in diagnosing 70 challenging medical cases from the New England Journal of Medicine clinicopathologic conferences and reported an accuracy rate of 39% (27/70) of cases. In addition, in 64% of cases (45/70), the model included the final diagnosis in its differential list []. Another study compared the accuracy of GPT-4 and GPT-4 with vision (GPT-4V) to that of board-certified radiologists in diagnosing 32 consecutive “Freiburg Neuropathology Case Conference” cases from the journal Clinical Neuroradiology. The study reported that GPT-4 had a lower diagnostic accuracy than each radiologist; however, the difference was not statistically significant []. Our study demonstrated that GPT-4 achieved a diagnostic accuracy of 97.14% for VS, higher than Gemini (88.57%) and Bing (85.71%). This higher performance of GPT-4 compared to other publicly available AI models in diagnosing medical conditions was echoed in other studies [,,]. However, it is important to emphasize that all models performed well and the difference in accuracy rates between them was not statistically significant. The present results should be interpreted in the context of the model versions evaluated. Subsequent releases, including GPT-5 and Gemini 2.5, are likely to show different performance characteristics as training data and alignment methods continue to improve.

GPT-4’s higher accuracy likely reflects a larger multimodal training corpus and more refined alignment procedures optimized for complex reasoning tasks. Gemini, although trained on comparable text data, appears less domain-specific in medical language. Bing, which relies on a search-augmented GPT architecture, may incorporate non-curated web content during response generation, occasionally leading to implausible outputs. These architectural and alignment differences may partly explain the observed performance gradient.

Differentiating vestibular schwannoma from cerebellopontine angle meningioma is challenging because both share similar MRI signal patterns and contrast enhancement. The models recognized textual cues such as ‘widening of the internal auditory canal’ or ‘dural tail,’ reflecting radiological reasoning. Their occasional identification of meningioma as the main etiological diagnosis demonstrated realistic diagnostic behavior, as this distinction can be difficult even for expert neuroradiologists.

No patients were excluded from the analysis, as all screened cases had written MRI reports available. The relatively small sample size reflects the exploratory design of this proof-of-concept study, which aimed primarily to assess whether publicly available large language models could accurately interpret standardized MRI reports and to compare their relative diagnostic performance. While all evaluated LLMs demonstrated high diagnostic accuracy, their susceptibility to data-source bias and misinterpretation remains an ethical concern. Since LLMs generate probabilistic text without transparent reasoning or certified medical validation, their use in clinical workflows must always involve human oversight. Transparent auditing of model outputs and continuous ethical monitoring are essential to prevent potential harm from over-reliance on AI-generated medical advice.

In the language comparison analysis, diagnostic accuracy across all three models remained largely consistent between German and English prompts. These results indicate that the prompt language did not materially affect diagnostic performance, although English prompts occasionally rescued misclassifications observed with German input. This finding is significant as it underscores the robustness of these AI models in processing medical information across different languages, thus broadening their applicability in diverse clinical settings. This result was also confirmed by a study that demonstrated the efficiency of AI tools in assigning BI-RADS categories across reports written in three different languages [].

Mohammad-Rahimi et al. [] assessed the responses of six different artificial intelligence (AI) chatbots (Bing, GPT-3.5, GPT-4, Google Bard, Claude, Sage) to controversial and difficult questions in oral pathology, oral medicine, and oral radiology. They found that GPT-4 was the most superior chatbot in providing high-quality information on controversial topics across various dental disciplines []. In our study, implausible diagnoses such as intramedullary spinal tumor or cerebral microangiopathy in cerebellopontine-angle reports were classified as hallucinations. These errors likely reflect limited domain representation and overgeneralization from unrelated medical contexts.

One of the main practical applications of this study is the simulation of the responses that patients would obtain if they uploaded their MRI report. The use of the internet as a source of information for medical conditions has become a common practice among patients, and in the era of large language models, this usage is expected to increase significantly, as well as patients’ trust in the information that has been extracted. In this study, we used prompts written in scientific language to extract the most accurate responses and evaluate the LLMs at their full potential. For healthcare providers, it will become increasingly important to understand the outputs of such LLMs for various medical conditions, as patients are likely to use these tools more frequently in the future. This familiarity is essential for clinicians to navigate consultations effectively, recognizing the (realistic or unrealistic) expectations and preconceptions that patients may bring to the encounter, potentially influenced by AI-generated advice. Future studies reporting the frequency of medical queries directed at LLMs could provide further insights into this growing trend.

Regarding treatment recommendations, GPT-4 provided the most accurate recommendations at 57.1%, followed by Gemini at 45.7% and Bing at 31.4%. As mentioned in the methods section, we considered the recommendations of our interdisciplinary skull base tumor board, composed of several board-certified physicians, as the ground truth. This approach is supported by the fact that the management of vestibular schwannomas follows a relatively well-defined decision-making pathway, with three main treatment options (surgery, radiotherapy, wait and scan). Key factors influencing these decisions include compression of the brain stem, tumor size, patient age, and specific symptoms, ensuring a structured and consistent framework for evaluating cases. The difference in accuracy rates of treatment recommendation between GPT-4 and Bing was statistically significant (p-value: 0.02). It is important to mention that even though Bing is based on GPT models, they can exhibit significant performance differences due to various factors. GPT-4 is specifically designed and fine-tuned for a wide range of applications, while Bing is tailored more towards general search and information retrieval tasks. This specialization in training can lead to substantial differences in performance, particularly in complex and nuanced fields such as medical diagnostics. Moreover, GPT-4 was able to suggest specific treatment measures tailored to individual patients in 77.1% of cases, compared to Gemini (48.6%) and Bing (34.3%). This specificity in treatment recommendations was statistically different from Gemini and Bing, and aligns with findings from previous research []. This may reflect differences in training scale, reinforcement alignment, or contextual reasoning capacity. Although these factors could contribute to a more coherent synthesis of clinical cues, such as recognizing terms like brainstem compression or a large cerebellopontine-angle component as indicators of therapeutic urgency, these mechanisms cannot be empirically verified. From a clinical standpoint, this pattern suggests that GPT-4 may approximate multidisciplinary reasoning more closely than the other models and could assist in pre-board case summarization or guideline cross-checking. Nevertheless, its outputs remain non-deterministic and require expert validation before any clinical use. With respect to surgical treatment recommendations based on MRI reports, the examined AI tools performed similarly, with GPT showing an accuracy rate of 60%, compared to 51.4% for Bing and 45.7% for Gemini. Regarding the recommendation of surgical treatment, Bing and Gemini failed to provide responses in some cases (Bing in 42.9% of cases and Gemini in 17.1% of cases). The typical response was: “I am not capable of suggesting medical treatment.” This behavior appeared to occur inconsistently, as the same prompt sometimes elicited this disclaimer, while at other times a treatment recommendation was provided. A possible hypothesis for this phenomenon is that the variability in responses could stem from the limitations in their training data. This study exclusively assessed three publicly accessible closed-source LLMs to mirror real-world patient usage. Future investigations should include open-source and medically fine-tuned models such as LLaMa, BioGPT, and MedPaLM to compare domain-specific performance, interpretability, and auditability [,,]. Such comprehensive benchmarking would elucidate the relative advantages and safety profiles of open versus commercial LLM architectures in neurosurgical contexts.

When faced with complex or ambiguous scenarios that exceed their confidence level, the models might default to a safety mechanism by refusing to provide a recommendation to avoid generating potentially harmful or inaccurate advice. The most immediate clinical applications of LLM-based tools are expected in structured report generation, automated translation of radiological narratives, and standardized extraction of tumor characteristics from free-text MRI reports. This may streamline communication between neuroradiologists and neurosurgeons and accelerate multidisciplinary decision-making. However, these systems also shift the clinician’s workload toward higher-level oversight, validation, and error detection. Rather than replacing expert input, they redefine it by emphasizing supervision, interpretive judgment, and the ethical responsibility to ensure accuracy and patient safety. Moreover, future research should extend the evaluation beyond free-format zero-shot prompts to include structured and role-specific prompting strategies. Examples include prompts enriched with relevant background information (‘Act as a neurosurgeon interpreting this MRI report’) or chain-of-thought scaffolding. Such designs would simulate realistic communication scenarios and enable a more comprehensive assessment of LLM usability in clinical practice.

One limitation of this study is considering the decision of one interdisciplinary tumor board as the ground truth, given that the evaluation of the same patient may lead to different recommendations across different tumor boards. However, we believe that for scientific purposes, using a tumor board from a university hospital composed of multiple board-certified physicians as the ground truth is deemed appropriate. Another limitation of this study is that treatment recommendations were provided based solely on MRI reports, whereas tumor board decisions also take into account other significant factors such as age, the Karnofsky Performance Index, patient symptoms, and neurological deficits. However, this lack of additional information also serves the purpose of simulating patient inquiries when they have only their MRI report and may not consider the aforementioned factors relevant for diagnostic and treatment purposes. Future research should focus on incorporating additional clinical variables to further enhance the performance and applicability of AI in medical diagnostics and treatment planning. In addition, this study was based on a consecutive single-center cohort of 35 patients, which ensures internal consistency but restricts external generalizability. Patient demographics, report style, and institutional wording may differ elsewhere. Multi-center datasets and larger sample sizes are required to confirm reproducibility of accuracy metrics across healthcare systems. LLMs lack intrinsic mechanisms to assess the scientific validity or hierarchy of evidence within their training data. Consequently, they may blend peer-reviewed medical knowledge with unverified online content, producing statements that appear coherent but may be clinically inaccurate. When such generalized information is applied to individual cases, it can result in misleading reassurance or unwarranted concern. This risk is particularly relevant for patients who use publicly available models to interpret their own imaging reports or symptoms. Without professional guidance, they may over- or misinterpret AI-generated advice, underscoring the need for transparent disclaimers and continuous public education about the limitations of these tools.

5. Conclusions

This study provides solid evidence that publicly available AI tools (GPT-4, Gemini and Bing) perform well when interpreting MRI reports of patients with VS. With respect to treatment recommendations, GPT-4 performed significantly better than Gemini and Bing, showing greater accuracy and specificity. On the other hand, the occasional generation of implausible diagnoses by Gemini and Bing suggests a need for ongoing refinement of these models to ensure their clinical reliability. This study provides evidence to support communication with patients regarding the current limitations of using LLM in diagnosing and treating neurosurgical pathologies. Future work should focus on expanding the dataset to multi-institutional cohorts, integrating additional clinical variables such as hearing function and performance scores, and exploring multimodal LLM architectures capable of processing both text and imaging data.

Author Contributions

A.H.A.S. contributed to the conception and design of the work, data collection, and drafting of the manuscript.; C.J.G. participated in data analysis and interpretation, as well as critical revision of the manuscript for important intellectual content.; J.B. contributed to the clinical interpretation of the data and final manuscript approval.; J.G. provided supervision, contributed to the clinical data interpretation, and critically revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This research was conducted in compliance with The Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments involving human subjects. The local ethics committee of the University of Freiburg (23-1393-S1-retro, approval date: 31 October 2023) approved this study.

Informed Consent Statement

Written informed consent was not required due to the retrospective study design.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

During the preparation of this study, the authors used ChatGPT (GPT-5; OpenAI, San Francisco, CA, USA) to assist in the generation of Figure 3, Figure 4, Figure 5 and Figure 6. All outputs were reviewed and edited by the authors, who take full responsibility for the accuracy and integrity of the content presented in this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI: Artificial Intelligence; MRI: Magnetic Resonance Imaging; VS: Vestibular Schwannoma; GPT-4: Generative Pre-trained Transformer 4; NPL: Natural Language Processing; LLM: Large Language Model.

References

Fortnum, H.; O’Neill, C.; Taylor, R.; Lenthall, R.; Nikolopoulos, T.; Lightfoot, G.; O’Donoghue, G.; Mason, S.; Baguley, D.; Jones, H.; et al. The Role of Magnetic Resonance Imaging in the Identification of Suspected Acoustic Neuroma: A Systematic Review of Clinical and Cost Effectiveness and Natural History. Health Technol. Assess. 2009, 13, iii–iv, ix–xi, 1–154. [Google Scholar] [CrossRef] [PubMed]
Cozzi, A.; Pinker, K.; Hidber, A.; Zhang, T.; Bonomo, L.; Lo Gullo, R.; Christianson, B.; Curti, M.; Rizzo, S.; Del Grande, F.; et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024, 311, e232133. [Google Scholar] [CrossRef] [PubMed]
Russo, A.G.; Ciarlo, A.; Ponticorvo, S.; Di Salle, F.; Tedeschi, G.; Esposito, F. Explaining Neural Activity in Human Listeners with Deep Learning via Natural Language Processing of Narrative Text. Sci. Rep. 2022, 12, 17838. [Google Scholar] [CrossRef]
Schmidt, R.A.; Seah, J.C.Y.; Cao, K.; Lim, L.; Lim, W.; Yeung, J. Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports. Radiol. Artif. Intell. 2024, 6, e230205. [Google Scholar] [CrossRef]
Le Guellec, B.; Lefèvre, A.; Geay, C.; Shorten, L.; Bruge, C.; Hacein-Bey, L.; Amouyel, P.; Pruvo, J.-P.; Kuchcinski, G.; Hamroun, A. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol. Artif. Intell. 2024, 6, e230364. [Google Scholar] [CrossRef]
Truhn, D.; Weber, C.D.; Braun, B.J.; Bressem, K.; Kather, J.N.; Kuhl, C.; Nebelung, S. A Pilot Study on the Efficacy of GPT-4 in Providing Orthopedic Treatment Recommendations from MRI Reports. Sci. Rep. 2023, 13, 20159. [Google Scholar] [CrossRef]
Rao, A.; Pang, M.; Kim, J.; Kamineni, M.; Lie, W.; Prasad, A.K.; Landman, A.; Dreyer, K.; Succi, M.D. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J. Med. Internet Res. 2023, 25, e48659. [Google Scholar] [CrossRef]
Hasani, A.M.; Singh, S.; Zahergivar, A.; Ryan, B.; Nethala, D.; Bravomontenegro, G.; Mendhiratta, N.; Ball, M.; Farhadi, F.; Malayeri, A. Evaluating the Performance of Generative Pre-Trained Transformer-4 (GPT-4) in Standardizing Radiology Reports. Eur. Radiol. 2024, 34, 3566–3574. [Google Scholar] [CrossRef]
Mallio, C.A.; Sertorio, A.C.; Bernetti, C.; Beomonte Zobel, B. Large Language Models for Structured Reporting in Radiology: Performance of GPT-4, ChatGPT-3.5, Perplexity and Bing. Radiol. Med. 2023, 128, 808–812. [Google Scholar] [CrossRef]
Fink, M.A. Large language models such as ChatGPT and GPT-4 for patient-centered care in radiology. Radiologie 2023, 63, 665–671. [Google Scholar] [CrossRef]
Fink, M.A.; Bischoff, A.; Fink, C.A.; Moll, M.; Kroschke, J.; Dulz, L.; Heußel, C.P.; Kauczor, H.-U.; Weber, T.F. Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer. Radiology 2023, 308, e231362. [Google Scholar] [CrossRef] [PubMed]
Gertz, R.J.; Dratsch, T.; Bunck, A.C.; Lennartz, S.; Iuga, A.-I.; Hellmich, M.G.; Persigehl, T.; Pennig, L.; Gietzen, C.H.; Fervers, P.; et al. Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy. Radiology 2024, 311, e232714. [Google Scholar] [CrossRef]
Lecler, A.; Duron, L.; Soyer, P. Revolutionizing Radiology with GPT-Based Models: Current Applications, Future Possibilities and Limitations of ChatGPT. Diagn. Interv. Imaging 2023, 104, 269–274. [Google Scholar] [CrossRef] [PubMed]
Lecler, A.; Soyer, P.; Gong, B. The Potential and Pitfalls of ChatGPT in Radiology. Diagn. Interv. Imaging 2024, 105, 249–250. [Google Scholar] [CrossRef] [PubMed]
Lyu, Q.; Tan, J.; Zapadka, M.E.; Ponnatapura, J.; Niu, C.; Myers, K.J.; Wang, G.; Whitlow, C.T. Translating Radiology Reports into Plain Language Using ChatGPT and GPT-4 with Prompt Learning: Results, Limitations, and Potential. Vis. Comput. Ind. Biomed. Art. 2023, 6, 9. [Google Scholar] [CrossRef]
Gordon, E.B.; Towbin, A.J.; Wingrove, P.; Shafique, U.; Haas, B.; Kitts, A.B.; Feldman, J.; Furlan, A. Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions. J. Am. Coll. Radiol. 2024, 21, 353–359. [Google Scholar] [CrossRef]
Gupta, V.K.; Thakker, A.; Gupta, K.K. Vestibular Schwannoma: What We Know and Where We Are Heading. Head. Neck Pathol. 2020, 14, 1058–1066. [Google Scholar] [CrossRef]
Carlson, M.L.; Link, M.J. Vestibular Schwannomas. N. Engl. J. Med. 2021, 384, 1335–1348. [Google Scholar] [CrossRef]
Gianoli, G.J.; Soileau, J.S. Acoustic Neuroma Neurophysiologic Correlates: Vestibular-Preoperative, Intraoperative, and Postoperative. Otolaryngol. Clin. N. Am. 2012, 45, 307–314. [Google Scholar] [CrossRef]
Murphy, M.R.; Selesnick, S.H. Cost-Effective Diagnosis of Acoustic Neuromas: A Philosophical, Macroeconomic, and Technological Decision. Otolaryngol. Head. Neck Surg. 2002, 127, 253–259. [Google Scholar] [CrossRef]
Hirosawa, T.; Harada, Y.; Yokose, M.; Sakamoto, T.; Kawamura, R.; Shimizu, T. Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study. Int. J. Env. Res. Public. Health 2023, 20, 3378. [Google Scholar] [CrossRef]
Galetta, K.; Meltzer, E. Does GPT-4 Have Neurophobia? Localization and Diagnostic Accuracy of an Artificial Intelligence-Powered Chatbot in Clinical Vignettes. J. Neurol. Sci. 2023, 453, 120804. [Google Scholar] [CrossRef]
Zhang, X.; Zhong, Y.; Jin, C.; Hu, D.; Tian, M.; Zhang, H. Medical Image Generative Pre-Trained Transformer (MI-GPT): Future Direction for Precision Medicine. Eur. J. Nucl. Med. Mol. Imaging 2024, 51, 332–335. [Google Scholar] [CrossRef]
Kanjee, Z.; Crowe, B.; Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 2023, 330, 78–80. [Google Scholar] [CrossRef]
Horiuchi, D.; Tatekawa, H.; Oura, T.; Oue, S.; Walston, S.L.; Takita, H.; Matsushita, S.; Mitsuyama, Y.; Shimono, T.; Miki, Y.; et al. Comparing the Diagnostic Performance of GPT-4-Based ChatGPT, GPT-4V-Based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin. Neuroradiol. 2024, 34, 779–787. [Google Scholar] [CrossRef]
Mohammad-Rahimi, H.; Khoury, Z.H.; Alamdari, M.I.; Rokhshad, R.; Motie, P.; Parsa, A.; Tavares, T.; Sciubba, J.J.; Price, J.B.; Sultan, A.S. Performance of AI Chatbots on Controversial Topics in Oral Medicine, Pathology, and Radiology. Oral. Surg. Oral. Med. Oral. Pathol. Oral. Radiol. 2024, 137, 508–514. [Google Scholar] [CrossRef] [PubMed]
Hirosawa, T.; Harada, Y.; Mizuta, K.; Sakamoto, T.; Tokumasu, K.; Shimizu, T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form. Res. 2024, 8, e59267. [Google Scholar] [CrossRef]
Ren, L.; Belkadi, S.; Han, L.; Del-Pinto, W.; Nenadic, G. Synthetic4Health: Generating Annotated Synthetic Clinical Letters. Front. Digit. Health 2025, 7, 1497130. [Google Scholar] [CrossRef] [PubMed]
Ah-Thiane, L.; Heudel, P.E.; Campone, M.; Robert, M.; Brillaud-Meflah, V.; Rousseau, C.; Le Blanc-Onfroy, M.; Tomaszewski, F.; Supiot, S.; Perennec, T.; et al. Large Language Models as Decision-Making Tools in Oncology: Comparing Artificial Intelligence Suggestions and Expert Recommendations. JCO Clin. Cancer Inform. 2025, 9, e2400230. [Google Scholar] [CrossRef] [PubMed]
Merlino, D.J.; Brufau, S.R.; Saieed, G.; Van Abel, K.M.; Price, D.L.; Archibald, D.J.; Ator, G.A.; Carlson, M.L. Comparative Assessment of Otolaryngology Knowledge Among Large Language Models. Laryngoscope 2025, 135, 629–634. [Google Scholar] [CrossRef]

Figure 1. Study workflow from case selection to model evaluation.

Figure 2. Example of the same prompt with MRI reports written in English and German (original language of the MRI report).

Figure 3. Model-level confusion matrices for diagnostic classification.

Figure 4. Case-level comparison of diagnostic correctness across models.

Figure 5. Waterfall plot showing per-case correctness of treatment recommendations.

Figure 6. Model-level confusion matrices for treatment recommendations.

Figure 7. Rates of specific and unspecific treatment recommendations of AI Tools.

Table 1. Verbal Prompts.

What is the most probable etiological diagnosis based on the following MRI report?

2.: What are the three most probable differential diagnoses for the lesion described in item 1?

3.: What is the best indicated treatment for this lesion considering its morphology, location, and size?

4.: Would you recommend surgical treatment in this particular case?

5.: Considering that this is a vestibular schwannoma, what is the best treatment option for this particular case considering the morphology and location of the lesion described in item 1?

Table 2. AI Tools and most probable diagnosis based on MRI report.

	Diagnosis	Frequency (%)
GPT-4	Vestibular Schwannoma	34 (97.14%)
	Meningioma	1 (2.85%)
Gemini	Acoustic Neuroma	31 (88.57%)
	Meningioma	3 (8.57%)
	Pilocytic Astrocytoma	1 (2.85%)
Bing	Vestibular Schwannoma	30 (85.71%)
	Meningioma	1 (2.85%)
	Intracerebral Hemorrhage	1 (2.85%)
	Intradural Extramedullary Spinal Tumor	1 (2.85%)
	Intramedullary Spinal Tumor	1 (2.85%)
	Epidermoid Cyst	1 (2.85%)

Table 3. Second most probable differential diagnosis suggested by AI tools based on MRI reports.

Diagnosis	GPT	Gemini	Bing
Vestibular Schwannoma	1	3	1
Arachnoid Cyst	0	0	1
Cerebral Microangiopathy *	0	0	1
Cholesteatoma	0	0	1
Dumbbell Schwannoma	0	0	1
Epidermoid Cyst	1	0	0
Facial Nerve Schwannoma	5	0	0
Intradural Extramedullary Spinal Tumor *	0	0	1
Meniere Disease *	0	0	1
Meningioma	28	30	28
Vascular Malformation	0	2	0

* Implausible diagnoses based on MRI reports.

Table 4. Third most probable differential diagnosis suggested by AI tools based on MRI reports.

Diagnosis	GPT	Gemini	Bing
Vestibular Schwannoma	0	2	1
Aneurysm	0	0	4
Cavernoma	0	0	1
Cerebral Amyloid Angiopathy *	0	0	1
Cholesteatoma	1	16	2
Chondroma	0	2	0
Epidermoid Cyst	14	6	14
Exostosis of the IAC	0	0	1
Facial Nerve Schwannoma	11	3	2
Brainstem glioma	0	0	1
Glomus Tumor	0	2	2
Hemangioblastoma	0	0	2
Infection *	0	2	1
Meningioma	6	0	1
Metastasis	2	0	2
Trigeminal Schwannoma	1	0	0
Vascular Malformation	0	2	0

* Implausible diagnoses based on MRI reports.

Table 5. Treatment recommendations of AI tools based on MRI reports.

Treatment	GPT	Gemini	Bing
Surgery	26	13	11
Radiosurgery	5	2	2
Wait and Scan	1	2	1
General Treatment	8	18	23

Table 6. Treatment recommendations of AI tools based on MRI reports after revealing the etiological diagnosis (AN).

Treatment	GPT	Gemini	Bing
Surgery	27	13	10
Radiosurgery	7	3	2
Wait and Scan	0	4	1
General Treatment	7	16	24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.