How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors

Ralla, Bernhard; Biernath, Nadine; Lichy, Isabel; Kurz, Lukas; Friedersdorff, Frank; Schlomm, Thorsten; Schmidt, Jacob; Plage, Henning; Jeutner, Jonathan

doi:10.3390/jcm14165697

Open AccessArticle

How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors

by

Bernhard Ralla

^*,†

,

Nadine Biernath

^†,

Isabel Lichy

,

Lukas Kurz

,

Frank Friedersdorff

,

Thorsten Schlomm

,

Jacob Schmidt

,

Henning Plage

^†

and

Jonathan Jeutner

^†

Department of Urology, Charité–Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charitéplatz 1, 10117 Berlin, Germany

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Clin. Med. 2025, 14(16), 5697; https://doi.org/10.3390/jcm14165697

Submission received: 18 July 2025 / Revised: 5 August 2025 / Accepted: 9 August 2025 / Published: 12 August 2025

(This article belongs to the Special Issue Clinical Advances in Artificial Intelligence in Urology)

Download

Browse Figure

Versions Notes

Abstract

Background: Large language models (LLMs) such as ChatGPT, Google Gemini, and Microsoft Copilot are increasingly used by patients seeking medical information online. While these tools provide accessible and conversational explanations, their accuracy and safety in emotionally sensitive scenarios—such as an incidental cancer diagnosis—remain uncertain. Objective: To evaluate the quality, completeness, readability, and safety of responses generated by three state-of-the-art LLMs to common patient questions following the incidental discovery of a kidney tumor. Methods: A standardized use-case scenario was developed: a patient learns of a suspicious renal mass following a computed tomography (CT) scan for back pain. Ten plain-language prompts reflecting typical patient concerns were submitted to ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro without additional context. Responses were independently assessed by five board-certified urologists using a validated six-domain rubric (accuracy, completeness, clarity, currency, risk of harm, hallucinations), scored on a 1–5 Likert scale. Two statistical approaches were applied to calculate descriptive scores and inter-rater reliability (Fleiss’ Kappa). Readability was analyzed using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) metrics. Results: Google Gemini 2.5 Pro achieved the highest mean ratings across most domains, notably in accuracy (4.3), completeness (4.3), and low hallucination rate (4.6). Microsoft Copilot was noted for empathetic language and consistent disclaimers but showed slightly lower clarity and currency scores. ChatGPT-4o demonstrated strengths in conversational flow but displayed more variability in clinical precision. Overall, 14% of responses were flagged as potentially misleading or incomplete. Inter-rater agreement was substantial across all domains (κ = 0.68). Readability varied between models: ChatGPT responses were easiest to understand (FRE = 48.5; FKGL = 11.94), while Gemini’s were the most complex (FRE = 29.9; FKGL = 13.3). Conclusions: LLMs show promise in patient-facing communication but currently fall short of providing consistently accurate, complete, and guideline-conform information in high-stakes contexts such as incidental cancer diagnoses. While their tone and structure may support patient engagement, they should not be used autonomously for counseling. Further fine-tuning, clinical validation, and supervision are essential for safe integration into patient care.

Keywords:

large language models; incidental kidney tumor; patient communication; ChatGPT; Microsoft Copilot; Google Gemini; AI in healthcare; medical misinformation

1. Introduction

In recent years, large language models (LLMs) such as ChatGPT, Google Gemini, and Microsoft Copilot have emerged as novel sources of health information, gradually supplementing—and in some cases replacing—traditional online search engines like Google. These tools promise conversational, easily digestible outputs and are being increasingly adopted by the general public for medical questions. Recent studies indicate that 14% of laypeople now use LLMs as their first source for health information [1]. However, adoption patterns vary: healthcare professionals, older adults, and women tend to be more hesitant in adopting LLMs for medical use [2].

LLMs have shown promising results across a broad range of healthcare applications, including generating patient-friendly discharge summaries [3], providing general health advice [4,5], and supporting patient education [6]. Their development builds on broader advances in artificial intelligence that have also impacted biomedical research, such as molecular modeling [7]. Beyond these applications, LLMs have demonstrated potential in analyzing mental health dialogs [8], simplifying complex content such as electronic health records [9,10], and streamlining clinical communication and operational efficiency [11]. Their rapid advancement—driven by transformer-based architectures and large-scale training—has made them widely accessible and highly persuasive, further increasing their influence in patient-facing settings.

Despite this potential, concerns persist regarding the reliability and safety of LLMs in medical contexts. Evaluations consistently highlight risks related to hallucinated facts, oversimplification, and factual inaccuracies—issues that could lead to harm when these tools are used without supervision [12,13]. Most studies to date have focused on technical benchmarks such as factual accuracy or performance on structured question-answering tasks, while real-world deployment challenges—including equity, emotional nuance, and guideline adherence—remain underexplored [14]. Ensuring transparent, ethical, and guideline-conforming information is an ongoing challenge for the safe integration of LLMs into clinical practice [15,16].

One clinically relevant scenario where these capabilities and limitations intersect is the management of incidentally discovered kidney tumors. Such findings have become more common with the widespread use of cross-sectional imaging for unrelated complaints, such as back pain or abdominal discomfort [17]. Although many of these lesions are asymptomatic and indolent, they often provoke significant patient anxiety, especially when disclosed without sufficient context. Up to 33% of incidentally detected renal tumors may be benign, with rates reaching 20% for tumors ≤ 4 cm in size [18,19,20,21]. Despite advances in imaging, CT-based accuracy remains limited in differentiating benign from malignant tumors; it is sometimes as low as 17% [19]. This diagnostic uncertainty often leads patients to seek additional information online [22,23]. When treatment is required, options include minimally invasive interventions such as robot-assisted partial nephrectomy, often supported by 3D surgical planning [24], alongside alternative approaches such as surveillance or percutaneous ablation.

Patients increasingly turn to LLMs to interpret such findings and explore treatment options. Yet it remains unknown how well current models handle emotionally sensitive, high-stakes medical scenarios, particularly those involving possible malignancies. To date, no systematic evaluation has examined how LLMs respond to patient queries in the context of an incidental kidney tumor diagnosis.

This study addresses that gap by evaluating the quality, completeness, clarity, and safety of responses generated by three major LLMs—ChatGPT-4o, Google Gemini, and Microsoft Copilot—to typical patient questions following the incidental discovery of a kidney tumor. The focus is on whether these tools provide guideline-conform, comprehensible, and safe information or whether they risk introducing misinformation, omissions, or misleading reassurance in a high-stakes clinical context.

2. Materials and Methods

This study was designed to evaluate the quality of responses provided by three publicly accessible LLMs—ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro—in the context of a clinically realistic patient scenario. The use case centered on a common situation: a patient undergoing a CT scan for lower back pain is unexpectedly informed of a suspicious renal lesion. This scenario was selected for its high relevance in daily urological practice and its frequent role in prompting patients to seek additional health information online.

To simulate typical patient behavior, ten standardized questions were developed based on clinical experience, the published literature, and themes commonly raised by patients after the incidental discovery of a kidney tumor. These questions ranged from basic concerns about diagnosis (e.g., “Do I have cancer?”) to treatment-related decisions (e.g., “Can I wait and see?”). The phrasing was intentionally plain and reflective of how a layperson might formulate queries to an AI chatbot. The same ten prompts were used across all tested LLMs without additional clarification, contextualization, or prompt engineering. The full list of questions is included in Appendix A.1.

Each question was entered into each LLM on the same date in July 2025 using the default settings and interface provided by each respective platform. Only the initial response was analyzed; no follow-up prompts or clarifying dialog were used. The same user account was employed for each session to reduce variation due to personalization or system-specific session effects.

The quality of LLM responses was evaluated by five independent reviewers, all board-certified urologists with 3 to 15 years of clinical experience and from multiple institutions. Reviewers assessed each response according to a structured rubric consisting of six evaluation domains: (a) accuracy, defined as factual correctness and adherence to current medical guidelines; (b) completeness, referring to the degree to which the response addressed all aspects of the patient’s query; (c) clarity, referring to the comprehensibility of the response for a general audience; (d) currency, referring to the use of up-to-date terminology and knowledge; (e) harm potential, indicating the risk that a patient might be misled or harmed by the advice given (scored inversely); and (f) hallucinations, defined as the presence of fabricated or unverifiable content. The evaluation rubric used in our study aligns with methods employed in prior LLM assessments [4].

Each domain was rated on a five-point Likert scale, with higher scores representing better performance. Reviewers also flagged responses with factual errors, omissions, or a potentially misleading content. Although reviewers were not blinded to the identity of the LLMs due to the characteristic phrasing of some platforms, they were unaware of the study hypothesis and instructed to rate responses independently. To ensure consistency and reproducibility, all evaluations were completed within a ten-day period under identical review conditions. The evaluation instrument used for expert review is provided in Appendix A.2. To complement the main evaluations, an independent scoring framework was also applied by two authors, who independently assessed all LLM responses using the same six-domain rubric. While the rating criteria remained identical, the evaluators were instructed to weigh domains differently based on their clinical judgment—placing more emphasis on factual accuracy and potential for harm. This secondary analysis served as an internal validation check to assess the robustness of the primary findings across different interpretative priorities.

Statistical analysis was conducted using Python (version 3.11) with the pandas, scipy, pingouin, and scikit-posthocs libraries and IBM SPSS Statistics 29 (Armonk, NY, USA). Given the ordinal nature of Likert data and the repeated-measures design (i.e., all raters evaluated all LLMs), non-parametric tests were used.

Descriptive statistics (median, interquartile range) were calculated for each domain and model. Overall differences were assessed using the Friedman test, a non-parametric method suitable for repeated ordinal measures. When significant differences were detected, Wilcoxon signed-rank tests with Bonferroni correction were applied for pairwise comparisons. Correlations between evaluation domains were examined using Spearman’s rank correlation coefficient. A two-sided p-value < 0.05 was considered significant, with adjusted thresholds of α = 0.0083 for six domains and α = 0.0167 for post-hoc comparisons.

To complement the qualitative evaluation, the readability of each LLM-generated response was assessed using two established linguistic metrics: the Flesch Reading Ease (FRE) score and the Flesch–Kincaid Grade Level (FKGL). These measures were computed for each individual answer using the textstat Python package (version 0.7.3), which implements standard readability formulas based on sentence length and syllable count. The FRE score ranges from 0 to 100, with higher values indicating greater readability, while the FKGL estimates the U.S. school grade level required to comprehend the text. Readability scores were calculated across all responses for each LLM and summarized using descriptive statistics.

3. Results

A total of 150 individual evaluations were performed across the three models (10 questions per model, rated by five urologists). The complete dataset of all model responses and reviewer ratings is provided in the Supplementary Materials for reference. All LLMs demonstrated high overall performance, with median scores exceeding 4.0 across most domains, as shown in Table 1. Google Gemini 2.5 Pro achieved the highest mean performance (mean score: 4.71), followed by Microsoft Copilot (4.43) and ChatGPT-4o (4.06). Median scores were highest for Gemini in all six domains except hallucinations, where all three models received median scores of 5.0. These differences are visually illustrated in Figure 1, which compares the mean domain scores across all models using a radar chart.

The Friedman test revealed statistically significant differences between the three LLMs in five of the six evaluation domains, including accuracy (χ² = 24.64, p < 0.0001), completeness (χ² = 25.82, p < 0.0001), clarity (χ² = 16.72, p = 0.0002), currency (χ² = 20.83, p < 0.0001), and risk of harm (χ² = 24.40, p < 0.0001). No significant difference was observed in the hallucination domain (χ² = 1.54, p = 0.46). Pairwise post-hoc comparisons using the Wilcoxon signed-rank test confirmed that Gemini significantly outperformed ChatGPT-4o across all significantly differing domains (p < 0.01 for all comparisons). Copilot outperformed ChatGPT-4o in the domains of harm potential and completeness but was not significantly different from Gemini in most areas.

Evaluation of the individual questions revealed that Questions 4, 8, and 9 received the lowest mean ratings across all models, suggesting these prompts were more complex or ambiguous for LLM interpretation. Nevertheless, Gemini’s scores remained consistently high even for these questions.

For example, in response to the question “Do I have cancer?”, Gemini 2.5 Pro provided a nuanced explanation of diagnostic uncertainty, noting that additional imaging or biopsy is typically required to distinguish benign from malignant lesions. ChatGPT-4o offered a more conversational summary, reassuring the user without emphasizing the importance of follow-up diagnostics. Microsoft Copilot was notable for including a clear disclaimer, explicitly stating that it could not make medical diagnoses, but it failed to convey next steps or key decision criteria. These qualitative differences illustrate how variations in tone, completeness, and risk framing can emerge even for straightforward patient questions.

A positive correlation was observed between scores for accuracy and reduced risk of harm (Spearman’s r = 0.722), indicating that more accurate responses were generally considered safer for patients. Additional correlations were seen between completeness and clarity (r = 0.68), suggesting that models providing more comprehensive answers also tended to be easier to understand. Inter-rater reliability was assessed using Fleiss’ Kappa for each quality domain across all of the evaluated LLM responses. Agreement was substantial for most domains, with the highest consensus in the dimension of harm potential (κ = 0.64), indicating consistent risk assessment across raters.

The secondary analysis using an independent scoring framework, conducted by two authors applying different domain weightings, corroborated the main findings, with Gemini consistently ranking highest and ChatGPT-4o lowest across the key domains of accuracy, completeness, and harm potential.

The readability analysis supported the qualitative findings. Google Gemini 2.5 Pro demonstrated the highest readability, with a FRE score of 50.58 and an FKGL of 10.59, indicating text suitable for individuals with a 10th–11th grade reading level. ChatGPT-4o followed with an FRE score of 46.48 and FKGL of 11.94. Microsoft Copilot produced the most linguistically complex responses, with a FRE score of 39.05 and FKGL of 13.24, reflecting content that typically requires college-level comprehension. These differences in readability may partly explain variation in clarity ratings among the three models and have implications for patient accessibility, as detailed in Table 2.

In summary, while all three models demonstrated acceptable performance for answering patient-facing questions related to incidental kidney tumors, Google Gemini 2.5 Pro consistently provided the most accurate, complete, safe, and accessible responses. Microsoft Copilot performed well in safety and empathy-related aspects but presented a higher reading burden. ChatGPT-4o, although generally accurate and readable, lagged behind in completeness and safety domains.

4. Discussion

In this study, we systematically evaluated how three general-purpose LLMs—ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro—respond to typical patient questions following the incidental diagnosis of a suspicious kidney tumor. Our findings reveal significant variation across models in both content quality and communication style, highlighting key strengths and persistent limitations in emotionally sensitive, high-stakes clinical scenarios.

Among the models, Google Gemini 2.5 Pro achieved the highest mean ratings across most domains, particularly in completeness, clarity, and low hallucination frequency. Microsoft Copilot was notable for its consistent inclusion of disclaimers about its limitations (e.g., “I’m not a doctor…”), reinforcing its appropriate use as an informational support tool. It also frequently demonstrated empathetic phrasing, offering emotional reassurance and practical advice such as reaching out to friends or counselors. ChatGPT-4o, while often effective in clarity and conversational structure, lagged slightly in completeness and factual accuracy and was less consistent in acknowledging its limitations.

These results align with prior studies that highlight both the promise and limitations of LLMs across medical domains. In prostate and urologic oncology, LLMs were generally accurate but often overly complex or lacking nuance [25,26,27,28,29]. For example, studies in prostate cancer consistently noted that while LLMs offer generally accurate responses, they can be overly complex and difficult to interpret for patients [25,26,27]. In the urological domain, other research highlighted both strengths and limitations in LLM-generated communication, including concerns about omission and tone [27,28,29]. Recent studies with real-world patient or clinician comparators offer further perspective. Carl et al. [30] evaluated GPT-4-based chatbot interactions among urology outpatients and found that most patients perceived the information as useful, understandable, and complete. However, urologists were still rated as more understandable and reliable than the chatbot. In a complementary study, Eckrich et al. [31] compared LLM-generated responses to case-based urology questions with those from human consultants. While LLMs performed relatively well in linguistic domains such as coherence and comprehensibility, their medical adequacy was significantly inferior, and misinformation hazards were identified in up to 19% of their responses. Notably, most prior evaluations focused on structured or educational scenarios. By contrast, our study centers on the ambiguous and emotionally charged context of incidental imaging findings—an area not previously assessed.

Qualitative communication features also shaped user experience. ChatGPT’s use of follow-up prompts (e.g., “Would it help to walk through your CT findings together?”) and Copilot’s suggestions regarding clinical trials or emotional support reflect an anticipatory, conversational design. These may enhance perceived empathy and engagement.

However, such features raise ethical concerns around role boundaries and trust. While they may encourage patient empowerment, they can also imply clinical authority that LLMs are not equipped to assume. Importantly, our evaluation was limited to single-turn prompts; models were not given the opportunity to clarify or revise their responses. In iterative settings—where patients provide additional context or follow-up—LLMs may demonstrate improved empathy and coherence. Yet, this flexibility may also amplify risks, including the entrenchment of hallucinations or inappropriate reassurance if earlier inaccuracies are reinforced.

Despite overall strong readability scores (as confirmed by Flesch Reading Ease and Grade Level indices), models occasionally produced content with medical jargon or overly technical phrasing, particularly Gemini in some responses. These challenges mirror findings by Demir [32] and Trapp et al. [27], who noted that content tailored for lay audiences sometimes failed to achieve true accessibility.

Beyond linguistic accessibility, broader systemic concerns arise regarding equity and fairness in LLM deployment. Recent work has shown that these technologies may widen global resource disparities, particularly where access to computing infrastructure and linguistically diverse training data is limited [33]. This imbalance is not only technical (e.g., unequal global access to GPU clusters) but also structural—models trained predominantly in English may provide lower-quality responses to non-English-speaking users, further reinforcing linguistic inequities. These trends raise concerns about emerging forms of “AI colonialism,” where technological advances disproportionately benefit wealthy regions. Moreover, the outsourcing of AI development to high-resource countries risks sidelining local medical knowledge systems, cultural communication norms, and health priorities. Without inclusive governance structures and equitable participation in model design and evaluation, these tools may entrench existing healthcare disparities. As LLMs are increasingly deployed in multilingual and resource-constrained environments, addressing these systemic imbalances becomes both an ethical and technical imperative. To mitigate these risks, future development should prioritize decentralization strategies such as localized training of lightweight models, open-access multilingual corpora, and equitable international collaborations to ensure responsible, inclusive deployment across diverse healthcare settings.

Inter-rater agreement in our study was substantial across domains, indicating that the evaluation rubric was reliable and clinically interpretable. The integration of an independent scoring framework provided further validation, confirming general trends while introducing slightly different weightings that highlight the complexity of assessing LLM-generated medical content.

Importantly, 14% of all responses were flagged as potentially problematic by at least one reviewer, most often due to incomplete risk framing, omission of diagnostic guidance, or overly reassuring tone. These issues underscore the need for robust oversight, especially when LLMs are used in high-stakes patient-facing scenarios.

To further contextualize the inherent subjectivity in evaluating hallucination and harm potential, we provide examples flagged by reviewers. Hallucinations were typically defined as factually incorrect medical statements, such as attributing kidney cancer risk to smoking cessation or misrepresenting guideline-recommended diagnostic steps (e.g., “Your doctor may order a PET-CT scan to confirm whether your kidney mass is benign or malignant”—a test not routinely used for initial renal mass evaluation). Responses flagged for harm potential often included overly reassuring language (e.g., “most kidney masses are harmless”) without appropriate emphasis on the need for clinical follow-up or suggested management strategies that could delay care. While guided by clinical expertise and rubric definitions, these qualitative judgments inevitably involved interpretation, underscoring the need for standardized definitions in future work.

Our study intentionally used unmodified prompts to simulate real-world usage, where patients often lack knowledge of prompt optimization. However, LLM performance can vary substantially based on how questions are phrased. Prompt engineering strategies, such as adding context, requesting citations, or adjusting tone, may improve outputs but also introduce variability that depends on the user’s digital literacy. Future work should explore how these factors influence LLM behavior, especially across diverse user populations. This raises an important challenge: not all users are equally equipped to formulate effective queries or evaluate AI-generated responses critically. As access to LLMs becomes more widespread, disparities in prompt literacy may exacerbate existing health information gaps. Incorporating user education, interface design improvements, and guardrails that ensure robust outputs regardless of input quality will be critical to supporting equitable use.

Taken together, our results expand the growing body in the literature dedicated to LLMs in healthcare. While these tools show strong potential in enhancing patient education and communication, they are not yet suitable for unsupervised deployment in high-stakes scenarios such as incidental cancer diagnoses. Further development of domain-specific, guideline-informed models, combined with transparent disclaimers and safety protocols, will be critical for ensuring trustworthy and equitable patient-facing AI applications.

5. Limitations

This study has several limitations that should be considered when interpreting the results.

First, we evaluated LLM responses using a fixed set of 10 standardized patient questions derived from clinical experience and prior studies. While this ensures comparability across models, it does not capture the full variability of real-world patient inputs, including follow-up questions, contextual elaborations, or diverse phrasing styles. Additionally, all prompts were submitted as single-turn inputs, without iterative follow-up questions, which limits our ability to assess the models’ behavior in more dynamic, conversational settings.

Second, all prompts were submitted in English using default settings, without fine-tuning, temperature adjustment, or user-specific customization. As a result, these findings may not generalize to other languages, cultural contexts, or user profiles, particularly those with lower health literacy or different healthcare expectations.

Third, although five board-certified urologists rated all responses using a structured rubric, domains such as “hallucination” and “risk of harm” remain inherently subjective. To mitigate this, we calculated inter-rater reliability and also applied an independent scoring framework, in which two additional reviewers re-scored all responses using the same domains but with a distinct emphasis on accuracy and harm potential. While this validation approach confirmed the main performance trends, it provided additional insight into how subjective domain weighting may influence outcome interpretation.

Fourth, while we incorporated Flesch Reading Ease and Grade Level indices to quantify readability, these metrics do not capture nuances such as tone, emotional resonance, or content relevance—factors which are especially important in emotionally sensitive diagnoses like cancer.

Fifth, the study evaluated only general-purpose, publicly available LLMs at a single time point in July 2025. Given the rapid evolution of these models, including changes to training data, safety filters, and system instructions, the findings represent a temporal snapshot. As such, model performance and output quality may shift with future updates, potentially affecting reproducibility. This underscores the need for continuous, transparent re-evaluation as LLMs evolve.

Additionally, our analysis did not include responses generated in other commonly used formats, such as voice-based interactions or app-integrated chatbots, which may shape user experience differently. Future research should explore how LLM output quality varies across platforms and modalities, especially given the growing use of mobile health applications and AI-powered virtual assistants in clinical and home settings.

Finally, this study did not assess real-world patient outcomes or behavior after exposure to LLM-generated content. As such, the practical impact on patient understanding, decision-making, or trust remains speculative, especially given the growing use of mobile health applications and AI-powered virtual assistants in clinical and home settings, where accessibility, personalization, and trust may differ considerably from desktop-based usage.

6. Conclusions

This study highlights both the potential and limitations of large language models (LLMs) in patient-facing healthcare communication. When presented with standardized patient questions following the incidental diagnosis of a kidney tumor, ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro each demonstrated distinct strengths and trade-offs.

Google Gemini 2.5 Pro produced the most clinically complete and accurate responses, though sometimes at the expense of readability. Microsoft Copilot stood out for its clear disclaimers and empathetic tone, which may foster trust and emotional reassurance. ChatGPT-4o excelled in clarity and conversational engagement but was less consistent in providing complete and precise medical information.

Despite these promising features, none of the models achieved consistently high performance across all evaluated domains. Occasional hallucinations, omissions of critical clinical detail, and limited tailoring to patient comprehension highlight the current risks of unsupervised use in emotionally charged, high-stakes clinical scenarios.

As LLMs continue to be integrated into healthcare environments, our findings support their cautious use as adjuncts—rather than replacements—for professional counseling. To ensure safer and more effective deployment in patient-facing contexts, we propose two actionable safeguards: (1) mandatory citation of clinical guideline sources within LLM-generated responses to enhance transparency and trust and (2) the implementation of real-time risk alert systems to flag hallucinations or potentially harmful advice before content is delivered.

These measures, combined with domain-specific training, adherence to medical guidelines, and regulatory oversight, will be essential to ensure that LLMs enhance—rather than compromise—patient understanding, trust, and safety.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jcm14165697/s1.

Author Contributions

Conceptualization, B.R. and J.J.; Data curation, B.R., I.L., J.S., H.P. and J.J.; Formal analysis, B.R.; Investigation, B.R.; Methodology, B.R.; Resources, J.J.; Software, J.J.; Supervision, B.R.; Validation, B.R. and N.B.; Visualization, J.J.; Writing—original draft, B.R.; Writing—review and editing, N.B., I.L., L.K., F.F., T.S., J.S. and H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki. Ethical review and approval were waived by the Institutional Review Board of Charité Hospital Berlin due to the non-interventional study design and the absence of patient-identifiable data. The research was based on simulated patient prompts entered into publicly accessible large language models. Evaluation was conducted by expert reviewers voluntarily assessing anonymized model outputs.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The anonymized data underlying this article will be made available by the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
CT	Computed Tomography
EHR	Electronic Healt Record
FKGL	Flesch–Kincaid Grade Level
FRE	Flesch Reading Ease
LLM	Large Language Model
NLP	Natural Language Processing
RCC	Renal Cell Carcinoma
SD	Standard Deviation
SOP	Standard Operating Procedure
U.S.	United States

Appendix A

Appendix A.1. Patient Questions Used as Prompts in the Study

No.	Patient Prompt
1	Do I have cancer?
2	What does this mean for me?
3	Is this dangerous or life-threatening?
4	What are my treatment options?
5	Should I get surgery?
6	Can I wait and see what happens?
7	How do doctors know if it is really cancer?
8	What is the risk that this spreads to other organs?
9	Do I need to see a specialist right away?
10	What happens next?

Appendix A.2. Evaluation Sheet for Urology Specialists

Reviewers were instructed to evaluate each LLM response using the following structured form. Six quality domains were rated on a 1–5 Likert scale (1 = poor, 5 = excellent):

Criterion	Description
1. Accuracy	Is the medical information factually correct and guideline-conform?
2. Completeness	Does the answer address all key aspects of the question?
3. Clarity	Is the answer understandable for a layperson?
4. Currency	Is the information up to date with current EAU guidelines?
5. Harm Potential	Could the answer be misleading or cause harm? (reverse scored)
6. Hallucinations Present	Are there invented facts or references? (Yes = 1, No = 5)

In addition, reviewers could flag responses for specific concerns:

Factual error present–clear inaccuracy in medical content;
Incomplete information–omission of critical elements;
Hallucinated reference–citation of fabricated or non-existent sources;
Misleading or harmful content–statements that could result in harm if believed or followed;
Terminology too complex–language unsuitable for general patient understanding.

Free-text comments were encouraged for flagged answers to justify concerns and support qualitative analysis.

References

Mendel, T.; Singh, N.; Mann, D.M.; Wiesenfeld, B.; Nov, O. Laypeople’s Use of and Attitudes Toward Large Language Models and Sear ch Engines for Health Queries: Survey Study. J. Med. Internet Res. 2025, 27, e64290. [Google Scholar] [CrossRef]
Sumner, J.; Wang, Y.; Tan, S.Y.; Chew, E.H.H.; Yip, A.W. Perspectives and Experiences with Large Language Models in Health Care: Survey Study. J. Med. Internet Res. 2025, 27, e67383. [Google Scholar] [CrossRef] [PubMed]
Reuter, N.; von Lipinski, V.-N.; Jeutner, J.; Schlomm, T.; Witzenrath, M.; Sander, L.E.; Gröschel, M.I. AI-generated patient-friendly discharge summaries to empower patients. medRxiv 2025. [Google Scholar] [CrossRef]
Rodler, S.; Cei, F.; Ganjavi, C.; Checcucci, E.; De Backer, P.; Belenchon, I.R.; Taratkin, M.; Puliatti, S.; Veccia, A.; Piazza, P.; et al. GPT-4 generates accurate and readable patient education materials aligned with current oncological guidelines: A randomized assessment. PLoS ONE 2025, 20, e0324175. [Google Scholar] [CrossRef] [PubMed]
Huo, B.; Boyle, A.; Marfo, N.; Tangamornsuksan, W.; Steen, J.P.; McKechnie, T.; Lee, Y.; Mayol, J.; Antoniou, S.A.; Thirunavukarasu, A.J.; et al. Large Language Models for Chatbot Health Advice Studies. JAMA Netw. Open 2025, 8, e2457879. [Google Scholar] [CrossRef]
Aydin, S.; Karabacak, M.; Vlachos, V.; Margetis, K. Large language models in patient education: A scoping review of applic ations in medicine. Front. Med. 2024, 11, 1477898. [Google Scholar] [CrossRef]
Guo, S.-B.; Meng, Y.; Lin, L.; Zhou, Z.-Z.; Li, H.-L.; Tian, X.-P.; Huang, W.-J. Artificial intelligence alphafold model for molecular biology and drug discovery: A machine-learning-driven informatics investigation. Mol. Cancer 2024, 23, 223. [Google Scholar] [CrossRef]
Wen, B.; Norel, R.; Liu, J.; Stappenbeck, T.; Zulkernine, F.; Chen, H. Leveraging Large Language Models for Patient Engagement: The Power of Conversational AI in Digital Health. In Proceedings of the 2024 IEEE International Conference on Digital Health (ICDH), Shenzhen, China, 7–13 July 2024. [Google Scholar]
Mannhardt, N.; Bondi-Kelly, E.; Lam, B.; Mozannar, H.; O’Connell, C.; Asiedu, M.; Buendia, A.; Urman, T.; Riaz, I.B.; Ricciardi, C.E.; et al. Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study. arXiv 2024, arXiv:2401.09637. [Google Scholar] [CrossRef]
Salmi, L.; Lewis, D.M.; Clarke, J.L.; Dong, Z.; Fischmann, R.; McIntosh, E.I.; Sarabu, C.R.; DesRoches, C.M. A proof-of-concept study for patient use of open notes with large lang uage models. JAMIA Open 2025, 8, ooaf021. [Google Scholar] [CrossRef]
Vishwanath, A.B.; VSrinivasalu, K.; Subramaniam, N. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artif. Intell. Health 2024, 2, 1. [Google Scholar] [CrossRef]
Busch, F.; Hoffmann, L.; Rueger, C.; van Dijk, E.H.; Kader, R.; Ortiz-Prado, E.; Makowski, M.R.; Saba, L.; Hadamitzky, M.; Kather, J.N.; et al. Current applications and challenges in large language models for patie nt care: A systematic review. Commun. Med. 2025, 5, 26. [Google Scholar] [CrossRef]
Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and barriers of using large language models (LLM) such as C hatGPT for diagnostic medicine with a focus on digital pathology—A recent scoping review. Diagn. Pathol. 2024, 19, 43. [Google Scholar] [CrossRef]
Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and Evaluation of Health Care Applications of Large Language M odels. JAMA 2025, 333, 319. [Google Scholar] [CrossRef]
Umerenkov, D.; Zubkova, G.; Nesterov, A. Deciphering Diagnoses: How Large Language Models Explanations Influenc e Clinical Decision Making. arXiv 2023, arXiv:2310.01708. [Google Scholar] [CrossRef]
Yang, Z.; Wang, D.; Zhou, F.; Song, D.; Zhang, Y.; Jiang, J.; Kong, K.; Liu, X.; Qiao, Y.; Chang, R.T.; et al. Understanding natural language: Potential application of large languag e models to ophthalmology. Asia-Pac. J. Ophthalmol. 2024, 13, 100085. [Google Scholar] [CrossRef] [PubMed]
Bex, A.; Abu Ghanem, Y.; Albiges, L.; Bonn, S.; Campi, R.; Capitanio, U.; Dabestani, S.; Hora, M.; Klatte, T.; Kuusk, T.; et al. European Association of Urology Guidelines on Renal Cell Carcinoma: The 2025 Update. Eur. Urol. 2025, 87, 683–696. [Google Scholar] [CrossRef] [PubMed]
Duchene, D.A.; Lotan, Y.; Cadeddu, J.A.; Sagalowsky, A.I.; Koeneman, K.S. Histopathology of surgically managed renal tumors: Analysis of a conte mporary series. Urology 2003, 62, 827–830. [Google Scholar] [CrossRef]
Corcoran, A.T.; Russo, P.; Lowrance, W.T.; Asnis-Alibozek, A.; Libertino, J.A.; Pryma, D.A.; Divgi, C.R.; Uzzo, R.G. A Review of Contemporary Data on Surgically Resected Renal Masses—Beni gn or Malignant? Urology 2013, 81, 707–713. [Google Scholar] [CrossRef]
Pedersen, C.L.; Winck-Flyvholm, L.; Dahl, C.; Azawi, N.H. High rate of benign histology in radiologically suspect renal lesions. Dan. Med. J. 2014, 61, A4932. [Google Scholar]
Russo, P.; Uzzo, R.G.; Lowrance, W.T.; Asnis-Alibozek, A.; LaFrance, N.D.; Libertino, J.A.; Pryma, D.A.; Divgi, C.R. Incidence of benign versus malignant renal tumors in selected studies. J. Clin. Oncol. 2012, 30 (Suppl. S5), 357. [Google Scholar] [CrossRef]
van Oostenbrugge, T.J.; Fütterer, J.J.; Mulders, P.F.A. Diagnostic Imaging for Solid Renal Tumors: A Pictorial Review. Kidney Cancer 2018, 2, 79–93. [Google Scholar] [CrossRef]
Tuncali, K.; Vansonnenberg, E.; Shankar, S.; Mortele, K.J.; Cibas, E.S.; Silverman, S.G. Evaluation of Patients Referred for Percutaneous Ablation of Renal Tum ors: Importance of a Preprocedural Diagnosis. Am. J. Roentgenol. 2004, 183, 575–582. [Google Scholar] [CrossRef]
Grosso, A.A.; Di Maida, F.; Lambertini, L.; Cadenar, A.; Coco, S.; Ciaralli, E.; Salamone, V.; Vittori, G.; Tuccio, A.; Mari, A.; et al. Three-dimensional virtual model for robot-assisted partial nephrectomy: A propensity-score matching analysis with a contemporary control group. World J. Urol. 2024, 42, 338. [Google Scholar] [CrossRef] [PubMed]
Geantă, M.; Bădescu, D.; Chirca, N.; Nechita, O.C.; Radu, C.G.; Rascu, S.; Rădăvoi, D.; Sima, C.; Toma, C.; Jinga, V. The Potential Impact of Large Language Models on Doctor–Patient Commun ication: A Case Study in Prostate Cancer. Healthcare 2024, 12, 1548. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Mou, W.; Chen, R. Can the ChatGPT and other Large Language Models with internet-connecte d database solve the questions and concerns of patient with prostate c ancer? medRxiv 2023. [Google Scholar] [CrossRef]
Trapp, C.; Schmidt-Hegemann, N.; Keilholz, M.; Brose, S.F.; Marschner, S.N.; Schönecker, S.; Dehelean, D.-C.; Rottler, M.; Konnerth, D.; Belka, C.; et al. Patient- and clinician-based evaluation of large language models for p atient education in prostate cancer radiotherapy. Strahlenther. Und Onkol. 2025, 201, 333–342. [Google Scholar] [CrossRef]
Alasker, A.; Alshathri, N.; Alsalamah, S.; Almansour, N.; Alsalamah, F.; Alghafees, M.; AlKhamees, M.; Alsaikhan, B. ChatGPT vs. Gemini: Which Provides Better Information on Bladder Cancer. Société Int. D’urologie J. 2025, 6, 34. [Google Scholar] [CrossRef]
Mak, G.; Siriwardena, C.; Haxhimolla, H.; Chan, R.; Hart, K.; Mare, A.; Kahloon, M.; McCredie, S.; Gilbourd, D. Utility of ChatGPT and Large Language Models in Enhancing Patient Unde rstanding of Urological Conditions. Société Int. D’urologie J. 2024, 5, 843–851. [Google Scholar] [CrossRef]
Carl, N.; Haggenmüller, S.; Wies, C.; Nguyen, L.; Winterstein, J.T.; Hetz, M.J.; Mangold, M.H.; Hartung, F.O.; Grüne, B.; Holland-Letz, T.; et al. Evaluating interactions of patients with large language models for medical information. BJU Int. 2025, 135, 1010–1017. [Google Scholar] [CrossRef]
Eckrich, J.; Ellinger, J.; Cox, A.; Stein, J.; Ritter, M.; Blaikie, A.; Kuhn, S.; Buhr, C.R. Urology consultants versus large language models: Potentials and hazards for medical advice in urology. BJUI Compass 2024, 5, 438–444. [Google Scholar] [CrossRef]
Demir, S. Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus. Eye Contact Lens Sci. Clin. Pract. 2024, 51, e107–e111. [Google Scholar] [CrossRef]
Guo, S.-B.; Shen, Y.; Meng, Y.; Zhou, Z.-Z.; Li, H.-L.; Cai, X.-Y.; Huang, W.-J.; Tian, X.-P. Surge in large language models exacerbates global regional healthcare inequalities. J. Transl. Med. 2025, 23, 706. [Google Scholar] [CrossRef]

Figure 1. Mean Likert scores (1–5) for ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro across six evaluation domains: accuracy, completeness, clarity, currency, risk of harm (inverse scored), and hallucinations. Ratings were based on expert evaluation of LLM responses to standardized patient questions following an incidental kidney tumor diagnosis.

Table 1. Evaluation of the reviewers’ overall answers to the questions. Each question was answered on a Likert scale from 0 to 5. Values are presented as mean (±standard deviation). * p-value < 0.05 was considered significant.

Overall	ChatGPT 4o	Copilot	Gemini 2.5 Pro	* p
Accuracy	3.86 ± 0.76	4.3 ± 0.74	4.54 ± 0.71	<0.001
Completeness	3.96 ± 0.61	3.98 ± 0.85	4.62 ± 0.60	<0.001
Clarity	4.62 ± 0.49	4.54 ± 0.61	4.88 ± 0.33	0.003
Currency	4.36 ± 0.75	4.44 ± 0.64	4.7 ± 0.55	0.02
Harm Potential	4.16 ± 0.74	4.66 ± 0.69	4.74 ± 0.57	<0.001
Hallucinations	4.42 ± 1.07	4.76 ± 0.63	4.8 ± 0.61	0.049

Table 2. Flesch Reading Ease (FRE) score and the Flesch–Kincaid Grade Level (FKGL). Inter-rater reliability was assessed using Fleiss’ Kappa for each quality domain across all oft he evaluated LLM responses. Agreement was moderate to substantial for most domains, with the highest consensus in the dimension of harm potential (κ = 0.64), indicating consistent risk assessment across raters.

Model	FRE Score (0–100)	FKGL (US Grade Level)
ChatGPT 4.0	46.48	11.94
Copilot	39.05	13.24
Gemini 2.5 Pro	50.58	10.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ralla, B.; Biernath, N.; Lichy, I.; Kurz, L.; Friedersdorff, F.; Schlomm, T.; Schmidt, J.; Plage, H.; Jeutner, J. How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors. J. Clin. Med. 2025, 14, 5697. https://doi.org/10.3390/jcm14165697

AMA Style

Ralla B, Biernath N, Lichy I, Kurz L, Friedersdorff F, Schlomm T, Schmidt J, Plage H, Jeutner J. How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors. Journal of Clinical Medicine. 2025; 14(16):5697. https://doi.org/10.3390/jcm14165697

Chicago/Turabian Style

Ralla, Bernhard, Nadine Biernath, Isabel Lichy, Lukas Kurz, Frank Friedersdorff, Thorsten Schlomm, Jacob Schmidt, Henning Plage, and Jonathan Jeutner. 2025. "How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors" Journal of Clinical Medicine 14, no. 16: 5697. https://doi.org/10.3390/jcm14165697

APA Style

Ralla, B., Biernath, N., Lichy, I., Kurz, L., Friedersdorff, F., Schlomm, T., Schmidt, J., Plage, H., & Jeutner, J. (2025). How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors. Journal of Clinical Medicine, 14(16), 5697. https://doi.org/10.3390/jcm14165697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

5. Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Patient Questions Used as Prompts in the Study

Appendix A.2. Evaluation Sheet for Urology Specialists

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI