Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek
Abstract
1. Introduction
2. Materials and Methods
2.1. Question Selection and Prompting Process
2.2. Readability Assessment
2.3. Urologist Assessment of AI Chatbot Generated Answers
3. Results
Urologist Assessment of AI Chatbot Answers
- Two consultant urologists independently assessed the accuracy, completeness and clarity of the answers for each of the 10 questions for bladder cancer using the DISCERN tool, which is a clinically validated scoring system for assessing the quality of written information on treatment choices for a health problem devised by the University of Oxford [15]. Table 2 shows the summary of their average DISCERN scores. As demonstrated, Grok 3 received the highest average DISCERN score of 52, followed closely by ChatGPT at 50.5. Gemini Flash 2.0 had the lowest average DISCERN score for the quality of answers at 47.
- Both urologist reviewers independently recorded whether each response contained any factual inaccuracies. No hallucinations were detected across 50 chatbot answers (0/50; κ = 1.0).
| Question | ChatGPT 4.0 | Gemini Flash 2.0 | Grok 3 | Claude Sonnet 3.7 | Deepseek R1 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Reviewer | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 |
| 1 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 3 | 4 | 4 |
| 2 | 5 | 3 | 5 | 4 | 5 | 4 | 4 | 3 | 5 | 4 |
| 3 | 4 | 4 | 3 | 4 | 3 | 4 | 3 | 4 | 3 | 5 |
| 4 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 |
| 5 | 1 | 1 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 1 |
| 6 | 4 | 5 | 4 | 4 | 3 | 4 | 4 | 4 | 3 | 4 |
| 7 | 1 | 1 | 1 | 1 | 1 | 3 | 2 | 1 | 1 | 2 |
| 8 | 2 | 2 | 2 | 3 | 2 | 3 | 3 | 2 | 3 | 2 |
| 9 | 5 | 5 | 5 | 3 | 4 | 4 | 4 | 3 | 5 | 3 |
| 10 | 5 | 4 | 4 | 4 | 4 | 4 | 5 | 3 | 5 | 4 |
| 11 | 4 | 3 | 4 | 2 | 3 | 4 | 5 | 3 | 4 | 4 |
| 12 | 3 | 4 | 3 | 2 | 2 | 4 | 4 | 2 | 3 | 1 |
| 13 | 4 | 2 | 3 | 2 | 3 | 2 | 4 | 3 | 2 | 2 |
| 14 | 4 | 4 | 3 | 5 | 4 | 4 | 4 | 3 | 3 | 4 |
| 15 | 3 | 2 | 3 | 2 | 3 | 4 | 3 | 3 | 3 | 3 |
| 16 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 3 | 3 | 3 |
| Total | 53 | 48 | 49 | 45 | 46 | 58 | 55 | 42 | 49 | 47 |
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| FRE | Flesch Reading Ease |
| AI | Artificial Intelligence |
| FKRGL | Flesch–Kincaid Reading Grade Level |
References
- Ma, J.; Roumiguie, M.; Hayashi, T.; Kohada, Y.; Zlotta, A.R.; Lévy, S.; Matsumoto, T.; Sano, T.; Black, P.C. Long-Term Recurrence Rates of Low-Risk Non-Muscle-Invasive Bladder Cancer-How Long Is Cystoscopic Surveillance Necessary? Eur. Urol. Focus 2024, 10, 189–196. [Google Scholar] [CrossRef] [PubMed]
- Grabe-Heyne, K.; Henne, C.; Mariappan, P.; Geiges, G.; Pöhlmann, J.; Pollock, R.F. Intermediate and high-risk non-muscle-invasive bladder cancer: An overview of epidemiology, burden, and unmet needs. Front. Oncol. 2023, 13, 1170124. [Google Scholar] [CrossRef]
- Thomas, K.R.; Joshua, C.; Ibilibor, C. Psychological Distress in Bladder Cancer Patients: A Systematic Review. Cancer Med. 2024, 13, e70345. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Makaroff, L.E.; Filicevas, A.; Boldon, S.; Hensley, P.; Black, P.C.; Chisolm, S.; Demkiw, S.; Fernández, M.I.; Sugimoto, M.; Jensen, B.T.; et al. Patient and Carer Experiences with Bladder Cancer: Results from a Global Survey in 45 Countries. Eur. Urol. 2023, 84, 248–251. [Google Scholar] [CrossRef] [PubMed]
- Cacciamani, G.E.; Bassi, S.; Sebben, M.; Marcer, A.; Russo, G.I.; Cocci, A.; Dell’Oglio, P.; Medina, L.G.; Nassiri, N.; Tafuri, A.; et al. Consulting “Dr. Google” for prostate cancer treatment options: A contemporary worldwide trend analysis. Eur. Urol. Oncol. 2020, 3, 481–488. [Google Scholar] [CrossRef]
- Cacciamani, G.E.; Dell’Oglio, P.; Cocci, A.; Russo, G.I.; De Castro Abreu, A.; Gill, I.S.; Briganti, A.; Artibani, W. Asking “Dr. Google” for a second opinion: The devil is in the details. Eur. Urol. Focus. 2021, 7, 479–481. [Google Scholar] [PubMed]
- Cacciamani, G.E.; Gill, K.; Gill, I.S. Web search queries and prostate cancer. Lancet Oncol. 2020, 21, 494–496. [Google Scholar] [CrossRef] [PubMed]
- Davis, R.; Eppler, M.; Ayo-Ajibola, O.; Loh-Doyle, J.C.; Nabhani, J.; Samplaski, M.; Gill, I.; Cacciamani, G.E. Evaluating the effectiveness of artificial intelligence-powered large language models application in disseminating appropriate and readable health information in urology. J. Urol. 2023, 210, 688–694. [Google Scholar] [CrossRef] [PubMed]
- Aljamaan, F.; Temsah, M.; Altamimi, I.; Al-Eyadhy, A.; Jamal, A.; Alhasan, K.; Mesallam, T.; Farahat, M.; Malki, K. Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study. JMIR Med. Inform. 2024, 12, e54345. [Google Scholar] [CrossRef]
- Steimetz, E.; Minkowitz, J.; Gabutan, E.C.; Ngichabe, J.; Attia, H.; Hershkop, M.; Ozay, F.; Hanna, M.G.; Gupta, R. Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports. JAMA Netw. Open. 2024, 7, e2412767. [Google Scholar] [CrossRef]
- Tung, J.Y.M.; Lim, D.Y.Z.; Sng, G.G.R. Potential safety concerns in use of the artificial intelligence chatbot ‘ChatGPT’ for perioperative patient communication. BJU Int. 2023, 132, 157–159. [Google Scholar] [CrossRef]
- Temsah, O.; Khan, S.A.; Chaiah, Y.; Senjab, A.; Alhasan, K.; Jamal, A.; Aljamaan, F.; Malki, K.H.; Halwani, R.; Al-Tawfiq, J.A.; et al. Overview of early ChatGPT’s presence in medical literature: Insights from a hybrid literature review by ChatGPT and human experts. Cureus 2023, 15, e37281. [Google Scholar] [PubMed]
- Eppler, M.B.; Ganjavi, C.; Knudsen, J.E.; Davis, R.J.; Ayo-Ajibola, O.; Desai, A.; Ramacciotti, L.S.; Chen, A.; Abreu, A.D.C.; Desai, M.M.; et al. Bridging the gap between urological research and patient understanding: The role of large language models in automated generation of layperson’s summaries. Urol. Pract. 2023, 10, 436–443. [Google Scholar] [PubMed]
- Cocci, A.; Pezzoli, M.; Lo Re, M.; Russo, G.I.; Asmundo, M.G.; Fode, M.; Cacciamani, G.; Cimino, S.; Minervini, A.; Durukan, E. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis. 2024, 27, 103–108. [Google Scholar]
- Available online: https://www.ndph.ox.ac.uk/research/research-groups/applied-health-research-unit-ahru/discern (accessed on 5 May 2025).
- Stokel-Walker, C.; Van Noorden, R. What ChatGPT and generative AI mean for science. Nature 2023, 614, 214–216. [Google Scholar] [CrossRef]
- Flores-Cohaila, J.A.; García-Vicente, A.; Vizcarra-Jiménez, S.F.; De la Cruz-Galán, J.P.; Gutiérrez-Arratia, J.D.; Quiroga Torres, B.G.; Taype-Rondan, A. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study. JMIR Med. Educ. 2023, 9, e48039. [Google Scholar] [PubMed]
- Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar]
- Humar, P.; Asaad, M.; Bengur, F.B.; Nguyen, V. ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination. Aesthet. Surg. J. 2023, 43, NP1085–NP1089. [Google Scholar]
- Hofmann, H.L.; Guerra, G.A.; Le, J.L.; Wong, A.M.; Hofmann, G.H.; Mayfield, C.K.; Petrigliano, F.A.; Liu, J.N. The Rapid Development of Artificial Intelligence: GPT-4’s Performance on Orthopedic Surgery Board Questions. Orthopedics 2023, 47, e85–e89. [Google Scholar]
- Ghanem, D.; Nassar, J.E.; El Bachour, J.; Hanna, T. ChatGPT Earns American Board Certification in Hand Surgery. Hand Surg. Rehabil. 2024, 43, 101688. [Google Scholar] [CrossRef]
- Desouky, E.; Jallad, S.; Bhardwa, J.; Sharma, H.; Kalsi, J. ChatGPT sitting for FRCS Urology examination: Will artificial intelligence get certified? J. Clin. Urol. 2024, 18, 383–391. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J. Transl. Med. 2023, 21, 269. [Google Scholar] [CrossRef] [PubMed]

| Q# | ChatGPT 4.0 | Gemini 2.0 Flash | Grok 3 | Claude 3.7 Sonnet | DeepSeek R1 | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FRE | GF | FK | CL | SM | FRE | GF | FK | CL | SM | FRE | GF | FK | CL | SM | FRE | GF | FK | CL | SM | FRE | GF | FK | CL | SM | |
| 1 | 45 | 11.9 | 9.43 | 12.9 | 8.76 | 38 | 11.6 | 10.05 | 14.48 | 8.17 | 28 | 14.4 | 14.63 | 15.84 | 12.69 | 32 | 14 | 13.6 | 15.5 | 11.9 | 32 | 14.1 | 12.1 | 15.5 | 10.6 |
| 2 | 49 | 9.9 | 8.91 | 11.2 | 7.7 | 51 | 10.1 | 8.89 | 10.79 | 8.09 | 51 | 9.9 | 11.25 | 11.28 | 9.62 | 58 | 9.3 | 7.62 | 10.5 | 7.3 | 53 | 9.9 | 7.96 | 11.6 | 7.09 |
| 3 | 65 | 9.1 | 6.55 | 8.05 | 7.24 | 50 | 10.6 | 8.45 | 11.03 | 7.78 | 67 | 8.3 | 8.19 | 8.83 | 7.9 | 49 | 11 | 9.05 | 12.3 | 8.47 | 54 | 10.6 | 8.37 | 10.7 | 8.25 |
| 4 | 68 | 9.8 | 5.59 | 7.52 | 6.81 | 57 | 9.4 | 7.5 | 9.78 | 7.58 | 64 | 8.8 | 8.32 | 10.26 | 7.88 | 52 | 11 | 8.79 | 11.8 | 8.83 | 55 | 11 | 8.1 | 10.4 | 8.4 |
| 5 | 56 | 11.5 | 7.7 | 10.1 | 8.17 | 42 | 11.3 | 9.96 | 13.04 | 8.71 | 59 | 9.2 | 7.68 | 11.38 | 7.5 | 26 | 16 | 12 | 15.9 | 10.4 | 25 | 14.5 | 11.7 | 15.1 | 9.23 |
| 6 | 58 | 11.8 | 7.53 | 9.49 | 8.26 | 55 | 9 | 8.91 | 11.1 | 7.91 | 65 | 9.6 | 7.47 | 9.81 | 7.99 | 50 | 12 | 9.03 | 11.9 | 8.95 | 60 | 9.6 | 7.01 | 9.9 | 7.15 |
| 7 | 52 | 12.5 | 8.14 | 10.2 | 8.51 | 46 | 11.7 | 10.05 | 11.47 | 9.81 | 62 | 9.9 | 7.98 | 10.66 | 8.52 | 47 | 11 | 9.06 | 11.1 | 8.68 | 45 | 13.1 | 9.64 | 11.8 | 9.54 |
| 8 | 65 | 10.7 | 6.41 | 9.19 | 7.76 | 43 | 10.4 | 11.44 | 12.89 | 9.47 | 63 | 9.6 | 7.2 | 10.19 | 7.58 | 46 | 10 | 9.13 | 13.5 | 7.82 | 53 | 10.4 | 8.37 | 11.2 | 7.89 |
| 9 | 66 | 10.5 | 6.34 | 7.89 | 7.63 | 46 | 10.4 | 9.54 | 12.3 | 8.19 | 61 | 9.7 | 7.8 | 10.78 | 7.9 | 37 | 12 | 10.1 | 13 | 8.22 | 42 | 12.8 | 9.95 | 12.3 | 9.06 |
| 10 | 68 | 10.1 | 6.38 | 8.66 | 7.94 | 51 | 10.2 | 8.96 | 12.19 | 8.4 | 63 | 9.8 | 8.31 | 11.27 | 8.56 | 48 | 12 | 11.9 | 12.5 | 10.1 | 56 | 9.5 | 8.29 | 11.8 | 8.09 |
| Model | ICC2 (Agreement) | Cohen’s Kappa | Weighted Kappa (Quadratic) |
|---|---|---|---|
| ChatGPT 4.0 | 0.791 | 0.437 | 0.780 |
| Gemini Flash 2.0 | 0.657 | 0.291 | 0.643 |
| Grok 3 | 0.339 | 0.000 | 0.325 |
| Claude Sonnet 3.7 | 0.562 | 0.063 | 0.546 |
| Deepseek R1 | 0.669 | 0.308 | 0.655 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Patel, K.; Radcliffe, R. Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek. J. Clin. Med. 2025, 14, 7804. https://doi.org/10.3390/jcm14217804
Patel K, Radcliffe R. Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek. Journal of Clinical Medicine. 2025; 14(21):7804. https://doi.org/10.3390/jcm14217804
Chicago/Turabian StylePatel, Kunjan, and Robert Radcliffe. 2025. "Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek" Journal of Clinical Medicine 14, no. 21: 7804. https://doi.org/10.3390/jcm14217804
APA StylePatel, K., & Radcliffe, R. (2025). Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek. Journal of Clinical Medicine, 14(21), 7804. https://doi.org/10.3390/jcm14217804

