Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation
Abstract
1. Introduction
2. Methods
2.1. Study Design
2.2. Question Development
2.2.1. Sources
2.2.2. Identification of Patient FAQ Domains
2.2.3. FAQ Prioritization
2.2.4. Selection of Guideline-Based Questions
2.3. Classification of Questions
2.4. ChatGPT Response Collection
2.5. Evaluation of Responses
2.6. Statistical Analysis
3. Results
3.1. Overall Accuracy Across Versions
3.2. Domain-Specific Accuracy
3.3. FAQ vs. Guideline-Based Questions
3.4. Question Difficulty
3.5. Reproducibility of Responses
3.6. Weighted Sensitivity Analysis
3.7. Examples of “Correct but Incomplete” (Grade-2) Responses
- MRI safety lacking conditional rules (e.g., device must be switched to MRI mode).
- Infection risk omitted despite being a major cause of revision surgery.
- Ambiguous postoperative restrictions (e.g., not specifying no twisting, bending, or lifting >10 lbs for 4–6 weeks).
4. Discussion
4.1. Study Limitations
- Only 20 questions were analyzed, which may not encompass the full spectrum of SNM-related queries. This relatively small sample constrains generalizability, as the performance trends observed may not fully capture the breadth of real-world patient or clinician inquiries.
- The evaluation was performed in English only, limiting generalizability to other languages.
- Responses were rated by two reviewers, leaving potential for residual subjectivity.
- Only ChatGPT was tested; other LLMs such as Claude, Gemini, and DeepSeek were not compared.
- SNM questions were selected manually because no standardized, validated SNM question bank exists, which may introduce selection bias and limit comparability across studies.
- The small question set (n = 20) limits the power of subgroup analyses, particularly for domain-level and difficulty-level comparisons.
- ChatGPT versions evolve continuously, and silent model updates during the study period may influence reproducibility, meaning results represent performance during a specific time window rather than a fixed model state.
4.2. Recommendations for Safe Clinical Use of ChatGPT in SNM Counseling
5. Conclusions
Supplementary Materials
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Amundsen, C.L.; Sutherland, S.E.; Kielb, S.J.; Dmochowski, R.R. Sacral and implantable tibial neuromodulation for the management of overactive bladder: A systematic review and meta-analysis. Adv. Ther. 2025, 42, 10–35. [Google Scholar] [CrossRef]
- Busch, F.; Hoffmann, L.; Rueger, C.; Van Dijk, E.H.; Kader, R.; Ortiz-Prado, E.; Makowski, M.R.; Saba, L.; Hadamitzky, M.; Kather, J.N.; et al. Current applications and challenges in large language models for patient care: A systematic review. Commun. Med. 2025, 5, 26. [Google Scholar] [CrossRef]
- Cameron, A.P.; Chung, D.E.; Dielubanza, E.J.; Enemchukwu, E.; Ginsberg, D.A.; Helfand, B.T.; Linder, B.J.; Reynolds, W.S.; Rovner, E.S.; Souter, L.; et al. The AUA/SUFU guideline on the diagnosis and treatment of idiopathic overactive bladder. J. Urol. 2024, 212, 11–20. [Google Scholar] [CrossRef]
- Abdelmalek, G.; Uppal, H.; Garcia, D.; Farshchian, J.; Emami, A.; McGinniss, A. Leveraging ChatGPT to produce patient education materials for common hand conditions. J. Hand Surg. Glob. Online 2025, 7, 37–40. [Google Scholar] [CrossRef]
- Sartori, A.M.; Kessler, T.M.; Castro-Díaz, D.M.; De Keijzer, P.; Del Popolo, G.; Ecclestone, H.; Frings, D.; Groen, J.; Hamid, R.; Karsenty, G.; et al. Summary of the 2024 update of the European Association of Urology guidelines on neuro-urology. Eur. Urol. 2024, 85, 543–555. [Google Scholar] [CrossRef] [PubMed]
- Feloney, M.P.; Stauss, K.; Leslie, S.W. Sacral Neuromodulation. In StatPearls; StatPearls Publishing: Treasure Island, FL, USA, 2024. Available online: https://www.ncbi.nlm.nih.gov/books/NBK567751/ (accessed on 15 August 2025).
- Gibson, D.; Jackson, S.; Shanmugasundaram, R.; Seth, I.; Siu, A.; Ahmadi, N.; Kam, J.; Mehan, N.; Thanigasalam, R.; Jeffery, N.; et al. Evaluating the efficacy of ChatGPT as a patient education tool in prostate cancer: Multimetric assessment. J. Med. Internet Res. 2024, 26, e55939. [Google Scholar] [CrossRef] [PubMed]
- Libretti, A.; Vitale, S.G.; Saponara, S.; Corsini, C.; Aquino, C.I.; Savasta, F.; Tizzoni, E.; Troìa, L.; Surico, D.; Angioni, S.; et al. Hysteroscopy in the new media: Quality and reliability of hysteroscopy procedures on YouTube™. Arch. Gynecol. Obstet. 2023, 308, 1515–1524. [Google Scholar] [CrossRef]
- National Institute for Health and Care Excellence. Sacral Nerve Stimulation for Idiopathic Chronic Non-Obstructive Urinary Retention. NICE Guideline IPG536. 2015. Available online: https://www.nice.org.uk/guidance/ipg536 (accessed on 15 August 2025).
- Hacibey, I.; Halis, A. Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation. Investig. Clin. Urol. 2025, 66, 188–193. [Google Scholar] [CrossRef]
- Şahin, M.F.; Doğan, Ç.; Topkaç, E.C.; Şeramet, S.; Tuncer, F.B.; Yazıcı, C.M. Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment. World J. Urol. 2025, 43, 116. [Google Scholar] [CrossRef]
- Malak, A.; Şahin, M.F. How Useful are Current Chatbots Regarding Urology Patient Information? Comparison of the Ten Most Popular Chatbots’ Responses About Female Urinary Incontinence. J. Med. Syst. 2024, 48, 102. [Google Scholar] [CrossRef]
- Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Raynier, J.; Clowez, G.; Boileau, P.; Ruetsch-Chelli, C. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef]
- Grilo, A.; Marques, C.; Corte-Real, M.; Carolino, E.; Caetano, M. Assessing the Quality and Reliability of ChatGPT’s Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4. JMIR Cancer 2025, 11, e63677. [Google Scholar] [CrossRef] [PubMed]
- Tan, Y.Z.; Nah, S.A.; Saw, S.N.; Rajandram, R.; Ong, T.A. Evaluating the performance of artificial intelligence chatbots in answering urology questions derived from guidelines or board examinations: A systematic review. Urol. Sci. 2025. [Google Scholar] [CrossRef]
- Jin, H.K.; Lee, H.E.; Kim, E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC Med. Educ. 2024, 24, 1013. [Google Scholar] [CrossRef]
- Liu, M.; Okuhara, T.; Chang, X.; Shirabe, R.; Nishiie, Y.; Okada, H.; Kiuchi, T. Performance of ChatGPT across different versions in medical licensing examinations: Systematic review and meta-analysis. J. Med. Internet Res. 2024, 26, e60807. [Google Scholar] [CrossRef]
- Stalp, J.L.; Denecke, A.; Jentschke, M.; Hillemanns, P.; Klapdor, R. Quality of ChatGPT-Generated Therapy Recommendations for Breast Cancer Treatment in Gynecology. Curr. Oncol. 2024, 31, 3845–3854. [Google Scholar] [CrossRef]
- Chen, S.; Kann, B.H.; Foote, M.B.; Aerts, H.J.; Savova, G.K.; Mak, R.H.; Bitterman, D.S. The utility of ChatGPT for cancer treatment information. medRxiv 2023. [Google Scholar] [CrossRef]
- Scott, M.; Muncey, W.; Seranio, N.; Belladelli, F.; Del Giudice, F.; Li, S.; Ha, A.; Glover, F.; Zhang, C.A.; Eisenberg, M.L. Assessing Artificial Intelligence–Generated Responses to Urology Patient In-Basket Messages. Urol. Pract. 2024, 11, 793–798. [Google Scholar] [CrossRef]
- Talyshinskii, A.; Juliebø-Jones, P.; Hameed, B.M.Z.; Naik, N.; Adhikari, K.; Zhanbyrbekuly, U.; Tzelves, L.; Somani, B.K. ChatGPT as a Clinical Decision Maker for Urolithiasis: Compliance with the Current European Association of Urology Guidelines. Eur. Urol. Open Sci. 2024, 69, 51–62. [Google Scholar] [CrossRef]
- Caglar, U.; Yildiz, O.; Meric, A.; Ayranci, A.; Gelmis, M.; Sarilar, O.; Ozgor, F. Evaluating the performance of ChatGPT in answering questions related to pediatric urology. J. Pediatr. Urol. 2024, 20, 26.e1–26.e5. [Google Scholar] [CrossRef] [PubMed]
- Hetz, M.; Carl, N.; Haggenmüller, S.; Wies, C.; Kather, J.; Michel, M.; Wessels, F.; Brinker, T. Superhuman performance on urology board questions using an explainable language model enhanced with European Association of Urology guidelines. ESMO Real World Data Digit. Oncol. 2024, 6, 100078. [Google Scholar] [CrossRef]
- Sheikh, J.K.; Sohail, S.S.; Alam, S. Enhancing patient education with ChatGPT: Critical insights and future directions. Indian J. Anaesth. 2024, 68, 1112–1113. [Google Scholar] [CrossRef]
- Armbruster, J.; Bussmann, F.; Rothhaas, C.; Titze, N.; Grützner, P.A.; Freischmidt, H. “Doctor ChatGPT, Can You Help Me?” The Patient’s Perspective: Cross-Sectional Study. J. Med. Internet Res. 2024, 26, e58831. [Google Scholar] [CrossRef]
- Tran, J.T.; Burghall, A.; Blydt-Hansen, T.; Cammer, A.; Goldberg, A.; Hamiwka, L.; Johnson, C.; Kehler, C.; Phan, V.; Rosaasen, N.; et al. Exploring the ability of ChatGPT to create quality patient education resources about kidney transplant. Patient Educ. Couns. 2024, 129, 108400. [Google Scholar] [CrossRef] [PubMed]
- Haltaufderheide, J.; Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). npj Digit. Med. 2024, 7, 183. [Google Scholar] [CrossRef] [PubMed]
- Asgari, E.; Montaña-Brown, N.; Dubois, M.; Khalil, S.; Balloch, J.; Yeung, J.A.; Pimenta, D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8, 274. [Google Scholar] [CrossRef]
- Sadeghi, Z.; Alizadehsani, R.; Cifci, M.A.; Kausar, S.; Rehman, R.; Mahanta, P.; Bora, P.K.; Almasri, A.; Alkhawaldeh, R.S.; Hussain, S.; et al. A review of Explainable Artificial Intelligence in healthcare. Comput. Electr. Eng. 2024, 118, 109370. [Google Scholar] [CrossRef]
- Muhammad, D.; Bendechache, M. Unveiling the black box: A systematic review of Explainable Artificial Intelligence in medical image analysis. Comput. Struct. Biotechnol. J. 2024, 24, 542–560. [Google Scholar] [CrossRef]
- Iqbal, U.; Tanweer, A.; Rahmanti, A.R.; Greenfield, D.; Lee, L.T.; Li, Y.J. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. J. Biomed. Sci. 2025, 32, 45. [Google Scholar] [CrossRef]

| No. | Question | Category |
|---|---|---|
| Mechanism & Indications | ||
| 1 | How does sacral neuromodulation (SNM) work to improve bladder function? | Frequently asked questions (FAQ) |
| 2 | Who is eligible for SNM therapy? | Guideline-based |
| 3 | Can SNM be used for urinary retention as well as overactive bladder? | Guideline-based |
| 4 | Is SNM indicated for patients with neurological conditions (e.g., multiple sclerosis, spinal cord injury)? | Guideline-based |
| Procedural Technique | ||
| 5 | What is the difference between the trial (test phase) and the permanent implant in SNM? | FAQ |
| 6 | Is the SNM procedure done under local or general anesthesia? | FAQ |
| 7 | How is the lead positioned during the SNM procedure? | Guideline-based |
| Outcomes & Efficacy | ||
| 8 | What are the success rates of SNM for overactive bladder? | FAQ |
| 9 | How long does the benefit of SNM usually last? | FAQ |
| 10 | How does SNM compare to Botox injections for overactive bladder? | Guideline-based |
| 11 | Can SNM therapy be repeated or adjusted if symptoms return? | FAQ |
| Complications & Risk Mitigation | ||
| 12 | What are the most common complications of SNM (e.g., lead migration, infection)? | FAQ |
| 13 | How often do patients need revision surgery after SNM? | Guideline-based |
| 14 | Is SNM safe for patients who may require magnetic resonance imaging (MRI) scans in the future? | Guideline-based |
| Postoperative Management | ||
| 15 | How long is the recovery period after SNM implantation? | FAQ |
| 16 | How long does the SNM battery last, and how is it replaced? | FAQ |
| 17 | What activity restrictions should patients follow after SNM implantation? | FAQ |
| 18 | Can SNM be used together with medications for overactive bladder? | Guideline-based |
| 19 | What follow-up care and monitoring are recommended after SNM? | Guideline-based |
| 20 | Are there contraindications for SNM, such as certain medical conditions or prior surgeries? | Guideline-based |
| ChatGPT Version | Completely Correct (Grade 1) | Correct but Incomplete (Grade 2) | Partially Misleading (Grade 3) | Completely Incorrect (Grade 4) | Combined Success Rate (Grade 1 + 2) |
|---|---|---|---|---|---|
| 3.5 | 7/20 (35%) | 7/20 (35%) | 4/20 (20%) | 2/20 (10%) | 14/20 (70%) |
| 4.0 | 11/20 (55%) | 6/20 (30%) | 2/20 (10%) | 1/20 (5%) | 17/20 (85%) |
| 5.0 | 13/20 (65%) | 5/20 (25%) | 1/20 (5%) | 1/20 (5%) | 18/20 (90%) |
| Thematic Domain | Version 3.5 (n/N, % Completely Correct) | Version 4.0 (n/N, % Completely Correct) | Version 5.0 (n/N, % Completely Correct) |
|---|---|---|---|
| Mechanism & Indications (Q1–4) | 2/4 (50%) | 3/4 (75%) | 4/4 (100%) |
| Procedural Technique (Q5–7) | 1/3 (33%) | 2/3 (67%) | 3/3 (100%) |
| Outcomes & Efficacy (Q8–11) | 1/4 (25%) | 2/4 (50%) | 3/4 (75%) |
| Complications & Risk Mitigation (Q12–14) | 1/5 (20%) | 2/5 (40%) | 3/5 (60%) |
| Postoperative Management (Q15–20) | 2/6 (33%) | 3/6 (50%) | 4/6 (67%) |
| Category | Version 3.5 (n/N, % Correct) | Version 4.0 (n/N, % Correct) | Version 5.0 (n/N, % Correct) |
|---|---|---|---|
| Source Type | |||
| FAQ (n = 10) | 8/10 (80%) | 9/10 (90%) | 10/10 (100%) |
| Guideline-based (n = 10) | 6/10 (60%) | 8/10 (80%) | 8/10 (80%) |
| Question Difficulty | |||
| Easy (n = 6) | 5/6 (83%) | 6/6 (100%) | 6/6 (100%) |
| Medium (n = 9) | 6/9 (67%) | 7/9 (78%) | 8/9 (89%) |
| Difficult (n = 5) | 2/5 (40%) | 3/5 (60%) | 4/5 (80%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Published by MDPI on behalf of the Société Internationale d’Urologie. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Eskandar, K. Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation. Soc. Int. Urol. J. 2026, 7, 11. https://doi.org/10.3390/siuj7010011
Eskandar K. Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation. Société Internationale d’Urologie Journal. 2026; 7(1):11. https://doi.org/10.3390/siuj7010011
Chicago/Turabian StyleEskandar, Kirolos. 2026. "Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation" Société Internationale d’Urologie Journal 7, no. 1: 11. https://doi.org/10.3390/siuj7010011
APA StyleEskandar, K. (2026). Assessing ChatGPT Accuracy Across Versions for Patient and Guideline Queries in Sacral Neuromodulation. Société Internationale d’Urologie Journal, 7(1), 11. https://doi.org/10.3390/siuj7010011
