Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift
Abstract
1. Introduction
2. Methods
2.1. Study Design and Settings
2.2. Statistical Analysis
3. Results
3.1. Binary Questions
3.2. Multiple-Choice Questions
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
IBD | Inflammatory Bowel Disease |
UC | Ulcerative Colitis |
CD | Crohn’s Disease |
AI | Artificial Intelligence |
LLM | Large Language Model |
ECCO | European Crohn’s and Colitis Organization |
ChatGPT | Chat Generative Pretrained Transformer |
References
- Bruner, L.P.; White, A.M.; Proksell, S. Inflammatory Bowel Disease. Prim. Care 2023, 50, 411–427. [Google Scholar] [CrossRef] [PubMed]
- Burisch, J.; Vardi, H.; Pedersen, N.; Brinar, M.; Cukovic-Cavka, S.; Kaimakliotis, I.; Duricova, D.; Bortlik, M.; Shonová, O.; Vind, I.; et al. Costs and resource utilization for diagnosis and treatment during the initial year in a European inflammatory bowel disease inception cohort: An ECCO-EpiCom Study. Inflamm. Bowel Dis. 2015, 21, 121–131. [Google Scholar] [CrossRef] [PubMed]
- Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Cankurtaran, R.E.; Polat, Y.H.; Aydemir, N.G.; Umay, E.; Yurekli, O.T. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 2023, 15, e46736. [Google Scholar] [CrossRef] [PubMed]
- Sciberras, M.; Farrugia, Y.; Gordon, H.; Furfaro, F.; Allocca, M.; Torres, J.; Arebi, N.; Fiorino, G.; Iacucci, M.; Verstockt, B.; et al. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J. Crohn’s Colitis 2024, 18, 1215–1221. [Google Scholar] [CrossRef] [PubMed]
- Deniz, M.S.; Guler, B.Y. Assessment of ChatGPT’s adherence to ETA-thyroid nodule management guideline over two different time intervals 14 days apart: In binary and multiple-choice queries. Endocrine 2024, 85, 794–802. [Google Scholar] [CrossRef]
- Diaconu, C.; State, M.; Birligea, M.; Ifrim, M.; Bajdechi, G.; Georgescu, T.; Mateescu, B.; Voiosu, T. The Role of Artificial Intelligence in Monitoring Inflammatory Bowel Disease—The Future Is Now. Diagnostics 2023, 13, 735. [Google Scholar] [CrossRef] [PubMed]
- Lahat, A.; Shachar, E.; Avidan, B.; Shatz, Z.; Glicksberg, B.S.; Klang, E. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci. Rep. 2023, 13, 4164. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Wan, X.H.; Kong, Q.Z.; Liu, H.; Liu, J.; Guo, J.; Yang, X.-Y.; Zuo, X.-L.; Li, Y.-Q. Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study. World J. Gastroenterol. 2025, 31, 102090. [Google Scholar] [CrossRef] [PubMed]
- Yan, Z.; Liu, J.; Fan, Y.; Lu, S.; Xu, D.; Yang, Y.; Wang, H.; Mao, J.; Tseng, H.-C.; Chang, T.-H.; et al. Ability of ChatGPT to Replace Doctors in Patient Education: Cross-Sectional Comparative Analysis of Inflammatory Bowel Disease. J. Med. Internet Res. 2025, 27, e62857. [Google Scholar] [CrossRef] [PubMed]
- Ran, J.; Zhou, M.; Wen, H. Artificial Intelligence in Inflammatory Bowel Disease. Saudi J. Gastroenterol. 2025. [Google Scholar] [CrossRef] [PubMed]
- Zeng, S.; Dong, C.; Liu, C.; Zhen, J.; Pu, Y.; Hu, J.; Dong, W. The global research of artificial intelligence on inflammatory bowel disease: A bibliometric analysis. Digit. Health 2025, 11, 20552076251326217. [Google Scholar] [CrossRef] [PubMed]
- Ghersin, I.; Weisshof, R.; Koifman, E.; Bar-Yoseph, H.; Ben Hur, D.; Maza, I.; Ben Hur, D.; Maza, I.; Hasnis, E.; Nasser, R.; et al. Comparative evaluation of a language model and human specialists in the application of European guidelines for the management of inflammatory bowel diseases and malignancies. Endoscopy 2024, 56, 706–709. [Google Scholar] [PubMed]
- Kerbage, A.; Kassab, J.; El Dahdah, J.; Burke, C.A.; Achkar, J.P.; Rouphael, C. Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers. Clin. Gastroenterol. Hepatol. 2024, 22, 1323–1325.e3. [Google Scholar] [CrossRef] [PubMed]
- Naqvi, H.A.; Delungahawatta, T.; Atarere, J.O.; Bandaru, S.K.; Barrow, J.B.; Mattar, M.C. Evaluation of online chat-based artificial intelligence responses about inflammatory bowel disease and diet. Eur. J. Gastroenterol. Hepatol. 2024, 36, 1109–1112. [Google Scholar] [CrossRef] [PubMed]
- Gravina, A.G.; Pellegrino, R.; Cipullo, M.; Palladino, G.; Imperio, G.; Ventura, A.; Auletta, S.; Ciamarra, P.; Federico, A. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J. Gastroenterol. 2024, 30, 17–33. [Google Scholar] [CrossRef] [PubMed]
- Levartovsky, A.; Ben-Horin, S.; Kopylov, U.; Klang, E.; Barash, Y. Towards AI-Augmented Clinical Decision-Making: An Examination of ChatGPT’s Utility in Acute Ulcerative Colitis Presentations. Am. J. Gastroenterol. 2023, 118, 2283–2289. [Google Scholar] [CrossRef] [PubMed]
- Lusetti, F.; Maimaris, S.; La Rosa, G.P.; Scalvini, D.; Schiepatti, A.; Biagi, F.; De Bernardi, A.; Manes, G.; Saibeni, S. Applications of Generative Artificial Intelligence in Inflammatory Bowel Disease: A Systematic Review. Dig. Liver Dis. 2025; in press. [Google Scholar] [CrossRef]
Group | Category | Initial Day (1) | 15th Day (2) | 180th Day (3) | p c |
---|---|---|---|---|---|
n (%) | n (%) | n (%) | |||
Binary Answers | Correct answers | 47 (92.2) | 47 (92.2) | 50 (98.0) | 0.862 |
False answers | 4 (7.8) | 4 (7.8) | 1 (2.0) | ||
Variable | Median (IQR) | Median (IQR) | Median (IQR) | p k/Pairwise Difference | |
3-point Likert | 2.00 (1.00) | 2.00 (1.00) | 3.00 (1.00) | 0.028 */2 < 3 | |
6-point Likert | 6.00 (1.00) | 6.00 (1.00) | 6.00 (1.00) | 0.496/- |
Likert | Group | Variable | Median | IQR | p | Pairwise Difference | Kendall’s W |
---|---|---|---|---|---|---|---|
3-point Likert | Author 1 | Initial day (1) | 2.00 | 1.00 | 0.331 | - | 0.022 |
15th Day (2) | 3.00 | 1.00 | |||||
180th Day (3) | 3.00 | 1.00 | |||||
Author 2 | Initial day (1) | 2.00 | 1.00 | 0.131 | - | 0.040 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 3.00 | 1.00 | |||||
Author 3 | Initial day (1) | 2.00 | 1.00 | 0.711 | - | 0.007 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 1.00 | |||||
Author 4 | Initial day (1) | 3.00 | 1.00 | 0.048 * | 1 < 3 | 0.060 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 3.00 | 1.00 | |||||
Author 5 | Initial day (1) | 3.00 | 1.00 | 0.015 * | 1 < 3 | 0.082 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 3.00 | 1.00 | |||||
6-point Likert | Author 1 | Initial day (1) | 6.00 | 1.00 | 0.307 | - | 0.023 |
15th Day (2) | 6.00 | 1.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 2 | Initial day (1) | 6.00 | 1.00 | 0.789 | - | 0.005 | |
15th Day (2) | 6.00 | 1.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 3 | Initial day (1) | 6.00 | 1.00 | 0.549 | - | 0.012 | |
15th Day (2) | 5.00 | 1.00 | |||||
180th Day (3) | 5.00 | 1.00 | |||||
Author 4 | Initial day (1) | 6.00 | 1.00 | 0.789 | - | 0.005 | |
15th Day (2) | 6.00 | 1.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 5 | Initial day (1) | 6.00 | 1.00 | 0.627 | - | 0.009 | |
15th Day (2) | 6.00 | 1.00 | |||||
180th Day (3) | 6.00 | 1.00 |
Group | Category | Initial Day (1) | 15th Day (2) | 180th Day (3) | p c |
---|---|---|---|---|---|
n (%) | n (%) | n (%) | |||
Multiple | Correct answers | 46 (90.20) | 47 (92.20) | 43 (84.30) | 0.471 |
False answers | 5 (9.80) | 4 (7.80) | 8 (15.70) | ||
Variable | Median (IQR) | Median (IQR) | Median (IQR) | p k/Pairwise Difference | |
3-point Likert | 2.00 (1.00) | 2.00 (1.00) | 2.00 (1.00) | 0.001 */1 < 2, 3 < 2 | |
6-point Likert | 6.00 (1.00) | 6.00 (0.00) | 6.00 (1.00) | 0.001 */1 < 2, 3 < 2 |
Likert | Group | Variable | Median | IQR | p | Pairwise Difference | Kendall’s W |
---|---|---|---|---|---|---|---|
3-point Likert | Author 1 | Initial day (1) | 2.00 | 1.00 | 0.004 * | 2 > 3, 2 > 1 | 0.110 |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 1.00 | |||||
Author 2 | Initial day (1) | 2.00 | 1.00 | 0.004 * | 2 > 3, 2 > 1 | 0.110 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 1.00 | |||||
Author 3 | Initial day (1) | 2.00 | 1.00 | 0.644 | - | 0.009 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 1.00 | |||||
Author 4 | Initial day (1) | 2.00 | 1.00 | 0.005 * | 2 > 3, 2 > 1 | 0.103 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 0.00 | |||||
Author 5 | Initial day (1) | 2.00 | 1.00 | 0.137 | - | 0.039 | |
15th Day (2) | 2.00 | 1.00 | |||||
180th Day (3) | 2.00 | 1.00 | |||||
6-point Likert | Author 1 | Initial day (1) | 6.00 | 1.00 | 0.159 | - | 0.036 |
15th Day (2) | 6.00 | 0.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 2 | Initial day (1) | 6.00 | 1.00 | 0.012 * | 2 > 1 | 0.087 | |
15th Day (2) | 6.00 | 0.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 3 | Initial day (1) | 6.00 | 1.00 | 0.043 * | 2 > 1 | 0.062 | |
15th Day (2) | 6.00 | 1.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 4 | Initial day (1) | 6.00 | 1.00 | 0.012 * | 2 > 1, 2 > 3 | 0.087 | |
15th Day (2) | 6.00 | 0.00 | |||||
180th Day (3) | 6.00 | 1.00 | |||||
Author 5 | Initial day (1) | 6.00 | 1.00 | 0.009 * | 2 > 1, 2 > 3 | 0.092 | |
15th Day (2) | 6.00 | 0.00 | |||||
180th Day (3) | 6.00 | 1.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ozturk, O.; Ergul, M.; Cagir, Y.; Atay, A.; Acun, K.C.; Coskun, O.; Tenlik, I.; Durak, M.B.; Yuksel, I. Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift. J. Clin. Med. 2025, 14, 4599. https://doi.org/10.3390/jcm14134599
Ozturk O, Ergul M, Cagir Y, Atay A, Acun KC, Coskun O, Tenlik I, Durak MB, Yuksel I. Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift. Journal of Clinical Medicine. 2025; 14(13):4599. https://doi.org/10.3390/jcm14134599
Chicago/Turabian StyleOzturk, Oguz, Mucahit Ergul, Yavuz Cagir, Ali Atay, Kadir Can Acun, Orhan Coskun, Ilyas Tenlik, Muhammed Bahaddin Durak, and Ilhami Yuksel. 2025. "Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift" Journal of Clinical Medicine 14, no. 13: 4599. https://doi.org/10.3390/jcm14134599
APA StyleOzturk, O., Ergul, M., Cagir, Y., Atay, A., Acun, K. C., Coskun, O., Tenlik, I., Durak, M. B., & Yuksel, I. (2025). Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift. Journal of Clinical Medicine, 14(13), 4599. https://doi.org/10.3390/jcm14134599