Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations
Abstract
:1. Introduction
2. Materials and Methods
2.1. General Study Design
2.2. Question Design and Prompting
“My kidney function is impaired with a GFR of about 45—can I have a CT with contrast?” or “I have a pacemaker—can I still have an MRI?”
“I am a patient. I am due to have a CT scan and have some questions about this examination. Can you answer each of the following questions in an understandable way.” and, ”I am a patient. I am due to have an MRI scan and have some questions about this examination. Can you answer each of the following questions in an understandable way.”
2.3. Response Evaluation
2.4. Statistical Analysis
3. Results
4. Discussion
5. Limitations
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LLM | Large language model |
CT | Computed tomography |
MRI | Magnetic resonance imaging |
GFR | Glomerular filtration rate |
IQR | Interquartile Range |
R1 | Radiologist 1 |
R2 | Radiologist 2 |
ICC | Intraclass correlation coefficient |
References
- Smith-Bindman, R.; Kwan, M.L.; Marlow, E.C.; Theis, M.K.; Bolch, W.; Cheng, S.Y.; Bowles, E.J.A.; Duncan, J.R.; Greenlee, R.T.; Kushi, L.H.; et al. Trends in Use of Medical Imaging in US Health Care Systems and in Ontario, Canada, 2000–2016. JAMA 2019, 322, 843–856. [Google Scholar] [CrossRef] [PubMed]
- Chesson, R.A.; McKenzie, G.A.; Mathers, S.A. What Do Patients Know About Ultrasound, CT and MRI? Clin. Radiol. 2002, 57, 477–482. [Google Scholar] [CrossRef]
- Mathers, S.A.; Chesson, R.A.; McKenzie, G.A. The information needs of people attending for computed tomography (CT): What are they and how can they be met? Patient Educ. Couns. 2009, 77, 272–278. [Google Scholar] [CrossRef] [PubMed]
- Abi-Rafeh, J.; Bassiri-Tehrani, B.; Kazan, R.; Furnas, H.; Hammond, D.; Adams, W.P.; Nahai, F. Preoperative Patient Guidance and Education in Aesthetic Breast Plastic Surgery: A Novel Proposed Application of Artificial Intelligence Large Language Models. Aesthetic Surg. J. Open Forum 2024, 6, ojae062. [Google Scholar] [CrossRef]
- Scaff, S.P.S.; Reis, F.J.J.; Ferreira, G.E.; Jacob, M.F.; Saragiotto, B.T. Assessing the performance of AI chatbots in answering patients’ common questions about low back pain. Ann. Rheum. Dis. 2024, 84, 143–149. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Shi, R.; Le, Q.; Shan, K.; Chen, Z.; Zhou, X.; He, Y.; Hong, J. Evaluating the effectiveness of large language models in patient education for conjunctivitis. Br. J. Ophthalmol. 2024, 109, 185–191. [Google Scholar] [CrossRef]
- Thieme. Aufklärungsbogen Computertomografie (CT). Available online: https://www.thieme-compliance.de/de/shop/Artikel/Aufkl%C3%A4rungsb%C3%B6gen/Computertomografie/p/AE62125001 (accessed on 14 April 2025).
- Thieme. Aufklärungsbogen Kernspintomografie (MRT/MRS/MRA). Available online: https://www.thieme-compliance.de/de/shop/Artikel/Aufkl%C3%A4rungsb%C3%B6gen/Kernspintomografie-%28MRT-MRS-MRA%29/p/AE62124201 (accessed on 14 April 2025).
- Joshi, A.; Kale, S.; Chandel, S.; Pal, D.K. Likert Scale: Explored and Explained. Br. J. Appl. Sci. Technol. 2015, 7, 396–403. [Google Scholar] [CrossRef]
- ESUR ESoUR. ESUR Guidelines on Contrast Media; ESUR: Stockholm, Sweden, 2018. [Google Scholar]
- ACR ACoR. ACR Manual on Contrast Media; ACR: Reston, VA, USA, 2023. [Google Scholar]
- Khaldi, A.; Machayekhi, S.; Salvagno, M.; Maniaci, A.; Vaira, L.A.; La Via, L.; Taccone, F.S.; Lechien, J.R. Accuracy of ChatGPT responses on tracheotomy for patient education. Eur. Arch. Oto-Rhino-Laryngol. 2024, 281, 11. [Google Scholar] [CrossRef]
- Maroncelli, R.; Rizzo, V.; Pasculli, M.; Cicciarelli, F.; Macera, M.; Galati, F.; Catalano, C.; Pediconi, F. Probing clarity: AI-generated simplified breast imaging reports for enhanced patient comprehension powered by ChatGPT-4o. Eur. Radiol. Exp. 2024, 8, 1–13. [Google Scholar] [CrossRef]
- Su, Z.; Jin, K.; Wu, H.; Luo, Z.; Grzybowski, A.; Ye, J. Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison. Ophthalmol. Ther. 2024, 14, 103–116. [Google Scholar] [CrossRef]
- OpenAI. Color Health—Color Health uses the reasoning capabilities of GPT-4o to help doctors transform cancer care. Available online: https://openai.com/index/color-health/ (accessed on 4 October 2024).
- OpenAI. Summer Health—Summer Health reimagines pediatric doctor’s visits with OpenAI. Available online: https://openai.com/index/summer-health/ (accessed on 4 October 2024).
- Lecler, A.; Duron, L.; Soyer, P. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. Diagn. Interv. Imaging 2023, 104, 269–274. [Google Scholar] [CrossRef]
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
- Armbruster, J.; Bussmann, F.; Rothhaas, C.; Titze, N.; Grützner, P.A.; Freischmidt, H. “Doctor ChatGPT, Can You Help Me?” The Patient’s Perspective: Cross-Sectional Study. J. Med. Internet Res. 2024, 26, e58831. [Google Scholar] [CrossRef] [PubMed]
- Amin, K.; Khosla, P.; Doshi, R.; Chheang, S.; Forman, H.P. Artificial Intelligence to Improve Patient Understanding of Radiology Reports. Yale J. Biol. Med. 2023, 96, 407–417. [Google Scholar] [CrossRef]
- Park, J.; Oh, K.; Han, K.; Lee, Y.H. Patient-centered radiology reports with generative artificial intelligence: Adding value to radiology reporting. Sci. Rep. 2024, 14, 1–9. [Google Scholar] [CrossRef]
- Bernstein, I.A.; Zhang, Y.V.; Govil, D.; Majid, I.; Chang, R.T.; Sun, Y.; Shue, A.; Chou, J.C.; Schehlein, E.; Christopher, K.L.; et al. Comparison of Ophthalmologist and AI Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw. Open 2023, 6, e2330320. [Google Scholar] [CrossRef] [PubMed]
- Kuo, F.H.; Fierstein, J.L.; Tudor, B.H.; Gray, G.M.; Ahumada, L.M.; Watkins, S.C.; Rehman, M.A. Comparing ChatGPT and a Single Anesthesiologist’s Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists. J. Med. Syst. 2024, 48, 1–10. [Google Scholar] [CrossRef]
- Reynolds, K.; Nadelman, D.; Durgin, J.; Ansah-Addo, S.; Cole, D.; Fayne, R.; Harrell, J.; Ratycz, M.; Runge, M.; Shepard-Hayes, A.; et al. Comparing the quality of ChatGPT- and physician-generated responses to patients’ dermatology questions in the electronic medical record. Clin. Exp. Dermatol. 2024, 49, 715–718. [Google Scholar] [CrossRef] [PubMed]
- Carnino, J.M.; Pellegrini, W.R.; Willis, M.; Cohen, M.B.; Paz-Lansberg, M.; Davis, E.M.; Grillone, G.A.; Levi, J.R. Assessing ChatGPT’s Responses to Otolaryngology Patient Questions. Ann. Otol. Rhinol. Laryngol. 2024, 133, 658–664. [Google Scholar] [CrossRef]
- Li, W.; Chen, J.; Chen, F.; Liang, J.; Yu, H. Exploring the Potential of ChatGPT-4 in Responding to Common Questions About Abdominoplasty: An AI-Based Case Study of a Plastic Surgery Consultation. Aesthetic Plast. Surg. 2023, 48, 1571–1583. [Google Scholar] [CrossRef]
- Masanneck, L.; Schmidt, L.; Seifert, A.; Kölsche, T.; Huntemann, N.; Jansen, R.; Pawlitzki, M. Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study. J. Med. Internet Res. 2024, 26, e53297. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Zaharia, M.; Zou, J. How is ChatGPT’s behavior changing over time? arXiv 2023, arXiv:2307.09009. [Google Scholar] [CrossRef]
- Manchanda, J.; Boettcher, L.; Westphalen, M.; Jasser, J. The Open Source Advantage in Large Language Models (LLMs). arXiv 2024, arXiv:241212004. [Google Scholar]
- Spirling, A. Why open-source generative AI models are an ethical way forward for science. Nature 2023, 616, 413. [Google Scholar] [CrossRef] [PubMed]
- Jeyaraman, M.; Balaji, S.; Jeyaraman, N.; Yadav, S. Unraveling the Ethical Enigma: Artificial Intelligence in Healthcare. Cureus 2023, 15, e43262. [Google Scholar] [CrossRef]
- Wang, C.; Liu, S.; Yang, H.; Guo, J.; Wu, Y.; Liu, J. Ethical Considerations of Using ChatGPT in Health Care. J. Med. Internet Res. 2023, 25, e48009. [Google Scholar] [CrossRef]
- Ong, J.C.L.; Chang, S.Y.-H.; William, W.; Butte, A.J.; Shah, N.H.; Chew, L.S.T.; Liu, N.; Doshi-Velez, F.; Lu, W.; Savulescu, J.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6, e428–e432. [Google Scholar] [CrossRef]
- Allen, J.W.; Earp, B.D.; Koplin, J.; Wilkinson, D. Consent-GPT: Is it ethical to delegate procedural consent to conversational AI? J. Med. Ethics 2024, 50, 77–83. [Google Scholar] [CrossRef]
Rating | ChatGPT 4o | Google Gemini | Claude 3.5 Sonnet | Mistral Large 2 |
---|---|---|---|---|
5 | 75 (65.79%) | 74 (64.91%) | 71 (62.28%) | 56 (49.12%) |
4 | 26 (22.81%) | 26 (22.81%) | 23 (20.18%) | 33 (28.95%) |
3 | 11 (9.65%) | 8 (7.02%) | 16 (14.04%) | 23 (20.18%) |
2 | 1 (0.88%) | 2 (1.75%) | 3 (2.63%) | 1 (0.88%) |
1 | 1 (0.88%) | 4 (3.51%) | 1 (0.88%) | 1 (0.88%) |
Mean (±SD) | 4.52 (±0.46) | 4.44 (±0.58) | 4.40 (±0.59) | 4.25 (±0.54) |
Rating | ChatGPT 4o | Google Gemini | Claude 3.5 Sonnet | Mistral Large 2 |
---|---|---|---|---|
5 | 107 (83.59%) | 102 (79.69%) | 107 (83.59%) | 105 (82.03%) |
4 | 15 (11.72%) | 14 (10.94%) | 15 (11.72%) | 16 (12.5%) |
3 | 6 (4.69%) | 9 (7.03%) | 5 (3.91%) | 4 (3.13%) |
2 | 0 (0.0%) | 2 (1.56%) | 1 (0.78%) | 2 (1.56%) |
1 | 0 (0.0%) | 1 (0.78%) | 0 (0.0%) | 1 (0.78%) |
Mean (±SD) | 4.79 (±0.37) | 4.68 (±0.58) | 4.79 (±0.37) | 4.74 (±0.47) |
ChatGPT 4o | Google Gemini | Claude 3.5 Sonnet | Mistral 2.0 | p-Value Friedman Test | |
---|---|---|---|---|---|
All questions (mean, SD) | 4.52 (±0.46) | 4.44 (±0.58) | 4.40 (±0.59) | 4.25 (±0.54) | <0.001 ** |
General and technical information (mean, SD) | 4.92 (±0.19) | 4.96 (±0.14) | 4.62 (±0.58) | 4.69 (±0.38) | 0.009 ** |
Contrast media information (mean, SD) | 4.48 (±0.39) | 4.44 (±0.50) | 4.46 (±0.40) | 4.06 (±0.55) | <0.001 ** |
| 4.33 (±0.41) | 4.17 (±0.61) | 4.17 (±0.41) | 4.00 (±0.63) | 0.557 (n.s.) |
| 4.75 (±0.29) | 4.75 (±0.29) | 4.88 (±0.25) | 4.25 (±0.50) | 0.097 (n.s.) |
| 5.00 (±0.00) | 4.83 (±0.29) | 4.83 (±0.29) | 4.67 (±0.29) | 0.262 (n.s.) |
| 4.42 (±0.38) | 4.50 (±0.32) | 4.50 (±0.32) | 4.25 (±0.52) | 0.145 (n.s.) |
| 4.29 (±0.27) | 4.29 (±0.57) | 4.29 (±0.27) | 3.57 (±0.19) | 0.007 ** |
Pregnancy, breastfeeding, and pediatric examinations (mean, SD) | 4.50 (±0.35) | 4.28 (±0.62) | 4.28 (±0.94) | 4.28 (±0.44) | 0.400 (n.s.) |
Pre- and post-examination information (mean, SD) | 4.06 (±0.58) | 3.83 (±0.56) | 4.06 (±0.53) | 4.11 (±0.42) | 0.969 (n.s.) |
ChatGPT 4o | Google Gemini | Claude 3.5 Sonnet | Mistral Large 2 | p-Value Friedman Test | |
---|---|---|---|---|---|
All questions (mean, SD) | 4.79 (±0.37) | 4.68 (±0.58) | 4.79 (±0.37) | 4.74 (±0.47) | 0.173 (n.s.) |
Categories | |||||
General and technical information (mean, SD) | 4.90 (±0.31) | 4.83 (±0.49) | 4.85 (±0.33) | 4.90 (±0.21) | 0.456 (n.s.) |
Information about external material (mean, SD) | 4.72 (±0.45) | 4.69 (±0.48) | 4.81 (±0.25) | 4.84 (±0.24) | 0.531 (n.s.) |
Contrast media information (mean, SD) | 4.68 (±0.37) | 4.50 (±0.71) | 4.68 (±0.46) | 4.46 (±0.63) | 0.127 (n.s.) |
Pregnancy, breastfeeding, and pediatric examinations (mean, SD) | 5.00 (±0.00) | 4.83 (±0.41) | 4.83 (±0.41) | 4.75 (±0.42) | 0.262 (n.s.) |
Pre- and post-examination information (mean, SD) | 4.64 (±0.38) | 4.36 (±0.80) | 4.64 (±0.48) | 4.50 (±0.76) | 0.491 (n.s.) |
p-Values Wilcoxon Signed-Rank Test with Holm Correction | ||||
---|---|---|---|---|
ChatGPT 4o | Google Gemini | Claude 3.5 Sonnet | Mistral Large 2 | |
CT—All questions | ||||
ChatGPT 4o | - | 0.903 (n.s.) | 0.299 (n.s.) | <0.001 ** |
Google Gemini | - | - | 0.903 (n.s.) | 0.071 (n.s.) |
Claude 3.5 Sonnet | - | - | - | 0.058 (n.s.) |
Mistral Large 2 | - | - | - | - |
CT—General and technical information | ||||
ChatGPT 4o | - | 0.635 (n.s.) | 0.170 (n.s.) | 0.190 (ns) |
Google Gemini | - | - | 0.118 (n.s.) | 0.170 (ns) |
Claude 3.5 Sonnet | - | - | - | 0.914 (ns) |
Mistral Large 2 | - | - | - | - |
CT—Contrast media information | ||||
ChatGPT 4o | - | 1.000 (n.s.) | 1.000 (n.s.) | 0.003 ** |
Google Gemini | - | - | 1.000 (n.s.) | 0.022 * |
Claude 3.5 Sonnet | - | - | - | 0.004 ** |
Mistral Large 2 | - | - | - | - |
CT—Contrast media information—thyroid gland | ||||
ChatGPT 4o | - | 1.000 (n.s.) | 1.000 (n.s.) | 0.097 (n.s.) |
Google Gemini | - | - | 1.000 (n.s.) | 0.328 (n.s.) |
Claude 3.5 Sonnet | - | - | - | 0.097 (n.s.) |
Mistral Large 2 | - | - | - | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Eminovic, S.; Levita, B.; Dell’Orco, A.; Leppig, J.A.; Nawabi, J.; Penzkofer, T. Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations. J. Pers. Med. 2025, 15, 235. https://doi.org/10.3390/jpm15060235
Eminovic S, Levita B, Dell’Orco A, Leppig JA, Nawabi J, Penzkofer T. Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations. Journal of Personalized Medicine. 2025; 15(6):235. https://doi.org/10.3390/jpm15060235
Chicago/Turabian StyleEminovic, Semil, Bogdan Levita, Andrea Dell’Orco, Jonas Alexander Leppig, Jawed Nawabi, and Tobias Penzkofer. 2025. "Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations" Journal of Personalized Medicine 15, no. 6: 235. https://doi.org/10.3390/jpm15060235
APA StyleEminovic, S., Levita, B., Dell’Orco, A., Leppig, J. A., Nawabi, J., & Penzkofer, T. (2025). Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations. Journal of Personalized Medicine, 15(6), 235. https://doi.org/10.3390/jpm15060235