Comparison of ChatGPT-4o and DeepSeek R1 in the Management of Ophthalmological Emergencies—An Analysis of Ten Fictional Case Vignettes
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design
- Our study was designed with possible layperson use of publicly available generative AI tools for self-diagnosis and self-triage in mind. In most countries, self-reported AI competence is quite low, and the majority of the general public did not undergo any specific AI training [21]. Hence, knowledge of advanced prompting methods cannot be assumed of the general public, and more valid results may be obtained by adhering to our simple prompting protocol that was specifically designed in short and simple language to resemble laypersons’ requests.
- Standardized evaluation tools such as the CLEAR tool are designed to enable comparable evaluation of AI outputs across a wide range of tasks. It comprises grading of the AI-generated answers according to five different items (sufficiency, accuracy (lack of false information), evidence support, clarity/understandability, and relevance (lack of irrelevant information)) on a five-point Likert scale ranging from excellent to poor [20]. While there is some overlap with our own evaluation protocol, the latter is tailored to the evaluation in the context of triage of emergencies. It provides certain evaluation parameters that are critical in this specific context, but are not comprised in tools such as CLEAR (e.g., potential for harm).
- While perfect comparability to our prior analysis of ChatGPT-3.5 cannot be reached due to the involvement of metachronous qualitative evaluations, we strive for the best possible comparability by following the same methodology.
2.2. Case Vignettes
2.3. Models
2.4. Prompting
2.5. Evaluation of Responses
2.6. Statistical Analysis
3. Results
3.1. Global Results
3.2. Disclaimer and Vignette-Level Results
4. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kedia, N.; Sanjeev, S.; Ong, J.; Chhablani, J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye 2024, 38, 1252–1261. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.M.C. DeepSeek Deployed in 90 Chinese Tertiary Hospitals: How Artificial Intelligence Is Transforming Clinical Practice. J. Med. Syst. 2025, 49, 53. [Google Scholar] [CrossRef] [PubMed]
- Mikhail, D.; Farah, A.; Milad, J.; Nassrallah, W.; Mihalache, A.; Milad, D.; Antaki, F.; Balas, M.; Popovic, M.; Feo, A.; et al. Performance of DeepSeek-R1 in ophthalmology: An evaluation of clinical decision-making and cost-effectiveness. Br. J. Ophthalmol. 2025, 109, 976–981. [Google Scholar] [CrossRef] [PubMed]
- OpenAI. How People are Using ChatGPT. Available online: https://openai.com/index/how-people-are-using-chatgpt/ (accessed on 6 December 2025).
- Shahsavar, Y.; Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Hum. Factors 2023, 10, e47564. [Google Scholar] [CrossRef] [PubMed]
- Paulsen, N. Dr. KI: Wenn der Chatbot zum Medizinischen Ratgeber Wird. Available online: https://www.bitkom.org/Presse/Presseinformation/Dr-KI-Chatbot-medizinischer-Ratgeber (accessed on 6 December 2025).
- Saenger, J.A.; Hunger, J.; Boss, A.; Richter, J. Delayed diagnosis of a transient ischemic attack caused by ChatGPT. Wien. Klin. Wochenschr. 2024, 136, 236–238. [Google Scholar] [CrossRef] [PubMed]
- Knebel, D.; Priglinger, S.; Scherer, N.; Klaas, J.; Siedlecki, J.; Schworm, B. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies—An Analysis of 10 Fictional Case Vignettes. Klin. Monatsbl. Augenheilkd. 2024, 241, 675–681. [Google Scholar] [CrossRef] [PubMed]
- Alsumait, A.; Deshmukh, S.; Wang, C.; Leffler, C.T. Triage of Patient Messages Sent to the Eye Clinic via the Electronic Medical Record: A Comparative Study on AI and Human Triage Performance. J. Clin. Med. 2025, 14, 2395. [Google Scholar] [CrossRef] [PubMed]
- Lyons, R.J.; Arepalli, S.R.; Fromal, O.; Choi, J.D.; Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 2024, 59, e301–e308. [Google Scholar] [CrossRef] [PubMed]
- Zandi, R.; Fahey, J.D.; Drakopoulos, M.; Bryan, J.M.; Dong, S.; Bryar, P.J.; Bidwell, A.E.; Bowen, R.C.; Lavine, J.A.; Mirza, R.G. Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints. Bioengineering 2024, 11, 120. [Google Scholar] [CrossRef] [PubMed]
- Schumacher, I.; Ferro Desideri, L.; Bühler, V.M.M.; Sagurski, N.; Subhi, Y.; Bhardwaj, G.; Roth, J.; Anguita, R. Performance analysis of an emergency triage system in ophthalmology using a customized CHATBOT. Digit Health 2025, 11, 20552076251320298. [Google Scholar] [CrossRef] [PubMed]
- Hussain, Z.S.; Delsoz, M.; Elahi, M.; Jerkins, B.; Kanner, E.; Wright, C.; Munir, W.M.; Soleimani, M.; Djalilian, A.; Lao, P.A.; et al. Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports. medRxiv 2025. preprint. [Google Scholar] [CrossRef]
- Sallam, M.; Alasfoor, I.M.; Khalid, S.W.; Al-Mulla, R.I.; Al-Farajat, A.; Mijwil, M.M.; Zahrawi, R.; Sallam, M.; Egger, J.; Al-Adwan, A.S. Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English. Narra J. 2025, 5, e2371. [Google Scholar] [CrossRef] [PubMed]
- Yao, J.; Hsin, S.C.; Li, L.; Ren, X.; Liu, W. Benchmark analysis of myopia-related issues using large language models: A comparison of ChatGPT-4o and DeepSeek. BMC Ophthalmol. 2025, 25, 632. [Google Scholar] [CrossRef] [PubMed]
- Maino, A.P.; Klikowski, J.; Strong, B.; Ghaffari, W.; Woźniak, M.; Bourcier, T.; Grzybowski, A. Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination. Vision 2025, 9, 31. [Google Scholar] [CrossRef] [PubMed]
- Bahir, D.; Hartstein, M.; Zloto, O.; Burkat, C.; Uddin, J.; Hamed Azzam, S. Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery. Ophthalmic Plast. Reconstr. Surg. 2025, 41, 439–444. [Google Scholar] [CrossRef] [PubMed]
- Taloni, A.; Sangregorio, A.C.; Alessio, G.; Romeo, M.A.; Coco, G.; Busin, L.M.L.; Sollazzo, A.; Scorcia, V.; Giannaccare, G. Large language models provide discordant information compared to ophthalmology guidelines. Sci. Rep. 2025, 15, 20556. [Google Scholar] [CrossRef] [PubMed]
- Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R.K.W.; Lim, E.P. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv 2023, arXiv:2305.04091. [Google Scholar] [CrossRef]
- Sallam, M.; Barakat, M.; Sallam, M. Pilot Testing of a Tool to Standardize the Assessment of the Quality of Health Information Generated by Artificial Intelligence-Based Models. Cureus 2023, 15, e49373. [Google Scholar] [CrossRef] [PubMed]
- Gillespie, N.; Lockey, S.; Ward, T.; Macdade, A.; Hassed, G. Trust, Attitudes and Use of Artificial Intelligence: A Global Study 2025; The University of Melbourne and KPMG: Melbourne, Australia, 2025. [Google Scholar] [CrossRef]

| Item | ChatGPT-4o | DeepSeek R1 | p-Value | ChatGPT-3.5 |
|---|---|---|---|---|
| Diagnostic specificity (median, [range]) | 3, [1, 3] | 3, [1, 3] | 0.48 | 3, [1, 4] |
| Number of differential diagnoses (median, [range]) | 6, [1, 16] | 6, [1, 9] | 0.219 | 3, [1, 9] |
| Diagnostic accuracy (%) | 100 | 100 | n/a * | 62 |
| Treatment specificity (median, [range]) | 2, [1, 4] | 2, [1, 4] | 0.007 | 3, [1, 4] |
| Treatment accuracy (%) | 50 | 60 | 0.52 | 100 |
| Unconditional recommendation to consult a physician in answer 1 (%) | 61 | 64 | 0.388 | 94 |
| Disclaimer (%) | 41 | 58 | 0.044 | 62 |
| Unconditional recommendation to consult a physician in answer 2 (%) | 16 | 39 | 0.056 | 33 |
| Information on urgency (%) | 100 | 100 | n/a * | 100 |
| Triage accuracy (%) | 73 | 66 | 0.256 | 87 |
| Pre-hospital measures recommended (%) | 97 | 97 | n/a * | 100 |
| APM (median, [range]) | 4, [0, 4] | 4 [0,4] | 0.223 | 4, [0, 4] |
| Questions directed at user (%) | 100 | 0 | <0.001 | 0 |
| Wrong information (%) | 60 | 46 | 0.115 | 24 |
| Conflicting information (%) | 40 | 30 | 0.148 | 36 |
| Overall reflection of the severity of presented symptoms (median, [range]) | 1, [1, 3] | 1, [1, 3] | 0.001 | 1, [1, 3] |
| Harmful answers (%) | 50 | 38 | 0.114 | 32 |
| Vignette Title | Diagnostic Accuracy | Disclaimer Contained | Triage Accuracy | APM (Median, [Range]) | Potentially Harmful Answers | |
|---|---|---|---|---|---|---|
| A | Hordeolum | - * | 3/4 (75%) | - ** | - ** | 0/5 (0%) |
| B | Pediatric leukocoria | - * | 0/5 (0%) | 4/5 (80%) | 3, [1; 4] | 2/5 (40%) |
| C | Flashes and floaters | - * | 2/5 (40%) | 1/5 (20%) | 4, [4; 4] | 4/5 (80%) |
| D | Sudden monocular vision loss | - * | 2/5 (40%) | 5/5 (100%) | 3, [0; 4] | 3/5 (60%) |
| E | Sudden, painful monocular vision loss | - * | 2/5 (40%) | 5/5 (100%) | 3, [0; 3] | 2/5 (40%) |
| F | Sudden onset diplopia | - * | 0/5 (0%) | 1/4 (25%) | 4, [3; 4] | 4/5 (80%) |
| G | Dry eye | 2/2 (100%) | 4/5 (80%) | - ** | - ** | 1/5 (20%) |
| H | Monocular red eye | - * | 5/5 (100%) | - ** | - ** | 3/5 (60%) |
| I | Corneal erosion | 5/5 (100%) | 0/5 (0%) | 1/1 (100%) | 4, [4; 4] | 5/5 (100%) |
| J | Alkali burns | 3/3 (100%) | 2/5 (40%) | 5/5 (100%) | 4, [0; 4] | 1/5 (20%) |
| Vignette Title | Diagnostic Accuracy | Disclaimer Contained | Triage Accuracy | APM (Median, [Range]) | Potentially Harmful Answers | |
|---|---|---|---|---|---|---|
| A | Hordeolum | - * | 0/5 (0%) | - ** | - ** | 0/5 (0%) |
| B | Pediatric leukocoria | - * | 0/5 (0%) | 5/5 (100%) | 4, [0; 4] | 1/5 (20%) |
| C | Flashes and floaters | - * | 4/5 (80%) | 2/5 (40%) | 3, [3; 3] | 3/5 (60%) |
| D | Sudden monocular vision loss | - * | 5/5 (100%) | 4/5 (80%) | 4, [3; 4] | 1/5 (20%) |
| E | Sudden, painful monocular vision loss | - * | 5/5 (100%) | 5/5 (100%) | 4, [3; 4] | 1/5 (20%) |
| F | Sudden onset diplopia | - * | 0/5 (0%) | 0/5 (0%) | 3.5, [3; 4] | 5/5 (100%) |
| G | Dry eye | - * | 5/5 (100%) | - ** | - ** | 0/5 (0%) |
| H | Monocular red eye | - * | 5/5 (100%) | 1/1 (100%) | 4, [4; 4] | 1/5 (20%) |
| I | Corneal erosion | 5/5 (100%) | 5/5 (100%) | 1/1 (100%) | 0, [0; 0] | 5/5 (100%) |
| J | Alkali burn | 5/5 (100%) | 0/5 (0%) | 3/5 (60%) | 4, [4; 4] | 2/5 (40%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Knebel, D.; Priglinger, S.; Schworm, B. Comparison of ChatGPT-4o and DeepSeek R1 in the Management of Ophthalmological Emergencies—An Analysis of Ten Fictional Case Vignettes. J. Clin. Med. 2025, 14, 8927. https://doi.org/10.3390/jcm14248927
Knebel D, Priglinger S, Schworm B. Comparison of ChatGPT-4o and DeepSeek R1 in the Management of Ophthalmological Emergencies—An Analysis of Ten Fictional Case Vignettes. Journal of Clinical Medicine. 2025; 14(24):8927. https://doi.org/10.3390/jcm14248927
Chicago/Turabian StyleKnebel, Dominik, Siegfried Priglinger, and Benedikt Schworm. 2025. "Comparison of ChatGPT-4o and DeepSeek R1 in the Management of Ophthalmological Emergencies—An Analysis of Ten Fictional Case Vignettes" Journal of Clinical Medicine 14, no. 24: 8927. https://doi.org/10.3390/jcm14248927
APA StyleKnebel, D., Priglinger, S., & Schworm, B. (2025). Comparison of ChatGPT-4o and DeepSeek R1 in the Management of Ophthalmological Emergencies—An Analysis of Ten Fictional Case Vignettes. Journal of Clinical Medicine, 14(24), 8927. https://doi.org/10.3390/jcm14248927

