Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Population
2.2. Large Language Models Evaluated
2.3. Prompt Design for LLMs
2.4. Triage Systems
2.5. Ethical Considerations
2.6. Statistical Analysis
2.7. Sub-Analysis of Clinical Features
3. Results
3.1. Triage Score Agreement
3.2. Clinic Referral Accuracy
3.3. Performance by Clinical Specialty
3.4. Admission Prediction (Outcome)
3.5. Error Bias Analysis (McNemar’s Test)
3.6. Paired Bootstrap Comparison of Top-Performing Models
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ED | Emergency Department |
| ENT | Ear, Nose, and Throat (Otolaryngology) |
| LLM | Large Language Model |
| ESI | Emergency Severity Index |
| F1 | F1-score (harmonic mean of precision and recall) |
| κ | Cohen’s Kappa (coefficient for inter-rater agreement) |
| κw | Quadratic Weighted Cohen’s Kappa |
| χ2 | Chi-squared statistic |
References
- Ouellet, S.; Gallani, M.C.; Fontaine, G.; Mercier, É.; Lapierre, A.; Severino, F.; Gélinas, C.; Bérubé, M. Strategies to improve the quality of nurse triage in emergency departments: A systematic review. Int. Emerg. Nurs. 2025, 81, 101639. [Google Scholar] [CrossRef]
- Hodge, A.; Hugman, A.; Varndell, W.; Howes, K. A review of the quality assurance processes for the Australasian Triage Scale (ATS) and implications for future practice. Australas Emerg. Nurs. J. 2013, 16, 21–29. [Google Scholar] [CrossRef] [PubMed]
- Zagalioti, S.-C.; Fyntanidou, B.; Exadaktylos, A.; Lallas, K.; Ziaka, M. The first positive evidence that training improves triage decisions in Greece: Evidence from emergency nurses at an Academic Tertiary Care Emergency Department. BMC Emerg. Med. 2023, 23, 60. [Google Scholar] [CrossRef]
- Zagalioti, S.-C.; Ziaka, M.; Exadaktylos, A.; Fyntanidou, B. An effective triage education method for triage nurses: An overview and update. Open Access Emerg. Med. 2025, 17, 105–112. [Google Scholar] [CrossRef] [PubMed]
- Emergency Nurses Association. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care, Version 4. [Internet]. 2020. Available online: https://sgnor.ch/fileadmin/user_upload/Dokumente/Downloads/Esi_Handbook.pdf (accessed on 27 December 2025).
- Tsiftsis, D.; Tasioulis, A.; Bampalis, D. Adult triage in the emergency department: Introducing a multi-layer triage system. Healthcare 2025, 13, 1070. [Google Scholar] [CrossRef] [PubMed]
- Seo, Y.H.; Lee, K.; Jang, K. Factors influencing the classification accuracy of triage nurses in emergency department: Analysis of triage nurses’ characteristics. BMC Nurs. 2024, 23, 764. [Google Scholar] [CrossRef]
- Joseph, M.J.; Summerscales, M.; Yogesan, S.; Bell, A.; Genevieve, M.; Kanagasingam, Y. The use of kiosks to improve triage efficiency in the emergency department. NPJ Digit. Med. 2023, 6, 19. [Google Scholar] [CrossRef]
- Sutham, K.; Khuwuthyakorn, P.; Thinnukool, O. Thailand medical mobile application for patients triage base on criteria based dispatch protocol. BMC Med. Inform. Decis. Mak. 2020, 20, 66. [Google Scholar] [CrossRef]
- Joseph, J.W.; Kennedy, M.; Landry, A.M.; Marsh, R.H.; Baymon, D.E.; Im, D.E.; Chen, P.C.; Samuels-Kalow, M.E.; Nentwich, L.M.; Elhadad, N.; et al. Race and Ethnicity and Primary Language in Emergency Department Triage. JAMA Netw. Open 2023, 6, e2337557. [Google Scholar] [CrossRef]
- Patel, M.D.; Lin, P.; Cheng, Q.; Argon, N.T.; Evans, C.S.; Linthicum, B.; Liu, Y.; Mehrotra, A.; Murphy, L.; Ziya, S. Patient sex, racial and ethnic disparities in emergency department triage: A multi-site retrospective study. Am. J. Emerg. Med. 2024, 76, 29–35. [Google Scholar] [CrossRef]
- Ong, J.C.L.; Jin, L.; Elangovan, K.; Lim, G.Y.S.; Lim, D.Y.Z.; Sng, G.G.R.; Ke, Y.H.; Tung, J.Y.M.; Zhong, R.J.; Koh, C.M.Y.; et al. Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Rep. Med. 2025, 6, 102323. [Google Scholar] [CrossRef]
- GPT-5 Technical Overview and Evaluation Benchmarks. 2025. Available online: https://cdn.openai.com/gpt-5-system-card.pdf (accessed on 27 December 2025).
- Li, J.; Deng, Y.; Sun, Q.; Zhu, J.; Tian, Y.; Li, J.; Zhu, T. Benchmarking Large Language Models in Evidence-Based Medicine. IEEE J. Biomed. Health Inform. 2025, 29, 6143–6156. [Google Scholar] [CrossRef]
- Siam, K.; Varela, A.; Faruk, J.H.; Cheng, J.Q.; Gu, H.; Al Maruf, A.; Aung, Z. Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios. Sci. Rep. 2025, 16, 1387. [Google Scholar] [CrossRef] [PubMed]
- Şan, I.; Öz, M.A.; Yortanli, M.; Genç, M.; Bulut, B.; Gür, A.; Yazici, R.; Mutlu, H.; Gönen, M.Ö. AI performance in emergency medicine fellowship examination: Comparative analysis of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 models. Turk. J. Med. Sci. 2025, 55, 1292–1299. [Google Scholar] [CrossRef]
- Shan, G.; Chen, X.; Wang, C.; Liu, L.; Gu, Y.; Jiang, H.; Shi, T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med. Inform. 2025, 13, e64963. [Google Scholar] [CrossRef] [PubMed]
- Wiest, I.C.; Bhat, M.; Clusmann, J.; Schneider, C.V.; Jiang, X.; Kather, J.N. Large language models for clinical decision support in gastroenterology and hepatology. Nat. Rev. Gastroenterol. Hepatol. 2025, 22, 773–787. [Google Scholar] [CrossRef] [PubMed]
- Shool, S.; Adimi, S.; Amleshi, R.S.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
- Benary, M.; Wang, X.D.; Schmidt, M.; Soll, D.; Hilfenhaus, G.; Nassir, M.; Sigler, C.; Knödler, M.; Keller, U.; Beule, D.; et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 2023, 6, e2343689. [Google Scholar] [CrossRef]
- Sandmann, S.; Riepenhausen, S.; Plagwitz, L.; Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 2024, 15, 2050. [Google Scholar] [CrossRef]
- Masanneck, L.; Schmidt, L.; Seifert, A.; Kölsche, T.; Huntemann, N.; Jansen, R.; Mehsin, M.; Bernhard, M.; Meuth, S.G.; Böhm, L.; et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: Comparative study. J. Med. Internet Res. 2024, 26, e53297. [Google Scholar] [CrossRef]
- Arslan, B.; Nuhoglu, C.; Satici, M.; Altinbilek, E. Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses. Am. J. Emerg. Med. 2025, 89, 174–181. [Google Scholar] [CrossRef] [PubMed]
- Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef]
- Lee, S.; Jung, S.; Park, J.-H.; Cho, H.; Moon, S.; Ahn, S. Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department. BMC Emerg. Med. 2025, 25, 176. [Google Scholar] [CrossRef]
- Wang, C.; Wang, F.; Li, S.; Ren, Q.-W.; Tan, X.; Fu, Y.; Liu, D.; Qian, G.; Cao, Y.; Yin, R.; et al. Patient triage and guidance in emergency departments using large language models: Multimetric study. J. Med. Internet Res. 2025, 27, e71613. [Google Scholar] [CrossRef] [PubMed]
- Han, S.; Choi, W. Development of a large language model-based multi-agent clinical decision support system for Korean Triage and Acuity Scale (KTAS)-based triage and treatment planning in emergency departments. Adv. Artif. Intell. Mach. Learn. 2025, 5, 3261–3275. [Google Scholar] [CrossRef]
- Collins, G.S.; Moons, K.G.M.; Dhiman, P.; Riley, R.D.; Beam, A.L.; Van Calster, B.; Ghassemi, M.; Liu, X.; Reitsma, J.B.; van Smeden, M.; et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef]
- Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]
- Fleiss, J.L.; Levin, B.; Paik, M.C. Statistical Methods for Rates and Proportions, 3rd ed.; John Wiley & Sons: Nashville, TN, USA, 2003. [Google Scholar]
- Altman, D.G. Practical Statistics for Medical Research; Chapman and Hall: London, UK, 1990. [Google Scholar] [CrossRef]
- Savage, T.; Nayak, A.; Gallo, R.; Rangan, E.; Chen, J.H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 2024, 7, 20. [Google Scholar] [CrossRef]
- Lyons, R.J.; Arepalli, S.R.; Fromal, O.; Choi, J.D.; Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 2024, 59, e301–e308. [Google Scholar] [CrossRef]
- Porto, B.M. Improving triage performance in emergency departments using machine learning and natural language processing: A systematic review. BMC Emerg. Med. 2024, 24, 219. [Google Scholar] [CrossRef]
- Zaboli, A. Establishing a common ground: The future of triage systems. BMC Emerg. Med. 2024, 24, 148. [Google Scholar] [CrossRef]
- Templin, T.; Fort, S.; Padmanabham, P.; Seshadri, P.; Rimal, R.; Oliva, J.; Lich, K.H.; Sylvia, S.; Sinnott-Armstrong, N. Framework for bias evaluation in large language models in healthcare settings. NPJ Digit. Med. 2025, 8, 414. [Google Scholar] [CrossRef]
- Elbattah, M.; Arnaud, E.; Ghazali, D.A.; Dequen, G. Exploring the Ethical Challenges of Large Language Models in Emergency Medicine: A Comparative International Review. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; pp. 5750–5755. [Google Scholar]
- Preiksaitis, C.; Ashenburg, N.; Bunney, G.; Chu, A.; Kabeer, R.; Riley, F.; Ribeira, R.; Rose, C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inform. 2024, 12, e53787. [Google Scholar] [CrossRef] [PubMed]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Cross, J.L.; Choma, M.A.; Onofrey, J.A. Bias in medical AI: Implications for clinical decision-making. PLoS Digit. Health 2024, 3, e0000651. [Google Scholar] [CrossRef]
- Hoot, N.R.; Aronsky, D. Systematic review of emergency department crowding: Causes, effects, and solutions. Ann. Emerg. Med. 2008, 52, 126–136. [Google Scholar] [CrossRef]
- Goh, E.; Bunning, B.; Khoong, E.C.; Gallo, R.J.; Milstein, A.; Centola, D.; Chen, J.H. Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Commun. Med. 2025, 5, 59. [Google Scholar] [CrossRef] [PubMed]
- Parasuraman, R.; Manzey, D.H. Complacency and bias in human use of automation: An attentional integration. Hum. Factors 2010, 52, 381–410. [Google Scholar] [CrossRef]
- Qazi, I.A.; Ali, A.; Khawaja, A.U.; Akhtar, M.J.; Sheikh, A.Z.; Alizai, M.H. Automation bias in large language model assisted diagnostic reasoning among AI-trained physicians. medRxiv 2025, 2025.08.23.25334280. [Google Scholar] [CrossRef]
- Yazaki, M.; Maki, S.; Furuya, T.; Inoue, K.; Nagai, K.; Nagashima, Y.; Maruyama, J.; Toki, Y.; Kitagawa, K.; Iwata, S.; et al. Emergency patient triage improvement through a Retrieval-Augmented Generation enhanced large-scale language model. Prehosp. Emerg. Care 2025, 29, 203–209. [Google Scholar] [CrossRef]
- Zaboli, A.; Brigo, F.; Brigiari, G.; Massar, M.; Parodi, M.; Pfeifer, N.; Magnarelli, G.; Turcato, G. Chat-GPT in triage: Still far from surpassing human expertise—An observational study. Am. J. Emerg. Med. 2025, 92, 165–171. [Google Scholar] [CrossRef] [PubMed]







| Model | N | κw | 95% CI | Exact Agreement (%) | Bias | Interpretation |
|---|---|---|---|---|---|---|
| DeepSeek | 35,581 | 0.467 | 0.457–0.476 | 59.4% | −0.22 | Moderate |
| Gemini 2.5 | 38,410 | 0.465 | 0.457–0.471 | 43.6% | −0.38 | Moderate |
| Claude Sonnet 4 | 37,897 | 0.402 | 0.394–0.409 | 48.0% | −0.46 | Fair |
| Qwen | 36,372 | 0.304 | 0.297–0.311 | 36.7% | −0.67 | Fair |
| Grok | 36,585 | 0.261 | 0.253–0.268 | 34.2% | −0.74 | Fair |
| Thinking GPT-5 | 38,536 | 0.258 | 0.249–0.266 | 39.5% | −0.26 | Fair |
| Instant GPT-5 | 37,884 | 0.176 | 0.167–0.186 | 40.1% | −0.15 | Slight |
| Comparison | Δκ | 95% CI | p-Value |
|---|---|---|---|
| Triage Score (Quadratic Weighted Kappa, κw) | |||
| Claude Sonnet 4 vs. DeepSeek | −0.074 | [−0.085, −0.063] | <0.001 |
| Claude Sonnet 4 vs. Gemini 2.5 | −0.059 | [−0.068, −0.050] | <0.001 |
| DeepSeek vs. Gemini 2.5 | +0.015 | [+0.005, +0.025] | 0.005 |
| Clinic Referral (Multiclass Kappa, κ) | |||
| Claude Sonnet 4 vs. DeepSeek | −0.003 | [−0.010, +0.005] | 0.494 |
| Claude Sonnet 4 vs. Gemini 2.5 | +0.006 | [−0.001, +0.013] | 0.104 |
| DeepSeek vs. Gemini 2.5 | +0.008 | [+0.001, +0.016] | 0.020 |
| Admission Prediction (Binary Kappa, κ) | |||
| Claude Sonnet 4 vs. DeepSeek | +0.113 | [+0.101, +0.125] | <0.001 |
| Claude Sonnet 4 vs. Gemini 2.5 | +0.091 | [+0.078, +0.104] | <0.001 |
| DeepSeek vs. Gemini 2.5 | −0.022 | [−0.035, −0.009] | <0.001 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nedos, I.; Zagalioti, S.-C.; Kofos, C.; Katsikidou, T.; Vellidou, D.; Astrinakis, K.; Karagiannis, I.; Giannakopoulos, P.; Michaloudi, S.; Apostolopoulou, A.; et al. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. J. Clin. Med. 2026, 15, 1512. https://doi.org/10.3390/jcm15041512
Nedos I, Zagalioti S-C, Kofos C, Katsikidou T, Vellidou D, Astrinakis K, Karagiannis I, Giannakopoulos P, Michaloudi S, Apostolopoulou A, et al. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. Journal of Clinical Medicine. 2026; 15(4):1512. https://doi.org/10.3390/jcm15041512
Chicago/Turabian StyleNedos, Ioannis, Sofia-Chrysovalantou Zagalioti, Christos Kofos, Theoni Katsikidou, Dimitra Vellidou, Konstantinos Astrinakis, Ioannis Karagiannis, Panagiotis Giannakopoulos, Styliani Michaloudi, Aikaterini Apostolopoulou, and et al. 2026. "Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department" Journal of Clinical Medicine 15, no. 4: 1512. https://doi.org/10.3390/jcm15041512
APA StyleNedos, I., Zagalioti, S.-C., Kofos, C., Katsikidou, T., Vellidou, D., Astrinakis, K., Karagiannis, I., Giannakopoulos, P., Michaloudi, S., Apostolopoulou, A., Karagiannidis, E., & Fyntanidou, B. (2026). Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. Journal of Clinical Medicine, 15(4), 1512. https://doi.org/10.3390/jcm15041512

