Next Article in Journal
Impact of Built-In Software Monitoring on Survival in Amyotrophic Lateral Sclerosis Patients Receiving Home Mechanical Ventilation: A Cohort Study
Next Article in Special Issue
Development of a Triage-Level Predictive Model for Hospitalization in the Emergency Department
Previous Article in Journal
Predictive Value of a Radiomics-Derived Risk Score for Local Progression in T3 Laryngeal Cancer: A 10-Year Single-Center Retrospective Cohort Study
Previous Article in Special Issue
Disparities in Survival After In-Hospital Cardiac Arrest by Time of Day and Day of Week: A Single-Center Cohort Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department

by
Ioannis Nedos
,
Sofia-Chrysovalantou Zagalioti
,
Christos Kofos
,
Theoni Katsikidou
,
Dimitra Vellidou
,
Konstantinos Astrinakis
,
Ioannis Karagiannis
,
Panagiotis Giannakopoulos
,
Styliani Michaloudi
,
Aikaterini Apostolopoulou
,
Efstratios Karagiannidis
*,† and
Barbara Fyntanidou
Department of Emergency Medicine, AHEPA University General Hospital, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
J. Clin. Med. 2026, 15(4), 1512; https://doi.org/10.3390/jcm15041512
Submission received: 13 January 2026 / Revised: 5 February 2026 / Accepted: 12 February 2026 / Published: 14 February 2026

Abstract

Background: Large language models (LLMs) are increasingly proposed as clinical decision support tools. However, their reliability in the emergency department (ED) triage remains insufficiently validated. This study aimed to evaluate the performance and limitations of multiple LLMs in triage using a large retrospective dataset. Methods: We conducted a retrospective analysis of 39,375 anonymized patient cases from the ED of AHEPA University General Hospital, Thessaloniki, Greece (June 2024–July 2025), extracted from the hospital’s electronic medical record system. All cases were triaged in real time according to the Emergency Severity Index (ESI) by 25 emergency physicians. In cases of uncertainty, a senior emergency physician was consulted. Seven LLMs (ChatGPT-5 Thinking, ChatGPT-5 Instant, Gemini 2.5, Qwen 3, Grok 4.0, Deep Seek v3.1, and Claude Sonnet 4) were evaluated against the physician-assigned ESI level (reference standard). Outcomes included triage score agreement (quadratic weighted kappa, κw), clinic referral accuracy and admission prediction. Subgroup analyses were performed by referral clinic and admission outcome. The study was conducted in accordance with TRIPOD-AI reporting guidelines. Results: Model performance varied substantially. DeepSeek and Claude Sonnet 4 achieved the highest agreement with physician-assigned ESI (κw ≈ 0.467; raw accuracy: 61.7%). In contrast, GPT-5 Instant performed poorly across all evaluation metrics (κw = 0.176; 95% CI: 0.167–0.186). Claude Sonnet 4 demonstrated the best performance in clinic referral (67.1%; κ = 0.619) and admission prediction (κw ≈ 0.46). Subgroup analyses indicated higher performance in pediatric cases and organ-specific complaints, such as ophthalmology (up to 81% accuracy). LLMs also showed tendencies toward over- or under-triage. Conclusions: Current LLMs demonstrate promising but inconsistent capability in triage. While selected models achieved moderate alignment with physician ESI decisions, none achieved strong agreement (κ > 0.80). LLMs are most suitable as supervised decision support tools, particularly in anatomically well-defined clinical scenarios, rather than as autonomous systems.

1. Introduction

Emergency departments (EDs) rely on triage scoring as a core function to prioritize patient treatment based on medical urgency [1]. Accurate triage assessments are crucial for ensuring safe patient care and the appropriate allocation of medical resources [2]. Overcrowding in EDs represents a major challenge, increasing the risk of delayed care, mis-triage and suboptimal patient outcomes [3]. To address these challenges, numerous studies have investigated strategies for improving triage training and refining decision-making processes [1,4].
The Emergency Severity Index (ESI) is a widely adopted five-level triage system designed to categorize patients based on both clinical acuity and anticipated diagnostic and therapeutic resource utilization [5]. Although different triage systems are used worldwide, no ideal triage system has yet been established [6]. Several studies have shown that trained clinicians can accurately assign triage levels, contributing to improved patient flow and better management of ED overcrowding [3,7]. To address overcrowding, various technological tools have been explored as supportive tools in triage. For instance, e-kiosks have been studied to improve pre-triage waiting times, and mobile applications were also explored as supportive tools in triage decision making [8,9]. Consequently, there is growing interest in whether other data-driven tools, such as large language models (LLMs), can support accurate and consistent triage decisions, which may improve patient safety in the ED. Moreover, disparities in triage outcomes exist across sociodemographic groups, with minority patients (such as Black and Hispanic individuals) often receiving less acute triage scores than White patients, highlighting the need for unbiased decision support tools [10,11].
LLMs offer strong computational support for clinical decision making, enabling physicians to address complex diagnostic and therapeutic challenges [12]. Seven LLMs were included in this study: ChatGPT-5 (Thinking mode), ChatGPT-5 (Instant mode), DeepSeek, Claude Sonnet 4, Qwen, Grok, and Gemini 2.5. These models differ in architecture, reasoning style, and processing capabilities, which may influence their performance in clinical decision-making tasks [13,14,15,16,17]. Their flexible architectures allow them to process unstructured clinical data, medical guidelines and diverse biomedical sources with increasing precision when multiple data types and complex patient information must be considered [18,19]. Initial assessments of LLMs show both potential benefits and current limitations. In precision oncology, LLMs frequently generated useful complementary ideas and showed notable potential in tasks such as rapidly filtering the biomedical literature to support evidence-based therapeutic planning [20]. Similarly, evaluations of GPT-3.5 and GPT-4 across real-world clinical scenarios revealed that while LLMs could provide reasonable suggestions for diagnostic steps, examinations, and treatments, they struggled most with proposing accurate initial diagnoses [21].
Current research on LLM applications in triage demonstrates considerable promise; however, reliability and accuracy still lag behind experienced clinicians [17]. Several studies report tendencies for LLMs to overestimate severity and urgency, increasing false-positive emergency classifications [22,23], while others indicate potential underestimation of clinical risk [22]. Limitations of existing studies include small sample sizes (124–4000 patients), assessment of single triage scales, and evaluation of only a few LLMs, making it difficult to derive reliable and broadly generalizable conclusions [22,23,24,25,26,27,28].
This study aims to provide a comprehensive, multidimensional evaluation of the diagnostic performance of existing LLMs in emergency triage using the ESI. The large clinical sample supports a more robust and generalizable assessment of model capabilities, helping to overcome key limitations observed in prior research.

2. Materials and Methods

2.1. Study Design and Population

We retrospectively compiled 39,375 emergency cases from 1 June 2024 to 21 July 2025 in the interdisciplinary ED of AHEPA University General Hospital (American Hellenic Educational Progressive Association), Thessaloniki, Greece. The study was conducted in accordance with TRIPOD-AI reporting guidelines [29]. The sample size of 39,375 cases exceeds those used in prior research on LLMs in triage, providing sufficient statistical power to robustly evaluate model performance in triage decisions [22,23,24,25,26,27,28].
All cases were extracted from the hospital’s electronic medical records and converted into standardized, anonymized English-language case vignettes for model evaluation. All presented symptoms and vital signs were recorded as free text in the hospital’s electronic medical record. These fields did not contain personally identifying information. Moreover, direct identifiers such as patient names, medical record numbers and dates of birth were removed prior to vignette creation. This manual review ensured that all vignettes were fully anonymized and safe for further analysis.
Patient ages ranged from 0 to 106 years, and patients of both sexes were represented. Only age and presenting symptoms were included in the LLM evaluations; vital signs and level of consciousness were included when documented at triage. Predictors (age, presenting symptoms, vital signs and level of consciousness when documented) were used in their raw form. No rescaling or standardization was applied. Missing data were handled using the available information only, without imputation. This was a retrospective observational study of triage decisions at ED presentation; no follow-up was conducted.
All patient records were originally triaged in real time according to ESI by 25 physicians, who receive ESI training every year. In cases of uncertainty, a senior emergency physician was consulted.
Inclusion and exclusion criteria were defined as follows. All ED encounters during the study period were eligible. For each model, cases were excluded from analysis if the model output was invalid (e.g., non-existent clinic names, responses outside the expected format, or outputs not mapping to predefined hospital categories). Exclusion rates varied across models, ranging from 2.1% (Thinking GPT-5, n = 839) to 9.6% (DeepSeek, n = 3794), yielding analyzable samples of 35,581–38,536 cases per model. These exclusions were reported descriptively as an indicator of output-format compliance (instruction-following robustness). No official recording of ethnicity or other sociodemographic characteristics beyond age and sex was available in the hospital records; therefore, potential disparities across ethnic or demographic groups could not be directly assessed.

2.2. Large Language Models Evaluated

Seven LLMs were evaluated: ChatGPT-5 (Thinking mode), ChatGPT-5 (Instant mode), DeepSeek, Claude Sonnet 4, Qwen, Grok, and Gemini 2.5. The rationale for model selection was to assess a range of widely available LLMs in a real-world triage scenario. Models were accessed during the study period using their publicly available interfaces with default generation parameters. No additional training or fine-tuning was performed. All LLMs were evaluated in their default, publicly available configurations, without any update or recalibration. Each model was evaluated on the full set of eligible cases.

2.3. Prompt Design for LLMs

A single, fixed prompt was used for all models to ensure consistency and comparability across evaluations. The prompt was not modified or optimized for individual models, reflecting a real-world, survey-style deployment rather than model-specific fine-tuning.
The following instruction was provided to all LLMs:
“I will provide you with clinical information regarding specific patients, including their age and the symptoms with which they presented to the emergency department. Based on these data, I would like you to generate a structured summary containing specific elements in a predetermined order. More precisely, I require you to carefully apply the distinctions and decision-making algorithms of the five-level ESI triage system, and to calculate the corresponding score for each patient. Subsequently, you should determine the most appropriate referral clinic, which must be written in Greek capital letters. Finally, you should indicate whether hospital admission is required, using 1 to denote the need for admission/hospitalization and 0 to denote that admission is not required. The results must appear strictly in the following order: ESI—Referral Clinic—Admission Prediction, and they should be presented in Excel-compatible format to facilitate further data processing. The available referral clinics are: Vascular Surgery, Cardiology, Cardiac Surgery, Neurology, Neurosurgery, Nephrology, Orthopedics, Pathology, Surgery, Psychiatry, Otolaryngology, Ophthalmology, Pediatrics, Fast-Track, Shock-Room. You should select the most appropriate one for each case and provide it in Greek capital letters.”

2.4. Triage Systems

Each case was evaluated by all seven LLMs using the five-level ESI triage system [5]. The ESI results of the LLMs were directly compared to physician-assigned ESI levels to assess agreement. At the time of triage, physicians were able to measure all vital signs and assess level of consciousness; however, these parameters were not documented for every patient. Consequently, LLMs were provided exclusively with information recorded at triage. When vital signs or level of consciousness were measured by the triage physician, they were documented and made equally available to the LLMs, ensuring comparable clinical information for both human and AI assessments. Vital signs were documented for the majority of encounters; missing vital signs occurred mainly in a small subset of clearly low-acuity presentations in which full vital sign measurement was not routinely performed. Recorded vital signs included heart rate, respiratory rate, oxygen saturation, and, when available, blood pressure, temperature, and glucose.

2.5. Ethical Considerations

This study was conducted in accordance with the Declaration of Helsinki and was approved by the local ethics committee (approval no. 212/23 May 2025). Due to its retrospective design, all data were anonymized, and informed consent was not required. Patients and the public were not involved in the design, conduct, reporting, interpretation or dissemination of this study.

2.6. Statistical Analysis

Statistical analysis evaluated agreement between physician triage decisions and LLM predictions for the ESI five-level triage scale. Quadratic weighted Cohen’s kappa (κw) was used, as it penalizes large disagreements (e.g., confusing Level 1 with Level 5) more heavily than minor discrepancies [30,31]. For binary admission decisions, standard Cohen’s kappa was calculated. Classification performance was evaluated using accuracy, sensitivity and specificity, as well as F1-scores where appropriate. For multiclass clinic referral, the weighted F1-score was used to account for class imbalance across specialties. Agreement for the 15-specialty clinic referral was assessed using multiclass Cohen’s kappa analysis. Confusion matrices were constructed to illustrate patterns of errors between physicians and LLMs. McNemar’s test was used to compare LLM predictions against physician-assigned admission decisions. Agreement strength was interpreted as: <0.20—poor, 0.21–0.40—fair, 0.41–0.60—moderate, 0.61–0.80—substantial and 0.81–1.00—almost perfect [32]. Additional sensitivity analyses comparing top-performing models using paired bootstrap resampling were conducted.
To assess whether small observed differences in agreement reflected systematic performance variation rather than sampling noise, we conducted a paired bootstrap analysis among the three top-performing models identified in the primary analyses (Claude Sonnet 4, DeepSeek, and Gemini 2.5). Analyses were restricted to cases with valid outputs from all three models to ensure identical patient samples. Bootstrap resampling preserved within-case pairing across models.

2.7. Sub-Analysis of Clinical Features

Prespecified subgroup analyses were conducted to explore variation in model performance across clinically relevant strata. Performance metrics were stratified by referral clinic and admission outcome.

3. Results

3.1. Triage Score Agreement

From 39,375 emergency department encounters, valid LLM responses were analyzed for concordance with physician-assigned ESI triage levels. Figure 1 summarizes agreement using quadratic weighted Cohen’s kappa (κw) for each model. DeepSeek demonstrated the highest agreement with physician ESI assessments (κw = 0.467; 95% CI: 0.457–0.476), followed closely by Gemini 2.5 (κw = 0.465; 95% CI: 0.457–0.471). Both models achieved moderate agreement. Claude Sonnet 4 showed slightly lower agreement (κw = 0.402; 95% CI: 0.394–0.409), remaining at the boundary between fair and moderate concordance.
Qwen (κw = 0.304; 95% CI: 0.297–0.311), Grok (κw = 0.261; 95% CI: 0.253–0.268), and Thinking GPT-5 (κw = 0.258; 95% CI: 0.249–0.266) demonstrated fair agreement, while Instant GPT-5 showed poor agreement with physician triage decisions (κw = 0.176; 95% CI: 0.167–0.186).
Across all models, negative mean bias values indicated a general tendency toward over-triage relative to physician assessments; exact agreement percentages are provided for descriptive comparison (Table 1).

3.2. Clinic Referral Accuracy

LLM performance in clinic referrals was evaluated across 15 specialty categories using accuracy and multiclass Cohen’s kappa. Claude Sonnet 4 achieved the highest agreement with physician referral decisions (accuracy: 67.1%; κ = 0.619, 95% CI: 0.614–0.624), corresponding to substantial agreement (Figure 2). DeepSeek demonstrated comparable performance (accuracy: 66.8%; κ = 0.615, 95% CI: 0.608–0.620), followed by Gemini 2.5 (accuracy: 64.5%; κ = 0.597, 95% CI: 0.591–0.602) and Grok (accuracy: 63.8%; κ = 0.580, 95% CI: 0.575–0.586).
Thinking GPT-5 showed moderate agreement (κ = 0.416, 95% CI: 0.411–0.423), while Instant GPT-5 demonstrated poor performance (κ = 0.229, 95% CI: 0.224–0.235). No model reached the threshold for strong agreement (κ > 0.80).
A representative confusion matrix for the highest-performing model (Claude Sonnet 4) is shown in Figure 3. Correct classifications clustered along the diagonal, with strong performance in anatomically well-defined specialties such as Ophthalmology and Pediatrics. In contrast, lower recall for severity-based categories, such as Fast Track and Shock Room, was shown, with a substantial proportion of cases misclassified as Internal Medicine.

3.3. Performance by Clinical Specialty

Performance varied substantially across clinical specialties. Severity-based routing categories, including Fast Track and Shock Room, demonstrated the highest misclassification rates, with F1-scores below 0.25. In contrast, anatomically well-defined specialties showed consistently higher performance.
For the highest-performing model (Claude Sonnet 4), classification performance was strongest in Ophthalmology (F1 = 0.872), Pediatrics (F1 = 0.849), and Otolaryngology (ENT; F1 = 0.810). Moderate performance was observed in Cardiology (F1 = 0.740), Neurology (F1 = 0.707), and Internal Medicine (F1 = 0.650) (Figure 4).
Performance was poorest in Orthopedics (F1 = 0.018), Shock Room (F1 = 0.185), and Fast Track (F1 = 0.235), highlighting persistent challenges in severity-based referral categories.

3.4. Admission Prediction (Outcome)

Claude Sonnet 4 was the best-performing model, achieving a binary Cohen’s kappa of about 0.46 (Figure 5), which signifies that the model’s predictions were moderately reliable. The next best were Gemini 2.5 and DeepSeek, with almost the same performance (κ ≈ 0.37). ChatGPT-5 Instant showed the poorest result (κ < 0.10), implying that its predictions were almost random with respect to the physicians’ decisions about admissions.

3.5. Error Bias Analysis (McNemar’s Test)

McNemar’s test revealed statistically significant discordance in error patterns for all models (p < 0.001). However, the direction of bias varied critically. Qwen and DeepSeek exhibited a strong positive bias (χ2 = 6162 and 2851), generating 3–5 times more false admissions than missed admissions. In contrast, Thinking GPT-5 showed a dangerous negative bias (χ2 = 749), significantly favoring false discharges (missed admissions) over false positives. Gemini 2.5 demonstrated the most balanced error profile, though it still leaned toward over-admission (Figure 6).
Analysis of sensitivity and specificity revealed distinct admission decision profiles across models. Claude Sonnet 4 demonstrated the most balanced performance, maintaining comparable sensitivity (≈75%) and specificity (≈76%).
In contrast, Instant GPT-5 showed markedly low sensitivity (<25%) despite high specificity (≈85%), indicating a strong tendency toward discharge decisions. Qwen exhibited the opposite pattern, achieving high sensitivity (≈79%) but substantially lower specificity (≈57%), consistent with a tendency toward over-admission (Figure 7).
These trade-offs contextualize the overall agreement results and the systematic error asymmetries identified by McNemar’s test.
McNemar’s test demonstrated statistically significant asymmetry in admission decision errors for all evaluated models (p < 0.05), indicating systematic differences between false-positive and false-negative predictions. The magnitude and direction of this asymmetry varied across models. Qwen exhibited the largest imbalance (χ2 > 6000), reflecting a strong tendency toward over-admission. In contrast, Gemini 2.5 and Thinking GPT-5 showed the smallest chi-square values, indicating more balanced false-positive and false-negative error distributions relative to physician decisions.

3.6. Paired Bootstrap Comparison of Top-Performing Models

To assess whether the small observed differences in agreement among the top-performing models reflected systematic performance variation rather than sampling noise, we conducted paired bootstrap analyses on matched patient samples with valid outputs from all three models (Claude Sonnet 4, DeepSeek, and Gemini 2.5) (Table 2). For ESI triage, DeepSeek demonstrated significantly higher agreement with physician-assigned scores compared with both Claude Sonnet 4 and Gemini 2.5. In contrast, admission prediction performance was significantly higher for Claude Sonnet 4 compared with both DeepSeek and Gemini 2.5. Differences in clinic referral agreement between models were small and not consistently statistically significant. These findings indicate that, while overall performance among the leading models was broadly comparable, specific strengths varied by task.

4. Discussion

In this large retrospective study, we evaluated the clinical decision-making performance of seven LLMs across hospital admission prediction, clinic referral selection, severity categorization, and triage scoring according to the ESI. Using a real-world dataset of 39,375 cases—one of the largest sample sizes assessed to date [23,24,25,27,28]—we observed substantial variability across models.
Overall, Claude Sonnet 4 and DeepSeek achieved the highest and most consistent agreement with physician decisions, though overall agreement remained moderate. These findings align with prior reports showing high accuracy in clinical scenarios using Claude 3.5 Sonnet [25]. In contrast, Instant GPT-5 consistently underperformed, showing low accuracy, weak kappa values, and limited stability across all tasks. These differences likely reflect variations in model architecture and reasoning strategies, consistent with prior studies demonstrating that reasoning optimized modes enhance diagnostic accuracy [33].
Triage and specialty-specific performance varied notably. LLMs performed best in domains with clearly anatomical or organ-specific patterns—ophthalmology, pediatrics and ENT cases—approaching near-specialist-level accuracy. The higher performance in these domains likely reflects distinct symptomatology and vocabulary that facilitate pattern recognition, a phenomenon also observed in the study by Lyons et al. [34]. However, scenarios requiring severity-based assessment—shock room, fast track and orthopedics—posed significant challenges. These scenarios require synthesis of high-impact clinical signs, such as abnormal vital signs and elements of clinical gestalt, as well as contextual reasoning that remains challenging for current LLMs [35]. In our study, many of these critical cues were not consistently documented across all cases. As ESI depends heavily on such inputs, this likely constrained model performance in severity-based categories. Consequently, when evaluated primarily on inputs lacking structured clinical data, LLMs may struggle to accurately classify such cases. Additionally, the orthopedic category should be interpreted with caution, as our hospital does not have a dedicated orthopedics department and only manages a small number of urgent cases. Misclassification in these cases carries significant critical risk; for example, delayed recognition of severe conditions like acute limb ischemia can increase morbidity and mortality [36]. These findings suggest that current LLMs should be used with caution in severity-driven decision making. Providing more detailed clinical information for shock room and fast track cases could potentially improve LLM’s severity-related predictions.
Admission prediction revealed similar trends across models. Claude Sonnet 4 achieved moderate predictive power (κ ≈ 0.46, sensitivity: ~75%, specificity: ~76%). Other models displayed substantial biases. Instant GPT-5 exhibited extremely low sensitivity (<25%), with high specificity (~85%), indicating under-triage risk. In emergency medicine, missed admissions and underestimation of clinical severity have been associated with increased mortality, making this pattern particularly concerning [37]. In contrast, Qwen and Grok tended to over-triage patients (high sensitivity: ~79%, low specificity: ~57%). This finding is consistent with previous studies, which have reported that LLMs tend toward over-triage [22,23]. These opposing biases indicate the need for careful calibration and validation of individual models before any clinical integration and also raise ethical concerns [37]. The recent literature emphasizes that inaccurate or biased LLM outputs may affect patient safety, influence care decisions and lead clinicians to over-rely on model suggestions [38,39].
The comparison between GPT-5 variants further illustrates the importance of structured reasoning. Thinking mode consistently outperformed across all metrics (triage scoring, clinic referral accuracy, and admission prediction). This aligns with prior research showing that chain-of-thought and structured reasoning prompts improve performance in complex clinical tasks [33,40]. However, no model achieved strong agreement with clinicians (κ > 0.80), emphasizing that current LLMs lack the reliability required for independent triage decision making. This limitation is consistent with the broader artificial intelligence (AI) literature, which highlights challenges related to confabulation, limited contextual awareness and biases inherited from training data [41].
From a clinical perspective, current LLMs may offer value as supportive tools rather than autonomous decision-making tools in triage. Our analysis shows that performance differences among top-performing models were systematic and task-dependent, meaning that no single model consistently outperformed others across all clinical decisions. This highlights that their main benefit lies in augmenting clinical expertise rather than replacing it. LLMs may also help mitigate ED overcrowding by optimizing time and resource allocation [42]. Evidence suggests that physicians are willing to modify clinical decisions based on LLM assistance in standardized chest pain scenarios [43]; however, erroneous recommendations can propagate automation bias and degrade performance [44,45]. Strategies such as retrieval-augmented generation, combining multiple LLMs to reduce model-specific biases, and clinician review may reduce these risks [25,40,46].
This study has limitations. This was a single-center retrospective study in which some text-based inputs did not consistently include vital signs or level of consciousness. Although the ESI formally requires vital signs for triage, incomplete documentation occurred in a minority of cases and reflects real-world clinical practice. This inconsistent documentation of critical cues may contribute to lower ESI assessment reliability in our study. Our hospital does not have a dedicated orthopedics department and only manages a small number of urgent cases, so the orthopedic category should be interpreted with caution. Sample sizes varied across LLMs (range: 35,581–38,536; exclusion rates: 2.1–9.6%) due to invalid model outputs. However, exclusions were based on predefined validity criteria (non-conforming clinic names or output formats) rather than case characteristics, minimizing the risk of selection bias. Additionally, LLM performance is sensitive to prompt design and model version update, which limits reproducibility over time. Finally, the tested LLMs were generalist models and had not been fine-tuned for health applications. This likely contributed to lower ESI assessment reliability in our study.
Future research should evaluate fine-tuned and multimodal LLM architectures and assess their performance across multiple healthcare settings. Moreover, prospective studies incorporating structured clinical inputs should assess actual patient outcomes when LLMs are part of triage workflows. These studies may further support a “triage assistant” role in real-world settings.

5. Conclusions

In summary, across seven models, performance varied substantially. While selected models demonstrated moderate alignment with physician ESI decisions and consistent performance in clinical referral and admission decisions, none achieved high-level concordance suitable for autonomous triage. LLMs performed more reliably in anatomically defined scenarios and pediatric cases but struggled with severity-based triage. These findings support the use of LLMs as adjunctive tools under clinician supervision rather than autonomous systems in triage.

Author Contributions

Conceptualization, E.K. and I.N.; formal analysis, data curation, writing—original draft, and writing—review and editing, I.N., S.-C.Z., E.K., T.K., D.V., K.A., I.K., P.G., S.M., A.A. and C.K.; data validation, I.N., S.-C.Z., C.K. and E.K.; statistical analysis and writing—review and editing, I.N., E.K. and C.K.; supervision—equal contribution, E.K. and B.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of AHEPA University Hospital (protocol code 212/2025 and date of approval 23 May 2025).

Informed Consent Statement

Patient consent was waived due to the study’s retrospective design.

Data Availability Statement

The data underlying this article will be shared upon reasonable request by the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EDEmergency Department
ENTEar, Nose, and Throat (Otolaryngology)
LLMLarge Language Model
ESIEmergency Severity Index
F1F1-score (harmonic mean of precision and recall)
κCohen’s Kappa (coefficient for inter-rater agreement)
κwQuadratic Weighted Cohen’s Kappa
χ2Chi-squared statistic

References

  1. Ouellet, S.; Gallani, M.C.; Fontaine, G.; Mercier, É.; Lapierre, A.; Severino, F.; Gélinas, C.; Bérubé, M. Strategies to improve the quality of nurse triage in emergency departments: A systematic review. Int. Emerg. Nurs. 2025, 81, 101639. [Google Scholar] [CrossRef]
  2. Hodge, A.; Hugman, A.; Varndell, W.; Howes, K. A review of the quality assurance processes for the Australasian Triage Scale (ATS) and implications for future practice. Australas Emerg. Nurs. J. 2013, 16, 21–29. [Google Scholar] [CrossRef] [PubMed]
  3. Zagalioti, S.-C.; Fyntanidou, B.; Exadaktylos, A.; Lallas, K.; Ziaka, M. The first positive evidence that training improves triage decisions in Greece: Evidence from emergency nurses at an Academic Tertiary Care Emergency Department. BMC Emerg. Med. 2023, 23, 60. [Google Scholar] [CrossRef]
  4. Zagalioti, S.-C.; Ziaka, M.; Exadaktylos, A.; Fyntanidou, B. An effective triage education method for triage nurses: An overview and update. Open Access Emerg. Med. 2025, 17, 105–112. [Google Scholar] [CrossRef] [PubMed]
  5. Emergency Nurses Association. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care, Version 4. [Internet]. 2020. Available online: https://sgnor.ch/fileadmin/user_upload/Dokumente/Downloads/Esi_Handbook.pdf (accessed on 27 December 2025).
  6. Tsiftsis, D.; Tasioulis, A.; Bampalis, D. Adult triage in the emergency department: Introducing a multi-layer triage system. Healthcare 2025, 13, 1070. [Google Scholar] [CrossRef] [PubMed]
  7. Seo, Y.H.; Lee, K.; Jang, K. Factors influencing the classification accuracy of triage nurses in emergency department: Analysis of triage nurses’ characteristics. BMC Nurs. 2024, 23, 764. [Google Scholar] [CrossRef]
  8. Joseph, M.J.; Summerscales, M.; Yogesan, S.; Bell, A.; Genevieve, M.; Kanagasingam, Y. The use of kiosks to improve triage efficiency in the emergency department. NPJ Digit. Med. 2023, 6, 19. [Google Scholar] [CrossRef]
  9. Sutham, K.; Khuwuthyakorn, P.; Thinnukool, O. Thailand medical mobile application for patients triage base on criteria based dispatch protocol. BMC Med. Inform. Decis. Mak. 2020, 20, 66. [Google Scholar] [CrossRef]
  10. Joseph, J.W.; Kennedy, M.; Landry, A.M.; Marsh, R.H.; Baymon, D.E.; Im, D.E.; Chen, P.C.; Samuels-Kalow, M.E.; Nentwich, L.M.; Elhadad, N.; et al. Race and Ethnicity and Primary Language in Emergency Department Triage. JAMA Netw. Open 2023, 6, e2337557. [Google Scholar] [CrossRef]
  11. Patel, M.D.; Lin, P.; Cheng, Q.; Argon, N.T.; Evans, C.S.; Linthicum, B.; Liu, Y.; Mehrotra, A.; Murphy, L.; Ziya, S. Patient sex, racial and ethnic disparities in emergency department triage: A multi-site retrospective study. Am. J. Emerg. Med. 2024, 76, 29–35. [Google Scholar] [CrossRef]
  12. Ong, J.C.L.; Jin, L.; Elangovan, K.; Lim, G.Y.S.; Lim, D.Y.Z.; Sng, G.G.R.; Ke, Y.H.; Tung, J.Y.M.; Zhong, R.J.; Koh, C.M.Y.; et al. Large language model as clinical decision support system augments medication safety in 16 clinical specialties. Cell Rep. Med. 2025, 6, 102323. [Google Scholar] [CrossRef]
  13. GPT-5 Technical Overview and Evaluation Benchmarks. 2025. Available online: https://cdn.openai.com/gpt-5-system-card.pdf (accessed on 27 December 2025).
  14. Li, J.; Deng, Y.; Sun, Q.; Zhu, J.; Tian, Y.; Li, J.; Zhu, T. Benchmarking Large Language Models in Evidence-Based Medicine. IEEE J. Biomed. Health Inform. 2025, 29, 6143–6156. [Google Scholar] [CrossRef]
  15. Siam, K.; Varela, A.; Faruk, J.H.; Cheng, J.Q.; Gu, H.; Al Maruf, A.; Aung, Z. Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios. Sci. Rep. 2025, 16, 1387. [Google Scholar] [CrossRef] [PubMed]
  16. Şan, I.; Öz, M.A.; Yortanli, M.; Genç, M.; Bulut, B.; Gür, A.; Yazici, R.; Mutlu, H.; Gönen, M.Ö. AI performance in emergency medicine fellowship examination: Comparative analysis of ChatGPT-4o, Gemini 2.0, Claude 3.5, and DeepSeek R1 models. Turk. J. Med. Sci. 2025, 55, 1292–1299. [Google Scholar] [CrossRef]
  17. Shan, G.; Chen, X.; Wang, C.; Liu, L.; Gu, Y.; Jiang, H.; Shi, T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med. Inform. 2025, 13, e64963. [Google Scholar] [CrossRef] [PubMed]
  18. Wiest, I.C.; Bhat, M.; Clusmann, J.; Schneider, C.V.; Jiang, X.; Kather, J.N. Large language models for clinical decision support in gastroenterology and hepatology. Nat. Rev. Gastroenterol. Hepatol. 2025, 22, 773–787. [Google Scholar] [CrossRef] [PubMed]
  19. Shool, S.; Adimi, S.; Amleshi, R.S.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef]
  20. Benary, M.; Wang, X.D.; Schmidt, M.; Soll, D.; Hilfenhaus, G.; Nassir, M.; Sigler, C.; Knödler, M.; Keller, U.; Beule, D.; et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 2023, 6, e2343689. [Google Scholar] [CrossRef]
  21. Sandmann, S.; Riepenhausen, S.; Plagwitz, L.; Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 2024, 15, 2050. [Google Scholar] [CrossRef]
  22. Masanneck, L.; Schmidt, L.; Seifert, A.; Kölsche, T.; Huntemann, N.; Jansen, R.; Mehsin, M.; Bernhard, M.; Meuth, S.G.; Böhm, L.; et al. Triage performance across large language models, ChatGPT, and untrained doctors in emergency medicine: Comparative study. J. Med. Internet Res. 2024, 26, e53297. [Google Scholar] [CrossRef]
  23. Arslan, B.; Nuhoglu, C.; Satici, M.; Altinbilek, E. Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses. Am. J. Emerg. Med. 2025, 89, 174–181. [Google Scholar] [CrossRef] [PubMed]
  24. Gaber, F.; Shaik, M.; Allega, F.; Bilecz, A.J.; Busch, F.; Goon, K.; Franke, V.; Akalin, A. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis. NPJ Digit. Med. 2025, 8, 263. [Google Scholar] [CrossRef]
  25. Lee, S.; Jung, S.; Park, J.-H.; Cho, H.; Moon, S.; Ahn, S. Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department. BMC Emerg. Med. 2025, 25, 176. [Google Scholar] [CrossRef]
  26. Wang, C.; Wang, F.; Li, S.; Ren, Q.-W.; Tan, X.; Fu, Y.; Liu, D.; Qian, G.; Cao, Y.; Yin, R.; et al. Patient triage and guidance in emergency departments using large language models: Multimetric study. J. Med. Internet Res. 2025, 27, e71613. [Google Scholar] [CrossRef] [PubMed]
  27. Han, S.; Choi, W. Development of a large language model-based multi-agent clinical decision support system for Korean Triage and Acuity Scale (KTAS)-based triage and treatment planning in emergency departments. Adv. Artif. Intell. Mach. Learn. 2025, 5, 3261–3275. [Google Scholar] [CrossRef]
  28. Collins, G.S.; Moons, K.G.M.; Dhiman, P.; Riley, R.D.; Beam, A.L.; Van Calster, B.; Ghassemi, M.; Liu, X.; Reitsma, J.B.; van Smeden, M.; et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef]
  29. Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]
  30. Fleiss, J.L.; Levin, B.; Paik, M.C. Statistical Methods for Rates and Proportions, 3rd ed.; John Wiley & Sons: Nashville, TN, USA, 2003. [Google Scholar]
  31. Altman, D.G. Practical Statistics for Medical Research; Chapman and Hall: London, UK, 1990. [Google Scholar] [CrossRef]
  32. Savage, T.; Nayak, A.; Gallo, R.; Rangan, E.; Chen, J.H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 2024, 7, 20. [Google Scholar] [CrossRef]
  33. Lyons, R.J.; Arepalli, S.R.; Fromal, O.; Choi, J.D.; Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can. J. Ophthalmol. 2024, 59, e301–e308. [Google Scholar] [CrossRef]
  34. Porto, B.M. Improving triage performance in emergency departments using machine learning and natural language processing: A systematic review. BMC Emerg. Med. 2024, 24, 219. [Google Scholar] [CrossRef]
  35. Zaboli, A. Establishing a common ground: The future of triage systems. BMC Emerg. Med. 2024, 24, 148. [Google Scholar] [CrossRef]
  36. Templin, T.; Fort, S.; Padmanabham, P.; Seshadri, P.; Rimal, R.; Oliva, J.; Lich, K.H.; Sylvia, S.; Sinnott-Armstrong, N. Framework for bias evaluation in large language models in healthcare settings. NPJ Digit. Med. 2025, 8, 414. [Google Scholar] [CrossRef]
  37. Elbattah, M.; Arnaud, E.; Ghazali, D.A.; Dequen, G. Exploring the Ethical Challenges of Large Language Models in Emergency Medicine: A Comparative International Review. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisbon, Portugal, 3–6 December 2024; pp. 5750–5755. [Google Scholar]
  38. Preiksaitis, C.; Ashenburg, N.; Bunney, G.; Chu, A.; Kabeer, R.; Riley, F.; Ribeira, R.; Rose, C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med. Inform. 2024, 12, e53787. [Google Scholar] [CrossRef] [PubMed]
  39. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  40. Cross, J.L.; Choma, M.A.; Onofrey, J.A. Bias in medical AI: Implications for clinical decision-making. PLoS Digit. Health 2024, 3, e0000651. [Google Scholar] [CrossRef]
  41. Hoot, N.R.; Aronsky, D. Systematic review of emergency department crowding: Causes, effects, and solutions. Ann. Emerg. Med. 2008, 52, 126–136. [Google Scholar] [CrossRef]
  42. Goh, E.; Bunning, B.; Khoong, E.C.; Gallo, R.J.; Milstein, A.; Centola, D.; Chen, J.H. Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Commun. Med. 2025, 5, 59. [Google Scholar] [CrossRef] [PubMed]
  43. Parasuraman, R.; Manzey, D.H. Complacency and bias in human use of automation: An attentional integration. Hum. Factors 2010, 52, 381–410. [Google Scholar] [CrossRef]
  44. Qazi, I.A.; Ali, A.; Khawaja, A.U.; Akhtar, M.J.; Sheikh, A.Z.; Alizai, M.H. Automation bias in large language model assisted diagnostic reasoning among AI-trained physicians. medRxiv 2025, 2025.08.23.25334280. [Google Scholar] [CrossRef]
  45. Yazaki, M.; Maki, S.; Furuya, T.; Inoue, K.; Nagai, K.; Nagashima, Y.; Maruyama, J.; Toki, Y.; Kitagawa, K.; Iwata, S.; et al. Emergency patient triage improvement through a Retrieval-Augmented Generation enhanced large-scale language model. Prehosp. Emerg. Care 2025, 29, 203–209. [Google Scholar] [CrossRef]
  46. Zaboli, A.; Brigo, F.; Brigiari, G.; Massar, M.; Parodi, M.; Pfeifer, N.; Magnarelli, G.; Turcato, G. Chat-GPT in triage: Still far from surpassing human expertise—An observational study. Am. J. Emerg. Med. 2025, 92, 165–171. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Agreement between LLM and physician triage scores. Quadratic weighted kappa (κw) for each model compared to the physician ESI score. Error bars represent 95% confidence intervals.
Figure 1. Agreement between LLM and physician triage scores. Quadratic weighted kappa (κw) for each model compared to the physician ESI score. Error bars represent 95% confidence intervals.
Jcm 15 01512 g001
Figure 2. Multiclass Cohen’s kappa coefficients for clinic referral agreement. The red dashed line shows the threshold for strong agreement, which is κ > 0.80.
Figure 2. Multiclass Cohen’s kappa coefficients for clinic referral agreement. The red dashed line shows the threshold for strong agreement, which is κ > 0.80.
Jcm 15 01512 g002
Figure 3. Confusion Matrix for Clinic Referral (Claude Sonnet 4). Heatmap showing the distribution of predicted vs. physician-assigned clinic destinations for Claude Sonnet 4.
Figure 3. Confusion Matrix for Clinic Referral (Claude Sonnet 4). Heatmap showing the distribution of predicted vs. physician-assigned clinic destinations for Claude Sonnet 4.
Jcm 15 01512 g003
Figure 4. F1-scores by clinical specialty (Claude Sonnet 4).
Figure 4. F1-scores by clinical specialty (Claude Sonnet 4).
Jcm 15 01512 g004
Figure 5. Agreement between LLM-predicted and physician-assigned admission decisions, measured using binary Cohen’s kappa. The dashed line indicates κ = 0.80.
Figure 5. Agreement between LLM-predicted and physician-assigned admission decisions, measured using binary Cohen’s kappa. The dashed line indicates κ = 0.80.
Jcm 15 01512 g005
Figure 6. Analysis of systematic error bias (McNemar’s test). The chi-squared (χ2) values quantify the asymmetry of errors (false positives vs. false negatives).
Figure 6. Analysis of systematic error bias (McNemar’s test). The chi-squared (χ2) values quantify the asymmetry of errors (false positives vs. false negatives).
Jcm 15 01512 g006
Figure 7. Hospital admission prediction metrics. Comparative bar chart of sensitivity (recall of admitted patients) vs. specificity (correct discharge).
Figure 7. Hospital admission prediction metrics. Comparative bar chart of sensitivity (recall of admitted patients) vs. specificity (correct discharge).
Jcm 15 01512 g007
Table 1. Triage score agreement and bias analysis. Comparison of quadratic weighted Cohen’s kappa and mean bias across seven LLMs. Negative bias values correspond to higher triage scores assigned by the model; 95% confidence intervals (CIs) were calculated via bootstrap resampling.
Table 1. Triage score agreement and bias analysis. Comparison of quadratic weighted Cohen’s kappa and mean bias across seven LLMs. Negative bias values correspond to higher triage scores assigned by the model; 95% confidence intervals (CIs) were calculated via bootstrap resampling.
ModelNκw95% CIExact Agreement (%)BiasInterpretation
DeepSeek35,5810.4670.457–0.47659.4%−0.22Moderate
Gemini 2.538,4100.4650.457–0.47143.6%−0.38Moderate
Claude Sonnet 437,8970.4020.394–0.40948.0%−0.46Fair
Qwen36,3720.3040.297–0.31136.7%−0.67Fair
Grok36,5850.2610.253–0.26834.2%−0.74Fair
Thinking GPT-538,5360.2580.249–0.26639.5%−0.26Fair
Instant GPT-537,8840.1760.167–0.18640.1%−0.15Slight
κw = quadratic weighted kappa; CI = confidence interval; bias = mean deviation. Agreement interpretation: <0.20—slight, 0.21–0.40—fair, 0.41–0.60—moderate, 0.61–0.80—substantial, >0.80—almost perfect.
Table 2. Paired bootstrap comparison of top three LLMs.
Table 2. Paired bootstrap comparison of top three LLMs.
ComparisonΔκ95% CIp-Value
Triage Score (Quadratic Weighted Kappa, κw)
Claude Sonnet 4 vs. DeepSeek−0.074[−0.085, −0.063]<0.001
Claude Sonnet 4 vs. Gemini 2.5−0.059[−0.068, −0.050]<0.001
DeepSeek vs. Gemini 2.5+0.015[+0.005, +0.025]0.005
Clinic Referral (Multiclass Kappa, κ)
Claude Sonnet 4 vs. DeepSeek−0.003[−0.010, +0.005]0.494
Claude Sonnet 4 vs. Gemini 2.5+0.006[−0.001, +0.013]0.104
DeepSeek vs. Gemini 2.5+0.008[+0.001, +0.016]0.020
Admission Prediction (Binary Kappa, κ)
Claude Sonnet 4 vs. DeepSeek+0.113[+0.101, +0.125]<0.001
Claude Sonnet 4 vs. Gemini 2.5+0.091[+0.078, +0.104]<0.001
DeepSeek vs. Gemini 2.5−0.022[−0.035, −0.009]<0.001
κw = quadratic weighted kappa; CI = confidence interval; Δκ = delta kappa.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nedos, I.; Zagalioti, S.-C.; Kofos, C.; Katsikidou, T.; Vellidou, D.; Astrinakis, K.; Karagiannis, I.; Giannakopoulos, P.; Michaloudi, S.; Apostolopoulou, A.; et al. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. J. Clin. Med. 2026, 15, 1512. https://doi.org/10.3390/jcm15041512

AMA Style

Nedos I, Zagalioti S-C, Kofos C, Katsikidou T, Vellidou D, Astrinakis K, Karagiannis I, Giannakopoulos P, Michaloudi S, Apostolopoulou A, et al. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. Journal of Clinical Medicine. 2026; 15(4):1512. https://doi.org/10.3390/jcm15041512

Chicago/Turabian Style

Nedos, Ioannis, Sofia-Chrysovalantou Zagalioti, Christos Kofos, Theoni Katsikidou, Dimitra Vellidou, Konstantinos Astrinakis, Ioannis Karagiannis, Panagiotis Giannakopoulos, Styliani Michaloudi, Aikaterini Apostolopoulou, and et al. 2026. "Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department" Journal of Clinical Medicine 15, no. 4: 1512. https://doi.org/10.3390/jcm15041512

APA Style

Nedos, I., Zagalioti, S.-C., Kofos, C., Katsikidou, T., Vellidou, D., Astrinakis, K., Karagiannis, I., Giannakopoulos, P., Michaloudi, S., Apostolopoulou, A., Karagiannidis, E., & Fyntanidou, B. (2026). Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. Journal of Clinical Medicine, 15(4), 1512. https://doi.org/10.3390/jcm15041512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop