Interpretable Machine Learning for Emergency Department Triage: Clinical Insights from 133,198 Patients Using the Korean Triage and Acuity Scale (KTAS)
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Data
2.2. Variable Definition and Preprocessing
- Demographics: Gender and Age.
- Vital Signs: Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), Heart Rate (HR), Respiratory Rate (RR), and Body Temperature (BT).
- Symptom Assessment: Pain score measured via the Numerical Rating Scale (NRS).
- Mean Arterial Pressure (MAP): Calculated as MAP = (SBP + 2 × DBP)/3, offering a better indicator of perfusion to vital organs than SBP alone.
- Shock Index (SI): Calculated as SI = HR/SBP, a sensitive marker for occult shock and hemodynamic instability.
2.3. Label Definition
- Level 1 (Resuscitation): Cardiac arrest, major trauma, respiratory arrest (Immediate intervention).
- Level 2 (Emergent): Potential threat to life or limb (Care within 10 min).
- Level 3 (Urgent): Condition may progress seriously if untreated (Care within 30 min).
- Level 4 (Less Urgent): Patient requires intervention but is stable (Care within 1–2 h).
- Level 5 (Non-Urgent): Minor ailments requiring no immediate intervention (Care within 2 h).
2.4. Modeling Strategy
2.5. Performance Evaluation Metrics
- Quadratic Weighted Kappa (QWK): This was the primary metric, as it penalizes discrepancies between predicted and actual levels. Unlike standard accuracy, QWK accounts for the ordinal nature of triage scales (e.g., mistaking Level 1 for Level 5 is worse than mistaking Level 1 for Level 2). Since KTAS levels represent ordinal data, it is essential to reflect the clinical risk associated with the distance between grades. Therefore, QWK was selected to statistically quantify the severity of misclassification.
- Mean Absolute Error (MAE): Computed as the average absolute distance between the predicted class and the true class. MAE intuitively demonstrates how many levels the model’s predictions deviate from the actual severity rating on average, which is useful for quantifying the magnitude of the overall prediction error.
- ±1 Accuracy: This clinical metric reflects the percentage of predictions that fell within one level of the true label, acknowledging that triage often has a “grey zone” of subjectivity. This was included to evaluate the practical utility of the model, taking into account acceptable variations in clinical judgment that may occur in real-world practice.
- Confusion Matrix Analysis: Used to visualize specific patterns of error, specifically identifying rates of under-triage (dangerous) versus over-triage (inefficient). By analyzing the direction of misclassification beyond simple numbers, we aimed to verify the safety of the model by precisely examining the prevalence of under-triage that threatens patient safety and over-triage that hinders ED operational efficiency.
2.6. Interpretability Analysis (XAI)
- Permutation Importance: We systematically shuffled the values of each feature to measure the resulting drop in QWK. This identifies which variables are most heavily relied upon by the model. This technique measures the actual contribution of each variable to predictive performance without requiring model re-training, making it effective for determining overall variable importance.
- Partial Dependence Plots (PDPs): These plots visualize the marginal effect of a specific feature (e.g., Age) on the predicted outcome, holding other variables constant. This helps verify if the model follows biological logic (e.g., does risk increase as age increases?). PDPs were used for the post hoc validation of whether the nonlinear relationships learned by the model conformed to physiological logic. For instance, we confirmed the clinical feasibility of the model by checking whether the predicted severity risk increased as vital sign parameters deteriorated.
- Uniform Manifold Approximation and Projection (UMAP): A dimension reduction technique used to project the high-dimensional patient data into a 2D scatter plot. This allows us to visually compare the clustering of the model’s predictions against the actual patient distribution. UMAP is highly efficient for visualizing high-dimensional clinical data because it preserves both the local and global structures of the data. This facilitated an intuitive assessment of how effectively the model differentiates between various KTAS categories.
3. Results
3.1. Characteristics of the Study Population
- High Acuity (Levels 1–2): 13.2% (Level 1: 2.8%, Level 2: 10.4%)
- Mid Acuity (Level 3): 52.2% (The dominant category)
- Low Acuity (Levels 4–5): 34.7% (Level 4: 26.9%, Level 5: 7.8%)
3.2. Model Performance Comparison
- XGBoost Results: Demonstrated the highest raw statistical power with a QWK of 0.476, MAE of 0.386, and Exact Accuracy of 67.4%. Its ±1 Accuracy was 94.7%, indicating it rarely made catastrophic errors.
- Random Forest Results: While slightly lower in raw metrics (QWK = 0.434, MAE = 0.494, Accuracy = 61.0%, ±1 Accuracy = 91.6%), RF was selected as the primary model.
- Under-triage (21.7%): Cases where the model predicted a lower severity than the nurse. This is the primary safety concern.
- Over-triage (17.4%): Cases where the model predicted higher severity.
- Exact Match (61.0%): Perfect agreement. Crucially, the majority of misclassifications occurred between adjacent severity levels (e.g., Level 2 vs. Level 3), and fatal errors such as misclassifying Level 1 as Level 5 were extremely rare. This suggests that the model shows substantial agreement with expert triage decisions when evaluated using the ±1 tolerance metric (>90%).

3.3. Variable Importance and Clinical Plausibility (Figure 2)
- Pain Score (NRS): The single most predictive feature. This means that the model has accurately learned real-world clinical guidelines where pain is utilized as a key indicator of severity determination in the KTAS classification system.
- Age: Highly significant, reflecting the higher biological vulnerability of older adults.
- Systolic Blood Pressure (SBP): A key indicator of hemodynamic status.
- Pain: The probability of High Acuity rose sharply as NRS increased.
- Age: Risk increased progressively with age, with a steeper incline after age 60.
- SBP: A “U-shaped” non-linear relationship was observed. Specifically, the probability of high-acuity classification increased during both hypotension (suggesting potential shock) and severe hypertension (suggesting hypertensive emergencies).

3.4. Patient Distribution Visualization (UMAP)
4. Discussion
4.1. The Need for Objective Triage Support
4.2. Interpreting the Model Performance
4.3. The Role of Explainable AI (XAI) in Adoption
4.4. Clinical Implications and Future Directions
5. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Shuchami, A.; Lazebnik, T.; Ashkenazi, S.; Cohen, A.H.; Reichenberg, Y.; Shkalim Zemer, V. A Machine Learning-Based Guide for Repeated Laboratory Testing in Pediatric Emergency Departments. Diagnostics 2025, 15, 1885. [Google Scholar] [CrossRef] [PubMed]
- Benjamin, E.J.; Muntner, P.; Alonso, A.; Bittencourt, M.S.; Callaway, C.W.; Carson, A.P.; Chamberlain, A.M.; Chang, A.R.; Cheng, S.; Das, S.R.; et al. Heart disease and stroke statistics—2019 update: A report from the American Heart Association. Circulation 2019, 139, e56–e528. [Google Scholar] [CrossRef] [PubMed]
- Lozano, R.; Naghavi, M.; Foreman, K.; Lim, S.; Shibuya, K.; Aboyans, V.; Abraham, J.; Adair, T.; Aggarwal, R.; Ahn, S.Y.; et al. Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 2010. Lancet 2012, 380, 2095–2128. [Google Scholar] [CrossRef] [PubMed]
- Zhiting, G.; Jingfen, J.; Shuihong, C.; Minfei, Y.; Yuwei, W.; Sa, W. Reliability and validity of the four-level Chinese emergency triage scale in mainland China: A multicenter assessment. Int. J. Nurs. Stud. 2020, 101, 103447. [Google Scholar] [CrossRef]
- Dugas, A.F.; Kirsch, T.D.; Toerper, M.; Korley, F.; Yenokyan, G.; France, D.; Hager, D.; Levin, S. An electronic triage system to improve patient flow in an urban emergency department. J. Emerg. Med. 2016, 50, 910–918. [Google Scholar] [CrossRef]
- Levin, S.; Toerper, M.; Hamrock, E.; Hinson, J.S.; Barnes, S.; Gardner, H.; Dugas, A.; Linton, B.; Kirsch, T.; Kelen, G. Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the emergency severity index. Ann. Emerg. Med. 2018, 71, 565–574.e2. [Google Scholar] [CrossRef]
- Choi, S.W.; Ko, T.; Hong, K.J.; Kim, K.H. Machine learning-based prediction of Korean triage and acuity scale level in emergency department patients. Healthc. Inform. Res. 2019, 25, 305–312. [Google Scholar] [CrossRef]
- Cicolo, E.A.; Peres, H.H.C. Electronic and manual registration of Manchester System: Reliability, accuracy, and time evaluation. Rev. Lat. Am. Enferm. 2019, 27, e3241. [Google Scholar] [CrossRef]
- Raita, Y.; Goto, T.; Faridi, M.K.; Brown, D.F.M.; Camargo, C.A., Jr.; Hasegawa, K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit. Care 2019, 23, 64. [Google Scholar] [CrossRef]
- Alpert, E.A.; Gold, D.D.; Kobliner-Friedman, D.; Wagner, M.; Dadon, Z. Revolutionizing bladder health: Artificial-intelligence-powered automatic measurement of bladder volume using two-dimensional ultrasound. Diagnostics 2024, 14, 1829. [Google Scholar] [CrossRef]
- Hong, W.S.; Haimovich, A.D.; Taylor, R.A. Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 2018, 13, e0201016. [Google Scholar] [CrossRef]
- Jiang, H.; Mao, H.; Lu, H.; Lin, P.; Garry, W.; Lu, H.; Yang, G.; Rainer, T.H.; Chen, X. Machine learning-based models to support decision-making in emergency department triage for patients with suspected cardiovascular disease. Int. J. Med. Inform. 2021, 145, 104326. [Google Scholar] [CrossRef] [PubMed]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar] [CrossRef]
- Tsai, S.-C.; Lin, C.-H.; Chu, C.-C.J.; Lo, H.-Y.; Ng, C.-J.; Hsu, C.-C.; Chen, S.-Y. Machine learning models for predicting mortality in patients with cirrhosis and acute upper gastrointestinal bleeding at an emergency department: A retrospective cohort study. Diagnostics 2024, 14, 1919. [Google Scholar] [CrossRef] [PubMed]
- Barbarin, A.M.; Klasnja, P.; Veinot, T.C. Good or bad, ups and downs, and getting better: Use of personal health data for temporal reflection in chronic illness. Int. J. Med. Inform. 2016, 94, 143–155. [Google Scholar] [CrossRef]
- Kwon, H.; Kim, Y.J.; Jo, Y.H.; Lee, J.H.; Lee, J.H.; Kim, J.; Hwang, J.E.; Jeong, J.; Choi, Y.J. The Korean Triage and Acuity Scale: Associations with admission, disposition, mortality and length of stay in the emergency department. Emerg. Med. J. 2021, 38, 662–667. [Google Scholar] [CrossRef]
- Kim, J.; Jang, E.; Kwon, S.; Song, M. Unsupervised Clustering of 41,728 Emergency Department Visits: Insights into Patient Profiles and KTAS Reliability. Healthcare 2025, 13, 3073. [Google Scholar] [CrossRef]
- Waljee, A.K.; Rogers, M.A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D.R. of imputation methods for missing laboratory data in medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef]
- Luo, W.; Phung, D.; Tran, T.; Gupta, S.; Rana, S.; Karmakar, C.; Shilton, A.; Yearwood, J.; Dimitrova, N.; Ho, T.B.; et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. J. Med. Internet Res. 2016, 18, e323. [Google Scholar] [CrossRef]
- Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
- Moon, S.-H.; Shim, J.L.; Park, K.-S.; Park, C.-S. Triage accuracy and causes of mistriage using the Korean Triage and Acuity Scale. PLoS ONE 2019, 14, e0216972. [Google Scholar] [CrossRef] [PubMed]
- Goto, T.; Camargo, C.A., Jr.; Faridi, M.K.; Freishtat, R.J.; Hasegawa, K. Machine Learning–Based Prediction of Clinical Outcomes for Children During Emergency Department Triage. JAMA Netw. Open 2019, 2, e186937. [Google Scholar] [CrossRef] [PubMed]
- Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M.C. Clinical decision support systems for triage in the emergency department using intelligent systems: A review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef] [PubMed]
- Alpert, E.A.; Weiser, G.; Schul, S.; Mashiach, E.; Shaham, A.; Kobliner-Friedman, D. Models of field hospital emergency departments: The Israeli experience. Disaster Med. Public Health Prep. 2024, 18, e315. [Google Scholar] [CrossRef]
- Nedos, I.; Zagalioti, S.C.; Kofos, C.; Katsikidou, T.; Vellidou, D.; Astrinakis, K.; Karagiannis, I.; Giannakopoulos, P.; Michaloudi, S.; Apostolopoulou, A.; et al. Is Artificial Intelligence Ready for Emergency Department Triage? A Retrospective Evaluation of Multiple Large Language Models in 39,375 Patients at a University Emergency Department. J. Clin. Med. 2025, 15, 1512. [Google Scholar] [CrossRef]
- Hodgson, N.R.; Saghafian, S.; Martini, W.A.; Feizi, A.; Orfanoudaki, A. Artificial intelligence-assisted emergency department vertical patient flow optimization. J. Pers. Med. 2025, 15, 219. [Google Scholar] [CrossRef]
- Alpert, E.A.; Malkin, M.; Kobliner-Friedman, D. If you rebuild it, they will come—The contribution of the Israel defense forces field hospital team to the treatment of the 2023 earthquake victims in Turkey. J. Emerg. Manag. 2025, 23, 417–420. [Google Scholar] [CrossRef]
- de La Torre, J.; Puig, D.; Valls, A. Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognit. Lett. 2018, 105, 144–154. [Google Scholar] [CrossRef]
- Desautels, T.; Calvert, J.; Hoffman, J.; Jay, M.; Kerem, Y.; Shieh, L.; Shimabukuro, D.; Chettipally, U.; Feldman, M.D.; Barton, C.; et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: A machine learning approach. JMIR Med. Inform. 2016, 4, e28. [Google Scholar] [CrossRef]
- Wong, H.S.; Wong, T.K. Multi-Evidence Clinical Reasoning With Retrieval-Augmented Generation for Emergency Triage: Retrospective Evaluation Study. JMIR Med. Inform. 2026, 14, e82026. [Google Scholar] [CrossRef]
- Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019, 17, 195. [Google Scholar] [CrossRef] [PubMed]
- Aityan, S.K.; Skvortsov, A.V.; Mikhailov, A.V. Integrated AI medical emergency diagnostics advising system. Electronics 2024, 13, 4389. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nat. Med. 2022, 28, 31–38. [Google Scholar] [CrossRef]
- World Health Organization (WHO). Ethics and Governance of Artificial Intelligence for Health: WHO Guidance; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
- Greenhalgh, T.; Wherton, J.; Papoutsi, C.; Lynch, J.; Hughes, G.; A’Court, C.; Hinder, S.; Fahy, N.; Procter, R.; Shaw, S. Beyond adoption: A new framework for theorizing and evaluating nonadoption, abandonment, and challenges to scale-up of health and care technologies. J. Med. Internet Res. 2017, 19, e367. [Google Scholar] [CrossRef]
- Sendak, M.P.; D’Arcy, J.; Kashyap, S.; Gao, M.; Nichols, M.; Corey, K.; Ratliff, W.; Balu, S. A path for translation of machine learning products into healthcare delivery. EMJ Innov. 2020, 4, 24–31. [Google Scholar] [CrossRef]
- Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef]
- Ghassemi, M.; Oakden-Rayner, L.; Beam, A.L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 2021, 3, e745–e750. [Google Scholar] [CrossRef]

| Variable | Overall (n = 133,198) | KTAS 1 (n = 3672) | KTAS 2 (n = 13,803) | KTAS 3 (n = 69,476) | KTAS 4 (n = 35,802) | KTAS 5 (n = 10,445) |
|---|---|---|---|---|---|---|
| Age, years | 43.0 (21.0–63.0) | 73.0 (59.0–83.0) | 62.0 (45.0–76.0) | 42.0 (21.0–62.0) | 30.0 (8.0–53.0) | 48.0 (34.0–61.0) |
| Gender, n (%) | ||||||
| Male | 63,707 (47.8) | 1983 (54.0) | 7548 (54.7) | 31,292 (46.5) | 18,042 (50.4) | 3847 (37.0) |
| Female | 69,491 (52.2) | 1689 (46.0) | 6255 (45.3) | 37,184 (53.5) | 17,760 (49.6) | 6598 (63.0) |
| Systolic BP, mmHg | 122 (103–139) | 99 (55–134) | 123 (99–146) | 124 (106–142) | 119 (93–136) | 123 (116–131) |
| Diastolic BP, mmHg | 75 (60–86) | 55 (24–77) | 73 (57–86) | 76 (62–87) | 74 (44–85) | 78 (71–84) |
| MAP, mmHg | 91.3 (75.7–103.0) | 71.0 (32.0–96.0) | 90.3 (72.0–105.0) | 92.0 (78.0–104.3) | 89.3 (64.3–101.7) | 93.3 (87.3–98.3) |
| Heart rate, bpm | 89 (75–107) | 86 (52–110) | 87 (73–105) | 91 (76–110) | 90 (77–107) | 76 (70–86) |
| Respiratory rate,/min | 18 (16–20) | 20 (16–23) | 18 (16–20) | 18 (16–20) | 18 (16–20) | 16 (16–18) |
| Body temperature, °C | 36.8 (36.4–37.3) | 36.4 (35.0–37.1) | 36.7 (36.3–37.1) | 36.9 (36.5–37.6) | 36.8 (36.4–37.1) | 36.7 (36.4–36.9) |
| Shock Index | 0.6 (0.5–0.8) | 0.8 (0.6–1.0) | 0.7 (0.5–0.9) | 0.6 (0.5–0.8) | 0.6 (0.4–0.7) | 0.6 (0.5–0.7) |
| Pain score (NRS) | 2.0 (0.0–4.0) | 0.0 (0.0–0.0) | 0.0 (0.0–4.0) | 4.0 (0.0–5.0) | 2.0 (0.0–4.0) | 0.0 (0.0–0.0) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Song, M.; Kim, J.; Jang, E.-C.; Kwon, S. Interpretable Machine Learning for Emergency Department Triage: Clinical Insights from 133,198 Patients Using the Korean Triage and Acuity Scale (KTAS). Diagnostics 2026, 16, 954. https://doi.org/10.3390/diagnostics16060954
Song M, Kim J, Jang E-C, Kwon S. Interpretable Machine Learning for Emergency Department Triage: Clinical Insights from 133,198 Patients Using the Korean Triage and Acuity Scale (KTAS). Diagnostics. 2026; 16(6):954. https://doi.org/10.3390/diagnostics16060954
Chicago/Turabian StyleSong, MyoungJe, Jongsun Kim, Eun-Chul Jang, and SoonChan Kwon. 2026. "Interpretable Machine Learning for Emergency Department Triage: Clinical Insights from 133,198 Patients Using the Korean Triage and Acuity Scale (KTAS)" Diagnostics 16, no. 6: 954. https://doi.org/10.3390/diagnostics16060954
APA StyleSong, M., Kim, J., Jang, E.-C., & Kwon, S. (2026). Interpretable Machine Learning for Emergency Department Triage: Clinical Insights from 133,198 Patients Using the Korean Triage and Acuity Scale (KTAS). Diagnostics, 16(6), 954. https://doi.org/10.3390/diagnostics16060954

