Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients

Cha, Kyungman; Lee, Sohee; Shin, Jaekwang; Lim, Jee Yong

doi:10.3390/ai7060202

Open AccessArticle

Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients

by

Kyungman Cha

¹

,

Sohee Lee

²,

Jaekwang Shin

³ and

Jee Yong Lim

^4,*

¹

Department of Emergency Medicine, Suwon St. Vincent Hospital, The Catholic University of Korea, Suwon 16247, Republic of Korea

²

Project SLK Research Institute for AI medical Innovation, Gwacheon 13832, Republic of Korea

³

Department of Sports and Technology, Seokyeong University, Seoul 02713, Republic of Korea

⁴

International Healthcare Center, Seoul St. Mary’s Hospital, The Catholic University of Korea, Seoul 06591, Republic of Korea

^*

Author to whom correspondence should be addressed.

AI 2026, 7(6), 202; https://doi.org/10.3390/ai7060202

Submission received: 24 March 2026 / Revised: 22 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Applications of Artificial Intelligence in Medicine)

Download

Browse Figures

Versions Notes

Abstract

Undertriage—the misclassification of acutely ill patients into low-acuity triage categories—is a persistent patient safety concern, and prior machine learning approaches restricted to vital signs have yielded modest predictive performance. We hypothesized that this ceiling reflects feature restriction rather than an inherent predictive barrier. In this retrospective cohort study of 10,792 adult patients (age ≥ 18) initially triaged as Korean Triage and Acuity Scale (KTAS) level 4 or 5 across two tertiary academic centers during 2025, the primary outcome was triage reclassification—change from initial KTAS 4/5 to final KTAS 1–3 (n = 941; 8.7%). Five nested feature sets of increasing breadth were compared using logistic regression (LR) and gradient-boosting classifiers (GBC). Calibration (slope, intercept, Brier score), sensitivity/specificity/positive and negative predictive values at operating thresholds of 3%, 5%, and 10%, and decision-curve net benefit were evaluated on a held-out test partition. NEWS alone yielded an AUROC of 0.58, whereas the full triage-time panel (Set E; 43 features) achieved a GBC AUROC of 0.72 (95% CI 0.68–0.76; 5-fold CV 0.73 ± 0.02) and an AUPRC of 0.23, approximately doubling the NEWS baseline (0.12). The model was well calibrated, with a Brier score of 0.075, a calibration slope of 0.85 (95% CI 0.70–1.01), and an intercept of −0.30 (95% CI −0.65 to 0.07); both intervals included the ideal values of 1 and 0, indicating that predicted probabilities can be interpreted as approximate absolute event likelihoods. At a 5% operating threshold, sensitivity was 0.79, capturing 79% of reclassifications while flagging 53% of the cohort. Decision curve analysis demonstrated positive net clinical benefit across thresholds of 3–20%, exceeding both a vital-signs-only model and the treat-all/treat-none baselines. Feature importance analysis identified pain score, onset-to-arrival time, heart rate, systolic blood pressure, and age as the dominant predictors. Contextual variables routinely documented at triage—particularly pain score and onset-to-arrival time—together with heart rate and systolic blood pressure form a discriminative composite that exceeds the performance of vital-signs-only models in the KTAS 4/5 subpopulation. The resulting model is well calibrated and provides positive net clinical benefit across the 3–20% threshold range, supporting its potential role as a secondary screening flag for low-acuity patients warranting clinician re-review. External validation in independent cohorts is needed before clinical deployment.

Keywords:

undertriage; triage reclassification; machine learning; emergency department; KTAS; gradient boosting; decision curve analysis; calibration; screening

Graphical Abstract

1. Introduction

Emergency departments (EDs) serve as the primary point of contact for acute, unscheduled medical care worldwide, and their capacity to function safely under high patient volumes depends heavily on the accuracy of the initial triage process. ED overcrowding has been consistently associated with treatment delays, increased adverse events, and higher mortality [1,2]. As ED utilization continues to grow, particularly among older and more complex patient populations, the consequences of triage error are becoming an increasingly visible patient safety concern. Crowding does not affect all patients equally: patients assigned to lower-acuity categories may wait for hours without physiological monitoring, and even a small proportion who harbor unrecognized acute pathology are placed at substantial risk [3,4].

The Korean Triage and Acuity Scale (KTAS), adopted nationally across all Korean EDs since 2016, assigns patients to one of five urgency levels based on presenting symptoms, vital signs, and selected clinical modifiers. Its implementation has been associated with measurable reductions in ED length of stay and in-hospital mortality [5]. Nevertheless, KTAS is subject to meaningful inter-rater variability (weighted kappa 0.772), where disagreement is most closely tied to chief complaint selection and modifier application [6]. Undertriage occurs in approximately 10% of triage assessments and accounts for the majority of mistriage events [7,8]. Its consequences are not limited to delayed waiting times: a 2025 multicenter cohort study demonstrated that undertriage was associated with significant delays in diagnostic and therapeutic orders for patients with subarachnoid hemorrhage, aortic dissection, and ST-elevation myocardial infarction [9]—conditions that frequently present with non-specific symptoms in their early stages, making them precisely the type of patient at risk for low-acuity assignment. Moreover, ED crowding compounds this risk: even patients classified as non-critical face independently increased 10-day mortality under high occupancy conditions [2].

Standard Early Warning Scores (EWS), including the National Early Warning Score (NEWS), aggregate vital sign deviations into a composite score to flag physiological deterioration. NEWS was originally developed and validated for inpatient ward settings, where serial measurements are available and the clinical trajectory is the primary signal of interest [10]. The ED triage environment presents a fundamentally different challenge: clinicians must estimate illness severity from a single snapshot of physiology, often obtained within minutes of arrival and frequently confounded by pain, anxiety, or incomplete stabilization. Several studies have examined the utility of NEWS in ED populations and found that while NEWS performs reasonably well for predicting short-term mortality and intensive care admission among undifferentiated ED patients, its discriminative ability declines substantially in lower-acuity subgroups. In the KTAS 4/5 subpopulation, initial triage assigns patients to low-acuity levels precisely because their vital signs are recorded within or near the conventional normal range at the time of triage assessment. Undertriage in this group therefore reflects a residual prediction problem confined to patients whose physiological signals at presentation provide little discriminative information about underlying severity [10,11,12]. This limitation is structural rather than incidental. EWS systems rely on linear thresholds applied to individual vital parameters and lack the capacity to capture interactions between clinical context variables such as symptom onset patterns, chief complaint categories, or arrival circumstances that experienced triage nurses routinely integrate into their assessments.

Machine learning (ML) offers a potential framework for integrating the heterogeneous, multi-domain information available at triage into a unified risk estimate. Over the past decade, a growing body of literature has applied ML to ED triage prediction, with results that vary considerably depending on the scope of input features and the clinical population studied. Models trained on comprehensive triage-time datasets have achieved AUROC values exceeding 0.86 in several large-cohort studies [13,14,15]. A systematic review of 60 ML-based triage algorithms confirmed that multi-domain feature integration consistently outperforms vital-sign-only approaches across diverse healthcare settings [16,17]. However, the majority of these studies address the full triage spectrum, where the inclusion of high-acuity patients with pronounced physiological derangements substantially inflates discriminative performance. Whether the same feature-dependent performance gains persist in the low-acuity subpopulation—where physiological signals are attenuated and the baseline event rate is low—has not been adequately examined. Prior ML studies targeting undertriage detection in lower-acuity settings have reported more modest results, but none have systematically investigated whether this performance gap reflects an irreducible predictive ceiling inherent to the task or a correctable consequence of feature restriction.

We aimed to test this hypothesis directly. Using Clinical Data Warehouse records from two tertiary academic medical centers, we assembled a cohort of 10,792 adult patients initially triaged as KTAS level 4 or 5 and compared ML models across five progressively enriched feature sets to determine the incremental contribution of each feature domain to the detection of triage reclassification. Our analysis incorporates not only discriminative metrics but also calibration assessment (slope, intercept), operating-threshold performance (sensitivity, specificity, predictive values), decision curve analysis, and partial dependence profiling to characterize the clinical interpretability and operational utility of the identified risk factors.

2. Materials and Methods

2.1. Study Design and Setting

This was a retrospective observational cohort study utilizing electronic medical record (EMR) data extracted from the Clinical Data Warehouses (CDW) of two tertiary academic medical centers affiliated with The Catholic University of Korea (Seoul St. Mary’s Hospital, Seoul; and Suwon St. Vincent Hospital, Suwon, Republic of Korea), both of which operate general academic EDs serving an all-comers adult population that includes medical, surgical, and traumatic presentations. Both CDWs integrate structured clinical data from each hospital’s EMR system—including triage assessments, nursing intake records, vital sign measurements, disposition records, and administrative fields—and are linked to the National Emergency Department Information System (NEDIS) registry. At the time of cohort definition, the two-center extract was unified without preservation of hospital-level identifiers, which precluded center-stratified analyses or leave-one-center-out validation in the present work. The study period extended from 1 January 2025 to 31 December 2025. The study was approved by the Institutional Review Board of The Catholic University of Korea (Protocol No. KC26RASI0131; approved 10 March 2026). The requirement for informed consent was waived owing to the retrospective, de-identified nature of the data. This study was conducted and reported in accordance with the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis + Artificial Intelligence (TRIPOD + AI) guidelines.

2.2. Study Population

The study population comprised all adult patients (age ≥ 18 years) who presented to the EDs of the two participating centers during the study period and were assigned an initial Korean Triage and Acuity Scale (KTAS) level of 4 (less urgent) or 5 (non-urgent) at their first triage assessment. KTAS is a five-level triage system modeled after the Canadian Triage and Acuity Scale (CTAS) and has been mandated across all Korean EDs since 2016 [5]. Levels 4 and 5 represent patients deemed by the triage nurse to have conditions of low acuity that do not require immediate medical intervention.

From the 35,052 ED encounters extracted from the CDW during the study period, 33 encounters of patients aged below 18 years were excluded at the data extraction stage, yielding 35,019 adult ED encounters. Of these, 510 administrative or non-clinical encounters initially triaged as KTAS Level 8 were excluded, leaving 34,509 screened encounters. A further 23,717 encounters initially triaged as KTAS Level 1, 2, or 3 (comprising 644 Level 1 [2.7%], 2766 Level 2 [11.7%], and 20,307 Level 3 [85.6%]) were excluded as high-acuity presentations, producing a final adult analytic cohort of 10,792 patients initially triaged as low-acuity (KTAS Level 4: 9206 [85.3%]; KTAS Level 5: 1586 [14.7%]). A flow diagram detailing patient selection is provided in Figure 1.

2.3. Outcome Definition

The primary outcome was triage reclassification, defined as a change from an initial KTAS level of 4 or 5 to a final KTAS level of 1, 2, or 3 during the same ED encounter. Under the KTAS system, triage level can only be revised upward during an ED encounter; downward reclassification is not permitted. This unidirectional constraint ensures that all observed reclassification events represent recognition of higher acuity than initially assigned. Triage reclassification was ascertained by comparing the initial triage assignment (recorded at first assessment) with the final triage level documented in the NEDIS registry at ED disposition.

Triage reclassification was selected as the primary endpoint because it directly captures the clinical recognition that a patient’s condition warranted higher-acuity care than initially assigned—the operational definition of undertriage. Unlike surrogate endpoints such as emergency transfer (which may reflect bed availability or insurance logistics), triage reclassification reflects a clinical reassessment by the treating physician and is recorded prospectively within the existing triage infrastructure.

Among the 941 patients who met this outcome, the majority were reclassified to KTAS level 3 (n = 886; 94.2%), with smaller proportions reclassified to KTAS level 2 (n = 54; 5.7%) and KTAS level 1 (n = 1; 0.1%). Interpretive boundaries of this outcome definition—in particular the possibility that reclassification may reflect evolving clinical status, inter-physician variation in reassessment thresholds, or documentation conventions rather than misclassification at presentation—are discussed in Section 4.2.

2.4. Feature Engineering and Preprocessing

2.4.1. Vital Sign Processing

Six vital signs were extracted from the initial triage assessment: systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), respiratory rate (RR), body temperature (BT), and peripheral oxygen saturation (SpO₂). Pain intensity at triage was recorded on the Numeric Rating Scale (NRS), ranging from 0 (no pain) to 10 (worst pain imaginable). Physiologically implausible values were identified using predefined thresholds (e.g., SBP <40 or >300 mmHg, HR <20 or >250 beats/min, BT <30 or > 45°C) and set to missing prior to imputation. These thresholds were established based on published physiological ranges and reviewed by two emergency physicians (J.Y.L. and K.C.) from the participating centers.

2.4.2. Handling of Missing Data and Informative Missingness

Missing data patterns were analyzed prior to imputation, and three variables exhibited high missingness rates that were judged to be clinically informative rather than random. SpO₂ was missing in 80.3% of cases and was assigned a fixed value of 98%, reflecting both the median observed among patients with recorded measurements and the clinical convention that pulse oximetry is selectively omitted in patients who appear well-oxygenated on visual assessment. The National Early Warning Score (NEWS) was missing in 81.2% of records and was assigned a value of 0, consistent with the interpretation that NEWS is not routinely calculated for patients triaged to low-acuity categories. Pain score (NRS) was missing in 30.5% of encounters and was assigned 0, reflecting clinical practice in which pain is not formally documented when patients do not report significant discomfort. These three constants are clinical conventions rather than data-derived parameters and were therefore applied uniformly across the cohort without introducing any pathway for information leakage between training and evaluation data.

For remaining variables with missing values, median imputation was applied for continuous features and mode imputation for categorical features. To ensure that no information from the held-out test set influenced model development, all data-driven preprocessing parameters—including imputation medians and modes, the StandardScaler statistics used for the logistic regression model, and the empirically derived high-risk chief complaint threshold (Section 2.4.3)—were fit exclusively on the training partition. The fitted transformations were then applied without modification to the test set and to each cross-validation fold during evaluation. This strict separation was enforced through a scikit-learn Pipeline that encapsulated the full preprocessing-and-modeling sequence.

This approach to informative missingness—treating the absence of a measurement as an implicit clinical observation—has been described in prior EHR-based prediction modeling studies [18] and was preferred over conventional multiple imputation, which assumes data are missing at random and would obscure the clinical signal embedded in selective non-measurement.

As a sensitivity analysis, we compared this informative missingness approach with two alternatives: iterative multiple imputation (MICE) via scikit-learn’s IterativeImputer (10 iterations) and naive median imputation, and additionally evaluated three perturbation scenarios that varied the fixed fill values (conservative: SpO₂ = 100, NEWS = 0, pain = 0; acuity-biased: SpO₂ = 95, NEWS = 2, pain = 2; pure train-median). Test-set AUROC ranged from 0.717 to 0.749 across these scenarios (Supplementary Table S1), indicating that the primary results were robust across imputation strategies and that the informative missingness approach did not artificially inflate model performance.

2.4.3. Feature Derivation

Beyond raw vital signs, a set of clinically motivated derived features was constructed from structured triage fields and nursing intake records to capture contextual information routinely available at the time of triage assessment. These included: ambulance arrival (binary); transfer-in from another facility (binary); altered level of consciousness at triage (binary); night-time visit (arrival between 23:00 and 07:00; binary); weekend or holiday visit (binary); onset-to-arrival time interval (continuous, minutes); acute onset indicator defined as symptom onset within 60 min of ED arrival (binary); past medical history (binary, from nursing intake); surgical history (binary); current medication use (binary); presence of an accompanying person (binary); medical aid insurance status as a proxy for socioeconomic vulnerability (binary); and documented injury mechanism (binary).

Chief complaint was encoded using the 17 KTAS chief complaint categories as dummy variables. Additionally, a high-risk chief complaint flag was derived empirically: complaint categories with an observed triage reclassification rate exceeding 15% in the training partition were coded as 1. This threshold was selected based on the distribution of complaint-specific reclassification rates in the training set: the mean rate across the 15 chief complaint categories with adequate sample size was 9.5% (SD 4.5%), and 15% corresponds approximately to the mean plus 1.5 standard deviations, identifying outlier complaint categories. Only two categories exceeded this threshold and were classified as high-risk. Model performance was robust to threshold variation between 10% and 20% (Supplementary Table S2).

2.5. Feature Set Design

To evaluate the incremental predictive contribution of each information domain, five nested feature sets of increasing breadth were constructed:

Set A (Vital Signs; n = 10 features): SBP, DBP, HR, RR, BT, SpO₂, NEWS, pain score, age, and sex. This set represents the information available to conventional early warning scores and serves as the baseline comparator.

Set B (+Arrival Context; n = 13): Set A plus ambulance arrival, transfer-in status, and altered consciousness.

Set C (+Temporal and Onset Context; n = 17): Set B plus night visit, weekend visit, onset-to-arrival interval, and acute onset indicator.

Set D (+Chief Complaint Categories; n = 34): Set C plus 17 KTAS chief complaint category dummies.

Set E (Full Triage-Time Panel; n = 43): Set D plus past medical history, surgical history, medication use, accompanying person, medical aid status, injury mechanism, repeat visit, KTAS level (4 vs. 5), and the high-risk chief complaint flag.

This nested design enables a stepwise decomposition of predictive performance, allowing direct quantification of the marginal contribution of each feature domain while controlling for the cumulative information from preceding sets.

2.6. Model Development

Two classifiers representing complementary modeling paradigms were trained on each of the five feature sets:

Logistic Regression (LR) was selected as a linear baseline model for its interpretability and well-characterized behavior in low-prevalence settings. L2 (ridge) regularization was applied, and the regularization strength C was selected by 5-fold stratified cross-validation on the training partition (StratifiedKFold with shuffle = True and random_state = 42) over a logarithmic grid C ∈ {0.001, 0.01, 0.1, 1, 10, 100}, with the configuration maximizing mean cross-validated AUROC retained for final evaluation. Balanced class weights were applied to address the class imbalance (reclassification rate 8.7%), assigning inverse-frequency weights to the minority class during model fitting. All continuous features were standardized to zero mean and unit variance prior to LR training.

Gradient-Boosting Classifier (GBC) was selected as a non-linear ensemble method capable of capturing feature interactions and non-linear decision boundaries without explicit specification. Hyperparameters were specified a priori following standard regularization conventions for gradient boosting in low-prevalence binary classification: n_estimators = 300, max_depth = 3, learning_rate = 0.05, min_samples_leaf = 20, and subsample = 0.8. The combination of shallow trees, a moderate learning rate, and stochastic subsampling was chosen to mitigate overfitting on the minority class while preserving sensitivity to non-linear patterns. A formal grid search was not undertaken for the GBC; instead, the robustness of these choices was confirmed by perturbing each hyperparameter individually around its specified value, with all variants yielding mean cross-validated AUROC within a 0.020-point band of the primary configuration (Supplementary Table S2).

2.7. Evaluation Strategy

The dataset was partitioned into training (n = 8627; 80%) and held-out test (n = 2165; 20%) sets using stratified random splitting to preserve the outcome prevalence in both partitions. All model development, hyperparameter selection, and feature engineering decisions were performed exclusively on the training partition; the test set was reserved for final performance evaluation and was accessed only once.

2.7.1. Discrimination

Discriminative performance was assessed using the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC). Ninety-five percent confidence intervals for both metrics were derived from 1000 stratified bootstrap iterations on the test set. Internal validation was performed using 5-fold stratified cross-validation on the training set, reporting mean AUROC ± standard deviation. Pairwise AUROC differences between nested feature sets were assessed using the DeLong test.

2.7.2. Calibration

Model calibration—the agreement between predicted probabilities and observed event frequencies—was assessed using three complementary measures. The Brier score, ranging from 0 (perfect calibration) to 1, was reported as a single overall summary. Calibration curves were constructed by grouping predicted probabilities into decile bins and plotting observed versus predicted event rates. Calibration slope and intercept were estimated by logistic regression of the binary outcome on the logit of predicted probabilities; perfect calibration corresponds to slope = 1 and intercept = 0, with values close to these targets indicating that predicted probabilities can be interpreted as accurate absolute event likelihoods rather than as relative risk rankings only. Ninety-five percent confidence intervals for slope and intercept were derived from 1000 stratified bootstrap iterations on the test set.

2.7.3. Operating Performance

To characterize the model’s behavior at clinically meaningful decision thresholds, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated at predicted-probability thresholds of 3%, 5%, and 10%, with 95% confidence intervals derived from 1000 stratified bootstrap iterations on the test set. The 5% threshold was designated as the primary operating point on clinical grounds (Section 2.7.4), with 3% and 10% providing more sensitive and more specific alternatives, respectively. To enable interpretation in operational terms, the proportion of the low-acuity cohort flagged at each threshold, the proportion of reclassification events captured, and the number of patients flagged per reclassification correctly identified—analogous to a number-needed-to-screen—were tabulated (Section 3.3).

2.7.4. Clinical Utility

Decision curve analysis (DCA) was performed to evaluate the net clinical benefit of each model across a range of decision thresholds [19]. DCA quantifies the trade-off between correctly identifying patients who require intervention (true positives) and unnecessarily flagging stable patients (false positives), expressed in units of net benefit relative to the default strategies of treating all or treating none. The threshold probability range of 3–20% was selected as clinically relevant, reflecting the range at which a secondary triage screen might reasonably be triggered in a busy ED.

2.7.5. Model Interpretability

Partial dependence plots (PDPs) were constructed for the four continuous predictors with the highest feature importance in the GBC model to characterize the functional relationship between each predictor and the predicted probability of triage reclassification. PDPs isolate the marginal effect of a single feature by averaging model predictions across the observed distribution of all other features, enabling identification of non-linear risk inflection points that may inform clinical decision thresholds. Feature importance was quantified using the mean decrease in impurity (Gini importance) for the GBC model and standardized coefficient magnitudes for LR.

2.8. Software and Reproducibility

All analyses were performed in Python 3.12. Model development and evaluation used scikit-learn 1.4; statistical testing used SciPy 1.12; and visualization used Matplotlib 3.8 and Seaborn 0.13. The complete analysis pipeline, including preprocessing, model training, evaluation, and figure generation, is available from the corresponding author upon reasonable request.

3. Results

3.1. Study Population and Baseline Characteristics

From the 35,052 ED encounters extracted from the CDW during the study period, 33 minors (age < 18 years) were excluded at the extraction stage, yielding 35,019 adult ED encounters. After further excluding 510 administrative encounters (KTAS Level 8) and 23,717 high-acuity encounters initially triaged as KTAS Levels 1–3, a final analytic cohort of 10,792 adult patients with initial KTAS Level 4 or 5 was retained (Figure 1). The cohort was partitioned into training (n = 8627; 752 reclassifications, 8.72%) and held-out test (n = 2165; 189 reclassifications, 8.73%) sets via stratified random splitting.

Triage reclassification occurred in 941 patients (8.7%). The majority of reclassifications were to KTAS level 3 (n = 886; 94.2%), with 54 patients (5.7%) reclassified to KTAS level 2 and 1 patient (0.1%) to KTAS level 1. Most patients in the cohort were initially classified as KTAS level 4 (n = 9206; 85.3%) rather than KTAS level 5 (n = 1586; 14.7%), and the reclassification rate was significantly higher among KTAS level 4 patients (93.6% of reclassifications originated from KTAS 4; p < 0.001).

Baseline characteristics are presented in Table 1. Patients who were reclassified were older (median 61 vs. 56 years; p < 0.001), more likely to arrive by ambulance (16.5% vs. 10.1%; p < 0.001), and more likely to have been transferred from another facility (12.4% vs. 5.1%; p < 0.001). The onset-to-arrival interval was markedly longer among reclassified patients (median 540 vs. 312 min; p < 0.001), while acute onset (≤ 60 min) was less frequent (13.2% vs. 17.7%; p < 0.001). Reclassified patients were more likely to have documented past medical history (63.3% vs. 55.6%; p < 0.001), medication use (54.8% vs. 46.8%; p < 0.001), and an accompanying person at presentation (64.3% vs. 57.6%; p < 0.001).

3.2. Model Performance Across Feature Sets

Model performance across the five feature sets is summarised in Table 2 and Figure 2.

NEWS alone, used as a univariate baseline, yielded an AUROC of 0.583 (95% CI 0.548–0.621) and an AUPRC of 0.119 on the test set, confirming the limited discriminative capacity of a single vital-sign composite for detecting undertriage in this population.

Set A (Vital Signs; n = 10): LR achieved an AUROC of 0.633 (95% CI 0.588–0.677) and an AUPRC of 0.166. GBC outperformed LR with an AUROC of 0.674 (95% CI 0.633–0.716) and an AUPRC of 0.176.

Set B (+Arrival Context; n = 13): The addition of ambulance arrival, transfer-in status, and altered consciousness produced the largest single AUROC increment in the sequence. GBC AUROC rose to 0.706 (95% CI 0.667–0.748), an absolute improvement of 0.032 over Set A.

Set C (+Temporal; n = 17): Incorporating onset-to-arrival time and the acute-onset indicator produced a small further improvement to a GBC AUROC of 0.711 (95% CI 0.673–0.752).

Set D (+Chief Complaint; n = 34): Chief complaint categories contributed the largest single AUPRC increment in the sequence. GBC AUROC rose to 0.723 (95% CI 0.684–0.762), while AUPRC improved more substantially to 0.236 (95% CI 0.191–0.301)—a 34% relative gain over Set A.

Set E (Full Triage-Time Panel; n = 43): The full feature set yielded a GBC AUROC of 0.719 (95% CI 0.678–0.759), with cross-validated AUROC on the training set of 0.730 ± 0.022—the highest cross-validated value across all configurations—and an AUPRC of 0.230 (95% CI 0.185–0.291) and Brier score of 0.075. Although the test-set AUROC for Set E was marginally below Set D (0.723), the superior cross-validated performance of Set E and its more comprehensive feature representation supported its retention as the primary configuration for subsequent calibration, decision-curve, and operating-threshold analyses on a single coherent model.

The stepwise progression from NEWS alone (AUROC 0.583) to the full GBC model (AUROC 0.719) corresponded to an absolute AUROC improvement of 0.136 and an approximately 1.9-fold improvement in AUPRC (0.119 → 0.230) (Figure 3). The largest single AUROC increment came from the addition of arrival-context variables (Set A → B, +0.032), and the largest single AUPRC increment came from the addition of chief complaint categories (Set C → D, +0.028), consistent with the central role of arrival mode and chief complaint in clinical triage decision-making.

3.3. Calibration, Operating Performance and Clinical Utility

LR showed higher Brier scores (0.215–0.228 across feature sets), reflecting the miscalibration introduced by balanced class weighting in the logistic framework. The GBC model (Set E) demonstrated well-calibrated predicted probabilities with a Brier score of 0.075. Calibration slope was 0.85 (95% CI 0.70–1.01) and calibration intercept was −0.30 (95% CI −0.65 to 0.07); both confidence intervals included the ideal values of 1 and 0, indicating that predicted probabilities can be interpreted as approximate absolute event likelihoods rather than as relative risk rankings alone, although the point estimates suggest a slight tendency toward over-prediction at the upper extreme (Figure 4).

Operating performance at three clinically relevant thresholds is summarised in Table 3. At the primary operating threshold of 5%, the GBC Set E model achieved a sensitivity of 0.79 (95% CI 0.74–0.85), specificity of 0.49, positive predictive value of 0.13, and negative predictive value of 0.96, flagging 53.3% of the low-acuity cohort for secondary review and capturing 79.4% of reclassifications. A more conservative 10% threshold yielded higher specificity (0.78), lower sensitivity (0.57; 95% CI 0.51–0.64), and a substantially reduced workload (25.4% of the cohort flagged) while still capturing 57.1% of reclassification events. Conversely, the 3% threshold captured 94.2% of reclassifications but flagged 85.5% of the cohort, defining the upper bound of operational feasibility.

Decision curve analysis demonstrated that the GBC Set E model provided positive net clinical benefit across decision thresholds of 3–20%, consistently outperforming both the treat-all and treat-none strategies as well as the vital-signs-only model (Figure 5). Expressed in operational terms, applying the model at the 5% threshold would require secondary review of approximately one in two low-acuity patients to correctly identify one reclassification candidate for every 7.7 patients flagged—a workload profile broadly consistent with existing secondary screening practices in busy EDs.

3.4. Feature Importance and Partial Dependence

Feature importance analysis for the GBC Set E model (Figure 6) identified pain score (NRS; importance 0.153), onset-to-arrival time (0.109), heart rate (0.105), systolic blood pressure (0.083), and age (0.076) as the five most influential predictors. Diastolic blood pressure (0.062), NEWS score (0.060), the musculoskeletal/abdominal chief complaint category (CC_J; 0.050), and body temperature (0.045) completed the set of nine features whose importances exceeded the 0.04 threshold.

Partial dependence plots (Figure 7) revealed non-linear relationships between continuous predictors and reclassification risk. Pain score showed a step-like profile: predicted probability was flat across NRS 0–3, rose sharply between NRS 3 and 7, and plateaued thereafter. Onset-to-arrival time showed a sharp inflection at approximately 1500–2000 min (≈25–33 h), with a plateau at higher values. Heart rate demonstrated a U-shaped relationship, with markedly elevated risk at bradycardia (<65 beats per minute) and a steep rise above 100 beats per minute. Systolic blood pressure showed a gradual rise from approximately 100 mmHg with a peak in the 140–160 mmHg range, dropping at extreme high values.

4. Discussion

Among 10,792 adult patients initially triaged as KTAS level 4 or 5, 941 (8.7%) were reclassified to KTAS level 1–3 during their ED course, confirming that undertriage is not a rare occurrence in this population. A gradient-boosting model incorporating the full spectrum of triage-time variables achieved a test-set AUROC of 0.72 (5-fold CV 0.73 ± 0.02) and an AUPRC of 0.23—a substantial improvement over NEWS alone (AUROC 0.58, AUPRC 0.12) and over models restricted to vital signs (GBC Set A: AUROC 0.67, AUPRC 0.18). The model was well calibrated (slope 0.85, intercept −0.30; both confidence intervals including their ideal values) and provided positive net clinical benefit across the 3–20% threshold range. These findings support the hypothesis that previously reported performance limitations in triage prediction models reflect feature restriction rather than an inherent ceiling on predictability, and they support the interpretation of the model as a secondary screening flag.

4.1. Pain Score and Contextual Variables Outperform Individual Vital Signs in Low-Acuity Settings

The most striking finding was the relative contribution of non-physiological features in the upper ranks of predictor importance. Pain score, recorded on the Numeric Rating Scale, emerged as the single strongest predictor of triage reclassification (importance 0.153). Onset-to-arrival time (0.109) and heart rate (0.105) followed closely as the second- and third-ranked features, with systolic blood pressure (0.083), age (0.076), diastolic blood pressure (0.062), NEWS (0.060), the musculoskeletal/abdominal chief complaint category (CC_J; 0.050), and body temperature (0.045) completing the set of nine features whose importances exceeded the 0.04 threshold (Figure 6). Pain score and onset-to-arrival time—both contextual rather than physiological features—together carried approximately the same weight as the next three vital-sign-derived features combined.

The prominence of pain is particularly relevant because pain scoring is a primary modifier within the KTAS algorithm and has been identified as the leading cause of mistriage in prior validation studies [7]. Our results suggest that the relationship between pain and true acuity in the low-severity population is more complex than the linear weighting embedded in the current KTAS framework, and that ML models may capture non-linear pain–acuity interactions that structured triage algorithms miss. The partial dependence profile (Figure 7A) reinforces this interpretation: predicted reclassification probability is flat across NRS 0–3, rises sharply between NRS 3 and 7, and plateaus thereafter—a step-like pattern that a single linear pain modifier would fail to capture.

The partial dependence profiles in Figure 7 illustrate precisely the form of non-linearity that fixed-threshold triage algorithms cannot capture. Where the KTAS algorithm applies a single pain modifier weight, the model identifies a step-like increase between NRS 3 and 7; where the algorithm treats heart rate through paired upper- and lower-bound thresholds, the model identifies a continuous U-shaped risk profile, with markedly elevated risk at bradycardia (<65 beats per minute) and a steep rise above 100 beats per minute; and where the algorithm applies no SBP modifier in the KTAS 4/5 range, the model identifies a gradual rise from 100 mmHg with a peak in the 140–160 mmHg range.

The protective association of acute onset (≤60 min) likely reflects triage behavior rather than biology: patients presenting with acute-onset symptoms receive closer initial scrutiny and are more likely to be correctly classified at the outset, whereas those with subacute or gradual presentations—reflected in longer onset-to-arrival intervals—may be perceived as less urgent but ultimately harbor pathology that becomes apparent only after further evaluation. Age contributed in the fifth-ranked position (0.076) and remains a clinically established modifier of triage validity. Prior work has shown that KTAS shows decreased predictive validity in elderly patients [20], among whom systematic under-recognition of acute illness may be more common, and our results are consistent with this body of literature.

A boundary condition should be noted: the present cohort is, by selection, the subset of patients triaged as KTAS 4 or 5—patients who could communicate sufficiently for an initial low-acuity assignment. Severely ill patients with impaired consciousness or inability to communicate would have been assigned KTAS 1–3 and are not represented here. The strong pain-score signal observed in this population cannot be straightforwardly extrapolated to patients in whom pain assessment is unreliable due to altered mental status or communication barriers. We treat the NRS as ordinal rather than interval and rely on the non-parametric structure of the gradient-boosting model to accommodate inter-patient variation in pain reporting.

4.2. The Choice of Outcome and Its Interpretive Boundaries

We defined triage reclassification—the reassignment from initial KTAS 4/5 to final KTAS 1–3—as the primary outcome, rather than disposition-based endpoints such as hospitalization or emergency transfer. This choice warrants discussion in light of both its strengths and its inherent limitations.

Hospitalization from KTAS 4/5 may reflect social admissions, observation stays, or logistic factors unrelated to acute severity. Emergency transfer, while more specific, is influenced by bed availability and institutional capacity. Triage reclassification, by contrast, directly captures the treating clinician’s retrospective judgment that the patient’s condition was more urgent than initially recognized; it also yields a substantially higher event rate (8.7% vs. approximately 1.2% for emergency transfer), providing greater statistical power for model development and evaluation.

At the same time, triage reclassification is not synonymous with true undertriage. Reclassification reflects the treating clinician’s recognition of higher acuity rather than an external ground truth, and its occurrence may be influenced by several factors beyond initial triage accuracy. These include the evolution of clinical findings during the ED stay (a patient who deteriorates after the initial assessment may be reclassified without the initial assignment having been incorrect), inter-physician variation in the threshold for upward reclassification, institution-specific documentation conventions, and the possibility that some initial assignments—particularly among older patients or those with prolonged onset-to-arrival intervals—reflect systematic under-recognition rather than genuine low acuity at presentation. The latter consideration is consistent with our finding that age and onset-to-arrival time were among the most informative predictors of reclassification.

These caveats argue against interpreting model output as a precise estimate of true undertriage probability. They do not, however, undermine the practical value of the model as a screening flag intended to direct attention toward patients whose presentations warrant a second look. Calibrated probabilities and positive net clinical benefit at clinically reasonable thresholds support this more modest interpretation; external validation against more objective endpoints in independent cohorts would be required before a stronger interpretive claim could be sustained.

4.3. Comparison with Prior Literature

The AUROC of 0.72 achieved in this study is modest compared to ML triage models that operate across the full acuity spectrum (AUROC > 0.86) [13,15], but this comparison is misleading. Full-spectrum models benefit from the easy separability of KTAS 1–2 patients, who exhibit pronounced physiological derangements. Within the KTAS 4/5 subpopulation, physiological variation is compressed and the prediction task is inherently more difficult. Our results are consistent with the limited prior literature on low-acuity prediction while demonstrating that feature enrichment can meaningfully improve performance even in this challenging setting. Reported undertriage rates from comparable systems—approximately 13.7% for the Emergency Severity Index (ESI) and 12–18% for the Canadian Triage and Acuity Scale (CTAS) in adult ED cohorts—provide context for our observed reclassification rate of 8.7% and suggest that the limited predictive power of vital signs alone is not a KTAS-specific phenomenon but a structural property of low-acuity triage across systems that rely primarily on physiological parameters. Recent efforts to modify KTAS through complaint-specific recalibration have shown improved predictive performance [21], and ML-based approaches to KTAS level prediction have shown promise in Korean ED settings [22].

The stepwise feature-set design provides direct evidence for the source of this improvement. The transition from Set A (vital signs only) to Set E (full triage-time panel) produced an absolute AUROC gain of 0.045 for GBC, and the AUPRC—a more sensitive metric in low-prevalence settings—improved by 31% (from 0.176 to 0.230). These gains are attributable not to a single dominant feature domain but to the cumulative integration of arrival context, temporal pattern, chief complaint, and medical/social history. Within this sequence, the largest single AUROC increment came from arrival-context variables (Set A → B, +0.032), while the largest single AUPRC increment came from chief complaint categories (Set C → D, +0.028), consistent with the centrality of arrival mode and chief complaint in clinical triage decision-making.

4.4. Clinical Implications

The practical value of the present model lies not in replacing nurse-led triage but in providing a systematic safety net for the subset of low-acuity patients in whom acuity has been underestimated. Three complementary lines of evidence support this framing.

First, the model’s predicted probabilities are well calibrated. The calibration slope (0.85; 95% CI 0.70–1.01) and intercept (−0.30; 95% CI −0.65 to 0.07) both included their ideal values of 1 and 0 within statistical bounds, indicating that the predicted risk for an individual patient can be interpreted as an approximate event probability rather than as a relative ranking alone. The point estimates suggest a modest tendency toward over-prediction at the upper extreme, which is acceptable for a screening application in which false positives are tolerated more readily than missed events. Absolute interpretability of risk is a prerequisite for any threshold-based screening application, where the absolute level of risk—not merely the order—determines whether secondary review is triggered.

Second, decision curve analysis demonstrated positive net clinical benefit across the 3–20% threshold range, consistently exceeding the treat-all, treat-none, and vital-signs-only baselines. This finding is informative independent of the absolute AUROC: net benefit measures clinical value at a specific operating threshold and can remain meaningful when overall discrimination is modest, as it is in this physiologically compressed low-acuity population where vital-sign variation is inherently constrained.

Third, the operating threshold metrics translate directly into ED workflow. At the 5% primary threshold, secondary review of approximately one in two low-acuity patients would correctly identify 79% of eventual reclassifications, with one reclassification flagged for every 7.7 patients reviewed. More conservative deployment at a 10% threshold reduces the screening volume to 25% of the cohort flagged while still capturing 57% of reclassifications. These figures define a reasonable operational envelope within which institutions could choose, depending on local workforce capacity and tolerance for false positives.

Beyond the model itself, the partial dependence analysis offers actionable insights that do not require computational deployment. The identified inflection regions—pain score (NRS) above 3, onset-to-arrival time exceeding approximately 24 h, heart rate below 65 or above 100 beats per minute, and systolic blood pressure in the 140–160 mmHg range —could inform structured prompts within existing triage protocols, directing attention to KTAS 4/5 patients whose presentations fall within these higher-risk regions.

We emphasize that these findings represent internal validation only. The model has not been tested against external cohorts, and the operating characteristics reported here are likely to differ in institutions with distinct patient case-mix, triage practices, or documentation conventions. Prospective validation in independent EDs—with recalibration of decision thresholds to local prevalence and workflow—is a necessary next step before clinical deployment can be recommended.

4.5. Limitations

Several limitations should be acknowledged. First, although the cohort derives from two academic medical centers within the same hospital system, the analytic extract was unified at the time of data preparation and hospital-level identifiers were not preserved, precluding leave-one-center-out validation, per-center stratification, and assessment of center-specific calibration. External validation in independent cohorts with preserved site information—ideally drawn from health systems and triage cultures distinct from the present one—remains a priority before any clinical implementation.

Second, triage reclassification, while the most face-valid operationally available marker of undertriage, is recorded by the treating clinician and may be influenced by documentation variability, the evolution of clinical findings after initial assessment, and inter-physician differences in the threshold for upward reclassification. We have discussed these interpretive boundaries in Section 4.2; here we note that the model output is best interpreted as a screening signal rather than as a precise estimate of true undertriage probability.

Third, the informative-missingness imputation strategy—while clinically motivated and supported by an expanded sensitivity analysis showing AUROC stability across alternative fill specifications (range 0.717–0.749) and across imputation methods (Supplementary Table S1)—introduces fixed-value assumptions that could introduce bias if the clinical rationale for non-measurement differs systematically across institutions.

Fourth, the gradient-boosting hyperparameters were specified a priori rather than tuned via formal grid search. Robustness to perturbations of each parameter was confirmed within a 0.020-point band of cross-validated AUROC (Supplementary Table S2), but a fully tuned model in an external cohort might yield modestly different optimal values.

Fifth, we did not incorporate unstructured text data (free-text chief complaints, nursing notes) that may contain additional predictive signal beyond the structured triage variables analyzed here. Sixth, the study period encompasses a single calendar year, and seasonal variation in ED case-mix could affect the generalizability of calibration and threshold-specific operating characteristics. Finally, the test-set AUROC of 0.72 indicates that substantial residual unpredictability remains. The model should therefore be viewed as a complementary screening aid that flags higher-risk presentations for clinician review, not as a definitive classifier or replacement for nurse-led triage.

5. Conclusions

Triage reclassification occurred in 8.7% of adult patients initially classified as KTAS level 4 or 5, confirming that undertriage is a clinically relevant phenomenon in this population. A gradient-boosting model incorporating the full spectrum of triage-time variables—vital signs, arrival context, temporal features, chief complaint categories, and medical/social history—achieved a test AUROC of 0.72 with calibration metrics whose confidence intervals included the ideal values of slope 1 and intercept 0 (point estimates: slope 0.85, intercept −0.30), and positive net clinical benefit across the 3–20% decision-threshold range. Pain score, onset-to-arrival time, heart rate, and systolic blood pressure emerged as the dominant predictors, indicating that contextual information together with selected vital signs carries substantially greater discriminative value than vital-signs-only screening in the low-acuity setting. The model is positioned as a secondary screening flag rather than a definitive classifier, and prospective validation in independent cohorts—with recalibration of decision thresholds to local prevalence and workflow—is a necessary next step before clinical deployment can be recommended.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7060202/s1, Table S1: Imputation sensitivity analysis; Table S2: Gradient boosting hyperparameter sensitivity.

Author Contributions

Conceptualization, J.Y.L.; methodology, K.C.; formal analysis, S.L.; investigation, J.Y.L.; data curation, J.S.; writing—original draft preparation, J.Y.L.; writing—review and editing, K.C.; visualization, S.L.; supervision, J.Y.L.; project administration, J.Y.L.; funding acquisition, J.Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Research Fund of Seoul St. Mary’s Hospital, The Catholic University of Korea (Project No. ZC26RISI0146).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of The Catholic University of Korea (Protocol No. KC26RASI0131; approved 10 March 2026).

Informed Consent Statement

Patient consent was waived due to the retrospective nature of the study.

Data Availability Statement

The datasets generated and/or analyzed during the current study are not publicly available due to patient privacy and institutional restrictions. De-identified data may be made available from the corresponding author upon reasonable request and subject to Institutional Review Board approval and institutional data-sharing policies.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AUROC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision–Recall Curve
CDW	Clinical Data Warehouse
DCA	Decision Curve Analysis
ED	Emergency Department
EWS	Early Warning Score
GBC	Gradient-Boosting Classifier
KTAS	Korean Triage and Acuity Scale
LR	Logistic Regression
NEDIS	National Emergency Department Information System
NEWS	National Early Warning Score
PDP	Partial Dependence Plot

References

Morley, C.; Unwin, M.; Peterson, G.M.; Stankovich, J.; Kinsman, L. Emergency department crowding: A systematic review of causes, consequences and solutions. PLoS ONE 2018, 13, e0203316. [Google Scholar] [CrossRef]
Sartini, M.; Carbone, A.; Demartini, A.; Giribone, L.; Oliva, M.; Spagnolo, A.M.; Cremonesi, P.; Canale, F.; Cristina, M.L. Overcrowding in Emergency Department: Causes, Consequences, and Solutions—A Narrative Review. Healthcare 2022, 10, 1625. [Google Scholar] [CrossRef]
Jo, S.; Jeong, T.; Jin, Y.H.; Lee, J.B.; Yoon, J.; Park, B. ED crowding is associated with inpatient mortality among critically ill patients admitted via the ED: Post hoc analysis from a retrospective study. Am. J. Emerg. Med. 2015, 33, 1725–1731. [Google Scholar] [CrossRef]
Guttmann, A.; Schull, M.J.; Vermeulen, M.J.; Stukel, T.A. Association between waiting times and short term mortality and hospital admission after departure from emergency department. BMJ 2011, 342, d2983. [Google Scholar] [CrossRef] [PubMed]
Kwon, H.; Kim, Y.J.; Jo, Y.H.; Lee, J.H.; Lee, J.H.; Kim, J.; Hwang, J.E.; Jeong, J.; Choi, Y.J. The Korean Triage and Acuity Scale: Associations with admission, disposition, mortality and length of stay in the emergency department. Int. J. Qual. Health Care 2019, 31, 449–455. [Google Scholar] [CrossRef] [PubMed]
Park, J.B.; Lee, J.; Kim, Y.J.; Lee, J.H.; Lim, T.H. Reliability of Korean Triage and Acuity Scale: Interrater agreement between two experienced nurses. J. Korean Med. Sci. 2019, 34, e189. [Google Scholar] [CrossRef]
Moon, S.H.; Shim, J.L.; Park, K.S.; Park, C.S. Triage accuracy and causes of mistriage using the Korean Triage and Acuity Scale. PLoS ONE 2019, 14, e0216972. [Google Scholar] [CrossRef]
Hinson, J.S.; Martinez, D.A.; Cabral, S.; George, K.; Whalen, M.; Hansoti, B.; Levin, S. Triage performance in emergency medicine: A systematic review. Ann. Emerg. Med. 2019, 74, 140–152. [Google Scholar] [CrossRef]
Sax, D.R.; Warton, E.M.; Mark, D.G.; Reed, M.E. Emergency department triage accuracy and delays in care for high-risk conditions. JAMA Netw. Open 2025, 8, e258498. [Google Scholar] [CrossRef]
Royal College of Physicians. National Early Warning Score (NEWS): Standardising the Assessment of Acute-Illness Severity in the NHS; Royal College of Physicians: London, UK, 2012. [Google Scholar]
Pedersen, N.E.; Rasmussen, L.S.; Petersen, J.A.; Gerds, T.A.; Østergaard, D.; Lippert, A. A critical assessment of early warning score records in 168,000 patients. J. Clin. Monit. Comput. 2018, 32, 109–116. [Google Scholar] [CrossRef] [PubMed]
Alam, N.; Hobbelink, E.; van Tienhoven, A.; van de Ven, P.; Jansma, E.; Nanayakkara, P. The impact of the use of the Early Warning Score (EWS) on patient outcomes: A systematic review. Resuscitation 2014, 85, 587–595. [Google Scholar] [CrossRef]
Levin, S.; Toerper, M.; Hamrock, E.; Hinson, J.S.; Barnes, S.; Gardner, H.; Dugas, A.; Linton, B.; Kirsch, T.; Kelen, G. Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes. Ann. Emerg. Med. 2018, 71, 565–574. [Google Scholar] [CrossRef] [PubMed]
Raita, Y.; Goto, T.; Faridi, M.K.; Brown, D.F.M.; Camargo, C.A., Jr.; Hasegawa, K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit. Care 2019, 23, 64. [Google Scholar] [CrossRef]
Kwon, J.M.; Lee, Y.; Lee, Y.; Lee, S.; Park, J. An algorithm based on deep learning for predicting in-hospital cardiac arrest. J. Am. Heart Assoc. 2018, 7, e008678. [Google Scholar] [CrossRef] [PubMed]
Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M. Clinical decision support systems for triage in the emergency department using intelligent systems: A review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef]
Porto, B.M. Improving triage performance in emergency departments using machine learning and natural language processing: A systematic review. BMC Emerg. Med. 2024, 24, 219. [Google Scholar] [CrossRef]
Sharafoddini, A.; Dubin, J.A.; Lee, J. Patient similarity in prediction models based on health data: A scoping review. JMIR Med. Inform. 2017, 5, e7. [Google Scholar] [CrossRef]
Vickers, A.J.; Elkin, E.B. Decision curve analysis: A novel method for evaluating prediction models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef]
Chung, H.S.; Namgung, M.; Lee, D.H.; Choi, Y.H.; Bae, S.J. Validity of the Korean Triage and Acuity Scale in older patients compared to the adult group. Exp. Gerontol. 2023, 175, 112136. [Google Scholar] [CrossRef] [PubMed]
Choi, D.H.; Hong, W.P.; Song, K.J.; Kim, T.H.; Shin, S.D.; Hong, K.J.; Park, J.H.; Jeong, J. Modification and Validation of a Complaint-Oriented Emergency Department Triage System: A Multicenter Observational Study. Yonsei Med. J. 2021, 62, 1145–1154. [Google Scholar] [CrossRef] [PubMed]
Choi, S.W.; Ko, T.; Hong, K.J.; Kim, K.H. Machine learning-based prediction of Korean Triage and Acuity Scale level in emergency department patients. Healthc. Inform. Res. 2019, 25, 305–312. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Study population flow and analytic split. From 35,052 ED encounters extracted from the Clinical Data Warehouse, 33 minors were excluded at the extraction stage, yielding 35,019 adult encounters. Subsequent exclusions of 510 KTAS Level 8 (administrative) encounters and 23,717 KTAS Level 1–3 (high-acuity) encounters produced a final low-acuity analytic cohort of 10,792 patients (KTAS Level 4: 9206 [85.3%]; KTAS Level 5: 1586 [14.7%]) with 941 reclassifications (8.7%). The cohort was partitioned into training (n = 8627; 752 reclassifications) and held-out test (n = 2165; 189 reclassifications) sets via stratified random splitting (80:20). KTAS, Korean Triage and Acuity Scale; LR, logistic regression; GBC, gradient-boosting classifier; CV, cross-validation; CI, confidence interval.

Figure 2. Discriminative performance of selected models on the held-out test set. (A) Receiver operating characteristic curves; (B) precision–recall curves. AUROC and AUPRC values indicated in parentheses. The horizontal dashed line in panel B represents the outcome prevalence (8.7%).

Figure 3. Area under the receiver operating characteristic curve (AUROC) with 95% confidence intervals across the five nested feature sets for logistic regression (blue) and gradient boosting (red). The dashed grey line indicates NEWS alone (AUROC = 0.572). Error bars from 1000 stratified bootstrap iterations.

Figure 4. Calibration plot for the gradient-boosting Set E model on the held-out test set, with calibration slope (0.85) and intercept (−0.30) annotated. The dashed diagonal line represents perfect calibration.

Figure 5. Decision curve analysis comparing the net clinical benefit of gradient-boosting Sets A (vital signs only, dashed blue) and E (full panel, solid red) across a range of decision thresholds. The dotted line represents the treat-all strategy and the solid line represents the treat-none strategy. The shaded band identifies the clinically relevant 3–20% threshold range.

Figure 6. Feature importance (mean decrease in impurity) for the top 20 predictors in the gradient-boosting classifier trained on the full triage-time panel (Set E, n = 43 features). Red bars indicate features with importance > 0.04. Numerical values are annotated on each bar.

Figure 7. Partial dependence plots for the four most influential continuous predictors in the gradient-boosting classifier (Set E). (A) Pain score (NRS); (B) onset-to-arrival time (minutes, capped at the 95th percentile); (C) heart rate (beats per minute); (D) systolic blood pressure (mmHg). Each curve represents the marginal effect of the predictor on the predicted probability of triage reclassification, averaged across all other features.

Table 1. Baseline characteristics of the study population stratified by triage reclassification status.

Variable	Total (n = 10,792)	Not Reclassified (n = 9851)	Reclassified (n = 941)	p
Demographics
Age, years, median (IQR)	56.0 (37.0–71.0)	56.0 (37.0–70.0)	61.0 (43.0–73.0)	<0.001
Male sex, n (%)	4650 (43.1)	4233 (43.0)	417 (44.3)	0.447
KTAS level 4 (vs. 5), n (%)	9206 (85.3)	8325 (84.5)	881 (93.6)	<0.001
Vital Signs (at triage)
SBP, mmHg	133.0 (120.0–150.0)	133.0 (120.0–150.0)	136.0 (120.0–152.0)	0.043
DBP, mmHg	79.0 (69.0–88.0)	79.0 (69.0–88.0)	78.0 (68.0–89.0)	0.455
Heart rate, /min	84.0 (75.0–95.0)	84.0 (75.0–95.0)	84.0 (74.0–98.0)	0.374
Respiratory rate, /min	20.0 (20.0–20.0)	20.0 (20.0–20.0)	20.0 (20.0–20.0)	0.006
Body temperature, °C	36.6 (36.3–36.9)	36.6 (36.3–36.9)	36.7 (36.3–37.0)	0.006
Arrival Context
Ambulance arrival, n (%)	1150 (10.7)	995 (10.1)	155 (16.5)	<0.001
Transfer-in, n (%)	615 (5.7)	498 (5.1)	117 (12.4)	<0.001
Altered consciousness, n (%)	45 (0.4)	42 (0.4)	3 (0.3)	0.795
Temporal Factors
Night visit (23:00–07:00), n (%)	2467 (22.9)	2270 (23.0)	197 (20.9)	0.153
Weekend/Holiday, n (%)	3648 (33.8)	3366 (34.2)	282 (30.0)	0.010
Onset-to-arrival, min, median (IQR)	326 (90–1456)	312 (88–1416)	540 (121–2498)	<0.001
Acute onset (≤60 min), n (%)	1865 (17.3)	1741 (17.7)	124 (13.2)	<0.001
Medical History and Social
Past medical history, n (%)	6078 (56.3)	5482 (55.6)	596 (63.3)	<0.001
Surgical history, n (%)	4447 (41.2)	4038 (41.0)	409 (43.5)	0.150
Medication use, n (%)	5131 (47.5)	4615 (46.8)	516 (54.8)	<0.001
Accompanying person, n (%)	6279 (58.2)	5674 (57.6)	605 (64.3)	<0.001
Medical aid insurance, n (%)	609 (5.6)	569 (5.8)	40 (4.3)	0.062
Injury mechanism, n (%)	2407 (22.3)	2216 (22.5)	191 (20.3)	0.132
Repeat visit, n (%)	872 (8.1)	807 (8.2)	65 (6.9)	0.187

Values are presented as median (interquartile range) for continuous variables and n (%) for categorical variables. p-values from Mann–Whitney U test for continuous variables and χ² test (or Fisher’s exact test) for categorical variables.

Table 2. Model performance across progressively enriched feature sets.

Feature Set	Model	n	CV AUROC	Test AUROC (95% CI)	AUPRC (95% CI)	Brier
NEWS alone	—	1	—	0.583 (0.548–0.621)	0.119 (0.096–0.152)	—
A (Vital Signs)	LR	10	0.661 ± 0.017	0.633 (0.588–0.677)	0.166 (0.131–0.213)	0.228
A (Vital Signs)	GBC	10	0.685 ± 0.013	0.674 (0.633–0.716)	0.176 (0.145–0.230)	0.077
B (+Arrival)	LR	13	0.680 ± 0.016	0.666 (0.626–0.709)	0.188 (0.151–0.241)	0.221
B (+Arrival)	GBC	13	0.695 ± 0.010	0.706 (0.667–0.748)	0.198 (0.161–0.252)	0.076
C (+Temporal)	LR	17	0.680 ± 0.018	0.664 (0.624–0.708)	0.185 (0.149–0.236)	0.221
C (+Temporal)	GBC	17	0.696 ± 0.011	0.711 (0.673–0.752)	0.208 (0.170–0.265)	0.075
D (+Chief Complaint)	LR	34	0.694 ± 0.023	0.681 (0.642–0.723)	0.194 (0.157–0.249)	0.216
D (+Chief Complaint)	GBC	34	0.726 ± 0.020	0.723 (0.684–0.762)	0.236 (0.191–0.301)	0.074
E (Full Panel)	LR	43	0.695 ± 0.015	0.683 (0.643–0.724)	0.193 (0.157–0.249)	0.215
E (Full Panel)	GBC	43	0.730 ± 0.022	0.719 (0.678–0.759)	0.230 (0.185–0.291)	0.075

AUROC, area under the receiver operating characteristic curve; AUPRC, area under the precision–recall curve; CI, confidence interval; CV, cross-validation; GBC, gradient-boosting classifier; LR, logistic regression; NEWS, National Early Warning Score. 95% CIs derived from 1000 stratified bootstrap iterations.

Table 3. Operating performance of the GBC Set E model at decision thresholds across the clinically relevant 3–20% band.

Threshold	Sensitivity (95% CI)	Specificity	PPV	NPV	Flagged (%)	Event Caught (%)	NNS *
3% (highly sensitive)	0.94 (0.91–0.97)	0.15	0.10	0.97	85.5%	94.2%	10.4
5% (primary)	0.79 (0.74–0.85)	0.49	0.13	0.96	53.3%	79.4%	7.7
10% (specific)	0.57 (0.51–0.64)	0.78	0.20	0.95	25.4%	57.1%	5.1

* NNS, number-needed-to-screen—patients flagged per reclassification correctly identified. Computed on the held-out test set (n = 2165, events = 189). PPV, positive predictive value; NPV, negative predictive value.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cha, K.; Lee, S.; Shin, J.; Lim, J.Y. Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients. AI 2026, 7, 202. https://doi.org/10.3390/ai7060202

AMA Style

Cha K, Lee S, Shin J, Lim JY. Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients. AI. 2026; 7(6):202. https://doi.org/10.3390/ai7060202

Chicago/Turabian Style

Cha, Kyungman, Sohee Lee, Jaekwang Shin, and Jee Yong Lim. 2026. "Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients" AI 7, no. 6: 202. https://doi.org/10.3390/ai7060202

APA Style

Cha, K., Lee, S., Shin, J., & Lim, J. Y. (2026). Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients. AI, 7(6), 202. https://doi.org/10.3390/ai7060202

Article Menu

Beyond Vital Signs: A Machine Learning Model Using Comprehensive Triage-Time Data to Detect Undertriage in Emergency Department Patients

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Setting

2.2. Study Population

2.3. Outcome Definition

2.4. Feature Engineering and Preprocessing

2.4.1. Vital Sign Processing

2.4.2. Handling of Missing Data and Informative Missingness

2.4.3. Feature Derivation

2.5. Feature Set Design

2.6. Model Development

2.7. Evaluation Strategy

2.7.1. Discrimination

2.7.2. Calibration

2.7.3. Operating Performance

2.7.4. Clinical Utility

2.7.5. Model Interpretability

2.8. Software and Reproducibility

3. Results

3.1. Study Population and Baseline Characteristics

3.2. Model Performance Across Feature Sets

3.3. Calibration, Operating Performance and Clinical Utility

3.4. Feature Importance and Partial Dependence

4. Discussion

4.1. Pain Score and Contextual Variables Outperform Individual Vital Signs in Low-Acuity Settings

4.2. The Choice of Outcome and Its Interpretive Boundaries

4.3. Comparison with Prior Literature

4.4. Clinical Implications

4.5. Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI