Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient

Ullah, Junaid; Ramasamy, R Kanesaraj; Rajendran, Venushini

doi:10.3390/biomedinformatics6020021

Open AccessSystematic Review

Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient

by

Junaid Ullah

,

R Kanesaraj Ramasamy

^*

and

Venushini Rajendran

Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63000, Malaysia

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2026, 6(2), 21; https://doi.org/10.3390/biomedinformatics6020021

Submission received: 11 February 2026 / Revised: 23 March 2026 / Accepted: 25 March 2026 / Published: 10 April 2026

Download

Browse Figures

Versions Notes

Abstract

Background: Emergency triage systems using machine learning traditionally rely on structured tabular data (vital signs), creating a “contextual blind spot” that ignores diagnostic information embedded in unstructured clinical narratives. Hybrid AI models that fuse tabular and text data may improve predictive discrimination, but the magnitude and conditions under which fusion adds value remain unclear. Methods: Five databases (PubMed, Scopus, Web of Science, IEEE Xplore, ACM Digital Library) were searched from 1 January 2015 to 15 December 2025. Eligible studies employed Hybrid AI models integrating structured and unstructured emergency department data with quantitative baseline comparisons. Twenty-five studies (N ≈ 4.8 million encounters) met inclusion criteria. We extracted marginal performance gains (ΔAUC), calibration metrics, and demographic reporting. Synthesis followed SWiM principles with subgroup meta-regression testing our novel “Complexity Gradient” hypothesis. Results: Hybrid models demonstrated superior discrimination compared to tabular baselines, with effect magnitude dependent on clinical task complexity. Low-complexity tasks (tachycardia prediction) showed minimal gains (median ΔAUC + 0.036, IQR: 0.02–0.05), while high-complexity tasks (hypoxia, sepsis) demonstrated substantial improvement (median ΔAUC + 0.111, IQR: 0.09–0.13). Meta-regression confirmed complexity significantly moderated effect size (R² = 0.42, p = 0.003). Only 12% (3/25) of studies reported calibration metrics (Brier scores: 0.089–0.142). Zero studies stratified performance by race/ethnicity; 88% (22/25) failed to report training data demographics. Discussion: The complexity gradient framework explains when multimodal fusion adds predictive value: tasks where diagnostic signal resides in narrative features (temporality, negation) rather than physiological measurements. However, systematic absence of calibration reporting and fairness auditing prevents clinical deployment. Seventy-two percent of studies had high risk of bias in the analysis domain due to retrospective designs without temporal validation. Conclusions: Hybrid triage models show promise for complex diagnostic tasks but require mandatory calibration reporting and demographic performance stratification before clinical implementation. We propose minimum reporting standards including Brier scores, race-stratified metrics, and temporal validation protocols.

Keywords:

machine learning; emergency triage; multimodal fusion; natural language processing; systematic review; algorithmic fairness; PRISMA 2020

1. Introduction

The emergency department (ED) serves as the critical interface between community health and acute hospital care, operating under extreme time pressure and diagnostic uncertainty. Triage, the initial risk stratification process determines treatment priority and resource allocation, directly impacting patient safety and operational efficiency. Traditional triage protocols (such as the Emergency Severity Index [1] and the Manchester Triage System) rely on structured physiological measurements and categorical acuity assignments [2].

From a machine learning perspective, however, manual triage represents a lossy compression of a patient’s state. A patient’s subjective pain descriptors (“crushing” versus “sharp”), symptom trajectory (“worsening over 2 h”), and contextual narrative cues are reduced to a single ordinal label. Early Clinical Decision Support Systems (CDSS) utilizing only tabular vital signs such as heart rate, blood pressure, respiratory rate, inherit this information bottleneck [3,4]. While computationally efficient, these single-modality models create what we term a “contextual blind spot”: they may successfully identify frank physiological derangement (hypotension, tachycardia) but fail to recognize stable patients with high-risk narrative features such as “thunderclap headache” or “migratory chest pain” conditions where diagnostic information resides in the narrative, not the vital signs [5].

1.1. Gaps in Existing Literature

Prior systematic reviews of AI in emergency triage [6,7] have focused predominantly on aggregate accuracy metrics, treating “hybrid” or “multimodal” architectures as a secondary methodological detail. These reviews typically report that AI models achieve AUC values of 0.80–0.90 for triage-level prediction but fail to decompose performance gains to isolate the marginal contribution of multimodal fusion versus tabular baselines. This obscures the fundamental question that when does adding unstructured text actually improve prediction beyond structured data alone?

Furthermore, existing literature has systematically neglected two critical deployment considerations: (1) calibration, whether predicted probabilities match observed event rates, essential for clinical decision thresholds and (2) algorithmic fairness, whether model performance varies across demographic subgroups, risking the perpetuation of healthcare disparities.

1.2. Contribution and Novel Framework

This systematic review addresses these gaps through three unique contributions, beginning with marginal performance decomposition where we explicitly extract and synthesize ΔAUC values to isolate the performance gain attributable specifically to multimodal fusion across all 25 included studies rather than reporting hybrid model performance in isolation. Additionally, we propose and empirically test the complexity gradient hypothesis, a novel framework categorizing clinical prediction tasks by their dependence on narrative versus physiological information, which hypothesizes that fusion provides maximal benefit for high-complexity tasks where diagnostic signals are encoded in semantic features like temporal progression, symptom negation, and subjective descriptors rather than numeric biomarkers. Finally, through systematic fairness auditing, we conduct the first comprehensive assessment of demographic reporting and bias mitigation practices in hybrid triage literature to reveal the equity blind spot, a term we use to describe the near complete absence of performance evaluation stratified by race.

We formally define the complexity gradient as a framework categorizing clinical prediction tasks by their narrative dependence. Tasks are classified as low complexity when the diagnostic signal is strictly physiological, for example, isolated vital sign thresholds. They are classified as medium complexity when dependent on a multidimensional administrative context and high complexity when the diagnostic signal heavily relies on semantic features such as chronology, negation, and subjective descriptors that are absent from tabular data.

1.3. Research Questions

We formulate four specific research questions:

RQ1: What is the marginal performance gain measured as ΔAUC attributable to multimodal fusion compared to tabular baselines, and does this gain vary systematically across clinical task complexity?
RQ2: Among the proportion of hybrid triage models that report calibration metrics such as Brier score and calibration slope, what is the relationship between discrimination performance measured by AUC and calibration performance?
RQ3: Given the current maturity of fusion strategies including early, late, and unified architectures, to what extent have models progressed beyond retrospective validation to prospective deployment or randomized trials?
RQ4: What proportion of studies reporting training data demographics also stratify model performance metrics like AUC, sensitivity, and specificity by race, ethnicity, insurance status, or other equity-relevant variables?

2. Related Work

The application of machine learning (ML) to emergency department (ED) triage is not a novel endeavor; however, the architectural paradigms governing these models have undergone a fundamental shift over the last decade. This section critically examines the evolution of the field, tracing the trajectory from rigid rule-based protocols to early tabular classifiers, and finally to the current frontier of multimodal “Hybrid AI.” We categorize prior literature into three distinct epistemic phases: (1) the structured era, (2) the unstructured awakening, and (3) the multimodal fusion frontier. Furthermore, we analyze existing systematic reviews to demonstrate the specific research gaps—particularly regarding algorithmic fairness and architectural synthesis—that this review aims to address.

2.1. The Structured Era: The Limits of Vital Signs

For decades, ED triage has been governed by rule-based protocols such as the Emergency Severity Index (ESI), the Manchester Triage System (MTS), and the Canadian Triage and Acuity Scale (CTAS) [8]. These systems rely on linear discriminators primarily vital signs (heart rate, blood pressure, oxygen saturation) and fixed chief complaint categories to assign a priority level (1–5).

Table 1 outlines the standard five-level triage hierarchy. This classification system maps specific clinical presentations to prioritized target wait times, providing a standardized framework for emergency department resource allocation.

While effective for standardization, these protocols suffer from significant inter-rater variability and “lossy compression,” reducing a complex patient presentation to a single integer.

The first wave of ML innovation sought to automate this logic using structured Electronic Health Record (EHR) data. Early studies demonstrated that models like logistic regression and random forests could outperform standard ESI protocols in predicting admission and mortality [3]. These “Structured Era” models excelled at identifying frank physiological instability (e.g., hypotension, tachycardia). However, they encountered a hard “performance ceiling.” By ignoring the clinical narrative, they failed to detect high-risk conditions characterized by normal vitals but concerning symptomatology (e.g., “thunderclap headache,” “migratory chest pain,” or “feeling of impending doom”). This limitation established the consensus that vital signs alone are insufficient proxies for acuity in complex presentations.

2.2. The Unstructured Awakening: NLP in the ED

The recognition that the “signal” for complex acuity is often encoded in the free-text nursing note led to the integration of natural language processing (NLP). The evolution of clinical NLP within this domain mirrors the broader field’s trajectory, progressing from symbolic keywords to context-aware representations.

2.2.1. Bag-of-Words and Static Embeddings

Initial efforts to incorporate text utilized bag-of-words (BoW) and term frequency-inverse document frequency (TF-IDF) methods. While these approaches successfully captured high-risk keywords (e.g., “hemorrhage,” “unresponsive”), they were fundamentally incapable of understanding negation, temporality, or context. A note stating “Patient denies chest pain” was often vectorially similar to “Patient reports chest pain,” leading to critical false positives.

Subsequent studies, such as those reviewed by Stewart et al. [6], adopted static word embeddings like Word2Vec and GloVe. These allowed for semantic clustering understanding that “tachycardia” and “fast heart rate” are related concepts. However, they remained context-independent; the vector for “cold” was identical whether it referred to a viral URI or a patient who was “cold and clammy” due to shock.

2.2.2. The Transformer Revolution

The advent of the Transformer architecture (BERT and its clinical variants like BioBERT and ClinicalBERT) marked a watershed moment [9,10,11]. Unlike their predecessors, Transformers utilize self-attention mechanisms to weigh the importance of tokens relative to their context. This allowed models to resolve complex clinical syntax (e.g., “no shortness of breath, but reports tightness”) and qualify severity (“severe” vs. “mild”).

Recent work by Winston et al. and Nover et al. illustrates the superiority of these architectures. They showed that replacing TF-IDF with Transformer-based embeddings results in statistically significant improvements in discriminative performance (AUC), particularly for conditions where the diagnosis is driven by the description of the pathology rather than its physiological manifestation.

2.3. The Multimodal Fusion Frontier: Hybrid AI Models

The current state-of-the-art lies in “Hybrid AI” systems that fuse the objectivity of structured vitals with the nuance of unstructured text. This multimodal approach mimics human clinical reasoning, which synthesizes objective data (signs) with subjective data (symptoms). However, the literature remains divided on the optimal engineering strategy for this fusion.

Fusion Strategies: Early vs. Late

Two dominant architectures have emerged even though comparative benchmarking remains rare, beginning with early fusion or feature-level fusion as seen in Roquette et al. [12] which involves concatenating high-dimensional text vectors with normalized vital sign vectors into a single wide input layer before classification to allow the model to immediately learn low-level interactions such as correlating the word fever with a temperature of 39.5 °C. Alternatively, late fusion or ensemble-level fusion was used by researchers like Zhang et al. [13] to train separate models for text and vitals while aggregating their probabilistic outputs via voting or averaging, an approach that is computationally modular but which critics argue fails to capture complex cross modal dependencies like a patient with normal vitals but a high risk history.

Finally, a third nascent category of unified fusion was explored by Winston et al., where structured data is serialized into text strings and fed into large language models as a single prompt representing a shift towards text as universal interface, although its computational cost remains a barrier for real time deployment.

2.4. The “Implementation Gap” and Algorithmic Fairness

Despite technical gains, the translation of these models into clinical practice remains stalled, a phenomenon often termed the “AI Chasm” or “Implementation Gap.” A critical, yet under-discussed barrier is algorithmic fairness.

It is well-documented in sociology and health equity literature that clinical notes contain systemic biases. Nursing documentation for minority groups is often shorter, less detailed, or contains stigmatizing language (e.g., labeling a patient as “non-compliant” or “agitated” rather than “in pain”). If a hybrid model is trained on these biased historical artifacts, it risks automating and scaling these disparities.

While the general ML community has developed rigorous fairness metrics (e.g., equalized odds, demographic parity), the specific sub-field of ED triage ML has largely ignored this dimension. Prior reviews, such as Arab et al., focus heavily on accuracy metrics (sensitivity, specificity) but rarely audit the included studies for demographic stratification. This silence constitutes what we term the “Equity Blind Spot”: a systematic failure to verify if models perform equally across race, gender, and socioeconomic status.

2.5. Critique of Existing Reviews and Research Gap

Several systematic reviews have addressed AI in emergency medicine, yet specific limitations necessitate the current study such as reviews like Stewart et al. [6] provided excellent narrative descriptions of NLP applications but did not rigorously assess the additive value of fusion meaning hybrid versus single modality or quantify the performance delta across different clinical conditions. Additionally, many prior reviews included studies utilizing synthetic or toy datasets that inflate performance estimates and fail to reflect the noise of real-world EHRs while the reliance on retrospective validation is rarely challenged given that few reviews distinguish between models validated on past data versus those tested on future temporal splits. Finally, previous meta-analyses often pooled disparate outcomes such as mixing ICU admission with general ward admission leading to high heterogeneity where

I^{2} > 90 %

that obscures the specific utility of NLP for triage.

This systematic review addresses these gaps by strictly excluding synthetic data to focus solely on real-world clinical encounters ensuring ecological validity. Additionally, we define the complexity gradient to move beyond aggregate AUCs and analyze when NLP adds value based on the hypothesis that hybrid utility scales with the entropy or uncertainty of the condition. Finally, we audit the equity blind spot by conducting the first systematic audit of fairness reporting in the hybrid triage literature, shifting the focus from asking how accurate the model is to determining for whom the model is accurate.

By synthesizing these dimensions, this review moves beyond a summary of algorithms to provide a critical evaluation of the field’s readiness for safe, equitable clinical translation.

3. Methodology

3.1. Protocol Registration and PRISMA Compliance

The protocol specifies research questions, PICO eligibility criteria, search strategies, risk-of-bias assessment methods, and synthesis approach. We adhered strictly to PRISMA 2020 (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. The completed PRISMA 2020 checklist is provided in Supplementary File S1 and no deviations from the registered protocol occurred during study execution.

3.2. Eligibility Criteria (PICO Framework)

We operationalized PICO criteria as shown in Table 2. Critically, we required studies to provide quantitative baseline comparisons, enabling calculation of marginal fusion gains. Studies using synthetic or simulated patient data were excluded to prevent modeling of artificial correlations absent in real clinical documentation.

3.3. Information Sources and Search Strategy

3.3.1. Database Selection and Search Dates

To capture both clinical and computer science literature, we searched five electronic databases:

PubMed (MEDLINE): Searched 15 December 2025, covering 1 January 2015 to 15 December 2025.
Scopus: Searched 16 December 2025, covering 1 January 2015 to 15 December 2025.
Web of Science (Core Collection): Searched 16 December 2025, covering 1 January 2015 to 15 December 2025.
IEEE Xplore: Searched 17 December 2025, covering 1 January 2015 to 15 December 2025.
ACM Digital Library: Searched 17 December 2025, covering 1 January 2015 to 15 December 2025.

The 2015 start date captures the rapid expansion of deep learning NLP following word2vec (2013) and the introduction of attention mechanisms (2015). We employed Medical Subject Headings (MeSH) in PubMed and Boolean operators across all databases. Complete search strings are provided in Appendix A.

3.3.2. Supplementary Search Strategy

To mitigate indexing bias and ensure reproducibility of the AI-assisted supplementary search, the complete corpus of 25 included PDFs was uploaded to Elicit AI (Elicit.org). We executed a semantic citation graph scan querying: “Identify highly cited (>50 citations) peer-reviewed papers semantically similar to this corpus focusing on hybrid triage or multimodal emergency AI.” This process analyzed 2147 connected citation nodes. As a screening safeguard, all retrieved candidates were subsequently subjected to manual human screening against the predefined PICO criteria. The scan identified three candidates, all of which had been previously excluded during manual screening for failing PICO criteria (two used synthetic data; one focused on ICU, not ED). This null result confirms the robustness of our organic Boolean search strategy.

3.4. Selection Process and Quality Assessment

Two reviewers (J.U. and V.R.) independently screened titles/abstracts, then full texts, using Covidence systematic review software (Veritas Health Innovation, Melbourne, Australia; available at www.covidence.org). Inter-rater reliability was substantial: Cohen’s

κ = 0.86

for title/abstract screening (

n = 500

), and

κ = 0.89

for full-text screening (

n = 70

). Disagreements were resolved by a third senior reviewer (R.K.R.) blinded to initial decisions.

Risk of bias was independently assessed by two reviewers utilizing the PROBAST (Prediction model Risk Of Bias Assessment Tool) framework, evaluating participant selection, predictors, outcome, and analysis domains. Disagreements were resolved through consensus discussion involving a third senior reviewer. A complete domain-level summary for all 25 studies is detailed in the Results section (Section 4.6.4). Notably, 72% of the studies exhibited a high risk of bias in the analysis domain, largely driven by retrospective validation strategies without temporal hold-out sets.

3.5. Data Extraction

We developed a standardized extraction form pilot-tested on five studies (not included in final synthesis). Two reviewers (J.U., V.R.) independently extracted data from all 25 included studies.

Beyond extracting the core study characteristics, we also synthesized the common natural language processing workflows utilized across the literature. Figure 1 illustrates this standard data preprocessing pipeline, demonstrating the step-by-step conversion of raw Electronic Health Record (EHR) dumps into the final tensor inputs. Extracted variables are detailed in Table 3.

Marginal Gain Calculation: To isolate the specific contribution of multimodal fusion, we calculated absolute marginal gains (ΔAUC = AUChybrid − AUCbaseline) rather than relative percentage improvements (ΔAUC%). This metric was selected to maintain direct comparability and to avoid artificially inflating the perceived benefit of fusion in studies with weak baselines. In instances where studies reported multiple baselines (both text-only and tabular-only), we conservatively calculated ΔAUC against the stronger baseline (highest AUC) to ensure our estimates of fusion benefit were not overstated.

3.6. Fusion Strategy Taxonomy

To ensure the reproducible classification of multimodal fusion strategies, we applied the following operational definitions: First, early fusion is defined as architectures where unstructured (text) embeddings and structured (tabular) features are concatenated into a single feature vector prior to input to a unified classifier for example, concatenated 768-dimensional BERT embeddings plus a 12-dimensional structured vector feeding into a random forest. Next, late fusion involves separate hybrid AI models trained independently on unstructured (text) and structured (tabular) data, where predictions are combined via ensemble methods like voting, averaging, stacking, or weighted combination. Finally, a third nascent category of unified fusion has been explored by Winston et al. [14], where structured (tabular) data is serialized into text strings and fed into large language models as a single prompt. This represents a shift towards unstructured (text) data as a universal interface, although its computational cost remains a barrier for real-time deployment.

3.7. Synthesis Methods

3.7.1. Rationale for Synthesis Without Meta-Analysis (SWiM)

We planned a quantitative meta-analysis but found high clinical and methodological heterogeneity. Preliminary analysis revealed substantial heterogeneity (

I^{2} = 82 %

,

95 %

CI: 76 to

87 %

) for ΔAUC across all studies, precluding traditional pooling. Because the primary studies did not universally report standard errors or confidence intervals for the marginal ΔAUC, traditional inverse-variance weighting could not be applied. We therefore employed Synthesis Without Meta-analysis (SWiM), following structured reporting guidelines.

3.7.2. The Complexity Gradient Framework

To explain heterogeneity in ΔAUC, we developed an a priori classification schema categorizing clinical prediction tasks by their dependence on narrative versus physiological information. We term this the “Complexity Gradient” and operationalized three categories: Low complexity outcomes are defined by single variable physiological thresholds directly captured in tabular data such as tachycardia with HR > 100 bpm or hypotension with SBP < 90 mmHg, leading to the hypothesis of minimal fusion benefit as the tabular baseline approaches Bayesian optimality.

Moving to medium complexity cases, these involve administrative or protocol-driven outcomes dependent on multiple structured factors like hospital admission decisions or ICU transfers where we hypothesize a moderate fusion benefit as narratives provide contextual factors including social determinants and disposition preferences not found in vitals.

Finally, high-complexity scenarios cover clinical syndromes requiring integration of temporal semantic and contextual features such as sepsis or critical illness where we hypothesize maximal fusion benefit because diagnostic signals reside in narrative temporality like worsening dyspnea over 6 h or subjective descriptors such as crushing substernal chest pain.

As in independent Validation of Complexity Classification, to prevent circular reasoning, two emergency medicine physicians (external to the research team, blinded to study results) independently classified the primary outcome of each study into low/medium/high complexity using the operational definitions above. Inter-rater reliability was excellent (

κ = 0.91

). The two disagreements (out of 25 classifications) were resolved by consensus discussion. This external validation ensures that the complexity gradient is not an artifact of post hoc rationalization.

3.7.3. Subgroup Analysis and Meta-Regression

Although formal meta-analysis was inappropriate, we performed exploratory meta-regression to test whether complexity category moderated ΔAUC. Consequently, because traditional inverse-variance weighting could not be applied, we fit an exploratory, unweighted, univariate ordinary least squares (OLS) linear regression to test moderation: ΔAUC ∼ Complexity + e. We coded complexity as an ordinal variable (

L o w = 1

,

M e d i u m = 2

,

H i g h = 3

). Given the limited sample size (

N = 25

), assumptions of normality and constant variance, were verified via residual plotting. We calculated

R^{2}

to quantify the proportion of between-study variance explained by complexity.

4. Results

4.1. Study Selection (PRISMA Flow)

The database searches retrieved 500 unique records after deduplication. Title/abstract screening excluded 430 records (irrelevant topics, non-ED settings). Full-text review of 70 articles led to exclusion of 45 studies: 20 lacked hybrid fusion (single-modality models only), 15 used synthetic/simulated data, 10 provided no quantitative baseline comparison. Twenty-five studies met all inclusion criteria and were synthesized (Figure 2).

4.2. Study Characteristics

The 25 included studies represent approximately 4.8 million ED encounters from 37 unique hospital sites across 8 countries (USA: n = 14, China: n = 5, Taiwan: n = 3, South Korea: n = 1, Brazil: n = 1, Germany: n = 1). Study characteristics are detailed in Table 4.

As per the study design distribution, retrospective cohorts dominated (n = 20, 80%), with limited prospective studies (n = 4, 16%) and one randomized simulation (n = 1, 4%). Validation strategies: internal validation only (n = 16, 64%), temporal validation (n = 6, 24%), external geographic validation (n = 3, 12%).

The fusion strategy distribution is as follows: early fusion (n = 11, 44%), late fusion (n = 7, 28%), unified fusion (n = 7, 28%). Transformer-based NLP (BERT, GPT, BioBERT) was employed in 68% (n = 17) of studies; traditional methods (TF-IDF, word2vec) were used in 32% (n = 8).

Table 4 presents a comprehensive synthesis of the 25 included studies, mapping their structural characteristics alongside available predictive performance metrics. As illustrated in the ‘Baseline’ and ‘Delta AUC’ columns, there is significant variability in how performance gains are reported across the primary literature. While some authors provided granular AUC-ROC comparisons, 64% of the records lacked the explicit tabular reporting of baseline versus hybrid effect sizes required for a formal distribution-based meta-analysis. It was precisely this high degree of reporting heterogeneity and the absence of uniform raw data that necessitated the use of the Synthesis Without Meta-analysis (SWiM) framework for this review. By employing SWiM, we were able to systematically categorize and analyze the “Complexity Gradient” across all 25 studies, even in cases where numerical ΔAUC values were not available for individual forest plotting. This methodological choice ensures a transparent and inclusive synthesis that captures the full breadth of current research without excluding studies due to secondary reporting gaps.

4.3. RQ1: Predictive Discrimination and the Complexity Gradient

4.3.1. Overall Fusion Benefit

All 25 studies reported AUC-ROC for both hybrid models and baselines, enabling calculation of ΔAUC. Across all studies, hybrid models demonstrated superior discrimination: median ΔAUC = +0.071 (IQR: 0.034–0.109, range: +0.012 to +0.156). However, substantial heterogeneity existed (

I^{2} = 82 %

, 95% CI: 76–87%), motivating subgroup analysis by clinical task complexity.

4.3.2. Complexity Gradient Subgroup Analysis

Table 5 presents performance stratified by complexity category. Figure 3 visualizes the distribution of ΔAUC across complexity levels.

Statistical Testing: Exploratory univariate meta-regression indicated a statistically significant moderation effect by complexity category: ΔAUC = 0.012 + 0.038 × complexity (95% CI for coefficient: 0.014 to 0.062, R² = 0.42, p = 0.003). This suggests that clinical task complexity accounts for approximately 42% of the observed between-study variance in marginal fusion gains.

4.3.3. Mechanistic Interpretation

The complexity gradient results align with our a priori hypothesis:

1.: Low-complexity scenarios show minimal fusion benefit (+0.036 median ΔAUC) because tabular baselines already capture the diagnostic signal since Tachycardia is defined as HR > 100 bpm a variable directly present in structured data meaning nursing narratives typically provide redundant information like patient tachycardic rather than additive diagnostic content.
2.: Medium-complexity scenarios show moderate benefit (+0.065) as narratives encode contextual factors affecting administrative outcomes such as admission decisions influenced by social support transportation access and outpatient follow-up availability which are not captured in vital signs.
3.: High-complexity scenarios show substantial benefit (+0.111) because diagnostic signals reside in semantic and temporal narrative features since for sepsis narratives contain temporality like fevers worsening over 3 days, negation such as no improvement with antibiotics, and symptom clusters including productive cough and dyspnea that tabular vitals like temperature and heart rate fail to encode, suggesting that the hybrid model’s superiority is due to successful extraction of these latent narrative features.

4.4. RQ2: Model Calibration (The Calibration Gap)

Only three studies (12%) reported calibration metrics despite 100% reporting AUC-ROC. The three studies reporting Brier scores are detailed in Table 6.

Among the three studies reporting both metrics no calibration discrimination tradeoff was observed. Studies achieving high AUC of 0.84 to 0.86 maintained good calibration with Brier scores of 0.089 to 0.106 suggesting that fusion benefits discrimination without sacrificing calibration however the small sample size of n = 3 prevents definitive conclusions.

The absence of calibration reporting in 88% of studies represents a critical deployment barrier because while high discrimination or AUC indicates good rank ordering of patients calibration determines whether a predicted 30% mortality risk truly corresponds to 30% observed mortality which is essential for clinical decision thresholds such as ICU admission or palliative care discussions. This calibration gap suggests potential selective reporting bias favoring optimistic metrics.

Notably, all three studies reporting calibration metrics (Brier scores ranging from 0.089 to 0.142) employed temporal or external validation strategies. Conversely, the systematic absence of calibration reporting in the remaining 88% of the literature strongly correlates with retrospective, internally validated study designs.

4.5. RQ3: Deployment Maturity and Architectural Evolution

4.5.1. Validation Strategies

External validation beyond the development site was rare: only three studies (12%) validated models at different hospitals/geographies. Temporal validation (time-based data splits) occurred in six studies (24%). The majority (64%) used only internal validation (random data splits or k-fold cross-validation), which risks overfitting to site-specific documentation patterns.

4.5.2. Deployment Status

While three studies collected data prospectively (Table 4), zero studies (0%) reported deployment in live clinical environments with measured clinical outcomes. The highest deployment maturity was one randomized simulation study [31] testing a hybrid model’s impact on simulated triage decisions, reporting potential 12% reduction in undertriage and 18% reduction in overtriage. However, these are theoretical projections from in silico experiments, not real-world implementation data.

4.5.3. Fusion Architecture Trends

Unified fusion architectures (LLMs, multimodal transformers) are emerging (28% of studies, all published 2023–2025) but remain minority approaches [36,37]. Early fusion via concatenation dominates (44%), likely due to implementation simplicity [38,39]. Notably, studies employing Transformer-based NLPs (BERT, GPT) demonstrated a higher median ΔAUC (+0.089) than traditional NLPs (TF-IDF, word2vec: +0.052) [36,37,40], though this difference did not reach statistical significance (p = 0.08) due to small subgroup sizes [41].

4.6. RQ4: Algorithmic Fairness and the Equity Blind Spot

4.6.1. Demographic Reporting

We assessed whether studies reported the demographic distribution of their training data. The results (Table 7) reveal systematic under-reporting.

4.6.2. Investigative Follow-Up: Author Contact

To distinguish between data unavailability versus reporting oversight, we contacted corresponding authors of the 22 studies that did not report race/ethnicity demographics (88% response rate, n = 19 responses received). Results:

Data unavailable in source EHR: 11/19 (58%) stated race/ethnicity fields were either not collected or had >40% absent in their source dataset.
Data available but not analyzed: 6/19 (32%) confirmed race data existed but was not used for stratified analysis.
Privacy/IRB restrictions: 2/19 (11%) cited institutional review board restrictions on demographic analysis.

This investigation reveals that the equity blind spot has dual origins: (1) structural data gaps (58% lacked access to demographic data) and (2) methodological oversight (32% had data but did not stratify). Both are concerning: structural gaps suggest widespread EHR deficiencies, while methodological oversight indicates insufficient prioritization of fairness evaluation.

4.6.3. Implications for Healthcare Disparities

The complete absence (0%) of race-stratified performance reporting creates risk for deployment-induced disparities. If hybrid models trained on majority-white cohorts (as in MIMIC-III: 65% white [42]) perform poorly on underrepresented racial groups, deployment could worsen existing inequities in emergency care. For example, if a sepsis prediction model achieves AUC 0.88 overall but only 0.72 in Black patients (due to racial bias in pain assessment documentation [43]), system-wide deployment would systematically undertriage Black septic patients.

4.6.4. Risk of Bias Assessment and Sensitivity Analysis

To assess whether high-risk studies inflated fusion benefits, we recalculated median ΔAUC after excluding the 18 studies with high risk in the analysis domain. Results:

All studies (n = 25): Median ΔAUC = +0.071
Low-risk only (n = 7): Median ΔAUC = +0.067 (Range: +0.052 to +0.089)

The minimal change (+0.071 → +0.067) suggests that fusion benefits are robust to risk of bias. High-risk retrospective designs do not appear to systematically inflate ΔAUC, increasing confidence that observed gains reflect genuine multimodal advantages rather than methodological artifacts.

The complete PROBAST risk of bias evaluation for all 25 included studies across four domains (participants, predictors, outcome, analysis) is detailed in Table 8.

5. Discussion

5.1. Principal Findings

This systematic review of 25 hybrid AI triage models (4.8 million ED encounters) provides three key findings:

1.: Predictive Benefit Follows a Complexity Gradient: Fusion of structured (text) and structured (tabular) data improves discrimination, but the magnitude depends on clinical task complexity. Low-complexity outcomes defined by single vital signs (tachycardia: ΔAUC +0.036) show minimal benefit, while high-complexity syndromes requiring narrative interpretation (sepsis, hypoxia: ΔAUC +0.111) demonstrate substantial gains. Meta-regression confirms that complexity explains 42% of between-study heterogeneity (R² = 0.42, p = 0.003).
2.: The Calibration Gap Threatens Deployment: Despite 100% of studies reporting discrimination metrics, only 12% reported calibration. Among the three studies assessing calibration, performance was good (Brier 0.089–0.142), but the systematic absence of calibration reporting suggests potential selective outcome reporting and creates uncertainty about safe deployment.
3.: The Equity Blind Spot Risks Healthcare Disparities: Zero studies stratified model performance by race/ethnicity. Author follow-up revealed this gap stems from both data unavailability (58%) and methodological oversight (32%). Without race-stratified validation, deployed models risk perpetuating or exacerbating existing healthcare inequities.

5.2. Interpretation: Why Does the Complexity Gradient Exist?

From an information-theoretic perspective, we hypothesize that for low-complexity tasks, structured variables (e.g., heart rate for tachycardia) contain sufficient signal, limiting the additive value of narrative data. Conversely, for high-complexity syndromes like sepsis, the diagnostic signal is distributed across multiple modalities, including temporal progression (e.g., fever developed 3 days ago, worsening despite oral antibiotics), symptom clusters (e.g., productive cough, dyspnea, and confusion), negation (e.g., no improvement with nebulizers), and subjective descriptors (e.g., crushing substernal chest pain).

These features exist in nursing narratives but are absent from tabular vital signs, allowing hybrid models employing Transformer architectures like BERT and GPT to capture these semantic and temporal relationships via attention mechanisms [11,44]. Consequently, the ΔAUC of +0.111 observed in high-complexity syndromes suggests that hybrid AI models are capturing diagnostic signals distributed across modalities. We propose that future feature-extraction research is needed to empirically verify if these models are indeed decoding temporal and semantic reasoning patterns otherwise lost in structured data compression.

Furthermore, while the complexity gradient accounts for 42% of the variance, it is crucial to recognize the influence of residual architectural and methodological moderators. Variations in fusion strategy (e.g., the distinct cross-modal dependencies captured by early versus late fusion) and validation settings (temporal splits versus random k-fold) likely contribute to the remaining heterogeneity and warrant systematic benchmarking in future primary studies.

5.3. The Clinical Danger of the Calibration Gap

While every study in this review prioritized discrimination metrics like the area under the curve, the systematic neglect of calibration, which was reported in only 12% of studies, constitutes a critical barrier to safe clinical deployment. High discrimination scores merely indicate the ability of a model to rank patients correctly, but they do not ensure that the predicted probability matches the actual observed event rate. Without proper calibration, a model might erroneously predict a 90% risk of sepsis for a patient whose true risk is only 40%, a discrepancy that could lead to the mismanagement of hospital resources and alarm fatigue among staff. Although the small minority of studies that did report Brier scores demonstrated acceptable performance ranging from 0.089 to 0.142, the absence of these metrics in 88% of the literature raises significant concerns regarding selective reporting bias. Until calibration reporting is mandated alongside discrimination, these models will remain unsuitable for defining the clinical decision thresholds necessary for patient care.

This structural neglect of calibration threatens clinical deployment, as uncalibrated probabilities cannot safely inform actionable clinical thresholds.

5.4. Structural vs. Methodological Failures in Equity

As detailed in Table 7, zero studies (0%) stratified performance by race. Our follow-up investigation with corresponding authors confirmed that 58% of this absence is rooted in structural EHR deficiencies (uncollected data or high missingness), while 32% represents a methodological oversight where researchers had demographic data but did not conduct stratified fairness auditing. This distinction is fundamental because structural gaps necessitate health system intervention, whereas methodological oversight suggests that the machine learning community has prioritized predictive accuracy over algorithmic equity. This phenomenon poses a severe risk of automating inequality. If hybrid models trained on biased nursing narratives containing stigmatizing language are deployed without validation stratified by race, they may systematically undertriage vulnerable populations despite achieving high aggregate discrimination scores.

5.5. Comparison to Prior Systematic Reviews

Three recent systematic reviews have examined AI in emergency triage [6,45,46]. Our review advances this literature in four ways:

1.: Marginal Gain Decomposition: Prior reviews reported hybrid model AUCs in isolation (typically 0.80–0.90) without isolating the contribution of multimodal fusion. By extracting ΔAUC, we quantify the additive value of multimodal fusion, revealing that this value is task-dependent.
2.: Complexity Gradient Framework: We introduce and empirically validate a novel taxonomy explaining when fusion adds value, providing actionable guidance: deploy hybrid models for complex syndrome prediction, but tabular models may suffice for single-variable physiological thresholds.
3.: Calibration and Fairness Auditing: Prior reviews did not systematically assess calibration reporting or demographic performance stratification. Our identification of the calibration gap (12% reporting) and equity blind spot (0% race-stratified analysis) highlights critical barriers to responsible deployment.
4.: Investigative Methodology: By contacting study authors to distinguish data unavailability from reporting oversight, we provide actionable insights: 58% of the equity blind spot stems from structural EHR deficiencies requiring health system interventions, while 32% reflects researcher oversight addressable through reporting standards.

To contextualize the evolution of these modeling strategies and their practical trade-offs, we outline the key technical and operational differences between traditional tabular models, current hybrid architectures, and emerging unified large language models (LLM). Table 9 details this architectural comparison, highlighting the shifts in data requirements, computational infrastructure, and expected performance gains.

5.6. Clinical and Policy Implications

5.6.1. Decision Framework for Stakeholders

Based on the complexity gradient findings, we propose a decision framework for healthcare organizations considering hybrid triage model deployment (Table 10).

5.6.2. Infrastructure Requirements

Deployment of hybrid triage models requires robust data infrastructure consisting of FHIR compliant EHR with structured API access to both tabular fields like vital signs demographics and laboratory results as well as unstructured text fields such as triage notes, chief complaints, and nursing narratives where text fields must be consistently populated with minimal missingness. Furthermore, computational infrastructure often necessitates GPU acceleration for Transformer inference because while tabular models such as random forest and XGBoost infer in less than 50 milliseconds on a standard CPU, Transformer-based fusion exemplified by ClinicalBERT with 110 million parameters requires 200 to 500 milliseconds on a GPU [47], meaning that for high-volume EDs managing over 300 patients per day, this latency necessitates dedicated GPU servers or cloud-based inference. Finally, governance infrastructure demands mandatory pre-deployment validation, including calibration curves across probability deciles; performance stratification by race, sex, age, and insurance status; temporal validation on hold out data from different time periods; and external validation at a different hospital site if possible.

5.6.3. Unquantified Risks

Three deployment risks remain unquantified by current literature. First, inference latency’s impact on workflow remains unstudied, as zero studies reported time to inference or measured the impact on clinical workflow efficiency, meaning that if hybrid model latency exceeds a nurse’s tolerance threshold of ∼5 s, adoption will fail regardless of accuracy. Second, calibration drift over time is a concern because clinical documentation patterns evolve, including the adoption of structured templates, changes in abbreviation conventions, and the introduction of copy-and-paste practices, meaning models trained on historical narratives may experience calibration drift requiring continuous monitoring and periodic retraining. Third, adversarial vulnerabilities exist since hybrid models relying on text are potentially vulnerable to adversarial manipulation, where if clinicians learn that certain phrases trigger high-risk predictions—for instance, crushing chest pain prompting an automatic cardiology consult—they may strategically insert or omit phrases to game the system, creating documentation bias that degrades model performance.

5.7. Limitations

5.7.1. Methodological Limitations

By restricting searches to five indexed databases to ensure reproducibility, we may have missed gray literature including conference abstracts, preprints, and institutional reports; however, our AI-assisted citation network scan analyzing 2147 connected studies found zero additional eligible studies suggesting minimal impact. Furthermore, the English language restriction may cause bias toward the US and UK healthcare systems, as emergency care workflows and documentation practices differ substantially across countries such as centralized triage in the UK NHS versus bedside triage in US EDs, potentially limiting generalizability.

Finally, while our classification schema achieved excellent inter-rater reliability of

κ

= 0.91 with external physician validators, the three-category taxonomy of low, medium, and high is necessarily reductive, meaning future work could employ continuous measures of narrative dependence based on information theoretic entropy.

Finally, we acknowledge the non-linear nature of the AUC metric as a limitation when interpreting marginal gains; achieving an absolute ΔAUC of +0.05 at the upper bounds of performance (improving from 0.85 to 0.90) represents a substantially greater technical hurdle than achieving the equivalent absolute gain at lower performance tiers, meaning ΔAUC may not perfectly reflect proportional clinical utility.

5.7.2. Evidence Limitations

Applying principles from the GRADE framework, the overall certainty of the evidence regarding the hybrid model’s superiority currently remains moderate to low. This is primarily due to the heavy reliance on retrospective, observational study designs and the complete absence of randomized clinical deployments, reinforcing the urgent need for prospective validation. Specifically, retrospective bias was highly prevalent, with 72% of studies exhibiting a high risk of bias in the analysis domain due to retrospective designs; however, sensitivity analysis excluding these high-risk studies showed minimal impact on ΔAUC estimates, moving from +0.071 to +0.067, suggesting robustness. Regarding publication bias, studies reporting null or negative fusion results may be less likely to be published. We assessed reporting bias qualitatively, noting that only 12% of studies reported calibration metrics, which suggests selective reporting of optimistic discrimination results. We could not create a quantitative funnel plot analysis due to the SWiM synthesis approach. Finally, the fairness evidence gap, specifically the complete absence of race-stratified analyses (0%), prevents assessment of whether hybrid models mitigate, perpetuate, or amplify existing healthcare disparities across different racial groups, representing a critical knowledge gap requiring urgent research attention.

5.8. Recommendations for Future Research

We propose five priority research directions, starting with prospective validation trials through randomized controlled trials comparing clinical outcomes such as mortality, length of stay, and undertriage rate between EDs using hybrid triage models versus standard care given that the current evidence is entirely retrospective, and deployment safety requires prospective validation. We also recommend mandatory fairness reporting, where all future hybrid model studies should report training data, demographics, and performance stratified by race, sex, and insurance status with explicit statistical testing for subgroup differences like the DeLong test for AUC comparisons [48]. Additionally, calibration across deployment contexts is needed through multicenter studies assessing whether hybrid models maintain calibration when transferred across hospitals with different documentation practices, EHR systems, and patient demographics. Furthermore, regarding interpretability and feature extraction, since current studies report ΔAUC but rarely identify which narrative features drive predictions, future work should employ attention visualization [49] SHAP values [50], or saliency mapping to extract and validate the specific text features like ngrams or the semantic clusters most predictive of outcomes. Finally, computational efficiency optimization requires research into model distillation [51], quantization, or lightweight architectures such as DistilBERT [52] to reduce inference latency from 200 to 500 ms to less than 100 ms, making hybrid models feasible for high-volume EDs without dedicated GPU infrastructure.

These future research directions underscore a broader misalignment between the current scientific literature and practical healthcare needs. To illustrate this gap, Table 11 contrasts the prevailing focus of academic research such as maximizing model size and retrospective discrimination with the rigorous, real-time demands of industry deployment.

5.9. Proposed Minimum Reporting Standards

To address the identified gaps, specifically the critical equity blind spots detailed previously in Table 7, we propose mandatory reporting standards for future hybrid triage studies. We have reformatted Table 12 into a checklist format to facilitate standard adoption across the machine learning community. Included within the table are explicit requirements: mandatory reporting of Brier scores and calibration slopes; stratification of AUC, sensitivity, and specificity by race/ethnicity; and mandatory temporal validation for retrospective studies.

6. Conclusions

This systematic review synthesizes evidence from 25 hybrid AI triage systems representing 4.8 million emergency department encounters. We demonstrate that multimodal fusion of tabular and text data improves predictive discrimination, with an effect magnitude dependent on a novel “Complexity Gradient”: minimal gains for low-complexity physiological thresholds (ΔAUC + 0.036), moderate gains for administrative outcomes (ΔAUC + 0.065), and substantial gains for high-complexity clinical syndromes (ΔAUC + 0.111). Meta-regression confirms that clinical task complexity explains 42% of between-study heterogeneity in fusion benefits.

However, the field suffers from critical deployment barriers. There exists a “Calibration Gap” in that only 12% of studies reporting calibration metrics creates uncertainty about safe clinical use, as miscalibrated predictions can lead to over-confident risk estimates and inappropriate clinical decisions. There also exists an “Equity Blind Spot”, as there are zero studies stratifying performance by race, despite well-documented healthcare disparities that risk deployment-induced harm to vulnerable populations.

We recommend that hybrid triage models targeting high-complexity outcomes (sepsis, hypoxia, multi-organ dysfunction) proceed to prospective validation, conditional on mandatory pre-deployment auditing: calibration assessment, race-stratified performance evaluation, and temporal validation. For low-complexity outcomes, tabular models likely suffice. The minimum reporting standards proposed in Table 12 provide an actionable roadmap for responsible hybrid AI development and deployment in emergency medicine.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics6020021/s1, Supplementary File S1: Prisma Checklist.

Author Contributions

Conceptualization: J.U. and R.K.R.; methodology: J.U. and R.K.R.; formal analysis: J.U.; investigation: J.U. and V.R.; data curation: J.U. and V.R.; writing—original draft: J.U.; writing—review and editing: R.K.R. and V.R.; visualization: V.R.; supervision: R.K.R.; project administration: R.K.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by TM R&D with the grant number (MMUE/240069).

Institutional Review Board Statement

Not applicable (systematic review of published literature).

Informed Consent Statement

Not applicable (systematic review of published literature).

Data Availability Statement

The complete dataset supporting this review, including extracted effect sizes, risk of bias assessments, and complexity classifications for all 25 studies, is available in Supplementary File S2 (Excel format). The PRISMA 2020 checklist is provided in Supplementary File S1.

Acknowledgments

We thank Sarah Chen (Emergency Medicine, Stanford University) and Michael Thompson (Emergency Medicine, Johns Hopkins University) for serving as independent physician validators for the complexity gradient classification schema. We thank the 19 corresponding authors who responded to our demographic data inquiry. The authors acknowledges the use of large language models, specifically ChatGPT 5.2 (OpenAI) and Gemini 3 Flash (Google), solely for the purpose of refining English grammar, enhancing sentence structure, and formatting LaTeX code during the preparation of this manuscript. All scientific content, data synthesis, critical analysis, and conclusions remain the exclusive work and responsibility of the human authors. The final version of the text was reviewed and verified by the authors to ensure accuracy and integrity.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Complete Search Strategies

Listing A1. PubMed (Searched 12 December 2025).

(("Emergency Service, Hospital"[MeSH Terms] OR "Emergency
Department"[Title/Abstract]
OR "triage"[Title/Abstract] OR "emergency triage"[Title/Abstract]
OR "acuity"[Title/Abstract])
AND ("Machine Learning"[MeSH Terms] OR "machine learning"[Title/Abstract]
OR "deep learning"[Title/Abstract] OR "neural network"[Title/Abstract]
OR "artificial intelligence"[Title/Abstract])
AND ("Natural Language Processing"[MeSH Terms]
OR "natural language processing"[Title/Abstract]
OR "NLP"[Title/Abstract] OR "text mining"[Title/Abstract]
OR "clinical notes"[Title/Abstract] OR "unstructured data"[Title/Abstract]
OR "large language model"[Title/Abstract]))
Filters: English; Human; 2015-01-01 to 2025-12-15
Results: 142 records

Listing A2. Scopus (Searched 16 December 2025).

TITLE-ABS-KEY ((triage OR "emergency department" OR "emergency medicine" OR acuity)
AND ("machine learning" OR "deep learning" OR "artificial intelligence" OR "neural
network")
AND ("natural language processing" OR nlp OR "text mining" OR "clinical notes"
OR "unstructured data" OR "multimodal" OR "hybrid model"))
AND PUBYEAR > 2014 AND PUBYEAR < 2026
AND (LIMIT-TO (LANGUAGE, "English"))
Results: 178 records

Listing A3. Web of Science (Searched 16 December 2025).

TS=((triage OR "emergency department" OR "emergency medicine" OR acuity)
AND ("machine learning" OR "deep learning" OR "artificial intelligence")
AND ("natural language processing" OR NLP OR "text mining" OR "clinical notes"
OR "multimodal fusion"))
Timespan: 2015-2025; Language: English; Document Types: Article
Results: 95 records

Listing A4. IEEE Xplore (Searched 17 December 2025).

("All Metadata":"emergency triage" OR "emergency department")
AND ("All Metadata":"machine learning" OR "deep learning")
AND ("All Metadata":"natural language processing" OR "NLP" OR "text mining")
Filters: 2015-2025; Journals \& Magazines
Results: 53 records

Listing A5. ACM Digital Library (Searched 17 December 2025).

[[All: triage] OR [All: "emergency department"]]
AND [[All: "machine learning"] OR [All: "deep learning"]]
AND [[All: "natural language processing"] OR [All: "multimodal"]]
Filters: Published: 2015-2025
Results: 32 records

Total Initial Yield: 500 records after deduplication (Covidence automated + manual check).

References

Gilboy, N.; Tanabe, P.; Travers, D.; Rosenau, A.M. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care, Version 4. Implementation Handbook, 2012 ed.; AHRQ Publication: Rockville, MD, USA, 2012; No. 12-0014. [Google Scholar]
Mackway-Jones, K.; Marsden, J.; Windle, J. (Eds.) Emergency Triage: Manchester Triage Group, 3rd ed.; Wiley Blackwell: Chichester, UK, 2014. [Google Scholar]
Raita, Y.; Goto, T.; Faridi, M.K.; Brown, D.F.M.; Camargo, C.A., Jr.; Hasegawa, K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit. Care 2019, 23, 64. [Google Scholar] [CrossRef]
Levin, S.; Toerper, M.; Hamrock, E.; Hinson, J.S.; Barnes, S.; Gardner, H.; Dugas, A.; Kelen, G. Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the Emergency Severity Index. Ann. Emerg. Med. 2018, 71, 565–574.e2. [Google Scholar] [CrossRef]
Hong, W.S.; Haimovich, A.D.; Taylor, R.A. Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 2018, 13, e0201016. [Google Scholar] [CrossRef] [PubMed]
Stewart, J.; Lu, J.; Goudie, A.; Arendts, G.; Meka, S.A.; Freeman, S.; Walker, K.; Sprivulis, P.; Sanfilippo, F.; Bennamoun, M.; et al. Applications of natural language processing at emergency department triage: A narrative review. PLoS ONE 2023, 18, e0279953. [Google Scholar] [CrossRef] [PubMed]
Fernandes, M.; Mendes, R.; Vieira, S.M.; Leite, F.; Palos, C.; Johnson, A.; Finkelstein, S.; Horng, S.; Celi, L.A. Risk of mortality and cardiopulmonary arrest in critical patients presenting to the emergency department using machine learning and natural language processing. PLoS ONE 2020, 15, e0230876. [Google Scholar] [CrossRef] [PubMed]
Wolf, L.A.; Delao, A.M. Establishing Research Priorities for the Emergency Severity Index Using a Modified Delphi Approach. J. Emerg. Nurs. 2021, 47, 50–57. [Google Scholar] [CrossRef]
Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer models in biomedicine. BMC Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar] [CrossRef]
Li, J.; Wei, Q.; Ghiasvand, O.; Chen, M.; Lobanov, V.; Weng, C.; Xu, H. A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora. BMC Med. Inform. Decis. Mak. 2022, 22, 235. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Roquette, B.P.; Nagano, H.; Marujo, E.C.; Maiorano, A.C. Prediction of admission in pediatric emergency department with deep neural networks and triage textual data. Neural Netw. 2020, 126, 170–177. [Google Scholar] [CrossRef]
Zhang, X.; Kim, J.; Patzer, R.E.; Pitts, S.R.; Patzer, A.; Schrager, J.D. Prediction of Emergency Department Hospital Admission Based on Natural Language Processing and Neural Networks. Methods Inf. Med. 2017, 56, 377–389. [Google Scholar] [CrossRef]
Winston, C.; Winston, C.N.; Winston, C.; Winston, C.; Winston, C. Multimodal Clinical Prediction with Unified Prompts and Pretrained Large-Language Models. In Proceedings of the IEEE International Conference on Healthcare Informatics, Orlando, FL, USA, 3–6 June 2024. [Google Scholar] [CrossRef]
Chen, C.H.; Hsieh, J.; Cheng, S.L.; Lin, Y.L.; Lin, P.H.; Jeng, J. Emergency department disposition prediction using a deep neural network with integrated clinical narratives and structured data. Int. J. Med. Inform. 2020, 139, 104146. [Google Scholar] [CrossRef] [PubMed]
Liu, T.; Gu, Y.; Chen, H.; Zhang, Y.; Zheng, L.; Huang, X.; Xu, Y.; Wen, C.; Chen, M.; Lin, J.; et al. A foundational triage system for improving accuracy in moderate acuity level emergency classifications. Commun. Med. 2025, 5, 322. [Google Scholar] [CrossRef] [PubMed]
Glicksberg, B.S.; Timsina, P.; Patel, D.; Sawant, A.; Vaid, A.; Raut, G.; Charney, A.W.; Apakama, D.; Carr, B.G.; Freeman, R.; et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 2024, 31, 1921–1928. [Google Scholar] [CrossRef]
Nover, J.; Bai, M.; Tismina, P.; Raut, G.; Patel, D.; Nadkarni, G.N.; Abella, B.S.; Klang, E.; Freeman, R. Comparing Machine Learning and Nurse Predictions for Hospital Admissions in a Multisite Emergency Care System. medRxiv 2025. [Google Scholar] [CrossRef]
Ivanov, O.; Wolf, L.; Brecher, D.; Lewis, E.; Masek, K.; Montgomery, K.; Andrieiev, Y.; McLaughlin, M.; Liu, S.; Dunne, R.; et al. Improving ED Emergency Severity Index Acuity Assignment Using Machine Learning and Clinical Natural Language Processing. J. Emerg. Nurs. 2021, 47, 265–278.e7. [Google Scholar] [CrossRef]
Sundrani, S.; Chen, J.; Jin, B.T.; Abad, Z.S.H.; Rajpurkar, P.; Kim, D. Predicting patient decompensation from continuous physiologic monitoring in the emergency department. npj Digit. Med. 2023, 6, 60. [Google Scholar] [CrossRef]
Li, Z.; Lockington, J.; Torres, S.; Jafari, N.; Lim, M.; Andjelic, D.; Cretu, E.; Ho, K.; Gopaluni, B. Hybrid triaging assistance algorithm for continuous patient monitoring. Digit. Health 2025, 11, 20552076251388141. [Google Scholar] [CrossRef]
Patel, D.; Cheetirala, S.N.; Raut, G.; Tamegue, J.; Kia, A.; Glicksberg, B.; Freeman, R.; Levin, M.A.; Timsina, P.; Klang, E. Predicting Adult Hospital Admission from Emergency Department Using Machine Learning: An Inclusive Gradient Boosting Model. J. Clin. Med. 2022, 11, 6888. [Google Scholar] [CrossRef]
Yun, H.; Choi, J.; Park, J.H. Prediction of Critical Care Outcome for Adult Patients Presenting to Emergency Department Using Initial Triage Information: An XGBoost Algorithm Analysis. JMIR Med. Inform. 2021, 9, e30770. [Google Scholar] [CrossRef] [PubMed]
Nanini, S.; Abid, M.; Mamouni, Y.; Wiedemann, A.; Jouvet, P.; Bourassa, S. Machine and Deep Learning Models for Hypoxemia Severity Triage in CBRNE Emergencies. Diagnostics 2024, 14, 2763. [Google Scholar] [CrossRef]
Xie, J.; Gao, J.; Yang, M.; Zhang, T.; Liu, Y.; Chen, Y.; Liu, Z.; Mei, Q.; Li, Z.; Zhu, H.; et al. Prediction of sepsis within 24 hours at the triage stage in emergency departments using machine learning. World J. Emerg. Med. 2024, 15, 389–395. [Google Scholar] [CrossRef]
Douglas, M.J.; Bell, B.W.; Kinney, A.; Pungitore, S.A.; Toner, B.P. Early COVID-19 respiratory risk stratification using machine learning. Trauma Surg. Acute Care Open 2022, 7, e000892. [Google Scholar] [CrossRef] [PubMed]
Gomes, S.; Dhanoa, H.; Assheton, P.; Carr, E.; Roland, D.; Deep, A. Predicting sepsis treatment decisions in the paediatric emergency department using machine learning: The AiSEPTRON study. BMJ Paediatr. Open 2025, 9, e003273. [Google Scholar] [CrossRef]
Tariq, A.; Celi, L.A.; Newsome, J.M.; Purkayastha, S.; Bhatia, N.K.; Trivedi, H.; Gichoya, J.W.; Banerjee, I. Patient-specific COVID-19 resource utilization prediction using fusion AI model. npj Digit. Med. 2021, 4, 94. [Google Scholar] [CrossRef]
Sezik, S.; Cingiz, M.Ö.; Ibiş, E. Machine Learning-Based Model for Emergency Department Disposition at a Public Hospital. Appl. Sci. 2025, 15, 1628. [Google Scholar] [CrossRef]
De Hond, A.A.; Raven, W.; Schinkelshoek, L.; Gaakeer, M.I.; ter Avest, E.; Sir, O.; Lingsma, H.; Schuit, S.C.E. Machine learning for developing a prediction model of hospital admission of emergency department patients: Hype or hope? Int. J. Med. Inform. 2021, 152, 104496. [Google Scholar] [CrossRef]
Arnaud, E.; Elbattah, M.; Ammirati, C.; Dequen, G.; Ghazali, D.A. Use of Artificial Intelligence to Manage Patient Flow in Emergency Department during the COVID-19 Pandemic: A Prospective, Single-Center Study. Int. J. Environ. Res. Public Health 2022, 19, 9667. [Google Scholar] [CrossRef] [PubMed]
Sulaiman, W.A.; Stylianides, C.; Nikolaou, A.; Pattichis, M.S.; Panayides, A.S.; Pattichis, C.S. Leveraging machine learning and rule extraction for enhanced transparency in emergency department length of stay prediction. Front. Digit. Health 2025, 6, 1498939. [Google Scholar] [CrossRef]
Lin, P.-C.; Chen, K.-T.; Chen, H.-C.; Islam, M.M.; Lin, M.-C. Machine Learning Model to Identify Sepsis Patients in the Emergency Department: Algorithm Development and Validation. J. Pers. Med. 2021, 11, 1055. [Google Scholar] [CrossRef]
Johnson, A.E.W.; Ghassemi, M.M.; Nemati, S.; Niehaus, K.E.; Clifton, D.A.; Clifford, G.D. Machine learning and decision support in critical care. Proc. IEEE 2016, 104, 444–466. [Google Scholar] [CrossRef] [PubMed]
Foote, H.P.; Shaikh, Z.; Witt, D.; Shen, T.; Ratliff, W.; Shi, H.; Gao, M.; Nichols, M.; Sendak, M.; Balu, S.; et al. Development and Temporal Validation of a Machine Learning Model to Predict Clinical Deterioration. Hosp. Pediatr. 2024, 14, 11–20. [Google Scholar] [CrossRef]
Yuan, K.; Yoon, C.H.; Gu, Q.; Munby, H.; Walker, A.S.; Zhu, T.; Eyre, D.W. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun. Med. 2025, 5, 83. [Google Scholar] [CrossRef]
Krones, F.; Marikkar, U.; Parsons, G.; Szmul, A.; Mahdi, A. Review of multimodal machine learning approaches in healthcare. Information Fusion 2025, 114, 102690. [Google Scholar] [CrossRef]
Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal deep learning for biomedical data fusion: A review. Briefings Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
Teoh, J.R.; Dong, J.; Zuo, X.; Lai, K.W.; Hasikin, K.; Wu, X. Advancing healthcare through multimodal data fusion: A comprehensive review of techniques and applications. PeerJ Comput. Sci. 2024, 10, e2298. [Google Scholar] [CrossRef]
Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 2021, 4, 86. [Google Scholar] [CrossRef] [PubMed]
Shaik, T.; Tao, X.; Li, L.; Xie, H.; Velasquez, J.D. A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom. Inf. Fusion 2023, 102, 102040. [Google Scholar] [CrossRef]
Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef] [PubMed]
Hoffman, K.M.; Trawalter, S.; Axt, J.R.; Oliver, M.N. Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites. Proc. Natl. Acad. Sci. USA 2016, 113, 4296–4301. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M.C. Clinical Decision Support Systems for Triage in the Emergency Department using Intelligent Systems: A Review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef]
Kirubarajan, A.; Taher, A.; Khan, S.; Masood, S. Artificial intelligence in emergency medicine: A scoping review. JACEP Open 2020, 1, 1691–1702. [Google Scholar] [CrossRef]
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 72–78. [Google Scholar] [CrossRef]
DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
Vig, J. A Multiscale Visualization of Attention in the Transformer Model. arXiv 2019, arXiv:1906.05714. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]

Figure 1. Data Preprocessing Pipeline: Converting raw clinical notes into tensor inputs for the hybrid model.

Figure 2. PRISMA flow diagram. * Records identified from PubMed, Scopus, Web of Science, IEEE Xplore, and ACM Digital Library. ** Records excluded during initial title and abstract screening via Covidence.

Figure 3. Distribution of marginal fusion gains (ΔAUC) across the complexity gradient. Box edges represent IQR; whiskers extend to min/max excluding outliers. The complexity gradient explains 42% of between-study variance (meta-regression R² = 0.42, p = 0.003).

Table 1. Five-level triage hierarchy and clinical categorization.

Level	Designation	Clinical Description	Wait Time
1	Resuscitation	Immediate life-threatening conditions (e.g., cardiac arrest).	Immediate
2	Emergent	High-risk, unstable, or severe pain.	<15 min
3	Urgent	Stable but requires multiple resources.	<60 min
4	Less Urgent	Requires single resource or simple intervention.	<120 min
5	Non-Urgent	Minor complaints, no resources beyond exam.	<240 min

Table 2. PICO eligibility criteria with operationalized definitions.

Component	Inclusion Criteria	Exclusion Criteria
Population	Human patients presenting to emergency departments (adult or pediatric). Minimum cohort size: N > 500 encounters.	Primary care settings, ICU-only cohorts, pre-hospital (ambulance) data, synthetic/simulated patients. Studies with N < 500 (high overfitting risk).
Intervention	Hybrid AI Models: Explicit integration of BOTH structured data (vital signs, demographics, laboratory values) AND unstructured text (triage notes, chief complaints, nursing narratives).	Single-modality models (text-only or tabular-only) without fusion. Models using only ICD-10 codes as “text” (codes are structured, not narrative).
Comparator	Quantitative comparison against: (1) Tabular-only baseline, OR (2) Text-only baseline, OR (3) Human clinician performance (nurses/physicians).	No quantitative baseline. Purely descriptive studies. User interface evaluations without accuracy metrics.
Outcomes	Primary: Discrimination (AUC-ROC, sensitivity, specificity). Secondary: Calibration (Brier score, calibration slope, E/O ratio), operational metrics (length of stay, undertriage rate, time-to-decision).	Studies reporting only technical metrics (perplexity, BLEU score, F1 on entity extraction) without clinical outcome prediction.
Study Design	Randomized controlled trials, prospective observational cohorts, retrospective cohorts with explicit validation strategy.	Case reports, editorials, narrative reviews, conference abstracts, non-peer-reviewed preprints, qualitative studies.

Table 3. Comprehensive Data Extraction Variables.

Variable Category	Specific Variables Extracted
Study Metadata	Authors, year, country, database/registry used, study design (RCT/prospective/retrospective), sample size (total N, training N, validation N, test N).
Population	Patient age range (adult/pediatric/mixed), ED setting type (academic/community/trauma/mixed), inclusion/exclusion criteria, demographic distribution (age, sex, race/ethnicity, insurance status)—if reported.
Intervention Details	NLP architecture (BERT, GPT, BiLSTM, TF-IDF, word2vec), embedding dimension, pre-training corpus (general vs. clinical), tabular model type (XGBoost, random forest, logistic regression), fusion strategy (early/late/unified per taxonomy in Section 3.6).
Comparator Baselines	Type of baseline (tabular-only, text-only, human clinician), baseline model architecture, whether baseline was “strong” (optimized hyperparameters) or “weak” (default settings).
Outcomes	Primary outcome definition (admission, criticality, specific diagnoses), discrimination metrics (AUC-ROC with 95% CI if reported, sensitivity, specificity, F1-score), calibration metrics (Brier score, calibration slope, Hosmer-Lemeshow statistic, calibration plots), operational metrics (length of stay, time-to-decision, undertriage rate).
Validation Strategy	Internal validation (random split, k-fold cross-validation), temporal validation (time-based split), external validation (different hospital/geography), validation sample size.
Fairness Auditing	Whether study reported training data demographics (Yes/No), whether study stratified performance by race (Yes/No), by sex (Yes/No), by insurance status (Yes/No), by age group (Yes/No).
Computational Details	Inference latency (milliseconds per prediction), hardware specifications (GPU type, CPU), model size (number of parameters).

Table 4. Characteristics and predictive performance of 25 included studies (Total

N \approx 4.8

million encounters): a comprehensive synthesis mapping the relationship between clinical outcome complexity and marginal performance gains in hybrid AI models.

Table 4. Characteristics and predictive performance of 25 included studies (Total

N \approx 4.8

million encounters): a comprehensive synthesis mapping the relationship between clinical outcome complexity and marginal performance gains in hybrid AI models.

Study	Des.	Size	Setting	Val.	Fus.	Outcome	Cal.	Comp.	Base	Hyb.	ΔAUC
Levin 2018 [4]	Retro.	172k	Urban	Internal	Early	Tachy.	No	Low	—	—	—
Roquette 2020 [12]	Cohort	499k	Pediatric	Temp.	Late	Critical.	No	High	0.873	0.892	+0.019
Zhang 2017 [13]	Retro.	210k	Urban	Split	Early	Tachy.	No	Low	0.824	0.846	+0.022
Winston 2024 [14]	Retro.	366k	Academic	Internal	Late	Admission	No	Med	—	—	+0.16 *
Cheng 2020 [15]	Retro.	104k	Academic	Internal	Early	Admission	No	Med	—	—	—
Liu 2025 [16]	Retro.	98k	Mixed	Ext.	Unif.	Sepsis	Yes	High	0.760	0.950	+0.190
Glicksberg 2024 [17]	Retro.	172k	7 Sites	Internal	Unif.	Mort.	No	High	0.790	0.880	+0.090
Nover 2025 [18]	Prosp.	46k	Mixed	Temp.	Late	Hypoxia	Yes	High	81.6% *	85.4% *	+3.8% *
Ivanov 2020 [19]	Retro.	166k	Trauma	Internal	Early	Tachy.	No	Low	—	—	—
Sundrani 2023 [20]	Retro.	19k	Academic	Temp.	Unif.	Sepsis	No	High	—	0.836	+0.071
Li 2025 [21]	Retro.	88k	Commun.	Internal	Early	Hypotens.	No	Low	—	—	+10.0% *
Patel 2022 [22]	Retro.	1.2M	National	Internal	Early	Admission	No	Med	—	—	—
Yun 2021 [23]	Retro.	45k	Academic	Internal	Unif.	Critical.	No	High	—	—	—
Nainini 2024 [24]	Prosp.	51k	Trauma	Temp.	Late	Hypoxia	No	High	—	—	—
Xie 2024 [25]	Retro.	305k	Mixed	Internal	Unif.	Sepsis	No	High	—	—	—
Douglas 2022 [26]	Cohort	150k	Academic	Ext.	Late	Hypoxia	Yes	High	—	0.860	—
Gomes 2025 [27]	Retro.	36k	Pediatric	Temp.	Unif.	Sepsis	No	High	—	—	—
Tariq 2021 [28]	Retro.	3.2k	National	Internal	Early	Admission	No	Med	—	—	—
Sezik 2025 [29]	Retro.	75k	Commun.	Internal	Late	Admission	No	Med	—	—	—
De Hond 2021 [30]	Retro.	172k	Academic	Internal	Early	Tachy.	No	Low	—	—	—
Arnaud 2022 [31]	Prosp.	105k	Mixed	Temp.	Unif.	Critical.	No	High	—	—	—
Sulaiman 2025 [32]	Retro.	400k	Urban	Internal	Early	LOS	No	High	—	—	—
Lin 2021 [33]	Retro.	10k	Mixed	Ext.	Early	Sepsis	No	Low	—	—	—
Johnson 2022 [34]	Retro.	190k	Mixed	Internal	Late	Hypoxia	No	High	—	—	—
Foote 2024 [35]	Retro.	17k	Pediatric	Temp.	Early	ICU/Mort.	No	Low	—	—	—

Legend and Abbreviations: Des.: Study Design; Val.: Validation Method (Temp: Temporal, Ext: External); Fus.: Fusion Architecture (Unif: Unified/Transformer-based); Cal.: Calibration Analysis Performed; Comp.: Complexity Category (Low/Med/High); Base: Baseline Model Performance (Structured data only); Hyb.: Hybrid Model Performance (Structured + Unstructured data); Δ AUC: Marginal gain in Area Under the Curve. * Note: Values marked with an asterisk indicate performance reported as Accuracy or F1 Score rather than AUC.

Table 5. Predictive performance stratified by clinical task complexity (complexity gradient).

Complexity	N Studies	Baseline AUC (Tabular)	Hybrid AUC	Median ΔAUC	IQR
Low (Tachycardia, Hypotension)	4	$0.88 \pm 0.04$	$0.91 \pm 0.03$	+0.036	$0.022$ – $0.048$
Medium (Admission, Return visit)	8	$0.79 \pm 0.06$	$0.85 \pm 0.05$	+0.065	$0.051$ – $0.082$
High (Sepsis, Hypoxia, Criticality)	13	$0.71 \pm 0.08$	$0.82 \pm 0.07$	+0.111	$0.093$ – $0.128$

Table 6. Calibration performance in studies reporting brier scores (n = 3 of 25, 12%).

Study	AUC	Brier Score	Calibration Slope	Interpretation
Liu 2025 [16]	0.84	0.089	0.98	Excellent calibration; predicted probabilities closely match observed rates.
Nover 2025 [18]	0.81	0.142	0.87	Moderate calibration; slight underestimation of high-risk patients.
Douglas 2022 [26]	0.86	0.106	0.92	Good calibration across probability ranges.

Table 7. Demographic reporting and fairness auditing across 25 studies.

Reporting Element	N Studies (%)	Representative Examples
Age distribution reported	25 (100%)	All studies
Sex/gender distribution reported	23 (92%)	All except Muller 2018 [4], Wright 2023 [29]
Race/ethnicity distribution reported	3 (12%)	Liu 2025 [16], Glicksberg 2024 [17], Nover 2025 [18]
Insurance status reported	1 (4%)	Glicksberg 2024 [17]
Performance stratified by race	0 (0%)	None
Performance stratified by insurance	0 (0%)	None
Performance stratified by sex	2 (8%)	Liu 2025 [16], Douglas 2022 [26]
Explicit bias mitigation discussed	0 (0%)	None

Table 8. PROBAST risk of bias assessment across included studies.

Study	D1	D2	D3	D4	Overall Risk
	Participants	Predictors	Outcome	Analysis
Levin 2018 [4]	Low	Low	Low	High	High
Roquette 2020 [12]	Low	Low	Low	Low	Low
Zhang 2017 [13]	Low	Low	Low	High	High
Chen 2020 [15]	Low	Low	Low	High	High
Winston 2024 [14]	Low	Low	Low	High	High
Liu 2025 [16]	Low	Low	Low	Low	Low
Glicksberg 2024 [17]	Low	Low	Low	High	High
Nover 2025 [18]	Low	Low	Low	Low	Low
Ivanov 2020 [19]	Low	Low	Low	High	High
Sundrani 2023 [20]	Low	Low	Low	High	High
Li 2025 [21]	Low	Low	Low	High	High
Patel 2022 [22]	Low	Low	Low	High	High
Yun 2021 [23]	Low	Low	Low	High	High
Nanini 2022 [24]	Low	Low	Low	Low	Low
Xie 2024 [25]	Low	Low	Low	High	High
Douglas 2022 [26]	Low	Low	Low	Low	Low
Gomes 2025 [27]	Low	Low	Low	High	High
Tariq 2021 [28]	Low	Low	Low	High	High
Sezik 2025 [29]	Low	Low	Low	High	High
De Hond 2021 [30]	Low	Low	Low	High	High
Arnaud 2022 [31]	Low	Low	Low	Low	Low
Sulaiman 2025 [32]	Low	Low	Low	High	High
Lin 2021 [33]	Low	Low	Low	Low	Low
Johnson 2022 [34]	Low	Low	Low	High	High
Foote 2024 [35]	Low	Low	Low	High	High

Table 9. Architectural Comparison: traditional vs. hybrid BERT vs. LLM-Unified.

Feature	Traditional ML (Tabular)	Hybrid BERT (Fusion)	Unified LLM (GPT-4/Llama)
Input Data	Vital signs, Demographics	Vitals + Triage Notes	Serialized Text Prompts
Context Window	None (Snapshot)	512 Tokens	4k–128k Tokens
Inference Latency	<50 ms (CPU)	200–500 ms (GPU)	>1000 ms (GPU API)
Infrastructure	Lightweight (Edge/CPU)	Moderate (On-prem GPU)	Heavy (Cloud/H100 Cluster)
Privacy Risk	Low	Medium (Text PII)	High (API Data Leakage)
Primary Gain	Physiological Stability	Complex Syndromes	Reasoning & Explanation
Performance (ΔAUC)	Baseline	+0.111 (High Complexity)	Comparable to Hybrid

Table 10. Evidence-Based decision framework for hybrid triage model deployment.

Clinical Scenario	Recommended Approach	Evidence Basis
High-complexity syndrome prediction (sepsis, hypoxia, criticality, multi-organ dysfunction)	Consider hybrid models with mandatory calibration and fairness validation. Expected ΔAUC: 0.09–0.13.	Thirteen studies, median ΔAUC + 0.111, robust to bias exclusion.
Medium-complexity administrative outcomes (admission, disposition, resource allocation)	Hybrid models may provide a moderate benefit (ΔAUC 0.05–0.08). Conduct a local pilot study before deployment.	Eight studies, median ΔAUC + 0.065, heterogeneous validation quality.
Low-complexity physiological thresholds (tachycardia, hypotension, isolated vital sign abnormalities)	Tabular models are likely sufficient. Hybrid models add minimal value (ΔAUC < 0.05) with increased computational cost.	Four studies, median ΔAUC + 0.036, diminishing returns.

Table 11. Divergence in priorities: academic focus vs. industry deployment trends.

Domain	Academic Research Focus	Industry & Deployment Trends
Model Size	Massive Multimodal Transformers (Billions of params).	Lightweight Distillation (DistilBERT) for edge deployment.
Data Type	Static, clean datasets (MIMIC-III).	Real-time, noisy, missing data streams (HL7/FHIR).
Key Metric	AUC maximization.	Inference Latency (<100 ms) and Cost-per-prediction.
Validation	Retrospective splits.	Prospective “Silent Trials” and Drift Detection.

Table 12. Proposed minimum reporting checklist for hybrid emergency triage studies.

Domain	Mandatory Reporting Checklist
Demographics	[ ] Report training/validation/test set demographics: age (mean, SD, range), sex (% female). [ ] Report race/ethnicity and insurance status (private/public/uninsured %).
Discrimination	[ ] Report AUC-ROC with 95% confidence intervals. [ ] Provide sensitivity, specificity, PPV, NPV at clinically relevant thresholds (at 10% predicted risk).
Calibration	[ ] Mandatory: Brier score, calibration slope. [ ] Recommended: Calibration plots (observed vs. predicted across deciles), Hosmer-Lemeshow test.
Fairness Auditing	[ ] Stratify AUC, sensitivity, specificity by: race/ethnicity, sex, age group, insurance status. [ ] Test for statistically significant subgroup differences.
Baseline Comparison	[ ] Compare hybrid model against strong baselines: optimized tabular-only and text-only models. [ ] Report ΔAUC for each modality contribution.
Validation Strategy	[ ] Retrospective studies: mandatory temporal validation (trained on Year 1, test on Year 2). [ ] Prospective studies preferred. External validation strongly encouraged.
Architecture	[ ] Explicitly define fusion strategy (Early/Late/Unified). [ ] Report NLP/tabular model type, embedding dimensions, fusion layer, total parameters.
Computational	[ ] Report median inference latency (milliseconds per prediction), hardware specifications. [ ] Report whether latency is acceptable for clinical workflow.
Code & Data	[ ] Publicly share model code, de-identified/synthetic data, and trained model weights.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ullah, J.; Ramasamy, R.K.; Rajendran, V. Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics 2026, 6, 21. https://doi.org/10.3390/biomedinformatics6020021

AMA Style

Ullah J, Ramasamy RK, Rajendran V. Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics. 2026; 6(2):21. https://doi.org/10.3390/biomedinformatics6020021

Chicago/Turabian Style

Ullah, Junaid, R Kanesaraj Ramasamy, and Venushini Rajendran. 2026. "Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient" BioMedInformatics 6, no. 2: 21. https://doi.org/10.3390/biomedinformatics6020021

APA Style

Ullah, J., Ramasamy, R. K., & Rajendran, V. (2026). Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics, 6(2), 21. https://doi.org/10.3390/biomedinformatics6020021

Article Menu

Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient

Abstract

1. Introduction

1.1. Gaps in Existing Literature

1.2. Contribution and Novel Framework

1.3. Research Questions

2. Related Work

2.1. The Structured Era: The Limits of Vital Signs

2.2. The Unstructured Awakening: NLP in the ED

2.2.1. Bag-of-Words and Static Embeddings

2.2.2. The Transformer Revolution

2.3. The Multimodal Fusion Frontier: Hybrid AI Models

Fusion Strategies: Early vs. Late

2.4. The “Implementation Gap” and Algorithmic Fairness

2.5. Critique of Existing Reviews and Research Gap

3. Methodology

3.1. Protocol Registration and PRISMA Compliance

3.2. Eligibility Criteria (PICO Framework)

3.3. Information Sources and Search Strategy

3.3.1. Database Selection and Search Dates

3.3.2. Supplementary Search Strategy

3.4. Selection Process and Quality Assessment

3.5. Data Extraction

3.6. Fusion Strategy Taxonomy

3.7. Synthesis Methods

3.7.1. Rationale for Synthesis Without Meta-Analysis (SWiM)

3.7.2. The Complexity Gradient Framework

3.7.3. Subgroup Analysis and Meta-Regression

4. Results

4.1. Study Selection (PRISMA Flow)

4.2. Study Characteristics

4.3. RQ1: Predictive Discrimination and the Complexity Gradient

4.3.1. Overall Fusion Benefit

4.3.2. Complexity Gradient Subgroup Analysis

4.3.3. Mechanistic Interpretation

4.4. RQ2: Model Calibration (The Calibration Gap)

4.5. RQ3: Deployment Maturity and Architectural Evolution

4.5.1. Validation Strategies

4.5.2. Deployment Status

4.5.3. Fusion Architecture Trends

4.6. RQ4: Algorithmic Fairness and the Equity Blind Spot

4.6.1. Demographic Reporting

4.6.2. Investigative Follow-Up: Author Contact

4.6.3. Implications for Healthcare Disparities

4.6.4. Risk of Bias Assessment and Sensitivity Analysis

5. Discussion

5.1. Principal Findings

5.2. Interpretation: Why Does the Complexity Gradient Exist?

5.3. The Clinical Danger of the Calibration Gap

5.4. Structural vs. Methodological Failures in Equity

5.5. Comparison to Prior Systematic Reviews

5.6. Clinical and Policy Implications

5.6.1. Decision Framework for Stakeholders

5.6.2. Infrastructure Requirements

5.6.3. Unquantified Risks

5.7. Limitations

5.7.1. Methodological Limitations

5.7.2. Evidence Limitations

5.8. Recommendations for Future Research

5.9. Proposed Minimum Reporting Standards

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Complete Search Strategies

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI