Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient
Abstract
1. Introduction
1.1. Gaps in Existing Literature
1.2. Contribution and Novel Framework
1.3. Research Questions
- RQ1: What is the marginal performance gain measured as ΔAUC attributable to multimodal fusion compared to tabular baselines, and does this gain vary systematically across clinical task complexity?
- RQ2: Among the proportion of hybrid triage models that report calibration metrics such as Brier score and calibration slope, what is the relationship between discrimination performance measured by AUC and calibration performance?
- RQ3: Given the current maturity of fusion strategies including early, late, and unified architectures, to what extent have models progressed beyond retrospective validation to prospective deployment or randomized trials?
- RQ4: What proportion of studies reporting training data demographics also stratify model performance metrics like AUC, sensitivity, and specificity by race, ethnicity, insurance status, or other equity-relevant variables?
2. Related Work
2.1. The Structured Era: The Limits of Vital Signs
2.2. The Unstructured Awakening: NLP in the ED
2.2.1. Bag-of-Words and Static Embeddings
2.2.2. The Transformer Revolution
2.3. The Multimodal Fusion Frontier: Hybrid AI Models
Fusion Strategies: Early vs. Late
2.4. The “Implementation Gap” and Algorithmic Fairness
2.5. Critique of Existing Reviews and Research Gap
3. Methodology
3.1. Protocol Registration and PRISMA Compliance
3.2. Eligibility Criteria (PICO Framework)
3.3. Information Sources and Search Strategy
3.3.1. Database Selection and Search Dates
- PubMed (MEDLINE): Searched 15 December 2025, covering 1 January 2015 to 15 December 2025.
- Scopus: Searched 16 December 2025, covering 1 January 2015 to 15 December 2025.
- Web of Science (Core Collection): Searched 16 December 2025, covering 1 January 2015 to 15 December 2025.
- IEEE Xplore: Searched 17 December 2025, covering 1 January 2015 to 15 December 2025.
- ACM Digital Library: Searched 17 December 2025, covering 1 January 2015 to 15 December 2025.
3.3.2. Supplementary Search Strategy
3.4. Selection Process and Quality Assessment
3.5. Data Extraction
3.6. Fusion Strategy Taxonomy
3.7. Synthesis Methods
3.7.1. Rationale for Synthesis Without Meta-Analysis (SWiM)
3.7.2. The Complexity Gradient Framework
3.7.3. Subgroup Analysis and Meta-Regression
4. Results
4.1. Study Selection (PRISMA Flow)
4.2. Study Characteristics
4.3. RQ1: Predictive Discrimination and the Complexity Gradient
4.3.1. Overall Fusion Benefit
4.3.2. Complexity Gradient Subgroup Analysis
4.3.3. Mechanistic Interpretation
- 1.
- Low-complexity scenarios show minimal fusion benefit (+0.036 median ΔAUC) because tabular baselines already capture the diagnostic signal since Tachycardia is defined as HR > 100 bpm a variable directly present in structured data meaning nursing narratives typically provide redundant information like patient tachycardic rather than additive diagnostic content.
- 2.
- Medium-complexity scenarios show moderate benefit (+0.065) as narratives encode contextual factors affecting administrative outcomes such as admission decisions influenced by social support transportation access and outpatient follow-up availability which are not captured in vital signs.
- 3.
- High-complexity scenarios show substantial benefit (+0.111) because diagnostic signals reside in semantic and temporal narrative features since for sepsis narratives contain temporality like fevers worsening over 3 days, negation such as no improvement with antibiotics, and symptom clusters including productive cough and dyspnea that tabular vitals like temperature and heart rate fail to encode, suggesting that the hybrid model’s superiority is due to successful extraction of these latent narrative features.
4.4. RQ2: Model Calibration (The Calibration Gap)
4.5. RQ3: Deployment Maturity and Architectural Evolution
4.5.1. Validation Strategies
4.5.2. Deployment Status
4.5.3. Fusion Architecture Trends
4.6. RQ4: Algorithmic Fairness and the Equity Blind Spot
4.6.1. Demographic Reporting
4.6.2. Investigative Follow-Up: Author Contact
- Data unavailable in source EHR: 11/19 (58%) stated race/ethnicity fields were either not collected or had >40% absent in their source dataset.
- Data available but not analyzed: 6/19 (32%) confirmed race data existed but was not used for stratified analysis.
- Privacy/IRB restrictions: 2/19 (11%) cited institutional review board restrictions on demographic analysis.
4.6.3. Implications for Healthcare Disparities
4.6.4. Risk of Bias Assessment and Sensitivity Analysis
- All studies (n = 25): Median ΔAUC = +0.071
- Low-risk only (n = 7): Median ΔAUC = +0.067 (Range: +0.052 to +0.089)
5. Discussion
5.1. Principal Findings
- 1.
- Predictive Benefit Follows a Complexity Gradient: Fusion of structured (text) and structured (tabular) data improves discrimination, but the magnitude depends on clinical task complexity. Low-complexity outcomes defined by single vital signs (tachycardia: ΔAUC +0.036) show minimal benefit, while high-complexity syndromes requiring narrative interpretation (sepsis, hypoxia: ΔAUC +0.111) demonstrate substantial gains. Meta-regression confirms that complexity explains 42% of between-study heterogeneity (R2 = 0.42, p = 0.003).
- 2.
- The Calibration Gap Threatens Deployment: Despite 100% of studies reporting discrimination metrics, only 12% reported calibration. Among the three studies assessing calibration, performance was good (Brier 0.089–0.142), but the systematic absence of calibration reporting suggests potential selective outcome reporting and creates uncertainty about safe deployment.
- 3.
- The Equity Blind Spot Risks Healthcare Disparities: Zero studies stratified model performance by race/ethnicity. Author follow-up revealed this gap stems from both data unavailability (58%) and methodological oversight (32%). Without race-stratified validation, deployed models risk perpetuating or exacerbating existing healthcare inequities.
5.2. Interpretation: Why Does the Complexity Gradient Exist?
5.3. The Clinical Danger of the Calibration Gap
5.4. Structural vs. Methodological Failures in Equity
5.5. Comparison to Prior Systematic Reviews
- 1.
- Marginal Gain Decomposition: Prior reviews reported hybrid model AUCs in isolation (typically 0.80–0.90) without isolating the contribution of multimodal fusion. By extracting ΔAUC, we quantify the additive value of multimodal fusion, revealing that this value is task-dependent.
- 2.
- Complexity Gradient Framework: We introduce and empirically validate a novel taxonomy explaining when fusion adds value, providing actionable guidance: deploy hybrid models for complex syndrome prediction, but tabular models may suffice for single-variable physiological thresholds.
- 3.
- Calibration and Fairness Auditing: Prior reviews did not systematically assess calibration reporting or demographic performance stratification. Our identification of the calibration gap (12% reporting) and equity blind spot (0% race-stratified analysis) highlights critical barriers to responsible deployment.
- 4.
- Investigative Methodology: By contacting study authors to distinguish data unavailability from reporting oversight, we provide actionable insights: 58% of the equity blind spot stems from structural EHR deficiencies requiring health system interventions, while 32% reflects researcher oversight addressable through reporting standards.
5.6. Clinical and Policy Implications
5.6.1. Decision Framework for Stakeholders
5.6.2. Infrastructure Requirements
5.6.3. Unquantified Risks
5.7. Limitations
5.7.1. Methodological Limitations
5.7.2. Evidence Limitations
5.8. Recommendations for Future Research
5.9. Proposed Minimum Reporting Standards
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Complete Search Strategies
| Listing A1. PubMed (Searched 12 December 2025). |
| (("Emergency Service, Hospital"[MeSH Terms] OR "Emergency Department"[Title/Abstract] OR "triage"[Title/Abstract] OR "emergency triage"[Title/Abstract] OR "acuity"[Title/Abstract]) AND ("Machine Learning"[MeSH Terms] OR "machine learning"[Title/Abstract] OR "deep learning"[Title/Abstract] OR "neural network"[Title/Abstract] OR "artificial intelligence"[Title/Abstract]) AND ("Natural Language Processing"[MeSH Terms] OR "natural language processing"[Title/Abstract] OR "NLP"[Title/Abstract] OR "text mining"[Title/Abstract] OR "clinical notes"[Title/Abstract] OR "unstructured data"[Title/Abstract] OR "large language model"[Title/Abstract])) Filters: English; Human; 2015-01-01 to 2025-12-15 Results: 142 records |
| Listing A2. Scopus (Searched 16 December 2025). |
| TITLE-ABS-KEY ((triage OR "emergency department" OR "emergency medicine" OR acuity) AND ("machine learning" OR "deep learning" OR "artificial intelligence" OR "neural network") AND ("natural language processing" OR nlp OR "text mining" OR "clinical notes" OR "unstructured data" OR "multimodal" OR "hybrid model")) AND PUBYEAR > 2014 AND PUBYEAR < 2026 AND (LIMIT-TO (LANGUAGE, "English")) Results: 178 records |
| Listing A3. Web of Science (Searched 16 December 2025). |
| TS=((triage OR "emergency department" OR "emergency medicine" OR acuity) AND ("machine learning" OR "deep learning" OR "artificial intelligence") AND ("natural language processing" OR NLP OR "text mining" OR "clinical notes" OR "multimodal fusion")) Timespan: 2015-2025; Language: English; Document Types: Article Results: 95 records |
| Listing A4. IEEE Xplore (Searched 17 December 2025). |
| ("All Metadata":"emergency triage" OR "emergency department") AND ("All Metadata":"machine learning" OR "deep learning") AND ("All Metadata":"natural language processing" OR "NLP" OR "text mining") Filters: 2015-2025; Journals \& Magazines Results: 53 records |
| Listing A5. ACM Digital Library (Searched 17 December 2025). |
| [[All: triage] OR [All: "emergency department"]] AND [[All: "machine learning"] OR [All: "deep learning"]] AND [[All: "natural language processing"] OR [All: "multimodal"]] Filters: Published: 2015-2025 Results: 32 records |
References
- Gilboy, N.; Tanabe, P.; Travers, D.; Rosenau, A.M. Emergency Severity Index (ESI): A Triage Tool for Emergency Department Care, Version 4. Implementation Handbook, 2012 ed.; AHRQ Publication: Rockville, MD, USA, 2012; No. 12-0014. [Google Scholar]
- Mackway-Jones, K.; Marsden, J.; Windle, J. (Eds.) Emergency Triage: Manchester Triage Group, 3rd ed.; Wiley Blackwell: Chichester, UK, 2014. [Google Scholar]
- Raita, Y.; Goto, T.; Faridi, M.K.; Brown, D.F.M.; Camargo, C.A., Jr.; Hasegawa, K. Emergency department triage prediction of clinical outcomes using machine learning models. Crit. Care 2019, 23, 64. [Google Scholar] [CrossRef]
- Levin, S.; Toerper, M.; Hamrock, E.; Hinson, J.S.; Barnes, S.; Gardner, H.; Dugas, A.; Kelen, G. Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the Emergency Severity Index. Ann. Emerg. Med. 2018, 71, 565–574.e2. [Google Scholar] [CrossRef]
- Hong, W.S.; Haimovich, A.D.; Taylor, R.A. Predicting hospital admission at emergency department triage using machine learning. PLoS ONE 2018, 13, e0201016. [Google Scholar] [CrossRef] [PubMed]
- Stewart, J.; Lu, J.; Goudie, A.; Arendts, G.; Meka, S.A.; Freeman, S.; Walker, K.; Sprivulis, P.; Sanfilippo, F.; Bennamoun, M.; et al. Applications of natural language processing at emergency department triage: A narrative review. PLoS ONE 2023, 18, e0279953. [Google Scholar] [CrossRef] [PubMed]
- Fernandes, M.; Mendes, R.; Vieira, S.M.; Leite, F.; Palos, C.; Johnson, A.; Finkelstein, S.; Horng, S.; Celi, L.A. Risk of mortality and cardiopulmonary arrest in critical patients presenting to the emergency department using machine learning and natural language processing. PLoS ONE 2020, 15, e0230876. [Google Scholar] [CrossRef] [PubMed]
- Wolf, L.A.; Delao, A.M. Establishing Research Priorities for the Emergency Severity Index Using a Modified Delphi Approach. J. Emerg. Nurs. 2021, 47, 50–57. [Google Scholar] [CrossRef]
- Madan, S.; Lentzen, M.; Brandt, J.; Rueckert, D.; Hofmann-Apitius, M.; Fröhlich, H. Transformer models in biomedicine. BMC Med. Inform. Decis. Mak. 2024, 24, 214. [Google Scholar] [CrossRef]
- Li, J.; Wei, Q.; Ghiasvand, O.; Chen, M.; Lobanov, V.; Weng, C.; Xu, H. A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora. BMC Med. Inform. Decis. Mak. 2022, 22, 235. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Roquette, B.P.; Nagano, H.; Marujo, E.C.; Maiorano, A.C. Prediction of admission in pediatric emergency department with deep neural networks and triage textual data. Neural Netw. 2020, 126, 170–177. [Google Scholar] [CrossRef]
- Zhang, X.; Kim, J.; Patzer, R.E.; Pitts, S.R.; Patzer, A.; Schrager, J.D. Prediction of Emergency Department Hospital Admission Based on Natural Language Processing and Neural Networks. Methods Inf. Med. 2017, 56, 377–389. [Google Scholar] [CrossRef]
- Winston, C.; Winston, C.N.; Winston, C.; Winston, C.; Winston, C. Multimodal Clinical Prediction with Unified Prompts and Pretrained Large-Language Models. In Proceedings of the IEEE International Conference on Healthcare Informatics, Orlando, FL, USA, 3–6 June 2024. [Google Scholar] [CrossRef]
- Chen, C.H.; Hsieh, J.; Cheng, S.L.; Lin, Y.L.; Lin, P.H.; Jeng, J. Emergency department disposition prediction using a deep neural network with integrated clinical narratives and structured data. Int. J. Med. Inform. 2020, 139, 104146. [Google Scholar] [CrossRef] [PubMed]
- Liu, T.; Gu, Y.; Chen, H.; Zhang, Y.; Zheng, L.; Huang, X.; Xu, Y.; Wen, C.; Chen, M.; Lin, J.; et al. A foundational triage system for improving accuracy in moderate acuity level emergency classifications. Commun. Med. 2025, 5, 322. [Google Scholar] [CrossRef] [PubMed]
- Glicksberg, B.S.; Timsina, P.; Patel, D.; Sawant, A.; Vaid, A.; Raut, G.; Charney, A.W.; Apakama, D.; Carr, B.G.; Freeman, R.; et al. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J. Am. Med. Inform. Assoc. 2024, 31, 1921–1928. [Google Scholar] [CrossRef]
- Nover, J.; Bai, M.; Tismina, P.; Raut, G.; Patel, D.; Nadkarni, G.N.; Abella, B.S.; Klang, E.; Freeman, R. Comparing Machine Learning and Nurse Predictions for Hospital Admissions in a Multisite Emergency Care System. medRxiv 2025. [Google Scholar] [CrossRef]
- Ivanov, O.; Wolf, L.; Brecher, D.; Lewis, E.; Masek, K.; Montgomery, K.; Andrieiev, Y.; McLaughlin, M.; Liu, S.; Dunne, R.; et al. Improving ED Emergency Severity Index Acuity Assignment Using Machine Learning and Clinical Natural Language Processing. J. Emerg. Nurs. 2021, 47, 265–278.e7. [Google Scholar] [CrossRef]
- Sundrani, S.; Chen, J.; Jin, B.T.; Abad, Z.S.H.; Rajpurkar, P.; Kim, D. Predicting patient decompensation from continuous physiologic monitoring in the emergency department. npj Digit. Med. 2023, 6, 60. [Google Scholar] [CrossRef]
- Li, Z.; Lockington, J.; Torres, S.; Jafari, N.; Lim, M.; Andjelic, D.; Cretu, E.; Ho, K.; Gopaluni, B. Hybrid triaging assistance algorithm for continuous patient monitoring. Digit. Health 2025, 11, 20552076251388141. [Google Scholar] [CrossRef]
- Patel, D.; Cheetirala, S.N.; Raut, G.; Tamegue, J.; Kia, A.; Glicksberg, B.; Freeman, R.; Levin, M.A.; Timsina, P.; Klang, E. Predicting Adult Hospital Admission from Emergency Department Using Machine Learning: An Inclusive Gradient Boosting Model. J. Clin. Med. 2022, 11, 6888. [Google Scholar] [CrossRef]
- Yun, H.; Choi, J.; Park, J.H. Prediction of Critical Care Outcome for Adult Patients Presenting to Emergency Department Using Initial Triage Information: An XGBoost Algorithm Analysis. JMIR Med. Inform. 2021, 9, e30770. [Google Scholar] [CrossRef] [PubMed]
- Nanini, S.; Abid, M.; Mamouni, Y.; Wiedemann, A.; Jouvet, P.; Bourassa, S. Machine and Deep Learning Models for Hypoxemia Severity Triage in CBRNE Emergencies. Diagnostics 2024, 14, 2763. [Google Scholar] [CrossRef]
- Xie, J.; Gao, J.; Yang, M.; Zhang, T.; Liu, Y.; Chen, Y.; Liu, Z.; Mei, Q.; Li, Z.; Zhu, H.; et al. Prediction of sepsis within 24 hours at the triage stage in emergency departments using machine learning. World J. Emerg. Med. 2024, 15, 389–395. [Google Scholar] [CrossRef]
- Douglas, M.J.; Bell, B.W.; Kinney, A.; Pungitore, S.A.; Toner, B.P. Early COVID-19 respiratory risk stratification using machine learning. Trauma Surg. Acute Care Open 2022, 7, e000892. [Google Scholar] [CrossRef] [PubMed]
- Gomes, S.; Dhanoa, H.; Assheton, P.; Carr, E.; Roland, D.; Deep, A. Predicting sepsis treatment decisions in the paediatric emergency department using machine learning: The AiSEPTRON study. BMJ Paediatr. Open 2025, 9, e003273. [Google Scholar] [CrossRef]
- Tariq, A.; Celi, L.A.; Newsome, J.M.; Purkayastha, S.; Bhatia, N.K.; Trivedi, H.; Gichoya, J.W.; Banerjee, I. Patient-specific COVID-19 resource utilization prediction using fusion AI model. npj Digit. Med. 2021, 4, 94. [Google Scholar] [CrossRef]
- Sezik, S.; Cingiz, M.Ö.; Ibiş, E. Machine Learning-Based Model for Emergency Department Disposition at a Public Hospital. Appl. Sci. 2025, 15, 1628. [Google Scholar] [CrossRef]
- De Hond, A.A.; Raven, W.; Schinkelshoek, L.; Gaakeer, M.I.; ter Avest, E.; Sir, O.; Lingsma, H.; Schuit, S.C.E. Machine learning for developing a prediction model of hospital admission of emergency department patients: Hype or hope? Int. J. Med. Inform. 2021, 152, 104496. [Google Scholar] [CrossRef]
- Arnaud, E.; Elbattah, M.; Ammirati, C.; Dequen, G.; Ghazali, D.A. Use of Artificial Intelligence to Manage Patient Flow in Emergency Department during the COVID-19 Pandemic: A Prospective, Single-Center Study. Int. J. Environ. Res. Public Health 2022, 19, 9667. [Google Scholar] [CrossRef] [PubMed]
- Sulaiman, W.A.; Stylianides, C.; Nikolaou, A.; Pattichis, M.S.; Panayides, A.S.; Pattichis, C.S. Leveraging machine learning and rule extraction for enhanced transparency in emergency department length of stay prediction. Front. Digit. Health 2025, 6, 1498939. [Google Scholar] [CrossRef]
- Lin, P.-C.; Chen, K.-T.; Chen, H.-C.; Islam, M.M.; Lin, M.-C. Machine Learning Model to Identify Sepsis Patients in the Emergency Department: Algorithm Development and Validation. J. Pers. Med. 2021, 11, 1055. [Google Scholar] [CrossRef]
- Johnson, A.E.W.; Ghassemi, M.M.; Nemati, S.; Niehaus, K.E.; Clifton, D.A.; Clifford, G.D. Machine learning and decision support in critical care. Proc. IEEE 2016, 104, 444–466. [Google Scholar] [CrossRef] [PubMed]
- Foote, H.P.; Shaikh, Z.; Witt, D.; Shen, T.; Ratliff, W.; Shi, H.; Gao, M.; Nichols, M.; Sendak, M.; Balu, S.; et al. Development and Temporal Validation of a Machine Learning Model to Predict Clinical Deterioration. Hosp. Pediatr. 2024, 14, 11–20. [Google Scholar] [CrossRef]
- Yuan, K.; Yoon, C.H.; Gu, Q.; Munby, H.; Walker, A.S.; Zhu, T.; Eyre, D.W. Transformers and large language models are efficient feature extractors for electronic health record studies. Commun. Med. 2025, 5, 83. [Google Scholar] [CrossRef]
- Krones, F.; Marikkar, U.; Parsons, G.; Szmul, A.; Mahdi, A. Review of multimodal machine learning approaches in healthcare. Information Fusion 2025, 114, 102690. [Google Scholar] [CrossRef]
- Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal deep learning for biomedical data fusion: A review. Briefings Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
- Teoh, J.R.; Dong, J.; Zuo, X.; Lai, K.W.; Hasikin, K.; Wu, X. Advancing healthcare through multimodal data fusion: A comprehensive review of techniques and applications. PeerJ Comput. Sci. 2024, 10, e2298. [Google Scholar] [CrossRef]
- Rasmy, L.; Xiang, Y.; Xie, Z.; Tao, C.; Zhi, D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digit. Med. 2021, 4, 86. [Google Scholar] [CrossRef] [PubMed]
- Shaik, T.; Tao, X.; Li, L.; Xie, H.; Velasquez, J.D. A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom. Inf. Fusion 2023, 102, 102040. [Google Scholar] [CrossRef]
- Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef] [PubMed]
- Hoffman, K.M.; Trawalter, S.; Axt, J.R.; Oliver, M.N. Racial bias in pain assessment and treatment recommendations, and false beliefs about biological differences between blacks and whites. Proc. Natl. Acad. Sci. USA 2016, 113, 4296–4301. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Fernandes, M.; Vieira, S.M.; Leite, F.; Palos, C.; Finkelstein, S.; Sousa, J.M.C. Clinical Decision Support Systems for Triage in the Emergency Department using Intelligent Systems: A Review. Artif. Intell. Med. 2020, 102, 101762. [Google Scholar] [CrossRef]
- Kirubarajan, A.; Taher, A.; Khan, S.; Masood, S. Artificial intelligence in emergency medicine: A scoping review. JACEP Open 2020, 1, 1691–1702. [Google Scholar] [CrossRef]
- Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 72–78. [Google Scholar] [CrossRef]
- DeLong, E.R.; DeLong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
- Vig, J. A Multiscale Visualization of Attention in the Transformer Model. arXiv 2019, arXiv:1906.05714. [Google Scholar] [CrossRef]
- Lundberg, S.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]



| Level | Designation | Clinical Description | Wait Time |
|---|---|---|---|
| 1 | Resuscitation | Immediate life-threatening conditions (e.g., cardiac arrest). | Immediate |
| 2 | Emergent | High-risk, unstable, or severe pain. | <15 min |
| 3 | Urgent | Stable but requires multiple resources. | <60 min |
| 4 | Less Urgent | Requires single resource or simple intervention. | <120 min |
| 5 | Non-Urgent | Minor complaints, no resources beyond exam. | <240 min |
| Component | Inclusion Criteria | Exclusion Criteria |
|---|---|---|
| Population | Human patients presenting to emergency departments (adult or pediatric). Minimum cohort size: N > 500 encounters. | Primary care settings, ICU-only cohorts, pre-hospital (ambulance) data, synthetic/simulated patients. Studies with N < 500 (high overfitting risk). |
| Intervention | Hybrid AI Models: Explicit integration of BOTH structured data (vital signs, demographics, laboratory values) AND unstructured text (triage notes, chief complaints, nursing narratives). | Single-modality models (text-only or tabular-only) without fusion. Models using only ICD-10 codes as “text” (codes are structured, not narrative). |
| Comparator | Quantitative comparison against: (1) Tabular-only baseline, OR (2) Text-only baseline, OR (3) Human clinician performance (nurses/physicians). | No quantitative baseline. Purely descriptive studies. User interface evaluations without accuracy metrics. |
| Outcomes | Primary: Discrimination (AUC-ROC, sensitivity, specificity). Secondary: Calibration (Brier score, calibration slope, E/O ratio), operational metrics (length of stay, undertriage rate, time-to-decision). | Studies reporting only technical metrics (perplexity, BLEU score, F1 on entity extraction) without clinical outcome prediction. |
| Study Design | Randomized controlled trials, prospective observational cohorts, retrospective cohorts with explicit validation strategy. | Case reports, editorials, narrative reviews, conference abstracts, non-peer-reviewed preprints, qualitative studies. |
| Variable Category | Specific Variables Extracted |
|---|---|
| Study Metadata | Authors, year, country, database/registry used, study design (RCT/prospective/retrospective), sample size (total N, training N, validation N, test N). |
| Population | Patient age range (adult/pediatric/mixed), ED setting type (academic/community/trauma/mixed), inclusion/exclusion criteria, demographic distribution (age, sex, race/ethnicity, insurance status)—if reported. |
| Intervention Details | NLP architecture (BERT, GPT, BiLSTM, TF-IDF, word2vec), embedding dimension, pre-training corpus (general vs. clinical), tabular model type (XGBoost, random forest, logistic regression), fusion strategy (early/late/unified per taxonomy in Section 3.6). |
| Comparator Baselines | Type of baseline (tabular-only, text-only, human clinician), baseline model architecture, whether baseline was “strong” (optimized hyperparameters) or “weak” (default settings). |
| Outcomes | Primary outcome definition (admission, criticality, specific diagnoses), discrimination metrics (AUC-ROC with 95% CI if reported, sensitivity, specificity, F1-score), calibration metrics (Brier score, calibration slope, Hosmer-Lemeshow statistic, calibration plots), operational metrics (length of stay, time-to-decision, undertriage rate). |
| Validation Strategy | Internal validation (random split, k-fold cross-validation), temporal validation (time-based split), external validation (different hospital/geography), validation sample size. |
| Fairness Auditing | Whether study reported training data demographics (Yes/No), whether study stratified performance by race (Yes/No), by sex (Yes/No), by insurance status (Yes/No), by age group (Yes/No). |
| Computational Details | Inference latency (milliseconds per prediction), hardware specifications (GPU type, CPU), model size (number of parameters). |
| Study | Des. | Size | Setting | Val. | Fus. | Outcome | Cal. | Comp. | Base | Hyb. | ΔAUC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Levin 2018 [4] | Retro. | 172k | Urban | Internal | Early | Tachy. | No | Low | — | — | — |
| Roquette 2020 [12] | Cohort | 499k | Pediatric | Temp. | Late | Critical. | No | High | 0.873 | 0.892 | +0.019 |
| Zhang 2017 [13] | Retro. | 210k | Urban | Split | Early | Tachy. | No | Low | 0.824 | 0.846 | +0.022 |
| Winston 2024 [14] | Retro. | 366k | Academic | Internal | Late | Admission | No | Med | — | — | +0.16 * |
| Cheng 2020 [15] | Retro. | 104k | Academic | Internal | Early | Admission | No | Med | — | — | — |
| Liu 2025 [16] | Retro. | 98k | Mixed | Ext. | Unif. | Sepsis | Yes | High | 0.760 | 0.950 | +0.190 |
| Glicksberg 2024 [17] | Retro. | 172k | 7 Sites | Internal | Unif. | Mort. | No | High | 0.790 | 0.880 | +0.090 |
| Nover 2025 [18] | Prosp. | 46k | Mixed | Temp. | Late | Hypoxia | Yes | High | 81.6% * | 85.4% * | +3.8% * |
| Ivanov 2020 [19] | Retro. | 166k | Trauma | Internal | Early | Tachy. | No | Low | — | — | — |
| Sundrani 2023 [20] | Retro. | 19k | Academic | Temp. | Unif. | Sepsis | No | High | — | 0.836 | +0.071 |
| Li 2025 [21] | Retro. | 88k | Commun. | Internal | Early | Hypotens. | No | Low | — | — | +10.0% * |
| Patel 2022 [22] | Retro. | 1.2M | National | Internal | Early | Admission | No | Med | — | — | — |
| Yun 2021 [23] | Retro. | 45k | Academic | Internal | Unif. | Critical. | No | High | — | — | — |
| Nainini 2024 [24] | Prosp. | 51k | Trauma | Temp. | Late | Hypoxia | No | High | — | — | — |
| Xie 2024 [25] | Retro. | 305k | Mixed | Internal | Unif. | Sepsis | No | High | — | — | — |
| Douglas 2022 [26] | Cohort | 150k | Academic | Ext. | Late | Hypoxia | Yes | High | — | 0.860 | — |
| Gomes 2025 [27] | Retro. | 36k | Pediatric | Temp. | Unif. | Sepsis | No | High | — | — | — |
| Tariq 2021 [28] | Retro. | 3.2k | National | Internal | Early | Admission | No | Med | — | — | — |
| Sezik 2025 [29] | Retro. | 75k | Commun. | Internal | Late | Admission | No | Med | — | — | — |
| De Hond 2021 [30] | Retro. | 172k | Academic | Internal | Early | Tachy. | No | Low | — | — | — |
| Arnaud 2022 [31] | Prosp. | 105k | Mixed | Temp. | Unif. | Critical. | No | High | — | — | — |
| Sulaiman 2025 [32] | Retro. | 400k | Urban | Internal | Early | LOS | No | High | — | — | — |
| Lin 2021 [33] | Retro. | 10k | Mixed | Ext. | Early | Sepsis | No | Low | — | — | — |
| Johnson 2022 [34] | Retro. | 190k | Mixed | Internal | Late | Hypoxia | No | High | — | — | — |
| Foote 2024 [35] | Retro. | 17k | Pediatric | Temp. | Early | ICU/Mort. | No | Low | — | — | — |
| Complexity | N Studies | Baseline AUC (Tabular) | Hybrid AUC | Median ΔAUC | IQR |
|---|---|---|---|---|---|
| Low (Tachycardia, Hypotension) | 4 | +0.036 | – | ||
| Medium (Admission, Return visit) | 8 | +0.065 | – | ||
| High (Sepsis, Hypoxia, Criticality) | 13 | +0.111 | – |
| Study | AUC | Brier Score | Calibration Slope | Interpretation |
|---|---|---|---|---|
| Liu 2025 [16] | 0.84 | 0.089 | 0.98 | Excellent calibration; predicted probabilities closely match observed rates. |
| Nover 2025 [18] | 0.81 | 0.142 | 0.87 | Moderate calibration; slight underestimation of high-risk patients. |
| Douglas 2022 [26] | 0.86 | 0.106 | 0.92 | Good calibration across probability ranges. |
| Reporting Element | N Studies (%) | Representative Examples |
|---|---|---|
| Age distribution reported | 25 (100%) | All studies |
| Sex/gender distribution reported | 23 (92%) | All except Muller 2018 [4], Wright 2023 [29] |
| Race/ethnicity distribution reported | 3 (12%) | Liu 2025 [16], Glicksberg 2024 [17], Nover 2025 [18] |
| Insurance status reported | 1 (4%) | Glicksberg 2024 [17] |
| Performance stratified by race | 0 (0%) | None |
| Performance stratified by insurance | 0 (0%) | None |
| Performance stratified by sex | 2 (8%) | Liu 2025 [16], Douglas 2022 [26] |
| Explicit bias mitigation discussed | 0 (0%) | None |
| Study | D1 | D2 | D3 | D4 | Overall Risk |
|---|---|---|---|---|---|
| Participants | Predictors | Outcome | Analysis | ||
| Levin 2018 [4] | Low | Low | Low | High | High |
| Roquette 2020 [12] | Low | Low | Low | Low | Low |
| Zhang 2017 [13] | Low | Low | Low | High | High |
| Chen 2020 [15] | Low | Low | Low | High | High |
| Winston 2024 [14] | Low | Low | Low | High | High |
| Liu 2025 [16] | Low | Low | Low | Low | Low |
| Glicksberg 2024 [17] | Low | Low | Low | High | High |
| Nover 2025 [18] | Low | Low | Low | Low | Low |
| Ivanov 2020 [19] | Low | Low | Low | High | High |
| Sundrani 2023 [20] | Low | Low | Low | High | High |
| Li 2025 [21] | Low | Low | Low | High | High |
| Patel 2022 [22] | Low | Low | Low | High | High |
| Yun 2021 [23] | Low | Low | Low | High | High |
| Nanini 2022 [24] | Low | Low | Low | Low | Low |
| Xie 2024 [25] | Low | Low | Low | High | High |
| Douglas 2022 [26] | Low | Low | Low | Low | Low |
| Gomes 2025 [27] | Low | Low | Low | High | High |
| Tariq 2021 [28] | Low | Low | Low | High | High |
| Sezik 2025 [29] | Low | Low | Low | High | High |
| De Hond 2021 [30] | Low | Low | Low | High | High |
| Arnaud 2022 [31] | Low | Low | Low | Low | Low |
| Sulaiman 2025 [32] | Low | Low | Low | High | High |
| Lin 2021 [33] | Low | Low | Low | Low | Low |
| Johnson 2022 [34] | Low | Low | Low | High | High |
| Foote 2024 [35] | Low | Low | Low | High | High |
| Feature | Traditional ML (Tabular) | Hybrid BERT (Fusion) | Unified LLM (GPT-4/Llama) |
|---|---|---|---|
| Input Data | Vital signs, Demographics | Vitals + Triage Notes | Serialized Text Prompts |
| Context Window | None (Snapshot) | 512 Tokens | 4k–128k Tokens |
| Inference Latency | <50 ms (CPU) | 200–500 ms (GPU) | >1000 ms (GPU API) |
| Infrastructure | Lightweight (Edge/CPU) | Moderate (On-prem GPU) | Heavy (Cloud/H100 Cluster) |
| Privacy Risk | Low | Medium (Text PII) | High (API Data Leakage) |
| Primary Gain | Physiological Stability | Complex Syndromes | Reasoning & Explanation |
| Performance (ΔAUC) | Baseline | +0.111 (High Complexity) | Comparable to Hybrid |
| Clinical Scenario | Recommended Approach | Evidence Basis |
|---|---|---|
| High-complexity syndrome prediction (sepsis, hypoxia, criticality, multi-organ dysfunction) | Consider hybrid models with mandatory calibration and fairness validation. Expected ΔAUC: 0.09–0.13. | Thirteen studies, median ΔAUC + 0.111, robust to bias exclusion. |
| Medium-complexity administrative outcomes (admission, disposition, resource allocation) | Hybrid models may provide a moderate benefit (ΔAUC 0.05–0.08). Conduct a local pilot study before deployment. | Eight studies, median ΔAUC + 0.065, heterogeneous validation quality. |
| Low-complexity physiological thresholds (tachycardia, hypotension, isolated vital sign abnormalities) | Tabular models are likely sufficient. Hybrid models add minimal value (ΔAUC < 0.05) with increased computational cost. | Four studies, median ΔAUC + 0.036, diminishing returns. |
| Domain | Academic Research Focus | Industry & Deployment Trends |
|---|---|---|
| Model Size | Massive Multimodal Transformers (Billions of params). | Lightweight Distillation (DistilBERT) for edge deployment. |
| Data Type | Static, clean datasets (MIMIC-III). | Real-time, noisy, missing data streams (HL7/FHIR). |
| Key Metric | AUC maximization. | Inference Latency (<100 ms) and Cost-per-prediction. |
| Validation | Retrospective splits. | Prospective “Silent Trials” and Drift Detection. |
| Domain | Mandatory Reporting Checklist |
|---|---|
| Demographics | [ ] Report training/validation/test set demographics: age (mean, SD, range), sex (% female). [ ] Report race/ethnicity and insurance status (private/public/uninsured %). |
| Discrimination | [ ] Report AUC-ROC with 95% confidence intervals. [ ] Provide sensitivity, specificity, PPV, NPV at clinically relevant thresholds (at 10% predicted risk). |
| Calibration | [ ] Mandatory: Brier score, calibration slope. [ ] Recommended: Calibration plots (observed vs. predicted across deciles), Hosmer-Lemeshow test. |
| Fairness Auditing | [ ] Stratify AUC, sensitivity, specificity by: race/ethnicity, sex, age group, insurance status. [ ] Test for statistically significant subgroup differences. |
| Baseline Comparison | [ ] Compare hybrid model against strong baselines: optimized tabular-only and text-only models. [ ] Report ΔAUC for each modality contribution. |
| Validation Strategy | [ ] Retrospective studies: mandatory temporal validation (trained on Year 1, test on Year 2). [ ] Prospective studies preferred. External validation strongly encouraged. |
| Architecture | [ ] Explicitly define fusion strategy (Early/Late/Unified). [ ] Report NLP/tabular model type, embedding dimensions, fusion layer, total parameters. |
| Computational | [ ] Report median inference latency (milliseconds per prediction), hardware specifications. [ ] Report whether latency is acceptable for clinical workflow. |
| Code & Data | [ ] Publicly share model code, de-identified/synthetic data, and trained model weights. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ullah, J.; Ramasamy, R.K.; Rajendran, V. Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics 2026, 6, 21. https://doi.org/10.3390/biomedinformatics6020021
Ullah J, Ramasamy RK, Rajendran V. Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics. 2026; 6(2):21. https://doi.org/10.3390/biomedinformatics6020021
Chicago/Turabian StyleUllah, Junaid, R Kanesaraj Ramasamy, and Venushini Rajendran. 2026. "Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient" BioMedInformatics 6, no. 2: 21. https://doi.org/10.3390/biomedinformatics6020021
APA StyleUllah, J., Ramasamy, R. K., & Rajendran, V. (2026). Hybrid Machine Learning Architectures for Emergency Triage: A Systematic Review of Predictive Performance and the Complexity Gradient. BioMedInformatics, 6(2), 21. https://doi.org/10.3390/biomedinformatics6020021

