Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification
Abstract
1. Introduction
- Bayesian Optimization-Driven Ensemble Construction: Unlike the conventional methods that rely on grid search, random search, or manual hyperparameter tuning, the proposed approach enables efficient exploration of higher-dimensional hyperparameter spaces through AUROC maximization.
- Multi-window Aggregation-based Temporal Feature Extraction: A temporal knowledge extraction based on multi-window aggregation is incorporated that captures the vital physiological responses and clinical deterioration patterns.
- Multi-Level Explainability Pipeline: To address the black-box problem in machine learning, an integrated three-level interpretability framework is incorporated at the global level with feature importance computation, population level by partial dependence profile (PDP) analysis, and finally, individual-level patient analysis through breakdown analysis and contribution heatmaps.
- Complementary Model Behavior Quantification: To justify ensemble construction beyond empirical performance gains, a complementary model behavior quantization analysis was performed, which shows how Random Forest (RF) and XGBoost (XGB) exhibit complementary decision-making strategies.
- Clinical Decision Support Integration: Provides actionable clinical insights, including priority-based critical thresholds, feature interaction interpretation, and multi-timescale monitoring priorities.
2. Related Works
Gaps in the Existing Literature and Significance of the Proposed Work
- Lack of Optimization: Most existing works [10,13,19] rely on time-consuming manual hyperparameter tuning or grid search. For high-dimensional medical datasets, these methods are computationally extensive. Therefore, this work utilizes a Bayesian optimization-based approach for efficient utilization of hyperparameter space and maximizes AUROC, which is a critical metric for balancing sensitivity and specificity.
- Insufficient Temporal Modeling: Studies by Liu et al. [11] and Zilker et al. [14] include temporal features but with shorter time windows and static aggregations. The proposed approach implements multi-window statistical aggregation across 1 h, 2 h, 4 h, 8 h, 24 h, and 48 h intervals for capturing temporal clinical patterns, thereby supporting rapid diagnostic confirmation.
- Fragmented Interpretability Approaches: Existing works by He and Qiu [12] and Hu et al. [13] incorporate SHAP- and LIME-based explainability for individual predictions but lack a population-level analysis. On the other hand, feature importance studies [10,17] provide global insights without patient-specific explanations. Therefore, BayeStack integrates a comprehensive three-level framework ensuring explainability at all decision-making levels.
- Unjustified Ensemble Construction: Traditional ensemble methods [19] show empirical performance gains but do not explain how the combination of specific models improves predictions. This work systematically quantifies complementary behavior through partial dependence profile analysis. It reveals Random Forest’s distributed feature utilization vs. XGBoost’s concentrated biomarker focus, thereby providing theoretical justification for the ensemble strategy.
3. Methodology
3.1. Dataset Selection
3.2. Data Processing and Data Balancing
3.2.1. Temporal Bounded Bidirectional Imputation
3.2.2. Data Normalization
3.2.3. One-Hot Encoding
3.2.4. SMOTE-ENN Algorithm for Data Balancing
- The dataset was split into training (80%, N = 32,269 patients) and testing (20%, N = 8067 patients) sets using stratified random sampling based on sepsis status. To avoid patient-level data leakage, all hourly measurements from each patient were allocated exclusively to the same dataset partition.
- After the data splitting, the SMOTE-ENN algorithm was applied only to the training set. Synthetic minority samples were created using the SMOTE procedure with nearest neighbors, followed by Edited Nearest Neighbors (ENNs) cleaning to remove borderline instances from both classes.
- The test set was preserved without any modifications and contained only original patient data with no synthetic samples, thereby ensuring evaluations under real-world deployment conditions.
- Z-score normalization parameters (, ) were computed from the training set and applied to both sets.
3.3. Feature Engineering
3.3.1. Multi-Window Temporal Feature Aggregation
- Multi-scale temporal patterns: Statistical aggregations computed across different time windows capture disease progression trajectories that distinguish acute physiological responses (1–4 h) from sustained clinical deterioration (24–48 h) patterns.
- Temporal stability indicators: The transient fluctuations from persistent abnormalities (e.g., sustained elevated lactate vs. single spike) were differentiated by the cross-window comparisons of aggregated features.
- Time-dependent feature importance: Feature importance varies by temporal scales, with vital signs (HR, O2Sat, and Resp) contributing predominantly within short windows (1–4 h), indicating acute physiological stress, whereas laboratory biomarkers (WBC, Fibrinogen, and Lactate) become more influential over longer windows (24–48 h), indicating sustained organ dysfunction.
3.3.2. Feature Importance Computation and Feature Selection
3.4. Model Development
Modeling of BayeStack Algorithm
3.5. Model Interpretability Framework
3.5.1. Level 1: Global Feature Importance Analysis
3.5.2. Level 2: Population-Level Analysis
- Clinical biomarkers, where both the base models show similar PDP patterns;
- Features or parameters that show different sensitivity patterns aiding in complementary decision-making;
- The feature utilization patterns of the base models.
| Algorithm 1 BayeStack Algorithm for Optimized Sepsis Classification |
| Require: Search Space ;
Objective Function ; Observed Hyperparameters ; Corresponding Objective function values Ensure: Optimized Sepsis Classification Model 1: Step 1: Gaussian Process Surrogate Model 2: Building surrogate model with mean , covariance kernel 3: Find joint distribution: 5: Step 2: Acquisition Function 6: Improvement function: 7: Expected Improvement: 9: , 10: Base Classifiers: where : RF, : XGB 11: Step 4: Stacked Out-of-Fold Predictions 12: for to K do 13: Train and on training folds 14: Generates the predictions on validation fold k 15: 16: 17: end for 18: Stack predictions: 19: Step 5: Meta-Model Training and Prediction 20: Meta-features: 21: Train Logistic Regression meta-model on to learn optimal weights 23: return Optimized ensemble model with learned blended weights |
- 1.
- Feature-wise Agreement Score: Pearson correlation between Random Forest and XGBoost partial dependence profile predictions for each feature:where and represent the partial dependence predictions for feature from Random Forest and XGBoost, respectively, and denotes standard deviation. A generally accepted interpretation is indicated by an agreement score nearly equal to 1, and model-specific feature utilization is indicated by lower scores.
- 2.
- Spearman Rank Correlation (ρ): To evaluate whether both models assign comparable rankings to features based on their importance within the specified range, a non-parametric correlation method is employed. It is calculated as follows:where n is the number of features, and is the difference between the ranks of feature i based on RF and XGB range values.
3.5.3. Level 3: Individual Patient-Level Interpretability
4. Results and Discussion
4.1. Baseline Characteristic Results
4.1.1. Time-Based Aggregation Results
- 1–4 h: The model captures the acute responses and indicates early metabolic shifts during the initial time window.
- 8–24 h: During the next 8 to 24 h time window, continuous monitoring of feature variations and their gradual trends and patterns, including feature stabilization, was undertaken. In this time window, certain feature trends, such as blood pressure trends and CBC counts, may reflect the systemic responses.
- 48 h: Stabilization or progression to severe conditions was noted. Indicators of renal function, such as creatinine, became critical. Persistent abnormalities in vital signs and laboratory features indicate sepsis severity.
4.1.2. Quantile-Quantile Plots (Q-Q Plots)
4.1.3. Feature Importance Radar Plot
4.2. Framework Evaluation and Model Interpretability Results
Component Performance Analysis
4.3. Ablation Studies Analysis
4.3.1. Temporal Aggregation Component Analysis
4.3.2. Data Balancing Component Analysis
4.4. Comprehensive Model Interpretability Analysis
4.4.1. Population-Level Model Behavior: Comprehensive Feature Effect Analysis
- Observable at Prediction Time: ICULOS captures the cumulative duration of ICU stay up to time t, representing information available to clinicians during real-time decision-making.
- Absence of Target Leakage: ICULOS does not directly represent the sepsis outcome and, therefore, avoids target leakage; it reflects accumulated risk exposure and overall disease complexity.
- Empirical Validation via Ablation: As shown in Table 10, ICULOS contributes comparably to individual laboratory markers rather than acting as a dominant modeling artifact.
4.4.2. Individual Patient-Level Interpretability: Case Study Analysis
4.5. Computational Complexity and Scalability Analysis
4.5.1. Computational Complexity Analysis
- Bayesian Optimization (100 iterations): —Gaussian Process surrogate inversion dominates.
- Random Forest Training (438 trees): —Tree induction dominates.
- XGBoost Training (385 rounds): —Gradient computation and tree splits.
- Stacking Ensemble: —Meta-feature generation and meta-model training.
- Training time for full dataset: 3.92 min (235 s).
- Inference time per sample: 5.83 ms.
- Peak memory requirement: 102.93 MB (0.10 GB).
4.5.2. Scalability Analysis
4.5.3. Performance Contextualization Against PhysioNet 2019 Challenge Baselines
4.6. Methodological Design Trade-Offs and Research Positioning
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AUROC | Area Under the Receiver Operating Characteristic Curve |
| RF | Random Forest |
| XGB | XGBoost |
| PDP | Partial Dependence Profile |
| SMOTE-ENN | Synthetic Minority Over-sampling Technique with Edited Nearest Neighbors |
| ICU | Intensive Care Unit |
| SOFA | Sequential Organ Failure Assessment |
| ML | Machine Learning |
| EI | Expected Improvement |
| SHAP | SHapley Additive exPlanations |
| LIME | Local Interpretable Model-agnostic Explanations |
References
- Singer, M.; Deutschman, C.S.; Seymour, C.W.; Shankar-Hari, M.; Annane, D.; Bauer, M.; Bellomo, R.; Bernard, G.R.; Chiche, J.D.; Coopersmith, C.M.; et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA 2016, 315, 801–810. [Google Scholar] [CrossRef]
- Ho, B.-S.; Lee, Y.-H.W.; Lin, Y.-B. Impact of hourly serial SOFA score on signaling emerging sepsis. Inform. Med. Unlocked 2022, 31, 100999. [Google Scholar] [CrossRef]
- Kamath, S.; Altaq, H.H.; Abdo, T. Management of Sepsis and Septic Shock: What Have We Learned in the Last Two Decades? Microorganisms 2023, 11, 2231. [Google Scholar] [CrossRef] [PubMed]
- Calvert, J.S.; Price, D.A.; Chettipally, U.K.; Barton, C.W.; Feldman, M.D.; Hoffman, J.L.; Jay, M.; Das, R. A computational approach to early sepsis detection. Comput. Biol. Med. 2016, 74, 69–73. [Google Scholar] [CrossRef] [PubMed]
- Selcuk, M.; Koc, O.; Kestel, A.S. The prediction power of machine learning on estimating the sepsis mortality in the intensive care unit. Inform. Med. Unlocked 2022, 28, 100861. [Google Scholar] [CrossRef]
- Mohamed, A.K.S.; Mehta, A.A.; James, P. Predictors of mortality of severe sepsis among adult patients in the medical Intensive Care Unit. Lung India 2017, 34, 330–335. [Google Scholar] [CrossRef]
- Rangan, E.S.; Pathinarupothi, R.K.; Anand, K.J.S.; Snyder, M.P. Performance effectiveness of vital parameter combinations for early warning of sepsis exhaustive study using machine learning. JAMIA Open 2022, 5, ooac080. [Google Scholar] [CrossRef]
- Scherpf, M.; Gräßer, F.; Malberg, H.; Zaunseder, S. Predicting sepsis with a recurrent neural network using the MIMIC III database. Comput. Biol. Med. 2019, 113, 103395. [Google Scholar] [CrossRef]
- Kam, H.J.; Kim, H.Y. Learning representations for the early detection of sepsis with deep neural networks. Comput. Biol. Med. 2017, 89, 248–255. [Google Scholar] [CrossRef]
- Chen, Q.; Li, R.; Lin, C.; Lai, C.; Chen, D.; Qu, H.; Huang, Y.; Lu, W.; Tang, Y.; Li, L. Transferability and interpretability of the sepsis prediction models in the intensive care unit. BMC Med. Inform. Decis. Mak. 2022, 22, 343. [Google Scholar] [CrossRef]
- Liu, Z.; Shu, W.; Li, T.; Zhang, X.; Chong, W. Interpretable machine learning for predicting sepsis risk in emergency triage patients. Sci. Rep. 2025, 15, 887. [Google Scholar] [CrossRef]
- He, B.; Qiu, Z. Development and validation of an interpretable machine learning for mortality prediction in patients with sepsis. Front. Artif. Intell. 2024, 7, 1348907. [Google Scholar] [CrossRef] [PubMed]
- Hu, C.; Li, L.; Huang, W.; Wu, T.; Xu, Q.; Liu, J.; Hu, B. Interpretable Machine Learning for Early Prediction of Prognosis in Sepsis: A Discovery and Validation Study. Infect. Dis. Ther. 2022, 11, 1117–1132. [Google Scholar] [CrossRef] [PubMed]
- Zilker, S.; Weinzierl, S.; Kraus, M.; Zschech, P.; Matzner, M. A machine learning framework for interpretable predictions in patient pathways: The case of predicting ICU admission for patients with symptoms of sepsis. Health Care Manag. Sci. 2024, 27, 136–167. [Google Scholar] [CrossRef]
- Stylianides, C.; Nicolaou, A.; Sulaiman, W.A.; Alexandropoulou, C.A.; Panagiotopoulos, I.; Karathanasopoulou, K.; Dimitrakopoulos, G.; Kleanthous, S.; Politi, E.; Ntalaperas, D.; et al. AI Advances in ICU with an Emphasis on Sepsis Prediction: An Overview. Mach. Learn. Knowl. Extr. 2025, 7, 6. [Google Scholar] [CrossRef]
- Zhang, G.; Shao, F.; Yuan, W.; Wu, J.; Qi, X.; Gao, J.; Shao, R.; Tang, Z.; Wang, T. Predicting sepsis in-hospital mortality with machine learning: A multi-center study using clinical and inflammatory biomarkers. Eur. J. Med. Res. 2024, 29, 156. [Google Scholar] [CrossRef] [PubMed]
- Islam, K.R.; Prithula, J.; Kumar, J.; Tan, T.L.; Reaz, M.B.I.; Sumon, M.S.I.; Chowdhury, M.E.H. Machine Learning-Based Early Prediction of Sepsis Using Electronic Health Records: A Systematic Review. J. Clin. Med. 2023, 12, 5658. [Google Scholar] [CrossRef]
- Bignami, E.G.; Berdini, M.; Panizzi, M.; Domenichetti, T.; Bezzi, F.; Allai, S.; Damiano, T.; Bellini, V. Artificial Intelligence in Sepsis Management: An Overview for Clinicians. J. Clin. Med. 2025, 14, 286. [Google Scholar] [CrossRef]
- Prithula, J.; Islam, K.R.; Kumar, J.; Tan, T.L.; Reaz, M.B.I.; Rahman, T.; Zughaier, S.M.; Khan, M.S.; Murugappan, M.; Chowdhury, M.E. A novel classical machine learning framework for early sepsis prediction using electronic health record data from ICU patients. Comput. Biol. Med. 2025, 184, 109284. [Google Scholar] [CrossRef]
- Goldberger, A.; Amaral, L.; Glass, L.; Hausdorff, J.; Ivanov, P.C.; Mark, R.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220, RRID:SCR_007345. [Google Scholar] [CrossRef]
- Reyna, M.A.; Josef, C.S.; Jeter, R.; Shashikumar, S.P.; Westover, M.B.; Nemati, S.; Clifford, G.D.; Sharma, A. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. Crit. Care Med. 2019, 48, 210–217. [Google Scholar] [CrossRef]
- Reyna, M.; Josef, C.; Jeter, R.; Shashikumar, S.; Moody, B.; Westover, M.B.; Sharma, A.; Nemati, S.; Clifford, G.D. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). PhysioNet 2019. [Google Scholar] [CrossRef]
- Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV (version 2.2). PhysioNet 2023. [Google Scholar] [CrossRef]
- Pollard, T.; Johnson, A.; Raffa, J.; Celi, L.A.; Badawi, O.; Mark, R. eICU Collaborative Research Database (version 2.0). PhysioNet 2019, RRID:SCR_007345. [Google Scholar] [CrossRef]
- Strickler, E.A.; Thomas, J.; Thomas, J.P.; Benjamin, B.; Shamsuddin, R. Exploring a global interpretation mechanism for deep learning networks when predicting sepsis. Sci. Rep. 2023, 13, 3067. [Google Scholar] [CrossRef] [PubMed]
- Murugesan, I.; Murugesan, K.; Balasubramanian, L.; Arumugam, M. Interpretation of artificial intelligence algorithms in the prediction of sepsis. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Narayanaswamy, L.; Garg, D.; Narra, B.; Narayanswamy, R. Machine learning algorithmic and system level considerations for early prediction of sepsis. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Alfaras, M.; Varandas, R.; Gamboa, H. Ring-topology echo state networks for ICU sepsis classification. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Deogire, A. A Low Dimensional Algorithm for Detection of Sepsis From Electronic Medical Record Data. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Moahmed, T.A.; El Gayar, N.; Atiya, A.F. Forward and backward forecasting ensembles for the estimation of time series missing data. In Artificial Neural Networks in Pattern Recognition, Proceedings of the 6th IAPR TC 3 International Workshop, ANNPR 2014, Montreal, QC, Canada, 6–8 October 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 93–105. [Google Scholar]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Udilă, A.; Ionescu, A.; Katsifodimos, A. Encoding Methods for Categorical Data: A Comparative Analysis for Linear Models, Decision Trees, and Support Vector Machines; Technical Report; Delft University of Technology: Delft, The Netherlands, 2023; Available online: https://repository.tudelft.nl/file/File_9e5e5225-4c53-4362-862e-1b7072e82b6a (accessed on 30 July 2023).
- Lamari, M.; Azizi, N.; Hammami, N.E.; Boukhamla, A.; Cheriguene, S.; Dendani, N.; Benzebouchi, N.E. SMOTE–ENN-based data sampling and improved dynamic ensemble selection for imbalanced medical data classification. In Advances on Smart and Soft Computing: Proceedings of ICACIn 2020; Springer: Singapore, 2021; pp. 84–93. [Google Scholar]
- Kaushik, S.; Choudhury, A.; Sheron, P.K.; Dasgupta, N.; Natarajan, S.; Pickett, L.A.; Dutt, V. AI in healthcare: Time-series forecasting using statistical, neural, and ensemble architectures. Front. Big Data 2020, 3, 4. [Google Scholar] [CrossRef]
- Regier, P.; Duggan, M.; Myers-Pigg, A.; Ward, N. Effects of random forest modeling decisions on biogeochemical time series predictions. Limnol. Oceanogr. Methods 2023, 21, 40–52. [Google Scholar] [CrossRef]
- Tyralis, H.; Papacharalampous, G. Variable selection in time series forecasting using random forests. Algorithms 2017, 10, 114. [Google Scholar] [CrossRef]
- Liang, J.; Pan, W.S.Y.; Yang, Z.-H. Characterization-based Q–Q plots for testing multinormality. Stat. Probab. Lett. 2004, 70, 183–190. [Google Scholar] [CrossRef]
- Anjana, G.; Nisha, K.L.; Arun Sankar, M.S. Improving sepsis classification performance with artificial intelligence algorithms: A comprehensive overview of healthcare applications. J. Crit. Care 2024, 83, 154815. [Google Scholar] [CrossRef]
- Gholamzadeh, M.; Abtahi, H.; Safdari, R. Comparison of different machine learning algorithms to classify patients suspected of having sepsis infection in the intensive care unit. Inform. Med. Unlocked 2023, 38, 101236. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Lin, Y.; Gao, J.; Chen, L.; Hong, Y.; Li, M.; Chen, P.; Shang, X. An interpretable XGBoost model for risk prediction of progression from sepsis-associated acute kidney injury to chronic kidney disease. Inform. Med. Unlocked 2025, 58, 101685. [Google Scholar] [CrossRef]
- Wang, X.; Jin, Y.; Schmitt, S.; Olhofer, M. Recent advances in Bayesian optimization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
- Awal, M.A.; Masud, M.; Hossain, M.S.; Bulbul, A.A.M.; Mahmud, S.H.; Bairagi, A.K. A novel bayesian optimization-based machine learning framework for COVID-19 detection from inpatient facility data. IEEE Access 2021, 9, 10263–10281. [Google Scholar] [CrossRef]
- Zheng, J.; Zhang, Z.; Wang, J.; Zhao, R.; Liu, S.; Yang, G.; Liu, Z.; Deng, Z. Metabolic syndrome prediction model using Bayesian optimization and XGBoost based on traditional Chinese medicine features. Heliyon 2023, 9, e22727. [Google Scholar] [CrossRef] [PubMed]
- Lacoste, A.; Larochelle, H.; Laviolette, F.; Marchand, M. Sequential model-based ensemble optimization. arXiv 2014, arXiv:1402.0796. [Google Scholar] [CrossRef]
- Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
- Sabbatella, A.; Ponti, A.; Candelieri, A.; Archetti, F. Bayesian Optimization Using Simulation-Based Multiple Information Sources over Combinatorial Structures. Mach. Learn. Knowl. Extr. 2024, 6, 2232–2247. [Google Scholar] [CrossRef]
- Lim, Y.-F.; Ng, C.K.; Vaitesswar, U.S.; Hippalgaonkar, K. Extrapolative Bayesian optimization with Gaussian process and neural network ensemble surrogate models. Adv. Intell. Syst. 2021, 3, 2100101. [Google Scholar] [CrossRef]
- Nour, M.; Senturk, U.; Polat, K. Diagnosis and classification of Parkinson’s disease using ensemble learning and 1D-PDCovNN. Comput. Biol. Med. 2023, 161, 107031. [Google Scholar] [CrossRef]
- Kalagotla, S.K.; Gangashetty, S.V.; Giridhar, K. A novel stacking technique for prediction of diabetes. Comput. Biol. Med. 2021, 135, 104554. [Google Scholar] [CrossRef] [PubMed]
- Prakash, J.A.; Asswin, C.R.; Ravi, V.; Sowmya, V.; Soman, K.P. Pediatric pneumonia diagnosis using stacked ensemble learning on multi-model deep CNN architectures. Multimed. Tools Appl. 2023, 82, 21311–21351. [Google Scholar] [CrossRef]
- Alqahtani, A.F.; Ilyas, M. An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of Cyberbullying. Mach. Learn. Knowl. Extr. 2024, 6, 156–170. [Google Scholar] [CrossRef]
- Wu, T.; Zhang, W.; Jiao, X.; Guo, W.; Hamoud, Y.A. Evaluation of stacking and blending ensemble learning methods for estimating daily reference evapotranspiration. Comput. Electron. Agric. 2021, 184, 106039. [Google Scholar] [CrossRef]
- Mienye, I.D.; Obaido, G.; Jere, N.; Mienye, E.; Aruleba, K.; Emmanuel, I.D.; Ogbuokiri, B. A survey of explainable artificial intelligence in healthcare: Concepts, applications, and challenges. Inform. Med. Unlocked 2024, 51, 101587. [Google Scholar] [CrossRef]
- Srinivasu, P.N.; Sandhya, N.; Jhaveri, R.H.; Raut, R. From blackbox to explainable AI in healthcare: Existing tools and case studies. Mob. Inf. Syst. 2022, 2022, 8167821. [Google Scholar] [CrossRef]
- Reghu, L.; Ashok, G.; Menon, R.R.K. Explainable AI for Health Care based Retrieval System. Grenze Int. J. Eng. Technol. 2024, 10, 1915. [Google Scholar]
- Talin, I.A.; Abid, M.H.; Khan, M.A.M.; Kee, S.H.; Nahid, A.A. Finding the influential clinical traits that impact on the diagnosis of heart disease using statistical and machine-learning techniques. Sci. Rep. 2022, 12, 20199. [Google Scholar] [CrossRef]
- Raţiu, A.; Pop, E.L. Machine Learning in Clinical Decision Making: Applications, Data Limitations and Multidisciplinary Perspectives. Appl. Sci. 2026, 16, 785. [Google Scholar] [CrossRef]
- Mahmud, F.; Quamruzzaman, M.; Sanka, A.I.; Cheung, R.C.; Chowdhury, M.H. Interpretable machine learning-based real-time sepsis diagnosis. Sci. Rep. 2026, 16, 36945. [Google Scholar] [CrossRef]
- Ristori, M.V.; Ruffini, F.; Spoto, S.; Cammarata, R.; La Vaccara, V.; Bani, L.; Caputo, D.; Soda, P.; Guarrasi, V.; Angeletti, S. Machine Learning Models for Sepsis: From Early Detection to Short-and Long-Term Prognosis. Int. J. Mol. Sci. 2026, 27, 2721. [Google Scholar] [CrossRef]
- Niazi, S.K. A Critical Review of the FDA’s Draft Guidance on Artificial Intelligence in Drug and Biological Product Regulation. J. Chem. 2026, 2026, 5202999. [Google Scholar] [CrossRef]
- Do, D.K.; Rockenschaub, P.; Boie, S.D.; Kumpf, O.; Volk, H.D.; Balzer, F.; Von Dincklage, F.; Lichtner, G. The Impact of Evaluation Strategy on Sepsis Prediction Model Performance Metrics in Intensive Care Data: Retrospective Cohort Study. J. Med. Internet Res. 2026, 28, e72083. [Google Scholar] [CrossRef]
- Oliveira, T.Q.; Carvalho, L.A.; Sousa, F.R.; Filho, J.B.; Oliveira, K.F.; Tavares, D.A. Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust. J. Clin. Med. 2026, 15, 2251. [Google Scholar] [CrossRef]








| Aspect | Existing Works | BayeStack (Proposed) |
|---|---|---|
| Hyperparameter Optimization | Manual tuning or grid search [10,13,19] | Bayesian optimization with Gaussian process surrogate model |
| Temporal Modeling | Single time-point or limited windows [11,14] | Multi-window aggregation (1–48 h) capturing disease progression |
| Interpretability Scope | Single-level (global or local) [12,13] | Three-level hierarchy (global, population, individual) |
| Ensemble Justification | Empirical performance gains [19] | Quantified complementary behavior (, agreement analysis) |
| Clinical Actionability | Risk scores only [15,16,17] | Thresholds, interactions, monitoring priorities |
| Sensitivity–Specificity Balance | Emphasis on a single metric [13,18] | Balanced optimization (both 0.97) via AUROC maximization |
| Feature Labels | Detailed Features |
|---|---|
| Vital Features | HR, O2Sat, Temperature, SBP, MAP, DBP, RR, EtCO2 |
| Laboratory Values | Base Excess, HCO3, FiO2, PaCO2, SaO2, AST, BUN, Alkalinephos, PTT, WBC, Calcium, Chloride, pH, Hct, Creatinine, Direct bilirubin, Glucose, Phosphate, Magnesium, Total bilirubin, Lactate, Hgb, Troponin I, Potassium, Fibrinogen, Platelets |
| Demographics | Age, Gender, Unit 1 (MICU), Unit 2 (SICU), Hospital admission time, ICULOS |
| Outcome | Sepsis Label (1-Septic, 0-Non-septic) |
| Classifier Combination | Mean Accuracy |
|---|---|
| Random Forest + XGBoost | 0.9961 |
| Random Forest + Logistic Regression | 0.9924 |
| Random Forest + KNN | 0.9940 |
| Random Forest + Naïve Bayes | 0.9498 |
| Logistic Regression + KNN | 0.9823 |
| Naïve Bayes + KNN | 0.9751 |
| XGBoost + KNN | 0.9924 |
| XGBoost + Logistic Regression | 0.9346 |
| XGBoost + Naïve Bayes | 0.8824 |
| Logistic Regression + Naïve Bayes | 0.6467 |
| Classifier | Parameter | Range | Optimal |
|---|---|---|---|
| Random Forest | Number of Trees | (100, 500) | 438 |
| Tree Depth | (1, 50) | 36 | |
| Split Threshold | (2, 20) | 19 | |
| Leaf Size | (1, 20) | 20 | |
| XGBoost | Number of Trees | (100, 500) | 385 |
| Tree Depth | (1, 20) | 20 | |
| Learning Rate | (0.001, 1) | 0.679 | |
| Child Node Samples | (0.001, 20) | 14.4 |
| Metric | RF | XGBoost | BayeStack | |||
|---|---|---|---|---|---|---|
| Value | 95% CI | Value | 95% CI | Value | 95% CI | |
| Specificity | 0.96 | [0.954–0.966] | 0.96 | [0.954–0.966] | 0.97 | [0.964–0.976] *** |
| Sensitivity | 0.94 | [0.932–0.948] | 0.98 | [0.974–0.986] | 0.97 | [0.964–0.976] ** |
| Accuracy | 0.95 | [0.942–0.958] | 0.97 | [0.962–0.978] | 0.97 | [0.964–0.976] |
| Precision | 0.94 | [0.932–0.948] | 0.98 | [0.974–0.986] | 0.97 | [0.964–0.976] ** |
| Recall | 0.96 | [0.952–0.968] | 0.96 | [0.952–0.968] | 0.97 | [0.964–0.976] * |
| F1-Score | 0.95 | [0.942–0.958] | 0.97 | [0.962–0.978] | 0.97 | [0.964–0.976] |
| AUC-ROC | 0.95 | [0.942–0.958] | 0.97 | [0.962–0.978] | 0.99 | [0.984–0.996] *** |
| Temporal Window | AUROC | Sensitivity | Specificity | F1-Score |
|---|---|---|---|---|
| Single timepoint | 0.88 | 0.90 | 0.87 | 0.88 |
| 1 h window | 0.91 | 0.92 | 0.90 | 0.91 |
| 1–4 h | 0.94 | 0.94 | 0.93 | 0.93 |
| 1–24 h | 0.97 | 0.96 | 0.96 | 0.96 |
| 1–48 h (Full) | 0.99 | 0.97 | 0.97 | 0.97 |
| Balancing Strategy | AUROC | Sensitivity | Specificity | F1-Score |
|---|---|---|---|---|
| No balancing | 0.66 | 0.52 | 0.88 | 0.68 |
| SMOTE only | 0.93 | 0.92 | 0.94 | 0.90 |
| ENN only | 0.92 | 0.91 | 0.93 | 0.89 |
| SMOTE-ENN | 0.99 | 0.97 | 0.97 | 0.97 |
| Feature | RF Range | XGB Range | RF Min | RF Max | XGB Min | XGB Max | RF Mean | XGB Mean | Agreement |
|---|---|---|---|---|---|---|---|---|---|
| ICULOS | 0.26 | 0.22 | 0.30 | 0.55 | 0.44 | 0.66 | 0.50 | 0.61 | 0.85 |
| O2Sat | 0.21 | 0.12 | 0.36 | 0.57 | 0.56 | 0.69 | 0.45 | 0.57 | 0.59 |
| Resp | 0.23 | 0.04 | 0.37 | 0.60 | 0.57 | 0.61 | 0.42 | 0.58 | 0.15 |
| HR | 0.11 | 0.03 | 0.34 | 0.44 | 0.55 | 0.58 | 0.41 | 0.57 | 0.31 |
| Magnesium | 0.07 | 0.06 | 0.42 | 0.49 | 0.54 | 0.60 | 0.42 | 0.56 | 0.79 |
| TroponinI | 0.09 | 0.03 | 0.35 | 0.45 | 0.54 | 0.57 | 0.39 | 0.54 | 0.31 |
| FiO2 | 0.10 | 0.02 | 0.32 | 0.42 | 0.54 | 0.56 | 0.41 | 0.56 | 0.21 |
| WBC | 0.05 | 0.05 | 0.38 | 0.43 | 0.52 | 0.57 | 0.40 | 0.54 | 0.92 |
| Fibrinogen | 0.05 | 0.04 | 0.41 | 0.46 | 0.53 | 0.57 | 0.42 | 0.55 | 0.79 |
| BaseExcess | 0.06 | 0.03 | 0.39 | 0.45 | 0.55 | 0.58 | 0.40 | 0.56 | 0.46 |
| Temp | 0.05 | 0.04 | 0.39 | 0.44 | 0.55 | 0.58 | 0.40 | 0.56 | 0.70 |
| SBP | 0.08 | 0.01 | 0.42 | 0.49 | 0.57 | 0.58 | 0.44 | 0.57 | 0.18 |
| BUN | 0.06 | 0.03 | 0.35 | 0.41 | 0.54 | 0.58 | 0.37 | 0.57 | 0.54 |
| Bilirubin (D) | 0.06 | 0.03 | 0.40 | 0.46 | 0.55 | 0.57 | 0.41 | 0.55 | 0.41 |
| Bilirubin (T) | 0.06 | 0.03 | 0.41 | 0.47 | 0.55 | 0.58 | 0.42 | 0.56 | 0.46 |
| Unit1_0.0 | 0.07 | 0.01 | 0.47 | 0.54 | 0.58 | 0.59 | 0.51 | 0.59 | 0.13 |
| pH | 0.06 | 0.01 | 0.40 | 0.46 | 0.57 | 0.58 | 0.42 | 0.57 | 0.24 |
| AST | 0.05 | 0.03 | 0.39 | 0.44 | 0.54 | 0.57 | 0.40 | 0.56 | 0.55 |
| Lactate | 0.05 | 0.02 | 0.41 | 0.46 | 0.56 | 0.58 | 0.43 | 0.56 | 0.40 |
| PaCO2 | 0.05 | 0.02 | 0.42 | 0.47 | 0.56 | 0.58 | 0.44 | 0.57 | 0.37 |
| Unit2_1.0 | 0.06 | 0.00 | 0.47 | 0.54 | 0.58 | 0.58 | 0.51 | 0.58 | 0.02 |
| Unit1_1.0 | 0.04 | 0.00 | 0.50 | 0.54 | 0.58 | 0.59 | 0.52 | 0.58 | 0.11 |
| Unit2_0.0 | 0.04 | 0.00 | 0.50 | 0.54 | 0.58 | 0.58 | 0.52 | 0.58 | 0.01 |
| Feature | Critical Threshold | ΔP (RF) | ΔP (XGB) | Clinical Interpretation | Priority |
|---|---|---|---|---|---|
| ICULOS | >48–50 h | 0.26 | 0.22 | Cumulative hospital-acquired infection risk | High |
| WBC | <4 or >12–15 ×103/µL | 0.05 | 0.05 | Leukopenia/leukocytosis (SIRS criteria) | High |
| Resp | >20–24 breaths/min | 0.23 | 0.04 | Tachypnea indicating respiratory distress | High |
| O2Sat | <92–94% | 0.21 | 0.12 | Hypoxemia/respiratory compromise | High |
| Temp | <36 °C or >38 °C | 0.05 | 0.04 | Hypothermia/fever (infection indicator) | Medium |
| Lactate | >2–4 mmol/L | 0.05 | 0.02 | Tissue hypoperfusion/metabolic stress | High |
| Fibrinogen | <200 or >400 mg/dL | 0.05 | 0.04 | Coagulopathy/acute phase response | High |
| Magnesium | <1.5 or >2.5 mg/dL | 0.07 | 0.06 | Electrolyte imbalance | Medium |
| BUN | >20–30 mg/dL | 0.06 | 0.03 | Renal dysfunction indicator | Medium |
| HR | >100–110 bpm | 0.11 | 0.03 | Tachycardia (systemic stress response) | Medium |
| pH | <7.35 or >7.45 | 0.06 | 0.01 | Metabolic acidosis/alkalosis | Medium |
| BaseExcess | <−2 or >2 mEq/L | 0.06 | 0.03 | Acid-base imbalance | Medium |
| Model Configuration | AUROC | Sensitivity | Specificity | Δ AUROC |
|---|---|---|---|---|
| Full Model (Baseline) | 0.99 | 0.97 | 0.97 | - |
| Without ICULOS | 0.96 | 0.94 | 0.96 | −0.03 |
| Without WBC | 0.97 | 0.95 | 0.96 | −0.02 |
| Without Fibrinogen | 0.97 | 0.96 | 0.97 | −0.02 |
| Without Lactate | 0.98 | 0.96 | 0.97 | −0.01 |
| Without O2Sat | 0.97 | 0.95 | 0.97 | −0.02 |
| Metric | Random Forest | XGBoost |
|---|---|---|
| Mean Range | 0.085 | 0.038 |
| Std Range | 0.061 | 0.047 |
| Max Range | 0.257 | 0.218 |
| Min Range | 0.042 | 0.001 |
| Mean Agreement | 0.413 | |
| Spearman | 0.856 *** | |
| Feature | Value | RF Contrib | XGB Contrib | RF Impact | XGB Impact |
|---|---|---|---|---|---|
| ICULOS | 58.0 | 0.706 | 0.492 | High | High |
| Temp | 38.0 | 0.643 | 0.067 | High | Medium |
| BUN | 8.0 | 0.460 | 0.057 | High | Medium |
| WBC | 13.6 | 0.414 | 0.139 | High | High |
| Magnesium | 2.9 | 0.367 | 0.330 | High | High |
| AST | 102.0 | 0.357 | 0.323 | High | High |
| HR | 96.3 | 0.343 | 0.031 | High | Low |
| Resp | 13.8 | 0.326 | 0.007 | High | Low |
| Bilirubin (D) | 0.4 | 0.324 | 0.067 | High | Medium |
| BaseExcess | 0.301 | 0.019 | High | Low |
| Dataset Size | Training (min) | Per-Sample (ms) | % Full |
|---|---|---|---|
| 1000 patients | 0.06 | 3.79 | 2.5% |
| 5000 patients | 0.39 | 4.68 | 12.4% |
| 10,000 patients | 0.84 | 5.06 | 24.8% |
| 20,000 patients | 1.81 | 5.44 | 49.6% |
| 40,336 patients (Full) | 3.92 | 5.83 | 100.0% |
| 80,000 patients † | 8.27 | 6.20 | 198.3% |
| 100,000 patients † | 10.54 | 6.32 | 247.9% |
| Method | AUROC | Accuracy | F-Measure | Rank |
|---|---|---|---|---|
| BayeStack (Proposed) | 0.99 *** | 0.97 *** | 0.97 *** | 1 |
| Deogire | 0.86 ** | 0.84 ** | 0.48 ** | 2 |
| Alfaras et al. | 0.72 * | 0.87 | 0.58 * | 3 |
| Narayanaswamy et al. | 0.71 * | 0.88 | 0.59 * | 4 |
| Strickler et al. | 0.78 * | 0.99 | 0.70 * | 5 |
| Murugesan et al. | 0.56 | 0.96 | 0.13 | 6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Geetha, A.; Nisha, K.L.; Pillai, A.S.M.S.; Rajeev, S. Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Mach. Learn. Knowl. Extr. 2026, 8, 150. https://doi.org/10.3390/make8060150
Geetha A, Nisha KL, Pillai ASMS, Rajeev S. Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Machine Learning and Knowledge Extraction. 2026; 8(6):150. https://doi.org/10.3390/make8060150
Chicago/Turabian StyleGeetha, Anjana, K. L. Nisha, Arun Sankar Muttathu Sivasankara Pillai, and Sreenath Rajeev. 2026. "Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification" Machine Learning and Knowledge Extraction 8, no. 6: 150. https://doi.org/10.3390/make8060150
APA StyleGeetha, A., Nisha, K. L., Pillai, A. S. M. S., & Rajeev, S. (2026). Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Machine Learning and Knowledge Extraction, 8(6), 150. https://doi.org/10.3390/make8060150

