Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification

Geetha, Anjana; Nisha, K. L.; Pillai, Arun Sankar Muttathu Sivasankara; Rajeev, Sreenath

doi:10.3390/make8060150

Open AccessArticle

Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification

¹

Department of Electronics and Communication Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam 690525, Kerala, India

²

Electronic and Communications Department, South East Technological University, R93 V960 Carlow, Ireland

³

Dr. K.M. Cherian Institute of Medical Sciences, Chengannur 689124, Kerala, India

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(6), 150; https://doi.org/10.3390/make8060150

Submission received: 6 April 2026 / Revised: 9 May 2026 / Accepted: 12 May 2026 / Published: 1 June 2026

(This article belongs to the Topic Deep Supplement Learning for Healthcare and Biomedical Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Sepsis, a life-threatening condition causing significant global mortality, requires rapid diagnosis and intervention. Although recent advances in machine learning have supported clinical decision-making, existing sepsis classification approaches exhibit several limitations, including inadequate temporal modeling of disease progression, lack of systematic hyperparameter optimization, fragmented interpretability approaches that do not fully address multi-stakeholder clinical needs, and challenges in achieving balanced sensitivity–specificity trade-offs. These limitations restrict effective extraction of knowledge from complex temporal clinical data and hinder actionable decision-making. To address these challenges, this work proposes BayeStack, a temporal knowledge-extraction framework that integrates Bayesian optimization-driven ensemble learning with hierarchical interpretability to optimize sepsis classification. This framework captures the progression of sepsis through multi-window temporal aggregation, performs optimal classification by applying AUROC-maximizing hyperparameter space exploration, and enables comprehensive clinical knowledge extraction by applying a three-level interpretability framework that includes global feature importance, population-level partial dependence analysis, and patient-specific contribution-level analysis. Evaluation results indicated that BayeStack achieved an AUROC of 0.99 with balanced sensitivity and specificity of 0.97, substantially outperforming all baseline methods (

p < 0.001

). Ablation studies validated that temporal aggregation and data balancing contributed to performance improvements. A strong Spearman correlation (

ρ = 0.856

) validated the feature ranking convergence and effectiveness of the ensemble strategy. The interpretability framework provides insights into complementary model behavior and extracts evidence-based clinical thresholds for priority-based treatment monitoring, thereby enabling robust clinical decision support. This first phase systematic integration framework of traditional machine learning models establishes baseline performance and explainability standards for subsequent deep learning advancements.

Keywords:

sepsis classification; temporal feature aggregation; interpretability; optimization; clinical decision support system

1. Introduction

Sepsis is a serious medical condition that happens when the human body fights an infection. Even with advanced treatments and improved care, sepsis is still one of the main causes of mortality globally, which dictates the need for better diagnostic and predictive tools. The third international consensus definition for Sepsis-3 was published in 2016 to provide a more precise clinical definition [1]: “Sepsis is a life-threatening organ dysfunction caused by a dysregulated host response to infection”. Sepsis-3 introduced a new diagnostic framework based on the Sequential Organ Failure Assessment (SOFA) score [2], which requires an increase of greater than 2 points from the patient’s baseline in sepsis. Nevertheless, the sepsis-3 criteria are not predictive; they identify the patients already experiencing organ dysfunction, but provide limited early intervention. In recent years, artificial intelligence and machine learning have opened up new avenues to improve early detection and diagnostic confirmation for sepsis [3,4].

Medical research has undergone drastic changes due to the introduction of machine learning (ML) models, which are useful for diagnosis, risk assessment, and identifying underlying pathophysiology. However, due to the complexity and dynamic nature of sepsis, reliable models are needed to identify patients at risk early and accurately [5,6,7]. Current ML-based models show promise but lack transparency, which prevents clinical use. Therefore, model transparency is so important to ML-based systems, as it allows clinicians to interpret predictions and understand the variables that influence classification results, which in turn builds trust in the model recommendations. Despite the recognition of these important clinical needs, the lack of a standardized approach for sepsis prediction encourages research into the development of optimized ML frameworks [8,9].

This study aims to enhance sepsis classification with a systematic integration of a traditional machine learning approach that not only aligns with sepsis-3 diagnostic criteria but also enables explainable sepsis classification. The novel contributions of this work include:

Bayesian Optimization-Driven Ensemble Construction: Unlike the conventional methods that rely on grid search, random search, or manual hyperparameter tuning, the proposed approach enables efficient exploration of higher-dimensional hyperparameter spaces through AUROC maximization.
Multi-window Aggregation-based Temporal Feature Extraction: A temporal knowledge extraction based on multi-window aggregation is incorporated that captures the vital physiological responses and clinical deterioration patterns.
Multi-Level Explainability Pipeline: To address the black-box problem in machine learning, an integrated three-level interpretability framework is incorporated at the global level with feature importance computation, population level by partial dependence profile (PDP) analysis, and finally, individual-level patient analysis through breakdown analysis and contribution heatmaps.
Complementary Model Behavior Quantification: To justify ensemble construction beyond empirical performance gains, a complementary model behavior quantization analysis was performed, which shows how Random Forest (RF) and XGBoost (XGB) exhibit complementary decision-making strategies.
Clinical Decision Support Integration: Provides actionable clinical insights, including priority-based critical thresholds, feature interaction interpretation, and multi-timescale monitoring priorities.

These contributions collectively address the key limitations in the existing methods and advance optimal methods in sepsis classification towards actionable clinical decision support systems.

The rest of the paper is organized as follows: Section 2 deals with related works, Section 3 explains the proposed methodology. Section 4 presents the results and discusses the findings. Section 5 summarizes the findings and contributions, discusses the limitations and outlines future research directions.

2. Related Works

Sepsis prediction and classification have been studied using machine learning techniques with a recent focus on interpretability and predictive accuracy. Many researchers have looked into different frameworks to make sepsis diagnosis models more robust and transparent.

Chen et al. [10] developed a sepsis-prediction model that is limited to intensive care units (ICUs) and focused on transferability and interpretability. Liu et al. [11] introduced an interpretable machine learning approach for sepsis risk assessment in emergency triage. By using structured electronic medical records (sEMRs) along with vital signs, they showed that more features improved predictive accuracy. However, their study only showed predictive gains without fine-grained interpretability of feature contributions over time.

He and Qiu [12] introduced an interpretable machine learning model for mortality prediction in sepsis patients. Their work showed the effectiveness of explainable models in clinical decision support systems by leveraging SHapley Additive exPlanations (SHAPs) and Local Interpretable Model-agnostic Explanations (LIMEs). An interpretable machine learning framework aimed at early detection of sepsis was presented by Hu et al. [13]. Their work underscored the importance of achieving a balance between predictive accuracy and model transparency, but lacked an optimization-based strategy to improve sensitivity and specificity.

For predicting the ICU admissions for sepsis patients, Zilker et al. [14] introduced an interpretable machine learning system called PatWay-Net. Their research showed the benefits of merging static and sequential clinical features to boost model performance. However, the approach did not have optimization or detailed visual interpretability tools to explain individual feature contributions.

Additionally, several recent studies have investigated and reviewed sepsis prediction using artificial intelligence techniques. Stylianides et al. [15] did a comprehensive review on AI in ICUs with a focus on sepsis prediction and discussed various machine learning architectures that have improved real-time sepsis detection. They mentioned that physiological signals and lab results need to be combined with deep learning models for early sepsis diagnosis. However, as a review, they did not provide model-level interpretability solutions.

To develop a prediction model for hospital mortality in an ICU, Zhang et al. [16] conducted a multi-center study and analyzed clinical and inflammatory biomarkers. Their results showed that inflammatory markers are important for sepsis-prediction models. A systematic review of machine learning approaches for early prediction of sepsis conducted by Islam et al. [17] presented the value of feature selection but did not provide methodological advances.

Bignami et al. [18] reviewed artificial intelligence in sepsis management and emphasized the need for model transparency in clinical aspects. Johayra et al. [19] introduced a classical machine learning framework for early sepsis prediction using ICU data, showing that traditional models can still work well, but their approach did not include optimization-driven ensemble learning or advanced interpretability mechanisms.

Gaps in the Existing Literature and Significance of the Proposed Work

The literature review indicates the existence of significant advances in sepsis diagnosis and interpretability, but there are still several key gaps that remain unaddressed. A comparison of the existing approaches with the proposed BayeStack framework is tabulated in Table 1.

Some of the key limitations addressed by the proposed BayeStack Algorithm are given below.

Lack of Optimization: Most existing works [10,13,19] rely on time-consuming manual hyperparameter tuning or grid search. For high-dimensional medical datasets, these methods are computationally extensive. Therefore, this work utilizes a Bayesian optimization-based approach for efficient utilization of hyperparameter space and maximizes AUROC, which is a critical metric for balancing sensitivity and specificity.
Insufficient Temporal Modeling: Studies by Liu et al. [11] and Zilker et al. [14] include temporal features but with shorter time windows and static aggregations. The proposed approach implements multi-window statistical aggregation across 1 h, 2 h, 4 h, 8 h, 24 h, and 48 h intervals for capturing temporal clinical patterns, thereby supporting rapid diagnostic confirmation.
Fragmented Interpretability Approaches: Existing works by He and Qiu [12] and Hu et al. [13] incorporate SHAP- and LIME-based explainability for individual predictions but lack a population-level analysis. On the other hand, feature importance studies [10,17] provide global insights without patient-specific explanations. Therefore, BayeStack integrates a comprehensive three-level framework ensuring explainability at all decision-making levels.
Unjustified Ensemble Construction: Traditional ensemble methods [19] show empirical performance gains but do not explain how the combination of specific models improves predictions. This work systematically quantifies complementary behavior through partial dependence profile analysis. It reveals Random Forest’s distributed feature utilization vs. XGBoost’s concentrated biomarker focus, thereby providing theoretical justification for the ensemble strategy.
Limited Clinical Integration Insights: Review studies [15,17,18] and multi-center analysis [17] generate prediction risk scores without actionable clinical guidance, whereas BayeStack translates the model outputs into clinically interpretable insights that enable assistance to clinical integration.
Imbalanced Performance Metrics: Existing works focusing on sensitivity [13] lack specificity, which leads to false alarms, while those focusing on accuracy [18] may miss critical sepsis cases; however, BayeStack achieves balanced performance.

By systematically addressing these limitations, BayeStack advances the sepsis classification frameworks via the integration of optimization theory, temporal knowledge extraction modeling, hierarchical interpretability, and clinical actionability, thus establishing a comprehensive methodology applicable to trustworthy decision support systems.

3. Methodology

This research introduces a data-driven method for sepsis classification by using the dataset from the PhysioNet/Computing in Cardiology Challenge 2019 database [20,21,22]. A total of 40,336 patients with 36,786 non-sepsis cases and 3550 sepsis cases were present in the de-identified dataset with label information. The dataset contains hourly measurements of 8 vital signs, 26 laboratory features, and 6 demographic metrics, with a total of 40 clinical features, as listed in Table 2.

This work addresses the concurrent sepsis classification that determines the presence of sepsis at time t using clinical features of the preceding 48 h temporal window [

t - 48 h

, t]. This classification task supports rapid diagnostic confirmation for patients presenting with suspected sepsis symptoms to enable immediate treatment decisions and provide retrospective quality assessment for identifying missed diagnoses in ICU populations. As a concurrent classification framework, this approach eliminates the prediction horizon by identifying existing sepsis rather than predicting the future onset, thereby enhancing diagnostic reliability and interpretability for time-critical clinical decision support. To capture both acute physiological responses and sustained clinical deterioration patterns relevant to sepsis progression, this study constructed the feature vectors using aggregated statistics from multi-scale temporal windows (1-h, 2-h, 4-h, 8-h, 24-h, and 48-h intervals) for each patient at prediction time t. Patient-level stratified hold-out validation was employed to evaluate model performance using an 80% training and 20% testing split. Processing at time t utilized only features from [

t - 48 h

, t]. No information beyond time t was incorporated at any processing step, thereby strictly preserving temporal causality and preventing temporal data leakage.

This work constitutes the first phase within a broader two-phase research initiative, which establishes an interpretable baseline performance using traditional machine learning with comprehensive explainability analysis. The next phase, which is in progress, will focus on early sepsis onset prediction with defined lead times (4–12 h before clinical onset) through sequential learning architectures with external validation on MIMIC IV [23] and eICU datasets [24].

3.1. Dataset Selection

The PhysioNet/Computing in Cardiology Challenge 2019 dataset was chosen for this work due to several reasons. First, it is a publicly available de-identified sepsis dataset with 40 clinical features and standardized clinical labels from 40,336 ICU patients. This allows direct performance comparison with the existing literature [25,26,27,28,29]. Second, the hourly measurements in the dataset align perfectly with the research objectives of evaluating time-based feature aggregation strategies across multiple windows (1–48 h) for knowledge extraction. This is crucial for capturing both short-term physiological responses and ongoing clinical deterioration patterns in sepsis progression. The clinical foundation for identifying sepsis comes from the Sepsis-3 diagnostic criteria introduced in 2016 [1], which have not undergone any revisions. This consistency ensures that the predictive patterns learned from 2019 data will still be relevant in 2026. Additionally, since the dataset includes data from different types of ICUs, it improves external validity and reduces biases.

The proposed study focuses on developing an interpretable algorithmic framework for concurrent sepsis classification rather than performing real-time experiments or interventions. This work establishes a hierarchical interpretability framework to provide transparent baseline performance using traditional machine learning methods. The proposed workflow shown in Figure 1 includes a five-stage approach.

The first stage is the data pipeline phase, in which data processing, data balancing, and feature engineering aspects are performed. The second stage is base model classifier identification, their performance evaluation, and model interpretability analysis. In the third stage, the actual BayeStack algorithm with a mathematically grounded optimization strategy is developed. The fourth stage includes the stacked blended ensemble, where the base model classifiers are stacked and blended with learned weights. Then, a logistic regression meta-model processes these blended features in such a way that the model was able to achieve the best results. The final stage is the optimal model generation and performance evaluation. In the following sections, all these steps are discussed in detail.

3.2. Data Processing and Data Balancing

In healthcare scenarios like sepsis classification, the preparation of raw datasets includes some crucial steps, like data processing and balancing. These steps ensure data quality, handle missing data, and enhance the class distribution. This section deals with the methodologies used for dealing with the missing values and balancing the dataset, which is inevitable for proper medical analysis.

3.2.1. Temporal Bounded Bidirectional Imputation

Bidirectional filling, which is a combination of backward and forward missing value imputation employed within a strictly bounded temporal window, is used in this work, which is capable of dealing with complex time series data [30] having different trends and patterns. This bidirectional imputation is employed within the 48 h historical feature extraction window [

t - 48 h

, t] and does not incorporate any data from future timestamps beyond the prediction time t. For predictions at time t, forward filling propagates observations from earlier to later timestamps, and backward filling propagates observations from later to earlier timestamps within [

t - 48 h

, t]. This process reconstructs the sequence within the bounded time window, similar to a clinician reviewing temporally disordered entries in a patient’s historical records. No data from the interval (t, ∞) and beyond the time t enters the feature construction process, which maintains temporal causality. The bidirectional operation reduces missingness from 47.3% (forward-fill only) to 12.8% while preserving causal structure, improving feature quality without introducing future data leakage [11]. This approach differs from methods that utilize post-outcome data to impute pre-outcome features. Bidirectional filling is therefore employed to accommodate temporal changes in the underlying data distributions and to minimize the risk of unintentional data leakage.

3.2.2. Data Normalization

Data normalization was performed by applying the standard scaling and Z-score normalization. Standard scaling transforms the dataset features to have a mean of 0 and a standard deviation of 1 [31], making the data more suitable for analysis. It is applied independently to each feature, as shown in Equation (1)

Z = \frac{x - μ}{σ}

(1)

where Z is the standardized value, x is the original value,

μ

is the mean of the feature, and

σ

is the standard deviation of the feature.

3.2.3. One-Hot Encoding

One-hot encoding converts categorical variables into a binary matrix format [32], with each unique category represented by a binary column (0 or 1) in the matrix. This technique is useful for representing categorical data, such as gender, into a binary matrix format.

3.2.4. SMOTE-ENN Algorithm for Data Balancing

To address the class imbalance problem in the sepsis dataset, the SMOTE-ENN [33] (Synthetic Minority Over-sampling Technique with Edited Nearest Neighbors) algorithm was used, as illustrated in Figure 2.

The synthetic minority oversampling technique is used to produce the best quality balanced samples, which creates synthetic samples between the minority data points and their k-nearest neighbors. Then, the edited nearest neighbors-based data cleaning approach is applied to remove the misclassified instances from all the classes. This method addresses the class distribution issues, minimizes data overlapping, and optimizes the decision boundaries, which enhances model performance.

To minimize data leakage and ensure valid performance evaluation, the following steps were employed:

The dataset was split into training (80%, N = 32,269 patients) and testing (20%, N = 8067 patients) sets using stratified random sampling based on sepsis status. To avoid patient-level data leakage, all hourly measurements from each patient were allocated exclusively to the same dataset partition.
After the data splitting, the SMOTE-ENN algorithm was applied only to the training set. Synthetic minority samples were created using the SMOTE procedure with $k = 5$ nearest neighbors, followed by Edited Nearest Neighbors (ENNs) cleaning to remove borderline instances from both classes.
The test set was preserved without any modifications and contained only original patient data with no synthetic samples, thereby ensuring evaluations under real-world deployment conditions.
Z-score normalization parameters ( $μ$ , $σ$ ) were computed from the training set and applied to both sets.

3.3. Feature Engineering

Feature engineering is a key part of data preparation, which involves creating, transforming, and selecting features to improve model interpretability and predictive performance. This work focuses on the knowledge of feature engineering, which includes time-based feature aggregation, feature importance computation, and extracting relevant features, for building an efficient and informative feature set.

3.3.1. Multi-Window Temporal Feature Aggregation

To deal with time series data, time-based feature aggregation is performed to extract the significant feature variations. Time-based aggregation is the process of summarizing or accumulating data over specific time intervals [34] to capture the temporal trends and patterns in the dataset. The data were split into non-overlapping segments by choosing appropriate time windows of interest (e.g., 1 h, 2 h, 4 h, and 8 h). The vital signs and lab features within each time window were aggregated using statistical measures (e.g., mean, median, and std) to capture the central tendency and variability, which represents the initial time series data as a feature matrix. The rows of the feature matrix indicate the time window, and the columns show the aggregated statistics for each vital sign or laboratory feature, which allow us to gain insights into the progression of sepsis by analysis of how the vital signs and laboratory features vary over different time windows. The extracted features are the central point of model interpretation and feature selection; thus, the machine learning models can make correct predictions while keeping clinical relevance.

In this work, temporal knowledge extraction denotes the systematic identification of clinically relevant temporal patterns across multiple time windows rather than sequential time series modeling methods such as LSTM. This aggregation-based framework captures the following temporal knowledge aspects.

Multi-scale temporal patterns: Statistical aggregations computed across different time windows capture disease progression trajectories that distinguish acute physiological responses (1–4 h) from sustained clinical deterioration (24–48 h) patterns.
Temporal stability indicators: The transient fluctuations from persistent abnormalities (e.g., sustained elevated lactate vs. single spike) were differentiated by the cross-window comparisons of aggregated features.
Time-dependent feature importance: Feature importance varies by temporal scales, with vital signs (HR, O₂Sat, and Resp) contributing predominantly within short windows (1–4 h), indicating acute physiological stress, whereas laboratory biomarkers (WBC, Fibrinogen, and Lactate) become more influential over longer windows (24–48 h), indicating sustained organ dysfunction.

This aggregation-based temporal knowledge extraction strategy emphasizes clinical interpretability and computational efficiency by prioritizing transparent clinician-validated feature engineering over sequential dependency modeling, which would require recurrent architectures. The extracted temporal patterns align with the multi-level explainability framework, facilitating interpretation of feature importance across multiple temporal scales in sepsis classification [34,35]. Consequently, this framework provides an interpretable baseline for evaluating future deep learning models designed to model sequential dynamics.

3.3.2. Feature Importance Computation and Feature Selection

Feature importance is the share of a given feature in the predictive performance of a model. Feature selection is a way to choose only the relevant features for a particular task. Feature importance computation using a Random Forest tree-based model is a method used to find out the extent to which each model feature impacts the predictive performance [36]. The feature importance scores are extracted and plotted to understand which vital signs and laboratory features are more important in predicting sepsis at different time frames. A Q-Q (Quantile–Quantile) plot [37] is used to check the distribution of feature data by comparing the quantiles of the observed data against the quantiles of a chosen normal distribution. If the points fall approximately along a straight line, then it means that the data follows the chosen distribution. Q-Q plots are best suited to detecting irregularities and outliers in a dataset. Feature importance computation and feature selection methods help to extract the most important features contributing to a given task. Time-based aggregation and feature importance computation are often used together to extract the dynamic nature of physiological indicators in sepsis over different time windows for model interpretability and clinical decision-making.

3.4. Model Development

Model development starts by applying a variety of traditional classification algorithms [38,39], including Random Forest, XGBoost, Logistic Regression, Naive Bayes, and K-Nearest Neighbor, to the medical dataset. The dataset used was a collection of clinical features such as heart rate, oxygen saturation, temperature, and various blood parameters. The performance of each classifier is evaluated thoroughly using the cross-validation technique. The cross-validation accuracy obtained for different classifier combinations is tabulated in Table 3.

The mean accuracy scores taken from the cross procedure indicate that the combination of the Random Forest classifier and the XGBoost classifier performs well across all metrics compared to other individual classifiers. So the Random Forest and XGBoost classifiers were chosen as base model classifiers in the initial stage of model development. The choice to use a combination of the Random Forest classifier and the XGBoost classifier has a double grounding, both in the theoretical principles and empirical observations drawn from the applications of the algorithms to the dataset. In addition, it follows the technical benefits related to the dataset used and the classification task [35,40,41].

Some of the advantages of using ensemble methods are robustness to overfitting, interpretability through feature importance analysis, the ability to handle complex and high-dimensional data, and generalizability to unseen data. When utilizing the bagging method, Random Forest [40] reduces variance and random feature selection, making it robust to noise and overfitting, whereas XGBoost [41] utilizes a boosting approach that sequentially corrects errors from prior trees, reducing bias and capturing complex nonlinear interactions. Due to these differing strengths, such as variance reduction in RF and bias reduction in XGBoost, the ensemble created from both these classifiers often performs better than either model alone, as each of them compensates for the limitations of the other. This ensemble method not only enhances predictive performance but also accommodates the diversification of patterns and dependencies in the data.

Modeling of BayeStack Algorithm

The dataset consists of 40 clinical features with time series information; thus, it becomes multifaceted, and as a result, it requires the use of an optimization approach in high-dimensional feature spaces. The dataset contains many clinical parameters as well as time series data. Therefore, selecting the most relevant parameters becomes a critical task, requiring an optimization technique that can efficiently operate in a high-dimensional, noisy feature space with computationally expensive evaluations. The BayeStack algorithm integrates Random Forest (RF) and XGBoost (XGB) models with Bayesian optimization and a stacked blended ensemble using logistic regression as a meta-learner for optimizing sepsis classification.

To begin the modeling of the proposed algorithm, the first step is to define the appropriate search spaces for the hyperparameters of the chosen base model classifiers. The parameters chosen for each classifier and their final optimal values obtained after optimization are listed in Table 4.

Area Under the Receiver Operating Characteristic Curve (AUROC) provides a trade-off between sensitivity and specificity across all the possible classification thresholds. Thus, this work focuses on maximizing the AUROC as the optimization objective function. As sepsis is a critical medical condition, identifying all the sepsis cases and avoiding unnecessary false alarms are crucial steps. There is a chance of bias towards one class if the optimization is performed only for accuracy, F1-score, or precision. By maximizing AUROC, the model ensures it is not tuned to a single threshold, but instead learns to discriminate between septic and non-septic patients over a range of thresholds, making it generalizable and robust.

Moreover, Bayesian optimization was the best approach [42], especially because it would perform effectively where gradients are unattainable. The sepsis classification is a very complex problem that is closely related to Bayesian optimization [43,44], due to the diversity of clinical parameters in the dataset. Additionally, Q-Q plot analysis reveals that most features approximate normal distributions after normalization, although deviations exist in some laboratory features, making Bayesian optimization with Gaussian process surrogate modeling a suitable choice. The proposed BayeStack, through a sequential model-based Bayesian optimization (SMBO) strategy [45,46,47,48], is very suitable for the efficient exploration of the hyperparameter space, and it can be relied on for the careful balance of exploration and exploitation.

In the Bayesian optimization procedure, a Gaussian process surrogate model was built and used to evaluate the objective function. A surrogate model is an approximation method that replaces complex simulations or experiments with a simpler model that evaluates fast and is good enough for analysis and optimization. A Gaussian process surrogate model, which comprises a mean function and a covariance kernel, is adapted in the proposed method to capture the underlying distribution of the unknown objective function across the hyperparameter space. The acquisition function used is Expected Improvement (EI), which balances exploration and exploitation by selecting hyperparameters that maximize the expected improvement over the current best AUROC. The best hyperparameters for both Random Forest and XGBoost are obtained by maximizing the EI function across the search space. Subsequently, the Random Forest and XGBoost models are trained separately with the Bayesian-optimized hyperparameters.

The stacking ensemble [49,50,51,52,53] is performed using Stratified K-Fold cross-validation to ensure robust generalization and further enhance performance. For each fold, Random Forest and XGBoost predictions were used as meta-features. A blending layer assigns learned weights to each base model’s predictions and combines them. The logistic regression meta-model produces a final probabilistic output through a Sigmoid function by using the blended meta-features. The stacked blended ensemble method captures the benefits of simplicity and the ability to model complex relationships. The performance of the blended models was comprehensively assessed using various metrics, including sensitivity and specificity, to capture knowledge about how well the model performed in positive and negative cases. The BayeStack Algorithm, developed for optimized model generation, is presented in Algorithm 1. See Section S1 in Supplementary Materials for detailed mathematical modeling of the BayeStack algorithm.

3.5. Model Interpretability Framework

In this work, a hierarchical explainability framework suitable for ensemble models was implemented to ensure clinical transparency and trustworthiness. With a unified analytical pipeline suitable for clinical decision-making, the proposed explainability framework provides both global and local explanations, where both population-wide patterns and individual case justifications are essential [54,55,56]. Figure 3 illustrates the hierarchical explainability workflow implemented in this study.

3.5.1. Level 1: Global Feature Importance Analysis

Global feature importance analysis deals with the identification of the physiological biomarkers and laboratory values that have the strongest impact on the sepsis classification task by finding the relative contribution of each feature across the entire test dataset. This is achieved by calculating the increase in prediction error when a feature is permuted.

3.5.2. Level 2: Population-Level Analysis

In this level, partial dependence profile (PDP) analysis was performed. It assesses how the predictions change across the clinical range of each feature while marginalizing over all other features through range metrics, mean predictions, and inter-model agreement scores. This approach reveals the following:

Clinical biomarkers, where both the base models show similar PDP patterns;
Features or parameters that show different sensitivity patterns aiding in complementary decision-making;
The feature utilization patterns of the base models.

For each feature

f_{i}

, the PDP function is computed as

{PDP}_{f_{i}} (x) = E_{X_{- i}} [\hat{y} (x, X_{- i})]

(2)

where

X_{- i}

represents all features except

f_{i}

, and

\hat{y}

is the model’s prediction function. Then, the prediction sensitivity for each feature is measured as follows:

Range (f_{i}) = max_{x \in X} {PDP}_{f_{i}} (x) - min_{x \in X} {PDP}_{f_{i}} (x)

(3)

where

X

represents the clinical range of feature values in the test dataset. If these ranges are high, it indicates features where prediction probability varies substantially across physiological measurement ranges, and this can be considered as critical markers for sepsis classification.

Algorithm 1 BayeStack Algorithm for Optimized Sepsis Classification

Require: Search Space

S = {n, d, s, l}

;
Objective Function

F = A U R O C [Model (n, s, d, l)]

;
Observed Hyperparameters

X = {x_{1}, x_{2}, \dots, x_{n}}

;
Corresponding Objective function values

y = {y_{1}, y_{2}, \dots, y_{n}}

Ensure: Optimized Sepsis Classification Model
1: Step 1: Gaussian Process Surrogate Model
2: Building surrogate model with mean

μ (x)

, covariance kernel

k (x, x^{*})

3: Find joint distribution:

[\begin{matrix} y \\ f^{*} \end{matrix}] \sim N ([\begin{matrix} μ_{X} \\ μ_{X^{*}} \end{matrix}], [\begin{matrix} K_{X, X} + σ_{n}^{2} I & K_{X, X^{*}} \\ K_{X^{*}, X} & K_{X^{*}, X^{*}} \end{matrix}])

4: Predictive distribution at

x^{*}

:

f (x^{*}) \sim N (μ^{*}, σ^{* 2})

5: Step 2: Acquisition Function
6: Improvement function:

I (x) = max (f (x^{*}) - f_{best}, 0)

7: Expected Improvement:

E [I (x^{*})] = (μ^{*} - f_{best}) Φ (\frac{μ^{*} - f_{best}}{σ^{*}}) + σ^{*} ϕ (\frac{f_{best} - μ^{*}}{σ^{*}})

8: Step 3: Extract Best Hyperparameters
9:

θ_{R F}^{*} = arg max E [I_{R F} (x^{*})]

,

θ_{X G B}^{*} = arg max E [I_{X G B} (x^{*})]

10: Base Classifiers:

C = [C_{1}, C_{2}]

where

C_{1}

: RF

(θ_{R F}^{*})

,

C_{2}

: XGB

(θ_{X G B}^{*})

11: Step 4: Stacked Out-of-Fold Predictions
12: for

k \leftarrow 1

to K do
13: Train

R F^{(k)}

and

X G B^{(k)}

on training folds
14: Generates the predictions on validation fold k
15:

R F_{s t a c k}^{(k)} (x) = R F_{θ_{R F}^{*}}^{(k)} (x_{v a l}^{(k)})

16:

X G B_{s t a c k}^{(k)} (x) = X G B_{θ_{X G B}^{*}}^{(k)} (x_{v a l}^{(k)})

17: end for
18: Stack predictions:

D_{m e t a} = {(R F_{s t a c k} (x_{i}), X G B_{s t a c k} (x_{i}), y_{i})}_{i = 1}^{N}

19: Step 5: Meta-Model Training and Prediction
20: Meta-features:

z (x) = [R F_{s t a c k} (x), X G B_{s t a c k} (x)]

21: Train Logistic Regression meta-model on

D_{m e t a}

to learn optimal weights

\hat{y} (x) = σ (β_{0} + β_{R F} \cdot R F_{s t a c k} (x) + β_{X G B} \cdot X G B_{s t a c k} (x))

22: where

β_{R F}, β_{X G B}

are learned coefficients,

σ (\cdot)

is the sigmoid function
23: return Optimized ensemble model with learned blended weights

Statistical Validation of Model Agreement: To identify the convergence and divergence between model explainability, two metrics were computed [50].

1.: Feature-wise Agreement Score: Pearson correlation between Random Forest and XGBoost partial dependence profile predictions for each feature:

$Agreement (f_{i}) = \frac{Cov ({PDP}_{f_{i}}^{R F}, {PDP}_{f_{i}}^{X G B})}{σ_{{PDP}_{f_{i}}^{R F}} \cdot σ_{{PDP}_{f_{i}}^{X G B}}}$

(4)

where ${PDP}_{f_{i}}^{R F}$ and ${PDP}_{f_{i}}^{X G B}$ represent the partial dependence predictions for feature $f_{i}$ from Random Forest and XGBoost, respectively, and $σ$ denotes standard deviation. A generally accepted interpretation is indicated by an agreement score nearly equal to 1, and model-specific feature utilization is indicated by lower scores.
2.: Spearman Rank Correlation (ρ): To evaluate whether both models assign comparable rankings to features based on their importance within the specified range, a non-parametric correlation method is employed. It is calculated as follows:

$ρ = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)}$

(5)

where n is the number of features, and $d_{i}$ is the difference between the ranks of feature i based on RF and XGB range values.

3.5.3. Level 3: Individual Patient-Level Interpretability

A breakdown analysis was performed by decomposing the prediction into additive feature contributions to provide explainability at the individual patient level. This provides an insight into which features contributed positively to the sepsis prediction and which negatively contributed for particular patients. Visualization of the feature contributions across both models is represented using a contribution heatmap.

For a given prediction

\hat{y} (x)

, the contribution of feature

f_{i}

is calculated as

Δ_{i} = \hat{y} (x_{1 : i}) - \hat{y} (x_{1 : i - 1})

(6)

Here,

x_{1 : i}

represents the feature vector, with features ordered by their contribution magnitude.

The Mean contribution magnitude for each model is computed as

{\bar{C}}_{model} = \frac{1}{k} \sum_{i = 1}^{k} | C_{i} |

(7)

where k is the number of features with contributions above the threshold (typically

| C_{i} | > 0.10

), and

C_{i}

is the contribution of feature i. The mean contribution magnitude for each model quantifies whether the models use distributed (high k, lower

\bar{C}

) or concentrated (low k, higher

\bar{C}

) feature utilization strategies.

This multi-level interpretability method provides a quantitative metric-based prediction. It ensures that the BayeStack algorithm prediction can be linked to specific clinical features. The PDP-based method offers compact population-level analysis, while breakdown plots and contribution heatmaps provide patient-specific explainability for clinical decision support.

4. Results and Discussion

4.1. Baseline Characteristic Results

This subsection discusses the baseline characteristics and statistical properties of the clinical features used in the model. It covers the temporal aggregation patterns, distribution analyses, and feature importance assessments.

4.1.1. Time-Based Aggregation Results

The vital sign variations and laboratory feature variations were extracted with respect to different time windows.

1–4 h: The model captures the acute responses and indicates early metabolic shifts during the initial time window.
8–24 h: During the next 8 to 24 h time window, continuous monitoring of feature variations and their gradual trends and patterns, including feature stabilization, was undertaken. In this time window, certain feature trends, such as blood pressure trends and CBC counts, may reflect the systemic responses.
48 h: Stabilization or progression to severe conditions was noted. Indicators of renal function, such as creatinine, became critical. Persistent abnormalities in vital signs and laboratory features indicate sepsis severity.

4.1.2. Quantile-Quantile Plots (Q-Q Plots)

Figure 4 depicts sample Q-Q (Quantile–Quantile) plots of the vital and laboratory features used in this work to assess the normality and distribution characteristics of clinical features. Vital signs (HR and Temp) show approximate normality, while laboratory features (Fibrinogen and Lactate) exhibit skewed distributions handled robustly by tree-based ensembles. Analysis of Q-Q plots indicates that most features exhibit distribution close to normality, but some features, such as O₂Sat and Fibrinogen, exhibit skewed behavior. These variations represent meaningful clinical conditions, and normalization was applied to standardize feature scales while preserving clinically relevant variability. (see Section S2 in Supplementary Materials).

4.1.3. Feature Importance Radar Plot

A feature importance radar plot is a circular chart used to visualize how much each feature contributes to the prediction. Figure 5 depicts a radar plot illustrating the importance of different features for predicting the outcome of sepsis. Utilizing a Random Forest tree-based model, the feature importance is extracted and normalized, and then mapped onto a radar plot. The length of each spoke indicates the significance of each feature, which is placed radially around a center point. A larger spoke indicates a higher importance of the corresponding feature. The high importance of WBC and Fibrinogen highlights the model’s focus on inflammatory markers, which are critical in sepsis diagnosis. Similarly, this plot indicates how much each feature influences the model.

Temporal Feature Engineering Impact: Multi-window aggregation (1–48 h) allows the framework to capture rapid changes in vital signs (1–4 h), acute physiological responses, and gradual decline (8–48 h) in organ function and clinical parameters. The feature importance analysis shows that short window aggregations highlight vital signs, while longer window aggregations represent the laboratory trends, which assert the temporal modeling approach. Looking at the Q-Q plot analysis (Figure 4), it is clear that most features approximate normal distributions, although several features are pretty skewed, reflecting clinical variability in critical care populations. Tree-based ensembles are robust enough to handle these distributional variations by avoiding information loss from data processing.

4.2. Framework Evaluation and Model Interpretability Results

Standard measures, such as specificity, sensitivity, accuracy, precision, recall, and F1-score, with 95% bootstrap confidence intervals, are given in Table 5, and overall model performance was evaluated. The results highlight the superior performance of the BayeStack model by achieving a balanced trade-off between specificity and sensitivity while maintaining excellence across all evaluation metrics.

Component Performance Analysis

Table 5 presents a detailed performance comparison between Random Forest (RF), XGBoost, and the proposed BayeStack model, with 95% bootstrap confidence intervals. Random Forest shows consistent performance across the metrics AUC-ROC, Accuracy, and F1-Score, but has lower sensitivity, indicating chances of elevated false negatives. XGBoost achieves higher AUC-ROC and improved sensitivity but shows slightly lower specificity, suggesting marginally elevated false positive rates.

Since BayeStack combines both base learners through Bayesian optimization and stacking ensemble techniques, even at the lower end of the confidence interval itself, BayeStack performs better than the baseline models. It achieves the highest AUC-ROC score of 0.99 with a 4.21% improvement over the best statistically significant base model prediction (

p < 0.001

). The 95% confidence intervals ensure that these improvements are statistically reliable, which is crucial in medical applications where performance variation directly affects patient safety and treatment outcomes. The proposed framework maintains XGBoost’s ability to catch cases that need attention and utilizes Random Forest’s ability to improve accuracy in correctly identifying negative cases.

4.3. Ablation Studies Analysis

Ablation studies were conducted to validate the contribution of temporal aggregation and data balancing strategies to overall model performance.

4.3.1. Temporal Aggregation Component Analysis

Table 6 presents the performance improvement achieved by extending the temporal aggregation window. It validates the necessity of extracting temporal features for optimized sepsis classification.

From a single time point to the complete 48 h time window, the performance parameter contributions indicate that capturing disease progression trajectories is essential for sepsis classification. The temporal aggregation component contributes a 12.5% improvement in AUROC. The full 48 h window captures critical temporal patterns needed for accurate sepsis detection.

4.3.2. Data Balancing Component Analysis

The impact of the data balancing strategy on performance is illustrated in Table 7.

From the impact analysis, it is clear that the SMOTE-ENN combined strategy significantly outperforms individual techniques in handling class imbalance. SMOTE alone achieves 0.93 AUROC, while ENN alone achieves 0.92 AUROC; however, when combined, SMOTE-ENN reaches 0.99, indicating the performance improvement of hybrid approaches. It contributes a 15.1% improvement in AUROC and indicates the need to combine synthetic oversampling and noise removal to handle severe class imbalance.

4.4. Comprehensive Model Interpretability Analysis

4.4.1. Population-Level Model Behavior: Comprehensive Feature Effect Analysis

The population-level analysis examined changes in the predicted probability of sepsis with respect to the variations of individual features across their clinical ranges. Table 8 provides a comparison of how the Random Forest and XGBoost base models respond to the 23 most important selected clinical features extracted across the entire test dataset. This provides an idea of the range of influence that each feature has on model predictions. To indicate whether the models interpret the features similarly, the Pearson correlation between the Random Forest and XGBoost predictions for each feature was measured as an agreement score.

Clinical Decision Thresholds: Using the interpretability framework and relying on partial dependence profile (PDP) analysis, the evidence-based clinical thresholds for prioritized treatment monitoring were derived, as shown in Table 9. The elevated probability (0.30–0.55) indicated by the RF model when ICULOS exceeded 48–50 h suggests cumulative hospital-acquired infection risk. Both models exhibit probability variations of about 0.05 in response to abnormal WBC levels such as counts below 4 × 10³/µL (leukopenia) or above 12–15 × 10³/µL (leukocytosis), which reflect SIRS-consistent inflammatory responses. A respiratory rate exceeding 20–24 breaths/min was associated with a significant RF probability rise of 0.23, suggesting respiratory distress. Likewise, reduced O2Sat (<92–94%) corresponded to the hypoxemic threshold, with a probability increase of 0.2, which reflected the respiratory compromise. Tissue hypoperfusion was signaled by lactate thresholds of >2–4 mmol/L. Intermediate-priority monitoring thresholds were identified as heart rates (HR) above 100–110 bpm, BUN levels above 20–30 mg/dL, and a temperature below 36 °C or above 38 °C. For both models, moderate variations in probability were observed in magnesium imbalances (<1.5 or >2.5 mg/dL). The bidirectional threshold ranges of fibrinogen indicate coagulopathy risk (<200 mg/dL) and acute-phase responses (>400 mg/dL). These results are consistent with Sepsis-3 diagnostic criteria and provide quantitative decision boundaries for evidence-based prioritization of clinical monitoring.

Temporal and Vital Sign Patterns: For both base model classifiers, ICU length of stay (ICULOS) was found to be the most influential feature in sepsis diagnosis, with increased sepsis probability for stays exceeding 48 h. Although ICULOS may initially appear to act as a potential confounding variable, since prolonged ICU stay is associated with both increased sepsis risk and subsequent clinical outcomes, it can be interpreted as a valid temporal risk accumulator for the following reasons:

Observable at Prediction Time: ICULOS captures the cumulative duration of ICU stay up to time t, representing information available to clinicians during real-time decision-making.
Clinical Relevance: ICULOS captures cumulative clinical risk exposure, such as hospital-acquired infections, antibiotic use that may promote antimicrobial resistance, prolonged use of invasive devices and severity of underlying diseases requiring extended clinical care [2,7].
Absence of Target Leakage: ICULOS does not directly represent the sepsis outcome and, therefore, avoids target leakage; it reflects accumulated risk exposure and overall disease complexity.
Empirical Validation via Ablation: As shown in Table 10, ICULOS contributes comparably to individual laboratory markers rather than acting as a dominant modeling artifact.

Consistent with this interpretation, ICULOS demonstrates the highest prediction ranges (RF: 0.26; XGB: 0.22) and strong inter-model agreement (0.85), supporting its role as a robust temporal marker of hospital-acquired infection risk and cumulative organ dysfunction, aligned with the clinical pathophysiology of healthcare-associated sepsis.

Vital signs show divergent model behaviors. Oxygen saturation exhibits better RF sensitivity compared to XGB sensitivity, with a moderate agreement score of 0.59. Respiratory rate shows remarkable divergence with high RF sensitivity, while a very low agreement score of 0.15 indicates that the XGB is nearly insensitive. Heart rate (HR) demonstrates a moderate RF sensitivity range of 0.11, with a low XGB sensitivity range of 0.03 and poor agreement of 0.31. Temperature shows minimal sensitivity for both models, but high agreement, 0.70.

Laboratory Biomarker Analysis: Confirming leukocytosis/leukopenia as a universally recognized sepsis marker, the WBC indicates almost equal sensitivity for both models, with a good agreement score of 0.92. Fibrinogen shows a high agreement score of 0.79 with low but equal sensitivity. Magnesium and AST exhibit moderate sensitivity and high agreement scores of 0.79 and 0.55, respectively. These lab consensus markers show that leukocytosis, coagulopathy, and electrolyte imbalance are all pretty clear indicators of sepsis across the different model architectures. BUN shows moderate RF sensitivity of 0.06 with low XGB sensitivity of 0.03 and a moderate agreement score of 0.54. Lactate shows low sensitivity for both models and a poor agreement score of 0.40.

Categorical Unit Features: The categorical unit features, such as the Unit1 and Unit2 features, indicate that they are primarily used as baseline adjustments by the Random Forest model, but they are ignored by XGBoost due to its minimal XGB sensitivity ranges (0.00–0.01) and very poor agreement scores (0.01–0.13).

Summary Statistics and Model Characterization: Table 11 evaluates the fundamental differences in the model behavior.

Random Forest’s broader feature utilization strategy was confirmed by its mean range of about 0.085, which is more than double that of the XGBoost (0.038). The mean agreement score of 0.413 indicates moderate–high agreement, which is sufficient to validate the stacked ensemble approach. The strong Spearman correlation (

ρ = 0.856

,

p < 0.001

) confirms that both the base models rank the features similarly, regardless of the differences in absolute magnitudes. This ranking convergence validates the computational complementarity of the ensemble components rather than serving as evidence of clinical validity, which requires prospective clinical trials.

Random Forest works as a surveillance system by monitoring the physiological parameters with fairly equal attention, and it is more robust to missing data, providing broad-spectrum pattern recognition. This breadth-first monitoring approach of Random Forest aids in emergency triage situations where the lab results might be delayed or unavailable. On the other hand, XGBoost is a depth-first strategy that excels at identifying the critical markers needed for definitive diagnosis. It functions like a precision diagnostic tool focusing on highly discriminative biomarkers such as ICULOS, fibrinogen, magnesium, and AST. This complementary behavior of the base models plays a crucial role in providing better performance for the BayeStack model.

From the moderate feature-wise agreement score of 0.413, coupled with strong ranking correlation, Spearman

ρ = 0.856

, it is clear that both base models have some overlap in certain clinical features. They converge on clinically important features, such as ICULOS, WBC, and magnesium, and diverge on secondary markers. Depending on the data availability, the meta-learner decides when to trust each model characteristic. When the laboratory data is sparse, it implicitly learns to weigh RF higher and weigh XGB higher when complete chemistry panels are available.

4.4.2. Individual Patient-Level Interpretability: Case Study Analysis

An explainability analysis was performed on a confirmed sepsis case (Sample 1000) to demonstrate clinical utility at the bedside. Both the models correctly classified this patient with high confidence, but their reasoning methodologies are different.

Figure 6 presents breakdown plots decomposing the predictions into feature contributions.

The breakdown plot of Random Forest indicates that its prediction is driven by distributed contributions across 19 high-impact features with a mean contribution of 0.291 spanning around ICULOS:Temp interaction (

- 0.089

), BUN (

- 0.027

), HR (+0.023), Magnesium (+0.046), and WBC (

- 0.027

), with all other factors (+0.536). In contrast, XGBoost reaches near-certain predictions using primarily 5 features, with a mean contribution 0.087 and some complex feature interactions such as Resp:HR interaction (

- 0.111

), Fibrinogen:SBP (

- 0.157

), Magnesium:AST (+0.119), Lactate (+0.133), WBC (+0.133), and pH (+0.107), with minimal contribution from remaining features (+0.026).

Table 12 ranks the top 10 contributing features for Sample 1000.

Random Forest emphasizes vital signs, namely Temp, HR, and Resp, as key players, so the model can spot issues early on at the bedside. While XGBoost gives much more priority to laboratory biomarkers, namely Fibrinogen, Magnesium, and AST, to confirm the diagnosis with lab evidence. Features, namely ICULOS, WBC, and Magnesium, show strong contributions in both models, indicating universal importance and robust sepsis markers. The feature contribution comparison charts shown in Figure 7 illustrate the base model’s feature contribution on the prediction profile. Random Forest indicates a gradually declining distribution across 19 high-impact features, while XGBoost shows a steep distribution, where 5 features contribute to most of the prediction. This interrelation ensures robustness in such a way that if laboratory values are delayed/missed, Random Forest can provide predictions using vital signs, and if vital signs are delayed/missed, XGBoost sticks to laboratory evidence.

The contribution heatmap in Figure 8 plays a key role in providing a better comparison of the top feature contributions of both models. It is a color-coded visualization tool that helps clinicians identify those features that contribute to the early diagnosis and which parameters need to be monitored thoroughly for treatment responses. Figure 8 shows dark red cells clustered around ICULOS, Temp, and BUN for Random Forest and for XGB, which shows a concentration around ICULOS, Fibrinogen, and Magnesium. Random Forest on the left side demonstrates distributed feature utilization with 19 features, while XGBoost on the right exhibits concentrated importance across 5 key features.

The multi-level explainability hierarchy is designed for different stakeholders, including researchers, clinicians, and healthcare administrators. It provides context-appropriate explanations at multiple decision-making levels.

4.5. Computational Complexity and Scalability Analysis

4.5.1. Computational Complexity Analysis

With 40,336 patients and 40 clinical features, the total computational requirement is calculated using the concept of

O (n \times d \times log (n))

complexity, where n is the number of samples, and d is the number of features. The computational complexity of BayeStack is dominated by ensemble training with a total computational requirement of

O (2.04 \times 10^{10})

floating-point operations.

Phase-by-phase breakdown:

Bayesian Optimization (100 iterations): $O (1.00 \times 10^{8})$ —Gaussian Process surrogate inversion dominates.
Random Forest Training (438 trees): $O (1.08 \times 10^{10})$ —Tree induction dominates.
XGBoost Training (385 rounds): $O (9.50 \times 10^{9})$ —Gradient computation and tree splits.
Stacking Ensemble: $O (4.03 \times 10^{5})$ —Meta-feature generation and meta-model training.

Empirical Runtime Performance:

Training time for full dataset: 3.92 min (235 s).
Inference time per sample: 5.83 ms.
Peak memory requirement: 102.93 MB (0.10 GB).

These computational requirements are manageable for a clinical decision support system on a standard hospital server infrastructure when retraining its model on new data every 24–48 h.

4.5.2. Scalability Analysis

A scalability analysis was performed to evaluate how well the BayeStack model scales beyond the available dataset. The analysis combined the empirical measurements (1000 to 40,336 patients) and theoretical projections for larger populations (80,000–100,000 patients). The projections were developed using two complementary methods. The first method is synthetic data generation from the original patient dataset, maintaining feature distributions by utilizing the concept of bootstrap resampling with replacement. The second approach is computational complexity modeling based on known algorithmic complexity

O (n \times log (n))

by fitting runtime curves to empirical measurements at different patient scales (1 K, 5 K, 10 K, 20 K and 40 K). This double validation approach using empirical measurements and theoretical projections ensures expected model computational behavior in real time. Results are presented in Table 13.

The linear scaling pattern of BayeStack interprets that the model can handle larger patient populations without compromising performance. When the dataset size doubles, it increases the training time by just 2.1×, which is consistent with

O (n \times log (n))

complexity. Empirical validation on smaller subsets (1 K–40 K patients) matched the theoretical predictions very closely with

R^{2} = 0.998

, which validates the extrapolation approach. For larger healthcare systems, such as those with 80,000 patients, it would take about 8.3 min, and 100,000 patients would require approximately 10.5 min for training.

Scalability and Clinical Deployment: Training on the entire 40,336-patient dataset takes approximately 3.92 min, and individual patient prediction is obtained in 5.83 ms. From this, it is clear that the model is fast enough to process real-time patient data as it arrives in the ICU. Thus, the computational requirements and the near-linear scaling and memory usage peaks at 102.93 MB make BayeStack a practical solution for clinical deployment in real time.

4.5.3. Performance Contextualization Against PhysioNet 2019 Challenge Baselines

Table 14 reports performance metrics from existing methods evaluated on the same dataset. Notably, the original challenge focused on early sepsis prediction with predefined temporal prediction horizons [21,22], whereas BayeStack is designed for concurrent diagnostic classification at time t with a zero prediction horizon. These approaches address complementary clinical objectives: early prediction enables timely clinical intervention, whereas concurrent classification helps in diagnostic confirmation in patients already suspected of sepsis. Concurrent classification generally achieves higher performance metrics due to the increased availability of information at inference time; thus, comparisons reflect relative performance positioning rather than direct benchmarking. The substantial performance obtained by BayeStack underscores the effectiveness of the proposed Bayesian optimization and temporal aggregation framework for concurrent diagnostic classification.

Clinical Rationale for Concurrent Classification: In contrast to early prediction systems designed to anticipate sepsis prior to clinical deterioration, concurrent classification addresses the clinical need for immediate diagnostic confirmation in patients with suspected sepsis in emergency department or ICU admission contexts. The clinicians require immediate diagnostic assistance to enable timely initiation of appropriate treatment strategies in these time-critical scenarios [1,3]. The proposed approach of adopting a zero-prediction-horizon framework maximizes interpretability by utilizing clinical information available at prediction time, thereby enabling transparent decision support without assumptions about future patient trajectories. This design aligns with the study’s primary objective of establishing an interpretable baseline with strong explainability, serving as a foundation for subsequent deep learning investigations, as described in Section 4.6.

The inclusion of PhysioNet 2019 Challenge baseline results in Table 14 fulfills several objectives: (1) it demonstrates awareness of the established literature and benchmark performance on this widely-used sepsis dataset, (2) provides performance contextualization using identical data infrastructure enabling assessment of relative method effectiveness, (3) defines empirical performance ranges for both concurrent classification and early prediction tasks to guide realistic expectations for future algorithmic developments, and (4) validates that the proposed Bayesian optimization and temporal aggregation framework achieves strong performance within the dataset ecosystem despite focusing on methodologically distinct clinical objectives. Collectively, these comparisons contextualize BayeStack within the PhysioNet 2019 ecosystem while explicitly acknowledging methodological differences between concurrent diagnostic classification and early warning prediction approaches.

4.6. Methodological Design Trade-Offs and Research Positioning

This work employs a temporal aggregation-based feature engineering framework combined with traditional machine learning models rather than sequential deep learning architectures. This design choice introduces important trade-offs, which are discussed below.

Interpretability vs. Sequential Modeling: Multi-window aggregation does not explicitly model temporal ordering or trajectory dynamics, which can be captured by recurrent or attention-based architectures. However, in clinical deployment contexts where regulatory compliance and clinician trust are critical, interpretability is often prioritized over modest performance gains from black-box models [54,55,56,57,58].

Computational Efficiency vs. Representation Learning: BayeStack requires 3.92 min for training and 5.83 ms per patient inference across 40,336 patients, whereas deep learning models commonly demand 10–15× longer training times, along with GPU-based deployment infrastructure. In resource-constrained healthcare environments and real-time ICU applications, computational efficiency, therefore, represents an important practical consideration [38,59].

Feature Engineering vs. End-to-End Learning: The proposed methodology relies on domain-guided temporal window selection and predefined statistical aggregation choices rather than end-to-end representation learning. Although deep learning models can automatically learn optimal temporal representations, the explicit feature engineering employed in this study facilitates clinical validation of extracted patterns and direct alignments with established stages of sepsis pathophysiology [1,3].

Traditional ML as first phase Baseline: The present study establishes an interpretable performance baseline using traditional machine learning supported by comprehensive explainability analysis. The resulting benchmarks provide a foundation for subsequent second-phase deep learning investigations on the MIMIC-IV dataset, facilitating an evaluation of a performance–interpretability trade-off [60].

Clinical Deployment Considerations: Tree-based ensemble explainability models, such as feature importance and partial dependence analysis, are well supported by the clinical literature and regulatory frameworks. In contrast, deep learning explainability methods, such as attention weights mechanisms and saliency maps, remain areas of ongoing investigation with limited clinical validation. This work, therefore, prioritizes immediate deployability with established interpretability methods [61,62,63].

Concurrent Classification vs. Early Warning Systems: The proposed framework conducts concurrent sepsis classification at time t with zero prediction horizon, emphasizing real-time diagnostic confirmation rather than early warning prediction. The temporal features, including ICULOS, provide contextual information on disease progression during ICU stay; they do not enable lead time quantification for future sepsis onset. This design prioritizes diagnostic accuracy and maximum interpretability to enable immediate clinical decision-making. The subsequent phase will address early sepsis prediction with explicit lead times using sequential modeling architectures that capture temporal dependencies for early warning applications.

Collectively, these trade-offs represent a practical methodological choice aimed at establishing rigorous and interpretable baseline models prior to pursuing deep learning extensions, which is in alignment with best practices in clinical AI development.

5. Conclusions

This work presents BayeStack, an interpretability-focused framework for sepsis classification that establishes baseline performance through temporal knowledge extraction using traditional machine learning with multi-level explainability. The framework achieves balanced predictive performance (AUROC: 0.99) by maintaining the trade-off between sensitivity (0.97) and specificity (0.97). The proposed work effectively explores the high-dimensional hyperparameter space by building a Gaussian process surrogate model with an expected improvement acquisition function and optimizing the base model classifiers by maximizing the AUROC. The key contributions of this study involve temporal feature extraction across multiple time windows for capturing the acute physiological responses and gradual clinical changes, a multi-level interpretability framework providing global feature importance, population-level partial dependence profile (PDP) analysis, individual patient-level breakdown explanations, and finally, the quantification of complementary model behaviors revealing Random Forest’s distributed feature utilization and XGBoost’s concentrated biomarker focus.

Comparative evaluation against existing methods on the PhysioNet Challenge 2019 dataset demonstrates substantial improvements over baseline approaches. The multi-level interpretability analysis identifies ICULOS, vital signs, such as HR, O2Sat, and Resp, and laboratory biomarkers, such as WBC, Fibrinogen, Magnesium, and AST, as key predictors of sepsis. Computational complexity analysis and ablation studies confirm the effectiveness of the proposed BayeStack model to generate reliable AI-assisted healthcare decision support systems.

The results of this study are promising, but the work has limitations, as it focuses only on structured data from the PhysioNet 2019 Challenge dataset. To improve the generalizability across different patient populations, a multi-institutional validation is currently under consideration. The temporal aggregation approach that emphasizes interpretability rather than explicit sequential trajectory learning in this study will be extended in the second phase by incorporating deep learning architectures that capture sequential dependencies and rate-of-change dynamics. Future work will involve incorporating sequential models that integrate multi-modal data and extending the model for predictive tasks such as predicting the mortality rate, assessing the risk factors, and forecasting the treatment response. The interpretable benchmark established in this work will provide a foundation for evaluating the deep learning extensions in the next phase, enabling quantification of the performance gains from sequential modeling and assessment of interpretability trade-offs. Ultimately, this approach has the potential to improve patient outcomes in sepsis treatment by balancing predictive accuracy with multi-level clinical interpretability, ensuring a transparent, dependable and clinically deployable machine learning system.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/make8060150/s1.

Author Contributions

Conceptualization, A.G. and K.L.N.; methodology, A.G. and K.L.N.; software, A.G.; validation, A.G., K.L.N., A.S.M.S.P. and S.R.; formal analysis, A.G.; investigation, A.G.; resources, K.L.N.; data curation, A.G.; writing—original draft preparation, A.G.; writing—review and editing, K.L.N., A.S.M.S.P. and S.R.; visualization, A.G.; supervision, K.L.N., A.S.M.S.P. and S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. The study used publicly available de-identified data.

Data Availability Statement

The dataset used in this study is publicly available from the PhysioNet 2019 Sepsis Challenge repository (https://physionet.org/content/challenge-2019/, accessed on 9 March 2023). All raw data, preprocessed datasets, and challenge specifications are maintained by PhysioNet and can be accessed following their data use agreement requirements. The complete implementation code, including data preprocessing pipelines, hyperparameter optimization routines, baseline model implementations, and interpretability analysis tools, is available from the authors upon reasonable request.

Acknowledgments

The authors would like to extend their heartfelt thanks to Mata Amritanandamayi Devi for serving as a guiding force and a source of inspiration throughout this research journey. Special thanks are owed to Amrita Vishwa Vidyapeetham, Amritapuri campus, for offering the essential facilities and creating the right environment to complete this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUROC	Area Under the Receiver Operating Characteristic Curve
RF	Random Forest
XGB	XGBoost
PDP	Partial Dependence Profile
SMOTE-ENN	Synthetic Minority Over-sampling Technique with Edited Nearest Neighbors
ICU	Intensive Care Unit
SOFA	Sequential Organ Failure Assessment
ML	Machine Learning
EI	Expected Improvement
SHAP	SHapley Additive exPlanations
LIME	Local Interpretable Model-agnostic Explanations

References

Singer, M.; Deutschman, C.S.; Seymour, C.W.; Shankar-Hari, M.; Annane, D.; Bauer, M.; Bellomo, R.; Bernard, G.R.; Chiche, J.D.; Coopersmith, C.M.; et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA 2016, 315, 801–810. [Google Scholar] [CrossRef]
Ho, B.-S.; Lee, Y.-H.W.; Lin, Y.-B. Impact of hourly serial SOFA score on signaling emerging sepsis. Inform. Med. Unlocked 2022, 31, 100999. [Google Scholar] [CrossRef]
Kamath, S.; Altaq, H.H.; Abdo, T. Management of Sepsis and Septic Shock: What Have We Learned in the Last Two Decades? Microorganisms 2023, 11, 2231. [Google Scholar] [CrossRef] [PubMed]
Calvert, J.S.; Price, D.A.; Chettipally, U.K.; Barton, C.W.; Feldman, M.D.; Hoffman, J.L.; Jay, M.; Das, R. A computational approach to early sepsis detection. Comput. Biol. Med. 2016, 74, 69–73. [Google Scholar] [CrossRef] [PubMed]
Selcuk, M.; Koc, O.; Kestel, A.S. The prediction power of machine learning on estimating the sepsis mortality in the intensive care unit. Inform. Med. Unlocked 2022, 28, 100861. [Google Scholar] [CrossRef]
Mohamed, A.K.S.; Mehta, A.A.; James, P. Predictors of mortality of severe sepsis among adult patients in the medical Intensive Care Unit. Lung India 2017, 34, 330–335. [Google Scholar] [CrossRef]
Rangan, E.S.; Pathinarupothi, R.K.; Anand, K.J.S.; Snyder, M.P. Performance effectiveness of vital parameter combinations for early warning of sepsis exhaustive study using machine learning. JAMIA Open 2022, 5, ooac080. [Google Scholar] [CrossRef]
Scherpf, M.; Gräßer, F.; Malberg, H.; Zaunseder, S. Predicting sepsis with a recurrent neural network using the MIMIC III database. Comput. Biol. Med. 2019, 113, 103395. [Google Scholar] [CrossRef]
Kam, H.J.; Kim, H.Y. Learning representations for the early detection of sepsis with deep neural networks. Comput. Biol. Med. 2017, 89, 248–255. [Google Scholar] [CrossRef]
Chen, Q.; Li, R.; Lin, C.; Lai, C.; Chen, D.; Qu, H.; Huang, Y.; Lu, W.; Tang, Y.; Li, L. Transferability and interpretability of the sepsis prediction models in the intensive care unit. BMC Med. Inform. Decis. Mak. 2022, 22, 343. [Google Scholar] [CrossRef]
Liu, Z.; Shu, W.; Li, T.; Zhang, X.; Chong, W. Interpretable machine learning for predicting sepsis risk in emergency triage patients. Sci. Rep. 2025, 15, 887. [Google Scholar] [CrossRef]
He, B.; Qiu, Z. Development and validation of an interpretable machine learning for mortality prediction in patients with sepsis. Front. Artif. Intell. 2024, 7, 1348907. [Google Scholar] [CrossRef] [PubMed]
Hu, C.; Li, L.; Huang, W.; Wu, T.; Xu, Q.; Liu, J.; Hu, B. Interpretable Machine Learning for Early Prediction of Prognosis in Sepsis: A Discovery and Validation Study. Infect. Dis. Ther. 2022, 11, 1117–1132. [Google Scholar] [CrossRef] [PubMed]
Zilker, S.; Weinzierl, S.; Kraus, M.; Zschech, P.; Matzner, M. A machine learning framework for interpretable predictions in patient pathways: The case of predicting ICU admission for patients with symptoms of sepsis. Health Care Manag. Sci. 2024, 27, 136–167. [Google Scholar] [CrossRef]
Stylianides, C.; Nicolaou, A.; Sulaiman, W.A.; Alexandropoulou, C.A.; Panagiotopoulos, I.; Karathanasopoulou, K.; Dimitrakopoulos, G.; Kleanthous, S.; Politi, E.; Ntalaperas, D.; et al. AI Advances in ICU with an Emphasis on Sepsis Prediction: An Overview. Mach. Learn. Knowl. Extr. 2025, 7, 6. [Google Scholar] [CrossRef]
Zhang, G.; Shao, F.; Yuan, W.; Wu, J.; Qi, X.; Gao, J.; Shao, R.; Tang, Z.; Wang, T. Predicting sepsis in-hospital mortality with machine learning: A multi-center study using clinical and inflammatory biomarkers. Eur. J. Med. Res. 2024, 29, 156. [Google Scholar] [CrossRef] [PubMed]
Islam, K.R.; Prithula, J.; Kumar, J.; Tan, T.L.; Reaz, M.B.I.; Sumon, M.S.I.; Chowdhury, M.E.H. Machine Learning-Based Early Prediction of Sepsis Using Electronic Health Records: A Systematic Review. J. Clin. Med. 2023, 12, 5658. [Google Scholar] [CrossRef]
Bignami, E.G.; Berdini, M.; Panizzi, M.; Domenichetti, T.; Bezzi, F.; Allai, S.; Damiano, T.; Bellini, V. Artificial Intelligence in Sepsis Management: An Overview for Clinicians. J. Clin. Med. 2025, 14, 286. [Google Scholar] [CrossRef]
Prithula, J.; Islam, K.R.; Kumar, J.; Tan, T.L.; Reaz, M.B.I.; Rahman, T.; Zughaier, S.M.; Khan, M.S.; Murugappan, M.; Chowdhury, M.E. A novel classical machine learning framework for early sepsis prediction using electronic health record data from ICU patients. Comput. Biol. Med. 2025, 184, 109284. [Google Scholar] [CrossRef]
Goldberger, A.; Amaral, L.; Glass, L.; Hausdorff, J.; Ivanov, P.C.; Mark, R.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 2000, 101, e215–e220, RRID:SCR_007345. [Google Scholar] [CrossRef]
Reyna, M.A.; Josef, C.S.; Jeter, R.; Shashikumar, S.P.; Westover, M.B.; Nemati, S.; Clifford, G.D.; Sharma, A. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. Crit. Care Med. 2019, 48, 210–217. [Google Scholar] [CrossRef]
Reyna, M.; Josef, C.; Jeter, R.; Shashikumar, S.; Moody, B.; Westover, M.B.; Sharma, A.; Nemati, S.; Clifford, G.D. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). PhysioNet 2019. [Google Scholar] [CrossRef]
Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV (version 2.2). PhysioNet 2023. [Google Scholar] [CrossRef]
Pollard, T.; Johnson, A.; Raffa, J.; Celi, L.A.; Badawi, O.; Mark, R. eICU Collaborative Research Database (version 2.0). PhysioNet 2019, RRID:SCR_007345. [Google Scholar] [CrossRef]
Strickler, E.A.; Thomas, J.; Thomas, J.P.; Benjamin, B.; Shamsuddin, R. Exploring a global interpretation mechanism for deep learning networks when predicting sepsis. Sci. Rep. 2023, 13, 3067. [Google Scholar] [CrossRef] [PubMed]
Murugesan, I.; Murugesan, K.; Balasubramanian, L.; Arumugam, M. Interpretation of artificial intelligence algorithms in the prediction of sepsis. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Narayanaswamy, L.; Garg, D.; Narra, B.; Narayanswamy, R. Machine learning algorithmic and system level considerations for early prediction of sepsis. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Alfaras, M.; Varandas, R.; Gamboa, H. Ring-topology echo state networks for ICU sepsis classification. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Deogire, A. A Low Dimensional Algorithm for Detection of Sepsis From Electronic Medical Record Data. In 2019 Computing in Cardiology (CinC); IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Moahmed, T.A.; El Gayar, N.; Atiya, A.F. Forward and backward forecasting ensembles for the estimation of time series missing data. In Artificial Neural Networks in Pattern Recognition, Proceedings of the 6th IAPR TC 3 International Workshop, ANNPR 2014, Montreal, QC, Canada, 6–8 October 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 93–105. [Google Scholar]
Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
Udilă, A.; Ionescu, A.; Katsifodimos, A. Encoding Methods for Categorical Data: A Comparative Analysis for Linear Models, Decision Trees, and Support Vector Machines; Technical Report; Delft University of Technology: Delft, The Netherlands, 2023; Available online: https://repository.tudelft.nl/file/File_9e5e5225-4c53-4362-862e-1b7072e82b6a (accessed on 30 July 2023).
Lamari, M.; Azizi, N.; Hammami, N.E.; Boukhamla, A.; Cheriguene, S.; Dendani, N.; Benzebouchi, N.E. SMOTE–ENN-based data sampling and improved dynamic ensemble selection for imbalanced medical data classification. In Advances on Smart and Soft Computing: Proceedings of ICACIn 2020; Springer: Singapore, 2021; pp. 84–93. [Google Scholar]
Kaushik, S.; Choudhury, A.; Sheron, P.K.; Dasgupta, N.; Natarajan, S.; Pickett, L.A.; Dutt, V. AI in healthcare: Time-series forecasting using statistical, neural, and ensemble architectures. Front. Big Data 2020, 3, 4. [Google Scholar] [CrossRef]
Regier, P.; Duggan, M.; Myers-Pigg, A.; Ward, N. Effects of random forest modeling decisions on biogeochemical time series predictions. Limnol. Oceanogr. Methods 2023, 21, 40–52. [Google Scholar] [CrossRef]
Tyralis, H.; Papacharalampous, G. Variable selection in time series forecasting using random forests. Algorithms 2017, 10, 114. [Google Scholar] [CrossRef]
Liang, J.; Pan, W.S.Y.; Yang, Z.-H. Characterization-based Q–Q plots for testing multinormality. Stat. Probab. Lett. 2004, 70, 183–190. [Google Scholar] [CrossRef]
Anjana, G.; Nisha, K.L.; Arun Sankar, M.S. Improving sepsis classification performance with artificial intelligence algorithms: A comprehensive overview of healthcare applications. J. Crit. Care 2024, 83, 154815. [Google Scholar] [CrossRef]
Gholamzadeh, M.; Abtahi, H.; Safdari, R. Comparison of different machine learning algorithms to classify patients suspected of having sepsis infection in the intensive care unit. Inform. Med. Unlocked 2023, 38, 101236. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Lin, Y.; Gao, J.; Chen, L.; Hong, Y.; Li, M.; Chen, P.; Shang, X. An interpretable XGBoost model for risk prediction of progression from sepsis-associated acute kidney injury to chronic kidney disease. Inform. Med. Unlocked 2025, 58, 101685. [Google Scholar] [CrossRef]
Wang, X.; Jin, Y.; Schmitt, S.; Olhofer, M. Recent advances in Bayesian optimization. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Awal, M.A.; Masud, M.; Hossain, M.S.; Bulbul, A.A.M.; Mahmud, S.H.; Bairagi, A.K. A novel bayesian optimization-based machine learning framework for COVID-19 detection from inpatient facility data. IEEE Access 2021, 9, 10263–10281. [Google Scholar] [CrossRef]
Zheng, J.; Zhang, Z.; Wang, J.; Zhao, R.; Liu, S.; Yang, G.; Liu, Z.; Deng, Z. Metabolic syndrome prediction model using Bayesian optimization and XGBoost based on traditional Chinese medicine features. Heliyon 2023, 9, e22727. [Google Scholar] [CrossRef] [PubMed]
Lacoste, A.; Larochelle, H.; Laviolette, F.; Marchand, M. Sequential model-based ensemble optimization. arXiv 2014, arXiv:1402.0796. [Google Scholar] [CrossRef]
Wu, J.; Chen, X.Y.; Zhang, H.; Xiong, L.D.; Lei, H.; Deng, S.H. Hyperparameter optimization for machine learning models based on Bayesian optimization. J. Electron. Sci. Technol. 2019, 17, 26–40. [Google Scholar]
Sabbatella, A.; Ponti, A.; Candelieri, A.; Archetti, F. Bayesian Optimization Using Simulation-Based Multiple Information Sources over Combinatorial Structures. Mach. Learn. Knowl. Extr. 2024, 6, 2232–2247. [Google Scholar] [CrossRef]
Lim, Y.-F.; Ng, C.K.; Vaitesswar, U.S.; Hippalgaonkar, K. Extrapolative Bayesian optimization with Gaussian process and neural network ensemble surrogate models. Adv. Intell. Syst. 2021, 3, 2100101. [Google Scholar] [CrossRef]
Nour, M.; Senturk, U.; Polat, K. Diagnosis and classification of Parkinson’s disease using ensemble learning and 1D-PDCovNN. Comput. Biol. Med. 2023, 161, 107031. [Google Scholar] [CrossRef]
Kalagotla, S.K.; Gangashetty, S.V.; Giridhar, K. A novel stacking technique for prediction of diabetes. Comput. Biol. Med. 2021, 135, 104554. [Google Scholar] [CrossRef] [PubMed]
Prakash, J.A.; Asswin, C.R.; Ravi, V.; Sowmya, V.; Soman, K.P. Pediatric pneumonia diagnosis using stacked ensemble learning on multi-model deep CNN architectures. Multimed. Tools Appl. 2023, 82, 21311–21351. [Google Scholar] [CrossRef]
Alqahtani, A.F.; Ilyas, M. An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of Cyberbullying. Mach. Learn. Knowl. Extr. 2024, 6, 156–170. [Google Scholar] [CrossRef]
Wu, T.; Zhang, W.; Jiao, X.; Guo, W.; Hamoud, Y.A. Evaluation of stacking and blending ensemble learning methods for estimating daily reference evapotranspiration. Comput. Electron. Agric. 2021, 184, 106039. [Google Scholar] [CrossRef]
Mienye, I.D.; Obaido, G.; Jere, N.; Mienye, E.; Aruleba, K.; Emmanuel, I.D.; Ogbuokiri, B. A survey of explainable artificial intelligence in healthcare: Concepts, applications, and challenges. Inform. Med. Unlocked 2024, 51, 101587. [Google Scholar] [CrossRef]
Srinivasu, P.N.; Sandhya, N.; Jhaveri, R.H.; Raut, R. From blackbox to explainable AI in healthcare: Existing tools and case studies. Mob. Inf. Syst. 2022, 2022, 8167821. [Google Scholar] [CrossRef]
Reghu, L.; Ashok, G.; Menon, R.R.K. Explainable AI for Health Care based Retrieval System. Grenze Int. J. Eng. Technol. 2024, 10, 1915. [Google Scholar]
Talin, I.A.; Abid, M.H.; Khan, M.A.M.; Kee, S.H.; Nahid, A.A. Finding the influential clinical traits that impact on the diagnosis of heart disease using statistical and machine-learning techniques. Sci. Rep. 2022, 12, 20199. [Google Scholar] [CrossRef]
Raţiu, A.; Pop, E.L. Machine Learning in Clinical Decision Making: Applications, Data Limitations and Multidisciplinary Perspectives. Appl. Sci. 2026, 16, 785. [Google Scholar] [CrossRef]
Mahmud, F.; Quamruzzaman, M.; Sanka, A.I.; Cheung, R.C.; Chowdhury, M.H. Interpretable machine learning-based real-time sepsis diagnosis. Sci. Rep. 2026, 16, 36945. [Google Scholar] [CrossRef]
Ristori, M.V.; Ruffini, F.; Spoto, S.; Cammarata, R.; La Vaccara, V.; Bani, L.; Caputo, D.; Soda, P.; Guarrasi, V.; Angeletti, S. Machine Learning Models for Sepsis: From Early Detection to Short-and Long-Term Prognosis. Int. J. Mol. Sci. 2026, 27, 2721. [Google Scholar] [CrossRef]
Niazi, S.K. A Critical Review of the FDA’s Draft Guidance on Artificial Intelligence in Drug and Biological Product Regulation. J. Chem. 2026, 2026, 5202999. [Google Scholar] [CrossRef]
Do, D.K.; Rockenschaub, P.; Boie, S.D.; Kumpf, O.; Volk, H.D.; Balzer, F.; Von Dincklage, F.; Lichtner, G. The Impact of Evaluation Strategy on Sepsis Prediction Model Performance Metrics in Intensive Care Data: Retrospective Cohort Study. J. Med. Internet Res. 2026, 28, e72083. [Google Scholar] [CrossRef]
Oliveira, T.Q.; Carvalho, L.A.; Sousa, F.R.; Filho, J.B.; Oliveira, K.F.; Tavares, D.A. Responsible AI for Sepsis Prediction: Bridging the Gap Between Machine Learning Performance and Clinical Trust. J. Clin. Med. 2026, 15, 2251. [Google Scholar] [CrossRef]

Figure 1. Proposed workflow for optimized sepsis classification.

Figure 2. SMOTE-ENN Algorithm used for data balancing. Notation:

y_{i}

: minority class instance; K: nearest neighbors count;

\hat{y}

: selected neighbor;

δ \in [0, 1]

: interpolation factor;

y_{new} = y_{i} + (\hat{y} - y_{i}) δ

: synthetic sample.

Figure 2. SMOTE-ENN Algorithm used for data balancing. Notation:

y_{i}

: minority class instance; K: nearest neighbors count;

\hat{y}

: selected neighbor;

δ \in [0, 1]

: interpolation factor;

y_{new} = y_{i} + (\hat{y} - y_{i}) δ

: synthetic sample.

Figure 3. Multi-level interpretability framework. The framework analyzes ensemble models through global feature importance (Level 1), population-level partial dependence profile (PDP) analysis with Spearman correlation agreement metrics (Level 2), and individual patient breakdown analysis with contribution heatmaps (Level 3), generating clinical decision support outputs.

Figure 4. Representative distribution analysis using histograms and Q-Q plots for selected clinical features. (a) Vital signs: HR, O₂Sat, and Temp. (b) Laboratory values: Fibrinogen, pH, and PaCO₂. Complete distribution analysis for all features is provided in Section S2.

Figure 5. Feature importance radar plot depicting the relative contribution of clinical features to sepsis prediction (refer to Table 2 for the details of the features used).

Figure 6. Breakdown plots showing feature contributions for Sample 1000. (a) Random Forest base model. (b) XGBoost base model.

Figure 7. Comparative feature contribution analysis for Sample 1000. (Left): Comprehensive feature utilization of Random Forest. (Right): Focused feature selection of XGBoost.

Figure 8. Feature contribution heatmap for Patient 1000. The color-coded matrix displays contribution values, with darker red indicating higher magnitudes.

Table 1. Comparison of existing works with proposed BayeStack framework.

Aspect	Existing Works	BayeStack (Proposed)
Hyperparameter Optimization	Manual tuning or grid search [10,13,19]	Bayesian optimization with Gaussian process surrogate model
Temporal Modeling	Single time-point or limited windows [11,14]	Multi-window aggregation (1–48 h) capturing disease progression
Interpretability Scope	Single-level (global or local) [12,13]	Three-level hierarchy (global, population, individual)
Ensemble Justification	Empirical performance gains [19]	Quantified complementary behavior ( $ρ = 0.856$ , agreement analysis)
Clinical Actionability	Risk scores only [15,16,17]	Thresholds, interactions, monitoring priorities
Sensitivity–Specificity Balance	Emphasis on a single metric [13,18]	Balanced optimization (both 0.97) via AUROC maximization

Table 2. Clinical time series data from the dataset.

Feature Labels	Detailed Features
Vital Features	HR, O₂Sat, Temperature, SBP, MAP, DBP, RR, EtCO₂
Laboratory Values	Base Excess, HCO₃, FiO₂, PaCO₂, SaO₂, AST, BUN, Alkalinephos, PTT, WBC, Calcium, Chloride, pH, Hct, Creatinine, Direct bilirubin, Glucose, Phosphate, Magnesium, Total bilirubin, Lactate, Hgb, Troponin I, Potassium, Fibrinogen, Platelets
Demographics	Age, Gender, Unit 1 (MICU), Unit 2 (SICU), Hospital admission time, ICULOS
Outcome	Sepsis Label (1-Septic, 0-Non-septic)

Table 3. Cross-validation mean accuracy: Classifier combinations for sepsis dataset.

Classifier Combination	Mean Accuracy
Random Forest + XGBoost	0.9961
Random Forest + Logistic Regression	0.9924
Random Forest + KNN	0.9940
Random Forest + Naïve Bayes	0.9498
Logistic Regression + KNN	0.9823
Naïve Bayes + KNN	0.9751
XGBoost + KNN	0.9924
XGBoost + Logistic Regression	0.9346
XGBoost + Naïve Bayes	0.8824
Logistic Regression + Naïve Bayes	0.6467

Table 4. Base model classifier parameters.

Classifier	Parameter	Range	Optimal
Random Forest	Number of Trees	(100, 500)	438
	Tree Depth	(1, 50)	36
	Split Threshold	(2, 20)	19
	Leaf Size	(1, 20)	20
XGBoost	Number of Trees	(100, 500)	385
	Tree Depth	(1, 20)	20
	Learning Rate	(0.001, 1)	0.679
	Child Node Samples	(0.001, 20)	14.4

Table 5. Performance evaluation results with 95% bootstrap confidence intervals.

Metric	RF		XGBoost		BayeStack
	Value	95% CI	Value	95% CI	Value	95% CI
Specificity	0.96	[0.954–0.966]	0.96	[0.954–0.966]	0.97	[0.964–0.976] ***
Sensitivity	0.94	[0.932–0.948]	0.98	[0.974–0.986]	0.97	[0.964–0.976] **
Accuracy	0.95	[0.942–0.958]	0.97	[0.962–0.978]	0.97	[0.964–0.976]
Precision	0.94	[0.932–0.948]	0.98	[0.974–0.986]	0.97	[0.964–0.976] **
Recall	0.96	[0.952–0.968]	0.96	[0.952–0.968]	0.97	[0.964–0.976] *
F1-Score	0.95	[0.942–0.958]	0.97	[0.962–0.978]	0.97	[0.964–0.976]
AUC-ROC	0.95	[0.942–0.958]	0.97	[0.962–0.978]	0.99	[0.984–0.996] ***

Note: ***

p < 0.001

, **

p < 0.01

, *

p < 0.05

. RF = Random Forest; CI = Confidence Interval.

Table 6. Ablation study: Impact of temporal aggregation window on performance.

Temporal Window	AUROC	Sensitivity	Specificity	F1-Score
Single timepoint	0.88	0.90	0.87	0.88
1 h window	0.91	0.92	0.90	0.91
1–4 h	0.94	0.94	0.93	0.93
1–24 h	0.97	0.96	0.96	0.96
1–48 h (Full)	0.99	0.97	0.97	0.97

Table 7. Ablation study: Impact of data balancing strategy on performance.

Balancing Strategy	AUROC	Sensitivity	Specificity	F1-Score
No balancing	0.66	0.52	0.88	0.68
SMOTE only	0.93	0.92	0.94	0.90
ENN only	0.92	0.91	0.93	0.89
SMOTE-ENN	0.99	0.97	0.97	0.97

Table 8. Comprehensive model profile comparison: Random Forest vs. XGBoost across all test samples.

Feature	RF Range	XGB Range	RF Min	RF Max	XGB Min	XGB Max	RF Mean	XGB Mean	Agreement
ICULOS	0.26	0.22	0.30	0.55	0.44	0.66	0.50	0.61	0.85
O₂Sat	0.21	0.12	0.36	0.57	0.56	0.69	0.45	0.57	0.59
Resp	0.23	0.04	0.37	0.60	0.57	0.61	0.42	0.58	0.15
HR	0.11	0.03	0.34	0.44	0.55	0.58	0.41	0.57	0.31
Magnesium	0.07	0.06	0.42	0.49	0.54	0.60	0.42	0.56	0.79
TroponinI	0.09	0.03	0.35	0.45	0.54	0.57	0.39	0.54	0.31
FiO₂	0.10	0.02	0.32	0.42	0.54	0.56	0.41	0.56	0.21
WBC	0.05	0.05	0.38	0.43	0.52	0.57	0.40	0.54	0.92
Fibrinogen	0.05	0.04	0.41	0.46	0.53	0.57	0.42	0.55	0.79
BaseExcess	0.06	0.03	0.39	0.45	0.55	0.58	0.40	0.56	0.46
Temp	0.05	0.04	0.39	0.44	0.55	0.58	0.40	0.56	0.70
SBP	0.08	0.01	0.42	0.49	0.57	0.58	0.44	0.57	0.18
BUN	0.06	0.03	0.35	0.41	0.54	0.58	0.37	0.57	0.54
Bilirubin (D)	0.06	0.03	0.40	0.46	0.55	0.57	0.41	0.55	0.41
Bilirubin (T)	0.06	0.03	0.41	0.47	0.55	0.58	0.42	0.56	0.46
Unit1_0.0	0.07	0.01	0.47	0.54	0.58	0.59	0.51	0.59	0.13
pH	0.06	0.01	0.40	0.46	0.57	0.58	0.42	0.57	0.24
AST	0.05	0.03	0.39	0.44	0.54	0.57	0.40	0.56	0.55
Lactate	0.05	0.02	0.41	0.46	0.56	0.58	0.43	0.56	0.40
PaCO₂	0.05	0.02	0.42	0.47	0.56	0.58	0.44	0.57	0.37
Unit2_1.0	0.06	0.00	0.47	0.54	0.58	0.58	0.51	0.58	0.02
Unit1_1.0	0.04	0.00	0.50	0.54	0.58	0.59	0.52	0.58	0.11
Unit2_0.0	0.04	0.00	0.50	0.54	0.58	0.58	0.52	0.58	0.01

Range Range = Max − Min contribution; Agreement = RF–XGB correlation (1.0 = perfect). Column Definitions: Min/Max = Minimum/Maximum prediction probability across feature values (clinical thresholds); Mean = Average prediction probability across all feature values (baseline risk); Agreement = Pearson correlation between RF and XGB predictions (1.0 = perfect consensus); Bold values: High ranges (>0.10), strong agreement (>0.75), or extreme divergence (<0.20).

Table 9. Evidence-based clinical thresholds extracted from partial dependence profile analysis.

Feature	Critical Threshold	ΔP (RF)	ΔP (XGB)	Clinical Interpretation	Priority
ICULOS	>48–50 h	0.26	0.22	Cumulative hospital-acquired infection risk	High
WBC	<4 or >12–15 ×10³/µL	0.05	0.05	Leukopenia/leukocytosis (SIRS criteria)	High
Resp	>20–24 breaths/min	0.23	0.04	Tachypnea indicating respiratory distress	High
O2Sat	<92–94%	0.21	0.12	Hypoxemia/respiratory compromise	High
Temp	<36 °C or >38 °C	0.05	0.04	Hypothermia/fever (infection indicator)	Medium
Lactate	>2–4 mmol/L	0.05	0.02	Tissue hypoperfusion/metabolic stress	High
Fibrinogen	<200 or >400 mg/dL	0.05	0.04	Coagulopathy/acute phase response	High
Magnesium	<1.5 or >2.5 mg/dL	0.07	0.06	Electrolyte imbalance	Medium
BUN	>20–30 mg/dL	0.06	0.03	Renal dysfunction indicator	Medium
HR	>100–110 bpm	0.11	0.03	Tachycardia (systemic stress response)	Medium
pH	<7.35 or >7.45	0.06	0.01	Metabolic acidosis/alkalosis	Medium
BaseExcess	<−2 or >2 mEq/L	0.06	0.03	Acid-base imbalance	Medium

ΔP = Prediction probability range from PDP analysis (see Table 8). Priority: High = immediate clinical action required; Medium = enhanced monitoring warranted. Thresholds represent inflection points where sepsis probability shows marked changes.

Table 10. Feature ablation study: Contribution of key temporal and laboratory features.

Model Configuration	AUROC	Sensitivity	Specificity	Δ AUROC
Full Model (Baseline)	0.99	0.97	0.97	-
Without ICULOS	0.96	0.94	0.96	−0.03
Without WBC	0.97	0.95	0.96	−0.02
Without Fibrinogen	0.97	0.96	0.97	−0.02
Without Lactate	0.98	0.96	0.97	−0.01
Without O₂Sat	0.97	0.95	0.97	−0.02

Ablation analysis confirms ICULOS contributes 3% to AUROC performance, comparable to individual laboratory markers (WBC, Fibrinogen: 2% each). This validates its inclusion as one of the complementary features among 40 clinical variables.

Table 11. Summary statistics: Model profile comparison.

Metric	Random Forest	XGBoost
Mean Range	0.085	0.038
Std Range	0.061	0.047
Max Range	0.257	0.218
Min Range	0.042	0.001
Mean Agreement	0.413
Spearman $ρ$	0.856 ***

Note: *** p <0.001 (highly significant correlation).

Table 12. Top 10 feature contributions for individual sepsis prediction (Sample 1000).

Feature	Value	RF Contrib	XGB Contrib	RF Impact	XGB Impact
ICULOS	58.0	0.706	0.492	High	High
Temp	38.0	0.643	0.067	High	Medium
BUN	8.0	0.460	0.057	High	Medium
WBC	13.6	0.414	0.139	High	High
Magnesium	2.9	0.367	0.330	High	High
AST	102.0	0.357	0.323	High	High
HR	96.3	0.343	0.031	High	Low
Resp	13.8	0.326	0.007	High	Low
Bilirubin (D)	0.4	0.324	0.067	High	Medium
BaseExcess	$- 2.9$	0.301	0.019	High	Low

Table 13. Scalability analysis: Training time by dataset size.

Dataset Size	Training (min)	Per-Sample (ms)	% Full
1000 patients	0.06	3.79	2.5%
5000 patients	0.39	4.68	12.4%
10,000 patients	0.84	5.06	24.8%
20,000 patients	1.81	5.44	49.6%
40,336 patients (Full)	3.92	5.83	100.0%
80,000 patients ^†	8.27	6.20	198.3%
100,000 patients ^†	10.54	6.32	247.9%

^† Projected using bootstrap resampling and complexity modeling.

Table 14. Comparison with published sepsis classification methods.

Method	AUROC	Accuracy	F-Measure	Rank
BayeStack (Proposed)	0.99 ***	0.97 ***	0.97 ***	1
Deogire	0.86 **	0.84 **	0.48 **	2
Alfaras et al.	0.72 *	0.87	0.58 *	3
Narayanaswamy et al.	0.71 *	0.88	0.59 *	4
Strickler et al.	0.78 *	0.99	0.70 *	5
Murugesan et al.	0.56	0.96	0.13	6

***

p < 0.001

, **

p < 0.01

, *

p < 0.05

. Dataset: All methods use PhysioNet 2019 Challenge dataset (40,336 patients). Task Context: Baseline methods (refs. [25,26,27,28,29]) perform early prediction per challenge specifications; BayeStack performs concurrent classification. Different temporal objectives naturally yield different performance ranges.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Geetha, A.; Nisha, K.L.; Pillai, A.S.M.S.; Rajeev, S. Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Mach. Learn. Knowl. Extr. 2026, 8, 150. https://doi.org/10.3390/make8060150

AMA Style

Geetha A, Nisha KL, Pillai ASMS, Rajeev S. Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Machine Learning and Knowledge Extraction. 2026; 8(6):150. https://doi.org/10.3390/make8060150

Chicago/Turabian Style

Geetha, Anjana, K. L. Nisha, Arun Sankar Muttathu Sivasankara Pillai, and Sreenath Rajeev. 2026. "Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification" Machine Learning and Knowledge Extraction 8, no. 6: 150. https://doi.org/10.3390/make8060150

APA Style

Geetha, A., Nisha, K. L., Pillai, A. S. M. S., & Rajeev, S. (2026). Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification. Machine Learning and Knowledge Extraction, 8(6), 150. https://doi.org/10.3390/make8060150

Article Menu

Temporal Knowledge Extraction Through BayeStack with Multi-Level Explainability for Optimal Sepsis Classification

Abstract

1. Introduction

2. Related Works

Gaps in the Existing Literature and Significance of the Proposed Work

3. Methodology

3.1. Dataset Selection

3.2. Data Processing and Data Balancing

3.2.1. Temporal Bounded Bidirectional Imputation

3.2.2. Data Normalization

3.2.3. One-Hot Encoding

3.2.4. SMOTE-ENN Algorithm for Data Balancing

3.3. Feature Engineering

3.3.1. Multi-Window Temporal Feature Aggregation

3.3.2. Feature Importance Computation and Feature Selection

3.4. Model Development

Modeling of BayeStack Algorithm

3.5. Model Interpretability Framework

3.5.1. Level 1: Global Feature Importance Analysis

3.5.2. Level 2: Population-Level Analysis

3.5.3. Level 3: Individual Patient-Level Interpretability

4. Results and Discussion

4.1. Baseline Characteristic Results

4.1.1. Time-Based Aggregation Results

4.1.2. Quantile-Quantile Plots (Q-Q Plots)

4.1.3. Feature Importance Radar Plot

4.2. Framework Evaluation and Model Interpretability Results

Component Performance Analysis

4.3. Ablation Studies Analysis

4.3.1. Temporal Aggregation Component Analysis

4.3.2. Data Balancing Component Analysis

4.4. Comprehensive Model Interpretability Analysis

4.4.1. Population-Level Model Behavior: Comprehensive Feature Effect Analysis

4.4.2. Individual Patient-Level Interpretability: Case Study Analysis

4.5. Computational Complexity and Scalability Analysis

4.5.1. Computational Complexity Analysis

4.5.2. Scalability Analysis

4.5.3. Performance Contextualization Against PhysioNet 2019 Challenge Baselines

4.6. Methodological Design Trade-Offs and Research Positioning

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI