Applying an Improved Stacking Ensemble Model to Predict the Mortality of ICU Patients with Heart Failure

Cardiovascular diseases have been identified as one of the top three causes of death worldwide, with onset and deaths mostly due to heart failure (HF). In ICU, where patients with HF are at increased risk of death and consume significant medical resources, early and accurate prediction of the time of death for patients at high risk of death would enable them to receive appropriate and timely medical care. The data for this study were obtained from the MIMIC-III database, where we collected vital signs and tests for 6699 HF patient during the first 24 h of their first ICU admission. In order to predict the mortality of HF patients in ICUs more precisely, an integrated stacking model is proposed and applied in this paper. In the first stage of dataset classification, the datasets were subjected to first-level classifiers using RF, SVC, KNN, LGBM, Bagging, and Adaboost. Then, the fusion of these six classifier decisions was used to construct and optimize the stacked set of second-level classifiers. The results indicate that our model obtained an accuracy of 95.25% and AUROC of 82.55% in predicting the mortality rate of HF patients, which demonstrates the outstanding capability and efficiency of our method. In addition, the results of this study also revealed that platelets, glucose, and blood urea nitrogen were the clinical features that had the greatest impact on model prediction. The results of this analysis not only improve the understanding of patients’ conditions by healthcare professionals but allow for a more optimal use of healthcare resources.


Introduction
Cardiovascular diseases (CVD) have ranked among the top three causes of death worldwide for many years, accounting for an estimated 18.9 million deaths per year, or approximately 31% of global mortality [1]. The majority of CVD morbidity and mortality are derived from Heart Failure (HF), a common cardiovascular disease in which the heart fails to maintain the body's metabolism. Patients with HF experience a variety of overt symptoms such as shortness of breath, swollen ankles, and physical fatigue, and may also show signs of elevated jugular venous pressure, pulmonary fissures, and peripheral edema caused by cardiac or noncardiac structural abnormalities [2]. As a major cause of cardiovascular morbidity and mortality, HF poses a significant threat to human health and social development [3]. It is estimated that one in five people will develop HF, and 50% of HF patients die in 5 years [4]. The mortality rate of HF patients in the year after hospitalization is 20% to 60% [5]. At least 64.3 million people worldwide are affected by HF. The prevalence of HF among Americans over the age of 20 has increased to 6.2 million [6]. In the face of the increasing number of people with HF, despite the rapid advances in medical technology and significant technological advances in diagnosis, assessment, and cardiovascular disease [7,8], HF remains a major medical problem worldwide.
According to the World Federation of Societies of Intensive and Critical Care Medicine, an Intensive Care Unit (ICU) is an organized system for the provision of care to critically ill prediction and no developed system is commonly used for prediction due to low recognition rates [13]. Gasillas describes new machine learning algorithms for predicting mortality in COVID-19 patients [31]. Bi et al. presented the study focused on assessing the predictive performance of ML methods for in-hospital mortality in adult PCS patients [32]. José et al. introduced a new methodology to improve ICU monitoring systems through an age-based stratification approach, using XGBoost classifiers and SHAP technology to automatically identify thresholds for the most important clinical variables to monitor interpretable machine learning [33]. Therefore, accurate prediction of hospital mortality has remained a challenge to date.
Given that each ML method may outperform or have shortcomings in different situations, developing a model that integrates multiple ML methods to obtain better performance has become a new research approach. There are three main types of ensemble learning methods: Boosting, Bagging, and Stacking. Boosting updates the weights of the training data after each training iteration, and then classifies the output by weighted voting combinations [34]. Bagging involves training several base learners with different bootstrap samples, then consolidating them and voting on the result [35]. Stacking is a powerful ensemble technique that harnesses the predictions of multiple base learners as features to train new meta learners [36]. Stacking usually performs better than any single trained model. Li et al. [37] proposed a stacked Long Short-Term Memory (LSTM) network migration learning framework to improve the migrability of the traffic prediction model. Zhai & Chen [38] used a stacked ensemble model to predict and analyze the daily average concentration of PM2.5 in Beijing. Jia [39] proposed a stacked approach for ML to efficiently and rapidly construct 3D multi-type rock models using geological and geophysical datasets. Zhou et al. [40] investigated the stacking algorithm for cheating detection in large-scale assessments with consideration of class imbalance. Meharie et al. [41] presented a stacking ensemble algorithm to linearly and optimally estimate the cost of the highway construction project. Dong et al. [42] in order to improve the wind power forecasting, proposed an ensemble learning model based on stacking framework. In this paper, we deployed the stacking method to perform mortality prediction model construction for HF patients in ICU, which consists of Random Forest (RF), Support Vector Classifier (SVC), K-Nearest Neighbors (KNN), Light Gradient Boosting Machine (LGBM), Bootstrap aggregating (Bagging), and Adaptive Boosting (AdaBoost).
In this study, we used detailed clinical data from the MIMIC-III database, and focused on the collected data of ICU center patients suffering from heart diseases to predict the mortality of HF patients at 3 days, 30 days, 3 months and 1 year after admission to ICU using ML modeling techniques. This study focuses on proposing a stacking-based model to predict mortality in patients with HF. The base estimators can be adaptively selected and applied to the base layer of the stacking model. In addition, in the selection of key variables, it is expected that the constructed model can also successfully identify the important vital signs indicators of HF patients. We hope to improve the prediction of mortality in patients with heart disease by medical professionals, so that patients, their families and medical professionals are afforded with more informed judgments and make more appropriate prognostic preparations. The contributions of this study are summarized below.

•
We propose an accurate and medically intuitive framework for predicting mortality in the ICU based on a comprehensive list of key characteristics of patients with HF in the ICU. The model is based on ensemble learning theory and stacking methods, and constructs a heterogeneous ensemble learning model to improve the generalization and prediction performance of the model. • For our model we adopted the most popular and most diverse classifiers in current literature, including six different ML techniques. The generated classifier lists are used to construct the proposed stacking models. The different meta classifiers were tested and the best performing estimators were selected. The predictive capabilities of our stacking model outperform the results of a single classifier and standard ensemble techniques, achieving encouraging accuracy and strong generalization performance.
• In addition, feature importance provides a specific score for each feature, and these scores indicate the impact of each feature on model performance. The importance score represents the degree to which each input variable adds value to the decision in the constructed model. We also obtained the clinical characteristics of patients with HF in the ICU that had the greatest impact on model prediction.

Patient Selection and Variable Selection
The MIMIC-III dataset was used in this study. MIMIC-III integrates comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center (BIDMC) in Boston, Massachusetts. The dataset collected de-identified data on 46,520 ICU admissions from 2001 to 2012 [43]. These data consisted of 26 tables containing aggregate information such as demographics, admission records, discharge summaries, ICD-9 diagnosis records, vital signs, laboratory measures, medications, clinical vital signs measurements, nursing staff observation records, radiology reports, and survival data. We completed the National Institutes of Health online course and passed the Protecting Human Research Participants examination and received data access privileges (Certificate No. 35628530).
For this study, in order to prevent possible information leakage and ensure a similar experimental setting compared to the related work, we used only the first ICU admission data for each patient. To emphasize the early predictive value, we used data from the first 24 h of patient admission as input to the predictive model and excluded patients younger than 16 years of age [12,21,26,[44][45][46]. A patient cohort was selected based on the following exclusion criteria: The first ICU stay without all subsequent ICU stays, single ICU stay more than 24 [47][48][49][50]. After the first stage of data screening, a total of 7278 eligible patients were qualified.
The data for the predictive variables selected in this paper were obtained from two tables: admission table and chartevents table in the MIIMIC-III database. As a result of the study, we referred to the predictive variables that have been used by other researchers to predict HF mortality in patients [2,3,47,48,50]  In addition, for the treatment of missing values, the data preprocessing method of Guo et al., [21] was adopted in this study to perform a three-stage missing value treatment. Firstly, patients with more than 30% missing values were removed, thus 579 patients were removed. Secondly, variables with more than 40% missing values of the predictor variables were removed, so 4 variables were removed and 16 predictor variables were finally used in this study. Finally, statistics with missing values greater than 20% for these indicators were removed, and the remaining missing values were interpolated using menus. Finally, a total of 6699 patients were used in this study. Figure 1 shows the detailed process of data extraction.

Proposed Framework
Mortality rate is a major outcome in acute care: ICU mortality is the highest among hospital units (10% to 29%, depending on age and disease), and early identification of high-risk patients is key to improving outcomes [51]. Mortality prediction is one of the primary outcomes of hospitalization. In this paper, we use ICU admission records of HF patients in the MIMIC-III dataset to predict mortality in HF patients at different time points. We propose an ensemble learning model based on stacking. In the first layer of the stacking model we applied six ML methods, including RF, SVC, KNN, LGBM, Bagging and Adaboost to perform a first-level classifier on the dataset. Then, the fusion of these six classifier decisions was used to construct and optimize the stacked set of classifiers. This was followed by subjecting the datasets to second-level classifier using LGBM, and formed the final prediction. Figure 2 shows the Stacking ensemble based on a cross-validation of all feature subsets.

Proposed Framework
Mortality rate is a major outcome in acute care: ICU mortality is the highest among hospital units (10% to 29%, depending on age and disease), and early identification of high-risk patients is key to improving outcomes [51]. Mortality prediction is one of the primary outcomes of hospitalization. In this paper, we use ICU admission records of HF patients in the MIMIC-III dataset to predict mortality in HF patients at different time points. We propose an ensemble learning model based on stacking. In the first layer of the stacking model we applied six ML methods, including RF, SVC, KNN, LGBM, Bagging and Adaboost to perform a first-level classifier on the dataset. Then, the fusion of these six classifier decisions was used to construct and optimize the stacked set of classifiers. This was followed by subjecting the datasets to second-level classifier using LGBM, and formed the final prediction. Figure 2 shows the Stacking ensemble based on a cross-validation of all feature subsets.  Table 1 [52].

•
Random Forest (RF) is an ensemble supervised ML algorithm. It uses decision trees as the basic classifier. RF generates many classifiers and combines their results by majority voting [53]. In the regression model, the output categories are numerical and the mean or average of the predicted outputs is used. The random forest algorithm is well suited to handle datasets with missing values. It also performs well on large datasets and can sort the features by importance. The advantage of using RF is that the algorithm provides higher accuracy compared to a single decision tree, it can handle datasets with a large number of predictive variables, and it can be used for variable selection [54]. • Support Vector Classifier (SVC) performs classification and regression analysis on linear and non-linear data. SVC aims to identify classes by creating decision hyperplanes in a non-linear manner in a higher eigenspace [55]. SVC is a robust tool to address data bias and variance and leads to accurate prediction of binary or multiclass classifications. In addition, SVC is robust to overfitting and has significant generalization capabilities [56].

•
The K-Nearest Neighbors (KNN) algorithm does not require training. It is used to predict binary or sequential outputs. The data is divided into clusters and the number of nearest neighbors is specified by declaring the value of "K", a constant. KNN is an algorithm [53] that stores all available instances and classifies new instances based on a similarity measure (e.g., distance function). Due to its simple implementation and excellent performance, it has been widely used in classification and regression prediction problems [57].  Table 1 [52].  • Light Gradient Boosting Machine (LGBM) is an ensemble approach that combines predictions from multiple decision trees to make well-generalized final predictions.

Method Parameters
LGBM divides the consecutive eigenvalues into K intervals and selects the demarcation points from the K values. This process greatly accelerates the prediction speed and reduces storage space required without degrading the prediction accuracy [58,59].
LGBM is a gradient boosting decision tree learning algorithm that has been widely used for feature selection, classification and regression [60].

•
The Bootstrap aggregating (Bagging) algorithm, also known as bagging algorithm, is an ensemble learning algorithm in the field of machine learning. It was originally proposed by Leo Breiman in 1994. Bagging algorithm can be combined with other classification and regression algorithms to improve its accuracy, stability, and avoid overfitting by reducing the variance of the results. Bagging is an ensemble method, i.e., a method of combining multiple predictors. It helps to avoid overfitting and variance reduction of the model to the data and has been used in a series of microarray studies [61,62]. We implemented Bagging using python's sklearn library. We chose an ensemble of 500 DecisionTreeClassifier classifiers with a maximum sample set of 100 for each classifier, sampled each time using self-sampling, and trained all other hyperparameters by applying the sklearn default values.

•
The self-adaptive nature of the Adaptive Boosting (AdaBoost) method is that the wrong samples of the previous classifier are used to train the next classifier, therefore, the AdaBoost method is sensitive to noisy data and anomalous data. It trains a basic classifier and assigns higher weights to the misclassified samples. After that, it is applied to the next process. The iterative process continues until the stopping condition is reached or the error rate becomes small enough [63,64]. We implemented AdaBoost using python's sklearn library, choosing a maximum number of iterations of 50 for our hyperparameters and using the default values in sklearn for the rest of the training model. Table 1. The parameters of the six candidate models.

Stacking Ensemble Technique
Stacking is an ensemble method for connecting multiple different types of classification models through a meta-classifier [65]. The basic concept is to combine multiple weak learners to obtain a model with stronger generalization ability [66]. Stacking is a novel ensemble framework in ensemble learning that uses meta-learners to fuse the results generated by each base learner [67]. Generally, base learners are called first-level learners and combinators are called second-level learners or meta learners. The basic principle of stacking is as follows. First, the first-level learner is trained using the initial training dataset. Then, the output of the first-level learner is used as the input feature of the meta learner. Finally, a new dataset is formed using the corresponding original labels as new labels to train the meta learner. If the first-level learner uses the same type of learning algorithm, then it is called homogeneous ensembles, otherwise it is called heterogeneous ensembles [42,[68][69][70].
In the study, the first-layer prediction model of the stacking ensemble model was trained using a k-fold cross-validation method. The specific training process is as follows: 1.
The original dataset S is randomly divided into K sub-datasets {S 1 , S 2 , · · · , S n }. Taking base learner 1 as an example, each sub-dataset Si (i = 1, 2, · · · , K) is verified separately, and the remaining K − 1 sub-datasets are used as training sets to obtain K prediction results. Merge into set D 1 , which has the same length as S.

2.
Perform the same operation on other n − 1 base learners to obtain the set D 2 , D 3 , · · · , D n . Combining the prediction results of n base learners, a new dataset D = {D 1 , D 2 , · · · , D n } is obtained, which constitutes the input data of the second-layer meta-learner. 3.
The second-layer prediction model can detect and correct errors in the first-layer prediction model in time, and improve the accuracy of the prediction model.
As heterogeneous ensembles have better generalization performance and prediction accuracy, this study proposes a stacked ensemble classifier that can be divided into two stages. First, we use RF, SVC, KNN, LGBM, Bagging and Adaboost as the base classifiers in the first stage. The individual six classification models are trained using the complete training set; then, the probabilistic outputs obtained in the first stage are fed into the meta-classifier in the second stage, and then the meta-classifier is fitted based on the output meta-features of each classification model using the chosen ensemble techniques. The meta classifier can be trained on the predicted category labels or probabilities from the ensemble technique.

Synthetic Minority Oversampling Technique (SMOTE)
In ML, the problem is imbalanced when the class distribution is highly skewed. Imbalanced classification problems usually occur in many applications and pose obstacles to traditional learning algorithms [71]. In general, an imbalanced dataset may adversely affect the results of the model. Gold standard datasets are usually imbalanced, which will lead to a decrease in the predictive power of the model [66]. Model overfitting and underfitting are the most common problems encountered when evaluating performance.
Overfitting occurs when a model shows high accuracy scores during training and low accuracy scores during validation. By adding more data to the training set and reducing the number of layers in the neural network, overfitting of models can be minimized. Underfitting occurs when the model fails to classify the data or make predictions during the training stage. SMOTE is a powerful solution to the classification imbalance problem and has delivered robust results in various domains [72]. The SMOTE algorithm adds synthetic data to a small number of classes to form a balanced dataset [71]. Class imbalance refers to the disproportionality between the classes of data used to train predictive models, a common problem that is not unique to medical data. Classification algorithms tend to favor the majority classes when the training data with negative results have significantly fewer observations than the classes with the majority observations, therefore, class imbalance problems can be resolved by manipulating the data, the algorithm, or both, to improve the predictive performance [6]. The core of the method is to perform random undersampling and oversampling for larger samples and smaller samples, respectively. In the MIMIC-III used in this study, only a few patients died during their ICU admission. Therefore, the SMOTE method, which uses synthetic minority sampling techniques to preprocess highly imbalanced data sets, was used in this study.
In our study, we applied the SMOTE techniques with different percentages for different cases. As a result, Table 2 several new training datasets were generated. Take the 3 days mortality dataset as an example, SMOTE with (3000%) increased the sample with class "died" from 213 instances to 6390 instances. This made an incremental increase in the minority class from 3.18% in the original dataset to 49.63% in the SMOTE with 3000% dataset.

Evaluation Criteria
The performance of the full classification is measured using different evaluation parameters, which consist of binary values (positive and negative). Two general evaluation measures, precision and recall, were used to evaluate the sentiment of tweets based on positive and negative polarity, including Accuracy and F-score for micro averaging purposes. Four functional accuracy measures were taken into account based on the outcomes of the confusion matrix named true positive (TP), true negative (TN), false positive (FP), and false negative (FN). In this study, five evaluation tools were selected as performance indicators. We used different indicators in order to have a better understanding of the results. The evaluation parameters used to measure the performance of our proposed system are listed below: we applied Precision, Recall, F-score, and Accuracy, which are widely used in the research field, to evaluate the results of our study, defined as follows: In addition, this study adopts the area under the receiver operating characteristic (AUROC) to measure the predictive performance of the model. AUROC represents the generalization ability of deep learning, which is the area enclosed by the ROC curve (the curve composed of true positive rate and false positive rate). The ROC curve is widely used in dichotomous problems to evaluate the reliability of the classifier. The FPR value on the horizontal coordinates and the TPR value on the vertical coordinates yield a curve called the ROC curve, and the metric considered is the AUC of the curve, called AUROC, with AUC values ranging from 0 to 1; the larger value the better [73]. The ROC curve is not affected by the imbalance in the distribution of the dataset, allowing a more objective evaluation of the performance in the case of unbalanced learning. A completely threshold-independent AUC can be computed to reveal the performance of the classification algorithm. A larger value of AUC implies that the classification algorithm exhibits stronger and more stable predictive performance [58].

Baseline Characteristics
The study was preprocessed with data from the MIMIC-III database and ultimately used the ICU admission records of 6699 HF patients. Table 3 provides demographic information on HF patients. The mean age of the patients in this study was 70.3 years, 55% of whom were male, and the mean number of days hospitalized and the mean number of days in the ICU was 5.8 days. In addition, more than 84% of the patients were hospitalized for medical emergencies. 37.9% of patients were admitted to the Medical ICU and 27.7% were admitted to the Coronary Care Unit. Furthermore, for patient insurance, over 70% of patients were on Medicare. Table 3 also presents the mean of the 20 variables used in this study.

Mortality Prediction Results of Different Models
In this study, a 10-fold cross-validated training and testing was used, with 80% of the data used for training the model and 20% for testing the model, followed by extensive statistical analysis to evaluate performance. In this study, data collected within 24 h of admission to the ICU were used to predict the mortality of patients 3 days, 30 days, 3 months and 1 year after admission. The most commonly used metric to assess the performance of diagnostic tools is AUROC, which graphically presents the true positive rate and the false positive rate. Table 4 lists the six different ML methods used in the first phase of this study, and the addition of the Stacking technique in the second phase involved seven techniques to generate the AUROC of HF patient mortality prediction tasks over four different time periods. This study determined the highest AUROC for predicting 3 days mortality in patients with HF, i.e., death within 3 days could be accurately predicted within 24 h of patient admission. Figures 3 and 4 indicate that the data collected from ICU admissions can be used to predict mortality within 3 days, which is a better prediction outcome; Figure 3 also clearly shows that all four models have high AUROC after adding stacking technique in the second stage, which means they can distinguish mortality from non-mortality cases well. Therefore, prediction models constructed in this study, can predict the life and health status of patients more accurately.   In addition to the above AUROC, this study also compares Precision, Recall, F-score, and Accuracy, as shown in Table 5, when evaluating the performance of different attribute sets used in the classification algorithm. The best precision and recall was shown in a run where the 3 days mortality was used to select the best subset of attributes. In addition, it can be observed from Figure 5 that the addition of Stacking technique in the second stage results in a higher accuracy, with the highest value of 0.9525 for predicting mortality within 3 days. The highest value was 0.9525 for predicting mortality within 3 days. It also demonstrated that this study could achieve very good results in predicting mortality in   In addition to the above AUROC, this study also compares Precision, Recall, F-score, and Accuracy, as shown in Table 5, when evaluating the performance of different attribute sets used in the classification algorithm. The best precision and recall was shown in a run where the 3 days mortality was used to select the best subset of attributes. In addition, it can be observed from Figure 5 that the addition of Stacking technique in the second stage results in a higher accuracy, with the highest value of 0.9525 for predicting mortality  In addition to the above AUROC, this study also compares Precision, Recall, F-score, and Accuracy, as shown in Table 5, when evaluating the performance of different attribute sets used in the classification algorithm. The best precision and recall was shown in a run where the 3 days mortality was used to select the best subset of attributes. In addition, it can be observed from Figure 5 that the addition of Stacking technique in the second stage results in a higher accuracy, with the highest value of 0.9525 for predicting mortality within 3 days. The highest value was 0.9525 for predicting mortality within 3 days. It also demonstrated that this study could achieve very good results in predicting mortality in HF patients using data from 24 h before the patients were admitted to ICU.

Interpretation of Variable Importance
Feature importance is the main contribution of each feature to improve the predictive power of the whole model. It provides an intuitive view of the importance of features to see which features have a greater impact on the final model, but it is not possible to determine how the features relate to the final prediction. Figure 6 lists the different explanatory variables according to their contribution to each model. We can observe that the clinical characteristics of Platelets, Glucose, Blood Urea Nitrogen, Age, Heart Rate, Systolic Blood Pressure, and Diastolic Blood Pressure are important factors influencing the prediction of HF, and previous studies of HF Similar results have been found in previous studies of HF [74][75][76]. However, the effects of GCS eye, GCS motor, GCS verbal, Red Blood Cell, International Normalized Ratio and Gender were not as significant.  In this study, we used several variables for predicting mortality in patients with HF in the ICU with reference to previous related studies. These characteristics result in different contributions of the parameters in the input layer in the construction of LightGBM. The LightGBM model was originally proposed by Microsoft ® [77]. It is a decision treebased algorithm that divides the parameters in the input layer into different parts to construct a mapping relationship between inputs and outputs. More specifically, the LightGBM model uses a gradient boosted decision tree (GBDT) algorithm to achieve more accurate predictions. In the LightGBM model, many hyperparameters are used, but since hyperparameters can significantly affect model performance, many of them can be fine tuned for different applications to improve model performance. Therefore, hyperparameter tuning is an essential process in the construction of ML models. Figure 6 shows the contribution of the first 20 variables in the input layer at each station. Parameter contribution represents the ratio of the number of times a parameter is used for tree splitting to the total number of times all parameters are used for tree splitting. A larger parameter contribution value means that it has been used more times to split nodes, indicating a more significant parameter importance. For a deeper understanding of the LightGBM algorithm, please refer to Guolin Ke et al. [77], which describes the principles and applications of the LightGBM algorithm in detail. In addition, more information about how the LightGBM method calculates the importance of input variables can be found in the related research note [78,79]. Table 6 lists the top ten most important clinical features that contributed most to the model at four different prediction times. The contributions are ranked from most important to least important. We can find that, in addition to considering an HF patient's Platelets, Glucose, and Blood Urea Nitrogen, Systolic Blood Pressure and Diastolic Blood Pressure are also important variables in predicting 3 days, 30 days, and 3 months mortality of HF patients in the ICU. Heart Rate is an important variable to be considered when predicting 1 year mortality in patients with HF in ICU. mine how the features relate to the final prediction. Figure 6 lists the different explanatory variables according to their contribution to each model. We can observe that the clinical characteristics of Platelets, Glucose, Blood Urea Nitrogen, Age, Heart Rate, Systolic Blood Pressure, and Diastolic Blood Pressure are important factors influencing the prediction of HF, and previous studies of HF Similar results have been found in previous studies of HF [74][75][76]. However, the effects of GCS eye, GCS motor, GCS verbal, Red Blood Cell, International Normalized Ratio and Gender were not as significant. In this study, we used several variables for predicting mortality in patients with HF in the ICU with reference to previous related studies. These characteristics result in different contributions of the parameters in the input layer in the construction of LightGBM. The LightGBM model was originally proposed by Microsoft ® [77]. It is a decision treebased algorithm that divides the parameters in the input layer into different parts to construct a mapping relationship between inputs and outputs. More specifically, the LightGBM model uses a gradient boosted decision tree (GBDT) algorithm to achieve more accurate predictions. In the LightGBM model, many hyperparameters are used, but since hyperparameters can significantly affect model performance, many of them can be fine tuned for different applications to improve model performance. Therefore, hyperparameter tuning is an essential process in the construction of ML models. Figure 6 shows the

Discussion
The feasibility of the ML technique for mortality prediction in HF patients has been previously demonstrated. Negassa et al., developed an ensemble model for 30-day postdischarge mortality and used discrimination, range of prediction, Brier index and explained variance as metrics to evaluate model performance, the discrimination achieved by the ensemble model was higher 0.83 [28].Adler et al. developed a new approach to risk assessment using ML techniques MARKER-HF algorithm, which naturally captures correlations in the covariates' multi-dimensional space predicted by a separate U.S. healthcare system and a large European registry mortality with AUCs of 0.84 and 0.81, respectively [29]. Jing proposed an ML model to accurately predict one-year all-cause mortality in a large number of HF patients, with the nonlinear XGBoost model (AUC: 0.77) achieving the best prediction [30]. Austin et al. developed ML algorithms to predict mortality in 12,608 patients with acute HF, the baseline logistic regression model with the AUC of 0.794 [80]. Luo et al., constructed a risk stratification tool using the extreme gradient boosting algorithm to correlate patient clinical characteristics with in-hospital mortality, and this new ML model outperformed traditional risk prediction methods with an AUC of 0.831 [81]. The aim of this study is to predict the mortality of ICU patients by ML model from structural vital signs data collected during the ICU stay of HF patients.
Through in-depth analysis of the experimental results, the following conclusions were drawn from this study: (1) The proposed stacking ensemble model can make full use of the observation ability of different prediction models to the data space and structure, so that different models can learn from each other, so that it is possible to obtain the best prediction.
(2) The data collected during the first 24 h after the admission of HF patients to the ICU were used for modeling analysis, and the ML stacking method was effective and rapid in constructing predictive models. The results indicated that our proposed stacking method has good performance in predicting mortality in HF patients with Accuracy (95.25%) and AUROC (82.55%). This also validates the findings of an analysis of the relationship between classifier diversity and the quality of stacking and concluded that diversity may be considered as the selection criteria for building the ensemble classifier [13,82]. (3) This study was able to accurately predict the mortality of patients after admission to the ICU; as the prediction time was shortened, the prediction performance became better. (4) Random Forest, LGBM and Bagging ML techniques have been rapidly developed in recent times to accurately predict outcomes in the medical field. (5) This study also found that Platelets, Glucose, and Blood Urea Nitrogen were the clinical characteristics that had the greatest impact on model prediction and were important indicators to be considered in the selection of important variables, which is in line with previous studies on HF [74,76].
However, there are some limitations to this study. First, this study is not a prospective study, but a retrospective study, and inherent biases are inevitable in retrospective studies. Since most of the dynamic data obtained from ICU is used to train the classifier specifically, there may be problems in non-ICU settings; i.e., less real-time data may negatively affect the prediction results and increase the possibility of incorrect judgments and decisions. This is a common problem in the practical implementation of ML methods using dynamic data from EHR [83]. Second, in this study, in order to consider the convenience and completeness of data collection, only ICU databases with the easiest access to complete dynamic patient information were considered. The MIMIC-III data used in this study were obtained from the Beth Israel Deaconess Single Medical Center in Boston, Massachusetts, USA. We trained our ML model using ten-fold cross-validation and tested the final model using an independent patient cohort as an external validation of model generalizability. We evaluated the performance of all models on the data set using Scikit-learn v0.23.1 [52]. However, the broader application of our ML model requires further validation in different statistical populations. More comprehensive results and validation may be obtained for patients in other types of medical institutions such as large hospitals or small clinics, or in medical institutions of other scales. Third, the data used in this study were limited to the data collected during the first ICU stay, excluding the records and reports of patients readmitted to the hospital. The collection of data from multiple ICU hospitalizations may provide a comprehensive assessment of time series issues, or may provide more different levels of analysis to patients, healthcare professionals, and patients' families for future evaluation and prognosis. Fourth, the data extracted from the MIMIC database were distributed over multiple years (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012), during which the treatment of HF changed significantly, which may weaken the application of our model.

Conclusions
The world is facing the challenge of a COVID-19 pandemic, where patients need regular monitoring of vital signs and medical warning scores to enable early identification and treatment of their disease. The burden on the healthcare system is enormous, and the interventions used and their duration may seriously affect the prognosis of patients [11]. Despite the recent development of various severity scores and ML models for early mortality prediction, such predictions remain challenging. Compared to existing solutions, this study provides a complement to current clinical decision-making methods by proposing a new stacked ensemble approach to predict mortality in patients with HF in the ICU. The base estimators can be adaptively selected and applied to the base layer of the stacking model. Our stacking model is significantly better than the traditional ML approach in mortality prediction, and can successfully screen out the important clinical features of HF patients. It can empower health care professionals to better predict mortality in HF patients, and provide patients, their families and medical professionals with more information to determine the status of patients and make more appropriate prognosis. For follow-up studies, this study suggests the following recommendations for future studies: 1.
Compared to structured data that have been used for clinical outcome prediction, the information available in diagnostic records and test reports in unstructured data is still underutilized by medical research. These diagnostic data are important references for clinical decision-making because they record multifaceted information about the patient's visit, such as the focus of care, preliminary medical assessment, and the generation of different recommendations for the final diagnosis. Future research suggests that structured data and unstructured data can be integrated for more detailed classification and study [84].

2.
The MIMIC-III data is relatively rich and complete, and this study only modeled and predicted the mortality of patients; subsequent studies can be conducted to evaluate the readmission, length of stay, medication use, and complications of patients with reference to the framework of this study. This type of study can be made more objective and complete if it can be extended to conduct more comprehensive evaluation and analysis. 3.
The evolution of variables over time can be collected from patient EHR data in an attempt to obtain better predictive effects. In terms of research methods, future research can attempt different ML methods as well as deep learning methods that have recently been applied to solve time-series data more effectively. For example, long short-term memory, recurrent neural network, CNN models [85,86] are common deep learning models.

4.
With the popularity and increasing prevalence of AI, telemedicine and robotics, which have emerged in response to the recent COVID-19, imaging AI and speech AI can be incorporated. Combining existing clinical data, diagnostic reports, medical image images, etc., can improve medical culture and quality of care, which will be an important issue in the future field of smart medicine [87,88].