Machine Learning Models to Predict 30-Day Mortality in Mechanically Ventilated Patients

Previous scoring models, such as the Acute Physiologic Assessment and Chronic Health Evaluation II (APACHE II) score, do not adequately predict the mortality of patients receiving mechanical ventilation in the intensive care unit. Therefore, this study aimed to apply machine learning algorithms to improve the prediction accuracy for 30-day mortality of mechanically ventilated patients. The data of 16,940 mechanically ventilated patients were divided into the training-validation (83%, n = 13,988) and test (17%, n = 2952) sets. Machine learning algorithms including balanced random forest, light gradient boosting machine, extreme gradient boost, multilayer perceptron, and logistic regression were used. We compared the area under the receiver operating characteristic curves (AUCs) of machine learning algorithms with those of the APACHE II and ProVent score results. The extreme gradient boost model showed the highest AUC (0.79 (0.77–0.80)) for the 30-day mortality prediction, followed by the balanced random forest model (0.78 (0.76–0.80)). The AUCs of these machine learning models as achieved by APACHE II and ProVent scores were higher than 0.67 (0.65–0.69), and 0.69 (0.67–0.71)), respectively. The most important variables in developing each machine learning model were APACHE II score, Charlson comorbidity index, and norepinephrine. The machine learning models have a higher AUC than conventional scoring systems, and can thus better predict the 30-day mortality of mechanically ventilated patients.


Introduction
Acute respiratory failure is a common cause of mechanical ventilation in the intensive care unit (ICU), which results from various medical conditions, such as pneumonia, congestive heart failure, sepsis, or acute respiratory distress syndrome [1]. Although mechanical ventilation is indicated for respiratory support or airway protection, it is associated with higher mortality and morbidity [1][2][3][4][5][6]. Moreover, patients requiring prolonged mechanical ventilation have high long-term mortality, and tracheostomy is often needed to maintain mechanical ventilation. Therefore, accurate prediction of prognosis in mechanically ventilated patients in the ICU is important.
As such, several mortality prediction models for mechanically ventilated patients have been suggested [4,[7][8][9][10][11]. However, there are few models that are focused on predicting hospital mortality, and most models included only patients with pneumonia or chronic obstructive lung disease. Conventional scoring systems such as Acute Physiologic Assessment and Chronic Health Evaluation II (APACHE II) or Sequential Organ Failure Assessment (SOFA) scores have been reported as a significant mortality predictor among mechanically ventilated patients [1,[10][11][12][13][14]. However, the discrimination ability of these scoring systems have not been validated in large cohorts of patients with various types of respiratory failure.
Machine learning algorithms have been recently applied to predict various outcomes related to mechanical ventilation. These outcomes include prolonged ventilation or tracheostomy, need for mechanical ventilation, successful extubation, weaning from mechanical ventilation, and monitoring lung mechanics [15][16][17][18][19]. However, there are no studies yet on using machine learning models for predicting mortality in mechanically ventilated patients. Therefore, we aimed to apply machine learning algorithms to predict the mortality of mechanically ventilated patients. Further, we investigated whether the machine learning models have better predictive capability than do conventional scoring systems.

Data Source and Study Population
In this retrospective study, data were collected from the study cohort enrolled at five hospitals of Hallym University Medical Center, Republic of Korea. The hospitals were located in Seoul (Kangnam Sacred Heart Hospital and Hangang Sacred Heart Hospital), Gyeonggi Province (Hallym University Sacred Heart Hospital and Dongtan Sacred Heart Hospital), and Gangwon Province (Chuncheon Sacred Heart Hospital). The overall bed capacity was 3047 beds, and 2,598,544 outpatients and 835,543 inpatients were managed in 2019.
We evaluated consecutive adult patients (≥18 years old) who required mechanical ventilation in the ICU between 1 January 2010 and 31 December 2019. Among the 28,340 mechanically ventilated patients identified, 11,400 patients who underwent surgery (n = 7403) and had missing values (n = 3997) were excluded ( Figure S1). Thus, 16,940 patients were included in the analysis.
The study was approved by the institutional review board of Chuncheon Sacred Hospital (No. 2020-11-008). The need for informed consent was waived owing to the retrospective nature of the study.

Data Collection and Definitions
Data were collected from the electronic medical records (EMRs) from each participating hospital using the clinical big data analytic solution Smart Clinical Data Warehouse based on the QlikView Elite Solution (Qlik, King of Prussia, PA, USA). The following information was collected from the time of mechanical ventilation initiation: age, sex, body mass index, time from hospitalization to ICU admission, time from hospitalization to mechanical ventilation initiation, APACHE II, ProVent score, Modified Early Warning Score (MEWS) [20], status post tracheostomy, transfer from skilled nursing facility, Charlson Comorbidity Index (CCI) and their variables [21], vital signs, continuous renal replacement therapy (CRRT), mode of mechanical ventilation, transfusion requirement (packed red blood cell, fresh frozen plasma, and platelet concentrate), use and type of vasopressors and inotropes (norepinephrine, epinephrine, dobutamine, dopamine, and vasopressin), use and type of corticosteroids (hydrocortisone, dexamethasone, and methylprednisolone), use and type of opioids (fentanyl and remifentanil), use and type of sedatives (propofol and midazolam), use and type of neuromuscular blockades (atracurium, cisatracurium, rocuronium, and vecuronium), and laboratory results with arterial blood gases. For longitudinal data such as vital signs or laboratory findings, we selected initial values taken on the day mechanical ventilation was initiated. To prevent errors in the dataset, we excluded patients with systolic blood pressure, heart rate, respiratory rate, and body temperature outside the ranges of 30-300 mmHg, 10-300 beats/min, 3-60 breaths/min, and 30-45 • C, respectively [22].
The ProVent score was calculated using five categories and their corresponding scores as follows: (1) age ≥ 65 years, 2 points; (2) age 50-64 years, 1 point; (3) platelets ≤ 100 × 109, 1 point; (4) use of vasopressors and hemodialysis, 1 point; and (5) non-trauma, 1 point [7]. The maximum score was 7. The MEWS is a bedside tool for the prediction of increased risk of clinical deterioration and uses five physiological parameters including systolic blood pressure, pulse rate, respiratory rate, temperature, and level of conscious state [20].
The allocated diagnoses for each patient were categorized using the Korean Standard Classification of Diseases-7 codes, which is a Korean version of the International Classification of Diseases-10 (ICD-10). CCI variables were categorized according to ICD-10 codes. These variables were included as features in developing the machine learning algorithms (Table S4). The primary outcome was mortality within 30 days from the initiation of mechanical ventilation in the ICU.

Machine Learning Algorithms
The data set involved patient variables. We divided the dataset into the traininginternal validation set and external validation sets to prevent model overfitting. The test set (17%, n = 2952) consisted of data from Chuncheon Sacred Heart Hospital to apply the machine learning model to an independent data set. The data from other four hospitals were used for training-validation (83%, n = 13,988). The training-internal validation set was further divided into the training set and internal validation set at a ratio of 4:1 with the same percentage of deaths. Datasets were standardized using min-max scaling. Supervised learning is a machine learning task that learns a function and maps inputs to outputs based on example input-output pairs. All machine learning used in this study was supervised learning. We used five machine learning algorithms, namely, balanced random forest (BRF), light gradient boosting machine (LGBM), extreme gradient boost (XBG), multilayer perceptron (MLP), and logistic regression (LR) [23,24]. LR is one of the regression algorithms that predicts whether data will fall into a specific category with a continuous probability between 0 and 1. Then, based on the probability, the algorithm decides which category the specific data belongs to, and ultimately solves the classification problem. MLP is a neural network in which one or more intermediate layers exist between an input layer and an output layer. The intermediate layer between the input layer and the output layer is called a hidden layer. The network is connected in the direction of the input layer, the hidden layer, and the output layer, and there is no connection within each layer and a direct connection from the output layer to the input layer, which is a feedforward network. It differs from logistic regression in that there can be one or more nonlinear layers called hidden layers. Random forest in machine learning is a type of ensemble learning method used for classification and regression analysis and operates by outputting classification or regression analysis from a plurality of decision trees constructed in the training process. The biggest characteristic of random forest is that trees have slightly different characteristics due to randomness. This property makes the predictions of each tree uncorrelated and consequently improves the generalization performance. In addition, randomization makes the forest robust even for noise-containing data. Extreme gradient boost (XGB) is one of the gradient boosting methods. Optimized gradient boosting algorithm through parallel processing, tree-pruning, handling missing values, and regularization to void overfitting/bias. LGBM works differently from the existing gradient boosting algorithm. Existing boosting models use a method of increasing the tree level-wise, but LGBM uses leaf-wise tree division. Existing trees used level-wise partitioning to reduce the tree depth, but LGBM models behave differently, and levelwise tree analysis needs to be balanced, so the tree depth is reduced. Instead, there is a disadvantage of adding an operation to balance it. LGBM does not balance the tree and proceeds by continuously dividing the leaf nodes. Therefore, an asymmetric and deep tree is created, but when creating a lost leaf, leaf-wise has the advantage of reducing loss compared to level-wise.

Variable Importance
In BRF, XGB, and LGBM, we used the built-in function that calculates feature importance. In MLP and LR, permutation feature importance was used because there was no built-in function in the packages. Permutation feature importance provides a method to compute feature importance for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as Mean Decrease Accuracy [25,26].

Statistical Analyses
Descriptive analysis was performed to compare the characteristics between survivors and non-survivors. Categorical variables were presented as numbers (%) and were compared using the Pearson's chi-squared test. Continuous variables were presented as mean ± standard deviation and were compared using the Student's t test. The discrimination powers of APACHE II, ProVent, and MEWS were assessed according to the AUC evaluated with the ROC curve analysis. All analyses were performed using SPSS software (version 26.0; IBM Corporation, Armonk, NY, USA). Differences were considered statistically significant at p-values of <0.05.

Patient Characteristics
The mean patient age was 67 years (SD ± 15), and 61.5% of the patients were male. In total, 5061 patients (29.9%) died within 30 days, and the mortality rates of internal and external data sets were 31.5% and 22.0%, respectively. Mortality rates of the four hospitals included in the internal validation set were as follows: 33.7%, 30.8%, 28.9%, and 33.7%, respectively. The baseline characteristics and laboratory values are presented in Table 1 and  Table S1. Compared with the survivor group, the non-survivor group showed significantly higher age (69 ± 14 years vs. 66 ± 15 years, p < 0.001) and APACHE II score (26.3 ± 6.5 vs. 21.6 ± 6.8, p < 0.001). The rates of use of vasopressors, corticosteroids, neuromuscular blockers, and CRRT were also significantly higher in the non-survivor group. The most common type of ventilator mode was pressure control (n = 7269, 42.9%). The PaO2/FiO2 ratio was significantly lower in the non-survivor group (262 ± 176 vs. 207 ± 173) ( Table 1). Further, other laboratory findings were also significantly different between the two groups.
The 30-day mortality of the training-validation set and test set was 31.5% and 22.00% (p < 0.001), respectively. Moreover, the medications and interventions received during mechanical ventilations were also significantly different (Tables S2 and S3).   Table 2. BRF showed the highest sensitivity (84%), while XGB showed the highest positive predictive value (46%) as well as accuracy (76%).

Variable Importance
The top 10 variables in the machine learning algorithms are listed in Table 3. The most important features in the models were APACHE II in BRF and LGBM, CCI in MLP and LR, and norepinephrine in XGB. APACHE II and norepinephrine were the top predictors common across all models. Variables of ABGA including base excess and bicarbonate, and pH were considered important values in BRF. Age and comorbidities, such as chronic pulmonary disease, congestive heart failure, and diabetes were important variables in the development of the models. The results of SHapley Additive exPlanations (SHAP) of the each model were demonstrated in Supplementary Figures S3-S7.

Discussion
Few studies have evaluated the validity of predictive models of mortality in cohorts with varying types of disease conditions. This multicenter study found that machine learning models based on EMR data on the day of mechanical ventilation initiation can predict the 30-day mortality of the patients receiving mechanical ventilation in the ICU regardless of the disease condition. Although the mortality of these patients are difficult to predict, the machine learning models better predicted 30-day mortality than did the conventional scoring systems of APACHE II, ProVent, and MEWS. The BRF and XGB models showed adequate discrimination abilities (AUC, 0.78 and 0.79). The most important features in the models were APACHE II, CCI, and norepinephrine. Although APACHE II did not reveal excellent discrimination power, (AUC 0.67), the conventional scores are still useful in developing machine learning models to predict outcomes.
Machine learning models have been developed to predict mortality in patients undergoing CRRT [27], critical trauma patients [28,29], and patients in the ICU [30]. Other studies have also used these models to predict in-hospital cardiac arrest [22] and real-time mortality [31,32]. Thorsen-Meyer et al. developed a machine leaning model using a recurrent neural network in a cohort of 15,615 ICU patients with a 33% 90-day mortality [30]. The AUC at ICU admission was 0.73, and performance improved over time from AUC 0.82 after 24 h to AUC 0.85 after 72 h. Kang et al. reported that machine learning model using random forest predicted ICU mortality in 1571 patients undergoing CRRT [27]. The AUC of the machine learning model was 0.78, which was higher than those of the APACHE II and SOFA scores.
Mortality among mechanically ventilated patients are associated with factors related to patient management and complications during mechanical ventilation as well as the factors at the initiation of mechanical ventilation [1]. As such, predicting mortality in patients with mechanical ventilation can be challenging. However, our model, which used clinical data within 24 h of the mechanical ventilation initiation, showed good discrimination ability. Further, the predictive capabilities were similar to those of machine learning models described previously. In addition, unlike previous studies, the results of our study are derived from external validation in a completely independent hospital dataset.
Mechanical ventilation is one of the crucial life support devices provided in the ICU. However, mechanical ventilation is associated with markedly increased ICU costs [33]. Worldwide, there is a shortage of ICU beds because of the coronavirus disease 2019 pandemic [34]. Therefore, in terms of priority of mechanical ventilation application, rapid and accurate prediction of mortality in mechanically ventilated patients is important. Our machine learning models were trained on all ICU patients who received mechanical ventilation regardless of the disease condition. We used clinical variables that are commonly available in mechanically ventilated patients. This improved the generalizability of our machine learning models for application in the clinical setting. Furthermore, our machine learning models can predict mortality with high accuracy compared with conventional scoring systems in patients receiving mechanical ventilation. These models can be easily developed using clinical variables from electronic medical records, and therefore, can be used even in the emergency room setting though. We suggest that our machine learning models should be utilized by physicians in making clinical decisions for judging mechanical ventilation priority.
Machine learning is like a black box in that we cannot explain the processes between input and output [35]. However, it is helpful to understand variables that play a significant role in predicting performance. Although there were some discrepancies between the machine learning models, APACHE-II, norepinephrine, age, and CCI contributed significantly to the development of our machine learning models. The variables selected in each model tended to be similar to conventional studies' results. Age, type of respiratory failure, use of inotropes, and severity scoring systems such as APACHE II and SAPS II have been considered important predictors of outcomes in patients receiving mechanical ventilation [1,12]. The models showed better discrimination abilities than the conventional scoring systems. However, APACHE II and CCI also were important factors for the development of the machine learning models. It is notable that the machine learning algorithms identified variables that were considered significant in previous studies. As machine learning for big data processing becomes more advanced, the systems for outcome prediction in critically ill patients will evolve. However, our results suggest that the conventional scoring systems will remain useful even in the big data era.
This study has some limitations. First, other variables associated with mechanical ventilation such as PEEP or plateau pressure were not included because they were not collected during the mechanical ventilation, and data were only analyzed retrospectively. Moreover, the retrospective nature of the study precluded collection of data regarding the reasons of mechanical ventilation initiation or cause of respiratory failure. Second, the AUC of SOFA score, which is one of the most used outcome prediction scores, was not presented. Third, we did not demonstrate information regarding lung heterogeneity related to the outcomes of acute respiratory distress syndrome [36]. Adding this information to the machine learning models can be beneficial to improve the prediction accuracy. In terms of the mechanical properties of the lungs, fractional order model is emerging as a tool to characterize lung function [37]. Forced oscillation technique is a non-invasive and reliable method to evaluate the lung function, and this showed a great potential in healthy, asthma, and chronic obstructive pulmonary disease patients [37][38][39]. These can be further used in machine learning algorithms for the evaluation of the respiratory systems of mechanically ventilated patients. Fourth, there is a possibility that historical bias exists due to the long duration of patient enrollment. Fifth, although the AUC of our machine learning models have shown acceptable discriminatory ability, model performance, however, needs to be improved. Therefore, further prospective studies including a large cohort with various scoring systems and variables over time are needed to improve the efficacy of machine learning models for predicting outcomes in mechanically ventilated patients. Mortality prediction in mechanically ventilated patients can be further improved if longitudinal data can be collected.

Conclusions
Compared with previous scoring models, our machine learning models of BRF, LGBM, XBG, MLP, and LR can better predict 30-day mortality in mechanically ventilated patients. Although APACHE II did not reveal excellent discrimination power, machine learning algorithms showed that the conventional scoring systems and CCI remain important factors for predicting outcomes in mechanically ventilated patients. Our findings suggest that improving discrimination power of machine learning models can help make crucial clinical decisions for patients who are less likely to benefit from mechanical ventilation.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10 .3390/jcm10102172/s1, Figure S1: Flow chart of the patients, Figure S2: Receiver operating characteristics (ROC) curves of conventional scoring systems for predicting 30-day mortality in mechanically ventilated patients in total patients, Figure S3: SHAP features importance for the determinants of 30-day mortality in BRF, Figure S4: SHAP features importance for the determinants of 30-day mortality in LGBM, Figure S5: SHAP features importance for the determinants of 30-day mortality in XGB, Figure S6: SHAP features importance for the determinants of 30-day mortality in MLP, Figure S7: SHAP features importance for the determinants of 30-day mortality in LR, Table S1: Baseline characteristics and laboratory findings of the patients according to mortality, Table S2: Baseline patient characteristics according to validation dataset, Table S3: Clinical and laboratory findings according to validation dataset, Table S4: Variables included in the machine learning algorithms. Institutional Review Board Statement: This retrospective study protocol was approved by the institutional review board of Chuncheon Sacred Hospital (2020-11-008).

Informed Consent Statement:
The need for informed consent was waived owing to the retrospective nature of the study. Data Availability Statement: Data sharing is not applicable to this article.