Using Machine Learning to Develop and Validate an In-Hospital Mortality Prediction Model for Patients with Suspected Sepsis

Background: Early recognition of sepsis and the prediction of mortality in patients with infection are important. This multi-center, ED-based study aimed to develop and validate a 28-day mortality prediction model for patients with infection using various machine learning (ML) algorithms. Methods: Patients with acute infection requiring intravenous antibiotic treatment during the first 24 h of admission were prospectively recruited. Patient demographics, comorbidities, clinical signs and symptoms, laboratory test data, selected sepsis-related novel biomarkers, and 28-day mortality were collected and divided into training (70%) and testing (30%) datasets. Logistic regression and seven ML algorithms were used to develop the prediction models. The area under the receiver operating characteristic curve (AUROC) was used to compare different models. Results: A total of 555 patients were recruited with a full panel of biomarker tests. Among them, 18% fulfilled Sepsis-3 criteria, with a 28-day mortality rate of 8%. The wrapper algorithm selected 30 features, including disease severity scores, biochemical parameters, and conventional and few sepsis-related biomarkers. Random forest outperformed other ML models (AUROC: 0.96; 95% confidence interval: 0.93–0.98) and SOFA and early warning scores (AUROC: 0.64–0.84) in the prediction of 28-day mortality in patients with infection. Additionally, random forest remained the best-performing model, with an AUROC of 0.95 (95% CI: 0.91–0.98, p = 0.725) after removing five sepsis-related novel biomarkers. Conclusions: Our results demonstrated that ML models provide a more accurate prediction of 28-day mortality with an enhanced ability in dealing with multi-dimensional data than the logistic regression model.


Introduction
Sepsis was first defined as a documented or suspected infection with systemic inflammatory response syndrome (SIRS) [1]. However, this conventional definition was abandoned in 2016, and "sepsis-3" serves as a new definition of "sepsis", intending to increase prognostic accuracy [2]. The incidence of sepsis has been steadily increasing  The 28-day mortality rate of the study cohort was 8%. Non-survivors tended to be older and male, with a higher proportion of having malignancies, presenting with a lower GCS of less than 15, increased respiratory rate and poorer oxygen saturation, abnormal hematological profiles, such as lower hematocrit and higher red cell distribution width, and were more likely to have sepsis-3 and septic shock ( Table 2). The mean levels of the sepsis-related novel biomarkers measured at baseline were higher in non-survivors than in survivors (Table S3). Patients with missing data had no significant difference in 28-day mortality compared to those without missing data.

Machine Learning Development and Evaluation
First, we divided the overall dataset into training and testing sets with sample sizes of 389 and 166, respectively, which contained 219 features (Table S4). Among them, 70 features were continuous, and 12 features had U-shaped distributions and were centered before feature selection, including pulse rate, respiratory rate, systolic blood pressure, diastolic blood pressure, mean arterial pressure, blood sugar, activated partial thromboplastin time, phosphorous, calcium, protein C, bicarbonate, and white blood cell count ( Figure S1).

28-Day In-Hospital Mortality
Features Mean (SD)/n (%) Survivor Figure 1 shows the importance of the independent features ranked in a descending order in the final Random Forest model. The top five most important features in predicting 28-day mortality are the SOFA score, IL-8, D-dimer, IL-6, and angiopoietin-2. In general, the higher the levels of these biomarkers and disease scores, the more positive the impacts were in predicting 28-day mortality. On the contrary, the lower the albumin and platelet levels observed, the greater the risk of 28-day mortality. These observations were compatible with the simple univariate analysis (Tables S3 and S5).

Comparison between the Best Machine Learning and Traditional Scoring Systems
Compared with all the traditional scoring systems, RF performed best in predicting 28-day mortality on the testing dataset (AUROC: 0.96; 95% CI: 0.93-0.98, p < 0.001; Figure 2), and the AUROC remained high after removing the five sepsis-related biomarkers (AUROC: 0.95; 95% CI: 0.91-0.99). Among the seven traditional scoring systems, the CHARM score demonstrated the second-best performance with an AUROC of 0.86 (95% CI: 0.79-0.91) for 28-day mortality prediction, whereas SIRS performed the worst (AUROC: 0.53; 95% CI: 0.40-0.77). The horizontal location of this SHAP plot demonstrates whether the effect of the value of that feature is associated with a higher or lower prediction of the model output, and the color indicates whether that feature is high (red) or low (blue) for that observation. SOFA, Sequential Organ Failure Assessment; SOFA score-Res, SOFA-respiratory; SOFA score-Coag, SOFA-coagulation; FDP, fibrin degradation products.

Imbalance Data Management
By applying SMOTE for both the up-sampling and down-sampling procedures, we found that the process did not improve the performance. Therefore, imbalanced processing was not adopted in this study (Table S6).

Discussion
In this prospective hospital-based cohort study, ensemble-based ML models, especially the random forest (RF) model, outperformed deep learning and logistic regression models and other traditional scoring systems in the prediction of 28-day mortality for patients with infection. We demonstrated that the ML models could be developed incorporating conventional features to assist the daily practice in the frontline health care settings. With 25 conventional features, the RF model had an AUROC of up to 0.95 in predicting 28-day mortality on the testing dataset. Many single biomarkers, such as IL-8, albumin, and D-dimer, were also found to have predictive power similar to that of the SOFA score. Sepsis-related novel biomarkers, including IL-8, IL-6, and angiopoietin-2, were included in the final models, but could also be substituted by other features without significant impact on the performance.
In our study, we demonstrated that RF can be used to rank the importance of features and derive a powerful prediction model with complicated interactions between features. RF can generate hundreds of decision trees to fit the dataset. By averaging the variances in the number of trees, RF reduces the high variance derived from a single tree. RF enables the evaluation of more features and interactions compared to traditional modelling approaches. A similar study that predicted 28-day mortality of ED patients with sepsis using real-world data carried out by Taylor et al. also demonstrated that the RF model performed better than the logistic regression model (AUROC of 0.86 vs. 0.76) [15].
The SOFA score was an important feature in our feature selection process and ML modelling in predicting 28-day mortality for patients with infection, in accordance with many other retrospective studies (pooled AUROC of 0.75-0.78) [12,19,27,28]. Furthermore, the SOFA respiratory score was selected in addition to the total SOFA score in our final RF model. Our findings suggest that respiratory dysfunction is an important predictor of mortality in patients with infection, which is supported by many other studies [5,29]. In addition, the qSOFA score, which was developed to screen for patients with possible sepsis-3, also contains the respiratory rate [19]. We believe that respiratory dysfunction contributes more to sepsis-associated mortality and should be considered an important risk factor in future research.
Several commonly measured biomarkers in clinical settings were selected for our final model. D-dimer is a fibrin degradation protein fragment that is formed after a blood clot is degraded by fibrinolysis. Severe infection may lead to the activation of an inflammatory cascade that can trigger this coagulopathy process [30]. Furthermore, the low serum albumin level in the acute phase of sepsis may be due to the inhibition of the albumin gene caused by TNF-α overexpression during inflammation [31]. This decrease in albumin level was significantly associated with the risk of death in septic patients [32]. Additionally, the presence of a large amount of lipopolysaccharides during sepsis activates the hypothalamus-pituitary-adrenal axis and the sympathoadrenal system, which leads to an increased output of cortisol and catecholamines, and ultimately an elevated serum lactic acid level, which has been widely recognized as a marker of tissue hypoxia or hypoperfusion and increased risk of multiple organ dysfunction syndrome [33].
Inflammation-related biomarkers, such as IL-8 and IL-6, were predictive of 28-day mortality both in the univariate analysis and in the final RF model. As a single biomarker, IL-8 has a similarly acceptable performance as the SOFA score (AUROC 0.83 vs. 0.82) and has been associated with 28-day mortality in a smaller study [34]. In contrast, the contributions of the other sepsis-related novel biomarkers were less prominent in our study. One possible reason is that the timing of sampling affected their predictive power, as many markers may only elevate at certain phases of disease progression and rapidly subside [35]. However, patients at different stages of infection severity were enrolled in this study. Therefore, we hypothesize that multiple measurements may capture the dynamic patterns of these indicators and better correlate with patient outcomes.
In this study, the clinician's gestalt demonstrated moderate predictive accuracy and precision (AUROC: 0.83; 95% CI: 0.69-0.90). However, it was not selected in the RF models of our study. The wrapper algorithm ranks features that are selected in a model with highorder feature interactions, which indicates that the importance of clinical gestalt can be replaced by other features. Since the development of ML algorithms, researchers have been trying to enhance human experience and judgement by translating data into a language that the machine can understand [36]. In reality, clinician's gestalt is seldom assessed together with the performance of machine-aided decision making, but if ever performed simultaneously and reported, the decision aid rarely outperformed human judgment [37].
Compared to previous studies that used retrospective electronic medical record data for model inputs, our study has the advantage of having good-quality data prospectively collected from three different emergency departments. This study still has some limitations. First, from the standpoint of clinical practicality, not all the information used for modelling inputs in this study is routinely collected from the emergency department. Nonetheless, the reduced model without those sepsis-related novel biomarkers still performs relatively well in our study. However, it is worth recognizing that the predictive capability of traditional score systems can be improved by applying ML algorithms to handle multidimensional features. Additionally, if laboratory and vital sign data can be collected repeatedly over time, this may allow more precise analysis to reflect the timedependent nature of sepsis progression [38]. Lastly, the outcome that we adopted in this study, the 28-day mortality, might not be the best endpoint to improve the care of septic patients. Future models are needed to be developed to predict the response, such as fluid and inotropic agents for patients with suspected sepsis.

Conclusions
We derived and tested a multi-feature prediction model that estimates the mortality probability in adult patients with infection. We demonstrated that the ML model could enhance the ability to deal with multidimensional data and achieve excellent performance in outcome prediction. Our ML model does an excellent job in predicting mortality but at the expense of gathering a large number of data, which might not be cost-effective in real-world clinical settings. While it adds more understanding on enhancing the utility of health data through ML algorithms, critical issues remain to be resolved; further studies that evaluate the clinical utility of the developed models and predictors are required. Additionally, external validations to confirm whether these algorithms provide improved predictive value in identifying at-risk patients in various clinical settings is warranted before it can move forward to clinical application.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/biomedicines10040802/s1, Figure S1: The U-shape distributions of 12 features; Table S1: Relevant studies about 28-day mortality prediction for sepsis patients; Table S2: Biomarker distribution of the different subgroups; Table S3: Comparison of level of sepsisrelated novel biomarkers between the survivors and the in-hospital mortality groups; Table S4: Total  Included Features; Table S5: The corresponding mean weighted contribution and area under the receiver operating characteristic curves (AUROC) of 30 features selected by the wrapper algorithm around the random forest models in the training dataset. The feature candidates were ranked according to the AUROC; Table S6: The AUROC performance of seven machine learning models when various features were selected by appling the SMOTE for both up-sampling down-sampling procedure; Table S7: Sequential (sep-sis-related) Organ Failure Assessment (SOFA) score; Table S8: The summary descriptions of the selected eight algorithms.