Machine Learning Model Development and Validation for Predicting Outcome in Stage 4 Solid Cancer Patients with Septic Shock Visiting the Emergency Department: A Multi-Center, Prospective Cohort Study

A reliable prognostic score for minimizing futile treatments in advanced cancer patients with septic shock is rare. A machine learning (ML) model to classify the risk of advanced cancer patients with septic shock is proposed and compared with the existing scoring systems. A multi-center, retrospective, observational study of the septic shock registry in patients with stage 4 cancer was divided into a training set and a test set in a 7:3 ratio. The primary outcome was 28-day mortality. The best ML model was determined using a stratified 10-fold cross-validation in the training set. A total of 897 patients were included, and the 28-day mortality was 26.4%. The best ML model in the training set was balanced random forest (BRF), with an area under the curve (AUC) of 0.821 to predict 28-day mortality. The AUC of the BRF to predict the 28-day mortality in the test set was 0.859. The AUC of the BRF was significantly higher than those of the Sequential Organ Failure Assessment score and the Acute Physiology and Chronic Health Evaluation II score (both p < 0.001). The ML model outperformed the existing scores for predicting 28-day mortality in stage 4 cancer patients with septic shock. However, further studies are needed to improve the prediction algorithm and to validate it in various countries. This model might support clinicians in real-time to adopt appropriate levels of care.


Introduction
The incidence of all cancer types has increased worldwide and is a major public health burden [1]. Recent advances in cancer treatment have improved the overall survival. Nevertheless, there is an increased risk of critical illness requiring intensive care unit (ICU) management [2]. Reportedly, approximately 5.2% of patients develop a critical illness within 2 years after a cancer diagnosis and are admitted to the ICU [3]. The mortality rate in the ICU is reportedly 14.1%, and 24.6% of ICU patients die during their hospital stay [3].

Study Population
This multi-center, retrospective, observational study used data from the Korean Shock Society (KoSS) septic shock registry between October 2016 and June 2019. The KoSS is a 11 university-affiliated hospital ED collaborative research network in South Korea, established in 2013, for improving the quality of the research, diagnosis, and management of sepsis [15]. The study included patients ≥ 19 years of age who met the inclusion criteria (evidence of refractory hypotension or hypoperfusion in patients with suspected or confirmed infection) [16,17]. Hypotension was defined as systolic blood pressure (SBP) < 90 mm Hg, mean arterial pressure < 70 mm Hg, or SBP decrease > 40 mm Hg. Refractory hypotension was defined as persistent hypotension despite the administration of a fluid challenge or the requirement of vasopressors to maintain a BP ≥ 90 mm Hg or mean arterial pressure ≥ 70 mm Hg [16,17]. Hypoperfusion was defined as the serum lactate level ≥ 4 mmol/L [18].
The exclusion criteria were patients who refused ICU management, patients who signed a "do not resuscitation" order before arrival at the ED or at the time of diagnosis, patients who met the inclusion criteria at 6 h after their ED arrival, patients who were transferred from other hospitals after their stabilization, and patients who were transferred directly to the ED of other hospitals. This study included only patients with stage 4 cancer who were registered in the septic shock database of a participating hospital. In stage 4 cancer, the disease has spread to other organs or parts of the body.   [16-2014-36], Hallym University College of Medicine Gangnam Secred Heart Hospital [2015- , Korea University Guro Hospital [KUGH15358-001], and Hanyang University Hospital [HYUH2015-11-013-007]) approved the study protocol. Informed consent was obtained from all subjects before the data collection if the patient was conscious, and the consent of the guardian was obtained otherwise.

Outcome and Data Collection
The endpoint selected to develop ML models was 28-day mortality. Death during hospitalization was confirmed by a record review, and death after discharge or transfer was confirmed by telephone. The study coordinator at each hospital confirmed 28-day mortality by a telephone interview. If the first call did not connect, an additional call was attempted, and there were no cases where 28-day mortality was not confirmed. The principal investigator of every site worked with a designated local research coordinator who was responsible for ensuring the accuracy of the data entry and verifying the records. A quality management committee of emergency physicians, local research coordinators, and investigators from every ED was established to monitor and review the data's quality regularly. Members from the committee provided feedback on the results of the quality management process to the research coordinators and investigators, and doubts pertaining to the data were clarified either through the use of the system's query function or directly via a telephone call. The data collection included the patient's characteristics, clinical variables, and their laboratory results at presentation which were required to calculate the SOFA score, APACHE II score, and initial lactate level.

Machine Learning Model Development and Feature Analysis
During the pre-processing, any missing data were recorded as -1 because machine learning uses numeric data with no spaces as the input variables. Next, unnecessary data (attributes and feature selection methods) were excluded, leaving demographic characteristics, underlying diseases, blood test results, and other variables included in the 6 h bundle treatment after visiting the ED for analysis. However, variables that showed their worst value within 24 h of admission, such as the SOFA score and APACHEII score, were excluded from the machine learning model development. To develop an ML model for predicting the prognosis of advanced cancer patients, the dataset was randomly split in a 7:3 ratio (training: test) using stratified partitioning. Following initial resuscitation (6 h bundle therapy), all the identifiable variables (demographic, 6 h bundle therapy components, vital signs, laboratory variables, etc.) were used to develop the machine learning model. Under the stratified condition, the selected samples of each fold have proportions of class labels equal to those of the original dataset [19]. The best ML model and its optimal hyperparameters were determined in the training set using the stratified 10-fold cross-validation (in which the data set is divided into 10 folds, each of which is used for an internal validation, with the remaining 90% used for training to develop the model). The use of cross-validation and hyperparameter tuning for internal validation is considered a robust method for model evaluation before external validation on a separate data set and maximizes the potential performance of the ML model [20,21]. The hyperparameters of the model are optimized by a grid search that exhaustively considers all the parameter combinations. In the internal validation phase, the best hyperparameters are investigated to get the best performance. For example, the number of estimators from 100 to 1000 are investigated in balanced random forest (BRF). The detailed parameters are as follows: the parameters of LR-bw are C (0.001, 0.01, 0.1, 1, 10, and 100) and penalty (l1 and l2), the parameters of XGB-bw are the max_depth (3, 5, and 7) and subsample (0.6, 0,8, and 1.0), a parameter of RF-bw is the n_estimators (100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000), a parameter of BBC is the n_estimators (100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000), and a parameter of BRF is the n_estimators (100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000).
To develop the ML model, a total of five classifiers were considered [10,22]. The dataset in this study showed the characteristics of an imbalanced dataset, because the number of data points were not balanced among the classes. Therefore, we used five weighted or balanced machine learning models in the machine learning model's development. The three basic ML classifiers with balanced weights are considered (LR-bw, XGB-bw, and RF-bw). In addition, two ensemble classifiers that are designed for handling an imbalanced dataset are selected: the balanced bagging classifier and the balanced random forest classifier [23][24][25]. The best ML model was selected using the AUC and F1 score. The F1 score is a useful indicator for analyzing disproportionate data like ours (minor class prediction, 28-day mortality). After selecting the best ML model and its hyperparameters, the final ML was built using the training set. The performance of the ML was evaluated using the test dataset. The F1-score (F-measure) is a popular and suitable metric for an imbalanced classification [26,27]. It is widely used in many applications with an imbalanced dataset since it measures how well the classification model handled the minority class classification [28]. The F1 score is the harmonic mean of the precision and recall. Therefore, this metric balances a model in terms of the precision and recall.
For feature analysis, Shapley Additive exPlanations (SHAP) values were used. The SHAP values quantify the effects of the features on the outcome of the ML model [29]. In addition, the performance of the final ML model was compared with the existing severity scores of the SOFA, APACHE II, and initial lactate level. For the SOFA and APACHE II scores, the worst values obtained within 24 h of the ED visit were used. The calibration and discrimination of the final ML model and other scores were compared based on the calibration curve and AUC.

Statistical Analysis
The continuous variables were analyzed as the mean ± standard deviation or median with an interquartile range being appropriate, and the categorical variables were analyzed as the absolute or relative frequency. Student's t-test and Mann-Whitney U test were used to compare the continuous variables, and the Chi-square test and Fisher's exact test were used for the categorical variables. The discrimination and calibration of the final ML model and other scores were compared based on the AUC and calibration curve. The AUC was calculated and then compared using a 2-tailed nonparametric method [30].
A two-sided p-value < 0.05 was considered to be statistically significant. All the statistical analyses were performed using SPSS Statistics version 18 (SPSS Inc., Chicago, IL, USA) and R version 4.0.4 (R Foundation for Statistical Computing, Vienna, Austria). In addition, we used development software (Anaconda 3) for the ML on the Python platform. The Python version is 3.7 for windows (Python Software Foundation, Wilmington, DE, USA).

Participant Characteristics
A total of 2132 patients were screened after excluding 210 patients with DNR orders (Figure 1). In addition, the cases diagnosed as septic shock 6 h or longer after arriving at the ED (n = 125) were excluded. Another 83 patients who were transferred from another hospital for septic shock but were stable were also excluded. In addition, 43 patients who were transferred directly from the ED to another hospital without admittance were excluded. After excluding 774 non-stage 4 cancer patients, a final 897 adult patients were included in the analysis. The dataset was split randomly in a 7:3 ratio (training: 627, test: 270). The baseline characteristics of 897 patients are shown in Table 1. The mean ages of the survivors and non-survivors were 65.6 and 66.9 years, respectively (p = 0.165). The proportion of males did not differ between survivors and non-survivors (61.8% vs. 64.6%, p = 0.482). The mean SBP of the non-survivors was significantly higher than that of survivors (99.5 vs. 94.9 years, p = 0.033). The median SOFA score for the non-survivors was significantly higher than for the survivors (10 vs. 7, p <0.001). The other characteristics between the survivors and non-survivors are presented in Table 1.
hospital for septic shock but were stable were also excluded. In addition, 43 patients who were transferred directly from the ED to another hospital without admittance were excluded. After excluding 774 non-stage 4 cancer patients, a final 897 adult patients were included in the analysis. The dataset was split randomly in a 7:3 ratio (training: 627, test: 270). The baseline characteristics of 897 patients are shown in Table 1. The mean ages of the survivors and non-survivors were 65.6 and 66.9 years, respectively (p = 0.165). The proportion of males did not differ between survivors and non-survivors (61.8% vs. 64.6%, p = 0.482). The mean SBP of the non-survivors was significantly higher than that of survivors (99.5 vs. 94.9 years, p = 0.033). The median SOFA score for the non-survivors was significantly higher than for the survivors (10 vs. 7, p <0.001). The other characteristics between the survivors and non-survivors are presented in Table 1.

ML Model Development
Five weighted or balanced ML models were considered candidates for ML development because our data showed a disproportionate proportion of patients who died and survived. The best ML model in stratified 10-fold cross-validation was selected as the final ML model. The AUC of the BRF for 28-day mortality in the training set was 0.823 (95% confidence interval (CI): 0.782-0.864; Table S2), and the F1 score was 0.604 (95% CI: 0.548-0.66). The AUC of the BRF with a 10-fold validation in the training set is shown in Supplementary Figure S1. Based on this, the final chosen ML model was BRF. The AUC of the BRF for 28-day mortality in the test set was 0.826 ( Table 2). The F1 score of the BRF was 0.64. The F1 score of the SOFA, APACHE II, and initial lactate was 0.321, 0.294, and 0.36, respectively.

Comparison of Diagnostic Performance of ML Model with Other Scores
The AUC of the BRF in the test set was 0.826. The AUCs of the SOFA, APACHE II, and initial lactate level in the test set were 0.672, 0.662, and 0.683, respectively ( Figure 2). The AUC of the BRF was significantly higher than those of the SOFA, APACHE II, and initial lactate level (p = 0.0001, p < 0.0001, and p = 0.0001, respectively. After the Bonferroni correction on the test set, the AUC of the BRF model was significantly higher than the those of the SOFA and APACHE II score (Supplementary Table S3 and Figure S2). regression with balanced weight; ML, machine learning; SOFA, Sequential O ment; RF-bw, random forest with balanced weight; XGB-bw, XGB with balanc

Comparison of Diagnostic Performance of ML Model with Other Scores
The AUC of the BRF in the test set was 0.826. The AUCs of the S and initial lactate level in the test set were 0.672, 0.662, and 0.683, resp The AUC of the BRF was significantly higher than those of the SOFA initial lactate level (p = 0.0001, p < 0.0001, and p = 0.0001, respectively. A correction on the test set, the AUC of the BRF model was significan those of the SOFA and APACHE II score (Supplementary Table S3 and The top 20 most important variables for predicting 28-day mor summarized in Figure 3 with visible explanations across all the patie these features on the 28-day mortality were quantified by the Shapley variables were found to be related to the patient's prognosis, similar t vious studies, but the activated prothrombin time and potassium leve predictors. The initial body temperature had the highest overall (total 0 The top 20 most important variables for predicting 28-day mortality with BRF are summarized in Figure 3 with visible explanations across all the patients. The impacts of these features on the 28-day mortality were quantified by the Shapley values. Most of the variables were found to be related to the patient's prognosis, similar to the results of previous studies, but the activated prothrombin time and potassium level were unexpected predictors. The initial body temperature had the highest overall (total 0.032) on the 28-day mortality, followed by the initial albumin level (0.028) and CK MB level (0.028). Calibration was evaluated with the plots of the predicted and observed probability. The ML model showed a better calibration compared with the SOFA score, APACHE II score, and initial lactate level in the test set ( Figure S3). The ML model was closest to the ideal line and predicted the actual 28-day mortality rate well. mortality, followed by the initial albumin level (0.028) and CK MB level (0.028). Calibration was evaluated with the plots of the predicted and observed probability. The ML model showed a better calibration compared with the SOFA score, APACHE II score, and initial lactate level in the test set ( Figure S3). The ML model was closest to the ideal line and predicted the actual 28-day mortality rate well. Body_temperature1, initial body temperature; Lactate_after_fluid, lactate level following fluid administration; Respira-tory_rate1, initial respiratory rate; Lactate1, initial lactate level; Focus_lung, lung as the site of infection; BUN, blood urea nitrogen level; pH, initial pH on arterial blood gas analysis; Body_tempera-ture_enroll, body temperature when septic shock is recognized; Lactate_enroll, lactate level when septic shock is recognized; SaO2, initial arterial O2 saturation on arterial blood gas analysis; Heart_rate_enroll, heart rate when septic shock is recognized. All other variables are the first test values after the emergency department visit unless otherwise specified.

Discussion
In the present study, an ML (BRF) model to predict the prognosis of stage 4 cancer patients with septic shock was developed and validated. Several models showed a similar AUC to that of the BRF; however, BRF had the best F1 score. The AUC for the ML model to predict the 28-day mortality was superior to the traditional SOFA score, APACHE II score, and initial lactate level. The ML model showed a better calibration compared with the SOFA and APACHE II scores in both the training and test sets. The variables that had the greatest effect on the 28-day mortality were the initial body temperature, serum albumin level, and serum CK MB level. However, it is important to limit the application of the model because a score of F1 > 0.8 is necessary to consider a model as good. Moreover, the model is an insufficient basis for decisions on the orientation of the patients; it is important that a collegial discussion and exchanges with the families are also necessary to reach a shared decision.
To the best of our knowledge, this is the first study in which an ML model to predict the outcomes of stage 4 cancer patients with septic shock visiting the ED was developed and validated. Our study population was extracted from a multicenter study with a large sample size. The discrimination and calibration of ML were examined and compared with the pre-existing severity scores. We also used as many potential variables as possible in developing the ML model to influence the outcomes in patients with advanced cancer with septic shock. for all patients; Body_temperature1, initial body temperature; Lactate_after_fluid, lactate level following fluid administration; Respiratory_rate1, initial respiratory rate; Lactate1, initial lactate level; Focus_lung, lung as the site of infection; BUN, blood urea nitrogen level; pH, initial pH on arterial blood gas analysis; Body_temperature_enroll, body temperature when septic shock is recognized; Lactate_enroll, lactate level when septic shock is recognized; SaO2, initial arterial O2 saturation on arterial blood gas analysis; Heart_rate_enroll, heart rate when septic shock is recognized. All other variables are the first test values after the emergency department visit unless otherwise specified.

Discussion
In the present study, an ML (BRF) model to predict the prognosis of stage 4 cancer patients with septic shock was developed and validated. Several models showed a similar AUC to that of the BRF; however, BRF had the best F1 score. The AUC for the ML model to predict the 28-day mortality was superior to the traditional SOFA score, APACHE II score, and initial lactate level. The ML model showed a better calibration compared with the SOFA and APACHE II scores in both the training and test sets. The variables that had the greatest effect on the 28-day mortality were the initial body temperature, serum albumin level, and serum CK MB level. However, it is important to limit the application of the model because a score of F1 > 0.8 is necessary to consider a model as good. Moreover, the model is an insufficient basis for decisions on the orientation of the patients; it is important that a collegial discussion and exchanges with the families are also necessary to reach a shared decision.
To the best of our knowledge, this is the first study in which an ML model to predict the outcomes of stage 4 cancer patients with septic shock visiting the ED was developed and validated. Our study population was extracted from a multicenter study with a large sample size. The discrimination and calibration of ML were examined and compared with the pre-existing severity scores. We also used as many potential variables as possible in developing the ML model to influence the outcomes in patients with advanced cancer with septic shock.
Oncologic patients with sepsis and septic shock are admitted more frequently to the ICU than subjects from the general population. In addition, the 28-day mortality rate is higher in these patients [4,31]. However, the demand for the ICUs care usually exceeds the supply; therefore, the triage and allocation decisions for the ICUs care for critically ill patients are important [32]. Moreover, considering the high mortality rates associated with stage 4 solid cancer patients with septic shock visiting the ED, it is necessary to establish a risk score for providing professionals and families with more precise information for making the decision of whether to go ahead with an invasive procedure. It is also important to work on the early expression of patients' wishes to be resuscitated or not. Scoring systems that predict the prognosis in cancer patients with septic shock have been investigated in several studies. The predictive values of the SOFA and APACHE II scores for the outcomes of patients with sepsis and septic shock were not satisfactory [33][34][35]. One ICU-based study reported that the AUC of the SOFA score for the in-hospital mortality was 0.69 in critically ill cancer patients with a suspected infection [33]. Another study reported that the AUC of the APACHE II score predicting in-hospital mortality in patients with sepsis or septic shock was 0.71 [36]. Compared to these studies using a similar study design, our ML model had an AUC greater than 0.8, although the primary outcome was different. In several studies, hypothermia was associated with increased mortality and organ failure in patients with severe sepsis [37,38]. Jonas et al. demonstrated that an elevated body temperature in patients admitted to the ED was associated with reduced mortality in patients with sepsis or septic shock admitted to the ICU [39]. However, a fever above 39.5 • C was associated with an increased mortality rate [40]. In several sepsis studies, the serum albumin level was associated with increased mortality [41][42][43]. An elevated troponin level was correlated with a greater degree of left ventricular dysfunction, illness severity, and mortality [44,45]. In a meta-analysis, 61% of subjects with elevated troponin had a twofold increased risk of death compared with patients with undetectable troponin [46]. In our machine learning model, troponin had a lower prognostic contribution than CK MB, but similarly to previous studies, elevated myocardial enzyme levels in sepsis and septic shock were highly associated with a poor prognosis. To recognize the occurrence of sepsis or predict the prognosis in the early stages while the patient is in the ward or ICU, research using ML is being actively conducted [11,47].
The present study had several limitations. First, it only included single-country data and a relatively small sample size, although multi-center prospectively collected registry data were used. Therefore, the results might not be representative of the general population. Unlike other studies using ML models, in the present study, the diagnostic performance in the test set is slightly higher than the diagnostic performance of the training set. Although we randomly divided the data into the training set and the test set, since we used patients with similar characteristics, we think that the similarity between the training set and the test set still exists and the diagnostic performance of the test set may have been higher as a result of chance. In future work, the performance of our ML model must be tested using independent cohort data for an accurate evaluation. The data used is from a registry, which inherently limits the applicability of such an algorithm with real-world data. Therefore, ML models should be prospectively validated using larger samples and a variety of real-world data from many countries and ethnicities. Second, cancer-related characteristics such as the treatment modality (e.g., surgery, radiotherapy, and chemotherapy), performance status, and response to therapy were not examined. These factors are known to be associated with the outcomes. However, in recent studies, cancer-related characteristics were not associated with mortality [48,49]. The accurate evaluation of the performance status is challenging in the ED due to subjectivity and irreproducibility [50,51]. Third, there may be unmeasured confounders that can affect the results. Our registry was for the general population with septic shock and was not cancer-specific. Therefore, the possibility of unmeasured variables, especially cancer-specific prognostic factors, cannot be excluded. Nevertheless, many variables associated with the outcome of patients with septic shock were included. Lastly, although 10 hospitals participated in the surviving sepsis campaign for the septic shock treatment, there might be differences in the treatment strategies between hospitals.

Conclusions
The predictive performance of the ML model was satisfactory to predict 28-day mortality in stage 4 cancer patients with septic shock. The ML model outperformed the pre-existing prediction scores, such as the SOFA, APACHE II, and initial lactate level. We revealed 20 important variables that significantly affected the prediction model. This model might support clinicians in real-time to adopt an appropriate level of care in terms of the chance of survival. However, the level of treatment should not be determined by the ML model alone without a co-operative discussion or exchange with the families. Additionally, further studies are needed to improve the prediction algorithm and to validate it in various countries.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/jcm11237231/s1, Table S1. Comparison of the training and test sets. Table S2. Comparison of machine learning models for predicting 28-day mortality in training sets. Table S3. Comparison of ML models for predicting 28-day mortality in test sets. Figure S1. The AUC of BRF with 10-fold validation in the training set. Abbreviations: AUC-area under the curve; BRFbalanced random forest. Figure S2. The AUC of ML models for predicting 28-day mortality in the test set. Abbreviations: AUC-area under the curve; BBC-balanced bagging classifier; BRF-balanced random forest classifier; CI-confidence interval; LR-bw-logistic regression with balanced weight; ML-machine learning; RF-bw-random forest classifier with balanced weight; XGB-bw-XGB classifier with balanced weight. Figure S3. Calibration curve of the ML model, SOFA, APACHE II and initial lactate level for 28-day mortality in the test set. Abbreviations: BRF-balanced random forest; ML-machine learning; SOFA-Sequential Organ Failure Assessment; APACHE-Acute Physiology and Chronic Health Evaluation.  Informed Consent Statement: Informed consent was obtained from all subjects before the data collection.

Data Availability Statement:
The study data will be made available upon request to the corresponding author.