Machine Learning Prediction Models for Mortality in Intensive Care Unit Patients with Lactic Acidosis

Background: Lactic acidosis is the most common cause of anion gap metabolic acidosis in the intensive care unit (ICU), associated with poor outcomes including mortality. We sought to compare machine learning (ML) approaches versus logistic regression analysis for prediction of mortality in lactic acidosis patients admitted to the ICU. Methods: We used the Medical Information Mart for Intensive Care (MIMIC-III) database to identify ICU adult patients with lactic acidosis (serum lactate ≥4 mmol/L). The outcome of interest was hospital mortality. We developed prediction models using four ML approaches consisting of random forest (RF), decision tree (DT), extreme gradient boosting (XGBoost), artificial neural network (ANN), and statistical modeling with forward stepwise logistic regression using the testing dataset. We then assessed model performance using area under the receiver operating characteristic curve (AUROC), accuracy, precision, error rate, Matthews correlation coefficient (MCC), F1 score, and assessed model calibration using the Brier score, in the independent testing dataset. Results: Of 1919 lactic acidosis ICU patients, 1535 and 384 were included in the training and testing dataset, respectively. Hospital mortality was 30%. RF had the highest AUROC at 0.83, followed by logistic regression 0.81, XGBoost 0.81, ANN 0.79, and DT 0.71. In addition, RF also had the highest accuracy (0.79), MCC (0.45), F1 score (0.56), and lowest error rate (21.4%). The RF model was the most well-calibrated. The Brier score for RF, DT, XGBoost, ANN, and multivariable logistic regression was 0.15, 0.19, 0.18, 0.19, and 0.16, respectively. The RF model outperformed multivariable logistic regression model, SOFA score (AUROC 0.74), SAP II score (AUROC 0.77), and Charlson score (AUROC 0.69). Conclusion: The ML prediction model using RF algorithm provided the highest predictive performance for hospital mortality among ICU patient with lactic acidosis.

Recently, artificial intelligence (AI) and machine learning (ML) have been increasingly utilized for precision medicine [28,29], including prediction of clinical outcomes among critically ill patients [30][31][32][33][34]. Due to the ability of ML to cope with nonlinear, complex, and multidimensional data [31,35], recent studies have demonstrated that ML approaches using ICU data provided high predictive performances that outperformed traditional analysis [32,33]. While the use of lactate levels in mortality prediction among critically ill patients has been investigated [26,27], data on mortality risk prediction among the subgroup of ICU patients with lactic acidosis are limited. Given the heterogeneity of impacts of lactic acidosis on clinical outcomes in a variety of different patient characteristics and ICU settings (such as lactic acidosis in patients with trauma, cardiac surgery, and septic shock) [9][10][11][12][13][14][15][16][17][18][19][20], an ML-based mortality prediction model for ICU patients with lactic acidosis can provide a novel individualized approach to clinical decision making for critically ill patients.
In this study, we aimed to develop and then assess various ML-based prediction model performances in predicting mortality of ICU patients with lactic acidosis in comparison to the traditional statistical model.

Patient Population
The Mayo Clinic Institutional Review Board approved this observational study (IRB number-21-009222). We used the Medical Information Mart for Intensive Care III (MIMIC III) database to conduct this study. MIMIC-III provides deidentified comprehensive clinical data from ICU patients at Beth Israel Deaconess Medical Center in Boston, Massachusetts, United States between 2001 and 2012 [36]. The database is widely accessible to researchers internationally under a data use agreement. If patients had multiple ICU admissions, we analyzed only the first admission.
Inclusion criteria were (1) age ≥18 years and (2) presence of lactic acidosis at ICU admission, defined as the first serum lactate measured within 48 h of ICU admission of ≥4.0 mmol/L. The exclusion criteria were (1) no serum lactate measurements within 48 h of ICU admission or (2) being admitted to the ICU for ≤24 h.

Data Collection
We abstracted data on patient characteristics, comorbidities, vital signs, organ support, and laboratory results for prediction model development. As our goal was to develop and assess a prediction model for mortality in lactic acidosis patients based on the available data at the time of ICU admission, we only used data that were present within 48 h of ICU admission for analysis. When multiple values existed, we selected the closest vital sign or laboratory value to lactic acidosis occurrence. We excluded laboratory results with more than 10% missing data. Otherwise, we imputed missing data through multiple imputation using Random Forest (RF).

Model Development
In order to utilize ML models to predict the risk of in-hospital mortality in ICU patients with lactic acidosis, we followed the TRIPOD to build these ML models (Online Supplementary) [37]. Spearman's rank correlation was applied to assess the separate correlation of variables in the dataset and demonstrate no significant correlations (Supplementary Figure S1). Numeric data were normalized to have a standard deviation of 1 and a mean of 0 [38]. The overall study cohort was randomized into a training (80%) and testing dataset (20%) as per the Pareto principle [39]. We used the training dataset to develop ML models. The testing cohort was blinded to all methods until the final evaluation. As a reference model, we used multivariable logistic regression analysis. We conducted forward stepwise variable selection using criteria of p < 0.20 for entry cut-off.
ML models include decision tree (DT), RF, extreme gradient boosting (XGBoost), and deep learning. RF and XGBoost are both DT ensemble algorithms [40,41]. However, RF forests rely on bagging, which is a democratic process to "elect" the best decision among the subgroups of trees [40]. XGBoost is based on a gradient descent-boosting process, which is an ensemble of weak learners that is reinforced depending on the quality of the assessment [41]. We used deep learning based on a multi-layer feedforward artificial neural network (ANN) that is trained with stochastic gradient descent using back-propagation.
For DT analysis, the number of terminal nodes was determined considering the scree plot showing the relationship between the tree size and coefficient of variance. The decision tree was pruned based on cross-validated error results using the complexity parameter associated with the minimal error (Supplementary Figure S2). For the RF model, the number of trees was 500, which yielded the lowest error rate (Supplementary Figure S3), and mtry value was calculated by the square root of the number of variables [42]. For XGBoost and ANN, we created a hyperparameter tuning grid to identify the best combination of hyperparameters using cross-validation methods [43]. Detailed hyperparameters are provided in the Online Supplementary data.

Model Evaluation and Calibration
Model performance was assessed with area under the receiver operating characteristic curve (AUROC), accuracy, precision, error rate (ERR), Matthews correlation coefficient (MCC), and F1 score in the testing dataset [44][45][46]. The formula for each measure is provided in the Online Supplementary data. The Brier score was used to evaluate model calibration [47].

Explanations of the Features in the ML-Based Prediction Model That Drive Patient-Specific Predictions of Mortality
After we identified ML model with highest predictive performances, we applied the Shapley additive explanations (SHAP) values to explain which features initiate patientspecific estimates. In addition, we also applied the local interpretable model-agnostic explanations (LIME) approach to approximate a complex nonlinear model to a linear model near variables of interest.

Results
A total of 1919 ICU patients with lactic acidosis were eligible for analysis. Of these, 1535 and 384 were included in the training and testing dataset, respectively. Table 1 shows the clinical characteristics of patients in the training and testing datasets. Clinical characteristics between the training and testing datasets were comparable. Hospital mortality was also similar between training and testing datasets (29.8% vs. 29.7%; p = 0.97). The ERRs and AUROCs of all ML models and the multivariable logistic regression model for mortality prediction in the test data set are shown in Table 2   The ERRs and AUROCs of all ML models and the multivariable logistic regression model for mortality prediction in the test data set are shown in Table 2   MCC: worst value -1 and best value +1. F1 score, accuracy, and precision: worst value 0 and best value 1. The Brier score is a combined measure of discrimination and calibration that ranges between 0 and 1, where the best score is 0 and the worst is 1. ANN, artificial neural network; MCC, Matthews correlation coefficient; AUROC, area under the receiver operating characteristic curve; SOFA, Sequential Organ Failure Assessment ; SAPS II, Simplified Acute Physiology Score.   The results of multivariable logistic regression analysis with stepwise variable selection are shown in Table 3. The AUROC of the multivariable logistic prediction model with forward stepwise variable selection was 0.81 (95%CI 0.79-0.83). We also compared our predictive models with Sequential Organ Failure Assessment (SOFA) score, Simplified Acute Physiology Score (SAPS II) score (acute severity score) and Charlson score (comorbidity score). The RF model outperformed the multivariable logistic regression model, SOFA score (AUROC 0.74), SAP II score (AUROC 0.77), and Charlson score (AUROC 0.69) ( Table 2).  The results of multivariable logistic regression analysis with stepwise variable selection are shown in Table 3. The AUROC of the multivariable logistic prediction model with forward stepwise variable selection was 0.81 (95%CI 0.79-0.83). We also compared our predictive models with Sequential Organ Failure Assessment (SOFA) score, Simplified Acute Physiology Score (SAPS II) score (acute severity score) and Charlson score (comorbidity score). The RF model outperformed the multivariable logistic regression model, SOFA score (AUROC 0.74), SAP II score (AUROC 0.77), and Charlson score (AUROC 0.69) ( Table 2).   Figures S4-S10). The Brier score for RF, DT, XGBoost, ANN, and multivariable logistic regression was 0.15, 0.19, 0.18, 0.19 and 0.16, respectively (Table 2). Variable importance analysis of RF, the best model, was performed. The top important variables of RF combined the mean decrease in Gini (how much each variable decreases the node impurity), decrease in accuracy, and p values of the clinical indices include BUN, anion gap, lactate level, INR, pO2, phosphate level, PTT, platelet count, pH, and baseline eGFR (Figure 3).
To identify the features that influenced the prediction model the most, we depicted the SHAP summary plot of RF model (Figure 4) and the top 20 features of the prediction model. This plot depicts how high and low features' values were in relation to SHAP values in the testing dataset. According to the prediction model, the higher the SHAP value of a feature, the higher probability of mortality occurring. Additionally, we applied LIME into RF model to illustrate the impact of key features at the individual level ( Figure 5). To identify the features that influenced the prediction model the most, we depicted the SHAP summary plot of RF model (Figure 4) and the top 20 features of the prediction model. This plot depicts how high and low features' values were in relation to SHAP values in the testing dataset. According to the prediction model, the higher the SHAP value of a feature, the higher probability of mortality occurring. Additionally, we applied LIME into RF model to illustrate the impact of key features at the individual level ( Figure 5).    Label "1" means prediction of mortality and label "0" means prediction of no mortality (survival). Probability shows the probability of the observation belong to the label "1" or "0". The five most influential variables that best explain the linear model in that observation's local region are provided and whether the variable causes an increase in the probability (supports/blue bar) or a decrease in the probability (contradicts/red bar). The x-axis shows how much each feature added or subtracted to the final probability value for the patient. Abbreviations: BUN, blood urea nitrogen; GCS, Glasgow Coma Scale; PTT, partial thromboplastin time; pO2, partial pressure of oxygen.

Discussion
Significant efforts have been invested into the development of predictive risk models of mortality for ICU patients. Traditional statistical models such as logistic regression Label "1" means prediction of mortality and label "0" means prediction of no mortality (survival). Probability shows the probability of the observation belong to the label "1" or "0". The five most influential variables that best explain the linear model in that observation's local region are provided and whether the variable causes an increase in the probability (supports/blue bar) or a decrease in the probability (contradicts/red bar). The x-axis shows how much each feature added or subtracted to the final probability value for the patient. Abbreviations: BUN, blood urea nitrogen; GCS, Glasgow Coma Scale; PTT, partial thromboplastin time; pO2, partial pressure of oxygen.

Discussion
Significant efforts have been invested into the development of predictive risk models of mortality for ICU patients. Traditional statistical models such as logistic regression analysis have been previously utilized to construct such prognostication tools [49][50][51][52]. In recent years, ML predictive algorithms have emerged as a method to handle highdimensional, unstructured, and complex structure data [28][29][30][31][32][33][34]. In this study, we compared ML models and a conventional multivariable logistic regression model to assess the bestperforming model for in-hospital mortality among ICU patients with lactic acidosis. The findings from our study suggest that the RF algorithm demonstrated superior performance in prediction of mortality among critically ill patients with lactic acidosis compared to other predictive tools.
Modern ICUs and advances in electronic health records (EHRs) generate vast amounts of complex and multidimensional data that provide valuable information on patient outcomes. This has led to considerable advances in precision medicine [53]. While elevated serum lactate levels have been shown to be associated with increased mortality [54][55][56] and its incorporation improves predictive performance in traditional logistic regression models among critically ill patients [26,27], mortality risk prediction among the subgroup of ICU patients with lactic acidosis [54][55][56] is limited, especially utilizing ML approaches. Furthermore, patients with lactic acidosis are heterogenous and the impact of lactic acidosis on ICU mortality varies based on the clinical ICU setting, such as trauma, cardiac surgery, and sepsis [54][55][56]. Given the heterogeneity of ICU patients with lactic acidosis and the lack of adequate tools for patient-level prognostication, clinicians may often resort to subjective gestalt judgment, which is prone to bias [57]. Thus, we investigated whether ML methods improved mortality prognostication for ICU lactic acidosis in this study to improve precision medicine. Our best model was reached using the RF algorithm, which was associated with the highest AUROC and lowest ERR compared to all other models. We acknowledged that AUROC has several flaws [58], and thus, we also investigated other evaluation metrics including accuracy, precision, MCC, and F1 score. These confirmed the robustness of our RF prediction model. Finally, the findings that the predicted probabilities are close to the expected probability distribution supports that our RF model for mortality predication among ICU patients with lactic acidosis is well-calibrated.
RF is a widely used ML approach that can effectively predict outcomes [40]. It does this by utilizing additive combinations of trees that are built using different subsets of data and variables [42,59]. This nonparametric and nonlinear machine learning RF method can resist noise, and thus, it is expected to build accurate prediction models using aggregated data [40,60]. As a type of robust nonparametric model, RF can simulate complex relationships and it does not depend on data distribution, such as in logistic regression [40]. In addition, RF works well on large datasets, particularly when there are many categorical independent variables and unbalanced data [42]. On the other hand, the logistic regression analysis approach uses a generalized linear equation and the stepwise variable selection method is based on the likelihood ratio test to describe the directed dependencies among a set of variables. These approaches require that a number of statistical assumptions must be met. Thus, logistic regression analysis possesses inherent bias and, consequently, low variance due to the rigid nature from the shape of the line. Our RF prediction model also outperformed acute severity scores (SOFA and SAPS II scores) and comorbidities score (Charlson score) for prediction of ICU patients with lactic acidosis hospital mortality. In addition, this study provided important information on variables in our RF model including BUN, anion gap, lactate level, INR, pO2, phosphate level, PTT, platelet count, pH, and baseline eGFR. These variables in the RF model for critically ill patients with lactic acidosis using a MIMICIII database will help each institution develop their individualized RF model to better prognosticate mortality risk. Additionally, we applied model-agnostic approaches including feature relevant explanation through SHAP and local explanations through the LIME approach [61], which demonstrated how our RF model can be used to explain how each feature contributes to mortality prediction among patients with lactic acidosis.
Although our study includes a large sample size of ICU patients with lactic acidosis and ICU admission data, there are several important limitations. First, our models utilized data obtained during the time of ICU admission in order to prognosticate mortality risk in the early ICU course. Thus, events that markedly altered the prognosis for an individual patient were not included. In addition, our study is retrospective and based on the MIMIC III database, a large single-center tertiary care hospital in the United States. Hence, the model might have been influenced by the specific clinical guidelines, practice, and treatment decisions for that institution. A future validation study with the updated MIMIC-IV database and external validation studies of ML prediction models are needed.

Conclusions
In conclusion, an ML prediction model using the RF algorithm (available online as a shiny app at https://wisitc.shinyapps.io/RandomForestLacticAcid/ created on 15 September 2021)) provided the highest predictive performance for hospital mortality among ICU patients with lactic acidosis. While future external validation studies are required, the findings of our study provide support towards the utilization of RF algorithms to improve risk stratification among critically patients with lactic acidosis.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/jcm10215021/s1, Figure S1: Correlation of variables in the dataset; Figure S2: Pruned DT based on cross-validated error results using the complexity parameter associated with the minimal error; Figure  Informed Consent Statement: Patient consent was waived due to the minimal risk nature of observational chart review studies.