Explainable Preoperative Automated Machine Learning Prediction Model for Cardiac Surgery-Associated Acute Kidney Injury

Background: We aimed to develop and validate an automated machine learning (autoML) prediction model for cardiac surgery-associated acute kidney injury (CSA-AKI). Methods: Using 69 preoperative variables, we developed several models to predict post-operative AKI in adult patients undergoing cardiac surgery. Models included autoML and non-autoML types, including decision tree (DT), random forest (RF), extreme gradient boosting (XGBoost), and artificial neural network (ANN), as well as a logistic regression prediction model. We then compared model performance using area under the receiver operating characteristic curve (AUROC) and assessed model calibration using Brier score on the independent testing dataset. Results: The incidence of CSA-AKI was 36%. Stacked ensemble autoML had the highest predictive performance among autoML models, and was chosen for comparison with other non-autoML and multivariable logistic regression models. The autoML had the highest AUROC (0.79), followed by RF (0.78), XGBoost (0.77), multivariable logistic regression (0.77), ANN (0.75), and DT (0.64). The autoML had comparable AUROC with RF and outperformed the other models. The autoML was well-calibrated. The Brier score for autoML, RF, DT, XGBoost, ANN, and multivariable logistic regression was 0.18, 0.18, 0.21, 0.19, 0.19, and 0.18, respectively. We applied SHAP and LIME algorithms to our autoML prediction model to extract an explanation of the variables that drive patient-specific predictions of CSA-AKI. Conclusion: We were able to present a preoperative autoML prediction model for CSA-AKI that provided high predictive performance that was comparable to RF and superior to other ML and multivariable logistic regression models. The novel approaches of the proposed explainable preoperative autoML prediction model for CSA-AKI may guide clinicians in advancing individualized medicine plans for patients under cardiac surgery.


Introduction
Cardiac surgery-associated acute kidney injury (CSA-AKI) is a common and serious complication with incidence ranging from 17% to 49% [1][2][3]. Compared to patients without CSA-AKI, those with CSA-AKI carry increased risks of mortality, prolonged length of hospital stay, and high healthcare costs [4][5][6][7][8]. Previous risk prediction models for CSA-AKI by multivariable logistic regression analysis have been developed with great initiative to help assess perioperative risk of CSA-AKI [9][10][11][12][13][14][15][16][17][18]. However, there are limitations of particular risk scores, such as generalizability (pre-specified type of elective cardiac surgery [9], coronary artery bypass grafting (CABG) [15], or only CKD patients [14]) and the need to include intraoperative factors in the models that are not available for preoperative risk assessment (such as intraoperative inotrope use, intraoperative intra-aortic balloon pump insertion, or cardiopulmonary bypass time [13]). In addition, several risk scores have been developed specifically to predict severe AKI requiring kidney replacement therapy (KRT) after cardiac surgery [10][11][12]16]. However, even milder degrees of CSA-AKI carry increased risks of CKD and progression to end-stage kidney disease (ESKD) and are clinically relevant [3,19,20]. Therefore, there is a need to develop accurate, reliable, and clinically meaningful preoperative risk prediction models for CSA-AKI to assist providers in counseling patients undergoing cardiac surgery.
Artificial intelligence (AI) and machine learning (ML) have been increasingly applied to individualized medicine [21][22][23][24][25][26], including the prediction of AKI in various settings [27][28][29][30][31][32][33][34][35]. ML algorithms can handle nonlinear, complex, and multidimensional data [36,37], and recent studies have shown high predictive performance from ML algorithms that outperform traditional statistical analyses [38,39]. Recently, automated ML (autoML) has emerged as a growing field to minimize human input and effort on repetitive tasks in ML pipelines, such as optimal algorithm selection and hyperparameter optimization to achieve optimal performance [40], by replacing manual trial-and-error approaches with systematic data-driven decision making [41,42]. In addition, autoML uses automation to efficiently identify the algorithms or models that work best for each dataset and improves accuracy using the ensemble method of algorithms [43]. Thus, autoML has been shown to be very effective, with high predictive performance comparable to human hyperparameter optimization (identification of hyperparameters that returns an optimal model) with a more time-efficient workflow and less human assistance [41,43]. In the present era of utilizing electronic health records (EHRs), where additional data is continuously added and updated, rapid adjustment of the scoring systems in autoML real-world applications is more feasible than traditional ML approaches [40]. Despite the growing research in the field of autoML there has been little work applying autoML to the healthcare field, despite demonstrated need [44].
In this study, we aimed to: (1) develop a preoperative autoML prediction model for CSA-AKI; (2) compare model performance among autoML, various other ML-based prediction models, and traditional statistical (multivariable logistic regression) models in predicting AKI after cardiac surgery in CSA-AKI; and (3) obtain explanations of the features in the ML-based prediction model that drive patient-specific predictions of CSA-AKI.

Patient Population
This was a single-center observational study conducted at a tertiary referral hospital. We studied all consecutive adult patients (≥18 years old) who underwent open-heart surgery at Mayo Clinic Hospital, Rochester, MN, from 1 January 2014 to 31 December 2020. To avoid assessment of multiple outcomes for a single patient, we analyzed only the first heart surgery during the study period for patients with multiple heart surgeries. We excluded (1) patients who had end-stage kidney disease or received any dialysis modalities within 7 days before the surgery, (2) patients who did not have known baseline serum creatinine before surgery, (3) patients who underwent solely right or left ventricular assist device placement, and (4) moribund patients who died during surgery or within 24 h after surgery. The Mayo Clinic Institutional Review Board approved this observational study (IRB number-21-004248) and waived informed consent due to the minimal risk nature of this study. The study was conducted in accordance with the relevant guidelines and regulations.

Data Collection
The primary outcome was post-operative AKI. We defined and staged AKI based solely on the serum creatinine criterion of the Kidney Disease Improving Global Outcomes (KDIGO) foundation [45]; AKI was defined as an increase in serum creatinine of ≥0.3 mg/dL within 48 h after surgery or relative increase of ≥50% from the baseline within 7 days after surgery. We used the most recent outpatient serum creatinine within 1 year prior to the surgery as the baseline value. If the outpatient baseline serum creatinine was not available, we used the lowest in-hospital serum creatinine prior to the surgery as the baseline instead. AKI severity was classified into three stages, as follows: stage 1 was an increase of ≥0.3 mg/dL or an increase to ≥1.5to 1.9-fold from baseline, stage 2 was an increase to ≥2to 2.9-fold from baseline, and stage 3 was an increase to >3-fold from baseline, an increase to ≥4.0 mg/dL, or the initiation of renal replacement therapy.
We used our institutional electronic database to abstract cardiac surgery information, patient demographics, comorbidities, echocardiographic findings, vital signs, medications, and laboratory data. Comorbidities were identified according to the Elixhauser Comorbidity index using previously defined ICD-9 and ICD-10 diagnosis codes. As our goal was to develop and assess a prediction model for CSA-AKI based on the available data before cardiac surgery, we only used the preoperative data that were present within 7 days before cardiac surgery for analysis. When multiple values existed, we selected the most recent vital signs or laboratory values prior to cardiac surgery. We excluded laboratory results with more than 10% missing data. Otherwise, we imputed missing data through a multiple imputation approach using Random Forest (RF).

Feature Selection
Spearman's rank correlation was applied to assess the separate correlation of variables in the dataset and demonstrated no significant correlations (Supplementary Figure S1). Subsequently, a recursive feature elimination (RFE) approach with RF was completed using the Caret R package. The optimal number of variables (69 variables) were identified by the most optimal accuracy and kappa metrics using five times repeated ten-fold crossvalidation (Supplementary Figure S2).

Model Development
In order to utilize ML models to predict the risk of AKI after cardiac surgery, we followed TRIPOD (Online Supplementary) to build automated ML and various ML models [46]. Numerical data were normalized to have a standard deviation of 1 and a mean of 0 [47]. H 2 O.ai was used to develop autoML models [44]. The H 2 O autoML platform has been validated and provides very stable performance [48]. It includes a number of advanced ML algorithms, including distributed RF (DRF), generalized linear model (GLM), gradient boosting machine (GBM), deep learning (a fully-connected multi-layer neural network), and extremely randomized trees (XRT). In addition, H 2 O-AutoML builds two stacked ensemble models, one using all the trained models and the other using just the best performing model from each algorithm family [49]. Detailed autoML algorithms and hyperparameter optimization processes by H 2 O autoML are provided in the Online Supplementary Materials.
The overall study cohort was randomized into training (70%), validation (15%), and testing (15%) datasets. The training dataset was used to develop autoML, ML, and traditional multivariable logistic regression analysis models. After model development, autoML models were ranked by evaluation metrics (area under the receiver operating characteristic curve (AUROC) and log loss) on a leaderboard using the validation dataset. The autoML model with highest predictive performance (top-ranked on the leaderboard) was subsequently chosen for comparison with various other ML and traditional multivariable logistic regression analysis models. The testing dataset was blinded to all methods until the final evaluation. As a reference model, we used multivariable logistic regression analysis. We included variables with p-value < 0.05 in univariate analysis into the multivariable model and subsequently selected the final multivariable model using a backward stepwise approach with p-value < 0.05 as the pre-specified threshold for model retention.
ML (non-automated) models included decision tree (DT), RF, extreme gradient boosting (XGBoost), and deep learning. We utilized deep learning based on a multi-layer feedforward artificial neural network (ANN) trained with stochastic gradient descent using back-propagation. For DT analysis, the number of terminal nodes was determined considering the scree plot revealing the relationship between the tree size and coefficient of variance. The decision tree was pruned based on cross-validated error results utilizing the complexity parameter associated with the minimal error (Supplementary Figure S3). For the RF model, the number of trees was 500, which yielded the lowest error rate (Supplementary Figure S4), and the mtry value was calculated by the square root of the number of variables [50]. For XGBoost and ANN, we created a hyperparameter tuning grid to identify the best combination of hyperparameters using cross-validation methods (Online Supplementary Data) [51].

Model Evaluation and Calibration
The performance of the autoML, ML, and multivariable logistic regression analysis models was assessed with AUROC, accuracy, precision, error rate (ERR), Matthews correlation coefficient (MCC), and F1 score in the testing dataset [52][53][54]. The DeLong test was used to compare AUROCs [55]. Two-sided p values less than 0.05 were considered significant. The formula for each measure is provided in the Online Supplementary Data. The Brier score was used to evaluate model calibration [56].

Explanations of the Variables in the autoML-Based Prediction Model That Drive Patient-Specific Predictions of CSA-AKI
Model-agnostic approaches, including Shapley additive explanations (SHAP) algorithm and Local Interpretable Model-Agnostic Explanations (LIME), were applied to our autoML prediction model in order to extract an explanation of the variables that drive patient-specific predictions to mitigate the issue of black-box predictions [57,58]. SHAP is a model-agnostic demonstration of variable importance where the effect of each aspect on a specific prediction is represented through the use of Shapley values [57,58]. The Shapley value indicates how much one singular variable contributes to the difference between the true prediction and the average (mean) prediction in the context of its interaction with other features. In addition, LIME focuses on training local surrogate models to explain individual predictions by building a white-box local surrogate model [58,59].

Clinical Characteristics
A total of 13,158 cardiac surgery patients were eligible for analysis. The mean age was 65 ± 15 years, and 66% were male. Eighteen percent had coronary bypass graft (CABG), 60% had valve surgery, 19% had CABG and valve surgery, 1% had heart transplant, and 2% had pericardiectomy. The mean baseline creatinine was 1.1 ± 0.7 mg/dL and the estimated glomerular filtration rate was 69± mL/min/1.73 m 2 (Table 1). Thirty-six percent (n = 4745) developed CSA-AKI, with 30% in stage 1, 3% in stage 2, and 3% in stage 3. Two percent (n = 284) required postoperative renal replacement therapy. Of these eligible cardiac surgery patients, 9244, 1967, and 1947 were randomly included in the training, validation, and testing dataset, respectively. Table 1 shows the clinical characteristics of patients in the training, validation, and testing datasets. Clinical characteristics among the training, validation, and testing datasets were mostly comparable. The incidence of CSA-AKI was similar among the three datasets (36% in training vs. 36% in validation vs. 35% in testing; p = 0.73).

AutoML Prediction Models for CSA-AKI
AutoML models for CSA-AKI were developed in the training dataset and were ranked by AUROC and log loss on the leaderboard using the validation dataset (Supplementary Table S1). Table 2 demonstrates the top 20 autoML models for CSA-AKI. The top autoML (Stacked ensemble model ID: StackedEnsemble_AllModels_3_AutoML_1_20211031_170047) shows the highest predictive performance on the leaderboard (AUROC = 0.78), and thus was subsequently chosen for comparison with other various ML and traditional multivariable logistic regression analysis models.

Traditional Logistic Regression Prediction Model for CSA-AKI
In the final multivariable logistic regression, the predictors for CSA-AKI included age, sex, race, cardiac surgery type, history of cardiac arrhythmia, peripheral vascular disease, hypertension with and without complications, liver disease, coagulopathy, obesity, right ventricular systolic pressure, systolic blood pressure, the use of aspirin, beta-blockers, antiarrhythmic medications, benzodiazepine, vasopressor/inotropes, insulin, serum sodium, albumin, hemoglobin, and eGFR. (Supplementary Table S2).

Model Comparison among the Different Models
The ERRs, accuracy, precision, MCC, F1 score, and AUROCs of the top autoML, all ML models, and the multivariable logistic regression model for CSA-AKI prediction in the test dataset are shown in Table 3 and  Abbreviation: ANN, artificial neural network; AUROC, area under the receiver operating characteristic curve; MCC: worst value −1 and best value +1. F1 score, accuracy, and precision: worst value 0 and best value 1. The Brier score is a combined measure of discrimination and calibration that ranges between 0 and 1, where the best score is 0 and the worst is 1.

Explanations of the Variables in the autoML-Based Prediction Model That Drive Patient-Specific Predictions of CSA-AKI
To identify the features that influenced the autoML prediction model the most, we applied the SHAP algorithm to our autoML prediction model in order to extract an explanation of the variables that drive patient-specific predictions for CSA-AKI. As the SHAP algorithm could be utilized for the ensemble model, it was applied to GBM_1_AutoML_1_20211031_170047 (rank number 7 on the leaderboard Table 2), which was one of the key models in the component of our top autoML model (Stacked ensemble model ID: StackedEnsemble_AllModels_3_ Au-toML_1_20211031_170047). The SHAP summary plot of GBM_1_AutoML_1_20211031_170047 model and the top 20 features of the prediction model are shown in Figure 3. This plot depicts how high and low the feature values were in relation to the SHAP values in the testing dataset. According to the prediction model, the higher the SHAP value of a feature, the higher probability of CSA-AKI occurring. Top 3 features that influenced predictions of CSA-AKI included baseline eGFR, cardiac surgery type, and coagulopathy, respectively.
Additionally, we applied LIME into autoML model to illustrate the impact of key variables at the individual level ( Figure 4). For each patient and individual risk assessment of CSA-AKI a LIME plot was generated depicting the top five variables that support (increase the risk of CSA-AKI) or contradict (decrease the risk of CSA-AKI) the prediction of CSA-AKI for each patient.    6) from the testing dataset. Label "1" means prediction of CSA-AKI and label "0" means prediction of no CSA-AKI. Probability shows the probability of the observation belong to the label "1" or "0". The five most important features that best explain the linear model in that observation's local region are demonstrated along with whether the features influence an increase in the probability (blue bar/supports or a decrease in the probability (red bar/contradicts). The x-axis demonstrated how much each feature added or subtracted to the final probability value for the patient. Abbreviations: BUN, blood urea nitrogen; eGFR, estimated glomerular filtration rate.

Discussion
Significant efforts have been invested in the development of predictive risk models of CSA-AKI. Traditional statistical models such as logistic regression analysis have been previously utilized to construct such prognostication tools [9][10][11][12][13][14][15][16][17]. In recent years, ML predictive algorithms have emerged as a method to handle high-dimensional, unstructured, and complex structured data including hospitalized patient with AKI [27][28][29][30][31]. While au-toML has been shown to be very effective, with high predictive performance comparable to human hyperparameter optimization and with higher time-efficient workflow when compared to non-automated ML [41,43], autoML has never been utilized in the development of AKI prediction models. In this study, we successfully developed preoperative autoML prediction models for CSA-AKI and compared the predictive performances of autoML models with unautomated ML, and conventional multivariable logistic regression models.
Previous traditional risk prediction models using multivariable logistic regression for CSA-AKI have been developed [9][10][11][12][13][14][15][16][17][18], including those with risk scores that included only subgroups of patients undergoing cardiac surgery, such as elective cardiac surgery [9], CABG [15], or only patients with CKD [14]. While the inclusion of intraoperative variables in the risk scores helps to improve predictive performances [13], the utilization of these models is limited in real clinical practice of preoperative risk assessment of CSA-AKI. In addition, several risk scores have been developed specifically to predict severe AKI requiring KRT after cardiac surgery [10][11][12]16]. Considering that CSA-AKI, even with milder severity of AKI, involves increased risks of CKD and ESKD [3,19,20], in the current era of individualized medicine and advanced EHR the development of preoperative ML risk prediction models for CSA-AKI can be clinically meaningful to assist providers in the counseling of each individual patient prior to cardiac surgery. Recently, there has been increasing interest in the utilization of supervised non-automated ML algorisms to predict the risk of CSA-AKI [32,33,61,62]. While these ML models provide excellent discrimination of cases with CSA-AKI [32,33,61] and higher predictive performances than traditional multivariable logistic regression analyses, these non-automated ML predictive models for CSA-AKI include intraoperative data in order to achieve high predictive performance [32,33,61]. Thus, the utilizations of these ML models for preoperative risk assessment are limited.
Our study solely used preoperative data in the development of CSA-AKI prediction models. Additionally, for the first time we utilized the autoML approach in the development of preoperative prediction models for CS-AKI. Furthermore, we demonstrated that the top autoML from the leader board (stacked ensemble model ID: StackedEnsem-ble_AllModels_3_AutoML_1_20211031_170047) achieved optimal predictive performance, as demonstrated in non-automated RF, and outperformed the DT, XGBoost, ANN, and multivariable logistic regression model. In addition to high predictive performance, the autoML approach requires less human assistance and reduces human biases in optimal algorithm selection and hyperparameter optimization of model development [43]. With the rapid changes in novel treatment patterns, demographics, and patient populations, data shifts have been increasingly recognized and have significantly affected predictive performance over time [63,64]. The rapid adjustment of autoML predictive performance with new data is more feasible than non-automated ML models [40], and can improve time-efficient workflow in the model maintenance phase.
One issue that has received considerable visibility and has often been cited as a limitation on the use of ML and autoML in clinical applications is a lack of transparency and interpretability in ML-derived recommendations [57,58]. When provided two models of equal performance, one a black box model and one an interpretable model, most users opt for the interpretable model [65]. Gaining user trust has frequently been referenced as one reason for interpretability [66]. In this study, to obtain explanations of the variables that drive patient-specific predictions of CSA-AKI, we applied model-agnostic approaches to our autoML prediction models using the SHAP and LIME algorithms [57,58]. While SHAP cannot be used with our top autoML model, as it is an ensemble model that combines several base models in order to produce one optimal predictive model, we applied the SHAP algorithm to explain the top 20 variables that played the most important roles in predicting of CSA-AKI in GBM autoML (model ID: GBM_1_AutoML_1_20211031_170047), which is one of the key models in the component of our top autoML model. The LIME algorithm can be utilized for ensemble models, and thus we successfully applied it to our top autoML prediction model. Through the adoption of the LIME approach, we were able to explain variables driving patient-specific predictions of CSA-AKI for each individual patient and reduce the black box concern of our preoperative autoML prediction model for CSA-AKI.
There are several limitations of our study. First, our study cohort represents a majority Caucasian population, and thus the autoML prediction model may need further adjustment with more data including other patient populations. Second, our autoML included only preoperative data in order to make it applicable in real clinical practice for preoperative assessment. While incorporation of intraoperative factors such as operative time and cardiopulmonary bypass time may additionally improve model predictive performance of CSA-AKI, and may be beneficial for interventional research during or after cardiac surgery, this is not the main focus of our current study. Lastly, a future validation study and external validation studies of preoperative autoML prediction models for CSA-AKI are needed.

Conclusions
In conclusion, we presented a preoperative autoML prediction model for CSA-AKI (available online as a shiny app at https://wisitc.shinyapps.io/autoML-CSA-AKI/, created on 21 July 2022)) that provided high predictive performance comparable to non-automated ML approaches, and superior to the multivariable logistic regression model. In addition, we demonstrated the explainability of our preoperative autoML prediction model for CSA-AKI. These novel approaches involving an explainable preoperative autoML prediction model for CSA-AKI may guide clinicians in advancing individualized medicine plans for patients under cardiac surgery.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/jcm11216264/s1, Figure S1: Correlation of variables in the dataset; Figure S2: The optimal number of variables (69 variables) were identified by the most optimal (A) accuracy and (B) kappa metrics using 5 times repeated 10-fold cross validation; Figure S3: Pruned decision tree associated with the minimal error; Figure S4: Number of trees of RF model which yielded the lowest error rate; Figure S5: Simple decision tree model showing the classification of patients who had CSA-AKI (1) and did not (0) have CSA-AKI. Supplementary Table S1 Leaderboard of top 45 autoML models for CSA-AKI ranked by evaluation metrics using validation dataset. Supplementary Table S2. Development of multivariable logistic regression model to predict acute kidney injury after cardiac surgery using stepwise variable selection in the training dataset.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Mayo Clinic (IRB number-21-004248, approved on 1 July 2021).
Informed Consent Statement: Patient consent was waived due to the minimal risk nature of observational chart review study.
Data Availability Statement: Data are available upon reasonable request to the corresponding author.

Conflicts of Interest:
The authors deny any conflict of interest.