Predicting Mortality Using Machine Learning Algorithms in Patients Who Require Renal Replacement Therapy in the Critical Care Unit

Background: General severity of illness scores are not well calibrated to predict mortality among patients receiving renal replacement therapy (RRT) for acute kidney injury (AKI). We developed machine learning models to make mortality prediction and compared their performance to that of the Sequential Organ Failure Assessment (SOFA) and HEpatic failure, LactatE, NorepInephrine, medical Condition, and Creatinine (HELENICC) scores. Methods: We extracted routinely collected clinical data for AKI patients requiring RRT in the MIMIC and eICU databases. The development models were trained in 80% of the pooled dataset and tested in the rest of the pooled dataset. We compared the area under the receiver operating characteristic curves (AUCs) of four machine learning models (multilayer perceptron [MLP], logistic regression, XGBoost, and random forest [RF]) to that of the SOFA, nonrenal SOFA, and HELENICC scores and assessed calibration, sensitivity, specificity, positive (PPV) and negative (NPV) predicted values, and accuracy. Results: The mortality AUC of machine learning models was highest for XGBoost (0.823; 95% confidence interval [CI], 0.791–0.854) in the testing dataset, and it had the highest accuracy (0.758). The XGBoost model showed no evidence of lack of fit with the Hosmer–Lemeshow test (p > 0.05). Conclusion: XGBoost provided the highest performance of mortality prediction for patients with AKI requiring RRT compared with previous scoring systems.


Introduction
Acute kidney injury (AKI) is a common and significant problem in intensive care units (ICU), with incidence rates reportedly as high as 50% of patients admitted [1]. Up to 25% of AKI patients in the ICU require renal replacement therapy (RRT) [1,2]. Despite advancements in the performance and technology of RRT, the mortality rate of those patients remains 30% to 50% [3,4]. Although the outcomes of these patients are likely related partly to the severity of their underlying diseases, having clinical tools that can accurately and reliably provide prognostic predictions is important to aid in clinical decision-making.
General severity of illness scores have been used to predict ICU mortality. For example, the Acute Physiology And Chronic Health Evaluation (APACHE) and Simplified Acute Physiology Score (SAPS) have been developed since the 1980s. They provide adequate prediction of in-hospital and ICU mortality of all ICU patients regardless of ICU type [5][6][7][8]. The Sequential Organ Failure Assessment (SOFA) score is also used for hospital and ICU mortality prediction [9]. However, studies have suggested that these traditional models are not reliable for the AKI populations who need RRT in ICU [10][11][12]. Instead, models using data at RRT initiation have performed better at mortality prediction. Some of these models have shown good performance for mortality prediction but have limited results during external validation [13][14][15][16].
Recently, machine learning models have been broadly applied from disease diagnosis to mortality prediction. They are expected to capture nonlinear interactions from high complexity data and consider all data points for continuous data, thus providing more accurate risk prediction than traditional models. We developed machine learning algorithms using data collected from the Medical Information Mart for Intensive Care (MIMIC-III) [17] and eICU Collaborative Research (eICU-CRD) databases [18] and compared the performance of the results to that of the SOFA [19]; nonrenal SOFA [20]; and HEpatic failure, LactatE, NorepInephrine, medical Condition, and Creatinine (HELENICC) [16] scores in 30-day mortality prediction for AKI patients requiring RRT.

Data Sources
This retrospective observational cohort study was performed using two publicly available ICU datasets (MIMIC-III and eICU-CRD). MIMIC-III (53,423 ICU admissions between 2001 and 2012) was released by the Massachusetts Institute of Technology Laboratory for Computational Physiology (MIT-LCP) from a single tertiary care hospital (Beth Israel Deaconess Medical Center) in 2016 [17]. eICU-CRD (approximately 200,000 ICU admissions between 2014 and 2015) is a multicenter critical care database from rural/nonacademic hospitals across the United States made available by Philips Healthcare with the help of researchers from MIT-LCP in 2018 [18]. There is no overlap of patients included in these two databases [18].
We included adults ≥18 years old who received RRT (intermittent hemodialysis or continuous RRT [CRRT]) for AKI in the ICU. AKI was defined by the creatinine change level and diagnosis codes in this study. We only used creatinine criteria due to the unreliable urine data in the retrospective databases. Patients were included when they did not have at least 2 creatinine data but had AKI as a diagnosis using ICD-9 codes (Supplementary  Table S1) or by maxium-miminum change of creatinine ≥ 0.3 mg/dL from ICU admission to RRT. If a patient had been admitted to the ICU multiple times during one hospitalization course, data from the first ICU admission were extracted for study. Patients with a history of end-stage kidney disease who underwent chronic peritoneal dialysis (PD) or hemodialysis (HD) were excluded from the study, as were those with chronic kidney disease (CKD) stages 4 and 5 based on ICD-9 codes (Supplementary Table S1), because advanced CKD patients who develop AKI are more likely to survive the episode of AKI [21]. Patients with a history of any organ transplant were also excluded, as they may have other confounding risk variables that affect mortality.

Predictors
The variables of our models consisted of demographics, medical history, mechanical ventilation use, FiO 2 , vital signs, laboratory tests, and medications (diuretics, vasopressors). The mechanical ventilation, vital signs, lab tests, and medications were recorded within 24 h before RRT initiation. Relevant past medical history, extracted from database records using ICD-9 codes (Supplementary Table S1), included diabetes mellitus (DM), CKD, hypertension (HTN), congestive heart failure (CHF), liver cirrhosis (LC), and cancer. We used mean values of lab tests, FiO 2 , Glasgow Coma Scales (GCS), mean arterial pressure (MAP), and respiratory and heart rates (HR). FiO 2 was from a laboratory test in MIMIC and from a respiratory chart in eICU.
For laboratory tests, we used mean values of all variables recorded within 24 h before the first dialysis therapy initiation date, because some laboratory data values would have been influenced by dialysis. Supplementary Table S2 reveals the percentage of missing data in laboratory tests. We excluded variables with >30% missing values. Multiple imputation by chained equations (MICE) with five imputed datasets was used to derive the missing values of the laboratory tests and vital signs, and the results were pooled using the MICE package [22]. The missing values were completed, handled by MICE, and then the imputed data were used to build models.
We modified the codes found in the public domain at https://github.com/nus-morninlab/oxygenation_kc (accessed on 5 February 2020) and https://github.com/MIT-LCP/ mimic-code/tree/master/concepts/severityscores (accessed on 5 February 2020) to calculate a SOFA score using variables collected within 24 h before RRT start in eICU and MIMIC based on methods in the original study [19]. We also calculated the nonrenal SOFA score, which was calculated by the total SOFA score minus the points for the renal system [20]. For patients with missing variables, SOFA and nonrenal SOFA scores were imputed using MICE, as described above.
The primary outcome was all-cause mortality within 30 days of RRT initiation.

Prediction Machine Learning Algorithms
Predicting mortality problem belongs to a classification topic in supervised machine learning. Four machine learning classification methods were applied in this study: logistic regression (LR), XGBoost, random forest (RF), and multilayer perceptron (MLP). We used grid search with tenfold cross-validation to find the best hyperparameters for all models. Our machine learning modeling strategy followed TRIPOD statement recommendations for the reporting of predictive models [23].

1.
LR is the fundamental algorithm for machine learning development. In scikit-learn, the LR uses regularization by default. The advantage of regularization is to improve numerical stability.

2.
XGBoost [24] is an implementation of the gradient-boosted decision trees ensemble algorithm. The implementation of XGBoost is optimized for performance and provides the best available solutions in many fields. It reduces variance and bias by using multiple models and adjusting the subsequent trees by the errors the previous trees made. 3.
RF [25] is a bagging ensemble machine learning model that also includes several decision trees, but decisions made among trees are independent. It chooses the final model by voting for the most common class that reduces variance in decision trees.
The advantages of RF are as follows: it is robust to overfitting and is more stable in high-dimensional data than other machine learning algorithms [26]. 4.
MLP [27] is a well known supervised learning implementation in artificial neural networks. Typically, it consists of one input layer, one or more hidden layers, and one output layer. It solves high-dimensional classification problems by dealing with the interactions among variables.

Model Validation
Models were validated using two strategies ( Figure 1): Using the first strategy to assess validation, we used the MIMIC dataset as a development model and assessed external validity using the eICU dataset, which was based on the higher severity of comorbidities and more complete records. Using the second strategy to build more robust models that could be applied across institutions, we pooled eICU and MIMIC datasets containing more diverse and heterogeneous data so that the trained models would generalize across different hospitals and then randomly split them into training and testing datasets at a ratio of 8:2. We performed grid searches with tenfold cross-validation to obtain the best parameters using the training dataset and then tested models on the independent testing dataset.

Model Validation
Models were validated using two strategies ( Figure 1): Using the first strategy to assess validation, we used the MIMIC dataset as a development model and assessed external validity using the eICU dataset, which was based on the higher severity of comorbidities and more complete records. Using the second strategy to build more robust models that could be applied across institutions, we pooled eICU and MIMIC datasets containing more diverse and heterogeneous data so that the trained models would generalize across different hospitals and then randomly split them into training and testing datasets at a ratio of 8:2. We performed grid searches with tenfold cross-validation to obtain the best parameters using the training dataset and then tested models on the independent testing dataset.

Statistical Analyses
We compared the baseline characteristics between the survival and death groups. Categorical variables were presented as proportions, and the mean with standard deviation or median with interquartile range was used to summarize the results for continuous variables. Numeric variables of clinical characteristics with normal distribution tested by the Kolmogorov-Smirnov test between the two groups were compared using the Student's t-test. Non-normally distributed continuous variables were tested by a Mann-Whitney U test. A Chi-squared test was used to compare the differences in categorical variables.
The overall performance of the prediction models on validation was assessed by the calculation of the area under the receiver operating characteristic curve (AUC) and the associated 95% confidence interval (CI) using the roc_auc_score function of scikit-learn. Calibration was assessed using the Hosmer-Lemeshow test and by constructing calibration curves. The differences between model AUCs were pairwise-compared using the DeLong test (p < 0.05 was considered statistically significant). Sensitivity, specificity, positive (PPV) and negative (NPV) predicted values, and accuracy were calculated for evaluation of model performance. To evaluate the impact of features on our best model, we

Statistical Analyses
We compared the baseline characteristics between the survival and death groups. Categorical variables were presented as proportions, and the mean with standard deviation or median with interquartile range was used to summarize the results for continuous variables. Numeric variables of clinical characteristics with normal distribution tested by the Kolmogorov-Smirnov test between the two groups were compared using the Student's t-test. Non-normally distributed continuous variables were tested by a Mann-Whitney U test. A Chi-squared test was used to compare the differences in categorical variables.
The overall performance of the prediction models on validation was assessed by the calculation of the area under the receiver operating characteristic curve (AUC) and the associated 95% confidence interval (CI) using the roc_auc_score function of scikit-learn. Calibration was assessed using the Hosmer-Lemeshow test and by constructing calibration curves. The differences between model AUCs were pairwise-compared using the DeLong test (p < 0.05 was considered statistically significant). Sensitivity, specificity, positive (PPV) and negative (NPV) predicted values, and accuracy were calculated for evaluation of model performance. To evaluate the impact of features on our best model, we used the SHAP framework (available in the public domain at https://github.com/slundberg/shap (accessed on 5 February 2020) [28]. We used decision curve analysis to assess the net benefits of our best machine learning model, SOFA, nonrenal SOFA, and HELENICC scores. In the decision curve analysis, SOFA, nonrenal SOFA, and HELENICC scores were converted to a logistic regression using probability theory [12].
Machine learning algorithms and statistical analyses were performed using Python version 3.6, scikit-learn version 0.22.1, keras version 2.3.1, and R version 3.6.1.

General Demographics
Of 3357 patients in the MIMIC and 8201 patients in the eICU databases who required dialysis therapy, 1129 and 2283, respectively, met the criteria for study inclusion ( Figure 2). used the SHAP framework (available in the public domain at https://github.com/slundberg/shap (accessed on 5 February 2020)) [28]. We used decision curve analysis to assess the net benefits of our best machine learning model, SOFA, nonrenal SOFA, and HELENICC scores. In the decision curve analysis, SOFA, nonrenal SOFA, and HELENICC scores were converted to a logistic regression using probability theory [12].
Machine learning algorithms and statistical analyses were performed using Python version 3.6, scikit-learn version 0.22.1, keras version 2.3.1, and R version 3.6.1.

General Demographics
Of 3357 patients in the MIMIC and 8201 patients in the eICU databases who required dialysis therapy, 1129 and 2283, respectively, met the criteria for study inclusion ( Figure  2). Baseline demographics, comorbidities, vital signs, and laboratory values for patients in the two datasets are grouped by survival status (Supplement Tables S3 and S4). Overall, the cohorts from the group who died were older and had a lower percentage of black race, a longer ICU stay before dialysis therapy initiation, and a higher percentage of mechanical ventilation use. Supplementary Table S5 reveals that the mortality rate and comorbidity are significantly different between the eICU and MIMIC datasets. The 30-day ICU mortality rate was 42.9% and 32.7% in the MIMIC and eICU datasets, respectively. Baseline demographics, comorbidities, vital signs, and laboratory values for patients in the two datasets are grouped by survival status (Supplement Tables S3 and S4). Overall, the cohorts from the group who died were older and had a lower percentage of black race, a longer ICU stay before dialysis therapy initiation, and a higher percentage of mechanical ventilation use. Supplementary Table S5 reveals that the mortality rate and comorbidity are significantly different between the eICU and MIMIC datasets. The 30-day ICU mortality rate was 42.9% and 32.7% in the MIMIC and eICU datasets, respectively. Table 1 shows the differences between the training and testing datasets of the pooled dataset. Regarding BUN, hemoglobin, and glucose, other variables were similar between the training and testing datasets.

Machine Learning Algorithm Performance and Comparison with Other Predictive Models in the First Strategy
In the eICU (testing) dataset, the RF model achieved the highest AUC (0.816; 95% CI, 0.798-0.834) ( Table 3), but there were no significant differences between those models (Supplementary Table S6). All four models performed significantly better than the SOFA, nonrenal SOFA, and HELENICC scores (p < 0.001). The Hosmer-Lemeshow test showed a poor model fit, except the MLP model. Figure 3 illustrates the ROC curves for our models, as well as for the SOFA, nonrenal SOFA, and HELENICC scores. Supplementary  Figures S1 and S2 demonstrate the calibration curves of all models in the training and testing datasets.

Machine Learning Algorithm Performance and Comparison with Other Predictive Models in the Secondary Strategy
In the pooled dataset, there were 978 and 258 deaths in the training and testing datasets, respectively. The XGBoost model achieved the highest AUC value and accuracy (0.823; 95% CI, 0.791-0.854; 0.758) ( Table 4). The MLP model performed worse than the other three models (Supplement Table S7). The XGBoost, LR, and RF models showed no evidence of lack of fit with a Hosmer-Lemeshow test (p > 0.05) in the testing dataset. All four models performed significantly better than the SOFA, nonrenal SOFA, and HELENICC scores (p < 0.001). Figure 4 shows the ROC curves for all models. Supplementary Figures S3 and S4 demonstrate the calibration curves of all models in the training and testing datasets. The decision curve analysis showed that the net benefit of the XGBoost model was superior to the previous scoring systems ( Figure 5).

Machine Learning Algorithm Performance and Comparison with Other Predictive Models in the Secondary Strategy
In the pooled dataset, there were 978 and 258 deaths in the training and testing datasets, respectively. The XGBoost model achieved the highest AUC value and accuracy (0.823; 95% CI, 0.791-0.854; 0.758) ( Table 4). The MLP model performed worse than the other three models (Supplement Table S7). The XGBoost, LR, and RF models showed no evidence of lack of fit with a Hosmer-Lemeshow test (p > 0.05) in the testing dataset. All four models performed significantly better than the SOFA, nonrenal SOFA, and HELE-NICC scores (p < 0.001). Figure 4 shows the ROC curves for all models. Supplementary Figure S3 and S4 demonstrate the calibration curves of all models in the training and testing datasets. The decision curve analysis showed that the net benefit of the XGBoost model was superior to the previous scoring systems ( Figure 5).    Figure 6a shows the top 10 important features of the XGBoost model that were calculated by SHAP value in the training datasets. In the 80% pooled dataset, older age, higher FiO 2 and RR, lower creatinine and HCO 3 , increased anion gap, lower GCS, lower BP, vasopressor use, and decreased platelet count were associated with an increased mortality rate. The predictor ranks of all the models by the mean absolute value of SHAP are shown in Figure 6b. The results of multivariable logistic regression analysis using stepwise variable selection are shown in Table S8 and were similar to those of the XGBoost model. four models performed significantly better than the SOFA, nonrenal SOFA, and HELE-NICC scores (p < 0.001). Figure 4 shows the ROC curves for all models. Supplementary Figure S3 and S4 demonstrate the calibration curves of all models in the training and testing datasets. The decision curve analysis showed that the net benefit of the XGBoost model was superior to the previous scoring systems ( Figure 5).      higher FiO2 and RR, lower creatinine and HCO3, increased anion gap, lower GCS, lower BP, vasopressor use, and decreased platelet count were associated with an increased mortality rate. The predictor ranks of all the models by the mean absolute value of SHAP are shown in Figure 6b. The results of multivariable logistic regression analysis using stepwise variable selection are shown in Table S8 and were similar to those of the XGBoost model.

Discussion
SOFA, nonrenal SOFA, and HELENICC scores have only modest predictive value of 30day mortality for ICU patients with AKI requiring RRT. In the first strategy, we found that the machine learning models performed better than SOFA, nonrenal SOFA, and HELENICC scores when validating models using an external validation dataset (eICU dataset). In the second strategy, the XGBoost model showed reasonable performance and a sufficiently good fit (p = 0.22) in the heterogenous dataset. Decision curve analysis indicated that the XGBoost model improved the net benefit for predicting the 30-day mortality compared with SOFA, nonrenal SOFA, and HELENICC scores. The reasons for why the XGBoost model performed better may be related to the application of regularization and high flexibility to tune hyperparameters.
General severity of illness scores, which use clinical and laboratory variables at ICU admission or even sequential data, predict mortality well for all ICU patients, like APACHE, SAPS, and SOFA scores. However, they showed poor mortality prediction for AKI patients requiring RRT [10][11][12][13]. The models targeted specifically at this population including the HELENICC score, ATN study, and Cleveland Clinic score revealed good performance in predicting mortality (AUC = 0.82, 0.85, 0.81, respectively) [13,16,29]. Those models either lacked external validation, did not perform well during external validation [15], or focused on patients with specific conditions. Notably, some variables were not readily available in the clinical datasets we used, thus limiting our ability to make a direct comparison of ML models with these scores. Prior studies have predicted mortality using machine learning algorithms, although these did not center on RRT. Brajer et al. [30] revealed excellent performance using XGBoost to predict the in-hospital mortality of adults

Discussion
SOFA, nonrenal SOFA, and HELENICC scores have only modest predictive value of 30-day mortality for ICU patients with AKI requiring RRT. In the first strategy, we found that the machine learning models performed better than SOFA, nonrenal SOFA, and HELENICC scores when validating models using an external validation dataset (eICU dataset). In the second strategy, the XGBoost model showed reasonable performance and a sufficiently good fit (p = 0.22) in the heterogenous dataset. Decision curve analysis indicated that the XGBoost model improved the net benefit for predicting the 30-day mortality compared with SOFA, nonrenal SOFA, and HELENICC scores. The reasons for why the XGBoost model performed better may be related to the application of regularization and high flexibility to tune hyperparameters.
General severity of illness scores, which use clinical and laboratory variables at ICU admission or even sequential data, predict mortality well for all ICU patients, like APACHE, SAPS, and SOFA scores. However, they showed poor mortality prediction for AKI patients requiring RRT [10][11][12][13]. The models targeted specifically at this population including the HELENICC score, ATN study, and Cleveland Clinic score revealed good performance in predicting mortality (AUC = 0.82, 0.85, 0.81, respectively) [13,16,29]. Those models either lacked external validation, did not perform well during external validation [15], or focused on patients with specific conditions. Notably, some variables were not readily available in the clinical datasets we used, thus limiting our ability to make a direct comparison of ML models with these scores. Prior studies have predicted mortality using machine learning algorithms, although these did not center on RRT. Brajer et al. [30] revealed excellent performance using XGBoost to predict the in-hospital mortality of adults (AUC~0.85). Another study developed models for patients with influenza infection requiring ICU admission and found that XGBoost achieved the highest AUC (0.842) [31]. Kang et al. [12] applied machine learning algorithms to predict mortality in patients requiring CRRT and found that the RF model achieved the highest AUC (0.768). This was a retrospective study in one hospital (n = 1094) and lacked external validation. The performance of our models was reasonable high for all AKI patients requiring RRT, either using the eICU dataset as an external validation dataset or using the independent part of the pooled dataset as a testing dataset to provide results that are more generalizable.
In this study, we found that the performance of validating models was better when using the eICU dataset. This may be related to the patient characteristics of the two datasets.
Patients in the MIMIC dataset had a higher mortality rate and more severe comorbidities than those in the eICU dataset (Supplementary Table S5). We speculate that this could be due to demographic differences, as the eICU dataset included data from multiple ICUs in rural areas while the MIMIC dataset contained data from a single ICU in an urban medical center. Another reason may be that the data distribution of MIMIC was more complicated to classify than that of eICU (Supplementary Figure S5) using principal component analysis.
One challenge for medical researchers using machine learning algorithms is that it is difficult to assess or explain the individual contributing factors [32]. However, scientists are creating many advanced ways to make machine learning more transparent. We used the SHAP value to visualize the feature importance and determine the effect of different variables on the final output. SHAP offers not only the rank order of importance of variables but also how the variables impact the outcome, such as low creatinine associated with death risk. The results highly correlated with clinical outcome. Overall, the features generated by SHAP could be classified into hemodynamic status, central nerve system, coagulation, respiratory systems, kidney-related features, and age. The SHAP results were like the SOFA score parameters, but the performance of our models outperformed the SOFA score. That may be related to the loss of information when categorizing data, inaccurately allocating creatinine score for those AKI patients using the SOFA score, and capturing nonlinear interactions from high complexity data using machine learning algorithms. In this study, the XGBoost model used more and different variables, such as age and anion gap, which may lead to better performance than the SOFA and HELENICC scores. Besides, we retrained a new XGBoost model only using the top 10 features generated by SHAP and still achieved a good AUC (0.818; 95% CI: 0.786-0.849) in the testing dataset, which allows clinical physicians to use the model by inputting only 10 data points, can minimize the burden on them, and limit non-use in the case of missing data on a larger number of variables.
Given its improved performance over traditional severity-of-illness scoring measures, such a model or tool could potentially be used and further refined for several potential applications. For example, given its relatively high negative predictive value, it might help to enrich clinical trials for a targeted risk profile of patients. Another potential advantage of our model is that utilizes data that is easily available and routinely collected in clinical practice. This is a distinct advantage over some other prior risk scores, such as ATN score, that have been used in this population, as those included data collected as part of a research study and may not be available clinically. Having a model that can be calculated in realtime will allow clinicians to have prognostic data and help them have informed shared decision-making with the patient and their caregivers to decide whether to initiate RRT. This will also allow physicians to discuss the overall aggressiveness of care and help with medical decision-making The strengths of the study include large sample size, external validation, curated datasets representing heterogeneous ICU populations, and the use of routinely collected clinical data. This study has several limitations. The MIMIC III dataset is old, and practice patterns may have changed. This dataset is only limited to labs when the patient is in the ICU, so it is possible that we may have missed AKI by creatinine values if they were admitted to the hospital in a non-ICU setting prior to their ICU admission. Although our datasets included a large amount of collected clinical data, some data had to be excluded due to poor-quality data recording or missing data. Using MICE to impute data may reduce predictive power. Due to many missing values, we were unable to calculate SAPS and APACHE at RTT start time. Thus, we could not make a head-to-head comparison of the performance of SAPS or APACHE with our models. In addition, the datasets did not have variables to allow for comparison with other scores that have looked specifically at this population, such as the ATN study score, which included research data. Moreover, the variables only capture data collected in the hospitals, and we are unable to capture patient mortality out of the hospitals in the eICU dataset. Our sample size may not be large enough for machine learning to have better performance. The interpretation of results should only focus on specific patients, since we excluded all transplant and advanced CKD patients in the United States. Finally, the causal relationship between the top 10 features of the XGBoost model and 30-day mortality was not clearly explored.

Conclusions
All machine learning models had a reasonable performance and were superior to the SOFA, nonrenal SOFA, and HELENICC scores in predicting 30-day mortality for AKI patients requiring RRT. XGBoost provided the highest performance in this study. Further prospective research is needed to validate these results prospectively and explore how they can be integrated into clinical decision-making.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jcm11185289/s1. Table S1: ICD-9 diagnosis codes used to identify acute kidney injury, transplant history, and comorbidities; Table S2: Percentages of missing data in the MIMIC and eICU datasets; Table S3: Baseline characteristics of the patients requiring renal replacement therapy in the MIMIC dataset; Table S4: Baseline characteristics of the patients requiring renal replacement therapy in the eICU dataset; Table S5: Comparison of baseline characteristics of the patients requiring renal replacement therapy in the eICU and MIMIC datasets; Table S6: Pairwise p value of area under ROC curves (AUROCs) of prediction models using the Delong test in the eICU dataset; Table S7: Pairwise p value of area under ROC curves (AUROCs) of prediction models using the Delong test in the 20% pooled dataset; Table S8: Significant variables in multivariate logistic regression model; Figure S1: Calibration curves of all models using MIMIC dataset as a training dataset; Figure S2: Calibration curves of all models using the eICU dataset as the testing dataset; Figure S3: Calibration curves of all models using the 80% pooled eICU and MIMIC dataset as the training dataset; Figure S4: Calibration curves of all models using the 20% pooled eICU and MIMIC datasets as the testing datasets; Figure S5 Institutional Review Board Statement: Both databases are previously de-identified and have been reviewed by the institutional review boards (IRB) of their hosting organizations and determined to be exempt from subsequent IRB.

Informed Consent Statement:
The data descriptor for MIMIC-III states, "Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified".

Conflicts of Interest:
The authors declare no conflict of interest.