Internal and External Validation of Machine Learning Models for Predicting Acute Kidney Injury Following Non-Cardiac Surgery Using Open Datasets

This study developed and validated a machine learning model to accurately predict acute kidney injury (AKI) after non-cardiac surgery, aiming to improve patient outcomes by assessing its clinical feasibility and generalizability. We conducted a retrospective cohort study using data from 76,032 adults who underwent non-cardiac surgery at a single tertiary medical center between March 2019 and February 2021, and used data from 5512 patients from the VitalDB open dataset for external model validation. The predictive variables for model training consisted of demographic, preoperative laboratory, and intraoperative data, including calculated statistical values such as the minimum, maximum, and mean intraoperative blood pressure. When predicting postoperative AKI, our gradient boosting machine model incorporating all the variables achieved the best results, with AUROC values of 0.868 and 0.757 for the internal and external validations using the VitalDB dataset, respectively. The model using intraoperative data performed best in internal validation, while the model with preoperative data excelled in external validation. In this study, we developed a predictive model for postoperative AKI in adult patients undergoing non-cardiac surgery using preoperative and intraoperative data, and external validation demonstrated the efficacy of open datasets for generalization in medical artificial modeling research.


Introduction
Acute kidney injury (AKI) is a condition characterized by a rapid decline in kidney function and can be caused by a variety of factors [1].In particular, postoperative patients are known to be at a higher risk of developing AKI than the average non-surgical hospitalized patient [2].The occurrence of postoperative AKI not only prolongs the patient's hospitalization duration but also leads to escalated healthcare expenses [3,4].Furthermore, postoperative AKI can impact the long-term outcome for surgical patients by elevating postoperative morbidity and mortality rates, potentially leading to the development of chronic kidney disease in the future [5][6][7].Predicting the occurrence of postoperative AKI before surgery enables the allocation of adequate postoperative medical resources to surgical patients, ultimately enhancing patient outcomes and mitigating healthcare costs [8,9].However, to date, a systematic approach for predicting postoperative AKI in patients undergoing non-cardiac surgery has not been well established.We intend to achieve this by developing a predictive model using machine learning (ML) techniques, which have recently been actively applied in the medical field.Whereas previous artificial intelligence (AI) studies have focused on predicting post-surgery AKI, most relied on data from single institutions, which limited their generalizability [9][10][11][12].To overcome this limitation, our study aimed to explore the model's feasibility and utility across different institutions by externally validating it using an openly accessible dataset.In medical AI research, the utilization of open datasets for external validation is a crucial aspect of model development, offering a highly practical and viable alternative.However, because of the challenges associated with sharing and accessing medical data over different time periods, obtaining external datasets for validation remains a formidable task.Therefore, we anticipate that research focused on leveraging open datasets for external model validation will continue to thrive in the future.

Ethical Statement and Study Data
This study utilized the data from 76,032 adult patients aged 18 years and older who underwent non-cardiac surgery at Asan Medical Center (AMC) from March 2019 to February 2021, with the data extracted from the hospital's Electronic Health Record (EHR) system.In addition, we utilized the open dataset VitalDB for the external validation of the model.Out of the total of 6388 patients in VitalDB, we extracted data for 5512 patients to use as the external validation dataset for our prediction model (Figure S1).The study received approval from the Institutional Review Board (IRB) of Asan Medical Center (IRB No. 2024-0060), and informed consent from participants was waived based on the retrospective collection and secondary utilization of EHR data.We excluded pediatric patients under the age of 18, individuals who underwent cardiac surgery, and those who were diagnosed with end-stage renal disease or had preoperative creatinine levels exceeding 4.5 mg/dL (Figure 1).

Variables for Modeling and Preprocessing Data
The variables employed in the modeling of this study encompassed demograp information, preoperative data, and intraoperative data (Table S1).The demographic d included the patient's age, gender, body mass index (BMI), and American Society of esthesiologists (ASA) class.The preoperative data included preoperative laboratory results such as the white blood cell, hemoglobin, hematocrit, platelet, sodium, potassi

Variables for Modeling and Preprocessing Data
The variables employed in the modeling of this study encompassed demographic information, preoperative data, and intraoperative data (Table S1).The demographic data included the patient's age, gender, body mass index (BMI), and American Society of Anesthesiologists (ASA) class.The preoperative data included preoperative laboratory test results such as the white blood cell, hemoglobin, hematocrit, platelet, sodium, potassium, chloride, total bilirubin, albumin, aspartate aminotransferase (AST), and alanine aminotransferase (ALT) results, along with the estimated glomerular filtration rate (eGFR), glucose level, prothrombin time (PT), activated partial thromboplastin time (aPTT), blood urea nitrogen (BUN) level, creatinine level, and C-reactive protein (CRP) level.The intraoperative data included the arterial blood pressure values (systolic blood pressure (SBP), diastolic blood pressure (DBP), and mean blood pressure (MBP)) measured through arterial catheterization, non-invasive blood pressure values (SBP, DBP, and MBP), estimated blood loss, anesthetic time, and surgery time.In our study, anesthetic time refers to the total duration from the induction of anesthesia to the moment of extubation and the patient's return to consciousness.Surgery time denotes the period from the initial skin incision by the surgeon to the completion of the surgery.The intraoperative blood pressure data were employed as input variables for the model by computing the maximum, minimum, mean, and standard deviation values of the blood pressure measurements, along with the sum of the changes in the blood pressure differences.Intraoperative monitoring measured blood pressure every 3 to 5 min using non-invasive methods, while arterial catheterization provided continuous real-time blood pressure data.However, the anesthesia records analyzed in our study only documented blood pressure readings at 5 min intervals, so more frequent measurements were not accessible.Intraoperative blood pressure data, which is automatically linked to the electronic medical record, often contains artifacts.We removed these artifacts by using the algorithm presented in our previous study [13].Blood pressure measurements exceeding the following thresholds were excluded from the analysis: a systolic blood pressure greater than 300 mm Hg or less than 20 mm Hg, diastolic blood pressure greater than 225 mm Hg or less than 5 mm Hg, and systolic blood pressure less than the diastolic blood pressure + 5 mm Hg, in accordance with the criteria proposed in a previous study [13,14].Since the missing values in the modeling dataset did not exhibit any specific pattern or correlation, we addressed them by inputting the missing values with the mean value of each variable in the dataset (Figures S2 and S3).

Open Dataset for External Validation
In our study, we utilized an open dataset known as VitalDB to externally validate the performance of our prediction model [15].VitalDB comprises preoperative data from 6388 surgical patients who underwent non-cardiac surgery at Seoul National University Hospital, including intraoperative monitoring parameters.Since AKI, the primary outcome of our study, relies on creatinine blood test results, we chose VitalDB as our external validation dataset because it provides both preoperative and postoperative blood laboratory test data.Furthermore, VitalDB is publicly accessible, allowing researchers to download it from the internet without the need for institutional IRB approval.In this regard, it proves valuable as an external validation dataset for a range of clinical studies.

Primary Outcome
The primary outcome in our study was the occurrence of AKI within 7 days following surgery.AKI is defined as either a 1.5-fold increase in the creatinine level from the preoperative baseline or a rise of 0.3 mg/dL over a 48 h period, as measured within the first 7 days after surgery, in accordance with the definition from the Kidney Disease Improving Global Outcomes (KDIGO) guidelines.If no creatinine level was measured postoperatively, it was assumed that an AKI had not occurred.

Modeling and Model Evaluation
In this study, the dataset variables used for modeling were categorized into demographic, preoperative, and intraoperative datasets.We compared the performances of models based on the individual datasets as well as various combinations of dataset configurations.We divided the datasets for modeling into training, testing, and validation datasets in a ratio of 6:2:2.In order to mitigate any potential overfitting of the model performance stemming from the specific composition of the training dataset, we employed the bootstrapping method for model evaluations by resampling each dataset multiple times.We here present the results of evaluations of the models in the form of the mean values and statistical confidence intervals for the performance outcomes of the trained models on each resampled dataset.In our study, we employed four predictive modeling methods: logistic regression (LR), random forest (RF), gradient boosting machine (GBM), and a simple deep neural network (DNN) with five layers.We configured the hyperparameters for our modeling techniques as follows.For LR, default values were used for the L2 penalty and "lbfgs" solver.For RF and GBM, we used a grid search approach to optimize the hyperparameters.Our DNN model comprised five hidden layers, along with the incorporation of batch normalization and a dropout rate of 0.5.The dense layers utilized rectified linear unit (ReLU) activation functions, and the final output layer employed a sigmoid activation function for binary classification.We applied a learning rate of 0.001, employed binary cross-entropy as the loss function, and utilized the adam optimizer.In our study, we primarily assessed the prediction model's performance using two key metrics: the area under the receiver operating characteristic (AUROC) curve and area under the precision-recall curve (AUPRC).We employed the DeLong test to establish the statistical significance of the performance differences.Furthermore, we identified and compared the crucial variables influencing the performance of each model using Shapley additive explanation (SHAP) values.

Statistical Analysis and Modeling Tools
To statistically describe the dataset characteristics, the data distributions of continuous variables are depicted using mean and standard deviation values, whereas categorical variables are presented using counts and proportions.To compare continuous variables across datasets, t-tests were employed, whereas chi-squared tests were used for categorical variables.The descriptive statistics were obtained using R 3.4.3,whereas ML predictive modeling and model performance evaluations were conducted using Python 3.11.

Study Data Characteristics
Table 1 presents the results of a comparative analysis of the attributes of research data extracted from two distinct datasets: the AMC data (an internal dataset) and VitalDB data (an external dataset).Initially, there does not seem to be any significant disparities in demographic variables such as age, gender, and BMI between the two datasets (Table 1).Nevertheless, upon closer examination of the surgical department information, it becomes apparent that the external dataset exhibits a higher prevalence of general and thoracic surgeries in contrast to the internal dataset (Table 1).Among the 76,032 adult patients aged 18 years and older who underwent general anesthesia for non-cardiac surgery at AMCs, a total of 2314 (3.1%) patients experienced postoperative AKI.Conversely, for the external validation dataset from VitalDB, which included 5512 patients who did not meet the exclusion criteria, 78 (1.5%) developed postoperative AKI (Table 2).The percentage of patients with a hospital stay of 7 days or more was higher for the external dataset, at 57.2%, in contrast to 38.5% for the internal dataset (Table 2).Mortality within 30 days of surgery did not exhibit a significant difference between the two datasets; however, postoperative in-hospital mortality was higher for the internal dataset, registering at 2.1% compared to 0.9% for the external dataset (Table 2).Additionally, the rate of postoperative intensive care unit (ICU) admissions for the internal dataset was nearly twice as high, standing at 18.3% compared to 9.5% for the external dataset (Table 2).For our internal dataset, we conducted a comparison of clinical outcomes between the AKI and non-AKI groups, and as anticipated, the AKI group exhibited a higher incidence of extended hospital stays, postoperative in-hospital mortality, and postoperative ICU admissions compared to the non-AKI group.

Predictive Performance Results of Internal and External Validations
The internal validation results for the AKI prediction model showed that the GBM algorithm had the highest prediction performance when using all the datasets, with an AUROC of 0.868, AUPRC of 0.786, and F1-score of 0.723 (Figure 2 and Table 3).The GBM algorithm also had the highest performance when utilizing the intraoperative dataset, with an AUROC of 0.816 and AUPRC of 0.715, as compared to the prediction model performances when using the demographic and preoperative datasets (Table 3).The external validation results for the AKI model on the VitalDB dataset showed the highest prediction performance, with an AUROC of 0.769 and AUPRC of 0.696, was achieved by the GBM algorithm when using the preoperative dataset (Table 4).Furthermore, the LR algorithm exhibited the highest prediction performance when utilizing the entire dataset, with an F1-score of 0.627 (Table 4).

Predictive Performance Results of Internal and External Validations
The internal validation results for the AKI prediction model showed that the GBM algorithm had the highest prediction performance when using all the datasets, with an AUROC of 0.868, AUPRC of 0.786, and F1-score of 0.723 (Figure 2 and Table 3).The GBM algorithm also had the highest performance when utilizing the intraoperative dataset, with an AUROC of 0.816 and AUPRC of 0.715, as compared to the prediction model performances when using the demographic and preoperative datasets (Table 3).The external validation results for the AKI model on the VitalDB dataset showed the highest prediction performance, with an AUROC of 0.769 and AUPRC of 0.696, was achieved by the GBM algorithm when using the preoperative dataset (Table 4).Furthermore, the LR algorithm exhibited the highest prediction performance when utilizing the entire dataset, with an F1-score of 0.627 (Table 4).

Feature Importance
An analysis of the feature importance of the AKI prediction model using SHAP values on the internal dataset showed that total surgery time and intraoperative blood pressure data were highly significant, whereas the gender status from the demographic data, and albumin and creatinine levels in the preoperative laboratory test data, played pivotal roles in the prediction performance (Figure 3).

Discussion
In our study, we developed a model for predicting postoperative AKI by utilizing both preoperative and intraoperative data.We subsequently conducted an external validation of its performance using a publicly available dataset from a distinct institution, thus highlighting the generalizability of our predictive model.Furthermore, by utilizing a substantial EHR dataset, our study demonstrated the potential to predict the occurrence of AKI-a significant postoperative complication.We achieved this using automatically generated EHR data that could be seamlessly integrated with a hospital's electronic medical record (EMR) system.This integration would serve the dual purposes of alerting medical personnel to the risk in advance, thereby facilitating the provision of intensive care to patients at risk, and assisting in optimizing the allocation of hospital resources by appropriately identifying and reclassifying these at-risk patients.
AKI is a major postoperative complication, contributing significantly to prolonged hospital stays, increased postoperative morbidity, and mortality in patients [1,3,4,[16][17][18].The incidence of postoperative AKI varies widely based on the type of surgery, with surgical factors being some of the most common causes [2].Whereas AKI is generally less frequent in non-cardiac surgeries as compared to cardiac procedures, its occurrence in the former can present greater prediction challenges, potentially leading to delayed detection and treatment, and consequently worsening the patient's prognosis.Postoperative AKI is associated with prolonged postoperative mechanical ventilation, extended intensive care hospitalization, and increased treatment requirements, all of which negatively impact the patient's postoperative outcome [1,3,4,6,7,[16][17][18][19][20][21].Therefore, the ability to predict an AKI before surgery is crucial for efficiently allocating hospital resources to at-risk patients and directing medical resources to high-risk patients, ultimately leading to improved patient GBM, gradient boosting machine; AKI, acute kidney injury; SHAP, Shapley additive explanation; SBP, systolic blood pressure; ABP, arterial blood pressure; DBP, diastolic blood pressure; BMI, body mass index; PT, prothrombin time; std, standard deviation; min, minimum.

Discussion
In our study, we developed a model for predicting postoperative AKI by utilizing both preoperative and intraoperative data.We subsequently conducted an external validation of its performance using a publicly available dataset from a distinct institution, thus highlighting the generalizability of our predictive model.Furthermore, by utilizing a substantial EHR dataset, our study demonstrated the potential to predict the occurrence of AKI-a significant postoperative complication.We achieved this using automatically generated EHR data that could be seamlessly integrated with a hospital's electronic medical record (EMR) system.This integration would serve the dual purposes of alerting medical personnel to the risk in advance, thereby facilitating the provision of intensive care to patients at risk, and assisting in optimizing the allocation of hospital resources by appropriately identifying and reclassifying these at-risk patients.
AKI is a major postoperative complication, contributing significantly to prolonged hospital stays, increased postoperative morbidity, and mortality in patients [1,3,4,[16][17][18].The incidence of postoperative AKI varies widely based on the type of surgery, with surgical factors being some of the most common causes [2].Whereas AKI is generally less frequent in non-cardiac surgeries as compared to cardiac procedures, its occurrence in the former can present greater prediction challenges, potentially leading to delayed detection and treatment, and consequently worsening the patient's prognosis.Postoperative AKI is associated with prolonged postoperative mechanical ventilation, extended intensive care hospitalization, and increased treatment requirements, all of which negatively impact the patient's postoperative outcome [1,3,4,6,7,[16][17][18][19][20][21].Therefore, the ability to predict an AKI before surgery is crucial for efficiently allocating hospital resources to at-risk patients and directing medical resources to high-risk patients, ultimately leading to improved patient outcomes [8,9,22].Previous studies have explored the use of ML algorithms to predict the occurrence of postoperative AKI [8,9,11,[23][24][25][26][27].Lei et al. aimed to predict the risk of postoperative AKI in non-cardiac surgery by utilizing preoperative and intraoperative blood pressure data [9].Nevertheless, a limitation of this study was its reliance solely on data from a single institution.This limitation hindered the ability to assess the generalizability of their findings to other healthcare facilities and also made it challenging to evaluate the potential overfitting issue in the prediction model.Overfitting is a common concern in AI research, particularly in studies that solely rely on data from a single institution.Therefore, for the developed ML prediction model to hold significance in terms of reproducibility and generalizability, it needed to undergo validation using data from other institutions to assess its predictive performance in an external context.An external validation of the effectiveness of our ML prediction model was performed by utilizing the VitalDB dataset, which is an open dataset containing information on surgical patients [15].
One noteworthy aspect of our study was our approach to validation.Unlike previous studies that predominantly relied on single-institution data for internal validation, we took a different path by extensively validating our model's performance against external institutional datasets.Whereas it is recognized that external validation may yield slightly less predictive accuracy compared to internal validation, our research underscored that the predictive performance achieved through external validation remained notably strong, particularly when compared to findings from prior studies.This pivotal finding highlighted our study's distinctive contribution in addressing a common limitation observed in AI model research, namely the limited generalizability of models trained on data from a single institution, which results in their suboptimal performance on datasets from other institutions.
One of the paramount considerations in AI prediction research involving medical data is to ensure that the developed model avoids overfitting to the training data.This step is crucial for enhancing the model's applicability in real-world clinical scenarios, making the outcomes of such studies more practically relevant [28,29].Consequently, we believe that external validation, employing data from external sources, is an essential and vital phase in medical AI prediction research.Nevertheless, acquiring data from external organizations often proves to be a substantial challenge.This difficulty stems from the various data regulations governing these organizations, which impose stringent restrictions on the exchange and sharing of large-scale data across different time periods.As a result, a practical alternative is to evaluate predictive models using single-institution data on publicly available open datasets.This approach offers a valuable means of conducting external validation in AI prediction research within real-world healthcare settings.
Our study had several strengths compared to prior research.First, we considered not only the basic statistical metrics of intraoperative blood pressure, which is a critical parameter, but also changes in blood pressure data.Additionally, we incorporated factors like intraoperative blood loss and surgery time information into our prediction model.Consequently, we enhanced the postoperative AKI prediction performance, achieving an AUC of 0.87, which surpassed the previous study's AUC of 0.82 [8,9].The external validation also demonstrated strong predictive performance, with an AUC of 0.769.
Our study had several limitations that warrant consideration in future research.First, we did not use the complete intraoperative blood pressure (BP) dataset for the predictive modeling; instead, we utilized statistical summaries of the BP data.Therefore, the modeling process did use all the BP data but substituted a statistical representation derived from predictive modeling using the complete blood pressure dataset.Future research should emphasize the application of predictive modeling techniques such as transformers and recurrent neural networks (RNN), which are known for their effectiveness in handling time series data like blood pressure readings.The goal should be to develop models capable of capturing all of the blood pressure data.Moreover, it is imperative to validate the applicability of predictive models across diverse real-world organizations using a more comprehensive set of publicly available datasets.Prospective studies are also essential to assess the practical value of these predictive models when deployed in real clinical settings.Ultimately, such prospective investigations can offer valuable insights into the clinical utility of AI predictive models.

Conclusions
In conclusion, we successfully developed and implemented a prediction model for postoperative AKI in adult patients undergoing non-cardiac surgery by using a dataset that could automatically be extracted from an EHR system.Furthermore, we conducted an external validation of our prediction model using a separate institutional dataset, which was an open dataset, thereby confirming the reproducibility and generalizability of our model.In addition, our study underscored the viability of open datasets as a valuable alternative to third-party datasets for the external validation of prediction models in the field of medical AI research.As additional diverse datasets become available in the future, it is anticipated that more researchers will leverage them for validating a range of AI prediction models.

Figure 1 .
Figure 1.Study flow chart of the internal dataset.

Figure 1 .
Figure 1.Study flow chart of the internal dataset.

Figure 2 .
Figure 2. AUROC and AUPRC values of prediction models across various modeling methods using all features in the internal and external datasets.(A) AUROC and (B) AUPRC values of prediction models across various modeling methods using all features in the internal datasets.(C) AUROC and (D) AUPRC values of prediction models across various modeling methods using all features in the external datasets.AUROC and AUPRC values are represented as 95% confidence intervals.AUROC, area under receiver operating characteristic; AUPRC, area under precision-recall curve; LR, logistic regression; RF, random forest; GBM, gradient boosting machine; DNN, deep neural network.

Figure 2 .
Figure 2. AUROC and AUPRC values of prediction models across various modeling methods using all features in the internal and external datasets.(A) AUROC and (B) AUPRC values of prediction models across various modeling methods using all features in the internal datasets.(C) AUROC and (D) AUPRC values of prediction models across various modeling methods using all features in the external datasets.AUROC and AUPRC values are represented as 95% confidence intervals.AUROC, area under receiver operating characteristic; AUPRC, area under precision-recall curve; LR, logistic regression; RF, random forest; GBM, gradient boosting machine; DNN, deep neural network.
Figure S1.Study flow chart of open dataset; Figure S2.Distribution plots of missing values (A) Distribution plot of missing values in internal dataset.(B) Distribution plot of missing values in external dataset; Figure S3.Nullity correlation heatmaps (A) Nullity correlation heatmap for internal dataset.(B) Nullity correlation heatmap for external dataset; Figure S4.Distribution plots of various blood pressure variables in internal and external datasets; Figure S5.AUROC and AUPRC values of prediction models across different modeling methods according to the various feature selections (A) Demographic dataset.(B) Preoperative dataset.(C) Intraoperative dataset; Figure S6.SHAP value summary plots generated from the GBM models according to the various feature selections (A) Demographic dataset.(B) Preoperative dataset.(C) Intraoperative dataset.Author Contributions: Conceptualization: S.-W.L. and S.-H.K.; data curation: S.-W.L. and J.J.; investigation: S.-W.L., J.J., W.-Y.S. and D.L.; methodology: S.-W.L., J.J., W.-Y.S. and S.-H.K.; supervision: S.-W.L. and S.-H.K.; writing-original draft: S.-W.L.; writing-review and editing: S.-W.L., J.J. and S.-H.K.All authors have read and agreed to the published version of the manuscript.Funding: This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant numbers: HR20C0026, HI22C1723).This study was also supported by a grant (2022IP0053, 2023IP0132) from the Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea.Institutional Review Board Statement: The study received approval from the Institutional Review Board (IRB) of Asan Medical Center (IRB No. 2024-0060) and was conducted according to the guidelines of the Declaration of Helsinki.Informed consent from participants was waived based on the retrospective collection and secondary utilization of EHR data.Informed Consent Statement: Informed consent from participants was waived based on the retrospective collection and secondary utilization of EHR data by the IRB of Asan Medical Center.

Table 1 .
Characteristics of the internal and external datasets.

Table 2 .
Clinical outcomes in the internal and external datasets.Data represent mean ± standard deviation, median (interquartile range), or number (percentage).AKI, acute kidney injury; ICU, intensive care unit.

Table 3 .
Predictive performances of machine learning techniques for predicting postoperative AKI in the internal dataset using combinations of various features.

Table 3 .
Predictive performances of machine learning techniques for predicting postoperative AKI in the internal dataset using combinations of various features.

Table 4 .
Predictive performance of machine learning techniques for predicting postoperative AKI in the external dataset using combinations of various features.