Next Article in Journal
Prevalence of the Sphenoidal Emissary Foramen in a Chilean Osteological Sample: Anatomical and Surgical Implications
Previous Article in Journal
Comparison of Preoperative Magnetic Resonance Imaging and Intraoperative Frozen Section Analysis with Final Pathological Outcomes in the Assessment of Myometrial Invasion in Endometrial Cancer Cases
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning Models for Point-of-Care Diagnostics of Acute Kidney Injury

1
Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
2
Artificial Intelligence Research and Development Center, Wan Fang Hospital, Taipei Medical University, Taipei 110, Taiwan
3
Department of Radiation Oncology, Wan Fang Hospital, Taipei Medical University, Taipei 116, Taiwan
4
Department of Surgery, School of Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
5
Division of Cardiovascular Surgery, Department of Surgery, Wan Fang Hospital, Taipei Medical University, Taipei 116, Taiwan
6
Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 100, Taiwan
7
Department of Internal Medicine, School of Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
8
Division of Nephrology, Department of Internal Medicine, Wan Fang Hospital, Taipei Medical University, Taipei 116, Taiwan
9
Division of Nephrology, Department of Internal Medicine, Shuang Ho Hospital, Taipei Medical University, Taipei 235, Taiwan
10
Emergency Department, Department of Emergency and Critical Medicine, Wan Fang Hospital, Taipei Medical University, Taipei 116, Taiwan
11
Department of Emergency Medicine, School of Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
12
Division of Cardiology and Cardiovascular Research Center, Department of Internal Medicine, Taipei Medical University Hospital, Taipei 110, Taiwan
13
Division of Cardiology, Department of Medicine, Taipei Veterans General Hospital, Taipei 112, Taiwan
14
Second Degree Bachelor of Science in Nursing Collage of Medicine, National Taiwan University, Taipei 100, Taiwan
15
Department of Nursing, National Taiwan University Hospital Yunlin Branch, Yunlin 640, Taiwan
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Diagnostics 2025, 15(21), 2801; https://doi.org/10.3390/diagnostics15212801
Submission received: 20 September 2025 / Revised: 28 October 2025 / Accepted: 1 November 2025 / Published: 5 November 2025
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Abstract

Background/Objectives: Computerized diagnostic algorithms could achieve early detection of acute kidney injury (AKI) only with available baseline serum creatinine (SCr). To tackle this weakness, we tried to construct a machine learning model for AKI diagnosis based on point-of-care clinical features regardless of baseline SCr. Methods: Patients with SCr > 1.3 mg/dL were recruited retrospectively from Wan Fang Hospital, Taipei. A Dataset A (n = 2846) was used as the training dataset and a Dataset B (n = 1331) was used as the testing dataset. Point-of-care features, including laboratory data and physical readings, were inputted into machine learning models. The repeated machine learning models randomly used 70% and 30% of Dataset A as training dataset and testing dataset for 1000 rounds, respectively. The single machine learning models used Dataset A as training dataset and Dataset B as testing dataset. A computerized algorithm for AKI diagnosis based on 1.5× increase in SCr and clinician’s AKI diagnosis compared to machine learning models. Results: On an independent, unbalanced test set (n = 1331), our machine learning models achieved AUROC values ranging from 0.67 to 0.74. A pre-existing computerized algorithm performed best (AUROC = 0.94). Crucially, all machine learning models significantly outperformed the routine clinician’s diagnosis (AUROC ~0.74 vs. 0.53, p < 0.05). For context, a pre-existing computerized algorithm, which requires available baseline SCr data, achieved an AUROC of 0.94 on a relevant subset of the data, highlighting the performance benchmark when baseline data is available. Formal statistical comparisons revealed that the top-performing models (e.g., Random Forest, SVM) were often statistically indistinguishable. Model performance was highly dependent on the test scenario, with precision and F1 scores improving markedly on a balanced dataset. Conclusions: In the absence of baseline SCr, machine learning models can diagnose AKI with significantly greater accuracy than routine clinical diagnoses. Our robust statistical analysis suggests that several advanced algorithms achieve a similarly high level of performance.

1. Introduction

Acute kidney injury (AKI) represents a significant adverse event among hospitalized patients [1]. Given that kidney function is frequently affected by cardiovascular dysfunction [2], sepsis [3], autoimmune diseases, and various causes of circulatory collapse [4], AKI serves as a critical indicator of in-hospital mortality and prolonged hospital stays [5]. Consequently, the early detection of AKI is a strategic approach for enhancing patient outcomes in hospital settings [6,7].
The implementation of an electronic AKI alert system holds considerable promise for early detection. A multicenter cohort study demonstrated that such a system improved renal function recovery in patients admitted to the intensive care units [8]. Another multicenter study reported that the use of a computerized decision support system was associated with reduced in-hospital mortality, fewer dialysis sessions, and shorter hospital stays [9]. In a prospective study conducted in Korea, an electronic AKI alert system linked to automated nephrologist consultation revealed that early consultation and intervention by a nephrologist increased the likelihood of renal recovery from AKI in hospitalized patients [10]. Although the electronic AKI alert system does not consistently alter clinical management or improve AKI outcomes, it shows potential for optimization [11,12].
Nevertheless, constructing an electronic AKI alert system is challenging. Such computerized algorithms detect AKI events based on an increase in serum creatinine (SCr) exceeding 1.5 times the baseline level within 7 days [13,14]. For patients lacking “baseline SCr within 7 days,” computerized algorithms must deduce a diagnosis, potentially lowering diagnostic accuracy. To address this limitation, machine learning models that detect AKI based on point-of-care clinical features may offer a solution. AKI is considered an ideal syndrome for the application of artificial intelligence due to its standardized and readily identifiable definition [15]. Patients with AKI may exhibit various clinical features, including demographic characteristics, comorbidities, changes in vital signs, and diverse laboratory findings [16]. These features may be input into machine learning models to facilitate point-of-care AKI diagnoses in the absence of baseline SCr levels. For instance, when a patient with no known medical history presents with abnormal SCr, it can be challenging for an inexperienced physician to differentiate between AKI and CKD, and to initiate an accurate diagnostic and therapeutic plan. In such cases, our machine learning model may assist less experienced clinicians in making this distinction, thereby enabling timely and appropriate clinical decision-making. Therefore, the present study aimed to construct machine learning models in the context of absent baseline SCr within seven days, utilizing clinical features at a single time point. To achieve this, we first assess model stability using a repeated sampling methodology (Method 1) before evaluating their real-world generalizability on an independent, temporally distinct test set (Method 2). We then benchmark these models against both routine clinician diagnoses and the traditional computerized algorithm to fully characterize their clinical potential.

2. Methods

2.1. Study Design and Participants

This retrospective study was conducted at Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan. The study was approved by the Ethics Committee and Institutional Review Board of Taipei Medical University (approval no. N202111017, date 30 September 2021) and adhered to the tenets of the 1975 Declaration of Helsinki, as revised in 2013. Informed consent for participation is not required as per the Ethics Committee and Institutional Review Board of Taipei Medical University. The study population comprised hospitalized patients possessing one or more records of SCr levels exceeding 1.3 mg/dL. Two datasets, designated as Datasets A and B, were used for patients meeting this criterion. Dataset A served both as the training and testing set in the repeated machine learning model (Method 1), and as the training set for the single machine learning model (Method 2). Dataset B was exclusively used as the testing dataset for the single machine-learning model (Method 2). To mitigate selection bias, all patients were included through simple randomization. Given that 26 features were input into the machine learning models, the study aimed to enroll over 2600 patients in the training dataset and more than 1000 patients in the testing dataset.
The study was intentionally designed to use two separate datasets—Dataset A for training and Dataset B for testing—to ensure a rigorous and clinically relevant evaluation of the models. This approach serves two primary purposes. First, it facilitates temporal validation, where the models are trained on older data (Jan 2018–June 2020) and evaluated on newer, unseen data (July 2020–Dec 2020). This simulates a real-world deployment scenario and tests the model’s robustness to potential shifts in patient characteristics or clinical practices over time.
For Dataset A, patients hospitalized from January 2018 to June 2020 were randomly screened for eligibility for inclusion in the training dataset. The inclusion criteria were as follows: (1) at least one SCr value > 1.3 mg/dL during hospitalization and (2) age > 20 years. The exclusion criterion was the absence of baseline SCr within 7 days preceding the indexed abnormal SCr level. Patients with abnormal SCr levels were categorized into AKI and non-AKI groups using a computerized algorithm for AKI diagnosis, which will be detailed subsequently. The AKI and non-AKI groups were then randomly balanced to form Dataset A, with 1423 patients in each group.
For Dataset B, patients hospitalized from July 2020 to December 2020 with (1) at least one SCr value > 1.3 mg/dL during hospitalization and (2) age > 20 years were randomly selected for inclusion in the testing dataset. Notably, the availability of baseline SCr within 7 days was not a requirement for Dataset B. For each patient in Dataset B, a final diagnosis of AKI or non-AKI was retrospectively established by our researcher nephrologists based on a comprehensive review of the patient’s record according to KDIGO guidelines; this expert diagnosis served as the ground truth for evaluating all other methods. AKI was defined by the following criteria: (1) For patients with available baseline SCr values within 7 days before the indexed abnormal SCr level, an increase in SCr > 1.5 times satisfied the diagnosis of AKI. (2) For patients with available SCr values more than 7 days before the indexed abnormal SCr, the nearest previous SCr value was assumed to be the baseline SCr, and an increase in SCr > 1.5 times above this baseline value satisfied the diagnosis of AKI. 3. For cases in which previous SCr values were unavailable, it was assumed that patients had normal baseline SCr levels, and AKI was arbitrarily defined. Notably, the approach to patients without baseline SCr within 7 days stated above may lead to an AKI diagnosis in some CKD patients. In addition to this ground truth, two baseline diagnostic labels were collected for comparison: the routine clinician’s diagnosis documented in the final discharge summary, and the diagnosis generated by our pre-existing computerized algorithm. Dataset B included 334 and 997 patients in the AKI and non-AKI groups, respectively. To address the imbalance in Dataset B, additional trials using balanced and refined subsets were conducted. These trials showed improved or consistent model performance, indicating the imbalance was mitigated. Thus, the impact of dataset imbalance on model metrics was carefully evaluated and controlled.

2.2. Features of the Machine Learning Models

The features utilized in the machine learning models were sex, age, and laboratory and physical readings obtained at the time of admission. Notably, the present study aimed to diagnose AKI in patients with no known medical history who presented with abnormal SCr. Consequently, comorbidities were excluded from the feature set used by the machine learning model. Laboratory parameters included SCr, Na, K, aspartate aminotransferase (AST), alanine aminotransferase (ALT), red blood cell (RBC) count, hemoglobin, hematocrit (Hct), red cell distribution width (RDW-CV), white blood cell (WBC) count, and fractions of neutrophils, lymphocytes, monocytes, eosinophils, basophils, platelet counts, and platelet distribution width (PDW). Physical readings included respiration rate, systolic blood pressure (SBP), diastolic blood pressure (DBP), oxygen saturation (SpO2), body temperature, pulse rate, weight, and height. Before running the machine learning models, the association between the model features and AKI events was evaluated using principal component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP). Physical readings were recorded by attending nurses at 7:00 on the day the indexed laboratory data were obtained. Biochemical data were measured using a Beckman Coulter DxC AU5800 (Beckman Coulter Inc., Brea, CA, USA), and hematological data were measured using a Beckman Coulter DxH 1601(Beckman Coulter Inc., Brea, CA, USA).

2.3. Computerized Algorithm for Defining AKI

The accuracy of the computerized algorithm was validated in a separate study conducted by our research team [15]. Briefly, if an azotemic patient (SCr > 1.3 mg/dL) with previous SCr value within 90 days was identified, an increase in SCr > 1.5 times satisfied the diagnosis of AKI; if an azotemic patient without previous SCr value within 90 days was identified, the baseline SCr was assumed to be normal and AKI was diagnosed arbitrarily; for an azotemic patient with previous SCr > 90 days before the indexed abnormal SCr, the nearest previous SCr was assumed as the baseline value, and increase in SCr > 1.5 times satisfied the AKI diagnosis. The program code was developed using Node.js 14.19.1 (OpenJS Foundation, San Francisco, CA, USA).

2.4. Researcher’s Definition for AKI and Clinician’s Diagnosis

AKI was defined according to the KDIGO Clinical Practice Guidelines for AKI [17]. Once an SCr value > 1.3 mg/dL was identified, the previous SCr values were reviewed. AKI was defined as follows: (1) In patients with previous SCr tests within 7 days preceding the indexed SCr values, an increase in SCr > 1.5 was defined as AKI. However, no cases of AKI were observed in the present study. (2) In patients with previous SCr tests exceeding 7 days before the indexed abnormal SCr values, the nearest previous SCr value was assumed as the baseline SCr, and an increase in SCr > 1.5 times above this value was defined as AKI. However, AKI was not observed in the present study. (3) In patients without previous SCr values, the patient was assumed to have normal baseline SCr levels, and AKI was defined as present. This diagnosis was utilized as the standard in the present study. Notably, the AKI criteria for decreased urine output in the KDIGO Clinical Practice Guidelines were not applied in this study. The AKI diagnosis documented in the discharge summaries was considered as the clinician’s diagnosis. In cases in which AKI was not included in the discharge diagnosis of patients with AKI, the diagnosis was considered inaccurate.

2.5. Data Preprocessing and Imputation

A standardized preprocessing pipeline was applied to the data before model training to ensure consistency and optimal performance. This pipeline, which includes imputation and feature scaling, was developed using only the training dataset (Dataset A) to prevent any information leakage from the test set (Dataset B).
First, to handle missing data, we employed mean imputation. In this procedure, any missing value for a given feature was replaced with the arithmetic mean of all observed values of that feature within the training data. This approach ensures a complete dataset for model training.
Second, following imputation, all features were standardized. This transformation rescales each feature so that it has a mean of zero and a standard deviation of one. Standardization is a critical step that prevents features with larger numeric ranges from disproportionately influencing the model’s learning process, which is particularly important for distance-based and gradient-based algorithms.

2.6. Development of Machine Learning Models

We employed two distinct approaches to develop and validate our machine learning models. Method 1 (Repeated Learning) was designed to assess the inherent stability and internal validity of the models by repeatedly training and testing on subsets of a single dataset (Dataset A). In contrast, Method 2 (Single Learning on an Independent Test Set) was designed to assess the models’ real-world generalizability on new, unseen data (Dataset B) and to investigate the impact of factors like data imbalance.
For Method 1, only Dataset A was used in the repeated machine learning models. In each iteration, 70% of Dataset A was randomly selected as the training dataset and the remaining 30% served as the testing dataset. This procedure was repeated 1000 times to obtain the average performance of the machine learning models. In each iteration, seven machine learning models were employed: the Support Vector Machine (SVM), Logistic Regression (LR), Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Naive Bayes classifier (NB), and Neural Network (NN).
For Method 2, we used the entire Dataset A as the training dataset to build a single, final version of each model. The performance of these models was then evaluated on the independent Dataset B, which served exclusively as the testing dataset.
The evaluation in Method 2 was conducted under three distinct clinical scenarios using the independent Dataset B:
Trial 1: Performance on the Full, Unbalanced Test Set
First, we tested the models on the entire Dataset B to evaluate their performance in a scenario that mirrors a typical clinical setting, where non-AKI cases are often more prevalent than AKI cases.
Trial 2: Performance on a Balanced Test Set
Second, to mitigate the potential effects of class imbalance on performance metrics like precision and F1-score, we evaluated the models on a reduced subset of Dataset B containing a balanced number of AKI and non-AKI patients.
Trial 3: Performance on a Post-Exclusion Test Set
Finally, to assess model performance on the most diagnostically definitive cases, we used a third subset of Dataset B that excluded patients who lacked baseline SCr values within the preceding seven days. This created a “cleaner” dataset to test the models’ core diagnostic capability.
The models were implemented using Python’sversion 3.12.12 scikit-learn and XGBoost libraries. For the more complex models, key hyperparameters were selected following a tuning process that utilized a randomized search with 5-fold cross-validation on the training data to optimize for accuracy. Specifically, the Random Forest model was constructed as an ensemble of 400 decision trees, with a maximum tree depth of 14 and a maximum of 8 features considered at each split. The XGBoost model was configured with 250 boosting rounds, a learning rate of 0.06, and a maximum tree depth of 4. Similarly, the Gradient Boosting model used 60 estimators and a learning rate of 0.11. The Neural Network was a Multi-layer Perceptron configured with a single hidden layer of 100 neurons and was set to stop training early if validation performance ceased to improve over 10 consecutive epochs to prevent overfitting. For the Support Vector Machine, Logistic Regression, and Naive Bayes models, we utilized the standard, well-established default parameters from the scikit-learn library, as they provided robust baseline performance without extensive modification.

2.7. Evaluation of Model Performance

The performance of each machine learning model was evaluated based on its accuracy, precision, recall (sensitivity), specificity, and F1 score calculated using the formula (2 × precision × recall)/(precision + recall). The predictive values of the machine learning models were evaluated using the area under the receiver operating characteristic curve (AUROC). These parameters constituted multiple performance metrics—accuracy, precision, recall, specificity, F1 score, and AUROC—to capture different aspects of diagnostic performance, especially under data imbalance. These metrics collectively assess the trade-offs between false positives and false negatives, which are critical in AKI diagnosis.

2.8. Statistical Analyses

Continuous variables with normal distribution were shown as mean ± standard deviation; continuous variables deviated from normal distribution were shown as median and interquartile range; categorical variables were shown as frequency and percentage. Analytic statistical tests for continuous variables with normal distribution were performed by using two-tailed t-test for independent samples; analytic statistical tests for continuous variables deviated from normal distribution were performed by using Wilcoxon sum rank test; analytic statistical tests for categorical variables were made using chi-squared test. p values of <0.05 was considered as significant. The distribution of data was examined using Q-Q plots. Statistical analysis was performed using SAS 9.4 (SAS Institute Inc., Cary, NC, USA).
To compare the performance between all machine learning models and baseline algorithms, we conducted pairwise statistical tests using the expert nephrologists’ diagnosis as the ground truth. For accuracy, recall (sensitivity), and specificity, we used McNemar’s test. For precision and F1-score, we employed a non-parametric bootstrap procedure with 2000 resamples. For the area under the receiver operating characteristic curve (AUROC), we used DeLong’s test. A p-value of <0.05 was considered statistically significant. The results of these pairwise comparisons are presented in the results tables using superscript letter notations, where models sharing a letter are not significantly different from one another.

3. Results

3.1. Clinical Characteristics of the Datasets

Dataset A comprised 1423 patients with AKI and 1423 without AKI. Sex, point-of-care SCr, RBC, Hct, body temperature, weight, and height were similar between AKI and non-AKI groups. In contrast, AKI patients were significantly younger; had higher serum Na and K; higher AST and ALT; lower hemoglobin; higher RDW-CV; higher WBC count; higher neutrophil fraction; lower lymphocyte, monocyte, eosinophil, and basophil fractions; lower platelet counts; and higher PDW. Patients with AKI had significantly higher respiratory rates, lower SBP and DBP, lower SpO2 and higher pulse rates (Table 1).
Dataset B consisted of 1331 hospitalized patients with abnormal renal function (SCr > 1.3 mg/dL), categorized into 334 patients with AKI and 997 non-AKI patients by the nephrologists involved in this study. The AKI and non-AKI groups exhibited similar sex, age, serum K, RDW-CV, body temperature, and height. Patients with AKI had significantly lower SCr, higher levels of serum Na, AST and ALT, RBC, hemoglobin, and Hct; higher WBC, neutrophil fraction, lower lymphocyte, monocyte, eosinophil, and basophil fractions; lower platelet count and higher PDW; higher respiratory rate; lower SBP and DBP; lower SpO2; and higher pulse rate (Table 2).
PCA and UMAP were used to reduce the dimensions of all the features mentioned above to visualize their correlations with AKI. Both PCA and UMAP showed that these features roughly distributed patients with AKI to the right upper dimension and non-AKI patients to the left lower dimension (Figure 1). The substantial overlap between the orange and blue clusters in both plots visually demonstrates the inherent difficulty in diagnosing AKI using point-of-care data. There is no simple, clear boundary that separates the two patient populations. This complexity underscores the limitations of simple diagnostic rules and provides the rationale for using machine learning models, which are designed to identify subtle, non-linear patterns within such complex data. The rough distribution, where AKI patients (orange) show some tendency to cluster, suggests that such patterns exist for a model to learn.

3.2. Repeated Machine Learning Model (Method 1)

As previously described, for Method 1, 70% of Dataset A was randomly selected for use as the training dataset and the remaining 30% was used as the testing dataset. This method was applied to all seven machine-learning models previously mentioned. This procedure was repeated 1000 times to obtain the average performance of the machine learning models. The performance metrics of each machine learning model are expressed as the mean ± standard deviation of 1000 iterations. The accuracy of Method 1 ranged from 0.65 0.69. The machine learning models SVM, GB, XGB, and RF achieved the highest accuracy (0.69 ± 0.01). The F1 score for Method 1 ranged from 0.55 0.69. The machine learning models XGB and RF achieved the highest F1 score (F1 score = 0.69 ± 0.01). The machine learning models of Method 1 exhibited AUROC ranging from 0.73 to 0.76, of which SVM, GB, XGB, and RF demonstrated the highest AUROC (0.76 ± 0.01). Overall, using Method 1, the machine models XGB and RF exhibited the best performance (Table 3).

3.3. Single Machine Learning Models on an Independent Test Set (Method 2)

We evaluated the seven trained machine learning models and two baseline methods on the independent test set (Dataset B), using the researcher nephrologists’ diagnosis as the ground truth. The evaluation was conducted under three different clinical scenarios, with detailed performance metrics and statistical comparisons presented in Table 4, Table 5 and Table 6.
Performance on the Unbalanced Test Set: In the first trial, using the entire, unbalanced Dataset B (n = 1331), the models achieved AUROC values from 0.67 to 0.74 (Table 4). The computerized algorithm was the top performer (AUROC 0.94). All ML models demonstrated statistically significantly superior performance compared to the clinician’s diagnosis across key metrics, for instance, F1 score (~0.52 vs. 0.35) and AUROC (~0.74 vs. 0.53). The letter notations indicate that the top-performing models, such as RF and SVM, were statistically indistinguishable from each other.
Performance on the Balanced Test Set: In the second trial, on a balanced subset (n = 334 in each group), there was a substantial improvement in precision and F1 scores (e.g., RF’s F1 score increased from 0.52 to 0.63) for most models (Table 5). This highlights the models’ learning capability when the effect of class imbalance is mitigated. The ML models remained significantly superior to the clinician’s diagnosis, again showing a clear advantage in both F1 score (e.g., ~0.63 vs. 0.50) and AUROC.
Performance on the Post-Exclusion Test Set: In the third trial, on a subset of the most definitive diagnoses (n = 398), the ML models’ AUROCs ranged from 0.64 to 0.71 (Table 6). Even in this “cleaner” data scenario, the ML models maintained a significant performance advantage over the clinician’s diagnosis in both F1 score (e.g., RF 0.59 vs. 0.46) and AUROC (e.g., RF 0.71 vs. 0.53).

3.4. Feature Importance of the Machine Learning Models

The three machine learning models that demonstrated the best performance in the third trial of Method 2, RF, XGBoost, and GB, were selected to evaluate the feature importance. In the RF model, the three features with the highest importance were lower lymphocyte fraction, SBP, and SCr (Figure 2A). In the XGBoost model, the three features with the highest importance were lower SCr level, lower lymphocyte fraction, and higher WBC count (Figure 2B). In the GB model, the three features with the highest importance were low SCr, high AST, and low SBP (Figure 2C). Overall, lower SCr, higher WBC count, lower lymphocyte fraction, higher RDW-CV, lower platelet count, higher GOT, younger age, lower SBP, higher pulse rate, and higher respiratory rate were features of AKI.

4. Discussion

In summary, the repeated machine learning models employed in the present study demonstrated accuracy ranging from 0.65 to 0.69 and an AUROC ranging from 0.73 to 0.76 for the diagnosis of AKI in the absence of baseline SCr. Conversely, the single machine leaning models exhibited an accuracy range of 0.53 to 0.74 and the AUROC ranged from 0.70 to 0.74 for the diagnosis of AKI without available baseline SCr. These findings suggest that repeated machine-learning models offer superior accuracy and predictive value for AKI diagnosis. Additionally, while the single machine learning models did not exhibit better accuracy and predictive value in the balanced testing dataset (Method 2, trial 2), they exhibited better performance in the testing dataset after exclusion of patients with uncertain AKI status (Methods 2, trial 3). Notably, with available past SCr records, the computerized algorithm exhibited superiority in every index compared to either repeated or single machine learning models. While RF, XGBoost, and GB consistently ranked among the top models, our statistical analysis revealed no significant performance difference between them across most metrics. This suggests that several advanced algorithms can achieve a similar performance ceiling. The results imply that future improvements may lie more in feature engineering and addressing data-driven challenges like class imbalance, rather than in selecting a single ‘best’ algorithm.
As the diagnosis of AKI is based on an increase in SCr over a 7-day period [16], computerized algorithms can accurately diagnose AKI in patients with available baseline SCr or a recent record of SCr. Nevertheless, for patients without such reference SCr values, the diagnosis of AKI is challenging using computerized algorithms, and even for clinicians. The present study attempted to overcome this impediment using machine learning models to identify AKI events based on point-of-care features of patients presenting with abnormal SCr. In cases where a patient with no known medical history presents with abnormal SCr, our machine learning model can assist inexperienced physicians in distinguishing between AKI and CKD, thereby facilitating the initiation of appropriate diagnostic and therapeutic strategies. Remarkably, all included patients had abnormal SCr values; thus, the function of our models was to distinguish AKI events from preexisting chronic kidney disease. To date, the application of machine learning in the management of AKI has primarily focused on AKI prediction. Thus, AKI prediction models with short time windows can be compared to our AKI diagnosis models [18]. In an AKI prediction model for all-care settings conducted by Cronin et al. in 2015 [19], pre-admission laboratory tests of −5 days to +48 h from the admission date were obtained for over 1.6 million hospitalizations for model training. They found that the models (LR, LASSO regression, and RF) exhibited an AUROC of 0.746–0.758 for predicting in-hospital AKI events [19]. In another study, He et al. tested machine learning models to differentiate AKI in different prediction time windows. Their models exhibited AUROC values ranging from 0.720 to 0.764. Among the tested models, the best model performance was achieved in predicting AKI one day in advance [20]. A similar study by Cheng et al. tested different data collection time windows to train datasets of AKI prediction models. The results suggested that the RF algorithm showed the best performance for AKI prediction 1–3 days in advance, with AUROC of 0.765, 0.733, and 0.709, respectively [21]. Compared with these studies, our repeated machine learning models exhibited AUROC of 0.73–0.76, depending on the training model used, showing that repeated machine learning could exhibit similar performance with different model algorithms and is comparable to machine learning models with large training datasets.
Although we compared the present AKI diagnostic model with AKI prediction models, a difference existed between these two models. Koyner et al. tested models with and without a change in SCr from the baseline in an all-care setting. The results showed that excluding “change in SCr” from input features did not affect the model’s AKI prediction ability [22]. In contrast, in the present study, SCr was an important feature to be input into the machine learning model for AKI diagnosis, regardless of the algorithm used. The reason for this difference may be that AKI prediction relies more on the severity of comorbidities than on the existing abnormal SCr readings. In contrast, the SCr value at point-of-care is an important feature for the identification of AKI events.
Among the studies developing AKI prediction models, researchers have been seeking the best machine learning algorithm for predicting the risk of AKI events. The AKI prediction model developed by Cronin et al. in 2015, using a 1.6 million training dataset revealed that the performance of traditional LR and LASSO regression models was slightly superior to that of the RF model [19]. In a 2021 study by Kim et al., in which they intended to develop a continuous real-time prediction model for AKI events, a recurrent neural network algorithm was found to be most suitable for predicting AKI events 48 h in advance [23]. In the single machine learning models used in the present study, we found that the RF, XGBoost, and GB algorithms exhibited superior performance in AKI diagnosis. Nonetheless, in the case of repeated machine learning models, the differences between the different algorithms were not evident. This finding suggests that with repeated training, the performance of different machine-learning algorithms may approach a consistent level.
Yue et al. developed a machine learning model for AKI prediction in patients with sepsis, identifying key features such as urine output, mechanical ventilation, body mass index, estimated glomerular filtration rate, SCr, partial thromboplastin time, and blood urea nitrogen [24]. In addition to features directly associated with renal function, those indicative of general disease severity are crucial in this model of sepsis-related AKI. In the present model, which was designed for an all-care setting, features related to sepsis include lymphocyte fraction, white blood cell (WBC) count, platelet count, pulse rate, SBP, and GOT also play important roles. This finding suggests that, in an all-care setting, sepsis is the most important cause of AKI in hospitalized patients.
As electronic diagnostic tools have been integrated into decision support and electronic alert systems for AKI, these studies showed a heterogeneous system design and revealed mixed results [25]. Previous research has shown that electronic AKI alert systems possess acceptable accuracy and applicability [26,27]. Furthermore, Hodgson et al. demonstrated that their electronic AKI alert system reduced the incidence of hospital-acquired AKI and in-hospital mortality [28]. Conversely, a study by Wilson et al. involving 6030 patients indicated that the electronic AKI alert system did not reduce the risk of the primary outcome, with variable effects across clinical centers [29]. The findings of the present study suggest that, while electronic diagnostic tools may enhance the accuracy of AKI diagnosis, timely differential diagnosis and management are imperative to improve outcomes.
The results of the present study showed that single model trials of machine learning models were associated with a wide variety of accuracy. The variation in accuracy (0.53–0.74) across single model trials reflects differences in Dataset B’s characteristics—specifically, data imbalance and inclusion of deduced diagnoses. When tested with balanced or refined subsets, model performance became more consistent (accuracy 0.63–0.72), with RF, XGBoost, and GB generally outperforming others. This demonstrates that the observed inconsistency is largely driven by dataset quality and composition. In addition, we also found that machine learning models underperform compared to traditional computerized algorithms in diagnosing AKI. A possible explanation may be that the computerized algorithm achieved higher accuracy (up to 0.95) because it directly relied on detecting a defined increase in baseline serum creatinine (SCr), as per AKI diagnostic criteria. In contrast, our machine learning models were designed for cases lacking recent baseline SCr, a scenario where traditional algorithms fail or rely on assumptions. A key finding of our study, now robustly supported by formal statistical analysis, is that machine learning models significantly outperform routine clinician’s diagnoses in identifying AKI when baseline SCr is unavailable. This was consistently observed across all three distinct evaluation scenarios (Table 4, Table 5 and Table 6). This suggests that in data-limited, real-world settings where clinicians may rely on subjective judgment, the models provide a valuable and more accurate diagnostic support tool.
The present study unexpectedly demonstrates that machine learning models outperform clinicians in diagnosing acute kidney injury (AKI). One possible explanation is that AKI may not have been the primary clinical concern in many cases, with clinicians focusing more on dominant conditions such as sepsis or heart failure. As a result, timely recognition and management of AKI were sometimes overlooked. Additionally, when patients presented with renal impairment but lacked baseline renal function data, clinicians often relied on subjective judgment to diagnose AKI. In such scenarios—where objective diagnostic criteria are unavailable—machine learning models offer valuable support, enabling timely and accurate decision-making. Moving forward, we aim to incorporate clinician feedback into model development to explore the root causes of misdiagnoses and further enhance diagnostic performance.
A limitation of the present study is the relatively small sample size, particularly for the testing dataset. However, considering the all-care setting in the present study, our machine learning models may be applied to hospitalized patients admitted to both critical care units and general wards. Another limitation was the single-center design that limits the generalizability of the results. To compensate for this, an independent testing dataset was used for validation (Dataset B). Nevertheless, external validation is restricted by the institutional review board and is therefore not feasible. Furthermore, this study did not exclude patients based on specific comorbidities. Clinical features used in our models, such as inflammatory markers and vital signs, can be influenced by a wide range of conditions beyond AKI, such as sepsis, heart failure, or diabetes. This could introduce confounding factors and affect model stability. However, this approach was intentional, as our goal was to develop models that could function in a real-world clinical setting where patients often present with multiple, complex health issues. The models were thus trained to identify diagnostic patterns within this inherent clinical complexity. Nevertheless, future research should aim to quantify the impact of specific comorbidities on model performance. Integrating established comorbidity indices, such as the Charlson Comorbidity Index, as input features could potentially improve model robustness and accuracy.
Looking ahead, several avenues for future research could build upon our findings. Future studies should explore more advanced machine learning algorithms, such as deep learning models or sophisticated ensembles, to potentially capture more complex relationships in the data. To improve the generalizability of these models, conducting multicenter studies with diverse data sources—including time-series clinical data and novel inputs like genetic markers—is crucial. Most importantly, prospective validation in real-time clinical settings is essential to assess the models’ true clinical impact, their utility in decision support, and their seamless integration into existing hospital workflows.
In conclusion, machine learning models were able to diagnose AKI without baseline SCr records. Additionally, the machine learning models for AKI diagnosis demonstrated superior accuracy compared to clinicians. Also, repeated machine learning models exhibited more consistent and superior performance than single machine learning models. Notably, the computerized AKI diagnostic algorithms showed superior accuracy compared to the machine learning models when baseline SCr was available. Therefore, these two approaches can be combined to develop a more comprehensive electronic AKI diagnostic system.

Author Contributions

Conception and Design: C.-Y.C., T.-I.C., H.-L.H., and C.-T.L.; Analysis and Interpretation: C.-Y.C., S.-C.H., C.-H.C., P.-H.H., H.-L.H., and C.-T.L.; Data Collection: C.-Y.C., T.-I.C., Y.-L.C., N.-J.H., H.-L.H., and C.-T.L.; Writing the Manuscript: C.-Y.C. and C.-T.L.; Critical Revision: C.-H.C., S.-C.H., Y.-M.S., T.-H.C., F.-Y.L., C.-M.S., P.-H.H., H.-L.H., and C.-T.L.; Statistical Analysis: C.-Y.C., Y.-L.C., N.-J.H., H.-L.H., and C.-T.L.; Obtaining Funding: C.-T.L. and Y.-M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, in part, by the research grants from the Wan Fang Hospital, Taipei Medical University (111-wf-f-1, 111-wf-f-2, 112-wf-swf-03). These funding agencies had no influence on the study design, data collection or analysis, the decision to publish, or preparation of the manuscript.

Institutional Review Board Statement

This retrospective study was conducted at Wan Fang Hospital, Taipei Medical University, Taipei, Taiwan. The study was approved by the Ethics Committee and Institutional Review Board of Taipei Medical University (approval no. N202111017, date 30 September 2021) and adhered to the tenets of the 1975 Declaration of Helsinki, as revised in 2013.

Informed Consent Statement

Informed consent for participation is not required as per the Ethics Committee and Institutional Review Board of Taipei Medical University.

Data Availability Statement

The datasets used and/or analyzed during the current study is not freely available due to ethicak restrictions but may be available from the corresponding author Chung-Te Liu on reasonable request.

Acknowledgments

The content of this publication does not reflect the views or policies of the authors’ facilities.

Conflicts of Interest

All the authors declare no interest or relationship within the last 3 years directly related to this manuscript, including advisory positions, consulting fees, equity & stock ownership, and non-financial support. We also declare no patents or copyrights an author may have that are relevant to the work in the manuscript.

Abbreviations

AKIacute kidney injury
ALTalanine aminotransferase
ASTaspartate aminotransferase
AUROCarea under curve of receiver operating characteristic
CKDchronic kidney disease
DBPdiastolic blood pressure
GBGradient Boosting
Hcthematocrit
LRLogistic Regression
NBNaive Bayes classifiers
NNNeural Network
PCAPrincipal Component Analysis
PDWplatelet distribution width
RBCred blood cell count
RDW-CVred cell distribution width
RFRandom Forest
SBPsystolic blood pressure
SCrserum creatinine
SpO2oxygenation saturation
SVMSupport Vector Machine
UMAPUniform Manifold Approximation and Projection
WBCwhite blood cell count
XGBoostExtreme Gradient Boosting

References

  1. Levey, A.S.; James, M.T. Acute Kidney Injury. Ann. Intern. Med. 2017, 167, ITC66–ITC80. [Google Scholar] [CrossRef]
  2. Rangaswami, J.; Bhalla, V.; Blair, J.E.; Chang, T.I.; Costa, S.; Lentine, K.L.; Lerma, E.V.; Mezue, K.; Molitch, M.; Mullens, W.; et al. American Heart Association Council on the Kidney in Cardiovascular Disease and Council on Clinical Cardiology. Cardiorenal Syndrome: Classification, Pathophysiology, Diagnosis, and Treatment Strategies: A Scientific Statement From the American Heart Association. Circulation 2019, 139, e840–e878. [Google Scholar]
  3. Sadudee Peerapornratana, S.; Manrique-Caballero, C.L.; Gómez, H.; Kellum, J.A. Acute kidney injury from sepsis: Current concepts, epidemiology, pathophysiology, prevention and treatment. Kidney Int. 2019, 96, 1083–1099. [Google Scholar] [CrossRef] [PubMed]
  4. Bienholz, A.; Wilde, B.; Kribben, A. From the nephrologist’s point of view: Diversity of causes and clinical features of acute kidney injury. Clin. Kidney J. 2015, 8, 405–414. [Google Scholar] [CrossRef]
  5. Abebe, A.; Kumela, K.; Belay, M.; Kebede, B.; Wobie, Y. Mortality and predictors of acute kidney injury in adults: A hospital-based prospective observational study. Sci. Rep. 2021, 11, 15672. [Google Scholar] [CrossRef]
  6. Connell, A.; Laing, C. Acute kidney injury. Clin. Med. 2015, 15, 581–584. [Google Scholar] [CrossRef]
  7. Levey, A.S. Defining AKD: The Spectrum of AKI, AKD, and CKD. Nephron 2022, 146, 302–305. [Google Scholar] [CrossRef]
  8. Holmes, J.; Roberts, G.; Geen, J.; Dodd, A.; Selby, N.M.; Lewington, A.; Scholey, G.; Williams, J.D.; Phillips, A.O. Utility of electronic AKI alerts in intensive care: A national multicentre cohort study. J. Crit. Care 2018, 44, 185–190. [Google Scholar] [CrossRef]
  9. Al-Jaghbeer, M.; Dealmeida, D.; Bilderback, A.; Ambrosino, R.; Kellum, J.A. Clinical Decision Support for In-Hospital AKI. J. Am. Soc. Nephrol. 2018, 29, 654–660. [Google Scholar] [CrossRef] [PubMed]
  10. Park, S.; Baek, S.H.; Ahn, S.; Lee, K.-H.; Hwang, H.; Ryu, J.; Ahn, S.Y.; Chin, H.J.; Na, K.Y.; Chae, D.-W.; et al. Impact of Electronic Acute Kidney Injury (AKI) Alerts with Automated Nephrologist Consultation on Detection and Severity of AKI: A Quality Improvement Study. Am. J. Kidney Dis. 2018, 71, 9–19. [Google Scholar] [CrossRef] [PubMed]
  11. Wilson, F.P.; Shashaty, M.; Testani, J.; Aqeel, I.; Borovskiy, Y.; Ellenberg, S.S.; Feldman, H.I.; Fernandez, H.; Gitelman, Y.; Lin, J.; et al. Automated, electronic alerts for acute kidney injury: A single-blind, parallel-group, randomised controlled trial. Lancet 2015, 385, 1966–1974. [Google Scholar] [CrossRef]
  12. Colpaert, K.; Hoste, E.A.; Steurbaut, K.; Benoit, D.; Van Hoecke, S.; De Turck, F.; Decruyenaere, J. Impact of real-time electronic alerting of acute kidney injury on therapeutic intervention and progression of RIFLE class. Crit. Care Med. 2012, 40, 1164–1170. [Google Scholar] [CrossRef]
  13. John, A.K.; Norbert, L.; Peter, A.; Rashad, S.B.; Emmanuel, A.B.; Stuart, L.G.; Charles, A.H.; Michael, J.; Andreas, K.; Andrew, S.L. Kidney Disease: Improving Global Outcomes (KDIGO) Acute Kidney Injury Work Group. KDIGO Clinical Practice Guideline for Acute Kidney Injury. Kidney Int. Suppl. 2012, 2, 124–138. [Google Scholar]
  14. Palevsky, P.M.; Liu, K.D.; Brophy, P.D.; Chawla, L.S.; Parikh, C.R.; Thakar, C.V.; Tolwani, A.J.; Waikar, S.S.; Weisbord, S.D. KDOQI US commentary on the 2012 KDIGO clinical practice guideline for acute kidney injury. Am. J. Kidney Dis. 2013, 61, 649–672. [Google Scholar] [CrossRef] [PubMed]
  15. Bagshaw, S.M.; Goldstein, S.L.; Ronco, C.; Kellum, J.A.; for the ADQI 15 Consensus Group. Acute kidney injury in the era of big data: The 15th Consensus Conference of the Acute Dialysis Quality Initiative (ADQI). Can. J. Kidney Health Dis. 2016, 3, 5. [Google Scholar] [CrossRef]
  16. Makris, K.; Spanou, L. Acute Kidney Injury: Definition, Pathophysiology and Clinical Phenotypes. Clin. Biochem. Rev. 2016, 37, 85–98. [Google Scholar]
  17. Khwaja, A. KDIGO clinical practice guidelines for acute kidney injury. Nephron Clin. Pract. 2012, 120, c179–c184. [Google Scholar] [CrossRef]
  18. Yu, X.; Ji, Y.W.; Huang, M.J.; Feng, Z. Machine learning for acute kidney injury: Changing the traditional disease prediction mode. Front. Med. 2023, 10, 1050255. [Google Scholar] [CrossRef]
  19. Cronin, R.M.; VanHouten, J.P.; Siew, E.D.; Eden, S.K.; Fihn, S.D.; Nielson, C.D.; Peterson, J.F.; Baker, C.R.; Ikizler, T.A.; Speroff, T.; et al. National Veterans Health Administration inpatient risk stratification models for hospital-acquired acute kidney injury. J. Am. Med. Inform. Assoc. 2015, 22, 1054–1071. [Google Scholar] [CrossRef]
  20. He, J.Q.; Hu, Y.; Zhang, X.Z.; Wu, L.J.; Waitman, L.R.; Liu, M. Multi-perspective predictive modeling for acute kidney injury in general hospital populations using electronic medical records. JAMIA Open 2019, 2, 115–122. [Google Scholar] [CrossRef]
  21. Cheng, P.; Waitman, L.R.; Hu, Y.; Liu, M. Predicting Inpatient Acute Kidney Injury over Different Time Horizons: How Early and Accurate? AMIA Annu. Symp. Proc. 2018, 2017, 565–574. [Google Scholar] [PubMed]
  22. Koyner, J.L.; Carey, K.A.; Edelson, D.P.; Churpek, M.M. The Development of a Machine Learning Inpatient Acute Kidney Injury Prediction Model. Crit. Care Med. 2018, 46, 1070–1077. [Google Scholar] [CrossRef] [PubMed]
  23. Kim, K.; Yang, H.; Yi, J.; Son, H.-E.; Ryu, J.-Y.; Kim, Y.C.; Jeong, J.C.; Chin, H.J.; Na, K.Y.; Chae, D.-W.; et al. Real-time clinical decision support based on recurrent neural networks for in-hospital acute kidney injury: External validation and model interpretation. J. Med. Internet Res. 2021, 23, e24120. [Google Scholar] [CrossRef]
  24. Yue, S.; Li, S.; Huang, X.; Liu, J.; Hou, X.; Zhao, Y.; Niu, D.; Wang, Y.; Tan, W.; Wu, J. Machine learning for the prediction of acute kidney injury in patients with sepsis. J. Transl. Med. 2022, 20, 215. [Google Scholar] [CrossRef]
  25. Bajaj, T.; Koyner, J.L. Artificial Intelligence in Acute Kidney Injury Prediction. Adv. Chronic Kidney Dis. 2022, 29, 450–460. [Google Scholar] [CrossRef] [PubMed]
  26. Colpaert, K.; Hoste, E.; Van Hoecke, S.; Vandijck, D.; Danneels, C.; Steurbaut, K.; De Turck, F.; Decruyenaere, J. Implementation of a real-time electronic alert based on the RIFLE criteria for acute kidney injury in ICU patients. Acta Clin. Belg. 2007, 62 (Suppl. S2), 322–325. [Google Scholar] [CrossRef] [PubMed]
  27. Selby, N.M.; Crowley, L.; Fluck, R.J.; McIntyre, C.W.; Monaghan, J.; Lawson, N.; Kolhe, N.V. Use of electronic results reporting to diagnose and monitor AKI in hospitalized patients. Clin. J. Am. Soc. Nephrol. 2012, 7, 533–540. [Google Scholar] [CrossRef]
  28. Hodgson, L.E.; Roderick, P.J.; Venn, R.M.; Yao, G.L.; Dimitrov, B.D.; Forni, L.G. The ICE-AKI study: Impact analysis of a Clinical prediction rule and Electronic AKI alert in general medical patients. PLoS ONE 2018, 13, e0200584. [Google Scholar] [CrossRef]
  29. Wilson, F.P.; Martin, M.; Yamamoto, Y.; Partridge, C.; Moreira, E.; Arora, T.; Biswas, A.; Feldman, H.; Garg, A.X.; Greenberg, J.H.; et al. Electronic health record alerts for acute kidney injury: Multicenter, randomized clinical trial. BMJ 2021, 372, m4786. [Google Scholar] [CrossRef]
Figure 1. Feature visualization and dimensionality reduction. This figure aims to visualize the overall clinical data to see if patients with Acute Kidney Injury (AKI) and those without (Non-AKI) form distinct, separable groups. Since it is impossible to plot all 26 clinical features simultaneously, we use two dimensionality reduction techniques—(A) Principal Component Analysis (PCA) and (B) Uniform Manifold Approximation and Projection (UMAP)—to compress the complex, high-dimensional data into a simple two-dimensional map. Each dot on the plots represents a single patient, colored according to their final diagnosis: orange for AKI and blue for Non-AKI. This visualization highlights the substantial overlap between the AKI and non-AKI patient clusters, visually confirming the difficulty of simple linear separation and reinforcing the need for sophisticated machine learning models to diagnose AKI from point-of-care data. 2D: two-dimensional; PCA, Principal Component Analysis; UMAP, Uniform Manifold Approximation and Projection.
Figure 1. Feature visualization and dimensionality reduction. This figure aims to visualize the overall clinical data to see if patients with Acute Kidney Injury (AKI) and those without (Non-AKI) form distinct, separable groups. Since it is impossible to plot all 26 clinical features simultaneously, we use two dimensionality reduction techniques—(A) Principal Component Analysis (PCA) and (B) Uniform Manifold Approximation and Projection (UMAP)—to compress the complex, high-dimensional data into a simple two-dimensional map. Each dot on the plots represents a single patient, colored according to their final diagnosis: orange for AKI and blue for Non-AKI. This visualization highlights the substantial overlap between the AKI and non-AKI patient clusters, visually confirming the difficulty of simple linear separation and reinforcing the need for sophisticated machine learning models to diagnose AKI from point-of-care data. 2D: two-dimensional; PCA, Principal Component Analysis; UMAP, Uniform Manifold Approximation and Projection.
Diagnostics 15 02801 g001
Figure 2. Feature importance of the three best performing machine learning models. This figure displays the most important clinical features for the three top-performing machine learning models, as determined by SHAP values. (A) Random Forest Model. (B) XGBoost Model. (C) Gradient Boosting Model. Feature Ranking (Y-axis): Clinical features are ranked from top to bottom based on their overall impact on the model’s predictions. Higher-ranked features are more influential. Impact on Diagnosis (X-axis): Each dot represents a single patient. A dot’s position on the horizontal axis shows how that feature value influenced the model’s conclusion for that patient. Dots on the right side (positive SHAP values) pushed the prediction towards an AKI diagnosis, while dots on the left side (negative SHAP values) pushed the prediction towards a non-AKI diagnosis. Feature Value (Color): The color of each dot indicates the feature’s value for that patient. Red dots represent higher clinical values (e.g., high serum creatinine), and blue dots represent lower values. SHAP value, SHapley Additive exPlanations; XGBoost, Extreme Gradient Boosting.
Figure 2. Feature importance of the three best performing machine learning models. This figure displays the most important clinical features for the three top-performing machine learning models, as determined by SHAP values. (A) Random Forest Model. (B) XGBoost Model. (C) Gradient Boosting Model. Feature Ranking (Y-axis): Clinical features are ranked from top to bottom based on their overall impact on the model’s predictions. Higher-ranked features are more influential. Impact on Diagnosis (X-axis): Each dot represents a single patient. A dot’s position on the horizontal axis shows how that feature value influenced the model’s conclusion for that patient. Dots on the right side (positive SHAP values) pushed the prediction towards an AKI diagnosis, while dots on the left side (negative SHAP values) pushed the prediction towards a non-AKI diagnosis. Feature Value (Color): The color of each dot indicates the feature’s value for that patient. Red dots represent higher clinical values (e.g., high serum creatinine), and blue dots represent lower values. SHAP value, SHapley Additive exPlanations; XGBoost, Extreme Gradient Boosting.
Diagnostics 15 02801 g002
Table 1. Baseline characteristics of Dataset A.
Table 1. Baseline characteristics of Dataset A.
CharacteristicAKINon-AKIp Value
Number, n14231423n/a
Male, n (%)822 (57)854 (60)0.13
Age, years73.2 ± 16.175.1 ± 14.1<0.05
Creatinine, mg/dL3.6 ± 2.33.7 ± 3.20.30
Na, mmol/L140.4 ± 9.1138.5 ± 7.2<0.05
K, mmol/L4.4 ± 1.04.1 ± 0.7<0.05
AST, U/L33.0 (58.0)23.0 (19.5)<0.05
ALT, U/L22.0 (33.5)15 (18.0)<0.05
RBC, 106/µL3.3 ± 0.83.4 ± 0.80.12
Hb, g/dL10.0 ± 2.210.3 ± 2.1<0.05
Hct, %30.3 ± 6.930.8 ± 6.60.06
RDW-CV, %16.8 ± 3.015.7 ± 2.4<0.05
WBC, 103/µL13.1 ± 19.010.2 ± 13.5<0.05
Neutrophil, %80.0 ± 15.474.2 ± 14.3<0.05
Lymphocyte, %10.4 ± 11.314.5 ± 10.3<0.05
Monocyte, %6.2 ± 4.057.7 ± 4.1<0.05
Eosinophil, %1.1 ± 2.12.2 ± 3.2<0.05
Basophil, %0.3 ± 0.50.5 ± 0.5<0.05
Platelet, 106/µL135 (136)181 (120)<0.05
PDW, %17.4 ± 0.817.2 ± 0.7<0.05
Respiratory rate, /min19.5 ± 5.318.3 ± 3.5<0.05
SBP, mmHg117.2 ± 23.9127.9 ± 22.4<0.05
DBP, mmHg64.7 ± 16.268.5 ± 14.1<0.05
SpO2, %96.4 ± 4.496.9 ± 3.1<0.05
Temperature, °C36.7 ± 0.636.7 ± 0.40.25
Pulse rate, /min91.8 ± 21.483.4 ± 17.5<0.05
Weight, kg61.2 ± 15.262.2 ± 14.80.09
Height, cm158.6 ± 15.3159.1 ± 10.30.32
Dataset A comprised randomly selected AKI and non-AKI hospitalized patients (n = 1423 in each group). The presence or absence of AKI was determined using a validated computerized algorithm. AKI, acute kidney injury; AST, aspartate aminotransferase; ALT, alanine aminotransferase; RBC, red blood cell; Hb, hemoglobin; Hct, hematocrit; RDW-CV, red cell distribution width; WBC, white blood cell; PDW, platelet distribution width; SBP, systolic blood pressure; DBP, diastolic blood pressure; SpO2, oxygen saturation. Categorical variables were expressed as frequency (%) and analyzed using the chi-square test; continuous variables with normal distribution were expressed as mean ± standard deviation and analyzed using Student’s t test for independent groups; continuous variables deviated from normal distribution were expressed as median (interquartile range) and analyzed using Wilcoxon sum rank test.
Table 2. Baseline characteristics of Dataset B.
Table 2. Baseline characteristics of Dataset B.
CharacteristicsAKINon-AKIp-Value
Number, n334997n/a
Male, n (%)207 (62)648 (65)0.26
Age, years73.3 ± 16.572.6 ± 14.30.45
Creatinine, mg/dL3.0 ± 1.93.4 ± 2.9<0.05
Na, mmol/L139.2 ± 8.2137.8 ± 6.4<0.05
K, mmol/L4.2 ± 0.84.2 ± 0.60.56
AST, U/L26.5 (33.8)19 (12)<0.05
ALT, U/L21 (30.6)15 (13)<0.05
RBC, 106/µL3.7 ± 0.83.6 ± 0.8<0.05
Hb, g/dL11.2 ± 2.510.8 ± 2.3<0.05
Hct, %33.5 ± 7.732.4 ± 7.1<0.05
RDW-CV, %15.6 ± 2.715.5 ± 2.40.81
WBC, 103/µL12.0 ± 7.58.17 ± 4.4<0.05
Neutrophil, %79.0 ± 13.570.6 ± 13.5<0.05
Lymphocyte, %11.5 ± 10.017.6 ± 10.3<0.05
Monocyte, %6.56 ± 4.38.44 ± 4.0<0.05
Eosinophil, %1.3 ± 2.52.4 ± 2.7<0.05
Basophil, %0.3 ± 0.40.6 ± 0.5<0.05
Platelet, 106/µL171 (119)194 (108)<0.05
PDW, %17.2 ± 0.817.1 ± 0.6<0.05
Respiratory rate, /min18.8 ± 6.517.7 ± 2.6<0.05
SBP, mmHg121.8 ± 24.4130.8 ± 21.4<0.05
DBP, mmHg68.6 ± 16.271.8 ± 14.3<0.05
SpO2, %96.4 ± 4.597.2 ± 2.9<0.05
Temperature, °C36.7 ± 0.536.6 ± 0.30.11
Pulse rate, /min86.4 ± 18.979.3 ± 14.7<0.05
Weight, kg64.9 ± 17.764.2 ± 14.70.48
Height, cm161.4 ± 8.9161.6 ± 9.00.67
Dataset B comprised 334 patients with AKI and 997 patients without AKI. AKI was diagnosed by nephrologists involved in the present study. AKI, acute kidney injury; AST, aspartate aminotransferase; ALT, alanine aminotransferase; RBC, red blood cell; Hb, hemoglobin; Hct, hematocrit; RDW-CV, red cell distribution width; WBC, white blood cell; PDW, platelet distribution width; SBP, systolic blood pressure; DBP, diastolic blood pressure; SpO2, oxygen saturation. Categorical variables were expressed as frequency (%) and analyzed using the chi-square test; continuous variables with normal distribution were expressed as mean ± standard deviation and analyzed using Student’s t test for independent groups; continuous variables deviated from normal distribution were expressed as median (interquartile range) and analyzed using Wilcoxon sum rank test.
Table 3. Repeated machine learning models based on Dataset A (n = 1423 in each group).
Table 3. Repeated machine learning models based on Dataset A (n = 1423 in each group).
ModelAccuracyPrecisionRecallSpecificityF1 ScoreAUROC
SVM0.69 ± 0.010.70 ± 0.010.67 ± 0.020.71 ± 0.020.68 ± 0.010.76 ± 0.01
LR0.67 ± 0.010.68 ± 0.010.64 ± 0.020.70 ± 0.020.66 ± 0.010.73 ± 0.01
GB0.69 ± 0.010.70 ± 0.010.67 ± 0.020.70 ± 0.020.68 ± 0.010.76 ± 0.01
XGB0.69 ± 0.010.70 ± 0.010.68 ± 0.020.70 ± 0.020.69 ± 0.010.76 ± 0.01
RF0.69 ± 0.010.69 ± 0.010.68 ± 0.020.70 ± 0.020.69 ± 0.010.76 ± 0.01
NB0.65 ± 0.010.76 ± 0.020.44 ± 0.050.86 ± 0.020.55 ± 0.040.73 ± 0.01
NN0.67 ± 0.010.68 ± 0.020.65 ± 0.020.70 ± 0.030.67 ± 0.010.74 ± 0.01
The parameters were derived from machine learning repeated 1000 times and expressed as the mean ± standard deviation. In each machine learning iteration, 70% of the data were randomly selected as the training dataset, and the remaining 30% were used as the testing dataset. The presence or absence of AKI was determined using a validated computerized algorithm. SVM, Support Vector Machine; LR, Logistic Regression; GB, Gradient Boosting; XGB, Extreme Gradient Boosting; RF, Random Forest; NB, Naive Bayes classifier; NN, Neural Network.
Table 4. Single machine learning models based on Dataset A (n = 1423 in each group) and unbalanced Dataset B as the testing dataset (n = 1331).
Table 4. Single machine learning models based on Dataset A (n = 1423 in each group) and unbalanced Dataset B as the testing dataset (n = 1331).
ModelAccuracyPrecisionRecallSpecificityF1 ScoreAUROC
SVM0.74 a0.48 a0.52 abc0.81 ab0.50 a0.74 a
RF0.75 ab0.50 a0.54 ab0.82 ab0.52 a0.74 a
GB0.76 ab0.51 a0.54 ab0.83 b0.52 a0.73 ac
NB0.77 b0.58 b0.27 d0.93 c0.37 b0.73 abc
XGB0.75 ab0.49 a0.55 b0.81 a0.52 a0.72 b
LR0.74 a0.48 a0.49 ac0.82 ab0.49 a0.71 bc
NN0.75 ab0.51 ab0.23 d0.92 c0.32 b0.67 e
Traditional methods
Computerized algorithm0.97 c0.98 c0.88 e0.99 e0.93 cn/a
Clinician’s diagnosis0.57 d0.28 d0.46 c0.61 d0.35 bn/a
Performance of models on the full, unbalanced test set (Dataset B, n = 1331). Performance was evaluated using the researcher nephrologists’ diagnosis as the ground truth. Values in the same column sharing a common superscript letter are not statistically significantly different (p ≥ 0.05). Pairwise comparisons were performed using McNemar’s test (Accuracy, Recall, Specificity), DeLong’s test (AUROC), and a bootstrap procedure (Precision, F1-score). Within the same column, values followed by the same superscript letter (e.g., a, b, ab) are not statistically significantly different (p ≥ 0.05). Notably, this table demonstrates that even in an unbalanced, real-world scenario, all developed machine learning models delivered significantly better diagnostic performance (based on F1 score and AUROC) than the routine clinician’s diagnosis. SVM, Support Vector Machine; LR, Logistic Regression; GB, Gradient Boosting; XGB, Extreme Gradient Boosting; RF, Random Forest; NB, Naive Bayes classifier; NN, Neural Network.
Table 5. Single machine learning models based on Dataset A (n = 1423 in each group) and balanced Dataset B as the testing dataset (n = 334 in each group).
Table 5. Single machine learning models based on Dataset A (n = 1423 in each group) and balanced Dataset B as the testing dataset (n = 334 in each group).
ModelAccuracyPrecisionRecallSpecificityF1 ScoreAUROC
GB0.69 a0.77 a0.54 ab0.84 a0.63 a0.75 a
RF0.68 a0.76 a0.54 ab0.83 a0.63 a0.75 a
SVM0.67 a0.74 a0.52 abc0.81 a0.61 a0.75 a
XGB0.68 a0.75 a0.55 b0.81 a0.63 a0.73 a
LR0.66 a0.75 a0.49 ac0.84 a0.59 a0.73 a
NB0.61 b0.85 b0.27 d0.95 b0.41 b0.74a
NN0.59 bc0.79 ab0.23 d0.94 b0.36 b0.68 c
Traditional methods
Computerized algorithm0.94 d1.00 c0.88 e1.00 d0.93 dn/a
Clinician’s diagnosis0.54 c0.54 d0.46 c0.62 c0.50 cn/a
Performance of models on a balanced subset of the test set (Dataset B, n = 668; 334 patients in each group). Performance was evaluated using the researcher nephrologists’ diagnosis as the ground truth. Values in the same column sharing a common superscript letter are not statistically significantly different (p ≥ 0.05). Pairwise comparisons were performed using McNemar’s test (Accuracy, Recall, Specificity), DeLong’s test (AUROC), and a bootstrap procedure (Precision, F1-score). Within the same column, values followed by the same superscript letter (e.g., a, b, ab) are not statistically significantly different (p ≥ 0.05). SVM, Support Vector Machine; LR, Logistic Regression; GB, Gradient Boosting; XGB, Extreme Gradient Boosting; RF, Random Forest; NB, Naive Bayes classifier; NN, Neural Network.
Table 6. Single machine learning models based on Dataset A (n = 1423 in each group) and Dataset B post-exclusion of patients without baseline creatinine values in the preceding 7 days as the testing dataset (n = 398).
Table 6. Single machine learning models based on Dataset A (n = 1423 in each group) and Dataset B post-exclusion of patients without baseline creatinine values in the preceding 7 days as the testing dataset (n = 398).
ModelAccuracyPrecisionRecallSpecificityF1 ScoreAUROC
SVM0.67 abc0.65 ab0.52 abc0.78 a0.58 ab0.71 b
RF0.67 ab0.65 ab0.54 ab0.78 a0.59 a0.71 ab
GB0.68 b0.67 bc0.54 ab0.80 a0.60 a0.71 ab
NB0.64 c0.73 c0.27 d0.92 b0.39 cd0.71 ab
XGB0.68 ab0.65 ab0.55 b0.77 a0.60 a0.70 ab
LR0.65 ac0.62 a0.49 ac0.77 a0.55 b0.69 a
NN0.61 e0.64 ab0.23 d0.90 b0.34 c0.64 d
Traditional methods
Computerized algorithm0.94 d0.98 d0.88 e0.99 d0.93 en/a
Clinician’s diagnosis0.53 f0.46 e0.46 c0.59 c0.46 dn/a
Performance of models on a subset of the test set (Dataset B, n = 398) post-exclusion of patients without baseline creatinine values within the preceding 7 days. Performance was evaluated using the researcher nephrologists’ diagnosis as the ground truth. Values in the same column sharing a common superscript letter are not statistically significantly different (p ≥ 0.05). Pairwise comparisons were performed using McNemar’s test (Accuracy, Recall, Specificity), DeLong’s test (AUROC), and a bootstrap procedure (Precision, F1-score). Within the same column, values followed by the same superscript letter (e.g., a, b, ab) are not statistically significantly different (p ≥ 0.05). SVM, Support Vector Machine; LR, Logistic Regression; GB, Gradient Boosting; XGB, Extreme Gradient Boosting; RF, Random Forest; NB, Naive Bayes classifier; NN, Neural Network.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.-Y.; Chang, T.-I.; Chen, C.-H.; Hsu, S.-C.; Chu, Y.-L.; Huang, N.-J.; Sue, Y.-M.; Chen, T.-H.; Lin, F.-Y.; Shih, C.-M.; et al. Machine Learning Models for Point-of-Care Diagnostics of Acute Kidney Injury. Diagnostics 2025, 15, 2801. https://doi.org/10.3390/diagnostics15212801

AMA Style

Chen C-Y, Chang T-I, Chen C-H, Hsu S-C, Chu Y-L, Huang N-J, Sue Y-M, Chen T-H, Lin F-Y, Shih C-M, et al. Machine Learning Models for Point-of-Care Diagnostics of Acute Kidney Injury. Diagnostics. 2025; 15(21):2801. https://doi.org/10.3390/diagnostics15212801

Chicago/Turabian Style

Chen, Chun-You, Te-I Chang, Cheng-Hsien Chen, Shih-Chang Hsu, Yen-Ling Chu, Nai-Jen Huang, Yuh-Mou Sue, Tso-Hsiao Chen, Feng-Yen Lin, Chun-Ming Shih, and et al. 2025. "Machine Learning Models for Point-of-Care Diagnostics of Acute Kidney Injury" Diagnostics 15, no. 21: 2801. https://doi.org/10.3390/diagnostics15212801

APA Style

Chen, C.-Y., Chang, T.-I., Chen, C.-H., Hsu, S.-C., Chu, Y.-L., Huang, N.-J., Sue, Y.-M., Chen, T.-H., Lin, F.-Y., Shih, C.-M., Huang, P.-H., Hsieh, H.-L., & Liu, C.-T. (2025). Machine Learning Models for Point-of-Care Diagnostics of Acute Kidney Injury. Diagnostics, 15(21), 2801. https://doi.org/10.3390/diagnostics15212801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop