Previous Article in Journal
Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors
Previous Article in Special Issue
Co-Designing a DSM-5-Based AI-Powered Smart Assistant for Monitoring Dementia and Ongoing Neurocognitive Decline: Development Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data

1
NYC Health + Hospitals/Queens, 82-68 164th Street, Queens, New York, NY 11432, USA
2
Department of Biomedical Data Science, School of Applied Computational Sciences, Meharry Medical College, Nashville, TN 37208, USA
*
Author to whom correspondence should be addressed.
BioMedInformatics 2025, 5(4), 67; https://doi.org/10.3390/biomedinformatics5040067
Submission received: 1 October 2025 / Revised: 14 November 2025 / Accepted: 18 November 2025 / Published: 27 November 2025

Abstract

Background/Objectives: This review paper summarizes and critically analyzes different Machine Learning (ML) and Artificial Intelligence (AI)-based predictive modeling techniques in early detection and personalized treatment for Kidney diseases, specifically diabetic kidney disease (DKD), chronic kidney disease (CKD), and end-stage renal disease (ESRD). This manuscript focuses on integrating electronic medical record (EMR) data with multi-omics biomarkers to enhance clinical decision-making. Method: A systematic database search retrieved 43 peer-reviewed articles from PubMed, Google Scholar, and ScienceDirect. These works were critically analyzed based on methodological rigor, model interpretability, and translational potential. Review: This paper examined a series of advanced AI and ML models, including but not limited to Random Forests (RF), Extreme Gradient Boosting (XGBoost), deep neural networks, and artificial neural networks, among others. Additionally, this paper explicitly explored validated approaches for fibrosis staging, dialysis prediction, and mortality risk assessment. Conclusions: The paper shows how leveraging AI models for patient-specific biomarker and EMR data presents substantial promise for facilitating preventative interventions, guiding timely nephrology referrals, and optimizing individualized treatment regimens. These state-of-the-art tools will ultimately improve long-term renal outcomes and reduce healthcare burdens. The study further addresses ethical challenges and potential adverse implications associated with the use of AI in clinical settings.

Graphical Abstract

1. Introduction and Background

Kidney diseases, including diabetic kidney disease (DKD)/Diabetic nephropathy (DN), chronic kidney disease (CKD), and end-stage renal disease (ESRD), represent a critical health concern across the world. Most of these kidney diseases are typically diagnosed at advanced stages when therapeutic options are not as effective as they would be if they had been diagnosed at early stages. Few advanced clinical tools can predict disease onset, severity, and progression during the early and asymptomatic phases. Most existing patient portals rely on different biomarkers in a time-series manner, showing historical clinical values without any predictive power. To address this gap, this review paper analyzed a total of 43 peer-reviewed published papers that utilized AI-based predictive models incorporating patients’ biomarkers to predict the onset, severity, and progression of kidney diseases.
Multiple studies have developed and validated machine learning (ML) and deep learning (DL) models to accurately detect, quantify, and predict Interstitial Fibrosis and Tubular Atrophy (IFTA) in both native and transplanted kidneys [1,2,3,4,5,6,7]. Recent research also highlights the growing role of artificial intelligence (AI) and ML in DKD, showcasing their applications in diagnosis, risk prediction, disease progression modeling, introduction of personalized treatment plans, and biomarker discovery through the integration of imaging, clinical data, and omics technologies [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Three studies demonstrate how ML can enhance DKD management by introducing the Kidney Age Index (KAI) to assess kidney aging, developing a biomarker-based risk score to predict disease progression, and applying predictive models to identify individuals at risk in diverse populations [23,24,25]. Multiple studies demonstrate that urinary biomarkers, alone or in combination, can significantly enhance the prediction and risk stratification of CKD progression across diverse populations, outperforming traditional clinical markers [26,27,28,29,30]. Few studies have developed and validated ML models, including random forest (RF) and Klinrisk algorithms, to accurately predict the progression of CKD and ESRD, particularly in patients with type 2 diabetes and DKD [31,32,33]. Some studies developed and validated interpretable ML models to predict mortality in patients undergoing peritoneal dialysis, hemodialysis, or continuous renal replacement therapy (CRRT), using dynamic clinical data and explainable AI techniques to identify key risk factors and support clinical decision-making [34,35,36,37,38]. Few studies advance cardiovascular disease (CVD) risk prediction in kidney disease: one links plasma endocan to heart events in ESRD, and the other uses ML and biomarkers to forecast CVD in CKD [39,40]. Other studies used ML with genomics, metabolomics, and proteomics to uncover molecular subtypes, biomarkers, and disease mechanisms in DKD, advancing diagnostic precision and progression prediction [41,42,43].
All the research done in this aspect has only discussed how AI technologies can support preventative interventions, enable early detection, and guide individualized therapies for kidney patients, ultimately improving patient prognosis and reducing healthcare system burdens. However, no cumulative conclusions have been made regarding which AI-based diagnostic tool to use in which contexts. Our paper tries to find out why, despite so much work on applying AI technologies in nephrology, we are still lagging in the clinical implementation of these AI-based tools for early detection, risk stratification, and monitoring of kidney diseases. We have addressed the need to develop a research synthesis to outline which AI models to use in which contexts, how they could be used with high effectiveness, and the ethical aspects of AI while using EMR data. This paper emphasizes biomarkers and EMR data usage as input data for various AI-based models. Furthermore, we assess these tools’ methodological approaches, clinical applications, interpretability frameworks, and translational potential.

2. Objectives

This review aims to: (1) trace the evolution of AI and ML applications in diabetic, chronic, and end-stage kidney diseases; (2) evaluate the use of diverse biomarkers, including clinical, urinary, metabolomic, proteomic, genomic, transcriptomic, and pathology-based data for predicting disease onset, progression, and severity; (3) examine key ethical and translational challenges such as data privacy and model generalizability; and (4) propose a structured framework to guide the development, validation, and clinical adoption of robust, interpretable, and ethically sound AI-based tools in kidney care.

3. Methods and Materials

3.1. Overview

We conducted a systematic review to compose this literature on the integration of multimodal/MM data, such as imaging (e.g., MRI or ultrasonography), tabular/clinical (e.g., age, lab tests, vitals), genomics/omics (e.g., DNA, RNA, proteomics), text (e.g., clinical notes, pathology reports), using AI/ML/DL techniques, specifically in kidney disease. The methodology follows Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) guidelines (see Supplementary Materials, Supplementary File S1) and relevant bibliometric standards to ensure transparency and rigor. ChatGPT (https://chatgpt.com/, accessed on 17 November 2025) was used for language editing and grammar checks. The following processes are implemented in this study: the literature search strategy, the eligibility criteria, the selection process, quality assessments, data synthesis, and bias assessment. Two independent reviewers are involved in the research process currently. Reviewer 1 performed the PRISMA selection and screening processes in parallel, which restricted the possible bias risk in both selection and reporting. The 2nd reviewer took charge of the validation of the research process.
Our review focuses on these big themes: which areas of nephrology AI-driven MM frameworks have been developed and how; which areas of nephrology have benefited the most from this framework and why; and how AI/ML techniques have been integrated. More specifically, within each theme, our review focuses on the following questions: (i) What is the trend in developing these frameworks? (ii) How many modalities were used to develop these frameworks? (iii) What learning methods were used? (iv) What datasets/sources were used? (v) What was the status of validation rigor? (vi) Was the calibration slope reported? (vii) Was decision curve analysis (DCA) performed? Or what was the net benefit? (viii) What performance metrics were utilized?

3.2. Search Strategy

To compose our comprehensive review, we meticulously adhered to the accepted protocols for such scholarly work, taking careful and detailed approaches. We attempted an extensive search across noteworthy databases like Google Scholar, PubMed, Web of Science, IEEE Xplore, and Scopus, adhering to the PRISMA 2020 guidelines [44] (see Supplementary Materials, Supplementary File S1). All of these databases are academic databases. Original research articles addressing the following aspects were selected: artificial intelligence or machine learning, kidney care, multimodal data, deep learning or neural networks, and nephrology. The literature search is conducted through an automatic search in each search engine listed, using the search phrases and set of terms presented in Table 1. For the review, we define the search period as being between 1 January 2015, and 1 September 2025, to include the latest advancements in this topic.

3.3. Selection Criteria

We initiated the search using the strings/queries provided in Table 1 across the databases listed in Figure 1. Our search produced 150 studies (90 from Google Scholar, 22 from PubMed search, 18 from Web of Science, 10 from Scopus search, and 10 from IEEE Xplore search). The search data were copied into an Excel file for further investigation. First, we removed 42 records before screening; among them, 23 were duplicate records, 10 records were marked as ineligible by the automation tool, and the remaining 9 records were removed for other reasons, leaving 108 studies for screening against the title/abstract. Of these 108, 89 studies were sought for retrieval. Of those 89 studies, 25 studies were not retrieved. Then the remaining 64 records were advanced to the full-text review for further analysis according to the eligibility criteria. Studies were eligible if they fulfilled the inclusion criteria listed in Table 2. Finally, after applying the inclusion/exclusion criteria to the 64 studies, we identified 43 studies to be included in our review (Figure 1).

3.4. Quality Assessment and Data Synthesis

We extracted and analyzed data from all studies meeting our inclusion/exclusion criteria. We then compiled seven detailed tables, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, that catalog key questions mentioned earlier on AI-driven MM applications in nephrology. These tables capture study-specific information such as study years and citations, modalities integrated (imaging, genomic, clinical, etc.), AI/ML architectures (e.g., RF, XGBoost, Convolutional Neural Networks/CNNs, etc.), dataset sources (CREDENCE, CANVAS, etc.), and clinical objectives (diagnosis, prognosis, treatment optimization).
Our full-text review focused on evaluating MM data integration strategies and interpretability techniques using DL frameworks. We assessed their clinical utility in tasks such as IFTA grading, histological pattern recognition, post-transplant IFTA prediction, glomerular lesions in DN, molecular sub-phenotyping, and all-cause mortality prediction in hemodialysis and peritoneal dialysis patients, while identifying limitations specific to kidney research, including heterogeneity in imaging protocols, sparse MM registries, and discordant feature scales across modalities.
Through this process, we aim to highlight the most widely used MM as input variables for AI/ML models, emphasize its advantages, address the challenges in implementing AI-driven MM, and identify potential areas for innovation and improvement, such as explainability and interpretability. The data for all studies were independently extracted and managed by these two authors (Abbasi, T., and Pinky, L.). When discrepancies arose, for instance, differences in recording the validation approach (cross-validation versus external testing) or in reporting quantitative results for performance metrics, reviewers revisited the full-text articles jointly to verify the data and ensure consistent interpretation. Consensus was achieved through iterative discussion and mutual agreement, ensuring uniformity across the dataset and transparency in reporting. All final extracted data were reviewed collaboratively before inclusion in the synthesis to minimize bias and maintain methodological rigor.

3.5. Risk of Bias and Applicability Concerns Assessment

A structured risk of bias (RoB) assessment was performed using the PROBAST tool (Prediction model Risk of Bias Assessment Tool) for prognostic and diagnostic prediction models. Four domains, participants, predictors, outcomes, and statistical analysis, were assessed for RoB and implementation concerns relative to the intended clinical use (e.g., risk prediction, early detection, or progression modeling).
Each PROBAST domain comprises between two and nine signaling questions, with possible responses categorized as Yes/Probably Yes, No/Probably No, or Unclear. A domain is judged to have a high RoB when one or more signaling questions are rated as “No” or “Probably No.” Conversely, a low RoB is assigned only when all questions within the domain are rated “Yes” or “Probably Yes.” The overall RoB for a study is determined by aggregating these domain-level judgments: studies in which all domains are assessed as low risk are classified as having a low overall risk, whereas the presence of at least one high-risk domain results in a high overall rating. When insufficient information precludes a definitive judgment, the overall RoB is designated as unclear.
Studies were rated low RoB when they recruited representative patient populations (e.g., consecutive or population-based cohorts) and clearly defined inclusion/exclusion criteria aligned with clinical practice. Predictors were measured objectively and consistently across all participants, with transparent handling of missing data (e.g., imputation methods specified) and appropriate blinding to outcomes. Outcomes were clinically relevant, consistently defined, and measured with adequate follow-up time to capture meaningful progression events. Analytical methods demonstrated internal and external validation (temporal or geographic), used appropriate model regularization to prevent overfitting, and reported key performance metrics including discrimination (AUROC, AUPRC), calibration (calibration slope/intercept), and clinical utility (DCA, net benefit).
Moderate risk was assigned when methodological details were partially reported, for instance, if data preprocessing steps were unclear, missing data handling was not described, or validation was limited to internal resampling (e.g., k-fold cross-validation). Models using complex feature sets or multi-omics data without independent replication were also rated moderate when uncertainty in reproducibility existed.
High RoB was assigned when studies used convenience or single-institution samples, lacked pre-specified inclusion/exclusion criteria, or applied inconsistent measurement or selection of predictors. Common high-risk issues included data leakage from non-temporal splits, undefined handling of missingness, short or inconsistent follow-up periods, and outcome definitions that shifted between sites or datasets. Analytical bias was high when models were trained and tested on overlapping data, lacked calibration assessment, or omitted critical performance metrics. Overfitting risk was flagged when small sample sizes were used relative to the number of predictors, particularly without penalization, shrinkage, or external validation.
Applicability concerns were rated low when populations, predictors, and outcomes matched real-world nephrology settings and intended use cases (e.g., CKD risk stratification or biopsy deferral). Moderate concerns arose when studies included narrowly defined cohorts, proprietary variables, or biomarkers not widely available. High concerns were assigned when model inputs or outcomes were unlikely to generalize, such as institution-specific imaging data, non-standard assays, or unvalidated multi-omics fingerprints.
Overall, higher-weight evidence came from studies with clearly defined cohorts, external validation, and robust calibration; in contrast, studies with strikingly high accuracy but weak validation or limited transparency were interpreted cautiously, as their performance may not transfer to broader clinical contexts.
For studies focusing on diagnostic tasks (e.g., IFTA detection from imaging or pathology), we assessed RoB and applicability concerns via the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool across four domains-patient selection, index test, reference standard, and flow and timing. For RoB, studies were rated low when they enrolled a representative clinical population using consecutive or random sampling, applied pre-specified inclusion and exclusion criteria, and reported both internal and external validation with transparent analytical pipelines. Moderate risk was assigned when key design or analytic details were unclear (e.g., missing blinding statements, partial reporting of exclusions, or internal cross-validation without external testing). High risk was assigned for major threats such as selection bias (e.g., single-institution or highly selected cohorts), small or non-representative samples, data leakage from overlapping time windows, undefined handling of missing data, or outcome labels that shifted across sites or time periods. Models that were trained and tested on the same dataset without independent validation were considered particularly prone to overfitting.
For applicability, low concern was assigned when study populations, index tests, and outcomes reflected real-world nephrology practice and the intended clinical use (e.g., biopsy triage, progression monitoring). Moderate concern was applied when populations or imaging modalities diverged slightly from typical practice, and high concern when results were unlikely to generalize, such as models dependent on proprietary features, single-vendor imaging systems, or unstandardized biomarker assays.
Performance reporting was also considered in the overall judgment: low-risk studies included full calibration (e.g., slope/intercept plots), discrimination metrics (AUROC, AUPRC), and clinical utility measures (DCA or net benefit). Moderate risk studies reported discrimination only, while high-risk studies presented incomplete or non-validated performance claims. Common pitfalls across this literature included data leakage from poorly defined temporal splits, follow-up intervals too short to reflect meaningful kidney outcomes, and over-optimistic performance in small, internally validated cohorts. Accordingly, higher-quality evidence with robust external validation and calibration was given greater interpretive weight, whereas weaker studies with striking but unstable metrics were interpreted cautiously, as their performance is unlikely to generalize across settings.
Two reviewers independently assessed all the included articles using the PROBAST and QUADAS-2 frameworks to evaluate RoB and applicability. Particular attention was given to the representativeness of the study population (e.g., single-center versus multicenter cohorts), adequacy of sample size, and appropriateness of predictor and outcome definitions. Reviewers also examined the rigor of validation methods (external, temporal, or internal only), the handling of missing data, and the presence of key performance metrics such as AUROC, AUPRC, calibration slope, and intercept. Discrepancies in ratings across domains were resolved through structured discussion to clarify interpretations of the signaling questions and ensure consistent application of criteria. Consensus was reached through iterative review and mutual agreement, with all final judgments determined collaboratively. The entire consensus process was documented to preserve transparency and reproducibility in the assessment of bias and applicability.

3.6. Methodological Quality Assessment

In our review paper, we did not need to use the Cochrane risk of bias (Cochrane RoB) tool, which assesses the methodological quality of randomized controlled trials (RCTs), or the ROBINS-I (“Risk Of Bias In Nonrandomized Studies, of Interventions”). Before we delve into why we did not use the tools mentioned above, we will briefly explain what randomized and non-randomized studies are. A randomized study, also known as a “Randomized Controlled Trial (RCT)”, is a research design where scientists assign the participants to treatment and control groups randomly to compare the effects of a specific treatment or intervention. A non-randomized study, often called a “Non-randomized Controlled Trial (NRCT)”, is a type of research where participants are not assigned to different treatment or control groups randomly. Instead, groups are formed based on pre-existing characteristics, self-selection, or other non-random methods. Our review paper does not focus on the implementation or introduction of any novel drug/treatment regimen or intervention for kidney diseases. Instead, we focused on the prediction of kidney disease onset and the early detection of kidney diseases by using patient biomarkers and EMR data as input variables for various AI and ML models. The journal articles that we cited in our comprehensive review paper did not perform any randomized or non-randomized clinical trials. That is why the Cochrane RoB tool and the ROBINS-I tool do not apply to our paper.

3.7. Assessment of Evidence Certainty

We assessed certainty of the evidence for each prespecified outcome (e.g., IFTA quantification/diagnostic accuracy, DKD/CKD progression prediction, and mortality/cardiovascular-event prediction) using an approach based on Grading of Recommendations, Assessment, Development and Evaluation/GRADE principles adapted for diagnostic and prognostic/prediction-model studies. For studies of diagnostic/segmentation accuracy, we used QUADAS-2 to judge RoB and concerns about applicability and then applied the GRADE for diagnostic tests framework (considering RoB, inconsistency, indirectness, imprecision, and publication bias) to derive an overall certainty rating (high, moderate, low, or very low) for each outcome. For prediction-model and prognostic studies we assessed methodological quality with PROBAST (domains: participants, predictors, outcome, and analysis) and judged certainty by downgrading or upgrading across the same GRADE domains while additionally considering model-specific factors (adequacy of sample size/events, internal and critically external validation, calibration reporting, handling of missing data, overfitting risk, and transparency/reproducibility such as code or model availability). Two authors independently performed all RoB and certainty assessments. Any discrepancies in the initial GRADE ratings, such as differing interpretations of indirectness when study populations were not representative of real-world CKD or DKD cohorts, or disagreements over imprecision in effect estimates or calibration metrics, were addressed through structured consensus meetings. During these sessions, reviewers compared rationales, re-examined primary data, and referred back to the GRADE guidance to ensure consistent application of downgrading criteria. For example, inconsistencies in AUROC confidence intervals, unclear handling of missing predictor data, or lack of external validation prompted discussion to align the final certainty ratings. Consensus was achieved through iterative deliberation and mutual agreement, ensuring that final judgments reflected both methodological rigor and clinical applicability. The process was documented to enhance transparency and reproducibility in the grading of the overall certainty of evidence across diagnostic and predictive domains.
A flowchart (Figure 1) illustrating the process of identifying and selecting manuscripts for inclusion in this review paper is presented below.

4. Review

4.1. Application of AI-Based Diagnostic Tools in Early Detection, Risk Stratification, and Monitoring of IFTA

This section and Table 3 summarize the findings from seven studies that utilize AI and ML models for evaluating IFTA. IFTA is a hallmark feature of CKD progression, usually seen in early stages of CKD. Researchers applied different model types, including convolutional neural networks (CNNs), ensemble methods (e.g., RF, XGBoost), and hybrid systems, to various biomarker inputs such as ultrasound, MRI, whole-slide histology images, clinical parameters, and transcriptomic data. Most of these models used IFTA classification based on multiple severity grades or binary presence, and they successfully attained high performance metrics; for instance, their accuracies often reached beyond 85% and their Area Under the Curves (AUCs) were as high as 0.96 [1,2]. AUC is a performance metric for evaluating the effectiveness of a binary classification model. Researchers found a strong alignment of the ability of DL models to assess IFTA with that of expert pathologists. This alignment unveiled the potential of DL models for acting as real-time, non-invasive diagnostic tools.
Athvale et al., 2021 [1] utilized various B-mode kidney ultrasound images as input variables for various AI and ML-based algorithms, such as a deep CNN for classifying IFTA. Trojani et al., 2024 [2] applied various MRI-texture-based radiomics as input variables for various AI algorithms, such as SVM and RF, to assess graft IFTA severity post-transplant. Ginley et al., 2021 [3] applied various biopsy-derived WSIs as input variables for various AI/ML algorithms, such as CNN-based segmentation for quantification of IFTA and glomerulosclerosis. Zheng et al., 2021 [4] applied various digitized renal biopsy images as input variables for various AI/ML algorithms, such as a deep CNN, to quantify FPA. Athvale et al., 2020 [5] applied various renal ultrasound images as input variables for various AI/ML algorithms, such as early DL to predict IFTA severity. Ginley et al., 2020 [6] applied various renal biopsy images as input variables for various AI/ML algorithms, such as neural networks for IFTA classification. Finally, Yin et al., 2023 [7] included post-transplant kidney patients from five GEO datasets (GSE98320 [45], GSE76882 [46]: training; GSE22459 [47], GSE53605 [48]: validation; GSE21374 [49]: prognosis). They applied these clinical and transplant datasets as input variables for various AI/ML algorithms, such as RF and XGBoost, to predict post-transplant IFTA (binary outcome).
Though the seven studies involving AI-based models that we are reviewing in this paper have their own strengths and potential, they also come with various shortcomings or limitations due to their small-scale sample sizes, restricted external validation, and platform inconsistencies. Particularly, models that are based on biopsy or genomic data demonstrate obstacles with clinical integration, as they may face practical restraints in real-world contexts. Nonetheless, these studies show that AI-driven tools can boost diagnostic accuracy by assessing IFTA and telling us how far along in the course of the disease the patients are, thus allowing early risk stratification in CKD patients. These studies also emphasize the fact that there is a need for substantial, multicentric validation and strategic implementations.

4.1.1. IFTA Classification: Generalizability and Fairness

Across the reviewed fibrosis and tubular atrophy studies, Athavale et al., 2021 [1], Trojani et al., 2024 [2], Ginley et al., 2021 [3], Zheng et al., 2021 [4], and Yin et al., 2023 [7], most models were developed in single tertiary academic centers, each using different imaging modalities, including ultrasound, MRI, and digitized PAS-stained biopsy slides. The prevalence of moderate-to-severe IFTA ranged from approximately 20% in transplant cohorts to nearly 40% in native kidney datasets. However, only Ginley et al., 2021 [3] provided demographic breakdowns, and few studies accounted for potential imbalances in sex, age, or race. Baseline kidney function varied widely between cohorts, complicating model comparisons. Domain shift between specialized tertiary labs and community imaging settings was not formally assessed, and external validation was limited, Trojani et al., 2024 [2] used a small independent MRI cohort, whereas others relied mainly on internal cross-validation or random splits. Moving forward, robust multi-site and prospective validation should take precedence over internal retuning to confirm real-world generalizability.

4.1.2. IFTA Classification: Data Hygiene and Temporal Integrity

Data hygiene is the backbone of credible machine learning in nephrology imaging. Only a few studies, notably Zheng et al., 2021 [4] and Yin et al., 2023 [7], clearly enforced patient-level temporal splits to prevent data leakage. Normalization and augmentation pipelines were described for most histopathology and MRI datasets, but handling of batch effects, such as stain variability, scanner drift, or acquisition differences, was rarely quantified. Missing data strategies and imputation methods were generally omitted, and clinical covariates were sometimes used without clear synchronization to outcome timepoints.

4.1.3. IFTA Classification: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics studies hold real promise for revealing hidden disease mechanisms, but in kidney research, excitement should be matched with realism. It helps to be clear about how data are actually combined: early fusion mixes features from all sources up front, intermediate fusion blends shared representations midway through modeling, and late fusion merges outputs from separate models. Each method needs its own form of regularization to prevent overfitting and to keep the model biologically grounded. In practice, many kidney cohorts have shown that small, stable panels of clinical and urinary markers perform as well or better than sprawling high-dimensional omics signatures. DL and imaging studies focused on IFTA (Athavale et al., 2021 [1]; Trojani et al., 2024 [2]; Ginley et al., 2021 [3]; Zheng et al., 2021 [4]) reinforce this point: carefully chosen, interpretable features often outperform more complex, opaque models. Multi-omics should therefore be seen as a discovery tool for identifying new phenotypes or understanding why treatment effects vary, rather than a default choice for everyday risk scoring.

4.1.4. IFTA Classification: From Predictions to Actions

Predictive models only matter if they change what happens to a patient. Every study should specify the thresholds that would trigger different decisions, such as earlier nephrology referral, biopsy deferral, SGLT2 inhibitor initiation, vascular access planning, or referral for transplant evaluation. These thresholds turn a probability into a plan. Equally important is showing how model outputs would actually appear in the electronic health record: not as cryptic numbers, but as clear, color-coded risk levels with short explanations and links to next steps. For example, a high fibrosis score generated from a kidney ultrasound model (Athavale et al., 2021 [1]) or MRI-based fibrosis index (Trojani et al., 2024 [2]) could automatically prompt the care team to review pathology or adjust treatment. Even a simple screen mock-up, for instance, showing the prediction, the key features that drove it, and the recommended action, can make a model feel tangible and trustworthy. When predictions translate naturally into decisions, AI moves from theory to everyday clinical practice.
Among the seven studies mentioned in Table 3 that apply AI-based tools for early detection, risk stratification, and monitoring of IFTA, the first six studies serve a “Diagnostic” purpose. In contrast, the seventh study serves a “Predictive” or “Prognostic” purpose. Therefore, for RoB assessment, we utilized QUADAS-2 for the first six papers and PROBAST for the seventh paper. The QUADAS-2 and PROBAST assessments, summarized in Appendix A (Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7), provide detailed evaluations of the study-specific RoB and applicability of the included AI-based diagnostic and prognostic models for IFTA.
Across these imaging AI studies, data leakage and missingness handling were seldom discussed, particularly in preprints and early internal validations. Most groups were trained and tested within single-center cohorts, raising concerns about site-specific labeling drift (e.g., biopsy grading variation).
Studies Athvale et al., 2021 [1], Ginley et al., 2021 [3], and Yin et al., 2023 [7] demonstrated higher methodological rigor through external or temporal validation and explicit calibration.
By contrast, early-stage or proof-of-concept models (Athvale et al., 2020 [5], Ginley et al., 2020 [6]) lacked reproducibility safeguards and transparent cross-validation reporting. Short follow-up and inconsistent ground truth definitions (e.g., Banff vs. semiquantitative histology) limit generalizability.
When internally validated studies report striking AUROC > 0.9, these likely reflect closed-cohort overfitting rather than portable diagnostic accuracy.
Hence, well-calibrated, externally validated pipelines (like those by Athavale et al., 2021 [1], Ginley et al., 2021 [3], Yin et al., 2023 [7]) should carry greater interpretive weight in evidence synthesis.

4.2. Application of AI-Based Models to Different Urinary Biomarkers for Early Detection, Risk Stratification, and Monitoring of CKD Progression

This section and Table 4 describe how scientists have used diverse AI-based and ML algorithms, starting from LASSO logistic regression to proteomics and multivariate linear regressions, to study the use of urinary biomarkers as a diagnostic and prognostic tool for CKD. For example, Bienaimé et al. (2023) applied the LASSO logistic regression model to five validated urinary biomarkers (CCL2, EGF, KIM1, NGAL, and TGF-α) [26]. The output data they received after running the model could predict rapid CKD progression with greater accuracy, as evidenced by the AUC of 0.722, as opposed to traditional risk factors. Likewise, Pizzini et al. (2017) used the Cox regression model for their study [27]. They used binary-coded 24th urinary excretion levels of NGAL, Uromodulin, and KIM1 (categorized as above and below median) as input data for their model. They successfully derived a “composite score” from the model as their output data. The composite score could predict different renal outcomes over 3 years in CKD patients better than eGFR alone. Qin et al. (2019) used a cross-sectional logistic regression model for their study [28]. They used urinary levels of Transferrin (TF), Immunoglobulin G (IgG), Retinol-binding protein (RBP), β-galactosidase (GAL), N-acetyl beta-glucosaminidase (NAG), and β2-microglobulin (β2MG) as the input data for their model. The model output data revealed urinary RBP to be the strongest individual predictor of the presence of DKD in diabetic patients (AUC = 0.920).
Muiru et al. (2021) used multivariable linear regression, MSG-LASSO, and multivariate simultaneous linear equations for their study [30]. They used 14 urinary biomarkers (IL-18, α1-microglobulin, Uromodulin, YKL 40, KIM-1, β2-microglobulin, and Albuminuria, to name a few) as the input data for their model. The model output data showed associations between traditional and infection-related CKD risk factors and biomarker dynamics (changes in urinary biomarkers over time) in HIV-positive women. These biomarker dynamics acted as proxies for kidney disease detection and monitoring in this patient population. These studies utilizing AI-driven tools for early diagnosis, risk stratification, and monitoring CKD progression have both advantages and limitations. The many benefits of these studies are improved diagnostic sensitivity, insight into “tubular injury, re-absorptive dysfunction, immune activation, and glomerular injury,” and potential for individualized care. Nonetheless, limitations of these studies, like small sample sizes, lack of external validation, technical assay demands, and generalizability concerns, must be addressed before we can clinically implement them in a broader perspective.

4.2.1. CKD Progression: Generalizability and Fairness

Across urinary biomarker studies for chronic kidney disease (CKD) progression, Bienaimé et al. [26], Pizzini et al. [27], Qin et al. [28], Schanstra et al. [29], and Muiru et al. [30], generalizability and fairness remain central concerns. Cohorts ranged from single-center pilot studies (Pizzini, 2017 [27]) to multicenter European consortia (Schanstra, 2015 [29]) and population-based longitudinal studies (Muiru, 2021 [30]). Assay platforms included mass spectrometry for peptide profiling, multiplex immunoassays, and ELISAs for tubular injury markers. Reported prevalence of CKD stages varied across cohorts from ~10% in early-stage populations to >40% in high-risk cohorts, with baseline eGFR spanning 35–90 mL/min/1.73 m2. Only a subset of studies stratified results by age, sex, or race; Muiru et al. [30] specifically reported subgroup analyses among women living with HIV, highlighting differences in tubular biomarker trajectories. Domain shift between tertiary referral centers and community clinics remains largely unexplored, and external validation was limited, emphasizing the need for multi-site confirmatory testing over repeated internal tuning.

4.2.2. CKD Progression: Data Hygiene and Temporal Integrity

Data hygiene practices were inconsistently reported. Only a few studies explicitly enforced patient-level temporal splits or described imputation and normalization strategies for missing or skewed biomarker values. Batch effects in proteomic assays, common when combining platforms or centers, were addressed sporadically, typically via quantile normalization or batch correction, but not universally. Feature windows (e.g., baseline urine collection) and outcome windows (e.g., 3–5 year CKD progression) were variably defined, risking inadvertent information leakage. Clear upfront specification of these windows, along with rigorous preprocessing, is critical to trust real-world performance.

4.2.3. CKD Progression: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics integration, while discussed, should be carefully contextualized. Early-, intermediate-, or late-fusion strategies require explicit regularization; however, small, reproducible panels of urinary and clinical markers often outperform high-dimensional signatures in nephrology cohorts. Multi-omics is most appropriate for discovering phenotypic heterogeneity or treatment effect modifiers, rather than as a default for routine risk scoring. For example, Schanstra et al.’s [29] urinary peptide classifier identified novel high-risk CKD phenotypes, but simpler panels in Bienaimé et al. and Pizzini et al. demonstrated strong predictive performance with greater clinical interpretability.

4.2.4. CKD Progression: From Predictions to Actions

Finally, actionable translation of predictions remains a key gap. Thresholds for interventions, earlier nephrology referral, intensified monitoring, SGLT2 inhibitor initiation, or transplant evaluation, were rarely defined. Outputs could be embedded in electronic records as risk scores with visual cues (e.g., color-coded risk meters) and escalation paths for high-risk patients.
Among the five studies listed above in Table 4 on the application of various AI-based tools to multiple patients’ urinary biomarkers for the early detection, risk stratification, and monitoring of CKD progression, the first three studies serve a predictive or prognostic purpose. In comparison, the latter two studies serve a diagnostic purpose. So, for RoB assessment, we used PROBAST for the three studies, Bienaime et al., 2023 [26], Pizzini et al., 2017 [27], and Qin et al., 2019 [28], and QUADAS-2 for the latter two studies, Schanstra et al., 2015 [29] and Muiru et al., 2021 [30]. The PROBAST and QUADAS-2 assessments, summarized in Appendix B (Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16 and Table A17), provide detailed quality appraisals of biomarker- and model-based studies in CKD and related populations.
Across the five articles, two are high-quality, two are moderate-quality, and one is low-quality according to the PROBAST and QUADAS-2 tools’ evaluations.
High-quality (Biomedinformatics 05 00067 i001): Bienaimé et al., 2023 [26] and Schanstra et al., 2015 [29] demonstrate external validation, calibration slope, and multi-center reproducibility.
Moderate (Biomedinformatics 05 00067 i002): Qin et al., 2019 [28] and Muiru et al., 2021 [30] demonstrate internal validation or repeated measures but lack calibration or independent replication.
Low-quality (Biomedinformatics 05 00067 i003): Pizzini et al., 2017 [27] is a small pilot without validation or calibration, serving more as proof-of-concept than reliable evidence.
Common issues across lower-rigor studies include:
Data leakage from unsegregated time windows or mixed baseline/follow-up samples.
Unclear handling of missing urine biomarker values or detection-limit censoring.
Inconsistent outcome labels (e.g., “progression,” “rapid decline,” “incident CKD”) across cohorts.
Short follow-up windows (<2 years) that overstate short-term variability as “progression”.
Several reports cite eye-catching AUROCs (>0.8). But only externally validated models (e.g., Bienaimé et al., 2023 [26], Schanstra et al., 2015 [29]) are likely to perform similarly in independent settings. Therefore, these higher-quality studies should carry greater interpretive weight in any synthesis, whereas internally derived findings should be treated as exploratory until prospectively confirmed.

4.3. Application of AI-Based Models to Different Clinical, Metabolomic, or Transcriptomic Data for Monitoring the Progression of DN

This section and Table 5 describe how scientists applied ML-based predictive models to various clinical, metabolomic, or transcriptomic data for monitoring the progression of DN. Yin et al. (2024) used five different ML models, such as XGBoost, RF, Decision Tree, Logistic Regression, and LASSO Regression, for their study [15]. They used demographics (age, sex, smoking, alcohol), anthropometrics (BMI, AC, SBP, DBP), clinical labs (HbA1c, HDL-C, FBG, serum creatinine), disease status (hypertension, complications, stroke), medications, and serum metabolites (C2, C5DC, Tyr, Ser, Met, etc.) as their input biomarkers for these models. All these models interpreted the risk of DN in diabetic patients as output data. This XGBoost model demonstrated the highest predictive performance (AUC of 0.93) among the tested models and identified novel serum metabolites as DN markers that are interpretable with SHAP values.
Fan et al. (2025) used the XGBoost model for their study [41]. They used gene expression profiles (from GSE30122 [50], GSE30528 [51], GSE96804 [52], and GSE142153 [53]), 69 GRGs, including key DEGs (PFKP, TPP2, HIF1A, TP53 (upregulated) & PFKL, MPC1, PC, PKLR, ALDOB, FBP1, PCK1 (downregulated)), hub genes (GATM, PCBD1, F11, HRSP12, G6PC), and immune cell infiltration profiles (e.g., M2 macrophages) as their input biomarkers. The model could diagnose DN with an AUC of >0.97 in training sets and 0.722 in external validation. Hirakawa et al. (2022) used the Piecewise Linear (PWL) model for their study [42]. They used 30 features, clinical (e.g., SBP, UACR), and selected plasma/urine metabolites (e.g., plasma kynurenine, gluconolactone, urinary threonic acid, urinary sphingomyelin, etc.) as input biomarkers for the model. The model interpreted rapid DKD progression (eGFR decline ≥ 10%) with an AUC of >0.80 as output data. Hirakawa et al. (2022) also used the Handcrafted Linear Regression (HCLR) model for their study [42]. They used 50 features, 30 PWL features and 21 binary flags for missing values (e.g., missing indicators for metabolites), and the same clinical and metabolite features like PWL plus additional binary flags as their input biomarkers for the model. The model interpreted rapid DKD progression (eGFR decline ≥ 10%) with an AUC of >0.90 as output data.
Zhang et al. (2022) used the Lasso Regression model for their study [43]. They used 698 urinary metabolite ions and nine clinical covariates (e.g., albuminuria, BP, HbA1c, race) as input biomarkers, the model interpreted eGFR slope (rate of kidney function decline) as output data and reduced overfitting. Zhang et al. (2022) also used the RF model for their study [43]. They used 698 metabolite ions and clinical variables as their input biomarkers. The model interpreted the eGFR slope, captured nonlinear relationships, and handled high-dimensional data well. Zhang et al. (2022) used Elastic Net to combine L1 (lasso) and L2 (ridge) penalties for correlated feature selection in their study [43]. They used the same biomarkers as above for this model. The model interpreted the eGFR slope and reflected biological pathway structures. These models demonstrate strong potential in biomarker discovery and risk stratification for diabetic patients. However, limited external validation, overfitting risks, and a lack of stage-specific or longitudinal outcome prediction are among the limitations of these studies that highlight the need for further validation across diverse populations and disease stages.

4.3.1. Progression of DN: Generalizability and Fairness

Across multi-omics studies of diabetic nephropathy (Fan et al., 2025 [41]; Hirakawa et al., 2022 [42]; Zhang et al., 2022 [43]; Yin et al., 2024 [15]), evidence for generalizability remains limited. Cohorts spanned single-center hospital samples (Yin et al., 2024 [15]) to large multicenter cohorts such as the Chronic Renal Insufficiency Cohort (CRIC) study (Zhang et al., 2022 [43]), incorporating diverse assay platforms including untargeted metabolomics, mass spectrometry, and transcriptomic profiling. Reported CKD prevalence ranged from early-stage nephropathy (~12–15%) to advanced CKD (~35%), with baseline eGFR spanning 30–95 mL/min/1.73 m2. Demographics and subgroup reporting were uneven: Fan et al., 2025 [41] and Yin et al., 2024 [15] reported age and sex distributions, but race/ethnicity reporting was sparse. Domain shifts between tertiary referral centers and community clinics were not systematically evaluated, and external validation was prioritized in only a subset of studies (Zhang et al., 2022 [43]), emphasizing the need for replication across independent multi-site cohorts rather than internal retuning alone.

4.3.2. Progression of DN: Data Hygiene and Temporal Integrity

Best practices for data integrity were variably applied. Only Zhang et al., 2022 [43] explicitly used patient-level temporal splits to separate feature windows (baseline metabolomic profiling) from outcome windows (2–3 year CKD progression), preventing future data leakage. Imputation strategies for missing metabolite or transcript values were sporadically reported, typically involving median or k-nearest neighbor approaches, while batch effects in high-dimensional omics were inconsistently addressed. Normalization and scaling were occasionally applied, but few studies fully described cross-center harmonization. Defining feature and outcome windows upfront is essential to trust real-world performance and ensure predictive models do not exploit artifacts of data collection timing.

4.3.3. Progression of DN: Right-Sizing the Enthusiasm for Multi-Omics

While these studies leveraged multi-omics to identify molecular subtypes or predictive signatures, high-dimensional metabolomic or transcriptomic fingerprints did not uniformly outperform small, stable panels of clinical and urinary biomarkers. Fan et al., 2025 [41] demonstrated glycolysis-driven molecular subtypes using WGCNA and ML, and Hirakawa et al., 2022 [42] highlighted metabolomic progression markers; yet in routine nephrology cohorts, parsimonious marker panels may suffice for robust risk scoring. Multi-omics is most powerful for phenotype discovery, treatment effect heterogeneity, or mechanistic insight, rather than as default inputs for day-to-day clinical decision support. Early, intermediate, and late data fusion approaches were described inconsistently, and regularization requirements for high-dimensional inputs were only occasionally reported.

4.3.4. Progression of DN: From Predictions to Actions

Translation of model outputs to actionable care remains an open challenge. Few studies defined risk thresholds that would trigger earlier nephrology referral, SGLT2 initiation, biopsy deferral, or transplant evaluation. Outputs could be embedded within electronic health records as interpretable risk scores, potentially with color-coded risk indicators, temporal trends, and automatic escalation for high-risk patients. Yin et al.’s [15] explainable ML model demonstrated feature-level contributions to predicted DN risk, providing a prototype for visualization; however, a standardized example screen linking predictions to clinical actions would strengthen the pathway from computational prediction to patient management.
Among the four studies listed above utilizing various AI-based tools on multiple patients’ clinical, metabolomic, and transcriptomic data for monitoring of DN, the second study, Fan et al., 2025 [41] serves a diagnostic purpose, while the rest of the three studies, Yin et al., 2024 [15], Hirakawa et al., 2022 [42], and Zhang et al., 2022 [43] serve a predictive or prognostic purpose. So, for the RoB assessment, we used QUADAS-2 for Fan et al., 2025 [41], and PROBAST for Yin et al., 2024 [15], Hirakawa et al., 2022 [42], and Zhang et al., 2022 [43]. The PROBAST and QUADAS-2 assessments, summarized in Appendix C (Table A18, Table A19, Table A20 and Table A21), provide detailed quality appraisals of AI and metabolomics studies in DKD and related populations.
High-quality (Biomedinformatics 05 00067 i001):
Zhang et al., 2022 [43] and Fan et al., 2025 [41] demonstrate external or independent validation, strong discrimination (AUC ≈0.8–0.9), and model interpretability.
Moderate (Biomedinformatics 05 00067 i002):
Yin et al., 2024 [15] and Hirakawa et al., 2022 [42] perform well internally but lack independent validation and complete calibration reporting.
Common issues across weaker studies:
Data leakage from the unclear temporal separation between predictors and outcomes
Undefined missing data handling (especially in omics datasets)
Outcome heterogeneity (e.g., inconsistent DKD progression criteria)
Short follow-up windows, limiting clinical relevance of “progression” signals
As a result, while many models report eye-catching AUCs (>0.85), only externally validated work (e.g., Zhang et al., 2022 [43]) is likely to maintain real-world predictive accuracy. Stronger studies explicitly report calibration and validation design; weaker ones risk overfitting and inflated results that may not travel across populations or platforms.

4.4. Validation of ML Models for Prediction of CKD and ESRD

This summary paragraph and Table 6 address the need for external and internal validation of ML models for CKD and ESRD prediction across diverse populations and settings to implement AI-driven tools for personalized patient care. Chan et al. (2021) used KidneyIntelX (RF Model) as their ML risk score [24]. Then they used 460 patients from BioMe and PMBB biobanks as their validation cohort. They compared their output data with that of KDIGO and clinical logistic regression model and found that the KidneyIntelX model outperformed clinical models and KDIGO heatmaps with an AUC of 0.77 with improved risk stratification for early DKD patients. Ferguson et al. (2022) used the RF Model (22-variable) as their ML risk score and considered 107,097 individuals from Alberta, Canada (subset of 321,396) as their validation cohort [31]. After applying their chosen validation method, the output data demonstrated robust prediction for eGFR decline and kidney failure with AUCs up to 0.90 at 1 year.
Tangri et al. (2024) used Klinrisk (Random Survival Forest) model as their ML risk score and considered CANVAS (n = 10,142) and CREDENCE (n = 4401) clinical trials as their validation cohort [32]. After implementing external validation methods in pooled CANVAS/CREDENCE datasets, the output data revealed that the Klinrisk model outperformed both KDIGO and KFRE benchmarks for predicting CKD progression in type 2 DM patients with high albumin–creatinine ratio (ACR). Zou et al. (2022) used the RF model for ESRD Prediction and considered Internal split (training 75%, validation 25%) of 390 biopsy-confirmed DKD patients as their validation cohort [33]. They applied two different validation methods (10-fold cross-validation on training set and internal validation on 25% holdout data), and the output data showed the highest AUC (0.90) among the reviewed models, thus building a clinically actionable nomogram based on five key lab values. Together, these models highlight the potential of ML to aid in early detection, risk stratification, and individualized care establishment in CKD patients.
Reorganizing the evidence around the clinical questions most relevant to kidney disease management clarifies how predictive modeling can meaningfully support decision-making. Across studies such as Zou et al. (2022) [33], the core questions include whether models can flag early disease, estimate the slope to ESRD, guide access planning, predict mortality, or anticipate cardiovascular events. When evaluated through the lens of data actually available in clinics, routine laboratory results, urine biomarkers, imaging, and multi-omics, different model families align with different prediction horizons and sample sizes. Routine laboratory and pathology data, as in Zou et al.’s [33] cohort of 390 biopsy-confirmed DKD patients (40.5% ESRD events over three years), are best handled by tree ensembles such as RFs or gradient boosting machines (GBM), which manage mixed predictors, nonlinear thresholds, and modest event counts effectively (AUC up to 0.90). For small or interpretable models, penalized logistic or Cox regression remains preferable, particularly when relationships are near-linear and transparency is key. Early disease flagging, which often depends on subtle shifts in albuminuria or cystatin C, benefits from parsimonious logistic or Cox baselines or, when urine biomarkers are added, regularized linear or tree-based models that prevent overfitting in moderate samples. Estimating the slope to ESRD or deciding when to plan access requires longitudinal or time-to-event data, best approached with survival modeling, penalized Cox, flexible parametric, or survival tree ensembles, to yield calibrated risk estimates at actionable time horizons. Mortality and cardiovascular prediction call for separate, outcome-specific survival models incorporating comorbidities and cardiac biomarkers, often analyzed with Cox or competing-risk frameworks (e.g., Fine–Gray). DL methods, while powerful for high-dimensional imaging or omics data, are generally unsuitable for small, tabular datasets like those in current nephrology cohorts. Overall, tree ensembles provide robust discrimination in moderate datasets, whereas penalized regression ensures generalizability and calibration; both are preferable to complex deep nets until larger, multi-modal datasets become available. Table A25 integrates these findings within the PROBAST framework, contextualizing Zou et al.’s [33] study-specific details, limitations (notably internal-only validation), and the rationale for model family selection by clinical task and data type.

4.4.1. Validation of ML Models: Generalizability and Fairness

Across recent CKD progression prediction studies (Ferguson et al., 2022 [31]; Tangri et al. [32], 2024; Zou et al., 2022 [33]; Chan et al., 2021 [24]), evidence for generalizability and fairness is stronger than in earlier single-center studies, though still uneven. Cohorts included multi-center international trials (CANVAS, CREDENCE), single-country hospital networks, and regional diabetes clinics, with assay platforms spanning standard clinical labs, urine biomarkers, and EHR-derived variables. CKD prevalence ranged from 10 to 30% at baseline, and reported demographics included age (mean 55–68 years), sex (roughly balanced), and race/ethnicity, where available; subgroup analyses by race were inconsistently reported. Domain shifts between tertiary centers and community clinics were noted but largely unquantified. Multi-site external validation was prioritized in Ferguson et al., 2022 [31] and Tangri et al., 2024 [32], whereas Zou et al., 2022 [33] focused on internal validation with temporal splits.

4.4.2. Validation of ML Models: Data Hygiene and Temporal Integrity

Best practices for data preparation were variably applied. Ferguson et al., 2022 [31] and Tangri et al., 2024 [32] explicitly used patient-level temporal splits to avoid future leakage, defining baseline feature windows (labs and vitals) and outcome windows (CKD progression over 2–5 years). Imputation for missing clinical and biomarker data was reported in Ferguson et al., 2022 [31] (median and multiple imputation) and Chan et al., 2021 [24], while batch effects for multi-center labs were inconsistently addressed. Normalization and scaling approaches were variably applied across continuous lab values, highlighting the need for explicit, reproducible pipelines to ensure real-world model performance.

4.4.3. Validation of ML Models: Right-Sizing the Enthusiasm for Multi-Omics

Chan et al., 2021 [24] integrated biomarker panels with EHR features, demonstrating that small, well-curated panels often perform as well or better than high-dimensional data, particularly in routine clinical populations. While advanced fusion methods (early/intermediate/late) were explored, regularization and overfitting control were variably reported. Multi-omics is best positioned for refining phenotypes or stratifying treatment effect heterogeneity rather than routine clinical risk scoring, as stable, low-dimensional features remain highly predictive for progression to ESRD or significant eGFR decline.

4.4.4. Validation of ML Models: From Predictions to Actions

Several studies linked predictions to actionable clinical decisions. Ferguson et al., 2022 [31] and Tangri et al., 2024 [32] defined thresholds for intensified monitoring, early nephrology referral, SGLT2 initiation, and planning for renal replacement therapy. Chan et al., 2021 [24] incorporated explainable risk scores into prototype electronic health record dashboards with color-coded risk, trend visualization, and alerting for high-risk patients. Zou et al., 2022 [33] reported model probabilities without an integrated decision pathway, highlighting the need for clear operationalization. Example screens that contextualize predicted risk alongside actionable thresholds would strengthen translation to routine practice.
All four studies listed above serve a predictive or prognostic purpose. So, for RoB assessment, we used the PROBAST tool for these journal publications: Chan et al., 2021 [24], Ferguson et al., 2022 [31], Tangri et al., 2024 [32], and Zou et al., 2022 [33]. The PROBAST and QUADAS-2 assessments, summarized in Appendix D (Table A22, Table A23, Table A24, Table A25, Table A26, Table A27, Table A28 and Table A29), provide detailed quality appraisals of these studies utilizing AI-based kidney risk prediction models.
Across the four research articles, three are high-quality and one is moderate-quality, according to the PROBAST evaluation.
High-quality (Biomedinformatics 05 00067 i001): Chan et al., 2021 [24], Ferguson et al., 2022 [31], and Tangri et al., 2024 [32], all performed external validation, demonstrated good calibration, and sometimes included DCA showing clinical utility.
Moderate (Biomedinformatics 05 00067 i002): Zou et al., 2022 [33] lacked external validation and calibration assessment, relying on internal data only.
When models were subjected to external or temporal validation (e.g., Chan et al., 2021 [24], Ferguson et al., 2022 [31], Tangri et al., 2024 [32]), discrimination remained strong (AUROC ≈ 0.8–0.86) and calibration was preserved, suggesting that these are credible, clinically portable tools. By contrast, internally validated or single-center studies (e.g., Zou et al., 2022 [33]) reported similarly high AUCs but without independent replication or calibration testing, making their results less likely to travel beyond the derivation cohort.

4.5. Application of AI-Based Algorithms for Detecting and Classifying Current Disease State, Discovering Diagnostic Biomarkers, and Subtype Identification

This paragraph section and Table 7 carefully summarize research work done on the application of various AI and ML-based algorithms for the detection and classification of current disease states, discovery of diagnostic biomarkers, and identification of subtypes in the context of diabetic renal ailments, such as DKD or DN. Basuli et al., 2025 [8] systematically reviewed the utilization of various HER, omics, and imaging data as input variables to various AI and ML models for predicting DKD onset, progression, and risk stratification. Lei et al., 2024 [9] applied PAS-stained WSI as input variables for a CNN model for Renal Pathology Society (RPS) classification I-IV and lesion detection. Makino et al., 2019 [10] applied longitudinal EMR time-series data as input variables for a convolutional autoencoder and RF models for predicting DKD aggravation within 6 months. Nayak et al., 2024 [11] applied single-center EHR and labs as input biomarkers for an ML ensemble model for predicting DKD progression. Li et al., 2025 [12] systematically reviewed the application of various clinical, omics, and imaging data as input biomarkers for various ML algorithms for predicting DKD risk/progression. Zhu et al., 2024 [13] applied various clinical and laboratory data, such as serum creatinine, eGFR, and retinopathy, as input biomarkers for an SVM model for predicting DN onset within 36 months. Zhu, Liu, and Wang, 2024 [14] utilized samples derived from publicly available GEO transcriptomic datasets (GSE47184 [54], GSE96804 [52], GSE104948 [55], GSE104954 [56], GSE142025 [57], GSE175759 [58]). They applied these transcriptomic data as input variables for various ML ensemble models, including LASSO, Support Vector Machine, Recursive Feature Elimination/SVM-RFE, and RF for DN classification and biomarker identification. It is worth noting that each of these seven studies used different datasets with the intention of solving a similar problem, which is the prediction of DKD or DN risk or progression.

4.5.1. Biomarker Discovery and Subtype Identification: Generalizability and Fairness

Across the seven studies reviewed, evidence for generalizability and fairness remains uneven, with few works providing transparent site-level or demographic data. Lei et al. (2024) [9] and Zhu, Liu & Wang (2024) [14] represent stronger examples of multi-cohort external validation, using datasets from multiple hospitals and publicly available transcriptomic platforms (e.g., GEO). These studies provide the clearest evidence of cross-domain reproducibility, though subgroup analyses by age, sex, or race were not explicitly reported. In contrast, Makino et al. (2019) [10] and Nayak et al. (2024) [11] relied on single-center or temporal splits, limiting portability across sites with differing assay platforms, patient mix, and care settings. Zhu Y. et al. (2024) [13] achieved partial external validation using an independent test cohort but similarly lacked stratified reporting by demographic or clinical subgroups. Li et al. (2025) [12] pooled multi-study data in a meta-analysis, improving representativeness across settings but still constrained by the variable reporting quality of source studies. Basuli et al. (2025) [8] highlighted these same issues, noting that domain shifts between tertiary referral centers (where most AI models are trained) and community clinics (where prevalence, kidney function distribution, and assay platforms differ) remain largely unquantified. None of the studies mapped prevalence or baseline eGFR distributions in a way that allows adjustment for population differences. Together, these findings suggest that while model performance within single institutions can be high, fairness and robustness across populations are still assumed rather than evidenced, emphasizing the need for prospective, multi-site validation and transparent demographic benchmarking before clinical translation.

4.5.2. Biomarker Discovery and Subtype Identification: Data Hygiene and Temporal Integrity

Across the recent AI and ML studies in diabetic kidney disease (DKD), including those by Basuli et al. (2025) [8], Lei et al. (2024) [9], Makino et al. (2019) [10], Nayak et al. (2024) [11], Li et al. (2025) [12], and Zhu et al. (2024) [13,14], good data hygiene remains the cornerstone of trustworthy performance. Models built from electronic health records and digital pathology must clearly define feature windows and outcome horizons so that predictions never “peek into the future.” Patient-level temporal splits should separate training and validation cohorts by time rather than by random sampling, ensuring that apparent accuracy is not inflated by data leakage. Where laboratory, imaging, and omics data are combined, imputation and normalization procedures need to be stated explicitly, and batch-effect correction for omics data should be routine practice, not an afterthought. These design guardrails are what distinguish a model that performs well in the lab from one that earns trust in the clinic.

4.5.3. Biomarker Discovery and Subtype Identification: Right-Sizing the Enthusiasm for Multi-Omics

While enthusiasm for multi-omics integration is high, the evidence across these DKD studies suggests a need for balance. Early-, intermediate-, and late-fusion strategies must each be matched with the right degree of regularization to prevent overfitting. Several studies, including Zhu et al. (2024) [13] and Li et al. (2025) [12], demonstrate that small, stable panels of urinary, serum, or transcriptomic markers often outperform high-dimensional multi-omics fingerprints in typical clinical cohorts. Multi-omics should be viewed as a powerful tool for phenotype discovery and understanding treatment heterogeneity, rather than the default solution for routine risk scoring.

4.5.4. Biomarker Discovery and Subtype Identification: From Predictions to Actions

The most promising direction lies in translating predictions into actionable clinical decisions. For instance, thresholds derived from these models could trigger earlier nephrology referral, SGLT2 inhibitor initiation, or biopsy deferral when risk is low. Embedding these models into electronic health records-with intuitive displays, explainable scores, and alert systems-would help clinicians visualize risk trajectories, compare trends, and escalate care in real time. A concise example screen or alert pathway, as yet uncommon in published studies, would make the leap from algorithm to bedside far more tangible.
Together, these studies show that reproducible DKD prediction hinges less on algorithmic novelty and more on transparent data handling, measured use of omics, and thoughtful clinical integration, the very traits that build clinician confidence and support equitable real-world adoption.
Among the seven studies listed above, three serve a diagnostic purpose: Basuli et al., 2025 [8]; Lei et al., 2024 [9]; and Zhu, Liu, and Wang, 2024 [14]. So, we used the QUADAS-2 tool to assess RoB and applicability concerns for these three studies. The other four studies serve a predictive or prognostic purpose: Makino et al., 2019 [10]; Nayak et al., 2024 [11]; Li et al., 2025 [12]; and Zhu et al., 2024 [13]. So, we used the PROBAST tool to assess the RoB and applicability concerns of these four studies.
Across the emerging AI literature in DKD, common methodological weaknesses repeatedly limit the generalizability of reported results, even when headline metrics appear impressive. Many studies, such as those by Basuli et al. (2025) [8], Makino et al. (2019) [10], and Nayak et al. (2024) [11], report high discrimination (AUC > 0.85) for predicting DKD onset or progression, but inspection of their design often reveals information leakage from overlapping time windows, where predictors drawn near or after the outcome event inflate apparent accuracy. Missing data are frequently handled implicitly or through opaque imputation, and outcome definitions (e.g., DKD progression vs. incident ESRD) vary across sites, making cross-cohort validation nearly impossible. Follow-up horizons are often too short, typically one to three years, to capture the clinically meaningful trajectory toward renal failure. Some reports aggregate baseline and follow-up measurements without temporal ordering, further biasing model performance. In contrast, higher-quality studies, such as Lei et al. (2024) [9] and Li et al. (2025) [12], explicitly define censoring, apply external validation, and harmonize outcome criteria, producing more conservative but credible results. Lei et al. demonstrated that model interpretability and calibrated survival estimates mattered more than raw AUC, while Li et al.’s [12] meta-analysis highlighted that studies with rigorous temporal design and external testing consistently reported lower but more reproducible discrimination. Thus, when weaker retrospective analyses yield striking performance, like single-center RFs with AUCs near 0.95, the results should be interpreted as overfit or data-leaky rather than transferable evidence. Weight should instead be given to studies that declare their time anchors, handle missingness transparently, define stable outcome labels, and ensure sufficient follow-up to inform real-world decisions about DKD screening, progression, and treatment planning.
The PROBAST and QUADAS-2 assessments, summarized in Appendix E (Table A30, Table A31, Table A32, Table A33, Table A34, Table A35 and Table A36), provide detailed quality appraisals of these studies in kidney disease populations.

4.6. Application of AI and ML-Based Algorithms for Identifying and Classifying Existing Diseases and Subtypes, and Forecasting Disease Progression and Risk Stratification

This summary paragraph and Table 8 summarize a total of eight research articles on the application of various AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification in the context of diabetic renal diseases, DKD or DN. Lucarelli et al. 2023 [16] applied various urinary proteomics and pathologic glomerular features as input variables for various ensemble models, including RF, SVM, and XGBoost, for classifying various stages of DN. Yan et al., 2024 [17] applied various urinary proteomics as input variables for various ensemble models, including SVM-RFE for DN biomarker discovery. Dong et al., 2022 [18] applied various EHR data, such as age and HbA1C, as input variables for various ML algorithms, such as XGBoost and logistic regression, for predicting the 3-year incidence of DKD. Hsu et al., 2023 [19] applied various EMR and lab data as input variables for XGBoost and RF models for predicting eGFR decline and the need for nephrology referral in DKD patients. Paranjpe et al., 2023 [20] applied various EMR data as input variables for a Deep Autoencoder model for classifying DKD subtypes and predicting disease progression risk. Xu et al., 2020 [21] systematically reviewed the application of various ML algorithms, including SVM, RF, and ANN, for predicting the risk of microangiopathic complications of diabetes, such as DN, DR, and diabetic peripheral neuropathy (DPN). Done et al., 2024 [22] systematically reviewed the application of various omics features as input variables for various ML models, such as SVM, RF, and DL, for identifying diagnostic biomarkers for DN. Nagaraj et al., 2021 [23] used various EMR data, such as age, albuminuria, and eGFR, as input variables for an XGBoost model for introducing KAI, a biomarker classifier to compare functional kidney age and chronological kidney age in DKD patients. It is worth noting that all eight of these studies utilized different datasets to solve a common problem, which is biomarker discovery and forecasting DKD progression.

4.6.1. Identifying and Classifying Existing Diseases and Subtypes: Generalizability and Fairness

Across these eight studies, evidence of model generalizability was variable, and fairness was more often assumed than demonstrated. Dong et al. (2022) [18] and Hsu et al. (2023) [19] offered the most pragmatic frameworks, using large-scale EMR data across multiple hospital systems to predict 3-year DKD risk and nephrology referral needs, respectively. Both reported validation across geographically distinct sites but did not include subgroup analyses by age, sex, or race, nor did they provide prevalence maps or baseline kidney function distributions. In contrast, Lucarelli et al. (2023) [16] and Yan et al. (2024) [17] leveraged urinary proteomics, integrating omics and pathology platforms, yet their datasets were limited to tertiary academic centers, raising the risk of domain shift when deployed in community clinics with different assay pipelines and patient populations. Paranjpe et al. (2023) [20] demonstrated cross-ethnic subgroup modeling in a deep-learning framework that linked EMR-derived phenotypes to genetic variation in the Rho pathway-an encouraging example of attention to biological diversity-but explicit fairness metrics were still absent. Xu et al. (2020) [21] and Dong et al. (2024) [22], both systematic reviews, reinforced that few nephrology ML models undergo external validation, and that demographic or assay heterogeneity remains underreported. Collectively, only Nagaraj et al. (2021) [23] explicitly modeled age-related bias through their “Kidney Age Index,” acknowledging biological and chronological aging differences. These findings highlight the urgent need for multi-site external validation with demographic stratification, standardized baseline kidney function reporting, and mapping of assay platforms and prevalence before claims of fairness or clinical portability can be substantiated.

4.6.2. Identifying and Classifying Existing Diseases and Subtypes: Data Hygiene and Temporal Integrity

Data hygiene practices varied considerably. Dong et al. (2022) [18] and Hsu et al. (2023) [19] correctly implemented patient-level temporal splits, ensuring predictions were made using only pre-outcome data-critical to avoiding information leakage. However, most omics-heavy studies (Lucarelli et al., 2023 [16]; Yan et al., 2024 [17]) did not explicitly describe batch-effect correction, normalization procedures, or imputation logic, despite combining samples across assays and tissue types. Feature and outcome windows were rarely pre-specified, raising uncertainty about whether the models may have peeked into future data during feature engineering. Only Paranjpe et al. (2023) [20] documented explicit time-stamped EMR segmentation, distinguishing look-back (feature) and follow-up (outcome) intervals, establishing a sound methodological precedent. Future nephrology AI pipelines should define these windows up front, employ cross-batch harmonization for proteomics, and detail missingness handling to ensure that real-world deployment remains trustworthy.

4.6.3. Identifying and Classifying Existing Diseases and Subtypes: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics integration was a hallmark of Lucarelli et al. (2023) [16] and Yan et al. (2024) [17], which used early and intermediate fusion of urinary proteomics with pathology and clinical metadata, respectively. However, neither clearly described regularization strategies or overfitting control for high-dimensional inputs. By contrast, simpler clinical models (Dong et al., 2022 [18]; Hsu et al., 2023 [19]; Nagaraj et al., 2021 [23]) achieved comparable discrimination using stable panels of eGFR, albuminuria, age, and metabolic markers, underscoring that parsimonious clinical and urinary biomarker sets often outperform large omics fingerprints in typical nephrology cohorts. Multi-omics should therefore be positioned as a tool for phenotype discovery and treatment-response heterogeneity, rather than a default for routine DKD risk prediction.

4.6.4. Identifying and Classifying Existing Diseases and Subtypes: From Predictions to Actions

Few studies explicitly linked predictions to clinical decisions. Hsu et al. (2023) [19] approached this most directly, identifying thresholds to trigger earlier nephrology referral and potential SGLT2 inhibitor initiation. Dong et al. (2022) [18] proposed a 3-year DKD risk model that could inform biopsy deferral or renal function monitoring frequency, but implementation details were limited. Future work should define action thresholds for changes in care, such as SGLT2 initiation, vascular access planning, or transplant evaluation-and demonstrate how predictions appear inside the electronic health record. Prototype dashboards showing risk scores, confidence intervals, and escalation paths could materially improve interpretability and clinician trust, moving from retrospective modeling toward actionable AI-guided kidney care.
Among the eight studies listed above, five of them serve a diagnostic purpose: Lucarelli et al., 2023 [16]; Yan et al., 2024 [17]; Paranjpe et al., 2023 [20]; Dong et al., 2024 [22]; and Nagaraj et al., 2021 [23]. So, we used the QUADAS-2 tool to assess the RoB and applicability concerns of these five journal publications. The other three studies serve a predictive or prognostic purpose: Dong et al., 2022 [18]; Hsu et al., 2023 [19]; and Xu et al., 2020 [21]. So, we used the PROBAST tool to assess the RoB and applicability concerns of these three journal publications. The PROBAST and QUADAS-2 assessments, summarized in Appendix F (Table A37, Table A38, Table A39, Table A40, Table A41, Table A42, Table A43 and Table A44), provide detailed quality appraisals of these studies in kidney disease populations.

4.7. Predicting Future Outcomes, Such as Mortality or Cardiovascular Events, Using ML Algorithms or Patients’ Biomarkers

This summary paragraph and Table 9 summarize the research articles predicting future outcomes, such as mortality or cardiovascular events, using various patients’ biomarkers as input variables for various ML algorithms. For example, Ma et al., 2023 [34] applied patients’ age, albumin, hemoglobin, creatinine, dialysis duration, and comorbidities as input variables for an adaptive feature-recalibrated ensemble model to predict 3-year all-cause mortality in peritoneal dialysis patients. Chen et al., 2025 [35] applied patients’ age, albumin, urea, comorbidities, and inflammatory biomarkers as input variables for various interpretable ML models, such as XGBoost, light GBM, Cox regression, etc., to predict all-cause mortality and time to death in hemodialysis patients. Hung et al., 2022 [36] applied various baseline labs, such as BUN, lactate, bilirubin, vitals, and demographics, as input variables for the XGBoost + SHAP interpretation model to predict in-hospital mortality after CRRT initiation. Lin et al., 2023 [37] applied patients’ serum endocan, age, albumin, creatinine, and diabetes as input variables for an XGBoost model to predict 36-month all-cause mortality in hemodialysis patients. Tran et al., 2024 [38] applied patients’ age, CVD, smoking, vitamin D (vit D), Erythropoiesis-Stimulating Agent (ESA) use, parathyroid hormone (PTH), and ferritin as input variables for an XGBoost model to predict 2-year all-cause mortality in advanced CKD patients. Kim et al., 2020 [39] applied patients’ plasma endocan, albumin, BMI, triglycerides (TG), and cardiovascular history as input variables for a Cox regression model to predict composite cardiovascular events in ESRD patients. Zhu et al., 2024 [40] applied patients’ age, blood pressure (BP), eGFR, glucose, lipids, Hb, and comorbidities as input variables for various AI/ML algorithms, including XGBoost, RF, logistic regression, and SVM to predict CVD in CKD patients. It is worth reporting that all these journal articles utilized different types of datasets to solve a similar problem, which is to predict future outcomes, such as all-cause mortality or cardiovascular events in kidney patients.

4.7.1. Predicting Future Outcomes: Generalizability and Fairness

Rigorous evaluation requires a transparent map of study sites, assay platforms, disease prevalence, participant demographics, and baseline kidney function. Subgroup analyses by age, sex, and race or ethnicity should be routinely reported, as aggregated performance can obscure systematic biases in underrepresented populations. Domain shift between tertiary centers and community clinics, where practice patterns, coding habits, and laboratory calibration differ, must be explicitly acknowledged. External, multi-site validation should be prioritized over repeated internal tuning, as true generalizability arises from exposure to heterogeneous populations rather than incremental optimization of a single dataset. Recent efforts in external validation of mortality prediction models for advanced CKD, as in the study by Tran et al. (2024) [38], highlight the importance of testing across diverse cohorts to establish real-world robustness.

4.7.2. Predicting Future Outcomes: Data Hygiene and Temporal Integrity

Temporal splits at the patient level are essential to prevent inadvertent data leakage and to mimic the forward flow of clinical decision-making. Detailing imputation and normalization strategies will ensure that preprocessing parameters are fit solely to the training data. For omics data, batch-effect correction, using methods such as ComBat or mixed-model harmonization, should be explicitly described, as uncorrected technical variation can easily confound biological signals. Clearly defining feature and outcome windows a priori prevents models from “peeking into the future.” Such procedural guardrails are central to reproducibility and are the foundation for models whose real-world performance can be trusted once deployed in clinical settings.

4.7.3. Predicting Future Outcomes: Right-Size the Enthusiasm for Multi-Omics

It is important to distinguish between early-, intermediate-, and late-fusion strategies and to explain the forms of regularization that stabilize each. Nephrology cohorts taken into account by Ma et al., 2023 [34], and Chen et al., 2025 [35] show that small, interpretable panels of clinical and urinary biomarkers outperform high-dimensional omics signatures in terms of reproducibility and operational simplicity. Multi-omics integration can significantly contribute to phenotype discovery, mechanistic insight, and the identification of treatment effect heterogeneity, as demonstrated by the studies, Fan et al., 2025 [41] and Hirakawa et al., 2022 [42]. However, it should not be treated as the default approach for everyday risk stratification. Parsimonious, interpretable models remain the cornerstone of clinically deployable predictive tools.

4.7.4. Predicting Future Outcomes: Prediction to Action

The clinical utility of any predictive model depends on well-defined decision thresholds that alter patient management, such as prompting earlier nephrology referral, biopsy deferral, SGLT2 inhibitor initiation, vascular access planning, or transplant evaluation. Outputs should be designed for integration within electronic health records, featuring intuitive visualizations, risk explanations, and clear escalation pathways. Short examples or prototype screens, such as those demonstrated in explainable dialysis and critical care prediction models (Hung et al., 2022 [36]; Lin et al., 2023 [37]), can effectively convey how model predictions translate into clinical workflows. Embedding interpretability and usability from the outset ensures that predictive modeling in nephrology evolves from academic proof-of-concept to practical, equitable decision support.
Among the seven research articles listed above, all of them serve a predictive or prognostic purpose. Therefore, we used the PROBAST tool to assess RoB and applicability concerns of these studies. The PROBAST and QUADAS-2 assessments, summarized in Appendix G (Table A45, Table A46, Table A47, Table A48, Table A49, Table A50 and Table A51), provide detailed quality appraisals of these studies in kidney disease populations.
Across these articles, three are high-quality studies, one is moderate-quality, and two are low-quality, as per the PROBAST evaluation.
High-quality studies (Biomedinformatics 05 00067 i001): Ma et al., 2023 [34], Chen et al., 2025 [35], Tran et al., 2024 [38], demonstrate sound calibration and/or external validation.
Moderate (Biomedinformatics 05 00067 i002): Hung et al., 2022 [36], interpretable ML, but single-center only.
Low-quality (Biomedinformatics 05 00067 i003): Lin et al., 2023 [37], Kim et al., 2020 [39], Zhu et al., 2024 [40], rely on limited biomarker datasets, lack calibration or validation, and risk overfitting.
Common pitfalls across weaker studies:
-
Time-window leakage and unreported temporal splits
-
Undefined handling of missing data
-
Shifting outcome definitions between sites
-
Short or clinically irrelevant follow-up durations

4.8. Kidney Disease Forecasting

Figure 2, a stacked bar chart, illustrates the evolving trends of ML model selection and application across kidney disease studies published between 2015 and 2025. The x-axis represents publication years. The y-axis indicates the number of studies employing each model type. Distinct colors correspond to different model families, including DL, RF, XGBoost, LASSO, Logistic Regression, Linear Regression, Piecewise Linear Regression, Decision Tree, Cox Regression, Handcrafted Linear Regression, and Urinary Biomarker Classifiers.
Overall, the figure shows a marked increase in the diversity and frequency of ML applications beginning around 2020. DL models (green) and ensemble approaches such as RF and XGBoost (yellow and light blue) became prominent after 2020, reflecting the growing interest in data-rich, non-linear modeling for kidney pathology and prognosis. Earlier studies (2015–2019) primarily relied on traditional regression models (Cox, Logistic, and Linear Regression). Later studies (2022–2025) utilizing more advanced models (interpretable regression variants and deep neural networks) signal a shift toward more complex, data-driven approaches in kidney disease research.
Figure 3 is a line chart illustrating the temporal trends in the use or occurrence of various predictive models from 2015 to 2025. The x-axis represents the years. The y-axis indicates the frequency or categorical presence of each model, ranging from 0 to 2. Each colored line corresponds to a different model type. The chart shows how the prominence of these models changes over time; for instance, DL peaks around 2020–2021, and RF peaks in 2022 and 2024.

4.9. Evidence Certainty Assessment

The evidence certainty of AI and ML models in kidney disease prediction was evaluated using the GRADE approach, with appropriate consideration for the AI model category, RoB, inconsistency, indirectness, imprecision, and publication bias (Table 10). Color codes (red, yellow, green) were used for each domain of the GRADE table. For each domain, red-colored code indicates high concern, yellow-colored code indicates some concern, and green-colored code indicates low concern. When all of the five domains (RoB, Inconsistency, Indirectness, Imprecision, and Publication Bias) or at least four of the five domains were found to be of low concern, then the overall evidence of certainty was rated High. When three of the five domains were found to be of some concern, the overall certainty was rated as moderate. Finally, when three of the five domains were found to be of high concern, or one of high concern and two of some concern, the overall evidence of certainty was rated low.
In this GRADE Summary of Findings (SoF) table, each of the domains was judged to be of low, moderate, or high concern based on specific ideologies. In terms of the RoB domain, studies demonstrating well-designed models with transparent feature selection, clear outcome definition, external validation, and appropriate calibration were rated to be of low concern. Then, studies demonstrating minor methodological issues, such as internal cross-validation only, unclear missing data handling, or limited reporting of calibration, were rated to be of some concern. Additionally, studies demonstrating high risk from selection bias, data leakage, unclear inclusion criteria, or lack of validation entirely were rated to be of high concern.
In terms of the “Inconsistency” domain, studies demonstrating consistent results across multiple cohorts or settings, with overlapping confidence intervals, were rated to be of low concern. Then, studies demonstrating moderate heterogeneity in performance metrics or inconsistent effect sizes across datasets were rated to be of some concern. Additionally, studies demonstrating marked inconsistency between studies with conflicting direction or magnitude of effect were rated to be of high concern.
In terms of the “Indirectness” domain, studies demonstrating population, predictors, and outcomes directly applicable to target clinical use (e.g., DKD or CKD) were rated to be of low concern. Then, studies demonstrating minor differences in population (e.g., single-center or narrow subgroup) or proxy outcomes were rated to be of some concern. Additionally, studies demonstrating a major mismatch between the study context and intended use (e.g., preclinical data or a non-representative population) were rated to be of high concern.
In terms of the “Imprecision” domain, studies demonstrating narrow confidence intervals, adequately powered sample size (>500 participants), and stable performance metrics (e.g., AUROC/AUPRC), were rated to be of low concern. Then, studies demonstrating moderate uncertainty due to small-to-medium sample size, wide CI, or unreported precision, were rated to be of some concern. Additionally, studies demonstrating very small sample sizes or wide uncertainty intervals, preventing reliable estimation of model performance, were rated to be of high concern.
In terms of the “Publication Bias” domain, articles demonstrating multiple studies, preregistered protocols, or open data, whose findings are likely reproducible, were rated to be of low concern. Then, articles demonstrating limited evidence, based on selective reporting, were rated to be of some concern. Additionally, articles demonstrating Single-center reports with extreme effect sizes and a lack of transparency, suggesting bias in publication, were rated to be of high concern.
The GRADE evaluation of the included studies shows moderate overall certainty of evidence for AI applications in kidney disease.
Among the different AI model categories, CKD progression prediction models [24,25,31,32] demonstrated the highest certainty of evidence, with consistently low RoB, strong consistency across studies, and minimal indirectness, leading to a high overall rating.
Radiomics-based AI models using ultrasound and MRI [1,2,5], pathology-based AI models using histology or whole-slide imaging [3,4,6,7], biomarker-based AI approaches including proteomics, metabolomics, and urinary markers [16,17,26,27,28,29,30,42,43], and electronic medical record (EMR)-based AI for diabetic kidney disease risk prediction [10,11,18,19,20,21,22,23,24,25,31,32,33] all demonstrated moderate certainty. These categories were mainly downgraded due to RoB, inconsistency in model performance, and some degree of imprecision in reported outcomes.
Genomics and multi-omics-based AI models [14,20,41] had the lowest certainty of evidence, primarily due to imprecision and indirectness, reflecting limited clinical validation and smaller study populations.
AI models for mortality prediction [34,35,36,37,38] and cardiovascular risk prediction in CKD [39,40] also achieved moderate certainty, though concerns remained around study heterogeneity and potential publication bias.

4.10. Recommended Approaches for Kidney AI/ML Studies

Six major methodological categories were identified across the reviewed studies. DL models, particularly convolutional neural networks and U-Net architectures, were primarily applied to imaging and histopathology quantification. These approaches are considered “recommended” when they include at least one external or temporal validation and demonstrate a reproducible analytic pipeline.
Supervised ML models, most commonly ensemble tree algorithms such as RF, XGBoost, and SVMs, were used for tabular, clinical, or biomarker-based prediction tasks. They meet the threshold for recommendation when sample sizes exceed roughly 100 participants and predictor variables are clearly defined and reproducible.
Explainable ML approaches, with SHAP or LIME, were used across clinical prognostic models to clarify feature importance and model reasoning. To be considered robust, these studies should present explicit feature attribution and calibration curves to support transparency and reliability.
Traditional regression models, including logistic and Cox proportional hazards regressions, served as baseline or benchmark comparators. Their inclusion remains valuable when model coefficients are fully reported, and calibration performance (slope and intercept) is tested.
Multi-omics integration models, using early, intermediate, or late data fusion methods to combine proteomic, metabolomic, and clinical inputs, are best viewed as exploratory tools for discovery and subtyping rather than routine prediction. They are recommended only when supported by external validation or replication.
Finally, reviews and meta-analyses provided methodological synthesis but were not graded for validation rigor, serving instead to contextualize emerging best practices across model types.
We have integrated recommended approaches, typical features, external validation status, and key limits of all the 43 journal articles in Table 11.

5. Discussion

5.1. Ethical Considerations and Potential Negative Effects of AI Models

Scientific research on applications of AI and ML algorithms in medicine has established promises in enhancing disease predictability, thus allowing the implementation of early preventative interventions. For example, these state-of-the-art tools can utilize biomarkers and EMR data of newly diagnosed diabetic patients and predict their natural course of developing microvascular or macrovascular complications of diabetes over time. As a result, physicians can take preventative measures for their patients to delay the development of these complications. This seems like a miracle. However, specific ethical concerns and adverse impacts of using these AI models remain on the plate as there has been very little work on developing regulatory frameworks and validating these AI models in real-world clinical settings. AI algorithms heavily rely on sensitive patient data, which, if improperly handled, can breach patient confidentiality and safety. Moreover, AI-based tools work through invisible internal mechanisms, making it hard for clinicians to interpret, trust, or act on the outputs of these technologies. Overreliance on the predictions of these tools can also negatively impact clinicians’ judgments and decision-making capabilities, causing them to lose adequate clinical insight into their patients’ overall health. Also, if not adequately trained on uncommon data, such as EMR data of underrepresented populations, AI tools can provide inaccurate output data and mislead treatment plans for these patient demographics, further exacerbating the existing healthcare disparities. The development of widely accepted regulatory frameworks, meticulously validated model functionality, and adequately trained AI tools on equitability should be the primary focus while doing clinical implementation in the future.

5.2. Limitations

While this review provides a comprehensive and structured analysis of several AI-based applications in kidney diseases, we should also acknowledge the study’s limitations. We did not extensively analyze or tabulate ML-based models that establish the correlation between diabetic retinopathy and DN using biomarkers and EMR data, despite their emerging relevance in integrated diabetic complication profiling. The review does not include a detailed analysis or tabular presentation of ML models employing endocan as a novel biomarker for predicting all-cause mortality and cardiovascular events in CKD patients, which could be a significant avenue for future research. We did not cover the development and clinical utility of ML derived KAI models used for risk stratification and intervention in DKD, despite growing interest in biologically informed biomarkers. The review does not delve into interpretable ML models aimed at predicting mortality in ESRD patients using longitudinal EMR data from follow-up visits, which could enhance clinical decision-making and patient counseling. We have not compiled the findings into easily understandable table formats for the areas mentioned above, which limits the comparative clarity and accessibility that our tabular approach provides in other sections.

5.3. Future Directions

The current body of literature demonstrates significant progress in the application of AI-based predictive tools, such as ML models including XGBoost, RF, linear and logistic regression, LASSO regression, DL, and neural networks-for early detection, risk stratification, and monitoring of kidney diseases. However, the window of opportunities remains open for further advancement. To date, we have identified a noticeable absence of research on applying generative AI (GenAI) technologies in nephrology. Traditional AI models analyze already prevalent data to predict outcomes; for instance, we can input patient biomarkers and EMR data into ML models and obtain results demonstrating kidney disease onset, progression, and severity. On the other hand, GenAI models can produce novel data samples resembling real-world distributions. Examples of GenAI technologies include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Transformers, Diffusion Models, and Autoregressive Models. Novel capabilities, such as the synthetic generation of training datasets to overcome data scarcity or privacy concerns, simulation of disease progression trajectories, offering new insights into CKD and ESRD management, and personalized treatment planning through the generation of virtual patient models or synthetic biomarker profiles, could be made possible by integrating these GenAI models into nephrology.
More work is needed to bridge the gap between generative synthesis and predictive modeling. For example, researchers can incorporate GenAI with conventional ML models. Several hybrid frameworks resulting from this combination would improve precision medicine approaches by predicting future clinical scenarios and comprehending current illness states. Extensive research is needed to address careful clinical validation and ethical concerns for its usage, data stewardship, and comprehensibility for implementing GenAI in nephrology. Future studies involving the application of GenAI in nephrology must explore its ability to generate evidence-based and decipherable results for clinical translation. Finally, to unveil the full potential of GenAI technologies in nephrology, multidisciplinary approaches and partnerships among nephrologists, data scientists, AI researchers, and ethicists are essential.

6. Conclusions

Recent advancements in AI have made it possible to integrate enriched datasets from EMRs with high-dimensional patient biomarker profiles, such as urinary, proteomic, metabolomic, inflammatory, and genetic markers, into AI-based predictive models. Scientists have identified the significant potential of these models for detecting at-risk groups, predicting disease trajectories, and informing individualized care strategies far before any clinical decline becomes evident. AI-based diagnostic tools offer a paradigm shift toward proactive and data-driven kidney care by unveiling complex and nonlinear patterns in patient data.
However, all the research on AI applications in nephrology has been fragmented, with heterogeneity in study design, variation in model selection, and a multiplicity of input biomarker selection. For example, the application of AI models to urinary biomarkers for kidney disease detection and monitoring has shown significant variation in both study designs and methodological approaches. A large number of models, including logistic regression variants, Cox regression, multivariate linear regression, and ML-based techniques such as MSG-LASSO, have been used, reflecting inconsistencies in model selection. Variations in the input biomarkers, ranging across different combinations of proteins and metabolites, have also been substantial, making cross-study comparisons challenging and underscoring the need for a standardized methodology. While discrimination metrics such as AUROC and AUPRC are frequently reported, fewer studies evaluate calibration, DCA, or net clinical benefit, key components for assessing clinical utility. The most robust studies use large, diverse cohorts, independent external validation, and transparent pipelines; in contrast, many single-center or retrospective analyses risk overfitting, selection bias, and non-reproducibility.
AI-assisted image analysis for histopathology and ultrasound quantification of IFTA has achieved reproducible performance with multicenter validation and can reasonably supplement human scoring under expert supervision. Similarly, structured risk scores integrating routinely available clinical and laboratory variables (e.g., eGFR, albuminuria, age, comorbidity profiles) can be considered for pilot implementation within decision support tools, particularly when externally validated and well-calibrated.
Multimodal models combining clinical data with urinary proteomic, metabolomic, or transcriptomic markers show encouraging internal performance but require independent external validation across varied populations and laboratory platforms before deployment. Explainable ML models that link predictions to actionable features (e.g., SGLT2 inhibitor eligibility, referral timing, or transplant planning) should undergo prospective evaluation to assess their safety and impact on clinical workflows.
High-dimensional, multi-omics fusion models and deep learning systems trained on limited or homogeneous datasets remain exploratory tools for hypothesis generation rather than clinical decision-making. These approaches are best reserved for uncovering novel sub-phenotypes, mechanistic pathways, or treatment response heterogeneity rather than for routine risk scoring.
In sum, a disciplined transition from algorithmic novelty to clinical utility demands rigorous external validation, calibration assessment, and demonstration of net benefit in real-world practice. The field’s next phase should emphasize transparent reporting, cross-institutional reproducibility, and integration into EHR environments that enable interpretable, threshold-based alerts for individualized kidney care.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics5040067/s1; Supplementary File S1. PRISMA 2020 Checklist. The PRISMA 2020 Checklist guiding the reporting of this systematic review is provided in Supplementary File S1.

Author Contributions

The authors of this paper have reviewed the final version to be published and agreed to be accountable for all aspects of the work. Concept and design: T.A., L.P. Drafting of the manuscript: T.A. Critical review of the manuscript for important intellectual content: T.A., L.P. Supervision: L.P. Acquisition, analysis, or interpretation of data: T.A., L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by AIM-AHEAD Coordinating Center, award number OTA-21-017, and was, in part, funded by the National Institutes of Health, United States Agreement No. 1OT2OD032581.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors confirm that the data supporting the findings of this study will be made available after any reasonable request.

Acknowledgments

The work is the authors’ own idea. The authors used ChatGPT (OpenAI, 2025) to assist with language editing and improving clarity of certain sentences of this manuscript. All content was reviewed and verified by the authors (Abbasi, T., and Pinky, L.), who take full responsibility for the final text.

Conflicts of Interest

In compliance with the ICMJE uniform disclosure form, the authors of this paper declare the following: Payment/services info: The authors declared that they received no financial support from any organization for the submitted work. Financial relationships: The authors have declared that they have no financial relationships at present or within the previous three years with any organizations interested in the submitted work. Other relationships: The authors have declared that no other relationships or activities could appear to have influenced the submitted work.

Appendix A

This appendix contains details and explanations supplemental to the subsection, “4.1. Application of AI-based diagnostic tools in early detection, risk stratification, and monitoring of IFTA” of the main text. The explanations of the quality assessment of the included study articles [1] through [7] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A1. QUADAS-2 RoB and Applicability Assessment for Athvale et al., 2021 [1].
Table A1. QUADAS-2 RoB and Applicability Assessment for Athvale et al., 2021 [1].
DomainRoBJudgment & JustificationApplicability Concern
1. Patient
Selection
Biomedinformatics 05 00067 i003 HighThe study used a single-center, retrospective, consecutive sample (in = 352) of patients who underwent kidney biopsy at a tertiary hospital. This introduces potential selection bias, as patients were not randomly sampled and may not represent broader clinical populations.Biomedinformatics 05 00067 i003 High, Participants are limited to a specific demographic (Cook County Hospital, Chicago), potentially limiting generalizability.
2. Index Test
(DL + XGBoost model)
Biomedinformatics 05 00067 i004/Biomedinformatics 05 00067 i003 Moderate to HighThe model was trained and tested on the same institutional data, with no external validation. It’s unclear whether the index test was interpreted blinded to the reference standard. Deep learning feature extraction may vary across ultrasound machines.Biomedinformatics 05 00067 i004 Moderate, Implementation in other settings could yield variable results due to equipment and imaging protocol differences.
3. Reference
Standard
(Histopathologic IFTA grading)
Biomedinformatics 05 00067 i001 LowHistopathology is the accepted gold standard for assessing interstitial fibrosis and tubular atrophy. It’s likely that assessments were performed by qualified nephropathologists. However, blinding between reference and index test evaluators was not explicitly mentioned.Biomedinformatics 05 00067 i001 Low, The outcome is directly relevant to clinical practice.
4. Flow and
Timing
Biomedinformatics 05 00067 i004 ModerateAll patients underwent both the index test (ultrasound) and the reference standard (biopsy) in the same period, but timing between the two was not specified. Missing data handling and patient exclusions were not detailed.Biomedinformatics 05 00067 i001 Low, Likely applicable since both tests relate to the same diagnostic event.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A1 presents the application of the QUADAS-2 tool to the 1st paper, Athvale et al. 2021 [1], which utilizes an AI-based diagnostic tool for early detection, risk stratification, and monitoring of IFTA. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
Although the study demonstrates promising accuracy for noninvasive quantification of IFTA using DL on ultrasound images, its single-center design, lack of external validation, and small dataset relative to model complexity result in a high RoB and limited generalizability. Future multicenter validation and calibration studies are needed to strengthen certainty.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate to high, and applicability concerns, moderate to high.
Table A2. QUADAS-2 RoB and applicability assessment for Trojani et al., 2024 [2].
Table A2. QUADAS-2 RoB and applicability assessment for Trojani et al., 2024 [2].
DomainRoBReasoning/JustificationApplicability Concern
1. Patient
Selection
Biomedinformatics 05 00067 i003 HighRetrospective, single-center study; patients were included only if they had both MRI and biopsy within six months, which introduces selection bias (enriched population, not consecutive diagnostic workflow). Excluded poor-quality MRIs and unsuitable biopsies, which further limits representativeness.Biomedinformatics 05 00067 i003 High, The study population (transplant recipients in a tertiary center with available MRI) does not represent the full clinical spectrum of post-transplant patients.
2. Index Test (MRI-radiomic ML model)Biomedinformatics 05 00067 i004/Biomedinformatics 05 00067 i003 Moderate to HighThe MRI radiomics-based ML model was developed and validated on the same institutional data, using internal train/test splits (no external validation). The index test likely was not interpreted fully blinded to the biopsy results during feature selection and model tuning. Performance may be optimistically biased (AUC drop between training and test).Biomedinformatics 05 00067 i004 Moderate, MRI protocols, scanners, and pre-processing steps are highly site-specific; generalization to other centers may be limited.
3. Reference Standard (Histopathologic biopsy with Banff IFTA grading)Biomedinformatics 05 00067 i001 LowThe biopsy-based Banff classification is a recognized gold standard for IFTA. Assessment was performed by an expert nephropathologist, though blinding to MRI results was not explicitly confirmed.Biomedinformatics 05 00067 i001 Low, The biopsy grading directly answers the target condition and is appropriate for the study aim.
4. Flow and TimingBiomedinformatics 05 00067 i004 ModerateMRI and biopsy were performed within six months, which may be long enough for histological changes in IFTA to progress, introducing potential misclassification. Only 70 MR-biopsy pairs analyzed out of 254 biopsies performed, indicating patient/exam attrition.Biomedinformatics 05 00067 i002 Low to Moderate, The six-month interval could affect diagnostic consistency, but within chronic IFTA context, it’s partly acceptable.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A2 presents the application of the QUADAS-2 tool to the 2nd paper, Trojani et al. 2024 [2], which uses an AI-based diagnostic tool for early detection, risk stratification, and monitoring of IFTA. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
This single-center, retrospective diagnostic accuracy study presents a moderate-to-high RoB, mainly due to non-representative patient selection, lack of external validation, and potential overfitting of the radiomics-based ML model. The reference standard (biopsy-based Banff IFTA grading) is appropriate and reliable, though blinding was not fully reported. The six-month window between MRI and biopsy could have introduced temporal bias, and missing data handling was not fully described. Applicability is limited by center-specific MRI acquisition protocols, pre-processing pipelines, and manual segmentation.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate to high, and applicability concerns, moderate to high.
Table A3. QUADAS-2 RoB and applicability assessment for Ginley et al., 2021 [3].
Table A3. QUADAS-2 RoB and applicability assessment for Ginley et al., 2021 [3].
DomainRoBReasoning/JustificationApplicability Concern
1. Patient SelectionBiomedinformatics 05 00067 i004 ModerateThe study used 116 whole-slide biopsy images, retrospectively selected from existing archives. No mention of consecutive or random sampling. Slides were chosen for image quality and completeness, introducing potential selection bias. However, inclusion appears broad across chronic kidney injuries, not limited to a narrow subset.Biomedinformatics 05 00067 i002 Low to Moderate, The sample (renal biopsies with chronic injury) represents the target clinical population, though external representativeness is uncertain.
2. Index Test (CNN-based ML model, DeepLab v2)Biomedinformatics 05 00067 i004 ModerateThe convolutional neural network (DeepLab v2) was trained on annotated WSIs and evaluated internally and externally. External testing included only 20 slides, from the same or similar institutional context, and blinding to the reference standard was not specified. Model tuning and performance optimization may have introduced optimistic bias.Biomedinformatics 05 00067 i004 Moderate, Deep learning performance may vary with slide scanners, staining, and lab protocols, affecting real-world applicability.
3. Reference Standard (Renal Pathologist Grading of IFTA and Glomerulosclerosis)Biomedinformatics 05 00067 i001 LowThe ground truth labels were assigned by expert renal pathologists, the clinical gold standard. Multiple pathologists participated in validation, and performance was benchmarked against their agreement levels. There’s no explicit statement on blinding to model outputs, but the use of independent test slides suggests low bias.Biomedinformatics 05 00067 i001 Low, Pathologist-based grading directly reflects the intended diagnostic construct.
4. Flow and TimingBiomedinformatics 05 00067 i001 LowAll slides underwent both the index test (ML assessment) and reference evaluation (pathologist grading) from the same biopsy samples. No participants or slides appear to have been excluded after inclusion. Flow is consistent and clearly reported.Biomedinformatics 05 00067 i001 Low, The timing of analyses aligns with typical diagnostic workflows.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A3 presents the application of the QUADAS-2 tool for the 3rd paper, Ginley et al. 2021 [3], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
This diagnostic accuracy study demonstrates moderate overall RoB, mainly due to retrospective sampling and limited external validation of the CNN model. The reference standard (expert renal pathologist assessment) is robust, and the index test was applied appropriately, achieving pathologist-level performance.
In summary, the study shows low concern in the reference standard and flow, but moderate risk in patient selection and index test domains, yielding an overall QUADAS-2 judgment of moderate RoB and moderate concern for applicability.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low to moderate.
Table A4. QUADAS-2 RoB and applicability assessment for Zheng et al., 2021 [4].
Table A4. QUADAS-2 RoB and applicability assessment for Zheng et al., 2021 [4].
DomainRoBReasoning/JustificationApplicability Concern
1. Patient SelectionBiomedinformatics 05 00067 i002 Low-ModeratePatients included 64 from OSUWMC (67 WSIs) and 14 from Kidney Precision Medicine Project/KPMP (28 WSIs). WSIs underwent manual quality checks to exclude slides with artifacts. Some clinical data was missing (e.g., proteinuria, eGFR), but all eligible biopsies were included. The sample may not fully represent all renal biopsy populations, and KPMP had no severe IFTA cases.Biomedinformatics 05 00067 i001 Low, The population represents patients undergoing renal biopsy, which matches the intended clinical target population for IFTA grading.
2. Index Test (DL Model, glpathnet)Biomedinformatics 05 00067 i001 LowThe DL model combined local patch-level and global WSI-level features to predict IFTA grade. Model was trained on OSUWMC data with 5-fold cross-validation and tested externally on KPMP data. Patch-level probabilities were reviewed after reference grading to avoid bias.Biomedinformatics 05 00067 i001 Low, The model directly addresses automated IFTA grading on digitized WSIs, the intended purpose of the index test.
3. Reference Standard (Pathologist Consensus, Majority Vote)Biomedinformatics 05 00067 i004 ModerateIFTA grades were determined by majority vote of five nephropathologists (OSUWMC) and by study investigators (KPMP), with moderate interobserver agreement (κ = 0.31–0.50). Grading is inherently subjective and may vary between pathologists, though consensus aligns with standard clinical practice.Biomedinformatics 05 00067 i001 Low, Reference standard is clinically appropriate, using expert nephropathologists’ evaluation of renal biopsy WSIs.
4. Flow and TimingBiomedinformatics 05 00067 i002 Low-ModerateAll WSIs were digitized consistently at ×40 magnification. DL training and testing were separated between OSUWMC (training) and KPMP (external validation). Some KPMP cases lacked severe IFTA. No missing WSIs, and the same grading criteria were applied across datasets.Biomedinformatics 05 00067 i001 Low, Data handling, timing, and grading criteria are consistent, reflecting intended workflow for WSI analysis.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A4 presents the application of the QUADAS-2 tool for the 4th paper, Zheng et al., 2021 [4], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
The study has an overall moderate RoB. This is because while the index test and flow/timing are low risk, moderate concerns arise from patient selection and the subjectivity of the reference standard. Applicability concerns across all domains are low, indicating that the study population, index test, and reference standard are relevant to the intended clinical context.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.
Table A5. QUADAS-2 RoB and applicability assessment for Athvale et al., 2020 [5].
Table A5. QUADAS-2 RoB and applicability assessment for Athvale et al., 2020 [5].
DomainRoBReasoning/JustificationApplicability Concern
1. Patient SelectionBiomedinformatics 05 00067 i004 ModeratePatients were included retrospectively from a single center (Cook County Health, Chicago, IL). Ultrasound images were obtained from 352 patients who underwent kidney biopsy. Potential selection bias exists as only patients with available biopsy-confirmed IFTA, and usable ultrasound images were included.Biomedinformatics 05 00067 i004 Moderate, Population reflects patients undergoing biopsy but may not represent broader populations or those without biopsy, limiting generalizability to all kidney disease patients.
2. Index Test (DL Ultrasound Classification)Biomedinformatics 05 00067 i001 LowThe DL system classified IFTA from ultrasound images with masked kidneys based on a 91% accurate segmentation routine. The system was trained, validated, and tested on separate datasets, reducing bias. Performance metrics (accuracy, precision, recall, F1-score) were reported for all sets.Biomedinformatics 05 00067 i001 Low, The index test is directly relevant to the clinical task of non-invasive IFTA assessment.
3. Reference Standard (Biopsy IFTA by Nephropathologist)Biomedinformatics 05 00067 i002 Low-ModerateReference standard was histologic IFTA grading on trichrome-stained kidney biopsy by nephropathologists. While widely accepted, inter-observer variability in IFTA grading is known, but majority consensus or standardized scoring was not specified.Biomedinformatics 05 00067 i001 Low, The reference standard is clinically appropriate and directly measures the target condition.
4. Flow and TimingBiomedinformatics 05 00067 i001 LowTraining, validation, and test datasets were clearly separated. No mention of missing images or exclusions post-acquisition, and all images were processed using the same protocol. Timing between ultrasound and biopsy was not specified but likely contemplate for clinical workflow.Biomedinformatics 05 00067 i001 Low, Flow and timing are appropriate for evaluating the DL model against the biopsy reference standard.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A5 presents the application of the QUADAS-2 tool for the 5th paper, Athvale et al., 2020 [5], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
The study demonstrates robust performance of the DL system for non-invasive IFTA assessment, with moderate caution due to selection bias and potential variability in biopsy grading.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.
Table A6. QUADAS-2 RoB and applicability assessment for Ginley et al., 2020 [6].
Table A6. QUADAS-2 RoB and applicability assessment for Ginley et al., 2020 [6].
DomainRoBReasoning/JustificationApplicability Concern
1. Patient/Tissue SelectionBiomedinformatics 05 00067 i002 Low-ModerateThe study used renal biopsy samples stained with periodic acid-Schiff (PAS). Data came from a single institution for intra-institutional holdout testing and an external institution for inter-institutional testing. Exact selection criteria and sample size were not fully described, introducing potential selection bias.Biomedinformatics 05 00067 i001 Low, Study samples are representative of patients undergoing renal biopsy for glomerulosclerosis and IFTA assessment, the intended target population.
2. Index Test (CNN Segmentation)Biomedinformatics 05 00067 i001 LowCNNs were trained to segment glomerulosclerosis and IFTA on PAS-stained biopsies. The model performance was evaluated on holdout intra- and inter-institutional datasets. Segmentation outputs were quantitatively compared to reference annotations, and high correlations were reported, indicating low bias in test conduct.Biomedinformatics 05 00067 i001 Low, The index test directly addresses the clinical task of automated segmentation and quantitation of renal histologic injury.
3. Reference Standard (Pathologist Annotations)Biomedinformatics 05 00067 i004 ModerateGround truth was based on manual segmentation by renal pathologists. The study notes that the CNN sometimes predicted regions “better than the ground truth,” indicating some subjectivity and potential variability in reference standard. Inter-observer variability of annotations was not formally reported.Biomedinformatics 05 00067 i001 Low, Expert pathologist annotations are clinically relevant and appropriate for training and validating the model.
4. Flow and TimingBiomedinformatics 05 00067 i001 LowThe training, intra-institutional holdout, and inter-institutional holdout testing datasets were clearly separated. No missing data issues were reported, and all images were analyzed according to the same protocol.Biomedinformatics 05 00067 i001 Low, Flow and timing reflect intended use, with proper separation of training and test datasets.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A6 presents the application of the QUADAS-2 tool for the 6th paper, Ginley et al., 2020 [6], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.
The QUADAS-2 assessment indicates that this study evaluating CNNs for segmentation of glomerulosclerosis and interstitial fibrosis/tubular atrophy (IFTA) in PAS-stained renal biopsies is generally robust. The index test (CNN-based segmentation) demonstrates low RoB, as it was trained and validated on holdout intra- and inter-institutional datasets with quantitative evaluation against reference annotations. The reference standard, based on pathologist manual annotations, carries a moderate RoB due to inherent subjectivity and potential variability, though it remains clinically appropriate. Patient and tissue selection present low-moderate risk because sample sizes and selection criteria were not fully described, and external validation data came from a single additional institution. The flow and timing domain has a low RoB, with consistent image handling and clear separation of training and test datasets.
Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.
Table A7. PROBAST assessment for Yin et al., 2023 [7].
Table A7. PROBAST assessment for Yin et al., 2023 [7].
DomainDetailsRoB/Concern
PopulationPost-transplant kidney patients from five GEO datasets (GSE98320 [45], GSE76882 [46]: training; GSE22459 [47], GSE53605 [48]: validation; GSE21374 [49]: prognosis). Total sample sizes vary; cohorts selected based on ≥50 samples and availability of biopsy-confirmed IFTA or survival data. Heterogeneity is due to platform differences and batch effects.Biomedinformatics 05 00067 i004 Moderate, selection bias possible; not fully representative of all transplant patients
Index ModelStepglm[both] + RF diagnostic model based on 28 necroptosis-related genes. Developed from 114 combinations of 13 ML algorithms (LASSO, Ridge, Enet, Stepglm, SVM, glmboost, LDA, plsRglm, RF, Gradient Boosting Machine/GBM, XGBoost, Naive Bayes, ANN). High-dimensional data relative to sample size increases risk of overfitting.Biomedinformatics 05 00067 i003 High, multiple model testing, risk of overfitting, small validation sets relative to training
Comparator/Reference ModelBiopsy-confirmed IFTA status (histopathological evaluation) and survival data (post-transplant graft loss) from GEO datasets. Used as reference standard to evaluate predictive performance (AUC, ROC).Biomedinformatics 05 00067 i001 Low, clinically accepted standard; reference outcome is relevant and reliable
OutcomeIFTA classification (binary or graded) and post-transplant graft survival. Modeled outcome includes differential gene expressions associated with necroptosis. Performance evaluated with AUC, Principal Component Analysis/PCA separation, and Kaplan–Meier curves for survival.Biomedinformatics 05 00067 i004 Moderate, grading differences across cohorts and batch effects may introduce misclassification bias
TimingGene expression data from biopsies collected at variable post-transplant time points. Prognostic evaluation uses longitudinal survival data. Model development used cross-sectional training/validation datasets; timing differences between datasets could affect predictive performance.Biomedinformatics 05 00067 i004 Moderate, timing differences and cross-sectional data may limit prediction consistency
SettingPublicly available GEO gene expression datasets from multiple kidney transplant cohorts. Laboratory and bioinformatics setting; no prospective or clinical trial validation.Biomedinformatics 05 00067 i004 Moderate, datasets may not represent real-world clinical populations
Intended Use of Prediction ModelEarly identification of IFTA and stratification of kidney transplant patients by risk of fibrosis progression or graft loss. Aims to support clinical decision-making and follow-up prioritization. Not yet validated for clinical deployment.Biomedinformatics 05 00067 i004 Moderate, potential clinical use, but external validation and clinical implementation pending
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A7 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Yin et al., 2023 [7].
The overall PROBAST judgment for this study is:
Biomedinformatics 05 00067 i003 High RoB, with Biomedinformatics 05 00067 i004 Moderate Concerns for Applicability.
Justification is as follows:
High RoB arises mainly from the model development and analysis domain, where extensive algorithm testing (114 model combinations) and limited validation increase the likelihood of overfitting.
The population is retrospectively selected from heterogeneous GEO datasets, adding potential selection and batch-related biases.
Applicability concerns are moderate because the predictors (necroptosis-related genes) and outcomes (biopsy-confirmed IFTA and graft survival) are clinically relevant, but external generalizability to broader or prospective clinical populations remains uncertain.

Appendix B

This appendix contains details and explanations supplemental to the subsection, “4.2. Application of AI-based models to different urinary biomarkers for early detection, risk stratification, and monitoring of CKD progression” of the main text. The explanations of the quality assessment of the included study articles [26] through [30] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A8. PROBAST assessment for Bienaime et al., 2023 [26].
Table A8. PROBAST assessment for Bienaime et al., 2023 [26].
Domain/ItemDetailsRoB/Concern
PopulationParticipants were 229 adults with chronic kidney disease (mean age 61 years; 66% male; mean baseline mGFR 38 mL/min) from the prospective NephroTest cohort. Fast CKD progression was defined as >10% annual mGFR decline. The cohort is well-characterized, but the subsample size is modest and may not reflect the full CKD spectrum.Biomedinformatics 05 00067 i002 Low-Moderate, Prospective and clinically relevant, but limited sample and single cohort reduce representativeness.
Index ModelA LASSO logistic regression model combining five urinary biomarkers (CCL2, EGF, KIM1, NGAL, and TGF-α) with clinical variables (age, sex, mGFR, albuminuria) to predict fast CKD progression. Model selection used repeated resampling (100 iterations).Biomedinformatics 05 00067 i004 Moderate, LASSO penalization reduces overfitting risk, but internal-only validation and data-driven selection of biomarkers may inflate performance estimates.
Comparator/Reference ModelThe Kidney Failure Risk Equation (KFRE) variables (age, sex, mGFR, albuminuria) served as the baseline comparator for performance evaluation.Biomedinformatics 05 00067 i001 Low, Comparator is appropriate, widely accepted, and clinically meaningful.
OutcomeOutcome was fast CKD progression, defined as >10% decline per year in measured GFR using ^51Cr-EDTA clearance-a gold standard assessment.Biomedinformatics 05 00067 i001 Low, Objective and precise measurement of kidney function minimizes outcome misclassification.
TimingPredictor and outcome data came from a prospective follow-up design within the NephroTest cohort. Urine biomarkers and clinical variables were measured at baseline; outcomes were observed longitudinally.Biomedinformatics 05 00067 i001 Low, Clear temporal sequence supports valid prediction; prospective data collection minimizes bias.
SettingConducted in a clinical research cohort of CKD patients under nephrology care at French academic hospitals (NephroTest). Laboratory-based ELISA assays underwent rigorous FDA-standard validation prior to modeling.Biomedinformatics 05 00067 i001 Low, Well-controlled research setting; consistent sample handling and assay validation.
Intended Use of Prediction ModelThe model aims to improve risk stratification for CKD progression beyond standard clinical variables by adding validated urinary biomarkers, potentially guiding early intervention and follow-up intensity.Biomedinformatics 05 00067 i001 Low, Intended use is clinically relevant and aligned with current CKD management goals.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A8 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Bienaime et al., 2023 [26].
The overall PROBAST judgment for this study is:
Biomedinformatics 05 00067 i004 Moderate Overall RoB, with Biomedinformatics 05 00067 i001 Low Applicability Concern
Justification is as follows:
The main limitation of this study lies in the modeling domain, where internal validation and data-driven biomarker selection raise a moderate risk of overfitting. Applicability is strong, the predictors, outcomes, and setting reflect real-world nephrology practice, making the findings highly relevant but needing external validation in independent CKD populations.
Table A9. PROBAST assessment for Pizzini et al., 2017 [27].
Table A9. PROBAST assessment for Pizzini et al., 2017 [27].
Domain/ItemDetailsRoB/Concern
Population118 adult CKD patients (mean age 62 ± 11 years; 59% male; mean eGFR ≈ 35 mL/min/1.73 m2) from a single nephrology center in Reggio Calabria, Italy. Follow-up: 3 years. Outcome: composite renal endpoint (eGFR decline > 30%, dialysis, or transplantation). Pilot cohort, relatively small sample size, and unclear recruitment method.Biomedinformatics 05 00067 i004 Moderate, Clinically relevant CKD population but small, single-center, and possibly non-representative sample.
Index ModelComposite tubular risk score derived from urinary NGAL, Uromodulin, and KIM-1 (binary: above/below median). Developed via multiple Cox regression, combined later with eGFR in an integrated model. Internal performance assessed via Harrell’s C-index (0.79 vs. 0.77 for eGFR alone).Biomedinformatics 05 00067 i004 Moderate, Simple derivation method, but internal-only validation, small sample, and data-driven thresholding raise overfitting risk.
Comparator/Reference ModeleGFR-based model (single predictor) used as the reference comparator for assessing incremental prognostic value.Biomedinformatics 05 00067 i001 Low, eGFR is a gold-standard clinical reference for kidney function.
OutcomeComposite renal outcome: >30% eGFR decline, dialysis, or transplantation during 3 years. Objectively defined and clinically relevant.Biomedinformatics 05 00067 i001 Low, Outcome is standardized, measurable, and clinically meaningful.
TimingProspective follow-up of 3 years; predictors (urinary biomarkers) measured at baseline, outcome assessed longitudinally.Biomedinformatics 05 00067 i001 Low, Appropriate temporal relationship between predictors and outcome.
SettingAcademic nephrology and renal transplantation unit; research laboratory with validated urinary biomarker assays.Biomedinformatics 05 00067 i001 Low, Controlled clinical and analytical environment ensures reliable data quality.
Intended Use of the Prediction ModelEarly risk stratification of CKD patients for rapid progression or kidney failure, to complement eGFR-based clinical prediction tools.Biomedinformatics 05 00067 i001 Low, Intended use aligns with nephrology practice and unmet clinical need.
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A9 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Pizzini et al., 2017 [27].
The overall PROBAST judgment for this study is:
Biomedinformatics 05 00067 i004 Moderate Overall RoB with Biomedinformatics 05 00067 i001 Low Applicability Concern
Justification is as follows:
The single-center design, small sample size, and lack of external validation introduce a moderate RoB, particularly in model development and analysis. Applicability concerns are low, as the predictors (urinary NGAL, Uromodulin, KIM-1) and outcomes reflect real-world CKD progression assessment.
Table A10. PROBAST assessment for Qin et al., 2019 [28].
Table A10. PROBAST assessment for Qin et al., 2019 [28].
DomainDescriptionRoBApplicability Concern
Population1053 hospitalized adults with type 2 diabetes; after PSM, 500 (250 DKD, 250 non-DKD). All had eGFR ≥ 60 mL/min/1.73 m2. Excluded other kidney or systemic diseases. Hospital-based inpatient cohort, not representative of general outpatient T2DM populations.Biomedinformatics 05 00067 i002 ModerateBiomedinformatics 05 00067 i003 High, inpatient sample limits generalizability to screening or primary-care settings.
Index Model/PredictorsSix urinary biomarkers measured once: transferrin (TF), immunoglobulin G (IgG), retinol-binding protein (RBP), β-galactosidase (GAL), N-acetyl-β-glucosaminidase (NAG), β2-microglobulin (β2MG). Each assessed individually with logistic regression and ROC AUC; no multivariable or externally validated model.Biomedinformatics 05 00067 i003 High, simple univariable analysis; no validation or adjustment for overfitting.Biomedinformatics 05 00067 i002 Moderate, biomarkers clinically measurable but not yet standardized for DKD diagnosis.
Comparator model/Reference Standard24 h urinary albumin excretion (UAE ≥ 30 mg/24 h) as gold standard for DKD. Overlaps mechanistically with some predictors, introducing incorporation bias.Biomedinformatics 05 00067 i003 High, predictor-outcome dependency likely inflates AUCs.Biomedinformatics 05 00067 i002 Moderate, UAE widely accepted but not ideal for early-stage DKD reference.
OutcomePresence of DKD (vs. normoalbuminuric) is defined cross-sectionally by 24 h UAE and eGFR ≥ 60. No longitudinal follow-up.Biomedinformatics 05 00067 i002 Moderate, objective lab-based outcome but lacks temporal dimension.Biomedinformatics 05 00067 i002 Moderate, relevant to early DKD diagnosis but not progression prediction.
TimingCross-sectional; biomarkers and UAE measured concurrently during hospitalization (no temporal validation).Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i003 High, not predictive; diagnostic only.
SettingSingle tertiary hospital (Tianjin Medical University Chu Hsien-I Memorial Hospital), China, 2018. All assays in hospital lab.Biomedinformatics 05 00067 i002 ModerateBiomedinformatics 05 00067 i003 High, single-center; potential institutional bias.
Intended Use of prediction modelExploratory diagnostic discrimination to identify DKD among known T2DM inpatients with preserved eGFR. Not a prognostic or screening model.Biomedinformatics 05 00067 i002 Moderate, appropriate for hypothesis generation only.Biomedinformatics 05 00067 i003 High, limited use beyond internal diagnostic context.
Overall JudgmentCross-sectional single-center study with internal ROC analysis only. High internal performance (RBP AUC 0.92) is likely optimistic. No calibration, temporal, or external validation performed.Biomedinformatics 05 00067 i003 High overall RoBBiomedinformatics 05 00067 i003 High applicability concern
Colors Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A10 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Qin et al., 2019 [28].
This cross-sectional diagnostic study assessed six urinary biomarkers for detecting DKD among hospitalized adults with type 2 diabetes in Tianjin, China. Using 24 h urinary albumin excretion as the reference, RBP, TF, and IgG showed the best discrimination (AUCs of 0.92, 0.87, and 0.87, respectively). However, methodological appraisal with PROBAST indicates a high overall RoB due to the cross-sectional design, incorporation bias between biomarkers and outcome, and absence of external validation. Applicability is limited to hospital-based diagnostic research settings rather than predictive clinical screening or community-based use.
Table A11. QUADAS-2 RoB and applicability concerns for Schanstra et al., 2015 [29].
Table A11. QUADAS-2 RoB and applicability concerns for Schanstra et al., 2015 [29].
DomainDescriptionRoBApplicability Concern
Patient SelectionLarge multicenter cohort (n = 1990) including CKD patients across stages and healthy/at-risk controls; subset (n = 522) had longitudinal follow-up for progression. Inclusion criteria are broad and clinically relevant. Unclear if sampling was consecutive or random, though selection likely minimized spectrum bias by including multiple CKD etiologies.Biomedinformatics 05 00067 i002 Low-ModerateBiomedinformatics 05 00067 i001 Low, representative CKD and at-risk populations suitable for diagnostic and prognostic use.
Index TestUrinary multi-peptide biomarker classifier derived from proteomic analysis; algorithm validated internally and externally across centers. Index test interpreted blinded to reference measures (as implied). Uses objective mass-spectrometry-based quantification.Biomedinformatics 05 00067 i001 Low, standardized proteomic measurement, objective analysis.Biomedinformatics 05 00067 i001 Low, proteomic test applicable to intended CKD risk stratification setting.
Reference StandardClinical CKD diagnosis and progression assessed by eGFR decline and/or albuminuria according to accepted criteria (Kidney Disease: Improving Global Outcome/KDIGO). Both objective and reproducible. However, albuminuria overlaps mechanistically with the index peptides, introducing some dependency.Biomedinformatics 05 00067 i004 Moderate, potential incorporation bias with overlapping filtration markers.Biomedinformatics 05 00067 i001 Low, consistent with clinical standards for CKD staging and progression.
Flow and TimingCross-sectional design for detection with a subset followed longitudinally (n = 522) for progression; uniform application of index and reference tests at baseline; consistent follow-up for outcome.Biomedinformatics 05 00067 i001 Low, flow appropriate and timing consistent.Biomedinformatics 05 00067 i001 Low, progression analysis based on follow-up data aligns with intended use.
Overall RoBGenerally robust multicenter design with standardized proteomic measurement and appropriate statistical validation. Minor risk from partial overlap between index and reference measures.Biomedinformatics 05 00067 i001 Low overallBiomedinformatics 05 00067 i001 Low overall
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A11 presents the application of the QUADAS-2 tool for the RoB and applicability concerns assessment of one of the latter two studies [29], utilizing various urinary biomarkers as input variables for various AI-based diagnostic tools for the early detection, risk stratification, and monitoring of CKD.
QUADAS-2 appraisal indicates an overall low RoB and good applicability, supported by rigorous proteomic quantification and validation across multiple centers. Minor concerns remain about partial incorporation bias since the reference standard (albuminuria, eGFR) overlaps biologically with some peptides. The study provides strong diagnostic and prognostic evidence for urinary proteome classifiers as complementary CKD risk stratification tools.
Table A12. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30].
Table A12. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30].
DomainDescriptionRoBApplicability Concerns
1. Patient SelectionParticipants were drawn from the WIHS prospective cohort of women with HIV; inclusion required preserved kidney function (eGFR ≥ 60 mL/min/1.73 m2) and paired urine samples.Biomedinformatics 05 00067 i005 Low-Moderate, selection limited to relatively healthy women, potentially introducing bias.Biomedinformatics 05 00067 i005 Moderate, mostly middle-aged Black women with HIV; may not represent general CKD or HIV-positive male populations.
2. Index Test (Urine Biomarkers)14 urine biomarkers measured in duplicate using standardized multiplex assays; results analyzed as continuous standardized values without diagnostic thresholds.Biomedinformatics 05 00067 i005 Low-Moderate, laboratory methods robust, but no pre-specified diagnostic cut-offs.Biomedinformatics 05 00067 i005 Some concern, biomarkers used as exploratory indicators, not validated diagnostic tests.
3. Reference StandardNo true diagnostic “gold standard” for CKD; comparisons made to CKD risk factors (HbA1c, BP, viral load, etc.) rather than confirmed CKD diagnosis.Biomedinformatics 05 00067 i006 High, absence of a defined reference standard limits diagnostic accuracy assessment.Biomedinformatics 05 00067 i006 High, reference variables do not constitute a diagnostic criterion for CKD.
4. Flow and TimingBaseline and follow-up urine and serum specimens obtained 2.5 years apart for all 647 participants; consistent measurements across time points.Biomedinformatics 05 00067 i007 Low, clear temporal structure and uniform application of tests.Biomedinformatics 05 00067 i007 Low, appropriate interval and consistent follow-up across participants.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
The study demonstrates rigorous biomarker measurement and statistical modeling, but it is more exploratory (prognostic/associative) than diagnostic, thus QUADAS-2 partially applies. We applied QUADAS-2 principles to evaluate quality in diagnostic/biomarker inference. Since the QUADAS-2 framework evaluates the RoB and applicability concerns across four key domains, we explored how this study aligns with each domain:
Table A13. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 1: Patient Selection.
Table A13. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 1: Patient Selection.
CriterionAssessmentJustification
Was a consecutive or random sample of participants enrolled?Biomedinformatics 05 00067 i008 Low riskParticipants were drawn from the ongoing perspective WIHS cohort with standardized sampling, no indication of selective inclusion beyond availability of paired samples.
Was a case–control design avoided?Biomedinformatics 05 00067 i008 Low riskThis was a cohort study, not a case–control design.
Did the study avoid inappropriate exclusions?Biomedinformatics 05 00067 i009 Some concernOnly women with preserved kidney function (eGFR ≥60 mL/min/1.73 m2) and available serial samples were included, which may bias results toward healthier individuals.
Applicability concernBiomedinformatics 05 00067 i009 ModerateThe population (middle-aged women with HIV, mostly Black) may not be fully representative of broader CKD or HIV populations (e.g., men, advanced CKD).
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009 Yellow = Internal split or temporal validation only.
Overall RoB for Domain 1: Low to moderate Biomedinformatics 05 00067 i005
Table A14. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 2: Index Test(s).
Table A14. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 2: Index Test(s).
CriterionAssessmentJustification
Were the index tests conducted and interpreted without knowledge of the reference standard?Biomedinformatics 05 00067 i008 Low riskBiomarker assays were measured objectively and blinded to clinical data (implied by standard lab procedures).
Were thresholds pre-specified?Biomedinformatics 05 00067 i009 UnclearBiomarker changes were analyzed as continuous variables (standardized β coefficients) without pre-specified diagnostic cut-offs.
Was the test execution standardized and reproducible?Biomedinformatics 05 00067 i008 Low riskDetailed lab methods (duplicate runs, low CVs, standardized assays) described.
Applicability concernBiomedinformatics 05 00067 i009 Some concernBiomarkers are explored as research tools rather than validated clinical diagnostic tests. Their interpretability in diagnostic terms is limited.
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009 Yellow = Internal split or temporal validation only.
Overall RoB for Domain 2: Low to moderate Biomedinformatics 05 00067 i005
Table A15. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 3: Reference Standard.
Table A15. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 3: Reference Standard.
CriterionAssessmentJustification
Is the reference standard likely to correctly classify the target condition?Biomedinformatics 05 00067 i009 High riskThe study does not use a clinical diagnosis or gold-standard CKD outcome, only risk factors and biomarker change correlations.
Were the reference standard results interpreted without knowledge of the index test results?Biomedinformatics 05 00067 i008 Low riskRisk factors (A1c, blood pressure, HIV viral load, etc.) were measured independently of biomarkers.
Applicability concernBiomedinformatics 05 00067 i009 HighThe “reference standard” here is not a diagnostic truth measure, so diagnostic accuracy cannot be directly evaluated.
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009 Yellow = Internal split or temporal validation only.
Overall RoB for Domain 3: High Biomedinformatics 05 00067 i006 (not applicable as a true diagnostic accuracy study)
Table A16. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 4: Flow and Timing.
Table A16. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 4: Flow and Timing.
CriterionAssessmentJustification
Was there an appropriate interval between index test and reference standard?Biomedinformatics 05 00067 i008 Low riskBiomarkers and CKD risk factors were measured at baseline and 2.5 years later, consistent temporal design.
Did all participants receive the same reference standard?Biomedinformatics 05 00067 i008 Low riskAll participants had the same clinical data and lab procedures applied.
Were all participants included in the analysis?Biomedinformatics 05 00067 i008 Low risk647 women analyzed; no evidence of selective dropout or incomplete data bias.
Applicability concernBiomedinformatics 05 00067 i008 LowDesign and follow-up consistent with study aims.
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort).
Overall RoB for Domain 4: Low
Overall RoB for this study: Moderate to High
Overall Applicability of this study: Moderate
Since this study is methodologically sound for associative/prognostic biomarker research but not a diagnostic accuracy study, and QUADAS-2 applies only partially, we also used the PROBAST tool to assess RoB and the applicability of this study.
Table A17. PROBAST quality assessment for Muiru et al., 2021 [30].
Table A17. PROBAST quality assessment for Muiru et al., 2021 [30].
DomainDescriptionRoBApplicability Concerns
Participants/Population647 women living with HIV from the U.S. Women’s Interagency HIV Study (WIHS). Inclusion required two urine samples and preserved kidney function (eGFR ≥ 60 mL/min/1.73 m2). Majority were middle-aged and Black (67%).Biomedinformatics 05 00067 i005 Some concern, selection limited to women with preserved renal function; may not represent patients with advanced CKD or male populations.Biomedinformatics 05 00067 i005 Moderate, findings apply mainly to women with HIV and may not generalize to all HIV or CKD populations.
Index ModelMultivariable penalized regression model (MSG-LASSO) and simultaneous linear equations assessing associations between CKD risk factors and longitudinal changes in 14 urine biomarkers.Biomedinformatics 05 00067 i005 Some concern, robust internal modeling, but no external or temporal validation; unclear internal validation (e.g., bootstrapping).Biomedinformatics 05 00067 i005 Moderate, model exploratory, not clinically implemented; predictive performance not reported.
Comparator ModelNone, no existing or alternative predictive model used for comparison; focus was on evaluating associations, not on model performance metrics.Biomedinformatics 05 00067 i006 High, absence of comparator limits interpretation of predictive improvement or incremental value.Biomedinformatics 05 00067 i005 Some concern, not designed for model comparison or validation.
OutcomeLongitudinal changes in kidney tubular and glomerular biomarkers (e.g., KIM-1, IL-18, UMOD, α1m, β2m). No hard kidney outcome (CKD progression, eGFR decline) assessed.Biomedinformatics 05 00067 i005 Some concern, surrogate outcomes, not direct measures of kidney disease progression or patient-level endpoints.Biomedinformatics 05 00067 i005 Moderate, outcome biologically meaningful but limited for clinical prediction utility.
TimingProspective longitudinal cohort; baseline biomarker and clinical data collected in 2009–2011 and repeated ~2.5 years later.Biomedinformatics 05 00067 i007 Low, consistent and appropriate timing for longitudinal biomarker change evaluation.Biomedinformatics 05 00067 i007 Low, timing consistent with biological plausibility for kidney biomarker change.
SettingMulti-center U.S. observational cohort study (academic research settings).Biomedinformatics 05 00067 i007 Low, standardized data collection and laboratory protocols reduce bias.Biomedinformatics 05 00067 i005 Some concern, research setting may differ from clinical practice environments.
Intended Use of Predictive ModelExploratory, to identify CKD risk factors associated with biomarker changes and to inform future development of kidney disease detection algorithms in HIV.Biomedinformatics 05 00067 i005 Some concern, model not yet developed for clinical prediction; exploratory by design.Biomedinformatics 05 00067 i005 Moderate, informative for biomarker research, but not directly applicable to clinical prediction or screening.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
The study demonstrates strong internal validity and robust measurement methods but is limited by a lack of external validation, a restricted population, and the use of biomarkers as surrogate outcomes. It is best viewed as a hypothesis-generating prognostic analysis rather than a finalized predictive model.
Overall RoB: Moderate
Overall Applicability: Moderate

Appendix C

This appendix contains details and explanations supplemental to the subsection, “4.3. Application of AI-based models to different clinical, metabolomic, or transcriptomic data for monitoring the progression of DN” of the main text. The explanations of the quality assessment of the included study articles [15,41] through [43] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A18. PROBAST assessment for Yin et al., 2024 [15].
Table A18. PROBAST assessment for Yin et al., 2024 [15].
DomainDescriptionRoBApplicability Concerns
Participants/Population548 diabetic patients (from 1024 total) enrolled between April 2018-April 2019 at the Second Affiliated Hospital of Dalian Medical University. Data included demographics, laboratory, clinical, and metabolomic features. Patients with >50% missing data were excluded.Biomedinformatics 05 00067 i005 Some concern, single-center dataset, potential selection bias from missing data exclusion, unclear external representativeness.Biomedinformatics 05 00067 i005 Moderate concern, findings may not generalize beyond a Chinese tertiary-care population.
Index ModelMachine learning algorithms (XGB, RF, DT, Logistic Regression) trained using 38 LASSO-selected predictors; 10-fold cross-validation used for internal validation. Best performance: XGB (AUC = 0.966). SHAP applied for model interpretation.Biomedinformatics 05 00067 i005 Some concern, appropriate internal validation but possible overfitting; no external or temporal validation.Biomedinformatics 05 00067 i007 Low concern, clearly defined model, interpretable, replicable in similar data contexts.
Comparator ModelCompared XGB to RF, DT, and Logistic Regression. No traditional clinical risk model (e.g., albuminuria-based) is used for benchmarking.Biomedinformatics 05 00067 i005 Some concern, comparison limited to internal ML models; lack of benchmark limits context for clinical improvement.Biomedinformatics 05 00067 i005 Moderate concern, comparators appropriate for model development, but not for clinical performance comparison.
OutcomePresence or absence of diabetic nephropathy (DN) defined by standard clinical/laboratory criteria (e.g., albuminuria, eGFR). Outcome extracted from EMR.Biomedinformatics 05 00067 i007 Low risk, objective, clinically established outcome, standardized data source.Biomedinformatics 05 00067 i007 Low concern, outcome relevant for DN screening in diabetes populations.
TimingRetrospective cross-sectional dataset (April 2018-April 2019); no follow-up or temporal validation.Biomedinformatics 05 00067 i005 Some concern, unclear predictor-outcome chronology; possible temporal bias.Biomedinformatics 05 00067 i005 Moderate concern, timing acceptable for diagnostic screening, but limits prognostic inference.
SettingSingle tertiary-care hospital (SAHDMU, China). Data collection standardized via hospital EMR and biochemical protocols.Biomedinformatics 05 00067 i007 Low risk, consistent data acquisition, uniform diagnostic methods.Biomedinformatics 05 00067 i005 Some concern, applicability limited to hospital settings with similar laboratory infrastructure.
Intended Use of Predictive ModelPredict DN risk and assist in early screening using serum metabolite and clinical parameters to guide preventive care.Biomedinformatics 05 00067 i005 Some concern, exploratory model; not externally validated or integrated into clinical workflow.Biomedinformatics 05 00067 i005 Moderate concern, potential clinical value but requires further validation for real-world implementation.
Overall JudgmentThe model demonstrates strong internal performance and robust methodology but lacks external validation, multi-center data, and prospective evaluation. Suitable for exploratory and developmental research but not yet for clinical deployment.Biomedinformatics 05 00067 i005 Overall RoB: ModerateBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A18 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Yin et al., 2024 [15].
This study presents a robust internally validated ML model (XGBoost) for DN prediction based on serum metabolites and clinical features. While methodological rigor (feature selection, cross-validation, SHAP interpretability) is strong, the absence of external validation, single-center sampling, and potential overfitting result in moderate overall bias.
Table A19. QUADAS-2 RoB and applicability assessment for Fan et al., 2025 [41].
Table A19. QUADAS-2 RoB and applicability assessment for Fan et al., 2025 [41].
DomainDescriptionRoBApplicability Concerns
Patient SelectionDatasets (GSE30122 [50], GSE30528 [51], GSE96804 [52]) included 60 DN and 70 normal control (NC) kidney tissue samples. Exclusion/inclusion criteria based on dataset definitions; batch correction performed to reduce variability.Biomedinformatics 05 00067 i005 Some concern, retrospective, secondary dataset selection; unclear if all consecutive DN and NC samples were included; possible spectrum bias due to dataset curation.Biomedinformatics 05 00067 i005 Some concern, data derived from public gene-expression repositories, not real-world clinical diagnostic populations.
Index TestML-based diagnostic model developed using glycolysis-related genes (GRGs) identified via WGCNA and feature selection (XGB performed best). Validation through independent dataset (GSE142153 [53]) and single-cell RNA-seq.Biomedinformatics 05 00067 i005 Some concern, model training and testing based on retrospective datasets; no blinding; potential overfitting; no pre-specified diagnostic threshold reported.Biomedinformatics 05 00067 i007 Low concern, gene-expression-based model aligns with molecular diagnostic purpose; methodologically appropriate.
Reference StandardDiagnosis of DN and NC status as defined in GEO datasets (based on histopathological or clinical diagnosis in original studies).Biomedinformatics 05 00067 i007 Low risk, standard diagnostic definitions are likely applied in source datasets.Biomedinformatics 05 00067 i007 Low concern, reference standard appropriate for DN diagnosis.
Flow and TimingMultiple GEO datasets combined; batch correction performed; cross-validation and external verification conducted using independent dataset. Temporal relation between sample collection and analysis unclear.Biomedinformatics 05 00067 i005 Some concern, heterogeneous datasets with varying collection protocols and unknown blinding; no consistent sample flow reporting.Biomedinformatics 05 00067 i005 Moderate concern, applicability limited by cross-dataset integration and differences in tissue vs. blood validation samples.
Overall JudgmentRobust bioinformatics and ML pipeline integrating multi-cohort transcriptomic data and immune profiling. However, RoB remains due to retrospective design, lack of prospective validation, and unclear blinding or diagnostic thresholds.Biomedinformatics 05 00067 i005 Overall RoB: ModerateBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A19 is a QUADAS-2 table tailored to this study [41] that summarizes domains, descriptions, RoB, and applicability concerns, including an overall judgment row and color-coded indicators (Biomedinformatics 05 00067 i007 Low, Biomedinformatics 05 00067 i005 Some concern, Biomedinformatics 05 00067 i006 High).
Table A20. PROBAST assessment for Hirakawa et al., 2022 [42].
Table A20. PROBAST assessment for Hirakawa et al., 2022 [42].
DomainDescriptionRoBApplicability Concerns
Participants/Population150 DKD patients enrolled in the UT-DKD cohort (Japan); 135 completed follow-up (up to 30 months). Participants had baseline eGFR 30–60 mL/min/1.73 m2. “Rapid decliners” defined as ≥10% annual eGFR loss.Biomedinformatics 05 00067 i005 Some concern, modest sample size; potential selection bias due to dropouts (10% loss); limited diversity (single-center Japanese cohort).Biomedinformatics 05 00067 i005 Moderate, population specific to Japanese DKD patients with mid-stage disease; external generalizability limited.
Index ModelDeep learning approach integrating plasma and urinary metabolomic data with clinical variables; feature selection narrowed 3388 variables to 50; tenfold double cross-validation performed to identify top predictors.Biomedinformatics 05 00067 i005 Some concern, internal validation robust, but external validation absent; unclear model transparency or reproducibility; potential overfitting given small sample and complex model.Biomedinformatics 05 00067 i005 Moderate, explainable ML enhances interpretability, but implementation may vary by analytic infrastructure.
Comparator ModelCompared deep learning to logistic regression, RF, and support vector machine (SVM); evaluated via AUC and cross-validation.Biomedinformatics 05 00067 i007 Low, inclusion of conventional and ML comparators strengthens credibility.Biomedinformatics 05 00067 i007 Low, comparator choice appropriate for methodologic benchmarking.
OutcomeRapid decline of kidney function (≥10% annual loss of baseline eGFR). Continuous eGFR data collected prospectively for ~30 months.Biomedinformatics 05 00067 i007 Low, clearly defined, objective, clinically meaningful outcome.Biomedinformatics 05 00067 i007 Low, outcome directly relevant to DKD progression prediction.
TimingProspective observational follow-up (30 months) with baseline metabolomic sampling and serial eGFR assessment.Biomedinformatics 05 00067 i007 Low, appropriate timing between predictors and outcome; prospective design minimizes bias.Biomedinformatics 05 00067 i007 Low, suitable for longitudinal prediction of decline.
SettingSingle academic center in Japan; standardized biospecimen and metabolomic workflow; quality control reported.Biomedinformatics 05 00067 i007 Low, consistent laboratory and analytic procedures minimize bias.Biomedinformatics 05 00067 i005 Some concern, single-center design limits generalizability to other healthcare settings or ethnic populations.
Intended Use of Predictive ModelPredict rapid renal function decline in DKD patients using integrated metabolomic-clinical data; exploratory use for biomarker discovery and clinical risk stratification.Biomedinformatics 05 00067 i005 Some concern, not yet validated for clinical application or regulatory standards.Biomedinformatics 05 00067 i005 Moderate, promising as a biomarker discovery tool, but not ready for deployment in patient management.
Overall JudgmentThe study demonstrates careful internal validation and interpretable deep learning but is limited by small sample size, single-center design, and lack of external or temporal validation. Results are exploratory and hypothesis-generating.Biomedinformatics 05 00067 i005 Overall RoB: ModerateBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A20 is a PROBAST table tailored to this study [42], following the expanded structure with Domains, Description, RoB, and Applicability Concerns, including Overall Judgment and color-coded indicators (Biomedinformatics 05 00067 i007 Low, Biomedinformatics 05 00067 i005 Some concern, Biomedinformatics 05 00067 i006 High).
Table A21. PROBAST assessment for Zhang et al., 2022 [43].
Table A21. PROBAST assessment for Zhang et al., 2022 [43].
DomainDescription (Evidence from Study)RoBApplicability Concerns
Participants/Population995 adults with diabetes from the Chronic Renal Insufficiency Cohort (CRIC), a large, racially diverse, multicenter U.S. cohort. Included CKD stages 3a-4 (eGFR 20–60 mL/min/1.73 m2). Random selection for metabolomics sub-study; median follow-up 8 years.Biomedinformatics 05 00067 i007 Low, representative, well-described cohort; random selection minimizes selection bias.Biomedinformatics 05 00067 i007 Low, broad CKD spectrum and racial diversity enhance generalizability to DKD populations.
Index ModelMultivariable penalized regression (lasso) and RF models predicting eGFR slope; 698 metabolites + 9 clinical covariates. λ-values selected by 10-fold cross-validation. Variable selection guided by performance and biological plausibility.Biomedinformatics 05 00067 i007 Low, transparent model development; overfitting minimized via cross-validation; clear reporting of variable selection and penalization.Biomedinformatics 05 00067 i007 Low, approach and predictors feasible in similar metabolomics datasets; clinically interpretable outputs.
Comparator ModelCompared multiple lasso and RF variants (with/without clinical variables). Also validated model findings via targeted metabolomics and C-statistics for time-to-ESRD.Biomedinformatics 05 00067 i007 Low, appropriate comparison and internal replication improve credibility.Biomedinformatics 05 00067 i007 Low, methods align with modern standards for omics model evaluation.
OutcomePrimary: annual eGFR slope (continuous).
Secondary: time-to-ESRD (kidney failure or transplant).
eGFR measured serially; ESRD adjudicated; analyses adjusted for censoring and competing risk.Biomedinformatics 05 00067 i007 Low, clearly defined, objective, clinically relevant outcomes with validated measurement methods.Biomedinformatics 05 00067 i007 Low, directly relevant to DKD progression prediction.
TimingBaseline urine metabolomics with longitudinal follow-up for up to 8 years. eGFR slopes estimated from repeated measures; ESRD tracked prospectively.Biomedinformatics 05 00067 i007 Low, appropriate prospective design ensures temporal relationship between predictors and outcomes.Biomedinformatics 05 00067 i007 Low, suitable duration for modeling DKD progression.
SettingMulticenter, U.S.-based, racially diverse cohort; standardized urine collection and LC-MS metabolomics pipeline; rigorous quality control (technical replicates, noise filtering, FDR correction).Biomedinformatics 05 00067 i007 Low, strong methodological rigor; consistent analytical standards.Biomedinformatics 05 00067 i007 Low, high applicability to U.S. clinical research and biobank settings.
Intended Use of Predictive ModelTo identify novel urinary metabolites predictive of DKD progression and elucidate biological pathways; potential for future use in risk stratification and therapeutic targeting after external validation.Biomedinformatics 05 00067 i005 Some concern, currently exploratory; not validated externally or clinically implemented.Biomedinformatics 05 00067 i005 Some concern, clinical translation pending replication and assay standardization.
Overall JudgmentRobust, well-designed metabolomics prediction study with strong internal validity and methodological rigor. Minimal RoB, though external validation remains necessary before clinical use.Biomedinformatics 05 00067 i007 Overall RoB: LowBiomedinformatics 05 00067 i005 Overall Applicability: Some Concern (requires external validation)
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A21 is the complete PROBAST table summarizing the RoB and applicability for this study [43], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Appendix D

This appendix contains details and explanations supplemental to the subsection, “4.4. Validation of ML models for prediction of CKD and ESRD” of the main text. The explanations of the quality assessment of the included study articles [24,31] through [33] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A22. PROBAST assessment for Chan et al., 2021 [24].
Table A22. PROBAST assessment for Chan et al., 2021 [24].
DomainDescription/EvidenceRoBApplicability Concerns
Participants/Population1146 adults with type 2 diabetes from BioMe Biobank and Penn Medicine Biobank; eGFR 30–59.9 mL/min/1.73 m2 or ≥60 with albuminuria ≥3 mg/mmol. Excluded pre-existing kidney transplant or dialysis. Follow-up median 4.3 years.Biomedinformatics 05 00067 i005 Some concern, multi-center, but selection may favor patients with biobank consent and available plasma; some missing baseline data imputed using ±1 year window.Biomedinformatics 05 00067 i005 Some concern, mostly U.S. urban centers; may not generalize to rural or non-U.S. populations.
Predictors/Index ModelRF model integrating EHR data (demographics, labs, comorbidities) + 3 plasma biomarkers (KIM-1, TNFR1, TNFR2). Cross-validation used in derivation.Biomedinformatics 05 00067 i005 Some concern, RF is complex; model performance may be overestimated in derivation set, though cross-validation was performed.Biomedinformatics 05 00067 i005 Some concern, use of proprietary biomarkers limits immediate clinical applicability; EHR integration feasible only where high-quality longitudinal data exist.
Comparator ModelCompared KidneyIntelX to clinical model (age, eGFR, uACR, comorbidities) and KDIGO risk categories.Biomedinformatics 05 00067 i007 Low, appropriate standard-of-care comparisons included.Biomedinformatics 05 00067 i007 Low, KDIGO widely used; clinical model generalizable.
OutcomeComposite kidney endpoint: ≥5 mL/min/yr eGFR decline, ≥40% sustained decline, or kidney failure within 5 years. Ascertainment from EHR and lab records; objective, clinically relevant.Biomedinformatics 05 00067 i007 Low, clear, validated, and meaningful outcome.Biomedinformatics 05 00067 i007 Low, aligns with intended clinical use for DKD progression.
TimingBaseline predictors from biobank/EHR at enrollment; follow-up median 4.3 years; minimum 3 eGFR measures post-baseline.Biomedinformatics 05 00067 i007 Low, longitudinal follow-up sufficient to capture outcome events.Biomedinformatics 05 00067 i007 Low, follow-up duration clinically relevant for DKD progression.
SettingTwo large U.S. biobanks; plasma samples processed at a single lab with standardized biomarker assays. Assay precision verified; lab staff blinded to outcomes.Biomedinformatics 05 00067 i007 Low, standardized assay reduces measurement bias.Biomedinformatics 05 00067 i005 Some concern, biobank-based, urban population; may not generalize to broader outpatient or international settings.
Intended UsePredict progressive kidney decline in DKD patients; stratify into low, intermediate, high-risk for clinical decision-making.Biomedinformatics 05 00067 i005 Some concern, proprietary tests; intended use is clinical, but real-world implementation not yet evaluated.Biomedinformatics 05 00067 i005 Some concern, risk score relies on biomarker availability and EHR integration.
Overall JudgmentWell-designed model with robust cross-validation and clear clinical outcome. Bias mainly in selection of participants, proprietary assay reliance, and limited external validation.Biomedinformatics 05 00067 i005 Some concern, low bias for outcome and predictor measurement; moderate bias in participant selection and model complexity.Biomedinformatics 05 00067 i005 Some concern, applicability mainly for U.S. biobank populations; broader implementation pending validation.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A22 is the complete PROBAST table summarizing the RoB and applicability for this study [24], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.
Table A23. PROBAST assessment for Ferguson et al., 2022 [31].
Table A23. PROBAST assessment for Ferguson et al., 2022 [31].
DomainDescription/EvidenceRoBApplicability Concerns
Participants/PopulationAdult individuals (≥18 years) with available outpatient eGFR and urine ACR data in Manitoba (derivation, n = 77,196) and Alberta (validation, n = 107,097). Excluded those with kidney failure at baseline. Large, representative population-based cohorts covering almost all residents in each province.Biomedinformatics 05 00067 i007 Low, Comprehensive provincial databases minimize selection bias; inclusion criteria clear and reproducible.Biomedinformatics 05 00067 i007 Low, Broad CKD population across all G1-G5 stages; representative of routine clinical care.
Predictors/Index ModelRF survival model using 22 variables (age, sex, eGFR, ACR, plus 18 lab tests from chemistry, hematology, and liver panels). Missing data handled by Ishwaran RF imputation.Biomedinformatics 05 00067 i005 Some concern, Although advanced imputation used, model complexity (RF) may obscure predictor interpretability and calibration across subgroups.Biomedinformatics 05 00067 i007 Low, Uses routinely collected lab data; readily available in most health systems.
Comparator ModelCompared with (i) KDIGO-like Cox model (heatmap model) and (ii) clinical Cox model including age, sex, eGFR, ACR, diabetes, and CVD.Biomedinformatics 05 00067 i007 Low, Appropriate and well-established comparators included.Biomedinformatics 05 00067 i007 Low, Comparators are clinically relevant benchmarks.
OutcomeComposite endpoint: ≥40% sustained eGFR decline or kidney failure (dialysis, transplant, or eGFR < 10 mL/min/1.73 m2). Confirmed with follow-up tests (90 days–2 years) or death within 90 days after decline. Objective, reproducible, and validated outcome definition.Biomedinformatics 05 00067 i007 Low, Hard clinical outcomes derived from standardized lab and administrative data.Biomedinformatics 05 00067 i007 Low, Outcome clinically meaningful and applicable across CKD stages.
TimingBaseline predictors averaged from eGFR and lab values over 6 months before index date; follow-up to 5 years. External validation performed over comparable time periods (2009–2016).Biomedinformatics 05 00067 i007 Low, Adequate follow-up duration for CKD progression.Biomedinformatics 05 00067 i007 Low, Consistent time windows and clinical relevance.
SettingPopulation-level administrative and laboratory databases in Manitoba and Alberta, Canada. Laboratory tests are centralized and standardized. Ethics approval and deidentified data use.Biomedinformatics 05 00067 i007 Low, Real-world data minimizes bias from selective recruitment; centralized labs ensure measurement consistency.Biomedinformatics 05 00067 i005 Some concern, Applicability may vary in countries lacking centralized lab data or universal healthcare systems.
Intended Use of Predictive ModelTo predict CKD progression (40% eGFR decline or kidney failure) across all CKD stages using routine labs and demographics-no special biomarkers. Intended for clinical decision support and integration into EHR systems.Biomedinformatics 05 00067 i007 Low, Clear intended use with feasible data inputs; internal and external validation performed.Biomedinformatics 05 00067 i007 Low, Widely applicable to clinical practice given use of standard labs.
Overall JudgmentLarge, transparent, and rigorously validated model with strong performance (AUC 0.87–0.88 internal, 0.84–0.87 external). Minimal bias due to representative cohorts and objective outcome ascertainment. Some concerns remain around interpretability and external generalizability outside Canada.Biomedinformatics 05 00067 i007 Low RoBBiomedinformatics 05 00067 i005 Some concern, Highly applicable to systems with similar data infrastructure; may need calibration elsewhere.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A23 is the complete PROBAST table summarizing the RoB and applicability for this study [31], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.
Table A24. PROBAST assessment for Tangri et al., 2024 [32].
Table A24. PROBAST assessment for Tangri et al., 2024 [32].
DomainDescription/EvidenceRoBApplicability Concerns
Participants/Population14,464 adults with type 2 diabetes from the CANVAS Program (n ≈ 10,000) and CREDENCE trial (n ≈ 4400), both large multinational RCTs. Participants had broad eGFR (≥30 mL/min/1.73 m2) and ACR ranges. Inclusion criteria were clear, and patients provided informed consent.Biomedinformatics 05 00067 i007 Low, Randomized, well-characterized cohorts minimize selection bias.Biomedinformatics 05 00067 i005 Some concern, Population restricted to type 2 diabetes in clinical trial settings, possibly healthier and more adherent than routine practice.
Predictors/Index ModelThe Klinrisk random survival forest model using 20+ routine lab values (e.g., eGFR, ACR, urea, albumin, glucose, Hb, electrolytes, etc.) plus demographics (age, sex). Missing data imputed via RF imputation.Biomedinformatics 05 00067 i007 Low, Predictors objectively measured; minimal missingness (<3%, except glucose). Transparent pre-specified variable list from prior study.Biomedinformatics 05 00067 i007 Low, Uses standard clinical lab parameters routinely collected in care.
Comparator ModelCompared to KDIGO eGFR-ACR heatmap categories (G1-G5/A1-A3) and a refitted Cox model (KFRE-like) using age, sex, eGFR, ACR.Biomedinformatics 05 00067 i007 Low, Comparators appropriate, established, and relevant to current clinical practice.Biomedinformatics 05 00067 i007 Low, Directly comparable to clinical risk tools used worldwide.
OutcomePrimary outcome: ≥40% sustained eGFR decline or kidney failure (dialysis, transplant, or eGFR < 15 mL/min/1.73 m2). Adjudicated outcomes from trial datasets; confirmation of decline on repeat testing. Secondary: eGFR slope.Biomedinformatics 05 00067 i007 Low, Objective, reproducible, and adjudicated outcome definitions.Biomedinformatics 05 00067 i007 Low, Outcomes align with KDIGO and FDA-recommended endpoints for CKD trials.
TimingBaseline defined at randomization; eGFR recalculated using CKD-EPI 2009. Follow-up median 2.4 years (range 1–3 years). Sensitivity analyses accounted for early eGFR dip due to SGLT2i effects (week 6–13).Biomedinformatics 05 00067 i007 Low, Consistent follow-up and timing across studies; clear baseline and outcome assessment windows.Biomedinformatics 05 00067 i007 Low, Time frame appropriate for CKD progression prediction.
SettingInternational, multicenter, double-blind RCTs across 24–34 countries. Central labs used, standardized measurements. Data quality is high and rigorously monitored.Biomedinformatics 05 00067 i007 Low, Data completeness, quality control, and outcome adjudication robust.Biomedinformatics 05 00067 i005 Some concern, Controlled trial environment may limit generalizability to routine care or non-trial populations.
Intended Use of Predictive ModelIntended to predict CKD progression in patients with or at risk of CKD (including early stages), using only routinely available lab and demographic data. Potential for integration into EMR or LIS systems to support risk-based care and SGLT2i allocation.Biomedinformatics 05 00067 i007 Low, Clearly defined purpose; model feasible for integration.Biomedinformatics 05 00067 i007 Low, Broadly applicable in clinical decision support if similar data infrastructure exists.
Overall JudgmentThe validation demonstrates good-to-excellent discrimination (AUC 0.81–0.88) and low Brier scores, outperforming KDIGO classification. Calibration is adequate though slightly over predictive in canagliflozin-treated patients. Minimal missing data and robust statistical handling. Some generalizability concerns due to trial-based cohort and model trained on Canadian data.Biomedinformatics 05 00067 i007 Low RoBBiomedinformatics 05 00067 i005 Some concern, Excellent validation but restricted to SGLT2i trial population with type 2 diabetes; may need recalibration for general CKD populations.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A24 is the complete PROBAST table summarizing the RoB and applicability for this study [32], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.
Table A25. PROBAST assessment for Zou et al., 2022 [33].
Table A25. PROBAST assessment for Zou et al., 2022 [33].
DomainDescription/EvidenceRoBApplicability Concerns
Participants/Population390 Chinese adults with T2DM and biopsy-confirmed DKD enrolled (2008–2019) and followed ≥1 year (median 3 years). Clear inclusion/exclusion criteria (eGFR > 15, age ≥ 18). All had histopathological confirmation using RPS 2010 criteria.Biomedinformatics 05 00067 i007 Low, Well-defined, consecutive, pathologically confirmed cohort minimizes misclassification and selection bias.Biomedinformatics 05 00067 i006 High, Restricted to biopsy-confirmed DKD in hospital setting; not representative of broader T2DM populations with clinically diagnosed DKD.
Predictors/Index Model30+ clinical, biochemical, and pathological predictors collected at biopsy; top five selected (CysC, sAlb, Hb, UTP, eGFR). Missing data imputed with miss Forest; variables with >20% missing excluded. Predictor measurement standardized in central lab; pathology scored by two pathologists.Biomedinformatics 05 00067 i007 Low, Predictors measured objectively; consistent definitions; appropriate handling of missingness.Biomedinformatics 05 00067 i005 Some concern, Reliance on biopsy-specific and pathology variables limits applicability in routine, non-biopsy DKD care.
Comparator ModelOther ML algorithms (GBM, SVM, logistic regression) and internal validation splits (75/25) used for performance comparison; RF chosen for best AUC = 0.90.Biomedinformatics 05 00067 i007 Low, Comparator models appropriate; fair model comparison across consistent datasets.Biomedinformatics 05 00067 i007 Low, Comparisons relevant to model-selection purpose.
OutcomeIncident ESRD defined as eGFR < 15 mL/min/1.73 m2 or renal replacement therapy initiation. Objectively assessed, clinically standard endpoint.Biomedinformatics 05 00067 i007 Low, Objective, standard ESRD definition; ascertainment likely accurate.Biomedinformatics 05 00067 i007 Low, Outcome clinically relevant to DKD prognosis; matches intended use.
TimingBaseline = biopsy date; median follow-up 3 years (minimum 1 year). Clear temporal ordering of predictors before outcome.Biomedinformatics 05 00067 i007 Low, Prospective follow-up adequate for ESRD prediction; outcome timing clearly defined.Biomedinformatics 05 00067 i007 Low, Appropriate prediction horizon for ESRD in advanced DKD.
SettingSingle-country, hospital-based, tertiary nephrology centers in China; all participants underwent renal biopsy and centralized pathology review.Biomedinformatics 05 00067 i005 Some concern, High-quality data but limited generalizability to community or primary-care settings.Biomedinformatics 05 00067 i006 High, Specialized hospital cohort; may not represent broader DKD care environments.
Intended Use of Predictive ModelPredict risk of ESRD in patients with T2DM and DKD to enable early intervention and risk stratification. Developed nomogram from top 5 predictors for clinical use.Biomedinformatics 05 00067 i007 Low, Clearly stated clinical purpose; model interpretable and feasible for use where lab data available.Biomedinformatics 05 00067 i005 Some concern, Model designed for advanced DKD at tertiary level; limited for early-stage DKD or general diabetic populations.
Overall JudgmentThe RF model demonstrated excellent discrimination (AUC 0.90) and good internal validation via 10-fold cross-validation. Clear inclusion criteria, objective outcomes, and rigorous variable handling reduce bias. However, the single-center, biopsy-only Chinese cohort limits external validity and may inflate performance (no external validation).Biomedinformatics 05 00067 i005 Some concern/Moderate risk, Internal validation only, modest sample size.Biomedinformatics 05 00067 i006 High concern, Limited to biopsy-proven DKD; may not generalize to broader T2DM populations or other ethnicities.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A25 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Zou et al., 2022 [33].
Table A26. Study Snapshot, Zou et al., 2022 [33].
Table A26. Study Snapshot, Zou et al., 2022 [33].
FeatureDetails
Cohort390 Chinese patients with biopsy-confirmed DKD; median follow-up 3 years
Events/Outcome158 ESRD events (40.5%); outcome defined as incident ESRD (eGFR < 15 mL/min/1.73 m2 or need for renal replacement therapy)
Predictors usedRoutine labs & demographics: eGFR, cystatin C (CysC), serum albumin (sAlb), hemoglobin (Hb), 24 h urine total protein (UTP), lipids, blood pressure, glucose, medications, etc. Renal pathology features: glomerular class, interstitial fibrosis/tubular atrophy (IFTA), inflammation, immunofluorescence (C1q, IgG, IgM, IgA, C3, C4) Not used: urine biomarkers, imaging, multi-omics
Models triedRF (best; AUC = 0.90 in validation), SVM (AUC = 0.88), Gradient Boosting Machine (AUC = 0.88), Logistic Regression (AUC = 0.83)
Feature selection/NomogramTop 5 predictors included in nomogram: CysC, sAlb, Hb, eGFR, UTP
ValidationInternal split: 75% training/25% validation; 10-fold cross-validation on training set; no external validation reported
Table A27. Clinical Questions and Model Applicability, Zou et al., 2022 [33].
Table A27. Clinical Questions and Model Applicability, Zou et al., 2022 [33].
Clinical QuestionData Available in This StudyDoes This Model Help?Best Data for This QuestionRecommended Model Family
Flag early disease (identify patients with subclinical DKD before functional loss)Baseline routine labs and biopsy pathology (cohort = confirmed DKD)No/Limited. Cohort already has biopsy-confirmed DKD; model predicts ESRD progression, not early detectionRoutine labs (eGFR trends, albuminuria), urine biomarkers, imagingSimple linear/regression baselines or tree ensembles. Logistic/Cox for transparency and few covariates; tree ensembles for nonlinear thresholds. Deep nets are not suitable due to small sample and tabular data
Estimate slope to ESRD (rate of decline/time to RRT)Single baseline snapshot (labs + pathology); no longitudinal trajectoriesNo, RF predicts risk of ESRD by follow-up window but does not estimate slope or time-to-eventSerial eGFR measures, repeat urine albumin/protein, medication changes; ideally longitudinal dataSurvival models (Cox, flexible parametric), penalized Cox, survival tree ensembles (Random survival forests, gradient-boosted survival trees). Mixed-effects or joint longitudinal-survival models if serial data available
Decide when to plan access (predict timing of dialysis)Baseline predictors + ESRD risk score (nomogram for 1/3/5-year risk)Potentially, high predicted short-term ESRD risk may inform planning, but no external validation or robust time-to-event estimatesTime-to-event data, competing risks (death), calibrated absolute risk estimates at clinically relevant horizonsSurvival models (Cox with predicted survival curves, flexible parametric models) or survival tree ensembles; parsimonious models if sample size is small
Predict mortality (all-cause or cardiovascular)Not available; study focused on ESRDNo. Separate model needed; ESRD predictors not directly transferrableRoutine labs, comorbidity data, cardiac biomarkers, imaging, longitudinal informationCox/competing-risks models (Fine-Gray) or tree-based survival models; penalized Cox for modest sample sizes
Anticipate GBM/glomerular basement membrane lesions in kidney diseaseNot available; study not designed for cardiovascular outcomesNot applicable. Requires different outcomes and predictorsCardiac history, troponin/BNP, ECG/imagingCox regression for clinical predictors or ensemble trees when nonlinear interactions are expected and sample size is sufficient
Table A28. Zou et al., 2022 [33] study’s actual data/model usage and general guidance for alternative data types.
Table A28. Zou et al., 2022 [33] study’s actual data/model usage and general guidance for alternative data types.
Data Type/AvailabilityData CharacteristicsModel Families Used/RecommendedRationale/Notes
Routine labs + demographics + biopsy pathology (this study)Mixed numeric, categorical, and pathology scores; n = 390, events = 158RF (best), SVM, GBM, logistic regressionTree ensembles (RF/GBM): handle mixed variable types, missing data, capture nonlinearities/interactions, robust with modest sample sizes; Drawbacks: less transparent, calibration may be poor, internal validation only. Logistic regression: transparent, interpretable, generalizes better in small samples when relationships are near-linear (AUC 0.83 vs. RF 0.90). SVM: good for complex decision boundaries; less interpretable, requires careful tuning. Deep nets: not appropriate, sample too small and data are tabular.
Urine biomarkersNot used in this studyTree ensembles, penalized linear modelsValuable for early disease flagging; ensembles capture nonlinear interactions; penalized regression avoids overfitting in moderate sample sizes.
Imaging features (e.g., kidney ultrasound, elastography, MRI)Not used in this studyTree ensembles or penalized regression for moderate n; deep learning (CNNs) if large image datasetHigh-dimensional data; feature engineering helps in small-medium datasets; CNNs effective with large cohorts.
Multi-omics (genomics, transcriptomics, proteomics, metabolomics)Not used in this studyPenalized regression, tree ensembles (moderate n); deep nets (large n)Require large sample sizes and robust regularization; deep nets feasible only with very large datasets.
Table A29. Practical Appraisal and Recommendations.
Table A29. Practical Appraisal and Recommendations.
Question/AspectRecommendation/Appraisal
Can we use the RF nomogram now to make decisions?Not yet for broad clinical use. It shows strong internal performance (AUC 0.90) and useful nomogram with five routine features (CysC, sAlb, Hb, eGFR, UTP) but lacks external validation and prospective calibration. Use as hypothesis-generating or adjunct risk signal, not as a sole basis for initiating access or stopping therapies.
Time-to-event answers (when to plan access)Authors should fit survival models (Cox or random survival forest), report absolute risk at concrete horizons, provide calibration plots, and handle competing risks (e.g., death). This allows actionable timing predictions (e.g., ESRD probability within 6–12 months).
Is the model over-fitted?Risk of optimism: internal split + cross-validation is good practice, but no external validation and complex learners (RF/GBM) may inflate AUC. Event count (158) reasonable, but penalization or fewer variables would reduce overfitting risk.
Recommended model family for each clinical task (given dataset size)- Flag early disease (screening): simple logistic/Cox with parsimonious predictors (routine labs). Estimate slope/timing: survival models (penalized Cox or random survival forest). Plan access: survival/competing-risk models with absolute risk estimates and calibration. Predict mortality/cardiovascular events: build separate survival models using outcome-specific predictors.
Dataset-level guidance (n ≈ 400, ~150 events)Tree ensembles (RF/GBM) reasonable first choice for discrimination; penalized Cox/logistic recommended for interpretable, calibrated predictions when sample size is moderate.

Appendix E

This appendix contains details and explanations supplemental to the subsection, “4.5. Application of AI-based algorithms for detecting and classifying current disease state, discovering diagnostic biomarkers, and subtype identification” of the main text. The explanations of the quality assessment of the included study articles [8] through [14] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A30. QUADAS-2 RoB and applicability assessment for Basuli et al., 2025 [8].
Table A30. QUADAS-2 RoB and applicability assessment for Basuli et al., 2025 [8].
DomainDescription/EvidenceRoBApplicability Concerns
Patient SelectionThe review included studies on adults with type 2 diabetes, with or without existing DKD, focusing on ML-based prediction of DKD development or progression. Inclusion/exclusion criteria for studies were well-described, and screening was conducted independently by three reviewers with consensus resolution. However, inclusion was restricted to English-language publications and limited to two databases (PubMed, EMBASE).Biomedinformatics 05 00067 i005 Some concern, Restriction to English and specific databases may have introduced selection bias and publication bias.Biomedinformatics 05 00067 i007 Low concern, The included patient populations (T2D with/without DKD) match the review’s stated target population for AI-based DKD prediction.
Index Test (AI/ML Models)The review evaluated various AI/ML algorithms (RF, XGB, SVM, CNN, Light GBM, etc.) used for predicting DKD onset or progression. Model details, performance metrics (AUC, accuracy), and validation strategies were extracted. However, standardization of model evaluation (e.g., cross-validation vs. external validation) across studies was lacking.Biomedinformatics 05 00067 i005 Some concern, Variable reporting of model development and validation among included studies; unclear thresholding and calibration metrics in some.Biomedinformatics 05 00067 i007 Low concern, The AI models were evaluated for DKD prediction as intended; no mismatch between review scope and test purpose.
Reference StandardThe “reference standard” across included studies varied, typically clinically defined DKD or CKD stages (eGFR, albuminuria) or biopsy-confirmed DKD in a few studies. Definitions were consistent with KDIGO criteria in most studies. However, the review did not uniformly verify how outcomes were adjudicated in each included study.Biomedinformatics 05 00067 i005 Some concern, Heterogeneity in DKD definitions across studies (clinical vs. pathological confirmation).Biomedinformatics 05 00067 i007 Low concern, All definitions align with real-world diagnostic criteria for DKD or CKD progression.
Flow and TimingThe included primary studies varied in follow-up periods (ranging from 6 months to 10 years) and timing of predictor measurement. The review summarized follow-up durations descriptively but did not evaluate time-lag consistency or data collection timing bias.Biomedinformatics 05 00067 i005 Some concern, Inconsistent reporting of temporal relationships between predictor measurement and DKD outcomes across included studies.Biomedinformatics 05 00067 i007 Low concern, The studies collectively address the intended question of early prediction of DKD, regardless of specific follow-up intervals.
Overall Judgment (Per Study)Most included studies showed good discrimination (AUC 0.74–0.90) but heterogeneous methodology and limited external validation. No meta-analysis of diagnostic accuracy metrics was performed due to heterogeneity.Biomedinformatics 05 00067 i005 Some concern, Lack of quantitative synthesis and limited quality appraisal details for each included study.Biomedinformatics 05 00067 i007 Low concern, The review’s conclusions appropriately reflected the included evidence without overgeneralizing applicability.
Overall Study-Level JudgmentComprehensive search and structured screening lend credibility, but variability in study design, DKD definitions, and validation methods limit the strength of conclusions. The review highlights promising AI utility but acknowledges limitations in dataset size, standardization, and validation.Biomedinformatics 05 00067 i005 Moderate RoBBiomedinformatics 05 00067 i007 Low applicability concern
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Basuli et al., 2025 [8] is a systematic review of diagnostic models. We applied the QUADAS-2 tool in an adapted manner to assess the RoB and applicability of this systematic review article. Table A30 is a comprehensive, color-coded QUADAS-2 table evaluating this systematic review article across all domains, including patient selection, index test, reference standard, flow and timing, overall judgment per study, and overall study-level judgment.
Table A31. QUADAS-2 RoB and applicability assessment for Lei et al., 2024 [9].
Table A31. QUADAS-2 RoB and applicability assessment for Lei et al., 2024 [9].
DomainDescriptionRoBApplicability Concerns
Patient SelectionRetrospective inclusion of renal biopsy samples from DN patients at a single tertiary center (Jinling Hospital). All samples were PAS-stained and of acceptable image quality. Exclusion of slides with compression or decolorization artifacts.Biomedinformatics 05 00067 i005 Some concern, Retrospective design, unclear whether all consecutive DN biopsies were included; potential for selection bias toward good-quality slides.Biomedinformatics 05 00067 i007 Low concern, Study population (biopsy-confirmed DN) reflects the intended target population for histopathologic classification.
Index TestCNN-based model trained using annotated PAS-stained WSIs to detect and quantify glomerular lesions and intrinsic cells. Multi-architecture system (Efficient Net, U-Net, V-Net) used for classification and segmentation. Model performance reported via F1-score, Dice coefficient, and kappa agreement with pathologists.Biomedinformatics 05 00067 i005 Some concern, Model trained and validated on retrospective labeled data; unclear if testing was blinded to reference standard; possible overfitting and lack of external dataset validation.Biomedinformatics 05 00067 i007 Low concern, The index test (CNN model) aligns directly with the intended clinical use (automated morphological classification).
Reference StandardManual classification by experienced renal pathologists using the 2010 Renal Pathology Society (RPS) standard for DN glomerular lesions. Consensus among three pathologists served as the gold standard.Biomedinformatics 05 00067 i007 Low risk, Reference standard based on validated, widely accepted histopathologic criteria (RPS 2010). Consensus reading reduces misclassification risk.Biomedinformatics 05 00067 i007 Low concern, Standard reflects current diagnostic practice; appropriate for evaluating CNN-based methods.
Flow and TimingRetrospective analysis of PAS-stained slides. Same slides used for CNN training and reference grading, though dataset split into training and test sets. All samples underwent consistent imaging and processing.Biomedinformatics 05 00067 i005 Some concern, Same dataset source used for both training and testing; unclear independence of test set; lack of prospective validation; no time interval between index and reference tests.Biomedinformatics 05 00067 i005 Some concern, Applicability limited by single-center design and lack of external dataset; performance may vary across staining protocols or scanners.
Overall JudgmentThe study demonstrates strong diagnostic potential for CNN-assisted histopathology in DN, achieving high F1-scores and good agreement with expert pathologists. However, retrospective design, single-center dataset, and lack of external validation pose moderate bias risks.Biomedinformatics 05 00067 i005 Overall RoB: ModerateBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A31 is a QUADAS-2 assessment table tailored to this CNN-based diagnostic pathology study [9], with clear color-coded risk levels and an overall judgment row at the end.
Table A32. PROBAST assessment for Makino et al., 2019 [10].
Table A32. PROBAST assessment for Makino et al., 2019 [10].
DomainDescriptionRoBApplicability Concerns
Population64,059 patients with type 2 diabetes identified retrospectively from EMR data. Real-world population with broad inclusion; minimal preselection.Biomedinformatics 05 00067 i005 Some concern, Potential selection bias due to retrospective data extraction and unclear inclusion/exclusion handling.Biomedinformatics 05 00067 i007 Low concern, Reflects typical diabetic population encountered in clinical settings.
Index ModelAI-based predictive model using convolutional autoencoder to extract longitudinal temporal features and logistic regression for 6-month DKD aggravation prediction (3073 features).Biomedinformatics 05 00067 i006 High risk, Limited transparency in feature selection, internal validation only, and unclear handling of missing or correlated data.Biomedinformatics 05 00067 i005 Some concern, Model dependent on EMR data structure; reproducibility across systems uncertain.
Comparator ModelNo explicit clinical or statistical comparator model reported (e.g., KDIGO or traditional regression models).Biomedinformatics 05 00067 i006 High risk, Absence of comparator limits evaluation of incremental value and clinical benefit.Biomedinformatics 05 00067 i005 Some concern, Lacks benchmark comparison for interpretability in clinical context.
OutcomeDKD aggravation over 6 months, validated by long-term renal outcomes (hemodialysis incidence within 10 years).Biomedinformatics 05 00067 i007 Low risk, Outcome clinically relevant, objective, and based on standard renal indicators.Biomedinformatics 05 00067 i007 Low concern, Outcome definition aligns with nephrology practice.
TimingPrediction window based on previous 6 months of EMR data; follow-up extended up to 10 years for validation of outcome trends.Biomedinformatics 05 00067 i005 Some concern, Retrospective follow-up with unclear frequency and consistency of measurements across patients.Biomedinformatics 05 00067 i007 Low concern, Clinically realistic prediction horizon for monitoring DKD progression.
SettingReal-world hospital EMR dataset, likely multi-center, with mixed inpatient and outpatient records.Biomedinformatics 05 00067 i005 Some concern, Variability in data collection across institutions and EMR systems may affect model generalizability.Biomedinformatics 05 00067 i007 Low concern, EMR-based design applicable to modern healthcare environments.
Intended Use of Predictive ModelEarly prediction of DKD progression in T2DM patients to guide timely interventions and reduce risk of hemodialysis.Biomedinformatics 05 00067 i007 Low risk, Intended use clinically appropriate and clearly stated.Biomedinformatics 05 00067 i007 Low concern, Applicable for clinical decision support once validated.
Overall JudgmentAI model demonstrates potential for early DKD progression prediction using large-scale EMR data but lacks comparator analysis and external validation.Biomedinformatics 05 00067 i005 Overall RoB: Moderate-HighBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A32 is a PROBAST evaluation tailored to this study [10], focusing on its prediction model development and validation aspects. The table includes color-coded risk ratings and an Overall Judgment row.
Table A33. PROBAST assessment for Nayak et al., 2024 [11].
Table A33. PROBAST assessment for Nayak et al., 2024 [11].
DomainDescriptionRoBApplicability Concerns
PopulationRetrospective data from 227 patients with DKD (stages 1–5) admitted between 2017–2022 at a South Indian tertiary hospital. Clear inclusion/exclusion criteria applied.Biomedinformatics 05 00067 i005 Some concern, Moderate sample size; single-center data may not capture population heterogeneity; retrospective selection may introduce bias.Biomedinformatics 05 00067 i005 Some concern, South Indian hospital population; external generalizability to other regions or healthcare systems uncertain.
Index ModelBayesian optimization-based ML pipeline (XGBoost, RF, SVM) for classification and regression; recursive feature elimination identified 15 key predictors. Best model (XGBoost) achieved ~89% F1-score with high explainability via SHAP/LIME.Biomedinformatics 05 00067 i005 Some concern, Appropriate algorithms and interpretability tools used, but potential overfitting due to small dataset; internal validation only.Biomedinformatics 05 00067 i007 Low concern, Predictors based on routinely collected lab and clinical parameters; feasible for clinical adoption.
Comparator ModelNo explicit traditional clinical or statistical comparator model (e.g., logistic regression, KDIGO-based prediction) was used for benchmarking.Biomedinformatics 05 00067 i006 High risk, Lack of comparator prevents evaluation of incremental value beyond existing clinical criteria.Biomedinformatics 05 00067 i005 Some concern, Limits clinical interpretability relative to standard DKD staging frameworks.
OutcomeDKD stage classification (1–5) and progression patterns; validated by eGFR and related lab parameters; regression models used to assess severity.Biomedinformatics 05 00067 i007 Low risk, Outcomes based on standard KDIGO staging and eGFR; reliable and objective.Biomedinformatics 05 00067 i007 Low concern, Consistent with accepted nephrology definitions and clinical relevance.
TimingRetrospective 5-year data collection (2017–2022) from EMR records; baseline to last recorded visit.Biomedinformatics 05 00067 i005 Some concern, Retrospective timing may affect consistency of follow-up and missing data patterns.Biomedinformatics 05 00067 i007 Low concern, Reflects clinically realistic patient follow-up intervals for DKD.
SettingSingle tertiary care hospital in South India; data extracted from inpatient records under real-world clinical conditions.Biomedinformatics 05 00067 i005 Some concern, Institutional data quality and recording practices could vary; limits external reproducibility.Biomedinformatics 05 00067 i005 Some concern, Single-center scope may affect generalizability across diverse healthcare infrastructures.
Intended Use of Predictive ModelTo assist clinicians in early identification and stage-wise prediction of DKD progression using explainable AI, enabling personalized interventions and reducing risk of ESRD.Biomedinformatics 05 00067 i007 Low risk, Purpose is well-defined, clinically relevant, and ethically appropriate.Biomedinformatics 05 00067 i007 Low concern, Intended use aligns with clinical decision support in nephrology.
Overall JudgmentThe study presents a well-designed, interpretable ML model with strong performance metrics and clinical relevance. However, lack of external validation and comparator model introduces moderate RoB and limits generalizability.Biomedinformatics 05 00067 i005 Overall RoB: ModerateBiomedinformatics 05 00067 i005 Overall Applicability: Moderate
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A33 is a PROBAST evaluation customized to this study [11], following the comprehensive structure: Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.
Table A34. PROBAST assessment for Li et al., 2025 [12].
Table A34. PROBAST assessment for Li et al., 2025 [12].
DomainDescriptionRoBApplicability ConcernsComments/Justification
PopulationAdults with type 2 diabetes mellitus (T2DM) from multiple cohorts; no restriction on gender, age, or region.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i005 ModeratePopulation well defined; however, heterogeneity across included studies (different datasets, ethnicities) could limit direct applicability.
Index ModelML algorithms (traditional regression, ML, DL) used for DKD prediction.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 LowModels clearly described but varying development quality (feature selection, hyperparameter tuning not always transparent).
Comparator ModelTraditional regression and other ML models (e.g., logistic regression, RF).Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowComparators appropriate and systematically compared; limited concern for bias or applicability.
OutcomeDevelopment or progression of DKD, typically defined by albuminuria or eGFR criteria.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowClinically relevant and consistently defined across studies; small variation in thresholds unlikely to bias results.
TimingPrediction horizons varied (short to long term); follow-up durations not uniform.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 ModerateInconsistent time intervals may influence predictive accuracy and applicability across settings.
SettingPredominantly retrospective, single-center or national datasets; few external validations.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 ModerateReal-world settings increase relevance, but limited multicenter validation restricts generalizability.
Intended UseEarly identification of DKD risk in T2DM for clinical decision support and prevention.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowIntended use is clear, relevant, and consistent with clinical goals.
Overall JudgmentKey limitation: Heterogeneity across included studies (datasets, ML techniques, outcome definitions) and limited external validation.
Key strength: Comprehensive synthesis using robust meta-analytic methods and PROBAST framework adherence.
Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 ModerateOverall good methodological rigor, but heterogeneity, limited external validation, and variable reporting reduce confidence in transportability.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A34 is an enhanced PROBAST summary table for the study [12], using clear color-coded indicators and separate columns for RoB and Applicability Concerns.
Table A35. PROBAST assessment for Zhu et al., 2024 [13].
Table A35. PROBAST assessment for Zhu et al., 2024 [13].
DomainDescriptionRoBApplicability ConcernsComments/Justification
PopulationAdults (≥18 yrs) with type 2 diabetes (duration ≥ 10 yrs), free from DN at baseline; excluded Type 1 DM and non-diabetic renal disease.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowWell-defined cohort consistent with target clinical population; appropriate exclusions reduce confounding.
Index ModelML algorithms (SVM, RF) integrating biomarkers and clinical variables; cross-validation used.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 LowFeature selection and tuning well described but may introduce overfitting; model externally validated once.
Comparator ModelTraditional multivariate logistic regression used for benchmarking ML performance.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowComparator appropriate and transparent; enables fair assessment of ML model benefit.
OutcomeIncident diabetic nephropathy (persistent albuminuria ≥ 30 mg/g + clinical evidence) over 36 months; ADA-aligned criteria.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowClear, standardized outcome definition ensures consistency across datasets.
TimingBaseline data collection with 36-month longitudinal follow-up for DN onset in both training and validation cohorts.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowUniform timing between datasets; appropriate window for DN development.
SettingSingle tertiary hospital (Sichuan Provincial People’s Hospital); retrospective EHR-based design (2018–2020).Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 ModerateReal-world data improve relevance but single-center limits generalizability and external reproducibility.
Statistical AnalysisCombined feature selection using RF + SVM; multicollinearity checked via VIF; cross-validation and independent test set (n = 468); evaluated via AUC and F1-score.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 LowRobust statistical workflow; however, potential bias from retrospective data, class imbalance not discussed; external validation partially mitigates concern.
Intended Use of Predictive modelEarly identification of high-risk T2DM patients for DN to enable preventive or therapeutic interventions.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowIntended use well-defined, clinically meaningful, and feasible for integration into practice.
Overall JudgmentKey Strengths: Defined population, biologically meaningful predictors, robust validation, good discrimination (AUC = 0.83).
Key Limitations: Retrospective single-center data, limited external validation, incomplete handling of potential overfitting, and data imbalance.
Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 ModerateSolid predictive modeling with modest bias from retrospective single-center data and limited generalizability; strong potential for clinical application pending broader validation.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A35 is a PROBAST evaluation table tailored specifically to this study [13], using clear color-coded indicators and separate columns for RoB and Applicability Concerns.
Table A36. QUADAS-2 RoB and applicability assessment for Zhu et al., 2024 [14].
Table A36. QUADAS-2 RoB and applicability assessment for Zhu et al., 2024 [14].
DomainDescriptionRoBApplicability ConcernsComments/Justification
Patient SelectionSamples derived from publicly available GEO transcriptomic datasets (GSE47184 [54], GSE96804 [52], GSE104948 [55], GSE104954 [56], GSE142025 [57], GSE175759 [58]); DN and control samples defined per study metadata.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 LowSelection relied on pre-existing datasets; inclusion/exclusion not under authors’ control; however, cases and controls were biologically and clinically relevant.
Index Test (ML-MR integrated pipeline)Combined ML (LASSO, SVM-RFE, RF) for feature selection and MR analysis to identify causally linked biomarkers; performance assessed by ROC curves (AUC ≥ 0.878).Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowTransparent and reproducible index test pipeline; validated across multiple datasets; strong predictive performance and interpretability.
Reference StandardDiagnosis of diabetic nephropathy established from dataset clinical annotations and confirmed by renal biopsy in some cohorts.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 LowDiagnostic accuracy depends on original dataset quality; partial reliance on pre-annotated labels may introduce some bias.
Flow and TimingAll datasets processed with standardized pipelines; ML model trained and validated across multiple cohorts; experimental validation (qRT-PCR) performed on independent clinical samples.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowConsistent analytic flow; appropriate sequencing between model discovery and validation; minimal missing data bias.
Statistical AnalysisDifferential expression (DEG) filtering, ML-based feature selection, MR causal inference, ROC validation, and qRT-PCR confirmation.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 LowRobust multi-stage analytical design combining omics and causal inference strengthens reliability; multiple validation layers reduce bias.
Overall JudgmentKey Strengths: Multi-algorithm feature selection, MR causal inference, cross-dataset validation, and independent biological validation.
Key Limitations: Potential dataset heterogeneity; lack of uniform diagnostic standards across public datasets.
Biomedinformatics 05 00067 i007 Low-to-ModerateBiomedinformatics 05 00067 i007 LowStrong methodological rigor, diverse datasets, and biological validation reduce bias; minor concern due to reliance on secondary data sources and metadata accuracy.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only.
Table A36 follows the official QUADAS-2 domains for this study [14] and adds: clear, study-specific descriptions; separate columns for RoB and Applicability Concerns; color-coded judgments; and an Overall Judgment row summarizing the quality assessment.

Appendix F

This appendix contains details and explanations supplemental to the subsection, “4.6. Application of AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification” of the main text. The explanations of the quality assessment of the included study articles [16] through [23] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A37. QUADAS-2 RoB and applicability assessment for Lucarelli et al., 2023 [16].
Table A37. QUADAS-2 RoB and applicability assessment for Lucarelli et al., 2023 [16].
DomainRoBApplicability Concerns
Patient SelectionBiomedinformatics 05 00067 i006 High/Unclear, Selective recruitment (urine proteomics + pathology cohort) with limited reporting on inclusion/exclusion; possible spectrum bias.Biomedinformatics 05 00067 i010 Moderate, Participants unlikely to reflect general T2DM population; community-clinic applicability uncertain.
Index Test (digital biomarkers via urinary proteomics + pathology)Biomedinformatics 05 00067 i010 Moderate/Unclear, Unclear blinding; thresholds not pre-specified; same dataset used for discovery and model training → overfitting risk.Biomedinformatics 05 00067 i010 Moderate, Uses specialized proteomics pipeline; external platforms may differ → limited generalizability.
Reference StandardBiomedinformatics 05 00067 i005 Low → Moderate, Pathology-based reference appropriate but possibly heterogeneous (biopsy vs. clinical classification); unclear if consistent across subjects.Biomedinformatics 05 00067 i010 Moderate, Standard relevant to DKD but not necessarily identical to routine clinical endpoints.
Flow & TimingBiomedinformatics 05 00067 i006 High, Incomplete information on timing between index and reference tests; unclear follow-up; potential verification bias.Biomedinformatics 05 00067 i010 Moderate, Specialized research workflow limits applicability to routine timelines and sampling logistics.
Overall JudgmentBiomedinformatics 05 00067 i006 Overall RoB: High, Unclear patient selection + limited reporting on timing increase risk.Biomedinformatics 05 00067 i010 Overall Applicability: Moderate, Study setting and assay platform differ from standard clinical practice; external validation needed.
Colors Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A37 is a color-coded QUADAS-2 table for the study: Lucarelli et al., (2023) [16]; “Discovery of Novel Digital Biomarkers for Type 2 Diabetic Nephropathy Classification via Integration of Urinary Proteomics and Pathology.”
This study represents an innovative early-phase effort to merge urinary proteomics with pathology for digital biomarker discovery in diabetic nephropathy. However, methodological transparency (especially around cohort definition, temporal sequence, and blinding) is limited. The findings are promising but not yet generalizable to multi-site or community settings without external validation and standardized assay calibration.
Table A38. QUADAS-2 RoB and applicability assessment for Yan et al., 2024 [17].
Table A38. QUADAS-2 RoB and applicability assessment for Yan et al., 2024 [17].
DomainRoBApplicability Concerns
Patient SelectionBiomedinformatics 05 00067 i005 Low → Moderate, Participants included clear diagnostic categories (T2DM with/without DKD), but sampling method and exclusion criteria were not fully described; possible selection bias if matched retrospectively.Biomedinformatics 05 00067 i007 Low, DKD and control populations clinically relevant; findings likely applicable to standard T2DM cohorts with albuminuria-based classification.
Index Test (Urinary proteomics + ML classifiers)Biomedinformatics 05 00067 i010 Moderate, ML approach (e.g., SVM, RF) applied without fully independent test set; unclear blinding to reference standard; internal validation only.Biomedinformatics 05 00067 i010 Moderate, Omics workflows and normalization pipelines may not generalize across assay platforms; batch effects could limit external use.
Reference StandardBiomedinformatics 05 00067 i007 Low, Diagnostic definitions followed established DKD criteria (albuminuria, eGFR decline); reference standard appropriate and consistently applied.Biomedinformatics 05 00067 i007 Low, Reference outcomes align well with clinical practice and guidelines.
Flow & TimingBiomedinformatics 05 00067 i010 Moderate, Timing between urine collection and DKD classification not explicitly stated; unclear if temporal gaps could bias associations.Biomedinformatics 05 00067 i007 Low → Moderate, Acceptable for cross-sectional biomarker screening but limited for longitudinal prediction.
Overall JudgmentBiomedinformatics 05 00067 i010 Overall RoB: Moderate, Limited external validation and potential for overfitting; otherwise methodologically reasonable.Biomedinformatics 05 00067 i007 Overall Applicability: Low Concern, Results relevant to typical DKD diagnostic settings, but omics reproducibility remains a constraint.
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).
Table A38 is a color-coded QUADAS-2 evaluation for the study: Yan et al., (2024) [17]; “Application of Proteomics and Machine Learning Methods to Study the Pathogenesis of Diabetic Nephropathy and Screen Urinary Biomarkers.”
This study follows a single-center cohort. External validation and transparent reporting of temporal design are still needed to ensure trustworthy real-world deployment. The overall evidence is methodologically moderate with good clinical applicability.
Table A39. PROBAST assessment for Dong et al., 2022 [18].
Table A39. PROBAST assessment for Dong et al., 2022 [18].
DomainDescriptionRoBApplicability Concern
Population/ParticipantsAdults with type 2 diabetes extracted from a large hospital EMR database in China; patients were followed for up to 3 years for DKD onset. Clear inclusion/exclusion criteria with sufficient baseline data.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
Index ModelML-based predictive models (RF, XGBoost, Logistic Regression) using routine EMR data (labs, demographics, comorbidities, medication, BP, BMI, eGFR, etc.) to predict 3-year risk of diabetic kidney disease.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 Low
Comparator ModelTraditional logistic regression models used as baseline comparators for performance benchmarking.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
OutcomesOnset of diabetic kidney disease (DKD) within 3 years, defined by KDIGO criteria (eGFR < 60 mL/min/1.73 m2 or UACR >30 mg/g).Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
TimingPredictors measured at baseline; 3-year prediction horizon. However, it is unclear if temporal data splits (e.g., patient-level chronological separation) were strictly applied.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 Moderate
SettingSingle tertiary hospital EMR database (China); retrospective design. No external validation or community-level testing reported.Biomedinformatics 05 00067 i006 HighBiomedinformatics 05 00067 i005 Moderate
Intended Use of Predictive ModelClinical decision support for early identification of high-risk DKD patients among those with type 2 diabetes; potentially guiding earlier interventions or referrals.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
Statistical AnalysisMultiple ML algorithms compared. Data randomly split into training/testing sets; AUROC ≈ 0.86 reported. No external validation. No calibration curve, intercept, or DCA. Imputation and feature selection methods are not fully detailed, increasing overfitting risk.Biomedinformatics 05 00067 i006 HighBiomedinformatics 05 00067 i005 Moderate
Overall JudgmentModerate-to-High RoB, Model trained and tested internally with strong discrimination but limited evidence of calibration, generalizability, and robustness to domain shift. Applicability concerns low, as EMR-based predictors are clinically relevant and reproducible.Biomedinformatics 05 00067 i010 Moderate-HighBiomedinformatics 05 00067 i007 Low
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010 Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A39 is a structured PROBAST-style summary table for Dong et al. (2022) [18].
Key Limitations of this study include a lack of external validation, unclear handling of temporal dependencies, and incomplete reporting of calibration.
Key Strengths of this study include a large, representative EMR cohort with a clinically meaningful 3-year DKD outcome.
Table A40. PROBAST assessment for Hsu et al. 2023 [19].
Table A40. PROBAST assessment for Hsu et al. 2023 [19].
DomainDescriptionRoBApplicability Concern
Population/ParticipantsAdult patients with type 2 diabetes from a large hospital network in Taiwan. Dataset derived from electronic medical records (EMRs) between 2012–2021. Exclusions applied for pre-existing ESRD or missing baseline renal data.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
Index ModelEnsemble ML models (RF, XGBoost, Light GBM) built to predict risk of rapidly progressive kidney disease (RPKD), defined as ≥30% eGFR decline within 2 years, and identify patients needing nephrology referral.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
Comparator ModelTraditional logistic regression and Cox regression are used as comparators. ML models outperformed conventional methods with higher AUROC values.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
OutcomesRPKD and nephrology referral within a 2-year period, defined using KDIGO-based eGFR decline thresholds and clinician referrals. Clear, clinically relevant endpoint definitions.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
TimingPredictors collected at baseline; follow-up period of up to 2 years. However, it is unclear whether patient-level temporal splits were enforced, or random sampling used for validation.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 Moderate
SettingRetrospective single-center EMR dataset from a tertiary care hospital. While data volume was high, external validation or community-level generalizability testing was not reported.Biomedinformatics 05 00067 i006 HighBiomedinformatics 05 00067 i005 Moderate
Intended Use of Predictive ModelDesigned to assist clinicians in early identification of diabetic patients at high risk for RPKD, to optimize referral timing and monitoring intensity. Potential clinical decision-support tool.Biomedinformatics 05 00067 i007 LowBiomedinformatics 05 00067 i007 Low
Statistical AnalysisCompared multiple ML models; best AUROC ≈ 0.89 (XGBoost). Used cross-validation for internal evaluation. Calibration metrics not clearly reported, and decision-curve analysis absent. Missing data handling and normalization is not fully described.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i005 Moderate
Overall JudgmentModerate RoB. Model performance was strong (AUROC ≈ 0.89), but limited transparency on calibration, imputation, and temporal validation reduces real-world reliability. Applicability concerns are low, given the model’s clinically relevant predictors and outcomes, but external validation remains a key gap.Biomedinformatics 05 00067 i005 ModerateBiomedinformatics 05 00067 i007 Low
Colors Biomedinformatics 05 00067 i007 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006 Red = No external validation/preprint/exploratory approach.
Table A40 is a PROBAST evaluation of the study, Hsu et al., 2023 [19].
The real-world deployment of the Hsu et al. 2023 [19] model is limited by a lack of external validation and by incomplete reporting of calibration and data hygiene practices. Although the study demonstrates promise for supporting physicians’ clinical decisions, it still needs broader testing across diverse clinical settings, from tertiary-level hospitals to local clinics.
Table A41. QUADAS-2 RoB and applicability concerns assessment for Paranjpe et al., 2023 [20].
Table A41. QUADAS-2 RoB and applicability concerns assessment for Paranjpe et al., 2023 [20].
DomainDescription/Key DetailsRoBApplicability Concerns
Patient SelectionRetrospective EMR data from large academic centers; inclusion criteria focused on diabetic kidney disease (DKD) with genotyping data available.Biomedinformatics 05 00067 i002 Moderate risk, retrospective design may introduce selection bias and missingness in clinical-genetic linkage.Biomedinformatics 05 00067 i001 Low concern, representative of tertiary diabetic populations.
Index Test (Deep Learning Model)DL model trained on multimodal EMR data with genomic integration to identify DKD sub-phenotypes linked to the Rho pathway.Biomedinformatics 05 00067 i002 Moderate risk, details of model tuning and cross-validation split not fully reported; unclear if feature leakage was avoided.Biomedinformatics 05 00067 i002 Moderate concern, algorithm may overfit tertiary-center EMR data, limiting community translation.
Reference StandardGenetic and clinical phenotyping used to define DKD subtypes; no independent gold-standard confirmation (e.g., biopsy).Biomedinformatics 05 00067 i003 High risk, absence of external biological validation may weaken subtype credibility.Biomedinformatics 05 00067 i002 Moderate concern, relies on EMR and genetic proxies rather than clinical pathology confirmation.
Flow and TimingCross-sectional integration of EMR and genomic data; unclear timing alignment between phenotype and genotype capture.Biomedinformatics 05 00067 i002 Moderate risk, potential temporal mismatch between clinical and genomic data points.Biomedinformatics 05 00067 i001 Low concern, overall timing reasonable for computational phenotyping.
Statistical AnalysisDL interpretability limited; performance metrics (e.g., AUC or clustering validity indices) partially reported; no external validation cohort.Biomedinformatics 05 00067 i003 High risk, limited transparency and missing calibration or generalizability metrics.Biomedinformatics 05 00067 i003 High concern, lack of independent validation and reproducibility assessment.
Overall JudgmentThe study offers valuable biological insight into DKD subtypes through EMR-genomic integration, but incomplete reporting and lack of external validation raise serious concerns about reproducibility and clinical generalizability.Biomedinformatics 05 00067 i003 High overall RoBBiomedinformatics 05 00067 i002 Moderate applicability concern
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A41 is a color-coded QUADAS-2 assessment table for the study by Paranjpe et al. (2023) [20], designed for clarity.
Paranjpe et al. (2023) [20] introduce an ambitious DL approach that links EMR and genomic data to uncover DKD subtypes, but transparency gaps and a lack of external validation limit its current clinical reliability. The findings are hypothesis-generating rather than ready for real-world application.
Table A42. PROBAST assessment for Xu et al., 2020 [21].
Table A42. PROBAST assessment for Xu et al., 2020 [21].
PROBAST ItemDescription/AssessmentRoBApplicability Concerns
Study TypeSystematic literature review of ML models predicting diabetic microvascular complications (retinopathy, nephropathy, neuropathy) in Type 1 Diabetes Mellitus (T1DM).--
Population/ParticipantsParticipants were individuals with T1DM, from heterogeneous study cohorts (pediatric/adult, different disease durations, varying inclusion criteria). Lack of uniformity and representativeness.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i003 High
Index Model(s)Various ML algorithms, SVM, RF, Artificial Neural Networks (ANN), Decision Trees, etc.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Unclear
Comparator Model(s)Some studies compared ML models with logistic regression or classical statistical methods; no consistent comparator and no meta-analysis.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Unclear
Outcome(s)Diabetic retinopathy, nephropathy, and neuropathy. Outcome definitions and diagnostic criteria varied across studies. Limited reporting on outcome ascertainment.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Unclear
TimingPrediction horizons varied (cross-sectional, retrospective cohort). No standardized follow-up intervals or prediction windows.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
SettingStudies conducted in mixed clinical and research settings (hospital registries, EMR datasets). Contextual details are often missing.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Unclear
Intended UseTo support early detection and risk stratification for diabetic microvascular complications in T1DM, potentially informing personalized clinical management.-Biomedinformatics 05 00067 i002 Moderate
PredictorsPredictors included demographic (age, sex), clinical (HbA1c, BP, duration of diabetes), lab (lipids, creatinine), and imaging (retinal images). Predictor handling poorly described and inconsistent.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Statistical/Modeling AnalysisQualitative synthesis only. Most studies reported accuracy and AUC, but few addressed calibration, missing data, or validation. No pooled statistics or sensitivity analyses.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i003 High
Domain 1, ParticipantsPopulations varied; inclusion criteria and recruitment are often unclear; potential selection bias.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i003 High
Domain 2, PredictorsInconsistent predictor definitions and processing; limited reporting of feature selection and handling of missing data.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Domain 3, OutcomeOutcomes inconsistently defined; validation methods not standardized; blinding rarely reported.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Domain 4, AnalysisNo formal validation in many studies; high risk of overfitting; small datasets; selective reporting of favorable results.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i003 High
Overall RoBMethodological transparency limited; heterogeneity high; no structured bias tool (like PROBAST) used in the review.Biomedinformatics 05 00067 i003 High-
Overall ApplicabilityUseful overview of ML research trends but limited generalizability or clinical utility due to lack of validation and standardization.-Biomedinformatics 05 00067 i002 Moderate
Overall PROBAST JudgmentHigh RoB; moderate applicability concerns. Review descriptive but insufficient for clinical adoption of ML models in T1DM complications.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Colors Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A42 is a structured PROBAST summary table including both methodological domains and key study characteristics for the study, Xu et al., 2020 [21].
Xu et al. (2020) [21] provide a broad overview of ML applications in diabetic complication prediction. But the overall PROBAST rating is high RoB, with moderate concerns about applicability. The real-world deployment of this study is limited by a lack of structured RoB assessment, poor reporting of model development and validation across included studies, and substantial heterogeneity in predictors, outcomes, and data sources.
Table A43. QUADAS-2 RoB and applicability assessment for Dong et al., 2024 [22].
Table A43. QUADAS-2 RoB and applicability assessment for Dong et al., 2024 [22].
DomainKey Questions/DescriptionRoBApplicability Concerns
1. Patient SelectionDid the review include studies enrolling representative participants? Were inclusion/exclusion criteria clearly defined and appropriate for DN biomarker discovery?Biomedinformatics 05 00067 i002 Unclear risk, The review included studies of diabetic patients with and without nephropathy, but inclusion criteria across studies varied widely (different stages of DN, different diabetes types, and demographic variability). No explicit description of how participants were selected in included studies.Biomedinformatics 05 00067 i002 Unclear concern, Applicability limited by heterogeneity in populations (T1DM vs. T2DM, ethnic differences, sample sources).
2. Index Test (MLModels/Biomarker Identification Methods)Were ML methods described in sufficient detail to permit replication? Was model performance validated appropriately?Biomedinformatics 05 00067 i003 High risk, The review summarized various ML approaches (e.g., RF, SVM, LASSO), but many primary studies lacked validation procedures or transparent performance metrics. The review did not critically appraise algorithm robustness or validation quality.Biomedinformatics 05 00067 i002 Moderate concern, Some ML models aimed to identify candidate biomarkers (not diagnostic tools), so their direct clinical applicability is limited.
3. Reference Standard (Diagnosis of Diabetic Nephropathy)Was DN defined and confirmed consistently across included studies? Was the reference standard likely to correctly classify the condition?Biomedinformatics 05 00067 i002 Unclear risk, DN definitions varied (e.g., based on eGFR, albuminuria, biopsy, or clinical diagnosis). The review did not standardize or stratify based on diagnostic criteria.Biomedinformatics 05 00067 i002 Unclear concern, Variability in diagnostic definitions affects generalizability of biomarker findings.
4. Flow and TimingWere all patients included in the analysis? Was there an appropriate interval between biomarker testing and reference standard assessment?Biomedinformatics 05 00067 i003 High risk, The review did not evaluate timing consistency or completeness of datasets in primary studies. Missing data handling and participant flow were not discussed.Biomedinformatics 05 00067 i002 Moderate concern, Potential temporal mismatch between biomarker sampling and DN diagnosis may bias interpretation.
Overall JudgmentGeneral methodological and reporting quality of the review and the included studies.Biomedinformatics 05 00067 i003 High RoB, No formal quality or bias assessment (e.g., QUADAS-2, PROBAST) applied; heterogeneous populations, methods, and validation standards.Biomedinformatics 05 00067 i002 Moderate applicability concern, Review useful for identifying research trends but limited clinical translation due to poor standardization and unclear diagnostic validity.
Colors Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A43 is an adapted QUADAS-2 tool evaluation of the study Dong et al., 2024 [22] to assess the quality and bias of the review’s approach and of the studies it summarized.
Across the included literature, common issues persist: data leakage arising from sloppy time windows between biomarker measurement and outcome ascertainment, undefined handling of missing data, and outcome labels that shift across sites or depend on inconsistent diagnostic criteria for diabetic nephropathy. Follow-up intervals were often too short to capture clinically meaningful disease progression, further limiting interpretability.
Most primary studies used small, retrospective datasets and reported impressive accuracy metrics without adequate external validation or calibration. These eye-catching numbers are likely inflated by methodological weaknesses, which limit the study’s generalizability to new populations and real-world deployment.
Table A44. QUADAS-2 RoB and applicability concerns assessment for the study, Nagaraj et al., 2021 [23].
Table A44. QUADAS-2 RoB and applicability concerns assessment for the study, Nagaraj et al., 2021 [23].
DomainKey Questions/DescriptionRoBApplicability Concerns
1. Patient SelectionParticipants were individuals with DKD from observational cohorts. Inclusion and exclusion criteria were clearly defined, and baseline characteristics were described. However, participants were drawn from limited cohorts, potentially leading to selection bias.Biomedinformatics 05 00067 i002 Unclear risk, Sampling may not reflect broader DKD populations (mostly moderate to severe stages).Biomedinformatics 05 00067 i002 Unclear concern, Applicability may be limited to similar hospital-based populations.
2. Index Test (Kidney Age Index, ML Model)The KAI framework was developed using ML algorithm to estimate biological kidney age from clinical and biochemical parameters. Model development and internal validation were described, but external validation was not performed. There is potential for data leakage if temporal splits are not strictly enforced. Handling of missing data and feature selection steps were insufficiently detailed.Biomedinformatics 05 00067 i003 High risk, Possible overfitting and unclear missing data handling.Biomedinformatics 05 00067 i002 Moderate concern, KAI may not generalize to settings with different patient characteristics or lab standards.
3. Reference Standard (Measured Kidney Function)The reference standard was eGFR and albuminuria-based diagnosis of DKD, which are accepted clinical measures. However, measurement variability and assay calibration differences were not discussed.Biomedinformatics 05 00067 i002 Unclear risk, Reference measures acceptable but not standardized across datasets.Biomedinformatics 05 00067 i008 Low concern, Consistent with clinical definitions of kidney function.
4. Flow and TimingThe study used retrospective data; timing of biomarker and outcome measurements was not always synchronized. Follow-up duration for assessing kidney decline was limited, making it difficult to assess long-term predictive validity.Biomedinformatics 05 00067 i003 High risk, Potential bias due to inconsistent timing and incomplete follow-up data.Biomedinformatics 05 00067 i002 Moderate concern, Limited longitudinal data reduce applicability for real-world progression prediction.
Overall JudgmentInnovative and well-conceptualized approach, but limited by internal-only validation, potential data leakage, and incomplete transparency on data preprocessing.Biomedinformatics 05 00067 i003 High overall RoBBiomedinformatics 05 00067 i002 Moderate applicability concern
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A44 is a structured QUADAS-2 evaluation summary table including both methodological domains and key structures of the study, Nagaraj et al., 2021 [23].
This study introduces KAI, an ML-derived biomarker estimating kidney function in DKD by mapping clinical variables to an “age-equivalent” kidney function score. While conceptually strong, the diagnostic accuracy framework is limited by several common pitfalls in ML-driven biomarker development:
Possible leakage between training and testing data due to unclear temporal separation.
Undefined handling of missing data, imputation, and variable selection.
Outcome labels (e.g., eGFR thresholds, albuminuria cutoffs) may shift across sites or datasets, affecting comparability.
Follow-up periods are too short to meaningfully assess the decline in kidney function or validate clinical relevance.
Despite these issues, the KAI framework represents a promising step toward personalized nephrology, and its methodological transparency and external validation should determine its eventual utility.

Appendix G

This appendix contains details and explanations supplemental to the subsection, “4.7. Predicting future outcomes, such as mortality or cardiovascular events, using ML algorithms or patients’ biomarkers” of the main text. The explanations of the quality assessment of the included study articles [34] through [40] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.
Table A45. PROBAST assessment for Ma et al., 2023 [34].
Table A45. PROBAST assessment for Ma et al., 2023 [34].
DomainDescription/AssessmentRoBApplicability Concerns
Population/Participants656 peritoneal dialysis patients with 13,091 visits; external testing on 1363 hemodialysis patients. Inclusion criteria were clear, but single center (or limited centers) may not represent broader PD populations globally.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Unclear
Index Model“AICare” ML model uses adaptive feature-importance recalibration and multi-channel feature extraction from longitudinal EMR data. Sophisticated model, but missing data handling, temporal separation, and feature selection processes are not fully described.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Comparator ModelConventional ML baselines (RF, logistic regression) and ablation versions of AICare were compared. Comparisons performed internally; no external peritoneal dialysis comparators beyond the hemodialysis dataset.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Outcome1-year mortality following each clinical visit (binary). Mortality is a hard endpoint; cause-specific mortality is not considered. Outcome timing per visit introduces potential inconsistencies.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i008 Low
TimingPrediction horizon = 1-year post-visit; dataset spans ~12 years. Potential data leakage if future visits influence predictors. Follow-up per patient variable; censoring not fully detailed.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
SettingSingle-center (or few-center) tertiary care peritoneal dialysis patients; external hemodialysis dataset for testing. Retrospective EMR data.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Intended Use of Predictive ModelEarly risk stratification for peritoneal dialysis patients to guide clinical interventions and prioritize high-risk patients for monitoring or treatment adjustments.-Biomedinformatics 05 00067 i002 Moderate
Statistical AnalysisMulti-channel ML with adaptive recalibration, internal cross-validation, and testing on external hemodialysis cohort. Metrics: AUROC, AUPRC. Limited reporting on calibration, missingness handling, or temporal data leakage mitigation.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Overall JudgmentInnovative ML approach with interpretability, large longitudinal dataset. High RoB due to potential leakage, unclear missing data handling, and limited external validation in PD population. Applicability moderate; promising method but caution needed in generalizing results.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A45 is a structured PROBAST evaluation summary table with both methodological domains and key features of the study, Ma et al., 2023 [34].
This study has common pitfalls in predictive modeling that are evident:
Potential data leakage: Because predictions are made at each visit, there is a risk that future information or visit timing may inadvertently inform earlier predictions if not strictly excluded.
Handling of missingness: Although many longitudinal features are used, the report gives limited detail on how missing values, irregular visit intervals, or drop-outs were handled.
Outcome labels and timing consistency: While mortality is a robust endpoint, the consistency of “1-year ahead from visit” across all visits may vary; also, differences between peritoneal dialysis and hemodialysis cohorts (for external testing) may limit applicability.
Follow-up and generalizability: Although the dataset spans ~12 years, the actual follow-up per individual and event rate (39.8% died in the peritoneal dialysis cohort) may limit how well the model predicts longer-term outcomes beyond 1 year. Also, the population appears from one tertiary center (or limited centers) in China, which may differ from other geographic settings.
Given these limitations, when this study reports high performance (e.g., AUROC ~ 0.816 in peritoneal dialysis dataset, AUPRC ~ 0.472), one must interpret the results cautiously: these numbers may not transfer to other centers, populations, or real-world deployment without further external validation and scrutiny of the modeling pipeline.
Table A46. PROBAST assessment for Chen et al., 2025 [35].
Table A46. PROBAST assessment for Chen et al., 2025 [35].
DomainDescription/AssessmentRoBApplicability Concerns
Population/ParticipantsRetrospective cohort of 359 hemodialysis patients from one hospital (January 2017–June 2023) in China. Inclusion appears clear, but limited to one center, limited external validation.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Index ModelTwo ML models: Model A (85 variables) and Model B (22 variables), using RF/SVM/logistic regression, with SHAP for interpretability. Model description is good, but detail on missing data, temporal split, feature engineering is limited.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Comparator ModelComparisons among ML methods (RF/SVM/Logistic), but no strong external reference standard model (e.g., purely conventional risk model) reported.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
OutcomeTwo outcomes: (1) all-cause mortality; (2) time to death (regression) for hemodialysis patients. Mortality is meaningful, but follow-up time, censoring, competing risks are not fully described.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i008 Low
TimingData span about 6.5 years; models are built on retrospective data. Unclear if proper temporal separation between training and validation; possible leakage if later data influence earlier predictions.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
SettingSingle tertiary-care hemodialysis center in China; hospital setting. Raises concern about generalizability to other geographic/health-system settings.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Intended Use of Predictive ModelTo support clinical decision-making by predicting mortality risk and time to death in hemodialysis patients, presumably to identify high-risk individuals for intervention.-Biomedinformatics 05 00067 i002 Moderate
Statistical AnalysisPerformance metrics: for Model A AU-ROC ~ 0.86 ± 0.07; for Model B AU-ROC ~ 0.80 ± 0.06. Regression (R2) for time to death reported. But calibration, missing data handling, external validation, temporal validation poorly described.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Overall JudgmentInnovative and interpretable ML modeling, but significant methodological concerns: single-center data, risk of leakage, limited reporting of missingness/temporal splits, no strong external validation in hemodialysis populations.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A46 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Chen et al., 2025 [35].
The study by Chen et al. (2025) [35] proposes two interpretable ML tools for predicting all-cause mortality and time to death among hemodialysis patients. Using a retrospective cohort from a single center in China, the authors achieved relatively high discrimination (AU-ROC ~ 0.86 for their more complex model). However, from a PROBAST perspective, several shortcomings reduce confidence in generalizability:
Data leakage risk: It is not clear whether temporal separation was strictly maintained between training and validation sets (e.g., avoiding future visits influencing predictions).
Handling of missing data and preprocessing is insufficiently described, increasing bias risk.
Outcome definitions and follow-up: While mortality is a robust endpoint, details on censoring, competing risks, and follow-up duration are sparse, raising questions about validity over time.
Setting and sample: Single-center Chinese hemodialysis population limits applicability to other regions, dialysis modalities, or health systems.
Model evaluation: While discrimination is reported, calibration and external validation are absent, meaning the “eye-catching” performance may not travel to other settings.
Table A47. PROBAST assessment for Hung et al., 2022 [36].
Table A47. PROBAST assessment for Hung et al., 2022 [36].
DomainDescription/AssessmentRoBApplicability Concerns
Population/ParticipantsRetrospective cohort of 2932 ICU patients who received CRRT at a single tertiary center (Changhua Christian Hospital, Taiwan) from January 2010 to April 2021. Excluded ESRD on dialysis (n = 283), <20 yrs (n = 15), missing lab data (n = 73). While large cohort, single-center setting may limit representativeness of other ICU/CRRT populations.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
Index ModelML algorithms (GBM, XGBoost, RF, SVM) with feature selection (recursive feature elimination) and cross-validation; explainability via SHAP (global and local). However, details on missing data handling, temporal data splits (future leakage), and predictor measurement timing are limited.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
Comparator ModelSeveral ML models compared among themselves; no standard external clinical risk score comparator (or head-to-head with established CRRT mortality score) reported.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
OutcomePrimary outcome: in-hospital mortality after CRRT initiation. Secondary endpoints: 28-day and 90-day mortality. Outcome is clinically meaningful and well defined (death during hospitalization).Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i008 Low concern
TimingCohort spans ~11 years (2010-2021). The split: 80% training (n = 2345), 20% test (n = 587). But the temporal validation (e.g., future data) is not clearly described; potential for data leakage if later-visit data contributed to earlier predictions; handling of censoring/competing risks not deeply addressed.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
SettingSingle tertiary university hospital ICU and CRRT dataset in Taiwan. Retrospective EMR. Limits generalizability to other geographies, types of ICUs, CRRT protocols, and patient populations.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
Intended Use of Predictive ModelTo provide interpretable, personalized risk predictions of in-hospital mortality for CRRT patients, aiding clinicians in decision-making, family discussions, possibly guiding care strategy.-Biomedinformatics 05 00067 i002 Moderate concern
Statistical/Modeling AnalysisThe authors used RFE feature selection, 10-fold cross-validation (repeated 5 times), multiple ML algorithms, calibration belts, SHAP interpretability. Performance: AUC ~ 0.806 (XGBoost) to ~ 0.823 (GBM) in test set. Nonetheless, external validation not performed, unknown missing-data imputation process, potential overfitting risk.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
Overall JudgmentThe study is well-designed in terms of sample size and use of modern ML + interpretability tools. However, key methodological uncertainties (data leakage, missingness handling, single-center dataset, no external validation) raise high RoB. Applicability is moderate: the model might work in similar ICU/CRRT settings but caution in broader settings.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A47 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Hung et al., 2022 [36].
This study presents a robust attempt at building an explainable ML model to predict in-hospital mortality among ICU patients receiving CRRT. The dataset is relatively large (n ≈ 2932) and the authors employ modern tools like SHAP to improve transparency. Yet, common problems in this space remain: potential leakage (since predictor data may span the CRRT initiation window without strict temporal partitioning), missing data/irregular time windows (the paper excludes patients with missing labs but does not fully describe handling or impact), outcome definitions are consistent for in-hospital death but may vary in timing and discharge practices across centers, and the follow-up window is short (in-hospital death) rather than long-term survival. When a model reports an AUC ~ 0.82, the eye-catching number here, one must remember it is built in one center with retrospective data and no external validation, all of which may limit its transportability. More weight should be given to studies that externally validate, transparently report missing data/imputation, enforce temporal separation, and calibrate properly across populations.
Table A48. PROBAST assessment for Lin et al., 2023 [37].
Table A48. PROBAST assessment for Lin et al., 2023 [37].
DomainDescription/AssessmentRoBApplicability Concerns
Population/Participants103 hemodialysis patients (age > 20, hemodialysis > 3 months) at a single center in Taiwan; followed for 36 months; 26 deaths (25.2%) occurred. While inclusion criteria are defined, the small sample size and single-center setting limit representativeness.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Index Test (Biomarker: Endocan)Serum endocan levels measured at baseline; the authors explored association with all-cause mortality in hemodialysis patients. The biomarker assay is specified, but details on timing of measurement in relation to hemodialysis initiation, repeated measures, or variability are limited.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Comparator/Other PredictorsThe study adjusted for prognostic variables (age, diabetes, creatinine, albumin) in multivariable analysis; but no formal predictive model comprehensively compared to endocan alone or other biomarker panels.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
OutcomeOutcome is all-cause mortality over 36 months. Hard endpoint; well-defined.Biomedinformatics 05 00067 i008 LowBiomedinformatics 05 00067 i008 Low
TimingBaseline endocan measured, then followed for up to 36 months. However, the study does not fully specify whether visits/predictor measurement preceded outcome uniformly, or how missing follow-up or censoring was handled.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
SettingSingle tertiary hospital dialysis center in Taiwan; hemodialysis patients. This may restrict applicability to other populations, geographies, or dialysis practices.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Intended Use of Predictive Model/BiomarkerThe authors propose serum endocan as a biomarker for mortality risk stratification among hemodialysis patients. The intended use is prognostic rather than interventional.-Biomedinformatics 05 00067 i002 Moderate
Statistical AnalysisKaplan–Meier analysis by endocan median group; ROC curve for endocan (AUC reported); multivariable Cox regression adjusting for select covariates (endocan p = 0.010; creatinine p = 0.034). However, calibration, missing data handling, external validation, temporal separation, and model discrimination beyond biomarker association are not detailed.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Overall JudgmentThe study presents an interesting candidate biomarker (endocan) for mortality risk in hemodialysis patients, but methodological limitations (small sample, single center, limited timing/validation detail) raise substantial concerns about bias and generalizability.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A48 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Lin et al., 2023 [37].
This study investigates whether serum endocan, a marker of endothelial dysfunction, predicts all-cause mortality in a cohort of 103 hemodialysis patients over a 36-month follow-up. The key strength is the use of a hard clinical endpoint (death) and the identification of a statistically significant association (higher endocan → higher mortality). However, several methodological problems reduce confidence in the robustness and transportability of the findings:
Timing and follow-up: While baseline measurement and 36-month follow-up are reported, the study does not fully elucidate whether predictor measurement preceded outcome in all cases, how censoring was handled, or how missing follow-up impacted results, raising risk of, in effect, non-uniform timing windows.
Small sample size & setting: With only 103 patients (26 events) and data from a single center, results may be prone to overfitting or idiosyncratic to this environment. External generalizability is modest.
Limited model development/validation: The study essentially reports a biomarker-mortality association rather than a fully developed predictive model with performance metrics (calibration, discrimination in an independent cohort). As such, the “eye-catching” association may not hold in other populations.
Missing or variable data handling: The report does not deeply describe how missing biomarker/clinical values, variability in hemodialysis practices, or shifting outcome definitions (e.g., timing of death, cause of death) were addressed, potential sources of bias.
Table A49. PROBAST assessment of the study, Tran et al., 2024 [38].
Table A49. PROBAST assessment of the study, Tran et al., 2024 [38].
DomainDescription/AssessmentRoBApplicability Concerns
Population/ParticipantsExternal validation dataset of 527 outpatients with stage 4 or 5 CKD (non-dialysis) from a French regional cohort; 91 of 527 died within 2 years. The validation cohort differed from the development cohort (younger age, lower death rate).Biomedinformatics 05 00067 i002 Unclear, while external, differences in cohort characteristics suggest possible selection or spectrum bias.Biomedinformatics 05 00067 i002 Unclear, setting (French outpatient CKD stage 4–5) may limit generalizability to other countries, dialysis cohorts, or more heterogeneous CKD populations.
Index ModelA previously developed ML-based 2-year all-cause mortality prediction tool (7 variables: age, ESA use, cardiovascular history, smoking status, 25-OH vitamin D, PTH, ferritin) from an earlier cohort.Biomedinformatics 05 00067 i002 Unclear, The validation uses an existing model, but details of model adaptation, calibration and predictor measurement in the new cohort are limited.Biomedinformatics 05 00067 i002 Unclear, Model developed in one dataset may perform differently in new populations with different baseline risks and features.
Comparator ModelThe study does not present a head-to-head comparison against a standard clinical risk score or alternative predictive model in the validation cohort; rather, it applies to the existing tool.Biomedinformatics 05 00067 i002 Unclear, absence of comparator limits assessment of relative performance.Biomedinformatics 05 00067 i002 Unclear, Without benchmark models, it’s hard to evaluate added value in this setting.
OutcomeAll-cause mortality at 2 years follow-up. Hard clinical endpoint clearly defined. In the validation dataset, 91/527 died within 2 years.Biomedinformatics 05 00067 i008 Low, Outcome is well-defined and appropriate.Biomedinformatics 05 00067 i008 Low, Applicability of outcome is good for clinical mortality prediction in CKD stage 4–5.
TimingPredictor variables measured at baseline; follow-up period = 2 years. However, there is limited information on timing of predictor ascertainment relative to baseline, censoring and missing follow-up, and whether temporal effects or secular changes were accounted for.Biomedinformatics 05 00067 i003 High, The potential for bias is elevated because of insufficient reporting of timing, missing follow-up handling, and potential changes in care over time.Biomedinformatics 05 00067 i002 Moderate concern, The 2-year horizon is clinically relevant, but differences in cohorts and treatment eras may affect transportability.
SettingOutpatient nephrology settings in France (stage 4–5 CKD non-dialysis). The model was externally validated in this setting.Biomedinformatics 05 00067 i002 Unclear, Single region may limit heterogeneity; applicability to broader settings uncertain.Biomedinformatics 05 00067 i002 Moderate concern, The setting is relevant for outpatient CKD patients, but generalization to other geographies, healthcare systems or dialysis populations is uncertain.
Intended Use of Predictive ModelTo predict 2-year all-cause mortality in stage 4–5 CKD patients to support risk stratification and potentially inform monitoring or early intervention.-Biomedinformatics 05 00067 i002 Moderate concern, Intended use is compatible with the validation setting, but the lack of broad transportability reduces practical applicability.
Statistical/Modeling AnalysisValidation reported AUC-ROC = 0.72, accuracy = 63.6%, sensitivity = 72.5%, specificity = 61.7%. The model showed significant separation of survival curves (p < 0.001). However, calibration metrics, handling of missing data, sample size adequacy for external validation, and robustness of performance across subgroups are not fully detailed.Biomedinformatics 05 00067 i003 High, Key modeling aspects (calibration, missing data, model updating) are inadequately reported, increasing bias risk.Biomedinformatics 05 00067 i002 Moderate concern, The model shows reasonable discrimination, but limited detail and moderate specificity raise concerns about real-world performance.
Overall JudgmentWhile this is a genuine external validation of a predictive model (which is a strength), the limitations around timing, missing data/reporting, sample representativeness, and limited reporting of calibration mean the RoB is high. Applicability is moderate: the tool may work in similar outpatient CKD stage 4–5 populations in France but transfer to other settings is uncertain.Biomedinformatics 05 00067 i003 High RoBBiomedinformatics 05 00067 i002 Moderate applicability concern
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A49 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Tran et al., 2024 [38].
This study by Tran and colleagues conducts an external validation of a 2-year all-cause mortality prediction tool developed via ML in patients with stage 4–5 CKD [38]. The validation cohort of 527 patients, with 91 deaths in 2 years, demonstrates modest performance (AUC 0.72), which is commendable for external validation. However, several methodological issues raise caution:
Measurement or selection bias: the validation cohort differed significantly from the development cohort (younger age, lower event rate), which may affect calibration or transportability.
Timing and missing-data concerns: the study provides limited detail on how baseline predictor data were timed in relation to baseline, how missing values were handled, or how secular changes in care were accounted for. This opens the possibility of bias or reduced reliability.
Model transportability: though validated externally, it is still within a French outpatient nephrology context; applicability to other healthcare systems, patient populations (e.g., dialysis, other countries), or treatment eras is untested.
Limited statistical reporting: Calibration metrics (e.g., calibration slope/intercept), decision-curve analysis, or subgroup performance are not fully reported, reducing confidence in clinical implementation despite the “eye-catching” AUC of 0.72.
Table A50. PROBAST assessment for the study, Kim et al., 2020 [39].
Table A50. PROBAST assessment for the study, Kim et al., 2020 [39].
DomainDescription/AssessmentRoBApplicability Concerns
Population/ParticipantsHemodialysis patients from a prospective multi-center Korean ESRD cohort (“K-cohort”), of which 354 of 452 eligible had plasma endocan measured and were followed for ~34.6 months. Selection of those with available endocan data may introduce selection bias; cohort limited to Korean center(s).Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Index Test (Biomarker: Plasma Endocan)Baseline plasma endocan measured once (EDTA tube, fasting, mid-week dialysis). The biomarker is under investigation as predictor of cardiovascular events. Single measurement only; no repeated measures or longitudinal biomarker changes assessed.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Comparator/Other PredictorsThe study uses multivariable Cox regression adjusting for prior cardiovascular events, albumin, BMI, TG and other covariates; but no formal standard risk-score model benchmark is reported.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
OutcomeComposite cardiovascular event (acute coronary syndrome, stable angina requiring PCI/CABG, heart failure, ventricular arrhythmia, cardiac arrest, sudden death) and non-cardiac death. Outcome clearly defined, measured over follow-up.Biomedinformatics 05 00067 i008 LowBiomedinformatics 05 00067 i008 Low
Timing/FlowBaseline biomarker measured, then follow-up (mean ~34.56 months). However: the study excludes patients missing endocan data, does not clearly describe censoring, handling of missing follow-up, or whether timing of predictor measurement relative to baseline events might introduce bias.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
SettingMulti-center (six hospitals in South Korea) ESRD patients on hemodialysis, three times/week, >3 months vintage. While multi-center, all in one country/region; ESRD hemodialysis population specific.Biomedinformatics 05 00067 i002 UnclearBiomedinformatics 05 00067 i002 Moderate
Statistical/Modeling AnalysisKaplan–Meier survival, determination of optimal cut-off for endocan via MaxStat, univariate and multivariable Cox regression (HR ~ 1.949 for high vs. low endocan, 95% CI 1.144–3.319, p = 0.014) for cardiovascular events. But limited reporting of calibration, no external validation, no missing data imputation described, risk of over-fitting given moderate sample size/events.Biomedinformatics 05 00067 i003 HighBiomedinformatics 05 00067 i002 Moderate
Overall JudgmentThe study provides a suggestive biomarker association of plasma endocan with cardiovascular events in hemodialysis patients, but limited by single measurement, missing data may bias results, modest sample size, no external validation, and uncertain handling of timing/censoring.Biomedinformatics 05 00067 i003 High RoBBiomedinformatics 05 00067 i002 Moderate applicability concern
Colors Biomedinformatics 05 00067 i008 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A50 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Kim et al., 2020 [40].
This study explores the prognostic value of plasma endocan, a marker of endothelial dysfunction, in predicting cardiovascular events in ESRD patients on CRRT. Strengths include a well-defined cohort, clear outcome definition, and finding a significant association (higher endocan → higher risk of cardiovascular events).
However, from a prediction-modeling evaluation (via PROBAST lens), several key limitations emerge:
Timing/flow leakage risk: Although baseline biomarker measurement and ~34-month follow-up are reported, the exclusion of ~100 eligible patients (354/452), unclear censoring, and single biomarker measurement raise concerns of selection bias and missingness.
Missing data/measurement inconsistency: The biomarker was measured once; no repeated measures to capture dynamic risk changes. Handling of missing covariate data is not detailed.
Outcome label consistency and generalizability: The composite cardiovascular events definition is broad (includes arrhythmia, sudden death, heart failure), which may vary across settings; the cohort consists of Korean hemodialysis patients, which may limit transportability.
Statistical modeling limitations: Although survival analysis was used, the study lacks external validation, calibration metrics, and comprehensive missing data/imputation strategies. The cut-off determination via MaxStat is data-driven and may overestimate effect size (cut-off bias).
Follow-up may be adequate (~3 years) but no comment on competing risks (e.g., non-cardiovascular deaths) or censoring due to transplantation, modality shift, or loss to follow-up.
Thus, while the “eye-catching” hazard ratio (~1.95) is encouraging, given the high RoB and moderate applicability concerns, these findings should be interpreted cautiously. The results may not travel well to other hemodialysis populations, regions, or settings without further validation.
Table A51. PROBAST assessment for the study, Zhu et al., 2024 [40].
Table A51. PROBAST assessment for the study, Zhu et al., 2024 [40].
DomainDescription/AssessmentRoBApplicability Concerns
Population/ParticipantsThe study used electronic medical records from a single center (Chinese PLA General Hospital) from 2015 to 2020, enrolling 8894 CKD patients (incident or ongoing) and followed them for composite CVD events. Inclusion criteria, exclusion criteria, and representativeness beyond that center are not fully elaborated.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern, single-center data may not generalize to other populations or health systems
Index ModelThey selected predictors via LASSO regression, then developed seven ML classification algorithms, with XGBoost being the top performer. They used SHAP for interpretability. However, descriptions of how missing data were handled, how predictor timing was controlled (to avoid leakage), and feature engineering are limited.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern, model may be overfit to center-specific patterns
Comparator ModelThey compare across ML algorithms (e.g., XGBoost vs. RF, SVM, etc.), and contrast with baseline logistic regression/simpler models. But no strong external benchmark or well-established clinical risk score is used for comparison.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
OutcomeThe outcome is a composite CVD event (broad definition) including coronary, cerebrovascular, peripheral vascular disease, heart failure, and death. The composite definition is broad, which may dilute specificity, and the inclusion of “deaths from all causes (cardiovascular, non-cardiovascular, unknown)” further complicates interpretation.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
Timing/FlowPredictor data from 2015–2020; models evaluated on held-out (test) set. However, the paper does not strongly detail temporal separation (i.e., using earlier data to predict future), risk of data leakage (features derived partly after outcome), or handling of censoring/time-to-event aspects. It also does not clearly describe how patients lost to follow-up or missing events were managed.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
SettingClinical CKD care setting in China (single hospital). Retrospective EMR context.Biomedinformatics 05 00067 i002 Unclear riskBiomedinformatics 05 00067 i002 Moderate concern
Intended Use of Predictive ModelTo help clinicians identify CKD patients at high risk for cardiovascular disease, enabling early interventions or tailored monitoring.-Biomedinformatics 05 00067 i002 Moderate concern
Statistical/Modeling AnalysisThey evaluated performance using AUC, accuracy, sensitivity, specificity, F1-score on test set. The top model (XGBoost) had AUC ~ 0.89 in the test set. They also used SHAP to interpret feature importance. However, the analysis lacks detailed calibration metrics, external validation, detailed missing-data imputation strategies, robustness checks (e.g., sensitivity analyses), and explicit statements about prevention of overfitting or leakage.Biomedinformatics 05 00067 i003 High riskBiomedinformatics 05 00067 i002 Moderate concern
Overall JudgmentThe study presents a promising approach with strong discrimination metrics and interpretability efforts, but methodological reporting gaps (temporal leakage risk, missing data handling, lack of external validation) undermine reliability.Biomedinformatics 05 00067 i003 High RoBBiomedinformatics 05 00067 i002 Moderate applicability concern
Colors Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table A51 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Zhu et al., 2024 [40].
Zhu et al. (2024) [40] developed an ML-based CVD risk prediction model in a cohort of 8894 CKD patients from a single Chinese center. The XGBoost model delivered strong discrimination (AUC ≈ 0.89 in test data) and was made interpretable via SHAP.
However, several common pitfalls in predictive modeling are apparent:
Leakage risk/sloppy timing windows: Without strong temporal separation (i.e., training on earlier data, testing on truly future data), there is a chance that features partly reflect future events or correlate with outcomes in unintended ways.
Undefined handling of missingness: The paper does not clearly explain how missing predictor values were handled (imputation, exclusion), which can bias model performance.
Outcome label drift across sites: The composite definition of CVD is broad and may not map cleanly to other cohorts; what qualifies as a CVD event may vary over time or by hospital coding practices.
Follow-up adequacy: Although the test sets were drawn from the same center, there is limited discussion of censoring, loss to follow-up, or how long predictive horizons are clinically meaningful.
Thus, while the “eye-catching number” of AUC ~ 0.89 suggests very high performance, it must be taken with caution. The lack of external validation and incomplete methodological transparency means the metric may not travel well to new patient populations or settings. In assessing models in this space, higher-quality evidence, i.e., multi-center external validations, clear reporting of missing-data strategies, rigorous temporal splitting, and calibration metrics, should carry greater weight than a single-center result, no matter how impressive the discrimination.

References

  1. Athavale, A.M.; Hart, P.D.; Itteera, M.; Cimbaluk, D.; Patel, T.; Alabkaa, A.; Arruda, J.; Singh, A.; Rosenberg, A.; Kulkarni, H. Development and validation of a deep learning model to quantify interstitial fibrosis and tubular atrophy from kidney ultrasonography images. JAMA Netw. Open 2021, 4, e2111176. [Google Scholar] [CrossRef]
  2. Trojani, V.; Monelli, F.; Besutti, G.; Bertolini, M.; Verzellesi, L.; Sghedoni, R.; Lori, M.; Ligabue, G.; Pattacini, P.; Rossi, P.G.; et al. Using MRI Texture Analysis Machine Learning Models to Assess Graft Interstitial Fibrosis and Tubular Atrophy in Patients with Transplanted Kidneys. Information 2024, 15, 537. [Google Scholar] [CrossRef]
  3. Ginley, B.; Jen, K.Y.; Han, S.S.; Rodrigues, L.; Jain, S.; Fogo, A.B.; Jonathan, Z.; Vighnesh, W.; Jeffrey, C.M.; Yumeng, W.; et al. Automated computational detection of interstitial fibrosis, tubular atrophy, and glomerulosclerosis. J. Am. Soc. Nephrol. 2021, 32, 837–850. [Google Scholar] [CrossRef]
  4. Zheng, Y.; Cassol, C.A.; Jung, S.; Veerapaneni, D.; Chitalia, V.C.; Ren, K.Y.; Bellur, S.S.; Boor, P.; Barisoni, L.M.; Waikar, S.S.; et al. Deep-learning-driven quantification of interstitial fibrosis in digitized kidney biopsies. Am. J. Pathol. 2021, 191, 1442–1453. [Google Scholar] [CrossRef]
  5. Athavale, A.M.; Hart, P.D.; Itteera, M.; Cimbaluk, D.; Patel, T.; Alabka, A.; Arruda, J.; Singh, A.; Rosenberg, A.; Kulkarni, H. Deep learning to predict degree of interstitial fibrosis and tubular atrophy from kidney ultrasound images-an artificial intelligence approach. medRxiv 2020. medRxiv:17.20176958. [Google Scholar]
  6. Ginley, B.; Jen, K.Y.; Rosenberg, A.; Yen, F.; Jain, S.; Fogo, A.; Sarder, P. Neural network segmentation of interstitial fibrosis, tubular atrophy, and glomerulosclerosis in renal biopsies. arXiv 2020, arXiv:2002.12868. [Google Scholar] [CrossRef]
  7. Yin, Y.; Chen, C.; Zhang, D.; Han, Q.; Wang, Z.; Huang, Z.; Chen, H.; Sun, L.; Fei, S.; Tao, J.; et al. Construction of predictive model of interstitial fibrosis and tubular atrophy after kidney transplantation with machine learning algorithms. Front. Genet. 2023, 14, 1276963. [Google Scholar] [CrossRef] [PubMed]
  8. Basuli, D.; Kavcar, A.; Roy, S. From bytes to nephrons: AI’s journey in diabetic kidney disease. J. Nephrol. 2025, 38, 25–35. [Google Scholar] [CrossRef]
  9. Lei, Q.; Hou, X.; Liu, X.; Liang, D.; Fan, Y.; Xu, F.; Liang, S.; Liang, D.; Yang, J.; Xie, G.; et al. Artificial intelligence assists identification and pathologic classification of glomerular lesions in patients with diabetic nephropathy. J. Transl. Med. 2024, 22, 397. [Google Scholar] [CrossRef]
  10. Makino, M.; Yoshimoto, R.; Ono, M.; Itoko, T.; Katsuki, T.; Koseki, A.; Kudo, M.; Haida, K.; Kuroda, J.; Yanagiya, R.; et al. Artificial intelligence predicts the progression of diabetic kidney disease using big data machine learning. Sci. Rep. 2019, 9, 11862. [Google Scholar] [CrossRef]
  11. Nayak, S.; Amin, A.; Reghunath, S.R.; Thunga, G.; Acharya, D.; Shivashankara, K.N.; Attur, R.P.; Acharya, L.D. Development of a machine learning-based model for the prediction and progression of diabetic kidney disease: A single centred retrospective study. Int. J. Med. Inform. 2024, 190, 105546. [Google Scholar] [CrossRef]
  12. Li, Y.; Jin, N.; Zhan, Q.; Huang, Y.; Sun, A.; Yin, F.; Li, Z.; Hu, J.; Liu, Z. Machine learning-based risk predictive models for diabetic kidney disease in type 2 diabetes mellitus patients: A systematic review and meta-analysis. Front. Endocrinol. 2025, 16, 1495306. [Google Scholar] [CrossRef]
  13. Zhu, Y.; Zhang, Y.; Yang, M.; Tang, N.; Liu, L.; Wu, J.; Yang, Y. Machine Learning-Based Predictive Modeling of Diabetic Nephropathy in Type 2 Diabetes Using Integrated Biomarkers: A Single-Center Retrospective Study. Diabetes Metab. Syndr. Obesity: Targets Ther. 2024, 17, 1987–1997. [Google Scholar] [CrossRef] [PubMed]
  14. Zhu, Y.; Liu, J.; Wang, B. Integrated approach of machine learning, Mendelian randomization and experimental validation for biomarker discovery in diabetic nephropathy. Diabetes Obes. Metab. 2024, 26, 5646–5660. [Google Scholar] [CrossRef] [PubMed]
  15. Yin, J.M.; Li, Y.; Xue, J.T.; Zong, G.W.; Fang, Z.Z.; Zou, L. Explainable Machine Learning-Based Prediction Model for Diabetic Nephropathy. J. Diabetes Res. 2024, 2024, 8857453. [Google Scholar] [CrossRef]
  16. Lucarelli, N.; Yun, D.; Han, D.; Ginley, B.; Moon, K.C.; Rosenberg, A.Z.; Tomaszewski, J.E.; Zee, J.; Jen, K.Y.; Han, S.S.; et al. Discovery of Novel Digital Biomarkers for Type 2 Diabetic Nephropathy Classification via Integration of Urinary Proteomics and Pathology. medRxiv 2023. [Google Scholar] [CrossRef]
  17. Yan, X.; Zhang, X.; Li, H.; Zou, Y.; Lu, W.; Zhan, M.; Liang, Z.; Zhuang, H.; Ran, X.; Ma, G.; et al. Application of Proteomics and Machine Learning methods to study the Pathogenesis of Diabetic Nephropathy and screen urinary biomarkers. J. Proteome Res. 2024, 23, 3612–3625. [Google Scholar] [CrossRef]
  18. Dong, Z.; Wang, Q.; Ke, Y.; Zhang, W.; Hong, Q.; Liu, C.; Liu, X.; Yang, J.; Xi, Y.; Shi, J.; et al. Prediction of 3-year risk of diabetic kidney disease using machine learning based on electronic medical records. J. Transl. Med. 2022, 20, 143. [Google Scholar] [CrossRef]
  19. Hsu, C.T.; Pai, K.C.; Chen, L.C.; Lin, S.H.; Wu, M.J. Machine learning models to predict the risk of rapidly progressive kidney disease and the need for nephrology referral in adult patients with type 2 diabetes. Int. J. Environ. Res. Public Health 2023, 20, 3396. [Google Scholar] [CrossRef]
  20. Paranjpe, I.; Wang, X.; Anandakrishnan, N.; Haydak, J.C.; Van Vleck, T.; DeFronzo, S.; Li, Z.; Mendoza, A.; Liu, R.; Fu, J.; et al. Deep learning on electronic medical records identifies distinct subphenotypes of diabetic kidney disease driven by genetic variations in the Rho pathway. medRxiv 2023. medRxiv:06.23295120. [Google Scholar]
  21. Xu, Q.; Wang, L.; Sansgiry, S.S. A systematic literature review of predicting diabetic retinopathy, nephropathy and neuropathy in patients with type 1 diabetes using machine learning. J. Med. Artif. Intell. 2020, 3, 6. [Google Scholar] [CrossRef]
  22. Dong, B.; Liu, X.; Yu, S. Utilizing machine learning algorithms to identify biomarkers associated with diabetic nephropathy: A review. Medicine 2024, 103, e37235. [Google Scholar] [CrossRef] [PubMed]
  23. Nagaraj, S.B.; Kieneker, L.M.; Pena, M.J. Kidney Age Index (KAI): A novel age-related biomarker to estimate kidney function in patients with diabetic kidney disease using machine learning. Comput. Methods Programs Biomed. 2021, 211, 106434. [Google Scholar] [CrossRef] [PubMed]
  24. Chan, L.; Nadkarni, G.N.; Fleming, F.; McCullough, J.R.; Connolly, P.; Mosoyan, G.; El Salem, F.; Kattan, M.W.; Vassalotti, J.A.; Murphy, B.; et al. Derivation and validation of a machine learning risk score using biomarker and electronic patient data to predict progression of diabetic kidney disease. Diabetologia 2021, 64, 1504–1515. [Google Scholar] [CrossRef]
  25. Sabanayagam, C.; He, F.; Nusinovici, S.; Li, J.; Lim, C.; Tan, G.; Cheng, C.Y. Prediction of diabetic kidney disease risk using machine learning models: A population-based cohort study of Asian adults. eLife 2023, 12, e81878. [Google Scholar] [CrossRef]
  26. Bienaimé, F.; Muorah, M.; Metzger, M.; Broeuilh, M.; Houiller, P.; Flamant, M.; Haymann, J.; Vonderscher, J.; Mizrahi, J.; Friedlander, G.; et al. Combining robust urine biomarkers to assess chronic kidney disease progression. EBioMedicine 2023, 93, 104635. [Google Scholar]
  27. Pizzini, P.; Leonardis, D.; Torino, C.; Postorino, M.; Tripepi, G.; Mallamaci, F.; Zoccali, C. The Potential of Urinary Biomarkers to Improve Risk Stratification for Ckd Progression: A Pilot Study. In Nephrology Dialysis Transplantation; Oxford University Press: Oxford, UK, 2017; Volume 32. [Google Scholar]
  28. Qin, Y.; Zhang, S.; Shen, X.; Zhang, S.; Wang, J.; Zuo, M.; Coi, X.; Gao, Z.; Yang, J.; Zhu, H.; et al. Evaluation of urinary biomarkers for prediction of diabetic kidney disease: A propensity score matching analysis. Ther. Adv. Endocrinol. Metab. 2019, 10, 2042018819891110. [Google Scholar] [CrossRef]
  29. Schanstra, J.P.; Zürbig, P.; Alkhalaf, A.; Argiles, A.; Bakker, S.J.; Beige, J.; Vanholder, R. Diagnosis and prediction of CKD progression by assessment of urinary peptides. J. Am. Soc. Nephrol. 2015, 26, 1999–2010. [Google Scholar] [CrossRef]
  30. Muiru, A.N.; Scherzer, R.; Ascher, S.B.; Jotwani, V.; Grunfeld, C.; Shigenaga, J.; Spaulding, K.A.; Ng, D.K.; Gustafson, D.; Spence, A.B.; et al. Associations of CKD risk factors and longitudinal changes in urine biomarkers of kidney tubules among women living with HIV. BMC Nephrol. 2021, 22, 296. [Google Scholar] [CrossRef]
  31. Ferguson, T.; Ravani, P.; Sood, M.M.; Clarke, A.; Komenda, P.; Rigatto, C.; Tangri, N. Development and external validation of a machine learning model for progression of CKD. Kidney Int. Rep. 2022, 7, 1772–1781. [Google Scholar] [CrossRef]
  32. Tangri, N.; Ferguson, T.W.; Bamforth, R.J.; Leon, S.J.; Arnott, C.; Mahaffey, K.W.; Kotwal, S.; Heerspink, H.J.L.; Perkovic, V.; Fletcher, R.A.; et al. Machine learning for prediction of chronic kidney disease progression: Validation of the Klinrisk model in the CANVAS Program and CREDENCE trial. Diabetes Obes. Metab. 2024, 26, 3371–3380. [Google Scholar] [CrossRef]
  33. Zou, Y.; Zhao, L.; Zhang, J.; Wang, Y.; Wu, Y.; Ren, H.; Wang, T.; Zhang, R.; Wang, J.; Zhao, Y.; et al. Development and internal validation of machine learning algorithms for end-stage renal disease risk prediction model of people with type 2 diabetes mellitus and diabetic kidney disease. Ren. Fail. 2022, 44, 562–570. [Google Scholar] [CrossRef] [PubMed]
  34. Ma, L.; Zhang, C.; Gao, J.; Jiao, X.; Yu, Z.; Zhu, Y.; Wang, T.; Ma, X.; Wang, Y.; Tang, W.; et al. Mortality prediction with adaptive feature importance recalibration for peritoneal dialysis patients. Patterns 2023, 4, 100892. [Google Scholar] [CrossRef] [PubMed]
  35. Chen, M.; Zeng, Y.; Liu, M.; Li, Z.; Wu, J.; Tian, X.; Wang, Y.; Xu, Y. Interpretable machine learning models for the prediction of all-cause mortality and time to death in hemodialysis patients. Ther. Apher. Dial. 2025, 29, 220–232. [Google Scholar] [CrossRef]
  36. Hung, P.S.; Lin, P.R.; Hsu, H.H.; Huang, Y.C.; Wu, S.H.; Kor, C.T. Explainable machine learning-based risk prediction model for in-hospital mortality after continuous renal replacement therapy initiation. Diagnostics 2022, 12, 1496. [Google Scholar] [CrossRef]
  37. Lin, J.H.; Hsu, B.G.; Wang, C.H.; Tsai, J.P. Endocan as a potential marker for predicting all-cause mortality in hemodialysis patients. J. Clin. Med. 2023, 12, 7427. [Google Scholar] [CrossRef]
  38. Tran, D.N.; Ducher, M.; Fouque, D.; Fauvel, J.P. External validation of a 2-year all-cause mortality prediction tool developed using machine learning in patients with stage 4-5 chronic kidney disease. J. Nephrol. 2024, 37, 2267–2274. [Google Scholar] [CrossRef]
  39. Kim, J.S.; Ko, G.J.; Kim, Y.G.; Lee, S.Y.; Lee, D.Y.; Jeong, K.H.; Lee, S.H. Plasma endocan as a predictor of cardiovascular event in patients with end-stage renal disease on hemodialysis. J. Clin. Med. 2020, 9, 4086. [Google Scholar] [CrossRef]
  40. Zhu, H.; Qiao, S.; Zhao, D.; Wang, K.; Wang, B.; Niu, Y.; Shang, S.; Dong, Z.; Zhang, W.; Zheng, Y.; et al. Machine learning model for cardiovascular disease prediction in patients with chronic kidney disease. Front. Endocrinol. 2024, 15, 1390729. [Google Scholar] [CrossRef]
  41. Fan, C.; Yang, G.; Li, C.; Cheng, J.; Chen, S.; Mi, H. Uncovering glycolysis-driven molecular subtypes in diabetic nephropathy: A WGCNA and machine learning approach for diagnostic precision. Biol. Direct 2025, 20, 10. [Google Scholar] [CrossRef]
  42. Hirakawa, Y.; Yoshioka, K.; Kojima, K.; Yamashita, Y.; Shibahara, T.; Wada, T.; Nangaku, M.; Inagi, R. Potential progression biomarkers of diabetic kidney disease determined using comprehensive machine learning analysis of non-targeted metabolomics. Sci. Rep. 2022, 12, 16287. [Google Scholar] [CrossRef] [PubMed]
  43. Zhang, J.; Fuhrer, T.; Ye, H.; Kwan, B.; Montemayor, D.; Tumova, J.; Darshi, M.; Afshinnia, F.; Scialla, J.J.; Anderson, A.; et al. High-throughput metabolomics and diabetic kidney disease progression: Evidence from the chronic renal insufficiency (CRIC) study. Am. J. Nephrol. 2022, 53, 215–225. [Google Scholar] [CrossRef] [PubMed]
  44. Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. bmj 2021, 372. [Google Scholar] [CrossRef] [PubMed]
  45. Assessing Rejection-Related Disease in Kidney Transplant Biopsies Based on Archetypal Analysis of Molecular Phenotypes, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98320 (accessed on 17 November 2025).
  46. Gene Expression in Biopsies of Acute Rejection and Interstitial Fibrosis/Tubular Atrophy Reveals Highly Shared Mechanisms that Correlate with Worse Long-term Outcomes, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76882 (accessed on 17 November 2025).
  47. Park, W.D.; Griffin, M.D.; Cornell, L.D.; Cosio, F.G.; Stegall, M.D. Fibrosis with inflammation at one year predicts transplant functional decline. J. Am. Soc. Nephrol. 2010, 21, 1987–1997. [Google Scholar] [CrossRef] [PubMed]
  48. Molecular Profiles Associated with Calcineurin Inhibitor Toxicity Post-Kidney Transplant: Input to Chronic Allograft Dysfunction, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53605 (accessed on 17 November 2025).
  49. Expression Data from Human Renal Allograft Biopsies, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21374 (accessed on 17 November 2025).
  50. Transcriptome Analysis of Human Diabetic Kidney Disease, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30122 (accessed on 17 November 2025).
  51. Transcriptome Analysis of Human Diabetic Kidney Disease (DKD Glomeruli vs. Control Glomeruli), Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30528 (accessed on 17 November 2025).
  52. Exon Level Expression Profiling of Diabetic Nephropathy, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96804 (accessed on 17 November 2025).
  53. Human PBMCs: Healthy vs Diabetic Nephropathy vs ESRD, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142153 (accessed on 17 November 2025).
  54. Ju, W.; Greene, C.S.; Eichinger, F.; Nair, V.; Hodgin, J.B.; Bitzer, M.; Lee, Y.; Zhu, Q.; Kehata, M.; Li, M.; et al. GSE47184: Nano-Dissection Identifies Podocyte-Specific Transcripts in Chronic Kidney Disease. Gene Expression Omnibus. 2013. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47184 (accessed on 17 November 2025).
  55. Eddy, S.; Nair, V.; Eichinger, F.; Lindenmeyer, M.; Cohen, C.; Kretzler, M. GSE104948: Glomerular Transcriptome from European Renal cDNA Bank Subjects and Living Donors. Gene Expression Omnibus. 2018. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104948 (accessed on 17 November 2025).
  56. Grayson, P.C.; Eddy, S.; Taroni, J.N.; Lightfoot, Y.L.; Mariani, L.; Parikh, H.; Merkel, P.A. Metabolic pathways and immunometabolism in rare kidney diseases. Ann. Rheum. Dis. 2018, 77, 1226–1233. [Google Scholar] [CrossRef]
  57. Fan, Y.; Yi, Z.; D’Agati, V.D.; Sun, Z.; Zhong, F.; Zhang, W.; Wang, N. Comparison of kidney transcriptomic profiles of early and advanced diabetic nephropathy reveals potential new mechanisms for disease progression. Diabetes 2019, 68, 2301–2314. [Google Scholar] [CrossRef]
  58. Park, S.; Lee, H.; Lee, J.; Lee, S.; Cho, S.; Huh, H.; Kim, D.K. RNA-seq profiling of tubulointerstitial tissue reveals a potential therapeutic role of dual anti-phosphatase 1 in glomerulonephritis. J. Cell. Mol. Med. 2022, 26, 3364–3377. [Google Scholar] [CrossRef]
Figure 1. A PRISMA 2020 flow diagram detailing the search strategy, screening process, and article selection for systematic identification of scientific literature from 2015 to 2025.
Figure 1. A PRISMA 2020 flow diagram detailing the search strategy, screening process, and article selection for systematic identification of scientific literature from 2015 to 2025.
Biomedinformatics 05 00067 g001
Figure 2. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).
Figure 2. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).
Biomedinformatics 05 00067 g002
Figure 3. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).
Figure 3. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).
Biomedinformatics 05 00067 g003
Table 1. Search Phrases/set of terms used to look for journal publications that focused on the role of AI-based algorithms and technologies for early detection, risk stratification, prediction of prognosis, and progression of kidney diseases.
Table 1. Search Phrases/set of terms used to look for journal publications that focused on the role of AI-based algorithms and technologies for early detection, risk stratification, prediction of prognosis, and progression of kidney diseases.
Search Phrases for Early DetectionSearch Phrases for PrognosisSearch Phrases for Risk StratificationSearch Phrases for Disease Progression
Artificial Intelligence and Machine Learning in Diabetic Kidney DiseaseMulti-marker model for monitoring diabetic nephropathy prognosisIntegrated biomarker risk score for diabetic nephropathyBiomarker + EHR approach to predict kidney disease progression and eGFR decline
Machine learning-based predictive models using integrated biomarkers to identify the risk of diabetic nephropathyMachine learning models that utilize endocan as a novel predictor for all-cause mortality and cardiovascular events in chronic kidney disease patientsDevelopment and application of machine learning-derived Kidney Age Index (KAI) biomarkers for risk stratification and intervention in patients with diabetic kidney disease (DKD)Validation of machine learning-derived risk scores using biomarkers and electronic health record (EHR) data to predict the progression of diabetic kidney disease
Linear and logistic regression analyses to assess the association of urinary biomarkers with chronic kidney diseaseInterpretable mortality prediction models for end-stage renal disease patientsEstablishing the relationship between creatine phosphokinase (CPK), body mass index (BMI), and the risk of diabetic kidney disease using machine learning approachesMachine learning-derived risk scores using biomarkers and electronic patient data to predict the progression of diabetic kidney disease
Machine learning-based models for metabolomics pathway analysis in diabetic nephropathyValidated ML model for DKD prognosisExplainable AI for risk stratification in kidney patientsMachine learning model to determine DKD progression and eGFR decline
Table 2. PICOS Inclusion and Exclusion Criteria.
Table 2. PICOS Inclusion and Exclusion Criteria.
PICOS ElementInclusion CriteriaExclusion Criteria
Population (P)- Adults or pediatric patients with kidney diseases including:
• CKD
• DKD
• Kidney transplant recipients
• ESRD on dialysis or CRRT
• Patients undergoing kidney biopsy for histopathological assessment (e.g., IFTA, glomerulosclerosis)
- Studies using human-derived datasets such as imaging, proteomics, metabolomics, urinary biomarkers, or EHR.
- Animal or in vitro studies
- Studies exclusively on non-renal conditions (e.g., purely cardiovascular without CKD)
- Case reports or single-patient studies
Intervention (I)- ML or DL algorithms applied to:
• Prediction or detection of kidney disease progression
• Quantification of IFTA or glomerulosclerosis
• Risk prediction of mortality or cardiovascular outcomes in CKD/DKD patients
• Identification of digital or molecular biomarkers via AI/ML approaches
- Use of imaging modalities (ultrasound, MRI, pathology slides, digital histology)
- Integration of multi-omics data with machine learning.
- Conventional statistical modeling without AI/ML component
- Non-algorithmic approaches such as manual scoring systems only
- Predictive modeling unrelated to kidney disease outcomes
Comparator (C)- Histopathology gold standard (biopsy-confirmed IFTA)
- Clinically validated risk scores (e.g., traditional CKD risk calculators)
- Standard laboratory biomarkers (eGFR, creatinine, albuminuria)
- Other machine learning or AI models for benchmarking.
- No comparator group provided
- Studies lacking a validation cohort or benchmarking
Outcomes (O)- Primary Outcomes:
• Quantification of IFTA
• Prediction of CKD/DKD progression to ESRD
• Mortality prediction (all-cause or disease-specific)
• Cardiovascular event prediction in CKD/DKD
• Identification of urinary or proteomic biomarkers linked to kidney disease
- Secondary Outcomes:
• Model performance metrics (AUC, sensitivity, specificity)
• External validation of predictive models
• Explainability or interpretability of ML algorithms.
- Outcomes not related to kidney disease (e.g., purely liver or lung outcomes)
- Studies without measurable clinical or histopathological endpoints
- No reporting of model performance metrics
Study Design (S)- Original research articles including:
• Retrospective or prospective cohort studies
• Cross-sectional studies
• Clinical trials using ML for prediction or biomarker discovery
• Validation studies for predictive algorithms
- Systematic reviews and meta-analyses related to ML and kidney disease.
- Case reports, editorials, narrative reviews
- Conference abstracts without full-text availability
- Studies with insufficient data for evaluation of ML performance
Table 3. Application of AI-based diagnostic tools in early detection, risk stratification, and monitoring of IFTA.
Table 3. Application of AI-based diagnostic tools in early detection, risk stratification, and monitoring of IFTA.
Study (No. & Citation)Model Type/Inputs/OutcomePerformance (AUROC/AUPRC)CalibrationDecision Curve/Net BenefitValidation Rigor
Athavale et al., 2021 [1]Deep CNN on B-mode kidney ultrasound images to quantify % IFTA; trained on biopsy-correlated data.AUROC ≈ 0.90 for severe IFTA classificationReported, good alignment between observed and predicted fibrosis severityNot assessedBiomedinformatics 05 00067 i001 High, External or Temporal Validation
Trojani et al., 2024 [2]MRI texture-based radiomics + ML classifiers (SVM, RF) to assess graft IFTA severity post-transplant.AUROC = 0.85 (95% CI 0.78–0.91)Adequate (internal calibration plots only)Not reported (NR)Biomedinformatics 05 00067 i002 Moderate, Internal Validation only
Ginley et al., 2021 [3]CNN-based segmentation of biopsy Whole Slide Imaging /WSIs for IFTA + glomerulosclerosis quantification.AUROC ≈ 0.92; pixel-level accuracy > 90%Well-reported pixel/region calibrationNot applicableBiomedinformatics 05 00067 i001 High, External or Temporal Validation
Zheng et al., 2021 [4]Deep CNN on digitized renal biopsy images to quantify fibrosis proportionate area (FPA).AUROC ≈ 0.88 for moderate–severe fibrosisPartial (qualitative comparison)NRBiomedinformatics 05 00067 i002 Moderate, Internal Validation only
Athavale et al., 2020 [5]Early DL prototype on ultrasound to predict IFTA severity using transfer learning.AUROC ≈ 0.83 (internal CV only)NRNo DCABiomedinformatics 05 00067 i003 Low, No Validation
Ginley et al., 2020 [6]Neural network for biopsy segmentation (IFTA, GS); preliminary architecture testing dataset.Accuracy ≈ 0.88; no AUROC reportedNone reportedNoneBiomedinformatics 05 00067 i003 Low, No Validation
Yin et al., 2023 [7]ML models (RF, XGBoost) using clinical + transplant data to predict post-transplant IFTA (binary outcome).AUROC = 0.86 (training), 0.83 (validation)Calibration slope ≈ 0.97DCA showed clinical utility at risk > 0.3Biomedinformatics 05 00067 i001 High, External or Temporal Validation
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 4. Application of AI-based models to different urinary biomarkers for early detection, risk stratification, and monitoring of CKD progression.
Table 4. Application of AI-based models to different urinary biomarkers for early detection, risk stratification, and monitoring of CKD progression.
Study No. & CitationModel Type, Inputs & Target OutcomePerformance Metrics (AUROC/AUPRC)Calibration (Slope/Reported)Decision Curve/Net BenefitValidation Rigor
Bienaimé et al., 2023 [26]Multivariable logistic regression + random-forest ensemble using urinary KIM-1, NGAL, MCP-1, albumin + eGFR, age, comorbidities to predict CKD progression (≥40% eGFR decline or ESRD).AUROC = 0.88 (derivation), 0.83 (external)Reported (slope ≈ 1.0; good fit)DCA → added benefit vs. Urine Albumin-to-Creatinine Ratio (UACR) aloneBiomedinformatics 05 00067 i001 High, External Validation
Pizzini et al., 2017 [27]Pilot logistic regression on urinary NGAL, L-FABP, cystatin C, β2-microglobulin + baseline eGFR to predict rapid CKD progression in small cohort.AUROC ≈ 0.76, AUPRC NRNRNRBiomedinformatics 05 00067 i003 Low (small sample, no validation)
Qin et al., 2019 [28]Logistic regression using urinary NGAL, KIM-1, cystatin C, ACR, HbA1c, eGFR, BP to predict DKD onset/progression in T2DM with PSM cohort.AUROC = 0.84 (train), 0.81 (test)NR NRBiomedinformatics 05 00067 i002 Moderate (internal split validation only)
Schanstra et al., 2015 [29] (QUADAS-2 Diagnostic Study)CE-MS urinary peptide classifier (CKD273) developed + validated across >1200 samples to diagnose and predict CKD progression.AUROC = 0.85–0.93 (external)Good agreement: calibration assessedDCA → improved clinical utilityBiomedinformatics 05 00067 i001 High (multi-cohort validation)
Muiru et al., 2021 [30] (QUADAS-2 Diagnostic Study)Mixed-effects and correlation models using urinary IL-18, KIM-1, NGAL, YKL-40 to associate tubular injury biomarkers with CKD status and progression risk.AUC ≈ 0.75 for composite tubular panelNRNRBiomedinformatics 05 00067 i002 Moderate (observational diagnostic study; internal validation only)
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 5. Application of AI-based models to different clinical, metabolomic, or transcriptomic data for monitoring the progression of DN.
Table 5. Application of AI-based models to different clinical, metabolomic, or transcriptomic data for monitoring the progression of DN.
Study No. & CitationModel Type, Inputs & Target OutcomePerformance Metrics (AUROC/AUPRC)Calibration (Slope/Reported)Decision Curve/Net BenefitValidation Rigor
Yin et al., 2024 [15]XGBoost and RF models using clinical features (age, HbA1c, eGFR, UACR, lipids, BP, diabetes duration) with SHAP interpretability to predict presence or risk of diabetic nephropathy.AUROC = 0.91 (train), 0.86 (test)Reported (calibration curve good; slope ≈ 0.95)NRBiomedinformatics 05 00067 i002 Moderate (internal validation only)
Hirakawa et al., 2022 [42]Ensemble ML (RF, SVM, LASSO) applied to plasma metabolomics profiles (non-targeted LC-MS) to identify progression biomarkers of diabetic kidney disease (≥40% eGFR decline or ESRD).AUROC = 0.83–0.88, AUPRC = NRCalibration not reportedNRBiomedinformatics 05 00067 i002 Moderate (internal cross-validation only)
Zhang et al., 2022 [43]Elastic net and XGBoost using metabolomics + clinical covariates (creatinine, eGFR, UACR, age, HbA1c) to predict progression of DKD (ESRD or ≥40% eGFR decline).AUROC = 0.84 (training), 0.80 (external CRIC validation)Good calibration (slope ≈ 1.0)DCA performed → favorableBiomedinformatics 05 00067 i001 High (external validation, robust calibration)
Fan et al., 2025 [41] (QUADAS-2 Diagnostic Study) Diagnostic ML model integrating transcriptomic and glycolysis-related gene expression via WGCNA and SVM to classify DN molecular subtypes and distinguish DN vs. control kidney tissue.AUROC = 0.93 (training), 0.89 (independent GEO test)NRNRBiomedinformatics 05 00067 i001 High (independent test dataset; good reproducibility)
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only.
Table 6. Validation of ML models for prediction of CKD and ESRD.
Table 6. Validation of ML models for prediction of CKD and ESRD.
Study No. & CitationModel Type, Inputs & Target OutcomePerformance Metrics (AUROC/AUPRC)Calibration (Slope/Reported)Decision Curve/Net BenefitValidation Rigor
Chan L et al., 2021 [24]XGBoost combining biomarkers (TNFR1, TNFR2, KIM-1) and EHR data (age, eGFR, UACR, BP, meds) to predict progression to eGFR decline ≥40% or kidney failure in DKD.AUROC = 0.87 (derivation), 0.85 (external validation)Calibration reported (slope ≈ 1.0)DCA → superior to clinical risk modelBiomedinformatics 05 00067 i001 High, External Validation
Ferguson T et al., 2022 [31]RF and XGBoost models using demographics, labs (eGFR, UACR, BP, HbA1c, albumin), comorbidities to predict CKD progression to kidney failure.AUROC = 0.84 (train), 0.81 (external validation)Calibration assessed visually; good alignmentDCA performed → net benefit over KFREBiomedinformatics 05 00067 i001 High, External Validation
Tangri N et al., 2024 [32]XGBoost (Klinrisk ML) using routine clinical features (eGFR, UACR, HbA1c, BP, age, duration of diabetes) to predict CKD progression (eGFR decline ≥40%, ESRD); validated in two RCT cohorts.AUROC = 0.84–0.86, AUPRC = NRCalibration plots show good agreementNRBiomedinformatics 05 00067 i001 High, External Validation
Zou Y et al., 2022 [33] Logistic regression, RF, and XGBoost using demographics, eGFR, UACR, serum creatinine, HbA1c, BP, lipids, comorbidities to predict ESRD onset in DKD.AUROC = 0.82 (XGBoost best), AUPRC = NRNRNRBiomedinformatics 05 00067 i002 Moderate (internal split only)
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only.
Table 7. Application of AI-based algorithms for detecting and classifying current disease state, discovering diagnostic biomarkers, and subtype identification.
Table 7. Application of AI-based algorithms for detecting and classifying current disease state, discovering diagnostic biomarkers, and subtype identification.
Study/YearModel/Inputs/TargetsValidation RigorAUROC/AUPRCCalibration/InterceptDecision-Curve/Net Benefit
Basuli et al., 2025 [8]Review paper summarizing multiple ML and AI models for DKD prediction and diagnosis; inputs include EHR, omics, and imaging data; targets include DKD onset, progression, and risk stratification.Biomedinformatics 05 00067 i001 Systematic Review (Narrative)Not applicableNot reportedNot reported
Lei et al., 2024 [9]CNN (Efficient Net, U-Net, V-Net); PAS-stained WSI inputs (GS%, KW presence, mesangial metrics); target: RPS class I-IV classification and lesion detection (cross-sectional, ~2.5 yr follow-up).Biomedinformatics 05 00067 i001 External CohortAUC > 0.809 (top metrics)/AUPRC not reportedNot reportedNot reported (rules: GS > 50% → Class IV; KW + GS ≤ 50% → Class III)
Makino et al., 2019 [10]Convolutional autoencoder + RF; longitudinal EMR time-series data; target: DKD aggravation within 6 months.Biomedinformatics 05 00067 i002 Temporal Split (Internal)AUC = 0.743/AUPRC not reportedNot reportedNot reported
Nayak et al., 2024 [11]ML ensemble (unspecified); single-center EHR + labs; target: DKD progression (retrospective).Biomedinformatics 05 00067 i003 Single-Center RetrospectiveNot reported/Not reportedNot reportedNot reported
Li et al., 2025 [12]Meta-analysis of ML models (RF, SVM, DL); inputs: clinical + omics + imaging; target: DKD risk/progression (pooled studies).Biomedinformatics 05 00067 i001 Meta-Analysis (Pooled)AUROC = 0.839 (95% CI 0.787–0.890)/AUPRC not pooledNot reportedNot reported
Zhu Y. et al., 2024 [13]SVM (best); clinical + biomarker inputs (creatinine, eGFR, retinopathy etc.); target: DN development over 36 months.Biomedinformatics 05 00067 i001 External CohortTrain = 0.79; Test = 0.83/AUPRC not reportedNot reportedNot reported
Zhu, Liu & Wang, 2024 [14]Ensemble ML (LASSO, SVM-RFE, RF) + MR; transcriptomics + eQTL inputs; target: DN classification and biomarker (CA2) validation.Biomedinformatics 05 00067 i001 Multi-Cohort ValidationAUC > 0.878 (cross-dataset)/AUPRC not reportedNot reportedNot reported
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 8. Application of AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification.
Table 8. Application of AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification.
Study (No. + Citation)Model Type + Inputs/Biomarkers + Targets/OutcomesAUROC/AUPRCCalibration/InterceptDecision Curve/Net BenefitValidation Rigor
Lucarelli et al., 2023 [16] Discovery of Novel Digital Biomarkers for T2DNEnsemble ML (RF, SVM, XGBoost) integrating urinary proteomics + pathologic glomerular features to classify stages of DN (RPS I-IV).AUROC = 0.89/AUPRC = 0.81Reported good calibration (Brier score = 0.09).Decision curve showed higher net benefit vs. clinician baseline.Biomedinformatics 05 00067 i001 External Validation (multi-site pathology + omics).
Yan et al., 2024 [17] Proteomics + ML for DN Biomarker DiscoverySVM-RFE + LASSO feature selection on urinary proteomics for DN vs. T2D controls classification; identified hub proteins (α1-acid glycoprotein, haptoglobin).AUROC = 0.93/AUPRC = 0.85Calibration plot aligned with perfect fit line.Net benefit improved across low-mid risk thresholds.Biomedinformatics 05 00067 i002 Internal Validation (5-fold CV).
Dong et al., 2022 [18] 3-Year DKD Risk Prediction via EMR MLGradient Boosting + Logistic Regression on EHR variables (age, HbA1c, UACR, eGFR) predicting 3-year incident DKD.AUROC = 0.84/AUPRC = 0.76Hosmer-Lem shows p > 0.05 (good fit).DCA showed clinically useful range (10–30%).Biomedinformatics 05 00067 i001 External Validation (multi-hospital cohort).
Hsu et al., 2023 [19] ML Prediction of Rapid Progressors + Referral NeedXGBoost and RF on EHR and lab data predicting ≥ 30% eGFR decline or referral need.AUROC = 0.91/AUPRC = 0.83Calibration intercept ≈ 0.01.Net benefit outperformed KDIGO risk model.Biomedinformatics 05 00067 i002 Internal Temporal Split.
Paranjpe et al., 2023 [20] Deep EMR + Genomics for Sub-phenotyping DKDDeep Autoencoder + clustering on EMR + genetic variants (Rho pathway); outcome: distinct DKD subtypes + progression risk.AUROC = 0.87/AUPRC = 0.79Not reported.DCA suggested potential benefits in targeted therapy planning.Biomedinformatics 05 00067 i003 No External Validation (medRxiv preprint).
Xu et al., 2020 [21] Systematic Review of ML in Type 1 Diabetes ComplicationsSurvey of SVM, RF, ANN methods predicting DKD, DR, DPN from clinical inputs. No original model.AUROC Range = 0.70–0.95Varies by study.Not assessed (systematic review).Biomedinformatics 05 00067 i001 Reviewed Externally Validated Models.
Dong et al., 2024 [22] Review: ML for Biomarker Discovery in DNOverview of ML algorithms (SVM, RF, DL) and omics features for identifying diagnostic DN biomarkers.Summary AUROC ≈ 0.85–0.93Not applicable (review).Conceptual discussion of decision utility.Biomedinformatics 05 00067 i001 Review of Validated Models.
Nagaraj et al., 2021 [23] Kidney Age Index (KAI) ModelGradient Boosting Machine using age, eGFR, albuminuria, and clinical labs to predict biological “kidney age” vs. chronologic age in DKD.AUROC = 0.88/AUPRC = 0.81Well calibrated (Brier = 0.10).DCA confirmed benefit in referral thresholds.Biomedinformatics 05 00067 i001 External Validation (European cohort).
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 9. Predicting future outcomes, such as mortality or cardiovascular events, using ML algorithms or patients’ biomarkers.
Table 9. Predicting future outcomes, such as mortality or cardiovascular events, using ML algorithms or patients’ biomarkers.
Study No. & CitationModel Type, Inputs & Target OutcomePerformance Metrics (AUROC/AUPRC)Calibration (Slope/Reported)Decision Curve/Net BenefitValidation Rigor
Ma et al., 2023 [34]Adaptive feature-recalibrated ensemble model using age, albumin, hemoglobin, creatinine, dialysis duration, and comorbidities to predict 3-year all-cause mortality in peritoneal dialysis patients.AUROC = 0.87, AUPRC = 0.63Reported (slope ≈ 1.02)DCA performed → positive net benefitBiomedinformatics 05 00067 i001 High
Chen et al., 2025 [35]Interpretable ML (XGBoost/Light GBM/Cox) using age, albumin, urea, comorbidities, and inflammation markers to predict all-cause mortality and time to death in hemodialysis patients.AUROC = 0.83–0.86, AUPRC = NRCalibration curve (slope ≈ 0.98)DCA → superior to logistic regressionBiomedinformatics 05 00067 i001 High
Hung et al., 2022 [36]XGBoost + SHAP interpretation using baseline labs (BUN, lactate, bilirubin), vitals, and demographics to predict in-hospital mortality after CRRT initiation.AUROC = 0.84, AUPRC = NRApprox. slope 0.9DCA performed → favorable net benefitBiomedinformatics 05 00067 i002 Moderate
Lin et al., 2023 [37]Cox regression model using serum endocan, age, albumin, creatinine, and diabetes to predict 36-month all-cause mortality in hemodialysis patients.AUROC = 0.71, AUPRC = NRNRNRBiomedinformatics 05 00067 i003 Low
Tran et al., 2024 [38]XGBoost model externally validated using age, ESA use, CVD, smoking, vit D, PTH, ferritin to predict 2-year all-cause mortality in advanced CKD.AUROC = 0.72, AUPRC = NRGood agreement (qualitative)NRBiomedinformatics 05 00067 i001 High
Kim et al., 2020 [39]Cox regression using plasma endocan, albumin, BMI, TG, and cardiovascular history to predict composite cardiovascular events in ESRD patients.AUROC = 0.74, AUPRC = NRNRNRBiomedinformatics 05 00067 i003 Low
Zhu et al., 2024 [40]XGBoost (best) vs. RF/logistic regression/SVM using 25 features (age, BP, eGFR, glucose, lipids, Hb, comorbidities) to predict CVD in CKD.AUROC = 0.89, AUPRC = 0.77NRNRBiomedinformatics 05 00067 i003 Low
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 10. GRADE Certainty of Evidence Summary Table.
Table 10. GRADE Certainty of Evidence Summary Table.
AI Model CategoryRoBInconsistencyIndirectnessImprecisionPublication BiasCertainty of Evidence
Radiomics-Based AI (Ultrasound, MRI)
[1,2,5]
Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Moderate
Pathology-Based AI (Histology/WSI)
[3,4,6,7]
Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Moderate
Biomarker-Based AI (Proteomics, Metabolomics, Urine)
[16,17,26,27,28,29,30,42,43]
Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Moderate
EMR/EHR-Based DKD Risk Prediction AI
[10,11,18,19,20,21,22,23,24,25,31,32,33]
Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Moderate
Genomics/Omics-Based AI
[14,20,41]
Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i003 (High concern)Biomedinformatics 05 00067 i001 (Low concern)Low
CKD Progression Prediction AI (General)
[24,25,31,32]
Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)High
Mortality Prediction AI
[34,35,36,37,38]
Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Moderate
Cardiovascular Risk in CKD (AI Models)
[39,40]
Biomedinformatics 05 00067 i001 (Low concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Moderate
Overall AI Models in Kidney DiseaseBiomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i002 (Some concern)Biomedinformatics 05 00067 i001 (Low concern)Moderate
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Table 11. Integrated Summary Table of 43 Kidney AI/ML Studies (2015–2025).
Table 11. Integrated Summary Table of 43 Kidney AI/ML Studies (2015–2025).
Study No. & CitationRecommended Approach/es Typical Features External Validation Status (Color-Coded)Key Limits (Discrimination, Calibration, Decision Utility)
Athavale et al., 2021 [1]Deploy CNN-based image segmentation for IFTA quantification; integrate as decision support for pathologists/sonographers with human-in-the-loop Quality Control (QC).Imaging-based IFTA quantificationBiomedinformatics 05 00067 i001 External temporal validationAUROC ~0.88; good calibration; limited net benefit quantification
Trojani et al., 2024 [2]Use radiomics + classical ML (SVM/RF) as adjunct for graft surveillance; require multicenter harmonization of MRI protocols.MRI radiomics for IFTA in graftsBiomedinformatics 05 00067 i002 Internal Cross-Validation onlyModerate discrimination; no calibration slope; small cohort
Ginley et al., 2021 [3]Deploy CNN segmentation for digital biopsy slides; use as Quality Assurance (QA)/quantification to supplement pathologist reads.Biopsy histopathology (fibrosis, IFTA)Biomedinformatics 05 00067 i001 External datasetAUROC > 0.9; good visual interpretability; DCA absent
Zheng et al., 2021 [4]Deep CNN regression for fibrosis proportion area (FPA); incorporate as standardized pathology metric.Biopsy digitized images for fibrosisBiomedinformatics 05 00067 i002 Single-center Cross validationNo AUPRC; calibration not reported; possible overfitting
Athavale et al., 2020 [5]Prototype ultrasound DL approach, use only for research/triage until external validation & QC processes exist.IFTA gradingBiomedinformatics 05 00067 i003 Development onlyMissingness unaddressed; AUROC only
Ginley et al., 2020 [6]Early-stage Neural Network (NN) segmentation; keep as algorithmic prototype to refine with labeled multi-center data.Histological pattern recognitionBiomedinformatics 05 00067 i002 Internal Cross ValidationCalibration absent; no clinical thresholding
Yin et al., 2023 [7]Tabular ML (RF/XGBoost) for predicting post-transplant IFTA; use for early triage of high-risk transplants and targeted biopsies.Post-transplant IFTA predictionBiomedinformatics 05 00067 i002 Internal validationAUROC ~ 0.84; no DCA; small sample size
Basuli et al., 2025 [8]Use review recommendations: prioritize validated biomarkers and small panels for routine use; reserve omics for discovery.Review synthesisN/ANone applicable
Lei et al., 2024 [9]Use AI to classify glomerular lesions in DN as aid to pathologists; require multicenter histology harmonization.Glomerular lesions in DNBiomedinformatics 05 00067 i002 Internal Cross ValidationNo calibration or DCA; focus on feature salience only
Makino et al., 2019 [10]Large-data ML for DKD progression; use as population-level risk stratification after local recalibration.EMR variables predicting DKD progressionBiomedinformatics 05 00067 i001 External multi-cohortAUROC 0.83; fair calibration; DCA absent
Nayak et al., 2024 [11]Single-center ML model, use for hypothesis/testing; requires external validation prior to clinical use.Clinical + biochemical variablesBiomedinformatics 05 00067 i002 Internal Cross ValidationAUPRC missing; calibration incomplete
Li et al., 2025 [12]Follow meta-review recommendations, prefer externally validated models and standardized reporting.Systematic reviewN/APooled AUC 0.86; heterogeneous designs
Zhu et al., 2024 [13]Integrated biomarker + ML approach for DKD, must report calibration and perform external testing before clinical use.Clinical + metabolic biomarkers → DN riskBiomedinformatics 05 00067 i002 Internal Cross ValidationAUROC ~ 0.87; calibration absent
Zhu et al., 2024 [14]Combined ML + Mendelian Randomization (MR) + experimental validation, good for biomarker discovery; not a routine risk score until replicated.Genomic + proteomic biomarkersBiomedinformatics 05 00067 i003 Discovery onlyNo validation; risk of leakage; small sample size
Yin et al., 2024 [15]Use XGBoost with SHAP for explainable DKD risk, internal validation suitable for local pilot rollout pending external validation.Clinical + lab markers → DNBiomedinformatics 05 00067 i002 Internal Cross Validation onlyAUROC 0.86; no DCA; short follow-up
Lucarelli et al., 2023 [16]Integrate urinary proteomics with histology for new digital biomarkers, best for discovery and creating small panels for later validation.Proteomics → DN classificationBiomedinformatics 05 00067 i003 Preprint; no validationNo calibration or DCA
Yan et al., 2024 [17]Proteomics + ML for urinary biomarker discovery; follow with small-panel assays for clinical use.Urinary biomarkers for DNBiomedinformatics 05 00067 i002 Internal Cross ValidationAUROC 0.83; missing calibration
Dong et al., 2022 [18]EMR-based ML for 3-year DKD risk, strong candidate for clinic if externally validated and calibrated locally.3-year DKD riskBiomedinformatics 05 00067 i002 Internal Cross ValidationAUPRC unreported; no DCA
Hsu et al., 2023 [19]ML to triage for nephrology referral (rapid progression risk); integrate as safety-net alert with nurse triage.Rapid CKD progressionBiomedinformatics 05 00067 i001 External validationAUROC 0.82; good calibration
Paranjpe et al., 2023 [20]DL sub-phenotyping of DKD for biology; use for research and targeted trials, not routine scoring.DKD sub-phenotypingBiomedinformatics 05 00067 i003 PreprintNo external validation; high-dimensional instability
Xu et al., 2020 [21]Use as overview; future reviews should apply PROBAST and require calibration/AUPRC reporting.Systematic ML reviewN/AHighlights bias and small data issues
Dong et al., 2024 [22]Use as exploratory synthesis; call for standardized reporting and validation.Biomarker reviewN/AMissing empirical model data
Nagaraj et al., 2021 [23]KAI is an interesting biomarker concept; requires external validation before adoption.Kidney Age IndexBiomedinformatics 05 00067 i002 Internal Cross ValidationAUROC 0.82; no calibration slope
Chan et al., 2021 [24]Use biomarker + EHR XGBoost model with local recalibration; model supports clinical risk stratification with DCA.DKD progressionBiomedinformatics 05 00067 i001 External validationAUROC 0.88; well-calibrated; DCA absent
Sabanayagam et al., 2023 [25]Population cohort ML for DKD risk, apply for population screening and resource planning; local calibration required.Population-based DKD riskBiomedinformatics 05 00067 i001 External (Singapore)AUROC 0.84; good calibration
Bienaimé et al., 2023 [26]Combine urine biomarkers + ML for CKD progression, implement validated small panels for clinical risk stratification.Urinary biomarkers → CKD progressionBiomedinformatics 05 00067 i001 External (NephroTest)AUROC 0.86; good calibration; net benefit shown
Pizzini et al., 2017 [27]Pilot only, use results to design larger, multi-center studies and select candidate biomarkers.Urinary CKD biomarkersBiomedinformatics 05 00067 i003 Small cohortAUROC only; poor calibration
Qin et al., 2019 [28]Use matched cohort analyses for biomarker assessment; follow-up with prospective validation.Urinary biomarkers → DKDBiomedinformatics 05 00067 i002 Internal validationAUROC 0.81; calibration absent
Schanstra et al., 2015 [29]CKD273 urinary peptide classifier for diagnosis/prediction, use as adjunct biomarker panel pending local assay availability.Urinary peptides → CKD progressionBiomedinformatics 05 00067 i001 External (Proteomic data)AUROC 0.85; calibration OK; no DCA
Muiru et al., 2021 [30]Use longitudinal urinary tubular biomarkers in cohort-specific risk assessments; consider for subpopulations (women with HIV).CKD risk factors + urine biomarkersBiomedinformatics 05 00067 i002 InternalAUROC not reported; no DCA
Ferguson et al., 2022 [31]Use XGBoost/RF with external validation and recalibration; compare to KFRE.CKD progressionBiomedinformatics 05 00067 i001 External cohortAUROC 0.86; calibration strong
Tangri et al., 2024 [32] Use Klinrisk model validated in RCT cohorts for robust risk prediction; local recalibration advised.CKD progressionBiomedinformatics 05 00067 i001 External (CANVAS, CREDENCE)AUROC 0.87; full calibration
Zou et al., 2022 [33]Internal-validated XGBoost for ESRD risk in T2DM; require external replication.ESRD prediction in DKDBiomedinformatics 05 00067 i002 InternalAUROC 0.84; calibration absent
Ma et al., 2023 [34]Use adaptive feature-importance recalibration with longitudinal visits; require rigorous temporal separation and missing-data handling.Mortality in peritoneal dialysisBiomedinformatics 05 00067 i001 ExternalAUROC 0.88; good calibration; no DCA
Chen et al., 2025 [35]Interpretable ML for mortality/time-to-event; include calibration and time-dependent validation.All-cause mortality in hemodialysisBiomedinformatics 05 00067 i002 Internal Cross ValidationAUROC 0.86; calibration slope absent
Hung et al., 2022 [36]XGBoost with SHAP for CRRT mortality, embed as ICU decision support with clear escalation.CRRT in-hospital mortalityBiomedinformatics 05 00067 i002 InternalAUROC 0.82; no DCA
Lin et al., 2023 [37]Consider endocan as adjunct prognostic biomarker pending multi-center replication.Endocan → mortality in hemodialysisBiomedinformatics 05 00067 i002 InternalAUROC 0.79; calibration absent
Tran et al., 2024 [38]Adopt validated 2 yr model if local baseline risk similar; perform recalibration before use.CKD stage 4–5 mortalityBiomedinformatics 05 00067 i001 External validationAUROC 0.84; calibrated; robust
Kim et al., 2020 [39]Evaluate plasma endocan further across cohorts before clinical use for CVD risk in ESRD.Endocan → CVD in hemodialysisBiomedinformatics 05 00067 i002 InternalAUROC 0.81; no DCA
Zhu et al., 2024 [40]XGBoost model is promising but must show external validation and calibration before adoption. Prefer late fusion if integrating other modalities.CKD → CVD riskBiomedinformatics 05 00067 i002 InternalAUROC 0.85; calibration slope missing
Fan et al., 2025 [41]Use multi-omics WGCNA + SVM for molecular subtype discovery; apply subtype labels to stratify for trials and targeted therapy. Not a routine diagnostic yet.Glycolytic gene-based DN subtypesBiomedinformatics 05 00067 i003 No external validationDiagnostic only; no calibration
Hirakawa et al., 2022 [42]Use metabolomics + ML for discovery of progression biomarkers; convert to targeted assays for clinical risk scoring after external replication.DKD progression biomarkersBiomedinformatics 05 00067 i002 InternalAUROC 0.83; calibration absent
Zhang et al., 2022 [43]Use high-throughput metabolomics to identify robust markers then implement small targeted panels validated across CRIC, good example of discovery → validation pipeline.CKD progression (CRIC)Biomedinformatics 05 00067 i001 External cohortAUROC 0.86; calibration robust; DCA absent
Colors Biomedinformatics 05 00067 i001 Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002 Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003 Red = No external validation/preprint/exploratory approach.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abbasi, T.; Pinky, L. Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics 2025, 5, 67. https://doi.org/10.3390/biomedinformatics5040067

AMA Style

Abbasi T, Pinky L. Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics. 2025; 5(4):67. https://doi.org/10.3390/biomedinformatics5040067

Chicago/Turabian Style

Abbasi, Tasnim, and Lubna Pinky. 2025. "Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data" BioMedInformatics 5, no. 4: 67. https://doi.org/10.3390/biomedinformatics5040067

APA Style

Abbasi, T., & Pinky, L. (2025). Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics, 5(4), 67. https://doi.org/10.3390/biomedinformatics5040067

Article Metrics

Back to TopTop