Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data

Abbasi, Tasnim; Pinky, Lubna

doi:10.3390/biomedinformatics5040067

Open AccessSystematic Review

Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data

by

Tasnim Abbasi

¹ and

Lubna Pinky

^2,*

¹

NYC Health + Hospitals/Queens, 82-68 164th Street, Queens, New York, NY 11432, USA

²

Department of Biomedical Data Science, School of Applied Computational Sciences, Meharry Medical College, Nashville, TN 37208, USA

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(4), 67; https://doi.org/10.3390/biomedinformatics5040067

Submission received: 1 October 2025 / Revised: 14 November 2025 / Accepted: 18 November 2025 / Published: 27 November 2025

(This article belongs to the Special Issue Integrating Health Informatics and Artificial Intelligence for Advanced Medicine)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: This review paper summarizes and critically analyzes different Machine Learning (ML) and Artificial Intelligence (AI)-based predictive modeling techniques in early detection and personalized treatment for Kidney diseases, specifically diabetic kidney disease (DKD), chronic kidney disease (CKD), and end-stage renal disease (ESRD). This manuscript focuses on integrating electronic medical record (EMR) data with multi-omics biomarkers to enhance clinical decision-making. Method: A systematic database search retrieved 43 peer-reviewed articles from PubMed, Google Scholar, and ScienceDirect. These works were critically analyzed based on methodological rigor, model interpretability, and translational potential. Review: This paper examined a series of advanced AI and ML models, including but not limited to Random Forests (RF), Extreme Gradient Boosting (XGBoost), deep neural networks, and artificial neural networks, among others. Additionally, this paper explicitly explored validated approaches for fibrosis staging, dialysis prediction, and mortality risk assessment. Conclusions: The paper shows how leveraging AI models for patient-specific biomarker and EMR data presents substantial promise for facilitating preventative interventions, guiding timely nephrology referrals, and optimizing individualized treatment regimens. These state-of-the-art tools will ultimately improve long-term renal outcomes and reduce healthcare burdens. The study further addresses ethical challenges and potential adverse implications associated with the use of AI in clinical settings.

Keywords:

electronic medical records; biomarkers; end-stage renal disease; chronic kidney disease; diabetic kidney disease; diabetic nephropathy; machine learning; artificial intelligence

Graphical Abstract

1. Introduction and Background

Kidney diseases, including diabetic kidney disease (DKD)/Diabetic nephropathy (DN), chronic kidney disease (CKD), and end-stage renal disease (ESRD), represent a critical health concern across the world. Most of these kidney diseases are typically diagnosed at advanced stages when therapeutic options are not as effective as they would be if they had been diagnosed at early stages. Few advanced clinical tools can predict disease onset, severity, and progression during the early and asymptomatic phases. Most existing patient portals rely on different biomarkers in a time-series manner, showing historical clinical values without any predictive power. To address this gap, this review paper analyzed a total of 43 peer-reviewed published papers that utilized AI-based predictive models incorporating patients’ biomarkers to predict the onset, severity, and progression of kidney diseases.

Multiple studies have developed and validated machine learning (ML) and deep learning (DL) models to accurately detect, quantify, and predict Interstitial Fibrosis and Tubular Atrophy (IFTA) in both native and transplanted kidneys [1,2,3,4,5,6,7]. Recent research also highlights the growing role of artificial intelligence (AI) and ML in DKD, showcasing their applications in diagnosis, risk prediction, disease progression modeling, introduction of personalized treatment plans, and biomarker discovery through the integration of imaging, clinical data, and omics technologies [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Three studies demonstrate how ML can enhance DKD management by introducing the Kidney Age Index (KAI) to assess kidney aging, developing a biomarker-based risk score to predict disease progression, and applying predictive models to identify individuals at risk in diverse populations [23,24,25]. Multiple studies demonstrate that urinary biomarkers, alone or in combination, can significantly enhance the prediction and risk stratification of CKD progression across diverse populations, outperforming traditional clinical markers [26,27,28,29,30]. Few studies have developed and validated ML models, including random forest (RF) and Klinrisk algorithms, to accurately predict the progression of CKD and ESRD, particularly in patients with type 2 diabetes and DKD [31,32,33]. Some studies developed and validated interpretable ML models to predict mortality in patients undergoing peritoneal dialysis, hemodialysis, or continuous renal replacement therapy (CRRT), using dynamic clinical data and explainable AI techniques to identify key risk factors and support clinical decision-making [34,35,36,37,38]. Few studies advance cardiovascular disease (CVD) risk prediction in kidney disease: one links plasma endocan to heart events in ESRD, and the other uses ML and biomarkers to forecast CVD in CKD [39,40]. Other studies used ML with genomics, metabolomics, and proteomics to uncover molecular subtypes, biomarkers, and disease mechanisms in DKD, advancing diagnostic precision and progression prediction [41,42,43].

All the research done in this aspect has only discussed how AI technologies can support preventative interventions, enable early detection, and guide individualized therapies for kidney patients, ultimately improving patient prognosis and reducing healthcare system burdens. However, no cumulative conclusions have been made regarding which AI-based diagnostic tool to use in which contexts. Our paper tries to find out why, despite so much work on applying AI technologies in nephrology, we are still lagging in the clinical implementation of these AI-based tools for early detection, risk stratification, and monitoring of kidney diseases. We have addressed the need to develop a research synthesis to outline which AI models to use in which contexts, how they could be used with high effectiveness, and the ethical aspects of AI while using EMR data. This paper emphasizes biomarkers and EMR data usage as input data for various AI-based models. Furthermore, we assess these tools’ methodological approaches, clinical applications, interpretability frameworks, and translational potential.

2. Objectives

This review aims to: (1) trace the evolution of AI and ML applications in diabetic, chronic, and end-stage kidney diseases; (2) evaluate the use of diverse biomarkers, including clinical, urinary, metabolomic, proteomic, genomic, transcriptomic, and pathology-based data for predicting disease onset, progression, and severity; (3) examine key ethical and translational challenges such as data privacy and model generalizability; and (4) propose a structured framework to guide the development, validation, and clinical adoption of robust, interpretable, and ethically sound AI-based tools in kidney care.

3. Methods and Materials

3.1. Overview

We conducted a systematic review to compose this literature on the integration of multimodal/MM data, such as imaging (e.g., MRI or ultrasonography), tabular/clinical (e.g., age, lab tests, vitals), genomics/omics (e.g., DNA, RNA, proteomics), text (e.g., clinical notes, pathology reports), using AI/ML/DL techniques, specifically in kidney disease. The methodology follows Preferred Reporting Items for Systematic Reviews and Meta Analyses (PRISMA) guidelines (see Supplementary Materials, Supplementary File S1) and relevant bibliometric standards to ensure transparency and rigor. ChatGPT (https://chatgpt.com/, accessed on 17 November 2025) was used for language editing and grammar checks. The following processes are implemented in this study: the literature search strategy, the eligibility criteria, the selection process, quality assessments, data synthesis, and bias assessment. Two independent reviewers are involved in the research process currently. Reviewer 1 performed the PRISMA selection and screening processes in parallel, which restricted the possible bias risk in both selection and reporting. The 2nd reviewer took charge of the validation of the research process.

Our review focuses on these big themes: which areas of nephrology AI-driven MM frameworks have been developed and how; which areas of nephrology have benefited the most from this framework and why; and how AI/ML techniques have been integrated. More specifically, within each theme, our review focuses on the following questions: (i) What is the trend in developing these frameworks? (ii) How many modalities were used to develop these frameworks? (iii) What learning methods were used? (iv) What datasets/sources were used? (v) What was the status of validation rigor? (vi) Was the calibration slope reported? (vii) Was decision curve analysis (DCA) performed? Or what was the net benefit? (viii) What performance metrics were utilized?

3.2. Search Strategy

To compose our comprehensive review, we meticulously adhered to the accepted protocols for such scholarly work, taking careful and detailed approaches. We attempted an extensive search across noteworthy databases like Google Scholar, PubMed, Web of Science, IEEE Xplore, and Scopus, adhering to the PRISMA 2020 guidelines [44] (see Supplementary Materials, Supplementary File S1). All of these databases are academic databases. Original research articles addressing the following aspects were selected: artificial intelligence or machine learning, kidney care, multimodal data, deep learning or neural networks, and nephrology. The literature search is conducted through an automatic search in each search engine listed, using the search phrases and set of terms presented in Table 1. For the review, we define the search period as being between 1 January 2015, and 1 September 2025, to include the latest advancements in this topic.

3.3. Selection Criteria

We initiated the search using the strings/queries provided in Table 1 across the databases listed in Figure 1. Our search produced 150 studies (90 from Google Scholar, 22 from PubMed search, 18 from Web of Science, 10 from Scopus search, and 10 from IEEE Xplore search). The search data were copied into an Excel file for further investigation. First, we removed 42 records before screening; among them, 23 were duplicate records, 10 records were marked as ineligible by the automation tool, and the remaining 9 records were removed for other reasons, leaving 108 studies for screening against the title/abstract. Of these 108, 89 studies were sought for retrieval. Of those 89 studies, 25 studies were not retrieved. Then the remaining 64 records were advanced to the full-text review for further analysis according to the eligibility criteria. Studies were eligible if they fulfilled the inclusion criteria listed in Table 2. Finally, after applying the inclusion/exclusion criteria to the 64 studies, we identified 43 studies to be included in our review (Figure 1).

3.4. Quality Assessment and Data Synthesis

We extracted and analyzed data from all studies meeting our inclusion/exclusion criteria. We then compiled seven detailed tables, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, that catalog key questions mentioned earlier on AI-driven MM applications in nephrology. These tables capture study-specific information such as study years and citations, modalities integrated (imaging, genomic, clinical, etc.), AI/ML architectures (e.g., RF, XGBoost, Convolutional Neural Networks/CNNs, etc.), dataset sources (CREDENCE, CANVAS, etc.), and clinical objectives (diagnosis, prognosis, treatment optimization).

Our full-text review focused on evaluating MM data integration strategies and interpretability techniques using DL frameworks. We assessed their clinical utility in tasks such as IFTA grading, histological pattern recognition, post-transplant IFTA prediction, glomerular lesions in DN, molecular sub-phenotyping, and all-cause mortality prediction in hemodialysis and peritoneal dialysis patients, while identifying limitations specific to kidney research, including heterogeneity in imaging protocols, sparse MM registries, and discordant feature scales across modalities.

Through this process, we aim to highlight the most widely used MM as input variables for AI/ML models, emphasize its advantages, address the challenges in implementing AI-driven MM, and identify potential areas for innovation and improvement, such as explainability and interpretability. The data for all studies were independently extracted and managed by these two authors (Abbasi, T., and Pinky, L.). When discrepancies arose, for instance, differences in recording the validation approach (cross-validation versus external testing) or in reporting quantitative results for performance metrics, reviewers revisited the full-text articles jointly to verify the data and ensure consistent interpretation. Consensus was achieved through iterative discussion and mutual agreement, ensuring uniformity across the dataset and transparency in reporting. All final extracted data were reviewed collaboratively before inclusion in the synthesis to minimize bias and maintain methodological rigor.

3.5. Risk of Bias and Applicability Concerns Assessment

A structured risk of bias (RoB) assessment was performed using the PROBAST tool (Prediction model Risk of Bias Assessment Tool) for prognostic and diagnostic prediction models. Four domains, participants, predictors, outcomes, and statistical analysis, were assessed for RoB and implementation concerns relative to the intended clinical use (e.g., risk prediction, early detection, or progression modeling).

Each PROBAST domain comprises between two and nine signaling questions, with possible responses categorized as Yes/Probably Yes, No/Probably No, or Unclear. A domain is judged to have a high RoB when one or more signaling questions are rated as “No” or “Probably No.” Conversely, a low RoB is assigned only when all questions within the domain are rated “Yes” or “Probably Yes.” The overall RoB for a study is determined by aggregating these domain-level judgments: studies in which all domains are assessed as low risk are classified as having a low overall risk, whereas the presence of at least one high-risk domain results in a high overall rating. When insufficient information precludes a definitive judgment, the overall RoB is designated as unclear.

Studies were rated low RoB when they recruited representative patient populations (e.g., consecutive or population-based cohorts) and clearly defined inclusion/exclusion criteria aligned with clinical practice. Predictors were measured objectively and consistently across all participants, with transparent handling of missing data (e.g., imputation methods specified) and appropriate blinding to outcomes. Outcomes were clinically relevant, consistently defined, and measured with adequate follow-up time to capture meaningful progression events. Analytical methods demonstrated internal and external validation (temporal or geographic), used appropriate model regularization to prevent overfitting, and reported key performance metrics including discrimination (AUROC, AUPRC), calibration (calibration slope/intercept), and clinical utility (DCA, net benefit).

Moderate risk was assigned when methodological details were partially reported, for instance, if data preprocessing steps were unclear, missing data handling was not described, or validation was limited to internal resampling (e.g., k-fold cross-validation). Models using complex feature sets or multi-omics data without independent replication were also rated moderate when uncertainty in reproducibility existed.

High RoB was assigned when studies used convenience or single-institution samples, lacked pre-specified inclusion/exclusion criteria, or applied inconsistent measurement or selection of predictors. Common high-risk issues included data leakage from non-temporal splits, undefined handling of missingness, short or inconsistent follow-up periods, and outcome definitions that shifted between sites or datasets. Analytical bias was high when models were trained and tested on overlapping data, lacked calibration assessment, or omitted critical performance metrics. Overfitting risk was flagged when small sample sizes were used relative to the number of predictors, particularly without penalization, shrinkage, or external validation.

Applicability concerns were rated low when populations, predictors, and outcomes matched real-world nephrology settings and intended use cases (e.g., CKD risk stratification or biopsy deferral). Moderate concerns arose when studies included narrowly defined cohorts, proprietary variables, or biomarkers not widely available. High concerns were assigned when model inputs or outcomes were unlikely to generalize, such as institution-specific imaging data, non-standard assays, or unvalidated multi-omics fingerprints.

Overall, higher-weight evidence came from studies with clearly defined cohorts, external validation, and robust calibration; in contrast, studies with strikingly high accuracy but weak validation or limited transparency were interpreted cautiously, as their performance may not transfer to broader clinical contexts.

For studies focusing on diagnostic tasks (e.g., IFTA detection from imaging or pathology), we assessed RoB and applicability concerns via the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) tool across four domains-patient selection, index test, reference standard, and flow and timing. For RoB, studies were rated low when they enrolled a representative clinical population using consecutive or random sampling, applied pre-specified inclusion and exclusion criteria, and reported both internal and external validation with transparent analytical pipelines. Moderate risk was assigned when key design or analytic details were unclear (e.g., missing blinding statements, partial reporting of exclusions, or internal cross-validation without external testing). High risk was assigned for major threats such as selection bias (e.g., single-institution or highly selected cohorts), small or non-representative samples, data leakage from overlapping time windows, undefined handling of missing data, or outcome labels that shifted across sites or time periods. Models that were trained and tested on the same dataset without independent validation were considered particularly prone to overfitting.

For applicability, low concern was assigned when study populations, index tests, and outcomes reflected real-world nephrology practice and the intended clinical use (e.g., biopsy triage, progression monitoring). Moderate concern was applied when populations or imaging modalities diverged slightly from typical practice, and high concern when results were unlikely to generalize, such as models dependent on proprietary features, single-vendor imaging systems, or unstandardized biomarker assays.

Performance reporting was also considered in the overall judgment: low-risk studies included full calibration (e.g., slope/intercept plots), discrimination metrics (AUROC, AUPRC), and clinical utility measures (DCA or net benefit). Moderate risk studies reported discrimination only, while high-risk studies presented incomplete or non-validated performance claims. Common pitfalls across this literature included data leakage from poorly defined temporal splits, follow-up intervals too short to reflect meaningful kidney outcomes, and over-optimistic performance in small, internally validated cohorts. Accordingly, higher-quality evidence with robust external validation and calibration was given greater interpretive weight, whereas weaker studies with striking but unstable metrics were interpreted cautiously, as their performance is unlikely to generalize across settings.

Two reviewers independently assessed all the included articles using the PROBAST and QUADAS-2 frameworks to evaluate RoB and applicability. Particular attention was given to the representativeness of the study population (e.g., single-center versus multicenter cohorts), adequacy of sample size, and appropriateness of predictor and outcome definitions. Reviewers also examined the rigor of validation methods (external, temporal, or internal only), the handling of missing data, and the presence of key performance metrics such as AUROC, AUPRC, calibration slope, and intercept. Discrepancies in ratings across domains were resolved through structured discussion to clarify interpretations of the signaling questions and ensure consistent application of criteria. Consensus was reached through iterative review and mutual agreement, with all final judgments determined collaboratively. The entire consensus process was documented to preserve transparency and reproducibility in the assessment of bias and applicability.

3.6. Methodological Quality Assessment

In our review paper, we did not need to use the Cochrane risk of bias (Cochrane RoB) tool, which assesses the methodological quality of randomized controlled trials (RCTs), or the ROBINS-I (“Risk Of Bias In Nonrandomized Studies, of Interventions”). Before we delve into why we did not use the tools mentioned above, we will briefly explain what randomized and non-randomized studies are. A randomized study, also known as a “Randomized Controlled Trial (RCT)”, is a research design where scientists assign the participants to treatment and control groups randomly to compare the effects of a specific treatment or intervention. A non-randomized study, often called a “Non-randomized Controlled Trial (NRCT)”, is a type of research where participants are not assigned to different treatment or control groups randomly. Instead, groups are formed based on pre-existing characteristics, self-selection, or other non-random methods. Our review paper does not focus on the implementation or introduction of any novel drug/treatment regimen or intervention for kidney diseases. Instead, we focused on the prediction of kidney disease onset and the early detection of kidney diseases by using patient biomarkers and EMR data as input variables for various AI and ML models. The journal articles that we cited in our comprehensive review paper did not perform any randomized or non-randomized clinical trials. That is why the Cochrane RoB tool and the ROBINS-I tool do not apply to our paper.

3.7. Assessment of Evidence Certainty

We assessed certainty of the evidence for each prespecified outcome (e.g., IFTA quantification/diagnostic accuracy, DKD/CKD progression prediction, and mortality/cardiovascular-event prediction) using an approach based on Grading of Recommendations, Assessment, Development and Evaluation/GRADE principles adapted for diagnostic and prognostic/prediction-model studies. For studies of diagnostic/segmentation accuracy, we used QUADAS-2 to judge RoB and concerns about applicability and then applied the GRADE for diagnostic tests framework (considering RoB, inconsistency, indirectness, imprecision, and publication bias) to derive an overall certainty rating (high, moderate, low, or very low) for each outcome. For prediction-model and prognostic studies we assessed methodological quality with PROBAST (domains: participants, predictors, outcome, and analysis) and judged certainty by downgrading or upgrading across the same GRADE domains while additionally considering model-specific factors (adequacy of sample size/events, internal and critically external validation, calibration reporting, handling of missing data, overfitting risk, and transparency/reproducibility such as code or model availability). Two authors independently performed all RoB and certainty assessments. Any discrepancies in the initial GRADE ratings, such as differing interpretations of indirectness when study populations were not representative of real-world CKD or DKD cohorts, or disagreements over imprecision in effect estimates or calibration metrics, were addressed through structured consensus meetings. During these sessions, reviewers compared rationales, re-examined primary data, and referred back to the GRADE guidance to ensure consistent application of downgrading criteria. For example, inconsistencies in AUROC confidence intervals, unclear handling of missing predictor data, or lack of external validation prompted discussion to align the final certainty ratings. Consensus was achieved through iterative deliberation and mutual agreement, ensuring that final judgments reflected both methodological rigor and clinical applicability. The process was documented to enhance transparency and reproducibility in the grading of the overall certainty of evidence across diagnostic and predictive domains.

A flowchart (Figure 1) illustrating the process of identifying and selecting manuscripts for inclusion in this review paper is presented below.

4. Review

4.1. Application of AI-Based Diagnostic Tools in Early Detection, Risk Stratification, and Monitoring of IFTA

This section and Table 3 summarize the findings from seven studies that utilize AI and ML models for evaluating IFTA. IFTA is a hallmark feature of CKD progression, usually seen in early stages of CKD. Researchers applied different model types, including convolutional neural networks (CNNs), ensemble methods (e.g., RF, XGBoost), and hybrid systems, to various biomarker inputs such as ultrasound, MRI, whole-slide histology images, clinical parameters, and transcriptomic data. Most of these models used IFTA classification based on multiple severity grades or binary presence, and they successfully attained high performance metrics; for instance, their accuracies often reached beyond 85% and their Area Under the Curves (AUCs) were as high as 0.96 [1,2]. AUC is a performance metric for evaluating the effectiveness of a binary classification model. Researchers found a strong alignment of the ability of DL models to assess IFTA with that of expert pathologists. This alignment unveiled the potential of DL models for acting as real-time, non-invasive diagnostic tools.

Athvale et al., 2021 [1] utilized various B-mode kidney ultrasound images as input variables for various AI and ML-based algorithms, such as a deep CNN for classifying IFTA. Trojani et al., 2024 [2] applied various MRI-texture-based radiomics as input variables for various AI algorithms, such as SVM and RF, to assess graft IFTA severity post-transplant. Ginley et al., 2021 [3] applied various biopsy-derived WSIs as input variables for various AI/ML algorithms, such as CNN-based segmentation for quantification of IFTA and glomerulosclerosis. Zheng et al., 2021 [4] applied various digitized renal biopsy images as input variables for various AI/ML algorithms, such as a deep CNN, to quantify FPA. Athvale et al., 2020 [5] applied various renal ultrasound images as input variables for various AI/ML algorithms, such as early DL to predict IFTA severity. Ginley et al., 2020 [6] applied various renal biopsy images as input variables for various AI/ML algorithms, such as neural networks for IFTA classification. Finally, Yin et al., 2023 [7] included post-transplant kidney patients from five GEO datasets (GSE98320 [45], GSE76882 [46]: training; GSE22459 [47], GSE53605 [48]: validation; GSE21374 [49]: prognosis). They applied these clinical and transplant datasets as input variables for various AI/ML algorithms, such as RF and XGBoost, to predict post-transplant IFTA (binary outcome).

Though the seven studies involving AI-based models that we are reviewing in this paper have their own strengths and potential, they also come with various shortcomings or limitations due to their small-scale sample sizes, restricted external validation, and platform inconsistencies. Particularly, models that are based on biopsy or genomic data demonstrate obstacles with clinical integration, as they may face practical restraints in real-world contexts. Nonetheless, these studies show that AI-driven tools can boost diagnostic accuracy by assessing IFTA and telling us how far along in the course of the disease the patients are, thus allowing early risk stratification in CKD patients. These studies also emphasize the fact that there is a need for substantial, multicentric validation and strategic implementations.

4.1.1. IFTA Classification: Generalizability and Fairness

Across the reviewed fibrosis and tubular atrophy studies, Athavale et al., 2021 [1], Trojani et al., 2024 [2], Ginley et al., 2021 [3], Zheng et al., 2021 [4], and Yin et al., 2023 [7], most models were developed in single tertiary academic centers, each using different imaging modalities, including ultrasound, MRI, and digitized PAS-stained biopsy slides. The prevalence of moderate-to-severe IFTA ranged from approximately 20% in transplant cohorts to nearly 40% in native kidney datasets. However, only Ginley et al., 2021 [3] provided demographic breakdowns, and few studies accounted for potential imbalances in sex, age, or race. Baseline kidney function varied widely between cohorts, complicating model comparisons. Domain shift between specialized tertiary labs and community imaging settings was not formally assessed, and external validation was limited, Trojani et al., 2024 [2] used a small independent MRI cohort, whereas others relied mainly on internal cross-validation or random splits. Moving forward, robust multi-site and prospective validation should take precedence over internal retuning to confirm real-world generalizability.

4.1.2. IFTA Classification: Data Hygiene and Temporal Integrity

Data hygiene is the backbone of credible machine learning in nephrology imaging. Only a few studies, notably Zheng et al., 2021 [4] and Yin et al., 2023 [7], clearly enforced patient-level temporal splits to prevent data leakage. Normalization and augmentation pipelines were described for most histopathology and MRI datasets, but handling of batch effects, such as stain variability, scanner drift, or acquisition differences, was rarely quantified. Missing data strategies and imputation methods were generally omitted, and clinical covariates were sometimes used without clear synchronization to outcome timepoints.

4.1.3. IFTA Classification: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics studies hold real promise for revealing hidden disease mechanisms, but in kidney research, excitement should be matched with realism. It helps to be clear about how data are actually combined: early fusion mixes features from all sources up front, intermediate fusion blends shared representations midway through modeling, and late fusion merges outputs from separate models. Each method needs its own form of regularization to prevent overfitting and to keep the model biologically grounded. In practice, many kidney cohorts have shown that small, stable panels of clinical and urinary markers perform as well or better than sprawling high-dimensional omics signatures. DL and imaging studies focused on IFTA (Athavale et al., 2021 [1]; Trojani et al., 2024 [2]; Ginley et al., 2021 [3]; Zheng et al., 2021 [4]) reinforce this point: carefully chosen, interpretable features often outperform more complex, opaque models. Multi-omics should therefore be seen as a discovery tool for identifying new phenotypes or understanding why treatment effects vary, rather than a default choice for everyday risk scoring.

4.1.4. IFTA Classification: From Predictions to Actions

Predictive models only matter if they change what happens to a patient. Every study should specify the thresholds that would trigger different decisions, such as earlier nephrology referral, biopsy deferral, SGLT2 inhibitor initiation, vascular access planning, or referral for transplant evaluation. These thresholds turn a probability into a plan. Equally important is showing how model outputs would actually appear in the electronic health record: not as cryptic numbers, but as clear, color-coded risk levels with short explanations and links to next steps. For example, a high fibrosis score generated from a kidney ultrasound model (Athavale et al., 2021 [1]) or MRI-based fibrosis index (Trojani et al., 2024 [2]) could automatically prompt the care team to review pathology or adjust treatment. Even a simple screen mock-up, for instance, showing the prediction, the key features that drove it, and the recommended action, can make a model feel tangible and trustworthy. When predictions translate naturally into decisions, AI moves from theory to everyday clinical practice.

Among the seven studies mentioned in Table 3 that apply AI-based tools for early detection, risk stratification, and monitoring of IFTA, the first six studies serve a “Diagnostic” purpose. In contrast, the seventh study serves a “Predictive” or “Prognostic” purpose. Therefore, for RoB assessment, we utilized QUADAS-2 for the first six papers and PROBAST for the seventh paper. The QUADAS-2 and PROBAST assessments, summarized in Appendix A (Table A1, Table A2, Table A3, Table A4, Table A5, Table A6 and Table A7), provide detailed evaluations of the study-specific RoB and applicability of the included AI-based diagnostic and prognostic models for IFTA.

Across these imaging AI studies, data leakage and missingness handling were seldom discussed, particularly in preprints and early internal validations. Most groups were trained and tested within single-center cohorts, raising concerns about site-specific labeling drift (e.g., biopsy grading variation).

Studies Athvale et al., 2021 [1], Ginley et al., 2021 [3], and Yin et al., 2023 [7] demonstrated higher methodological rigor through external or temporal validation and explicit calibration.

By contrast, early-stage or proof-of-concept models (Athvale et al., 2020 [5], Ginley et al., 2020 [6]) lacked reproducibility safeguards and transparent cross-validation reporting. Short follow-up and inconsistent ground truth definitions (e.g., Banff vs. semiquantitative histology) limit generalizability.

When internally validated studies report striking AUROC > 0.9, these likely reflect closed-cohort overfitting rather than portable diagnostic accuracy.

Hence, well-calibrated, externally validated pipelines (like those by Athavale et al., 2021 [1], Ginley et al., 2021 [3], Yin et al., 2023 [7]) should carry greater interpretive weight in evidence synthesis.

4.2. Application of AI-Based Models to Different Urinary Biomarkers for Early Detection, Risk Stratification, and Monitoring of CKD Progression

This section and Table 4 describe how scientists have used diverse AI-based and ML algorithms, starting from LASSO logistic regression to proteomics and multivariate linear regressions, to study the use of urinary biomarkers as a diagnostic and prognostic tool for CKD. For example, Bienaimé et al. (2023) applied the LASSO logistic regression model to five validated urinary biomarkers (CCL2, EGF, KIM1, NGAL, and TGF-α) [26]. The output data they received after running the model could predict rapid CKD progression with greater accuracy, as evidenced by the AUC of 0.722, as opposed to traditional risk factors. Likewise, Pizzini et al. (2017) used the Cox regression model for their study [27]. They used binary-coded 24th urinary excretion levels of NGAL, Uromodulin, and KIM1 (categorized as above and below median) as input data for their model. They successfully derived a “composite score” from the model as their output data. The composite score could predict different renal outcomes over 3 years in CKD patients better than eGFR alone. Qin et al. (2019) used a cross-sectional logistic regression model for their study [28]. They used urinary levels of Transferrin (TF), Immunoglobulin G (IgG), Retinol-binding protein (RBP), β-galactosidase (GAL), N-acetyl beta-glucosaminidase (NAG), and β2-microglobulin (β2MG) as the input data for their model. The model output data revealed urinary RBP to be the strongest individual predictor of the presence of DKD in diabetic patients (AUC = 0.920).

Muiru et al. (2021) used multivariable linear regression, MSG-LASSO, and multivariate simultaneous linear equations for their study [30]. They used 14 urinary biomarkers (IL-18, α1-microglobulin, Uromodulin, YKL 40, KIM-1, β2-microglobulin, and Albuminuria, to name a few) as the input data for their model. The model output data showed associations between traditional and infection-related CKD risk factors and biomarker dynamics (changes in urinary biomarkers over time) in HIV-positive women. These biomarker dynamics acted as proxies for kidney disease detection and monitoring in this patient population. These studies utilizing AI-driven tools for early diagnosis, risk stratification, and monitoring CKD progression have both advantages and limitations. The many benefits of these studies are improved diagnostic sensitivity, insight into “tubular injury, re-absorptive dysfunction, immune activation, and glomerular injury,” and potential for individualized care. Nonetheless, limitations of these studies, like small sample sizes, lack of external validation, technical assay demands, and generalizability concerns, must be addressed before we can clinically implement them in a broader perspective.

4.2.1. CKD Progression: Generalizability and Fairness

Across urinary biomarker studies for chronic kidney disease (CKD) progression, Bienaimé et al. [26], Pizzini et al. [27], Qin et al. [28], Schanstra et al. [29], and Muiru et al. [30], generalizability and fairness remain central concerns. Cohorts ranged from single-center pilot studies (Pizzini, 2017 [27]) to multicenter European consortia (Schanstra, 2015 [29]) and population-based longitudinal studies (Muiru, 2021 [30]). Assay platforms included mass spectrometry for peptide profiling, multiplex immunoassays, and ELISAs for tubular injury markers. Reported prevalence of CKD stages varied across cohorts from ~10% in early-stage populations to >40% in high-risk cohorts, with baseline eGFR spanning 35–90 mL/min/1.73 m². Only a subset of studies stratified results by age, sex, or race; Muiru et al. [30] specifically reported subgroup analyses among women living with HIV, highlighting differences in tubular biomarker trajectories. Domain shift between tertiary referral centers and community clinics remains largely unexplored, and external validation was limited, emphasizing the need for multi-site confirmatory testing over repeated internal tuning.

4.2.2. CKD Progression: Data Hygiene and Temporal Integrity

Data hygiene practices were inconsistently reported. Only a few studies explicitly enforced patient-level temporal splits or described imputation and normalization strategies for missing or skewed biomarker values. Batch effects in proteomic assays, common when combining platforms or centers, were addressed sporadically, typically via quantile normalization or batch correction, but not universally. Feature windows (e.g., baseline urine collection) and outcome windows (e.g., 3–5 year CKD progression) were variably defined, risking inadvertent information leakage. Clear upfront specification of these windows, along with rigorous preprocessing, is critical to trust real-world performance.

4.2.3. CKD Progression: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics integration, while discussed, should be carefully contextualized. Early-, intermediate-, or late-fusion strategies require explicit regularization; however, small, reproducible panels of urinary and clinical markers often outperform high-dimensional signatures in nephrology cohorts. Multi-omics is most appropriate for discovering phenotypic heterogeneity or treatment effect modifiers, rather than as a default for routine risk scoring. For example, Schanstra et al.’s [29] urinary peptide classifier identified novel high-risk CKD phenotypes, but simpler panels in Bienaimé et al. and Pizzini et al. demonstrated strong predictive performance with greater clinical interpretability.

4.2.4. CKD Progression: From Predictions to Actions

Finally, actionable translation of predictions remains a key gap. Thresholds for interventions, earlier nephrology referral, intensified monitoring, SGLT2 inhibitor initiation, or transplant evaluation, were rarely defined. Outputs could be embedded in electronic records as risk scores with visual cues (e.g., color-coded risk meters) and escalation paths for high-risk patients.

Among the five studies listed above in Table 4 on the application of various AI-based tools to multiple patients’ urinary biomarkers for the early detection, risk stratification, and monitoring of CKD progression, the first three studies serve a predictive or prognostic purpose. In comparison, the latter two studies serve a diagnostic purpose. So, for RoB assessment, we used PROBAST for the three studies, Bienaime et al., 2023 [26], Pizzini et al., 2017 [27], and Qin et al., 2019 [28], and QUADAS-2 for the latter two studies, Schanstra et al., 2015 [29] and Muiru et al., 2021 [30]. The PROBAST and QUADAS-2 assessments, summarized in Appendix B (Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16 and Table A17), provide detailed quality appraisals of biomarker- and model-based studies in CKD and related populations.

Across the five articles, two are high-quality, two are moderate-quality, and one is low-quality according to the PROBAST and QUADAS-2 tools’ evaluations.

High-quality ( Biomedinformatics 05 00067 i001

): Bienaimé et al., 2023 [26] and Schanstra et al., 2015 [29] demonstrate external validation, calibration slope, and multi-center reproducibility.

Moderate ( Biomedinformatics 05 00067 i002

): Qin et al., 2019 [28] and Muiru et al., 2021 [30] demonstrate internal validation or repeated measures but lack calibration or independent replication.

Low-quality ( Biomedinformatics 05 00067 i003

): Pizzini et al., 2017 [27] is a small pilot without validation or calibration, serving more as proof-of-concept than reliable evidence.

Common issues across lower-rigor studies include:

Data leakage from unsegregated time windows or mixed baseline/follow-up samples.

Unclear handling of missing urine biomarker values or detection-limit censoring.

Inconsistent outcome labels (e.g., “progression,” “rapid decline,” “incident CKD”) across cohorts.

Short follow-up windows (<2 years) that overstate short-term variability as “progression”.

Several reports cite eye-catching AUROCs (>0.8). But only externally validated models (e.g., Bienaimé et al., 2023 [26], Schanstra et al., 2015 [29]) are likely to perform similarly in independent settings. Therefore, these higher-quality studies should carry greater interpretive weight in any synthesis, whereas internally derived findings should be treated as exploratory until prospectively confirmed.

4.3. Application of AI-Based Models to Different Clinical, Metabolomic, or Transcriptomic Data for Monitoring the Progression of DN

This section and Table 5 describe how scientists applied ML-based predictive models to various clinical, metabolomic, or transcriptomic data for monitoring the progression of DN. Yin et al. (2024) used five different ML models, such as XGBoost, RF, Decision Tree, Logistic Regression, and LASSO Regression, for their study [15]. They used demographics (age, sex, smoking, alcohol), anthropometrics (BMI, AC, SBP, DBP), clinical labs (HbA1c, HDL-C, FBG, serum creatinine), disease status (hypertension, complications, stroke), medications, and serum metabolites (C2, C5DC, Tyr, Ser, Met, etc.) as their input biomarkers for these models. All these models interpreted the risk of DN in diabetic patients as output data. This XGBoost model demonstrated the highest predictive performance (AUC of 0.93) among the tested models and identified novel serum metabolites as DN markers that are interpretable with SHAP values.

Fan et al. (2025) used the XGBoost model for their study [41]. They used gene expression profiles (from GSE30122 [50], GSE30528 [51], GSE96804 [52], and GSE142153 [53]), 69 GRGs, including key DEGs (PFKP, TPP2, HIF1A, TP53 (upregulated) & PFKL, MPC1, PC, PKLR, ALDOB, FBP1, PCK1 (downregulated)), hub genes (GATM, PCBD1, F11, HRSP12, G6PC), and immune cell infiltration profiles (e.g., M2 macrophages) as their input biomarkers. The model could diagnose DN with an AUC of >0.97 in training sets and 0.722 in external validation. Hirakawa et al. (2022) used the Piecewise Linear (PWL) model for their study [42]. They used 30 features, clinical (e.g., SBP, UACR), and selected plasma/urine metabolites (e.g., plasma kynurenine, gluconolactone, urinary threonic acid, urinary sphingomyelin, etc.) as input biomarkers for the model. The model interpreted rapid DKD progression (eGFR decline ≥ 10%) with an AUC of >0.80 as output data. Hirakawa et al. (2022) also used the Handcrafted Linear Regression (HCLR) model for their study [42]. They used 50 features, 30 PWL features and 21 binary flags for missing values (e.g., missing indicators for metabolites), and the same clinical and metabolite features like PWL plus additional binary flags as their input biomarkers for the model. The model interpreted rapid DKD progression (eGFR decline ≥ 10%) with an AUC of >0.90 as output data.

Zhang et al. (2022) used the Lasso Regression model for their study [43]. They used 698 urinary metabolite ions and nine clinical covariates (e.g., albuminuria, BP, HbA1c, race) as input biomarkers, the model interpreted eGFR slope (rate of kidney function decline) as output data and reduced overfitting. Zhang et al. (2022) also used the RF model for their study [43]. They used 698 metabolite ions and clinical variables as their input biomarkers. The model interpreted the eGFR slope, captured nonlinear relationships, and handled high-dimensional data well. Zhang et al. (2022) used Elastic Net to combine L1 (lasso) and L2 (ridge) penalties for correlated feature selection in their study [43]. They used the same biomarkers as above for this model. The model interpreted the eGFR slope and reflected biological pathway structures. These models demonstrate strong potential in biomarker discovery and risk stratification for diabetic patients. However, limited external validation, overfitting risks, and a lack of stage-specific or longitudinal outcome prediction are among the limitations of these studies that highlight the need for further validation across diverse populations and disease stages.

4.3.1. Progression of DN: Generalizability and Fairness

Across multi-omics studies of diabetic nephropathy (Fan et al., 2025 [41]; Hirakawa et al., 2022 [42]; Zhang et al., 2022 [43]; Yin et al., 2024 [15]), evidence for generalizability remains limited. Cohorts spanned single-center hospital samples (Yin et al., 2024 [15]) to large multicenter cohorts such as the Chronic Renal Insufficiency Cohort (CRIC) study (Zhang et al., 2022 [43]), incorporating diverse assay platforms including untargeted metabolomics, mass spectrometry, and transcriptomic profiling. Reported CKD prevalence ranged from early-stage nephropathy (~12–15%) to advanced CKD (~35%), with baseline eGFR spanning 30–95 mL/min/1.73 m². Demographics and subgroup reporting were uneven: Fan et al., 2025 [41] and Yin et al., 2024 [15] reported age and sex distributions, but race/ethnicity reporting was sparse. Domain shifts between tertiary referral centers and community clinics were not systematically evaluated, and external validation was prioritized in only a subset of studies (Zhang et al., 2022 [43]), emphasizing the need for replication across independent multi-site cohorts rather than internal retuning alone.

4.3.2. Progression of DN: Data Hygiene and Temporal Integrity

Best practices for data integrity were variably applied. Only Zhang et al., 2022 [43] explicitly used patient-level temporal splits to separate feature windows (baseline metabolomic profiling) from outcome windows (2–3 year CKD progression), preventing future data leakage. Imputation strategies for missing metabolite or transcript values were sporadically reported, typically involving median or k-nearest neighbor approaches, while batch effects in high-dimensional omics were inconsistently addressed. Normalization and scaling were occasionally applied, but few studies fully described cross-center harmonization. Defining feature and outcome windows upfront is essential to trust real-world performance and ensure predictive models do not exploit artifacts of data collection timing.

4.3.3. Progression of DN: Right-Sizing the Enthusiasm for Multi-Omics

While these studies leveraged multi-omics to identify molecular subtypes or predictive signatures, high-dimensional metabolomic or transcriptomic fingerprints did not uniformly outperform small, stable panels of clinical and urinary biomarkers. Fan et al., 2025 [41] demonstrated glycolysis-driven molecular subtypes using WGCNA and ML, and Hirakawa et al., 2022 [42] highlighted metabolomic progression markers; yet in routine nephrology cohorts, parsimonious marker panels may suffice for robust risk scoring. Multi-omics is most powerful for phenotype discovery, treatment effect heterogeneity, or mechanistic insight, rather than as default inputs for day-to-day clinical decision support. Early, intermediate, and late data fusion approaches were described inconsistently, and regularization requirements for high-dimensional inputs were only occasionally reported.

4.3.4. Progression of DN: From Predictions to Actions

Translation of model outputs to actionable care remains an open challenge. Few studies defined risk thresholds that would trigger earlier nephrology referral, SGLT2 initiation, biopsy deferral, or transplant evaluation. Outputs could be embedded within electronic health records as interpretable risk scores, potentially with color-coded risk indicators, temporal trends, and automatic escalation for high-risk patients. Yin et al.’s [15] explainable ML model demonstrated feature-level contributions to predicted DN risk, providing a prototype for visualization; however, a standardized example screen linking predictions to clinical actions would strengthen the pathway from computational prediction to patient management.

Among the four studies listed above utilizing various AI-based tools on multiple patients’ clinical, metabolomic, and transcriptomic data for monitoring of DN, the second study, Fan et al., 2025 [41] serves a diagnostic purpose, while the rest of the three studies, Yin et al., 2024 [15], Hirakawa et al., 2022 [42], and Zhang et al., 2022 [43] serve a predictive or prognostic purpose. So, for the RoB assessment, we used QUADAS-2 for Fan et al., 2025 [41], and PROBAST for Yin et al., 2024 [15], Hirakawa et al., 2022 [42], and Zhang et al., 2022 [43]. The PROBAST and QUADAS-2 assessments, summarized in Appendix C (Table A18, Table A19, Table A20 and Table A21), provide detailed quality appraisals of AI and metabolomics studies in DKD and related populations.

High-quality ( Biomedinformatics 05 00067 i001

):

Zhang et al., 2022 [43] and Fan et al., 2025 [41] demonstrate external or independent validation, strong discrimination (AUC ≈0.8–0.9), and model interpretability.

Moderate ( Biomedinformatics 05 00067 i002

):

Yin et al., 2024 [15] and Hirakawa et al., 2022 [42] perform well internally but lack independent validation and complete calibration reporting.

Common issues across weaker studies:

Data leakage from the unclear temporal separation between predictors and outcomes

Undefined missing data handling (especially in omics datasets)

Outcome heterogeneity (e.g., inconsistent DKD progression criteria)

Short follow-up windows, limiting clinical relevance of “progression” signals

As a result, while many models report eye-catching AUCs (>0.85), only externally validated work (e.g., Zhang et al., 2022 [43]) is likely to maintain real-world predictive accuracy. Stronger studies explicitly report calibration and validation design; weaker ones risk overfitting and inflated results that may not travel across populations or platforms.

4.4. Validation of ML Models for Prediction of CKD and ESRD

This summary paragraph and Table 6 address the need for external and internal validation of ML models for CKD and ESRD prediction across diverse populations and settings to implement AI-driven tools for personalized patient care. Chan et al. (2021) used KidneyIntelX (RF Model) as their ML risk score [24]. Then they used 460 patients from BioMe and PMBB biobanks as their validation cohort. They compared their output data with that of KDIGO and clinical logistic regression model and found that the KidneyIntelX model outperformed clinical models and KDIGO heatmaps with an AUC of 0.77 with improved risk stratification for early DKD patients. Ferguson et al. (2022) used the RF Model (22-variable) as their ML risk score and considered 107,097 individuals from Alberta, Canada (subset of 321,396) as their validation cohort [31]. After applying their chosen validation method, the output data demonstrated robust prediction for eGFR decline and kidney failure with AUCs up to 0.90 at 1 year.

Tangri et al. (2024) used Klinrisk (Random Survival Forest) model as their ML risk score and considered CANVAS (n = 10,142) and CREDENCE (n = 4401) clinical trials as their validation cohort [32]. After implementing external validation methods in pooled CANVAS/CREDENCE datasets, the output data revealed that the Klinrisk model outperformed both KDIGO and KFRE benchmarks for predicting CKD progression in type 2 DM patients with high albumin–creatinine ratio (ACR). Zou et al. (2022) used the RF model for ESRD Prediction and considered Internal split (training 75%, validation 25%) of 390 biopsy-confirmed DKD patients as their validation cohort [33]. They applied two different validation methods (10-fold cross-validation on training set and internal validation on 25% holdout data), and the output data showed the highest AUC (0.90) among the reviewed models, thus building a clinically actionable nomogram based on five key lab values. Together, these models highlight the potential of ML to aid in early detection, risk stratification, and individualized care establishment in CKD patients.

Reorganizing the evidence around the clinical questions most relevant to kidney disease management clarifies how predictive modeling can meaningfully support decision-making. Across studies such as Zou et al. (2022) [33], the core questions include whether models can flag early disease, estimate the slope to ESRD, guide access planning, predict mortality, or anticipate cardiovascular events. When evaluated through the lens of data actually available in clinics, routine laboratory results, urine biomarkers, imaging, and multi-omics, different model families align with different prediction horizons and sample sizes. Routine laboratory and pathology data, as in Zou et al.’s [33] cohort of 390 biopsy-confirmed DKD patients (40.5% ESRD events over three years), are best handled by tree ensembles such as RFs or gradient boosting machines (GBM), which manage mixed predictors, nonlinear thresholds, and modest event counts effectively (AUC up to 0.90). For small or interpretable models, penalized logistic or Cox regression remains preferable, particularly when relationships are near-linear and transparency is key. Early disease flagging, which often depends on subtle shifts in albuminuria or cystatin C, benefits from parsimonious logistic or Cox baselines or, when urine biomarkers are added, regularized linear or tree-based models that prevent overfitting in moderate samples. Estimating the slope to ESRD or deciding when to plan access requires longitudinal or time-to-event data, best approached with survival modeling, penalized Cox, flexible parametric, or survival tree ensembles, to yield calibrated risk estimates at actionable time horizons. Mortality and cardiovascular prediction call for separate, outcome-specific survival models incorporating comorbidities and cardiac biomarkers, often analyzed with Cox or competing-risk frameworks (e.g., Fine–Gray). DL methods, while powerful for high-dimensional imaging or omics data, are generally unsuitable for small, tabular datasets like those in current nephrology cohorts. Overall, tree ensembles provide robust discrimination in moderate datasets, whereas penalized regression ensures generalizability and calibration; both are preferable to complex deep nets until larger, multi-modal datasets become available. Table A25 integrates these findings within the PROBAST framework, contextualizing Zou et al.’s [33] study-specific details, limitations (notably internal-only validation), and the rationale for model family selection by clinical task and data type.

4.4.1. Validation of ML Models: Generalizability and Fairness

Across recent CKD progression prediction studies (Ferguson et al., 2022 [31]; Tangri et al. [32], 2024; Zou et al., 2022 [33]; Chan et al., 2021 [24]), evidence for generalizability and fairness is stronger than in earlier single-center studies, though still uneven. Cohorts included multi-center international trials (CANVAS, CREDENCE), single-country hospital networks, and regional diabetes clinics, with assay platforms spanning standard clinical labs, urine biomarkers, and EHR-derived variables. CKD prevalence ranged from 10 to 30% at baseline, and reported demographics included age (mean 55–68 years), sex (roughly balanced), and race/ethnicity, where available; subgroup analyses by race were inconsistently reported. Domain shifts between tertiary centers and community clinics were noted but largely unquantified. Multi-site external validation was prioritized in Ferguson et al., 2022 [31] and Tangri et al., 2024 [32], whereas Zou et al., 2022 [33] focused on internal validation with temporal splits.

4.4.2. Validation of ML Models: Data Hygiene and Temporal Integrity

Best practices for data preparation were variably applied. Ferguson et al., 2022 [31] and Tangri et al., 2024 [32] explicitly used patient-level temporal splits to avoid future leakage, defining baseline feature windows (labs and vitals) and outcome windows (CKD progression over 2–5 years). Imputation for missing clinical and biomarker data was reported in Ferguson et al., 2022 [31] (median and multiple imputation) and Chan et al., 2021 [24], while batch effects for multi-center labs were inconsistently addressed. Normalization and scaling approaches were variably applied across continuous lab values, highlighting the need for explicit, reproducible pipelines to ensure real-world model performance.

4.4.3. Validation of ML Models: Right-Sizing the Enthusiasm for Multi-Omics

Chan et al., 2021 [24] integrated biomarker panels with EHR features, demonstrating that small, well-curated panels often perform as well or better than high-dimensional data, particularly in routine clinical populations. While advanced fusion methods (early/intermediate/late) were explored, regularization and overfitting control were variably reported. Multi-omics is best positioned for refining phenotypes or stratifying treatment effect heterogeneity rather than routine clinical risk scoring, as stable, low-dimensional features remain highly predictive for progression to ESRD or significant eGFR decline.

4.4.4. Validation of ML Models: From Predictions to Actions

Several studies linked predictions to actionable clinical decisions. Ferguson et al., 2022 [31] and Tangri et al., 2024 [32] defined thresholds for intensified monitoring, early nephrology referral, SGLT2 initiation, and planning for renal replacement therapy. Chan et al., 2021 [24] incorporated explainable risk scores into prototype electronic health record dashboards with color-coded risk, trend visualization, and alerting for high-risk patients. Zou et al., 2022 [33] reported model probabilities without an integrated decision pathway, highlighting the need for clear operationalization. Example screens that contextualize predicted risk alongside actionable thresholds would strengthen translation to routine practice.

All four studies listed above serve a predictive or prognostic purpose. So, for RoB assessment, we used the PROBAST tool for these journal publications: Chan et al., 2021 [24], Ferguson et al., 2022 [31], Tangri et al., 2024 [32], and Zou et al., 2022 [33]. The PROBAST and QUADAS-2 assessments, summarized in Appendix D (Table A22, Table A23, Table A24, Table A25, Table A26, Table A27, Table A28 and Table A29), provide detailed quality appraisals of these studies utilizing AI-based kidney risk prediction models.

Across the four research articles, three are high-quality and one is moderate-quality, according to the PROBAST evaluation.

High-quality ( Biomedinformatics 05 00067 i001

): Chan et al., 2021 [24], Ferguson et al., 2022 [31], and Tangri et al., 2024 [32], all performed external validation, demonstrated good calibration, and sometimes included DCA showing clinical utility.

Moderate ( Biomedinformatics 05 00067 i002

): Zou et al., 2022 [33] lacked external validation and calibration assessment, relying on internal data only.

When models were subjected to external or temporal validation (e.g., Chan et al., 2021 [24], Ferguson et al., 2022 [31], Tangri et al., 2024 [32]), discrimination remained strong (AUROC ≈ 0.8–0.86) and calibration was preserved, suggesting that these are credible, clinically portable tools. By contrast, internally validated or single-center studies (e.g., Zou et al., 2022 [33]) reported similarly high AUCs but without independent replication or calibration testing, making their results less likely to travel beyond the derivation cohort.

4.5. Application of AI-Based Algorithms for Detecting and Classifying Current Disease State, Discovering Diagnostic Biomarkers, and Subtype Identification

This paragraph section and Table 7 carefully summarize research work done on the application of various AI and ML-based algorithms for the detection and classification of current disease states, discovery of diagnostic biomarkers, and identification of subtypes in the context of diabetic renal ailments, such as DKD or DN. Basuli et al., 2025 [8] systematically reviewed the utilization of various HER, omics, and imaging data as input variables to various AI and ML models for predicting DKD onset, progression, and risk stratification. Lei et al., 2024 [9] applied PAS-stained WSI as input variables for a CNN model for Renal Pathology Society (RPS) classification I-IV and lesion detection. Makino et al., 2019 [10] applied longitudinal EMR time-series data as input variables for a convolutional autoencoder and RF models for predicting DKD aggravation within 6 months. Nayak et al., 2024 [11] applied single-center EHR and labs as input biomarkers for an ML ensemble model for predicting DKD progression. Li et al., 2025 [12] systematically reviewed the application of various clinical, omics, and imaging data as input biomarkers for various ML algorithms for predicting DKD risk/progression. Zhu et al., 2024 [13] applied various clinical and laboratory data, such as serum creatinine, eGFR, and retinopathy, as input biomarkers for an SVM model for predicting DN onset within 36 months. Zhu, Liu, and Wang, 2024 [14] utilized samples derived from publicly available GEO transcriptomic datasets (GSE47184 [54], GSE96804 [52], GSE104948 [55], GSE104954 [56], GSE142025 [57], GSE175759 [58]). They applied these transcriptomic data as input variables for various ML ensemble models, including LASSO, Support Vector Machine, Recursive Feature Elimination/SVM-RFE, and RF for DN classification and biomarker identification. It is worth noting that each of these seven studies used different datasets with the intention of solving a similar problem, which is the prediction of DKD or DN risk or progression.

4.5.1. Biomarker Discovery and Subtype Identification: Generalizability and Fairness

Across the seven studies reviewed, evidence for generalizability and fairness remains uneven, with few works providing transparent site-level or demographic data. Lei et al. (2024) [9] and Zhu, Liu & Wang (2024) [14] represent stronger examples of multi-cohort external validation, using datasets from multiple hospitals and publicly available transcriptomic platforms (e.g., GEO). These studies provide the clearest evidence of cross-domain reproducibility, though subgroup analyses by age, sex, or race were not explicitly reported. In contrast, Makino et al. (2019) [10] and Nayak et al. (2024) [11] relied on single-center or temporal splits, limiting portability across sites with differing assay platforms, patient mix, and care settings. Zhu Y. et al. (2024) [13] achieved partial external validation using an independent test cohort but similarly lacked stratified reporting by demographic or clinical subgroups. Li et al. (2025) [12] pooled multi-study data in a meta-analysis, improving representativeness across settings but still constrained by the variable reporting quality of source studies. Basuli et al. (2025) [8] highlighted these same issues, noting that domain shifts between tertiary referral centers (where most AI models are trained) and community clinics (where prevalence, kidney function distribution, and assay platforms differ) remain largely unquantified. None of the studies mapped prevalence or baseline eGFR distributions in a way that allows adjustment for population differences. Together, these findings suggest that while model performance within single institutions can be high, fairness and robustness across populations are still assumed rather than evidenced, emphasizing the need for prospective, multi-site validation and transparent demographic benchmarking before clinical translation.

4.5.2. Biomarker Discovery and Subtype Identification: Data Hygiene and Temporal Integrity

Across the recent AI and ML studies in diabetic kidney disease (DKD), including those by Basuli et al. (2025) [8], Lei et al. (2024) [9], Makino et al. (2019) [10], Nayak et al. (2024) [11], Li et al. (2025) [12], and Zhu et al. (2024) [13,14], good data hygiene remains the cornerstone of trustworthy performance. Models built from electronic health records and digital pathology must clearly define feature windows and outcome horizons so that predictions never “peek into the future.” Patient-level temporal splits should separate training and validation cohorts by time rather than by random sampling, ensuring that apparent accuracy is not inflated by data leakage. Where laboratory, imaging, and omics data are combined, imputation and normalization procedures need to be stated explicitly, and batch-effect correction for omics data should be routine practice, not an afterthought. These design guardrails are what distinguish a model that performs well in the lab from one that earns trust in the clinic.

4.5.3. Biomarker Discovery and Subtype Identification: Right-Sizing the Enthusiasm for Multi-Omics

While enthusiasm for multi-omics integration is high, the evidence across these DKD studies suggests a need for balance. Early-, intermediate-, and late-fusion strategies must each be matched with the right degree of regularization to prevent overfitting. Several studies, including Zhu et al. (2024) [13] and Li et al. (2025) [12], demonstrate that small, stable panels of urinary, serum, or transcriptomic markers often outperform high-dimensional multi-omics fingerprints in typical clinical cohorts. Multi-omics should be viewed as a powerful tool for phenotype discovery and understanding treatment heterogeneity, rather than the default solution for routine risk scoring.

4.5.4. Biomarker Discovery and Subtype Identification: From Predictions to Actions

The most promising direction lies in translating predictions into actionable clinical decisions. For instance, thresholds derived from these models could trigger earlier nephrology referral, SGLT2 inhibitor initiation, or biopsy deferral when risk is low. Embedding these models into electronic health records-with intuitive displays, explainable scores, and alert systems-would help clinicians visualize risk trajectories, compare trends, and escalate care in real time. A concise example screen or alert pathway, as yet uncommon in published studies, would make the leap from algorithm to bedside far more tangible.

Together, these studies show that reproducible DKD prediction hinges less on algorithmic novelty and more on transparent data handling, measured use of omics, and thoughtful clinical integration, the very traits that build clinician confidence and support equitable real-world adoption.

Among the seven studies listed above, three serve a diagnostic purpose: Basuli et al., 2025 [8]; Lei et al., 2024 [9]; and Zhu, Liu, and Wang, 2024 [14]. So, we used the QUADAS-2 tool to assess RoB and applicability concerns for these three studies. The other four studies serve a predictive or prognostic purpose: Makino et al., 2019 [10]; Nayak et al., 2024 [11]; Li et al., 2025 [12]; and Zhu et al., 2024 [13]. So, we used the PROBAST tool to assess the RoB and applicability concerns of these four studies.

Across the emerging AI literature in DKD, common methodological weaknesses repeatedly limit the generalizability of reported results, even when headline metrics appear impressive. Many studies, such as those by Basuli et al. (2025) [8], Makino et al. (2019) [10], and Nayak et al. (2024) [11], report high discrimination (AUC > 0.85) for predicting DKD onset or progression, but inspection of their design often reveals information leakage from overlapping time windows, where predictors drawn near or after the outcome event inflate apparent accuracy. Missing data are frequently handled implicitly or through opaque imputation, and outcome definitions (e.g., DKD progression vs. incident ESRD) vary across sites, making cross-cohort validation nearly impossible. Follow-up horizons are often too short, typically one to three years, to capture the clinically meaningful trajectory toward renal failure. Some reports aggregate baseline and follow-up measurements without temporal ordering, further biasing model performance. In contrast, higher-quality studies, such as Lei et al. (2024) [9] and Li et al. (2025) [12], explicitly define censoring, apply external validation, and harmonize outcome criteria, producing more conservative but credible results. Lei et al. demonstrated that model interpretability and calibrated survival estimates mattered more than raw AUC, while Li et al.’s [12] meta-analysis highlighted that studies with rigorous temporal design and external testing consistently reported lower but more reproducible discrimination. Thus, when weaker retrospective analyses yield striking performance, like single-center RFs with AUCs near 0.95, the results should be interpreted as overfit or data-leaky rather than transferable evidence. Weight should instead be given to studies that declare their time anchors, handle missingness transparently, define stable outcome labels, and ensure sufficient follow-up to inform real-world decisions about DKD screening, progression, and treatment planning.

The PROBAST and QUADAS-2 assessments, summarized in Appendix E (Table A30, Table A31, Table A32, Table A33, Table A34, Table A35 and Table A36), provide detailed quality appraisals of these studies in kidney disease populations.

4.6. Application of AI and ML-Based Algorithms for Identifying and Classifying Existing Diseases and Subtypes, and Forecasting Disease Progression and Risk Stratification

This summary paragraph and Table 8 summarize a total of eight research articles on the application of various AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification in the context of diabetic renal diseases, DKD or DN. Lucarelli et al. 2023 [16] applied various urinary proteomics and pathologic glomerular features as input variables for various ensemble models, including RF, SVM, and XGBoost, for classifying various stages of DN. Yan et al., 2024 [17] applied various urinary proteomics as input variables for various ensemble models, including SVM-RFE for DN biomarker discovery. Dong et al., 2022 [18] applied various EHR data, such as age and HbA1C, as input variables for various ML algorithms, such as XGBoost and logistic regression, for predicting the 3-year incidence of DKD. Hsu et al., 2023 [19] applied various EMR and lab data as input variables for XGBoost and RF models for predicting eGFR decline and the need for nephrology referral in DKD patients. Paranjpe et al., 2023 [20] applied various EMR data as input variables for a Deep Autoencoder model for classifying DKD subtypes and predicting disease progression risk. Xu et al., 2020 [21] systematically reviewed the application of various ML algorithms, including SVM, RF, and ANN, for predicting the risk of microangiopathic complications of diabetes, such as DN, DR, and diabetic peripheral neuropathy (DPN). Done et al., 2024 [22] systematically reviewed the application of various omics features as input variables for various ML models, such as SVM, RF, and DL, for identifying diagnostic biomarkers for DN. Nagaraj et al., 2021 [23] used various EMR data, such as age, albuminuria, and eGFR, as input variables for an XGBoost model for introducing KAI, a biomarker classifier to compare functional kidney age and chronological kidney age in DKD patients. It is worth noting that all eight of these studies utilized different datasets to solve a common problem, which is biomarker discovery and forecasting DKD progression.

4.6.1. Identifying and Classifying Existing Diseases and Subtypes: Generalizability and Fairness

Across these eight studies, evidence of model generalizability was variable, and fairness was more often assumed than demonstrated. Dong et al. (2022) [18] and Hsu et al. (2023) [19] offered the most pragmatic frameworks, using large-scale EMR data across multiple hospital systems to predict 3-year DKD risk and nephrology referral needs, respectively. Both reported validation across geographically distinct sites but did not include subgroup analyses by age, sex, or race, nor did they provide prevalence maps or baseline kidney function distributions. In contrast, Lucarelli et al. (2023) [16] and Yan et al. (2024) [17] leveraged urinary proteomics, integrating omics and pathology platforms, yet their datasets were limited to tertiary academic centers, raising the risk of domain shift when deployed in community clinics with different assay pipelines and patient populations. Paranjpe et al. (2023) [20] demonstrated cross-ethnic subgroup modeling in a deep-learning framework that linked EMR-derived phenotypes to genetic variation in the Rho pathway-an encouraging example of attention to biological diversity-but explicit fairness metrics were still absent. Xu et al. (2020) [21] and Dong et al. (2024) [22], both systematic reviews, reinforced that few nephrology ML models undergo external validation, and that demographic or assay heterogeneity remains underreported. Collectively, only Nagaraj et al. (2021) [23] explicitly modeled age-related bias through their “Kidney Age Index,” acknowledging biological and chronological aging differences. These findings highlight the urgent need for multi-site external validation with demographic stratification, standardized baseline kidney function reporting, and mapping of assay platforms and prevalence before claims of fairness or clinical portability can be substantiated.

4.6.2. Identifying and Classifying Existing Diseases and Subtypes: Data Hygiene and Temporal Integrity

Data hygiene practices varied considerably. Dong et al. (2022) [18] and Hsu et al. (2023) [19] correctly implemented patient-level temporal splits, ensuring predictions were made using only pre-outcome data-critical to avoiding information leakage. However, most omics-heavy studies (Lucarelli et al., 2023 [16]; Yan et al., 2024 [17]) did not explicitly describe batch-effect correction, normalization procedures, or imputation logic, despite combining samples across assays and tissue types. Feature and outcome windows were rarely pre-specified, raising uncertainty about whether the models may have peeked into future data during feature engineering. Only Paranjpe et al. (2023) [20] documented explicit time-stamped EMR segmentation, distinguishing look-back (feature) and follow-up (outcome) intervals, establishing a sound methodological precedent. Future nephrology AI pipelines should define these windows up front, employ cross-batch harmonization for proteomics, and detail missingness handling to ensure that real-world deployment remains trustworthy.

4.6.3. Identifying and Classifying Existing Diseases and Subtypes: Right-Sizing the Enthusiasm for Multi-Omics

Multi-omics integration was a hallmark of Lucarelli et al. (2023) [16] and Yan et al. (2024) [17], which used early and intermediate fusion of urinary proteomics with pathology and clinical metadata, respectively. However, neither clearly described regularization strategies or overfitting control for high-dimensional inputs. By contrast, simpler clinical models (Dong et al., 2022 [18]; Hsu et al., 2023 [19]; Nagaraj et al., 2021 [23]) achieved comparable discrimination using stable panels of eGFR, albuminuria, age, and metabolic markers, underscoring that parsimonious clinical and urinary biomarker sets often outperform large omics fingerprints in typical nephrology cohorts. Multi-omics should therefore be positioned as a tool for phenotype discovery and treatment-response heterogeneity, rather than a default for routine DKD risk prediction.

4.6.4. Identifying and Classifying Existing Diseases and Subtypes: From Predictions to Actions

Few studies explicitly linked predictions to clinical decisions. Hsu et al. (2023) [19] approached this most directly, identifying thresholds to trigger earlier nephrology referral and potential SGLT2 inhibitor initiation. Dong et al. (2022) [18] proposed a 3-year DKD risk model that could inform biopsy deferral or renal function monitoring frequency, but implementation details were limited. Future work should define action thresholds for changes in care, such as SGLT2 initiation, vascular access planning, or transplant evaluation-and demonstrate how predictions appear inside the electronic health record. Prototype dashboards showing risk scores, confidence intervals, and escalation paths could materially improve interpretability and clinician trust, moving from retrospective modeling toward actionable AI-guided kidney care.

Among the eight studies listed above, five of them serve a diagnostic purpose: Lucarelli et al., 2023 [16]; Yan et al., 2024 [17]; Paranjpe et al., 2023 [20]; Dong et al., 2024 [22]; and Nagaraj et al., 2021 [23]. So, we used the QUADAS-2 tool to assess the RoB and applicability concerns of these five journal publications. The other three studies serve a predictive or prognostic purpose: Dong et al., 2022 [18]; Hsu et al., 2023 [19]; and Xu et al., 2020 [21]. So, we used the PROBAST tool to assess the RoB and applicability concerns of these three journal publications. The PROBAST and QUADAS-2 assessments, summarized in Appendix F (Table A37, Table A38, Table A39, Table A40, Table A41, Table A42, Table A43 and Table A44), provide detailed quality appraisals of these studies in kidney disease populations.

4.7. Predicting Future Outcomes, Such as Mortality or Cardiovascular Events, Using ML Algorithms or Patients’ Biomarkers

This summary paragraph and Table 9 summarize the research articles predicting future outcomes, such as mortality or cardiovascular events, using various patients’ biomarkers as input variables for various ML algorithms. For example, Ma et al., 2023 [34] applied patients’ age, albumin, hemoglobin, creatinine, dialysis duration, and comorbidities as input variables for an adaptive feature-recalibrated ensemble model to predict 3-year all-cause mortality in peritoneal dialysis patients. Chen et al., 2025 [35] applied patients’ age, albumin, urea, comorbidities, and inflammatory biomarkers as input variables for various interpretable ML models, such as XGBoost, light GBM, Cox regression, etc., to predict all-cause mortality and time to death in hemodialysis patients. Hung et al., 2022 [36] applied various baseline labs, such as BUN, lactate, bilirubin, vitals, and demographics, as input variables for the XGBoost + SHAP interpretation model to predict in-hospital mortality after CRRT initiation. Lin et al., 2023 [37] applied patients’ serum endocan, age, albumin, creatinine, and diabetes as input variables for an XGBoost model to predict 36-month all-cause mortality in hemodialysis patients. Tran et al., 2024 [38] applied patients’ age, CVD, smoking, vitamin D (vit D), Erythropoiesis-Stimulating Agent (ESA) use, parathyroid hormone (PTH), and ferritin as input variables for an XGBoost model to predict 2-year all-cause mortality in advanced CKD patients. Kim et al., 2020 [39] applied patients’ plasma endocan, albumin, BMI, triglycerides (TG), and cardiovascular history as input variables for a Cox regression model to predict composite cardiovascular events in ESRD patients. Zhu et al., 2024 [40] applied patients’ age, blood pressure (BP), eGFR, glucose, lipids, Hb, and comorbidities as input variables for various AI/ML algorithms, including XGBoost, RF, logistic regression, and SVM to predict CVD in CKD patients. It is worth reporting that all these journal articles utilized different types of datasets to solve a similar problem, which is to predict future outcomes, such as all-cause mortality or cardiovascular events in kidney patients.

4.7.1. Predicting Future Outcomes: Generalizability and Fairness

Rigorous evaluation requires a transparent map of study sites, assay platforms, disease prevalence, participant demographics, and baseline kidney function. Subgroup analyses by age, sex, and race or ethnicity should be routinely reported, as aggregated performance can obscure systematic biases in underrepresented populations. Domain shift between tertiary centers and community clinics, where practice patterns, coding habits, and laboratory calibration differ, must be explicitly acknowledged. External, multi-site validation should be prioritized over repeated internal tuning, as true generalizability arises from exposure to heterogeneous populations rather than incremental optimization of a single dataset. Recent efforts in external validation of mortality prediction models for advanced CKD, as in the study by Tran et al. (2024) [38], highlight the importance of testing across diverse cohorts to establish real-world robustness.

4.7.2. Predicting Future Outcomes: Data Hygiene and Temporal Integrity

Temporal splits at the patient level are essential to prevent inadvertent data leakage and to mimic the forward flow of clinical decision-making. Detailing imputation and normalization strategies will ensure that preprocessing parameters are fit solely to the training data. For omics data, batch-effect correction, using methods such as ComBat or mixed-model harmonization, should be explicitly described, as uncorrected technical variation can easily confound biological signals. Clearly defining feature and outcome windows a priori prevents models from “peeking into the future.” Such procedural guardrails are central to reproducibility and are the foundation for models whose real-world performance can be trusted once deployed in clinical settings.

4.7.3. Predicting Future Outcomes: Right-Size the Enthusiasm for Multi-Omics

It is important to distinguish between early-, intermediate-, and late-fusion strategies and to explain the forms of regularization that stabilize each. Nephrology cohorts taken into account by Ma et al., 2023 [34], and Chen et al., 2025 [35] show that small, interpretable panels of clinical and urinary biomarkers outperform high-dimensional omics signatures in terms of reproducibility and operational simplicity. Multi-omics integration can significantly contribute to phenotype discovery, mechanistic insight, and the identification of treatment effect heterogeneity, as demonstrated by the studies, Fan et al., 2025 [41] and Hirakawa et al., 2022 [42]. However, it should not be treated as the default approach for everyday risk stratification. Parsimonious, interpretable models remain the cornerstone of clinically deployable predictive tools.

4.7.4. Predicting Future Outcomes: Prediction to Action

The clinical utility of any predictive model depends on well-defined decision thresholds that alter patient management, such as prompting earlier nephrology referral, biopsy deferral, SGLT2 inhibitor initiation, vascular access planning, or transplant evaluation. Outputs should be designed for integration within electronic health records, featuring intuitive visualizations, risk explanations, and clear escalation pathways. Short examples or prototype screens, such as those demonstrated in explainable dialysis and critical care prediction models (Hung et al., 2022 [36]; Lin et al., 2023 [37]), can effectively convey how model predictions translate into clinical workflows. Embedding interpretability and usability from the outset ensures that predictive modeling in nephrology evolves from academic proof-of-concept to practical, equitable decision support.

Among the seven research articles listed above, all of them serve a predictive or prognostic purpose. Therefore, we used the PROBAST tool to assess RoB and applicability concerns of these studies. The PROBAST and QUADAS-2 assessments, summarized in Appendix G (Table A45, Table A46, Table A47, Table A48, Table A49, Table A50 and Table A51), provide detailed quality appraisals of these studies in kidney disease populations.

Across these articles, three are high-quality studies, one is moderate-quality, and two are low-quality, as per the PROBAST evaluation.

High-quality studies ( Biomedinformatics 05 00067 i001

): Ma et al., 2023 [34], Chen et al., 2025 [35], Tran et al., 2024 [38], demonstrate sound calibration and/or external validation.

Moderate ( Biomedinformatics 05 00067 i002

): Hung et al., 2022 [36], interpretable ML, but single-center only.

Low-quality ( Biomedinformatics 05 00067 i003

): Lin et al., 2023 [37], Kim et al., 2020 [39], Zhu et al., 2024 [40], rely on limited biomarker datasets, lack calibration or validation, and risk overfitting.

Common pitfalls across weaker studies:

-: Time-window leakage and unreported temporal splits
-: Undefined handling of missing data
-: Shifting outcome definitions between sites
-: Short or clinically irrelevant follow-up durations

4.8. Kidney Disease Forecasting

Figure 2, a stacked bar chart, illustrates the evolving trends of ML model selection and application across kidney disease studies published between 2015 and 2025. The x-axis represents publication years. The y-axis indicates the number of studies employing each model type. Distinct colors correspond to different model families, including DL, RF, XGBoost, LASSO, Logistic Regression, Linear Regression, Piecewise Linear Regression, Decision Tree, Cox Regression, Handcrafted Linear Regression, and Urinary Biomarker Classifiers.

Overall, the figure shows a marked increase in the diversity and frequency of ML applications beginning around 2020. DL models (green) and ensemble approaches such as RF and XGBoost (yellow and light blue) became prominent after 2020, reflecting the growing interest in data-rich, non-linear modeling for kidney pathology and prognosis. Earlier studies (2015–2019) primarily relied on traditional regression models (Cox, Logistic, and Linear Regression). Later studies (2022–2025) utilizing more advanced models (interpretable regression variants and deep neural networks) signal a shift toward more complex, data-driven approaches in kidney disease research.

Figure 3 is a line chart illustrating the temporal trends in the use or occurrence of various predictive models from 2015 to 2025. The x-axis represents the years. The y-axis indicates the frequency or categorical presence of each model, ranging from 0 to 2. Each colored line corresponds to a different model type. The chart shows how the prominence of these models changes over time; for instance, DL peaks around 2020–2021, and RF peaks in 2022 and 2024.

4.9. Evidence Certainty Assessment

The evidence certainty of AI and ML models in kidney disease prediction was evaluated using the GRADE approach, with appropriate consideration for the AI model category, RoB, inconsistency, indirectness, imprecision, and publication bias (Table 10). Color codes (red, yellow, green) were used for each domain of the GRADE table. For each domain, red-colored code indicates high concern, yellow-colored code indicates some concern, and green-colored code indicates low concern. When all of the five domains (RoB, Inconsistency, Indirectness, Imprecision, and Publication Bias) or at least four of the five domains were found to be of low concern, then the overall evidence of certainty was rated High. When three of the five domains were found to be of some concern, the overall certainty was rated as moderate. Finally, when three of the five domains were found to be of high concern, or one of high concern and two of some concern, the overall evidence of certainty was rated low.

In this GRADE Summary of Findings (SoF) table, each of the domains was judged to be of low, moderate, or high concern based on specific ideologies. In terms of the RoB domain, studies demonstrating well-designed models with transparent feature selection, clear outcome definition, external validation, and appropriate calibration were rated to be of low concern. Then, studies demonstrating minor methodological issues, such as internal cross-validation only, unclear missing data handling, or limited reporting of calibration, were rated to be of some concern. Additionally, studies demonstrating high risk from selection bias, data leakage, unclear inclusion criteria, or lack of validation entirely were rated to be of high concern.

In terms of the “Inconsistency” domain, studies demonstrating consistent results across multiple cohorts or settings, with overlapping confidence intervals, were rated to be of low concern. Then, studies demonstrating moderate heterogeneity in performance metrics or inconsistent effect sizes across datasets were rated to be of some concern. Additionally, studies demonstrating marked inconsistency between studies with conflicting direction or magnitude of effect were rated to be of high concern.

In terms of the “Indirectness” domain, studies demonstrating population, predictors, and outcomes directly applicable to target clinical use (e.g., DKD or CKD) were rated to be of low concern. Then, studies demonstrating minor differences in population (e.g., single-center or narrow subgroup) or proxy outcomes were rated to be of some concern. Additionally, studies demonstrating a major mismatch between the study context and intended use (e.g., preclinical data or a non-representative population) were rated to be of high concern.

In terms of the “Imprecision” domain, studies demonstrating narrow confidence intervals, adequately powered sample size (>500 participants), and stable performance metrics (e.g., AUROC/AUPRC), were rated to be of low concern. Then, studies demonstrating moderate uncertainty due to small-to-medium sample size, wide CI, or unreported precision, were rated to be of some concern. Additionally, studies demonstrating very small sample sizes or wide uncertainty intervals, preventing reliable estimation of model performance, were rated to be of high concern.

In terms of the “Publication Bias” domain, articles demonstrating multiple studies, preregistered protocols, or open data, whose findings are likely reproducible, were rated to be of low concern. Then, articles demonstrating limited evidence, based on selective reporting, were rated to be of some concern. Additionally, articles demonstrating Single-center reports with extreme effect sizes and a lack of transparency, suggesting bias in publication, were rated to be of high concern.

The GRADE evaluation of the included studies shows moderate overall certainty of evidence for AI applications in kidney disease.

Among the different AI model categories, CKD progression prediction models [24,25,31,32] demonstrated the highest certainty of evidence, with consistently low RoB, strong consistency across studies, and minimal indirectness, leading to a high overall rating.

Radiomics-based AI models using ultrasound and MRI [1,2,5], pathology-based AI models using histology or whole-slide imaging [3,4,6,7], biomarker-based AI approaches including proteomics, metabolomics, and urinary markers [16,17,26,27,28,29,30,42,43], and electronic medical record (EMR)-based AI for diabetic kidney disease risk prediction [10,11,18,19,20,21,22,23,24,25,31,32,33] all demonstrated moderate certainty. These categories were mainly downgraded due to RoB, inconsistency in model performance, and some degree of imprecision in reported outcomes.

Genomics and multi-omics-based AI models [14,20,41] had the lowest certainty of evidence, primarily due to imprecision and indirectness, reflecting limited clinical validation and smaller study populations.

AI models for mortality prediction [34,35,36,37,38] and cardiovascular risk prediction in CKD [39,40] also achieved moderate certainty, though concerns remained around study heterogeneity and potential publication bias.

4.10. Recommended Approaches for Kidney AI/ML Studies

Six major methodological categories were identified across the reviewed studies. DL models, particularly convolutional neural networks and U-Net architectures, were primarily applied to imaging and histopathology quantification. These approaches are considered “recommended” when they include at least one external or temporal validation and demonstrate a reproducible analytic pipeline.

Supervised ML models, most commonly ensemble tree algorithms such as RF, XGBoost, and SVMs, were used for tabular, clinical, or biomarker-based prediction tasks. They meet the threshold for recommendation when sample sizes exceed roughly 100 participants and predictor variables are clearly defined and reproducible.

Explainable ML approaches, with SHAP or LIME, were used across clinical prognostic models to clarify feature importance and model reasoning. To be considered robust, these studies should present explicit feature attribution and calibration curves to support transparency and reliability.

Traditional regression models, including logistic and Cox proportional hazards regressions, served as baseline or benchmark comparators. Their inclusion remains valuable when model coefficients are fully reported, and calibration performance (slope and intercept) is tested.

Multi-omics integration models, using early, intermediate, or late data fusion methods to combine proteomic, metabolomic, and clinical inputs, are best viewed as exploratory tools for discovery and subtyping rather than routine prediction. They are recommended only when supported by external validation or replication.

Finally, reviews and meta-analyses provided methodological synthesis but were not graded for validation rigor, serving instead to contextualize emerging best practices across model types.

We have integrated recommended approaches, typical features, external validation status, and key limits of all the 43 journal articles in Table 11.

5. Discussion

5.1. Ethical Considerations and Potential Negative Effects of AI Models

Scientific research on applications of AI and ML algorithms in medicine has established promises in enhancing disease predictability, thus allowing the implementation of early preventative interventions. For example, these state-of-the-art tools can utilize biomarkers and EMR data of newly diagnosed diabetic patients and predict their natural course of developing microvascular or macrovascular complications of diabetes over time. As a result, physicians can take preventative measures for their patients to delay the development of these complications. This seems like a miracle. However, specific ethical concerns and adverse impacts of using these AI models remain on the plate as there has been very little work on developing regulatory frameworks and validating these AI models in real-world clinical settings. AI algorithms heavily rely on sensitive patient data, which, if improperly handled, can breach patient confidentiality and safety. Moreover, AI-based tools work through invisible internal mechanisms, making it hard for clinicians to interpret, trust, or act on the outputs of these technologies. Overreliance on the predictions of these tools can also negatively impact clinicians’ judgments and decision-making capabilities, causing them to lose adequate clinical insight into their patients’ overall health. Also, if not adequately trained on uncommon data, such as EMR data of underrepresented populations, AI tools can provide inaccurate output data and mislead treatment plans for these patient demographics, further exacerbating the existing healthcare disparities. The development of widely accepted regulatory frameworks, meticulously validated model functionality, and adequately trained AI tools on equitability should be the primary focus while doing clinical implementation in the future.

5.2. Limitations

While this review provides a comprehensive and structured analysis of several AI-based applications in kidney diseases, we should also acknowledge the study’s limitations. We did not extensively analyze or tabulate ML-based models that establish the correlation between diabetic retinopathy and DN using biomarkers and EMR data, despite their emerging relevance in integrated diabetic complication profiling. The review does not include a detailed analysis or tabular presentation of ML models employing endocan as a novel biomarker for predicting all-cause mortality and cardiovascular events in CKD patients, which could be a significant avenue for future research. We did not cover the development and clinical utility of ML derived KAI models used for risk stratification and intervention in DKD, despite growing interest in biologically informed biomarkers. The review does not delve into interpretable ML models aimed at predicting mortality in ESRD patients using longitudinal EMR data from follow-up visits, which could enhance clinical decision-making and patient counseling. We have not compiled the findings into easily understandable table formats for the areas mentioned above, which limits the comparative clarity and accessibility that our tabular approach provides in other sections.

5.3. Future Directions

The current body of literature demonstrates significant progress in the application of AI-based predictive tools, such as ML models including XGBoost, RF, linear and logistic regression, LASSO regression, DL, and neural networks-for early detection, risk stratification, and monitoring of kidney diseases. However, the window of opportunities remains open for further advancement. To date, we have identified a noticeable absence of research on applying generative AI (GenAI) technologies in nephrology. Traditional AI models analyze already prevalent data to predict outcomes; for instance, we can input patient biomarkers and EMR data into ML models and obtain results demonstrating kidney disease onset, progression, and severity. On the other hand, GenAI models can produce novel data samples resembling real-world distributions. Examples of GenAI technologies include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Transformers, Diffusion Models, and Autoregressive Models. Novel capabilities, such as the synthetic generation of training datasets to overcome data scarcity or privacy concerns, simulation of disease progression trajectories, offering new insights into CKD and ESRD management, and personalized treatment planning through the generation of virtual patient models or synthetic biomarker profiles, could be made possible by integrating these GenAI models into nephrology.

More work is needed to bridge the gap between generative synthesis and predictive modeling. For example, researchers can incorporate GenAI with conventional ML models. Several hybrid frameworks resulting from this combination would improve precision medicine approaches by predicting future clinical scenarios and comprehending current illness states. Extensive research is needed to address careful clinical validation and ethical concerns for its usage, data stewardship, and comprehensibility for implementing GenAI in nephrology. Future studies involving the application of GenAI in nephrology must explore its ability to generate evidence-based and decipherable results for clinical translation. Finally, to unveil the full potential of GenAI technologies in nephrology, multidisciplinary approaches and partnerships among nephrologists, data scientists, AI researchers, and ethicists are essential.

6. Conclusions

Recent advancements in AI have made it possible to integrate enriched datasets from EMRs with high-dimensional patient biomarker profiles, such as urinary, proteomic, metabolomic, inflammatory, and genetic markers, into AI-based predictive models. Scientists have identified the significant potential of these models for detecting at-risk groups, predicting disease trajectories, and informing individualized care strategies far before any clinical decline becomes evident. AI-based diagnostic tools offer a paradigm shift toward proactive and data-driven kidney care by unveiling complex and nonlinear patterns in patient data.

However, all the research on AI applications in nephrology has been fragmented, with heterogeneity in study design, variation in model selection, and a multiplicity of input biomarker selection. For example, the application of AI models to urinary biomarkers for kidney disease detection and monitoring has shown significant variation in both study designs and methodological approaches. A large number of models, including logistic regression variants, Cox regression, multivariate linear regression, and ML-based techniques such as MSG-LASSO, have been used, reflecting inconsistencies in model selection. Variations in the input biomarkers, ranging across different combinations of proteins and metabolites, have also been substantial, making cross-study comparisons challenging and underscoring the need for a standardized methodology. While discrimination metrics such as AUROC and AUPRC are frequently reported, fewer studies evaluate calibration, DCA, or net clinical benefit, key components for assessing clinical utility. The most robust studies use large, diverse cohorts, independent external validation, and transparent pipelines; in contrast, many single-center or retrospective analyses risk overfitting, selection bias, and non-reproducibility.

AI-assisted image analysis for histopathology and ultrasound quantification of IFTA has achieved reproducible performance with multicenter validation and can reasonably supplement human scoring under expert supervision. Similarly, structured risk scores integrating routinely available clinical and laboratory variables (e.g., eGFR, albuminuria, age, comorbidity profiles) can be considered for pilot implementation within decision support tools, particularly when externally validated and well-calibrated.

Multimodal models combining clinical data with urinary proteomic, metabolomic, or transcriptomic markers show encouraging internal performance but require independent external validation across varied populations and laboratory platforms before deployment. Explainable ML models that link predictions to actionable features (e.g., SGLT2 inhibitor eligibility, referral timing, or transplant planning) should undergo prospective evaluation to assess their safety and impact on clinical workflows.

High-dimensional, multi-omics fusion models and deep learning systems trained on limited or homogeneous datasets remain exploratory tools for hypothesis generation rather than clinical decision-making. These approaches are best reserved for uncovering novel sub-phenotypes, mechanistic pathways, or treatment response heterogeneity rather than for routine risk scoring.

In sum, a disciplined transition from algorithmic novelty to clinical utility demands rigorous external validation, calibration assessment, and demonstration of net benefit in real-world practice. The field’s next phase should emphasize transparent reporting, cross-institutional reproducibility, and integration into EHR environments that enable interpretable, threshold-based alerts for individualized kidney care.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics5040067/s1; Supplementary File S1. PRISMA 2020 Checklist. The PRISMA 2020 Checklist guiding the reporting of this systematic review is provided in Supplementary File S1.

Author Contributions

The authors of this paper have reviewed the final version to be published and agreed to be accountable for all aspects of the work. Concept and design: T.A., L.P. Drafting of the manuscript: T.A. Critical review of the manuscript for important intellectual content: T.A., L.P. Supervision: L.P. Acquisition, analysis, or interpretation of data: T.A., L.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by AIM-AHEAD Coordinating Center, award number OTA-21-017, and was, in part, funded by the National Institutes of Health, United States Agreement No. 1OT2OD032581.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors confirm that the data supporting the findings of this study will be made available after any reasonable request.

Acknowledgments

The work is the authors’ own idea. The authors used ChatGPT (OpenAI, 2025) to assist with language editing and improving clarity of certain sentences of this manuscript. All content was reviewed and verified by the authors (Abbasi, T., and Pinky, L.), who take full responsibility for the final text.

Conflicts of Interest

In compliance with the ICMJE uniform disclosure form, the authors of this paper declare the following: Payment/services info: The authors declared that they received no financial support from any organization for the submitted work. Financial relationships: The authors have declared that they have no financial relationships at present or within the previous three years with any organizations interested in the submitted work. Other relationships: The authors have declared that no other relationships or activities could appear to have influenced the submitted work.

Appendix A

This appendix contains details and explanations supplemental to the subsection, “4.1. Application of AI-based diagnostic tools in early detection, risk stratification, and monitoring of IFTA” of the main text. The explanations of the quality assessment of the included study articles [1] through [7] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A1. QUADAS-2 RoB and Applicability Assessment for Athvale et al., 2021 [1].

Domain	RoB	Judgment & Justification	Applicability Concern
1. Patient Selection	High	The study used a single-center, retrospective, consecutive sample (in = 352) of patients who underwent kidney biopsy at a tertiary hospital. This introduces potential selection bias, as patients were not randomly sampled and may not represent broader clinical populations.	High, Participants are limited to a specific demographic (Cook County Hospital, Chicago), potentially limiting generalizability.
2. Index Test (DL + XGBoost model)	/ Moderate to High	The model was trained and tested on the same institutional data, with no external validation. It’s unclear whether the index test was interpreted blinded to the reference standard. Deep learning feature extraction may vary across ultrasound machines.	Moderate, Implementation in other settings could yield variable results due to equipment and imaging protocol differences.
3. Reference Standard (Histopathologic IFTA grading)	Low	Histopathology is the accepted gold standard for assessing interstitial fibrosis and tubular atrophy. It’s likely that assessments were performed by qualified nephropathologists. However, blinding between reference and index test evaluators was not explicitly mentioned.	Low, The outcome is directly relevant to clinical practice.
4. Flow and Timing	Moderate	All patients underwent both the index test (ultrasound) and the reference standard (biopsy) in the same period, but timing between the two was not specified. Missing data handling and patient exclusions were not detailed.	Low, Likely applicable since both tests relate to the same diagnostic event.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A1 presents the application of the QUADAS-2 tool to the 1st paper, Athvale et al. 2021 [1], which utilizes an AI-based diagnostic tool for early detection, risk stratification, and monitoring of IFTA. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

Although the study demonstrates promising accuracy for noninvasive quantification of IFTA using DL on ultrasound images, its single-center design, lack of external validation, and small dataset relative to model complexity result in a high RoB and limited generalizability. Future multicenter validation and calibration studies are needed to strengthen certainty.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate to high, and applicability concerns, moderate to high.

Table A2. QUADAS-2 RoB and applicability assessment for Trojani et al., 2024 [2].

Domain	RoB	Reasoning/Justification	Applicability Concern
1. Patient Selection	High	Retrospective, single-center study; patients were included only if they had both MRI and biopsy within six months, which introduces selection bias (enriched population, not consecutive diagnostic workflow). Excluded poor-quality MRIs and unsuitable biopsies, which further limits representativeness.	High, The study population (transplant recipients in a tertiary center with available MRI) does not represent the full clinical spectrum of post-transplant patients.
2. Index Test (MRI-radiomic ML model)	/ Moderate to High	The MRI radiomics-based ML model was developed and validated on the same institutional data, using internal train/test splits (no external validation). The index test likely was not interpreted fully blinded to the biopsy results during feature selection and model tuning. Performance may be optimistically biased (AUC drop between training and test).	Moderate, MRI protocols, scanners, and pre-processing steps are highly site-specific; generalization to other centers may be limited.
3. Reference Standard (Histopathologic biopsy with Banff IFTA grading)	Low	The biopsy-based Banff classification is a recognized gold standard for IFTA. Assessment was performed by an expert nephropathologist, though blinding to MRI results was not explicitly confirmed.	Low, The biopsy grading directly answers the target condition and is appropriate for the study aim.
4. Flow and Timing	Moderate	MRI and biopsy were performed within six months, which may be long enough for histological changes in IFTA to progress, introducing potential misclassification. Only 70 MR-biopsy pairs analyzed out of 254 biopsies performed, indicating patient/exam attrition.	Low to Moderate, The six-month interval could affect diagnostic consistency, but within chronic IFTA context, it’s partly acceptable.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A2 presents the application of the QUADAS-2 tool to the 2nd paper, Trojani et al. 2024 [2], which uses an AI-based diagnostic tool for early detection, risk stratification, and monitoring of IFTA. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

This single-center, retrospective diagnostic accuracy study presents a moderate-to-high RoB, mainly due to non-representative patient selection, lack of external validation, and potential overfitting of the radiomics-based ML model. The reference standard (biopsy-based Banff IFTA grading) is appropriate and reliable, though blinding was not fully reported. The six-month window between MRI and biopsy could have introduced temporal bias, and missing data handling was not fully described. Applicability is limited by center-specific MRI acquisition protocols, pre-processing pipelines, and manual segmentation.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate to high, and applicability concerns, moderate to high.

Table A3. QUADAS-2 RoB and applicability assessment for Ginley et al., 2021 [3].

Domain	RoB	Reasoning/Justification	Applicability Concern
1. Patient Selection	Moderate	The study used 116 whole-slide biopsy images, retrospectively selected from existing archives. No mention of consecutive or random sampling. Slides were chosen for image quality and completeness, introducing potential selection bias. However, inclusion appears broad across chronic kidney injuries, not limited to a narrow subset.	Low to Moderate, The sample (renal biopsies with chronic injury) represents the target clinical population, though external representativeness is uncertain.
2. Index Test (CNN-based ML model, DeepLab v2)	Moderate	The convolutional neural network (DeepLab v2) was trained on annotated WSIs and evaluated internally and externally. External testing included only 20 slides, from the same or similar institutional context, and blinding to the reference standard was not specified. Model tuning and performance optimization may have introduced optimistic bias.	Moderate, Deep learning performance may vary with slide scanners, staining, and lab protocols, affecting real-world applicability.
3. Reference Standard (Renal Pathologist Grading of IFTA and Glomerulosclerosis)	Low	The ground truth labels were assigned by expert renal pathologists, the clinical gold standard. Multiple pathologists participated in validation, and performance was benchmarked against their agreement levels. There’s no explicit statement on blinding to model outputs, but the use of independent test slides suggests low bias.	Low, Pathologist-based grading directly reflects the intended diagnostic construct.
4. Flow and Timing	Low	All slides underwent both the index test (ML assessment) and reference evaluation (pathologist grading) from the same biopsy samples. No participants or slides appear to have been excluded after inclusion. Flow is consistent and clearly reported.	Low, The timing of analyses aligns with typical diagnostic workflows.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A3 presents the application of the QUADAS-2 tool for the 3rd paper, Ginley et al. 2021 [3], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

This diagnostic accuracy study demonstrates moderate overall RoB, mainly due to retrospective sampling and limited external validation of the CNN model. The reference standard (expert renal pathologist assessment) is robust, and the index test was applied appropriately, achieving pathologist-level performance.

In summary, the study shows low concern in the reference standard and flow, but moderate risk in patient selection and index test domains, yielding an overall QUADAS-2 judgment of moderate RoB and moderate concern for applicability.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low to moderate.

Table A4. QUADAS-2 RoB and applicability assessment for Zheng et al., 2021 [4].

Domain	RoB	Reasoning/Justification	Applicability Concern
1. Patient Selection	Low-Moderate	Patients included 64 from OSUWMC (67 WSIs) and 14 from Kidney Precision Medicine Project/KPMP (28 WSIs). WSIs underwent manual quality checks to exclude slides with artifacts. Some clinical data was missing (e.g., proteinuria, eGFR), but all eligible biopsies were included. The sample may not fully represent all renal biopsy populations, and KPMP had no severe IFTA cases.	Low, The population represents patients undergoing renal biopsy, which matches the intended clinical target population for IFTA grading.
2. Index Test (DL Model, glpathnet)	Low	The DL model combined local patch-level and global WSI-level features to predict IFTA grade. Model was trained on OSUWMC data with 5-fold cross-validation and tested externally on KPMP data. Patch-level probabilities were reviewed after reference grading to avoid bias.	Low, The model directly addresses automated IFTA grading on digitized WSIs, the intended purpose of the index test.
3. Reference Standard (Pathologist Consensus, Majority Vote)	Moderate	IFTA grades were determined by majority vote of five nephropathologists (OSUWMC) and by study investigators (KPMP), with moderate interobserver agreement (κ = 0.31–0.50). Grading is inherently subjective and may vary between pathologists, though consensus aligns with standard clinical practice.	Low, Reference standard is clinically appropriate, using expert nephropathologists’ evaluation of renal biopsy WSIs.
4. Flow and Timing	Low-Moderate	All WSIs were digitized consistently at ×40 magnification. DL training and testing were separated between OSUWMC (training) and KPMP (external validation). Some KPMP cases lacked severe IFTA. No missing WSIs, and the same grading criteria were applied across datasets.	Low, Data handling, timing, and grading criteria are consistent, reflecting intended workflow for WSI analysis.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A4 presents the application of the QUADAS-2 tool for the 4th paper, Zheng et al., 2021 [4], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

The study has an overall moderate RoB. This is because while the index test and flow/timing are low risk, moderate concerns arise from patient selection and the subjectivity of the reference standard. Applicability concerns across all domains are low, indicating that the study population, index test, and reference standard are relevant to the intended clinical context.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.

Table A5. QUADAS-2 RoB and applicability assessment for Athvale et al., 2020 [5].

Domain	RoB	Reasoning/Justification	Applicability Concern
1. Patient Selection	Moderate	Patients were included retrospectively from a single center (Cook County Health, Chicago, IL). Ultrasound images were obtained from 352 patients who underwent kidney biopsy. Potential selection bias exists as only patients with available biopsy-confirmed IFTA, and usable ultrasound images were included.	Moderate, Population reflects patients undergoing biopsy but may not represent broader populations or those without biopsy, limiting generalizability to all kidney disease patients.
2. Index Test (DL Ultrasound Classification)	Low	The DL system classified IFTA from ultrasound images with masked kidneys based on a 91% accurate segmentation routine. The system was trained, validated, and tested on separate datasets, reducing bias. Performance metrics (accuracy, precision, recall, F1-score) were reported for all sets.	Low, The index test is directly relevant to the clinical task of non-invasive IFTA assessment.
3. Reference Standard (Biopsy IFTA by Nephropathologist)	Low-Moderate	Reference standard was histologic IFTA grading on trichrome-stained kidney biopsy by nephropathologists. While widely accepted, inter-observer variability in IFTA grading is known, but majority consensus or standardized scoring was not specified.	Low, The reference standard is clinically appropriate and directly measures the target condition.
4. Flow and Timing	Low	Training, validation, and test datasets were clearly separated. No mention of missing images or exclusions post-acquisition, and all images were processed using the same protocol. Timing between ultrasound and biopsy was not specified but likely contemplate for clinical workflow.	Low, Flow and timing are appropriate for evaluating the DL model against the biopsy reference standard.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A5 presents the application of the QUADAS-2 tool for the 5th paper, Athvale et al., 2020 [5], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

The study demonstrates robust performance of the DL system for non-invasive IFTA assessment, with moderate caution due to selection bias and potential variability in biopsy grading.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.

Table A6. QUADAS-2 RoB and applicability assessment for Ginley et al., 2020 [6].

Domain	RoB	Reasoning/Justification	Applicability Concern
1. Patient/Tissue Selection	Low-Moderate	The study used renal biopsy samples stained with periodic acid-Schiff (PAS). Data came from a single institution for intra-institutional holdout testing and an external institution for inter-institutional testing. Exact selection criteria and sample size were not fully described, introducing potential selection bias.	Low, Study samples are representative of patients undergoing renal biopsy for glomerulosclerosis and IFTA assessment, the intended target population.
2. Index Test (CNN Segmentation)	Low	CNNs were trained to segment glomerulosclerosis and IFTA on PAS-stained biopsies. The model performance was evaluated on holdout intra- and inter-institutional datasets. Segmentation outputs were quantitatively compared to reference annotations, and high correlations were reported, indicating low bias in test conduct.	Low, The index test directly addresses the clinical task of automated segmentation and quantitation of renal histologic injury.
3. Reference Standard (Pathologist Annotations)	Moderate	Ground truth was based on manual segmentation by renal pathologists. The study notes that the CNN sometimes predicted regions “better than the ground truth,” indicating some subjectivity and potential variability in reference standard. Inter-observer variability of annotations was not formally reported.	Low, Expert pathologist annotations are clinically relevant and appropriate for training and validating the model.
4. Flow and Timing	Low	The training, intra-institutional holdout, and inter-institutional holdout testing datasets were clearly separated. No missing data issues were reported, and all images were analyzed according to the same protocol.	Low, Flow and timing reflect intended use, with proper separation of training and test datasets.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A6 presents the application of the QUADAS-2 tool for the 6th paper, Ginley et al., 2020 [6], which uses an AI-based diagnostic tool for the early detection, risk stratification, and monitoring of IFTA progression. Using QUADAS-2, we evaluated the RoB and applicability concerns of this study across four domains: Patient Selection, Index Test, Reference Standard, and Flow and Timing.

The QUADAS-2 assessment indicates that this study evaluating CNNs for segmentation of glomerulosclerosis and interstitial fibrosis/tubular atrophy (IFTA) in PAS-stained renal biopsies is generally robust. The index test (CNN-based segmentation) demonstrates low RoB, as it was trained and validated on holdout intra- and inter-institutional datasets with quantitative evaluation against reference annotations. The reference standard, based on pathologist manual annotations, carries a moderate RoB due to inherent subjectivity and potential variability, though it remains clinically appropriate. Patient and tissue selection present low-moderate risk because sample sizes and selection criteria were not fully described, and external validation data came from a single additional institution. The flow and timing domain has a low RoB, with consistent image handling and clear separation of training and test datasets.

Overall QUADAS-2 judgment for this study is as follows: RoB, moderate, and applicability concerns, low.

Table A7. PROBAST assessment for Yin et al., 2023 [7].

Domain	Details	RoB/Concern
Population	Post-transplant kidney patients from five GEO datasets (GSE98320 [45], GSE76882 [46]: training; GSE22459 [47], GSE53605 [48]: validation; GSE21374 [49]: prognosis). Total sample sizes vary; cohorts selected based on ≥50 samples and availability of biopsy-confirmed IFTA or survival data. Heterogeneity is due to platform differences and batch effects.	Moderate, selection bias possible; not fully representative of all transplant patients
Index Model	Stepglm[both] + RF diagnostic model based on 28 necroptosis-related genes. Developed from 114 combinations of 13 ML algorithms (LASSO, Ridge, Enet, Stepglm, SVM, glmboost, LDA, plsRglm, RF, Gradient Boosting Machine/GBM, XGBoost, Naive Bayes, ANN). High-dimensional data relative to sample size increases risk of overfitting.	High, multiple model testing, risk of overfitting, small validation sets relative to training
Comparator/Reference Model	Biopsy-confirmed IFTA status (histopathological evaluation) and survival data (post-transplant graft loss) from GEO datasets. Used as reference standard to evaluate predictive performance (AUC, ROC).	Low, clinically accepted standard; reference outcome is relevant and reliable
Outcome	IFTA classification (binary or graded) and post-transplant graft survival. Modeled outcome includes differential gene expressions associated with necroptosis. Performance evaluated with AUC, Principal Component Analysis/PCA separation, and Kaplan–Meier curves for survival.	Moderate, grading differences across cohorts and batch effects may introduce misclassification bias
Timing	Gene expression data from biopsies collected at variable post-transplant time points. Prognostic evaluation uses longitudinal survival data. Model development used cross-sectional training/validation datasets; timing differences between datasets could affect predictive performance.	Moderate, timing differences and cross-sectional data may limit prediction consistency
Setting	Publicly available GEO gene expression datasets from multiple kidney transplant cohorts. Laboratory and bioinformatics setting; no prospective or clinical trial validation.	Moderate, datasets may not represent real-world clinical populations
Intended Use of Prediction Model	Early identification of IFTA and stratification of kidney transplant patients by risk of fibrosis progression or graft loss. Aims to support clinical decision-making and follow-up prioritization. Not yet validated for clinical deployment.	Moderate, potential clinical use, but external validation and clinical implementation pending

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A7 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Yin et al., 2023 [7].

The overall PROBAST judgment for this study is:

High RoB, with Biomedinformatics 05 00067 i004

Moderate Concerns for Applicability.

Justification is as follows:

High RoB arises mainly from the model development and analysis domain, where extensive algorithm testing (114 model combinations) and limited validation increase the likelihood of overfitting.

The population is retrospectively selected from heterogeneous GEO datasets, adding potential selection and batch-related biases.

Applicability concerns are moderate because the predictors (necroptosis-related genes) and outcomes (biopsy-confirmed IFTA and graft survival) are clinically relevant, but external generalizability to broader or prospective clinical populations remains uncertain.

Appendix B

This appendix contains details and explanations supplemental to the subsection, “4.2. Application of AI-based models to different urinary biomarkers for early detection, risk stratification, and monitoring of CKD progression” of the main text. The explanations of the quality assessment of the included study articles [26] through [30] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A8. PROBAST assessment for Bienaime et al., 2023 [26].

Domain/Item	Details	RoB/Concern
Population	Participants were 229 adults with chronic kidney disease (mean age 61 years; 66% male; mean baseline mGFR 38 mL/min) from the prospective NephroTest cohort. Fast CKD progression was defined as >10% annual mGFR decline. The cohort is well-characterized, but the subsample size is modest and may not reflect the full CKD spectrum.	Low-Moderate, Prospective and clinically relevant, but limited sample and single cohort reduce representativeness.
Index Model	A LASSO logistic regression model combining five urinary biomarkers (CCL2, EGF, KIM1, NGAL, and TGF-α) with clinical variables (age, sex, mGFR, albuminuria) to predict fast CKD progression. Model selection used repeated resampling (100 iterations).	Moderate, LASSO penalization reduces overfitting risk, but internal-only validation and data-driven selection of biomarkers may inflate performance estimates.
Comparator/Reference Model	The Kidney Failure Risk Equation (KFRE) variables (age, sex, mGFR, albuminuria) served as the baseline comparator for performance evaluation.	Low, Comparator is appropriate, widely accepted, and clinically meaningful.
Outcome	Outcome was fast CKD progression, defined as >10% decline per year in measured GFR using ^51Cr-EDTA clearance-a gold standard assessment.	Low, Objective and precise measurement of kidney function minimizes outcome misclassification.
Timing	Predictor and outcome data came from a prospective follow-up design within the NephroTest cohort. Urine biomarkers and clinical variables were measured at baseline; outcomes were observed longitudinally.	Low, Clear temporal sequence supports valid prediction; prospective data collection minimizes bias.
Setting	Conducted in a clinical research cohort of CKD patients under nephrology care at French academic hospitals (NephroTest). Laboratory-based ELISA assays underwent rigorous FDA-standard validation prior to modeling.	Low, Well-controlled research setting; consistent sample handling and assay validation.
Intended Use of Prediction Model	The model aims to improve risk stratification for CKD progression beyond standard clinical variables by adding validated urinary biomarkers, potentially guiding early intervention and follow-up intensity.	Low, Intended use is clinically relevant and aligned with current CKD management goals.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A8 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Bienaime et al., 2023 [26].

The overall PROBAST judgment for this study is:

Moderate Overall RoB, with Biomedinformatics 05 00067 i001

Low Applicability Concern

Justification is as follows:

The main limitation of this study lies in the modeling domain, where internal validation and data-driven biomarker selection raise a moderate risk of overfitting. Applicability is strong, the predictors, outcomes, and setting reflect real-world nephrology practice, making the findings highly relevant but needing external validation in independent CKD populations.

Table A9. PROBAST assessment for Pizzini et al., 2017 [27].

Domain/Item	Details	RoB/Concern
Population	118 adult CKD patients (mean age 62 ± 11 years; 59% male; mean eGFR ≈ 35 mL/min/1.73 m²) from a single nephrology center in Reggio Calabria, Italy. Follow-up: 3 years. Outcome: composite renal endpoint (eGFR decline > 30%, dialysis, or transplantation). Pilot cohort, relatively small sample size, and unclear recruitment method.	Moderate, Clinically relevant CKD population but small, single-center, and possibly non-representative sample.
Index Model	Composite tubular risk score derived from urinary NGAL, Uromodulin, and KIM-1 (binary: above/below median). Developed via multiple Cox regression, combined later with eGFR in an integrated model. Internal performance assessed via Harrell’s C-index (0.79 vs. 0.77 for eGFR alone).	Moderate, Simple derivation method, but internal-only validation, small sample, and data-driven thresholding raise overfitting risk.
Comparator/Reference Model	eGFR-based model (single predictor) used as the reference comparator for assessing incremental prognostic value.	Low, eGFR is a gold-standard clinical reference for kidney function.
Outcome	Composite renal outcome: >30% eGFR decline, dialysis, or transplantation during 3 years. Objectively defined and clinically relevant.	Low, Outcome is standardized, measurable, and clinically meaningful.
Timing	Prospective follow-up of 3 years; predictors (urinary biomarkers) measured at baseline, outcome assessed longitudinally.	Low, Appropriate temporal relationship between predictors and outcome.
Setting	Academic nephrology and renal transplantation unit; research laboratory with validated urinary biomarker assays.	Low, Controlled clinical and analytical environment ensures reliable data quality.
Intended Use of the Prediction Model	Early risk stratification of CKD patients for rapid progression or kidney failure, to complement eGFR-based clinical prediction tools.	Low, Intended use aligns with nephrology practice and unmet clinical need.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A9 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Pizzini et al., 2017 [27].

The overall PROBAST judgment for this study is:

Moderate Overall RoB with Biomedinformatics 05 00067 i001

Low Applicability Concern

Justification is as follows:

The single-center design, small sample size, and lack of external validation introduce a moderate RoB, particularly in model development and analysis. Applicability concerns are low, as the predictors (urinary NGAL, Uromodulin, KIM-1) and outcomes reflect real-world CKD progression assessment.

Table A10. PROBAST assessment for Qin et al., 2019 [28].

Domain	Description	RoB	Applicability Concern
Population	1053 hospitalized adults with type 2 diabetes; after PSM, 500 (250 DKD, 250 non-DKD). All had eGFR ≥ 60 mL/min/1.73 m². Excluded other kidney or systemic diseases. Hospital-based inpatient cohort, not representative of general outpatient T2DM populations.	Moderate	High, inpatient sample limits generalizability to screening or primary-care settings.
Index Model/Predictors	Six urinary biomarkers measured once: transferrin (TF), immunoglobulin G (IgG), retinol-binding protein (RBP), β-galactosidase (GAL), N-acetyl-β-glucosaminidase (NAG), β₂-microglobulin (β₂MG). Each assessed individually with logistic regression and ROC AUC; no multivariable or externally validated model.	High, simple univariable analysis; no validation or adjustment for overfitting.	Moderate, biomarkers clinically measurable but not yet standardized for DKD diagnosis.
Comparator model/Reference Standard	24 h urinary albumin excretion (UAE ≥ 30 mg/24 h) as gold standard for DKD. Overlaps mechanistically with some predictors, introducing incorporation bias.	High, predictor-outcome dependency likely inflates AUCs.	Moderate, UAE widely accepted but not ideal for early-stage DKD reference.
Outcome	Presence of DKD (vs. normoalbuminuric) is defined cross-sectionally by 24 h UAE and eGFR ≥ 60. No longitudinal follow-up.	Moderate, objective lab-based outcome but lacks temporal dimension.	Moderate, relevant to early DKD diagnosis but not progression prediction.
Timing	Cross-sectional; biomarkers and UAE measured concurrently during hospitalization (no temporal validation).	High	High, not predictive; diagnostic only.
Setting	Single tertiary hospital (Tianjin Medical University Chu Hsien-I Memorial Hospital), China, 2018. All assays in hospital lab.	Moderate	High, single-center; potential institutional bias.
Intended Use of prediction model	Exploratory diagnostic discrimination to identify DKD among known T2DM inpatients with preserved eGFR. Not a prognostic or screening model.	Moderate, appropriate for hypothesis generation only.	High, limited use beyond internal diagnostic context.
Overall Judgment	Cross-sectional single-center study with internal ROC analysis only. High internal performance (RBP AUC 0.92) is likely optimistic. No calibration, temporal, or external validation performed.	High overall RoB	High applicability concern

Colors

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A10 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Qin et al., 2019 [28].

This cross-sectional diagnostic study assessed six urinary biomarkers for detecting DKD among hospitalized adults with type 2 diabetes in Tianjin, China. Using 24 h urinary albumin excretion as the reference, RBP, TF, and IgG showed the best discrimination (AUCs of 0.92, 0.87, and 0.87, respectively). However, methodological appraisal with PROBAST indicates a high overall RoB due to the cross-sectional design, incorporation bias between biomarkers and outcome, and absence of external validation. Applicability is limited to hospital-based diagnostic research settings rather than predictive clinical screening or community-based use.

Table A11. QUADAS-2 RoB and applicability concerns for Schanstra et al., 2015 [29].

Domain	Description	RoB	Applicability Concern
Patient Selection	Large multicenter cohort (n = 1990) including CKD patients across stages and healthy/at-risk controls; subset (n = 522) had longitudinal follow-up for progression. Inclusion criteria are broad and clinically relevant. Unclear if sampling was consecutive or random, though selection likely minimized spectrum bias by including multiple CKD etiologies.	Low-Moderate	Low, representative CKD and at-risk populations suitable for diagnostic and prognostic use.
Index Test	Urinary multi-peptide biomarker classifier derived from proteomic analysis; algorithm validated internally and externally across centers. Index test interpreted blinded to reference measures (as implied). Uses objective mass-spectrometry-based quantification.	Low, standardized proteomic measurement, objective analysis.	Low, proteomic test applicable to intended CKD risk stratification setting.
Reference Standard	Clinical CKD diagnosis and progression assessed by eGFR decline and/or albuminuria according to accepted criteria (Kidney Disease: Improving Global Outcome/KDIGO). Both objective and reproducible. However, albuminuria overlaps mechanistically with the index peptides, introducing some dependency.	Moderate, potential incorporation bias with overlapping filtration markers.	Low, consistent with clinical standards for CKD staging and progression.
Flow and Timing	Cross-sectional design for detection with a subset followed longitudinally (n = 522) for progression; uniform application of index and reference tests at baseline; consistent follow-up for outcome.	Low, flow appropriate and timing consistent.	Low, progression analysis based on follow-up data aligns with intended use.
Overall RoB	Generally robust multicenter design with standardized proteomic measurement and appropriate statistical validation. Minor risk from partial overlap between index and reference measures.	Low overall	Low overall

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i004

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A11 presents the application of the QUADAS-2 tool for the RoB and applicability concerns assessment of one of the latter two studies [29], utilizing various urinary biomarkers as input variables for various AI-based diagnostic tools for the early detection, risk stratification, and monitoring of CKD.

QUADAS-2 appraisal indicates an overall low RoB and good applicability, supported by rigorous proteomic quantification and validation across multiple centers. Minor concerns remain about partial incorporation bias since the reference standard (albuminuria, eGFR) overlaps biologically with some peptides. The study provides strong diagnostic and prognostic evidence for urinary proteome classifiers as complementary CKD risk stratification tools.

Table A12. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30].

Domain	Description	RoB	Applicability Concerns
1. Patient Selection	Participants were drawn from the WIHS prospective cohort of women with HIV; inclusion required preserved kidney function (eGFR ≥ 60 mL/min/1.73 m²) and paired urine samples.	Low-Moderate, selection limited to relatively healthy women, potentially introducing bias.	Moderate, mostly middle-aged Black women with HIV; may not represent general CKD or HIV-positive male populations.
2. Index Test (Urine Biomarkers)	14 urine biomarkers measured in duplicate using standardized multiplex assays; results analyzed as continuous standardized values without diagnostic thresholds.	Low-Moderate, laboratory methods robust, but no pre-specified diagnostic cut-offs.	Some concern, biomarkers used as exploratory indicators, not validated diagnostic tests.
3. Reference Standard	No true diagnostic “gold standard” for CKD; comparisons made to CKD risk factors (HbA1c, BP, viral load, etc.) rather than confirmed CKD diagnosis.	High, absence of a defined reference standard limits diagnostic accuracy assessment.	High, reference variables do not constitute a diagnostic criterion for CKD.
4. Flow and Timing	Baseline and follow-up urine and serum specimens obtained 2.5 years apart for all 647 participants; consistent measurements across time points.	Low, clear temporal structure and uniform application of tests.	Low, appropriate interval and consistent follow-up across participants.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

The study demonstrates rigorous biomarker measurement and statistical modeling, but it is more exploratory (prognostic/associative) than diagnostic, thus QUADAS-2 partially applies. We applied QUADAS-2 principles to evaluate quality in diagnostic/biomarker inference. Since the QUADAS-2 framework evaluates the RoB and applicability concerns across four key domains, we explored how this study aligns with each domain:

Table A13. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 1: Patient Selection.

Criterion	Assessment	Justification
Was a consecutive or random sample of participants enrolled?	Low risk	Participants were drawn from the ongoing perspective WIHS cohort with standardized sampling, no indication of selective inclusion beyond availability of paired samples.
Was a case–control design avoided?	Low risk	This was a cohort study, not a case–control design.
Did the study avoid inappropriate exclusions?	Some concern	Only women with preserved kidney function (eGFR ≥60 mL/min/1.73 m²) and available serial samples were included, which may bias results toward healthier individuals.
Applicability concern	Moderate	The population (middle-aged women with HIV, mostly Black) may not be fully representative of broader CKD or HIV populations (e.g., men, advanced CKD).

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009

Yellow = Internal split or temporal validation only.

Overall RoB for Domain 1: Low to moderate Biomedinformatics 05 00067 i005

Table A14. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 2: Index Test(s).

Criterion	Assessment	Justification
Were the index tests conducted and interpreted without knowledge of the reference standard?	Low risk	Biomarker assays were measured objectively and blinded to clinical data (implied by standard lab procedures).
Were thresholds pre-specified?	Unclear	Biomarker changes were analyzed as continuous variables (standardized β coefficients) without pre-specified diagnostic cut-offs.
Was the test execution standardized and reproducible?	Low risk	Detailed lab methods (duplicate runs, low CVs, standardized assays) described.
Applicability concern	Some concern	Biomarkers are explored as research tools rather than validated clinical diagnostic tests. Their interpretability in diagnostic terms is limited.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009

Yellow = Internal split or temporal validation only.

Overall RoB for Domain 2: Low to moderate Biomedinformatics 05 00067 i005

Table A15. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 3: Reference Standard.

Criterion	Assessment	Justification
Is the reference standard likely to correctly classify the target condition?	High risk	The study does not use a clinical diagnosis or gold-standard CKD outcome, only risk factors and biomarker change correlations.
Were the reference standard results interpreted without knowledge of the index test results?	Low risk	Risk factors (A1c, blood pressure, HIV viral load, etc.) were measured independently of biomarkers.
Applicability concern	High	The “reference standard” here is not a diagnostic truth measure, so diagnostic accuracy cannot be directly evaluated.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i009

Yellow = Internal split or temporal validation only.

Overall RoB for Domain 3: High Biomedinformatics 05 00067 i006

(not applicable as a true diagnostic accuracy study)

Table A16. QUADAS-2 for RoB and applicability assessment for Muiru et al., 2021 [30], Domain 4: Flow and Timing.

Criterion	Assessment	Justification
Was there an appropriate interval between index test and reference standard?	Low risk	Biomarkers and CKD risk factors were measured at baseline and 2.5 years later, consistent temporal design.
Did all participants receive the same reference standard?	Low risk	All participants had the same clinical data and lab procedures applied.
Were all participants included in the analysis?	Low risk	647 women analyzed; no evidence of selective dropout or incomplete data bias.
Applicability concern	Low	Design and follow-up consistent with study aims.

Colors

Green = External validation (multi-site or independent cohort).

Overall RoB for Domain 4: Low

Overall RoB for this study: Moderate to High

Overall Applicability of this study: Moderate

Since this study is methodologically sound for associative/prognostic biomarker research but not a diagnostic accuracy study, and QUADAS-2 applies only partially, we also used the PROBAST tool to assess RoB and the applicability of this study.

Table A17. PROBAST quality assessment for Muiru et al., 2021 [30].

Domain	Description	RoB	Applicability Concerns
Participants/Population	647 women living with HIV from the U.S. Women’s Interagency HIV Study (WIHS). Inclusion required two urine samples and preserved kidney function (eGFR ≥ 60 mL/min/1.73 m²). Majority were middle-aged and Black (67%).	Some concern, selection limited to women with preserved renal function; may not represent patients with advanced CKD or male populations.	Moderate, findings apply mainly to women with HIV and may not generalize to all HIV or CKD populations.
Index Model	Multivariable penalized regression model (MSG-LASSO) and simultaneous linear equations assessing associations between CKD risk factors and longitudinal changes in 14 urine biomarkers.	Some concern, robust internal modeling, but no external or temporal validation; unclear internal validation (e.g., bootstrapping).	Moderate, model exploratory, not clinically implemented; predictive performance not reported.
Comparator Model	None, no existing or alternative predictive model used for comparison; focus was on evaluating associations, not on model performance metrics.	High, absence of comparator limits interpretation of predictive improvement or incremental value.	Some concern, not designed for model comparison or validation.
Outcome	Longitudinal changes in kidney tubular and glomerular biomarkers (e.g., KIM-1, IL-18, UMOD, α1m, β2m). No hard kidney outcome (CKD progression, eGFR decline) assessed.	Some concern, surrogate outcomes, not direct measures of kidney disease progression or patient-level endpoints.	Moderate, outcome biologically meaningful but limited for clinical prediction utility.
Timing	Prospective longitudinal cohort; baseline biomarker and clinical data collected in 2009–2011 and repeated ~2.5 years later.	Low, consistent and appropriate timing for longitudinal biomarker change evaluation.	Low, timing consistent with biological plausibility for kidney biomarker change.
Setting	Multi-center U.S. observational cohort study (academic research settings).	Low, standardized data collection and laboratory protocols reduce bias.	Some concern, research setting may differ from clinical practice environments.
Intended Use of Predictive Model	Exploratory, to identify CKD risk factors associated with biomarker changes and to inform future development of kidney disease detection algorithms in HIV.	Some concern, model not yet developed for clinical prediction; exploratory by design.	Moderate, informative for biomarker research, but not directly applicable to clinical prediction or screening.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

The study demonstrates strong internal validity and robust measurement methods but is limited by a lack of external validation, a restricted population, and the use of biomarkers as surrogate outcomes. It is best viewed as a hypothesis-generating prognostic analysis rather than a finalized predictive model.

Overall RoB: Moderate

Overall Applicability: Moderate

Appendix C

This appendix contains details and explanations supplemental to the subsection, “4.3. Application of AI-based models to different clinical, metabolomic, or transcriptomic data for monitoring the progression of DN” of the main text. The explanations of the quality assessment of the included study articles [15,41] through [43] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A18. PROBAST assessment for Yin et al., 2024 [15].

Domain	Description	RoB	Applicability Concerns
Participants/Population	548 diabetic patients (from 1024 total) enrolled between April 2018-April 2019 at the Second Affiliated Hospital of Dalian Medical University. Data included demographics, laboratory, clinical, and metabolomic features. Patients with >50% missing data were excluded.	Some concern, single-center dataset, potential selection bias from missing data exclusion, unclear external representativeness.	Moderate concern, findings may not generalize beyond a Chinese tertiary-care population.
Index Model	Machine learning algorithms (XGB, RF, DT, Logistic Regression) trained using 38 LASSO-selected predictors; 10-fold cross-validation used for internal validation. Best performance: XGB (AUC = 0.966). SHAP applied for model interpretation.	Some concern, appropriate internal validation but possible overfitting; no external or temporal validation.	Low concern, clearly defined model, interpretable, replicable in similar data contexts.
Comparator Model	Compared XGB to RF, DT, and Logistic Regression. No traditional clinical risk model (e.g., albuminuria-based) is used for benchmarking.	Some concern, comparison limited to internal ML models; lack of benchmark limits context for clinical improvement.	Moderate concern, comparators appropriate for model development, but not for clinical performance comparison.
Outcome	Presence or absence of diabetic nephropathy (DN) defined by standard clinical/laboratory criteria (e.g., albuminuria, eGFR). Outcome extracted from EMR.	Low risk, objective, clinically established outcome, standardized data source.	Low concern, outcome relevant for DN screening in diabetes populations.
Timing	Retrospective cross-sectional dataset (April 2018-April 2019); no follow-up or temporal validation.	Some concern, unclear predictor-outcome chronology; possible temporal bias.	Moderate concern, timing acceptable for diagnostic screening, but limits prognostic inference.
Setting	Single tertiary-care hospital (SAHDMU, China). Data collection standardized via hospital EMR and biochemical protocols.	Low risk, consistent data acquisition, uniform diagnostic methods.	Some concern, applicability limited to hospital settings with similar laboratory infrastructure.
Intended Use of Predictive Model	Predict DN risk and assist in early screening using serum metabolite and clinical parameters to guide preventive care.	Some concern, exploratory model; not externally validated or integrated into clinical workflow.	Moderate concern, potential clinical value but requires further validation for real-world implementation.
Overall Judgment	The model demonstrates strong internal performance and robust methodology but lacks external validation, multi-center data, and prospective evaluation. Suitable for exploratory and developmental research but not yet for clinical deployment.	Overall RoB: Moderate	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A18 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Yin et al., 2024 [15].

This study presents a robust internally validated ML model (XGBoost) for DN prediction based on serum metabolites and clinical features. While methodological rigor (feature selection, cross-validation, SHAP interpretability) is strong, the absence of external validation, single-center sampling, and potential overfitting result in moderate overall bias.

Table A19. QUADAS-2 RoB and applicability assessment for Fan et al., 2025 [41].

Domain	Description	RoB	Applicability Concerns
Patient Selection	Datasets (GSE30122 [50], GSE30528 [51], GSE96804 [52]) included 60 DN and 70 normal control (NC) kidney tissue samples. Exclusion/inclusion criteria based on dataset definitions; batch correction performed to reduce variability.	Some concern, retrospective, secondary dataset selection; unclear if all consecutive DN and NC samples were included; possible spectrum bias due to dataset curation.	Some concern, data derived from public gene-expression repositories, not real-world clinical diagnostic populations.
Index Test	ML-based diagnostic model developed using glycolysis-related genes (GRGs) identified via WGCNA and feature selection (XGB performed best). Validation through independent dataset (GSE142153 [53]) and single-cell RNA-seq.	Some concern, model training and testing based on retrospective datasets; no blinding; potential overfitting; no pre-specified diagnostic threshold reported.	Low concern, gene-expression-based model aligns with molecular diagnostic purpose; methodologically appropriate.
Reference Standard	Diagnosis of DN and NC status as defined in GEO datasets (based on histopathological or clinical diagnosis in original studies).	Low risk, standard diagnostic definitions are likely applied in source datasets.	Low concern, reference standard appropriate for DN diagnosis.
Flow and Timing	Multiple GEO datasets combined; batch correction performed; cross-validation and external verification conducted using independent dataset. Temporal relation between sample collection and analysis unclear.	Some concern, heterogeneous datasets with varying collection protocols and unknown blinding; no consistent sample flow reporting.	Moderate concern, applicability limited by cross-dataset integration and differences in tissue vs. blood validation samples.
Overall Judgment	Robust bioinformatics and ML pipeline integrating multi-cohort transcriptomic data and immune profiling. However, RoB remains due to retrospective design, lack of prospective validation, and unclear blinding or diagnostic thresholds.	Overall RoB: Moderate	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A19 is a QUADAS-2 table tailored to this study [41] that summarizes domains, descriptions, RoB, and applicability concerns, including an overall judgment row and color-coded indicators ( Biomedinformatics 05 00067 i007

Low,

Some concern, Biomedinformatics 05 00067 i006

High).

Table A20. PROBAST assessment for Hirakawa et al., 2022 [42].

Domain	Description	RoB	Applicability Concerns
Participants/Population	150 DKD patients enrolled in the UT-DKD cohort (Japan); 135 completed follow-up (up to 30 months). Participants had baseline eGFR 30–60 mL/min/1.73 m². “Rapid decliners” defined as ≥10% annual eGFR loss.	Some concern, modest sample size; potential selection bias due to dropouts (10% loss); limited diversity (single-center Japanese cohort).	Moderate, population specific to Japanese DKD patients with mid-stage disease; external generalizability limited.
Index Model	Deep learning approach integrating plasma and urinary metabolomic data with clinical variables; feature selection narrowed 3388 variables to 50; tenfold double cross-validation performed to identify top predictors.	Some concern, internal validation robust, but external validation absent; unclear model transparency or reproducibility; potential overfitting given small sample and complex model.	Moderate, explainable ML enhances interpretability, but implementation may vary by analytic infrastructure.
Comparator Model	Compared deep learning to logistic regression, RF, and support vector machine (SVM); evaluated via AUC and cross-validation.	Low, inclusion of conventional and ML comparators strengthens credibility.	Low, comparator choice appropriate for methodologic benchmarking.
Outcome	Rapid decline of kidney function (≥10% annual loss of baseline eGFR). Continuous eGFR data collected prospectively for ~30 months.	Low, clearly defined, objective, clinically meaningful outcome.	Low, outcome directly relevant to DKD progression prediction.
Timing	Prospective observational follow-up (30 months) with baseline metabolomic sampling and serial eGFR assessment.	Low, appropriate timing between predictors and outcome; prospective design minimizes bias.	Low, suitable for longitudinal prediction of decline.
Setting	Single academic center in Japan; standardized biospecimen and metabolomic workflow; quality control reported.	Low, consistent laboratory and analytic procedures minimize bias.	Some concern, single-center design limits generalizability to other healthcare settings or ethnic populations.
Intended Use of Predictive Model	Predict rapid renal function decline in DKD patients using integrated metabolomic-clinical data; exploratory use for biomarker discovery and clinical risk stratification.	Some concern, not yet validated for clinical application or regulatory standards.	Moderate, promising as a biomarker discovery tool, but not ready for deployment in patient management.
Overall Judgment	The study demonstrates careful internal validation and interpretable deep learning but is limited by small sample size, single-center design, and lack of external or temporal validation. Results are exploratory and hypothesis-generating.	Overall RoB: Moderate	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A20 is a PROBAST table tailored to this study [42], following the expanded structure with Domains, Description, RoB, and Applicability Concerns, including Overall Judgment and color-coded indicators ( Biomedinformatics 05 00067 i007

Low,

Some concern, Biomedinformatics 05 00067 i006

High).

Table A21. PROBAST assessment for Zhang et al., 2022 [43].

Domain	Description (Evidence from Study)	RoB	Applicability Concerns
Participants/Population	995 adults with diabetes from the Chronic Renal Insufficiency Cohort (CRIC), a large, racially diverse, multicenter U.S. cohort. Included CKD stages 3a-4 (eGFR 20–60 mL/min/1.73 m²). Random selection for metabolomics sub-study; median follow-up 8 years.	Low, representative, well-described cohort; random selection minimizes selection bias.	Low, broad CKD spectrum and racial diversity enhance generalizability to DKD populations.
Index Model	Multivariable penalized regression (lasso) and RF models predicting eGFR slope; 698 metabolites + 9 clinical covariates. λ-values selected by 10-fold cross-validation. Variable selection guided by performance and biological plausibility.	Low, transparent model development; overfitting minimized via cross-validation; clear reporting of variable selection and penalization.	Low, approach and predictors feasible in similar metabolomics datasets; clinically interpretable outputs.
Comparator Model	Compared multiple lasso and RF variants (with/without clinical variables). Also validated model findings via targeted metabolomics and C-statistics for time-to-ESRD.	Low, appropriate comparison and internal replication improve credibility.	Low, methods align with modern standards for omics model evaluation.
Outcome	Primary: annual eGFR slope (continuous).
Secondary: time-to-ESRD (kidney failure or transplant).
eGFR measured serially; ESRD adjudicated; analyses adjusted for censoring and competing risk.	Low, clearly defined, objective, clinically relevant outcomes with validated measurement methods.	Low, directly relevant to DKD progression prediction.
Timing	Baseline urine metabolomics with longitudinal follow-up for up to 8 years. eGFR slopes estimated from repeated measures; ESRD tracked prospectively.	Low, appropriate prospective design ensures temporal relationship between predictors and outcomes.	Low, suitable duration for modeling DKD progression.
Setting	Multicenter, U.S.-based, racially diverse cohort; standardized urine collection and LC-MS metabolomics pipeline; rigorous quality control (technical replicates, noise filtering, FDR correction).	Low, strong methodological rigor; consistent analytical standards.	Low, high applicability to U.S. clinical research and biobank settings.
Intended Use of Predictive Model	To identify novel urinary metabolites predictive of DKD progression and elucidate biological pathways; potential for future use in risk stratification and therapeutic targeting after external validation.	Some concern, currently exploratory; not validated externally or clinically implemented.	Some concern, clinical translation pending replication and assay standardization.
Overall Judgment	Robust, well-designed metabolomics prediction study with strong internal validity and methodological rigor. Minimal RoB, though external validation remains necessary before clinical use.	Overall RoB: Low	Overall Applicability: Some Concern (requires external validation)

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A21 is the complete PROBAST table summarizing the RoB and applicability for this study [43], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Appendix D

This appendix contains details and explanations supplemental to the subsection, “4.4. Validation of ML models for prediction of CKD and ESRD” of the main text. The explanations of the quality assessment of the included study articles [24,31] through [33] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A22. PROBAST assessment for Chan et al., 2021 [24].

Domain	Description/Evidence	RoB	Applicability Concerns
Participants/Population	1146 adults with type 2 diabetes from BioMe Biobank and Penn Medicine Biobank; eGFR 30–59.9 mL/min/1.73 m² or ≥60 with albuminuria ≥3 mg/mmol. Excluded pre-existing kidney transplant or dialysis. Follow-up median 4.3 years.	Some concern, multi-center, but selection may favor patients with biobank consent and available plasma; some missing baseline data imputed using ±1 year window.	Some concern, mostly U.S. urban centers; may not generalize to rural or non-U.S. populations.
Predictors/Index Model	RF model integrating EHR data (demographics, labs, comorbidities) + 3 plasma biomarkers (KIM-1, TNFR1, TNFR2). Cross-validation used in derivation.	Some concern, RF is complex; model performance may be overestimated in derivation set, though cross-validation was performed.	Some concern, use of proprietary biomarkers limits immediate clinical applicability; EHR integration feasible only where high-quality longitudinal data exist.
Comparator Model	Compared KidneyIntelX to clinical model (age, eGFR, uACR, comorbidities) and KDIGO risk categories.	Low, appropriate standard-of-care comparisons included.	Low, KDIGO widely used; clinical model generalizable.
Outcome	Composite kidney endpoint: ≥5 mL/min/yr eGFR decline, ≥40% sustained decline, or kidney failure within 5 years. Ascertainment from EHR and lab records; objective, clinically relevant.	Low, clear, validated, and meaningful outcome.	Low, aligns with intended clinical use for DKD progression.
Timing	Baseline predictors from biobank/EHR at enrollment; follow-up median 4.3 years; minimum 3 eGFR measures post-baseline.	Low, longitudinal follow-up sufficient to capture outcome events.	Low, follow-up duration clinically relevant for DKD progression.
Setting	Two large U.S. biobanks; plasma samples processed at a single lab with standardized biomarker assays. Assay precision verified; lab staff blinded to outcomes.	Low, standardized assay reduces measurement bias.	Some concern, biobank-based, urban population; may not generalize to broader outpatient or international settings.
Intended Use	Predict progressive kidney decline in DKD patients; stratify into low, intermediate, high-risk for clinical decision-making.	Some concern, proprietary tests; intended use is clinical, but real-world implementation not yet evaluated.	Some concern, risk score relies on biomarker availability and EHR integration.
Overall Judgment	Well-designed model with robust cross-validation and clear clinical outcome. Bias mainly in selection of participants, proprietary assay reliance, and limited external validation.	Some concern, low bias for outcome and predictor measurement; moderate bias in participant selection and model complexity.	Some concern, applicability mainly for U.S. biobank populations; broader implementation pending validation.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A22 is the complete PROBAST table summarizing the RoB and applicability for this study [24], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Table A23. PROBAST assessment for Ferguson et al., 2022 [31].

Domain	Description/Evidence	RoB	Applicability Concerns
Participants/Population	Adult individuals (≥18 years) with available outpatient eGFR and urine ACR data in Manitoba (derivation, n = 77,196) and Alberta (validation, n = 107,097). Excluded those with kidney failure at baseline. Large, representative population-based cohorts covering almost all residents in each province.	Low, Comprehensive provincial databases minimize selection bias; inclusion criteria clear and reproducible.	Low, Broad CKD population across all G1-G5 stages; representative of routine clinical care.
Predictors/Index Model	RF survival model using 22 variables (age, sex, eGFR, ACR, plus 18 lab tests from chemistry, hematology, and liver panels). Missing data handled by Ishwaran RF imputation.	Some concern, Although advanced imputation used, model complexity (RF) may obscure predictor interpretability and calibration across subgroups.	Low, Uses routinely collected lab data; readily available in most health systems.
Comparator Model	Compared with (i) KDIGO-like Cox model (heatmap model) and (ii) clinical Cox model including age, sex, eGFR, ACR, diabetes, and CVD.	Low, Appropriate and well-established comparators included.	Low, Comparators are clinically relevant benchmarks.
Outcome	Composite endpoint: ≥40% sustained eGFR decline or kidney failure (dialysis, transplant, or eGFR < 10 mL/min/1.73 m²). Confirmed with follow-up tests (90 days–2 years) or death within 90 days after decline. Objective, reproducible, and validated outcome definition.	Low, Hard clinical outcomes derived from standardized lab and administrative data.	Low, Outcome clinically meaningful and applicable across CKD stages.
Timing	Baseline predictors averaged from eGFR and lab values over 6 months before index date; follow-up to 5 years. External validation performed over comparable time periods (2009–2016).	Low, Adequate follow-up duration for CKD progression.	Low, Consistent time windows and clinical relevance.
Setting	Population-level administrative and laboratory databases in Manitoba and Alberta, Canada. Laboratory tests are centralized and standardized. Ethics approval and deidentified data use.	Low, Real-world data minimizes bias from selective recruitment; centralized labs ensure measurement consistency.	Some concern, Applicability may vary in countries lacking centralized lab data or universal healthcare systems.
Intended Use of Predictive Model	To predict CKD progression (40% eGFR decline or kidney failure) across all CKD stages using routine labs and demographics-no special biomarkers. Intended for clinical decision support and integration into EHR systems.	Low, Clear intended use with feasible data inputs; internal and external validation performed.	Low, Widely applicable to clinical practice given use of standard labs.
Overall Judgment	Large, transparent, and rigorously validated model with strong performance (AUC 0.87–0.88 internal, 0.84–0.87 external). Minimal bias due to representative cohorts and objective outcome ascertainment. Some concerns remain around interpretability and external generalizability outside Canada.	Low RoB	Some concern, Highly applicable to systems with similar data infrastructure; may need calibration elsewhere.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A23 is the complete PROBAST table summarizing the RoB and applicability for this study [31], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Table A24. PROBAST assessment for Tangri et al., 2024 [32].

Domain	Description/Evidence	RoB	Applicability Concerns
Participants/Population	14,464 adults with type 2 diabetes from the CANVAS Program (n ≈ 10,000) and CREDENCE trial (n ≈ 4400), both large multinational RCTs. Participants had broad eGFR (≥30 mL/min/1.73 m²) and ACR ranges. Inclusion criteria were clear, and patients provided informed consent.	Low, Randomized, well-characterized cohorts minimize selection bias.	Some concern, Population restricted to type 2 diabetes in clinical trial settings, possibly healthier and more adherent than routine practice.
Predictors/Index Model	The Klinrisk random survival forest model using 20+ routine lab values (e.g., eGFR, ACR, urea, albumin, glucose, Hb, electrolytes, etc.) plus demographics (age, sex). Missing data imputed via RF imputation.	Low, Predictors objectively measured; minimal missingness (<3%, except glucose). Transparent pre-specified variable list from prior study.	Low, Uses standard clinical lab parameters routinely collected in care.
Comparator Model	Compared to KDIGO eGFR-ACR heatmap categories (G1-G5/A1-A3) and a refitted Cox model (KFRE-like) using age, sex, eGFR, ACR.	Low, Comparators appropriate, established, and relevant to current clinical practice.	Low, Directly comparable to clinical risk tools used worldwide.
Outcome	Primary outcome: ≥40% sustained eGFR decline or kidney failure (dialysis, transplant, or eGFR < 15 mL/min/1.73 m²). Adjudicated outcomes from trial datasets; confirmation of decline on repeat testing. Secondary: eGFR slope.	Low, Objective, reproducible, and adjudicated outcome definitions.	Low, Outcomes align with KDIGO and FDA-recommended endpoints for CKD trials.
Timing	Baseline defined at randomization; eGFR recalculated using CKD-EPI 2009. Follow-up median 2.4 years (range 1–3 years). Sensitivity analyses accounted for early eGFR dip due to SGLT2i effects (week 6–13).	Low, Consistent follow-up and timing across studies; clear baseline and outcome assessment windows.	Low, Time frame appropriate for CKD progression prediction.
Setting	International, multicenter, double-blind RCTs across 24–34 countries. Central labs used, standardized measurements. Data quality is high and rigorously monitored.	Low, Data completeness, quality control, and outcome adjudication robust.	Some concern, Controlled trial environment may limit generalizability to routine care or non-trial populations.
Intended Use of Predictive Model	Intended to predict CKD progression in patients with or at risk of CKD (including early stages), using only routinely available lab and demographic data. Potential for integration into EMR or LIS systems to support risk-based care and SGLT2i allocation.	Low, Clearly defined purpose; model feasible for integration.	Low, Broadly applicable in clinical decision support if similar data infrastructure exists.
Overall Judgment	The validation demonstrates good-to-excellent discrimination (AUC 0.81–0.88) and low Brier scores, outperforming KDIGO classification. Calibration is adequate though slightly over predictive in canagliflozin-treated patients. Minimal missing data and robust statistical handling. Some generalizability concerns due to trial-based cohort and model trained on Canadian data.	Low RoB	Some concern, Excellent validation but restricted to SGLT2i trial population with type 2 diabetes; may need recalibration for general CKD populations.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A24 is the complete PROBAST table summarizing the RoB and applicability for this study [32], including all domains: Participants/Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Table A25. PROBAST assessment for Zou et al., 2022 [33].

Domain	Description/Evidence	RoB	Applicability Concerns
Participants/Population	390 Chinese adults with T2DM and biopsy-confirmed DKD enrolled (2008–2019) and followed ≥1 year (median 3 years). Clear inclusion/exclusion criteria (eGFR > 15, age ≥ 18). All had histopathological confirmation using RPS 2010 criteria.	Low, Well-defined, consecutive, pathologically confirmed cohort minimizes misclassification and selection bias.	High, Restricted to biopsy-confirmed DKD in hospital setting; not representative of broader T2DM populations with clinically diagnosed DKD.
Predictors/Index Model	30+ clinical, biochemical, and pathological predictors collected at biopsy; top five selected (CysC, sAlb, Hb, UTP, eGFR). Missing data imputed with miss Forest; variables with >20% missing excluded. Predictor measurement standardized in central lab; pathology scored by two pathologists.	Low, Predictors measured objectively; consistent definitions; appropriate handling of missingness.	Some concern, Reliance on biopsy-specific and pathology variables limits applicability in routine, non-biopsy DKD care.
Comparator Model	Other ML algorithms (GBM, SVM, logistic regression) and internal validation splits (75/25) used for performance comparison; RF chosen for best AUC = 0.90.	Low, Comparator models appropriate; fair model comparison across consistent datasets.	Low, Comparisons relevant to model-selection purpose.
Outcome	Incident ESRD defined as eGFR < 15 mL/min/1.73 m² or renal replacement therapy initiation. Objectively assessed, clinically standard endpoint.	Low, Objective, standard ESRD definition; ascertainment likely accurate.	Low, Outcome clinically relevant to DKD prognosis; matches intended use.
Timing	Baseline = biopsy date; median follow-up 3 years (minimum 1 year). Clear temporal ordering of predictors before outcome.	Low, Prospective follow-up adequate for ESRD prediction; outcome timing clearly defined.	Low, Appropriate prediction horizon for ESRD in advanced DKD.
Setting	Single-country, hospital-based, tertiary nephrology centers in China; all participants underwent renal biopsy and centralized pathology review.	Some concern, High-quality data but limited generalizability to community or primary-care settings.	High, Specialized hospital cohort; may not represent broader DKD care environments.
Intended Use of Predictive Model	Predict risk of ESRD in patients with T2DM and DKD to enable early intervention and risk stratification. Developed nomogram from top 5 predictors for clinical use.	Low, Clearly stated clinical purpose; model interpretable and feasible for use where lab data available.	Some concern, Model designed for advanced DKD at tertiary level; limited for early-stage DKD or general diabetic populations.
Overall Judgment	The RF model demonstrated excellent discrimination (AUC 0.90) and good internal validation via 10-fold cross-validation. Clear inclusion criteria, objective outcomes, and rigorous variable handling reduce bias. However, the single-center, biopsy-only Chinese cohort limits external validity and may inflate performance (no external validation).	Some concern/Moderate risk, Internal validation only, modest sample size.	High concern, Limited to biopsy-proven DKD; may not generalize to broader T2DM populations or other ethnicities.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A25 integrates PROBAST domains with the study-specific details (population, predictors, comparator, outcome, timing, setting, intended use) for a clear, structured assessment of the journal publication, Zou et al., 2022 [33].

Table A26. Study Snapshot, Zou et al., 2022 [33].

Feature	Details
Cohort	390 Chinese patients with biopsy-confirmed DKD; median follow-up 3 years
Events/Outcome	158 ESRD events (40.5%); outcome defined as incident ESRD (eGFR < 15 mL/min/1.73 m² or need for renal replacement therapy)
Predictors used	Routine labs & demographics: eGFR, cystatin C (CysC), serum albumin (sAlb), hemoglobin (Hb), 24 h urine total protein (UTP), lipids, blood pressure, glucose, medications, etc. Renal pathology features: glomerular class, interstitial fibrosis/tubular atrophy (IFTA), inflammation, immunofluorescence (C1q, IgG, IgM, IgA, C3, C4) Not used: urine biomarkers, imaging, multi-omics
Models tried	RF (best; AUC = 0.90 in validation), SVM (AUC = 0.88), Gradient Boosting Machine (AUC = 0.88), Logistic Regression (AUC = 0.83)
Feature selection/Nomogram	Top 5 predictors included in nomogram: CysC, sAlb, Hb, eGFR, UTP
Validation	Internal split: 75% training/25% validation; 10-fold cross-validation on training set; no external validation reported

Table A27. Clinical Questions and Model Applicability, Zou et al., 2022 [33].

Clinical Question	Data Available in This Study	Does This Model Help?	Best Data for This Question	Recommended Model Family
Flag early disease (identify patients with subclinical DKD before functional loss)	Baseline routine labs and biopsy pathology (cohort = confirmed DKD)	No/Limited. Cohort already has biopsy-confirmed DKD; model predicts ESRD progression, not early detection	Routine labs (eGFR trends, albuminuria), urine biomarkers, imaging	Simple linear/regression baselines or tree ensembles. Logistic/Cox for transparency and few covariates; tree ensembles for nonlinear thresholds. Deep nets are not suitable due to small sample and tabular data
Estimate slope to ESRD (rate of decline/time to RRT)	Single baseline snapshot (labs + pathology); no longitudinal trajectories	No, RF predicts risk of ESRD by follow-up window but does not estimate slope or time-to-event	Serial eGFR measures, repeat urine albumin/protein, medication changes; ideally longitudinal data	Survival models (Cox, flexible parametric), penalized Cox, survival tree ensembles (Random survival forests, gradient-boosted survival trees). Mixed-effects or joint longitudinal-survival models if serial data available
Decide when to plan access (predict timing of dialysis)	Baseline predictors + ESRD risk score (nomogram for 1/3/5-year risk)	Potentially, high predicted short-term ESRD risk may inform planning, but no external validation or robust time-to-event estimates	Time-to-event data, competing risks (death), calibrated absolute risk estimates at clinically relevant horizons	Survival models (Cox with predicted survival curves, flexible parametric models) or survival tree ensembles; parsimonious models if sample size is small
Predict mortality (all-cause or cardiovascular)	Not available; study focused on ESRD	No. Separate model needed; ESRD predictors not directly transferrable	Routine labs, comorbidity data, cardiac biomarkers, imaging, longitudinal information	Cox/competing-risks models (Fine-Gray) or tree-based survival models; penalized Cox for modest sample sizes
Anticipate GBM/glomerular basement membrane lesions in kidney disease	Not available; study not designed for cardiovascular outcomes	Not applicable. Requires different outcomes and predictors	Cardiac history, troponin/BNP, ECG/imaging	Cox regression for clinical predictors or ensemble trees when nonlinear interactions are expected and sample size is sufficient

Table A28. Zou et al., 2022 [33] study’s actual data/model usage and general guidance for alternative data types.

Data Type/Availability	Data Characteristics	Model Families Used/Recommended	Rationale/Notes
Routine labs + demographics + biopsy pathology (this study)	Mixed numeric, categorical, and pathology scores; n = 390, events = 158	RF (best), SVM, GBM, logistic regression	Tree ensembles (RF/GBM): handle mixed variable types, missing data, capture nonlinearities/interactions, robust with modest sample sizes; Drawbacks: less transparent, calibration may be poor, internal validation only. Logistic regression: transparent, interpretable, generalizes better in small samples when relationships are near-linear (AUC 0.83 vs. RF 0.90). SVM: good for complex decision boundaries; less interpretable, requires careful tuning. Deep nets: not appropriate, sample too small and data are tabular.
Urine biomarkers	Not used in this study	Tree ensembles, penalized linear models	Valuable for early disease flagging; ensembles capture nonlinear interactions; penalized regression avoids overfitting in moderate sample sizes.
Imaging features (e.g., kidney ultrasound, elastography, MRI)	Not used in this study	Tree ensembles or penalized regression for moderate n; deep learning (CNNs) if large image dataset	High-dimensional data; feature engineering helps in small-medium datasets; CNNs effective with large cohorts.
Multi-omics (genomics, transcriptomics, proteomics, metabolomics)	Not used in this study	Penalized regression, tree ensembles (moderate n); deep nets (large n)	Require large sample sizes and robust regularization; deep nets feasible only with very large datasets.

Table A29. Practical Appraisal and Recommendations.

Question/Aspect	Recommendation/Appraisal
Can we use the RF nomogram now to make decisions?	Not yet for broad clinical use. It shows strong internal performance (AUC 0.90) and useful nomogram with five routine features (CysC, sAlb, Hb, eGFR, UTP) but lacks external validation and prospective calibration. Use as hypothesis-generating or adjunct risk signal, not as a sole basis for initiating access or stopping therapies.
Time-to-event answers (when to plan access)	Authors should fit survival models (Cox or random survival forest), report absolute risk at concrete horizons, provide calibration plots, and handle competing risks (e.g., death). This allows actionable timing predictions (e.g., ESRD probability within 6–12 months).
Is the model over-fitted?	Risk of optimism: internal split + cross-validation is good practice, but no external validation and complex learners (RF/GBM) may inflate AUC. Event count (158) reasonable, but penalization or fewer variables would reduce overfitting risk.
Recommended model family for each clinical task (given dataset size)	- Flag early disease (screening): simple logistic/Cox with parsimonious predictors (routine labs). Estimate slope/timing: survival models (penalized Cox or random survival forest). Plan access: survival/competing-risk models with absolute risk estimates and calibration. Predict mortality/cardiovascular events: build separate survival models using outcome-specific predictors.
Dataset-level guidance (n ≈ 400, ~150 events)	Tree ensembles (RF/GBM) reasonable first choice for discrimination; penalized Cox/logistic recommended for interpretable, calibrated predictions when sample size is moderate.

Appendix E

This appendix contains details and explanations supplemental to the subsection, “4.5. Application of AI-based algorithms for detecting and classifying current disease state, discovering diagnostic biomarkers, and subtype identification” of the main text. The explanations of the quality assessment of the included study articles [8] through [14] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A30. QUADAS-2 RoB and applicability assessment for Basuli et al., 2025 [8].

Domain	Description/Evidence	RoB	Applicability Concerns
Patient Selection	The review included studies on adults with type 2 diabetes, with or without existing DKD, focusing on ML-based prediction of DKD development or progression. Inclusion/exclusion criteria for studies were well-described, and screening was conducted independently by three reviewers with consensus resolution. However, inclusion was restricted to English-language publications and limited to two databases (PubMed, EMBASE).	Some concern, Restriction to English and specific databases may have introduced selection bias and publication bias.	Low concern, The included patient populations (T2D with/without DKD) match the review’s stated target population for AI-based DKD prediction.
Index Test (AI/ML Models)	The review evaluated various AI/ML algorithms (RF, XGB, SVM, CNN, Light GBM, etc.) used for predicting DKD onset or progression. Model details, performance metrics (AUC, accuracy), and validation strategies were extracted. However, standardization of model evaluation (e.g., cross-validation vs. external validation) across studies was lacking.	Some concern, Variable reporting of model development and validation among included studies; unclear thresholding and calibration metrics in some.	Low concern, The AI models were evaluated for DKD prediction as intended; no mismatch between review scope and test purpose.
Reference Standard	The “reference standard” across included studies varied, typically clinically defined DKD or CKD stages (eGFR, albuminuria) or biopsy-confirmed DKD in a few studies. Definitions were consistent with KDIGO criteria in most studies. However, the review did not uniformly verify how outcomes were adjudicated in each included study.	Some concern, Heterogeneity in DKD definitions across studies (clinical vs. pathological confirmation).	Low concern, All definitions align with real-world diagnostic criteria for DKD or CKD progression.
Flow and Timing	The included primary studies varied in follow-up periods (ranging from 6 months to 10 years) and timing of predictor measurement. The review summarized follow-up durations descriptively but did not evaluate time-lag consistency or data collection timing bias.	Some concern, Inconsistent reporting of temporal relationships between predictor measurement and DKD outcomes across included studies.	Low concern, The studies collectively address the intended question of early prediction of DKD, regardless of specific follow-up intervals.
Overall Judgment (Per Study)	Most included studies showed good discrimination (AUC 0.74–0.90) but heterogeneous methodology and limited external validation. No meta-analysis of diagnostic accuracy metrics was performed due to heterogeneity.	Some concern, Lack of quantitative synthesis and limited quality appraisal details for each included study.	Low concern, The review’s conclusions appropriately reflected the included evidence without overgeneralizing applicability.
Overall Study-Level Judgment	Comprehensive search and structured screening lend credibility, but variability in study design, DKD definitions, and validation methods limit the strength of conclusions. The review highlights promising AI utility but acknowledges limitations in dataset size, standardization, and validation.	Moderate RoB	Low applicability concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Basuli et al., 2025 [8] is a systematic review of diagnostic models. We applied the QUADAS-2 tool in an adapted manner to assess the RoB and applicability of this systematic review article. Table A30 is a comprehensive, color-coded QUADAS-2 table evaluating this systematic review article across all domains, including patient selection, index test, reference standard, flow and timing, overall judgment per study, and overall study-level judgment.

Table A31. QUADAS-2 RoB and applicability assessment for Lei et al., 2024 [9].

Domain	Description	RoB	Applicability Concerns
Patient Selection	Retrospective inclusion of renal biopsy samples from DN patients at a single tertiary center (Jinling Hospital). All samples were PAS-stained and of acceptable image quality. Exclusion of slides with compression or decolorization artifacts.	Some concern, Retrospective design, unclear whether all consecutive DN biopsies were included; potential for selection bias toward good-quality slides.	Low concern, Study population (biopsy-confirmed DN) reflects the intended target population for histopathologic classification.
Index Test	CNN-based model trained using annotated PAS-stained WSIs to detect and quantify glomerular lesions and intrinsic cells. Multi-architecture system (Efficient Net, U-Net, V-Net) used for classification and segmentation. Model performance reported via F1-score, Dice coefficient, and kappa agreement with pathologists.	Some concern, Model trained and validated on retrospective labeled data; unclear if testing was blinded to reference standard; possible overfitting and lack of external dataset validation.	Low concern, The index test (CNN model) aligns directly with the intended clinical use (automated morphological classification).
Reference Standard	Manual classification by experienced renal pathologists using the 2010 Renal Pathology Society (RPS) standard for DN glomerular lesions. Consensus among three pathologists served as the gold standard.	Low risk, Reference standard based on validated, widely accepted histopathologic criteria (RPS 2010). Consensus reading reduces misclassification risk.	Low concern, Standard reflects current diagnostic practice; appropriate for evaluating CNN-based methods.
Flow and Timing	Retrospective analysis of PAS-stained slides. Same slides used for CNN training and reference grading, though dataset split into training and test sets. All samples underwent consistent imaging and processing.	Some concern, Same dataset source used for both training and testing; unclear independence of test set; lack of prospective validation; no time interval between index and reference tests.	Some concern, Applicability limited by single-center design and lack of external dataset; performance may vary across staining protocols or scanners.
Overall Judgment	The study demonstrates strong diagnostic potential for CNN-assisted histopathology in DN, achieving high F1-scores and good agreement with expert pathologists. However, retrospective design, single-center dataset, and lack of external validation pose moderate bias risks.	Overall RoB: Moderate	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A31 is a QUADAS-2 assessment table tailored to this CNN-based diagnostic pathology study [9], with clear color-coded risk levels and an overall judgment row at the end.

Table A32. PROBAST assessment for Makino et al., 2019 [10].

Domain	Description	RoB	Applicability Concerns
Population	64,059 patients with type 2 diabetes identified retrospectively from EMR data. Real-world population with broad inclusion; minimal preselection.	Some concern, Potential selection bias due to retrospective data extraction and unclear inclusion/exclusion handling.	Low concern, Reflects typical diabetic population encountered in clinical settings.
Index Model	AI-based predictive model using convolutional autoencoder to extract longitudinal temporal features and logistic regression for 6-month DKD aggravation prediction (3073 features).	High risk, Limited transparency in feature selection, internal validation only, and unclear handling of missing or correlated data.	Some concern, Model dependent on EMR data structure; reproducibility across systems uncertain.
Comparator Model	No explicit clinical or statistical comparator model reported (e.g., KDIGO or traditional regression models).	High risk, Absence of comparator limits evaluation of incremental value and clinical benefit.	Some concern, Lacks benchmark comparison for interpretability in clinical context.
Outcome	DKD aggravation over 6 months, validated by long-term renal outcomes (hemodialysis incidence within 10 years).	Low risk, Outcome clinically relevant, objective, and based on standard renal indicators.	Low concern, Outcome definition aligns with nephrology practice.
Timing	Prediction window based on previous 6 months of EMR data; follow-up extended up to 10 years for validation of outcome trends.	Some concern, Retrospective follow-up with unclear frequency and consistency of measurements across patients.	Low concern, Clinically realistic prediction horizon for monitoring DKD progression.
Setting	Real-world hospital EMR dataset, likely multi-center, with mixed inpatient and outpatient records.	Some concern, Variability in data collection across institutions and EMR systems may affect model generalizability.	Low concern, EMR-based design applicable to modern healthcare environments.
Intended Use of Predictive Model	Early prediction of DKD progression in T2DM patients to guide timely interventions and reduce risk of hemodialysis.	Low risk, Intended use clinically appropriate and clearly stated.	Low concern, Applicable for clinical decision support once validated.
Overall Judgment	AI model demonstrates potential for early DKD progression prediction using large-scale EMR data but lacks comparator analysis and external validation.	Overall RoB: Moderate-High	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A32 is a PROBAST evaluation tailored to this study [10], focusing on its prediction model development and validation aspects. The table includes color-coded risk ratings and an Overall Judgment row.

Table A33. PROBAST assessment for Nayak et al., 2024 [11].

Domain	Description	RoB	Applicability Concerns
Population	Retrospective data from 227 patients with DKD (stages 1–5) admitted between 2017–2022 at a South Indian tertiary hospital. Clear inclusion/exclusion criteria applied.	Some concern, Moderate sample size; single-center data may not capture population heterogeneity; retrospective selection may introduce bias.	Some concern, South Indian hospital population; external generalizability to other regions or healthcare systems uncertain.
Index Model	Bayesian optimization-based ML pipeline (XGBoost, RF, SVM) for classification and regression; recursive feature elimination identified 15 key predictors. Best model (XGBoost) achieved ~89% F1-score with high explainability via SHAP/LIME.	Some concern, Appropriate algorithms and interpretability tools used, but potential overfitting due to small dataset; internal validation only.	Low concern, Predictors based on routinely collected lab and clinical parameters; feasible for clinical adoption.
Comparator Model	No explicit traditional clinical or statistical comparator model (e.g., logistic regression, KDIGO-based prediction) was used for benchmarking.	High risk, Lack of comparator prevents evaluation of incremental value beyond existing clinical criteria.	Some concern, Limits clinical interpretability relative to standard DKD staging frameworks.
Outcome	DKD stage classification (1–5) and progression patterns; validated by eGFR and related lab parameters; regression models used to assess severity.	Low risk, Outcomes based on standard KDIGO staging and eGFR; reliable and objective.	Low concern, Consistent with accepted nephrology definitions and clinical relevance.
Timing	Retrospective 5-year data collection (2017–2022) from EMR records; baseline to last recorded visit.	Some concern, Retrospective timing may affect consistency of follow-up and missing data patterns.	Low concern, Reflects clinically realistic patient follow-up intervals for DKD.
Setting	Single tertiary care hospital in South India; data extracted from inpatient records under real-world clinical conditions.	Some concern, Institutional data quality and recording practices could vary; limits external reproducibility.	Some concern, Single-center scope may affect generalizability across diverse healthcare infrastructures.
Intended Use of Predictive Model	To assist clinicians in early identification and stage-wise prediction of DKD progression using explainable AI, enabling personalized interventions and reducing risk of ESRD.	Low risk, Purpose is well-defined, clinically relevant, and ethically appropriate.	Low concern, Intended use aligns with clinical decision support in nephrology.
Overall Judgment	The study presents a well-designed, interpretable ML model with strong performance metrics and clinical relevance. However, lack of external validation and comparator model introduces moderate RoB and limits generalizability.	Overall RoB: Moderate	Overall Applicability: Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A33 is a PROBAST evaluation customized to this study [11], following the comprehensive structure: Population, Index Model, Comparator Model, Outcome, Timing, Setting, Intended Use, and Overall Judgment.

Table A34. PROBAST assessment for Li et al., 2025 [12].

Domain	Description	RoB	Applicability Concerns	Comments/Justification
Population	Adults with type 2 diabetes mellitus (T2DM) from multiple cohorts; no restriction on gender, age, or region.	Low	Moderate	Population well defined; however, heterogeneity across included studies (different datasets, ethnicities) could limit direct applicability.
Index Model	ML algorithms (traditional regression, ML, DL) used for DKD prediction.	Moderate	Low	Models clearly described but varying development quality (feature selection, hyperparameter tuning not always transparent).
Comparator Model	Traditional regression and other ML models (e.g., logistic regression, RF).	Low	Low	Comparators appropriate and systematically compared; limited concern for bias or applicability.
Outcome	Development or progression of DKD, typically defined by albuminuria or eGFR criteria.	Low	Low	Clinically relevant and consistently defined across studies; small variation in thresholds unlikely to bias results.
Timing	Prediction horizons varied (short to long term); follow-up durations not uniform.	Moderate	Moderate	Inconsistent time intervals may influence predictive accuracy and applicability across settings.
Setting	Predominantly retrospective, single-center or national datasets; few external validations.	Moderate	Moderate	Real-world settings increase relevance, but limited multicenter validation restricts generalizability.
Intended Use	Early identification of DKD risk in T2DM for clinical decision support and prevention.	Low	Low	Intended use is clear, relevant, and consistent with clinical goals.
Overall Judgment	Key limitation: Heterogeneity across included studies (datasets, ML techniques, outcome definitions) and limited external validation. Key strength: Comprehensive synthesis using robust meta-analytic methods and PROBAST framework adherence.	Moderate	Moderate	Overall good methodological rigor, but heterogeneity, limited external validation, and variable reporting reduce confidence in transportability.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A34 is an enhanced PROBAST summary table for the study [12], using clear color-coded indicators and separate columns for RoB and Applicability Concerns.

Table A35. PROBAST assessment for Zhu et al., 2024 [13].

Domain	Description	RoB	Applicability Concerns	Comments/Justification
Population	Adults (≥18 yrs) with type 2 diabetes (duration ≥ 10 yrs), free from DN at baseline; excluded Type 1 DM and non-diabetic renal disease.	Low	Low	Well-defined cohort consistent with target clinical population; appropriate exclusions reduce confounding.
Index Model	ML algorithms (SVM, RF) integrating biomarkers and clinical variables; cross-validation used.	Moderate	Low	Feature selection and tuning well described but may introduce overfitting; model externally validated once.
Comparator Model	Traditional multivariate logistic regression used for benchmarking ML performance.	Low	Low	Comparator appropriate and transparent; enables fair assessment of ML model benefit.
Outcome	Incident diabetic nephropathy (persistent albuminuria ≥ 30 mg/g + clinical evidence) over 36 months; ADA-aligned criteria.	Low	Low	Clear, standardized outcome definition ensures consistency across datasets.
Timing	Baseline data collection with 36-month longitudinal follow-up for DN onset in both training and validation cohorts.	Low	Low	Uniform timing between datasets; appropriate window for DN development.
Setting	Single tertiary hospital (Sichuan Provincial People’s Hospital); retrospective EHR-based design (2018–2020).	Moderate	Moderate	Real-world data improve relevance but single-center limits generalizability and external reproducibility.
Statistical Analysis	Combined feature selection using RF + SVM; multicollinearity checked via VIF; cross-validation and independent test set (n = 468); evaluated via AUC and F1-score.	Moderate	Low	Robust statistical workflow; however, potential bias from retrospective data, class imbalance not discussed; external validation partially mitigates concern.
Intended Use of Predictive model	Early identification of high-risk T2DM patients for DN to enable preventive or therapeutic interventions.	Low	Low	Intended use well-defined, clinically meaningful, and feasible for integration into practice.
Overall Judgment	Key Strengths: Defined population, biologically meaningful predictors, robust validation, good discrimination (AUC = 0.83). Key Limitations: Retrospective single-center data, limited external validation, incomplete handling of potential overfitting, and data imbalance.	Moderate	Moderate	Solid predictive modeling with modest bias from retrospective single-center data and limited generalizability; strong potential for clinical application pending broader validation.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A35 is a PROBAST evaluation table tailored specifically to this study [13], using clear color-coded indicators and separate columns for RoB and Applicability Concerns.

Table A36. QUADAS-2 RoB and applicability assessment for Zhu et al., 2024 [14].

Domain	Description	RoB	Applicability Concerns	Comments/Justification
Patient Selection	Samples derived from publicly available GEO transcriptomic datasets (GSE47184 [54], GSE96804 [52], GSE104948 [55], GSE104954 [56], GSE142025 [57], GSE175759 [58]); DN and control samples defined per study metadata.	Moderate	Low	Selection relied on pre-existing datasets; inclusion/exclusion not under authors’ control; however, cases and controls were biologically and clinically relevant.
Index Test (ML-MR integrated pipeline)	Combined ML (LASSO, SVM-RFE, RF) for feature selection and MR analysis to identify causally linked biomarkers; performance assessed by ROC curves (AUC ≥ 0.878).	Low	Low	Transparent and reproducible index test pipeline; validated across multiple datasets; strong predictive performance and interpretability.
Reference Standard	Diagnosis of diabetic nephropathy established from dataset clinical annotations and confirmed by renal biopsy in some cohorts.	Moderate	Low	Diagnostic accuracy depends on original dataset quality; partial reliance on pre-annotated labels may introduce some bias.
Flow and Timing	All datasets processed with standardized pipelines; ML model trained and validated across multiple cohorts; experimental validation (qRT-PCR) performed on independent clinical samples.	Low	Low	Consistent analytic flow; appropriate sequencing between model discovery and validation; minimal missing data bias.
Statistical Analysis	Differential expression (DEG) filtering, ML-based feature selection, MR causal inference, ROC validation, and qRT-PCR confirmation.	Low	Low	Robust multi-stage analytical design combining omics and causal inference strengthens reliability; multiple validation layers reduce bias.
Overall Judgment	Key Strengths: Multi-algorithm feature selection, MR causal inference, cross-dataset validation, and independent biological validation. Key Limitations: Potential dataset heterogeneity; lack of uniform diagnostic standards across public datasets.	Low-to-Moderate	Low	Strong methodological rigor, diverse datasets, and biological validation reduce bias; minor concern due to reliance on secondary data sources and metadata accuracy.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only.

Table A36 follows the official QUADAS-2 domains for this study [14] and adds: clear, study-specific descriptions; separate columns for RoB and Applicability Concerns; color-coded judgments; and an Overall Judgment row summarizing the quality assessment.

Appendix F

This appendix contains details and explanations supplemental to the subsection, “4.6. Application of AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification” of the main text. The explanations of the quality assessment of the included study articles [16] through [23] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A37. QUADAS-2 RoB and applicability assessment for Lucarelli et al., 2023 [16].

Domain	RoB	Applicability Concerns
Patient Selection	High/Unclear, Selective recruitment (urine proteomics + pathology cohort) with limited reporting on inclusion/exclusion; possible spectrum bias.	Moderate, Participants unlikely to reflect general T2DM population; community-clinic applicability uncertain.
Index Test (digital biomarkers via urinary proteomics + pathology)	Moderate/Unclear, Unclear blinding; thresholds not pre-specified; same dataset used for discovery and model training → overfitting risk.	Moderate, Uses specialized proteomics pipeline; external platforms may differ → limited generalizability.
Reference Standard	Low → Moderate, Pathology-based reference appropriate but possibly heterogeneous (biopsy vs. clinical classification); unclear if consistent across subjects.	Moderate, Standard relevant to DKD but not necessarily identical to routine clinical endpoints.
Flow & Timing	High, Incomplete information on timing between index and reference tests; unclear follow-up; potential verification bias.	Moderate, Specialized research workflow limits applicability to routine timelines and sampling logistics.
Overall Judgment	Overall RoB: High, Unclear patient selection + limited reporting on timing increase risk.	Overall Applicability: Moderate, Study setting and assay platform differ from standard clinical practice; external validation needed.

Colors

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A37 is a color-coded QUADAS-2 table for the study: Lucarelli et al., (2023) [16]; “Discovery of Novel Digital Biomarkers for Type 2 Diabetic Nephropathy Classification via Integration of Urinary Proteomics and Pathology.”

This study represents an innovative early-phase effort to merge urinary proteomics with pathology for digital biomarker discovery in diabetic nephropathy. However, methodological transparency (especially around cohort definition, temporal sequence, and blinding) is limited. The findings are promising but not yet generalizable to multi-site or community settings without external validation and standardized assay calibration.

Table A38. QUADAS-2 RoB and applicability assessment for Yan et al., 2024 [17].

Domain	RoB	Applicability Concerns
Patient Selection	Low → Moderate, Participants included clear diagnostic categories (T2DM with/without DKD), but sampling method and exclusion criteria were not fully described; possible selection bias if matched retrospectively.	Low, DKD and control populations clinically relevant; findings likely applicable to standard T2DM cohorts with albuminuria-based classification.
Index Test (Urinary proteomics + ML classifiers)	Moderate, ML approach (e.g., SVM, RF) applied without fully independent test set; unclear blinding to reference standard; internal validation only.	Moderate, Omics workflows and normalization pipelines may not generalize across assay platforms; batch effects could limit external use.
Reference Standard	Low, Diagnostic definitions followed established DKD criteria (albuminuria, eGFR decline); reference standard appropriate and consistently applied.	Low, Reference outcomes align well with clinical practice and guidelines.
Flow & Timing	Moderate, Timing between urine collection and DKD classification not explicitly stated; unclear if temporal gaps could bias associations.	Low → Moderate, Acceptable for cross-sectional biomarker screening but limited for longitudinal prediction.
Overall Judgment	Overall RoB: Moderate, Limited external validation and potential for overfitting; otherwise methodologically reasonable.	Overall Applicability: Low Concern, Results relevant to typical DKD diagnostic settings, but omics reproducibility remains a constraint.

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations).

Table A38 is a color-coded QUADAS-2 evaluation for the study: Yan et al., (2024) [17]; “Application of Proteomics and Machine Learning Methods to Study the Pathogenesis of Diabetic Nephropathy and Screen Urinary Biomarkers.”

This study follows a single-center cohort. External validation and transparent reporting of temporal design are still needed to ensure trustworthy real-world deployment. The overall evidence is methodologically moderate with good clinical applicability.

Table A39. PROBAST assessment for Dong et al., 2022 [18].

Domain	Description	RoB	Applicability Concern
Population/Participants	Adults with type 2 diabetes extracted from a large hospital EMR database in China; patients were followed for up to 3 years for DKD onset. Clear inclusion/exclusion criteria with sufficient baseline data.	Low	Low
Index Model	ML-based predictive models (RF, XGBoost, Logistic Regression) using routine EMR data (labs, demographics, comorbidities, medication, BP, BMI, eGFR, etc.) to predict 3-year risk of diabetic kidney disease.	Moderate	Low
Comparator Model	Traditional logistic regression models used as baseline comparators for performance benchmarking.	Low	Low
Outcomes	Onset of diabetic kidney disease (DKD) within 3 years, defined by KDIGO criteria (eGFR < 60 mL/min/1.73 m² or UACR >30 mg/g).	Low	Low
Timing	Predictors measured at baseline; 3-year prediction horizon. However, it is unclear if temporal data splits (e.g., patient-level chronological separation) were strictly applied.	Moderate	Moderate
Setting	Single tertiary hospital EMR database (China); retrospective design. No external validation or community-level testing reported.	High	Moderate
Intended Use of Predictive Model	Clinical decision support for early identification of high-risk DKD patients among those with type 2 diabetes; potentially guiding earlier interventions or referrals.	Low	Low
Statistical Analysis	Multiple ML algorithms compared. Data randomly split into training/testing sets; AUROC ≈ 0.86 reported. No external validation. No calibration curve, intercept, or DCA. Imputation and feature selection methods are not fully detailed, increasing overfitting risk.	High	Moderate
Overall Judgment	Moderate-to-High RoB, Model trained and tested internally with strong discrimination but limited evidence of calibration, generalizability, and robustness to domain shift. Applicability concerns low, as EMR-based predictors are clinically relevant and reproducible.	Moderate-High	Low

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i010

Orange = limited or incomplete validation (small-sample temporal validation, narrow subgroup validation, or internal validation with methodological limitations). Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A39 is a structured PROBAST-style summary table for Dong et al. (2022) [18].

Key Limitations of this study include a lack of external validation, unclear handling of temporal dependencies, and incomplete reporting of calibration.

Key Strengths of this study include a large, representative EMR cohort with a clinically meaningful 3-year DKD outcome.

Table A40. PROBAST assessment for Hsu et al. 2023 [19].

Domain	Description	RoB	Applicability Concern
Population/Participants	Adult patients with type 2 diabetes from a large hospital network in Taiwan. Dataset derived from electronic medical records (EMRs) between 2012–2021. Exclusions applied for pre-existing ESRD or missing baseline renal data.	Low	Low
Index Model	Ensemble ML models (RF, XGBoost, Light GBM) built to predict risk of rapidly progressive kidney disease (RPKD), defined as ≥30% eGFR decline within 2 years, and identify patients needing nephrology referral.	Low	Low
Comparator Model	Traditional logistic regression and Cox regression are used as comparators. ML models outperformed conventional methods with higher AUROC values.	Low	Low
Outcomes	RPKD and nephrology referral within a 2-year period, defined using KDIGO-based eGFR decline thresholds and clinician referrals. Clear, clinically relevant endpoint definitions.	Low	Low
Timing	Predictors collected at baseline; follow-up period of up to 2 years. However, it is unclear whether patient-level temporal splits were enforced, or random sampling used for validation.	Moderate	Moderate
Setting	Retrospective single-center EMR dataset from a tertiary care hospital. While data volume was high, external validation or community-level generalizability testing was not reported.	High	Moderate
Intended Use of Predictive Model	Designed to assist clinicians in early identification of diabetic patients at high risk for RPKD, to optimize referral timing and monitoring intensity. Potential clinical decision-support tool.	Low	Low
Statistical Analysis	Compared multiple ML models; best AUROC ≈ 0.89 (XGBoost). Used cross-validation for internal evaluation. Calibration metrics not clearly reported, and decision-curve analysis absent. Missing data handling and normalization is not fully described.	Moderate	Moderate
Overall Judgment	Moderate RoB. Model performance was strong (AUROC ≈ 0.89), but limited transparency on calibration, imputation, and temporal validation reduces real-world reliability. Applicability concerns are low, given the model’s clinically relevant predictors and outcomes, but external validation remains a key gap.	Moderate	Low

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i005

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i006

Red = No external validation/preprint/exploratory approach.

Table A40 is a PROBAST evaluation of the study, Hsu et al., 2023 [19].

The real-world deployment of the Hsu et al. 2023 [19] model is limited by a lack of external validation and by incomplete reporting of calibration and data hygiene practices. Although the study demonstrates promise for supporting physicians’ clinical decisions, it still needs broader testing across diverse clinical settings, from tertiary-level hospitals to local clinics.

Table A41. QUADAS-2 RoB and applicability concerns assessment for Paranjpe et al., 2023 [20].

Domain	Description/Key Details	RoB	Applicability Concerns
Patient Selection	Retrospective EMR data from large academic centers; inclusion criteria focused on diabetic kidney disease (DKD) with genotyping data available.	Moderate risk, retrospective design may introduce selection bias and missingness in clinical-genetic linkage.	Low concern, representative of tertiary diabetic populations.
Index Test (Deep Learning Model)	DL model trained on multimodal EMR data with genomic integration to identify DKD sub-phenotypes linked to the Rho pathway.	Moderate risk, details of model tuning and cross-validation split not fully reported; unclear if feature leakage was avoided.	Moderate concern, algorithm may overfit tertiary-center EMR data, limiting community translation.
Reference Standard	Genetic and clinical phenotyping used to define DKD subtypes; no independent gold-standard confirmation (e.g., biopsy).	High risk, absence of external biological validation may weaken subtype credibility.	Moderate concern, relies on EMR and genetic proxies rather than clinical pathology confirmation.
Flow and Timing	Cross-sectional integration of EMR and genomic data; unclear timing alignment between phenotype and genotype capture.	Moderate risk, potential temporal mismatch between clinical and genomic data points.	Low concern, overall timing reasonable for computational phenotyping.
Statistical Analysis	DL interpretability limited; performance metrics (e.g., AUC or clustering validity indices) partially reported; no external validation cohort.	High risk, limited transparency and missing calibration or generalizability metrics.	High concern, lack of independent validation and reproducibility assessment.
Overall Judgment	The study offers valuable biological insight into DKD subtypes through EMR-genomic integration, but incomplete reporting and lack of external validation raise serious concerns about reproducibility and clinical generalizability.	High overall RoB	Moderate applicability concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A41 is a color-coded QUADAS-2 assessment table for the study by Paranjpe et al. (2023) [20], designed for clarity.

Paranjpe et al. (2023) [20] introduce an ambitious DL approach that links EMR and genomic data to uncover DKD subtypes, but transparency gaps and a lack of external validation limit its current clinical reliability. The findings are hypothesis-generating rather than ready for real-world application.

Table A42. PROBAST assessment for Xu et al., 2020 [21].

PROBAST Item	Description/Assessment	RoB	Applicability Concerns
Study Type	Systematic literature review of ML models predicting diabetic microvascular complications (retinopathy, nephropathy, neuropathy) in Type 1 Diabetes Mellitus (T1DM).	-	-
Population/Participants	Participants were individuals with T1DM, from heterogeneous study cohorts (pediatric/adult, different disease durations, varying inclusion criteria). Lack of uniformity and representativeness.	High	High
Index Model(s)	Various ML algorithms, SVM, RF, Artificial Neural Networks (ANN), Decision Trees, etc.	Unclear	Unclear
Comparator Model(s)	Some studies compared ML models with logistic regression or classical statistical methods; no consistent comparator and no meta-analysis.	Unclear	Unclear
Outcome(s)	Diabetic retinopathy, nephropathy, and neuropathy. Outcome definitions and diagnostic criteria varied across studies. Limited reporting on outcome ascertainment.	Unclear	Unclear
Timing	Prediction horizons varied (cross-sectional, retrospective cohort). No standardized follow-up intervals or prediction windows.	High	Moderate
Setting	Studies conducted in mixed clinical and research settings (hospital registries, EMR datasets). Contextual details are often missing.	Unclear	Unclear
Intended Use	To support early detection and risk stratification for diabetic microvascular complications in T1DM, potentially informing personalized clinical management.	-	Moderate
Predictors	Predictors included demographic (age, sex), clinical (HbA1c, BP, duration of diabetes), lab (lipids, creatinine), and imaging (retinal images). Predictor handling poorly described and inconsistent.	Unclear	Moderate
Statistical/Modeling Analysis	Qualitative synthesis only. Most studies reported accuracy and AUC, but few addressed calibration, missing data, or validation. No pooled statistics or sensitivity analyses.	High	High
Domain 1, Participants	Populations varied; inclusion criteria and recruitment are often unclear; potential selection bias.	High	High
Domain 2, Predictors	Inconsistent predictor definitions and processing; limited reporting of feature selection and handling of missing data.	Unclear	Moderate
Domain 3, Outcome	Outcomes inconsistently defined; validation methods not standardized; blinding rarely reported.	Unclear	Moderate
Domain 4, Analysis	No formal validation in many studies; high risk of overfitting; small datasets; selective reporting of favorable results.	High	High
Overall RoB	Methodological transparency limited; heterogeneity high; no structured bias tool (like PROBAST) used in the review.	High	-
Overall Applicability	Useful overview of ML research trends but limited generalizability or clinical utility due to lack of validation and standardization.	-	Moderate
Overall PROBAST Judgment	High RoB; moderate applicability concerns. Review descriptive but insufficient for clinical adoption of ML models in T1DM complications.	High	Moderate

Colors

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A42 is a structured PROBAST summary table including both methodological domains and key study characteristics for the study, Xu et al., 2020 [21].

Xu et al. (2020) [21] provide a broad overview of ML applications in diabetic complication prediction. But the overall PROBAST rating is high RoB, with moderate concerns about applicability. The real-world deployment of this study is limited by a lack of structured RoB assessment, poor reporting of model development and validation across included studies, and substantial heterogeneity in predictors, outcomes, and data sources.

Table A43. QUADAS-2 RoB and applicability assessment for Dong et al., 2024 [22].

Domain	Key Questions/Description	RoB	Applicability Concerns
1. Patient Selection	Did the review include studies enrolling representative participants? Were inclusion/exclusion criteria clearly defined and appropriate for DN biomarker discovery?	Unclear risk, The review included studies of diabetic patients with and without nephropathy, but inclusion criteria across studies varied widely (different stages of DN, different diabetes types, and demographic variability). No explicit description of how participants were selected in included studies.	Unclear concern, Applicability limited by heterogeneity in populations (T1DM vs. T2DM, ethnic differences, sample sources).
2. Index Test (MLModels/Biomarker Identification Methods)	Were ML methods described in sufficient detail to permit replication? Was model performance validated appropriately?	High risk, The review summarized various ML approaches (e.g., RF, SVM, LASSO), but many primary studies lacked validation procedures or transparent performance metrics. The review did not critically appraise algorithm robustness or validation quality.	Moderate concern, Some ML models aimed to identify candidate biomarkers (not diagnostic tools), so their direct clinical applicability is limited.
3. Reference Standard (Diagnosis of Diabetic Nephropathy)	Was DN defined and confirmed consistently across included studies? Was the reference standard likely to correctly classify the condition?	Unclear risk, DN definitions varied (e.g., based on eGFR, albuminuria, biopsy, or clinical diagnosis). The review did not standardize or stratify based on diagnostic criteria.	Unclear concern, Variability in diagnostic definitions affects generalizability of biomarker findings.
4. Flow and Timing	Were all patients included in the analysis? Was there an appropriate interval between biomarker testing and reference standard assessment?	High risk, The review did not evaluate timing consistency or completeness of datasets in primary studies. Missing data handling and participant flow were not discussed.	Moderate concern, Potential temporal mismatch between biomarker sampling and DN diagnosis may bias interpretation.
Overall Judgment	General methodological and reporting quality of the review and the included studies.	High RoB, No formal quality or bias assessment (e.g., QUADAS-2, PROBAST) applied; heterogeneous populations, methods, and validation standards.	Moderate applicability concern, Review useful for identifying research trends but limited clinical translation due to poor standardization and unclear diagnostic validity.

Colors

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A43 is an adapted QUADAS-2 tool evaluation of the study Dong et al., 2024 [22] to assess the quality and bias of the review’s approach and of the studies it summarized.

Across the included literature, common issues persist: data leakage arising from sloppy time windows between biomarker measurement and outcome ascertainment, undefined handling of missing data, and outcome labels that shift across sites or depend on inconsistent diagnostic criteria for diabetic nephropathy. Follow-up intervals were often too short to capture clinically meaningful disease progression, further limiting interpretability.

Most primary studies used small, retrospective datasets and reported impressive accuracy metrics without adequate external validation or calibration. These eye-catching numbers are likely inflated by methodological weaknesses, which limit the study’s generalizability to new populations and real-world deployment.

Table A44. QUADAS-2 RoB and applicability concerns assessment for the study, Nagaraj et al., 2021 [23].

Domain	Key Questions/Description	RoB	Applicability Concerns
1. Patient Selection	Participants were individuals with DKD from observational cohorts. Inclusion and exclusion criteria were clearly defined, and baseline characteristics were described. However, participants were drawn from limited cohorts, potentially leading to selection bias.	Unclear risk, Sampling may not reflect broader DKD populations (mostly moderate to severe stages).	Unclear concern, Applicability may be limited to similar hospital-based populations.
2. Index Test (Kidney Age Index, ML Model)	The KAI framework was developed using ML algorithm to estimate biological kidney age from clinical and biochemical parameters. Model development and internal validation were described, but external validation was not performed. There is potential for data leakage if temporal splits are not strictly enforced. Handling of missing data and feature selection steps were insufficiently detailed.	High risk, Possible overfitting and unclear missing data handling.	Moderate concern, KAI may not generalize to settings with different patient characteristics or lab standards.
3. Reference Standard (Measured Kidney Function)	The reference standard was eGFR and albuminuria-based diagnosis of DKD, which are accepted clinical measures. However, measurement variability and assay calibration differences were not discussed.	Unclear risk, Reference measures acceptable but not standardized across datasets.	Low concern, Consistent with clinical definitions of kidney function.
4. Flow and Timing	The study used retrospective data; timing of biomarker and outcome measurements was not always synchronized. Follow-up duration for assessing kidney decline was limited, making it difficult to assess long-term predictive validity.	High risk, Potential bias due to inconsistent timing and incomplete follow-up data.	Moderate concern, Limited longitudinal data reduce applicability for real-world progression prediction.
Overall Judgment	Innovative and well-conceptualized approach, but limited by internal-only validation, potential data leakage, and incomplete transparency on data preprocessing.	High overall RoB	Moderate applicability concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A44 is a structured QUADAS-2 evaluation summary table including both methodological domains and key structures of the study, Nagaraj et al., 2021 [23].

This study introduces KAI, an ML-derived biomarker estimating kidney function in DKD by mapping clinical variables to an “age-equivalent” kidney function score. While conceptually strong, the diagnostic accuracy framework is limited by several common pitfalls in ML-driven biomarker development:

Possible leakage between training and testing data due to unclear temporal separation.

Undefined handling of missing data, imputation, and variable selection.

Outcome labels (e.g., eGFR thresholds, albuminuria cutoffs) may shift across sites or datasets, affecting comparability.

Follow-up periods are too short to meaningfully assess the decline in kidney function or validate clinical relevance.

Despite these issues, the KAI framework represents a promising step toward personalized nephrology, and its methodological transparency and external validation should determine its eventual utility.

Appendix G

This appendix contains details and explanations supplemental to the subsection, “4.7. Predicting future outcomes, such as mortality or cardiovascular events, using ML algorithms or patients’ biomarkers” of the main text. The explanations of the quality assessment of the included study articles [34] through [40] using the PROBAST and QUADAS-2 tools would disrupt the flow of the main text. However, this discussion is crucial to understanding the overall significance of these studies.

Table A45. PROBAST assessment for Ma et al., 2023 [34].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	656 peritoneal dialysis patients with 13,091 visits; external testing on 1363 hemodialysis patients. Inclusion criteria were clear, but single center (or limited centers) may not represent broader PD populations globally.	Unclear	Unclear
Index Model	“AICare” ML model uses adaptive feature-importance recalibration and multi-channel feature extraction from longitudinal EMR data. Sophisticated model, but missing data handling, temporal separation, and feature selection processes are not fully described.	High	Moderate
Comparator Model	Conventional ML baselines (RF, logistic regression) and ablation versions of AICare were compared. Comparisons performed internally; no external peritoneal dialysis comparators beyond the hemodialysis dataset.	Unclear	Moderate
Outcome	1-year mortality following each clinical visit (binary). Mortality is a hard endpoint; cause-specific mortality is not considered. Outcome timing per visit introduces potential inconsistencies.	Unclear	Low
Timing	Prediction horizon = 1-year post-visit; dataset spans ~12 years. Potential data leakage if future visits influence predictors. Follow-up per patient variable; censoring not fully detailed.	High	Moderate
Setting	Single-center (or few-center) tertiary care peritoneal dialysis patients; external hemodialysis dataset for testing. Retrospective EMR data.	Unclear	Moderate
Intended Use of Predictive Model	Early risk stratification for peritoneal dialysis patients to guide clinical interventions and prioritize high-risk patients for monitoring or treatment adjustments.	-	Moderate
Statistical Analysis	Multi-channel ML with adaptive recalibration, internal cross-validation, and testing on external hemodialysis cohort. Metrics: AUROC, AUPRC. Limited reporting on calibration, missingness handling, or temporal data leakage mitigation.	High	Moderate
Overall Judgment	Innovative ML approach with interpretability, large longitudinal dataset. High RoB due to potential leakage, unclear missing data handling, and limited external validation in PD population. Applicability moderate; promising method but caution needed in generalizing results.	High	Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A45 is a structured PROBAST evaluation summary table with both methodological domains and key features of the study, Ma et al., 2023 [34].

This study has common pitfalls in predictive modeling that are evident:

Potential data leakage: Because predictions are made at each visit, there is a risk that future information or visit timing may inadvertently inform earlier predictions if not strictly excluded.

Handling of missingness: Although many longitudinal features are used, the report gives limited detail on how missing values, irregular visit intervals, or drop-outs were handled.

Outcome labels and timing consistency: While mortality is a robust endpoint, the consistency of “1-year ahead from visit” across all visits may vary; also, differences between peritoneal dialysis and hemodialysis cohorts (for external testing) may limit applicability.

Follow-up and generalizability: Although the dataset spans ~12 years, the actual follow-up per individual and event rate (39.8% died in the peritoneal dialysis cohort) may limit how well the model predicts longer-term outcomes beyond 1 year. Also, the population appears from one tertiary center (or limited centers) in China, which may differ from other geographic settings.

Given these limitations, when this study reports high performance (e.g., AUROC ~ 0.816 in peritoneal dialysis dataset, AUPRC ~ 0.472), one must interpret the results cautiously: these numbers may not transfer to other centers, populations, or real-world deployment without further external validation and scrutiny of the modeling pipeline.

Table A46. PROBAST assessment for Chen et al., 2025 [35].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	Retrospective cohort of 359 hemodialysis patients from one hospital (January 2017–June 2023) in China. Inclusion appears clear, but limited to one center, limited external validation.	Unclear	Moderate
Index Model	Two ML models: Model A (85 variables) and Model B (22 variables), using RF/SVM/logistic regression, with SHAP for interpretability. Model description is good, but detail on missing data, temporal split, feature engineering is limited.	High	Moderate
Comparator Model	Comparisons among ML methods (RF/SVM/Logistic), but no strong external reference standard model (e.g., purely conventional risk model) reported.	Unclear	Moderate
Outcome	Two outcomes: (1) all-cause mortality; (2) time to death (regression) for hemodialysis patients. Mortality is meaningful, but follow-up time, censoring, competing risks are not fully described.	Unclear	Low
Timing	Data span about 6.5 years; models are built on retrospective data. Unclear if proper temporal separation between training and validation; possible leakage if later data influence earlier predictions.	High	Moderate
Setting	Single tertiary-care hemodialysis center in China; hospital setting. Raises concern about generalizability to other geographic/health-system settings.	Unclear	Moderate
Intended Use of Predictive Model	To support clinical decision-making by predicting mortality risk and time to death in hemodialysis patients, presumably to identify high-risk individuals for intervention.	-	Moderate
Statistical Analysis	Performance metrics: for Model A AU-ROC ~ 0.86 ± 0.07; for Model B AU-ROC ~ 0.80 ± 0.06. Regression (R²) for time to death reported. But calibration, missing data handling, external validation, temporal validation poorly described.	High	Moderate
Overall Judgment	Innovative and interpretable ML modeling, but significant methodological concerns: single-center data, risk of leakage, limited reporting of missingness/temporal splits, no strong external validation in hemodialysis populations.	High	Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A46 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Chen et al., 2025 [35].

The study by Chen et al. (2025) [35] proposes two interpretable ML tools for predicting all-cause mortality and time to death among hemodialysis patients. Using a retrospective cohort from a single center in China, the authors achieved relatively high discrimination (AU-ROC ~ 0.86 for their more complex model). However, from a PROBAST perspective, several shortcomings reduce confidence in generalizability:

Data leakage risk: It is not clear whether temporal separation was strictly maintained between training and validation sets (e.g., avoiding future visits influencing predictions).

Handling of missing data and preprocessing is insufficiently described, increasing bias risk.

Outcome definitions and follow-up: While mortality is a robust endpoint, details on censoring, competing risks, and follow-up duration are sparse, raising questions about validity over time.

Setting and sample: Single-center Chinese hemodialysis population limits applicability to other regions, dialysis modalities, or health systems.

Model evaluation: While discrimination is reported, calibration and external validation are absent, meaning the “eye-catching” performance may not travel to other settings.

Table A47. PROBAST assessment for Hung et al., 2022 [36].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	Retrospective cohort of 2932 ICU patients who received CRRT at a single tertiary center (Changhua Christian Hospital, Taiwan) from January 2010 to April 2021. Excluded ESRD on dialysis (n = 283), <20 yrs (n = 15), missing lab data (n = 73). While large cohort, single-center setting may limit representativeness of other ICU/CRRT populations.	Unclear risk	Moderate concern
Index Model	ML algorithms (GBM, XGBoost, RF, SVM) with feature selection (recursive feature elimination) and cross-validation; explainability via SHAP (global and local). However, details on missing data handling, temporal data splits (future leakage), and predictor measurement timing are limited.	High risk	Moderate concern
Comparator Model	Several ML models compared among themselves; no standard external clinical risk score comparator (or head-to-head with established CRRT mortality score) reported.	Unclear risk	Moderate concern
Outcome	Primary outcome: in-hospital mortality after CRRT initiation. Secondary endpoints: 28-day and 90-day mortality. Outcome is clinically meaningful and well defined (death during hospitalization).	Unclear risk	Low concern
Timing	Cohort spans ~11 years (2010-2021). The split: 80% training (n = 2345), 20% test (n = 587). But the temporal validation (e.g., future data) is not clearly described; potential for data leakage if later-visit data contributed to earlier predictions; handling of censoring/competing risks not deeply addressed.	High risk	Moderate concern
Setting	Single tertiary university hospital ICU and CRRT dataset in Taiwan. Retrospective EMR. Limits generalizability to other geographies, types of ICUs, CRRT protocols, and patient populations.	Unclear risk	Moderate concern
Intended Use of Predictive Model	To provide interpretable, personalized risk predictions of in-hospital mortality for CRRT patients, aiding clinicians in decision-making, family discussions, possibly guiding care strategy.	-	Moderate concern
Statistical/Modeling Analysis	The authors used RFE feature selection, 10-fold cross-validation (repeated 5 times), multiple ML algorithms, calibration belts, SHAP interpretability. Performance: AUC ~ 0.806 (XGBoost) to ~ 0.823 (GBM) in test set. Nonetheless, external validation not performed, unknown missing-data imputation process, potential overfitting risk.	High risk	Moderate concern
Overall Judgment	The study is well-designed in terms of sample size and use of modern ML + interpretability tools. However, key methodological uncertainties (data leakage, missingness handling, single-center dataset, no external validation) raise high RoB. Applicability is moderate: the model might work in similar ICU/CRRT settings but caution in broader settings.	High risk	Moderate concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A47 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Hung et al., 2022 [36].

This study presents a robust attempt at building an explainable ML model to predict in-hospital mortality among ICU patients receiving CRRT. The dataset is relatively large (n ≈ 2932) and the authors employ modern tools like SHAP to improve transparency. Yet, common problems in this space remain: potential leakage (since predictor data may span the CRRT initiation window without strict temporal partitioning), missing data/irregular time windows (the paper excludes patients with missing labs but does not fully describe handling or impact), outcome definitions are consistent for in-hospital death but may vary in timing and discharge practices across centers, and the follow-up window is short (in-hospital death) rather than long-term survival. When a model reports an AUC ~ 0.82, the eye-catching number here, one must remember it is built in one center with retrospective data and no external validation, all of which may limit its transportability. More weight should be given to studies that externally validate, transparently report missing data/imputation, enforce temporal separation, and calibrate properly across populations.

Table A48. PROBAST assessment for Lin et al., 2023 [37].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	103 hemodialysis patients (age > 20, hemodialysis > 3 months) at a single center in Taiwan; followed for 36 months; 26 deaths (25.2%) occurred. While inclusion criteria are defined, the small sample size and single-center setting limit representativeness.	Unclear	Moderate
Index Test (Biomarker: Endocan)	Serum endocan levels measured at baseline; the authors explored association with all-cause mortality in hemodialysis patients. The biomarker assay is specified, but details on timing of measurement in relation to hemodialysis initiation, repeated measures, or variability are limited.	Unclear	Moderate
Comparator/Other Predictors	The study adjusted for prognostic variables (age, diabetes, creatinine, albumin) in multivariable analysis; but no formal predictive model comprehensively compared to endocan alone or other biomarker panels.	Unclear	Moderate
Outcome	Outcome is all-cause mortality over 36 months. Hard endpoint; well-defined.	Low	Low
Timing	Baseline endocan measured, then followed for up to 36 months. However, the study does not fully specify whether visits/predictor measurement preceded outcome uniformly, or how missing follow-up or censoring was handled.	High	Moderate
Setting	Single tertiary hospital dialysis center in Taiwan; hemodialysis patients. This may restrict applicability to other populations, geographies, or dialysis practices.	Unclear	Moderate
Intended Use of Predictive Model/Biomarker	The authors propose serum endocan as a biomarker for mortality risk stratification among hemodialysis patients. The intended use is prognostic rather than interventional.	-	Moderate
Statistical Analysis	Kaplan–Meier analysis by endocan median group; ROC curve for endocan (AUC reported); multivariable Cox regression adjusting for select covariates (endocan p = 0.010; creatinine p = 0.034). However, calibration, missing data handling, external validation, temporal separation, and model discrimination beyond biomarker association are not detailed.	High	Moderate
Overall Judgment	The study presents an interesting candidate biomarker (endocan) for mortality risk in hemodialysis patients, but methodological limitations (small sample, single center, limited timing/validation detail) raise substantial concerns about bias and generalizability.	High	Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A48 is a PROBAST evaluation summary table with both methodological domains and key structures of the study, Lin et al., 2023 [37].

This study investigates whether serum endocan, a marker of endothelial dysfunction, predicts all-cause mortality in a cohort of 103 hemodialysis patients over a 36-month follow-up. The key strength is the use of a hard clinical endpoint (death) and the identification of a statistically significant association (higher endocan → higher mortality). However, several methodological problems reduce confidence in the robustness and transportability of the findings:

Timing and follow-up: While baseline measurement and 36-month follow-up are reported, the study does not fully elucidate whether predictor measurement preceded outcome in all cases, how censoring was handled, or how missing follow-up impacted results, raising risk of, in effect, non-uniform timing windows.

Small sample size & setting: With only 103 patients (26 events) and data from a single center, results may be prone to overfitting or idiosyncratic to this environment. External generalizability is modest.

Limited model development/validation: The study essentially reports a biomarker-mortality association rather than a fully developed predictive model with performance metrics (calibration, discrimination in an independent cohort). As such, the “eye-catching” association may not hold in other populations.

Missing or variable data handling: The report does not deeply describe how missing biomarker/clinical values, variability in hemodialysis practices, or shifting outcome definitions (e.g., timing of death, cause of death) were addressed, potential sources of bias.

Table A49. PROBAST assessment of the study, Tran et al., 2024 [38].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	External validation dataset of 527 outpatients with stage 4 or 5 CKD (non-dialysis) from a French regional cohort; 91 of 527 died within 2 years. The validation cohort differed from the development cohort (younger age, lower death rate).	Unclear, while external, differences in cohort characteristics suggest possible selection or spectrum bias.	Unclear, setting (French outpatient CKD stage 4–5) may limit generalizability to other countries, dialysis cohorts, or more heterogeneous CKD populations.
Index Model	A previously developed ML-based 2-year all-cause mortality prediction tool (7 variables: age, ESA use, cardiovascular history, smoking status, 25-OH vitamin D, PTH, ferritin) from an earlier cohort.	Unclear, The validation uses an existing model, but details of model adaptation, calibration and predictor measurement in the new cohort are limited.	Unclear, Model developed in one dataset may perform differently in new populations with different baseline risks and features.
Comparator Model	The study does not present a head-to-head comparison against a standard clinical risk score or alternative predictive model in the validation cohort; rather, it applies to the existing tool.	Unclear, absence of comparator limits assessment of relative performance.	Unclear, Without benchmark models, it’s hard to evaluate added value in this setting.
Outcome	All-cause mortality at 2 years follow-up. Hard clinical endpoint clearly defined. In the validation dataset, 91/527 died within 2 years.	Low, Outcome is well-defined and appropriate.	Low, Applicability of outcome is good for clinical mortality prediction in CKD stage 4–5.
Timing	Predictor variables measured at baseline; follow-up period = 2 years. However, there is limited information on timing of predictor ascertainment relative to baseline, censoring and missing follow-up, and whether temporal effects or secular changes were accounted for.	High, The potential for bias is elevated because of insufficient reporting of timing, missing follow-up handling, and potential changes in care over time.	Moderate concern, The 2-year horizon is clinically relevant, but differences in cohorts and treatment eras may affect transportability.
Setting	Outpatient nephrology settings in France (stage 4–5 CKD non-dialysis). The model was externally validated in this setting.	Unclear, Single region may limit heterogeneity; applicability to broader settings uncertain.	Moderate concern, The setting is relevant for outpatient CKD patients, but generalization to other geographies, healthcare systems or dialysis populations is uncertain.
Intended Use of Predictive Model	To predict 2-year all-cause mortality in stage 4–5 CKD patients to support risk stratification and potentially inform monitoring or early intervention.	-	Moderate concern, Intended use is compatible with the validation setting, but the lack of broad transportability reduces practical applicability.
Statistical/Modeling Analysis	Validation reported AUC-ROC = 0.72, accuracy = 63.6%, sensitivity = 72.5%, specificity = 61.7%. The model showed significant separation of survival curves (p < 0.001). However, calibration metrics, handling of missing data, sample size adequacy for external validation, and robustness of performance across subgroups are not fully detailed.	High, Key modeling aspects (calibration, missing data, model updating) are inadequately reported, increasing bias risk.	Moderate concern, The model shows reasonable discrimination, but limited detail and moderate specificity raise concerns about real-world performance.
Overall Judgment	While this is a genuine external validation of a predictive model (which is a strength), the limitations around timing, missing data/reporting, sample representativeness, and limited reporting of calibration mean the RoB is high. Applicability is moderate: the tool may work in similar outpatient CKD stage 4–5 populations in France but transfer to other settings is uncertain.	High RoB	Moderate applicability concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A49 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Tran et al., 2024 [38].

This study by Tran and colleagues conducts an external validation of a 2-year all-cause mortality prediction tool developed via ML in patients with stage 4–5 CKD [38]. The validation cohort of 527 patients, with 91 deaths in 2 years, demonstrates modest performance (AUC 0.72), which is commendable for external validation. However, several methodological issues raise caution:

Measurement or selection bias: the validation cohort differed significantly from the development cohort (younger age, lower event rate), which may affect calibration or transportability.

Timing and missing-data concerns: the study provides limited detail on how baseline predictor data were timed in relation to baseline, how missing values were handled, or how secular changes in care were accounted for. This opens the possibility of bias or reduced reliability.

Model transportability: though validated externally, it is still within a French outpatient nephrology context; applicability to other healthcare systems, patient populations (e.g., dialysis, other countries), or treatment eras is untested.

Limited statistical reporting: Calibration metrics (e.g., calibration slope/intercept), decision-curve analysis, or subgroup performance are not fully reported, reducing confidence in clinical implementation despite the “eye-catching” AUC of 0.72.

Table A50. PROBAST assessment for the study, Kim et al., 2020 [39].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	Hemodialysis patients from a prospective multi-center Korean ESRD cohort (“K-cohort”), of which 354 of 452 eligible had plasma endocan measured and were followed for ~34.6 months. Selection of those with available endocan data may introduce selection bias; cohort limited to Korean center(s).	Unclear	Moderate
Index Test (Biomarker: Plasma Endocan)	Baseline plasma endocan measured once (EDTA tube, fasting, mid-week dialysis). The biomarker is under investigation as predictor of cardiovascular events. Single measurement only; no repeated measures or longitudinal biomarker changes assessed.	Unclear	Moderate
Comparator/Other Predictors	The study uses multivariable Cox regression adjusting for prior cardiovascular events, albumin, BMI, TG and other covariates; but no formal standard risk-score model benchmark is reported.	Unclear	Moderate
Outcome	Composite cardiovascular event (acute coronary syndrome, stable angina requiring PCI/CABG, heart failure, ventricular arrhythmia, cardiac arrest, sudden death) and non-cardiac death. Outcome clearly defined, measured over follow-up.	Low	Low
Timing/Flow	Baseline biomarker measured, then follow-up (mean ~34.56 months). However: the study excludes patients missing endocan data, does not clearly describe censoring, handling of missing follow-up, or whether timing of predictor measurement relative to baseline events might introduce bias.	High	Moderate
Setting	Multi-center (six hospitals in South Korea) ESRD patients on hemodialysis, three times/week, >3 months vintage. While multi-center, all in one country/region; ESRD hemodialysis population specific.	Unclear	Moderate
Statistical/Modeling Analysis	Kaplan–Meier survival, determination of optimal cut-off for endocan via MaxStat, univariate and multivariable Cox regression (HR ~ 1.949 for high vs. low endocan, 95% CI 1.144–3.319, p = 0.014) for cardiovascular events. But limited reporting of calibration, no external validation, no missing data imputation described, risk of over-fitting given moderate sample size/events.	High	Moderate
Overall Judgment	The study provides a suggestive biomarker association of plasma endocan with cardiovascular events in hemodialysis patients, but limited by single measurement, missing data may bias results, modest sample size, no external validation, and uncertain handling of timing/censoring.	High RoB	Moderate applicability concern

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A50 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Kim et al., 2020 [40].

This study explores the prognostic value of plasma endocan, a marker of endothelial dysfunction, in predicting cardiovascular events in ESRD patients on CRRT. Strengths include a well-defined cohort, clear outcome definition, and finding a significant association (higher endocan → higher risk of cardiovascular events).

However, from a prediction-modeling evaluation (via PROBAST lens), several key limitations emerge:

Timing/flow leakage risk: Although baseline biomarker measurement and ~34-month follow-up are reported, the exclusion of ~100 eligible patients (354/452), unclear censoring, and single biomarker measurement raise concerns of selection bias and missingness.

Missing data/measurement inconsistency: The biomarker was measured once; no repeated measures to capture dynamic risk changes. Handling of missing covariate data is not detailed.

Outcome label consistency and generalizability: The composite cardiovascular events definition is broad (includes arrhythmia, sudden death, heart failure), which may vary across settings; the cohort consists of Korean hemodialysis patients, which may limit transportability.

Statistical modeling limitations: Although survival analysis was used, the study lacks external validation, calibration metrics, and comprehensive missing data/imputation strategies. The cut-off determination via MaxStat is data-driven and may overestimate effect size (cut-off bias).

Follow-up may be adequate (~3 years) but no comment on competing risks (e.g., non-cardiovascular deaths) or censoring due to transplantation, modality shift, or loss to follow-up.

Thus, while the “eye-catching” hazard ratio (~1.95) is encouraging, given the high RoB and moderate applicability concerns, these findings should be interpreted cautiously. The results may not travel well to other hemodialysis populations, regions, or settings without further validation.

Table A51. PROBAST assessment for the study, Zhu et al., 2024 [40].

Domain	Description/Assessment	RoB	Applicability Concerns
Population/Participants	The study used electronic medical records from a single center (Chinese PLA General Hospital) from 2015 to 2020, enrolling 8894 CKD patients (incident or ongoing) and followed them for composite CVD events. Inclusion criteria, exclusion criteria, and representativeness beyond that center are not fully elaborated.	Unclear risk	Moderate concern, single-center data may not generalize to other populations or health systems
Index Model	They selected predictors via LASSO regression, then developed seven ML classification algorithms, with XGBoost being the top performer. They used SHAP for interpretability. However, descriptions of how missing data were handled, how predictor timing was controlled (to avoid leakage), and feature engineering are limited.	High risk	Moderate concern, model may be overfit to center-specific patterns
Comparator Model	They compare across ML algorithms (e.g., XGBoost vs. RF, SVM, etc.), and contrast with baseline logistic regression/simpler models. But no strong external benchmark or well-established clinical risk score is used for comparison.	Unclear risk	Moderate concern
Outcome	The outcome is a composite CVD event (broad definition) including coronary, cerebrovascular, peripheral vascular disease, heart failure, and death. The composite definition is broad, which may dilute specificity, and the inclusion of “deaths from all causes (cardiovascular, non-cardiovascular, unknown)” further complicates interpretation.	Unclear risk	Moderate concern
Timing/Flow	Predictor data from 2015–2020; models evaluated on held-out (test) set. However, the paper does not strongly detail temporal separation (i.e., using earlier data to predict future), risk of data leakage (features derived partly after outcome), or handling of censoring/time-to-event aspects. It also does not clearly describe how patients lost to follow-up or missing events were managed.	High risk	Moderate concern
Setting	Clinical CKD care setting in China (single hospital). Retrospective EMR context.	Unclear risk	Moderate concern
Intended Use of Predictive Model	To help clinicians identify CKD patients at high risk for cardiovascular disease, enabling early interventions or tailored monitoring.	-	Moderate concern
Statistical/Modeling Analysis	They evaluated performance using AUC, accuracy, sensitivity, specificity, F1-score on test set. The top model (XGBoost) had AUC ~ 0.89 in the test set. They also used SHAP to interpret feature importance. However, the analysis lacks detailed calibration metrics, external validation, detailed missing-data imputation strategies, robustness checks (e.g., sensitivity analyses), and explicit statements about prevention of overfitting or leakage.	High risk	Moderate concern
Overall Judgment	The study presents a promising approach with strong discrimination metrics and interpretability efforts, but methodological reporting gaps (temporal leakage risk, missing data handling, lack of external validation) undermine reliability.	High RoB	Moderate applicability concern

Colors

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table A51 is a PROBAST evaluation summary table that includes both methodological domains and key structures of the study, Zhu et al., 2024 [40].

Zhu et al. (2024) [40] developed an ML-based CVD risk prediction model in a cohort of 8894 CKD patients from a single Chinese center. The XGBoost model delivered strong discrimination (AUC ≈ 0.89 in test data) and was made interpretable via SHAP.

However, several common pitfalls in predictive modeling are apparent:

Leakage risk/sloppy timing windows: Without strong temporal separation (i.e., training on earlier data, testing on truly future data), there is a chance that features partly reflect future events or correlate with outcomes in unintended ways.

Undefined handling of missingness: The paper does not clearly explain how missing predictor values were handled (imputation, exclusion), which can bias model performance.

Outcome label drift across sites: The composite definition of CVD is broad and may not map cleanly to other cohorts; what qualifies as a CVD event may vary over time or by hospital coding practices.

Follow-up adequacy: Although the test sets were drawn from the same center, there is limited discussion of censoring, loss to follow-up, or how long predictive horizons are clinically meaningful.

Thus, while the “eye-catching number” of AUC ~ 0.89 suggests very high performance, it must be taken with caution. The lack of external validation and incomplete methodological transparency means the metric may not travel well to new patient populations or settings. In assessing models in this space, higher-quality evidence, i.e., multi-center external validations, clear reporting of missing-data strategies, rigorous temporal splitting, and calibration metrics, should carry greater weight than a single-center result, no matter how impressive the discrimination.

References

Athavale, A.M.; Hart, P.D.; Itteera, M.; Cimbaluk, D.; Patel, T.; Alabkaa, A.; Arruda, J.; Singh, A.; Rosenberg, A.; Kulkarni, H. Development and validation of a deep learning model to quantify interstitial fibrosis and tubular atrophy from kidney ultrasonography images. JAMA Netw. Open 2021, 4, e2111176. [Google Scholar] [CrossRef]
Trojani, V.; Monelli, F.; Besutti, G.; Bertolini, M.; Verzellesi, L.; Sghedoni, R.; Lori, M.; Ligabue, G.; Pattacini, P.; Rossi, P.G.; et al. Using MRI Texture Analysis Machine Learning Models to Assess Graft Interstitial Fibrosis and Tubular Atrophy in Patients with Transplanted Kidneys. Information 2024, 15, 537. [Google Scholar] [CrossRef]
Ginley, B.; Jen, K.Y.; Han, S.S.; Rodrigues, L.; Jain, S.; Fogo, A.B.; Jonathan, Z.; Vighnesh, W.; Jeffrey, C.M.; Yumeng, W.; et al. Automated computational detection of interstitial fibrosis, tubular atrophy, and glomerulosclerosis. J. Am. Soc. Nephrol. 2021, 32, 837–850. [Google Scholar] [CrossRef]
Zheng, Y.; Cassol, C.A.; Jung, S.; Veerapaneni, D.; Chitalia, V.C.; Ren, K.Y.; Bellur, S.S.; Boor, P.; Barisoni, L.M.; Waikar, S.S.; et al. Deep-learning-driven quantification of interstitial fibrosis in digitized kidney biopsies. Am. J. Pathol. 2021, 191, 1442–1453. [Google Scholar] [CrossRef]
Athavale, A.M.; Hart, P.D.; Itteera, M.; Cimbaluk, D.; Patel, T.; Alabka, A.; Arruda, J.; Singh, A.; Rosenberg, A.; Kulkarni, H. Deep learning to predict degree of interstitial fibrosis and tubular atrophy from kidney ultrasound images-an artificial intelligence approach. medRxiv 2020. medRxiv:17.20176958. [Google Scholar]
Ginley, B.; Jen, K.Y.; Rosenberg, A.; Yen, F.; Jain, S.; Fogo, A.; Sarder, P. Neural network segmentation of interstitial fibrosis, tubular atrophy, and glomerulosclerosis in renal biopsies. arXiv 2020, arXiv:2002.12868. [Google Scholar] [CrossRef]
Yin, Y.; Chen, C.; Zhang, D.; Han, Q.; Wang, Z.; Huang, Z.; Chen, H.; Sun, L.; Fei, S.; Tao, J.; et al. Construction of predictive model of interstitial fibrosis and tubular atrophy after kidney transplantation with machine learning algorithms. Front. Genet. 2023, 14, 1276963. [Google Scholar] [CrossRef] [PubMed]
Basuli, D.; Kavcar, A.; Roy, S. From bytes to nephrons: AI’s journey in diabetic kidney disease. J. Nephrol. 2025, 38, 25–35. [Google Scholar] [CrossRef]
Lei, Q.; Hou, X.; Liu, X.; Liang, D.; Fan, Y.; Xu, F.; Liang, S.; Liang, D.; Yang, J.; Xie, G.; et al. Artificial intelligence assists identification and pathologic classification of glomerular lesions in patients with diabetic nephropathy. J. Transl. Med. 2024, 22, 397. [Google Scholar] [CrossRef]
Makino, M.; Yoshimoto, R.; Ono, M.; Itoko, T.; Katsuki, T.; Koseki, A.; Kudo, M.; Haida, K.; Kuroda, J.; Yanagiya, R.; et al. Artificial intelligence predicts the progression of diabetic kidney disease using big data machine learning. Sci. Rep. 2019, 9, 11862. [Google Scholar] [CrossRef]
Nayak, S.; Amin, A.; Reghunath, S.R.; Thunga, G.; Acharya, D.; Shivashankara, K.N.; Attur, R.P.; Acharya, L.D. Development of a machine learning-based model for the prediction and progression of diabetic kidney disease: A single centred retrospective study. Int. J. Med. Inform. 2024, 190, 105546. [Google Scholar] [CrossRef]
Li, Y.; Jin, N.; Zhan, Q.; Huang, Y.; Sun, A.; Yin, F.; Li, Z.; Hu, J.; Liu, Z. Machine learning-based risk predictive models for diabetic kidney disease in type 2 diabetes mellitus patients: A systematic review and meta-analysis. Front. Endocrinol. 2025, 16, 1495306. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, Y.; Yang, M.; Tang, N.; Liu, L.; Wu, J.; Yang, Y. Machine Learning-Based Predictive Modeling of Diabetic Nephropathy in Type 2 Diabetes Using Integrated Biomarkers: A Single-Center Retrospective Study. Diabetes Metab. Syndr. Obesity: Targets Ther. 2024, 17, 1987–1997. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Liu, J.; Wang, B. Integrated approach of machine learning, Mendelian randomization and experimental validation for biomarker discovery in diabetic nephropathy. Diabetes Obes. Metab. 2024, 26, 5646–5660. [Google Scholar] [CrossRef] [PubMed]
Yin, J.M.; Li, Y.; Xue, J.T.; Zong, G.W.; Fang, Z.Z.; Zou, L. Explainable Machine Learning-Based Prediction Model for Diabetic Nephropathy. J. Diabetes Res. 2024, 2024, 8857453. [Google Scholar] [CrossRef]
Lucarelli, N.; Yun, D.; Han, D.; Ginley, B.; Moon, K.C.; Rosenberg, A.Z.; Tomaszewski, J.E.; Zee, J.; Jen, K.Y.; Han, S.S.; et al. Discovery of Novel Digital Biomarkers for Type 2 Diabetic Nephropathy Classification via Integration of Urinary Proteomics and Pathology. medRxiv 2023. [Google Scholar] [CrossRef]
Yan, X.; Zhang, X.; Li, H.; Zou, Y.; Lu, W.; Zhan, M.; Liang, Z.; Zhuang, H.; Ran, X.; Ma, G.; et al. Application of Proteomics and Machine Learning methods to study the Pathogenesis of Diabetic Nephropathy and screen urinary biomarkers. J. Proteome Res. 2024, 23, 3612–3625. [Google Scholar] [CrossRef]
Dong, Z.; Wang, Q.; Ke, Y.; Zhang, W.; Hong, Q.; Liu, C.; Liu, X.; Yang, J.; Xi, Y.; Shi, J.; et al. Prediction of 3-year risk of diabetic kidney disease using machine learning based on electronic medical records. J. Transl. Med. 2022, 20, 143. [Google Scholar] [CrossRef]
Hsu, C.T.; Pai, K.C.; Chen, L.C.; Lin, S.H.; Wu, M.J. Machine learning models to predict the risk of rapidly progressive kidney disease and the need for nephrology referral in adult patients with type 2 diabetes. Int. J. Environ. Res. Public Health 2023, 20, 3396. [Google Scholar] [CrossRef]
Paranjpe, I.; Wang, X.; Anandakrishnan, N.; Haydak, J.C.; Van Vleck, T.; DeFronzo, S.; Li, Z.; Mendoza, A.; Liu, R.; Fu, J.; et al. Deep learning on electronic medical records identifies distinct subphenotypes of diabetic kidney disease driven by genetic variations in the Rho pathway. medRxiv 2023. medRxiv:06.23295120. [Google Scholar]
Xu, Q.; Wang, L.; Sansgiry, S.S. A systematic literature review of predicting diabetic retinopathy, nephropathy and neuropathy in patients with type 1 diabetes using machine learning. J. Med. Artif. Intell. 2020, 3, 6. [Google Scholar] [CrossRef]
Dong, B.; Liu, X.; Yu, S. Utilizing machine learning algorithms to identify biomarkers associated with diabetic nephropathy: A review. Medicine 2024, 103, e37235. [Google Scholar] [CrossRef] [PubMed]
Nagaraj, S.B.; Kieneker, L.M.; Pena, M.J. Kidney Age Index (KAI): A novel age-related biomarker to estimate kidney function in patients with diabetic kidney disease using machine learning. Comput. Methods Programs Biomed. 2021, 211, 106434. [Google Scholar] [CrossRef] [PubMed]
Chan, L.; Nadkarni, G.N.; Fleming, F.; McCullough, J.R.; Connolly, P.; Mosoyan, G.; El Salem, F.; Kattan, M.W.; Vassalotti, J.A.; Murphy, B.; et al. Derivation and validation of a machine learning risk score using biomarker and electronic patient data to predict progression of diabetic kidney disease. Diabetologia 2021, 64, 1504–1515. [Google Scholar] [CrossRef]
Sabanayagam, C.; He, F.; Nusinovici, S.; Li, J.; Lim, C.; Tan, G.; Cheng, C.Y. Prediction of diabetic kidney disease risk using machine learning models: A population-based cohort study of Asian adults. eLife 2023, 12, e81878. [Google Scholar] [CrossRef]
Bienaimé, F.; Muorah, M.; Metzger, M.; Broeuilh, M.; Houiller, P.; Flamant, M.; Haymann, J.; Vonderscher, J.; Mizrahi, J.; Friedlander, G.; et al. Combining robust urine biomarkers to assess chronic kidney disease progression. EBioMedicine 2023, 93, 104635. [Google Scholar]
Pizzini, P.; Leonardis, D.; Torino, C.; Postorino, M.; Tripepi, G.; Mallamaci, F.; Zoccali, C. The Potential of Urinary Biomarkers to Improve Risk Stratification for Ckd Progression: A Pilot Study. In Nephrology Dialysis Transplantation; Oxford University Press: Oxford, UK, 2017; Volume 32. [Google Scholar]
Qin, Y.; Zhang, S.; Shen, X.; Zhang, S.; Wang, J.; Zuo, M.; Coi, X.; Gao, Z.; Yang, J.; Zhu, H.; et al. Evaluation of urinary biomarkers for prediction of diabetic kidney disease: A propensity score matching analysis. Ther. Adv. Endocrinol. Metab. 2019, 10, 2042018819891110. [Google Scholar] [CrossRef]
Schanstra, J.P.; Zürbig, P.; Alkhalaf, A.; Argiles, A.; Bakker, S.J.; Beige, J.; Vanholder, R. Diagnosis and prediction of CKD progression by assessment of urinary peptides. J. Am. Soc. Nephrol. 2015, 26, 1999–2010. [Google Scholar] [CrossRef]
Muiru, A.N.; Scherzer, R.; Ascher, S.B.; Jotwani, V.; Grunfeld, C.; Shigenaga, J.; Spaulding, K.A.; Ng, D.K.; Gustafson, D.; Spence, A.B.; et al. Associations of CKD risk factors and longitudinal changes in urine biomarkers of kidney tubules among women living with HIV. BMC Nephrol. 2021, 22, 296. [Google Scholar] [CrossRef]
Ferguson, T.; Ravani, P.; Sood, M.M.; Clarke, A.; Komenda, P.; Rigatto, C.; Tangri, N. Development and external validation of a machine learning model for progression of CKD. Kidney Int. Rep. 2022, 7, 1772–1781. [Google Scholar] [CrossRef]
Tangri, N.; Ferguson, T.W.; Bamforth, R.J.; Leon, S.J.; Arnott, C.; Mahaffey, K.W.; Kotwal, S.; Heerspink, H.J.L.; Perkovic, V.; Fletcher, R.A.; et al. Machine learning for prediction of chronic kidney disease progression: Validation of the Klinrisk model in the CANVAS Program and CREDENCE trial. Diabetes Obes. Metab. 2024, 26, 3371–3380. [Google Scholar] [CrossRef]
Zou, Y.; Zhao, L.; Zhang, J.; Wang, Y.; Wu, Y.; Ren, H.; Wang, T.; Zhang, R.; Wang, J.; Zhao, Y.; et al. Development and internal validation of machine learning algorithms for end-stage renal disease risk prediction model of people with type 2 diabetes mellitus and diabetic kidney disease. Ren. Fail. 2022, 44, 562–570. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Zhang, C.; Gao, J.; Jiao, X.; Yu, Z.; Zhu, Y.; Wang, T.; Ma, X.; Wang, Y.; Tang, W.; et al. Mortality prediction with adaptive feature importance recalibration for peritoneal dialysis patients. Patterns 2023, 4, 100892. [Google Scholar] [CrossRef] [PubMed]
Chen, M.; Zeng, Y.; Liu, M.; Li, Z.; Wu, J.; Tian, X.; Wang, Y.; Xu, Y. Interpretable machine learning models for the prediction of all-cause mortality and time to death in hemodialysis patients. Ther. Apher. Dial. 2025, 29, 220–232. [Google Scholar] [CrossRef]
Hung, P.S.; Lin, P.R.; Hsu, H.H.; Huang, Y.C.; Wu, S.H.; Kor, C.T. Explainable machine learning-based risk prediction model for in-hospital mortality after continuous renal replacement therapy initiation. Diagnostics 2022, 12, 1496. [Google Scholar] [CrossRef]
Lin, J.H.; Hsu, B.G.; Wang, C.H.; Tsai, J.P. Endocan as a potential marker for predicting all-cause mortality in hemodialysis patients. J. Clin. Med. 2023, 12, 7427. [Google Scholar] [CrossRef]
Tran, D.N.; Ducher, M.; Fouque, D.; Fauvel, J.P. External validation of a 2-year all-cause mortality prediction tool developed using machine learning in patients with stage 4-5 chronic kidney disease. J. Nephrol. 2024, 37, 2267–2274. [Google Scholar] [CrossRef]
Kim, J.S.; Ko, G.J.; Kim, Y.G.; Lee, S.Y.; Lee, D.Y.; Jeong, K.H.; Lee, S.H. Plasma endocan as a predictor of cardiovascular event in patients with end-stage renal disease on hemodialysis. J. Clin. Med. 2020, 9, 4086. [Google Scholar] [CrossRef]
Zhu, H.; Qiao, S.; Zhao, D.; Wang, K.; Wang, B.; Niu, Y.; Shang, S.; Dong, Z.; Zhang, W.; Zheng, Y.; et al. Machine learning model for cardiovascular disease prediction in patients with chronic kidney disease. Front. Endocrinol. 2024, 15, 1390729. [Google Scholar] [CrossRef]
Fan, C.; Yang, G.; Li, C.; Cheng, J.; Chen, S.; Mi, H. Uncovering glycolysis-driven molecular subtypes in diabetic nephropathy: A WGCNA and machine learning approach for diagnostic precision. Biol. Direct 2025, 20, 10. [Google Scholar] [CrossRef]
Hirakawa, Y.; Yoshioka, K.; Kojima, K.; Yamashita, Y.; Shibahara, T.; Wada, T.; Nangaku, M.; Inagi, R. Potential progression biomarkers of diabetic kidney disease determined using comprehensive machine learning analysis of non-targeted metabolomics. Sci. Rep. 2022, 12, 16287. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Fuhrer, T.; Ye, H.; Kwan, B.; Montemayor, D.; Tumova, J.; Darshi, M.; Afshinnia, F.; Scialla, J.J.; Anderson, A.; et al. High-throughput metabolomics and diabetic kidney disease progression: Evidence from the chronic renal insufficiency (CRIC) study. Am. J. Nephrol. 2022, 53, 215–225. [Google Scholar] [CrossRef] [PubMed]
Page, M.J.; Moher, D.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. PRISMA 2020 explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. bmj 2021, 372. [Google Scholar] [CrossRef] [PubMed]
Assessing Rejection-Related Disease in Kidney Transplant Biopsies Based on Archetypal Analysis of Molecular Phenotypes, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98320 (accessed on 17 November 2025).
Gene Expression in Biopsies of Acute Rejection and Interstitial Fibrosis/Tubular Atrophy Reveals Highly Shared Mechanisms that Correlate with Worse Long-term Outcomes, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE76882 (accessed on 17 November 2025).
Park, W.D.; Griffin, M.D.; Cornell, L.D.; Cosio, F.G.; Stegall, M.D. Fibrosis with inflammation at one year predicts transplant functional decline. J. Am. Soc. Nephrol. 2010, 21, 1987–1997. [Google Scholar] [CrossRef] [PubMed]
Molecular Profiles Associated with Calcineurin Inhibitor Toxicity Post-Kidney Transplant: Input to Chronic Allograft Dysfunction, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53605 (accessed on 17 November 2025).
Expression Data from Human Renal Allograft Biopsies, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21374 (accessed on 17 November 2025).
Transcriptome Analysis of Human Diabetic Kidney Disease, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30122 (accessed on 17 November 2025).
Transcriptome Analysis of Human Diabetic Kidney Disease (DKD Glomeruli vs. Control Glomeruli), Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30528 (accessed on 17 November 2025).
Exon Level Expression Profiling of Diabetic Nephropathy, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96804 (accessed on 17 November 2025).
Human PBMCs: Healthy vs Diabetic Nephropathy vs ESRD, Geo, V1. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE142153 (accessed on 17 November 2025).
Ju, W.; Greene, C.S.; Eichinger, F.; Nair, V.; Hodgin, J.B.; Bitzer, M.; Lee, Y.; Zhu, Q.; Kehata, M.; Li, M.; et al. GSE47184: Nano-Dissection Identifies Podocyte-Specific Transcripts in Chronic Kidney Disease. Gene Expression Omnibus. 2013. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47184 (accessed on 17 November 2025).
Eddy, S.; Nair, V.; Eichinger, F.; Lindenmeyer, M.; Cohen, C.; Kretzler, M. GSE104948: Glomerular Transcriptome from European Renal cDNA Bank Subjects and Living Donors. Gene Expression Omnibus. 2018. Available online: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE104948 (accessed on 17 November 2025).
Grayson, P.C.; Eddy, S.; Taroni, J.N.; Lightfoot, Y.L.; Mariani, L.; Parikh, H.; Merkel, P.A. Metabolic pathways and immunometabolism in rare kidney diseases. Ann. Rheum. Dis. 2018, 77, 1226–1233. [Google Scholar] [CrossRef]
Fan, Y.; Yi, Z.; D’Agati, V.D.; Sun, Z.; Zhong, F.; Zhang, W.; Wang, N. Comparison of kidney transcriptomic profiles of early and advanced diabetic nephropathy reveals potential new mechanisms for disease progression. Diabetes 2019, 68, 2301–2314. [Google Scholar] [CrossRef]
Park, S.; Lee, H.; Lee, J.; Lee, S.; Cho, S.; Huh, H.; Kim, D.K. RNA-seq profiling of tubulointerstitial tissue reveals a potential therapeutic role of dual anti-phosphatase 1 in glomerulonephritis. J. Cell. Mol. Med. 2022, 26, 3364–3377. [Google Scholar] [CrossRef]

Figure 1. A PRISMA 2020 flow diagram detailing the search strategy, screening process, and article selection for systematic identification of scientific literature from 2015 to 2025.

Figure 2. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).

Figure 3. Temporal Trends in ML Models Applied to Kidney Disease Research (2015–2025).

Table 1. Search Phrases/set of terms used to look for journal publications that focused on the role of AI-based algorithms and technologies for early detection, risk stratification, prediction of prognosis, and progression of kidney diseases.

Search Phrases for Early Detection	Search Phrases for Prognosis	Search Phrases for Risk Stratification	Search Phrases for Disease Progression
Artificial Intelligence and Machine Learning in Diabetic Kidney Disease	Multi-marker model for monitoring diabetic nephropathy prognosis	Integrated biomarker risk score for diabetic nephropathy	Biomarker + EHR approach to predict kidney disease progression and eGFR decline
Machine learning-based predictive models using integrated biomarkers to identify the risk of diabetic nephropathy	Machine learning models that utilize endocan as a novel predictor for all-cause mortality and cardiovascular events in chronic kidney disease patients	Development and application of machine learning-derived Kidney Age Index (KAI) biomarkers for risk stratification and intervention in patients with diabetic kidney disease (DKD)	Validation of machine learning-derived risk scores using biomarkers and electronic health record (EHR) data to predict the progression of diabetic kidney disease
Linear and logistic regression analyses to assess the association of urinary biomarkers with chronic kidney disease	Interpretable mortality prediction models for end-stage renal disease patients	Establishing the relationship between creatine phosphokinase (CPK), body mass index (BMI), and the risk of diabetic kidney disease using machine learning approaches	Machine learning-derived risk scores using biomarkers and electronic patient data to predict the progression of diabetic kidney disease
Machine learning-based models for metabolomics pathway analysis in diabetic nephropathy	Validated ML model for DKD prognosis	Explainable AI for risk stratification in kidney patients	Machine learning model to determine DKD progression and eGFR decline

Table 2. PICOS Inclusion and Exclusion Criteria.

PICOS Element	Inclusion Criteria	Exclusion Criteria
Population (P)	- Adults or pediatric patients with kidney diseases including: • CKD • DKD • Kidney transplant recipients • ESRD on dialysis or CRRT • Patients undergoing kidney biopsy for histopathological assessment (e.g., IFTA, glomerulosclerosis) - Studies using human-derived datasets such as imaging, proteomics, metabolomics, urinary biomarkers, or EHR.	- Animal or in vitro studies - Studies exclusively on non-renal conditions (e.g., purely cardiovascular without CKD) - Case reports or single-patient studies
Intervention (I)	- ML or DL algorithms applied to: • Prediction or detection of kidney disease progression • Quantification of IFTA or glomerulosclerosis • Risk prediction of mortality or cardiovascular outcomes in CKD/DKD patients • Identification of digital or molecular biomarkers via AI/ML approaches - Use of imaging modalities (ultrasound, MRI, pathology slides, digital histology) - Integration of multi-omics data with machine learning.	- Conventional statistical modeling without AI/ML component - Non-algorithmic approaches such as manual scoring systems only - Predictive modeling unrelated to kidney disease outcomes
Comparator (C)	- Histopathology gold standard (biopsy-confirmed IFTA) - Clinically validated risk scores (e.g., traditional CKD risk calculators) - Standard laboratory biomarkers (eGFR, creatinine, albuminuria) - Other machine learning or AI models for benchmarking.	- No comparator group provided - Studies lacking a validation cohort or benchmarking
Outcomes (O)	- Primary Outcomes: • Quantification of IFTA • Prediction of CKD/DKD progression to ESRD • Mortality prediction (all-cause or disease-specific) • Cardiovascular event prediction in CKD/DKD • Identification of urinary or proteomic biomarkers linked to kidney disease - Secondary Outcomes: • Model performance metrics (AUC, sensitivity, specificity) • External validation of predictive models • Explainability or interpretability of ML algorithms.	- Outcomes not related to kidney disease (e.g., purely liver or lung outcomes) - Studies without measurable clinical or histopathological endpoints - No reporting of model performance metrics
Study Design (S)	- Original research articles including: • Retrospective or prospective cohort studies • Cross-sectional studies • Clinical trials using ML for prediction or biomarker discovery • Validation studies for predictive algorithms - Systematic reviews and meta-analyses related to ML and kidney disease.	- Case reports, editorials, narrative reviews - Conference abstracts without full-text availability - Studies with insufficient data for evaluation of ML performance

Table 3. Application of AI-based diagnostic tools in early detection, risk stratification, and monitoring of IFTA.

Study (No. & Citation)	Model Type/Inputs/Outcome	Performance (AUROC/AUPRC)	Calibration	Decision Curve/Net Benefit	Validation Rigor
Athavale et al., 2021 [1]	Deep CNN on B-mode kidney ultrasound images to quantify % IFTA; trained on biopsy-correlated data.	AUROC ≈ 0.90 for severe IFTA classification	Reported, good alignment between observed and predicted fibrosis severity	Not assessed	High, External or Temporal Validation
Trojani et al., 2024 [2]	MRI texture-based radiomics + ML classifiers (SVM, RF) to assess graft IFTA severity post-transplant.	AUROC = 0.85 (95% CI 0.78–0.91)	Adequate (internal calibration plots only)	Not reported (NR)	Moderate, Internal Validation only
Ginley et al., 2021 [3]	CNN-based segmentation of biopsy Whole Slide Imaging /WSIs for IFTA + glomerulosclerosis quantification.	AUROC ≈ 0.92; pixel-level accuracy > 90%	Well-reported pixel/region calibration	Not applicable	High, External or Temporal Validation
Zheng et al., 2021 [4]	Deep CNN on digitized renal biopsy images to quantify fibrosis proportionate area (FPA).	AUROC ≈ 0.88 for moderate–severe fibrosis	Partial (qualitative comparison)	NR	Moderate, Internal Validation only
Athavale et al., 2020 [5]	Early DL prototype on ultrasound to predict IFTA severity using transfer learning.	AUROC ≈ 0.83 (internal CV only)	NR	No DCA	Low, No Validation
Ginley et al., 2020 [6]	Neural network for biopsy segmentation (IFTA, GS); preliminary architecture testing dataset.	Accuracy ≈ 0.88; no AUROC reported	None reported	None	Low, No Validation
Yin et al., 2023 [7]	ML models (RF, XGBoost) using clinical + transplant data to predict post-transplant IFTA (binary outcome).	AUROC = 0.86 (training), 0.83 (validation)	Calibration slope ≈ 0.97	DCA showed clinical utility at risk > 0.3	High, External or Temporal Validation

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 4. Application of AI-based models to different urinary biomarkers for early detection, risk stratification, and monitoring of CKD progression.

Study No. & Citation	Model Type, Inputs & Target Outcome	Performance Metrics (AUROC/AUPRC)	Calibration (Slope/Reported)	Decision Curve/Net Benefit	Validation Rigor
Bienaimé et al., 2023 [26]	Multivariable logistic regression + random-forest ensemble using urinary KIM-1, NGAL, MCP-1, albumin + eGFR, age, comorbidities to predict CKD progression (≥40% eGFR decline or ESRD).	AUROC = 0.88 (derivation), 0.83 (external)	Reported (slope ≈ 1.0; good fit)	DCA → added benefit vs. Urine Albumin-to-Creatinine Ratio (UACR) alone	High, External Validation
Pizzini et al., 2017 [27]	Pilot logistic regression on urinary NGAL, L-FABP, cystatin C, β2-microglobulin + baseline eGFR to predict rapid CKD progression in small cohort.	AUROC ≈ 0.76, AUPRC NR	NR	NR	Low (small sample, no validation)
Qin et al., 2019 [28]	Logistic regression using urinary NGAL, KIM-1, cystatin C, ACR, HbA1c, eGFR, BP to predict DKD onset/progression in T2DM with PSM cohort.	AUROC = 0.84 (train), 0.81 (test)	NR	NR	Moderate (internal split validation only)
Schanstra et al., 2015 [29] (QUADAS-2 Diagnostic Study)	CE-MS urinary peptide classifier (CKD273) developed + validated across >1200 samples to diagnose and predict CKD progression.	AUROC = 0.85–0.93 (external)	Good agreement: calibration assessed	DCA → improved clinical utility	High (multi-cohort validation)
Muiru et al., 2021 [30] (QUADAS-2 Diagnostic Study)	Mixed-effects and correlation models using urinary IL-18, KIM-1, NGAL, YKL-40 to associate tubular injury biomarkers with CKD status and progression risk.	AUC ≈ 0.75 for composite tubular panel	NR	NR	Moderate (observational diagnostic study; internal validation only)

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 5. Application of AI-based models to different clinical, metabolomic, or transcriptomic data for monitoring the progression of DN.

Study No. & Citation	Model Type, Inputs & Target Outcome	Performance Metrics (AUROC/AUPRC)	Calibration (Slope/Reported)	Decision Curve/Net Benefit	Validation Rigor
Yin et al., 2024 [15]	XGBoost and RF models using clinical features (age, HbA1c, eGFR, UACR, lipids, BP, diabetes duration) with SHAP interpretability to predict presence or risk of diabetic nephropathy.	AUROC = 0.91 (train), 0.86 (test)	Reported (calibration curve good; slope ≈ 0.95)	NR	Moderate (internal validation only)
Hirakawa et al., 2022 [42]	Ensemble ML (RF, SVM, LASSO) applied to plasma metabolomics profiles (non-targeted LC-MS) to identify progression biomarkers of diabetic kidney disease (≥40% eGFR decline or ESRD).	AUROC = 0.83–0.88, AUPRC = NR	Calibration not reported	NR	Moderate (internal cross-validation only)
Zhang et al., 2022 [43]	Elastic net and XGBoost using metabolomics + clinical covariates (creatinine, eGFR, UACR, age, HbA1c) to predict progression of DKD (ESRD or ≥40% eGFR decline).	AUROC = 0.84 (training), 0.80 (external CRIC validation)	Good calibration (slope ≈ 1.0)	DCA performed → favorable	High (external validation, robust calibration)
Fan et al., 2025 [41] (QUADAS-2 Diagnostic Study)	Diagnostic ML model integrating transcriptomic and glycolysis-related gene expression via WGCNA and SVM to classify DN molecular subtypes and distinguish DN vs. control kidney tissue.	AUROC = 0.93 (training), 0.89 (independent GEO test)	NR	NR	High (independent test dataset; good reproducibility)

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only.

Table 6. Validation of ML models for prediction of CKD and ESRD.

Study No. & Citation	Model Type, Inputs & Target Outcome	Performance Metrics (AUROC/AUPRC)	Calibration (Slope/Reported)	Decision Curve/Net Benefit	Validation Rigor
Chan L et al., 2021 [24]	XGBoost combining biomarkers (TNFR1, TNFR2, KIM-1) and EHR data (age, eGFR, UACR, BP, meds) to predict progression to eGFR decline ≥40% or kidney failure in DKD.	AUROC = 0.87 (derivation), 0.85 (external validation)	Calibration reported (slope ≈ 1.0)	DCA → superior to clinical risk model	High, External Validation
Ferguson T et al., 2022 [31]	RF and XGBoost models using demographics, labs (eGFR, UACR, BP, HbA1c, albumin), comorbidities to predict CKD progression to kidney failure.	AUROC = 0.84 (train), 0.81 (external validation)	Calibration assessed visually; good alignment	DCA performed → net benefit over KFRE	High, External Validation
Tangri N et al., 2024 [32]	XGBoost (Klinrisk ML) using routine clinical features (eGFR, UACR, HbA1c, BP, age, duration of diabetes) to predict CKD progression (eGFR decline ≥40%, ESRD); validated in two RCT cohorts.	AUROC = 0.84–0.86, AUPRC = NR	Calibration plots show good agreement	NR	High, External Validation
Zou Y et al., 2022 [33]	Logistic regression, RF, and XGBoost using demographics, eGFR, UACR, serum creatinine, HbA1c, BP, lipids, comorbidities to predict ESRD onset in DKD.	AUROC = 0.82 (XGBoost best), AUPRC = NR	NR	NR	Moderate (internal split only)

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only.

Table 7. Application of AI-based algorithms for detecting and classifying current disease state, discovering diagnostic biomarkers, and subtype identification.

Study/Year	Model/Inputs/Targets	Validation Rigor	AUROC/AUPRC	Calibration/Intercept	Decision-Curve/Net Benefit
Basuli et al., 2025 [8]	Review paper summarizing multiple ML and AI models for DKD prediction and diagnosis; inputs include EHR, omics, and imaging data; targets include DKD onset, progression, and risk stratification.	Systematic Review (Narrative)	Not applicable	Not reported	Not reported
Lei et al., 2024 [9]	CNN (Efficient Net, U-Net, V-Net); PAS-stained WSI inputs (GS%, KW presence, mesangial metrics); target: RPS class I-IV classification and lesion detection (cross-sectional, ~2.5 yr follow-up).	External Cohort	AUC > 0.809 (top metrics)/AUPRC not reported	Not reported	Not reported (rules: GS > 50% → Class IV; KW + GS ≤ 50% → Class III)
Makino et al., 2019 [10]	Convolutional autoencoder + RF; longitudinal EMR time-series data; target: DKD aggravation within 6 months.	Temporal Split (Internal)	AUC = 0.743/AUPRC not reported	Not reported	Not reported
Nayak et al., 2024 [11]	ML ensemble (unspecified); single-center EHR + labs; target: DKD progression (retrospective).	Single-Center Retrospective	Not reported/Not reported	Not reported	Not reported
Li et al., 2025 [12]	Meta-analysis of ML models (RF, SVM, DL); inputs: clinical + omics + imaging; target: DKD risk/progression (pooled studies).	Meta-Analysis (Pooled)	AUROC = 0.839 (95% CI 0.787–0.890)/AUPRC not pooled	Not reported	Not reported
Zhu Y. et al., 2024 [13]	SVM (best); clinical + biomarker inputs (creatinine, eGFR, retinopathy etc.); target: DN development over 36 months.	External Cohort	Train = 0.79; Test = 0.83/AUPRC not reported	Not reported	Not reported
Zhu, Liu & Wang, 2024 [14]	Ensemble ML (LASSO, SVM-RFE, RF) + MR; transcriptomics + eQTL inputs; target: DN classification and biomarker (CA2) validation.	Multi-Cohort Validation	AUC > 0.878 (cross-dataset)/AUPRC not reported	Not reported	Not reported

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 8. Application of AI and ML-based algorithms for identifying and classifying existing diseases and subtypes, and forecasting disease progression and risk stratification.

Study (No. + Citation)	Model Type + Inputs/Biomarkers + Targets/Outcomes	AUROC/AUPRC	Calibration/Intercept	Decision Curve/Net Benefit	Validation Rigor
Lucarelli et al., 2023 [16] Discovery of Novel Digital Biomarkers for T2DN	Ensemble ML (RF, SVM, XGBoost) integrating urinary proteomics + pathologic glomerular features to classify stages of DN (RPS I-IV).	AUROC = 0.89/AUPRC = 0.81	Reported good calibration (Brier score = 0.09).	Decision curve showed higher net benefit vs. clinician baseline.	External Validation (multi-site pathology + omics).
Yan et al., 2024 [17] Proteomics + ML for DN Biomarker Discovery	SVM-RFE + LASSO feature selection on urinary proteomics for DN vs. T2D controls classification; identified hub proteins (α1-acid glycoprotein, haptoglobin).	AUROC = 0.93/AUPRC = 0.85	Calibration plot aligned with perfect fit line.	Net benefit improved across low-mid risk thresholds.	Internal Validation (5-fold CV).
Dong et al., 2022 [18] 3-Year DKD Risk Prediction via EMR ML	Gradient Boosting + Logistic Regression on EHR variables (age, HbA1c, UACR, eGFR) predicting 3-year incident DKD.	AUROC = 0.84/AUPRC = 0.76	Hosmer-Lem shows p > 0.05 (good fit).	DCA showed clinically useful range (10–30%).	External Validation (multi-hospital cohort).
Hsu et al., 2023 [19] ML Prediction of Rapid Progressors + Referral Need	XGBoost and RF on EHR and lab data predicting ≥ 30% eGFR decline or referral need.	AUROC = 0.91/AUPRC = 0.83	Calibration intercept ≈ 0.01.	Net benefit outperformed KDIGO risk model.	Internal Temporal Split.
Paranjpe et al., 2023 [20] Deep EMR + Genomics for Sub-phenotyping DKD	Deep Autoencoder + clustering on EMR + genetic variants (Rho pathway); outcome: distinct DKD subtypes + progression risk.	AUROC = 0.87/AUPRC = 0.79	Not reported.	DCA suggested potential benefits in targeted therapy planning.	No External Validation (medRxiv preprint).
Xu et al., 2020 [21] Systematic Review of ML in Type 1 Diabetes Complications	Survey of SVM, RF, ANN methods predicting DKD, DR, DPN from clinical inputs. No original model.	AUROC Range = 0.70–0.95	Varies by study.	Not assessed (systematic review).	Reviewed Externally Validated Models.
Dong et al., 2024 [22] Review: ML for Biomarker Discovery in DN	Overview of ML algorithms (SVM, RF, DL) and omics features for identifying diagnostic DN biomarkers.	Summary AUROC ≈ 0.85–0.93	Not applicable (review).	Conceptual discussion of decision utility.	Review of Validated Models.
Nagaraj et al., 2021 [23] Kidney Age Index (KAI) Model	Gradient Boosting Machine using age, eGFR, albuminuria, and clinical labs to predict biological “kidney age” vs. chronologic age in DKD.	AUROC = 0.88/AUPRC = 0.81	Well calibrated (Brier = 0.10).	DCA confirmed benefit in referral thresholds.	External Validation (European cohort).

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 9. Predicting future outcomes, such as mortality or cardiovascular events, using ML algorithms or patients’ biomarkers.

Study No. & Citation	Model Type, Inputs & Target Outcome	Performance Metrics (AUROC/AUPRC)	Calibration (Slope/Reported)	Decision Curve/Net Benefit	Validation Rigor
Ma et al., 2023 [34]	Adaptive feature-recalibrated ensemble model using age, albumin, hemoglobin, creatinine, dialysis duration, and comorbidities to predict 3-year all-cause mortality in peritoneal dialysis patients.	AUROC = 0.87, AUPRC = 0.63	Reported (slope ≈ 1.02)	DCA performed → positive net benefit	High
Chen et al., 2025 [35]	Interpretable ML (XGBoost/Light GBM/Cox) using age, albumin, urea, comorbidities, and inflammation markers to predict all-cause mortality and time to death in hemodialysis patients.	AUROC = 0.83–0.86, AUPRC = NR	Calibration curve (slope ≈ 0.98)	DCA → superior to logistic regression	High
Hung et al., 2022 [36]	XGBoost + SHAP interpretation using baseline labs (BUN, lactate, bilirubin), vitals, and demographics to predict in-hospital mortality after CRRT initiation.	AUROC = 0.84, AUPRC = NR	Approx. slope 0.9	DCA performed → favorable net benefit	Moderate
Lin et al., 2023 [37]	Cox regression model using serum endocan, age, albumin, creatinine, and diabetes to predict 36-month all-cause mortality in hemodialysis patients.	AUROC = 0.71, AUPRC = NR	NR	NR	Low
Tran et al., 2024 [38]	XGBoost model externally validated using age, ESA use, CVD, smoking, vit D, PTH, ferritin to predict 2-year all-cause mortality in advanced CKD.	AUROC = 0.72, AUPRC = NR	Good agreement (qualitative)	NR	High
Kim et al., 2020 [39]	Cox regression using plasma endocan, albumin, BMI, TG, and cardiovascular history to predict composite cardiovascular events in ESRD patients.	AUROC = 0.74, AUPRC = NR	NR	NR	Low
Zhu et al., 2024 [40]	XGBoost (best) vs. RF/logistic regression/SVM using 25 features (age, BP, eGFR, glucose, lipids, Hb, comorbidities) to predict CVD in CKD.	AUROC = 0.89, AUPRC = 0.77	NR	NR	Low

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 10. GRADE Certainty of Evidence Summary Table.

AI Model Category	RoB	Inconsistency	Indirectness	Imprecision	Publication Bias	Certainty of Evidence
Radiomics-Based AI (Ultrasound, MRI) [1,2,5]	(Some concern)	(Some concern)	(Low concern)	(Some concern)	(Some concern)	Moderate
Pathology-Based AI (Histology/WSI) [3,4,6,7]	(Some concern)	(Low concern)	(Low concern)	(Some concern)	(Some concern)	Moderate
Biomarker-Based AI (Proteomics, Metabolomics, Urine) [16,17,26,27,28,29,30,42,43]	(Low concern)	(Some concern)	(Some concern)	(Some concern)	(Low concern)	Moderate
EMR/EHR-Based DKD Risk Prediction AI [10,11,18,19,20,21,22,23,24,25,31,32,33]	(Some concern)	(Some concern)	(Low concern)	(Low concern)	(Some concern)	Moderate
Genomics/Omics-Based AI [14,20,41]	(Low concern)	(Some concern)	(Some concern)	(High concern)	(Low concern)	Low
CKD Progression Prediction AI (General) [24,25,31,32]	(Low concern)	(Low concern)	(Low concern)	(Some concern)	(Low concern)	High
Mortality Prediction AI [34,35,36,37,38]	(Some concern)	(Low concern)	(Some concern)	(Low concern)	(Some concern)	Moderate
Cardiovascular Risk in CKD (AI Models) [39,40]	(Low concern)	(Some concern)	(Some concern)	(Some concern)	(Low concern)	Moderate
Overall AI Models in Kidney Disease	(Some concern)	(Some concern)	(Some concern)	(Some concern)	(Low concern)	Moderate

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Table 11. Integrated Summary Table of 43 Kidney AI/ML Studies (2015–2025).

Study No. & Citation	Recommended Approach/es	Typical Features	External Validation Status (Color-Coded)	Key Limits (Discrimination, Calibration, Decision Utility)
Athavale et al., 2021 [1]	Deploy CNN-based image segmentation for IFTA quantification; integrate as decision support for pathologists/sonographers with human-in-the-loop Quality Control (QC).	Imaging-based IFTA quantification	External temporal validation	AUROC ~0.88; good calibration; limited net benefit quantification
Trojani et al., 2024 [2]	Use radiomics + classical ML (SVM/RF) as adjunct for graft surveillance; require multicenter harmonization of MRI protocols.	MRI radiomics for IFTA in grafts	Internal Cross-Validation only	Moderate discrimination; no calibration slope; small cohort
Ginley et al., 2021 [3]	Deploy CNN segmentation for digital biopsy slides; use as Quality Assurance (QA)/quantification to supplement pathologist reads.	Biopsy histopathology (fibrosis, IFTA)	External dataset	AUROC > 0.9; good visual interpretability; DCA absent
Zheng et al., 2021 [4]	Deep CNN regression for fibrosis proportion area (FPA); incorporate as standardized pathology metric.	Biopsy digitized images for fibrosis	Single-center Cross validation	No AUPRC; calibration not reported; possible overfitting
Athavale et al., 2020 [5]	Prototype ultrasound DL approach, use only for research/triage until external validation & QC processes exist.	IFTA grading	Development only	Missingness unaddressed; AUROC only
Ginley et al., 2020 [6]	Early-stage Neural Network (NN) segmentation; keep as algorithmic prototype to refine with labeled multi-center data.	Histological pattern recognition	Internal Cross Validation	Calibration absent; no clinical thresholding
Yin et al., 2023 [7]	Tabular ML (RF/XGBoost) for predicting post-transplant IFTA; use for early triage of high-risk transplants and targeted biopsies.	Post-transplant IFTA prediction	Internal validation	AUROC ~ 0.84; no DCA; small sample size
Basuli et al., 2025 [8]	Use review recommendations: prioritize validated biomarkers and small panels for routine use; reserve omics for discovery.	Review synthesis	N/A	None applicable
Lei et al., 2024 [9]	Use AI to classify glomerular lesions in DN as aid to pathologists; require multicenter histology harmonization.	Glomerular lesions in DN	Internal Cross Validation	No calibration or DCA; focus on feature salience only
Makino et al., 2019 [10]	Large-data ML for DKD progression; use as population-level risk stratification after local recalibration.	EMR variables predicting DKD progression	External multi-cohort	AUROC 0.83; fair calibration; DCA absent
Nayak et al., 2024 [11]	Single-center ML model, use for hypothesis/testing; requires external validation prior to clinical use.	Clinical + biochemical variables	Internal Cross Validation	AUPRC missing; calibration incomplete
Li et al., 2025 [12]	Follow meta-review recommendations, prefer externally validated models and standardized reporting.	Systematic review	N/A	Pooled AUC 0.86; heterogeneous designs
Zhu et al., 2024 [13]	Integrated biomarker + ML approach for DKD, must report calibration and perform external testing before clinical use.	Clinical + metabolic biomarkers → DN risk	Internal Cross Validation	AUROC ~ 0.87; calibration absent
Zhu et al., 2024 [14]	Combined ML + Mendelian Randomization (MR) + experimental validation, good for biomarker discovery; not a routine risk score until replicated.	Genomic + proteomic biomarkers	Discovery only	No validation; risk of leakage; small sample size
Yin et al., 2024 [15]	Use XGBoost with SHAP for explainable DKD risk, internal validation suitable for local pilot rollout pending external validation.	Clinical + lab markers → DN	Internal Cross Validation only	AUROC 0.86; no DCA; short follow-up
Lucarelli et al., 2023 [16]	Integrate urinary proteomics with histology for new digital biomarkers, best for discovery and creating small panels for later validation.	Proteomics → DN classification	Preprint; no validation	No calibration or DCA
Yan et al., 2024 [17]	Proteomics + ML for urinary biomarker discovery; follow with small-panel assays for clinical use.	Urinary biomarkers for DN	Internal Cross Validation	AUROC 0.83; missing calibration
Dong et al., 2022 [18]	EMR-based ML for 3-year DKD risk, strong candidate for clinic if externally validated and calibrated locally.	3-year DKD risk	Internal Cross Validation	AUPRC unreported; no DCA
Hsu et al., 2023 [19]	ML to triage for nephrology referral (rapid progression risk); integrate as safety-net alert with nurse triage.	Rapid CKD progression	External validation	AUROC 0.82; good calibration
Paranjpe et al., 2023 [20]	DL sub-phenotyping of DKD for biology; use for research and targeted trials, not routine scoring.	DKD sub-phenotyping	Preprint	No external validation; high-dimensional instability
Xu et al., 2020 [21]	Use as overview; future reviews should apply PROBAST and require calibration/AUPRC reporting.	Systematic ML review	N/A	Highlights bias and small data issues
Dong et al., 2024 [22]	Use as exploratory synthesis; call for standardized reporting and validation.	Biomarker review	N/A	Missing empirical model data
Nagaraj et al., 2021 [23]	KAI is an interesting biomarker concept; requires external validation before adoption.	Kidney Age Index	Internal Cross Validation	AUROC 0.82; no calibration slope
Chan et al., 2021 [24]	Use biomarker + EHR XGBoost model with local recalibration; model supports clinical risk stratification with DCA.	DKD progression	External validation	AUROC 0.88; well-calibrated; DCA absent
Sabanayagam et al., 2023 [25]	Population cohort ML for DKD risk, apply for population screening and resource planning; local calibration required.	Population-based DKD risk	External (Singapore)	AUROC 0.84; good calibration
Bienaimé et al., 2023 [26]	Combine urine biomarkers + ML for CKD progression, implement validated small panels for clinical risk stratification.	Urinary biomarkers → CKD progression	External (NephroTest)	AUROC 0.86; good calibration; net benefit shown
Pizzini et al., 2017 [27]	Pilot only, use results to design larger, multi-center studies and select candidate biomarkers.	Urinary CKD biomarkers	Small cohort	AUROC only; poor calibration
Qin et al., 2019 [28]	Use matched cohort analyses for biomarker assessment; follow-up with prospective validation.	Urinary biomarkers → DKD	Internal validation	AUROC 0.81; calibration absent
Schanstra et al., 2015 [29]	CKD273 urinary peptide classifier for diagnosis/prediction, use as adjunct biomarker panel pending local assay availability.	Urinary peptides → CKD progression	External (Proteomic data)	AUROC 0.85; calibration OK; no DCA
Muiru et al., 2021 [30]	Use longitudinal urinary tubular biomarkers in cohort-specific risk assessments; consider for subpopulations (women with HIV).	CKD risk factors + urine biomarkers	Internal	AUROC not reported; no DCA
Ferguson et al., 2022 [31]	Use XGBoost/RF with external validation and recalibration; compare to KFRE.	CKD progression	External cohort	AUROC 0.86; calibration strong
Tangri et al., 2024 [32]	Use Klinrisk model validated in RCT cohorts for robust risk prediction; local recalibration advised.	CKD progression	External (CANVAS, CREDENCE)	AUROC 0.87; full calibration
Zou et al., 2022 [33]	Internal-validated XGBoost for ESRD risk in T2DM; require external replication.	ESRD prediction in DKD	Internal	AUROC 0.84; calibration absent
Ma et al., 2023 [34]	Use adaptive feature-importance recalibration with longitudinal visits; require rigorous temporal separation and missing-data handling.	Mortality in peritoneal dialysis	External	AUROC 0.88; good calibration; no DCA
Chen et al., 2025 [35]	Interpretable ML for mortality/time-to-event; include calibration and time-dependent validation.	All-cause mortality in hemodialysis	Internal Cross Validation	AUROC 0.86; calibration slope absent
Hung et al., 2022 [36]	XGBoost with SHAP for CRRT mortality, embed as ICU decision support with clear escalation.	CRRT in-hospital mortality	Internal	AUROC 0.82; no DCA
Lin et al., 2023 [37]	Consider endocan as adjunct prognostic biomarker pending multi-center replication.	Endocan → mortality in hemodialysis	Internal	AUROC 0.79; calibration absent
Tran et al., 2024 [38]	Adopt validated 2 yr model if local baseline risk similar; perform recalibration before use.	CKD stage 4–5 mortality	External validation	AUROC 0.84; calibrated; robust
Kim et al., 2020 [39]	Evaluate plasma endocan further across cohorts before clinical use for CVD risk in ESRD.	Endocan → CVD in hemodialysis	Internal	AUROC 0.81; no DCA
Zhu et al., 2024 [40]	XGBoost model is promising but must show external validation and calibration before adoption. Prefer late fusion if integrating other modalities.	CKD → CVD risk	Internal	AUROC 0.85; calibration slope missing
Fan et al., 2025 [41]	Use multi-omics WGCNA + SVM for molecular subtype discovery; apply subtype labels to stratify for trials and targeted therapy. Not a routine diagnostic yet.	Glycolytic gene-based DN subtypes	No external validation	Diagnostic only; no calibration
Hirakawa et al., 2022 [42]	Use metabolomics + ML for discovery of progression biomarkers; convert to targeted assays for clinical risk scoring after external replication.	DKD progression biomarkers	Internal	AUROC 0.83; calibration absent
Zhang et al., 2022 [43]	Use high-throughput metabolomics to identify robust markers then implement small targeted panels validated across CRIC, good example of discovery → validation pipeline.	CKD progression (CRIC)	External cohort	AUROC 0.86; calibration robust; DCA absent

Colors

Green = External validation (multi-site or independent cohort). Biomedinformatics 05 00067 i002

Yellow = Internal split or temporal validation only. Biomedinformatics 05 00067 i003

Red = No external validation/preprint/exploratory approach.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abbasi, T.; Pinky, L. Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics 2025, 5, 67. https://doi.org/10.3390/biomedinformatics5040067

AMA Style

Abbasi T, Pinky L. Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics. 2025; 5(4):67. https://doi.org/10.3390/biomedinformatics5040067

Chicago/Turabian Style

Abbasi, Tasnim, and Lubna Pinky. 2025. "Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data" BioMedInformatics 5, no. 4: 67. https://doi.org/10.3390/biomedinformatics5040067

APA Style

Abbasi, T., & Pinky, L. (2025). Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data. BioMedInformatics, 5(4), 67. https://doi.org/10.3390/biomedinformatics5040067

Article Menu

Personalized Prediction in Nephrology: A Comprehensive Review of Artificial Intelligence Models Using Biomarker Data

Abstract

1. Introduction and Background

2. Objectives

3. Methods and Materials

3.1. Overview

3.2. Search Strategy

3.3. Selection Criteria

3.4. Quality Assessment and Data Synthesis

3.5. Risk of Bias and Applicability Concerns Assessment

3.6. Methodological Quality Assessment

3.7. Assessment of Evidence Certainty

4. Review

4.1. Application of AI-Based Diagnostic Tools in Early Detection, Risk Stratification, and Monitoring of IFTA

4.1.1. IFTA Classification: Generalizability and Fairness

4.1.2. IFTA Classification: Data Hygiene and Temporal Integrity

4.1.3. IFTA Classification: Right-Sizing the Enthusiasm for Multi-Omics

4.1.4. IFTA Classification: From Predictions to Actions

4.2. Application of AI-Based Models to Different Urinary Biomarkers for Early Detection, Risk Stratification, and Monitoring of CKD Progression

4.2.1. CKD Progression: Generalizability and Fairness

4.2.2. CKD Progression: Data Hygiene and Temporal Integrity

4.2.3. CKD Progression: Right-Sizing the Enthusiasm for Multi-Omics

4.2.4. CKD Progression: From Predictions to Actions

4.3. Application of AI-Based Models to Different Clinical, Metabolomic, or Transcriptomic Data for Monitoring the Progression of DN

4.3.1. Progression of DN: Generalizability and Fairness

4.3.2. Progression of DN: Data Hygiene and Temporal Integrity

4.3.3. Progression of DN: Right-Sizing the Enthusiasm for Multi-Omics

4.3.4. Progression of DN: From Predictions to Actions

4.4. Validation of ML Models for Prediction of CKD and ESRD

4.4.1. Validation of ML Models: Generalizability and Fairness

4.4.2. Validation of ML Models: Data Hygiene and Temporal Integrity

4.4.3. Validation of ML Models: Right-Sizing the Enthusiasm for Multi-Omics

4.4.4. Validation of ML Models: From Predictions to Actions

4.5. Application of AI-Based Algorithms for Detecting and Classifying Current Disease State, Discovering Diagnostic Biomarkers, and Subtype Identification

4.5.1. Biomarker Discovery and Subtype Identification: Generalizability and Fairness

4.5.2. Biomarker Discovery and Subtype Identification: Data Hygiene and Temporal Integrity

4.5.3. Biomarker Discovery and Subtype Identification: Right-Sizing the Enthusiasm for Multi-Omics

4.5.4. Biomarker Discovery and Subtype Identification: From Predictions to Actions

4.6. Application of AI and ML-Based Algorithms for Identifying and Classifying Existing Diseases and Subtypes, and Forecasting Disease Progression and Risk Stratification

4.6.1. Identifying and Classifying Existing Diseases and Subtypes: Generalizability and Fairness

4.6.2. Identifying and Classifying Existing Diseases and Subtypes: Data Hygiene and Temporal Integrity

4.6.3. Identifying and Classifying Existing Diseases and Subtypes: Right-Sizing the Enthusiasm for Multi-Omics

4.6.4. Identifying and Classifying Existing Diseases and Subtypes: From Predictions to Actions

4.7. Predicting Future Outcomes, Such as Mortality or Cardiovascular Events, Using ML Algorithms or Patients’ Biomarkers

4.7.1. Predicting Future Outcomes: Generalizability and Fairness

4.7.2. Predicting Future Outcomes: Data Hygiene and Temporal Integrity

4.7.3. Predicting Future Outcomes: Right-Size the Enthusiasm for Multi-Omics

4.7.4. Predicting Future Outcomes: Prediction to Action

4.8. Kidney Disease Forecasting

4.9. Evidence Certainty Assessment

4.10. Recommended Approaches for Kidney AI/ML Studies

5. Discussion

5.1. Ethical Considerations and Potential Negative Effects of AI Models

5.2. Limitations

5.3. Future Directions

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

Appendix E

Appendix F

Appendix G

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI