Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation

Huang, Jianbo; Li, Long; Chen, Jia

doi:10.3390/sym17111808

Open AccessArticle

Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation

by

Jianbo Huang

¹,

Long Li

^2,3

and

Jia Chen

^1,*

¹

School of Computer Application, Guilin University of Technology, Guilin 541006, China

²

Key Laboratory of Equipment Data Security and Guarantee Technology, Ministry of Education, Guilin University of Electronic Technology, Guilin 541004, China

³

Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1808; https://doi.org/10.3390/sym17111808 (registering DOI)

Submission received: 8 September 2025 / Revised: 30 September 2025 / Accepted: 17 October 2025 / Published: 27 October 2025

(This article belongs to the Special Issue Simulation and Modelling in Natural Sciences, Biomedicine and Engineering III)

Download

Browse Figures

Versions Notes

Abstract

Chronic kidney disease (CKD) impacts over 850 million people globally, representing a critical public health issue, yet existing risk assessment methodologies inadequately address the complexity of disease progression trajectories. Traditional machine learning approaches encounter critical limitations including inefficient hyperparameter selection and lack of clinical transparency, hindering their deployment in healthcare settings. This study introduces an innovative computational framework that integrates adaptive Multi-Armed Bandit (MAB) strategies with BorderlineSMOTE sampling techniques to improve CKD risk assessment. The proposed methodology leverages XGBoost within an ensemble learning paradigm enhanced by Upper Confidence Bound exploration strategy, coupled with a comprehensive interpretability system incorporating SHAP and LIME analytical tools to ensure model transparency. To address the challenge of algorithmic interpretability while maintaining clinical utility, a four-level risk categorization framework was developed, employing cross-validated stratification methods and balanced performance evaluation metrics, thereby ensuring fair predictive accuracy across diverse patient populations and minimizing bias toward dominant risk categories. Through rigorous empirical evaluation on clinical datasets, we performed extensive comparative analysis against sixteen established algorithms using paired statistical testing with Bonferroni correction. The MAB-optimized framework achieved superior predictive performance with accuracy of 91.8%, F1-score of 91.0%, and ROC-AUC of 97.8%, demonstrating superior performance within the evaluated cohort of reference algorithms (p-value < 0.001). Remarkably, our optimized framework delivered nearly ten-fold computational efficiency gains relative to conventional grid search methods while preserving robust classification performance. Feature importance analysis identified albumin-to-creatinine ratio, eGFR measurements, and CKD staging as dominant prognostic factors, demonstrating concordance with established clinical nephrology practice. This research addresses three core limitations in healthcare artificial intelligence: optimization computational cost, model interpretability, and consistent performance across heterogeneous clinical populations, offering a practical solution for improved CKD risk stratification in clinical practice.

Keywords:

chronic kidney disease; adaptive bandit optimization; interpretable artificial intelligence; risk classification; SHAP; healthcare decision support

1. Introduction

Chronic kidney disease (CKD) constitutes a critical worldwide health concern, impacting nearly 850 million individuals across the globe and being positioned as the 11th primary mortality factor internationally [1]. The gradual decline of renal function demands timely identification and systematic risk evaluation for preventing permanent sequelae, encompassing cardiovascular disease, mineral bone disorders, and end-stage renal disease [2,3]. The varied clinical manifestations and frequently silent initial phases pose significant diagnostic difficulties for physicians, especially in settings with limited healthcare resources and inadequate nephrology specialization [4].

Machine learning (ML) approaches have emerged as promising solutions to these diagnostic challenges [5,6,7,8,9]. Recent studies demonstrate superior predictive performance of ensemble methods, particularly gradient boosting algorithms, for complex medical classification tasks [10,11]. XGBoost, LightGBM, and Random Forest show promising results in nephrology applications, frequently outperforming traditional statistical models in discrimination ability and calibration [12,13]. These ensemble methods excel with medical data due to their capacity for handling mixed data types, missing values, and complex feature interactions inherent in clinical datasets [14].

Despite these advances, current ML-based CKD risk assessment approaches face critical limitations. Machine learning model performance depends heavily on hyperparameter configuration, yet most existing studies employ manual tuning or simplistic grid search methods that are computationally expensive and frequently fail to identify optimal parameter combinations [15]. Traditional hyperparameter optimization techniques suffer from the curse of dimensionality and require prohibitively long computation times [16,17]. Recent comparative studies show that hyperparameter configuration alone can result in performance variations of up to 12–15% in medical classification tasks [18].

Broader healthcare implementation of machine learning methodologies remains constrained by their opaque characteristics, offering insufficient transparency in predictive mechanisms [19,20]. This interpretability deficit is particularly problematic in medical applications where clinicians require understanding of factors that contribute most significantly to patient risk profiles. Recent regulatory guidelines, including the FDA’s framework for AI/ML-based medical devices, increasingly emphasize the importance of explainable AI approaches that provide transparent and interpretable predictions [21,22].

SHAP (SHapley Additive exPlanations) has emerged as a leading framework for post hoc explanations of machine learning model predictions [23], providing mathematically rigorous explanations based on cooperative game theory with desirable properties such as efficiency, symmetry, and additivity [24,25]. However, SHAP explanation effectiveness is inherently linked to underlying model performance—poorly optimized models may produce misleading explanations that adversely affect clinical decision-making [26].

Multi-armed bandit (MAB) algorithms represent a novel paradigm for tackling sequential decision-making problems under uncertainty [27]. Unlike traditional optimization approaches that treat hyperparameter selection as a static problem, MAB frameworks adaptively balance exploration of new parameter configurations with exploitation of promising regions in the parameter space [28]. The theoretical foundations of MAB algorithms, particularly Upper Confidence Bound (UCB) strategies, provide rigorous guarantees on convergence rates and regret bounds [29,30]. Despite their demonstrated success in domains such as online advertising and recommendation systems, MAB approaches have received limited attention in medical machine learning applications [31]. The medical domain presents unique challenges including class imbalance, missing data patterns, and complex correlation structures that create challenging optimization landscapes where traditional methods may converge to suboptimal solutions [32,33,34,35].

Current literature reveals significant methodological gaps in intelligent hyperparameter optimization for medical classification applications. Existing research typically addresses hyperparameter tuning and class imbalance handling as separate processes, resulting in suboptimal workflows. CKD risk evaluation faces distinctive obstacles—pronounced distributional disparity among hazard strata and rigorous requirements for both predictive precision and algorithmic transparency—necessitating targeted methodological frameworks that existing approaches fail to provide adequately.

Symmetry principles offer a guiding framework for developing robust and equitable machine learning systems in medical applications. In this work, symmetry manifests through balanced exploration-exploitation tradeoffs in hyperparameter optimization, equitable performance guarantees across all risk categories via macro-averaged metrics, and complementary global-local perspectives in model interpretability. This multi-faceted symmetry ensures fair treatment of all patient subgroups and transparent algorithmic decision-making.

Confronting these constraints, we present an innovative computational architecture that unifies adaptive Multi-Armed Bandit (MAB) strategies with BorderlineSMOTE techniques to improve CKD risk categorization. The framework employs Upper Confidence Bound (UCB) strategy to efficiently explore hyperparameter spaces while incorporating BorderlineSMOTE’s conservative balancing strategy to address class imbalance without over-synthesis artifacts, with comprehensive SHAP and LIME interpretability analysis ensuring clinical transparency.

This research delivers three key innovations:

A unified adaptive bandit approach that merges XGBoost with Upper Confidence Bound exploration for intelligent hyperparameter search, achieving 0.9-fold to 9.7-fold time efficiency improvement over four hyperparameter optimization methods;
Advanced imbalance handling through BorderlineSMOTE with conservative balancing strategy preventing over-synthesis while preserving original data distribution characteristics;
Rigorous statistical validation through extensive empirical evaluation across 30 repeated experiments against 16 baseline algorithms with Bonferroni-corrected significance testing, demonstrating consistent superiority (p < 0.001) across all performance metrics;
Comprehensive explainability integration through dual SHAP-LIME framework resolving interpretability deficits of ensemble architectures while preserving clinical utility.

2. Related Works

Machine learning approaches have revolutionized chronic kidney disease research, enabling advanced detection and risk stratification through computational methods that uncover complex clinical patterns invisible to conventional diagnostic techniques. However, the rapid proliferation of methodologies has created fragmentation, with limited consensus on evaluation standards and concerning patterns of performance inflation.

Initial computational studies established foundational benchmarks. Comprehensive assessment of multiple methodologies—encompassing logistic regression, decision trees, random forest, KNN, and SVM—delivered valuable performance insights for CKD prediction [36], though feature correlation issues and parameter tuning challenges persisted unresolved. Telemonitoring capabilities were demonstrated employing ensemble tree methods and deep learning architectures for CKD classification and creatinine prediction using home-based monitoring data [37]. A composite classification approach incorporating sophisticated missing data handling, minority class augmentation, and variable selection methods attained 99.75% accuracy [38], though such exceptionally high performance raises methodological concerns regarding validation stringency.

XGBoost gained prominence for robust structured clinical data handling. An explainable framework utilizing XGBoost with SHAP and LIME interpretation tools achieved 93.29% accuracy and 0.9689 AUC [39]. Six machine learning models were evaluated, with XGBoost demonstrating superior performance while integrating SHAP and PDP for interpretability in a novel explainable GUI tool [40]. While these studies established XGBoost as a strong baseline, most employed minimally tuned hyperparameters, potentially leaving performance gains unrealized.

Temporal modeling benefited from sophisticated deep learning architectures. A deep ensemble framework combining CNN, LSTM, and BLSTM architectures predicted CKD occurrence 6–12 months in advance, achieving up to 98% accuracy [41]. XGBoost-based prediction models for cardiovascular disease risk in CKD patients achieved AUC = 0.89 [42], while various algorithms were applied for chronic kidney disease stage classification [43]. An end-to-end framework integrating multiple machine learning models and explainable AI techniques was created [44], though extensive hyperparameter tuning on small datasets raised overfitting concerns.

Hyperparameter optimization approaches evolved beyond traditional methods, yet most studies reported final performance without documenting optimization costs or computational requirements. Recursive Feature Elimination and GridSearchCV enhanced CKD detection accuracy [45], though small dataset usage limited generalizability. Bayesian optimization emerged as a principled alternative, with the BO-FTT model achieving 91.81% accuracy for renal anemia prediction [46], though lacking detailed convergence analysis.

Evolutionary and metaheuristic approaches demonstrate advantages in complex parameter spaces, but often at substantial computational expense that remains undocumented. Genetic Algorithm integration for feature selection with SHAP interpretability boosted decision tree accuracy from 95.83% to 97.50% [47]. Grey Wolf Optimizer frameworks for hyperparameter tuning combined with SHAP interpretability were proposed [48]. OptiNet-CKD using population-based optimization achieved perfect performance metrics [49], ensemble deep learning with Eurygasters Optimization Algorithm and Shuffled Frog Leap Algorithm was investigated [50], and optimized ensemble frameworks across LightGBM, CatBoost, and stacking ensembles achieved 99.5% accuracy [51]. This pattern of exceptionally high reported accuracies (95–100%) suggests potential systematic issues with evaluation protocols or publication bias. Convolutional neural networks—ResNet50, MobileNet, and InceptionV3—were investigated for CKD classification from medical imaging, with ResNet50 achieving 98% accuracy [52].

SHAP established itself as the leading medical model interpretation framework based on cooperative game theory. Ensemble learning approaches enhanced by Explainable AI techniques for model interpretation and biomarker identification were proposed [53]. AI-driven predictive analytics frameworks combining ensemble learning and explainable AI for early CKD prognosis achieved 98% fidelity scores with XGBoost [54]. Hybrid approaches combining ant colony optimization for feature selection with SHAP and LIME transparency achieved 97.70% accuracy and 99.55% AUC [55], and hybrid graph-based collaborative filtering for personalized drug recommendation, integrating SHAP and LIME for interpretability, achieved up to 88% precision [56]. Explainable models using decision tree and random forest for CKD progression prediction achieved strong internal validation results (AUC up to 0.98) [57], while explainable systems for CKD prediction in high-risk cardiovascular patients achieved 88.2% sensitivity with Random Forest [58]. However, most interpretability analyses merely confirmed established clinical knowledge rather than revealing novel insights, and few studies report confidence intervals or stability analysis across multiple runs.

Despite these advances, critical integration gaps persist between advanced optimization techniques and explainable AI. Current hyperparameter optimization methods predominantly derive from general machine learning without adequate consideration of healthcare-specific requirements. The relationship between model optimization quality and explanation reliability represents a fundamental yet understudied aspect, potentially leading to clinical decisions based on explanations derived from suboptimal models. Our work addresses these gaps by introducing a novel framework that integrates multi-armed bandit optimization with explainable AI specifically for medical applications, demonstrating systematic hyperparameter exploration with documented computational efficiency and rigorous comparative evaluation for chronic kidney disease risk assessment.

3. Preliminary

3.1. Data Overview

Table 1 presents a summary of the dataset characteristics. This investigation employed a retrospective design using clinical records from 1150 patients with chronic kidney disease from seven medical institutions in Shanghai, China [59]. The dataset includes comprehensive clinical information encompassing patient demographics, medical histories, laboratory findings, and disease classification parameters, with all personal identifiers anonymized to ensure privacy protection.

Comprehensive demographic data and institutional identifiers were documented across all participating centers. Patient medical histories encompassed hereditary nephropathy, familial chronic nephritis, transplantation records, and biopsy documentation. Genetic components constituted approximately 10–15% of CKD cases, with hereditary patterns significantly influencing disease progression and treatment outcomes. Key comorbidities included hypertension, diabetes mellitus, hyperuricemia, and urological structural abnormalities.

Early disease phases (stages 1–2) maintain preserved function with minimal symptoms, constituting crucial intervention windows for slowing progression. Stage 3 represents a critical transition where compensatory mechanisms deteriorate, producing metabolic complications including anemia, bone disorders, and cardiovascular risk increase. Advanced stages (4–5) manifest severe impairment with uremic complications, electrolyte disturbances, and multi-organ toxin effects, requiring renal replacement therapy at stage 5. A four-tier risk stratification system (low, moderate, high, very high risk) integrating both clinical and biochemical indicators to assess disease advancement likelihood and inform treatment strategies. This classification recognizes that patients within identical stages may demonstrate markedly different outcomes based on etiology, decline velocity, proteinuria extent, and comorbidity load. Albumin-creatinine ratio (ACR) functions as a pivotal biomarker, with proteinuria exceeding 300 mg/g signifying substantial glomerular injury and independently forecasting disease advancement and cardiovascular complications.

3.2. Preliminary Data Investigation

Systematic dataset examination constitutes a fundamental step in CKD investigation, enabling discovery of latent clinical patterns, verification of established staging frameworks, and recognition of critical prognostic associations necessary for constructing reliable predictive algorithms.

Figure 1 presents a comprehensive correlation analysis of clinical parameters and kidney function biomarkers using Pearson correlation coefficients. The mixed visualization employs pie charts in the lower triangle to represent correlation magnitude through sector angles, while the upper triangle displays bubble plots with numerical correlation values. Strong positive correlations (r > 0.7, dark red) are observed among kidney function markers: eGFR demonstrates robust negative correlations with serum creatinine (r = −0.72), CKD stage progression (r = −0.91), and standardized measures (eGFR_stan r = −0.88), reflecting the fundamental inverse relationship between filtration capacity and disease severity. Notable clustering patterns emerge among urinary biomarkers, with urinary anatomical structure abnormalities (UAS) showing moderate correlations with urine protein index (UP_index, r = 0.31) and hyperuricemia (r = 0.36), consistent with shared pathophysiological mechanisms. Clinical comorbidities including hypertension (HBP) and diabetes demonstrate expected associations with progression markers, while the albumin-creatinine ratio (ACR) exhibits strong correlations with proteinuria measures, reinforcing its role as a critical biomarker for glomerular damage assessment. Demographic factors (gender, genetic background, family history) display minimal correlations with most biomarkers (|r| < 0.1), suggesting disease progression patterns are largely independent of these baseline characteristics across this diverse clinical cohort.

Figure 2 demonstrates distinct biomarker patterns through parallel coordinate analysis. (a) Parallel Coordinates by CKD Stage displays standardized biomarker trajectories across CKD progression, where each colored line represents individual patients grouped by disease stage. Advanced stages (CKD4–5, orange-red lines) demonstrate characteristic elevation in serum creatinine (mean standardized values: CKD4 ≈ 1.2, CKD5 ≈ 1.8) and corresponding eGFR decline (CKD4 ≈ −1.0, CKD5 ≈ −1.5 standardized units) compared to early stages (CKD1–2, blue-green lines) maintaining near-normal profiles. Thick colored lines represent group means, while thin transparent lines show individual patient trajectories. (b) Risk Stratification Profiles illustrate biomarker distributions across four-level risk classification using mean values with standard deviation bands (shaded areas). Very-high-risk patients (pink) exhibit the most pronounced serum creatinine elevation (1.6 standardized units) and elevated urine RBC counts (0.5 standardized units), while low risk patients (green) maintain favorable biomarker profiles across all parameters.

Figure 3 demonstrated clear stage-dependent biomarker distributions using kernel density estimation plots. (a) Serum creatinine distributions show progressive rightward shift with disease advancement, where CKD1 patients cluster tightly around 80 μmol/L (narrow peak), CKD2–3 exhibit moderate elevation with broader distributions (100–200 μmol/L range), while CKD4–5 patients display wide distributions extending beyond 600 μmol/L, reflecting significant functional heterogeneity in advanced disease. (b) eGFR distributions demonstrate inverse staging relationships with early-stage patients (CKD1–2) maintaining concentrated distributions above 60 mL/min/1.73 m² (sharp peaks at ≈90 and ≈75, respectively), while advanced stages (CKD4–5) show compressed distributions below 30 mL/min/1.73 m² with CKD5 clustering near dialysis threshold (<15 mL/min/1.73 m²). (c) Urine RBC count distributions remain predominantly low across all stages (majority <10 HP), though CKD2–3 exhibit extended right tails suggesting increased hematuria variability in moderate disease, while CKD4–5 maintain relatively low counts despite advanced nephropathy.

Figure 4 demonstrated dimensional reduction and clustering patterns in CKD patient data using principal component analysis. (a) PCA scatter plot by CKD stage captures 50.1% total variance (PC1: 34.3%, PC2: 15.8%) with each colored point representing individual patients grouped by disease severity. Advanced stages (CKD4–5, orange-red points) cluster predominantly toward positive PC1 values (>2), while early stages (CKD1–2, blue-green points) distribute across negative PC1 values (<0), demonstrating partial but meaningful stage separation in reduced dimensional space. (b) PCA loading plot displays feature contributions as directional vectors, where vector length indicates contribution magnitude and direction shows correlation patterns. Genetic and family history emerge as dominant contributors (importance: 0.93 each), followed by serum creatinine (0.82) and hypertension (0.81), forming the primary disease axis. (c) K-means clustering analysis reveals four distinct patient phenotypes (Clusters 1–4) with clear centroid separation (black X markers), suggesting underlying patient subgroups beyond traditional staging. (d) Feature importance ranking quantifies each variable’s discriminative power in PCA space, where genetic factors dominate (0.93), while gender, proteinuria status, and urine RBC count show minimal discriminative power (0.29–0.31), indicating limited utility for patient stratification in this cohort.

Figure 5 illustrates CKD progression patterns and risk distribution across patient populations using multiple visualization approaches. (a) CKD progression funnel displays patient distribution by disease stage using trapezoid widths proportional to patient counts, revealing characteristic disease pyramid with early stages (CKD1–2) each comprising 34.7% of patients (399 patients each), moderate disease (CKD3) representing 19.9% (229 patients), while advanced stages show decreased prevalence (CKD4: 3.7%, 42 patients; CKD5: 7.0%, 81 patients). Accompanying bar charts quantify absolute patient numbers for each stage. (b) Risk stratification distribution combines absolute counts (bars) with cumulative percentage analysis (red line), demonstrating moderate risk as most prevalent category (529 patients, 46.0%), followed by low risk (268 patients, 23.3%), with cumulative analysis reaching 84.3% when combining low and moderate risk categories, while high-risk categories constitute the remaining 15.7%. (c) Risk distribution by eGFR groups reveals inverse relationship between kidney function and risk classification, where patients with severely reduced eGFR (<30 mL/min/1.73 m²) exhibit 90% very-high-risk classification, while those with preserved function (eGFR > 90) show predominantly moderate-to-low risk profiles (60% moderate, 25% low risk). (d) eGFR violin plots by CKD stage confirm stage-appropriate functional clustering with distinct median values and minimal overlap between adjacent stages, demonstrating precise staging accuracy with characteristic distributions for each disease severity level.

This exploratory investigation provided a solid groundwork through identification of critical predictive associations, validation of data integrity, and establishment of the dataset’s clinical reliability and consistency over time among collaborating centers.

3.3. Data Processing

Raw clinical data received comprehensive transformation procedures to ensure analytical quality for computational modeling tasks. Starting with 1150 patient cases across 23 clinical parameters, preliminary assessment identified quality concerns encompassing inconsistent categorical formats, varied measurement scales, and incomplete data entries.

Categorical variable inconsistencies required standardization protocols. Binary formatting was applied to hyperuricemia measurements, while UAS parameters underwent normalization through unified encoding schemes. URC value discrepancies across different measurement systems were resolved by converting all entries to standardized HP units.

{U R C}_{H P} = \{\begin{matrix} \frac{{U R C}_{n u m}}{10}, & i f {U R C}_{u n i t} = μ l \\ {U R C}_{n u m}, & i f {U R C}_{u n i t} = H P \end{matrix}

(1)

Missing categorical data was addressed through mode-based substitution, while numerical gaps were filled using K-nearest neighbor imputation with

k

= 5:

\hat{x_{i j}} = \frac{1}{k} \sum_{l \in N_{k} (i)} x_{i j}

(2)

Distances were determined via the Euclidean distance measure:

d (i, l) = \sqrt{\sum_{m = 1}^{p} {(x_{i m} - x_{l m})}^{2}}

(3)

Dichotomous variables were encoded as binary (1/0) values. Ordered categorical features preserved ranking relationships: UP_index (6-category scale) and ACR groupings. Outcome measures employed four-tier risk stratification (0–3) and five-level CKD staging (0–4). Positively skewed features were log-transformed:

{U R C}_{H P_l o g} = \{\begin{matrix} l n ({U R C}_{H P}), & i f {U R C}_{H P} > 0 \\ l n (0.01), & i f {U R C}_{H P} = 0 \end{matrix}

(4)

Log transformation was applied to serum creatinine values:

{S c r}_{l o g} = l n (S c r)

(5)

Continuous measurements underwent

z

-score standardization across varying scales:

z = \frac{x - μ}{σ}

(6)

Parameters for standardization were derived using the following formula:

μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(7)

σ = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}}

(8)

The objective function of

k

-means is to minimize WCSS as follows:

W C S S = \sum_{i = 1}^{k} \sum_{x \in C_{i}} {||x - μ_{i}||}^{2}

(9)

An iterative update procedure was applied to cluster centroids as follows:

μ_{i} = \frac{1}{|C_{i}|} \sum_{x \in C_{i}} x

(10)

Cut-points were derived from the moving average of consecutive cluster centers:

c u t p o i n t_{j} = \frac{μ_{j} + μ_{j + 1}}{2}

(11)

where

j = 1,2 \dots, k - 1

. Each derived discrete category was assigned consecutive numeric labels from

0

to

k - 1

, corresponding to progressively worsening clinical conditions. Our final analytical cohort comprised 988 participants characterized by 25 distinct variables. These included eight demographic and clinical history indicators, six initial laboratory test results, three logarithmically converted metrics, three standardized variables, three discretized features derived from clustering analysis, and two endpoint indicators (disease risk categorization and chronic kidney disease grading).

This framework delivered complete data coverage while maintaining 85.9% of initial sample volume, establishing optimal equilibrium between data integrity and population preservation.

4. Methodology

This research introduces a hybrid machine learning architecture that merges Multi-Armed Bandit (MAB) optimization with Extreme Gradient Boosting (XGBoost) for risk assessment in chronic kidney disease. The analytical pipeline encompasses six key phases: initial data preprocessing paired with intelligent label mapping, advanced feature engineering incorporating standardization and encoding procedures, BorderlineSMOTE-based resampling to address class imbalance through adaptive sampling, MAB-guided hyperparameter selection using Upper Confidence Bound (UCB) strategies, stratified 5-fold cross-validation for rigorous performance evaluation, and extensive model interpretability assessment via SHAP and LIME techniques. The optimization process treats hyperparameter combinations as “arms” in a multi-armed bandit problem, dynamically balancing exploration and exploitation across 85 iterations to maximize macro-averaged F1-score performance, tackling key obstacles in clinical data categorization such as imbalanced class distribution, diverse feature types, and streamlined hyperparameter tuning.

Figure 6 illustrates the complete framework architecture, demonstrating the sequential flow from raw clinical data through preprocessing, class balancing, optimization, and validation stages, culminating in four-category risk classification output with comprehensive interpretability analysis for clinical decision support.

4.1. XGBoost

The XGBoost algorithm builds predictive models by iteratively adding decision trees, where each subsequent tree corrects errors from preceding ones using gradient descent optimization. The algorithm develops the model through sequential addition of weak learners, with each subsequent tree addressing the residual errors from the preceding ensemble. For sample

i

at iteration

t

, the prediction follows:

\hat{y_{i}^{(t)}} = \hat{y_{i}^{(t - 1)}} + f_{t} (x_{i})

(12)

The optimization objective incorporates regularization terms to prevent overfitting, thereby striking a balance between achieving high predictive accuracy and maintaining model generalizability:

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, \hat{y_{i}^{(t)}}) + \sum_{j = 1}^{t} Ω (f_{j})

(13)

4.2. Multi-Armed Bandit Optimization

The Multi-Armed Bandit framework treats hyperparameter configurations as “arms” in a sequential decision problem, where each arm represents a unique parameter combination. The Upper Confidence Bound (UCB) strategy balances exploration of untested configurations against exploitation of high-performing ones through:

U C B_{i} (t) = \bar{x_{i}} + c \sqrt{\frac{l n t}{n_{i}}}

(14)

where

\bar{x_{i}}

represents the empirical mean reward for arm

i

,

c

is the confidence parameter,

t

denotes the current iteration, and

n_{i}

indicates the selection count for arm

i

.

The algorithm dynamically expands the arm set by introducing new configurations when convergence criteria suggest potential for improved solutions:

A_{t + 1} = A_{t} \cup {a_{n e w}} if t m o d 20 = 0 and |A_{t}| < A_{m a x}

(15)

This metric computes the difference between the reward obtained from always selecting the best arm and the actual rewards achieved through the algorithm’s choices:

R_{t} = \sum_{s = 1}^{t} (μ^{*} - μ_{a_{s}})

(16)

where

μ^{*}

represents the optimal arm’s expected reward and

μ_{a_{s}}

denotes the reward of the selected arm at iteration

s

.

The MAB optimizer employs UCB strategy to intelligently balance exploration and exploitation during hyperparameter search. The algorithm initializes with 18 randomly sampled parameter configurations and dynamically expands to 35 arms based on performance feedback. Conducting a total of 85 evaluations ensures comprehensive optimization while maintaining computational efficiency for clinical deployment. This reinforcement learning approach systematically identifies optimal configurations that maximize macro-averaged F1-score through strategic parameter space navigation, as detailed in Algorithm 1.

Algorithm 1 Multi-Armed Bandit XGBoost Hyperparameter Optimization
$Input: X, y, T = 85, i n i t i a l_a r m s = 18, m a x_a r m s = 35, c = 1.4$
$Initialization: M A B = M u l t i a r m e d B a n d i t O p t i m i z e r (p a r a m_s p a c e, s t r a t e g y = ‘ u c b ’)$
1:	def sample_arm():
2:	params = {
3:	n_estimators: random.choice(range(300, 801, 25)),
4:	max_depth: random.choice(range(5, 10)),
5:	learning_rate: exp(uniform(log(0.01), log(0.05))),
6:	subsample: uniform(0.7, 0.9),
7:	colsample_bytree: uniform(0.8, 1.0),
8:	reg_alpha: uniform(0.1, 1.0),
9:	reg_lambda: uniform(3.0, 8.0) }
10:	return params
11:	def evaluate_arm(arm_params):
12:	model = XGBClassifier(**arm_params)
13:	cv_scores = cross_val_score(model, X, y, cv = 5, scoring = ‘f1_macro’)
14:	return cv_scores.mean()
15:	def select_arm_ucb(t):
16:	if t < len(arms): return t
17:	ucb_values = [arm_means[i] + c*sqrt(log(t)/arm_counts[i]) for i in range(len(arms))]
18:	return argmax(ucb_values)
19:	for t in range(T):
20:	arm_idx = select_arm_ucb(t)
21:	reward = evaluate_arm(arms[arm_idx])
22:	update_arm_statistics(arm_idx, reward)
23:	if t % 20 == 0 and len(arms) < max_arms:
24:	arms.append(sample_arm())
25:	best_arm_idx = argmax(arm_means)
26:	final_model = XGBClassifier(**arms[best_arm_idx]).fit(X_train, y_train)
27:	predictions = final_model.predict(X_test)
Return: final_model, arms[best_arm_idx], evaluate_metrics(y_test, predictions)

4.3. BorderlineSMOTE

BorderlineSMOTE specifically targets the challenge of imbalanced datasets by focusing synthetic sample generation on ambiguous regions where majority and minority classes meet, thereby strengthening the decision boundary. The algorithm identifies borderline minority samples where at least half of the

k

-nearest neighbors belong to the majority class:

B O R D E R = {x_{i} \in M I N : \frac{|{N N}_{k} (x_{i}) \cap M A J|}{k} \geq 0.5}

(17)

where

M I N

and

M A J

represent minority and majority class sets, and

{N N}_{k} (x_{i})

denotes the

k

-nearest neighbors of sample

x_{i}

.

4.4. Model Validation Strategy

Performance evaluation implements five-fold cross-validation with stratified sampling to ensure balanced class distributions throughout each fold. For every split, stratified sampling maintains:

\frac{| S k^{(c)} |}{| S k |} \approx \frac{| D^{(c)} |}{| D |}

(18)

where:

S k^{(c)}

represents samples of class

c

in fold

k

,

D^{(c)}

denotes all samples of class

c

in the dataset,

| \cdot |

indicates set cardinality.

5. Experiment

5.1. Implementation Environment and Settings

We conducted all computational experiments on a workstation running Windows 11 operating system equipped with 32 GB of memory, utilizing Python version 3.11 as the primary programming environment. To ensure result reproducibility, we fixed the random number generator seed throughout all experimental procedures.

Our implementation combined the XGBoost classifier with Multi-Armed Bandit-based hyperparameter tuning, following the approach detailed in Section 4 (Methodology). The bandit optimization procedure commenced with 18 candidate parameter configurations, progressively expanding the search space to accommodate up to 35 distinct configurations. We adopted the Upper Confidence Bound exploration strategy with a confidence coefficient of c = 1.4, conducting 85 successive optimization rounds to achieve an optimal balance between parameter space exploration and exploitation of promising configurations.

Performance assessment was conducted through nested validation: an outer 80–20% stratified holdout split separated the data into development and test sets, while an inner stratified 5-fold cross-validation procedure operated within the optimization loop to evaluate candidate configurations. This dual-layer validation structure preserved proportional class representation at both levels, ensuring unbiased performance estimation.

The 16 baseline algorithms were selected to represent diverse ML paradigms: ensemble methods (Random Forest, XGBoost, LightGBM, CatBoost, Gradient Boosting, Decision Tree), linear models (Logistic Regression, Ridge, Lasso, ElasticNet), other approaches (SVM, KNN, Naive Bayes, Neural Network), and meta-ensembles (Voting, Stacking). This selection ensures comprehensive evaluation across different learning principles while maintaining reproducibility with standard libraries.

5.2. Performance Metrics

We evaluated classifier performance using a suite of classification metrics designed to capture predictive effectiveness across the four-category CKD risk taxonomy. Given the clinical significance of each risk stratum and the inherent class distribution asymmetry in our dataset, we prioritized macro-averaged metrics to ensure balanced representation of all risk categories in the assessment process.

Overall classification correctness is measured by accuracy, defined as the proportion of correctly identified instances across all risk tiers:

A c c u r a c y = \frac{{T P}_{1} + {T P}_{2} + {T P}_{3} + {T P}_{4}}{N}

(19)

where

{T P}_{n}

denotes true positives for category

n

and

N

represents total sample count.

The precision metric measures the reliability of positive predictions within each risk category, quantified through macro-averaging as:

{Precision}_{macro} = \frac{1}{4} \sum_{c = 1}^{4} \frac{T P_{c}}{T P_{c} + F P_{c}}

(20)

where

T P_{c}

and

F P_{c}

represent true and false positives for class

c

.

Recall, also termed sensitivity, quantifies the proportion of actual cases correctly identified within each risk tier—a critical measure for minimizing false-negative diagnoses:

{Recall}_{macro} = \frac{1}{4} \sum_{c = 1}^{4} \frac{T P_{c}}{T P_{c} + F N_{c}}

(21)

F1-Score represents the harmonic mean of precision and recall, delivering balanced evaluation for minimizing both false positive and false negative errors:

F 1_{macro} = \frac{1}{4} \sum_{c = 1}^{4} \frac{2 \times {Precision}_{c} \times {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}}

(22)

This measure functioned as the principal optimization target during hyperparameter adjustment, harmonizing diagnostic precision.

ROC AUC (One-vs-Rest) measures discriminative performance between risk categories through binary classification assessment:

AUC OvR = \frac{1}{4} \sum_{c = 1}^{4} {AUC}_{c}

(23)

Statistical significance validation involved comprehensive analysis across 30 repeated experiments. The methodology incorporated paired statistical testing with multiple comparison adjustments for reliable conclusions.

Paired t-test compared performance differences when normality conditions were met:

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

(24)

where

\bar{d}

represents mean paired difference,

s_{d}

denotes difference standard deviation, and

n

indicates paired sample quantity.

Wilcoxon signed-rank test provided non-parametric alternatives for non-normal paired differences:

W = \sum_{i = 1}^{n} rank (|d_{i}|) \cdot I (d_{i} > 0)

(25)

where

d_{i}

represents paired observation differences and

rank (|d_{i}|)

indicates the rank of absolute differences, with

I (d_{i} > 0)

as the indicator function for positive differences.

To assess the clinical relevance of observed differences independent of p-values, we computed Cohen’s d as a standardized measure of effect size:

d = \frac{\bar{x_{1}} - \bar{x_{2}}}{s_{pooled}}

(26)

where

s_{pooled} = \sqrt{\frac{(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}}

represents pooled standard deviation.

Bonferroni correction managed multiple comparison challenges through significance level adjustment:

α_{corrected} = \frac{α}{k}

(27)

where

α

denotes original significance threshold (0.05) and

k

indicates pairwise comparison count.

T-Distribution Confidence Interval provided robust mean difference estimates through parametric methodology:

{CI}_{α} = \bar{d} \pm t_{α / 2, n - 1} \cdot \frac{s_{d}}{\sqrt{n}}

(28)

where

\bar{d}

represents mean paired difference,

t_{α / 2, n - 1}

denotes the critical

t

-value,

s_{d}

indicates difference standard deviation, and

n

represents sample size.

5.3. Performance Evaluation and Comparative Assessment

Table 2 summarizes the comparative evaluation of our proposed method against 16 benchmark algorithms across 30 independent trials. Performance metrics are reported as average ± standard deviation, with statistical significance assessed via paired t-test employing Bonferroni adjustment. Our framework attained outstanding results with macro F1-score of 91.1%, correct classification rate of 91.9%, positive predictive value of 91.3%, and sensitivity of 91.1%, demonstrating statistically significant superiority over all competing methods (p < 0.001). The ROC-AUC of 97.8% with exceptional consistency (±0.001) exceeds the leading competitor by 1.7 percentage points, indicating superior discriminative power vital for clinical applications. Particularly noteworthy is the 3.1% F1-score gain relative to the top-performing baseline (CatBoost: 88.4%), which corresponds to roughly 310 additional accurate diagnoses per 10,000 cases, directly influencing therapeutic strategies and clinical outcomes. Importantly, our methodology displays negligible overfitting with discrepancies of merely 2.4% for F1-score and 1.7% for correct classification rate—the smallest among high-performing approaches—guaranteeing reliable performance across heterogeneous patient cohorts, institutional settings, and temporal shifts critical for real-world implementation.

Among comparison algorithms, CatBoost exhibited the strongest performance (F1-score: 88.4%) with comparably minimal overfitting disparities (2.4% F1, 1.8% correct classification rate), followed by standard XGBoost (87.5%) displaying moderate training-testing gaps (3.7% F1, 3.2% correct classification rate). Notably, LightGBM demonstrated competitive testing performance (86.7%) but revealed troubling overfitting patterns with the widest disparities among leading methods (7.9% F1, 6.9% correct classification rate), indicating possible instability in diverse clinical environments or patient populations. Conventional ensemble techniques such as Voting Classifier (85.7%) and Stacking Classifier (85.6%) could not match competitive performance despite reasonable robustness characteristics, suggesting that intelligent hyperparameter tuning outperforms architectural complexity for healthcare classification problems. Linear approaches showed modest efficacy, with Lasso (83.7%) and ElasticNet (83.3%) surpassing Logistic Regression (82.8%) while maintaining acceptable overfitting discrepancies under 3.0%. Ridge Classifier demonstrated substantial weaknesses (61.5% F1-score), and Ridge Classifier shows zero variance due to its inability to generate probabilistic outputs necessary for AUC computation. Distance-based approaches exposed serious overfitting concerns, with KNN attaining only 68.2% F1-score alongside the poorest transferability gap of 10.0%, while Naive Bayes, despite weak overall performance (45.0%), unexpectedly exhibited minimal overfitting (1.0% gap). The uniformly significant p-values throughout all comparisons, coupled with effect sizes spanning 1.416 to 15.732 (all surpassing Cohen’s “large effect” criterion of 0.8), and our technique’s superior cross-validation stability furnish compelling statistical support for clinical adoption. The negligible performance variability (CV = 1.1%) across 30 trials guarantees consistent algorithmic performance essential for regulatory clearance and physician trust across varied healthcare contexts.

Figure 7 presents comprehensive F1-score performance distribution analysis across 17 machine learning algorithms through hybrid statistical visualization combining violin plots, box plots, and scatter points. The analysis encompasses 30 repeated experiments to ensure robust statistical inference. Left-side violin plots display kernel density estimations showing distribution shape and variance characteristics, where our proposed MAB-optimized method exhibits the most concentrated distribution around the highest performance level (mean F1-score: 0.911 ± 0.010). Right-side box plots provide statistical summaries with quartile boundaries, medians (black lines), and means (white diamonds), revealing our method’s superior central tendency and minimal interquartile range. The three strongest baseline competitors are CatBoost (0.884 ± 0.023), XGBoost (0.875 ± 0.025), and LightGBM (0.867 ± 0.025), yet all exhibit notably larger variance and lower mean performance. Gray confidence ellipses represent 95% confidence intervals, where our method’s narrow ellipse indicates high precision. Critically, our methodology achieves both exceptional predictive accuracy and remarkable consistency (coefficient of variation: 1.1%), indicating dependable performance transferability crucial for real-world medical implementation across heterogeneous patient cohorts. The trend line underscores the clear performance stratification, with ensemble techniques displaying moderate efficacy while distance-metric methods reveal substantial deficiencies.

Figure 8 presents comprehensive multi-metric evaluation encompassing accuracy, precision, recall, and ROC-AUC across all 17 algorithms using distinct visualization methods for each metric. Accuracy (red violin plots) demonstrates our method’s superior performance (0.919 ± 0.009) with remarkably tight distribution density, surpassing the strongest baseline competitors CatBoost (0.894 ± 0.022) and XGBoost (0.884 ± 0.024), while Neural Network shows the broadest distribution indicating high variability. Precision (blue notched box plots) reveals consistent high performance with our method achieving 0.913 ± 0.014 and narrow confidence notches, while Ridge Classifier displays the broadest interquartile range and extensive outliers due to classification instability. Recall (green scatter plots with diamond means) illustrates individual experimental outcomes where our method maintains 0.911 ± 0.007 with tightly clustered data points, demonstrating reliable sensitivity across repeated trials, contrasting with Neural Network’s wide scatter pattern. ROC-AUC (orange raincloud plots) combines density estimation with scatter points and mini-box summaries, showing our method’s exceptional discriminative performance (0.978 ± 0.001) with remarkably low variance and near-perfect density concentration, while Ridge Classifier exhibits zero variance due to its inability to provide probabilistic outputs. The unified visualization reveals consistent algorithmic ranking patterns across different metrics, with our method achieving the highest performance stability and minimal cross-metric variance. Traditional ensemble methods (CatBoost, XGBoost, LightGBM) demonstrate moderate effectiveness with acceptable variance levels, linear models show intermediate performance with higher variance patterns, while distance-based approaches (KNN) and probabilistic methods (Naive Bayes) exhibit significant performance limitations across all evaluation criteria.

Figure 9 presents comprehensive statistical significance analysis through −log₁₀(p-value) transformation, systematically confirming robust statistical evidence for our method’s superiority across all baseline comparisons. The horizontal bar chart displays the magnitude of statistical significance, where longer bars indicate stronger evidence against the null hypothesis of equal performance. All 16 baseline comparisons achieve p < 0.001 with effect sizes ranging from 1.416 (CatBoost) to 15.732 (Ridge Classifier), establishing unequivocal statistical evidence for our method’s superiority. Critical significance thresholds are marked by vertical reference lines: the red dashed line represents α = 0.05 (−log₁₀(0.05) ≈ 1.3), while the orange dashed line indicates the Bonferroni-corrected threshold (α = 0.003125 for 16 comparisons, −log₁₀(0.003125) ≈ 2.5), demonstrating that all comparisons far exceed even the most stringent multiple comparison corrections. Ridge Classifier exhibits the highest statistical significance (−log₁₀(p-value) ≈ 32, Cohen’s d = 15.732), while CatBoost shows the smallest yet still substantial effect size (Cohen’s d = 1.416). The color gradient from green to red reflects effect size magnitude, revealing distinct hierarchies: ensemble methods (CatBoost: 1.416, XGBoost: 1.833, LightGBM: 2.161) show moderate differences, distance-based methods (KNN: 8.647) and probabilistic approaches (Naive Bayes: 6.671) demonstrate larger effects, and linear classifiers (Ridge: 15.732) exhibit extreme gaps. All 16 comparisons exceed Cohen’s “large effect” threshold (d > 0.8), with 10 algorithms showing very large effects (d > 2.0), validating our methodology’s clinical superiority and statistical robustness.

Table 3 presents a detailed comparison of hyperparameter search strategies, illustrating divergent exploration approaches employed by various optimization techniques. Our MAB framework enables intelligent parameter space navigation using continuous probability distributions across complete parameter ranges, notably implementing exponential sampling for learning rates (exp (0.01–0.05)) to accommodate the logarithmic characteristics of this hyperparameter. Grid Search employs the most restrictive strategy with limited discrete value sets, such as merely three options for n_estimators [400, 600, 800] and constrained regularization ranges (reg_alpha: [0.1, 0.5], reg_lambda: [5.0, 8.0]), ensuring computational tractability but potentially overlooking optimal configurations. Random Search, Evolutionary Algorithm, and Bayes Search maintain broader parameter coverage comparable to our framework, with Random Search and Bayes Search utilizing uniform sampling across complete ranges while Evolutionary Algorithm implements discrete choices for categorical parameters (min_child_weight: [1,2,3], scale_pos_weight: [2,3,4]). The primary distinction lies in Grid Search’s conservative parameter space reduction, constraining exploration to predetermined combinations, whereas alternative methods investigate the full hyperparameter landscape through different sampling strategies, reflecting each method’s trade-off between computational efficiency and optimization comprehensiveness.

Table 4 offers a rigorous comparative analysis of five distinct parameter optimization methodologies, providing supplementary evidence for our MAB approach’s effectiveness. Our MAB-based framework delivered superior performance with F1-score of 91.40%, securing optimal balance across all evaluation metrics while preserving computational efficiency with 158.52 s execution time across 85 evaluations. Grid search, despite necessitating 2304 evaluations and expending 1539.93 s, produced lower F1-score (90.92%) and inferior ROC-AUC (97.58%) performance, with computational overhead 27.1 times greater than our approach. Random search achieved identical performance to grid search but with reduced execution time (137.79 s), while Genetic Algorithm and Bayesian Search exhibited inferior F1-scores of 90.44% and 89.98%, respectively, demonstrating the consistent superiority of intelligent exploration strategies over traditional optimization methods. Notably, our method achieved the highest precision (92.01%) and ROC-AUC (97.85%) while requiring significantly fewer evaluations, establishing both performance and efficiency advantages essential for practical medical AI applications.

Figure 10 illustrates detailed confusion matrix comparisons across the six top-performing algorithms, demonstrating our method’s superior classification accuracy across all CKD risk categories. Our approach exhibits strong performance in correctly identifying low-risk patients (38/43, 90.5%), moderate-risk cases (90/93, 94.7%), and high-risk patients (28/33, 84.8%) compared to CatBoost’s 85.7%, 88.8%, and 71.4%, respectively, with exceptional accuracy in very high-risk classification (27/28, 96.4%) significantly surpassing XGBoost’s 88.1%.

Our method maintains balanced accuracy across all risk strata while minimizing false negatives. Misclassification errors primarily occur between adjacent risk categories, with traditional ensemble methods showing increased confusion between moderate and high-risk classifications. The model exhibits conservative classification tendencies, favoring clinical safety by trending toward higher risk categories rather than missing high-risk cases.

Figure 11 presents multi-class ROC curve analysis for six representative algorithms using one-versus-rest classification, demonstrating our method’s superior discriminative ability with consistently enhanced AUC values across all risk categories. Our approach achieves notable classification capability with AUC scores of 0.971 for low-risk, 0.971 for moderate-risk, 0.968 for high-risk, and 0.991 for very high-risk patients, significantly outperforming the leading baseline competitors CatBoost (0.938, 0.937, 0.930, 0.992) and XGBoost (0.927, 0.924, 0.920, 0.991) across most categories.

Category-specific discriminative analysis reveals our method’s particularly strong performance in very high-risk patient identification (AUC = 0.991), where effective discrimination is clinically critical for timely intervention, while maintaining reliably high performance across lower risk categories. The ROC curves demonstrate optimal sensitivity-specificity balance with steep initial rises and minimal false positive rates, particularly for the highest severity categories where diagnostic precision directly impacts treatment decisions. Traditional ensemble methods including Voting Classifier and Decision Tree exhibit notably lower discriminative power, especially for moderate-risk classification (AUC ≈ 0.933 vs. our 0.971).

Figure 12 displays precision-recall curve evaluation across all algorithms using class-specific assessment to examine diagnostic performance at varying decision thresholds. Our method achieves enhanced performance with average precision scores of 0.946 for low-risk, 0.963 for moderate-risk, 0.869 for high-risk, and 0.976 for very high-risk patients, consistently surpassing baseline algorithms across all categories. Class-specific comparison reveals our approach particularly excels in very high-risk identification where CatBoost achieves only 0.956 compared to our 0.976, while maintaining stable precision across varying recall thresholds unlike competitors showing significant precision degradation. Traditional ensemble methods exhibit substantial limitations in high-risk classification, with Decision Tree achieving merely 0.649 and Voting Classifier reaching 0.686 average precision, compared to our robust 0.869 performance. Our method demonstrates reliable precision maintenance at high recall levels for critical patient categories, with smooth curve trajectories indicating consistent performance across different decision thresholds.

Figure 13 displays calibration curve evaluation across all models using reliability diagrams to assess probabilistic prediction accuracy. Calibration curves plot positive outcome fractions against mean predicted probabilities, where alignment with the perfect calibration diagonal indicates accurate probability estimates. Our MAB-optimized XGBoost exhibits enhanced probability calibration with curves closely coinciding with the diagonal line across the entire spectrum of risk stratifications, achieving strong calibration reliability for high-risk (slope ≈ 0.95) and very high-risk patients (slope ≈ 0.98), indicating predicted probabilities accurately reflect actual outcome frequencies. Our method maintains consistent linearity with minimal deviation from ideal calibration across the complete probability spectrum. Baseline algorithms exhibit substantial calibration deficiencies: Decision Tree shows erratic oscillations with steep departures and irregular patterns reflecting unstable probability estimates, while traditional XGBoost displays systematic overestimation of confidence in predictions for moderate-risk cases and insufficient confidence in high-risk situations. CatBoost and LightGBM demonstrate intermediate calibration quality but fail to achieve consistent linearity, with notable deviations in extreme probability ranges where clinical decisions are most critical.

Figure 14 shows decision curve evaluation comparing clinical net benefit across threshold probabilities to assess real-world utility of different algorithms. Decision curves plot net benefit against threshold probabilities, with higher curves indicating better clinical value compared to “treat all” (dashed line) and “treat none” (horizontal baseline). Our approach achieves enhanced clinical utility with consistent net benefit of 0.28–0.30 across the critical 0.1–0.4 threshold range, outperforming all baseline algorithms and maintaining positive benefit up to 0.85 threshold while alternatives show earlier decline. Threshold-specific evaluation reveals Decision Tree and LightGBM fall below the “treat none” baseline at 0.6 and 0.7 thresholds, respectively, indicating limited utility at higher probability cutoffs. Clinical relevance is evident in our approach’s notable stability within the 0.2–0.5 threshold range where most treatment decisions occur, exhibiting minimal fluctuation compared to traditional XGBoost and CatBoost which show marked benefit degradation.

Figure 15 displays concordance index evaluation across probability thresholds to assess discriminative performance consistency. C-index quantifies the probability that algorithms correctly rank patient pairs with different outcomes, where values exceeding 0.5 indicate superior-to-random discrimination. Our approach exhibits consistent performance with C-index values sustaining 0.60–0.65 stability throughout the 0.1–0.8 threshold range, ensuring dependable discrimination regardless of probability cutoff selection. Threshold consistency examination reveals enhanced reliability compared to baseline algorithms, which display either weak discrimination (Decision Tree maintaining ~0.57) or substantial threshold-dependent variability (LightGBM remaining near random performance until 0.8 threshold). While conventional XGBoost and CatBoost exhibit threshold-dependent enhancement reaching 0.82–0.85 at elevated thresholds, they compromise consistency across clinical scenarios, potentially restricting practical applicability when threshold requirements differ between clinical environments.

5.4. Model Interpretability Analysis

Figure 16 presents SHAP value distributions across four risk categories, illustrating feature-specific contributions to model predictions for each CKD severity level. The analysis reveals distinct contribution patterns where ACR demonstrates consistently high impact across all risk groups, with particularly pronounced positive contributions in high-risk patients, reflecting its established role as a primary indicator of glomerular damage and proteinuria severity. eGFR exhibits complementary behavior with predominantly negative SHAP values in low-risk cases and increasingly positive contributions in severe disease states, confirming its utility as a bidirectional predictor of kidney function decline. Stage-related features show clear stratification patterns, with URC_HP and eGFR_cluster displaying progressive contribution magnitudes that align with disease severity, while demographic factors including gender and diabetes exhibit more subtle but consistent influences across risk categories. The beeswarm distribution patterns demonstrate robust model interpretability, with feature contributions showing appropriate clinical directionality where protective factors (higher eGFR, lower ACR) generate negative SHAP values for high-risk predictions, while deteriorating kidney function markers produce positive contributions toward elevated risk classifications.

Figure 17 illustrates individualized SHAP waterfall plots demonstrating feature-level contributions for representative patients across four CKD risk categories, providing transparent mechanistic insights into model decision-making processes. Each subplot traces the cumulative contribution pathway from baseline expected value to final prediction, with positive contributions (green bars) indicating risk-elevating factors and negative contributions (red bars) representing protective influences. In the low-risk case (a), ACR emerges as the dominant positive contributor (SHAP = 2.051), while URC_cluster and URC_HP_log provide protective effects, illustrating how patients with elevated proteinuria can still maintain low overall risk through compensatory factors. The moderate-risk patient (b) demonstrates balanced contributions where ACR (SHAP = 0.851) and stage-related features create moderate elevation from the expected baseline, reflecting early disease progression patterns. High-risk classification (c) shows pronounced stage influence (SHAP = 1.326) combined with sustained ACR contribution, indicating advanced kidney dysfunction with significant proteinuria. The very high-risk case (d) exhibits the most dramatic deviation, with stage contributing substantially (SHAP = 1.630) alongside elevated eGFR impact (SHAP = 1.053), demonstrating severe kidney impairment across multiple clinical dimensions.

Figure 18 demonstrates feature-specific SHAP dependence patterns for the four most influential predictors, revealing critical non-linear relationships and interaction effects underlying CKD risk stratification. UP_index (a) exhibits distinct categorical behavior with pronounced negative contributions at lower values transitioning to minimal impact at higher levels, suggesting threshold-based protective effects in urinary protein assessment. eGFR dependence (b) displays classical kidney function relationships where higher values generate negative SHAP contributions (protective effect) while declining eGFR produces increasingly positive contributions toward higher risk classifications, with interaction coloring revealing synergistic effects with other clinical parameters. ACR dependence (c) shows the expected monotonic relationship where elevated albumin-creatinine ratios consistently drive positive risk contributions, with particularly steep increases beyond normal thresholds reflecting accelerated glomerular damage progression. Stage dependence (d) demonstrates clear categorical transitions corresponding to CKD severity levels, with distinct SHAP value clusters at each stage boundary and strong interaction effects indicated by color gradients.

Figure 19 presents LIME-generated local explanations for representative patients across four CKD risk categories, demonstrating feature-specific contributions to individual predictions and validating model transparency at the patient level. Low-risk classification (a) shows family history and stage as primary protective factors (negative weights), while biopsy and URC_cluster provide modest positive contributions, illustrating how patients maintain favorable prognosis despite certain risk indicators. Moderate-risk patients (b) exhibit balanced contribution patterns where ACR emerges as the dominant positive predictor (weight = 0.067) alongside stage contributions, while family history provides protective influence, reflecting early disease progression with mixed prognostic indicators. High-risk classification (c) demonstrates escalated stage impact (weight = −0.117) paradoxically contributing negatively due to local feature interactions, while ACR and family history provide positive risk elevation, indicating complex local decision boundaries. Very high-risk patients (d) show pronounced negative contributions from stage and ACR, suggesting LIME’s local approximation captures non-linear model behavior where extreme values trigger different decision pathways. This local interpretability analysis provides patient-specific explanations essential for clinical decision-making, revealing how global model patterns manifest differently across individual cases and supporting personalized treatment planning through transparent algorithmic reasoning.

5.5. MAB Optimization and Algorithm Performance

Figure 20 demonstrates the performance characteristics of the UCB-based multi-armed bandit approach for XGBoost hyperparameter optimization, revealing sophisticated exploration-exploitation dynamics and efficient convergence behavior. The cumulative regret evolution (a) exhibits sublinear growth consistent with theoretical bounds, validating the algorithm’s optimal performance guarantees while maintaining practical efficiency with minimal accumulated regret below 1.0 throughout the optimization process. Arms performance evolution (b) illustrates intelligent parameter configuration selection, where the algorithm systematically identifies and exploits high-performing hyperparameter combinations while maintaining diversity across the exploration space, with convergence toward optimal F1 scores of approximately 0.913. The exploration-exploitation balance (c) reveals dynamic strategy adaptation, showing periods of intensive exploration followed by focused exploitation phases, demonstrating the algorithm’s ability to transition between strategies based on accumulated knowledge and performance feedback. Parameter space exploration landscape (d) visualizes the sophisticated mapping between learning rate and max depth configurations, with performance intensity clearly indicating optimal parameter regions and the algorithm’s ability to identify superior configurations through intelligent sampling. UCB confidence intervals evolution (e) demonstrates the algorithm’s statistical rigor in uncertainty quantification, providing robust performance estimates with tightening confidence bounds as exploration progresses. The algorithm performance heatmap analysis (f) reveals temporal optimization dynamics across six key metrics, showing consistent performance improvements and stability patterns throughout different optimization phases, with mean performance values ranging from 0.905 to 0.910 and exploration rates maintaining effective parameter space coverage, establishing multi-armed bandit approaches as superior alternatives to traditional grid search methods for complex hyperparameter optimization tasks.

6. Conclusions

This research introduces an innovative hybrid framework integrating Multi-Armed Bandit optimization with BorderlineSMOTE for the risk stratification of chronic kidney disease, advancing methodological capabilities in medical artificial intelligence applications. The proposed model demonstrated robust performance, delivering 91.8% accuracy, 91.0% F1-score, and 97.8% ROC-AUC while outperforming 16 benchmark algorithms across 30 repeated experiments (all comparisons showed p < 0.001 with Bonferroni correction). Upper Confidence Bound optimization provided enhanced computational efficiency, decreasing optimization duration from 1539.93 s (GridSearch) to 158.52 s—a 9.7-fold enhancement valuable for resource-limited clinical settings while preserving robust predictive capabilities.

The framework tackles critical medical AI challenges through sophisticated class imbalance management and comprehensive interpretability incorporation. BorderlineSMOTE’s measured balancing approach avoided over-synthesis complications while maintaining original data distribution properties, supporting reliable generalization across varied patient cohorts. Combined SHAP-LIME interpretability examination identified feature importance patterns showing strong alignment with established nephrology principles: albumin-creatinine ratio surfaced as the leading predictor, confirming its recognized function in identifying glomerular barrier dysfunction and forecasting cardiovascular outcomes in CKD populations. Estimated glomerular filtration rate exhibited bidirectional predictive significance, validating its function as both a staging measure and independent prognostic indicator for disease advancement, while CKD stage classification supported the clinical staging system’s applicability for risk assessment. These results correspond with pathophysiological knowledge that proteinuria indicates early glomerular injury, eGFR reduction signifies progressive nephron depletion, and advanced staging associates with systemic complications including cardiovascular disease and mineral bone disorders. The model exhibited dependable calibration with curves closely matching ideal calibration across all risk groups, while decision curve examination verified sustained clinical net benefit (0.28–0.30) throughout clinically meaningful threshold ranges, confirming practical value for real-world implementation.

From a clinical nephrology standpoint, the model’s robust capability in identifying very high-risk patients (ROC-AUC 0.991) carries significant therapeutic consequences, as these individuals typically necessitate immediate nephrology consultation, intensive cardiovascular risk control, and preparation for renal replacement therapy. The conservative classification tendency toward elevated risk categories corresponds with clinical best practices, where underestimating disease severity might postpone essential interventions such as ACE inhibitor optimization, phosphate binder introduction, or timely dialysis access establishment. Although these results indicate encouraging clinical potential, constraints include retrospective methodology and single-region data acquisition that may restrict generalizability across varied populations and healthcare frameworks. Subsequent investigations should encompass prospective validation studies across multiple international centers to determine broader applicability. Research pathways include expanding the framework to longitudinal risk prediction incorporating temporal disease progression patterns, integrating multi-modal data sources including imaging and genomic information, and creating adaptive learning systems capable of continuous model enhancement based on accumulating clinical evidence. This methodology establishes a reproducible paradigm for developing transparent, computationally efficient machine learning architectures capable of enhancing medical AI deployment while safeguarding equitable diagnostic efficacy across the entire spectrum of patient risk stratifications, offering practical utility for chronic kidney disease management and broader potential across diverse medical prediction applications.

Author Contributions

Conceptualization, J.C.; Methodology, J.H. and J.C.; Software, J.H. and L.L.; Validation, L.L. and J.C.; Data Collection and Organization, J.H.; Manuscript Drafting, J.H.; Manuscript Revision and Polishing, L.L. and J.C.; Visualization, L.L.; Supervision, J.C.; Funding Acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

Financial support for this research was provided by the Key Research and Development Program of Guangxi (Grant Nos. AB24010085, AB23026120, AA24263010), National Natural Science Foundation of China (Grant Nos. 62462019, 62561018), Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023A1515012846), Key Laboratory of Equipment Data Security and Guarantee Technology, Ministry of Education (Grant No. GDZB2024060500), Natural Science Foundation of Guangxi Zhuang Autonomous Region (Grant No. 2025GXNSFBA069410), and Basic Scientific Research Capacity Enhancement Program for Young and Middle-Aged Teachers in Guangxi Institutions of Higher Education (Grant Nos. 2024KY0233, 2025KY0243).

Data Availability Statement

The novel contributions generated in this research are incorporated within the manuscript. For additional inquiries, please contact the corresponding author.

Conflicts of Interest

The authors have no conflicts of interest to disclose.

References

Ahmed, K.; Dubey, M.K.; Kajal; Dubey, S.; Pandey, D.K. Chronic Kidney Disease: Causes, Treatment, Management, and Future Scope. In Computational Intelligence for Genomics Data; Elsevier: Amsterdam, The Netherlands, 2025; pp. 99–111. ISBN 978-0-443-30080-6. [Google Scholar]
Gogoi, P.; Valan, J.A. Machine Learning Approaches for Predicting and Diagnosing Chronic Kidney Disease: Current Trends, Challenges, Solutions, and Future Directions. Int. Urol. Nephrol. 2024, 57, 1245–1268. [Google Scholar] [CrossRef]
Simeri, A.; Pezzi, G.; Arena, R.; Papalia, G.; Szili-Torok, T.; Greco, R.; Veltri, P.; Greco, G.; Pezzi, V.; Provenzano, M.; et al. Artificial Intelligence in Chronic Kidney Diseases: Methodology and Potential Applications. Int. Urol. Nephrol. 2024, 57, 159–168. [Google Scholar] [CrossRef] [PubMed]
Francis, A.; Harhay, M.N.; Ong, A.C.M.; Tummalapalli, S.L.; Ortiz, A.; Fogo, A.B.; Fliser, D.; Roy-Chaudhury, P.; Fontana, M.; Nangaku, M.; et al. Chronic Kidney Disease and the Global Public Health Agenda: An International Consensus. Nat. Rev. Nephrol. 2024, 20, 473–485. [Google Scholar] [CrossRef]
Bello, A.K.; Okpechi, I.G.; Levin, A.; Ye, F.; Damster, S.; Arruebo, S.; Donner, J.-A.; Caskey, F.J.; Cho, Y.; Davids, M.R.; et al. An Update on the Global Disparities in Kidney Disease Burden and Care across World Countries and Regions. Lancet Glob. Health 2024, 12, e382–e395. [Google Scholar] [CrossRef]
Lee, P.-H.; Huang, S.M.; Tsai, Y.-C.; Wang, Y.-T.; Chew, F.Y. Biomarkers in Contrast-Induced Nephropathy: Advances in Early Detection, Risk Assessment, and Prevention Strategies. Int. J. Mol. Sci. 2025, 26, 2869. [Google Scholar] [CrossRef]
Talaat, F.M.; Aly, W.F. Toward Precision Cardiology: A Transformer-Based System for Adaptive Prediction of Heart Disease. Neural Comput. Appl. 2025, 37, 13547–13571. [Google Scholar] [CrossRef]
Montalescot, L.; Dorard, G.; Speyer, E.; Legrand, K.; Ayav, C.; Combe, C.; Stengel, B.; Untas, A. Patient Perspectives on Chronic Kidney Disease and Decision-Making about Treatment. Discourse of Participants in the French CKD-REIN Cohort Study. J. Nephrol. 2022, 35, 1387–1397. [Google Scholar] [CrossRef]
Wen, F.; Wang, J.; Yang, C.; Wang, F.; Li, Y.; Zhang, L.; Pagán, J.A. Cost-Effectiveness of Population-Based Screening for Chronic Kidney Disease among the General Population and Adults with Diabetes in China: A Modelling Study. Lancet Reg. Health—West. Pac. 2025, 56, 101493. [Google Scholar] [CrossRef]
Rane, N.; Choudhary, S.P.; Rane, J. Ensemble Deep Learning and Machine Learning: Applications, Opportunities, Challenges, and Future Directions. Stud. Med. Health Sci. 2024, 1, 18–41. [Google Scholar] [CrossRef]
Shivahare, B.D.; Singh, J.; Ravi, V.; Chandan, R.R.; Alahmadi, T.J.; Singh, P.; Diwakar, M. Delving into Machine Learning’s Influence on Disease Diagnosis and Prediction. Open Public Health J. 2024, 17, e18749445297804. [Google Scholar] [CrossRef]
Sanmarchi, F.; Fanconi, C.; Golinelli, D.; Gori, D.; Hernandez-Boussard, T.; Capodici, A. Predict, Diagnose, and Treat Chronic Kidney Disease with Machine Learning: A Systematic Literature Review. J. Nephrol. 2023, 36, 1101–1117. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Liu, J.; Fu, P.; Zou, J. Artificial Intelligence Models in Diagnosis and Treatment of Kidney Diseases: Current Status and Prospects. Kidney Dis. 2025, 11, 501–517. [Google Scholar] [CrossRef] [PubMed]
Ramu, K.; Patthi, S.; Prajapati, Y.N.; Ramesh, J.V.N.; Banerjee, S.; Rao, K.B.V.B.; Alzahrani, S.I.; Ayyasamy, R. Hybrid CNN-SVM Model for Enhanced Early Detection of Chronic Kidney Disease. Biomed. Signal Process. Control. 2025, 100, 107084. [Google Scholar] [CrossRef]
Iliyas, I.I.; Boukari, S.; Gital, A.Y. Recent Trends in Prediction of Chronic Kidney Disease Using Different Learning Approaches: A Systematic Literature Review. J. Med. Artif. Intell. 2025, 8, 62. [Google Scholar] [CrossRef]
Delrue, C.; De Bruyne, S.; Speeckaert, M.M. Application of Machine Learning in Chronic Kidney Disease: Current Status and Future Prospects. Biomedicines 2024, 12, 568. [Google Scholar] [CrossRef]
Silveira, A.C.M.D.; Sobrinho, Á.; Silva, L.D.D.; Costa, E.D.B.; Pinheiro, M.E.; Perkusich, A. Exploring Early Prediction of Chronic Kidney Disease Using Machine Learning Algorithms for Small and Imbalanced Datasets. Appl. Sci. 2022, 12, 3673. [Google Scholar] [CrossRef]
Dhanka, S.; Sharma, A.; Kumar, A.; Maini, S.; Vundavilli, H. Advancements in Hybrid Machine Learning Models for Biomedical Disease Classification Using Integration of Hyperparameter-Tuning and Feature Selection Methodologies: A Comprehensive Review. Arch. Computat. Methods Eng. 2025. [Google Scholar] [CrossRef]
Agrawal, R.; Agrawal, R. Explainable AI in Early Autism Detection: A Literature Review of Interpretable Machine Learning Approaches. Discov. Ment. Health 2025, 5, 98. [Google Scholar] [CrossRef]
Arjmandmazidi, S.; Heidari, H.R.; Ghasemnejad, T.; Mori, Z.; Molavi, L.; Meraji, A.; Kaghazchi, S.; Mehdizadeh Aghdam, E.; Montazersaheb, S. An In-Depth Overview of Artificial Intelligence (AI) Tool Utilization across Diverse Phases of Organ Transplantation. J. Transl. Med. 2025, 23, 678. [Google Scholar] [CrossRef]
Hossain, M.I.; Zamzmi, G.; Mouton, P.R.; Salekin, M.S.; Sun, Y.; Goldgof, D. Explainable AI for Medical Data: Current Methods, Limitations, and Future Directions. ACM Comput. Surv. 2025, 57, 1–46. [Google Scholar] [CrossRef]
Goktas, P.; Grzybowski, A. Shaping the Future of Healthcare: Ethical Clinical Challenges and Pathways to Trustworthy AI. J. Clin. Med. 2025, 14, 1605. [Google Scholar] [CrossRef] [PubMed]
Hooshyar, D.; Yang, Y. Problems with SHAP and LIME in Interpretable AI for Education: A Comparative Study of Post-Hoc Explanations and Neural-Symbolic Rule Extraction. IEEE Access 2024, 12, 137472–137490. [Google Scholar] [CrossRef]
Shah, P.; Shukla, M.; Dholakia, N.H.; Gupta, H. Predicting Cardiovascular Risk with Hybrid Ensemble Learning and Explainable AI. Sci. Rep. 2025, 15, 17927. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Sun, H.; Huang, Y.; Chen, H. Shapley Value: From Cooperative Game to Explainable Artificial Intelligence. Auton. Intell. Syst. 2024, 4, 2. [Google Scholar] [CrossRef]
Khan, N.; Nauman, M.; Almadhor, A.S.; Akhtar, N.; Alghuried, A.; Alhudhaif, A. Guaranteeing Correctness in Black-Box Machine Learning: A Fusion of Explainable AI and Formal Methods for Healthcare Decision-Making. IEEE Access 2024, 12, 90299–90316. [Google Scholar] [CrossRef]
Asri, B.; Qassimi, S.; Rakrak, S. Active Learning-Based Multi-Armed Bandits for Recommendation Systems. Knowl. Inf. Syst. 2025, 67, 9253–9275. [Google Scholar] [CrossRef]
Xia, H.; Li, C.; Tan, Q.; Zeng, S.; Yang, S. Learning to Search Promising Regions by Space Partitioning for Evolutionary Methods. Swarm Evol. Comput. 2024, 91, 101726. [Google Scholar] [CrossRef]
Baheri, A. Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees. Mathematics 2025, 13, 149. [Google Scholar] [CrossRef]
Nie, X.; Ahmad, F.S. Dynamic Reward Systems and Customer Loyalty: Reinforcement Learning-Optimized Personalized Service Strategies. Future Technol. 2025, 4, 259–268. [Google Scholar] [CrossRef]
Ni, Y.; Yang, D. Health Consulting Services Recommendation Considering Patients’ Decision-Making Behaviors: A CNN and Multiarmed Bandit Approach. IEEE Trans. Eng. Manag. 2025, 72, 2341–2355. [Google Scholar] [CrossRef]
Keerthika, K.; Kannan, M.; Saravanan, T. Clinical Intelligence: Deep Reinforcement Learning for Healthcare and Biomedical Advancements. In Deep Reinforcement Learning and Its Industrial Use Cases; Mahajan, S., Raj, P., Pandit, A.K., Eds.; Wiley: Hoboken, NJ, USA, 2024; pp. 137–150. ISBN 978-1-394-27255-6. [Google Scholar]
Nasarian, E.; Alizadehsani, R.; Acharya, U.R.; Tsui, K.-L. Designing Interpretable ML System to Enhance Trust in Healthcare: A Systematic Review to Proposed Responsible Clinician-AI-Collaboration Framework. Inf. Fusion 2024, 108, 102412. [Google Scholar] [CrossRef]
Salmi, M.; Atif, D.; Oliva, D.; Abraham, A.; Ventura, S. Handling Imbalanced Medical Datasets: Review of a Decade of Research. Artif. Intell. Rev. 2024, 57, 273. [Google Scholar] [CrossRef]
Song, R. Optimizing Decision-Making in Uncertain Environments through Analysis of Stochastic Stationary Multi-Armed Bandit Algorithms. Appl. Comput. Eng. 2024, 68, 93–113. [Google Scholar] [CrossRef]
Khan, N.; Raza, M.A.; Mirjat, N.H.; Balouch, N.; Abbas, G.; Yousef, A.; Touti, E. Unveiling the Predictive Power: A Comprehensive Study of Machine Learning Model for Anticipating Chronic Kidney Disease. Front. Artif. Intell. 2024, 6, 1339988. [Google Scholar] [CrossRef] [PubMed]
Metherall, B.; Berryman, A.K.; Brennan, G.S. Machine Learning for Classifying Chronic Kidney Disease and Predicting Creatinine Levels Using At-Home Measurements. Sci. Rep. 2025, 15, 4364. [Google Scholar] [CrossRef]
Rahman, M.M.; Al-Amin, M.; Hossain, J. Machine Learning Models for Chronic Kidney Disease Diagnosis and Prediction. Biomed. Signal Process. Control 2024, 87, 105368. [Google Scholar] [CrossRef]
Ghosh, S.K.; Khandoker, A.H. Investigation on Explainable Machine Learning Models to Predict Chronic Kidney Diseases. Sci. Rep. 2024, 14, 3687. [Google Scholar] [CrossRef]
Dharmarathne, G.; Bogahawaththa, M.; McAfee, M.; Rathnayake, U.; Meddage, D.P.P. On the Diagnosis of Chronic Kidney Disease Using a Machine Learning-Based Interface with Explainable Artificial Intelligence. Intell. Syst. Appl. 2024, 22, 200397. [Google Scholar] [CrossRef]
Saif, D.; Sarhan, A.M.; Elshennawy, N.M. Early Prediction of Chronic Kidney Disease Based on Ensemble of Deep Learning Models and Optimizers. J. Electr. Syst. Inf. Technol. 2024, 11, 17. [Google Scholar] [CrossRef]
Zhu, H.; Qiao, S.; Zhao, D.; Wang, K.; Wang, B.; Niu, Y.; Shang, S.; Dong, Z.; Zhang, W.; Zheng, Y.; et al. Machine Learning Model for Cardiovascular Disease Prediction in Patients with Chronic Kidney Disease. Front. Endocrinol. 2024, 15, 1390729. [Google Scholar] [CrossRef]
Shanmugarajeshwari, V.; Ilayaraja, M. Intelligent Decision Support for Identifying Chronic Kidney Disease Stages: Machine Learning Algorithms. Int. J. Intell. Inf. Technol. 2023, 20, 1–22. [Google Scholar] [CrossRef]
Singamsetty, S.; Ghanta, S.; Biswas, S.; Pradhan, A. Enhancing Machine Learning-Based Forecasting of Chronic Renal Disease with Explainable AI. PeerJ Comput. Sci. 2024, 10, e2291. [Google Scholar] [CrossRef] [PubMed]
Saputra, A.G.; Purwanto, P.; Pujiono, P. Hyperparameter Tuning Decision Tree and Recursive Feature Elimination Technique for Improved Chronic Kidney Disease Classification. Sci. J. Inform. 2024, 11, 821–830. [Google Scholar] [CrossRef]
Liu, Y.; Chen, J.; Wang, M. BO–FTT: A Deep Learning Model Based on Parameter Tuning for Early Disease Prediction from a Case of Anemia in CKD. Electronics 2025, 14, 2471. [Google Scholar] [CrossRef]
Gogoi, P.; Valan, J.A. Interpretable Machine Learning for Chronic Kidney Disease Prediction: A SHAP and Genetic Algorithm-Based Approach. Biomed. Mater. Devices 2025, 3, 1384–1402. [Google Scholar] [CrossRef]
Gokiladevi, M.; Santhoshkumar, S. Henry Gas Optimization Algorithm with Deep Learning Based Chronic Kidney Disease Detection and Classification Model. Int. J. Intell. Eng. Syst. 2024, 17, 645–655. [Google Scholar] [CrossRef]
Priyadharshini, M.; Murugesh, V.; Samkumar, G.V.; Chowdhury, S.; Panigrahi, A.; Pati, A.; Sahu, B. A Population Based Optimization of Convolutional Neural Networks for Chronic Kidney Disease Prediction. Sci. Rep. 2025, 15, 14500. [Google Scholar] [CrossRef]
Awad Yousif, S.M.; Halawani, H.T.; Amoudi, G.; Osman Birkea, F.M.; Almunajam, A.M.R.; Elhag, A.A. Early Detection of Chronic Kidney Disease Using Eurygasters Optimization Algorithm with Ensemble Deep Learning Approach. Alex. Eng. J. 2024, 100, 220–231. [Google Scholar] [CrossRef]
Nayyem, M.N.; Sharif, K.S.; Raju, M.A.H.; Al Rakin, A.; Arafin, R.; Khan, M.M. Optimized Ensemble Learning for Chronic Kidney Disease Prognostication: A Stratified Cross-Validation Approach. In Proceedings of the 2024 IEEE International Conference on Computing (ICOCO), Kuala Lumpur, Malaysia, 12 December 2024; pp. 553–558. [Google Scholar]
Mohamad Tabish, H.A.; Arbaz, S.M.; Agarwal, M.; Sinha, A. Early Prediction and Risk Identification of Chronic Kidney Disease Using Deep Learning Technique. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24 June 2024; pp. 1–9. [Google Scholar]
Tawsik Jawad, K.M.; Verma, A.; Amsaad, F.; Ashraf, L. A Study on the Application of Explainable AI on Ensemble Models for Predictive Analysis of Chronic Kidney Disease. IEEE Access 2025, 13, 23312–23330. [Google Scholar] [CrossRef]
Jawad, K.M.T.; Verma, A.; Amsaad, F.; Ashraf, L. AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI. arXiv 2024, arXiv:2406.06728. [Google Scholar] [CrossRef]
Jafar, A.; Lee, M. Enhancing Kidney Disease Diagnosis Using ACO-Based Feature Selection and Explainable AI Techniques. Appl. Sci. 2025, 15, 2960. [Google Scholar] [CrossRef]
Huang, M.; Zhang, X.S.; Bhatti, U.A.; Wu, Y.; Zhang, Y.; Yasin Ghadi, Y. An Interpretable Approach Using Hybrid Graph Networks and Explainable AI for Intelligent Diagnosis Recommendations in Chronic Disease Care. Biomed. Signal Process. Control 2024, 91, 105913. [Google Scholar] [CrossRef]
Reddy, S.; Roy, S.; Choy, K.W.; Sharma, S.; Dwyer, K.M.; Manapragada, C.; Miller, Z.; Cheon, J.; Nakisa, B. Predicting Chronic Kidney Disease Progression Using Small Pathology Datasets and Explainable Machine Learning Models. Comput. Methods Programs Biomed. Update 2024, 6, 100160. [Google Scholar] [CrossRef]
Nguycharoen, N. Explainable Machine Learning System for Predicting Chronic Kidney Disease in High-Risk Cardiovascular Patients. arXiv 2024, arXiv:2404.11148. [Google Scholar] [CrossRef]
Huang, J.; Li, L.; Hou, M.; Chen, J. Bayesian Optimization Meets Explainable AI: Enhanced Chronic Kidney Disease Risk Assessment. Mathematics 2025, 13, 2726. [Google Scholar] [CrossRef]

Figure 1. Correlation Matrix of Clinical Variables and Biomarkers.

Figure 2. Patient Characteristics by CKD Stage.

Figure 3. Biomarker Distribution Profiles Across CKD Stages.

Figure 4. PCA Analysis and Clustering of CKD Patients.

Figure 5. CKD Progression and Risk Stratification Analysis.

Figure 6. Proposed Framework Architecture.

Figure 7. F1-Score Performance Distribution Analysis.

Figure 8. F1-Score Performance Distribution Analysis.

Figure 9. Statistical Significance Analysis.

Figure 10. Confusion Matrix Comparison Analysis.

Figure 11. Multi-class ROC Analysis.

Figure 12. Precision-Recall Curve Performance Analysis.

Figure 13. Calibration Curve Performance Analysis.

Figure 14. Decision Curve Analysis.

Figure 15. Concordance Index Curve Analysis.

Figure 16. SHAP Summary Analysis.

Figure 17. SHAP Waterfall Analysis.

Figure 18. SHAP Dependence Analysis.

Figure 19. LIME Local Interpretability Analysis.

Figure 20. Multi-Armed Bandit Optimization Analysis.

Table 1. Data Field Description.

Variable Name	Description	Values/Range
hos_id	Hospital ID	7 hospitals
hos_name	Hospital Name	Hospital names
gender	Gender	Male/Female
genetic	Hereditary Kidney Disease	Yes/No
family	Family History of Chronic Nephritis	Yes/No
transplant	Kidney Transplant History	Yes/No
biopsy	Renal Biopsy History	Yes/No
HBP	Hypertension History	Yes/No
diabetes	Diabetes Mellitus History	Yes/No
hyperuricemia	Hyperuricemia	Yes/No
UAS	Urinary Anatomical Structure Abnormality	None/No/Yes
ACR	Albumin-to-Creatinine Ratio	<30/30–300/>300 mg/g
UP_positive	Urine Protein Test	Negative/Positive
UP_index	Urine Protein Index	±(0.1–0.2 g/L) +(0.2–1.0) 2 + (1.0–2.0) 3 + (2.0–4.0) 5 + (>4.0)
URC_unit	Urine RBC Unit	HP—per high power field μL—per microliter
URC_num	Urine RBC Count	0–93.9 Different units
Scr	Serum Creatinine	0/27.2–85,800 μmol/L
eGFR	Estimated Glomerular Filtration Rate	2.5–148 mL/min/1.73 m²
date	Diagnosis Date	13 December 2016 to 27 January 2018
rate	CKD Risk Stratification	Low Risk/Moderate Risk High Risk/Very High Risk
stage	CKD Stage	CKD Stage 1–5

Table 2. Model Comparison Results.

Model	F1-Score	Accuracy	Precision	Recall	ROC AUC	F1 Overfitting Gap	Accuracy Overfitting Gap	p Value	95% CI	Effect Size
Ours	0.911 ± 0.010	0.919 ± 0.009	0.913 ± 0.014	0.911 ± 0.007	0.978 ± 0.001	0.024	0.017	—	—	—
Random Forest	0.783 ± 0.035	0.819 ± 0.026	0.823 ± 0.038	0.763 ± 0.033	0.940 ± 0.013	0.056	0.042	<0.001	[0.113, 0.140]	4.849
XGBoost	0.875 ± 0.025	0.884 ± 0.024	0.883 ± 0.024	0.869 ± 0.026	0.957 ± 0.012	0.037	0.032	<0.001	[0.026, 0.046]	1.833
Decision Tree	0.859 ± 0.028	0.872 ± 0.026	0.866 ± 0.028	0.855 ± 0.029	0.942 ± 0.017	0.04	0.033	<0.001	[0.040, 0.062]	2.356
SVM	0.841 ± 0.030	0.856 ± 0.027	0.853 ± 0.032	0.833 ± 0.029	0.951 ± 0.013	0.052	0.043	<0.001	[0.057, 0.080]	3.005
Neural Network	0.722 ± 0.079	0.766 ± 0.057	0.763 ± 0.065	0.709 ± 0.076	0.908 ± 0.030	0.057	0.045	<0.001	[0.157, 0.219]	3.32
Logistic Regression	0.828 ± 0.022	0.842 ± 0.021	0.848 ± 0.026	0.815 ± 0.021	0.938 ± 0.013	0.032	0.027	<0.001	[0.073, 0.091]	4.597
Ridge Classifier	0.615 ± 0.024	0.714 ± 0.023	0.703 ± 0.078	0.633 ± 0.022	0.000 ± 0.000	0.031	0.02	<0.001	[0.286, 0.305]	15.732
LightGBM	0.867 ± 0.025	0.877 ± 0.023	0.876 ± 0.025	0.861 ± 0.028	0.955 ± 0.013	0.079	0.069	<0.001	[0.033, 0.053]	2.161
CatBoost	0.884 ± 0.023	0.894 ± 0.022	0.893 ± 0.021	0.878 ± 0.025	0.961 ± 0.011	0.024	0.018	<0.001	[0.016, 0.035]	1.416
Gradient Boosting	0.844 ± 0.028	0.858 ± 0.024	0.857 ± 0.027	0.835 ± 0.030	0.950 ± 0.012	0.059	0.048	<0.001	[0.055, 0.076]	3.026
KNN	0.682 ± 0.035	0.720 ± 0.030	0.726 ± 0.035	0.658 ± 0.036	0.871 ± 0.020	0.1	0.085	<0.001	[0.215, 0.240]	8.647
Naive Bayes	0.450 ± 0.097	0.431 ± 0.119	0.489 ± 0.113	0.542 ± 0.065	0.857 ± 0.022	0.01	0.007	<0.001	[0.423, 0.495]	6.671
Lasso Regression	0.837 ± 0.027	0.851 ± 0.024	0.857 ± 0.031	0.825 ± 0.026	0.943 ± 0.012	0.028	0.023	<0.001	[0.063, 0.082]	3.52
ElasticNet	0.833 ± 0.025	0.846 ± 0.022	0.853 ± 0.029	0.819 ± 0.024	0.941 ± 0.012	0.03	0.026	<0.001	[0.067, 0.087]	3.933
Voting Classifier	0.857 ± 0.024	0.869 ± 0.022	0.878 ± 0.025	0.843 ± 0.024	0.958 ± 0.011	0.045	0.038	<0.001	[0.043, 0.062]	2.771
Stacking Classifier	0.856 ± 0.025	0.866 ± 0.023	0.867 ± 0.026	0.849 ± 0.025	0.952 ± 0.012	0.05	0.044	<0.001	[0.044, 0.063]	2.69

Table 3. Comparative Assessment of Parameter Settings Across Tuning Strategies.

Hyperparameter	Ours	Grid Search	Random Search	Genetic Algorithm	Bayes Search
n_estimators	300–800	[400, 600, 800]	300–800	300–800	300–800
max_depth	5–8	[6, 8]	5–8	5–8	5–8
learning_rate	exp (0.01–0.05)	[0.02, 0.03, 0.04]	0.01–0.05	exp (0.01–0.05)	0.01–0.05
subsample	0.7–0.9	[0.8, 0.9]	0.7–0.9	0.7–0.9	0.7–0.9
colsample_bytree	0.8–1.0	[0.9, 1.0]	0.8–1.0	0.8–1.0	0.8–1.0
reg_alpha	0.1–1.0	[0.1, 0.5]	0.1–1.0	0.1–1.0	0.1–1.0
reg_lambda	3.0–8.0	[5.0, 8.0]	3.0–8.0	3.0–8.0	3.0–8.0
min_child_weight	[1, 2, 3]	[2, 3]	1–3	[1, 2, 3]	1–3
gamma	0.1–1.0	[0.1, 0.5]	0.1–1.0	0.1–1.0	0.1–1.0
scale_pos_weight	[2, 3, 4]	[2, 3]	2–4	[2, 3, 4]	2–4

Table 4. Performance Assessment Across Hyperparameter Search Strategies.

Method	F1 Score	Accuray	Precision	Recall	ROC AUC	Time (s)	Evaluations
Ours	0.9140	0.9242	0.9201	0.9113	0.9785	158.52	85
Grid Search	0.9092	0.9192	0.9122	0.9086	0.9758	1539.93	2304
Random Search	0.9092	0.9192	0.9122	0.9086	0.9764	137.79	100
Genetic Algorithm	0.9044	0.9141	0.9048	0.9060	0.9752	415.93	100
Bayes Search	0.8998	0.9091	0.8978	0.9034	0.9727	598.24	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Li, L.; Chen, J. Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation. Symmetry 2025, 17, 1808. https://doi.org/10.3390/sym17111808

AMA Style

Huang J, Li L, Chen J. Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation. Symmetry. 2025; 17(11):1808. https://doi.org/10.3390/sym17111808

Chicago/Turabian Style

Huang, Jianbo, Long Li, and Jia Chen. 2025. "Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation" Symmetry 17, no. 11: 1808. https://doi.org/10.3390/sym17111808

APA Style

Huang, J., Li, L., & Chen, J. (2025). Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation. Symmetry, 17(11), 1808. https://doi.org/10.3390/sym17111808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Armed Bandit Optimization for Explainable AI Models in Chronic Kidney Disease Risk Evaluation

Abstract

1. Introduction

2. Related Works

3. Preliminary

3.1. Data Overview

3.2. Preliminary Data Investigation

3.3. Data Processing

4. Methodology

4.1. XGBoost

4.2. Multi-Armed Bandit Optimization

4.3. BorderlineSMOTE

4.4. Model Validation Strategy

5. Experiment

5.1. Implementation Environment and Settings

5.2. Performance Metrics

5.3. Performance Evaluation and Comparative Assessment

5.4. Model Interpretability Analysis

5.5. MAB Optimization and Algorithm Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI