1. Introduction
Kidney disease has become a major health concern affecting millions of individuals worldwide. The number of people suffering from chronic kidney disease (CKD) and related disorders, such as tumors, stones, and cysts, has been growing rapidly in recent years [
1]. CKD affects over 10% of the global population. The recent estimates showed that between 700 and 850 million people in the world have the condition [
2,
3]. Furthermore, some conditions, like multilocular kidney cysts and stones, may recur or remain undetectable in their earliest stages, adding additional diagnostic complexities. Renal cancers, or kidney tumors, constitute approximately 2.5% of all cancer cases [
4]. Simple cysts are common (especially with increasing age), but the number may be higher in certain genetic or acquired disease states (e.g., CYP24A1 deficiency, dialysis-associated acquired cystic kidney disease), and possibly increase in CKD progression and cancer risk [
5,
6]. The incidence of kidney cancer is rising worldwide, with approximately 400,000 new cases annually and 175,000 deaths. The causes are CKD, acquired cystic kidney diseases, and metabolic disorders. Kidney malignancies comprising Wilms’ tumor appear as 3–11% in all childhood cancers [
7,
8]. Kidney stones account for one of the predominant causes of CKD, affecting approximately 10% of adults worldwide, with recurrence rates reaching up to 50% [
9,
10]. Early detection of CKD is vital because most times, the conditions of CKD go unnoticed, which makes people unaware in the early stages. Identifying CKD in the early stages allows for timely interventions that can slow or even halt disease progression, reduce the risk of complications such as cardiovascular disease, improve overall patient outcomes, and increase survival rates [
11,
12]. A deep neural network (DNN) model with a limited dataset may increase the probability of overfitting [
13]. Traditional diagnostic methods, such as blood tests and biopsies, are often invasive, time-consuming, and may fail to accurately identify early-stage kidney diseases [
14]. Traditional diagnostic methods and standalone machine learning models face challenges such as overfitting, slow computational speed, class imbalance, and limited feature extraction capabilities, which hinder their effectiveness in clinical settings [
15,
16]. Due to the time-consuming registration, the processing speed of the model becomes slow, hampering the model’s practicality for real-time clinical use, and loses its validation [
17]. Deep learning methods, particularly CNNs, have proven their worth in various applications related to medical image classification [
18]. CNNs would be able to analyze kidney images obtained from ultrasound, MRI, and CT for the identification of abnormalities with a high degree of accuracy [
19]. Despite the great advance in CNN-based kidney disease classification, several intrinsic drawbacks are still inherited from past research. Most of the existing studies simply focus on employing a single CNN model, such as VGG, ResNet, or Inception, without presenting an extensive comparison or fusion of different feature extractors. In addition, very little effort has been given to explaining the selection of some CNN models and their parameter adjustment, which usually influences the explainability and transferability of the results. Class imbalance is rarely fully addressed, and experiments rarely evaluate model robustness under different data partition ratios or on different datasets. Second, the lack of full ablation studies and thorough analyses of existing methods makes the identification of the optimal feature fusion and classification methods hard to find. These call for a more formal, interpretable, and robust hybrid scheme that can integrate complementary CNN structures for improving the reliability of diagnoses in kidney disease detection. Hybrid deep learning approaches fix these limitations by synthesizing the strength of deep learning in robust feature extraction with the strengths of classical machine learning or optimization algorithms in efficient and accurate classification, leading to improved diagnostic accuracy, reduced computational load, and better handling of imbalanced datasets [
15,
20].
This study proposes a hybrid deep learning model for the accurate classification of kidney diseases from medical images. It also explores effective feature extraction techniques, investigates the benefits of transfer learning with pretrained CNN architectures, and prioritize improvements in both model performance and interpretability to enhance clinical usability. The model combines the extracted features of DenseNet121 and EfficientNetB0 effectively, and the Synthetic Minority Oversampling Technique (SMOTE) is applied to handle class imbalance in the dataset. In order to further refine the process of classification, three machine learning classifiers is applied, Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). A soft-voting ensemble mechanism is applied subsequently in order to leverage the performance of individual models as well as combined prediction accuracy and reliability. The hybrid model with an ensemble classifier correctly classifies kidney diseases with abnormalities and achieved an outstanding performance, suggesting the effectiveness and readiness of the model for real-world deployment in computer-aided kidney disease diagnosis. To further evaluate the model, it was tested on two more augmented datasets in which it performed well. Thus, this study provides a hybrid model with the following contributions:
This study combines DenseNet121 and EfficientNetB0, both state-of-the-art convolutional neural networks, and they efficiently extract dense and discriminative features from medical images.
By combining deep learning models and traditional machine learning algorithms, the hybrid model benefits from the representational power of CNNs and the interpretability and generalizability of conventional classifiers.
A novel soft-voting ensemble strategy is introduced to fuse multiple classifier outputs, enhancing the model’s stability and diagnostic reliability across heterogeneous datasets.
The proposed pipeline incorporates a class-balancing mechanism using the Synthetic Minority Oversampling Technique (SMOTE), ensuring equitable learning across imbalanced medical image categories and improving generalization for underrepresented kidney disease types.
The rest of the article is organized as follows:
Section 2—Related work—reviews of the background, motivation, and research gaps.
Section 3—Methodology—explains the workflow diagram.
Section 4—Experimental setup—states the evaluation metrics and device configuration.
Section 5—Results and discussion—explains results, comparing CNN models for kidney disease classification across imaging modalities. Finally,
Section 6—Conclusion—summarizes the limitations and future research directions.
2. Related Work
Deep learning medical image analysis started with core architectures, including AlexNet [
21], VGGNet [
22], GoogLeNet [
23], ResNet [
24], and other similar models. The research demonstrated how convolutional neural networks (CNNs) could extract multi-level image features which led to new medical diagnostic approaches. Medical imaging performance for segmentation tasks received a major boost from U-Net [
25], which became a standard architecture. The classical models formed the foundation for modern networks like DenseNet and EfficientNet, which deliver better feature sharing and depth utilization and gradient flow for superior disease detection accuracy. Deep learning methods have been broadly used for the automated detection and classification of kidney abnormalities in medical imaging. However, many earlier approaches, as given in
Table 1, show limited generalizability across diverse imaging modalities and patient populations. A deep neural network was proposed by Cruz et al. combining AlexNet for scope reduction and U-Net for segmentation, achieving a high accuracy of 99.92% on the KiTS19 dataset; parameter settings are manually defined, which may lead to the loss of kidney slices or small true positives, thereby reducing segmentation accuracy [
26]. A VGG-19-based hybrid model was proposed by Somasundaram et al. Comparing the hybrid model with AlexNet, GoogLeNet, ResNet, and SqueezeNet for kidney stone detection, it achieved a high accuracy of 99.89% and lacked evaluation across diverse datasets, which may affect the model’s generalizability to different medical imaging conditions [
27].A U-Net-based deep learning model was developed by Choudhari et al. for the semantic segmentation of kidneys and tumors, demonstrating strong IOU scores, but tumor segmentation accuracy was low, which indicated that additional training epochs and model optimization were needed to improve performance [
28].
Patro et al. suggested a Kroneker product-based convolution approach for reducing feature map redundancy and avoiding convolutional overlap, achieving an accuracy of 98.56%, but it is not cost-effective [
29]. AI-powered healthcare systems in resource-poor settings are suggested by Ekpar et al., which face limitations such as infrastructure and low digital literacy [
30]. Subedi et al. use Vision Transformer (ViT), which sets the boundaries for traditional methods and demonstrates the promise of potential improvement in the medical image analysis field but has not been tested yet on varied datasets [
31]. Yildrim et al. proposed a deep learning model which could assist radiologists to detect kidney stone cases accurately and it was clinically proven, but the kidney stones could not be classified according to the sizes [
32]. Ghosh et al. developed a fuzzy inference system to enhance kidney CT image contrast with an accuracy of 99.2%, but their dataset was limited and did not have tumor localization and was not clinically validated [
33]. An inductive transfer-based ensemble deep neural network (DNN) was used by Chaki et al. for feature extraction, but it lacks spatial localization and has not been clinically tested yet [
34].
A hybrid convolutional neural network (CNN) and Residual Network (ResNet) model introduced by Kumar et al. to combine the strengths of both networks achieved accuracy of 90% but lacked multi-modal data integrity and interpretability [
35]. Fuladi et al. suggested a CNN model which is limited by dataset diversity and lacks clinical validation, achieving an accuracy of 99.57% [
36]. Recently, hybrid deep learning methods have been recognized for their ability to combine the strengths of different architectures. Majid et al. integrated DenseNet121 and ResNet50 to enable both fine-grained feature extraction and robust classifier training, yielding a significant improvement in diagnostic accuracy, but their work was confined to a single dataset [
37]. Furthermore, methods employing Grad-CAM and other interpretability techniques suggested by Asif et al. aid in understanding which image regions contribute most to the algorithm’s decisions, thereby strengthening clinician confidence in automated diagnostics but sometimes misclassifying CT images, which can pose severe clinical risks [
38]. A CNN-based deep learning approach was proposed by Chanakya et al. for the automated detection of kidney diseases, improving the decipher capacity with an accuracy 98.66% [
39]. Ensembles of three deep learning models and three vision transformers were proposed by Ayogu et al., achieving a high accuracy of 99.67%, but the absence of object detection techniques in the preprocessing stage may restrict the ability of the model [
40].
Refaee et al. proposed a machine learning approach named ResNet-50, which enhanced early detection, achieving an accuracy of 83%, but the study lacked a diverse dataset; the implementation of XAI may increase the model’s performance [
41]. Reddy et al. introduced a deep network comparing U-Net for segmentation and AlexNet for classification, reporting 94.75% accuracy with AlexNet, and they visualized the convolutional layers using saliency maps [
42]. A deep learning approach named DarkNet53 was proposed by Obaid et al. for leveraging and generating best performance, but the model fully depended on high-quality and diverse datasets, which may hinder its accuracy in low-resource settings [
43]. To enhance the early detection and classification of kidney diseases, a deep learning approach was proposed by Pimpalkar et al., achieving an accuracy of 99.96%, but the study lacked cross-dataset evaluation [
44]. An ensemble of CNN models was proposed by Saber et al., achieving an accuracy of 99.85%, but the study lacked external dataset validation, which may restrict the model’s reliability [
45].
The medical field has seen multiple deep learning systems achieve outstanding results. The effectiveness of CNNs, U-Net, and hybrid frameworks appears in multiple studies [
26,
28,
37,
42,
44,
45]. However, these studies base their findings on restricted dataset samples, which makes their models unsuitable for different medical environments. The models from Somasundaram et al. [
27] and Fuladi et al. [
36] achieve high accuracy results, but they have not been tested on external datasets or through cross-dataset evaluation. The architectures from Patro et al. [
29] and Kumar et al. [
35] deliver strong results, but their systems operate at high computational costs and produce results that lack interpretability. The implementation of Grad-CAM and saliency map techniques [
38,
42] improves system transparency, but these methods produce incorrect classifications of essential CT areas, which could lead to dangerous medical outcomes. The current models do not solve actual problems that stem from restricted infrastructure and insufficient digital skills in underdeveloped areas [
30,
43]. Despite significant progress, the literature reveals persistent gaps: (1) limited generalization across multiple datasets, (2) lack of interpretability and clinical validation, and (3) insufficient exploration of complementary hybrid feature extraction. To address these gaps, this study introduces a hybrid model combining DenseNet121 and EfficientNetB0, designed to enhance feature diversity, interpretability, and classification robustness, yielding a reliable and trustworthy pipeline for automated kidney disease detection and classification.
3. Methodology
This research work maintains a systematic workflow illustrated in
Figure 1 that includes several steps: (a) data collection, (b) data preprocessing and enhancement, (c) feature extraction, (d) SMOTE balancing, (e) classification and prediction, and (f) visualization.
Initially, the images were enhanced by using two techniques, Contrast-Limited Adaptive Histogram Equalization (CLAHE) and bilateral filtering, then used two distinct feature extractors—Densenet121 and EfficientNetB0. DenseNet121 was chosen for its dense connectivity, which facilitates efficient gradient flow and feature reuse, while EfficientNetB0 provides balanced depth, width, and resolution scaling for parameter efficiency. Their complementary structures enable the fusion of multi-scale and hierarchical features, enhancing robustness for kidney disease detection.
SMOTE balancing was used to reduce class imbalance. Although SMOTE is primarily designed for tabular data, its application to deep feature vectors helps balance the class distribution in the latent feature space. This ensures that minority classes, such as kidney stones, are adequately represented without introducing artificial pixel-level artifacts. These features were then combined and fed into three machine learning classifiers, Support Vector Machine (SVM), Random Forest (RF), and XGBoost, to accurately detect and classify diseased kidneys.
3.1. Data Collection
The dataset used in this research consists of CT kidney images sourced from Kaggle (
https://www.kaggle.com/datasets/nazmul0087/ct-kidney-dataset-normal-cyst-tumor-and-stone (accessed on 15 June 2025)). The dataset is publicly available and distributed under an open academic license on Kaggle, allowing reuse for research and publication purposes. All patient information was anonymized, ensuring compliance with ethical data usage standards. It includes four distinct classes—cyst (3709 images), normal (5077 images), stone (1377 images), and tumor (2283 images)—making a total of 12,446 labeled images. The dataset was split into a 70/15/15 ratio for training, testing, and validation, as shown in
Figure 2 below.
While this partitioning provided satisfactory performance, the impact of alternative data split ratios such as 80/10/10 on model performance has not been fully explored. A comparative analysis using different partitioning strategies is planned for future work to assess the robustness and sensitivity of the proposed approach to data division.
All experiments in this study are conducted using variations of the same Kaggle CT kidney dataset. While the results demonstrate strong performance on this dataset, the generalizability of the proposed approach to other populations, imaging protocols, or clinical settings remains to be validated. Future work will involve testing the model on independent datasets from multiple sources to further assess its robustness and real-world applicability.
Some of the examples of the ct images of four classes are shown in
Figure 3.
While other kidney diseases exist, this study specifically focuses on these four classes due to the availability and balance of labeled data in the dataset. Extension to additional diseases is left for future work.
3.2. Data Preprocessing
To improve contrast, reduce noise, and enable robust training, a series of preprocessing and augmentation steps was applied to the CT images. Each grayscale CT image was first enhanced using CLAHE to highlight fine-grained textures. CLAHE parameters were empirically selected with a clip limit of 2.0 and a tile grid size of (8, 8), which provided optimal contrast enhancement without amplifying noise. For bilateral filtering, parameters were set to diameter
,
, and
, balancing smoothness with edge preservation. CLAHE operates by partitioning the image into small, non-overlapping regions and clipping their histograms at a predefined threshold
T to avoid noise amplification. The clipped histogram
is defined by
, where
denotes the original histogram count for bin
i. The total excess
C is evenly redistributed to all
L histogram bins shown in Equation (
1):
with
and
. Subsequently, the pixel intensities are remapped by their Cumulative Distribution Function (CDF) as per Equation (
2):
Here,
N represents the total number of pixels in the tile. This procedure efficiently improves the local contrast while avoiding noise and artifact enhancement. Bilateral filtering is then applied to reduce noise while preserving edges. This filter performs non-linear, edge-preserving smoothening by averaging nearby pixel intensities given in Equation (
3):
with
and
representing the spatial and range kernel functions,
denoting the neighborhood, and
acting as a normalization factor.
Figure 4 shows the changes in the image after applying the CLAHE and bilateral filtering.
To further augment the dataset and aid in generalization, transformation techniques such as rotation, flipping, and zooming were applied [
46]. All enhanced images were then resized to
pixels, converted to RGB format, and normalized using ImageNet’s mean
and standard deviation
, expressed as Equation (
4):
with
I representing the pixel values. Normalization using ImageNet statistics ensures feature compatibility with pretrained CNNs, stabilizing learning and improving convergence. Augmentation techniques, including rotation, flipping, and zooming, further mitigate overfitting by providing geometric diversity. The processed and augmentated dataset was then fed in mini-batches to DenseNet121 and EfficientNetB0 for deep feature extraction. The resulting feature representations were concatenated and afterward classified using an ensemble of machine learning classifiers (Random Forest, SVM, and XGBoost) using soft voting.
Two more virtual datasets were created using separate augmentation technique to further validate the proposed model’s performance over varied data. The process of creating the virtual dataset is shown in
Figure 5 below.
3.3. Feature Extraction
Feature extraction is an important technique as it provides the most informative patterns for accurate diagnosis [
47]. While several studies have applied single CNN architectures such as ResNet or VGG for kidney CT analysis, their limited feature diversity restricts generalization. The proposed hybrid feature extraction leverages both dense connections (DenseNet121) and compound scaling (EfficientNetB0) to capture complementary spatial and semantic representations, thus improving discriminative capacity. DenseNet121 introduces dense feature reuse and reduces the problem of vanishing gradients, while EfficientNetB0 relies on compound scaling for an ideal balance between accuracy and efficiency. Therefore, their complementary feature hierarchies allow for better texture sensitivity and lesion discrimination than VGG or ResNet, confirmed by preliminary experiments. In this study, a hybrid deep feature extraction approach was used, employing two well-established convolutional neural networks (CNNs)—DenseNet121 and EfficientNetB0—both pretrained in ImageNet. The classifier heads were removed, retaining their respective deep feature representations, which were subsequently fused into a unified descriptor for each CT image.
3.3.1. DenseNet121
DenseNet utilizes dense connections, expressed as Equation (
5), where each layer directly accesses all previous feature maps:
with denoting concatenation of previous layer outputs and
representing a composite operation. This architecture efficiently reuses features, reduces redundancy, and helps mitigate the vanishing gradient problem.
For feature extraction, we removed the classifier layer and used the output of the last pooling layer, which yielded a 1024-dimensional vector.
3.3.2. EfficientNetB0
EfficientNet utilizes a compound scaling method alongside depthwise separable convolutions (MBConv) and squeeze-and-excitation modules to enable lightweight yet powerful feature extraction. The last convolutional layer produces a 1280-dimensional vector.
3.3.3. Feature Fusion
To leverage the complementary representations of both networks, their respective feature vectors (1024- and 1280-dimensional) were concatenated as shown in Equation (
6):
forming a unified 2304-dimensional descriptor. Feature concatenation was preferred over additive or attention-based fusion to preserve all discriminative information from both networks. Preliminary experiments showed that concatenation provided higher validation accuracy compared to other fusion methods. This fused vector was subsequently fed into classifiers (SVM, Random Forest, XGBoost) for robust and accurate kidney disease detection shown in
Figure 6.
These classifiers are employed in a cascading ensemble manner, where each classifier contributes its predictions, which are then combined to form a stronger final decision. This approach leverages the complementary strengths of individual models, enhancing the overall classification performance.
3.4. SMOTE Balancing
To reduce the class imbalance problem of the stone class, which leads to more misclassifications, SMOTE is applied. A basic working principle of the SMOTE is illustrated in the
Figure 7.
Traditionally, SMOTE is applied to input-level or shallow feature data to synthetically balance the classes. Applying it directly to high-dimensional deep feature embeddings can risk generating unrealistic samples. However, in this study, SMOTE was applied to the fused deep feature representations obtained by concatenating the outputs of DenseNet121 and EfficientNetB0. This design choice was motivated by the observation that the fused embeddings retained meaningful class separability in the high-dimensional feature space, allowing SMOTE to interpolate effectively between minority class neighbors.
Instead of simply duplicating minority class samples, SMOTE generates new synthetic examples by interpolating between existing minority class instances and their nearest neighbors. This process helps balance the dataset by increasing the representation of the minority class, thereby improving the model’s ability to learn meaningful patterns from underrepresented data. By reducing bias toward the majority class, SMOTE enhances the overall classification performance and ensures more reliable predictions in imbalanced learning tasks. To minimize the risk of unrealistic feature distributions after SMOTE augmentation, the generated feature vectors were inspected using dimensionality reduction visualization techniques like t-SNE on representative subsets, which indicated that synthetic features followed similar patterns to the original data.
3.5. Classification Models
After obtaining the 2304-dimensional fused feature vector from DenseNet121 and EfficientNetB0, traditional machine learning classifiers were used for final prediction, including Support Vector Machine (SVM), Random Forest (RF), and XGBoost. In practice, all classifiers were trained on the fused feature vectors with mostly default hyperparameters. Specifically, the SVM used an RBF kernel; Random Forest employed 100 estimators; and XGBoost used 200 boosting rounds with a default learning rate and maximum depth. Since these classifiers operate on fixed feature vectors, no batch size or training epochs were required. Finally, a soft-voting ensemble strategy was applied to combine their predictions and achieve greater robustness and accuracy.
3.5.1. Support Vector Machine (SVM)
SVM is a supervised algorithm which finds the hyperplane that maximally separates classes in high-dimensional space. In this study, radial basis function (RBF) kernel was used with a probability estimator and a fixed random state to ensure reproducibility.
3.5.2. Random Forest
Random Forest is an ensemble method that constructs multiple decision trees on different subsets of training data and merges their predictions, thereby reducing overfitting and improving generalization. Here, we used 100 trees with a set random state.
3.5.3. XGBoost
XGBoost is a high-performance algorithm based on gradient boosting and was also implemented with label encoding disabled and multiclass logarithmic loss as the evaluation metric, allowing for strong and accurate predictions.
To combine the strengths of these classifiers, a soft-voting ensemble was used, averaging their predicted class probabilities. The final class label was then assigned by choosing the highest average probability class. This approach resulted in a more balanced and robust classifier for kidney CT image recognition.
5. Result and Discussion
During the training phase, 70% of 12,446 CT images were used for training and 15% for validation, which keeps track of whether the model is overfitting. After training, the remaining 15% is used for testing the model’s performance using 1869 images that were not used during the model train-up on the baseline dataset. This split provides a balanced evaluation between learning and unseen generalization. To further assess the robustness of the proposed model, two additional augmented datasets were created: Dataset A and Dataset B. This allowed an investigation of how training diversity and unseen augmentation influence model performance. The output of these three datasets is discussed in the following subsections.
As shown in
Figure 8, the confusion matrix of the proposed hybrid model on the baseline dataset is presented below.
In the matrix, the diagonal elements represent correctly classified instances, and the rest of the elements indicate misclassifications. The incorrectly classified numbers are 18, 18, and 21 for baseline, Dataset A, and Dataset B, respectively. Although the model performs consistently well across all datasets, a slightly higher misclassification rate is observed in the stone class. This is mainly attributed to the smaller sample size of the stone category, which limits the model’s ability to learn its feature variability, even after applying SMOTE balancing. Additionally, the visual inspection of misclassified cases shown in
Figure 9 revealed that some stone images share overlapping visual characteristics with cyst and tumor classes, such as low-contrast textures and irregular boundaries, making them challenging to distinguish even for deep learning models.
The classification report of the baseline dataset is shown in
Table 2.
The AUC curve of the baseline dataset is shown in
Figure 10.
The loss curve of the baseline dataset is shown in
Figure 11.
Figure 12 shows the confusion matrix of Dataset A.
The classification report for Dataset A is presented in
Table 3.
The AUC curve for Dataset A is shown in
Figure 13.
Figure 14 shows the loss curve for Dataset A.
The confusion matrix of Dataset B is shown in
Figure 15.
Table 4 presents the classification report of Dataset B.
The model’s performance across the three datasets shows outstanding results on the baseline dataset, with an overall accuracy of 99.21% and near-perfect class-level performance, where all classes (cyst, normal, stone, tumor) obtained F1-scores between 0.98 and 1.00. On the contrary, almost similar results were observed on Dataset A, also achieving 99% accuracy, with only a slight decline in the stone class (F1 = 0.98), while the remaining classes, particularly cyst and tumor, maintained perfect scores. Dataset B contains augmented test samples with higher texture variability. As seen in the figures, the AUC values for all classes remain close to 1.0, indicating excellent discriminative ability. A high AUC demonstrates that the model can effectively differentiate positive and negative cases across all thresholds, providing a robust measure of performance beyond simple accuracy. The loss curves show smooth convergence without divergence between training and validation losses, suggesting that the model learned effectively without overfitting. This consistency across datasets confirms the regularizing effect of CLAHE, bilateral filtering, and SMOTE on improving generalization. Overall, while the model demonstrates impressive robustness on the baseline dataset and maintains it on Dataset A, the results on Dataset B identify the challenge in coping with more complex data distributions.
Table 5 presents a compact view of all the models that have been tested, including feature fusion strategies, a combination of classifiers, and performance metrics. Several architectures and classifier combinations were evaluated to justify the final model choice. DenseNet121 and EfficientNetB0 were selected due to their complementary characteristics—DenseNet’s dense connectivity promotes feature reuse and gradient stability, while EfficientNet’s compound scaling offers efficiency and fine-grained texture sensitivity.
We implemented six single and four hybrid models combining different feature extractors and classifiers for kidney disease classification. Models like VGG-16 with Random Forest and ResNet-101 with Softmax performed moderately, while deeper architectures such as VGG-19 and ResNet-50 achieved higher accuracy by capturing finer features. DenseNet-121 and ResNet-101 performed best among single and dual models, reaching about 98% accuracy due to DenseNet’s dense connectivity and ResNet’s gradient stability.
An ablation study was conducted to assess the contribution of preprocessing, feature fusion, and classifier combination. CLAHE and bilateral filtering improved image clarity and contrast, enhancing feature extraction. The fusion of DenseNet121 and EfficientNetB0 further improved performance by combining dense and efficient feature representations. Finally, the tri-ensemble of Random Forest, SVM, and XGBoost outperformed individual classifiers, yielding the most stable and accurate predictions.
The final model—combining CLAHE, bilateral filtering, DenseNet-121, EfficientNet-B0, and the tri-ensemble classifier—achieved 99.24% accuracy, confirming the effectiveness of deep feature fusion and ensemble learning.
Finally the model is compared with existing studies in the same dataset, as shown in
Table 6.
From the comparison table, we can see that our proposed hybrid model achieves excellent results among all. There were two hybrid models, and among them, [
37] achieved the best accuracy of 98.22%. There was a model named YOLO-v8 [
51], whose accuracy was the lowest. On the contrary, our model—with CLAHE and bilateral filtering for top-of-the-line image enhancement, augmented by feature extraction from DenseNet121 and EfficientNet-B0 and backed by a robust ensemble classifier (Random Forest, SVM, and XGBoost)—reported a record level of 99.24% accuracy with a perfect 0.99 in Precision, Recall, and F1-score. This testifies to the unprecedented efficacy of the proposed hybrid and ensemble-based approach in presenting not just high accuracy but also uniform and stable performance in all classes, a new standard for kidney disease classification.
Finally, the comparative assessment firmly indicates the superiority of the suggested model over the current techniques in kidney disease classification. By utilizing the advanced preprocessing techniques (CLAHE and bilateral filtering), strong deep feature extractors (DenseNet121 and EfficientNet-B0), SMOTE class balancing, and a robust ensemble of classifiers (Random Forest, SVM, and XGBoost), the proposed model achieved an excellent accuracy of 99.24%, with near perfect Precision, Recall, and F1-score. This not only surpasses all the models reported earlier but also verifies the power of combining image enhancement, deep learning, and ensemble techniques in achieving strong and highly accurate medical image classification results.