Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis

Obaid, Walid; Hussain, Abir; Rabie, Tamer; Mansoor, Wathiq

doi:10.3390/ai6080172

Open AccessArticle

Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis

¹

Department of Electrical Engineering, University of Sharjah, University City Road, Sharjah P.O. Box 27272, United Arab Emirates

²

Department of Computer Engineering, University of Sharjah, University City Road, Sharjah P.O. Box 27272, United Arab Emirates

³

CEIT Department, University of Dubai, Academic City Emirates Road, Exit 49, Dubai P.O. Box 14143, United Arab Emirates

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(8), 172; https://doi.org/10.3390/ai6080172

Submission received: 16 June 2025 / Revised: 10 July 2025 / Accepted: 14 July 2025 / Published: 31 July 2025

(This article belongs to the Section Medical & Healthcare AI)

Download

Browse Figures

Versions Notes

Abstract

Objectives: This study introduces an automated classification system for noisy kidney ultrasound images using an ensemble of deep neural networks (DNNs) with transfer learning. Methods: The method was tested using a dataset with two categories: normal kidney images and kidney images with stones. The dataset contains 1821 normal kidney images and 2592 kidney images with stones. Noisy images involve various types of noises, including salt and pepper noise, speckle noise, Poisson noise, and Gaussian noise. The ensemble-based method is benchmarked with state-of-the-art techniques and evaluated on ultrasound images with varying quality and noise levels. Results: Our proposed method demonstrated a maximum classification accuracy of 99.43% on high-quality images (the original dataset images) and 99.21% on the dataset images with added noise. Conclusions: The experimental results confirm that the ensemble of DNNs accurately classifies most images, achieving a high classification performance compared to conventional and individual DNN-based methods. Additionally, our method outperforms the highest-achieving method by more than 1% in accuracy. Furthermore, our analysis using Gradient-weighted Class Activation Mapping indicated that our proposed deep learning model is capable of prediction using clinically relevant features.

Keywords:

kidney diseases; ultrasound medical image scans; noise generation; multiple datasets; ensemble learning

1. Introduction

Ultrasound imaging is a non-invasive, radiation-free method for gaining real-time knowledge on kidney structure and function [1,2]. It operates at frequencies ranging from 2 MHz to 15 MHz [3]. It assists in monitoring kidney function, which is crucial for maintaining fluid balance, controlling blood pressure, and producing hormones [4]. Nearly one million people need renal replacement therapy due to the prevalence of kidney disorders [5]. Diabetes, a major cause of kidney disease, is predicted to affect over 100 million individuals by the end of 2025 [6,7].

Kidney tumors, kidney stones, and cysts are common kidney diseases. By compressing nephrons, Polycystic Kidney Disease (PKD) causes kidney failure [8]. Crystal deposits in the kidneys brought on by nephrolithiasis raise the chance of developing Chronic Kidney Disease (CKD) [9,10,11,12]. Urinary stones affect 12% of people in India, and half of them lose kidney function as a result [13]. Over 100,000 people die from renal cell carcinoma each year, which affects 300,000 people worldwide [14]. Kidney failure may result from these disorders if they are not detected early.

Many South Asian nations lack specialized care due to a lack of nephrologists, despite the increased need for early identification [15,16]. There are fewer than five nephrologists for every million people in lower-income countries. Due to its reliance on visual interpretation, ultrasound diagnosis can vary depending on the experience of the physician [17].

Artificial Intelligence (AI) is increasingly transforming the medical field by enabling machines to perform tasks that typically require human intelligence. It comprises techniques such as machine learning (ML), artificial neural networks (ANNs), and deep learning (DL), each playing a distinct role in medical data analysis. ML involves training algorithms to identify patterns and make predictions from data without explicit programming. ANNs are modeled after biological neural networks, processing data through interconnected layers to detect complex relationships. DL, a subset of ML and ANNs, utilizes multiple hidden layers to extract features from large datasets with high accuracy. These technologies have demonstrated significant potential in various clinical applications, including imaging analysis, disease diagnosis, and treatment decision support. AI systems can improve diagnostic precision, reduce human error, and optimize workflow efficiency. Moreover, they support physicians by offering data-driven insights, especially in radiology, pathology, and oncology. Despite their advantages, challenges remain regarding validation, data quality, and integration into clinical workflows. Continued research and ethical implementation are essential to fully realize AI’s benefits in healthcare [18].

In medical imaging, poor lighting conditions and restricted exposure settings can deteriorate specimen quality by increasing noise interference. Extracting accurate and reliable information is essential for diagnosing, staging, and treating diseases. However, noise can introduce distortions in images, which may result in incorrect diagnoses. The presence of noise reduces image clarity, making it difficult to extract features, analyze data, recognize patterns, and perform precise quantitative assessments.

To address challenges in classifying noisy kidney ultrasound images, this study presents a transfer learning-based ensemble system leveraging three pre-trained deep neural networks (DNNs): Darknet19, Darknet53, and Inceptionv3. While ensemble learning and noise augmentation have been explored in the related literature, our work integrates these techniques in a focused application for multi-condition kidney ultrasound classification (normal vs. stone, under both clean and noisy conditions) and detection of noisy cases. The dataset used includes original high-quality images as well as augmented variants corrupted with diverse noise types: salt and pepper, speckle, Poisson, and Gaussian noise, across random variances, simulating challenging degradations. The system supports radiologists and nephrologists by identifying normal kidneys, noisy images, as well as stones in normal and noisy kidney ultrasound images. Ultrasound images from standard datasets [19,20] are analyzed using DNN-based feature extraction [21] for classification. An example of ultrasound images of kidney stones is shown in Figure 1, providing two cases of normal and kidney stone cases. The contributions of our work are as follows:

A structured classification framework evaluating normal, stone, clean, and noisy kidney ultrasound images across four random noise types and multiple diagnostic conditions.
Gradient-weighted Class Activation Mapping visualization is utilized to validate and visualize the learning part for our proposed ensemble method, besides mapping comparisons across different models.

The dataset was modified by adding random noise (salt and pepper noise, speckle noise, Poisson noise, and Gaussian noise) using random variance values. The ensemble learning model attained the highest accuracy and was further assessed using performance metrics like kappa, specificity, and F1 score.

The remainder of this paper is organized as follows: Section 2 shows the related work on DNN-based kidney medical ultrasound image classification. Section 3 describes the methodology, ensemble approach, dataset preparation, and image quality selection. Section 4 presents the experiment’s results and analysis, and Section 5 presents the conclusions.

2. Literature Review

In this section, various DNN architectures and their function in image classification, specifically, as well as transfer learning, are reviewed.

The ImageNet benchmark [22] shows that DNNs are sophisticated models that perform exceptionally well in image categorization. The ImageNet dataset has 1000 categories that can be classified using a variety of architectures. AlexNet [23], an 8-layered model, is a breakthrough, while GoogLeNet [24] increases efficiency with a 22-layer design. SqueezeNet [25], which has 18 layers, drastically decreases parameters. ResNet-101 [26] includes skip connections for deeper networks. DenseNet-201 [27] increases depth to 201 layers. MobileNet-v2 and ShuffleNet [28,29] are tailored for mobile devices. These networks require powerful computation to train, despite their accuracy.

Medical imaging is challenging since training DNNs from scratch necessitates large amounts of labeled data. This is addressed by using transfer learning [30,31], which utilizes pre-trained models on sizable datasets [32]. The analysis of renal ultrasound images has been investigated in various studies. The kidney stone detection approach [33] employs Gaussian and median filters for improvement, principal component analysis for feature extraction, and K-nearest neighbor (KNN) for classification. Meta-heuristic SVM using gray-level co-occurrence matrix features is used in another method [34].

In [35], kidney pictures are classified using an SVM classifier into normal, cyst, and stone categories using pre-trained AlexNet. For binary classification, a multi-SVM ensemble model [36] is suggested, and for the detection of congenital anomalies, a computer-aided diagnostic (CAD) system [37] that integrates deep learning and conventional characteristics is used.

Recent advancements in medical imaging analysis have highlighted the significant potential of deep learning techniques, particularly ensemble models, for robust classification in ultrasound-based diagnostics. In line with this, the study in [38] focuses on deep ensemble learning for the classification of kidney abnormalities using ultrasound images, incorporating a noise modeling approach to simulate real-world conditions. The study demonstrates that ensemble architectures, combining models like ResNet, DenseNet, and MobileNet, achieve higher accuracy and better generalization than single-model systems, especially when handling noisy and low-quality images. The authors report classification accuracies exceeding 98%, confirming the value of ensemble deep learning frameworks in achieving high diagnostic performance in ultrasound imaging.

The study in [39] explores the robustness of convolutional neural networks to various forms of synthetic noise in ultrasound imaging. Their work emphasizes the importance of architectural robustness and noise-aware training strategies, reinforcing the decision to train on a dataset augmented with multiple noise types. The study in [40] proposes a multi-view ensemble method for kidney disease classification, addressing the impact of angle and image quality variation. These studies collectively affirm that ensemble deep learning is an effective and increasingly adopted approach for reliable ultrasound image classification.

The proposed approach integrated three different DNN models, enhancing classification accuracy. Moreover, it modified the used dataset to deal with multiple random types and values of noisy images. Our ensemble model achieves high accuracy in classifying the kidney ultrasound images and detecting noisy images, outperforming existing techniques in high-quality and noisy image cases.

3. Materials and Methods

The methodology included quality selection, augmentation, noise modeling and generation, feature extraction via transfer learning, and classification by ensemble learning, as illustrated in Figure 2.

Our proposed approach utilized an ensemble of pre-trained DNNs, overcoming the challenge of limited medical image data and noise existence challenges on medical images through augmentation and transfer learning.

Neural network performance is greatly impacted by image quality, and accuracy can be deteriorated by inconsistent training and testing data [41,42]. Data augmentation is a popular method for lowering overfitting in DNN training [43,44]. A variety of augmentation techniques are used to create various images, such as ultrasound images, including rotation, projective transformation, warping, and cropping. An upgraded image data store was utilized for automatic scaling to preserve homogeneity. Random vertical flipping, translation up to 30 pixels, and 10% scaling in both directions were additional modifications. By avoiding overfitting and dependence on image characteristics, these augmentations aid in enhancing model generalization.

For training and testing, we used a dataset that consists of a total of 4413 ultrasound images, classified into ‘normal’ (1821 images) and ‘stone’ (2592 images), gathered from a dataset that contains 5431 open-source normal kidney and kidneys with stones images plus a pre-trained kidney-stone-ultrasound model and API [45] while maintaining privacy. Images were collected across multiple clinical sites, including mid-range portable devices (Mindray which are produced by Shenzhen Mindray Bio-Medical Electronics Co., Ltd., headquartered in Shenzhen, China, GE Logiq which are manufactured by GE HealthCare, in Wauwatosa, Wisconsin, United States, and Philips which are manufactured by Philips Healthcare, in Bothell, Washington, United States) and console systems. Gain was adjusted on a per-scan basis by operators to ensure optimal contrast between kidney parenchyma and stones. Displayed gain variations are visible in image brightness. Depth settings ranged from approximately 8 cm to 16 cm, depending on patient size and kidney location. Scans included transverse, longitudinal, and oblique views, chosen to best visualize stone echogenicity and acoustic shadows. The selected 4413 images in this study were the clearly labelled images as either ‘normal’ or ‘stone’ out of the 5431 images. The dataset was partitioned as follows: 60% was used for training, 20% was used for testing, and the remaining 20% was used for validation.

Images undergo preprocessing to correct distortions and lighting issues before object detection. This included resizing, grayscale conversion, and pixel normalization to match model input sizes. For instance, ResNet requires a 240 × 240 input size.

The images used in the dataset of this study were modified by adding different types of noise: speckle, salt and pepper, Poisson, and Gaussian. Figure 3 shows an example of a Gaussian distributed noise model with zero mean and a specific standard deviation that corrupts images with two example ultrasound images (before and after noise addition). The additional values of noisy pixels are added to the original image to create the noisy image. It can be noted that the randomly distributed pixels have random RGB values and colors.

To ensure reliability, the noise was added randomly, where each image was associated with one random noise type. Moreover, each noise was added with a random variance value to demonstrate high reliability and performance. Noise was added to both training and testing images.

Speckle noise is formed as follows:

f_{n} (x, y) = f (x, y) + {f (x, y)}^{γ} \cdot η (x, y)

(1)

where f_n (x, y) is the image after adding the speckle noise, f (x, y) is the image without added noise, and η(x, y) is a zero-mean and zero-variance Gaussian distribution. The term γ refers to the multiplicative noise model when its value is 1. The variance or density values of speckle noise, as well as other noise types, were chosen randomly for each image, ranging from 0.01 to 0.1.

Salt and pepper noise replaces some pixels in the image with either 0 or 255. It can be expressed as follows:

q_{s p} (i, j) = γ i f x (i, j) = N s \lor N p

(2)

q_{s p} (i, j) = 1 - γ o t h e r w i s e

(3)

where Ns value is the maximum pixel value (salt noise) and Np value is the minimum pixel value (pepper noise). The expression x(i,j) is the pixel value after corruption with the added noise. γ refers to the value of noisy density. Figure 4 shows examples of noisy generated ultrasound images that were used for training or testing. The noise type is random, as well as the variance/density values.

The peak signal to noise ratio (PSNR) is a quality measurement calculated value that is used to evaluate images in the area of image processing. PSNR can be obtained by the logarithm of the mean squared error (MSE) value of an image. Since grayscale images are 2D, the value of MSE is found concerning the M × N dimensions of the image, calculated as follows:

M S E = \frac{1}{M \times N \times O} \sum_{x = 1}^{M} \sum_{y = 1}^{N} \sum_{z = 1}^{O} [(I_{(x, y, z)} - I_{(x, y, z)}^{'})]

(4)

where M, N, and O refer to channels; I() refers to the pixel value at x, y; and I′ is the output image.

PSNR can be calculated as follows:

P S N R = 10 l o g 10 \frac{m^{2}}{M S E}

(5)

where m is the largest scale value.

In our experiments, we added the noise to simulate real-world degradation in ultrasound images. For each image, one of four noise types, Gaussian, Poisson, salt and pepper, or speckle, was randomly applied. The selection was performed by a uniform discrete distribution, which ensures equal probability (25%) for each noise type. For Gaussian, salt and pepper, and speckle noise, the noise level (variance or density) was drawn uniformly at random from the interval [0.01,0.1]. Poisson noise does not take a variance parameter and was applied in its default configuration. This random noise generation procedure was applied independently to each image in both the training and testing sets, ensuring a uniform and unbiased distribution of noise types and intensities across the dataset. This approach provides sufficient variability while also ensuring that the methodology is reproducible, as the ranges used are explicitly defined.

Figure 5 shows the measured PSNR and Structural Similarity Index (SSIM) values of some noisy kidney ultrasound images concerning the original corresponding ultrasound images. Based on the figure, the trends of PSNR and SSIM reveal significant variability in quality and structural integrity. PSNR values range from 10 dB to 50 dB, indicating a broad spectrum of image fidelity. High PSNR values (>40 dB) suggest near-lossless quality, likely from small noise intensity and distribution, while lower values (20–30 dB) are typical for lossy compression or mild noise. An extremely low PSNR (<20 dB) points to severe distortions, such as synthetic corruptions. PSNR’s limitations are evident, it fails to penalize blur or perceptual distortions adequately and is sensitive to global intensity shifts, which may not align with human perception. This makes PSNR more suitable for quantifying noise reduction in technical applications rather than assessing visual quality.

On the other hand, SSIM values are consistently low, ranging from 0.1 to 0.7, with most scores below 0.5. This indicates moderate to severe structural degradation across the dataset, with no images approaching the ideal score of 1. The stability of SSIM trends, compared to PSNR, suggests systemic structural distortions, possibly from a preprocessing step or inherent dataset issues. SSIM’s sensitivity to luminance, contrast, and structure makes it better correlate with human judgment, particularly for localized distortions like edge artifacts. However, the absence of high SSIM outliers implies that the dataset lacks pristine images. The divergence between PSNR and SSIM is notable; some images with a moderate PSNR (30–40 dB) exhibit a low SSIM (~0.3), highlighting PSNR’s inability to capture structural degradation, such as blur or texture loss. Conversely, noisy but structurally intact images may have a low PSNR but a moderate SSIM.

The joint analysis of PSNR and SSIM reveals critical insights into the dataset’s composition and potential biases. For instance, a high PSNR with a low SSIM often occurs in blurry denoised images, where noise reduction sacrifices texture details. A low PSNR with a moderate SSIM might reflect noisy but structurally preserved images, such as those with film grain.

Transfer learning can be utilized effectively for classification and detection. To learn general characteristics, a network is trained on a sizable collection of tagged natural images via transfer learning, and then those features are applied to classify other datasets. In our approach, DNN models, trained with images from a source domain, are used to classify kidney ultrasound images into either normal or stones for multiple cases. Features extracted during training are passed to a SoftMax classifier. We used three pre-trained DNNs, Darknet19, Darknet53, and Inceptionv3, to classify kidney ultrasound images into either two categories (normal and stones) or four categories: the first case is for the original dataset, the second case is when noise was added to all images (normal and stones), and the third case is the original dataset (normal and stones), besides the noisy dataset (noisy normal and noisy stones).

The pre-trained models help extract features from the kidney images, which are then used for classification. Since the method uses an ensemble of these models, the predictions from all three networks are combined to produce the final result.

The ensemble method improves classification by combining predictions from multiple models [46,47], reducing variance [48] and bias [49]. This approach ensembles three pre-trained DNNs for accurate kidney ultrasound image classification. Majority voting is used, where each model votes on a test instance, and the final prediction is based on the majority vote, enhancing overall model performance.

Our goal was to create a lightweight, computationally efficient ensemble. Inceptionv3 and Darknet53 have a deep architecture with 48 and 53 layers, respectively, while Darknet19 has fewer layers, requiring less computational power.

Table 1 compares three convolutional neural network architectures—InceptionV3, Darknet19, and Darknet53—across four critical hyperparameters directly influencing model performance and practical deployment. These parameters include network depth, parameter memory footprint, total trainable parameters, and required input image dimensions. Each architecture presents unique trade-offs between computational complexity, feature extraction capability, and hardware requirements, making them suitable for clinical scenarios.

The comparison reveals important technical trade-offs that inform clinical implementation decisions. Darknet19 emerges as the optimal choice for edge devices and point-of-care applications due to its compact size and efficient operation. InceptionV3 provides a middle ground, offering enhanced feature detection capabilities while remaining within reasonable computational limits. Darknet53 serves as a premium solution for high-accuracy diagnostic workstations where hardware resources are abundant. The ensemble approach combining all three networks likely capitalizes on their complementary strengths, using Darknet19 for initial screening, InceptionV3 for intermediate processing, and Darknet53 for the final confirmation of challenging cases.

These architectural differences also highlight important considerations for clinical workflow integration. The memory requirements directly impact deployment feasibility, with Darknet53’s 159 MB footprint potentially limiting its use in mobile applications. Input resolution requirements affect preprocessing pipelines, where InceptionV3’s 299 × 299 images may require a more substantial transformation of source images than the 256 × 256 standards of the Darknet variants. The parameter counts correlate with model expressiveness, explaining why Darknet53 achieves a high accuracy despite its greater resource consumption.

The deep learning models used in medical image classification, including those in this study, are primarily pre-trained on large-scale, natural image datasets such as ImageNet (ILSVRC). This dataset, which includes over 1.2 million images from 1000 object categories, serves as the source domain for transfer learning. While ImageNet images are from non-medical, natural scenes, the models learn generalizable low- and mid-level visual features that are useful for downstream tasks in medical imaging, the target domain.

InceptionV3, GoogLeNet, and EfficientNetB0 were pre-trained on ImageNet and adapted for medical use by replacing their final classification layers with domain-specific output heads. These models benefit from their architectural ability to capture multi-scale spatial features, which are particularly useful when identifying structures like kidney stones or tissue textures in ultrasound images.

Darknet19 and Darknet53 were also pre-trained on ImageNet. These models are particularly adept at real-time object localization and detection, which translates well into identifying distinct echogenic regions, such as renal stones in ultrasound scans. In medical adaptation, their classification layers are modified and fine-tuned on annotated medical data.

AlexNet and ResNet variants (ResNet18, ResNet50, and ResNet101) were also trained on ImageNet. The ResNet family, known for its residual learning capabilities, helps in training deeper models without vanishing gradient issues. These models transfer well to ultrasound imaging tasks due to their ability to extract deep hierarchical features that capture both low-level texture and high-level anatomical patterns.

MobileNetV2 also originates from ImageNet training. Its depth-wise separable convolutions make it computationally efficient, allowing adaptation to medical scenarios where lightweight models are essential, especially in point-of-care or portable ultrasound systems.

Vision Transformers (ViTs) are pre-trained on large datasets like ImageNet-21k or JFT-300M. In the medical domain, they are fine-tuned to learn spatial and contextual dependencies, which can be beneficial for tasks requiring attention to specific anatomical regions, such as stone localization or tissue classification in grayscale images.

In adapting all these models to the medical domain, transfer learning plays a central role. It involves retaining the learned convolutional or transformer-based features from the source domain and fine-tuning the models on a target medical dataset. Additionally, domain adaptation techniques, such as image preprocessing, data augmentation (rotation, scaling, shifting), and regularization, are applied to bridge the domain gap between natural and medical images. This adaptation enhances performance, especially when working with relatively small annotated medical datasets compared to ImageNet-scale datasets.

4. Simulation Results and Discussion

In this section, the experimental settings are demonstrated, including the used dataset, the performance outcomes, and the comparisons with other related works.

The experiments were performed using the following specifications. A Core i7 Intel PC with 2.8 GHz and 4 TB SSD storage were used. The operating system of the PC was 64 bit with 16 GB of RAM. The computational tasks were performed using an NVIDIA 1050 GPU.

The dataset is publicly available [45], containing 4413 images classified into ‘normal’ (1821 images) and ‘stone’ (2592 images). The dataset was gathered from a clinic while maintaining patient privacy. Some images were rotated, translated, or with varying viewpoints and projections. Moreover, the intensity and lighting of some images were low, representing significant challenges. The size of each image was 640 × 640.

An augmented image data store was used to resize the images. Additionally, augmentation methods were used, such as random flipping, random translation, and random scaling in both height and width, to help reduce overfitting by ensuring learning general patterns instead of memorizing details.

The training configuration utilizes a mini-batch size of 10. This choice balances between convergence speed and computational efficiency.

The initial learning rate was set to 0.0003 to ensure stable convergence and prevent the model from overshooting the optimal weights.

Training data were shuffled at every epoch to ensure that the model does not learn the order of the samples and generalizes better. All models in this study were trained and used with the same options and settings to ensure fairness in performance evaluations.

Training and testing were performed using five cases described as follows:

The original kidney images without added noise (normal kidney images and images with kidney stones), besides the dataset with added random Gaussian noise.
The dataset without added noise (normal kidney images and images with kidney stones), besides the dataset with added random speckle noise.
The dataset without added noise (normal kidney images and images with kidney stones), besides the dataset with added random Poisson noise.
The dataset without added noise (normal kidney images and images with kidney stones), besides the dataset with added random salt and pepper noise.
The dataset without added noise (normal kidney images and images with kidney stones), besides the dataset with added random noise using a random type from any of the four noise types (Gaussian, Poisson, speckle, or salt and pepper).

The evaluations were assessed using metrics such as precision, recall, accuracy, specificity, and F1 score. The F1 score is used when there is a trade-off between precision and recall, especially in scenarios where minimizing false positives or false negatives is critical. It represents the harmonic means of precision and recall, providing a balanced metric that accounts for both measures. The F1 score is useful in maintaining an equilibrium between these metrics to ensure a comprehensive evaluation of the model’s performance.

Figure 6 and Figure 7 show examples of the obtained confusion matrices for the classification results of the dataset (using InceptionV3, Darknet19, and Darknet53), besides the dataset and added Gaussian or Speckle noises.

The classification performance evaluation in Figure 6 using confusion matrices for InceptionV3, Darknet19, and Darknet53 models on a kidney ultrasound dataset, comprising both original and Gaussian-noise-augmented images, revealed clear distinctions in accuracy and robustness. Each confusion matrix summarizes the model’s predictions across four classes: kidney stone cases, normal kidney cases, and their counterparts with Gaussian noise. The results highlight how each model handles clean versus noisy inputs and its ability to differentiate between pathological and healthy conditions.

InceptionV3 achieved the highest performance among the three models. It classified 516 out of 518 kidney stone cases and all 364 normal kidney cases correctly. On noisy data, it accurately classified 508 of 518 kidney stone cases with Gaussian noise and 361 of 364 normal kidney cases with Gaussian noise. Misclassifications were minimal, with only 13 errors in total, most of which involved noisy data being incorrectly labeled within its noisy class. This reflects InceptionV3’s exceptional ability to generalize well even in the presence of Gaussian noise. Its deep and sophisticated architecture allows it to extract meaningful features despite variations in image quality, making it highly reliable for medical image classification.

Darknet19, on the other hand, demonstrated a noticeable drop in performance. It correctly classified 496 kidney stone cases and 363 normal kidney cases from the original dataset, but misclassified 22 kidney stone cases as normal kidneys and one normal kidney case as a kidney stone. For the Gaussian noise images, the model correctly labeled 488 of the kidney stone cases but misclassified 27 as noisy normal kidneys. Three normal kidney cases with Gaussian noise were also incorrectly predicted. With a total of 55 misclassifications, Darknet19 showed reduced robustness to noise and a higher tendency to confuse kidney stone and normal categories.

Darknet53 delivered a better performance than Darknet19 but still fell short of InceptionV3. It correctly classified 498 kidney stone cases and 359 normal kidney cases in the original dataset, with 5 normal kidneys being misclassified. For the noisy subset, the model correctly labeled 506 noisy kidney stone cases but misclassified 12 as noisy normal kidneys. Additionally, one normal kidney case with Gaussian noise was mislabeled. The total number of misclassifications was 38, reflecting moderate robustness to noise and good but not optimal classification ability. Its deeper architecture compared to Darknet19 seems to improve generalization, yet it still exhibits vulnerability to class confusion in noisy scenarios.

The impact of various types of noise can be assessed using their confusion matrices in Figure 6 and Figure 7. Each noise type affects the models in unique ways, influencing classification accuracy and class confusion differently.

When Gaussian noise is added, the model demonstrates relatively mild degradation in accuracy. For example, in one Gaussian noise case, the model correctly classified 516 out of 518 kidney stone images (99.6% sensitivity) and 364 out of 364 normal kidney images (100% sensitivity). Most errors occur as slight confusions between clean images and their Gaussian noise counterparts within the same class. For instance, 10 kidney stones with Gaussian noise images were misclassified as normal kidney with Gaussian noise, representing only about 1.9% misclassification in that class. This minor confusion shows the model remains largely robust to Gaussian noise, effectively identifying pathology despite the noise. The confusion mainly involves differentiation between clean and noisy samples rather than mixing across classes.

In contrast, Poisson noise shows a more pronounced negative effect on classification accuracy. One confusion matrix reveals that only 451 out of 518 kidney stone images (87%) were correctly classified. Similarly, kidney stone with Poisson noise images had 445 correct classifications out of 518 (85.9%), but 70 images (13.5%) were misclassified as normal kidney with Poisson noise. Normal kidney classes also show confusion, with up to 11 misclassifications out of 364 samples in some cases. These figures reflect a significant drop in sensitivity and an increase in false negatives and false positives under Poisson noise. The model’s ability to distinguish pathological and normal cases deteriorates noticeably, indicating that Poisson noise disrupts critical image characteristics more severely than Gaussian noise.

Speckle noise also causes measurable declines in classification accuracy. For example, in one speckle noise condition, kidney stone images were correctly classified in 480 out of 518 cases (92.7%), while 16 images were misclassified as normal kidney and 22 as kidney stone with speckle noise. Normal kidney images had 362 correct classifications out of 364 (99.5%), but some were confused with noisy classes. The kidney stone with speckle class had 488 correct out of 518 (94.2%) but experienced some misclassification into normal kidney classes. These results indicate increased intra- and inter-class confusion, particularly due to speckle noise altering essential texture information, thus moderately impairing the model’s discrimination power.

Salt and pepper noise caused the most severe degradation in classification performance, particularly for normal kidney classes. In one matrix, only 314 out of 364 normal kidney samples (86.3%) were correctly classified, with 46 (12.6%) misclassified as normal kidney with salt and pepper noise and 4 (1.1%) as kidney stone. Kidney stone images remained more robust, with 501 out of 518 samples correctly classified (96.7%), and only 108 misclassifications as normal kidney in some instances. The impulse nature of salt and pepper noise, introducing random black and white pixels, severely disrupts pixel intensities, leading to significant drops in accuracy for normal kidney detection. This noise strongly affects the model’s sensitivity and specificity for this class, highlighting a class-dependent vulnerability.

For our proposed ensemble model, the accuracy is 98.19% for the dataset including the added Poisson noise. The ensemble model accuracy is 98.98% for the case of the dataset including the added Gaussian noise. The ensemble model accuracy is 98.58% for the case of the dataset including the added salt and pepper noise. The ensemble model accuracy is 97.85% for the case of the dataset including the added speckle noise.

Figure 8 shows the obtained confusion matrix based on the classification results for our proposed system for the fifth case (the dataset and the dataset with random added noise type). Using these confusion matrix values, performance is further evaluated using error, recall, specificity, precision, false positive rate, F1 score, Matthew’s correlation coefficient, and kappa. The performance parameters values are the following: accuracy of 98.75%, error of 1.25%, recall of 98.77%, specificity of 99.59%, precision of 98.62%, false positive Rate of 0.41%, F1 score of 98.7%, Matthew’s correlation coefficient of 98.28%, and kappa of 96.67%. These performance parameter values indicate a very precise classification of both positive and negative cases, regardless of the conditions of the images in the dataset and the varying properties such as intensity levels, noise, rotation, viewpoint, translation, and lighting.

In order to handle class imbalance, random oversampling was used to balance and test the performance of the proposed system using 2592 images for each label of the four categories. The corresponding obtained ensemble accuracy after balancing is 98.07%.

The proposed model demonstrates high performance on both noisy and original ultrasound data, accurately distinguishing between the presence and absence of kidney stones. This high level of accuracy indicates that the model can be reliable for deployment in real-time ultrasound analysis. Furthermore, its robustness to noise suggests that effective preprocessing techniques or training with augmented/noisy data are needed. As a result, the model holds potential for diagnostic decision support, helping to reduce false positives and missed detections in kidney stone classification for ultrasound images.

The receiver operating characteristic (ROC) curves shown in Figure 9 for the classification results of ultrasound images with added random noise (Gaussian, Poisson, speckle, and salt and pepper) provide an insightful comparison of the performance of three deep learning models: Darknet19, InceptionV3, and Darknet53. These curves reflect each model’s ability to distinguish between different classes under noisy conditions, where robustness is critical.

The ROC curve for Darknet19 shows good performance, with the curve quickly rising to the top-left corner of the plot, indicating a very low false positive rate and a high true positive rate. However, the smoothness of the curve suggests fewer threshold variations, likely due to the model’s relatively simpler architecture. The model achieves a true positive rate very close to 1 and a false positive rate close to 0.02, which demonstrates good discrimination even with added noise, but it is not as refined as more advanced networks. The area under the curve (AUC) appears high (close to 0.99), confirming the model’s general effectiveness, though perhaps with slightly reduced flexibility under more complex noise distributions.

InceptionV3, as expected, delivers the most refined and stable ROC curve. It consistently shows high true positive rates across a broader range of false positive thresholds. The curve’s smooth, tight progression toward the top-left corner with minimal deviation indicates a high level of confidence and robustness in classification, even when images are corrupted with complex noise types. The closeness to the upper-left corner and the tightly packed steps suggest fine granularity in decision thresholds and minimal misclassification. The AUC for InceptionV3 likely approaches 1.0, confirming its generalization capabilities and adaptability to noisy medical images. This performance affirms InceptionV3’s advantage in deep feature extraction and noise resilience.

Darknet53 also exhibits high performance, with a steep rise and a curve hugging the top-left axis, very similar to InceptionV3. However, slight deviations and reduced step smoothness suggest a marginally lower precision or slightly higher variability in classification under noise. The curve still maintains a high true positive rate and nearly negligible false positives. It performs better than Darknet19 and nearly matches InceptionV3, showing that its deeper architecture allows for better feature representation and better handling of noise. The AUC is again very high, close to 1.0, indicating a high classification reliability.

The provided set of images in Figure 10 displays ultrasound scans of kidneys, with each grayscale image accompanied by the Grad-CAM (Gradient-weighted Class Activation Mapping) visualization. These overlays are commonly used in deep learning to highlight the regions of the image that contributed most to a model’s classification decision. In the medical domain, such tools are essential for interpreting how and why a neural network reaches a particular diagnosis, especially in scenarios like renal stone detection.

Figure 10a shows a standard grayscale ultrasound of a kidney. A noticeable hyperechoic (brighter) area can be seen, which is suggestive of a renal stone. These stones typically appear as bright echoes with posterior acoustic shadowing due to their dense composition. This clinical presentation aligns with what radiologists expect when diagnosing urolithiasis using ultrasound imaging. The corresponding Grad-CAM visualization in the top-right panel overlays a heatmap on this image, where the most intense red zone aligns with the suspected stone. The model focuses its attention on the suspicious region, which suggests that its prediction is based on clinically relevant features. This alignment between the model’s attention and medically significant structures enhances the interpretability and trustworthiness of the AI system.

Figure 10c presents another kidney ultrasound in a different orientation. This image also shows a dense echogenic area near the center, again raising suspicion of a renal stone. The structure’s location and appearance suggest a calculus located near the renal pelvis. The corresponding Grad-CAM map indicates that the deep learning model directs its attention to this same region, evidenced by the red coloration on the heatmap. This correlation between the model’s highlighted region and the likely pathology indicates that the neural network is learning to focus on diagnostically meaningful cues rather than spurious patterns or background noise.

These images and their Grad-CAM overlays demonstrate a level of model interpretability. The deep learning system appears to attend to the correct anatomical regions, those most relevant for diagnosing renal stones. This is vital not only for performance but also for clinical acceptance. Physicians need to understand and trust the reasoning behind automated classifications, especially in healthcare settings.

The visualizations suggest that the model has learned to focus on clinical features, such as echogenic foci and their acoustic shadows, which are indicative of renal stones. This demonstrates the potential for AI-assisted diagnosis in medical imaging.

The proposed method was tested on another dataset in [50]. That other dataset contains 2776 kidney stone ultrasound images and 2161 normal kidney ultrasound images sourced from low- to mid-range clinical portable units (Philips Lumify and Clarius) used in academic ultrasound labs. The depth varied between 6 cm and 14 cm, with common settings around 10 cm. The dataset included standard renal protocol views (transverse and longitudinal) and some oblique and angled views.

The total number of images was 9874, with random-added noise. Based on the classification results using the proposed ensemble method, the obtained accuracy value with the dataset in [50] was 98.38%.

The proposed method outperforms existing DNNs in [24,25,26,27,28,29]. Moreover, it also outperforms other recent approaches related to detecting kidney stones on ultrasound images [33,34,35,36,37], achieving a classification accuracy of 99.43% on high-quality images (the original dataset images) and 99.21% on the original dataset images with added noise. The method achieved a maximum classification accuracy of 98.75% on the original dataset, besides the original dataset with added noise images (four total classes, the fifth case).

The method was compared with ViTs. ViTs have emerged as a transformative approach in the field of computer vision, offering an alternative to traditional CNNs. Originally developed for natural language processing, the transformer architecture has been adapted to visual tasks by treating images as sequences of patches, similar to tokens in text. The confusion matrix shown in Figure 11 represents the classification performance of a ViT model on the dataset, in addition to the dataset with the added noise. Ultrasound images categorized under ultrasound normal noise posed a significant challenge for the model. While 226 images were correctly classified, 190 were misclassified as ultrasound stone noise. This indicates that the ViT model struggles to differentiate between noise-induced artifacts in healthy tissue and actual pathological signs, possibly due to the subtlety of differences under distortion. Based on the confusion matrix, the obtained accuracy value is 82.17%.

Figure 12 represents a critical difference diagram generated using the Nemenyi post hoc test, which is applied to statistically compare multiple classifiers over multiple datasets. This helps in determining whether the differences in average performance rankings are statistically significant.

The models were benchmarked using the critical difference diagram, as shown in Figure 12: the proposed model, InceptionV3 (I3), Darknet19 (D19), and Darknet53 (D53). They were plotted along a horizontal axis according to their average rank, with lower ranks (left side) indicating a better performance. The proposed model was placed furthest to the left, suggesting it had the best average rank and thus performed best overall across the tested datasets. I3, D19, and D53 were positioned to the right of the proposed model, indicating a comparatively lower performance.

The critical difference (CD) is specified as 2.14. This value represents the minimum difference in average rank required for the performance between two models to be considered statistically significant at a certain confidence level (α = 0.05). In this case, the distance between the proposed model and the other models (I3, D19, and D53) appears to exceed the CD threshold, indicating that the proposed model is statistically significantly better than I3, D19, and D59.

This conclusion is strengthened by the fact that no connecting line joins the proposed model to I3, D19, or D59, suggesting that the Nemenyi test identified a significant difference between their ranks. Consequently, the diagram supports the claim that the proposed model not only performed high in terms of average rank but also this is statistically validated.

Figure 13 shows some examples of the hardest images to classify when using the proposed system. Based on Figure 13a,b, the model misclassified the noisy ultrasound images with stones as noisy ultrasound images without stones. This misclassification occurred on these images because of the high concentration of noisy pixels, which made the classification challenging. On the other hand, the misclassification of Figure 13c,d occurred due to low image intensity from the scan, making it difficult to obtain useful and visual features to decide. Figure 13e,f shows the misclassification of noisy ultrasound images with stones as noisy images without stones. These two images also have low intensity values from the scanning.

Figure 14, Figure 15 and Figure 16 show Grad-CAM visualizations to compare interpretability across different models. In Figure 14, AlexNet highlights outer regions in the first, second, and third image examples. Darknet faces some difficulties in highlighting relevant regions in the second image. In Figure 15, EfficientNet points at some outer regions in the second and third image examples. Googlenet maps outer regions with lower focus in the second, third, and fourth example images. Shufflenet highlights outer regions with a lower focus in the second image case. In Figure 16, both ResNet and Mobilenet show a high focus on outer regions in the second image.

Table 2 shows the obtained FP and FN cases based on the results in Figure 8. Based on the analysis of the confusion matrix with class-specific labeling, the model’s performance varies across different types of conditions. The original dataset with stones is the most accurately classified class. It has the lowest overall error rate of 1.52%, indicating that the model has a very high confidence and accuracy when predicting this class. The false positive rate for this class is only 0.44%, and the false negative rate is 0.77%, reflecting that the model seldom misses actual cases belonging to it.

The noisy dataset without stones has the highest error rate at 4.00%, which is significantly higher than the others. This class also shows the highest false negative rate (1.65%), which means the model often misses identifying true samples from this category. Additionally, the false positive rate is 0.85%, which contributes to the overall difficulty the model faces when distinguishing this noisy negative class from others.

The noisy dataset with stones performs well, with a false positive rate of only 0.22%, the lowest among all classes, which means that very few samples from other classes are incorrectly classified as noisy stone images. However, its false negative rate is 1.93%, the highest among all classes, indicating the model sometimes fails to recognize true positive instances from this group. This leads to a total error rate of 2.30%.

The original dataset without stones has a low false negative rate of 0.55% and a false positive rate of 0.66%. Its total error rate stands at 2.42%, placing it better than the noisy negative class but behind the original dataset with stones in terms of overall reliability.

The model performs best when dealing with non-noisy data, especially the original dataset with stones, and struggles more when distinguishing noisy non-stone images, likely due to the visual similarities introduced by noise.

The presented performance comparison tables (Table 3 and Table 4) evaluate various kidney image classification techniques, offering critical insights into the evolution of medical image analysis methodologies. Table 3 presents a comprehensive comparison that encompasses both traditional machine learning approaches and deep learning methods, while Table 4 focuses specifically on deep neural network architectures. These tables collectively demonstrate advancements in classification accuracy, robustness to noise, and the impact of dataset scale on model performance.

Table 3 reveals progression in classification accuracy from traditional machine learning techniques to modern deep learning approaches. The earliest methods ([33,34]) utilizing K-nearest neighbors (KNNs) and support vector machines (SVMs) achieve modest accuracies of 56.35% and 64.04%, respectively. These conventional methods, while computationally efficient, are limited by their reliance on handcrafted features and struggle with the complex patterns in medical imaging. The inclusion of meta-heuristic optimization in [34] provides a marginal improvement, highlighting the challenges of traditional computer vision techniques in medical image analysis.

The transition to deep learning marks a significant leap in performance, with [35]’s off-the-shelf CNN features achieving 92.31% accuracy. This substantial improvement underscores the power of learned feature representations over manual feature engineering. The ensemble MSVM model in [36] shows an anomaly; its performance slightly improves (from 84.89% to 85.15%) with added noise, suggesting potential inherent regularization effects or robustness in the ensemble structure. The work by Zheng et al. [37] demonstrates the value of combining traditional texture features with deep learning, achieving 91.73% accuracy despite using a relatively small dataset of 270 images.

Recent advancements are particularly noteworthy. Sudharson and Kokil’s [51] ensemble deep neural network achieved 96.54% accuracy, while Asaye et al.’s [52] machine learning approach reached 98.4%. These results highlight the benefits of sophisticated model architecture and careful feature engineering. The proposed ensemble method, combining Darknet19, Darknet53, and InceptionV3, sets a new benchmark with 99.43% accuracy and maintains 98.75% performance under noisy conditions. This exceptional performance can be attributed to several factors: the strengths of the constituent models, comprehensive noise augmentation during training (including salt and pepper, speckle, Poisson, and Gaussian noise), and the substantial training dataset of 8826 images, nearly double the size of most comparative studies.

Table 4 provides a focused comparison of deep learning architectures, all evaluated on the same dataset size (4940 images) except for the proposed method. The performance spectrum reveals important architectural insights. SqueezeNet [25] achieves the lowest accuracy (72.31%), demonstrating the challenges of maintaining performance with extreme model compression. InceptionV1 (GoogLeNet) [24] shows 86.73% accuracy, while ResNet [26] and ShuffleNet [28] perform notably better at 93.46% and 93.65%, respectively, illustrating the benefits of residual connections and efficient channel shuffling operations.

The proposed method’s 98.75% accuracy outperforms these established architectures, despite using the same evaluation framework. This stems from some aspects: the ensemble approach leverages Darknet19’s efficiency, Darknet53’s depth and feature extraction capabilities, and InceptionV3’s multi-scale processing. The comprehensive noise augmentation strategy during training (employing multiple noise types with random variances) has enhanced the model’s robustness, as evidenced by its performance in Table 3’s noise-added scenario.

The progression from traditional methods to deep learning, and further to ensemble approaches, demonstrates a clear evolution in medical image analysis. Several key observations emerge:

Dataset size impact: The correlation between dataset size and performance is evident, though not absolute. While [51] achieves 96.54% with 4940 images, the proposed method’s larger dataset (8826 images) emphasizes the importance of data scale in deep learning to maximize testing on numerous cases.

Noise robustness: The proposed method’s minimal accuracy drops (0.68%) under noisy conditions compared to other techniques demonstrates the effectiveness of its multi-noise augmentation strategy. This is particularly valuable for medical imaging, where acquisition artifacts are common.

Architectural aspect: The ensemble approach capitalizes on complementary model strengths, Darknet’s object detection capabilities combined with Inception’s multi-scale processing. This synergy explains the significant performance leap over individual architecture.

While high accuracy is promising, the real test lies in clinical deployment. The method’s robustness to various noise types suggests a good generalization potential.

Several areas warrant further investigation, such as the ensemble’s computational requirements versus its performance benefits, the potential for knowledge distillation to reduce model size while maintaining accuracy, extension to other medical imaging modalities, and a detailed analysis of false positive/negative rates in clinical scenarios.

Table 5 presents a comparison between the proposed ensemble deep learning model and some widely used existing deep neural networks on the same dataset (original images and the images with added random noise types).

The proposed ensemble model achieves an accuracy of 98.75%. This result clearly illustrates the strength of using an ensemble approach combining Darknet53, Darknet19, and InceptionnetV3, where the combined predictions of multiple models lead to more robust and accurate outcomes. Specifically, Darknet53 contributes deep hierarchical features, Darknet19 provides lightweight and fast inference capabilities, and InceptionNetV3 brings multi-scale feature extraction.

Conventional CNNs like AlexNet perform low (78.34%), primarily due to their shallow architecture and limited capacity to model complex patterns in noisy data. More modern models like GoogLeNet and ResNet50/101 show improved results, with accuracies ranging from 87.89% to 90.61%, yet they still fall short of the proposed method. ResNet18 achieves 94.39%, outperforming its deeper variants, which might indicate better generalization or less overfitting on this dataset. EfficientnetB0 and MobileNet V2, which are designed for performance–efficiency trade-offs, result in accuracy of 92.49% and 91.55%, respectively, though still not matching the ensemble’s accuracy.

Newer and more powerful models, such as Swin, showed a high performance with an accuracy that is close to the proposed system’s accuracy (97.79%), while ConvNext did not demonstrate a high performance (33.07%).

The results underscore the unique value of ensemble learning, combining Darknet53, Darknet19, and InceptionnetV3 in medical image analysis, especially for challenging data like ultrasound. While single models may perform well in isolation, they often have specific weaknesses, such as susceptibility to noise or difficulty capturing certain patterns, that can be mitigated through model fusion. In clinical settings, even small gains in accuracy can translate to significant improvements in diagnostic confidence and patient care. Therefore, the proposed ensemble model’s performance (98.75%) makes it a candidate for integration into computer-aided diagnostic tools for kidney disease detection via ultrasound.

The study in [53] presents a lightweight, real-time capable object detection model for thyroid nodules. It builds upon the YOLOv8 framework by integrating a Coordinate Attention-based C2fA module and custom loss functions to improve accuracy. The authors report improvements in mean average precision, reaching 43.6%, with a detection precision of 54% and recall of 58.2%. The model also achieves fast inference times of approximately 7.7 milliseconds per image, which is advantageous for real-time clinical settings.

In comparison, the proposed study focuses on the classification of kidney ultrasound images under both clean and noisy conditions using an ensemble of Darknet19, Darknet53, and Inceptionv3. This ensemble approach achieves a higher classification accuracy of 98.75%. Performance metrics derived from confusion matrices, such as recall (98.77%), specificity (99.59%), F1 score (98.7%), and kappa (96.67%), indicate superior classification reliability. These results suggest that the kidney ultrasound system provides a highly accurate identification of pathological conditions, even in the presence of noise.

In terms of computational efficiency, the two approaches diverge. The study in [53] is designed for speed and resource efficiency, making it suitable for real-time applications in clinical settings. Its compact architecture and optimized design yield fast inference times and minimal computational overhead. On the other hand, the kidney classification system employs an ensemble of heavier networks, which results in increased time consumption. While exact timing benchmarks are not reported, the use of three deep models suggests a trade-off: the model favors high accuracy and noise robustness over real-time performance.

Noise handling is a key strength of our study. The study in [53] focuses on enhancing detection performance under natural conditions without explicitly introducing noise. Our study applied four distinct types of noise with randomly selected variances to simulate imaging degradations. This testing under noisy conditions demonstrates the proposed model’s strong resilience to various artifacts in ultrasound imaging. The minimal performance degradation confirms the model’s robustness.

Regarding interpretability, our study leverages Grad-CAM to visualize regions in ultrasound images that influence the model’s predictions. This helps ensure clinical relevance by demonstrating that the model focuses on diagnostically important regions, such as echogenic kidney stones. These visualizations enhance transparency and can build trust with medical practitioners. In contrast, the YOLOv8-based thyroid detection model does not incorporate explicit interpretability tools, such as attention maps or Grad-CAM, which limits its transparency.

Both models offer valuable contributions tailored to different clinical needs. The study in [53] focuses on fast and efficient thyroid nodule detection and shows improvements in precision and recall with architectural enhancements. Our study demonstrates a high classification performance and resilience to various noise types, supported by Grad-CAM. While it may demand more computational resources, its robustness and clinical transparency make it highly suitable for diagnostic decision support systems in environments where noise is a significant challenge.

Each approach reflects a different priority: the study in [53] focuses on real-time detection and architectural efficiency, while our study emphasizes accuracy, robustness, and interpretability in classification tasks. These differences underline the complementary nature of the two studies in advancing the state of medical image analysis using deep learning.

5. Conclusions

An ensemble of pre-trained DNNs was proposed for kidney ultrasound image classification. The dataset was corrupted with random noise types and values, achieving high performance. The method was evaluated using more than one dataset with random added noise. The method achieved a maximum classification accuracy of 99.43% on high-quality images and achieved 98.75% accuracy on the original and noisy dataset images. The dataset of ultrasound images has varying intensities, translations, viewpoints, and rotations. Our proposed method was benchmarked with existing DNNs and state-of-the-art approaches and outperformed by 1.03% compared to the best-performing method, showing effective classification, supporting radiologists and nephrologists in diagnosis.

Some limitations must be acknowledged to present a balanced evaluation of its capabilities. One is the risk of overfitting, particularly given the dataset size. Although data augmentation with noise types, using multiple datasets and clinical images, and validation splitting were utilized to enhance generalization, the model’s ability to perform robustly on truly unseen clinical data from other institutions remains uncertain. Overfitting may be particularly pronounced due to the ensemble architecture, which combines multiple deep models (Darknet19, Darknet53, and Inceptionv3), each capable of capturing complex representations.

Another notable limitation is the computational overhead introduced by the ensemble design. Using three deep convolutional networks increases the memory footprint and inference time, limiting the model’s deployment in resource-constrained environments, such as ultrasound machines or mobile diagnostic applications. The added complexity also poses a challenge for real-time integration into clinical workflows, where rapid feedback is often critical. While the system offers robust performance in batch processing scenarios, it may require optimization or compression before it can be translated into real-world clinical systems.

Although the model includes interpretability mechanisms such as Grad-CAM, which aid in visualizing feature attention and contribute to clinician trust, these visualizations do not necessarily reflect the internal decision-making process of the ensemble as a whole. In medical applications, model transparency is essential not only for trust but also for regulatory approval. Since each base model may focus on different aspects of the image, the ensemble’s final decision could involve interactions that are difficult to explain using single-network Grad-CAM overlays alone. This could limit the depth of interpretability in clinical decision support scenarios where full transparency is essential.

The limitations of this study are also related to the added noise types to the ultrasound images. Future extensions of this work can study other types of added noise or combinations of added noise and their effects on the classification results. Another direction for research could involve the use of modified ensemble DNN architectures for improved accuracy.

Author Contributions

Conceptualization, W.O. and A.H.; methodology, T.R.; software, W.M.; validation, W.M.; formal analysis, T.R.; investigation, A.H.; writing—original draft, W.O.; writing—review and editing, W.O., A.H., T.R. and W.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset for this research is available at https://universe.roboflow.com/kidney-ktlmt/kidney-stone-ultrasound (accessed on 14 June 2025) and https://universe.roboflow.com/proyecto-1/tesis-iz9ji (accessed on 14 June 2025).

Acknowledgments

We acknowledge the support of the University of Sharjah, United Arab Emirates.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mack, M.J. Minimally invasive and robotic surgery. J. Am. Med. Assoc. 2001, 285, 568–572. [Google Scholar] [CrossRef]
Krishna, K.D.; Akkala, V.; Bharath, R.; Rajalakshmi, P.; Mohammed, A.; Merchant, S.; Desai, U. Computer-aided abnormality detection for kidney on FPGA based IoT enabled portable ultrasound imaging system. Irbm 2016, 37, 189–197. [Google Scholar] [CrossRef]
Shung, K.K. High frequency ultrasonic imaging. J. Med. Ultrasound 2009, 17, 25–30. [Google Scholar] [CrossRef]
Sahay, M.; Kalra, S.; Bandgar, T. Renal endocrinology: The new frontier. Indian. J. Endocrinol. Metab. 2012, 16, 154–155. [Google Scholar] [PubMed]
Eggers, P.W. Has the incidence of end-stage renal disease in the USA and other countries stabilized? Curr. Opin. Nephrol. Hypertens 2011, 20, 241–245. [Google Scholar] [CrossRef]
Couser, W.G.; Remuzzi, G.; Mendis, S.; Tonelli, M. The contribution of chronic kidney disease to the global burden of major noncommunicable diseases. Kidney Int. 2011, 80, 1258–1270. [Google Scholar] [CrossRef]
Vita-Finzi, L.; Saavedra, J.; Yach, D.; Bettcher, D.; Ogden, J.; Clift, J.; Alleyne, G.; Stanecki, K.; Poznyak, V.; Riley, L.; et al. Preventing Chronic Diseases: A Vital Investment; World Health Organization: Geneva, Switzerland, 2005. [Google Scholar]
Patel, V.; Williams, D.; Hajarnis, S.; Hunter, R.; Pontoglio, M.; Somlo, S.; Igarashi, P. miR-17 ≈ 92 miRNA cluster promotes kidney cyst growth in polycystic kidney disease. Proc. Natl. Acad. Sci. USA 2013, 110, 10765–10770. [Google Scholar] [CrossRef]
Alelign, T.; Petros, B. Kidney stone disease: An update on current concepts. Adv. Urol. 2018, 2018, 3068365. [Google Scholar] [CrossRef]
El-Zoghby, Z.M.; Lieske, J.C.; Foley, R.N.; Bergstralh, E.J.; Li, X.; Melton, L.J.; Krambeck, A.E.; Rule, A.D. Urolithiasis and the risk of ESRD. Clin. J. Am. Soc. Nephrol. 2012, 7, 1409–1415. [Google Scholar] [CrossRef]
Sigurjonsdottir, V.K.; Runolfsdottir, H.L.; Indridason, O.S.; Palsson, R.; Edvardsson, V.O. Impact of nephrolithiasis on kidney function. BMC Nephrol. 2015, 16, 149. [Google Scholar] [CrossRef]
Rule, A.D.; Roger, V.L.; Melton, L.J.; Bergstralh, E.J.; Li, X.; Peyser, P.A.; Krambeck, A.E.; Lieske, J.C. Kidney stones associate with increased risk for myocardial infarction. J. Am. Soc. Nephrol. 2010, 21, 1641–1644. [Google Scholar] [CrossRef]
Joseph, K.; Parekh, B.B.; Joshi, M. Inhibition of growth of urinary type calcium hydrogen phosphate dihydrate crystals by tartaric acid and tamarind. Curr. Sci. 2005, 88, 1232–1238. [Google Scholar]
Linehan, W.M.; Ricketts, C.J. The metabolic basis of kidney cancer, Semin. Cancer Biol. 2013, 23, 46–55. [Google Scholar] [CrossRef]
Bello, A.K.; Levin, A.; Tonelli, M.; Okpechi, I.G.; Feehally, J.; Harris, D.; Jindal, K.; Salako, B.L.; Rateb, A.; Osman, M.A.; et al. Assessment of global kidney health care status. J. Am. Med. Assoc. 2017, 317, 1864–1881. [Google Scholar] [CrossRef]
Osman, M.A.; Alrukhaimi, M.; Ashuntantang, G.E.; Bellorin-Font, E.; Gharbi, M.B.; Braam, B.; Courtney, M.; Feehally, J.; Harris, D.C.; Jha, V.; et al. Global nephrology workforce: Gaps and opportunities toward a sustainable kidney care system. Kidney Int. Suppl. 2018, 8, 52–63. [Google Scholar] [CrossRef]
Rosen, M.P.; Levine, D.; Carpenter, J.M.; Frost, L.; Hulka, C.A.; Western, D.L.; McArdle, C.R. Diagnostic accuracy with US: Remote radiologists’ versus on-site radiologists’ interpretations. Radiology 1999, 210, 733–736. [Google Scholar] [CrossRef]
Kufel, J.; Bargieł-Łączek, K.; Kocot, S.; Koźlik, M.; Bartnikowska, W.; Janik, M.; Czogalik, Ł.; Dudek, P.; Magiera, M.; Lis, A.; et al. What Is Machine Learning, Artificial Neural Networks and Deep Learning?—Examples of Practical Applications in Medicine. Diagnostics 2023, 13, 2582. [Google Scholar] [CrossRef]
Geertsma, T. Ultrasound Cases: Educational Site for Medical Ultrasound Image Examples. 2011. Available online: http://www.ultrasoundcases.info/ (accessed on 20 March 2020).
Antony, J. Ultrasound Images: A Comprehensive Collection of Diagnostic Medical Sonography Cases. 2015. Available online: https://www.ultrasound-images.com/ (accessed on 20 March 2020).
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: Alexnet-level accuracy with 50× fewer parameters and <0.5 MB model size. In Proceedings of the International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–13. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffieNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Shin, H.-C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef]
Shie, C.-K.; Chuang, C.-H.; Chou, C.-N.; Wu, M.-H.; Chang, E.Y. Transfer representation learning for medical image analysis. In Proceedings of the IEEE 37th International Conference of the Engineering in Medicine and Biology Society, Milan, Italy, 25–29 August 2015; pp. 711–714. [Google Scholar]
Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; Darrell, T. A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference in Machine Learning, Beijing China, 21–26 June 2014; pp. 647–655. [Google Scholar]
Verma, J.; Nath, M.; Tripathi, P.; Saini, K. Analysis and identification of kidney stone using Kth nearest neighbour (KNN) and support vector machine (SVM) classification techniques. Pattern Recognit. Image Anal. 2017, 27, 574–580. [Google Scholar] [CrossRef]
Selvarani, S.; Rajendran, P. Detection of renal calculi in ultrasound image using meta-heuristic support vector machine. J. Med. Syst. 2019, 43, 300. [Google Scholar] [CrossRef]
Kokil, P.; Sudharson, S. Automatic detection of renal abnormalities by off-the-shelf CNN features. IETE J. Educ. 2019, 60, 14–23. [Google Scholar] [CrossRef]
Sudharson, S.; Kokil, P. Abnormality detection in the renal ultrasound images using ensemble MSVM model. In Proceedings of the IEEE International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), Chennai, India, 21–23 March 2019; pp. 378–382. [Google Scholar]
Zheng, Q.; Furth, S.L.; Tasian, G.E.; Fan, Y. Computer-aided diagnosis of congenital abnormalities of the kidney and urinary tract in children based on ultrasound imaging data by integrating texture image features and deep transfer learning image features. J. Pediatr. Urol. 2019, 15, 75-e1. [Google Scholar] [CrossRef]
Himeur, Y.; Al-Ali, K.; Abdul-Kareem, S.; El Ouadghiri, D.; Amira, A.; Bouagga, S. Deep ensemble learning for kidney abnormality classification using ultrasound images with noise modeling. Biomed. Signal Process. Control 2023, 85, 104961. [Google Scholar] [CrossRef]
Jiang, J.; Jiang, X.; Xu, L.; Zhang, Y.; Zheng, Y.; Kong, D. Noise-robustness test for ultrasound breast nodule neural network models as medical devices. Front. Oncol. 2023, 13, 1177225. [Google Scholar] [CrossRef]
Zhou, F.; Liu, W.; Zhang, R.; Li, X. Multi-view ensemble deep learning model for kidney disease classification from ultrasound images. Comput. Biol. Med. 2023, 159, 106855. [Google Scholar] [CrossRef]
Dodge, S.; Karam, L. Understanding how image quality affects deep neural networks. In Proceedings of the IEEE 2016 Eighth International Conference on Quality of Multimedia Experience, Lisbon, Portugal, 6–8 June 2016; pp. 1–6. [Google Scholar]
Borkar, T.S.; Karam, L.J. DeepCorrect: Correcting DNN models against image distortions. IEEE Trans. Image Process. 2019, 28, 6022–6034. [Google Scholar] [CrossRef]
Ayan, E.; Unver, H.M. Data augmentation importance for classification of skin lesions via deep learning. In Proceedings of the 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT), Istanbul, Turkey, 18–19 April 2018; pp. 1–4. [Google Scholar]
Pezeshk, A.; Petrick, N.; Chen, W.; Sahiner, B. Seamless lesion insertion for data augmentation in CAD training. IEEE Trans. Med. Imaging 2016, 36, 1005–1015. [Google Scholar] [CrossRef]
kidney-stone-ultrasound Computer Vision Project. Available online: https://universe.roboflow.com/kidney-ktlmt/kidney-stone-ultrasound (accessed on 14 July 2025).
Kumar, A.; Kim, J.; Lyndon, D.; Fulham, M.; Feng, D. An ensemble of fine-tuned convolutional neural networks for medical image classification. IEEE J. Biomed. Health Inf. 2016, 21, 31–40. [Google Scholar] [CrossRef]
Xie, X.; Xing, J.; Kong, N.; Li, C.; Li, J.; Zhang, S. Improving colorectal polyp classification based on physical examination data–an ensemble learning approach. IEEE Rob. Autom. Lett. 2017, 3, 434–441. [Google Scholar] [CrossRef]
Ganjisaffar, Y.; Caruana, R.; Lopes, C.V. Bagging gradient-boosted trees for high precision, low variance ranking models. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 85–94. [Google Scholar]
Zhou, Z.-H.; Wu, J.; Tang, W. Ensembling neural networks: Many could be better than all. Artif. Intell. 2002, 137, 239–263. [Google Scholar] [CrossRef]
tesis Computer Vision Project. Available online: https://universe.roboflow.com/proyecto-1/tesis-iz9ji (accessed on 14 July 2025).
Sudharson, S.; Kokil, P. An ensemble of deep neural networks for kidney ultrasound image classification. Comput. Methods Programs Biomed. 2020, 197, 105709. [Google Scholar] [CrossRef] [PubMed]
Asaye, Y.A.; Annamalai, P.; Ayalew, L.G. Detection of kidney stone from ultrasound images using machine learning algorithms. Sci. Afr. 2025, 28, e02618. [Google Scholar] [CrossRef]
Wang, S.; Zi-An, Z.; Chen, Y.; Mao, Y.-J.; Cheung, J.C.-W. Enhancing thyroid nodule detection in ultrasound images: A novel YOLOv8 architecture with a C2fA module and optimized loss functions. Technologies 2025, 13, 28. [Google Scholar] [CrossRef]

Figure 1. Ultrasound images: (a) normal kidney, (b) kidney with stones and visible caliper markers (++1) that are used for measuring the dimensions of the stone (++ for width and 1 for height/thickness).

Figure 2. The proposed framework components.

Figure 3. Gaussian distributed noise model with zero mean and a specific standard deviation.

Figure 4. Examples of generated noisy ultrasound images: (a) normal kidney image, (b) image of the kidney with stones.

Figure 5. Examples of PSNR and SSIM values of generated noisy ultrasound images: (a) normal kidney image cases, (b) cases of images of kidneys with stones.

Figure 6. The confusion matrix obtained for the classification results of the dataset (using InceptionV3, Darknet19, and Darknet53), besides the dataset, and added (a–c) Gaussian noise and (d–f) Poisson noise.

Figure 7. The confusion matrix obtained for the classification results of the dataset (using InceptionV3, Darknet19, and Darknet53), besides the dataset, and added (a–c) speckle noise and (d–f) salt and pepper noise.

Figure 8. The obtained confusion matrix is based on the number of TP (true positive), TN (true negative), FP (false positive), and FN (false negative) values for the classification results related to the dataset, besides the dataset with random added noise type.

Figure 9. The ROC curves (a) Darknet19 and Darknet53 (b) InceptionV3.

Figure 10. Examples of random kidney images with the corresponding Grad-CAM maps: (a) Original kidney image; (b) Grad-CAM activation map for (a); (c) Original kidney image; (d) Grad-CAM activation map for (c); (e) Original kidney image; (f) Grad-CAM activation map for (e); (g) Original kidney image; (h) Grad-CAM activation map for (g).

Figure 11. The obtained confusion matrix using ViTs.

Figure 12. The obtained critical difference diagram was generated using the Nemenyi post hoc test.

Figure 13. Examples of some difficult images that the proposed system misclassified: (a) misclassification of the noisy ultrasound images with stones as noisy ultrasound images without stones; (b) misclassification of the noisy ultrasound images with stones as noisy ultrasound images without stones; (c–f) misclassification due to low image intensity from the scan.

Figure 14. Grad-CAM map comparison of 4 different images and models: Inception, Darknet, and Alexnet (left to right).

Figure 15. Grad-CAM map comparison of 4 different images across models: Efficientnet, Googlenet, and Shufflenet (from left to right).

Figure 16. Grad-CAM map comparison of 4 different images across models: Mobilenet and Resnet (from left to right).

Table 1. Model characteristics.

Neural Network Name	Depth	Parameter Memory	Parameters (Millions)	Image Input Size
“inceptionv3”	48	91 MB	23.9	299-by-299
“darknet19”	19	80 MB	20.8	256-by-256
“darknet53”	53	159 MB	41.6	256-by-256

Table 2. FP and FN distribution.

Class	False Positive (FP)	False Negative (FN)
Noisy dataset without stones	9	6
Noisy dataset with stones	2	10
Original dataset with stones	4	4
Original dataset without stones	7	2

Table 3. Comparison between the proposed technique and other previous techniques.

Technique	Accuracy %	Accuracy % with Added Noise	Number of Kidney Images
[33]	56.35	54.62 (best case)	Not mentioned
[34]	64.04	61.12 (best case)	250
[35]	92.31	91.35 (best case)	1072
[36]	84.89	85.15 (best case)	2085
[37]	91.73	90 (best case)	270
[51]	96.54	95.58 (best case)	4940
[52]	98.4	-	3837
ViTs	-	82.17	8826
Proposed	99.43	98.75	8826

Table 4. Comparison with existing DNN models.

Technique	Accuracy %	Number of Kidney Images
[24]	86.73	4940
[25]	72.31	4940
[26]	93.46	4940
[27]	90.58	4940
[28]	93.65	4940
[29]	92.88	4940
Proposed	98.75	8826

Table 5. Comparison with existing DNN models on the same dataset.

Technique	Accuracy %	Number of Kidney Images
Alexnet	78.34	8826
Resnet50	89.23	8826
Efficientnetb0	92.49	8826
Googlenet	87.89	8826
MobilenetV2	91.55	8826
Resnet18	94.39	8826
Resnet50	89.25	8826
Resnet101	90.61	8826
Swin	97.79	8826
ConvNext	33.07	8826
Proposed	98.75	8826

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Obaid, W.; Hussain, A.; Rabie, T.; Mansoor, W. Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis. AI 2025, 6, 172. https://doi.org/10.3390/ai6080172

AMA Style

Obaid W, Hussain A, Rabie T, Mansoor W. Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis. AI. 2025; 6(8):172. https://doi.org/10.3390/ai6080172

Chicago/Turabian Style

Obaid, Walid, Abir Hussain, Tamer Rabie, and Wathiq Mansoor. 2025. "Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis" AI 6, no. 8: 172. https://doi.org/10.3390/ai6080172

APA Style

Obaid, W., Hussain, A., Rabie, T., & Mansoor, W. (2025). Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis. AI, 6(8), 172. https://doi.org/10.3390/ai6080172

Article Menu

Noisy Ultrasound Kidney Image Classifications Using Deep Learning Ensembles and Grad-CAM Analysis

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

4. Simulation Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI