An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images

Ahmed, Ali; Siam, Ali I.; Atwa, Ahmed E. Mansour; Atwa, Mohamed Ahmed; Abdelrahim, Elsaid Md.; Atlam, El-Sayed

doi:10.3390/a18110703

Open AccessArticle

An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images

by

Ali Ahmed

^1,*

,

Ali I. Siam

^2,3

,

Ahmed E. Mansour Atwa

⁴

,

Mohamed Ahmed Atwa

⁵

,

Elsaid Md. Abdelrahim

^6,*

and

El-Sayed Atlam

^7,8

¹

Information Technology Department, Faculty of Computers and Information, Menoufia University, Shebin El-Kom 32511, Egypt

²

College of Arts and Science, Umm Al Quwain University, Umm Al Quwain 536, United Arab Emirates

³

Department of Embedded Network Systems Technology, Faculty of Artificial Intelligence, Kafrelsheikh University, Kafr El-Sheikh 33516, Egypt

⁴

Electronics and Communication Department, College of Engineering and Computer Science, Mustaqbal University, Buraydh 51411, Saudi Arabia

⁵

Faculty of Medicine Kasr Al-Ainy, Cairo University, Giza 12613, Egypt

⁶

Computer Science Department, College of Science, Northern Border University, Arar 91431, Saudi Arabia

⁷

Department of Computer Science, College of Computer Science and Engineering, Taibah University, Yanbu 46421, Saudi Arabia

⁸

Computer Science Department, Faculty of Science, University of Tanta, Tanta 31527, Egypt

^*

Authors to whom correspondence should be addressed.

Algorithms 2025, 18(11), 703; https://doi.org/10.3390/a18110703

Submission received: 29 August 2025 / Revised: 19 October 2025 / Accepted: 31 October 2025 / Published: 4 November 2025

(This article belongs to the Special Issue Algorithms and Applications of Machine Learning Techniques for Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Pneumonia remains a serious global health issue, particularly affecting vulnerable groups such as children and the elderly, where timely and accurate diagnosis is critical for effective treatment. Recent advances in deep learning have significantly enhanced pneumonia detection using chest X-rays, yet many current methods still face challenges with interpretability, efficiency, and clinical applicability. In this work, we proposed a YOLOv11-based deep learning framework designed for real-time pneumonia detection, strengthened by the integration of Grad-CAM for visual interpretability. To further enhance robustness, the framework incorporated preprocessing techniques such as Contrast Limited Adaptive Histogram Equalization (CLAHE) for contrast improvement, region-of-interest extraction, and lung segmentation, ensuring both precise localization and improved focus on clinically relevant features. Evaluation on two publicly available datasets confirmed the effectiveness of the approach. On the COVID-19 Radiography Dataset, the system reached a macro-average accuracy of 98.50%, precision of 98.60%, recall of 97.40%, and F1-score of 97.99%. On the Chest X-ray COVID-19 & Pneumonia dataset, it achieved 98.06% accuracy, with corresponding high precision and recall, yielding an F1-score of 98.06%. The Grad-CAM visualizations consistently highlighted pathologically relevant lung regions, providing radiologists with interpretable and trustworthy predictions. Comparative analysis with other recent approaches demonstrated the superiority of the proposed method in both diagnostic accuracy and transparency. With its combination of real-time processing, strong predictive capability, and explainable outputs, the framework represents a reliable and clinically applicable tool for supporting pneumonia and COVID-19 diagnosis in diverse healthcare settings.

Keywords:

pneumonia detection; YOLOv11; chest X-ray imaging; deep learning; explainable AI; Grad-CAM; COVID-19

1. Introduction

Pneumonia continues to be a major contributor to global illness and mortality, particularly affecting vulnerable groups such as children under five and the elderly. The World Health Organization (WHO) estimates that the disease causes more than 2.5 million deaths each year, underscoring the urgent demand for timely and reliable diagnostic methods. Chest X-ray imaging remains a widely used and cost-effective approach for detecting pneumonia; however, conventional assessments depend heavily on radiologists’ expertise, which can introduce variability and subjectivity in interpretation, especially in settings with limited medical resources [1]. To address these challenges, automatic pneumonia detection systems have gained significant attention due to their potential to provide consistent, rapid, and scalable diagnostic support. Early automated detection not only assists clinicians in decision-making but also alleviates the burden on healthcare systems and improves patient outcomes [2]. Nevertheless, traditional computer-aided diagnosis (CAD) systems often depend on handcrafted features and shallow classifiers, which tend to be limited in their capacity to generalize across varied imaging conditions and patient populations [3,4,5].

In recent years, deep learning (DL) has emerged as a transformative technique for medical image analysis, offering notable improvements in robustness and detection accuracy. Convolutional Neural Networks (CNNs), in particular, have shown remarkable success in automatically learning complex visual representations from raw image data, thus removing the need for manual feature extraction [6,7,8]. Within DL-based object detection frameworks, the You Only Look Once (YOLO) model is especially notable for its ability to achieve real-time analysis with high precision [9,10], positioning it as a strong approach for pneumonia detection in chest X-ray imaging [11].

Despite the impressive performance of deep learning models, a major concern in clinical applications is their lack of transparency. To gain trust from healthcare professionals, it is essential that such models provide explainable results that clarify the reasoning behind predictions. Hence, integrating explainable AI (XAI) techniques into DL-based pneumonia detection frameworks is critical for fostering clinical acceptance and ensuring ethical, interpretable diagnostics [12,13].

This article presents an explainable YOLO-based framework tailored for pneumonia detection from chest X-ray images. The proposed approach combines the detection strength of YOLO with post hoc explainability techniques to provide not only accurate diagnoses but also visual justifications, thereby enhancing transparency and clinical usability.

The proposed framework leverages the latest YOLOv11 architecture, which provides notable improvements in detection accuracy and efficiency compared to earlier YOLO versions. Its enhanced feature extraction and anchor-free detection mechanisms make it particularly effective for identifying small, low-contrast abnormalities in chest X-rays. These capabilities allow for more precise localization of pneumonia and COVID-19–related opacities, supporting both diagnostic accuracy and clinical interpretability.

The contribution of this work can be summarized as follows:

Proposed a novel pneumonia detection system combining YOLOv11 for real-time object detection with Grad-CAM for visual interpretability.
Employed Grad-CAM to generate diagnostic heatmaps, enabling clinicians to visualize the critical regions influencing the model’s predictions, hereby enhancing explainability and trust in AI-assisted decision-making.
Applied a robust preprocessing pipeline including CLAHE-based contrast enhancement, region-of-interest (ROI) extraction, and lung segmentation to improve feature localization and reduce background noise.
Validated the proposed framework using two public benchmark datasets called COVID-19 Radiography Database and Chest X-ray COVID-19 & Pneumonia Dataset) to ensure performance generalizability across diverse clinical cases.
Conducted a comparative analysis with state-of-the-art models, showing competitive or superior performance in terms of well-known evaluation metrics.

The structure of the study is as follows: Section 2 reviews the most recent developments in pneumonia detection. Section 3 introduces the proposed explainable YOLO-based framework for detecting pneumonia from chest X-ray images. Section 4 details the experimental results obtained using the proposed models, along with a performance evaluation compared to the latest related studies. Finally, Section 5 summarizes the main contributions of this work and outlines potential directions for future improvements.

2. Literature Review

This section provides a comprehensive overview of the most relevant and up-to-date research efforts in this domain, focusing on different model architectures, datasets, classification strategies, and performance metrics. Emphasis is placed on the comparative analysis of traditional transfer learning, hybrid deep learning methods, and custom-designed CNNs for pneumonia and COVID-19 diagnosis. The goal is to identify the strengths, limitations, and trends in current methodologies to better position the proposed explainable YOLO-based framework within the existing body of work.

2.1. Hybrid Deep Learning Models

Hybrid models that combine multiple architectures or integrate machine learning classifiers with deep learning feature extractors have gained attention for their performance enhancements. Abdullah et al. [14] proposed a hybrid CNN model combining VGG16 and VGG19 with average pooling for feature fusion and a dense neural network for classification. Trained on the COVID-19 Radiography Database, this model achieved 92% accuracy. Despite improvements over individual CNNs, class imbalance and overfitting risks were noted. Aslan et al. [15] integrated ANN-based lung segmentation with features from eight CNN models (e.g., DenseNet201, ResNet50), classifying with SVMs and other ML algorithms. Their model reached 96.29% accuracy. The approach, though accurate, incurred significant computational costs due to Bayesian optimization. Lakshmi et al. [16] adopted a Bayesian-optimized SVM with deep features extracted from models like AlexNet and ResNet50, achieving 96.20% accuracy. While promising, the approach’s processing time and overfitting risk limited real-time application.

2.2. Transfer Learning Approaches and Custom CNN Architectures

Transfer learning using pre-trained CNN architectures has become a dominant approach, especially in scenarios with limited medical image datasets. Singh et al. [17] and El Houby [18] used ResNet50 and VGG19, respectively, achieving up to 97% accuracy. El Houby applied contrast enhancement techniques like CLAHE and histogram equalization for improved feature visibility. Chakravarthy et al. [19] used the SEA-ResNet50 model, attaining 97.50% accuracy in binary classification and employing explainable AI for interpretability. Srinivas et al. [20] fused InceptionV3 and VGG16, attaining 98% accuracy, though the model’s robustness on unseen data remains uncertain. Researchers have also designed custom CNNs tailored to pneumonia detection tasks. Ullah et al. [21] proposed CovidDetNet using a custom CNN structure incorporating batch and cross-channel normalization. It achieved 98.40% accuracy but faced limitations due to limited dataset variety. P. Szepesi and L. Szilágyi [22] optimized a lightweight CNN for pediatric pneumonia detection using dropout layers, achieving up to 97.76% accuracy. However, model applicability in clinical environments remains uncertain. F. Bayram and A. Eleyan [23] proposed a DL approach using a 3-stream fusion-based CNN model trained on a large US clinical dataset of chest X-ray images, achieving an accuracy of 97.76%. Their methodology involved extracting features from grayscale X-ray, LBP, and HOG images, then concatenating these features for classification, validated through fivefold cross-validation.

2.3. Federated Learning

To preserve patient data privacy while leveraging distributed training, federated learning (FL) has been explored. Naz et al. [24] applied FL with ResNet50, achieving up to 98% accuracy on both IID and non-IID distributions using the COVID-19 Radiography Database. However, model convergence and parameter tuning posed challenges in large-scale deployments. A. Kareem et al. [25] demonstrated the use of a federated learning framework to ensure data privacy, enabling collaboration between hospitals and medical institutes for model training using real-time datasets while preserving privacy and achieving high accuracy. A. Mabrouk et al. [26] proposed an Ensemble Federated Learning (EFL) framework for pneumonia detection from chest X-ray images, enabling multiple hospitals to train local CNN models (DenseNet, ResNet, MobileNet) on private data, form local ensembles, and share only model parameters with a central server for aggregation into a global ensemble. Experiments on a chest X-ray dataset achieved an accuracy of about 96.63%, highlighting the framework’s ability to enhance diagnostic performance while maintaining patient data privacy.

2.4. YOLO-Based Detection Models

The YOLO architecture has been a popular choice for its real-time object detection capability. Munna et al. [27] compared YOLOv3, YOLOv4, and YOLOv6 for pneumonia detection, with YOLOv6 performing best. Despite their effectiveness, the models required further tuning for clinical deployment. Yao [28] enhanced YOLOv3 with MaskFPN and dilated convolutions, achieving 83.70% AP. Nevertheless, the localization of subtle lesions remained challenging. Telaumbanua [29] applied YOLOv11 with standard preprocessing (resizing, normalization), achieving 91.24% accuracy. The model’s strength lies in its balance of sensitivity and specificity, though dataset limitations hinder generalization. Zhao [30] improved Fast-YOLO with the FASPA attention mechanism, using over 14,000 re-annotated images. This model excelled in both accuracy and speed, which is critical for real-time settings. Xie et al. [31] optimized YOLO for robustness by incorporating modules like DCNv2 and DynamicConv, attaining a mAP of 97.80%. Despite strong performance, interpretability was a noted limitation.

2.5. Alternative Techniques and Exploratory Models

Several studies have pursued alternative deep learning or preprocessing approaches to boost classification accuracy. Zhang [32] introduced NSEC-YOLO, incorporating adaptive noise suppression and global perception aggregation, achieving state-of-the-art accuracy and inference speed. Hameed et al. [33] proposed a steganography-based framework for medical data embedding, which, while outside classification, highlights security considerations in image-based AI systems. Das et al. [34] used U-Net and W-Net for segmentation and classification, achieving 97.50% F1-score. Their approach performed well but was sensitive to image quality and data diversity. Kailasam and Balasubramanian [35] combined CNNs and YOLO for pneumonia detection with an 83% accuracy. However, generalization and interpretability remained problematic. Accurate lung segmentation enhances the model’s focus on relevant anatomical regions. Nguyen et al. [36] combined YOLOv5s-based lung segmentation with Faster R-CNN and YOLOv5s for detecting five thoracic abnormalities. YOLOv5s surpassed Faster R-CNN in both speed and accuracy but struggled with small anomalies and hyperparameter tuning. H. M. Balaha et al. [37] introduced an Archimedes Optimization Algorithm (AOA)-guided framework for hyperparameter optimization in medical image segmentation. The methodology follows four stages: population initialization, fitness function evaluation, population updating, and results logging, where AOA optimizes hyperparameters such as activation functions, loss functions, optimizers, and batch sizes. The results showed high segmentation performance, with the R2 U-Net 2D model achieving 95.70% accuracy on the BUSI dataset and the V-Net model achieving 99.20% accuracy on the COVID-19 dataset.

The literature highlights several key trends: hybrid models and transfer learning dominate accuracy benchmarks; YOLO-based models excel in speed and localization; and interpretability is increasingly important, particularly in healthcare applications. However, challenges remain in achieving model generalization, real-time deployment, and explainability. The proposed YOLOv11-based framework addresses these gaps by combining efficient object detection with post hoc interpretability (Grad-CAM), offering a balanced solution suitable for clinical use. Table 1 summarizes the most recent work related to pneumonia detection.

3. Methodology

This section describes the proposed pneumonia detection framework based on YOLOv11 and Grad-CAM. The overall workflow includes image preprocessing, lung region segmentation, model training, and explainable visualization. The proposed detection framework is built upon the YOLOv11 architecture, which introduces several improvements over previous YOLO versions such as YOLOv8 and YOLOv10. YOLOv11 incorporates an enhanced backbone and feature pyramid (PAN-FPN) structure for more effective multi-scale feature extraction, improving sensitivity to small and low-contrast lesions common in medical images. It also adopts anchor-free detection with dynamic label assignment, reducing localization errors and improving training stability. The integration of efficient attention mechanisms and decoupled head design further enhances feature representation and object classification accuracy. These architectural advancements enable YOLOv11 to achieve faster convergence, higher precision, and improved localization, making it particularly well suited for accurate and reliable pneumonia detection in chest X-rays.

The proposed framework for pneumonia detection, as shown in Figure 1, integrates a real-time object detection model (YOLO) with explainable AI techniques to provide both accurate localization of infected lung regions and interpretable results that can aid clinical decision-making. The framework comprises a structured pipeline with several key stages, from input image acquisition to final explainable prediction. Algorithm 1 outlines the complete methodology used in this study for pneumonia detection and explainability using the YOLOv11 model.

Algorithm 1: Pneumonia Detection and Explainability using YOLOv11.

Input: Chest X-ray image dataset
Output: Classification results and Grad-CAM heatmaps
BEGIN
1. Dataset Selection and Preprocessing
1.1 Load Dataset Images
1.2 FOR each image in DatasetImages DO
a. Apply ImageEnhancement(image) using CLAHE
b. Perform LungSegmentation(image)
c. Identify RegionOfInterest(image)
d. Pass preprocessed image to pipeline
END FOR
2. Model Training
2.1 Split the preprocessed dataset into:
- TrainingSet
- ValidationSet
- TestSet
2.2 Apply Data Augmentation on TrainingSet
2.3 Train YOLOv11 model:
Model ← TrainYOLOv11(TrainingSet, ValidationSet)
3. Model Evaluation
3.1 Evaluate TrainedModel using TestSet
Results ← Evaluate(Model, TestSet)
3.2 FOR each test image in TestSet DO
a. Predict class using TrainedModel
b. Apply Grad-CAM to generate heatmap
c. Overlay heatmap on test image
END FOR
3.3 Display and Save:
- Classification Results
- Confusion Matrix
- Grad-CAM Visualizations
END

To enhance the diagnostic effectiveness of chest X-ray images, a series of preprocessing steps were applied. Publicly available datasets containing Normal, COVID-19, and Pneumonia cases were utilized, with preprocessing focused on improving image clarity and reducing noise. CLAHE was employed to highlight key structures such as lung textures and opacities, thereby improving visibility of clinically significant features. Segmentation was then performed to isolate the lung fields, ensuring that the model focused on relevant regions while excluding surrounding background areas. This process not only reduced false activations but also enhanced the accuracy of Grad-CAM heatmap generation by localizing the areas most responsible for prediction. To further increase robustness, the segmented images underwent data augmentation before training. Transformations such as random rotations, scaling, and brightness variations were applied to simulate real-world imaging variability. These augmentations strengthened the model’s ability to generalize across diverse clinical conditions and reduced overfitting.

The processed and augmented images were then introduced into the YOLOv11 framework, selected for its real-time performance and strong detection accuracy. YOLO divides the image into a grid, predicting bounding boxes and class probabilities simultaneously, which enables effective pneumonia detection and localization. For interpretability, Grad-CAM was integrated into the framework to generate class-specific heatmaps that highlight areas of the lungs most influential to the model’s predictions. These visual explanations support clinical validation by confirming that the system is attending to medically relevant regions rather than irrelevant patterns.

The final stage involves training and evaluating the model on two separate benchmark datasets: (1) COVID-19 Radiography Database and (2) Chest X-ray COVID-19 & Pneumonia Database. These datasets provide diverse imaging conditions and disease classes to ensure generalizability. Performance metrics such as accuracy, precision, recall, F1-score, and confusion matrices are computed to assess the effectiveness of the detection framework.

3.1. Dataset Description

3.1.1. Dataset 1: COVID-19 Radiography Database

The first dataset is the COVID-19 Radiography Database, compiled and made publicly available by Tawsifur Rahman et al. on Kaggle [45]. This dataset comprises over 21,000 posterior–anterior (PA) view chest X-ray images divided into four categories: COVID-19 (3616 images), Normal (10,192 images), Lung Opacity (6012 images), and Viral Pneumonia (1345 images). The images are stored in JPEG format and annotated by clinical professionals. This dataset is widely recognized in the research community for its high-quality annotations and balanced representation of respiratory classes, making it suitable for both binary and multiclass classification tasks.

3.1.2. Dataset 2: Chest X-Ray (COVID-19 & Pneumonia)

The second dataset is the Chest X-ray (COVID-19 & Pneumonia) dataset, curated by Prashant Mohan and available on Kaggle [46]. It contains a total of 6432 chest X-ray images, categorized into three diagnostic classes: COVID-19 (1252 images), Pneumonia (3427 images), and Normal (1753 images). All images are in JPEG format and primarily captured in PA orientation. This dataset is particularly useful due to its balanced structure and clinical relevance in detecting COVID-19 and pneumonia symptoms from radiographic features. Its compact size and well-separated class structure make it ideal for rapid model prototyping and benchmarking.

Both datasets were used independently and not combined. Each dataset was divided into 70% training, 15% validation, and 15% testing subsets, ensuring that no image was repeated across the splits. As neither dataset includes patient identifiers, splitting was performed strictly at the image level to prevent overlap between subsets. According to their official Kaggle documentation, the COVID-19 Radiography Database was curated by medical experts, with duplicate and low-quality images removed during dataset preparation, while the Chest X-ray (COVID-19 & Pneumonia) Dataset was manually reviewed and balanced to eliminate duplicates and ensure representative class coverage. These properties help minimize data leakage and support a fair and reliable evaluation of the proposed model.

3.2. Data Preprocessing

Although the datasets used in this study provide only image-level diagnostic labels (i.e., COVID-19, Pneumonia, and Normal) without bounding box annotations, our framework is designed as a weakly supervised detection system, rather than a conventional object detector, that leverages YOLOv11’s localization capability. Therefore, it was not trained with manually annotated bounding boxes. Furthermore, our YOLOv11-based model was employed in a detection mode where each image was associated with the corresponding label.

CLAHE is an advanced image enhancement technique used to improve the contrast of images, particularly in medical imaging and low-contrast environments [47]. Unlike traditional histogram equalization that applies to a global contrast adjustment, CLAHE operates on small regions called tiles (usually 8 × 8 or 16 × 16 pixels), applying histogram equalization within each tile to enhance local details. To prevent noise amplification in nearly uniform regions, CLAHE introduces a contrast limiting step by clipping the histogram at a predefined threshold before redistributing the excess pixels. Mathematically, for a tile with intensity levels I(x, y), the transformation function T(i) is given by the cumulative distribution function (CDF) of the clipped and normalized histogram H(i):

T (i) = \frac{1}{M \times N} \sum_{j = 0}^{i} H_{c l i p p e d} (j)

(1)

where M × N is the number of pixels in the tile, and

H_{c l i p p e d} (j)

is the clipped histogram count for gray level j. To avoid abrupt changes between tiles, CLAHE uses bilinear interpolation to merge neighboring tiles. This technique has proven effective in enhancing medical images such as MRI, CT, and ECG visualizations, by revealing subtle features while minimizing noise amplification.

According to the proposed framework, the preprocessing pipeline, as demonstrated in Figure 2, began with the application of CLAHE, which significantly improved local contrast and visibility of critical lung features. This enhancement is vital in radiographic imaging, where subtle opacities can indicate pathological changes. Following enhancement, the framework incorporated ROI identification to direct attention toward the lung areas, effectively minimizing background interference. Additionally, lung segmentation was performed to further isolate the anatomical zones of interest, ensuring that subsequent predictions and explanations pertain strictly to lung regions. This segmentation step also lays the groundwork for meaningful application of Grad-CAM visualizations.

3.3. YOLOv11 Model Architecture

The YOLOv11 model architecture represents an advanced iteration in the YOLO family, designed to enhance real-time object detection by improving speed, accuracy, and feature representation. The architecture of YOLOv11 model is shown in Figure 3. It incorporates a refined backbone network for efficient feature extraction, typically built on a modified CSP (Cross Stage Partial) structure to reduce computational cost while preserving gradient flow. The neck of the model employs a combination of PANet (Path Aggregation Network) and attention mechanisms to fuse multi-scale features, facilitating the detection of small and overlapping objects, particularly beneficial in medical imaging tasks like pneumonia detection from chest X-ray images. Furthermore, YOLOv11 introduces an optimized anchor-free head that improves bounding box regression and classification performance, thus delivering state-of-the-art results in various object detection benchmarks. This version emphasizes lightweight computation, making it suitable for edge deployments in healthcare applications where latency and resource constraints are critical [48,49].

The experimental setup utilizes YOLOv11 model with the parameters listed in Table 2. This table outlines the key training parameters used for implementing the YOLOv11 model in the proposed pneumonia detection framework. The model was trained for 100 epochs using images resized to 520 × 520 pixels, with a batch size of 16 and three output classes (COVID-19, pneumonia, and normal). The optimizer was set to auto-select the best option, starting with an initial learning rate (lr0) of 0.01 and a momentum value of 0.937, while a patience parameter of 100 was applied to prevent overfitting by allowing early stopping when no further improvement was observed. These parameter choices were designed to balance computational efficiency, convergence stability, and classification performance, ensuring the model’s robustness and suitability for real-time clinical applications.

The YOLOv11 detection framework predicts a set of bounding boxes and class probabilities for each input image. Each detection output is represented as a vector

(x, y, w, h, P_o b j, P (c ∣ o b j)),

where

(x, y)

denote the center coordinates of the bounding box,

w

and

h

represent its width and height,

P_{o b j}

is the objectness confidence indicating the probability that the box contains an object, and

P (c ∣ o b j)

is the conditional class probability. The final detection confidence for each class is computed as

P_{o b j}

×

P

(c ∣ obj).

The total loss function

L

used to optimize the YOLOv11 model is a multi-component objective that jointly minimizes errors in localization, objectness, and classification, expressed as:

L = λ_{b o x} L_{b o x} + λ_{o b j} L_{o b j} + λ_{c l s} L_{c l s}

(2)

where

L_{b o x}

represents the bounding box regression loss based on Complete Intersection over Union (CIoU) to improve spatial accuracy,

L_{o b j}

is the binary cross-entropy loss for objectness prediction, and

L_{c l s}

is the classification loss for multi-class label prediction. The coefficients

λ_{b o x}, λ_{o b j}, λ_{c l s}

are weighting factors that balance the contributions of each component. This composite formulation ensures that the model simultaneously optimizes precise localization, reliable object detection, and accurate disease classification, which are crucial for robust medical image interpretation.

3.4. Grad-CAM Method for YOLOv11 Model Explainability

Grad-CAM, short for Gradient-weighted Class Activation Mapping, is a popular explainable AI method that provides visual interpretations of convolutional neural network predictions. It works by tracing the gradients of the target output back through the final convolutional layer to create heatmaps that highlight the most influential image regions. Unlike earlier methods, Grad-CAM can be applied to CNNs without modifying their structure, making it versatile across a wide variety of models. This capability is particularly useful in medical image analysis, where understanding why a diagnosis was made—such as in pneumonia detection from chest X-rays—builds trust in the system’s decisions. The heatmaps offer intuitive, spatially aligned insights that help clinicians and researchers confirm that the network is attending to medically meaningful features rather than irrelevant patterns [50].

Grad-CAM is valuable not just for its technical role but for the clarity it brings to AI predictions. By overlaying heatmaps on chest X-rays, it visually points to the exact lung regions that guided the YOLOv11 model’s decision. In pneumonia cases, for instance, the highlighted areas often correspond to classic signs such as consolidation or ground-glass opacities, showing that the model is focusing on clinically meaningful features rather than irrelevant background details. This level of transparency helps radiologists quickly see whether the AI’s reasoning aligns with their own observations. In this way, Grad-CAM turns the model from a “black box” into a supportive diagnostic partner—building trust, encouraging clinical use, and reducing the risks of relying on unexplained predictions.

3.5. Model Evaluation Metrics

Model evaluation metrics are essential for assessing the effectiveness of machine learning and deep learning models, especially in classification tasks. The most common metrics are accuracy, precision, and recall (or sensitivity). These metrics are often derived from the confusion matrix, which summarizes predictions into true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) [51].

Precision: This metric quantifies the fraction of correctly identified positive cases relative to all instances predicted as positive. It indicates the reliability of the model’s positive predictions, which is critical in medical diagnostics to minimize false alarms. High precision reflects a reduced likelihood of false positives. It can be mathematically expressed as:

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

Here,

T P

denotes the count of correctly identified positive cases, while

F P

represents the count of incorrectly classified positive cases.

Recall: Commonly referred to as sensitivity or the true positive rate, this metric measures the model’s effectiveness in accurately detecting all instances that belong to the positive class. It is especially crucial in health-related applications where missing a positive case (such as a pneumonia infection) can be critical. Recall is calculated as:

R e c a l l = \frac{T P}{T P + F N}

(4)

where FN denotes the count of false negatives. A higher recall value indicates that the model is able to correctly identify the majority of true positive cases.

F1-Score: This metric represents the harmonic mean of precision and recall, offering a unified measure that captures the trade-off between them. This metric is especially valuable in scenarios with imbalanced class distributions or when minimizing both false positives and false negatives is critical. The F1-score is defined as:

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

The metric is bounded between 0 and 1, where larger values correspond to superior performance.

Accuracy: This metric reflects the model’s overall ability to correctly distinguish between positive and negative cases. Although it is often considered the most straightforward performance indicator, it may yield misleading interpretations when applied to imbalanced datasets. Accuracy is defined as:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(6)

where TN is true negatives. While useful, accuracy should be considered alongside other metrics in skewed class scenarios.

Confusion Matrix: A confusion matrix is a structured table that evaluates the performance of a classification model by comparing the actual classes with the predicted ones. It offers a detailed breakdown of prediction outcomes, differentiating between TP, FP, TN, and FN. In binary classification tasks, it is generally organized as follows:

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

This matrix serves as a basis for calculating key evaluation metrics such as precision, recall, F1-score, and accuracy, thereby providing a holistic assessment of the model’s performance and diagnostic reliability.

4. Results and Discussion

This section presents the experimental outcomes of the proposed explainable YOLOv11-based pneumonia detection framework and evaluates its performance using two benchmark chest X-ray datasets. The evaluation includes a detailed analysis of classification accuracy, precision, recall, and F1-score for each diagnostic category across both datasets. Additionally, the section showcases visual results, including prediction samples, confusion matrices, and Grad-CAM-based explanations, to demonstrate both the diagnostic capability and interpretability of the model. Comparative analysis with existing state-of-the-art approaches is also provided to highlight the relative strengths and unique contributions of the proposed framework.

4.1. Experimental Setup

Model training and evaluation were performed using the preprocessed datasets described in Section 3.1. Each dataset was already partitioned into training, validation, and testing subsets as outlined earlier, ensuring proportional class representation and strict separation between splits. All images were resized to 320 × 320 pixels to align with the YOLOv11 input dimensions. Data augmentation methods, such as random rotation and brightness adjustment, were applied to improve generalization and reduce overfitting. Model training was executed on a Kaggle environment using an NVIDIA Tesla P100 GPU with 16 GB of memory. Performance evaluation was based on standard metrics, including accuracy, precision, recall, F1-score, and confusion matrices, calculated on the test set after each training epoch.

4.2. Experimental Results Using Dataset 1

When trained and tested on the COVID-19 Radiography Database, the proposed YOLOv11-based pneumonia detection model achieved outstanding performance across all three classes—COVID-19, Normal, and Pneumonia. Figure 4 illustrates the model’s learning behavior, with both training and validation loss steadily decreasing over successive epochs and training accuracy consistently improving, indicating effective convergence without signs of overfitting. The confusion matrix in Figure 5 confirms the high classification accuracy, with only a few misclassifications and strong sensitivity and specificity for each category. Visual examples in Figure 6 further validates these results, showing a close match between ground-truth labels (panel a) and the model’s predictions (panel b). The high level of agreement across diverse chest X-ray samples highlights the framework’s reliability and robustness in accurately distinguishing between COVID-19, normal, and pneumonia cases.

Table 3 summarizes the classification results of the YOLOv11-based model when tested on the COVID-19 Radiography dataset. The framework demonstrated consistently strong outcomes across all three categories—COVID-19, Normal, and Pneumonia—based on accuracy, precision, recall, and F1-score. Specifically, accuracy reached 98.98% for COVID-19, 98.50% for Normal, and 99.52% for Pneumonia. Precision was above 98% for every class, while recall values ranged from 97.41% for COVID-19 to 99.28% for Pneumonia. F1-scores also showed stability, with values of 97.86% (COVID-19), 98.89% (Normal), and 97.22% (Pneumonia). When macro-averaged, the framework achieved 98.96% accuracy, 98.60% precision, 97.40% recall, and a 97.99% F1-score, confirming reliable and balanced performance. Figure 7 offers a visual look into how the model makes its decisions, using Grad-CAM to generate heatmaps for sample chest X-rays from each class. In each example, the original image (top row) is paired with a Grad-CAM heatmap (bottom row), where warmer colors mark the region’s most influential to the prediction. These highlighted lung areas closely match known pathological patterns, showing that the model’s attention is focused on clinically relevant zones. Together, the solid numerical results in Table 3 and the clear visual explanations in Figure 7 demonstrate not only the accuracy of the YOLOv11-based framework but also its transparency—making it a promising tool for reliable and interpretable clinical use.

4.3. Experimental Results Using Dataset 2

Using the second dataset (Chest X-ray COVID-19 & Pneumonia), the proposed YOLOv11 framework achieved consistently strong outcomes across all three classes. Figure 8a–c show the training behavior, where both training and validation losses steadily declined, and accuracy quickly rose above 98% before stabilizing—indicating effective convergence without evidence of overfitting. The confusion matrix in Figure 9 highlights only a small number of misclassifications: 20 Normal scans predicted as Pneumonia, 5 Pneumonia cases classified as Normal, and a single Pneumonia image mislabeled as COVID-19. This demonstrates clear class separation, particularly for COVID-19, which reached near-perfect accuracy. Figure 10 further illustrates this reliability, as predicted labels closely matched ground-truth annotations with minimal discrepancies.

The quantitative metrics are presented in Table 4. For COVID-19 detection, the framework delivered 99.92% accuracy, 100% precision, and an F1-score of 99.57%, confirming its high sensitivity and specificity for COVID-related pneumonia. The Normal and Pneumonia categories also achieved strong results, with accuracies of 98.06% and 97.98%. On average, across all three classes, the framework attained 98% accuracy, 97.76% precision, 98.41% recall, and 98.06% F1-score, verifying robust and balanced classification performance.

Figure 11 presents Grad-CAM–based visual explanations for representative samples from the second dataset, covering COVID-19, Pneumonia, and Normal cases. For each example, the original chest X-ray image is shown alongside its corresponding Grad-CAM heatmap, where warmer colors indicate areas with greater influence on the model’s decision. The highlighted lung regions align closely with known pathological patterns, confirming that the framework focuses on clinically relevant zones when making predictions. This interpretability strengthens trust in the model’s outputs and underscores its potential as a transparent and reliable tool for clinical use.

4.4. Comparative Performance Evaluation with Recently Related Work

Evaluating the YOLOv11-based pneumonia detection framework against existing state-of-the-art methods is necessary to demonstrate its clinical potential and relative advantages. Table 5 reports a comparison with several advanced models on the COVID-19 Radiography dataset, using four key performance indicators: accuracy, precision, recall, and F1-score. On this benchmark dataset, the proposed YOLOv11 system achieved a macro-average accuracy of 98.56%, precision of 98.6%, recall of 97.45%, and F1-score of 97.99%. These results position it among the best-performing approaches and highlight its competitiveness in pneumonia and COVID-19 classification tasks. It clearly outperformed models such as the hybrid DL + ML approach [14] (92% accuracy), VGG19 [18] (93.38% accuracy), and CNN [23] (97.76% accuracy). Even advanced combinations like Inception V3 with VGG16 [20], which reached 98% across all metrics, were slightly below or on par with YOLOv11. The YOLO-based approaches such as YOLOv3 with MaskFPN [28] and Fast-YOLO [31], demonstrated mixed performance. Yao’s improved YOLOv3 achieved only 81% accuracy on the RSNA dataset, while Fast-YOLO reached strong precision (95.20%) and recall (94.90%).

A major advantage of the proposed framework is its ability to perform real-time object detection, enabling rapid diagnostic support without sacrificing accuracy. In addition, the incorporation of Grad-CAM provides an interpretability component that is often lacking in comparable models, thereby enhancing user confidence and making the approach more appropriate for clinical environments where transparency is critical. The system also exhibited strong generalization when evaluated on the Chest X-ray COVID-19 & Pneumonia dataset, sustaining high performance with 98% accuracy, 97.76% precision, 98.41% recall, and 98.06% F1-score. Achieving consistent results across datasets with varying distributions demonstrates the robustness and adaptability of the YOLOv11-based method.

4.5. Discussion

The findings collectively demonstrate both the effectiveness and clinical promise of the proposed YOLOv11-based pneumonia detection framework, while also pointing to areas for future improvement. On both benchmark datasets, the model reached macro-average accuracies of 98.50% on the COVID-19 Radiography Dataset and 98% on the Chest X-ray COVID-19 & Pneumonia dataset. These consistently high values confirm the framework’s strong ability to recognize critical lung abnormalities, demonstrating competitive performance among state-of-the-art approaches. One of the notable strengths of this approach is the integration of Grad-CAM, which enhances interpretability by highlighting the lung regions most influential in the model’s predictions.

The heatmaps shown in Figure 7 and Figure 11 provide clear visual evidence that complements the numerical metrics, offering direct insight into the model’s decision-making. For pneumonia and COVID-19 images, Grad-CAM highlighted areas corresponding to typical pathological signs such as consolidations, infiltrates, and ground-glass opacities. In contrast, normal chest radiographs showed only minimal activation within the lung fields, suggesting that the model focuses on medically meaningful features rather than irrelevant image patterns. This behavior establishes a clinically meaningful baseline, confirming that the framework focuses on medically relevant regions rather than spurious image artifacts. Collectively, these visual explanations support the validity of the diagnostic outputs and strengthen clinical trust in the framework’s applicability. This not only improves transparency but also builds clinical trust, as radiologists can visually verify the AI’s reasoning. Preprocessing steps like CLAHE and lung segmentation further improved the model’s focus and robustness, while the YOLOv11 architecture allowed for real-time detection with low computational requirements—an important feature for deployment in settings with limited resources.

The work is not without limitations. The datasets used, although reliable, cover a limited range of patient demographics and imaging variations, which could affect how well the model generalizes to different clinical environments. In addition, Grad-CAM’s usefulness depends on the model’s own accuracy—if the model is wrong, the highlighted regions may be misleading. Still, the overall findings show that the framework is an effective, efficient, and interpretable approach to automated pneumonia detection, with clear potential for further improvement through more diverse datasets and broader testing across hospitals and imaging systems.

To ensure a fair comparison, all baseline models reproduced in this study were trained and evaluated using identical data splits, preprocessing, and augmentation procedures as the proposed YOLOv11 framework. This consistent setup guarantees that differences in performance reflect model capability rather than data or processing bias. For previously published methods whose raw data or code were unavailable, the reported results are presented as indicative comparisons only, since exact replication under the same experimental settings was not possible. Statistical tests such as DeLong for AUC were not applied due to the lack of prediction-level outputs from external studies; however, the consistent improvements across multiple metrics demonstrate the robustness and reliability of the proposed model.

4.6. Clinical Significance and Practical Implications

The clinical significance of this framework extends beyond mere high accuracy scores. The combination of real-time performance, high diagnostic precision, and explainable outputs makes the proposed system a practical and valuable tool in healthcare settings, particularly those with limited resources. The ability to accurately detect and localize pneumonia in real time can significantly reduce the time from imaging to diagnosis, enabling earlier treatment and potentially improving patient outcomes. In settings where access to experienced radiologists is limited, this automated system can provide a reliable second opinion or serve as an initial screening tool to prioritize critical cases. The integrated Grad-CAM visualizations are key for clinical trust and validation, as they allow healthcare professionals to visually confirm that the model’s predictions are based on medically plausible evidence, thereby fostering greater acceptance of AI in clinical practice. This framework’s robustness, demonstrated on two different public datasets, suggests its potential for widespread, effective deployment, addressing the global health challenge posed by pneumonia.

To further improve the generalization and robustness of the proposed framework, future work may integrate concepts from robust feature learning and resilient signal reconstruction. Recent studies on complex scene parsing and light-field image watermarking have demonstrated effective strategies for maintaining feature stability and signal integrity in noisy or distorted environments [52,53]. Incorporating such approaches could enhance the proposed model’s adaptability to non-Gaussian noise and its applicability to multidimensional medical imaging scenarios.

5. Conclusions

Pneumonia remains a serious global health challenge, especially among vulnerable populations such as children and the elderly, where fast and accurate diagnosis can save lives. Traditional diagnostic methods, while valuable, often face practical challenges—ranging from subjective interpretation to limited access to expert radiologists, and an inability to provide instant results in resource-limited areas. This study introduced a new, explainable pneumonia detection framework that combines the real-time object detection power of YOLOv11 with Grad-CAM visualizations for interpretability. The system not only delivers high detection accuracy but also pinpoints infected lung regions, giving clinicians a clear and visual explanation for each prediction. Tested on two benchmark datasets, the framework consistently outperformed or matched leading methods, demonstrating both accuracy and robustness. Looking ahead, the approach could be made even more versatile by incorporating additional medical data such as CT scans or patient records, adapting it for use on mobile or edge devices for on-site diagnosis, and exploring federated learning for privacy-preserving collaboration between hospitals. While this work focused on pneumonia, the same framework could be extended to detect and differentiate other thoracic diseases, such as tuberculosis or lung cancer, making it a powerful tool for comprehensive chest disease screening. Future work could focus on expanding the model to detect multiple chest diseases, integrating clinical and multimodal data, enhancing explainability with advanced XAI methods, validating performance on larger and diverse datasets.

Author Contributions

Conceptualization, M.A.A. and A.E.M.A.; methodology, A.A., A.I.S. and A.E.M.A.; software, A.I.S.; validation, M.A.A.; formal analysis, A.E.M.A. and M.A.A.; investigation, A.A., A.I.S. and M.A.A.; resources, E.-S.A., E.M.A. and A.E.M.A.; data curation, A.I.S. and M.A.A.; writing—original draft preparation, A.E.M.A., A.A. and M.A.A.; writing—review and editing, A.E.M.A., E.M.A. and E.-S.A.; visualization, A.I.S. and A.A.; supervision, E.M.A. and E.-S.A.; project administration, E.-S.A. and A.A.; funding acquisition, E.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

Deanship of Scientific Research at Northern Border University, Arar, KSA: NBU-FFR-2025-750 159-04.

Data Availability Statement

The datasets used in this work are publicly available at: https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database (accessed on 20 June 2025). https://www.kaggle.com/datasets/prashant268/chest-xray-covid19-pneumonia (accessed on 20 June 2025).

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at Northern Border 749 University, Arar, KSA for funding this research work through the project number “NBU-FFR-2025-750 159-04”.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding this work.

References

World Health Organization. Available online: https://www.who.int/news-room/fact-sheets/detail/pneumonia (accessed on 1 May 2025).
Kermany, D.; Zhang, K.; Goldbaum, M. Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification. Mendeley Data 2018, 2. [Google Scholar] [CrossRef]
Yanase, J.; Triantaphyllou, E. A systematic survey of computer-aided diagnosis in medicine: Past and present developments. Expert Syst. Appl. 2019, 138, 112821. [Google Scholar] [CrossRef]
El-Rashidy, N.; Sedik, A.; Siam, A.I.; Ali, Z.H. An efficient edge/cloud medical system for rapid detection of level of consciousness in emergency medicine based on explainable machine learning models. Neural Comput. Appl. 2023, 35, 10695–10716. [Google Scholar] [CrossRef] [PubMed]
Masmoudi, M.; Shakrouf, Y.; Omar, O.H.; Shikhli, A.; Abdalla, F.; Alketbi, W.; Alsyouf, I.; Cheaitou, A.; Jarndal, A.; Siam, A.I. Driver risk classification for transportation safety: A machine learning approach using psychological, physiological, and demographic factors with driving simulator. Eng. Appl. Artif. Intell. 2025, 162, 112585. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Abdelrahim, E.M.; Hashim, H.; Atlam, E.-S.; Osman, R.A.; Gad, I. TMS: Ensemble Deep Learning Model for Accurate Classification of Monkeypox Lesions Based on Transformer Models with SVM. Diagnostics 2024, 14, 2638. [Google Scholar] [CrossRef]
Siam, A.I.; Sedik, A.; El-Shafai, W.; Elazm, A.A.; El-Bahnasawy, N.A.; El Banby, G.M.; Khalaf, A.A.; El-Samie, F.E.A. Biosignal classification for human identification based on convolutional neural networks. Int. J. Commun. Syst. 2021, 34, e4685. [Google Scholar] [CrossRef]
Talaat, F.M.; El-Shafai, W.; Soliman, N.F.; Algarni, A.D.; El-Samie, F.E.A.; Siam, A.I. Real-time Arabic avatar for deaf-mute communication enabled by deep learning sign language translation. Comput. Electr. Eng. 2024, 119, 109475. [Google Scholar] [CrossRef]
Zayed, A.; Siam, A.I. A YOLO-based Deep Learning Approach for Vibration-based Rotating Shaft Imbalance Detection. J. Eng. Res. 2025, 9, 32. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Samek, W.; Wiegand, T.; Müller, K.-R. Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models. arXiv 2017, arXiv:arXiv:1708.08296. [Google Scholar]
Atwa, A.E.M.; Atlam, E.-S.; Ahmed, A.; Atwa, M.A.; Abdelrahim, E.M.; Siam, A.I. Interpretable Deep Learning Models for Arrhythmia Classification Based on ECG Signals Using PTB-X Dataset. Diagnostics 2025, 15, 1950. [Google Scholar] [CrossRef]
Abdullah, M.; Abrha, F.B.; Kedir, B.; Tagesse, T.T. A Hybrid Deep Learning CNN model for COVID-19 detection from chest X-rays. Heliyon 2024, 10, e26938. [Google Scholar] [CrossRef]
Aslan, M.F.; Sabanci, K.; Durdu, A.; Unlersen, M.F. COVID-19 diagnosis using state-of-the-art CNN architecture features and Bayesian Optimization. Comput. Biol. Med. 2022, 142, 105244. [Google Scholar] [CrossRef]
Lakshmi, M.; Das, R.; Manohar, B. A new COVID-19 classification approach based on Bayesian optimization SVM kernel using chest X-ray datasets. Evol. Syst. 2024, 15, 1521–1540. [Google Scholar] [CrossRef]
Singh, A.K.; Kumar, A.; Kumar, V.; Prakash, S. COVID-19 Detection using adopted convolutional neural networks and high-performance computing. Multimedia Tools Appl. 2024, 83, 593–608. [Google Scholar] [CrossRef] [PubMed]
El Houby, E.M.F. COVID-19 detection from chest X-ray images using transfer learning. Sci. Rep. 2024, 14, 11639. [Google Scholar] [CrossRef] [PubMed]
Chakravarthy, S.R.S.; Bharanidharan, N.; Vinothini, C.; Kumar, V.V.; Mahesh, T.R.; Guluwadi, S. Adaptive Mish activation and ranger optimizer-based SEA-ResNet50 model with explainable AI for multiclass classification of COVID-19 chest X-ray images. BMC Med. Imaging 2024, 24, 206. [Google Scholar] [CrossRef] [PubMed]
Srinivas, K.; Sri, R.G.; Pravallika, K.; Nishitha, K.; Polamuri, S.R. COVID-19 prediction based on hybrid Inception V3 with VGG16 using chest X-ray images. Multimedia Tools Appl. 2023, 83, 36665–36682. [Google Scholar] [CrossRef]
Ullah, N.; Khan, J.A.; Almakdi, S.; Khan, M.S.; Alshehri, M.; Alboaneen, D.; Raza, A. A Novel CovidDetNet Deep Learning Model for Effective COVID-19 Infection Detection Using Chest Radiograph Images. Appl. Sci. 2022, 12, 6269. [Google Scholar] [CrossRef]
Szepesi, P.; Szilágyi, L. Detection of pneumonia using convolutional neural networks and deep learning. Biocybern. Biomed. Eng. 2022, 42, 1012–1022. [Google Scholar] [CrossRef]
Bayram, F.; Eleyan, A. COVID-19 detection on chest radiographs using feature fusion based deep learning. Signal Image Video Process. 2022, 16, 1455–1462. [Google Scholar] [CrossRef]
Naz, S.; Phan, K.; Chen, Y.-P.P. Centralized and Federated Learning for COVID-19 Detection with Chest X-Ray Images: Implementations and Analysis. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2987–3000. [Google Scholar] [CrossRef]
Kareem, A.; Liu, H.; Velisavljevic, V. A federated learning framework for pneumonia image detection using distributed data. Health Anal. 2023, 4, 100204. [Google Scholar] [CrossRef]
Mabrouk, A.; Redondo, R.P.D.; Elaziz, M.A.; Kayed, M. Ensemble Federated Learning: An approach for collaborative pneumonia diagnosis. Appl. Soft Comput. 2023, 144, 110500. [Google Scholar] [CrossRef]
Munna, M.S.; Chowdhury, R.; Siddiqee, A.M. A Comparative Study of YOLO Models for Pneumonia Detection. Available online: www.ijfmr.com (accessed on 1 May 2025).
Yao, S.; Chen, Y.; Tian, X.; Jiang, R.; Ma, S. An improved algorithm for detecting pneumonia based on YOLOv3. Appl. Sci. 2020, 10, 1818. [Google Scholar] [CrossRef]
Telaumbanua, P.; Fatihah, M.R.; Pratama Tarigan, A.; Afzalurrahmah, A. Deep Learning for Pneumonia Detection in Lung Diseases: Implementing YOLO11 Architecture. Available online: https://www.researchgate.net/publication/387020311_Deep_Learning_for_Pneumonia_Detection_in_Lung_Diseases_Implementing_YOLO11_Architecture (accessed on 1 May 2025).
Zhao, B.; Chang, L.; Liu, Z. Fast-YOLO Network Model for X-Ray Image Detection of Pneumonia. Electronics 2025, 14, 903. [Google Scholar] [CrossRef]
Xie, Y.; Zhu, B.; Jiang, Y.; Zhao, B.; Yu, H. Diagnosis of pneumonia from chest X-ray images using YOLO deep learning. Front. Neurorobotics 2025, 19, 1576438. [Google Scholar] [CrossRef]
Zhang, X.; Liu, L.; Yang, X.; Liu, L.; Peng, W. NSEC-YOLO: Real-time lesion detection on chest X-ray with adaptive noise suppression and global perception aggregation. J. Radiat. Res. Appl. Sci. 2025, 18, 101281. [Google Scholar] [CrossRef]
Hameed, M.A.; Hassaballah, M.; Abdelazim, R.; Sahu, A.K. A novel medical steganography technique based on Adversarial Neural Cryptography and digital signature using least significant bit replacement. Int. J. Cogn. Comput. Eng. 2024, 5, 379–397. [Google Scholar] [CrossRef]
Das, A.; Agarwal, R.; Singh, R.; Chowdhury, A.; Nandi, D. Automatic Detection of COVID-19 from Chest X-Ray Images Using Deep Learning Model. August 2024. Available online: http://arxiv.org/abs/2408.14927 (accessed on 1 May 2025).
Kailasam, R.; Balasubramanian, S. Deep Learning for Pneumonia Detection: A Combined CNN and YOLO Approach. Human-Centric Intell. Syst. 2025, 5, 44–62. [Google Scholar] [CrossRef]
Nguyen, H.T.; Nguyen, M.N.; Phung, L.D.; Pham, L.T.T. Anomalies Detection in Chest X-Rays Images Using Faster R-CNN and YOLO. Vietnam. J. Comput. Sci. 2023, 10, 499–515. [Google Scholar] [CrossRef]
Balaha, H.M.; Bahgat, W.M.; Aljohani, M.; Bamaqa, A.; Atlam, E.-S.; Badawy, M.; Elhosseini, M.A. AOA-guided hyperparameter refinement for precise medical image segmentation. Alex. Eng. J. 2025, 120, 547–560. [Google Scholar] [CrossRef]
Ibrahim, A.U.; Ozsoz, M.; Serte, S.; Al-Turjman, F.; Yakoi, P.S. Pneumonia Classification Using Deep Learning from Chest X-ray Images During COVID-19. Cogn. Comput. 2024, 16, 1589–1601. [Google Scholar] [CrossRef]
Khan, E.; Rehman, M.Z.U.; Ahmed, F.; Alfouzan, F.A.; Alzahrani, N.M.; Ahmad, J. Chest X-ray Classification for the Detection of COVID-19 Using Deep Learning Techniques. Sensors 2022, 22, 1211. [Google Scholar] [CrossRef]
Talaat, M.; Si, X.; Xi, J. Multi-Level Training and Testing of CNN Models in Diagnosing Multi-Center COVID-19 and Pneumonia X-ray Images. Appl. Sci. 2023, 13, 10270. [Google Scholar] [CrossRef]
Karim, A.M.; Kaya, H.; Alcan, V.; Sen, B.; Hadimlioglu, I.A. New Optimized Deep Learning Application for COVID-19 Detection in Chest X-ray Images. Symmetry 2022, 14, 1003. [Google Scholar] [CrossRef]
Mohan, G.; Subashini, M.M.; Balan, S.; Singh, S. A multiclass deep learning algorithm for healthy lung, Covid-19 and pneumonia disease detection from chest X-ray images. Discov. Artif. Intell. 2024, 4, 20. [Google Scholar] [CrossRef]
Fahad, N.; Ahmed, R.; Jahan, F.; Sadib, R.J.; Morol, M.K.; Al Jubair, M.A. MIC: Medical Image Classification Using Chest X-ray (COVID-19 & Pneumonia) Dataset with the Help of CNN and Customized CNN. In Proceedings of the 3rd International Conference on Computing Advancements, Dhaka, Bangladesh, 17–18 October 2024; ACM: New York, NY, USA, 2024; pp. 1007–1013. [Google Scholar] [CrossRef]
Lunagaria, M.; Katkar, V.; Vaghela, K. COVID-19 and Pneumonia Infection Detection from Chest X-Ray Images using U-Net, EfficientNetB1, XGBoost and Recursive Feature Elimination. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 475–483. [Google Scholar] [CrossRef]
Rahman, T. COVID-19 Radiography Database. 2021. Available online: https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database (accessed on 1 May 2025).
Mohan, P. Chest X-Ray (COVID-19 & Pneumonia). 2021. Available online: https://www.kaggle.com/datasets/prashant268/chest-xray-covid19-pneumonia (accessed on 1 May 2025).
Zuiderveld, K. Contrast limited adaptive histogram equalization. In Graphics Gems IV; Academic Press Professional: San Diego, CA, USA, 1994; pp. 474–485. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. 2018. Available online: http://arxiv.org/abs/1804.02767 (accessed on 1 May 2025).
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar] [CrossRef]
Miller, C.; Portlock, T.; Nyaga, D.M.; O’Sullivan, J.M. A review of model evaluation metrics for machine learning in genetics and genomics. Front. Bioinform. 2024, 4, 1457619. [Google Scholar] [CrossRef]
Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Q.; Wang, X.; Zhou, L.; Li, Q.; Xia, Z.; Ma, B.; Shi, Y.-Q. Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks. IEEE Trans. Dependable Secur. Comput. 2025, 99, 1–15. [Google Scholar] [CrossRef]

Figure 1. The proposed pneumonia detection framework.

Figure 2. Preprocessing steps with an image sample.

Figure 3. YOLOv11 architecture.

Figure 4. The model performance for Dataset 1.

Figure 5. Confusion matrix of the proposed model using Dataset 1.

Figure 6. Samples of model predictions using Dataset 1, (a) True labels, and (b) Predicted labels.

Figure 7. Samples from Dataset 1 after the model explainability using Grad-CAM (Red color: high contribution, to Blue color: no contribution).

Figure 8. The model performance using Dataset 2.

Figure 9. Confusion matrix of the proposed model using Dataset 2.

Figure 10. Samples of model predictions from Dataset 2: (a) True labels, and (b) Predicted labels.

Figure 11. Samples from the second dataset after the model explainability using Grad-CAM method (Red color: high contribution, to Blue color: no contribution).

Table 1. Summary of pneumonia detection related work.

Ref.	Year	Dataset (# Images)	Model Used	Classification Type	Performance
[14]	2024	COVID-19 Radiography Database (9220 images)	Hybrid CNN (VGG16 + VGG19) + Avg Pooling	Binary	Accuracy: 92.00%
[31]	2025	Re-annotated images from MIMIC-CXR dataset (4194 images)	Fast-YOLO	Multi-class	Precision 95.20%, Recall 94.90%
[38]	2020	COVID-19 Radiography Database (5856 images)	AlexNet	Binary, 3-class, 4-class	Accuracy: 99.16%, 94.00%, 93.42%
[39]	2022	COVID-19 Radiography Database (21,000 images)	EfficientNetB1, MobileNetV2, NasNetMobile	4-class	Accuracy: 96.13%
[40]	2023	COVID-19 Radiography Database, Chest X-ray (COVID-19 & Pneumonia), Other (25,000 images)	ResNet-50, VGG-19, AlexNet, MobileNet	Multi-center	Accuracy: 99.54%, 95.90%, 83.00%
[41]	2022	COVID-19 Radiography Database, Chest X-ray (COVID-19 & Pneumonia) (26,009 images)	CNN + ALO + NB/SVM/KNN/DT	Multi-class	Accuracy: 99.63%
[42]	2024	COVID-19 Radiography Database, Chest X-ray (COVID-19 & Pneumonia) (images)	VGG16, InceptionResNetV2, Custom CNN	Multiclass	Accuracy: 97.00%, 96.00%, 93.00%
[43]	2024	Chest X-ray (COVID-19 & Pneumonia) (6432 images)	CNN, Customized CNN (CCNN)	Multi-class	Accuracy: 95.62%
[44]	2022	Mendeley (5000 images)	U-Net, EfficientNetB1, XGBoost, RFE	Multi-class	Accuracy: 97.60%

Table 2. YOLOv11 model parameters used in the experiments.

Parameter	Value
Epochs	100
Image size	520 × 520
Patience	100
Batch Size	16
Number of classes	3
Optimizer	Auto
Initial Learning Rate (lr0)	0.01
Momentum	0.94

Table 3. The performance metrics of the proposed model using the first dataset.

Class	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
VID	98.98%	98.32%	97.41%	97.86%
Normal	98.50%	98.51%	99.28%	98.89%
Pneumonia	99.52%	98.97%	95.52%	97.22%
Macro average	98.50%	98.60%	97.40%	97.99%

Table 4. The performance metrics of the proposed model using the second dataset.

Class	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
COVID	99.92%	100%	99.14%	99.57%
Normal	98.06%	93.98%	98.42%	96.15%
Pneumonia	97.98%	99.29%	97.66%	98.45%
Macro average	98%	97.76%	98.41%	98.06%

Table 5. Performance evaluation against some related work.

Ref.	Dataset	Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
[14]	COVID-19 Radiography dataset	hybrid DL + ML	92%	93%	92%	92%
[15]	COVID-19 Radiography dataset	DenseNet201 + SVM	96.29%	96.42%	96.42%	94.53%
[16]	COVID-19 Radiography dataset	ResNet50 + SVM	96.20%	-	-	-
[17]	Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification + Chest X-Ray Images (Pneumonia)	Adopted-CNN and ResNet50	97%	-	-	-
[18]	COVID-19 Radiography dataset	VGG19	93.38%	94.12%	96%	95.05%
[20]	COVID-19 Radiography dataset	Inception V3 with VGG16	98%	98%	98%	98%
[21]	COVID-19 Radiography dataset	DL + SVM	98.40%	97%	96.66%	96.82%
[23]	COVID-19 Radiography dataset	CNN	97.76%	97.78%	97.76%	97.76%
[24]	COVID-19 Radiography dataset	Federated Learning (FL)	98%	98%	98%	98%
[28]	RSNA dataset	Yolov3 with MaskFPN	81%	-	-	-
[31]	Re-annotated images from MIMIC-CXR dataset	Fast-YOLO	-	95.20%	94.90%	-
Proposed	COVID-19 Radiography dataset	YOLOv11	98.50%	98.60%	97.45%	97.99%
Proposed	Chest X-ray (COVID-19 & Pneumonia)	YOLOv11	98%	97.76%	98.41%	98.06%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ahmed, A.; Siam, A.I.; Atwa, A.E.M.; Atwa, M.A.; Abdelrahim, E.M.; Atlam, E.-S. An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images. Algorithms 2025, 18, 703. https://doi.org/10.3390/a18110703

AMA Style

Ahmed A, Siam AI, Atwa AEM, Atwa MA, Abdelrahim EM, Atlam E-S. An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images. Algorithms. 2025; 18(11):703. https://doi.org/10.3390/a18110703

Chicago/Turabian Style

Ahmed, Ali, Ali I. Siam, Ahmed E. Mansour Atwa, Mohamed Ahmed Atwa, Elsaid Md. Abdelrahim, and El-Sayed Atlam. 2025. "An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images" Algorithms 18, no. 11: 703. https://doi.org/10.3390/a18110703

APA Style

Ahmed, A., Siam, A. I., Atwa, A. E. M., Atwa, M. A., Abdelrahim, E. M., & Atlam, E.-S. (2025). An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images. Algorithms, 18(11), 703. https://doi.org/10.3390/a18110703

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Explainable YOLO-Based Deep Learning Framework for Pneumonia Detection from Chest X-Ray Images

Abstract

1. Introduction

2. Literature Review

2.1. Hybrid Deep Learning Models

2.2. Transfer Learning Approaches and Custom CNN Architectures

2.3. Federated Learning

2.4. YOLO-Based Detection Models

2.5. Alternative Techniques and Exploratory Models

3. Methodology

3.1. Dataset Description

3.1.1. Dataset 1: COVID-19 Radiography Database

3.1.2. Dataset 2: Chest X-Ray (COVID-19 & Pneumonia)

3.2. Data Preprocessing

3.3. YOLOv11 Model Architecture

3.4. Grad-CAM Method for YOLOv11 Model Explainability

3.5. Model Evaluation Metrics

4. Results and Discussion

4.1. Experimental Setup

4.2. Experimental Results Using Dataset 1

4.3. Experimental Results Using Dataset 2

4.4. Comparative Performance Evaluation with Recently Related Work

4.5. Discussion

4.6. Clinical Significance and Practical Implications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI