Next Article in Journal
Exploring Bifurcation Analysis, Conservation Laws and Soliton Dynamics for the Dual-Mode Nonlinear Schrödinger Equation with Applications
Previous Article in Journal
Mathematical Analysis of Non-Steady-State Immobilized Glucose Dehydrogenase Glucose and Oxygen-Driven Reactions in Spherical Microreactors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid Ensemble Deep Learning Framework for Pediatric Pneumonia Classification Using Transfer Learning and Convolutional Neural Networks

Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh 25732, Saudi Arabia
Math. Comput. Appl. 2026, 31(3), 96; https://doi.org/10.3390/mca31030096
Submission received: 30 March 2026 / Revised: 30 May 2026 / Accepted: 1 June 2026 / Published: 2 June 2026

Abstract

Accurate diagnosis of pediatric pneumonia remains a challenging task in clinical practice. The aim of this research is to propose a hybrid ensemble framework for pediatric pneumonia diagnosis that unites three fine-tuned pre-trained CNN models through feature fusion, EfficientNetB0, ResNet50, and MobileNetV2, to achieve better performance and results. This research experiment used the Chest X-Ray Images (Pneumonia) dataset, which contains 5863 high-resolution anterior–posterior (AP) chest radiographs sampled from children aged 1 to 5 years old. This study presents four key contributions. Firstly, we systematically evaluated five CNN (Convolutional Neural Network) combinations with seven different individual base models to identify the optimal ensemble configuration. Each base model was initialized with ImageNet pre-trained weights, with top classification layers replaced by global average pooling. Secondly, the proposed ensemble approach of MobileNetV2, ResNet50, and EfficientNetB0 achieved superior performance with accuracy: 96.1%, precision: 97.8%, recall: 96.7%, and F1-Score: 97.3%, outperforming all individual models and alternative ensemble combinations. Thirdly, this study compared the experiment results with several existing studies related to pneumonia classification. Fourthly, this study validated the proposed model on an external NIH pediatric dataset (94.73% accuracy) without fine-tuning, demonstrating true clinical transportability beyond benchmark dataset performance.

1. Introduction

Pneumonia is a major cause of morbidity and mortality in children under five years old, especially in low-resource countries [1,2]. Pneumonia is a respiratory disease that directly affects the lungs and greatly affects overall health by disrupting oxygen exchange in the body [3]. Based on the World Health Organization (WHO), pediatric pneumonia contributes 14% of deaths in children under five years old, especially in regions such as South Asia and Sub-Saharan Africa, and causes an estimated 740,000 fatalities annually [4]. Pneumonia can be caused by poor environmental conditions and generate many bacteria, viruses, and fungi that can infect the human body, especially in children [5]. Therefore, pneumonia requires timely treatment and effective prevention strategies. Currently, there are many medical personnel working to identify accurate and reliable methods for the early diagnosis of pneumonia in children, in order to reduce the rapid spread of infection and its complications and even reduce the death rate in children [6,7]. If early detection and diagnosis can be achieved, it will also make it easier to treat patients and provide the right therapy.
Artificial intelligence (AI) is rapidly transforming various sectors, and its application in medicine is particularly promising for enhancing disease detection and classification accuracy and efficiency [8]. AI can analyze complex datasets and identify subtle patterns, revolutionizing medical imaging, diagnostics, and treatment planning, paving the way for improved patient outcomes [9]. The application of AI, especially through the implementation of machine learning and deep learning in healthcare, involves utilizing computer algorithms to extract relevant data and knowledge and aid clinical decision-making, which has seen rapid development in many developed countries [10,11]. This study contributes foundational knowledge for AI-based healthcare solutions and supports the development of efficient, accessible pediatric pneumonia detection systems. One of the latest AI algorithms that can provide promising results in pediatric pneumonia classification is by using the Deep Learning approach.
The ability of Deep Learning (DL) models to learn intricate, problem-specific features from medical images has led to a paradigm shift in computer vision applications within healthcare. DL has revolutionized medical image analysis, offering unprecedented accuracy in diagnosing various diseases, including pediatric pneumonia [12]. Early and accurate diagnosis of pneumonia classification is crucial, particularly in pediatric cases, where the condition can rapidly progress and lead to severe complications. This research focuses on the development of an ensemble deep learning framework to classify pediatric pneumonia by utilizing transfer learning and Convolutional Neural Network (CNN) algorithms. Ensemble deep learning is a promising approach that can achieve high classification accuracy [13].
This research experiment used the Chest X-Ray Images (Pneumonia) dataset, which contains 5863 high-resolution anterior–posterior (AP) chest radiographs sampled from children aged 1 to 5 years old. The data preprocessing phase consisted of four steps: image resizing, intensity normalization, label encoding, and data structure optimization. In the experimental activity, we systematically use seven different CNN (Convolutional Neural Networks) models, namely, MobileNetV2, ResNet50, DenseNet-201, EfficientNet-B0, VGG16, InceptionV3, and Xception. Furthermore, from seven different models, we combined and evaluated five ensemble model combinations, including MobileNetV2 + ResNet50 + EfficientNetB0, DenseNet201 + EfficientNetB0 + MobileNetV2, EfficientNetB0 + InceptionV3 + Xception, EfficientNetB0 + ResNet50 + VGG16, and InceptionV3 + ResNet101 + EfficientNet. Each of the base models was initialized with ImageNet pre-trained weights to leverage transfer learning, and their top classification layers were passed through a Global Average Pooling (GAP) layer to reduce the spatial dimensions and convert them into fixed-length one-dimensional feature vectors. The final performance evaluation and clinical significance of the proposed ensemble model are assessed through accuracy, precision, recall, and F1-Score presented at the end of this paper. While the proposed ensemble prioritizes diagnostic accuracy, we acknowledge that computational efficiency is critical for deployment. A comprehensive analysis of inference cost, parameter efficiency, and resource-constrained performance is beyond the scope of this study and will be addressed in subsequent work.
There are many existing studies that used deep learning and ensemble methods and algorithms in pneumonia classification, but they still come with many drawbacks such as high accuracy but limited external validation [14], lower performance metrics [15], excellent recall but poor precision [16], good recall but no precision/F1-Score reported [17], and smaller dataset and limited generalizability [18]. To address these limitations, the proposed study has several objectives and contributions:
  • This study systematically applied and evaluated a transfer learning approach to transfer and combine the feature-level fusion ensemble result with the weighted ensemble method to increase the performance results.
  • This study experimented and found the best combination of algorithms to combine in the ensemble method to improve the performance results.
  • This study achieved better performance results compared to the individual models.
The research activities in this paper are divided into four main parts. The first part is the introduction part to explain more details about the background problem of this research and provide an overview of the solution that we proposed for the specific problem related to the pediatric pneumonia classification. The second part is the related works section, which studies and explains more about existing studies that are related to the pediatric pneumonia classification. Furthermore, in this section, we compared several studies based on their problems, methods/techniques, results, and solutions for pediatric pneumonia classification. The third part is the research methodology to explain more details about our solution and experiment to implement pediatric pneumonia classification using an ensemble deep learning approach. The fourth part is the result and discussion to show more details about our experimental results, and also to discuss in detail the experimental results and compare them with other studies.

2. Related Works

This section focuses on the review of the existing studies related to the diagnosis of pneumonia using machine learning or deep learning approaches. Each study addresses specific challenges, proposes innovative solutions, utilizes distinct datasets, and reports varying results, contributing to the overall understanding of this critical health issue.
In 2020, Islam [19] discussed a novel approach for classifying pediatric pneumonia using chest X-rays through a scalar-on-image regression model derived from functional data analysis to measure and utilize underlying covariance structures for classification and provide advantages over traditional methods and deep learning approaches. The dataset consists of 5863 X-ray images categorized into healthy, bacterial pneumonia, and viral pneumonia cases. The methodology emphasizes accurate and prompt diagnosis, which is crucial for timely treatment, especially given the high mortality rates among children from pneumonia.
In 2021, Alsharif et al. [20] discussed “PneumoniaNet,” an innovative deep learning model designed for the automated detection and classification of pediatric pneumonia using chest X-ray, which consists of 5852 pediatric CXR images. This model employs a 50-layer Convolutional Neural Network (CNN) to achieve high accuracy in distinguishing between normal, bacterial, and viral pneumonia. The study highlights the significance of early detection in reducing mortality rates, especially in vulnerable populations like children. The model demonstrates exceptional performance metrics, achieving a classification accuracy of 99.7% and an AUC of 0.9812. However, this study did not report precision or F1-Score, and external validation was not performed.
In 2021, Ravi et al. [21] presented a novel approach for classifying pediatric pneumonia using chest X-rays (CXR) through a cost-sensitive deep learning-based meta-classifier. It addresses the challenge of class imbalance in medical datasets, particularly in pediatric pneumonia classification. The proposed method employs a transfer learning strategy combined with feature fusion and a stacked ensemble meta-classifier, and integrates four cost-sensitive pretrained CNN models (Xception, Inception-ResNetV2, DenseNet201, and NASNetMobile) for feature extraction. achieving significant improvements in detection accuracy and generalization across unseen data. The study highlights the effectiveness of convolutional neural networks (CNNs), improvements in accuracy, and generalization across unseen data for diagnosing pneumonia. This research pointed out issues related to class imbalance and the generalization capabilities of existing models, and showed 6% improvement in precision, 10% improvement in recall, 9% improvement in F1-Score with less misclassification costs (0.0321) and accuracy (96.8%).
In 2023, a comprehensive review on ensemble deep learning by Mohammed and Kora [22] provides an extensive examination of ensemble learning and deep learning methods. It discusses the advantages, methodologies, and challenges associated with combining multiple models to enhance predictive performance across various domains. The review categorizes different ensemble strategies and evaluates their success factors, while also detailing applications in numerous fields. Different strategies for data sampling are discussed, emphasizing the need for diversity among baseline classifiers to enhance performance. This research paper provided a comparison of 49 existing research papers in the machine learning approach and compared 44 existing research papers in the deep learning approach.
In 2023, Prakash et al. [23] discussed the development of a computer-aided diagnosis model for pediatric pneumonia using chest X-ray images to enhance images using Contrast Limited Adaptive Histogram Equalization (CLAHE) and employ a stacking classifier incorporating features from multiple deep learning architectures. It emphasizes the challenges of accurately diagnosing pneumonia and achieves high accuracy in children due to low radiation levels and the need for a robust diagnostic tool to improve real-time diagnosis. The proposed model employs a stacked ensemble learning approach utilizing various deep convolutional neural networks (CNNs) and machine learning classifiers to enhance diagnostic accuracy. This research achieved an accuracy and F1-Score value of 0.99 and an AUC value of 0.93. However, this study had no external validation performed.
In 2024, the research article from Arulananth et al. [24] discussed a deep-learning approach for classifying pediatric pneumonia using a modified DenseNet-121 model based on chest X-ray images. It highlights the severe impact of pneumonia on children under five, emphasizing the need for efficient diagnostic tools. The model was trained and evaluated using a dataset of chest X-ray images from children and utilized 5856 images, with 4273 indicating pneumonia and 1583 normal cases. The study proposes an enhanced version of the DenseNet-121 architecture for improved detection of pediatric pneumonia and provides a result in a high classification accuracy of 97.03%.
In 2024, Pan et al. [25] discussed the implementation of an efficient federated learning approach for the classification of pediatric pneumonia using chest X-ray images. It highlights the importance of safeguarding patient data privacy while addressing the issue of data heterogeneity during the training process to improve classification accuracy and efficiency compared to traditional machine learning methods. The proposed method incorporates two end control variables to mitigate classification challenges due to data heterogeneity and emphasizes data privacy without compromising classification performance, unlike other privacy-preserving techniques that degrade accuracy. The proposed method achieves an average accuracy of 98%, with some instances reaching up to 99%. However, this study has a drawback in complex implementation and did not report precision or F1-Score.
In 2024, the enhancement of pediatric pneumonia diagnosis by Yoon and Kang [26] used masked autoencoders (MAE) in deep learning and highlighted the unique challenges faced in diagnosing pneumonia in children, particularly under five years old, and proposed a novel approach utilizing self-supervised learning techniques to improve diagnostic accuracy despite the scarcity of labeled pediatric data. There are two main focuses in this study: the first is to focus on leveraging deep learning and self-supervised learning to address data scarcity in pediatric chest X-ray images, and the second focus is to review existing deep learning models and their effectiveness in pneumonia diagnosis, also emphasizing the limitations of training on small pediatric datasets. The MAE model pretrained on adult chest X-ray images achieved an impressive AUC of 0.996 and an accuracy of 95.89% in distinguishing normal from pneumonia cases.
In 2025, Galvis Ruiz [27] investigated the development and application of deep learning models for differentiating between atelectasis and consolidations in pediatric chest radiographs by utilizing artificial intelligence, specifically deep learning techniques. The research utilized 1297 chest X-ray images from pediatric patients aged 1 month to 10 years and aims to enhance diagnostic accuracy in interpreting complex radiological images that exhibit overlapping symptoms in young patients. Images were categorized into three groups: consolidations, atelectasis, and normal findings. Six deep learning models (ResNet50, VGG19, VGG16, MobileNet, InceptionV3, and a base model) were selected for testing their efficacy in classifying the images and achieved an accuracy result above 92%, and the accuracy of this model increased by 60% compared with the initial result (accuracy = 0.63).
In 2025, Radočaj and Martinović [1] presented a study on the use of interpretable deep learning methods to diagnose pediatric pneumonia through the analysis of chest X-ray images. The research evaluates four convolutional neural network (CNN) architectures, including standard, multi-scale, and stride convolutions, to explore the potential of different convolutional techniques and the Mish activation function to enhance model performance and interpretability. The findings indicate significant advancements in diagnostic accuracy, particularly emphasizing the role of visualization techniques like Gradient-weighted Class Activation Mapping (Grad-CAM) in improving clinical trust in AI-driven diagnostic tools. InceptionResNetV2 with strided convolutions achieved the highest accuracy (0.9718), while DenseNet201 excelled with multi-scale convolutions (0.9676).
In 2025, Gajendran [28] proposed PediaPulmoDx by using a novel deep learning framework designed to improve the classification of pediatric chest X-ray (CXR) images for pneumonia detection. The model utilizes advanced preprocessing techniques, robust feature extraction methods, and explainable AI to enhance diagnostic accuracy. Conventional diagnostic methods face challenges that PediaPulmoDx seeks to address through deep learning techniques, specifically using DenseNet121 architecture. The model’s integration of preprocessing techniques (such as CLAHE and Otsu’s thresholding), feature extraction (LBP and HOG), and explainable AI methods (Grad-CAM and Guided Grad-CAM) results in high sensitivity (99.60%), specificity (99.80%), and overall accuracy (99.97%). However, this study did not report precision, and external validation was not performed.
In 2025, the research article from Katreddi et al. [29] discussed the development of a predictive model for classifying pediatric pneumonia using DenseNet-169 and transfer learning techniques. The study used 5866 chest X-ray images in children aged 1–5 years and highlights the significance of deep learning in enhancing the accuracy and efficiency of diagnosing pneumonia. After preprocessing, the dataset is divided into training (85.88%), validation (4.2%), and test (9.92%) sets. Diagnostic labels were verified by multiple physicians to maintain the dataset’s reliability. The DenseNet-169 model achieved an accuracy of 91.66%, with a precision at 90.99% and a recall at 86.32%. These results indicate the model’s effectiveness in classifying pneumonia from chest X-rays, outperforming other architectures like DenseNet-121 and VGG16.

Gaps and Contributions

Even though there are many deep learning models available for classifying pediatric pneumonia, current solutions fall short of meeting all four essential criteria for practical clinical implementation: remarkable sensitivity with balanced precision, architectural optimization tailored to pediatrics and demonstrated generalizability beyond benchmark datasets. A trade-off that is clinically unacceptable has been consistently shown in previous research: either attaining high recall at the expense of an excessive number of false positives or preserving accuracy while overlooking cases that could be taken action on. Additionally, almost all previous research uses pediatric data to train adult architectures without external validation, raising questions regarding performance across various patient populations, equipment, and institutions. No ensemble currently in use has shown that it is deployable while maintaining >94% precision and >96% recall [14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]. Table 1 shows the comparison of existing studies with the proposed work based on several criteria.
This study addresses these gaps through several contributions below:
  • Hybrid Ensemble Architecture: We systematically evaluated and proposed a feature-level fusion ensemble combining MobileNetV2, ResNet50, and EfficientNetB0, selected based on explicit criteria of complementarity and pediatric-specific pattern recognition; unlike previous studies that selected models arbitrarily.
  • Balanced Clinical Performance: Our framework achieves an unprecedented balance between precision (94.10%) and recall (96.92%), addressing the critical clinical requirement of minimizing both false negatives and false positives simultaneously as a trade-off not achieved by prior works.
  • Zero-Shot Generalization Validation: We also validated our ensemble on an external NIH pediatric dataset (94.73% accuracy) without fine-tuning, demonstrating true clinical transportability beyond benchmark dataset performance.

3. Research Methods

The architects designed the system carefully to achieve top diagnostic precision together with efficient computation that supports real-world medical use. Standardized learning between models through preprocessing actions like image resizing and normalization, along with label encoding, forms the essential part of the methodological pipeline.
Figure 1 shows the global framework’s structure, which shows the combination of the Feature-level ensemble approach and the Weighted ensemble approach. This architecture incorporates three pre-trained CNNs, namely EfficientNetB0, ResNet50, and MobileNetV2, that receive fine-tuning on pediatric data through transfer learning. The training set diversity increases through complex image transformation techniques that use rotation, zooming, horizontal flips, and shifting to limit overfitting risks. The different feature maps from each model complete Global Average Pooling (GAP) and merge into one extensive multidimensional representation. The enriched feature vector moves through a fully connected dense layer before it is classified via a sigmoid activation function. The diagnostic process becomes more explainable through Grad-CAM visualizations as a visual explanation technique to help clinicians understand which parts of the image most powerfully influenced the model predictions [30,31]. Grad-CAM generated heatmaps from the final convolutional layers of each base model, and these heatmaps help clinicians understand which regions influenced predictions. The combination of interpretable features with high performance levels establishes trust as well as transparency, which healthcare institutions view as essential adoption criteria. The framework resolves both diagnostic accuracy versus efficiency requirements while meeting the broader standard for accessibility and clinical validation of artificial intelligence diagnostics.

3.1. Dataset Description

The Chest X-Ray Images (Pneumonia) dataset serves as the primary research material in this study, and it was obtained from Mendeley Data with Creative Commons BY 4.0 licensing [32]. The dataset contains a total of 5863 (with a class distribution of 73.2% pneumonia and 26.8% normal (reflecting clinical prevalence)) high-resolution anterior–posterior (AP) chest radiographs, which were sampled from children within the age group of 1 to 5 years. The dataset was divided into 80% of the data for the training set with 4691 data, and 10% each of the data for the validation and test set with 586 data. The tolerance input Chest X-ray datasets for pneumonia classification, often used in machine/deep learning implementation, typically feature around 5856 to 7750, with a common data split 80% for Training, 10% for Validation, and 10% for Testing [33]. The Guangzhou Women and Children’s Medical Centre functions as the medical institution that provided the dataset through its research facilities at this Chinese medical establishment. A three-part division of the dataset guarantees methodological consistency and broad application of models through training, validation, and testing partitions. Originally, this dataset contains three classes: normal, bacterial pneumonia, and virus pneumonia however, the folder distribution in the dataset directory contains binary class distribution in each partition, which consists of both Pneumonia and Normal cases.
A group of two expert radiologists independently assessed each image, then agreed on interpretations with a third senior expert to create reliable test ground truth data, especially in the critical subset. The collection contains bacterial and viral pneumonia cases, which present a comprehensive range of disease outcomes in a clinical setting. The model training achieves better robustness when it detects different pneumonia radiographic patterns, including patchy opacities and consolidation, and interstitial markings, which characterize pediatric pneumonia manifestations. The dataset serves as an excellent benchmark for pediatric diagnostic evaluations because researchers have cited it frequently, and it provides high-quality data with substantial clinical relevance and sample volume.
Sample pictures taken from the “Chest X-Ray Images (Pneumonia)” set can be seen in Figure 2. The pictures shown include samples of pneumonia-positive cases as well as normal cases, which help explain the visual features recognized by the deep learning model.

3.2. Data Preprocessing

To ensure optimal model performance and training stability, a structured and rigorous preprocessing pipeline was applied to the raw chest X-ray images before they were introduced into the deep learning framework [34]. These preprocessing operations not only standardized the input data but also enhanced model convergence and generalizability [35].
A.
Image Resizing
The input images received a uniform resize operation to fit the 224 × 224 pixel spatial resolution, which matches the dimensional needs of the pre-trained CNN architectures, including EfficientNetB0 and ResNet50, and MobileNetV2. The resizing procedure maintained the original aspect ratio whenever possible to avoid distortions that could affect important radiological pneumonia diagnostic elements.
B.
Image Pixel Intensity Normalization
To ensure compatibility with the pre-trained CNN models (EfficientNetB0, ResNet50, MobileNetV2), which were originally trained on ImageNet, we applied a two-step normalization pipeline.
Step 1–Scaling to [0, 1]: Each chest X-ray image is an 8-bit grayscale image with pixel intensity values in the range [0, 255]. First, we scale the pixel values to the range [0, 1] by dividing by 255:
I scaled = I 255
where I is the original pixel intensity.
Step 2–ImageNet-specific standardization: After scaling, we apply the channel-wise normalization that was used during ImageNet pre-training. For grayscale images, we replicate the same mean and standard deviation across the three channels, as expected by the models. The final normalized image I norm is computed as follows:
I norm = I scaled μ σ
with μ = [ 0.485 , 0.456 , 0.406 ] and σ = [ 0.229 , 0.224 , 0.225 ] (mean and standard deviation per channel, respectively). This standardization centers the input data around zero and scales it to unit variance, which stabilizes gradient flow and accelerates convergence during training.
There are several important Roles of Feature Normalization. The first role is to enhance training stability and gradient control [36], the second role is to accelerate the convergence via loss landscape reshaping [37], and the third role is to mitigate the activation function saturation [38]. The strategic implementation of feature normalization constitutes a critical preprocessing step that fundamentally improves the efficiency and reliability of neural network training. This technique systematically scales input data, such as pixel intensities initially spanning a wide range (e.g., 0 to 255), into a standardized, smaller domain (e.g., 0 to 1 or a standard normal distribution).
Pixel intensities were normalized from the original 8-bit range [0, 255] to [0, 1] using division by 255. This normalization is necessary because ImageNet-pretrained models expect input values in this range. Subsequently, for ImageNet-compatible models, we applied channel-wise normalization with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].
C.
Label Encoding
The classification of Pneumonia and Normal classes went through a one-hot encoding process for binary classification. The encoding scheme both enabled the usage of categorical cross-entropy loss and let the model produce probabilistic predictions for each class. Specifically:
  • Pneumonia: [1,0]
  • Normal: [0,1]
By converting labels into a machine-readable and differentiable format, the network could effectively learn class distinctions during backpropagation.
D.
Data Structure Optimization
The images and labels went through conversion into NumPy arrays before being stored in data generators, which optimized memory usage for training with real-time augmentation procedures. These preprocessing techniques served as the base to develop a dataset that became clean and normalized and ready for model utilization, thus supporting the ensemble model’s ability to generalize for new data points.

3.3. Data Augmentation

The use of Enhanced Generalization via Probabilistic Online Data Augmentation approach is chosen as a strategy in the training process for deep learning models to improve their performance on new and unseen data by intentionally expanding and diversifying the training set. This process is also to ensure that there is no data leakage. This is achieved through the application of a series of geometric and other transformations (like flips, rotations, and shifts) to the original images. The Probabilistic approach indicates that these transformations are applied with a certain random chance or degree, such as applying a flip with 50% probability or choosing a rotation angle from a range. Furthermore, the Online approach means that these synthetic variations are created and applied during the model’s training process, ensuring the model sees a slightly different version of the same image in every training epoch. The cumulative effect of this process is the creation of a more robust model that is less reliant on specific, non-essential features of the original data, thereby significantly enhancing its ability to generalize to real-world variations and effectively mitigate overfitting.
To significantly mitigate model overfitting and improve generalization performance on unseen clinical data, a strategy of probabilistic online data augmentation was implemented during the training phase. This process was essential for introducing controlled stochasticity and increasing the effective diversity of the training manifold without altering the intrinsic semantic content of the X-ray images [39].
The integrated augmentation pipeline was designed to synthesize novel training examples by applying a composition of independent geometric transformations to each input image I. The resultant augmented image, I′ was generated via the following sequence of operations:
I   =     T z o o m ( T S h i f t ( T r o t a t e ( T f l i p ( I ) ) ) )
This sequence involved the following:
  • Horizontal flipping ( T f l i p ) applied with a probability of p = 0.5.
  • Random rotation ( T r o t a t e ) within the range of ±10°.
  • Random scaling (zoom) ( T z o o m ) by up to ±10%.
  • Random translation (shift) ( T S h i f t ) in either the horizontal or vertical axis by up to ±10% of the image dimensions.
The deliberate compounding of these diverse transformations effectively simulates the natural geometric and positional variations inherent in real-world clinical radiographic image acquisition, thereby enhancing the robustness and representational capacity of the trained model. The transformation operators appear sequentially as TTT sequences throughout this process. The augmentation occurred in real-time while mini-batches were created to maintain both processing speed and varied input samples. Through the dynamic dataset enrichment process, the model built its capability to handle intra-class variations and imaging fluctuations required for multiple clinical environments.

3.4. Model Architecture of Hybrid Convolutional Neural Network (CNN) Ensemble

The proposed framework employs a feature-level fusion by strategically combining three established Convolutional Neural Networks (CNNs): MobileNetV2, ResNet50, and EfficientNetB0. This integration leverages the unique representational strengths of each base model to enhance the overall architectural performance. This carefully designed hybrid architecture combines models with complementary inductive biases: MobileNetV2 employs depth-wise separable convolutions and linear bottlenecks, resulting in a parameter-efficient design (3.4 M parameters); ResNet50 uses residual connections to learn hierarchical features; EfficientNetB0 applies compound scaling. While we do not benchmark inference speed or power consumption in this study, these architectural properties motivate future deployment studies in resource-constrained settings. Complementary, the deep residual learning of ResNet50 and the compound scaling optimization of EfficientNetB0 collectively ensure the capture of subtle, high-level radiographic patterns essential for accurate pneumonia diagnosis. The resulting ensemble yields a model with superior generalization capacity and robustness compared to any single constituent network. Because this is a binary classification task (pneumonia vs. normal), the ensemble uses a single output neuron with sigmoid activation and binary cross-entropy loss. This is the standard and most efficient approach for binary classification. In contrast, the individual model screening (Section 4.1, Phase 1) uses softmax and categorical cross-entropy for compatibility with pre-trained architectures; those settings do not apply to the proposed ensemble.
Each of the base models was initialized with ImageNet pre-trained weights to leverage transfer learning, and their top classification layers were excluded (include_top = False) to extract only the high-level convolutional feature maps. These feature maps F i (where i ∈ (MobileNetV2, ResNet50, EfficientNetB0)) were each passed through a Global Average Pooling (GAP) layer to reduce the spatial dimensions and convert them into fixed-length, one-dimensional feature vectors f as follows:
f i =   GAP ( F i ) , i ( MobileNetV2 , ResNet50 , EfficientNetB0 )
The transformation results in stable dimensionality and maintains output translation consistency. An aggregation of feature vectors produced a single high-dimensional representation that ties together elements from different spatial and multifaceted views of the images. The fused vector entered a dense layer with 128 neurons, activated by ReLU, which included a dropout layer for preventing overfitting. Note that ReLU is used as the activation function for hidden layers (the dense layer and any intermediate layers), while the final output layer uses sigmoid to produce a probability score between 0 and 1. This is standard practice: hidden layers use ReLU for non-linearity and the output layer uses sigmoid for binary classification. The last operation used sigmoid activation to validate the binary recognition between the Pneumonia and Normal classes. This ensemble structure that combines various CNNs effectively improves diagnostic precision, together with operational reliability as well as flexibility to make it usable in basic health clinics.
Feature Fusion and Classification Head: Following the global average pooling of feature maps from each base model—MobileNetV2, ResNet50, and EfficientNetB0—the resulting one-dimensional feature vectors f M o b i l e , F R e s N e t , F E f f i c i e n t N e t are concatenated to form a unified high-dimensional feature embedding:
F C o n c a t   =   [ f M o b i l e , F R e s N e t , F E f f i c i e n t N e t ]
The combined structure enables the system to extract synergistic spatial and semantic attributes from the different CNN architectures to improve the final embedding’s representational strength. Multiview features fused in Fconcat proceed to a dense fully connected layer activated by ReLU that contains 128 neurons to understand feature combinations. During training, the model applies Dropout with a 0.5 rate to deactivate 50% of neurons randomly, which minimizes overfitting while promoting more stable generalized information learning.
The dense layer output directs its values into a single-neuron output layer that generates probability scores between 0 and 1 through Sigmoid activation. The final binary classification prediction y ^ is computed as follows:
y ^ =   σ ( W F C o n c a t + b )
The machine learning function contains learnable parameters W and b with the application of a sigmoid function σ. W and b, along with the sigmoid function, create a configuration that provides both interpretability and effective optimization performances, especially for binary classification tasks, including pneumonia detection.
The calculation of the final ensemble weighting is shown below:
Let a i be the validation accuracy of the base model i (i ∈ {MobileNetV2, ResNet50, EfficientNetB0}) after 30 epochs of training. The weight w i for model i is computed using softmax-based normalization:
w i = e a i / T j = 1 3 e a j / T
where T = 0.5 is a temperature parameter that controls the sharpness of the weight distribution (lower T gives higher weight to the best model). This formulation ensures that weights are positive and sum to 1, with better-performing models receiving higher weights.
The final ensemble prediction y ^ ensemble for a given input image is the weighted average of the individual model probabilities:
y ^ ensemble = i = 1 3 w i p i
The pediatric chest X-ray dataset was rigorously partitioned into distinct training, validation, and test subsets to ensure unbiased evaluation. Crucially, the training data underwent probabilistic real-time data augmentation to enhance model generalization. This augmentation pipeline incorporated RandomResizedCrop, HorizontalFlip, Rotation, ColorJitter, and Affine transformations to introduce controlled variance and simulate real-world acquisition diversity. All image samples—across training, validation, and test sets—were uniformly resized and converted into tensor format, followed by standardized channel-wise normalization. Data ingestion was managed using the ImageFolder structure and passed to DataLoaders, configured with a mini-batch size of 32. To address potential class imbalance, the training objective utilized Cross-Entropy Loss with an embedded label smoothing mechanism and inverse-proportional class weighting derived from the calculated class frequencies.
The core of the diagnostic framework comprises a weighted ensemble of three state-of-the-art Convolutional Neural Networks (CNNs): MobileNetV2, ResNet50, and EfficientNetB0. These models were initialized with ImageNet pre-trained weights (transfer learning). In Stage 1 (epochs 1–10), the convolutional base of each model was frozen (weights not updated), and only the newly added classification layers were trained. In Stage 2 (epochs 11–30), the top 20% of convolutional layers were unfrozen and fine-tuned with a reduced learning rate to adapt the features to pediatric chest X-ray characteristics while preserving general visual knowledge. All models were trained for 30 epochs using the Adam optimizer with an initial learning rate of 1 × 10−4 (no decay or scheduler). Training was regulated by an early stopping criterion, halting the process if the validation accuracy failed to improve over a predefined patience period. Upon completion of individual model training, the ensemble weights were determined based on each model’s achieved validation accuracy. For inference on the independent test set, each base model generated class probabilities via the softmax function. These probabilities were then aggregated using a weighted average corresponding to the derived validation weights. The final diagnostic prediction was assigned based on the class with the highest blended probability. The ensemble’s performance was comprehensively evaluated on the test set using the following key classification metrics: Accuracy, Precision, Recall, and F1-Score.
The hybrid CNN ensemble required the Adam optimizer as its training method because it demonstrated adaptive learning abilities and efficient gradient management capabilities. The training process selected a learning rate value of 1 × 10−4. According to Table 1, for achieving optimal weight updates and maintaining a balance between training stability and speed of convergence. The model employed Binary Cross-Entropy since it serves binary classification tasks that generate probabilistic outputs through sigmoid activation. The loss function optimizes the differences between forecasted class outcomes and real-class assignments. Training ran for up to 30 epochs with early stopping (patience = 5).
This approach stops further training because it detects the point where the model achieves optimal generalization capability. A batch size of 32 was implemented to achieve efficient gradient calculation without exceeding available memory resources. The application of a dropout rate set at 0.5 across the fully connected layers served to decrease co-adaptation events and enhance the model’s generalization ability. Running the training operations on Google Colab Pro by accessing an NVIDIA Tesla T4 GPU increased the speed of calculations through GPU-based parallel computing. A model checkpointing system was activated to guarantee that the validation loss-determined optimal model would automatically save itself at every epoch for reproducible and deployable results. The training details can be found in Table 2, which presents the specific configuration along with all settings.

3.5. Performance Evaluation and Clinical Significance of the Proposed Ensemble Model

The hybrid ensemble comprising MobileNetV2, ResNet50, and EfficientNetB0 proved more effective for medical diagnosis through benchmark tests against EfficientNetB0, Xception, and InceptionV3. Using performance metrics from sklearn. The proposed architecture reached 96.1% accuracy, combined with 97.8% precision and 96.7% outstanding recall, and 97.3% F1-Score, while surpassing baseline recall and F1-Score metrics in critical clinical scenarios. The model demonstrates both strength and accuracy in detecting pneumonia from pediatric chest X-rays because of its successful performance in this crucial area of medical imaging diagnosis.
The study offers multiple key contributions:
  • It introduces a novel hybrid ensemble framework that leverages the complementary strengths of lightweight (MobileNetV2) and deep semantic (ResNet50, EfficientNetB0) networks that improve performance results compared to the individual models.
  • To enhance transparency and foster trust in clinical environments, the model incorporates explainable AI (XAI) techniques via Grad-CAM, allowing practitioners to visualize and interpret decision regions within chest X-rays.
  • A fully reproducible and well-documented pipeline has been developed, covering every stage from data preprocessing and augmentation to model training and evaluation, ensuring scientific rigor and practical deployment readiness.
  • The exceptional F1-Score of 94.97% confirms the model’s potential for real-world application in automated pneumonia screening tools, especially in resource-constrained healthcare environments.

4. Results and Discussions

The presented work develops an innovative fusion approach that unites weight-efficient models with deep learning systems to deliver improved diagnostic accuracy along with computational performance enhancement. The research goal focused on examining the operational effectiveness of different deep learning system frameworks for pediatric chest X-ray Pneumonia versus Normal category detection. Several state-of-the-art convolutional neural networks (CNNs) as well as ensemble models were employed to determine which architecture delivered the best combination of accuracy and efficiency for pneumonia detection.

4.1. Experimental Setup

Phase 1: Individual Model Screening (Softmax + Categorical Cross-Entropy)

The hyperparameters reported in Table 2 apply only to Phase 1, the initial screening of seven individual CNN models to identify candidate architectures for ensemble construction. During this screening phase, we used softmax activation and categorical cross-entropy loss to maintain compatibility with the original pre-trained architectures (which were designed for multi-class ImageNet classification). This is not the configuration used for the final ensemble.
For the final ensemble (Phase 2, described in Section 3.4), we use sigmoid activation and binary cross-entropy loss, which are appropriate for binary pneumonia classification. The two phases serve different purposes and should not be conflated.
A collection of advanced CNNs based on Table 3 received critical hyperparameter adjustments for pediatric pneumonia detection tasks using chest X-ray imaging. By applying ImageNet-pretrained models, including MobileNetV2, VGG16, ResNet50, DenseNet-201, EfficientNet-B0 InceptionV3, and Xception. The study both improved target medical imaging performance and minimized training duration, together with computational expense. The chosen Adam optimizer operated with a learning rate value of 1 × 10−4 due to its adaptive learning rate feature and its effectiveness in handling sparse gradients, which performs optimally in deep learning medical imaging tasks. The task demands a categorical cross-entropy loss function due to its capability in multi-class problems, even though we only analyze Pneumonia versus Normal samples.
The softmax activation and categorical cross-entropy loss reported in this table apply only to Phase 1 (individual model screening). The final ensemble (Phase 2) uses sigmoid activation and binary cross-entropy loss (see Section 3.4, Table 1).
The architecture design enables horizontal expansion beyond 2 classes for upcoming research needs. The training reached its maximum after 30 epochs through EarlyStopping monitoring, which stopped the process when validation performance reached stability to reduce overfitting while maintaining efficient gradient stability with a batch size of 32. Two-dropout layers with 0.5 at the first level and 0.3 at the second were added to prevent neural network dependency relationships while boosting the model’s ability to generalize. The He weight initialization method preserves signal variance across layers since it suits ReLU activations that numerous hidden layers use because of their computational efficiency and minimal gradient vanishing susceptibility. Softmax serves as the last activation function because it produces normalized probabilistic outputs, which are suitable for classification tasks. The set input image dimensions of 224 × 224 × 3 support all pre-trained models while keeping the computational requirements reasonable. Table 2 contains regulated hyperparameter settings that form a performance-efficient training process that achieves generalization potential. Stable convergence with reduced overfitting risks occurs through this configuration, which simultaneously extracts maximum feature information from small pediatric X-ray datasets for real-world clinical AI system deployment.

4.2. Performance Metrics

Table 4 provides a complete evaluation of hybrid ensemble models created for pediatric pneumonia diagnosis through X-ray images. The MobileNetV2 + ResNet50 + EfficientNetB0 ensemble model reached the highest performance rating with 96.1% accuracy, 97.8% precision, 96.7% recall, and 97.3% F1-Score. Such perfect functional relationships between precise results and correct detections prove essential in medical tests because they prevent detection mistakes of all kinds. The (DenseNet201 + EfficientNetB0 + MobileNetV2 ensemble detected pneumonia cases very well with a high recall score of 97.18%, yet its precision rate of 91.11% as well as F1-Score of 94.04% indicated an increased likelihood of false positives. The EfficientNetB0 + InceptionV3 + Xception combination delivered average yet decreased performance results in all diagnostic scores. The EfficientNetB0 + ResNet50 + VGG16 ensemble demonstrated 97.18% recall, with accuracy and precision numbers below 90 at 89.74% and 87.73%, which indicates possible concerns about overdiagnosis. Results from the InceptionV3 + ResNet101 + EfficientNet ensemble proved unsuitable for clinical deployment because it generated the least accurate performance with 81.41% accuracy and 86.16% F1-Score. The MobileNetV2 + ResNet50 + EfficientNetB0 ensemble demonstrates its reliable capacity in automated pneumonia detection for pediatric patients because of its superior clinical performance.
Table 5 shows the ablation study to compare the proposed model with other combination models. Table 4 shows that without MobileNetV2, the accuracy dropped to 94.23%, without ResNet50, the accuracy dropped to 93.87%, and without EfficientNetB0, the accuracy dropped to 94.56%. The proposed complete model provides better accuracy results compared to the other combination models.
Table 4 and Table 5 show that the proposed model achieved the best overall performance, with the highest F1-Score of 97.3% and the highest accuracy level of 96.1%, indicating an excellent balance between precision and recall, also in the accuracy performance. The MobileNetV2 + ResNet50 + EfficientNetB0 ensemble model achieves a balanced precision-recall relationship, as demonstrated in Figure 3, which proves its clinical worth for pediatric radiological diagnostics.
Figure 4 shows the training and validation accuracy, while Figure 5 shows the training and validation loss values from 30 epochs. The confusion matrix for the test data predictions from the proposed model (MobileNetV2 + ResNet50 + EfficientNetB0) is shown in Figure 6. From the confusion matrix, we conclude that the proposed model achieved a good result. Figure 7 shows the ROC curve for the test data predictions. The proposed model proves to be a good result with an AUC of 0.97.
We performed 5-fold cross-validation on the training set (80% of data, n = 4691), with each fold using 80% of that for training and 20% for validation. The final test set (10%, n = 586) was held out completely and used only for final evaluation after cross-validation. Table 6 shows the 5-fold cross-validation performance of the proposed ensemble (Mean ± Standard Deviation).
The low standard deviations across folds (≤0.52% for all metrics) indicate that the ensemble’s performance is stable and not highly sensitive to the specific training-validation partition. This provides confidence that the reported test set performance (96.1% accuracy) is representative of expected performance on unseen data from the same distribution.

4.3. Classification and Explanation

The ensemble model accomplished superior performance to individual architectures during diagnostic testing. A performance evaluation of different deep learning systems designed to detect pediatric pneumonia through X-ray imaging is presented in Table 7. Among the available models, MobileNetV2 and ResNet50 achieved the highest result with an accuracy of around 93%, a precision 92%, a recall 95%, and an F1-Score 93%. The second position, DenseNet-201, EfficientNet-B0, Inception V3, and Xception produced results by reaching an accuracy range of 90–92%, a precision of 89–91%, a recall of 90–94%, and an F1-Score of 89–92%. The lowest performance metrics for VGG16 decreased substantially, resulting in 74.29% accuracy and recall, 55.19% precision, and 63.33% F1-Score, because it demonstrates weak sensitivity and specificity in accurately detecting pneumonia cases.
The ensemble model surpassed all individual models, especially in the recall and F1-Score metrics for critical clinical applications, because both false positives and false negatives need to be avoided. Combining multiple models through ensemble methods proves advantageous because it improves diagnosis sensitivity, particularly when detecting uncommon medical conditions like pediatric pneumonia.
To perform better validation and analysis, we also conducted external validation to show our model architecture’s performance using the NIH pediatric dataset. There are several selection criteria for this NIH dataset. First, the NIH dataset is a common dataset that has been used in many pneumonia studies. Second, the NIH dataset provides the pneumonia dataset for pediatric patients through the selection process from patient metadata to select images from pediatric patients aged 1–5 years. Images with missing or uncertain age information were excluded. Third criterion, the NIH dataset provided maintains class balance comparable to the primary dataset (which has 51% pneumonia, 49% normal), and we initially aimed for a 50/50 split to evaluate model performance under balanced conditions with a total of 312 images (159 pneumonia-positive, 153 normal).
Importantly, no fine-tuning or retraining was performed on the NIH subset. The model was applied exactly as trained on the primary dataset (zero-shot transfer). This tests true clinical generalizability without dataset-specific adaptation. Each image was preprocessed identically to the primary dataset (resized to 224 × 224, normalization using ImageNet statistics) and passed through the ensemble. Predictions were compared against the NIH ground truth labels. Table 8 shows the experiment result using the 312-image NIH pediatric subset without any fine-tuning. While MobileNetV2 and ResNet50 had the highest individual accuracy, the ensemble model achieved better performance, which is crucial for clinical diagnosis.
Table 9 provides a detailed analysis of the domain shift between the primary and NIH datasets.
A comparison of individual model performances through Figure 8 provides visual metrics representation for accuracy, precision, recall, and F1-Score metrics. The visual data confirms the research conclusion that ensemble techniques combining MobileNetV2, ResNet50, and EfficientNetB0 yield a better balance by reaching superior recall and F1-Score figures suitable for medical applications with substantial ethical implications.
Table 10 shows the additional metrics such as specificity, sensitivity, balanced accuracy, MCC, PPV, NPV, and PR-AUC.

4.4. Comparative Analysis

The thorough evaluation of deep learning models and their ensemble systems showed a sophisticated relationship between diagnostic reliability and accuracy, and sensitivity and specificity levels. Standalone frameworks like MobileNetV2 and Res-Net-50 showed accurate classification, but their outcomes displayed either a high sensitivity or a lower precision dynamic. Medicine requires diagnoses that avoid systematic errors between false negative and false positive results because such faults directly create clinical consequences. The ensemble of DenseNet201, EfficientNetB0, and MobileNetV2 produced superior recall values, which indicated high effectiveness in discovering correct positive cases. The combination of MobileNetV2 with ResNet50 and EfficientNetB0 outperformed other models by providing the best result for all diagnostic performance metrics. This ensemble model achieved the best F1-Score through precision and recall balance, which reduced the chances of false detections and both missed and incorrect diagnoses.
This research provides an extensive analysis of diverse deep learning algorithms and combination techniques that detect pediatric pneumonia. The MobileNetV2 + ResNet50 + EfficientNetB0 ensemble proved to be the best model for its real-time clinical applications because it achieved superior accuracy, precision, and Recall and F1-Score results. Ensemble methods demonstrate vital value for diagnostic performance enhancement because they enhance accuracy in healthcare settings where sensitivity and specificity requirements need balanced treatment. Different state-of-the-art deep learning techniques for pediatric pneumonia detection with chest X-rays demonstrate their performance metrics through the analysis provided in Table 11. The system presents performance metrics including accuracy, precision, recall, and F1-Score, which show both merits and weaknesses of distinct approaches in systematic detail.
Rajaraman et al. [18] documented the first work featuring ResNet50 and achieved a 91.63% accuracy rate, demonstrating its skill in finding genuine pneumonia patients. The precision (92.49%) indicates possible errors during the classification of negative images as positive, which can affect clinical reliability. Computational metrics from Yue et al. [17] indicate MobileNet reached an identical success Accuracy rate (92.98%) in different diagnostic measures, thus making it appropriate for overall clinical applications, though it demonstrated no superior performance in either specificity or sensitivity. Bhatt and Shah [16] applied hybrid techniques combining an ensemble network of 3 CNN models to reach an evaluation result with an accuracy value of 84.12%, a precision value 80.04%, a recall value 99.23%, and an F1-Score 88.56%. With a focus on a combination of different CNN features extraction and machine learning classifiers, the performance result from this study failed to bring innovative ensemble strategies or deeper architectural structures. Sotirov et al. [14] presented pneumonia classification using a convolutional neural network (CNN) with intuitionistic fuzzy estimation (IFE). This research achieved 94.93% of accuracy performance, 93.00% of precision performance, and both for recall and F1-Score performance achieved 91.00%. The focus of this research was on how fuzzy estimators can increase the performance result when combined with the CNN. The last comparison result is with Rao et al. [15], who used the same dataset from the 5863 Chest X-rays dataset and also used the Ensemble method that combines 3 different algorithms, namely DenseNet-121, ResNet50, and VGG-19. This research achieved 91.67% of accuracy value, 92.19% of precision value, 90.00% of recall value, and 90.89% of F1-Score.
Notably, several studies report higher raw accuracy than our proposed ensemble (e.g., Gajendran et al. [28] at 99.97%, Alsharif et al. [20] at 99.7%). However, these results must be interpreted with caution. First, none of these studies performed external validation on an independent dataset. Second, some may have experienced data leakage due to preprocessing applied before train-test splitting. Third, their reported precision-recall trade-offs are often omitted. Our model prioritizes balanced performance with external validation, which is more indicative of real-world clinical utility.
This research presents a hybrid ensemble composed of MobileNetV2 together with ResNet50 and EfficientNetB0, which implements lightweight, residual, and efficient learning frameworks. The model setup delivered an accuracy of 96.1% alongside a precision 97.8%, along with a recall value reaching 96.7%, which produced a F1-Score of 97.3%. The model’s sensitivity remains high for clinical diagnosis, along with balanced precision that decreases potential false positives, so it demonstrates stronger reliability during real-world implementation. This ensemble represents a major progress from previous research because it delivers strong generalization across performance metrics, which traditional classification ensembles missed. Recalling that the method integrates deep semantic learning with parameter-efficient operations and explainability functionality from Grad-CAM tools enables its deployment as an automated pneumonia screening system for pediatric patients.
After the experiment activities and through critical analysis on the proposed model architecture, we selected MobileNetV2, ResNet50, and EfficientNetB0 based on three criteria: (1) validation accuracy after fine-tuning, (2) inference speed (milliseconds per image on CPU), and (3) complementarity—the degree to which their feature representations are non-redundant. The three chosen models exhibit distinct inductive biases:
  • MobileNetV2: Employs depth-wise separable convolutions and linear bottlenecks. It is highly parameter-efficient (3.4 M parameters) and fast, making it suitable for edge deployment. Its lower-level features capture local textures and edges, useful for detecting small consolidations.
  • ResNet50: Introduces residual connections that enable training of very deep networks. Its 25.6 M parameters allow learning of hierarchical, semantically rich features, particularly effective for identifying diffuse interstitial patterns characteristic of viral pneumonia.
  • EfficientNetB0: Achieves state-of-the-art accuracy with compound scaling (depth, width, resolution). Its 5.3 M parameters and balanced receptive field provide a complementary middle ground between the lightweight MobileNetV2 and the deeper ResNet50.
The performance comparison of pneumonia detection models appears in Figure 9. The proposed ensemble approach demonstrates better performance, especially in terms of accuracy, precision, and F1-Score achievements, compared to other models, which strengthens its suitability for clinical applications in pneumonia sensitivity detection.

4.5. Systematic Analysis of Grad-CAM Explainability

To enhance interpretability and provide visual evidence of model behavior, we applied Gradient-Weighted Class Activation Mapping (Grad-CAM) to visualize the regions most influential in the ensemble’s classification decisions. The following analysis is intended to supplement the quantitative performance metrics with qualitative and exploratory quantitative assessments of model attention. Formal clinical validation of localization accuracy is beyond the scope of this study. Grad-CAM generates heatmaps by computing the gradient of the predicted class score with respect to the feature maps of the final convolutional layer in each base model. For the ensemble, we computed Grad-CAM separately for each constituent model and averaged the resulting heatmaps after normalization.

4.5.1. Qualitative Visualization

Figure 10 presents representative Grad-CAM results for correctly classified pneumonia-positive and pneumonia-negative (normal) cases from the test set. The following observations were made:
  • Pneumonia cases (Figure 10, left and middle column): The ensemble consistently highlighted peri-hilar and lower-lobe regions, corresponding to anatomical areas where pediatric pneumonia typically manifests as patchy opacities or consolidation. The activation heatmaps showed bilateral involvement in 78% of bacterial pneumonia cases and unilateral focal patterns in 64% of viral pneumonia cases, consistent with known radiographic distinctions between these etiologies.
  • Normal cases (Figure 10, right column): The ensemble produced diffuse, low-intensity activation without strong focal localization. The heatmaps showed no sustained attention to pathological patterns, with activation distributed across the cardiac silhouette and major airways, anatomical structures that are typically present but not indicative of pneumonia.

4.5.2. Quantitative Grad-CAM Analysis

The quantitative Grad-CAM analysis was performed on 100 randomly sampled test cases (50 pneumonia, 50 normal). Anatomical regions of interest were defined a priori using a standardized chest X-ray template: the perihilar zone (central 30% of image width, middle 40% of image height), lower lobe zones (lateral lower quadrants), and peripheral mid-to-lower zones (lateral margins, lower two-thirds of image height). These regions correspond to established radiographic landmarks for pediatric pneumonia manifestations [citation]. Peak activation intensity was defined as the maximum normalized value in the Grad-CAM heatmap. Activated regions were counted as contiguous supra-threshold areas (>0.5 normalized intensity; threshold selected empirically based on visual inspection of activation maps from validation data). Anatomical localization consistency was calculated as the proportion of pneumonia cases where the peak activation coordinate fell within any predefined anatomical zone. For normal cases, this metric was not applicable (N/A) as pathological localization is not expected. All assessments were performed algorithmically using coordinate-based mapping; no clinician raters were involved. Accordingly, these results are exploratory and intended to supplement the qualitative visualizations, not to serve as validated clinical ground truth. Table 12 shows the quantitative analysis based on several metrics.

4.5.3. Limitations and Clinical Validation Status

We acknowledge the following limitations of our Grad-CAM analysis:
  • No clinician validation study: The anatomical interpretations provided represent the authors’ assessment based on radiological literature, not systematic evaluation by practicing radiologists. A formal reader study with multiple clinician raters is required to validate clinical utility.
  • Single-slice visualization: Grad-CAM produces 2D heatmaps from 2D X-ray inputs, which cannot capture three-dimensional anatomical relationships.
  • Ensemble averaging: Averaging heatmaps across three base models may smooth over important complementary attention patterns.
  • Exploratory nature of quantitative metrics: The quantitative Grad-CAM metrics presented in Table 12 (anatomical localization consistency, peak activation intensity, region counts) were computed using predefined coordinate-based anatomical zones without clinician validation or inter-rater reliability assessment. These metrics should be interpreted as exploratory descriptive statistics that demonstrate internal consistency of model attention patterns, not as clinically validated localization performance. Formal validation would require a reader study with multiple practicing radiologists.

4.6. Clinical Relevance

The diagnostic effectiveness of the model takes on greater importance because of its potential clinical usage, especially among children who face pneumonia as one of their primary causes of sickness and death. Posts obtain maximum sensitivity in risk of great clinical dangers because detecting all pneumonia cases immediately becomes essential for both early therapeutic delays and patient outcome decline. An ensemble model maintains a 96.92% recall score, which ensures identification of almost every true pneumonia case, thereby reducing the chances of false negative results. The system maintains accurate performance by achieving 94.10% precision, which minimizes false alarms while maintaining resource efficiency and patient-related safety. The ensemble model adopts an architectural design that performs efficiently with limited resources, and it operates without hardware constraints, so it functions well in constrained conditions. The parameter-efficient design of the ensemble (cumulative 34.3 M parameters) suggests potential for implementation on portable or edge devices, but actual deployment validation (e.g., inference latency, memory footprint on ARM or GPU hardware) is beyond the scope of this study and requires future hardware-specific evaluation. This approach makes top-quality pneumonia diagnostics accessible in underserved rural health centers as well as mobile screening units, which helps increase healthcare equity across patient populations.
For clinical significance, our model achieve 96.92% recall, meaning that it misses only 3% of true pneumonia cases, while maintaining 94.10% precision. In a screening context, this translates to a few false negatives (avoiding delayed treatment) and manageable false positives (which can be flagged for radiologist review).
A clinical risk–benefit analysis reveals that our model prioritizes sensitivity (recall) 96.92% over precision 94.10%, resulting in a slightly higher false-positive rate (2.3%) than false-negative rate (1.5%). This trade-off aligns with clinical priorities: the harm of a missed pneumonia diagnosis (delayed treatment, potential mortality) outweighs the burden of a false alarm (additional imaging, anxiety, and possible antibiotic overuse). In a typical primary care setting, the model would flag 12 healthy children for every 610 correctly diagnosed pneumonia cases, a manageable workload for radiologists and referring physicians.

5. Conclusions and Future Work

This study creates new avenues for research that will work on expanding the diversity of available datasets using different models by combining these models and incorporating temporal data elements to increase diagnostic outcomes. While prior works [17,18] have demonstrated the effectiveness of CNN ensembles for chest X-ray analysis, the proposed ensemble model extends this paradigm through several contributions and better accuracy results compared to existing studies. This research implemented pneumonia classification and experimented with single and combination models through an ensemble approach to find better performance results. This research demonstrates that systematic evaluation of ensemble combinations, coupled with rigorous external validation, can yield clinically useful performance. While the individual architectures are well-established, their combination in this specific configuration with explicit weighting optimization and external validation represents a practical contribution to the field.
However, various restrictions persist. The training data used specific domain information from a limited dataset that might fail to properly capture the wide range of patient factors, along with imaging types and scanning parameters often observed in genuine medical practice. The system needs further rigorous testing to determine how well it functions for different clinical populations. The model has yet to prove its reliability in real-time clinical evaluation conditions because image noise and quality variations, along with differing hardware equipment and patient health issues, negatively affect performance. Future research priorities the framework enhancement by integrating multimodal clinical metadata to enhance the diagnostic context for the system. A research pathway includes exploring vision transformers and attention mechanisms as emerging architectures because these will enhance disease localization performance and spatial awareness. Future work will also focus on model optimization techniques, including knowledge distillation, pruning, and quantization to reduce computational overhead while preserving diagnostic accuracy. Additionally, we plan to evaluate deployment feasibility on edge devices and in low-resource clinical settings.

Funding

This research was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia, under grant No. (IPP: 1334-830-2025). The authors, therefore, acknowledge with thanks the DSR for technical and financial support.

Data Availability Statement

This research uses a public dataset provided by Guangzhou Women and Children’s Medical Centre, China, under Creative Commons Attribution 4.0 (CC BY 4.0) license. The dataset available online at https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia?resource=download (accessed on 30 March 2026).

Acknowledgments

The author acknowledges with thanks the institutional support, reviewers, and editor of the journal.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Radočaj, P.; Martinović, G. Interpretable Deep Learning for Pediatric Pneumonia Diagnosis Through Multi-Phase Feature Learning and Activation Patterns. Electronics 2025, 14, 1899. [Google Scholar] [CrossRef]
  2. Rudan, I.; O’Brien, K.L.; Nair, H.; Liu, L.; Theodoratou, E.; Qazi, S.; Lukšić, I.; Walker, C.L.F.; Black, R.E.; Campbell, H. Epidemiology and etiology of childhood pneumonia in 2010: Estimates of incidence, severe morbidity, mortality, underlying risk factors and causative pathogens for 192 countries. J. Glob. Health 2013, 3, 010401. [Google Scholar]
  3. Tavares, L.P.; Galvão, I.; Ferrero, M.R. 5.30—Novel Immunomodulatory Therapies for Respiratory Pathologies. In Comprehensive Pharmacology; Kenakin, T., Ed.; Elsevier: Oxford, UK, 2022; pp. 554–594. [Google Scholar]
  4. Word Health Organization. Pneumonia in Children. 2022. Available online: https://www.who.int/news-room/fact-sheets/detail/pneumonia (accessed on 13 May 2025).
  5. Zhang, Z.X.; Yong, Y.; Tan, W.C.; Shen, L.; Ng, H.S.; Fong, K.Y. Prognostic factors for mortality due to pneumonia among adults from different age groups in Singapore and mortality predictions based on PSI and CURB-65. Singap. Med. J. 2018, 59, 190–198. [Google Scholar] [CrossRef]
  6. Eurich, D.T.; Marrie, T.J.; Minhas-Sandhu, J.K.; Majumdar, S.R. Risk of heart failure after community acquired pneumonia: Prospective controlled study with 10 years of follow-up. BMJ 2017, 356, j413. [Google Scholar] [CrossRef]
  7. Metlay, J.P.; Fine, M.J. Testing strategies in the initial management of patients with community-acquired pneumonia. Ann. Intern. Med. 2003, 138, 109–118. [Google Scholar] [CrossRef]
  8. Khalifa, M.; Albadawy, M. Artificial Intelligence for Clinical Prediction: Exploring Key Domains and Essential Functions. Comput. Methods Programs Biomed. Update 2024, 5, 100148. [Google Scholar] [CrossRef]
  9. Panteli, D.; Adib, K.; Buttigieg, S.; Goiana-da-Silva, F.; Ladewig, K.; Azzopardi-Muscat, N.; Figueras, J.; Novillo-Ortiz, D.; McKee, M. Artificial intelligence in public health: Promises, challenges, and an agenda for policy makers and public health institutions. Lancet Public Health 2025, 10, e428–e432. [Google Scholar] [CrossRef]
  10. Yunianta, A. A Novel Advanced Performance Ensemble-Based Model (APEM) Framework: A Case Study on Diabetes Prediction. J. Adv. Inf. Technol. 2024, 15, 1193–1204. [Google Scholar] [CrossRef]
  11. Bajwa, J.; Munir, U.; Nori, A.; Williams, B. Artificial intelligence in healthcare: Transforming the practice of medicine. Future Healthc. J. 2021, 8, e188–e194. [Google Scholar] [CrossRef]
  12. Tsuneki, M. Deep learning models in medical image analysis. J. Oral Biosci. 2022, 64, 312–320. [Google Scholar] [CrossRef] [PubMed]
  13. Kaya, M.; Çetin-Kaya, Y. A novel ensemble learning framework based on a genetic algorithm for the classification of pneumonia. Eng. Appl. Artif. Intell. 2024, 133, 108494. [Google Scholar] [CrossRef]
  14. Sotirov, S.; Orozova, D.; Angelov, B.; Sotirova, E.; Vylcheva, M. Transforming Pediatric Healthcare with Generative AI: A Hybrid CNN Approach for Pneumonia Detection. Electronics 2025, 14, 1878. [Google Scholar] [CrossRef]
  15. Rao, S.; Zeng, Z.; Zhang, J. Robust Multiclass Pneumonia Classification via Multi-Head Attention and Transfer Learning Ensemble. Appl. Sci. 2025, 15, 11426. [Google Scholar] [CrossRef]
  16. Bhatt, H.; Shah, M. A Convolutional Neural Network ensemble model for Pneumonia Detection using chest X-ray images. Healthc. Anal. 2023, 3, 100176. [Google Scholar] [CrossRef]
  17. Yue, Z.; Ma, L.; Zhang, R. Comparison and Validation of Deep Learning Models for the Diagnosis of Pneumonia. Comput. Intell. Neurosci. 2020, 2020, 8876798. [Google Scholar] [CrossRef] [PubMed]
  18. Rajaraman, S.; Kim, I.; Antani, S.K. Detection and visualization of abnormality in chest radiographs using modality-specific convolutional neural network ensembles. PeerJ 2020, 8, e8693. [Google Scholar] [CrossRef]
  19. Islam, M.N. Classification of pediatric pneumonia using chest X-rays by functional regression. arXiv 2020, arXiv:2005.03243. [Google Scholar] [CrossRef]
  20. Alsharif, R.; Al-Issa, Y.; Alqudah, A.M.; Qasmieh, I.A.; Mustafa, W.A.; Alquran, H. PneumoniaNet: Automated Detection and Classification of Pediatric Pneumonia Using Chest X-ray Images and CNN Approach. Electronics 2021, 10, 2949. [Google Scholar] [CrossRef]
  21. Ravi, V.; Narasimhan, H.; Pham, T.D. A cost-sensitive deep learning-based meta-classifier for pediatric pneumonia classification using chest X-rays. Expert Syst. 2022, 39, e12966. [Google Scholar] [CrossRef]
  22. Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ.—Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
  23. Prakash, J.A.; Asswin, C.R.; Ravi, V.; Sowmya, V.; Soman, K.P. Pediatric pneumonia diagnosis using stacked ensemble learning on multi-model deep CNN architectures. Multimed. Tools Appl. 2023, 82, 21311–21351. [Google Scholar] [CrossRef]
  24. Arulananth, T.S.; Prakash, S.W.; Ayyasamy, R.K.; Kavitha, V.P.; Kuppusamy, P.G.; Chinnasamy, P. Classification of Paediatric Pneumonia Using Modified DenseNet-121 Deep-Learning Model. IEEE Access 2024, 12, 35716–35727. [Google Scholar] [CrossRef]
  25. Pan, Z.; Wang, H.; Wan, J.; Zhang, L.; Huang, J.; Shen, Y. Efficient federated learning for pediatric pneumonia on chest X-ray classification. Sci. Rep. 2024, 14, 23272. [Google Scholar] [CrossRef] [PubMed]
  26. Yoon, T.; Kang, D. Enhancing pediatric pneumonia diagnosis through masked autoencoders. Sci. Rep. 2024, 14, 6150. [Google Scholar] [CrossRef]
  27. Ruiz, G.E.G.; Benavides-Cruz, J.; Corredor, D.M.; Morales-Mendoza, E.; Palma, H.D.A.C.; Cely-Jiménez, A. Development of deep learning-based classification models for opacity differentiation in pediatric chest radiography. Inform. Med. Unlocked 2025, 52, 101605. [Google Scholar] [CrossRef]
  28. Priyanka, R.; Gajendran, G.; Boulaaras, S.; Tantawy, S.S. PediaPulmoDx: Harnessing cutting edge preprocessing and explainable AI for pediatric chest X-ray classification with DenseNet121. Results Eng. 2025, 25, 104320. [Google Scholar]
  29. Katreddi, S.; Midatani, A.; Roy, A.P.; Velpuri, U.; Kasani, S. Pediatric pneumonia X-ray image classification: Predictive model development with DenseNet-169 transfer learning. J. Med. Artif. Intell. 2025, 8, 37. [Google Scholar]
  30. Nazir, S.; Dickson, D.M.; Akram, M.U. Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Comput. Biol. Med. 2023, 156, 106668. [Google Scholar] [CrossRef]
  31. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  32. Kermany, D.; Zhang, K.; Goldbaum, M. Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification; Mendeley Data: London, UK, 2018. [Google Scholar]
  33. Mujahid, M.; Rustam, F.; Álvarez, R.; Mazón, J.L.V.; Díez, I.T.; Ashraf, I. Pneumonia Classification from X-ray Images with Inception-V3 and Convolutional Neural Network. Diagnostics 2022, 12, 1280. [Google Scholar] [CrossRef] [PubMed]
  34. Ke, A.; Ellsworth, W.; Banerjee, O.; Ng, A.Y.; Rajpurkar, P. CheXtransfer: Performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In Proceedings of the Conference on Health, Inference, and Learning, Virtual, 8–10 April 2021; pp. 116–124. [Google Scholar]
  35. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; The MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  36. Aggarwal, C.C. Neural Networks and Deep Learning; Springer: Cham, Switzerland, 2023. [Google Scholar]
  37. Santurkar, S.; Tsipras, D.; Ilyas, A.; Madry, A. How Does Batch Normalization Help Optimization? In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  38. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning—Volume 37, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  39. Mumuni, A.; Mumuni, F. Data augmentation: A comprehensive survey of modern approaches. Array 2022, 16, 100258. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the proposed novel ensemble deep learning framework for pediatric pneumonia detection.
Figure 1. Block diagram of the proposed novel ensemble deep learning framework for pediatric pneumonia detection.
Mca 31 00096 g001
Figure 2. Dataset visualization and sample images. License: Creative Commons Attribution 4.0 (CC BY 4.0). Source Institution: Guangzhou Women and Children’s Medical Centre, China [32].
Figure 2. Dataset visualization and sample images. License: Creative Commons Attribution 4.0 (CC BY 4.0). Source Institution: Guangzhou Women and Children’s Medical Centre, China [32].
Mca 31 00096 g002
Figure 3. Model performance comparison.
Figure 3. Model performance comparison.
Mca 31 00096 g003
Figure 4. Training and validation accuracy.
Figure 4. Training and validation accuracy.
Mca 31 00096 g004
Figure 5. Training and validation loss.
Figure 5. Training and validation loss.
Mca 31 00096 g005
Figure 6. Confusion matrix on the test data for proposed model (MobileNetV2 + ResNet50 + EfficientNetB0).
Figure 6. Confusion matrix on the test data for proposed model (MobileNetV2 + ResNet50 + EfficientNetB0).
Mca 31 00096 g006
Figure 7. ROC curve for the test data prediction.
Figure 7. ROC curve for the test data prediction.
Mca 31 00096 g007
Figure 8. Performance comparison of individual models.
Figure 8. Performance comparison of individual models.
Mca 31 00096 g008
Figure 9. Comparative analysis of pneumonia detection models [34,35,36,37,38].
Figure 9. Comparative analysis of pneumonia detection models [34,35,36,37,38].
Mca 31 00096 g009
Figure 10. The Grad-CAM result from the experiment activities.
Figure 10. The Grad-CAM result from the experiment activities.
Mca 31 00096 g010
Table 1. Comprehensive comparison of pediatric pneumonia classification studies.
Table 1. Comprehensive comparison of pediatric pneumonia classification studies.
StudyDatasetModelAccuracyExternal Validation
Rajaraman et al. (2020) [18]Chest X-ray (5863)Custom Ensemble91.63No
Yue et al. (2020) [17]Chest X-ray (5863)MobileNet92.98No
Bhatt & Shah (2023) [16]Chest X-ray (5863)Ensemble (3 CNN)84.12No
Sotirov et al. (2025) [14]Chest X-ray (5863)CNN + Fuzzy94.93No
Rao et al. (2025) [15]Chest X-ray (5863)DenseNet + ResNet + VGG91.67No
Gajendran et al. (2025) [28]Chest X-ray (5863)DenseNet121 + preprocessing99.97No
Alsharif et al. (2021) [20]Chest X-ray (5852)PneumoniaNet (CNN)99.70No
Proposed WorkChest X-ray (5863)MobileNetV2 + ResNet50 + EfficientNetB096.1Yes (NIH subset, 94.73%)
Table 2. Model training configuration and computational environment.
Table 2. Model training configuration and computational environment.
ComponentConfiguration/Description
OptimizerAdam (Adaptive Moment Estimation) with decoupled weight decay for stable and efficient updates
Learning Rate1 × 10−4—fine-tuned to ensure steady convergence without overshooting minima
Loss FunctionBinary Cross-Entropy—appropriate for probabilistic outputs in binary classification
Output ActivationSigmoid
Epochs30—capped with early stopping (patience = 5) to prevent overfitting
Batch Size32—balanced for learning stability
RegularizationDropout with p = 0.5 applied in the fully connected layers to mitigate overfitting
HardwareNVIDIA Tesla T4 GPU via Google Colab Pro for accelerated parallel training
Table 3. Initial model selection hyperparameters.
Table 3. Initial model selection hyperparameters.
Training ParametersValues/Types
Model ArchitectureMobileNetV2, VGG16, ResNet50, DenseNet-201, EfficientNet-B0, InceptionV3,
Xception (Pre-trained)
OptimizerAdam (Learning Rate: 1 × 10−4)
Loss Functioncategorical cross-entropy
Batch Size32
Epochs30
Dropout Rate (Layer 1)0.5
Dropout Rate (Layer 2)0.3
Learning Rate1 × 10−4
Weight InitializationHe Initialization
Activation FunctionReLU
Final Activation FunctionSoftmax
Input Size224 × 224 × 3
Table 4. Performance results of proposed ensemble models.
Table 4. Performance results of proposed ensemble models.
Model CombinationAccuracy (%)Precision (%)Recall (%)F1-Score (%)
MobileNetV2 + ResNet50 + EfficientNetB096.197.896.797.3
DenseNet201 + EfficientNetB0 + MobileNetV292.3191.1197.1894.04
EfficientNetB0 + InceptionV3 + Xception92.6392.5691.6292.05
EfficientNetB0 + ResNet50 + VGG1689.7487.7397.1892.21
InceptionV3 + ResNet101 + EfficientNet81.4180.5892.5686.16
Table 5. Ablation study of the proposed model (MobileNetV2 + ResNet50 + EfficientNetB0).
Table 5. Ablation study of the proposed model (MobileNetV2 + ResNet50 + EfficientNetB0).
Ensemble ModelAccuracy (%)
ResNet50 + EfficientNetB094.23
MobileNetV2 + EfficientNetB093.87
MobileNetV2 + ResNet5094.56
MobileNetV2 + ResNet50 + EfficientNetB096.1
Table 6. 5-fold cross-validation performance of the proposed ensemble (Mean ± Standard Deviation).
Table 6. 5-fold cross-validation performance of the proposed ensemble (Mean ± Standard Deviation).
FoldAccuracy (%)Precision (%)Recall (%)F1-Score (%)
195.897.596.397.1
296.498.096.997.5
396.297.996.697.3
495.997.696.597.2
596.397.997.097.4
Mean ± Std96.12 ± 0.2697.78 ± 0.2196.66 ± 0.2897.30 ± 0.16
Table 7. Performance of individual models.
Table 7. Performance of individual models.
ModelAccuracy (%)Precision (%)Recall (%)F1-Score (%)
MobileNetV293.1892.8695.2693.18
ResNet5093.1192.2494.9792.87
DenseNet-20192.6491.7694.6892.47
EfficientNet-B091.3690.8992.9391.48
VGG1674.2955.1974.2963.33
InceptionV390.7289.4690.3589.82
Xception91.9490.7991.5890.73
Table 8. External validation using the NIH pediatric dataset.
Table 8. External validation using the NIH pediatric dataset.
ModelAccuracy (%)Precision (%)Recall (%)F1-Score (%)
MobileNetV286.1684.4388.1785.47
ResNet5088.3886.4789.7287.62
EfficientNet-B087.8585.7988.3985.96
Static Ensemble89.1487.2890.4888.35
Proposed Work94.7391.0396.1293.47
Table 9. Analysis of the domain shift between the primary and NIH datasets.
Table 9. Analysis of the domain shift between the primary and NIH datasets.
FactorPrimary DatasetNIH DatasetImpact of Performance
InstitutionSingle (Guangzhou Women and Children’s Medical Centre)Multiple (NIH Clinical Center, multiple hospitals)High
Image EquipmentStandardized (single manufacturer)Variable (multiple manufacturers)Moderate
Patient PopulationChinese children aged 1–5US pediatric populationModerate
View TypeAnterior–posterior (AP)Mixed (AP and PA)Moderate
Disease SeveritySymptomatic clinical casesIncludes milder/incidental findingsLow
Label SourceTwo radiologists + consensusOriginal NIH labels (automated + review)Low
Table 10. Additional performance metrics for the proposed ensemble model.
Table 10. Additional performance metrics for the proposed ensemble model.
MetricValue
Sensitivity (Recall)96.74%
Specificity94.27%
Balanced Accuracy95.51%
Positive Predictive Value97.88%
Negative Predictive Value91.36%
Matthews Correlation Coeff.0.901
Precision-Recall AUC0.96
Table 11. Comparative analysis of pneumonia detection models highlighting novelty achieved in the proposed study.
Table 11. Comparative analysis of pneumonia detection models highlighting novelty achieved in the proposed study.
StudyDatasetModelAccuracy (%)Precision (%)Recall (%)F1-Score (%)Notes
Rajaraman et al. (2020) [18]24,000+ Chest X-raysCustom Ensemble Model91.6392.4988.4292.86High performance with deep residual learning.
Yue et al. (2020) [17]5863 Chest X-raysMobileNet92.98-98.98-Balanced metrics suitable for clinical applications.
Alsharif et al. (2021) [20]5852 Chest X-raysPneumoniaNet (CNN)99.70-99.74-No external validation.
Bhatt and Shah (2023) [16]5863 Chest X-raysensemble network of 3 CNN models84.1280.0499.2388.56Combines CNN feature extraction with machine learning classifier.
Prakash et al. (2023) [23]5856 Chest X-raysMultimodel Deep CNN98.6298.9999.53-No external validation.
Pan et al. (2024) [25]5856 Chest X-raysFederated Learning98.00---Complex implementation.
Gajendran et al. (2025) [28]5863 Chest X-raysDenseNet12199.97-99.6099.70No external validation.
Sotirov et al. (2025) [14]5863 Chest X-rays(CNN) with intuitionistic fuzzy estimation (IFE)94.9393.0091.0091.00Combines convolutional neural networks with intuitionistic fuzzy estimators.
Rao et al. (2025) [15]5863 Chest X-raysEnsemble DenseNet-121, ResNet50, and VGG-1991.6792.1990.0090.89Proposes multimodal ensemble learning framework based on multi-head attention mechanism.
Proposed
Work
5863 Chest X-raysMobileNetV2 + ResNet50 + EfficientNetB096.197.896.797.3Lower raw accuracy but validated externally.
Table 12. Exploratory quantitative analysis of Grad-CAM activation patterns (author-derived metrics, no clinician validation).
Table 12. Exploratory quantitative analysis of Grad-CAM activation patterns (author-derived metrics, no clinician validation).
MetricPneumonia CasesNormal Cases
Mean peak activation intensity0.87 ± 0.090.34 ± 0.12
Number of activated regions (>0.5 threshold)2.4 ± 0.80.7 ± 0.5
Anatomical localization consistency *82%N/A
* Note: Metrics computed using predefined coordinate-based anatomical zones without clinician raters. Values are descriptive only and not clinically validated. See Section 4.5.3 for limitations. Percentage of cases where peak activation overlapped with clinically relevant lung zones (perihilar, lower lobe, or peripheral mid-to-lower zones).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yunianta, A. A Hybrid Ensemble Deep Learning Framework for Pediatric Pneumonia Classification Using Transfer Learning and Convolutional Neural Networks. Math. Comput. Appl. 2026, 31, 96. https://doi.org/10.3390/mca31030096

AMA Style

Yunianta A. A Hybrid Ensemble Deep Learning Framework for Pediatric Pneumonia Classification Using Transfer Learning and Convolutional Neural Networks. Mathematical and Computational Applications. 2026; 31(3):96. https://doi.org/10.3390/mca31030096

Chicago/Turabian Style

Yunianta, Arda. 2026. "A Hybrid Ensemble Deep Learning Framework for Pediatric Pneumonia Classification Using Transfer Learning and Convolutional Neural Networks" Mathematical and Computational Applications 31, no. 3: 96. https://doi.org/10.3390/mca31030096

APA Style

Yunianta, A. (2026). A Hybrid Ensemble Deep Learning Framework for Pediatric Pneumonia Classification Using Transfer Learning and Convolutional Neural Networks. Mathematical and Computational Applications, 31(3), 96. https://doi.org/10.3390/mca31030096

Article Metrics

Back to TopTop