The architects designed the system carefully to achieve top diagnostic precision together with efficient computation that supports real-world medical use. Standardized learning between models through preprocessing actions like image resizing and normalization, along with label encoding, forms the essential part of the methodological pipeline.
Figure 1 shows the global framework’s structure, which shows the combination of the Feature-level ensemble approach and the Weighted ensemble approach. This architecture incorporates three pre-trained CNNs, namely EfficientNetB0, ResNet50, and MobileNetV2, that receive fine-tuning on pediatric data through transfer learning. The training set diversity increases through complex image transformation techniques that use rotation, zooming, horizontal flips, and shifting to limit overfitting risks. The different feature maps from each model complete Global Average Pooling (GAP) and merge into one extensive multidimensional representation. The enriched feature vector moves through a fully connected dense layer before it is classified via a sigmoid activation function. The diagnostic process becomes more explainable through Grad-CAM visualizations as a visual explanation technique to help clinicians understand which parts of the image most powerfully influenced the model predictions [
30,
31]. Grad-CAM generated heatmaps from the final convolutional layers of each base model, and these heatmaps help clinicians understand which regions influenced predictions. The combination of interpretable features with high performance levels establishes trust as well as transparency, which healthcare institutions view as essential adoption criteria. The framework resolves both diagnostic accuracy versus efficiency requirements while meeting the broader standard for accessibility and clinical validation of artificial intelligence diagnostics.
3.1. Dataset Description
The Chest X-Ray Images (Pneumonia) dataset serves as the primary research material in this study, and it was obtained from Mendeley Data with Creative Commons BY 4.0 licensing [
32]. The dataset contains a total of 5863 (with a class distribution of 73.2% pneumonia and 26.8% normal (reflecting clinical prevalence)) high-resolution anterior–posterior (AP) chest radiographs, which were sampled from children within the age group of 1 to 5 years. The dataset was divided into 80% of the data for the training set with 4691 data, and 10% each of the data for the validation and test set with 586 data. The tolerance input Chest X-ray datasets for pneumonia classification, often used in machine/deep learning implementation, typically feature around 5856 to 7750, with a common data split 80% for Training, 10% for Validation, and 10% for Testing [
33]. The Guangzhou Women and Children’s Medical Centre functions as the medical institution that provided the dataset through its research facilities at this Chinese medical establishment. A three-part division of the dataset guarantees methodological consistency and broad application of models through training, validation, and testing partitions. Originally, this dataset contains three classes: normal, bacterial pneumonia, and virus pneumonia however, the folder distribution in the dataset directory contains binary class distribution in each partition, which consists of both Pneumonia and Normal cases.
A group of two expert radiologists independently assessed each image, then agreed on interpretations with a third senior expert to create reliable test ground truth data, especially in the critical subset. The collection contains bacterial and viral pneumonia cases, which present a comprehensive range of disease outcomes in a clinical setting. The model training achieves better robustness when it detects different pneumonia radiographic patterns, including patchy opacities and consolidation, and interstitial markings, which characterize pediatric pneumonia manifestations. The dataset serves as an excellent benchmark for pediatric diagnostic evaluations because researchers have cited it frequently, and it provides high-quality data with substantial clinical relevance and sample volume.
Sample pictures taken from the “Chest X-Ray Images (Pneumonia)” set can be seen in
Figure 2. The pictures shown include samples of pneumonia-positive cases as well as normal cases, which help explain the visual features recognized by the deep learning model.
3.2. Data Preprocessing
To ensure optimal model performance and training stability, a structured and rigorous preprocessing pipeline was applied to the raw chest X-ray images before they were introduced into the deep learning framework [
34]. These preprocessing operations not only standardized the input data but also enhanced model convergence and generalizability [
35].
- A.
Image Resizing
The input images received a uniform resize operation to fit the 224 × 224 pixel spatial resolution, which matches the dimensional needs of the pre-trained CNN architectures, including EfficientNetB0 and ResNet50, and MobileNetV2. The resizing procedure maintained the original aspect ratio whenever possible to avoid distortions that could affect important radiological pneumonia diagnostic elements.
- B.
Image Pixel Intensity Normalization
To ensure compatibility with the pre-trained CNN models (EfficientNetB0, ResNet50, MobileNetV2), which were originally trained on ImageNet, we applied a two-step normalization pipeline.
Step 1–Scaling to [0, 1]: Each chest X-ray image is an 8-bit grayscale image with pixel intensity values in the range [0, 255]. First, we scale the pixel values to the range [0, 1] by dividing by 255:
where
is the original pixel intensity.
Step 2–ImageNet-specific standardization: After scaling, we apply the channel-wise normalization that was used during ImageNet pre-training. For grayscale images, we replicate the same mean and standard deviation across the three channels, as expected by the models. The final normalized image
is computed as follows:
with
and
(mean and standard deviation per channel, respectively). This standardization centers the input data around zero and scales it to unit variance, which stabilizes gradient flow and accelerates convergence during training.
There are several important Roles of Feature Normalization. The first role is to enhance training stability and gradient control [
36], the second role is to accelerate the convergence via loss landscape reshaping [
37], and the third role is to mitigate the activation function saturation [
38]. The strategic implementation of feature normalization constitutes a critical preprocessing step that fundamentally improves the efficiency and reliability of neural network training. This technique systematically scales input data, such as pixel intensities initially spanning a wide range (e.g., 0 to 255), into a standardized, smaller domain (e.g., 0 to 1 or a standard normal distribution).
Pixel intensities were normalized from the original 8-bit range [0, 255] to [0, 1] using division by 255. This normalization is necessary because ImageNet-pretrained models expect input values in this range. Subsequently, for ImageNet-compatible models, we applied channel-wise normalization with mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225].
- C.
Label Encoding
The classification of Pneumonia and Normal classes went through a one-hot encoding process for binary classification. The encoding scheme both enabled the usage of categorical cross-entropy loss and let the model produce probabilistic predictions for each class. Specifically:
Pneumonia: [1,0]
Normal: [0,1]
By converting labels into a machine-readable and differentiable format, the network could effectively learn class distinctions during backpropagation.
- D.
Data Structure Optimization
The images and labels went through conversion into NumPy arrays before being stored in data generators, which optimized memory usage for training with real-time augmentation procedures. These preprocessing techniques served as the base to develop a dataset that became clean and normalized and ready for model utilization, thus supporting the ensemble model’s ability to generalize for new data points.
3.3. Data Augmentation
The use of Enhanced Generalization via Probabilistic Online Data Augmentation approach is chosen as a strategy in the training process for deep learning models to improve their performance on new and unseen data by intentionally expanding and diversifying the training set. This process is also to ensure that there is no data leakage. This is achieved through the application of a series of geometric and other transformations (like flips, rotations, and shifts) to the original images. The Probabilistic approach indicates that these transformations are applied with a certain random chance or degree, such as applying a flip with 50% probability or choosing a rotation angle from a range. Furthermore, the Online approach means that these synthetic variations are created and applied during the model’s training process, ensuring the model sees a slightly different version of the same image in every training epoch. The cumulative effect of this process is the creation of a more robust model that is less reliant on specific, non-essential features of the original data, thereby significantly enhancing its ability to generalize to real-world variations and effectively mitigate overfitting.
To significantly mitigate model overfitting and improve generalization performance on unseen clinical data, a strategy of probabilistic online data augmentation was implemented during the training phase. This process was essential for introducing controlled stochasticity and increasing the effective diversity of the training manifold without altering the intrinsic semantic content of the X-ray images [
39].
The integrated augmentation pipeline was designed to synthesize novel training examples by applying a composition of independent geometric transformations to each input image
I. The resultant augmented image,
I′ was generated via the following sequence of operations:
This sequence involved the following:
Horizontal flipping () applied with a probability of p = 0.5.
Random rotation () within the range of ±10°.
Random scaling (zoom) () by up to ±10%.
Random translation (shift) () in either the horizontal or vertical axis by up to ±10% of the image dimensions.
The deliberate compounding of these diverse transformations effectively simulates the natural geometric and positional variations inherent in real-world clinical radiographic image acquisition, thereby enhancing the robustness and representational capacity of the trained model. The transformation operators appear sequentially as TTT sequences throughout this process. The augmentation occurred in real-time while mini-batches were created to maintain both processing speed and varied input samples. Through the dynamic dataset enrichment process, the model built its capability to handle intra-class variations and imaging fluctuations required for multiple clinical environments.
3.4. Model Architecture of Hybrid Convolutional Neural Network (CNN) Ensemble
The proposed framework employs a feature-level fusion by strategically combining three established Convolutional Neural Networks (CNNs): MobileNetV2, ResNet50, and EfficientNetB0. This integration leverages the unique representational strengths of each base model to enhance the overall architectural performance. This carefully designed hybrid architecture combines models with complementary inductive biases: MobileNetV2 employs depth-wise separable convolutions and linear bottlenecks, resulting in a parameter-efficient design (3.4 M parameters); ResNet50 uses residual connections to learn hierarchical features; EfficientNetB0 applies compound scaling. While we do not benchmark inference speed or power consumption in this study, these architectural properties motivate future deployment studies in resource-constrained settings. Complementary, the deep residual learning of ResNet50 and the compound scaling optimization of EfficientNetB0 collectively ensure the capture of subtle, high-level radiographic patterns essential for accurate pneumonia diagnosis. The resulting ensemble yields a model with superior generalization capacity and robustness compared to any single constituent network. Because this is a binary classification task (pneumonia vs. normal), the ensemble uses a single output neuron with sigmoid activation and binary cross-entropy loss. This is the standard and most efficient approach for binary classification. In contrast, the individual model screening (
Section 4.1, Phase 1) uses softmax and categorical cross-entropy for compatibility with pre-trained architectures; those settings do not apply to the proposed ensemble.
Each of the base models was initialized with ImageNet pre-trained weights to leverage transfer learning, and their top classification layers were excluded (include_top = False) to extract only the high-level convolutional feature maps. These feature maps
(where
i ∈ (MobileNetV2, ResNet50, EfficientNetB0)) were each passed through a Global Average Pooling (GAP) layer to reduce the spatial dimensions and convert them into fixed-length, one-dimensional feature vectors
as follows:
The transformation results in stable dimensionality and maintains output translation consistency. An aggregation of feature vectors produced a single high-dimensional representation that ties together elements from different spatial and multifaceted views of the images. The fused vector entered a dense layer with 128 neurons, activated by ReLU, which included a dropout layer for preventing overfitting. Note that ReLU is used as the activation function for hidden layers (the dense layer and any intermediate layers), while the final output layer uses sigmoid to produce a probability score between 0 and 1. This is standard practice: hidden layers use ReLU for non-linearity and the output layer uses sigmoid for binary classification. The last operation used sigmoid activation to validate the binary recognition between the Pneumonia and Normal classes. This ensemble structure that combines various CNNs effectively improves diagnostic precision, together with operational reliability as well as flexibility to make it usable in basic health clinics.
Feature Fusion and Classification Head: Following the global average pooling of feature maps from each base model—MobileNetV2, ResNet50, and EfficientNetB0—the resulting one-dimensional feature vectors
,
,
are concatenated to form a unified high-dimensional feature embedding:
The combined structure enables the system to extract synergistic spatial and semantic attributes from the different CNN architectures to improve the final embedding’s representational strength. Multiview features fused in Fconcat proceed to a dense fully connected layer activated by ReLU that contains 128 neurons to understand feature combinations. During training, the model applies Dropout with a 0.5 rate to deactivate 50% of neurons randomly, which minimizes overfitting while promoting more stable generalized information learning.
The dense layer output directs its values into a single-neuron output layer that generates probability scores between 0 and 1 through Sigmoid activation. The final binary classification prediction
is computed as follows:
The machine learning function contains learnable parameters W and b with the application of a sigmoid function σ. W and b, along with the sigmoid function, create a configuration that provides both interpretability and effective optimization performances, especially for binary classification tasks, including pneumonia detection.
The calculation of the final ensemble weighting is shown below:
Let
be the validation accuracy of the base model
(
i ∈ {MobileNetV2, ResNet50, EfficientNetB0}) after 30 epochs of training. The weight
for model
is computed using softmax-based normalization:
where
is a temperature parameter that controls the sharpness of the weight distribution (lower T gives higher weight to the best model). This formulation ensures that weights are positive and sum to 1, with better-performing models receiving higher weights.
The final ensemble prediction
for a given input image is the weighted average of the individual model probabilities:
The pediatric chest X-ray dataset was rigorously partitioned into distinct training, validation, and test subsets to ensure unbiased evaluation. Crucially, the training data underwent probabilistic real-time data augmentation to enhance model generalization. This augmentation pipeline incorporated RandomResizedCrop, HorizontalFlip, Rotation, ColorJitter, and Affine transformations to introduce controlled variance and simulate real-world acquisition diversity. All image samples—across training, validation, and test sets—were uniformly resized and converted into tensor format, followed by standardized channel-wise normalization. Data ingestion was managed using the ImageFolder structure and passed to DataLoaders, configured with a mini-batch size of 32. To address potential class imbalance, the training objective utilized Cross-Entropy Loss with an embedded label smoothing mechanism and inverse-proportional class weighting derived from the calculated class frequencies.
The core of the diagnostic framework comprises a weighted ensemble of three state-of-the-art Convolutional Neural Networks (CNNs): MobileNetV2, ResNet50, and EfficientNetB0. These models were initialized with ImageNet pre-trained weights (transfer learning). In Stage 1 (epochs 1–10), the convolutional base of each model was frozen (weights not updated), and only the newly added classification layers were trained. In Stage 2 (epochs 11–30), the top 20% of convolutional layers were unfrozen and fine-tuned with a reduced learning rate to adapt the features to pediatric chest X-ray characteristics while preserving general visual knowledge. All models were trained for 30 epochs using the Adam optimizer with an initial learning rate of 1 × 10−4 (no decay or scheduler). Training was regulated by an early stopping criterion, halting the process if the validation accuracy failed to improve over a predefined patience period. Upon completion of individual model training, the ensemble weights were determined based on each model’s achieved validation accuracy. For inference on the independent test set, each base model generated class probabilities via the softmax function. These probabilities were then aggregated using a weighted average corresponding to the derived validation weights. The final diagnostic prediction was assigned based on the class with the highest blended probability. The ensemble’s performance was comprehensively evaluated on the test set using the following key classification metrics: Accuracy, Precision, Recall, and F1-Score.
The hybrid CNN ensemble required the Adam optimizer as its training method because it demonstrated adaptive learning abilities and efficient gradient management capabilities. The training process selected a learning rate value of 1 × 10
−4. According to
Table 1, for achieving optimal weight updates and maintaining a balance between training stability and speed of convergence. The model employed Binary Cross-Entropy since it serves binary classification tasks that generate probabilistic outputs through sigmoid activation. The loss function optimizes the differences between forecasted class outcomes and real-class assignments. Training ran for up to 30 epochs with early stopping (patience = 5).
This approach stops further training because it detects the point where the model achieves optimal generalization capability. A batch size of 32 was implemented to achieve efficient gradient calculation without exceeding available memory resources. The application of a dropout rate set at 0.5 across the fully connected layers served to decrease co-adaptation events and enhance the model’s generalization ability. Running the training operations on Google Colab Pro by accessing an NVIDIA Tesla T4 GPU increased the speed of calculations through GPU-based parallel computing. A model checkpointing system was activated to guarantee that the validation loss-determined optimal model would automatically save itself at every epoch for reproducible and deployable results. The training details can be found in
Table 2, which presents the specific configuration along with all settings.