Pneumonia Detection on chest X-ray images Using Ensemble of Deep Convolutional Neural Networks

Pneumonia is a life-threatening lung infection resulting from several different viral infections. Identifying and treating pneumonia on chest X-ray images can be difficult due to its similarity to other pulmonary diseases. Thus, the existing methods for predicting pneumonia cannot attain substantial levels of accuracy. Therefore, this paper presents a computer-aided classification of pneumonia, coined as Ensemble Learning (EL), to simplify the diagnosis process on chest X-ray images. Our proposal is based on Convolutional Neural Network (CNN) models, which are pre-trained CNN models that have been recently employed to enhance the performance of many medical tasks instead of training CNN models from scratch. We propose to use three well-known CNN pre-trained (DenseNet169, MobileNetV2 and Vision Transformer) using the ImageNet database. Then, these models are trained on the chest X-ray data set using fine-tuning. Finally, the results are obtained by combining the extracted features from these three models during the experimental phase. The proposed EL approach outperforms other existing state-of-the-art methods, and it obtains an accuracy of 93.91% and a F1-Score of 93.88% on the testing phase.


I. INTRODUCTION
Virus infection has been one of the most serious threats to human health throughout history.One of the most common viral infections is pneumonia [1].Infections caused by viruses and bacteria harm the lungs [2].Pneumonia symptoms are common, including pain, cough, shortness of breath, etc. Pneumonia affects approximately 7.7% of the world's population each year.As a result, early detection is critical for such illnesses.Thus, the task of automated medical image classification has grown significantly [3].This task aims to diagnose medical images into pre-defined classes.Recently, Deep Learning (DL) has become one of the most common and widely used methods for developing medical image classification tasks [4].Also, DL models produced more effective performance than traditional techniques using chest X-ray images from pneumonia patients [2], [5].
The DL architectures illustrated effective predictive ability and outperformed physicians [6].On chest X-ray images, multiple tasks were done on DL models, including tuberculosis identification [7], tuberculosis segmentation [8], large scale recognition [9], COVID-19 detection [10], [11], and Radiograph classification [12].The automated classification of chest X-ray images using DL models is growing rapidly, and choosing an appropriate region of interest (ROI) on chest X-ray images was used to discover pneumonia [13].Furthermore, applying the DL modes helps to avoid problems that take a long time to solve in traditional approaches.However, these models require large volumes of well-labeled training samples.
To solve this problem, Transfer Learning (TL) has been developed.Due to its capacity to effectively solve the shortcomings of reinforcement learning and supervised learning, TL is becoming more widespread [14], [15].The TL has the following types: Unsupervised, inductive, transductive, and negative learning.These types have been demonstrated to be able to tackle the DL problems [16].To enhance accuracy, TL is essential to have the most suggestive contextual and exclusionary capacity in the feature extraction stage for many fields [17].For example, online scraping [18], social media [19], sentiment classification [20], and medical image classification [17].Therefore, this paper applies TL approaches to improve diagnostic reliability and reduce time-consuming decisions for clinicians.
We have proposed a new NN, coined as EL, which is built by training CNN models using TL methods.To achieve this goal, CNN's MobileNet [21], DenseNet [22], and Vision Transformer [23] methods were trained to detect pneumonia in chest X-ray images.
Alhassan Mabrouk (alhassanmohamed@science.bsu.edu.eg) and Abdelghani Dahou (dahou.abdghani@univ-adrar.edu.dz) are with Mathematics and Computer Science Department, Faculty of Science, Beni-Suef University, Beni Suef 62511, Egypt; Rebeca P. Díaz Redondo (rebeca@det.uvigo.We have decided to use three models to generate the proposed ensemble learning method trying to combine two approaches that, separately, have obtained promising results: on the one hand, using the best CNN models for the training stage; and, on the other hand, applying a vision transformer.Regarding the former, the authors of [24] proposed to use the best two CNN models to develop an ensemble learning solution.They built a hierarchical stacked method by using the most relevant features extracted by the selected two convolution networks, and they obtained a good performance.Additionally, Vision Transformer (VIT) has recently achieved highly competitive performance for several computer vision applications [25].What is more, VIT achieves remarkable results compared to CNN while it obtains fewer computational resources for pre-training [23].Since the existing approaches merge CNN models for creating an ensemble method without using the transformers, as in [26], we have decided to combine both to obtain a methodology that simultaneously merges and improves each individual proposal.That is, we propose a method that joins VIT and the selection of the best CNN models for the training stage.In addition, the obtained features from the selected three models were combined using a probability-based ensemble approach to achieve good classification performance.
With the aforementioned in mind, we have developed a novel method to enhance the diagnosis of pneumonia.This method is based on three well-known CNN models, which significantly improve classification performance.As a result of these processes, the following contributions to the suggested method are listed: • We suggested an ensemble method that use forecasts from multiple CNN models to improve the classification results.
• Instead of training a CNN model from scratch, we looked at appropriate transfer learning and fine-tuning methods.
• The architecture of the proposed ensemble learning method is improved by using a batch normalization layer and a dropout layer.
• A comprehensive analysis of the developed method is compared to different state-of-the-art approaches using a real-world data set.The rest of the paper is organized as follows: Section II provides a review of related works.In section III the existing CNN models and the proposed method are presented.The pneumonia classification performance of the proposed method is given in section IV.Lastly, the conclusion provides future scope in section V.

II. RELATED WORKS
Over the past decade, many researchers have automatically used deep learning to detect lung infections and diseases from chest X-ray.For example, CheXNet is a 121-layer CNN-based approach developed by Rajpurkar et al. [27].This approach was trained using 100,000 chest X-ray images from 14 different diseases.The approach was also applied using 420 chest X-ray, and the results were compared with those of radiologists.Therefore, it was found that the DL-based CNN method outperformed the average performance of radiological pneumonia detection.In [28], they trained a CNN method from scratch to retrieve features from chest X-ray images to achieve excellent classifier performance and used it to detect whether or not a patient had pneumonia, in contrast to previous studies based on traditional manual features.Wu et al.,in [29] suggested a method based on adaptive average filtration CNN and random forest to predict pneumonia using chest X-ray images.The adaptive filtration was applied to remove noise from the chest X-ray image, improving accuracy and making it easier to identify.Then, using dropout, a CNN model with two layers is created for extracting features.Nevertheless, more preprocessing with the adaptive filter is required to enhance CNN's classification accuracy.However, there are some issues with CNN models, which require a large amount of data with labels to be trained.Furthermore, learning a CNN architecture is computationally expensive and requires advanced machines.As a result, a transfer learning (TL) approach has been proposed to solve these problems.
Recently, the TL method has become very popular, mainly because it enables the CNN model to be more efficient, reduces costs, and requires fewer inputs [30].Ayan and Ünver [31] used the Xception and VGG16 structures to fine-tune transfer learning.The design of Xception was substantially altered with the addition of two fully linked levels and multiple-output tiers with a SoftMax activation mechanism.As per the theory, the channel's initial layer has the greatest generality potential.The previous eight layers of the VGG16 architecture have been stopped, and the fully linked levels have been altered.Similarly, the test period for each image was 16 ms for VGG16 and 20 ms for the Xception network.In [32], the methods included InceptionV3, ResNet18, and GoogLeNet.Classifier results were merged using the strong majority in this method.This means that the diagnosis goes with the group with a high proportion of first-time voters.Averaging out the model's testing results, this approach took 161 milliseconds per image.On top of that, they were able to classify chest X-ray images with great accuracy.Pneumonia may be detected using deep CNNs, per the results of this research.We use standard algorithms as a component in our approach to categorizing data to keep computation costs minimum.Rahman et al. [33] used transfer learning techniques on ImageNet to detect pneumonia using four pre-trained CNN architectures.They used three classification strategies to classify chest radiography images.Togacar et al. [34] utilized three well-known CNN models for extracting features in the pneumonia classification task.They used the same data for training each model individually and acquired 1000 features from every CNN's last fully connected layer.For this task, these features are essential, which was reduced by the minimum redundancy maximum relevance (mRMR) feature selection method.Also, the selected features were fed into machine learning (ML) classification algorithms.Mittal et al. [35] suggested a CapsNet architecture for diagnosing pneumonia in chest X-ray images using multi-layered capsules.Liang and Zheng in [36] suggested a new residual network-based trained TL approach for pneumonia diagnosis.Also, the DL model used in their study had 49 convolutional layers and 2 dense layers.Their model has a 90.05% test accuracy.However, because of the huge number of convolutional layers used, this technique had a long execution time.In addition, Octave-Like Convolutional Neural Network [37] are considered lightweight and low computational cost neural networks which can replace the vanilla convolution operation such in driver distraction detection [38], document image segmentation [39], and tumor segmentation [40].Compared to the vanilla convolution, the octave CNN uses multi-frequency feature representation, which decomposes the input into low and high frequencies maps (feature representations) rather than only using the high frequency.Thus, the low-frequency feature maps represent a low-resolution representation of the input, which helps decrease unnecessary redundancy and the concept spatial dimensions.
To address this problem, several papers have been recently published that attempt to detect pneumonia using deep CNN methods with a lower number of convolutional layers, as in [41], [42].For example, Liang and Zheng [36] used a CNN approach with residual junctions and dilated convolutional methods to identify pneumonia.While selecting chest X-ray, they revealed the influence of TL on CNN's approach.Transfer learning was used by Kermany et al. [43] to learn a CNN method to identify pneumonia in chest X-ray images.For classifying chest X-ray as normal vs. pneumonia, Rajaraman et al. [44] developed a new CNN-based approach.They used a region of interest (ROI) that only included the lungs rather than the entire image to learn CNN architecture.However, these approaches are still unable to achieve a high degree of efficiency in detecting pneumonia.
To sum up, there are interesting approaches in the state-of-the-art, but we have tried to go one step further by proposing a method that combines two different techniques: using CNN models for the training stage and taking the best one for ensemble learning, and using a vision transformer (VIT), which obtain good results.Therefore, the main difference between our proposal (EL method) and the other previous approaches is that we use an ensemble method that combines three well-known CNN models, one of them the most recent vision transformer.The obtained results are promising and lightly improve the state-of-the-art performance, with a small number of layers and features.

III. METHODOLOGY A. Deep Convolutional Neural Networks (DCNN) models
Recently, many DCNN models have been suggested, which have been shown to enhance the productivity and effectiveness of machine learning (ML) [20], [45].Moreover, the DCNN models are among the most studied DL methods due to their capability to extract features automatically, and their adjustable structures, as in [15].Many DL algorithms, such as MobileNet [21], and DenseNet [46], have incorporated the concept of depth-wise separable convolutions to address the disadvantages of traditional operation.In contrast to traditional convolution operations, depth-wise separable convolutions are performed independently of each input.Consequently, the algorithms are cost-effective to run and can be trained with fewer parameters in a short time.Therefore, the ensemble method has been recently introduced to learn more complex feature representations compared to single network [47].
There are two kinds of ensemble techniques utilized in CNN architectures [26].In the first technique, some researchers employed different CNN algorithms to obtain features from the medical images, as in [48].The collected features are aggregated and used in various machine learning techniques for classification/categorization tasks.Two distinct training methods and sophisticated algorithms are some of the limitations of this technique.In the second technique, predicted values are merged using a computational formula, as suggested in [49].The benefit of this technique is that the ensemble method correctly classifies the data due to the resulting performed by other CNN models' correct predictions.Therefore, this paper employs an ensemble technique to improve the performance of the classification task.

1) MobileNet:
The MobileNet architecture was designed by Howard et al. [21].The MobileNet design is based on separable convolution layers and consists of two components: (a) Depthwise convolution: A single filter is applied to each input channel.(b) Pointwise convolution: a 1 × 1 convolution aggregates the depthwise convolution's outcomes.In a typical convolution, we filter and aggregate input images into a new vector of features through one phase.
Depthwise convolution is used to cut down on calculation time and model size.Eventually, MobileNet employs batch normalization and ReLU as a non-linearity activation function.Furthermore, before the fully-connected layer, the last average pooling decreases the spatial resolution to just one.
2) DenseNet: The DenseNet was suggested by Huang et al. [46] for improving the depth of CNN.This approach was first implemented to address issues when CNNs became more complex in the model size.The authors solved the issue by linking each layer completely to the next, thus assuring maximal information and gradient transfer.One of the key benefits of adopting such a structure is that the DenseNet structure maximizes its capacity by reducing the usage of a deep or broad design via feature recycling.Unlike traditional CNNs, DenseNet does not train duplicate features.Thus it requires fewer parameters.Moreover, since the structure has relatively thin layers, it only adds a tiny number of new feature maps.Also, the structure depends on each layer having immediate access to the gradients from the loss function and the input image during the training stage.
It is worth noting that the DenseNet concatenates the layer's return image features with the input feature maps, and thus there is no aggregate between them.The feature maps could be the same dimension to accomplish this combination in any instance.To overcome this issue, DenseBlocks are a concept introduced by DenseNet.DenseBlocks are used to ensure that the size of feature maps stays consistent inside a block while the number of filters varies among them.Layers of a particular sort (called transition layers) are put in the DenseBlocks.Also, down sampling is performed using batch normalization, a 1x1 convolution, and 2x2 layers in these layers.

3) Vision Transformer (VIT):
The VIT has successfully obtained perfect performance on different computer vision tasks, as discussed in [25].The Vision Transformer (VIT) [23] divides an image into patches and uses a transformer to pattern the similarity among these patches as sequences, resulting in sufficient image classification performance.VIT's structure can be summarized as follows: 1) Divide the given image into patches.2) Flatten patches and use these patches to produce lower-dimensional linear embeddings (Patch Embedding).3) Add a class token and positional embedding.4) Give the patch sequence into the transformer layer and use a class token to get the label.5) To get the output prediction, transfer the class token values to the Multi-Layer Perception (MLP).
Regarding inserting a 112 x 112 image to generate the patch, we start with 16 x 16 non-overlapping and overlapping patches.Therefore, generating 49 patches becomes easy and inserting them into the linear projection layer.Taking into account that each patch has three color channels.Also, the patches are loaded into the linear projection layer to achieve a long vector representation of each patch.
The overall number of overlapping and non-overlapping patches in the patch embedding is 49, and the patch size with the number of channels is 16 x 16 x 3.As a result, each patch's long vector is 768, and the patch embedding matrix is 49x196.In addition, class tokens and position embedding have been added to the sequence of embedded patches.If positional encoding is not used, the transformer will not be able to retain the information.Because of the additional class token, patch embeddings are still sized 50.Lastly, the acquired representations of the class token are obtained by feeding patch embeddings with a positional encoding and a class token into the transformer layer.As a result, the transformer encoding layer produces 1x768, which is then transferred to the MLP block to give an accurate prediction.
The transformer encoder, which contains the Multi-Head Self-Attention (MHSA) block and the MLP block, is the most important element in the VIT structure.The encoded layer has 50 x 768 as input, which this layer merged into patch embeddings, positional embeddings, and class tokens.In the VIT architecture, the previous layer's inputs and outputs for the 12 layers are 50 x 768.Furthermore, the normalization layer normalizes the inputs before they are fed into the Multi-Headed Attention (MHA) block.To obtain the query, key, and value matrix in MHA, the input data is adapted into a 50x2304 (768x3) shape using a linear layer.Then, reshape these matrices into 50x3x768, where each one is represented as 50 x 768.These matrices are then reshaped once more to 12 x 50 x 64.Once these matrices are obtained, the attention process for the MHA block is performed using the following equation: Attention(Query, Key, V alue) = sof tmax( Query.Key T dKey ).V alue The outputs from the MHSA block are delivered as an input to the skip connection.Then, the outcome of the skip connection is sent to the normalization layer before being delivered to the MLP block for processing.Due to significant advancements in VIT, MLP includes a local mechanism to understand local features [50].Furthermore, depth-wise convolution is integrated into the MLP block during the first fully connected layer to reduce parameters and achieve better results.The output of the MLP block must eventually feed the skip connection to achieve the encoder layer's output.
In this paper, the vision transformer is used because it focuses on each independent patch of the image, as well as their relationships with other patches.In contrast, the convolutional network does not have this property because it uses convolutional filters to learn image features.

B. Proposed EL method
This section describes the implemented DL architecture based on the ensemble learning technique.The objective of the proposed method is to learn and extract medical image representation using three well-known DL models, including MobileNetV2, DenseNet169, and Vision Transformer (VIT).As shown in Figure 1, the input image to the ensemble method is fed to three functional layers simultaneously.At this stage, each functional layer represents a pre-trained model that relies on MobileNetV2, DenseNet169, and VIT, respectively.For dimensionality reduction, each functional layer's output (learned representations) is fed to the global average pooling layer.After applying the pooling operation on each parallel flow, the output is flattened and concatenated to generate a single feature vector of each inputted image.To fine-tune the overall network, overcome over-fitting, and boost the classification accuracy, a sequential set of layers were placed on top of each other, including batch normalization (BN), fully-connected layer (dense), and dropout layer, as shown in Figure 1.The final output of the ensemble method is generated using a fully-connected layer to output the classification result.
Using chest X-ray image data sets, the ensemble method was fine-tuned to learn and extract feature vectors from input images of size 224 × 224.These three models such as MobileNetV2, DenseNet169, and VIT were pre-trained on the ImageNet [51].In our experiments, the pre-trained ensemble method was employed and fine-tuned on the data sets having chest X-ray images.As an output, these models generate a feature vector after flattening of size 1280, 1664, and 768, respectively.Thus, the concatenated feature vector is of size 3712.During the fine-tuning of the ensemble method, the weights of the three models were fixed to accelerate the training process.

IV. EXPERIMENTAL STUDY
This study trained nine well-known CNN methods and the proposed EL method to classify pneumonia in chest X-ray images.In the training phase, different TL and fine-tuning techniques were attempted on these methods, and configurations ensuring Fig. 2: Example chest X-ray samples for classification task from the selected database.The above row shows the normal images, and the bottom row shows the pneumonia images.
excellent outcomes were utilized in the testing stage.A batch size of 32 and a learning rate of 1e − 4 were defined during this phase.We used various epoch sizes to train methods, but after 20 epochs, the methods began to overfit.To avoid overfitting of methods, early stopping was used.In addition, the Adam optimizer was applied to reduce the categorical cross-entropy loss function.For classifying, the softmax activation function was applied in the final layer.As a result, this section describes the experimental study carried out.First, the data sets and the performance measures are portrayed, and then the experimental results and a discussion of them are presented.Finally, we compare our proposed method with state-of-the-art methods.

A. Data set description
Pneumonia diseases have been verified for our experimental assessment, including the diagnosis of pneumonia from chest X-ray images.Figure 2, for instance, displays a set of images from the chosen database.The data set used in this study was provided by Kermany and Goldbaum [43] based on a chest X-ray scan database from pediatric patients from one to five years of age at the Guangzhou Women and Children's Medical Center.In the chest X-ray Images (Pneumonia) data set, which is publicly available at https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia,there are in total 5,856 normal and pneumonia chest X-ray images.To provide a fair comparison between our proposed method and the different methods.Also, the training set, the validation set, and the test set were previously divided.The chest X-ray database was divided into two classes (normal and pneumonia).This data set contains two subsets for each class.The training subset consists of 1,341 normal patients and 3,875 pneumonia patients.Moreover, it contains 234 patients as normal and 390 pneumonia patients for the test subset.This data also consists of 16 validation data images, including eight pneumonia patients and eight normal patients.Examples of normal and pneumonia samples can be seen in Figure 2.

B. Evaluation metrics
The performance of the proposed classification method was evaluated based on precision, recall, f1-score, and accuracy, as introduced in Eqs. ( 2), (3), (4), and (5), respectively.These metrics are the most popular in medical image classification [26], [49].The precision is measured as the percentage of exact data that conforms to specified characteristics.The recall is measured as the percentage of actual statistics to quantities that should have been explicitly anticipated.The F1-score is an indication of imbalanced data between Recall and Precision.The amount produced across all expected amounts is known as accuracy.
According to the table, the positive term denotes pneumonia, while the negative term denotes normal images.The true term denotes the proper classification, while the false term denotes the wrong classification.The number of normal images wrongly labeled as pneumonia is called False Positive (F P ).The number of normal images accurately recognized as normal is referred to as True Negative (T N ).The number of normal images wrongly labeled as pneumonia is known as False Negative (F N ).The percentage of labels found by the system is measured by the recall.The percentage of labels correctly assigned by the system is measured by precision.For providing the correct results, the F1-score is dependent on precision and recall.From a different perspective, an accuracy metric is used to evaluate the baselines for each task in the two main phases.The system's recognition rate is defined by accuracy.

C. Results and analysis
Initially, we chose the pre-trained models from the previous research, as in [26], [52].The best three methods were selected after comparing their accuracy to other methods using the previously mentioned free dataset of chest X-ray images.In terms of testing accuracy, the MobileNetV2, VIT, and DenseNet169 models performed best, as shown in Table I.Their characteristics are described above in subsection III-A, which also includes a summary of their structures.
The performance of MobileNetV2, DenseNet169, Vision Transformer (VIT), and the proposed ensemble learning (EL) method on training and validation losses and accuracy is compared in Figure 3 The test set fulfills the requirements too.In reality, we obtain better accuracy results and a lower loss on the test set with the proposed method, illustrating a more reliable and robust approach.The accuracy of the test set can be seen in Table II.
A comparison of true and predicted labels can be seen in the confusion matrices for the proposed Ensemble Learning (EL) method, Vision Transformer (VIT), MobileNetV2 and DenseNet169 , as shown in Figure 4, to better understand how the four approaches did in these binary classifications.This is done primarily to understand what a good classification approach should be, as well as how it could be enhanced while trying to deal with the diagnosis of diseases, which is often critical to a patient's survival.The confusion matrices for the MobileNetV2, DenseNet169, VIT, and the proposed EL method are shown in Figure 4.The confusion matrices in the figure contain actual and predicted labels for both normal (234) and pneumonia (390) on the chest X-ray images.

D. Compared Methods
In this section, the proposed methodology is analyzed systematically, and its positive and negative aspects are discussed compared to other methods in the literature.The obtained results in this study were compared with other studies that achieved successful results in the literature.This comparison is given in Table III.The chest X-ray data set was used to compare the various advanced methods for pneumonia detection.The state-of-the-art methods of this data set are discussed as follows: • Madani et al., [53] examined using Generative Adversarial Networks (GANs) to enrich a data set by producing chest X-ray data samples.GANs offer a method to learn about the underlying architecture of medical images, which can subsequently be used to make high-quality realistic samples.• Kermany et al., [43] used transfer learning, which allows them to learn a neural network with a portion of the data required by traditional methods.They also made the diagnosis more transparent and understandable by highlighting the neural network's known areas.• Ayan and Ünver [31] employed two well-known CNN approaches, Xception and Vgg16.In the learning phase, they employed transfer learning and fine-tuning.• Stephen et al., [28] proposed a CNN-based method.Unlike other methods based solely on transfer learning or traditional handcrafted techniques, they trained the CNN model from scratch to extract attributes from a given chest X-ray image to achieve remarkable classification performance.They used it to determine if a person was infected with pneumonia or not.• Liang and Zheng [36] performed pneumonia detection with a CNN model architecture using residual connections and dilated convolution methods.They also discovered the transfer learning effect on CNN models when classifying chest X-ray images.• Salehi et al., [54] proposed an automatic transfer-learning method based on CNN's using DenseNet121 pre-trained concepts.
The proposed ensemble method showed better performance than a pre-trained CNN model.In addition, designing a CNN model needs massive experiments and knowledge to train a pre-trained CNN model.According to our test results, the proposed ensemble method was shown to have better performance than a pre-trained CNN model.Figure 4 shows a performance comparison of the proposed MobileNetV2, DenseNet169, VIT, and the proposed EL method.In addition, designing a CNN model needs massive   92.80 2018 VGG16/ [31] 87.00 2019 [28] 93.73 2019 [36] 90.50 2020 DenseNet121/ [54] 86.80 2021 EL/Our 93.91 2022 experiments and domain knowledge to train a pre-trained CNN model with transfer learning.Also, CNN models trained from scratch need more data, more training time, and more epochs to gain better generalization ability on input data.However, the proposed method suffers from two drawbacks: The first is defining hyperparameters of pre-trained CNN methods while applying TL and fine-tuning to a problem of one's own.TL requires determining an appropriate pre-trained CNN method for a related issue, the size of fully connected layers, and the number of freezing layers.Many researchers use the trial-anderror approach or their own experiences to identify these parameters.Therefore, finding out TL parameters can reveal lengthy trial-and-error methods.The second drawback of the proposed EL method needs to have a lot of variance and bias.
V. CONCLUSIONS

Fig. 1 :
Fig. 1: The structure of the proposed ensemble learning method.
. The proposed method reduced verification loss, as shown in the figure, which improved the accuracy results.DenseNet169's training loss is 0.1664, training accuracy is 0.9319, validation loss is 0.2408, and validation accuracy is 0.9103.Also, MobileNetV2 achieved a training accuracy of 0.9122, a training loss of 0.2096, a validation loss of 0.2072, and a validation accuracy of 0.9087.The VIT had a training loss of 0.1503, a training accuracy of 0.9421, a validation loss of 0.2071, and a validation accuracy of 0.9215.The proposed ensemble method had a training loss of 0.1361, a training accuracy of 0.9525, a validation loss of 0.0421, and a validation accuracy of 1.0.

Fig. 3 :
Fig. 3: The plots referring to a) MobileNetV2, b) DenseNet169, c) Vision Transformer (VIT), and d) Ensemble Learning (EL) of the losses and accuracy on the training and validation sets.

Fig. 4 :
Fig. 4: Confusion matrices of the chest X-ray data set.

TABLE I :
The results of well-known CNN models.

TABLE II :
Comparison of testing data results among the proposed Ensemble Learning (EL) and three well-known CNN models.

TABLE III :
Comparative accuracy results for state-of-the-art method on test set of the chest X-ray data set.The best results for each item are labeled in bold.