Automated Lung-Related Pneumonia and COVID-19 Detection Based on Novel Feature Extraction Framework and Vision Transformer Approaches Using Chest X-ray Images

According to research, classifiers and detectors are less accurate when images are blurry, have low contrast, or have other flaws which raise questions about the machine learning model’s ability to recognize items effectively. The chest X-ray image has proven to be the preferred image modality for medical imaging as it contains more information about a patient. Its interpretation is quite difficult, nevertheless. The goal of this research is to construct a reliable deep-learning model capable of producing high classification accuracy on chest x-ray images for lung diseases. To enable a thorough study of the chest X-ray image, the suggested framework first derived richer features using an ensemble technique, then a global second-order pooling is applied to further derive higher global features of the images. Furthermore, the images are then separated into patches and position embedding before analyzing the patches individually via a vision transformer approach. The proposed model yielded 96.01% sensitivity, 96.20% precision, and 98.00% accuracy for the COVID-19 Radiography Dataset while achieving 97.84% accuracy, 96.76% sensitivity and 96.80% precision, for the Covid-ChestX-ray-15k dataset. The experimental findings reveal that the presented models outperform traditional deep learning models and other state-of-the-art approaches provided in the literature.


Introduction
Lung disease is widespread across the globe. Chronic disease, tuberculosis, asthma, pneumonia, fibrosis, and other diseases fall into this category. A new coronavirus disease (COVID- 19) has now been causing major respiratory problems and breathing issues since early December 2019. It has been claimed that about 63.2 million individuals have been infected globally, with around 1.47 million fatalities. The World Health Organization (WHO) performance of four models, two of which were pre-trained models (MobileNetV2 and ResNet152V2), a CNN model built from scratch, and an LSTM model. The models were assessed with different parameters using standard classification assessment measures. Wang et al. [32] emphasized the need for early detection of pneumonia sickness. They used transfer learning and model adaptation methodologies to forecast the illness using the VGG-16 and Xception models, reaching detection accuracy of 87% and 82%, respectively, for the VGG-16 Xception models. Talo et al. [33] used the transfer learning approach to diagnose pneumonia illness using the ResNet152 model. Without any preprocessing or feature extraction, it identified 97.4% of the collection. Varshni et al. [34] investigated the diagnosis of Pneumonia using numerous models based on a convolutional neural network (CNN), which they used for extracting features via transfer learning and several classifiers as predictors. Their findings show that pre-trained CNN models mixed with supervised classifier models can aid in the evaluation of chest x-ray images, notably in the detection of Pneumonia. The authors also observed that using DenseNet-169 for feature extraction and SVM (Support Vector Machines) as the predictor produced the best results. In contrast to transfer learning-based efforts, Stephen et al. [35] employed data augmentation to construct a trained CNN for pneumonia diagnosis. The model's effectiveness was tested with various image dimensions, with a 200 by 200 RGB image yielding the best results (93.73%). To classify chest X-ray pictures as normal, bacterial, or viral pneumonia. Hammoudi et al. [36] used numerous deep learning models (ResNet50, ResNet34, VGG-19, DenseNet169, and Inception ResNetV2-RNN). Sirazitdinov et al. [37] used RetinaNet and Mask R-CNN to detect lung pneumonia using a Chest X-ray image database, with a recall of 79.3%. Liang and Zheng [38] presented a transfer learning strategy for diagnosing pediatric pneumonia with a recall rate of 96.7% and an F1_score of 92.7%. The author also used the CNN and VGG16 models, achieving 90.5% accuracy, 89.1% precision, 96.7% recall, and 92.7% F1_score for the CNN model, respectively. Chouhan et al. [39] employed Guangzhou Women's and Children's Medical Center using a transfer learning algorithm with a 96.4% success rate. Siddiqi et al. [40] employed a sequential 18-layer CNN to identify pneumonia and achieved an accuracy of 93.75%, whereas Jain et al. [41] achieved 95.62% accuracy, 95% recall, and 96% precision for pneumonia diagnosis from chest X-ray images.
The authors of [42] investigated ResNet-50, ResNet34, MobileNet V2, GoogleNet, Inception V3 VGG16, SqueezeNet, and AlexNet models for early COVID-19 infection detection using CXr pictures. For the best model selection, parameters such as learning rate, number of epochs, and batch size were considered. The assessment findings revealed that the ResNet34 model outperformed all other assessed models, with an accuracy of 98.33 percent. Ozturk et al. [43] employed X-ray images to diagnose COVID-19 using CNNbased transfer learning (TL). The photos were put directly into the Inception-V3 model, which achieved 96% accuracy. Reference [44] authors produced a COVID-19 test model (VGG-16 and ResNet-50) based on the COVID-19 radiography dataset, with three classes: normal, COVID-19, and other pneumonia infection. The VGG-16 model fared the best, with a 97.67% accuracy. Furthermore, Das et al. [45] indicated that they improved COVID-19 detection performance utilizing CXR pictures by adjusting data augmentation and CNN model parameters. The VGG-19 and ResNet-50 models performed better as a result of this strategy. However, a suggested model called CovidXrayNet, which is built on EfficientNet-B0 and optimization, was presented, resulting in an of 95.82% when tested with data from two independent databases. Rajpal et al. [46] investigated the COVID-19 detection problem as a three-class classification problem: normal, COVID-19, and pneumonia. The proposed building was divided into three parts. ResNet-50 with TL was used in the first stage to generate 2048 parameters. The second portion employed Principle Component Analysis (PCA) to choose 64 characteristics from a total of 252. In the third module, the attributes obtained in the previous two parts were combined and classified, yielding a classification accuracy of 0.98%. SARS-Net was proposed by Kumar et al. [47] for COVID-19 identification using CXr. In that analysis, the open COVIDx database including CXr data was used. According to quantitative research, the proposed design has a higher accuracy Bioengineering 2022, 9,709 4 of 27 of 97.60%. The authors of Reference [48] train and test a ResNet50 architecture with a small database of 50 COVID-19 examples from the Cohen et al. source and 50 normal cases from Kaggle, attaining 98% accuracy using re-sampling and five-fold cross-validation. In [49], TL was used to offer a novel framework called COVID-Net. The authors created a dataset comprising 8066 normal samples, 183 COVID-19 samples, and 5538 Pneumonia samples, while the test set included 100 pneumonia and normal samples, respectively, and 31 COVID-19 samples, yielding a 92% accuracy. The authors of [50] studied the COVID-19 variants using the transfer learning approach to tackle the New Stringency Indicators. We summarized our literature review in Table 1. CNN-Based Transfer Learning Accuracy = 96% [44] VGG-16 Accuracy = 97.67% [45] COVIDXrayNet Accuracy = 95.82% [46] Resnet-50 With TL+ PCA + Ensemble Accuracy = 98% [47] SARS-Net Accuracy = 97.60% [48] ResNet50 Accuracy = 98% [49] COVID-Net Accuracy = 92% Nevertheless, even for professional and competent doctors, X-ray-based lung disease identification remains a mammoth task because X-ray images offer identical region information for various disorders such as pneumonia, COVID-19, and so on. As a result, traditional techniques of detecting lung disorders are time-consuming and energy-intensive, and it is difficult to employ a consistent methodology to establish which sort of lung disease a patient has. Many scholars have sought to improve CNN's performance and have seen significant improvements over time. The CNN model, on the other hand, merely examines the connection between spatially nearby pixels in the receptive region defined by the filter size. As a result, identifying associations with distant pixels is challenging. As a result, this study proposed a Chest X-ray Image Based Feature Extraction Framework for accurate and fast lung disease identification. First, using fused fine-tuned pre-trained deep learning models, increased contour and correlations of lung disease-specific X-ray characteristics are retrieved. On the same hand, a global second-order pooling was applied for enhancing non-linear capabilities and taking advantage of comprehensive visual information across the fused pre-trained deep learning models. Furthermore, the Chest X-ray images are split into patches and positional embedding before passing them to the multi-head attention mechanism for robust global feature extraction. We utilized the Globalaverage2d layer, a Batch normalization layer, a dense layer with GeLu activation, another Batch normalization layer, and a dense layer with SoftMax's activation for the classification Block. However, before selecting the suggested feature extractor models, this article first investigated numerous deep learning models using transfer learning. Furthermore, a thorough evaluation of the proposed model was conducted utilizing multiple datasets with multi-class classification (Normal, COVID-19, Pneumonia), and (Normal, COVID-19, Pneumonia, Lungs Opacity). The main contribution of this paper is summarized as follows: • This research offers a refined Chest X-ray Image Based Feature Extraction Framework for Lung Disease identification that is significantly discriminative in identifying Pneumonia, COVID-19, and Lung Cancer Diseases.

•
We offer explainability-driven, medically explainable visuals that emphasize the crucial regions relevant to the model's prediction of the input image.

•
We established a novel technique for improving ensemble models by using the integration of global second-order pooling and multi-head self-attention.

•
This work examined many pre-trained deep learning models, providing a unique ensemble deep learning model that acts as the suggested model backbone, tackling the problem of the requirement for large-scale data.

•
We reported a well robust deep learning method in Accuracy, Specificity, Sensitivity, Precision, F1 Score, Confusion matrix, and AUC using receiver operating characteristics (ROC) for detecting Pneumonia, COVID-19, and Lung Cancer Diseases based on a detailed experimental evaluation of the proposed model and comparison with state-of-the-art results.
This paper is structured as follows; Section 1 talks about the introduction and literature review of the study. Section 2 describes the material (Dataset) used while the methodology and model architecture are presented in Section 3. The attended result is presented in Section 4 alongside the experimental setup and result analysis. Section 5 presents the result discussion, ablation studies and result comparison with the state-of-the-art models.

Materials and Methods
This study implemented its idea in a vision transformer implementation style. Vision transformer [51] is an encoder-only based type of attention-based transformer [26] widely deployed in the Natural language processing (NLP) domain that has made it simpler for visual and pattern recognition domain in image data. In the absence of pre-image analysis tasks, such as image classification, the input image x ∈ R H×W×C , is divided into N image patches, x (i) p ∈ R H×W×C , where i ∈ {1, · · · N} and each patch has the form P × P in 2-D, C specifies the number of channels and N = H×W P×P The image patches are then employed successfully as a succession to the transformer's input images. Patch embeddings are generated by flattening the input patches and then mapping them to a D dimensional latent vector using a learnable linear projection. In the series of patch embeddings (Z 0 0 = x class ), a trainable embedding is embedded. The class token's last transformer layer state Z 0 L contains the classification information that the model can obtain from the image in a concise way (y). During both pre-training and fine-tuning, the classification head is connected to Z 0 L . Standard learnable 1D position embeddings are added to the patch embeddings to maintain critical positional information. The encoder receives the final result sequence as input. The encoder is made up of alternating layers of multiheaded self-attention (MSA) and MLP blocks. Before each block, the layer norm (LN) is applied, followed by residual or skip connections. Additionally, we introduced a global second-order pooling [52] for utilizing comprehensive image information across a network in order to implement in an effective manner a higher-order interpretation of the output layers of the fused models for enhancing the non-linear function of the fused model before passing the features to the encoder. The global second-order pooling technique is utilized to create a transformation matrix from a 3D vector generated by the fused layer as an input, which is then used for vector multiplication along the entire continuum and spatial context.

Dataset
Some existing works use proprietary datasets to evaluate their approaches, while others mix data from many publicly available sources. Two huge publicly available datasets were used in this work, as stated below: This dataset comprises medical CXr images for four distinct classes: Normal, Pneumonia, Lung Opacity, and COVID-19, which were gathered by researchers from Qatar University Doha Qatar, University of Dhaka, Bangladesh, as well as medical professionals and researchers from Pakistan and Malaysia. The COVID-19 Radiography Dataset [53] is titled. 3616 COVID-19 samples, 10,192 Normal samples, 6012 Lung Opacity samples, and 1345 Pneumonia samples make up the four classes. The images are in the png (Portable Network Graphics) file type and have a resolution of 299 × 299 pixels. Only 3000 images per class were sampled for training, 300 samples for validation, and 300 samples for testing in this paper. We executed a data augmentation using the Python Augmentor pipeline to obtain the number of samples needed for the experiment because the Pneumonia samples were fewer than 3000.

Model Architecture
As shown in Figure 2, we present a patch-based Chest X-ray Image Feature Extraction Framework for Lung Disease Detection that is accurate and dependable. The captured

Model Architecture
As shown in Figure 2, we present a patch-based Chest X-ray Image Feature Extraction Framework for Lung Disease Detection that is accurate and dependable. The captured features from the network backbone after passing through global second-order pooling were implemented in the encoder in two distinct layer configurations: a multi-head selfattention layer and an MLP layer. The shortcut connection and the normalizing layer are used to build each layer.  (shown with 2), and GoogleNet architecture (shown with 3) serve as the network backbone to help in feature extraction. The fused features are passed via a global second-order pooling before being split into N patches and linear projection is employed to embed them. After adding position embedding, the sequence is supplied to an encoder, which then passes it to the classification/detection layer for prediction.
The multi-head self-attention layer, which is based on scaled dot-product attention as illustrated in Figure 3A, is utilized to capture the interdependence among input tokens. The Scaled dot-product attention algorithm aims to find important information from the source sequence for the target sequence. We infer the output of the scaled dot-product attention as shown in Equation (2). where n represents the length of the source and target sequences, m denotes the hidden dimension, and the target sequence is represented as ∈ × , while the source sequence is represented as ∈ × ) and ∈ × . (shown with 2), and GoogleNet architecture (shown with 3) serve as the network backbone to help in feature extraction. The fused features are passed via a global second-order pooling before being split into N patches and linear projection is employed to embed them. After adding position embedding, the sequence is supplied to an encoder, which then passes it to the classification/detection layer for prediction.
The output is as follows: where x i symbolizes the layer i input and layer i − 1 output, f LN symbolizes the normalization layer, f (·) symbolizes either the multi-head attention f ATT (·) or MLP f FFN (·). The multi-head self-attention layer, which is based on scaled dot-product attention as illustrated in Figure 3A, is utilized to capture the interdependence among input tokens. The Scaled dot-product attention algorithm aims to find important information from the source sequence for the target sequence. We infer the output of the scaled dot-product attention as shown in Equation (2). where n represents the length of the source and target sequences, m denotes the hidden dimension, and the target sequence is represented as Q ∈ R n×m , while the source sequence is represented as K ∈ R n×m and V ∈ R n×m .
where the row-wise SoftMax is represented softmax(·). Because the output of SoftMax often has one dimension much bigger than the other dimensions in each row, one scaled dot-product attention attends just one place in each row (for each target token). Multi-head attention was used to attend to several places using multiple scaled dot-product attention simultaneously, as seen in Figure 3B and mathematically described as where ), h depicts the number of attention heads, W (·) represent the learnable entities. The MLP layer configuration is of two MLP Block as seen in Equation (4); where the non-linear function is depicted as Ø(·) and W (·) depicts parameters. After a Global Average Pooling 1D, the GeLU activation function was employed at the first layer, while the SoftMax activation function was utilized after Batch Normalization at the second layer, as shown in Figure 3C. Batch Normalization is the layer of a neural network that allows the following layers of the model to adjust more independently [55]. It's used to scale the activations of the input layer and make the output of the preceding layers more realistic. Training becomes more successful when batch normalization is utilized, and it may also be used as a regularization to reduce model overfitting. The Gaussian Error Linear Unit (GeLu) activation is the initial Dense layer activation. Because of its deterministic nonlinearity, which includes a stochastic regularization effect that leads to a large performance boost in most models with intricate structures, the GeLu was used in this study. The fundamental function of the SoftMax layer is to transform the output information from the encoding layer into a probability interval (0, 1). In this work, the detection was treated as a multi-classification challenge. Following that, the input samples are passed to the encoding network, which then transfers their outputs into the probability interval (0, n) through the SoftMax layer, as seen below: where the weight matrix and the bias term are denoted as W c and b c respectively. Adam optimizers are used in this research. To compute the loss between the ground truth and the identified item, this study used a modified loss function categorical smooth loss and a categorically cross-entropy loss. The addition of smoothing the labels function to the cross-entropy loss function, as shown below, is known as categorical smooth loss; the identified item, this study used a modified loss function categorical smooth loss and a categorically cross-entropy loss. The addition of smoothing the labels function to the cross-entropy loss function, as shown below, is known as categorical smooth loss; The extracted feature from the backbone model is portrayed as , in Equation (1), and the attention layer configuration generates outputs as described in Equation (7) and the MLP layer configuration as given in Equation (8). The extracted feature i from the backbone model is portrayed as x i , in Equation (1), and the attention layer configuration generates outputs as described in Equation (7) and the MLP layer configuration as given in Equation (8).
The attention layer uses Equation (2) to establish Q, K, and V values = x 2i , capturing the reliance between tokens within the same sequence, also known as self-attention.

Feature Extraction
As illustrated in Figure 4, this work combines deep features collected from DenseNet [56,57], VGG16 [58], and GoogleNet [59] using Ensembling algorithms [60,61]. DenseNet [41] architecture is a classification model that involves connecting layers in a feed-forward manner (with identical feature-map size), this design ensures knowledge transfer across network tiers. The output of the previous layer and the output of the following layer is concatenated (.). VGG16 [42], a deep learning architecture first preprocess its input data before being input into a tiered convolutional layer with three susceptible filters and a constant stride of one. Spatial pooling is then carried out using five max-pooling convolutional layers with a 2 × 2 filter and stride of 2. Two fully connected layers (FC) and SoftMax activations at the end of the design make up the model structure. GoogleNet [59] uses inception modules, which enable the model to select between several convolutional hyperparameters per block and are intended for image classification and identification. It consists of 22-layers. By using an inception module as the first layer, which is then piled upon itself, GoogleNet seeks to increase the computational complexity of basic CNN by applying parallel filtering on the input from the previous layer.
convolutional layers with a 2 × 2 filter and stride of 2. Two fully connected layers (FC) and SoftMax activations at the end of the design make up the model structure. GoogleNet [59] uses inception modules, which enable the model to select between several convolutional hyperparameters per block and are intended for image classification and identification. It consists of 22-layers. By using an inception module as the first layer, which is then piled upon itself, GoogleNet seeks to increase the computational complexity of basic CNN by applying parallel filtering on the input from the previous layer.  Ensembling is the process of combining different learning algorithms to improve the overall performance of current models by combining many models into a single trustworthy model. The fusion is calculated as follows: where n is the number of pre-trained models that have been chosen. The features are then concatenated into a single vector, as shown below: F Ensemble is then run through a 2D convolutional layer with a kernel size of 1, padding = 'same', and activation = "ReLU." Immediately comes the Global second-order pooling which is intended to use the comprehensive image information across the network for an effective higher-order interpretation of the output layers of the fused models thus enhancing the non-linear function of the fused model. Zeropadding2D was used to zero-pad the output of the new layer (padding = ((0, 5), (0, 5)).

Evaluation Metrics
The robustness of the suggested model was assessed using a variety of evaluation indicators. Accuracy, precision, specificity, F1_score, sensitivity, and area under a receiver operating characteristic curve are among the measurements [62][63][64][65][66][67][68]. TP stands for True Positive, FP for False Positive, TN for True Negative, and FN for False Negative. The following are the metrics we used: The AUC measures a classifier's performance, while the probability curve gotten from plotting at different threshold settings, the FP rate is referred to as the ROC (Receiver Operating Characteristic). The AUC indicates how well the model distinguishes between the different lung disease instances. The higher the AUC, the better.

Results
The many experiments carried out in this study are explained in this section. First, the experiment was carried out with pre-trained models, which include dual learning rates and loss functions.

Experimental Setup
All experiments have been performed on a Desktop Computer with 64.0GB RAM and an NVIDIA GEFORCE RTX-3080 Ti 10 GB graphics processing unit with an Intel(R) Core (TM) i9-10850K CPU running at 3.60 GHz (GPU). For the implementation, this research used the open-source Keras framework and TensorFlow. During the training phase, the suggested deep learning models were fine-tuned and assisted using the same training and testing settings and methodologies. The es callback early-stopping approach with the patience of 10 was also evaluated. An es callback is a component that may perform operations at different stages of learning, such as at different batch intervals, epoch intervals, and so forth. The Adam optimizer is used for hyper-parameter optimization, with a clip value of 0.2 and an epoch of 100. The encoder uses eight heads with a patch size of 2 and a drop rate of 0.01 for all layers. Meanwhile, the shift size is calculated using embed dim of 64 (embed dim indicates the dimension by which high-dimensional vectors are converted to low-dimensional vectors without loss), num_MLP of 256 (this indicates the number of multi-linear perceptron's), a window size of 2, and global average pooling (GAP). The hyperparameters utilized in the studies are listed in Table 3.

Classification Results
The classification findings of the various methodologies used in this work are discussed in this section. Because the backbone is made up of unified deep learning models, we'll start with their findings, which we'll show using the loss functions and learning rate we used.

Backbone Model Selection
Six pre-trained deep learning models were identified during the selection of the implemented backbone, namely DenseNet201, VGG16, GoogleNet, InceptionResNetV2, Xception, and EfficientNet network architecture. Table 4 shows the outcomes of the pretrained deep learning models that were used, as shown visually in Figure 5. During the backbone model selection experiment, Data_A was employed.
In terms of the assessment measures employed in this study, the DenseNet model achieved the best result. It outperforms the other models in terms of employed learning rates. The InceptionResNetv2 architecture comes after the DenseNet architecture, before the GoogleNet and VGG16 models. The Sensitivity, Specificity, F1 score, and AUC score were the most important metrics to consider while choosing feature extractors. The more precise the model's categorization and prediction are, the better the outcomes of the chosen assessment metrics. With a learning rate of 10 −4 , the DenseNet achieved 0.91981% sensitivity, 0.97325% specificity, 0.92088% F1_score, and 0.94651% AUC. The recorded results for the InceptionResNetV2, GoogleNet, and VGG16 architectures are: 0.89385% sensitivity, 0.96434% specificity, 0.89398% F1_score, 0.9291% AUC score, 0.86024% sensitivity, 0.95322% specificity, 0.86188% F1_score, 0.90673% AUC score, 0.81988% sensitivity, 0.93985% specificity, 0.82008%. The pre-trained models outperformed Adam optimizers with a learning rate of 10 −4 when compared to Adam optimizers with a learning rate of 10 −3 .    Table 5 shows the ROC while Table 6 shows the PR curve performance of the network backbone selection. The objective of this is to see how well the models do in their respective classes. The DenseNet design had the greatest COVID-19 ROC class performance, with an area of 0.95583 percent, followed by the GoogleNet architecture with an area of 0.90296%. The Xception and the EfficientNet are followed by the InceptionResNetV2 with an area of 0.86819%, VGG16 with an area of 0.85526%, and finally the InceptionResNetV2 with an area of 0.86819%. While compared to the GoogleNet architecture, the InceptionResNetV2 COVID-19 class had a superior area when utilizing the Adam optimizer with a learning rate of 10 −3 . We also analyzed the computational cost of all six models, i.e., the amount of trainable and untrainable parameters of the architecture, to complete our backbone network choices. As a result, we concluded that the feature extractors should be fused using DenseNet, GoogleNet, and VGG16.  Table 7 shows the classification performance of the proposed model and the backbone network. Two distinct learning rates and loss functions were used in the studies. In both implemented loss functions, the results obtained with a learning rate of 10 −3 surpass those obtained with a learning rate of 10 −4 . Despite this, the model had a lower performance when using the learning rate of 10 −3 and categorical smooth loss, with an accuracy of 0.96667%, Sensitivity of 0.93314%, Specificity of 0.97772%, Precision of 0.93895%, F1_score of 0.93391%, and AUC area of 0.95543 percent compared to using the categorical cross-entropy loss, which had an accuracy of 0.98%, Sensitivity of 0.94965%, Specificity of 0.98992%, Precision of 0.95508%, F1_score of 0.95216% and AUC area of 0.96976%. The Adam optimizer is favored in all other situations with a learning rate of 10 −3 and categorical cross-entropy.  Table 8 explain how the Receiver Operating Characteristic (ROC) and Table 9; Precision-Recall (PR) are used to confirm these findings. To evaluate the precise prediction rate of the classes Normal, COVID-19, Pneumonia, and Lung Opacity, the ROC and PR curves are employed. However, the hyperparameters had a significant impact on the accurate prediction rate of the models, as the learning rate of 10 −4 with categorical cross-entropy surpassing the learning rate of 10 −3 with categorical smooth loss. Table 6 shows that the Adam ROC class performance was 0.95606% for COVID-19, 0.98206% for Lung Opacity, 0.95559% for Normal, 0.92801% for Pneumonia, and 0.92% for AP, 0.95% for Lung Opacity, 0.83% for Normal, 0.87% for Pneumonia. When comparing the four classes, the COVID-19 class outperformed the lung opacity class in the majority of the optimum settings in terms of ROC and AP regions.
The ROC curve and the AP Curve, as mentioned in Table 7, are graphically shown in Figure 6. The HIT Rate data were utilized to further assess the model, as illustrated in Figure 6. When the adjustments are done, the "Hit Rate" is calculated by dividing the full sequence number (obtained by removing the number of Targets plus Mistakes); the "Miss Rate" is 1 minus the "Hit Rate." COVID-19 had a 38 percent hit rate. As the Normal class hit rate recorded 38 versus the 10 −3 learning rate and categorical smooth loss function, the performance of the 10 −4 learning rate and categorical cross-entropy loss function are chosen.

Classification Results Using Data_B
The proposed model's classification results on data B are shown in this section. Unlike the experimental analysis using data A, this paper only employed the Learning rate of 10 −4 and the categorical smooth loss function in this article. Table 10 shows the obtained result in terms of the assessment measures used. Table 8 shows that the suggested model performed significantly better in classification, with overall accuracies of 98.19 percent, sensitivity of 97.29%, specificity of 98.64%, precision of 97.29%, F1_score of 97.29%, and AUC of 98.10%. This demonstrates how the backbone model performs better when combined with the proposed model. The qualitative evaluation results in terms of ROC and PR curves, on the other hand, showed similar results. The COVID-19 samples were predicted more accurately than the other two classes in terms of the ROC and precision-recall curves, with the pneumonia class recording an AUC of 98.42% and the AP being more significant than the viral pneumonia class with an average of 96.0%. COVID-19 had an AP of 97.96%, while Pneumonia had an AP of 97.85%. The evaluation results in the normal class were attained with a little lower AUC and even PR rates of 97% and 96.06%, respectively. This was owing to the well-known random deep-learning procedure for fine-tuning the trainable parameters. The Diagrammatic representation of the ROC curve and the AP Curve recorded in Tables 8 and 9 for the Data_B experiment is shown in Figure 7A,B. To further elaborate on the performance of the proposed model on Data_B, we used the confusion metric instead of the Hit Rate as used in Data_A since the testing set of Data_B is much bigger in terms of the number of samples. Figure 7C shows the confusion metrics score of Data_B

Discussion
The results of this experiment show that the proposed model for lung disease diagnosis is quite accurate. The results are described using the hyperparameters that were used in this investigation. We look at how the specified loss function affects categorical cross-entropy loss and the advantage of the 10 −4 learning rate over the 10 −3 learning rate. When using a learning rate of 10 −3 and a categorical cross-entropy Loss Function as an assessment criterion, the Adam optimizer performs significantly better. The categorical cross-entropy yielded a better result than the categorical smooth loss in most cases. However, as compared to other loss functions, label smoothing appears to aid the model in detecting the damaged area, according to various papers. The proposed model in this paper makes use of patches and positional embedding, allowing the model to focus on all the damaged areas in patches while keeping the potions in mind for rebuilding. The model's performance was boosted by the feature extraction strategy, which paid special attention to global features. According to the Data_B, using soft targets that are a weighted average of the hard targets and uniform distribution over labels, the label smoothing loss function technique aids in the generalization and learning speed of a multi-class neural network being regularly and significantly enhanced. This label smoothing stops the network from being overconfident. However, we can see that the learning rate has an impact on the effect of label smoothing, as the suggested model label smoothing with a learning rate of 10 −3 exceeds the learning rate of 10 −4 in Data_A result. Before adding more complicated architectures to the networks, this study emphasizes the relevance of deep learning model feature extraction and hyperparameter adjustment in processing new data. The outcomes of this work could be useful for quickly deploying accessible AI models for the rapid, accurate, and cost-effective detection of COVID-19 infection.

Ablation Studies of the Proposed Model
The heat maps that describe the deep learning outcomes are presented in this section. In this study, the attention approach aids the model in highlighting the relevant features of the Chest X-ray images, resulting in the suggested model's prediction capacity. The internal working structure of the Proposed model starts with the input image being divided into patches before adding the positional embedding. By merging the pixel layers in a patch and then stretching it to the proper input dimension, each patch is compressed into a vector representation. positional embedding demonstrates how the model interprets distance within the input image in terms of position embedding comparability,

Discussion
The results of this experiment show that the proposed model for lung disease diagnosis is quite accurate. The results are described using the hyperparameters that were used in this investigation. We look at how the specified loss function affects categorical cross-entropy loss and the advantage of the 10 −4 learning rate over the 10 −3 learning rate. When using a learning rate of 10 −3 and a categorical cross-entropy Loss Function as an assessment criterion, the Adam optimizer performs significantly better. The categorical cross-entropy yielded a better result than the categorical smooth loss in most cases. However, as compared to other loss functions, label smoothing appears to aid the model in detecting the damaged area, according to various papers. The proposed model in this paper makes use of patches and positional embedding, allowing the model to focus on all the damaged areas in patches while keeping the potions in mind for rebuilding. The model's performance was boosted by the feature extraction strategy, which paid special attention to global features. According to the Data_B, using soft targets that are a weighted average of the hard targets and uniform distribution over labels, the label smoothing loss function technique aids in the generalization and learning speed of a multi-class neural network being regularly and significantly enhanced. This label smoothing stops the network from being overconfident. However, we can see that the learning rate has an impact on the effect of label smoothing, as the suggested model label smoothing with a learning rate of 10 −3 exceeds the learning rate of 10 −4 in Data_A result. Before adding more complicated architectures to the networks, this study emphasizes the relevance of deep learning model feature extraction and hyperparameter adjustment in processing new data. The outcomes of this work could be useful for quickly deploying accessible AI models for the rapid, accurate, and cost-effective detection of COVID-19 infection.

Ablation Studies of the Proposed Model
The heat maps that describe the deep learning outcomes are presented in this section. In this study, the attention approach aids the model in highlighting the relevant features of the Chest X-ray images, resulting in the suggested model's prediction capacity. The internal working structure of the Proposed model starts with the input image being divided into patches before adding the positional embedding. By merging the pixel layers in a patch and then stretching it to the proper input dimension, each patch is compressed into a vector representation. positional embedding demonstrates how the model interprets distance within the input image in terms of position embedding comparability, i.e., relatively close patches have a lot of position similar embeddings. For accurate feature extraction, patches and learnable embeddings are employed to treat each patch separately.
The model can remember where each patch was during the initial input and output thanks to positional embedding. To begin, 2D learnable convolutions are used to convert the patches. Figures 8 and 9 confirm the suggested approach's efficacy in boosting prospective ROIs, allowing the proposed model to focus on these regions rapidly and successfully and detect the disease, by examining the effects of the patch and embedding combo. It shows how the suggested model can generalize across the input frame, even within the simplest layers, thanks to the Self-attention heads. According to the diagram, the total distance in input images across which relevant data is assimilated is comparable to the receptive scale factor in CNNs and is highly recognized in our model due to our network backbone, which is an ensemble of pre-trained models, and thus we observed small attention scales in the small layers continuously. The attention heads focus on the bulk of the image in the lowest layers when the suggested model is implemented without a network backbone, i.e., by building features from scratch, implying that the model's ability to compress information globally is used. The suggested model, as illustrated in Figures 8 and 9, focuses on visual aspects that are semantic information that is vital for classification. i.e., relatively close patches have a lot of position similar embeddings. For accurate feature extraction, patches and learnable embeddings are employed to treat each patch separately. The model can remember where each patch was during the initial input and output thanks to positional embedding. To begin, 2D learnable convolutions are used to convert the patches. Figures 8 and 9 confirm the suggested approach's efficacy in boosting prospective ROIs, allowing the proposed model to focus on these regions rapidly and successfully and detect the disease, by examining the effects of the patch and embedding combo. It shows how the suggested model can generalize across the input frame, even within the simplest layers, thanks to the Self-attention heads. According to the diagram, the total distance in input images across which relevant data is assimilated is comparable to the receptive scale factor in CNNs and is highly recognized in our model due to our network backbone, which is an ensemble of pre-trained models, and thus we observed small attention scales in the small layers continuously. The attention heads focus on the bulk of the image in the lowest layers when the suggested model is implemented without a network backbone, i.e., by building features from scratch, implying that the model's ability to compress information globally is used. The suggested model, as illustrated in Figures 8 and 9, focuses on visual aspects that are semantic information that is vital for classification.

Comparison with the State-of-the-Art Based on Deep Learning Models
We compute and show the Accuracy, Precision, Sensitivity, and F1 score to compare the proposed model classification performance to existing cutting-edge approaches. When compared to other state-of-the-art approaches, the proposed model achieves the best overall accuracy of 98% (Table 11). For COVID-19 multiclassification, Wang et al. [53] suggested using COVIDNet, but Khan et al. [69] suggested using CoroNet. The COVIDNet model, on the other hand, beats the CoroNet model, with an accuracy of 90.78%, precision of 91.1%, an F1_score of 90.81% vs. 89.6%, the precision of 90.0%, and F1_score of 89.8%. Nonetheless, in terms of sensitivity, the CoroNet model outperformed the COVIDNet model recording 96.4%. The Mag-SD model was recommended by Li et al. [70], who attained 92.35% accuracy, 92.50% precision, 92.20% sensitivity, and 92.34% F1_score. To increase feature extraction comprehension of CXr images, Mondal et al. [71] and Shi et al. [72] advocated adopting an attention mechanism. Teacher-Student Attention was presented by Shi et al. [72]. The accuracy was 91.38%, which was better than the previous methods. The Local-Global Attention Network was introduced by Mondal et al. [71], and it surpassed earlier state-of-the-art models in terms of classification accuracy (95.87%), precision (95.56%), sensitivity (95.99%), and F1_score (95.74%). The author of Reference [73] used two different CXr classification algorithms with the same dataset in this study. With 92% accuracy, 91.75% precision, 94.50% sensitivity, and 92.75% F1_score, EfficientNetB1 (Strategy 2) produced the best classification results. Furthermore, the proposed technique has the maximum precision for COVID-19 circumstances, which means that a COVID-19 negative sample is rarely misidentified as a positive sample by the proposed classifier. Furthermore, the proposed approach has the highest recall score, indicating that the classifier can correctly identify the majority of positive COVID-19 samples. When compared to the baseline approaches, the suggested method has the highest F1_score, indicating that it is the most balanced in terms of precision and sensitivity.
This research leverages the Dataset_B classification performance as given in Table 12 to further test the proposed model's superiority over state-of-the-art models. This comparison is based primarily on the many models used to detect pneumonia from chest X-ray images. Researchers have utilized several methodologies, such as the pre-trained

Comparison with the State-of-the-Art Based on Deep Learning Models
We compute and show the Accuracy, Precision, Sensitivity, and F1 score to compare the proposed model classification performance to existing cutting-edge approaches. When compared to other state-of-the-art approaches, the proposed model achieves the best overall accuracy of 98% (Table 11). For COVID-19 multiclassification, Wang et al. [53] suggested using COVIDNet, but Khan et al. [69] suggested using CoroNet. The COVIDNet model, on the other hand, beats the CoroNet model, with an accuracy of 90.78%, precision of 91.1%, an F1_score of 90.81% vs. 89.6%, the precision of 90.0%, and F1_score of 89.8%. Nonetheless, in terms of sensitivity, the CoroNet model outperformed the COVIDNet model recording 96.4%. The Mag-SD model was recommended by Li et al. [70], who attained 92.35% accuracy, 92.50% precision, 92.20% sensitivity, and 92.34% F1_score. To increase feature extraction comprehension of CXr images, Mondal et al. [71] and Shi et al. [72] advocated adopting an attention mechanism. Teacher-Student Attention was presented by Shi et al. [72]. The accuracy was 91.38%, which was better than the previous methods. The Local-Global Attention Network was introduced by Mondal et al. [71], and it surpassed earlier state-of-the-art models in terms of classification accuracy (95.87%), precision (95.56%), sensitivity (95.99%), and F1_score (95.74%). The author of Reference [73] used two different CXr classification algorithms with the same dataset in this study. With 92% accuracy, 91.75% precision, 94.50% sensitivity, and 92.75% F1_score, EfficientNetB1 (Strategy 2) produced the best classification results. Furthermore, the proposed technique has the maximum precision for COVID-19 circumstances, which means that a COVID-19 negative sample is rarely misidentified as a positive sample by the proposed classifier. Furthermore, the proposed approach has the highest recall score, indicating that the classifier can correctly identify the majority of positive COVID-19 samples. When compared to the baseline approaches, the suggested method has the highest F1_score, indicating that it is the most balanced in terms of precision and sensitivity.
This research leverages the Dataset_B classification performance as given in Table 12 to further test the proposed model's superiority over state-of-the-art models. This comparison is based primarily on the many models used to detect pneumonia from chest X-ray images. Researchers have utilized several methodologies, such as the pre-trained model approach, ensemble model approach, and from-scratch model approach, as shown in the Table. Naralasetti et al. [74] used Deep CNN architecture and achieved a 91% accuracy rate. Ensemble models allow for a deeper understanding of the task and better results. However, when compared to the CNN model used by Dokur et al. [75], the proposed ensemble model fared poorly. In Accuracy, Precision, Recall, and F1_score, the CNN model outperformed the ensemble model by a factor of 3. For pneumonia detection feature extraction, Hammoudi et al. [36] implemented several deep-learning models. DenseNet121, VGG16, VGG19, and ResNet50 were the best of the implemented models. For X-ray pneumonia classification tasks, traditional models such as K-Nearest Neighbor (KNN), Nave Bayes (NB), Support Vector Machine (SVM), and Random Forest were used (RF). The authors concluded that DenseNet-169 when combined with ideal SVM RBF kernel hyper-parameter values, outperformed all other models tested. The from-scratch techniques were employed [76,77]. The researchers developed a novel model for pneumonia detection; however, the model's performance was poor, with accuracy, precision, recall, and F1_score all falling below 90%. The authors [78] used the AlexNet architecture via transfer learning and produced the best classification accuracy among the state-of-the-art models with a 97.40% accuracy after analyzing the transfer learning methodologies. With a percentage accuracy of 98.19, precision of 97.29%, recall of 97.29%, and F1_score of 97.29%, the proposed model, which integrates all the investigated strategies, has proved to outperform all previous techniques.

Limitations and Future Works
The implemented model, on the other hand, has some limitations. To begin, the model was only investigated using chest X-ray scans, hence our findings are limited to chest X-ray images. There are various medical image modalities for lung disease detection and classification, including magnetic resonance imaging (MRI), ultrasound, and computed tomography (CT). In the future, the proposed approach will be applied to the listed medical image modalities. Furthermore, no Image Feature improvement procedures were examined in this investigation, and the degree of the lung disease (mild, moderate, or severe disease) was not considered. We also notice that the chest X-ray dataset only shows one series per patient, which supports [79]'s thesis that a small dataset (one chest x-ray series per patient) cannot be utilized to predict whether a patient would develop a radiographic abnormality as the disease progresses. This will be thoroughly investigated in our upcoming investigation. Finally, the suggested model can be utilized to predict oral cancer, skin cancer, breast cancer, and other types of cancer.

Conclusions
This research focuses primarily on the identification of pneumonia and COVID-19, as these are the two most common lung diseases now afflicting people around the world. Lung disease identification was and continues to be an important part of epidemic diagnosis, and effective CXr data extraction aids in the correct diagnosis of lung illnesses, allowing for early detection and treatment. We present a unique Chest X-ray Image Based Feature Extraction Framework that split the images into patches and positional embeddings for accurate and fast lung disease identification. This paper first looked into the efficiency of six pre-trained deep learning models. Secondly, we proposed our model first the first step is to use a fusion model to extract deep features (generic features. Since the fusion model involves three concatenated models, there is a need for a higher-order representation of the features hence we introduced the global second-order pooling before the application of the multi-head self-attention network to analyze the input image regional features which in return pass the extracted features to the MLP layer for accurate lung disease classification and detection. To test the efficacy of the proposed approach, two publicly available datasets were employed. Data_A had a precision of 96.20% and an accuracy of 98.00%, while Data_B had a precision of 97.29% and an accuracy of 98.19%. We also assess the proposed model's forecasting accuracy using an explainability-driven heatmap visualization to emphasize the key aspects influencing the prediction decision it makes. Not only are these decipherable visual clues a step closer to understandable AI, but they may also benefit professional radiologists in diagnosis. We have empirically proved the efficacy of the suggested strategy over state-of-the-art CNN-based algorithms in terms of precision, recall, and F1 score.