A Deep Learning Model with Self-Supervised Learning and Attention Mechanism for COVID-19 Diagnosis Using Chest X-ray Images

: The SARS-CoV-2 virus has spread worldwide, and the World Health Organization has declared COVID-19 pandemic, proclaiming that the entire world must overcome it together. The chest X-ray and computed tomography datasets of individuals with COVID-19 remain limited, which can cause lower performance of deep learning model. In this study, we developed a model for the diagnosis of COVID-19 by solving the classiﬁcation problem using a self-supervised learning technique with a convolution attention module. Self-supervised learning using a U-shaped convolutional neural network model combined with a convolution block attention module (CBAM) using over 100,000 chest X-Ray images with structure similarity (SSIM) index captures image representations extremely well. The system we proposed consists of ﬁne-tuning the weights of the encoder after a self-supervised learning pretext task, interpreting the chest X-ray representation in the encoder using convolutional layers, and diagnosing the chest X-ray image as the classiﬁcation model. Additionally, considering the CBAM further improves the averaged accuracy of 98.6%, thereby outperforming the baseline model (97.8%) by 0.8%. The proposed model classiﬁes the three classes of normal, pneumonia, and COVID-19 extremely accurately, along with other metrics such as speciﬁcity and sensitivity that are similar to accuracy. The average area under the curve (AUC) is 0.994 in the COVID-19 class, indicating that our proposed model exhibits outstanding classiﬁcation performance.


Introduction
The SARS-CoV-2 virus, which causes COVID-19, has appeared across the globe and has considerably damaged all forms of human activities worldwide. As a species of the corona virus, such as the severe acute respiratory syndrome coronavirus that appeared in 2002 and the Middle East respiratory syndrome coronavirus that appeared in 2012, the SARS-CoV-2 virus spreads in the form of a droplet infection. Its spread has had adverse effects on the economies and culture of humankind. The SARS-CoV-2 virus infected 6.18 million people in just six months from December 2019 across the globe, and it continues to spread. Since discovery, approximately eighty million cases have been recorded world wide, including 35 million cases in the Americas, 26 million in Europe, and 1.8 million confirmed deaths across the globe as of 1 January 2021. Because of the rate of spread, the World Health Organization (WHO) declared the COVID-19 disease a pandemic on 11 March 2020, and the whole world has since been actively working to overcome this disease.
As the virus spreads around the world, experts are urging you to keep a distance from your acquaintances, and working hard to intensively manage confirmed cases and conduct epidemiological investigations of contacts. As the virus is highly contagious, keeping distance is the key to preventing the spread of the virus. Moreover, infected • Firstly, self-supervised learning is introduced to prevent the overfitting problem caused by the limited number of training images in deep learning. Models Genesis [13] was modified by adding convolutional attention module and trained using 112,120 unlabeled CXR dataset. And it was fine-tuned on a COVID-19 data set containing 1821 X-ray images. The accuracy of our model is 98.6%. • We improved the performance of Models Genesis by adding a convolutional attention module after every convolutional layer. We conducted extensive experiments in which we compared the performance of the modified Models Genesis containing the attention modules with that of the original for the COVID-19 classification. • For qualitative evaluation of model results, we considered a visually explainable AI approach, Score-CAM [14]. By using it, we investigated how the proposed model makes correct/incorrect classifications to identify critical factors related to COVID-19 cases. The Score-CAM used in this paper is an improved method which resolve issues of the Grad-CAM [15].

Materials and Methods
This section describes the datasets and methods for the classification task based on CNN using self-supervised learning with a pretext task by a modified Models Genesis [13], which consists of UNet [16] and Convolutional Block Attention Module (CBAM) [17]. Combining these methods increases the classification accuracy, along with other metrics such as the sensitivity and AUC.

Datasets
The datasets used in this study were collected from various sources to develop classification models that could accurately identify images as belonging to the classes of normal, pneumonia, and COVID-19. Considering the balancing problem in classification tasks, the size of dataset for each of the three classes was set to 607. The NIH CXR dataset [18], which is used for self-supervised learning, consists of 122,120 images in PNG format, including images with two classes, normal and pneumonia. COVID-19 datasets were collected from various sources, as presented in Table 1. These datasets are collected from same sources based on the baseline model by Lee et al. [11], to compare the model performances. All the images are resized to 512 × 512, because the collected images are in different shapes. For self-supervised learning, we used 112,120 images as training dataset and 1000 images as validation dataset. For our classification task, we split the data randomly into three parts: 20% of the total data set into test set, 20% of the remaining as validation set, and the remaining as the training set; this yields 1160 training data, 290 validation data, and 363 test data. Then the pixel values of the collected images were scaled by a factor in the range of zero to one.

Existing Models
As mentioned above, since the collected datasets are from the same sources as in Lee et al. [11], we set their model as the baseline. Lee et al. [11] trained deep CNNbased models by visual geometric group (VGG) team in Oxford University, VGG-16 and VGG-19 [23]. VGG-16 and VGG-19 have depth of 16 and 19 depth convolutional layers, using an architecture with very small (3 × 3) convolution filters. While slightly increasing the depth of convolutional layers, image classification performance reached the highest in ImageNet Challenge 2014 submission. They constructed 12 experimental models by using different degrees of fine-tuning for VGG-16 and -19 models. Then, Lee et al. [11] conducted an experiment to compare the performance of the models for the COVID-19 classification. Fine-tuning with Conv block 4 and Conv block 5 of VGG-16 exhibited the highest AUC value of the ROC curve.
Furthermore, in addition to the baseline model by Lee et al. [11], we compare the performance of our proposed systems through two recently released models by Das et al. [24] and Rahimzadeh et al. [25]. Das et al. [24] proposed a two-stage machine learning model for classifying COVID-19 using CXR images. The proposed model extracts CXR features with the VGG19 model pretrained with ImageNet dataset, and classifies CXR images into normal and abnormal by logistic regression. Then, CXR images classified as abnormal in the first stage are again classified into pneumonia and COVID-19 using XGBoost [26]. Rahimzadeh et al. [25] introduced some training techniques that help the network learn better when there is an unbalanced dataset (with more cases from other classes and fewer COVID-19s). Also, the authors proposed a neural network connecting Xception [27] and ResNet50V2 [28] networks.

Self-Supervised Learning
Self-supervised learning is used to learn the feature representations using unlabeled data. As a form of semi-supervised learning, self-supervised learning is a state-of-theart technique that uses unlabeled data with a pretext task to capture semantic features for use in other vision tasks such as classification and to improve the performance or to prevent overfitting when the data are limited [35,36]. Image features learned from a selfsupervised learning method can be good substitutes when used in a transferring approach in other vision tasks, as proven by various former studies such as Larsson et al. [37] and Zhang et al. [38]. Furthermore, introducing self-supervised learning to obtain representations of data improves the accuracy of the model and makes the model more robust to the result and even to adversarial examples [39]. Due to these benefits, self-supervised learning has been widely used in various areas including medical image management. The pretext task of Models Genesis begins by transforming the input medical images using three steps: non-linear transformation, local-shuffling, and out-painting or in-painting, as shown in Figure 1. See Zhou et al. [13] for details of the three types of input image transformation of Models Genesis. In the pretext task of Models Genesis, the model learns the CXR representation by restoring the transformed images into their original ones. UNet encoder-decoder architecture is used for the restoration task. UNet [16] is the model that first introduced a fully convolutional network in segmentation tasks, especially in the medical image field. Although it does not contain fully connected layers, it has fully convolution layers, thereby enabling the model to yield more accurate results in segmentation tasks. It is a U-shaped model, as shown in Figure 2, that consists of convolution and upsampling layers to make input and output shapes equal. UNet is divided into two parts: the encoder and the decoder. The encoder component uses a convolution layer to train input image features and the decoder component for segmentation tasks upsamples the output of the encoder gradually to yield the same shape as the input, corresponding to the same depth of the encoding component using upsampling layers. Furthermore, because the decoding component begins with a significantly large number of feature channels, the model can propagate context information to higher resolution layers [16].

Convolutional Attention Module
The attention method used in the deep learning field has several advantages. Attention enables a deep learning model to focus on an important point of the input and yield better interpretations of the output. Additionally, attention allows researchers to interpret the deep learning model through human perception [40]. Therefore, various approaches for applying attention to CNN-based models have been proposed. One approach is called the Convoluational Block Attention Module (CBAM), which was developed by Woo et al. [17]. Figure 3 shows the architecture of the CBAM. It consists of channel-wise and spatial-wise attentions. When a feature map F is given as input, the CBAM computes two feature maps using the channel-wise and spatial-wise attentions. Then, the overall CBAM is expressed as follows: where H, W, and C are the height, width, and channel of the input feature map, respectively, ⊗ denotes element-wise multiplication, and M c and M s represent the channel-wise and spatial-wise attentions, respectively. The channel attention map represents the 'interchannel relationship' of features by considering the channel of an input feature map as a feature detector [41]. Channel attention uses multi-layer perceptron, with one hidden layer applied to two different features of average-pooling and max-pooling operations. Therefore, it can be interpreted as focusing on the 'what' is meaningful in the input feature map [17]. Spatial attention, unlike channel attention, produces a spatial attention map focusing on 'inter-spatial relationship' of the input feature map. It uses convolution layers to generate spatial attention maps with a filter size of 7 × 7 and sigmoid function. Hence, spatial attention can be interpreted as focusing on 'where' to be counted as an important area of input feature map [17]. Using the channel and spatial attention, CBAM can be used on any dimensional feature map when it follows the convolution layer. In this paper, we used the self-supervised learning method, Models Genesis [13] and modified it with CBAM [17] after every convolutional layer in Models Genesis networks to learn the representation of CXR images. For the self-supervised learning we conducted an experiment with NIH large CXR dataset, which contains 112,120 CXR images from pneumonia patients and normal cases. Following the transformation of the input images based on Models Genesis, the proposed self-supervised learning model should capture image features better than the model without self-supervised learning because the use of large amounts of CXR data helps capturing important CXR image features. Figure 4 illustrates how our self-supervised learning task captures the image representations as pictionary. The original CXR images are expressed as X and restored images as X . Transformed images are input into the U-shaped architecture of Models Genesis to restore their original pixel values. During this task, the U-shaped architecture and CBAM weights are updated to restore the transformed images. We conducted the self-supervised learning from scratch using the training dataset, and trainable weights are initialized randomly from zero to one. The loss function, L(X, X ), consists of structure similarity (SSIM) index [42] which is used to measure similarity between original images and their corresponding restored ones. SSIM index is calculated by following equations: where µ represents sample mean and C 1 is a very small constant to stabilize the metric when µs are close to zero. l(x, y) indicates the difference of luminance of the two images. c(x, y) equals 2(σ x σ y + C 2 )/(σ 2 x + σ 2 y + C2), where σ represents sample standard deviation and C 2 is a constant close to zero. c(x, y) indicates the difference of the contrast of the two images. And s(x, y) is calculated by (σ xy + C 3 )/(σ x σ y + C 3 ) which is a measure for the structural correlation between two images, and C3 also plays a role in stabilizing the fraction calculations as C1 and C2. The parameters, α, β and γ which are positive values, are adjustment numbers for relative importance of the three components. For our experiments, C1, C2, and C3 are set to very small value, 0.000001, and all three parameters are equal to 1, to simplify the calculations. Therefore, as SSIM index has its value between 0 to 1 in our experiments, the 1−SSIM index is used as the loss function to maximize the SSIM index while reducing the loss value. Figure 5 shows our overall proposed system for the COVID-19 classification. As the self-supervised learning is conducted using 112,120 images, we enable our encoder-decoder architecture to capture representations of the CXR images. Our proposed classification model fine-tunes the weights of the encoder network from the pretrained self-supervised learning model. Behind the encoder network from the self-supervised learning model, we added classification layers which consist of four convolution layers, a max pooling layer, a global average pooling layer, and some fully connected layers. The four 2D convolution layers are to interpret the information of feature maps resulting from the encoder. After the convolution layers, the max and global average pooling layers follow to maintain graphical features and to produce one-dimensional nodes which connect to fully connected layers. A max pooling layer is used to down-sample the output of the two convolution layers. This procedure reduces the input dimension while maintaining the graphical information.

Fine-Tuing the Encoder
Furthermore, a global average pooling layer calculates the average values of each feature map, containing more implicative information than a flatten layer and prevents overfitting by reducing trainable parameters [43].

Experimental Details
We trained the self-supervised learning model using the traing dataset, and the validation set is used for hyperparameter tuning. The evaluation metric is the SSIM index and mean squared error (MSE) which considers the self-supervised learning task as a regression task to predict the pixel values of original CXR images.
For self-supervised learning, the initial learning rate was 0.0001 with an Adam optimizer [44]. In the training procedure, the key to develop classification models is to prevent overfitting. Thus, we used data augmentation techniques to make the training data more variety [45]. In this experiment, we used the three image data augmentation methods, which are flipping, zooming and width shifting to reduce the bias due to the characteristics of the CXR images. Flipping refers to a method of flipping an image left and right, and zooming is a method of performing augmentation with zooming at a certain ratio. Width shifting is an augmentation method that can reduce the bias on the position of an object in the image by moving the image up, down, left, and right by a certain distance. For the self-supervised learning task, the selected hyperparameter is the Adam optimizer with initial learning rate 0.0001. The learning rate of Adam optimizer decreases exponentially by 0.8 every 10 epochs afterh 40th epoch and the batch size is 16. Also, the L1 and L2 regularization value was 0.01 and the drop out ratio was 0.2. For classification task, the selected hyperparameter is the Adam optimizer with initial learning rate 0.00001. The learing rate decreases exponentiall by 0.8 after 30th epoch by 0.8 and the batch size is 32.
Along with the structure-wise methods such as self-supervised learning and data augmentation, regularization, batch normalization, and dropout methods are used to manage the model. Regularization can improve the model performance and prevent overfitting by controlling trainable weights via model complexity [46]. Loss function using L1 or L2 regularization reduces the model size in training. Furthermore, batch normalization is used to reduce the chance for weights of layers to be high or low [47]. Additionally, dropout can also improve model performance by removing several connections of nodes randomly in hidden layers [48]. L1 and L2 regularization coefficients and the ratio of the total nodes to dropout were carefully tuned according to the validation accuracy in our classification task.
In this experiment, we used NVIDIA Quadro RTX 8000 in the Ubuntu 20 operating system. Entire neural networks were implemented using Keras API [49]. In the overall classification task, we used a learning rate with exponentially decaying to prevent the training procedure from falling in the local minimum of the loss function, beginning with 0.00001. An Adam optimizer was used in all training procedures. Figure 6 shows how well the modified Models Genesis with CBAM restores the transformed images. As the figures show, the output images of self-supervised learning pretext task are almost exactly the same as the original images. The training MSE of the NIH CXR dataset is 0.0228 and 0.0234 for the validation set. For the SSIM index metric of the NIH CXR dataset, the train dataset SSIM index is 0.9132, 0.9083 for the validation dataset. Organs such as the bones, lungs, and heart are restored very clearly, which implies that self-supervised learning using the modified Models Genesis pretext task can learn medical image features very well. To evaluate the performance of our approach, the accuracy, specificity, sensitivity, AUC and F1 score of the test dataset are used. Sensitivity denotes the ability of the model to correctly detect infected patients given infected predicted cases, and specificity means to correctly detect normal people given none predicted ill cases. These metrics are calculated using the following equations: where TP and TN are the number of correctly predicted images for positive and negative cases, respectively, and FP and FN are the number of incorrectly predicted images for positive and negative cases, respectively. Using TP, TN, FP and FN, we made a plot of ROC curves and calculated the AUC to show the performance of the model at every threshold. The classification performances of our approach as well as other models are presented in Table 2.

Experimental Results
From the table we can see that overall our model outperforms the baseline model and other CNN based models. In particular, our model scored 98.6% averaged accuracy, 0.996 specificity, 0.992 sensitivity, and 0.994 AUC in the COVID-19 case. Although it did not record the highest values in accuracy, sensitivity and specificity in all classes, the values of sensitivity and specificity vary when calculated according to the level of threshold, it is desirable to compare the overall accuracy, AUC and F1 scores to compare the classification performances. Because our proposed system has the highest AUC, F1 score and overall accuracy, it might be most powerful in diagnosing COVID-19 than other CNN-based models presented here. Also, training loss using categorical cross entropy was 0.0256 while 0.0287 in the validation set, which indicates that there is no evidence of overfitting. The confusion matrix in Figure 7 presents the result of the test dataset on the total samples that are accurately classified by our proposed methods. Furthermore Figure 8 shows the ROC curves to depict overall classification performance of our proposed models in the test dataset, which is very excellent.

AI over RT-PCR Using CXR
Published studies have shown that a diagnosis system that uses images such as CXR and CT for COVID-19 has remarkable benefits over that of the RT-PCR [50,51]. However, while CT uses longer durations to produce the images and requires more expensive equipment, CXR is significantly cheaper and faster in producing the information from the tests. Because of these benefits, expanding the diagnosis system to CXR using AI will enhance the detection of COVID-19. Furthermore, developing systems that use AI to automatically analyze the images will reduce the time and cost of diagnosis of COVID-19 and all kinds of pneumonia diseases.  [14]. Score-CAM can interpret which parts of images are important according to the predicted result by using gradient information flowing through the model from the last convolution layer to the input layer. Compared to other CAM-based approaches, Score-CAM eliminates the dependence on gradients by obtaining weights for activation maps through forward direction score on each class. Therefore, Score-CAM produces a linear combination of weights and achieves better performance for visual expression and fair interpretations for decision making process. As our proposed model outperforms other models considered in this paper, we employed Score-CAM to justify the model performance by confirming the activated regions of images that are important to make such decision. Figure 9 shows the result of the Score-CAM for well-classified examples, and Figure 10 shows the misclassified samples of the test set with Score-CAM. In Figure 9, activated gradient areas are mostly in the lung tissues, with high degree compared to other regions. It is known that pneumonia is a disease that affects lung tissues [52], and COVID-19 is one of those types [53]. Therefore, it can be verified that because our model focuses mainly on lung tissues, the model could accurately diagnose whether the patients have COVID-19, pneumonia, or neither. Furthermore, considering the misclassified samples in the test dataset as shown in Figure 10, there is a significant difference that leads to misclassification. Compared to the samples in Figure 9, the activated gradient area of the both images in Figure 10 focus on the lung tissue area, but there is an alphabet 'R' in the left side of the images. Those activated gradient maps in the misclassified examples show that the proposed model focused on the character. This indicates that the activated gradient area focuses on the wrong place because of the foreign matter in the images, thereby resulting in misclassification from normal to COVID-19 and pneumonia each. These substances distract and hinder the model that is trained to target the torso, as shown in the activated gradient region focusing on the external materials.

Comparison with Other Methods for COVID-19 Classification
Since the strike of the COVID-19 globally, several new and modified deep learning models such as CVDNet [54] and transfer learning of various pre-trained deep learning models [55] have been proposed for screening the COVID-19 using AI. Among the recently published studies to diagnosis COVID-19, research topics can be divided into three main categories: (1) Using well-known CNN based models. Hassantabar et al. [56] performed detection and diagnosis using MLP on fractal features and CNN on CXR. To extract fractal features from CXR, images were reshaped into 1-dimensional vectors at first, covariance matrix was calculated and eigenvalue and eigenvector were used for fractal features extraction. CNN architecture reached the higher accuracy, 93.2%. Khan et al. [57] proposed CoroNet using Xception architecture pretrained on ImageNet dataset. The CoroNet classfies CXR images into four classes, COVID-19, Pneumonia bacterial, Pneumonia viral and normal. This model achieved 89.6% of overall accuracy, and 95% overall accuracy in 3-class case for COVID-19, pneumonia, and normal. Ozturk et al. [58] aimed for detecting COVID-19 in early stages. The authors implemented Darknet architecture used in you only look once (YOLO) [59] to propose a real-time diagnosis method. The dataset used in this study is CXR images taken on the first day of patients infected with COVID-19. The accuracy for binary classification of COVID-19 and normal was 98.08%, and 87.02% for multi labels of COVID-19, pneumonia and normal cases. Afshar et al. [60] used Capsule Network based model to make classifications while handling small dataset. Proposed framework which consists of several Capsule and convolutional layers were pretrained by ImageNet. After transfer learning with COVID-19 dataset, classification results reached 95.7%, 90%, 95.8%, 0.97 for accuracy, sensitivity, specificity, and AUC respectively.
(2) Using self-supervised learning for feature extraction. Sriram et al. [61] trained a model using self-supervised learning based on momentum contrast method in pretraining to learn more general representations of CXR images. They used the pretrained model for downstream tasks of single image prediction, oxygen requirements predictions for greater than 6 L, and mortal prediction using multiple images sequence. The proposed model achieved AUCs of 0.742, 0.765, and 0.848 for three downstream tasks, respectively.
(3) Using an optimization method. Goel et al. [62] aimed to classify COVID-19, normal and pneumonia using CXR images. The authors proposed Optimized Convolutional Neural Network (OptCoNet) for automatic diagnosis of COVID-19. Grey Wolf Optimizer (GWO) algorithm was used to optimize the hyperparameters for training CNN. GWO algorithm selects the hyperparameter iteratively and evaluates the candidate solutions until the condition set by research is met. Using GWO, CNN based models achieved 97.78%, 97.75%, and 96.25% for accuracy, sensitivity, and specificity, respectively. The summary for comparison is represented in Table 3.
Unlike other studies, our proposed system has three important differences: (1) we proposed a system for diagnosing COVID-19 that does not overfit through the self-supervised learning pretext task of Models Genesis with a large CXR dataset to extract features of CXR images well. Pretext task of Models Genesis enables the encoder and decoder to be trained better with large CXR dataset, confirmed by the restored images in Figure 6. (2) We used convolutional attention modules to enhance the diagnosis performance of COVID-19. As our proposed system has CBAM layers after every convolutional layers, pretext task and classification task both reached improved results in SSIM index and evaluation metric.
(3) We investigated the cause of misclassification visually through Score-CAM, in which the gradient is activated. Therefore, our study is considered in both aspects of quantitative and qualitative because we reached high classification accuracy as a quantitative aspect while confirming that the misclassified samples are not because of the problems in the training process of our proposed systems, but having foreign matters in Chest X-ray images as qualitative aspects.

Conclusions
In this paper we proposed a novel deep learning system that contributes for screening COVID-19 efficiently. The advantage of our proposed system is in three main points. Our proposed system uses self-supervised learning, so the system can learn the features of CXR well using large amounts of CXR images data. Also, by applying the convolutional attention module in self-supervised learning task, we can focus more on important features of the CXR images. And lastly, through a qualitative evaluation using Score-CAM, we have identified the reasons for misclassified cases. The pretrained encoder from the modified Models Genesis was combined with classification layers and fine-tuned using the COVID-19 dataset. Through extensive experiments, we showed that our proposed system performs more powerful than other CNN-based classification models. The trainable parameters of the encoder and convolutional attention module in the pretext task aim to capture image representation precisely, not in just expanding numerical calculations. These lead to over 98% accuracy, with a similar or higher AUC and F1 score of our proposed system than those for other models. Furthermore, visualizing activated gradient areas using Score-CAM also verifies that our proposed model can enable diagnosing the CXR images by proper reason, as focusing on the lung tissues in CXR images.
There are some limitations of the proposed solution. Although we constructed balanced data to compare the proposed model with the baseline model proposed by Lee et al. [11], the CXR image data with three categories, normal, pneumonia and COVID-19, would be imbalanced with few COVID-19 samples in real life. Thus, the proposed solution may not perform as well as presented in this paper in that case. Furthermore, the COVID-19 data used in this paper is limited. Recent papers such as Das et al. [24] show that there are more COVID-19 data publicly available. Hence, if we could collect more data, the experimental results may be different.
For future research, one could investigate some implications of our proposed methods. As the extent of future work, we are considering the reason for misclassification appearing in our study. In our proposed method, it was confirmed through the activated gradient map from Score-CAM that foreign substances such as various medical devices on the chest and the letter 'R' on the CXR image interfered with the diagnosis and led to misclassification. Therefore, using the point that the U-shaped model is often used in the segmentation task, it is possible to study the methodology for diagnosis that the region on images which are segmented into medical devices or external materials does not affect the classification task.
Also, our models was considered only in CXR in this paper. There will be more applications where the models are applied to chest CT data. Moreover, as COVID-19 radiology data have been collected, sufficient CXR images can enable the self-supervised learning model to capture the image representation better, thereby yielding higher classification model accuracy. Furthermore, our proposed approach manages only 3 categories; it would be more helpful in the real world to aid the radiologist when more categories and data are applied to our proposed models.