Explainable COVID-19 Detection on Chest X-rays Using an End-to-End Deep Convolutional Neural Network Architecture

: The coronavirus pandemic is spreading around the world. Medical imaging modalities such as radiography play an important role in the ﬁght against COVID-19. Deep learning (DL) techniques have been able to improve medical imaging tools and help radiologists to make clinical decisions for the diagnosis, monitoring and prognosis of different diseases. Computer-Aided Diagnostic (CAD) systems can improve work efﬁciency by precisely delineating infections in chest X-ray (CXR) images, thus facilitating subsequent quantiﬁcation. CAD can also help automate the scanning process and reshape the workﬂow with minimal patient contact, providing the best protection for imaging technicians. The objective of this study is to develop a deep learning algorithm to detect COVID-19, pneumonia and normal cases on CXR images. We propose two classiﬁcations problems, (i) a binary classiﬁcation to classify COVID-19 and normal cases and (ii) a multiclass classiﬁcation for COVID-19, pneumonia and normal. Nine datasets and more than 3200 COVID-19 CXR images are used to assess the efﬁciency of the proposed technique. The model is trained on a subset of the National Institute of Health (NIH) dataset using swish activation, thus improving the training accuracy to detect COVID-19 and other pneumonia. The models are tested on eight merged datasets and on individual test sets in order to conﬁrm the degree of generalization of the proposed algorithms. An explainability algorithm is also developed to visually show the location of the lung-infected areas detected by the model. Moreover, we provide a detailed analysis of the misclassiﬁed images. The obtained results achieve high performances with an Area Under Curve (AUC) of 0.97 for multi-class classiﬁcation (COVID-19 vs. other pneumonia vs. normal) and 0.98 for the binary model (COVID-19 vs. normal). The average sensitivity and speciﬁcity are 0.97 and 0.98, respectively. The sensitivity of the COVID-19 class achieves 0.99. The results outperformed the comparable state-of-the-art models for the detection of COVID-19 on CXR images. The explainability model shows that our model is able to efﬁciently identify the signs of COVID-19.


Introduction
Since its appearance at the end of 2019, the coronavirus pandemic , caused by extreme coronavirus (SARS-CoV-2) acute respiratory syndrome, has spread worldwide, causing hundreds of millions of infected people and millions of deaths [1]. As the World Health Organization (WHO) deemed the outbreak to be a public health emergency of international significance, this has raised significant public health concerns in the international community, and on 11 March 2020, it was proclaimed a pandemic [2,3].
The reverse transcription polymerase chain reaction (RT-PCR) test serves as the gold standard to test patients for COVID-19 [4]. However, RT-PCR appears to be inadequate images for the training set. Data augmentation was applied to the training set in order to increase the number of COVID-19 samples to 420. SqueezeNet networks achieved a sensitivity of 0.98 and a specificity of 0.90, and an average AUC of 0.986 was obtained for ResNet18, ResNet50, SqueezeNet and DenseNet-121.
In all these studies, we see very interesting scores. This is due to many reasons. First, the low number of images in the datasets. Second, the optimization of hyper-parameters applied to CNN. Third, the performance of the CNN architecture. However, there is a lack of research dealing with the study of the degree of generalization of the deep learning algorithms on heterogeneous datasets. The degree of generalization is important in the medical field. For example, the CXR images have different image quality, resolution, nonuniform lighting and contrast, orientation, etc. The main contribution of this study is to develop a deep neural network model to detect COVID-19, pneumonia and normal cases using several heterogeneous datasets with different image quality, projection, resolution, lung capture, etc. Since COVID-19 is a pneumonia, another challenge is adapting the network to distinguish between other pneumonia and COVID-19 classes (there are very strong similarities between the two categories in terms of signs). Moreover, we did not apply pre-processing and data augmentation techniques on the CXR images; we kept the original quality.
The model used in this study is a fine-tuned EfficientNet-B5, which was selected based on its performance in ImageNet [28] classification. The EfficientNet family are DCNN and their dimensions are balanced (width, resolution and depth). The model was tested on two scenarios using CXR images: (i) detecting COVID-19 vs. pneumonia vs. normal cases (DeepCCXR-Multi) and (ii) a binary model for COVID-19 vs. normal (DeepCCXR-Bin). During training and testing, several datasets were employed: RSNA [29], CIDC from GitHub [30], BIMCV COVID19+ [31], CHEST X-RAY IMAGES PNEUMONIA (CXRIP) from Kaggle [32], MONTGOMERY [33], SHENZHEN [33] and NIH [34] which are available publicly, and a proprietary dataset from Montfort hospital (Ottawa, ON, Canada) [35]. Testing with numerous datasets allows the proposed algorithm's generalization performance to be benchmarked and made more robust to real-life conditions [36]. We compare our findings to current techniques and show that the suggested method achieves a higher score of COVID-19 detection across several datasets. Several measures were used to validate the performance of our proposed model in terms of Accuracy, AUC, Sensitivity and Specificity. An explainability model was developed and adapted to visualize the signs of COVID-19 and pneumonia detected by the model.

Proposed Approach
In this work, we propose a robust system for detecting COVID-19, pneumonia and normal cases, and in the binary problem (COVID-19 and normal) on several datasets using CXR images, the system is based on the fine-tuning of recent CNN called EfficientNet-B5. An explainability algorithm is also developed to visually show the infected regions identified by the network. The proposed architecture called DeepCCXR (Deep Covid-19 CXR detection) is illustrated in Figure 2.

Deep Learning Model
Tan and Le [24] have investigated the relationship between CNN model width and depth and devised a method for creating CNN models with fewer parameters and higher classification accuracy. The authors called the networks EfficientNet and proposed eight models (EfficientNet-B0 to EfficientNet-B7). Authors tested the performance of the networks on ImageNet dataset [28] and showed that the EfficientNet networks surpass all CNN models in term of accuracy.
The EfficientNet network is based on a novel scaling strategy for CNN models. It employs a straightforward compound coefficient that is quite successful. Unlike existing approaches that scale network parameters such as width, depth and resolution, Efficient-Net evenly scales each dimension with a given set of scaling factors. Scaling individual dimensions increases model performance in practice, but balancing all network dimensions in relation to available resources significantly enhances overall performance.
The primary building component of the EfficientNet model family is mobile inverted bottleneck convolution (MBConv). The MobileNet models [37] provided the inspiration for MBConv. The use of depthwise separable convolutions, which combine depthwise and pointwise convolution layers after each other, is one of the main ideas. Then, from MobileNet-V2 (a second upgraded version of MobileNet), two further ideas are borrowed: (1) inverted residual connections and (2) linear bottlenecks. Figure 3 presents an illustration of inverted residual blocks. The skip connections exist between layers with wide number of channels (64 in Figure 3a).
The number of channels in the residual block is lowered or compressed to 16, reducing the number of parameters required by the next layer's 3 × 3 convolutions. The sizes of the connected channels are inverted in the inverted residual block illustrated in Figure 3b, so the skip connections now take place between narrower layers with fewer channels. This explains the name of residual blocks. Because we employ depthwise convolutions, even if the number of channels in the layer inside the block increases to 64, the number of parameters is actually lower than in the original ResNet residual block. The second concept in MobileNetV2 is linear bottlenecks, which implies that for the layer shown in red in Figure 3b, we employ a linear activation function. Because the number of channels is constrained at various network locations, this layer is referred to as a bottleneck layer. According to the authors of MobileNetV2 [37], the ReLU activation function, which is often employed in CNN architectures, does not operate well with inverted residual blocks since it discards values less than zero. The layer with decreased channels (bottleneck channel) performed better when using a linear activation function.
In addition, instead of using the ReLU activation function, our proposed network utilizes a new activation function called Swish [38]. The Swish activation function is comparable in shape to the ReLU and LeakyReLU functions and so shares some of their performance advantages. It has a smoother activation function than the other two. The Swish activation formulation is presented in the following equation: where β ≥ 0 is a parameter that can be learned during training of the CNN model. Note, if β = 0 , f Swish becomes the linear activation function and as β→∞, f Swish looks more and more like the ReLU function except it is smoother [38]. The success of the previously described model scaling idea is highly dependent on the baseline network. To do this, the automatic machine learning (AutoML) MNAS framework is used to generate a new baseline network, which automatically searches for a CNN model that optimizes both precision and efficiency (in FLOPS). EfficientNet-B0 is the name of the baseline network, and its main architecture is shown in Table 1.
The first thing to notice is that this baseline model is made up of MBConv1, MBConv3 and MBConv6 blocks that are repeated multiple times. MBConv blocks come in a variety of shapes and sizes. The second point to note is that the number of channels within each block is increasing or decreasing (through a larger number of filters). The inverted residual connections between the model's narrow layers constitute the model's third observation.
The squeeze-and-excitation (SE) technique was also added in the MBConv blocks by the authors in [24], which helps to increase performance even further.
Remember that the number of filter parameters determines the number of channels that a convolutional layer produces. In most cases, subsequent operations will give these channels equal weight. The SE block is a strategy that weights each channel differently rather than treating them all equally.
The SE block outputs a shape (1 × 1 × channels) that specifies the weights for each channel, and the best part is that these weights values, like other parameters, are learned during training. Finally, in Figure 4, we show an example of MBConv blocks that accepts a feature map of size (128 × 128 × 40) as input and incorporates all of the preceding principles, including (1) depthwise separable convolutions, (2) inverted residual blocks, (3) linear bottlenecks, (4) Swish activation functions and (5) the squeeze-and-excitation block. MBConv blocks come in a variety of shapes and sizes. In general, EfficientNet models outperform existing CNN such as AlexNet, DenseNet and GoogleNet in terms of accuracy and efficiency and have been widely used in the medical field [39][40][41].
Finally, to increase accuracy and avoid overfitting, we modified the final convolution layer by adding a Global Average Pooling (GAP) to the network. Following GAP, we added a dense layer with size 1024 and a 50% Dropout. After dense layers, we added a Softmax layer with three neurons to give the probability prediction scores for detecting one of the three classes. We kept the same architecture for the binary classification, and only the Softmax layer was changed to give the probability prediction scores for detecting COVID-19 vs. normal.

Datasets
In this work, we used nine datasets for training and testing, which are presented in the following subsections.

COVID-19 Image Data Collection (CIDC)
Cohen et al. [30] provided an open dataset of CXR and CT scan images of patients with COVID-19 and other viral/bacterial pneumonia who were positive or suspected (MERS, SARS and ARDS). The information was primarily scraped from medical websites that collected publicly available COVID-19 CXR images from hospitals and clinicians. There are 654 COVID-19 CXR images in the dataset, which were gathered from various sources. The purpose of this dataset is to create artificial intelligence-based tools to anticipate and comprehend illness. COVID-19 CXR images from this dataset are shown in Figure 5.

COVID-19 Radiography
The COVID-19 RADIOGRAPHY [42] database contains 219 CXR COVID-19 positive images obtained from Kaggle [42] and created by a team of researchers from Qatar University (Doha, Qatar), the University of Dhaka (Bangladesh) and their collaborators from Pakistan and Malaysia with the help of various medical doctors who created a database of CXR images for COVID-19 positive cases. Figure 6 shows examples of COVID-19 RADIOGRAPHY images from this dataset.

BIMCV COVID19+
COVID19+ is a large CXR dataset of COVID-19 positive patients and computed tomography (CT) images, as well as their radiography findings, pathologies, polymerase chain reaction (PCR) test results, immunoglobulin G (IgG) and immunoglobulin M (IgM) diagnostic antibody tests and radiography reports, from the Medical Imaging Databank in Valencia Region Medical Image Bank (BIMCV). A team of expert radiologists annotates the images and stores them in high resolution. Moreover, extensive information is provided, including the patient's demographic information, type of projection (PA-AP) and acquisition parameters for the imaging study, among others. The database includes 1380 CXR, 885 DX (Digital X-ray) and 163 computed tomography images. Figure 7 shows examples of BIMCV COVID-19+ CXR images.

RSNA
The RSNA [29] dataset is a dataset of CXR images with patients metadata. This dataset was provided for a challenge in Kaggle by the US National Institutes of Health Clinical Center and is available on Kaggle competition website [43]. It contains 26,684 CXR images for unique patients, and each image is labeled with one of three different classes from the associated radiology reports: 'Normal', 'No Lung Opacity/Not Normal', 'Lung Opacity'. Figure 8 shows image examples from the RSNA dataset.

Chest X-ray Images Pneumonia (CXRIP)
Chest X-ray images (anterior-posterior) [32] were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children's Medical Center (Guangzhou, China). All CXR imaging was performed as part of patients' routine clinical care. All CXR were initially screened for quality control by removing all low-quality images. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. The dataset is partitioned into three folders for training, testing and validation and contains sub-folders for each image category (pneumonia or normal). The dataset contains 5863 CXR images and two classes, pneumonia and normal. Figure 9 shows examples of CXR images from patients with pneumonia and normal cases.

Montgomery County X-ray
The MONTGOMERY County CXR dataset [33] has been acquired from the tuberculosis control program of the Department of Health and Human Services of MONTGOMERY County (Rockville, MD, USA). The dataset contains 138 posterior-anterior CXR, of which 80 are normal and 58 are abnormal with manifestations of tuberculosis. All CXR images are identified and available in DICOM format. The dataset includes radiology readings available as a text file. Figure 10 shows example images from the MONTGOMERY dataset.

Shenzhen Hospital X-ray
The SHENZHEN [33] Hospital X-ray dataset has been collected by SHENZHEN Hospital (Shenzhen, Guangdong, China). The CXR were acquired as part of the routine care at SHENZHEN Hospital. The dataset contains 326 normal CXR and 336 abnormal CXR showing various manifestations of tuberculosis. Figure 11 shows some CXR images from SHENZHEN dataset.

National Institute of Health (NIH)
The NIH [34] Chest X-ray dataset is comprised of 112,120 CXR images with disease labels from 30,805 unique patients. This dataset was obtained from the National Institute of Health (Bethesda, MD, USA). There are 15 classes in the dataset (14 diseases and one for 'No findings'). Infiltration, edema, atelectasis, pneumothorax, consolidation, emphysema, effusion, fibrosis, pneumonia, cardiomegaly, pleural thickening, mass, nodule and hernia are some of the disease groups that images can be classed as. Expert physicians assigned grades to the CXR images. Figure 12 shows example images from the NIH dataset.

Montfort Dataset
In addition to the above datasets, we collected more images in collaboration with health professionals from Montfort hospital (Ottawa, ON, Canada) and built the Montfort dataset [35]. This dataset contains 236 CXR images, with 150 COVID-19, 29 pneumonia (other than COVID-19) and 57 normal patients (no findings). Radiology reports and RT-PCR testing are used to label CXR images.

Data Distribution for Multi-Class and Binary Models
To train the DeepCCXR-Multi (COVID-19 vs. other pneumonia vs. normal), we used NIH dataset for the pneumonia and normal sets, with 14,226 CXR images (8551 for normal and 5675 for other pneumonia). To build the COVID-19 set, we combined CXR COVID-19 positive images from multiple datasets (Montfort Dataset, BIMCV COVID19+, CIDC and COVID-19 RADIOGRAPHY dataset), leading to a total of 3288 CXR COVID-19 positive images. We used 2060 images for training and 1228 for testing. For the test set of pneumonia and normal cases, we kept samples from each dataset CXRIP, RSNA, NIH, SHENZHEN and MONTGOMERY) in order to validate the generalization performance of our model on all these datasets (since the quality an technology used are different). Our test set is comprised of 1128 normal, 1228 COVID-19 and 1072 pneumonia images.
For DeepCCXR-Bin, we used 10,611 training images. 2060 COVID-19 images (Same as DeepCCXR-Multi) and 8551 normal images from the NIH dataset. The test set contains 1128 normal cases and 1228 COVID-19 cases. We divided the bases manually, because the random division may give bad results according to [20]. Figure 13 gives an overview for the datasets distribution. For all datasets used in this study, we kept image quality, without applying any preprocessing. As we can see in Figure 14, an example of images in CIDR datasets shows the unbalanced color (blue and gray). Figure 15 shows the capture of lungs includes the abdominal part. Furthermore, the images are different in the orientation as shown in Figure 16. Moreover, we kept the CXR images with poor quality as shown in Figure 17.

Training Parameters
The models were developed using Keras (Tensorflow) Library [44], and training was conducted on Microsoft Azure servers [45]. We must use an optimization function to optimize the learning parameters. Because different optimizers have distinct effects on parameter training, we studied the effects of SGD [46] and Adam [47] on model performance. Multiple comparative experiments were conducted under the same settings. SGD is determined to be much superior to Adam in terms of convergence and training time reduction. When we utilized Adam as an optimizer, the gradient of each sample was modified every time, which increased the noise. Each iteration is not moving in the same direction as the overall optimization, and it may converge to a local minimum, lowering accuracy. We used 200 epochs for training and a batch size of 16. All CXR images were resized to 512 × 512. We used the pre-trained model with ImageNet weights. To prevent the class imbalance issue, we applied the class-weight approach [48] during model training on the NIH, CIDC and BIMCV datasets. The technique directly accounts for the asymmetry of cost errors. Hyperparameters optimization was conducted on the validation set, and the best results were kept for testing. Table 2 summarizes the hyperparameters used in our deep learning models.

Metrics
For this work, we used the following metrics: accuracy (ACC), sensitivity (SE), specificity (SP) and area under curve (AUC) [9]. The SE and SP show the performance of the proposed approach. AUC computed using ROC curve is a performance measure widely used for medical classification problems to highlight the compromise between good and bad classifications by the model. In medical applications and published papers about COVID-19, the main used metrics are sensitivity and specificity; therefore, we have used these metrics. This way, we can compare our work with past published work. These metrics are defined in the following: where TP means true positives for samples classified as positive and FN means false negatives for the samples incorrectly classified as negative, FP means false positives for the samples incorrectly classified as positive and TN means true negatives for the samples correctly classified as negative.

Results
The proposed fine-tuned EfficientNet-B5 gives interesting results for the DeepCCXR-Multi (COVID-19 vs. other pneumonia vs. normal). The model achieved an average SP and SN of 0.97 and 0.94, respectively. The obtained AUC was 0.973 on 3,428 CXR images (normal, COVID-19 and pneumonia) used in the test set. The details of the obtained SN and SP for each class are given in Table 3. The DeepCCXR-Bin also achieves high scores with an AUC of 0.985, an average SP of 0.94 and an SN of 0.98. Figure 18a,b shows the ROC curves for both models. We can see the high performance achieved by the proposed models. The confusion matrices for DeepCCXR-Multi and DeepCCXR-Bin are given in Figure 19a,b. As we can see, only 11 from 1228 images were misclassified in the DeepCCXR-Multi. For the DeepCCXR-Bin, only 19 images from 1228 were misclassified. This shows the high performance obtained by our models.

Explainability
To understand how the model learned to detect the signs of pneumonia pathology including COVID-19 signs, we developed an explainability algorithm based on the use of Gradient-weighted Class Activation Mapping (Grad-CAM) [49]. This algorithm provides a visual output of the most interesting areas found by the proposed CNN models. Grad-CAM uses the gradients of any target, flowing into the final convolutional layer to generate a coarse localization map that highlights important regions in the predictive image. The proposed technique blends Grad-CAM with fine-grained visualizations to construct a high-resolution class-discriminative visualization [50][51][52]. Figure 20 shows samples of TP and TN using Grad-CAM to localize the signs of COVID-19, pneumonia and normal regions on CXR images. As we can see in Figure 20a,b in the example of TP cases of COVID-19 detected in CIDR dataset, the heatmap localized the signs in the lungs. In Figure 20c,d as well, we can see a sample of true positive cases in COVID-19 RADIOGRAPHY dataset, and the green color in the lungs indicates there is something abnormal detected by the model, which classifies them as COVID-19. Our model identified the dense homogeneous opacity regions as the most significant signs for COVID-19 which correlates well with radiology findings in COVID-19 medical research studies [53].
Other samples of TP cases of pneumonia are presented in Figure 20e,f for RSNA dataset, as we can see in the CXR images that the heatmap localizes the opacity in the lungs, and despite the poor quality and the bad projection on the lungs, the model classified the images correctly. Similarly in Figure 20g,h, we can see the positive cases of pneumonia in NIH dataset, the heatmap focus on the opacity area on the lungs.
The TN cases for the MONTGOMERY datasets are presented in Figure 20k,l. Generally, the heatmap focuses on something outside of the lungs or near the heart to distinguish between the normal and the other cases.
This explains the efficiency of the proposed deep learning approach in detecting COVID-19 and pneumonia signs and its high performance in COVID-19 classification.

Performance Comparison
In Table 4, the performance of COVID-19 detection is compared to that of contemporary approaches. We can see that DeepCCXR outperforms most of the recently published work for detecting COVID-19 on CXR images. In addition, the number of COVID-19 images used in this study is higher compared to other studies, which confirms the degree of generalization for detecting COVID-19 with the proposed model. Our previous model [54], which was generated using a small number of COVID-19 positive CXR images, has been improved by the DeepCCXR model (192 images). Based on ResNet50, our prior model had an AUC of 0.97, an SP of 0.96 and an SN of 0.95.
DeepCCXR obtained the same scores as Minaee et al. [26] in terms of AUC and sensitivity; still, our proposed models perform best for the specificity. However, the work of Minaee et al. [26] used only 203 COVID-19 images, which does not help in evaluating the generalization of the algorithm. Moreover, this number represents only 5% of a small dataset (203 vs. 4797), which can make the model sensitive to overfitting and bias the results.
The only work using a large number of COVID-19 images was published recently by Wehbe et al. [20]. Our results outperform the results in this last work [20]. The authors tested their models on a number of COVID-19 images close to ours (1192 in Wehbe et al. [20] vs. 1228 in this work). According to their paper, their model surpassed radiologists, which leads us to think that our model will have an even better performance if compared to a manual interpretation by radiologists. This shows that DeepCCXR is robust and able to detect COVID-19 with a high sensitivity (0.99). In addition, our developed explainability (EXP) model shows a very precise localization of the COVID-19 signs and can be used as a CAD tool to further help physicians in their diagnosis.

Individual Tests
To confirm the degree of generalization, we conducted individual tests with each of the nine datasets.

DeepCCXR-Bin for Individual Datasets
We validated DeepCCXR-Bin model on the nine datasets separately. We removed the pneumonia class and we kept the normal and COVID-19 classes.
The results of each dataset are presented in Table 5. Very interesting AUC scores of 0.999, 0.997 and 0.999 were obtained in the CXRIP, RSNA and MONTGOMERY datasets, respectively. DeepCCXR-Bin gives a high AUC score of 0.986 in the NIH and 0.978 in SHENZHEN datasets. Scores of 0.965 and 0.961 were obtained in BIMCV COVID19+ and COVID-19 RADIOGRAPHY. SP and SN are equal to 0.99 in RSNA, MONTGOMERY and CXRIP. This shows the degree of generalization of our binary model (DeepCCXR-Bin) on the heterogeneous data. The confusion matrices for the nine datasets are shown in Figure 21. Only 74 images from 654 CXR of COVID-19 were misclassified in the CIDC dataset (see Figure 21b), and only six of 218 of COVID-19 CXR were misclassified in the COVID-19 RADIOGRAPHY dataset (see Figure 21f). Figure 21c represents the confusion matrix for CXRIP, this dataset contains 1435 normal cases, and only 15 were misclassified. An SP of 1 was obtained in the MONTGOMERY dataset because all 33 patients were classified correctly (see Figure 21d). The Montfort dataset obtained a good classification with only five patients from each category misclassified (see Figure 21i). Figure 22 shows the ROC curves for detecting COVID-19 on the nine datasets, and we can see that the curve of CXRIP is the highest because with an AUC of 0.999, followed by the curve of RSNA which obtained 0.999. The lowest is the curve of CIDC also which gives 0.945 for the AUC.

DeepCCXR-Multi for Individual Datasets
Similar to DeepCCXR-Bin, we tested DeepCCXR-Multi on individual datasets. The results of each dataset are presented in Table 6. A high score was obtained for CXRIP with an AUC of 0.994. The CIDC, MONTGOMERY and SHENZHEN obtained an AUC of 0.987, 0.988 and 0.981, respectively. RSNA obtained an AUC score of 0.959 by using 2790 CXR images. An AUC of 0.908 was obtained by the NIH dataset using 4389 CXR images (1228 for COVID-19, 1663 for normal and 1498 for pneumonia). The AUC score for COVID-19 RADIOGRAPHY is 0.849, an SN of 0.92 and an SP of 0.92. The Montfort dataset obtained an AUC score of 0.801 because we have a few CXR images in the pneumonia cases (24 CXR images) and 58 CXR images in the normal cases.
The ROC curves of DeepCCXR-Multi for detecting COVID-19 in nine datasets are presented in Figure 23. As we can see, the curves of CIDC, CXRIP, SHENZHEN and MONT-GOMERY are the highest because the AUC score is between 0.980 and 0.999, followed by the curve of RSNA dataset with an AUC of 0.950.
The confusion matrices for the multi-class classification are presented in Figure 24. Figure 24b gives the confusion matrix for CIDC dataset, we can see the model achieve good classification results because 358 CXR images from 369 in pneumonia class are correctly classified, and 944 CXR images in normal class are also correctly classified. As with the CXRIP dataset, we can see the model gives a better classification of the three classes, only four patients from 1435 are misclassified, and only 129 from 2123 in pneumonia class are misclassified (see Figure 24c). The confusion matrix for Montfort shows the COVID-19 patients are correctly classified (see Figure 24g). This confirms the robustness of our model in the classification of three classes.

Model Limitations
Despite the good results obtained, both models have some limitations in classifying CXR images correctly; this sometimes happens in the presence of poor quality images which contain artifacts and noise resembling the opacity in the lungs. Figure 25a,b shows an example of FP cases in RSNA dataset, the heatmap localizes the opacity in image Figure 25a,b in the right lung, which is probably just a noise in the CXR images and the model to classify these images as COVID-19 instead of normal. The same can be said for Figure 25c in the SHENZHEN dataset, as the models classified the normal case as pneumonia. An example of FN is presented in Figure 25d from the RSNA dataset: the model detects this image as normal because the lungs appear clear on the top which is similar to the normal cases.
Some CXR images also contain cables, text and medical objects which are captured with the thorax. This makes the detection of the region of interest and disease signs more difficult for the model. Figure 25e shows an example of FP in the RSNA dataset: the image is labeled normal by the authors of the dataset and it is classified COVID-19 by the deep model, and the opacity on the top left lung leads to this error (same in Figure 25f).

Conclusions
In this work, we developed DeepCCXR, a deep convolutional neural network (CNN) architecture for COVID-19 detection on Chest X-ray images. The proposed model is based on a recent architecture called EfficientNET-B5. DeepCCXR was fine-tuned and trained to detect COVID-19 using 1228 COVID-19 positive CXR images.
The obtained results show that our model outperforms recent deep learning approaches for COVID-19 detection on CXR. The model achieved high AUCs scores with 0.973 and 0.986 for DeepCCXR-Multi and DeepCCXR-Bin, respectively. The sensitivity reaches 0.97 and the specificity around 0.98 in average. For the COVID-19 class, we obtained a sensitivity of 0.99 (tested on 1228 COVID-19 positive images), meaning that we have a better performance in measuring the proportion of actual positives that are correctly identified as such (e.g., the percentage of people who are correctly identified as having COVID- 19). An explainability algorithm was also developed and showed that DeepCCXR is efficient in identifying the most important pathology regions (the signs of COVID-19).
Previously published works used a limited number of COVID-19 images for testing. The large number of test images in our work shows that DeepCCXR is robust in detecting COVID-19 and other pneumonia cases. Moreover, multiple datasets with varying quality were used in this work, which confirms the good degree of generalization of our model.
The proposed technique is an interesting contribution in the development of a CAD system able to detect COVID-19 and other pneumonia cases in CXR images. The model was deployed online and can be used freely by researchers and health professionals [55].
Future work includes extending our architecture to process CT scan images to detect COVID-19 and adapting the model to identify other types of diseases with radiography images.  Institutional Review Board Statement: Université de Moncton IRB waived the approval requirements since the data used in this work was anonymized.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data used in this work come mainly for public datasets. Please see the section describing the datasets.