Deep Learning Methods to Reveal Important X-ray Features in COVID-19 Detection: Investigation of Explainability and Feature Reproducibility

: X-ray technology has been recently employed for the detection of the lethal human coro-navirus disease 2019 (COVID-19) as a timely, cheap, and helpful ancillary method for diagnosis. The scientiﬁc community evaluated deep learning methods to aid in the automatic detection of the disease, utilizing publicly available small samples of X-ray images. In the majority of cases, the results demonstrate the effectiveness of deep learning and suggest valid detection of the disease from X-ray scans. However, little has been investigated regarding the actual ﬁndings of deep learning through the image process. In the present study, a large-scale dataset of pulmonary diseases, including COVID-19, was utilized for experiments, aiming to shed light on this issue. For the detection task, MobileNet (v2) was employed, which has been proven very effective in our previous works. Through analytical experiments utilizing feature visualization techniques and altering the input dataset classes, it was suggested that MobileNet (v2) discovers important image ﬁndings and not only features. It was demonstrated that MobileNet (v2) is an effective, accurate, and low-computational-cost solution for distinguishing COVID-19 from 12 various other pulmonary abnormalities and normal subjects. This study offers an analysis of image features extracted from MobileNet (v2), aiming to investigate the validity of those features and their medical importance. The pipeline can detect abnormal X-rays with an accuracy of 95.45 ± 1.54% and can distinguish COVID-19 with an accuracy of 89.88 ± 3.66%. The visualized results of the Grad-CAM algorithm provide evidence that the methodology identiﬁes meaningful areas on the images. Finally, the detected image features were reproducible in 98% of the times after repeating the experiment for three times.


Introduction
Deep learning has already demonstrated superiority to conventional methods in a variety of medical imaging tasks, including the classification of important diseases using different imaging modalities, such as Computed Tomography (CT), Positron Emission Tomography (PET), and X-ray [1]. The recent human coronavirus disease (COVID- 19) poses new challenges for deep learning experts, such as the automatic segmentation and classification of CT or X-ray images that can lead to a timely, accurate, and cost-effective diagnosis. Limitations related to data scarcity have been a major obstacle in designing deep and robust frameworks [2]. Since March 2020, the available X-ray image datasets included no more than 500 images of COVID-19 disease.
Typical imaging findings of COVID-19 lung infection include bilateral, patchy, lower-lobepredominant, and peripheral ground-glass opacities and/or consolidation. These are mainly identified on CT imaging rather than X-ray, which has lower sensitivity for COVID-19 diagnosis at the level of ≈67-100% [3]. Nevertheless, the scientific community has responded to the aforementioned challenge and has provided first answers as to whether this disease can indeed be detected solely from X-ray images. Several works suggest the utilization of deep learning models, such as Convolutional Neural Networks (CNNs) for diagnosis [4][5][6][7][8][9][10]. In most cases, either handcrafted CNNs, or established CNNs in other domains, yield precise and promising results, at least in cases where the COVID-19 disease is adequately visualized in the particular imaging modality. All those networks have been evaluated utilizing approximately the same image sources.
Deep learning has already demonstrated its effectiveness in distinguishing COVID-19 using the particular image datasets. However, the assumption that through deep learning it is possible to diagnose COVID-19 solely on the basis of X-ray images is not valid yet. This is because the available datasets are heavily incomplete due to the following reasons: a.
The samples are too few for deep model training b.
The image information is not accompanied by clinical outcomes. c.
There are few multicenter studies to support the conclusions. d.
The samples commonly illustrate COVID-19 disease of patients showing disease symptoms. Asymptomatic cases are under-represented.
The above issues motivated the scientific community towards applying data augmentation techniques to expand the training sets, add diversity to the data distributions, and enable their models to become robust to transformations.. Nevertheless, the data scarcity issue is not circumvented completely. The question arising at this point is the following: "Besides their undeniably strong predictive power, are the developed deep learning models capable of providing explanations regarding their decisions, informing the actual user of their image findings so as to be trustworthy and accountable?".
Motivated by our previous studies on the automatic identification of COVID-19 from X-rays [5,9] and aiming to shed light on the explainability of deep learning, we performed a deeper analysis on the decision mechanisms of mobile network, a state-of-the-art CNN, that exhibited promising results in our recent study [5]. In previous work of our group [9], the effectiveness of training from scratch strategy against transfer learning is demonstrated, showing that training from scratch may discover potential image biomarkers extracted from X-ray images. This conclusion is based on the comparison of transfer learning with training from scratch. The reader should note that with transfer learning, the classification is mainly based on pre-learned feature extraction knowledge of a particular CNN. This knowledge is obtained by performing an independent training on large-scale datasets of a completely different domain task. Although transfer learning also yields good results, training from scratch improves the classification accuracy. This led the authors to the conclusion that novel and vital image features were extracted from the latter strategy.
In the present work, the feature extraction capabilities of MobileNet (v2) were further analyzed by performing extensive experiments and visualizing the output feature maps. The Grad-CAM algorithm [11] was utilized to reveal the regions where MobileNet (v2) seeks for important features. In this way, a better understanding of the decision mechanism of the network is achieved.
The contributions of this paper can be summarized as follows: • The successful state-of-the-art network (MobileNet v2) was extensively evaluated in performing multi-class and two-class classification of X-ray images with the aim of identifying images related to the coronavirus disease. Further, the consistency of the reported metrics was assessed by running a 25-times 10-fold cross-validation • The explainability algorithm (Grad-CAM) was employed to inspect the consistency of the suggested areas of interest across a three-run experiment.

•
We present a staged approach for the detection of COVID-19 from X-ray images that exhibited an accuracy of 89.88 ± 3.66%.

COVID-19 Detection Based on X-ray Imaging: Recent Studies
The research community has put an enormous effort in developing deep learning pipelines for COVID-19 detection from either computed tomography (CT) scans or X-ray scans. In addition, a large amount of attention has been paid to leveraging explainability methods to visualize the suggested areas of interest as proposed by the models. Hence, model assessment can be based not only on quantitative metrics (such as the accuracy, the sensitivity, and the specificity scores), but also on qualitative evaluation. In this section, we briefly describe major findings and trends found in the latest literature.
Hou and Gao [12] proposed a deep CNN-based platform for COVID-19 detection that could identify COVID-19 cases with an accuracy of 96%. Their model has been trained using a dataset of 1400 chest X-ray images, which includes 400 normal images, 400 images of pneumonia infection from bacteria, 400 images of pneumonia infection by other viruses, and 200 images of pneumonia infection by COVID-19. The authors used the Grad-CAM algorithm to visualize the suggested areas of interest.
Ahsan et al. [13] proposed the utilization of the state-of-the-art networks named Virtual Geometry Group (VGG) and MobileNet (v2) to distinguish between COVID-19 and non-COVID-19 X-rays from an imbalanced dataset of 2191 X-rays. The networks achieved remarkable accuracy, stretching between 91% and 96% and an AUC score of approximately 0.82. The authors used the local interpretable model-agnostic explanations (LIME) [14] method for the visualization of important image areas.
Brunese et al. [4] analyzed 6523 X-ray scans and developed a pipeline for an incremental detection of COVID-19. Their framework identifies pulmonary-disease-related X-rays and then further distinguishes between COVID-19 cases and non-COVID-19 cases. Their model reached an accuracy of 97%. The authors adopted the Grad-CAM algorithm to visualize the feature maps and verified that their model did not focus on irrelevant locations of the image.
In [9], which is a previous study by the authors of this study, a first attempt to evaluate the extracted features of deep learning methods for COVID-19 detection from X-rays revealed evidence that training MobileNets from scratch can extract problem-specific features that could be if medical importance. In addition, an accuracy of 99% in distinguishing between COVID-19 and non-COVID-19 cases from an imbalanced dataset of 3905 scans.
Wang et al. [15] proposed COVID-Net, a tailored CNN trained on a dataset of 13,975 X-ray scans. They achieved an accuracy of 93.3% in distinguishing between normal, common pneumonia, and COVID-19-related pneumonia images. The authors employed the GSInquire method [16] to plot the associated critical factors on the image. COVID-Net primarily leveraged areas in the lungs in the X-ray images as the main critical factors in determining whether an X-ray image is of a patient with COVID-19.
Thorough interpretation and examination of the explainability methods is missing from the majority of the related studies, although particular explainability methods have been employed.

Deep Learning with Mobile Networks
The main advantage of CNNs lies in extracting new features from the input data distributions (i.e., images), thereby bypassing the manual feature extraction process, which is traditionally performed in image analysis task with machine learning methods [17].
Each convolution layer in a CNN is processing the output of the previous layer by applying new filters and extracting new features. Due to the fact that the convolutional layers are hierarchically ordered, features directly from the original image are only extracted by the first convolutional layer, whereas the other layers process the outputs of each other [18]. In this way, a slow introduction to large amounts of filters is achieved, whilst underlying features may be revealed during the later layers. The general rule of thumb relates the effectiveness of the network with the number of convolutional layers. This is why deep networks are generally superior, provided that adequate amounts of image data are present. In cases where the dataset's size is not large enough to feed a deep network, three solutions are commonly proposed: (a) The selection of a simpler CNN, which contains less trainable parameters and fits in the particular data well.
(b) Transfer learning [19], utilizing deep and complex CNNs, but freezing their layers, thereby decreasing the trainable parameters and allowing for knowledge transfer, following their training on large image datasets. (c) Data augmentation methods to increase the training set size, such as geometric transformation (rotation, sheer) and pixel-level transformations (equalizations, grey-level alterations) [20].
In this study, MobileNet (v2) [21] was selected for the classification task, which is a state-of-the-art CNN and has been recently employed and evaluated by the authors [9]. In that particular study, MobileNet (v2) was found to be superior for false negative reduction in COVID-19 detection, in comparison with a variety of famous CNNs, including Inception (v3) [22] and Xception [23].
The superiority of MobileNet (v2) in reducing the false negatives for the detection of COVID-19, compared to other famous CNNs, is demonstrated in [5,9]. Moreover, this CNN introduces a smaller number of parameters compared to other CNNs, which makes it appropriate for swift training and portable applications. The inventors of this network made use of depth-wise separable convolution [22] to drastically reduce the number of learnable parameters in CNNs, thereby reducing the computational cost.
MobileNet (v2) is employed and trained from scratch, letting it fit in the training set completely and without making any adjustments to its structure. Every parameter is made trainable. In essence, the obtained weights from its training on ImageNet challenge dataset [24] are erased. This methodology is selected to allow for problem-specific feature extraction. At the top of the network, wherein the final feature maps are produced, a global average pooling [25] layer is applied to reduce overfitting. This layer connects the final feature map directly to the dense layer at the top of the CNN, which consists of 2500 nodes. Another dense layer of two outputs is inserted for the binary classification of the inputs. Batch normalization and dropout layers aid in the reduction of overfitting and are part of the densely connected layers at the top of the network.
3.2. Image Dataset 3.2.1. COVID-19, Common Bacterial and Viral Pneumonia X-ray Scans X-ray images corresponding to confirmed cases infected by the virus SARS-CoV-2 were selected. Through extensive research, a collection of 1281 well-visualized, confirmed pathological X-ray images was created. The final collection included X-rays from a publicly available repository [26]. Contributing institutions of this repository include the Indian Institute of Science, the PES University, the M. S. Ramaiah Institute of Technology, and Concordia University. The publishers of this data did not include important clinical information, which could be useful for a more robust analysis.

Pulmonary Diseases Detected from X-ray Scans
The National Institutes of Health (NIH) X-ray repository was accessed and analyzed. It comprises 112.120 frontal-view X-ray images of 30.805 unique patients with the text-mined 14 disease image labels [27].
Those images were extracted from the clinical PACS database at the National Institutes of Health Clinical Center in USA. The contents of this archive contained 14 common thoracic pathologies, namely, atelectasis, consolidation, infiltration, pneumothorax, edema, emphysema, fibrosis, effusion, pneumonia, pleural thickening, cardiomegaly, nodule, mass, and hernia. This dataset is significantly more representative of the real patient population distributions and realistic clinical diagnosis challenges than any previous chest X-ray datasets. The medical reports were analyzed by an automatic text-mining model that assigned the corresponding labels according to its text-mining procedure. This method has been initially adopted by the creators of the dataset and is not part of this work.
The final dataset characteristics are summarized in Table 1. In Figure 1, selected samples from major classes are presented. Abnormality detection 2 Huge dataset consisting of normal and abnormal X-ray scans. In the abnormal class, X-rays corresponding to COVID-19 were also included.

COVID-19 detection 2
Dataset containing COVID-19 X-ray scans and a second class of both normal and abnormal X-ray scans (selected samples).

2935
Reports 2022, 5, x FOR PEER REVIEW  5 of 19 nodule, mass, and hernia. This dataset is significantly more representative of the real patient population distributions and realistic clinical diagnosis challenges than any previous chest X-ray datasets. The medical reports were analyzed by an automatic text-mining model that assigned the corresponding labels according to its text-mining procedure. This method has been initially adopted by the creators of the dataset and is not part of this work.
The final dataset characteristics are summarized in Table 1. In Figure 1, selected samples from major classes are presented. For the normal class in the abnormality detection dataset, we added some more images to make the classes approximately even in terms of number of images included. All image sizes were adjusted to 400 × 400 pixels (height, width). The resolution of the images varied from 72 to 150 pixels/inch, and the bit depth if the image was 8 bits.  For the normal class in the abnormality detection dataset, we added some more images to make the classes approximately even in terms of number of images included. All image sizes were adjusted to 400 × 400 pixels (height, width). The resolution of the images varied from 72 to 150 pixels/inch, and the bit depth if the image was 8 bits.

Data Augmentation Techniques
Data augmentation is an important method in deep learning applications and research, mainly utilized for two reasons. The first reason is the data scarcity, which impedes deep learning models adoption to the domain of interest. Few images are usually not enough for a deep learning framework to train on [28], especially in cases where the classification should be based on deep features and not obvious and low-level characteristics (e.g., colors). With data augmentation, the initial training set can be broadly expanded by applying a variety of transformations on the original images. In this way, the model learns to ignore irrelevant characteristics and improves its spatial capabilities [29]. For example, applying random rotations directs the model towards seeking for patterns in moving positions.
In the present research, the following augmentations to the training sets to expand the available data and to increase the generalization capabilities of the experimental deep learning network were applied: a.
Height and width shifts.
The reader should note that data augmentation was performed on-line. During each 10-fold repetition, the augmented images were supplied to the classification model, whilst the test sets remained untouched. In this way, each training image was augmented to produce contextual images by performing the abovementioned augmentations.
Random rotations were restricted to −20 to 20 degrees, and height and width shifts were restricted to ±20 pixels. The ±20 degree of rotation was empirically selected to avoid excessive rotations, whilst letting the model develop robustness to spatial discrepancies between the image findings, for example, the position of the lungs.

Experiments
The initial dataset included 14 classes. On the basis of this dataset, subsets were created according to Figure 2 and Table 2. The intention of the experimental phases and the methods utilized are summarized in Table 2.   For all the experiments, the parameters of the model were retained. The batch size was 16 and the number of epochs varied from 30 to 40 according to the validation loss. All experiments were performed in a Python programming language environment making use of the Tensorflow library. An Intel Core i5-9400F CPU at 2.90 GHz computer equipped with 64 Gb RAM and a GeForce RTX 2060 Super was the main infrastructure for the experiments. In Figure 2, an overview of the study is presented.

Results of Multiclass Classification
For the multiclass classification, MobileNet (v2) achieved sub-optimal performance, as presented in Table 3. The model achieved good classification for the bacterial pneumonia, normal, mass, COVID-19, and consolidation classes (confusion matrix is available in the Supplementary Material). Especially for COVID-19, 1095 true positives were recorded (out of 1281), corresponding to 85.48% accuracy. Moreover, only 12 false negatives were reported. This observation indicates that, despite the overall sub-optimal performance, the model correctly captured COVID-19 image characteristics that distinguish these images from the rest. Moreover, the normal class was adequately predicted, with 2439 true normal predictions and 36 predictions that were mistakenly identified as normal. Table 3. Classification results. The mean accuracy for the complete 10-fold and standard deviation for the performance between the 10-fold are also reported.

Dataset
Accuracy (%) AUC Score (%)  Figure 3 illustrates the results of the multiclass classification and selected samples from the outputs of the Grad-CAM algorithm. The red areas of the image suggest the region where the model has captured significant features. Blue areas are considered neutral regions, where no features, or insignificant features, are found. The reader can observe that COVID-19 features were mainly discovered in the center of the respiratory system and that those regions indeed contained COVID-19 findings. Moreover, Figure 3 illustrates misclassified instances. For COVID-19, it was observed that the misclassified image did not contain any information in the center of the respiratory system, perhaps leading the model to falsely recognize specific patterns. In fact, it was observed that the model looked for patterns in the upper right of the image, which was a completely irrelevant region. This issue highlights the flaws of the model and its decision mechanism.
Reports 2022, 5, x FOR PEER REVIEW 9 of 19 Taking into consideration both the good classification accuracy in distinguishing the COVID-19 class from the other classes and the Grad-CAM visualizations, it can be assumed that in the majority of COVID-19 X-rays, potential biomarkers are discovered. However, this assumption requires further investigation.

Results of Abnormality Detection (Two-Class)
Abnormality detection tests produce excellent results. As observed in Table 3, 95.45% accuracy was achieved. The total number of false negatives was 211, as the confusion matrix of Figure 4 suggests. It is clearly concluded that the model achieved great capability in distinguishing normal from abnormal X-ray scans. In Figure 4, it is observed that the Grad-CAM results confirmed the assumption that the model seeks for patterns in the correct regions of the respiratory system. Taking into consideration both the good classification accuracy in distinguishing the COVID-19 class from the other classes and the Grad-CAM visualizations, it can be assumed that in the majority of COVID-19 X-rays, potential biomarkers are discovered. However, this assumption requires further investigation.

Results of Abnormality Detection (Two-Class)
Abnormality detection tests produce excellent results. As observed in Table 3, 95.45% accuracy was achieved. The total number of false negatives was 211, as the confusion matrix of Figure 4 suggests. It is clearly concluded that the model achieved great capability in distinguishing normal from abnormal X-ray scans. In Figure 4, it is observed that the Grad-CAM results confirmed the assumption that the model seeks for patterns in the correct regions of the respiratory system.
Due to the fact that all types of infections were grouped together in one class (abnormal), the model learned global features explaining the presence of any disease and did not learn the visual differences that each disease may display in the image. However, there are still images were the Grad-CAM exposed some limitations and flaws of the model. In essence, there were images where the model was unable to locate the region of interest correctly, despite the correct classification. Due to the fact that all types of infections were grouped together in one class (abnormal), the model learned global features explaining the presence of any disease and did not learn the visual differences that each disease may display in the image. However, there are still images were the Grad-CAM exposed some limitations and flaws of the model. In essence, there were images where the model was unable to locate the region of interest correctly, despite the correct classification.

Results of Abnormality Discrimination
The abnormality discrimination experiment produced poor performance due to the presence of many respiratory diseases, many of which produce overlapping X-Ray results. Specifically, 62.26% accuracy was achieved. In the Supplementary Material, the confusion matrix is provided. It is observed that MobileNet (v2) achieved good classification results for COVID-19 (1110 true positives, 171 false negatives, 168 false positives), mass (2427 true positives, 78 false negatives, 14 false positives), and bacterial pneumonia (1108 true positives, 25 false negatives, 14 false positives). For the rest of the diseases, the discrimination task performed sub-optimally. As is observed in Figure 5, the validation accuracy did not improve, despite the improvement in the training accuracy. The same phenomenon applied to the validation loss. Those results highlighted the inability of the model to capture and learn discriminant features. Data augmentation has not been beneficial enough to improve its discrimination ability for the majority of the diseases. However, due to the fact that the aim of this study was focused on COVID-19, the reason behind the sub-optimal performance for multi-class classification was not further investigated in terms of the type of the extracted features. Moreover, the imbalance of the dataset hindered thorough and extensive evaluation. Several classes were underrepresented. As a result, a deep analysis on the extracted features of those classes would yield negligible outcomes.

Results of Abnormality Discrimination
The abnormality discrimination experiment produced poor performance due to the presence of many respiratory diseases, many of which produce overlapping X-ray results. Specifically, 62.26% accuracy was achieved. In the Supplementary Material, the confusion matrix is provided. It is observed that MobileNet (v2) achieved good classification results for COVID-19 (1110 true positives, 171 false negatives, 168 false positives), mass (2427 true positives, 78 false negatives, 14 false positives), and bacterial pneumonia (1108 true positives, 25 false negatives, 14 false positives). For the rest of the diseases, the discrimination task performed sub-optimally. As is observed in Figure 5, the validation accuracy did not improve, despite the improvement in the training accuracy. The same phenomenon applied to the validation loss. Those results highlighted the inability of the model to capture and learn discriminant features. Data augmentation has not been beneficial enough to improve its discrimination ability for the majority of the diseases. However, due to the fact that the aim of this study was focused on COVID-19, the reason behind the sub-optimal performance for multi-class classification was not further investigated in terms of the type of the extracted features. Moreover, the imbalance of the dataset hindered thorough and extensive evaluation. Several classes were underrepresented. As a result, a deep analysis on the extracted features of those classes would yield negligible outcomes.

Results of COVID-19 Detection
For the COVID-19 detection experiment, top performance was observed, with classification accuracy reaching 89.88%. Specifically, as the confusion matrix of Figu suggests, 1154 COVID-19 X-ray images were correctly identified out of 1281. The number of false negatives was 127, whilst the total number of false positives was 170 Grad-CAM output suggested that the model looked for COVID-19 related features, fo ing on the upper respiratory system. For the non-COVID-19 class, the model base predictions on the collection of different features found in various regions of the ima

Results of COVID-19 Detection
For the COVID-19 detection experiment, top performance was observed, with the classification accuracy reaching 89.88%. Specifically, as the confusion matrix of Figure 6 suggests, 1154 COVID-19 X-ray images were correctly identified out of 1281. The total number of false negatives was 127, whilst the total number of false positives was 170. The Grad-CAM output suggested that the model looked for COVID-19 related features, focusing on the upper respiratory system. For the non-COVID-19 class, the model based its predictions on the collection of different features found in various regions of the image.
A significant observation is that in every experiment, COVID-19 images were correctly classified, either as part of a multiclass dataset or as the major class in a two-class dataset. There is significant evidence that this stability derives from unique image features discovered by the model in those processes. The results of the upcoming reproducibility test favor this assumption.

Results of Feature Reproducibility in COVID-19 Detection
The two-class classification routine has been repeated for 25 times, and the reported accuracy is assessed for statistical significance. A one-sample t-test was performed, assuming that there is no difference in the mean accuracy score between the 25 runs (i.e., setting the second variable equal to the first obtained accuracy). Table 4 presents the accuracy of each run. As can be observed from Table 5, the p-value was greater than 0.05. Hence, there is no evidence that the mean accuracy obtained from the 25 runs deviated from the expected values. To summarize, the t-test results suggest that the model is stable in reproducing the particular results in terms of accuracy. A significant observation is that in every experiment, COVID-19 images were correctly classified, either as part of a multiclass dataset or as the major class in a two-class dataset. There is significant evidence that this stability derives from unique image features discovered by the model in those processes. The results of the upcoming reproducibility test favor this assumption.

Results of Feature Reproducibility in COVID-19 Detection
The two-class classification routine has been repeated for 25 times, and the reported accuracy is assessed for statistical significance. A one-sample t-test was performed, assuming that there is no difference in the mean accuracy score between the 25 runs (i.e., setting the second variable equal to the first obtained accuracy). Table 4 presents the accuracy of each run. As can be observed from Table 5, the p-value was greater than 0.05. Hence, there is no evidence that the mean accuracy obtained from the 25 runs deviated from the expected values. To summarize, the t-test results suggest that the model is stable in reproducing the particular results in terms of accuracy. Table 4. Classification results of 25-run 10-fold cross-validation when training and testing Mo-bileNet (v2) using the COVID-19 detection dataset (two classes).

Run
Mean Accuracy (%) 1 89.88 2 91. 23 3 88.54 4 92.14   It was observed that there was no significant variation of the accuracy over the 25 runs. As a result, the comparison between the Grad-CAM visualization outputs of the 25 runs was performed using the outputs of three runs. We performed a case-to-case examination of the similarity of the produced Grad-CAM images to inspect whether the suggested areas of interest remained consistent across the three independent trainings. The evaluation was conducted by two of the authors (J.A. and N.P.) by visually inspecting the suggested areas in terms of their relative position inside the image. The methodology of this experiment is better understood in Figure 7. Figure 8 illustrates the Grad-CAM outputs obtained by three independent trainings of MobileNet (v2). All parameters, hyper-parameters, and image sets were retained during the three separate trainings.  In approximately 98% of the visualized Grad-CAM maps, the features were reproduced and the suggested areas remained the same. It was noted that there was a disagreement between the three independent training-testing results for 2% of the images. The reader should note that Figure 8     It was observed that a few discovered features are not always reproducible (2%). Figure 8 provides regions of specific images where the discovered features in the first training were not re-discovered during the second or third training. The classification accuracy remained top-level (approximately 90%) for each repetition. This is a conflicting situation. The reasons behind this phenomenon can vary: (a) Some of the COVID-19 images may contain annotations that are recognized by the model as features. Although the data were tested, the non-official nature of the dataset source led us to not be completely sure about the origin of the images and the pre-processing that may have taken place.

Discussion
Deep Learning enabled the extraction of a massive amount of low-and high-level features from medical images. Those features may represent important biomarkers, closely related to the corresponding diseases. However, deep learning methods lack the ability to specifically assess these features. The extracted features are not well-defined and usually refer to combinations of findings inside the image. This issue derives from the millions of complex mathematical procedures incorporated into deep models. Tracking the extracted features is not an easy task. The above issue raises concern about the trustfulness of such models for medical image classification tasks. For the recent COVID-19 disease, deep learning has been proven to be helpful in early detection, utilizing only X-ray scans. Little has been yet investigated as to why all deep learning models yield top results in a variety of scientific papers.
This study was focused on revealing evidence supporting the assumption that COVID-19 imprints specific pattern-stamps on the X-rays, which testify to its existence. The results provide strong evidence that MobileNet (v2) can capture those underlying signatures and reveal them. However, in many occasions, the MobileNet (v2) model was unable to locate the proper regions of interest, even if the classification was correct. In essence, the decision outcome was not verified on a correct basis. It is fair to assume that the model was deceived, and the associated features were irrelevant. This behavior raises many questions and mandates future research. Nevertheless, the majority of samples demonstrated a correct model reasoning and require further attention.
The experiments were based on the recently introduced Grad-CAM algorithm, which kept track of the learned weights in a way similar to backpropagation of a trained model. The experimental tests have been repeated three times to investigate the reproducibility of those regions, which contained the suggested features. It was found that in 98% of the samples, the suggested areas remained consistent. Moreover, the model insists on suggesting specific regions of the image that helped in distinguishing COVID-19 from both normal X-rays and X-rays corresponding to other respiratory and lung diseases. With the aid of those experiments, it is fair to assume that, out of the millions of extracted image features, there are potential features of medical importance.
Besides the demonstrated effectiveness of MobileNet (v2), this network is also suitable for mobile applications due to its inherently low computational requirements [21]. In the present work, it took approximately 70 min for a complete 40-epoch training of MobileNet (v2) using a dataset of 11,984 images (of size 400 × 400) and whilst performing online data augmentation. The reader shall recall that the experiments were performed using an ordinary computer. The trained model can process a new image input and provide both classification and Grad-CAM generation in less than one second. The latter boosts the significance of our work because limited computational costs and low model complexity are highly desirable in modern medical technology solutions, which can operate in real time.
This study has a number of limitations. Firstly, due to COVID-19 data scarcity, every publicly available image dataset related to COVID-19 is incomplete it terms of clinical data, verification, specific annotations, demographic details, and more. Those issues hinder the development of models that will approach the problem holistically. For example, Tartaglione et al. [30] highlight that either missing or imbalanced demographic information can result in biased models. Moreover, real-life evaluation is mandatory to verify the validity of the results, due to the above issue. Secondly, the experts' opinion regarding each sample of the image involved in this study was also missing from the image datasets. Hence, it is not possible to compare the model's decisions with that of the medical experts. This is an important limitation of the study, and we intend to suggest solutions in future research. Thirdly, this study used only the Grad-CAM algorithm for visualizing the suggested areas of interest. Although Grad-CAM is extensively used in related works, its performance can sometimes be sub-optimal [31]. Future studies can consider employing more explainability tools, such as saliency maps visualization [31] and the LIME [14] and the Shapley Additive explanations (SHAP) methods [32]. Moreover, the reader shall recall that the model underperformed in abnormality discrimination, failing to provide acceptable classification metrics for a number of pulmonary defects and diseases. Although this study is focused on COVID-19 detection rather than abnormality discrimination, the inability of the model to discriminate other pulmonary diseases is a limitation that cannot be overlooked.
During the experiments, it has also been revealed that a more accurate diagnosis of COVID-19 involves a two-stage approach (Figure 9). During the first stage, the input X-ray is analyzed for pathological findings, with 95.45% certainty. If the image is abnormal, the second stage takes place. The X-ray is further analyzed for COVID-19 detection, with 89.89% certainty. If the corresponding X-ray is not identified as COVID-19 class, an optional third stage may take place, where the image is analyzed for other abnormalities. The latter stage was not further explored in the particular research study. The scope of the study is not to present a framework that exhibits classification metrics superior to the related works but to investigate the extracted image features as to their validity and importance. Nevertheless, the classification accuracy of the presented framework competes with the recent literature ( Table 6). The reader shall recall that this work utilizes a large collection of X-ray images that belong to many classes. This poses additional challenges to the classification model.  The scope of the study is not to present a framework that exhibits classification metrics superior to the related works but to investigate the extracted image features as to their validity and importance. Nevertheless, the classification accuracy of the presented framework competes with the recent literature ( Table 6). The reader shall recall that this work utilizes a large collection of X-ray images that belong to many classes. This poses additional challenges to the classification model.

Conclusions
For the present study, a collection of 11,984 images corresponding to 12 different respiratory-lung abnormalities, including COVID-19 and normal X-ray scans, was utilized. Five independent experiments were performed. In the first experiment, the 14-class dataset is used to evaluate MobileNet (v2) in distinguishing between the complete dataset classes. MobileNet (v2) was found to be superior to other relative state-of-the-art CNNs in previous studies conducted by the authoring team [4,8]. In the second experiment, two-class (normal vs. abnormal) classification was performed. In the third experiment, a 13-class dataset was utilized to distinguish between abnormal classes. In the fourth experiment, two-class (COVID-19 vs. non-COVID-19) classification was performed. Finally, the last experiment was repeated three times in order to investigate the reproducibility of the extracted features and to assess the explainability of the model. Grad-CAM visualizations and accuracy metrics yielded strong evidence that COVID-19 image features can be detected with the deep learning approach, specifically with MobileNet v2. Moreover, it was demonstrated that MobileNet (v2) is an effective CNN for automatic COVID-19 detection, which could even be embedded in portable diagnostic systems due to its inherent low computational cost and its ability to process a new image in less than a second, at least in this particular study. Finally, a staged classification approach is suggested for diagnosing COVID-19, which exhibits an accuracy of 89.89%.