Exploration of Interpretability Techniques for Deep COVID-19 Classification Using Chest X-ray Images

The outbreak of COVID-19 has shocked the entire world with its fairly rapid spread, and has challenged different sectors. One of the most effective ways to limit its spread is the early and accurate diagnosing of infected patients. Medical imaging, such as X-ray and computed tomography (CT), combined with the potential of artificial intelligence (AI), plays an essential role in supporting medical personnel in the diagnosis process. Thus, in this article, five different deep learning models (ResNet18, ResNet34, InceptionV3, InceptionResNetV2, and DenseNet161) and their ensemble, using majority voting, have been used to classify COVID-19, pneumoniæ and healthy subjects using chest X-ray images. Multilabel classification was performed to predict multiple pathologies for each patient, if present. Firstly, the interpretability of each of the networks was thoroughly studied using local interpretability methods—occlusion, saliency, input X gradient, guided backpropagation, integrated gradients, and DeepLIFT—and using a global technique—neuron activation profiles. The mean micro F1 score of the models for COVID-19 classifications ranged from 0.66 to 0.875, and was 0.89 for the ensemble of the network models. The qualitative results showed that the ResNets were the most interpretable models. This research demonstrates the importance of using interpretability methods to compare different models before making a decision regarding the best performing model.


Introduction
In 2020, the world witnessed a serious new global health crisis: the outbreak of the infectious COVID-19 disease, which is caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) 1,2 .Due to its long incubation period and its highly contagious nature, it is important to identify infected cases early and isolate them from the healthy population.To date, viral nucleic acid detection using Reverse Transcription Polymerase Chain Reaction (RT-PCR) has been regarded as the gold standard diagnostic method 3 .However, RT-PCR tests have been reported to suffer from a high rate of false negatives owing to laboratory and sample collection errors 4,5 .
However, medical imaging emerges as a great alternative candidate for screening COVID-19 cases and discriminating them from other conditions, as the majority of infected patients exhibit abnormalities on medical chest imaging [6][7][8] .In this context, chest radiography (CXR) and Computed Tomography (CT) are widely utilised in front-line hospitals for diagnosis [9][10][11] .In certain instances, chest CT images have been demonstrated to exhibit higher sensitivity than RT-PCR and have detected COVID-19 infections in patients with negative RT-PCR results 4,[11][12][13] .Nevertheless, there are numerous advantages to encourage the use of CXR imaging in clinical practice, such as faster diagnosis, infection control, and lesser harmfulness than CT 14,15 .Moreover, X-ray machines are far more readily available than CT scanners, especially in developing countries.In addition, with the help of portable X-ray machines, imaging can be performed in the isolation rooms, decreasing the risk of infection transmission during transportation to the CT room, as well as the time needed for disinfecting the CT equipment and room 16 .Despite its limitations, CXR is more widely available than CT across the globe and is widely utilised for COVID-19 screening 16 .
Airspace opacities or ground-glass opacities (GGO) are commonly reported radiological appearances with COVID-19 17,18 .The predominant distributions in the bilateral, peripheral, and lower zones are primarily observed (90%) 19 .However, these manifestations are very similar to various viral pneumoniae and other inflammatory and infectious lung diseases.Therefore, it is difficult for radiologists to discriminate COVID-19 from other types of pneumoniae 20 .Expert radiologists are needed to achieve high diagnostic performance, and the duration of the diagnostic is relatively long.
Artificial intelligence (AI) can play one of the potential roles in strengthening the power of imaging tools to provide accurate diagnosis.Many AI applications have focused on infection quantification and identification to assist radiologists in decision-making.The classification of COVID-19 and other types of pneumonia has been investigated using deep learning techniques 6,21 .However, due to the "black box" nature, the rationale behind such techniques is often unknown; hence, these techniques are considered to have low reliability to be integrated within the clinical workflow.Interpretability techniques, which show the focus area of such deep learning methods, are potentially needed to build the confidence of medical practitioners in such methods.Techniques have been proposed that also involve interpretability to understand the reasoning performed by the model 22 .However, comparative studies of different models based on accuracy and interpretability, and then verification of the interabilities by doctors have not been performed.Thereby, in this work, the authors have considered the state-of-the-art deep learning models to classify COVID-19 and similar pathologies, along with a thorough look involving doctors into the interpretability of each of these models.Foremost, motivated by the fact that one patient can have multiple pathologies at the same time, a multilabel classification was performed -a task that is not commonly performed by similar studies.The motivation behind considering deep learning and not interpretable non-deep learning techniques is owing to the fact that in recent times deep learning techniques have been observed to outperform others for various radiological applications [23][24][25] .
The remainder of the paper is organised as follows: in the second section, several related works are presented and discussed, followed by the third section, which details the various network models and interpretability techniques used here and and the approach to dataset creation is delineated.The fourth section presents the classification results and the interpretability analysis.The results are then analysed in the fifth section, and finally, the sixth section concludes the work and provides directions for further research.

Related works
The use of artificial intelligence (AI) in healthcare has been developed to support humans in decision making [26][27][28][29] .AI-based knowledge has been combined with medical imaging to enhance the accuracy of diagnoses of various diseases, such as respiratory infectious diseases 30 , pulmonary tuberculosis 31 , including pandemic diseases such as H1N1 influenza 32 .
The spread of COVID-19 has attracted many researchers to concentrate their efforts toward developing AI-based disease detection techniques for various medical imaging modalities.The assistance of deep learning has shown an improvement in binary diagnosis (presence or absence of COVID-19) from CXR images 33 and a reduction in the workload of front-line radiologists 34 .Many efforts have been made to perform multiclass classification (COVID-19, other types of pneumonia, or healthy) to assist radiologists in decision making.Narin et al. 7 used ResNet50, InceptionV3, and InceptionResNetV2 models to classify patients with COVID-19 using CXR images.They demonstrated that the pre-trained ResNet50 model yields the highest accuracy (98%).However, accuracy is often deemed a misleading metric in the case of imbalanced datasets.Furthermore, they only discriminated between healthy subjects and COVID-19, but did not include the other types of pneumonia.Wang et al. 35 designed COVID-Net using CXR images for the classification of patients with bacterial pneumonia, viral pneumonia, COVID-19, and also healthy subjects with a sensitivity of detection of 91% COVID-19.Zhang et al. 6 37 proposed the Gen-ProtoPNet architecture that provides interpretable classifications of COVID-19 in CXR 37 and CT scans 38 , resulting in F1 scores as high as 98%.Furthermore, Shorten et al. 39 provided a comprehensive survey of different applications of deep learning for COVID-19.On the other hand, De Falco et al. 40 proposed an interpretable completely-transparent evolutionary rule-based approach, but only managed to achieve an accuracy of around 80%.This demonstrates the possible trade-off between transparency and model performance.Deep learning methods that are interpretable, or they are interpreted using post hoc methods, can mitigate this trade-off.Although the application of deep learning methods for COVID-19 lesion detection is not an unexplored topic, including interpretability, systematic comparisons of different models 2/20 in terms of interpretability and verification of the interpretability results by medical professionals are still missing.these are the aspects this paper seeks to address, while presenting the importance of evaluating or comparing models with respect to interpretability along with the classification accuracy.It is noteworthy that these problems and the message of this paper are not limited to COVID-19 classification, but they are applicable to classification problems in general, especially in high-risk domains like medical imaging.
Although AI-based assistance has been introduced in the field of radiology for a long time, the decision-making mechanisms within these "black-box" methods remains questionable.Recently, research on interpretability has gained more focus.Different interpretability techniques, such as occlusion 41 , saliency 42 , guided backpropagation 43 , integrated gradients 44 , etc., have been introduced, demonstrating the potential to open these black boxes.

Network models
During the course of this research, various network architectures were explored and experimented with, including several variants of VGG 45 , ResNet 46 , ResNeXt 47 , WideResNet 48 , Inception 49 , DenseNet 50 .Prior to training on the dataset of this research work, all the networks were initialised with weights pre-trained on ImageNet.After observing the results, five network architectures were shortlisted for further analysis and also used to create an ensemble using the majority voting strategy for better prediction performance.The models were selected based on different criteria, such as performance, complexity of the model, etc.The selected models are discussed in this section, and Table 1 shows the complexity of the models.
ResNet: At the nascent stage of deep learning, the deeper networks faced the problem of vanishing gradients/ exploding gradients 51,52 , which hampered convergence.The deeper network faced another obstacle called degradation, where the accuracy starts to saturate and degrade rapidly after a certain depth of the network.To overcome these problems, He et al. 46 designed a new network model called residual network or ResNet, where the authors came up with 'Skip Connection' identity mapping.This does not involve adding an extra hyperparameter or learnable parameter but just adding the output from a preceding layer to a subsequent layer.It unleashed the possibility of training deeper models whilst avoiding these aforementioned issues.
After comparing various versions of ResNet, during this research two different variants, ResNet18 and ResNet34, were chosen for further analysis.
InceptionNet: An image can have thousands of salient features.In different images, the focused features can be in any different part of the image, determining the appropriate kernel size for a convolution network a very difficult task.A large kernel will have a greater focus on globally distributed information, while a smaller kernel will focus on local information.To overcome this problem, Szegedy et al. 49 came up with a new network architecture called InceptionNet or GoogleNet.The authors used filters of multiple sizes to operate on the same level, which makes the network more "wider" rather than "deeper".In order to enhance computational cost-effectiveness, the authors restricted the number of input channels by adding an extra 1x1 convolution before the 3x3 and 5x5 convolutions.Adding 1x1 convolutions is much cheaper than adding 5x5 convolutions.The authors introduced two auxiliary classifiers to avoid the problem of vanishing gradient, and an auxiliary loss is calculated on each of them.The total loss function is a weighted sum of the auxiliary loss and the real loss.
Excessive reduction in dimensions can cause a loss of information, also known as a "representational bottleneck".To overcome this problem and scale the network in ways that utilise the added computation as efficiently as possible, the authors of InceptionNet introduced a new idea in another publication by Szegedy et al. 53 factorising convolutions and aggressive regularisation.The authors factored each 5x5 convolution into two 3x3 convolution operations to improve computational speed.Furthermore, they factorised the convolutions of the filter size nxn into a combination of the 1xn and nx1 convolutions.This network is known as InceptionV2.
Szegedy et al. 53 have also proposed InceptionV3, which extends InceptionV2 further by factorising 7x7 convolutions, label smoothing, and by adding BatchNorm in the auxiliary classifiers.Label smoothing is a type of regularising component added to the loss formula that prevents the network from becoming too confident about a class.
InceptionV3 ranked in one of the top five positions during the initial trials and therefore was used for further analysis.

InceptionResNetV2:
The different variants of InceptionNet and ResNet have shown very good performance with relatively low computational costs.With the hypothesis that residual connections would cause Inception network training to accelerate significantly, the authors of the original InceptionNet proposed InceptionResNet 54 .In this, the pooling operation inside the main inception modules was replaced by the residual connections.Each Inception block is followed by a filter expansion layer (1x1 convolution without activation), which is used for scaling up the dimensions of the filters back before the residual addition, to match the input size.This is one of the networks that has been used in this research, because of its performance on the dataset that has been used.

3/20
DenseNet: Huang et al. 50came up with a very simple architecture to ensure maximum information flow between layers of the network.By matching feature map size throughout the network, they connected all the layers directly to all of their subsequent layers -a densely connected neural network, or simply known as DenseNet.DenseNet improved the information flow between layers by proposing this different connectivity pattern.Unlike many other networks such as ResNet, DenseNets do not sum the output feature maps of the layer with the incoming feature maps but concatenate them.
In the preliminary trials of this study, DenseNet161 came out as a winner in terms of performance.Therefore, in this research DenseNet161 was included.

Interpretability techniques
interpretability techniques can aid in understanding the reasoning of a network for its predictions.In general, the results of interpretability can be visualised using heatmaps, where higher values indicate a heightened focus.However, this may vary among different interpretability techniques.Typically, the heatmaps are overlaid on top of an input image to understand at which parts of the image the network focused to generate the predictions.The techniques that use a single image at a time for analysis are known as local interpretability techniques.On the other hand, a global interpretability technique often pertains to comprehending how the model works -an aggregated behaviour of the model based on the distribution of the data 55,56 .There are several techniques already in existence.Some of the methods, such as, Occlusion, Saliency, Input X Gradient, Integrated Gradients, Guided Backpropagation, DeepLIFT, Neuron Activation Profiles, which were explored in this research, are explained briefly in this section.
Occlusion: Occlusion is one of the simplest interpretability techniques for image classifications.This technique helps to understand which features of the image steer the network towards a particular prediction or which are the most important parts for the network to classify a certain image.To obtain this answer, Zeiler et al. 41 performed an occlusion technique by systematically blocking different parts of the input image with a grey square box and monitoring the output of the classifier.The grey square is applied to the image in a sliding window manner that moves across the image, obtaining many images, and subsequently fed into the trained network to obtain probability scores for a given class for each mask position.
Saliency: In the context of visualisation, saliency refers to a topological representation of the unique features of an image.Saliency is one of the baseline approaches for the interpretation of deep learning models.The saliency method of Simonyan et al. 42 returns the gradients of a model for its respective inputs.Positive values present in the gradients show how a small change in the input image changes the prediction.
Input X Gradient: Input X Gradient is an extension of the Saliency approach.Similarly to the saliency method of Simonyan et al. 42 , this method of Kindermans et al. 57 also takes the gradients of the output with respect to the input, but additionally multiplies the gradients by the input feature values.
Guided Backpropagation: Guided Backpropagation, also known as guided saliency, is another visualisation technique for deep learning classifiers.Guided backpropagation is a combination of vanilla backpropagation and deconvolution networks (DeConvNet) 43 .In this method, only positive error signals are backpropagated, and the negative signals are set to zero while backpropagating through a ReLU unit 58 .
Integrated Gradients: Sundararajan et al. 44 proposed a model interpretability technique, which assigns an importance score to each of the features of the input by approximating the integral of the gradients of the output for that input, along the path from the given references for the input.
DeepLIFT: Deep Learning Important FeaTures or DeepLIFT, proposed by Shrikumar et al. 59 , is a method to pixel-wise decompose the output prediction of a neural network on a specific input.This involves backpropagating the contributions of all neurons in the network to every feature of the input.DeepLIFT compares the activation of each neuron to its "reference activation", and then assigns contribution scores based on the difference.DeepLIFT can also reveal dependencies that might be missed by other approaches by optionally assigning separate considerations to positive and negative contributions.Unlike other gradient-based methods, it uses difference from reference, which permits DeepLIFT to propagate an importance signal even in situations where the gradient is set to zero.
Neuron Activation Profiles: The aforementioned interpretability techniques are local methods that help to understand single predictions of a neural network.To investigate model behaviour more generally, a global interpretability technique called Neuron Activation Profiles (NAPs) is employed 60,61 .NAPs describe and contrast the activity of the neural network of sets of related inputs, for example, of different classes, using an averaging approach.Initially, the activation values in the layers of interest are obtained by computing a forward pass for every test image.Then, the average feature maps over each respective group are computed to characterise the group-specific activity.In addition to characterising the network activations for a group, further emphasis is given to the differences between the groups.To this end, the average over all groups is subtracted from each group's average.These normalised averaged activation values can be interpreted as the activation difference from the global average.Positive values indicate a characteristically high neuron activation compared to the entire data set, and negative values indicate a comparably low neuron activation.NAP values are particularly useful to identify which activations differ between groups of interest and correspondingly indicate the model's ability to distinguish between the classes according to the activations.When working with image data, visually interpretable plots of NAPs of feature maps can be created.For data that are not visually interpretable, NAPs can be further used for similarity analyses 61 or for dimensionality reduction-based visualisation 62 .
In order to obtain useful averaging results, this method requires data in which the objects are at the same location in the images.This alignment is guaranteed through data preprocessing that resizes and crops the original images.

Implementation
The models were implemented using PyTorch 63 .An interpretability pipeline for PyTorch-based classification models was developed with the help of Captum 64 .The code of this project is available on GitHub: https://github.com/soumickmj/diagnoPP.The pipeline was later made part of the TorchEsegeta 65 .
Training sessions were conducted using Nvidia GeForce 1080 Ti and 2080 Ti GPUs, each with 11GB of memory.The loss was calculated using Binary Cross-Entropy (BCE) with Logits, which combines the sigmoid layer with the BCE loss, to achieve better numerical stability than using the Sigmoid layer followed by BCE loss separately.The numerical stability is achieved by using the log-sum-exp trick, which can prevent underflow/overflow errors.The loss was minimised by optimising the model parameters using the Adam optimiser 66 , with a learning rate of 0.001 and a weight decay of 0.0001.A manual seed was used to ensure the reproducibility 67 of the models.Automatic Mixed Precision was used using Apex 68 , to speed up training and decrease GPU memory requirements.
The interpretability methods were applied on the models using Nvidia Tesla V100 GPUs, having 32GB memory each.Some of the interpretability techniques could not be used on certain models owing to insufficient GPU memory caused by the complexities of the models.

Data Collection
The CXR images were collected from two public datasets.The first dataset was the COVID-19 image data collection by Cohen et al. 21,69 , comprising 236 images of COVID-19, 12 images of COVID-19 and ARDS, 4 images of ARDS, 1 image of Chlamydophila, 1 image of Klebsiella, 2 images of Legionella, 12 images of Pneumocystis, 16 images of SARS, 13 images of Streptococcus, and 5 images without any pathological findings.The second dataset was the Chest X-ray Images (Pneumonia) dataset by Kermany et al. 70,71 , which has a total of 1583 images of healthy subjects, 1493 images of viral pneumonia and 2780 of bacterial pneumonia.From this dataset, 500 images of healthy, 250 images of viral pneumonia, and 250 images of bacterial pneumonia were randomly chosen.Fig 1 portrays the final data distribution considered for the work.This CXR image dataset comprises posterior anterior (PA), anterior superior (AP), and anterior superior supine (AP supine) radiographs.Whilst the AP view is not the preferred positioning and has disadvantages such as organ overlap that could interfere with network prediction 72 , it is a technique commonly used for COVID-19 patients in a coma.
The hierarchical nature of the pathologies can be observed in this combined dataset (see Fig 2).For example, SARS and COVID-19 are subtypes of viral pneumonia.However, Streptococcus, Klebsiella, Chlamydophila, and Legionella are subtypes of bacterial pneumonia, and Pneumocystis is a subtype of fungal pneumonia.Furthermore, viral, bacterial, and fungal pneumoniae are different types of pneumonia.Therefore, a patient having COVID-19 inherently has viral pneumonia.ARDS, 5/20  which stands for acute respiratory distress syndrome, is a serious lung condition with a high mortality rate 73 .It frequently develops alongside pathological conditions like nonpulmonary sepsis, aspiration, or pneumonia 74 .Although the respiratory pathologies of ARDS (associated with or without COVID-19) and COVID-19 are similar, COVID-19 has different features that require different patient management, and a patient suffering from both could require additional care [75][76][77] .Therefore, the dataset, which comprises cases where a patient has both COVID-19 and ARDS, is suitable for multilabel classification.

Dataset Preparation
The final dataset was randomly divided into a training set, consisting of 60% of unique subjects, and the remaining 40% of the subjects were used as a test set.Five-fold cross-validation (CV) was conducted to assess the generalisation capabilities of the models.The performance of the models during the 5-fold CV is reported in the sub-section .For the interpretability analysis, only the results from the first fold were used, as this yielded the highest micro F1 scores.

Pre-processing
The dataset used for the task comprises X-ray images collected at different centres using different protocols and varying in size and intensity.Therefore, all the images were initially pre-processed to have the same size.To make the image size uniform throughout the dataset, each image was interpolated employing bicubic interpolation, to have 512 pixels on the longer side.the pixel count on the shorter side was determined, keeping the aspect ratio of the original image.Subsequently, zero-padding was applied to the shorter side to make that side have 512 pixels, resulting in a 512 x 512 image.Image resizing was followed by percentile cropping, where the image intensity was cropped to the first and 95th percentile, and then the intensity normalisation was performed to the range [0,1].The percentile cropping normalisation minimises the effect of intensity variation due to non-biological factors.

Classification Setup
In this multilabel classification setup, the model was trained to identify the disease and also its supertypes.Therefore, when a network encounters an image of a COVID-19 patient, it should ideally predict it as pneumonia, viral pneumonia, and COVID-19.When a network encounters an image of a patient having multiple pathologies, as in this dataset, some patients have both COVID-19 and ARDS, ideally, the network should classify it as pneumonia, viral pneumonia, COVID-19, as well as ARDS.Interpretability analysis was conducted for each label of each image in the test set.

Evaluation Metrics
In a multiclass setting, classifiers are generally evaluated with respect to precision, recall and F1 metrics.In a multilabel classification setting, these metrics are computed in two manners: macro and micro averaging 78 .
As shown in Eq. 1, the macro-based metrics are first computed individually from the true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) of each class/pathology and then averaged, where P denotes the number of classes and Metric ∈ {precision, recall, F1}.This manner of computation of metrics helps to treat each pathology equally, and the metric values are significantly influenced by the rarer labels.
In micro-based metrics, TP, TN, FP, and FN of each class/pathology are added individually and then averaged, as shown in Eq. 2. Therefore, the micro-based metrics portray the aggregated contribution of all classes/pathologies.Therefore, the influence of the predictions from the minority classes becomes diluted among the contributions from the majority classes.This makes the micro-based metrics a suitable measure for estimating the overall performance of the classifier, particularly in scenarios involving imbalanced datasets.Given the significant imbalance in the utilised dataset, micro-based metrics have been considered for classifier evaluation 79 .

Overall comparisons of the classifiers
Fig. 3a shows that the overall performance of the classifiers over pathologies was similar.Among the non-Ensemble models, DenseNet161 performed the best in all metrics.Although InceptionResNetV2 was the most complex model among all, it yielded the poorest recall, which implies that the ability of the model to find pathology-affected cases was poor compared to less complex models.ResNet18 was the least complex model among the non-Ensemble classifiers, ranking second to DenseNet161 with respect to micro F1.The ensemble produced the best results and the minimum variance as presented in Table 2 in the 5-fold cross-validation.Another interesting observation that could be made is regarding inactive feature maps (dead neurons).DenseNet161 had the highest percentage of such feature maps -as high as 99.22% for the middle layer.Although InceptionResNetv2 was the most complex, it had fewer inactive feature maps than DeseNet161.ResNets, the least complex models in this study, had the lowest percentage of inactive feature maps (48.44% and 60.16% for the middle layers of ResNet18 and ResNet34, respectively).

Interpretability of models
In the first sub-subsection different interpretability techniques have been explored for different classifiers with respect to the different diseases.The second subsection talks about how the different models performed for specific pathologies.All the given interpretability analyses (except using the global method NAP) were performed for that specific input CXR image which has been shown as the underlay.In the interpretability analysis using NAP, all images from the test set were used, as this method performs a global analysis.

Pathology based comparisons of local interpretability techniques for models
To visualise the results for a specific case, the models were interpreted using local methods: occlusion, saliency, inputXgradient, guided backpropagation and integrated gradients, and have been shown in Fig. 7, Fig. 8 and Fig. 9. Apart from occlusion, the other interpretability techniques failed to run for DenseNet161 due to GPU memory limitations.in DeepLIFT, ResNets faced an additional challenge due to the ReLU operations used "in place" in those models.Models have to be updated to run DeepLIFT on them.
According to the clinical findings of the COVID-19 image data provided by Cohen et al. 21, multiple abnormalities of the lungs were located in the upper and lower pulmonary field, as well as the upper left part of the lung.The models classified this case as COVID-19, pneumonia, and viral pneumonia responding to the pathology of lung infection.It can be seen that the focus area of the models for COVID-19 differs from the focus area for pneumonia and viral pneumonia.DenseNet161 and InceptionResNetV2 focused primarily on the right lung.InceptionV3, ResNet18, and ResNet34 covered both the right and left parts, not only the lesion but also the irrelevant regions outside the lung.
Local interpretability methods suffered mainly from false positives.In some cases, the occlusion did not detect the affected areas for DenseNet161 and InceptionResNetV2 and falsely marked the normal areas as positive, as shown in Fig. 7. Furthermore, for InceptionV3, it detected some positive patches, but falsely detected more areas as positive.Finally, in general, for ResNets, occlusion was most sensitive to positive areas and detected lesser false negatives.Guided backpropagation, saliency, integrated gradients, and DeepLIFT in general falsely detected normal lung areas as positive -picked up normal bronchovascular markings as positive and did not mark the actual affected areas.The input X gradient detected some positive areas correctly for ResNet18, but falsely marked many normal areas.In general, the representations learnt by the ResNet models captured the most accurate regions as seen from most interpretability techniques, with fewer false negatives.Among the local interpretability techniques, occlusion provided the best guidance in finding clinically important areas, which were confirmed by medical experts.

Intense Interpretability
The failure case of the best performing model for COVID-19 classification: Although DenseNet161 performed the best among all models, it gave false negatives for some of the COVID-19 patients, while the rest of the models, including the ensemble, could correctly predict.The occlusion results of the models can be observed in Fig. 4.This figure shows that DenseNet161 and InceptionResnetV2 did not focus on any affected areas, but rather on other regions (e.g.normal right hilum).InceptionV3, ResNet18, and ResNet34 mainly focused on affected areas with good sensitivity.InceptionV3, however, had more false positives than ResNets (e.g.outside the right lung).
Another analysis was performed with CXR of a 70-year-old woman who had three days of cough, myalgia, and fever; without any recent overseas travel.A series of chest radiographs were obtained before confirmation of coronavirus infection, and follow-ups were done in three days, seven days, and nine days.It shows the progression of radiographic changes.In the image prior to COVID-19, both models falsely detected all normal areas as relevant features.In the image of day 3, the doctor could not visually detect any affected area, although this was the image from the third day after testing positive for COVID-19.This might indicate that when no substantial affected area can be seen in the image visually (i.e., day 3), the model might have been picking up some mild markers, which visually cannot be confirmed.In the images of days seven and nine, DesNet161 did not focus correctly on the affected regions and had both false positives and false negatives, while ResNet18 focused on the affected regions more accurately.
ResNet18 can be considered the overall winner, as it yielded high evaluation scores, despite having the least number of network parameters.Furthermore, its interpretability analysis showed the location of the lesion, which allows to use this network for follow-up or severity estimations, as illustrated in Fig. 5.To find potentially exploitable features, the input averages (input layer NAPs) are first investigated in Fig. 6 (left).It can be observed that pneumonia images cover a smaller portion of the height dimension than COVID-19 or healthy subjects images.This means that there are dark top and bottom regions in the majority of Pneumonia images.Based on this observation, the authors hypothesised that a model might exploit this non-biological feature.
To investigate this hypothesis, the feature map NAPs of DenseNet161 and ResNet18 in an early and deep layer, respectively, are visualised.The authors particularly investigate layers at representative depths of the networks.For DenseNet161, the ReLU-activated outputs of the first and last dense blocks were chosen.As representative layers of ResNet18, the outputs after the first and last residual connections were selected.For these layers, two exemplary feature map NAPs among those of the highest activity differences between the observed classes are shown in Fig. 6.In DenseNet161, one can clearly observe activation differences in both the border regions and the lung.For example, COVID-19 images are easy for the model to distinguish based on the activation difference corresponding to not having dark regions at the bottom and top of the images.In the deeper layer, the activation difference patterns do not resemble any interpretable structure, neither in the lungs nor in the lower and upper regions.This indicates why DenseNet161 has a high performance despite giving false negative COVID-19 results.Instead of detecting COVID-19-specific features, it likely exploits features of the data that are correlated but not related to the pathology.However, it does not appear that DenseNet161 uses dark border regions as the main distinguishing factor.ResNet18, in contrast, is less likely to detect biologically irrelevant features.Although in the early layers there are activation differences in the top and bottom areas of the images, in most deep-layer feature maps, the groups can be most clearly distinguished from each other from neuron activity in the (right) lung regions.

Discussion
The literature review portrayed that the diagnosis of COVID-19 was seen as a multiclass classification task rather than a multilabel classification.The datasets used in the previous works vary in terms of the amount of data used for the classification task.In 7 , the authors created a balanced dataset by appending the 50 COVID cases with 50 healthy cases from another dataset and reported the highest mean specificity score of 0.90 using InceptionV3.The others 6,8,35 performed a multiclass classification task on different imbalanced datasets using X-rays, and achieved a maximum mean specificity of 0.989, 0.979, and 0.971 respectively.In this work, InceptionResNetV2 achieved the highest specificity of 0.975, comparable to previous studies.However, in this research, the authors have used a different dataset, train-test split, and preprocessing techniques compared to previous works, which makes it unfair to compare the results with previous studies.It was observed that the less complex models were more interpretable, while having fewer dead neurons than the more complex ones.DeneseNet161, which resulted in the highest F1 score, had the highest number of dead neurons and also had the worst focus areas according to interpretability methods.The model that resulted in the second-best F1-score, ResNet18, was the least complex model in this study -while also having the best focus areas as dictated by the interpretability methods.This was further confirmed by a global interpretability method, NAPs, which showed that ResNet18 is less likely to detect biologically irrelevant features.It should be noted that in some cases, the network predicted the findings as a presence of COVID-19, while the doctors did not report any abnormalities.
There were a couple of cases where the network detected both viral and bacterial pneumonia.According to Morris et al. 80 , and Shigeo et al. 81 , the induction of viral infection could lead to secondary bacterial infection and increase the severity of symptoms.Though such cases were considered as miss-predictions for the current dataset based on the available labels, one could argue that the network was able to detect such instances.
The main motivation to perform a multilabel classification over a multi-class classification was to be able to predict multiple pathologies from the images if they were present.It was observed that all networks, including the Ensemble, were able to correctly predict both COVID-19 and ARDS for the images that had both pathologies present.
Lastly, this study also showed that the models could classify lung pathologies from CXR images, although unwanted objects, such as annotations or labels, were obscuring the radiographs.
used a ResNet-based model to classify COVID-19 and non-COVID-19 patients.They achieved a sensitivity of 96% and a specificity of 70.7%.Ghoshal et al. 36 presented a Dropweight-based Bayesian Convolutional Neural Network (BCNN) for CXR-based COVID-19 diagnosis.They found a drastic correlation between the accuracy of the prediction and the uncertainty of the model.Awareness of diagnosis decision uncertainty could endorse deep learning-based applications to be used more and more in clinical routine.Singh et al.

Figure 1 .
Figure 1.CXR images distribution for each infection type in the dataset

Figure 2 .
Figure 2. A hierarchy of pathological labels used in this study

Figure 3 .
Figure 3.Comparison of the classifiers based on micro metrics (a) and their performance for the different classes (b-f)

Figure 4 .
Figure 4.A case-study of DenseNet161 failure using occlusion.The affected areas in the lungs have been annotated by medical experts.

Figure 5 .
Figure 5.Comparison using occlusion between DenseNet161 and ResNet18 for a specific COVID-19 follow-up case.The affected areas in the lungs have been annotated by medical experts.

Figure 6 .
Figure 6.Average input images and feature map NAPs in different models and layers for different pathologies and healthy subjects.Blue indicates lower activation of the respective neuron for this group compared to the other groups, red indicates higher activity.

Figure 7 . 20 Figure 8 . 20 Figure 9 . 20 Figure 10 .
Figure 7.Comparison of various interpretability techniques with respect to models for COVID-19 predictions against the manual annotation of the affected areas by medical experts.

Table 1 .
Number of trainable parameters in each model

Table 2 .
Performance of all the classifiers with respect to micro based metrics over 5-folds