Object or Background: An Interpretable Deep Learning Model for COVID-19 Detection from CT-Scan Images

The new strains of the pandemic COVID-19 are still looming. It is important to develop multiple approaches for timely and accurate detection of COVID-19 and its variants. Deep learning techniques are well proved for their efficiency in providing solutions to many social and economic problems. However, the transparency of the reasoning process of a deep learning model related to a high stake decision is a necessity. In this work, we propose an interpretable deep learning model Ps-ProtoPNet to detect COVID-19 from the medical images. Ps-ProtoPNet classifies the images by recognizing the objects rather than their background in the images. We demonstrate our model on the dataset of the chest CT-scan images. The highest accuracy that our model achieves is 99.29%.


Introduction
The pandemic COVID-19 is looming as a worst menace on the world populations while its several new strains are being identified. Some vaccines for COVID- 19  . The detection of the virus is usually done with molecular tests, that is, the tests that look for the virus by detecting the presence of the virus's RNA. The molecular tests include RT-PCR, CRISPR, isothermal nucleic acid amplification, digital polymerase chain reaction, microarray analysis, and next-generation sequencing [2]. The presence of the virus can also be detected from the medical images, such as: chest X-ray and CT images. Although, RT-PCR is still a gold standard for COVID-19 testing, but deep learning techniques to identify the virus from medical images can also be helpful in certain circumstances, such as: unavailability of RT-PCR kits. A deep learning model can also be used for the pre-screening before RT-PCR testing. Many models have been proposed to detect COVID-19 from the medical images, see [3][4][5][6][7][8][9][10][11][12][13][14][15]. However, these models lack the interpretability/transparency of the reasoning process of their predictions. So, we propose an interpretable deep learning model: pseudo prototypical part network (Ps-ProtoPNet), and experiment it over the dataset of CT-scan images, see Section 2.4. Ps-ProtoPNet is closely related to ProtoPNet [16], Gen-ProtoPNet [17] and NP-Proto-PNet [18], but strikingly different from these models.
A prototype represents a patch of an image. To classify a test image, ProtoPNet compares the different parts of the test image with the learned prototypes of images from all classes. Then the decision is made based on the weighted combination of similarity scores [16]. To calculate the similarity scores between learned prototypes (with square spatial dimensions 1 × 1) and parts of the test image, ProtoPNet and NP-ProtoPNet use L2 distance function, whereas Gen-ProtoPNet uses a generalized version of L2.
In this work, we present a theorem that calculates the impact of the change in the hyperparameters of the dense layer on the logits, see Theorem 1. Ps-ProtoPNet chooses negative connections between the similarity score and logits of incorrect classes as suggested by the theorem. Also, our model uses prototypes that can have any type of spatial dimensions, that is, square and rectangular.
A model should classify an image of an object by identifying the object in the image instead of the background of the object in the image. The model that uses prototypes of smaller spatial dimensions (1 × 1) can classify an image just on the basis of the background and give higher accuracy with wrong reasoning process. For example, the most part of the images of birds of a sea specie is not similar to any patch of the images of birds of a jungle specie. So, the images from these two classes can be classified on the basis of backrounds. Another scenario, images of birds of different sea bird species can share same background water on the most part. Therefore, a model with prototypes of small spatial dimensions (1 × 1) can classify wrongly the images just on the basis of the background of the birds. On the other hand, the use of prototypes with the dimensions equal to the dimensions of an image can also reduce the accuracy because there can be only few images that are similar to the whole image, but their parts can be similar. So, we need to use optimum spatial dimensions for the prototypes. To identify an image that has not been encountered before, humans may compare patches of the image with the patches of images of the known objects. Our model's reasoning is inspired from the above reasoning, where comparison of image parts with learned prototypes is integral to the reasoning process of the model. That is, a new image is compared with learned prototypes from all classes, and it is classified to the class whose prototypes are more similar to parts of the image. We have three classes of images: Covid, Normal and Pneumonia. Therefore, a COVID-19 CT image is distinguished from the pneumonia CT images based on the greater similarity of parts of the image with the prototypes.

Related Work
Numerous perspectives have been emerged to explain convolution neural networks, including posthoc interpretability analysis. A neural network with posthoc analysis is interpreted following classifications made by a model. Activation maximization [19][20][21][22][23][24][25], deconvolution [26], and saliency visualization [23,[27][28][29] are some forms of posthoc analysis approach. Nevertheless, these techniques do not throw light on the reasoning process with transparency. Another approach to make the reasoning process of the neural networks clear is attentionbased interpretability that includes class activation maps (CAM) and part-based models. In this approach, a model aims to point out the parts of a test image that are its centers of attention [30][31][32][33][34][35][36][37][38][39][40][41]. These models do not point out the prototypes that are similar to parts of the test image.
Oscar et al. [42] developed a model that uses prototypes of the size of a whole image to find the similarity scores. A substantial improvement over the above work was made by Chen et al. with the development of their model ProtoPNet [16]. The models Gen-ProtoPNet [17] and NP-ProtoPNet [18] are close variations of ProtoPNet.

Data
Many datasets of medical images are publicly available [43][44][45]. However, we used the dataset of chest CT-scan images of normal people, COVID-19 patients and pneumonia patients [44]. This dataset has 143, 778 training images and 25,658 test images. The training dataset consists of 35,996, 25,496 and 82,286 CT-scan images of normal people, pneumonia patients and COVID-19 patients, respectively. The test dataset consists of 12, 245, 7395 and 6018 CT-scan images of normal people, pneumonia patients and COVID-19 patients. We resized the images to the dimensions 224 × 224 as required by the base models. We put these images into three classes Covid (first class), Normal (second class) and Pneumonia (third class).

Working Principal and Novelty of Ps-ProtoPNet
ProtoPNet classify an image on the basis of a weighted combination of the similarity scores [16]. For each class, a fixed number of prototypes are selected. We select 10 prototypes for each class. The model calculates the Euclidean distance of each prototype from each latent patch of the test image that has spatial dimensions equal to 1 × 1. Then these distances are inverted and a maximum of the inverted distances is called the similarity score of the prototype. Thus, for a given image, only one similarity score for each prototype is obtained. In the dense layer, these similarity scores are multiplied with the weights to calculates the logits. During the training process, ProtoPNet does the convex optimization of the last layer to make the certain weights zero [16].
Theorem 1 finds the impact of the change in the weights on the logits. Therefore, along with the use of prototypes with spatial dimensions bigger than 1 × 1, Ps-ProtoPNet uses the negative weights for similarity scores that connect to incorrect classes. Thus, for a given CT-scan image as in Figure 1, Ps-ProtoPNet identifies the parts of the image where it thinks that this part of the image looks like that prototypical part, and this part of the image does not look like that prototypical part. In addition to the positive reasoning process, Ps-ProtoPNet does not do the convex optimiza-tion of the last layer to keep the impact of the negative reasoning process on the image classification, whereas ProtoPNet model emphasizes on the positive reasoning process. The non-optimization of the last layer enabled us to write Theorem 1, because it ensures that the weights of last layer do not change during the training process. Also, it reduces the training time considerably.   51 only few images that are similar to the whole image, but their parts can be similar. So, 52 we need to use optimum spatial dimensions for the prototypes. To identify an image 53 that has not been encountered before, humans may compare patches of the image with 54 the patches of images of the known objects. Our model's reasoning is inspired from the 55 above reasoning, where comparison of image parts with learned prototypes is integral 56 to the reasoning process of the model. That is, a new image is compared with learned 57 prototypes from all classes, and it is classified to the class whose prototypes are more 58 similar to parts of the image. We have three classes of images: Covid, Normal and 59 Pneumonia. Therefore, a Covid-19 CT image is distinguished from the pneumonia CT 60 Figure 1. For a given CT-scan image, Ps-ProtoPNet identifies the parts of the image where it thinks that this part of the image looks like that prototypical part, and this part of the image does not look like that prototypical part.

Ps-ProtoPNet Architecture
In this section, we introduce and explain the architecture and the training procedure of our model Ps-ProtoPNet in the context of CT-scan images.
We construct our network over the state-of-the-art mod-els: VGG-16, VGG-19 [46], ResNet-34, ResNet-152 [47], DenseNet-121, or DenseNet-161 [48]. In this paper, these models are called baseline or base models. The base models were pretrained on Ima-geNet [49]. In the Figure 2, we see that the model comprises of the convolution layers of any of the above base model that are followed by an additional 1 × 1 layer (we denote these convolution layers together by ) and then these convolution layers are followed by a generalized [50,51] convolution layer p p of prototypical parts and a dense layer w with weight matrix m w . The dense layer does not have any bias. We denote the parameters of by conv . The activation function Sigmoid is used for the additional convolution layer.
We provide an explanation of our model with the base model VGG-16. For an input image x, let (x) be the output of the convolutional layers . Therefore, the shape of (x) is 512 × 7 × 7. Let P k = {p k l } m l=1 be the set of prototypes of a class k and P = {P k } n k=1 is set of prototypes of all classes, where m is the number of prototypes for each class and n Diagnostics 2021, 11, 1732 4 of 20 is the total number of classes. In our case, m = 10 and n = 3, and the hyperparameter m = 10 is chosen randomly. For example, p 1 1 , p 1 2 , . . . , p 1 10 prototypes belong to the first class (Covid class). The shape of each prototype is 512 × h × w, where 1 × 1 < h × w < 7 × 7, that is, h and w are neither simultaneously equal to 1 nor 7. Hence, every prototype can be considered as a representation of some prototypical part of some CT-scan image. and Pneumonia (third class). As explained in Section 2.3, Ps-ProtoPNet calculates the similarity scores between an input image and the prototypical parts p 1 1 − p 1 10 , p 2 1 − p 2 10 and p 3 1 − p 3 10 , see Figure 2. Note that, similarity score of the prototype p 1 1 (0.03955200) is greater than the similarity scores of p 2 1 (0.00021837) and p 3 10 (0.00023386). The complete list is given in the similarity score matrix S, see Section 2.6. The source image of the prototypes p 1 1 , p 2 1 and p 3 10 are also given in the third column of the Figure 2. The model keeps track of spatial relation of the convolutional output and the prototypical parts, and upsamples the parts to the size of input image to point out the patch on the source images that corresponds to the prototypes. The rectangles in the source images are the parts of the source images from where the prototypical parts are taken. In layer w, the matrix S is multiplied with m w to get the logits. The logits for the first, second and third class are 0.5744, −0.5790 and −0.5787, respectively.

The Training of Ps-ProtoPNet
We use the generalized version d of the distance function L2 (Euclidean distance). We consider the baseline VGG-16 to present d in this section. For a given image x, let z = (x). Therefore, the shape of (x) is 512 × 7 × 7, where 512 is the depth of (x) and 7 × 7 are the spatial dimensions of (x). Let p be a prototype of the shape 512 × h × w, where 1 ≤ h, w ≤ 7, but h and w are neither simultaneously equal to 1 nor 7. Since p can be any prototype of any class, p does not have any subscript and superscript. The output z of the convolutional layers has (8 − h)(8 − w) patches of dimensions h × w. Hence, square of the distance d(Z ij , p) between the prototype p and (i, j) patch Z ij (say) of z is: (1) For prototypes of spatial dimension 1 × 1, that is, which is the square of the Euclidean distance between the prototype p and a patch of z, where p 11k p k . Therefore, the distance function d is a generalization of L2. The prototypical unit p p calculates the following. In other words, The Equation (2) tells us that a prototype p is more similar to input image x if the inverse of the distance between a latent patch of x and p is smaller. The two training steps of our model are as follows.
2.5.1. Optimization of All Layers before the Dense Layer Suppose X = {x 1 . . . x n } and Y = {y 1 . . . y n } are sets of images and corresponding labels, respectively. Let Our objective function is: where ClstCst and SepCst are: The Equation (4) tells us that the decrease in the cluster cost (ClstCst) leads to clustering of prototypes surrounding their respective classes. However, the Equation (5) suggests that the decrease in separation cost (SepCst) keeps prototypes away from their incorrect classes [16]. The drop in cross entropy leads to improved classifications, see the objective function (3). The hyperparameters λ 1 and λ 2 are selected from the set {0.4, 0.5, 0.7, 0.8, 0.9} using cross validation. Since m w is the weight matrix for the last layer, m (i,j) w is the weight assigned to the connection between similarity score of jth prototype and logit of ith class.
Theorem 1 finds the impact of the selection of the weights m (i,j) w on the logits. Therefore, for a class k, we put m (i,j) w = 1 for all j with p i j ∈ P i , and for all p k j ∈ P i with k = i, m (k,j) w is chosen from the set {−1, −0.9, −0.7, −0.5, −0.2, −0.1}. Since the distance function is nonnegative, the optimization of all layers except the last layer with the optimizer SGD helps Ps-ProtoPNet to learn important latent space.

Push of Prototypical Parts
At this step, Ps-ProtoPNet pushes/projects the prototypes onto the patches of the output (x) of an image x that have smallest distances from the prototypes. That is, Ps-ProtoPNet performs the following update: Therefore, prototype layer gets updated prototypical parts that are more closer to their respective classes [16]. The patch of x that is the most similar to p is used for visualization of p. The activation value of the prototype must be at least 94th percentile of all the activation values of p p [16].

Explanation of Ps-ProtoPNet with an Example
The test image in the first column of Figure 3 belongs to the first class (Covid). In the second column, the test image has some patches enclosed in green rectangles. These patches give the highest similarity score to the corresponding prototypes in the third column. The prototypes in the third column are taken from the corresponding source images in the fourth column. The rectangles on the source image pin-point the patches from where the corresponding prototypes are taken. The fifth column has similarity scores of the prototypes and sixth column has the weights. The entries of the seventh column are obtained by multiplying the similarity scores with the corresponding weights. The logit (0.5744) of the first class is the sum of entries of the seventh column. The logit for the first class can also be obtained from the multiplication of the first row of weight matrix m w with the similarity score matrix S. Similarly, the logit for the second class (−0.5790) and third class (−0.5787) can be obtained by multiplying second and third row of the weight matrix with the similarity score matrix S.    The transpose of the weight matrix m w and similarity scores matrix S that we obtain from our experiments are as follows:

Results
In this section we present the metrics given by our model and compare the performance of our model with the performance of the other models.

The Metrics and Confusion Matrices
For a given class, true positive (TP) and true negative (TN) are the number of items correctly predicted as belonging to the class and not belonging to the class, respectively, see [52]. False positives (FP) and false negatives (FN) are the number of items incorrectly predicted as belonging to the class and not belonging to the class, respectively, see [53]. The metrics accuracy, precision, recall and F1-score are [54][55][56]: In Figures 4-9, the confusion matrices of Ps-ProtoPNet with the base models are given. For example, in Figure 4 (6) and (7), the accuracy for Ps-ProtoPNet is 98.83%, and the precision, recall and F1-score are equal to 0.96, 0.98 and 0.97, respectively.

rics and Confusion Matrices
iven class, true positive (TP) and true negative (TN) are the number of items dicted as belonging to the class and not belonging to the class, respectively, se positives (FP) and false negatives (FN) are the number of items incorrectly belonging to the class and not belonging to the class, respectively, see [46]. accuracy, precision, recall and F1-score are:

184
In this section we present the metrics given by our model and compare the performance 185 of our model with the performance of the other models. 186

187
For a given class, true positive (TP) and true negative (TN) are the number of items correctly predicted as belonging to the class and not belonging to the class, respectively, see [49]. False positives (FP) and false negatives (FN) are the number of items incorrectly predicted as belonging to the class and not belonging to the class, respectively, see [46]. The metrics accuracy, precision, recall and F1-score are: Covid Normal Pneumonia

The Performance Comparison of the Models
The models Ps-ProtoPNet, Gen-ProtoPNet, NP-ProtoPNet and ProtoPNet are constructed over the convolution layers of base models. We trained and tested these models over the dataset of CT-scan images [44]. Although, the accuracies of these models stabilize before 30 epochs (see Section 3.3), but we trained and tested the models for 100 epochs.
The comparison of the performance in the metrics is given in the Table 1. We observe from the third column of Table 1 that when we construct our model over the convolutional layers of VGG-16, and use the prototypes of spatial dimensions 3 × 4 then the accuracy, precision, recall and F1-score given by Ps-ProtoPNet are 98.83, 0.96, 0.98 and 0.97, respectively. The accuracy, precision, recall and F1-score given by the models Gen-ProtoPNet, NP-ProtoPNet and ProtoPNet with baseline VGG- 16

The Graphical Comparison of the Accuracies
In the Figures 10-15, the accuracies given by Ps-ProtoPNet are graphically compared with the accuracies given by the other models. As mentioned in Section 3.2, the accuracies of these models stabilize before 30 epochs, but we trained and tested the models for 100 epochs over the dataset of CT-scan images [44]. In Figure 10, the comparison of the accuracies given by the models with baseline VGG-16 is provided. The curves of colors green, purple, yellow, brown and blue sketch the accuracies of Ps-ProtoPNet, Gen-ProtoPNet, NP-ProtoPNet, ProtoPNet and VGG-16, respectively. Although, it is hard to see the difference between the accuracies in the Figures 10-15, but the figures clearly show the difference between the accuracies before they stabilize.

The Test of Hypothesis for the Accuracies
Since accuracy is the proportion of correctly classified images among all the test images, we can apply the test of hypothesis concerning system of two proportions. Let n be the size of test dataset, and the number of images correctly classified by model 1 and 2 are x 1 and x 2 , respectively. Letp 1 = x 1 /n andp 2 = x 2 /n. The statistic for test concerning difference between two proportions is given by [57]: Let p 1 and p 2 be the accuracies given by model 1 and 2. Therefore, our hypothesis is as follows: H 0 : (p 1 − p 2 ) = 0 (null hypothesis) H a : (p 1 − p 2 ) = 0 (alternative hypothesis) We test the hypothesis for the level of confidence (α) = 0.05. Since the hypothesis is two-tailed, the p-value must be less than 0.025 to reject the null hypothesis. In the above hypotheses, p 1 is the accuracy given by Ps-ProtoPNet and p 2 represents the accuracies given by Gen-ProtoPNet, NP-ProtoPNet, ProtoPNet and base models. We obtain the values of test statistic Z from the above formula given by the Equation (8). Then the corresponding pvalues are obtained from the standard normal table (Z-table). The complete list of p-values is given in the Table 2. For example, when VGG-16 is used as a base model, the p-values obtained from the accuracy given by Gen-ProtoPNet in pairs with accuracies given by Gen-ProtoPNet, NP-ProtoPNet and ProtoPNet are 0.0002, 0.0002, 0.0002 and 0.0367, respectively. Since α = 0.05, we reject the null hypothesis for all the p-values listed in the Table 2 except the five p-values written in bold. The p-values in bold in the last column means the accuracies given by Ps-ProtoPNet are not statistically different from accuracies given by the three base models. However, we can say with 95% confidence that the accuracies given by Ps-ProtoPNet are better than the corresponding accuracies given by Gen-ProtoPNet, NP-ProtoPNet and ProtoPNet except in the two cases.

The Impact of Change in the Hyperparameters of the Last Layer
In this section, we prove a theorem analogous to [16], Theorem 2.1. Our experiments show that w (k,j) m can hardly be made equal to 0 for p k j ∈ P i during the training, an assumption made in [16], Theorem 2.1. Therefore, we don't assume this condition.
Theorem 1. Let h • p p • be a Ps-ProtoPNet. For a class k, let b k l and a k l be the values of l-th prototype for class k before the projection of p k l and after the projection of p k l , respectively. Let x be an input image that is correctly classified by Ps-ProtoPNet before the projection, and k be the correct class label of x. Suppose that: A1 z k l = arg minz ∈patches( (x)) d(z, a k l ); A2 there exists some δ with 0 < δ < 1 such that: A2a for all incorrect classes k = k and l ∈ {1, . . . , m k }, we have d(a k where is given by p p (z) = max Z ∈patches(z) log d 2 (Z, p) + 1 d 2 (Z, p) + and θ = min( A2b for all l ∈ {1, . . . , m k }, we have d(a k l , b k l ) ≤ ( √ 1 + δ − 1)d(z k l , b k l ) and d(z k l , b k l ) ≤ √ 1 − δ.
Then after projection, the output logit for the correct class k can decrease at most by ∆ = m log(1 + δ)(2 − δ)(1 + 1 r (n − 1)), where −1/r is the weight assigned to incorrect classes, and r is a positive real number.
Proof of Theorem 1. For any class k, let L k (x, {p k l } m l=1 ) be the output logit for input image x, where {p k l } m l=1 denote the prototypes of class k. Since negative connections between similarities scores of incorrect classes and logits are equal to −1/r, Let ∆ k be the difference between the output logit of class k before and after the projection of pr-ototypes {p k l } m l=1 to their nearest latent training patches. Suppose L k (x, {b k l } m l=1 ) and L k (x, {a k l } m l=1 ) denotes the logits before the projection and after the projection, respectively. Therefore, we have Suppose that, Therefore,