Attention Map-Guided Visual Explanations for Deep Neural Networks

: Deep neural network models perform well in a variety of domains, such as computer vision, recommender systems, natural language processing, and defect detection. In contrast, in areas such as healthcare, ﬁnance, and defense, deep neural network models, due to their lack of explainability, are not trusted by users. In this paper, we focus on attention-map-guided visual explanations for deep neural networks. We employ an attention mechanism to ﬁnd the most important region of an input image. The Grad-CAM method is used to extract the feature map for deep neural networks, and then the attention mechanism is used to extract the high-level attention maps. The attention map, which highlights the important region in the image for the target class, can be seen as a visual explanation of a deep neural network. We evaluate our method using two common metrics: average drop and percentage increase. For a more effective experiment, we also propose a new metric to evaluate our method. The experiments were carried out to show that the proposed method works better than the state-of-the-art explainable artiﬁcial intelligence method. Our approach can provide a lower average drop and higher percent increase when compared to other methods and ﬁnd a more explanatory region, especially in the ﬁrst twenty percent region of the input image.


Introduction
Deep neural networks (DNNs) have enabled tremendous improvements in a number of computer vision tasks, such as image classification [1,2], object detection [3][4][5], and semantic segmentation [6]; and in some other tasks, such as visual question answering [7] and autonomous driving [8]. However, DNNs are difficult to analyze and behave as black boxes. When designing a deep neural network model, most researchers emphasize the model's framework and the many internal parameters of the model, but they cannot provide a correct explanation of the model's output when the model makes mistakes. This also makes users unable to trust the network's decisions in industries such as healthcare, finance, and security. It is important that we construct transparent models so that they can show users their reasoning. This will help with understanding failures, debugging, and identifying potential biases in training data.
To solve these problems, explainable artificial intelligence (XAI) technology has been proposed, and more and more researchers are working on this technology every year. XAI technology focuses on how to make a DNN model's decisions more transparent, understandable, and trustworthy to humans. To interpret a deep neural network model, it would be useful to generate an explanation map that highlights important regions that are most related to the model's decision. One common approach for interpreting deep neural network models is relying on the changes in the model output, such as the changes in prediction scores concerning the input images [9]. RISE [10] advocated a general approach that probes the model with randomly masked versions of the image and obtains the corresponding outputs without requiring access to its internals for each network architecture. LIME [11] draws random samples and builds an approximated linear decision model to interpret deep neural networks. However, it depends on super-pixels, which may or may not capture the relevant areas. Another approach, Grad-CAM [12], relies on the gradients by back-propagating the prediction score through the last convolutional layer and applying them as weights to combine the forward feature maps to produce explanations. However, an explanation using Grad-CAM has too much meaningless information, since the feature maps are not necessarily related to the target class.
In this paper, we propose an attention-map-guided visual explanation method for deep neural networks. We use an attention mechanism to generate the attention map from the feature map, which is generated using Grad-CAM. Herein, we compare our approach with other state-of-the-art XAI methods. We evaluate our method using three metrics, and the experimental results show that our method can provide a better explanation than the other methods. Figure 1 shows an overview of our methods. In experiments, we demonstrated the effectiveness of our method using the Imagenet dataset. Our method found the most important region for the deep neural network. Our methodology achieved a lower average drop and a higher percent increase, and uncovered a more explanatory region.

CAM-Based XAI Methods
There are now many ways of using class activation mapping (CAM) [13] based methods for explaining the output of a model. These XAI methods use CAM methods as their basis, and some researchers upgrade CAM methods with a mix of backpropagation gradients and feature maps of a certain convolutional layer to generate an explanation map. To generate the explanation map, they have mainly used prior position information, such as part-level bounding boxes and segmentation masks [14]. The CAM is essentially a weighted linear sum of these visual patterns' existence in various spatial regions. It can determine the images' most important regions for the given category by simply upsampling the class activation map to the size of the input image. In Figure 2, the global average pooling (GAP) layer is used to convert the feature map into a feature vector, and each layer of the feature map can be represented as a numerical value. CAM methods multiply the weights corresponding to the bull mastiff class by the layers corresponding to the feature map, making a weighted linear sum. Using a CAM method, it is possible to observe which area the model is looking at. However, the CAM method has some shortcomings; e.g., it needs to change the model's structure from a fully connected layer to a global average pooling layer. Users are cautious to explain DNN models using the CAM technique, since it requires changing the model's basic structure. Changing the model's internal structure is not convenient for the user. To address these problems of CAM methods, Selvaraju et al. [12] proposed using gradient calculations instead of GAP. Grad-CAM is a new method for combining feature maps using the gradient weights without any modifications to the network structure. It allows any gradients to flow into the final convolutional layer to build a coarse localization map that highlights the regions essential in the image for the predicting class. Grad-CAM assigns priority values to each neuron for a specific choice using the gradient information flowing into the last convolutional layer of the CNN model.
CAM and Grad-CAM use a linear combination of activation to produce a fine-grained explanation. Grad-CAM++ is a Grad-CAM enhancement that provides a visual explanation for the associated class by using a weighted mixture of the positive partial derivatives of the target layers' feature maps concerning a predetermined class score as weights. To create an enhanced visual explanation of multiple objects in a single image, SmoothGrad-CAM was created [15], which is a simple method that can help visually sharpen gradient-based sensitivity maps. Additionally, it can visually brighten gradient-based sensitivity maps, which obtain random samples in the neighbor of an input x and average the sensitivity maps. The gradient of the class score function for the input image is a good starting point for SmoothGrad-CAM. Omeiza, D et al. [16] proposed SmoothGrad-CAM++, which combines SmoothGrad-CAM and Grad-CAM++. Smooth Grad-CAM++ creates visual explanations of the input images that are more visually sharp. Smooth Grad-CAM++ allows one to visualize a layer, a subset of feature maps, or a subset of neurons inside a feature map at each occurrence. Although these XAI methods can provide reasonable visualizations, the majority of them lack obvious and sufficient theoretical backing. XGrad-CAM [17] was proposed to satisfy those needs as much as is feasible, and the studies on it show that ot is a more sensitive and conservation-oriented variant of Grad-CAM. However, because the feature maps are not always connected to the target class, the outputs of activation-based approaches may collect too much worthless information.

Attention-Based Methods
Attention mechanisms are widely used in the field of natural language processing (NLP) as a way to improve the performances of models [18,19]. They have been employed extensively in sequential models using recurrent neural networks and long short-term memory (LSTM). Evermore research is applying attention mechanisms to computer vision tasks [20,21]. Researchers can use an attention method to extract high-level features to improve the performance of a deep learning model. An attention mechanism in computer vision tasks can be thought of as a dynamic selection process that is implemented by adaptively weighting characteristics based on their relevance to the input. In the past few years, researchers have found that focusing the attention mechanism on many image recognition tasks can provide good results. Some created a global-and-local attention (GALA) module and incorporated it into a DNN model, and the experimental results show the module can improve visual recognition performance [22]. Increasingly, the attention mechanism is being used in the XAI field. The authors of [23,24] offer spatial attention maps of visual sections that the network attends to, which can be shown in a user-friendly manner. However, attention maps are only one element of the puzzle. Non-salient picture content is filtered away using the attention technique. Attention networks, on the other hand, must locate all potentially salient visual areas and forward them to the primary recognition network for a final decision, just as a human would utilize peripheral vision to determine that "something is there" before visual fixating on the item to determine what it is Kim et al. [25] used a visual attention model that highlights image regions that potentially influence a network's output then applies a causal filtering step to determine which input regions actually influence the output. This produces more succinct visual explanations and more accurately exposes the network's behavior than do other methods. Their research first showed that training with attention does not degrade the performance of the end-to-end network. However, they used a convolutional feature extractor to directly extract the low-level feature map from the image. Thus, the explanation of the deep learning model is based on another deep learning model, and whether the low-level features extracted directly from the input image are the same.

Grad-CAM
Grad-CAM uses gradient calculations instead of GAP. As shown in Figure 3, Grad-CAM is a method for combining feature maps using gradient weights without any modifications to the network structure. It allows any gradients to flow into the final convolutional layer to build an explanation map that highlights the regions essential in the image for predicting the class. We found through our experiments that using Grad-CAM as a base gave the best results, so we built on it for our subsequent research. Grad-CAM overview. The input image is processed by the CNN model, and a raw score for the specific class is obtained. The gradients were set to 0, and the "bull mastiff" class was set to 1. Then, it back-propagates the gradient to the rectified convolutional feature maps, which were combined to produce the coarse red heat map that depicts where the model looking.
The Grad-CAM technique computes the gradient of the class score y c with respect to the feature map of the last convolution layer: It uses global-average-pooling gradients to get weights W c k .
Grad-CAM generalizes visual explanations using a weighted combination of feature maps with ReLU.
In Equation (4), weight α c k represents a partial linearization of the deep network downstream from A, Z is the total number of feature map cells, y c is an activation class score for class c, and A k ij represents activation of the cell at spatial location i.
Grad-CAM assigns priority values to each neuron for a specific choice using the gradient information flowing into the last convolutional layer of the CNN model.

General Form
When we become aware of a scene in our lives, we focus our attention on discriminative areas and process them quickly, and almost all existing attention mechanisms can be summed up by Equation (5). g(x) reflects the process of attending to discriminative regions, which corresponds to the process of providing attention. Here, f (g(x), x) denotes that input x is processed based on the attention g(x), which is compatible with processing crucial sections and obtaining information.

Channel-Spatial Attention Module
Inspired by Woo et al. [26], we designed our channel-spatial attention module. Distinct channels in different feature maps typically represent different objects in deep neural networks [27]. Channel attention adjusts the weight of each channel as needed, and can be thought of as an objective selection process that determines what to pay attention to. By utilizing the inter-channel relationship of features, we create a channel attention map, wherein each channel of the feature map acts as a feature detector. As shown in Figure 4, we aggregate the spatial information of a feature map by using both average-pooling and max-pooling operations, thereby generating average-pooled feature AvgPool(F) and max-pooled feature MaxPool(F). Both descriptors are then forwarded to a multi-layer perceptron (MLP) to produce a channel attention map Mc. In short, the channel attention is computed as in Equation (6). We created a spatial attention module that is distinct from channel attention in that it focuses on where there is an informative component, which is complementary to channel attention. As shown in Figure 5, we use average-pooling and max-pooling procedures along the channel axis, and then we use a convolution layer to generate a spatial attention map.
Pooling procedures along the channel axis have been shown to help identify informative regions [28]. We use two pooling operations: average-pooled features AvgPool(F) and maxpooled features MaxPool(F). After that, a convolution layer convolves them to generate our 2D spatial attention map. In Equation (7), σ denotes the sigmoid function, and f 7×7 represents a convolution operation with the filter size 7 × 7. The benefit of the channelspatial attention module is that it can adaptively identify essential objects and regions. Our attention module leverages both channel and spatial relationships of features to instruct the network on what to focus on and where to focus by sequentially combining channel and spatial attention. It highlights helpful channels while also increasing informative local locations. Figure 5. The spatial attention module pools two outputs along the channel axis and sends them to a convolution layer.

Experimental Setup
Our experiments were conducted on the commonly-used computer vision dataset Im-ageNet. They involved the objective evaluation of our method and its compared with Grad-CAM, Grad-CAM++, XGrad-CAM, and SmoothGrad-CAM++. We first tested VGG19 [29], Resnet-50 [30], and Googlenet [31] models, which are pre-trained on ImageNet. After the test, we chose the best-performing model as our black-box model to be explained. All datasets were resized to 3 × 224 × 224 pixels, then transformed to tensors, and finally, normalized to the range [0, 1]. As shown in Table 1, AMD Ryzen 7 3700X was used as the CPU, and a total of 64 GB of memory was used. We used the GeForce RTX 2080 Ti as the GPU. We also used Python 3.6, Pytorch 1.8.1, Torchvision 0.9.1, and other libraries as our environment. First, we tested the above pre-trained models and selected the best-performing model for the following experiments. According to Table 2, the Resnet-50 model performs best on the ImageNet dataset, so we chose the Resnet-50 model as the black-box model to be explained.

Evaluation Metrics
We leveraged the study presented in [32] for the objective evaluation of our proposed method. A heatmap was created for each image using a visualization approach such as Grad-CAM. The most relevant discriminative regions were highlighted in red on this heat map. The primary concept behind a heat map is to create an image that only contains the sub-regions of the original image that are highlighted using a visualization technique. To evaluate the explanation map, the generated heat map was modified so that the top 5, 10, 20, 25, and 50% of pixels were 1 and the rest 0. By multiplying the original image point by point with the adjusted localization heat map, a visual explanation map was created. Figure 6 displays the visual explanation map generated by our method, which modified 25% of the original image's pixels. We examined the effectiveness of heatmaps created by XAI method using the top x percent pixels, rather than the visual explanation maps of other XAI methods. This guaranteed that one technique would outperform another not only in terms of highlighting more pixels but also in terms of capturing more relevant information for the same number of pixels. We evaluated the performances of explanation maps produced by our method and other XAI methods using three metrics: (a) Average drop in activation score. (b) Percent increase in activation score. (c) Percentage in metric. All the results were computed on the ImageNet dataset using Resnet-50 models.

Average Drop in Activation Score
An excellent explanation map will cover the majority of the elements of the object in the image that are important for making a choice. As a result, a better explanation map, rather than a whole image, should result in a small decline in the model's output scores. In Equation (8), the metric is given as the percentage drop in the model's score when only an explanation map is provided as input.
where Y c i is the activation score when original image i is provided as input and O c i is the activation score when explanation map is provided as input. N is the total number of images in the data.

Percent Increase in Activation Score
When the context acts as noise for the class, it has been discovered that presenting the explanation map instead of the whole image boosts the output activation scores. When only an explanation map is provided as input for a whole dataset, this measure is defined as the rate at which the model's output score rises. Formally, this can be expressed as: where 1Y c i < O c i is an indicator function that returns 1 when an argument is true. Table 3 indicates that our method has a lower average drop and higher percent increase.

Percentage in Metric
We created a new metric to demonstrate the results of our experiments. One of the key reasons we created this metric is because it allows a more intuitive view of how well the XAI method performs. The percentage specifies how much to mask the input image, and this image is fed into the original Resnet-50 model to check the performance of the XAI method. Using this metric, it is possible to visualize how well the XAI method performs and provide a visualization of the results from the user's perspective. Table 4 shows the results of our experiment.

Results
As shown in Figure 7, the proposed method gave the clearest explanation of particular features the model learned. For instance, proposed method was able to find the most important portion of the bull mastiff's head. Additionally, the proposed method captured a larger amount of the class object (as seen in the dog image in Figure 7) and performed localization well. Table 3 shows that if we use average drop and percent increase to evaluate our method, it is better than the other XAI methods. A good explanation map will focus on most of the relevant parts of the object in the image. As a result, when we input an explanation map to the DNN model, it is expected to result in a low average drop and high percent increase. The full explanation map is used as the input, and the Resnet-50 model will provide a class score. If the explanation map concentrates on the most essential area in the image, the Resnet-50 model will provide a high-class score. According to Equations (8) and (9), as the explanation map performs better, the average drop will be lower and the percent increase will be higher.

Discussion
The computing time needed to create a single attention-map-guided visual explanation map is longer than that required by other XAI methods. The reason for this is that we employ the attention mechanism to get a higher-level feature region for each feature map. Second, as seen in Tables 3 and 4, when we try to explain models such as Resnet-50, which do not have any fully-connected layers, our methods perform only slightly better than other XAI methods. As shown in Figure 7, our method focused on more of the important region than the other XAI methods. The bull-mastiff's head was totally obtained in the five-percent and ten-percent images. This indicates that for the Resnet-50 model, the features expressed in the head region of the bull-mastiff are most important.

Conclusions
In this work, we proposed a novel technique-attention-map-guided visual explanation-to produce explanation maps to explain the individual decisions of CNNbased models. It uses the Grad-CAM method to extract the feature map for a deep learning model, and then uses the attention mechanism to extract the high-level attention map. We showed through objective evaluations that our method performs better than the existing state-of-the-art XAI methods. In the future, we hope to apply the proposed method to medical diagnostics, and by explaining deep learning models, we hope to persuade doctors and patients of the veracity of good deep learning models' decisions. Our study has some limitations, in light of which our findings need to be interpreted carefully. First, as in most empirical studies, the research presented here was limited by the black-box used. Second, the attention mechanism highlights some image regions which are true influences, but some are spurious.