Rethinking Gradient Weight’s Influence over Saliency Map Estimation

Class activation map (CAM) helps to formulate saliency maps that aid in interpreting the deep neural network’s prediction. Gradient-based methods are generally faster than other branches of vision interpretability and independent of human guidance. The performance of CAM-like studies depends on the governing model’s layer response and the influences of the gradients. Typical gradient-oriented CAM studies rely on weighted aggregation for saliency map estimation by projecting the gradient maps into single-weight values, which may lead to an over-generalized saliency map. To address this issue, we use a global guidance map to rectify the weighted aggregation operation during saliency estimation, where resultant interpretations are comparatively cleaner and instance-specific. We obtain the global guidance map by performing elementwise multiplication between the feature maps and their corresponding gradient maps. To validate our study, we compare the proposed study with nine different saliency visualizers. In addition, we use seven commonly used evaluation metrics for quantitative comparison. The proposed scheme achieves significant improvement over the test images from the ImageNet, MS-COCO 14, and PASCAL VOC 2012 datasets.


Introduction
Deep neural networks have achieved superior performance on numerous vision tasks [1,2,3].However, they contain complicated black boxes with huge, unexplainable parameters that begin with random initialization and reach another unpredictable (still) sub-optimal point.Such transformations are non-linear for each problem, the interpretability remains unsolved.
The vision community relies on estimating saliency maps for deciphering the decision-making process of deep networks.We try to bridge the unknown gap between the input space and decision space through this saliency map.These saliency map generation approaches can be divided into different procedures regarding input space [4,5,6], feature maps [7,8,9,10,11,12,13], or propagation scheme [14,15,16,17,18].Usual perturbation-based methods [19,20,21] probe the input space into different deviated versions and obtain a unified saliency map by their underlying algorithms.Even though their approach produces a better result, their process is relatively slow.CAM-like methods rely on the gradient for the given class and create a decent output within a very brief computation time.
This study explores the established framework of CAM-based methods to address current issues within vision-based interpretability.Similar studies usually rely on gradient information to produce saliency maps.Some replace the gradient dependency with score estimation to build the saliency map out of the feature maps.Nonetheless, these studies formulate saliency maps that often suffer degradation during class discriminative examples.Moreover, their weighted (a) Single Class (b) Feature map (c) Ours (d) [22] (e) [9] (f) [10] (g) Feature map characteristics for a single class (h) Dual Class (i) Feature map (j) Ours(1 st ) (k) Ours(2 nd ) (l) [22] (m) [9](1 st ) (n) [10](1 st ) (o) Feature map characteristics for dual class accumulation does not always cover the expected region for the given single class instance, as the associated weights sometimes do not address the local correspondence.On the other hand, the gradient-less methods take significant time to produce a score index for the feature maps.
Typically, the gradient maps are averaged into single values to produce the saliency map within CAM-like studies.However, this projection is not always efficient for the saliency map.In figure 1, we can see that weighted multiplication still contains traces of unwanted classes.To address the weighted multiplication issue, we first look at the problem setup.For a given image, we use a model that produces a fixed number of feature maps before going into the dense layers.This fixation limits our search space for the saliency map.The only variables are the gradient maps that change according to the class.
Gradient matrix to a single value conversion eventually reduces many significant gradients from influencing the overall accumulation.If we keep all the gradients and perform elementwise multiplication with feature maps, the obtained map contains better representations of the assigned class.The acquired map is the intermediate global guidance map for the usual local weighted multiplication.We use this guidance map to constrain the generated feature maps to be responsive only to the assigned class through element-wise matrix multiplication.After that, we perform the formal weighted accumulation for acquiring the saliency map, followed by carefully designed upscaling process.Thus, the produced saliency map is more responsive to the given class and its boundaries than previous approaches.The following lines summarize our overall contributions: • A new saliency map generation scheme is formulated by introducing global guidance map that incorporates elementwise influence of the gradient tensor onto the saliency map.• Acquired boundaries for the saliency maps are crisper than contemporary studies and perform efficiently within single/multi-class/multi-instance-single class cases.• To validate our study, we perform seven different metric analyses on three different datasets, and the proposed study achieves state-of-the-art performance in most cases.

Related Work
Backprop-based methods.Zisserman et al. [1,2] first introduced gradient calculation by focusing on computing the confidence score for explanation generation, and other back propagation-based explanation studies have been introduced [14,12,11].However, their gradient employment and manipulation lead to several issues, as addressed by [12,11].
Instead of a single layer gradient, Srinivas et al. [11] focuses on aggregating gradients from all convolutional layers.
Activation-based methods.Class activation mapping (CAM) methods are based on the study [7], where authors select feature maps as a medium for creating an explanation map.GradCAM [9] is a weighted linear combination of the feature maps followed by the ReLU operation for a given image and the respective model.Later, GradCAM++ [10] introduced the effect of the higher-order gradients with inclusive only to positive elements properties.In this way, they achieve more precise representation compared to the previous studies.However, gradients are not the only way to generate the saliency map, which inspires the ScoreCAM [22], AblationCAM [13].X-GradCAM [8], an extension to the GradCAM, follows the same the underlying weighted multiplication for GradCAM.EigenCAM [23] introduces principal component analysis into developing the saliency map.CAMERAS [24] utilizes basic CAM studies with multiscale input extension, leveraging the fusion technique of multiscale feature and gradient map weighted multiplication.
Perturbation-based methods.Another group of studies treat the neural network as a "white box" instead of a "black box" and propose an explanation map by probing the input space.These studies generate saliency maps by checking the response upon the manipulated input space.By blocking/blurring/masking some of the regions randomly, these studies observe the forward pass response for each case and finally aggregate their decision to develop the saliency map.RISE [4] first introduces such analysis, followed by external perturbation [20,21] where authors rely upon optimization procedure.SISE [5] presents feature map selection from multiple layers, followed by attribution map generation and mask scoring to generate the saliency map.Later, they improvise it through adaptive mask selection in the ADA-SISE [6] study.

Proposed methodology
This section will describe the formulation of the proposed approach as shown in figure 2 for obtaining the saliency for the given model and image.

Baseline formulation
Every trained model infers through the collective response of its feature maps for the given image.The current understanding of deep learning requires more research to define the ideal formulation for the extracted feature maps during inference.During inference, activated feature maps can be sparse, shallow, and rich in containing the spatial correspondence and their collective distribution governs the outcome of the final prediction.
Let φ be the model we take for the inference.For any image X ∈ R W ×H×3 , model φ generates feature maps M. Say, M k l is the k th feature channel at l th layer, before the final dense layer.If we aggregate them all, we will obtain a unified map that will represent the collective correspondence due to the model φ for the given image.This aggregation for the activated channels is as follows: Figure 2: Visual comparison between previous state-of-the-art studies and the proposed method.For the given demonstrations, our approach can mark down the primary salient regions under challenging visual conditions.Additionally, our saliency maps are more concrete and leaves almost no traces for the secondary-salient areas.
This tensor A l contains the global representation for all activated feature maps, which can serve the purpose of marking salient regions with careful tuning as shown in figures 1g, and 1o.In equation ( 1), we aggregated all of the feature representations into A l ; elements over a specific threshold may correspond to the primary class information.Nonetheless, this observation only works within images with a single class, but is not appropriate for the dual class scenarios, as shown in figure 1.
To achieve class discriminatory behaviour, researchers [9,10,12,11] worked with the idea of using weighted aggregation for equation (1), in contrast to plain linear addition operation.To extract weights, first they compute the gradient maps ∇ C with respect to the given class C, from the feature maps M.
If Y C is the class score [9] for the given image from the input model φ, for each location (i, j) of the k th feature map at the l th layer, then the corresponding gradient map is expressed as : If each feature map holds Z number of elements, then the corresponding weight for each feature map M k l is: which is the mean value of ∇ k l C .Hence, the regular baseline formulation [9] for the saliency map S C estimation is expressed as follows:

Incorporating global guidance
Equation ( 4) achieves class discriminatory behavior, however it still faces challenges due to its formulation.If we investigate the above equation, we see that λ k l C weighs the corresponding feature map M k l .Typical λ k l C treats every member of the given M k l equally and increases/decreases their collective effect homogeneously.Therefore, we can still see the traces of the other classes in the saliency map, as shown in figure 1.Other studies [5,6] have shown class discriminatory performance through perturbations without relying upon the gradients.However, those studies are time expensive [11] and often require human interaction.Hence, it is natural to ask, can we still rely on the gradient-weighted operation and address the above issues?
In response to this, we propose global guidance map.To obtain the global guidance map, we perform a simple elementwise multiplication between the feature maps and their corresponding gradient maps.The formulation for the global guidance map is as follows: The idea behind considering the global guidance map is to focus only on the salient regions by limiting the operating zone for the λ k l C .Since the multiplication operation of equation ( 5) investigates individual responses for every element of the gradient maps, we can mark the class-specific regions from the feature maps.In this way, captured guidance map successfully omits the trace of the secondary classes from equation (4) unless the generated feature maps heavily overlap between categories, signifying the possible misclassification.
In summary, we first compute G M and multiply it with each of the feature maps to obtain the class discriminative feature maps from the initial class representative feature maps.Then, we perform the weighted multiplication between each member of the λ k l C and class discriminative feature maps, followed by the final aggregation to obtain the desired representation.Hence, our proposed weighted-multiplicative-aggregation is as follows: In equation ( 6), the only reason we are using the λ k l C is to gain a similar homogeneous increment as in equation ( 4), which also helps to preserve visual integrity during the final upsampling operation.Since guidance map G M can successfully omit the secondary classes from any k th feature map M k l by 'masking' the primary governing region, it also becomes possible to produce increments in the desired class region with the additional multiplication help from λ k l C .Finally, typical smoothing and normalization are performed on the saliency map before post aggregation up-sampling Figure 4: Visual comparison between previous state-of-the-art studies and the proposed method.For the given demonstrations, our approach can mark down the primary salient regions under challenging visual conditions.Additionally, our saliency maps are more concrete and leaves almost no traces for the secondary-salient areas.
4 Performance evaluation Datasets.Our experimental setup covers three widely used vision datasets: ImageNet, MS-COCO 14, and PASCAL-VOC 12.Among them, PASCAL-VOC 12 provides the full segmentation annotations for the input images.Hence, experiments using segmentation-oriented metrics are from this dataset.Other experiments do not involve segmentation labels, as the rest of the experiments are applicable for all datasets mentioned above.For the ImageNet and MS-COCO 14 datasets, we randomly selected a few thousand [22,10,8] for the experiments.
Compared studies.For visual comparison, given the availability and functional complexities, we consider available official implementations of the GradCAM [9], GradCAM++ [10], X-gradCAM [8], ScoreCAM [22], Fullgrad [11], Smooth-Fullgrad [11], CAMERAS [24], Integrated [12], and Relevance-CAM [25].We report seven different quantitative analyses in addition to the visual demonstration.We extract saliency maps from pretrained VGG16 [26] and ResNet50 [27] networks for all of the methods above in our experiments.A later section reports the quantitative results from ResNet50 in the respective tables.As with previous studies [22,12,10,8], few thousand random images are selected from the datasets for our experiments.However, since the random images from previous studies are not made public, the results of our random selections may vary from the cited studies.
Figure 5: Here, (a) shows the interpretation comparison between the proposed and GradCAM++ [10].The upper row is the response map for the primary "Albatross" bird class, and the second row is for the secondary class, "Ruddy Turnstone" bird.The proposed method clearly presents the difference between VGG16 and ResNet50 in terms of interpretation, whereas GradCAM++ responses are all similar across different networks and classes.(b) As we suspect that dataset bias might lead to such a decision for the given "Albatross" images, an inspection of the ImageNet dataset could clarify further.Upon examining typical Ruddy Turnstone bird images in the ImageNet dataset, we see that stony shore is the background for most of the Ruddy Turnstone bird images.

Visual demonstration
In this section, we show a qualitative comparison with previous methods.In figure 4, we have included diverse sets of images from the datasets.The first five rows show the performance due to ResNet50, and later are for VGG16.Our image selection includes class representative, class discriminative, and multiple instances examples.Here, the proposed method can capture the class discriminative region with greater confidence, if not the entire class area itself for the bicycle, sheep, sea-bird, and dog image.Our scheme captures the class region for the sea-bird image and excludes its reflection in the water compared to other methods.
Additionally, the proposed CAM bounds the whole dog class as a dog and captures the sheep in a lowlight environment more clearly than other methods.For images with dual classes, our method presents superior class discriminative performance.Our study successfully bounds the horse regions in the horse images without leaving any traces to other class regions.On the other hand, compared studies often mark both horse and secondary class instances.

Quantitative analysis
To present our quantitative analysis, we perform the following experiments: model's performance drops and increments due the to salient and context regions, Pointing score, Dice score, and IoU score.We present the scores for the above metrics for ResNet50 for all datasets.
Performance due to the saliency region.If we have a perfect model and a perfect interpreter to mark the spatial correspondence for the specific class, the network will provide a similar prediction for both the given image and the segmented salient image.Here, we first extract the salient region from the given images with the help of the given interpreter.Then we perform prediction on the original image, and the corresponding salient image [10] and check the performance drop for the given interpreter.The expectation is that the better interpreter can exclude the non-salient region as much as possible; hence, the performance drop will be as low as possible.Therefore, our first metric delivers the performance drop due to salient area only as input.For some cases, prediction performance hinders due to the presence of strong spatial context.GradCAM [9] GradCAM++ [10] X GradCAM [8] CAMERAS [24] FullGrad [11] SmoothFullGrad [11] Integrated [12] ScoreCAM [22] Relevance [25]   GradCAM [9] GradCAM++ [10] X GradCAM [8] CAMERAS [24] FullGrad [11] SmoothFullGrad [11] Integrated [12] ScoreCAM [22] Relevance [25]  Performance due to the context region.As above, if saliency extraction is as good as one expects, then we can set up another experiment with the context region.In this setup, we first exclude the salient part from the given image to obtain the context image and predict it.If the interpreter can successfully extract all the salient areas, the performance will drop near 100 percent.
Pointing game, Dice Score, IoU.For image sets with segmenation labels, various segmentation evaluations can be calculated for saliency maps.We follow [17] to perform the pointing game for class discriminative evaluation.In this performance metric, the ground-truth label is used to trigger each visualization approach, and the maximum active spot on the resulting heatmap is extracted.After that, it determines if the highest saliency points undergo into the annotated boundary box of an object, determining whether it is a hit or miss.The term

Hits total
Hits total +M isses total is calculated as the pointing game accuracy, which high value represents a better explanation for any model.Dice score is a popular metric for analyzing segmentation performance.It results from the ratio of the doubled intersection over the total number of elements associated with this instance.IoU stands for intersection over the union.It is a widespread metric for evaluating segmentation performance.This score ranges from 0 to 1 and signifies the overlapping area between the obtained image and its corresponding ground truth.
The proposed method performs better or similar to the best performing saliency generation method in table 1.Here, we present the comparative data on the Pascal VOC 2012 dataset for the ResNet50 model.Out of seven different performance tests, our method obtains the highest score for the six of them.For an increase in the context zone, our study differs only 0.03 from the best performing result.In table 2, the proposed research avails state-of-the-art performance for three out of four metrics.Table 3 also shows the best performance of our method for every experiment metrics using the ImageNet dataset.The PASCAL VOC 2012 dataset has corresponding segmentation mask, and we can achieve pseudo segmentation masks by thresholding the saliency maps.Achieved scores for the Pointing game, Dice and IoU signify that our study can better capture interest zones than the compared studies.To measure the explainability, we also have conducted a comparative analysis in figure 6 on three images from the Pascal VOC 2012 dataset and presented the insertion and deletion operation.Here, our method captures the most salient regions for both single, dual, and dual classes with multiple instances in comparison to GradCAM [9], Grad-CAM++ [10], and X-GradCAM [8].

Interpretation comparison
Saliency map generation is not all about capturing the class of interest as precisely as possible.A faithful interpretation is also a significant part of saliency generation studies.In other words, any interpretable method should explain why the underlying model is making such a prediction by marking the corresponding image region.However, we cannot present this for every image from the dataset, but a sophisticated example can show the difference from previous methods.
In figure 5a, we have presented the interpretation comparison between the proposed study and the GradCAM++ [10].
Here, the top-1 class response is 'Albatross' for the given image.For VGG16, saliency map for the proposed method marks one of the Albatross birds and the water as context, but fails to mark other Albatross bird.For 'Albatross', our method with the VGG16 shows different interpretations due to the global guidance map and responds to the water as context.In contrast to ResNet50, our guidance map marks both Albatross birds without marking the water context.However, for both networks, GradCAM++ [10] captures both of the birds also barely touches the water context.For this particular image, the proposed method presents clearer interpretation difference between VGG16 and ResNet50.
In 5b, we show why the models might identify the given image as 'Ruddy Turnstone' class.With our scheme, we interpret that surrounding stones and water are features that are corresponding to Ruddy Turnstone bird for both VGG16 and ResNet50.This interpretation makes more sense if we look at typical Ruddy Turnstone images from the ImageNet dataset 5b, where the most of Ruddy Turstone birds are shown with stony sea-shore area as the background.Hence, we can utilize this interpretation as a medium for identifying the dataset bias.On the other hand, GradCAM++ [10] shows the Albatross bird regions as the interpretation for the Ruddy Turnstone, and almost ignores the associated context.

Conclusion
In this study, we present a novel extension of the traditional gradient-dependent saliency map generation scheme.The proposed method leverages element-wise multiplicative aggregation as guidance with previous weighted multiplicative summation and further improves the performance of salient region bounding.Additionally, we showed our study's advanced class discriminative performance and presented evidence for better area framing with deeper networks.Furthermore, our model can avail crisper saliency map and significant quantitative improvement over three widely used datasets.We aim to integrate this study into other vision tasks in future work.

Figure 1 :
Figure 1: Proposed global guidance maps g M for cat, proposed saliency map (S c ) with and without global guidance, and the same maps for the dog.The proposed global guidance map provides strong localization information as well as exclusion of non-target class.

Figure 3 :
Figure 3: Proposed global guidance maps g M for cat, proposed saliency map (S c ) with and without global guidance (first row), and the same maps for the dog [second row and third row].The proposed global guidance map provides strong localization information as well as exclusion of non-target class.

Figure 6 :
Figure 6: AUC demonstration for the insertion and deletion operation for images on the left side.The above analysis shows that the proposed method can capture the most salient regions for a single class, dual-class, and dual-class with multiple instances compared to the previous methods [9, 8, 10].

Table 1 :
Comparative evaluation in terms of salience zone drop and context zone increase, Pointing Game, Dice, IoU (higher is better) and salience zone increase and context zone drop (lower is better) on the PASCAL VOC 2012 dataset for ResNet50 model.The best scores are in bold form and second best scores are underlined.

Table 2 :
Comparative performance drop and increment of saliency and context zones on the MS-COCO 14 dataset for ResNet50 model.

Table 3 :
Comparative performance drop and increment of saliency and context zones on the ImageNet dataset for ResNet50 model.