SC-SM CAM: An Efficient Visual Interpretation of CNN for SAR Images Target Recognition

Convolutional neural networks (CNNs) have successfully achieved high accuracy in synthetic aperture radar (SAR) target recognition; however, the intransparency of CNNs is still a limiting or even disqualifying factor. Therefore, visually interpreting CNNs with SAR images has recently drawn increasing attention. Various class activation mapping (CAM) methods are adopted to discern the relationship between CNN’s decision and image regions. Unfortunately, most existing CAM methods are based on optical images; thus, they usually lead to a limiting visualization effect for SAR images. Although a recently proposed Self-Matching CAM can obtain a satisfactory effect for SAR images, it is quite time-consuming, due to there being hundreds of self-matching operations per image. G-SM-CAM reduces the time of such operation dramatically, but at the cost of visualization effect. Based on the limitations of the above methods, we propose an efficient method, Spectral-Clustering Self-Matching CAM (SC-SM CAM). Spectral clustering is first adopted to divide feature maps into groups for efficient computation. In each group, similar feature maps are merged into an enhanced feature map with more concentrated energy in a specific region; thus, the saliency heatmaps may more accurately tally with the target. Experimental results demonstrate that SC-SM CAM outperforms other SOTA CAM methods in both effect and efficiency.


Introduction
Synthetic aperture radar (SAR) imaging has been widely applied in remote sensing, geoscience, electronic reconnaissance, etc., due to its all-weather, day-and-night working conditions and high-resolution imaging ability [1][2][3][4]. Target recognition is usually deemed one of the most challenging tasks in SAR image processing, due to the blurred edge and heavy speckle noise in SAR images [5,6]. Therefore, a series of pre-processing procedures are required, including de-speckling [7], edge detection [8], region of interest (ROI) extraction [9], and feature fusion before a classifier-like support vector machine (SVM), perceptron, decision tree, etc., are used to categorize a SAR image to its most probabilistic classes. These multiple individual pre-processing steps are quite time-consuming and unfriendly for real-time applications. To resolve this, numerous deep-learning-based algorithms, especially convolutional neural network (CNN), are adopted to realize automatic target recognition (ATR). Ref. [6] adopted CNN as a classifier in ATR tasks and obtained higher accuracy than SVM. Ref. [10] proposed a gradually distilled CNN with a small structure and low time complexity for ATR. Ref. [11] designed a large margin, softmax batch-normalization CNN (LM-NB-CNN), particularly for the ATR of ground vehicles. Ref. [12] proposed a lightweight, fully convolutional neural network based on a channelattention mechanism, and obtained higher accuracy than other existing ATR methods.
The above CNN-based algorithms can replace the aforementioned pre-processing with an end-to-end structure; thus, the computing efficiency can be improved dramatically.
However, there is a dearth of analytical or mathematical interpretations of CNN's inner recognition mechanism; thus, CNN is still used as a "black-box" [13,14]. The intransparency of CNN techniques may be a limiting or even disqualifying factor [15] in some special scenarios, especially if single wrong decisions can result in danger to the life and health of humans (e.g., autonomous driving [16], medical domain [17]) or significant monetary losses (e.g., electronic reconnaissance and countermeasures in remote sensing), relying on a datadriven system whose reasoning is incomprehensible may not be an option. To interpret the "black box", some visualization methods are proposed to provide a saliency heatmap whose highlighted regions are most related to CNN's decision, such as RISE [18], LRP [19], XRAI [20], Deep Taylor [21], and class activation mapping (CAM) [22]. Recently, increasing attention has been drawn to CAM methods due to its amazing and intuitive effects; thus, numerous modified CAM methods have been proposed, such as Grad-CAM [23], Grad-CAM++ [24], XGrad-CAM [25], Ablation-CAM [26], Score-CAM [27], etc. Unfortunately, these CAM methods are all based on optical images; thus, they show a very restricted visualization effect on SAR images. This is probably due to the difference in imaging mechanism and properties between SAR images and optical images, as discussed in the first paragraph.
To alleviate this limitation, we proposed a Self-Matching CAM, particularly for SAR images, obtaining a SOTA performance [28]. In the Self-Matching CAM, an artful operator, termed "self-matching", is introduced to suppress energy that is irrelevant to the target in CNN's feature maps. Therefore, the Self-Matching CAM can highlight a region matching the target precisely for most SAR images. However, the Self-Matching CAM is still not a panacea: (1) it is quite time-consuming since hundreds of "self-matching" operations are required per image; (2) there is sometimes a deviation between the highlighted region and target for a few SAR images. To boost the computational efficiency, Ref. [29] proposed Group-CAM, which divides the feature maps into several groups. Accordingly, the number of any feature map operations can be reduced dramatically. However, this time-boosting comes at the cost of visualization effect for SAR images because this straightforward strategy divides the feature maps with neighboring indices into a group. Nonetheless, there is no obvious relationship among feature maps with neighbouring channel indices in a convolutional layer.
In this paper, an efficient CAM method, Spectral-Clustering Self-Matching CAM (SC-SM CAM), is proposed to visualize CNN's innate mechanism in ATR. The contribution of this paper can be summarized as follows: (1). SC-SM CAM provides a reasonable and interpretable grouping strategy instead of channel indices; thus, more highlighted pixels can be located in the target region; (2) SC-SM CAM is an efficient method. Differing from Group-CAM with the loss of effect, SC-SM CAM runs nearly twice as fast as the Self-Matching CAM, with a conspicuous improvement in visualization results.
The remainder of this paper is organized as follows. Section 2 introduces the basic theory of CAM and reviews several SOTA CAM methods, especially our previous work, the Self-Matching CAM. Section 3 elaborates the methodology of SC-SM CAM. Section 4 provides numerous experimental results from various perspectives to demonstrate the superiority of SC-SM CAM compared to other existing CAM methods. Section 5 discusses the experimental results and clarifies the confusion. Finally, Section 6 concludes this paper and discusses future work.

Related Work
In this section, we review several existing CAM methods from two categories: opticalbased CAM and SAR-based CAM. The former contains numerous modified versions, while the latter denotes a Self-Matching CAM, particularly in this paper. Besides, since Group-CAM is based on optical images, we propose a modified version combined with Self-Matching CAM: Group-Self-Matching CAM (G-SM-CAM).

Optical-Based CAM
CAM was first proposed by Bolei Zhou, et al. in Ref. [22] for the CNN with global average pooling (GAP) after the last convolutional layer. The spatial element of the heatmap H CAM generated by CAM for a given class c is defined by: where A k denotes the feature map in k-th channel in a convolutional layer. Note that GAP compresses each feature map to a single pixel and then connects it to neurons in fully connected layers; thus, the parameter a c k can be replaced with the weights w c k between the last convolutional layer and its next fully connected layer. However, most SOTA CNN models have abandoned the GAP layer, so CAM cannot be directly performed on them. To improve generality, many researchers focus on modifications or manipulations of a c k . Different definitions of a c k lead to different CAM methods, i.e., Grad-CAM [23] and Grad-CAM++ [24] utilize the first-order and second-order partial gradient of prediction score S c with respect to A k to formulate a c k , respectively. [25] proposed XGrad-CAM to enhance the interpretability of a c k . Recently, Refs. [26,27] proposed two gradient-free methods, Ablation CAM and Score CAM, to avoid the negative influences of gradient death and gradient explosion.

Self-Matching CAM
It is worth noting that the above optical-based CAM methods usually highlight a region that excessively covers the target in saliency heatmaps for SAR images. To alleviate this limitation, we proposed a Self-Matching CAM in Ref. [28] for SAR images. In the Self-Matching CAM, we introduce a "self-matching" operator to process the feature map A k instead of manipulating a c k . Specifically, the input SAR image and all feature maps are first downsampled and upsampled to the same size. Then, the Hadamard product of each feature map and SAR image is adopted as the new feature mapÂ k , formulated as follows: where denotes Hadmard product operation, I refers to the input SAR image, D(·) and U(·) denote downsampling and upsampling, respectively. This processing is termed as "self-matching", since only the elements relevant to the target itself are preserved in feature maps. More details about these CAM methods can be found in Ref. [28].

Group-Self-Matching CAM
It should be noted that a c k in Self-Matching CAM can be obtained by any of the aforementioned CAM methods. Hence, similar to these CAM methods, the Self-Matching CAM method also requires hundreds of "self-matching" operations per image (most SOTA CNN models usually own hundreds of convolutional filters in the last convolutional layer). To improve computing efficiency, G-SM-CAM CAM utilizes a division strategy to divide the feature maps into G groups as follows, where K/G is the number of feature maps in a group, the number ofÃ is less than K. In this case, the number of "self-matching" operation can be reduced from hundreds. The following should be noted: (1) The original Group-CAM adopts a series of operations such as smoothing mask, blurring image, and confidence calculation to estimate the weight of a specific feature map [29], while G-SM-CAM only adopts the division strategy to categorize feature maps into different groups; thus, no operation on weights is required.
(2) This division strategy divides the feature maps with neighbouring indices into one group. However, there is no specific relationship among several neighbouring indices. This division strategy increases the speed at the cost of visualization effects, as discussed in Section 4.

Motivation
As discussed in Section 2, Self-Matching CAM is effective for SAR images but timeconsuming, while G-SM-CAM can improve computing efficiency greatly, with a loss of effect. Therefore, it is natural to wonder whether there is a more reasonable division strategy, which can be embedded in a Self-Matching CAM instead of the straightforward strategy in Group-CAM. In fact, the problem of Group-CAM is that this strategy divides the feature maps with less similarity into groups according to channel indices. These dissimilar feature maps may introduce redundant information in the new feature mapÃ k . Thus, it is very important to divide the feature maps with high similarity in a group. In this paper, we adopt spectral-clustering (SC) as a division strategy because (1) SC is a very efficient clustering method; (2) SC uses a dimensional compression technology, and so is more suitable for high-dimensional data, e.g., feature maps in our experiments; (3) Different from other, traditional clustering algorithms, such as K-means, SC only requires a similarity matrix among the data, so it is very effective for clustering sparse data, such as feature maps (a number of feature maps are all-zero) [30][31][32].

SC-SM CAM
Assume the feature maps of the last convolutional layer in a CNN as A k (k 2 {0, 1, . . . , K 1}), where K is the number of channels. We categorize A k into different groups by spectral clustering. Here, A k is regarded as vertices; thus, the similarity matrix S 2 R K⇥K can be formulated by the Euclidian distance between two vertices: where k · k F denotes the Frobenius norm of a matrix. The similarity matrix S is a symmetric matrix. Next, we can calculate the adjacency matrix W based on the K-nearest neighbor (KNN) [33]: where s controls the width of the neighborhoods, and the degree matrix is defined as a sum of the weights W(i, j): Note that the degree matrix D is a K ⇥ K diagonal matrix. Then, the Laplacian matrix L can be obtained as L = D W (7) and the normalized Laplacians matrix: whereL is a symmetric matrix. We seekK lowest eigen values ofL and their corresponding eigenvectors y = [y 0 , y 1 , · · · , yK 1 ] T (T denotes transpose of the matrix). Then, the eigen matrix H can be formulated with y, Here, all the feature maps in the same group will be summarized as a clustered feature map A C n = Â i A i (n 2 1, 2, . . . , m). In this case, hundreds of A k can be clustered into several representative feature maps A C n . Then, a set of new feature maps can be obtained by "self-matching": Finally, the saliency heatmap H SC SM is formulated as: where the weight a c n is obtained by any of the aforementioned CAM methods. The flowchart of SC-SM CAM is shown in Figure 1. The pseudo-code is presented in Algorithm 1.

Experimental Results
In this section, the superiority of SC-SM CAM in both validity and efficiency will be demonstrated by numerous experiments. We first perform all the aforementioned CAM methods to compare their class discriminative visualization in Section 4.2. Then, we apply an insertion task to investigate the concentration of highlighted pixels in saliency heatmaps in Section 4.3. Next, two ablation study on two variable parameters will be analyzed in Section 4.4. Finally, we compare the running time of SC-SM CAM with that of the Self-Matching CAM and G-SM CAM to evaluate its computing efficiency in Section 4.5.

Experiment Setup
All experiments in this paper are conducted on the benchmark dataset MSTAR. MSTAR contains 5172 SAR images corresponding to 10 classes of military vehicles, 2536 for training, and 2636 for validation. All SAR images are size of 1 ⇥ 100 ⇥ 100, and normalized to the range [0, 1]. AlexNet is adopted as a CNN classifier in our experiments (optimizer is stochastic gradient descent (SGD) , learning rate = 5 ⇥ 10 4 , and momentum = 0.9).

Class Discriminative Visualization
In this section, we first present a qualitative comparison of saliency heatmaps generated by the aforementioned CAM methods, including Grad-CAM, Grad-CAM++, XGrad-CAM, Ablation-CAM, Score-CAM, Self-Matching CAM, G-SM-CAM, and SC-SM CAM, as shown in Figure 2. Intuitively, Self-Matching CAM, G-SM-CAM, and SC-SM CAM resemble the original target much more than other optical-based CAM methods. To further demonstrate this, we adopt intersection over union (IoU) to measure the similarity between the original SAR images and their corresponding heatmaps. The definition of IoU in our experiment is: where Area_overlap denotes the overlapped area of the highlighted region in the heatmap and its corresponding target area in SAR image, Area_union denotes the union of both parts, as shown in Figure 3. From Equation (12), a high value of IoU means a high similarity between the original images and CAM heatmaps. Note that the ground truth of each image is manually labeled at pixel-level. In Table 1, we compute the IoU of each image in Figure 2. Table 1. IoU for the SAR images and heatmaps. Each row corresponds to an original SAR image from the first to the tenth row in Figure 2. To quantitatively measure the performances of various CAM, Ref. [29] utilizes the "occlusion test" and "conservation test" to demonstrate the superiority of the Self-Matching CAM. In the "occlusion test", the pixels most relevant to the target are occluded by pixelmultiplying the original SAR image and the binarized heatmap whose high-value elements are set to 0 with a threshold, while the "conservation test" preserve the pixels most relevant to the target. Then, the occluded or conserved images are sent to the CNN to detect the confidence_drop:

Grad
where I refers to the original image andǏ refers to the occluded or conserved image. According to the definition of confidence_drop, a high value of confidence_drop in the occlusion test and a low value in the conservation test means that the most discriminative information is contained in the occluded or conserved image. Ref. [29] denotes that only the Self-Matching CAM can simultaneously achieve a high confidence drop in the occlusion and conservation test. A more detailed analysis can be found in [29]. It is clear from the qualitative and quantitative evaluation that only three SAR-based CAM methods can precisely locate the target with a highlighted region in saliency heatmaps, while other, optical-based CAM methods show excessive highlighted regions, which overcover the target. Such overwhelming superiority is benefited from self-matching operations and matches the conclusion in Ref. [29]. Therefore, we will only discuss Self-Matching CAM, G-SM-CAM, and SC-SM CAM. Figure 2 shows some representative SAR images and the saliency heatmaps generated by the three methods. For the images with less noise (e.g., from the first to third rows in Figure 2), although SC-SM CAM may produce more speckles in the background, the most highlighted region matches the target more precisely than Self-Matching CAM and G-SM-CAM. For the images with heavy noise (e.g., from the fourth to sixth rows in Figure 2), all three CAM methods produce numerous speckles, whereas only SC-SM CAM can concentrate the highlighted pixels in the target area. Self-Matching CAM and G-SM CAM even highlight a "wrong" region, irrelevant to the target, for the sixth image in Figure 2.
To show this comparison more vividly, we preserve a set of elements in the original SAR images according to the top 20% values in the corresponding saliency heatmaps, as shown in Figure 4 (from the fifth to seventh columns).

Insertion Check
We implement an insertion check in this section. Here, the insertion check starts with an all-zero image and gradually recovers contents according to the corresponding saliency heatmaps. Specifically, we replace 1% pixels of the all-zero image until the image is recovered. Figure 5 shows the recovered images of SC-SM CAM with different insertion percentages q. From Figure 5, we can see that, with only a small q (q  20%), the shape of the target can be recovered. This further demonstrates that pixels with the highest values in saliency heatmaps are accurately concentrated on the target region. To quantitatively evaluate these methods, we calculate the Area Under Curve (AUC) of the classification score after Softmax with different q [34,35].
AUC denotes the area under receiver operating characteristic curve (ROC). For a binary classification problem, ROC refers to the curve of each point, drawn by taking the False Positive (FR) rate as an abscissa and True Positive (TR) rate as an ordinate. AUC can reflect a model's performance, i.e., AUC = 1, the model's performance is the best; AUC = 0.5, the model is a random classifier; AUC < 0.5, the model is usually worse than a random classifier. This concept can be extended to multiclass classification problems by regarding the real label as true and other labels as false.
Firstly, we calculate the AUC of the six representative images in Figure 5. The results are shown in Figure 6. The AUC generally increases with q, sharply drops with a smaller q, and surges with a larger q. This is probably because when q is small, the re-introduced pixels are concentrated on the target region, which represents the most discriminative feature of the target; whereas, when q becomes larger, some sharp or "strange" edges are introduced, resulting in a low AUC. Hence, the earlier arrival of maximal AUC means the most highlighted pixels in the heatmaps are concentrated in the target. Without loss of generality, all 2636 validation images are sent to CNN and the average AUC is calculated from q = 5% to q = 80%, as shown in Table 2. The highest AUC of SC-SM CAM appears when q = 15%, while this appears when q = 30% for the Self-Matching CAM and G-SM CAM. This result further quantitatively validates the perfect precision of SC-SM CAM.

Ablation Study
Both G-SM-CAM and SC-SM CAM adopt the "grouping" strategy; thus, we investigated the influence of group number G on the saliency heatmaps generated by G-SM CAM and SC-SM CAM, respectively. The results are shown in Figure 7. Apparently, the number of speckles in the saliency heatmaps of G-SM CAM decreases when G rises, while the saliency heatmaps of SC-SM CAM are nearly unrelated to G. This is because the groups in G-SM CAM are categorized according to channel indices. In this case, different G may divide feature maps with huge divergence into one group. In comparison, spectral clustering can ensure that similar feature maps are divided into a group in SC-SM CAM. According to our experimental results, we find that similar feature maps can be categorized into several groups, whereas the all-zero feature maps are distributed in the rest of the groups when G changes. Here, we further investigate the optimal G for 10 classes of SAR images. We record the running time and AUC when the top 15% highlighted pixels are conserved. (q = 15% in insertion check) with G = 1, 2, 4, 8, 16, 32, respectively, as shown in Figure 8. In general, the running time increases with G for each class. It is clear from the left subfigure in Figure 8 that the running time of SC-SM CAM is shorter than the median when G  4. Note that the running time of G = 1 is much shorter than other G. This is because SC-SM CAM degrades into G-SM-CAM when G = 1 (spectral clustering is not required). As for AUC, it is clear from the right subfigure in Figure 8 that the AUC is very low when G = 1 and G = 2, whereas it improves dramatically when G = 4 and then almost retains this high value when G > 4. This is because these similar feature maps can be divided into a group when G is small. In contrast, when G > 4, the feature maps with high similarity can be divided into a group; thus, the AUC is almost unchanged. This result intuitively matches the heatmaps with different G in Figure 7.  Based on the above analysis, we think G = 4 is the optimal group number for MSTAR dataset which is a balance between effect and efficiency. Therefore, our experiments are all conducted with G = 4 unless otherwise specified. Besides, it should be pointed out that G = 4 is only optimal for MSTAR dataset, whereas, the optimal G probably changes for other SAR image datasets. This ablation experiment further demonstrates that spectral clustering works effectively in grouping feature maps.

Computing Efficiency
In this section, we will investigate the computational efficiency of our proposed method. We record the average running time of 2636 validation SAR images of Self-Matching CAM, G-SM CAM, and SC-SM CAM on 8th Gen Intel Core(TM) i7-8700, 3.20GHz, as shown in Table 3. It is clear from Table 3 that SC-SM CAM runs approximately twice as fast as Self-Matching CAM. It should be noted that, although G-SM-CAM runs much faster than the other methods, this high speed comes at the cost of visualization effects, whereas SC-SM CAM improves both the effect and the efficiency. In addition, we further study the effect of the number of the eigenvectors on running time, as shown in Table 4. From Table 4, the number of the eigenvectors has no conspicuous influence on running time.

Discussion
In our experiment, the effect and efficiency of SC-SM CAM are verified through both qualitative (class discriminative visualization and ablation study) and quantitative (insertion check and running time) analysis. Class discriminative visualization provides a vivid comparison of various CAM methods, especially the divergence among the three self-matching-based methods. An ablation study shows that an appropriate number of eigenvectors in the Laplacian matrix have a significant impact on the visualization effect of SC-SM CAM, whereas the number of clusters has a nominal influence. AN insertion check further demonstrates the SC-SM CAM concentrates the more high-value pixels in the target area in comparison to G-SM-CAM and Self-Matching CAM. The running time demonstrates the superiority of SC-SM CAM in terms of computational efficiency.
It should be noted that it is possible to strike a balance between the number of eigenvectors and the visualization effect. Seeking an optimal number of eigenvectors is our further research direction.

Conclusions
In this paper, we propose SC-SM CAM, an efficient visual interpretation algorithm of CNN, for target recognition of SAR images. In visualization effects, two SOTA SAR-based CAM methods, Self-Matching CAM and G-SM-CAM, SC-SM CAM, can highlight the target area in saliency heatmaps more precisely than G-SM-CAM and Self-Matching CAM. In comparison to G-SM-CAM, the fastest of these three methods at the cost of effect, SC-SM CAM increases speed without any loss of visualization effect. Numerous experimental results verify the validity of SC-SM CAM through quantitative and qualitative analyses. These findings may shed light on the understanding of the inner mechanism of CNN classification.

Data Availability Statement:
The experimental dataset adopted in this paper is the measured SAR ground stationary target data provided by the MSTAR program supported by the Defense AdvancedResearch Projects Agency (DARPA) of the United States. Both internationally and domestically,MSTAR is used as a benchmark dataset for research on SAR image processing. The sensors are high-resolution focused synthetic aperture radars with a resolution of 0.3 m ⇥ 0.3 m, which work in the X-band, and the polarization mode is HH. The MSTAR dataset contains SAR images of 10 classes of vehicle, namely 2S1 (self-propelled artillery), BMP2(infantry fighting vehicles), BRDM2 (armored reconnaissance vehicle), BTR70 (rmored transport vehicle), BTR60 (armored transport vehicle), D7 (bulldozer), T62 (tank), ZIL131 (cargo truck), ZSU234 (self-propelledanti-aircraft gun), and T72 (tank). MSTAR dataset will be made available on request to the first author's email (zpfeng_1@stu.xidian.edu.cn).