SAR-BagNet: An Ante-hoc Interpretable Recognition Model Based on Deep Network for SAR Image

: Convolutional neural networks (CNNs) have been widely used in SAR image recognition and have achieved high recognition accuracy on some public datasets. However, due to the opacity of the decision-making mechanism, the reliability and credibility of CNNs are insufﬁcient at present, which hinders their application in some important ﬁelds such as SAR image recognition. In recent years, various interpretable network structures have been proposed to discern the relationship between a CNN’s decision and image regions. Unfortunately, most interpretable networks are based on optical images, which have poor recognition performance for SAR images, and most of them cannot accurately explain the relationship between image parts and classiﬁcation decisions. Based on the above problems, in this study, we present SAR-BagNet, which is a novel interpretable recognition framework for SAR images. SAR-BagNet can provide a clear heatmap that can accurately reﬂect the impact of each part of a SAR image on the ﬁnal network decision. Except for the good interpretability, SAR-BagNet also has high recognition accuracy and can achieve 98.25% test accuracy.


Introduction
Synthetic Aperture Radar (SAR) imaging is an imaging technology that generates the equivalent synthetic antenna array through the relative displacement between the radar and imaging target. SAR imaging technology is less affected by external weather and has certain surface penetration ability, which makes it widely used in military target recognition, urban planning, environment monitoring, disaster assessment, and other fields [1][2][3][4][5]. Nowadays, it is increasingly important to obtain clear explanations of SAR images. SAR image explanation usually includes image segmentation, target detection, and recognition, among which target recognition is deemed as the most challenging task [6]. Synthetic aperture radar automatic target recognition (SAR-ATR) technology has been widely used in SAR image recognition studies in recent years. SAR-ATR can be divided into two stages: first extract the representative features of the SAR image and then distribute the image to a predetermined set of classes using a classifier. The recognition features are crucial to the SAR-ATR and have a significant impact on the success of the latter classifier. Most of the traditional SAR-ATR methods are designed based on rich theoretical models and expert knowledge [7]. These methods are highly interpretable, but the artificial design of features requires high domain knowledge and a time-consuming and laborious design process; furthermore, the features of SAR images cannot be fully utilized. Traditional SAR-ATR algorithms include, but are not limited to, the scattering center model (SCM)based method [8], the principal component analysis (PCA)-based method [9,10], the sparse representation method [11,12], and the multi-features fusion method [13].
In recent years, with the rapid development of artificial intelligence technology, SAR-ATR based on the deep learning (DL) method has gradually become mainstream in this The remainder of this work is organized as follows. For a better understanding of the SAR-BagNet, Section 2 introduces the basic theory and details of CAM and BagNet. In Section 3, the design ideas and network structure of SAR-BagNet are introduced. In Section 4, we show the experimental results of SAR-BagNet and compare them with several commonly used interpretable models. In Section 5, we clarify some of the questions about our proposed model and discuss the contributions of this work. Finally, Section 6 concludes this study and looks forward to the future work.

Related Work
In order to explain CNNs, many methods have been proposed. In this section, we will introduce two ante-hoc interpretable methods that are closely related to our work.

CAM Methods
CAM was first proposed in [26] by Zhou, B.L., Khosla, A., et al. and plays an influential role in the interpretation of CNNs. CAM was originally designed specifically for CNNs that have a global average pooling (GAP) in the last convolution layer. The function of GAP is to compress each feature map in the last convolution layer into a single pixel value P k , and then connect it to the full connection layer to obtain the final classification score S c . In this case, the single pixel value P k can be expressed as: (1) where A k ij represents the value of the k-th feature map of the last convolutional layer in coordinates (i, j). The final classification score S c can be obtained from the equation: where ω c k is the weight that corresponds to class c for the unit pooled from the feature map in the k-th channel. Then, the heatmap can be obtained by multiplying the weights of the full connection layer and the feature map of the last convolution layer. The spatial element of the CAM heatmap for class c can be obtained by: In order to overcome the limitation of the GAP structure of CAM, many CAM variants have been proposed in recent years, such as Grad-CAM [25], Grad-CAM++ [31], Ablation-CAM [32], Score-CAM [33], etc. Grad-CAM is the most famous and widely used CAMbased method; Grad-CAM defines the weights ω grad as: where Z represents the number of pixels in the feature map. Thus, Grad-CAM can be applied to CNNs without changing the structure of the model as long as S c is a differentiable function of A k ij . However, the Grad-CAM method does not clearly explain why it uses the average of gradients to weight each feature map, and there is a great risk for the interpretation of CNNs.
Overall, the original CAM is a method based on the internal mechanism of CNN, which can reflect the decision process of the network to a certain extent. However, the generated heatmap has low resolution and an unclear corresponding relationship with input, so it cannot clearly reflect the influence of each input region on the decision result. If the model itself is interpretable, then no additional methods are needed to interpret the model and the problems described above can be avoided. BagNets are an interpretable Remote Sens. 2022, 14, 2150 4 of 19 model that inspires our work; we will introduce the implementation method and principle of BagNets below.

DNN-Based BagNets Model
Before deep learning was popular in image recognition tasks, the Bag of Features (BoF) model was the preferred method for competitions. Before introducing DNN-based BagNets, we will recount the main elements of a classic BoF model here. BoF representations can be described by analogy with bag-of-words representations. Using bag-of-words, we can count the number of occurrences of words from a vocabulary in a document. This vocabulary contains important words (but not common words, such as "this" or "the") and clusters of words (e.g., semantically similar words, such as "excellent" and "outstanding"). The counts of each word in the vocabulary are combined into a long-term vector. This is called the bag-of-words document representation because the order of all the words is lost. Similarly, BoF representation is based on a visual vocabulary that represents local image feature clusters. The term vector of an image is the number of occurrences of each visual word in the vocabulary. This term vector is used as the input of a classifier, such as multilayer perceptron (MLP) or SVM [34].
BoF models are easy to interpret if the classifier is linear. In this case, the influence of a given part of the input on the classifier is independent of the rest of the input. The word linear here refers to the combination of a linear spatial aggregation (a simple average) and a linear classifier on top of the aggregated features. The fact that the classifier and the spatial aggregation are both linear and thus interchangeable allows us to pinpoint exactly how evidence from local image patches is integrated into one image-level decision. Based on this insight, Reference [30] constructs linear DNN-based BoF models (BagNets).
DNN-based BagNets are similar to the CAM as they also use a CNN with global average pooling and a linear classifier in order to extract class-specific heatmaps. However, their latent representations are extracted from the whole image, and it is unclear how the heatmaps in the latent space are related to the pixel space. In BagNets, the receptive field of CNNs is limited to very small image patches, making it possible to trace exactly how each image patch contributes to the final decision. The basic principle of DNN-based BagNets can be shown in Figure 1. Figure 1a shows that each small image patch is input into BagNets, and then the BagNets extract features from the small image patches and generate activation in the corresponding region of the heatmap. In this case, a specific class c activations (logits) L c of a q × q pixel patch of an image can be expressed by Equation (3). Figure 1b represents that BagNets yield one logit heatmap per class; these heatmaps are averaged spatially and the final class probability is obtained by a softmax layer. Then, the total score S c of an image in the k-th class can be expressed as: where n denotes the number of units in a feature map. ω c k is the weight that corresponds to class c for the unit pooled from the feature map in the k-th channel. The reason for 1/n in the equation is that there is a GAP layer behind the last convolution layer.
It can be seen from Figure 1 that the decision results of DNN-based BagNets are obtained directly from the heatmaps, so this network architecture has good interpretability.

Inspiration and Motivation
Although the BagNets have excellent explicability in optical images, they are not applicable to SAR images. This is mainly because these interpretable models are based on high-resolution optical images with rich information, whereas SAR images are quite different. (1) The resolution of a SAR image is generally lower than that of an optical image and has strong noise. (2) In SAR images, the target usually occupies only a small area of the image, whereas in optical images, the target usually occupies more than half of the image area, such as the CUB-200 dataset and the CIFRA-10 dataset. These differences require that a model that can generate more refined heatmaps and have higher recognition accuracy for SAR image interpretation.
The original BagNets are able to produce a clear heatmap. However, these heatmaps are post-hoc generated and cannot truly reflect the model recognition process. The antehoc heatmaps generated by BagNets have a large disparity in resolution size compared to the original image. This is mainly because BagNets only considers the change of receptive field but not the change of global stride. As shown in Figure 1, the BagNet with a receptive field of 2 × 2 and global stride size of 2 only can generate a heatmap of 3 × 3 on a 6 × 6 image, which is not sufficient for SAR images. To better interpret SAR images, we need more detailed heatmaps. In addition, the BagNets are designed based on ResNet-50 and has 50 convolutional layers. Generally, for relatively large datasets, neural networks with more convolutional layers are conducive to the improvement of recognition accuracy, whereas for small datasets, overfitting is prone to occur. Due to the complexity of SAR image acquisition, it is difficult to build a large SAR image dataset, so a small network model is needed for SAR image recognition.
In order to solve the above problems, we propose the SAR-BagNet model. The detailed procedures of the SAR-BagNet and the specific difference between it and BagNets will be elucidated in what follows.

Inspiration and Motivation
Although the BagNets have excellent explicability in optical images, they are not applicable to SAR images. This is mainly because these interpretable models are based on high-resolution optical images with rich information, whereas SAR images are quite different. (1) The resolution of a SAR image is generally lower than that of an optical image and has strong noise. (2) In SAR images, the target usually occupies only a small area of the image, whereas in optical images, the target usually occupies more than half of the image area, such as the CUB-200 dataset and the CIFRA-10 dataset. These differences require that a model that can generate more refined heatmaps and have higher recognition accuracy for SAR image interpretation.
The original BagNets are able to produce a clear heatmap. However, these heatmaps are post-hoc generated and cannot truly reflect the model recognition process. The ante-hoc heatmaps generated by BagNets have a large disparity in resolution size compared to the original image. This is mainly because BagNets only considers the change of receptive field but not the change of global stride. As shown in Figure 1, the BagNet with a receptive field of 2 × 2 and global stride size of 2 only can generate a heatmap of 3 × 3 on a 6 × 6 image, which is not sufficient for SAR images. To better interpret SAR images, we need more detailed heatmaps. In addition, the BagNets are designed based on ResNet-50 and has 50 convolutional layers. Generally, for relatively large datasets, neural networks with more convolutional layers are conducive to the improvement of recognition accuracy, whereas for small datasets, overfitting is prone to occur. Due to the complexity of SAR image acquisition, it is difficult to build a large SAR image dataset, so a small network model is needed for SAR image recognition.
In order to solve the above problems, we propose the SAR-BagNet model. The detailed procedures of the SAR-BagNet and the specific difference between it and BagNets will be elucidated in what follows.

SAR-BagNet Model
For SAR image recognition, we want a network structure with both high recognition accuracy and good interpretability. In our network design process, we found three key factors that have important influence on the accuracy and interpretability of the network, i.e., receptive field, global stride, and network padding.
One of the most basic concepts in deep CNNs is the receptive field (RF). The value of each unit in the feature map depends on a region of the input of the convolutional networkthis is the receptive field of the unit [35]. Assume that a CNN with n convolutional layers has no pooling layer in the middle of the convolution layer; then, the calculation formula of the receptive field on the input image corresponding to each unit in the feature map of the last convolution layer can be written as [36]: where RF n denotes the size of the receptive field of the n-th layer to be calculated, RF n−1 denotes the calculated size of the receptive field at layer n − 1, f n denotes the size of the n-th convolution kernel, and s i denotes the stride corresponding to the i-th convolution layer.
In the BagNet architecture, the size of the receptive field will affect the precision of the heatmap. If the receptive field becomes smaller, the region corresponding to a unit on the heatmap will also become smaller, which will obviously increase the fineness of the heatmap. However, a smaller receptive field means that the image is segmented into smaller patches, which results in the loss of more characteristic information of the image and makes it harder for the network to classify objects.
Global stride represents the equivalent stride of a convolutional neural network on the input image. The global stride S g of a convolutional neural network is equal to the stride multiplication of all the convolutional layers: We can see from Figure 1 that the global stride size will affect the number of SAR images segmented into patches and the resolution of heatmap. The more patches segmented by an image, the more feature information of the image input into the network, which is conducive to improving the identification accuracy of the network.
The size of the heatmap determines the fineness of the heatmap. We can observe from Equation (3) that the size of the heatmap is the same as that of the feature map of the last convolution layer. The size Q n of the feature map of the n-th convolutional layer can be written as: where W denotes the size of the input image and P denotes the number of turns required to pad the edges of the image. The symbol [] indicates rounding down after the calculation is completed. During the network design process, we find that parameter P also has a great influence on SAR image recognition. In SAR-BagNet, we set P = 0 for all convolution layers; the reason for this is shown in Figure 2. Unlike colored optical images, SAR images are gray, and the high brightness boundary of SAR images and black padding boundary will create a local feature, which is manifested in the heatmap as a high active mapping at the edge. This phenomenon will cause network misjudgment, which we do not want to see. In Figure 2, the region represented by the red box is the region that generated high activation, and the corresponding region on the heatmap is shown as the darker red region. The local area represented by the green box in the SAR image has low brightness and little difference from the color of the black edges. The features formed in the green box area are not activated difference from the color of the black edges. The features formed in the green box area are not activated on the heatmap. According to the experimental results, in order to avoid introducing additional features, we set the network parameter P = 0. When P = 0, Equation (8) becomes: In the BagNets [30], the global stride size Sg is fixed at 8. The model only considers the influence of the change of the receptive field on the resolution of the heatmap and the recognition accuracy of the model, but it does not consider changes in global stride. Therefore, the ante-hoc heatmap obtained by BagNets has a low resolution and is not applicable to SAR images. According to Equation (9), for a given image with a size of W × W, in order to increase the value of Qn, one is to reduce the receptive field RFn and the other is to reduce the global stride size Sg. Because the effect of the receptive field on the model's recognition accuracy and interpretability is contradictory, we explore the effect of global stride size on the model in order to obtain a model that can achieve a high recognition rate and produce a fine heatmap.
Based on the above analysis, BagNets with different global strides and different receptive fields were designed under the framework of ResNet-18 in order to achieve high accuracy and interpretability in SAR image recognition. The reason for adopting the Res-Net-18 framework is that the ResNet-18 network has only 18 convolution layers, which is suitable for SAR image datasets with less data. We compared the recognition accuracy (validation sets in the MSTAR dataset) of BagNets with RF = 13, RF = 19, and RF = 25 at global stride sizes of 1, 4, and 8, respectively, and the results are shown in Figure 3. When P = 0, Equation (8) becomes: In the BagNets [30], the global stride size S g is fixed at 8. The model only considers the influence of the change of the receptive field on the resolution of the heatmap and the recognition accuracy of the model, but it does not consider changes in global stride. Therefore, the ante-hoc heatmap obtained by BagNets has a low resolution and is not applicable to SAR images. According to Equation (9), for a given image with a size of W × W, in order to increase the value of Q n , one is to reduce the receptive field RF n and the other is to reduce the global stride size S g . Because the effect of the receptive field on the model's recognition accuracy and interpretability is contradictory, we explore the effect of global stride size on the model in order to obtain a model that can achieve a high recognition rate and produce a fine heatmap.
Based on the above analysis, BagNets with different global strides and different receptive fields were designed under the framework of ResNet-18 in order to achieve high accuracy and interpretability in SAR image recognition. The reason for adopting the ResNet-18 framework is that the ResNet-18 network has only 18 convolution layers, which is suitable for SAR image datasets with less data. We compared the recognition accuracy (validation sets in the MSTAR dataset) of BagNets with RF = 13, RF = 19, and RF = 25 at global stride sizes of 1, 4, and 8, respectively, and the results are shown in Figure 3.  When P = 0, Equation (8) becomes: In the BagNets [30], the global stride size Sg is fixed at 8. The model only considers the influence of the change of the receptive field on the resolution of the heatmap and the recognition accuracy of the model, but it does not consider changes in global stride. Therefore, the ante-hoc heatmap obtained by BagNets has a low resolution and is not applicable to SAR images. According to Equation (9), for a given image with a size of W × W, in order to increase the value of Qn, one is to reduce the receptive field RFn and the other is to reduce the global stride size Sg. Because the effect of the receptive field on the model's recognition accuracy and interpretability is contradictory, we explore the effect of global stride size on the model in order to obtain a model that can achieve a high recognition rate and produce a fine heatmap.
Based on the above analysis, BagNets with different global strides and different receptive fields were designed under the framework of ResNet-18 in order to achieve high accuracy and interpretability in SAR image recognition. The reason for adopting the Res-Net-18 framework is that the ResNet-18 network has only 18 convolution layers, which is suitable for SAR image datasets with less data. We compared the recognition accuracy (validation sets in the MSTAR dataset) of BagNets with RF = 13, RF = 19, and RF = 25 at global stride sizes of 1, 4, and 8, respectively, and the results are shown in Figure 3. with larger receptive fields have higher recognition accuracy, which is consistent with the above analysis. When RF = 25 and S g = 1, the recognition accuracy of the model is the highest. However, large receptive fields mean poor interpretability. In order to balance interpretability and high recognition accuracy, we choose the model with RF = 19, S g = 1, which not only has a small receptive field but also a high recognition accuracy. Meanwhile, according to Equation (9), when the size W of the input image of the model is 100, the size Q n of the heatmap is 82. The size difference between the heatmap and the input image is small, which can ensure the model has good interpretability to SAR images. Based on the above experimental comparison, we proposed the SAR-BagNet model, in which the receptive field of the model is 19 and the global stride is 1. The specific SAR-BagNet structure is shown in Figure 4.  In Figure 3, with the increase in global stride size, the recognition accuracy of different networks decreases in general. In addition, under the same global stride size, networks with larger receptive fields have higher recognition accuracy, which is consistent with the above analysis. When RF = 25 and Sg = 1, the recognition accuracy of the model is the highest. However, large receptive fields mean poor interpretability. In order to balance interpretability and high recognition accuracy, we choose the model with RF = 19, Sg = 1, which not only has a small receptive field but also a high recognition accuracy. Meanwhile, according to Equation (9), when the size W of the input image of the model is 100, the size Qn of the heatmap is 82. The size difference between the heatmap and the input image is small, which can ensure the model has good interpretability to SAR images. Based on the above experimental comparison, we proposed the SAR-BagNet model, in which the receptive field of the model is 19 and the global stride is 1. The specific SAR-BagNet structure is shown in Figure 4. According to Equation (6), because the global stride is 1, whenever a convolution layer with 1 × 1 convolution kernel is added to the network, the size of the receiving domain remains unchanged, whereas whenever a convolution layer with 3 × 3 convolution kernel is added to the network, the size of the receiving domain increases by 2. In the SAR-BagNet architecture, there are nine convolution layers with the 3 × 3 convolution kernel, so RF = 19 can be calculated.
The SAR-BagNet is modified based on the framework of ResNet-18, replacing the original convolution kernel with 1 × 1 and 3 × 3 convolution kernels. Each convolutional layer is followed by a BatchNorm layer and ReLU layer. In the model, the stride of all convolution layers is 1 and the padding is 0; the downsampling operation is a simple 1×1 convolution layer with stride 1.

Experiments
In this section, we will compare our model with ResNet-18 [37], ProtoPNet [28], and BagNets [30] on the commonly used public MSTAR dataset. In the process of training the model, Adam is adopted as the optimizer, with learning rate LR = 1 × 10 −3 , β1 = 0.9 (the exponential decay rate for the 1st moment estimates), and β2 = 0.99 (the exponential decay rate for the 2nd moment estimates).  According to Equation (6), because the global stride is 1, whenever a convolution layer with 1 × 1 convolution kernel is added to the network, the size of the receiving domain remains unchanged, whereas whenever a convolution layer with 3 × 3 convolution kernel is added to the network, the size of the receiving domain increases by 2. In the SAR-BagNet architecture, there are nine convolution layers with the 3 × 3 convolution kernel, so RF = 19 can be calculated.
The SAR-BagNet is modified based on the framework of ResNet-18, replacing the original convolution kernel with 1 × 1 and 3 × 3 convolution kernels. Each convolutional layer is followed by a BatchNorm layer and ReLU layer. In the model, the stride of all convolution layers is 1 and the padding is 0; the downsampling operation is a simple 1 × 1 convolution layer with stride 1.

Experiments
In this section, we will compare our model with ResNet-18 [37], ProtoPNet [28], and BagNets [30] on the commonly used public MSTAR dataset. In the process of training the model, Adam is adopted as the optimizer, with learning rate L R = 1 × 10 −3 , β 1 = 0.9 (the exponential decay rate for the 1st moment estimates), and β 2 = 0.99 (the exponential decay rate for the 2nd moment estimates). MSTAR was launched in the mid-1990s by the Defense Advanced Research Projects Agency (DARPA). The high-resolution bunched SAR is used to collect SAR images of various former Soviet military vehicles. The MSTAR dataset includes SAR images of 10 different classes of vehicles, including 2S1 (Self-Propelled Howitzer), BDRM2 (Armored Reconnaissance vehicle), BTR60 (Armored Personnel Carrier), D7 (Bulldozer), T72 (Main Battle Tank), BMP2 (Infantry Fighting Vehicle), BTR70 (Armored Personnel Carrier), T62 (Tank), ZIL131 (Military Truck), and ZSU234 (self-propelled antiaircraft gun), which are numbered from Class 0 to Class 9 in order. The 10 classes of targets with a depression angle of 15 • were used as the training set, and the 10 classes of targets with a depression angle of 17 • were used as the verification set. On the MSTAR dataset, the initial SAR images are gray scale; to avoid modification of the parameters of ProtoPNet and BagNets, all the SAR images are transformed into pseudo-RGB images (copy the gray image in all three channels). In data preprocessing, we process the training dataset using normalization, horizontal and vertical rotation, random panning, and image brightness transformation to increase the generalization ability of the model. All the SAR images are cropped to the size of 100 × 100. Because ProtoPNet was trained with 224 × 224 images, the SAR image was upsampled during the training process, and its size was increased to 224 × 224. We selected BagNet-17 and BagNet-33 from the BagNets, and the receptive fields of these two networks are 17 × 17 and 33 × 33, respectively.

Comparison of Recognition Accuracy
ProtoPNet [28] and BagNets [30] are widely used interpretable models. It is important to point out that these two models have achieved similar recognition accuracy in optical image recognition tasks as traditional CNNs (e.g., Alexnet, ResNet-18). In SAR image recognition, the recognition accuracy of each model in the validation set is shown in Table 1. From Table 1, ResNet-18 obtains the highest recognition accuracy, whereas ProtoPNet obtains the lowest. It can be seen that the ProtoPNet model has a low recognition accuracy in SAR images, which is mainly because of the great difference between SAR images and optical images. In the BagNets, the recognition accuracy of BagNet-33 is higher than that of BagNet-17, which is mainly attributed to the larger receptive field of BagNet-33 than that of BagNet-17. Table 1. The accuracy of the models on the validation set.

Models Recognition Accuracy
ResNet-18 [37] 99.05% BagNet-17 [30] 94.15% BagNet-33 [30] 96.99% ProtoPNet [28] 78.34% SAR-BagNet 98.25% The recognition accuracy of the SAR-BagNet is higher than the other three and slightly lower than ResNet-18. For the ResNet-18 network, its receptive field is 432 × 432. It is generally believed that the larger the receptive field of the network is, the richer features can be extracted from the image, including not only local features but also global features. Such a large receptive field is conducive to the improvement of recognition accuracy, but it brings the problem of lack of interpretability. Due to the small receptive field of SAR-BagNet, the global features in the images cannot be extracted by the network, so the accuracy of the SAR-BagNet network is slightly lower than the ResNet-18 network, but it brings the advantage of good interpretability (see below). In some special application scenarios, some recognition accuracy can be sacrificed to obtain better interpretability.

Heatmap Comparison of Models
The heatmap can reflect the influence of each region in the SAR image on the model recognition result. Due to the existence of strong interference, we need a more accurate heatmap to explain the SAR image recognition. Our model is designed according to the characteristics of SAR images, which not only ensures high recognition accuracy but also generates a heatmap that can well reflect the influence of different regions of SAR images on recognition. To compare the interpretability of the models, we contrast the heatmaps generated by these models.
Because ResNet-18 has a global average pooling layer, the CAM method is used to generate the heatmap. In the BagNet model, we choose BagNet-33 with high recognition accuracy to obtain the heatmap. The heatmap for ProtoPNet is considered less convincing and reasonable in view of the low accuracy, 78.34%; thus, here only the heatmaps from ResNet-18, BagNet-33, and SAR-BagNet are shown.
In Figure 5, the red area represents a positive impact on the model's decision results, whereas the blue area represents a negative impact on the model's decision process. Darker areas indicate greater influence on the results. The positive and negative impact can be understood as follows: in order to distinguish a person's gender, certain characteristics such as hair length, clothing color, height, and facial features can be used as evidence. If a man has long hair, this feature has a negative impact on the results, and conversely, it has a positive impact on the results for a woman (generally, long hair is considered a female characteristic).
When the decision results of ResNet-18 are interpreted by the CAM method, the heatmap can only give a wide range of regions. In addition, because the receptive field of ResNet-18 covers the whole image, the heatmap cannot determine the regions in the image, resulting in weak interpretability. Compared with the heatmap generated by the CAM method, the heatmap generated by the BagNet-33 model can reflect which part of the picture has a greater impact on the results, but it cannot obtain a more accurate structure of the target. The heatmap generated by our model can not only accurately reflect the influence of each patch in the image on the decision result but also reflect the influence of the small structure in the target on the decision result to a certain extent. In the heatmaps of Figure 5a,f, the edge of the target is highlighted, indicating that the position of the edge of the target has a strong positive influence on the classification results.
Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 20 on recognition. To compare the interpretability of the models, we contrast the heatmaps generated by these models. Because ResNet-18 has a global average pooling layer, the CAM method is used to generate the heatmap. In the BagNet model, we choose BagNet-33 with high recognition accuracy to obtain the heatmap. The heatmap for ProtoPNet is considered less convincing and reasonable in view of the low accuracy, 78.34%; thus, here only the heatmaps from ResNet-18, BagNet-33, and SAR-BagNet are shown.
In Figure 5, the red area represents a positive impact on the model's decision results, whereas the blue area represents a negative impact on the model's decision process.
Darker areas indicate greater influence on the results. The positive and negative impact can be understood as follows: in order to distinguish a person's gender, certain characteristics such as hair length, clothing color, height, and facial features can be used as evidence. If a man has long hair, this feature has a negative impact on the results, and conversely, it has a positive impact on the results for a woman (generally, long hair is considered a female characteristic). When the decision results of ResNet-18 are interpreted by the CAM method, the heatmap can only give a wide range of regions. In addition, because the receptive field of ResNet-18 covers the whole image, the heatmap cannot determine the regions in the image, resulting in weak interpretability. Compared with the heatmap generated by the CAM method, the heatmap generated by the BagNet-33 model can reflect which part of the picture has a greater impact on the results, but it cannot obtain a more accurate structure of the target. The heatmap generated by our model can not only accurately reflect the influence of each patch in the image on the decision result but also reflect the influence of the small structure in the target on the decision result to a certain extent. In the heatmaps of Figure 5a,f, the edge of the target is highlighted, indicating that the position of the edge of the target has a strong positive influence on the classification results.
The heatmap generated by our model in class 3 is shown in Figure 5d; the red positions in the heatmap are not all target positions, and the red areas also appear in the background area of the SAR image. In this case, the model can also correctly classify SAR targets. This indicates that the SAR-BagNet's recognition of the third class of the SAR image depends largely on the background information and not just the target. It is obviously unreasonable to use the background information in the third class of the target SAR image The heatmap generated by our model in class 3 is shown in Figure 5d; the red positions in the heatmap are not all target positions, and the red areas also appear in the background area of the SAR image. In this case, the model can also correctly classify SAR targets. This indicates that the SAR-BagNet's recognition of the third class of the SAR image depends largely on the background information and not just the target. It is obviously unreasonable to use the background information in the third class of the target SAR image instead of the target information for classification. This phenomenon was also found in Reference [6]. The reference found that when the background information was blocked, the neural network could not recognize the target. Because the Self-Matching CAM method proposed in Reference [6] is not well interpretable, the author attributed this phenomenon to the network learning some information unrelated to the target, but this information exists in different categories of SAR images. Due to the poor interpretability of the method in Reference [6], the author does not explain what the information is. In a practical application, it is difficult to find these potential risks if an unexplained model is applied in the SAR image recognition field. This illustrates the importance of interpretability of the model in the field of SAR image recognition.

Recognition Process of SAR-BagNet
The process of training the SAR-BagNet is the same as that of ordinary convolutional networks, and we do not need to manually segment the image. The trained SAR-BagNet will learn the features of each class. When similar category features appear, SAR-BagNet will generate strong activation mapping on the heatmap of the corresponding class. Because each image input into SAR-BagNet is a patch on the complete SAR image, it is possible to discern which class of features the patch most closely resembles based on the strength of the activation mapping generated by this patch on the heatmap of each class. The activation mapped regions of the patch on the heatmap correspond to the regions of the patch on the SAR image, so the impact of each region in the image on the recognition result can be determined from the heatmap. The complete recognition process of SAR-BagNet is shown in Figure 6.
nomenon to the network learning some information unrelated to the target, but this information exists in different categories of SAR images. Due to the poor interpretability of the method in Reference [6], the author does not explain what the information is. In a practical application, it is difficult to find these potential risks if an unexplained model is applied in the SAR image recognition field. This illustrates the importance of interpretability of the model in the field of SAR image recognition.

Recognition Process of SAR-BagNet
The process of training the SAR-BagNet is the same as that of ordinary convolutional networks, and we do not need to manually segment the image. The trained SAR-BagNet will learn the features of each class. When similar category features appear, SAR-BagNet will generate strong activation mapping on the heatmap of the corresponding class. Because each image input into SAR-BagNet is a patch on the complete SAR image, it is possible to discern which class of features the patch most closely resembles based on the strength of the activation mapping generated by this patch on the heatmap of each class. The activation mapped regions of the patch on the heatmap correspond to the regions of the patch on the SAR image, so the impact of each region in the image on the recognition result can be determined from the heatmap. The complete recognition process of SAR-BagNet is shown in Figure 6.  In the process of image recognition, SAR-BagNet generates a heatmap for each class. The class activation of a patch of the input image on each class can be displayed on the heatmap, and the class activation of all patches on the heatmap constitutes the complete heatmap. The average value of the heatmap is equivalent to the matching degree of the input image and the corresponding category of the heatmap, and the images can be classified according to the average value of the heatmap.
After obtaining the heatmap, we can find the corresponding region in the original SAR image, so as to determine the contribution degree of each patch in the SAR image to the model decision. As shown in Figure 6b, SAR-BagNet controls the receptive field and global stride size so that each patch of the SAR image corresponds strictly to a certain value on the heatmap. Such correspondence ensures that the model has good interpretability.

Analysis of Salient Features
The principle of SAR-Bagnet is analogous to that of the BoF model. Just like the BoF model mentioned above, we want the model to cluster words with similar meanings (e.g., "excellent" and "outstanding"), i.e., the model has similar activation for words with similar meanings. For the SAR-BagNet model, we want the model to have similar activation for similar patches. We selected similar SAR images in the same classes for comparison, and the experimental results are shown in Figure 7. In Figure 7a, for similar SAR images of class 4, their corresponding heatmaps are also similar. From these heatmaps, we can see that in different SAR images, the target edge region has strong activation on the heatmap. In Figure 7b, there is an obvious line-like feature in the patch in the red box, and it may be caused by cavity scattering on the target which does not exist on targets of the other classes. In different SAR images, this feature generates strong activation on the heatmap. The experimental results show that SAR-BagNet learns some robust classification features during the training process, which are applied to the classification of SAR images by the model. heatmap, and the class activation of all patches on the heatmap constitutes the complete heatmap. The average value of the heatmap is equivalent to the matching degree of the input image and the corresponding category of the heatmap, and the images can be classified according to the average value of the heatmap.
After obtaining the heatmap, we can find the corresponding region in the original SAR image, so as to determine the contribution degree of each patch in the SAR image to the model decision. As shown in Figure 6b, SAR-BagNet controls the receptive field and global stride size so that each patch of the SAR image corresponds strictly to a certain value on the heatmap. Such correspondence ensures that the model has good interpretability.

Analysis of Salient Features
The principle of SAR-Bagnet is analogous to that of the BoF model. Just like the BoF model mentioned above, we want the model to cluster words with similar meanings (e.g., "excellent" and "outstanding"), i.e., the model has similar activation for words with similar meanings. For the SAR-BagNet model, we want the model to have similar activation for similar patches. We selected similar SAR images in the same classes for comparison, and the experimental results are shown in Figure 7. In Figure 7a, for similar SAR images of class 4, their corresponding heatmaps are also similar. From these heatmaps, we can see that in different SAR images, the target edge region has strong activation on the heatmap. In Figure 7b, there is an obvious line-like feature in the patch in the red box, and it may be caused by cavity scattering on the target which does not exist on targets of the other classes. In different SAR images, this feature generates strong activation on the heatmap. The experimental results show that SAR-BagNet learns some robust classification features during the training process, which are applied to the classification of SAR images by the model.  We also compared different classes of heatmaps corresponding to the same SAR image, as shown in Figure 8. In Figure 8, the same area in the SAR image can have a positive effect on the right class of the heatmap and a negative effect on the wrong class of the heatmap. For the images of class 5 and 6, it is very difficult for humans to find the category features that can correctly classify targets from SAR images. However, for the SAR-BagNet, it is easy to extract the classification features from the target, so as to carry out the correct classification. In recent years, learning imaging has been widely used in SAR imaging [38,39]. However, it is difficult to objectively evaluate the effects of learning imaging. The statistical evaluation indexes, such as image entropy and image contrast, used in natural image processing are not completely suitable for radar images and the indexes, such as mean square error, peak signal-to-noise ratio, and structural similarity, require known target reference images, which are difficult to apply to measured radar data. Most of the evaluation of the imaging effect is based on people's subjective feelings. From the point of view of recognition, this may result in the imaging effect and the final recognition effect being inconsistent. That is, the SAR image that people think is clear may contain no or very little category information for the recognition model. SAR-BagNet can be used to objectively evaluate whether SAR objects generated by learning imaging contain category information.
aging [38,39]. However, it is difficult to objectively evaluate the effects of learning imaging. The statistical evaluation indexes, such as image entropy and image contrast, used in natural image processing are not completely suitable for radar images and the indexes, such as mean square error, peak signal-to-noise ratio, and structural similarity, require known target reference images, which are difficult to apply to measured radar data. Most of the evaluation of the imaging effect is based on people's subjective feelings. From the point of view of recognition, this may result in the imaging effect and the final recognition effect being inconsistent. That is, the SAR image that people think is clear may contain no or very little category information for the recognition model. SAR-BagNet can be used to objectively evaluate whether SAR objects generated by learning imaging contain category information.

Misclassification Interpretation of Models
In this section, we will discuss the causes of classification errors in different categories of SAR-BagNet. The classification accuracy and confusion matrix of each class of SAR-BagNet are given in Table 2 and Figure 9.

Misclassification Interpretation of Models
In this section, we will discuss the causes of classification errors in different categories of SAR-BagNet. The classification accuracy and confusion matrix of each class of SAR-BagNet are given in Table 2 and Figure 9.   From Table 2, we can see that the recognition accuracy of SAR-BagNet in BMP2 and BTR70 is relatively low, which is 94.33% and 95.58%, respectively. It can be seen from the confusion matrix of SAR-BagNet that the misclassification of BMP2 is mostly concentrated in BTR70, whereas the misclassification of BTR70 is mostly concentrated in BMP2. In fact, the BMP2 is the infantry fighting vehicles and the BTR70 is the armored personnel carriers. The appearance of the two class of targets itself is very similar. Due to the strong noise of SAR images, the local scattering characteristics of targets are easily disturbed by noise, which leads to the error classification of SAR-BagNet when classifying according to patches on images. The BRDM2 class achieves the highest recognition accuracy; the main reason is that the BRDM2 class is very different from the other nine class targets in terms of appearance and background. The SAR-BagNet can easily extract the class information from the patch features of the BRDM2 class. For targets with small appearance differences, the recognition accuracy of the SAR-BagNet network decreases due to the lack of global features. For targets with large appearance differences, SAR-BagNet can achieve high recognition accuracy even though it only uses patch features.
We selected several incorrectly classified images and compared the heatmaps of SAR-BagNet on the true class and the predicted class. The comparison results and corresponding positions on the original image are shown in Figure 10. In Figure 10a, targets on SAR images have a stronger positive impact in the prediction class than in the true class. In Figure 10b,c, the target area in the SAR image exerts a negative impact on the heatmap of the true class and a positive impact on the prediction class. These reasons lead to the misclassification of the model. These heatmaps can reflect the influence of various parts of the target on the classification results of the model. For incorrect classifications, we can compare the heatmaps to locate the regions in the SAR images that make the network misclassify. Combined with the imaging mechanism of SAR images and the physical scattering characteristics of the target, we can explore the deeper causes of the errors and thus improve the SAR imaging algorithm and recognition model, which cannot be achieved by other uninterpreted models.
BTR70 is relatively low, which is 94.33% and 95.58%, respectively. It can be seen fro confusion matrix of SAR-BagNet that the misclassification of BMP2 is mostly concen in BTR70, whereas the misclassification of BTR70 is mostly concentrated in BMP2. I the BMP2 is the infantry fighting vehicles and the BTR70 is the armored personnel ca The appearance of the two class of targets itself is very similar. Due to the strong n SAR images, the local scattering characteristics of targets are easily disturbed by which leads to the error classification of SAR-BagNet when classifying accord patches on images. The BRDM2 class achieves the highest recognition accuracy; the reason is that the BRDM2 class is very different from the other nine class targets in of appearance and background. The SAR-BagNet can easily extract the class inform from the patch features of the BRDM2 class. For targets with small appearance differ the recognition accuracy of the SAR-BagNet network decreases due to the lack of features. For targets with large appearance differences, SAR-BagNet can achiev recognition accuracy even though it only uses patch features.
We selected several incorrectly classified images and compared the heatm SAR-BagNet on the true class and the predicted class. The comparison results and sponding positions on the original image are shown in Figure 10. In Figure 10a, targ SAR images have a stronger positive impact in the prediction class than in the true In Figure 10b,c, the target area in the SAR image exerts a negative impact on the he of the true class and a positive impact on the prediction class. These reasons lead misclassification of the model. These heatmaps can reflect the influence of various of the target on the classification results of the model. For incorrect classifications, w compare the heatmaps to locate the regions in the SAR images that make the ne misclassify. Combined with the imaging mechanism of SAR images and the physica tering characteristics of the target, we can explore the deeper causes of the errors an improve the SAR imaging algorithm and recognition model, which cannot be achiev other uninterpreted models.

Discussion
In this study, we verify the effectiveness of BagNet-based methods in SAR recognition and interpretation and propose SAR-BagNet according to the characte of SAR images. This architecture is not fixed but can change according to specific For example, for the ImageNet dataset, which has a large number of samples and c ries, we can choose ResNet-50 or ResNet-101 as the basic framework; we then explo interpretability and recognition accuracy of the model by the receptive field and step size and finally select the appropriate model. The approach we propose is not specific model but an architecture and network design idea.
Regarding the fact that most of the red areas in the heatmap of Figure 5d are lo in the background area, it is necessary to clarify here that this does not mean that BagNet does not learn the category information in class 3. As shown in Table 2, the nition accuracy of the target in class 3 is 99.66%, which indicates that most of the information in class 3 is located in the background area, i.e., the difference in the ground can be used to distinguish class 3 from other classes. Due to the non-interpr nature of traditional CNNs, it is unknown whether traditional CNNs make use of the class of background information for classification. Therefore, although high recog accuracy can be achieved by using traditional CNNs for SAR target recognition, the nition results of the network are highly risky because it is possible that the netw Figure 10. Heatmaps of true and predicted classes in SAR-BagNet for misclassification. (a) The true class is Class 0 (2S1) and the predict class is Class 9 (ZSU234). (b) The true class is Class 5 (BMP2) and the predict class is Class 6 (BTR70). (c) The true class is Class 6 (BTR70) and the predict class is Class 5 (BMP2).

Discussion
In this study, we verify the effectiveness of BagNet-based methods in SAR image recognition and interpretation and propose SAR-BagNet according to the characteristics of SAR images. This architecture is not fixed but can change according to specific tasks. For example, for the ImageNet dataset, which has a large number of samples and categories, we can choose ResNet-50 or ResNet-101 as the basic framework; we then explore the interpretability and recognition accuracy of the model by the receptive field and global step size and finally select the appropriate model. The approach we propose is not just a specific model but an architecture and network design idea.
Regarding the fact that most of the red areas in the heatmap of Figure 5d are located in the background area, it is necessary to clarify here that this does not mean that SAR-BagNet does not learn the category information in class 3. As shown in Table 2, the recognition accuracy of the target in class 3 is 99.66%, which indicates that most of the class information in class 3 is located in the background area, i.e., the difference in the background can be used to distinguish class 3 from other classes. Due to the non-interpretable nature of traditional CNNs, it is unknown whether traditional CNNs make use of the third class of background information for classification. Therefore, although high recognition accuracy can be achieved by using traditional CNNs for SAR target recognition, the recognition results of the network are highly risky because it is possible that the network is using information that is not related to the target to make the judgment.
Interpretability is an important characteristic and research topic of the next generation artificial intelligence system. A model with strong interpretability enables users to better understand the decision-making process of the machine, so as to determine the confidence of the corresponding results and increase people's trust in the system. The SAR-BagNet model architecture proposed in this work can visualize the process of SAR image recognition using a model and reduce the risk of recognition. It has strong practical significance for some fields with high reliability requirements, such as military and disaster detection fields. In addition, SAR-BagNet can show the causes of model discrimination errors, and this has certain application prospects for improving and objectively evaluating SAR imaging algorithms.

Conclusions
A SAR-BagNet model that can provide a novel and accurate explanation for SAR image interpretation is proposed in this work. SAR-BagNet was originally inspired by the BagNet model, but compared to the BagNet model, SAR-BagNet can generate clearer heatmaps and higher recognition accuracy. Therefore, SAR-BagNet is particularly suitable for SAR images whose resolution is low and whose texture feature is not as vivid as optical images. In addition, as the heatmap generated by the SAR-BagNet model determines the classification results, the interpretation method adopted by SAR-BagNet is the ante-hoc interpretation method. The ante-hoc interpretation method is directly faithful to the decision-making process and is more credible and reasonable than the post-hoc interpretation methods. In comparison to other interpretable models, the proposed model can precisely display the influence of each region of the SAR image on classification results rather than a rough coverage. This model will help to increase the reliability of SAR image classification results. In the following work, we will combine the heatmap generated by the SAR-BagNet with the imaging mechanism and physical characteristics of SAR images, so as to explore the deeper recognition features of SAR images.