SAR ATR of Ground Vehicles Based on ESENet

: In recent studies, synthetic aperture radar (SAR) automatic target recognition (ATR) algorithms that are based on the convolutional neural network (CNN) have achieved high recognition rates in the moving and stationary target acquisition and recognition (MSTAR) dataset. However, in a SAR ATR task, the feature maps with little information automatically learned by CNN will disturb the classiﬁer. We design a new enhanced squeeze and excitation (enhanced-SE) module to solve this problem, and then propose a new SAR ATR network, i.e., the enhanced squeeze and excitation network (ESENet). When compared to the available CNN structures that are designed for SAR ATR, the ESENet can extract more e ﬀ ective features from SAR images and obtain better generalization performance. In the MSTAR dataset containing pure targets, the proposed method achieves a recognition rate of 97.32% and it exceeds the available CNN-based SAR ATR algorithms. Additionally, it has shown robustness to large depression angle variation, conﬁguration variants, and version variants.


Introduction
Synthetic aperture radar (SAR) has played a significant role in surveillance and battlefield reconnaissance, thanks to its all-day, all-weather, and high resolution capability.In recent years, SAR automatic target recognition (ATR) of ground military vehicles has received intensive attention in the radar ATR community.However, SAR images usually have low resolution and they only contain the amplitude information of scattering centers.Thus, it is challenging to identify the targets in SAR images.
The MIT Lincoln Laboratory proposed the standard SAR ATR architecture, which consists of three stages: detection, discrimination, and classification [1].In the detection stage, simple decision rules are used to find the bright pixels in SAR images and indicate the presence of targets.The output of this stage might include not only targets of interests, but also clutters, because the decision stage is far from perfect.On the following discrimination stage, a discriminator is designed to solve a two-class (target and clutter) classification problem and the probability of false alarm can be significantly reduced [2].On the final classification stage, a classifier is designed to categorize each output image of the discrimination stage as a specific target type.
On the classification stage, there are three mainstream methods: template matching methods, model-based methods, and machine learning methods.For the template matching methods [3,4], the template database is generated from training samples according to some matching rules and the best match is then found by comparing each test sample to the template database.The common matching rules are the minimum mean square error, the minimum Euclidean distance, and the maximum correlation coefficient, etc.In these template matching methods, the initial SAR images or sub-images cut from initial SAR images are served as templates.However, the SAR images are sensitive to azimuth angle, depression angle, and target structure.When there is large difference between the training and test samples, the recognition performance will severely decrease.Additionally, such methods suffer from severe overfitting [5].Model-based methods were proposed to solve the above problem [6,7].In the model-based methods, SAR images are predicted by computer-aided design model and the modeling procedure is usually complicated.
SAR ATR algorithms that are based on machine learning methods can be further divided into two types, i.e., feature-based methods and deep learning methods.Feature-based methods [8,9] require features to be manually extracted from SAR images, while deep learning methods automatically extract features from SAR images.Thus, deep learning methods avoid the designing of feature extractors.As a typical deep learning structure, convolutional neural network (CNN) has been successfully applied in various fields, e.g., SAR image classification [10] and satellite image classification [11].Particularly, CNN-based methods outperform others in SAR ATR tasks due to its unique characteristics that are suitable for two-dimensional image classification [12].
The MSTAR dataset serves as a benchmark for SAR ATR algorithms evaluation and comparison [13].However, there is high-correlation between the target type and clutter in the MSTAR dataset, i.e., the SAR images of a specific target type may correspond to the same background clutter.It was demonstrated that, even if the target and shadow regions are removed, a traditional classifier still achieves high recognition accuracy (above 99%) for the remaining clutters [14].It may be impossible that the target location may change in real world situations, and various background clutters instead of a fixed type should accompany the corresponding SAR image.Therefore, we exclude such correlation by target region segmentation [15] and generate the MSTAR pure target dataset for fair comparison and an evaluation of SAR ATR algorithms.
The key factors in improving the recognition performance of SAR ATR algorithms that are based on CNN include: (i) SAR image preprocessing to extract features more effectively and easily; and, (ii) designing effective network structures that make full use of the extracted features from SAR images.
Ding et al. [16] augmented the training set by image rotation and shifting to alleviate over-fitting for SAR image preprocessing.Chen et al. generated the augmented training set by cropping the initial 128 × 128 MSTAR images to 88 × 88 patches randomly [12].Wagner enlarged the training set by directly adding distorted SAR images to improve the robustness [17].Lin et al. cropped the initial MSTAR images to 68 × 68 patches in order to reduce the computation burden of CNN [18] and Shang et al. cropped the initial MSTAR images to 70 × 70 patches [19].Wang et al. used a despeckling subnetwork to suppress speckle noise before inputting SAR images into a classification network [20].
For the designing of CNN structure for SAR ATR, a traditional CNN structure that consists of convolutional layers, pooling layers and softmax classifier was proposed [16,[21][22][23].Later, Chen et al. designed A-convent, where the number of unknown parameters is greatly reduced by removing the fully-connected layer [12].Wagner replaced the softmax classifier in the traditional CNN structure by a SVM classifier and achieved high recognition accuracy [17,24].Lin et al. proposed CHU-Nets, where a convolutional highway unit is inserted into the traditional CNN structure and the classification performance is improved in a limited-labeled training dataset [18].Shang et al. added an information recorder to CNN to remember and store the spatial features of the samples, and then used spatial similarity information of the recorded features to predict the unknown sample labels [19].Kechagias-Stamatis et al. fused a convolutional neural network module with a sparse coding module under a decision level scheme, which can adaptively alter the fusion weights that are based on the SAR images [25].Pei et al. proposed a multiple-view DCNN (m-VDCNN) to extract the features from target images with different azimuth angles [26].
Generally, CNN is a data-driven model and each pixel of the training and test samples directly participates in feature extraction.The correlation between the clutter in the training and test sets cannot be ignored, since input SAR images consist of both target region and clutter region.Additionally, for the available SAR ATR algorithms that are based on CNN, the softmax classifier directly applies the features that were extracted by convolutional layers.However, CNN may automatically learn the useless feature maps, which prevent the classifier from effectively utilizing significant features [27,28].Therefore, the available SAR ATR algorithms that are based on CNN ignore the negative effects of the feature maps with little information, and the recognition performance may degrade.
We propose a novel SAR ATR algorithm based on CNN to tackle the above-mentioned problems.The main contributions includes: (i) an enhanced Squeeze and Excitation (SE) module is proposed to suppress feature maps with little information in CNN by allocating different weights to feature maps according to the amount of information they contain; and, (ii) a modified CNN structure, i.e., the Enhanced Squeeze and Excitation Net (ESENet) incorporating the enhanced-SE module is proposed.The experimental results on the MSTAR dataset without clutter have shown that the proposed network outperforms the available CNN structures designed for SAR ATR.
The remainder of this paper is organized, as follows.Section 2 introduces the Squeeze and Excitation module.Section 3 introduces a novel SAR ATR method based on the ESENet, and discusses the mechanism of the enhanced-SE module, together with the structure of the ESENet in detail.Section 4 presents the experimental results to validate the effectiveness of the proposed network, and Section 5 concludes the paper.

Squeeze and Excitation Module
A typical CNN structure consists of a feature extractor and a classifier.The feature extractor is a multilayer structure that is formed by stacking convolutional layers and pooling layers.The feature maps of different hierarchies are extracted layer by layer, and then feature maps of the last layer are applied by the classifier for target recognition.In a typical feature extractor, the feature maps in the same layer are regarded as having the same importance to the next layer.However, such an assumption is usually violated in practice [29].Figure 1 shows 16 feature maps extracted by the first convolutional layer for a typical CNN structure applied in a SAR ATR experiment.It is observed that some of the feature maps, e.g., the second feature map in the first row, only have several bright pixels, and contain less target structural information than others.
directly applies the features that were extracted by convolutional layers.However, CNN may automatically learn the useless feature maps, which prevent the classifier from effectively utilizing significant features [27,28].Therefore, the available SAR ATR algorithms that are based on CNN ignore the negative effects of the feature maps with little information, and the recognition performance may degrade.
We propose a novel SAR ATR algorithm based on CNN to tackle the above-mentioned problems.The main contributions includes: (i) an enhanced Squeeze and Excitation (SE) module is proposed to suppress feature maps with little information in CNN by allocating different weights to feature maps according to the amount of information they contain; and, (ii) a modified CNN structure, i.e., the Enhanced Squeeze and Excitation Net (ESENet) incorporating the enhanced-SE module is proposed.The experimental results on the MSTAR dataset without clutter have shown that the proposed network outperforms the available CNN structures designed for SAR ATR.
The remainder of this paper is organized, as follows.Section 2 introduces the Squeeze and Excitation module.Section 3 introduces a novel SAR ATR method based on the ESENet, and discusses the mechanism of the enhanced-SE module, together with the structure of the ESENet in detail.Section 4 presents the experimental results to validate the effectiveness of the proposed network, and Section 5 concludes the paper.

Squeeze and Excitation Module
A typical CNN structure consists of a feature extractor and a classifier.The feature extractor is a multilayer structure that is formed by stacking convolutional layers and pooling layers.The feature maps of different hierarchies are extracted layer by layer, and then feature maps of the last layer are applied by the classifier for target recognition.In a typical feature extractor, the feature maps in the same layer are regarded as having the same importance to the next layer.However, such an assumption is usually violated in practice [29].Figure 1 shows 16 feature maps extracted by the first convolutional layer for a typical CNN structure applied in a SAR ATR experiment.It is observed that some of the feature maps, e.g., the second feature map in the first row, only have several bright pixels, and contain less target structural information than others.
In a typical CNN structure, all of the feature maps with different importance in the same layer equally pass through the network.Thus, they make equal contributions to recognition and such an equal mechanism disturbs the utilization of important feature maps that contain more information.We could apply the SE module, which allocates different weights to different feature maps in the same layer, to enhance significant feature maps and suppress others with less information [29].In a typical CNN structure, all of the feature maps with different importance in the same layer equally pass through the network.Thus, they make equal contributions to recognition and such an equal mechanism disturbs the utilization of important feature maps that contain more information.We could apply the SE module, which allocates different weights to different feature maps in the same layer, to enhance significant feature maps and suppress others with less information [29].
Figure 2 illustrates the structure of a SE module.For an arbitrary input feature map tensor U:U ∈ R W×H×C , where W × H represents the size of the input feature map and C represents the number of input feature maps, the SE module transforms U into a new feature map tensor X, where X shares the same size with U, i.e., X ∈ R W×H×C .r is a fixed hyperparameter in a SE module.

Output Feature Maps
ReLU Figure 2 illustrates the structure of a SE module.For an arbitrary input feature map tensor U: , where W × H represents the size of the input feature map and C represents the number of input feature maps, the SE module transforms U into a new feature map tensor X, where X shares the same size with U, i.e., . r is a fixed hyperparameter in a SE module.The computation of a SE module includes two steps, i.e., the squeeze operation sq F and the excitation operation ex F .The squeeze operation obtains the global information of each feature map, while the excitation operation automatically learns the weight of each feature map.A simple implementation of the squeeze operation is global average pooling.For the feature map tensor , such a squeeze operation outputs a description tensor C z R ∈ , where the cth element of z is denoted by: where c u represents the cth feature map of U. The excitation operation is denoted by the following nonlinear function: where δ is the rectified linear unit (ReLU) function, σ is the sigmoid activation function, , r is a fixed hyperparameter, and s is the automatically-learned weight vector, which represents the importance of feature maps.It can be seen from Equations ( 1) and ( 2) that the combination of the squeeze operation and the excitation operation learns the importance of each feature map independently from the network.Finally, the cth feature map that is produced by the SE module is denoted by: where c s represents the weight of c u and ( , ) F u s represents the product of them.
As discussed above, the SE module computes and allocates weights to the corresponding feature maps.The feature maps with little information will be suppressed after being multiplied by the weights that are much less than 1, while the others will remain almost unchanged after being multiplied by the weights near 1.The computation of a SE module includes two steps, i.e., the squeeze operation F sq and the excitation operation F ex .The squeeze operation obtains the global information of each feature map, while the excitation operation automatically learns the weight of each feature map.A simple implementation of the squeeze operation is global average pooling.For the feature map tensor U ∈ R W×H×C , such a squeeze operation outputs a description tensor z ∈ R C , where the cth element of z is denoted by: where u c represents the cth feature map of U. The excitation operation is denoted by the following nonlinear function: where δ is the rectified linear unit (ReLU) function, σ is the sigmoid activation function, r , r is a fixed hyperparameter, and s is the automatically-learned weight vector, which represents the importance of feature maps.It can be seen from Equations ( 1) and ( 2) that the combination of the squeeze operation and the excitation operation learns the importance of each feature map independently from the network.Finally, the cth feature map that is produced by the SE module is denoted by: where s c represents the weight of u c and F scale (u c , s c ) represents the product of them.
As discussed above, the SE module computes and allocates weights to the corresponding feature maps.The feature maps with little information will be suppressed after being multiplied by the weights that are much less than 1, while the others will remain almost unchanged after being multiplied by the weights near 1.

SAR ATR Based on ESENet
In this section, we will propose the Enhanced-SE module according to the characteristics of the SAR data, and then design a new CNN structure for SAR ATR, namely the ESENet.
Figure 3 shows main steps of the training and test stages to give a brief view of the proposed method.Firstly, image segmentation is utilized to remove the background clutter [15,30] In what follows, we will explain the mechanisms of the ESENet in detail.

SAR ATR Based on ESENet
In this section, we will propose the Enhanced-SE module according to the characteristics of the SAR data, and then design a new CNN structure for SAR ATR, namely the ESENet.
Figure 3 shows main steps of the training and test stages to give a brief view of the proposed method.Firstly, image segmentation is utilized to remove the background clutter [15,30].In what follows, we will explain the mechanisms of the ESENet in detail.

Overall Structure of the ESENet
In this part, we will discuss the characteristics and general layout of the proposed ESENet.As shown in Figure 4, the ESENet consists of four convolutional layers, three max pooling layers, a fully-connected layer, a SE-module, an enhanced-SE module, and a LM-softmax classifier [31].There are 16 5 × 5 convolutional kernels in the first convolutional layer, 32 3 × 3 convolutional kernels in the second convolutional layer, 64 4 × 4 convolutional kernels in the third convolutional layer, and 64 5 × 5 convolutional kernels in the last convolutional layer.Batch normalization [32] is used in the first convolutional layer to accelerate the convergence.A max pooling layer with pooling size 2 × 2 and stride size 2 is added after the first convolutional layer, the SE module, and the enhanced-SE module, respectively.The SE module is inserted in the middle of the network to preliminarily enhance the important feature maps.An enhanced-SE module is inserted before the last pooling layer to further suppress higher-level feature maps with little information.Subsequently, dropout is added to the third convolutional layer and the last convolutional layer.The fully-connected layer has 10 nodes.Finally, we apply the LM-softmax classifier for classification.Below, we will introduce the key components of the proposed network in detail.

Overall Structure of the ESENet
In this part, we will discuss the characteristics and general layout of the proposed ESENet.As shown in Figure 4, the ESENet consists of four convolutional layers, three max pooling layers, a fully-connected layer, a SE-module, an enhanced-SE module, and a LM-softmax classifier [31].There are 16 5 × 5 convolutional kernels in the first convolutional layer, 32 3 × 3 convolutional kernels in the second convolutional layer, 64 4 × 4 convolutional kernels in the third convolutional layer, and 64 5 × 5 convolutional kernels in the last convolutional layer.Batch normalization [32] is used in the first convolutional layer to accelerate the convergence.A max pooling layer with pooling size 2 × 2 and stride size 2 is added after the first convolutional layer, the SE module, and the enhanced-SE module, respectively.The SE module is inserted in the middle of the network to preliminarily enhance the important feature maps.An enhanced-SE module is inserted before the last pooling layer to further suppress higher-level feature maps with little information.Subsequently, dropout is added to the third convolutional layer and the last convolutional layer.The fully-connected layer has 10 nodes.Finally, we apply the LM-softmax classifier for classification.Below, we will introduce the key components of the proposed network in detail.

SAR ATR Based on ESENet
In this section, we will propose the Enhanced-SE module according to the characteristics of the SAR data, and then design a new CNN structure for SAR ATR, namely the ESENet.
Figure 3 shows main steps of the training and test stages to give a brief view of the proposed method.Firstly, image segmentation is utilized to remove the background clutter [15,30]

Overall Structure of the ESENet
In this part, we will discuss the characteristics and general layout of the proposed ESENet.As shown in Figure 4, the ESENet consists of four convolutional layers, three max pooling layers, a fully-connected layer, a SE-module, an enhanced-SE module, and a LM-softmax classifier [31].There are 16 5 × 5 convolutional kernels in the first convolutional layer, 32 3 × 3 convolutional kernels in the second convolutional layer, 64 4 × 4 convolutional kernels in the third convolutional layer, and 64 5 × 5 convolutional kernels in the last convolutional layer.Batch normalization [32] is used in the first convolutional layer to accelerate the convergence.A max pooling layer with pooling size 2 × 2 and stride size 2 is added after the first convolutional layer, the SE module, and the enhanced-SE module, respectively.The SE module is inserted in the middle of the network to preliminarily enhance the important feature maps.An enhanced-SE module is inserted before the last pooling layer to further suppress higher-level feature maps with little information.Subsequently, dropout is added to the third convolutional layer and the last convolutional layer.The fully-connected layer has 10 nodes.Finally, we apply the LM-softmax classifier for classification.Below, we will introduce the key components of the proposed network in detail.

Enhanced Squeeze and Excitation Module
We discovered that, if the original SE module is inserted directly into a CNN designed for SAR ATR, most of the weights output by the sigmoid function become 1 (or almost 1), thus the feature maps remain almost unchanged after being multiplied by the corresponding weights.Accordingly, the original SE module cannot effectively suppress the feature maps with little information.
To solve this problem, a modified SE module is proposed, i.e., the enhanced-SE module.Firstly, although global average pooling could compute global information of the current feature map, its accurately apperceiving ability is limited.Thus, we design a new layer with learnable parameters to apperceive global information regarding the current feature map, which is realized by replacing the global average pooling layer by a convolutional layer whose kernel size is the same as the size of the current feature map.Additionally, the first fully-connected layer is deleted, thus the apperceived global information directly joins the computation of the final output weights.
The sigmoid function is utilized to avoid numerical explosion by transforming all the learned weights to (0,1) in the original SE module, which is defined by reference [33], Although the sigmoid function is monotonically increasing, all of the large weights are transformed to almost 1 (e.g., the weight 2.5 becomes 0.9241 after the sigmoid transformation).Such transformation is helpless for the network in distinguishing the importance of different feature maps.To solve the above problem, we design a new function, i.e., the enhanced-sigmoid function, where a is the shift parameter, b is the scale parameter, q is the power parameter, and s(x) is the original sigmoid function.If a = 0, b = 1, q = 1, then p(x) is the same as s(x).For a = 0, b = 1, q = 2, the comparison between the sigmoid function and Figure 5 shows the enhanced-sigmoid function.If the input value falls in (−5,5), then the output of the enhanced-sigmoid function is smaller than the output of the sigmoid function (e.g., the weight 2.5 becomes 0.8540 after the enhanced-sigmoid transformation, which is obviously smaller than 0.9241).

Enhanced Squeeze and Excitation module
We discovered that, if the original SE module is inserted directly into a CNN designed for SAR ATR, most of the weights output by the sigmoid function become 1 (or almost 1), thus the feature maps remain almost unchanged after being multiplied by the corresponding weights.Accordingly, the original SE module cannot effectively suppress the feature maps with little information.
To solve this problem, a modified SE module is proposed, i.e., the enhanced-SE module.Firstly, although global average pooling could compute global information of the current feature map, its accurately apperceiving ability is limited.Thus, we design a new layer with learnable parameters to apperceive global information regarding the current feature map, which is realized by replacing the global average pooling layer by a convolutional layer whose kernel size is the same as the size of the current feature map.Additionally, the first fully-connected layer is deleted, thus the apperceived global information directly joins the computation of the final output weights.
The sigmoid function is utilized to avoid numerical explosion by transforming all the learned weights to (0,1) in the original SE module, which is defined by reference [33], Although the sigmoid function is monotonically increasing, all of the large weights are transformed to almost 1 (e.g., the weight 2.5 becomes 0.9241 after the sigmoid transformation).Such transformation is helpless for the network in distinguishing the importance of different feature maps.To solve the above problem, we design a new function, i.e., the enhanced-sigmoid function, where a is the shift parameter, b is the scale parameter, q is the power parameter, and ( ) s x is the original sigmoid function.If a = 0, b = 1, q = 1, then ( ) p x is the same as ( ) s x .For a = 0, b = 1, q = 2, the comparison between the sigmoid function and Figure 5 shows the enhanced-sigmoid function.If the input value falls in (−5,5), then the output of the enhanced-sigmoid function is smaller than the output of the sigmoid function (e.g., the weight 2.5 becomes 0.8540 after the enhanced-sigmoid transformation, which is obviously smaller than 0.9241).Figure 6 shows the structure of the enhanced-SE module with the above modification.Figure 7 shows an illustrative comparison between the feature maps output by the SE module and the enhanced-SE module in a SAR ATR task.Obviously, many feature maps become blank in Figure 7b, indicating that the enhanced-SE module suppresses feature maps with little information more effectively than the original SE module.Figure 6 shows the structure of the enhanced-SE module with the above modification.Figure 7 shows an illustrative comparison between the feature maps output by the SE module and the enhanced-SE module in a SAR ATR task.Obviously, many feature maps become blank in Figure 7b, indicating that the enhanced-SE module suppresses feature maps with little information more effectively than the original SE module.

Output Feature Maps
ReLU

Other Components in the ESENet
The convolutional layer and pooling layer are the basic components in a typical CNN structure [34].The convolutional layer often acts as a feature extractor, which convolutes the input with a convolutional kernel to generate the new feature map.The pooling layer is a subsampling layer that reduces the number of trainable parameters of the network.By subsampling, the structural feature of the current layer is maintained and the impact of the deformed training samples on feature extraction is reduced.
Neural networks are essentially utilized to fit the data distribution.If the training and test sets have different distributions, the convergence speed will decrease and the generalization performance will degrade.To tackle this problem, batch normalization is added behind the first convolutional layer of the ESENet to accelerate network training and improve the generalization performance.
Dropout is a common regularization method that is utilized in deep neural networks [35].This technique randomly samples the weights from the current layer with probability p and prune them out, similar to the ensemble of sub-networks.Usually, it is adopted in the layer with a large number of parameters to alleviate overfitting.In the proposed ESENet, the fully-connected layer has a small

Other Components in the ESENet
The convolutional layer and pooling layer are the basic components in a typical CNN structure [34].The convolutional layer often acts as a feature extractor, which convolutes the input with a convolutional kernel to generate the new feature map.The pooling layer is a subsampling layer that reduces the number of trainable parameters of the network.By subsampling, the structural feature of the current layer is maintained and the impact of the deformed training samples on feature extraction is reduced.
Neural networks are essentially utilized to fit the data distribution.If the training and test sets have different distributions, the convergence speed will decrease and the generalization performance will degrade.To tackle this problem, batch normalization is added behind the first convolutional layer of the ESENet to accelerate network training and improve the generalization performance.
Dropout is a common regularization method that is utilized in deep neural networks [35].This technique randomly samples the weights from the current layer with probability p and prune them out, similar to the ensemble of sub-networks.Usually, it is adopted in the layer with a large number of parameters to alleviate overfitting.In the proposed ESENet, the fully-connected layer has a small number of parameters, while the third convolutional layer and the forth convolutional layer contain most of the trainable weights.Thus, we apply dropout in the two layers with p = 0.5 and p = 0.25, respectively.
Additionally, we replace the common softmax classifier by the LM-softmax classifier, which could improve the classification performance by adjusting the decision boundary of features that were extracted by CNN.

Parameter Settings and Training Method
We apply the gradient decent technique with weight decay and momentum in the training process [36], which is defined by: where ∆θ i+1 is the variation of θ in the (i + 1)th iteration, ε is the learning rate, α is the momentum coefficient, β is the weight decay coefficient, and ∂L ∂θ is the derivative of loss function L with respect to θ.In this paper, the base learning rate is set to 0.02, α is set to 0.9, and β is set to 0.004, respectively.Subsequently, we adopt a multi-step iteration strategy, which updates the learning rate to be ε = ε/10 if the iteration number reaches 1000, 2000, and 4000, etc.Additionally, we adopt a common training method that subtracts the mean of training samples from both the training and test samples to accelerate the convergence of CNN.In the enhanced-SE module, a is set to 0, b is set to 1, and q is set to 2. In the SE module, r is set to 16.

Dataset Description
The training and test datasets are generated from the MSTAR dataset that was provided by DARPA/AFRL [13].The dataset was collected by Sandia National Laboratory SAR sensor platform in 1995 and 1996 using an X-band SAR sensor.It provides a nominal spatial resolution of 0.3 × 0.3 m in both range and azimuth, and the image size is 128 × 128.The publicly released datasets include ten categories of ground military vehicles, i.e., armored personnel carrier: BMP-2, BRDM-2, BTR-60, and BTR-70; tank: T62, T72; rocket launcher: 2S1; air defense unit: ZSU-234; truck: ZIL-131; and, bulldozer: D7.
The MSTAR dataset consists of two sub-datasets for the sake of performance evaluation in various scenarios, i.e., the standard operating conditions (SOC) dataset and the extended operating conditions (EOC) dataset.The SOC dataset consists of ten target categories at 17 • and 15 • depression angles, respectively, as shown in Table 1.As a matter of routine [12,21], images at 17 • depression angle serve as training samples and images at the 15 • depression angle serve as test samples.
The EOC dataset includes EOC1 and EOC2, i.e., large depression variation dataset and variants dataset.There are four target categories in EOC1, including 2S1, BRDM-2, T-72, and ZSU-234.Images at 17 • depression angle serve as training samples and images at 30 • depression angle serve as test samples, as shown in Table 2.There are two target categories in EOC2, i.e., configuration variants and version variants.For the configuration variants, the training samples include BMP2, BRDM-2, BTR-70, and T-72, and the test samples only include variants of T72.For version variants, the training samples include BMP-2, BRDM-2, BTR-70, and T-72, and the test samples include variants of T72 and BMP-2.Detailed information is listed in Tables 3 and 4, respectively.

Network Structures for Comparison
Traditional CNN and A-convnet structures are designed according to the size of the input image by referring to the structures given in references [12,21] for the convenience of comparison.Subsequently, structures yielding the highest classification accuracy are selected as optimal ones, as shown in Figure 8a,b, respectively.Data augmentation methods, such as translation and rotation, are not applied in this paper.

Effect of Clutter and Data Generation
Reference [14] shows that, although the target region has been removed from original MSTAR images, the nearest neighbor classifier still achieves high classification accuracy, proving that the clutter in the training and test images of the MSTAR dataset is highly correlated.Reference [15] also proves that background clutter in the MSTAR dataset will disturb the recognition results of CNN.The target region is segmented out from the original SAR images according to references [15] and [30] to mitigate the impact of background clutter on network training and testing, as shown by Figure 9.The original 128 × 128 image is cropped to 60 × 60 to reduce the computational cost, because the target only occupies a small region at the center of the original image.By this means, the pure target dataset utilized in this paper is generated.

Results of SOC
Table 5 shows the recognition results of ESENet and other CNN structures for comparison under SOC.We replace the enhanced-SE module in the ESENet with an original SE module and obtain the SENet for comparison in Table 5 to validate the effectiveness of the proposed enhanced-SE module.Table 6 provides the confusion matrix of ESENet.

Effect of Clutter and Data Generation
Reference [14] shows that, although the target region has been removed from original MSTAR images, the nearest neighbor classifier still achieves high classification accuracy, proving that the clutter in the training and test images of the MSTAR dataset is highly correlated.Reference [15] also proves that background clutter in the MSTAR dataset will disturb the recognition results of CNN.The target region is segmented out from the original SAR images according to references [15,30] to mitigate the impact of background clutter on network training and testing, as shown by Figure 9.The original 128 × 128 image is cropped to 60 × 60 to reduce the computational cost, because the target only occupies a small region at the center of the original image.By this means, the pure target dataset utilized in this paper is generated.

Effect of Clutter and Data Generation
Reference [14] shows that, although the target region has been removed from original MSTAR images, the nearest neighbor classifier still achieves high classification accuracy, proving that the clutter in the training and test images of the MSTAR dataset is highly correlated.Reference [15] also proves that background clutter in the MSTAR dataset will disturb the recognition results of CNN.The target region is segmented out from the original SAR images according to references [15] and [30] to mitigate the impact of background clutter on network training and testing, as shown by Figure 9.The original 128 × 128 image is cropped to 60 × 60 to reduce the computational cost, because the target only occupies a small region at the center of the original image.By this means, the pure target dataset utilized in this paper is generated.

Results of SOC
Table 5 shows the recognition results of ESENet and other CNN structures for comparison under SOC.We replace the enhanced-SE module in the ESENet with an original SE module and obtain the SENet for comparison in Table 5 to validate the effectiveness of the proposed enhanced-SE module.Table 6 provides the confusion matrix of ESENet.

Results of SOC
Table 5 shows the recognition results of ESENet and other CNN structures for comparison under SOC.We replace the enhanced-SE module in the ESENet with an original SE module and obtain the SENet for comparison in Table 5 to validate the effectiveness of the proposed enhanced-SE module.Table 6 provides the confusion matrix of ESENet.
As shown in Table 5, the recognition accuracy for traditional CNN, A-convnet, SENet, and ESENet under SOC is 94.79%, 95.04%, 96.63% and 97.32%, respectively.Although the background clutter has been removed from SAR images, the ESENet still obtains good recognition performance, as the recognition rate of all types of targets exceeds 90%.Table 5 shows that SENet outperforms the traditional CNN structures for SAR ATR by inserting the SE module into a common CNN structure.Moreover, comparisons between the SENet and ESENet show that the enhanced-SE module outperforms the SE module in facilitating the feature extraction of CNN in a SAR ATR task.For a typical test sample, the feature maps of ESENet before and after transformation by the SE and the enhanced-SE modules are shown in Figure 10.It is observed that the feature maps of the second convolutional layer that pass through the SE module almost unchanged.However, in the third convolutional layer, the feature maps with little information are effectively suppressed when they pass through the enhanced-SE module.

Results of EOC1
Subsequently, the EOC1 dataset is utilized to evaluate the performance of ESENet under large depression angle variation.As shown in Table 7, the recognition accuracy of traditional CNN, the A-convnet, the SENet, and the ESENet is 88.44%, 89.05%, 90.27%, and 93.40% respectively, which shows that the ESENet outperforms the others under EOC1.However, the accuracy of the EOC1 experiment is lower than that of the SOC experiment.As expected, the large difference between the training and the test samples decreases the recognition accuracy, because the SAR image is sensitive to the variation of viewing angles.

Results of EOC1
Subsequently, the EOC1 dataset is utilized to evaluate the performance of ESENet under large depression angle variation.As shown in Table 7, the recognition accuracy of traditional CNN, the A-convnet, the SENet, and the ESENet is 88.44%, 89.05%, 90.27%, and 93.40% respectively, which shows that the ESENet outperforms the others under EOC1.However, the accuracy of the EOC1 experiment is lower than that of the SOC experiment.As expected, the large difference between the training and the test samples decreases the recognition accuracy, because the SAR image is sensitive to the variation of viewing angles.Table 8 shows the confusion matrix of the ESENet under EOC1.It is observed that the recognition accuracy of T72 rapidly decreases, which might be caused by the similarity between T72 and ZSU-234 for large depression angle variation.As shown in Figure 12, the SAR image of T72 at 30 • depression angle exhibits a configuration similarity to ZSU-234 at 17 • depression angle.Table 8 shows the confusion matrix of the ESENet under EOC1.It is observed that the recognition accuracy of T72 rapidly decreases, which might be caused by the similarity between T72 and ZSU-234 for large depression angle variation.As shown in Figure 12, the SAR image of T72 at 30° depression angle exhibits a configuration similarity to ZSU-234 at 17° depression angle.Variants recognition plays a significant role in SAR ATR.We test the network's ability to distinguish objects with similar appearance in the experiments under EOC2.For the configuration variants dataset that is introduced in Table 3, Table 9 shows the recognition accuracy of the above-mentioned four networks and Table 10 shows the confusion matrix of the ESENet.Obviously, the ESENet outperforms the others and the recognition accuracy is increased by 3% as compared with traditional CNN.For the version variants dataset that is introduced in Table 4, Table 11 lists the recognition accuracy of the four networks, and Table 12 lists the confusion matrix of the ESENet.It is observed that the ESENet has the best recognition performance among the four CNN structures.

Results of EOC2
Variants recognition plays a significant role in SAR ATR.We test the network's ability to distinguish objects with similar appearance in the experiments under EOC2.For the configuration variants dataset that is introduced in Table 3, Table 9 shows the recognition accuracy of the above-mentioned four networks and Table 10 shows the confusion matrix of the ESENet.Obviously, the ESENet outperforms the others and the recognition accuracy is increased by 3% as compared with traditional CNN.For the version variants dataset that is introduced in Table 4, Table 11 lists the recognition accuracy of the four networks, and Table 12 lists the confusion matrix of the ESENet.It is observed that the ESENet has the best recognition performance among the four CNN structures.

Conclusions
Feature extraction plays an important role in the task of SAR ATR.This paper proposed the ESENet to solve the problem that feature maps with little information being automatically learned by CNN will decrease the SAR ATR performance.In this framework, we designed a new enhanced-SE module.The enhanced-SE module could enhance the ability of CNN in suppressing feature maps with little information by computing and allocating different weights to the corresponding feature maps.For the preprocessed MSTAR dataset, experiments have shown that the ESENet achieves higher recognition accuracy than traditional CNN structure and A-convent, and that it exhibits robustness to large depression angle variation, configuration variants, and version variants.
Future work will be focused on network optimization, multi-channel CNN structure designing for multi-dimensional feature extraction, and improving the network robustness to the distorted datasets.

Figure 1 .
Figure 1.Illustration of feature maps learned by convolutional neural network (CNN) in a synthetic aperture radar automatic target recognition (SAR ATR) experiment.

Figure 1 .
Figure 1.Illustration of feature maps learned by convolutional neural network (CNN) in a synthetic aperture radar automatic target recognition (SAR ATR) experiment.

Figure 2 .
Figure 2. The structure of the Squeeze and Excitation (SE) module.

Figure 2 .
Figure 2. The structure of the Squeeze and Excitation (SE) module.
. Subsequently, the segmented training images are input into the ESENet to learn weights, and all of the weights in the ESENet are fixed when the training stage ends.After that, the ESENet is used for classification.During the test stage, the segmented test images are input into the ESENet to obtain the classification results.The correlation between the clutter in the training and tests is excluded, because the clutter irrelevant to the target does not join the training and test stages of the ESENet.
Subsequently, the segmented training images are input into the ESENet to learn weights, and all of the weights in the ESENet are fixed when the training stage ends.After that, the ESENet is used for classification.During the test stage, the segmented test images are input into the ESENet to obtain the classification results.The correlation between the clutter in the training and tests is excluded, because the clutter irrelevant to the target does not join the training and test stages of the ESENet.

Figure 3 .
Figure 3. Overview of the proposed SAR ATR method.

Figure 3 .
Figure 3. Overview of the proposed SAR ATR method.

Figure 3 .
Figure 3. Overview of the proposed SAR ATR method.

Figure 4 .
Figure 4. Structure of the Enhanced Squeeze and Excitation Net (ESENet).Figure 4. Structure of the Enhanced Squeeze and Excitation Net (ESENet).

Figure 5 .
Figure 5.Comparison between the sigmoid function and the enhanced-sigmoid function.

Figure 5 .
Figure 5.Comparison between the sigmoid function and the enhanced-sigmoid function.

Figure 6 .
Figure 6.Structure of the enhanced-SE module.

Figure 7 .
Figure 7. Visualization of feature maps output by the SE module (a) and the enhanced-SE module (b).

Figure 8 .
Figure 8. Optimal structures of traditional CNN (a) and A-convnet (b) for the 60 × 60 input image.

Figure 8 .
Figure 8. Optimal structures of traditional CNN (a) and A-convnet (b) for the 60 × 60 input image.

Figure 8 .
Figure 8. Optimal structures of traditional CNN (a) and A-convnet (b) for the 60 × 60 input image.

Figure 10 .
Figure 10.Visualization of feature maps in the ESENet.(a) input image; (b) mean of training samples; (c) input image with the mean of the training samples removed; (d) feature maps of conv2; (e) feature maps of conv2 after passing through the SE module; (f) feature maps of conv3; and, (g) feature maps of conv3 after passing through the enhanced-SE module.

Figure 10 .Figure 10 .Figure 11 .
Figure 10.Visualization of feature maps in the ESENet.(a) input image; (b) mean of training samples; (c) input image with the mean of the training samples removed; (d) feature maps of conv2; (e) feature maps of conv2 after passing through the SE module; (f) feature maps of conv3; and, (g) feature maps of conv3 after passing through the enhanced-SE module.For the purpose of illustration, we present a sample of BTR-70 in Figure11a, which is misclassified to BTR-60 by the SENet, while the ESENet correctly classifies it.The feature maps of the third

Figure 11 .
Figure 11.Feature maps of conv3 in SENet and ESENet.(a) input image; (b) input image with the mean of training samples removed; (c) feature maps of conv3 in the SENet; (d) feature maps of conv3 in the SENet after passing through the SE module; (e) feature maps of conv3 in the ESENet; and, (f) feature maps of conv3 in the ESENet after passing through the enhanced-SE module.

Table 1 .
Training and test samples for the standard operating conditions (SOC) experiments setup.

Table 2 .
Number of training and test samples for extended operating conditions (EOC)-1 (large depression variation).

Table 3 .
Number of training and test samples for EOC-2 (configuration variants).

Table 4 .
Number of training and test samples for EOC-2 (version variants).

Table 5 .
Recognition accuracy comparison under SOC.

Table 6 .
Confusion matrix of ESENet under SOC.