Multi-Aspect Convolutional-Transformer Network for SAR Automatic Target Recognition

: In recent years, synthetic aperture radar (SAR) automatic target recognition (ATR) has been widely used in both military and civilian ﬁelds. Due to the sensitivity of SAR images to the observation azimuth, the multi-aspect SAR image sequence contains more information for recognition than a single-aspect one. Nowadays, multi-aspect SAR target recognition methods mainly use recurrent neural networks (RNN), which rely on the order between images and thus suffer from information loss. At the same time, the training of the deep learning model also requires a lot of training data, but multi-aspect SAR images are expensive to obtain. Therefore, this paper proposes a multi-aspect SAR recognition method based on self-attention, which is used to ﬁnd the correlation between the semantic information of images. Simultaneously, in order to improve the anti-noise ability of the proposed method and reduce the dependence on a large amount of data, the convolutional autoencoder (CAE) used to pretrain the feature extraction part of the method is designed. The experimental results using the MSTAR dataset show that the proposed multi-aspect SAR target recognition method is superior in various working conditions, performs well with few samples and also has a strong ability of anti-noise.


Introduction
Synthetic aperture radar (SAR) is a high-resolution coherent imaging radar.As an active microwave remote sensing system, which is not affected by light and climatic conditions, SAR can achieve all-weather day-and-night earth detection [1].At the same time, SAR adopts synthetic aperture technology and matched filtering technology, which can realize long-distance high-resolution imaging.Therefore, SAR is of great significance in both military and civilian fields [2].
In recent years, with the development of SAR technology, the ability to obtain SAR data has been greatly improved.The early methods of manually interpreting SAR images cannot support the rapid processing of large amounts of SAR data due to their low time efficiency and high cost.How to quickly mine useful information from massive highresolution SAR image data and apply it to military reconnaissance, agricultural and forestry monitoring, geological survey and many other fields has become an important problem in SAR applications that needs to be solved urgently.Therefore, the automatic target recognition (ATR) [3] of SAR images to solve this problem has become a research hotspot.
Since a SAR image shows the scattering characteristics of the target to electromagnetic waves, the SAR image is very different from the optical image, which also has a great impact on the SAR ATR.Classical SAR ATR methods include a template-based method and model-based method.The template-based method is one of the earliest proposed methods, including direct template matching methods that calculate the similarity between the template formed by processing the training sample itself and the test sample for classification [4], and feature template matching methods that use classifiers such as SVM [5], KNN [6] and Bayes classifier [7] after extracting various features [8,9] for classification.The template-based method is simple in principle and easy to implement, but requires large and diverse training data to build a complete template library.To make up for the shortcomings of the template-based method, the model-based method [10] is proposed, which includes two parts: model construction and online prediction.
With the development of machine learning and deep learning technology, the automatic feature extraction ability of a neural network has attracted the attention of researchers, and then deep learning has been applied in SAR ATR.Initially, the neural network model in traditional computer vision (CV) was directly applied to SAR target recognition.For instance, Kang et al. transferred existing pretrained networks after fine-tuning [11].Unsupervised learning methods such as autoencoder [12] and deep belief network (DBN) [13] were also used to automatically learn SAR image features.Afterward, the network structure and loss function were designed for the specific task of target recognition using the amplitude information of SAR images, which were more in line with the requirements of the SAR target recognition task and undoubtedly achieved better recognition performance.Chen et al. designed a fully convolutional network (A-ConvNets) for recognition on the MSTAR target dataset [14].Lin et al. proposed the deep convolutional Highway Unit for ATR of a small number of SAR targets [15].Du et al. recommended the application of multi-task learning to SAR ATR to learn and share useful information from two auxiliary tasks designed to improve the performance of recognition tasks [16].Gao et al. proposed to extract polarization features and spatial features, respectively, based on a dual-branch deep convolution neural network (Dual-CNN) [17].Shang et al. designed deep memory convolution neural networks (M-Net), including an information recorder to remember and store samples' spatial features [18].Recently, the characteristics brought by the special imaging mechanism of SAR are being focused on, and some methods combining deep learning with physical models have appeared.Zhang et al. proposed a domain knowledge-powered two-stream deep network (DKTS-N), which incorporates a deep learning network with SAR domain knowledge [19].Feng et al. combined electromagnetic scattering characteristics with a depth neural network to introduce a novel method for SAR target classification based on an integration parts model and deep learning algorithm [20].Wang et al. recommended an attribute-guided multi-scale prototypical network (AG-MsPN) that obtains more complete descriptions of targets by subband decomposition of complex-valued SAR images [21].Zhao et al. proposed a contrastive-regulated CNN in the complex domain to obtain a physically interpretable deep learning model [22].Compared with traditional methods, deep learning methods have achieved better recognition results in SAR ATR.However, most of these current SAR ATR methods based on deep learning are aimed at single-aspect SAR images.
In practical applications, due to the special imaging principle of SAR, the same target will show different visual characteristics under different observation conditions, which also makes the performance of SAR ATR methods affected by various factors, such as environment, target characteristics and imaging parameters.The observation azimuth is also one of the influencing factors.The sensitivity of the scattering characteristics of artificial targets to the observation azimuth leads to a large difference in the visual characteristics of the same target at different aspects.Therefore, the single-aspect SAR image loses the scattering information related to the observation azimuth [23].The target recognition performance of single-aspect SAR is also affected by the aspect.
With the development of SAR systems, multi-aspect SAR technologies such as Circular SAR (CSAR) [24] can realize continuous observation of the same target from different observation azimuth angles.The images of the same target under different observation azimuth angles obtained by multi-angle SAR contain a lot of identification information.The multi-aspect SAR target recognition technology uses multiple images of the target obtained from different aspects and combines the scattering characteristics of different aspects to identify the target category.Compared with single-aspect SAR images, multiaspect SAR image sequences contain spatially varying scattering features [25] and provide more identification information for the same target under different aspects.On the other hand, multi-aspect SAR target recognition can improve the target recognition performance by fully mining the intrinsic correlation between multi-aspect SAR images.
The neural networks that use multi-aspect SAR image sequences for target recognition mainly include recurrent neural networks (RNN) and convolutional neural networks (CNN).Zhang et al. proposed multi-aspect-aware bidirectional LSTM (MA-BLSTM) [26], which extracts features from each multi-aspect SAR image through a Gabor filter, and further uses LSTM to store the sequence features in the memory unit and transmit through learnable gates.Similarly, Bai et al. proposed a bidirectional convolutional-recurrent network (BCRN) [27], which uses Deep CNN to replace the manual feature extraction process of MA-BLSTM.Pei et al. proposed a multiview deep convolutional neural network (MVD-CNN) [28], which uses a parallel CNN to extract the features of each multi-aspect SAR image, and then merges them one by one through pooling.Based on MVDCNN, Pei et al. improved the original network with the convolutional gated recurrent unit (ConvGRU) and proposed a multiview deep feature learning network (MVDFLN) [29].
Although these methods have obtained good recognition results, there are still the following problems: 1.
When using RNN or CNN to learn the association between multi-aspect SAR images, the farther the two images are in a multi-aspect SAR image sequence, the more difficult it is to learn the association between them.That is, the association will depend on the order of the image in the sequence.

2.
All current studies require a lot of data for training the deep networks, and the accuracy will drop sharply in the case of few samples.

3.
The existing approaches do not consider the influence of noise, which leads to a poor anti-noise ability of the model.
To address these problems, in this paper, we propose a multi-aspect SAR target recognition method based on convolutional autoencoder (CAE) and self-attention.After pre-training, the encoder of CAE will be used to extract the features of single-aspect SAR images in the multi-aspect SAR image sequence, and then the intrinsic correlation between images in the sequence will be mined through a transformer based on self-attention.
In this paper, it is innovatively proposed to mine the correlation between multiaspect SAR images through a transformer [30] based on self-attention.Vision transformer (ViT) [31] for optical image classification and the networks based on attention for singleaspect SAR ATR, such as the mixed loss graph attention network (MGA-Net) [32] and the convolutional transformer (ConvT) [33], extract representative features by determining the correlation between various parts of an image itself.Unlike them, the ideas of natural language processing (NLP) tasks are leveraged to mine the association between the semantic information of each image in the multi-aspect SAR image sequence.Because each image is correlated with other images in the same way in the calculation process of self-attention, the order dependence problem faced by existing methods will be avoided.Considering that self-attention loses local details, the CNN pre-trained by CAE with shared parameters is designed to extract local features for each image in the sequence.On the one hand, effective feature extraction provided by CNN can diminish the requirement for sample size.On the other hand, by minimizing the gap between the reconstructed image and the original input, the autoencoder ensures that the features extracted by the encoder can effectively represent the principal information of the original image.Thus CAE plays a vital role in anti-noise.
Compared with available multi-aspect SAR target recognition methods, the novelty as well as the contribution of the proposed method can be summarized as follows.

1.
A multi-aspect SAR target recognition method based on self-attention is proposed.Compared with existing methods, the calculation process of self-attention makes it not affected by the order of images in the sequence.To the best of our knowledge, this is the first attempt to apply a transformer based on self-attention to complete the recognition task of multi-aspect SAR image sequences.2.
CAE is introduced for feature extraction in our method, which is due to the additional consideration of the cases with few samples and noise compared with other methods and is created to improve the ability of the network to effectively extract the major features through pre-training and fine-tuning.

3.
Compared with the existing methods designed for multi-aspect SAR target recognition, our network obtains higher recognition accuracy on the MSTAR dataset and exhibits more robust recognition performance in version and configuration variants.Furthermore, our method demonstrates better in the recognition task with a small number of samples.Our method achieves stronger performance in anti-noise assessment as well.
The remainder of this paper is organized as follows: Section 2 describes the proposed network structure in detail.Section 3 presents the experimental details and results.Section 4 discusses the advantages and future work of the proposed method.Section 5 summarizes the full paper.

Overall Structure
As shown in Figure 1, the proposed multi-aspect SAR target recognition method consists of five parts, i.e., multi-aspect SAR image sequence construction, single-aspect feature extraction, multi-aspect feature learning, feature dimensionality reduction and target classification.Among them, feature extraction is implemented using CNN pretrained by CAE, and multi-aspect feature learning uses the transformer encoder [31]   Before feature extraction, single-aspect SAR images are used to construct multi-aspect SAR image sequences and also serve as the input to pre-train CAE, which includes the encoder that utilizes the multi-layer convolution-pooling structure to extract features and the decoder that utilizes the multi-layer deconvolution-unpooling structure to reconstruct images.The encoder of CAE after pre-training will be transferred to CNN for feature extraction of each image in the input multi-aspect SAR image sequence.In particular, the output of feature extraction for an image is a 1-D feature vector.Then, in the multiaspect feature learning structure, the vectorized features extracted from each image are spliced and added to position embedding to be the input of the transformer encoder based on multi-head self-attention.All the output features of the transformer encoder are reduced in dimension by 1 × 1 convolutions and then averaged.Finally, the softmax classifier gives the recognition result.
In the training process, the whole network obtains errors from the output and propagates back along the network to update parameters.It should be noted that the parameters of multi-layer CNN used for feature extraction can be frozen or updated with the entire network for fine-tuning.
In the following discussions, the details of each part of the proposed method and the training process will be introduced in turn, such as the loss function and so forth.

Multi-Aspect SAR Image Sequence Construction
Multi-aspect SAR images can be obtained by imaging the same target from different azimuth angles and different depression angles by radars on one or more platforms.Figure 2 shows a simple geometric model for multi-aspect SAR imaging.On this basis, multi-aspect SAR image sequences are built based on the following rules.Suppose is the image set sorted by azimuth angle for a specific class c i .C is the number of classes and n i is the number of images contained in one class.The azimuth of the image is ϕ(x i j ).For a given angle range θ and sequence length k, a window with fixed length k + 1 is placed along the original image set with a stride of 1 and the images in the window form k sequences of length k by permutation, then the sequence whose azimuth difference between any two images is smaller than θ is reserved as the training sample of the network.In addition, the final retained sequence samples are required to not contain duplicate samples.The process of multi-aspect SAR image sequence construction is summarized in Algorithm 1.In Algorithm 1, is the sequence set for a specific class c i .N i is the number of sequences contained in one class.Figure 3 shows an example of multi-aspect SAR image sequence construction.When the sequence length is set to 4, ideally, 12 sequence samples can be received from only 7 images.In this way, enough training samples can be obtained from limited raw SAR images.

Algorithm 1 Construct multi-aspect SAR image sequence
Input: angle range θ and sequence length k, raw SAR images

Feature Extraction Pre-Trained by CAE
Drawing on the idea of NLP, our method considers each image in a multi-aspect SAR image sequence to be equivalent to each word in a sentence.To effectively extract major features from each image in a multi-aspect SAR image sequence in parallel, CNN pre-trained by CAE with shared parameters is designed, which can reduce the number of learning parameters as well.Figure 4 shows the network structure of CAE, which consists of an encoder and decoder.The encoder is comprised of convolutional layers, pooling layers and the nonlinear activation function.The convolutional layer is the core structure, which extracts image features through convolution operations.The convolution layer in the neural network initializes a learnable convolution kernel, which is convolved with the input vector to obtain a feature map.Each convolution kernel has a bias, which is also trainable.The activation function is a mapping from the input to the output of the neural network, which increases the nonlinear properties of the neural network.ReLU, which is simple to calculate and can speed up the convergence of the network, is used as the activation function in the encoder of CAE.The convolutional layer is usually followed by the pooling layer, which plays the role of downsampling.Common pooling operations mainly include max pooling and average pooling.In this method, maximum pooling is selected; that is, it takes the largest value in the pooling window as the value after pooling.
For the lth convolution-maxpooling layer of the encoder, suppose x l−1 is the input and x l is the output feature map, and the input of the first layer x 0 is the input image x.Suppose W l is the convolution kernel in the lth layer, and b l is its bias.The feedforward propagation process of a convolution-maxpooling layer in the encoder can be expressed as: (1) where * and f DOW N denote the convolution and the pooling operation, respectively.σ represents the ReLU activation function, which is defined as: The decoder reconstructs the image according to the feature map.The decoder is also a multi-layer structure, which contains unmaxpooling layers, deconvolution layers and the ReLU activation function.Unpooling is the reverse operation of pooling, which restores the original information to the greatest extent by complimenting.In this work, unmaxpooling is chosen; that is, it is assigned the value to the position of the maximum value in the pooling window recorded during pooling, and we supplement 0 for the other positions in the pooling window.The deconvolution layer performs convolution between the feature map and the transposed convolution kernel so as to reconstruct the image based on the feature map.
For the lth unpooling-deconvolution layer of the decoder, suppose xl−1 is the input and xl is the output.The input of the decoder's first layer x0 is the output of the encoder x L .Suppose Ŵl is the convolution kernel in the lth layer, ŴT represents the transpose of the convolution kernel, and bl is the bias in the lth layer.The feedforward propagation process of an unpooling-deconvolution layer in the decoder can be expressed as: where f UP represents the unpooling operation.The output of the whole CAE x is the output of the decoder's last layer xL .CAE takes a single-aspect SAR image as input, and the output is a reconstructed image of the same size as the input image.The training of the CAE module will be detailed in Section 2.7.1.After the training, the encoder is transferred to extract features of each image in the sequence, the output feature vector is the output of the last layer of the encoder x L

Multi-Aspect Feature Learning Based on Self-Attention
The multi-aspect feature learning part of the proposed method is modified based on the transformer encoder to make it suitable for multi-aspect SAR target recognition.The feature vectors extracted from each image in the sequence are combined as , which is the input of position embedding.Positional embedding is proposed because self-attention does not consider the order between input vectors, but in translation tasks in NLP, the position of a word has an impact on the meaning of a sentence.As described in Section 2.2, multi-aspect SAR images are constructed into sequences according to azimuth angles, and the angle information of images in the sequence also needs to be recorded by position embedding.In this work, sine and cosine functions are used to calculate the positional embedding [30].The output of positional embedding X PE is input to the transformer encoder next.
The transformer encoder, which is the kernel structure of this part, is shown in Figure 5.The transformer encoder is composed of multiple layers, and each layer contains two residual blocks, which mainly include the multi-head self-attention (MSA) unit and multilayer perceptron (MLP) unit.Supposed there are N layers in the transformer encoder, the details of these two residual blocks in each layer will be introduced below.The first residual block is formed by adding the result of the input vector going through layer normalization (LN) [34] and MSA unit to itself.LN is used to implement normalization.Different from the commonly used batch normalization (BN), all the hidden units in a layer share the same normalization terms under LN.Thus, LN does not impose any constraint on the size of a mini-batch.MSA, the core of the transformer encoder, is to calculate the correlation as a weight by multiplying the query and the key and then using this weight to weighted sum the value to increase the weight of related elements in a sequence and reduce the weight of irrelevant elements.
Here the calculation process of the first residual block is described.Suppose Z n−1 ∈ R k×m is the input of the first residual block on the nth layer of the transformer encoder, and Z n ∈ R k×m is the output.Among them, k is the number of images in a multi-aspect SAR image sequence, m represents the channel number of each image's feature vectors, and the input of the first layer is the output of the position-embedding X PE .The input vector Z n−1 ∈ R k×m first passes through LN to get Y n−1 ∈ R k×m ; that is: where Φ LN indicates the process of LN.Then, Y n−1 is divided into H parts along the channel dimension.Suppose , and corresponds to a head of self-attention.In other words, where H is the number of heads in MSA.For each head, the input Y n−1 h is multiplied by three learnable matrices, the query matrix W n qh ∈ R k×k , the key matrix W n kh ∈ R k×k and the value matrix W n vh ∈ R k×k to obtain the query vector Q n h ∈ R k×d h , the key vector K n h ∈ R k×d h and the value vector V n h ∈ R k×d h , which can be formulated as: Then, the transposed matrix of K n h and Q n h are multiplied to obtain the initial correlation matrix Ân h ∈ R d h ×d h , and then a softmax operation is performed on Ân h column by column to obtain the correlation matrix A n h ∈ R d h ×d h .The calculation process is written as: V n h is multiplied by the weight matrix A n h to get the output Y n h ; that is: Then, the output vectors of each head Finally, to get the output of the first residual block Zn ∈ R k×m , the residual operation is applied to compute the summation of this block Z n−1 and the output of MSA Y n , which can be formulated as: The second residual block of each layer in the transformer encoder is composed of adding the results of the input vector through MLP to itself.The input vector of the second residual block Zn goes through LN, a fully connected sublayer with the nonlinear activation function and another fully connected sublayer in turn.The two fully connected sublayers expand and restore the vector dimension, respectively, thereby enhancing the expressive ability of the model.The number of neurons of the two fully connected sublayers is N f c1 and m, respectively.GELU is employed as the activation function, which performs well in transformer-based networks, and its definition is given as follows: where Φ(x) is the cumulative distribution function of the standard normal distribution.Finally, through the residual operation, the output of the second residual block Z n , which is also the output of the nth layer of the transformer encoder, is obtained.The calculation process of the second residual block is described as: where Φ FC1 , Φ FC2 represent the operation of the two fully connected sublayers separately.
To alleviate the overfitting issue that is prone to occur when the training sample is insufficient, dropout is added at the end of each fully connected sublayer.Dropout is implemented by ignoring part of the hidden layer nodes in each training batch, and to put it simply, p percent neurons in the hidden layer stop working during each training iteration.

Feature Dimensionality Reduction
After the multi-aspect feature learning process based on self-attention, features that contain intrinsic correlation information of multi-aspect SAR image sequence have been extracted with the size of k × m.Before using these features for classification, their dimension needs to be reduced.
MLP is the most commonly used feature dimensionality reduction method but one that lacks cross-channel information integration.Our proposed method uses a 1 × 1 convolution [35] for dimensionality reduction, which uses the 1 × 1 convolution kernel and adjusts the feature dimension by the number of convolution kernels.The 1 × 1 convolution realizes the combination of information between channels, thus reducing the loss of information during dimensionality reduction.At the same time, compared with MLP, the 1 × 1 convolution reduces the number of parameters.
The input of the 1 × 1 convolutional layer is the output of the transformer encoder Z N , and the output size is k × C, where C is the number of classes.Therefore, the number of convolution kernels of the 1 × 1 convolution layer is C. Before the convolution operation, dimension transformation is performed from Z N to Z N r ∈ R k×1×m .Suppose the output of dimensionality reduction is Z c , W 1×1 is the convolution kernel of the 1 × 1 convolutional layer and b 1×1 is the bias.The dimensionality reduction process is described as: After dimensionality reduction, dimensional transformation is performed on Z c ∈ R k×1×C to obtain Z ∈ R k×C for subsequent classification.

Classification
The classification process uses the softmax classifier.After dimensionality reduction, Z is averaged along the sequence dimension to get Zmean = [ z1 , • • • , zC ] T .Finally, the softmax operation is appiled upon Ẑmean to get the probability output

Training Process 2.7.1. Pre-Train of CAE and Layer Transfer
CAE is an unsupervised learning method.As described in Section 2.3, taking the single-aspect SAR images as input, after the forward propagation, reconstructed images will be obtained.Network optimization of CAE is achieved by minimizing the mean square error (MSE) between the reconstructed image and the original image; that is, using the MSE loss function.Suppose the total number of samples is N s .x is the input image, and x is the reconstructed image.The MSE loss function is defined as In this work, backpropagation (BP) is selected to minimize the loss function and optimize the CAE parameters.The aim of BP is to calculate the gradient and update the network parameters to achieve the global minimum of the loss function.The process of BP is to calculate the error of each parameter from the output layer to the input layer through the chain rule, which is related to the partial derivative of the loss function relative to the trainable parameters, and then the parameters are updated according to the gradient.When the network converges, it means that the encoder of CAE can extract enough information to reconstruct the single-aspect SAR image, which can prove the effectiveness of feature extraction.
After the CAE network converges, the parameters of the encoder are saved for subsequent layer transfer.The specific layer transfer operation is first when initializing network parameters; the CNN for feature extraction loads the parameters of the trained encoder of CAE.Next, in the process of network training, the parameters of each layer of the CNN can be frozen, which means that they will remain unchanged during the training process or continue to be optimized; that is, fine-tuning.It should be noted that only the first two layers of the CNN are frozen, and the last layer of the CNN is fine-tuned along with the overall network in our method.This is because the pre-training of CAE is carried out for single-aspect SAR images.Therefore, in order to effectively extract sufficient internal correlation information to support multi-aspect feature learning, it is necessary to fine-tune during the whole network training.At the same time, the parameters of the first two layers are frozen to maintain the effective extraction of the main features of each single-aspect image to ensure the noise resistance of the network.

Training of Overall Network
Taking the multi-aspect SAR image sequence as input, after the forward propagation, the predicted results given by the proposed network will be obtained.Network optimization is achieved by minimizing the cross entropy between data labels and the predicted result given by the network; that is, using the cross entropy loss function.Suppose the total number of the sequence samples is N m .For the ith sample, let T represent the probability output predicted by the network.When the sample belongs to the jth class, z i j = 1 and z i k = 0 (k = j).Then the cross entropy loss function can be written as: The proposed method also minimizes the loss function and optimizes the network parameters by BP, which is the same as most SAR ATR methods.The parameter updating in the BP process is affected by the learning rate, which determines how much the model parameters are adjusted according to the gradient in each parameter update.In the early stage of network training, there is a large difference between the network parameters and the optimal solution.Thus the gradient descent can be carried out faster with a larger learning rate.However, in the later stage of network training, gradually reducing the value of the learning rate will help the convergence of the network and make the network's approach to the optimal solution easier.Therefore, in this work, the learning rate is decreased by piecewise constant decay.

Experiments and Results
To verify the effectiveness of our proposed method, first, the network architecture setup is specified, and then the multi-aspect SAR image sequences are constructed using the MSTAR dataset under the standard operating condition (SOC) and extended operating condition (EOC), respectively.Finally, the performance of the proposed method has been extensively assessed by conducting experiments under different conditions.

Network Architecture Setup
In the experiment, the network instances were deployed whose input can be singleaspect images or multi-aspect sequences to comprehensively evaluate the recognition performance of the network.In this instance, the size of the input SAR image is 64 × 64.The encoder of CAE includes three convolution-maxpooling layers, which obtain 64, 64 and 256 feature maps, respectively.In each layer, the convolution operation with kernel size 7 × 7 and stride size 2 × 2 is followed by the max-pooling operation with kernel size 3 × 3 and stride size 2 × 2. The decoder of CAE includes three unpooling-deconvolution layers with the number of channels 64, 64 and 1, respectively.In each layer, the unpooling operation with kernel size 3 × 3 and stride size 2 × 2 is followed by the deconvolution operation with kernel size 7 × 7 and stride size 2 × 2. In the transformer encoder with 6 layers, MSA has 4 heads and the number of neurons of the 2 fully connected sublayers is 512 and 256 in turn.
Our proposed network is implemented by the deep learning toolbox Pytorch 1.9.1.All the experiments are conducted on a PC with an Intel Core i7-9750H CPU at 2.60 GHz, 16.0 G RAM, and a NVIDIA GeForce RTX 2060 GPU.The learning rate is 0.00001 when training CAE and starts from 0.001 with decay rate 0.9 every 30 epochs when training the whole network.The mini-batch size is set to 32, and the probability of dropout is 0.1.

Dataset
The MSTAR dataset [36], which was jointly released by the U.S. Defense Advanced Research Projects Agency (DARPA) and the U.S. Air Force Research Laboratory (AFRL), consists of high-quality SAR image data collected from ten stationary military vehicles (i.e., rocket launcher: 2S1; tank: T72 and T62; bulldozer: D7; armored personnel carrier: BMP2, BRDM2, BTR60 and BTR70; air defense unit: ZSU234; truck: ZIL131 ) through the X-band high-resolution Spotlight SAR by Sandia National Laboratory between 1996 and 1997.All images in the MSTAR dataset have a resolution of 0.3 × 0.3 m with HH polarization.The azimuth aspect range of imaging each target covers 0 • ∼360 • with an interval of 5 • ∼6 • .The optical images of ten targets and their corresponding SAR images are illustrated in Figure 6.
class.The rows of the confusion matrix correspond to the actual class of the target, and the columns show the class predicted by the network.From Tables 3-5, it can be observed that the recognition rate of our proposed method with 2, 3 and 4-aspect SAR image input sequences are all higher than 99.00% under SOC in the ten-class problem.Compared with the recognition rate shown in Table 2, it is proven that the multi-aspect SAR image sequence contains more recognition information than the single-aspect SAR image.In addition, from the improvement of the recognition rate in Tables 3-5 from 99.35%, 99.46% to 99.90%, it can be concluded that the self-attention process of our proposed method can effectively extract more internal correlation information of multi-aspect SAR images, so as to improve the recognition rates with the increase in the sequence length of multi-aspect SAR image sequence samples.

Results under EOC
Compared with SOC, the experiment under EOC is more difficult for target recognition due to the structure difference between the training set and testing set, which is often used to verify the robustness of the target recognition network.The experiments under EOC mainly include two experimental schemes, configuration variation (EOC-C) and version variation (EOC-V).
According to the original definition, EOC-V refers to targets of the same class that were built to different blueprints, while EOC-C refers to targets that were built to the same blueprints but had different post-production equipment added.The training sets under EOC-C and EOC-V are the same, which consist of four classes of targets (BMP2, BRDM2, BTR70 and T72), and the depression angle is 17 • .The testing set under EOC-C consists of images of two classes of targets (BMP2 and T72) with seven different configuration variations acquired at both 17 • and 15 • depression angles, and the testing set under EOC-V consists of images of T72 with five different version variations acquired at both 17 • and 15 • depression angles.The training and testing samples for the experiment under EOC-C and EOC-V are listed in Tables 6-8.

Class Type Depression Angle
Image Samples

Class Type Depression Angle
Image Samples

4-Aspect Sequences
T72/A32 The confusion matrices of experiments under EOC-C and EOC-V with single-aspect input images, 2, 3 and 4-aspect input sequences are summarized in Tables 9 and 10, respectively.
Table 9 shows the superior recognition performance of the proposed network in identifying BMP2 and T72 targets with configuration differences.The recognition rates of the proposed method reach 96.91%, 97.66% and 98.50% with 2, 3 and 4-aspect input sequences, respectively, which are all higher than 94.65% for the single-aspect input image.It can prove that, under EOC-C, the network can still learn more recognition information from multi-aspect images through self-attention so as to obtain better recognition performance.Table 10 shows the excellent performance of the proposed network in identifying T72 targets with version differences.With the increase in the input sequence length, the recognition rate of the network rises as well, from 98.12% for single-aspect input to 99.78% for four-aspect input.The above experiments indicate that the proposed network can achieve a high recognition rate when tested under different operating conditions, which confirms the application value of the proposed network in actual SAR ATR tasks.

Recognition Performance Comparison
In this section, our proposed network is compared with six other methods, i.e., joint sparse representation (JSR) [37], sparse representation-based classification (SRC) [38], data fusion [39], multiview deep convolutional neural network (MVDCNN) [28], bidirectional convolutional-recurrent network (BCRN) [27] and multiview deep feature learning network (MVDFLN) [29], which have been widely cited or recently published in SAR ATR.Among them, the first three are classical multi-aspect SAR ATR methods.JSR and SRC are two classical methods based on sparse representation, and data fusion refers to the fusion of multi-aspect features based on principal component analysis (PCA) and discrete wavelet transform (DWT).The last three are all deep learning multi-aspect SAR ATR methods.
Here, first, the recognition performance under SOC and EOC is compared between these methods.The recognition rates for each method under SOC and EOC are listed in Table 11.It should be noted that in order to objectively evaluate the performance of the method, it should be ensured that the datasets are as much the same as possible.Among the six methods, the original BCRN uses the image sequences with a sequence length of 15 as the input, which contains more identification information and requires a larger computational burden.That is, the original BCRN is difficult to compare with the other methods with an input sequence length of 3 or 4. Therefore, BCRN is implemented, and the results in Table 11 are obtained using the same four-aspect training and testing sequences as our method.
From Table 11, it is obvious that our method has a higher recognition rate than the other six methods in multi-aspect SAR ATR tasks, which proves that the combination of CNN and self-attention can learn the recognition information more effectively in multi-aspect SAR target recognition.Then, as shown in Table 12, compared with BCRN, our method greatly reduces the model size; that is, it greatly reduces the number of parameters and is in the same order of magnitude as MVDCNN.Considering the network structure of MVDCNN, when the sequence length increases, the network depth and the number of parallel branches will increase correspondingly.On the contrary, our method does not change the network structure when the sequence length increases, so it is more flexible, and the number of parameters increases slowly with the sequence length.As for the FLOPs, which represent the speed of network reasoning, it can be seen that our method still needs to be optimized.It is speculated that the amount of floating point operations mainly comes from the large convolution kernels for feature extraction and matrix operations for self-attention.

Discussion
For further discussion, the experiments on the network structure of feature extraction are carried out first.In order to obtain 1-D feature vectors, when a smaller convolution kernel is selected, the number of layers of CNN will increase accordingly.The recognition rates compared between the 6-layer CNN for feature extraction with the convolution kernel size of 3 × 3 and 3-layer CNN with the kernel size of 7 × 7 are shown in Table 13.From the results under EOC, it can be seen that the recognition performance of the larger convolution kernel network is better.Such a result is obtained because the larger convolution kernel can better extract the global information in raw images, which is beneficial to self-attention to learn common features in image sequences as a basis for classification.On the contrary, small convolution kernels are more concentrated on local information, which varies greatly from different angles.When the number of transformer encoder layers is different, the recognition performance of the whole network will also be different.The recognition accuracy under different operating conditions with different transformer encoder layers is shown in Table 14.When the transformer encoder has more layers, it means that the self-attention calculation has been carried out more times, so it is possible to mine more correlation information, which is also confirmed by the higher recognition accuracy achieved in the experiment.Next, in order to verify the recognition ability of the method with few samples, the training sequence is downsampled and the recognition rates of BCRN and MVDCNN are compared with our method when the number of training sequences is 50%, 25%, 10%, 5% and 2% of the original under SOC.As shown in Table 15, the recognition accuracy only decreases by 5% when the number of training sequence samples decreases to 2%, which is much less than 21.8% for BCRN and 11.03% for MVDCNN.As is known to all, the transformer needs a lot of data for training, which is mainly due to the lack of prior experience contained in the convolution structure, such as translation invariance and locality [30].In our proposed method, we use CNN to extract features first so as to make up for the lost local information.Therefore, the network also has excellent performance in the case of few samples.Finally, considering that in actual SAR ATR tasks, SAR images often contain noise, which has a great impact on the performance of SAR ATR because of the sensitivity of SAR images to noise, experiments are carried out to test the anti-noise performance of the network under SOC.As shown in Figure 7, the output of input image reconstruction by convergent CAE can reflect the main characteristics of the target in the input image but blur some other details.This indicates that when the image contains noise, the feature extraction network can filter the noise and extract the main features of the target.This is proven by the test image with noise with variance from 0.01 to 0.05 and the results of its convergent CAE reconstruction shown in Figure 8. Table 16 shows the recognition rates of the methods to be compared when the variance of noise increases from 0.01 to 0.05.Obviously, after pre-training, our method has excellent anti-noise ability.When the input image is seriously polluted by noise, the recognition rate of BCRN and the network without CAE is low, but the proposed method with CAE still maintains a high recognition rate, which shows that CAE plays an important role in anti-noise.In addition, to further verify the effectiveness of the pre-trained CAE, as shown in Table 17, CAE is replaced by other structures for experimental comparison.DS-Net [40] in the table is a feature extraction structure composed of dense connection and separable convolution.The experimental results show that compared with other structures, the proposed method with CAE does show better recognition performance under a variety of complex conditions, which proves the advantage of CAE in extracting major features.

Advantages
From the experiments in Section 3, it can be seen that compared with the existing methods, the proposed method has achieved higher recognition accuracy under both SOC and EOC.This proves the feasibility of self-attention under various complex conditions in multi-aspect SAR target recognition.
The experiment carried out with few samples in Section 3 shows that the proposed method can still achieve a higher recognition rate than other methods.It proves the advantages of the whole method in the case of small datasets, which make the proposed method more practical, considering the high cost of radar image acquisition.
The anti-noise experimental results in Section 3 verify the advantages of the proposed method.After pre-training, the anti-noise ability of the whole method is greatly enhanced, and a high recognition rate is obtained on the testing samples containing noise.Due to the characteristics of the coherent imaging system, the radar images almost certainly contain noise, so the excellent anti-noise performance makes the method more valuable for practical application.

Future Work
The future subsequent research will mainly focus on two directions.One is to reduce floating-point operands and improve reasoning speed, which is the main disadvantage of our proposed method.In order to achieve the goal, the two main structures of the network will be optimized, namely, the CNN structure whose FLOPs can be reduced by applying separable convolution [41] and the transformer encoder that can be accelerated by pruning [42] or improving the structure [43].
The other is to further improve the recognition performance of multi-aspect SAR ATR, for which some attempts will be made to explore the combination of deep learning and the physical characteristics brought by the special imaging mechanism of SAR.The method proposed in this paper only uses the amplitude information of SAR images for network training and testing.However, the complex SAR images contain more identification information, which can be used to train deeper networks or make up for the lack of information in small datasets.It is perhaps to extract and fuse the amplitude and phase information of complex SAR images with reference to Deep SAR-Net [44] or expand the convolutional neural network to the complex domain with reference to CV-CNN [45] so as to make full use of the information contained in complex SAR images.

Conclusions
In this paper, a multi-aspect SAR target recognition network based on CNN and self-attention has been presented.The overall network consists of the feature extraction layers, the multi-aspect feature learning layers, feature dimensionality reduction and classification.Specifically, after pre-training by single-aspect SAR images, the encoder of CAE is transferred for feature extraction of images in the multi-aspect SAR image sequence separately, and then the internal correlation between images in the sequence is learned by self-attention.Finally, after dimensionality reduction by 1 × 1 convolution, the feature vectors of images in the sequence are averaged and fed into the softmax layer for classification.Experiments on the MSTAR dataset show that the proposed method can obtain satisfactory recognition performance under a variety of complex conditions.At the same time, the recognition rate in the case of few training samples can still be close to that of the complete training set.Besides that, the anti-noise ability of the whole network is greatly enhanced after pre-training.

Figure 1 .
Figure 1.Basic architecture of the proposed multi-aspect SAR target recognition method.

Figure 2 .
Figure 2. A simple geometric model for multi-aspect SAR imaging.

Figure 4 .
Figure 4.The structure of CAE.

Figure 5 .
Figure 5.The structure of the transformer encoder.

Figure 7 .
Figure 7.Comparison between raw image and image reconstructed by CAE.

Figure 8 .
Figure 8.Comparison between raw image with noise with different variances and image reconstructed by CAE.
structure based on self-attention.

Table 2 .
Confusion matrix of a single-aspect experiment under SOC.

Table 3 .
Confusion matrix of a two-aspect experiment under SOC.

Table 4 .
Confusion matrix of a three-aspect experiment under SOC.

Table 5 .
Confusion matrix of a four-aspect experiment under SOC.

Table 6 .
The training set of experiment under EOC-C and EOC-V.

Table 7 .
The testing set of experiment under EOC-C.

Table 8 .
The testing set of experiment under EOC-V.

Table 9 .
Confusion matrix of experiments under EOC-C.

Table 10 .
Confusion matrix of experiments under EOC-V.

Table 11 .
Performance comparison between our method and other methods.

Table 12 .
Model size comparison with four-aspect input sequences.

Table 13 .
Performance comparison between different kernel sizes.

Table 14 .
Performance comparison between different layers of the transformer encoder.

Table 15 .
Recognition performance comparison with few samples.

Table 16 .
Comparison of anti-noise performance.

Table 17 .
Comparison between different structures for feature extraction.