FEF-Net: A Deep Learning Approach to Multiview SAR Image Target Recognition

: Synthetic aperture radar (SAR) is an advanced microwave imaging system of great importance. The recognition of real-world targets from SAR images, i.e., automatic target recognition (ATR), is an attractive but challenging issue. The majority of existing SAR ATR methods are designed for single-view SAR images. However, multiview SAR images contain more abundant classiﬁcation information than single-view SAR images, which beneﬁts automatic target classiﬁcation and recognition. This paper proposes an end-to-end deep feature extraction and fusion network (FEF-Net) that can effectively exploit recognition information from multiview SAR images and can boost the target recognition performance. The proposed FEF-Net is based on a multiple-input network structure with some distinct and useful learning modules, such as deformable convolution and squeeze-and-excitation (SE). Multiview recognition information can be effectively extracted and fused with these modules. Therefore, excellent multiview SAR target recognition performance can be achieved by the proposed FEF-Net. The superiority of the proposed FEF-Net was validated based on experiments with the moving and stationary target acquisition and recognition (MSTAR) dataset.


Introduction
Synthetic aperture radar (SAR) [1][2][3] is an important modern microwave sensor system, with powerful capabilities, including high-resolution imaging, day-and-night use, and all-weather operation. Those qualities make it superior to other sensors, such as infrared and optical sensors, for some applications. With advances in SAR signal processing and imaging performance, people have been paying more attention to classifying or recognizing targets of interest from SAR images. Therefore, automatic target classification or recognition (ATR) has become an attractive but challenging problem in SAR research and application areas [4][5][6][7][8].
Generally, an SAR ATR system discovers regions of interest containing potential targets from the SAR image [9][10][11] and efficiently assigns those targets reliable and intelligent category labels [12]. Over the years, researchers focused on this field have proposed many novel SAR ATR approaches [7,8]. Many SAR ATR methods or algorithms have also been employed in the past few decades, such as support vector machine (SVM) [13], conditional Gaussian model (CGM) [14], adaptive boosting (AdaBoost) [15], sparse representation [16], and iterative graph thickening (IGT) [17].
The SAR ATR methods mentioned above generally perform well in applications. Nevertheless, many of these methods must often extract handcrafted features from SAR targets, so sophisticated algorithms for ATR must be predesigned. With the rise in machine learning theory in recent years, ATR methods based on deep learning have quickly advanced [18,19]. They can spontaneously learn hierarchical features from the input data and can achieve remarkable performance in complex ATR tasks. Many novel works using deep neural networks have proven to be powerful tools for SAR ATR [20][21][22][23].
Most SAR ATR methods focus on single-input SAR images. However, modern SAR sensors can obtain high-resolution target images from different views in practice. Researchers have indicated that SAR ATR benefits from multiview measurements, since multiview SAR images contain more abundant classification information than single-view images [24]. Thus, studies of multiview SAR ATR methods have started in recent years, achieving good recognition results [25][26][27][28].
Although multiview SAR images have more classification information and show great potential for ATR, two important problems should be solved for ATR performance improvements. SAR target images are sensitive to their imaging views, and the same target often has geometric variations in multiview SAR images, such as orientation and shape variations. Hence, the first challenge is effectively extracting the inherent classification features from each view of the SAR image while accommodating their geometric variations. To further exploit the multiview classification information, effective fusion means should be employed to integrate extracted features from multiple views. Therefore, a valid multiview SAR ATR approach should be able to extract the inherent classification features from each view and to fuse these features effectively. Meanwhile, we hope that the processes of feature extraction and fusion can be carried out spontaneously and without much manual intervention. Hence, an end-to-end deep learning network with feature extraction and fusion modules is a perfect choice. It should extract and fuse useful features of the multiview SAR images through network construction and sample training, and thereby achieve superior ATR results.
This paper proposes an end-to-end deep feature extraction and fusion network (FEF-Net) to address these two problems and to improve the multiview SAR ATR performance. Its network architecture is based on a type of multiple input topology for multiview SAR ATR. Some specific modules, such as deformable convolution and squeeze-and-excitation (SE), are embedded in this network. The deformable convolutional layer can extract inherent classification information and can accommodate the geometric variations of SAR target well, whereas the SE module fuses the features from the multiview SAR images together. Thus, these two problems in multiview SAR ATR, classification feature extraction and fusion of input multiview SAR images, can be effectively resolved with FEF-Net. Therefore, the proposed network can exploit classification information from multiview SAR images and can achieve satisfactory ATR performance.
The main contributions of FEF-Net compared with existing SAR ATR methods are the following: (1) We designed a new deep neural network based on a multiple-input topological structure that can significantly improve SAR ATR performance; (2) the helpful classification features of the input multiview SAR images can be extracted and fused thoroughly through the implantation of distinct network modules; and (3) the proposed FEF-Net achieves excellent recognition performance compared with the available SAR ATR methods.
The remainder of this paper is organized as follows: Multiview SAR ATR is modeled and formulated in Section 2. Section 2 also details the proposed FEF-Net for multiview SAR ATR. The ATR performance of FET-Net is evaluated in Section 3, and Section 4 provides the conclusion. Figure 1 shows the multiview SAR ATR geometric model of a ground target. In a practical multiview SAR ATR pattern, the SAR system receives its returns and obtains multiview images of the ground target from different aspects and depressions. For simplicity, the depression is set as a constant here. When the view interval θ and view number k are provided, SAR can collect the ground target images in multiview imaging mode. Using these multiview SAR images, more classification information can be obtained than from the single-view pattern. Hence, the multiview SAR target recognition problem requires a valid multiple input classifier to determine the most probable classified label for the interested target, which can be formulated as follows:

Problem Formulation
In Equation (1), f is the classifier with multiview SAR images as the input, y i is the assigned target label, Y is the class label set, and x #k is the kth view of the SAR image target with its aspect angle ϕ(x #k ), which satisfies the following conditions.
FEF-Net was designed to solve the ATR problem with multiview SAR images based on this formulated model. The FEF-Net method is explained in the following subsection.

Proposed Method
This subsection proposes FEF-Net to solve these two difficulties and to improve the performance of multiview SAR ATR. The architecture of the FEF-Net for multiview SAR ATR is provided, along with details on specific modules of the network.

Network Framework
The basic framework of the FEF-Net instance with three inputs is shown in Figure 2. This deep neural network is based on a multiple-input topological structure. The inputs are merged into a certain layer, extracting and fusing the classification information from the multiview SAR images.  As mentioned in Section 1, the key point in multiview SAR ATR is an effective extraction and fusion classification feature. Hence, the proposed FEF-Net begins with a deformable convolutional layer in each branch to extract the inherent classification feature from each view and to accommodate the geometric variations in the SAR target. Alternate pooling and convolutional layers are present within each branch to further extract the features from each view and to reduce the feature dimensions. After the feature extraction from each view, the three branch feature maps are concatenated. These merged feature maps are linked to an SE network module to further recalibrate the feature responses and to fuse the concatenated features of the multiview SAR images together. Finally, the FEF-Net instance ends with a fully connected layer, and the softmax classifier performs the recognition decision.
From the basic architecture of the instance, we can see that the proposed FEF-Net can effectively extract and fuse the classification information from the input multiview SAR images, which benefits multiview SAR ATR. Specific modules in the proposed network are provided in the following discussion.

Deformable Convolution
The convolution operation is inspired by the process of the biological neuron in the visual cortex [29]. Supposing that the grid Ω represents the receptive field size and dilation, the convolution operation can be written as follows: where z(p) represents the intensity of each location p on the output feature map z, w denotes the convolution kernel, p n enumerates the locations in Ω, and a is the input feature map. An activation function, such as the rectified linear unit (ReLU), follows the convolution to enhance the nonlinear representation of the network. The diagram for deformable convolution is shown in Figure 3a, which can be formulated as follows: where ∆p n represents the additional offsets learned using a convolutional operation over the same input feature. Thus, the sampling of the deformable convolution is from the irregular and offset locations p n + ∆p n . Since the offset ∆p n is typically fractional, the deformable convolution should be performed based on the interpolation: where q enumerates all integral spatial locations in a, p = p + p n + ∆p n , and B(·, ·) denotes the interpolation kernel. By augmenting the spatial sampling locations with additional offsets, the deformable convolution can enhance the modeling of targets' geometric variations and can effectively extract the inherent features of the SAR images.

SE Module
The classification features extracted from multiple views are different. Thus, we employ an SE module [30] to adaptively recalibrate and fuse those concatenated feature responses from the multiview SAR images. Figure 3b shows a basic block diagram of the SE module. Let A = [a 1 , a 2 , · · · , a C ] denote the input feature maps of the SE module and a l ∈ R H×W , l = 1, 2, · · · , C. The SE module squeezes the global spatial information of the input into a channel descriptor d ∈ R C with a global average pooling operation: A fully connected layer is employed to exploit the aggregated information in the squeeze step and adaptively learn the recalibration, formulated as follows: where s = [s 1 , s 2 , · · · , s C ]. W and b are the weight matrix and bias of the fully connected layer, which are the trainable parameters and can be computed by the network training method [31]. The sigmoid activation is denoted by σ(·).
The final fused feature A = [a 1 , a 2 , · · · , a C ] of SE is obtained by recalibrating the input feature maps A as follows: Through dynamic recalibration and fusion of the features from different views, the SE module can effectively help improve the feature discriminability and ATR performance of FEF-Net.

Other Modules
Other helpful modules or operations, such as pooling, dropout, and softmax, are also necessary for FEF-Net. As an important module in FEF-Net, the pooling layer can extract the prominent features from the input feature map while reducing its dimensions. Here, we use a max-pooling operation in the proposed neural network.
Dropout is an operation widely used to reduce the overfitting of the neural network. It enhances the robustness of the network's learning ability with random active neuron combinations. Dropout is used after the last convolutional layer in FEF-Net to increase the generalization.
After all of the features of the multiview SAR images are extracted and fused, the feature maps are transformed and connected to a fully connected layer. Finally, the softmax classifier [32] provides the accurately classified attributes of the target, as follows: where z (L) is the input feature vector to the softmax classifier, and K denotes the class number.

Loss Function and Network Training
The loss function used in FEF-Net is the cross-entropy loss [33]: The training process of FEF-Net is similar to that of a standard SAR ATR neural network, although it has a complex network structure. The back propagation algorithm can be used to calculate the gradients and update the network parameters to effectively train the network.

Dataset
We selected raw SAR images from the moving and stationary target acquisition and recognition (MSTAR) dataset [34] to assess the proposed FEF-Net ATR performance for our experiments. The MSTAR program collected a significant quantity of SAR images to evaluate the performances of advanced SAR ATR methods. The MSTAR dataset includes a large number of 0.3 m × 0.3 m resolution SAR images processed with an X-band spotlight SAR sensor. Ten classes of targets were used in this experiment. The optical images and the corresponding SAR images of these targets are shown in Figure 4. Only part of the raw SAR images were selected from these ten class targets with a depression of 17 • to generate multiview SAR image samples for training the network in this experiment. The azimuth angles of the selected images for each class all ranged 0 • −360 • . All raw images in the dataset with a depression of 15 • were used to generate testing multiview SAR images. The usage of raw SAR images is listed in Table 1. Additionally, the gray enhancement method with a power function [35] was employed to enhance the scattering information of the SAR target images. The view interval θ was set as 45 • . The data augmentation method [27] generated many multiview training samples from the selected raw SAR images. There were 48,764 multiview SAR image samples with a depression of 17 • for deep network training. The samples were randomly selected from the generated multiview SAR images with a depression of 15 • for testing.

Network Configuration
The input SAR image size for the network instance was 80 × 80, and the probability of dropout was set to 0.5 during the training phase. Table 2 lists the hyper-parameters of the FEF-Net instance in our experiment, which were determined by statistical validation and trials. Table 2. FEF-Net instance configurations. Convolutional layers are represented as Conv. and their hyper-parameters denote as (number of feature maps)@(kernel size in convolution). "La_#b" represents the bth branch of ath layer. "WS," "SS," and "NN" denote window size, stride size, and number of neurons in network, respectively.  Table 3 shows the proposed FEF-Net recognition result with a confusion matrix. The rows of the matrix are ground truths, and the columns are the predicted class labels. Each element in the confusion matrix denotes the recognition rate of FEF-Net for a specific target class. Table 3 shows that the recognition rate of the proposed FEF-Net was higher than 99.00% in ten classes of the ATR problem. We can infer from this experimental results that the multiview SAR images of the same target contained large amounts of classification information. The proposed FEF-Net can effectively extract and fuse the classification features of the input multiview SAR images using only a few raw data for generating training samples. Therefore, it can exploit recognition information well from multiview SAR images and can achieve an excellent ATR performance. To visually show the classification capabilities of FEF-Net, some of the input multiview SAR samples and their output vectors in the fully connected layer were mapped into twodimensional Euclidean space by the t-distributed stochastic neighbor embedding (t-SNE) algorithm [36], as shown in Figure 5. Figure 5a shows the input multiview SAR samples, and Figure 5b illustrates the corresponding outputs. We can observe that the visualization results of the original samples are mixed in Figure 5a and were difficult to classify. After being processed by FEF-Net, the samples with the same class label became closer, whereas the different class samples tended to end up far away from each other, as shown in Figure 5b. That allowed for easier classification and led to effective recognition results. Although the proposed network instance has only three branches in Figure 2, the architecture of FEF-Net is flexible and can have a different number of input branches. We conducted a group of experiments to test the recognition performances of FEF-Net instances with different views. Similarly to the previous experiment, some of the raw SAR images were selected to generate multiview samples for training the networks, and the testing samples were randomly selected for recognition result evaluation. Table 4 shows the raw SAR images, generated training samples, and recognition results of FEF-Net instances with two, three, and four input views. We can see from the experimental results that the recognition rates of FEF-Nets with two, three, and four views were all higher than 98.00%, but only using a few raw data for training sample generation. As the number of input views increased, the recognition rate rose as well, and could reach more than 99.30%. These experimental results indicate the flexibility and potential applications of FEF-Net for SAR ATR tasks with different input views.
The recognition rate for each ATR method is shown in Figure 6. Although the recognition rates of all the methods were more than 92.00%, their performances were different. The comparisons indicate that FEF-Net had superior recognition performance compared with the other five SAR ATR methods, demonstrating the reasonability and validity of FEF-Net. Recognition rates (%) Figure 6. Recognition performances of various methods.

Conclusions
Inherent classification feature extraction from each view and multiview feature fusion are two important issues for improving the performance of multiview SAR ATR. We presented a novel ATR approach based on FEF-Net with multiview SAR images. FEF-Net was designed with a multiple-input topological structure, including specific modules, such as deformable convolution and SE, and it has the capability of learning useful classification information for multiview SAR images. Thus, the two key problems, classification feature extraction and fusion, are solved with the proposed FEF-Net. Extensive experiments on the MSTAR dataset were conducted, which showed that the proposed multiview FEF-Net can achieve excellent recognition performance. Its top recognition rate was more than 99.00% in a ten class problem. Additionally, it achieved superior recognition performance compared with the existing SAR ATR methods.
The proposed method attained satisfactory recognition results in SAR ATR because of its effective extraction and fusion of classification features. Although we used helpful learning modules in our network for this study, other promising feature extraction and fusion methods may also work, such as the spatial transformer technique and self-attention mechanism. Hence, these alternative methods are important issues worth studying in subsequent research.

Conflicts of Interest:
The authors declare no conflict of interest.