Hierarchical Fusion of Convolutional Neural Networks and Attributed Scattering Centers with Application to Robust SAR ATR

: This paper proposes a synthetic aperture radar (SAR) automatic target recognition (ATR) method via hierarchical fusion of two classiﬁcation schemes, i.e., convolutional neural networks (CNN) and attributed scattering center (ASC) matching. CNN can work with notably high effectiveness under the standard operating condition (SOC). However, it can hardly cope with various extended operating conditions (EOCs), which are not covered by the training samples. In contrast, the ASC matching can handle many EOCs related to the local variations of the target by building a one-to-one correspondence between two ASC sets. Therefore, it is promising that both effectiveness and efﬁciency of the ATR method can be improved by combining the merits of the two classiﬁcation schemes. The test sample is ﬁrst classiﬁed by CNN. A reliability level calculated based on the outputs from CNN. Once there is a notably reliable decision, the whole recognition process terminates. Otherwise, the test sample will be further identiﬁed by ASC matching. To evaluate the performance of the proposed method, extensive experiments are conducted on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset under SOC and various EOCs. The results demonstrate the superior effectiveness and robustness of the proposed method compared with several state-of-the-art SAR ATR methods.


Introduction
As a microwave sensor, synthetic aperture radar (SAR) has the capability to work under all-day and all-weather conditions thus providing a powerful tool for the battlefield surveillance in modern wars. A SAR system sends electromagnetic pulses from an airborne or spaceborne platform to the interested area and records the returned signals [1,2]. The range resolution of SAR images is determined by the bandwidth. In order to achieve high cross-range resolution, SAR collects data from multiple observation points, and focuses the received information coherently. Afterwards, the acquired signals are transformed into the image domain using some imaging algorithms, e.g., fast Fourier Transform (FFT) [3]. However, the main drawback of SAR images is the presence of speckles, which visually degrades the appearance of images [4]. As a result, it is difficult to interpret SAR images with high performance. As one of the key steps in SAR image interpretation, automatic target recognition (ATR) has been researched intensively since 1990s [5]. An ensemble SAR ATR system generally involves three stages: target detection [6], target discrimination [7], and target recognition [5]. A large-scale SAR image is first processed by target detection to find the potential regions of interest (ROIs), which possibly contain the interested targets. In this stage, the background clutters can be eliminated. Afterwards, target discrimination is performed to reject the false alarms in the ROIs, which are possibly caused by man-made obstacles. Finally, the selected ROIs are sent the target recognition module to determine the target labels. In this study, we focus on the third stage of the SAR ATR system, i.e., target recognition algorithms.
A typical SAR ATR algorithm generally involves two parts: feature extraction and classifier. Feature extraction aims to find low-dimensional representations for the original SAR images while maintaining the discrimination information for distinguishing different targets. In the past decades, many handcrafted features have been used for SAR ATR including the geometrical features, projection features and scattering center features. The geometrical features depict the shape and physical sizes of the target such as binary target region [8,9], target outline [10,11], shadow [12,13], etc. In [8], a region matching scheme is proposed for SAR ATR. The binary target region of the test image is directly compared with the corresponding regions from the template set. And a similarity measure is designed based on the region residuals filtered by the morphological operations. Park et al. construct 12 features based on the target outline, which are used for target recognition. The target shadow is also validated to be discriminative for SAR ATR in [10]. The projection features are obtained by projecting the original image to some specially designed basis. Typical methods for extracting the projections features are principal component analysis (PCA) [14], linear discriminant analysis (LDA) [14] and other manifold learning methods [15][16][17]. Mishra applies PCA and LDA to feature extraction of SAR images and compares their performances on target recognition [14]. The neighborhood geometric center scaling embedding is proposed in [16] by exploiting the inner structure of the training samples, which is demonstrated to be effective for SAR ATR. The scattering center features reflect the electromagnetic scattering characteristics of the target such as attributed scattering centers (ASCs) [18,19]. ASCs describe the local structures of the target by several physically relevant parameters, which have been demonstrated notably effectively for SAR ATR especially under the extended operating conditions (EOCs) [19][20][21][22][23][24][25]. In [21], an ASC-matching method is proposed based on Bayesian theory with application to target recognition. Ding et al. propose several ways to apply ASCs to SAR ATR, e.g., one-to-one ASC matching [22][23][24] and ASC-based target reconstruction [25]. Recently, the 3-D scattering center model-based SAR ATR methods have drawn the researchers' interests, where a 3-D scattering center model is established to describe the target's electromagnetic scatterings for feature prediction [26,27]. In the classification stage, the extracted features are classified by the classifiers to determine the target type of the test sample. With the fast development of pattern recognition and machine learning techniques, many advanced classifiers have been successfully applied to SAR ATR including adaptive boosting (AdaBoost) [28], discriminative graphical model [29], support vector machines (SVM) [30,31] and sparse representation-based classification (SRC) [32,33]. Specially, for the features without unified forms, e.g., the unordered scattering centers, a similarity or distance measure is often first defined for these features. Afterwards, the target type is determined based on the maximum similarity or minimum distance [19][20][21][22][23][24].
Recently, deep learning has been shown to provide a powerful classification scheme for image interpretation, i.e., convolutional neural networks (CNN). CNN considers the feature extraction and classification in a unified framework. As validated in several studies [34][35][36], the learned deep features by convolution operations tend to have better discrimination capability to distinguish different classes of targets. However, it should be noted that the performance of CNN is closely related to the completeness and coverage of the training samples. In the case of SAR ATR, the training samples are quite scarce due to the limited accesses to the resources [37,38]. Moreover, the operating conditions in SAR ATR are also complicated. There are many EOCs in the real-world environment including the variations of the target itself, background environments, SAR sensors, etc., which can hardly be covered by the training samples [5]. As reported in several CNN-based SAR ATR methods [39][40][41][42], they could achieve notably high recognition accuracies under the stand operating condition (SOC). However, the performances degrade ungracefully under various EOCs even with different types of data augmentations. With little prior information about the operating conditions of the test samples, it is hard to evaluate whether the decisions from CNN are reliable or not.
In this study, a SAR ATR method is proposed via hierarchical fusion of CNN and ASC matching. For each test sample, it is first classified by CNN. Based on the outputs of CNN, e.g., the pseudo posterior probabilities from softmax, a reliability level is calculated to evaluate the reliability of the decision. A preset threshold is used to judge whether the decision should be adopted. When the decision is justified to be invalidated, the test sample is passed to the classifier based on ASC matching. ASCs are local descriptors with rich, physically relevant information. It is demonstrated that ASCs can be handle various EOCs with good performances [20][21][22][23][24]. For the test samples, which cannot be reliably classified by CNN, they are possibly from EOCs. Therefore, ASC matching tends to achieve more reliable decisions for these samples. In this study, a one-to-one correspondence between the ASC set from the test image and those from the corresponding template is built using the Hungarian algorithm [22,43]. Afterwards, a similarity measure is defined, which comprehensively considers the possible outliers. Finally, the target type of the test sample is decided to be the class with the maximum similarity. Therefore, the hierarchical fusion of CNN and ASC matching can enhance both the effectiveness and robustness of the ATR method. In addition, via the hierarchical fusion, the strict demand on a single classifier is relieved. Although CNN and ASC matching may not achieve very good performances individually, they can complement each other to achieve a much better result. The main advantages of the proposed method are as follows. First, the excellent performance of CNN for SOC recognition can be inherited in the proposed method. When a reliable decision is obtained by CNN, no further classification by ASC matching is necessary. Second, the robustness of ASCs to various EOCs can be maintained in the proposed method. By building a one-to-one correspondence between two ASC sets, the local variations of the target caused by EOCs can be sensed.
The remainder of this paper is organized as follows. Section 2 describes the basic theory of CNN and the architecture of our networks. In Section 3, the classification scheme based on ASC matching is introduced. The detailed implementation of the proposed target recognition method is explained in Section 4. Extensive experiments on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset are conducted in Section 5. Discussions are made in Section 6 to explain the reasonability of the proposed method and some future directions are stated. Conclusions are summarized in Section 7 based on the experimental results.

Basic Theory
Owing to the fast development of deep learning techniques, CNN has become the most prevalent tool for image interpretation [34][35][36]. CNN combines the feature learning and classification in a unified framework thus avoiding the design of hand-crafted features. In detail, the convolution layers learn hierarchical features via the convolutional operations. In the classification stage, a multilayer perceptron classifier is used for decision making.
In the convolutional layer, the previous layer's input feature maps O (l−1) m (m = 1, · · · , M) are connected to all the output feature maps O (l) n (x, y) as the unit of the mth input feature map and the nth output feature map at the position (x, y), respectively, then each unit in the output feature map is calculated as: where k (l) nm (p, q) denotes the convolutional kernel; σ(·) represents the nonlinear activation function and b (l) n is the bias. After the convolution layer, the pooling operation is usually performed, which cannot only effectively reduce the computational load but also make networks robust to some nuisance conditions like translation, distortion, etc. Different types of pooling operations are used in CNNs by either choosing the average or maximum in a preset window with the sizes of h × w. For example, the max pooling is defined as follow.
In the classification stage, the softmax nonlinearity is applied to the output layer to determine the target label. It will output the posterior probabilities over each class and the target label will be decided as the class with the maximum probability.

Architecture of the Proposed CNN
Actually, there is no consensus on how to design CNNs for the specific application of SAR ATR. In the previous works, several different kinds of CNNs have been applied to SAR ATR and they all achieved very good performances [39][40][41][42]. Based on these works, this paper designs the architecture of CNN as Figure 1, which is composed of three convolution layers, three max pooling layers, and two fully-connected layers. The convolution stride is fixed to 1 pixel with no spatial zero padding. After each convolution layer, a max pooling is performed with a kernel size of 2 × 2 and a stride of 2 pixels. The rectified linear units (ReLU) activation function is applied to every hidden convolution layer.
Specifically for the MSTAR dataset used in this study, all the images are first cropped to be 88 × 88 patches from the centroid. The detailed layout of our network is displayed in Table 1. The input image is filtered by 16 convolution filters with the size of 5 × 5 in the first convolution layer, producing 16 feature maps with the size of 84 × 84. After the first pooling layer, their sizes become 42 × 42. After the second convolution layer, there are 32 feature maps with size of 38 × 38, which become 19 × 19 after pooling. After the third convolution layer and pooling layer, 64 feature maps with the size of 7 × 7 are obtained. In the first fully-connected layer, a 1024-dimensionality vector is produced, where the dropout regularization technique is used. The output layer is also a fully-connected layer with the softmax function to ensure the final output size to be 1 × 1 × 10, corresponding to the probabilities of the 10 classes of MSTAR targets.
During the training of the designed networks, the weights are initialized from Gaussian distributions with zero mean and a standard deviation of 0.01, and biases are initialized with a small constant value of 0.1. The learning rate is initially 0.001, which decreases by a factor of 0.1 after 100 epochs. The batch size is set to be 100. To train the proposed CNN, the deep learning toolbox in Tensorflow is used. The cropped MSTAR training images (the detailed descriptions of the MSTAR dataset are presented in Section 5) are fed to the networks in Figure 1. The hierarchical features are learned during the training process. According to the target label of each training sample, the parameters of the whole networks are obtained. As shown in Figure 2, the total loss decreases sharply and converges after about 1500 epochs during the training. Figure 3 illustrates the original image and internal state of the trained CNN. In the first convolution layer, the convolution kernels and 16 feature maps are shown in Figure 3b,c, respectively. It is clearly that the global properties of the original image in Figure 3a can be maintained in the feature maps. Afterwards, in the classification stage, the cropped test image is input to the trained CNN to decide its target type.

ASC Model
The high-frequency scattering of an electrically large target can be well approximated as a sum of the responses from individual scattering centers as Equation (3) [18].
where f denotes the frequency and  represents the aspect angle. The backscattering field of a single scattering center can be described by the ASC model as follows.
In Equation ( is the parameter set of the ASCs. In detail, i A denotes the complex amplitude; ( )

ASC Model
The high-frequency scattering of an electrically large target can be well approximated as a sum of the responses from individual scattering centers as Equation (3) [18].
where f denotes the frequency and φ represents the aspect angle. The backscattering field of a single scattering center can be described by the ASC model as follows.
In Equation (4), is the parameter set of the ASCs. In detail, A i denotes the complex amplitude; (x i , y i ) are the spatial positions; α i represents the frequency dependence; L i and φ i are the length and orientation of the distributed ASC, respectively and γ i is the aspect dependence of the localized ASC. The ASC attributes provide rich physically relevant descriptions for the local structures of the target. (x i , y i ) denotes the scattering center location in the image domain. α i is a discrete parameter, which takes on integer or half-integer values. Some typical values of α i are 1, 1/2, 0, −1, −2. The combination of the length and frequency dependency can effectively reveal the geometrical structure of the ASC. For example, when α i is 1 and L i is nonzero, the ASC is assumed to have a dihedral shape. More explanations can be referred to [18].

Sparse Representation for ASC Extraction
For a single SAR image, there are only a few ASCs in the target. When the parameter space is gridded to form an over-complete dictionary, the parameter estimation of ASCs can be formulated as a sparse representation problem [44,45]. Firstly, Equation (3) is rewritten as In Equation (5), s is the vector form of the measurements E( f , φ; θ); D(θ) is modeled as a parameterized redundant dictionary, in which each column is the vectorization of the measurements corresponding to one element in parameter set θ; σ is a complex sparse vector whose element represents the relative amplitude A. Considering the possible noises during the data acquisition, the real measurements should be expressed as where n is modeled as the additive white Gaussian noise with zero mean. Then, the ASCs can be extracted by solving the following problem: where ε = n 2 represents the noise level, which can be estimated from the original measurements; • 0 denotes l 0 -norm andσ is the complex-valued amplitude estimator with respect to dictionary D(θ). The optimization problem in Equation (7) is nondeterministic polynomial time hard (NP-hard), which is computationally difficult to solve. However, an approximation solution can be obtained by some greedy algorithms, such as the orthogonal matching pursuit (OMP) [45]. The detailed implementation of ASC extraction using OMP is described in Algorithm 1, which is used in this study.

ASC Matching
The ASCs contain rich physically relevant descriptions for the local structures of the target such as the relative amplitude, spatial positions, length, etc. Therefore, the ASCs can be effectively used to sense the local variations of the target caused by various EOCs like configuration variance, depression angle variance, partial occlusion, etc. In this study, an ASC matching method is proposed for target recognition. A one-to-one correspondence between two ASC sets is first established. Then, the matched ASC pairs are evaluated to form a similarity measure for target recognition. An essential prerequisite to build the one-to-one correspondence is properly evaluation the distance between two individual ASCs. This paper uses four attributes, i.e., [A, x, y, L], for distance evaluation because of their clearly physical meanings and stability during the ASC extraction. For the test ASC set P = [p 1 , p 2 , . . . , p M ] and template ASC set Q = [q 1 , q 2 , . . . , q N ], the distance between two individual ASCs is defined as follow: According to Equation (8), the distance is explained as three components. The first is the Euclidean distance between the spatial positions, i.e., [(p ix − q jx ) 2 + (p iy − q jy ) 2 ]. The second is the difference between the lengths, i.e., . The attribute L is assumed to have twofold uncertainty than the spatial positions because it is more difficult to obtain a better estimation of the parameter. For the amplitude A, it is first normalized based on its absolute and the distance is measured by an exponential function.
(2) ASC matching using the Hungarian algorithm Based on the designed distance measure, this study uses the Hungarian algorithm to build the one-to-one correspondence between two ASC sets. As a bipartite graph matching problem, the Hungarian algorithm can find the best one-to-one correspondence between two point sets with the lowest total distance [43].
The cost matrix for Hungarian matching is displayed in Table 2, where C ij = d(p i , q j ). In this study, the absolute amplitudes of different ASCs are subject to amplitude normalization in both ASC sets. The cost of assigning p i to q j is the defined distance in Equation (8). In practical applications, the test ASC set may contain some false ASCs caused by the background noises. In addition, the template ASC set may have some missing ASCs due to the deformation of the test target such as partial occlusion. Therefore, the false ASCs (false alarms, FAs) and missing ASCs (missing alarms, MAs) should be considered during the Hungarian matching. The costs for the FAs and MAs contained in Table 2 are defined as follows.
The cost of assigning an test ASC to be a FA is the average of assigning it to all q i (i = 1, 2, · · · , N) and a template ASC to a MA is the average of assigning it to all p j (j = 1, 2, · · · , M). To form a complete bipartite graph for Hungarian matching, some costs in Table 2 are assigned as "∞" (i.e., infinity). The "∞" costs can effectively constraint unsuitable matched pairs. For example, the MAs will not be matched with the FAs.

Similarity Evaluation
Based on the one-to-one correspondence built by Hungarian matching, both the matched ASC pairs and possible outliers are considered to define the similarity measure for two ASC sets as Equation (10).
where K m denotes the number of matched ASC pairs; d k represents the distance between the kth matched pairs, which can be referred from the cost matrix, and ω k is the corresponding weight defined as follow.
In Equation (11), A k denotes the absolute amplitude of the matched test ASC. The test ASCs are taken as the baseline because they are compared with different types of template ASC sets. The weights are defined based on the relative amplitudes for the following considerations. On one hand, the strong ASCs with higher amplitudes tend to be more stable during the ASC extraction. On the other hand, the ASCs with higher amplitudes will keep more stable under noise corruption or other interferences. With the deterioration of noise corruption, the ASCs with lower amplitudes are more probable to be submerged. Therefore, by assigning higher weights to the stronger ASCs, the similarity measure will be more robust to the possible uncertainties during ASC extraction and noise corruption.

Hierarchical Fusion of CNN and ASC Matching for SAR ATR
As reported in relevant literatures [39][40][41][42], CNN can achieve notably high accuracies under SOC or the conditions similar to SOC. The ASC matching is more robust to those conditions with local variations caused by EOCs like noise corruption, configuration variance, partial occlusion, etc. To combine their merits in a unified ATR system, a hierarchical fusion framework is proposed in this study. Figure 4 shows the general procedure of the proposed target recognition method. First, the test sample is classified by the designed CNN. The pseudo posterior probabilities from the softmax are used to define a reliability level as follow: where P i (i = 1, 2, · · · , C) denotes the probability corresponding to the ith class; r represents the reliability of the decision with a value larger than 1, which reflects the difference between the highest probability with the second highest one. A larger r indicates a more reliable decision. A threshold T is used to judge whether the decision from CNN should be adopted. With a reliability level higher than the threshold, the decision is assumed to be highly reliable. Then, the target type is directly decided by CNN. Otherwise, the test sample is passed to the ASC matching for further identification. The template samples are selected based on the estimated azimuth of the test image [19]. Then, the target type is decided to be the class with the maximum similarity. The ASC matching makes detailed analysis about the local structures of the targets thus more robust to various EOCs. By hierarchically fusing the two classification schemes, both the efficiency and robustness of the ATR method can be enhanced.
A threshold T is used to judge whether the decision from CNN should be adopted. With a reliability level higher than the threshold, the decision is assumed to be highly reliable. Then, the target type is directly decided by CNN. Otherwise, the test sample is passed to the ASC matching for further identification. The template samples are selected based on the estimated azimuth of the test image [19]. Then, the target type is decided to be the class with the maximum similarity. The ASC matching makes detailed analysis about the local structures of the targets thus more robust to various EOCs. By hierarchically fusing the two classification schemes, both the efficiency and robustness of the ATR method can be enhanced.

Data Preparation and Experimental Setup
To experimentally evaluate the proposed method, the MSTAR dataset is used in this study, which is the benchmark dataset for SAR ATR. There are 10 military targets included in the dataset, which share similar appearances as shown in Figure 5. Their SAR images are captured by the X-band

Data Preparation and Experimental Setup
To experimentally evaluate the proposed method, the MSTAR dataset is used in this study, which is the benchmark dataset for SAR ATR. There are 10 military targets included in the dataset, which share similar appearances as shown in Figure 5. Their SAR images are captured by the X-band SAR sensors with the resolution of 0.3 m × 0.3 m. The training and test samples used for experiments are showcased in Table 3, which are collected at 17 • and 15 • depression angles, respectively.
For performance comparison, several state-of-the-art SAR ATR methods are used including SVM [30], SRC [32] and A-ConvNet [39], as briefly described in Table 4. SVM and SRC are performed on 80-dimension PCA feature vectors extracted from the original images. A-ConvNet in [29] is chosen as the representative of the CNN-based SAR ATR methods. The ASC matching method proposed in [22] is also compared, where a one-to-one correspondence between two ASC sets is built for similarity evaluation. In the followings, the experiment is first conducted under SOC on the 10 classes of targets. Afterwards, several typical EOCs are used to comprehensively evaluate the robustness of the proposed method including configuration variance, large depression angle variance, noise corruption and partial occlusion. Finally, the performance is evaluated under limited training samples to further examine it robustness.

Preliminary Verification
The recognition problem is first considered under SOC. The 10-class training and test samples in Table 3 are used. The threshold T for the reliability level is first set to be 1.1. Table 5 displays the detailed recognition results of the proposed method. Each of the 10 targets can be classified with a percentage of correct classification (PCC) over 98%. Table 6 compares the average PCCs of different methods under SOC. With the highest PCC of 99.41%, the proposed method outperforms the others with notable margins. A-ConvNet ranks second in all the methods, indicating the excellent classification capability of CNN. Under SOC, the training and test samples are quite similar with only a small depression angle variance (2 • ) in this case. Therefore, most test samples can be correctly classified by CNN because of its powerful classification capability. Due to the unpredictable factors during data acquisition, a few test samples may have many differences with the training samples. As a result, they may not be reliably determined by the designed CNN. Then, they are passed to ASC matching for further determination. By combining the advantages of the two classification schemes, the final recognition performance of the proposed method is largely enhanced.

Performance under Different Thresholds
The threshold T directly determines whether the decision from CNN is reliable. Therefore, it has important influences on the final recognition performance. By varying the threshold, the PCCs of the proposed method are plotted in Figure 6. The PCC tops at T = 1.1 and the detailed results can be found in the former experiment. The PCC varies in the threshold interval. However, the average PCC of all the thresholds is still calculated to be 98.91%, indicating the robustness of the proposed method. When the threshold is lower than 1, all the test samples are directly classified by the designed CNN. Will a threshold slightly higher than 1, most of the test samples are determined by CNN and only a few are passed to ASC matching. Due to the excellent classification capability of CNN under SOC, the performance maintains at a high level. In contrast, when the threshold is notably high, almost all the decisions are made by ASC matching. Actually, the ASC matching is also an effective SAR ATR method. Therefore, the PCC will not fall too much. In the following experiments, the threshold is fixed to be T = 1.1 in order to achieve better recognition performance.

Recognition under EOCs
As a reliable SAR ATR system, it must be robust to various EOCs in the real-world scenarios caused by the variations of the target itself, background environments, sensors, etc. To comprehensively evaluate the proposed method, the following experiments are conducted under different types of EOCs, i.e., configuration variance, large depression angle variance, noise corruption and partial occlusion.

Configuration Variance
A certain military target may be modified to have several different configurations for different applications. The different configurations share similar target shapes with some local variations. Table 7 showcases the training and test sets for this experiment. The configurations of BMP2 and T72 for testing are not included in the training set. Figure 7 shows the optical images of four different configurations of T72. Several local differences can be found at the turret, fuel drums, etc. Table 8 lists the detailed recognition results of the proposed method under configuration variance. All the configurations of BMP2 and T72 can be classified with PCCs over 96%, resulting in an average of 98.64%. The performances of different methods are compared in Table 9. With the highest PCC, the proposed method is validated to be the most robust to configuration variance. It is also notable that the ASC method outperforms the remaining ones. In the ASC method, the one-to-one correspondence between the test and template ASC sets is built, which is beneficial to sense the local variations of the

Recognition under EOCs
As a reliable SAR ATR system, it must be robust to various EOCs in the real-world scenarios caused by the variations of the target itself, background environments, sensors, etc. To comprehensively evaluate the proposed method, the following experiments are conducted under different types of EOCs, i.e., configuration variance, large depression angle variance, noise corruption and partial occlusion.

Configuration Variance
A certain military target may be modified to have several different configurations for different applications. The different configurations share similar target shapes with some local variations. Table 7 showcases the training and test sets for this experiment. The configurations of BMP2 and T72 for testing are not included in the training set. Figure 7 shows the optical images of four different configurations of T72. Several local differences can be found at the turret, fuel drums, etc. Table 8 lists the detailed recognition results of the proposed method under configuration variance. All the configurations of BMP2 and T72 can be classified with PCCs over 96%, resulting in an average of 98.64%. The performances of different methods are compared in Table 9. With the highest PCC, the proposed method is validated to be the most robust to configuration variance. It is also notable that the ASC method outperforms the remaining ones. In the ASC method, the one-to-one correspondence between the test and template ASC sets is built, which is beneficial to sense the local variations of the target caused by configuration variance. In the proposed method, some test samples can still be reliably classified using the designed CNN. The remaining ones can obtain more accurate decisions by ASC matching. Therefore, the final recognition performance of the proposed method can be effectively enhanced.

Large Depression Angle Variance
The test SAR images may be collected at different depression angles with the training samples. Figure 8 shows the SAR images of 2S1 at three depression angles, i.e., 17°, 30° and 45°. It is visible that images with large depression angle variances have quite different appearances like the target shape and scattering patterns [46,47]. The training and test sets for this experiment are showcased in Table 10, where three targets are included, i.e., 2S1, BRDM2 and ZSU23/4. Table 11 presents the detailed recognition results of the proposed method at different depression angles. At 30° depression angle, the proposed method can still achieve a very high PCC of 97.80%. However, when the depression angle changes to 45°, the performance decreases significantly to 76.16%. The main reason is that the notably large depression angle variance causes much discrepancy between the training and test samples. Table 12 compares the performances of different methods under large depression angle variance. With the highest PCCs at both depression angles, the proposed method is

Large Depression Angle Variance
The test SAR images may be collected at different depression angles with the training samples. Figure 8 shows the SAR images of 2S1 at three depression angles, i.e., 17 • , 30 • and 45 • . It is visible that images with large depression angle variances have quite different appearances like the target shape and scattering patterns [46,47]. The training and test sets for this experiment are showcased in Table 10, where three targets are included, i.e., 2S1, BRDM2 and ZSU23/4. Table 11 presents the detailed recognition results of the proposed method at different depression angles. At 30 • depression angle, the proposed method can still achieve a very high PCC of 97.80%. However, when the depression angle changes to 45 • , the performance decreases significantly to 76.16%. The main reason is that the notably large depression angle variance causes much discrepancy between the training and test samples. Table 12 compares the performances of different methods under large depression angle variance. With the highest PCCs at both depression angles, the proposed method is demonstrated to be the most robust. The ASC method ranks second in all the methods and the superiority becomes more remarkable at 45 • depression angle. Although the global appearance changes greatly under large depression angle variance, some local characteristics can still maintain stable. Therefore, the ASCs can better serve for target recognition in this situation. By combing the merits of CNN and ASC matching, the proposed method achieves the best performance.

Noise Corruption
The test images collected in the real-world scenarios are often contaminated by the noises from the background environment or radar systems. Hence, it is crucial that the recognition algorithms can maintain robust under possible noise corruption. To test the performance of the proposed method

Noise Corruption
The test images collected in the real-world scenarios are often contaminated by the noises from the background environment or radar systems. Hence, it is crucial that the recognition algorithms can maintain robust under possible noise corruption. To test the performance of the proposed method under noise corruption, the noisy SAR images are first simulated by adding additive Gaussian noises to the original images according to the predefined signal-to-noise ratio (SNR) [48]. Figure 9 shows the noisy images at different SNRs. With the deterioration of noise contamination, more and more target characteristics are submerged in the noises, which will definitely increase the difficulty of correct target recognition. Figure 10 plots the average PCCs of different methods under noise corruption. In comparison, the proposed method has the best robustness to noise corruption with the highest PCCs at each SNR. The ASC method outperforms SVM, SRC, and CNN at SNRs lower than 5 dB. The reasons can be analyzed from two aspects. On one hand, the ASCs are noise-robust features. Then, the ASCs of noisy images can still be extracted with good precision to match well with those from the template samples. On the other hand, the local variations caused by noise corruption can be better handled via the one-to-one correspondence between two ASC sets. In the proposed method, the ASC matching method can effectively complement the designed CNN to cope with those severely corrupted samples. Therefore, the final performance is significantly improved.

Partial Occlusion
The target may be occluded by the obstacles or camouflaged intentionally. In this case, a part of the target may not be presented in the captured SAR image. According to the SAR occlusion model in [20,49], the partially occluded image is simulated by removing a certain proportion of the target region of the original image from eight directions. Figure 11 shows the 20% occluded SAR images from four different directions whereas the remaining ones are in the symmetrical directions. Figure  12 plots the average PCCs of the eight directions of different methods. With the highest PCC at each

Partial Occlusion
The target may be occluded by the obstacles or camouflaged intentionally. In this case, a part of the target may not be presented in the captured SAR image. According to the SAR occlusion model in [20,49], the partially occluded image is simulated by removing a certain proportion of the target region of the original image from eight directions. Figure 11 shows the 20% occluded SAR images

Partial Occlusion
The target may be occluded by the obstacles or camouflaged intentionally. In this case, a part of the target may not be presented in the captured SAR image. According to the SAR occlusion model in [20,49], the partially occluded image is simulated by removing a certain proportion of the target region of the original image from eight directions. Figure 11 shows the 20% occluded SAR images from four different directions whereas the remaining ones are in the symmetrical directions. Figure 12 plots the average PCCs of the eight directions of different methods. With the highest PCC at each occlusion level, the proposed method is validated to be the most robust to partial occlusion. The ASC method outperforms the remaining ones when the occlusion level goes higher than 30%. The main reason is that the stable ASCs in occluded images can still be matched well. For the proposed method, the ASC matching works cooperatively with the CNN to cope with the severely occluded images. Therefore, the fused performance is much better than others.

Limited Training Samples
Actually, the available training resource for SAR ATR is quite limited [37,38]. As a result, the training samples may only cover a certain proportion of the full 360° azimuth range. For experimental evaluation, we randomly select 1/2, 1/3, 1/4, 1/5 and 1/6 from each of the 10-class samples and then perform target recognition based on the reduced training set. As shown in Figure 13, the proposed method keeps the highest PCC at each reduction level, validating its best robustness to limited

Limited Training Samples
Actually, the available training resource for SAR ATR is quite limited [37,38]. As a result, the training samples may only cover a certain proportion of the full 360° azimuth range. For experimental evaluation, we randomly select 1/2, 1/3, 1/4, 1/5 and 1/6 from each of the 10-class samples and then perform target recognition based on the reduced training set. As shown in Figure 13

Limited Training Samples
Actually, the available training resource for SAR ATR is quite limited [37,38]. As a result, the training samples may only cover a certain proportion of the full 360 • azimuth range. For experimental evaluation, we randomly select 1/2, 1/3, 1/4, 1/5 and 1/6 from each of the 10-class samples and then perform target recognition based on the reduced training set. As shown in Figure 13, the proposed method keeps the highest PCC at each reduction level, validating its best robustness to limited training samples. In addition, ASC method shares an approaching performance to the proposal and outperforms the remaining ones significantly. For SVM, SRC and CNN, their performances are closely related to the completeness of the training set. When the training samples are reduced severely, their PCCs experience sharp decreases. In the ASC method, the corresponding templates are selected based on the azimuth of the test image. In fact, the ASCs can maintain stable in a certain azimuth interval (e.g., [−5 • , 5 • ]) [50]. Then, the ASC matching can still be performed with good effectiveness. As a combination of CNN and ASC matching, the proposed method achieves the best performance mainly by inheriting the robustness of ASC matching.

Discussion
The experimental results based on the MSTAR dataset validate the superior effectiveness and robustness of the proposed method under SOC and several EOCs compared with several state-ofthe-art SAR ATR methods including SVM, SRC, A-ConvNet and ASC matching method. In detail, the reasonability lay behind the experimental results is discussed as follows.

Discussion
The experimental results based on the MSTAR dataset validate the superior effectiveness and robustness of the proposed method under SOC and several EOCs compared with several state-of-the-art SAR ATR methods including SVM, SRC, A-ConvNet and ASC matching method. In detail, the reasonability lay behind the experimental results is discussed as follows.
(i) Experiment under SOC. Under SOC, the training and test samples are notably similar with only a 2 • depression angle difference. Consequently, all the methods achieve very high PCCs. Due to the powerful classification capability of CNN under SOC, most test samples are actually classified by CNN in the proposed method. The remaining ones can also be effectively classified by ASC matching because of its goof performance. Hence, the hierarchical fusion of the two classification schemes can maintain the excellent performance under SOC, which is demonstrated to outperform the others. In this case, the excellent performance of the proposed method mainly benefits from CNN. Meanwhile, ASC matching further improves the recognition performance by handling a few test samples, which possibly have many differences with the training ones. (ii) Experiment under EOCs. The EOCs like configuration variance, depression angle variance, noise corruption and partial occlusion probably cause some local variations of the target in the test SAR images. Therefore, the one-to-one correspondence between the local descriptors, i.e., ASCs, can better handle these situations. For the classifiers like SVM, SRC and CNN, the training samples only include SAR images of intact targets with high SNRs. In addition, only a specific configuration is bracketed. Therefore, their performances degrade greatly under these EOCs. In the proposed method, when a test sample cannot be reliably classified by CNN, ASC matching can probably provide a correct decision. Therefore, via hierarchically fusing CNN and ASC matching, the robustness of the proposed method can be enhanced. In this case, the superior robustness of the proposed method mainly benefits from the merits of ASC matching. However, for those EOCs which are not severely different from the training set (e.g., small amount of noise additions), CNN is probable to make correct decisions on them. Therefore, CNN can complement ASC matching to further improve ATR performance. (iii) Experiment under limited training samples. With limited training samples, the classification capabilities of SVM, SRC and CNN will be impaired greatly. For the ASC matching method, the template ASCs still share a high correlation with the test ASCs because the stability of ASCs can be maintained in a certain azimuth interval. Therefore, once the CNN cannot form a reliable decision for the test image, the ASC matching can better cope with the situation.
All in all, in this study, CNN is adopted as the basic classifier, which can operate with high effectiveness and efficiency when the test sample is covered by the training set. As a complement to CNN, the test samples that are severely corrupted by EOCs and can hardly be determined by CNN are further identified by ASC matching. The detailed analysis between two ASC sets helps make correction decisions for various EOCs. Therefore, the hierarchical fusion of the two classification schemes notably promotes the final ATR performance.
The future works can be conducted from two aspects. On one hand, the specific architecture of CNN for SAR ATR should be studied to further improve the recognition performance. At present stage, the CNNs for SAR ATR are mainly introduced from the field of optical image processing. The specific network for SAR image interpretation should be further researched. On the other hand, more efficient and robust classier can be incorporated into the proposed framework to further enhance the robustness of the ATR system. ASC matching is a representative of local classifier, which performs target recognition by analyzing the local variations of the target. It may exist other similar classification schemes, which can further improve the robustness of SAR ATR.

Conclusions
A SAR ATR method by hierarchically fusing CNN and ASC matching is proposed in this study. A test sample is first classified by CNN. When there is no reliable decision, it will be further recognized by ASC matching. CNN can achieve notably high classification accuracy under SOC, when the test samples are covered by the training set. ASC matching can better cope with various EOCs related to the local variations of the target such as configuration variance, noise corruption, partial occlusion, etc. Therefore, the hierarchical fusion effectively inherits the high effectiveness of CNN under SOC and good robustness of ASC matching to various EOCs. Extensive experiments are conducted on the MSTAR dataset under SOC and typical EOCs including configuration variance, depression angle variance, noise corruption and partial occlusion. Based on the experimental results, several conclusions can be drawn as follows.
(i) CNN has powerful classification capability under SOC. Thus, it is a reasonable choice to use is as the basic classifier. In addition, ASC matching can also work very well under SOC because of the good discrimination of ASCs. Therefore, the hierarchical fusion of the two classification schemes can maintain excellent performance under SOC. (ii) ASC matching can achieve very good robustness under different types of EOCs. The one-to-one correspondence between two ASC sets can sense the local variations of the target thus the resulted similarity measure can better handle these situations. Therefore, those samples which cannot be reliably classified by CNN are probably to obtain correct decisions by ASC matching. (iii) The proposed method achieves the best performance under both SOC and EOCs compared with other state-of-the-art methods by combining the merits of the two classification schemes.
In conclusion, the proposed method has much potential to improve the ATR performance in practical applications.
Author Contributions: C.J. proposed the general idea of the method and performed the experiments. Y.Z. reviewed the idea and provided many suggestive advices. This manuscript was written by C.J.