SAR Target Recognition via Meta-Learning and Amortized Variational Inference

The challenge of small data has emerged in synthetic aperture radar automatic target recognition (SAR-ATR) problems. Most SAR-ATR methods are data-driven and require a lot of training data that are expensive to collect. To address this challenge, we propose a recognition model that incorporates meta-learning and amortized variational inference (AVI). Specifically, the model consists of global parameters and task-specific parameters. The global parameters, trained by meta-learning, construct a common feature extractor shared between all recognition tasks. The task-specific parameters, modeled by probability distributions, can adapt to new tasks with a small amount of training data. To reduce the computation and storage cost, the task-specific parameters are inferred by AVI implemented with set-to-set functions. Extensive experiments were conducted on a real SAR dataset to evaluate the effectiveness of the model. The results of the proposed approach compared with those of the latest SAR-ATR methods show the superior performance of our model, especially on recognition tasks with limited data.


Introduction
Synthetic aperture radar (SAR) is an active remote sensor with all day and night, high-resolution, and wide-area imaging capabilities. Because of these unique capabilities, SAR is widely used in geoscience and remote sensing. Today, numerous SAR sensors are operating on spaceborne and airborne platforms and are imaging ground targets for surveillance and reconnaissance. For efficient interpretation of SAR image data, SAR automatic target recognition (SAR-ATR) system are being developed. SAR-ATR aims to detect and recognize targets, such as trucks and armored personnel carriers, in SAR images. The workflow of an end-to-end SAR-ATR system includes three stages: detection, low-level classification, and high-level classification [1]. Once an SAR image enters the system, detectors, such as constant false-alarm rate (CFAR) detectors, locate candidate targets in the images [2]. The region of interest (ROI), consisting of the true target and background clutter, is extracted around each candidate target. The clutter is then analyzed and filtered out in the low-level classification. Finally, the class or even the model of the target is identified in the high-level classification. In this paper, we focus on the third stage, that is, high-level classification.
Traditional SAR-ATR methods are generally classified into two categories: feature-based and model-based. Feature-based methods extract discriminative features from SAR images and train the classifiers with these features. The features can be extracted in the spatial domain, such as templates, each of which is an average representation of a target at a particular azimuth angle [3]. Feature extraction can also be carried out in the transformation domain, where the images are and cross-task SAR-ATR problems. In this paper, we solve SAR-ATR tasks with small data in a meta-learning framework. Simulated SAR data [30] are also introduced to compensate for the lack of real data. We build a meta-learning model consisting of global parameters and task-specific parameters. The model is meta-learned using sufficient simulated SAR data. After meta-learning, it retains the global parameters and uses real SAR data to update task-specific parameters. To reduce the model uncertainty caused by small training data, the model places probability distributions over task-specific parameters. These parameters, vast in number, are estimated by amortized variational inference (AVI) [31] to reduce the computation and storage cost. Most relevant to our method is the work of [32,33], in which amortized networks (e.g., plain neural networks) were used to approximate task-specific parameters. By contrast, our model uses variational inference to clarify the errors introduced by amortized approximation, thereby improving the training objective function. Furthermore, we propose a novel amortized network implemented with set-to-set functions [34] to boost the performance of AVI.
The contributions of this paper are summarized as follows: (1) We propose a novel recognition model integrating meta-learning and AVI. The model can recognize new targets with a small amount of real data.
(2) To reduce the model uncertainty caused by small data, task-specific parameters of the model are modeled by probability distributions and are inferred by AVI.
(3) The amortized network of AVI is implemented with set-to-set functions, thereby improving its performance.

Model Framework
In this paper, SAR-ATR with small data is defined and solved in a probabilistic meta-learning framework. The task, a random combination of training samples, is the independent training unit of the model. Assume that we have a task set {τ i } M i=1 of M tasks, where the i th task τ i = x ij, y ij N j=1 has N samples, each of which contains an image x ij and its ground-truth label y ij . The samples of τ i are then split into two disjoint parts: a support set τ s i = x ij, y ij L j=1 for training and a query set to evaluate training effectiveness.
A probabilistic graph is employed to visually specify how components depend on each other in the model framework. Figure 1 shows that the framework consists of global parameters θ, task-specific parameters φ i , and samples that include support and query sets. Three principles guide the choice of model framework. First, we assume that all the tasks share a common structure parameterized by θ, which assists the model in solving new tasks with few training samples. Second, we employ the point estimate for θ because it is shared between tasks and its uncertainty decreases as the number of tasks increases. Third, the task-specific parameters φ i are distributionally estimated to deal with the model uncertainty caused by small training data. To simplify the symbols, the sample sets, |∀ , |∀ , |∀ , and |∀ are abbreviated as , , , and , respectively. According to the maximum likelihood criterion, our goal is to train a model that utilizes the knowledge in , to predict the labels of new images accurately. The log-likelihood ℒ is defined as: where is the total number of tasks. As shown in Figure 1, is conditionally independent of and , and is conditionally independent of . According to the above dependency relationships, we factorize the joint distribution in (1) into two terms: where ( | , , ), the posterior distribution over , is learned by the support set , .The predictive distribution | , , is inferred by query data and learned . Computing (2) requires integration over all the values of , which is typically intractable for complex or large-scale models. Therefore, variational inference [35] is introduced to approximate the posterior ( | , , ). This involves a variational distribution ( ; ), specified by a set of variational parameters . We rewrite the integral in (2) as expectations and use Jensen's inequality to obtain its lower bound ℒ : where the first term represents the predictive log-likelihood given a variational distribution ( ; ) . The second term is the Kullback-Leibler (KL) divergence between ( ; ) and ( | , , ). Minimizing the KL divergence will bridge the gap between two distributions to eliminate approximation errors. During meta-learning, large numbers of tasks are generated by randomly drawing training samples from the dataset. With the increasing number of tasks, learning variational parameters for each is challenging due to the cost of storage and computing. Therefore, we introduce an AVI To simplify the symbols, the sample sets, x s ij ∀j , y s ij ∀ j , x q ij ∀j , and y q ij ∀j are abbreviated as x s i , y s i , x q i , and y q i , respectively. According to the maximum likelihood criterion, our goal is to train a model that utilizes the knowledge in x s i , y s i to predict the labels y q i of new images x q i accurately. The log-likelihood L is defined as: where M is the total number of tasks. As shown in Figure 1, y q i is conditionally independent of x s i and y s i , and φ i is conditionally independent of x q i . According to the above dependency relationships, we factorize the joint distribution in (1) into two terms: where p φ i x s i , y s i , θ , the posterior distribution over φ i , is learned by the support set x s i , y s i . The predictive distribution p y q i x q i , φ i , θ is inferred by query data x q i and learned φ i . Computing (2) requires integration over all the values of φ i , which is typically intractable for complex or large-scale models. Therefore, variational inference [35] is introduced to approximate the posterior p φ i x s i , y s i , θ . This involves a variational distribution q(φ i ; λ i ), specified by a set of variational parameters λ i . We rewrite the integral in (2) as expectations and use Jensen's inequality to obtain its lower bound L VI : where the first term represents the predictive log-likelihood given a variational distribution q(φ i ; λ i ). The second term is the Kullback-Leibler (KL) divergence between q(φ i ; λ i ) and p φ i x s i , y s i , θ . Minimizing the KL divergence will bridge the gap between two distributions to eliminate approximation errors.
During meta-learning, large numbers of tasks are generated by randomly drawing training samples from the dataset. With the increasing number of tasks, learning variational parameters λ i for each φ i is challenging due to the cost of storage and computing. Therefore, we introduce an AVI that combines an amortized inference network with variational inference. AVI takes a task τ i as input and uses an inference network shared between all tasks to predict φ i . Thus, the optimization problem of (3) turns into: where q ϕ φ i , x s i , y s i , θ is the approximation of q(φ i ; λ i ), and the inference network is parameterized by ϕ. After performing AVI, the model only needs to learn and store globally shared parameters (i.e., θ and ϕ) throughout the meta-learning process, while φ i is predicted by globally shared parameters and training samples. Finally, the model performs the following optimization: where the first term of (5) is the cross-entropy loss of the query set, while the KL term can be viewed as a training regularization step. Due to the varying range of values for the two terms, the weight coefficient α between them needs to be down-weighted. q ϕ φ i x s i , y s i , θ is a Gaussian distribution with mean and covariance determined by data x s i , y s i and parameters θ, ϕ . p φ i x s i , y s i , θ is defined experimentally and also a Gaussian distribution. Its mean is identical to that of q ϕ φ i x s i , y s i , θ , but its covariance is an identity matrix. For the detailed calculation method of KL divergence, one can refer to [36].

Model Structure
As shown in Figure 2, the model is composed of three modules: a feature extractor f θ with global parameters, a classifier f φ with task-specific parameters, and a weight predictor f ϕ . Once the images enter the model, their low-dimensional features are extracted by f θ . The weight predictor f ϕ then uses support set features f θ (τ s ) to predict the weights of f φ . Finally, the predictive distribution p y q i x q i , φ i , θ is produced by f φ using query set features f θ (τ q ).
Sensors 2020, 20, x FOR PEER REVIEW 5 of 19 that combines an amortized inference network with variational inference. AVI takes a task as input and uses an inference network shared between all tasks to predict . Thus, the optimization problem of (3) turns into: where ( |, , , ) is the approximation of ( ; ) , and the inference network is parameterized by . After performing AVI, the model only needs to learn and store globally shared parameters (i.e., and ) throughout the meta-learning process, while is predicted by globally shared parameters and training samples. Finally, the model performs the following optimization: where the first term of (5) is the cross-entropy loss of the query set, while the KL term can be viewed as a training regularization step. Due to the varying range of values for the two terms, the weight coefficient between them needs to be down-weighted.
( | , , ) is a Gaussian distribution with mean and covariance determined by data , and parameters , . ( | , , ) is defined experimentally and also a Gaussian distribution. Its mean is identical to that of ( | , , ), but its covariance is an identity matrix. For the detailed calculation method of KL divergence, one can refer to [36].

Model Structure
As shown in Figure 2, the model is composed of three modules: a feature extractor with global parameters, a classifier with task-specific parameters, and a weight predictor . Once the images enter the model, their low-dimensional features are extracted by . The weight predictor then uses support set features ( ) to predict the weights of . Finally, the predictive distribution | , , is produced by using query set features ( ). The detailed network configuration of the model is shown in Figure 3. The feature extractor uses a CNN that contains four convolutional blocks to map images into features. Each convolutional block sequentially performs convolution (Conv), batch normalization (BN), nonlinear activation, dropout,  The detailed network configuration of the model is shown in Figure 3. The feature extractor uses a CNN that contains four convolutional blocks to map images into features. Each convolutional block sequentially performs convolution (Conv), batch normalization (BN), nonlinear activation, dropout, and max-pooling operations on the inputs. BN is essential in the network for accelerating the training convergence. Furthermore, it can effectively improve training stability, especially when training data are scarce. The classifier consists of an FC layer and a softmax layer, where C is the number of classes.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 19 and max-pooling operations on the inputs. BN is essential in the network for accelerating the training convergence. Furthermore, it can effectively improve training stability, especially when training data are scarce. The classifier consists of an FC layer and a softmax layer, where C is the number of classes. Before being sent to the feature extractor, the SAR images are first converted to a logarithmic scale, and their pixel values are normalized to [−1.0, 1.0]. The size of normalized images is then cropped to 64 × 64 to reduce the influence of clutter and noise. The first convolutional block convolves input images by 16 filters with a kernel size of 5 × 5. Its outputs are 16 feature maps of size 32 × 32 due to spatial zero paddings and a 2 × 2 max-pooling. The next three convolutional blocks perform similar processing on input feature maps. Finally, the feature extractor outputs 64 feature maps of the size 4 × 4 and flattens them into a 1024-D feature vector. The classifier determines the class to which the feature vector belongs. These feature vectors are also used to learn the weights of the classifier. The weight predictor uses set-to-set functions to transform feature vectors, extracts distribution parameters (e.g., mean and variance), and samples the classifier weights by these parameters.

Weight Predictor
Predicting task-specific weights for the classifier is challenging since its training samples per task are limited. It has been observed that in the last FC layer of a CNN, the weight vector and the input feature vector are highly similar in structure. Previous works in [32,33] regarded the mean of image features in one class as a class proxy. They inputted the proxy into amortized networks to predict classifier weights for this class. This idea is too rigid for cases in which the image features follow a complex distribution containing multiple cluster centers. Our model uses feature sets instead of feature means to infer classifier weights. The feature sets are first transformed by set-to-set functions taking into account both the inter-class and the intra-class information. Next, the model uses the transformed features to infer the distribution of classifier weights.
Given a feature set of C classes, where represents image features belonging to class . The set-to-set transformation function ( ) is defined as where ⊕ denotes vector concatenation, and (•) and ℎ(•) are nonlinear mappings. We use cascaded FC layers with ReLU activation to implement these nonlinear mappings. Figure 4 shows the transformation process of feature sets. For each , its complement elements , ≠ are first transformed into some representations ℎ , ≠ by nonlinear mappings. The representations are aggregated as context data (inter-class information) by a maximum operator. Next, we concatenate with the context vector and input it to the FC layers to obtain the residual mapping ( ). Before being sent to the feature extractor, the SAR images are first converted to a logarithmic scale, and their pixel values are normalized to [−1.0, 1.0]. The size of normalized images is then cropped to 64 × 64 to reduce the influence of clutter and noise. The first convolutional block convolves input images by 16 filters with a kernel size of 5 × 5. Its outputs are 16 feature maps of size 32 × 32 due to spatial zero paddings and a 2 × 2 max-pooling. The next three convolutional blocks perform similar processing on input feature maps. Finally, the feature extractor outputs 64 feature maps of the size 4 × 4 and flattens them into a 1024-D feature vector. The classifier determines the class to which the feature vector belongs. These feature vectors are also used to learn the weights of the classifier. The weight predictor uses set-to-set functions to transform feature vectors, extracts distribution parameters (e.g., mean and variance), and samples the classifier weights by these parameters.

Weight Predictor
Predicting task-specific weights for the classifier is challenging since its training samples per task are limited. It has been observed that in the last FC layer of a CNN, the weight vector and the input feature vector are highly similar in structure. Previous works in [32,33] regarded the mean of image features in one class as a class proxy. They inputted the proxy into amortized networks to predict classifier weights for this class. This idea is too rigid for cases in which the image features follow a complex distribution containing multiple cluster centers. Our model uses feature sets instead of feature means to infer classifier weights. The feature sets are first transformed by set-to-set functions taking into account both the inter-class and the intra-class information. Next, the model uses the transformed features to infer the distribution of classifier weights.
Given a feature set where ⊕ denotes vector concatenation, and g(·) and h(·) are nonlinear mappings. We use cascaded FC layers with ReLU activation to implement these nonlinear mappings. Figure 4 shows the transformation process of feature sets. For each Z i , its complement elements Z j , j i are first transformed into some representations h Z j , j i by nonlinear mappings. The representations are aggregated as context  ( ) can also be regarded as a conditioned mapping that considers other classes in the set.
Finally, we add and ( ) to obtain ( ). As stated in [34], the maximum operator in Figure 4 can be replaced by a sum operator. However, we experimentally found that the maximum operator performs better than the sum operator.
To reduce the model uncertainty caused by small data, the classifier weights are random variables that are inferred by AVI. In practice, the weights are task-specific parameters , where i is the index of the tasks. We formulate as a stochastic matrix; thus, = , ⋯ , , ⋯ , ∈ × , where C is the number of classes and D is the dimension of feature vectors. The distribution of is specified as a factorized Gaussian distribution | , , where and are the mean vector and diagonal covariance matrix of , respectively. The weight predictor uses set-to-set functions to transform support set features ( ), produces parameters and for each , and samples with these parameters. To facilitate the backpropagation of gradients, samples with a local reparameterization trick instead of sampling directly. The sampling of is defined as follows: where is represented by a linear function of Gaussian variables , and ⨀ denotes the elementwise product.

Training Details
The workflow of our model includes three stages: meta-learning, updating, and testing. During meta-learning, the model learns parameters and using simulated SAR data. In the updating stage, the model freezes and uses a small amount of real SAR data to update . The remaining real data are used to test the model. To mimic the meta-learning scenario, both real and simulated data are organized as N-way, K-shot classification tasks. To construct a task with support and query splits, we randomly chose N classes from the dataset and then collected (K + L) images from each class. These images were then divided into two disjoint subsets: a support set of size N × K and a query set of size N × L. We used N = 10, K = 5, and L = 15 in all the tests. Our model was trained by the ADAM optimizer [37] with a learning rate of 0.001. The regularization term (the KL divergence of Equation (5)) and dropout operation are only used in the meta-learning stage. The weight coefficient and the drop rate are set to 0.0001 and 0.5, respectively. All the experiments were carried out on a computer configured with Intel i5-8400 CPU, GeForce GTX 1080Ti GPU and 16 GB RAM. The model needed to be trained for 14,000 iterations in the meta-learning stage and 300 g(Z i ) can also be regarded as a conditioned mapping that considers other classes in the set. Finally, we add Z i and g(Z i ) to obtain f (Z i ). As stated in [34], the maximum operator in Figure 4 can be replaced by a sum operator. However, we experimentally found that the maximum operator performs better than the sum operator.
To reduce the model uncertainty caused by small data, the classifier weights are random variables that are inferred by AVI. In practice, the weights are task-specific parameters φ i , where i is the index of the tasks. We formulate φ i as a stochastic matrix; thus, φ i = w 1 , · · · , w j , · · · , w C ∈ R D×C , where C is the number of classes and D is the dimension of feature vectors. The distribution of w j is specified as a factorized Gaussian distribution N w j µ j , diag σ 2 j , where µ j and diag σ 2 j are the mean vector and diagonal covariance matrix of w j , respectively. The weight predictor f ϕ uses set-to-set functions to transform support set features f θ (τ s ), produces parameters µ j and σ 2 j for each w j , and samples w j with these parameters. To facilitate the backpropagation of gradients, f ϕ samples w j with a local reparameterization trick instead of sampling w j directly. The sampling of w j is defined as follows: where w j is represented by a linear function of Gaussian variables , and denotes the element-wise product.

Training Details
The workflow of our model includes three stages: meta-learning, updating, and testing. During meta-learning, the model learns parameters θ and ϕ using simulated SAR data. In the updating stage, the model freezes θ and uses a small amount of real SAR data to update ϕ. The remaining real data are used to test the model. To mimic the meta-learning scenario, both real and simulated data are organized as N-way, K-shot classification tasks. To construct a task with support and query splits, we randomly chose N classes from the dataset and then collected (K + L) images from each class. These images were then divided into two disjoint subsets: a support set of size N × K and a query set of size N × L. We used N = 10, K = 5, and L = 15 in all the tests. Our model was trained by the ADAM optimizer [37] with a learning rate of 0.001. The regularization term (the KL divergence of Equation (5)) and dropout operation are only used in the meta-learning stage. The weight coefficient α and the drop rate are set to 0.0001 and 0.5, respectively. All the experiments were carried out on a computer configured with Intel i5-8400 CPU, GeForce GTX 1080Ti GPU and 16 GB RAM. The model needed to be trained for 14,000 iterations in the meta-learning stage and 300 iterations in the updating phase, and each iteration took 0.19 s. In the testing stage, it took 0.002 s for the model to recognize each target image.

Datasets
The model was trained on simulated SAR data and then updated and tested on real SAR data. The real SAR data were collected by an SAR system operating in X-band (9.6 GHz), HH-polarization, and spotlight mode, in support of the Moving and Stationary Target Acquisition and Recognition (MSTAR) project. The target images were imaged over a 360 • azimuth angle and had a spatial resolution of 0.3 × 0.3 m. A huge number of ground target images with various classes, azimuth angles, depression angles, and so on, were gathered. Some of these images are publicly available and are widely used as a benchmark for SAR-ATR testing [38]. Table 1 summarizes the publicly released MSTAR dataset, including ten classes, each with hundreds of images. The simulated dataset was generated and shared by Kusk et al. [30] at the Technical University of Denmark. The target's radar cross-section (RCS) was generated by an electromagnetic computing software that takes the target's CAD model as input. The RCS was then passed to a postprocessing tool that models thermal noise, terrain clutter, and SAR focusing, to produce simulated SAR images. The detailed simulation parameters are listed in Table 2. The simulated dataset includes seven vehicles: bulldozer, bus, car, hummer, motorbike, tank, and truck. Each vehicle contains two variants built by different CAD models. In our experimental setup, each variant was viewed as an independent class, thereby establishing a dataset containing images of fourteen types of targets. The size and resolution of the simulated images are consistent with those of the real images, but the imaging angles are more diverse. The simulated images were acquired at azimuth angles from 0 • to 360 • at 5 • intervals and a few depression angles ( Table 3 lists the name, the CAD model and the number of images per class in the simulated dataset. In each class, the number of images is the product of the azimuths and the depressions.

Reference Methods
To quantitatively evaluate the model, we employed several state-of-the-art recognition methods as references, which are summarized in Table 4. Among these methods, class-dependent structure preserving projection (CDSPP) and kernel robust locality discriminant projection (KRLDP) are based on discriminant projection, kernel sparse representation (KSR) and tri-task joint sparse representation (TJSR) are based on sparse representation, and the rest are deep learning methods. The experimental results of CNN, transfer learning (TFL), probabilistic meta-learning (PML), MobileNet, and predicting parameters from activations (PPA) were obtained from our implementations. Their network structures are similar to that of our model. The results of other methods in Table 4 are cited directly from their papers.

Results under Standard Operation Conditions
Operation conditions (OCs), the working environment in which SAR sensors acquire images, have a significant impact on the recognition performance of the SAR-ATR system. In this experiment, we evaluated the model under standard operation conditions (SOCs), where the training and testing images were acquired under similar configurations, depression angles, etc. After the meta-learning, real images with depression angles of 17 • and 15 • were used to update and test the model, respectively. Our model was compared with several reference methods using different amounts of training data. Note that the term "training data" used in all the experiments refers to real SAR data to facilitate comparison with reference methods. Figure 5 shows that the recognition rates of all methods (our model, PML, PPA, TFL, and CNN) rise rapidly with increasing data, and then saturate when reaching a certain amount of training data.
Our model achieves a recognition rate of more than 95% with only 30% of the data, indicating that it has excellent data use efficiency. When 10% of the training data are used, the recognition rate of the model is 89.7%, compared with 88.7% for PML, 87.4% for PPA, 80.2% for TFL, and 75.9% for CNN. In recognition tasks with small data, our model outperforms other methods by large margins. With more data, the recognition performance of all methods improves, and our model consistently achieves the best performance. When using 100% training data, the recognition rate for our model is 97.9%, which is 0.3%, 1.5%, 1.8%, and 1.8% better than those of the competitors, PML, PPA, TFL, and CNN, respectively. CNN performs the worst because it is trained only on the real data. By transferring knowledge from simulated to real data, the recognition rate of TFL is higher than that of CNN. Methods that use the meta-learning framework (i.e., our model, PML, and PPA) are superior to TFL, especially when training data are scarce. By introducing posterior distributions over parameters, our model can deal with the uncertainty caused by small data and hence perform well in small data scenarios.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 19 Figure 5 shows that the recognition rates of all methods (our model, PML, PPA, TFL, and CNN) rise rapidly with increasing data, and then saturate when reaching a certain amount of training data. Our model achieves a recognition rate of more than 95% with only 30% of the data, indicating that it has excellent data use efficiency. When 10% of the training data are used, the recognition rate of the model is 89.7%, compared with 88.7% for PML, 87.4% for PPA, 80.2% for TFL, and 75.9% for CNN. In recognition tasks with small data, our model outperforms other methods by large margins. With more data, the recognition performance of all methods improves, and our model consistently achieves the best performance. When using 100% training data, the recognition rate for our model is 97.9%, which is 0.3%, 1.5%, 1.8%, and 1.8% better than those of the competitors, PML, PPA, TFL, and CNN, respectively. CNN performs the worst because it is trained only on the real data. By transferring knowledge from simulated to real data, the recognition rate of TFL is higher than that of CNN. Methods that use the meta-learning framework (i.e., our model, PML, and PPA) are superior to TFL, especially when training data are scarce. By introducing posterior distributions over parameters, our model can deal with the uncertainty caused by small data and hence perform well in small data scenarios. The model was also compared with deep transferred atrous-inception synthetic aperture radar network (TAI-SARNET), TAI-SARNET with transfer learning (TAI-SARNET-TF), and MobileNet. These methods are lightweight network architectures that can be used for recognition in small-data scenarios. TAI-SARNET-TF1 transfers prior knowledge from optical data, TAI-SARNET-TF2 transfers prior knowledge from SAR data and TAI-SARNET-TF3 transfers knowledge from mixed data. Table 5 summarizes the recognition results with small sample sizes. The results of MobileNet were obtained from our implementation, while the results of TAI-SARNET and TAI-SARNET-TF were from [39]. Our model performs better than the competitors in small-data scenarios. When the proportion of training data is 1/2, the recognition rate of our model is 97.0%, which is 3.8%, 2.7%, 0.9%, 3.4%, and 5.5% higher than those of the competitors. When the proportion decreases to 1/32, our model surpasses the competitors by large margins.  The model was also compared with deep transferred atrous-inception synthetic aperture radar network (TAI-SARNET), TAI-SARNET with transfer learning (TAI-SARNET-TF), and MobileNet. These methods are lightweight network architectures that can be used for recognition in small-data scenarios. TAI-SARNET-TF1 transfers prior knowledge from optical data, TAI-SARNET-TF2 transfers prior knowledge from SAR data and TAI-SARNET-TF3 transfers knowledge from mixed data. Table 5 summarizes the recognition results with small sample sizes. The results of MobileNet were obtained from our implementation, while the results of TAI-SARNET and TAI-SARNET-TF were from [39]. Our model performs better than the competitors in small-data scenarios. When the proportion of training data is 1/2, the recognition rate of our model is 97.0%, which is 3.8%, 2.7%, 0.9%, 3.4%, and 5.5% higher than those of the competitors. When the proportion decreases to 1/32, our model surpasses the competitors by large margins. Finally, we compared the model with several SAR-ATR methods proposed in recent years. The recognition results in Figure 6 were obtained with 100% training data. The recognition rate of the model is marginally lower than those of all-convolutional network (A-ConvNet) and micro convolutional neural network (MCNN), but it is still higher than those of most reference methods. Although our model focuses on recognition tasks in small-data scenarios, it can realize performance improvement with more training data and achieves excellent results.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 19 Finally, we compared the model with several SAR-ATR methods proposed in recent years. The recognition results in Figure 6 were obtained with 100% training data. The recognition rate of the model is marginally lower than those of all-convolutional network (A-ConvNet) and micro convolutional neural network (MCNN), but it is still higher than those of most reference methods. Although our model focuses on recognition tasks in small-data scenarios, it can realize performance improvement with more training data and achieves excellent results.

Results under Depression Angle Variations
In this test, the depression angles of training and testing images are markedly different, which is one of the extended operation conditions (EOCs). Figure 7 compares the target images at various depression angles. When the depression angles are not significantly different (17° versus 30°), the target shapes are similar, with only slight differences in scattering centers. When the depression angles differ noticeably (17° versus 45°), the target shape, scattering pattern, and even the speckle noise of the two images are different. Following the referenced methods, images of three targets (2S1, BRDM2, and ZSU234) were selected to evaluate the model. As shown in Table 6, the training set contains 890 images collected at a depression angle of 17°, and the test set contains 1778 images at depression angles of 30° and 45°.

Results under Depression Angle Variations
In this test, the depression angles of training and testing images are markedly different, which is one of the extended operation conditions (EOCs). Figure 7 compares the target images at various depression angles. When the depression angles are not significantly different (17 • versus 30 • ), the target shapes are similar, with only slight differences in scattering centers. When the depression angles differ noticeably (17 • versus 45 • ), the target shape, scattering pattern, and even the speckle noise of the two images are different. Following the referenced methods, images of three targets (2S1, BRDM2, and ZSU234) were selected to evaluate the model. As shown in Table 6, the training set contains 890 images collected at a depression angle of 17 • , and the test set contains 1778 images at depression angles of 30 • and 45 • .
Sensors 2020, 20, x FOR PEER REVIEW 11 of 19 Finally, we compared the model with several SAR-ATR methods proposed in recent years. The recognition results in Figure 6 were obtained with 100% training data. The recognition rate of the model is marginally lower than those of all-convolutional network (A-ConvNet) and micro convolutional neural network (MCNN), but it is still higher than those of most reference methods. Although our model focuses on recognition tasks in small-data scenarios, it can realize performance improvement with more training data and achieves excellent results.

Results under Depression Angle Variations
In this test, the depression angles of training and testing images are markedly different, which is one of the extended operation conditions (EOCs). Figure 7 compares the target images at various depression angles. When the depression angles are not significantly different (17° versus 30°), the target shapes are similar, with only slight differences in scattering centers. When the depression angles differ noticeably (17° versus 45°), the target shape, scattering pattern, and even the speckle noise of the two images are different. Following the referenced methods, images of three targets (2S1, BRDM2, and ZSU234) were selected to evaluate the model. As shown in Table 6, the training set contains 890 images collected at a depression angle of 17°, and the test set contains 1778 images at depression angles of 30° and 45°.   We compared the recognition rates of five methods at a depression angle of 30 • , where training and testing images are slightly different. Figure 8 plots the recognition rates under different proportions of training data. When 10% of the training data is used, the recognition rates of the model, PML, PPA, TFL, and CNN are 92.9%, 92.1%, 90.4%, 89.0%, and 87.9%, respectively. Using the complete training data, the recognition rates of the five methods increase to 96.5%, 96.0%, 95.7%, 95.7%, and 95.5%, respectively. The recognition rates increase with the amount of training data, and our model is always better than the four competitors.  We compared the recognition rates of five methods at a depression angle of 30°, where training and testing images are slightly different. Figure 8 plots the recognition rates under different proportions of training data. When 10% of the training data is used, the recognition rates of the model, PML, PPA, TFL, and CNN are 92.9%, 92.1%, 90.4%, 89.0%, and 87.9%, respectively. Using the complete training data, the recognition rates of the five methods increase to 96.5%, 96.0%, 95.7%, 95.7%, and 95.5%, respectively. The recognition rates increase with the amount of training data, and our model is always better than the four competitors. As shown in Figure 9, at a depression angle of 45°, the recognition performance of all methods deteriorates dramatically. When the proportion of training data is 10%, the recognition rates of our model, PML, PPA, TFL, and CNN are 78.7%, 78.0%, 76.4%, 52.5%, and 56.8%, respectively. When the proportion increases to 100%, the recognition rates of the five methods increase to 82.1%, 79.3%, 78.6%, 55.6%, and 62.2%, respectively. A drastic change in the depression angle significantly modifies the target's appearance and hence leads to a huge difference between training and testing images. This difference makes the recognition methods gain little from the training data, thereby degrading their recognition performance. Despite using simulated SAR data, TFL performs worse than CNN trained only by the real data. One possible explanation is that the method of fine-tuning network parameters, which is used by TFL, is not suitable for this scenario.
Finally, using the complete dataset, we compared the model with several reference SAR-ATR methods, and their recognition rates are plotted in Figure 10. Note that KRLDP, MCNN, and A-ConvNet only provide recognition results at a depression angle of 30°. When the depression angle of the test images (30°) is close to that of the training images (17°), all of the recognition rates are maintained at a high level of more than 90%. Our model is superior to all reference methods except KRLDP. When the testing depression angle increases to 45°, all of the recognition rates decrease sharply. Our model obtains a recognition rate of 82.1%, which surpasses the competing methods by large margins. This superior performance is due to the meta-learning and simulated data used in the model. Unlike reference methods only trained by real images with scarce depression angles, our model is meta-trained by a large number of simulated images with multiple depression angles. The learned angle-invariant global parameters make our model remain relatively robust under depression variations. As shown in Figure 9, at a depression angle of 45 • , the recognition performance of all methods deteriorates dramatically. When the proportion of training data is 10%, the recognition rates of our model, PML, PPA, TFL, and CNN are 78.7%, 78.0%, 76.4%, 52.5%, and 56.8%, respectively. When the proportion increases to 100%, the recognition rates of the five methods increase to 82.1%, 79.3%, 78.6%, 55.6%, and 62.2%, respectively. A drastic change in the depression angle significantly modifies the target's appearance and hence leads to a huge difference between training and testing images. This difference makes the recognition methods gain little from the training data, thereby degrading their recognition performance. Despite using simulated SAR data, TFL performs worse than CNN trained only by the real data. One possible explanation is that the method of fine-tuning network parameters, which is used by TFL, is not suitable for this scenario.
Finally, using the complete dataset, we compared the model with several reference SAR-ATR methods, and their recognition rates are plotted in Figure 10. Note that KRLDP, MCNN, and A-ConvNet only provide recognition results at a depression angle of 30 • . When the depression angle of the test images (30 • ) is close to that of the training images (17 • ), all of the recognition rates are maintained at a high level of more than 90%. Our model is superior to all reference methods except KRLDP. When the testing depression angle increases to 45 • , all of the recognition rates decrease sharply. Our model obtains a recognition rate of 82.1%, which surpasses the competing methods by large margins. This superior performance is due to the meta-learning and simulated data used in the model. Unlike reference methods only trained by real images with scarce depression angles, our model is meta-trained by a large number of simulated images with multiple depression angles. The learned angle-invariant global parameters make our model remain relatively robust under depression variations. Sensors 2020, 20, x FOR PEER REVIEW 13 of 19

Results under Configuration Variations
In this experiment, the targets used for training and testing have different configurations. Configuration refers to small appearance modifications, such as adding or removing fuel barrels, side skirts, and smoke grenade launchers to the targets. On the battlefield, targets of the same type but with different configurations should be classified into the same class. This is a challenging task because when two targets are just different in configurations, they have similar scattering characteristics. As shown in Figure 11, target images in three configurations are generally similar except for a slight difference in position and intensity of scattering centers. We trained and tested the model with four ground targets, in which T72 and BMP2 have three configuration variants indicated by different serial numbers. For T72 and BMP2, targets with the serial numbers 132 and 9563 were used for training, while the remaining ones, with the serial numbers S7, 812, C21 and 9566, were used for testing. For T62 and BTR60, their training and testing images have the same configuration, but with different depression angles. Table 7 summarizes the serial numbers and numbers of images used in this test.

Results under Configuration Variations
In this experiment, the targets used for training and testing have different configurations. Configuration refers to small appearance modifications, such as adding or removing fuel barrels, side skirts, and smoke grenade launchers to the targets. On the battlefield, targets of the same type but with different configurations should be classified into the same class. This is a challenging task because when two targets are just different in configurations, they have similar scattering characteristics. As shown in Figure 11, target images in three configurations are generally similar except for a slight difference in position and intensity of scattering centers. We trained and tested the model with four ground targets, in which T72 and BMP2 have three configuration variants indicated by different serial numbers. For T72 and BMP2, targets with the serial numbers 132 and 9563 were used for training, while the remaining ones, with the serial numbers S7, 812, C21 and 9566, were used for testing. For T62 and BTR60, their training and testing images have the same configuration, but with different depression angles. Table 7 summarizes the serial numbers and numbers of images used in this test.

Results under Configuration Variations
In this experiment, the targets used for training and testing have different configurations. Configuration refers to small appearance modifications, such as adding or removing fuel barrels, side skirts, and smoke grenade launchers to the targets. On the battlefield, targets of the same type but with different configurations should be classified into the same class. This is a challenging task because when two targets are just different in configurations, they have similar scattering characteristics. As shown in Figure 11, target images in three configurations are generally similar except for a slight difference in position and intensity of scattering centers. We trained and tested the model with four ground targets, in which T72 and BMP2 have three configuration variants indicated by different serial numbers. For T72 and BMP2, targets with the serial numbers 132 and 9563 were used for training, while the remaining ones, with the serial numbers S7, 812, C21 and 9566, were used for testing. For T62 and BTR60, their training and testing images have the same configuration, but with different depression angles. Table 7 summarizes the serial numbers and numbers of images used in this test. Sensors 2020, 20, x FOR PEER REVIEW 14 of 19 Figure 11. Illustration of target images in different configurations (titled by their serial numbers). All targets have an azimuth angle of 90° and a depression angle of 17°. We compared the model with several reference methods and provide the results in Figure 12. For a fair comparison, all results were obtained from full training data. The recognition rate of our model is 93.8%, compared with 93.2% for PML, 92.9% for PPA, 93.9% for KSR, 91.2% for TJSR, and 92.2% for KRLDP. Our model performs better than most methods but slightly worse than KSR. Experimental results verify that the model can effectively solve recognition problems under configuration variations.

Evaluation of Model Calibration
With a considerable number of network parameters, CNN obtains high predictive accuracy, but it tends to be overconfident and is poorly calibrated. An overconfident CNN will assign a high confidence score (i.e., the softmax output at the end of the network) towards the wrong class for things it has not seen before. This makes it unsuitable for SAR-ATR systems that must not only provide predictions but also calibrated confidence measures. Only when the model is well-calibrated can we use the confidence score to judge the reliability of the recognition results. The results with low confidence can be passed to image analysts or supervisors for further inspection. Bayesian methods  We compared the model with several reference methods and provide the results in Figure 12. For a fair comparison, all results were obtained from full training data. The recognition rate of our model is 93.8%, compared with 93.2% for PML, 92.9% for PPA, 93.9% for KSR, 91.2% for TJSR, and 92.2% for KRLDP. Our model performs better than most methods but slightly worse than KSR. Experimental results verify that the model can effectively solve recognition problems under configuration variations.  We compared the model with several reference methods and provide the results in Figure 12. For a fair comparison, all results were obtained from full training data. The recognition rate of our model is 93.8%, compared with 93.2% for PML, 92.9% for PPA, 93.9% for KSR, 91.2% for TJSR, and 92.2% for KRLDP. Our model performs better than most methods but slightly worse than KSR. Experimental results verify that the model can effectively solve recognition problems under configuration variations.

Evaluation of Model Calibration
With a considerable number of network parameters, CNN obtains high predictive accuracy, but it tends to be overconfident and is poorly calibrated. An overconfident CNN will assign a high confidence score (i.e., the softmax output at the end of the network) towards the wrong class for things it has not seen before. This makes it unsuitable for SAR-ATR systems that must not only provide predictions but also calibrated confidence measures. Only when the model is well-calibrated can we use the confidence score to judge the reliability of the recognition results. The results with low confidence can be passed to image analysts or supervisors for further inspection. Bayesian methods

Evaluation of Model Calibration
With a considerable number of network parameters, CNN obtains high predictive accuracy, but it tends to be overconfident and is poorly calibrated. An overconfident CNN will assign a high confidence score (i.e., the softmax output at the end of the network) towards the wrong class for things it has not seen before. This makes it unsuitable for SAR-ATR systems that must not only provide predictions but also calibrated confidence measures. Only when the model is well-calibrated can we use the confidence score to judge the reliability of the recognition results. The results with low confidence can be passed to image analysts or supervisors for further inspection. Bayesian methods [31] offer a practical framework to address this shortcoming. Our model uses AVI to infer posterior distributions over task-specific parameters instead of calculating a point estimate of them. The posterior captures the uncertainty of these parameters and results in a well-calibrated model.
First, we used reliability diagrams [41] to visually measure the model calibration. The reliability diagrams reflect the relationship between expected predictive accuracy and confidence score. The more aligned the bars and diagonals in reliability diagrams, the less the calibration error of the model. Figure 13 shows the reliability diagrams of our model and PPA for different amounts of training data. In each subplot of Figure 13, the expected accuracies are lower than the confidence scores, indicating a tendency towards overconfidence in both of the two methods. However, the gaps between the expected accuracies and confidence scores of our model are less than those of PPA. Our model provides a more calibrated confidence score by inferring posterior distributions over parameters.
Sensors 2020, 20, x FOR PEER REVIEW 15 of 19 [31] offer a practical framework to address this shortcoming. Our model uses AVI to infer posterior distributions over task-specific parameters instead of calculating a point estimate of them. The posterior captures the uncertainty of these parameters and results in a well-calibrated model. First, we used reliability diagrams [41] to visually measure the model calibration. The reliability diagrams reflect the relationship between expected predictive accuracy and confidence score. The more aligned the bars and diagonals in reliability diagrams, the less the calibration error of the model. Figure 13 shows the reliability diagrams of our model and PPA for different amounts of training data. In each subplot of Figure 13, the expected accuracies are lower than the confidence scores, indicating a tendency towards overconfidence in both of the two methods. However, the gaps between the expected accuracies and confidence scores of our model are less than those of PPA. Our model provides a more calibrated confidence score by inferring posterior distributions over parameters. Figure 13. Reliability diagrams of (a) our model with 100% data, (b) our model with 50% data, (c) our model with 10% data, (d) PPA with 100% data. (e) PPA with 50% data, (f) PPA with 10% data. Second, we used the expected calibration error (ECE) and the maximum calibration error (MCE) to quantify the model calibration. ECE and MCE represent the weighted average deviation and the worst-case deviation between the expected accuracy and confidence score of each bin, respectively. The lower the calibration error scores, the better the model calibration. Table 8 shows that both MCE and ECE of the model are always less than those of PPA under three different proportions of training data. It can also be observed that providing more training data can reduce MCE and ECE, thereby achieving better model calibration. The above experimental results verify that our model performs better than PPA in model calibration. Our model uses posterior distributions to capture the randomness of task-specific Figure 13. Reliability diagrams of (a) our model with 100% data, (b) our model with 50% data, (c) our model with 10% data, (d) PPA with 100% data. (e) PPA with 50% data, (f) PPA with 10% data. Second, we used the expected calibration error (ECE) and the maximum calibration error (MCE) to quantify the model calibration. ECE and MCE represent the weighted average deviation and the worst-case deviation between the expected accuracy and confidence score of each bin, respectively. The lower the calibration error scores, the better the model calibration. Table 8 shows that both MCE and ECE of the model are always less than those of PPA under three different proportions of training data. It can also be observed that providing more training data can reduce MCE and ECE, thereby achieving better model calibration. The above experimental results verify that our model performs better than PPA in model calibration. Our model uses posterior distributions to capture the randomness of task-specific parameters, while the PPA treats these parameters as deterministic values. Integrating over the posteriors will lead to a well-calibrated model.

Models with Different Network Structures
This section discusses the recognition results of models with different network structures. We designed eight different networks, summarized in Table 9. AVG represents an average-pooling operation. The difference between these networks lies in the number of convolutional kernels and whether the average pool is used at the end of the feature extractor. The last four rows of Table 9 show the implementation details of the weight predictor.  Table 10 compares the recognition results of different networks. All of the results were obtained under the SOC experimental setup. Networks using average pooling (B, D, F, H) are inferior to networks without average pooling (A, C, E, G), suggesting that average pooling degrades the recognition performance. The average pooling reduces the dimensions of feature vectors, which determine the number of hidden units in FC layers. The network capacity of the weight predictor is reduced with the decrease in the number of hidden units, thus degrading the recognition performance. Compared with the average-pooling, the number of convolutional kernels has less impact on recognition performance. Network C, used by our model, achieves the best result.

Recognition Results under Different Amounts of Simulated Data
In this section, we analyze the influence of simulated data on model recognition performance. We randomly selected 20%, 40%, 60%, 80%, and 100% images from the simulated dataset to construct small datasets. During meta-learning, these small simulated datasets are used to train the model. After the meta-learning, we used real SAR images at a depression angle of 17 • to update the model and use real images at other depression angles (15 • , 30 • , 45 • ) to test it. Table 11 list the test results with different small simulated datasets. When using 100% simulated data, the recognition rates of the model at 15 • , 30 • , and 45 • are 97.9%, 96.5%, and 82.1%, respectively. When the proportion drops to 20%, the recognition rates of the three test angles are reduced by 1.9%, 2.1%, and 5.3%, respectively. The recognition rate of the model decreases as the amount of simulated data decreases. The smaller the simulated data set, the smaller the number of azimuth angles, depression angles, and classes contained in it. The model cannot learn enough prior knowledge from such a small dataset, thus degrading its recognition performance. In order to further improve the recognition performance of the model, we should construct a complete simulated dataset. The dataset must contain a variety of target images, each of which covers complete azimuth and depression angles. Besides, the simulated images should have various ground clutters and speckle noises.

Conclusions
Recognition with small data has been a daunting problem in SAR-ATR because collecting sufficient real SAR data is difficult. In this paper, we propose a model incorporating meta-learning and AVI, which realizes the knowledge transfer from simulated data to real data. With meta-learning and simulated SAR data, our model can recognize novel targets using small amounts of real SAR data. Moreover, inferring the posterior distributions with AVI allows the model to provide calibrated confidence scores in addition to predictions. The results of extensive experiments verify that our model obtains state-of-the-art results, especially in the small-data scenario.