Classiﬁcation of PolSAR Image Using Neural Nonlocal Stacked Sparse Autoencoders with Virtual Adversarial Regularization

: Polarimetric synthetic aperture radar (PolSAR) has become increasingly popular in the past two decades, for it can derive multichannel features of ground objects, which contains more discriminative information compared with traditional SAR. In this paper, a neural nonlocal stacked sparse autoencoders with virtual adversarial regularization (NNSSAE-VAT) is proposed for PolSAR image classiﬁcation. The NNSSAE ﬁrst extracts the nonlocal features by calculating pairwise similarity of each pixel and its surrounding pixels using a neural network, which contains a multiscale feature extractor and a linear embedding layer. The feature extraction process can relieve the negative inﬂuence of speckle noise and extract discriminative nonlocal spatial information without carefully designed parameters. Then, the SSAE maps the center pixel and the extracted nonlocal features into deep latent space in which a Softmax classiﬁer is utilized to conduct classiﬁcation. The virtual adversarial training is introduced to regularize the network, which tries to keep the network from being overﬁtting. The experimental results from three real PolSAR image show that the proposed NNSSAE-VAT method has proved its robustness and e ﬀ ectiveness and it can achieve competitive performance compared with related methods. the model be more robust. The experimental results derived from three real PolSAR images shows that the proposed method achieves better results compared with previous methods.


Introduction
Synthetic Aperture Radar (SAR), as a kind of active imaging sensor, has the ability of monitoring a wide area without the limitation of lighting condition and weather. One of the SAR-based techniques, namely Polarimetric SAR (PolSAR), not only inherits the advantages of SAR, but also possesses multi-polarization characteristic, by which the sensor can extract more spatial information. Due to these advantages, the analysis of PolSAR images have received extensive attention in past decades. Combined with machine learning techniques, PolSAR has been proved to be an efficient tool in many research fields, such as terrain classification, agriculture monitoring [1,2], disaster monitoring [3,4], target recognition and detection [5], and sea ice monitoring [6], etc. In this paper, we focus on PolSAR image classification. Many effective machine-learning approaches have been proposed in this field, like Wishart classifier [7,8], support vector machine (SVM) [9,10], sparse representation based classification method (SRC) [11,12], random forest [13], and so on.
We propose the neural nonlocal feature extractor, which introduces neural nonlocal similarity calculation into PolSAR image classification task to extract nonlocal spatial information around each pixel. The nonlocal spatial features generated by NNFE are automatically optimized, which improves the robustness and the classification accuracy of the model.

2.
In NNFE, we introduce a multiscale spatial information extractor implemented as convolution layers, which maps every pixel in image patches into latent space and considers both of their characteristics and their spatial information.

3.
Virtual adversarial training regularization term is introduced into the loss function to further improve the classification accuracy and the ability of generalization of the model.
The NNSSAE is validated on three real dataset and the results show that our model can achieve higher performance compared with other related models.

Nonlocal Feature Extraction Methods
Nonlocal methods utilize the spatial information of pixels around one center pixel to construct its characteristics. They have many variants and have wide application. Three nonlocal methods will be introduced in this subsection. Considering that many symbols are easy to be confused with each other, we add the lower case of the corresponding abbreviation at the final of the symbol's subscript to point out which method the symbol belongs to.
NLM is a widely used noise-reduction method that calculates the weighted sum by measuring the central pixel and its surrounding pixels' spatial similarity. To unify the symbols used in the paper, all terms follow the ANSSAE [32] with minimum modifications. All pixels in the PolSAR image are centered with three windows: the search window, the central window and the neighborhood window, which are shown in Figure 1. The circumstance that the search window is centering at the boarder of three different kinds ground objects is considered in Figures 1 and 2. Different colors of pixels represent different categories of ground objects.  [31] method. The olive box, the orange box, and the brown box denote search window, central window and neighborhood window, respectively. To concisely describe the main idea of NLM, the left-top pixel and the corresponding search window is taken as an example. Different colors of pixels represent different categories of ground objects.
weights calculation of the central window ( ) C i . The weights calculation for pixel i and pixel j is equivalent with calculating the normalized similarity metrics of pixels in ( ) N i and ( ) N j . The distance , ij nlm S between ( ) N i and ( ) N j is defined by Gaussian weighted Euclidean distance, which is shown in Equation (3).
where ∈ ( ) j C i and ≠ j i . σ denotes the variance of the Gaussian kernel. Smaller , ij nlm S indicates that ( ) N i and ( ) N j are closer under the Gaussian weighted Euclidean distance criterion, and leads to a larger value of ω , ij nlm . The calculation of ω , ij nlm is shown in Equation (4).   [32]. The olive box, the orange box and the brown box denote search window, central window, and neighborhood window, respectively. To concisely describe the main idea of ANFE, the left-top pixel in the central window Figure 2. The diagram of adaptive nonlocal feature extractor (ANFE) [32]. The olive box, the orange box and the brown box denote search window, central window, and neighborhood window, respectively. To concisely describe the main idea of ANFE, the left-top pixel in the central window and the corresponding search window is taken as an example. Different colors of pixels represent different categories of ground objects.
Every pixel in search window S(i) is influential to the value of noise-reduction estimation of the center pixel x i . The central window of x i is denoted as C(i), in which all of the pixels have their corresponding neighborhood windows N(t), t ∈ C(i). The weighted sum of pixels in C(i) equals the nonlocal means filtered center pixel x i,nlm .
where the weights ω ij,nlm are simultaneously normalized and have values between 0 and 1, i.e., The pixels in neighborhood window N(t), t ∈ C(i) are the fundamental elements participating weights calculation of the central window C(i). The weights calculation for pixel i and pixel j is equivalent with calculating the normalized similarity metrics of pixels in N(i) and N( j). The distance S ij,nlm between N(i) and N( j) is defined by Gaussian weighted Euclidean distance, which is shown in Equation (3).
where j ∈ C(i) and j i. σ denotes the variance of the Gaussian kernel. Smaller S ij,nlm indicates that N(i) and N( j) are closer under the Gaussian weighted Euclidean distance criterion, and leads to a larger value of ω ij,nlm . The calculation of ω ij,nlm is shown in Equation (4).
where Q ij,nlm = j∈C(i),j i exp −S ij,nlm /h 2 is the normalization scaling factor, and h denotes the filtering parameter. ANFE is the feature extractor of the ANSSAE proposed by Reference [32]. It introduces a hard-decision vector → M i into the original NLM method. To avoid ambiguities in the illustration of ANFE, let x k be pixel in N(i), where k ∈ N(i) and k i. Let m ik be the k-th binary elements in the → M i vector. At the boarder of different land types, pixels in N(i) but having different label with x i are denoted as x − k . Taking x − k into the weight calculation of NLM can lead to unexpected weight decrease if pixels are far away from the boarder, but have the same label with x i . To relieve this undesirable effect, → M i is adopted to exclude singular pixels x − k in N(i) from taking part in similarity calculation, which is depicted in Figure 2.
The calculation of m ik is shown as follows: where λ is a parameter related to the degree of the singular pixel decision, which needs expert knowledge to be tuned. The modified similarity calculation is: where → M i = (m i1 , m i2 , m i3 , . . . , m iK ), K is the number of pixels in the neighborhood window. NLN is a kind of neural networks that introduce the main idea of NLM into deep learning, enabling the networks to capsule the long range correlations in feature maps [33]. As is depicted in Figure 3, weighted sum of the original features applied in nonlocal neural blocks generally follows the form of NLM, as is shown in Equation (1), while the weight calculation takes learned features into consideration, i.e.,: where x i and x j are features of the previous layer, θ(·) and φ(·) are linear embeddings. Nonlocal operation is implemented in the form of exponential dot product, which is named as "the embedded Gaussian form". Q i,nln = j∈C(i),j i exp θ(x i ) T · φ(x j ) denotes the normalization term. Thus, the equation introduces a softmax operation to θ(x i ) T · φ(x j ). Then, the weighted sum y i,nln equals the calculated weights multiply the corresponding linear embeddings g(x i ), i.e., The output of the nonlocal operation equals the original feature x i,nln plus the linear embeddings h(y i,nln ) of y i,nln . Figure 3 shows the structure of the Nonlocal block.
Remote Sens. 2019, 11, 1038 6 of 20 where ( )  NLN is a kind of neural networks that introduce the main idea of NLM into deep learning, enabling the networks to capsule the long range correlations in feature maps [33]. As is depicted in Figure 3, weighted sum of the original features applied in nonlocal neural blocks generally follows the form of NLM, as is shown in Equation (1), while the weight calculation takes learned features into consideration, i.e.:

Virtual Adversarial Training Regularization
Applying deep models to small size dataset often risks overfitting, which degrades the performance of the models during validation. Virtual adversarial training (VAT) is proposed to relieve the negative influence of overfitting and strengthen the robustness of models by making models be insensitive to virtual adversarial disturbance. A brief implementation method and the solution to virtual adversarial disturbance r vadv are presented in this subsection. The main idea of the VAT is depicted in Figure 4.  Figure 3 shows the structure of the Nonlocal block. AT is depicted in Figure 4. irtual adversarial training is derived from adversarial training proposed by Reference oss term of the original adversarial training is presented as follows:  Virtual adversarial training is derived from adversarial training proposed by Reference [37]. The loss term of the original adversarial training is presented as follows: where q(y x in ) indicates the real label output distribution given by the corresponding input sample x in . p(y x in + r adv ) denotes the distribution of model output given by adversarial disturbed input x in + r adv .

KL[·]
represents the Kullback-Leibler divergence. The aim of this loss term is to make this output distribution be close to the real output distribution. However, this distribution is usually unknown. In [37], q(y x in ) is simplified by the one-hot label vector h(y x in ) , which cannot be applied to unlabeled dataset. VAT [35,36] adopts the current estimation of the model output distribution to approximate q(y x in ) , i.e., L vadv (x in , θ) = KL p(y x in ,θ ), p(y x in + r vdav , θ) , where θ is the model parameters andθ represents the current state of model parameters. r vadv denotes the virtual adversarial disturbance given by input x in and model parameters θ. [35,36] also proposes a fast computation method of r vadv . We denote KL[·] as D div for simplification purpose. Assume the existence of the first and second order derivatives of D div with respect to θ always hold. As D div reaches its minimum value when r = 0, we can derive that the first order derivative of D div with respect to r equals 0 at r = 0. Thus, the second order Taylor expansion of D div is shown as follows: where H(x in ,θ) denotes the Hessian matrix of D div . Based on the assumption given above, r vadv will be close to the first dominant eigenvector of H(x in ,θ) with amplitude ε.
where v = v/ v 2 is the direction vector of v. By using power iteration and finite difference method, the direction vector of this eigenvector can be approximated iteratively using the equation as follows: where d is a random vector before the first iteration and gets updated iteratively for T pi times, ξ is a small definite number used in finite difference method. In this paper, 1-time iteration is used to fit the eigenvector.
Then, the r vadv can be derived by substituting the final d into Equation (16) to replace u(x in ,θ). For more details, refer to References [35,36].

Proposed Method
In this paper, an end-to-end deep neural network called NNSSAE is proposed for PolSAR image classification. The proposed model contains two types of deep modules, i.e., a multiscale similarity based feature extractor and a stacked sparse autoencoders (SSAE). The feature extractor, named as neural nonlocal feature extractor (NNFE), is composed of feedforward neural networks and calculates the weighted sum of the PolSAR image patch. SSAE takes the output of NNFE as input to project original pixel features and nonlocal features into high dimensional latent space. Softmax classifier is introduced to conduct the final classification task. Meanwhile, we introduce VAT regularization term in the loss function to relieve the overfitting and to achieve higher generalization ability. The whole framework of the model is depicted in Figure 5. In this paper, an end-to-end deep neural network called NNSSAE is proposed for PolSAR image classification. The proposed model contains two types of deep modules, i.e., a multiscale similarity based feature extractor and a stacked sparse autoencoders (SSAE). The feature extractor, named as neural nonlocal feature extractor (NNFE), is composed of feedforward neural networks and calculates the weighted sum of the PolSAR image patch. SSAE takes the output of NNFE as input to project original pixel features and nonlocal features into high dimensional latent space. Softmax classifier is introduced to conduct the final classification task. Meanwhile, we introduce VAT regularization term in the loss function to relieve the overfitting and to achieve higher generalization ability. The whole framework of the model is depicted in Figure 5.

Input Feature Preparation
Every pixel in PolSAR data is represented as a × 2 2 scattering matrix S that indicates the polarimetric scattering characteristics of the ground objects. The matrix S is given by where hv s is the element that is horizontal transmitting and vertical receiving polarized, and the definition of the other three elements can be derived in a similar manner. Under the assumption that = hv vh s s , the scattering vector is given by Followed by Reference [38], a covariance matrix of an n -look image is represented by where i h denotes the i -th one-look sample, and H represents the Hermitian transpose. From [38], we can derive the coherency matrix T from Equation (21).

Input Feature Preparation
Every pixel in PolSAR data is represented as a 2 × 2 scattering matrix S that indicates the polarimetric scattering characteristics of the ground objects. The matrix S is given by where s hv is the element that is horizontal transmitting and vertical receiving polarized, and the definition of the other three elements can be derived in a similar manner. Under the assumption that s hv = s vh , the scattering vector is given by Followed by Reference [38], a covariance matrix of an n-look image is represented by where h i denotes the i-th one-look sample, and H represents the Hermitian transpose. From [38], we can derive the coherency matrix T from Equation (21).
Since the coherency matrix T is known to be a Hermitian matrix, we simply takes the upper triangular part of the matrix as the input of our model. To avoid introduce complex number into our model, we separate the real and imagine part of these complex elements, i.e., we have data = [T 11 , Re(T 12 ), Im(T 12 ), Re(T 13 ), Im(T 13 ), T 22 , Re(T 23 ), Im(T 23 ), T 33 ].

Neural Nonlocal Feature Extractor
Nonlocal methods generally share the same core concept that calculates weighted sum of the surrounding pixels to extract the nonlocal features of the center pixel. The difference between them is the different weight calculation criterions. NNFE adopts parameterized model to reveal the spatial features and uses the dot product of the feature vectors of surrounding pixels and the center pixel as its weight calculation criterion. The parameters are updated by back-propagation [39] algorithms, which Remote Sens. 2019, 11, 1038 9 of 20 relies on mining the information buried in data rather than careful selection adopted by hand-crafted algorithms. The structure of the NNFE is shown in Figure 6.
Remote Sens. 2019, 11, x FOR PEER REVIEW 10 of 22 Nonlocal methods generally share the same core concept that calculates weighted sum of the surrounding pixels to extract the nonlocal features of the center pixel. The difference between them is the different weight calculation criterions. NNFE adopts parameterized model to reveal the spatial features and uses the dot product of the feature vectors of surrounding pixels and the center pixel as its weight calculation criterion. The parameters are updated by back-propagation [39] algorithms, which relies on mining the information buried in data rather than careful selection adopted by hand-crafted algorithms. The structure of the NNFE is shown in Figure 6. Before neural nonlocal operation, a multiscale feature extraction operation is implemented on the original image patch ( ) S i to relieve the degradation effect accompanied with speckle noise.
Inspired by NLN, we use dot product rather than the variants of Euclidean distance, for the computation complexity of the former one is negligible compared with the latter one. The multiscale feature extraction operation utilizes × 1 1 , × 3 3 and 5 5 × convolutional kernels, which is inspired by the Inception block [41]. These kernels of various scales can both effectively capture the discriminative features and substitute the complicated Considering that NNFE is fully parameterized thus need to be optimized, and SSAE need initial feature vector to conduct unsupervised pre-training, an auxiliary classifier based on Softmax regression is introduced to train NNFE to reach a proper initial state.
The main improvement of the nonlocal blocks compared with traditional neural networks is that it is capable with analyzing the pairwise relationship of all the pixels in the feature maps. By introducing the weights based on adaptive similarity calculation, both NLN and NNFE will have larger vision field and be enabled to extract nonlocal features. The differences between the nonlocal Before neural nonlocal operation, a multiscale feature extraction operation is implemented on the original image patch S(i) to relieve the degradation effect accompanied with speckle noise. Inspired by NLN, we use dot product rather than the variants of Euclidean distance, for the computation complexity of the former one is negligible compared with the latter one.
The multiscale feature extraction operation utilizes 1 × 1, 3 × 3 and 5 × 5 convolutional kernels, which is inspired by the Inception block [40]. These kernels of various scales can both effectively capture the discriminative features and substitute the complicated → M i introduced by ANFE. The output of the 1 × 1 kernels represent the characteristics of the pixel themselves, which provides the functionality like a soft version of → M i . Then, features of different scales are concatenated to constitute the input of the nonlocal operation block, h e . Through 1 × 1 linear embedding layer, the feature of pixel i and the set of pixels' features in the central window h e ( j), j ∈ C(i) and j i are used to calculate the corresponding weight set ω ij,nn f e , j ∈ C(i) and j i by the exponential dot product: where Q i,nn f e = j∈C(i),j i exp[h e (i) · h e ( j)] denotes the normalization factor. The weight set is then used for weighted sum feature calculation. Apparently, the nonlocal feature vector x i,nn f e has the same dimension with the original pixel vector x i .
Considering that NNFE is fully parameterized thus need to be optimized, and SSAE need initial feature vector to conduct unsupervised pre-training, an auxiliary classifier based on Softmax regression is introduced to train NNFE to reach a proper initial state.
The main improvement of the nonlocal blocks compared with traditional neural networks is that it is capable with analyzing the pairwise relationship of all the pixels in the feature maps. By introducing the weights based on adaptive similarity calculation, both NLN and NNFE will have larger vision field and be enabled to extract nonlocal features. The differences between the nonlocal operation introduced in NLN and the nonlocal operation proposed in this paper could be summarized in two folds: First, the similarity calculation of the NNFE focuses on the center pixel of the nonlocal feature maps, rather than every pixel in the feature maps under the circumstance of NLN. Second, the nonlocal feature vector extracted by NNFE is simply calculated by applying the weighted sum on the original central window, for this could make the nonlocal feature vector have similar distribution compared with the feature vector of the original pixel.

Stacked Sparse Autoencoders
A two-layer SSAE is implemented to map the original and the nonlocal features into latent space and finally conduct classification on optimized features using a Softmax classifier.
The training procedure of SSAE contains two steps: Pretraining and fine-tuning. Assume that the SSAE model is composed of S autoencoders. The structure of Autoencoder is shown in Figure 7. At the pre-training stage, the input of the s-th autoencoder is denoted as m s , s = 1, . . . , S. This input is encoded as a hidden representation h s , and then is decoded as a reconstructed representation y s .
where W s e , b s e are the weight matrix and bias vector of the encoder, and W s d , b s d are the weight matrix and bias vector of the decoder. σ(·) represents a nonlinear function, here we use the sigmoid function which has the expression of σ(x) = 1/(1 − exp(x)). The loss function of this sparse autoencoder is shown as follows: where N is the number of the training samples in a mini-batch, γ represents the scaling factor of the weight regularization term, and β is a scaling factor controlling the strength of the sparse penalty. ρ is the sparse parameter usually holding a small value.ρ s t denotes the averaged response on this specific mini-batch of the t-th unit in the s-th autoencoder, and K s is the number of the units in the s-th autoencoder. KL(· ·) represents the Kullback-Leibler divergence of two distribution. Minimizing the sparse penalty term in the loss function will make the response of units in the hidden layer be close to zero. When the s-th autoencoder is trained, its hidden representation h s will become the input of the (s + 1)-th autoencoder m s+1 , noted that the input of the first autoencoder m 1 is the input of the SSAE.

Stacked Sparse Autoencoders
A two-layer SSAE is implemented to map the original and the nonlocal features into latent space and finally conduct classification on optimized features using a Softmax classifier.
The training procedure of SSAE contains two steps: Pretraining and fine-tuning. Assume that the SSAE model is composed of S autoencoders. The structure of Autoencoder is shown in Figure  7. At the pre-training stage, the input of the s -th autoencoder is denoted as = , 1,...,    When the last autoencoder finishes its pre-training process, a Softmax classifier is connected after the hidden representation of the last autoencoder to predict the label distribution of every sample. The expression of the Softmax classifier is: where K S+1 is the number of categories contained in the dataset.

Loss Function in the Fine-Tuning Stage and the VAT Regularization Term
At the fine-tuning stage, virtual adversarial training and supervised cross-entropy loss are combined together to guide the back-propagation algorithm to optimize the parameters of the model. We denote dataset as {data i } and the set of samples perturbed by VAT as {data r }. The Loss function in the fine-tuning stage is shown in Equation (28).
The first term is the cross-entropy term, which represents the difference between the distribution of the model prediction performed on the training samples and the distribution of the corresponding labels. In Equation (28), l i represents the one-hot label vector andl i represents the model prediction distribution. This term aims to make the predictions be close to the labels, thus yields to better accuracy performance. The second term is the VAT regularization term, which denotes the flatness of the model prediction distribution in the high dimensional latent space around each training sample. The purpose of this term is to smooth the output distribution around input data, and make the model more robust.
The calculation of r vadv is concluded in Algorithm 1.

Algorithm 1. Calculation of Virtual Adversarial Training Disturbance
Input: Mini-batch size N, dataset {data i }; Step 1: Randomly select N samples dataset {data i } to construct a mini-batch.
Step 3: Calculate r n vadv using Equation (16) and Equation (17) with respect to r on r = εu n on each sample. Output: Virtual adversarial disturbance r vadv .

Training Strategy
Algorithm 2 summarizes the details in training processes. The training procedure mainly contains three stages. First, in order to provide coarse nonlocal features to SSAE, the NNFE is pre-trained using training samples in a supervised manner. Only samples in the training set are used in this stage. Then SSAE takes both the center pixels and the initialized nonlocal feature vectors as input and extracts their latent information by unsupervised layerwise reconstruction training. Finally, the whole framework is fine-tuned supervisedly using the training samples with the regularization of the VAT.

Algorithm 2. The training strategy of NNSSAE-VAT
Input: PolSAR Image, label map; Step 1: Select v% of samples per category from the labeled samples to construct training set.
Step 3: Use the auxiliary classifier to pretrain the NNFE with sample-label pairs by BP Algorithm.
Step 4: For the s-th autoencoder in N autoencoders: Train the autoencoder to reconstruct the input m s , minimize the L2-difference between the input m s and the reconstructed result y s .
Step 5: Use the main classifier to fine-tune the whole network with virtual adversarial training term in the loss function.

Experiments
In this section, three real PolSAR images are utilized to verify the effectiveness of the proposed method. The first dataset covering Flevoland, Netherlands, is obtained by AIRSAR platform in 1989.
The second dataset is the C-band Benchmark dataset acquired by AIRSAR platform in 1991, covering another part of Flevoland. The third dataset is the full-polarimetric L-band image covering Foulum, Denmark, which is obtained by EMISAR platform in 1998. To restrain the negative effects brought by speckle noise, all these three PolSAR image are preprocessed by refined Lee filtering [41], and the window size of the filter is set to 5 for fair comparison. We use python/Tensorflow to conduct the experiments, which are implemented on a computer with Inter(R) Core™ i7-4790K CPU, 16G RAM and NVIDIA GTX960 GPU.
The proposed NNSSAE-VAT is compared with four relevant methods, including SSAE [23], SSAE-LS [26], SRDNN [30], ANSSAE [32]. Since all methods are AE-based classification models, their hyperparameters are set as same as the original papers, respectively. The metrics used for comparison include overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K).

Networks Architecture Settings and Analyses
For SSAE and SSAE-LS, the number of the hidden units in the two hidden layers are both set to 300. For SRDNN, the two hidden layers are set to have 150 and 40 units. As for ANSSAE, it is set to have five layers and its detailed configuration for each dataset will be illustrated in the following subsections, respectively. The Softmax classifier is connected after the final hidden layer to conduct classification task.
For autoencoder-based PolSAR image classification models, the classification accuracy depends on the effectiveness of their feature extractors. Thus, the architecture of NNFE should be carefully designed in order to fully utilize the strong representation capability of SSAE and achieve better performance. To this end, we determine the architecture of NNFE by conducting experiments on all the three PolSAR data used in this paper. During this experiment, only the auxiliary classifier is used, which is connected directly to the output of the NNFE. NNFE contains two parts of parameterized layers, i.e., the multiscale feature extractor layer and the 1 × 1 linear embedding layer. The number of kernels in feature extractor layer is set to 30 and the number of kernels in 1 × 1 linear embedding layer is set to 40. The result of this experiment is shown in Figure 8.
training. Finally, the whole framework is fine-tuned supervisedly using the training samples with the regularization of the VAT.

Experiments
In this section, three real PolSAR images are utilized to verify the effectiveness of the proposed method. The first dataset covering Flevoland, Netherlands, is obtained by AIRSAR platform in 1989. The second dataset is the C-band Benchmark dataset acquired by AIRSAR platform in 1991, covering another part of Flevoland. The third dataset is the full-polarimetric L-band image covering Foulum, Denmark, which is obtained by EMISAR platform in 1998. To restrain the negative effects brought by speckle noise, all these three PolSAR image are preprocessed by refined Lee filtering [40], and the window size of the filter is set to 5 for fair comparison. We use python/Tensorflow to conduct the experiments, which are implemented on a computer with Inter(R) Core™ i7-4790K CPU, 16G RAM and NVIDIA GTX960 GPU.
The proposed NNSSAE-VAT is compared with four relevant methods, including SSAE [23], SSAE-LS [26], SRDNN [30], ANSSAE [32]. Since all methods are AE-based classification models, their hyperparameters are set as same as the original papers, respectively. The metrics used for comparison include overall accuracy (OA), average accuracy (AA), and Kappa coefficient ( K ).

Networks Architecture Settings and Analyses
For SSAE and SSAE-LS, the number of the hidden units in the two hidden layers are both set to 300. For SRDNN, the two hidden layers are set to have 150 and 40 units. As for ANSSAE, it is set to have five layers and its detailed configuration for each dataset will be illustrated in the following subsections, respectively. The Softmax classifier is connected after the final hidden layer to conduct classification task.
For autoencoder-based PolSAR image classification models, the classification accuracy depends on the effectiveness of their feature extractors. Thus, the architecture of NNFE should be carefully designed in order to fully utilize the strong representation capability of SSAE and achieve better performance. To this end, we determine the architecture of NNFE by conducting experiments on all the three PolSAR data used in this paper. During this experiment, only the auxiliary classifier is used, which is connected directly to the output of the NNFE. NNFE contains two parts of parameterized layers, i.e., the multiscale feature extractor layer and the × 1 1 linear embedding layer. The number of kernels in feature extractor layer is set to 30 and the number of kernels in × 1 1 linear embedding layer is set to 40. The result of this experiment is shown in Figure 8. In Figure 8, 'S', 'M', and 'L' represents kernels of scale 1 × 1, 3 × 3, and 5 × 5, respectively. Besides, the symbol '+' indicates the concatenation of multiple scales. From the results, it is apparent that the NNFE with kernels of all three scales outperforms other variants. In general, NNFEs with kernels of multiple scales have better performance compared with NNFEs with kernels of single scales. Combined with kernels of scale 1 × 1, the 'S+M+L' module achieves higher accuracy results compared with original 'M+L' module. One possible reason for this is that the 1 × 1 kernels represent features of the pixels themselves, which act like a soft version of The number of kernels in multiscale feature extractor layer is set to 30 for 1 × 1, 3 × 3 and 5 × 5 kernels, and the number of kernels in 1 × 1 linear embedding layer is set to 40. We set the SSAE part to γ = 0.005, β = 0.02, ρ = 0.15, S = 2, and each layer has 220 hidden units throughout the experiments.

Parameter Settings for Virtual Adversarial Regularization Term
According to [35,36], the essential parameters of VAT are regularization weight α, perturbation length ε, a small value ξ and the number of the power iteration T pi . In the original papers, the value of ξ is fixed to 1.0 × 10 −6 , for it is merely a small value used in finite differential method, which is a numerical calculation approach. The parameter T pi represents the number of power iteration is performed on the random vector to make it fit the most significant eigenvector of the Hessian matrix H. No significant improvement is observed when we pick T pi as values larger than 1, for which we fix the value of T pi to 1. The regularization weight α denotes the importance of the VAT term compared with the cross-entropy term in the loss function and the perturbation length ε is the core parameter controlling the length of VAT perturbation r vadv .
Considering that the intention of combining VAT with NNSSAE is to regularize the model and relieve the negative effects of overfitting, we simply set α = 1.0 and search for ε to achieve the best results of the proposed model. The experimental results of the datasets with a different value of ε are shown in Figure 9.
apparent that the NNFE with kernels of all three scales outperforms other variants. In general, NNFEs with kernels of multiple scales have better performance compared with NNFEs with kernels of single scales. Combined with kernels of scale 1 1 × , the 'S+M+L' module achieves higher accuracy results compared with original 'M+L' module. One possible reason for this is that the 1 1 ×  According to [35][36], the essential parameters of VAT are regularization weight α , perturbation length ε , a small value ξ and the number of the power iteration pi T . In the original papers, the value of ξ is fixed to − × 6 1.0 10 , for it is merely a small value used in finite differential method, which is a numerical calculation approach. The parameter pi T represents the number of power iteration is performed on the random vector to make it fit the most significant eigenvector of the Hessian matrix H . No significant improvement is observed when we pick pi T as values larger than 1, for which we fix the value of pi T to 1. The regularization weight α denotes the importance of the VAT term compared with the cross-entropy term in the loss function and the perturbation length ε is the core parameter controlling the length of VAT perturbation vadv r .
(a) (b) (c) Considering that the intention of combining VAT with NNSSAE is to regularize the model and relieve the negative effects of overfitting, we simply set α = 1.0 and search for ε to achieve the best results of the proposed model. The experimental results of the datasets with a different value of ε are shown in Figure 9.
From the results we can derive that when ε belongs to [0.5, 3.0] the model is capable to achieve promising performance. We also observe that when ε is smaller than 0.5 and is larger than 4.0, the performance is degraded. When ε is smaller than 0.5, the regularization effect is limited because the perturbed data points are too close to the original data points; This means the regions around data points are too small to make the corresponding output distribution smooth.
When ε is larger than 4.0, the region around data points are too large and may unexpectedly include data points of other classes. Hence, we suggest using a general range of ε from 0.5 to 3.0 for NNSSAE-VAT model. From the results we can derive that when ε belongs to [0.5, 3.0] the model is capable to achieve promising performance. We also observe that when ε is smaller than 0.5 and is larger than 4.0, the performance is degraded. When ε is smaller than 0.5, the regularization effect is limited because the perturbed data points are too close to the original data points; This means the regions around data points are too small to make the corresponding output distribution smooth. When ε is larger than 4.0, the region around data points are too large and may unexpectedly include data points of other classes. Hence, we suggest using a general range of ε from 0.5 to 3.0 for NNSSAE-VAT model.

Influences of Different Sizes of the Training Set
We conduct experiments that test different sizes of training set for all three datasets. The sizes we chosen are 1%, 2%, 3%, 5%, 7%, and 10%. The results are shown in Figure 10. We find that Flevoland dataset is kind of 'easier' compared with the C-band Benchmark dataset and Foulum dataset, because in general Flevoland dataset doesn't have classes that are easily confused with each other, although it does have the largest number of classes. For evidence, from Tables 1-3, we can find that the accuracy of SSAE is 89.75% in the case of Flevoland dataset only using 1% labeled samples, while the accuracies become 89.65% and 90.97% using 10% labeled samples when SSAE is applied to the other two dataset. For this reason, we think 1% is good enough to be the size of the training set of Flevoland dataset, and we set the size of the training set of the other two datasets to 10%. dataset, because in general Flevoland dataset doesn't have classes that are easily confused with each other, although it does have the largest number of classes. For evidence, from Table 1-3, we can find that the accuracy of SSAE is 89.75% in the case of Flevoland dataset only using 1% labeled samples, while the accuracies become 89.65% and 90.97% using 10% labeled samples when SSAE is applied to the other two dataset. For this reason, we think 1% is good enough to be the size of the training set of Flevoland dataset, and we set the size of the training set of the other two datasets to 10%.

Flevoland Dataset
The Flevoland image has 750 × 1024 pixels, which are labeled with 15 categories. The number of labeled pixels is 161,327. Figure 11 (a) shows the Pauli-RGB pseudo color image, Figure 11 (b)(c) shows its corresponding ground truth, and Figure 11 (d-i) are classification results. In the experiment, 1% labeled samples are randomly selected to construct training set, and the rest is used for testing. For SRDNN, the number of superpixels is set to 8000. For ANSSAE, the parameter λ controlling the

Flevoland Dataset
The Flevoland image has 750 × 1024 pixels, which are labeled with 15 categories. The number of labeled pixels is 161,327. Figure 11a shows the Pauli-RGB pseudo color image, Figure 11b,c shows its corresponding ground truth, and Figure 11d The accuracy performance is reported in Table 1, from which we can find that the NNSSAE and NNSSAE-VAT achieve higher accuracy than other methods. In the absence of spatial information, the SSAE are highly sensitive to the negative effect brought by speckle noise, especially when the size of training set is limited. The SSAE-LS significantly improves the The accuracy performance is reported in Table 1, from which we can find that the NNSSAE and NNSSAE-VAT achieve higher accuracy than other methods. In the absence of spatial information, the SSAE are highly sensitive to the negative effect brought by speckle noise, especially when the size of training set is limited. The SSAE-LS significantly improves the classification performance, for involving pixels surrounding the center pixel makes training samples become more discriminable. Nonlocal methods further improve the accuracy and reduce misclassified spot in the large homogeneous regions. The SRDNN clusters pixels in the original PolSAR image using the intensity and the spatial distance of pixels. However, in other words, this means the performance of the model is essentially dependent the segmentation of superpixels. The ANSSAE achieve better performance compared with SRDNN, for it extracts nonlocal information in a more precise pixel-wise manner. The proposed NNSSAE shows even better accuracy performance than the two previously mentioned nonlocal methods. Using deep neural embedding features, NNSSAE extracts more robust nonlocal information for every pixel in the image, which is proved to have better discriminative capability than ANSSAE. Furthermore, the proposed NNSSAE-VAT improves the accuracy about 0.6% compared with NNSSAE, which proves the effectiveness of the VAT regularization under the situation that the amount of training data is limited.

Benchmark Dataset
The C-band Benchmark dataset is 700 × 750 with 7 labeled categories. The number of labeled pixels is 203,223, which indicates that this dataset has the smallest image size but has the largest number of labeled samples. Figure 12a shows the Pauli-RGB pseudo color image and Figure 12b shows its corresponding ground truth. In the experiment, 10% labeled samples are randomly selected to construct training set, and the rest is used for testing. For SRDNN, the number of superpixels is set to 6500. For ANSSAE, λ is set to 0.0005, and the hidden units number of each layer is set to 150.
The classification performance is presented in Table 2, and the corresponding classification results are shown in Figure 12c-h, respectively. From Figure 12c, it is apparent that SSAE misclassified a considerably large part of barley into wheat. Besides, the classification result of Lucerne is quite confused with Beet, and the pea is highly confused with potato. The result of SSAE indicates that the spatial information is crucial when classes that can merely be discriminated by analyzing the context of spatial regions. The SSAE-LS significantly improved the accuracy performance, while the result is inferior to nonlocal methods, for its vision field size is still insufficient to capsule more discriminative features and it uses fixed isotropic decaying weights instead of adaptive feature-driven weights. The ANSSAE achieves better accuracy than SRDNN, for the reason that rather than calculating averaged value in a superpixel, ANSSAE calculates similarity-based averaged value for each pixel. The NNSSAE obtains the best result among other methods used for comparison, which shows that the nonlocal features extracted by NNFE are more efficient. Regularized by VAT, the NNSSAE-VAT further improves the performance, indicating that VAT is capable to make neural networks become more robust.
layer is set to 150.
The classification performance is presented in Table 2, and the corresponding classification results are shown in Figure 12(c-h), respectively. From Figure 12(c), it is apparent that SSAE misclassified a considerably large part of barley into wheat. Besides, the classification result of Lucerne is quite confused with Beet, and the pea is highly confused with potato. The result of SSAE indicates that the spatial information is crucial when classes that can merely be discriminated by analyzing the context of spatial regions. The SSAE-LS significantly improved the accuracy performance, while the result is inferior to nonlocal methods, for its vision field size is still insufficient to capsule more discriminative features and it uses fixed isotropic decaying weights instead of adaptive feature-driven weights. The ANSSAE achieves better accuracy than SRDNN, for the reason that rather than calculating averaged value in a superpixel, ANSSAE calculates similarity-based averaged value for each pixel. The NNSSAE obtains the best result among other methods used for comparison, which shows that the nonlocal features extracted by NNFE are more efficient. Regularized by VAT, the NNSSAE-VAT further improves the performance, indicating that VAT is capable to make neural networks become more robust.

Foulum Dataset
The Foulum image has 750 × 1100 pixels, which are labeled with 6 categories. The number of labeled pixels is 169,386. Figure 13a shows the Pauli-RGB pseudo color image and Figure 13b shows its corresponding ground truth. In the experiment, 10% labeled samples are randomly selected to construct training set, and the rest is used for testing. The experiment results are shown in Figure 13c-h and Table 3. In Table 3, dense crops, broad leaves crops and small stem crops are denoted as D.c, B.l.c, and S.s.c, respectively. For SRDNN, the number of superpixels is set to 9000. For ANSSAE, the parameter λ is set to 0.00055, and the hidden units number of each layer is set to 80.
As is shown in Table 3, the proposed NNSSAE model achieves higher overall accuracy than the methods, which do not use NNFE as their feature extractor. Noted that the SSAE takes the original image pixels as input, its performance is considerably degraded by the speckle noise. From the classification result shown in Figure 13c it can be observed that the buildings, forest and some part of dense crop can be easily confused without the presence of spatial information. As a result of involving local spatial information, the SSAE-LS performs better than the SSAE, improve the OA about 5%. Furthermore, nonlocal methods extract spatial features based on various similarity calculations. Besides, NNSSAE and ANSSAE show comparable results, which indicate the robustness of the NNSSAE, for it can achieve competitive performance without carefully selecting parameters. With the regularization of the VAT, the NNSSAE-VAT further improves the OA and AA compared with the NNSSAE.

Foulum Dataset
The Foulum image has 750 1100 × pixels, which are labeled with 6 categories. The number of labeled pixels is 169,386. Figure 13(a) shows the Pauli-RGB pseudo color image and Figure 13(b) shows its corresponding ground truth. In the experiment, 10% labeled samples are randomly selected to construct training set, and the rest is used for testing. The experiment results are shown in Figure 13(c-h) and Table 3. In Table 3, dense crops, broad leaves crops and small stem crops are denoted as D.c, B.l.c, and S.s.c, respectively. For SRDNN, the number of superpixels is set to 9000. For ANSSAE, the parameter λ is set to 0.00055, and the hidden units number of each layer is set to 80.
As is shown in Table 3, the proposed NNSSAE model achieves higher overall accuracy than the methods, which do not use NNFE as their feature extractor. Noted that the SSAE takes the original image pixels as input, its performance is considerably degraded by the speckle noise. From the classification result shown in Figure 13(c) it can be observed that the buildings, forest and some part of dense crop can be easily confused without the presence of spatial information. As a result of involving local spatial information, the SSAE-LS performs better than the SSAE, improve the OA about 5%. Furthermore, nonlocal methods extract spatial features based on various similarity calculations. Besides, NNSSAE and ANSSAE show comparable results, which indicate the robustness of the NNSSAE, for it can achieve competitive performance without carefully selecting parameters. With the regularization of the VAT, the NNSSAE-VAT further improves the OA and AA compared with the NNSSAE.

Discussion
Feature extractor and classifier are two essential modules for PolSAR image classification. Efficient feature extractor and robust classifier have a strong link with good classification performance. The proposed NNSSAE-VAT method uses NNFE to capsule multiscale spatial features and use similarity based weighted sum to calculate the discriminative nonlocal feature vector without manually designed expert parameters. The experimental results have shown the effectiveness of NNFE. SSAE is utilized as the classifier of the proposed method, for it can project the input features into deep latent space and achieve good classification performance.
VAT term is also introduce in the proposed method to regularize the model, because the strong capability to fit training data of deep learning models can also cause overfitting when the training data is insufficient. This circumstance is consistent with PolSAR image classification. For evidence, in Figure 10, we notice that applying VAT will achieve relatively larger accuracy improvement when the size of the training set is small. Another significance of applying VAT to NNSSAE is that we can achieve similar performance with smaller size of training set using same architecture of the model, which could be derived in Figure 10 as well. This means VAT could further reduce the workload of manually labeling the ground objects.
The time consumption during training procedures for all six method are shown in Table 4, where "Prep" denotes the time used for pre-processing, "Train" denotes the training time, and "Total" denotes the time used during the whole training procedure. From Table 4, we could also find that the time required by the pre-processing stage of the ANSSAE model is quite long, which could be a limitation in real applications. The training time of the proposed NNSSAE and NNSSAE-VAT are higher compared with other methods, because the former two models have neural network based feature extractors, which need to be trained. Moreover, applying VAT to NNSSAE introduces more time in training procedure, for implementing VAT will increase the amount of computation. More competent structure of NNFE and more efficient calculation algorithm for the virtual adversarial perturbation term r vadv could be further investigated in the future. Considering the advantages of NNFE, we also plan to combine the NNFE with fully convolutional network (FCN) [42] to further improve the classification accuracy and reduce the time consumption in our future work.

Conclusions
In this paper, we propose a method called NNSSAE-VAT for PolSAR image classification. The NNSSAE-VAT mainly contains three parts: The neural nonlocal feature extractor (NNFE), the SSAE, and the virtual adversarial training regularization terms. The NNFE uses a neural network calculating pairwise similarity between each pixel and its surrounding pixels in a pairwise manner as the filtering weights to extract nonlocal features. The SSAE takes the combination of the original pixel and its neural nonlocal features to perform classification and the VAT regularization term is introduced to make the model be more robust. The experimental results derived from three real PolSAR images shows that the proposed method achieves better results compared with previous methods.
Author Contributions: All the authors made important contributions to this work. R.W. designed the model. R.W. conceived the experiments and analyzed the results. R.W. wrote the paper and Y.W. provided instruction and advice for the preparation of the paper.

Conflicts of Interest:
The authors declare no conflict of interest.