Complex-Valued Convolutional Autoencoder and Spatial Pixel-Squares Reﬁnement for Polarimetric SAR Image Classiﬁcation

.


Introduction
Polarimetric synthetic aperture radar (PolSAR) image classification have been extensively used in topographic mapping, natural disaster monitoring, quantitative statistics on vegetation coverage, and urban and rural planning [1][2][3].In recent years, deep learning models have been utilized to classify optical image and achieved superior accuracy [4].Nevertheless, imaging mechanism of PolSAR images is different from that of the optical images [5].These models achieve weak performance working on PolSAR images directly [6].
Before the deep learning models are applied to image classification, many traditional algorithms have been proposed.They focus on developing feature extractors and classifiers that are divided into two parts.The first part aims to design the filters associated with corresponding features.For example, wavelet transform filter is exploited to extract local features [7].Markov random field and Fisher discriminant analysis are employed to learn spatial features between adjacent pixels [8,9].In addition, Gabor wavelet filtering is used to extract texture and edge information in different directions [10].3D-Gabor filter is employed to generate multiple cubes for active learning [11].The second part designs classifier via the obtained features to achieve classification tasks, including hierarchical classifier [12], wavelet transform classifier [13], and complex Wishart classifier [14].Others such as k-nearest neighbor classifier used in [15], improve the classification accuracy significantly.Assisted by SVM and random forest classifier [16,17], Uhlmann et al. annotated PolSAR images according to color features and artificial designed features of PolSAR images [18].These algorithms have shown better performance, but there is still a need to design hand crafted feature extractors and select suitable classifiers based on experiences that not only spends much time on designing models but also produces poor generalization performance.
With the significant breakthrough of convolutional neural network (CNN), it has performed in optical image classification task [19].Deep learning methods are introduced in PolSAR images classification.For example, Zhang et al. exploited stack sparse autoencoder to extract spatial sparse features, reducing the effect of speckle noise on pixel level [20].Geng et al. applied deep recurrent encoding neural networks (DRENNs) to extract contextual information of SAR images [21].In [22], stack autoencoder is utilized to extract PolSAR features from synthetic target database firstly.Then the classifier, which is constructed of multi-layer perceptron network, is used to label the urban area.Nonetheless, in these models, adequate training datasets are required to achieve high classification accuracy.However, attaining sufficient training samples is difficult because of the rarity and confidentiality of remote sensing images.Consequently, Shang et al. added an information encoder to CNN to increase samples' utilization [23].Gao et al. obtained joint feature map using CNN and Multiple Feature Learning to increase the discriminant performance of the features [24].There are also many unsupervised feature extraction methods, such as sparse autoencoder (SAE) [25], convolutional autoencoder (CAE) [26], multilayer autoencoder with a restriction using Euclidean distance [27], discriminant Analysis with Graph Learning (DAGL) [28], multilayer autoencoders and self-paced learning (SPL) [29], Wishart autoencoder (WAE) and Wishart convolutional autoencoder (WCAE) [30], and Wishart deep belief network (W-DBN) [31].Specifically, the prior information of Wishart distribution of PolSAR data are used in WAE and WCAE, which increase the accuracy rate by over 2%.W-DBN is composed of the Wishart-Bernoulli restricted Boltzmann machine (WBRBM), achieving better classification performance based on unsupervised pre-training and fine tuning.However, only the real value of coherence matrix or covariance matrix of pixels of PolSAR images is used among these algorithms.To solve this problem, Zhang et al. introduced phase information to CNN and proposed a complex-valued CNN (CV-CNN) [6], which had achieved comparable accuracy and verified the significance of phase information in PolSAR image.But massive annotated training datasets are needed in CV-CNN.
To alleviate the problem that CAE cannot extract the features of PolSAR image adequately with tiny amounts of training datasets, complex-valued convolutional autoencoder network (CV-CAE) is proposed in this paper.Firstly, CV-CAE extracts features from unannotated complex-valued input patch, then training complex-valued fully connected network (CFC) and fine tune CV-CAE with annotated training data.Experiments with three typical datasets show that the classification accuracy can be further improved.Nowadays, many post processing methods have been introduced into the PolSAR image classification.Among them, Liu et al. proposed the Cleaning algorithm, in which Bayesian theory and local spatial information are employed to rectify the class of each pixel [31].In [32], refined spatial-anchor graph is proposed to reassign the border pixels using majority voting and distance measurement.These methods increase the classification accuracy by refining the pixels one by one.Therefore, Considering the efficiency of postprocessing methods, SPF is proposed in this paper by calculating blocky land cover structure of preliminary classified map.SPF uses majority voting and difference-value to determine whether the refined condition is met or not, and then refines the class of all pixels within the PixS.Therefore, compared with pixel level refinement, SPF can obtain higher refinement efficiency.The proposed algorithm is evaluated using three PolSAR datasets, and achieve better accuracy than other compared algorithms.
The rest of this paper is strctured as follows.Section 2 describes the framework of proposed CV-CAE and SPF in details.Data preprocessing and experimental analysis are introduced in Section 3. The conclusion is discussed in Section 4.

Classification Based on CV-CAE Network
In our work, considering the phase and amplitude information of PolSAR images.CV-CAE network is proposed by extending the unsupervised model CAE to complex domain.In order to promote the efficiency of pixel level refinement, a post processing method, SPF, is adopted.The architecture and the training process of CV-CAE, along with the implementation method of SPF are outlined in the following.

The Framework of the Proposed Algorithm
The framework of CV-CAE, depicted in Figure 1

CV-CAE
CV-CAE consists of four complex-valued parts, which are input, output, encoding, and decoding.The configuration of CV-CAE is given in Table 1.The encoding includes convolution and mean pooling corresponding to the second and third layer.Next two layers, upsampling and deconvolution, are the components of decoding.Sigmoid activation function is utilized in CV-CAE.  1, the structure and parameters of convolutional layer and mean pooling layer are represented by "Conv.feature mappings number (kernel size)/activation function" and "Mean-Po.Stride (pooling size)".In addition, the structure and parameters of the next two layers are similar to those of the two layers.Classification network is formed with Fully connected layer.Output size is the number of output feature mapping.N is the number of terrain type.
Spatial features play a pivotal role in classification of PolSAR images.Therefore, input of CV-CAE is a complex-valued patch that cropped from original PolSAR images.As shown in Figure 1,

23, and
33 are complex-valued pixel values of six channels in the ith input patch.Considering the terrain type of PolSAR images [33,34], the size of 12 × 12 is selected as input patch.On the one hand, this size is big enough to contain the spatial feature that is needed for classification.On the other hand, with the smaller input size, the computational efficiency is increased and the risk of over-fitting is prevented [23].
In encoding part, complex-valued convolution extracts discriminant features for classification task from the complex-valued input patch.They are different from that of real-valued convolution for these features include spatial and polarized information.All parameters in complex-valued convolutional operation are complex value.Specifically, the ith complex input patch is X (l) ic ∈ W 1 ×H 1 ×C , where l is the layers' number, and c is the number of channels (c = 1, 2, • • • , C).The output corresponding to the ith input is y k is the number of feature mappings.The complex-valued convolution is defined as where real (•) and imag (•) are real and imaginary part of the complex value •.Character * represents convolutional operation.W (l) ik is the convolutional kernel of size W 2 × H 2 × C × K. Generally, kernels with size of 3 × 3 or 5 × 5 are recommended because they are more effective in feature extraction than others [35,36]   k .In complex domain, whose number is two times that of the real field.That is 2 × (W 2 × H 2 × C × K + K).For a convolutional operation with stride S and zero-padding P, the size of feature mappings of convolution result is calculated by In Equation ( 1), only linear transformation is performed on the input data.In order to obtain improved generalization and robustness of CV-CAE, nonlinear operations must be adopted.In neural networks, sigmoid and ReLU are the two commonly recommended [37].In addition, they showed good performance on nonlinear transformation and accelerated the speed of training.In CV-CAE, the complex-valued nonlinear operation is defined as ik , the size same as y ik , is the result of complex-valued nonlinear transformation.
Pooling is reducing the dimension of its input features based on similarity, which not change the number of channels at all.By means of pooling, the pivotal features are preserved and the redundant information is reduced.Therefore, the calculation and convergence of networks are more efficient.In neural networks, the most useful pooling operations are max-pooling and mean-pooling.Pooling size and stride are dominant parameters.Appropriate parameter values not only eliminate redundant information but also retain the discriminant features.Based on the previous experience, the pooling size 2 × 2 or 3 × 3 and stride 2 are commonly recommended.
No padding convolution with kernel size 5 and stride 1 is employed in encoding part.The number of convolutional kernels is 12.In the complex domain, max pooling cannot be directly adopted.So the mean pooling with a pooling size 2 and a stride 2 is exploited in CV-CAE.According to Equation (2), with the complex-valued input patch size of 12 × 12 × 6, the size of feature mappings after convolution and mean pooling operation are 8 × 8 × 12 and 4 × 4 × 12.
Decoding part consists of uppooling and deconvolution, and it is the inverse process of encoding, which aims to reconstruct the input of encoding.In uppooling, feature mappings of encoding are extended by utilizing the location information retained in the pooling process.There are different extension methods with diverse pooling operation.For inverse mean pooling, the result is the case that a pixel value in the feature maps is copied to all positions within the pooling size.Deconvolution, also called transposition convolution, is the inverse process of convolution.In deconvolution, the sparse image representation generated by uppooling is reconstructed to the identical resolution as input patch of encoding.Deconvolution result Ỹ(l) ic is calculated by The parameters to be trained in deconvolution are In decoding part, the input features of the uppooling are also the feature mappings of encoding.The size of which is 4 × 4 × 12.The output of uppooling with size of 8 × 8 × 12. Kernels size 5 × 5 and the number of output features 6 are employed in deconvolution.Therefore, the output size of decoding is 12 × 12 × 6.

Classification Network
Encoding of CV-CAE after training and CFC are included in the classification network.Encoding part has been elaborated in Section 2.1.1 and will not be repeated here.The input of CFC Y (l) ik is a vector that is obtained by reshaping encoding result Ỹ(l) ic .The number of input neurons is equal to the number of the elements in this vector.The result of CFC is in , n is the number of neurons in a complex-valued output layer (n = 1, 2, • • • , N), which is also the number of terrain type of PolSAR images.Therefore, O (l) in can be described as where character • represents dot product operation.The parameters to be trained are weights W n is the neurons N of output layer.N is varied in different datasets.

Network Training
In CV-CAE, there are two stages of training.Firstly, unannotated datasets are utilized to train CV-CAE.The encoding of CV-CAE after training is employed to extract features.Then, annotated dataset are applied to train the CFC and fine tune the encoding part.The detailed procedure is as follows.

CV-CAE Training
The training of CV-CAE is to minimize the loss function J (θ), which aims to reconstruct the input of CV-CAE by optimizing the parameters θ. θ includes convolutional kernel ic and output Ỹ(l) ic can be calculated by where l = 1, 2, • • •, L and c = 1, 2, • • •C represent network layers and channel numbers respectively.The W(l) ic and b(l) c in decoding can be updated iteratively using the following Equations.
As can be known from Equation (8), J (θ) is a function of parameter θ .To solve Equations ( 9) and ( 10), finding the partial derivatives ∂J (θ) ∂ W(l) ic and ∂J (θ) ∂ b(l) c are needed.By imitating the real-valued solution process and extending the chain rule to the complex domain, the result can be defined as with Equations ( 6)-( 8), the second and third term in Equation ( 11) are zero.So there are two terms in Equation (11).The result can be calculated by utilizing same methods in bias b(l) The same update method is exploited as in encoding.After training with unannotated dataset, the discriminant features that obtained by encoding part of CV-CAE are used as input for CFC.

Classification Network Training
Annotated dataset are applied to train the CFC in this section.In real-valued convolutional autoencoder (RV-CAE), softmax is used as output layer to obtain the probability of each category.However, complex-valued input data cannot attain the certain probabilistic value of every class.Therefore, the output layer is a complex-valued fully connected layer with N neurons.Mean square error (MSE) between the output of the CFC and the one-hot vector are used as loss function.In complex domain, ON value of one-hot vector is recorded as 1 + j, others are 0. The length of vector is the number of classes of the datasets.Therefore, the loss function of CFC is defined as where in is the result of CFC, and T i is the target corresponding to the input of O in .In this part, the updating method of complex-valued weight W n are similar with those in encoding of CV-CAE.

Spatial Pixel-Squares Refinement
The goal of PolSAR image classification is to assign each pixel to one class.But some pixels may be misclassified into other classes, which affects the classification accuracy.In order to reduce its impact on classification accuracy, this paper proposes a post processing method called SPF based on the blocky structure of PolSAR image.The whole algorithm is summarized in Algorithm 1.For a preliminary calssified mapping of size w × h, the number of times the PixS moves in the horizontal and vertical directions is w/s and h/s , where s is the stride of PixS movement, and indicates rounding down.pixNum n represents the number of pixels of the nth (1 ≤ n ≤ (r × r)) class in PixS.In SPF, the most critical step is to determine the refined condition.Therefore, the majority voting and difference-value methods are used as judgement rule.Specifically, majority voting is applied to find the class with the largest number of pixels pixNum max in PixS.The size of PixS is r (r represent the number of pixels in each row or column, r ≤ s).Then compare pixNum max with (r × r) /2 to determine whether to continue processing the PixS or move to the next PixS, i.e., (r × r) /2 < pixNum max < (r × r) where (r × r) /2 is selected to avoid that more than one category satisfies the refined condition.Here, the PixS that satisfying Equation ( 14) is called unstable window.In unstable window, the number of pixels belonging to each class pixNum 1 , pixNum 2 , • ••, pixNum n need to be calculated, and sorted it then according to pixels' number.The queue can be represented as pixNum max , pixNum 2ed_ max , • ••.
In our work, SPF refines all the classes in unstable window that satisfy the next refined condition into the one class.Therefore, to reduce computational complexity, difference-value method is employed to calculate the difference of first two classes.Then comparing the result with setting threshold τ 0 , the next refined condition can be calculated by If both Equations ( 14) and ( 15) are established, all pixels in PixS are changed to the category with the largest number of pixels.
The diagram of SPF is shown in Figure 2. Left shows unprocessed PixS, the refined result is displayed in right picture.In SPF, the size of the PixS is one of the most crucial factors affecting the refinement results.It is proved it by experiment that the larger size of PixS incorrectly refines other class of pixels in the edge of land cover, and smaller size affects the efficiency of refinement.The optimal result can be obtained by setting the size of PixS to 3 × 3 and the threshold τ 0 to 3.
There are three different cases of PixS to be refined in Figure 3.Each digit in the PixS represents a pixel class.

PolSAR Data Preprocessing
The scattering characteristics of pixels in PolSAR images are represented by a scattering matrix S [38].It is defined as Generally, covariance matrix or coherent matrix are used as the unit of PolSAR image [39].In CV-CAE, covariance matrix is adopted.Covariance matrix contains all the polarization information of object obtained by radar measurement.And it is deduced from scattering matrix.The effectiveness of covariance matrix has been authenticated in [40].According to reciprocity theorem S HV = S V H , scattering vector is x = S HH √ 2S HV S VV .Covariance matrix can be calculated by the kronecker product of the x as follows where the superscript * , T, H represent conjugation, transposition and conjugate transposition respectively.In order to suppress the speckle noise of PolSAR images, multi-look processing is introduced in covariance matrix where L is the number of looks.And x i is the scattering vector of the ith look.It can be known from the scattering properties of the PolSAR images that elements on the principal diagonal of the covariance matrix C are real values.The rest are complex values and conjugated at the symmetric position of the main diagonal.i.e., C 12 corresponds to C 21 , C 13 corresponds to C 31 , and C 23 corresponds to C 32 are conjugated.To reduce redundancy while preserving the integrity of input information, the upper triangular elements {C 11 , C 12 , C 13 , C 22 , C 23 , C 33 } of C are employed as input of the CV-CAE.In computer vision, data normalization can effectively avoid the problem of vanishing gradient and exploding gradient, and improve the convergence efficiency of propoded network [25].So the real values (diagonal elements) and complex values (non-diagonal elements) of input data need to be preprocessed.Taking the first channel C 11 as an example of real values where C11 is the normalized result of C 11 , µ C 11 and δ 2 C 11 are the average and standard deviation of C 11 .They can be defined as Taking the second channel C 12 as an example of complex values where the average µ C 12 and standard deviation δ 2 C 12 of C 12 are calculated by Other real values (C 22 and C 33 ) and complex values (C 13 and C 23 ) of input data are treated in the same way.

PolSAR Datasets for Experiment
In this paper, three PolSAR images are used to verify the performance of the proposed algorithm.These datasets are acquired with Airborne SAR (AIRSAR) platform.Two of them show agriculture areas over Flevoland in the Netherlands.There are available online at https://earth.esa.int/web/guest/missions/esa-operational-eo-missions/envisat.And the third one is AIRSAR data over San Francisco [30].After preprocessed, the datasets are divided into training datasets and test datasets.Training datasets are 5% and the rest are used as test datasets.The spatial resolution of the test datasets is 12 × 12 and the number of channels is 6, which are the same as that of the training datasets.Detailed analyzing is shown in the following experiments.

Comparative Algorithms
To objectively evaluate the effectiveness of the proposed method, our algorithm is compared against three state-of-the-art algorithms.They include RV-CAE, WAE, WCAE, and fixed-feature-size CNN (FFS-CNN) [41].To ensure the fairness of comparison, firstly, the input information content of RV-CAE should be equivalent with that of CV-CAE, so the input elements of RV-CAE are designed as {C 11 , C 22 , C 33 , real(C 12 ), imag(C 12 ), real(C 13 ), imag(C 13 ), real(C 23 ), imag(C 23 )}.Secondly, the number of parameters in CV-CAE and RV-CAE must be the same.Therefore, in the experiment, we configure the structure and the number of parameters of RV-CAE according to Table 2.In this table, "parameters" indicate the number of parameters in each layer.S Rw × S Rh and S Cw × S Ch represent the size of feature mapping of mean pooling in RV-CAE and CV-CAE respectively.N is the number of terrain type.The structure of RV-CAE is same as that of CV-CAE.But, the input size of RV-CAE is 12 × 12 with 9 channels.However, the number of parameters in complex domain is double of that in real domain.Therefore, in order to make sure the parameters of RV-CAE same as CV-CAE, the number of feature mappings is set 16 in RV-CAE.The quantity of parameters is 5 × 5 × 9 × 16.Which is equal to the 5 × 5 × 6 × 12 × 2 in CV-CAE.In CV-CAE and RV-CAE, the number of parameters of fully connected layer is the product of neurons number of input layer and output layer.In CV-CAE, the number of neurons of input layer is S Cw × S Ch , while S Rw × S Rh in RV-CAE, which are feature mappings of mean pooling layer after reshaping.

Experiment on Flevoland Datasets of 14 Classes
The first experiment is carried on the datasets over Flevoland, which is a subset of an L-band, full PolSAR image, attained by AIRSAR platform in 1991.It is widely applied as a benchmark data for PolSAR image classification research.The Pauli RGB image and the corresponding ground-truth are exhibited in Figure 4a,b, its size is 1020 × 1024 pixels.There are in total 14 identified classes including Potatoes, Fruit, Oats, Beet, Barley, Onions, Wheats, Beans, Peas, Maize, Flax, Rapeseed, Grass, and Lucerne.Each color indicates a type of class in ground-truth map, the corresponding legends are listed in Figure 4c.The structure of the network is shown in Figure 1.Hyperparameters were selected as follows.Firstly, unsupervised training processe is employed to train CV-CAE with learning rate 0.001.Then the annotated training data is utilized to train CFC and fine tune encoding of CV-CAE.In supervised training processes, learning rate η is 0.48, and the batchsize is 100.In CFC, the number of neurons of the input layer is 192, and the output layer is 14.
For convenience, the proposed methods CV-CAE add SPF are abbreviated to CV-CAE+SPF.The classification results of the compared algorithms and the proposed algorithm are shown in Figure 5.The notable different results are highlighted by black rectangle.Comparing Figure 5a,c, the number of misclassified pixels of CV-CAE are clearly less than that of the compared RV-CAE.And the intra-class of the classification map of CV-CAE is smoother than that of RV-CAE.As is shown in the lower left of Figure 5a,c, CV-CAE achieves the more distinguishable edge.In Figure 5a,b, the number of misclassified pixels is further depressed by CV-CAE+SPF.CV-CAE+SPF achieves the best classification result compared with other two algorithms.
The classified accuracy of each class, OA and Kappa are listed in Table 3, and the best results are shown in bolding.From Table 3, we can know that the CV-CAE+SPF obtained the best accuracy than the CV-CAE and RV-CAE.The OA of RV-CAE, CV-CAE and CV-CAE+SPF are 98.34%, 98.7%, and 98.82% respectively.And the Kappa coefficients also achieve improvement in our algorithms including CV-CAE and CV-CAE+SPF.This indicates the effectiveness of our methods.Specifically, the accuracy of Oats is 100%, which is achieved by the proposed methods.And the accuracy of Beans in CV-CAE is 92.7% while RV-CAE is only 82.9%.These results illustrate that phase information is a crucial feature in PolSAR image classified tasks.In addition, the classification accuracy CV-CAE+SPF is further improved compared with CV-CAE in Table 3, which indicates the success of SPF.Furthermore, another experment is carried out to evaluate the effectiveness of proposed SPF.The result shown that the proposed SPF takes 4.39 s while improving correct rate by And the compated algorithm (pixel-by-pixel refinement based on vote) takes 70.95 s while increasing the correct rate by only 0.04%.However, the proposed algorithm achieves a lower accuracy rate on Onions.From the confusion matrix of CV-CAE+SPF shown in Table 4 (Each row in the table indicates the natural class, and each column indicates the predicted class. 1 to 14 represent the Potatoes, Fruit, Oats, Beet, Barley, Onions, Wheats, Beans, Peas, Maize, Flax, Rapeseed, Grass, Lucerne), we can know that Beet, Wheats, Beans, and Maize take the large ratio of misclassified classes of Onions.Considering the ground-truth in Figure 4b, it can be found that the annotated area of Onions is smaller than other classes such as Potatoes, Barely, and Wheats.Consequently, many of the input patch is smaller than 12 × 12 in size and needed zero padding, which leads to the discriminant features cannot be extracted adequately.7b, many pixels of Rapeseed are misclassified into Grass and Wheat2.To evaluate the performance of the proposed method, the comparison is made between the compared methods and the proposed methods.It can be observed from Figure 7e,f that the number of misclassified pixels are lower than that of compared algorithms.Therefore, CV-CAE and CV-CAE+SPF give the best performance.In addition, the intra-class smoothness and the inter-class distinctness of the proposed algorithms are better than that of the compared algorithms.The classification accuracy of the proposed algorithms and the compared algorithms is listed in Table 5. CV-CAE and CV-CAE+SPF achieve better OA than the compared algorithms, followed by WCAE, WAE, and RV-CAE.In this experiment, WAE performs not well in recognizing Beet, Potatoes, and Grass.The accuracy of these classes is lower than 85% while CV-CAE+SPF achieved 93.09%, 89.24% and 87.02% respectively.Moreover.RV-CAE cannot distinguish Potatoes, Grass and Buildings clearly, and discriminate Potatoes and Grass with the accuracy of 77.56% and 73.14%.But the proposed CV-CAE improves the accuracy of these two classes by 10 points compared with RV-CAE, and also achieves 100% accuracy on Bare soil.Therefore, phase information can promote the improvement of classification accuracy.In order to explicate the effect of SPF, a comparison of CV-CAE and CV-CAE+SPF is carried out, and the OA is increased by 1 point in CV-CAE+SPF, i.e., 94.31% is comparable to 93.31%.However, the result of FFS-CNN is higher than that of the CV-CAE, but lower than that of CV-CAE+SPF.Furthermore, FFS-CNN is based on the LeNet-5, which contains three convolutional layers with the size of convolutional kernel 3 × 3 and feature mappings of 100.So the parameters of FFS-CNN are much larger than those of the algorithm proposed in this paper.
To evaluate the generalization of proposed SFP, which is also used to process the preliminary classification results of the compared method.The OA is improved by 0.78%, 1.03%, 1.29%, and 0.86%, of WAE, WCAE, RV-CAE, and FFS-CNN respectively.San Francisco Datasets, acquired by the AIRSAR platform, is adopted in this experiment.The Pauli RGB image and corresponding ground-truth are shown in Figure 8a,b.Five colors in ground-truth map represents five terrain types, which are vegetation, low-density urban, high-density urban, and developed.The legends are listed in Figure 8c.From Figure 8b, we can know that most of the annotated areas are irregular.Thus, the complexity of this experiment is higher than the previous two experiments.In this experiment, learning rate η is 0.6, the number of output neurons is 5, network structure and other hyperparameters are same as that of the above two experiments.
Table 6 indicates the classification results of each algorithm.For WAE, the classification accuracy of Vegetation and Low-Density urban is 58.85% and 78.12%, while CV-CAE achieves significant improvement in classification accuracy.RV-CAE cannot distinguish High-Density urban clearly with the accuracy is 80.76%.The performance of WCAE is better than that of WAE and RV-CAE in Vegetation, Low-Density urban and High-Density urban.However, the accuracy of Developed category is slightly lower than that of the above two algorithms.According to the results summarized in Table 6, compared with WCAE, 1.5 points is increased of OA by CV-CAE+SPF.However, the recognition rate of CV-CAE+SPF on Vegetation and High-Density urban is lower than that of other classes.From the confusion matrix of CV-CAE+SPF shown in Table 7 (Each row in the table indicates the natural class, and each column indicates the predicted class. 1 to 5 represent Water, Vegetation, Low-Density urban, urban, Developed), we can that there is a large proportion of these two classes of misclassification into the Low-Density urban.the features of these two classes are similar to the Low-Density urban.It also can be verified in Figure 8b.However, associating with the phase information and SPF, CV-CAE+SPF gives the best performance.Its OA and Kappa coefficient are 97.03% and 0 .96.

Conclusions
CAE has demonstrated significant success in computer vision.In order to take advantage of phase information of PolSAR images, the RV-CAE is extended to complex domain and CV-CAE is proposed.CV-CAE is designed to extract more discriminant features from amplitude and phase information of tiny number of unannotated training data.To fit the classification task, a small number of annotated training datasets are needed to adjust the classification network, the convolution operation of which is initialized by the trained CV-CAE.We have tested the performance of proposed CV-CAE on three PolSAR datasets and compared against several other similar models including WAE, WCAE, and RV-CAE.CV-CAE achieves the better performance than the compared algorithms.In addition, a post processing method named SPF is proposed to further improve the performance.Benefitting from the blocky structure of land cover of PolSAR images, the proposed SPF refines the class of pixels in the spatial squares at the same time, which alleviates the time-consuming problem of pixel level refinement.Compared with CV-CAE, CV-CAE+SPF further improves the classification accuracy.Future work will investigate ways of replacing a two-stage network with an end-to-end network to reduce the complexity and improve the efficiency of this network.We can also investigate the advantages of shorter time-consuming and more efficient post processing methods to achieve better results.
, consists of the feature extraction and classification.Which are marked with the red and blue dotted box respectively.Detailed explanation is as follows.Firstly, the network in red box extracts features.Then classification network that formed with encode part of CV-CAE after training and the CFC achieves classification task.Where C (i) 11 is the first channel value of each pixel in the ith input patch, Ĉ(i) 11 is the decoding value of C (i) 11 , c n is the nth value of classification result, and n indicates the number of terrain type.

Figure 1 .
Figure 1.CV-CAE architecture.Red and blue boxes are the structure of the CV-CAE and classification network respectively. . b

k
is bias.The parameters of CV-CAE to be trained in the lth layer of convolutional operation are W n .In CFC, the number of input elements are 4 × 4 × 12.And the number of bias b (l)

Figure 2 .
Figure 2. The refinement process of SPF. the shaded part represents pixNum max , the blank part represents other classes.

Figure 5 .
Figure 5.The classification results and the result overlaid with ground-truth of our algorithm and RV-CAE.(a,d) are result of CV-CAE.(b,e) are result of CV-CAE+SPF.(c,f) are result of RV-CAE.

Table 1 .
The Framework and Parameters Configuration of CV-CAE.

Table 2 .
The Structure and Parameters Number of RV-CAE and CV-CAE.
Rw × S Rh × N Fully Connected S Cw × S Ch × N × 2

Table 3 .
The OA and Kappa Coefficient of Our Algorithms and the Compared Algorithms.

Table 4 .
The Confusion Matrix of CV-CAE+SPF.

Table
The OA and Kappa of Our Algorithms and the Compared Algorithms.

Table 6 .
The OA and Kappa Coefficient of Our Algorithms and the Compared Algorithms.

Table 7 .
The Confusion Matrix of CV-CAE+SPF.