Adversarial Reconstruction-Classification Networks for PolSAR Image Classification

Polarimetric synthetic aperture radar (PolSAR) image classification has become more and more widely used in recent years. It is well known that PolSAR image classification is a dense prediction problem. The recently proposed fully convolutional networks (FCN) model, which is very good at dealing with the dense prediction problem, has great potential in resolving the task of PolSAR image classification. Nevertheless, for FCN, there are some problems to solve in PolSAR image classification. Fortunately, Li et al. proposed the sliding window fully convolutional networks (SFCN) model to tackle the problems of FCN in PolSAR image classification. However, only when the labeled training sample is sufficient, can SFCN achieve good classification results. To address the above mentioned problem, we propose adversarial reconstruction-classification networks (ARCN), which is based on SFCN and introduces reconstruction-classification networks (RCN) and adversarial training. The merit of our method is threefold: (i) A single composite representation that encodes information for supervised image classification and unsupervised image reconstruction can be constructed; (ii) By introducing adversarial training, the higher-order inconsistencies between the true image and reconstructed image can be detected and revised. Our method can achieve impressive performance in PolSAR image classification with fewer labeled training samples. We have validated its performance by comparing it against several state-of-the-art methods. Experimental results obtained by classifying three PolSAR images demonstrate the efficiency of the proposed method.


Introduction
Polarimetric synthetic aperture (PolSAR) image classification is one of the most prominent applications in geoscience remote sensing [1].Over the last few years, substantial amount of PolSAR image data has been put into use [2].Consequently, PolSAR image classification has gained significant research attention [3,4] and many methods to accomplish this task came into existence [4,5].The majority of the available methods are based on physical scattering mechanisms, which are obtained through various polarimetric decomposition methods [6].Polarimetric target decomposition is one of the most powerful and widely used methods for PolSAR image classification.The physical characteristics corresponding to the target are used to describe the architecture in the polarimetric target decomposition theorem.Several polarimetric target decomposition methods are reported in the literature.To name a few, they are Krogager decomposition [7], Cloude-Pottier decomposition [8], Pauli decomposition [9], Huynen decomposition [10], Freeman decomposition [11], and the extensions of the above mentioned decomposition methods [12,13].Besides, some researchers believed that the statistical distribution of PolSAR data can be used for classifying the PolSAR images [14,15].For instance, derived from the complex Wishart distributions of the covariance and coherency matrices, Lee et al. [15,16] used Wishart distance to accomplish the task of the PolSAR image classification.In addition, the research on PolInSAR technique has also received a lot of attention [17][18][19][20][21].However, all these methods mentioned above are highly depend on a complex analysis of PolSAR data [22], and the extensive analysis of the physical mechanism is hard [23].
In recent years, the convolutional neural network (CNN) model, derived from convolution, pool and nonlinear transformation operations, has obtained good results in many applications [34,35], i.e., action recognition [36], semantic segmentation [37], image classification [38] and scene labeling [39].Nonetheless, the existing CNN model are not suitable for PolSAR image classification.It is well known that PolSAR image classification task is dense prediction problem.For the existing CNN's classification framework, input is image and output is class of image.Nevertheless, CNN's classification result cannot describe the image detail [40].Consequently, for the existing CNN's classification frameworks for PolSAR image [41][42][43], pixel's neighborhood is set as the input image.In this way, the pixel's class can be obtained.In other words, CNN has no advantage in memory occupation.Luckily, Long et al. proposed fully convolutional networks (FCN) model [40], which can be trained in an end-to-end, pixels-to-pixels manner.FCN converts fully connected layers of traditional CNN model into convolutional layers.By this means, an efficient classification net for end-to-end dense learning can be enabled.Accordingly, FCN has great potential in processing PolSAR image classification task.Nevertheless, because each PolSAR image has a different size, FCN cannot be directly applied to PolSAR image classification.Consequently, the existing FCN frameworks does not have a specific framework capable of processing all PolSAR images.Larger size input image also increases the complexity of the classification framework.Recently, Li et al. [44] proposed sliding window fully convolutional networks (SFCN) to tackle the problems of FCN in PolSAR image classification.The sliding window operation of SFCN is similar to that of CNN, and Li et al. [44] have designed a new training framework for SFCN.Nevertheless, because of the relatively complex network architecture of SFCN, it cannot obtain excellent classification results with fewer labeled training samples.
Based on deep learning, Ghifary et al. [45] presented a new model called deep reconstructionclassification network (DRCN) for object recognition.DRCN can jointly learn a shard feature representation for two tasks: (i) supervised source data classification; and (ii) unsupervised target data reconstruction.In this way, the extracted feature representation can preserve discriminability while encoding meaningful information from the target data.Similar to the standard neural networks, DRCN can also be optimized by backpropagation.DRCN has obtained considerable improvement over the state-of-the-art methods in cross-domain object recognition task.
Adversarial training has become the state-of-the-art method for generative image modeling.Luc et al. [46] proposed an adversarial training method which is very helpful for training segmentation models.In their approach, a convolutional segmentation network with an adversarial network is trained to discriminate the segmentation maps coming either from the segmentation network or the ground truth.Their method can detect and revise the higher-order inconsistencies between the ground truth maps and segmentation result maps.
Based on the SFCN, DRCN and adversarial training models, we propose adversarial reconstruction-classification networks (ARCN) in this paper.The merits of our method can be expressed as: (i) Since our method is based on SFCN, it can be trained end-to-end, pixels-to-pixels while taking into account the spatial information; (ii) Our method can jointly learn a shard feature representation for supervised image classification and unsupervised image reconstruction.In this way, all samples can be correctly classified, not just labeled training samples; (iii) By introducing adversarial training, our method can enforce forms of higher-order consistency between the true image and reconstructed image.The rest of this paper is organized as follows.The methods used to extract features for PolSAR images are given in Section 2. Section 3 describes the related work.Our proposed ARCN method is shown in Section 4. Section 5 reports the experimental results.Finally, discussion and conclusions are provided in Sections 6 and 7.

Coherency Matrix
On the basis of [22], PolSAR data can be expressed with scattering matrix, which is given by Equation (1).In addition, PolSAR data's polarimetric information is contained in scattering matrix.

S =
S hh S hv S vh S vv . (1) In the case of the monostatic backscattering, there exists the reciprocity theorem, i.e., S hv = S vh .The scattering matrix can also be expressed as: [47]: where a = (S hh + S vv )/ In addition, PolSAR data's coherency matrix T can be expressed as follows [22]: In recent years, many great performances corresponding to PolSAR image classification have been obtained by the coherency matrix T [22,23].

Cloude-Pottier Decomposition
As described in the eigen-decomposition model [8], coherence matrix T is decomposed as: where U 3 = [ e 1 , e 2 , e 3 ] and λ i (i = 1, 2, 3) are eigenvector matrix and eigenvalue of T, respectively.According to the eigen-decomposition model, Cloude and Pottier put forward Cloude-Pottier decomposition model.The entropy H, the anisotropy A, and the mean alpha angle ᾱ are defined as: where e 1 i (i = 1, 2, 3) is the first element of e i (i = 1, 2, 3).Besides, Cloude-Pottier decomposition model plays an important role in processing PolSAR image classification task [6,8].
The features we extract in this paper are divided into two parts, F = [F 1 , F 2 ].To construct the first part of F, i.e., F 1 , we use the coherency matrix T; F 1 = [T 11 , T 22 , T 33 , Re(T 12 ), Re(T 13 ), Re(T 23 )].Here, Re(T ij ) and Im(T ij ) represent the real and the imaginary parts of T ij , respectively.Now, by making use of the Cloude-Pottier decomposition features, we construct the second part of the features as

Sliding Window Fully Convolutional Networks
FCN is a natural extension of CNN to solve the dense prediction problem, for example image segmentation.To recover the resolution information corresponding to the input at the output layer, upsampling layers are added to the standard architectures by FCN.Therefore, images of arbitrary size can be processed with FCN.FCN uses skip connections between the upsampling and downsampling paths to tackle the resolution problem caused by downsampling operation.Skip connections are very helpful for the upsampling path to recover the fine-grained information from the downsampling layers.However, as Section 1 described, FCN cannot be directly applied in PolSAR image classification.Images of different sizes need different FCN frameworks and larger-size input images generally increase the difficulty in designing network architecture.Therefore, we cannot design a corresponding framework for each PolSAR image alone.To design a unified FCN architecture for different PolSAR images, Li et al. introduced sliding window operator in [44].The sliding window operation of SFCN is similar to that of CNN, and the number of the images obtained by sliding window operation can be acquired through where ceil denotes the upward integer-valued function, Height and width respectively denote the height and width of the image, W and S respectively denote the size and stride of the sliding window operation, and num denotes the number of the acquired images obtained by sliding window operation.

Deep Reconstruction-Classification Networks
DRCN jointly learns two tasks: (i) supervised classification of the source data, and (ii) unsupervised reconstruction of the target data.The two tasks share the encoding parameters, while the decoding parameters of the two tasks are separated.DRCN's core contribution is to construct a single composite feature representation, which encodes information for both the classification task of the source data and the structure of the target data.The purpose is that the learned supervised classification function can obtain good classification results in the target data.That is to say, the unsupervised reconstruction task can be regarded as the auxiliary task to support the adaption of the classification task.

Semantic Segmentation Using Adversarial Networks
In [46], Luc et al. use an adversarial training method to increase the performance of segmentation model.Their method more interested in forcing higher-order consistency than a very specific class of high-order potentials.Motivated by the generative adversarial network (GAN) model [48], they find a method through adversarial training rather than searching to directly integrate higher-order potentials in a conditional random field (CRF) model.For this reason, their objective function is consisted of a conventional multi-class cross-entropy loss and an adversarial term.The adversarial term encourages semantic segmentation model to generate result maps, which cannot be distinguished from the ground truth maps with an adversarial classification network.Because the adversarial network can evaluate the joint configuration of multiple label variables, it can force forms of higher-order consistency.

Methodology
In this paper, we propose a novel PolSAR image classification method, which we refer to as adversarial reconstruction-classification networks (ARCN).We first present the architecture of reconstruction-classification networks (RCN), which based on the SFCN model.Then, we give the architecture and training details of our proposed ARCN method.

Reconstruction-Classification Networks
RCN consists of two pipelines: (i) supervised classification network, and (ii) unsupervised reconstruction network, which can be shown in Figure 1.In addition, the classification network part of Figure 2 gives the framework of SFCN.The two pipelines can be divided into three functions: (i) encoding function, which can be expressed by the intersection between the red rectangle and blue rectangle in Figure 1; (ii) classification function, which can be expressed by the rest part of the red rectangle in Figure 1; (iii) reconstruction function, which can be expressed by the rest part of the blue rectangle in Figure 1.That is to say, RCN has two pipelines with a shared encoding representation.The RCN model is optimized through multitask learning [49], namely, jointly learns the supervised classification and unsupervised reconstruction tasks.The purpose is that the encoding function can learn the commonality between the two tasks.In this way, excellent classification results still can be obtained when the number of the labeled training samples is limited.
We now describe RCN more formally.Let f c : X → Y c be the supervised image classification pipeline and f r : X → X be the unsupervised reconstruction image pipeline of RCN, where X and Y c represent the input image space and the ground truth space, respectively.Define the three functions mentioned above: (i) encoding function g enc : X → F , (ii) reconstruction function g rec : F → X , and (iii) classification function g cla : F → Y c , where F represents the encoding feature space.Given an input image x ∈ X , f c and f r can be described as follow: Let Θ c = {Θ enc , Θ cla } and Θ r = {Θ enc , Θ rec } respectively represent the parameters of the supervised classification model and unsupervised reconstruction model.Θ enc , Θ rec , and Θ cla are the parameters of the encoding, reconstruction, and classification functions, respectively.The aim is to find a shared encoding function g enc which supports both f c and f r .
As for the unsupervised reconstruction model, we use adversarial training to train it, which is described next.Let Θ a represents the parameters of the adversarial networks, and f a : X → Y d be the adversarial networks, where Y d ∈ [0, 1].Given an input image x ∈ X , f a can be expressed as follow: Given N 1 labeled training samples (x i , y i ), where y i ∈ {0, 1} K is a one-hot vector, and N 2 unlabeled training samples x j , the loss can be defined as: In the above, l c ( ŷ, y) = − ∑ K k=1 y k ln ŷk represents the multi-class cross-entropy loss for predictions ŷ, which is the softmax output.Similarly, we use the binary cross-entropy loss.We minimize the loss with respect to the parameters Θ c and Θ r of RCN, while maximizing the loss with respect to the parameters Θ a of the adversarial networks.

Training the Adversarial Model
Since only the terms in Equation ( 11) that contain l a depend on the adversarial networks, the loss of the adversarial model can be described as: The training of the adversarial model minimizes the above loss function, and the architectures of our adversarial model can be shown by the right black rectangle in Figure 2.

Training the RCN Model
Given the adversarial networks, training the RCN model is equivalent to minimize the multi-class cross-entropy loss, while decreasing the performance of the adversarial networks.In this way, it is not easy for the adversarial model to distinguish the reconstructed image produced by the RCN model from the true image.The objective function corresponding to the RCN model can be described as follow: Similar to Goodfellow et al. [48], we replace the term −λl a ( f a ( f r (x)), 0) with λl a ( f a ( f r (x)), 1).That is to say, we maximize the probability that the adversarial model predicts f r (x) to be the true image instead of minimizing the probability that the adversarial model predicts it to be the synthetic image.It is not difficult to prove that −λl a ( f a ( f r (x)), 0) with λl a ( f a ( f r (x)), 1) have the same set of critical points.The reason of this update is that it produces a stronger gradient signal when the adversarial model makes an accurate prediction of the true/synthetic image.It has been proven by the preliminary experiments that this update is very meaningful for accelerating training process [46].Therefore, the objective function of the RCN model can be updated to follow: The optimization of L ARCN can be obtained by alternately minimizing L A and L RCN using ADAM [50].We will count the training classification accuracy for each iteration.The algorithm will stop in two cases.First, the training classification accuracy of multiple consecutive iterations is higher than the predefined accuracy value.Second, the iteration number reaches the predefined maximum iteration number.Our proposed ARCN method is summarized in Algorithm 1.In addition, we use dropout regularization [51] during the minimization of L RCN to prevent overfitting.

Experimental Results
As previously stated in Section 2, in this work, we use coherency matrix T and Cloude-Pottier decomposition features as our extracted original features.The feature dimension is 15.As a matter of pre-processing, we have used a refined Lee filter [52] to reduce speckle noise.To validate the performance of our proposed method, we have used the following three PolSAR images: Xi'an, China; Oberpfaffenhofen, Germany; San Francisco, USA.The performance of the proposed method is compared against SVM [26], sparse representation classifier (SRC) [53], SAE [33], CNN [54], and SFCN [44].Overall accuracy (OA) and Kappa coefficients [55] are used as evaluation criteria.All methods are implemented in a 3.20-GHz machine with a 8.00-GB RAM and a NVIDIA GTX 1050 Ti GPU.

Xi'an
The first PolSAR iamge is acquired from a C-band multilook PolSAR image, which covers western Xi'an, Shaanxi, China.The left of Figure 3a gives its PauliRGB image with its corresponding coordinate, and the size of which is 512 × 512.The right of Figure 3a gives the photo of the near area of Xi'an, which is from Google Maps.The corresponding ground truth map is shown in Figure 3b, which is acquired by referencing [56].Overall, 237,416 pixels are labeled in Figure 3b.Xi'an image mainly contains 3 classes, which are water, grass and building.The corresponding code map is shown in Figure 3c.

Oberpfaffenhofen
The second PolSAR image is acquired from an L-band multilook PolSAR image, which covers Oberpfaffenhofen, Germany.German Aerospace Center's E-SAR provided this PolSAR image.The size of this PolSAR image is 1300 × 1200, which is shown in the left of Figure 4a.The right of Figure 4 gives the photo of the near area of Oberpfaffenhofen, which is from Google Maps.The ground truth is shown in Figure 4b, which is obtained by referencing [43].In total there are 1,374,298 labeled pixels.The corresponding ground truth map is shown in Figure 4c.From the ground truth map, we can see that there are three classes in this PolSAR image: open areas, wood land and built-up areas.5a gives the photo of the near area of San Francisco, which is from Google Maps. Figure 5b gives the corresponding ground truth map, which is acquired by referencing [43].The number of the labeled pixels is 1,804,087.This data set mainly contains five classes: ocean, vegetation, developed, low density urban and high density urban, and the corresponding color code is shown in Figure 5c.

Parameter Setting
As for the SVM method, the radial basis function (RBF) kernel is used.For the SRC method, we set the number of dictionary atoms to 15.For the SAE model, the dimensions of middle layers are fixed to 300 and 100, respectively.A 21 × 21 neighborhood is used for each pixel with the CNN model, and Figure 6 gives the classification architecture of CNN.The sliding window size and stride in Equation ( 7) are fixed to 128 and 64, respectively.For the SFCN model, it has the same architecture as that of the classification network of RCN model.Figures 1 and 2 show our method's architecture.Next, for all experiments, the rate of training samples used for each class for the three PolSAR images are 1%, 0.2% and 0.1%, respectively.For SFCN and our proposed method, only the training pixels are involved in modifying the network parameters in the training stage.

Xi'an Data Set
As previously stated, 1% of the labeled pixels are used for training and the rest are used for testing.The classification accuracy of our method and three compared method are shown in Table 1.From Table 1, we can see that our method has better classification accuracy, which is about 3.97%, 7.06%, 4.80%, 3.09% and 10.84% higher than that of the five compared methods, respectively.The classification results of Xi'an with various methods are shown in Figure 7.As shown in Figure 7a-c, the results of SVM, SRC and SAE are not well in regional continuity.For example, the building in the upper left corner is misclassified to grass.From Figure 7d, we can see that CNN does not perform well in the marginal areas of classes.It is because CNN use a 21 × 21 neighborhood of each pixel as the input image, which increases the difficulty of classifying the pixels in the above mentioned areas.As for SFCN, it does not perform well in recognizing water, which is shown in Figure 7e.Because the selected training samples are too few, SFCN cannot learn the internal structure of Xi'an.On the other hand, as can be seen in Figure 7f, our method has clearly outperformed the compared methods, with fewer misclassified pixels.Furthermore, we use white ellipse to highlight the notable different classification results.From the white ellipse, it can be seen that the building is classified better with our method than five compared methods.In summary, the effectiveness of our method can be proven.

Oberpfaffenhofen Data Set
Out of the labeled pixels, 0.2% are used for training and the rest for testing.The classification accuracy on Oberpfaffenhofen obtained with our method and compared methods is listed in Table 2. From the classification results in Table 2, we can see that the OA of our method is about 8.02%, 8.96%, 7.15%, 3.75% and 2.04% higher than that of the five compared methods, respectively.Figure 8 shows the classification result of Oberpfaffenhofen with different methods.As to SVM, SRC and SAE, there are a large number of misclassified pixels for all classes, which can be seen in Figure 8a-c.For CNN, we can see many misclassified pixels between open areas and built-up areas in Figure 8d.From Figure 8e, we can see that the classification result of SFCN is desirable.However, SFCN does not perform well in the edge parts of classes.As shown in Figure 8f, our method obtains better classification result than the methods used for comparison.Simultaneously, the remarkable different classification results are highlighted by the white rectangles in Figure 8.The visual comparison of classification results corresponding to the white rectangles shows that large amount of pixels are misclassified by the five compared methods.For the noted above, the effectiveness of our proposed ARCN method can be demonstrated.

San Francisco Data Set
In this case, we use 0.1% of the labeled pixels for training and the remaining for testing.Table 3 lists the classification accuracy on San Francisco computed by the six aforementioned methods.From Table 3, we can see that our method has higher classification accuracy than the five compared methods.Nevertheless, the gap between the OA of CNN and our method is not large.Figure 9 shows the classification result of San Francisco with the aforementioned four methods.As shown in Figure 9a-c, the classification results of SVM, SRC and SAE are relatively worse.From Figure 9d, we can see that the classification result of CNN is well in regional continuity.Because the structure of San Francisco is not complicated, for example, water itself occupies a large area, where does not contain other classes.Furthermore, CNN can take into account the spatial information, which facilitates the image classification.From Figure 9e, we can see that the classification result of SFCN is acceptable.Furthermore, we use white rectangles to highlight the remarkable different classification results in Figure 9. Through comparing the classification results of white rectangles, it can be concluded that our method acquire a better classification result than the five compared methods.In summary, all these results clearly demonstrate the efficiency of our proposed ARCN method in classifying San Francisco.

Accuracy
As mentioned in Section 5.2, the amount of training samples used with the three images are 1%, 0.2% and 0.1%, i.e., 2375, 2751 and 1807 of labeled pixels.From our study we see that the number of selected samples with our method is much less than that of the available state-of-the-art methods [22,33].Nevertheless, our method still manages to obtain comparatively better results.We strongly believe that this is due to the following reasons: (i) Because our method is based on SFCN, an efficient classification net for end-to-end dense learning can be learned; (ii) Our method can jointly learn a shared encoding representation for two tasks: supervised image classification and unsupervised image reconstruction.By optimizing unsupervised image reconstruction task, the learnt encoding representation is highly abstract, which facilitates the image classification task.Since the learnt encoding representation can be used to reconstruct all the samples and classify the labeled training samples, the rest samples, which belong to the same distribution as the labeled training samples, can also be classified correctly with the learnt encoding representation; (iii) By introducing adversarial training, our method can enforce forms of higher-order consistency between the truth image and reconstructed image.For these reasons, our method can obtain excellent classification results with a small number of labeled training samples.

Execution Time
Table 4 summarizes the execution time of various methods corresponding to three PolSAR images, where "Train" denotes the training time, "Predict" denotes the time taken to classify entire image, and "Total" denotes the total time used to train plus predict.From Table 4, we can see that the execution time of our method is higher than that of the methods used for comparison.The reasons for this can be stated as follows: (i) Our method requires to process two tasks simultaneously, i.e., supervised image classification and unsupervised image reconstruction; (ii) Our method introduces adversarial training, which is time consuming.

Memory Consumption
The memory consumption of various methods corresponding to three PolSAR images is given in Table 5, and the symbol "G" in the caption of Table 5 denotes gigabytes.From Table 5, we can see that the memory consumption of CNN is the highest among all the methods.The main reason is that for the classification framework of CNN used in PolSAR images, the neighborhood of the pixel is set as the input to get the class of the pixel.Compared with CNN, the memory consumption of the rest two compared methods and our method is satisfied.In summary, SVM, SRC and SAE cannot obtain satisfactory results in PolSAR image classification, because they donot take the spatial information of the image into consideration.Nevertheless, they perform very well in time consumption and memory occupation except that the predict time of SVM is relatively longer than other two methods.CNN can take into account the spatial information by setting the neighborhood of the pixel as the input image.Consequently, CNN capable of obtaining acceptable classification result.However, setting the neighborhood of the pixel as the input image of the CNN model also results in repeated memory consumption, which causes the shortcoming of CNN in memory consumption.Because the framework of SFCN is more complex than that of CNN and SAE, so SFCN needs more labeled training samples to get promising classification results.Therefore, SFCN does not get ideal classification results in this paper.For our proposed ARCN method, it can jointly learn a shared feature representation for supervised image classification and unsupervised image reconstruction while introducing adversarial learning.Therefore, our method can get competitive classification results with fewer labeled training samples.Furthermore, our method also performs well in memory consumption.However, our method still has room for improvement in terms of time consumption.

Figure 1 .Figure 2 .
Figure 1.The diagram of RCN, where Conv denotes the convolutional layer, Pool denotes the max pooling layer, Deconv denotes the deconvolutional layer, three color (including white, blue and black) arrows denote the convolution, max pooling, and deconvolution operation, respectively, and "+" represents the add operation.RCN consists of two pipelines: (i) classification network which is indicated by red rectangle, Image and Classification result respectively represent its input and output, and (ii) reconstruction network which is indicated by blue rectangle, Image and Image' respectively represent its input and output.In addition, the classification network is also the framework of SFCN.

4. 2 .
Adversarial Reconstruction-Classification Networks 4.2.1.Adversarial Training for RCN ARCN can be divided into two parts: RCN and adversarial networks, which can be shown in Figure 2. The left black rectangle in Figure 2 represents the RCN model, which takes Image as input and produces the classification result and the reconstruction of the Image.Meanwhile, the right black rectangle takes image map as input and produces class label (1 = true image, or 0 = synthetic).

Figure 3 .
Figure 3. Xi'an.(a) Left: PauliRGB image with its coordinate, right: the photo of the near area with its coordinate, which is from Google Maps.(b) Ground truth map.(c) Color code.

Figure 4 .
Figure 4. Oberpfaffenhofen.(a) Left: PauliRGB image with its coordinate, right: the photo of near area with its coordinate, which is from Google Maps.(b) Ground truth (c) Color code.5.1.3.San Francisco The third PolSAR image is gained from a C-band multilook PolSAR image, which covers the area around the bay of San Francisco with the golden gate bridge.Because this PolSAR image provides a good coverage of both natural terrain and man-made terrain, it has been widely used in PolSAR image classification.The size of this PolSAR image is 1800 × 1380, which is shown in the left of Figure 5a.The right of Figure5agives the photo of the near area of San Francisco, which is from Google Maps.Figure5bgives the corresponding ground truth map, which is acquired by referencing[43].The number of the labeled pixels is 1,804,087.This data set mainly contains five classes: ocean, vegetation, developed, low density urban and high density urban, and the corresponding color code is shown in Figure5c.

Figure 5 .
Figure 5. San Francisco.(a) Left: PauliRGB image with its coordinate, right: the photo of near area with its coordinate, which is from Google Maps.(b) Ground truth map.(c) Color code.

Figure 6 .
Figure 6.The architecture of CNN, where image represents the input image, Conv denotes the convolutional layer, Pool denotes the max pooling layer, flat denotes the flat operation, fc denotes the full connected layer, and Result denotes the classification result.

Table 1 .
Classification results of Xi'an with various methods.

Table 2 .
Classification results of oberpfaffenhofen with various methods.

Table 3 .
Classification results of San Francisco with various methods.

Table 4 .
Execution time of the three PolSAR Images with different methods (S).

Table 5 .
Memory consumption of various methods corresponding to three PolSAR Images (G).