Land Cover Classiﬁcation for Polarimetric SAR Images Based on Vision Transformer

: Deep learning methods have been widely studied for Polarimetric synthetic aperture radar (PolSAR) land cover classiﬁcation. The scarcity of PolSAR labeled samples and the small receptive ﬁeld of the model limit the performance of deep learning methods for land cover classiﬁcation. In this paper, a vision Transformer (ViT)-based classiﬁcation method is proposed. The ViT structure can extract features from the global range of images based on a self-attention block. The powerful feature representation capability of the model is equivalent to a ﬂexible receptive ﬁeld, which is suitable for PolSAR image classiﬁcation at different resolutions. In addition, because of the lack of labeled data, the Mask Autoencoder method is used to pre-train the proposed model with unlabeled data. Experiments are carried out on the Flevoland dataset acquired by NASA/JPL AIRSAR and the Hainan dataset acquired by the Aerial Remote Sensing System of the Chinese Academy of Sciences. The experimental results on both datasets demonstrate the superiority of the proposed method. Author Conceptualization, and methodology, H.W.; software, H.W.; validation, H.W.; formal analysis, H.W.; investigation, H.W. and C.X.; resources, H.W.; data curation, writing—original draft preparation, H.W.; writing—review and editing,


Introduction
Polarimetric synthetic aperture radar (PolSAR) provides fully polarimetric backscattering observations of the earth's surface under all-weather and day-and-night conditions. It is widely applicable to land cover classification.
The existing PolSAR land cover classification methods can be divided into conventional methods without deep learning and deep learning methods. As for the conventional method, as early as the late 1980s, classification methods that utilized the complete polarimetric information was proposed by Kong [1] and Lim [2], based on the Bayes classifier and the complex Gaussian distribution. Lee [3] extended their method and proposed an optimal classifier based on the complex Wishart distribution, namely the Wishart classifier. These kinds of methods are known as the statistical methods for PolSAR land cover classification. To characterize the heterogeneity of the land cover scattering medium, the Wishart classifier has been extensively improved by generalizing Wishart distribution to many other complicated distributions [4][5][6][7]. Markov Random Fields [8][9][10] were also introduced to describe the association information between pixels. However, the statistical classification methods cannot describe the characteristics of the spatial structure of the land covers and perform poorly in the case of high resolution and complex scenarios.
Another conventional approach to PolSAR land cover classification is based on the feature representation of PolSAR images and the supervised classifiers. The target decomposition methods [11][12][13], which have clear physical interpretations, are widely used in feature representation. As different decomposition methods have their own applicabilities, they are often used in combination, and many land cover classification methods [14][15][16] were derived based on them. However, due to the complexity of land cover scatters, the classification methods based on hand-crafted features cannot achieve satisfactory performance.
As deep learning [17] has been widely used in various application fields, deep learning methods for PolSAR land cover classifications have also been widely studied. As convolutional neural networks (CNN) [18] have been widely applied in computer vision tasks, most PolSAR land cover classification deep learning methods are based on CNN. Zhou et al. [19] first used CNN for PolSAR land cover classification. The model consisted of two convolution layers followed by two fully connected layers with an input size of 8 × 8 around the interested pixels, and achieved convincing classification performance. Subsequently, various CNN-based land cover classification methods have been proposed. In terms of network architecture, Zhang et al. [20] proposed complex-valued CNN (CV-CNN) to adapt to the arithmetic characteristics of complex data. Dong et al. [21] introduced the 3-D convolution to extract features from both spatial and channel dimensions. In terms of input features, Chen et al. [22] studied the input features with roll invariance. Yang et al. [23] developed a feature selection model based on multiple hand-crafted polarimetric features. In terms of training strategies, Xie et al. [24] introduced semi-supervised learning. Liu et al. [25] and Zhao et al. [26] introduced adversarial learning to generate samples. In general, a wide variety of CNN-based deep learning methods have been proposed, and the classification performance has gradually improved.
In recent years, in addition to CNN, the Transformer-based method is worthy of attention. Transformer [27] is a self-attention-based architecture that was first used in natural language processing (NLP). The network architecture based on the self-attention mechanism has the capability to extract spatial correlation information in a global range, and thus, has a flexible feature representation capability. Inspired by NLP successes, multiple works have tried to incorporate self-attention mechanisms into computer vision tasks [28][29][30][31]. Carion et al. [30] proposed Detection Transformer (DETR) and applied Transformer to the field of object detection. DETR maintained the model backbone as CNN and used Transformer to generate box prediction. Dosovitskiy et al. [31] proposed Vision Transformer (ViT), which completely abandoned the convolution structure widely used in image processing. By dividing the input images into several local patches, ViT applied a standard transformer directly to the images with the fewest possible modifications, and outperformed the classic ResNet-like CNN architectures [32]. These works explored the potential of transformer structures for computer vision tasks, and subsequently, many improvements for transformer-based structures have been proposed. Touvron et al. [33] proposed Dataefficient image Transformers (DeiT), which used a teacher-student strategy to improve the performance of ViT when trained on insufficient amounts of data. Han et al. [34] pointed out that the attention inside the local patches is also essential, and a new structure called Transformer in Transformer (TNT) is proposed. Stude et al. [35] explored image segmentation methods based on the transformer structure. Liu et al. [36] proposed Swin Transformer, which can serve as a general-purpose backbone for computer vision.
The performance of deep learning methods is closely related to the amount of training data, and the same is true for transformer-based methods. To take advantage of large amounts of unlabeled data, self-supervised pre-training methods for transformer structures have also been studied. For CNN structures, self-supervised pre-training methods are mainly based on contrastive learning [37], which is an approach to the pre-train model with pseudo-labeled data generated from unlabeled data. In contrastive learning, the siamese network architecture and data augmentation were used to construct training samples, and the CNN model was pre-trained by contrastive loss [38,39]. This idea was generalized to ViT, and a self-supervised pre-training method was derived for transformer structures, namely MoCoV3 [40]. However, MoCoV3 requires an empirical training strategy to avoid the instability problem of the training process. To obtain a simple and effective self-supervised pre-training method for ViT, He et al. [41] used the idea of mask encoding from BERT [42], which is a self-supervised pre-training method for NLP. The idea of mask encoding was implemented by adding random masks on the input image, and reconstructing the masked part by an encoder-decoder structure. The derived method, namely Masked AutoEncoder (MAE), can effectively pre-train Vision Transformer on unlabeled data. Although Transformer has been widely studied in computer vision, its potential in PolSAR land cover classification has not been fully exploited. Recently, Dong et al. [43] explored the application of a shallow ViT (SViT) in PolSAR land cover classification. The good results of SViT demonstrate the potential and feasibility of the transformer structure in PolSAR image processing. However, SViT has two drawbacks. The first is that SViT has only one layer of transformer block, which cannot make full use of the flexible feature representation capability of the Transformer structure. Second, the input size is 16 × 16, which limits the receptive field of the model. The receptive of the model is the size of the detail of the input image that is used for the classification of a pixel. The performance of PolSAR land cover classification of a model is closely related to the receptive field [44]. A small receptive field is not sufficient to extract the features of the objects with a large space area, and the land cover objects usually occupy a large area of pixels in high-resolution PolSAR images. Moreover, the classification results obtained by a model with a small receptive field are susceptible to noise and heterogeneity of land cover objects. Therefore, to improve the performance of PolSAR land cover classification, it is necessary to enlarge the receptive field of the model.
To solve the aforementioned problems, a PolSAR land cover classification method based on the a Vision Transformer is proposed in this paper. To make full use of the flexible feature representation capabilities of ViT, the input size of the proposed model is set to an empirical size of 224 × 224. Moreover, the depth of the proposed model is increased compared to SViT. However, the growth of the model capacity will lead to an increase in the difficulty of training, and more training data is needed to ensure the performance of the model. Although the amount of PolSAR data is large, due to its high annotation cost, the amount of labeled data is scarce, which is not enough for the supervised learning of the proposed model. To address this issue, the MAE method is employed to pre-train the ViT backbone structure of the proposed model with the help of abundant PolSAR unlabeled images. After the backbone is pre-trained, an image-segmentation-based land cover classification model is fine-tuned on the labeled dataset.
The remainder of this paper is organized as follows. In Section 2, the existed typical CNN-based land cover classification method and dilated convolution are briefly introduced, and the proposed method is described in detail along with the MAE pre-training method. The results of comparison experiments and ablation experiments are given in Section 3. Some additional discussions of the experimental results are shown in Section 4. Finally, the research is concluded in Section 5.

Methods
In this section, the representation and preprocessing of PolSAR images are presented first. The typical CNN-based land cover classification method and dilated convolution are briefly introduced. Then, the proposed classification method is described in detail. The Vision Transformer [31], which is the backbone of the proposed model, and the detailed implementation of the classification method are described. Further, the Masked Autoencoder pre-training method [41] is introduced. In the pre-training phase, the ViT backbone is trained by the MAE method with a large number of unlabeled PolSAR images. Then the proposed model is fine-tuned on the labeled training set to train the classifier.

Representation and Preprocessing of PolSAR Image
Each pixel of the PolSAR images contains the polarimetric backscattering information of the corresponding resolution cell, which can be expressed by the Sinclair matrix S [45] as follows: where S qp represents the complex backscattering coefficient when the polarization of the incident field and scattered field is p and q, respectively (p, q ∈ {H, V}). In the monostatic backscattering case, the reciprocity of the target restricts the Sinclair matrix to be symmetric, which is S HV = S V H . Thus, the Sinclair matrix can be represented by a 3-D polarimetric target vector k called the Pauli vector, which becomes where (·) T means the transpose. Then, the polarimetric coherency matrix T can be obtained by where (·) * represents the complex conjugate, and · indicates temporal or spatial ensemble averaging, which is also known as the multilook operation. Noting that matrix T is a Hermitian matrix, the upper triangular elements of matrix T can be taken as the input of the network model, which can be expressed in a 9-D real vector f as follows: where Re(·) and Im(·) represent the real and imaginary parts of a complex number, respectively.
Usually there are some numerical problematic pixels in PolSAR images, which may make the model training process unstable. To avoid this issue, each element of f is constrained in a dynamic range, which is [Th min (i), Th max (i)] for each f i , where i = 1, 2, · · · , 9 and Th min (i), Th max (i) are the 2-th and 98-th percentile of f (i) in the whole image, respectively. Then, f is normalized to zero mean and unit variance in each image.

Typical CNN-Based Method and Dilated Convolution
A typical CNN-based PolSAR land cover classification method [19][20][21] usually receives the polarimetric features in a local window of a pixel as input, and outputs the land cover type of the pixel. The classification of the whole image is achieved based on a sliding window that traverses all pixels in the image. Limited by the small number of labeled samples, the network architectures are usually shallow convolutional neural networks, and the input size of the network usually does not exceed 16 × 16. Therefore, the receptive fields of these CNN-based methods are limited by the small input size.
For high-resolution images, the small receptive field is not enough to capture the spatial features of land cover objects. Therefore, to obtain fair comparison results with the proposed method on high-resolution images, dilated convolution [46] is used to increase the receptive fields of these CNN-based methods. At the same time, the input image size of the model is also increased.
The principle of dilated convolution can be illustrated by Figure 1. By introducing a parameter named dilation rate, the dilated convolution obtains a receptive field larger than that of the conventional convolution with the same kernel size. For a k-dilated convolution layer with kernel size w × w, the receptive field of the input to this layer is 1 + k(w − 1).

The Proposed Land Cover Classification Method
The convolutional neural network architecture [19] is widely used in PolSAR land cover classification, but its performance is limited by the receptive field of the model, especially in the high-resolution case. To address this problem, we introduce the Vision Transformer (ViT) [31] as the backbone of the model. Compared with the 16 × 16 input size in [43], to fully exploit the flexible receptive field of the transformer block, we choose a larger input size of 224 × 224. The size of 224 is a good empirical choice, which is also the input size of the original ViT model [31]. An overview of the ViT-based land cover classification method is shown in Figure 2. The PolSAR image is firstly sliced into several image patches and embedded to feature vectors by a linear projection. The feature embedding vectors are then fed to a feature encoder consisting of alternating stacks of multiheaded self-attention (MSA) and multi-layer perceptron (MLP) blocks further to extract the long-range correlation features of the image. Finally, each image patch is classified by a linear projection, and the final classification result is obtained by upsampling.

Feature Embedding
In the ViT model, the subsequent feature encoder receives a sequence of token embeddings; thus, an image feature embedding is performed to transform a 3-D PolSAR image into 2-D token embeddings. As shown in Figure 2, a PolSAR image I ∈ R H×W×C is sliced into N p patches I i ∈ R P×P×C , where i ∈ 1, · · · , N p and N p = HW/P 2 is the number of image patches. Each image patch I i is then embedded into a 1-D vector in R (PPC) and transformed into a 1-D patch embedding vector E Moreover, each image patch is given a corresponding position embedding E i pos ∈ R L . The 2-D position embedding is used in this paper, which is obtained by applying a sine and cosine transform to the location of the corresponding patch [27,31], as expressed in Equations (5) and (6).
As the sequence input of the subsequent feature encoder cannot maintain the position information of the image patch, the final feature embedding vector E (i) f ∈ R L for an image patch is obtained by adding the image patch embedding E (i) p and position embedding E i pos , so as to fuse the image information and position information of the patch. By performing the above operation on all N p image patches and concatenating these embeddings, the whole image is transformed into a 2-D feature embedding z (0) ∈ R N p ×L . The feature embedding process described above can be expressed as follows: where w ∈ R L/4 is a frequency vector for position embedding, and M is a parameter which is usually chosen as M = 10, 000 in many implementations [27]. x (i) , y (i) are the locations of the i-th image patch. To be pointed out, the class token is not used in the proposed method.

Feature Encoder
The feature encoder consists of layers of transformer blocks, as is shown in Figure 3, which is a cascade of multiheaded self-attention (MSA) block and MLP blocks. In each transformer block, the input features are firstly normalized by LayerNorm [47] and then fed into an MSA block. The MSA, which plays a similar role as convolution layers in CNN, can extract the spatial features of the images by capturing the long-range interaction between different image patches. The interaction information is represented by the weighted sum of the feature embeddings between patches, and the calculation of the weights is implemented by the self-attention block. Specifically, the self-attention block (SA) maps the feature embedding x i ∈ R L of each image patch into query vector q i ∈ R L , key vector k i ∈ R L , and value vector v i ∈ R L , through learnable linear matrix W Q ∈ R L×L , W K ∈ R L×L , and W V ∈ R L×L , respectively. Further, the weights between patch i and patch j are generated using the scaled dot-product function between the query vector q i and the key vector k j , and the output y i is obtained by the weighted sum of the value vectors v j , j = 1, 2, · · · , N p . If the vectors x i , q i , k i , and v i are packed together into the matrix forms X, Q, K, and V ∈ R N p ×L , then SA module can be expressed in matrix form as follows, where softmax is used to scale the dot product to valid weights, and 1 √ L is a scaling factor for numerical stability. Since the SA module is based on a global weighted summation, unlike the convolution operation which has the a limited receptive field, it is able to capture image spatial correlation features globally.
The multiheaded self-attention block consists of multiple SAs in parallel. Suppose that the MSA has n head heads and the input of the l-th MSA in the model is z (l−1) ∈ R N p ×L , then the MSA will cut z (l−1) into n head slices z is processed with a separate SA block. Then the n head outputs are concatenated and fused by a linear projection W O ∈ R L×L . The MSA can be expressed as follows, (11) MSA allows for a better exploitation of the correlations in the embedded data through the joint representation of separate self-attention at separate views.
In the transformer block, after MSA processing of the embedding vector, it is also layer normalized and transformed by an MLP block. The MLP block is implemented by a fully connected layer, followed by a GeLU activation layer [48], and another fully connected layer. The data dimension of the intermediate layer is L · r MLP , where r MLP is a pre-defined scale factor. Thus, the processing of a complete transformer block can be described by the following expression, where LN(·) represents the LayerNorm layer.

Land Cover Classifier
Instead of using a sliding window to capture the image patch centered on each pixel and classify all patches [19][20][21][22][23][24]43,44,49,50], the proposed method uses the segmentation method to implement pixel-by-pixel classification, i.e., the proposed method will assign a corresponding category to each pixel of the input image in a single model forward propagation. As shown in Figure 2, the feature z (D) ∈ R N p ×L put out by the feature encoder with D blocks is a stack of the feature vectors in R L corresponding to each image patch. A linear classifier is used to assign each feature vector a prediction vector, which consists of the predicted probabilities for each category, and bilinear upsampling is performed to obtain the final classification result of each pixel in the image.
In the training phase, the training set is the images with size H × W obtained by random cropping around the training sample pixels. In the inference phase, large PolSAR images are sliced into several blocks with size H × W. Then, the classification results of the blocks can be stitched together to obtain the results of the large PolSAR images. An overlap of 20% is introduced when the images are sliced to prevent inaccurate classification near the edge part of the image block. For pixels that appear in the overlapping area of multiple blocks, the classification result is determined by the superposition of the prediction vectors given by each block.
Under the image-segmentation-based classification scheme, the choice of patch size P will affect the resolution of the classification map. As the process of bilinear interpolation does not introduce additional knowledge, the pixel-by-pixel classification result totally depends on the classification results of small patches with size P × P. Consequently, objects much smaller than P × P are difficult to classify correctly, which may result in a loss of resolution for the classification map. . However, land covers usually have consistent types within a certain range, so this loss of resolution on the classification map usually does not cause a degradation in classification performance. Moreover, as the patch size P decreases, the number of image patches N p increases, leading to a rapid increase in the number of parameters of the ViT model, which will make the training progress more difficult. With a combination of the above factors, P = 8 is empirically chosen as the image patch size.
The immediate advantage of the image-segmentation-based approach over the method based on sliding windows is that only several forward propagations are required to classify a large image, leading to highly time efficiency for the inference of the model. For the sliding window method, to assign a class to each pixel in the image, one forward propagation is required for each pixel, resulting in the forward propagation of the model needing to be computed many times. When processing high-resolution PolSAR images, the model requires a large receptive field to obtain good classification performance, so each forward propagation of the model needs to process a large-sized input, which will cause a severe increase in the inference time consumption of the sliding-window-based method.
A quantitative comparison of inference time efficiency is shown in Table 1. The proposed method is compared with a sliding-window-based CNN method on an image of 2500 × 2500 pixels. The sliding-window-based CNN is implemented according to [19]. The hardware device used is a high performance server with a CPU of Intel(R) Core(TM) i9-10940X @ 3.30 GHz and a GPU of Nvidia RTX 3090 Ti, and the software implementation is based on Pytorch [51]. As can be seen from the quantitative results, although the slidingwindow-based method is based on a small convolutional network whose computation cost in a single forward propagation is low, its total computation cost for the whole 2500 × 2500 image is similar to the proposed method. For time efficiency, the costs of file I/O should also be taken into account, so the time efficiency of the proposed method is much higher than that of the sliding-window-based method. The advantage in inference time efficiency is the main reason why the proposed method adopts a segmentation-based classification method instead of the sliding-window-based method, which is more commonly used in PolSAR land cover classification.

Pre-Training Method
ViT-based models usually only perform well with a large number of training samples. However, the amount of labeled PolSAR images is usually small due to the high labeling cost of PolSAR images. However, the amount of unlabeled data is relatively large. To address this problem, the Masked Autoencoder (MAE) self-supervised training method [41] is used to pre-train the proposed model on unlabeled data. The framework of the MAE self-supervised pre-training method is shown in Figure 4. Similarly to the common autoencoder method, the MAE method reconstructs the original input based on the encoding features and trains the encoder by minimizing the reconstruction error. Unlike the common autoencoder, the MAE method is designed specifically for ViT. In the feature encoding stage, the image patches are randomly sampled and the remaining patches are masked. Only the unmasked patches are fed into the subsequent Transformer feature encoder. Assuming that G mask is a random permutation of the patch index [1, · · · , N p ], and N keep is the number of unmasked patches, then the encoding of the masked image can be expressed as G mask : 1, · · · , N p → RandomPermutation 1, · · · , N p , where z (0) is the feature embedding of the input image in Equation (8), and TransformerBlock(·) represents the transformer block given in Equation (12). The decoder consists of several transformer blocks with lower feature dimensions L dec . Suppose the feature dimension of the encoder is L. In the decoding stage, both encoded unmasked patch embeddings and mask tokens are linear mapped by W ed ∈ R L×L dec and together fed into the decoder. The mask tokens are zero vectors that indicate the presence of the missing patches in the encoding stage. Moreover, the position embedding is added to both unmask patch embeddings and the mask tokens before being fed into the decoder. Finally, the output of the decoder is mapped into R N p ×(P 2 ) and expanded into a reconstructed imageÎ of size H × W. The processing of the decoder can be expressed aŝ where z (l) enc is the output feature vector of the l-layer encoder. The reconstruction error is measured by the mean square error. As the diagonal and non-diagonal elements of the polarimetric coherency matrix have different physical interpretations, different weights are given to the reconstruction errors of the diagonal and non-diagonal elements. According to the notation of Equation (4), the reconstruction error can be expressed as where f is the vector representation of the origin coherency matrix, andf is the vector representation of the reconstructed coherency matrix.
Compared with the autoencoder without random masking, the MAE method uses the unmasked patches to reconstruct the masked patches, which means the reconstruction problem cannot be solved by trivial extrapolation from the input. The model will be trained to pay more attention to the implicit association information between image patches rather than the internal local features in the patches. Therefore, the model derived from MAE can achieve good feature representation with high-level semantics.

Data Description
To evaluate the classification performance, two datasets are used for the experiments. The first is the PolSAR images with P-, L-, and C-band, acquired in Flevoland, Netherlands by NASA/JPL AIRSAR. The pixel spacing is 6.66 m in range and 12.16 m in azimuth. The region of interest (ROI) has a size of 1100 × 1024 pixels. The land cover mainly includes several types of crops as well as buildings and roads, and the ground truth is from [52]. The geographic location of the images, the pseudo Pauli images of the three bands, and the ground truth are shown in Figure 5.  18 m, 0.12 m), respectively. The ROI includes 3 images with size 12, 500 × 10, 600 pixels, which are registered between different bands, and the pixel spacing in slant range and azimuth are 0.18 m and 0.12 m, respectively. The ground truth includes six categories: buildings, crops, moss, trees, roads, and water. The annotations are obtained by combining the in-site survey and the corresponding optical remote sensing images. The geographic information, the pseudo Pauli images of the three images in the ROI with L-, C-, and Ka-band, and their corresponding ground truth are shown in Figure 6.

Pre-Training Settings and Results
In the pre-training stage, a huge amount of unlabeled PolSAR images are used, including PolSAR images of different bands and resolutions acquired by AIRSAR, Radarsat-2, GaoFen-3, and the ARSSCAS. The band of the data varies from P-band to Ka-band, and the resolution varies from 10 m to 0.1 m. The total amount of the original data is about 500 GB.
For the model parameters, the input image size is chosen as H = W = 224 and the patch size is P = 8. The embedding dimension is L = 576, and the number of heads of the MSA is n head = 12. For the decoder, the embedding dimension and heads number are 224, 16, respectively. The layers of the decoder are set to 2. The depth and mask ratio of the encoder are compared for several parameters. Moreover, whether to perform pixel normalization in the reconstruction loss [41] is compared. For the training parameters, the number of training epochs is 500, including 20 warm-up epochs. The base learning rate is chosen to be 0.001, and the learning rate decays with a half-cycle cosine function. The optimizer is the AdamW [53] method, with a weight decay coefficient of 0.05. Data augmentation including random cropping, random flipping, and adding Gaussian noise is performed while training. In consideration of the speckle noise in the original PolSAR image, a Gaussian filter with σ = 1 is performed before the image is used as the reconstruction target.
The curves of the pre-training loss are shown in Figure 7. Figure 7a shows the loss curves for the case of different encoder depths (the number of transformer blocks in the encoder) with a fixed masking rate of 80% and no pixel regularization, which indicates that the depth of the encoder has little effect on the pre-training procedure. Figure 7b shows the loss curves for the case of different mask ratios and different approaches of pixel normalization, with a fixed encoder depth of 4. It can be seen that at mask ratio below 80%, the loss curve shows an unexpected inflection point at the end of the warm-up stage but does not affect the convergence of the final pre-training results. In addition, the loss at the end of the pre-training stage is smaller at lower mask ratios. Although the classification performance of the model does not depend on the loss in pre-training, the shape of the loss curve shows that the pre-training results are steadily convergent. A convergent pre-training result is the basis for discussing the subsequent classification performance.  Figure 8 shows the pre-training results of the model from the perspective of image reconstruction. It can be seen that the image reconstruction performance is similar when the depth of the encoder varies from 1 to 7. When the mask ratio increases from 20% to 80%, although the model cannot reconstruct the details, it can still recover the semantic information of the image. The results indicate that the model can extract semantic features from the fragments of the original images and use the correlation information between image patches to reconstruct the image.

Comparison Experiments
In the comparison experiments, the hyperparameters of the proposed method are set to Depth = 4 and Mask ratio = 80, without pixel normalization. The compared methods include WMM [7], SVM [14], CNN [19], CVCNN [20], 3DCNN [21], and SViT [43]. For all deep learning methods, the optimizer is chosen as AdamW, with a weight decay coefficient of 0.05. The initial learning rate is 0.001 and decays with a half-cycle cosine function. The number of training epochs is 100, with 10 warm-up epochs.
In the Flevoland dataset, 1% of the total labeled samples were randomly selected as training samples, which is 223 training samples for each class. The evaluation metrics include classification accuracy for each category, the overall accuracy, and the kappa coefficient. The experiments were carried out on the P-, L-, and C-band separately. Due to the small training sample size, the experiments were repeated 50 times to avoid randomness. The comparisons of results are shown in Tables 2-4, and the classification images are shown in Figure 9.
As seen from the classification results, the performance of the WMM and SVM were poor for categories with special spatial structures such as roads and buildings. The reason is that WMM and SVM only use the features of a single pixel. Moreover, for the crop categories, SVM and WMM can hardly achieve an accuracy of more than 90% due to the influence of speckle noise. As a result, there is a significant gap between their overall accuracy and that of the deep learning methods.
Among the four deep learning methods used as comparisons, the SViT has the best performance. In P-and C-band, the SViT had 4% greater overall accuracy compared with the other three methods. In the categories of beet, buildings, roads, and maize, the accuracies of SViT are significantly higher than the other three deep learning methods. In L-band, CNN, CV-CNN, and SViT all achieve about 95% overall accuracy, and obtain more than 95% accuracy in all the categories other than beet, grass, building, and roads. Table 2. Classification results on the P-band Flevoland dataset.

Method
CNN [19] CV-CNN [20] 3D-CNN [21] SViT [43] WMM [7] SVM [14] The For the proposed method, the classification performance is much better than the other four compared deep learning methods. An overall accuracy of about 98% is obtained in all three bands. Moreover, in the roads category, which is not well classified by the other four deep learning methods, the accuracy achieved by the proposed method is more than 90%. Moreover, as seen in Figure 9, in the crop region, the classification results of the proposed method are smooth, with almost no misclassification, while the other four deep learning methods have significant misclassifications.  [20]. (c,j,q) 3D-CNN [21]. (d,k,r) SViT [43]. (e,l,s) WMM [7]. (f,m,t) SVM [14]. (g,n,u) The proposed method.
For the Hainan dataset, the training set consists of 2000 pixel samples per category, and the proposed method is compared with CNN, CV-CNN, 3D-CNN, and SViT. Due to the high resolution of the Hainan dataset, the small receptive fields of the compared methods may make them unable to extract spatial features effectively. To achieve fair comparison results with the proposed method, the four compared methods were modified to have a larger receptive field. The 4-dilated convolution was used to replace the common convolution in CNN, CV-CNN, and 3D-CNN. The input size is increased to 64 × 64. To distinguish from the original compared methods, the modified methods are called CNN-Dilated, CV-CNN-Dilated, and 3D-CNN-Dilated methods, respectively. For the SViT method, the patch size is increased from 1 to 4. The input features are simplified so that only the 9 real numbers of the coherency matrix are fed to the network, and the dimension of the embedding vector is increased to 144. The modified SViT method, namely SViT-Large, also has an input size of 64 × 64. The four modified methods are also compared with the proposed method. The comparisons of classification results are shown in Table 5 and Figure 10.
From the experimental results of the Hainan dataset, it can be seen that none of the four original compared deep learning methods can obtain a reasonable classification performance. The water category has the highest accuracy, reaching hardly 80%. As seen in Figure 9a-d, the four compared methods produce a large number of misclassifications between the categories of trees, mosses, and crops. There are also misclassifications between the roads and water category, which hardly produce backscattering.
As seen in Table 5, the four modified methods all received substantial performance improvements, indicating that the expansion of receptive fields can significantly improve the classification results in the case of high-resolution PolSAR images. However, the overall accuracies of the modified methods are still below 90% in all three bands. For most categories, the classification accuracy is between 70% and 80%. As can be seen in Figure 10, there are still obvious misclassifications between the trees, crops, and moss categories.
For the proposed method, it can be seen that superior classification performance is achieved in all six categories of the Hainan dataset. For the categories of roads and moss, which are not well classified by the comparison method, the proposed method gets an improvement of about 10% to 20% in accuracy. In terms of overall accuracy, the proposed method achieves about 10% performance improvement compared with the comparison method in all three bands. From Figure 10, it can be seen that the proposed method is able to perform the land cover classification accurately, and the improvement is significant, especially in the pond area in the second image and the urban area in the third image (marked with a red ellipse). The results show the superiority of the proposed method in high-resolution image land cover classification. Table 5. Classification results on the Hainan dataset.

SViT-Larger
The

Ablation Experiments
To verify the effects of hyperparameters in the proposed method, ablation experiments were carried out on the Hainan dataset. The training settings are the same as for the aforementioned comparison experiments. The hyperparameters compared in the ablation experiments include whether pre-training is used, the depths of the model, the masking ratio in the pre-training stage, and whether pixel normalization is used in pre-training. Figure 11 shows the results of the ablation experiments. In Figure 11a, the models are compared based on overall accuracy at varying model depth and also whether or not pre-training was applied. It can be seen that pre-training improves the model classification performance significantly in almost all cases. Moreover, the effectiveness of pre-training increases as the depth of the model increases. The pretraining provides an improvement on the overall accuracy of about 1% to 2% at model depths below 3, while the improvement is more than 2% at model depths more than 4.
For model depth, it can be seen that in the absence of pre-training, the classification performance does not improve significantly as the model depth increases beyond 2, and sometimes decreases slightly. However, in the pre-trained case, the overall accuracy is almost saturated only after the model depth reaches 4. It indicates that the main factor limiting the performance is not the complexity of the model, but the amount of training samples. When pre-training is not used, the size of the training set is so small that all information in the training set can be totally exploited by a 2-layer transformer model and this results in a saturated performance. When a proper unsupervised pre-training method is introduced, because of the contribution of the image information in a large amount of unlabeled data, better classification performance can be achieved by increasing the model depth to 4.
The ablation results for the mask ratio and the choice of whether to use pixel normalization or not during pre-training are shown in Figure 11b. The optimal mask ratio is around 80%. When the mask ratio is increased from 20% to 80%, the classification performance has a trend to improve, and the increase in overall accuracy is about 1%. For the pixel normalization, the effect on the overall accuracy is negligible.

Influence of Receptive Field
Based on the experimental results, it can be seen that the receptive field of the model has a great influence on the classification performance. In the experimental results of the Flevoland dataset (Figure 9), it can be seen that there are significant small misclassification regions in the results of the compared method, and this phenomenon almost does not exist in the proposed method. The reason is that the proposed method uses inputs with size 224 and the transformer structure can extract features in the global range of the input image, which is equivalent to having a large receptive field. Compared with the other four deep learning methods, whose receptive fields only have sizes of 8 × 8 or 16 × 16, the proposed method is less sensitive to the speckle noise and the heterogeneity of the land covers.
Despite the shortcoming of the small receptive field of the compared methods, their overall accuracies in the Flevoland dataset are still above 90%. However, on the Hainan dataset, the overall accuracies of the four original compared methods all dropped significantly. When dilated convolution was introduced to CNN-based methods to enlarge the receptive fields, the size of the convolution kernel do not increase, but the overall accuracy rose to between 80% and 90%. Similarly, increasing the input size of SViT also resulted in a significant performance improvement. This indicates that when processing high-resolution images, a large receptive field is necessary for the network model to extract the spatial features effectively. Figure 12 intuitively illustrates the relationship between the receptive field and the spatial features of the PolSAR image at different resolutions. The red frames in Figure 12 are patches of 32 × 32 pixels. In the Flevoland dataset (with a pixel spacing of 6.66 m in range and 12.16 m in azimuth) , it can be seen that the patch in the red frame contains the local spatial structure information of the image (Figure 12a) . However, in the Hainan dataset (with pixel spacings of 0.18 m in range and 0.12 m in azimuth) , it is difficult to distinguish three patches obtained from roads, ponds, and buildings only by spatial structure information. Therefore, enlarging the receptive field is an intuitive approach to make use of spatial information in high-resolution images efficiently. To further study the effect of receptive field size on spatial feature extraction, the Grad-CAM method [54] is used to visualize the class activation map in the model. The class activation map can display the area that the model focuses on for a certain category, and in turn, it is possible to know whether the model learns effective features by analyzing the regions that the model focuses on. Visualizations of CNN, CNN-Dilated, and the proposed method are performed, as is shown in Figure 13. To adapt the Grad-Cam method, the proposed model is adjusted to output the average classification results for a specific region. It can be seen that for the CNN method with a small input size of 8 × 8, the activation map is almost irregular, which indicates that the model can hardly extract effective spatial features. As the dilated convolution is used and the receptive field is expanded, some spatial structure can be observed on its activation map, demonstrating that increasing the receptive field helps the model learn effective spatial features. For the proposed method, owing to the large input size and the Transformer structure that can extract features globally, the activation map has an apparent correspondence with the land cover, indicating that the proposed method can make full use of the spatial features in the image.

Potential Overfitting Problem
The classification performance of deep learning models can be affected by potential overfitting problems. In the experiments, the AdamW method with weight decay is adopted to prevent overfitting. Considering the huge performance difference between the compared methods on the Flevoland dataset and Hainan dataset, it is necessary to confirm further whether the model is overfitting. Figure 14 shows the curves of training loss and overall accuracy on C-band data. It can be seen from the curves that, with the increase in the training epoch, the training loss gradually decreases, and the overall accuracy is always in an increasing trend. In the compared methods and the proposed method, no obvious overfitting was observed.
Comparing the training curves of the same method on the Flevoland dataset and the Hainan dataset in Figure 14, it can be seen that the four original compared methods converged to a large loss value on the Hainan dataset. It indicates that the poor performance of the original compared methods on the Hainan dataset is not caused by overfitting, but by the poor fitting results limited by the model receptive fields.

Expected Performance on Complicated Classification Tasks
Although the proposed model is specific to the PolSAR land cover classification task, the feature encoder of the proposed method is pre-trained on unlabeled data in a task-agnostic way. Therefore, the derived feature encoder can be used for a variety of downstream tasks theoretically, including more complex land cover classification tasks, such as classification with noisy labels, multi-label classification, and the domain adaption problem between different sensors. It also has the potential to be used in other classification tasks, such as the classification and recognition of ships and vehicles.
From the perspective of data characteristics, most PolSAR images contain different types of land cover objects. Therefore, unlabeled land cover PolSAR data is sufficient, and it is not difficult for the feature encoder to learn to describe the feature of land covers. If a method is specifically designed for a more complex land cover classification task, the performance can be expected to be improved after integrating the proposed feature encoder. For other classification tasks, such as recognition of ships and vehicles, the pretrained feature encoder may not be suitable for describing the features of these targets because such objects usually occupy very few pixels in an image. Therefore, the application of the derived feature encoders to these tasks requires further discussion and validation.

Conclusions
In this paper, a Vision Transformer-based PolSAR image land cover classification method has been proposed. The multi-layer transformer structure, which has the capability to extract spatial associative information in the global range, is able to characterize the land cover objects of different sizes at various resolutions. Moreover, to address the issue of the scarcity of labeled data, the MAE pre-training method was introduced for pre-training the model with unlabeled data. The comparison experiments and ablation experiments were conducted on the Flevoland dataset and the Hainan dataset. The results of the comparison experiments demonstrated the superiority of the proposed method compared with other deep learning PolSAR land cover classification methods, especially in the high-resolution dataset of Hainan. The ablation experiments investigated the effect of hyperparameter settings of the proposed method on classification performance, which validated the effectiveness of pre-training, and provide a basis for the setting of hyperparameters. The performance difference between the proposed method and the compared methods is analyzed in the discussion on the receptive field and overfitting problems. The expected performance of the proposed feature encoder in other more complex classification tasks is also discussed, which will be the future work to verify and study further.