Global Polarimetric Synthetic Aperture Radar Image Segmentation with Data Augmentation and Hybrid Architecture Model

: Machine learning and deep neural networks have shown satisfactory performance in the supervised classification of Polarimetric Synthetic Aperture Radar (PolSAR) images. However, the PolSAR image classification task still faces some challenges. First, the current form of model input used for this task inevitably involves tedious preprocessing. In addition, issues such as insufficient labels and the design of the model also affect classification performance. To address these issues, this study proposes an augmentation method to better utilize the labeled data and improve the input format of the model, and an end-to-end PolSAR image global classification is implemented on our proposed hybrid network, PolSARMixer. Experimental results demonstrate that, compared to existing methods, our proposed method reduces the steps for the classification of PolSAR images, thus eliminating repetitive data preprocessing procedures and significantly improving classification performance.


Introduction
Polarimetric Synthetic Aperture Radar (PolSAR) is a radar technology that increases the polarization characteristic based on traditional Synthetic Aperture Radar (SAR).PolSAR facilitates the acquisition of polarization information from the target, thereby enabling more comprehensive and precise detection and imaging of the target.Additionally, PolSAR technology is an active remote sensing technology, unlike passive remote sensing technology (such as optical remote sensing), which is limited by weather and time [1].Active remote sensing technology acquires data through the transmission of electromagnetic waves to the target and the reception of its reflected signals.Consequently, this method remains impervious to natural factors, such as variations in light intensity, cloud cover, haze, and sunlight.This attribute positions PolSAR technology with distinct advantages in promptly responding to emergencies and swiftly gathering target information.The application of PolSAR in ground target observation, such as land cover classification, is a very significant direction [2][3][4][5].
In the early stage, PolSAR image classification methods can be broadly categorized into two groups: those founded on scattering mechanisms [6][7][8] and those rooted in statistical approaches [9][10][11][12].While both methods are characterized by simplicity, speed, and physical interpretability, their classification results tend to be coarse and imprecise.These approaches are primarily suited for the preliminary analysis of PolSAR data.With the emergence of deep neural networks, the data-driven self-learning representation mode of networks has attracted more and more attention.While the features acquired through deep networks may lack direct physical interpretability, their representational prowess significantly surpasses that of features manually extracted by human methods.Therefore, research has begun to introduce neural networks into PolSAR land cover classification.For example, Zhou et al. [13] used original polarimetric features from matrix T as input on the Flevoland I dataset and used a deep convolutional neural network for classification for the first time, and the classifier performance achieved the best performance at that time.In addition, because PolSAR data contain rich polarization information, such as scattering amplitude, polarization direction, and degree of polarization, this complex information may degrade the performance of unsupervised algorithms [14][15][16][17][18][19].In contrast, supervised algorithms that utilize labeled data effectively learn the complex relationships of the data to improve performance [20][21][22][23][24][25][26][27].Therefore, relying on extracting certain polarization features from the original scattering matrix as initial features and then letting the deep neural network learn higher-level and more complex features from the label data to improve the classification ability of the classifier has become a more efficient and popular method for PolSAR land cover classification.
At present, there are two primary strategies for PolSAR land cover supervised classification using deep neural networks, depending on the input of the model.The first strategy involves dividing the PolSAR image into small patches, with each patch representing a specific land cover class.Neural networks are then trained to recognize these small patches, thereby obtaining land cover classes across the entire image [28][29][30][31].The second strategy, known as direct segmentation, entails feeding the PolSAR image into a neural network to directly segment the image, and then the trained model assigns each pixel to a specific land cover class [32][33][34].Patch-based classification offers a high level of flexibility, and the model is easy to train, but it is sensitive to noise, and one patch represents a class, which will lead to the loss of spatial information because features spanning multiple patches cannot be accurately captured, resulting in classification errors.In addition, manually assigning a land cover label to each small patch can be a time-consuming and challenging task, especially for large and complex datasets.The direct segmentation strategy can quickly and comprehensively capture the spatial distribution information of land cover and has strong robustness and anti-noise ability.However, it has high requirements on the model, and due to computer performance limitations and limited label data, cut and merge operations (cutting image to 256 × 256 or lessx) are usually required on the original PolSAR images as input.
In general, due to the following issues, the performance of both patch-base and direct segmentation is still not ideal.(a) Redundant step.Both require redundant operation steps: patch construction or image cutting, which greatly increases the computational complexity.(b) Insufficient labeling data.PolSAR images typically have high resolution and cover a wide geographical area.Processing and annotating large-scale PolSAR data require a significant amount of time and computational resources.Due to the complexity of manual annotation, typically only some areas of the image contain labeled information.(c) Design of the model.In the domain of PolSAR image classification, models predominantly leverage Convolutional Neural Networks (CNNs) and Transformer architectures, each possessing distinctive advantages and limitations.CNNs excel at capturing local features, rendering them well-suited for conventional image tasks.However, when confronted with images featuring intricate boundaries, their restricted global contextual understanding may lead to suboptimal performance.In contrast, empowered by self-attention mechanisms, Transformer architectures demonstrate proficiency in handling global context, thereby enhancing their capability to recognize irregular boundaries.Nevertheless, they exhibit a diminished capacity for capturing local features, involve higher computational complexity, and necessitate a substantial amount of labeled data for effective training.Effectively leveraging the strengths of both architectures is a matter worthy of consideration.
In view of the challenges, this paper first presents a novel input for the model.This new format involves using the results obtained after applying the data augmentation methods proposed in this paper as input to the model.It aims to maximize the utilization of labeled data and overcome the shortcomings of previous strategies.Moreover, to accommodate this input format, a concise but superior model is proposed.Considering that Multiple Layers of Perceptron (MLP) have been proven to achieve long-range dependencies and significantly reduce computational complexity compared to Transformers [35], the model adopts a hybrid architecture that combines CNN and MLP to extract features at different levels.This hybrid architecture not only reduces computational complexity but also allows the observation of objects from a multiscale and long-range perspective.In addition, attention mechanisms have become a dominant paradigm in deep learning [36].In this model, the use of cross-layer attention mechanisms aims to build global dependencies between features at different levels in PolSAR images.
The main contributions are summarized as follows: (1) A data augmentation technique is introduced, aiming at significantly improving the utilization of labeled data while mitigating spatial information interference.Based on this data augmentation technique, we improved the input format without the need to construct patches or perform cut and merge operations on label data.This ensures the model can adapt to images of any size and swiftly conduct global inference.(2) A hybrid architecture of CNN and MLP is proposed to classify PolSAR images.The architecture accepts arbitrary-size input images.Then, the output is the extracted feature at different levels.(3) To further improve the performance, a cross-layer attention module is used to establish the relationship between different neural network layers, and the feature information is passed from the shallow layer to the deep layer.This information transfer helps capture dependencies over long distances, improving the model's understanding of the data.(4) Three extensively recognized datasets are utilized for evaluating the efficacy of the proposed approach, and the experimental results unequivocally demonstrate its superior performance and classification accuracy when compared to contemporary other methods.
The rest of this paper is organized as follows: Section 2 focuses on the relevant attention mechanisms, the hybrid models, and the segmentation model.In Section 3, we elaborate on the details of the proposed methods.Section 4 provides a comprehensive analysis of comparative experimental results based on three extensively employed PolSAR images.Section 5 includes discussions on ablation experiments and analyzes the impact of hyperparameters in the proposed method.Lastly, in Sections 6 and 7, we outline the potential directions for future research and draw conclusions.

Related Works 2.1. Segmentation Model
In the field of image segmentation, there are two key architectures.The first is based on the Fully Convolutional Network (FCN), which achieves downsampling and enlarges the receptive field through convolution operations.Deconvolution operations are then employed for upsampling to restore the original image size, thereby preserving the spatial information of the original input image and enabling pixel-level classification.The U-net, proposed by Ronneberger et al. [37], has been widely adopted due to its simple design and relatively small parameter count.However, its performance is relatively poor when dealing with larger images.Another important architecture adopts the Transformer structure, efficiently capturing global context through self-attention mechanisms and handling high-resolution images by independently processing image blocks.In this context, SETR, proposed by Zheng et al. [38], constructs an encoder solely using the Transformer without performing downsampling operations.It models the global context in each Transformer layer.While this encoder can be combined with a simple decoder to provide a powerful seg-mentation model, it requires higher computational complexity and has certain limitations in the interpretability of the self-attention mechanism.

Attention Mechanism
Attention Mechanism (AM), initially introduced for Machine Translation, has become a key concept in neural networks.It is widely used in various Artificial Intelligence applications, including Natural Language Processing (NLP) [39], Speech [40], and Computer Vision (CV) [41].The Visual Attention Mechanism (VAE) has become popular in many mainstream CV tasks to focus on relevant regions within the image and capture structural long-range dependencies between parts of the image [42].Among the different attention mechanisms, self-attention, multi-headed attention, and cross-attention are commonly used in CV tasks.Self-attention calculates the similarity between elements in an input sequence, updating each element's representation based on the weights.Multi-head attention maps inputs to different subspaces, allowing the model to focus on diverse aspects simultaneously.Cross-attention addresses relationships between two sequences, with one sequence acting as a query vector and the other as a key-value vector, updating representations based on similarity weights.

Hybrid Model
Combining CNN and Transformer yields superior performance, with various approaches explored in recent studies.CMT, a visual network architecture, achieves enhanced performance by seamlessly integrating traditional convolution and Transformer [43].It employs a multi-level Transformer with traditional convolution inserted between layers to hierarchically extract local and global image features.Conformer, a dual-network structure, merges CNN-based local features with Transformer-based global representations for improved representation learning [44].Touvron et al. [45] proposed that DeiT, utilizing a CNN model as a teacher network, optimize Vit with hard distillation.This introduces the inductive bias feature of the CNN model into the Transformer, reducing data requirements, enhancing training speed, and achieving better performance.DETR, introduced by Carion et al. [46], adopts a concatenation method of CNN before the Transformer.The CNN network learns two-dimensional features, reshaping low-resolution feature maps for input into the Transformer, resulting in improved learning speed and overall model performance.ViT-FRCNN, proposed by Beal et al. [47], selects the Transformer before CNN.After ViT, Faster R-CNN is spliced sequentially as the target detection network, demonstrating the Transformer's capability to retain spatial information for effective target detection.

PolSAR Data Augmentation Method
When labeled data are concentrated in a specific region, they can result in a substantial decline in the performance of the segmentation model when applied to segment the entire global image.This issue arises because, during training, the model tends to focus predominantly on the labeled area, potentially overlooking crucial information from other regions of the image.This phenomenon is frequently denoted as label skew or class imbalance.The model exhibits a learning bias as it lacks essential generalization capabilities, primarily stemming from its limited exposure to features from other regions.For instance, according to Equation (1), the spatial information of labeled data x (i 0 ,j 0 ) is still retained in the feature map obtained after a multi-layer neural network, demonstrating that the spatial information associated with labels in PolSAR images has the potential to disrupt the model's capacity for global reasoning.
To address the issues, a proven effective method involves distributing label data across various regions of the image.Through the random allocation of label data in different areas of the image, the model can mitigate the tendency to overly rely on information from specific regions.This approach enables the model to gain a more comprehensive understanding of the diverse features and structures within the image, thereby improving its generalization capabilities.This adaptation enhances the model's ability to effectively handle challenges such as noise, occlusion, background changes, and more, ultimately bolstering its robustness.
Therefore, we randomly scrambled the label distribution without losing the original polarization characteristic information.The whole process can be expressed by the following formula: where X ∈ R H×W×C is a PolSAR image and X ′ is the new PolSAR image obtained.Three transformations: f sample , f split , f merge can be formulated as: where ϑ is the sampling ratio, and ϑ% of points are used for each category in the labeled data; σ ⟨h;w⟩ means the original image is divided into small blocks, x * ∈ R h×w×C according to a fixed size ⟨h; w⟩, and τ shuffles the v and combines it into a new PolSAR image.For comparison, this paper maintains consistent shapes for both X ′ and X.
Since this operation involves a direct split-merge process, it can be executed synchronously, allowing us to obtain multiple distinct new images simultaneously.The new images after data augmentation will be used as a training dataset.The entire procedure is illustrated in Figure 1.

Model
In this paper, we propose a model adapted to our input format, named PolSARMixer.The model's architecture is depicted in Figure 2, primarily encompassing three key processes: (1) Shallow Feature Extraction: The whole PolSAR image is fed into the CNN network to extract preliminary features.The shallow features of PolSAR images at the i-th level are expressed as Conv i low .
(2) Deep Feature Extraction: To improve the perception of small targets, the final output Conv 3 low of the shallow feature extraction module is forwarded to the Feature-Mixing (FM) blocks to obtain FM high .Through the stacking of multiple Feature-Mixing layers, FM high is highly integrated with both high-level abstract features and generalization features, aiding the model in better comprehending the content within images, enhancing segmentation performance for complex scenes and objects, reducing sensitivity to noise and variations, and delivering more semantically rich segmentation results.
(3) Feature Fusion: High-level features provide abstract semantic information, while low-level features contain the details and basic structure of the image.Fusing these two types of information provides a more comprehensive understanding of the image and enhances the robustness of the model.To achieve enhanced utilization efficiency, FM high is successively fused with the layers of Conv i low through cross-layer attention (CLA) to obtain the multiscale high-low joint map.

Input of Model
The use of matrix T as a representation method for PolSAR images is widely recognized as an effective choice.This is because the elements within the matrix T offer physical interpretability to reflect the scattering mechanisms and scattering types of targets, thereby contributing to a deeper understanding of the content of the images.Furthermore, matrix T contains additional physical information, including the polarization characteristics and phase details of targets, which are crucial for enhancing the performance of deep learning models in classification tasks.Therefore, the input of the model is Input ∈ R H×W×9 where H and W are the height and width of the original PolSAR image, and the vector of nine dimensions can be represented as: where Re( * ) and Im( * ) represent the real and imaginary components of a complex value, respectively, and T * is the value of matrix T.

Feature Extractor
The shallow feature extractor is composed of three convolution layers with a convolution kernel of 3 × 3 and a stride of 2, which is used to scale feature maps to obtain a larger receptive field and extract features of different levels.The highest-level feature map Conv 3 low ∈ R H×W×C extracted from the CNN network is constructed into L ∈ R S×C in Linear Projection module firstly, where S = HW p 2 is the number of non-overlapping image patches and p is the size of the patch.Then, L is linearly projected with the projection matrix M ∈ R C×C ′ to obtain L ′ ∈ R S×C ′ .L ′ is fed to the stacked Feature-Mixing blocks to enhance its perception of small targets.As shown in Figure 3, Feature-Mixing relies solely on MLP repeated in the spatial or feature channel domains, as well as basic matrix multiplication operations and data scale transformations, without the need for convolution or attention mechanisms.This design ensures a straightforward network structure and low computational complexity while maintaining global dependencies.It allows the model to capture the overall context and relationships, preserving contextual connections between objects and ensuring accurate segmentation of similar objects, thereby improving segmentation accuracy.The Feature-Mixer module is mainly composed of two sub-blocks.The first one is spatial mixing: it acts on rows of L ′ , maps R S → R 4S → R S .The second one is channel mixing: it acts on a column of L ′ , maps R C ′ → R 4C ′ → R C ′ .Each sub-block contains two fully connected layers and a LayerNorm layer applied independently to ensure that the data of each patch is normalized in the feature dimensions.The Feature-Mixer block can be written as follows: where The shape of the input and output of the Feature-Mixing block remains the same.The p is set as 2, and the output of stacking Feature-Mixer blocks is after reshaping.

Cross-Layer Attention
Cross-layer attention helps the model share information at different levels of abstraction, and its workflow is shown in Figure 4.It allows the model to transmit and exchange information about input data across different layers, thereby enhancing the overall model performance.Secondly, cross-layer attention aids in capturing long-range relationships and dependencies, thus improving the model's representational capacity.Additionally, cross-layer attention enhances the model's robustness.By sharing information at different levels, the model can better adapt to variations and noise in the input data, leading to improved performance in various scenarios.First, 1 × 1 convolution is used to transform the features of Conv i low into three identical feature maps: Q i low , K i low , and V i low of the same number of channels.In the same way, the FM high used to obtain three identical feature maps: To obtain the cross-attention score for both, the transposes of K i low /K j high and Q j high /Q i low are multiplied and then normalized using the SoftMax function.The result is multiplied by V i low /V j high to obtain the cross-attention maps CLA i /CLA j .Through matrix multiplication, the internal correlation of features is captured, and the long dependency between features is obtained, which can effectively model the context.
The cross-information between low and high features is different, mutually independent, and complementary.The attention map of the high-low features cross-fusion in the j + 1 layer of the deep unit is expressed as follows: Then, iterate layer by layer according to the above operation.

Loss Function
In the training process, after data augmentation, the target boundaries are extremely dispersed, and the uneven distribution of land cover categories, so the model is optimized by the Focal-Dice loss function to overcome these problems, which is written as: where p represents the target, q indicates the predicted probability, γ is the temperature coefficient, and α and β are weight factors, and for easy convergence of the model, we set α > β.

Dataset
To verify the effectiveness of our proposed method and model, we selected three PolSAR images for experiments, which are described below.
(1) Flevoland I, AIRSAR, L-Band Flevoland I is a widely used dataset for PolSAR classification, consisting of L-band data acquired at Flevoland, Netherlands, in 1989 by the Airborne Synthetic Aperture Radar (AIRSAR).The image scene size of the dataset is 750 × 1024 pixels.Of these, 157,296 pixels are labeled with 15 different terrain types.The categories marked include stem beans, rapeseed, bare soil, potatoes, beets, wheat 2, peas, wheat 3, lucerne, barley, wheat, grass, forest, water, and buildings.The pseudocolor image, ground truth, and legends of classes are shown in Figure 5. (2) Flevoland II, AIRSAR, L-Band Flevoland II is another L-band quad-polarized dataset acquired by the AIRSAR in Flevoland, the Netherlands, in 1991.The dataset has an image scene size of 1024 × 1024 pixels, covering a total of 14 different terrain types and 122,928 annotated pixels.The categories included rapeseed, potato, barley, maize, lucerne, peas, fruit, wheat, beans, beets, grass, onions, and oats.The pseudocolor image, ground truth, and legends of classes are shown in Figure 6.
(3) Oberpfaffenhofen, ESAR, L-Band The Oberpfaffenhofen dataset is acquired from the L-band ESAR sensor that covers Oberpfaffenhofen, Germany.The image scene size of this dataset is 1300 × 1024 pixels, covering four different terrain types in total.The categories included Build-up Areas, Wood Land, and Open Areas.The pseudocolor image, ground truth, and legends of classes are shown in Figure 7.

Analysis Criteria of Performance
In this paper, Overall Accuracy (OA), Average Accuracy (AA), Mean Intersection over Union (mIoU), Mean Dice coefficient (mDice), and Kappa coefficient are the criteria to evaluate the model.OA is a metric used to evaluate the performance of a classification model, measuring the overall accuracy of the model on the entire dataset.AA takes the average accuracy of the model in each class and calculates the mean, providing a more comprehensive assessment of the model's performance.The Kappa coefficient is a statistical metric used to measure consistency or agreement in classification tasks.The mIoU measures the average overlap between predicted and ground truth segmentation results, and the mDice measures the average similarity between predicted and ground truth segmentation results.These criteria are calculated as follows: where TP is the number of samples that are correctly classified, N is the total number of samples, P o is observed agreement, P e is expected agreement, C is the total number of classes, and p i is the accuracy of the i-th class.TP i represents the number of samples that the model correctly predicted as class i, FP i represents the number of samples that the model incorrectly predicted as class i, and FN i represents the number of samples that the model incorrectly predicted as class i as some other class.

Parameters of Experiment
All the experiments are running on Ubuntu 18.04 LTS with a 48GB NVIDIA RTX8000 GPU.All these methods are implemented using the deep learning framework of Pytorch.We constructed three different input formats for comparison.In the patch-based (PB) construction method, the size of each patch is 8 × 8.In the direct segmentation (DS) construction method, we divide labeled areas into 32 × 32 sub-images.In the data augmentation (DA) formatted dataset, according to the method outlined in Section 3.1, 30% of points are selected to create training images, while another 30% of points are used to generate the validation images.All datasets are divided into training sets and validation sets in a 7:3 ratio.The configurations of different input formats are shown in Table 1.We selected different classification models for different input formats to compare with our method.The compared models include SVM, CNN [48], U-net [37], and SETR [38].The first two are models suitable for PB, and the latter two can take images of arbitrary sizes, so we tested their performance separately under conditions DS and DA.For all deep learning methods, we used Stochastic Gradient Descent (SGD) as the optimizer, with a weight decay coefficient set to 0.05, momentum set to 0.9, and an initial learning rate of 0.01.Additionally, PolyLR is employed as the learning rate scheduler, and the training epochs are fixed at 50.

Experiments Result
Table 2 presents the classification results, while Figure 8 depicts the visualized classification map of the Flevoland I dataset.The results from Table 2 indicate that the proposed model achieved an OA of 98.90%, outperforming SVM-PB (92.41%),CNN-PB (95.24%),U-Net-DS (94.68%),U-Net-DA (96.05%),SETR-DS (93.11%), and SETR-DA (98.29%).Comparative analysis across various criteria also highlights the superior performance of the PolSARMixer.Figure 8 provides a more detailed illustration of the issues with using PB-format data and DS-format data.In Figure 8c,d, noticeable noise points appear within the plots, and the boundary is unclear.Additionally, in Figure 8e,f, small details within the plots are lost, and there exists considerable transition disparity between segmented subplots.When utilizing U-Net with DA-format data, the transitions between subplots are smoother, the plot details are richer, and the evaluation criteria are higher.However, the global inference performance of SETA with DA-format data is not good.As can be seen from Figure 8h, its boundary range is not clear.The main reason is that the SETR model is not good at capturing local context when the label information of the dataset is limited.It is worth noting that our proposed PolSARMixer model demonstrates superior performance, offering clearer plot boundaries and fewer misclassifications.Table 3 displays the evaluation results for the Flevoland II dataset.The data reveal that the proposed model outperforms other models in terms of AA, surpassing SVM-PB (91.17%),CNN-PB (96.98%),U-Net-DS (91.21%),SETR-DS (85.92%),U-Net-DA (96.61%), and SETR-DA (98.75%).Furthermore, the OA of PolSARMixer is notably higher than that of other methods by 1.82%, 2.71%, 1.35%, 2.84%, 0.39%, and 0.30%, respectively.Furthermore, the PolSARMixer achieves a mIoU of 98.14%, a mDice of 99.06%, and a Kappa coefficient of 0.9940.In Figure 9, class-wise land-cover classification maps for the Flevoland II region are depicted using various methods.It can be seen that PB-format data and DS-format data still have the problems of too many noise points and unclear boundaries.When U-Net is employed with DA-format data, as evident from Figure 9g, significant errors within large land parcels become noticeable.This is primarily due to its limited ability to capture long-range dependencies.From Figure 9i, it is evident that our model performs exceptionally well within both large and small land parcels.The parcels exhibit greater consistency, and the boundaries are notably clearer in the surrounding areas.
The class-wise land-cover classification map using various methods for the Oberpfaffenhofen dataset is shown in Figure 10.From Table 4, the results show that the proposed approach obtains the best classification performance and reaches 95.32% of OA, 93.85% of AA, 89.60% of mIoU, 94.46% of mDice, and 0.9176 of Kappa.It is obvious that the other methods have more misjudgment points in each category, but our proposed method only has a few misjudgment points in the Build-up category.

Ablation Experiments
In this section, we conduct a comprehensive assessment of the individual modules within our proposed methodology and their impact on classification performance.Our results are succinctly summarized in Table 5.The data presented in Table 5 illustrate the substantial positive influence of both the Feature-Mixing and cross-layer attention modules on classification performance.The introduction of the Feature-Mixing module significantly improves the performance of different datasets.OA increased by 2.46%, 0.09%, and 1.23%; AA increased by 4.75%, 2.14%, and 0.99%; mIoU increased by 5.94%, 3.07%, and 2.24%; mDice increased by 3.94%, 1.67%, and 1.32%; and the Kappa coefficient increased by 2.69%, 0.10%, and 2.25%, respectively.This enhancement is primarily attributed to its capacity to extract high-level features by long-distance dependency, thereby enabling the extraction of features with strong generalization and the realization of a higher-dimensional observation scale.The cross-layer attention module results in an increase in OA values of 1.81%, 0.06%, and 1.35%; AA values of 1.81%, 1.94%, and 1.75%; mIoU values of 5.30%, 2.88%, and 2.50%; mDice values of 3.07%, 1.57%, and 1.48%; and Kappa coefficient values of 1.98%, 0.06%, and 2.51% across the three datasets.This module demonstrates effectiveness in capturing dependencies in spatial features across various layers, facilitating the efficient merging of information from different layers.Remarkably, when both modules are combined, the evaluation criteria reach the highest values for all datasets.This demonstrates the complementary and synergistic relationship between the two modules, which together can help with a significant improvement in classification performance.

Impact of Augmentation of Data
To verify the necessity of the data augmentation proposed in this paper, we further conducted experiments without data augmentation.For consistency, other processing remains unchanged.As illustrated in Figure 11, the model exhibits a deficiency in its global reasoning capabilities when data augmentation is not applied.Furthermore, a detailed analysis of Figure 11a,b reveals that the inherent spatial information contained in the labeled data significantly disrupts the model, leading to inaccurate classifications within non-labeled regions.Additionally, without data augmentation, the model lacks generalization capability.Even in scenarios with a substantial number of labels, such as Oberpfaffenhofen, as illustrated in Figure 11c, the model still fails to accurately delineate boundaries.The comparative experiments presented above underscore the efficacy of the proposed data augmentation method.This method effectively activates the model's capacity for global inference, even when operating with a limited number of labels.

Impact of Shape of Blocks
In the proposed data augmentation method, the parameter ⟨h; w⟩ represents the size of the block, and its size directly affects the performance of the model.Therefore, experimental analysis was performed to evaluate the impact of ⟨h; w⟩ on classification performance, and the results are shown in Table 6.As visually depicted in Figure 12, it becomes apparent that distinct block shapes yield varying effects across different datasets.Notably, for the Flevoland I dataset, employing ⟨8; 8⟩, as exemplified by the red mark in Figure 12a, incurs an increase in classification errors within larger blocks.This observation extends to the Flevoland II dataset as well.Conversely, when opting for a larger size, such as ⟨32; 32⟩, as illustrated by the red mark in Figure 12c,f, the classification errors increase for small blocks.This is particularly noteworthy for the Flevoland II dataset, leading to a substantial drop in classification accuracy.For the Flevoland 1 and Flevoland II datasets, the deployment of ⟨16; 16⟩ attains optimal performance across various criteria.Conversely, for the Oberpfaffenhofen dataset, when ⟨h; w⟩ is set to ⟨32; 32⟩, the misclassification is minimized and the boundary determination is clearer, while settings of ⟨8; 8⟩ and ⟨16; 16⟩ exhibit poor performance, as shown in Figure 12g,h.In summary, the results emphasize the substantial effect of the parameter ⟨h; w⟩ on classification performance.Therefore, the value of ⟨h; w⟩ should be adjusted according to the size of the region, with annotations.In essence, with the enlargement of the scope of the annotation region, the parameter ⟨h; w⟩ needs to be increased to achieve optimal global inference efficiency.

Impact of Sampling Ratio
To further reduce the proportion of labeled data while ensuring the model's generalization performance, we employed proportional sampling of labeled data during the data augmentation process.Previous experimental results have demonstrated that our method already has excellent performance.In this section, we will delve deeper into the impact of different sampling ratios on classifier performance.Sampling rates for each class range from 10% to 70%, with intervals of 10%. Figure 13 illustrates how the OA, AA, mIoU, mDice, and Kappa coefficients of different datasets change as the training sample sampling ratio changes.All these criteria increase as the sampling proportion grows.It is particularly noteworthy that even at a sampling rate of 10%, our model still exhibits outstanding performance.Therefore, our proposed method remains highly effective in the case of less labeled data.

Discussion
Through the analysis of the experimental results, it becomes evident that our proposed PolSAR image land classification method excels.As shown in Table 7, our approach can achieve outstanding global inference results without the need for intricate operations such as constructing patches and cutes and merges, significantly simplifying the process.Furthermore, the hybrid model incorporating a cross-layer attention mechanism that we introduced exhibits similarly remarkable performance.It possesses exceptional capabilities for contextual reasoning and global dependency analysis, ensuring excellent generalization and robustness.However, we have also identified some limitations in the proposed data augmentation method.Firstly, when comparing performance across different datasets, especially in datasets like Flevoland II, we observed skewed angles in the labeled data, which may result in the loss of spatial information when directly splitting based on squares and cause the model to prefer learning without angles.To address this issue, we are considering splitting based on the skewed angles derived from the labels, with the aim of improving performance.Additionally, the adoption of fixed-sized blocks has proven to be inadequate in accommodating the actual distribution of land blocks.Therefore, considering the use of multiple block sizes for the split operation can assist the model in adapting to the real-world distribution of land blocks, thus enhancing its ability to recognize land block boundaries.

Conclusions
In this study, land classification based on the PolSAR image is studied as follows: (1) A data augmentation method, it constructs new PolSAR images by splitting and merging labeled data on PolSAR images as the input of the model and improves the global segmentation ability, generalization, and robustness of the model.(2) PolSARMixer, a hybrid CNN and MLP network comprising multi-layer feature extraction and low-high-level feature fusion modules based on cross-layer attention, is developed.(3) We tested our algorithm and performed experiments on the Flevoland I, Flevoland II, and Oberpfaffenhofen datasets, which proved its advantages in accurate land cover classification and reduced processing steps.In addition, the ablation experiments were carried out on the relevant hyperparameters of the proposed method, and the further improvement direction is discussed.

Figure 3 .
Figure 3. Flowchart of the Feature-Mixing.Furthermore, global dependencies are beneficial for the data augmentation techniques proposed in this paper.Through the shuffling and merging of label data, where targets span across multiple local regions, the introduction of global dependencies ensures that the model considers the relationships across these local regions, leading to improved segmentation of coherent objects.The Feature-Mixer module is mainly composed of two sub-blocks.The first one is spatial mixing: it acts on rows of L ′ , maps R S → R 4S → R S .The second one is channel

Figure 4 .
Figure 4. Architecture of the CLA.

Table 1 .
Configurations of different input formats.

Table 2 .
Classification results of different methods in the Flevoland I dataset.

Table 3 .
Classification results of different methods in the Flevoland II dataset.

Table 4 .
Classification results of different methods in the Oberpfaffenhofen dataset.

Table 5 .
Performance contribution of each module.

Table 6 .
Performance of different shapes of block.

Table 7 .
Comparison of different inputs of the model.