Learning to Adapt Adversarial Perturbation Consistency for Domain Adaptive Semantic Segmentation of Remote Sensing Images

: Semantic segmentation techniques for remote sensing images (RSIs) have been widely developed and applied. However, most segmentation methods depend on su ﬃ ciently annotated data for speci ﬁ c scenarios. When a large change occurs in the target scenes, model performance drops signi ﬁ cantly. Therefore, unsupervised domain adaptation (UDA) for semantic segmentation is proposed to alleviate the reliance on expensive per-pixel densely labeled data. In this paper, two key issues of existing domain adaptive (DA) methods are considered: (1) the factors that cause data distribution shifts in RSIs may be complex and diverse, and existing DA approaches cannot adaptively optimize for di ﬀ erent domain discrepancy scenarios; (2) domain-invariant feature alignment, based on adversarial training (AT), is prone to excessive feature perturbation, leading to over robust models. To address these issues, we propose an AdvCDA method that guides the model to adapt adversarial perturbation consistency. We combine consistency regularization to consider inter-domain feature alignment as perturbation information in the feature space, and thus propose a joint AT and self-training (ST) DA method to further promote the generalization performance of the model. Additionally, we propose a con ﬁ dence estimation mechanism that determines network stream training weights so that the model can adaptively adjust the optimization direction. Extensive experiments have been conducted on Potsdam, Vaihingen, and LoveDA remote sensing datasets, and the results demonstrate that the proposed method can signi ﬁ cantly improve the UDA performance in various cross-domain scenarios.


Introduction
Image segmentation has been widely researched as a basic remote sensing intelligent interpretation task [1][2][3][4].In particular, semantic segmentation based on deep learning plays an important role as a pixel-level classification method in remote sensing interpretation tasks, such as building extraction [5], landcover classification [6] and change detection [7,8].However, the prerequisite for good performance in existing fully supervised deep learning approaches is sufficiently annotated data.It is also essential that the training and test data follow the identical distributions [9].Once applied to unseen scenarios with different data distributions, model performance can degrade significantly [10][11][12].This means new data might be annotated and retrained for performance requirements, which requires considerable labor and time [13].
In practical applications, the domain discrepancy problem is prevalent in remote sensing images (RSIs) [14,15].Different remote sensing platforms, payload imaging mechanisms, and photographic angles will induce variations in image spatial resolution and object features [16].Due to the variation in seasons, geographic locations, illumination, and atmospheric radiation conditions, the same source images may also show significant feature distribution differences [17].The data distribution shift caused by the mix of these complex factors leads the segmentation network to behave poorly in the unseen target domain.
As a transfer learning paradigm [18], unsupervised domain adaptation (UDA) can improve the domain generalization performance of the model by transferring knowledge from the source domain data with annotations to the target domain [19].This method has been extensively researched in computer vision to address the domain discrepancy issue in natural image scenes [20].Domain adaptive (DA) methods have also gained intensive attention in remote sensing [21].Compared with natural images, RSIs contain more complex spatial detail information and object boundary situation, and homogeneous and heterogeneous phenomena are more common in images.Additionally, the factors that generate domain discrepancies are more complex and diverse.Thus, solving the problem of domain discrepancies in RSIs became more challenging.Currently, existing research works focus on three main approaches: UDA based on image transfer [17,22], UDA based on deep adversarial training (AT), and UDA based on self-training (ST) [23,24].Image transfer methods achieve image-level alignment based on generative adversarial networks.AT-based methods (as shown in Figure 1a) reduce the feature distribution in the source and target domains by minimizing the adversarial loss to achieve feature-level alignment [25].The ST approach (as shown in Figure 1b) focuses on generating high-confidence pseudolabels in the target domain and then participating in the iterative training of the model to achieve the progressive transfer process [26,27].One general conclusion about the DA performance of the model is: AT+ST>ST>AT [27].However, as shown in Figure 1c, combining ST and AT methods typically requires strong coupling between submodules, which leads to a poorly stabilized model during training [28].Therefore, fine-tuning the network structure and the submodules parameters is generally needed, so that model performance depends on specific scenarios and loses its scalability and flexibility.Recently, several studies have been conducted to optimize and improve the process, such as decoupling AT and ST methods functionally by constructing dual-stream networks [28], and using exponential moving average (EMA) techniques to construct teacher networks to smooth instable features in the training process [29].However, it also complicates the network architecture, increasing the spatial computational complexity, and reducing training efficiency.
This paper combines the consistency regularization idea in semi-supervised learning and proposes a DA semantic segmentation method based on adversarial perturbation consistency to overcome the limitations of the aforementioned methods.Inspired by FixMatch [30], our approach first generates pseudolabels using weak augmentation to predict target domain images.The same images are strongly augmented with the RandAugment(RA) [31] and ClassMix [32] techniques and then fed into the model for training.The supervised information comes from generating higher quality pseudolabels using the weakly augmented branch, thus preserving output prediction consistency in the case of diverse input perturbations.This process is termed the weak-to-strong consistency stream.Critically, AdvCDA provides feature-level perturbations in the feature space by AT for interdomain alignment, while leveraging the same weakly augmented branch to provide high-quality pseudolabels for supervised constraints.In this way, the model generalization is improved by reducing the interdomain discrepancies while maintaining model training stability through the supervisory constraints of the pseudolabel information.This process is termed the adversarial perturbation consistency stream.In addition, the confidence estimation mechanism is designed to assess the reliability of the two consistent perturbation processes, and thus the model can adaptively optimize the learning direction according to the training scenes.
In this paper, the main contributions are summarized as follows: 1. We propose an AdvCDA method for high-resolution RSIs based on adversarial perturbation consistency.The method combines AT and ST strategies to provide feature perturbation information through interdomain alignment in order to improve the domain generalization of the model during the ST process.Moreover, the ST method provides high-quality labels that maintain the predictive consistency of the model during AT, thus alleviating the over robustness that is prone to arise during domain alignment.2. We propose a confidence estimation mechanism to determine the learning weights of the weak-to-strong consistency stream and the adversarial perturbation consistency stream so that the model can adaptively adjust the optimization direction according to different scenarios.Our method has been effectively demonstrated in various domain discrepancy scenarios of high-resolution RSIs.

Image-Level Alignment for UDA
Image-level alignment reduces the data distribution shift between the source and target domains through image transfer methods [33,34].This scheme generates pseudo images that are semantically identical to the source images, but whose spectral distribution is similar to that of the target images [17].Cycle-consistent adversarial domain adaptation (CyCADA) improves the semantic consistency of the image transfer process through cycle consistency loss [35].To preserve the semantic invariance of RSIs after being transferred, ColorMapGAN designs a color transformation method without a convolutional structure [17].Many UDA schemes adopt GAN-based style transfer methods [36] to align data distributions in the source and target domains.ResiDualGAN [22] introduces scale information of RSIs based on DualGAN [37].Some work also leverages non-adversarial optimization transform methods, such as Fourier transform-based FDA [38] and Wallis filtering methods [39], to reduce image domain discrepancies.

Feature-Level Alignment by AT
Adversarial-based feature alignment methods train additional domain discriminators [19,40] to distinguish target samples from source samples and then train the feature network to fool the discriminator, thus generating a domain-invariant feature space [41].Many works have made significant progress using AT to align the feature space distribution to reduce the domain variance in RSIs.Wu et al. [42] focused on interdomain category differences and proposed class-aware domain alignment.Deng et al. [23] designed a scale discriminator to detect scale variation in RSIs.Considering regional diversity, Chen et al. [43] focused on difficult-to-align regions through a region adaptive discriminator.Bai et al. [20] leveraged contrast learning to align high-dimensional image representations between different domains.Lu et al. [44] designed global-local adversarial learning methods to ensure local semantic consistency in different domains.

Self-Training for UDA
Self-training acts as a kind of semi-supervised learning [45], which involves highconfidence prediction as easy-to-transfer pseudolabels, and participates in the next iteration of training together with the corresponding target images, progressively realizing the knowledge transfer process [26,27].Yao et al. [39] used the ST paradigm to improve the performance of the model for building extraction on unseen data.CBST [26] designs classbalanced selectors for pseudolabels to avoid the easy-to-predict classes becoming dominant.ProDA [46] computes representation prototypes that represent the centers of category features to correct pseudolabels.CLUDA [47] constructs contrast learning between different classes and different domains by mixing source and target domain images.Additionally, several works have attempted to combine ST and adversarial methods to improve domain generalization performance.However, these models are difficult to optimize and often require fine-tuning of the model parameters.Zhang et al. [48] established the two-stage training process of AT followed by ST.DecoupleNet [28] decouples ST and AT through two network branches to alleviate the difficulty of model training.

Consistency Regularization
Consistency regularization is generally employed to solve semi-supervised problems, where the essential idea is to preserve the output consistency of the model under different versions of input perturbations, thus improving the generalization ability of the model for test data [49,50].FixMatch [30] establishes two network flows, which include weak perturbation augmentation and strong perturbation augmentation at the image level, using the weak perturbation to ensure the high quality of the output and using the strong perturbation to provide better training of the model.FeatMatch [51] extracts class representative prototypes for feature-level augmentation transformations.Liu et al. [52] constructed dual-teacher networks to provide more rigorous pseudolabels for unlabeled test data.UniMatch [50] provides an auxiliary feature perturbation stream using a simple dropout mechanism.Several recent regularization models have been designed under the ST paradigm, but fail to account for domain discrepancy scenes, which has led to the fact that pure consistency regularization has not behaved remarkably well in cross-domain scenes.

Materials and Methods
In this section, the general architecture of the proposed network is illustrated and each component of our approach is elaborated.We attempt to improve the domain generalization performance through a combination of AT and ST methods.However, distinguishing from existing work [28,29,53], we are devoted to leveraging the idea of consistency regularization [52,54,55] to preserve the output prediction consistency during the feature alignment process to mitigate the instability issues that are easily induced by the adversarial perturbation.Simultaneously, a confidence estimation mechanism is established to optimize the training direction for different complicated domain difference scenarios in RSIs.First, some preliminary work is introduced in Section 3.1.Then, the proposed adversarial perturbation with consistency is described in Section 3.2, and the proposed confidence estimation mechanism is described in Section 3.3.

Preliminaries
In the DA semantic segmentation task, the source domain images are defined as , where , and its corresponding one-hot ground truth is . Let us define the target domain images as , where , and the ground truth of the target domain cannot access the model.Typically, the annotated source domain data is used to train model G with parameters θ , and then the trained weights are directly applied to the target domain.Supervisory losses are formulaic as follows: where and S B is defined as the batch size of the source domain data input to the model at each iteration. represents the loss entropy of minimizing the ground truth with respect to the predicted probability distribution.This method is set up as the multicategory cross-entropy.In general, the generalization ability of the model tends to perform poorly if domain discrepancies exist between the source and target domains, resulting in the model performance in the target domain usually being suboptimal.
Several strategies and methods [25,28,41,56] have been proposed to address the domain shift problem, among which AT and ST have become the two dominant DA methods [57].In ST, the model generates pseudolabels for the target domain images and iteratively transfers training for the model to be adapted to the target domain.The overall objective function is the linear combination of the supervised loss in the source domain and the unsupervised loss in the target domain arg max( ( | , )) where T B is the batch size of the target domain data for the input model, τ is defined as the default confidence threshold, which is usually set at 0.9 to select high-quality pseudolabels for the target domain, and i t y represents the candidate pseudolabels from the target domain.
As a common concept in semi-supervised learning [30,51,58], consistency regularization [52,55] typically imposes random perturbation information on unannotated data while constraining the model to maintain output prediction consistency.FixMatch [30] uses weak-to-strong consistency regularization to assign different levels of perturbation augmentation, dubbed weak perturbation w  and strong perturbation s  , to each un- annotated target domain images t  .It is written as where the teacher network Ĝ generates higher-quality pseudolabels from weakly per- turbed target images, and the student network G serves as a trainable segmentation network to apply stronger perturbations to the same images for optimizing the model.In our method, the teacher network Ĝ and the student network G are designed to share weights.
AT obtains the domain-invariant feature space of the source and target domains via aligning the interdomain global feature distributions, which provides another effective method to alleviate the domain discrepancy problem.It generally consists of segmentation network G and discriminative network D .The segmentation network can be divided into the feature extractor F and the classifier C , where G F C =  .AT depends on the discriminative network D to align the feature distributions extracted by the segmentation network in the source and target domains.Specifically, the segmentation network G and the discriminative network D are optimized alternately and iteratively by the following two steps [25,28,40]: (1) First, F and C of the segmentation network are frozen, and only the determination network is optimized, which improves the domain discrimination ability of the discriminator D to distinguish the output features of different domains: (2) The segmentation network G not only conducts supervised training tasks with labeled source domains, but also participates in the AT process.Specifically, the adversarial loss is as follows, and this process is achieved by fixing the discriminative network D and optimizing F and C of the segmentation network.
log ( 0| ) The main purpose of adversarial loss adv  is to confuse the discriminator and en- courage the segmentation network to perform interdomain alignment and learn domain invariant features.
In general, the ST method combined with consistency regularization shows better stability with small discrepancies in data distributions between source and target domains.However, in practical cases, the factors that cause the data distribution discrepancies in RSIs are often complicated.For complex domain discrepancy scenarios, the generalization performance of simple ST methods usually fails to meet the requirements due to the impact of pseudolabel noise.Deep AT methods aim to reduce domain discrepancies through feature space alignment.However, for the semantic segmentation task, finegrained feature alignment in high-dimensional space is needed, which is prone to induce more noise disturbances causing the model to become over robust and affecting the stability of adversarial learning.
Based on the above issues, we propose a novel DA method for high-resolution RSIs based on adversarial perturbation consistency.We provide directional feature perturbation through AT and align the source domain features with the target domain to improve the domain generalization ability of the model.Additionally, combining consistency regularization and the ST paradigm maintains the output prediction consistency after feature perturbation and improves the stability of AT.Moreover, to adapt to the complex domain discrepancy scenarios in RSIs, based on the complementary advantages of weak-to-strong and adversarial perturbation consistency, we further develop a confidence estimation mechanism for pseudolabels to constrain the direction of the decision boundary.

Adversarial Perturbations Consistency
To combine the AT and ST paradigms to improve the domain transfer performance of the model, and simultaneously ensure model stability during the training process, inspired by the consistency regularization idea of semi-supervised learning, we propose an adversarial perturbation consistency-based DA semantic segmentation method.Consistency regularization has achieved significant effects in the semi-supervised domain.
However, it is difficult to achieve breakthrough performance improvement when applied directly to scenarios where large data distribution shifts exist between the source and target domains, mainly due to the lack of an effective feature alignment mechanism to reduce the interdomain discrepancies.AT is an effective interdomain feature alignment method, but it relies on fine-grained alignment in high-dimensional feature space, which is prone to generating ineffective feature perturbations and causing instability in the training process.Hence, AdvCDA considers the AT process as a single directional feature perturbation stream in consistency regularization to reduce the interdomain variance.Simultaneously, the output consistency is constrained by consistency loss to maintain AT stability.
The framework of AdvCDA is shown in Figure 2.For source images with ground truth, we use supervised loss to train the segmentation network and improve the semantic discrimination performance of the model for each category.For the target domain, we set up three branches to achieve domain transfer between the source and target images to improve the generalization of the model: the weak augmentation branch, the strong augmentation branch, and the adversarial perturbation branch, respectively.Similar to some existing semi-supervised methods [30,50], we provide different versions of input perturbations at the input level through weak and strong augmentation to improve the generalization of the model.However, due to domain shifts, consistency learning [51] at only the input image level is often insufficient and requires the model to maintain consistency at multiple levels under various perturbations to fully exploit the ability of the model to learn generalized features.In particular, it is important to note that the goal of UDA is to align the feature space between different domains to reduce domain discrepancies.Therefore, based on weak-to-strong perturbation consistency learning [30], as shown in Figure 3a, we propose injecting adversarial perturbation information to maintain the consistency of the output prediction with the adversarial perturbation.Specifically, as shown in Figure 3b, we separate image-and feature-level perturbations into individual network streams, allowing the model to directly achieve target consistency with each type of perturbation information. .We attempt to align the shallow feature space of the model in the source and target domains.This design explains that the domain discrepancies between the source and target domains are represented in the low-level feature information, such as spectral and textural differences, because of the geographic location, atmospheric radiation conditions, or seasons.These features are generally captured by the shallow layer of the feature extractor, so we decided to inject adversarial perturbation information into the shallow network features, which will capture domain-invariant features more accurately and simultaneously prevent excessive invalid perturbation information from affecting the stability of the model training process.The source and target domain images φ φ and the predicted results: Specifically, to reduce the domain discrepancies and improve the generalization performance of the model in the target domain, we attempt to align the feature distributions of the source and target domains through AT methods.Therefore, we apply a discriminator in the shallow feature space of the model for adversarial learning.The adversarial loss is as follows: Training the discriminator is also required to improve the discriminant performance on source and target domains.Discriminatory loss is described as follows: where 0 denotes the source domain and 1 denotes the target domain.The feature space of the target domain gradually converges to the source domain through AT to obtain the domain-invariant feature space.The alignment process in the source and target domains can be regarded as injecting a feature perturbation in the shallow feature space of the model and obtaining the new feature parameter, which is fp low f .Furthermore, we can obtain the predicted results after feature perturbation by AT: Fine-grained feature alignment in high-dimensional space can be more prone to generate adversarial noise [59], leading to a lack of stability in training DA methods.Therefore, we constrain the model to maintain the consistency of the output predictions after noise perturbation based on the idea of consistency regularization, which helps to improve the stability of the model.Eventually, the unsupervised loss in the target domain is reformulated as w  and fp  , where w  denotes the weak-to-strong consistency loss and fp  denotes the adversarial perturbation consistency loss.
( ) ( ) arg max( ( | , )) To adapt to the complicated domain discrepancies in RSIs, one can find that our framework is designed with a weak-to-strong consistency stream and an adversarial perturbation consistency stream, which skillfully combines the ST and AT methods to improve domain transfer performance while guaranteeing training stability.Specifically, AT plays a crucial role in the network to conduct interdomain alignment to reduce domain discrepancies.On the one hand, AT provides feature-level perturbations to allow the model to learn various consistent features with more abundant perturbation information.On the other hand, feature alignment is used to reduce the domain discrepancies between the source and target images to improve the domain generalization performance of the model.Meanwhile, consistency regularization enables the model to maintain strong stability during the co-learning process of ST and AT, which fully exploits the potential for domain generalization.

Confidence Estimation Mechanism
In general, for large domain discrepancy scenarios, feature alignment by AT plays the primary role in reducing the interdomain discrepancy and improving the generalization of the model.In contrast, ST methods are prone to pseudo-label noise that can lead to performance degradation [46].For scenarios with small domain discrepancies, such as semi-supervised domains, the ST method can be sufficient to attain satisfactory results for the model in the target domain.Therefore, for the weak-to-strong consistency and adversarial perturbation consistency stream, it is better to allow the model to adaptively optimize the learned weights of the two streams to meet uncertain domain discrepancy scenarios.
The design key of this method is how to evaluate the confidence estimation of each stream to guide the model for better transfer training.As we know, it is especially critical for ST methods to design confidence thresholds for pseudolabels, where labels lower than the confidence threshold are generally considered incorrect labels for prediction.In contrast, labels higher than the threshold will be involved as candidate labels in the next iterative training process to improve the performance of the model in the target domain.Based on this, as shown in Figure 3b, we propose a confidence estimation mechanism that estimates the training confidence of the two streams by calculating the similarity of the outputs from the strong augmented branch, and the adversarial perturbation branch to the weakly augmented branch, thus constraining the model to assign more training weights to the higher-quality consistent network stream.In addition, it can be found that both of the proposed consistency regularization streams conduct consistently supervised learning based on weak augmentation.Intuitively, the weakly augmented branch is more prone to produce high-quality prediction results.We define the final target domain loss as: where 1   λ and 2 λ are the key weights for estimating the confidence of the two streams.
The weight values determine the influence level of the corresponding stream on the training and gradient optimization, guiding the optimization direction of the model.When 2 λ = 0, the model degenerates into a semi-supervised model, FixMatch [30].Specifically, we use the similarity of the logit outputs from the strongly augmented branch and the adversarial perturbation branch, with the weakly augmented branch, respectively, as a confidence estimation for the two streams: 1 ( , ( | , )) where i ws c and i fp c are the confidence weights assigned to the two streams of weak-to- strong consistency and adversarial perturbation consistency, the higher weight value represents the higher confidence assigned to the corresponding stream, and the model tends to learn from the stream with high confidence.To avoid the instability problem caused by scale variation in weight values, we normalize the final weight values: In this case, the final loss we use to train the segmentation network was and adv  as an adversarial loss will inject interdomain feature alignment perturbation information into the feature extractor before the gradient optimization of the segmentation network.The data distribution shifts between source and target images mainly manifest in the shallow information, so adv  focuses primarily on the domain-invariant fea- tures in the shallow feature space, and d  is employed to individually train and optimize the discriminative network.
In addition, for weak-to-strong augmentation in consistency regularization learning, we leverage the ClassMix [32] augmentation strategy in the strongly augmented perturbations by mixing the foreground and background regions of the image to provide more diverse information about the perturbations, as illustrated in Figure 4. Compared to the commonly adopted CutMix [60] strategy, ClassMix has more advantages in maintaining the semantic integrity and the boundary information of each object in the images.

Dataset Description
To validate the segmentation performance of AdvCDA with different domain discrepancy scenarios in RSIs, three benchmark datasets are used: the Potsdam dataset, Vaihingen and LoveDA datasets.
Potsdam dataset: The Potsdam dataset consists of 38 pieces of 5 cm high-resolution RSIs with a size of 6000 × 6000, and annotated data include six interpretation categories: impervious surfaces, buildings, trees, cars, low vegetation, and background.The dataset has red, green, blue, and near-infrared bands, and we use both IRRG and RGB imaging modes in the experiments.In addition, we follow the same sample splitting method and crop the image to 512 × 512 patch size [24].A total of 4598 samples are generated and divided into 2904 training sets and 1694 test sets [22,24,61].
Vaihingen dataset: The Vaihingen dataset contains the same interpretation categories as the Potsdam dataset, with an image resolution of 9 cm and only IRRG imaging modes.The dataset contains 33 VHR TOP images.During data preprocessing, we also crop the images to 512 × 512 size and divide 1296 images as training data and 440 images as test data [22,24,61].
LoveDA dataset: The LoveDA dataset provides both rural and urban land cover scenes and contains seven interpretation categories: building, road, water, barren, forest, agricultural land, and background.It contains 5987 0.3 m high-resolution images from three different cities, with a size of 1024 × 1024.The urban scene in this dataset contains 1156 training images, 677 validation images, and 820 test images, while the rural scene contains 2358 images, of which 1366 images are used for training and 976 are used for test data [62,63].On the LoveDA dataset, we focused our experiments on the remote sensing cross-domain task for rural-to-urban scenes.

Experimental Settings and Evaluation Metrics
All the network architectures in our experiments were implemented using the PyTorch framework.We primarily leveraged SegFormer [64] as our typical baseline segmentation model.During the training process, the SegFormer model was optimized by AdamW [65] with the momentum parameter set to 0.9 and the weight decay set to 10 −2 .The initial learning rates for the encoder and decoder were set to 6 × 10 −5 and 6 × 10 −4 , respectively, and then the learning rate decayed linearly with iterations.We set horizontal flipping and random rotation as weak augmentation methods in consistency learning while adding RandAugment [31] and ClassMix [32] as strong augmentation methods for the weak-to-strong consistency branch.
We comprehensively evaluated the performance of the model using the mean intersection over union (mIoU), which was obtained by calculating the intersection over union (IoU) for each category and then averaging them.As follows, we computed the IoU for each category by a confusion matrix with three terms, true positive (TP), false positive (FP) and false negative (FN), in the formulation: In addition, following the settings of [22,24,66], the F1 score was used to further evaluate the proposed method, which is defined as:

Comparisons with Other Methods
To verify the effectiveness of AdvCDA, we performed experiments in three kinds of domain discrepancy scenarios that are commonly observed in RSIs, including cross-spectral discrepancy scenarios, cross-space discrepancy scenarios, and complex domain discrepancy scenarios.

Cross-Space Scenarios
We conducted experiments with the Potsdam (IRRG) dataset as the source domain and the Vaihingen (IRRG) dataset as the target domain.We focused primarily on practically meaningful goals, so five categories were evaluated: impervious surfaces, buildings, low vegetation, trees, and cars [23,53,67].One can find that objects in the two datasets have significant characteristic differences, such that there are large buildings, narrow streets, and dense residential structures within the Potsdam dataset images.In contrast, the Vaihingen dataset images contain mostly free-standing structures and small buildings; the results are shown in Table 1.Compared to the existing state-of-the-art (SOTA) method, AdvCDA improves the mIoU performance by 2.84% and the mFscore performance by 2.01%.In terms of each category, our method significantly improved the results for all categories.In terms of categories, the best IoU and F1 performance of AdvCDA is achieved for impervious surfaces, cars, buildings, and trees, indicating that the proposed DA method has a more robust and stable domain transfer ability.Note that both ST-DASeg-Net and DAFormer, the best performance among the compared methods, used transformer (SegFormer) as the baseline, and an equally transformer-based model is used for the best performance of AdvCDA.Furthermore, as shown in Figure 5 for the qualitative visualization, it can be intuitively found that the proposed AdvCDA performed strongly in the Potsdam (IRRG) → Vaihingen (IRRG) cross-domain task.
Geographical discrepancies arising from urban and rural areas are also very common in practical remote sensing applications.Urban areas cover many building clusters and dense road grids compared to rural areas with more agricultural land, increasing the difficulty of generalization of the model.As shown in Table 2, we give the type of architecture for each method.Obviously, for this rural→urban cross-domain task, one can find that the combination of ST and AT outperforms purely ST methods, while purely AT methods show the lowest performance.Furthermore, using SegFormer as the baseline model, the mIoU performance of our DA approach outperformed the baseline by 9.09%.Among the compared DA methods, DAFormer, ST-DASegNet, and our method are all transformerbased networks, which obviously achieve a significant advantage over CNN-based DA methods, while our method achieved the optimal comprehensive performance.
We provide the visualization results for the rural → urban task in Figure 6.Since the ground truth is not available for the test set, we show the validation data for the LoveDA dataset.It can be found that AdvCDA has more advantages in preserving the integrity and edge accuracy of the objects.To further validate the effectiveness of the model in complex domain discrepancy scenarios that represent more difficult and large data distribution shifts, we conducted experiments on the Potsdam (RGB) → Vaihingen (IRRG) task.Note that this task involves both cross-spectral and cross-spatial discrepancies, and the same classes also have largescale variations, which pose a greater challenge to the generalization and stability of the model.As shown in Table 4, the quantitative comparison results between AdvCDA and several existing DA methods are presented.Compared to the simple cross-spectral and cross-space scenarios, our approach has greater advantages in complex domain discrepancy scenarios.AdvCDA outperforms the best comparison method by 4.03% for mIoU and 4.26% for mFscore.The experimental results demonstrate that AdvCDA also achieves the best performance in cross-spectral and cross-spatial complex scenarios.In the comparison experiments, AdvCDA achieves significant advantages in various domain discrepancy scenarios that are common in RSIs, which proves the effectiveness of AdvCDA and the stability that can be adapted to different remote sensing scenario tasks.Intuitively, in contrast to the pure ST approach, the key component of the proposed joint ST and AT paradigm is the additional adversarial alignment idea to capture the domaininvariant feature space and promote the generalization ability of the model.Therefore, we investigate the impact of the feature alignment module in the AT process when it acts on different feature layers in the segmentation network architecture.With transformer-based SegFormer [64] as the backbone, stage-1 to stage-4 of the backbone and the output layers were used as inputs to the discriminative network.AT only updates the network layer gradients prior to the current feature layer for feature alignment.The results shown in Figure 8 indicate that conducting feature alignment at stage-2 achieves the best DA results, whereas the model performance tends to decrease when the feature alignment module is applied to the deep network, such as stage-3 and stage-4, which might be that the AT overly interferes with the feature parameters, resulting in the over robustness of the model.Feature alignment in the output space is commonly employed in AT methods to maintain the consistency of the output layouts of the source and target domains.However, experiments show that AdvCDA provides adversarial feature interference at stage-2 to achieve the best DA performance.

Effectiveness Analysis of Each Component
To validate the effectiveness of each component for the proposed AdvCDA, we conducted ablation experiments in Table 5. FixMatch leverages weak-to-strong consistency regularization ideas for ST, while our approach generalizes consistency regularization ideas to DA tasks.Therefore, the key idea is to leverage consistency regularization to improve the stability of AT, thus combining ST and AT methods to boost DA performance, which we dubbed adversarial perturbation consistency (AdvC).The adversarial perturbation consistency acts on the feature layer of the model to complement the advantages of weak-to-strong perturbation consistency at the input level, and the mIoU performance of the model is improved by 3.26%.In addition, confidence estimation (CB) on the two streams of weak-to-strong and adversarial perturbation consistency from adaptive optimization learning is crucial for AdvCDA to maintain the stability of its performance in different domain discrepancy scenarios, where the mIoU performance of the model is further improved to 68.65% in this Potsdam (IRRG) → Vaihingen (IRRG) cross-domain task.6 shows the performance obtained by imposing different augmentation perturbation strategies on strong augmentation branches for different cross-domain tasks.The baseline is augmented with horizontal flipping, rotation, and other common augmentation methods used in semantic segmentation models.One can find that although the Cut-Mix strategy can effectively improve the generalization ability of the model in the semisupervised learning task [52], it instead degrades the performance of the model in the cross-domain scenes.We assume that CutMix augmentation corrupts the local semantic integrity of the classes and that the loss of semantic information further enlarges the discrepancies between the source and target domains.In contrast, ClassMix provides complete object boundaries, which mix images from the source and target domains for augmentation, and the model performance is further improved.In addition, the RA, which is a commonly used strategy for weak-to-strong consistency learning, improves the mIoU performance by 2.48% and 1.02% in the Potsdam RGB → Vaihingen IRRG and rural → urban cross-domain tasks, respectively.

Conclusions
In this paper, we propose a novel DA semantic segmentation method based on adversarial perturbation consistency to solve the distribution discrepancies among different domains in RSIs.In the network architecture, we design a weak-to-strong consistency stream at the input level and an adversarial perturbation consistency stream at the feature level, aiming to further improve the domain generalization performance of the model through joint AT and ST.Crucially, considering the inherent instability problem of AT, we use consistency regularization to provide high-quality pseudolabels to prevent over robustness that can easily be induced by over-perturbation of the feature space for AT.Furthermore, we propose a confidence estimation mechanism to adaptively assign the optimization weights for each stream and thus guide the model to train better for domain transfer.The effectiveness of the proposed method is validated on three different remote sensing benchmark datasets with cross-space, cross-spectral, and complex domain difference scenarios.Extensive experiments demonstrate the performance superiority of AdvCDA compared to existing UDA methods.Notably, AdvCDA improves mIoU performance by 4.03% and mFscore performance by 4.26% in Potsdam (RGB) → Vaihingen (IRRG) complex domain discrepancy scenarios against existing SOTA methods, further demonstrating that the design of the adversarial perturbation consistency and confidence estimation mechanisms enables the model to obtain effectively adaptive optimization in complex unseen scenarios.Nevertheless, it remains the case that our approach focuses on specific target domains and mainly studies the transfer training process of domain-specific knowledge in known target domains.In future work, we will further explore domain generalized feature learning in the case of multi-target domains or unseen target domains.

Figure 1 .
Figure 1.General paradigm description of existing DA training methods.(a) AT based DA approach.(b) Self-training (ST) based DA approach.(c) A combined ST and AT for DA methods.
where s f and t f are feature extractors whose inputs are source images s  and target images t  .d denotes the domain indicator, where 0 denotes the source domain, and 1 denotes the target domain.probability that discriminator D determines; the input comes from the source and target domains, respectively.

Figure 2 .
Figure 2. Overall framework of AdvCDA.The source images are fed into the feature extractor and classifier, and the supervised loss is computed using the source predictions and the corresponding ground truth to help the segmentation network learn task-specific knowledge.The target images pass through a weak augmentation flow to obtain high-quality pseudolabels.The same target images are put through a strong augmentation flow and an adversarial perturbation flow to obtain two target predictions, which are used to minimize the consistency loss.The two consistency training processes are weak-to-strong consistency and adversarial perturbation consistency.The domain discriminator is part of the AT to generate feature perturbations to the network layer.The feature alignment of the source and target domains is performed to minimize domain discrepancies.

Figure 3 .
Figure 3.Comparison of consistency regularization pipelines.(a) Weak-to-strong consistency baseline framework.(b) The proposed adversarial perturbation consistency framework

Figure 4 .
Figure 4. Weak-to-strong consistency with the introduction of ClassMix.

Figure 8 .
Figure 8. Performance of feature alignment modules on different network layers.

Table 1 .
Comparison results of AdvCDA with existing DA methods.The mIoU performance is validated on the test set of the Potsdam (IRRG) → Vaihingen (IRRG) task.The best results are highlighted in bold.

Table 2 .
Comparison results of AdvCDA with existing DA methods.The mIoU performance is validated on the test set of the rural → urban task.The best results are highlighted in bold.

Table 3 .
Comparison results of AdvCDA with existing DA methods.The mIoU performance is validated on the test set of the Potsdam (RGB) → Potsdam (IRRG) task.The best results are highlighted in bold.

Table 4 .
Comparison results of AdvCDA with existing DA methods.The mIoU performance is validated on the test set of the Potsdam (RGB) → Vaihingen (IRRG) task.The best results are highlighted in bold.

Table 5 .
Ablation experiments on the effectiveness of each component with the proposed approach.

Table 6 .
The performance of applying different augmentation perturbation strategies to strong augmentation branches in the tasks Potsdam RGB → Vaihingen IRRG and rural → urban.