Unsupervised Domain Adaption for High-Resolution Coastal Land Cover Mapping with Category-Space Constrained Adversarial Network

: Coastal land cover mapping (CLCM) across image domains presents a fundamental and challenging segmentation task. Although adversaries-based domain adaptation methods have been proposed to address this issue, they always implement distribution alignment via a global discriminator while ignoring the data structure. Additionally, the low inter-class variances and intricate spatial details of coastal objects may entail poor presentation. Therefore, this paper proposes a category-space constrained adversarial method to execute category-level adaptive CLCM. Focusing on the underlying category information, we introduce a category-level adversarial framework to align semantic features. We summarize two diverse strategies to extract category-wise domain labels for source and target domains, where the latter is driven by self-supervised learning. Meanwhile, we generalize the lightweight adaptation module to multiple levels across a robust baseline, aiming to ﬁne-tune the features at different spatial scales. Furthermore, the self-supervised learning approach is also leveraged as an improvement strategy to optimize the result within segmented training. We examine our method on two converse adaptation tasks and compare them with other state-of-the-art models. The overall visualization results and evaluation metrics demonstrate that the proposed method achieves excellent performance in the domain adaptation CLCM with high-resolution remotely sensed images.


Introduction
Coastal land cover mapping (CLCM) provides a detailed and intuitive presentation of ground objects in the land-sea interaction zone, which is the necessary and sufficient premise for land investigation, resource development, and eco-environment protection [1][2][3]. In the past decade, the continuous evolution of space and sensor technologies has made remote sensing enter into the Big Data era [4]. An intuitive advance is the favorable circumstance to achieve mass production of land cover while meeting large-scale and high-resolution needs. However, high-resolution remotely sensed (HRRS) images acquired in various scenarios are easily affected by irresistible factors, e.g., seasonal climates, regional conditions, and sensor models. Unfortunately, these discrepant factors may result in remarkable data divergences in the appearance distribution for scenes and ground objects. Therefore, it is yet a challenging task to achieve large-area and high-precision CLCM production automatically.

1.
Referring to the characteristics of HRRS images in coastal areas, we propose a category-level UDA approach to achieve land cover mapping across image domains, which emphasizes the advance of adversarial learning in generating and aligning the feature spaces.

2.
For the category-level adaptation framework, we focus on the underlying category space of the target domain and introduce a category-wise discriminator to finetune the segmentation network. In light of heterogeneous situations, two different strategies are adopted to extract domain labels for the discriminator. 3.
With the lower-level features concerning local details and higher-level ones encoding global context presentations, we integrate the adaptation modules with a similar architecture to each feature stack, aiming to align the semantic features at multiple spatial scales.

4.
Experiments in two coastal datasets demonstrate that the proposed method enables the cross-domain CLCM to be realized and achieves excellent performance compared with other state-of-the-art models.
The remainder of this paper is arranged as follows. The background is reported in Section 2. Section 3 introduces our proposed method and presents its implementation details. Then, Section 4 describes the experimental procedures and results on two benchmark datasets, while we discuss the effectiveness of various designed modules in Section 5. Finally, our work and future research are concluded in Section 6.

Adversarial Learning
Adversarial learning has recently become popular and has been explored in generative tasks since Goodfellow et al. [19] proposed the Generative Adversarial Nets (GAN) as a pioneering report. Adversarial learning essentially presents a dynamic mini-max game where the generative adversarial method is always divided into two antagonistic modes: a generative module G and a discriminative module D. Within the iterative training, G strives to generate imitative samples to deceive the discriminator by capturing the data distribution. Meanwhile, the target of D is to distinguish the generated distribution from real ones via a binary domain label. The whole process seeks G to minimize the divergence while updating D to maximize the separation, which can be formulated as follows: where P data and P G , respectively, indicate the real and generated distribution. The advance of GAN over other generative approaches is that there is no complex sampling and inference [17]. After that, numerous variants have served a breadth of visual tasks, e.g., image generation, style transfer, and image labeling. Using the deep fully convolution framework, DCGAN [36] provides pioneering guidance for complex mapping when CGAN [37] denotes an extension that makes it possible to link additional information such as the category relation of training samples. Presented as an excellent work, CycleGAN [28] performs the unpaired image-image translation by adopting the bi-directional consistency losses and adversarial losses. In summary, with its outstanding performance, the generative adversarial approach has now become a fundamental strategy for unsupervised domain adaptation.

Self-Supervised Learning
Even though deep supervised learning has made outstanding successes in the past decade, there is a fatal flaw with excessively relying on manual annotations. As an alternative, Remote Sens. 2021, 13, 1493 4 of 21 self-supervised learning (SSL) adopts input data itself to dig supervision information for training. Precisely, SSL captures pseudo labels via a semi-automatic process or a partial prediction by leveraging the rest of the data. There are three summarized objective-based types: generative, contrastive, and generative-contrastive [38]. Generally, this strategy of SSL benefits various downstream tasks without the need for expensive supervision information.
Denoted as a branch of semi-supervised learning, SSL has been used in various imagerelated tasks [39][40][41]. For instance, a broad span of domain adaptation applications [35,41] leverage SSL to learn the decision boundary between source and target data. These approaches enable the promotion of the global feature matching of the different data domains while performing well in class-wise alignment. The SSL approach is also employed to carry out pixel-level annotating when the ground truth is not accessible. Under this scenario, the related methods [12,31] are often guided by the cross-entropy loss between the dense prediction and generated pseudo label. In our work, we leverage SSL to execute the pixel-level segmentation task in an unsupervised way. Note that the SSL strategy is simultaneously applied in our domain adaptation framework and segmentation module. First, we denote the dense predictions from the target data as the discriminator labels to update the adversarial adaptation network in training iterations. Second, the pseudo labels from the above adaptative predictions are regarded as the ground truth to fine-tune the segmentation network for target images. To a certain extent, SSL overcame the defect of missing annotations and has achieved distinguished contributions.

Problem Setting
Having access to the source image set with dense annotations and the target image set without any references (Figure 1a), we focus on the problem of unsupervised domain adaptation for the CLCM with HRRS images. The goal is to learn a pixel-level segmentation network in a supervised way and then achieve correct predictions for the target images in an unsupervised manner. Due to the divergence of marginal and joint distribution in both datasets (domains), deep convolutional models trained on source data always fail to generalize to the target space.

Self-Supervised Learning
Even though deep supervised learning has made outstanding successes in the past decade, there is a fatal flaw with excessively relying on manual annotations. As an alternative, self-supervised learning (SSL) adopts input data itself to dig supervision information for training. Precisely, SSL captures pseudo labels via a semi-automatic process or a partial prediction by leveraging the rest of the data. There are three summarized objective-based types: generative, contrastive, and generative-contrastive [38]. Generally, this strategy of SSL benefits various downstream tasks without the need for expensive supervision information.
Denoted as a branch of semi-supervised learning, SSL has been used in various image-related tasks [39][40][41]. For instance, a broad span of domain adaptation applications [35,41] leverage SSL to learn the decision boundary between source and target data. These approaches enable the promotion of the global feature matching of the different data domains while performing well in class-wise alignment. The SSL approach is also employed to carry out pixel-level annotating when the ground truth is not accessible. Under this scenario, the related methods [12,31] are often guided by the cross-entropy loss between the dense prediction and generated pseudo label. In our work, we leverage SSL to execute the pixel-level segmentation task in an unsupervised way. Note that the SSL strategy is simultaneously applied in our domain adaptation framework and segmentation module. First, we denote the dense predictions from the target data as the discriminator labels to update the adversarial adaptation network in training iterations. Second, the pseudo labels from the above adaptative predictions are regarded as the ground truth to fine-tune the segmentation network for target images. To a certain extent, SSL overcame the defect of missing annotations and has achieved distinguished contributions.

Problem Setting
Having access to the source image set with dense annotations and the target image set without any references (Figure 1a), we focus on the problem of unsupervised domain adaptation for the CLCM with HRRS images. The goal is to learn a pixel-level segmentation network in a supervised way and then achieve correct predictions for the target images in an unsupervised manner. Due to the divergence of marginal and joint distribution in both datasets (domains), deep convolutional models trained on source data always fail to generalize to the target space.  To address the adverse effect of domain shift, we resort to a generative adversarial framework that learns the feature mapping between the source and target domain. Conventional adversaries-based methods commonly leverage the global discriminator as the domain judge, which only aligns the global marginal distribution. In such cases, it may lead to the misclassification of categories within the target domain ( Figure 1b). Considering the underlying semantic structure, we fuse the category information into the multi-level adversarial procedure by replacing the single naive discriminator with our multi-level category-wise discriminators. As illustrated in Figure 1c, this strategy implements the local domain matching for category features at multiple scales while performing the global domain alignment.

Overall Formulation
The proposed CsCANet serves the cross-domain CLCM task via a multi-level adversarial framework and an extra self-supervised learning module ( Figure 2). Specifically, the whole architecture comprises three fully convolutional networks: feature extractor F, pixel-level classifier C, and category-wise discriminators D i , where i = 1, 2, . . . , n, presents the level of adversarial adaptation scheme. Source images i=0 and target images X T = X i T N T i=0 are given as the inputs of the network, where N S and N T indicate the number of respective samples. Then, the shared baseline network F generates multiple feature stacks (i.e., F i (X S , θ) and F i (X T , θ)) for both the source and target domain at different spatial scales. Our target is to make the multi-level semantic features F i (X S , θ) and F i (X T , θ) close to each other. Hence, four category-wise discriminators with the analogous architecture are designed to achieve category-level domain adaptation for aligning the features in specific scales. Note that source label Y S and target prediction C(F i (X T , θ), µ) are, respectively, denoted as the domain labels to discriminators. Furthermore, the prediction of the shared classifier C is presented as the result of the CLCM, which also forwards to optimize the feature extractor and classifier via the segmentation losses.
To address the adverse effect of domain shift, we resort to a generative adversarial framework that learns the feature mapping between the source and target domain. Conventional adversaries-based methods commonly leverage the global discriminator as the domain judge, which only aligns the global marginal distribution. In such cases, it may lead to the misclassification of categories within the target domain ( Figure 1b). Considering the underlying semantic structure, we fuse the category information into the multi-level adversarial procedure by replacing the single naive discriminator with our multi-level category-wise discriminators. As illustrated in Figure 1c, this strategy implements the local domain matching for category features at multiple scales while performing the global domain alignment.

Overall Formulation
The proposed CsCANet serves the cross-domain CLCM task via a multi-level adversarial framework and an extra self-supervised learning module ( Figure 2). Specifically, the whole architecture comprises three fully convolutional networks: feature extractor F, pixel-level classifier C, and category-wise discriminators Di, where i = 1, 2, …, n, presents the level of adversarial adaptation scheme. Source images are given as the inputs of the network, where N S and N T indicate the number of respective samples. Then, the shared baseline network F generates multiple feature stacks (i.e., Fi(XS, θ) and Fi(XT, θ)) for both the source and target domain at different spatial scales. Our target is to make the multi-level semantic features Fi(XS, θ) and Fi(XT, θ) close to each other. Hence, four category-wise discriminators with the analogous architecture are designed to achieve category-level domain adaptation for aligning the features in specific scales. Note that source label YS and target prediction C(Fi(XT, θ), μ) are, respectively, denoted as the domain labels to discriminators. Furthermore, the prediction of the shared classifier C is presented as the result of the CLCM, which also forwards to optimize the feature extractor and classifier via the segmentation losses.  With the proposed network, the joint loss objective for the hybrid adaptation task can be formulated from two primary modules: where L seg consisted of L S seg and L T seg denotes the cross-entropy losses between the prediction and label (truth or pseudo) in the source and target domains. L adv indicates the adversarial Remote Sens. 2021, 13, 1493 6 of 21 losses that align with the category-wise data distribution. Besides, λ adv presents the weight coefficient to promote backward propagation steadily. For the target domain, the loss L T seg is individually used for self-supervised learning to fine-tune the segmentation network toward better adaptation.

Domain Labels Extraction Module
For cross-domain dense segmentation, each image contains numerous pixels that represent multiple instances. Exploring adversarial learning for domain adaptation in the right way is a vital premise. The majority of global adaptation approaches adopt the single binary values (either "0" or "1") as the opposite domain labels (Figure 3a), ignoring the category-space constraints in the target domain. Our developed architecture introduces a category-wise discriminator for adaptively aligning the semantic features. For this purpose, extracting the category-wise domain labels denotes a crucial component module.
With the proposed network, the joint loss objective for the hybrid adaptation task can be formulated from two primary modules: where Lseg consisted of S seg L and T seg L denotes the cross-entropy losses between the prediction and label (truth or pseudo) in the source and target domains. Ladv indicates the adversarial losses that align with the category-wise data distribution. Besides, λadv presents the weight coefficient to promote backward propagation steadily. For the target domain, the loss T seg L is individually used for self-supervised learning to fine-tune the segmentation network toward better adaptation.

Domain Labels Extraction Module
For cross-domain dense segmentation, each image contains numerous pixels that represent multiple instances. Exploring adversarial learning for domain adaptation in the right way is a vital premise. The majority of global adaptation approaches adopt the single binary values (either "0" or "1") as the opposite domain labels (Figure 3a), ignoring the category-space constraints in the target domain. Our developed architecture introduces a category-wise discriminator for adaptively aligning the semantic features. For this purpose, extracting the category-wise domain labels denotes a crucial component module. We seek the category information contained in both domains to construct the domain label for each sample. The constraint that the ground truth in the target domain is not accessible is contradictory to expect category-level alignment via the target category information. Referring to self-supervised learning, treating the target label as a learnable hidden variable is a feasible choice. We use the prediction of C as the domain label to supervise the discriminator since the target domain shares the same semantic categories as the source data. In general, there are two summarized strategies for extracting domain labels, whose outcomes are divided into category-wise hard and soft labels. For the former shown in Figure 3b, executing the one-hot encoding is a straightforward solution, which can be formulated as follows: We seek the category information contained in both domains to construct the domain label for each sample. The constraint that the ground truth in the target domain is not accessible is contradictory to expect category-level alignment via the target category information. Referring to self-supervised learning, treating the target label as a learnable hidden variable is a feasible choice. We use the prediction of C as the domain label to supervise the discriminator since the target domain shares the same semantic categories as the source data. In general, there are two summarized strategies for extracting domain labels, whose outcomes are divided into category-wise hard and soft labels. For the former shown in Figure 3b, executing the one-hot encoding is a straightforward solution, which can be formulated as follows: where i ∈ (N = H × W) presents the pixel position, and P (i,k) gives the softmax probability prediction of the kth category. In this plan, we try to generate domain labels from the most confident predictions and hope that they are mostly correct. As a result, the method only adopts the category with the highest confidence for domain adaptation. This strategy enormously depends on the predicted outcomes of C.
Focusing on the adaptation problem to each category, we leverage the category-wise soft label as an alternative strategy (Figure 3c). Unlike the hard one, the latter utilizes Remote Sens. 2021, 13, 1493 7 of 21 the probability prediction of all the channels to implement category-level adaptation, denoted as: where j ∈ (N = H × W) presents the pixel position, and P (j,k) indicates the logits probability prediction of the kth category. In our proposed architecture, we simultaneously employ two diverse strategies to extract category-wise domain labels instead of a single one. For the source data, the hard process is selected where we use the available ground truth as the domain label. In fact, this operation can offer the highest confidence rather than the probability prediction of C. On the other hand, we adopt the soft label from dense prediction per iteration to execute adversarial learning for the target domain. In essence, this belongs to a process of self-supervised learning.

Single-Level Adaptation Adversarial Framework
With the generative adversarial learning, the domain adaptation flow is generally executed by alternately updating the segmentation network G and discriminator D. To be specific, G is composed of feature extractor F and pixel-level classifier C, where G = (F → C). Our single-level adaptation framework that focuses on the output last feature stack from F also follows the above two procedures.
In our category-space constrained adaptation, we divide the output from D and extracted domain labels into k channels, aiming to encourage category-wise adversarial learning. It enables D to model more complex underlying structures between categories. During the training iterations, D is optimized to distinguish the features from cross domains. The training objective can be written as: where hl (i,k) and sl (j,k) , respectively, indicate the label for source pixel i and target pixel j.
As an antagonistic procedure, G is trained with the segmentation loss L S seg from the source domain and the adversarial loss L adv on the target space. This stage seeks to update F and C with the fixed D. We begin by defining the cross-entropy loss L S seg to enforce the prediction close to the annotated ground truth, observed in: where Y S and P S denote the ground truth and dense predictions for source samples.
Second, under the assumption that we do not diverge far from the target solution, the adversarial loss L adv encourages F to learn domain-invariant features by confusing D, which can be achieved as follows: 3.

Multi-Level Adaptation Adversarial Framework
Although high-level features contain rich semantic information, there are equally critical contexts and spatial details in the low-level features, such as the position relation, contour information, and small-scale objects. Notably, it is significant for coastal HRRS images. In the background of segmentation models, integrating multi-level features has demonstrated an astonishing performance [42,43]. Motivated by these distinctive approaches, we embed additional domain adaptation modules in the low-level feature stacks to enhance adaptability at multiple spatial scales. The overall objective function can be extended from Equations (6) and (7): where i denotes the level of feature stacks, λ seg and λ i adv are the weights to balance the losses. Our ultimate goal is to minimize the dense segmentation losses in G = (F → C) for both domains and maximize the probability of target features considered the source one. The min-max flow follows the formulation: 3.3. Implementation

Subdivided Modules
Our CsCANet is built on the fully convolution architecture subdivided into a feature extractor, a pixel-level classifier, and four category-wise domain discriminators. It should be noted that the discriminators denote a similar structure with different channel numbers. Below, we elaborate their detailed compositional structures.
Feature extractor: According to our multi-level adaptation framework, the Resnet-101 [44] module pre-trained on the ImageNet [45] is adopted as the backbone network that works to extract features at multiple scales. The same as several advanced reports [3,30], we substitute for the down-sampling layers in the last two residual blocks with dilated convolutional layers. This strategy led to the size of the output feature map 1/8 of the input image, aiming to retain more spatial details without changing the scale of pre-trained parameters.
Pixel-level classifier: Referring to the Deeplab system [10,11], we leverage the ASPP module as an efficient pixel-level classifier that leverages four convolutional layers with a kernel size of 3 × 3 and a dilation of {6, 12, 18, 24} to form the network. The innovative structure successfully expands the receptive field to capture long-range context. For the module, the weights and biases are initialized with the Xavier [46] method.
Domain discriminators: We implement the category-space constraint domain adaptation with a category-wise discriminator. The network consists of three convolutional layers with a kernel size of 3 × 3 and channel numbers of {128 × 2 i , 32 × 2 i , N C }, where i = 1, 2, . . . , n, presents the level of the adversarial learning scheme and N C gives the category number. Except for the last one, each convolutional layer is followed by the Leaky-ReLU [47] with a negative slope of 0.2. Besides, a bilinear up-sampling is used to reconstruct the resolution at the end of the discriminators. Additionally, similar to the classifier, we also use Xavier [46] to initialize the discriminators.

Training Details
In this section, our expected goal is to gain a well-trained adaptive segmentation network for CLCM. Alternate adversarial training is driven by the objective function L, in which the segmentation loss L S seg and adversarial loss L i adv , respectively, serve for the dense prediction and multi-level domain adaptation task. The proper scheduling of these two modules is crucial for network performance. Thus, there is a greater weight for L S seg , i.e., λ seg = 1. For the multi-level L i adv , we employ smaller weights for them, i.e., λ i adv = {0.0001, 0.0002, 0.0005, 0.001}, since the low-level features carry less semantic information.
Furthermore, to train our proposed CsCANet, we find that performing segmented training with self-supervised learning is an effective strategy to accelerate network parameter convergence. Within the front four-fifths of iterations, we begin by jointly training the source-based segmentation network and domain discriminators to conduce adversarial learning in one stage. In detail, source images are first forward to optimize F and C with L S seg . The dense predictions P T are then generated from target images, which are forward put to G and D i with the source labels Y S for optimizing L i adv . As for the rest of the iterations, the self-supervised learning adopts the generated pseudo target labels Y t to fine-turn the segmentation network for the target domain with L T seg . Algorithm 1 gives the necessary training process for our hybrid framework. Algorithm 1. Training process for the hybrid framework.
Input: Source images X S , source annotations Y S , target images X T , threshold T.
Initialized feature extractor F, pixel-level classifier C, and discriminators D i . Output: Well-trained F, C, and D i for adversarial learning.
Well-trained F and C for self-supervised learning.
forward X T , Y t to F and C update F, C with L T seg end if end for

Datasets Description
Two benchmark datasets, namely Shanghai and Zhejiang, are selected as the experimental data. As illustrated in Figure 4, corresponding study areas are located in typical coastal regions, and both of them are characterized by multi-scale land cover categories with low intra-class variances. Their appearances reflect the unique geographical characteristics of the coastal zone. Obviously, data diversity exists in spatial distribution between the Shanghai and Zhejiang datasets, where the former possesses more detailed information. Furthermore, due to the influences of seasonal factors and sensor modes, there are significant domain differences in spectral characteristics, which meets our experimental needs.
Shanghai dataset: The benchmark dataset is located in Xiaoshan District, Zhejiang Province, where the adopted remotely sensed images were acquired in 2017 with a spatial resolution of 0.8 m/pixel. In addition, the original images cover a scale of approximately 46 square kilometers with a spatial extent of 11,776 × 6144 pixels, composed of three bands of red (R), green (G), and blue (B). The images are further clipped into small patches with a size of 256 × 256 by employing a sliding window. As a result, there are 1104 images in the Shanghai dataset, where the ratio of training set to validation set is approximately 2:1 to the number of 736 and 368.
Zhejiang dataset: This dataset is located in Fengxian District, Shanghai, while the corresponding satellite images from the WorldView system were collected on 26 December 2016, with a high resolution of 0.5 m/pixel. It has been widely accepted that special ground objects in remote sensing images have a constant scale range [48]. Thus, the images are resampled to obtain a consistent spatial resolution as the Shanghai dataset. Besides, the same as the Shanghai dataset, the images only contain RGB channels and cover a region of approximately 61 square kilometers with a size of 12,800 × 7424 pixels. Similarly, the dataset contains 1450 images with a spatial extent of 256 × 256, and the numbers of training and validation sets are 967 and 483, respectively. ens. 2021, 13, x FOR PEER REVIEW 10 of 21 Zhejiang dataset: This dataset is located in Fengxian District, Shanghai, while the corresponding satellite images from the WorldView system were collected on 26 December 2016, with a high resolution of 0.5 m/pixel. It has been widely accepted that special ground objects in remote sensing images have a constant scale range [48]. Thus, the images are resampled to obtain a consistent spatial resolution as the Shanghai dataset. Besides, the same as the Shanghai dataset, the images only contain RGB channels and cover a region of approximately 61 square kilometers with a size of 12,800 × 7424 pixels. Similarly, the dataset contains 1450 images with a spatial extent of 256 × 256, and the numbers of training and validation sets are 967 and 483, respectively.
For both benchmark datasets, six land cover categories are defined and annotated at pixel-level ( Figure 5). Specifically, the categories are composed of cropland (Cropland), impervious surfaces (Imp. Surf.), water areas (Water), vegetative cover (Veg.), bare land (Bareland), and roads (Road). Table 1 gives the pixel statistics of each dataset. It is hugely unbalanced for all the land cover categories that produce a more significant challenge to conduct dense segmentation work. As an example, the proportions of Cropland and Imp. Surf. are markedly larger than Bareland and Road. For both benchmark datasets, six land cover categories are defined and annotated at pixel-level ( Figure 5). Specifically, the categories are composed of cropland (Cropland), impervious surfaces (Imp. Surf.), water areas (Water), vegetative cover (Veg.), bare land (Bareland), and roads (Road). Table 1 gives the pixel statistics of each dataset. It is hugely unbalanced for all the land cover categories that produce a more significant challenge to conduct dense segmentation work. As an example, the proportions of Cropland and Imp. Surf. are markedly larger than Bareland and Road.

Experimental Setting
To verify the feasibility and robustness of the proposed method, two independent but opposite experiments are implemented on the above datasets. The unpaired images from the source and target domain are randomly taken as the network inputs, and the annotation is verifiable for the source data. In addition, data augmentation methods [49], i.e., mean subtraction and normalization, are leveraged on the training sets, which adjust the input images to accelerate the convergence of weights and biases.
We implement the proposed CsCANet on the PyTorch Toolbox [50] written as a deep learning framework. All experiments are conducted on a machine with an Intel Core i7-9700k (six cores), 16 GB of memory (RAM), an NVIDIA GeForce GTX 1080 GPU (8 GB), and an NVIDIA GeForce RTX 2080 GPU (8 GB). In the training procedure, we set 100 K iterations to obtain an overall convergence with a batch size of 1. For training the segmentation network, SGD [51] is used as the optimizer, whose momentum and weight decay are set by 0.9 and 0.0005. The learning rate is initialized to 2.5 × 10 −4 with a "poly" decay policy multiplied by (1-iter_step/total_step) 0.9 . To train the discriminators, we leverage the Adam [52] with momentums of 0.9 and 0.99 as the optimizer. The initial learning rate is 10 −4 and decreases with the same policy as the segmentation network.

Evaluation Metrics
We take the CLCM with domain adaptation as a pixel-level and multi-category segmentation task, whose experimental results are generally evaluated via the generated confusion matrix. Referring to it, TP, FP, TN, and FN denote the numbers of true positives, false positives, true negatives, and false negatives [53,54]. Having access to these indexes, the following four metrics, i.e., per-class accuracy, overall accuracy, balanced F (F1) score, and intersection-over-union (IoU), are given to prove the validity effectiveness of our proposed CsCANet. Their detailed formulations are shown in Table 2, where C presents the number of categories in both datasets. For all the metrics, the higher value demonstrates a better segmentation result to a certain degree.
Partial representative examples from the above converse tasks are illustrated in Figures 6 and 7. Given as a pioneering work, FCNs ITW undoubtedly gains the worst results using a primitive feature alignment in the final representation layer (Figures 6a and 7a). It is apparent from Figures 6c and 7c that AdaptSegNet enables better performance than FCNs ITW by adopting the model in the output space. With pixel-level adaptation, the results of CyCADA and BDL are seriously affected by the early style transformation that results in the emergence of a large misclassification area, as shown in Figure 6b,f and Figure 7b,f. Even though CLAN and ADVENT further optimize the segmentation outcomes via the categorylevel adaptation and entropy minimization, there are still shortcomings in recognizing the ground objects with low inter-class variances (Figure 6d,e and Figure 7d,e). Besides, as shown in Figures 6h and 7h, FADA effectively solves the recognition issue of objects with similar characteristics. However, it presents poor ability in classifying the ground objects on a small scale, like other methods mentioned above. As we expected, the proposed method produces impressive segmentation results, as shown in Figures 6g and 7g. In practical terms, our CsCANet not only successfully recognizes the ground objects with similar appearance but also performs well in adapting the multi-scale features.
On the other hand, Tables 3 and 4 give the competitive evaluation metrics to other competitive methods on both segmentation tasks under domain adaptation, including per-class accuracy (PA), overall accuracy (OA), mean F1 score (mF1) and mean Iou (mIoU). The comparison results demonstrate that our proposed CsCANet achieves an excellent performance against other algorithms. CsCANet acquires the highest OA, mF1, and mIoU of 80.48%, 71.56%, and 57.69% in Shanghai → Zhejiang, while the corresponding values of 70.55%, 65.33%, and 49.64% also present the best ones in Zhejiang → Shanghai. Compared with the well-known category-level CLAN, CsCANet achieves 7.35% and 4.55% improvement in mIoU, which verifies the advance of our category-level adversarial framework. Furthermore, CsCANet still presents excellent representations on the per-class accuracy. For instance, it provides increases of 12.16% and 17.80% over the fine-grained FADA in Bareland and Road within the task Shanghai → Zhejiang. Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 21

Discussion
Whereas deep neural networks have driven the progress of land cover mapping, their performance fundamentally relies on the network architecture and optimization strategies. In this section, we conduct several ablation studies and effectiveness analysis for the above two primary factors. Note that all the comparative experiments are carried out on the CLCM task of Shanghai → Zhejiang.

Design of Domain Adaptation Framework
Our anticipated goal is to develop a hybrid framework for land cover mapping across image domains. Considering the complex characteristics of coastal ground objects, we proposed a multi-level adaptation framework to adapt semantic features at different scales. Several comparative studies are executed to verify the effectiveness of our multi-level adversarial scheme. For all the methods, we use the pre-trained Deeplab v2 [9] as the baseline network.
As shown in Table 5, the domain adaptation approach achieves significant improvement in cross-domain land cover mapping. Compared with the baseline without any adaptation operation, our single-level CsCANet increases the value by 11.76% for mIoU. Additionally, the framework with multi-level generalization denotes significant results up to 77.52%, 69.25%, and 54.78% in terms of OA, mF1, and mIoU. On the other hand, Figure 8 visualizes partial representative outcomes. There is no doubt that the baseline method gives poor presentations because of the domain divergence within different datasets. In addition, multi-level CsCANet illustrates the better recognition ability in ground objects with low intra-class variance and a small scale against the single-level approach. The aforementioned contrastive studies strongly prove the feasibility and effectiveness of the domain adaptation method in coastal land cover mapping, especially our multi-level adaptation framework. As shown in Table 5, the domain adaptation approach achieves significant improvement in cross-domain land cover mapping. Compared with the baseline without any adaptation operation, our single-level CsCANet increases the value by 11.76% for mIoU. Additionally, the framework with multi-level generalization denotes significant results up to 77.52%, 69.25%, and 54.78% in terms of OA, mF1, and mIoU. On the other hand, Figure 8 visualizes partial representative outcomes. There is no doubt that the baseline method gives poor presentations because of the domain divergence within different datasets. In addition, multi-level CsCANet illustrates the better recognition ability in ground objects with low intra-class variance and a small scale against the singlelevel approach. The aforementioned contrastive studies strongly prove the feasibility and effectiveness of the domain adaptation method in coastal land cover mapping, especially our multi-level adaptation framework.

Design of Domain Labels Extraction Module
A conventional discriminator always employs a global adversarial loss to implement feature alignment via a binary domain label. Paying attention to the underlying category-space in the target domain, we introduced a category-wise discriminator with two summarized modules to extract domain labels. Moreover, a mixed strategy was leveraged in our method where the category-wise hard and soft labels were, respectively, applied to the source and target domain. In this subsection, we conduct ablation experiments to prove the superiority of our modules. Table 6 gives the comparison results on different domain label extraction modules. The single strategy concerning category-wise hard and soft labels, respectively, achieves an increased mIoU of 2.83% and 2.65% compared to the naive one with a binary label. Notably, our data-based approach further improves network performance while achieving the highest values of 77.52%, 69.25%, and 54.78% in the evaluation metrics. Meanwhile, Figure 9 offers specific references in per-class accuracy as additional verifications. As can be seen, the mixed method presents better results in most categories, such as the Bareland, Road, Cropland, etc. In summary, extracting domain labels with reasonable modules can lead to outstanding presentation for network performance, although it is determined by the data structure itself. space in the target domain, we introduced a category-wise discriminator with two summarized modules to extract domain labels. Moreover, a mixed strategy was leveraged in our method where the category-wise hard and soft labels were, respectively, applied to the source and target domain. In this subsection, we conduct ablation experiments to prove the superiority of our modules. Table 6 gives the comparison results on different domain label extraction modules. The single strategy concerning category-wise hard and soft labels, respectively, achieves an increased mIoU of 2.83% and 2.65% compared to the naive one with a binary label. Notably, our data-based approach further improves network performance while achieving the highest values of 77.52%, 69.25%, and 54.78% in the evaluation metrics. Meanwhile, Figure 9 offers specific references in per-class accuracy as additional verifications. As can be seen, the mixed method presents better results in most categories, such as the Bareland, Road, Cropland, etc. In summary, extracting domain labels with reasonable modules can lead to outstanding presentation for network performance, although it is determined by the data structure itself. . Per-class accuracies for land cover mapping. BL, HL, and SL, respectively, denote the binary label, category-wise hard label, and soft label. For all the categories, each method with a specific extraction module for domain labels corresponds to a broken line with a particular color. Figure 9. Per-class accuracies for land cover mapping. BL, HL, and SL, respectively, denote the binary label, category-wise hard label, and soft label. For all the categories, each method with a specific extraction module for domain labels corresponds to a broken line with a particular color.

Effectiveness Analysis of Improvement Strategy
To further optimize the segmentation results, we employed the self-supervised learning approach as an improvement strategy to perform segmented training. Therefore, two comparative experiments under contrary settings are carried out to analyze its effectiveness. They are based on our multi-level adaptation framework and mixed approach to domain label extraction.
The comparison results from Table 7 confirm that self-supervised learning led to a remarkable improvement, where the values of OA, mF1, and mIoU increased by 2.96%, 2.31%, and 2.91%. We also illustrate the internal specific variation of IoU for each category, as shown in Figure 10. Compared with the original one, it is apparent that the CsCANet with self-supervised learning presents further improvements to all the object categories, especially the Imp. Surf., Road and Bareland. These pieces of ample evidence indicate the effectiveness of our improvement strategy with self-supervised learning. approach to domain label extraction. The comparison results from Table 7 confirm that self-supervised learning led to a remarkable improvement, where the values of OA, mF1, and mIoU increased by 2.96%, 2.31%, and 2.91%. We also illustrate the internal specific variation of IoU for each category, as shown in Figure 10. Compared with the original one, it is apparent that the CsCANet with self-supervised learning presents further improvements to all the object categories, especially the Imp. Surf., Road and Bareland. These pieces of ample evidence indicate the effectiveness of our improvement strategy with self-supervised learning.  Figure 10. Per-class intersection-over-union (IoU) for land cover mapping. For all the categories, each method corresponds to a histogram with a particular color.

Conclusions
This paper proposes a novel category-level adaptative method to address the crossdomain CLCM with HRRS images. We take the adversarial framework with a categorywise discriminator as an alternative to the conventional one, then generalize it to multiple levels. Several state-of-the-art models are employed and compared to verify the superiority of the proposed method. The experimental results demonstrate that our approach successfully learns the transformed features and executes the domain adaptation procedure. In addition, the multi-level adversarial scheme is confirmed to be efficient in recognizing the ground objects with low intra-class variances and spatial details, ultimately achieving the optimal performance in adaptive pixel-level segmentation. Furthermore, massive ablation studies strongly confirm the effectiveness of our network architecture and improvement strategy. Nevertheless, our method merely takes the dense annotations from the source domain as supervised guidance. In future research, we will focus on other effective guidance information, such as the semantic Figure 10. Per-class intersection-over-union (IoU) for land cover mapping. For all the categories, each method corresponds to a histogram with a particular color.

Conclusions
This paper proposes a novel category-level adaptative method to address the crossdomain CLCM with HRRS images. We take the adversarial framework with a category-wise discriminator as an alternative to the conventional one, then generalize it to multiple levels. Several state-of-the-art models are employed and compared to verify the superiority of the proposed method. The experimental results demonstrate that our approach successfully learns the transformed features and executes the domain adaptation procedure. In addition, the multi-level adversarial scheme is confirmed to be efficient in recognizing the ground objects with low intra-class variances and spatial details, ultimately achieving the optimal performance in adaptive pixel-level segmentation. Furthermore, massive ablation studies strongly confirm the effectiveness of our network architecture and improvement strategy. Nevertheless, our method merely takes the dense annotations from the source domain as supervised guidance. In future research, we will focus on other effective guidance information, such as the semantic context and super-pixel, to further improve the network performance by implementing additional constraints.

Data Availability Statement:
The data presented in this work are available on request from the corresponding author. The data are not publicly available due to other ongoing studies.