SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images

Xiao, Hongfeng; Yao, Wei; Chen, Haobin; Cheng, Li; Li, Bo; Ren, Longfei

doi:10.3390/rs15194668

Open AccessArticle

SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images

by

Hongfeng Xiao

¹,

Wei Yao

^1,*

,

Haobin Chen

¹,

Li Cheng

¹,

Bo Li

¹ and

Longfei Ren

²

¹

College of Computer Science, South-Central Minzu University, Wuhan 430074, China

²

Aerospace Information Reserach Institute, Chinese Acadamy of Sciences, Beijing 100080, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4668; https://doi.org/10.3390/rs15194668

Submission received: 4 August 2023 / Revised: 15 September 2023 / Accepted: 20 September 2023 / Published: 23 September 2023

(This article belongs to the Special Issue Advances in Deep Fusion of Multi-Source Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

:

Due to the differences in imaging methods and acquisition areas, remote sensing datasets can exhibit significant variations in both image style and content. In addition, the ground objects can be quite different in scale even within the same remote sensing image. These differences should be considered in remote sensing image segmentation tasks. Inspired by the recently developed domain generalization model WildNet, we propose a domain adaption framework named “Style and Content Domain Adaptation” (SCDA) for semantic segmentation tasks involving multiple remote sensing datasets with different data distributions. SCDA uses residual style feature transfer (RSFT) in the shallow layer of the baseline network model to enable source domain images to obtain style features from the target domain and reduce the loss of source domain content information. Considering the scale difference of different ground objects in remote sensing images, SCDA uses the projection of the source domain images, the style-transferred source domain images, and the target domain images to construct a multiscale content adaptation learning (MCAL) loss. This enables the model to capture multiscale target domain content information. Experiments show that the proposed method has obvious domain adaptability in remote sensing image segmentation. When performing cross-domain segmentation tasks from VaihingenIRRG to PotsdamIRRG, mIOU is

48.64 %

, and the F1 is

63.11 %

, marking improvements of

1.21 %

and

0.45 %

, respectively, compared with state-of-the-art methods. When performing cross-domain segmentation tasks from VaihingenIRRG to PotsdamRGB, the mIOU is

44.38 %

, an improvement of

0.77 %

over the most advanced methods. In summary, SCDA improves the semantic segmentation of remote sensing images through domain adaptation for both style and content. It fully utilizes multiple innovative modules and strategies to enhance the performance and the stability of the model.

Keywords:

remote sensing images; domain adaptation; cross-domain semantic segmentation; contrast learning; style transfer

Graphical Abstract

1. Introduction

In recent years, computer technology, modern communication technology, and automatic interpretation methods of high-resolution remote sensing images (HRI) have been more and more widely applied to monitor changes on Earth’s surface [1]. Among them, target detection technology [2] can be used to improve national defense security ships, vehicle detection [3], change detection technology that can be used to monitor forest area changes, building changes [4], band selection technology that can monitor specific targets [5], and 3D reconstruction technology that can reconstruct high-precision ground objects and their rich details [6]. The semantic segmentation technology of remote sensing images aims to achieve accurate identification of each pixel, which has great application value and potential. HRI segmentation approaches can be used for land cover classification, road extraction, building extraction, farmland classification, and vegetation identification, and they play a vital role in the fields of urban planning, environmental monitoring, transportation planning, disaster assessment, and land management [7,8,9,10].

In recent years, segmentation models based on convolutional neural networks (CNNs) have achieved remarkable progress in the field of image semantic segmentation tasks [11]. The superior performance of CNN models highly depends on a large amount of labeled training data. However, for semantic segmentation tasks of remote sensing images, this prerequisite is often difficult to meet. HRIs typically have more complex structures than natural images, and the image annotation procedure frequently requires a great deal of domain expert knowledge [12]. Cordts et al. [13] reported that pixel-level annotation of a city overhead optical image takes nearly 90 min. Considering the complexity and high time cost of pixel-level annotation for HRIs, it is necessary to investigate how to segment unlabeled target domain images with a segmentation model trained on labeled source domain samples. Since HRIs are often obtained by different sensors or from different geographical locations, the data distributions in different datasets are usually quite different. Such variation can lead to significant performance decline when a model trained on a source domain dataset is deployed on a target domain dataset. As shown in Figure 1, there are significant differences between HRIs of different datasets. These HRIs captured from different regions by different sensors not only have obvious differences in content but also show significant difference in style (color, texture). In addition, there are scale differences between different object types in HRIs. Therefore, it is necessary to reduce the domain difference from the style and content of HRIs and, at the same time, consider the influence of the scale variations between different ground objects.

Unsupervised Domain Adaptation (UDA) [14,15,16,17,18,19] is a widely used technology to solve the difference in data distribution, which aims to use only labeled source domain samples and unlabeled target domain samples to train the model to achieve the purpose of reducing data discrepancies between domains. According to Xu et al. [14], the current UDA methods can be divided into four categories, namely the generation methods [19], the adversarial methods [15,16], the self-training methods [20], and the hybrid training methods. Our research focuses on the first two categories. Generative methods focus on differences that alter the appearance of an image, such as color, contours, and texture. Inspired by the image-to-image translation task of Generative Adversarial Networks (GANs), Bousmalis et al. [19] uses CycleGAN to generate fake images that approximate the style of the target domain. The goal of these methods [17,18,19] is to generate images that resemble the appearance of the target domain. This approach aims to facilitate the training of high-performance models using images and labels from the source domain, thereby enhancing their consistency with the target domain. The goal of adversarial methods is to extract information at the level of features, pixels, or both and learn domain-invariant features through domain discriminators. Ref. [21] tries to adjust the feature space distribution through a domain classifier. Long et al. [22] proposes a framework to make deep features more transferable by the low-density separation of target domain unlabeled data in deep features and further reduce domain variance by enhancing the statistical power of kernel embedding matching. Most of these classic UDA methods have primarily focused on classification tasks. However, more and more techniques have also been developed for other more specific vision tasks, including the cross-domain semantic segmentation of HRIs.

Currently, in the field of domain adaptive semantic segmentation of HRIs, several methods [23,24,25,26,27,28] have designed cross-domain adaptation frameworks under dynamically changing imaging modes or geographic location changes. Among these methods, GAN-based image style conversion techniques are commonly employed to adapt the network to the target domain. For example, the study in [28] employed DualGAN [29] to translate the source domain image into a target domain for the first time in HRIs cross-domain semantic segmentation. Zhao et al. [23] proposes a geometric consistency (GC) constraint that can be embedded into image conversion. Zhao et al. [30] also proposed an architecture based on DualGAN, which uses an adjustment module inside the network to solve the scale difference of the HRIs dataset and uses residual connections to enhance the stability of the actual converted image. Additionally, some studies focus on leveraging domain discriminators and adversarial training at the feature and category levels to achieve domain adaptation. Liu et al. [26] develops a full-space domain adaptation framework to reduce domain differences in feature distributions by adversarial learning from image space, feature space, and output space. Chen et al. [31] propose a region and category adaptive domain discriminator (RCA-DD), aiming to emphasize the differences in regions and categories during the process of alignment. There are also studies combining the above two methods. For example, Cai et al. [25] proposes an unsupervised domain adaptation method based on bidirectional image translation to take full advantage of the advantages of both domains and overcome the performance issues caused by unidirectional domain adaptation. V2RNet [24] unifies the semantic structure of the source domain with the image style of the target domain by adding a semantic discriminator to the style transfer network SegGAN. Although these works [23,24,25,26,27,28] have achieved excellent results on the target domain, for HRIs datasets, the difference in image style and the difference in image content often exist at the same time. None of the above studies regard HRIs as two parts of style and content or use these two directions to reduce domain differences. In addition, the disturbance of source content caused by style changes and the scale differences in feature categories in HRIs datasets also need to be considered.

Therefore, we refer to the architecture of WildNet [32] in our research. WildNet is a domain generalization method based on style transfer and contrastive learning to simultaneously adapt to the differences in style and content of the dataset. However, WildNet [32] is only studied on ordinary optical images, and it is not used on remote sensing image HRIs. In addition, WildNet also does not take into account the scale differences between different object types in remote sensing image HRIs, such as the size differences of car and road categories. In our work, we adopt the foundational structure of WildNet to address the issue of the network becoming overly specialized to the style and content of a single source domain. At the same time, according to the characteristics of remote sensing image data, we improve and optimize this architecture. Firstly, considering that the source content will be distorted when the target domain style is transferred to the source domain image in WildNet, the improved style transfer module residual style feature transfer (RFST) is proposed, which uses residual connections to enhance the stability of the actual converted content features and prevent the loss of source domain content. Then, to address the problem of scale differences between different types of features in the same HRIs dataset, this study constructed a Multiscale Content Adaptive Loss (MCAL) function based on the content characteristics of the network to train the model to learn content information at multiple scales in the target domain. We refer to this DA framework for remote sensing image segmentation tasks as Style and Content Domain Adaptation (SCDA). The main contributions of this paper are as follows:

(1): This paper draws on the multiloss function architecture in WildNet to realize domain adaptation with regard to both the style and the content of different images. This domain adaption technology is then introduced into the semantic segmentation tasks of remote sensing images.
(2): As an improved model based on WildNet, a residual structure is added into the style transfer module, aiming to fully learn the style information of the target domain without losing the content information of the source domain data. Meanwhile, a multiscale model is added at the output end of the model to learn content information at different scales in the target domain.
(3): In summary, we propose a novel end-to-end domain adaptive semantic segmentation framework SCDA for remote sensing images. The proposed method achieves state-of-the-art performance on cross-domain semantic segmentation tasks between two open-source datasets, Vaihingen and Potsdam. When performing cross-domain segmentation tasks from Vaihingen to Potsdam, mIOU is $48.64 %$ and $44.38 %$ , which is an improvement of $1.21 %$ and $0.77 %$ compared with the most advanced methods.

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the theory of our proposed framework. Section 4 introduces experimental settings and analyses the experimental results. Finally, conclusions are provided in Section 5.

2. Related Work

In Section 2.1, we introduce some basic knowledge about style transfer and contrastive learning, as well as domain adaptation methods based on style transfer and contrastive learning. In Section 2.2, as the basis of our approach, we introduce the main ideas of WildNet [32].

2.1. Domain Adaptation Techniques Based on Style Transfer and Contrastive Learning

Style transfer aims to transform an image into the style of another image while keeping the original content intact. The application of style transfer methods based on adaptive instance normalization(AdaIN) [18] in domain adaptation has attracted the attention of many researchers. For example, Marsden et al. [33] proposed a style transfer method based on AdaIN, which uses the domain change information of the category and obtains the category-specific target moment by pseudolabeling to convert the style of the labeled source image into the style of the target domain. Figure 2 shows the style transfer network using AdaIN. In the AdaIN [18] module, two types of input are received: content input x and style input y, and the channel-level mean and standard deviation of x are matched to the channel-level mean and standard deviation of y to ensure style similarity. In other words, AdaIN achieves style transfer by changing the data distribution at the feature level, ensuring that the computational and storage overheads are small and easy to implement. Its main idea can be defined as

A d a I N (x, y) = σ (y) \frac{x - μ (x)}{σ (x)} + μ (y)

(1)

where x and y denote the image of the contributed content and the image of the contributed style, respectively,

σ (x)

and

μ (x)

are the channel-wise mean and the standard deviation of x, and

σ (y)

and

μ (y)

are the channel-wise mean and the standard deviation of y.

Contrastive learning [34] is a learning strategy that aims to minimize the distance between positive samples in the embedding space while maximizing the distance to negative samples. At present, some work has proposed a domain adaptation framework based on contrastive learning methods. For example, Wang et al. [35] developed a framework based on contrastive learning that accomplishes domain alignment by minimizing the distance between the target domain anchor images and the cross-domain samples of the same category in comparison with the distance from the cross-domain samples of different categories. Among contrastive learning frameworks, SimCLR [36] is an exceptional framework for unsupervised comparison. It uses differently augmented versions of the same image as positive samples, while treating distinct images as negative samples. This encourages the positive samples to cluster closely in the embedding space while keeping the negative samples apart. Therefore, a SimCLR-trained network can spontaneously cluster similar samples. The principle of SimCLR is shown in Figure 3.

Its main idea can be defined as

L_{i, j} = - log \frac{exp (s i m (z_{i} \cdot z_{j}) / τ)}{\sum_{k = 1}^{2 N} 1_{[k \neq i]} exp (s i m (z_{i}^{s} \cdot z_{k}^{t}) / τ)}

(2)

where

s i m (μ, ν) = μ^{T} ν / ∥μ∥ ∥ν∥

, which denotes the dot product between

L_{2}

normalized

μ

and

ν

.

(i, j)

denote a positive pair of examples, and

1_{[k \neq i]} \in {0, 1}

is an indicator function evaluating to 1 if

k = i

and

τ

denote a temperature parameter.

2.2. WildNet: A Domain Generalization Model Based on Multiple Loss Functions

The main idea of WildNet is to enable the network to learn domain-general semantic information from actual situations. In general, WildNet contains four modules: Feature Stylization (FS), Style Extension Learning (SEL), Semantic Consistency Regularization (SCR), and Content Extension Learning (CEL). The first three modules together can realize a style adaptive segmentation, and CEL is a technique based on contrastive learning [36,37] which enables content adaptive segmentation. The architecture of WildNet is shown in Figure 4.

In WildNet, the FS method adds an AdaIN [18] layer to the shallow structure of the encoder to adjust the shallow features to obtain the style information of the unknown domain. Refer to Equation (1), which is defined as follows:

z_{l}^{s w} = σ (z_{l}^{w}) \frac{z_{l}^{s} - μ (z_{l}^{s})}{σ (z_{l}^{s})} + μ (z_{l}^{w})

(3)

where

ϕ_{l}

denotes the l-th layer of the network, and

z_{l}

represent the output features of the l-th layer. In the l-th layer,

z_{l}^{s}

represents source domain style features, and

z_{l}^{w}

represents wild domain style features. The source domain features stylized in the wild domain style, denoted as

z_{l}^{s w}

, are generated by transferring the wild domain style features

z_{l}^{w}

to the source domain

z_{l}^{s}

.

μ (z_{l})

and

σ (z_{l})

are the channel-wise mean and the standard deviation of feature

z_{l}

, respectively.

CEL is based on the contrastive learning strategy [36,38,39], and it consists of two loss functions, namely pixel-wise source content extension loss

L_{S C E}

and pixel-wise wild content extension loss

L_{W C E}

.

L_{S C E}

aims to control that source domain content features will not be lost due to style transfer,

L_{W C E}

trains the network to cluster content containing the same semantic information and distinguish it from content containing other semantic information. They are deformed on the basis of Equation (2) and defined as follows:

L_{S C E} = - \frac{1}{N_{z}} \sum_{i = 1}^{N_{z}} log \frac{e x p (z_{i}^{s} \cdot z_{i}^{s w} / τ)}{e x p (z_{i}^{s} \cdot z_{i}^{s w} / τ) + \sum_{j = 1}^{N_{z}} 1_{i j}^{s} e x p (z_{i}^{s} \cdot z_{j}^{s w} / τ)}

(4)

L_{W C E} = - \frac{1}{N_{z}} \sum_{i = 1}^{N_{z}} log \frac{e x p (z_{i}^{s} \cdot z_{k}^{w} / τ)}{e x p (z_{i}^{s} \cdot z_{k}^{w} / τ) + \sum_{j = 1}^{N_{z}} 1_{i j}^{s} e x p (z_{i}^{s} \cdot z_{j}^{w} / τ)}

(5)

where

z_{i}^{s}

is the projected content feature of the i-th pixel of the source image, and

z_{i}^{s w}

is the projected content feature of the i-th pixel of the wild domain stylized source domain image.

z_{k}^{w}

is a wild content feature selected from the wild content feature dictionary

Q \in R^{C_{q} \times N_{q}}

that is similar to the stylized source domain content feature of the wild domain.

1_{i j}^{s}

represents an indicator. When

z_{i}^{s w}

and

z_{i}^{s}

are the same,

1_{i j}^{s}

is 0. When

z_{i}^{s w}

and

z_{i}^{s}

are not the same,

1_{i j}^{s}

is 1.

N_{z}

represents the number of pixels, and

τ

is the temperature parameter.

However, this approach was primarily investigated and adapted for conventional optical images and did not consider the unique characteristics of High-Resolution Images (HRIs) in remote sensing, such as the significant scale difference between categories of ground objects. In order to adapt the multiloss function architecture of WildNet to domain adaptive semantic segmentation of remote sensing images, we propose a learning method that uses a multiscale contrast mechanism to focus on both larger and smaller regions simultaneously.

3. Materials and Methods

In this section, we describe the SCDA method in detail. SCDA is an improved domain adaptive semantic segmentation framework adapted to the data characteristics of HRIs. Figure 5 shows the principle of the proposed SCDA method.

3.1. Preliminaries

In Figure 5,

x^{s} \in R^{H_{s} \times W_{s} \times C_{s}}

denotes the source domain image, and

x^{t} \in R^{H_{t} \times W_{t} \times C_{t}}

denotes the target domain image, where

H, W, C

, respectively, denote the height, width, and number of channels of the image.

y^{s} \in Z^{H_{s} \times W_{s} \times K}

denotes the label of the source domain image

x^{s}

, where K denotes the number of categories. To reduce the impact of scale factors [30], the size of the image should meet the following conditions:

\frac{H_{s}}{H_{t}} = \frac{W_{s}}{W_{t}}

(6)

ϕ

denotes the semantic segmentation model, which mainly consists of two parts: the feature extractor (

ϕ f e a t

) and the classifier (

ϕ c l s

). The basic semantic segmentation model used in this paper is DeepLab v3 [40], and the encoder structure is ResNet-50 [41]. The model is trained using a cross-entropy loss function:

L_{o r i} = - \frac{1}{H W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} \sum_{k = 1}^{K} y_{h w k}^{s} log (p^{s})

(7)

p^{s} = ϕ_{f e a t} (ϕ_{c l s} (x^{s}))

(8)

ϕ_{c l s} (z_{i}) = S o f t m a x (z^{i}) = \frac{e^{z_{i}}}{\sum_{k = 1}^{K} e^{z_{k}}}

(9)

where

y_{h w k}^{s}

is a one-hot vector that represents the pixel class label of the source domain image.

p^{s}

is the prediction of the network.

ϕ c l s

is the Softmax function. Based on the above segmentation loss using source domain samples for model training, we apply RSFT, MCAL, SEL, and SCR methods during the model training process to achieve style adaptation and content adaptation. SEL and SCR are the native modules, as in the WildNet architecture, while RSFT and MCAL are the improved versions of FS and CEL. In each training iteration, a pair of images,

x^{s}

and

x^{t}

, are randomly selected from the labeled source domain dataset

D_{s} = \{x^{s}, y^{s}\}

and the unlabeled target domain dataset

D_{t} = \{x^{t}\}

as the input to the model. In the feature extractor

ϕ f e a t

, the source domain image

x^{s}

obtains the style features of the target domain image

x^{t}

through the RSFT process. The classifier

ϕ c l s

provides the prediction results for both the source domain image and the style-transferred source domain image. The projector

ϕ p r o j

outputs the content embedding features of the source domain image, the style-transferred source domain image, and the target domain image, respectively. The model learns the similar features in the source and target domains using SEL, SCR, and MCAL methods. After training, the model is evaluated on the validation set of the unseen target domain T.

3.2. Residual Style Feature Transfer

The study in [42,43] suggests that overfitting to the source domain is often caused by the network overlearning the limited style of the source domain. Moreover, the variation in image styles can have an impact on the semantic segmentation results. The performance of the base network model decreases significantly when the style of the source domain images change after the training process of the model. Zhao et al. [30] use a GAN for image translation operation using residual structure so that the image is transformed into the target domain style without losing source domain information. In the work in Ref. [18], we added a simple residual structure to the FS module based on AdaIN in WildNet to reduce the loss of source domain content information. The structure of the RFST is shown in Figure 6. Therefore, referring to Equation (3), the source domain features after target domain stylization

z_{l}^{s t}

can be defined as

z_{l}^{s t} = σ (z_{l}^{t}) \frac{z_{l}^{s} - μ (z_{l}^{s})}{σ (z_{l}^{s})} + μ (z_{l}^{t}) + k \times z_{l}^{s}

(10)

where k is a hyperparameter.

ϕ_{l}

denotes the l-th layer of the network, and

z_{l}

represent the output features of the l-th layer. In the l-th layer,

z_{l}^{s}

represents source domain style features, and

z_{l}^{t}

represents target domain style features. The source domain features stylized in the target domain style, denoted as

z_{l}^{s t}

, are generated by transferring the target domain style features

z_{l}^{t}

to the source domain

z_{l}^{s}

.

μ (z_{l})

and

σ (z_{l})

are the channel-wise mean and the standard deviation of feature

z_{l}

, respectively.

Following Equation (10), the distribution of

z_{l}^{s}

is renormalized with the channel-wise statistics of

z_{l}^{t}

. The target-stylized feature

z_{l}^{s t}

is input into layer

l + 1

, and

z_{l + 1}^{s t} = ϕ_{l + 1} (z_{l}^{s t})

is output from the layer.

z_{l + 1}^{s t}

can be swapped repeatedly in the style of

z_{l + 1}^{t}

as

z_{l + 1}^{s t} : = σ (z_{l + 1}^{t}) \frac{z_{l + 1}^{s t} - μ (z_{l + 1}^{s t})}{σ (z_{l + 1}^{s t})} + μ (z_{l + 1}^{t}) + k \times z_{l}^{s} t

(11)

Following this equation, the distribution of

z_{l}^{s}

is renormalized with the channel-wise statistics of

z_{l}^{t}

, and the source domain image acquires the target domain style at multiple levels within the network. However, as the network level deepens, the output features are more biased towards semantic information. Therefore, in this paper, RSFT is only applied to the shallow layers of the network.

3.3. Multiscale Content Adaptation Learning

The Equations (4) and (5) solve the problem that the network will learn limited source content by increasing the content diversity within the target domain class in the embedding space. However, remote sensing images have a common problem of large differences in scale between object categories. The above equations can only obtain fixed-scale content information but cannot pay attention to both small-scale and large-scale feature category content information. Therefore, we optimized based on the above method and proposed the MCAL method to solve the problem of large differences in scales of ground object categories. The structure of the MCAL is shown in Figure 7. The specific operation is as follows: a projector

ϕ p r o j

is added in parallel with the classifier

ϕ c l s

after the feature extractor

ϕ f e a t

. This projector

ϕ p r o j

maps the pixel-level features extracted by the feature extractor

ϕ f e a t

into the embedding space. That is, after the source domain image

x^{s}

and target domain image

x^{t}

are input into the feature extractor

ϕ f e a t

, the output source domain feature

z^{s}

, the target domain feature

z^{t}

, and the target domain stylized source domain feature

z^{s t}

are input into the projector

ϕ f e a t

, outputting the source domain projected content feature

z_{p r o j}^{s}

, the target domain projected content feature

z_{p r o j}^{t}

, and the target domain stylized source domain projected content feature

z_{p r o j}^{s t}

. It should be noted that the projected content feature

z_{i}^{s}

of the i-th pixel corresponding to the source domain image and the projected content feature

z_{i}^{s t}

of the i-th pixel of the target domain stylized source domain image should have the same semantic information.

In order to obtain common domain content information, the network should be able to cluster projection features containing the same content information and distinguish projection features containing other content information. Inspired by Equation (4), we define the i-th source domain adaptation content loss

L_{S M C A L}

as follows:

\begin{matrix} L_{S M C A L} = - \frac{α}{N_{l}} \sum_{i = 1}^{N_{l}} log \frac{e x p (z_{i}^{s} \cdot z_{i}^{s t} / τ)}{e x p (z_{i}^{s} \cdot z_{i}^{s t} / τ) + \sum_{j = 1}^{N_{l}} 1_{i j}^{s} e x p (z_{i}^{s} \cdot z_{j}^{s t} / τ)} \\ - \frac{(1 - α)}{N_{s}} \sum_{j = 1}^{N_{s}} log \frac{e x p (z_{j}^{s} \cdot z_{j}^{s t} / τ)}{e x p (z_{j}^{s} \cdot z_{j}^{s t} / τ) + \sum_{i = 1}^{N_{s}} 1_{j i}^{s} e x p (z_{j}^{s} \cdot z_{i}^{s t} / τ)} \end{matrix}

(12)

where

1_{i j}^{s}

represents an indicator. When

z_{i}^{s t}

and

z_{i}^{s}

are the same,

1_{i j}^{s}

is 0. When

z_{i}^{s t}

and

z_{i}^{s}

are not the same,

1_{i j}^{s}

is 1, and it is the same as the above when replacing i with j.

N_{l}

represents the number of large-scale pixels.

N_{s}

represents the number of small-scale pixels.

τ

is the temperature parameter.

α \in [0, 1]

is a hyperparameter. Equation (12) encourages the multiscales features

z_{i}^{s}

and

z_{i}^{s t}

to be closer in the embedding space by reducing the distance between the source domain features and the target domain stylized source domain features in the embedding space and stays away from other negative sample contents.

Based on WildNet in each iteration of training, the projected content features of the target domain

z_{p r o j}^{t}

are filtered at multiple scales and preserved in a dictionary

Q \in R^{C_{q} \times N_{q}}

. Here,

C_{q}

is the channel number of target domain projected content features, and

N_{q}

is the capacity of the dictionary. The dictionary Q stores the content information of multiple scales, which may not exist in the source domain. By learning the information of the effective target domain in the dictionary, the network’s adaptability to the content of the target domain can be enhanced. If domain-general semantic information can be filtered out from the dictionary Q, the network can be more adapted to the target domain. In this work,

z_{k}^{t}

is close to the target domain stylized source domain feature

z_{i}^{s t}

in the embedding space:

\underset{q \in Q}{z_{k}^{t} = arg min} ({∥z_{i}^{s t} - q∥}_{2}) \cup \underset{q \in Q}{arg min} ({∥z_{j}^{s t} - q∥}_{2})

(13)

By filtering the target domain content information that is similar to the source domain content information from the Equation (5), we can define the pixel-level content adaptation loss function of each pixel on the target domain

L_{T M C A L}

as follows:

\begin{matrix} L_{T M C A L} = - \frac{α}{N_{l}} \sum_{i = 1}^{N_{l}} log \frac{e x p (z_{i}^{s} \cdot z_{k}^{t} / τ)}{e x p (z_{i}^{s} \cdot z_{k}^{t} / τ) + \sum_{j = 1}^{N_{l}} 1_{i j}^{s} e x p (z_{i}^{s} \cdot z_{j}^{t} / τ)} \\ - \frac{1 - α}{N_{s}} \sum_{i = 1}^{N_{s}} log \frac{e x p (z_{i}^{s} \cdot z_{k}^{t} / τ)}{e x p (z_{i}^{s} \cdot z_{k}^{t} / τ) + \sum_{j = 1}^{N_{s}} 1_{i j}^{s} e x p (z_{i}^{s} \cdot z_{j}^{t} / τ)} \end{matrix}

(14)

Combining the source content adaptation loss in Equation (12) and the target domain content adaptation loss in Equation (12), the CAL loss is defined as follows:

L_{M C A L} = L_{S M C A L} + L_{T M C A L}

(15)

4. Experiments and Discussion

4.1. Datasets Description

In the experiment, we utilized two public remote sensing image datasets, Potsdam and Vaihingen, to assess the methodology proposed in this paper. These particular datasets were assembled by the International Society for Photogrammetry and Remote Sensing (ISPRS) [44], and they have been widely recognized and adopted as significant evaluation benchmarks for semantic segmentation tasks that rely on HRIs [28,45,46,47]. Table 1 provides a synopsis of the characteristics of the datasets.

The Potsdam dataset serves as a quintessential example of an urban landscape, characterized by expansive building clusters, narrow thoroughfares, and densely organized settlement structures. It encompasses six semantic categories: Clutter/background, Impervious surfaces, Car, Tree, Low vegetation, and Buildings. The Potsdam dataset includes Infrared–Red–Green (IR-R-G) channels, Red–Green–Blue (R-G-B) channels, and Red–Green–Blue–Infrared(R-G-B-IR) channels, among other imaging modes. This article employs the first two imaging modes (IR-R-G/R-G-B). In addition, the Potsdam dataset incorporates 38 True Orthophotos (TOPs), with a resolution of 6000 × 6000 pixels.

The Vaihingen dataset depicts a relatively diminutive village, marked by an abundance of detached edifices and compact multistory constructions. It includes the identical semantic categories as the Potsdam dataset and only contains the Infrared–Red–Green (IR-R-G) channel imaging mode. It includes 33 high-resolution True Orthophotos (VHRTOPs), with sizes ranging from 2000 × 2000 to 2000 × 3000 pixels.

In this paper, images from the Potsdam dataset and the Vaihingen dataset are cut into 896 × 896 and 512 × 512 sized images, respectively, following formula (1). After cutting, the Potsdam dataset totals 1764 images, while the Vaihingen dataset totals 1696 images.

4.2. Experiment Settings

Considering the differences in HRIs as depicted in Figure 8, which are influenced by variations in imaging modes, color saturation, and geographical locations, experiments within this paper are executed in two distinct cross-domain scenarios. The first scenario pertains to cross-domain differences associated with geographical location changes, and the second entails alterations in both geographical location and imaging mode. A detailed overview of the experimental setup can be found in Table 2.

For the cross-domain scenario focusing on geographical location changes, the Vaihingen dataset, utilizing the IR-R-G imaging mode, is designated as the source domain, while the Potsdam dataset, operating with the same IR-R-G imaging mode, is identified as the target domain.

For the cross-domain scenario that incorporates changes in both geographical location and imaging mode, the Vaihingen dataset with the IR-R-G imaging mode serves as the source domain, whereas the Potsdam dataset with the R-G-B imaging mode is delineated as the target domain.

The experiment in this paper was implemented utilizing the Pytorch deep learning framework. The operating system used in the experiment is 64-bit Windows 11, with an Intel Xeon(G) W-2223F CPU, 32GB of memory, and an NVIDIA GeForce RTX 3080ti GPU with 12 GB RAM. The total time consumption is about 12 h. The software environment comprised Python 3.8, CUDA 11.3, and cuDNN 8.2. Training and testing were accelerated using the GPU. Specific parameters for this study included a batch size fixed at 2, an initial learning rate of 0.0001, and the employment of the Adam optimizer. The learning rate was adjusted using cosine annealing, and the entire process iterated 100 times.

4.3. Evaluation Metrics

For quantitative evaluation, this paper adopts the Intersection over Union (

I o U

),

F 1 - s c o r e s

, and mean IoU (

m I o U

) metrics, given their comprehensive capacity to assess model performance. These metrics effectively encompass both precision and recall. Therefore, the

I o U

metric for each category, the

F 1 - s c o r e s

, and the

m I o U

are utilized in the evaluation of all implemented methodologies.

The

I o U

is used to measure the overlap between two areas, which is the ratio of the intersection and the union of two areas, and its formula is as follows:

I o U = \frac{P_{t} \cap Y_{t}}{P_{t} \cup Y_{t}}

(16)

In the above,

P_{t}

represents the pixel-level prediction results on the target domain dataset obtained by the proposed method, while

Y_{t}

represents the actual pixel-level labels in the target domain. The

F 1 - s c o r e

is the harmonic mean of Precision and Recall, and the formula for the

F 1 - s c o r e

is as follows:

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n \times R e c a l l}

(17)

4.4. Quantitative Results

In the domain adaptation scenario with geographical location changes, to determine the adaptability of the proposed method in this cross-domain scenario, a comparison was conducted against to advanced domain adaptive semantic segmentation methods, including BiSeNet [48], AdaptSegNet [15], MUCSS [28], and CCDA_LGFA [27]. The former two are UDA methods based on natural images, while the latter two are state-of-the-art methods in the current domain adaptive semantic segmentation methods based on remote sensing images. The segmentation results are shown in Table 3. Each column represents the domain adaptive semantic segmentation method, evaluation metrics, each category, and the average of the evaluation metrics. From the results, it is discernible that SCDA enhances the adaptability of the baseline network Deeplabv3 [40] to the above datasets and improves the segmentation effect. At the same time, the

I o U

and

F 1 - s c o r e s

of our proposed method are divided into four categories: Impervious surfaces, Cars, Low vegetation, and Buildings, which are slightly higher than the most advanced methods, and the overall segmentation indicators are, respectively, 0.77% and 0.45%, compared with the second-best method CCDA_LGFA [27]. However, the segmentation performance of our method in the two categories of background and trees is not improved, and it is inferior to the CCDA_LGFA [27] and MUCSS [48] methods.

In the domain adaptation scenario where both geographical location and imaging mode are changing dynamically, in order to verify whether the proposed method can be extrapolated to the cross-domain adaptation scenarios in which both geographical location and imaging mode change dynamically, a comparative analysis was conducted against several advanced domain adaptive semantic segmentation methods, including BiSeNet [48], AdaptSegNet [15], MUCSS [28], and CCDA_LGFA [27]. The results are shown in Table 4. The proposed method emerged as pre-eminent in the segmentation of Impervious surfaces and Cars, thereby attesting that its performance is on par with the most sophisticated cross-domain method and lending additional validation to the effectiveness of our method in semantic segmentation based on HRIs. However, it is inferior to other methods in the four categories of Buildings, Background, Trees, and Low vegetation.

In addition, our model has a parameter count of 40.348M. When given a tensor input of shape (1, 3, 512, 512), the GFLOPs is 138.906G. For a tensor input of shape (1, 3, 896, 896), the GFLOPs increases to 425.397G.

4.5. Visualization Results

Figure 9 and Figure 10, respectively, display the segmentation results of different domain adaptive semantic segmentation methods in the cross-domain scenarios where geographical location changes and both geographical location and imaging mode dynamically change. Each column corresponds to the input image, true labels, segmentation results of Deeplabv3 [40], BiSeNet [48], AdaptSegNet [15], MUCSS [28], CCDA_LGFA [27], and the proposed method in this paper.

SCDA excels in the majority of classes and is generally the best. The most significant issue is the misclassification of the cross-domain class of trees, which leads to poor segmentation results. The reason might be that in the MCAL method, having similar semantic information in the target domain and enhanced source domain content features does not necessarily mean the categories of the two contents are always the same.

4.6. Ablation Study

To ascertain the efficacy of the resized images, the RSFT and MCAL techniques proposed in Section 3, ablation experiments were performed by successively incorporating these three techniques into the baseline. The test results are shown in Table 5. The left columns represent the baseline, resized image, and RSFT and MCAL, respectively. The evaluation metrics use the mean values of

m I o U

and

F 1 s c o r e s

. As shown in Table 5, in the cross-domain scenario where the geographical location changes (Vaihingen IRRG to Potsdam IRRG), the RSFT and MCAL methods provide the network with the style and content information of the target domain, respectively, which enhances the network’s adaptability to target domain data and significantly improves semantic segmentation performance. Additionally, the “resize images” method involves resizing both the source and target domain images according to their resolution ratios before feeding them into the network. This approach can effectively enhance the network’s adaptability. In cross-domain scenes where both geographical location and imaging methods change, the stylistic differences are significant. The improvement in segmentation performance is the most significant in the MSFT method. After using RSFT alone, the performance of the network drops slightly, indicating that there is overfitting to the source domain data. By adding the MCAL method, the network’s attention to content information at multiple scales in the target domain is increased, and performance is further improved.

Table 6 shows the effect of the number of RSFTs and the hyperparameter k on domain adaptation performance. In the table, we use a tuple to indicate which layer of ResNet50 uses RSFT. The five elements represent the five layers in ResNet50. When the element is 1, the layer uses RSFT, and when it is 0, it is not used. Experiments show that by using RSFT in the first three layers and k = 1, the network can achieve the best domain adaptation effect, with mIoU being 44.38% and 50.59%, respectively, and F1 being 57.43% and 66.42%, respectively. However, applying RSFT to deeper layers results in a slight decrease in performance. This may be because as layer depth increases, semantic content should be more important than style. The above proof demonstrates the need to appropriately use RSFT in the network to prevent interference with semantic information and help train generalization models by increasing style information in the target domain.

In the Table 7,

α = 0

or

α = 1

means only using a single scale (either a large scale or a small scale), and when

α = 0.5

, multiple scales are used at the same time. As shown in the table, when using multiple scales, MSAL can provide multiple scales for the network category information to improve the domain adaptation performance of the network.

5. Conclusions

In this paper, we develop a domain adaptive semantic segmentation framework for high-resolution remote sensing images (HRIs). This framework is based on the method utilizing image style transfer and contrastive learning implemented in WildNet, and it is improved according to the characteristics of HRIs, aiming to minimize domain differences between various remote sensing datasets. This enables the transfer of pretrained segmentation models based on source domain remote sensing images to the unlabeled target domain. Firstly, the distribution shift caused by the difference in image style mainly exists in the shallow layer of the network. When this distribution shift is adjusted, the content of the source image will often be lost. SCDA uses the residual style feature transfer (RSFT) method to make the source. The domain image obtains the style information of the target domain image and at the same time reduces the loss of source content information by increasing the source feature information. Secondly, this approach helps prevent the network from overfitting the content of the source domain. Additionally, high-resolution remote sensing images (HRIs) are characterized by significant scale variations across different object categories. We propose a multiscale content adaptive learning (MCAL) method to enable the network to learn the category information of different scales in the target domain that is similar to the source domain, so as to achieve the effect of adapting to the content of the target domain.

SCDA is a simple and effective framework for domain adaptive semantic segmentation. However, SCDA only reduces the domain difference from the content and style of the image, without considering other information of the image, such as layout information, category depth information, etc. Furthermore, SCDA excels at identifying both large-scale and small-scale feature categories, but it performs the segmentation effect poorly on other features. Therefore, how to combine SCDA with research methods that utilize image layout and category depth information, as well as increase the attention of intermediate-scale object categories, is a potential topic for future work.

Author Contributions

Conceptualization, H.X., H.C. and W.Y.; methodology, H.X. and W.Y.; software, H.X.; writing—original draft preparation, H.X.; writing—review and editing, W.Y., L.C., B.L. and L.R.; supervision, W.Y.; funding acquisition, W.Y., L.C. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China under Grants 61976226.

Data Availability Statement

The source code for our proposed method can be found at https://github.com/casualdays/SCDA, accessed on 12 September 2023.

Acknowledgments

We would like to thank the scholars who have open-sourced their research work. We also thank researchers who make datasets available to the public.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lyu, Y.; Vosselman, G.; Xia, G.S.; Yilmaz, A.; Yang, M.Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J. Photogramm. Remote Sens. 2020, 165, 108–119. [Google Scholar] [CrossRef]
Shang, X.; Song, M.; Wang, Y.; Yu, C.; Yu, H.; Li, F.; Chang, C.I. Target-constrained interference-minimized band selection for hyperspectral target detection. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6044–6064. [Google Scholar] [CrossRef]
Li, J.; Xu, Z.; Fu, L.; Zhou, X.; Yu, H. Domain adaptation from daytime to nighttime: A situation-sensitive vehicle detection and traffic flow parameter estimation framework. Transp. Res. Part Emerg. Technol. 2021, 124, 102946. [Google Scholar] [CrossRef]
Vega, P.J.S. Deep Learning-Based Domain Adaptation for Change Detection in Tropical Forests. Ph.D. Thesis, PUC-Rio, Rio de Janeiro, Brazil, 2021. [Google Scholar]
Wang, P.; Wang, L.; Leung, H.; Zhang, G. Super-resolution mapping based on spatial–spectral correlation for spectral imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2256–2268. [Google Scholar] [CrossRef]
Zhang, W.; Chen, H.; Chen, W.; Yang, S. A Rooftop-Contour Guided 3D Reconstruction Texture Mapping Method for Building using Satellite Images. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3456–3459. [Google Scholar]
Ding, L.; Zhang, J.; Bruzzone, L. Semantic segmentation of large-size VHR remote sensing images using a two-stage multiscale training architecture. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5367–5376. [Google Scholar] [CrossRef]
Yang, H.L.; Crawford, M.M. Spectral and spatial proximity-based manifold alignment for multitemporal hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2015, 54, 51–64. [Google Scholar] [CrossRef]
Sun, W.; Wang, R. Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with DSM. IEEE Geosci. Remote Sens. Lett. 2018, 15, 474–478. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Mou, L.; Hua, Y.; Zhu, X.X. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
Yue, K.; Yang, L.; Li, R.; Hu, W.; Zhang, F.; Li, W. TreeUNet: Adaptive tree convolutional neural networks for subdecimeter aerial image segmentation. ISPRS J. Photogramm. Remote Sens. 2019, 156, 1–13. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Xu, M.; Wu, M.; Chen, K.; Zhang, C.; Guo, J. The eyes of the gods: A survey of unsupervised domain adaptation methods based on remote sensing data. Remote Sens. 2022, 14, 4380. [Google Scholar] [CrossRef]
Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7472–7481. [Google Scholar]
Zou, Y.; Yu, Z.; Kumar, B.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
Tasar, O.; Happy, S.; Tarabalka, Y.; Alliez, P. SemI2I: Semantically consistent image-to-image translation for domain adaptation of remote sensing data. In Proceedings of the IGARSS 2020—2020 IEEE International Geoscience and Remote Sensing Symposium, IEEE, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1837–1840. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Bousmalis, K.; Silberman, N.; Dohan, D.; Erhan, D.; Krishnan, D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3722–3731. [Google Scholar]
Shi, Y.; Du, L.; Guo, Y. Unsupervised domain adaptation for SAR target detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6372–6385. [Google Scholar] [CrossRef]
Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1180–1189. [Google Scholar]
Long, M.; Cao, Y.; Cao, Z.; Wang, J.; Jordan, M.I. Transferable representation learning with deep adaptation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 3071–3085. [Google Scholar] [CrossRef] [PubMed]
Zhao, D.; Yuan, B.; Gao, Y.; Qi, X.; Shi, Z. UGCNet: An Unsupervised Semantic Segmentation Network Embedded with Geometry Consistency for Remote-Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 5510005. [Google Scholar] [CrossRef]
Zhao, D.; Li, J.; Yuan, B.; Shi, Z. V2RNet: An unsupervised semantic segmentation algorithm for remote sensing images via cross-domain transfer learning. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, IEEE, Brussels, Belgium, 11–16 July 2021; pp. 4676–4679. [Google Scholar]
Cai, Y.; Yang, Y.; Zheng, Q.; Shen, Z.; Shang, Y.; Yin, J.; Shi, Z. BiFDANet: Unsupervised Bidirectional Domain Adaptation for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 190. [Google Scholar] [CrossRef]
Liu, W.; Su, F.; Jin, X.; Li, H.; Qin, R. Bispace Domain Adaptation Network for Remotely Sensed Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5600211. [Google Scholar] [CrossRef]
Zhang, B.; Chen, T.; Wang, B. Curriculum-style local-to-global adaptation for cross-domain remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Li, Y.; Shi, T.; Zhang, Y.; Chen, W.; Wang, Z.; Li, H. Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2021, 175, 20–33. [Google Scholar] [CrossRef]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Zhao, Y.; Guo, P.; Sun, Z.; Chen, X.; Gao, H. ResiDualGAN: Resize-residual DualGAN for cross-domain remote sensing images semantic segmentation. Remote Sens. 2023, 15, 1428. [Google Scholar] [CrossRef]
Chen, X.; Pan, S.; Chong, Y. Unsupervised domain adaptation for remote sensing image semantic segmentation using region and category adaptive domain discriminator. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Lee, S.; Seong, H.; Lee, S.; Kim, E. WildNet: Learning domain generalized semantic segmentation from the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9936–9946. [Google Scholar]
Marsden, R.A.; Wiewel, F.; Döbler, M.; Yang, Y.; Yang, B. Continual unsupervised domain adaptation for semantic segmentation using a class-specific transfer. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Wang, R.; Wu, Z.; Weng, Z.; Chen, J.; Qi, G.J.; Jiang, Y.G. Cross-domain contrastive learning for unsupervised domain adaptation. IEEE Trans. Multimed. 2022, 25, 1665–1673. [Google Scholar] [CrossRef]
Kang, G.; Jiang, L.; Yang, Y.; Hauptmann, A.G. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4893–4902. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; Zisserman, A. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9588–9597. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, K.; Yang, Y.; Qiao, Y.; Xiang, T. Domain generalization with mixstyle. arXiv 2021, arXiv:2104.02008. [Google Scholar]
Park, K.; Woo, S.; Shin, I.; Kweon, I.S. Discover, hallucinate, and adapt: Open compound domain adaptation for semantic segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 10869–10880. [Google Scholar]
Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D. ISPRS Semantic Labeling Contest; ISPRS: Leopoldshöhe, Germany, 2014; Volume 1. [Google Scholar]
Wang, Q.; Gao, J.; Li, X. Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes. IEEE Trans. Image Process. 2019, 28, 4376–4386. [Google Scholar] [CrossRef] [PubMed]
Ji, S.; Wang, D.; Luo, M. Generative adversarial network-based full-space domain adaptation for land cover classification from multiple-source remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3816–3828. [Google Scholar] [CrossRef]
Benjdira, B.; Bazi, Y.; Koubaa, A.; Ouni, K. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sens. 2019, 11, 1369. [Google Scholar] [CrossRef]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]

Figure 1. Description of the characteristics of HRIs data: (a) images captured from different geographical regions by different sensors show obvious domain differences; (b) remarkable scale differences exist between different ground object categories (cars vs. buildings), even within the same HRI.

Figure 2. An overview of style transfer methods based on AdaIN.

Figure 3. An overview of SimCLR.

Figure 4. The overall architecture of WildNet.

Figure 5. The overall learning process of SCDA.

Figure 6. The structure of RFST. Layer0 shows the original structure, the FS structure in WlidNet, and our improved structure. Only our structure is shown in layer1. Layer2 to layer4 are similar to layer1 and are not shown.

Figure 7. The structure of MCAL.

Figure 8. (a) Across the different regions: In this setting, the source domain is the Vaihingen dataset with an IR-R-G (Infrared, Red, Green) imaging mode. The target domain is the Potsdam dataset, also with an IR-R-G imaging mode. This setting is used to evaluate how well the domain adaptation model performs when the geographical location changes (from Vaihingen to Potsdam), while the imaging mode remains the same. It helps to understand the model’s ability to generalize and adapt to new locations while maintaining the same types of image data. (b) Across the different region and imaging sensors: In this setting, the source domain is the Vaihingen dataset with an IR-R-G (Infrared, Red, Green) imaging mode, while the target domain is the Potsdam dataset with an R-G-B (Red, Green, Blue) imaging mode. This setting is meticulously designed to scrutinize the resilience of the domain adaptation model when both the geographical location (transitioning from Vaihingen to Potsdam) and the imaging mode (shifting from IR-R-G to R-G-B) change. It helps to evaluate the model’s capability to learn and adapt to not only new geographical contexts but also different types of image data.

Figure 9. (a) Target Image, (b) GT, (c) Deeplabv3, (d) BiSeNet, (e) AdaptSegNet, (f) MUCSS, (g) CCDA_LGFA, (h) Ours.

Figure 10. (a) Target Image, (b) GT, (c) Deeplabv3, (d) BiSeNet, (e) AdaptSegNet, (f) MUCSS, (g) CCDA_LGFA, (h) Ours.

Table 1. Synopsis of the characteristics of the Potsdam and Vaihingen datasets.

Description	Datasets
Description	Vaihingen	Potsdam
Sensor	Leica ADS80	Leica ADS80
Pixel	2000 × 3000	6000 × 6000
GSD	5 cm/pixel	9 cm/pixel
Processed size	512 × 512	896 × 896
Channel compositions	IR-R-G	R-G-B/IR-R-G/IR-R-G-B
Clutter/background samples	0.70%	4.80%
Impervious surface samples	29.30%	29.80%
Car samples	1.40%	1.80%
Tree samples	22.50%	14.50%
Low-vegetation samples	19.30%	21.00%
Buildings samples	26.80%	28.10%

Table 2. Two different cross-domain scenarios.

Description	Across the Different Region		Across the Different Region and Imaging Sensors
Domain	Source	Target	Source	Target
Location	Vaihingen	Potsdam	Vaihingen	Potsdam
Imaging mode	IR-R-G	IR-R-G	IR-R-G	R-G-B
Labeled	✓		✓

Table 3. Experimental results of cross-domain segmentation under geographical variation (Vaihingen IRRG to Potsdam IRRG).

Methods	Background/ Clutter		Impervious Surface		Car		Tree		Low Vegetation		Buildings		Overall
Methods	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1
Deeplabv3 [40]	9.3	16.86	49.18	65.93	38.51	55.6	7.67	14.24	29.32	45.34	36.96	53.97	28.49	41.99
BiSeNet [48]	29.01	44.97	22.70	36.99	0.69	1.36	41.56	58.71	26.12	41.42	39.61	34.61	23.5	36.34
AdaptSegNet [15]	8.36	15.33	49.55	64.64	40.95	58.11	22.59	36.79	34.43	61.5	48.01	63.41	33.98	49.96
MUCSS [28]	33.48	50.24	50.78	67.35	36.93	54.08	58.69	73.97	40.84	58.00	63.20	77.45	46.36	62.46
CCDA_LGFA [27]	12.31	24.59	64.39	78.59	59.35	75.08	37.55	54.60	47.17	63.27	66.44	79.84	47.87	62.66
Ours	19.68	33.88	66.91	81.17	70.90	83.97	15.23	28.44	48.79	66.58	70.34	84.59	48.64	63.11

Table 4. Experimental results of cross-domain segmentation under geographical and imaging mechanism variation (Vaihingen IRRG to Potsdam RGB).

Methods	Clutter/ Background		Impervious Surface		Car		Tree		Low Vegetation		Buildings		Overall
Methods	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1
Deeplabv3 [40]	6.99	13.04	42.98	60.12	38.01	55.08	0.53	1.06	1.59	3.13	29.09	45.05	19.86	29.58
BiSeNet [48]	23.66	38.26	17.74	30.12	0.99	1.95	32.67	49.24	18.42	31.11	12.64	22.43	17.69	28.85
AdaptSegNet [15]	6.11	11.5	37.66	59.55	42.31	55.95	30.71	45.41	15.1	25.81	54.25	70.31	31.02	44.75
MUCSS [28]	16.09	27.76	47.28	64.21	40.71	57.87	28.96	44.97	43.66	60.79	59.84	74.88	38.06	53.81
CCDA_LGFA [27]	13.27	23.43	57.65	73.14	56.99	72.27	35.87	52.80	29.77	45.88	65.44	79.11	43.17	57.77
Ours	15.76	26.73	63.31	76.02	66.57	80.77	12.63	21.8	40.89	58.22	67.63	81.01	44.38	57.43

Table 5. Ablation study for SCDA. The results are obtained from the task of cross-domain semantic segmentation from Vaihingen IRRG to Potsdam RGB and Vaihingen IRRG to Potsdam IRRG.

Experiment	Vaihingen IRRG to Potsdam RGB		Vaihingen IRRG to Potsdam IRRG
Experiment	mIou	F1	mIou	F1
baseline [32]	35.54	52.86	40.25	57.79
resize images	39.32	55.21	45.68	61.56
with RSFT	40.76	55.36	44.98	60.67
with MCAL	42.49	55.92	48.64	63.11
with RSFT and MACL	44.38	57.43	50.59	66.42

Table 6. Evaluation results of RSFT under different hyperparameter settings. The results are obtained from the task of cross-domain semantic segmentation from Vaihingen IRRG to Potsdam RGB and Vaihingen IRRG to Potsdam IRRG.

Hyperparameters Settings		Vaihingen IRRG to Potsdam RGB		Vaihingen IRRG to Potsdam IRRG
Hyperparameters Settings		mIou	F1	mIou	F1
RSFT = [1,0,0,0,0]		43.71	53.29	47.15	58.98
RSFT = [1,1,0,0,0]		41.46	52.12	48.96	65.13
RSFT = [1,1,1,0,0]	k = 1.0	44.38	57.43	50.59	66.42
RSFT = [1,1,1,1,0]		43.33	52.51	47.40	59.21
RSFT = [1,1,1,1,1]		42.53	52.26	46.33	58.78
RSFT = [1,1,1,0,0]	k = 0	39.32	55.21	45.68	61.56
	k = 0.5	38.14	53.37	46.30	62.56
	k = 1.0	44.38	57.43	50.59	66.42
	k = 1.5	40.76	54.22	49.02	65.67
	k = 2.0	41.10	55.79	49.43	65.90

Table 7. Evaluation results of MCAL under different hyperparameter settings. The results are obtained from the task of cross-domain semantic segmentation from Vaihingen IRRG to Potsdam RGB and Vaihingen IRRG to Potsdam IRRG.

Hyperparameter Settings	Vaihingen IRRG to Potsdam IRRG		Vaihingen IRRG to Potsdam RGB
Hyperparameter Settings	mIou	F1	mIou	F1
$α$ = 0	48.00	62.80	41.71	52.99
$α$ = 0.5	50.59	66.42	44.38	57.43
$α$ = 1	48.84	63.00	42.06	54.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, H.; Yao, W.; Chen, H.; Cheng, L.; Li, B.; Ren, L. SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images. Remote Sens. 2023, 15, 4668. https://doi.org/10.3390/rs15194668

AMA Style

Xiao H, Yao W, Chen H, Cheng L, Li B, Ren L. SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images. Remote Sensing. 2023; 15(19):4668. https://doi.org/10.3390/rs15194668

Chicago/Turabian Style

Xiao, Hongfeng, Wei Yao, Haobin Chen, Li Cheng, Bo Li, and Longfei Ren. 2023. "SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images" Remote Sensing 15, no. 19: 4668. https://doi.org/10.3390/rs15194668

APA Style

Xiao, H., Yao, W., Chen, H., Cheng, L., Li, B., & Ren, L. (2023). SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images. Remote Sensing, 15(19), 4668. https://doi.org/10.3390/rs15194668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SCDA: A Style and Content Domain Adaptive Semantic Segmentation Method for Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Domain Adaptation Techniques Based on Style Transfer and Contrastive Learning

2.2. WildNet: A Domain Generalization Model Based on Multiple Loss Functions

3. Materials and Methods

3.1. Preliminaries

3.2. Residual Style Feature Transfer

3.3. Multiscale Content Adaptation Learning

4. Experiments and Discussion

4.1. Datasets Description

4.2. Experiment Settings

4.3. Evaluation Metrics

4.4. Quantitative Results

4.5. Visualization Results

4.6. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI