Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment

Cao, Kangjian; Wang, Sheng; Wei, Ziheng; Chen, Kexin; Chang, Runlong; Xu, Fu

doi:10.3390/electronics13245022

Open AccessArticle

Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment

by

Kangjian Cao

^1,2,3,†,

Sheng Wang

^1,2,3,†,

Ziheng Wei

^1,2,3,

Kexin Chen

^1,2,3,

Runlong Chang

^1,2,3 and

Fu Xu

^1,2,3,*

¹

School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China

²

Engineering Research Center for Forestry-Oriented Intelligent Information Processing, National Forestry and Grassland Administration, Beijing 100083, China

³

State Key Laboratory of Efficient Production of Forest Resources, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(24), 5022; https://doi.org/10.3390/electronics13245022

Submission received: 8 October 2024 / Revised: 17 December 2024 / Accepted: 18 December 2024 / Published: 20 December 2024

(This article belongs to the Special Issue Bridging the Gap between Deep Learning and Probabilistic Inference for Advancements in Robotics)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing imagery (RSI) segmentation plays a crucial role in environmental monitoring and geospatial analysis. However, in real-world practical applications, the domain shift problem between the source domain and target domain often leads to severe degradation of model performance. Most existing unsupervised domain adaptation methods focus on aligning global-local domain features or category features, neglecting the variations of ground object categories within local scenes. To capture these variations, we propose the scene covariance alignment (SCA) approach to guide the learning of scene-level features in the domain. Specifically, we propose a scene covariance alignment model to address the domain adaptation challenge in RSI segmentation. Unlike traditional global feature alignment methods, SCA incorporates a scene feature pooling (SFP) module and a covariance regularization (CR) mechanism to extract and align scene-level features effectively and focuses on aligning local regions with different scene characteristics between source and target domains. Experiments on both the LoveDA and Yanqing land cover datasets demonstrate that SCA exhibits excellent performance in cross-domain RSI segmentation tasks, particularly outperforming state-of-the-art baselines across various scenarios, including different noise levels, spatial resolutions, and environmental conditions.

Keywords:

unsupervised domain adaptation; remote sensing imagery; semantic segmentation; covariance alignment

1. Introduction

Remote sensing imagery (RSI) plays a crucial role in numerous environmental monitoring and geospatial analysis applications, including land use classification, urban planning, disaster management, and agricultural monitoring [1,2]. The increasing availability of high-resolution RSI data from satellite and aerial platforms has enhanced our ability to monitor and analyze the Earth’s surface with unprecedented detail. Despite the advantages of high-resolution imagery, inherent variability in image characteristics such as spatial resolution, noise levels, and environmental conditions poses significant challenges for the automatic segmentation and interpretation of remote sensing data [3,4,5].

In recent years, deep learning methods have demonstrated remarkable performance in semantic segmentation of remote sensing imagery, particularly convolutional neural networks (CNNs). These models can automatically learn hierarchical features, ranging from low-level features (such as edges and textures) to high-level semantic features (such as object shapes and regional characteristics), significantly improving segmentation accuracy. However, traditional deep learning methods heavily rely on huge amounts of pixel-level labels [6], which require substantial human effort and time. Moreover, significant variations in data distributions across different regions or within the same region over time or captured by different sensors, as well as factors such as atmospheric effects, cloud cover, and terrain, further increase the complexity of RSI data [7,8]. These challenges often lead to a reduction in model accuracy.

To address these challenges, researchers have proposed unsupervised domain adaptation (UDA) methods. These methods train models using labeled data from a source domain while ensuring accurate predictions on unlabeled data from a target domain [9]. By employing techniques such as discrepancy metrics and adversarial learning, UDA aligns the feature distributions between the source and target domains, enabling the learning of domain-invariant features [10,11,12,13,14,15]. This reduces the dependency on labeled data, allows the model to better adapt to the feature distribution of the target domain, and ultimately enhances both the model’s generalization capability and the accuracy of image semantic segmentation.

Currently, unsupervised domain adaptation methods are mainly divided into adversarial methods [16,17], generative training methods [18,19], and self-training methods [20,21]. Among them, adversarial methods align cross-domain feature spaces and semantic structures at three levels: image-level [17,22], feature-level [23,24], and output-level [25,26], ensuring high visual consistency between source and target domain data. Tasar et al. [22] proposed a color mapping generation network that can convert the colors of training images to match those of target images without altering the object structures in the training images. Ma et al. [24] introduced a discriminator into the segmentation network to align cross-domain high-level features, thereby capturing global context. Additionally, some studies have leveraged the powerful long-range context modeling capability of transformers to enhance feature alignment in adversarial methods [12,13]. Generative training-based methods address input-level domain shift by modifying the visual features of images to minimize color and texture differences between source and target domain images. However, their effectiveness largely depends on the quality of the generated images. Self-training-based methods enhance the model by generating pseudo-labels for unlabeled target images, but they face challenges in producing high-confidence pseudo-labels and effectively utilizing them in the target domain.In addition, some studies have combined multiple approaches. Ran et al. [27] proposed a hybrid training method that integrates self-training and generative training methods. This approach reduces the negative impact of noise that may be introduced by generative training and improves the accuracy of pseudo-labels.

In high-resolution remote sensing images, land-cover objects and their spatial relationships are highly complex, and the same category of objects collected from different regions often exhibit significant feature differences. Traditional unsupervised domain adaptation methods typically capture global context by aligning high-level features [28] or focus on implicit local feature alignment and explicit category alignment [29,30]. Recent studies have simultaneously considered both global and local features. Ma et al. [31] proposed an adaptive method based on high- and low-frequency decomposition and developed a fully global-local adversarial learning UDA framework based on this method. This framework promotes domain alignment by capturing cross-domain dependencies at different levels while leveraging global-local context modeling between the two domains. Wang et al. [16] proposed a two-stage semantic segmentation framework that achieves fine-grained local alignment and category-level alignment on the foundation of global alignment.

Although existing methods align cross-domain features at both global and local levels, they fail to capture the complex, fine-grained differences between scenes and overlook variations in ground object categories within local scenes. To address domain shifts in the semantic segmentation of remote sensing imagery, we propose a scene covariance alignment (SCA) model. First, we design a scene feature pooling (SFP) module to extract multi-scale scene features from the domain and fuse them with category features. Next, we introduce a covariance regularization (CR) mechanism to maintain consistency of these scene features between the source and target domains. As a result, the distribution discrepancy of cross-domain scene features is reduced, ensuring effective alignment of local features in complex scenes and ultimately improving segmentation performance.

The contributions of this work are summarized in the following:

(1): We proposed a novel scene-level feature extraction method for ground-object categories, which employs multi-scale feature pooling and attention mechanisms to capture the critical information within domain scene features.
(2): We proposed a covariance regularization mechanism, which further reduces the distribution discrepancy between the source and target domains by aligning the covariance of features within the same scene and separating features across different scenes.
(3): Extensive experiments conducted on the LoveDA and Yanqing datasets confirm that the proposed SCA method achieves excellent performance in cross-domain segmentation of RSI.

The remainder of this paper is organized as follows: In Section 2, we describe the architecture of the DCA model and the proposed UDA mechanism in detail. Section 3 discusses the experimental setup and evaluation protocol, including the datasets used and performance comparison metrics. In Section 4, we present the experimental results, demonstrating the effectiveness of the DCA model in improving segmentation performance across a range of remote sensing scenarios. Finally, in Section 5, we conclude the paper, discussing the implications of our findings and potential directions for future work.

2. Materials and Methods

In the context of remote sensing image segmentation, domain adaptation presents unique challenges due to the inherent variability in image characteristics across different domains. The goal is to transfer knowledge from a labeled source domain to an unlabeled target domain with significant discrepancies in noise levels, spatial resolution, and environmental conditions. To address these challenges, we introduce a novel scene covariance alignment (SCA) model. The SCA model facilitates robust feature alignment between source and target domains by leveraging scene-level feature pooling and covariance regularization, thus enabling effective domain adaptation and improved generalization in complex environments.

The core of the SCA model builds on the DeepLabV2 framework with ResNet-50 as its backbone for feature extraction. This model is augmented by two key innovations: the scene feature pooling (SFP) module, which aggregates scene-specific features, and the covariance regularization (CR) mechanism, which reinforces feature alignment across domains. Together, these components ensure the model’s adaptability to domain shifts while preserving scene-level feature integrity. The framework of the SCA model is shown in Figure 1.

2.1. Scene Feature Pooling (SFP)

Remote sensing images exhibit complex spatial and scene characteristics, such as spatial distribution, texture, and spectral features, with unique feature distributions across different scenes. The scene feature pooling (SFP) module captures domain-specific multi-scale scene feature information and integrates it with land-cover class features to generate scene-level centroids for land-cover classes. The SFP module comprises three components: Multi-Scale Feature Pooling Layer, Context Attention Fusion Layer, and Scene-Level Centroid Generation.

The Multi-Scale Feature Pooling Layer applies multiple pooling operations with different window sizes to the feature map

F \in R^{H \times W \times C}

. Each pooling output is dimensionally reduced to a unified channel number

C^{'}

via

1 \times 1

convolution. The features from different scales are resized to the original feature map size

H \times W

using bilinear interpolation and concatenated along the feature dimension, resulting in a comprehensive feature

F_{p o o l} \in R^{H \times W \times C^{″}}

, where

C^{″} = n \cdot C^{'}

, and n is the number of pooling windows.

The Context Attention Fusion Layer captures key features and suppresses redundant information in the feature map through channel attention, spatial attention, and self-attention mechanisms. The outputs of the three mechanisms are denoted as

F_{c}, F_{s}, F_{a}

, respectively, with the formulas as follows:

F_{c} = F \cdot A_{c} = F \cdot σ (FC (ReLU (FC (F_{a v g} + F_{m a x}))))

(1)

where

F_{a v g}

and

F_{m a x}

represent the spatial average pooling and maximum pooling of the comprehensive feature map, respectively, and

A_{c}

is the channel attention weight.

F_{s} = F \cdot A_{s} = F \cdot σ (Conv 2 D ([F_{s_a v g}, F_{s_m a x}]))

(2)

where

F_{s_a v g}

and

F_{s_m a x}

represent the channel-wise average pooling and maximum pooling of the comprehensive feature map, respectively, and

A_{s}

is the spatial attention weight.

F_{a} = Softmax ({QK}^{T}) V

(3)

where

Q, K, V

are the linear transformations of the feature map. The fused scene feature is denoted as

F_{c o n t e x t} \in R^{H \times W \times C^{″}}

, obtained by combining

F_{c}, F_{s}, F_{a}

.

Finally, the Context Attention Fusion Layer’s scene features are combined with the coarse prediction output from the network to generate scene-level centroids

f \in R^{N \times C^{″}}

, where N is the number of land-cover classes. The generation formula is as follows:

f = \frac{1}{H \times W} \sum_{k = 1}^{H \times W} Y_{k}^{'} ⊙ F_{c o n t e x t}

(4)

Here,

Y^{'} \in R^{N \times H \times W}

represents the coarse prediction from the segmentation network, and ⊙ denotes element-wise multiplication. The Multi-Scale Feature Pooling Layer captures diverse feature distributions of scenes through local-to-global multi-scale pooling. The Context Attention Fusion Layer enhances contextual feature representation by integrating channel attention, spatial attention, and self-attention mechanisms.

2.2. Covariance Regularization for Robust Feature Alignment

To effectively align the category-specific scene features extracted by the scene feature pooling module between the source and target domains, we introduce a covariance regularization (CR) [32,33] mechanism. The core idea of this mechanism is to reduce the distribution discrepancy between the source and target domains by aligning the covariance of features within the same scene and separating the features between different scenes. Specifically, for different samples of the same scene, we want their feature representations to be as similar as possible, meaning the covariance matrix of the features within a scene should be “concentrated”. This ensures that samples of the same scene category have high consistency in the feature space. On the other hand, features from different scenes should be as separated as possible, meaning the covariance matrix between scenes should be “dispersed”. By doing so, we ensure a clear distinction between features of different scenes, thus reducing cross-scene confusion. The core of covariance regularization lies in calculating the correlation between scene features. For two feature vectors

f_{1}

and

f_{2}

, we use cosine similarity as the measure of their correlation.

Corr (f_{1}, f_{2}) = \frac{{(f_{1} - μ_{f_{1}})}^{T} (f_{2} - μ_{f_{2}})}{{∥ f_{1} - μ_{f_{1}} ∥}_{2} {∥ f_{2} - μ_{f_{2}} ∥}_{2}}

(5)

where

μ_{f_{1}}

and

μ_{f_{2}}

represent the mean vectors of

f_{1}

and

f_{2}

. In this way, the calculated correlation value reflects the angular difference between the two feature vectors. A value closer to 1 indicates higher similarity, while a value closer to −1 indicates greater dissimilarity. To guide feature alignment, we design the loss function for covariance regularization.

L_{CR} (f_{1}, f_{2}) = - \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} log (A_{i j} (Corr (f_{1}, f_{2})))

(6)

where

A_{i j}

is defined as

A_{i j} Corr (f_{1}, f_{2}) = \{\begin{matrix} Corr (f_{1}, f_{2}) & if i = j \\ max (1 - Corr (f_{1}, f_{2}), ϵ) & if i \neq j \end{matrix}

(7)

where

ϵ

is a small value used to avoid logarithm with zero. This loss function ensures that the elements on the diagonal (i.e., samples from the same scene) have high correlation, while the elements off the diagonal (i.e., samples from different scenes) have low correlation. Therefore, CR helps the model effectively align scene features between the source and target domains, improving the model’s generalization ability in cross-domain tasks.

2.3. Transferability Analysis

The ability of a model to adapt to unseen domains with different characteristics is critical for effective domain adaptation in RSI segmentation. We analyze the transferability of the SCA model under varying conditions, including noise, resolution, scene complexity, and environmental factors such as cloud cover and contrast variations.

2.3.1. Noise Robustness

In real-world scenarios, remote sensing images are often degraded by noise, which can affect model performance. To assess the noise robustness of our model, we introduce a noise parameter

σ

and model the noisy image

X_{noisy}

as

X_{noisy} = X + N (0, σ^{2})

(8)

where

N (0, σ^{2})

represents Gaussian noise with zero mean and variance

σ^{2}

. Covariance regularization helps preserve feature consistency by minimizing the expected difference between the original and noisy features:

E [∥ f (X) - f (X_{noisy}) ∥_{2}^{2}] \leq K σ^{2}

(9)

Here, K is a constant dependent on the model architecture. This ensures that the SCA model can maintain stable feature representations even in the presence of significant noise.

2.3.2. Resolution Invariance

Remote sensing images are often captured at varying resolutions depending on sensor specifications or acquisition conditions. The SCA model addresses this challenge by employing a multi-scale feature extraction strategy, allowing it to generalize across different resolutions. For a given input image

X

, we generate scaled versions

{X_{s}}_{s = 1}^{S}

, where s represents the scale factor. The final feature representation is computed as

f_{multi} (X) = \frac{1}{S} \sum_{s = 1}^{S} f (X_{s})

(10)

Covariance regularization ensures alignment of features across different scales:

L_{scale} = \sum_{s_{1} \neq s_{2}} L_{CR} (f (X_{s_{1}}), f (X_{s_{2}}))

(11)

This multi-scale alignment enables the model to adapt to varying resolutions, a key requirement in real-world remote sensing applications.

2.3.3. Scene Complexity Adaptation

Scene complexity in remote sensing images can vary significantly, from simple landscapes to highly heterogeneous environments. To handle these variations, we introduce a scene complexity measure

C (X)

based on the entropy of local image patches:

C (X) = - \sum_{p \in P} \sum_{i} p_{i} log p_{i}

(12)

where

P

represents the set of local patches in the image, and

p_{i}

is the normalized intensity histogram for each patch. The feature extraction process adapts to the scene complexity measure as follows:

f_{adapted} (X) = f (X) + α C (X) \cdot g (X)

(13)

where

g (X)

is an additional set of features designed to capture fine-grained details, and

α

is a learnable parameter. This mechanism allows the model to adjust dynamically to different levels of scene complexity, improving segmentation performance in complex environments.

2.3.4. Cloud Cover Compensation

Cloud cover poses a significant challenge in remote sensing, often obscuring important details in the imagery. To address this, we introduce a cloud detection module

D (X)

that generates a cloud probability map. The feature extraction process is then modified as

f_{cloud} (X) = (1 - D (X)) ⊙ f (X) + D (X) ⊙ h (X)

(14)

where

h (X)

represents cloud-specific features. Covariance regularization ensures smooth transitions between clear and cloudy regions, allowing the model to handle cloud interference effectively.

2.3.5. Contrast Normalization

Variation in image contrast is another challenge in remote sensing. To mitigate this, we apply a local contrast normalization (LCN) layer prior to feature extraction:

X_{LCN} = \frac{X - μ_{local}}{\sqrt{σ_{local}^{2} + ϵ}}

(15)

where

μ_{local}

and

σ_{local}

represent the local mean and standard deviation computed over small image patches. This step ensures that the model is less sensitive to global contrast variations, improving the robustness of the feature extraction process.

2.4. Training Procedure

The SCA model is trained using a stage-wise strategy designed to mitigate error propagation when generating pseudo-labels in the target domain. The total loss function is defined as

L = L_{CE}^{source} + λ_{1} L_{CE}^{target} + λ_{2} L_{ICR} + λ_{3} (L_{CCR} + L_{scale}) + λ_{4} L_{adapt}

(16)

Here,

L_{CE}^{source}

and

L_{CE}^{target}

represent the cross-entropy losses for the source and target domains, respectively.

L_{ICR}

and

L_{CCR}

denote the intra-domain and cross-domain covariance regularization losses.

L_{scale}

enforces scale invariance, and

L_{adapt}

accounts for scene complexity, cloud cover, and contrast normalization components. The hyperparameters

λ_{i}

control the relative contribution of each term with their sum equal to 4.

Optimization is carried out using stochastic gradient descent (SGD) with momentum. The parameter update rule is

θ_{t + 1} = θ_{t} - η (\nabla L (θ_{t}) + m (θ_{t} - θ_{t - 1}))

(17)

where

η

represents the learning rate, m is the momentum coefficient, and

θ_{t}

denotes the model parameters at iteration t. A polynomial learning rate decay schedule is used to ensure stable convergence:

η_{t} = η_{0} {(1 - \frac{t}{T})}^{p}

(18)

where

η_{0}

is the initial learning rate, T represents the total number of iterations, and p denotes the decay power. This comprehensive training procedure, combined with the novel architectural innovations, allows the SCA model to achieve robust performance across a wide range of remote sensing tasks and domains.

3. Experiments and Results

In this section, we evaluate the effectiveness of the proposed scene covariance alignment (SCA)model for domain adaptation in remote sensing image (RSI) segmentation. Our experiments are designed to test the model’s ability to generalize across different domains with substantial variability in spatial resolution, noise levels, and environmental conditions. We compare our model against several state-of-the-art baselines and perform ablation studies to assess the contribution of each key component, including the scene feature pooling (SFP) module and covariance regularization (CR) mechanism. Furthermore, we analyze the model’s robustness to contrast, noise, resolution changes, and scene complexity.

3.1. Datasets

We evaluated the scene covariance alignment (SCA) model using two remote sensing datasets that present significant domain adaptation challenges due to differences in geographic location, sensor characteristics, and environmental conditions.

The LoveDA (A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. Retrieved from https://github.com/Junjue-Wang/LoveDA, accessed on 10 September 2024) dataset [34] is a large-scale land cover classification dataset, consisting of 5987 high-resolution remote sensing images (1024 × 1024 pixels, 0.3 m/pixel) from three Chinese cities: Nanjing, Changzhou, and Wuhan. The images provide three channels: Red, Green, and Blue (RGB), and cover seven land cover categories: Background, Building, Road, Water, Bare Land, Forest, and Agriculture. The dataset is divided into urban and rural scenes. The rural scene contains 2358 images, with 1366 used for training and 992 for testing. The urban scene contains 1833 images, with 1156 used for training and 677 for testing. The LoveDA dataset covers approximately 3000 square kilometers of land and exhibits rich intra-class and inter-class diversity.

The Yanqing dataset contains more than 500 high-resolution remote sensing images (2048 × 2048 pixels, 0.5 m/pixel) captured by the GaoJing-1 satellite over the Yanqing area in Beijing, China, covering an area of approximately 260 square kilometers. The images also provide three channels: Red, Green, and Blue, with radiometric calibration and atmospheric correction performed using the Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes (FLASH) algorithm [35]. A machine learning-based cloud detection algorithm is used to identify and mask cloud-covered areas. As our target domain, the Yanqing dataset presents unique geographical and environmental features compared to the LoveDA dataset, posing distinct challenges for domain adaptation.

For ground truth data, a team of remote sensing experts manually annotated 100 images from the Yanqing dataset for seven land cover classes consistent with the LoveDA dataset. We employed a rigorous cross-validation process, with each image independently annotated by two experts. Discrepancies were resolved through consensus discussions. To evaluate the model’s performance on the target domain, we randomly selected 20% of the annotated images as a held-out validation set.

To assess the SCA model’s robustness and transferability, we systematically augmented the LoveDA dataset and Yanqing dataset. We introduced two levels of Gaussian noise (

σ

= 0.05, 0.1) to simulate sensor noise and atmospheric interference (Figure 2). Image contrast was altered using linear contrast stretching with values of −0.4, 0.4, 0.8, and 1.2 to mimic variations in illumination conditions (Figure 3). Furthermore, we generate 3/4, 1/2, and 1/4 resolution versions of the images using bilinear interpolation to evaluate the model’s performance at different spatial resolutions.

3.2. Experimental Settings

Experimental Environment.All models are implemented using the PyTorch framework, and all experiments are conducted in a Linux environment using an NVIDIA GeForce RTX 4090 24GB GPU.

Network Architecture and Training. The core of the SCA model builds on the DeepLabV2 framework with ResNet-50 as its backbone for feature extraction. All original images are cropped into 512 × 512 patches as input to the model, with three channels: R, G, and B. In the scene feature pooling module, the input data are a feature map of size 2048 × 32 × 32. It is then processed through pooling windows with sizes 1 × 1, 2 × 2, 3 × 3, and 6 × 6, followed by bilinear interpolation and feature fusion to obtain a consolidated feature map. During the training, we used the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a weight decay of 0.0005. The learning rate was initially set to 0.01, and a poly schedule with power 0.9 was applied. And The hyperparameters

λ_{1}

,

λ_{2}

,

λ_{3}

,

λ_{4}

in the loss function are set to 0.8, 0.8, 0.8, and 1.6. In addition, we employ a staged process to train the model, preventing the accumulation of error-prone pseudo-labels generated during self-training. The maximum number of stages and iterations are set to 5 and 1000.

3.3. Evaluation Metrics

To rigorously assess the performance of the scene covariance alignment (SCA) model, we employed a comprehensive set of evaluation metrics. These metrics provide a multifaceted view of the model’s segmentation accuracy and its ability to generalize across diverse remote sensing scenes.

The primary metric used is the Mean Intersection over Union (mIoU), which quantifies the average overlap between the predicted segmentation and the ground truth across all categories. For a given class i, the IoU is calculated as

{IoU}_{i} = \frac{| {TP}_{i} |}{| {TP}_{i} | + | {FP}_{i} | + | {FN}_{i} |}

(19)

where

{TP}_{i}

,

{FP}_{i}

, and

{FN}_{i}

represent the true positive, false positive, and false negative pixels for class i, respectively. The mIoU is then computed as the average IoU across all N classes:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} {IoU}_{i}

(20)

To complement the mIoU, we also report the Pixel Accuracy (PA), which provides a global measure of correctly classified pixels across the entire image. PA is defined as

PA = \frac{\sum_{i = 1}^{N} {TP}_{i}}{\sum_{i = 1}^{N} ({TP}_{i} + {FP}_{i})}

(21)

3.4. Baseline Methods

To evaluate the effectiveness of the scene covariance alignment (SCA) model, we conducted comparisons against four baseline methods, each representing a distinct approach to domain adaptation in remote sensing image segmentation. The first baseline is DeepLabV2 [36] with a ResNet-50 [37] backbone, a powerful semantic segmentation model without specific domain adaptation techniques. We also compare against the Domain-Adversarial Neural Network (DANN) [38], which employs adversarial learning to align feature distributions of source and target domains. The third is AdaptSegNet [39], which extends adversarial domain adaptation specifically to segmentation tasks by applying adversarial learning in the output space. The fourth is De-GLGAN [31], which designs an adaptive method based on high-low frequency decomposition and builds a global-local adversarial learning model based on this method. Finally, MemoryAdaptSegNet [40] incorporates an invariant feature memory module within adversarial learning to preserve domain-level information, overcoming the issue of insufficient pseudo-invariant features. These baselines provide a comprehensive framework for assessing the SCA model’s performance across diverse geographical and environmental conditions in remote sensing imagery.

3.5. Quantitative Results

Table 1 shows the performance comparison of the SCA model against the baselines for the domain adaptation tasks from LoveDA to Yanqing District.

The results demonstrate that the SCA model significantly outperforms baseline models in the domain adaptation task from LoveDA to Yanqing District. The SCA model achieves a 5.5% improvement in mIoU compared to AdaptSegNet, indicating its superior generalization capabilities across different geographic and environmental conditions.

3.6. Robustness to Contrast, Noise and Resolution Variations

As shown in Figure 4 and Figure 5, compared to the best baseline model (AdaptSegNet), the SCA model demonstrates better robustness to contrast and noise. It maintains higher mIoU scores in most cases, even under severe contrast changes and significant noise level differences.

We also tested the model’s robustness to resolution changes by downsampling the target domain images. Table 2 shows the mIoU results under different downsampling factors for the LoveDA to Yanqing District task.

The SCA model consistently outperforms the baselines across all downsampling factors, demonstrating its robustness to resolution changes, which is particularly relevant for handling remote sensing data from different sensors or acquisition conditions.

3.7. Ablation Studies

To demonstrate and quantify the contributions of each component within the SCA model to the unsupervised domain adaptation semantic segmentation task on remote sensing images, we conducted ablation studies. In three separate ablation experiments, we respectively removed the scene feature pooling module, the covariance regularization mechanism, and the multi-scale feature pooling operations from the SCA model. Table 3 presents the results of the ablation studies for the LoveDA to Yanqing District task.

The ablation study results show that the scene feature pooling module, the covariance regularization mechanism, and the multi-scale feature pooling operations play indispensable roles in the model. Removing these components significantly degrades the model’s performance. In particular, the scene feature pooling module is crucial for enabling the SCA model to adapt to complex and dynamic scene-level features.

In addition, we conducted experiments on the selection of loss function hyperparameters

λ_{i}

. Table 4 presents the results of this experiment for the LoveDA to Yanqing District task. We observed that increasing the weight of the target domain prediction loss or the intra-domain covariance alignment loss, while reducing the weights of other losses, resulted in a decrease in the final segmentation mIoU compared to using equal weights. In contrast, increasing the weight of the cross-domain covariance alignment loss, multi-scale loss, and complex scene loss led to an improvement in the final segmentation mIoU. As show in Table 4, the best results were achieved when

λ_{1} = 0.8

,

λ_{2} = 0.8

,

λ_{3} = 0.8

and

λ_{4} = 1.6

. We suggest that the weight of these two types of losses can be appropriately increased. This indicates that multi-scale feature pooling and covariance regularization mechanisms play an indispensable role in the model.

3.8. Qualitative Results

Figure 6 provides a visual comparison of the segmentation performance of the SCA model at different contrast levels. The SCA model demonstrates strong generalization capabilities, with the semantic segmentation results gradually improving as the contrast increases from −0.4 to +0.8, featuring clearer category boundaries and richer details. However, over-segmentation occurs when the contrast reaches +1.2.

Figure 7 highlights the model’s performance across different image resolutions. The SCA model consistently produces clearer and more accurate segmentations at all resolution levels. However, due to the massive amount of object detail in urban images, it also experiences significant degradation at ultra-low resolutions.

Figure 8 illustrates the segmentation results across different scenarios, including complex urban and rural environments. The SCA model exhibits more spatially coherent segmentations, particularly in heterogeneous landscapes.

These visual results strongly support the quantitative findings, further demonstrating the SCA model’s ability to produce accurate and robust segmentations across a wide range of challenging scenes, including low contrast, varying resolutions, and complex scene structures.

The experimental results presented in this work demonstrate the effectiveness of the SCA model in addressing the challenges of domain adaptation in remote sensing image (RSI) segmentation. The SCA model outperforms state-of-the-art baseline models in most cases, showcasing its ability to generalize across diverse target domains with varying noise levels, spatial resolutions, and environmental conditions.

3.9. Model Inference

Inference time, as part of model performance, is a critical consideration for actual deployment. Using an NVIDIA (Santa Clara, CA, USA) GeForce RTX 4090 24G and PyTorch 1.10, we conducted multiple inference tests on the LoveDA and Yanqing datasets. The results show that it takes an average of 0.489 s to infer a single 1024 × 1024 RGB remote sensing image. Additionally, we conducted the same inference tests on a 12 vCPU Intel(R) (Santa Clara, CA, USA) Xeon(R) Platinum 8352V CPU@2.10 GHz, which yielded an average inference time of 25.245 s.

The requirements for inference time in real-time applications depend on the specific definition of real-time and the application scenario. For tasks with non-continuous input, such as user-triggered analyses, this inference time is acceptable. However, for real-time video stream processing, this inference time is clearly inadequate.

4. Discussion

In this section, we discuss the key contributions of the SCA model and the broader implications of our findings while also exploring potential areas for future research.

4.1. Effectiveness of Scene-Level Feature Alignment

A central contribution of the SCA model is the introduction of scene-level feature alignment of ground object categories through the scene feature pooling (SFP) module and covariance regularization (CR) mechanism. The results from our ablation studies clearly indicate that both components are essential for achieving robust performance across different domains. By pooling scene-specific features and aligning their covariance across domains, the model ensures that critical scene-level characteristics are consistently captured, even in the presence of domain shifts. This is particularly important in remote sensing applications where geographic diversity and environmental variability can result in significant differences between source and target domains.

Compared to existing adversarial or self-training methods (which typically do not consider the feature variations of ground object categories across different scenes), which often focus on aligning global feature distributions, the SCA model goes a step further by explicitly targeting scene-level features. This finer granularity in feature alignment ensures that the model can handle complex and heterogeneous environments more effectively, as evidenced by the model’s superior performance on both the LoveDA Rural to Urban and LoveDA Rural to Yanqing tasks.

4.2. Robustness to Contrast, Noise and Resolution Changes

The SCA model exhibits significant robustness to contrast and noise, as demonstrated by our experiments applying contrast and noise to both source and target domains. Covariance regularization plays a crucial role in this robustness, maintaining feature consistency even in environments with high contrast differences and high noise levels. This is particularly important for real-world remote sensing applications, where images are often affected by sensor noise, atmospheric interference, environmental lighting, or adverse weather conditions. The model’s ability to maintain high segmentation accuracy under such conditions makes it a strong candidate for practical deployment in challenging operational environments.

Similarly, the SCA model adopts a multi-scale feature extraction strategy to ensure robustness against resolution changes. This approach enables the model to learn scene features across multiple scales—from local to global—effectively adapting to common spatial resolution variations encountered in real-world applications. This is particularly important when dealing with remote sensing data from different sensors or platforms, as it reduces the amount of preprocessing required and mitigates the associated loss of information.

4.3. Handling Scene Complexity and Environmental Variability

The complexity of remote sensing images, particularly in terms of category distribution differences and category feature distribution differences, poses significant challenges for segmentation models. The SCA model addresses this by incorporating a scene complexity measure, which allows the model to adapt its feature extraction process dynamically based on local scene characteristics. This adaptive mechanism is crucial for improving segmentation performance in scenes with varying levels of complexity, such as densely built urban areas or heterogeneous agricultural landscapes.

4.4. Implications for Remote Sensing Applications

The strong performance of the SCA model across multiple domain adaptation tasks has significant implications for a wide range of remote sensing applications, including land-use classification, urban planning, disaster management, and environmental monitoring. The ability to generalize across domains with minimal labeled data from the target domain reduces the reliance on extensive and costly ground truth labeling efforts, making it more feasible to deploy remote sensing segmentation models in new geographic regions or under changing environmental conditions.

Furthermore, the model’s robustness to noise, resolution variations, and environmental factors such as contrast and scene complexity enhances its practicality in real-world scenarios. Remote sensing applications often involve data from various sensors, acquired under different conditions, and the SCA model’s ability to adapt to these variations ensures that it can be applied in diverse operational contexts without significant performance degradation.

4.5. Under Low-Light and Noisy Conditions

In practical applications of remote sensing image segmentation, remote sensing images are often affected by various factors, such as insufficient illumination, sensor noise, weather conditions, and the blurriness of target objects. In particular, remote sensing images captured in low-light environments suffer from reduced contrast and color saturation, along with increased sensor noise, leading to a significant decrease in the distinguishability and resolution of ground object features. Additionally, texture and boundary information may be overwhelmed by noise.

Under such conditions, the performance of the SCA model may decline. However, the multi-scale feature extraction and covariance regularization mechanisms employed by the model still offer certain advantages. Multi-scale feature extraction can identify useful information at different scales. Smaller pooling windows help capture local texture and boundary features, even when these features are disrupted by noise, while larger scales emphasize more stable global scene structural information. By fusing information across multi-scale feature spaces, the model can still extract a degree of discriminative features under low-light and high-noise conditions.

To mitigate the effects of low-light and high-noise conditions on the imagery, image enhancement structures [41,42,43] (e.g., denoising, deblurring, and contrast enhancement) can be incorporated before the model’s encoder. This improves the quality of the input data, enabling the SCA model to better align features and perform segmentation effectively.

4.6. Limitations and Future Work

While the SCA model achieves strong results, there are several areas that warrant further investigation. First, although the model handles scene complexity effectively, its performance could be further improved by integrating additional weather or atmospheric correction modules. This would allow the model to better handle extreme environmental conditions, such as heavy fog, snow, or seasonal vegetation changes.

Another limitation lies in the reliance on pseudo-labeling during unsupervised domain adaptation. While this technique helps improve performance on the target domain, errors in pseudo-label generation can propagate through the model and impact overall accuracy. Future work could explore more robust pseudo-labeling strategies or semi-supervised learning techniques to mitigate these errors.

Lastly, while the model has been tested on aerial and satellite imagery, its applicability to other forms of remote sensing data, such as hyperspectral or radar imagery, remains to be explored. Adapting the SCA model to these modalities could open new avenues for domain adaptation in even more challenging remote sensing applications.

5. Conclusions

The scene covariance alignment (SCA) model introduces a transformative approach to domain adaptation in remote sensing by emphasizing scene-level feature alignment. Central to its innovation are the scene feature pooling (SFP) module and covariance regularization (CR) mechanism, which collaboratively ensure robust performance across domains by aligning critical scene-level characteristics. This approach addresses challenges inherent in remote sensing, such as geographic diversity and environmental variability, and outperforms traditional adversarial and self-training methods that focus on global feature alignment.

This strong adaptability has profound implications for practical applications in remote sensing, such as land-use classification, urban planning, disaster management, and environmental monitoring. The SCA model’s ability to generalize across domains with minimal labeled target data reduces the reliance on costly ground-truth annotations, enabling deployment in new geographic regions or under varying environmental conditions. Its robustness to operational challenges makes it a promising tool for real-world remote sensing tasks, where data variability is inevitable.

While the SCA model sets a new benchmark in domain adaptation, areas for further improvement remain. Integrating weather and atmospheric correction modules could enhance performance under extreme conditions, such as heavy fog or seasonal vegetation changes. Additionally, refining pseudo-labeling techniques or exploring semi-supervised approaches could mitigate label noise and further enhance target domain performance. Expanding the model’s applicability to other remote sensing modalities, such as hyperspectral or radar imagery, presents another promising direction, potentially extending its impact to a broader range of challenging remote sensing applications.

In summary, the SCA model represents a significant step forward in robust and scalable remote sensing segmentation. Its innovative design not only addresses the complexity of scene-level feature variations but also establishes a framework for future research in domain adaptation, paving the way for broader, more effective deployment of remote sensing technologies.

Author Contributions

Conceptualization, S.W. and K.C. (Kangjian Cao); methodology, S.W.; software, K.C. (Kangjian Cao); validation, Z.W. and K.C. (Kexin Chen); formal analysis, S.W.; investigation, K.C. (Kangjian Cao); resources, R.C.; data curation, K.C. (Kexin Chen); writing—original draft preparation, K.C. (Kangjian Cao); writing—review and editing, S.W. and R.C.; visualization, Z.W.; supervision, F.X.; project administration, F.X.; funding acquisition, F.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, grant number 2022YFF1302700; The Emergency Open Competition Project of National Forestry and Grassland Administration, grant number 202303; and the Outstanding Youth Team Project of Central Universities, grant number QNTD202308.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to the data are part of an ongoing study.

Acknowledgments

We would like to express our sincere gratitude to the anonymous reviewers and the editorial team for their valuable feedback and insightful comments, which have significantly contributed to improving the quality of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Janga, B.; Asamani, G.P.; Sun, Z.; Cristea, N. A review of practical ai for remote sensing in earth sciences. Remote Sens. 2023, 15, 4112. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Ghamisi, P.; Xie, W.; Li, J.; Chanussot, J.; Plaza, A. Optical remote sensing image understanding with weak supervision: Concepts, methods, and perspectives. IEEE Geosci. Remote Sens. Mag. 2022, 10, 250–269. [Google Scholar] [CrossRef]
Zhao, J.; Zhong, Y.; Hu, X.; Wei, L.; Zhang, L. A robust spectral-spatial approach to identifying heterogeneous crops using remote sensing imagery with high spectral and spatial resolutions. Remote Sens. Environ. 2020, 239, 111605. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Hossain, M.D.; Chen, D. Segmentation for Object-Based Image Analysis (OBIA): A review of algorithms and challenges from remote sensing perspective. ISPRS J. Photogramm. Remote Sens. 2019, 150, 115–134. [Google Scholar] [CrossRef]
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 102–118. [Google Scholar]
Peng, J.; Huang, Y.; Sun, W.; Chen, N.; Ning, Y.; Du, Q. Domain adaptation in remote sensing image classification: A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9842–9859. [Google Scholar] [CrossRef]
Tuia, D.; Persello, C.; Bruzzone, L. Domain adaptation for the classification of remote sensing data: An overview of recent advances. IEEE Geosci. Remote Sens. Mag. 2016, 4, 41–57. [Google Scholar] [CrossRef]
Xu, M.; Wu, M.; Chen, K.; Zhang, C.; Guo, J. The eyes of the gods: A survey of unsupervised domain adaptation methods based on remote sensing data. Remote Sens. 2022, 14, 4380. [Google Scholar] [CrossRef]
Chen, X.; Pan, S.; Chong, Y. Unsupervised domain adaptation for remote sensing image semantic segmentation using region and category adaptive domain discriminator. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4412913. [Google Scholar] [CrossRef]
Bai, L.; Du, S.; Zhang, X.; Wang, H.; Liu, B.; Ouyang, S. Domain adaptation for remote sensing image semantic segmentation: An integrated approach of contrastive learning and adversarial learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628313. [Google Scholar] [CrossRef]
Li, W.; Gao, H.; Su, Y.; Momanyi, B.M. Unsupervised domain adaptation for remote sensing semantic segmentation with transformer. Remote Sens. 2022, 14, 4942. [Google Scholar] [CrossRef]
Zhang, J.; Xu, S.; Sun, J.; Ou, D.; Wu, X.; Wang, M. Unsupervised adversarial domain adaptation for agricultural land extraction of remote sensing images. Remote Sens. 2022, 14, 6298. [Google Scholar] [CrossRef]
He, Z.; Xia, K.; Ghamisi, P.; Hu, Y.; Fan, S.; Zu, B. Hypervitgan: Semisupervised generative adversarial network with transformer for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 6053–6068. [Google Scholar] [CrossRef]
Tu, J.; Mei, G.; Ma, Z.; Piccialli, F. SWCGAN: Generative adversarial network combining swin transformer and CNN for remote sensing image super-resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5662–5673. [Google Scholar] [CrossRef]
Wang, L.; Xiao, P.; Zhang, X.; Chen, X. A fine-grained unsupervised domain adaptation framework for semantic segmentation of remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 4109–4121. [Google Scholar] [CrossRef]
Li, J.; Zi, S.; Song, R.; Li, Y.; Hu, Y.; Du, Q. A stepwise domain adaptive segmentation network with covariate shift alleviation for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Zhao, Y.; Guo, P.; Sun, Z.; Chen, X.; Gao, H. ResiDualGAN: Resize-residual DualGAN for cross-domain remote sensing images semantic segmentation. Remote Sens. 2023, 15, 1428. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
Zou, Y.; Yu, Z.; Kumar, B.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
Tasar, O.; Happy, S.; Tarabalka, Y.; Alliez, P. ColorMapGAN: Unsupervised domain adaptation for semantic segmentation using color mapping generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7178–7193. [Google Scholar] [CrossRef]
Zhang, J.; Liu, J.; Pan, B.; Shi, Z. Domain adaptation based on correlation subspace dynamic distribution alignment for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7920–7930. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Wang, Z.; Pun, M.O. Unsupervised domain adaptation augmented by mutually boosted attention for semantic segmentation of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Ni, H.; Liu, Q.; Guan, H.; Tang, H.; Chanussot, J. Category-level assignment for cross-domain semantic segmentation in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Ning, Y.; Peng, J.; Liu, Q.; Sun, W.; Du, Q. Domain Invariant and Compact Prototype Contrast Adaptation for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Ran, L.; Wang, L.; Zhuo, T.; Xing, Y.; He, H.; Zhang, Y. Ddf: A novel dual-domain image fusion strategy for remote sensing image semantic segmentation with unsupervised domain adaptation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Song, S.; Yu, H.; Miao, Z.; Zhang, Q.; Lin, Y.; Wang, S. Domain adaptation for convolutional neural networks-based remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1324–1328. [Google Scholar] [CrossRef]
Luo, Y.; Zheng, L.; Guan, T.; Yu, J.; Yang, Y. Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2507–2516. [Google Scholar]
Wang, H.; Shen, T.; Zhang, W.; Duan, L.Y.; Mei, T. Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 642–659. [Google Scholar]
Ma, X.; Zhang, X.; Ding, X.; Pun, M.O.; Ma, S. Decomposition-based Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Liu, Y.; Kang, X.; Huang, Y.; Wang, K.; Yang, G. Unsupervised domain adaptation semantic segmentation for remote-sensing images via covariance attention. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Wu, L.; Lu, M.; Fang, L. Deep covariance alignment for domain adaptive remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Anderson, G.P.; Felde, G.W.; Hoke, M.L.; Ratkowski, A.J.; Cooley, T.W.; Chetwynd Jr, J.H.; Gardner, J.; Adler-Golden, S.M.; Matthew, M.W.; Berk, A.; et al. MODTRAN4-based atmospheric correction algorithm: FLAASH (fast line-of-sight atmospheric analysis of spectral hypercubes). In Proceedings of the Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery VIII; SPIE: Bellingham, WA, USA, 2002; Volume 4725, pp. 65–71. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M. Domain-adversarial neural networks. arXiv 2014, arXiv:1412.4446. [Google Scholar]
Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7472–7481. [Google Scholar]
Zhu, J.; Guo, Y.; Sun, G.; Yang, L.; Deng, M.; Chen, J. Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level prototype memory. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–18. [Google Scholar] [CrossRef]
Duong, M.T.; Lee, S.; Hong, M.C. Dmt-net: Deep multiple networks for low-light image enhancement based on retinex model. IEEE Access 2023, 11, 132147–132161. [Google Scholar] [CrossRef]
Duong, M.T.; Lee, S.; Hong, M.C. Learning to Concurrently Brighten and Mitigate Deterioration in Low-Light Images. IEEE Access 2024, 12, 132891–132903. [Google Scholar] [CrossRef]
Duong, M.T.; Nguyen Thi, B.T.; Lee, S.; Hong, M.C. Multi-Branch Network for Color Image Denoising Using Dilated Convolution and Attention Mechanisms. Sensors 2024, 24, 3608. [Google Scholar] [CrossRef]

Figure 1. The framework of the scene covariance alignment (SCA).

Figure 2. The LoveDA dataset under different noise conditions. Subplots (a–c) introduce Gaussian noise with

σ

values of 0.0, 0.05, and 0.1, to simulate sensor noise and atmospheric interference.

Figure 2. The LoveDA dataset under different noise conditions. Subplots (a–c) introduce Gaussian noise with

σ

values of 0.0, 0.05, and 0.1, to simulate sensor noise and atmospheric interference.

Figure 3. The LoveDA dataset under different contrast conditions. Subplots (a–e) apply linear contrast stretches with values of −0.4, 0.0, 0.4, 0.8, and 1.2, to adjust the image contrast.

Figure 4. Comparison of model adaptation effects under different contrast train and evaluation datasets.

Figure 5. Comparison of model adaptation effects under different noise level training and evaluation datasets.

Figure 6. Comparison of model segmentation effects at different contrasts. The upper part shows the same remote sensing image processed with varying contrast levels, increasing gradually from left to right. The lower part displays the segmentation results of the SCA model.

Figure 7. Comparison of model segmentation effects at different resolutions. The upper part shows the same remote sensing image processed with decreasing resolution levels from left to right. The lower part displays the segmentation results of the SCA model.

Figure 8. Visual comparison of model segmentation effects under different scenarios.

Table 1. Performance comparison between baseline models and the SCA model for the LoveDA → Yanqing District domain adaptation task.

Method	LoveDA → Yanqing (mIoU)	LoveDA → Yanqing (PA)
DeepLabV2 (ResNet-50)	48.3	73.1
DANN	54.2	76.4
AdaptSegNet	57.6	78.9
De-GLGAN	60.3	80.7
MemoryAdaptSegNet	62.9	80.1
SCA (Ours)	63.1	82.5

Table 2. mIoU results under different downsampling factors for the LoveDA → Yanqing District task.

Downsampling Factor	DeepLabV2	DANN	AdaptSegNet	SCA (Ours)
1× (Original)	48.3	54.2	57.6	63.1
0.75×	45.2	51.0	54.5	60.3
0.5×	41.8	47.5	50.6	56.4
0.25×	37.9	42.8	45.2	51.9

Table 3. Ablation study results for the LoveDA → Yanqing District task.

Method	mIoU	PA
SCA (Full Model)	63.1	82.5
Without SFP	57.3	77.6
Without Covariance Regularization (CR)	58.5	79.2
Without Multi-Scale Features	59.0	80.1

Table 4. mIoU results under different loss function hyperparameters

λ_{i}

for the LoveDA → Yanqing District task.

Table 4. mIoU results under different loss function hyperparameters

λ_{i}

for the LoveDA → Yanqing District task.

Lambda1	Lambda2	Lambda3	Lambda4	mIoU
1.0	1.0	1.0	1.0	62.2
1.5	0.8	0.8	0.9	62.1
2.0	0.6	0.6	0.8	60.3
0.8	1.5	0.8	0.9	61.3
0.6	2.0	0.6	0.8	58.6
0.8	0.8	1.5	0.9	61.0
0.6	0.6	2.0	0.8	62.9
0.8	0.8	0.8	1.6	63.1
0.6	0.6	1.0	2.0	62.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, K.; Wang, S.; Wei, Z.; Chen, K.; Chang, R.; Xu, F. Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment. Electronics 2024, 13, 5022. https://doi.org/10.3390/electronics13245022

AMA Style

Cao K, Wang S, Wei Z, Chen K, Chang R, Xu F. Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment. Electronics. 2024; 13(24):5022. https://doi.org/10.3390/electronics13245022

Chicago/Turabian Style

Cao, Kangjian, Sheng Wang, Ziheng Wei, Kexin Chen, Runlong Chang, and Fu Xu. 2024. "Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment" Electronics 13, no. 24: 5022. https://doi.org/10.3390/electronics13245022

APA Style

Cao, K., Wang, S., Wei, Z., Chen, K., Chang, R., & Xu, F. (2024). Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment. Electronics, 13(24), 5022. https://doi.org/10.3390/electronics13245022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Domain Adaptation Semantic Segmentation of Remote Sensing Imagery with Scene Covariance Alignment

Abstract

1. Introduction

2. Materials and Methods

2.1. Scene Feature Pooling (SFP)

2.2. Covariance Regularization for Robust Feature Alignment

2.3. Transferability Analysis

2.3.1. Noise Robustness

2.3.2. Resolution Invariance

2.3.3. Scene Complexity Adaptation

2.3.4. Cloud Cover Compensation

2.3.5. Contrast Normalization

2.4. Training Procedure

3. Experiments and Results

3.1. Datasets

3.2. Experimental Settings

3.3. Evaluation Metrics

3.4. Baseline Methods

3.5. Quantitative Results

3.6. Robustness to Contrast, Noise and Resolution Variations

3.7. Ablation Studies

3.8. Qualitative Results

3.9. Model Inference

4. Discussion

4.1. Effectiveness of Scene-Level Feature Alignment

4.2. Robustness to Contrast, Noise and Resolution Changes

4.3. Handling Scene Complexity and Environmental Variability

4.4. Implications for Remote Sensing Applications

4.5. Under Low-Light and Noisy Conditions

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI