Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images

Sun, Zihao; Guo, Peng; Li, Zehui; Chen, Xiuwan; Liu, Xinbo

doi:10.3390/rs17142529

Open AccessArticle

Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images

by

Zihao Sun

¹,

Peng Guo

²

,

Zehui Li

¹

,

Xiuwan Chen

¹ and

Xinbo Liu

^3,*

¹

Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China

²

Land Satellite Remote Sensing Application Center, Ministry of Natural Resources of P.R. China, Beijing 100048, China

³

China Yangtze Power Co., Ltd., Yichang 443000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2529; https://doi.org/10.3390/rs17142529

Submission received: 8 May 2025 / Revised: 22 June 2025 / Accepted: 27 June 2025 / Published: 21 July 2025

Download

Browse Figures

Versions Notes

Abstract

Recent advancements in Earth observation technologies have accelerated remote sensing (RS) data acquisition, yet cross-domain semantic segmentation remains challenged by domain shifts. Traditional unsupervised domain adaptation (UDA) methods often rely on computationally intensive and unstable generative adversarial networks (GANs). This study introduces elevation-aware domain adaptation (EADA), a multi-task framework that integrates elevation estimation (via digital surface models) with semantic segmentation to address distribution discrepancies. EADA employs a shared encoder and task-specific decoders, enhanced by a spatial attention-based feature fusion module. Experiments on Potsdam and Vaihingen datasets under cross-domain settings (e.g., Potsdam IRRG → Vaihingen IRRG) show that EADA achieves state-of-the-art performance, with a mean IoU of 54.62% and an F1-score of 65.47%, outperforming single-stage baselines. Elevation awareness significantly improves the segmentation of height-sensitive classes, such as buildings, while maintaining computational efficiency. Compared to multi-stage approaches, EADA’s end-to-end design reduces training complexity without sacrificing accuracy. These results demonstrate that incorporating elevation data effectively mitigates domain shifts in RS imagery. However, lower accuracy for elevation-insensitive classes suggests the need for further refinement to enhance overall generalizability.

Keywords:

unsupervised domain adaptation; semantic segmentation; remote sensing image; self-supervision; multi-task learning

1. Introduction

In recent years, with the advancement of Earth observation technologies, diversification of imaging methods, and enhanced capabilities for acquiring remote sensing data, the constraints on Earth observation have been gradually reduced. As a result, remote sensing data has exhibited trends toward diversification and massive volume. The speed of data acquisition has accelerated, update cycles have shortened, timeliness has improved significantly, and both the complexity and degree of personalization are increasing. These characteristics strongly reflect the traits of big data [1,2]. Embedded within remote sensing big data lies enormous social, economic, and scientific value. However, in stark contrast to the rapidly advancing data acquisition capabilities is the relatively underdeveloped capacity for remote sensing information processing. This imbalance is particularly prominent in the context of high-resolution remote sensing imagery. As the most common unstructured data format in remote sensing, images play a critical role in intelligent analysis by enabling automatic interpretation and thematic information extraction. Therefore, developing theories and techniques for automated analysis and information mining tailored to remote sensing big data has become one of the frontier areas in international remote sensing science and technology.

Currently, with the rapid progress in high-resolution remote sensing image analysis technologies, related fields such as computer vision, artificial intelligence, and cognitive science in the domain of machine vision have also seen accelerated development. These technological advancements provide robust support for the automatic analysis and interpretation of high-resolution remote sensing images. Remote sensing image analysis is now transitioning from a computation-centric paradigm to a data-driven scientific paradigm [3,4]. Among the emerging approaches, image semantic segmentation (ISS) based on deep learning has gained significant attention as a research focus.

With continuous development in monocular depth estimation within computer vision, deep neural networks are increasingly recognized for their ability to perceive spatial distances. Building on depth estimation research, some scholars have introduced these techniques into the domain of remote sensing image-based elevation estimation. They attempt to use end-to-end convolutional neural networks to predict elevation from single-view remote sensing images, aiming to rapidly acquire three-dimensional models of a scene. For instance, Srivastava [5] employed a standard encoder–decoder architecture as the neural network framework for elevation prediction, extracting digital surface models (DSM) of urban landscapes. The model incorporated two prediction branches for multi-task learning. In this approach, semantic labels of surface features served as a supervision branch for semantic segmentation, and a joint loss function was used during training, combining semantic segmentation loss and the mean squared error for elevation. This multi-task setup allowed the two tasks to complement each other, resulting in performance improvements in both. Building on fully convolutional networks, Mou [6] introduced skip connections from the encoder’s shallow layers into the decoder’s deconvolution layers, aiming to use spatial edge features from earlier encoding layers to clarify the boundaries in elevation prediction results, thereby optimizing the edge details in the DSM outputs. Amirkolaee [7] proposed an elevation-aware network based on U-Net, where skip connections were added at each scale to better fuse local and global features. Additionally, specially designed upsampling modules were introduced at different levels of the decoder to enhance the output resolution.

Unlike the focus of the aforementioned methods, some researchers have attempted to leverage the strong fitting capabilities of conditional generative adversarial networks (GANs) by treating the elevation estimation task as an image-to-image translation problem [8,9,10,11]. In this framework, an improved GAN is used to generate urban scene images, aiming to produce visually superior estimations of objects within the scene. These methods typically apply classical encoder–decoder structures to regress height values for a given scene. However, the decoder stage often does not fully utilize the structural or spatial information of the scene. While these approaches may produce visually appealing 2D height maps, their conversion into 3D models often results in fragmented and visually inconsistent reconstructions. Therefore, elevation estimation and prediction based solely on single-view remote sensing images are not suitable for producing rigorous data products for downstream applications. Nonetheless, due to their lack of dependence on multi-view or temporal sequences, such methods are well-suited as auxiliary tasks in other image processing workflows to enhance the accuracy of primary tasks. Currently, the most successful adaptive methods for remote sensing image semantic segmentation often involve generative adversarial learning. In such frameworks, prior to training the semantic segmentation model, image-level data distribution matching is required between the source and target domain data. By reducing the distribution differences between domains, the performance of the semantic segmentation model on the target domain data can be improved. However, the training of generative adversarial networks is time-consuming and unstable. Moreover, these coupled two-stage methods limit the automation level of adaptive methods, as they require training an image translation network before training the semantic segmentation model. Therefore, although hybrid methods based on generative adversarial learning have clear advantages in regards to adaptive accuracy, there is still room for improvement in terms of efficiency and computational cost.

Digital surface models (DSMs) and digital elevation models (DEMs) provide intermediate data in the production of orthorectified images. In particular, the DSM reflects the ground undulation of the real surface and can depict elevation information of surface features such as terrain, buildings, and vegetation in a finer way. Thus, both elevation information and semantic attributes are important features for interpreting surface elements, and there is a strong correlation between them. As intermediate data in the production of high spatial resolution orthorectified remote sensing images, DSM elevation data can serve as a common attribute data that is accessible in both domains.

Based on this, a multi-task learning training framework containing elevation estimation and semantic segmentation tasks is designed. DSM data is used as ground truth, and an elevation estimation network model is trained in a supervised manner in both domains. The correlation between elevation estimation and semantic segmentation helps with unsupervised domain adaptation on the target domain. Additionally, the difference in elevation estimation results between the two domains can evaluate the confidence of semantic segmentation category predictions, enabling pixel-level pseudo-label generation optimization to improve the self-training accuracy of the semantic segmentation module. This framework can be integrated with domain adaptation methods focused on self-supervised consistency to enhance semantic consistency in unlabeled target domain data. The three main contributions of our work can be summarized as follows:

We propose EADA, a novel cross-domain adaptation framework that co-jointly performs monoscopic elevation estimation and semantic segmentation for aerial imagery. By leveraging digital surface models (DSMs)—intermediate products of orthorectification pipelines—as auxiliary data, EADA exploits correlated feature representations between elevation and semantics to significantly enhance primary task performance while generating high-precision DSMs for downstream applications.
The model incorporates a shared backbone network for efficient feature transfer, augmented by a task–feature correlation module. This module distills cross-task dependencies through spatial attention mechanisms, dynamically reinforcing mutually beneficial features while suppressing task-specific noise.
EADA achieves state-of-the-art adaptation performance on Potsdam and Vaihingen benchmarks, surpassing the success of both single-stage and multi-stage methods. The architecture demonstrates strong extensibility, readily integrating with existing UDA methods to enhance deployability in real-world scenarios.

2. Methods

In the following section, we describe the overview of the proposed network architecture by depicting two constructed modules and then introduce the optimization functions.

2.1. Overview

In the setting of UDA for semantic segmentation, we consider a labeled source domain S and an unlabeled target domain T with the same set of semantic classes. The source domain is defined as

S = {(x_{S}^{1}, y_{S}^{1}, e_{S}^{1}), \dots, (x_{S}^{m}, y_{S}^{m}, e_{S}^{m})}

as the set of labeled training data, where

m

is the number of labeled source samples,

x_{S} \in X_{S}

are the RS images from the source domain,

y_{S} \in Y_{S}

comprise the corresponding label for semantic segmentation, and

e_{S} \in E_{S}

are the DSM data for the auxiliary task (elevation perception). Similarly, target unlabeled data can be represented as

T = {(x_{T}^{1}, e_{T}^{1}), \dots, (x_{T}^{n}, e_{T}^{n})}

, where

n

is the total number of unlabeled training samples,

x_{T} \in X_{T}

are the RS images from target domain, and

e_{T} \in E_{T}

are the DSM labels for the elevation perception. The overall architecture is shown in Figure 1.

2.2. Initial Prediction Module

The main task of the initial prediction module is to extract features by utilizing a convolutional neural network and then to accomplish the initial predictions for both segmentation and elevation perception. EADA adopts DeepLabv3+, with ResNet-101 as the backbone, as the initial prediction model, and its structure is shown in Figure 1. The most outstanding feature of DeepLabv3+ is that it applies the atrous convolution and atrous spatial pyramid pooling (ASPP), which consists of several parallel atrous convolutions with different rates, to extract multiscale features. This research edits the structure according to the desired multi-task prediction and self-adaptative semantic segmentation. The feature extraction network and the ASPP structure of the encoder remain, while the decoder is reconstructed into a multi-task prediction module in the upsampling stage. Specifically, the input RS images are firstly fed into a residual network structure, which consists of four atrous convolutional layers, in which the feature output of the second layer provides low-level features for the subsequent upsampling. Then, the extracted image features are fed into a spatial pyramid pooling module. This module contains a 1 × 1 convolution and three 3 × 3 atrous convolutions with different rates for capturing contextual information at multiple scales. In addition, there is a global average pooling layer with the function of global feature extraction. The output of the encoder is generated by a 1 × 1 convolutional upsampling operation after fusing all the extracted features. In this encoding stage, the output features are shared by both elevation estimation and semantic segmentation tasks. In the following multi-task decoding stage, this paper designs two separate branches for the two prediction tasks to generate independent features and predictions. Each branch contains two 3 × 3 convolutional layers and the output layer. For the elevation prediction layer, as the DSM data from the source domain and the target domain are used by the elevation estimation training for its own domain, respectively, the output layer for the source domain and the target domain are separated. And for the semantic segmentation task, as there are no labeled samples from the target domain, the source domain and the target domain share the same output layer. This stage outputs domain-specific predictions for the elevation estimation (

e_{T}^{'} & e_{S}^{'}

) and the semantic segmentation (

y_{S}^{'} & y_{T}^{'}

) tasks, and these outputs can be utilized to calculate the initial prediction loss. In addition, this stage also outputs task-specific features (

F_{e l e} & F_{s e g}

) for the following task feature correlation module.

2.3. Task Feature Correlation Module

The initial prediction module generates the elevation features

F_{e l e}

and the semantic features

F_{s e g}

, and the task feature correlation module is used to learn the correlation semantics and elevation. This is achieved by incorporating two spatial attentions, which capture the mutual relationship between semantics and elevation. The design of this module is inspired by multi-task distillation [12]. Since some of the correlation features are not useful for the semantic segmentation task, this module adopts the attention mechanism, which can make the network automatically learn to focus on or to ignore information from other features. The distilled features are calculated by

F_{s e g}^{O} = F_{s e g} + (W ⨂ F_{e l e}) ⨀ σ (W ⨂ F_{e l e})

(1)

F_{e l e}^{O} = F_{e l e} + (W ⨂ F_{s e g}) ⨀ σ (W ⨂ F_{s e g}),

(2)

where

F_{s e g}^{O}

and

F_{e l e}^{O}

are the fused semantic features and elevation features.

W

denotes the learnable weights for the convolution.

⨂

denotes the convolution operation, and

⨀

is the element-wise multiplication.

σ

is the sigmoid function for the normalization of the attention map. The fused features are later fed into the task-specific final decoders, generating the final elevation estimation and semantic segmentation predictions through transposed convolution and upsampling.

2.4. Objective Function Optimization

The semantic loss of the initial prediction network and the task feature correlation module in the proposed EADA model adopt a combined objective function that combines a distribution-based cross-entropy loss and a region-based dice loss. In this way, we can combine local with global information to improve the semantic prediction. The loss function is defined as follows:

L_{s e g} = (- \sum_{i}^{c} \sum_{j}^{N} y_{i}^{j} \log {\hat{y}}_{i}^{j}) + (1 - \frac{1}{c} \sum_{i = 0}^{c} \frac{\sum_{j}^{N} 2 y_{i}^{j} {\hat{y}}_{i}^{j} + ϵ}{\sum_{j}^{N} y_{i}^{j} + \sum_{j}^{N} {\hat{y}}_{i}^{j} + ϵ}),

(3)

where

y_{i}^{j}

and

{\hat{y}}_{i}^{j}

denote the

i

th channel at the

j

th pixel location of the reference labels and the neural network softmax output, respectively. We use c to denote the total channel count, N to denote the total pixel count in a mini-batch, and

ϵ

as a small constant plugged to avoid numerical problems.

EADA generates two elevation estimation predictions for each domain; these predictions can be used with corresponding DSM data to calculate the elevation loss. We use the reverse Huber (BerHu) loss function [13] as the elevation loss for the initial prediction network and the task feature correlation module:

L_{e l e} = b e r H u (e_{z}) = \{\begin{matrix} |e_{z}|, i f |e_{z}| \leq c, \\ \frac{e_{z}^{2} + c^{2}}{2 c}, o t h e r w i s e, \end{matrix}

(4)

where

e_{z}

denotes the difference between the DSM and the predicted elevation.

c

denotes the threshold, which is set to

c = m a x (|e_{z}|) / 5

during each gradient calculation epoch. Significantly, when

e_{z} \in [- c, c]

, the BerHu loss function is equal to the

L_{1}

loss, and in other cases, it equals the

L_{2}

loss.

The overall loss function for the entire network includes the loss function of the source domain and the target domain images in the initial prediction module and the task feature correlation module, which is shown as follows:

L = L_{s e g} (y_{S}, y_{S}^{'}) + L_{s e g} (y_{S}, y_{S}^{″}) + α (L_{s e g} (y_{T}, {\hat{y}}_{T}^{'}) + L_{s e g} (y_{T}, {\hat{y}}_{T}^{″})) + β (L_{e l e} (e_{S}, e_{S}^{'}) + L_{e l e} (e_{T}, {\hat{e}}_{T}^{'}) + L_{e l e} (e_{S}, e_{S}^{″}) + L_{e l e}^{'} (e_{T}, {\hat{e}}_{T}^{″})),

(5)

where

y_{S}

and

y_{T}

denote the semantic ground truth for the source domain and the semantic pseudo-label for the target domain, respectively.

y_{S}^{'}

and

y_{S}^{″}

are the semantic outputs of the source domain from the initial prediction module and the task feature correlation module. Similarly,

{\hat{y}}_{T}^{'}

and

{\hat{y}}_{T}^{″}

denote the semantic outputs of the target domain from these two modules.

e_{S}

and

e_{T}

denote the DSM data for the source domain and target domain.

e_{S}^{'}

,

e_{S}^{″}

,

{\hat{e}}_{T}^{'}

, and

{\hat{e}}_{T}^{″}

denote the elevation predictions of both domain from the initial prediction module and the task feature correlation module.

α

and

β

are the hyperparameters for the semantic loss of the target domain and the elevation loss of both domains.

3. Experiments

In this section, we first describe the datasets and experimental implementation details, and then we provide the experimental results.

3.1. Datasets

We evaluate our proposed framework on two standard large-scale RS image semantic segmentation datasets, Potsdam and Vaihingen [14]. In addition to optical remote sensing images, the Vaihingen and Potsdam datasets also contain DSM data generated by photogrammetry and dense image matching techniques. In the Potsdam dataset, for example, the MATCH-T DSM module of INPHO GmbH software (https://geospatial.trimble.com/zh-cn/products/software/trimble-inpho, accessed on 8 July 2025) was used to generate high-resolution DSM data after acquiring continuous images with high spatial resolution.

3.2. Elevation Normalization

The DSM data under different terrain cannot truly reflect the height attributes of the feature elements because of the fluctuation of the terrain height within the image coverage in the continuously acquired image data (Figure 2). Taking the common building elements on the ground surface as an example, the height of the building relative to the ground is the height attribute of the building, but the building height in the DSM data contains the elevation information of the ground, which is the absolute height. As shown in Figure 3, the most concentrated area of elevation in the two images is the ground, but according to the histogram, the ground elevation in the first two images is significantly different due to the terrain. The accuracy of feature information extraction is affected by this phenomenon. Therefore, the DSM data need to be processed to filter the terrain to obtain the relative height information. In this paper, the data in both datasets are divided into ground and non-ground pixels using lastools-toolbox, which improves Axelsson’s [15] method by assuming that the nearest ground point is the relevant low point for all off-ground pixels and thus calculating the so-called normalized height by reducing the off-site height by the specified ground point height. In the final processed DSM, the effect of different ground heights on the relative heights of the features is eliminated. As shown in Figure 3, the processed ground heights are concentrated in the same interval, and the DSM is then represented as the relative elevation data of the features.

3.3. Contrast Experiment Setting and Implementation Details

We implement the proposed framework in PyTorch (v1.8) [16] and adopt DeepLabv3+ [17] with a ResNet-101 backbone as the segmentation architecture. The backbone initializes from the models pretrained on ImageNet.

To evaluate the performance of EADA in single-stage domain adaptation, we focus on assessing the feasibility of the EADA multi-task framework and comparing its accuracy with that of other single-stage and multi-stage methods. Given the prominent role of generative adversarial learning in semantic segmentation adaptation, this study further investigates a two-stage EADA-based approach aimed at achieving optimal adaptation performance. In this variant, the data distributions of the source and target domains are first aligned using ResiDualGAN [10], followed by semantic segmentation model training with EADA. This method is referred to as RDG-EADA in the subsequent experiments.

The single-stage baseline methods used for comparison include DeepLabv3+ (without adaptation) [17], AdaptSegNet (adversarial learning in the output space) [18], and CSC-SSO (CSC with self-supervised consistency module only) [19]. The multi-stage domain adaptation methods compared in this study include GAN-RSDA (based on CycleGAN) [8], MUCSS (based on DualGAN) [20], CCDA (using curriculum learning) [21], RDG-OSA [10], and CSC-Aug [19].

For experimental deployment, all models were trained using the PyTorch 1.8 deep learning framework on a Ubuntu 20.04 operating system. The hardware setup includes an Intel Core i7-7800X CPU, 16 GB RAM (Intel Corp., Santa Clara, CA, USA), and an NVIDIA 2080 Ti GPU (NVIDIA Corp., Santa Clara, CA, USA) with CUDA version 10.0. During training, the loss function parameters for EADA were set to

α

= 0.1 and

β

= 0.01. The batch size was fixed at 2, and the Adam optimizer was employed to update model parameters. The initial learning rate was set to 0.0001 and was dynamically reduced by a factor of 0.5 whenever validation accuracy plateaued.

3.4. Results and Comparison

Single-stage methods

The quantitative results for the single-stage domain adaptation methods under two cross-domain semantic segmentation scenarios are presented in the upper sections of Table 1 and Table 2, while the qualitative results are presented in Figure 4. In the adaptation scenario characterized primarily by spatiotemporal imaging differences—Potsdam IRRG → Vaihingen IRRG—the EADA method achieved an intersection over union (IoU) of 54.06% and an F1-score of 65.58% for the target domain. Under this setting, EADA outperformed all other single-stage methods. When compared with multi-stage approaches, EADA’s performance surpassed all except for the two methods proposed in this study, RDG-OSA and CSC-Aug.

In a more challenging scenario involving both imaging modality and environmental differences—Potsdam RGB → Vaihingen IRRG—EADA achieved an IoU of 42.41% and an F1-score of 53.38% in the target domain. Even under this more complex domain shift, EADA still outperformed all competing single-stage methods. Among multi-stage methods, its performance was lower than that of CCDA, RDG-OSA, and CSC-Aug, but remained superior to the generative-model-based methods GAN-RSDA and MUCSS.

Overall, the EADA multi-task cross-domain segmentation model demonstrates strong adaptability to variations in data distribution caused by differences in imaging modalities and geographical environments, achieving high-accuracy unsupervised semantic segmentation in the target domain without requiring labeled data. Moreover, for terrain categories that are particularly sensitive to elevation—such as buildings and tall trees—EADA benefits significantly from the auxiliary task of elevation estimation, achieving notable improvements across adaptation scenarios. For instance, in the Potsdam IRRG → Vaihingen IRRG scenario, the mean IoU and F1-score for the “building” class reached 84.04% and 91.28%, respectively. In the Potsdam RGB → Vaihingen IRRG scenario, these metrics further increased to 86.57% and 92.75%. These results significantly exceed the performance of all previously published remote sensing semantic segmentation adaptation methods.

From the perspective of training efficiency, EADA—being an end-to-end multi-task neural network—eliminates the need for multi-stage training procedures, thereby greatly enhancing the performance potential of single-stage domain adaptation approaches.

ii: Multi-stage methods

The RDG-EADA method proposed in this section is a multi-stage domain adaptation framework. In Stage I, it leverages a generative adversarial learning model—ResiDualGAN—for image-level distribution alignment in the input space. In Stage II, it incorporates an elevation-aware multi-task feature extraction network combined with self-supervised consistency training. The quantitative results of all multi-stage adaptation methods are presented in the lower sections of Table 1 and Table 2, while the qualitative results are presented in Figure 5. RDG-EADA achieves the highest segmentation accuracy among all methods in both cross-domain adaptation scenarios. However, its overall accuracy shows only a marginal improvement over that of CSC-Aug, which does not incorporate elevation-awareness (with mIoU gains of 1.5% and 0.9% in the two scenarios, respectively). This suggests a diminishing return in segmentation performance relative to the additional data requirements and computational costs.

From a class-wise perspective, RDG-EADA outperforms all other multi-stage methods across nearly all semantic categories, except for tall trees and miscellaneous/background. In fact, both the single-stage EADA and the multi-stage RDG-EADA—despite incorporating elevation-awareness—suffer from significantly reduced segmentation accuracy in regards to the miscellaneous/background class, which in turn lowers their overall performance. For instance, in the Potsdam IRRG → Vaihingen IRRG scenario, RDG-EADA yields an IoU of only 7.76% and an F1-score of 11.95% for the miscellaneous/background category, which is 27% and 23% lower, respectively, than the results for the best-performing method, RDG-OSA. Therefore, if the overall accuracy is recalculated by excluding this minor class—characterized by sparse representation and high annotation subjectivity—or by applying a pixel-proportional weighted averaging, both RDG-EADA and EADA would demonstrate a more significant contribution to accuracy improvement. The reasons behind the poor performance for this particular category, as well as the role of elevation estimation therein, will be further analyzed in the next section.

iii: Results of elevation estimation

In the EADA framework, single-view elevation estimation is incorporated as an auxiliary task to enhance the domain adaptation performance of the primary task—semantic segmentation. As such, no additional single-view elevation estimation methods are included for quantitative comparison in this chapter. Nevertheless, as a meaningful byproduct of the supervised multi-task model, the estimated elevation maps can serve as valuable information for specific Earth observation applications.

The elevation estimation results of EADA under both adaptation scenarios are illustrated in Figure 6. The model is capable of producing reasonably accurate relative elevation estimates from input imagery, particularly for land-cover types with distinct height characteristics, such as buildings and tall trees. EADA demonstrates the ability to resolve elevation information even under building shadows and to differentiate between visually similar categories, such as shrubs and tall trees, based on their elevation profiles.

Compared to digital surface models (DSMs) derived from photogrammetry, EADA—formulated as a regression-based estimation task—does not provide absolute elevation values and cannot fully capture the real-world height of surface features in the absence of geospatial metadata. However, it effectively conveys relative elevation patterns within a predefined range, which is sufficient for many qualitative Earth observation tasks. Furthermore, as a single-image-based estimation model, EADA does not require sequential data input, making it more lightweight and responsive compared to conventional elevation data acquisition methods that rely heavily on extensive data collection.

4. Discussion

4.1. Auxiliary Tasks of Elevation Estimation

From the quantitative results, the elevation-aware multi-task semantic segmentation adaptive method significantly improves the average accuracy of the self-adaptive semantic segmentation. And qualitatively, EADA, which considers the elevation estimation as an auxiliary task, performs well for height-sensitive features such as buildings and trees (as shown in Figure 7). In the scenario of the Potsdam RGB → Vaihingen IRRG, for example, the spectral difference between buildings and trees is particularly obvious when the wave band combination is different. Compared with the baseline without the elevation information, the mIoU and F1 score for buildings improve by 63.0% and 34.3%, while the results for trees improve by 21.0% and 11.2%, respectively. In particular, the segmentation accuracy for buildings was better than that for other methods, including the multi-stage methods, i.e., the results were 30.0% and 16.3% higher than those for the AdaptSegNet, which is the highest single-stage method, and 4.89% and 2.63% higher than those for CSC-Aug, which has the highest success rate of all the multi-stage methods. This is due to the strong correlation between the discontinuous areas in DSM data and the edges of the features in semantic segmentation. Buildings and trees, which display large edge elevation differences, are more significant under this situation.

However, for elevation-insensitive features, such as low vegetation, impervious surfaces, and clutter, this method shows limited improvement in the accuracy of segmentation. Compared with other methods, there is even a decrease in accuracy in some scenarios. For example, in the scenario of the Potsdam RGB → Vaihingen IRRG, the segmentation accuracy for low vegetation, although improved compared to that of the baseline model, is significantly lower than that of the other methods. Similarly, the segmentation accuracy of the impervious surfaces in this scenario is also slightly lower than that of the other methods involved in the comparison. The main reason for this is the difference in spectral combinations between the domains, with a large difference between the low vegetation in the Potsdam RGB dataset and the Vaihingen IRRG dataset. Another reason is that the elevation estimation process does not provide effective feature fusion for the semantic segmentation task due to the tiny difference in height between the low vegetation and the impervious surfaces. To a certain extent, this feature fusion even reduces the difference in the distribution of these two features in the feature space. However, in the Potsdam IRRG → Vaihingen IRRG scenario, the accuracy of these two features is acceptable. Therefore, it can be concluded that the elevation-aware multi-task semantic segmentation adaptive method is not suitable for adaptive scenarios in which the differences between the target and background elevations are small and there are large spectral differences between similar features. For example, the scene shown in Figure 8 contains a high-rise parking lot, and the elevation features of the cars on the parking lot (marked by red circles) make it impossible for the semantic segmentation model to recognize them.

4.2. Extensibility of Multi-Task Learning Models

The key to adversarial learning and self-supervised consistency training is to effectively reduce the differences in the distribution of source and target domains at the image or feature space. Multi-task learning methods, on the other hand, focus on improving the ability of the model to work with invariant features across domains. Therefore, the focus of this paper is on the construction of feature extraction models rather than on learning methods. Once the multi-task learning model is constructed, the model can be trained using adversarial learning and self-supervised consistency methods. Thus, only one model needs to be trained end-to-end to complete a high-performance semantic segmentation adaptive approach. Among the semantic segmentation adaptive approaches, the multi-stage models usually include a generative adversarial network for image-level data distribution feature matching before the semantic segmentation model is trained. It should be noted that EADA, as a single-stage approach, performs only slightly better than the CSC-like adaptive segmentation approaches if the data distribution is aligned by a generative adversarial network before training. According to the experiments of Potsdam IRRG → Vaihingen IRRG, the average IOU of RDG-EADA is 59.02%, which is slightly higher than that of CSC-AUG at 58.10%. The reason why the accuracy is not as high as that of EADA for other single-stage methods may be because more semantic information errors occur during the image transformation by ResiDualGAN. Such errors do not affect the model’s learning of the target domain data distribution when performing consistency training but can severely impair the supervised training of the image elevation estimation. For example, in Potsdam to Vaihingen image translation, the semantic translation between roofs and trees is often abnormal, but the corresponding DSM remains unchanged. This matching error between elevation and segmentation affects the accuracy of the elevation estimation task and thus affects the improvement of correlation features for semantic segmentation. Therefore, combining multi-task learning with image-to-image-based style transformation can cause task correlation feature matching errors and thus reduce multi-task learning performance when semantic translation accuracy is not guaranteed. The gain in this accuracy improvement exhibits a marginal decreasing effect with respect to the investment in data and computing power. Overall, a suitable adaptive approach needs to be selected based on the actual semantic segmentation adaptive task requirements, the focused feature information, and the actual data in the source and target domain data (shown in Table 3).

5. Conclusions

This paper proposes a multi-task learning method called elevation-aware domain adaptation (EADA), which utilize the elevation information from DSM data to assist in domain adaptation for RS image semantic segmentation. The method improves the performance of independent tasks by building a multi-task feature extraction network for elevation estimation and semantic segmentation, exploiting the correlation between two tasks. The correlated task features from the source domain can be shared or migrated to the target domain to improve the accuracy of semantic segmentation in the target domain. As a tightly coupled approach, EADA does not require the generative adversarial networks commonly used in hybrid methods for image translation, thus requiring only single-stage training to achieve state-of-the-art performance. It achieves the best state-of-the-art performance among all published single-stage methods. At the same time, the accuracy of the combined generative approach is better than that of the proposed multi-stage UDA approaches. As a future direction, we plan to investigate class-aware feature fusion strategies that adaptively balance elevation cues and other contextual information to improve performance for elevation-insensitive classes.

Author Contributions

Conceptualization, Z.S.; methodology, P.G.; software, P.G.; validation, Z.S. and P.G.; formal analysis, Z.S. and X.L.; investigation, Z.L.; resources, Z.L.; data curation, Z.L.; writing—original draft preparation, Z.S.; writing—review and editing, X.L.; visualization, Z.S.; supervision, X.C.; project administration, X.C.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Yangtze River Three Gorges Group Co. Ltd., for scientific research study ‘Dynamic Remote Sensing of the Sand Content in Watersheds and Trend Analysis of Sand Transport’ (Z242302046) and the study ‘Research on Key Technologies of Remote Sensing Monitoring of Inland River Underwater Environment’ (Z522402008); Hubei Provincial Natural Science Foundation of China (2023AFD201).

Data Availability Statement

The Potsdam and Vaihingen datasets are published by the International Society for Photogrammetry and Remote Sensing (ISPRS) and can be accessed at https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx, accessed on 8 July 2025.

Acknowledgments

Our sincere gratitude goes out to the anonymous reviewers for their constructive comments and suggestions, which have substantially improved this paper.

Conflicts of Interest

Author Xinbo Liu was employed by the company China Yangtze Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chi, M.; Plaza, A.; Benediktsson, J.A.; Sun, Z.; Shen, J.; Zhu, Y. Big data for remote sensing: Challenges and opportunities. Proc. IEEE 2016, 104, 2207–2219. [Google Scholar] [CrossRef]
Liu, P.; Di, L.; Du, Q.; Wang, L. Remote sensing big data: Theory, methods and applications. Remote. Sens. 2018, 10, 711. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, L.; Du, B. Advances in Machine Learning for Remote Sensing and Geosciences. IEEE Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
Srivastava, S.; Volpi, M.; Tuia, D. Joint height estimation and semantic labeling of monocular aerial images with CNNs. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 5173–5176. [Google Scholar]
Mou, L.; Zhu, X.X. IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network. arXiv 2018, arXiv:1802.10249. [Google Scholar]
Amirkolaee, H.A.; Arefi, H. Height estimation from single aerial images using a deep convolutional encoder-decoder network. ISPRS J. Photogramm. Remote Sens. 2019, 149, 50–66. [Google Scholar] [CrossRef]
Benjdira, B.; Bazi, Y.; Koubaa, A.; Ouni, K. Unsupervised domain adaptation using generative adversarial networks for semantic segmentation of aerial images. Remote Sens. 2019, 11, 1369. [Google Scholar] [CrossRef]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Zhao, Y.; Guo, P.; Sun, Z.; Chen, X.; Gao, H. ResiDualGAN: Resize-residual DualGAN for cross-domain remote sensing images semantic segmentation. Remote Sens. 2023, 15, 1428. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Li, W.-H.; Bilen, H. Knowledge distillation for multi-task learning. In Proceedings of the European Conference on Computer Vision 2020, Glasgow, UK, 23–28 August 2020; pp. 163–176. [Google Scholar]
Lambert-Lacroix, S.; Zwald, L. The adaptive BerHu penalty in robust regression. J. Nonparametr. Stat. 2016, 28, 487–514. [Google Scholar] [CrossRef]
Markus Gerke, I. Use of the Stair Vision Library within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); University of Twente: Enschede, The Netherlands, 2014. [Google Scholar]
Axelsson, P. DEM generation from laser scanner data using adaptive TIN models. Int. Arch. Photogramm. Remote Sens. 2000, 33, 110–117. [Google Scholar]
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Tsai, Y.-H.; Hung, W.-C.; Schulter, S.; Sohn, K.; Yang, M.-H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 7472–7481. [Google Scholar]
Gao, H.; Zhao, Y.; Guo, P.; Sun, Z.; Chen, X.; Tang, Y. Cycle and self-supervised consistency training for adapting semantic segmentation of aerial images. Remote Sens. 2022, 14, 1527. [Google Scholar] [CrossRef]
Li, Y.; Shi, T.; Zhang, Y.; Chen, W.; Wang, Z.; Li, H. Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2021, 175, 20–33. [Google Scholar] [CrossRef]
Zhang, B.; Chen, T.; Wang, B. Curriculum-style local-to-global adaptation for cross-domain remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]

Figure 1. The overall architecture.

Figure 2. Illustration of elevation normalization.

Figure 3. Comparison of histograms before and after elevation normalization.

Figure 4. Qualitative results of single-stage unsupervised self-adaptation for cross-domain semantic segmentation. (a) Target images. (b) Ground truth. (c) Baseline. (d) AdaptSegNet. (e) CSC-SSO. (f) EADA.

Figure 5. Qualitative results of multi-stage unsupervised self-adaptation for cross-domain semantic segmentation. (a) Target images. (b) Ground truth. (c) GAN-RSDA. (d) MUCSS. (e) CCDA. (f) RDG-OSA. (g) CSC-Aug. (h) RDG-EADA.

Figure 6. EADA elevation estimation results. (a,d) Target images. (b,e) Ground truth. (c,f) Results.

Figure 7. EADA results for building extraction and elevation estimation in the target domain: (a) overlay of building extraction results (in blue) from EADA and the target domain image; (b) elevation estimation results for the target domain from EADA.

Figure 8. Impact of elevation estimation on semantic information extraction: cars on a high-rise parking lot are not correctly identified.

Table 1. Quantitative results (%) of cross-domain semantic segmentation on Potsdam IRRG → Vaihingen IRRG benchmark.

Method	Metrics	Clutter/ Background	Impervious Surface	Car	Tree	Low Vegetation	Building	Overall
Single-stage method
Baseline (DeepLabv3+)	IoU	2.99	47.88	20.82	58.74	19.57	61.37	35.23
Baseline (DeepLabv3+)	F1-score	5.18	64.40	33.93	73.88	32.47	75.83	47.61
AdaptSegNet	IoU	6.32	62.50	29.31	55.74	40.30	70.41	44.10
AdaptSegNet	F1-score	9.67	76.66	44.81	71.36	57.01	82.50	57.00
CSC-SSO	IoU	9.76	70.03	38.18	57.27	37.23	76.15	48.10
CSC-SSO	F1-score	14.11	82.24	54.96	72.70	53.78	86.40	60.70
EADA	IoU	5.06	69.69	58.07	63.20	45.58	85.66	54.62
EADA	F1-score	8.09	81.79	72.83	76.92	61.01	92.18	65.47
Multi-stage method
GAN-RSDA	IoU	7.26	57.32	20.04	44.27	35.47	65.35	38.28
GAN-RSDA	F1-score	10.32	72.60	32.53	61.04	51.99	78.84	51.22
MUCSS	IoU	11.16	65.94	26.30	50.49	39.85	69.07	43.80
MUCSS	F1-score	14.70	79.15	40.77	66.76	56.55	81.53	56.58
CCDA	IoU	\	58.64	28.17	53.28	30.39	60.60	46.22
CCDA	F1-score	\	75.13	45.81	69.52	47.62	76.89	62.99
RDG-OSA	IoU	10.70	70.31	54.04	59.22	49.03	81.20	54.08
RDG-OSA	F1-score	15.48	82.43	69.85	74.22	65.52	89.57	66.18
CSC-Aug	IoU	13.83	75.56	56.58	65.55	52.92	84.17	58.10
CSC-Aug	F1-score	19.59	86.01	72.01	79.09	68.96	91.38	69.50
RDG-EADA	IoU	7.76	76.90	56.91	67.03	59.84	85.68	59.02
RDG-EADA	F1-score	11.95	86.80	72.22	80.15	74.72	92.23	69.68

Table 2. Quantitative results (%) of cross-domain semantic segmentation on Potsdam RGB → Vaihingen IRRG benchmark.

Method	Metrics	Clutter/ Background	Impervious Surface	Car	Tree	Low Vegetation	Building	Overall
Single-stage method
Baseline (DeepLabv3+)	IoU	2.67	40.24	18.35	53.14	12.88	52.63	29.98
Baseline (DeepLabv3+)	F1-score	4.65	56.93	30.40	69.19	22.68	68.74	42.10
AdaptSegNet	IoU	6.26	55.91	34.09	47.56	23.18	65.97	38.83
AdaptSegNet	F1-score	9.55	71.44	50.34	64.17	37.22	79.36	52.01
CSC-SSO	IoU	2.47	48.99	35.63	49.54	21.39	61.31	36.56
CSC-SSO	F1-score	4.31	65.35	52.14	66.05	35.00	75.89	49.79
EADA	IoU	0.18	45.09	39.25	63.84	15.69	84.82	41.48
EADA	F1-score	0.34	60.99	55.42	77.20	26.46	91.51	51.99
Multi-stage method
Baseline	IoU	2.29	48.27	25.73	42.16	23.34	64.33	34.35
Baseline	F1-score	3.50	64.79	40.20	59.03	37.55	78.13	47.20
GAN-RSDA	IoU	5.87	54.21	27.95	43.73	26.94	68.76	37.91
GAN-RSDA	F1-score	8.77	70.04	42.89	60.53	42.09	81.26	50.93
MUCSS	IoU	12.38	64.47	43.43	52.83	38.37	76.87	48.06
MUCSS	F1-score	21.55	77.76	60.05	69.62	55.94	86.95	61.98
CCDA	IoU	9.84	62.59	54.22	56.31	37.86	79.33	50.02
CCDA	F1-score	14.55	76.81	70.00	71.92	54.55	88.41	62.71
RDG-OSA	IoU	8.12	68.91	57.41	65.47	48.33	81.78	55.00
RDG-OSA	F1-score	11.23	81.48	72.76	79.04	64.78	89.94	66.54
CSC-Aug	IoU	3.80	73.50	51.22	62.62	55.32	86.57	55.51
CSC-Aug	F1-score	5.88	84.58	67.29	76.85	71.07	92.75	66.41

Table 3. Methods under different scenarios.

Method	Data Distribution Alignment	Self-Supervision	Elevation Information	mIoU_1 ¹	mIoU_2 ²
Baseline				35.23	29.98
RDG	✓			51.89	45.14
CSC-SSO		✓		48.10	36.56
CSC	✓	✓		58.10	55.00
EADA		✓	✓	54.02	65.58
RDG-EADA	✓	✓	✓	59.02	55.51

¹ Potsdam IRRG → Vaihingen IRRG. ² Potsdam RGB → Vaihingen IRRG.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Z.; Guo, P.; Li, Z.; Chen, X.; Liu, X. Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images. Remote Sens. 2025, 17, 2529. https://doi.org/10.3390/rs17142529

AMA Style

Sun Z, Guo P, Li Z, Chen X, Liu X. Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images. Remote Sensing. 2025; 17(14):2529. https://doi.org/10.3390/rs17142529

Chicago/Turabian Style

Sun, Zihao, Peng Guo, Zehui Li, Xiuwan Chen, and Xinbo Liu. 2025. "Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images" Remote Sensing 17, no. 14: 2529. https://doi.org/10.3390/rs17142529

APA Style

Sun, Z., Guo, P., Li, Z., Chen, X., & Liu, X. (2025). Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images. Remote Sensing, 17(14), 2529. https://doi.org/10.3390/rs17142529

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Elevation-Aware Domain Adaptation for Sematic Segmentation of Aerial Images

Abstract

1. Introduction

2. Methods

2.1. Overview

2.2. Initial Prediction Module

2.3. Task Feature Correlation Module

2.4. Objective Function Optimization

3. Experiments

3.1. Datasets

3.2. Elevation Normalization

3.3. Contrast Experiment Setting and Implementation Details

3.4. Results and Comparison

4. Discussion

4.1. Auxiliary Tasks of Elevation Estimation

4.2. Extensibility of Multi-Task Learning Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI