Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction

Peng, Xuanang; Wang, Shixin; Wang, Futao; Zhu, Jinfeng; Li, Suju; Liu, Longfei; Wang, Zhenqing

doi:10.3390/rs17091637

Open AccessArticle

Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction

by

Xuanang Peng

^1,2,

Shixin Wang

¹,

Futao Wang

^1,2,*,

Jinfeng Zhu

¹,

Suju Li

³,

Longfei Liu

³ and

Zhenqing Wang

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory of Emergency Satellite Engineering and Application, Ministry of Emergency Management, China National Disaster Reduction Center of China, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(9), 1637; https://doi.org/10.3390/rs17091637

Submission received: 17 March 2025 / Revised: 25 April 2025 / Accepted: 29 April 2025 / Published: 5 May 2025

(This article belongs to the Special Issue Advances in High-Resolution Satellite Remote Sensing Image Processing and Classification)

Download

Browse Figures

Versions Notes

Abstract

Semantic segmentation and height estimation in remote sensing imagery are two pivotal tasks for scene understanding, and they are highly interrelated. Although deep learning methods have achieved remarkable progress in these tasks in recent years, several challenges remain. Recent studies have shown that multi-task learning methods can enhance the complementarity of task-related features, thus maximizing the prediction accuracy of multiple tasks at a low computational cost. However, due to factors such as complex semantic categories and the inconsistent spatial scales of remotely sensed images, existing multi-task learning methods often fail to achieve better results on these two tasks. To address this issue, we propose CTME-Net, a novel architecture termed the Cross-Task Mutual Enhancement Network, designed to jointly perform height estimation and semantic segmentation tasks on remote sensing imagery. Firstly, to generate discriminative initial features for each task branch and activate dedicated pathways for cross-task feature disentanglement, we design a universal initial feature embedding module for each downstream task. Secondly, to address the impact of redundancy in general features during global–local fusion, we develop an Adaptive Task-specific Feature Distillation Module that enhances the model’s ability to acquire task-specific features. Finally, we propose a task feature interaction module to optimize features across tasks through mutual optimization, maximizing task-specific feature expression. We conduct extensive experiments on the ISPRS Vaihingen and Potsdam datasets to validate the effectiveness of our approach. The results demonstrate that our proposed method outperforms existing methods in both height estimation and semantic segmentation.

Keywords:

remote sensing imagery; semantic segmentation; height estimation; multi-task learning

1. Introduction

Three-dimensional scene understanding has always been an exploratory issue in the field of remote sensing. Compared with two-dimensional planar images, it can provide richer structural information, which helps to accurately reflect the state of objects and environments in the real world. It has a wide range of application scenarios in autonomous driving, urban planning, disaster monitoring, and other fields. Traditional three-dimensional scene acquisition methods mainly include LiDAR scanning and oblique photogrammetry.

In recent years, driven by the rapid advancement of deep learning technologies and the improvement in the quality of remote sensing imagery, the significance of semantic segmentation (SS) and height estimation (HE) in 3D scene understanding has been increasingly recognized. As distinct tasks in computer vision, semantic segmentation aims to interpret the two-dimensional contextual information represented by individual pixels in a scene, while height estimation focuses more on the reconstruction of three-dimensional geometric information. For SS, recent studies [1,2,3,4] have extensively employed deep learning and attention mechanisms to model the contextual information and long-range dependencies of SS. However, uncertainties still exist in parsing and distinguishing objects with similar appearances in complex remote sensing scenes. For HE, studies [5,6,7,8,9,10] have utilized monocular images to extract geometric information through regression or classification methods. Nevertheless, due to the complexity of remote sensing scenes and the inherent ambiguity in color mapping, HE still has its limitations.

Typically, semantic segmentation and height estimation have been treated as separate tasks, with little consideration given to their interconnections and interactions. As research progresses, an increasing number of scholars have come to realize that the interaction between semantic segmentation (SS) and height estimation (HE) is complementary for both tasks. In fact, in complex remote sensing scenes, geometric and semantic information can be mutually utilized. As shown in Figure 1b, for instance, in SS, the geometric information provided by HE can enhance the discriminative power of the semantic information to help solve the ambiguity problem of SS within the scene by re-determining the boundaries of the classes and regions of different objects through statistical distribution properties. In the HE task, the semantic information provided by SS can be used as explicit or implicit a priori information or as additional supervisory information to provide hierarchical supervisory signals for HE. Therefore, a unified approach for the joint learning of SS and HE is a very promising research direction to solve both problems with one model while saving computational resources as much as possible. Therefore, a unified joint learning method for SS and HE is a feasible and very promising research direction. With a reasonable model architecture, we can save computational resources as much as possible while solving these problems well.

Multi-task learning (MTL) aims to leverage the implicit connections between related tasks to enhance the performance across tasks, maximizing computational efficiency while delivering highly accurate results. Compared to single-task learning (STL), MTL can reduce redundant computations and memory usage by sharing visual features among tasks through well-designed models, uncovering potential inter-task relationships and fostering positive feedback among tasks. In the field of computer vision, MTL methods have been continuously emerging and applied to various task scenarios, such as semantic segmentation, surface estimation, and depth estimation.

However, methods for multi-task learning (MTL) in natural images cannot be directly applied to the field of remote sensing. Due to the complex spectral characteristics of remote sensing images, significant semantic inconsistencies, and complex spatial interpretability, natural MTL methods are not well suited for remote sensing scene understanding. For example, objects of the same category may have vastly different height values, and vice versa; objects of the same category may also have very similar spectral properties, such as building roofs and roads, or low vegetation and lawns. These interfering factors can be fatal to MTL networks, potentially causing the network weights to move in undesirable directions. Therefore, when employing MTL methods for remote sensing scene understanding, capturing the relationships between pixels is of paramount importance. Both the multi-level features and feature interactions between different tasks must be based on this principle.

To address the aforementioned challenges, in this paper, we proposed a novel MTL method to correlate semantic segmentation (SS) and height estimation (HE), termed the Cross-Task Mutual Enhancement Network (CTME-Net). As shown in Figure 1a, firstly, recognizing the significance of the initial features in a multi-task learning framework, we designed a universal Initial Task-specific Feature Embedding Module (ITFEM) for each downstream task. By decoupling the initial general features and employing real labels for supervised learning, we obtained highly discriminative initial features. This process laid the foundation for subsequent cross-task feature disentanglement. Secondly, considering the inherent feature discrepancies between different tasks, we introduced an Adaptive Task-specific Feature Distillation Module (ATFDM). This module adaptively evaluated the importance of shared features for each task by analyzing their specific requirements, thereby achieving precise feature allocation and optimization. This efficient and flexible strategy not only ensured the generality of shared features across tasks but also tailored task-specific features for each individual task, enabling the simultaneous capturing of both shared and task-specific characteristics. Finally, to further explore the higher order common patterns between the two tasks, we proposed the Attention-Based Task Interaction Module (ABTIM), which facilitated information exchange between tasks through an attention mechanism. Specifically, the module computed attention weight scores between task features, controlling the information flow through learnable parameters, and deeply mined the potential positive feedback from one task to another, thereby promoting task interaction and collaborative optimization. This design allowed the model to share useful information across tasks while mitigating the interference between them. Experimental results demonstrate that our proposed CTME-Net achieves significant performance improvements on multiple benchmark datasets, particularly excelling in cross-task feature sharing and task-specific feature extraction. Compared with existing multi-task learning methods, CTME-Net not only obtains better results in SS and HE tasks but also exhibits a stronger robustness and generalization ability.

In summary, the main contributions of this article are summarized as follows:

(1): Unlike the traditional single-task model, we innovatively proposed a novel multi-task framework for remote sensing image scene understanding, the Cross-Task Mutual Enhancement Network (CTME-Net), which was designed to simultaneously address semantic segmentation and height estimation.
(2): To address the inconsistency of feature information across different tasks, we introduced an initial feature embedding module at the decoder’s initial stage, using ground truth labels for preliminary supervised learning to generate highly discriminative initial features. Meanwhile, we designed an Adaptive Task-specific Feature Distillation Module to extract more beneficial task-specific information from general features at various stages.
(3): A task interaction module based on a cross-attention mechanism was designed to establish hierarchical attention mapping between semantic features and height features, enhancing feature responses in regions of geometric–semantic consistency. Through this approach, we achieved dynamic interaction and information fusion between features.

This paper is organized as follows: a brief review of the related works is given in Section 2, including semantic segmentation, height estimation, and multi-task learning. The details of our proposed method are illustrated in Section 3. Section 4 introduces the datasets, evaluation indicators, and implementation details. Extensive experimental results and evaluations are reported in Section 5, which also includes the ablation studies of each component. Finally, we conclude our work in Section 6.

2. Related Works

2.1. Single-Task Learning

Convolutional neural networks (CNNs) and Transformer renowned for their robust feature representation capabilities, have achieved significant advancements across a multitude of computer vision applications. The conventional practice involves the training of distinct networks tailored to individual tasks, with a primary focus on optimizing task-specific metrics. This approach is commonly referred to as the single-task learning (STL) method. In the present discourse, we direct our attention to two distinct tasks: semantic segmentation (SS) and height estimation (HE).

2.1.1. Semantic Segmentation

Semantic segmentation (SS) involves the pixel-level classification of images to recognize and interpret visual scenes. Traditional methods have relied on low-level visual features for region division, often lacking deep semantic information. Over the past decade, convolutional neural networks (CNNs) have significantly enhanced the performance of SS, enabling richer semantic understanding. Fully Convolutional Networks (FCNs) [11] pioneered the transformation of the last three fully connected layers into convolutional layers, addressing semantic-level image segmentation. The introduction of skip architectures combined information with a low level of detail with high-level semantic information, ensuring the robustness and accuracy of the network. This work can be regarded as a milestone in SS methods. Since then, numerous models [12,13,14,15,16,17,18] based on the FCN framework have been developed, exploring various innovations of CNNs in semantic segmentation. For instance, PSPNet [13] proposed a pyramid pooling architecture capable of extracting long-range dependencies across different scales. DeeplabV3+ [16] continued the previous series of network architectures by applying depthwise separable convolutions to the ASPP and decoder modules, significantly improving computational efficiency. Fu et al. [18] introduced a dual-attention network that captured spatial and channel feature dependencies through self-attention, effectively gathering global and contextual information. SETR [19] was the first to use a Vision Transformer as an encoder for semantic segmentation, achieving remarkable results. Zhang et al. [20] proposed an Attention Mask (ATM) decoder module that leveraged spatial information from attention maps to generate mask predictions for each category, proving effective and efficient semantic segmentation. Shi et al. [21] utilized self-attention and cross-attention in Vision Transformers to select multi-scale features, optimizing the process of contextual information selection. Each of these approaches suggests innovations from different perspectives.

In the task of the semantic segmentation of remote sensing images, to achieve precise segmentation results, researchers have proposed numerous advanced methods. Mou et al. [1] enhanced the representation of image features by modeling long-range spatial relationships in remote sensing images through the introduction of spatial and channel relationship modules. Ma et al. [2] designed a semantic segmentation model for small objects that activated small object features while suppressing large-scale backgrounds, thereby more effectively distinguishing small objects. Liu et al. [3] proposed a SwinTransformer method with Upper Head to address the complex background samples and inconsistent class distributions in remote sensing data. Additionally, Reed et al. [4] focused on the interpretation of scale-specific information in remote sensing images by coding and reconstructing high-frequency and low-frequency images at different scales. Other approaches [22,23,24,25,26,27] investigated, from diverse viewpoints, the method of simultaneously maintaining long-range dependencies and local detail integrity in image processing.

In summary, while existing methods have achieved remarkable performance through meticulously designed network architectures, their fundamental limitation lies in the excessive focus on semantic correlations while neglecting the effective role of geometric contextual information in SS. This paper addresses this gap by integrating height estimation tasks to explore the geometric distribution patterns inherent in SS, thereby providing hierarchical geometric supervision signals to enhance the robustness of SS in complex scenarios.

2.1.2. Height Estimation

The goal of height estimation (HE) is to obtain the height values represented by each pixel in an image, capturing spatial geometric information on a per-pixel basis to interpret the scene. HE has a wide range of applications in urban planning, damage monitoring, and disaster prediction. Traditional methods often relied on photogrammetry [28], SAR bathymetry [29,30], and LiDAR processing [31]; however, these techniques often depended on expensive hardware and had strict standards for the input data used, making the HE task not only costly but also time-consuming. With the significant advancements in neural networks in recent years, some works have begun to attempt to use simple inputs, such as single satellite images, to obtain the height information of surface objects. Mou et al. [5] designed a convolutional–deconvolutional architecture for height estimation and trained the network in an end-to-end manner. Amirkolaee et al. [6] proposed an upsample method to preserve feature information as much as possible during the resolution enhancement process. PLNet [32] employed a progressive learning network, adopting a coarse-to-fine strategy to accurately predict the height information of objects within images. Li et al. [8] suggested dividing the height values into spacing-increasing intervals, transforming the regression problem into an ordinal regression problem, and combining the ASPP module to obtain multi-scale information. MTBR-Net [7] explored the extraction of building height information from single oblique images by designing an offset task for height estimation. Zhao et al. [10] believed that the relative elevation values of each pixel in remote sensing images were highly correlated with semantic categories, so they used GAN as a starting point and then improved height estimation by using semantic guidance.

In short, existing methods in HE focus on modeling the distribution pattern of height values only by mining the connection between geometric information to obtain the corresponding height map. Due to the specificity of remote sensing imaging, this idea is very challenging in complex remote sensing scenarios. In this paper, SS is introduced as an aid to HE, which constrains the spatial continuity of height values by providing contextual semantic information of complementary cues to HE, thus improving the accuracy and applicability of the model in complex remote sensing scenes.

2.2. Multi-Task Learning

Multi-task learning (MTL) aims to enhance model generalization by simultaneously capturing the shared features and unique attributes across different tasks. Multi-task learning (MTL) methods have historically been divided into hard parameter sharing and soft parameter sharing. However, with the advancement of research, MTL methods are now more commonly categorized into encoder-focused and decoder-focused approaches based on the network location where tasks exchange or share information or features [33]. PAD-Net [34] was one of the earliest decoder-focused architectures, proposing a novel prediction distillation network structure for multi-task learning, which improved task predictions by extracting inter-task feature information. Zhang et al. [35], adopting a similar architecture to PAD-Net, proposed the concept of “task affinity” based on statistical observations of data, clarifying the modeling of feature similarities. MTI-Net [36] extended the distillation idea to multiple feature scales, referred to as “multi-scale multi-modal” distillation. Invpt++ [37] introduced an attention transfer mechanism to preserve as much attention information from the previous layer as possible during the decoding process. EMA-Net [38] proposed a lightweight framework for capturing cross-task information at multi-scale hierarchies with a low parameter cost. Agiza [39] et al. focused on the multi-task learning paradigm in the large model domain, focusing on the parameter space decoupling of MTLs, and proposed a means of multi-task learning using LoRA fine-tuning. Wang et al. [40] leveraged cueing as a means of each task’s induced prior, thereby controlling the flow of task-specific information across tasks. Of course, with the development of research, numerous methods which do not fall into the above categories have also been proposed [41,42,43,44,45], each elucidating the role of their methods for MTL from different perspectives.

In the remote sensing community, works based on MTL have reported encouraging results. For example, Srivastava et al. [46] first proposed using a multi-task CNN to jointly learn SS and HE. Zheng et al. [46] designed a novel pyramid on pyramid Network (Pop-Net) based on an encoder–dual-decoder framework, which could simultaneously predict semantic labels and normalize Digital Surface Models. Wang et al. [47] introduced a boundary attention module into the multi-task learning framework, focusing more on local details while simultaneously predicting height, segmentation results, and boundary information. Xing et al. [48] proposed task-aware feature separation modules and cross-task adaptive propagation modules (CAPM) to improve the performance of SS and HE. Liu [49] adopted the concept of task affinity to promote the interaction between SS and HE as much as possible, while introducing a gating network to encode each task separately. Mao et al. [50] improved the gating network and effectively conducted feature interaction between branches. In addition, Zhang et al. [51] attempted to combine the super-resolution task with SS for joint learning. Gao et al. [52] utilized contrastive learning to combine SS and HE for joint learning.

Compared to existing methods, our approach places greater emphasis on the complex relationships and interactions between features. We obtain highly discriminative initial features for different task branches, which serve as the raw materials to activate task-specific pathways. By extracting more enriched task-specific features from the shared general features at multiple scales, we enhance the specialized representational capacity of each task. Simultaneously, we conduct inter-task feature interaction at appropriate stages, thereby achieving collaborative optimization and information sharing between tasks. This effectively conveys the latent associations between different tasks and promotes positive feedback, further improving the model’s performance and adaptability in complex scenarios.

3. Methodology

In this section, we will introduce the various modules in our proposed CTME-Net network architecture, including the Initial Task-specific Feature Embedding Module (ITFEM), Adaptive Task-specific Feature Distillation Module (ATFDM), and Attention-Based Task Interaction Module (ABTIM). Subsequently, we will introduce the loss functions and optimization methods used in the network. We list the abbreviations and their corresponding meanings in our work in Table 1 for ease of reading.

3.1. Overview

As illustrated in Figure 2, the proposed multi-task learning (MTL) framework is encapsulated within an encoder–decoder architecture. The core components of this network architecture include a shared backbone network for multi-scale image feature extraction, an Initial Task-specific Feature Embedding Module (ITFEM) for the dual-branch tasks, an Adaptive Task-specific Feature Distillation Module (ATFDM), and an Attention-Based Task Interaction Module (ABTIM). In the implementation process, the input image X is first passed through the shared backbone network to extract multi-level feature representations. Subsequently, the ITFEM generates baseline features for each task branch. To extract task-specific features from the shared general features, the ATFDM selectively fuses global features with task-specific features, balancing the completeness of global contextual information while minimizing the impact of irrelevant and redundant features on the tasks. Moreover, to capture the potential linkages between tasks, the ABTIM is employed across different network layers to interact between the task-specific features at various scales. This mechanism enables the capturing of inter-task associations and enhances the positive knowledge transfer effect. Detailed architectural specifications and implementation strategies for each component will be systematically elaborated upon in subsequent sections.

3.2. Initial Task-Specific Feature Embedding Module

In multi-task learning frameworks, task-specific preliminary predictions play a pivotal role by generating highly discriminative initial features for each task branch, thereby eliminating redundant computational operations associated with feature disentanglement during the initialization phase. Moreover, as the fundamental component for activating task-specific inference pathways, these preliminary predictions serve as essential prerequisites for achieving cross-task feature decoupling. To realize this objective, we have developed a universal initial decoder architecture applicable to all downstream tasks. While maintaining structural consistency across tasks, the parameter updates for each task-specific decoder are independently conducted during training. This decoder architecture comprises two cascaded modules: a preliminary feature calibration module followed by a hierarchical feature integration module. The subsequent sections provide a comprehensive exposition of our design methodology.

As illustrated in Figure 3, the preliminary feature calibration module employs strip-wise pooling operations along horizontal and vertical spatial axes to construct axial-aware feature representations. This design utilizes an elongated pooling kernel configuration that effectively establishes long-range feature dependencies while preserving feature resolution, with constrained pooling regions further suppressing background noise interference. Specifically, the encoder-generated initial features

F \in R^{C \times H \times W}

undergo parallel average pooling operations along horizontal and vertical spatial axes, yielding directional global context vectors

v_{X} \in R^{C \times H \times 1}

and

v_{Y} \in R^{C \times 1 \times W}

that capture horizontal and vertical spatial statistics, respectively. Subsequently, these vectors undergo 1D convolution operations to amplify directional discriminative information, followed by Group Normalization and Sigmoid activation to generate spatial attention maps. The computational workflow can be formally described as follows:

Z_{h} = σ (G N (F_{h} (v_{X})))

(1)

Z_{w} = σ (G N (F_{w} (v_{Y})))

(2)

Here,

σ

denotes the Sigmoid activation function. GN stands for Group Normalization, and

F_{h}

and

F_{w}

stand for 1D convolution with 7 × 7 convolution kernels. We perform element-wise multiplication between the initial features

F

and the horizontal/vertical attention weights

Z_{h}

and

Z_{w}

, thereby generating the globally attentive feature representation

F_{a t t n}

enriched with comprehensive spatial–semantic information for subsequent task processing. The hierarchical feature integration module consists of two cascaded convolutional blocks, each comprising a sequence of 3 × 3 convolution, a batch normalization operation, and ReLU activation (denoted as Conv-BN-ReLU). The first-stage block operates on the attention-calibrated features

F_{a t t n}

to capture localized contextual patterns, while the second-stage block further aggregates multi-scale semantic features through deep nonlinear transformations, ultimately producing high-dimensional discriminative representations

F_{p r e} \in R^{C^{'} \times H \times W}

. This hierarchical architecture enhances feature discriminability through progressive nonlinear mappings while maintaining parameter efficiency.

Following the preliminary decoding stage, a 1 × 1 convolutional layer is employed to generate task-specific preliminary predictions, which are supervised by task-specific labels through dedicated loss functions. Subsequently, we perform channel-wise fusion between the preliminary predictions and the original features

F

, resulting in task-enhanced feature representations with superior expressiveness for downstream processing.

F_{c} = C o n v (C o n c a t (F_{p r e}, φ (F_{p r e})))

(3)

φ

denotes the projection operation implemented via 1 × 1 convolution. This methodological framework enables the progressive disentanglement of task-specific feature representations during the initial decoding phase, generating highly targeted and discriminative feature embeddings for individual tasks. Such an architectural design not only enhances the model’s multi-task adaptability through specialized feature conditioning but also improves robustness against complex environmental perturbations while maintaining high prediction fidelity. The synergistic integration of task-aware feature separation and cross-task information preservation ultimately achieves efficient parameter utilization and collectively elevates the overall model performance across diverse application scenarios.

3.3. Adaptive Task-Specific Feature Distillation Module

Cross-scale feature modeling has emerged as a standard paradigm in dense prediction tasks, where hierarchical representation spaces spanning microscopic details to macroscopic contexts are constructed through the integration of local and global information, effectively alleviating the representational limitations inherent in single-scale features for complex scenarios. However, in conventional multi-task learning frameworks, the direct fusion mechanism for shared multi-scale features frequently induces negative transfer effects between tasks, leading to the mutual interference of task-specific discriminative patterns in high-dimensional feature spaces. This phenomenon ultimately manifests as the asymmetric degradation of cross-task generalization capabilities. To address this challenge, we propose an Adaptive Task-specific Feature Distillation Module (ATFDM) based on soft parameter sharing principles and gated mechanisms, which allocates dedicated feature refinement pathways for each task branch. Leveraging the task-specific features generated by the initial decoder as activation triggers, ATFDM dynamically captures inter-task parameter discrepancies to facilitate collaborative yet discriminative learning. This design enables the selective distillation of task-relevant multi-scale information while suppressing feature conflicts, thereby optimizing the trade-off between cross-task knowledge transfer and task-specific representation preservation.

As depicted in Figure 4, our framework implements the adaptive distillation of multi-scale global features from the backbone network. This process involves generating task-specific attention score maps to extract discriminative task-aware features, which are subsequently fused with localized representations to achieve global–local feature integration with task-optimized expressiveness. The core innovation resides in a task-selective gating module that dynamically modulates feature propagation pathways. Formally, we let

B_{i} \in R^{C_{i} \times H_{i} \times W_{i}}

denote the shared feature maps from the stage i of the multi-task backbone network,

F_{i}^{T} \in R^{C_{i} \times H_{i} \times W_{i}}

represent the task-specific features for task T at stage i, and

F_{i + 1}^{T} \in R^{C_{i + 1} \times H_{i + 1} \times W_{i + 1}}

denote the propagated features for subsequent processing. The learned task-specific attention score map

A_{i}

can be formulated as:

A_{i} = M_{i}^{T} (C o n v (C o n c a t (B_{i}, F_{i}^{T})))

(4)

Here,

M_{i}^{k}

denotes the Gated Map Block, which is designed to generate feature weights correlated with target tasks. This module comprises two consecutive Conv-BN-ReLU (CBR) layers with kernel sizes configured as 3 × 3, respectively. The generic features extracted from the backbone network are integrated with task-specific features derived from the upper layers through a channel-wise fusion-squeeze operation, which initially aligns and compresses feature channels for dimensional compatibility. Subsequently, two 3 × 3 convolutional layers are employed to excavate spatial dependencies, explicitly modeling local interactions between task-specific features and global generic representations. Note that the number of channels has been aligned in this process so as to satisfy the calculation requirements. The resultant similarity scores act as conditional selectors for the backbone features, performing element-wise multiplication with the generic shared features

B_{i}

from the backbone network. This operation yields task-adaptive features

f_{i}^{T}

, which retain task-discriminative patterns while preserving the backbone’s generalized representations.

f_{i}^{T} = A_{i} ⨂ B_{i}

(5)

Here, ⨂ denotes element-wise multiplication. Finally, to facilitate the progressive optimization of deep task-specific representations and to accumulate task-specific feature expressions as much as possible, we follow the residual connection idea. The task-specific features obtained are combined with the task-specific features from the previous layer and then integrated through a feature recalibration module to generate the starting task-specific features for the next layer.

F_{i + 1}^{T} = ϕ_{θ} (f_{i}^{T} ⨁ F_{i}^{T})

(6)

We use this method in the decoder in addition to the initial layer and experiment with it. A detailed analysis will be shown by subsequent ablation analysis in Section 5.

3.4. Attention-Based Task Interaction Module

In the framework of multi-task learning, the semantic and spatial geometric information of objects often exhibit significant correlation and coupling. These two types of information interact through heterogeneous multi-source features to form complex nonlinear mappings. This characteristic is particularly evident in remote sensing tasks, such as the relationship between the semantic and height information of objects. Such interdependencies inevitably lead to mutual influences. On one hand, the two tasks are complementary, providing additional useful information for each other’s solutions. Height estimation, as a typical ill-posed problem, is constrained by the irreversible mapping from infinite 3D scenes to 2D projections, where many possible 3D scenes may correspond to the same 2D image. The deep feature information implicitly modeled during semantic segmentation, such as shadow distribution, texture patterns, lighting conditions, and object geometry, can provide implicit regularization constraints for height estimation. This effectively reduces the dimensionality of the solution space, aids in determining the corresponding 3D scene, and improves the accuracy of height estimation. Similarly, height estimation can reciprocally assist semantic segmentation, achieving mutual enhancement. For example, semantic boundaries are spatially consistent with regions of abrupt height changes [46]. Categories with similar materials but different functions can be accurately distinguished through gradient features of elevation. On the other hand, the heterogeneity of cross-task feature spaces may also lead to negative transfer effects. For instance, although roads and low vegetation have clear semantic boundaries, their similar height characteristics can interfere with their distinction.

In view of the above discussion, we can draw two conclusions: firstly, the task features of semantic segmentation and height estimation can be exploited by each other, and it is possible to compensate for the discriminative power of both through feature interaction in the network; secondly, simple fusion interaction is risky, and a reasonable architecture must be adopted to capture the commonalities between the two task feature patterns in order to efficiently and accurately achieve a favorable positive migration effect. Therefore, we propose the Attention-Based Task Interaction Module (ABTIM) based on the cross-attention mechanism, which fully follows the above idea and performs prior attention score computation on height features and semantic features, respectively, to capture the commonality scores between the tasks, and then searches for the positive feedback connections while maximizing the potential connections between the tasks.

Our Attention-Based Task Interaction Module (ABTIM) is depicted in Figure 5. Specifically, it comprises two branches: the semantic branch and the height branch. These branches are structurally consistent, and both are constructed with self-attention modules. Taking the height estimation branch as an example, the objective is to embed geometric height information into semantic information, thereby enhancing the generalization of semantic features and mitigating the contamination caused by inconsistencies in height information.

Initially, the height features

F_{H}

, obtained through feature distillation, serve as the input. To alleviate the memory load associated with cross-attention computation, the input features are downsampled in advance. Subsequently, linear projections are applied to generate the corresponding query, key, and value matrices.

Q_{H} = X W_{Q}, K_{H} = X W_{K}, V_{H} = X W_{V}

(7)

Here,

W_{Q}

,

W_{K}

, and

W_{V}

denote the learned projection matrices, which are initialized via linear transformations with default random values. Through iterative training and updates, these matrices are optimized to achieve suitable projections. Similarly, for the input semantic features

F_{S}

, we generate

Q_{S}

,

K_{S}

, and

V_{S}

using the same approach. These values are subsequently utilized for cross-attention computation.

{A t t n}_{H \to S} = M a s k (M a t m u l (M a t m u l (Q_{H}, K_{S}^{T}), \frac{1}{\sqrt{C_{A}}}))

(8)

{A t t n}_{S \to H} = M a s k (M a t m u l (M a t m u l (Q_{S}, K_{H}^{T}), \frac{1}{\sqrt{C_{A}}}))

(9)

In our ABTIM,

{A t t n}_{H \to S}

and

{A t t n}_{S \to H}

represent the unidirectional attention mappings from height features to semantic features and from semantic features to height features, respectively. These attention modules capture the correlations between task-specific features, enabling the dynamic interaction between semantic and spatial geometric information. Specifically,

{A t t n}_{H \to S}

uses the spatial geometric information from the height features as the query basis to search for semantically relevant information that is beneficial for height estimation. This enhances the semantic segmentation task’s understanding of object shapes and spatial layouts. Conversely,

{A t t n}_{S \to H}

leverages the implicit semantic information from the semantic segmentation task to provide semantic constraints for height estimation, aiding in the interpretation of ambiguous scenes. Through this bidirectional attention mechanism, the model effectively integrates cross-task features, thereby improving the overall performance of multi-task learning. To prevent computational overload due to numerical instability, we employ a scaling factor

\sqrt{C_{A}}

and utilize a mask operation to ignore invalid positions, thereby enhancing computational efficiency.

We use the Softmax function to transform these similarities into a probability distribution representing the value of the attentional weight of one task over another:

{A t t n}_{H \to S}^{'} = S o f t m a x ({A t t n}_{H \to S})

(10)

{A t t n}_{S \to H}^{'} = S o f t m a x ({A t t n}_{S \to H})

(11)

Then, we apply the obtained attention weight values to the values of the corresponding tasks to generate the final output vector by weighted summation. This process essentially extracts the most relevant information for the current task from another task and integrates it into a more expressive feature representation:

T_{H \to S} = {A t t n}_{H \to S}' \times V_{S}

(12)

T_{S \to H} = {A t t n}_{S \to H}' \times V_{H}

(13)

In practice, we employ a multi-head cross-attention mechanism, which segments the input features into multiple independent subspaces (heads) and computes the cross-attention weights in parallel within each subspace. This mechanism enables the model to learn diverse features and relationships across different subspaces, thereby more comprehensively capturing the complex dependencies within the input data. Specifically, the multi-head attention mechanism first projects the input queries, keys, and values into multiple subspaces via linear transformations, with each subspace corresponding to one head. Each head independently computes the attention scores and performs weighted summation of the values. The outputs from all heads are subsequently concatenated and passed through another linear transformation to produce the final output. This approach not only enhances the model’s expressive power but also allows for parallel computation, thereby accelerating the training process. The resulting interactive feature representation is formulated as follows:

{T_{H \to S}}^{'} = C o n c a t (T_{H \to S}^{1}, T_{H \to S}^{2} . . . T_{H \to S}^{3}) W_{O}

(14)

{T_{S \to H}}^{'} = C o n c a t (T_{S \to H}^{1}, T_{S \to H}^{2} . . . T_{S \to H}^{3}) W_{O}

(15)

We augment the original height features

F_{H}

by adding the transformed cross-attention output

{T_{H \to S}}^{'}

and subsequently process the resultant features through a Feed-Forward Network (FFN) to enhance the learned task-specific high-level representations. Similarly, we apply the same procedure to amplify the semantic features. Within the Attention-Based Task Interaction Module, the cross-attention mechanism enables the model to dynamically allocate attention weights across the feature spaces of different tasks. This allows each task to extract the most valuable information from the features of other tasks, effectively conveying the latent inter-task linkages and fostering positive feedback between tasks. Through this mechanism, the ABTIM dynamically captures the intricate relationships between tasks, thereby further enhancing the overall performance of the model in multi-task learning scenarios.

3.5. Loss Functions

We use different loss functions for the supervised training of the task. For semantic segmentation tasks, we use the cross-entropy function as their loss function with the following expression:

L_{S S} = \frac{1}{N} \sum_{i} L_{i} = - \frac{1}{N} \sum_{i} \sum_{c = 1}^{M} y_{i c} \log (p_{i c})

(16)

Here,

M

denotes the number of classes;

y_{i c}

represents the indicator function, taking a value of 1 if the true class of sample i is c, and 0 otherwise; and

p_{i c}

represents the predicted probability that sample i belongs to class c.

For the height estimation tasks, we use the L1 function as their loss function with the following expression:

L_{H E} = \frac{1}{N} \sum_{i = 1}^{N} |h_{i} - \hat{h_{i}}|

(17)

where

h_{i}

and

\hat{h_{i}}

represent the predicted and true height values, respectively. In addition to the final prediction, our total loss function includes the loss function used for supervision in the initial decoder. In the initial decoder, for the same task branch, we still use the same loss function. For example, in SS, the initial decoder also uses the cross-entropy function. Thus, the loss function for each task branch can be represented as follows:

{L_{S S} = L}_{S S_F i n a l} + ω_{A u x} L_{S S_P r e}

(18)

{L_{H E} = L}_{H E_F i n a l} + ω_{A u x} L_{H E_P r e}

(19)

In order to facilitate the optimization of the overall loss function, considering the consistent direction and rate of descent of the loss function for the same task, we set

ω_{A u x}

here to 1 to shift the optimization objectives to the task level. Thus, the goal of optimizing the multi-task model becomes optimizing the loss function for the two different tasks,

L_{S S}

and

L_{H E}

.

Inspired by the literature [53], we use homoskedastic uncertainty to balance the loss function. Homoskedastic uncertainty is independent of the inputs and depends on the inherent uncertainty of the task; thus, by transforming homoskedastic uncertainty into the weight of the loss, the model can have the ability to dynamically adjust the loss. We introduce two learnable parameters that optimize the variance through a logarithmic form, thus adaptively tuning the model’s learning process. The weight of each task can be expressed by the following equation:

ω_{i} = e^{- α_{i}} = \frac{1}{σ_{i}^{2}} (i = 1, 2, \dots, n)

(20)

where

α_{i}

represents the learnable parameters we introduced and

σ_{i}^{2}

represents the variance. This setup ensures that the weight of each task is determined by the log-variance parameter without directly optimizing the variance itself, avoiding the gradient explosion caused when the variance tends to 0. Meanwhile, after taking the logarithms, the range of values of

α_{i}

is real and the optimization process is more stable.

In summary, our total loss can be written in the following form:

{L_{O v e r a l l} = ω_{S S} L}_{S S} + ω_{H E} L_{H E} + α_{S S} + α_{H E} = e^{- α_{S S}} L_{S S} + e^{- α_{H E}} L_{H E} + α_{S S} + α_{H E}

(21)

Among the formula,

α_{S S}

and

α_{H E}

represent the two learnable parameters we introduce, respectively. Meanwhile, we add them to the total loss as regularization terms to prevent the variance from tending to infinity.

4. Experiment

In this section, we first introduce the two public datasets used in the experiments. We then provide the implementation details and evaluation metrics used in the experiments. Subsequently, we compare our proposed method with state-of-the-art multi-task algorithms on the aforementioned datasets, demonstrating its effectiveness. Finally, ablation studies are conducted to further analyze the contributions of each proposed module.

4.1. Dataset

The Vaihingen dataset contains 33 high-resolution aerial images with an average size of approximately 2500 × 2000 pixels. The dataset contains orthorectified images in three bands (near-infrared, red, and green), pixel-level semantic annotations, and the corresponding normalized Digital Surface Models (nDSMs). The spatial resolution of the dataset is 9 cm and there are six semantic labels (impervious surfaces, building, low vegetation, tree, car, and background clutter) in the semantic annotations. The training set consists of 16 images according to the official ISPRS division method, while the remaining 17 images are used as the test set.

The Potsdam dataset contains 38 high-resolution aerial images, each of which has a size of 6000 × 6000 pixels, and also provides orthoimages containing four bands (red, green, blue, and near-infrared), pixel-level semantic annotations, and the corresponding normalized Digital Surface Models (nDSMs). The dataset adopts the same semantic classification system as the Vaihingen dataset, which also includes six categories of semantic labels (impervious surfaces, building, low vegetation, tree, car, and background clutter). In this study, we select the red, green, and blue (RGB) channels from the original four-band data for model training. Following the instructions of ISPRS, the training set includes 24 images (note that image 7_10 is excluded due to annotation errors), while the remaining 14 images are used as the test set. The spatial distribution characteristics of typical samples in the dataset are illustrated in Figure 6.

4.2. Evaluation Metrics

In this paper, we use different indicators to evaluate different tasks. For SS, we use mean intersection over union (mIoU), overall accuracy (OA), and F1 score to quantify the performance of models:

I o U = \frac{N_{T P}}{N_{T P} + N_{F P} + N_{F N}}

(22)

m I o U = \frac{1}{L} \sum_{l = 1}^{L} I o U_{l}

(23)

O A = \frac{N_{T P} + N_{T N}}{N_{T P} + N_{F P} + N_{F N} + N_{T N}}

(24)

F_{1} = \frac{2 \cdot p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(25)

where

N_{T P}

,

N_{F P}

,

N_{F N}

, and

N_{T N}

are the pixel numbers of true positives, false positives, true negatives, and false negatives. Following standard evaluation protocols, we compute the mean F1 score (mF1) and mean intersection over union (mIoU) across the five foreground classes (impervious surfaces, buildings, low vegetation, trees, and cars).

For HE, we qualitatively evaluate the results using four indicators, namely, absolute relative error (AbsRel), mean absolute error (MAE), root mean square error (RMSE), and accuracy with threshold

δ_{i}

. The specific formulas are as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(h_{i} - \hat{h_{i}})}^{2}}

(26)

δ_{i} = m a x (\frac{H_{g}}{H_{e}}, \frac{H_{e}}{H_{g}}) < {1.25}^{i}, i \in {1, 2, 3} .

(27)

A b s R e l = \frac{1}{N} \sum_{i = 1}^{N} | h_{i} - \hat{h_{i}} | / h_{i}

(28)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |h_{i} - \hat{h_{i}}|

(29)

where

N

represents the total number of pixels in the image,

h_{i}

is the height ground truth, and

\hat{h_{i}}

is the predicted height value.

4.3. Implementation Details

All experiments in this study were conducted using PyTorch to build the models and trained on an NVIDIA GeForce RTX 4090 GPU. Both CTME-Net and the comparison methods utilized pre-trained models trained on ImageNet. Among other things, following much of the work of previous studies, we used ResNet101 as the backbone network in CTME-Net. We randomly cropped 512 × 512 patches from the training images as the input for network training. The total number of training epochs was set to 100, with a batch size of 4 and an initial learning rate of 0.0005. To facilitate rapid convergence, we employed the AdamW optimizer with weight decay (decay coefficient of 0.001 and beta values of 0.5 and 0.999). A cosine annealing strategy was used to adjust the learning rate, with the minimum learning rate set to 0.00001 and a 5-epoch warm-up phase to ensure stable training. For data augmentation, we applied three methods: horizontal flipping, vertical flipping, and rotation with a probability of 0.5.

5. Results and Ablation Analysis

5.1. Experiment Results

To verify the effectiveness of the proposed method, we compare our approach with other state-of-the-art methods on the Vaihingen and Potsdam datasets, including single-task learning (STL) methods that perform only height estimation or semantic segmentation, as well as multi-task learning (MTL) methods that jointly execute height estimation and semantic segmentation. The comparative results are presented in Table 2 and Table 3.

In Table 2, we report the quantitative evaluation results on the Vaihingen dataset. Compared to the state-of-the-art methods, our approach achieves superior performance in both semantic segmentation (SS) and height estimation (HE). Specifically, for SS, our model attains an average F1 score of 89.6%, an overall accuracy (OA) of 90.8%, and a mean intersection over union (mIoU) of 81.5%. For HE, our model achieves an absolute relative error (AbsRel) of 0.791, a mean absolute error (MAE) of 1.184, and a root mean squared error (RMSE) of 1.790, with higher

δ_{i}

accuracy. The results demonstrate that, compared to single-task frameworks, our proposed method not only achieves competitive results in the segmentation task but also reaches state-of-the-art performance in most of the height estimation metrics. Moreover, compared to other multi-task frameworks, our approach is more effective and reliable, outperforming other methods in the majority of metrics. In the case of these indicators, which are slightly inferior, our method also achieves results very close to the best results. We present some qualitative results in Figure 7. As shown in Figure 7, compared to single-task models, our model’s segmentation results exhibit clearer and more distinct boundaries, with a significant reduction in misclassification and omission. Additionally, the predicted height values for pixels within the same category are smoother and more consistent, and the omission in height prediction is notably reduced, thanks to the embedding of semantic information. This indicates that we can leverage semantic information to enhance the generation of height information, while also using the abrupt changes in height information to constrain the accuracy of semantic segmentation. BAMTL [47], which incorporates boundary detection, also confirms this viewpoint, thus achieving better performance in MAE and RMSE.

Similarly, in addition to the Vaihingen dataset, we further report the results on the Potsdam dataset. As shown in Table 3, similar to the results on the Vaihingen dataset, our method outperforms the state-of-the-art method in the majority of metrics. We also present some qualitative results on the Potsdam dataset to demonstrate the effectiveness of our approach, as shown in Figure 8. The results from both datasets indicate that our proposed strategy is more effective and reliable.

5.2. Ablation Analysis

In this section, we perform extensive ablation experiments on the Vaihingen dataset to investigate the effectiveness of our core ideas and proposed model designs. We adopt the same backbone and keep the details of the experiments described earlier unchanged in all the experiments.

We initially conduct ablation studies on the three proposed modules. The quantitative results of the ablation studies are presented in Table 4, with a total of six methodological outcomes: (1) STL-S: Results of the single-task learning method for semantic segmentation. (2) STL-H: Results of the single-task learning method for height estimation. (3) MTL-Baseline: A multi-task learning network without any additional modules, comprising a shared encoder and two task-specific heads. (4) MTL-B+ITFEM: Results of adding the proposed ITFEM to the baseline. (5) MTL-B+ITFEM+ATFDM: Results of adding the proposed ATFDM to the MTL-B+ITFEM. (6) MTL-B+ITFEM+ATFDM+ABTIM (CTME-Net), which is the multi-task model incorporating all the proposed modules. The visual results of the ablation studies are also displayed in Figure 9.

As shown in Table 4, the first and second rows correspond to the experimental results of the single-task models on the Vaihingen dataset. The third row represents the results of the MTL-Baseline. It is evident that a simple multi-task model using hard parameter sharing leads to a certain degree of performance degradation across all metrics. We attribute this phenomenon to the negative transfer effect caused by the indiscriminate sharing of parameters at the lower levels during multi-task training, resulting in conflicts between task pairs and inconsistent optimization directions, thereby preventing the model from finding a Pareto optimal solution. This further underscores the necessity of selective feature extraction for tasks. The fourth row illustrates the effect of adding the ITFEM to the baseline model. It can be observed that preliminary feature disentanglement, combined with label supervision, leads to a certain recovery in model performance across various metrics, especially for semantic segmentation labels, which almost match the single-task model. Subsequently, the fifth row demonstrates the results of adding both the ITFEM and ATFDM to the baseline model. Through the embedding of preliminary features and the selective fusion of multi-scale feature information, our multi-task model achieves further improvements in metrics for both tasks. It is worth noting that the addition of ATFDM results in a slight decline in the AbsRel metric. We posit that this indicates the incomplete disentanglement and fusion of semantic and height information, highlighting the necessity for further task feature interaction. The last row presents the complete experimental results of our CTME-Net on the Vaihingen dataset. After incorporating the task feature interaction mechanism, it is evident that all metrics have significantly improved, strongly corroborating the effectiveness and reliability of our proposed CTME-Net. Figure 9 displays some of the visual results of the ablation studies, revealing that the effective integration of each module not only compensates for misclassification and omission in segmentation but also enhances the accuracy and robustness of height estimation. This fusion strategy, by integrating multi-source information and leveraging the advantages of multi-task learning, enables the model to better understand the geometric structure and spatial relationships of objects in complex scenarios.

To further analyze model performance, we conduct systematic ablation studies on the position and number of the feature interaction modules. The results indicate that the deployment position of the feature interaction modules significantly impacts the overall model performance. When positioned closer to the input layer of the network, the feature interaction module can capture the direct associations between features more promptly, thereby effectively enhancing the model’s ability to model complex patterns. However, when placed towards the later stages of the decoder, the feature interaction module not only fails to further improve model performance but also leads to a decline in the relevant metrics. We speculate that this phenomenon may be attributed to two main reasons. On one hand, the later stages of the decoder are more sensitive to minor changes in input features, causing the model to experience significant fluctuations during optimization. This can result in the model becoming trapped in local optima rather than converging to a global optimal state, ultimately leading to a decrease in performance metrics. On the other hand, the shallow feature information fused at the later stages of the decoder has relatively lower semantic content, focusing primarily on global information and being less sensitive to fine-scale changes. Under such circumstances, the feature interaction module may not be able to fully leverage its advantages and may even introduce noise or redundant information, thereby weakening the overall model performance. We determine the optimal deployment position and number of feature interaction modules through a series of ablation experiments, with the specific results presented in Table 5. Here, L1, L2, and L3 denote the locations where the interactions occur, where L1 corresponds to the first stage of the decoder, L2 to the second stage of the decoder, and L3 to the third stage of the decoder.

We similarly explore the effect of the number of attention heads in ABTIM on the final results, which can be viewed in Table 6. Multiple heads allow the model to learn the different connections between features from multiple subspaces, improving the ability to interact with features. This is also shown in the first three rows of our records in the table, where the performance of both semantic segmentation and height estimation improves as the number of heads increases, reaching the best performance at a number of 8 heads. However, too many heads also introduce potential parameter redundancy, leading to interactions between features tending towards similar attentional patterns and affecting model performance. We also verify this observation in the last two rows of the records in the table.

6. Conclusions

This paper proposes a multi-task learning network, termed CTME-Net, which is designed to jointly perform semantic segmentation and height estimation tasks. Capitalizing on the strong correlation between semantic and height features, the network disentangles and interacts with features to achieve task-specific representations. In the feature disentanglement process, the network first employs an Initial Task-specific Feature Embedding Module (ITFEM) to preliminarily disentangle semantic and height features. By separating these features and leveraging ground truth for supervised learning, the network provides initial task-specific feature representations. To further enhance the model performance by collecting task-beneficial information from global features, an Adaptive Task-specific Feature Distillation Module (ATFDM) is introduced. Based on a gating mechanism, this module selects and fuses task-beneficial features from multi-scale general features at different stages of the decoder. Finally, to deeply explore the latent associations between tasks, compensate for insufficient task interactions, and promote positive synergistic effects, a feature interaction module based on cross-attention is designed. This module enables dynamic interaction and information fusion between semantic and spatial features, enriching the semantic and spatial dimensions of the features. Consequently, the model gains a more comprehensive understanding of the input data, exhibiting stronger adaptability and generalization in multi-task learning scenarios. Extensive experiments on two public datasets demonstrate the effectiveness and superiority of the proposed method. Future work will explore more task interaction methods, introduce additional effective information to enhance task representation, and optimize memory usage and inference speed to comprehensively improve model performance.

Author Contributions

Conceptualization, X.P., F.W. and Z.W.; methodology, X.P.; software, X.P.; validation, X.P.; formal analysis, X.P.; investigation, X.P.; resources, X.P.; data curation, X.P. and J.Z.; writing—original draft preparation, X.P.; writing—review and editing, F.W., S.L. and L.L.; visualization, X.P.; supervision, F.W. and S.L.; project administration, F.W.; funding acquisition, F.W. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China under Grant 2021YFB3901201.

Data Availability Statement

No new data were generated in this study. The data used in this study are publicly available. All datasets mentioned in this paper are available online (https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx, accessed on 15 May 2024).

Acknowledgments

The authors would like to thank the International Society for Photogrammetry for providing the Vaihingen and Potsdam datasets for research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mou, L.; Hua, Y.; Zhu, X.X. Relation Matters: Relational Context-Aware Fully Convolutional Network for Seman tic Segmentation of High-Resolution Aerial Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmenta tion in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606216. [Google Scholar] [CrossRef]
Liu, Y.; Mei, S.; Zhang, S.; Wang, Y.; He, M.; Du, Q. Semantic Segmentation of High-Resolution Remote Sensing Images Using an Improved Transformer. In Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2022), Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3496–3499. [Google Scholar]
Reed, C.J.; Gupta, R.; Li, S.; Brockman, S.; Funk, C.; Clipp, B.; Keutzer, K.; Candido, S.; Uyttendaele, M.; Darrell, T. Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, ICCV, Paris, France, 2–3 October 2023; pp. 4065–4076. [Google Scholar]
Mou, L.; Zhu, X. IM2HEIGHT: Height Estimation from Single Monocular Imagery via Fully Residual Convolu tional-Deconvolutional Network. arXiv 2018, arXiv:1802.10249. [Google Scholar]
Amirkolaee, H.A.; Arefi, H. Height Estimation from Single Aerial Images Using a Deep Convolutional Encoder-Decoder Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 149, 50–66. [Google Scholar] [CrossRef]
Li, W.D.; Meng, L.; Wang, J.; He, C.; Xia, G.-S.; Lin, D. 3D Building Reconstruction from Monocular Remote Sens ing Images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021), Montreal, QC, Canada, 10–17 October 2021; pp. 12528–12537. [Google Scholar]
Li, X.; Wang, M.; Fang, Y. Height Estimation from Single Aerial Images Using a Deep Ordinal Regression Net work. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6000205. [Google Scholar] [CrossRef]
Mao, Y.; Chen, K.; Zhao, L.; Chen, W.; Tang, D.; Liu, W.; Wang, Z.; Diao, W.; Sun, X.; Fu, K. Elevation Estimation-Driven Building 3-D Reconstruction from Single-View Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608718. [Google Scholar] [CrossRef]
Zhao, W.; Persello, C.; Stein, A. Semantic-Aware Unsupervised Domain Adaptation for Height Estimation from Single-View Aerial Images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 372–385. [Google Scholar] [CrossRef]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Los Alamitos, CA, USA, 2017; pp. 6230–6239. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Honolulu, HI, USA, 2017; pp. 5168–5177. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmen tation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. CANet: Class-Agnostic Segmentation Networks with Iterative Refine ment and Attentive Few-Shot Learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–17 June 2019; pp. 5212–5221. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–17 June 2019; pp. 3141–3149. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 6877–6886. [Google Scholar]
Zhang, B.; Tian, Z.; Tang, Q.; Chu, X.; Wei, X.; Shen, C.; Liu, Y. SegViT: Semantic Segmentation with Plain Vision Transformers. In Proceedings of the Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, LO, USA, 28 November–9 December 2022. [Google Scholar]
Shi, H.; Hayat, M.; Cai, J. Transformer Scale Gate for Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Paris, France, 2–3 October 2023; pp. 3051–3060. [Google Scholar]
Almarzouqi, H.; Saoud, L.S. Semantic Labeling of High-Resolution Images Using EfficientUNets and Transform ers. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4402913. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmen tation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convo lutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Net work for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
Hanyu, T.; Yamazaki, K.; Tran, M.; McCann, R.A.; Liao, H.; Rainwater, C.; Adkins, M.; Cothren, J.; Le, N. Aerial Former: Multi-Resolution Transformer for Aerial Image Segmentation. Remote Sens. 2024, 16, 2930. [Google Scholar] [CrossRef]
RAGGAM, J.; BUCHROITHNER, M.; MANSBERGER, R. Relief Mapping Using Nonphotographic Spaceborne Imagery. ISPRS J. Photogramm. Remote Sens. 1989, 44, 21–36. [Google Scholar] [CrossRef]
Pinheiro, M.; Reigber, A.; Scheiber, R.; Prats-Iraola, P.; Moreira, A. Generation of Highly Accurate DEMs Over Flat Areas by Means of Dual-Frequency and Dual-Baseline Airborne SAR Interferometry. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4361–4390. [Google Scholar] [CrossRef]
Ka, M.-H.; Shimkin, P.E.; Baskakov, A.I.; Babokin, M.I. A New Single-Pass SAR Interferometry Technique with a Single-Antenna for Terrain Height Measurements. Remote Sens. 2019, 11, 1070. [Google Scholar] [CrossRef]
Yang, X.; Wang, C.; Xi, X.; Wang, P.; Lei, Z.; Ma, W.; Nie, S. Extraction of Multiple Building Heights Using ICE Sat/GLAS Full-Waveform Data Assisted by Optical Imagery. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1914–1918. [Google Scholar] [CrossRef]
Xing, S.; Dong, Q.; Hu, Z. Gated Feature Aggregation for Height Estimation from Single Aerial Images. IEEE Geosci. Remote Sensing Lett. 2022, 19, 6003705. [Google Scholar] [CrossRef]
Vandenhende, S.; Georgoulis, S.; Van Gansbeke, W.; Proesmans, M.; Dai, D.; Van Gool, L. Multi-Task Learning for Dense Prediction Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3614–3633. [Google Scholar] [CrossRef]
Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing. In Proceedings of the 2018 IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 675–684. [Google Scholar]
Zhang, Z.; Cui, Z.; Xu, C.; Yan, Y.; Sebe, N.; Yang, J. Pattern-Affinitive Propagation across Depth, Surface Normal and Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–17 June 2019; pp. 4101–4110. [Google Scholar]
Vandenhende, S.; Georgoulis, S.; Van Gool, L. MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2020; pp. 527–543. [Google Scholar]
Ye, H.; Xu, D. InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7493–7508. [Google Scholar] [CrossRef]
Sinodinos, D.; Armanfard, N. Cross-Task Affinity Learning for Multitask Dense Scene Predictions. arXiv 2024, arXiv:2401.11124. [Google Scholar] [CrossRef]
Agiza, A.; Neseem, M.; Reda, S. MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning 2024. In Proceedings of the 2024 IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 16–22 June 2024; pp. 16196–16205. [Google Scholar]
Wang, S.; Li, J.; Zhao, Z.; Lian, D.; Huang, B.; Wang, X.; Li, Z.; Gao, S. TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3 January 2024; pp. 914–923. [Google Scholar]
Ye, H.; Xu, D. TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Meyerson, E.; Miikkulainen, R. Beyond Shared Hierarchies: Deep Multitask Learning through Soft Layer Ordering. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Maninis, K.-K.; Radosavovic, I.; Kokkinos, I. Attentive Single-Tasking of Multiple Tasks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 16–17 June 2019; pp. 1851–1860. [Google Scholar]
Guo, P.; Lee, C.-Y.; Ulbricht, D. Learning to Branch for Multi-Task Learning. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 15–17 February 2020. [Google Scholar]
Wallingford, M.; Li, H.; Achille, A.; Ravichandran, A.; Fowlkes, C.; Bhotika, R.; Soatto, S. Task Adaptive Parameter Sharing for Multi-Task Learning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7551–7560. [Google Scholar]
Srivastava, S.; Volpi, M.; Tuia, D. Joint height estimation and semantic labeling of monocular aerial images with CNNS. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 5173–5176. [Google Scholar]
Wang, Y.; Ding, W.; Zhang, R.; Li, H. Boundary-Aware Multitask Learning for Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 951–963. [Google Scholar] [CrossRef]
Xing, S.; Dong, Q.; Hu, Z. SCE-Net: Self- and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation. Remote Sens. 2022, 14, 2252. [Google Scholar] [CrossRef]
Liu, W.; Sun, X.; Zhang, W.; Guo, Z.; Fu, K. Associatively Segmenting Semantics and Estimating Height from Monocular Remote-Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Mao, Y.; Sun, X.; Huang, X.; Chen, K. Light: Joint Individual Building Extraction and Height Estimation from Satellite Images Through a Unified Multitask Learning Network. In Proceedings of the 2023 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2023), Vancouver, ON, Canada, 18–22 June 2023; pp. 5320–5323. [Google Scholar]
Zhang, Q.; Yang, G.; Zhang, G. Collaborative Network for Super-Resolution and Semantic Segmentation of Re mote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4404512. [Google Scholar] [CrossRef]
Gao, Z.; Sun, W.; Lu, Y.; Zhang, Y.; Song, W.; Zhang, Y.; Zhai, R. Joint Learning of Semantic Segmentation and Height Estimation for Remote Sensing Image Leveraging Contrastive Learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614015. [Google Scholar] [CrossRef]
Cipolla, R.; Gal, Y.; Kendall, A. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10 October 2021; pp. 7242–7252. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer Is Actually What You Need for Vision. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10809–10819. [Google Scholar]
Ji, Y.; Chen, Z.; Xie, E.; Hong, L.; Liu, X.; Liu, Z.; Lu, T.; Li, Z.; Luo, P. DDP: Diffusion Model for Dense Visual Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 21684–21695. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 17256–17267. [Google Scholar]
Alidoost, F.; Arefi, H.; Tombari, F. 2D Image-To-3D Model: Knowledge-Based 3D Building Reconstruction (3DBR) Using Single Aerial Images and Convolutional Neural Networks (CNNs). Remote Sens. 2019, 11, 2219. [Google Scholar] [CrossRef]
Carvalho, M.; Saux, B.L.; Trouvé-Peloux, P.; Almansa, A.; Champagnat, F. On Regression Losses for Deep Depth Estimation. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2915–2919. [Google Scholar]
Liu, C.-J.; Krylov, V.A.; Kane, P.; Kavanagh, G.; Dahyot, R. IM2ELEVATION: Building Height Estimation from Single-View Aerial Imagery. Remote Sens. 2020, 12, 2719. [Google Scholar] [CrossRef]
Carvalho, M.; Le Saux, B.; Trouvé-Peloux, P.; Champagnat, F.; Almansa, A. Multitask Learning of Height and Semantics from Aerial Images. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1391–1395. [Google Scholar] [CrossRef]
Ye, H.; Xu, D. TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1 October 2023; pp. 21771–21780. [Google Scholar]

Figure 1. (a) Pipeline of our method. We follow the basic multi-tasking architecture by focusing on the feature interactions between tasks for the purpose of task-specific representation. (b) A visual demonstration of the close connection between SS and HE. In the HE task, the discriminant criterion provided by SS provides additional supervisory information for the regression of HE, constraining the range of height variations for the same and different categories. In the SS task, the geometric information distribution status provided by HE provides stronger support for SS discrimination, further overcoming the ambiguity of SS in complex scenarios.

Figure 2. Structure of the proposed CTME-Net for the joint prediction of SS and HE. We use red arrows to denote semantic feature streams, blue arrows to denote height feature streams, black arrows to denote generic feature streams from the backbone network, and brown arrows to denote interactive feature streams. For best viewing results, please refer to the color version.

Figure 3. The details of the proposed ITFEM are as depicted. At the top is the overall structure of the ITFEM. The dotted box below is the structure of the preliminary feature calibration module, which performs feature calibration on the initial features to achieve a preliminary decoupling result. We employ ground truth labels for supervised learning, embedding the preliminary predictions into the initial features and passing them to subsequent modules.

Figure 4. The overall structure of our proposed ATFDM. We first calculate the similarity weights in the task-specific features and the generic shared network, and subsequently, we extract the corresponding features from the shared network according to the weights. We omit part of the operation schematic for aesthetics.

Figure 5. The overall structure of our proposed ABTIM. Bidirectional cross-attention computation ensures that the tasks extract favorable information from each other, thus further enhancing task-specific feature representation.

Figure 6. Samples from the Vaihingen and Potsdam datasets.

Figure 7. Qualitative results of HE and SS on the Vaihingen dataset. The input images are cropped to 512 × 512 for better visualization. (a) Input images. (b) Semantic segmentation ground truth. (c) Predictions of the STL method. (d) Predictions of the CTME-Net method (ours). (e) Height estimation ground truth. (f) Predictions of the STL method. (g) Predictions of the CTME-Net method (ours). For best viewing results, please refer to the color version.

Figure 8. Qualitative results of HE and SS on the Potsdam dataset. The input images are cropped to 512 × 512 for better visualization. (a) Input images. (b) Semantic segmentation ground truth. (c) Predictions of the STL method. (d) Predictions of the CTME-Net method (ours). (e) Height estimation ground truth. (f) Predictions of the STL method. (g) Predictions of the CTME-Net method (ours). For best viewing results, please refer to the color version.

Figure 9. Visualization results of ablation studies on the Vaihingen dataset. (a) Input images. (b) Ground truth. (c) Predictions of the MTL-Baseline. (d) Predictions of the MTL-Baseline+ITFEM. (e) Predictions of the MTL-Baseline+ITFEM+ATFDM. (f) Predictions of the CTME-Net method (ours). We have marked the specific difference information using red boxes. For best viewing results, please refer to the color version.

Table 1. Main abbreviations and their corresponding meanings in our work.

Abbreviation	Meaning
STL	Single-task learning
MTL	Multi-task learning
ITFEM	Initial Task-specific Feature Embedding Module
ATFDM	Adaptive Task-specific Feature Distillation Module
ABTIM	Attention-Based Task Interaction Module

Table 2. Quantitative results on the ISPRS Vaihingen dataset. Bold indicates the best performance.

Methods		Semantic Segmentation			Height Estimation
		(Higher Better)↑			(Lower Better)↓			(Higher Better)↑
		OA	mF1	mIoU	AbsRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$
SS	FCN [11]	86.5	83.7	72.6	-	-	-	-	-	-
	DANet [18]	90.4	88.7	80.0	-	-	-	-	-	-
	Segmenter [54]	89.9	88.2	79.4	-	-	-	-	-	-
	PoolFormer [55]	90.3	89.6	81.4		-	-	-	-	-
	DDP [56]	-	89.5	80.1	-	-	-	-	-	-
	EfficientViT [57]	89.4	87.6	80.5	-	-	-	-	-	-
HE	Amirkolaee et al. [6]	-	-	-	1.179	1.487	2.197	0.305	0.496	0.599
	IM2HEIGHT [5]	-	-	-	1.009	1.485	2.253	0.317	0.512	0.609
	3DBR [58]	-	-	-	0.948	1.379	2.074	0.338	0.540	0.641
	D3Net [59]	-	-	-	2.016	1.314	2.123	0.369	0.533	0.644
	IM2ELEVATION [60]	-	-	-	0.956	1.226	1.882	0.399	0.587	0.671
	PLNet [32]	-	-	-	0.833	1.178	1.775	0.386	0.599	0.702
MTL	Srivastava et al. [46]	79.3	72.6	-	4.415	1.861	2.729	0.217	0.385	0.517
	Carvalho et al. [61]	86.1	82.3	-	1.882	1.262	2.089	0.405	0.562	0.663
	BAMTL [47]	88.4	86.9	-	1.064	1.078	1.762	0.451	0.617	0.714
	TaskExpert [62]	88.8	86.3	-	1.037	1.338	1.989	0.428	0.647	0.760
	InvPT++ [37]	88.7	86.0	-	0.830	1.334	2.009	0.379	0.638	0.768
	Ours	90.8	89.6	81.5	0.791	1.184	1.790	0.498	0.703	0.806

Table 3. Quantitative results on the ISPRS Potsdam dataset. Bold indicates the best performance.

Methods		Semantic Segmentation			Height Estimation
		(Higher Better)↑			(Lower Better)↓			(Higher Better)↑
		OA	mF1	mIoU	AbsRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$
SS	FCN [11]	85.6	87.6	78.3	-	-	-	-	-	-
	DANet [18]	89.9	91.2	84.1	-	-	-	-	-	-
	Segmenter [54]	91.0	92.3	86.5	-	-	-	-	-	-
	PoolFormer [55]	91.1	92.6	86.5	-	-	-	-	-	-
	DDP [56]	-	92.4	86.1	-	-	-	-	-	-
	EfficientViT [57]	89.6	90.1	84.2	-	-	-	-	-	-
HE	Amirkolaee et al. [6]	-	-	-	0.537	1.926	3.507	0.394	0.640	0.775
	IM2HEIGHT [5]	-	-	-	0.518	2.200	4.141	0.534	0.680	0.763
	3DBR [58]	-	-	-	0.409	1.751	3.439	0.605	0.742	0.823
	D3Net [59]	-	-	-	0.391	1.681	3.055	0.601	0.742	0.830
	IM2ELEVATION [60]	-	-	-	0.429	1.744	3.516	0.638	0.767	0.839
	PLNet [32]	-	-	-	0.318	1.201	2.354	0.639	0.833	0.912
MTL	Srivastava et al. [46]	80.1	79.9	-	0.624	2.224	3.740	0.412	0.597	0.720
	Carvalho et al. [61]	83.2	82.2	-	0.441	1.838	3.281	0.575	0.720	0.808
	BAMTL [47]	91.3	90.9	-	0.291	1.223	2.407	0.685	0.819	0.897
	TaskExpert [62]	90.7	90.2	-	0.273	1.292	2.513	0.650	0.818	0.898
	InvPT++ [37]	91.1	90.6	-	0.253	1.210	2.402	0.673	0.829	0.904
	Ours	92.0	92.9	87.0	0.256	1.147	2.420	0.698	0.839	0.912

Table 4. Ablation studies on the Vaihingen dataset. Bold indicates the best performance.

Methods	Semantic Segmentation			Height Estimation
	(Higher Better)↑			(Lower Better)↓			(Higher Better)↑
	OA	mF1	mIoU	AbsRel	MAE	RMSE	$δ_{1}$	$δ_{2}$	$δ_{3}$
STL-S	89.8	88.9	80.7	-	-	-	-	-	-
STL-H	-	-	-	0.827	1.245	1.912	0.480	0.682	0.785
MTL-Baseline	90.1	88.4	80.3	0.830	1.253	1.933	0.473	0.677	0.778
MTL-Baseline+ITFEM	90.4	89.1	81.0	0.807	1.230	1.887	0.488	0.685	0.782
MTL-Baseline+ITFEM+ATFDM	90.6	89.4	81.3	0.810	1.222	1.843	0.490	0.687	0.790
CTME-Net (Full)	90.8	89.6	81.5	0.791	1.184	1.790	0.498	0.703	0.806

Table 5. Ablation studies on the number and the location of ABTIM on the Vaihingen dataset. Bold indicates the best performance.

L1	L2	L3	Semantic Segmentation		Height Estimation
L1	L2	L3	mF1	mIoU	RMSE	$δ_{1}$
√			89.5	81.3	1.796	0.492
	√		89.3	81.2	1.811	0.489
		√	88.7	80.4	1.890	0.477
√	√		89.6	81.5	1.790	0.498
	√	√	89.0	80.7	1.823	0.480
√	√	√	89.0	80.8	1.814	0.484

Table 6. Ablation studies on the number of attention heads in ABTIM on the Vaihingen dataset. Bold indicates the best performance.

Quantities	Semantic Segmentation		Height Estimation
Quantities	mF1	mIoU	RMSE	$δ_{1}$
1	89.3	81.3	1.810	0.490
4	89.4	81.4	1.797	0.494
8	89.6	81.5	1.790	0.498
16	89.5	81.4	1.795	0.496
32	89.4	81.3	1.803	0.495

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peng, X.; Wang, S.; Wang, F.; Zhu, J.; Li, S.; Liu, L.; Wang, Z. Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction. Remote Sens. 2025, 17, 1637. https://doi.org/10.3390/rs17091637

AMA Style

Peng X, Wang S, Wang F, Zhu J, Li S, Liu L, Wang Z. Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction. Remote Sensing. 2025; 17(9):1637. https://doi.org/10.3390/rs17091637

Chicago/Turabian Style

Peng, Xuanang, Shixin Wang, Futao Wang, Jinfeng Zhu, Suju Li, Longfei Liu, and Zhenqing Wang. 2025. "Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction" Remote Sensing 17, no. 9: 1637. https://doi.org/10.3390/rs17091637

APA Style

Peng, X., Wang, S., Wang, F., Zhu, J., Li, S., Liu, L., & Wang, Z. (2025). Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction. Remote Sensing, 17(9), 1637. https://doi.org/10.3390/rs17091637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synergistic Semantic Segmentation and Height Estimation for Monocular Remote Sensing Images via Cross-Task Interaction

Abstract

1. Introduction

2. Related Works

2.1. Single-Task Learning

2.1.1. Semantic Segmentation

2.1.2. Height Estimation

2.2. Multi-Task Learning

3. Methodology

3.1. Overview

3.2. Initial Task-Specific Feature Embedding Module

3.3. Adaptive Task-Specific Feature Distillation Module

3.4. Attention-Based Task Interaction Module

3.5. Loss Functions

4. Experiment

4.1. Dataset

4.2. Evaluation Metrics

4.3. Implementation Details

5. Results and Ablation Analysis

5.1. Experiment Results

5.2. Ablation Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI