SCE-Net: Self-and Cross-Enhancement Network for Single-View Height Estimation and Semantic Segmentation

: Single-view height estimation and semantic segmentation have received increasing attention in recent years and play an important role in the photogrammetry and remote sensing communities. The height information and semantic information of images are correlated, and some recent works have shown that multi-task learning methods can achieve complementation of task-related features and improve the prediction results of the multiple tasks. Although much progress has been made in recent works, how to effectively extract and fuse height features and semantic features is still an open issue. In this paper, a self-and cross-enhancement network (SCE-Net) is proposed to jointly perform height estimation and semantic segmentation on single aerial images. A feature separation–fusion module is constructed to effectively separate and fuse height features and semantic features based on an attention mechanism for feature representation enhancement across tasks. In addition, a height-guided feature distance loss and a semantic-guided feature distance loss are designed based on deep metric learning to achieve task-aware feature representation enhancement. Extensive experiments are conducted on the Vaihingen dataset and the Potsdam dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that the proposed SCE-Net could outperform the state-of-the-art methods and achieve better performance in both height estimation and semantic segmentation.


Introduction
In recent years, with the rapid development of aerospace technology, remote sensing imagery analysis for high-resolution images acquired by aerial or satellite sensors has received extensive attention.Learning height information from single aerial images, being one of the important tasks in remote sensing imagery analysis, can provide geometric information for 3D reconstruction of ground scenes, and is widely used in a variety of applications, such as urban planning [1], change detection [2], and disaster monitoring [3].Recently, thriving deep learning technology has made tremendous progress in the photogrammetry and remote sensing communities [4][5][6][7].Similarly, height estimation from single aerial images mainly adopts deep-learning-based methods, including methods based on convolutional neural networks (CNNs), methods based on generative adversarial networks (GANs), and methods based on multi-task learning (MTL).
Compared with other images, remote sensing images have more complex spectral characteristics, where objects with different heights may have similar appearances due to similar materials, such as building roofs and roads.When using a deep neural network to extract features from a single image, it may generate mismatched height feature relationships, resulting in inaccurate height estimation.Generally, there is a geometric correlation between the height information and semantic information of the remote sensing scene.Compared with treating height estimation and semantic segmentation as two independent tasks, the multi-task learning methods can utilize height feature and semantic feature extracted from the image to achieve information complementarity, then leverage multi-source supervised information to improve the predictive performance.Therefore, this paper performs height estimation and semantic segmentation from single aerial images simultaneously in a unified framework.
Several recent works have shown that height estimation and semantic segmentation can benefit from each other, mainly based on the implicit assumption that changes in height generally correspond to changes in class [8,9].However, although the height cues and semantic cues are related, they are not completely consistent.For example, objects within a same class may have different heights, while objects with a same height may belong to different classes.Therefore, straightforward fusion (summation or concatenation) of height features and semantic features will make inconsistent features negatively impact other shared features, leading to more inaccurate predictions.In addition, estimating height from single images is generally regarded as a pixel-level height regression task.However, the wide range of height values makes it challenging to obtain an accurate height value directly.Under the direct regression paradigm, existing methods generally suffer from slow convergences or sub-optimal solutions.
In this paper, a self-and cross-enhancement network (SCE-Net) is proposed to jointly learn height information and semantic labels from single aerial images under the framework of multi-task learning.Specifically, the SCE-Net first exploits the backbone network to extract shared features for both two tasks from the input image.Then, a feature separationfusion module (FSFM) is designed to effectively separate task-aware features from the shared features and fuse cross-task features based on an attention mechanism to achieve cross-enhancement of task-related feature representation.In addition, for addressing the problem that the height range is large and difficult to regress, the height range is discretized into several intervals and a height-guided feature distance loss and a semantic-guided feature distance loss are designed to accomplish self-enhancement of feature representation based on the deep metric learning method.To verify the effectiveness of the proposed method, extensive experiments are conducted on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset.Experimental results demonstrate that the proposed method outperforms recent state-of-the-art height estimation methods and achieves comparable performance to the comparison semantic segmentation methods.
The main contributions include the following: • A multi-task learning network, called self-and cross-enhancement network (SCE-Net), is proposed to simultaneously perform height estimation and semantic segmentation from single aerial images under a unified framework.

•
To effectively integrate the height and semantic cues of the scene, a feature separationfusion module (FSFM) is constructed to separate the shared image features into task-aware features, and selectively fuse the cross-task features based on an attention mechanism.

•
A height-guided feature distance loss and a semantic-guided feature distance loss are designed to achieve task-guided representation enhancement using the deep metric learning method.
The paper is organized as follows: a brief review of related works is given in Section 2, including height estimation, semantic segmentation, and multi-task learning.Section 3 introduces the proposed self-and cross-enhancement network (SCE-Net), datasets, evaluation indicators, and implementation details.Extensive experimental results and evaluations are reported in Section 4. In Section 5, the effectiveness of each component proposed in this work is analyzed and discussed.Finally, Section 6 concludes this work.

Related Works
In this section, the related works are briefly introduced, including height estimation, semantic segmentation, and multi-task learning.

Height Estimation
The height information of remote sensing images can be obtained by stereo-pair photogrammetry [10,11], SAR interoferometry [12,13], Lidar processing, etc., but these methods are usually expensive, and require high computational costs or expert interpretation.With the development of deep learning technology, recent works have paid more attention to estimating the height information from single images based on deep learning networks, including convolutional neural networks (CNNs) and generative adversarial networks (GANs).
The methods based on supervised learning mainly use the height ground truth corresponding to the image to supervise the network training.The model trained on the training dataset can be applied to other images to achieve height estimation from single aerial images.Mou et al. [14] proposed a fully convolutional-deconvolutional network to learn the height information from single images in an end-to-end manner, and the predicted height map can assist in building instance segmentation.Zhang et al. [15] proposed a multi-path fusion network, using a multi-path feature fusion module and a residual upsampling block to obtain a predicted height map with good scene structure preservation.Amirkolaee et al. [16] proposed a deep convolutional encoder-decoder network to recover the precise geometry of the object by combining the global and local features, and used a post-processing method to obtain a seamless and continuous predicted height map.Li et al. [17] used a deep ordinal regression network to estimate the depth interval, and employed an ordered loss function to improve the height estimation performance.Liu et al. [18] proposed a CNN-based method to achieve height estimation from single images, and utilized the learned height map to infer the shape of 3D buildings and the footprints of 2D buildings.Xing et al. [19] proposed a progressive learning network that aggregates high-level and low-level features using an attention mechanism, and gradually refines the predicted height map in a coarse-to-fine manner through a progressive refinement module.Mo et al. [20] proposed a soft-aligned gradient-chaining network to generate the height difference between adjacent pixels in a specific direction, and designed a robust surface-based soft alignment method to improve the training quality of the network.Karatsiolis et al. [21] proposed a deep learning model to estimate the height from single images, and verified the effectiveness of the proposed method on two experimental datasets.With the wide application of generative adversarial networks in various fields of computer vision, some recent works employ generative adversarial networks for image-toimage translation to generate corresponding height maps from images.Ghamisi et al. [22] proposed to use conditional generative adversarial nets to generate elevation information from single remote sensing images.Paoletti et al. [23] proposed to use unpaired images to learn elevation from optical remote sensing images based on variational autoencoders and generative adversarial networks.Panagiotou et al. [24] applied conditional generative adversarial networks to learn digital elevation models of scenes from satellite images.

Semantic Segmentation
The semantic segmentation task predicts the semantic label for each pixel of the input image.The development of deep learning techniques in recent years has produced significant improvements in semantic segmentation.The FCN [25] replaced the fully connected layers with convolutional layers, achieving efficient image semantic segmentation.Although FCN utilizes deconvolutional layers to restore the resolution of prediction results, multiple pooling and downsampling operations in the network lead to the loss of spatial details.To address this problem, some works adopted an encoder-decoder architecture to obtain high-resolution semantic segmentation results [26][27][28].U-Net [29] and its variants [30][31][32] used an encoder-decoder network to fuse shallow and deep informa-tion through skip connections for achieving semantic segmentation with more accurate boundaries.Contextual information in images, especially in remote sensing images, is particularly important for semantic segmentation.To fuse the multi-scale information of the image, some works used dilated convolution/atrous convolution to expand the receptive field, improving the performance of semantic segmentation [33][34][35][36][37][38][39].DeepLab V3+ [40] used atrous spatial pyramid pooling (ASPP) to achieve multi-scale feature fusion and employed a decoding module to refine the boundary details.Another commonly used strategy is to obtain global contextual information by modeling the spatial relationship of images to improve the accuracy of pixel-level segmentation results [41][42][43][44].Considering the complexity and diversity of remote sensing image scenes, some works used multi-scale training or feature fusion to obtain long-distance dependencies, alleviating the scale change problem of objects in the scene [45][46][47].In addition, some works utilized multi-modal data, mainly 3D elevation data (DSM or nDSM), to assist the semantic segmentation of remote sensing images, improving the accuracy of semantic prediction [48,49].

Multi-Task Learning
In recent years, multi-task learning is advocated to exploit complementary information and improve the predictive performance of each task [50][51][52][53][54].For remote sensing images, the height information of the scene usually has a correlation with the semantic labels, so recent works usually jointly perform these two tasks under a unified framework.Srivastava et al. [8] proposed to utilize convolutional neural networks to simultaneously predict height information and semantic labels from single images.Zheng et al. [9] performed simultaneous height estimation and semantic segmentation on input images, and improved the quality of prediction results by fusing the features of the two tasks.Carvalho et al. [55] employed an encoder to extract features from images and predict corresponding height maps and semantic labels through different decoders.Mahmud et al. [56] proposed a boundary-aware multi-task deep-learning-based framework to simultaneously predict height and semantic segmentation results, enabling 3D architectural modeling from single images.Wang et al. [57] proposed a multi-task learning network that simultaneously learns height information, semantic segmentation results, and edge information.Different from these methods, this paper focuses on the relationships between height features and semantic features, and improves the performance of height estimation and semantic segmentation by selectively fusing features between tasks and enhancing the representation of task-aware features.

Materials and Methods
In this section, an overview of the proposed self-and cross-enhancement network (SCE-Net) are first given.Then, the feature separation-fusion module (FSFM) and task-guided representation enhancement are introduced, including height-guided feature distance loss and semantic-guided feature distance loss.After that, the multi-task objective function, datasets, evaluation indicators, and implementation details are described.Notation of important symbols in this work are shown in Table 1.

Overview
In this paper, pixel-level height maps and semantic labels are simultaneously predicted from single aerial images under the multi-task learning framework.The proposed SCE-Net employs an encoder-decoder architecture, and the whole network consists of three parts: a backbone for feature extraction, a feature separation-fusion module, and a multi-task predictor.Unlike single-task learning methods, the SCE-Net includes a shared encoder and two task-related decoders.The overall network architecture is shown in Figure 1.Concretely, the network adopts ResNet-50 or ResNet-101 as the backbone.For a three-channel input image, the output of the network is a one-channel height map and a semantic segmentation resulting in the same number of channels as the number of classes.In the encoding process, the image is downsampled multiple times to obtain feature maps with sizes 1/4, 1/8, and 1/16 of the input size in turn.Then, several upsampling operations are performed in the two decoder branches to obtain the height map and semantic segmentation result with the same resolution as the original input, respectively.Furthermore, the network adopts skip connection to preserve detailed information lost during multiple downsampling operations in the encoder.
Considering that height estimation and semantic segmentation are closely related but they do not have a one-to-one relation, this paper constructs a feature separation-fusion module (FSFM), which first separates the features extracted from input image into taskaware features and selectively fuses the features from another branch to obtain consistent task-related features.Then, two task-guided feature distance losses are designed based on the deep metric learning to enhance the representation of the two task-aware features.The network is trained in an end-to-end manner by optimizing a multi-task objective function.The feature separation-fusion module (FSFM), task-guided representation enhancement, and the multi-task objective function will be explained in the following sections.

Feature Separation-Fusion Module
In the existing multi-task learning methods, the features of height and semantic branches are usually fused by direct summation or concatenation.However, we believe it is desirable to select relevant and consistent features from the two tasks for handling each task.To this end, this work constructs a feature separation-fusion module (FSFM) based on an attention mechanism.The FSFM module consists of two components, including a task-aware feature separation module (TFSM) and two cross-task feature fusion modules (CFFM).The TFSM module separates the shared features extracted from the input image into height-aware features and semantic-aware features, and the CFFM module selects the beneficial features from another branch for fusion.The topologies of the TFSM module and the CFFM module are illustrated in Figure 2. Specifically, the TFSM module employs a symmetric structure, which contains two branches with the same architecture but different weights.As seen from Figure 2a, in the TFSM module, its upper branch represents the height estimation branch for outputting height features (red features), while its lower branch represents the semantic segmentation branch for outputting semantic features (cyan features).Here, we take the height branch as an example.The shared features are first downsampled by a global average pooling layer (GAP), then the feature integration is performed by a fully connected layer (FC).Then, an attention map in the channel dimension is obtained through a sigmoid function.After that, the shared features are weighted by this attention map and added to the original shared features to obtain the height-aware features.Then, the height-aware features are integrated through three consecutive convolutional blocks (CBR), each of which is composed of a convolutional layer, a batch normalization layer, and a rectified linear unit (ReLU) function.Similarly, the semantic branch obtains semantic-aware features through the same operations as the height branch.
For the obtained task-aware features, the CFFM module is utilized to fuse the features of this branch and the beneficial features of another branch.In the CFFM module, the heightaware features and the semantic-aware features are first concatenated and passed through a 3 × 3 convolutional layer, and then split into height branch and semantic branch.For the height branch, the features are fed into a 3 × 3 convolutional layer and a sigmoid function to obtain an attention map, which can be used for feature selection of height-aware features in the spatial dimension.The height-aware features are multiplied with this attention map and added with the features of the semantic branch to achieve cross-task feature fusion.After that, the size of the obtained features is increased by a factor of 2 using an upsampling operation.The semantic branch is similar to the height branch; the difference is that the attention map is used to weight the semantic-aware features, and then the weighted features are added to the features in the height branch to complete the cross-task feature fusion.
The feature separation-fusion module can effectively aggregate the relevant features between height estimation and semantic segmentation.By using semantic features to constrain the representation of height features more accurately, the height-spreading phenomenon across different classes is reduced.

Task-Guided Representation Enhancement
Based on the above feature separation-fusion, a novel task-guided representation enhancement method is designed to refine the height-aware features and the semanticaware features.Considering the local geometric relationship of the scene, the height-aware features of objects with the same height should be similar, whereas the height-aware features of objects with large height differences should be significantly different.Similarly, the semantic-aware features within the same class should be as similar as possible, and the semantic-aware features across different classes should be largely different.Therefore, two task-guided feature distance losses are designed based on the deep metric learning method, including the height-guided feature distance loss and the semantic-guided feature distance loss, to accomplish the representation enhancement of height features and semantic features.

Height-Guided Feature Distance Loss
The wide range of height values usually leads to a slow convergence or a sub-optimal solution when regressing pixel-level height from single images.Moreover, neighboring pixels usually have close height values, and the corresponding height features are similar.To facilitate the representation enhancement of height-aware features, the entire height range is first discretized into multiple intervals.Then, the features of the same height interval are constrained to be similar, and the features of different height intervals to be different.
For general remote sensing images, most pixels have smaller height values, and a few pixels have larger height values.However, predictions for these large height values are often subject to large uncertainties.To avoid overfocus on such pixels with large heights, the spacing-increasing discretization method in [58] is employed to uniformly discretize the height range in the log space.The formula for height interval discretization is as follows: where α and β are the lower and upper bounds of the whole height range, t i ∈ {t 0 , t 1 , . . ., t K } are the discrete thresholds, and K is the number of height intervals.
The local geometric consistency of the image makes the pixels within a small adjacent region usually have similar height values.Therefore, local patches are first cropped from the whole image in a left-to-right, top-to-bottom manner.For each local patch, pixels are divided into three groups, namely, anchor pixel, positive pixels, and negative pixels.The central pixel of the local patch is regarded as the anchor pixel, pixels in the same height interval as the anchor pixel are positive pixels, and pixels in different height intervals from the anchor pixel are negative pixels.Correspondingly, the feature distance between the positive pixels and the anchor pixel is defined as d + h , and the feature distance between the negative pixels and the anchor pixel is defined as d − h ; the formulas are as follows: where i represents the location of the anchor pixel, | • | represents the number of elements in the set, P + i is the set of positive pixels, P − i is the set of negative pixels, and Fh To make the features in the same height interval more similar and the features in different height intervals more distant, the feature distance of positive pixels and anchor pixel should be reduced, while the feature distance of negative pixels and anchor pixel should be increased.For this purpose, this work adopts the triplet loss [59][60][61] in deep metric learning, as follows: where m h indicates that when the feature distance of negative pixels and anchor pixel is larger than the distance of positive pixels and anchor pixel by a threshold m h , the loss term is no longer optimized.
To reduce the noise influence, this work sets a condition for this loss as follows: When the number of positive pixels and the number of negative pixels are both greater than the threshold T h , the loss term is calculated.Therefore, the height-guided feature distance loss is defined as

Semantic-Guided Feature Distance Loss
In the same spirit, a semantic-guided feature distance loss is designed to refine the semantic features.Specifically, the center pixel of the local image patch is taken as the anchor pixel, then the pixels of the same class as the anchor pixel are taken as positive pixels, and the pixels of different classes from the anchor pixel as negative pixels.Intuitively, the number of negative pixels is 0 which means that the pixels in this patch belong to the same class.When both the numbers of positive pixels and negative pixels are greater than 0, it indicates that the image patch contains objects from different classes.
The feature distance between positive pixels and anchor pixel d + s and the feature distance between negative pixels and anchor pixel d − s are defined as where Fs d = F s d /||F s d || is the normalized height feature.The corresponding semantic-guided feature distance loss is as follows: where m s is the feature distance threshold for semantic features.

Multi-Task Objective Function
In addition to the aforementioned height-guided feature distance loss and semanticguided feature distance loss, the height ground truth and semantic labels are used as the supervision information for network training.
For the height estimation, following [55,62], this work adopts the L1 loss as the height loss term: where h denotes the height ground truth, ĥ denotes the predicted height value, i is the pixel index in the image, and N is the total number of the valid pixels.
For the semantic segmentation, the multi-class cross-entropy loss is employed as the semantic loss term as follows: where y ic is 1 when the true class of pixel i is c, and 0 otherwise.P ic is the probability scores for semantic label prediction, and C is the number of semantic classes.
Finally, the overall multi-task objective function is formulated as follows: where λ h , λ s , λ he , and λ se are the weights of each loss item, respectively.

Datasets
To verify the effectiveness of the proposed SCE-Net, extensive experiments are performed on two public datasets, namely, the Vaihingen dataset and the Potsdam dataset, provided by ISPRS Working Group II/4.In the experiments, the normalized digital surface models (nDSMs) in [63] are used as the height ground truth.
Vaihingen: It consists of 33 tiles of different sizes, each tile contains the true orthophoto (TOP) and corresponding nDSM.The ground sampling distance of the TOP is 9 cm.The TOP contains near-infrared, red, and green bands (IRRG), while nDSM has one band.According to the official dataset partition, this work uses 16 tiles to construct the training set, and the remaining 17 tiles to form the testing set.
Potsdam: It contains 38 tiles of the same size, including the true orthophoto (TOP) and corresponding nDSM.The ground sampling distance of this dataset is 5 cm.The training images and testing images in the experiments are images containing three bands of red, green, and blue (RGB).According to the official dataset partition, 24 images are used for training, and the remaining 14 images are used for testing.
Samples from Vaihingen and Potsdam datasets are shown in Figure 3. Due to the large size of the original tiles, small patches of 512 × 512 are randomly cropped from the raw tiles as the input images for training and testing in the experiments.When compared with other methods, the predictions of the image patches are stitched together and the results of the whole tiles are quantitatively evaluated.

Evaluation Indicators
In this paper, following [16,57], six indicators are used to evaluate the performance of height estimation.The height evaluation indicators include absolute relative error (absRel), mean absolute error (MAE), root mean square error (RMSE), and accuracy with thresholds (δ i ).The specific formulas are as follows: 15) where N represents the total number of pixels in the image, i denotes the pixel index in the image, h is the height ground truth, and ĥ is the predicted height value.Referring to [8,25], five evaluation indicators for semantic segmentation are adopted, including overall pixel accuracy (OA): the accuracy of the overall semantic segmentation; per-class pixel accuracy (AA): the average accuracy of segmentation for different classes; mean intersection over union (mIoU): the intersection ratio between the ground truth and the predicted semantic labels; mean F1 scores (mF1): the harmonic mean of precision and recall; kappa coefficient (Kappa): the coefficient for measuring segmentation accuracy.

Implementation Details
The proposed SCE-Net is implemented based on the PyTorch framework on a single Tesla V100 with 32 GB GPU memory.The network uses ResNet-50 or ResNet-101 pretrained on ImageNet as the backbone to extract shared features from the input image.During training, the input of the network is an image of size 512 × 512 randomly cropped from the original tiles.The size of the predicted height map and semantic segmentation result output by the network are both 512 × 512.The batch size is 4, and the total number of epochs is 50 for the network with ResNet-50 and 80 for the network with ResNet-101.The initial learning rate is 5 × 10 −4 and then decreased using the polynomial decay with power 0.9.During training, Adam is adopted as the optimizer with β 1 = 0.5, β 2 = 0.999.To prevent overfitting, three data augmentation methods are performed, including horizontal flipping, vertical flipping, and rotation with a degree between [−1.25, 1.25] with probability 0.5.

Results
To validate the effectiveness of the proposed method, the proposed SCE-Net is compared with recent state-of-the-art methods, including single-task learning methods (STL) that only perform either height estimation or semantic segmentation, and multi-task learning methods (MTL) that jointly perform height estimation and semantic segmentation.The qualitative and quantitative results on the Vaihingen dataset and Potsdam dataset are reported in the following sections.

Comparisons with State-of-the-Art Methods
Different from the single-task learning methods, the proposed SCE-Net adopts a multi-task learning framework and exploits the correlation between height information and semantic information to achieve collaborative learning between tasks.Unlike other multi-task learning methods, the proposed method selectively fuses features from related tasks instead of by their direct adding or concatenating to ensure selected features are more relevant to the task, and uses deep metric learning methods to enhance the representation of features.
The quantitative evaluation of the height estimation results on the Vaihingen dataset is shown in Table 2.It can be seen that the SCE-Net achieves better results than the comparison methods in five out of six metrics.We believe that the MAE of [57] is slightly better than the proposed method because their introduction of boundary detection provides more supervised information.At the same time, the semantic segmentation results on the Vaihingen dataset are reported in Table 3.The experimental results in Table 3 demonstrate that the semantic segmentation performance of the proposed method outperforms the comparison methods.This shows that the SCE-Net can not only utilize semantic features to improve the quality of height estimation results, but also make full use of height information to improve the accuracy of semantic segmentation.The qualitative results of height estimation and semantic segmentation on the Vaihingen dataset and the Potsdam dataset are illustrated in Figures 4 and 5.The visualization results of some local images are shown in Figures 6  and 7.It can be seen that the height estimation and semantic segmentation of the SCE-Net are accurate and reliable.Overall, the proposed SCE-Net achieves better performance than the comparison state-of-the-art methods on the Vaihingen dataset.This section also reports the height estimation results and semantic segmentation results on the Potsdam dataset, as shown in Tables 4 and 5, respectively.Similarly, the results of the proposed method in both height estimation and semantic segmentation are better than the comparison methods on this dataset.

Ablation Studies
To verify the effectiveness of each component in the SCE-Net, this section conducts ablation studies on the Vaihingen dataset.The experiments in this section are performed on images of size 512 × 512.The quantitative results of the ablation experiments are shown in Table 6.There are five methods in Table 6: (a) a single-task learning network that estimates height from single images (STL_H); (b) a single-task learning network that performs semantic segmentation from single images (STL_S); (c) a multi-task learning network that consists of a shared encoder and two decoders related to height estimation and semantic segmentation (MTL_B), hereinafter referred as baseline network; (d) baseline network accompanied by the proposed FSFM module (MTL_B+FSFM); (e) the SCE-Net, namely, baseline network with FSFM module and task-guided representation enhancement method (MTL_B+FSFM+TRE).
It can be seen from Table 6 that the results of the single-task learning network are slightly better than the results of the multi-task learning network in the case of using the same network; however, after adding the FSFM module, both height estimation and semantic segmentation are improved.This shows that the FSFM module can take advantage of the multi-task learning and fully exploit the features between related tasks.The SCE-Net achieves the best performance on both tasks, indicating that the proposed task-guided representation enhancement method can improve the feature representation ability of the network.The visualization results of different networks are shown in Figure 8, and such qualitative results also illustrate the effectiveness of each component in the SCE-Net.

Discussion
In this paper, the proposed SCE-Net adopts a multi-task learning framework and achieves satisfactory performances on both height estimation and semantic segmentation.Although there is a correlation between class semantics and height, evident differences also exist; clearly, objects from the same class could have substantial height variations.With this consideration in mind, this work proposes a feature separation-fusion module (FSFM) that selectively fuses height features and semantic features to prevent inconsistent features across tasks to negatively impact the predictive ability of the network.Some examples of the predicted height maps and error maps for the baseline network and the network with the FSFM module are shown in Figure 9.It can be seen that, compared with the baseline network, the results of the network with the FSFM module have smaller errors, which demonstrates that the FSFM module can effectively select features in related tasks.Furthermore, in the Vaihingen dataset, most buildings have uneven roofs and trees have large height differences.This is the geometric inconsistency of objects in the same class with different heights.Therefore, this section also shows the attention maps learned by the network for the selection of semantic features and fusion with height features.The attention maps indicate that attention is less focused on those places with the same semantic but large height variations.It also shows that the FSFM method can effectively fuse the relevant features between the two tasks.
In addition, the results of the FSFM module are compared with two other feature fusion methods, and the experimental results are shown in Table 7.The three methods in Table 7 are (a) the height and semantic features are fused by direct summation (B+Sum); (b) the height and semantic features are fused by direct concatenation (B+Cat); (c) the proposed FSFM module (B+FSFM).As seen from the table, the FSFM module outperforms the other two feature fusion methods.It shows that the FSFM module can more effectively utilize the features between related tasks and improve the prediction performance of the model.The FSFM module extracts task-aware features from shared features and integrates features from related task branches based on an attention mechanism.On this basis, a task-guided representation enhancement method is employed to refine the task-aware features.In this method, interval discretization is first performed for the height range, and then a height-guided feature distance loss is designed for the height intervals and a semantic-guided feature distance loss is designed for the semantic classes.
Here, the influence of the number of height intervals on height estimation and semantic segmentation is assessed.The experimental results of discretizing the height range into different numbers of intervals (10,20,30,40,50, and 60 intervals) are reported in Table 8.It can be seen that as the number of height intervals increases, the height estimation performance gradually becomes better.The results reach the best when the height interval is 30, and then the results gradually decrease as the number of intervals increases.This is because the height range of the Vaihingen dataset and Potsdam dataset is 0-25.5 m.If the number of height intervals is too small, the same height interval will contain a wide range of height values, then the height-guided feature distance loss will cause a large error in the consistency constraint of features in the same height interval.When the number of height intervals is too large, the height feature is close to the pixel-level feature, resulting in inaccurate height prediction.Therefore, in future work, the number of height intervals can be adaptively adjusted according to the approximate height range of the used dataset.It is worth noting that when changing the number of height intervals, the results of semantic segmentation remain basically unchanged.It shows that the improvement of the height estimation does not come at the expense of the performance of semantic segmentation, showing that height discretization with a proper interval is necessary.Since the task-guided representation enhancement method is performed on local patches, here, the impact of the local patch size on height estimation and semantic segmentation is also assessed.Since the height variations of the scenes are usually more pronounced than the semantic class variation, this work chooses different local patch sizes for the height branch and the semantic branch.For the height branch, the image patch size is set to 5 × 5, 7 × 7, and 9 × 9, and the experimental results are shown in Table 9.It can be seen that the height estimation results under 7 × 7 image patch are the best.The experimental results show that the image patches should not be too small or too large for height estimation.Small-sized image patches contain a small number of pixels, and the height values may all belong to the same height interval, and make the height-guided feature distance loss less effective.If the size of the image patches is too large, the pixels in the same patch may have substantial height differences, and could blur the feature expression of different height intervals.For the semantic branch, this work chooses the size of the image patches to be 9 × 9, 11 × 11, and 13 × 13, respectively.The experimental results are shown in Table 10.It can be seen that the results are the best when the image patch size is 11 × 11.Similar to height features, pixels in a small image patch are more likely to belong to the same class, while a large image patch may contain different object classes, making it difficult to obtain optimal feature representation.In the experiments, the proposed method chooses image patches of size 7 × 7 for the height branch and size 11 × 11 for the semantic branch.Furthermore, the computational time of the SCE-Net on the Vaihingen dataset and the Potsdam dataset is analyzed, as shown in Table 11.Similar to in Section 4.2, B represents the baseline network, FSFM represents the feature separation-fusion module, and TRE represents the task-guided representation enhancement method.It can be seen that for the Vaihingen dataset, when the backbone adopts ResNet-50, the average inference time of the baseline network for images of size 512 × 512 is 0.034 s, the total time for the testing dataset is 13.796 s, and the average inference time for each original tile is 0.811 s.When the FSFM module is added, the average inference time for 512 × 512 images is 0.035 s, the total time for testing images is 14.219 s, and each original tile takes 0.836 s on average.Further, when TRE is added, the inference time of the model does not increase because the feature representation enhancement of the height features and semantic features is not included in the testing stage.When the backbone adopts ResNet-101, the average inference time for 512 × 512 images is 0.039 s, the total inference time for testing images is 15.761 s, and the average inference time per original tile is 0.927 s.For the Potsdam dataset, when the backbone adopts ResNet-50, the average inference time for 512 × 512 images is 0.033 s, the total time is 68.205 s, and each original tile takes about 4.871 s.When the backbone adopts ResNet-101, the inference time for each 512 × 512 image is about 0.037 s, the total inference time for testing images is 76.140 s, and the average inference time per original tile is 5.438 s, since ResNet-101 is more complex than ResNet-50.The experimental results demonstrate that, compared with the baseline network, the proposed SCE-Net can effectively improve the height estimation and semantic segmentation performance with little increase in computational time.

Conclusions
Recently, several works in the literature have demonstrated that height estimation and semantic segmentation can benefit from each other and improve the performance of both tasks.In this paper, a multi-task learning network, called self-and cross-enhancement network (SCE-Net), is proposed to simultaneously learn height information and semantic labels from single aerial images.Considering that the height information and class semantics are generally related, but not always in complete agreement, a feature separation-fusion module is constructed to achieve effective separation and fusion of height features and semantic features, improving the feature representation across tasks.To obtain more discriminative feature representation, this work constrains the height-aware features of objects with the same height being similar, whereas the height-aware features of objects with large height differences are as different as possible.Similarly, semantic-aware features within the same class are constrained to be similar, while semantic-aware features across different classes are largely different.To this end, both the height-aware feature representation and semantic-aware feature representation are enhanced by constraining the height-guided feature distance loss and the semantic-guided feature distance loss, achieving better performance in height estimation and semantic segmentation.Extensive experimental results on two public datasets demonstrate the effectiveness of the proposed SCE-Net.In future work, height estimation capability could be further improved by exploring more suitable multi-scale contextual information, for example, using ASPP [40]-like tricks for multi-scale feature learning, or training the network under the multi-scale training paradigm.

Figure 1 .
Figure 1.The overall network architecture of SCE-Net.

Figure 4 .
Figure 4.The qualitative results of height estimation and semantic segmentation on the Vaihingen dataset.(a) Images.(b) Height ground truth.(c) Predicted height maps.(d) Semantic labels.(e) Semantic segmentation results.

Figure 5 .
Figure 5.The qualitative results of height estimation and semantic segmentation on the Potsdam dataset.(a) Images.(b) Height ground truth.(c) Predicted height maps.(d) Semantic labels.(e) Semantic segmentation results.

Figure 6 .
Figure 6.The visualization results of local images on the Vaihingen dataset.(a) Images.(b) Height ground truth.(c) Predicted height maps.(d) Semantic labels.(e) Semantic segmentation results.

Figure 7 .
Figure 7.The visualization results of local images on the Potsdam dataset.(a) Images.(b) Height ground truth.(c) Predicted height maps.(d) Semantic labels.(e) Semantic segmentation results.

Figure 9 .
Figure 9.The visualization results of baseline network and the network with FSFM module on the Vaihingen dataset.(a) Images.(b) Height ground truth.(c) Predicted height maps of the baseline network.(d) Error maps of the baseline network.(e) Predicted height maps of the network with FSFM module.(f) Error maps of the network with FSFM module.(g) Attention maps.

Table 1 .
Notation of important symbols in this work.

Table 2 .
Quantitative evaluation of the height estimation results on the Vaihingen dataset.

Table 3 .
Quantitative evaluation of the semantic segmentation results on the Vaihingen dataset.

Table 4 .
Quantitative evaluation of the height estimation results on the Potsdam dataset.

Table 5 .
Quantitative evaluation of the semantic segmentation results on the Potsdam dataset.

Table 6 .
Ablation studies on the Vaihingen dataset.

Table 7 .
Experimental results of different feature fusion methods on the Vaihingen dataset.

Table 8 .
Experimental results with different height interval numbers on the Vaihingen dataset.

Table 9 .
Experimental results under different height patch sizes on the Vaihingen dataset.The h and s denote the patch size of height branch and semantic branch.

Table 10 .
Experimental results under different semantic patch sizes on the Vaihingen dataset.The h and s denote the patch size of height branch and semantic branch.

Table 11 .
Computational time of different combinations of the used modules in the SCE-Net on the Vaihingen dataset and the Potsdam dataset.