StyHighNet: Semi-Supervised Learning Height Estimation from a Single Aerial Image via Unified Style Transferring

Recovering height information from a single aerial image is a key problem in the fields of computer vision and remote sensing. At present, supervised learning methods have achieved impressive results, but, due to domain bias, the trained model cannot be directly applied to a new scene. In this paper, we propose a novel semi-supervised framework, StyHighNet, for accurately estimating the height of a single aerial image in a new city that requires only a small number of labeled data. The core is to transfer multi-source images to a unified style, making the unlabeled data provide the appearance distribution as additional supervision signals. The framework mainly contains three sub-networks: (1) the style transferring sub-network maps multi-source images into unified style distribution maps (USDMs); (2) the height regression sub-network, with the function of predicting the height maps from USDMs; and (3) the style discrimination sub-network, used to distinguish the sources of USDMs. Among them, the style transferring sub-network shoulders dual responsibilities: On the one hand, it needs to compute USDMs with obvious characteristics, so that the height regression sub-network can accurately estimate the height maps. On the other hand, it is necessary that the USDMs have consistent distribution to confuse the style discrimination sub-network, so as to achieve the goal of domain adaptation. Unlike previous methods, our style distribution function is learned unsupervised, thus it is of greater flexibility and better accuracy. Furthermore, when the style discrimination sub-network is shielded, this framework can also be used for supervised learning. We performed qualitatively and quantitative evaluations on two sets of public data, Vaihingen and Potsdam. Experiments show that the framework achieved superior performance in both supervised and semi-supervised learning modes.


Introduction
With the development of remote sensing and image acquisition technology, highresolution aerial images are widely used, e.g., in urban planning, disaster monitoring, emergency management, and so on. If height information could be automatically extracted from aerial images, it would further improve the intelligent level of downstream applications, such as automated city modeling [1,2], augmented reality [3,4], etc. However, it is a technically ill-posed problem to extract height from a single image [5], especially for the scenes with complex structure. Most traditional solutions are based on handcrafted visual features and probabilistic graphical models (PGMs), which rely on strong assumptions about the geometry of the scene, seriously affected by issues of flexibility and stability [6]. In recent years, with the growth of deep learning and the emergence of large-scale datasets, image-to-height mapping can be trained end-to-end. Most of them use an encoder-decoder network structure [7], where the encoder is responsible for extracting multi-scale spatial features, while the decoder gradually up-samples these features to the original size to 1.
We propose a novel network framework that can semi-supervised learn height estimation from a single image based on unified style transferring.

2.
We generate a small-scale synthetic dataset automatically through city modeling software and a game engine to make up for the lack of real-world data.

3.
We design a set of loss functions that enable three sub-networks to work orderly to achieve the goal of semi-supervised learning.
We conducted quantitative and qualitative evaluations on two public datasets of Vaihingen and Potsdam. The experiments showed that our framework outperforms previous methods in both supervised and semi-supervised mode. We also verified the effects of hyperparameters through an ablation study.

Related Works
In this section, we review the the three most relevant aspects to the method proposed in this paper, namely monocular depth/height estimation, domain adaptation, and style transfer.

Monocular Depth/Height Estimation
The purpose of monocular depth/height estimation is to determine the depth/height value corresponding to each pixel in the image. It is a basic problem in many computer vision tasks and has received extensive attention. Early methods are mainly based on handcrafted features and probabilistic graph models (PGMs). Saxena et al. [16,17] used Markov Random Fields (MRF) and combined local/global features to infer the depth from the monocular image, and introduce super-pixels to achieve neighboring constraints. Comber et al. [18] calculated the height of the building based on the relationship between length of the shadow and the pose of the sun. Qi et al. [19,20] used the information provided by Google Earth to propose CSLR (corner shadow length ratio) to calculate the height of the building. These methods rely on the strong assumption of the input image thus have some limitations in practical applications.
In terms of deep learning, Baig et al. [21] used sparse coding to estimate the depth of the entire scene. The authors of [22,23] used a two-scale network to learn the mapping of RGB images to depth. Since then, there have been multiple improved versions [24][25][26][27]. In the field of remote sensing, several networks for predicting height have been proposed [5,[28][29][30][31]. The above methods generally adopt decoder-decoder structures, where the encoder extracts multi-scale features, and the decoder up-samples and combines these features to regress the pixel-wise height. However, due to the lack of highquality/large-scale training data, these supervised learning methods suffer from the problems of stability and integrity [32]. Recently, Xie et al. [33] proposed a self-supervised learning method, Deep3D network, to predict the depth map from stereo images without training labels, which reconstructs a virtual right image with a predicted depth map and known camera translation, and the consistency relative to the left image is utilized as mainly leaning signal. Godard et al. [34] used bilinear difference and left-right consistency cross-validation to obtain higher accuracy. Although such methods achieve superior quality to the supervised version, the stereo image pairs require strict synchronization and calibration that still limit the training data. Zhou et al. [35] simultaneously estimated the depth map and ego-motion of the adjacent frame within a monocular video, which further reduced the threshold of training data, but the dynamic objects in the scene violate the assumption of rigid transformation, leading to a fuzzy and incomplete result. Subsequent work made improvements in this area by off-line masking [36][37][38], optical-flow [39] or on-line masking [32,40,41]. The authors of [42,43] proposed a semi-supervised learning depth estimation method, which combines the use of LiDAR labels and the consistency of the novel view of adjacent frames to ensure the correct prediction. Sex et al. [44] proposed a method that semi-supervised learns the depth estimation of a single image through the relationship between semantic labels and geometric information. Although the above methods achieved high fidelity on the training data, the situation of cross-domain is not considered. Moreover, in the field of remote sensing, isolated images without spatiotemporal adjacent frames are the mainstream data format. Therefore, we take advantage of both height labels and unified style distribution as learning signals to achieve accuracy and domain adaptability simultaneously.

Domain Adaptation
Due to the lack of comprehensive training datasets for depth/height estimation, synthetic datasets [9][10][11] were generated as a complement for the real-world datasets through their low cost and perfect pixels. However, the inevitable bias that comes from the virtual modeling and rendering process makes the networks trained on synthetic images cannot directly apply to real-world scenes. Zhou et al. [12] proposed a fine-tuning method that retrains the model on a small count of target data, but it faces the issues of overfitting and catastrophic forgetting [13]. Domain adaptation methods [8,15,[45][46][47][48][49] minimize the difference between the source data and the target data by a pre-trained model, but they tend to fail when the difference between two data sources is large (e.g., two cities). Here, we learn a unified style distribution unsupervised to avoid the phenomenon of adaptation failure.

Style Transfer
The method of Gatys et al. [50] firstly converts source images to another style via a convolutional neural network. The subsequent methods directly update the pixel value of the output image [51][52][53][54] or learn the specified image style from a large amount of training data [55][56][57][58][59]. Among them, the Gram Matrix is usually used to evaluate the consistency of the distribution. Inspired by this idea, we transfer multi-source images to a unified style distribution and preserve the obvious characteristics at the same time to ensure the robustness of height estimation.

Method
In the following subsection, we introduce the implementation details of the proposed framework, namely pipeline overview, running mechanism, and loss functions.

Pipeline Overview
The framework is composed of three sub-networks: (1) The style transferring network N t , which converts the original image X ∈ R H×W×3 from multiple sources into the style distribution maps T ∈ R H×W×C t , where ∈ {sup, sem, syn} represents the three types of input images, sup represents to the real data with a large number of labels, sem means the real data with a small number of labels, and syn refers to the synthetic data; (2) the height regression network N h , which regresses the height maps Y ∈ R H×W×1 from T ; and (3) the style discrimination network N d , with inputs T and outputs D ∈ R H×W×3 , which represent the probability distribution of source category of T . These three sub-networks are coupled together to achieve the goal of height estimation and domain adaptation through three loss functions (loss h , loss d , loss c ), as shown in Figure 1.

Implementation Mechanism of StyHighNet
In our pipeline, there are two workflows trained simultaneously: one is supervised learning of height regression (including N t and N h ) and the other is unsupervised learning of unified style distribution (including N t , and N d ). It can see that N t undertakes dual tasks in these two workflows to achieve the purpose of semi-supervised learning.

Supervised Height Regression
Unlike the previous supervised method [29,60], our height estimation adopts a dualnetwork serial inference strategy. The style transferring network N t converts the multisource images X = {X | = sup, sem, syn} into the style distribution maps T and regresses them by height regression network N h to the corresponding height maps Y. There are three types of sources of input data for N t , each of them playing a different role: (1) real data with many labels X sup are the main force of supervised learning and are the source domain in terms of domain adaptation; (2) real data with few labels X sem , which, although the number is not large, provide the key guidance to style distribution and are the target domain in terms of domain adaptation; and (3) synthetic data X syn are used as a complement to X sup because of their low cost and perfect pixels.
We employ a popular encoder-decoder structure [7] for both N t and N h . The encoder adopts the MobileNetV2 architecture [61] to improve the computational efficiency. The decoder uses deconvolution as the up-sampling function. The feature maps with the same size in the encoder and decoder are skipped and connected to preserve the geometric details. The input and output sizes of the two networks are the same, and the number of channels of T is set to 3 for the convenience of visualization and analysis. The output activation functions of N t and N h are both sigmoid. The specific network structure is shown in Figure 2.

Unsupervised Style Transferring
The task of unsupervised style transferring is jointly completed by the style transfer network N t and the style discrimination network N d . Their relationship is similar to that of generator and discriminator in Generative Adversarial Networks (GANs) [62]. N d is used to judge (classify) the source category of the T = {T * | * = sup, sem, syn}, output number of channels is the number of source categories (here is 3), and the activation function is so f tmax to form the probability distribution of classification. N t tries to confuse N d , which makes the distribution of T from multi-source images as similar as possible, to achieve the purpose of domain adaptation. The unified style distribution is not known in advance; it is learned unsupervised and tends to be stable during the adversarial process. However, two points are different from the classic generative confrontation network [62]: (1) The N d network performs the classification task for each pixel, rather than summarizes them into a scalar to distinguish, making control and analysis further improved. (2) Our style distribution maps T are derived from the multi-source images X instead of a random vector. We use the same network structure for N d , as shown in Figure 2. Sub-network architecture. The three sub-networks use the same network structure, but according to different specific tasks, the input and output data are different.

Semi-Supervised Learning
During the training process, the two workflows mentioned above are carried out at the same time. It is clear that the style transfer network N t shoulders dual tasks simultaneously: On the one hand, it supervised learns a style transferring function together with height estimation network N h to recover the height map from multi-source images. The characteristics of the style distribution map T need to be obvious to achieve the goal of accurate height regression. On the other hand, it cooperates with the style discrimination network N d in an adversarial manner to make the T as similar as possible to achieve domain adaptation. Therefore, the images without labels can also contribute their supervising signals on style distribution. Note that the labeled data only enter the height regression workflow, while all of the data enter the style transferring workflow, which forms a semisupervised learning mechanism. In the training phase, these two workflows are performed cooperatively in parallel.

Loss Functions
Style transferring sub-network N t and height regression sub-network participate in supervised learning to recover the height map from multi-source images. The binary-crossentropy (BCE) loss h is used to optimize the parameters in these two sub-networks, namely is the predicted height map, F (·, ·) denotes the network mapping function, N is the number of all pixels, i is the pixel index, andŶ is the corresponding height labels.
In the optimization process, the output style distribution maps T = F (M t , X ) from the style transferring sub-network N t are originally unconstrained, thus images from the different data source may have different styles, which leads to domain bias. To this end, we introduce a style discrimination network N d to unify the style distribution, where two losses are involved (loss d and loss c ). loss d is to evaluate the effect of classification for data categories, achieved by the cross-entropy function [63] similar to the tasks of semantic segmentation [64]. In contrast, loss c aims at confusing the style discrimination network N d , making the Ts from three data categories as similar as possible. They are defined as follows: where D ∈ R H,W,3 is the output of style discriminant sub-network N d , which is normalized by a so f tmax activation. D l m,i is the discriminant probability inferred from mth data source category at the position of pixel i and in the mth channel, where m, l ∈ {0, 1, 2}.
If we treat D as an RGB image, the style discriminant network N d tries to output three pure color images for three data categories: red for T sup , green for T sem and blue for T syn . In Equation (4), we set the target category always be 0, as Xsup has the most learning signals that can avoid the phenomenon of excessive smoothness. Other style distribution maps (T sem and T syn are constrained to be closed with T sup to accomplish the task of domain adaptation.
The height regression sub-network N h and style discriminant sub-network N d are optimized by loss h and loss d , respectively, as they are both independent modules. However, style transferring sub-network N t is a dual-task module, so it has a combined loss function: where the coefficient λ is a fusing weight, and set to be 0.1 in practice.

Experiment
To verify the performance of the ThickSeg, we built a synthetic dataset and made a qualitative and quantitative evaluation on two open datasets of Vaihengen and Potsdam. We also performed an ablation study to observe the effects of hyper-parameters.

Datasets
Vaihingen dataset includes 33 regions of different sizes, each of them containing a top view taken from the mosaic and the corresponding height map. The ground sampling interval of the two types of images is all 9 cm. The height maps are generated by Trimble INPHO 5.3 software, and the top views are stitched by Trimble INPHO OthoVista. To avoid data loss, these 33 areas are sliced in the center part of the reconstructed scene, where interpolation is used to remove missing data.
Potsdam dataset contains 38 areas with the same size, where top views and height maps are both taken from the mosaic with 5 cm sampling spacing. The top view images are in TIFF format and have different channel combinations: (1) IRRG with three channels (IR-RG); (2) RGB with three channels (RGB); and (3) RGBIR with four channels (RGB-IR). Users can choose the appropriate channel mode, and here we use RGB mode. The height maps are also in TIFF format but with one channel, and are coded as a 32 bit floating point in meters.
Synthetic dataset, similar to the one in [65], is generated automatically by modeling software and a game engine. Objects are randomly distributed in the virtual city, including roads, buildings, trees, lawns, etc. The 3D models are imported into the game engine through obj format, containing shapes, materials, and textures. The color maps and height maps are sampled and rendered at random positions, both in the format of PNG. Some examples are given in Figure 3.

Implementation Details
We implemented the proposed network using the open deep learning framework PyTorch [66]. For training, we used Adam optimizer [67] with lr = 10 −4 , β 1 = 0.9, β 2 = 0.999, and = 10 −8 . The learning rate was scheduled via exponential decay with d = 0.96. The total number of epochs was set to 50 with batch size 32 on a workstation equipped with four NVIDIA 1080ti GPUs for all experiments in this work.
All three sub-networks adopted U-Net architecture [7] with MobileNet-v2 [61] encoder and de-convolutional decoder. All outputs of sub-networks were filtered by Sigmoid activation for normalization, except for the style discriminant sub-network. for which the output was activated by so f tmax function for pixel-wise classification. Two workflows of height regression and style transferring were parallel on the macro-level and serial on the micro-level, which means they were trained in turn on each batch.
To avoid overfitting, we augmented images before input to the network using random rotation in the range of [−π, +π] as well as random contrast, brightness, and color adjustment in a range of [0.8, 1.2], with 50% of chance. The images were also randomly cropped to 512 × 512 and 1024 × 1024 for training and testing, respectively. Training data and testing data were randomly split according to the radio of 6:4. All test results shown in this section were obtained from the average of five independent experiments. For Potsdam dataset, all original data were down-sampled by radio 2 to expand the sampling distance from 5 to 10 cm.
We used the same numerical metrics as in [29,60] to evaluate the quality of height regression, root-mean-square error (RMSE) and the zero-mean normalized cross-correlation (ZNCC), which are defined as: where x and y denote output and ground truth, respectively, with n pixels. µ x and µ y are the mean values of x and y, while σ x and σ y are the standard deviations of x and y.

Supervised Mode
Our framework supports supervised learning by simply neglecting the style discriminant sub-network. In this learning mode, two datasets (Vaihingen and Potsdam) were trained separately as only one source data (X sup ) is needed. X sup was firstly inputted into style transferring sub-network N t to get a style distribution map T, and then, T was fed to the height regression sub-network N h to regress the height maps where the only loss h was minimized to optimize both sub-networks jointly. As shown in Table 1, measurements of RMSE and ZNN were improved by 2% and 3%, respectively, compared to state-of-the-art work [60]. Visualized results are shown in Figure 4 and compared with IMG2DSM [29], where can be inferred that our result is sharper than that of IMG2DSM [29]. . From left to right, we show the input images, ground truth, predicted height maps of IMG2DSM (our implementation), predicted height maps of our method, height difference maps using IMG2DSM, and height difference maps of our result, respectively. Table 1. Comparison height estimation results in supervised learning mode with the previous works of IMG2DSM [29] and MPFupsion [60]. Best results in each category are in bold. Although StyHighNet needs two cascaded sub-networks to predict the height maps, it still achieves a high level of time and space efficiency. All sub-networks in StyHighNet were implemented by a lightweight structure of MobileNetV2 [61], which only contains 12M parameters parameters and predicts a 1024 × 1024 image in just 50 ms.

Inner-Domain Semi-Supervised Learning
In the inner-domain semi-supervised mode, the training data in each dataset were further split into two parts: the images with or without labels to simulate circumstance where many images exist of one city but few of them are labeled due to the cost of annotation. We performed the experiments on two datasets (Vaihingen and Potsdam) separately; the ratio of the labeled images were set as 20%, 50%, and 80%. In this mode, three sub-networks (N t , N h , and N d ) were all trained as described in Section 3.2.3, and three loss functions (loss t , loss h , and loss d ) were all involved, with the fusing weight λ in Equation (5) set to 0.1. We compared the results to those of the supervised mode introduced in the last section, as shown in Table 2, which only used labeled images for learning. The results of the semi-supervised mode are superior to those of the supervised mode because extra data (unlabeled data) were used to constrain the style distribution maps, thus avoiding overfitting. The visualization results are shown in Figure 5.

Inter-Domain Semi-Supervised Learning
The inter-domain semi-supervised learning mode was also designed for the circumstance of lack of labeled images. In contrast to the inner-domain mode, this mode focuses on the problem of domain bias, in which the model trained in one city has difficulty being applied in another city, which is very common in practice. We used all the training data from one city with a small percentage of labeled data (20%) from another city as the supervised signals of height regression, and the remaining unlabeled data were used for unsupervised learning of style distribution. We used the same parameters as in the previous section to train and test the model and compared the results to the supervised learning method, fine-tuning [12], and with or without synthetic data, as shown in Table 3. The inter-domain configuration achieves the best result, as the unlabeled data contributed to constrain and unify the style distribution. Furthermore, the use of synthetic data enhanced the performance significantly. The visualized results are shown in Figure 6.

Ablation Study
We examined two super parameters: the number of channels of style map n t and the loss function of height regression loos h . For n t , we chose 1, 3, and 5, as shown in Table 4. We observed that overall performance improves with the increase of n t since a thicker style map carries more features for height regression. However, the effect is not obvious when n t increases from 3 to 5 as a three-channel style map can already describe the latent information for this task. For loss h , we compared it with the root-mean-squareerror (RMSE) loss. We found that the binary-cross-entropy (BCE) loss used in this work outperforms the version with RMSE, as BCE loss tends to form a sharper effect which is more suitable for building-like objects.

Conclusions
In this paper, we propose a novel framework, named StyHighNet, for semi-supervised learning height estimation from a single aerial image. StyHighNet consists of three subnetworks with the same structure for style transferring, height regression, and style discrimination, respectively. These sub-networks are optimized orderly within two workflows: supervised height regression and unsupervised style transferring. We created a synthetic dataset and performed qualitative and quantitative analysis on two public datasets of Vaihingen and Potsdam. The experiments indicated that StyHighNet is superior in both supervised learning mode and semi-supervised learning mode. Especially in inter-domain semi-supervised learning mode, StyHighNet effectively solves the problem of domain bias in the case of lack of labels. The super parameter of number channels in the style distribution map and the choice of loss function for height regression were analyzed in the ablation study.