IM2ELEVATION: Building Height Estimation from Single-View Aerial Imagery

: Estimation of the Digital Surface Model (DSM) and building heights from single-view aerial imagery is a challenging inherently ill-posed problem that we address in this paper by resorting to machine learning. We propose an end-to-end trainable convolutional-deconvolutional deep neural network architecture that enables learning mapping from a single aerial imagery to a DSM for analysis of urban scenes. We perform multisensor fusion of aerial optical and aerial light detection and ranging (Lidar) data to prepare the training data for our pipeline. The dataset quality is key to successful estimation performance. Typically, a substantial amount of misregistration artifacts are present due to georeferencing/projection errors, sensor calibration inaccuracies, and scene changes between acquisitions. To overcome these issues, we propose a registration procedure to improve Lidar and optical data alignment that relies on Mutual Information, followed by Hough transform-based validation step to adjust misregistered image patches. We validate our building height estimation model on a high-resolution dataset captured over central Dublin, Ireland: Lidar point cloud of 2015 and optical aerial images from 2017. These data allow us to validate the proposed registration procedure and perform 3D model reconstruction from single-view aerial imagery. We also report state-of-the-art performance of our proposed architecture on several popular DSM estimation datasets.


Introduction
High-resolution orthorectified imagery acquired by aerial or satellite sensors is well known to be a rich source of information with high geolocation accuracy. These images are widely used in geographic information systems (GIS), for instance, for detection of man-made objects (building), urban monitoring, and planning. However, these optical images do not contain height information, and therefore this limits the scope of analysis that can be done based on aerial optical capture. Point clouds, on the other hand, provide complementary 3D information. Stereo image pairs [1], structure from motion (SfM) [2], or Light Detection and Ranging (Lidar) laser-scanning technology are traditionally used to obtain point clouds. These methods provide 3D information with various levels of accuracy which can then be converted to Digital Surface Model (DSM) which in turn can be stored as grayscale imagery. The latter can be used directly to annotate the height value on the aerial image. However, these methods require high computational resources and laborious fine-tuning. Moreover, using stereo image pairs with SfM requires image matching, which helps to estimate camera poses with different temporal intervals. The height information is extracted via triangulation from pairs of consecutive views, and therefore the single view imagery can not be used by these techniques.
In this work, we focus on the scenario where building heights information is to be extracted in a fully automated mode from a single airborne or satellite optical image without relying on the availability of any further contemporary or historical imagery, point clouds, or GIS records data. To arrive at this we propose a convolutional neural network (CNN) architecture IM2ELEVATION that takes a single optical image as input and produces an estimated DSM image as output, see Figure 1. Our architecture is inspired by [3,4] and features additional skip connections and postprocessing convolutional layers which deliver highest performance on the task at hand. To enable the training process, we use a set of point cloud data that serves as ground truth reference. In particular, we investigate the use of publicly available point cloud data of Dublin city center recorded in 2015 [5] for extracting the training building height information. This is used in combination with the more recent aerial imagery collected by Ordnance Survey Ireland (OSI) in 2017. These data are of higher resolution than other alternative open source datasets, see, e.g., in [6]. We also explore co-registration required to perform inference on these multisensor data, which is needed because of the misalignment in space and time between the streams. Figure 2 presents the Dublin dataset used for this study and visualizes the height estimation process.  Using supervised learning approach for converting automatically an aerial image (input) into a DSM image (output), our first step (see Section 3) is to create a training dataset of aligned pairs (aerial-DSM). Using the resulting training set, we then train a CNN for inferring building heights (see Section 4). Our approach is validated against state of the art benchmarks in Section 5, discussed in Section 6 and we conclude the paper in Section 7.

Related Works
Using multiple sources of data for inference has a number of applications in mapping (see Section 2.1) but presents challenges for machine learning techniques where the quality of the data used for training directly impacts the performance of the resulting techniques. Here, we focus on buildings and the corresponding roof shapes that are important for instance for solar panels [7] and 5G deployment [8] (see Section 2.2). To capture 3D data directly is more expensive in practice than capturing multispectral aerial imagery, therefore some research efforts have been deployed for extracting depth information from single view images (see Section 2.3) and extended for processing aerial images to infer heights (see Section 2.4). We also report a brief review of the registration techniques applied to point clouds and aerial imagery (Section 2.5) as this is inevitably the first step in dealing with incoming new data, see, e.g., Figure 2.

Fusion of Heterogeneous Data Streams
In recent years, various sources of multimedia input and GIS data have been used in numerous applications such as mapping and 3D city reconstruction. For instance, social media data and satellite imagery used together provide an opportunity to measure and monitor the impact of flooding [9]. Social media data has the advantage to provide up-to-date information capturing geolocated information and sentiment [10]. Images posted on social media is however too sparse and noisy for applications such as 3D city reconstruction. Street-level imagery (e.g., Google Street View) has been used for 3D reconstruction [11], for object geolocation such as poles for asset management [12], and trees for monitoring biodiversity [13]. Street-level imagery, when used with building shape information extracted from OpenStreetMap (OSM), also provides opportunities to generate light 3D model of cities usable with game engine technologies [10]. Extending the point of view from street level to aerial, Liu et al. [14] proposed to use OSM to segment and propagated labels to a point cloud from drone imagery collected over Trinity College Dublin campus in 2017 [15].

Mapping of Buildings and Rooftops
Detection and segmentation of buildings in satellite and aerial imagery are standard processes for populating and efficient updating of modern maps [16][17][18]. For instance, Lafarge et al. [17] presented an approach for building reconstruction from a single Digital Surface Model (DSM) with buildings treated as an assemblage of simple urban structures extracted from a library of 3D parametric blocks. As an alternative to DSM, Benedek et al. [16] introduced a probabilistic method which integrates building extraction with change detection in remotely sensed image pairs.
Beyond building footprint segmentation in 2D and 2.5D (DSM) aerial imagery, knowledge of roof geometry is also essential for instance in detecting and assessing the potential of rooftop solar photovoltaic (PV) installations on local electricity networks for sustainable development. Palmer et al. [19] proposed an approach for roof PV suitability based on slope and aspect using aircraft-based Lidar data, building footprint data, GIS tools, and aerial photography. Song et al. [20] proposed to extract the 2D rooftop outlines and 3D rooftop parameters retrieved from high-resolution remote sensing image data and DSM considering five rooftop categories (flat rooftops, shed rooftops, hipped rooftops, gable rooftops, and mansard rooftops).

Monocular Depth Estimation from Images
While the stereo vision has been intensely researched to use triangulation cues from two or multiple images, it cannot be applied to the case where only a single image is available. Instead, there are numerous monocular cues such as texture variations and pixel's gradient, occlusion, and color/haze, which can be used to infer the underlying 3D structure. Single-view depth estimation can be classified into two major groups, specifically random field-based approaches and deep convolutional neural network (DCNN) approaches. Saxena et al. [21,22] segmented an image into small patches (super-pixel) and Markov Random Fields were applied to infer depth for each of the segmented patches. Their model is trained based on several features at different scales [21,22]. A similar approach by Wei et al. [23] introduced hierarchical representation of the scenes with depth prediction model based on Conditional Random Fields (CRF). The latter encoded interactions within and across different layers, jointly exploiting local and global 3D information. More recently DCNN-based models have been actively investigated [24][25][26][27][28][29]. Eigen et al. [30] proposed to use two pipelines of DCNNs to extract features at different scale levels: one that regressed global depth structure from a single image, and another for capturing the high level features. Both outputs were concatenated and are collectively refined in the final layer. This has later been extended to handle multiple tasks like semantic segmentation and surface normal estimation [24].
Laina et al. [28] and Xu et al. [26] demonstrated the use of CRFs with DCNN-based models. These hybrid DCNNs methods capitalize on exploring the strength on pairwise pixel interaction whereas the DCNNs only consider the unary potentials during training. In the similar fashion, Long et al. [27] leveraged connected CRF as part of the fully convolution network layer to jointly maximize the posterior on predicting semantic labels. The encoder-decoder framework has been a common trend in DCNN architecture [3,29,[31][32][33]. The encoder transforms the imagery into a latent space which contain high level features of the imagery. The decoder is to increase the spatial resolution of the feature maps generated by the encoder. This enables the networks to regress the input to desired output. Laina et al. [28] used the ResNet-50 [32] as the DCNN architecture's backbone and replaced the fully connected layer to the up-projection blocks which act like reverse operation of pooling, scaling up the spatial resolution to half the size of input. Mal et al. [29] used up-projection blocks [28] in deconvolutional layers to learn mapping from a sparse depth map to a dense depth map. Both [28,29] used similar encoder-decoder fashion architectures to predict depth information from single image, however their results are underperforming in terms of preserving the object shapes.
Hu et al. [4] combined the up-projection blocks with the concept of CNN skip connections [34]. In addition, the features from encoder are fused into the same size of block as multifeature fusion block (MFF). The MFF is then concatenated to the output of decoder. Furthermore, they use gradients and normals as part of the loss function so that the designed network is be more aware of the edges and shapes during training.

Aerial Image Height Estimation
Unlike monocular depth estimation in computer vision, there has been relatively little on estimating depth from a single aerial image. The methods that are used in monocular depth prediction can be considered as the problem of estimating distance between sensor to the observed scene in remote sensing. Height estimation from single view is an ill-posed problem: a single 2D image may have infinite number of possible 3D scenes. The ambiguity lies in mapping from 3-channel RGB into a channel height value. Fully convolutional-deconvolutional network architectures are a useful tool that is capable of guiding the model through the process of learning this ambiguous mapping with considerable accuracy. Alidoost et al. [35] proposed to regress RGB image to DSM, together with linear line structure of roof in three different classes: eave, ridge, and hip lines. The lines were used as additional information in their building reconstruction process.
Architectures with skip connection are capable of keeping the global edge features and have been recently used for buildings height estimation [36,37]. Similar architectures were used to jointly estimate height and semantic labels [38,39] in order to improve accuracy by using the training information for these two complimentary tasks. While the skip connection has been widely used to directly concatenate the same spatial resolution features from encoder to decoder, we establish the superior performance of multiscale feature fusion from encoder blocks (see Section 4).
Ghamisi et al. [40] exploited conditional Generative Adversarial Network (cGAN) to generate a synthesized DSM from a single aerial image. Bittner et al. [41] used cGAN to refine the building roof surface by regressing DSM from stereo aerial image to level of detail (LoD) 2 building shape.

Point Cloud and Aerial Imagery Registration
Here, we will briefly review relevant methods for coregistration of the specific type that is required in our work: point clouds and optical aerial imagery.
In feature-based registration methods, different types of features, such as points, linear features, and patches, are extracted to achieve the correspondence between modalities. Habib et al. [42] extracted features of lines and patches. Kwak et al. [43] used centroid points of the roofs. Peng et al. [44] used lines and intersections (corners). Zhang et al. [45] used the objects from point cloud to form a geometry constraint (e.g., polygon). An assumption was made that all points in point cloud within the geometry constraint in 2.5D fall into the same corresponding object in the optical imagery. The constraint was used to find the optimal registration solution. This method is similar to Liu et al. [14], which registered polygon from OSM to 3D point cloud data. A different method from Chen et al. [46] employed deep learning method for training registration between polygons and 2D aerial images. The network learned the transformation so as to transform polygons to correct positions. However, feature-based methods rely on physical correspondences, which are not always available, especially when using datasets that have been acquired at substantially different time points.
Information theory has been used extensively for multimodal registration. The statistical similarity can be measured by Mutual Information (MI), which utilizes the statistical dependencies between the data to be registered and derives similarity measurements [47]. Using MI, Mastin et al. [48] proposed to optimize the rendering from point cloud via tuning extrinsic camera parameters. Parmehr et al. [49] registered aerial imagery onto LiDAR point clouds by maximizing the combined mutual information from the grayscale-encoded height, return-pulse intensity, and optical imagery.

Data Preprocessing and Registration
We outline the employed data preprocessing for Lidar data in Section 3.1 and for aerial imagery in Section 3.2. The registration procedure we propose to align our two heterogeneous and asynchronous datasets (aerial images 2017 and Lidar point cloud 2015) is presented in Section 3.3, and then validated to improve robustness (see Section 3.4). The preprocessing allows us to create a dataset for training and testing our CNN model that converts aerial images (input) into DSM (output).

Preprocessing of Lidar Data
Our purpose here is to generate DSM ground truth from the available Lidar point cloud. These DSM images are then used as the reference for training and testing our CNN.
In our case study, we work on a 2015-captured Lidar dataset that covers the city center of Dublin, with~5.6 km 2 including partially covered areas [5]. The flight altitude was~300 m and there are 41 flight path strips in total. The final Lidar point cloud was generated from the registration of points from these strips. As this Lidar point cloud contains more than 1.4 billion points, it is split into 0.25 km 2 tiles to be loaded and processed efficiently. The point cloud density inside the projected area varies from 238 to 348 points/m 2 . There is~3.5 km 2 complete point cloud coverage, which is equivalent to 14 tiles. The average accuracy of this point cloud is 0.03 m with a maximum of 0.3 m deviation. Tiles are back-projected to DSM (output of our neural network). The ground control sample distance (GSD) of the generated DSM is scaled to 15 ground resolution which is the resolution of the aerial images employed. The generated DSM from Lidar point cloud has holes caused by low density of points in certain areas. To enhance the DSM completeness, we fill those holes by interpolating from the closest available points. In addition, the DSM is saved in 16-bit image format which preserves more details about local gradient (see in Figure 3).
The return-pulse intensity is also obtained from the Lidar point cloud. It is merged with DSM ( Figure 4) as an additional source of information for validating our registration process. Our unit voxel volume is set to 15 cm × 15 cm × 15 cm to be at the same resolution as the aerial imagery. The density of the points in the Lidar point cloud is observed to be higher on vertical walls that translate as edges in aerial imagery and DSMs.

Preprocessing of Aerial Orthorectified Imagery
The aerial image we use in our study is captured by OSI in 2017 and covers the entire extent of the Lidar 2015 dataset at 15 cm geometric resolution. Note that the Lidar dataset 2015 [5] also provides aerial imagery, however it only covers partially Dublin city (see Figure 2). Therefore, we deal with two distinct datasets and there are two major challenges: (1) The change of scenes between the two captures (in 2015 and in 2017), and (2) the two data streams are not properly aligned with each other even though both of them are georeferenced. Small deviations can be observed that originate from different projection methods and sensor calibration errors.
We first remove the areas that change substantially between the acquisition dates of the two datasets, this mainly occurs on dynamic objects, e.g., cars and buses. Only areas with DSM values above 2.5 m from the ground are kept to mitigate such discrepancies. We then convert RGB optical images to grey scale so that they can be used to perform registration with DSM via Mutual information registration (explained Section 3.3). Return-pulse intensity and DSM images are merged (see Figure 4) and then used with the optical image for validation of our MI registration quality (explained Section 3.4). This approach is chosen because DSM has a more similar distribution of intensities to optical imagery than return-pulse intensity, which leads to a better performance in MI registration. Moreover, fusing the DSM and the return-pulse intensity retains more distinctive features alongside edges which boosts the Hough line extractor's performance.

Registration with Mutual Information
Mutual information is employed to carry out registration between aerial imagery and DSM. MI [49] measures the statistical similarity between datasets. The matching similarity is based on the intensity of distribution within images. The definition is as follows, where H(I) is the marginal (Shannon) entropy of image I [50], N is the number of bins used for the histograms of the images (estimated distribution p), and H(I P , I D ) is the joint entropy. The aerial image I P is chosen as the reference source on which the DSM image I D is aligned. The statistical dependence between I P and I D is computed using probability density function (pdf) estimated via histogramming using a non-parametric approach. One-plus-one evolutionary algorithm [51] was used in the optimizer. It perturbs the matrix parameters in different direction to find the maximum MI value. In the case of misregistration, the output from the joint pdf p(i, j) yields large negative number, which increases the value of the joint entropy. After appropriate registration, the images are expected to yield the smallest joint entropy. A parametric registration transformation between the images is formulated as a translation operation. Disabling rotation and scaling operations is a necessary trade-off to enable efficient and robust registration as otherwise the process is highly affected by smaller objects other than buildings. The translation T is estimated as follows, We ran 1000 iterations to get optimal solutionT. GPS location available with the images is used to initialize T. Note that the initial level of registration between the available Lidar and optical imagery, both of which are georeferenced, allows for only approximate matching and we resort to an iterative procedure to improve the quality of local matching. We denote the aligned DSM image asÎ D . The registration process can fail, as demonstrated in Figure 5, if a scene has undergone substantial changes in between the acquisitions. To detect such situations, the Hough line extractor is applied to aerial images and the corresponding fusion of DSM and intensities (Section 3.4). We compute the distances between the Hough detected lines in the two modalities and validate matching pairs. Invalid examples were adjusted via its neighbors' mean vector. This is used as the objective function to ensure the quality of the registration samples. The aim is to create a good training dataset to train our network: quality of the training data is crucial for the high performance of the resulting neural network.

Initial misalignment
Successful registration Failed registration Figure 5. MI registration between return-pulse intensities (green) and optical image (purple) may fail when the scene content is significantly different between the two data sets.

Patch Adjustment via Hough Lines Validation
To enhance the quality of pairs {(I P k ,Î D k )} for training our deep neural network, we exploit the Hough space to retain the good pairs of images registered with MI. Given two images I P (optical imagery) and I Q (fusion of DSM and return-pulse intensity), two sets of Hough lines, noted The distance between parallel lines (θ Q = θ P ) between I P and I Q is computed as Denoting e and e the distance between the two lines before and after registration respectively, we validate a pair of patches having n corresponding lines when

not valid-registered pairs
Parallel lines were grouped together by searching the closest matching lines within a range e < δ before registration to reduce the impact of failed registration. To correct the detected cases of failed patch registration, the missing translation vectors are interpolated between the adjacent correctly registered neighbor patches (this is referred to as adjustment in the following).

CNN Network Design
We propose a deep learning architecture based on that presented in Hu et al. [4]. Specifically, we use a fully convolutional-deconvolutional network presented in Figure 6 that regresses a RGB (3 channel) image to a DSM image (1 channel). A Squeeze-and-Excitation Network [3] is used as encoder to extract features, generating five encoding blocks.   Figure 7. Finally, the last layer of deconvolution D3 is concatenated with F1 thus forming R0, and fed into convolution layers R1 − R4 in order to refine the final prediction map. Each convolutional except R4 layer features batch normalization and rectified linear unit (ReLU) activation function. The details of the input/output channels and the feature size are reported in Table 1. Further details on standard architectural elements can be found in [4,28].
Our architecture employs a skip connection that concatenates features directly from the encoder block with those from the multiscale features block. We observe a significant empirical improvement in our results due to this strategy, see Section 5.2.  The network loss function guides the training process to obtain the best fitting function modeling the training data. We construct the loss that consists of L 1 norm (Equation (7)), surface normal (Equation (9)) and spatial gradient (Equation (8)) as follows, L = l depth + λ l grad + µ l normal (6) where λ, µ ∈ R + are weight coefficients. Here, we have e i is the L 1 norm between the estimated heightĥ i and ground truth height h i . α > 0 is a slack parameter to ensure F(e i ) ∈ R has a lower bound (in practice we always set α = 0.5). In order to make the network more aware of the edge structure, we introduce l grad defined as where (x, y) are the spatial coordinates in the residual image e. l normal is a cost on the normals to the surface of the estimated DSM with respect to its ground truth: ·, · denotes the inner product of vectors, which infers the orientation of surface. This cost penalizes the subtle changes of gradient which may not be captured in Equation (8). Besides, this guides the network to learn more geometric features.

Results
Our dataset consists of 14 tiles and each tile is split into 169 patches. Each patch is 500 × 500 pixels in size with 250 pixels overlapping for the purpose of increasing training data volume. Of the original 2366 patches, 1999 patches are used for training, and 367 patches are used for testing and for comparisons between several preprocessing pipeline scenarios: no registration (base), registration (MI), and registration with invalid patch adjustment (adjustment). In addition, we augment the training set by randomly flipping and jittering the training data. Adam solver is used with weight decay 0.001 and learning rate 0.0001. The learning rate is dropped by 10 percent in every 5 epochs. The CNN network is done with PyTorch package (The implementation will be made available upon publication at https://github.com/speed8928/IMELE.). Figure 8 demonstrates the evolution of the training/test loss throughout the training process: both MI and patch-adjusted datasets show steady loss decrease whereas the base dataset (without preprocessing) has large fluctuations (unstable) training loss. Similar behavior has been observed on the training/test Mean Absolute Error (MAE). We notice that beyond epoch 60 the test performance plateaus. Figure 8. The training and testing loss on original data (base), MI-registered, and after Hough line patch adjustment. The registered data demonstrate more stable convergence than non-registered data.
In Section 5.1 we assess the registration with Mutual Information. Section 5.2 presents our CNN improved performance when using registered and non-registered training sets.

MI Registration and Hough Line Validation
To evaluate the accuracy of registration, the vector layers for buildings provided by OSI are used as ground truth after being transformed to binary maps. This additional data source is used solely for the purpose of validating the performance of the registration pipeline and is not part of CNN training. For evaluating the performance of MI and the effect of validation with Hough transform, 11 tiles were selected. For each tile of image, 169 patches were generated, giving the total number of testing samples K = 11 × 169 = 1859. Figure 9 illustrates the data used in our experiments. For each test sample k ∈ {1, · · · , K}, see Figure 9a, the total number of valid pixels associated with ground truth building label is denoted as n k g . The DSM is converted to a binary map for buildings by setting all pixel value to 0 if these have values inferior to 2.5 m above ground, and set to 1 otherwise (if superior to 2.5 m). The intersection defined by Equation (10) gives the average of the proportion of pixels from DSM building binary map (denoted M k h , see Figure 9e) corresponding with the ground truth building binary map (denoted M k g , see Figure 9c), normalized by n k g (the valid number of pixels from M k g ):  Table 2 highlights the quantitative improvements associated with the use of MI and Hough validation preprocessing steps. Misregistered patches identified by Hough transform are adjusted by local translation that allows more meaningful computation of intersection for this evaluation (Method III). The value of MI after patch adjustment declines a little, whereas the value of Intersection increases. Further validation of preprocessing is reported in the next Section in conjunction with the proposed CNN. Table 2. MI and Intersection obtained by applying registration (higher values are better). n g = 1.34 bn. # patches = 1859. Note that we employ only a subset of training data due to the partial coverage of the building binary map. Best results in bold.

Height Inference with CNN
We now report the performance of the height inference CNN on the preprocessed Dublin dataset. The Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are presented in Table 3. In the comparison with the baseline architecture, Hu et al. [4] demonstrates that the proposed CNN improvements are of significance for the height estimation problem. We note that using registered data for training only improves the RMSE but deteriorates the MAE in comparison to non-registered training data. However, improvement on both can be found in registered data with patch adjustment. Table 3. DSM estimation accuracy measured by MAE and RMSE (lower vaues are better): Hu et al. [4] vs. IM2ELEVATION (proposed) with different types of registration between Lidar and aerial data. The loss employs λ = 1, µ = 1 for all methods. Best results in bold.  [54,55], see Table 4. After retraining on these data, our network outperforms other the state-of-the-art methods [35,37,39,40] on DFC2018 and Potsdam datasets. We do not apply any preprocessing developed throughout Section 3 to these data to ensure comparability with benchmark methods. We employ~25% randomly selected tiles for testing and the rest is used to train our pipeline for each of the 3 considered benchmark datasets. Note that in our work we employ solely the DSM information and make no use of semantic labels, which is an additional and expensive (effort-wise) source of complementary information. Nevertheless, our method still outperforms some of the other benchmark methods, which perform multitask learning. In particular, IM2ELEVATION outperforms other methods in MAE on DFC2018 dataset. Among the single task learning models, our model reaches the highest accuracy in RMSE in the Potsdam dataset. We also observe that the accuracy obtained with the Dublin dataset is in general at a comparable level which suggests that this dataset is reasonably balanced and our results can be generalized. We demonstrate the obtained DSM estimation results in a full tile in Figure 10. It can be readily observed that the building contours present high variance in general. This can partially be attributed to remaining artifacts after registration and orthorectification processes. Buildings substantially taller than average in the scene appear to be more challenging for monocular height estimation as well.

Method
Aerial imagery DSM 2.5 meter above ground (ground truth) Prediction of DSM MAE=2.23 Figure 10. DSM of a 500 × 500 m tile in the Dublin inner city center with northing of 316,000 to 316,500, and easting from 234,000 to 234,500, in TM65 projection.

Height Inference from Stereoscopic Imagery
The height inference on stereo images is reported in this section to compare with our height inference performed on a monocular set of airborne imagery. In remote sensing, structure from motion (SFM) is often adapted to infer the elevation of the landscape [56]. The point cloud can be generated by rectifying a selected pair of images, followed by computing the disparity map which is similar to DSM but with unknown scale. The composite point cloud data generated by historical stereo images is fused with 2D map to normalize the scale. The same resolution of imagery with 15 cm/pixel GSD is used. The density of the generated point cloud is 40 points/m 2 . The point cloud is back-projected to 2D imagery as DSM, with scale GSD of 15 cm/pixel. The DSM from stereo images is evaluated by using the testing dataset, the outcome is reported in Table 5.

Discussion
We present a panel of detailed examples of DSM reconstruction obtained with our pipeline and from stereo imagery in Figure 11. The heat maps highlight the difference between our estimated DSM and the corresponding ground truth DSM. We observe that our estimates present with less salt-and-pepper-type noise and succeed in extracting the outlines of the buildings on a par with the stereoscopic imagery input.
To investigate the potential of height reconstruction for visualization purposes, we now report the results of mesh reconstruction in Figure 12. Specifically, the estimated DSM is back-projected to a 3D point cloud using pixel georeferencing information and elevation value, and reconstructed to a 3D mesh by 2.5D Delaunay triangulation [57]. The points are triangulated in 2D space by using their row and column geolocation coordinates. The edges between vertices are inherited from pixels. Finally, the input aerial images are overlaid with the resulting building mesh. The Eye-Dome Lighting shader [58] is applied to facilitate depth visualization.
In Figure 13, we attempt to perform DSM estimation on a different image dataset. To this end, we collect Google Maps satellite images covering the same geographic extent as the crops in Figure 11 using Google API. We then perform inference on these without any additional retraining or fine-tuning of our CNN pipeline. We observe that the estimation results are substantially worse, which is due to different properties of the Google images as the result of a completely different (and not explicitly disclosed) preprocessing applied to these images. This highlights the necessity for pipeline fine-tuning when the image source changes. As demonstrated by the results reported in Section 5.2 in Potsdam and DFC2018 experiments, once retrained, the IM2ELEVATION pipeline is capable of producing state-of-the-art results.
In Figure 14 we used ArcGIS functionalities to postprocess the predicted DSM applying the footprint constraints (we use OSI-provided footprints data) via Local Government 3D Basemaps package (https://solutions.arcgis.com/local-government/help/local-government-scenes/ get-started/). The simple roof shape can be reconstructed to CityGML [59] style from 2.5D and reach the level of detail 2, i.e., LoD2. It is worth of mentioning that the work from Alidoost et al. [35] reached the similar level of precision but they used additional data while training.
The training data used in the study covers solely urban areas of high density and lacks any examples of lower density or suburban scenes. Poor performance of our trained pipeline is therefore observed where buildings are sparsely distributed in the scene, see Figure 15. This performance drop is due to the absence/weakness of support information for height inference such as neighboring buildings and shadows. Much stronger presence of tall vegetation also requires training of the pipeline on a more relevant suburban dataset.

Conclusions
We have proposed a pipeline for inferring heights from aerial images using machine learning trained with two asynchronous heterogeneous datasets: aerial imagery and Lidar point cloud, where the latter is employed only during the training phase. We mitigated the impact of temporal and spatial mismatches in our custom dataset by removing areas that had changed between the acquisition dates (e.g., new building, etc.), and also proposed registration tools on the two data streams to correct small translation artifacts. Our pipeline manages to perform building height estimation on average within 1.5 m of the ground truth (which in this study is a Lidar point cloud with average height accuracy of 0.03 m), which outperforms the estimation results on stereo imagery. We demonstrated how the estimated DSM can be used for inferring 3D building shapes and 2D building footprints.
Future directions of research to enable higher level of accuracy in recovering heights and shapes of roofs include augmenting our training dataset with computer graphics simulated data (training set) and adding other input data streams such as street view imagery. IM2ELEVATION pipeline for DSM inference provides an alternative to stereo reconstruction when multiple aerial imagery is not available. Nevertheless, integration of the elements of this pipeline may provide highly relevant information in stereo reconstruction if such data is available. In the future we envisage to look into merging multiple DSMs (i.e., stereo-generated and AI-generated) to reduce noise and improve accuracy in addition to considering shape priors inherent in man-built infrastructure (e.g., planar surface).
Author Contributions: C.-J.L. developed the software and run the experiments; P.K. and G.K. prepared and provided the OSI data used in the experiments; C.-J.L., V.A.K., and R.D. conceived the experiments and wrote the paper. All authors have read and agreed to the published version of the manuscript.