IMG2nDSM: Height Estimation from Single Airborne RGB Images with Deep Learning

: Estimating the height of buildings and vegetation in single aerial images is a challenging problem. A task-focused Deep Learning (DL) model that combines architectural features from successful DL models (U-NET and Residual Networks) and learns the mapping from a single aerial imagery to a normalized Digital Surface Model (nDSM) was proposed. The model was trained on aerial images whose corresponding DSM and Digital Terrain Models (DTM) were available and was then used to infer the nDSM of images with no elevation information. The model was evaluated with a dataset covering a large area of Manchester, UK, as well as the 2018 IEEE GRSS Data Fusion Contest LiDAR dataset. The results suggest that the proposed DL architecture is suitable for the task and surpasses other state-of-the-art DL approaches by a large margin.


Introduction
Aerial images are widely used in geographic information systems (GIS) for a plethora of interesting tasks, such as: urban monitoring and planning [1][2][3], agricultural development [4], landscape change detection [5][6][7] and disaster mitigation planning and recovery [8], as well as aviation [9,10]. However, these images are predominantly twodimensional (2D) and constitute a poor source of three-dimensional (3D) information, hindering the adequate understanding of vertical geometric shapes and relations within a scene. Ancillary 3D information improves the performance of many GIS tasks and facilitates the development of tasks that require a geometric analysis of the scene, such as digital twins for smart cities [11] and forest mapping [12]. In such cases, the most popular type of this complementary 3D information is the form of a Digital Surface Model (DSM). The DSM is often obtained with a Light Detection and Ranging Laser Scanner (LiDAR) or an Interferometric Synthetic-Aperture Radar (InSAR), a Structure-from-Motion (SfM) methodology [13], or by using stereo image pairs [14]. Structure from motion is a technique for estimating 3D structures from 2D image sequences. The main disadvantages of SfM include the possible deformation of the modeled topography, its over-smoothing effect, the necessity for optimal conditions during data acquisition and the requirement of a ground control point [15]. Like SfM, DSM estimation by stereo image pairs requires difficult and sophisticated acquisition techniques, precise image pairing and relies on triangulations from pairs of consecutive views. LiDAR sensors can provide accurate height estimations and have recently become affordable [16]. However, LiDAR sensors suffer from poor performance when complex reflective and refractive bodies (such as water) are present and can return irrational values in such cases, especially where there are multiple light paths from reflections in a complex scene. Despite these specific performance issues, LiDAR remains a commonly used technology for the acquisition of DSMs. Furthermore, multitask training has proven to be beneficial, especially when the height estimation is combined with image segmentation. Srivastava et al. [39] proposed a joint height estimation and semantic labeling of monocular aerial images with convolutional neural networks (CNNs). Carvalho et al. [40] used multitask learning for both the height prediction and the semantics of aerial images. A different approach to the height estimation problem used a generative model that produced the height map of an aerial image when given the image as the input. This strategy employs the conditional setting of the GAN and performs image-to-image translation, i.e., the model translates an aerial image to a DSM. Ghamisi and Yokoya [41] used this exact approach for their IMG2DSM model. Similarly, Panagiotou et al. [42] estimated the Digital Elevation Models (DEMs) of aerial images.

Materials and Methods
This section discusses the datasets used to train and evaluate the model, the technical aspects of the methods and the techniques used in the Deep Learning model. The model's architecture is also presented, along with the task-specific design features that make it appropriate for height predictions. For further information on neural networks and Deep Learning, please refer to references [43,44].

Datasets and Data Pre-Processing
Two relatively large datasets were used to develop and evaluate the proposed height prediction model, namely: a Manchester area dataset, compiled by the authors, and an IEEE GRSS data fusion contest dataset. The focus of the Manchester area dataset is on estimating the height of buildings, while the focus of the IEEE GRSS data fusion contest dataset is on estimating the height of all objects in the images.

Manchester Area Dataset
The first dataset used to train the model comprised images and LiDAR DEMs (DSMs and DTMs), all from the Trafford area of Manchester, UK. The aerial photography was from Digimap [45] (operated by EDINA [46]). Both the LiDAR DEMs and the RGB images were geospatially aligned according to the Ordnance Survey National Grid reference system, a system using a transverse Mercator projection with a straight-line grid system built over an Airy 1830 ellipsoid. The reference grid system choice was essentially arbitrary, as the model used input image sections small enough for alignment deviations to be insignificant. Furthermore, as the model was trained to build nDSMs from the input images, model nDSM outputs were aligned to the image pixel locations. The LiDAR data belongs to the UK Environment Agency [47]. It covered approximately 130 km 2 , comprising roughly 8000 buildings. The RGB images had a resolution of 0.25 m by 0.25 m, and the LiDAR resolution was 1 m by 1 m. The RGB images and the LiDAR maps were acquired at different dates; hence, there were data inconsistencies resulting from new constructions or demolished buildings. Such inconsistencies constituted a barrier to the training of a DL model, yet were representative of real-world problems, especially given that many wide-area LiDAR datasets are compiled as composite images from LiDAR flights on multiple dates. Due to the low LiDAR resolution, this dataset was not appropriate for estimating the height of vegetation; thus, the analysis focused only on buildings. Since segmentation labels for the differentiation of what was vegetation and what was not were not available, a threshold height value of 1.5 m was used to distinguish buildings. This approach occluded low vegetation and cars, which was desirable, since the cars were mobile objects and, thus, the source of additional inconsistency. Furthermore, vegetation was a highly variable entity that was easily removed from the environment and was not necessary for many applications. The model was trained with RGB aerial images as the input and the normalized DSMs (nDSM = DSM-DTM) as the target. nDSMs ignore the altitude information of the terrain and concentrate on the heights of objects. Figure 1 shows examples of different areas from the Manchester dataset and their corresponding Remote Sens. 2021, 13, 2417 4 of 21 ground truth nDSMs. Figure 2 shows the DTM and DSM of Figure 1 (bottom image) and demonstrates some flaws in their specific dataset. normalized DSMs (nDSM = DSM-DTM) as the target. nDSMs ignore the altitude information of the terrain and concentrate on the heights of objects. Figure 1 shows examples of different areas from the Manchester dataset and their corresponding ground truth nDSMs. Figure 2 shows the DTM and DSM of Figure 1 (bottom image) and demonstrates some flaws in their specific dataset.

IEEE GRSS Data Fusion Contest Dataset
The Data Fusion 2018 Contest Dataset (DFC2018) [48,49] is part of a set of community data provided by the IEEE Geoscience and Remote Sensing Society (GRSS). The Multispectral LiDAR Classification Challenge data was used herein. The RGB images in the dataset had a 0.05 m by 0.05 m resolution, and the LiDAR resolution was 0.5 m by 0.5 m. The data belonged to a 4.172 × 1.20 (2 km 2 ) area. Given the higher resolution of the RGB images, this dataset was more suitable for estimating the vegetation height than the Manchester area dataset. Figure 3 shows example pairs of RGB images and their corresponding nDSMs from this dataset. The ratio between the resolution of the RGB images and the resolution of their corresponding LiDAR scans affected the design of the depth-predicting model. Like in the Manchester area dataset, the model must handle the resolution difference between its input and its output and predict a nDSM that is several times smaller than the RGB image. Since the two datasets have different resolutions between the RGB images and the LiDAR scans, the same model cannot be used for both cases. Consequently, the models differ in their input/output sizes and the resolution reduction they must apply. For the Remote Sens. 2021, 13, 2417 5 of 21 most part, the models used for the two datasets are very similar, but slight architectural modifications are applied to cope with the resolution differences. Notably, some of these unexpected values in the DSM map (left) correspond to a river, which illustrates a well-known problem of LiDAR measurements near highly reflective and refractive surfaces with multiple light paths. Such erroneous values raise significant problems regarding the training of the model. Thus, they are detected during data preprocessing and excluded from the training data (see Section 2.4). They are also excluded from the validation and test data to avoid inaccurate performance evaluations. Overall, these values roughly comprise 10% of the dataset but lead to a larger amount of discarded data, since any candidate patch containing even a pixel of undetermined or irrational value is excluded from the training pipeline. This figure is best seen in color.

IEEE GRSS Data Fusion Contest Dataset
The Data Fusion 2018 Contest Dataset (DFC2018) [48,49] is part of a set of community data provided by the IEEE Geoscience and Remote Sensing Society (GRSS). The Multispectral LiDAR Classification Challenge data was used herein. The RGB images in the dataset had a 0.05 m by 0.05 m resolution, and the LiDAR resolution was 0.5 m by 0.5 m. The data belonged to a 4.172 × 1.20 (2 km 2 ) area. Given the higher resolution of the RGB images, this dataset was more suitable for estimating the vegetation height than the Manchester area dataset. Figure 3 shows example pairs of RGB images and their corresponding nDSMs from this dataset. The ratio between the resolution of the RGB images and the resolution of their corresponding LiDAR scans affected the design of the depth-predicting model. Like in the Manchester area dataset, the model must handle the resolution difference between its input and its output and predict a nDSM that is several times smaller than the RGB image. Since the two datasets have different resolutions between the RGB images and the LiDAR scans, the same model cannot be used for both cases. Consequently, the models differ in their input/output sizes and the resolution reduction they must apply. For the most part, the models used for the two datasets are very similar, but slight architectural modifications are applied to cope with the resolution differences.  Figure 1. The color bar for each heat map indicates the color-coding of the DEMs in meters above sea level. Both heat maps have several undetermined or irrational (extremely high or low) values shown in black color. Notably, some of these unexpected values in the DSM map (left) correspond to a river, which illustrates a well-known problem of LiDAR measurements near highly reflective and refractive surfaces with multiple light paths. Such erroneous values raise significant problems regarding the training of the model. Thus, they are detected during data preprocessing and excluded from the training data (see Section 2.4). They are also excluded from the validation and test data to avoid inaccurate performance evaluations. Overall, these values roughly comprise 10% of the dataset but lead to a larger amount of discarded data, since any candidate patch containing even a pixel of undetermined or irrational value is excluded from the training pipeline. This figure is best seen in color.

Data Preparation
The model operating on the Manchester area dataset used image patches of sizes 256 × 256 × 3, while the model operating on the DFC2018 dataset used image patches of sizes 520 × 520 × 3. The specific input sizes determined that the former model output a map of size 64 × 64 and the latter had an output of 52 × 52, since the resolution ratios of the images to LiDAR datasets were 4 and 10, respectively: every 4 pixels in one aerial image of the Manchester dataset correspond to 1 pixel in the respective nDSM, and 10 pixels in one aerial image of the DFC2018 dataset corresponded to 1 pixel in the respective nDSM. The 64 × 64 and the 52 × 52 sizes of the model outputs offered a compromise between the computational and memory requirements during training and sufficient scenery area consideration when calculating a nDSM, i.e., basing the estimation on several neighboring structures in the input image for achieving better accuracy. Various output sizes were experimented with, and it was discovered that predicting larger nDSMs tends to achieve a slightly better accuracy at the cost of increased memory usage and computational requirements.

Model Description
In this section, the technical aspects of the methods and techniques used in the proposed DL model are discussed. The model's architecture is presented, together with the task-specific design features that make it appropriate for depth prediction. The authors call the model presented herein "IMG2nDSM", because it maps an aerial photography image to a nDSM. , the corresponding nDSMs and the color bars of the heat maps indicating the color-coding in meters. The RGB images on the left of each pair have a size of 5000 × 5000 pixels, while the size of the nDSMs is 500 × 500 pixels. The heat maps are shown as the same size as the aerial images for demonstration reasons. This figure is best seen in color.

Data Preparation
The model operating on the Manchester area dataset used image patches of sizes 256 × 256 × 3, while the model operating on the DFC2018 dataset used image patches of sizes 520 × 520 × 3. The specific input sizes determined that the former model output a map of size 64 × 64 and the latter had an output of 52 × 52, since the resolution ratios of the images to LiDAR datasets were 4 and 10, respectively: every 4 pixels in one aerial image of the Manchester dataset correspond to 1 pixel in the respective nDSM, and 10 pixels in one aerial image of the DFC2018 dataset corresponded to 1 pixel in the respective nDSM. The 64 × 64 and the 52 × 52 sizes of the model outputs offered a compromise between the computational and memory requirements during training and sufficient scenery area consideration when calculating a nDSM, i.e., basing the estimation on several neighboring structures in the input image for achieving better accuracy. Various output sizes were experimented with, and it was discovered that predicting larger nDSMs tends to achieve The proposed architecture shares some similarity with semantic segmentation models, where the model must predict the label of each pixel in an image and, thus, partition it into segments. The segmentation may have the size of the input image or a scaleddown size. In this study, instead of labels, the real values corresponding to the elevation of each pixel were predicted in a down-scaled version of each RGB image. As with the semantic segmentation task, several DL models are suitable for learning the task of predicting nDSM values. A popular DL architecture, the U-Net model [50], was chosen due to its efficiency and effectiveness on tasks based on pixel-level manipulations like semantic segmentation [51][52][53] and as its architectural scheme easily combines with other DL techniques to introduce task-specific enhancements.
The U-Net architecture was implemented with residual blocks both in the encoder and the decoder mechanisms. Specifically, three types of residual blocks were used, as shown in Figure 4: • a down-sampling residual block (DRBLK) and • an up-sampling residual block (URBLK).
A typical residual block contains two convolutional layers at the data path and a convolutional layer with a kernel size of one at the residual connection path. The downsampling residual block differs in the stride used at the first convolutional layer and the skip connection. Using a higher stride at these convolutions, the previous feature maps were downscaled by a factor s (here, s = 2) at the first convolutional layer of the block and the skip connection, which resulted in smaller feature maps. The up-sampling residual block used subpixel convolutional upscaling [54] in the first layer of the block. Subpixel upscaling was performed in two steps, with the first step calculating a representation comprising feature maps of sizes h × w × s 2 c, where s was the upscaling factor, and h × w × c was the size of the input feature maps. The second step of the process applied a reshape operation on the feature maps and produced a representation containing feature maps of sizes 2h × 2w × c. The skip connection of the upscaling residual block also applied subpixel upscaling. The detailed architecture of the residual blocks is shown in Figure 4.  The input/output sizes of the models are a compromise between having a manageable input size in terms of the computational requirements and having a sufficient output map size and performance. The few differences between the two models are necessary for applying different reduction factors between the input and the output of the two datasets, as dictated by the RGB/nDSM resolution ratio of each dataset. Specifically, the number of layers, the number of channels and the kernel sizes of the convolutional layers of the model trained on the DFC2018 dataset were different from the ones used in the model trained on the Manchester dataset. This was due to the requirement of a larger resolution reduction. However, these changes were applied at the initial and the last layers of the model to maintain the U-NET scheme unaltered. Figures 5 and 6 show the detailed architectures of both models and the sizes of the feature maps after each layer.
The model dealing with the Manchester area dataset had 164 layers (including concatenation layers and residual block addition layers) and consisted of approximately 125 million trainable parameters. The model dealing with the DFC2018 dataset had 186 layers (including concatenation layers and residual block addition layers) but had fewer parameters to handle the higher memory requirements during training due to the larger input size. Precisely, it consisted of 104 million trainable parameters. The only differences with the model used for the Manchester area dataset were (a) the addition of some convolu- A very similar model for both datasets was used, with minor changes regarding the input patch size and the size of the output prediction. The Manchester area dataset had an RGB image over a nDSM resolution ratio equal to 4, so the neural network dealing with this dataset reduced the input size from 256 × 256 × 4 to an output size of 64 × 64. On the other hand, the DFC2018 dataset had an RGB image over the depth map resolution ratio equal to 10, so the neural network dealing with this dataset reduced the input size from 520 × 520 × 4 to output size 52 × 52.
The input/output sizes of the models are a compromise between having a manageable input size in terms of the computational requirements and having a sufficient output map size and performance. The few differences between the two models are necessary for applying different reduction factors between the input and the output of the two datasets, as dictated by the RGB/nDSM resolution ratio of each dataset. Specifically, the number of layers, the number of channels and the kernel sizes of the convolutional layers of the model trained on the DFC2018 dataset were different from the ones used in the model trained on the Manchester dataset. This was due to the requirement of a larger resolution reduction. However, these changes were applied at the initial and the last layers of the model to maintain the U-NET scheme unaltered. Figures 5 and 6 show the detailed architectures of both models and the sizes of the feature maps after each layer.   The model dealing with the Manchester area dataset had 164 layers (including concatenation layers and residual block addition layers) and consisted of approximately 125 million trainable parameters. The model dealing with the DFC2018 dataset had 186 layers (including concatenation layers and residual block addition layers) but had fewer parameters to handle the higher memory requirements during training due to the larger input size. Precisely, it consisted of 104 million trainable parameters. The only differences with the model used for the Manchester area dataset were (a) the addition of some convolutional layers with "valid" padding to achieve the correct output size and (b) the reduction of the parameters of the convolutional layers.

Training Details
Simple augmentations were applied to the patches during training: rotations of 90, 180 and 270 degrees; small-value color shifting and contrast variations. Patches where the elevation maps contained incomplete or extreme elevation values (>100 m) were ignored in both datasets. Moreover, specifically for the Manchester area dataset, small elevation values (<1.5 m) were replaced with zeros to prevent the model from considering nonstationary objects and low vegetation. This preprocessing is important in part because of the time difference between the acquisition of the RGB images and the LiDAR point clouds, which results in inconsistencies between the images and the elevation maps due to the presence of mobile objects like cars in the viewing field of either of the two sensors (RGB or LiDAR). The time of acquisition inconsistency in the Manchester area dataset also introduced inconsistencies in the vegetation height and, occasionally, in building heights (demolished buildings or newly constructed buildings). Consequently, regarding the Manchester area dataset, the model was focused on predicting the elevation of human-built structures like houses, factories and public buildings. The DFC2018 dataset had better resolution, and no inconsistencies were observed. This fact facilitated the prediction of the vegetation height as well; thus, a threshold filter was not applied to the ground truth nDSMs for the DFC2018 dataset.
The models were trained with the Adam optimizer [56] and a learning rate of 1 × 10 −4 , which decreased by a factor of 10 each time the error plateaued for several iterations. Both datasets were randomly split into three sets, each containing images of sizes 1000 × 1000 for the Manchester dataset and 5000 × 5000 for the DFC2018 dataset: a training set (70%), a validation set (15%) and a test set (15%). The validation set was used for hyperparameter fitting and then merged with the training set for retraining the models. The test set was only used to report the models' performances. The models were trained for 5 different random dataset splits, and the average of the performances on the test sets was reported. The pixel-wise Mean Absolute Error (MAE) was used between the ground truth elevation maps and the predicted output as the loss function during training. The Mean Squared Error (MSE) was also considered, but it was found that the MAE performed slightly better, probably because it does not penalize outliers as much as the MSE. The Root Mean Squared Error (RMSE) performance was also reported as an additional evaluation metric. All parameters were initialized with the He normal technique [57].

Results
The model presented herein achieved a MAE of 0.59 m and an RMSE of 1.4 m for the Manchester area dataset, as well as a MAE of 0.78 m and an RMSE of 1.63 m for the DFC2018 dataset. The lower error values on the first dataset most likely occurred due to ignoring the small nDSM values (<1.5 m), increasing the model accuracy in the prediction of buildings and human-made structures. The proposed architecture improved on the results of Carvallo et al. [40] and Liu et al. [17] by a significant margin (see Table 1), although a direct comparison cannot be accurate, since all approaches use random data splits.

Height Prediction for the Manchester Area Dataset
The estimated heights of areas in the Manchester Area test set are depicted in Figure 7. The estimations are shown in the form of heat maps for better visualization (i.e., providing a more precise display of the relative height values) and evaluation purposes. Since the model operates on patches of size 256 × 256 × 3, the RGB images were divided into several patches with overlapping regions of 16 pixels. Then, the model predicted a nDSM for each patch and, finally, the estimated maps were recombined to create the overall nDSM for the RGB image. During the recombination process, the outer 16 pixels of each predicted map were ignored to achieve a more natural blending and avoid artifacts. Interestingly, the model avoided spiky estimations like the ones indicated with note 1 in the images of Figure 7: ground truth LiDAR maps occasionally contain points of unnaturally high values compared to neighboring points that constitute false readings that occur for several reasons. These reasons relate mainly to the physical properties of the LiDAR sensor and the environmental conditions during data acquisition (see Section 1 for a discussion on this). Furthermore, some incorrect readings may have values that lie in the boundary of reasonable LiDAR values and are difficult to discriminate from incomplete readings with irrational values. Such spiky readings naturally occurred in the training set as well. Nevertheless, the model was not affected by such inconsistencies in the training set, and its estimates corresponding to spiky measurements in the "ground truth" data were closer to the actual ground truth (see Figure 7, note 1).
Moreover, the Manchester Area dataset contains several inconsistencies in regard to structures that were missing either from the RGB images or from the nDSM due to different acquisition times between the two data types. Such inconsistencies are shown in Figure 7 (indicated as note 2): In these cases, some structures presented in the ground truth nDSM were missing from the RGB images; however, the model correctly predicted the corresponding regions containing the inconsistencies as undeveloped spaces. This behavior was, of course, desired and demonstrated effectively that the IMG2nDSM model presented herein was robust to false training instances. Furthermore, the results revealed some additional cases that indicated that the model was doing a good job estimating the heights of buildings, surpassing the quality of the ground truth map, notably with noisy data on specific structures with known forms. Note 3 in Figure 7 demonstrates such a case where the ground truth map seems noisy, given the known form of the apex roof structure, while the estimation of the model is more detailed and smoother. This raises the point that, although the model performance was calculated against the LiDAR data as the "ground truth", it sometimes outperformed the LiDAR data and generated results closer to the actual ground truth. Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 22 Right: The elevation heat maps as predicted by the model. Note 1 shows cases of spurious points in the ground truth that the model correctly avoids estimating. Note 2 shows occasional inconsistencies in the dataset due to the different acquisition times of the RGB images and the LiDAR measurements. Although these inconsistencies are also evident in the training set, the model is robust to such problematic training instances. Note 3 shows cases where the model produces betterquality maps than the ground truth in terms of the surface smoothness and level of detail, as the LiDAR data contains noisy values.

Height Prediction for the DFC2018 Dataset
The estimated nDSMs of the consecutive areas of the DFC2018 test set are illustrated in Figure 8. As in the case of the Manchester Area test set, the RGB images were divided into overlapping patches, and the model predicted the nDSM for each of the patches. The only difference was that the size of the patches for this dataset was 520 × 520 × 3 pixels. The estimated nDSMs were fused together, as described in Section 3.1, to create the nDSM of the entire area. Right: The elevation heat maps as predicted by the model. Note 1 shows cases of spurious points in the ground truth that the model correctly avoids estimating. Note 2 shows occasional inconsistencies in the dataset due to the different acquisition times of the RGB images and the LiDAR measurements. Although these inconsistencies are also evident in the training set, the model is robust to such problematic training instances. Note 3 shows cases where the model produces better-quality maps than the ground truth in terms of the surface smoothness and level of detail, as the LiDAR data contains noisy values.

Height Prediction for the DFC2018 Dataset
The estimated nDSMs of the consecutive areas of the DFC2018 test set are illustrated in Figure 8. As in the case of the Manchester Area test set, the RGB images were divided into overlapping patches, and the model predicted the nDSM for each of the patches. The only difference was that the size of the patches for this dataset was 520 × 520 × 3 pixels. The estimated nDSMs were fused together, as described in Section 3.1, to create the nDSM of the entire area. Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 22 Note 1 indicates an area that contains a group of trees and is magnified in Figure 9 to demonstrate how the model treats vegetation in the RGB images. Note 1 indicates an area that contains a group of trees and is magnified in Figure 9 to demonstrate how the model treats vegetation in the RGB images.
mation and demonstrates the tendency of the model to overestimate the volume of foliage. It is thought that this behavior contributed to the higher MAE that the model scored on the DFC2018 dataset compared to the better performance on the Manchester area dataset. As described above, the latter dataset had a lower resolution and more inconsistencies, but the model training ignored the vegetation and low-standing objects to its favor. However, despite the associated higher MAE, this behavior of the model with a vegetation height estimation could be beneficial under some circumstances, such as projects that focus on tree counting, monitoring tree growth or tree coverage in an area [12].

Model Analysis
The very good results of the model, as shown in Table 1, result from its carefully designed architecture, which was selected after many experiments and trials with various alternate options. The initial form of the model was a basic model with the U-NET scheme proposed in reference [50] with typical residual blocks (Figure 4a) only, max-pooling (down-sampling) layers and nearest-neighbor interpolation (up-sampling) layers. Then, the basic model was improved upon by replacing individual architectural features with ones that improved the performance. The modifications that affected the performance the most are listed according to their contribution (higher contribution first): • Use of the up-sampling residual block (URBLK), as shown in Figure 4c, instead of the nearest-neighbor interpolation.

•
Use of the down-sampling residual block (DRBLK) with strided convolutions, as shown in Figure 4b, instead of max-pooling. The predicted nDSMs looked very similar to the ground truth. The higher resolution of the RGB images and the consistency between the RGB and the LiDAR measurements in terms of the data acquisition time had a positive impact on the model's performance. For this dataset, the model could estimate the vegetation height accurately. Regarding vegetation, the model consistently overestimated the area covered by foliage, as it filled the space between the foliage. Note 1 in the second row of Figure 8 (located at the ground truth nDSM) shows the height measurements for a group of trees. Figure 9 shows the magnification of that area, the magnified ground truth map and the model's height estimation and demonstrates the tendency of the model to overestimate the volume of foliage. It is thought that this behavior contributed to the higher MAE that the model scored on the DFC2018 dataset compared to the better performance on the Manchester area dataset. As described above, the latter dataset had a lower resolution and more inconsistencies, but the model training ignored the vegetation and low-standing objects to its favor. However, despite the associated higher MAE, this behavior of the model with a vegetation height estimation could be beneficial under some circumstances, such as projects that focus on tree counting, monitoring tree growth or tree coverage in an area [12].

Model Analysis
The very good results of the model, as shown in Table 1, result from its carefully designed architecture, which was selected after many experiments and trials with various alternate options. The initial form of the model was a basic model with the U-NET scheme proposed in reference [50] with typical residual blocks (Figure 4a) only, max-pooling (downsampling) layers and nearest-neighbor interpolation (up-sampling) layers. Then, the basic model was improved upon by replacing individual architectural features with ones that improved the performance. The modifications that affected the performance the most are listed according to their contribution (higher contribution first):

•
Use of the up-sampling residual block (URBLK), as shown in Figure 4c, instead of the nearest-neighbor interpolation.

•
Use of the down-sampling residual block (DRBLK) with strided convolutions, as shown in Figure 4b, instead of max-pooling.
• Modification of the basic U-NET scheme so that the first two concatenation layers are applied before the up-sampling steps and not after them, as originally proposed in reference [50]. • Use of "same" instead of "valid" padding in the U-NET scheme.

•
Replace the ReLU activation functions with PReLUs.
The first three modifications (the use of URBLKs and DRBLKs and the changes in the concatenation layers positions) enabled the model to surpass the performances of other state-of-the-art works, while the remaining modifications (the use of "same" padding and PReLUs) further increased the performance in favor of the proposed model, widening the gap. Overall, the proposed model relied on a task-specific architecture for achieving good results in predicting the nDSM of a scene from an aerial image.

Investigation of the Model's Reliance on Shadows
The proposed DL models apply representation learning to discover the features needed to learn the task. Besides structural and scene geometry features, shadows are an important geometric cue, and they directly correlate with the height of the structures that cause them. Van Dijk et al. [58] reported that an object with shadow is more likely to be detected by a DL model trained for depth estimation than an object without a shadow. Christie et al. [59] cast shadows in each LiDAR map that matched the shadows observed in the RGB image to improve the reliability of image matching.
To investigate whether the model was indeed considering shadows for predicting the nDSMs, an experiment was conducted to observe how the model changed its prediction after the manipulation of shadows in the image patch. Specifically, a small square mask (window) was moved over the image patch, altering the values of the area under the mask. The sliding window set the underlying values to zero and, thus, simulated the presence of a shadow. The effect of this value adjustment on the model's prediction was observed to significantly affect the nDSM prediction (see Figure 10). Shadow removal was also attempted by replacing dark shadow pixels in an image with higher values, but this did not influence the prediction as significantly as the shadow addition. This observation implies that the model has some prediction dependency on both the detection of the features of a structure and its associated shadow. Removing the shadow of a structure resulted in a slightly lower height prediction for the structure, but this did not fool the model into ignoring the shadowless building. In the other case, simulating a large shadow anywhere in the image caused the model to increase its elevation prediction for any evident object around that shadow. Figure 10 shows how the model's prediction is affected by sliding a zero-value masking window over the RGB images. When the masking window adjoined the structures in the images, the model was driven into predicting larger height values for these structures. In the case that there were no structures adjoining the artificial shadow, the impact of the shadow manipulation was negligible.
In summary, the results of this experiment suggest that the model does consider shadows for computing the height of the structures in the RGB image, yet is not dependent on them. A shadow that is not associated with a nearby structure does not fool the model into predicting a building nearby. Thus, there is evidence that the model combines the presence of a shadow together with the features of an object to compute its height.

Limitations
Despite the overall promising results of the proposed model, there were still some cases where the model did not perform correctly. Buildings were well-represented in both datasets, and thus, the model could predict their heights with little error. The same applied to vegetation in the DFC2018 dataset. However, for objects that were rarely seen in the data (e.g., objects that were tall and thin simultaneously, such as light poles and telecommunication towers), the model sometimes failed to estimate their heights correctly. In cases of very scarce objects, the model treated them as if they did not exist. Rarely seen tall objects that were not bulky or whose structure had empty interior spaces were tough for the model to assess. Examples of such failed cases are shown in Figure 11. The leading cause of the problem was is the under-representation of these structures in the dataset. This can be mitigated by introducing more images containing these objects during training.

Limitations
Despite the overall promising results of the proposed model, there were still some cases where the model did not perform correctly. Buildings were well-represented in both datasets, and thus, the model could predict their heights with little error. The same applied to vegetation in the DFC2018 dataset. However, for objects that were rarely seen in the data (e.g., objects that were tall and thin simultaneously, such as light poles and telecommunication towers), the model sometimes failed to estimate their heights correctly. In cases of very scarce objects, the model treated them as if they did not exist. Rarely seen Figure 10. Using a sliding window to investigate whether the model uses shadows for object height estimation. Each test case is presented in pairs of consecutive rows, the upper row showing the RGB image with the position of the sliding masking window (black square) and the lower row the prediction at the model output. The artificial shadow implied by the square black box influences the height estimation of the buildings close to the shadow by increasing their height predictions (the values on the predicted map corresponding to the buildings that are close to the shadow are seen to be brighter in the image and, thus, higher in value). The estimated heights of buildings that are not near the implied shadow are not affected. The artificial shadow causes the model to predict a higher elevation for buildings that are in the shadow's proximity. This figure is better seen in color.
Although the model performs well, it is acknowledged that it has many parameters. However, it is quite fast when predicting the nDSM of an individual patch, especially when the model runs on a Graphical Processing Unit (GPU). Inferring the nDSM of a large area requires the splitting of the RGB image into several patches. Using a GPU, the estimation of the nDSMs of all patches was performed in parallel by processing a batch (or batches) of patches, taking advantage of the hardware and its parallel computing capabilities.
for the model to assess. Examples of such failed cases are shown in Figure 11. The leading cause of the problem was is the under-representation of these structures in the dataset. This can be mitigated by introducing more images containing these objects during training.
Although the model performs well, it is acknowledged that it has many parameters. However, it is quite fast when predicting the nDSM of an individual patch, especially when the model runs on a Graphical Processing Unit (GPU). Inferring the nDSM of a large area requires the splitting of the RGB image into several patches. Using a GPU, the estimation of the nDSMs of all patches was performed in parallel by processing a batch (or batches) of patches, taking advantage of the hardware and its parallel computing capabilities.

Discussion
Obtaining the heights of objects in aerial photography with hardware equipment can be costly, time-consuming and require human expertise and sophisticated instruments. Furthermore, the acquisition techniques of such data are demanding and require specialized operators. On the other hand, inferring this data solely from aerial RGB images is easier, faster and especially helpful if the availability of image pairs is limited for a certain terrain modeling task. Height estimations from aerial imagery are difficult due to its illposed nature, yet DL techniques offer a promising perspective towards providing adequate solutions to the task.
The authors proposed a model, named IMG2nDSM, with a task-focused DL architecture that tackled the problem with very good results, which were better than the state-ofthe-art ones to date. The model was tested on two different datasets: one with 0.25 m by

Discussion
Obtaining the heights of objects in aerial photography with hardware equipment can be costly, time-consuming and require human expertise and sophisticated instruments. Furthermore, the acquisition techniques of such data are demanding and require specialized operators. On the other hand, inferring this data solely from aerial RGB images is easier, faster and especially helpful if the availability of image pairs is limited for a certain terrain modeling task. Height estimations from aerial imagery are difficult due to its ill-posed nature, yet DL techniques offer a promising perspective towards providing adequate solutions to the task.
The authors proposed a model, named IMG2nDSM, with a task-focused DL architecture that tackled the problem with very good results, which were better than the state-ofthe-art ones to date. The model was tested on two different datasets: one with 0.25 m by 0.25 m image resolution, 1-m LiDAR resolution and different acquisition times (thus, it had spatial inconsistencies) and one with 0.05 m by 0.05 m image resolution and 0.5-m LiDAR resolution. The first dataset (capturing lower resolution images) covered the Trafford area in Manchester, UK, while the second dataset was part of the 2018 IEEE GRSS Data Fusion Contest. The first dataset was used to estimate building heights only, while the second dataset was used to estimate both buildings and vegetation heights. Despite the inconsistencies encountered in the first dataset, the effectiveness of the model indicated its high robustness and ability to build domain knowledge without resorting to dataset memorization. This indication was also suggested by the fact that data curation or special prepossessing, besides data augmentation, was not employed.
The authors aspired to the idea that the possibility of deriving high-precision digital elevation models from RGB images without expensive equipment and high costs will accelerate global efforts in various application domains that require the geometric analysis of areas and scenes. Such domains include urban planning and digital twins for smart cities [11], tree growth monitoring and forest mapping [12], modeling ecological and hydrological dynamics [60], detecting farmland infrastructures [61], etc. Such lowcost estimations of building heights will allow policymakers to understand the potential revenue of rooftop photovoltaics based on the yearly access to sunshine [62] and law enforcement to verify whether urban/or rural infrastructures comply with the local land registry legislation.
Finally, it is noted that the model experienced some cases of poor performance with tall and thin and generally under-represented objects. This issue can be solved by including more examples of such objects in the training images, which is an aspect of future works.

Conclusions
A DL model, IMG2nDSM, was proposed for inferring the heights of objects in single aerial RGB images. The model was trained with aerial images and their corresponding nDSMs acquired from LiDAR point clouds, but only the RGB images were required during the inference. The model was tested on two datasets, and its performance was significantly better than the other state-of-the-art methods. The results proved that the model built good domain knowledge and sometimes produced results that were better compared to the LiDAR data when assessing the ground truth scenario. The model's behavior regarding vegetation height estimations was also analyzed, and some failed cases were reported.
Future research directions and model improvements include the reduction of failed cases for under-represented structures in aerial imagery, such as rarely seen special-purpose structures with electronic devices, telecommunication towers and energy transmission towers. The value of the proposed methodology stems from its convenient and easy application and the fact that it only requires RGB images during inference. Achieving the height estimation task from single RGB images without requiring LiDAR or any other information greatly reduces the cost, required effort and time and the difficulties that emerge from using complex data acquisition techniques or complex analytical computations.