A Multi-Scale Mapping Approach Based on a Deep Learning CNN Model for Reconstructing High-Resolution Urban DEMs

The shortage of high-resolution urban digital elevation model (DEM) datasets has been a challenge for modelling urban flood and managing its risk. A solution is to develop effective approaches to reconstruct high-resolution DEMs from their low-resolution equivalents that are more widely available. However, the current high-resolution DEM reconstruction approaches mainly focus on natural topography. Few attempts have been made for urban topography which is typically an integration of complex man-made and natural features. This study proposes a novel multi-scale mapping approach based on convolutional neural network (CNN) to deal with the complex characteristics of urban topography and reconstruct high-resolution urban DEMs. The proposed multi-scale CNN model is firstly trained using urban DEMs that contain topographic features at different resolutions, and then used to reconstruct the urban DEM at a specified (high) resolution from a low-resolution equivalent. A two-level accuracy assessment approach is also designed to evaluate the performance of the proposed urban DEM reconstruction method, in terms of numerical accuracy and morphological accuracy. The proposed DEM reconstruction approach is applied to a 121 km2 urbanized area in London, UK. Compared with other commonly used methods, the current CNN based approach produces superior results, providing a cost-effective innovative method to acquire high-resolution DEMs in other data-scarce environments.


Introduction
Digital elevation models (DEMs) have been widely used in many fields such as landform evolution, soil erosion modeling and other geo-simulations (Bishop et al., 2012;Liu et al., 2015;Mondal et al., 2017;Li and Wong, 2010). In particular, DEMs provide indispensable data to support water resources management and flood risk assessment (Moore et al., 1991;O'Loughlin et al., 2016). In urban flood risk assessment, the availability of high-resolution urban DEMs is crucial for the accurate representation of complex urban topographic features and required for a reliable prediction of flood inundation to inform risk calculation (Ramirez et al., 2016;Leitão and de Sousa, 2018).
The common ways of acquiring high-resolution urban DEMs include ground surveying and remote sensing through light detection and ranging (LiDAR), interferometric synthetic aperture radar (InSAR) and other techniques (Shan and Aparajithan, 2005;Rossi and Gernhardt, 2013;Le Besnerais et al., 2008). These approaches are usually labor-intensive and financially expensive, hindering their wider application at a large scale (e.g., across an entire city). As such, high-resolution urban DEMs are not always available, especially for cities in the developing countries. This essentially imposes a barrier for many applications including the development of effective urban flood risk management strategies that are necessary to be informed by highresolution flood modelling results. Hence, it is necessary to develop alternative and more costeffective approaches to construct high-resolution urban DEMs to support a wide range of applications.
Although high-resolution urban DEMs are not always available, low-resolution DEMs, on the other hand, are relatively easy to access. For example, there exist a range of open-access global DEMs, including Shuttle Radar Topography Mission (SRTM) and ALOS World 3D (Hawker et al., 2018). Thus, it may be desirable to develop effective techniques to enhance the quality of low-resolution DEMs to subsequently obtain high-resolution urban DEMs. Most of the existing high-resolution DEM reconstruction methods are developed for natural terrains, which may be generally classified into three categories, i.e., DEM interpolation, DEM enhancement and learning-based DEM reconstruction.
The DEM interpolation methods, commonly including inverse distance weighting (IDW), bilinear interpolation (BI) and cubic convolution (CC), are generally implemented according to spatial autocorrelation, i.e., the correlation of the ground elevations between two points is inverse to the distance between them (also known as Tobler's first law of geography) (Aguilar et al., 2005;Heritage et al., 2009;Wise, 2011;Arun, 2013;Tan et al., 2018). DEM interpolation methods have been widely applied to acquire high-resolution DEMs, but the resulting products can never include the necessary topographic details that are not contained in the low-resolution DEMs. DEM enhancement methods attempt to restore the lost topographic features via introducing extra information to enhance the quality of low-resolution DEMs. The extra information may be derived from additional elevation points, contours, land-use maps and flood extents (Tran et al., 2014;Yue et al., 2015;Mason et al., 2016;Yue et al., 2017;Li et al., 2017), etc. DEM enhancement methods can effectively reconstruct high-resolution DEMs by fusing multiple DEMs and datasets at different scales or from various sources. Nevertheless, the required extra high-accuracy topographic information for implementation of this type of methods is still hard to acquire, especially for a large extent. The learning-based approaches generate high-resolution DEMs by establishing the correlation between low-and high-resolution DEMs through a training process (Xu et al., 2015;Chen et al., 2016;Moon and Choi, 2016;. Learning-based models can be trained to learn from multi-dimensional information, which may potentially produce high-resolution DEMs of better quality than the aforementioned alternative approaches. However, less research has been done in this topic, and the existed learning-based models are relatively simple and not suitable for applications in complex urban environments. Most of the existing DEM reconstruction methods are developed and applied to natural terrains. Reconstruction of urban high-resolution DEMs faces extra challenges, and direct application of the existing methods in the complex urban environments is questionable and may not be feasible. Due to human interventions, urban topography is typically an intricate synthesis of man-made and natural features. In most of the cases, man-made features are more predominant, which may create abrupt changes to the topography at different scales. For flood modelling, the key urban structures/features may pose particular influence on and even control the underlying hydrological processes and must be accurately represented in urban DEMs (Mark et al., 2004;Ozdemir et al., 2013;Leitão and de Sousa, 2018). On the other hand, there is a strong need to develop new approaches to support multi-scale reconstruction to efficiently reconstruct urban DEMs at specified resolutions from a low-resolution equivalent.
Whilst cities are covered by man-made topographic features of different types and scales, they are planned and built according to specific regulations and codes, and urban topography commonly presents a high level of self-similar features, especially for cities in the same region. This is particularly suitable for the application of learning-based approaches. For example, Convolutional Neural Network (CNN) (LeCun et al., 2015;Schmidhuber, 2015) is a deep learning technique designed to automatically and adaptively learn spatial hierarchies of image features and has been successfully applied in image recognition and many other fields, such as machine translation and autonomous driving (Abdel-Hamid et al., 2014;Chen et al., 2015;Gu et al., 2018). An urban gridded DEM can be effectively regarded as an image. With the availability of localized DEMs of different resolution, a CNN model may be trained to recognize the urban topographic features and used to reconstruct high-resolution DEMs from low-resolution ones across a much larger area. Nowadays, although it may be still costinhibitive for application in a large area, the use of unmanned aerial vehicles (UAVs) to acquire high-resolution urban DEMs in localized areas is feasible and has been widely adopted (James and Robson, 2014;Gonçalves and Henriques, 2015;Leitão et al., 2016). Therefore, this paper aims to develop an innovative approach by combining a deep-learning CNN model and localized high-resolution urban DEMs to substantially improve the quality of low-resolution urban DEMs and subsequently reconstruct high-resolution urban DEMs for a large area.
The rest of this paper is arranged as follows: Section 2 introduces the proposed multi-scale mapping approach for urban DEM reconstruction, followed by the introduction of two-level accuracy assessment framework in Section 3; Section 4 describes the experiments undertaken to assess the effectiveness of high-resolution urban DEM reconstruction; further discussion is given in Section 5; and finally several remarks are summarized in Section 6.

A CNN-based multi-scale mapping approach
A Multi-Scale Mapping approach based on CNN (MSM-CNN) is developed to reconstruct high-accuracy urban DEMs at higher resolutions from a low-resolution dataset, which is illustrated in Fig. 1. Herein, the low-resolution urban DEM is denoted as X, and the corresponding datasets at higher resolutions are denoted as 2 , 4 , , 2 , where the superscript 2 n indicates that the urban DEMs are at 2 n times higher resolution than DEM X and n is a positive integer. The goal here is to reconstruct any high-resolution urban DEM 2 ( ) from the low-resolution DEM X to ensure 2 ( ) is as close to the ground truth dataset 2 as possible, which will be achieved by training a CNN to learn mapping F.

Network architecture
The detailed network architecture is shown in Fig. 1a, which consists of several subnetworks. Each of these subnetworks performs a 2-time reconstruction to its input urban DEM. According to the existing achievements, the network with skip connections bypassing certain intermediate layers could lead to better performance (Krizhevsky et al., 2012;Simonyan and Zisserman et al., 2015;He et al., 2016). The skip connections are therefore added between the input and output of each of the subnetworks. Specifically, the input urban DEM of each subnetwork is interpolated to become 2 times of its original resolution using a nearest neighbor assignment (NNS) method, and the interpolated urban DEM is then directly summed to the output of the feature-learning network. The NNS interpolation is only performed to increase the resolution of the input data without generating any new information on ground elevation.
As a result, the skip connections encourage the feature-learning networks to effectively learn and predict the missing topographic details from the low-resolution urban DEMs to generate high-resolution datasets. Since each subnetwork only performs a 2-time reconstruction, the proposed architecture can effectively train a single network to construct urban DEMs at different higher resolutions.
In the proposed architecture, feature-learning network is a key component in each of the subnetworks. It is important to consider not only the representation capability but also the computational efficiency when building a feature-learning network. As the convolutional operations for convolutional layers are linear, it lacks the ability to model non-linear relationships in the data. The general method to address this issue is to include layers with nonlinear projections after convolutional operations. Hence, a rectified linear unit (ReLU) formulated as max(0, x) is employed to account for nonlinearity, which is an approach widely reported for its superior effectiveness (Krizhevsky et al., 2012;Simonyan and Zisserman et al., 2015;He et al., 2016). Herein, all of the convolutional layers are followed by a ReLU unless it is specifically mentioned. On the other hand, the information distillation block (IDB) proposed by Hui et al. (2018) is introduced as the basic element due to its proved higher-level performance. Each feature-learning network includes two IDBs. Fig. 1b shows the structure of each IDB, which includes a stack of several convolution layers. After the first three layers in each IDB, the output feature maps are split into s parts. The parts with a proportion of 1/s is concatenated with the output of the next three layers. Such a structure creates skip connections and combines both long-and short-path features. In the current study, s is set to 4. Finally, the last layer in each IDB combines these features from different paths. Additionally, in each feature-learning network, before the IDBs, two convolutional layers with 3 x 3 x 64 (i.e., width and height are 3 and number is 64) filters are used to extract features of the input low-resolution urban DEM as the basis of the high-resolution reconstruction; after the IDBs, a transposed convolutional layer is applied to project the output data to the DEM at a targeted (high) resolution.
An advantage of the proposed multiple-scale architecture with respect to single-scale architectures is that the multi-scale supervision has been introduced to regularize the intermediate features of the urban DEM, which can faithfully enhance the output of each subnetwork to become as close to the high-resolution 'true' DEM as possible. As a result, this advantage directs the network to learn and restore the losing information in the input lowresolution DEM progressively, which results in an adaptive adjustment process of the reconstruction error. Therefore, using such a structure can be more effective to learn the mapping from the low-to high-resolution urban DEM than employing the direct way of mapping learning without any intermediate supervision. The adopted multi-scale supervision enables effortless and effective reconstruction of quality-enhanced urban DEMs at any specified high resolution. Computing losses at intermediate network layers to guide the learning process has been widely used in deep neural network architectures (Szegedy et al., 2015;Lai et al., 2017). In this paper, we firstly introduce this principle to the domain of DEM reconstruction.

Network training
This section introduces the process of network training, in which the loss of elevation information is a critical issue must be carefully considered. Let Ri be the 2 i -time reconstruction result and Yi be the corresponding ground truth. The loss denoted by Li between Ri and Yi is calculated as follows: where Ri,j and Yi,j are the element in Ri and Yi, respectively, and N is the cell number. The overall loss L is the sum of the losses at all scales: Theoretically, a weighted sum could achieve better balance among the losses at different scales.
However, preliminary experiments revealed that the sum loss with equal weights is sufficient to achieve a good performance. In detail, based on standard back-propagation with Adam optimizer (Kingma and Adam, 2015), the network training is conducted with a batch size of 64.
The weight decay is set to 0.0001, and the learning rate is set to 0.0001 initially and reduced by a factor of 10 after 250 thousand iterations.
In the training process, we divide the training dataset of urban DEMs (see subsection 4.1) into blocks with a size of 500 by 500. Each block also has an overlapping of 250 cells in both horizontal and vertical direction with its neighbors. During each forward-backward pass of the network, a batch of 64 blocks is randomly selected for each training area, and then a patch from each block is randomly cropped, followed by forming the batch of training data through concatenating all of the patches. The size of a patch is chosen to meet the computational capacity, which depends on the number of scales in the network.

Two-level accuracy assessment
To evaluate the performance of the proposed urban DEM reconstruction method, a twolevel assessment approach is designed to quantify the numerical accuracy and morphological accuracy of the resulting products. Herein, the numerical accuracy is a quantification of elevation error at cell locations, while the morphological accuracy is a region-scale quantification of morphology variance between the urban DEM and ground truth.

Numerical accuracy
Numerical accuracy is assessed by the difference of pointwise elevation between the reconstructed and 'true' urban DEMs. Three well-known metrics, i.e., mean absolute error (MAE), root mean square error (RMSE) and standard deviation (STD), are employed to quantify the numerical accuracy. The equations used to calculate these metrics are given as follows: where n is the total count of valid grid cells, x denotes the elevation of the reconstructed urban DEM and y refers to the reference data.

Morphological accuracy
A DEM not only represents the ground elevation at each of its cells, but also reveals the structure of the topography. As the skeleton of topography, topographic structure decides the spatial pattern of geomorphology (Wilson, 2012). Hence, the accuracy in representing the topographic structure is an essential indicator for DEM quality assessment. In the case of urban topography, the topographic structure may be mainly reflected by road networks and building clusters that have a significant impact on surface runoff and flow processes. Accordingly, the morphological accuracy, i.e., the assessment of feature difference, can be evaluated by measuring the variances of the road profiles and building boundaries between the reconstructed urban DEM and the reference data.
The road-profile variance is measured through the following steps: 1) input the road centerlines and merge subsections of each centerline to ensure road integrity; 2) densify vertices along each road centerline stepped by the cell size of the reconstructed urban DEM; 3) obtain the road centerline profiles from both the reconstructed DEM and the reference data; and 4) apply the Pearson's correlation coefficient (PCC) to quantify the variance between profiles for each road, and take the average and standard deviation (STD) of PCC to denote the difference.
The PCC is calculated as follows: where m represents the number of the profile vertices, x and y are the values corresponding to the reconstructed and reference profiles being compared.
The variance of the building boundaries can be measured through three steps: Step 1 is to count the reference data by: 1) preprocessing the building polygons by merging the adjacent polygons and deleting those small and discrete patches according to an area threshold of 20 m 2 ; 2) obtaining the boundary line of each building patch and converting all boundary lines to a raster format using the cell size of the reconstructed DEM; and 3) counting the boundary grid cells as the reference truth.
Step 2 is to extract the building boundaries from the reconstructed urban DEM by: 1) highlighting the boundaries between features (e.g. the boundary where a building meets a road) by an edge-enhancement (or high-pass) filter in the ArcGIS software; 2) screening the candidates of boundary grid cells via an edge threshold of 1; and 3) obtaining the boundary cells using a thinning tool available in the ArcGIS software.
Step 3 is to quantify the variance by: 1) selecting the boundary cells from Step 2 according to the location of the reference boundary lines with no buffer, and buffers of 1, 2, and 3 times of the cell size, respectively; and 2) calculating the ratio between the number of selected cells and that of the reference truth successively.

Experiments and results
In

Study area and data
As one of the largest cities in the world, London, UK is highly urbanized with a population of 8 million and is selected as the study area. We firstly train the MSM-CNN model using three with MSM-CNN reconstruction is slightly inferior to that of BI but better than that of CC and IDW; MSM-CNN also returns similar but slightly higher RMSE than the other reconstruction methods. profiles as showed later in Fig. 9.
Overall, the numerical accuracy of the MSM-CNN reconstructions is mostly higher than that achieved by other interpolation methods. Meanwhile, it is noted that the variances of the numerical accuracy between MSM-CNN and other interpolation methods are not significant, which appears to contrast with the visual analysis of the reconstruction results presented in Fig.   5. The reason may be that the local elevation variation of urban topography in the reconstructing area is relatively small, and the overall statistics may not efficiently reflect the small differences.
It is therefore necessary to further investigate this by considering the morphological accuracy for quality analysis as well as conducting numerical accuracy assessment in groups, such as slope ranges and land covers.

Vertical accuracy based on slope classification
We further investigate the vertical accuracy based on slope classification. The topographic features are divided into ten ranges according to slope, and then MAE and RMSE are respectively calculated for each of these ranges (Fig. 6). Table 2

lists the average MAEs and
RMSEs for all of the ten slope ranges. Herein, the slope data is derived from the original 0.5 m urban DEM. From Fig. 6a-c, a general increasing trend is observed for both MAEs and RMSEs calculated for the different reconstruction results as the slope gradually increases. This indicates that the urban terrain relief as indicated by the slope factor has an obvious influence on the vertical accuracy of DEM reconstruction. As shown in Table 2, among all four approaches, MSM-CNN returns the highest accuracy confirmed by low RMSE and MAE for the reconstructions from all of the adopted low-resolution urban DEMs. The superior accuracy is maintained across all slope ranges until the slope is >= 100%, which covers 76% of the whole reconstruction area. takes up 24% of the total area, the influence on the reconstruction results is evident. The findings may also explain the overall accuracy assessment result in Table 1, where the MSM-CNN reconstruction result from the 8 m DEM is slightly 'less accurate' than those obtained using other interpolation methods.

Vertical accuracy based on land cover classification
For urban topography, terrain change is closely related to land cover types. Therefore, the vertical accuracy of the reconstructed DEMs from different apporaches is also analyzed for various land covers. Herein, the urban land cover is divided into five types for analysis, including roads, buildings, natural environment, multi-surface and other. Natural environment is defined to include those areas representing geographic extent of natural environments and terrain. Multi-surface comprises all of the man-made surfaces that are mainly around buildings, such as yards and plazas. Except for the first four types, the rest is classified as 'other'. Fig. 7 illustrates the distribution of different types of land cover in a sample area within the case study site. topography. In urbanized cities, natural environment is commonly much less dominant than other land cover types, which indicates its influence on the overall accuracy of DEM reconstruction is small. It is interesting to note that, for the road and building land cover types, the MAEs of the MSM-CNN DEMs reconstructed from all 3 low-resolution urban DEMs are much smaller than other reconstructions. Obviously, these are the two major land cover types in urbanized area and cover approximately 40% of the total area in the current study site. The performance analysis results effectively demonstrate that the current MSM-CNN approach offers better capability in restoring urban topographic structure with a high fidelity. In addition, the errors calculated for the multi-surface land cover type are relatively high for all reconstructions although the corresponding topography inherently has a low relief. A possible reason may be that vegetation was not removed from the original 0.5 m urban DEM created from LiDAR data. Vegetation cover may significantly affect the reconstruction accuracy because its elevation changes disorderly and behaves like random noise, which is difficult to be reliably reconstructed from low-resolution urban DEMs.   DEMs, the oscillations in the BI, CC and IDW products are so strong that the centerline profiles can no longer be recognized as a road. Moreover, the three CC road profiles unexpectedly show many deep ditches, which are again inconsistent with normal urban road morphology. The results confirm the much superior capability of the proposed MSM-CNN model in reliably reproducing urban morphology. Based on the previous accuracy assessment results, BI produces better reconstruction results than the other two interpolation methods. Therefore, the following analysis is focused on comparing the morphological accuracy between the MSM-CNN and BI reconstruction results. Table 3

Accuracy evaluation based on building boundary reconstruction
Using the extraction method described in Subsection 3.2, building boundaries are delineated from the MSM-CNN and BI reconstructed DEMs for comparison, as shown in Fig.   10 in which the reference boundary data is also presented in the vector format. As shown in Fig.   10a for 16-time reconstructions, the overall shapes of the boundaries are reasonably well reproduced by MSM-CNN, although certain fine-level details are smoothened out, which is as expected. However, almost no building boundary can be detected from the BI reconstruction.  To quantify the morphological accuracy of building boundary reconstruction, the percentage of correctly restored boundary cells is calculated and plotted in Fig. 11

Influence of training data on the reconstruction performance
It has been widely recognized that the quality of training data has a major influence on the performance of a deep learning model (LeCun et al., 2015;. For MSM-CNN, the reconstruction accuracy is potentially influenced by three factors: 1) typicality; 2) coverage; and 3) scale of the training data. Typicality requires that the training data should represent the typical features of urban topography to be reconstructed. Ideally, the training data should cover typical sample areas of the reconstructing site. The recent development in UAV remote sensing technology has made rapid acquisition of high-accuracy topographic data become a common practice (Pajares, 2015;Florinsky, 2018). Although the application of this technology to a large area is still expensive, it is entirely feasible to survey a number smaller sample (typical) areas within the reconstructed site to reduce the cost. In theory, the larger area the training dataset covers, the more features of the urban topography can be learned. Nevertheless, large coverage of training data inevitably increases the cost of obtaining the sample datasets and training the learning model. Therefore, it is necessary to find a balance between the reconstruction accuracy and the coverage of training datasets. For the implementation of the multi-scale reconstruction approach, this work applies the resampling method of NNS to produce the low-resolution urban DEMs. For real-world applications, existing low-resolution urban DEMs will be used, instead of using resampled data. Meanwhile, the range between the lower and upper resolutions for training is also better to covers the scale range for high-resolution reconstruction. In addition, it is worth noting that there may be a scale limit for the low-resolution urban DEM to be reconstructed, which implies that the reconstructed result may not reach the expected accuracy if the input low-resolution urban DEM becomes too coarse.

Enhancing reconstruction quality with additional terrain information
The quantitative assessment as presented indicates that the reconstruction accuracy varies with the land covers, slope ranges and details of man-made constructions. This implies that the features of urban DEMs may be better learned by including the additional terrain information to improve reconstruction quality. For example, land covers provide dominant features of urban topography. Land cover types may be considered in the learning process by distinguishing different types of topographic features, e.g. buildings, roads, water surfaces and natural environments (i.e., natural terrain with relatively high relief). Terrain attributes, such as slope, curvature or roughness (Wilson and Gallant, 2000), define the multi-dimensional features of urban topography and may be also considered to improve the proposed deep learning process.
These attributes can be straightforwardly derived from the corresponding urban DEMs; once the multi-layer attributes are classified, the weight of each layer may be also considered to facilitate better learning process. Semantic knowledge is another source of information that may be considered. Herein, topographic semanteme refers to the rules of urban construction, for example, the transversal and longitudinal gradients of roads and the relative relationship between the upper and lower parts of a side slope. The semantic knowledge may be utilized to refine the urban topography. Overall, MSM-CNN can be further improved to accommodate more topographic information to further enhance its performance, which should be investigated in future research.

Conclusions
In this paper, we have proposed an innovative approach, MSM-CNN, to reconstruct high-resolution urban DEMs from low-resolution equivalents. In order to effectively account for the complexity of urban topography, a multi-scale CNN model is utilized to enhance the reconstruction quality. After the correlations between the low-and high-resolution urban DEMs is learned by the designed training process, the urban DEM at a specified high resolution can be accurately and effortlessly restored from a low-resolution equivalent.
A two-level accuracy assessment procedure including both numerical accuracy and morphological accuracy is also designed to evaluate the performance of the proposed MSM-CNN method, by comparing with other DEM reconstruction methods including IDW, BI and CC. The results show that the high-resolution urban DEMs of 0.5 m can be effectively restored by MSM-CNN from the low-resolution urban DEMs of 2, 4 and 8 m. Also, the MSM-CNN reconstructions are consistently better than the results produced by other methods, in terms of the visual assessment, and also numerical and morphological accuracy.
The promising results presented in this work demonstrates that MSM-CNN has the potential for use in generating high-resolution urban DEMs from low-resolution DEMs, instead of surveying the whole city. In recent years, a number of commercial global DEM products have been released to provide better resolution to represent urban topography, e.g., ALOS AW3D, NEXTMAP World 10, and WorldDEM. These provide rich data source for applying MSM-CNN to reconstruct high-resolution DEMs for cities, which will have profound implications in many applications, including supporting the use of modern flood modelling tools to facilitate more accurate urban flood risk assessment.