2.1. Study Site
Yangon city (
Figure 2a), the former capital of Myanmar, was selected as the study area due to its intense urban expansion within the last two decades. As per the 2014 Myanmar population census [
44], urban Yangon has 5.16 million inhabitants. This is an increase of 85% over 1983 estimates. In roughly the same period between 1979 to 2009, Yangon’s urban area experienced about a 5-fold expansion [
45], most of which took place within the last decade. Apart from this, Yangon lies in one of the world’s most disaster-prone countries. Yangon is situated on hilly terrain surrounded by a river and is at high risk of earthquakes and floods. The country was affected by Cyclone Nargis in 2008 and the Shan State Earthquake in 2011, which displaced several thousand people. Alarmingly, simulations of future urban expansion have shown that development will continue in flood-prone and earthquake-risk areas [
46]. A land cover map of Yangon that shows built-up areas, water-bodies, vegetation, and fallow land for the year 2015 is presented in
Figure 2b. Land cover types were classified using cloud-free Landsat-8 surface reflectance imagery available in Google Earth Engine [
47]. In this paper, the fallow-land class refers to non-cultivated agricultural land and other bare lands, while the vegetation class refers to both forests and agricultural land with crops. Central Yangon has seen vertical expansion in the form of the construction of several new buildings alongside the older industrial, residential areas and colonial buildings. Rapid horizontal expansion has taken place from the center to periphery, stretching the built-up boundary.
2.2. Data Used
SRTM: The Shuttle Radar Topography Mission (SRTM) DEM was an international effort led by NASA and NGA (US National Geospatial Agency). The DSM was processed from C-band and X-band radar imagery collected from two antennae atop the Space Shuttle in an 11-day mission in February 2000 [
48] and had an absolute vertical accuracy of less than 9 m [
49]. Until 2014, the global dataset was available at a 3-arcsecond posting for regions outside of the US. In 2015, the LP DAAC (Land Processes Distributed Active Archive Center) released the NASA SRTM Version 3.0 Global 1-arcsecond dataset (SRTMGL1) [
50]. At a global scale, the 1-arcsecond version (SRTMGL1) has the same root-mean-square error (RMSE) of 10.3 m as its 3-arcsecond version [
51]. Its RMSE ranges from 5.9 m in urban areas to 10.4 m in bushland [
32,
52]. In this research, the 1-arcsecond (approximately 30 m at the equator) SRTMGL1 was used and is subsequently referred to as SRTM. It is available from NASA’s Earth Explorer website [
53].
ASTER: Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) Global Digital Elevation Model Version 2 (GDEM V2) dataset is a DSM from NASA and Japan’s Ministry of Economy, Trade and Industry (METI). It is freely available at a 1-arcsecond posting from NASA’s Earth Explorer. The DSM was generated from nadir and backward-looking visible and near-infrared imagery from the ASTER sensor aboard NASA’s Terra satellite. It was compiled from over 1.5 million scenes acquired between 2000 and 2009 and released in 2011 [
54]. GDEM V2 is an improved version of the earlier GDEM V1 in terms of spatial resolution and coverage, water body mask, and horizontal and vertical accuracy [
55]. Still, it contains disturbances in the values due to an increased frequency of noise on account of using a smaller correlation kernel to enhance the horizontal resolution. The RMSE accuracy of the ASTER GDEM changes with location [
32,
56] and is influenced by the land cover type, varying from 15.1 m in forested mountainous areas [
54] to 23.3 m in urban areas [
57]. In this study, ASTER GDEM V2 was used and is further referred to as ASTER.
TanDEM-X: TanDEM-X (TerraSAR-X Add-On for Digital Elevation Measurements) was launched in 2010 by the German Aerospace Center (DLR) with the aim of generating WorldDEM, a consistent global DSM. Its identical twin, TerraSAR-X, was launched earlier in 2007, and both satellites collect microwave imagery with X-band single-polarized SAR antennae. A uniqueness of this mission is that data collection takes place in a bistatic mode, in which both the satellites orbit with a short baseline and acquire data at the same location and same time. This helps to greatly reduce the effects of atmospheric disturbances. Marconcini et al. [
58] demonstrated promising results of building height extraction over the Yellow River Delta, China using preliminary TanDEM-X DEM. Wessel et al. [
59] validated the 12 m resolution TanDEM-X DEM with GPS measurements scattered over the United States and established its RMSE accuracy for urban (1.4 m) and vegetation areas (1.8 m). Its vertical RMSE over the mostly urban Tokyo was evaluated as 3.2 m [
60], with higher errors occurring over built-up and vegetation classes. The final WorldDEM is publicly available at a 90 m resolution. The 12 m and 30 m resolution versions are freely available for research proposals (through DLR) and are priced for commercial use (through Airbus Defence and Space company). As part of a research project, a pair of TanDEM-X HH polarization images in ascending orbit were acquired in StripMap mode (ground spatial resolution between 2 and 3 m) for 6 September 2011. The incidence angle of the master image was 44.57
with a height of ambiguity of 50.14 m. A 12 m TanDEM-X InSAR DSM was generated in [
60] and upsampled to a 5 m resolution for comparison with other DSM products.
AW3D: The ALOS World 3D (AW3D©JAXA) DSM, publicly released by JAXA in 2016, is the most recent DSM considered in this paper. The AW3D DSM was generated using images from PRISM’s (Panchromatic Remote-Sensing Instrument for Stereo Mapping) front, nadir, and backward-looking panchromatic bands aboard ALOS (Advanced Land Observing Satellite). PRISM sensors were in operation between 2006 and 2011 and acquired imagery at a 2.5 m resolution which was processed with a 5 m grid spacing to generate a global elevation dataset, AW3D [
61]. The AW3D DSM is commercially distributed at a 5 m resolution, while a 30 m downsampled dataset (known as ‘AW3D30’) is publicly available. The AW3D DSM generally meets the 5 m RMSE target height accuracy as per its producers [
61]. However, Takaku et al. [
61] found slope-dependent errors, with errors greater than 5 m occurring for slope angles larger than 30 degrees. Using longitudinal profiles of airport runways, Caglar et al. [
62] found that AW3D30 has an RMSE of 1.78 m and contains an elevation anomaly due to sensor noise and the processing algorithm. Takaku et al. [
61] found a mostly positive bias, while Caglar et al. [
62] identified a negative bias in elevation estimation. In the Philippines, AW3D30’s RMSE varies from 4.3 m in urban areas to 6.8 m in areas with dense vegetation [
32]. Estoque et al. [
63] found that heights filtered from the AW3D5 DSM are more accurate for lower buildings (e.g., ground truth building height <100 m) in less dense cities than for high-rise buildings and denser cities. In this research, a commercial 5 m DSM [
64] was obtained as part of the research project, while the freely distributed 30 m AW3D DSM was downloaded from [
65]. The 5 m resolution and 30 m resolution AW3D DSMs are henceforth referred to as AW3D5 and AW3D30, respectively.
Reference data: Ideally, the heights obtained from ground control points should be used as references. A higher-resolution surface model can also be used as a reference when ground control data are unavailable [
66]. A high-resolution DSM was generated from 0.5 m resolution commercial GeoEye-1 stereo image pairs acquired in 2013 over Yangon. The DSM was then resampled to 4 m using PCI Geomatica 2015 software. The digital terrain model (DTM) was extracted by the in-built Wallis filter, which is a local adaptive filter that is useful for areas with significant shadow. The DSM generated with GeoEye-1 image pairs has a vertical RMSE accuracy ranging from 0.57 m in flat areas to 0.87 m in urban areas [
67]. The completeness of the DSM in urban areas is 63.23% due to occlusion resulting from a high base/height (B/H) ratio (ratio of the image-pair distance to the height of the sensor) and the convergence angle of the imaging geometry [
67]. In the pair used in this research, the stereo images also had different acquisition times that affected the quality of the generated DSM over some locations. For example, inaccurate matching was generated over the pagodas constructed with metallic roof plates, as they appeared differently in the stereo-pair due to the changed sun-view angle. This led to improper registration and erroneous height estimation.
Stable structures: Since open DSMs (AW3D30, ASTER, and SRTM) were acquired in different years, their DBHs cannot be compared directly in a fast-developing city like Yangon. To overcome this limitation, ‘stable structures’ were identified for comparison. These structures are those buildings that were consistently present between 2003 and 2011 and can be identified visually from historical imagery in Google Earth Pro software. The year 2003 is the earliest year for which high-resolution optical imagery is available. Care was taken to select only those structures that appear without any errors in the GeoEye DBH. In total, 52 ‘stable structures’ were identified, which included large pagodas and temples, colonial buildings, a palace, government offices, a sports complex, large hotels, and residential apartments. Some examples are shown in
Figure 3. A polygon was drawn manually around each stable structure’s footprint.
All DSMs used in this research are summarized in
Table 1. All DSMs and DBHs were referenced to the World Geodetic System (WGS84) horizontal datum and Earth Gravitational Model 1996 (EGM96) vertical (geoid) datum. A highly accurate image registration that is precise to each pixel is desirable for comparison. Since the DSMs were originally not georegistered with each other, we co-registered each DSM and DBH with the reference GeoEye DSM. Thirty ground control points for high-resolution DSMs and 15 tie-points spread evenly over the study area were selected for each co-registration. This was performed in the map registration module of the software ENVI4.7 (Exelis Visual Information Solutions, Boulder, CO, USA) using a rotation, scaling, and translation technique, followed by cubic convolution resampling. Separate co-registration of DSMs and DBHs was done to prevent the influence of interpolation on height estimation.
2.3. DBH Generation
There are several types of building extraction based on the desired or possible details, ranging from building footprints to building roof contours [
23]. As per the study objective and data limitations, the focus was on building height extraction. A DBH is different from a digital building model (DBM), which is a more comprehensive 3D representation of buildings and includes all aspects of the building geometry [
6]. DBH is considered a normalized DSM (nDSM) over built-up class pixels. An nDSM is calculated as the difference in elevation values between the DSM and DTM (digital terrain model, also known as a bare earth model). The extraction of an nDSM requires distinguishing ground from non-ground pixels by generating a DTM. Most algorithms first generate the DTM from a photogrammetric DSM by identifying pixels which are part of the local terrain [
68]. There are several methods for identifying non-ground pixels, but they often assume that the terrain is smooth and that a large height difference exists between neighboring ground and non-ground points [
69]. Deep learning approaches have resulted in high-accuracy building extraction (overall accuracy > 95%), with very high resolution imagery [
70,
71]. However, these networks are designed for small-sized images (e.g., 256 × 256 pixels, 512 × 512 pixels, etc.) to prevent memory overloading, which can produce discontinuous artifacts [
72]. Many such models rely on a fully connected neural network [
73], which is a pre-trained model using an RGB image repository (Imagenet [
74]) and exploit similar features between the RGB intensities and the depth images, such as edges, corner, and end-points [
72]. In the case of a coarse-resolution DSM, such features are not clearly visible, and we were skeptical of their performance with coarse resolution. Recognizing these possible limitations, a morphological approach—a multi-directional processing and slope-dependent filtering technique called ‘MSD filtering’ [
75]—was used for DTM generation in light of its consideration of the terrain slope and overall simplicity in implementation [
69]. The MSD filtering technique is an extension of a similar technique developed for an ALS DSM [
76]. MSD filtering is effective over hilly terrains with slopes for extracting a DTM with a sub-meter high-resolution DSM [
75]. An enhancement of MSD filtering, the ‘network of ground points’ technique, also exists [
77] and does not need to consider the slope angle. However, as admitted by Mousa et al. [
77], this probably holds true only for very high resolution DSMs. Therefore, we implemented the MSD method instead of the ‘network of ground points’ method. MSD filtering has also been used to generate a DTM for the alignment of high-resolution optical and SAR images in urban areas [
78].
The MSD filtering technique requires four parameters to generate a DTM: the Gaussian smoothing kernel size, the scanline filter extent, the height threshold, and the slope threshold. Each DSM pixel was checked to determine whether it should be considered ground by comparing it with other pixels within the predefined neighborhood scanline filter extending in eight directions. If the pixel was identified as a ground pixel in more than five directions, it was labeled as a terrain pixel by the majority voting method. To draw the comparison, a local reference terrain slope was first generated by 2D Gaussian smoothing. Then, the pixel’s height was compared with the lowest elevated pixel within the scanline filter extent. If this height difference was more than the height threshold parameter, the pixel was classified as a non-ground pixel. Then, if the slope difference between the current and the successive pixel in the scanline direction was greater than the slope threshold, it was labeled as a non-ground pixel. If the slope was positive and less than the slope threshold, then that pixel was given the same label as its previous pixel. Otherwise, that pixel was labeled as ground. This resulted in a raster with only ground points and holes, the latter being locations where non-ground points exist. Thereafter, a linear interpolation technique from the ‘SciPy’ module of Python [
79] was used to fill the holes for generating the DTM. The nDSM was generated by subtracting the DTM from its DSM.
Parameter Selection
The GeoEye nDSM was used as a reference to choose suitable parameter values for the scanline extent, height threshold, and slope threshold. The parameters for the Gaussian smoothing filter were set to a 100 m kernel size and a 25 m standard deviation to generate the initial local terrain. After trying various combinations of height difference thresholds and slope thresholds, 3 m and 30
were chosen, respectively, as they captured the greatest number of structures. A 3 m height difference threshold approximately corresponds to a one-story construction. A lower value of the height difference threshold leads to underestimation, while higher values lead to an overestimation of the ground terrain. One drawback to the MSD scanline approach arises when no ground pixels lie within the eight directional scanlines [
77]. This can happen when a structure is contiguous and larger than the scanline extent. The neighborhood scanline filter extent parameter was stretched beyond 100 m for a greater chance of successfully ‘finding’ a ground pixel. This ensured more chances to observe a ground pixel within the scanline since any contiguous urban structure is unlikely to be larger than 100 m in all scanline directions.
The AW3D5 nDSM was generated with a scanline extent of 300 m, a height threshold of 3 m, and a slope threshold of 30. Setting a lower value for the scanline filter extent (<300 m) underestimated the structures’ footprints and also their heights, e.g., a scanline extent of 150 m resulted in a lesser overall mean height estimation by 0.2 m when compared with the DBH generated with a scanline filter extent of 300 m. This was more pronounced for tall structures. Similarly, the TanDEM-X nDSM was generated with a scanline extent of 100 m, a height threshold of 3 m, and a slope threshold of 30. The same parameters used for AW3D5 were deemed fit to extract the nDSM from AW3D30 and ASTER GDEM v2. Due to the low differentiation between ground and non-ground points in SRTM, the height threshold parameter was lowered to 2 m. In the AW3D5 and TanDEM-X nDSMs thus generated, about 10% of the pixels had negative heights, out of which 90% of the values were between m and 0 m. In the SRTM, ASTER, and AW3D30 nDSMs, 20% of the pixels had negative values, out of which 90% were between m and 0 m. These negative heights were removed.
2.4. Vertical Accuracy Assessment
There are several accuracy metrics for roof level and roof plane level evaluations [
80]. Recent additions include shape similarity and positional accuracy metrics [
81] and a threshold-free metric based on the overlap between extracted and reference roof planes [
80]. However, the coarse DBH imposes limitations due to which such advanced metrics cannot be applied. For example, in a 30 m gridded DBH, roof planes are not visible except on very large structures that span several hundred meters. Therefore, pixel-based and object-based height accuracies were evaluated with conventional statistical metrics. Object-based heights were derived as mean pixel heights within the footprint polygon of each stable structure. The vertical accuracy of the estimated datasets (DBH and DTM) was analyzed by calculating the descriptive statistics of the difference between the estimated height and the reference height. These statistics were the root-mean-square error (RMSE), mean error (ME), mean absolute error (MAE), and standard deviation (SD). The RMSE describes how much the estimated dataset differs from the reference dataset in terms of deviation from zero. The ME describes the bias toward underestimation (negative ME) or overestimation (positive ME) with respect to the reference dataset. The SD represents the distribution of errors from the mean error (for normally distributed errors, the mean error is zero). So, a low SD value means less variation in error magnitudes. For any DBH or DTM,
was extracted from DSM
D with an image containing
n pixels or objects, and its error metrics with respect to the reference DBH or DTM
were calculated as shown in Equations (
1)–(
4).
Finally, in accordance with Rutzinger et al. [
82], the correspondence of a building footprint within the stable structure polygon was checked pixel-wise. For this method, true positive (
, when the footprint exists in the reference as well as in the DBH), false negative (
, when the footprint is incorrectly identified as ground), true negative (
, when the footprint is correctly identified as ground), and false positive (
, when a ground pixel is identified as a footprint) pixels within each stable structure polygon were identified. The completeness and correctness was computed according to Equations (
5) and (
6).
where
denotes the number of pixels.