Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas

Xie, Yakun; Tu, Jiaxing; Zhao, Yaoji; Xia, Ruifeng; Song, Wen; Feng, Dejun; Hu, Ya

doi:10.3390/rs18050801

Open AccessArticle

Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas

by

Yakun Xie

^1,2,

Jiaxing Tu

¹,

Yaoji Zhao

¹,

Ruifeng Xia

¹,

Wen Song

^3,*,

Dejun Feng

¹ and

Ya Hu

¹

Faculty of Geosciences and Engineering, Southwest Jiaotong University, Chengdu 610097, China

²

State Key Laboratory of Bridge Intelligent and Green Construction, Southwest Jiaotong University, Chengdu 611756, China

³

School of Architecture, Southwest Jiaotong University, Chengdu 610097, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 801; https://doi.org/10.3390/rs18050801

Submission received: 12 January 2026 / Revised: 1 March 2026 / Accepted: 3 March 2026 / Published: 5 March 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A spatial distribution-aware framework is proposed for single-image building height estimation by integrating shadow-based geometric inversion with urban spatial typology classification.
Extensive experiments on 11,168 buildings across 13 Chinese cities demonstrate high accuracy and robustness (MAE = 2.07 m, RMSE = 2.56 m, R² = 0.99) under diverse urban morphologies.

What are the implications of the main findings?

The proposed method enables cost-efficient and scalable large-area building height mapping using single high-resolution optical imagery, reducing dependence on stereo data or dense elevation measurements.
Reliable 3D urban height information is provided to support urban planning, population and energy assessment, climate modeling, and disaster risk analysis in complex urban environments.

Abstract

Building height is a fundamental parameter for characterizing urban three-dimensional structure and supporting applications such as urban planning, population estimation, and energy assessment. However, traditional shadow-based height inversion methods often suffer from occlusion, shadow overlap, and orientation inconsistencies when applied to heterogeneous urban environments. This study proposes a single-image building height estimation method that explicitly incorporates spatial distribution characteristics to enhance robustness and estimation accuracy. Shadow lengths are first robustly extracted using a fishnet–Pauta strategy, followed by a multi-scenario scaling coefficient model accommodating different sun–sensor geometric configurations. Urban areas are then subdivided into high-rise, mid-to-high-rise mixed, and dense low-rise zones using DBSCAN clustering and a composite indicator system. For each spatial type, tailored optimization strategies—including neighborhood-weighted correction, similarity-constrained local regression, and median smoothing—are applied to suppress systematic biases and local outliers. Experiments on 11,168 buildings across 13 Chinese cities demonstrate strong overall performance, achieving an MAE of 2.07 m, an RMSE of 2.56 m, and an R² of 0.99. The proposed method outperforms existing approaches and remains highly stable across diverse urban morphologies, providing a scalable solution for large-area building height mapping from single high-resolution imagery.

Keywords:

height estimation; spatial distribution; remote sensing image; shadows

1. Introduction

With the continuous advancement of global urbanization, the urban population has been expanding rapidly. According to a report by the United Nations, approximately 68% of the world’s population is projected to reside in cities by 2050 [1]. Constrained by limited land resources, urban spatial development has gradually shifted from two-dimensional expansion to three-dimensional growth, with high-rise and super high-rise buildings increasingly becoming prominent features of urban landscapes [2,3,4]. Against this backdrop, building height has emerged as a critical parameter for characterizing urban spatial structure, functional distribution, and vertical development. It plays a vital role in urban planning [5,6,7] and population estimation [8,9,10], as well as energy assessment and climate simulation [11,12]. Consequently, developing building height estimation methods that are low-cost, efficient, and highly adaptable has become a central topic in urban remote sensing research.

Remote sensing data provide a rich foundation for building height estimation, with mainstream technical approaches primarily including Light Detection and Ranging (LiDAR) [13,14], Synthetic Aperture Radar (SAR) [15,16], and optical remote sensing imagery [17,18]. LiDAR acquires dense three-dimensional point clouds through high-precision laser ranging, enabling highly accurate building height estimation when combined with digital elevation models (DEMs) and building footprints [19]. However, LiDAR data acquisition is costly and often constrained by weather conditions and terrain complexity, which limits its applicability for large-scale mapping and frequent temporal updates. SAR-based approaches estimate building height indirectly by applying interferometric synthetic aperture radar (InSAR) techniques to derive surface phase differences, offering all-weather and all-day observation capabilities [20,21]. Nevertheless, in dense urban environments, SAR performance is frequently affected by speckle noise, multipath scattering, and coherence loss, resulting in reduced stability and accuracy. Optical remote sensing imagery has therefore become one of the most widely used data sources for building height estimation, owing to its high spatial resolution, rich texture information, and convenient accessibility [22,23].

In comparison, optical remote sensing imagery, characterized by high spatial resolution, rich textural information, and strong visual interpretability, has become one of the most widely applicable data sources for building height estimation [24,25]. Among optical-based approaches, reconstructing a Digital Surface Model (DSM) from stereo imagery has become the mainstream solution. The basic workflow involves generating a disparity map from stereo image pairs, constructing the DSM, subtracting a Digital Elevation Model (DEM) to obtain a normalized DSM (nDSM), and subsequently deriving building heights [26]. In practical applications, Liu et al. employed ZY-3 stereo imagery combined with the Semi-Global Matching (SGM) algorithm to construct DSMs, and further extracted building height information using morphological Top-Hat transformation, achieving high-resolution building height estimates across multiple cities [27]. Wang et al. applied a stereo-pair algorithm to GF-7 imagery in the Beijing region, which effectively improved DSM completeness and detail preservation in complex urban scenes [22]. To address the systematic underestimation of tall buildings, Zhang et al. proposed a stereo matching method incorporating building roof footprint constraints [23]. Experiments conducted in Yingde, Guangzhou (8653 buildings) and Xi’an, Shaanxi (40 buildings) demonstrated that this strategy significantly alleviated DSM underestimation. However, the study also revealed that the method relies heavily on accurately annotated building footprint data, which limits its scalability for large-area applications.

To further improve large-scale estimation accuracy, Cao and Huang introduced ZY-3 multi-view imagery and a multi-task deep neural network to estimate building heights across 42 cities in China [28]. By integrating multispectral imagery with elevation labels to construct a regression framework, their study demonstrated the feasibility of combining multi-view data and deep learning for large-scale, high-resolution building height estimation. To address the mismatch issues of traditional SGM methods in regions with insufficient texture, large-scale height variation, or severe occlusion, Chen et al. adopted the StereoNet network to reconstruct disparity maps. Experiments conducted in several cities, including Chongqing, Tianjin, and Guangzhou, showed a substantial improvement in height estimation accuracy for tall buildings, achieving more than a 40% reduction in RMSE compared with conventional SGM methods for buildings higher than 60 m [29]. Meanwhile, to mitigate the influence of shadows on disparity matching in GF-7 imagery, Liu et al. applied histogram equalization for shadow compensation and combined building compactness analysis with zonal statistics to perform object-oriented building height estimation from nDSMs. This strategy enhanced adaptability in heterogeneous urban environments, such as dense low-rise zones and high-rise commercial zones [30]. Nevertheless, existing approaches generally depend on high-quality stereo imagery, accurately delineated building footprints, or labeled elevation samples. When confronted with strong intra-urban heterogeneity in spatial distribution—such as variations in building density, functional mixing, and structural compactness—these methods often exhibit notable estimation inconsistencies.

In recent years, low-cost building height estimation methods have gradually expanded beyond reliance on single optical imagery to incorporate non-shadow-based imaging techniques. These approaches include the use of street-view imagery or perspective images to reconstruct building heights through stereo matching or structured light reconstruction, achieving similarly low-cost estimation performance [31]. However, street-view and perspective imagery exhibit limited adaptability in complex urban environments, particularly in high-density areas where severe occlusion and complicated illumination conditions can substantially degrade height estimation accuracy. By contrast, geometric height inversion methods based on building shadows, among the earliest techniques for estimating building height from optical imagery, remain effective for small- to medium-scale areas with clear imaging conditions. Owing to their simple algorithmic structure, low data requirements, and high computational efficiency, these methods continue to be widely used [32,33]. Building height is estimated by measuring shadow length and applying geometric models that incorporate solar elevation angles and sensor parameters [34]. For example, Liasis and Stavrou developed an automatic method for extracting shadow axes and boundaries, achieving a height estimation variance of 4.13% across 198 buildings [35]. Xie et al. proposed a “tangent plus fishnet tangent” strategy combined with RMU-Net for accurate shadow boundary segmentation, constraining height estimation errors within 2 m for 131 buildings [17]. More recently, the integration of ICESat-2 ATL03 photon data as reference height samples for global fitting and correction of shadow-based height inversion models has emerged as a promising direction to further enhance estimation reliability [36,37].

However, shadow-based height inversion methods exhibit notable limitations in complex urban environments, particularly in areas characterized by strong structural heterogeneity. First, in high-density urban settings, building shadows are frequently obstructed or overlapped by adjacent structures and tree canopies, which complicates shadow contour extraction and leads to unstable shadow length measurements, thereby degrading height estimation accuracy [38,39,40]. Second, variations in building orientation relative to solar illumination across different urban regions can introduce systematic bias in shadow length measurement, especially in areas with irregular layouts or highly diverse orientation patterns [41,42]. In addition, traditional shadow-based height inversion approaches often rely on image metadata (e.g., solar elevation angle) or require prior building height samples for calibration, which limits their degree of automation and scalability, making them less suitable for large-area height estimation across multifunctional urban environments [17].

To address the aforementioned challenges, this study proposes a single-image building height estimation method that explicitly incorporates spatial distribution characteristics. The proposed framework integrates spatial typology classification with a region-specific, multi-strategy optimization scheme, enabling building height estimation to be adaptively adjusted according to different urban spatial scenarios. By transforming the height inversion task from a globally uniform estimation problem into a spatially differentiated optimization process, the method effectively mitigates estimation errors arising from heterogeneous building distributions. Moreover, because the proposed method relies solely on a single high-resolution image, it reduces data dependency compared to stereo- or multi-source-based approaches and simplifies the data preparation process. This characteristic facilitates practical application in large-area building height estimation tasks. The main contributions of this paper are summarized as follows:

(1): This study introduces an urban building spatial distribution classification mechanism into the single-image building height estimation framework. By explicitly considering spatial heterogeneity, region-specific height optimization strategies are designed for three typical urban spatial types—high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones—thereby improving estimation accuracy and robustness under diverse spatial distribution conditions.
(2): A joint shadow processing algorithm that integrates fishnet partitioning with the Pauta criterion is developed for shadow length measurement and outlier suppression. This strategy significantly enhances the stability and reliability of shadow extraction under challenging conditions, such as occlusion, shadow overlap, and complex illumination.

The remainder of this paper is organized as follows: Section 2 describes the proposed methodology, including spatial typology classification, shadow length extraction, and multi-strategy height optimization. Section 3 presents the experimental design and results. Section 4 provides discussion and comparison with existing methods. Section 5 concludes the paper and outlines future research directions.

2. Methods

Unlike conventional unified modeling approaches that assume a globally consistent relationship between image-derived features and building height, this study proposes a single-image building height estimation framework that explicitly accounts for the spatial heterogeneity of urban environments. The proposed framework integrates geometric height inversion based on building shadows with spatial distribution-aware optimization, as illustrated in Figure 1. Based on annotated building roofs and shadow boundaries, building shadow lengths are robustly extracted using a fishnet–Pauta strategy and subsequently converted into preliminary height estimates through a scale factor model under three sun–sensor geometric configurations. Rather than directly pursuing absolute height accuracy at the individual building level, the framework characterizes urban spatial patterns using relative height statistics, spatial density, and functional heterogeneity, which serve as key constraints for subsequent spatially differentiated optimization.

Buildings are clustered using DBSCAN and categorized into three representative spatial types: high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones. Through this spatial partitioning, the building height inversion task is reformulated from a globally uniform estimation problem into a region-specific optimization process, in which local spatial context and building distribution characteristics are explicitly incorporated. For each spatial type, differentiated optimization strategies—including neighborhood-weighted correction, similarity-constrained local regression, and median smoothing—are applied to suppress region-dependent systematic biases and local outliers. The proposed method is evaluated on 11,168 buildings across 13 representative cities, demonstrating high accuracy, robustness, and applicability under diverse urban morphologies.

2.1. Preliminary Building Height Extraction

2.1.1. Shadow-Based Building Height Calculation

To enable efficient and automated building height estimation, this study establishes three typical shadow-based height inversion models according to the geometric relationship between building shadows and the sun–sensor configuration. These models correspond to three representative scenarios: (i) the sun and sensor are oriented in the same direction, (ii) the azimuth difference between the sun and sensor exceeds 180°, and (iii) the azimuth difference lies between 0° and 180°, as illustrated in Figure 2. Differences in shadow visibility and observation geometry across the image lead to distinct projection patterns of building shadows, which in turn require the adoption of corresponding height calculation formulations to ensure accurate geometric modeling.

Figure 2a illustrates the case in which the sun and sensor are oriented in the same direction. In this configuration, α denotes the solar elevation angle, β denotes the sensor elevation angle, AB represents the building height, BC denotes the shaded façade portion of the building, BD corresponds to the total shadow length, and CD indicates the shadow segment observable in the remote sensing image. Under this geometric condition, the effective shadow length measured from the image is CD, and the building height AB can be calculated using Equation (1):

A B = C D \times \frac{\tan α \times \tan β}{\tan β - \tan α}

(1)

According to Equation (1), the building height depends solely on the shadow length measured from the remote sensing image and the fixed sun–sensor parameters at the time of image acquisition. This relationship indicates that building height is linearly proportional to the detected shadow length under a given imaging geometry. By defining the proportionality coefficient as

k = \frac{\tan α \times \tan β}{\tan α - \tan β}

, Equation (1) can be simplified to Equation (2):

A B = C D \times k

(2)

Figure 2b illustrates the scenario in which the azimuth difference between the sun and the sensor exceeds 180°. Under this configuration, the sensor is able to capture the complete shadow cast by the building. Consequently, the shaded façade segment BC equals zero, and the corresponding geometric relationship is depicted in Figure 2b. Based on this geometry, the building height AB can be calculated using Equation (3):

A B = B D \times \tan α

(3)

Similarly, according to Equation (3), the building height is linearly proportional to the shadow length detected in the image under the given imaging geometry. By defining the proportionality coefficient as

k_{1} = \tan α

, Equation (3) can be simplified to Equation (4):

A B = B D \times k_{1}

(4)

Figure 2c depicts the case in which the azimuth difference between the sun and the sensor lies between 0° and 180°. Under this condition, the influence of the sensor azimuth on shadow detection must be explicitly considered, as this configuration represents the most common scenario for building shadows in optical remote sensing imagery. The geometric relationship among the sensor, the sun, and the building is illustrated in Figure 2c, where γ denotes the solar azimuth angle, δ denotes the sensor azimuth angle, and ε represents the angle between the building orientation and the shadow projection measured in the clockwise direction. Based on this geometric relationship, the building height can be calculated using Equation (5):

A B = \frac{D E \times s i n ε}{c o t α s i n ε - c o t β s i n (ε + γ - δ)}

(5)

According to Equation (5), the building height is linearly proportional to the shadow length detected in the image under the given geometric configuration. By defining the proportionality coefficient as

k_{2} = \frac{s i n ε}{c o t α s i n ε - c o t β s i n (ε + γ - δ)}

, Equation (5) can be simplified to Equation (6):

A B = D E \times k_{2}

(6)

Although the three scenarios differ in their geometric configurations, the building height inversion can be uniformly formulated as a multiplicative relationship,

b u i l d i n g h e i g h t = s h a d o w l e n g t h \times s c a l e f a c t o r (k)

, where the scale factor is jointly determined by the solar elevation angle, sensor elevation angle, building orientation, and their relative angular relationships.

2.1.2. Shadow Length Calculation and Gross Error Elimination

To improve the accuracy of building height estimation, precise measurement of shadow length is a critical prerequisite. This study proposes a shadow length estimation method that combines the fishnet strategy with the Pauta criterion to effectively mitigate errors arising from complex shadow geometries, ambiguous boundaries, and noise interference commonly encountered in traditional approaches. Specifically, a set of evenly spaced parallel vector lines is generated within the shadow region along the solar azimuth to form a feature line set. Each line segment represents the projected shadow length in the solar direction and serves as an input for subsequent building height inversion. However, due to factors such as terrain obstruction, uneven illumination, and image noise, some shadow line segments may exhibit abnormal deviations. Directly using the mean or median of all line segments may therefore amplify estimation errors. To address this issue, a gross-error elimination strategy based on the Pauta criterion is introduced to iteratively refine the feature line set. By computing the mean and standard deviation of the line segment lengths, a confidence interval is constructed according to the 3σ principle, and outliers exceeding this range are removed to enhance the stability and robustness of shadow length estimation. The inter-line spacing is determined through experimental analysis and optimized based on empirical results to balance extraction accuracy and computational efficiency. Finally, the shadow length is obtained as the mean value of the filtered valid line segments. The overall algorithmic workflow is summarized in Algorithm 1.

Algorithm 1 Shadow Length Calculation Method (Combining the Fishnet Method and Pauta Criterion)

Input:

S = s h a d o w p o l y g o n

,

θ = s o l a r a z i m u t h a n g l e

,

d = f i s h n e t l i n e s p a c i n g

Output: filtered shadow line segment length

1: //Feature line generation

2: generate a set of fishnet lines

L_{n e t}

in the shadow region

S

based on

θ

and

d

3: for each line in

L_{n e t}

do

4: calculate the intersection points between the lines and the building corners

5: calculate the length of each intersection line and record it in the length list

L_{l e n g t h s}

6: if the intersection line does not exceed the corner range then

7: keep the intersection line and update the target contour

8: else

9: remove the out-of-bounds intersection lines

10: end if

11: end for

12: //Gross Error Elimination

13: calculate the standard deviation

σ

and arithmetic mean

X

of

L_{l e n g t h s}

14: repeat

15: for each

L_{i}

in

L_{l e n g t h s}

do

16: calculate the residual

V_{i} = |L_{i} - X|

17: if

V_{i} \leq 3 σ

then

18: retain

L_{i}

19: else

20: remove

L_{i}

21: end if

22: end for

23: recalculate the mean

X

and standard deviation

σ

of

L_{l e n g t h s}

24: until all errors are removed

25: return

L_{l e n g t h s}

2.2. Building Spatial Distribution Classification

Urban buildings exhibit substantial heterogeneity in height, density, spatial arrangement, and functional composition, which directly affects the accuracy of single-image building height estimation. Conventional methods often struggle to maintain robustness across diverse urban scenarios—such as high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones—resulting in systematic or localized estimation errors. To address this issue, a spatial distribution classification framework is proposed (Figure 3), which integrates building height characteristics, spatial density, and functional heterogeneity. Specifically, DBSCAN is first employed to identify spatial clusters of buildings. Subsequently, a three-dimensional indicator system is constructed to classify urban areas into four categories: high-rise zones, mid-to-high-rise mixed zones, dense low-rise zones, and others. By explicitly accounting for spatial distribution patterns, this classification framework enhances the adaptability and accuracy of shadow-based height inversion in complex urban environments.

2.2.1. Building Cluster Analysis Based on DBSCAN

To identify the initial spatial clustering structure of buildings, this study employs the density-based clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise). A key advantage of DBSCAN is that it does not require the number of clusters to be predefined and is capable of identifying clusters with arbitrary shapes based on local density characteristics, making it particularly suitable for analyzing building groups with complex spatial distributions. By introducing DBSCAN into the framework, customized optimization strategies can be applied to different building distribution areas, thereby improving the accuracy and stability of building height estimation. DBSCAN relies on two critical parameters: the neighborhood radius ε and the minimum number of points minPts, which define the spatial connectivity range among buildings and the minimum local density required to form a cluster, respectively. Given a set of buildings

B = {b_{1}, b_{2}, \dots, b_n}

, a building is identified as a core object if at least minPts other buildings fall within its ε-neighborhood. All density-connected core objects, together with their associated boundary objects, collectively constitute a building cluster.

In terms of parameter configuration, the ε value is determined by considering spatial variations in building distribution. Specifically, ε is set to the third quartile (

Q_{3}

) of the nearest-neighbor distance distribution within the study area, ensuring that more than 80% of spatial associations among buildings are effectively captured. The nearest-neighbor distance is calculated as follows. First, the distance between each building and all other buildings is computed, and the average distance is obtained. This average distance is then used as the radius of a search circle, within which the point closest to the circumference is identified. The characteristic distance k is subsequently determined as the mode of these distances [43]. The use of

Q_{3}

enables ε to adaptively reflect differences in building density: in high-density zones,

Q_{3}

assumes smaller values, capturing shorter inter-building distances, whereas in low-density zones, larger

Q_{3}

values accommodate more dispersed building patterns. The minPts parameter is dynamically adjusted according to building density gradients. It is set to 5 in high-density urban areas and reduced to 3 in urban–rural transition zones [44]. This setting is supported by spatial autocorrelation analysis of building centroids, yielding a Moran’s I value of 0.67 (p < 0.01), which indicates strong spatial dependence in building distribution.

Through spatial distribution classification, targeted optimization strategies can be applied according to the distribution characteristics of buildings in different areas. This strategy not only improves the accuracy of building height estimation but also enhances the overall stability and robustness of the proposed approach.

2.2.2. Classification Indicators and Rules

To provide a detailed characterization of urban building spatial structures and to support the differentiated adaptation of height estimation strategies, this study proposes a zoning scheme that classifies urban buildings based on height characteristics, spatial density, and functional mix. Using this scheme, building clusters are categorized into four typical spatial forms: high-rise zones, mid-to-high-rise mixed zones, dense low-rise zones, and other zones.

(1): Height characteristics indicator:

Building height distribution is the core criterion for distinguishing different urban functional zones. Conventional mean-based indicators are highly sensitive to extreme values, while median-based measures tend to lose discriminative power in areas with mixed building heights. To address these limitations, the 75th percentile of building height (

H_{75}

) is adopted as the primary height indicator. This metric effectively captures the dominant height level within a building cluster while maintaining strong sensitivity to the presence of high-rise buildings.

H_{75} = i n f \{h ϵ R : F_{N} (h) \geq 0.75\}

(7)

Here,

F_{N} (h)

denotes the empirical distribution function (EDF) of building heights, representing the proportion of buildings with heights less than or equal to

h

. The operator

i n f

indicates the infimum, i.e., the lower bound of the set. Accordingly,

H_{75}

is defined as the minimum height value

h

such that at least 75% of the buildings have heights less than or equal to

h

.

(2): Density characteristic indicator:

Building space density is a key parameter for distinguishing urban functional zones. However, traditional density metrics often exhibit limited effectiveness when applied to building groups with complex layouts or irregular spatial distributions. To address this limitation, this study introduces an improved standardized density index D, which more accurately captures the degree of spatial compactness within local building clusters.

D = \frac{\sum_{i = 1}^{n} A_{i}^{f o o t p r i n t}}{m i n (A^{M B R}, 1.2 A^{c o n v e r h u l l})}

(8)

Here,

A_{i}^{f o o t p r i n t}

denotes the footprint area of the i-th building,

A^{M B R}

represents the minimum bounding rectangle (MBR), and

A^{c o n v e r h u l l}

refers to the convex hull area of the building cluster. By constraining the denominator to no more than 1.2 times the smaller of these two reference areas, the proposed formulation ensures robustness while preventing density overestimation. This constraint effectively enhances the discrimination between densely built areas and spatially dispersed building distributions.

(3): Mixedness indicator:

To effectively identify mid-to-high-rise mixed zones characterized by strong heterogeneity and functional complexity, this study constructs a composite mixedness index M that jointly captures height variability and functional diversity. The index consists of two complementary dimensions. The height dispersion term quantifies internal variations in building height and is expressed using a standardized interquartile range, which is robust to extreme values. The functional entropy term is derived from information entropy theory and provides a quantitative measure of the degree of functional mixing among different building types within the cluster.

M = \frac{H_{75} - H_{25}}{H_{50}} \times (- \sum_{k = 1}^{K} p_{k} \ln p_{k})

(9)

Here,

H_{75}

,

H_{25}

, and

H_{50}

represent the 75th percentile, 25th percentile, and median of building height, respectively.

k

is the total number of functional types (e.g., residential, commercial, industrial), and

p_{k}

represents the proportion of the k-th functional type. The functional proportion

p_{k}

measures the proportion of buildings with different functional types within each cluster. The functional type (e.g., commercial, residential, or mixed-use) is inferred by analyzing each building’s height difference (Height_Dif) and distance to the nearest building (Distance_T).

After deriving the building height, density, and mixedness indicators, this study establishes spatial typology classification rules by jointly integrating these three dimensions. As summarized in Table 1, urban areas are categorized into four typical spatial types. High-rise zones are characterized by pronounced vertical development and intensive land use, and are commonly associated with central business districts (CBDs) and urban sub-centers. Mid-to-high-rise mixed zones correspond to comprehensive urban areas with moderate building heights and a high degree of functional diversity. Dense low-rise zones are composed of compact clusters of low-rise buildings, which are frequently observed in industrial parks or logistics-related areas on urban fringes. Areas that do not satisfy the above criteria are classified as “other”. It should be noted that in high-rise zones, the dominant height and density characteristics are sufficient for reliable classification, while mixedness plays a negligible role in such height-driven environments. Therefore, no mixedness constraint is imposed for this category.

2.3. Height Optimization Method for Buildings in Multiple Spatial Distribution Types

After completing the initial height inversion using shadow length and scale coefficient models, this study proposes a spatial-distribution-aware height optimization strategy to further mitigate errors induced by building density, structural differences, and scale extrapolation effects. Taking the spatial classification results as a guiding framework, the method first examines the distinct error characteristics across high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, in order to identify the dominant factors affecting estimation accuracy. It then incorporates neighborhood structure, local height patterns, and spatial similarity to construct differentiated optimization models. Without relying on external elevation data, this strategy achieves structural correction and significantly improves height estimation accuracy.

2.3.1. Analysis of Different Spatial Distribution Characteristics

Based on the spatial distribution classification system, the study area is divided into three typical zones: high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones. To improve estimation accuracy, this section analyzes differences in spatial structure, shadow morphology, and error characteristics across these categories, thereby providing a basis for targeted height optimization.

① High-rise zones: Buildings in high-rise zones are tall, densely distributed, and structurally regular, which generally produce clear and consistent shadows. Although proportional coefficient–based inversion performs well in these areas, shadow length remains highly sensitive to solar altitude. In addition, dense building layouts may lead to local shadow obstruction or mismatches in fishnet sampling lines. Therefore, a neighborhood-weighted correction strategy is required to suppress localized errors.

② Mid-to-high-rise mixed zones: These zones exhibit pronounced spatial heterogeneity, characterized by interspersed high- and low-rise buildings and diverse functional compositions. Large height variations often result in unstable shadow measurements and frequent fitting inconsistencies. Moreover, shadow pairing ambiguity is common in such mixed environments. To address these issues, regression-based correction constrained by height similarity and local spatial trends is employed.

③ Dense low-rise zones: Dense low-rise zones consist of compactly arranged buildings with relatively uniform heights. Short shadows and blurred boundaries increase the difficulty of shadow extraction, while shadow merging further reduces measurement stability. Consequently, median-filter-based smoothing is applied to enhance robustness and improve height estimation accuracy in these areas.

2.3.2. Height Optimization for Buildings in Different Spatial Distributions

To address the distinct error characteristics and structural differences observed in building height estimation across high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, this study proposes three corresponding optimization strategies. These strategies respectively employ neighborhood-weighted correction, local linear regression, and median smoothing to enhance the accuracy and robustness of the preliminary height inversion results.

① High-rise zones:

In high-rise zones, building heights are relatively concentrated and uniformly distributed, forming a highly consistent group within the same cluster structure. Based on this, to address abnormal valuations caused by local obstructions or fishnet breaks, this paper employs a neighborhood-weighted method for correction. Specifically, for each building

i

, the deviation between its initial height inversion value

{h_{i}}^{(0)}

and the neighborhood average value

\bar{h_{i}}

is calculated. If the following condition is satisfied:

|{h_{i}}^{(0)} - \bar{h_{i}}| > τ \times σ_{i}

(10)

Here,

σ_{i}

denotes the height standard deviation within the neighborhood of building

i

, and

τ

is the tolerance threshold. If the above conditions are met, building

i

is considered as an outlier, and its height

{h_{i}}^{(o p t)}

can be corrected using the following formula:

{h_{i}}^{(o p t)} = \frac{\sum_{j = 1}^{n} ω_{i j} \times {h_{j}}^{(0)}}{\sum_{j = 1}^{n} ω_{i j}}

(11)

Here,

ω_{i j}

is the spatial weighting coefficient between building

i

and its neighboring building

j

, calculated using the following formula:

ω_{i j} = \frac{1}{{d_{i j}}^{2}}

(12)

Here,

d_{i j}

is the Euclidean distance between building

i

and building

j

.

② Mid-to-high-rise mixed zones:

In mid-to-high-rise mixed zones, building heights exhibit substantial variability. Traditional spatial-distance-based weighted interpolation methods often fail to accurately capture the true height relationships among buildings, which may lead to errors such as “high compensating for low” or “low compensating for high”. Moreover, building layouts in these areas commonly display local continuity or gradual height transition patterns, making simple mean-based smoothing insufficient for representing such structural characteristics. To mitigate these issues, this study proposes a joint correction strategy that combines a height-similarity screening mechanism with a local linear regression model. For a target building

i

, buildings within its spatial neighborhood

N (i)

that exhibit similar preliminary inversion heights are first selected to construct a similarity-constrained neighborhood N(i). Specifically, neighboring buildings satisfying the following condition are retained:

N * (i) = \{j ϵ N (i)| |{h_{j}}^{(0)} - {h_{i}}^{(0)}| < ϵ\}

(13)

The spatial position variable is introduced to characterize the local consistency of buildings in terms of imaging geometry, illumination conditions, and occlusion environment. Since preliminary height inversion errors exhibit pronounced spatial correlation, a local regression model based on spatial position is capable of effectively suppressing region-scale systematic errors within similarity-constrained neighborhoods. Here,

{h_{j}}^{(0)}

denotes the preliminary inversion height of building j, and

{h_{i}}^{(0)}

represents the corresponding value for the target building

i

. The parameter ϵ is the similarity tolerance threshold (in meters), which is used to filter neighboring buildings with comparable preliminary heights. Based on the resulting similarity-constrained neighborhood, a local linear regression model is then employed to refine the height estimate of the target building

i

. It is assumed that the preliminary inversion height of building

j

exhibits a linear relationship with its spatial position variable

x_{j}

:

{h_{j}}^{(0)} = a \times x_{j} + b + ε_{j}

(14)

Here,

a

and

b

are the regression coefficients of the local linear regression model, estimated by fitting the data within the similarity-constrained neighborhood.

ε_{j}

is the error term of the regression model, representing the residual of building

j

. The optimized height

{h_{i}}^{(o p t)}

of the target building

i

is then obtained using the following regression equation:

{h_{i}}^{(o p t)} = a \times x_{i} + b

(15)

where

x_{i}

denotes the spatial position variable of the target building

i

.

③ Dense low-rise zones:

In dense low-rise zones, buildings are generally low in height, densely distributed, and exhibit little variation. In preliminary inversion results, abnormal fluctuations often occur due to blurred boundaries or noise interference. To address this issue, a median smoothing strategy based on a sliding window is employed to correct outliers. First, outliers are identified using the outlier detection criteria defined in Equation (10). Buildings meeting this criterion are regarded as unstable estimates and marked for correction, the preliminary inversion height

{h_{i}}^{(0)}

is regarded as an outlier and marked as for correction. This step can effectively identify extreme outliers caused by shadow errors, boundary fusion, and other factors. For buildings marked as outliers, their heights are corrected using the median height of buildings in their neighborhood. A local window

Ω (i)

centered on building

i

is defined, and the optimized building height

{h_{i}}^{(o p t)}

is calculated as follows:

{h_{i}}^{(o p t)} = m e d i a n ({{h_{i}}^{(0)} | j \in Ω (i), {h_{i}}^{(0)} i s r e t a i n e d})

(16)

Here,

Ω (i)

represents the set of sliding windows used in median filtering. Through this approach, height estimation errors caused by shadow merging and blurred building contours can be effectively corrected, thereby improving the stability and accuracy of building height inversion in dense low-rise zones.

3. Experimental Results

3.1. Experimental Data and Evaluation Criteria

3.1.1. Study Area

To assess the adaptability and robustness of the proposed building height estimation method across diverse urban spatial structures, this study selected 13 representative Chinese cities: Beijing, Shanghai, Tianjin, Shijiazhuang, Harbin, Baotou, Wuhan, Chongqing, Lanzhou, Kunming, Nanning, Haikou, and Lhasa. The city selection followed three main principles: (1) broad geographical coverage across plains, mountains, hills, and plateaus; (2) diverse urban morphologies, ranging from high-density cores to low-density dispersed areas, with building types spanning high-rise residential, commercial, and low-rise industrial structures; and (3) variation in city scale and development stage, spanning megacities to mid-sized regional cities. This selection strategy ensures strong representativeness and scenario diversity. The final dataset comprises 11,168 buildings with manually annotated roof and shadow information across the 13 cities, providing a solid foundation for comprehensive method validation (Figure 4).

To obtain reliable reference data for method validation, building roof contours and corresponding shadows were manually annotated using 0.45 m WorldView-III imagery. Manual annotation was adopted to provide stable and controlled inputs for shadow length measurement and spatial optimization, rather than to treat boundary extraction as a research focus of this study (Figure 5). Specifically, manual labeling was employed solely to ensure accurate shadow length extraction under complex urban conditions and does not constitute a required operational component of the proposed framework. To minimize subjectivity, all annotations followed unified guidelines, and only clear, unobstructed shadows were labeled to avoid interference from adjacent buildings, vegetation, or terrain. In addition, radiometric calibration and orthorectification were applied to all images to ensure geometric consistency, and solar and sensor parameters were extracted from metadata for subsequent height inversion.

For reference height validation, this study employed floor-count data from AutoNavi Maps (https://ditu.amap.com/ (accessed on 2 March 2026)) as ground-truth information. Building heights were estimated by multiplying the reported number of floors by a uniform floor height of 3 m. Although actual floor heights may vary among different building types, this assumption provides reliable approximations for large-scale analyses. Cao and Huang manually validated AutoNavi floor counts across 42 cities and reported an RMSE of 1.19 m, demonstrating the reliability of these data for large-area building height estimation [28].

3.1.2. Evaluation Criteria

To quantitatively evaluate the accuracy of building height inversion results, this study selects the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as the primary evaluation metrics. Both metrics are widely used in error analysis and effectively characterize the numerical deviation between estimated and reference heights [45]. MAE measures the average absolute difference between the inverted height and the reference height, as defined in Equation (17):

M A E = \frac{1}{n} \sum_{i = 1}^{n} |H_{i}^{i n v} - H_{i}^{r e f}|

(17)

where

n

denotes the number of samples,

H_{i}^{i n v}

represents the inversion height of the i-th sample, and

H_{i}^{r e f}

denotes the corresponding reference height. A smaller MAE indicates a lower average estimation error and, consequently, higher inversion accuracy.

RMSE places greater emphasis on large deviations by squaring the residuals before averaging, thereby reflecting the influence of extreme errors on overall estimation performance, as expressed in Equation (18):

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(H_{i}^{i n v} - H_{i}^{r e f})}^{2}}

(18)

By jointly considering MAE and RMSE, the performance of building height estimation can be comprehensively assessed from two complementary perspectives: the overall error magnitude and the degree of error dispersion. This dual-metric evaluation is particularly suitable for comparing estimation performance across different urban spatial distribution types.

3.2. Results of Building Spatial Distribution Classification

3.2.1. Building Clustering Results

Based on the preliminary inversion of building heights, the DBSCAN clustering method described in Section 2 is applied to perform spatial clustering of buildings within the study area. The clustering outcomes are illustrated in Figure 6, where different colors denote distinct building clusters. The clustering results for Baotou, Haikou, Shijiazhuang, and Beijing exhibit relatively uniform building distributions with clear clustering patterns. Specifically: Baotou exhibits highly concentrated clusters with only a few outliers, indicating a relatively homogeneous spatial structure. Haikou shows a more dispersed clustering pattern with a larger number of outliers, reflecting greater spatial fragmentation. Shijiazhuang presents a more complex clustering structure characterized by a higher number of clusters, suggesting increased variability in building distribution. In contrast, Beijing demonstrates a moderately concentrated clustering pattern accompanied by a limited number of outliers.

The clustering results for Harbin, Lhasa, Kunming, and Lanzhou indicate relatively high building density, accompanied by a noticeable number of outliers (Figure 7). However, these outliers are distributed along the image boundaries and therefore do not reflect deficiencies in clustering performance. Instead, their presence is mainly attributable to the large spatial extent and broad coverage of the study images. Within each city, the dominant clusters remain compact and well-defined, with a relatively large number of clusters identified. This pattern reflects a generally uniform spatial distribution of buildings combined with high local density, demonstrating that the DBSCAN-based clustering effectively captures the intrinsic spatial structure of buildings in these urban environments.

3.2.2. Building Classification Results

Based on the aforementioned clustering and classification methods, this paper conducted a spatial typology analysis of building clusters in selected urban subregions. The classification results are shown in Figure 8, where buildings are categorized into four spatial types: high-rise zones, mid-to-high-rise mixed zones, dense low-rise zones, and other zones. The results reveal pronounced inter-city differences in both the composition and structural characteristics of these spatial types.

Across the sampled subregions, the proportion of high-rise zones varies notably. Cities such as Lanzhou (487 buildings), Shijiazhuang (296), and Chongqing (324) contain large high-rise clusters, reflecting strong vertical development, whereas Baotou and Lhasa exhibit no high-rise zones, indicating dominance of mid- to low-rise structures. Mid-to-high-rise mixed zones constitute the predominant spatial type in most cities, particularly Harbin (997), Kunming (536), and Wuhan (450), where substantial height variability and complex internal spatial structures are observed.

Dense low-rise zones are prominent in Kunming (979) and Lhasa (1286), indicating extensive and compact low-rise development. Conversely, cities such as Nanning, Tianjin, and Chongqing contain relatively fewer dense low-rise clusters, suggesting a greater presence of residential or administrative land-use patterns. “Other” zones, which consist mainly of scattered buildings located at urban fringes or in areas with lower inversion quality, are more evident in cities such as Lhasa (229) and Harbin (89).

It should be noted that these results are derived from sampled local subregions rather than entire cities. Nevertheless, the observed inter-city differences demonstrate the proposed method’s ability to effectively capture localized spatial structures and reveal meaningful associations between functional zoning and building development patterns.

3.3. Building Height Optimization Results

To systematically evaluate the applicability of the proposed spatial-distribution-based optimization strategy across multiple cities, building height inversion results from 13 cities were compared before and after optimization, and the corresponding MAE and RMSE trends were analyzed (Figure 9). The results show that the proposed optimization consistently improves estimation accuracy across all cities, with particularly pronounced error reductions observed in cities exhibiting higher initial inversion errors.

Before optimization, the absence of clustering and spatial classification resulted in substantial inversion errors in several cities, primarily due to unmitigated shadow occlusion and overlap effects. In Nanning and Lanzhou, RMSE values reached 86.65 m and 77.67 m, with MAE exceeding 50 m, highlighting poor adaptability in high-rise-dense or complex-terrain areas. Although cities such as Baotou, Shijiazhuang, Shanghai, and Chongqing showed lower pre-optimization errors, noticeable biases still existed. After applying the optimization strategy, all 13 cities experienced substantial error reductions, with MAE and RMSE consistently reduced to the range of 1.09–3.44 m and with error dispersion substantially suppressed. Nanning showed the most pronounced improvement, with an RMSE reduction exceeding 80 m. Cities including Harbin, Kunming, and Wuhan achieved error reductions of approximately 89%, demonstrating effective correction in mid-to-high-rise mixed zones. Northern cities reached MAE/RMSE values of 1.97/2.50 m, while southern cities reached 2.16/2.60 m, possibly influenced by differences in vegetation and shading complexity. Overall, the optimization not only corrected large errors in challenging regions but also enhanced stability in cities with initially good performance (e.g., Baotou and Shijiazhuang), demonstrating strong generalization capability and spatial adaptability.

To evaluate nationwide performance, height inversion results from all 13 cities (11,168 buildings) were integrated and compared with reference heights (Figure 10). The proposed method demonstrates excellent overall accuracy, achieving R² of 0.99, an MAE of 2.07 m, and an RMSE of 2.56 m, with most deviations confined within ±3 m. The fitted regression line (Y = X − 0.28) closely aligns with the ideal Y = X line, indicating only minor systematic underestimation (~0.30 m) and no observable scale or bias drift across height ranges. Sample height distribution is strongly concentrated in low- and mid-rise buildings (0–60 m), which account for approximately 84% of all samples, while high-rise (13%) and super-high-rise (<3%) buildings represent smaller proportions. Despite this imbalance, the model maintains a stable linear relationship and consistent prediction performance even for tall buildings. A limited number of outliers deviating from the ideal regression line are primarily associated with super-high-rise buildings and high-density urban core areas. These cases are typically characterized by complex architectural forms, severe shadow overlap from adjacent tall structures, vegetation interference, or irregular roof geometries (Figure 11a–f), which introduce additional uncertainty in shadow length measurement and consequently affect height inversion accuracy. This behavior is consistent with the error patterns observed in extreme height ranges discussed in Section 4.1 and does not indicate a systematic bias of the proposed method.

3.4. Building Height Estimation Performance in Different Cities

To further evaluate the adaptability and robustness of the proposed model under varying height structures, building distributions, and functional complexities, three representative urban areas—Beijing, Nanning, and Lhasa—were selected for localized comparative analysis. Figure 12 illustrates the relationships between predicted building heights and reference values for each area. The selection of these cities was purposefully designed to cover distinct urban conditions: Beijing represents a metropolitan area with a relatively balanced height structure; Nanning is characterized by a large number of super-high-rise buildings and complex building types; Lhasa is dominated by low-rise buildings, with concentrated sample heights but significant terrain variations. This diversity provides an ideal sample for testing the model’s generalization capabilities under different structural conditions.

As shown in Figure 12, Beijing exhibits a wide range of building heights with a relatively balanced sample distribution. The proposed model demonstrates stable predictive performance, achieving an R² of 0.98, an MAE of 2.46 m, and an RMSE of 3.10 m. The fitted regression slope is close to unity (0.97), with a positive intercept, indicating that prediction errors are evenly distributed without a pronounced systematic bias. Nanning includes a large number of super-high-rise buildings exceeding 100 m in height, with the tallest reaching 164.75 m, and encompasses various functional types such as residential, office, and commercial buildings. Despite challenges such as a wide height range and strong building heterogeneity, the model maintains excellent fitting performance (R² = 0.99, MAE = 2.79 m, RMSE = 3.44 m), validating its adaptability in extreme high-rise building environments. In contrast, Lhasa is characterized by building heights concentrated within the 0–30 m range, a relatively small urban scale, and complex terrain conditions. Even under these constraints, the proposed method achieves satisfactory prediction accuracy (R² = 0.85, MAE = 2.28 m, RMSE = 2.69 m). The fitting slope (0.95) is slightly less than 1, and the intercept is negative, indicating a certain degree of underestimation bias.

To assess the model’s generalizability across diverse urban contexts, Figure 13 presents the fitting results for the remaining 10 representative cities. Overall, the model performs consistently well, with R² values from 0.97 to 0.99, indicating strong linear correlation and effective capture of height variability. MAE ranges from 1.09 m (Baotou) to 2.64 m (Lanzhou), and RMSE ranges from 1.44 m to 3.03 m, showing concentrated errors without evident outliers or systematic deviation. Slight underestimation is observed in high-rise clusters of cities such as Lanzhou and Tianjin; however, the fitted regression lines remain close to the ideal Y = X line, with slopes approaching unity and intercepts near zero. This indicates stable and unbiased model behavior across different urban forms. In addition, the density-based color distribution shows that high-density building clusters (highlighted in red) are tightly aligned with the main regression trend, confirming the model’s strong predictive capability in complex urban core areas.

Overall, the proposed method demonstrates robust predictive capability across various typical urban environments. It maintains error convergence and linear consistency in both heterogeneous high-rise regions and structurally homogeneous low-rise areas, further validating its strong spatial generalization and cross-structural adaptability.

3.5. Visualization Analysis of Building Height Modeling Results

To visually validate the effectiveness of the proposed building height estimation method, a three-dimensional building visualization model was constructed based on the extracted height data. By integrating building height information with its boundary contours, a three-dimensional representation of the building with realistic spatial form was generated. This visualization process not only aids in analyzing the rationality and accuracy of height calculation results from a spatial perspective, but also provides intuitive evidence for evaluating the adaptability of the method across different cities and building types. Figure 14 and Figure 15 present modeling results for typical urban areas. The overall distribution reveals that cities like Beijing, Chongqing, and Nanning feature dense clusters of high-rise buildings with complex structures, while cities like Lhasa and Baotou are primarily composed of low-rise buildings with lower overall heights. The data shows an average building height of 36.70 m, with the tallest building in Nanning reaching 178.57 m. The spatial forms of buildings in each city are well reproduced in the model, demonstrating that the proposed method exhibits strong stability, accuracy, and versatility in building height estimation and 3D representation, making it suitable for modeling tasks in complex urban environments.

4. Discussion

4.1. Error Analysis Across Different Height Levels

To evaluate the adaptability of the proposed model across different height ranges, all buildings were grouped into four categories: low-rise (0–30 m), mid-rise (30–60 m), high-rise (60–100 m), and super-high-rise (>100 m). The height intervals were redefined using independent reference height data, which were introduced in Section 3.1.1. These reference data were applied to categorize buildings into distinct height ranges, ensuring that the height intervals are independent of the model’s output and preventing any potential bias in performance evaluation. As shown in Figure 16, the fitting performance generally improves with increasing building height, with R² values rising from 0.83 to 0.95, indicating that height variations in taller buildings are more effectively captured. The MAE reaches its minimum value (1.92 m) in the 30–60 m range and increases for super-high-rise buildings (2.86 m), reflecting the greater challenges associated with long, distorted, or overlapping shadows cast by very tall structures. Systematic trends are also observed across height ranges. For buildings below 60 m, the fitted slopes are slightly greater than 1 with negative intercepts (approximately −0.60 m), indicating mild overestimation. In contrast, for buildings exceeding 60 m, the slopes fall below 1 with positive intercepts, revealing a tendency toward underestimation in high-rise and super-high-rise buildings. Despite these trends, the model maintains strong linear consistency and stable performance across all height scales.

The significant estimation errors in super-high-rise primarily stem from their complex spatial and imaging characteristics (Figure 17). First, the long shadows cast by skyscrapers often overlap with surrounding buildings. Under varying lighting conditions, changes in shadow length and direction can blur building boundaries, compromising the stability of height inversion. Additionally, super-high-rise buildings typically exhibit complex geometric forms, with substantial structural differences between tower sections and podiums, complicating contour extraction. Moreover, the high density of skyscraper clusters reduces inter-building spacing, further increasing uncertainties in image segmentation and boundary identification. Together, these factors contribute to relatively larger measurement errors in super-high-rise building height estimation.

Figure 18 further examines the error distribution, showing near-normal, symmetric residuals centered around zero. Errors for low-rise buildings are almost entirely within ±5 m, indicating high accuracy. Mid-rise buildings also exhibit a compact error range with minimal outliers. High-rise buildings exhibit minimal systematic bias (mean residual +0.15 m), with 80% of samples within ±2.5 m. For super-high-rise buildings, the median residual (+2.3 m) and denser upper-tail distribution reveal a clear overestimation tendency, with only a few extreme cases exceeding +10 m. This bias is likely due to shadow occlusion and distortion caused by the super-tall structures, which complicate stable shadow extraction and result in height overestimation.

4.2. Comparison with Existing Building Height Calculation Methods

To comprehensively evaluate the effectiveness of the building height calculation method proposed in this paper, several representative existing techniques were selected for comparative analysis. Due to data acquisition limitations, a direct comparison at the same resolution and within the same region was not feasible. However, precision comparisons were conducted through qualitative analysis. Table 2 summarizes the basic principles, study areas, sample sizes, and corresponding MAE and RMSE for each method. The traditional shadow method estimates building height based on shadow length, which is simple and effective for areas with good lighting and sparse building distribution. However, it produces significant errors in densely populated urban areas (MAE: 4.08 m). The BIRCH clustering combined with a random forest model performs well in local areas (with a minimum MAE of 1.723 m), but its generalization is limited due to the small sample size. The roof contour-constrained stereo matching method achieves high accuracy in buildings with clear structures (MAE: 2.31 m), but it is highly dependent on image quality and stereo models. Adaptive photon selection technology shows some adaptability across wide areas, but its RMSE reaches 8.1 m, indicating insufficient stability.

In contrast, the method proposed in this paper achieves the most stable performance in both key metrics: MAE (2.07 m) and RMSE (2.56 m). Its advantage lies in combining spatial distribution features of buildings for classification modeling and applying optimized strategies for different spatial types, which not only improves accuracy but also enhances the model’s adaptability in complex urban environments. However, this method still has room for improvement in challenging scenarios, such as areas with extreme occlusion or dense overlapping.

5. Conclusions and Future Work

This study proposes a single-image building height estimation method based on high-resolution optical imagery, aiming to address the uneven estimation accuracy caused by spatial heterogeneity in complex urban environments. Unlike traditional approaches that rely on a unified modeling assumption, the proposed framework incorporates spatial distribution characteristics into the height inversion and optimization process, thereby reformulating building height estimation as a spatially heterogeneous, region-specific optimization problem. The method integrates robust shadow length extraction based on a fishnet strategy, a multi-scenario scale factor model for preliminary height inversion, and spatial typology classification based on DBSCAN clustering combined with multiple indicators. By partitioning urban areas into high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, differentiated optimization strategies—including neighborhood-weighted correction, local linear regression, and median smoothing—are applied to effectively suppress region-correlated systematic errors and local outliers.

Experiments conducted on 11,168 buildings across 13 representative cities in China demonstrate that the proposed method achieves a mean absolute error (MAE) of 2.07 m, a root mean square error (RMSE) of 2.56 m, and a coefficient of determination (R²) of 0.99. The results confirm that the method not only outperforms multiple existing approaches but also maintains high consistency and stability across diverse urban layouts and height ranges. Owing to its low cost, high efficiency, and broad applicability, the proposed method provides reliable support for urban planning, population and energy assessment, climate modeling, and disaster risk analysis. Despite its effectiveness, limitations remain in extremely high-density urban areas where shadows are severely obstructed by neighboring buildings or vegetation. In addition, the spatial classification thresholds are empirically derived for typical Chinese urban morphologies and may require adjustment for cities with fundamentally different development patterns. Future work will focus on integrating automatic roof and shadow extraction models and incorporating multi-source data, such as synthetic aperture radar (SAR) or LiDAR, to further enhance adaptability and generalization under complex illumination and terrain conditions.

Author Contributions

Y.X., W.S., J.T. and Y.Z. proposed the initial idea and research framework of this study. Y.X. and R.X. designed and performed the experiments. Y.X., R.X. and Y.H. collected, processed, and analyzed the experimental data. W.S.; and D.F. provided important methodological guidance and constructive suggestions. Y.X. drafted and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Natural Science Foundation of China (42301473), the Sichuan Science and Technology Program (2024YFFK0421, 2024NSFSC0074, 2025ZNSFSC0322), and Chengdu Science and Technology Program (2025-YF05-00009-SN).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors wish to thank the editors and reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shareef, S. The impact of urban morphology and building’s height diversity on energy consumption at urban scale: The case study of Dubai. Build. Environ. 2021, 194, 107675. [Google Scholar] [CrossRef]
Huang, H.B.; Chen, P.M.; Xu, X.Q.; Liu, C.X.; Wang, J.; Liu, C.; Clinton, N.; Gong, P. Estimating building height in China from ALOS AW3D30. ISPRS J. Photogram. Remote Sens. 2022, 185, 146–157. [Google Scholar] [CrossRef]
Qonita, M.; Giyarsih, S.R. Smart city assessment using the Boyd Cohen smart city wheel in Salatiga, Indonesia. GeoJournal 2023, 88, 479–492. [Google Scholar] [CrossRef]
Fan, K.X.; Lin, A.Q.; Wu, H.; Xu, Z.C. Pano2Geo: An efficient and robust building height estimation model using street-view panoramas. ISPRS J. Photogram. Remote Sens. 2024, 215, 177–191. [Google Scholar] [CrossRef]
Cao, Q.; Luan, Q.Z.; Liu, Y.P.; Wang, R.Q. The effects of 2D and 3D building morphology on urban environments: A multi-scale analysis in the Beijing metropolitan region. Build. Environ. 2021, 192, 107635. [Google Scholar] [CrossRef]
Zhao, M.Z.; Wang, J. A new method of feature line integration for construction of DEM in discontinuous topographic terrain. Environ. Earth Sci. 2022, 81, 397. [Google Scholar] [CrossRef]
Therias, A.; Rafiee, A. City digital twins for urban resilience. Int. J. Digit. Earth 2023, 16, 4164–4190. [Google Scholar] [CrossRef]
Chen, H.G.; Wu, B.; Yu, B.L.; Chen, Z.Q.; Wu, Q.S.; Lian, T.; Wang, C.G.; Li, Q.X.; Wu, J.P. A new method for building-level population estimation by integrating LiDAR, nighttime light, and POI data. J. Remote Sens. 2021, 2021, 9803796. [Google Scholar] [CrossRef]
Palacios-Lopez, D.; Esch, T.; MacManus, K.; Marconcini, M.; Sorichetta, A.; Yetman, G.; Zeidler, J.; Dech, S.J.; Tatem, A.; Reinartz, P. Towards an improved large-scale gridded population dataset: A Pan-European study on the integration of 3D settlement data into population modelling. Remote Sens. 2022, 14, 325. [Google Scholar] [CrossRef]
Wu, B.; Yang, C.S.; Wu, Q.S.; Wang, C.X.; Wu, J.P.; Yu, B.L. A building volume adjusted nighttime light index for characterizing the relationship between urban population and nighttime light intensity. Comput. Environ. Urban Syst. 2023, 99, 101911. [Google Scholar] [CrossRef]
Yan, Z.; Wang, P.; Xu, F.; Sun, X.; Diao, W. AIR-PV: A benchmark dataset for photovoltaic panel extraction in optical remote sensing imagery. Sci. China Inf. Sci. 2023, 66, 140307. [Google Scholar] [CrossRef]
Zhang, Z.X.; Chen, M.; Zhong, T.; Zhu, R.; Qian, Z.; Zhang, F.; Yang, Y.; Zhang, K.; Santi, P.; Wang, K.C.; et al. Carbon mitigation potential afforded by rooftop photovoltaic in China. Nat. Commun. 2023, 14, 2347. [Google Scholar] [CrossRef] [PubMed]
Park, Y.J.; Guldmann, J.M. Creating 3D city models with building footprints and LIDAR point cloud classification: A machine learning approach. Comput. Environ. Urban Syst. 2019, 75, 76–89. [Google Scholar] [CrossRef]
Yan, Y.Z.; Huang, B. Estimation of building height using a single street view image via deep neural networks. ISPRS J. Photogram. Remote Sens. 2022, 192, 83–98. [Google Scholar] [CrossRef]
Guida, R.; Iodice, A.; Riccio, D. Height retrieval of isolated buildings from single high-resolution SAR images. IEEE Trans. Geosci. Remote Sens. 2010, 48, 2967–2979. [Google Scholar] [CrossRef]
Li, W.C.; Tao, X.J.; Liu, D.; Wang, L.; Li, Z.Y.; Wu, J.J.; Yang, J.Y. An improved iterative simulation and matching scheme for building height retrieval from SAR image. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Xie, Y.K.; Feng, D.J.; Xiong, S.F.; Zhu, J.; Liu, Y.G. Multi-scene building height estimation method based on shadow in high resolution imagery. Remote Sens. 2021, 13, 2862. [Google Scholar] [CrossRef]
Esch, T.; Brzoska, E.; Dech, S.; Leutner, B.; Palacios-Lopez, D.; Metz-Marconcini, A.; Marconcini, M.; Roth, A.; Zeidler, J. World Settlement Footprint 3D-A first three-dimensional survey of the global building stock. Remote Sens. Environ. 2022, 270, 112877. [Google Scholar] [CrossRef]
Li, Y.; Wu, B. Relation-constrained 3D reconstruction of buildings in metropolitan areas from photogrammetric point clouds. Remote Sens. 2021, 13, 129. [Google Scholar] [CrossRef]
Wegner, J.D.; Ziehn, J.R.; Soergel, U. Combining high-resolution optical and InSAR features for height estimation of buildings with flat roofs. IEEE Trans. Geosci. Remote Sens. 2013, 52, 5840–5854. [Google Scholar] [CrossRef]
Sun, Y.; Mou, L.C.; Wang, Y.Y.; Mobtazeri, S.; Zhu, X.X. Large-scale building height retrieval from single SAR imagery based on bounding box regression networks. ISPRS J. Photogram. Remote Sens. 2022, 184, 79–95. [Google Scholar] [CrossRef]
Wang, J.Y.; Hu, X.L.; Meng, Q.Y.; Zhang, L.L.; Wang, C.Y.; Liu, X.C.; Zhao, M.F. Developing a method to extract building 3D information from GF-7 data. Remote Sens. 2021, 13, 4532. [Google Scholar] [CrossRef]
Zhang, C.N.; Cui, Y.F.; Zhu, Z.Y.; Jiang, S.; Jiang, W.S. Building height extraction from GF-7 satellite images based on roof contour constrained stereo matching. Remote Sens. 2022, 14, 1566. [Google Scholar] [CrossRef]
Karatsiolis, S.; Kamilaris, A.; Cole, I. Img2ndsm: Height estimation from single airborne RGB images with deep learning. Remote Sens. 2021, 13, 2417. [Google Scholar] [CrossRef]
Geiß, C.; Brzoska, E.; Pelizari, P.A.; Lautenbach, S.; Taubenböck, H. Multi-target regressor chains with repetitive permutation scheme for characterization of built environments with remote sensing. Int. J. Appl. Earth Obs. 2022, 106, 102657. [Google Scholar] [CrossRef]
Wen, D.W.; Huang, X.; Zhang, A.L.; Ke, X.L. Monitoring 3D building change and urban redevelopment patterns in inner city areas of Chinese megacities using multi-view satellite imagery. Remote Sens. 2019, 11, 763. [Google Scholar] [CrossRef]
Liu, C.; Huang, X.; Wen, D.W.; Chen, H.J.; Gong, J.Y. Assessing the quality of building height extraction from ZiYuan-3 multi-view imagery. Remote Sens. Lett. 2017, 8, 907–916. [Google Scholar] [CrossRef]
Cao, Y.X.; Huang, X. A deep learning method for building height estimation using high-resolution multi-view imagery over urban areas: A case study of 42 Chinese cities. Remote Sens. Environ. 2021, 264, 112590. [Google Scholar] [CrossRef]
Chen, P.M.; Huang, H.B.; Liu, J.Y.; Wang, J.; Liu, C.; Zhang, N.; Su, M.; Zhang, D.J. Leveraging Chinese GaoFen-7 imagery for high-resolution building height estimation in multiple cities. Remote Sens. Environ. 2023, 298, 113802. [Google Scholar] [CrossRef]
Liu, R.; Zhang, H.S.; Yip, K.H.A.; Ling, J.; Lin, Y.Y.; Huang, H.B. Automatic building height estimation with shadow correction over heterogeneous compact cities using stereo Gaofen-7 data at sub-meter resolution. Building 2023, 69, 106283. [Google Scholar] [CrossRef]
Zhao, Y.X.; Qi, J.Z.; Korn, F.; Wang, X.Y. Scalable building height estimation from street scene images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Izadi, M.; Saeedi, P. Three-dimensional polygonal building model estimation from single satellite images. IEEE Trans. Geosci. Remote Sens. 2011, 50, 2254–2272. [Google Scholar] [CrossRef]
Zhao, X.W.; Shan, L.; Sun, Z.C.; Ma, H.Q.; Kong, X.H.; Xu, Y.Q. Detection of building shadows in high-resolution remote sensing images by using improved DeepLabV3+. Remote Sens. Lett. 2025, 16, 290–301. [Google Scholar] [CrossRef]
Takaku, J.; Tadono, T.; Tsutsui, K.; Ichikawa, M. Validation of ‘AW3D’ global DSM generated from Alos Prism. ISPRS Ann. Photogram. Remote Sens. Spatial Inf. Sci. 2016, 3, 25–31. [Google Scholar] [CrossRef]
Liasis, G.; Stavrou, S. Satellite images analysis for shadow detection and building height estimation. ISPRS J. Photogram. Remote Sens. 2016, 119, 437–450. [Google Scholar] [CrossRef]
Dandabathula, G.; Sitiraju, S.R.; Jha, C.S. Retrieval of building heights from ICESat-2 photon data and evaluation with field measurements. Environ. Res. Infrastruct. Sustain. 2021, 1, 011003. [Google Scholar] [CrossRef]
Lao, J.Y.; Wang, C.; Zhu, X.X.; Xi, X.H.; Nie, S.; Wang, J.L.; Cheng, F.; Zhou, G.Q. Retrieving building height in urban areas using ICESat-2 photon-counting LiDAR data. Int. J. Appl. Earth Obs. 2021, 104, 102596. [Google Scholar] [CrossRef]
Li, Q.Y.; Mou, L.C.; Hua, Y.S.; Shi, Y.L.; Chen, S.N.; Sun, Y.; Zhu, X.X. 3DCentripetalNet: Building height retrieval from monocular remote sensing imagery. Int. J. Appl. Earth Obs. 2023, 120, 103311. [Google Scholar] [CrossRef]
Zhao, Y.; Wu, B.; Li, Q.X.; Yang, L.; Fan, H.C.; Wu, J.P.; Yu, B.L. Combining ICESat-2 photons and Google Earth Satellite images for building height extraction. Int. J. Appl. Earth Obs. 2023, 117, 103213. [Google Scholar] [CrossRef]
Cai, P.L.; Guo, J.X.; Li, R.K.; Xiao, Z.; Fu, H.Y.; Guo, T.Z.; Zhang, X.P.; Li, Y.S.; Song, X.F. Automated Building Height Estimation Using Ice, Cloud, and Land Elevation Satellite 2 Light Detection and Ranging Data and Building Footprints. Remote Sens. 2024, 16, 263. [Google Scholar] [CrossRef]
Xu, W.Q.; Feng, Z.Y.; Wan, Q.; Xie, Y.K.; Feng, D.J.; Zhu, J. Building height extraction from high-resolution single-view remote sensing images using shadow and side information. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 17, 6514–6528. [Google Scholar] [CrossRef]
Lu, S.Q.; Lu, H.L.; Chen, Y.; Zhang, C.R.; Miao, C.H. Multi-angle analysis of shading-based modelling: Extracting the urban building height based on ZY-3 three-line-array camera. Int. J. Remote Sens. 2025, 46, 4241–4273. [Google Scholar] [CrossRef]
Sawant, K. Adaptive methods for determining DBSCAN parameters. Int. J. Innov. Sci. Eng. Technol. 2014, 1, 329–334. [Google Scholar]
Starczewski, A.; Goetzen, P.; Er, M.J. A new method for automatic determining of the DBSCAN parameters. J. Artif. Intell. Soft Comput. Res. 2020, 10, 209–221. [Google Scholar] [CrossRef]
Huang, X.; Cheng, F.; Bao, Y.L.; Wang, C.; Wang, J.L.; Wu, J.N.; He, J.L.; Lao, J.Y. Urban building height extraction accommodating various terrain scenes using ICESat-2/ATLAS data. Int. J. Appl. Earth Obs. 2024, 130, 103870. [Google Scholar] [CrossRef]
Chang, J.X.; Jiang, Y.H.; Tan, M.L.; Wang, Y.M.; Wei, S.D. Building Height Extraction Based on Spatial Clustering and a Random Forest Model. ISPRS Int. J. Geo-Inf. 2024, 13, 265. [Google Scholar] [CrossRef]
Cai, B.W.; Shao, Z.F.; Huang, X.; Zhou, X.C.; Fang, S.H. Deep learning-based building height mapping using Sentinel-1 and Sentinel-2 data. Int. J. Appl. Earth Obs. 2023, 122, 103399. [Google Scholar] [CrossRef]
Wu, B.; Huang, H.L.; Zhao, Y. Utilizing building offset and shadow to retrieve urban building heights with ICESat-2 photons. Remote Sens. 2023, 15, 3786. [Google Scholar] [CrossRef]

Figure 1. Methodological Workflow Diagram.

Figure 2. Shadow-Based Building Height Inversion Model.

Figure 3. Flowchart of Building Spatial Distribution Classification.

Figure 4. The 13 Cities in China Selected in this Study.

Figure 5. Image and Label.

Figure 6. Building Clustering Results for Beijing, Shijiazhuang, Baotou, Lanzhou, Harbin, and Lhasa. Different colors represent spatial clusters identified by DBSCAN.

Figure 7. Building Clustering Results for Tianjin, Chongqing, Haikou, Wuhan, Nanning, Kunming, and Shanghai. Different colors represent spatial clusters identified by DBSCAN.

Figure 8. Building Spatial Classification Results in Local Subregions of Different Cities.

Figure 9. Comparison of MAE and RMSE Variations before and after Optimization.

Figure 10. Scatter plot of predicted versus reference building heights for all 11,168 buildings across 13 cities.

Figure 11. Illustration of Primary Sources Causing Estimation Outliers.

Figure 12. Comparison of Predicted and Reference Building Heights in Beijing, Nanning, and Lhasa.

Figure 13. Comparison of Predicted and Reference Building Heights in 10 Representative Cities.

Figure 14. 3D Building Modeling Results for Haikou, Kunming, Lhasa, Baotou, and Beijing. The color coding represents different height ranges: red for 0–30 m, yellow for 30–60 m, blue for 60–100 m, and dark blue for buildings over 100 m. The white borders indicate sub-regions requiring close magnification, which are detailed in the three-dimensional diagram on the right.

Figure 15. 3D Building Modeling Results for Harbin, Nanning, Shijiazhuang, Chongqing, and Shanghai. The color coding represents different height ranges: red for 0–30 m, yellow for 30–60 m, blue for 60–100 m, and dark blue for buildings over 100 m. The white borders indicate sub-regions requiring close magnification, which are detailed in the three-dimensional diagram on the right.

Figure 16. Error Analysis of Building Height Estimation across Different Height Levels.

Figure 17. Representative challenging scenarios for super-high-rise building height estimation.

Figure 18. Violin Plots of Relative Errors across Different Height Levels.

Table 1. Spatial Morphology Classification Rules.

Classification Type	$H_{75}$	$D$	$M$	Examples of Typical Functional Areas
High-rise zones	≥70 m	≥0.7	/	CBD, high-density residential areas
Mid-to-high-rise mixed zones	≥40 m & <70 m	≥0.5	≥0.5	Urban complexes, mixed-use development zones
Dense low-rise zones	≤40 m	≥0.6	<0.3	Industrial parks, logistics and warehousing areas
Other	Not Meeting the Above Conditions			Urban fringe, low-density mixed-use areas

Table 2. Comparison of the Proposed Method with Existing Building Height Estimation Methods.

Method	Basic Principle	Study Area	Number of Buildings	MAE (m)	RMSE (m)
Zhao et al. [39]	Shadow-based height inversion method	Shanghai	15,966	4.08	5.34
Chang et al. [46]	BIRCH-based clustering and random forest model	Hohhot Ordos	5240 7202	2.284 1.723	3.715 3.052
Zhang et al. [23]	Roof contour-constrained stereo matching method	Yingde	8653	2.31	3.01
Cai et al. [47]	Adaptive Photon Selection Technique	New York	17,399	3.00	8.1
Wu et al. [48]	Integration of building offset and shadow length	Shanghai	8216	4.70	6.75
Ours	Considering spatial distribution types	13 cities in China	11,168	2.07	2.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, Y.; Tu, J.; Zhao, Y.; Xia, R.; Song, W.; Feng, D.; Hu, Y. Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas. Remote Sens. 2026, 18, 801. https://doi.org/10.3390/rs18050801

AMA Style

Xie Y, Tu J, Zhao Y, Xia R, Song W, Feng D, Hu Y. Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas. Remote Sensing. 2026; 18(5):801. https://doi.org/10.3390/rs18050801

Chicago/Turabian Style

Xie, Yakun, Jiaxing Tu, Yaoji Zhao, Ruifeng Xia, Wen Song, Dejun Feng, and Ya Hu. 2026. "Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas" Remote Sensing 18, no. 5: 801. https://doi.org/10.3390/rs18050801

APA Style

Xie, Y., Tu, J., Zhao, Y., Xia, R., Song, W., Feng, D., & Hu, Y. (2026). Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas. Remote Sensing, 18(5), 801. https://doi.org/10.3390/rs18050801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Single-Image Building Height Estimation Using Spatial Distribution-Aware Optimization in Complex Urban Areas

Highlights

Abstract

1. Introduction

2. Methods

2.1. Preliminary Building Height Extraction

2.1.1. Shadow-Based Building Height Calculation

2.1.2. Shadow Length Calculation and Gross Error Elimination

2.2. Building Spatial Distribution Classification

2.2.1. Building Cluster Analysis Based on DBSCAN

2.2.2. Classification Indicators and Rules

2.3. Height Optimization Method for Buildings in Multiple Spatial Distribution Types

2.3.1. Analysis of Different Spatial Distribution Characteristics

2.3.2. Height Optimization for Buildings in Different Spatial Distributions

3. Experimental Results

3.1. Experimental Data and Evaluation Criteria

3.1.1. Study Area

3.1.2. Evaluation Criteria

3.2. Results of Building Spatial Distribution Classification

3.2.1. Building Clustering Results

3.2.2. Building Classification Results

3.3. Building Height Optimization Results

3.4. Building Height Estimation Performance in Different Cities

3.5. Visualization Analysis of Building Height Modeling Results

4. Discussion

4.1. Error Analysis Across Different Height Levels

4.2. Comparison with Existing Building Height Calculation Methods

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI