1. Introduction
With the continuous advancement of global urbanization, the urban population has been expanding rapidly. According to a report by the United Nations, approximately 68% of the world’s population is projected to reside in cities by 2050 [
1]. Constrained by limited land resources, urban spatial development has gradually shifted from two-dimensional expansion to three-dimensional growth, with high-rise and super high-rise buildings increasingly becoming prominent features of urban landscapes [
2,
3,
4]. Against this backdrop, building height has emerged as a critical parameter for characterizing urban spatial structure, functional distribution, and vertical development. It plays a vital role in urban planning [
5,
6,
7] and population estimation [
8,
9,
10], as well as energy assessment and climate simulation [
11,
12]. Consequently, developing building height estimation methods that are low-cost, efficient, and highly adaptable has become a central topic in urban remote sensing research.
Remote sensing data provide a rich foundation for building height estimation, with mainstream technical approaches primarily including Light Detection and Ranging (LiDAR) [
13,
14], Synthetic Aperture Radar (SAR) [
15,
16], and optical remote sensing imagery [
17,
18]. LiDAR acquires dense three-dimensional point clouds through high-precision laser ranging, enabling highly accurate building height estimation when combined with digital elevation models (DEMs) and building footprints [
19]. However, LiDAR data acquisition is costly and often constrained by weather conditions and terrain complexity, which limits its applicability for large-scale mapping and frequent temporal updates. SAR-based approaches estimate building height indirectly by applying interferometric synthetic aperture radar (InSAR) techniques to derive surface phase differences, offering all-weather and all-day observation capabilities [
20,
21]. Nevertheless, in dense urban environments, SAR performance is frequently affected by speckle noise, multipath scattering, and coherence loss, resulting in reduced stability and accuracy. Optical remote sensing imagery has therefore become one of the most widely used data sources for building height estimation, owing to its high spatial resolution, rich texture information, and convenient accessibility [
22,
23].
In comparison, optical remote sensing imagery, characterized by high spatial resolution, rich textural information, and strong visual interpretability, has become one of the most widely applicable data sources for building height estimation [
24,
25]. Among optical-based approaches, reconstructing a Digital Surface Model (DSM) from stereo imagery has become the mainstream solution. The basic workflow involves generating a disparity map from stereo image pairs, constructing the DSM, subtracting a Digital Elevation Model (DEM) to obtain a normalized DSM (nDSM), and subsequently deriving building heights [
26]. In practical applications, Liu et al. employed ZY-3 stereo imagery combined with the Semi-Global Matching (SGM) algorithm to construct DSMs, and further extracted building height information using morphological Top-Hat transformation, achieving high-resolution building height estimates across multiple cities [
27]. Wang et al. applied a stereo-pair algorithm to GF-7 imagery in the Beijing region, which effectively improved DSM completeness and detail preservation in complex urban scenes [
22]. To address the systematic underestimation of tall buildings, Zhang et al. proposed a stereo matching method incorporating building roof footprint constraints [
23]. Experiments conducted in Yingde, Guangzhou (8653 buildings) and Xi’an, Shaanxi (40 buildings) demonstrated that this strategy significantly alleviated DSM underestimation. However, the study also revealed that the method relies heavily on accurately annotated building footprint data, which limits its scalability for large-area applications.
To further improve large-scale estimation accuracy, Cao and Huang introduced ZY-3 multi-view imagery and a multi-task deep neural network to estimate building heights across 42 cities in China [
28]. By integrating multispectral imagery with elevation labels to construct a regression framework, their study demonstrated the feasibility of combining multi-view data and deep learning for large-scale, high-resolution building height estimation. To address the mismatch issues of traditional SGM methods in regions with insufficient texture, large-scale height variation, or severe occlusion, Chen et al. adopted the StereoNet network to reconstruct disparity maps. Experiments conducted in several cities, including Chongqing, Tianjin, and Guangzhou, showed a substantial improvement in height estimation accuracy for tall buildings, achieving more than a 40% reduction in RMSE compared with conventional SGM methods for buildings higher than 60 m [
29]. Meanwhile, to mitigate the influence of shadows on disparity matching in GF-7 imagery, Liu et al. applied histogram equalization for shadow compensation and combined building compactness analysis with zonal statistics to perform object-oriented building height estimation from nDSMs. This strategy enhanced adaptability in heterogeneous urban environments, such as dense low-rise zones and high-rise commercial zones [
30]. Nevertheless, existing approaches generally depend on high-quality stereo imagery, accurately delineated building footprints, or labeled elevation samples. When confronted with strong intra-urban heterogeneity in spatial distribution—such as variations in building density, functional mixing, and structural compactness—these methods often exhibit notable estimation inconsistencies.
In recent years, low-cost building height estimation methods have gradually expanded beyond reliance on single optical imagery to incorporate non-shadow-based imaging techniques. These approaches include the use of street-view imagery or perspective images to reconstruct building heights through stereo matching or structured light reconstruction, achieving similarly low-cost estimation performance [
31]. However, street-view and perspective imagery exhibit limited adaptability in complex urban environments, particularly in high-density areas where severe occlusion and complicated illumination conditions can substantially degrade height estimation accuracy. By contrast, geometric height inversion methods based on building shadows, among the earliest techniques for estimating building height from optical imagery, remain effective for small- to medium-scale areas with clear imaging conditions. Owing to their simple algorithmic structure, low data requirements, and high computational efficiency, these methods continue to be widely used [
32,
33]. Building height is estimated by measuring shadow length and applying geometric models that incorporate solar elevation angles and sensor parameters [
34]. For example, Liasis and Stavrou developed an automatic method for extracting shadow axes and boundaries, achieving a height estimation variance of 4.13% across 198 buildings [
35]. Xie et al. proposed a “tangent plus fishnet tangent” strategy combined with RMU-Net for accurate shadow boundary segmentation, constraining height estimation errors within 2 m for 131 buildings [
17]. More recently, the integration of ICESat-2 ATL03 photon data as reference height samples for global fitting and correction of shadow-based height inversion models has emerged as a promising direction to further enhance estimation reliability [
36,
37].
However, shadow-based height inversion methods exhibit notable limitations in complex urban environments, particularly in areas characterized by strong structural heterogeneity. First, in high-density urban settings, building shadows are frequently obstructed or overlapped by adjacent structures and tree canopies, which complicates shadow contour extraction and leads to unstable shadow length measurements, thereby degrading height estimation accuracy [
38,
39,
40]. Second, variations in building orientation relative to solar illumination across different urban regions can introduce systematic bias in shadow length measurement, especially in areas with irregular layouts or highly diverse orientation patterns [
41,
42]. In addition, traditional shadow-based height inversion approaches often rely on image metadata (e.g., solar elevation angle) or require prior building height samples for calibration, which limits their degree of automation and scalability, making them less suitable for large-area height estimation across multifunctional urban environments [
17].
To address the aforementioned challenges, this study proposes a single-image building height estimation method that explicitly incorporates spatial distribution characteristics. The proposed framework integrates spatial typology classification with a region-specific, multi-strategy optimization scheme, enabling building height estimation to be adaptively adjusted according to different urban spatial scenarios. By transforming the height inversion task from a globally uniform estimation problem into a spatially differentiated optimization process, the method effectively mitigates estimation errors arising from heterogeneous building distributions. Moreover, because the proposed method relies solely on a single high-resolution image, it reduces data dependency compared to stereo- or multi-source-based approaches and simplifies the data preparation process. This characteristic facilitates practical application in large-area building height estimation tasks. The main contributions of this paper are summarized as follows:
- (1)
This study introduces an urban building spatial distribution classification mechanism into the single-image building height estimation framework. By explicitly considering spatial heterogeneity, region-specific height optimization strategies are designed for three typical urban spatial types—high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones—thereby improving estimation accuracy and robustness under diverse spatial distribution conditions.
- (2)
A joint shadow processing algorithm that integrates fishnet partitioning with the Pauta criterion is developed for shadow length measurement and outlier suppression. This strategy significantly enhances the stability and reliability of shadow extraction under challenging conditions, such as occlusion, shadow overlap, and complex illumination.
The remainder of this paper is organized as follows:
Section 2 describes the proposed methodology, including spatial typology classification, shadow length extraction, and multi-strategy height optimization.
Section 3 presents the experimental design and results.
Section 4 provides discussion and comparison with existing methods.
Section 5 concludes the paper and outlines future research directions.
2. Methods
Unlike conventional unified modeling approaches that assume a globally consistent relationship between image-derived features and building height, this study proposes a single-image building height estimation framework that explicitly accounts for the spatial heterogeneity of urban environments. The proposed framework integrates geometric height inversion based on building shadows with spatial distribution-aware optimization, as illustrated in
Figure 1. Based on annotated building roofs and shadow boundaries, building shadow lengths are robustly extracted using a fishnet–Pauta strategy and subsequently converted into preliminary height estimates through a scale factor model under three sun–sensor geometric configurations. Rather than directly pursuing absolute height accuracy at the individual building level, the framework characterizes urban spatial patterns using relative height statistics, spatial density, and functional heterogeneity, which serve as key constraints for subsequent spatially differentiated optimization.
Buildings are clustered using DBSCAN and categorized into three representative spatial types: high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones. Through this spatial partitioning, the building height inversion task is reformulated from a globally uniform estimation problem into a region-specific optimization process, in which local spatial context and building distribution characteristics are explicitly incorporated. For each spatial type, differentiated optimization strategies—including neighborhood-weighted correction, similarity-constrained local regression, and median smoothing—are applied to suppress region-dependent systematic biases and local outliers. The proposed method is evaluated on 11,168 buildings across 13 representative cities, demonstrating high accuracy, robustness, and applicability under diverse urban morphologies.
2.1. Preliminary Building Height Extraction
2.1.1. Shadow-Based Building Height Calculation
To enable efficient and automated building height estimation, this study establishes three typical shadow-based height inversion models according to the geometric relationship between building shadows and the sun–sensor configuration. These models correspond to three representative scenarios: (i) the sun and sensor are oriented in the same direction, (ii) the azimuth difference between the sun and sensor exceeds 180°, and (iii) the azimuth difference lies between 0° and 180°, as illustrated in
Figure 2. Differences in shadow visibility and observation geometry across the image lead to distinct projection patterns of building shadows, which in turn require the adoption of corresponding height calculation formulations to ensure accurate geometric modeling.
Figure 2a illustrates the case in which the sun and sensor are oriented in the same direction. In this configuration, α denotes the solar elevation angle,
β denotes the sensor elevation angle,
AB represents the building height,
BC denotes the shaded façade portion of the building,
BD corresponds to the total shadow length, and
CD indicates the shadow segment observable in the remote sensing image. Under this geometric condition, the effective shadow length measured from the image is
CD, and the building height
AB can be calculated using Equation (1):
According to Equation (1), the building height depends solely on the shadow length measured from the remote sensing image and the fixed sun–sensor parameters at the time of image acquisition. This relationship indicates that building height is linearly proportional to the detected shadow length under a given imaging geometry. By defining the proportionality coefficient as
, Equation (1) can be simplified to Equation (2):
Figure 2b illustrates the scenario in which the azimuth difference between the sun and the sensor exceeds 180°. Under this configuration, the sensor is able to capture the complete shadow cast by the building. Consequently, the shaded façade segment
BC equals zero, and the corresponding geometric relationship is depicted in
Figure 2b. Based on this geometry, the building height
AB can be calculated using Equation (3):
Similarly, according to Equation (3), the building height is linearly proportional to the shadow length detected in the image under the given imaging geometry. By defining the proportionality coefficient as
, Equation (3) can be simplified to Equation (4):
Figure 2c depicts the case in which the azimuth difference between the sun and the sensor lies between 0° and 180°. Under this condition, the influence of the sensor azimuth on shadow detection must be explicitly considered, as this configuration represents the most common scenario for building shadows in optical remote sensing imagery. The geometric relationship among the sensor, the sun, and the building is illustrated in
Figure 2c, where
γ denotes the solar azimuth angle,
δ denotes the sensor azimuth angle, and
ε represents the angle between the building orientation and the shadow projection measured in the clockwise direction. Based on this geometric relationship, the building height can be calculated using Equation (5):
According to Equation (5), the building height is linearly proportional to the shadow length detected in the image under the given geometric configuration. By defining the proportionality coefficient as
, Equation (5) can be simplified to Equation (6):
Although the three scenarios differ in their geometric configurations, the building height inversion can be uniformly formulated as a multiplicative relationship, , where the scale factor is jointly determined by the solar elevation angle, sensor elevation angle, building orientation, and their relative angular relationships.
2.1.2. Shadow Length Calculation and Gross Error Elimination
To improve the accuracy of building height estimation, precise measurement of shadow length is a critical prerequisite. This study proposes a shadow length estimation method that combines the fishnet strategy with the Pauta criterion to effectively mitigate errors arising from complex shadow geometries, ambiguous boundaries, and noise interference commonly encountered in traditional approaches. Specifically, a set of evenly spaced parallel vector lines is generated within the shadow region along the solar azimuth to form a feature line set. Each line segment represents the projected shadow length in the solar direction and serves as an input for subsequent building height inversion. However, due to factors such as terrain obstruction, uneven illumination, and image noise, some shadow line segments may exhibit abnormal deviations. Directly using the mean or median of all line segments may therefore amplify estimation errors. To address this issue, a gross-error elimination strategy based on the Pauta criterion is introduced to iteratively refine the feature line set. By computing the mean and standard deviation of the line segment lengths, a confidence interval is constructed according to the 3σ principle, and outliers exceeding this range are removed to enhance the stability and robustness of shadow length estimation. The inter-line spacing is determined through experimental analysis and optimized based on empirical results to balance extraction accuracy and computational efficiency. Finally, the shadow length is obtained as the mean value of the filtered valid line segments. The overall algorithmic workflow is summarized in Algorithm 1.
| Algorithm 1 Shadow Length Calculation Method (Combining the Fishnet Method and Pauta Criterion) |
|
Input:
,
,
|
| Output: filtered shadow line segment length |
| 1: //Feature line generation |
| 2: generate a set of fishnet lines in the shadow region based on
and
|
| 3: for each line in do |
| 4: calculate the intersection points between the lines and the building corners |
| 5: calculate the length of each intersection line and record it in the length list |
| 6: if the intersection line does not exceed the corner range then |
| 7: keep the intersection line and update the target contour |
| 8: else |
| 9: remove the out-of-bounds intersection lines |
| 10: end if |
| 11: end for |
| 12: //Gross Error Elimination |
| 13: calculate the standard deviation and arithmetic mean
of
|
| 14: repeat |
| 15: for each in
do |
| 16: calculate the residual |
| 17: if
then |
| 18: retain
|
| 19: else |
| 20: remove
|
| 21: end if |
| 22: end for |
| 23: recalculate the mean and standard deviation
of
|
| 24: until all errors are removed |
| 25: return
|
2.2. Building Spatial Distribution Classification
Urban buildings exhibit substantial heterogeneity in height, density, spatial arrangement, and functional composition, which directly affects the accuracy of single-image building height estimation. Conventional methods often struggle to maintain robustness across diverse urban scenarios—such as high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones—resulting in systematic or localized estimation errors. To address this issue, a spatial distribution classification framework is proposed (
Figure 3), which integrates building height characteristics, spatial density, and functional heterogeneity. Specifically, DBSCAN is first employed to identify spatial clusters of buildings. Subsequently, a three-dimensional indicator system is constructed to classify urban areas into four categories: high-rise zones, mid-to-high-rise mixed zones, dense low-rise zones, and others. By explicitly accounting for spatial distribution patterns, this classification framework enhances the adaptability and accuracy of shadow-based height inversion in complex urban environments.
2.2.1. Building Cluster Analysis Based on DBSCAN
To identify the initial spatial clustering structure of buildings, this study employs the density-based clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise). A key advantage of DBSCAN is that it does not require the number of clusters to be predefined and is capable of identifying clusters with arbitrary shapes based on local density characteristics, making it particularly suitable for analyzing building groups with complex spatial distributions. By introducing DBSCAN into the framework, customized optimization strategies can be applied to different building distribution areas, thereby improving the accuracy and stability of building height estimation. DBSCAN relies on two critical parameters: the neighborhood radius ε and the minimum number of points minPts, which define the spatial connectivity range among buildings and the minimum local density required to form a cluster, respectively. Given a set of buildings , a building is identified as a core object if at least minPts other buildings fall within its ε-neighborhood. All density-connected core objects, together with their associated boundary objects, collectively constitute a building cluster.
In terms of parameter configuration, the ε value is determined by considering spatial variations in building distribution. Specifically,
ε is set to the third quartile (
) of the nearest-neighbor distance distribution within the study area, ensuring that more than 80% of spatial associations among buildings are effectively captured. The nearest-neighbor distance is calculated as follows. First, the distance between each building and all other buildings is computed, and the average distance is obtained. This average distance is then used as the radius of a search circle, within which the point closest to the circumference is identified. The characteristic distance k is subsequently determined as the mode of these distances [
43]. The use of
enables
ε to adaptively reflect differences in building density: in high-density zones,
assumes smaller values, capturing shorter inter-building distances, whereas in low-density zones, larger
values accommodate more dispersed building patterns. The minPts parameter is dynamically adjusted according to building density gradients. It is set to 5 in high-density urban areas and reduced to 3 in urban–rural transition zones [
44]. This setting is supported by spatial autocorrelation analysis of building centroids, yielding a Moran’s I value of 0.67 (
p < 0.01), which indicates strong spatial dependence in building distribution.
Through spatial distribution classification, targeted optimization strategies can be applied according to the distribution characteristics of buildings in different areas. This strategy not only improves the accuracy of building height estimation but also enhances the overall stability and robustness of the proposed approach.
2.2.2. Classification Indicators and Rules
To provide a detailed characterization of urban building spatial structures and to support the differentiated adaptation of height estimation strategies, this study proposes a zoning scheme that classifies urban buildings based on height characteristics, spatial density, and functional mix. Using this scheme, building clusters are categorized into four typical spatial forms: high-rise zones, mid-to-high-rise mixed zones, dense low-rise zones, and other zones.
- (1)
Height characteristics indicator:
Building height distribution is the core criterion for distinguishing different urban functional zones. Conventional mean-based indicators are highly sensitive to extreme values, while median-based measures tend to lose discriminative power in areas with mixed building heights. To address these limitations, the 75th percentile of building height (
) is adopted as the primary height indicator. This metric effectively captures the dominant height level within a building cluster while maintaining strong sensitivity to the presence of high-rise buildings.
Here, denotes the empirical distribution function (EDF) of building heights, representing the proportion of buildings with heights less than or equal to . The operator indicates the infimum, i.e., the lower bound of the set. Accordingly, is defined as the minimum height value such that at least 75% of the buildings have heights less than or equal to .
- (2)
Density characteristic indicator:
Building space density is a key parameter for distinguishing urban functional zones. However, traditional density metrics often exhibit limited effectiveness when applied to building groups with complex layouts or irregular spatial distributions. To address this limitation, this study introduces an improved standardized density index
D, which more accurately captures the degree of spatial compactness within local building clusters.
Here, denotes the footprint area of the i-th building, represents the minimum bounding rectangle (MBR), and refers to the convex hull area of the building cluster. By constraining the denominator to no more than 1.2 times the smaller of these two reference areas, the proposed formulation ensures robustness while preventing density overestimation. This constraint effectively enhances the discrimination between densely built areas and spatially dispersed building distributions.
- (3)
Mixedness indicator:
To effectively identify mid-to-high-rise mixed zones characterized by strong heterogeneity and functional complexity, this study constructs a composite mixedness index
M that jointly captures height variability and functional diversity. The index consists of two complementary dimensions. The height dispersion term quantifies internal variations in building height and is expressed using a standardized interquartile range, which is robust to extreme values. The functional entropy term is derived from information entropy theory and provides a quantitative measure of the degree of functional mixing among different building types within the cluster.
Here, , , and represent the 75th percentile, 25th percentile, and median of building height, respectively. is the total number of functional types (e.g., residential, commercial, industrial), and represents the proportion of the k-th functional type. The functional proportion measures the proportion of buildings with different functional types within each cluster. The functional type (e.g., commercial, residential, or mixed-use) is inferred by analyzing each building’s height difference (Height_Dif) and distance to the nearest building (Distance_T).
After deriving the building height, density, and mixedness indicators, this study establishes spatial typology classification rules by jointly integrating these three dimensions. As summarized in
Table 1, urban areas are categorized into four typical spatial types. High-rise zones are characterized by pronounced vertical development and intensive land use, and are commonly associated with central business districts (CBDs) and urban sub-centers. Mid-to-high-rise mixed zones correspond to comprehensive urban areas with moderate building heights and a high degree of functional diversity. Dense low-rise zones are composed of compact clusters of low-rise buildings, which are frequently observed in industrial parks or logistics-related areas on urban fringes. Areas that do not satisfy the above criteria are classified as “other”. It should be noted that in high-rise zones, the dominant height and density characteristics are sufficient for reliable classification, while mixedness plays a negligible role in such height-driven environments. Therefore, no mixedness constraint is imposed for this category.
2.3. Height Optimization Method for Buildings in Multiple Spatial Distribution Types
After completing the initial height inversion using shadow length and scale coefficient models, this study proposes a spatial-distribution-aware height optimization strategy to further mitigate errors induced by building density, structural differences, and scale extrapolation effects. Taking the spatial classification results as a guiding framework, the method first examines the distinct error characteristics across high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, in order to identify the dominant factors affecting estimation accuracy. It then incorporates neighborhood structure, local height patterns, and spatial similarity to construct differentiated optimization models. Without relying on external elevation data, this strategy achieves structural correction and significantly improves height estimation accuracy.
2.3.1. Analysis of Different Spatial Distribution Characteristics
Based on the spatial distribution classification system, the study area is divided into three typical zones: high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones. To improve estimation accuracy, this section analyzes differences in spatial structure, shadow morphology, and error characteristics across these categories, thereby providing a basis for targeted height optimization.
① High-rise zones: Buildings in high-rise zones are tall, densely distributed, and structurally regular, which generally produce clear and consistent shadows. Although proportional coefficient–based inversion performs well in these areas, shadow length remains highly sensitive to solar altitude. In addition, dense building layouts may lead to local shadow obstruction or mismatches in fishnet sampling lines. Therefore, a neighborhood-weighted correction strategy is required to suppress localized errors.
② Mid-to-high-rise mixed zones: These zones exhibit pronounced spatial heterogeneity, characterized by interspersed high- and low-rise buildings and diverse functional compositions. Large height variations often result in unstable shadow measurements and frequent fitting inconsistencies. Moreover, shadow pairing ambiguity is common in such mixed environments. To address these issues, regression-based correction constrained by height similarity and local spatial trends is employed.
③ Dense low-rise zones: Dense low-rise zones consist of compactly arranged buildings with relatively uniform heights. Short shadows and blurred boundaries increase the difficulty of shadow extraction, while shadow merging further reduces measurement stability. Consequently, median-filter-based smoothing is applied to enhance robustness and improve height estimation accuracy in these areas.
2.3.2. Height Optimization for Buildings in Different Spatial Distributions
To address the distinct error characteristics and structural differences observed in building height estimation across high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, this study proposes three corresponding optimization strategies. These strategies respectively employ neighborhood-weighted correction, local linear regression, and median smoothing to enhance the accuracy and robustness of the preliminary height inversion results.
① High-rise zones:
In high-rise zones, building heights are relatively concentrated and uniformly distributed, forming a highly consistent group within the same cluster structure. Based on this, to address abnormal valuations caused by local obstructions or fishnet breaks, this paper employs a neighborhood-weighted method for correction. Specifically, for each building
, the deviation between its initial height inversion value
and the neighborhood average value
is calculated. If the following condition is satisfied:
Here,
denotes the height standard deviation within the neighborhood of building
, and
is the tolerance threshold. If the above conditions are met, building
is considered as an outlier, and its height
can be corrected using the following formula:
Here,
is the spatial weighting coefficient between building
and its neighboring building
, calculated using the following formula:
Here, is the Euclidean distance between building and building .
② Mid-to-high-rise mixed zones:
In mid-to-high-rise mixed zones, building heights exhibit substantial variability. Traditional spatial-distance-based weighted interpolation methods often fail to accurately capture the true height relationships among buildings, which may lead to errors such as “high compensating for low” or “low compensating for high”. Moreover, building layouts in these areas commonly display local continuity or gradual height transition patterns, making simple mean-based smoothing insufficient for representing such structural characteristics. To mitigate these issues, this study proposes a joint correction strategy that combines a height-similarity screening mechanism with a local linear regression model. For a target building
, buildings within its spatial neighborhood
that exhibit similar preliminary inversion heights are first selected to construct a similarity-constrained neighborhood
N(
i). Specifically, neighboring buildings satisfying the following condition are retained:
The spatial position variable is introduced to characterize the local consistency of buildings in terms of imaging geometry, illumination conditions, and occlusion environment. Since preliminary height inversion errors exhibit pronounced spatial correlation, a local regression model based on spatial position is capable of effectively suppressing region-scale systematic errors within similarity-constrained neighborhoods. Here,
denotes the preliminary inversion height of building
j, and
represents the corresponding value for the target building
. The parameter
ϵ is the similarity tolerance threshold (in meters), which is used to filter neighboring buildings with comparable preliminary heights. Based on the resulting similarity-constrained neighborhood, a local linear regression model is then employed to refine the height estimate of the target building
. It is assumed that the preliminary inversion height of building
exhibits a linear relationship with its spatial position variable
:
Here,
and
are the regression coefficients of the local linear regression model, estimated by fitting the data within the similarity-constrained neighborhood.
is the error term of the regression model, representing the residual of building
. The optimized height
of the target building
is then obtained using the following regression equation:
where
denotes the spatial position variable of the target building
.
③ Dense low-rise zones:
In dense low-rise zones, buildings are generally low in height, densely distributed, and exhibit little variation. In preliminary inversion results, abnormal fluctuations often occur due to blurred boundaries or noise interference. To address this issue, a median smoothing strategy based on a sliding window is employed to correct outliers. First, outliers are identified using the outlier detection criteria defined in Equation (10). Buildings meeting this criterion are regarded as unstable estimates and marked for correction, the preliminary inversion height
is regarded as an outlier and marked as for correction. This step can effectively identify extreme outliers caused by shadow errors, boundary fusion, and other factors. For buildings marked as outliers, their heights are corrected using the median height of buildings in their neighborhood. A local window
centered on building
is defined, and the optimized building height
is calculated as follows:
Here, represents the set of sliding windows used in median filtering. Through this approach, height estimation errors caused by shadow merging and blurred building contours can be effectively corrected, thereby improving the stability and accuracy of building height inversion in dense low-rise zones.
5. Conclusions and Future Work
This study proposes a single-image building height estimation method based on high-resolution optical imagery, aiming to address the uneven estimation accuracy caused by spatial heterogeneity in complex urban environments. Unlike traditional approaches that rely on a unified modeling assumption, the proposed framework incorporates spatial distribution characteristics into the height inversion and optimization process, thereby reformulating building height estimation as a spatially heterogeneous, region-specific optimization problem. The method integrates robust shadow length extraction based on a fishnet strategy, a multi-scenario scale factor model for preliminary height inversion, and spatial typology classification based on DBSCAN clustering combined with multiple indicators. By partitioning urban areas into high-rise zones, mid-to-high-rise mixed zones, and dense low-rise zones, differentiated optimization strategies—including neighborhood-weighted correction, local linear regression, and median smoothing—are applied to effectively suppress region-correlated systematic errors and local outliers.
Experiments conducted on 11,168 buildings across 13 representative cities in China demonstrate that the proposed method achieves a mean absolute error (MAE) of 2.07 m, a root mean square error (RMSE) of 2.56 m, and a coefficient of determination (R2) of 0.99. The results confirm that the method not only outperforms multiple existing approaches but also maintains high consistency and stability across diverse urban layouts and height ranges. Owing to its low cost, high efficiency, and broad applicability, the proposed method provides reliable support for urban planning, population and energy assessment, climate modeling, and disaster risk analysis. Despite its effectiveness, limitations remain in extremely high-density urban areas where shadows are severely obstructed by neighboring buildings or vegetation. In addition, the spatial classification thresholds are empirically derived for typical Chinese urban morphologies and may require adjustment for cities with fundamentally different development patterns. Future work will focus on integrating automatic roof and shadow extraction models and incorporating multi-source data, such as synthetic aperture radar (SAR) or LiDAR, to further enhance adaptability and generalization under complex illumination and terrain conditions.