Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network

Zhang, Yuan; Deng, Jiayi; Yan, Wenjia

doi:10.3390/rs18101484

Open AccessArticle

Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network

by

Yuan Zhang

^1,*

,

Jiayi Deng

¹

and

Wenjia Yan

²

¹

School of Geographic Sciences, East China Normal University, Shanghai 200241, China

²

Shanghai Information Technology Research Center, Shanghai 200125, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1484; https://doi.org/10.3390/rs18101484

Submission received: 29 March 2026 / Revised: 29 April 2026 / Accepted: 29 April 2026 / Published: 9 May 2026

(This article belongs to the Special Issue Applications of Remote Sensing Imagery for Urban Areas (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A hybrid CNN–Transformer network (SECT-Net) is proposed to extract building footprints and shadows from high-resolution optical imagery.
A shadow-geometry-based strategy enables building height estimation using only single-date optical imagery and satellite acquisition parameters, eliminating the need for stereo data, LiDAR, or DSM labels.

What are the implications of the main findings?

The proposed framework significantly reduces data dependency and acquisition costs, providing a practical solution for large-scale 2D–3D urban mapping.
The method can be extended to multi-temporal imagery for building change detection and disaster assessment, supporting remote sensing-based urban analysis and management.

Abstract

High-resolution multispectral remote sensing imagery for accurate building footprint extraction and reliable height estimation often faces numerous challenges in complex urban environments with diverse building structures, heterogeneous shadow patterns, and widespread occlusions. This study proposes a sparse edge-aware convolutional-transformer neural network model, SECT-Net, for precise extraction of building footprints and their cast shadows from high-resolution Jilin-1 multispectral images covering Shanghai, China. A shadow-based height estimation workflow was then developed to characterize building heights from their shadow lengths. Results show that the SECT-Net achieves high performance in building footprint extraction, with an IoU of 77.96%, an F1 score of 87.62%, and an overall accuracy of 97.16%, respectively. Building heights for more than 750,000 buildings were estimated across the entire Shanghai, with R² = 0.74 and RMSE = 5.66 m. A slight systematic underestimation of building heights is attributed to occlusions of high-rise buildings and the disruption of vegetation in residential communities in dense urban central areas. This study demonstrates that the proposed SECT-Net can accurately and precisely extract building footprints from high-resolution spaceborne remotely sensed images. The estimated building heights provide a reliable underpinning for urban morphology analysis and building monitoring in urban planning and scientific management.

Keywords:

building footprints; shadow length; building heights; Jilin-1 image; deep learning; SECT-Net

1. Introduction

Rapid urbanization has resulted in more than half of the global population residing in urban areas, and nearly two-thirds of future population growth is expected to occur in cities by the mid-21st century [1]. The growth of urban populations drives intensive land use and the vertical expansion of buildings, resulting in highly dense, compact urban environments. As the primary setting for human activities, the built environment critically shapes urban morphology, socio-economic structures, thermal landscapes, and ecological conditions [2]. Although two-dimensional (2D) urban features, such as building footprints, impervious surfaces, and urban boundaries, have attracted extensive interest, the vertical dimension of urban structures has received relatively limited attention [3]. Three-dimensional (3D) urban structures significantly influence ventilation corridors, pollution dispersion [4], surface runoff, urban heat islands [5,6], population distribution, transportation patterns, and carbon emissions [7,8,9]. Therefore, building height, as a key 3D parameter, is essential for population estimation, energy and carbon assessment, urban monitoring, disaster evaluation, 3D modeling, and urban land planning.

Building footprints are fundamental data for modeling urban spatial structures and are widely applied in urban expansion monitoring, disaster management, and building statistics. Traditional extraction methods primarily rely on high-resolution remote sensing image processing and classification rules, such as thresholding, edge detection, and morphological operations [10,11,12]. These approaches typically use spectral, geometric, and spatial distribution characteristics in those built-up scenes with clear structures. However, they often fail to accurately distinguish buildings from other highly reflective objects in complex urban environments with diverse building types and severe occlusions, thereby reducing robustness and adaptability. Object-based image analysis (OBIA) has emerged as an effective alternative by segmenting high-resolution remote sensing images into meaningful objects and classifying them using multidimensional features, such as spectral, textural, and shape attributes [13,14]. Some methods that combine traditional image segmentation with machine learning classifiers (e.g., SVM, Random Forest, and AdaBoost) have significantly enhanced the accuracy and interpretability of building recognition [15,16]. However, these approaches still suffer from complex feature engineering, parameter sensitivity, and limited scalability and generalization in large-scale applications [17].

In recent years, deep learning (DL) techniques have become the dominant paradigm for automatic extraction of building footprints from remote sensing imagery. Pixel-wise semantic segmentation models based on convolutional neural networks (CNNs), such as U-Net, SegNet, and the DeepLab series, have been widely applied due to their ability to capture local spatial details and delineate connected building regions [18,19]. Instance segmentation frameworks, including Mask R-CNN and YOLO-based methods, further enable accurate building localization and boundary extraction, particularly in dense urban scenes [20,21]. More recently, Transformer-based architectures, such as Swin Transformer and SegFormer, have demonstrated superior performance by modeling long-range dependencies and fusing multi-scale contextual information, thereby enhancing building perception in complex urban environments [22]. Some studies further incorporate attention mechanisms, edge-aware modules, and deep supervision to improve sensitivity to building boundaries, morphological details, and spatial context [23,24,25].

Despite these advances, both CNNs and Transformers exhibit inherent limitations. CNN-based models are effective in capturing fine-grained local features and preserving boundary details but suffer from limited receptive fields and insufficient global context modeling. In contrast, Transformers excel at capturing long-range dependencies but often struggle to preserve precise spatial details and sharp object boundaries due to their patch-based representations and global attention mechanisms. In addition, their computational complexity may further limit their efficiency in high-resolution remote sensing tasks [26].

To address these challenges, hybrid CNN-Transformer architectures have recently emerged as a promising solution. Representative hybrid architectures, such as TransUNet [27] and CoAtNet [28], have demonstrated the effectiveness of integrating convolutional operations with Transformer-based global modeling in dense prediction tasks. In the field of remote sensing, UNetFormer [29], CMTFNet [30] and CTMFNet [31] further adapt this paradigm to complex scenarios characterized by large spatial extents and multi-scale object distributions, achieving improved segmentation performance in urban scene understanding.

Along with methodological advances, several large-scale building footprint data products have been released, including GABLE [32], 90_cities_BRA [33], and the East Asian Buildings dataset [34], which provide extensive spatial coverage at national and continental scales. Despite their wide applicability, most existing footprint products rely primarily on CNN-based architectures such as DeepLabv3+, BE-Net, and OCRNet. Due to limited receptive fields and insufficient modeling of global context, these methods often produce irregular or fragmented footprints, fail to fully extract large buildings, ignore small structures, or generate internal holes within building polygons [30]. These issues restrict their reliability for fine-grained urban analysis and 3D modeling applications.

Building height estimation represents another critical yet challenging component of 3D urban mapping. Conventional field-based surveying methods are highly accurate but labor-intensive, costly, and impractical for large-scale applications [35]. Remote sensing-based approaches, including stereo imagery, LiDAR, and interferometric SAR (InSAR), provide effective alternatives. LiDAR can acquire high-precision 3D point clouds but is constrained by high acquisition costs and limited spatial coverage [36,37]. InSAR enables large-area height estimation but suffers from phase unwrapping errors and instability in dense urban areas [38,39]. Shadow-based methods estimate building height using geometric relationships among shadow length, solar illumination parameters, and satellite viewing geometry, offering a cost-effective solution with single-temporal high-resolution optical imagery [40,41,42]. However, its accuracy might be affected by blurred shadow boundaries, overlapping shadows, or uneven terrain.

Several building-height products have also been developed, such as GHSL-2023 [43], WSF 3D [44], and CNBH-10m [45], which provide gridded height information at regional or global scales. Nevertheless, their relatively coarse spatial resolution makes it difficult to associate height values with individual buildings. Only a limited number of building footprint datasets include height attributes. For example, GABLE adopts a DL-based height estimation framework supervised by stereo-derived DSM data [32]. However, the availability and temporal update frequency of such high-quality DSM products remain limited, particularly in rapidly developing cities. 3D-GloBFP employs an XGBoost-based regression framework but requires extensive multi-source data, including remote sensing imagery, socioeconomic indicators, DEM, and DSM [46], resulting in high data dependence and computational complexity. Other products, such as CMAB, provide rich building attributes, including height, function, and age, but are limited to partial urban areas and lack continuous spatial coverage [47]. Moreover, many recent methods rely on high-quality reference height datasets for supervised learning, whose temporal representativeness directly constrains the timeliness of derived building height products, and updating such datasets is often costly and time-consuming.

To overcome these limitations, this study develops a building-centric representation framework with reduced data dependency for large-scale urban analysis using high-resolution optical imagery. Within this framework, building footprint extraction and shadow detection are formulated as essential intermediate tasks that provide critical geometric cues for subsequent height inference. A hybrid CNN-Transformer network is designed to jointly model local spatial details and global contextual dependencies, enabling reliable extraction of building footprints and shadows from single-source optical imagery. The extracted shadow information is then integrated into a geometry-driven formulation for building height estimation based on satellite imaging parameters. Compared with existing approaches that rely on stereo imagery, LiDAR, or supervised learning with DSM labels, the proposed framework requires only high-resolution optical imagery and satellite acquisition geometry, thereby significantly reducing data dependence and acquisition costs. The framework is demonstrated on Shanghai as a case study, showing its potential for automated generation of building footprints and heights. The proposed framework can be further extended to multi-temporal imagery for building change detection and disaster impact assessment, offering substantial potential for urban management and remote sensing-based 3D urban analysis.

2. Materials and Methods

2.1. Study Area and Dataset

2.1.1. Study Area

The study area, Shanghai, is located on the eastern coast of China (Figure 1). As one of the most densely populated and highly urbanized megacities in China, Shanghai serves as a national hub for the economy, finance, and shipping. The terrain in Shanghai is extremely flat, with slopes generally below 2° and negligible elevation variation, which minimizes terrain-induced effects in shadow-based height estimation. The central urban area is characterized by dense and diverse building types, including high-rise commercial, residential, and industrial structures, with significant height variability. In contrast, suburban districts (e.g., Qingpu, Fengxian, and Chongming) consist of low-density, irregularly distributed buildings mixed with vegetation, water bodies, and farmlands. Such a pronounced urban–rural gradient provides a representative landscape for multi-scale building extraction and height estimation.

2.1.2. Remotely Sensed Data

The dual temporal remote sensing data jointly used in this study were acquired from the Jilin-1 high-resolution optical satellite constellation, operated by Chang Guang Satellite Technology Co., Ltd. (Changchun, China). The multispectral data with a spatial resolution of 1 m provide detailed information on building boundaries and texture. The first dataset, acquired on 15 May 2024, with minimal cloud cover, was used for building footprint extraction. The second dataset, acquired on 16 February 2024, under lower solar elevation conditions, provides elongated building shadows suitable for shadow detection and height estimation. To ensure spatial consistency, the two temporal datasets were geometrically co-registered at a pixel level, enabling accurate alignment between building footprints and their corresponding shadows.

A total of 500 image patches, each with a size of 512 × 512 pixels, were selected across Shanghai. A stratified sampling strategy was adopted to meet spatial heterogeneity. Homogeneous regions in rural areas were sampled sparsely, while complex urban areas were sampled more densely. This ensures samples’ representativeness and model generalization.

2.1.3. Label Generation for Building Footprints and Shadows

Ground-truth labels for building footprints and shadows were manually annotated based on the Jilin-1 imagery. For building footprints, only clear rooftop boundaries with regular geometric structures were labeled, while temporary structures and ambiguous objects were excluded. For shadow labels, only shadows cast by buildings were annotated, based on their spatial relationship with buildings and solar illumination direction. Shadows from trees, vehicles, and other objects were excluded. Ambiguous cases, such as dark rooftops, were carefully distinguished using spatial context and illumination direction. Buildings partially occluded by vegetation were labeled according to visible rooftop boundaries. To ensure label quality, all annotations were independently reviewed by a second annotator, and inconsistencies were corrected through manual verification.

2.1.4. Reference Data for Building Height Validation

Building height validation was conducted using two types of reference data. First, a 1 m resolution digital surface model (DSM) derived from Gaofen-7 (GF-7) stereo imagery was used, covering approximately 70% of Shanghai. The GF-7 data provide reliable elevation information due to their high spatial and vertical accuracy. Second, high-resolution UAV LiDAR data were used for local validation. The LiDAR-derived DSMs (6 cm and 10 cm resolution) were used to obtain accurate height measurements for 110 buildings. Building heights were derived using a quantile-based approach to reduce noise and edge effects. Specifically, rooftop elevation was estimated using the 95th percentile of DSM values within each building footprint, while ground elevation was estimated using the 5th percentile within a 3 m buffer outside the building. The height was calculated as the difference between these two values.

2.2. Methodology

2.2.1. Network Architecture for Extracting Building Footprints and Shadows

A sparse edge-aware convolution-transformer neural network (SECT-Net) is proposed to extract building footprints and shadows. It is built upon the UNetFormer architecture [29], which combines a lightweight CNN encoder and a Transformer-based decoder with a global-local attention mechanism for efficient multi-scale contextual modeling.

To adapt the framework for building footprints extraction from high-resolution Jilin-1 remote sensing images, three modifications are conducted: (1) Multi-scale edge supervision module (MESM): During multi-scale feature extraction in the encoder, MESM explicitly supervises edge information at different resolutions for enhancing the network’s ability to capture fine building contour details; (2) Dual-path CNN-Transformer block (DP-CTB): Inspired by the sparse token transformers [48], the original global-local Transformer module is replaced with a dual-path CNN-Transformer block with sparse global attention. This design preserves the ability to capture global contextual dependencies while alleviating redundant global interactions through sparse token sampling; (3) Multi-scale auxiliary supervision (MAS): Auxiliary prediction heads are attached to intermediate layers of the decoder to provide additional supervision signals for multi-level features.

The overall architecture of SECT-Net comprises a ResNet50-based encoder and a DP-CTB-based decoder (Figure 2). The encoder begins with a convolutional stem (Conv Stem) and sequentially stacks four residual blocks (ResBlocks) to extract multi-scale features at 1/4, 1/8, 1/16, and 1/32 of the input resolution. During the encoding stage, the MESM is applied to each scale of the feature map to explicitly enhance the learning of the building’s edge information. In the decoding phase, multi-scale features are fused via a weighted integration mechanism and subsequently passed through a series of DP-CTB modules for progressive feature aggregation and semantic reconstruction. Finally, the feature refinement head [29] refines the fused features to produce predictions of building footprints and shadows.

(1): Multi-scale Edge Supervision Module (MESM)

Edge information is a critical fine-grained feature in building extraction. To enhance boundary modeling, the proposed MESM introduces explicit edge supervision at multiple encoder stages and injects edge features into the decoder using scale-specific strategies.

Specifically, the encoder outputs multi-scale feature maps

F_{1} \in R^{C_{1} \times H_{1} \times W_{1}}

and

F_{2} \in R^{C_{2} \times H_{2} \times W_{2}}

, corresponding to the shallow and intermediate layers. For each scale, a lightweight edge head

E H_{i}

is designed to generate corresponding edge predictions:

E_{i} = E H_{i} (F_{i}), i = 1,2,

(1)

where

E_{i} \in R^{1 \times H_{i} \times W_{i}}

denotes the predicted edge map at the

i

-th scale.

E H_{i}

is implemented as a lightweight convolutional module composed of a 3 × 3 convolution, batch normalization, and ReLU activation, followed by a 1 × 1 convolution for channel projection.

To effectively utilize edge information, different integration strategies are adopted for different scales. For the intermediate stage, the predicted edge map

E_{2}

is concatenated with

F_{2}

along the channel dimension to form an enhanced feature representation:

F_{2}^{a u g} = C o n c a t (F_{2}, σ (E_{2})),

(2)

where

C o n c a t (\cdot)

denotes concatenation along the channel dimension.

σ (\cdot)

is the sigmoid activation.

For the shallow stage, the finer edge prediction

E_{1}

is used to guide feature refinement via a spatial gating mechanism:

F_{1}^{a u g} = F_{1} ⊙ ({1 + G}_{1}),

(3)

where

⊙

denotes element-wise multiplication, and

G_{1}

is a learnable gating map derived from

E_{1}

via a 1 × 1 convolution. This mechanism enables the network to focus on fine-grained boundary details during reconstruction.

During training, edge ground truths are generated by rasterizing building footprint vectors and extracting inner boundaries, producing single-channel masks for supervision.

(2): Dual-path CNN-Transformer Block (DP-CTB)

To balance global structural consistency and local detail accuracy, a DP-CTB is proposed (Figure 3a). This module consists of a local convolutional branch and a sparse global attention branch.

Given an input feature map

X {\in R}^{C \times H \times W}

, the local branch aims to capture fine-grained spatial and channel interactions using depth-wise separable convolution followed by channel recalibration via a squeeze-and-excitation mechanism, yielding a local feature representation

X_{l o c a l}

.

To facilitate global context modeling, the global branch first projects the input into a reduced feature space via a 1 × 1 convolution. A Spatial-Channel Token Sampler (Figure 3b) is then employed to select informative tokens from both spatial and channel dimensions.

For spatial token sampling, a three-route strategy is adopted to enhance diversity and robustness: block-aware sampling ensures uniform spatial coverage, boundary-aware sampling emphasizes high-gradient regions (e.g., object contours) via gradient magnitude estimation, and region-aware sampling focuses on semantically salient areas. The sampled tokens are aggregated as:

T_{s} = M e r g e (T_{b l o c k}, T_{b o u n d a r y}, T_{r e g i o n}),

(4)

where

M e r g e (\cdot)

denotes concatenation followed by duplicate removal.

In parallel, channel tokens are selected based on channel importance derived from globally pooled features and further aggregated to produce channel-wise modulation.

In practice, the number of spatial and channel tokens is fixed to

K_{s}

= 196 and

K_{c}

= 32, respectively. For spatial sampling, the proportions of block-aware, boundary-aware, and region-aware tokens are set to 0.3, 0.3, and 0.4, respectively.

Based on the sampled tokens, position-enhanced sparse attention is performed:

A t t n (Q, K, V) = S o f t m a x (\frac{Q K^{T} + Q P_{K}^{T} + P_{Q} K^{T}}{\sqrt{d}}) V,

(5)

where

Q, K, V

denote query, key, and value matrices, and

P_{Q}

,

P_{K}

represent learnable positional embeddings.

The attended tokens are projected back to the spatial domain via cross-attention, producing a spatial global feature map

X_{S} \in R^{C_{r} \times H \times W}

. In parallel, channel attention yields a channel-enhanced feature map

X_{C} \in R^{C_{r} \times H \times W}

.

The spatial and channel global features are concatenated and projected:

X_{g l o b a l} = C o n v ([X_{S}, X_{C}]) .

(6)

Finally, the outputs of the local and global branches are adaptively fused:

X_{o u t} = α X_{l o c a l} + (1 - α) X_{g l o b a l},

(7)

where

α \in [0, 1]

is a learnable balance factor.

The DP-CTB effectively integrates fine-grained local details with long-range structural dependencies through a structure-aware sparse modeling strategy. Instead of performing dense global interactions, the proposed token sampling mechanism selectively focuses on informative spatial regions (e.g., boundaries and salient areas), thereby reducing redundant computations while preserving representative global context.

(3): Multi-scale Auxiliary Supervision (MAS)

To enhance semantic consistency and facilitate optimization, a multi-scale auxiliary supervision module is introduced into the multi-stage decoder features.

Specifically, the decoder generates three intermediate feature maps at different stages:

h_{4} \in R^{C \times H_{4} \times W_{4}}, h_{3} \in R^{C \times H_{3} \times W_{3}}, h_{2} \in R^{C \times H_{2} \times W_{2}},

(8)

here,

h_{4}

originates from the deepest decoder stage and contains the richest semantic information, while

h_{2}

comes from a shallower stage and preserves more local details.

For each feature map, a lightweight auxiliary head

A H_{i} (\cdot)

is designed to generate a probability map at the corresponding scale:

A_{i} = A H_{i} (h_{i}) i = 2, 3, 4 .

(9)

where

A_{i} \in R^{1 \times H_{i} \times W_{i}}

denotes the predicted probability map at the i-th scale. Each

A H_{i}

consists of a convolutional block with a 3 × 3 convolution, followed by batch normalization and ReLU activation, a dropout layer, and a 1 × 1 convolution for channel projection.

During training, these auxiliary predictions are up-sampled to the original input resolution to compute the loss and provide supervision. This design facilitates gradient propagation to shallow layers, accelerates model convergence, and enhances the semantic representation capability across multiple scales.

(4): Loss Function

In the training phase, the proposed network is supervised by a composite loss function that consists of three components: a principal loss

L_{p r i n c i p a l}

, an auxiliary loss

L_{a u x}

, and an edge loss

L_{e d g e}

. The overall loss can be expressed as:

L_{t o t a l} = L_{p r i n c i p a l} + λ_{a u x} L_{a u x} + λ_{e d g e} L_{e d g e},

(10)

where

λ_{a u x}

and

λ_{e d g e}

are the weight coefficients of the auxiliary and edge losses, respectively. In our experiments,

λ_{a u x}

is set to 0.4 and

λ_{e d g e}

is set to 0.2 by default.

The principal loss supervises the final prediction of building footprints and shadows. It is formulated as a combination of the soft Dice loss

L_{d i c e}

and the cross-entropy loss

L_{c e}

:

L_{p r i n c i p a l} = L_{d i c e} + L_{c e},

(11)

where

L_{d i c e}

denotes the soft Dice loss computed on predicted probabilities, and

L_{c e}

denotes the pixel-wise cross-entropy loss.

The auxiliary loss is designed to guide intermediate feature representations within the decoder. It is applied to the outputs of the auxiliary heads, which process features from multiple decoder stages to produce auxiliary predictions. Following the same formulation as the principal loss, each auxiliary prediction is supervised using a combination of Dice loss

L_{d i c e}

and the cross-entropy loss

L_{c e}

:

L_{a u x} = \sum_{i \in {2,3, 4}} α_{i} (L_{d i c e}^{i} + L_{c e}^{i}),

(12)

where

α_{i}

denotes the weight assigned to the auxiliary prediction at stage

i

.

L_{d i c e}^{i}

and

L_{c e}^{i}

denote the losses computed for the

i

-th auxiliary prediction. In our implementation, the auxiliary losses are weighted as

(α_{4}, α_{3}, α_{2}) = (0.1,0.2,0.4)

, where shallower features are given larger weights to emphasize fine-grained details.

During training, both edge predictions are supervised at the original image resolution. To address the severe class imbalance between boundary and non-boundary pixels, a dynamically weighted binary cross-entropy loss is adopted. In addition, a Dice loss is introduced to enforce structural consistency of the predicted edges. The final edge loss is defined as a weighted combination of the two scales:

L_{e d g e} = 0.4 L_{e d g e}^{1} + 0.6 L_{e d g e}^{2} .

(13)

2.2.2. Building Height Estimation Based on Shadow Lengths

Shadow-based height estimation from single-view high-resolution imagery is widely used. The building height is calculated from the extracted shadow length projected onto the ground and several key imaging-geometry parameters, such as the solar elevation angle, solar azimuth, sensor elevation angle, and sensor azimuth (Figure 4).

For the case where the sun and the sensor are on opposite sides of the building, the sensor can capture the complete ground shadow BC of the building (Figure 4a). The building height H is calculated by:

H = L_{s} \times \tan β .

(14)

When the sun and sensor are on the same side of the building, part of the shadow (segment BE in Figure 4b) may be occluded by the building itself and thus cannot be fully observed. In this case, the shadow length measured in the image is denoted as

L_{s}

, which corresponds to the projection along the sensor’s line of sight (EC) and is derived from the detected building rooftop area and the shadow region (Figure 4b). Building height H is then calculated using the following geometric correction formula [49]:

H = (L_{s} \cdot \tan β \cdot \tan α \cdot \sin γ) / (\tan α \cdot \sin γ - \tan β \cdot \sin θ),

(15)

where α denotes the sensor elevation angle, β the solar elevation angle, θ the sensor azimuth, γ the solar azimuth.

L_{s}

denotes the measured shadow length in the image, which corresponds to the complete ground shadow BC in the opposite-side case (Figure 4a) and the observable shadow segment EC due to partial occlusion of the ground shadow (Figure 4b).

The Jilin-1 images used for shadow-based height estimation were acquired on 16 February 2024. The solar elevation and azimuth angles at acquisition time were 40.8° and 149.2°, respectively. Satellite’s elevation and azimuth angles were 81.1° and 148.8°, respectively. Both the sun and the sensor were positioned on the same side of the buildings. Shadow occlusion detection was first conducted for each extracted building using its footprint and the shadow geometry (Figure 5). Building footprints often fall within shadows cast by neighboring buildings. When a large proportion of the footprint area is covered by neighboring shadows, the building’s own shadow region becomes difficult to reliably identify, leading to unstable shadow-length measurements. In this study, an overlap ratio threshold of 70% was adopted to filter out severely occluded samples. Buildings exceeding this threshold were excluded from further analysis.

According to the solar azimuth angles, projection sampling points were uniformly generated along the building boundary (N = 50). Only points located on the shadow-facing side were retained, resulting in approximately 20–30 valid sampling points per building. For each valid sampling point, pixel-level shadow tracing along the solar illumination direction continues until a non-shadow pixel is encountered. As a result, multiple candidate shadow-length samples were generated for each building. To reduce the effects of roof self-occlusion and overshadowing by nearby buildings, all candidate shadow-length samples were subjected to a

3 σ

outlier filtering procedure [50]. Specifically, the mean

μ

and standard deviation

σ

were first computed for a set of shadow-length

L_{i}

and samples were removed when satisfying:

|L_{i} - μ| > 3 σ .

(16)

For each building, the effective shadow length was defined as the mean of all high-quality shadow-length samples that passed the quality control procedure. This representative shadow length was subsequently used to estimate the building height by Equation (15).

2.2.3. Comparison Methods and Evaluation Matrices

In our experiments, several representative deep learning methods, including U-Net [51], DeepLabV3+ [52], Swin-UNet [53], SegFormer [54], and UNetFormer [29], are compared with the proposed SECT-Net. These methods cover both CNN-based and recent Transformer-based segmentation architectures to ensure a comprehensive evaluation. To ensure a fair comparison, all models were trained on the same dataset with identical training/validation/test splits (8:1:1) at the patch level, where each sample corresponds to a 512 × 512 non-overlapping image tile, ensuring no spatial overlap between samples. All models were trained for 100 epochs using the AdamW optimizer with a batch size of 8. For CNN-based models (e.g., U-Net and DeepLabV3+) and CNN-Transformer hybrid models (e.g., UNetFormer and SECT-Net), a ResNet50 backbone was adopted with a unified learning rate of 6 × 10⁻⁴. For Transformer-based models (e.g., Swin-UNet and SegFormer), we followed their commonly recommended training configurations to ensure stable and competitive performance. In terms of architecture, Swin-UNet employs a hierarchical Swin Transformer as the encoder, while SegFormer adopts the MiT-B2 backbone, both providing model capacities broadly comparable to ResNet50. To mitigate overfitting and improve model performance, data augmentation techniques, including random vertical and horizontal flipping and random rotation, are applied to the training dataset.

Three widely used pixel-level evaluation metrics, namely intersection-over-union (IoU), F1-score, and overall accuracy (OA), were employed to quantitatively evaluate the segmentation performance of building footprints and shadows. The metrics are computed from pixel-wise classification results on the test set and averaged over all test images. They are calculated as follows:

I o U = T P / (T P + F P + F N),

(17)

F 1 s c o r e = (2 \times T P) / (2 \times T P + F P + F N),

(18)

O A = \frac{T P + T N}{T P + T N + F P + F N},

(19)

where TP (true positive) denotes the number of pixels correctly predicted as the target class (i.e., building footprint or shadow, depending on the evaluated task). TN (true negative) denotes the number of pixels correctly predicted as non-target class. FP (false positive) represents the number of non-target pixels incorrectly predicted as the target class. FN (false negative) represents the number of target pixels incorrectly predicted as the non-target class.

For building height estimation, both Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are used for the evaluation:

R M S E = \sqrt{(\sum_{i = 1}^{N} {({\hat{H}}_{i} - H_{i})}^{2}) / N},

(20)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |{\hat{H}}_{i} - H_{i}|,

(21)

where N is the total number of buildings used for height evaluation;

{\hat{H}}_{i}

and

H_{i}

represent the predicted and reference building heights for the i-th sample, respectively.

3. Results

3.1. Performance Assessment of Models for Mapping Building Footprints and Shadows

The evaluation metrics in Table 1 present the classification performance of the proposed SECT-Net and several representative deep learning approaches for building footprint and shadow extraction.

Overall, SECT-Net demonstrates competitive and consistently strong performance across most evaluation metrics. For building footprint extraction, SECT-Net achieves the highest scores in terms of IoU (77.96%), F1-score (87.62%), and OA (97.16%). For shadow extraction, SECT-Net also attains strong performance, with IoU, F1-score, and OA values of 75.01%, 85.72%, and 97.75%, respectively.

To further assess whether the observed performance differences are statistically significant, we conducted a statistical significance analysis using a one-sided Wilcoxon signed-rank test. The one-sided test was adopted based on an a priori hypothesis that SECT-Net improves performance over baseline methods.

Specifically, the test was performed on the paired distributions of per-image Overall Accuracy (OA) calculated for each individual image in the test set (N = 50). The statistical comparison was conducted on per-image samples rather than aggregated mean values. The results are summarized in Table 2.

For building footprint extraction, SECT-Net demonstrates statistically significant improvements (p < 0.05) over all baseline methods. In contrast, for shadow extraction, statistically significant improvements are observed compared with DeepLabv3+, SegFormer, and UNetFormer, while the differences with U-Net and Swin-UNet are not statistically significant (p ≥ 0.05).

This indicates that, although SECT-Net achieves slightly higher average performance in shadow extraction, the improvement is not consistently observed across all test samples. The variability in shadow characteristics, such as low contrast and irregular boundaries, may lead to unstable performance differences between models, thereby reducing statistical significance.

As illustrated in Figure 6, the proposed SECT-Net produces building segmentation results with more coherent structures and improved completeness, effectively reducing fragmented regions and internal holes. Traditional CNN-based methods, such as U-Net and DeepLabv3+, are more susceptible to local appearance variations and image noise due to their limited receptive fields. As a result, they often produce irregular boundaries and suffer from fragmented predictions or internal holes. Transformer-based approaches, including Swin-UNet and SegFormer, improve the overall structural consistency by modeling global context. However, they may still suffer from boundary ambiguity or incomplete delineation in regions with complex building layouts or low contrast. UNetFormer shows competitive performance in scenes with clear foreground–background contrast, but its predictions degrade in challenging areas with blurred boundaries or closely spaced buildings. In contrast, by jointly integrating multi-scale edge supervision and global-local contextual modeling, SECT-Net achieves more accurate and structurally consistent building delineation, particularly in complex urban scenes.

Building shadows exhibit distinctive tonal characteristics in high-resolution imagery, allowing generally consistent shadow locations and spatial distributions across methods (Figure 7). However, notable differences remain in shadow completeness and boundary continuity. CNN-based methods, such as U-Net and DeepLabv3+, tend to produce fragmented shadow regions with noticeable internal holes. They are also prone to misclassifying dark objects (e.g., dark rooftops or the shadow of overpasses) as building shadows. Transformer-based approaches, including Swin-UNet and SegFormer, significantly alleviate the issues of internal holes and speckle noise by incorporating global context modeling. Nevertheless, they still exhibit limitations in precise boundary localization, often leading to over-smoothed predictions or undesired merging of adjacent shadow regions. UNetFormer achieves relatively competitive performance in regions with clear shadow–background contrast. However, its predictions become less reliable in challenging scenarios, such as closely spaced buildings or blurred shadow boundaries, where local ambiguities dominate. In contrast, the proposed SECT-Net achieves consistent improvements by effectively integrating multi-scale edge supervision with global-local contextual modeling. It produces more complete shadow regions with clearer boundaries, while suppressing misclassification and reducing shadow adhesion effects.

3.2. Spatial Distribution of Building Footprints and Heights

Using the proposed SECT-Net, we produced complete and spatially consistent building footprints across the entire Shanghai. As shown in Figure 8, the extracted buildings exhibit diverse morphological characteristics under heterogeneous urban contexts. Specifically, typical patterns include mixed-use urban areas integrating residential, commercial, and service functions (Figure 8a), large-scale industrial complexes with regular geometries (Figure 8b), low-density villa-type residential areas (Figure 8c), rural residential areas with scattered homesteads (Figure 8d), coastal port and logistics zones with large structured facilities (Figure 8e), and urban villages characterized by highly compact and irregular building layouts (Figure 8f).

Within the study area, the heights of 755,996 buildings were derived using the shadow-based height estimation method (Figure 9d) and aligned with a local LoD-1 3D building model (Figure 9f). In terms of height statistics, 571,440 buildings (76%) are shorter than 10 m, indicating that Shanghai is predominantly composed of low-rise residential structures. Buildings with a height of 10 to 30 m account for 19% of the total (143,204). Only 323 high-rise buildings exceed 100 m in height (i.e., <1% of the total), and they are mainly concentrated in central urban areas such as Huangpu, Jing’an, and Hongkou, as well as in the Lujiazui area of Pudong. The spatial distribution of high-rise buildings exhibits a clear core-periphery pattern, with dense concentrations in central business districts, and only a few are scattered in the suburbs.

Accuracy assessment was conducted for 559,614 buildings using reference heights extracted from the GF-7-derived DSM. The residual analysis (Figure 10a) indicates that 73% of height errors fall within the interval of [−5.0 m, 5.0 m]. Overall, a mean bias of −2.29 m indicates a slight underestimation. From the scatterplot of estimated vs. reference heights, the vast majority of them lie near the 1:1 line, indicating better agreement between predictions and reference values (Figure 10b). Regression analysis yields an R² of 0.74, an RMSE of 5.66 m, and an MAE of 3.96 m. However, the fitted trend line consistently lies below the 1:1 line, further confirming a systematic underestimation bias.

To provide a more comprehensive evaluation, we conducted a stratified error analysis across five height intervals (Table 3). The results show a clear increase in error with building height. For low-rise buildings (<10 m), the estimation is most accurate (RMSE = 3.49 m, MAE = 2.75 m). Errors increase gradually for 10–30 m (RMSE = 6.70 m) and more noticeably for 30–100 m (RMSE = 9.69–15.50 m). For high-rise buildings (>100 m), the error rises sharply (RMSE = 60.87 m, MAE = 41.96 m), indicating significant performance degradation. Overall, the method performs reliably for low- and medium-height buildings, but its accuracy decreases substantially for very tall structures.

By removing samples heavily affected by building occlusion or vegetation-induced shadow contamination (accounting for approximately 6% of the total samples), the accuracy of height estimation substantially improves, with R² increasing from 0.74 to 0.80 and RMSE decreasing from 5.66 m to 4.46 m (Figure 10c). These results provide quantitative evidence that occlusion and vegetation interference are major contributors to the observed systematic underestimation.

In addition to the GF-7 DSMs, UAV LiDAR-derived building heights were employed to assess local-scale estimation accuracy. The estimated heights of 110 buildings were compared with those derived from UAV LiDAR heights (Figure 10d). The results reveal a moderate positive correlation between the estimated and reference heights, yielding an R² of 0.49, an RMSE of 6.64 m and an MAE of 5.44 m. Most samples are concentrated in the low- to medium-height range (<60 m), where the estimated heights generally align with the reference values with moderate dispersion. Nevertheless, the regression slope of 0.894 and a negative intercept indicate a persistent tendency to underestimation, particularly for three taller buildings, consistent with the bias features observed in the GF-7 DSM-based validation.

To further examine the impact of the selected filtering threshold on height estimation, a sensitivity analysis was conducted based on different shadow overlap ratios (Table 4).

Although the 50% threshold yields the lowest RMSE (5.646 m), the improvement compared with the 70% threshold (5.662 m) is marginal (only 0.016 m), while more than 4000 additional buildings would be excluded. In contrast, higher thresholds, such as 80% and 90%, retain more severely occluded samples and lead to noticeable accuracy degradation. Considering both estimation accuracy and sample completeness, the 70% threshold provides the best trade-off and was therefore adopted in this study.

4. Discussion

4.1. Ablation Test of SECT-Net

To evaluate the contribution of each component in the proposed SECT-Net, ablation experiments were conducted on the Jilin-1 dataset for both building footprint and shadow extraction. The baseline model was implemented using the UNetFormer architecture with a ResNet-50 backbone and original specification. Quantitative results for both tasks are reported in Table 5.

As shown in Table 5, the introduction of MESM improves the IoU of building footprints from 75.72% to 77.05% and shadow IoU from 74.13% to 74.93%, demonstrating its effectiveness in enhancing boundary localization for both object types. By leveraging multi-scale feature aggregation and explicit edge supervision, MESM produces sharper and more complete contours. However, due to its strong sensitivity to local gradient variations, MESM is easily affected by texture and tone changes within building roofs, leading to internal holes and occasional false positives in both footprint and shadow predictions.

When DP-CTB is incorporated, the IoU increases to 77.40% for buildings and 74.66% for shadows. As shown in Figure 11, DP-CTB effectively captures long-range contextual dependencies, thereby suppressing noise and improving regional consistency in both tasks. Compared with MESM, DP-CTB demonstrates greater ability to suppress speckle noise and reduce internal holes, yielding more complete and smoother segmentation results. To verify the effectiveness of the MAS module, we conducted comparative experiments by integrating MAS alone into the baseline model. As shown in Table 5, the MAS module further improves the IoU to 76.84% (building) and 74.55% (shadow). MAS provides multi-scale semantic supervision, mitigating gradient vanishing and enhancing feature interaction between deep and shallow layers. This improves model consistency and training stability, and reduces speckle noise, while qualitative results show better alignment between predictions and ground truth (Figure 11).

Furthermore, when MESM is jointly applied with DP-CTB and MAS, the model achieves its best performance with an IoU reaching 77.96% for buildings and 75.01% for shadows. This indicates that the three modules act in a complementary manner rather than as independent components. Specifically, MESM enhances boundary awareness, DP-CTB captures long-range contextual dependencies, and MAS improves multi-scale feature consistency during optimization. Their joint use promotes effective feature interaction. Edge-enhanced features support global context modeling, while global representations help suppress local noise, resulting in more accurate and stable segmentation.

To further validate the selection of key hyperparameters in the proposed DP-CTB, a sensitivity analysis was conducted on the token sampling strategy, including the number of spatial tokens (

K_{s}

), the number of channel tokens (

K_{c}

), and the sampling ratios of block-aware, boundary-aware, and region-aware tokens. The quantitative results are presented in Table 6.

For

K_{s}

, performance peaks at 196 and degrades beyond that, indicating that excessive spatial tokens may introduce redundant information and unnecessary computational overhead. For

K_{c}

= 32, further increasing the number of channel tokens only brings marginal improvement in shadow extraction while slightly reducing building segmentation accuracy, suggesting limited benefit from overly dense channel interactions. For sampling ratios, single-strategy sampling consistently underperforms mixed strategies, confirming that block, boundary, and region cues are complementary. The ratio 0.3/0.3/0.4 achieves the best trade-off across both tasks.

4.2. Comparison of Different Building Footprint Products

A quantitative comparison was conducted between our produced building footprint product and three existing large-scale datasets—GABLE, 90_cities_BRA, and East Asian Buildings—on the Shanghai test dataset (Table 7). In addition to pixel-wise metrics (IoU, F1-score, and OA), we further incorporate boundary- and instance-sensitive metrics to evaluate the building footprint products comprehensively. Specifically, the boundary F1-score (BF1) [55] is used to assess contour accuracy by measuring the correspondence between predicted and ground-truth boundaries within a predefined tolerance distance. In this study, a tolerance of 5 pixels is adopted following Guo et al. [23]. For instance-level evaluation, object-level precision, recall, and F1-score were computed on vectorized building polygons. A predicted polygon was regarded as a true positive if its IoU with a ground-truth polygon exceeded 0.5 under one-to-one matching [56].

Overall, our method achieves substantially higher performance across pixel-level, boundary-level, and instance-level metrics. In particular, it attains the highest IoU (71.58%), F1-score (83.44%), OA (96.09%), and BF1 (88.35%), indicating superior spatial consistency and contour completeness. At the instance level, our method also yields markedly higher object-level precision, recall, and F1-score under an IoU threshold of 0.5, demonstrating a stronger ability to preserve complete building objects in the vectorized footprint product.

The visual investigation reveals characteristic structural errors in the existing datasets when applied to complex urban environments in Shanghai (Figure 12). For the GABLE dataset, noticeable geometric inconsistencies are observed, including overlapping, misalignments, and discontinuities in the vectorized building outlines. These issues are likely related to its instance-based extraction framework with cascade contour refinement, which might be less stable in highly dense urban environments [32]. The 90_cities_BRA dataset contains numerous fragmented and isolated patches, which highlight the limitations of the CNN-based DeepLabv3+ architecture in forming coherent object-level representations [33]. The East Asian Buildings dataset tends to over-regularize building outlines via boundary enhancement, resulting in inaccurate rooftop boundaries for irregular or multi-section buildings [34]. These results indicate that our proposed method maintains robust performance across heterogeneous urban environments, providing a reliable basis for subsequent building height estimation and three-dimensional urban modeling.

In contrast, our method demonstrates a stronger ability to distinguish buildings from non-building objects such as vehicles, temporary structures, and paved surfaces, while effectively avoiding contour breakage and excessive geometric regularization. Compared with existing data products, our resulting building footprints exhibit markedly improved visual consistency, structural integrity, and object discrimination accuracy. By leveraging a DL-based roof segmentation model and a robust contour vectorization strategy, our model accurately identifies more buildings in large-scale urban areas. These improvements stem from the model’s enhanced capacity to learn building-related textures, materials, and structural patterns. This highlights its superior generalization performance in structurally complex urban environments.

4.3. Comparison of Different Building Height Products

As shown in Table 8, the shadow-based method using a single-temporal Jilin-1 image achieves an RMSE of 5.7 m. Although this value falls within the range reported by several existing large-scale building height products, it should be noted that some of these products are generated at coarser spatial resolutions (e.g., 10 m or 30 m), and their accuracies are not strictly quantified. Nevertheless, the comparison provides a general indication that the proposed method can achieve competitive performance while relying on significantly simpler data inputs. These existing products are typically generated using data-driven frameworks that integrate heterogeneous multi-source datasets. In recent years, monocular depth estimation has emerged as a promising alternative for building height retrieval [57,58]. For example, GABLE [32] employs a DL-based DSM estimation network trained on stereo-derived reference heights. Although such learning-based approaches can achieve superior accuracy, their performance depends heavily on the availability of high-quality reference height datasets and on complex training pipelines.

Data-driven approaches typically require multiple heterogeneous datasets and reference height data to implement regression. Monocular depth estimation methods primarily rely on highly accurate reference-height data for supervised training and often require transfer learning to generalize to regions without pixel-level height labels [59,60]. In contrast, our proposed shadow-based method exploits explicit geometric relationships between building shadows and solar illumination, enabling height estimation with minimal data requirements. This advantage makes it particularly suitable for large-scale, data-scarce, or time-sensitive urban applications, where multi-view reconstruction or heavily supervised learning-based methods are difficult to deploy or maintain.

4.4. Uncertainty Analysis and Limitations

Despite the promising performance of the proposed framework, several sources of uncertainty may affect the accuracy of building height estimation. These uncertainties mainly arise from errors in shadow extraction and footprint delineation, as well as the propagation of these errors through the geometry-based estimation process. In addition, inconsistencies may be introduced by the use of multi-temporal imagery.

First, uncertainties are primarily introduced by errors in shadow extraction and footprint delineation, and their propagation through the geometry-based estimation process. Occlusion, vegetation interference, and low contrast may affect the accuracy of the extracted shadow length

L_{s}

, while footprint boundary inaccuracies may further bias its measurement. According to Equation (15), the height estimation formula can be equivalently rewritten as

H = K \cdot L_{s}

, where

K

is a constant determined by the solar and sensor geometry for a given image. This implies a linear error propagation relationship

Δ H = K \cdot Δ L_{s}

. For the Jilin-1 imagery used in this study,

K \approx 1.003

, indicating negligible geometric error amplification. Therefore, the overall uncertainty is mainly dominated by upstream extraction errors rather than the geometric transformation itself.

Second, the use of multi-temporal imagery for footprint and shadow extraction may introduce occasional mismatches. Although geometric co-registration is applied, discrepancies may still occur due to local urban dynamics, such as temporary structures or construction activities. As a result, shadows extracted from one acquisition may not perfectly correspond to building footprints from another, leading to localized shadow-footprint mismatches. However, given the relatively short temporal interval between the two acquisitions, such inconsistencies are generally limited and only affect a small number of cases.

These uncertainties are further reflected in a systematic bias observed in the height estimation results. Although the shadow-based height estimation generally shows strong agreement with the DSM reference heights, a consistent underestimation is observed (Figure 10). This bias is mainly attributed to occlusion effects in dense urban areas, as widely reported in previous shadow-based studies [49,61]. The complex and irregular geometries of high-rise buildings make accurate footprint delineation challenging. In addition, the close spacing of buildings in dense urban areas leads to severe shadow occlusion, resulting in shortened observable shadow lengths (Figure 13a–c). For relatively low-rise buildings, vegetation cover may further obscure shadow boundaries, introducing additional negative bias (Figure 13d–f).

In particular, the limitation becomes more pronounced for high-rise buildings (>100 m), where the RMSE reaches 60.87 m. Although such buildings account for only a small proportion of the total samples (322 buildings), their large errors substantially affect the stratified accuracy results. This indicates that single-temporal shadow-based methods are less reliable for extremely high-rise cases, especially in dense urban cores.

Future work will focus on improving height estimation for these challenging samples by incorporating multi-temporal imagery acquired under different solar elevation conditions. Images with higher solar elevation angles (e.g., May) produce shorter shadows, which can reduce occlusion and shadow projection errors, while winter images provide longer and clearer shadows for geometric estimation. Combining multi-temporal observations and multi-view results may further improve the robustness and reliability of high-rise building height retrieval.

5. Conclusions

In complex urban environments, accurately characterizing building outlines, shadows, and heights remains challenging due to spatial complexity and the occlusion effect. This study systematically investigated the automated extraction of city-scale urban building footprints and height information from high-resolution spaceborne remote sensing imagery. We proposed a hybrid architecture, SECT-Net, that integrates the strengths of both CNNs and Transformers. The experimental results demonstrate that SECT-Net achieves competitive and generally improved performance compared with several representative deep learning segmentation networks, thereby improving the stability and reliability of building height inversion.

Our study demonstrates the feasibility of high-resolution multispectral images acquired by the Jilin-1 satellite for building mapping and height estimation of Shanghai, China. Compared with several existing building datasets, our extracted building footprints exhibit better geometric integrity, fewer fragmented areas, and higher confidence in complex urban settings. With the high-quality building footprints, a shadow-based height estimation framework was developed to extract heights for over 750,000 buildings across the entire city. When evaluated against two reference datasets (GF-7 DSM and UAV LiDAR DSM), the estimated building heights achieved an RMSE ranging from 4.5 to 6.6 m, demonstrating favorable accuracy.

Our proposed systematic workflow demonstrates promising performance in extracting building footprints and estimating building heights in complex urban environments. However, the present study was only validated in Shanghai, which is characterized by relatively flat terrain and regular urban morphology. The generalization of the proposed method needs to be further investigated for other regions with diverse terrain conditions (e.g., mountainous areas) and varying urban structures. Satellite remote sensing is on the way toward acquiring high-resolution imagery, which will support the precise and rapid extraction of large-scale buildings. Future studies will focus on extensive cross-region validation and methodological improvement. In addition, we will explore joint inversion of occlusion-aware shadow modeling, multi-source heterogeneous data fusion, and multi-view imaging for high-precision 3D city model reconstruction. This study generally provides a potential pathway for LoD-1 3D city modeling to supports fine-scale urban management of smart city.

Author Contributions

Conceptualization, Y.Z.; methodology, J.D. and Y.Z.; software, J.D.; validation, J.D. and W.Y.; formal analysis, J.D. and W.Y.; investigation, J.D. and W.Y.; resources, W.Y.; data curation, J.D. and Y.Z.; writing—original draft preparation, J.D.; writing—review and editing, Y.Z.; visualization, J.D. and Y.Z.; supervision, Y.Z.; project administration, W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are available by contacting the corresponding author.

Acknowledgments

This study is supported by the Chang Guang Satellite Technology Co., Ltd. for the high-resolution Jilin-1 image. The authors would like to thank the anonymous reviewers and the Editor for their constructive comments and valuable suggestions, which have significantly improved the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

United Nations, Department of Economic and Social Affairs, Population Division. World Urbanization Prospects: The 2025 Revision; United Nations: New York, NY, USA, 2025; Available online: https://population.un.org/wup/assets/Publications/undesa_pd_2024_key_messages_wup_2025.pdf (accessed on 26 March 2026).
Xu, Y.; Ren, C.; Ma, P.; Ho, J.; Wang, W.; Lau, K.K.-L.; Lin, H.; Ng, E. Urban morphology detection and computation for urban climate research. Landsc. Urban Plan. 2017, 167, 212–224. [Google Scholar] [CrossRef]
Zhu, Z.; Zhou, Y.; Seto, K.C.; Stokes, E.C.; Deng, C.; Pickett, S.T.A.; Taubenböck, H. Understanding an urbanizing planet: Strategic directions for remote sensing. Remote Sens. Environ. 2019, 228, 164–182. [Google Scholar] [CrossRef]
Yang, F.; Zhong, K.; Chen, Y.; Kang, Y. Simulations of the impacts of building height layout on air quality in natural-ventilated rooms around street canyons. Environ. Sci. Pollut. Res. 2017, 24, 23620–23635. [Google Scholar] [CrossRef] [PubMed]
van Hove, L.W.A.; Jacobs, C.M.J.; Heusinkveld, B.G.; Elbers, J.A.; van Driel, B.L.; Holtslag, A.A.M. Temporal and spatial variability of urban heat island and thermal comfort within the Rotterdam agglomeration. Build. Environ. 2015, 83, 91–103. [Google Scholar] [CrossRef]
Vojinovic, Z.; Seyoum, S.D.; Mwalwaka, J.M.; Price, R.K. Effects of model schematisation, geometry and parameter values on urban flood modelling. Water Sci. Technol. 2011, 63, 462–467. [Google Scholar] [CrossRef]
Creutzig, F.; Baiocchi, G.; Bierkandt, R.; Pichler, P.-P.; Seto, K.C. Global typology of urban energy use and potentials for an urbanization mitigation wedge. Proc. Natl. Acad. Sci. USA 2015, 112, 6283–6288. [Google Scholar] [CrossRef]
Engelfriet, L.; Koomen, E. The impact of urban form on commuting in large Chinese cities. Transportation 2018, 45, 1269–1295. [Google Scholar] [CrossRef]
Schug, F.; Frantz, D.; Van Der Linden, S.; Hostert, P. Gridded population mapping for Germany based on building density, height and type from Earth Observation data using census disaggregation and bottom-up estimates. PLoS ONE 2021, 16, e0249044. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L.P. A multidirectional and multiscale morphological index for automatic building extraction from multispectral GeoEye-1 imagery. Photogramm. Eng. Remote Sens. 2011, 77, 721–732. [Google Scholar] [CrossRef]
Tupin, F.; Roux, M. Detection of building outlines based on the fusion of SAR and optical features. ISPRS J. Photogramm. Remote Sens. 2003, 58, 71–82. [Google Scholar] [CrossRef]
Sirmacek, B.; Unsalan, C. Urban-area and building detection using SIFT keypoints and graph theory. IEEE Trans. Geosci. Remote Sens. 2009, 47, 1156–1167. [Google Scholar] [CrossRef]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Tan, Y.; Yu, Y.; Xiong, S.; Tian, J. Semi-automatic building extraction from very high resolution remote sensing imagery via energy minimization model. In Proceedings of 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 657–660. [Google Scholar] [CrossRef]
Khoshelham, K.; Nardinocchi, C.; Frontoni, E.; Mancini, A.; Zingaretti, P. Performance evaluation of automated approaches to building detection in multi-source aerial data. ISPRS J. Photogramm. Remote Sens. 2010, 65, 123–133. [Google Scholar] [CrossRef]
Koc-San, D.; Turker, M. A model-based approach for automatic building database updating from high-resolution space imagery. Int. J. Remote Sens. 2012, 33, 4193–4218. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Li, W.; He, C.; Fang, J.; Zheng, J.; Fu, H.; Yu, L. Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source GIS data. Remote Sens. 2019, 11, 403. [Google Scholar] [CrossRef]
Liu, P.H.; Liu, X.P.; Liu, M.X.; Shi, Q.; Yang, J.X.; Xu, X.C.; Zhang, Y.Y. Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sens. 2019, 11, 830. [Google Scholar] [CrossRef]
Wen, Q.; Jiang, K.Y.; Wang, W.; Liu, Q.J.; Guo, Q.; Li, L.L.; Wang, P. Automatic building extraction from Google Earth images under complex backgrounds based on deep instance segmentation network. Sensors 2019, 19, 333. [Google Scholar] [CrossRef]
Xie, Y.Q.; Cai, J.N.; Bhojwani, R.; Shekhar, S.; Knight, J. A locally-constrained YOLO framework for detecting small and densely-distributed building footprints. Int. J. Geogr. Inf. Sci. 2020, 34, 777–801. [Google Scholar] [CrossRef]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building Extraction with Vision Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
Li, Y.; Hong, D.; Li, C.; Yao, J.; Chanussot, J. HD-Net: High-resolution decoupled network for building footprint extraction via deeply supervised body and boundary decomposition. ISPRS J. Photogramm. Remote Sens. 2024, 209, 51–65. [Google Scholar] [CrossRef]
Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. RSSFormer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in Remote Sensing: A Survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. CoAtNet: Marrying convolution and attention for all data sizes. arXiv 2021, arXiv:2106.04803. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Song, P.; Li, J.; An, Z.; Fan, H.; Fan, L. CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5900314. [Google Scholar] [CrossRef]
Sun, X.; Huang, X.; Mao, Y.; Sheng, T.; Li, J.; Wang, Z.; Lu, X.; Ma, X.; Tang, D.; Chen, K. GABLE: A first fine-grained 3D building model of China on a national scale from very high resolution satellite imagery. Remote Sens. Environ. 2024, 305, 114057. [Google Scholar] [CrossRef]
Zhang, Z.; Qian, Z.; Zhong, T.; Chen, M.; Zhang, K.; Yang, Y.; Zhu, R.; Zhang, F.; Zhang, H.; Zhou, F.; et al. Vectorized rooftop area data for 90 cities in China. Sci. Data 2022, 9, 66. [Google Scholar] [CrossRef] [PubMed]
Shi, Q.; Zhu, J.; Liu, Z.; Guo, H.; Gao, S.; Liu, M.; Liu, Z.; Liu, X. The last puzzle of global building footprints—Mapping 280 million buildings in East Asia based on VHR images. J. Remote Sens. 2024, 4, 0138. [Google Scholar] [CrossRef]
He, T.; Hu, Y.; Li, F.; Chen, Y.; Zhang, M.; Zheng, Q.; Dong, B.; Ren, H. An improved height sampling approach used for global urban building height mapping. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104633. [Google Scholar] [CrossRef]
Haithcoat, T.L.; Song, W.; Hipple, J.D. Building footprint extraction and 3-D reconstruction from LIDAR data. In Proceedings of IEEE/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, Rome, Italy, 8–9 November 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 74–78. [Google Scholar] [CrossRef]
Aravinth, J.; Lavenya, R.; Shanmukha, K.; Vaishnav, K. Evaluation and analysis of building height with LiDAR data. In Proceedings of 2018 3rd International Conference on Communication and Electronics Systems, Coimbatore, India, 15–16 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 397–402. [Google Scholar] [CrossRef]
Dubois, C.; Thiele, A.; Hinz, S. Building detection and building parameter retrieval in InSAR phase images. ISPRS J. Photogramm. Remote Sens. 2016, 114, 228–241. [Google Scholar] [CrossRef]
Liu, W.; Suzuki, K.; Yamazaki, F. Height estimation for high-rise buildings based on InSAR analysis. In Proceedings of 2015 Joint Urban Remote Sensing Event (JURSE), Lausanne, Switzerland, 30 March–1 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–4. [Google Scholar] [CrossRef]
Liasis, G.; Stavrou, S. Satellite images analysis for shadow detection and building height estimation. ISPRS J. Photogramm. Remote Sens. 2016, 119, 437–450. [Google Scholar] [CrossRef]
Shao, Y.; Taff, G.N.; Walsh, S.J. Shadow detection and building-height estimation using IKONOS data. Int. J. Remote Sens. 2011, 32, 6929–6944. [Google Scholar] [CrossRef]
Wang, J.; Wang, X. Information extraction of building height and density based on quick bird image in Kunming, China. In Proceedings of 2009 Joint Urban Remote Sensing Event, Shanghai, China, 20–22 May 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1–8. [Google Scholar] [CrossRef]
Pesaresi, M.; Politis, P. GHS-BUILT-H R2023A-GHS Building Height, Derived from AW3D30, SRTM30, and Sentinel2 Composite (2018); European Commission, Joint Research Centre: Brussels, Belgium, 2023; pp. 26–32. [Google Scholar] [CrossRef]
Esch, T.; Brzoska, E.; Dech, S.; Leutner, B.; Palacios-Lopez, D.; Metz-Marconcini, A.; Marconcini, M.; Roth, A.; Zeidler, J. World settlement footprint 3D—A first three-dimensional survey of the global building stock. Remote Sens. Environ. 2022, 270, 112877. [Google Scholar] [CrossRef]
Wu, W.-B.; Ma, J.; Banzhaf, E.; Meadows, M.E.; Yu, Z.-W.; Guo, F.-X.; Sengupta, D.; Cai, X.-X.; Zhao, B. A first Chinese building height estimate at 10 m resolution (CNBH-10 m) using multi-source earth observations and machine learning. Remote Sens. Environ. 2023, 291, 113578. [Google Scholar] [CrossRef]
Che, Y.; Li, X.; Liu, X.; Wang, Y.; Liao, W.; Zheng, X.; Zhang, X.; Xu, X.; Shi, Q.; Zhu, J.; et al. 3D-GloBFP: The first global three-dimensional building footprint dataset. Earth Syst. Sci. Data 2024, 16, 5357–5374. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, H.; Long, Y. CMAB: A multi-attribute building dataset of China. Sci. Data 2025, 12, 430. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Li, Z.; Ji, S.; Fan, D.; Yan, Z.; Wang, F.; Wang, R. Reconstruction of 3D information of buildings from single-view images based on shadow information. ISPRS Int. J. Geo-Inf. 2024, 13, 62. [Google Scholar] [CrossRef]
Xie, Y.; Feng, D.; Xiong, S.; Zhu, J.; Liu, Y. Multi-Scene building height estimation method based on shadow in high resolution imagery. Remote Sens. 2021, 13, 2862. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure Transformer for medical image segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]
Csurka, G.; Larlus, D.; Perronnin, F. What is a good evaluation measure for semantic segmentation? In Proceedings of British Machine Vision Conference, Bristol, UK, 9–13 September 2013; BMVA Press: Bristol, UK, 2013. [Google Scholar] [CrossRef]
Chen, S.; Ogawa, Y.; Zhao, C.; Sekimoto, Y. Large-Scale Building Footprint Extraction from Open-Sourced Satellite Imagery via Instance Segmentation Approach. In Proceedings of IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6284–6287. [Google Scholar] [CrossRef]
Li, Q.; Mou, L.; Hua, Y.; Shi, Y.; Chen, S.; Sun, Y.; Zhu, X.X. 3DCentripetalNet: Building height retrieval from monocular remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 120, 103311. [Google Scholar] [CrossRef]
Lei, Y.; Jiang, W. A relation aware and edge preserving height refinement network for single-view height estimation from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12988–13002. [Google Scholar] [CrossRef]
Zhao, W.; Persello, C.; Stein, A. Semantic-aware unsupervised domain adaptation for height estimation from single-view aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 372–385. [Google Scholar] [CrossRef]
Hong, Z.; Wu, T.; Persello, C.; Zhao, W. Dual-domain representation alignment for unsupervised height estimation from cross-resolution remote sensing images. ISPRS J. Photogramm. Remote Sens. 2026, 231, 443–457. [Google Scholar] [CrossRef]
Zhao, Y.; Wu, B.; Li, Q.; Yang, L.; Fan, H.; Wu, J.; Yu, B. Combining ICESat-2 photons and Google Earth Satellite images for building height extraction. Int. J. Appl. Earth Obs. Geoinf. 2023, 117, 103213. [Google Scholar] [CrossRef]

Figure 1. Study area and the selected building samples.

Figure 2. Overall architecture of SECT-Net.

Figure 3. Attention mechanism. Where (a) depicts the structure of the sparse global-local transformer block, (b) exhibits details of the spatial-channel token sampler.

Figure 4. Schematic view of the geometric relationship between buildings and shadows: (a) Sensor on the opposite side of the sun, (b) Sensor on the same side as the sun.

Figure 5. Workflow for building height estimated by shadow lengths. Red outlines denote the building footprint boundaries used for self-occlusion detection.

Figure 6. Examples of extracted building footprints with the proposed SECT-Net vs. other DL models on the test set. Red boxes highlight representative regions for local detail comparison among different methods.

Figure 7. Examples of extracted building shadows with SECT-Net vs. other DL models on the test set. Red boxes highlight representative regions for local detail comparison among different methods.

Figure 8. Spatial distribution of extracted building footprints across Shanghai. Representative sub-regions highlight heterogeneous urban structural patterns, including central urban areas (a), industrial zones (b), low-density residential (villa) areas (c), suburban residential areas (d), coastal port and logistics zones (e), and urban villages characterized by irregular and high-density informal settlements (f). The red bounding boxes indicate the extracted building footprint boundaries.

Figure 9. Spatial distribution of estimated building heights in Shanghai. (a–c) Enlarged views of representative urban scenarios, including (a) suburban residential areas, (b) urban residential areas, and (c) industrial zones. (d) City-scale distribution of building heights. (e,f) Detailed visualization of the Lujiazui area, including (e) a 2D map of building height variations and (f) the corresponding LoD-1 3D representation.

Figure 10. Accuracy assessment of building height estimation. Where (a) is frequency histogram of height estimation errors; (b) is scatterplot of the estimated vs. GF7 DSM-derived reference heights for all buildings; (c) is an updated scatterplot after removing buildings severely affected by shadow occlusion; (d) is scatterplot of the estimated vs. UAV LiDAR-derived building heights. The dashed line represents the 1:1 reference line, and the solid line indicates the linear regression fitting result.

Figure 11. Visualization of ablation results for building footprint and shadow extraction on the Jilin-1 test set. Red boxes highlight representative regions for local detail comparison among different methods.

Figure 12. Visualization and comparison of building rooftops with existing building footprint extraction datasets. The green polygons represent the ground truth building footprint boundaries, while the red polygons denote the building footprints extracted by different datasets.

Figure 13. Scenarios of occlusion. Where a shadow is obstructed by higher building clusters (a–c), and the shadow of lower buildings is blurred by tree canopies (d–f). The red polygons indicate the building footprint boundaries overlaid for reference.

Table 1. Quantitative comparison of building footprint and shadow extraction on the test set.

Method	${I o U}_{f o o t p r i n t}$	${F 1}_{f o o t p r i n t}$	${O A}_{f o o t p r i n t}$	${I o U}_{s h a d o w}$	${F 1}_{s h a d o w}$	${O A}_{s h a d o w}$
U-Net	63.27	74.07	96.12	57.49	69.86	97.60
DeepLabv3+	62.52	73.83	95.90	63.65	77.79	96.69
Swin-UNet	74.39	85.32	96.72	74.65	85.49	97.72
SegFormer	73.62	84.81	96.47	71.70	83.52	97.45
UNetFormer	75.72	86.15	96.91	74.13	85.14	97.67
SECT-Net	77.96	87.62	97.16	75.01	85.72	97.75

Table 2. Statistical significance test results based on per-image Overall Accuracy (OA).

Comparison	Building Footprint OA (p-Value)	Building Shadow OA (p-Value)
SECT-Net vs. U-Net	<0.001	0.0886
SECT-Net vs. DeepLabv3+	<0.001	<0.001
SECT-Net vs. Swin-UNet	0.0018	0.9999
SECT-Net vs. SegFormer	<0.001	<0.001
SECT-Net vs. UNetFormer	<0.001	<0.001

Statistical significance was evaluated using a one-sided Wilcoxon signed-rank test. Differences were considered significant at p < 0.05.

Table 3. Stratified accuracy assessment of building height estimation across different height ranges.

Height Range	RMSE (m)	MAE (m)	N
<10 m	3.487	2.753	343,184
10–30 m	6.697	5.389	186,722
30–50 m	9.692	7.305	19,450
50–100 m	15.499	11.246	9936
>100 m	60.865	41.961	322

Table 4. Sensitivity analysis of occlusion filtering threshold.

Threshold	Building Count	R²	RMSE (m)	MAE (m)
50%	555,494	0.738	5.646	3.960
60%	557,762	0.737	5.658	3.962
70%	559,515	0.736	5.662	3.964
80%	561,194	0.733	5.701	3.975
90%	562,809	0.729	5.751	3.987

Table 5. Comparison of ablation experiment results of the SECT-Net.

Method	${I o U}_{b u i l d i n g}$	${F 1}_{b u i l d i n g}$	${O A}_{b u i l d i n g}$	${I o U}_{s h a d o w}$	${F 1}_{s h a d o w}$	${O A}_{s h a d o w}$
baseline	75.72	86.15	96.91	74.13	85.14	97.67
+MESM	77.05	86.90	97.08	74.93	85.67	97.77
+DP-CTB	77.40	87.34	97.12	74.66	85.49	97.74
+MAS	76.84	86.90	97.03	74.55	85.42	97.72
+MESM + DP-CTB + MAS	77.96	87.62	97.16	75.01	85.72	97.75

Table 6. Sensitivity analysis of token sampling strategy.

$K_{s}$	$K_{c}$	${r a t i o}_{b l o c k}$	${r a t i o}_{b o u n d a r y}$	${r a t i o}_{r e g i o n}$	${I o U}_{b u i l d i n g}$	${I o U}_{s h a d o w}$
64	32	0.3	0.3	0.4	76.15	74.11
128	32	0.3	0.3	0.4	76.88	74.57
196	32	0.3	0.3	0.4	77.40	74.66
256	32	0.3	0.3	0.4	76.51	74.15
196	8	0.3	0.3	0.4	76.35	73.89
196	16	0.3	0.3	0.4	76.65	74.44
196	32	0.3	0.3	0.4	77.40	74.66
196	64	0.3	0.3	0.4	77.06	74.71
196	32	1.0	0.0	0.0	76.74	74.42
196	32	0.0	1.0	0.0	76.71	74.25
196	32	0.0	0.0	1.0	76.84	74.57
196	32	0.3	0.3	0.4	77.40	74.66
196	32	0.3	0.2	0.5	77.05	74.59
196	32	0.5	0.2	0.3	77.00	74.73
196	32	0.4	0.4	0.2	77.03	74.60

Table 7. Evaluation metrics of building footprint products on the test set.

Dataset	Pixel-Level			Boundary-Level	Instance-Level
Dataset	IoU	F1	OA	BF1	$P_{0.5}^{I o U}$	$R_{0.5}^{I o U}$	${F 1}_{0.5}^{I o U}$
GABLE [32]	45.61	62.65	92.20	67.23	17.69	35.89	23.70
90_cities_BRA [33]	51.89	68.33	93.02	78.82	26.93	39.79	32.12
East Asian Buildings [34]	41.85	59.01	91.05	71.49	15.45	30.85	20.59
Ours	71.58	83.44	96.09	88.35	68.50	54.35	60.61

Table 8. Comparison of building height products under different data and methodological settings.

Product	Data Source	Method	Product Type	RMSE
CNBH-10 m [45]	Sentinel-1, PALSAR, Sentinel-2, LUOJIA 1–01, World population, SRTM, WSF 2019, Baidu building height	Random Forest	10 m Raster	6.1 m
GABLE [32]	Beijing-3, Gaofen-7, WorldView, WSF 2019	Semantic Flow Field-driven DSM Estimation network	Vector	3.7 m
3D-GloBFP [46]	Sentinel-1, Sentinel-2, DEM, DSM, World population, Nighttime light, Baidu building height	XGBoost	Vector	1.9–14.6 m
CMAB [47]	GES imagery, Amap road, POI, Baidu building height	XGBoost	Vector	7.6 m
He et al. [35]	Sentinel-1, Sentinel-2, SRTM, PALSAR, VIIRS, WSF 2019, ALOS AW3D 30, GEDI	Random Forest	30 m Raster	4.7–10.1 m
Ours	Jilin-1	Shadow-based algorithm	Vector	4.5–5.7 m

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Deng, J.; Yan, W. Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network. Remote Sens. 2026, 18, 1484. https://doi.org/10.3390/rs18101484

AMA Style

Zhang Y, Deng J, Yan W. Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network. Remote Sensing. 2026; 18(10):1484. https://doi.org/10.3390/rs18101484

Chicago/Turabian Style

Zhang, Yuan, Jiayi Deng, and Wenjia Yan. 2026. "Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network" Remote Sensing 18, no. 10: 1484. https://doi.org/10.3390/rs18101484

APA Style

Zhang, Y., Deng, J., & Yan, W. (2026). Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network. Remote Sensing, 18(10), 1484. https://doi.org/10.3390/rs18101484

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Urban Building Footprints Extraction and Heights Estimation from High-Resolution Spaceborne Remote Sensing Imagery Using a CNN-Transformer Network

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Dataset

2.1.1. Study Area

2.1.2. Remotely Sensed Data

2.1.3. Label Generation for Building Footprints and Shadows

2.1.4. Reference Data for Building Height Validation

2.2. Methodology

2.2.1. Network Architecture for Extracting Building Footprints and Shadows

2.2.2. Building Height Estimation Based on Shadow Lengths

2.2.3. Comparison Methods and Evaluation Matrices

3. Results

3.1. Performance Assessment of Models for Mapping Building Footprints and Shadows

3.2. Spatial Distribution of Building Footprints and Heights

4. Discussion

4.1. Ablation Test of SECT-Net

4.2. Comparison of Different Building Footprint Products

4.3. Comparison of Different Building Height Products

4.4. Uncertainty Analysis and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI