High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network

Si, Qiliang; Li, Liwei; Cheng, Gang

doi:10.3390/rs18010167

Open AccessArticle

High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network

by

Qiliang Si

^1,2,

Liwei Li

² and

Gang Cheng

^1,*

¹

College Surveying & Land Information Engineering, Henan Polytechnic University, Jiaozuo 454003, China

²

Key Laboratory of Computational Optical Imaging Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, No. 9 Deng Zhuang South Road, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 167; https://doi.org/10.3390/rs18010167

Submission received: 18 November 2025 / Revised: 25 December 2025 / Accepted: 30 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue Multi-Source Remote Sensing and Spatial Statistical Analysis in Urban Sustainability Research)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Prior-Embedded Dual-Branch Neural Network (PEDNet) demonstrates strong generalization across wide geographic regions and large timestamp spans in high-rise building area (HRB) extraction from Sentinel-2 imagery by successfully balancing global features with local details and embedding diverse prior information.

What are the implications of the main findings?

This robust method enables more frequent and accurate HRB extraction in a national scale, supporting urban planning and environmental assessment.

Abstract

High-rise building areas (HRBs) play a crucial role in providing social and environmental services during the process of modern urbanization. Their large-scale, long-term spatial distribution characteristics have significant implications for fields such as urban planning and regional climate analysis. However, existing studies are largely limited to local regions and fixed-time-phase images. These studies are also influenced by differences in remote sensing image acquisition, such as regional architectural styles, lighting conditions, seasons, and sensor variations. This makes it challenging to achieve robust extraction across time and regions. To address these challenges, we propose an improved method for extracting HRBs that uses a Prior-Embedded Dual-Branch Neural Network (PEDNet). The dual-path design balances global features with local details. More importantly, we employ a window attention mechanism to introduce diverse prior information as embedded features. By integrating these features, our method becomes more robust against HRB image feature variations. We conducted extensive experiments using Sentinel-2 data from four typical cities. The results demonstrate that our method outperforms traditional models, such as FCN and U-Net, as well as more recent high-performance segmentation models, including DeepLabV3+ and BuildFormer. It effectively captures HRB features in remote sensing images, adapts to complex conditions, and provides a reliable tool for wide geographic span, cross-timestamp urban monitoring. It has practical applications for optimizing urban planning and improving the efficiency of resource management.

Keywords:

high-rise building areas; fully convolutional neural network; vision transformer; prior-embedded; semantic segmentation

1. Introduction

Driven by rapid urbanization and land resource limitations, China’s urban areas are shifting from “horizontal expansion” to “vertical growth”. HRBs are central to this shift and serve as key indicators of a city’s development potential. Compared to low-rise clusters, HRBs maximize vertical space use, reshaping city layouts and significantly affecting urban functions. They act as mixed-use spaces, combining commerce, residential, and office areas, directly influencing population movement and public service distribution. Their widespread distribution reflects economic vitality but also challenges infrastructure capacity. Therefore, accurately mapping HRBs’ spatial distribution is vital for urban planning and sustainable growth [1].

Currently, there is no universal definition of HRBs in academia, with different studies emphasizing various aspects based on objectives: engineering standards often use height thresholds (such as 24 or 27 m) as main criteria (e.g., the “Unified Standard for Civil Building Design GB 50252-2019”) [2]; in contrast, geographical research emphasizes their spatial clustering, focusing on “building clusters” as unified entities. In this study, aligned with “regional adaptability”, HRBs are defined as building clusters with an average height over 25 m, exhibiting continuous spatial distribution, and forming functional links with surrounding land uses. This definition preserves height as a measurable indicator while highlighting cluster characteristics as urban functional units, aligning with the extraction features of “areal targets” in remote sensing images [3].

From a remote sensing interpretation perspective, extracting HRBs remains challenging, particularly when HRBs are represented at the area or regional scale as continuous urban clusters rather than isolated individual structures. Due to their complex three-dimensional configurations and dense spatial aggregation, HRBs are highly susceptible to imaging geometry effects, such as off-nadir viewing and georeferencing uncertainties, which may cause spatial displacement and shape distortion at the regional level. In dense urban environments, height-induced occlusions and building overlaps further result in spectral mixing and blurred area boundaries, while variations in illumination conditions, seasonal changes, and surrounding land cover context introduce substantial intra-class variability across regions and imaging timestamps. These factors collectively complicate the discrimination of HRBs from other urban land cover types, posing significant challenges for robust and transferable area-level extraction methods.

The utilization of deep learning-based extraction methodologies for existing HRBs remains encumbered by considerable limitations, particularly in the context of models constructed upon Convolutional Neural Networks (CNNs). Representative CNN-based segmentation models, such as FCNs [4] and U-Net [5], serve as paradigmatic exemplars of this category. Fully Convolutional Networks (FCNs) serve as a paradigmatic exemplar of such models. As lightweight approaches applied to single-source Sentinel-2 imagery (10 m resolution, nadir observation), FCNs have achieved relatively acceptable extraction accuracy [6,7]. However, their performance is severely constrained by the intrinsic flaw of CNN architectures: the imbalance between local and global feature modeling. CNN operations rely on local receptive fields, which restrict FCNs to capturing only nearby spatial features while failing to effectively model global dependencies between spatially distant information [8,9,10,11]. This deficiency renders FCNs incapable of comprehending the intrinsic connections between HRBs and their surrounding environments in complex urban scenes, where contextual correlations frequently extend over extensive spatial scales.

To alleviate this limitation of FCN, more advanced CNN-based segmentation frameworks, such as DeepLabV3+, have been proposed by incorporating atrous convolution and multi-scale context aggregation [12]. While these models improve the capture of contextual information to some extent, their receptive fields remain implicitly constrained by convolutional operations, limiting their ability to model global spatial relationships across large urban regions. More recently, a dual-branch architecture that integrates CNNs for local feature extraction and Vision Transformers (ViTs) for global context modeling has been explored, such as Buildformer [13]. This architecture integrates CNNs to capture local features and ViTs to model global dependencies, thereby achieving breakthroughs in building segmentation within a single region [14,15]. However, despite its success in addressing the local–global modeling imbalance, it still faces a significant limitation: it fails to account for the diversity inherent in HRB imagery. HRB imagery exhibits considerable variability in terms of spectral, textural, and spatial characteristics, attributed to cross-regional variations such as differing latitudes, topographies, and climates, as well as cross-timestamp changes like seasonal shifts in illumination and vegetation coverage. This variability is not fully incorporated into the dual-branch model, which limits its ability to generalize across different regions or times.

To overcome the above shortcomings, this study proposes a novel approach, PEDNet, which integrates prior information into the model. Remote sensing images contain inherent prior information, such as imaging time (timestamp) and geographic location, which significantly influence image formation [16,17,18,19]. These factors, including seasonal changes in illumination, regional climate, and architectural styles, shape HRB features and can be embedded into the model as prior knowledge. By incorporating such information, PEDNet enables the model to better capture correlations between contextual factors and HRB features, enhancing segmentation performance across diverse scenarios [20,21,22,23]. PEDNet adopts a dual-path design that combines the strengths of CNNs for capturing fine-grained spatial details and ViTs for modeling global context, while effectively addressing local–global feature relationships and variations in HRBsimagery [24,25].

The contributions of this paper can be summarized as follows:

This study proposes a prior-aware dual-path segmentation framework, termed PEDNet, for the extraction of HRBs from optical remote sensing imagery. The performance of PEDNet is systematically evaluated and compared with representative baseline models, including classical CNN-based architectures (e.g., FCN), a strong CNN-based segmentation model (DeepLabV3+), and a modern Transformer-based building extraction model (BuildFormer). The results demonstrate that PEDNet achieves superior performance and robustness in complex urban scenarios.
In scenarios involving data combination across multiple regions and time periods, the study validates the extraction advantages of the PEDNet model, thereby substantiating its capacity to circumvent the limitations associated with “local models”. The model demonstrates enhanced robustness in HRB extraction tasks across multiple regions and all seasons, providing a feasible basis for a single model to achieve wide geographic span and cross-timestamp HRB extraction.

The paper is structured as follows: Section 2 details the study data, the PEDNet model architecture, and the experimental design. Section 3 presents the extraction results and the accuracy analysis. Section 4 discusses the model extraction results, summarizes the research, and outlines future directions. To facilitate understanding of the proposed framework and related components, a list of abbreviations used in this paper is provided in Table 1.

2. Materials and Methods

2.1. Data

This study selected four provincial capital cities in China, spanning from north to south—Harbin, Beijing, Zhengzhou, and Guangzhou—as the study area (see Figure 1). The cities were chosen for their extensive geographical range—from Harbin near the 45°N latitude line to Guangzhou near the 23°N latitude line—which encompasses a broad climate zone from cold temperate to subtropical regions. This large geographical span leads to notable differences in various aspects, primarily architectural styles and landscape features. Additionally, during the same period, the four cities exhibit significantly different lighting conditions. Harbin, located at a high latitude, experiences a small angle of incidence and weak sunlight intensity. In contrast, Guangzhou, situated closer to the equator, exhibits a large angle of incidence and high sunlight intensity. Beijing and Zhengzhou lie between these extremes. These variations directly impact the spectral characteristics of remote sensing imagery, creating optimal conditions for evaluating the model’s robustness. These variations arise from temporal and spatial dynamics. Temporally, the interplay of solar and satellite geometry plays a crucial role. Solar geometry is governed by the Earth’s rotational and orbital positions, leading to shallower angles in Harbin and steeper ones in Guangzhou due to axial tilt. Satellite geometry includes off-nadir angles and overpass times that modulate lighting patterns. Spatially, the latitudinal gradient creates inherent differences in solar irradiance regimes, and regional disparities shaped by geography, topography and climate amplify these effects, thereby reinforcing the robustness of the test conditions. Atmospheric conditions also differ across the four regions, with variations in water vapor and aerosol content at different latitudes. This causes more pronounced attenuation and interference of spectral information in the images. These variations in atmospheric conditions further amplify the differences in imaging characteristics of remote sensing images, collectively creating an ideal testing scenario to verify the model’s robustness in extracting HRBs under conditions of complex regional variation.

To validate the model’s effectiveness across diverse scenarios, this study constructed an HRBs detection dataset encompassing multiple regions and imaging time points. As shown in Table 2 below, the dataset includes Sentinel-2 satellite remote sensing imagery for the following four study cities, along with their corresponding city and region codes: T49QGF (Guangzhou), T49SGU (Zhengzhou), T50TMK (Beijing), and T51TYL (Harbin), respectively. All images are 512 × 512 pixel RGB band TIFF files with a spatial resolution of 10 m. They utilize the blue band (B2), green band (B3), and red band (B4). The image file naming format follows the convention “City ID_Imaging Timestamp_Region ID” (e.g., T49QGF_20180115_HRBs_002), which enables systematic organization and retrieval of spatial and temporal information. The selected images exhibit high imaging quality and cover representative acquisition dates across all four seasons. All baseline images were acquired from Sentinel-2 surface reflectance products processed by the Sen2Cor algorithm, and scenes with significant cloud or cloud-shadow coverage were excluded during data acquisition. As a result, no explicit cloud masking was required. Temporal information (imaging timestamp) and spatial identifiers (city ID) were treated as image-level metadata and uniformly associated with each image patch, rather than being defined at the individual pixel level. Consequently, all input channels share identical valid spatial regions, and no cross-dimensional inconsistency in masked areas occurs. Ground truth HRB labels were generated based on a height-driven criterion, where buildings with an average height exceeding 25 m were classified as high-rise structures and encoded as pixel-wise segmentation masks. This threshold follows commonly adopted national building standards in China and has been consistently adopted in our previous related studies on high-rise building area extraction using the same dataset. To further enhance labeling reliability, the identified HRB regions were verified using auxiliary visual references, including ground-level photographs and publicly available street-view imagery. The dataset covers most sampling areas in the four cities and includes multiple imaging time points. For each city, independent areas with distinct imaging dates were selected for testing to ensure no overlap with the training data, thereby enabling a rigorous evaluation of the model’s generalization capability across different spatial and temporal scenarios. This data organization balances detection accuracy and computational efficiency, laying a solid foundation for validating the robustness of the proposed method.

Figure 2 shows high-resolution satellite images from different locations in the four study cities, along with the corresponding HRBs annotation results. Each sample corresponds to an actual ground area of approximately five square kilometers. The spatial and timestamp diversity of the samples provides a representative basis for training the model and verifying its performance.

2.2. PEDNet

2.2.1. Network Architecture

This study designed the PEDNet model to robustly extract HRBs. The model’s structure is shown in Figure 3. The model’s encoder has two parallel feature extraction paths. One is the ViT feature path, which focuses on global semantic feature extraction. This path uses the Prior-Embedded Window Attention (PEWA) structure within the Prior-Embedded Block (PEBlock) to capture prior information. This enables the model to effectively capture long-range feature associations across regions and time phases. The second path is a convolutional feature path. This path uses the local receptive field advantage of convolutional neural networks to focus on extracting fine-grained spatial details and local semantic information. These two paths operate in tandem to extract remote sensing image features from global and local perspectives. Ultimately, the decoder fuses the two feature streams to output the HRB extraction results.

The model accepts 512 × 512 image blocks with a spatial resolution of 10 m as input. Each input sample consists of three RGB channels, two timestamp information channels, and one spatial information channel. The timestamp channels encode the imaging year and day of year (DOY), while the spatial channel represents the corresponding city identifier. Timestamp and spatial information are treated as image-level metadata and are broadcast to constant-valued 512 × 512 feature maps to match the spatial resolution of the input image. These feature maps are concatenated with the RGB channels to form the final input tensor, such that all pixels within the same image patch share identical timestamps and spatial priors. The combined inputs are then processed by PEDNet to extract multi-scale feature representations.

During the process of data loading, the City ID and Imaging Timestamp are extracted from the image file name. The imaging timestamp is initially interpreted as a calendar date and subsequently converted into the corresponding year and Julian day (i.e., day of year) in order to provide a season-aware temporal representation. The City ID is utilized as a discrete spatial identifier, which is mapped to an integer index and transformed into a continuous tensor via a learnable embedding layer. The year and Julian day are encoded by way of learnable embedding layers, with these two elements then being combined to form a unified time embedding. Following the entry of the network, the spatial embedding derived from the City ID and the time embedding derived from the Imaging Timestamp are propagated in conjunction with visual features and incorporated into the attention learning process within each Prior-Embedded Block. This enables the model to adaptively capture region- and time-dependent characteristics during feature learning.

The ViT feature path includes a Stem module and four Stage modules. The stem module performs preliminary feature extraction on the input image through two convolution operations. Each Stage then progressively refines the features using a multi-layer PEBlock structure that integrates the Prior-Embedded Window Attention (PEWA) mechanism and a Multi-Layer Perceptron (MLP). While various types of prior information exist, this study focuses on learning from the prior information of image spatial locations and imaging timestamps for the experiment. Specifically, PEWA learns such prior information by combining spatial encoding from the Spatial Embedding (SE), which encodes the City ID, and time encoding from the Time Embedding (TE), which encodes the Imaging Timestamp. Stages 2 through 4 also use patch merging to reduce and elevate feature dimensions while incorporating spatial and timestamp encoding as attention learning information to enhance feature expression.

Figure 4 below shows the prior-embedded window attention structure diagram. PEWA’s design closely aligns with the core requirements for extracting HRBs from remote sensing imagery, and its advantages are particularly evident in complex scenarios. Traditional attention mechanisms often suffer from spectral imaging feature confusion when processing remote sensing data because they ignore geographical spatial differences and imaging timestamp variations. PEWA addresses this issue by using spatial and timestamp dual-bias encoding to create a triple learnable feature consisting of “feature similarity + spatial prior + timestamp features”. This enables the model to selectively enhance the discriminative power of building features.

From a mechanism implementation perspective, the core process of PEWA consists of four steps: feature mapping generation, spatial and timestamp encoding fusion, similarity calculation, and attention output. First, a 1 × 1 convolution is used to generate the

Q

,

K

, and

V

matrices from the input feature map

X \in R^{C \times H \times W}

, where

C

is the number of channels and

H

and

W

are the height and width, respectively. This lays the foundation for attention calculation, as shown in the formula:

\begin{matrix} Q = {C o n v}_{1 \times 1} (X), K = {C o n v}_{1 \times 1} (X), V = {C o n v}_{1 \times 1} (X) \end{matrix},

(1)

where

Q, K, V \in R^{C \times H \times W}

preserve the spatial structure and channel information of the input features.

Secondly, to incorporate spatial and timestamp a priori information, PEWA generates spatial and timestamp codes through SE and TE. Spatial tokens project discrete regional identifiers (such as city codes) onto matching dimensions through an embedding layer and 3 × 3 convolution, as shown in the formula:

SEProj (r) = Tanh ({C o n v}_{3 \times 3} ({S p a t i a l E m b e d d i n g}_{r e g i o n} (r))),

(2)

Similarly, the time tokens convert timestamps (e.g., dates) into periodic features and project them, as shown in the formula:

TEProj (t) = Tanh ({C o n v}_{3 \times 3} {(T i m e E m b e d d i n g r}_{t i m e} (t))),

(3)

The outputs of the two branches are superimposed to form the spatial and timestamp attention bias, which is used to post-modulate the attention results. The formula is as follows:

{A t t e n t i o n}_{STBias} = SEProj (r) + TEProj (t),

(4)

Furthermore, PEWA uses 16 × 16 window partitioning to balance building-scale coverage and background interference. To address the computational bottleneck of traditional self-attention, where

O (N^{2})

(with

N = 256

being the number of window pixels), L2 normalization and a Taylor series approximation of the softmax function are used to linearize weight calculations. The similarity between the

i

-th query

(q_{i} \in R^{C})

and the

j

-th key

(k_{j} \in R^{C})

within the window is defined as shown in the formula:

sim (q_{i}, k_{j}) = 1 + ({(\frac{q_{i}}{{∥ q_{i} ∥}_{2}})}^{T} (\frac{k_{j}}{{∥ k_{j} ∥}_{2}})),

(5)

where

‖ \cdot ‖_{2}

is the L2 norm, ensuring the validity of the Taylor approximation (

\exp (x) \approx 1 + x

).

Then, the attention output is the weighted sum of the value features, as shown in the formula:

A t t e n t i o n (Q, K, V) = \frac{\sum_{j = 1}^{N} s i m (q_{i}, k_{j}) \cdot v_{j}}{\sum_{j = 1}^{N} s i m (q_{i}, k_{j})},

(6)

Finally, complete the attention modulation and output the spatial and timestamp attention, whose formula is as follows:

{A t t e n t i o n}_{STE} = A t t e n t i o n (Q, K, V) + {A t t e n t i o n}_{STBias},

(7)

This formula reduces complexity to

O (N)

, improving the efficiency of 16 × 16 window calculations by several times. It is not only suitable for efficiently processing high-resolution remote sensing images, but also supports the model’s ability to accurately extract HRBS from remote sensing images across space and time within a large window range. This is achieved through spatial and timestamp coding, which strengthens the model’s ability to learn HRB characteristics in complex scenarios.

In PEBlock, spatial and timestamp features are learned through PEWA. During the Patch Merging process from Stage 2 to Stage 4, spatial and time tokens work collaboratively. Spatial encoding provides prior information by encoding spatial attributes (e.g., regional encoding of Harbin and Guangzhou), while timestamp embeddings adjust dynamic weights based on features corresponding to different imaging timestamps. These two processes jointly form a spatial and timestamp dual-bias mechanism, enabling PEBlock to adapt to spatial heterogeneity across regions while capturing features corresponding to different imaging timestamps within the same region. This enhances the encoder’s ability to learn features from complex scenes at multiple scales. Finally, the encoder takes multi-scale features from the four stages as input for the decoder.

Meanwhile, the convolutional feature path extracts low-level details through three ConvBNAct operations: the first uses a 3 × 3 convolution (stride 2) to reduce the input to 256 × 256; the second uses the same 3 × 3 convolution (stride 2) to reduce it to 128 × 128; and the third 3 × 3 convolution (stride 1) maintains the size and adjusts it to the decoder dimension (e.g., 384 dimensions). Finally, the fused features from the Feature Pyramid Network (FPN) output are added element-wise, and high-frequency details are supplemented via convolution.

The decoder primarily uses FPN to achieve feature fusion and upsampling, restoring the HRB label image from the encoded features. FPN unifies the dimensions of features at various scales, then performs cross-scale feature fusion, followed by upsampling operations to further process features at each layer, ultimately fusing them into a high-resolution feature map. The FPN-output features are fused with the features from the convolutional feature path to enhance the spatial details of label recovery, employing a refined dual-branch feature complementarity strategy.

The fused features from each decoding stage are processed and finally output through the convolutional operation of the head module with a feature dimension corresponding to the number of categories (in our case, two: HRBs and others). The upsampling in the decoder primarily uses bilinear interpolation to restore feature map dimensions. As the final layer, the softmax function converts the decoded features into probabilities for HRBs and other categories, and the argmax function selects the label with the highest probability for each pixel, yielding a pixel-wise mapping of the HRBs mask, accurately recovering the HRB label image from the encoded features.

2.2.2. Loss Function

When extracting HBRs from remote sensing images, challenges such as loss of edge details and difficulty maintaining geometric structures are encountered. Additionally, there is an imbalance in the distribution of land cover types within the image. HRBs, the target of extraction, typically occupy a small proportion of the image and belong to the minority category. Other land cover types constitute the majority category. This significant difference in category distribution causes the model to favor the majority category during training, which severely affects the accuracy with which HRBs are recognized. To address this issue and improve recognition accuracy, this study adopted a strategy combining multiple loss functions. The model is designed with a “total loss = geometry constraint loss + edge loss” structure to achieve multidimensional optimization. The edge loss is decomposed into cross-entropy–dice joint loss, binary cross-entropy loss, and focal loss. Specifically, Geometry Constraint Loss, Cross-Entropy–Dice Joint Loss, Binary Cross-Entropy Loss, and Focal Loss are combined by adding them together with certain weights. This approach leverages the advantages of each loss function to guide the model to focus more on HRBs in the minority category. The mathematical expression for the combined loss is as follows:

\begin{matrix} L_{t o t a l} = L_{g e o m e t r y} + L_{e d g e} \end{matrix},

(8)

Among them,

L_{g e o m e t r y}

is the geometric constraint loss, and

L_{e d g e}

is the edge loss. The edge loss is further decomposed as follows:

\begin{matrix} L_{e d g e} = L_{s c d} + L_{b c e} + L_{f o c a l} \end{matrix},

(9)

where

L_{s c d}

is the soft cross-entropy–dice joint loss,

L_{b c e}

is the binary cross-entropy loss, and

L_{f o c a l}

is the focus loss.

During the experimental implementation phase, total loss is calculated using a layered weighted fusion strategy. This strategy dynamically adjusts the weighting coefficients of each loss in order to balance the optimization objectives of “geometric structure preservation” and “category imbalance mitigation”. In the experiment,

L_{t e}

denotes the total experimental loss, and its mathematical expression is as follows:

\begin{matrix} \begin{matrix} \begin{matrix} L_{t e} = α \cdot L_{g e o m e t r y} + (β \cdot L_{s c d} + γ \cdot L_{b c e} + δ \cdot L_{f o c a l}) \end{matrix} \end{matrix} \end{matrix}

(10)

where

α

,

β

,

γ

, and

δ

are weighted parameters, whose specific values have been set in the experiment.

Geometric constraint loss is a regularization term that enhances the continuity of the model’s output space. It is often used in tasks requiring spatial structure consistency, such as semantic segmentation. By penalizing drastic changes in predicted values between adjacent pixels, geometric constraint loss guides the model to produce smoother images that align with the spatial distribution of natural images. Its mathematical expression is:

\begin{matrix} L_{g e o m e t r y} = \frac{1}{N} (\sum_{i, j} | \nabla_{x} {\hat{Y}}_{i, j} | + \sum_{i, j} | \nabla_{y} {\hat{Y}}_{i, j} |) \end{matrix},

(11)

where

{\hat{Y}}_{i, j}

is the predicted probability of pixels

(i, j)

(belonging to the HBRs),

\nabla_{x}

is the horizontal gradient operator, and

\nabla_{y}

is the vertical gradient operator. These operators are used to measure the difference between adjacent pixels.

The Soft Cross-Entropy–Dice joint loss function is composed of two classic, complementary loss functions: Soft Cross-Entropy Loss and Dice Loss. This combination balances pixel-level classification accuracy and overall structural similarity. Its mathematical expression is:

\begin{matrix} L_{s c d} = L_{s o f t C E} + L_{d i c e} \end{matrix},

(12)

Soft Cross-Entropy Loss is a smoothed version of the standard cross-entropy loss. It introduces a small smoothing factor during calculation to mitigate the impact of label noise or mislabeling and improve the model’s sensitivity to rare categories. It penalizes the difference between the predicted probability distribution and the true labels, prompting the model to output a distribution closer to the true category distribution. Its mathematical expression is:

\begin{matrix} L_{s o f t C E} = - {\sum_{i = 1}^{N} \tilde{y}}_{i} \log (p_{i}) \end{matrix},

(13)

where

\begin{matrix} {\tilde{y}}_{i} = (1 - 0.05) y_{i} + \frac{0.05}{2} \end{matrix}

(

y_{i}

is the true label, 0.05 is the smoothing factor),

p_{i}

is the predicted probability, and

N

is the total number of pixels.

Dice Loss is a loss function commonly used in image segmentation tasks. It is designed based on the Dice coefficient and aims to maximize the overlap between the predicted region and the true label region. It is particularly suitable for situations where the foreground and background categories are imbalanced, and can effectively improve the model’s ability to recognize small objects or rare categories. Its formula is:

L_{d i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + 0.05}{\sum_{i = 1}^{N} p_{i}^{2} + \sum_{i = 1}^{N} y_{i}^{2} + 0.05},

(14)

Of these, 0.05 is the smoothing factor that prevents the denominator from equaling zero.

Binary cross-entropy loss calculates the difference between the model’s predicted and actual boundary maps. This guides the model to better learn object edge information. This loss function strongly constrains the accuracy of boundary area predictions, improving the clarity and structural integrity of segmentation results. Its mathematical expression is:

L_{b c e} = - \frac{1}{N} \sum_{i = 1}^{N} (E_{Y} (i) \log (E_{\hat{Y}} (i)) + (1 - E_{Y} (i)) \log (1 - E_{\hat{Y}} (i))),

(15)

where x and y are calculated from the soft boundaries extracted from the true labels and predicted probability maps, with the mathematical expression as follows:

E_{Y} = σ (L a p l a c i a n (Y)), E_{\hat{Y}} = σ (L a p l a c i a n (\hat{Y})),

(16)

where

L a p l a c i a n

is the Laplace edge operator (3 × 3 convolution kernel), and

σ

is the sigmoid function, which maps the edge response to [0, 1].

Focal loss is effective in highly imbalanced datasets. It prevents the model from overemphasizing the majority class, thereby improving the detection and segmentation performance of the minority class. In the task of extracting high-rise buildings, focal loss focuses on pixels of HRBs that are difficult to classify by reducing the weight of “easily classified background pixels”. The mathematical formula is:

\begin{matrix} L_{f o c a l} = - \frac{1}{N} \sum_{i = 1}^{N} α_{t} (1 - p_{t})^{γ} \log (p_{t}) \end{matrix},

(17)

where

p_{t}

is the probability that a pixel belongs to the true class (positive class is HBRs,

p_{t}

=

p

; negative class is background,

p_{t}

= 1 −

p

);

α_{t}

= 0.25 (positive class weight), balancing the positive and negative sample ratios;

γ

= 2.0, controlling the degree of difficult sample weight amplification.

2.3. Evaluation Metrics

This study uses the

F 1

score, mean intersection over union, and overall accuracy as evaluation metrics for the model’s HRB extraction performance. These metrics are widely used in remote sensing image segmentation and target extraction tasks, with the following definitions:

The

F 1

score is the harmonic mean of precision and recall and is used to evaluate the model’s accuracy and completeness in identifying HRBs categories. The formula is:

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(18)

where the

P r e c i s i o n

formula is:

P r e c i s i o n = \frac{T P}{T P + F P},

(19)

and the

R e c a l l

formula is:

R e c a l l = \frac{T P}{T P + F N},

(20)

Intersection over Union (

I o U

) evaluates segmentation accuracy by calculating the overlap between the predicted results and the true labels using the following formula:

I o U = \frac{T P}{T P + F N + F P},

(21)

Overall accuracy (

O A

) reflects the proportion of correct classifications for all pixels and is calculated as follows:

O A = \frac{T P + T N}{T P + T N + F N + F P},

(22)

In the above formula,

T P

(true positive) refers to the number of correctly predicted HRBs.

F P

(false positive) refers to the number of non-HRBs that were misjudged as HRBs.

F N

(false negative) refers to the number of HRBs that were misjudged as non-HRBs.

T N

(true negative) refers to the number of correctly predicted non-HRBs. All parameters are obtained through confusion matrix statistics.

2.4. Experimental Design

This study validated model performance using the same dataset under a unified experimental setup. All experiments were conducted in a 64-bit Ubuntu 20.04.1 LTS environment equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB memory) and CUDA 12.4, with all models implemented using PyTorch 2.4.0 and Python 3.8.19. All models were trained for 70 epochs using the AdamW optimizer with an initial learning rate of 1 × 10⁻³ and a weight decay of 0.0025. A CosineAnnealingWarmRestarts scheduler was employed to dynamically adjust the learning rate during training. The batch size was set to 1 for both training and validation. To ensure fair comparisons, the optimizer, loss formulation, and training strategy were kept consistent across all models. In the experimental phase, all models were trained on this hardware platform, utilizing the remote sensing dataset containing images from multiple regions and time periods. The optimizer, loss function, and training strategy were standardized across all models to eliminate the impact of training differences on performance. Specifically, a total loss is employed; combined with the experiment’s specific parameter settings, its weighted form is given by Equation (10), where

α

= 0.7,

β

= 1.0,

γ

= 10.0, and

δ

= 1.1. These weights were determined empirically through preliminary experiments and are introduced to balance region-level supervision and boundary-aware constraints during optimization. The weighted loss primarily serves as an auxiliary mechanism to stabilize training rather than as a performance-critical hyperparameter.

The dataset follows a unified naming convention, “City ID_Imaging Timestamp_Region ID”. For example, “T49QGF_20180115_HRBs_001” corresponds to City T49QGF, acquired on 15 January 2018, with Region ID 001. The dataset includes samples from four cities (T49QGF, T49SGU, T50TMK, and T51TYL), each containing annotated regions from multiple imaging timestamps and spatial locations. In total, 60 image samples were constructed across all cities, covering diverse urban environments and seasonal conditions.

To validate the model’s generalization ability to unseen scenes within the same city, a spatial and timestamp non-overlap partitioning strategy was adopted. For each city, combinations of imaging timestamps and region IDs were split into training, validation, and testing sets with strict non-overlap in both temporal (e.g., 15 January 2018 vs. 11 March 2018) and spatial (e.g., 001 vs. 002) dimensions. Overall, 36 samples were used for training, 12 for validation, and 12 for testing. For example, the training set includes samples such as T49QGF_20180115_001 and T49QGF_20180311_002, while the test set contains previously unseen samples such as T49QGF_20181219_003.

To objectively validate model effectiveness, all participating models—including PEDNet and its ablation variants, U-Net, FCN, DeepLabV3+, and BuildFormer—were trained and tested on this dataset using identical training and test splits. The training set enabled the learning of HRB features from known scenes across cities, while the test set evaluated each model’s ability to predict unseen scenes in terms of both spatial regions and imaging timestamps. This design ensured that performance differences between models stemmed solely from architectural variations rather than data or training discrepancies. U-Net, FCN, DeepLabV3+, and BuildFormer were implemented in their standard forms and trained using image features only, as none of these models explicitly incorporates mechanisms for learning spatial or timestamp priors. In contrast, PEDNet is explicitly designed to incorporate and model such prior information through its prior-aware embedding architecture. During inference, overlapping image blocks were adopted, and predictions in overlapping regions were fused during the mosaicking process to alleviate boundary artifacts and ensure spatial continuity in the final HRBs segmentation maps.

The experiments were divided into two parts. First, ablation experiments on the PEDNet model analyzed the impact of the SE and TE modules on the PEWA mechanism by enabling or disabling these components and comparing F1 score, IoU, and OA across different configurations. Second, comparative experiments quantitatively evaluated PEDNet against classical CNN-based models (U-Net and FCN), a high-performance CNN-based segmentation model (DeepLabV3+), and a recent CNN–Transformer hybrid architecture (BuildFormer) on the same dataset. These experiments collectively validate the advantages of PEDNet’s prior-aware dual-path structure for HRB extraction tasks under cross-regional and cross-timestamp scenarios.

3. Results and Analysis

3.1. Results of PEDNet Model

To validate the role of SE and TE as built-in PEWA mechanisms, this study designed four ablation experiments based on the PEDNet model. The model learned from and was tested on the same set of training data. The focus was on analyzing the impact of SE and TE on feature extraction by controlling their switch states within the PEWA module:

PEDNet_Base: Disables the SE and TE mechanisms in PEWA, retaining only the basic attention structure. PEWA then relies solely on the raw spectral and spatial features of pixels within the window for self-attention calculations. Feature dimension reduction is achieved through multiple layers of PEBlock (including the PEWA base module and MLP) and PatchMerging without introducing spatial or timestamp information.
PEDNet_SE: Enables the SE mechanism in PEWA and disables the TE mechanism. Specifically, the SE module converts regional input information (e.g., city codes) into high-dimensional feature vectors. These vectors are then mapped by the Spatial Projection Layer (Spatial Proj) into spatial attention corresponding to the number of attention heads. These vectors participate in the calculation of attention weights within the window, enabling the model to learn regional spatial features.
PEDNet_TE: Enables the TE mechanism in PEWA while disabling SE. The TE module converts timestamp information (year and Julian day) into high-dimensional feature vectors. These vectors are then mapped by the Temporal Projection Layer (Temporal Proj) to generate temporal attention. This enables timestamp information to participate in PEWA’s attention modulation learning.
PEDNet_SE + TE: Enables both the SE and TE mechanisms in PEWA. Both mechanisms act as dual biases that collaborate in the attention calculation process. Spatial and timestamp information are learned in parallel during the window attention calculation of PEWA.

All experiments are conducted on the same dataset and hardware environment. Comparing the differences in F₁ scores, IoU, and OA metrics among the four model groups quantifies the individual and combined effects of SE and TE as built-in PEWA mechanisms. In the PEDNet model ablation experiments, all configurations use the same base parameters. The window size is set to 16; the number of attention heads is 4, 8, 16, and 32 in order of stage; the MLP ratio is 4; and the dropout rate is 0.1. All models are trained for the same number of iterations. Four experimental models were constructed by controlling the on/off states of SE and TE in PEWA: PEDNet_Base, PEDNet_SE, PEDNet_TE, and PEDNet_SE + TE. The performance metrics of these models under different configurations are shown in Table 3.

As shown in Table 2, PEDNet_SE achieves the highest performance across all metrics, with F1, IoU, and OA values of 62.8%, 45.8%, and 91.3%, respectively. These results suggest that enabling only the SE mechanism in PEWA is the optimal configuration under the current settings. PEDNet_TE is the second-best performing network, with F₁, IoU, and OA values of 61.4%, 44.3%, and 91.2%, respectively. This suggests that the model achieves good classification accuracy for the overall category with the TE mechanism alone. PEDNet_SE + TE’s relatively low F1 and IoU values suggest that enabling both the SE and TE mechanisms simultaneously in the current setup may result in feature interference or insufficient synergy. This could lead to a decline in the model’s predictive capability for small objects and categories with few samples. PEDNet_Base is the worst-performing network, with metrics lower than those of other models with enabled mechanisms, which further validates the effectiveness of PEWA in enhancing model performance when the SE and TE mechanisms are enabled. Overall, the PEDNet_SE model, which only enables the SE mechanism, performed better in this ablation experiment.

To further validate the model’s segmentation performance in real-world scenarios, two typical samples were selected for visual comparison. The T49SGU_20181229_HRBs_003 image, captured in Zhengzhou at approximately 34°N, was utilized for the winter imaging analysis. Similarly, the T51TYL_20180918_HRBs_002 image, obtained in Harbin at approximately 45°N, was employed for the autumn imaging study. Due to variations in both latitude and season, the two samples manifest divergent imaging characteristics. The city of Zhengzhou experiences elevated levels of particulate matter in the winter months due to the influx of cold air, resulting in diminished image contrast. Conversely, Harbin experiences a decline in particulate matter concentration during the autumn months due to the effects of water vapor evaporation, leading to heightened atmospheric humidity. However, this phenomenon is accompanied by a concurrent decrease in particulate matter levels, with the coastal regions exhibiting a marginal fogging effect. Furthermore, a discernible disparity in illumination angles is evident. The solar altitude angle in Zhengzhou during winter is approximately 30°, while in Harbin during autumn, the actual illumination angle is more oblique due to its higher latitude. This results in short, straight shadows in the Zhengzhou samples and elongated, diffuse shadows in the Harbin samples. The imaging conditions, influenced by geographical location and timestamp differences, offer an optimal visual basis for validating the segmentation robustness of the model in complex imaging environments. The results of the visualization process demonstrate that the PEDNet series models exhibit significant advantages in terms of feature extraction and detail preservation.

Figure 5 shows a comparison of the test image results from the T49SGU_20181229_HRBs_003 (Zhengzhou (T49SGU) Region 3) ablation experiment. The visualization results show that the PEDNet_Base model significantly underperforms compared to other models that incorporate embedding mechanisms. Its segmentation results demonstrate poor integrity and accuracy in the HRB regions. The PEDNet_SE model demonstrates outstanding performance in capturing HRBs. It accurately and comprehensively identifies HRBs and showcases strong feature extraction and target recognition capabilities. The PEDNet_TE and PEDNet_SE + TE models capture a similar number of HRBs, but the PEDNet_SE + TE model handles details slightly worse than the PEDNet_TE model and has some difficulty capturing subtle HRB features. This may be due to interference caused by the simultaneous application of the two embedding mechanisms, affecting the model’s precision in extracting detailed information.

Figure 6 below illustrates a comparison of sample ablation test image results for the T51TYL_20180918_HRBs_002 (Harbin (T51TYL) Region 2) sample ablation experiment test image results comparison. A comparative analysis of the visualization results reveals that the PEDNet_Base model exhibits a substantially diminished HRBs capture capability in comparison to models that incorporate prior information. The segmentation results demonstrate evident fragmentation and omission issues in HRB regions (e.g., the small-scale HRBs within the red box are barely recognized), exhibiting poor completeness and boundary accuracy. The PEDNet_SE model demonstrates the most prominent performance in capturing HRBs, not only accurately identifying large contiguous HRBs but also achieving high coverage of scattered, small-scale HRBs within the red box. This suggests stronger feature extraction and detail resolution capabilities, effectively preserving the distribution patterns of building clusters. The PEDNet_TE and PEDNet_SE + TE models demonstrate comparable HRB capture efficiency; however, substantial disparities emerge in the refinement of details. The PEDNet_SE + TE model manifests blurred boundaries in HRB regions of limited sample size, as illustrated by the red box, and its capacity for recognizing fine-grained features is inferior to that of the PEDNet_TE model. This phenomenon is likely attributable to the interplay of spatial and time embedding mechanisms, which introduce interference and compromise the model’s capacity to discern nuanced features within the scene.

To further analyze the performance degradation observed when SE and TE are jointly applied, prediction-guided attention maps are visualized and compared across PEDNet_SE, PEDNet_TE, and PEDNet_SE + TE, as shown in Figure 7. As illustrated, the attention responses of PEDNet_SE are more spatially consistent with the HRBs mask, exhibiting concentrated activation within building regions and clearer correspondence with the target areas. The PEDNet_TE model also highlights HRBs-related regions, but its attention distribution is comparatively less focused than that of PEDNet_SE. In contrast, the PEDNet_SE + TE model shows more scattered and less target-aligned attention patterns, with high-response regions partially deviating from HRB areas. This indicates that, when spatial and timestamp embeddings are applied simultaneously without explicit coordination, their combined guidance may interfere with each other, resulting in weakened attention concentration on HRB regions and consequently reduced segmentation precision.

In summary, the visualization results of two cross-regional samples acquired at different time periods, together with the quantitative results reported in the table, demonstrate that the proposed method effectively distinguishes high-rise buildings from non-building areas in complex urban scenes. This advantage can be attributed to the integration of spatial heterogeneity and timestamp priors, as well as the attention mechanism, which enhances the model’s ability to focus on structurally consistent building features while suppressing responses from background regions such as roads, vegetation, and open spaces. Under the same experimental settings, the PEDNet_SE model, which exclusively incorporates the SE mechanism, exhibits superior segmentation performance across diverse scenarios in terms of both visual quality and quantitative metrics.

3.2. Comparison with Traditional Methods

To comprehensively evaluate the performance of the proposed PEDNet model for high-rise building area segmentation, this study conducted cross-model comparative experiments using representative semantic segmentation architectures. Classical CNN-based models, including U-Net and Fully Convolutional Networks (FCN), were adopted as foundational baselines, while DeepLabV3+ was selected as a strong CNN-based segmentation model with enhanced multi-scale context modeling. In addition, BuildFormer was included as a representative Transformer-based architecture specifically designed for building extraction tasks. A quantitative comparison was performed to assess the effectiveness of PEDNet in extracting high-rise building area categories, with all models evaluated on identical training and test sets using F1 score, IoU, and OA metrics. The comparative results are summarized in Table 4.

The quantitative experimental results show that the PEDNet_SE model outperforms the U-Net, FCN, DeepLabV3+, and BuildFormer models in terms of the F₁ score, IoU, and OA. PEDNet_SE achieves an F₁ score of 62.8%, an IoU of 45.8%, and an OA of 91.3%. The U-Net model records an F₁ score of 54.8%, an IoU of 42.3%, and an OA of 90.2%, while the FCN model achieves an F₁ score of 55.8%, an IoU of 38.1%, and an OA of 90.1%, with a notably lower IoU value. Among the more advanced models, DeepLabV3+ achieves an F₁ score of 56.2%, an IoU of 42.5%, and an OA of 90.5%, and BuildFormer achieves an F₁ score of 57.8%, an IoU of 43.4%, and an OA of 90.6%. Compared with these baseline models, PEDNet_SE shows higher values across these metrics.

In order to further validate the segmentation performance of the model in real-world scenarios, particularly to highlight the advantages of the model proposed in this study compared to classical models such as U-Net and FCN in cross-timestamp and cross-spatial scenarios, we selected two typical samples for visualization comparison: The T49SGU_20180930_HRBs_002 image, captured in Zhengzhou at approximately 34°N, was utilized for autumn imaging, while the T50TMK_20180212_HRBs_002 image, obtained in Beijing at approximately 39°N, was employed for winter imaging. Due to the disparity in geographical location between North China and Central China, as well as the marked seasonal variations between autumn and winter, these two samples manifest substantial disparities in imaging characteristics. The autumn atmosphere in Zhengzhou is characterized by moderate humidity, with slight dust contributing to a warmer color tone in the image and clear spectral boundaries. In contrast, Beijing’s winter atmosphere is dry and frequently hazy, resulting in poor transparency and blurred building contours. Furthermore, discrepancies in solar elevation angles (approximately 50° in Zhengzhou in September and approximately 25° in Beijing in February) give rise to divergent shadow patterns, manifesting as short, straight, concentrated shadows in Zhengzhou and long, diffuse shadows in Beijing. The imaging conditions, influenced by geographical location and the changing seasons, offer a standard sample for assessing the reliability of the model in diverse cross-timestamp and cross-spatial environments.

As illustrated in Figure 8, a comparison of test results for the T49SGU_20180930_HRBs_002 (Zhengzhou (T49SGU), Region 2) sample is provided. Subsequent analysis of the visualization results suggests that the PEDNet_SE model attains better segmentation performance for high-rise building areas. Within the red-highlighted region, the HRB object contours segmented by the PEDNet_SE model are clearly discernible, with more complete retention of spatial details, enabling the identification of smaller building units. The U-Net model demonstrates a certain degree of target fragmentation and edge blurring in its segmentation results for this region. The FCN model produces comparatively coarse segmentation results, exhibiting limited ability to distinguish small building targets, which leads to inaccurate segmentation of some HRBs. The DeepLabV3+ model shows moderate improvements in regional continuity compared with U-Net and FCN; however, partial fragmentation and omissions are still observed in dense building areas. The BuildFormer model exhibits a higher number of misclassified regions, and some closely adjacent building areas are not effectively separated, resulting in merged regions and reduced boundaries.

As illustrated in Figure 9, a comparison of the test results for the T50TMK_20180212_HRBs_002 sample (Beijing (T50TMK), Region 2) is presented. The visualization results indicate that the PEDNet_SE model produces clearer and more continuous segmentation results for high-rise building areas. Within the red-highlighted regions, PEDNet_SE effectively covers large building clusters while also preserving smaller and scattered building units, with boundary details of fragmented objects relatively well maintained. Compared with U-Net and FCN, PEDNet_SE exhibits a small number of false-positive detections in this sample, reflected by slight over-identification in marginal areas. The U-Net model shows evident target fragmentation, where continuous building areas are incorrectly divided into discrete patches, accompanied by blurred boundary transitions, while the FCN model produces comparatively coarse segmentation results with limited recognition of small-scale building targets, leading to omissions and imprecise boundaries. The DeepLabV3+ model demonstrates improved regional continuity relative to U-Net and FCN, but incomplete separation of adjacent building areas and partial omissions are still observed in dense regions. The BuildFormer model preserves the overall structure of major building clusters; however, misclassified regions remain apparent, and narrow gaps between closely adjacent buildings are not consistently separated, resulting in merged building areas.

In summary, the visualization results and quantitative metrics obtained from the two cross-regional samples acquired at different time periods are mutually consistent. The results show that the PEDNet_SE model achieves better segmentation performance than U-Net and FCN, particularly in alleviating target fragmentation and coarse recognition, with improved feature extraction accuracy and more complete retention of spatial details. In the highlighted regions, PEDNet_SE demonstrates stronger capability in delineating the contours and boundaries of small-scale building units while maintaining the integrity of larger building clusters in complex urban scenes. These advantages are consistently reflected in both the quantitative evaluation metrics and the visual comparison results, indicating that, under the current experimental setup, PEDNet_SE exhibits more stable segmentation performance across different regions and imaging times compared with other models.

4. Discussion

According to visualization results obtained from multiple test samples across four different cities and regions, slight performance variations can be observed among cities. These variations are likely related to inherent differences in urban environments and imaging conditions, such as climatic factors affecting illumination and seasonal appearance, variations in architectural styles and building density influencing spatial continuity, as well as local differences in Sentinel-2 imaging quality. Such factors inevitably affect the visual characteristics of HRBs and contribute to performance differences across cities, which represent common challenges in cross-regional urban remote sensing tasks.

It should be noted that this study focuses on regional-scale mapping of HRBs rather than precise delineation of individual building footprints. Accordingly, explicit instance-level boundary regularization is not the primary focus of the proposed framework. Nevertheless, geometric consistency is partially encouraged through the geometry constraint loss and edge-aware loss, which promote spatial smoothness and boundary continuity at the regional level and help suppress isolated or spurious predictions. Variations in building sizes within the segmentation masks naturally arise from the heterogeneous spatial organization and density of high-rise buildings across different urban environments. More explicit boundary regularization and instance-level refinement strategies are left for future investigation.

From a broader perspective, the findings of this study are consistent with our previous research [6,7] and other related work showing that incorporating global contextual information can alleviate the limitations of purely convolutional architectures in complex urban scenes. Building upon these observations, the present study further demonstrates that explicitly embedding region-level and timestamp prior information can enhance the robustness of HRB extraction under cross-regional and cross-timestamp scenarios. Moreover, ablation experiments and attention visualization analyses indicate that while individual spatial or timestamp priors can effectively guide feature learning, their direct combination without explicit coordination may lead to competing attention responses, highlighting the importance of structured prior integration. In this context, recent foundation segmentation models, such as the Segment Anything Model (SAM), offer promising capabilities for general-purpose visual understanding; however, adapting such models to low-resolution, area-level urban remote sensing tasks—without reliance on prompt-based interaction—remains an open challenge and a meaningful direction for future research.

5. Conclusions

In this study, we propose PEDNet, a prior-embedded dual-branch neural network for regional-scale extraction of high-rise building areas from cross-spatial and cross-timestamp Sentinel-2 imagery. By integrating convolutional feature learning with Transformer-based global context modeling and explicitly embedding region-level and timestamp prior information, PEDNet effectively enhances robustness to spatial heterogeneity and temporal variations in complex urban scenes. Experimental results demonstrate that the proposed framework achieves superior performance in HRB extraction compared with representative CNN-based and hybrid segmentation models. While the current implementation focuses on a single dataset and area-level mapping, the proposed prior-aware design provides a flexible foundation for future extensions. Further work will explore structured prior coordination mechanisms, such as channel-wise gating or attention-guided dynamic weighting strategies, and the integration of multi-source data or foundation models, to improve generalization under more diverse urban and imaging conditions.

Author Contributions

Conceptualization, L.L. and Q.S.; methodology, L.L., G.C. and Q.S.; software, Q.S. and L.L.; validation, Q.S. and G.C.; formal analysis, Q.S.; investigation, L.L.; resources, L.L.; data curation, Q.S.; writing—original draft preparation, Q.S.; writing—review and editing, L.L.; visualization, G.C.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the National Natural Science Foundation of China (grant number 41971327).

Data Availability Statement

The Sentinel-2 images used in the experiments were all obtained from Copernicus Data Hub (https://browser.dataspace.copernicus.eu/, accessed on 11 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, Y.; Liu, M.; Hu, Y.; Li, C.; Xiong, Z. Analysis of Three-Dimensional Space Expansion Characteristics in Old Industrial Area Renewal Using GIS and Barista: A Case Study of Tiexi District, Shenyang, China. Sustainability 2019, 11, 1860. [Google Scholar] [CrossRef]
Gu, J.; Zhu, Q.; Du, Z.J.; Zhang, J.B.; Liu, Y.L.; Xu, S.W.; Lu, Q.; Shan, L.X.; Liu, D.M.; Zhang, L.P.; et al. Unified Standards for Civil Building Design (GB 50352-2019). Constr. Sci. Technol. 2021, 13, 52–56. [Google Scholar] [CrossRef]
Jung, S.; Lee, K.; Lee, W.H. Object-Based High-Rise Building Detection Using Morphological Building Index and Digital Map. Remote Sens. 2022, 14, 330. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Li, L.; Zhu, J.; Cheng, G.; Zhang, B. Detecting High-Rise Buildings from Sentinel-2 Data Based on Deep Learning Method. Remote Sens. 2021, 13, 4073. [Google Scholar] [CrossRef]
Yao, S.; Li, L.; Cheng, G.; Zhang, B. Analyzing Long-Term High-Rise Building Areas Changes Using Deep Learning and Multisource Satellite Images. Remote Sens. 2023, 15, 2427. [Google Scholar] [CrossRef]
Shaloni; Dixit, M.; Agarwal, S.; Gupta, P. Building Extraction from Remote Sensing Images: A Survey. In Proceedings of the 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Online, 18–19 December 2020; pp. 966–971. [Google Scholar]
Dunaeva, A. Building Footprint Extraction from Stereo Satellite Imagery Using Convolutional Neural Networks. In Proceedings of the 2019 International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), Novosibirsk, Russia, 21–27 October 2019; pp. 0557–0559. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Shleifer, S.; Weston, J.; Ott, M. Normformer: Improved transformer pretraining with extra normalization. arXiv 2021, arXiv:2110.09456. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Liu, X.; Yan, P. An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+. Sci. Rep. 2024, 14, 9716. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Fang, S.; Meng, X.; Li, R.; Sensing, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
An, K.; Wang, Y.; Chen, L.; Wang, Y. A Dual-Branch Network Based on ViT and Mamba for Semantic Segmentation of Remote Sensing Image. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024; pp. 1–6. [Google Scholar]
Zhang, Y.; Cheng, J.; Su, Y.; Deng, C.; Xia, Z.; Tashi, N. Global Adaptive Second-Order Transformer for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5640417. [Google Scholar] [CrossRef]
Li, J.; Sun, D. Building Extraction from Remote Sensing Images by Fusing Mask R-CNN and Attention Mechanisms. Int. J. Netw. Secur. 2025, 27, 356–367. [Google Scholar]
Yuan, Q.; Xia, B. Cross-level and multiscale CNN-Transformer network for automatic building extraction from remote sensing imagery. Int. J. Remote Sens. 2024, 45, 2893–2914. [Google Scholar] [CrossRef]
Yiming, T.; Tang, X.; Shang, H. A shape-aware enhancement Vision Transformer for building extraction from remote sensing imagery. Int. J. Remote Sens. 2024, 45, 1250–1276. [Google Scholar] [CrossRef]
Yuan, W.; Xu, W. MSST-Net: A Multi-Scale Adaptive Network for Building Extraction from Remote Sensing Images Based on Swin Transformer. Remote Sens. 2021, 13, 4743. [Google Scholar] [CrossRef]
Sirko, W.; Brempong, E.A.; Marcos, J.T.; Annkah, A.; Korme, A.; Hassen, M.A.; Sapkota, K.; Shekel, T.; Diack, A.; Nevo, S. High-resolution building and road detection from sentinel-2. arXiv 2023, arXiv:2310.11622. [Google Scholar] [CrossRef]
Dong, S.; Meng, X. Text Semantics-Driven Remote Sensing Image Feature Extraction. Spacecr. Recovery Remote Sens. 2024, 45, 82–91. [Google Scholar]
Lu, Q.; Qin, J.; Yao, X.; Wu, Y.; Zhu, H. Buildings extraction of GF-2 remote sensing image based on multi-layer perception network. Remote Sens. Nat. Resour. 2021, 33, 75–84. [Google Scholar] [CrossRef]
Guo, W.; Zhang, Q. Building extraction using high-resolution satellite imagery based on an attention enhanced full convolution neural network. Remote Sens. Nat. Resour. 2021, 33, 100–107. [Google Scholar] [CrossRef]
Hu, Y.; Wang, Z.; Huang, Z.; Liu, Y. PolyBuilding: Polygon transformer for building extraction. ISPRS J. Photogramm. Remote Sens. 2023, 199, 15–27. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]

Figure 1. Research Area Location Map.

Figure 2. Typical regional image slices and their masks.

Figure 3. PEDNet Model Structure Diagram.

Figure 4. Prior-embedded window attention structure diagram.

Figure 5. The PEDNet model ablation experiment was conducted in the Zhengzhou region in December. The image test effect comparison is highlighted by the red box.

Figure 6. The PEDNet model ablation experiment was conducted in the Harbin region in September. The image test effect comparison is highlighted by the red box.

Figure 7. The PEDNet model attention visualization comparison was conducted in the Beijing region in February. The attention response differences are highlighted by the red box.

Figure 8. Comparison of test results from different models in the Zhengzhou region in September, with key parts highlighted in red boxes.

Figure 9. Comparison of test results from different models in the Beijing region in February, with key parts highlighted in red boxes.

Table 1. List of abbreviations used in this paper.

Abbreviation	Full Name
HRBs	High-Rise Building Areas
PEDNet	Prior-Embedded Dual-branch Network
PEWA	Prior-Embedded Window Attention
PEBlock	Prior-Embedded Block
SE	Spatial Embedding
TE	Time Embedding
DOY	Day of Year

Table 2. This experiment uses Sentinel-2 image data to examine the spatial and timestamp distribution.

	T51TYL (Harbin)	T50TMK (Beijing)	T49SGU (Zhengzhou)	T49QGF (Guangzhou)
Data Group	T51TYL (Harbin)	T50TMK (Beijing)	T49SGU (Zhengzhou)	T49QGF (Guangzhou)
Data A	22 March 2018	12 February 2018	22 February 2018	15 January 2018
Data B	23 June 2018	14 June 2018	07 June 2018	11 March 2018
Data C	18 September 2018	05 September 2018	30 September 2018	14 June 2018
Data D	10 December 2018	19 December 2018	29 December 2018	02 October 2018

Table 3. PEDNet Model Ablation Experiment Accuracy Comparison Table.

Model	F₁/%	IoU/%	OA/%
PEDNet_Base	57.4	40.3	91.1
PEDNet_SE	62.8	45.8	91.3
PEDNet_TE	61.4	44.3	91.2
PEDNet_SE + TE	60.5	43.4	91.2

Table 4. Comparison table of experimental accuracy for different models.

Model	F₁/%	IoU/%	OA/%
U-Net	54.8	42.3	90.2
FCN	55.8	38.1	90.1
Deeplabv3+	56.2	42.5	90.5
Buildformer	57.8	43.4	90.6
PEDNet_SE	62.8	45.8	91.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Si, Q.; Li, L.; Cheng, G. High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network. Remote Sens. 2026, 18, 167. https://doi.org/10.3390/rs18010167

AMA Style

Si Q, Li L, Cheng G. High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network. Remote Sensing. 2026; 18(1):167. https://doi.org/10.3390/rs18010167

Chicago/Turabian Style

Si, Qiliang, Liwei Li, and Gang Cheng. 2026. "High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network" Remote Sensing 18, no. 1: 167. https://doi.org/10.3390/rs18010167

APA Style

Si, Q., Li, L., & Cheng, G. (2026). High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network. Remote Sensing, 18(1), 167. https://doi.org/10.3390/rs18010167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Rise Building Area Extraction Based on Prior-Embedded Dual-Branch Neural Network

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. PEDNet

2.2.1. Network Architecture

2.2.2. Loss Function

2.3. Evaluation Metrics

2.4. Experimental Design

3. Results and Analysis

3.1. Results of PEDNet Model

3.2. Comparison with Traditional Methods

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI