Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network

Hu, Qingqing; Peng, Yiran; Zhang, Chi; Lin, Yunqi; U, KinTak; Chen, Junming

doi:10.3390/buildings15173102

Open AccessArticle

Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network

by

Qingqing Hu

¹,

Yiran Peng

²,

Chi Zhang

¹,

Yunqi Lin

¹,

KinTak U

^2,*

and

Junming Chen

^1,*

¹

Faculty of Humanities and Arts, Macau University of Science and Technology, Macau 999078, China

²

Faculty of Innovation Engineering, Macau University of Science and Technology, Macau 999078, China

^*

Authors to whom correspondence should be addressed.

Buildings 2025, 15(17), 3102; https://doi.org/10.3390/buildings15173102

Submission received: 6 August 2025 / Revised: 25 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Practice and Application of Artificial Intelligence in Urban Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

Accurate building instance segmentation from high-resolution remote sensing images remains challenging due to complex urban scenes featuring occlusions, irregular building shapes, and heterogeneous textures. To address these issues, we propose a novel Multi-Scale Hybrid Dual-Attention Network (MS-HDAN), which integrates a dual-stream encoder, multi-scale feature extraction, and a hybrid attention mechanism. Specifically, the encoder is designed with a Local Feature Extraction Pathway (LFEP) and a Global Context Modeling Pathway (GCMP), enabling simultaneous capture of structural details and long-range semantic dependencies. A Local-Global Collaborative Perception Enhancement Module (LG-CPEM) is introduced to fuse the outputs from both streams, enhancing contextual representation. The decoder adopts a hierarchical up-sampling structure with skip connections and incorporates a dual-attention module to refine boundary-level details and suppress background noise. Extensive experiments on benchmark urban building datasets demonstrate that MS-HDAN significantly outperforms existing state-of-the-art methods, particularly in handling densely distributed and structurally complex buildings. The proposed framework offers a robust and scalable solution for real-world applications, such as urban planning, where precise building segmentation is crucial.

Keywords:

building instance; remote sensing; deep learning; dual attention

1. Introduction

Building instance segmentation from high-resolution remote sensing images, typically with a spatial resolution finer than 1 m per pixel, has emerged as a fundamental technique. It plays a crucial role in the broader digital revolution of the construction sector, where the integration of remote sensing, Building Information Modeling (BIM), and digital twin technologies is reshaping urban planning and management practices. Accurately identifying building boundaries enables application scenarios such as 3D city model reconstruction and population density analysis. Traditional building extraction methods typically rely on manually constructed spectral–texture features or geometric cues such as best-fit plane estimation and elevation differences, which are then combined with machine learning classifiers to complete the recognition task [1]. Although these approaches are effective in relatively regular scenes, they struggle to cope with the complexity and diversity of urban environments. With convolutional neural network (CNN) gradually established as the core technical framework, the application of deep learning has changed the field. The improved architectures based on U-Net (ResUNet, Attention U-Net, etc.) achieve performance optimization by integrating cross-layer skip connections with attention control modules. Although such networks have advantages in capturing local details, they still face technical constraints in modeling long-range spatial correlations between buildings. This challenge is particularly evident in densely built-up areas, which are typically characterized by a building coverage ratio exceeding 60%. The Transformer architecture breaks through this limitation with the global context-awareness paradigm, and the Multi-layer Perceptron (MLP) [2] architecture opens up a new modeling path through the feature channel mixing mechanism [3].

Despite the rapid advancement of deep learning techniques, accurate building instance extraction from high-resolution remote sensing imagery remains a challenging task. This difficulty arises not only in urban areas but also in non-urban settings, where scale differences, occlusions, and diverse background conditions remain unresolved issues. First, architectural entities in urban areas exhibit significant variability in shape, scale, and spectral appearance, requiring models to dynamically adapt their receptive fields to capture diverse structures effectively. Second, vegetation occlusion, shadow interference, and adjacency to other buildings often cause building contours to break or remain incomplete, leading to missed detections. Third, due to the inherent class imbalance between sparse building targets and extensive background areas, models must possess fine-grained feature discrimination capabilities to maintain boundary precision. While recent studies have attempted to address these challenges individually—such as employing attention mechanisms for modeling long-range dependencies [4], deformable convolutions for adaptive feature sampling [5], and multi-scale fusion strategies for handling size variations [6]—their effectiveness remains limited in complex urban environments due to the lack of a holistic, integrated solution that simultaneously addresses geometric adaptability, contextual reasoning, and structural refinement.

To address the challenges, we propose a framework named the Multi-Scale Hybrid Dual-Attention Network (MS-HDAN). Built upon a U-Net-style encoder–decoder architecture, MS-HDAN introduces a dual-stream encoder that includes a Local Feature Extraction Pathway (LFEP) and a Global Context Modeling Pathway (GCMP). The LFEP utilizes pre-activation residual blocks to preserve edge and building outline information. At the same time, the GCMP employs a Gated Dual-Attention BiFormer Block (GDABB) to capture long-range semantic relations. To enhance interaction between the two streams, we design a Local-Global Collaborative Perception Enhancement Module (LG-CPEM), which fuses semantic and structural features through a combination of standard and deformable convolutions. It improves adaptability to varying building geometries while maintaining semantic consistency. Regarding the decoder, an attention-guided reconstruction module incorporates spatial and channel attention information to refine boundaries and restore spatial resolution. The hierarchical skip connections ensure the propagation of multi-scale context and structural information.

The main contributions of this paper include the following:

1.: We propose a Multi-Scale Hybrid Dual-Attention Network (MS-HDAN) that integrates local detail modeling with global semantic reasoning for accurate building instance segmentation.
2.: We introduce the Gated Dual-Attention BiFormer Block (GDABB) for dynamic global feature selection, and the Local-Global Collaborative Perception Enhancement Module (LG-CPEM) for adaptive feature alignment under geometric variations.
3.: A dual-attention guided reconstruction module is designed to refine building boundaries and enhance spatial detail consistency.

The structure of this paper is as follows. Section 2 reviews related work on building instance extraction. Section 3 details the proposed Multi-Scale Hybrid Dual-Attention Network (MS-HDAN). Section 4 presents the experimental setup, results, and analysis. Section 5 provides further discussion, and Section 6 concludes the paper.

2. Related Work

2.1. Deep Learning and Attention Mechanisms for Building Extraction

The iterative upgrading of deep learning technology has improved the accuracy level of building extraction from remote sensing images. As the founding framework of pixel-level semantic segmentation, fully convolutional networks (FCNs) construct the technical foundation of end-to-end prediction [7]. On this basis, U-Net [8] deepened the feature expression ability by using a symmetric encoding and decoding architecture and a cross-layer skip connection mechanism, establishing itself as the benchmark model system for building extraction. ResUNet [9] optimizes the deep network training process by integrating the residual learning module, and attention U-Net [10] utilizes a spatial attention gate mechanism to enhance the feature focusing ability of the target region.

In recent years, Transformer-based models have gradually attracted attention in remote sensing applications. Swin Transformer adopts a hierarchical self-attention structure to effectively capture long-range dependencies in high-resolution remote sensing images and improve the feature discrimination of ground targets [3,11]. However, the Transformer architecture is inefficient in extracting local features and requires many computational resources. In order to solve this problem, by deeply fusing the local perception advantages of the convolution operation and the global modeling ability of self-attention, the CNN-Transformer hybrid network effectively improves the accuracy and robustness of remote sensing image interpretation [12]. The attention mechanism exhibits a powerful feature focusing ability in complex terrain scenes, providing a new solution to the problem of feature redundancy commonly found in remote sensing images. The Dual Attention Network (DANet) [4] introduced the simultaneous use of position and channel attention for scene segmentation. HANet establishes a multi-level attention architecture to enable cross-scale feature interaction and adaptive fusion [13]. RADANet (On the Road Enhanced Deformable Attention Network) combines the geometric modeling benefits of deformable convolutions with the dynamic weighting properties of the attention mechanism through an optimization system tailored for linear objects, such as road networks [14]. Remote sensing image building extraction benefits most from the attention module, which uses dynamic channel feature enhancement. Squeeze-and-excitation block-based [15] channel attention mechanisms have demonstrated effectiveness in building extraction by focusing on important feature channels. The Convolutional Block Attention Module (CBAM) [16] added spatial attention to enhance channel-wise feature recalibration.

2.2. Deformable Convolutions and Hybrid Architectures for Building Extraction

Deformable convolutional networks solved the problem of fixed geometric transformations in standard CNNs by learning dynamic sampling locations [5]. This capability is particularly valuable for building extraction, where structures exhibit diverse shapes and orientations. Refine-UNet [17] demonstrated how Atrous Spatial Pyramid Pooling (ASPP) could complement deformable convolutions to capture multi-scale building features. Recent work has explored the integration of deformable operations with attention mechanisms. The Feature-Fusion Segmentation Network (FFS-Net) [18] showed that combining deformable convolutions with attention could improve boundary accuracy for irregular objects. However, these approaches typically apply deformable operations at fixed network stages, while our RM module integrates them throughout the feature hierarchy for continuous geometric adaptation. The resurgence of MLP-based architectures, exemplified by MLP-Mixer [2], offered new perspectives on feature mixing without convolution or self-attention. Gated MLP variants like gMLP [19] introduced data-dependent projections to control information flow. These architectures demonstrated competitive performance on image classification while maintaining computational efficiency. However, their application to dense prediction tasks, such as building extraction, remains largely unexplored. The proposed GateMLP module differs from previous MLP approaches in several key aspects. First, it operates on feature patches rather than individual tokens, thereby preserving spatial relationships that are crucial for boundary delineation. Second, the gating mechanism incorporates sigmoid activation and element-wise multiplication, enabling finer control over feature propagation. Third, it integrates seamlessly with convolutional features through residual connections, combining the strengths of both paradigms. The last decade has witnessed remarkable progress in building extraction methodologies, from early CNN-based approaches to recent hybrid architectures. While existing methods have addressed various aspects of the problem—geometric adaptation through deformable convolutions, context modeling via attention mechanisms, and multi-scale feature fusion—they often treat these components separately.

In summary, existing methods have addressed some of the existing problems, such as geometric adaptation via deformable convolutions, context modeling via attention mechanisms [20], and handling size changes via multi-scale feature fusion [21]. However, these methods treat each component separately and lack a systematic solution to simultaneously optimize local detail preservation, global context integration, and adaptive feature selection. Despite achieving high performance in specific scenarios, existing building extraction methods can reach accuracy levels above 85–90% when evaluated by metrics such as IoU or F1-score. However, their limitations become evident in complex urban environments, where buildings with diverse scales, shapes, and structural appearances together with occlusions and background interferences significantly reduce robustness. For example, the dramatic changes in geometry, scale distribution, and spectral performance of buildings still pose challenges to the adaptability of existing models. At the same time, vegetation occlusion, shadow interference and the influence of adjacent buildings may also cause contour breakage and missed detection problems. Therefore, improving the location accuracy of building boundaries and the integrity of the overall area in complex urban environments is still a key problem to be solved.

3. Method

When semantic segmentation of urban buildings is performed on remote sensing images, traditional image segmentation methods often exhibit insufficient feature expression capabilities in complex scenes and high-resolution images. At the same time, these methods also have limitations in modeling spatial semantic information, especially in multi-scale feature fusion and long-distance dependency, making it challenging to capture the structural and semantic information. To solve these problems, we propose a Multi-Scale Hybrid Dual-Attention Network (MS-HDAN).

Let

x \in R^{H \times W \times C}

represent an urban remote sensing image, where

H \times W

is the resolution and C is the number of channels. Our goal is to predict the pixel-wise label for segmentation. Our model is built upon the classic U-Net architecture, utilizing an encoder–decoder framework. It integrates multi-scale feature extraction and a hybrid dual-attention mechanism to enhance the representation of contextual information and improve segmentation accuracy in complex urban scenes. Specifically, the encoder is designed as a dual-stream structure, where the main branch focuses on semantic features and the auxiliary branch provides complementary structural details. After feature extraction, the decoder employs a stepwise up-sampling strategy, combined with skip connections from the encoder, to reconstruct the spatial resolution. Additionally, a dual-attention mechanism is integrated into the decoder path to enhance feature fusion and improve the reconstruction of structural details, such as building edges. This design significantly improves the segmentation accuracy of complex urban scenes in remote sensing images. The flowchart is shown in Figure 1.

3.1. Dual-Stream Encoder

To effectively extract diverse features from urban remote sensing images, we designed the dual-stream encoder of MS-HDAN. First, we propose a dual-stream encoder composed of two processing pathways: a Local Feature Extraction Pathway (LFEP) and a Global Context Modeling Pathway (GCMP). The LFEP is designed to capture spatial patterns and structural textures based on convolutional operations. The GCMP captures long-range dependencies and semantic coherence across the remote sensing images. Among them, the results of each layer will be fused through the designed Local-Global Collaborative Perception Enhancement Module (LG-CPEM) for feature fusion.

Firstly, we convert each remote sensing image

x \in R^{H \times W \times C}

into a sequence of image patches with size

P \times P

. Each patch

M_{i} \in R^{C \times P \times P}

is then flattened and formed into the sequence matrix

X_{M} \in R^{N \times C \times P \times P}

.

N = \frac{H \times W}{P^{2}}

is the number of patches. Secondly, we project the image patches

X_{M}

into the D-dimension embedding space using a learnable linear projection. The calculation is as follows:

Z = {X_{M}^{1} W, X_{M}^{2} W, \dots, X_{M}^{N} W} + E_{p}

(1)

where

W \in R^{P \times P \times C \times D}

is the projection.

Z \in R^{N \times D}

denotes the embedding patches.

E_{p}

is the learnable position embeddings.

Here, we input Z and x into the dual-stream encoder. Specifically, Z is fed into the GCMP, and x is fed into the LFEP. The overall structure can be expressed as

F = E_{D S} (Z, x)

(2)

where

E_{D S}

denotes the dual-stream encoder. It consists of N blocks:

F_{L}^{0} = x

(3)

F_{G}^{0} = Z

(4)

F_{L}^{i} = f_{L} (F_{L}^{i - 1})

(5)

F_{G}^{i} = f_{G} (F_{G}^{i - 1})

(6)

F^{i} = f_{f u s e}^{i} (F_{L}^{i}, F_{G}^{i})

(7)

where

f_{L}

represents the LFEP and

f_{G}

represents the GCMP.

f_{f u s e}

represents the LG-CPEM.

3.1.1. Global Context Modeling Pathway

To enhance the model’s ability to model long-distance dependencies and global semantics in complex urban scenes, we introduce the GCMP based on the Gated Dual-Attention BiFormer Block (GDABB). It emphasizes informative features for improving discriminative representation. Specifically, GDABB is an improved version of BiFormer. BiFormer demonstrates strong potential in modeling long-range dependencies. However, the MLP has limited expressive capability when dealing with complex feature relationships. At the same time, GateMLP can adaptively adjust the information flow according to the input by introducing a dynamic gating mechanism. It enhances the feature selectivity and expressive capability. Meanwhile, the conventional normalization strategy can be stably trained, but it is insufficient in modeling complex feature interactions. For this reason, we introduce the dual-attention mechanism to explore the relationship between features from multiple perspectives to enrich the semantic expression.

Given the input Z, the GCMP can be expressed as

F_{G} = P M^{i} (H_{G D B}^{i} (Z)) i = 1, 2, 3

(8)

P M

denotes the patch merge.

H_{G D B}

denotes the GDABB. i denotes the i-round of the structure.

To better illustrate the GCMP, we now provide a detailed description of the Gated Dual-Attention BiFormer Block (GDABB), which serves as the fundamental unit of GCMP. Given that feature representation Y, the GDABB first applies a depthwise convolution branch to extract localized spatial patterns. Then, the Bi-Level Routing Attention is employed to model the global dependencies. Also, to further enhance the expressive diversity of global context modeling, we propose a dual-way parallel attention mechanism. Specifically, we adopt a structurally symmetric dual-path attention mechanism. This scheme aims to model global dependencies from both pre-normalization and post-normalization perspectives, thereby achieving complementary modeling of attention representations.

Y_{1} = Y + D W C o n v_{3 \times 3} (Y)

(9)

where

D W C o n v

denotes the depthwise convolution.

Y_{2} = Y_{1} + B R A (L N (Y_{1})) + L N (B R A (Y_{2}))

(10)

where

B R A

represents the Bi-Level Routing Attention.

L N

represents the layer normalization.

B R A

is a dynamic, query-aware sparse attention mechanism. It takes the query Q, key K, and value V as input, and then calculates the adjacency matrix A. Secondly, a routing index matrix (I) is set to contain the top-k connections for each region. Then, it gathers the key and value to ensure high computational efficiency during sparse attention.

A = Q K^{T}

(11)

I = T o p K (A)

(12)

K_{G} = g a t h e r (K, I)

(13)

V_{G} = g a t h e r (V, I)

(14)

where

K_{G}

and

V_{G}

are the gathered key and value.

T o p K

denotes the operation of selecting the top-k strongest connections.

Finally, we apply the attention based on the gathered key and value.

O_{A} = A t t (Q, K_{G}, V_{G}) + D W C o n v (V)

(15)

A t t (Q, K_{G}, V_{G}) = Softmax (\frac{Q K_{G}^{T}}{\sqrt{d_{k}}}) V_{G}

(16)

where

d_{k}

denotes the dimensionality of the key and query embeddings.

D W C o n v

represents the depthwise separable convolution.

After calculating the attention, a gated feedforward network is introduced to fuse the features adaptively. This step is formulated as

O = G a t e M L P (L N (O_{A}))

(17)

G a t e M L P (X) = (D W C o n v (G E L U (X W_{1} + b_{1})) ⊙ σ (X W_{1} + b_{g})) W_{2} + b_{2}

(18)

GELU (x) = x \cdot Φ (x) = x \cdot \frac{1}{2} (1 + erf (\frac{x}{\sqrt{2}}))

(19)

Here, O is the output of the

B R A

. ⊙ denotes element-wise multiplication.

σ

denotes the activation function.

Φ (x)

denotes the cumulative distribution function of the standard normal distribution, and erf is the Gaussian error function.

3.1.2. Local Feature Extraction Pathway

To enhance the model’s ability to extract fine-grained features that correspond to subtle pixel-level variations in remote sensing images, the Local Feature Extraction Pathway (LFEP) is designed. Specifically, we adopt the Pre-activation ResNet (PRN) block as the backbone structure to extract discriminative, detailed features in local regions. Unlike the standard ResNet structure, Pre-activation ResNet places the batch normalization and activation function (ReLU) before the convolution operation, effectively mitigating the problem of gradient vanishing. It makes the local feature expression more stable and training more efficient.

Given the input x, the backbone structure can be expressed as

F_{L} = L F E P (x)

(20)

L F E P (x) = P R N^{3} (C o n v (x))

(21)

where

P R N^{3}

denotes the

P R N

being performed three times.

P R N (x) = x + C o n v_{3 \times 3} (R e L U (B N (x)))

(22)

where

B N

denotes the batch normalization, and

R e L U

is the activation function.

3.1.3. Local-Global Collaborative Perception Enhancement Module

To effectively fuse local details with global semantic information, we propose the Local-Global Collaborative Perception Enhancement Module (LG-CPEM). This module enhances the discriminative ability of the overall feature representation by efficiently fusing the global and local information from both paths. Specifically, LG-CPEM introduces a differentiated feature enhancement strategy. On the one hand, features output from the global path are introduced into a standard

3 \times 3

convolution to preserve the semantic backbone. On the other hand, the deformable convolution is applied to the features of the local path to enhance its modeling capability. The dual-path features are processed and then fused by element-by-element addition to obtain a synergistically enhanced multi-scale feature representation. This fusion approach not only enables complementary perception at the semantic and structural levels but also enhances computational efficiency and structural flexibility.

Given the extracted

F_{L}

from the LFEP and

F_{G}

from GCMP, the overall process can be stated as

F_{L G} = C o n v_{3} (F_{L}) + D e f o r m C o n v_{3} (F_{G})

(23)

where

D e f o r m C o n v

denotes deformable convolution.

3.2. Attention-Guided Decoder

The decoder structure of traditional U-Net generally suffers from insufficient information fusion as well as insufficient detail recovery. So, we construct an attention-guided decoder. Specifically, we adopt a skip-connection based on dual attention (DA) for feature enhancement. The decoder not only adopts the up-sampling features from the previous layer but also jointly utilizes the output features from the fusion of local and global features (

F^{i}

) provided by the encoder. Firstly, the fused feature

F^{i}

is feature-enhanced by the DA to achieve a more accurate and contextually consistent reconstruction. Then, the feature is up-sampled as a result of decoding feature

D^{i}

in the previous layer. The process can be formulated as follows:

F_{S}^{i} = D A (F^{i})

(24)

where the

F_{S}^{i}

denotes the enhanced features from the dual-attention mechanism.

D^{i} = U p S a m p l i n g (C o n c a t (D^{i + 1}, F_{S}^{i}))

(25)

DA consists of the spatial attention (SPA) and the channel attention (CA). For spatial attention, when given the feature A, it first generates the feature maps B, C, and D, then performs the matrix multiplication and applies the softmax to generate the spatial attention map. Here, D is used for the feature weighing of SPA. For channel attention, when given the feature A, it first performs the matrix multiplication between A and the transpose of A. Then, applies the softmax to generate the channel attention map. Here, A is used for feature weighting in CA. After weighting the two types of attention based on their features, feature summation is then performed. The overall process can be expressed as

D A (x) = S P A (x) + C A (x)

(26)

s_{j i} = \frac{exp (B_{i} \cdot C_{j})}{\sum_{i = 1}^{N} exp (B_{i} \cdot C_{j})}

(27)

S P A (x) = α \sum_{i = 1}^{N} s_{j i} D_{i} + A_{j}

(28)

x_{j i} = \frac{exp (A_{i} \cdot A_{j})}{\sum_{i = 1}^{C} exp (A_{i} \cdot A_{j})}

(29)

C A (x) = β \sum_{i = 1}^{C} x_{j i} A_{i} + A_{j}

(30)

where

α

and

β

are the parameters. N denotes the number of pixels and C denotes the channel number.

4. Experiments

4.1. Dataset Details

Two public datasets are used for experimental evaluation: Wuhan and Shanghai, taken from the buildings instances of typical cities in China (BITC) [22]. Among these, there are 1231 images of Shanghai and 1448 images of Wuhan. The size of each image is

500 \times 500

and its spatial resolution is 0.29 m.

4.2. Evaluation Metrics

For evaluation, we employ three standard metrics: accuracy, F1-score, and mean intersection over union (mIoU) [23]:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(31)

F 1 - s c o r e = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(32)

m I o U = \frac{1}{K} \sum_{k = 1}^{K} \frac{T P_{k}}{T P_{k} + F P_{k} + F N_{k}}

(33)

where

T P

,

T N

,

F P

, and

F N

represent the number of true positives, true negatives, false positives, and false negatives, respectively. K denotes the number of classes, and

T P_{k}

,

F P_{k}

, and

F N_{k}

represent the true positives, false positives, and false negatives for the k-th class.

4.3. Performance Evaluation

To conduct a comprehensive evaluation of the model’s performance, we performed experiments on the proposed method using the Shanghai-BITC and Wuhan-BTTC datasets.

Table 1 presents the quantitative results of the Shanghai-BITC dataset, comparing the performance of the five segmentation models. Overall, our proposed method consistently achieves higher performance across all metrics, reaching an accuracy of 0.9451, an F1-score of 0.9445, and an mIoU of 0.8445, which highlights its improved segmentation ability and model generalization. Compared to other mainstream models, TransUNet also performs better (accuracy: 0.9329, F1-score: 0.9325, mIoU: 0.8325), verifying the effectiveness of fusing CNN and Transformer architectures in improving semantic understanding and spatial perception. DeepLabV3 benefits from its Atrous Spatial Pyramid Pooling (ASPP) structure, which effectively enhances multi-scale context capture, achieving a mean IoU (mIoU) of 0.7841. UNeXt consistently demonstrates improvements over the base U-Net model across all evaluation metrics. Although these differences are not statistically significant, they reveal a positive trend, suggesting that the optimized decoding structure and jump connection mechanism help preserve information. Although SwinUNet introduces a hierarchical window attention mechanism, it does not surpass TransUNet and DeepLabV3 in all indices, especially in mIoU (0.7513), which is only slightly higher than that of the basic U-Net, indicating that the balance between local and global features still has some deficiencies. These results collectively demonstrate the effectiveness of the proposed method, particularly in feature extraction and boundary maintenance.

Table 2 presents the results of the Wuhan-BITC dataset. The accuracy, F1-score, and mIoU of all models on this dataset are generally lower than their performance on the Shanghai-BITC dataset. Among them, our proposed method still achieves the highest accuracy (0.9232), F1-score (0.9221), and mIoU (0.7638). DeepLabV3 also shows stable performance on Wuhan-BITC, with an accuracy of 0.9139 and an mIoU of 0.7427. In contrast, TransUNet’s performance slightly declines on this dataset, with mIoU dropping from 0.8325 to 0.7033. SwinUNet achieves an mIoU of 0.7056, which is slightly higher than that of TransUNet, but still significantly lower than its performance on the Shanghai-BITC dataset. The performance of UNeXt and U-Net is further weakened, especially the mIoU of UNeXt, which decreases to 0.6613, likely due to weaker boundary representation.

In summary, the proposed method maintains consistently strong performance across two datasets with different distributions and complexities, highlighting its robustness and generalization ability.

4.4. Visualization Results

Figure 2 and Figure 3 display the image segmentation visualization results for the two remote sensing datasets, Shanghai-BITC and Wuhan-BITC, respectively. These figures contain the segmentation outputs of the original image, the ground truth labels, the five mainstream models, and the proposed method. In the Shanghai-BITC dataset (Figure 2), the traditional U-Net and its variant UNeXt are prone to target sticking or breaking, and the edge details are not well preserved. DeepLabV3 has a slight advantage in extracting the backbone structure, but it still suffers from deformation or missegmentation in fine areas. SwinUNet and TransUNet alleviate these problems to some extent, particularly in reconstructing the general contour. Primarily, it is more complete in reconstructing the general contour. However, the proposed method can better restore building edges, slender structures, and narrow gap areas, and the generated prediction maps are closer to the real labels in terms of structural continuity and geometric alignment. In the Wuhan-BITC dataset (Figure 3), the scenario is more complex, leading to a significant increase in the difficulty of model prediction. The overall observation reveals that U-Net and UNetX exhibit significant background leakage and boundary collapse in several areas, and their performance is volatile in small target regions. DeepLabV3 retains larger regions but has limited ability to recognize fine edges. SwinUNet and TransUNet perform slightly better in the third and fourth rows of the image, identifying the general building layouts. However, residual errors are still visible, mainly around complex boundaries and textured regions, where misclassifications and noise are more likely to occur. Experimental results across multiple scenes demonstrate that the proposed method consistently maintains higher accuracy and mIoU compared with other models, indicating improved robustness.

4.5. Confusion Matrices

To further analyze the classification performance of the models at the pixel level, Figure 4 and Figure 5 show the confusion matrix results of each segmentation model on the Shanghai-BITC and Wuhan-BITC datasets, respectively. The proposed method performs best on the Shanghai-BITC dataset (Figure 4). In contrast, the Class 1 recognition rates of U-Net and UNeXt are 0.80 and 0.81, both lower than the 0.85 achieved by the proposed method, particularly at the intersection of background and foreground. TransUNet and DeepLabV3 show relatively consistent performance across datasets, with smaller performance fluctuations compared to other models. In particular, TransUNet demonstrates a stronger ability to recognize background regions. In contrast, SwinUNet’s performance is relatively unstable, indicating that there are still obvious omissions in the detection of target regions, such as buildings. On the Wuhan-BITC dataset (Figure 5), the overall classification performance is degraded, reflecting the challenges of this dataset, such as its complex background and fuzzy target boundaries. Nevertheless, the proposed method still maintains high accuracy in both prediction classes, reaching 0.94 for Class 0 and 0.82 for Class 1. These results are higher than those of other methods, particularly in classifying foreground buildings, and demonstrate stronger robustness.

4.6. Ablation Experiment

To verify the impact of the key modules in the proposed method on the overall performance, we conduct ablation experiments on the Shanghai-BITC and Wuhan-BITC datasets. Here, two main modules are examined: (1) the GDAB (Gated Dual-Attention BiFormer) block, which enhances the relationship between features from multiple perspectives to enrich semantic expression; and (2) the DA (dual-attention) mechanism, which improves the discrimination of feature expression. Table 3 and Table 4 show the comparison results of the complete model (our proposed) and its two ablation versions (w/o GDAB, w/o DA) on the Shanghai-BITC and Wuhan-BITC datasets, respectively. As shown in the table, the complete model achieves the highest results in all three metrics on both the Shanghai-BITC and Wuhan-BITC datasets. This improvement should be interpreted as a consistent trend rather than a statistically proven significance, since no statistical hypothesis testing (e.g., p-value analysis) was conducted. When the GDAB module is removed, the model’s mIoU on the Shanghai-BITC and Wuhan-BITC datasets decreases by 2.41% and 1.23%, respectively. These results suggest that GDAB contributes to improving boundary accuracy and reducing false detections. In contrast, the F1-score decreases when the DA module is removed, indicating that the mechanism plays a crucial role in enhancing feature discrimination capability and comparability. Overall, the observed differences support the effectiveness of the proposed modules.

5. Discussion

5.1. Model Effectiveness in Complex Urban Scenes

With the acceleration of urbanization, high-resolution remote sensing images are playing an increasingly important role in urban planning, land use, and disaster monitoring [28]. As the core element of urban space, accurate segmentation of buildings is crucial for constructing high-quality urban information maps [29]. However, traditional semantic segmentation methods often suffer from insufficient feature expressiveness and limited spatial semantic modeling when faced with complex urban scenes and architectural structures of varying scales.

Our proposed MS-HDAN addresses these issues by combining local detail preservation with global semantic reasoning, leading to more coherent boundary predictions without relying on a single type of feature representation. Rather than restating the technical details of each module, here we emphasize that the overall framework improves both contextual understanding and structural fidelity in challenging urban scenarios.The experimental results demonstrate that MS-HDAN consistently outperforms existing mainstream methods, thereby validating its adaptability and robustness in complex urban scenes. In particular, the model shows improvements in densely built-up areas and regions with ambiguous boundaries, where maintaining regional consistency is especially difficult.

5.2. Potential Impact on Urban Decision Making

MS-HDAN has shown outstanding performance in high-precision building semantic segmentation, providing more accurate and automated data support for urban management and decision making [30]. In practical applications, our proposed method can be widely used for tasks such as extracting urban spatial information, conducting building surveys, monitoring urban expansion, and evaluating planning. For example, in the process of urban renewal, renovation of old residential areas, and infrastructure expansion, decision-makers can rely on model segmentation results to quickly identify building distribution density, boundary morphology, and structural characteristics, thereby developing more targeted and scientific intervention strategies.

By introducing the dual-stream structure and dual-attention mechanism, MS-HDAN models the structural details and semantic contours of buildings in complex urban environments more accurately, effectively alleviating the misjudgment and missed detection problems of traditional methods in densely populated urban areas. Therefore, this method shows potential practical value and may provide useful technical references for urban intelligent governance and data-driven spatial decision making, although further validation in real-world applications is needed.

5.3. Limitations and Future Work

Although the proposed MS-HDAN has demonstrated significant advantages in the semantic segmentation of urban buildings, certain limitations remain. Firstly, the complex structure of the model significantly increases the overall parameter count and computational cost compared to traditional segmentation models. This may become a limiting factor in resource-constrained edge devices or application scenarios that require high real-time performance. Secondly, model training still heavily relies on high-quality, pixel-level annotated data (i.e., remote sensing images with manually delineated building boundaries serving as ground truth). In practical remote sensing applications, the cost of obtaining building labels is high, and annotation accuracy is easily affected by human subjectivity, which can significantly impact the performance of cross-regional promotion and large-scale deployment.

Given the above problems, future research work can be carried out from the following directions: on the one hand, model lightweight strategies can be explored, such as structural pruning, knowledge distillation or neural architecture search (NAS), which can significantly reduce model complexity to meet the actual needs of edge computing and real-time reasoning on the premise of ensuring segmentation performance; on the other hand, multi-source remote sensing data (such as hyperspectral images, LiDAR, or multi-phase images) can be introduced for fusion modeling, further enhancing the model’s perception and segmentation capabilities for complex terrain structures, occluded areas, and multi-scale targets. In addition, future segmentation results can be deeply integrated with tasks such as urban functional zoning, building type recognition, or 3D modeling, providing more accurate and multidimensional data support for intelligent decision-making systems, including urban management, disaster response, and spatial planning.

6. Conclusions

In this paper, we propose a novel Multi-Scale Hybrid Dual-Attention Network (MS-HDAN) for accurate building instance segmentation in high-resolution remote sensing images. To address the challenges posed by complex urban scenes, including geometric variability and semantic ambiguity, MS-HDAN integrates a dual-stream encoder that captures both local structural details and global contextual semantics. Beyond its methodological contributions, the approach can provide valuable support for applications such as BIM-based modeling, GIS data updates, and urban spatial monitoring. Potential users include urban planning authorities, competent city developers, and agencies involved in disaster response and land-use management. The method may also be aligned with emerging standards for urban spatial data exchange, although further investigation is needed. Some limitations remain, including the model’s relatively high complexity and reliance on pixel-level annotations. Future work may explore lightweight architectures, multi-source data integration, and closer linkage with planning and construction workflows. Overall, MS-HDAN provides an incremental improvement over existing approaches, with potential value for integration into digital urban management practices.

Author Contributions

Conceptualization: Q.H. and Y.P.; data curation: C.Z. and Y.L.; formal analysis: Q.H. and C.Z.; investigation: Y.P. and Y.L.; methodology: Q.H. and Y.P.; project administration: K.U. and J.C.; resources: Q.H. and Y.P.; software: Y.P.; visualization: Q.H., Y.P. and Y.L.; writing—original draft: Q.H., Y.P., Y.L., C.Z., J.C. and K.U.; writing—review and editing: Q.H., Y.P., K.U., J.C. and K.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research work is funded by the project FRG-25-064-FA and granted by the Research Fund of Macao University of Science and Technology (FRG-MUST).

Data Availability Statement

The datasets used in this study are publicly available and can be accessed from the following repositories: the buildings instances of typical cities in China (BITC) dataset is available at https://aistudio.baidu.com/datasetdetail/250828 (accessed on 25 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Raghavan, R.; Verma, D.C.; Pandey, D.; Anand, R.; Pandey, B.K.; Singh, H. Optimized building extraction from high-resolution satellite imagery using deep learning. Multimed. Tools Appl. 2022, 81, 42309–42323. [Google Scholar] [CrossRef]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, Y.; Shen, J.; Yang, L.; Bian, G.; Yu, H. ResDO-UNet: A deep residual network for accurate retinal vessel segmentation from fundus images. Biomed. Signal Process. Control 2023, 79, 104087. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
Yin, M.; Chen, Z.; Zhang, C. A CNN-transformer network combining CBAM for change detection in high-resolution remote sensing images. Remote Sens. 2023, 15, 2406. [Google Scholar] [CrossRef]
Han, C.; Wu, C.; Guo, H.; Hu, M.; Chen, H. HANet: A hierarchical attention network for change detection with bitemporal very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3867–3878. [Google Scholar] [CrossRef]
Dai, L.; Zhang, G.; Zhang, R. RADANet: Road Augmented Deformable Attention Network for Road Extraction From Complex High-Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602213. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Qiu, W.; Gu, L.; Gao, F.; Jiang, T. Building Extraction From Very High-Resolution Remote Sensing Images Using Refine-UNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6002905. [Google Scholar] [CrossRef]
Liu, X.; Peng, Y.; Lu, Z.; Li, W.; Yu, J.; Ge, D.; Xiang, W. Feature-Fusion Segmentation Network for Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4500314. [Google Scholar] [CrossRef]
Rajagopal, A.; Nirmala, V. Convolutional gated MLP: Combining convolutions and gMLP. In Proceedings of the International Conference on Big Data, Machine Learning and Applications, Orlando, FL, USA, 6–9 December 2021. [Google Scholar]
Zuo, R.; Zhang, G.; Zhang, R.; Jia, X. A deformable attention network for high-resolution remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4406314. [Google Scholar] [CrossRef]
Chen, J.; Yi, J.; Chen, A.; Jin, Z. EFCOMFF-Net: A multiscale feature fusion architecture with enhanced feature correlation for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604917. [Google Scholar] [CrossRef]
Wu, K.; Zheng, D.; Chen, Y.; Zeng, L.; Zhang, J.; Chai, S.; Xu, W.; Yang, Y.; Li, S.; Liu, Y.; et al. A dataset of building instances of typical cities in China. Chin. Sci. Data 2021, 6, 182–190. [Google Scholar] [CrossRef]
You, D.; Wang, S.; Wang, F.; Zhou, Y.; Wang, Z.; Wang, J.; Xiong, Y. EfficientUNet+: A building extraction method for emergency shelters based on deep learning. Remote Sens. 2022, 14, 2207. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Patel, V.M. Unext: Mlp-based rapid medical image segmentation network. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 23–33. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Al Shafian, S.; Hu, D. Integrating machine learning and remote sensing in disaster management: A decadal review of post-disaster building damage assessment. Buildings 2024, 14, 2344. [Google Scholar] [CrossRef]
Dabove, P.; Daud, M.; Olivotto, L. Revolutionizing urban mapping: Deep learning and data fusion strategies for accurate building footprint segmentation. Sci. Rep. 2024, 14, 13510. [Google Scholar] [CrossRef] [PubMed]
Arulananth, T.; Kuppusamy, P.; Ayyasamy, R.K.; Alhashmi, S.M.; Mahalakshmi, M.; Vasanth, K.; Chinnasamy, P. Semantic segmentation of urban environments: Leveraging U-Net deep learning model for cityscape image analysis. PLoS ONE 2024, 19, e0300767. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The overall model structure of MS-HDAN.

Figure 2. Visualization results for the Shanghai-BITC dataset.

Figure 3. Visualization results for the Wuhan-BITC dataset.

Figure 4. Confusion matrix results for the Shanghai-BITC dataset.

Figure 5. Confusion matrix results for the Wuhan-BITC dataset.

Table 1. Comparison of segmentation models on Shanghai-BITC dataset.

Model	Accuracy	F1-Score	mIoU
U-Net [8]	0.9034	0.9014	0.7466
UNeXt [24]	0.9121	0.9111	0.7693
DeepLabV3 [25]	0.9208	0.9187	0.7841
TransUNet [26]	0.9329	0.9325	0.8325
SwinUNet [27]	0.9046	0.9031	0.7513
Our proposed method	0.9451	0.9445	0.8445

Table 2. Comparison of segmentation models on Wuhan-BITC dataset.

Model	Accuracy	F1-Score	mIoU
U-Net [8]	0.8972	0.8922	0.6919
UNeXt [24]	0.8789	0.8762	0.6613
DeepLabV3 [25]	0.9139	0.9121	0.7427
TransUNet [26]	0.9017	0.8971	0.7033
SwinUNet [27]	0.9025	0.8993	0.7056
Our proposed method	0.9232	0.9221	0.7638

Table 3. Ablation study results on the Shanghai-BITC dataset.

Method	Accuracy	F1-Score	mIoU
Our proposed	0.9451	0.9447	0.8458
w/o GDAB	0.9363	0.9354	0.8217
w/o DA	0.9344	0.9345	0.8216

Table 4. Ablation study results on the Wuhan-BITC dataset.

Method	Accuracy	F1-Score	mIoU
Our proposed	0.9232	0.9221	0.7638
w/o GDAB	0.9182	0.9161	0.7515
w/o DA	0.9191	0.9177	0.7571

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Q.; Peng, Y.; Zhang, C.; Lin, Y.; U, K.; Chen, J. Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network. Buildings 2025, 15, 3102. https://doi.org/10.3390/buildings15173102

AMA Style

Hu Q, Peng Y, Zhang C, Lin Y, U K, Chen J. Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network. Buildings. 2025; 15(17):3102. https://doi.org/10.3390/buildings15173102

Chicago/Turabian Style

Hu, Qingqing, Yiran Peng, Chi Zhang, Yunqi Lin, KinTak U, and Junming Chen. 2025. "Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network" Buildings 15, no. 17: 3102. https://doi.org/10.3390/buildings15173102

APA Style

Hu, Q., Peng, Y., Zhang, C., Lin, Y., U, K., & Chen, J. (2025). Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network. Buildings, 15(17), 3102. https://doi.org/10.3390/buildings15173102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning and Attention Mechanisms for Building Extraction

2.2. Deformable Convolutions and Hybrid Architectures for Building Extraction

3. Method

3.1. Dual-Stream Encoder

3.1.1. Global Context Modeling Pathway

3.1.2. Local Feature Extraction Pathway

3.1.3. Local-Global Collaborative Perception Enhancement Module

3.2. Attention-Guided Decoder

4. Experiments

4.1. Dataset Details

4.2. Evaluation Metrics

4.3. Performance Evaluation

4.4. Visualization Results

4.5. Confusion Matrices

4.6. Ablation Experiment

5. Discussion

5.1. Model Effectiveness in Complex Urban Scenes

5.2. Potential Impact on Urban Decision Making

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI