Next Article in Journal
Photopolymerization 3D-Printed Dual-Modal Flexible Sensor for Glucose and pH Monitoring
Previous Article in Journal
A Large Kernel Convolutional Neural Network with a Noise Transfer Mechanism for Real-Time Semantic Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction

1
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China
2
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
*
Authors to whom correspondence should be addressed.
Sensors 2025, 25(17), 5356; https://doi.org/10.3390/s25175356
Submission received: 17 July 2025 / Revised: 21 August 2025 / Accepted: 27 August 2025 / Published: 29 August 2025
(This article belongs to the Section Optical Sensors)

Abstract

Building extraction from high-resolution remote sensing imagery is critical for urban planning and disaster management, yet remains challenging due to significant intra-class variability in architectural styles and multi-scale distribution patterns of buildings. To address these limitations, we propose the Multi-Scale Guided Context-Aware Network (MSGCANet), a Transformer-based multi-scale guided context-aware network. Our framework integrates a Contextual Exploration Module (CEM) that synergizes asymmetric and progressive dilated convolutions to hierarchically expand receptive fields, enhancing discriminability for dense building features. We further design a Window-Guided Multi-Scale Attention Mechanism (WGMSAM) to dynamically establish cross-scale spatial dependencies through adaptive window partitioning, enabling precise fusion of local geometric details and global contextual semantics. Additionally, a cross-level Transformer decoder leverages deformable convolutions for spatially adaptive feature alignment and joint channel-spatial modeling. Experimental results show that MSGCANet achieves IoU values of 75.47%, 91.53%, and 83.10%, and F1-scores of 86.03%, 95.59%, and 90.78% on the Massachusetts, WHU, and Inria datasets, respectively, demonstrating robust performance across these datasets.

1. Introduction

The identification and localization of buildings constitute a fundamental task in regional planning, carrying significant implications for urban planning [1] and demographic analysis [2]. This geospatial capability provides critical support for governmental decision-making in urban planning while delivering essential datasets for urban environmental monitoring [3], disaster risk management [4], and sustainable urban development initiatives [5].
Although buildings in remote sensing imagery exhibit significant spatial distribution and spectral feature variations, early research primarily focused on two universal identifiable characteristics of their geometric morphology: straight edges and shadow casting [6]. Specifically, edge features have been effectively utilized through image processing techniques such as edge detection [7] and Hough transform [8], enabling accurate extraction of geometric boundaries for regular buildings. Meanwhile, shadow features [9,10] have been incorporated as crucial auxiliary discriminant features, substantially improving building extraction accuracy. With modern societal development, high-rise buildings demonstrate increasing complexity and diversity in both morphological configurations and material compositions, presenting distinct functional advantages over traditional low-rise structures. However, conventional edge and shadow feature extraction methods suffer from significant performance degradation due to the difficulty in constructing robust and universal modeling frameworks capable of adapting to such complexity [11], which poses new challenges to existing technical systems.
To address these limitations, deep-learning-based methods have emerged as highly promising solutions [12]. Conventional convolutional neural networks (CNNs) [13] demonstrate remarkable capabilities in local feature extraction through their hierarchical learning architecture. However, the fully connected layers at the final stage of the architecture require fixed-size input dimensions, which introduces inherent limitations when processing building extraction tasks characterized by significant scale variations and morphological diversity [14,15].
The Fully Convolutional Network (FCN), proposed by Shelhamer et al. [16], is specifically designed for semantic segmentation tasks. In contrast to traditional CNNs, FCNs can process input images of arbitrary sizes and generate corresponding prediction masks with matched spatial dimensions, thereby significantly enhancing the preservation of spatial information in end-to-end pixel-wise prediction tasks. The core architecture employs an encoder–decoder framework, where the encoder is typically constructed from classical backbone networks (e.g., VGG [15], ResNet [17], Res2Net [18], and PVTv2 [19]), generating high-dimensional, low-resolution feature representations through multi-level feature extraction. The decoder progressively restores spatial resolution via upsampling operations, remapping high-level semantic encodings back to the pixel space to achieve end-to-end fully convolutional semantic segmentation [20]. Currently, researchers have proposed various innovative decoding architectures. Several studies [21,22,23,24,25,26,27,28,29] enrich feature representations by aggregating multi-scale contextual information to capture building-related cues, while others [30,31,32,33,34,35,36] employ channel-wise or spatial attention mechanisms for dynamic feature calibration. Additionally, some works [37,38,39,40,41,42,43,44] incorporate boundary supervision during the decoding phase to refine segmentation performance. Furthermore, cascade-style decoding frameworks [45,46,47,48,49,50,51] adopt coarse-to-fine progressive refinement strategies, systematically improving prediction accuracy through multi-stage decoding processes.
Although FCN-based high-resolution remote sensing building extraction has achieved significant progress, notable style variations among different buildings (as illustrated by the blue bounding boxes in Figure 1, where roof colors and shapes exhibit substantial differences) lead to markedly divergent edge features and geometric structures of buildings within the same image, making unified modeling challenging for traditional edge detection methods. Furthermore, modern architectures employ diverse construction materials, resulting in significant variations in material properties and spectral-spatial characteristics, all of which contribute to pronounced intra-class variability among buildings of the same category. Taking the buildings highlighted by red bounding boxes in Figure 1 as examples, building scales span a wide range from tens or hundreds of pixels to thousands, demonstrating systematic multi-scale distribution patterns. While current decoding mechanisms have proposed respective solutions addressing either intra-class variability or multi-scale characteristics, most approaches focus exclusively on one aspect. Recent CNN–Transformer hybrid architectures [52,53,54,55] tackle intra-class variability and multi-scale distribution. However, their fixed multi-scale design cannot adapt to the diverse building scales, and spatial misalignment, together with semantic gaps, still hinders effective feature fusion. These limitations lead to systematic errors, including the omission of small-scale buildings and material-related misclassifications, which become particularly evident when processing modern architectural clusters characterized by morphological diversity and large-scale variations.
To address these challenges, we propose a multi-scale guided context-aware network based on the Transformer architecture, termed Multi-Scale Guided Context-Aware Network (MSGCANet). Our network first employs a pretrained Pyramid Vision Transformer (PVTv2) [19] to generate multi-level feature maps. Building upon this foundation, we innovatively propose the following: (1) Contextual Exploration Module: By synergizing asymmetric convolutions with progressive dilated convolutions, this module establishes hierarchical contextual representations through a multi-scale receptive field expansion mechanism, generating enhanced context-aware features. (2) Window-Guided Multi-Scale Attention Mechanism: This mechanism explicitly establishes cross-scale spatial feature dependencies through a dynamic window partitioning strategy, enabling dynamic feature fusion while preserving local structural details and enhancing global contextual awareness. (3) The cross-level Transformer decoder: As a cross-level feature integration decoder, it performs unified alignment of multi-level semantic features and jointly conducts channel-spatial modeling, effectively achieving dual enhancement of both semantic representation and detail preservation. This framework ultimately yields refined building extraction results.
The proposed MSGCANet significantly enhances multi-scale feature modeling capabilities while achieving synergistic optimization of semantic understanding and geometric details. The key contributions of this work include the following:
  • A hierarchical receptive field expansion mechanism based on asymmetric convolutions and progressive dilated convolutions is proposed, which synergistically integrates dynamic multi-scale feature fusion with residual-guided optimization to significantly enhance contextual modeling capabilities in dense prediction tasks;
  • We propose a dynamic multi-scale feature fusion method using hierarchical attention for adaptive cross-scale aggregation, preserving local geometry and balancing global–local contexts. Combined into a Transformer-based decoder, it employs deformable convolutions for spatially adaptive feature alignment;
  • Demonstration of refined building extraction performance on multiple public datasets [56,57,58], with significant improvements in evaluation metrics, validating the superiority of our proposed MSGCANet.
The remainder of this paper is organized as follows. Section 2 presents a comprehensive review of related work on semantic segmentation decoder mechanisms. Section 3 provides a detailed analysis and introduction of the proposed MSGCANet architecture and its components. Section 4 presents the experimental setup and results analysis. Section 5 conducts ablation studies on the proposed modules. Finally, Section 6 concludes the paper.

2. Related Work

The innovation of the proposed MSGCANet primarily lies in its decoding strategy design. Therefore, this section systematically reviews the evolutionary trajectory and technical limitations of existing decoding methods. Existing decoder methods can be primarily categorized into the following types:
(1) Multi-scale context-based methods: These approaches [21,22,23,24,25,26,27,28] construct multi-scale context modules to characterize foreground objects with varying scales and appearances. The ASPP module proposed in DeepLab [21] achieves multi-scale context capture in a single forward pass through parallel multi-branch architectures and dynamic receptive field adjustment. Subsequently, Zhao et al. [22] developed the Pyramid Pooling Module (PPM), which for the first time realized unified modeling of global and local contexts. Chen et al. [23] innovatively integrates atrous convolution with pooling operations in a four-branch parallel architecture, while another research direction [24,25] introduces graph neural network frameworks that map multi-scale features to graph nodes and model cross-scale dependencies via attention mechanisms. State-of-the-art approaches [26,27,28] have successfully addressed the persistent trade-off between fine-grained semantic understanding and geometric precision in conventional methods through innovative cross-domain feature interaction mechanisms. A paradigmatic example is the dual-attention selective kernel network developed by Sultonov et al. [28], which achieves feature refinement via adaptive multiscale feature fusion integrated with cascaded channel-spatial attention mechanisms.
(2) Attention-based methods: These approaches [30,31,32,33,34,35,36] employ synergistic integration of self-attention and cross-attention mechanisms to model long-range dependencies for enhanced feature consistency. Several studies [30,31] have demonstrated that joint modeling of spatial attention (PAM) and channel attention (CAM) can effectively capture global dependencies. With further research development, some works [32,33] have integrated frequency-domain attention with multi-scale detection methods into attention mechanisms. The dual-path hierarchical attention framework proposed by Wang et al. [34] combines window-based linear self-attention for global context modeling with convolutional spatial detail preservation to achieve efficient building extraction from high-resolution remote sensing imagery. Current research focuses on task-specific lightweight attention mechanisms, as exemplified by the GCM+LCM module proposed by Zhai et al. [35] for UAV change detection tasks, which effectively captures bi-temporal differences through global–local contrastive mechanisms. Further advancing this direction, Fu et al. [36] developed a complementarity-aware fusion mechanism that explicitly decouples shared and distinct features between convolutional and Transformer branches, enforcing triplet constraints to maximize cross-branch feature interactions while adaptively exchanging local patterns and global dependencies through gated feature recalibration, thereby achieving dynamic fusion of local and global contextual representations.
(3) Boundary-supervised methods: These approaches [37,38,39,40,41,42,43,44] enhance the geometric precision of segmentation boundaries through boundary-supervised loss functions. The technical evolution exhibits three characteristic phases: (a) initial stage employing post-processing methods like CRF [37] for boundary refinement; (b) intermediate stage indirectly enhancing boundary representation via multi-scale feature fusion [38]; and (c) recent methodologies have effectively addressed core challenges in conventional approaches—including boundary ambiguity, small target omission, and complex background interference—through three pivotal innovations: optimization of loss functions [39,40], enhancement of boundary features via multimodal/multiscale feature interaction [41,42], and implementation of novel network architectures [43,44], collectively advancing segmentation precision.
(4) Coarse-to-fine methods: These approaches [45,46,47,48,49] employ multi-stage progressive optimization to progressively refine segmentation results from coarse predictions to sub-pixel boundary delineation. The framework proposed by Li et al. [45] pioneered difficulty-aware progressive optimization through hierarchical network partitioning, while the recursive framework by Jing et al. [46] further advanced multi-scale feature fusion. Researchers have also developed specialized networks for building extraction, including the boundary refinement strategy by Guo et al. [47], the ASPP-based multi-scale fusion architecture by Sheikh et al. [48], and the innovative dual-task coordination mechanism integrating vector extraction with semantic segmentation proposed by Liu et al. [49].
While existing decoder research has made significant progress, most approaches still rely on predefined parameter configurations, lacking dynamic scale adaptability and exhibiting insufficient alignment between low-level features and high-level semantics. Specifically, although multi-scale context methods capture multi-scale information through fixed modules, they struggle to dynamically adjust receptive fields or fusion strategies based on input features, resulting in inaccurate feature representation when handling buildings with substantial scale variations. Attention mechanisms, while capable of modeling long-range dependencies, overly depend on static weight configurations, frequently leading to spatial detail loss or computational redundancy during global feature integration. Boundary-supervised methods, despite incorporating loss function optimization, continue to face fundamental challenges, including boundary ambiguity, small target omission, and complex background interference, which constrain their ability to enhance building geometric precision. Furthermore, coarse-to-fine strategies suffer from error propagation in initial predictions that compromises final boundary refinement. These limitations collectively hinder decoders’ capacity to dynamically accommodate the cross-scale representation requirements of buildings, adversely affecting multi-level feature expression from microscopic details to macroscopic spatial layouts.
Moreover, existing methods predominantly address intra-class variations and multi-scale features in isolation, neglecting their synergistic effects. In recent years, researchers have proposed hybrid CNN-Transformer architectures as representative fusion approaches [52,53,54,55] that leverage multimodal feature complementarity, offering innovative solutions to simultaneously tackle both intra-class variability and multi-scale challenges. These methods employ local-global feature coordination and dynamic contextual modeling to jointly optimize local sensitivity and global consistency, thereby significantly mitigating the performance bottlenecks inherent in traditional single-perspective approaches. However, static multi-scale modeling struggles to adapt to the actual scale distribution of buildings, while spatial misalignment and semantic gaps during feature fusion constrain cross-modal synergistic effects. Additionally, the computational overhead of attention mechanisms limits their practical application to high-resolution imagery. These inherent limitations collectively result in constrained accuracy when processing modern architectural clusters characterized by morphological diversity and substantial scale variations.

3. Methodology

3.1. Overview

To tackle the challenges associated with building extraction from high-resolution remote sensing images, we propose MSGCANet, a multi-scale guided, context-aware Transformer architecture (Figure 2).
The network extracts hierarchical features { F i } i = 1 4 at four distinct levels using the PVTv2 backbone network [19]. Each feature map F i is subsequently enhanced by the Contextual Exploration Module (CEM), which captures multi-scale contextual information through a parallel multi-branch architecture, ultimately producing refined features { F i } i = 1 4 . To address the intra-class variability and multi-scale distribution characteristics of building features in remote sensing imagery, MSGCANet innovatively designs a decoding mechanism: First, feature alignment is achieved, followed by element-wise averaging to generate the unified feature representation X that preserves hierarchical information. The integrated feature X is then processed by the Window-Guided Multi-Scale Attention Mechanism (WGMSAM) to obtain level-specific features { T i } i = 1 4 , which simultaneously capture local details and global contextual information while maintaining the structural integrity of buildings. These features are further integrated through a cross-layer Transformer decoder to produce the reconstructed feature X attn . The final building extraction prediction map P is generated through convolutional operations followed by upsampling.

3.2. Contextual Exploration Module

The PVTv2 backbone network [19] exhibits three critical issues in feature extraction: (1) high-level feature maps suffer from significant spatial resolution reduction due to repeated downsampling operations, leading to local detail loss and edge blurring; (2) simple upsampling or convolutional operations between different hierarchical feature maps lack effective cross-scale interaction mechanisms, resulting in insufficient fusion of semantic and spatial information; (3) the fixed-size window partitioning scheme significantly constrains cross-window global interaction in deep feature maps, thereby restricting receptive field expansion and consequently impairing comprehensive contextual information extraction from the feature representations.
To reduce the complexity of subsequent decoding processes and achieve more precise predictions, we propose an innovative Contextual Exploration Module (Figure 3) to enhance multi-level feature representations. The module adopts a parallel multi-branch architecture that systematically constructs geometrically progressive receptive field expansion through its concurrent processing branches. This innovative design generates multi-scale overlapping contextual perception zones, enabling simultaneous extraction of local textural details, mid-range structural patterns, and global spatial relationships, which form a critical foundation for subsequent multi-scale window decoding operations.
Specifically, the input feature map F i R H   ×   W   ×   C i n is processed through four independent paths:
  • The base path employs 1×1 pointwise convolution for channel transformation
    Y 0 = Conv 1   ×   1     F i + b 0
    where Conv 1   ×   1 transforms the channel dimension from C i n to C o u t . The output feature Y 0 R H   ×   W   ×   C o u t strictly maintains the original spatial resolution with a 1 × 1 pixel receptive field, specifically designed for capturing microscopic-scale features. Notably, this initial channel transformation convolution is consistently incorporated in all subsequent three branches, ensuring identical feature space foundations prior to further operations.
  • The primary expansion path adopts a four-stage cascade structure: 1 × 1 channel compression convolution, followed by 1 × 3 and 3 × 1 asymmetric convolution pairs, and finally a 3 × 3 dilated convolution with rate 3. This yields output feature Y 1 R H   ×   W   ×   C o u t with an effective 7 × 7 receptive field.
  • The intermediate expansion path extends the primary path by using 1 × 5 and 5 × 1 asymmetric convolution pairs combined with a rate-5 3 × 3 dilated convolution, producing output feature Y 2 R H   ×   W   ×   C o u t with a 19 × 19 receptive field.
  • The advanced expansion path further implements 1 × 7 and 7 × 1 asymmetric convolution combinations coupled with a rate-7 3 × 3 dilated convolution, expanding the receptive field of output feature Y 3 R H   ×   W   ×   C o u t to 31 × 31 pixels, specifically optimized for capturing global context of large-scale targets.
The branch outputs are concatenated sequentially from Y 0 to Y 3 along the channel dimension through a deterministic concatenation operation to construct a high-dimensional feature representation Y c a t R H   ×   W   ×   4 C o u t , which mathematically manifests as a linear reorganization of channel indices:
Y c a t ( h , w , k ) = Y 0 ( h , w , k ) , 0 k < C o u t Y 1 ( h , w , k C o u t ) , C o u t k < 2 C o u t Y 2 ( h , w , k 2 C o u t ) , 2 C o u t k < 3 C o u t Y 3 ( h , w , k 3 C o u t ) , 3 C o u t k < 4 C o u t
This concatenation operation preserves the spatial structure of each branch feature by stacking them linearly along the channel direction at ( h , w ) coordinates. The concatenated feature tensor is then fed into a learnable feature fusion layer, in which a 3   ×   3 convolution first linearly combines the multi-branch features across all channels, followed by a ReLU activation that introduces non-linear mappings by suppressing negative responses and enhancing positive activations, allowing the module to capture richer cross-scale and cross-channel interactions.
Y f u s e d = Conv 3   ×   3 ( Y c a t )
where Conv 3 × 3 denotes the standard convolution operation with kernel size 3 × 3 , and its weight parameters are Θ R 3   ×   3   ×   4 C o u t   ×   C o u t . Notably, the 3 × 3 convolution in the fusion layer not only compresses feature channels but, more importantly, establishes a dynamic weighting mechanism for cross-scale features, enabling adaptive contribution adjustment based on input content.
To maintain feature stability, we incorporate residual learning. The original input F i undergoes channel alignment via 1×1 convolution to produce residual feature Y r e s R H   ×   W   ×   C o u t . The final enhanced feature F i is obtained by:
F i = ReLU ( Y f u s e d + Y r e s )
The enhanced output features F i incorporate multi-scale contextual information ranging from local details to global semantics through a hierarchical contextual exploration mechanism, while systematically preserving the original fine-grained details.

3.3. Window-Guided Multi-Scale Attention Mechanism

This part presents the Window-Guided Multi-Scale Attention Mechanism (Figure 4) in MSGCANet, which computes the final output tensor X out R B   ×   N   ×   C through self-attention operations performed within localized windows of varying sizes, effectively integrating multi-scale contextual information.
The flattening operation compromises the inherent 2D spatial relationships among pixels, hindering the window mechanism’s capacity to effectively capture local contextual information. To mitigate this issue, the WGMSAM framework transforms the input feature map X R B   ×   N   ×   C (where B denotes batch size, N = H   ×   W represents the spatial token count, and H, W, and C correspond to height, width, and channel dimensions, respectively) into a 4D tensor X R B   ×   H   ×   W   ×   C , thereby reconstructing the original spatial structure.
Before performing local feature capture via shifted windows, the input feature map must undergo boundary padding to guarantee that its spatial dimensions (H, W) are exactly divisible by the current window size w k . The multi-scale window mechanism (where w k { w 1 , , w K } ) demonstrates distinct functional specialization for contextual information capture: small windows ( w 1 ) specialize in local fine-grained feature extraction, the large windows ( w K ) establish long-range semantic dependencies, while the intermediate scales ( w 2 , , w K 1 ) construct hierarchical pathways. However, this architectural design inherently invalidates conventional static padding methods or fixed input-size constraints due to its computational demands. We therefore introduce an adaptive dynamic padding strategy that computes the required padding amounts p h ( k ) and p w ( k ) according to the window dimensions k, producing the padded feature map X p ( k ) R B   ×   H p ( k )   ×   W p ( k )   ×   C , with H p ( k ) = H + p h ( k ) and W p ( k ) = W + p w ( k ) .
For the padded feature maps X p ( k ) corresponding to different window scales, we implement a non-overlapping partitioning strategy based on window size w k   ×   w k . Specifically, we divide the feature map into n h ( k ) = H p ( k ) / w k windows along the height dimension and n w ( k ) = W p ( k ) / w k windows along the width dimension. The window dimensions w k directly determine the grid partitioning configuration, with larger w k values resulting in fewer windows ( n h   ×   n w ) and vice versa. Through this partitioning scheme, the feature map is transformed into a windowed representation X w ( k ) R ( B · n h ( k ) · n w ( k ) ) × ( w k · w k )   ×   C , where each w k   ×   w k window preserves the complete spatial neighborhood relationships of the original feature map. Currently, the original batch dimension B is expanded to B   ×   n h ( k )   ×   n w ( k ) , thereby allowing efficient parallel computation at the window level.
For the window features X w ( k ) obtained at each window scale k, multi-head attention computation is performed where each independent attention head h m ( m = 1 , , M ) possesses its dedicated parameter matrices W m Q , W m K , W m V R d h   ×   d h (where d h = C / h denotes the dimension per head). The input features are projected into different representation spaces through three independent linear transformation matrices W m Q , W m K , and W m V :
Q m ( k ) = X w ( k ) W m Q , K m ( k ) = X w ( k ) W m K , V m ( k ) = X w ( k ) W m V
where for each window scale k and attention head h m , its query vectors Q m ( k ) actively retrieve relevant features, key vectors K m ( k ) establish correspondences, and value vectors V m ( k ) preserve the original representations.
Subsequently, the computation proceeds with matrix multiplication between the query vectors Q m ( k ) and the transposed key vectors K m ( k ) T , yielding the raw similarity dot products Q m ( k ) K m ( k ) T that capture pairwise affinities between all spatial positions within each attention head. The final normalized attention weights A m ( k ) are computed as:
A m ( k ) = softmax Q m ( k ) K m ( k ) T d h + R rel k
where each element a i j quantifies the dynamic association strength between query position i and key position j, satisfying the probability normalization condition j a i   j = 1 . This formulation establishes an adaptive feature aggregation mechanism within the local window.
Notably, we further incorporate window-wise relative position encoding R rel k to inject geometric structural information, where the encoding’s coordinate range dynamically adapts with window size w k . This encoding is implemented through a learnable relative position bias matrix B R ( 2 w k 1 )   ×   ( 2 w k 1 ) . For any two positions i = ( i x , i y ) and j = ( j x , j y ) within the window, their relative displacement ( Δ x , Δ y ) = ( i x j x + w k 1 , i y j y + w k 1 ) serves as indices to retrieve the corresponding bias term b Δ x , Δ y from B . This parameterization strictly guarantees translation equivariance - when the input features undergo translation, the attention weights adapt according to relative positional relationships while preserving their fundamental capacity for geometric structure modeling. In implementation, we initialize the bias matrix using a truncated normal distribution and enable parameter sharing across different window sizes through bilinear interpolation, allowing the model to adaptively handle multi-scale features without retraining position encoding parameters.
Compared with alternative position encoding schemes, our window-relative position encoding demonstrates distinct advantages: while absolute position encoding captures global positional information, it violates the translation invariance that is fundamental to dense prediction tasks; whereas rotary position encoding (RoPE) fails to leverage its long-sequence advantages in local window scenarios. In contrast, our approach directly models local geometric relationships while maintaining translation equivariance. This design enables the attention matrix to significantly enhance the model’s perception of regular spatial patterns.
The attention weights are applied to perform a weighted summation of the value vectors, yielding the computed results T m ( m = 1 , , M ) for each attention head within the window features.
T m = A m ( k ) V m ( k ) , ( m = 1 , , M )
The outputs from all attention heads are then concatenated along the channel dimension to obtain the enhanced window features X ( k ) w at the corresponding scale. This enables each window feature to integrate enriched information captured through diverse attention patterns across multiple heads.
Subsequently, the resultant features undergo inverse transformation processing to restore the original spatial dimensions, thereby maintaining dimensional consistency with the input features and completing the feature computation pipeline for a single window scale. By performing average aggregation on the computation results from different window scales, the final output X o u t R B   ×   N   ×   C effectively integrates multi-scale contextual information, achieving an optimal balance between local details and global structures.

3.4. Cross-Level Transformer Decoder

Based on the Window-Guided Multi-Scale Attention Mechanism, MSGCANet proposes a cross-level Transformer decoder built upon hybrid multi-scale window contextual attention.
For the multi-scale feature maps { F i } i = 1 4 R B   ×   C i   ×   H i   ×   W i extracted by the backbone and enhanced by the Receptive Field Block (RFB), we first project each level into a standardized feature space: The features with varying channel dimensions C i are mapped to a unified embedding space ( C e m b e d = 256 ), followed by batch normalization and ReLU activation. Spatial alignment is then performed by resampling all features to the base resolution ( H 1 , W 1 ) of the highest-resolution F 1 . This preserves shallow-level fine-grained details while ensuring precise registration of deep semantic features on the high-resolution grid.
Multi-level features are projected into a unified representation space through standardized normalization and alignment. We fuse these features via arithmetic averaging:
X = 1 L i = 1 L F ¯ i
The unified tensor X preserves alignment with the highest-resolution input: shallow features retain fine local details, while upsampled deep features provide global contextual information through their superpixel representations.
Then, the fused feature tensor X R B   ×   C   ×   H   ×   W is directly fed into a series of L cascaded Transformer blocks for iterative refinement, where the output of each block serves as input to the next:
X ( l ) = Block l ( X ( l 1 ) ) , l = 1 , , L
For an individual block, the input feature X ( l 1 ) R B   ×   D   ×   H   ×   W is first flattened into a sequence X flat ( l 1 ) R B   ×   N   ×   D ( N = H W ), followed by layer normalization (LayerNorm) with mean μ and standard deviation σ computed along the channel dimension. Subsequently, spatial feature interaction is implemented through the Window-Guided Multi-Scale Attention Mechanism (WGMSAM, see Section 3.3). Multi-scale windows operate in parallel across feature levels, achieving implicit specialization through parameter differentiation of attention heads. The final output X out R B   ×   ( H · W )   ×   C integrates computational results from all window scales. As this output is in sequential form, tensor reshaping is required to restore spatial structure, followed by fusion with the input X ( l 1 ) R B   ×   C   ×   H   ×   W through residual connection:
X ( l 1 ) = X ( l 1 ) + Reshape ( X out )
To further enhance feature representation capability, layer normalization is first applied to X ( l 1 ) , followed by feature transformation through a channel-mixing MLP. The final output X ( l 1 ) is generated by combining the MLP output with the original features through a residual connection, which simultaneously preserves spatial dependencies and enhances semantic representation capability.
The block output X ( l 1 ) propagates as input X ( l ) to subsequent blocks, enabling iterative refinement through cascaded processing: shallow blocks capture local details via small-window attention, while deep blocks model global semantics through large-window attention, with multi-scale window attention implicitly encoding hierarchical interactions via parameter sharing. The final output X ( L ) R B   ×   C   ×   H   ×   W integrates cross-level features for joint detail-semantic representation.

4. Experiment

4.1. Dataset

For experimental validation, we employ three publicly available building extraction datasets: the WHU Building Dataset, the Massachusetts Buildings Dataset, and the Inria Building Dataset. The detailed specifications of each dataset are described below:
  • The Massachusetts Buildings Dataset [57] encompasses 151 aerial images of the Boston region. Each image measures 1500 × 1500 pixels, covering 2.25 km2 at 1-meter spatial resolution, with total coverage approximating 340 km2. Original partitioning designates 137 images for training, 10 for testing, and 4 for validation. Experimental preprocessing involved random cropping of images and corresponding labels to 512 × 512 pixel patches during training, while validation and testing employed 1536 × 1536 pixel padding to ensure 32-divisibility. Padded regions were systematically omitted from evaluation metrics to preserve assessment accuracy (as illustrated in Figure 5a).
  • The WHU Building Dataset [56] incorporates both aerial and satellite imagery subsets. This investigation specifically focuses on the aerial imagery subset acquired in Christchurch, New Zealand. Spanning 450 square kilometers, the subset contains annotations for over 220,000 distinct buildings extracted from source imagery with 0.075 -m spatial resolution. The processed dataset comprises 8189 image tiles at 0.3 -m resolution, allocated as follows: 4736 for training, 1036 for validation, and 2416 for testing (as illustrated in Figure 5b).
  • The Inria Building Dataset [58] incorporates 360 orthorectified color aerial images at 0.3 -m spatial resolution, encompassing five representative urban zones in the United States (Austin, Chicago, Kitsap) and Europe (Tyrol, Vienna) with aggregate coverage of 810 km2 (equally allocated as 405 km2 per training and test set). Following official partitioning protocols, this investigation employed stratified sampling by randomly designating 1 to 5 images per city for validation while allocating residual images for training. Data preprocessing was initiated with zero-padding of original 5000 × 5000 pixel images to 5120 × 5120 pixels, subsequently segmented into standardized 512   ×   512 pixel patches. Post rigorous quality control eliminated non-building specimens, and the refined dataset contained 9737 training samples and 1942 validation samples (as illustrated in Figure 5c).

4.2. Evaluation Metrics

To comprehensively evaluate the performance of MSGCANet, we employed four key metrics including Intersection over Union (IoU) [59], Precision (P) [60], Recall (R) [60], and F1-score (F1) [61].
IoU = T P T P + F P + F N
P = T P T P + F P
R = T P T P + F N
F 1 = 2   ×   P   ×   R P + R
where True Positives (TPs) denote pixels correctly identified as buildings, False Positives (FPs) represent non-building pixels misclassified as buildings, False Negatives (FNs) correspond to undetected actual building pixels, and True Negatives (TNs) indicate correctly classified non-building pixels.

4.3. Experimental Settings

All experiments were conducted on an NVIDIA GeForce RTX 3090 GPU (24 GB memory) utilizing PyTorch 1.8 . 1 (CUDA 11.1 ) to comprehensively evaluate model performance. The training protocol integrated three critical components: AdamW optimization [62] with cosine learning rate scheduling, data augmentation via random horizontal and vertical flipping, and dataset-specific parameter configurations. Following the parameter configuration in [34], our experimental setup was specified as follows: the WHU dataset employed an initial learning rate of 10 3 with a batch size of 12, the Massachusetts dataset used a learning rate of 5   ×   10 4 with a batch size of 2, while the Inria dataset maintained the same 5   ×   10 4 learning rate as Massachusetts but adopted a batch size of 12.
All experiments were conducted under the same environment and repeated five times to calculate confidence intervals, ensuring the robustness and reliability of the results.

4.4. Compared Methods

For a comprehensive and objective comparison, this study selects ten representative methods for performance benchmarking against MSGCANet. We employ Deeplab v3+ [21] with its atrous convolution-based feature refinement as a baseline method for general semantic segmentation. For building extraction specifically, the selected state-of-the-art approaches include CBRNet [47], ensuring segmentation consistency through contextual modeling, BuildFormer [34] and BOMSC-Net [25] utilizing Transformer architecture and multi-scale feature fusion, respectively, CLGFF-Net [36] combining a convolutional and a Transformer branch, along with DFF-Net [27] optimizing boundary recognition through dynamic feature filtering, and CICF-Net [26] employing cross-modal interaction.
As quantitatively demonstrated in Table 1, Table 2 and Table 3 through evaluation metrics including Intersection over Union (IoU), F1-score, Precision, and Recall, we systematically compare the performance disparities among these methods, with particular emphasis on their capabilities in handling multi-scale building structures and intra-class variations.

4.5. Evaluation on Massachusetts Building Dataset

  • Quantitative Comparison: As shown in Table 1, MSGCANet demonstrates excellent building extraction performance on the Massachusetts dataset. Among the key evaluation metrics, MSGCANet surpasses all compared methods in IoU, F1-score, and Precision, achieving 75.47%, 86.03%, and 87.55%, respectively. Its Recall reaches 84.50%, showing a clear overall advantage and reflecting the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
  • Visual Comparison: Figure 6 presents three sets of visual comparisons for building extraction. As shown, in the first row, other methods miss buildings within the red-boxed regions and produce incomplete building contours, whereas MSGCANet accurately and completely extracts the building outlines, closely matching the ground truth. In the second row, the T-shaped building in the red box is incompletely captured by other methods, while MSGCANet produces contours nearly identical to the ground truth. In the third row, for the elongated buildings within the red-boxed area, BOMSC-Net and CLGFF-Net exhibit over-detection errors, and BuildFormer and DFF-Net miss certain buildings. Only MSGCANet successfully extracts the building group accurately and completely, with minimal deviation from the ground truth. These results demonstrate MSGCANet’s superiority in preserving structural details and maintaining complete building contours.

4.6. Evaluation on WHU Building Dataset

  • Quantitative Comparison: As shown in Table 2, MSGCANet demonstrates outstanding building extraction performance on the WHU dataset. In several key evaluation metrics, MSGCANet outperforms all compared methods, achieving an IoU of 91.53%, an F1-score of 95.59%, and a Precision of 95.65%. Additionally, with a Recall of 95.46%, MSGCANet shows a clear overall advantage, highlighting the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
  • Visual Comparison: To more intuitively demonstrate the advantages of MSGCANet, Figure 7 presents the building extraction results of various comparative methods. In the first image, the building region in the lower-left corner, highlighted by a red box, suffers from missed detections in all other methods, with incomplete or entirely undetected contours. MSGCANet, however, successfully extracts complete and accurate building contours, showing high consistency with the ground truth. In the second image, the three small buildings highlighted by a red box are incompletely detected by other methods, while MSGCANet accurately captures both the number and contours of the buildings, detecting all three completely. In the third image, the dense building cluster in the lower-right corner is partially missed by other methods, resulting in incomplete extraction, whereas MSGCANet achieves complete extraction with results closely aligned with the ground truth, demonstrating very high accuracy.

4.7. Evaluation on Inria Building Dataset

  • Quantitative Comparison: Table 3 presents a performance comparison of various methods on the Inria dataset. MSGCANet outperforms all other methods across all key metrics, achieving an IoU of 83.10%, an F1-score of 90.78%, a Precision of 91.98%, and a Recall of 89.55%. This demonstrates its clear overall advantage and highlights the effectiveness of the multi-scale collaborative mechanism in building extraction tasks.
  • Visual Comparison: Figure 8 presents three representative cases. In the first image, the building highlighted by the red box is incompletely detected by all other methods, whereas MSGCANet successfully extracts the full building contour with high accuracy, closely matching the ground truth. In the second image, the buildings within the red box show noticeable contour deviations and shape distortions in other methods, while MSGCANet achieves the most accurate and faithful representation of the building shapes. In the third image, the small building in the lower-left corner of the red box is not perfectly detected by any method, including ours; however, the large building on the right is extracted with the most complete and precise contours by MSGCANet, demonstrating overall superior performance compared to the other methods.

5. Discussion

In this section, we conduct comprehensive experiments on three building datasets to validate the effectiveness of our proposed key components. Using PVTv2 as the encoder combined with a conventional decoding strategy as our baseline model, we systematically evaluate the contributions of CEM and WGMSAM. Furthermore, we specifically examine the performance variations under different window configurations in WGMSAM.

5.1. Effectiveness of Contextual Exploration Module

Building upon the feature representations extracted by the PVTv2 encoder, CEM employs a parallel multi-branch architecture to achieve multi-level feature enhancement. As quantitatively demonstrated in Table 4, the CEM-enhanced model exhibits statistically significant improvements in extraction accuracy metrics across all three standard building extraction datasets compared to the baseline model.
To evaluate the effectiveness of our proposed modules, we conducted statistical significance tests based on five independent trials for each configuration on the WHU, Massachusetts, and Inria datasets. The full model (baseline + CEM + WGMSAM) consistently achieves the best performance with narrow confidence intervals, indicating stable and reliable improvements. Both the CEM and WGMSAM modules contribute positively when applied individually, as evidenced by their increased IoU and F1 metrics relative to the baseline. The integration of both modules leads to further gains, with the combined model outperforming all partial configurations across all datasets. Statistical tests confirm that these improvements are significant, highlighting the complementary strengths of the CEM and WGMSAM modules and validating the effectiveness of contextual and attention mechanisms in enhancing building extraction accuracy.
Visual feature analysis in Figure 9 further reveals that in CEM-augmented feature maps, the highlighted regions corresponding to building areas show substantially improved spatial alignment with ground truth annotations, while the feature discriminability between buildings and background is markedly enhanced. This improvement manifests as increased inter-class dispersion and intra-class compactness in the feature map’s color distribution. These collective advancements validate CEM’s dual advantages in enhancing feature discriminability while preserving geometric details, particularly demonstrating its critical role in achieving precise boundary delineation of buildings in complex scenes.

5.2. Effectiveness of Window-Guided Multi-Scale Attention Mechanism

The WGMSAM module represents the core algorithmic innovation of MSGCANet, with its exceptional performance in building extraction tasks being clearly demonstrated in Table 4. This novel multi-scale window attention mechanism achieves significant performance improvements across three benchmark datasets. Our comparative analysis reveals that while the standalone implementation of WGMSAM already shows substantial gains in both IoU and F1-score metrics, its synergistic integration with the CEM module yields even more pronounced performance enhancements.
To rigorously validate these improvements, we conducted statistical significance tests over five independent runs. The full model (baseline + CEM + WGMSAM) consistently achieves the best results with narrow confidence intervals. Paired t-tests confirm that the increases in both IoU and F1 metrics are statistically significant compared to the baseline and single-module configurations, underscoring the robustness and complementary nature of combining CEM with WGMSAM.
These results provide compelling evidence for the superior capability of WGMSAM’s multi-scale window attention mechanism in facilitating cross-window interactions and dynamic receptive field control, and the effective complementary relationship between the multi-scale attention mechanism and the context exploration module.

5.3. Analysis About the Windows of Window-Guided Multi-Scale Attention Mechanism

We conducted a comprehensive comparative analysis of different window configuration schemes in the WGMSAM module across three building extraction datasets in Table 5. The quantitative results for single-window configurations reveal distinct performance patterns: (1) small windows ( 2   ×   2 ) achieve superior performance on the WHU dataset compared to Massachusetts and Inria, demonstrating their particular efficacy for high-resolution building clusters; (2) medium windows ( 4   ×   4 ) show optimal performance on Inria, validating their adaptability to complex urban architectures; (3) large windows ( 8   ×   8 ) consistently attain the highest single-window performance across all datasets, underscoring the critical importance of global context capture.
Statistical analysis over five independent trials confirms that the multi-window configuration combining all three scales ( 2 , 4 , 8 ) consistently achieves the highest IoU and F1-scores across all datasets. This configuration shows statistically significant improvements compared to all single- and dual-window variants, supported by narrow confidence intervals. These findings underscore the robustness and effectiveness of multi-scale context integration in improving building extraction accuracy.
Feature map visualizations (Figure 10) further illustrate these findings: Small windows generate discrete activation points that precisely align with individual building contours in ground truth annotations, though with occasional local omissions. Medium windows produce block-like activation regions that effectively capture spatial relationships among medium-sized building clusters, albeit with minor background noise incorporation. Large windows create extensive high-activation zones that comprehensively cover building group layouts, albeit with increased edge blurring and reduced boundary sharpness compared to ground truth.
The multi-window configuration demonstrates significant improvements over single-window setups. Specifically, (1) the 2   ×   2 + 8   ×   8 combination achieves optimal performance on WHU and Inria datasets, while (2) the 4   ×   4 + 8   ×   8 configuration shows superior results on Massachusetts, with both cases robustly validating the complementary nature of local-global window scales. Most notably, the triple-window combination delivers comprehensively optimal performance, exhibiting measurable gains over both the best single-window and dual-window configurations, thereby conclusively confirming the necessity of multi-scale collaboration.

5.4. Limitations and Future Work

While the proposed method has achieved significant progress in building extraction tasks, several noteworthy limitations warrant further investigation. At the feature representation level, the current model demonstrates limited capability in capturing vertical structural characteristics of super-tall buildings, primarily due to the inherent constraints of 2D convolutional neural networks in modeling three-dimensional spatial information. Particularly in dense urban scenarios with skyscraper clusters, the coupling relationship between complex façade reflectance properties and rooftop features remains insufficiently modeled. Furthermore, in areas with severe shadow occlusion, the segmentation accuracy still exhibits approximately 15% relative degradation, which is directly attributable to local feature distortion induced by shadows.
To address these limitations, we propose to develop a synergistic architecture incorporating 3D convolutional branches and deformable attention mechanisms to enhance the representation of building volumetric structures. Additionally, we plan to implement physics-based rendering algorithms for shadow generation, which will simulate illumination conditions at various solar elevation angles to construct more challenging training samples. These two directions constitute our primary improvement strategies.
Notably, the core concept of multi-scale contextual cooperative modeling in our approach can be extended to other geospatial extraction tasks with significant scale variations, such as road network extraction and farmland boundary detection. This provides valuable insights for developing a universal framework for remote sensing image interpretation.

6. Conclusions

This study addresses the critical challenges of intra-class variability and multi-scale distribution characteristics in high-resolution remote sensing image building extraction tasks by proposing a Transformer-based Multi-Scale Guided Context-Aware Network (MSGCANet). First, we construct a hierarchical receptive field expansion mechanism (CEM) based on asymmetric and progressive dilated convolutions, which significantly enhances contextual representation capabilities for dense prediction tasks through dynamic multi-scale feature fusion and residual-guided optimization. Second, we innovatively propose a dynamically adjustable multi-scale feature fusion method (WGMSAM), which establishes explicit hierarchical attention fields to achieve adaptive cross-scale feature aggregation while preserving local geometric constraints. Furthermore, we design a Transformer-based cross-level decoder architecture that utilizes deformable convolutions for spatial-adaptive alignment of multi-level semantic features, coupled with joint channel-spatial modeling for dual optimization. Extensive experiments on WHU, Massachusetts, and Inria datasets demonstrate that MSGCANet significantly outperforms state-of-the-art methods in edge integrity and small-target detection accuracy, validating its superior generalization capability in complex scenarios.
Although the dynamic window partitioning strategy effectively establishes cross-scale spatial dependencies, the adaptive determination of optimal combination weights for multi-scale windows remains an unresolved challenge. Particularly for building clusters with significant scale variations, fixed-ratio window combinations may inadequately adapt to scene characteristics. Future research could focus on developing scene-aware dynamic scale adaptation mechanisms, potentially through lightweight gating networks, to achieve adaptive allocation of window weights, thereby further enhancing the model’s scale adaptability in complex architectural scenarios.

Author Contributions

Conceptualization, M.Y.; methodology, M.Y.; software, M.Y.; validation, M.Y.; formal analysis, J.L.; investigation, W.H.; resources, J.L. and W.H.; data curation, M.Y.; writing—original draft preparation, M.Y.; writing—review and editing, J.L. and W.H.; visualization, M.Y.; supervision, J.L. and W.H.; project administration, W.H.; funding acquisition, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial Emergency Management Science and Technology Project, grant number 2025YJ021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rathore, M.M.; Ahmad, A.; Paul, A.; Rho, S. Urban planning and building smart cities based on the Internet of Things using Big Data analytics. Comput. Netw. 2016, 101, 63–80. [Google Scholar] [CrossRef]
  2. Xie, Y.; Weng, A.; Weng, Q. Population Estimation of Urban Residential Communities Using Remotely Sensed Morphologic Data. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 1111–1115. [Google Scholar] [CrossRef]
  3. Wang, H.; Wei, Y.; Liu, Y.; Cao, Y.; Liu, R.; Ning, X. Evaluation of Chinese Urban Land-Use Efficiency (Sdg11.3.1) Based on High-Precision Urban Built-up Area Data. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 858–862. [Google Scholar] [CrossRef]
  4. Wu, F.; Wang, C.; Zhang, B.; Zhang, H.; Gong, L. Discrimination of Collapsed Buildings from Remote Sensing Imagery Using Deep Neural Networks. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 2646–2649. [Google Scholar] [CrossRef]
  5. Sasmoko.; Wijaksono, S.; Indrianti, Y.; Rahmayati, Y. Empirical Study on the Effect of Green Building and Risk Management on Economic Quality and Sustainability in the Indonesian Sustainable Architecture Index. In Proceedings of the 2024 International Conference on ICT for Smart Society (ICISS), Yogyakarta, Indonesia, 4–5 September 2024; pp. 1–5. [Google Scholar] [CrossRef]
  6. Shackelford, A.; Davis, C.; Wang, X. Automated 2-D building footprint extraction from high-resolution satellite multispectral imagery. In Proceedings of the IGARSS 2004—2004 IEEE International Geoscience and Remote Sensing Symposium, Anchorage, AK, USA, 20–24 September 2004; Volume 3, pp. 1996–1999. [Google Scholar] [CrossRef]
  7. Krishnamachari, S.; Chellappa, R. An energy minimization approach to building detection in aerial images. In Proceedings of the ICASSP ’94. IEEE International Conference on Acoustics, Speech and Signal Processing, Adelaide, SA, Australia, 19–22 April 1994; Volume 5, pp. V/13–V/16. [Google Scholar] [CrossRef]
  8. Jung, C.; Schramm, R. Rectangle detection based on a windowed Hough transform. In Proceedings of the 17th Brazilian Symposium on Computer Graphics and Image Processing, Curitiba, Brazil, 20 October 2004; pp. 113–120. [Google Scholar] [CrossRef]
  9. Irvin, R.; McKeown, D. Methods for exploiting the relationship between buildings and their shadows in aerial imagery. IEEE Trans. Syst. Man Cybern. 1989, 19, 1564–1575. [Google Scholar] [CrossRef]
  10. Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction From High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]
  11. Tang, L.; Xie, W.; Hang, J. Automatic high-rise building extraction from aerial images. In Proceedings of the Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788), Hangzhou, China, 15–19 June 2004; Volume 4, pp. 3109–3113. [Google Scholar] [CrossRef]
  12. Li, W.; Liu, H.; Wang, Y.; Li, Z.; Jia, Y.; Gui, G. Deep Learning-Based Classification Methods for Remote Sensing Images in Urban Built-Up Areas. IEEE Access 2019, 7, 36274–36284. [Google Scholar] [CrossRef]
  13. O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
  14. Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  15. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
  16. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
  18. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
  19. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
  20. Jia, H.; Yang, W.; Wang, L.; Li, H. Uncertainty-Guided Segmentation Network for Geospatial Object Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5824–5833. [Google Scholar] [CrossRef]
  21. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
  22. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. arXiv 2017, arXiv:1612.01105. [Google Scholar] [CrossRef]
  23. Chen, Y.; Cheng, H.; Yao, S.; Hu, Z. Building extraction from high-resolution remote sensing imagery based on multi-scale feature fusion and enhancement. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, XLIII-B3-2022, 55–60. [Google Scholar] [CrossRef]
  24. Liu, Y.; Zhao, Z.; Zhang, S.; Huang, L. Multiregion Scale-Aware Network for Building Extraction From High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–10. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Chen, Z.; Wang, B.; Li, S.; Liu, H.; Xu, D.; Ma, C. BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction From High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  26. Chen, X.; Xiao, P.; Zhang, X.; Muhtar, D.; Wang, L. A Cascaded Network with Coupled High-Low Frequency Features for Building Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 10390–10406. [Google Scholar] [CrossRef]
  27. Chen, J.; Liu, B.; Yu, A.; Quan, Y.; Li, T.; Guo, W. Depth Feature Fusion Network for Building Extraction in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16577–16591. [Google Scholar] [CrossRef]
  28. Sultonov, F.; Yun, S.; Kang, J.M. DASK-Net: A Lightweight Dual-Attention Selective Kernel Network for Efficient Dense Prediction in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
  29. Li, J.; Wei, Y.; Wei, T.; He, W. A Comprehensive Deep-Learning Framework for Fine-Grained Farmland Mapping from High-Resolution Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
  30. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
  31. Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-Driven Multitask Parallel Attention Network for Building Extraction in High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4287–4306. [Google Scholar] [CrossRef]
  32. Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. arXiv 2021, arXiv:2012.11879. [Google Scholar] [CrossRef]
  33. Das, P.; Chand, S. AttentionBuildNet for Building Extraction from Aerial Imagery. In Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 19–20 February 2021; pp. 576–580. [Google Scholar] [CrossRef]
  34. Wang, L.; Fang, S.; Meng, X.; Li, R. Building Extraction with Vision Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
  35. Zhai, Y.; Li, W.; Xian, T.; Jia, X.; Zhang, H.; Tan, Z.; Zhou, J.; Zeng, J.; Philip Chen, C.L. CAS-Net: Comparison-Based Attention Siamese Network for Change Detection with an Open High-Resolution UAV Image Dataset. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
  36. Fu, W.; Xie, K.; Fang, L. Complementarity-Aware Local–Global Feature Fusion Network for Building Extraction in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617113. [Google Scholar] [CrossRef]
  37. de Oliveira Junior, L.A.; Medeiros, H.R.; Macêdo, D.; Zanchettin, C.; Oliveira, A.L.I.; Ludermir, T. SegNetRes-CRF: A Deep Convolutional Encoder-Decoder Architecture for Semantic Image Segmentation. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
  38. Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
  39. Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-Shaped CNN Building Instance Extraction Framework with Edge Constraint for High-Spatial-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6106–6120. [Google Scholar] [CrossRef]
  40. Han, T.; Ma, J.; Wang, C.; Luo, Y.; Fan, H.; Marcato, J.; Zhang, X.; Chen, Y. CityInsight: Incorporating Dual-Condition based Diffusion Model into Building Footprint Segmentation from Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63. [Google Scholar] [CrossRef]
  41. Jung, H.; Choi, H.S.; Kang, M. Boundary Enhancement Semantic Segmentation for Building Extraction from Remote Sensed Image. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  42. Cao, S.; Feng, D.; Liu, S.; Xu, W.; Chen, H.; Xie, Y.; Zhang, H.; Pirasteh, S.; Zhu, J. BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction from Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16342–16358. [Google Scholar] [CrossRef]
  43. Zhu, X.; Zhang, X.; Zhang, T.; Tang, X.; Chen, P.; Zhou, H.; Jiao, L. Semantics and Contour Based Interactive Learning Network for Building Footprint Extraction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  44. Tang, S.; Wang, X.; Pan, C.; Ji, R.; Zhou, C.; Tan, K. Poly BRBLE: A Boundary Refinement-Based Individual Building Localization and Extraction Model Combined with Regularization. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–17. [Google Scholar] [CrossRef]
  45. Li, X.; Liu, Z.; Luo, P.; Loy, C.C.; Tang, X. Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade. arXiv 2017, arXiv:1704.01344. [Google Scholar] [CrossRef]
  46. Jing, L.; Chen, Y.; Tian, Y. Coarse-to-Fine Semantic Segmentation from Image-Level Labels. IEEE Trans. Image Process. 2020, 29, 225–236. [Google Scholar] [CrossRef]
  47. Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
  48. Sheikh, M.A.A.; Maity, T.; Kole, A. IRU-Net: An Efficient End-to-End Network for Automatic Building Extraction from Remote Sensing Images. IEEE Access 2022, 10, 37811–37828. [Google Scholar] [CrossRef]
  49. Liu, Z.; Shi, Q.; Ou, J. LCS: A Collaborative Optimization Framework of Vector Extraction and Semantic Segmentation for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  50. Li, J.; He, W.; Li, Z.; Guo, Y.; Zhang, H. Overcoming the uncertainty challenges in detecting building changes from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2025, 220, 1–17. [Google Scholar] [CrossRef]
  51. Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608513. [Google Scholar] [CrossRef]
  52. Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature Cross-Layer Interaction Hybrid Method Based on Res2Net and Transformer for Remote Sensing Scene Classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
  53. Chen, B.; Zou, X.; Zhang, Y.; Li, J.; Li, K.; Xing, J.; Tao, P. LEFormer: A Hybrid CNN-Transformer Architecture for Accurate Lake Extraction from Remote Sensing Imagery. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 5710–5714. [Google Scholar] [CrossRef]
  54. Gibril, M.B.A.; Al-Ruzouq, R.; Bolcek, J.; Shanableh, A.; Jena, R. Building Extraction from Satellite Images Using Mask R-CNN and Swin Transformer. In Proceedings of the 2024 34th International Conference Radioelektronika (RADIOELEKTRONIKA), Zilina, Slovakia, 17–18 April 2024; pp. 1–5. [Google Scholar] [CrossRef]
  55. Patel, S. Hybrid CNN-Transformer for Aerial Object Detection: A Novel Architecture for Enhanced Detection Accuracy. In Proceedings of the 2025 International Conference on Machine Learning and Autonomous Systems (ICMLAS), Prawet, Thailand, 10–12 March 2025; pp. 693–698. [Google Scholar] [CrossRef]
  56. Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
  57. Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
  58. Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017. [Google Scholar]
  59. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
  60. Sung, K.K.; Poggio, T. Example-based learning for view-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 39–51. [Google Scholar] [CrossRef]
  61. Joachims, T. Text categorization with Support Vector Machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, ECML’98, Berlin/Heidelberg, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar] [CrossRef]
  62. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar] [CrossRef]
Figure 1. Challenges confronting conventional decoder mechanisms: (a) intra-class variability among buildings of the same category (highlighted by blue bounding boxes); (b) multi-scale distribution characteristics of building features (highlighted by red bounding boxes).
Figure 1. Challenges confronting conventional decoder mechanisms: (a) intra-class variability among buildings of the same category (highlighted by blue bounding boxes); (b) multi-scale distribution characteristics of building features (highlighted by red bounding boxes).
Sensors 25 05356 g001
Figure 2. Architecture of Multi-Scale Guided Context-Aware Network (MSGCANet).
Figure 2. Architecture of Multi-Scale Guided Context-Aware Network (MSGCANet).
Sensors 25 05356 g002
Figure 3. Structure of Contextual Exploration Module.
Figure 3. Structure of Contextual Exploration Module.
Sensors 25 05356 g003
Figure 4. Structure of Window-Guided Multi-Scale Attention Mechanism.
Figure 4. Structure of Window-Guided Multi-Scale Attention Mechanism.
Sensors 25 05356 g004
Figure 5. Examples of images and corresponding labels in the dataset.
Figure 5. Examples of images and corresponding labels in the dataset.
Sensors 25 05356 g005
Figure 6. Visual comparison of the Massachusetts Building Dataset.
Figure 6. Visual comparison of the Massachusetts Building Dataset.
Sensors 25 05356 g006
Figure 7. Visual comparison of the WHU Building Dataset.
Figure 7. Visual comparison of the WHU Building Dataset.
Sensors 25 05356 g007
Figure 8. Visual comparison of the Inria Building Dataset.
Figure 8. Visual comparison of the Inria Building Dataset.
Sensors 25 05356 g008
Figure 9. Feature visualization before/after the Contextual Exploration Module.
Figure 9. Feature visualization before/after the Contextual Exploration Module.
Sensors 25 05356 g009
Figure 10. Feature visualization of the output of the WGMSAM.
Figure 10. Feature visualization of the output of the WGMSAM.
Sensors 25 05356 g010
Table 1. Performance comparison on the Massachusetts dataset. The best values in each metric are highlighted in bold.
Table 1. Performance comparison on the Massachusetts dataset. The best values in each metric are highlighted in bold.
MethodYearIoU(%)F1(%)Pre(%)Rec(%)
Deeplab v3+201869.9082.2883.8180.81
CBRNet202174.5585.4286.5084.36
BuildFormer202275.0385.7386.6984.79
BOMSC-Net202274.7185.1386.6483.68
CLGFF-Net202475.3385.9385.0386.85
DFF-Net202472.6084.2087.2081.30
CICF-Net202475.1785.83--
MSGCANet-75.47 ± 0.01486.03 ± 0.01287.55 ± 0.01584.50 ± 0.013
Table 2. Performance comparison on the WHU dataset. The best values for each metric are highlighted in bold.
Table 2. Performance comparison on the WHU dataset. The best values for each metric are highlighted in bold.
MethodYearIoU(%)F1(%)Pre(%)Rec(%)
Deeplab v3+201886.6393.3992.9193.88
CBRNet202191.4095.5195.3195.70
BuildFormer202290.7395.1495.1595.14
BOMSC-Net202290.1594.8095.1494.50
CLGFF-Net202491.3095.4595.0195.89
DFF-Net202490.5095.0095.4094.60
CICF-Net202491.4595.53--
MSGCANet-91.53 ± 0.01395.59 ± 0.01295.65 ± 0.01595.46 ± 0.014
Table 3. Performance comparison on the Inria dataset. The best values for each metric are highlighted in bold.
Table 3. Performance comparison on the Inria dataset. The best values for each metric are highlighted in bold.
MethodYearIoU(%)F1(%)Pre(%)Rec(%)
Deeplab v3+201876.8086.8887.3586.40
CBRNet202181.1089.5689.9389.20
BuildFormer202281.2489.7190.6588.78
BOMSC-Net202278.1887.7587.9387.58
CLGFF-Net202482.4890.4091.8688.99
DFF-Net202477.9087.6088.8086.30
CICF-Net202481.2889.67--
MSGCANet-83.10 ± 0.01590.78 ± 0.01291.98 ± 0.01789.55 ± 0.013
Table 4. Ablation study results on the test datasets are reported as mean ± standard deviation over five independent runs to indicate variability, along with 95% confidence intervals to assess the reliability of the improvements. Statistical significance tests were conducted to validate the results. A, B, and C denote the baseline, CEM module, and WGMSAM module, respectively.
Table 4. Ablation study results on the test datasets are reported as mean ± standard deviation over five independent runs to indicate variability, along with 95% confidence intervals to assess the reliability of the improvements. Statistical significance tests were conducted to validate the results. A, B, and C denote the baseline, CEM module, and WGMSAM module, respectively.
ConfigWHUMassInria
ABC I o U F 1 I o U F 1 I o U F 1
88.40 ± 0.02193.97 ± 0.01872.51 ± 0.02584.13 ± 0.01778.42 ± 0.01987.95 ± 0.022
90.61 ± 0.02095.08 ± 0.01773.75 ± 0.01985.25 ± 0.02380.40 ± 0.01589.06 ± 0.027
91.18 ± 0.02695.36 ± 0.02274.66 ± 0.01885.07 ± 0.02682.19 ± 0.02190.11 ± 0.025
91.55 ± 0.01695.59 ± 0.01875.45 ± 0.02786.03 ± 0.02083.11 ± 0.01590.77 ± 0.017
Note: The symbols indicate: ↑ higher value is better, bold best performance, ✓ module enabled.
Table 5. Ablation study results on the test datasets are reported as mean ± standard deviation over five independent runs to indicate variability, along with 95% confidence intervals to assess the reliability of the improvements. Statistical significance tests were conducted to validate the results.
Table 5. Ablation study results on the test datasets are reported as mean ± standard deviation over five independent runs to indicate variability, along with 95% confidence intervals to assess the reliability of the improvements. Statistical significance tests were conducted to validate the results.
ScalesWHUMassInria
I o U F 1 I o U F 1 I o U F 1
2 90.52 ± 0.015 94.98 ± 0.020 73.70 ± 0.018 85.10 ± 0.025 79.98 ± 0.017 88.79 ± 0.016
4 90.83 ± 0.012 95.08 ± 0.014 73.50 ± 0.020 85.05 ± 0.022 80.40 ± 0.021 89.01 ± 0.023
8 90.97 ± 0.018 95.32 ± 0.010 73.55 ± 0.011 85.11 ± 0.019 80.88 ± 0.020 89.29 ± 0.014
2, 4 91.21 ± 0.020 95.40 ± 0.025 74.70 ± 0.022 85.62 ± 0.021 82.30 ± 0.019 89.88 ± 0.015
2, 8 91.45 ± 0.025 95.52 ± 0.018 74.95 ± 0.030 85.65 ± 0.016 82.65 ± 0.023 90.11 ± 0.022
4, 8 91.40 ± 0.022 95.41 ± 0.019 75.12 ± 0.027 85.77 ± 0.021 82.50 ± 0.020 90.05 ± 0.018
2, 4, 8 91 . 57 ± 0 . 028 95 . 59 ± 0 . 023 75 . 34 ± 0 . 025 85 . 92 ± 0 . 027 83 . 02 ± 0 . 026 90 . 68 ± 0 . 021
Note: The symbols indicate: ↑ higher value is better, bold best performance.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, M.; Li, J.; He, W. Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors 2025, 25, 5356. https://doi.org/10.3390/s25175356

AMA Style

Yu M, Li J, He W. Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors. 2025; 25(17):5356. https://doi.org/10.3390/s25175356

Chicago/Turabian Style

Yu, Mengxuan, Jiepan Li, and Wei He. 2025. "Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction" Sensors 25, no. 17: 5356. https://doi.org/10.3390/s25175356

APA Style

Yu, M., Li, J., & He, W. (2025). Multi-Scale Guided Context-Aware Transformer for Remote Sensing Building Extraction. Sensors, 25(17), 5356. https://doi.org/10.3390/s25175356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop