1. Introduction
Automatic and precise extraction of building information from High Spatial Resolution (HSR) remote sensing imagery constitutes a fundamental task in domains such as urban planning, dynamic monitoring, national geographic census, disaster emergency response, and 3D digital city modeling [
1,
2,
3]. However, due to the dense distribution, complex structures, and significant scale variations in buildings in urban environments, coupled with susceptibility to shadows, vegetation occlusion, and illumination changes, building contours are prone to blurred boundaries and irregular shapes, posing significant challenges for high-precision extraction [
4,
5,
6].
Early research typically relied on hand-crafted features such as spectral, textural, and geometric attributes, combined with classifiers like Support Vector Machines (SVMs) or Random Forests for building recognition; however, such methods struggle to maintain stable generalization capabilities under complex background conditions [
1] and are ill-suited for complex and variable urban scenarios. The advancement of deep learning technologies, particularly Fully Convolutional Networks (FCNs), represented by encoder–decoder architectures [
7], has significantly propelled the progress of building extraction. As a classic implementation of this architecture, U-Net [
8] and its variants utilize Skip Connections to fuse shallow spatial information with deep semantic information, gaining widespread application in the remote sensing field [
8]. However, its simple feature concatenation approach results in a Semantic Gap, and the inevitable loss of high-frequency spatial details during the encoder down-sampling process leads to insufficient recognition of small building entities and suboptimal segmentation accuracy in boundary regions [
9,
10,
11]. To address U-Net’s deficiencies in multi-scale context modeling, subsequent researchers have proposed a series of improved models. For instance, Zhao et al. [
12] proposed PSPNet [
13] (Pyramid Scene Parsing Network), which introduces a Pyramid Pooling Module (PPM) to aggregate features at different scales for global context acquisition. However, the fixed grid pooling operation employed by PPM is relatively rigid, easily leading to the over-smoothing or loss of local detail information. Chen et al. [
14] proposed DeepLabV3+, which employs Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale information utilizing atrous convolutions with varying dilation rates. Nevertheless, the sparse sampling characteristic of ASPP tends to generate Gridding Effects, restricting the perception of fine boundaries and potentially causing voids within large buildings [
14]. He et al. [
15] proposed APCNet, attempting to mitigate the limitations of PPM via an Adaptive Context Module (ACM) that dynamically computes affinities between local regions, yet there remains room for improvement in balancing global guidance with local detail correlation. UPerNet combines the Feature Pyramid Network (FPN) with PPM, aiming to unify perceptual information across different levels [
2,
16], and is frequently utilized in conjunction with modern backbones like ConvNeXt [
16]. Although UPerNet excels in multi-task processing, it inherits the inherent defects of PPM regarding boundary detail handling when applied to the specific task of building extraction.
A comprehensive analysis of the aforementioned mainstream models, such as U-Net [
9], DeepLabV3+ [
14], PSPNet [
12,
13], APCNet [
15], and UPerNet [
16], reveals three critical common issues remaining in current research:
- 1.
Insufficient long-range dependency modeling.
Traditional CNN architectures, constrained by the local receptive fields of convolution kernels, struggle to establish explicit correlations between global pixels, causing models to prioritize learning local textures over global structures and significantly limiting generalization capabilities across domains (e.g., transferring from the WHU dataset to the Ganzhou urban dataset) [
17,
18]. Although models like DeepLabV3+ attempt to expand the receptive field via atrous convolution, they still fail to effectively capture the spatial layout relationships between buildings [
17].
- 2.
Inadequate multi-scale object modeling.
Modules like PPM and ASPP employ fixed-scale pooling or dilation rates, making it difficult to adaptively match the extreme scale variations in buildings in remote sensing imagery, resulting in significantly low recall rates for small and dense buildings [
17,
19]. Research indicates that existing methods perform poorly when handling buildings with immense scale disparities, particularly in areas with dense building distribution, and fixed-scale feature extraction strategies fail to effectively cover all targets [
19].
- 3.
Weak feature spatial adaptability.
During feature fusion (such as combining down-sampled and up-sampled features), models lack spatial position awareness, treating all pixels “equally” without prioritizing high-frequency information areas like building boundaries, leading to severe edge blurring and artifacts [
20]. This issue is particularly pronounced in building boundary extraction, where traditional methods often lose crucial shape detail information [
20]. These common issues severely constrain the performance of building extraction models in practical applications, necessitating breakthrough innovations in global dependency modeling, adaptive multi-scale feature extraction, and spatial perception mechanisms.
In recent years, State Space Models (SSMs) have demonstrated superior global dependency capture capabilities and linear complexity characteristics in sequence modeling tasks, providing a new and more efficient pathway for long-range feature modeling in remote sensing building extraction [
21,
22,
23,
24,
25,
26]. Simultaneously, architectures such as edge enhancement networks and multi-scale attention fusion strategies have further improved model sensitivity to building boundaries and local geometric structures. Addressing the aforementioned challenges, this paper proposes a high-resolution remote sensing building extraction network that fuses multi-scale sequence modeling with spatial adaptive enhancement. This method utilizes UPerNet (with a ConvNeXt-Tiny backbone) as the foundational framework and introduces a dedicated PyramidSSM-based neck (PyramidSSMNeck) as the primary design for structured multi-scale feature projection, alignment, and fusion, upon which it further integrates three enhancement components (S6 (SSM-based), LSKNet [
27,
28], and SAFM (Spatial Adaptive Feature Modulation) [
29]) that provide complementary improvements mainly reflected in boundary delineation. Specifically, PyramidSSMNeck emphasizes structured cross-scale feature projection, alignment, and aggregation to strengthen multi-scale representation; S6 enhances long-sequence contextual modeling to better capture global dependencies; the LSKNet module, by introducing a Large Selective Kernel mechanism, enables the network to dynamically adjust the spatial receptive field, adaptively capturing multi-scale spatial patterns; the SAFM module dynamically modulates feature responses based on spatial positional information, enhancing the recognition precision of high-frequency details in boundary regions. Overall, PyramidSSMNeck contributes the dominant improvements in region-level metrics, whereas S6, LSKNet, and SAFM provide additional gains that are primarily reflected in boundary-sensitive evaluation; improved cross-domain transferability is observed for the proposed full framework in WHU → Ganzhou experiments. The experimental results on the public WHU Building Dataset [
30], the INRIA Dataset [
31], and a self-constructed Ganzhou urban building dataset validate the effectiveness and superiority of the proposed method.
The main innovations and contributions of this work are summarized as follows:
(1) We propose a PyramidSSMNeck-based building extraction architecture built on the UPerNet (ConvNeXt-Tiny) baseline, which strengthens multi-scale feature alignment and fusion for HSR imagery under complex scale variation and boundary ambiguity.
(2) On top of the proposed PyramidSSMNeck, we integrate three enhancement components—S6 for long-range context modeling, LSKNet for spatially adaptive receptive-field selection, and SAFM for spatial refinement—to provide additional gains that are primarily reflected in boundary quality.
(3) Extensive experiments on the WHU, INRIA, and Ganzhou datasets demonstrate consistent gains in both region- and boundary-sensitive metrics (e.g., IoU/BIoU), as well as improved transfer performance under the WHU → Ganzhou cross-domain setting.
2. Research Methods and Principles
To effectively address the unique challenges inherent in HSR remote sensing imagery—such as drastic scale variations in building objects, complex spatial distributions, strong background noise interference, and long-range contextual dependencies—this paper designs a building semantic segmentation network that integrates multi-scale sequence modeling with spatial adaptive enhancement. As illustrated in
Figure 1, the model adopts UPerNet equipped with ConvNeXt-Tiny as the baseline framework, constructing a holistic architecture composed of a backbone network (ConvNeXt-Tiny), a multi-scale feature-enhanced neck (PyramidSSMNeck), and a decoding head (UPerHead). The core philosophy of the proposed model is to strengthen region-level representation through structured cross-scale feature projection, alignment, and fusion in PyramidSSMNeck, while S6, LSKNet, and SAFM provide additional refinement that is more evident in boundary preservation and fine-detail integrity.
2.1. ConvNeXt-Tiny Feature Extraction Module
As illustrated in
Figure 1, ConvNeXt-Tiny adopts a hierarchical structure comprising four stacked stages (Stage 0 to Stage 3) to progressively down-sample input features, systematically increasing channel dimensions (from 96 to 768) while reducing spatial resolution (from 128 × 128 to 16 × 16). This process generates four feature maps (F1, F2, F3, and F4) at distinct semantic levels, providing rich multi-scale inputs for the subsequent neck and decoder modules. The fundamental building unit of ConvNeXt-Tiny is the ConvNeXt block [
32] (see the bottom of
Figure 1), which incorporates the following key design elements:
Large Kernel Depthwise Convolution [
33]:
The core component of this block is a 7 × 7 large-kernel depthwise convolution. In contrast to traditional 3 × 3 convolution kernels, the 7 × 7 large kernel significantly expands the model’s Effective Receptive Field (ERF), enabling the capture of broader spatial contextual information. Simultaneously, the depthwise convolution format ensures computational efficiency.
- 2.
Layer Normalization (LN):
Regarding the normalization strategy, this block substitutes Layer Normalization (LN) for Batch Normalization (BN), which is commonly employed in convolutional networks. As a standard component of Transformers, LN offers more stable training dynamics across varying batch sizes.
- 3.
Inverted Bottleneck:
This block adopts an inverted bottleneck design derived from the Feed-Forward Network (FFN) of Transformers. As illustrated in
Figure 1, channel dimensions (C) are first expanded by a factor of 4 to 4C via a 1 × 1 convolution, undergo a non-linear transformation through the Gaussian Error Linear Unit (GELU) activation function, and are finally compressed back to C via another 1 × 1 convolution. This “narrow-wide-narrow” architecture compels the model to learn complex feature transformations within a higher-dimensional (4C) feature space while restricting the computationally intensive large-kernel convolution to the narrower (C) channel dimension, striking a delicate balance between performance and efficiency.
- 4.
Residual Connections and DropPath:
By combining residual connections (ResNet) with DropPath (a structural regularization technique), this block effectively ensures stable gradient propagation within deep networks and enhances the model’s generalization capabilities.
2.2. PyramidSSMNeck Feature Enhancement Module
Subsequently, these multi-scale features are fed into the proposed PyramidSSMNeck module to facilitate feature fusion and enhancement. The PyramidSSMNeck module represents the core innovation of this study, designed to serve as a “neck” bridging the encoder and decoder to specifically address two critical challenges in semantic segmentation of remote sensing imagery: Global contextual dependency: Accurate building recognition (e.g., distinguishing between roofs and roads with similar textures) relies heavily on long-range spatial relationships; Scale diversity: Building objects in remote sensing imagery exhibit vast size variations, ranging from small shacks occupying a few pixels to large complexes spanning hundreds of pixels. As illustrated in
Figure 1, PyramidSSMNeck receives multi-scale feature maps from the four stages of ConvNeXt-Tiny. Initially, a “Projection Layer” is employed to unify these four feature maps of varying dimensions into a consistent channel count. Subsequently, the feature map at each scale is independently processed by a pivotal PyramidSSM Block for deep feature enhancement. Finally, all enhanced feature maps are fused during the “Feature Alignment and Fusion” stage via up-sampling, concatenation, and convolution operations, providing prepared pyramidal feature inputs for the UPerHead decoder.
The PyramidSSM Block serves as the core computational unit of the PyramidSSMNeck, as depicted in
Figure 2. It employs a meticulously designed Sequential Pipeline, wherein features successively traverse three sub-modules—S6, LSKNet, and SAFM—to achieve progressive enhancement of complementary information.
The S6 module, derived from the Mamba architecture, functions as a Selective State Space Model (SSM) [
34]. Its primary design objective is to efficiently capture long-range dependencies within sequential data. The fundamental principle of SSMs is grounded in continuous-time systems, wherein the evolution of the state
is governed by Ordinary Differential Equations (ODEs):
Here,
denotes the state matrix, while
and
represent the input and output transformation matrices, respectively. To facilitate implementation on digital computing hardware, the continuous system necessitates discretization. In this study, we employ the Zero-Order Hold (ZOH) principle to transform the continuous parameters
into their discrete counterparts
via a learnable timescale parameter
:
Following the discretization defined in Equation (4), the SSM can be efficiently computed in a recurrent form:
The revolutionary attribute of the S6 module lies in its “selectivity”. Unlike traditional SSMs, the key parameters of S6 are not static but are input-dependent. As illustrated in the S6 module component of
Figure 2, input features are projected via x_proj and subsequently split to dynamically generate these parameters. This mechanism empowers the model to “selectively” determine which information to propagate or forget along the spatial sequence, thereby achieving content-aware long-range information interaction. Following the capture of global context by the S6 module, the features are fed into the Large Selective Kernel Network (LSKNet) module. Designed to address the issue of scale diversity in remote sensing imagery, this module enables the network to dynamically adjust its spatial receptive field in response to the input content.
The core mechanism of LSKNet (as illustrated in the LSKNet block of
Figure 2) performs spatially selective fusion, which differs from methods such as SKNet [
35] that conduct selection mainly along the channel dimension. For building extraction in HSR remote sensing imagery, spatial selection is particularly suitable because buildings exhibit substantial scale variation and morphology diversity, and many ambiguities (e.g., adjacent small buildings and irregular boundaries) are location-dependent. Therefore, spatially adaptive receptive-field selection helps adjust responses according to local structure. The operational workflow proceeds as follows:
Input features are processed in parallel through four depthwise convolution branches equipped with varying large kernel sizes, yielding four distinct feature maps, denoted as
.
- 2.
Spatial Selection Weight Generation:
Within the “Selection Path,” these four feature maps undergo element-wise summation. The aggregated result is subsequently processed by a “Selection Block” (comprising a 1 × 1 convolution, Batch Normalization, ReLU activation, and a Softmax function) to generate four distinct sets of spatial attention maps, denoted as
.
As illustrated in
Figure 2 (LSKNet), the Selection Block outputs a four-channel score map, where each channel corresponds to one convolutional branch. The Softmax in Equation (8) is applied across the four branches at each spatial location
, producing pixel-wise weights
that satisfy
. Each
is an
spatial weight map and is broadcast along the channel dimension when reweighting
in Equation (9).
- 3.
Weighted Fusion:
Within the “Fusion Path,” each feature map
undergoes element-wise multiplication with its corresponding spatial weight map
, followed by a summation of the results:
The resultant fused feature V is added to the original input U (via a Residual path) to yield the final output of the LSKNet module:
This mechanism empowers the network to dynamically and adaptively select the optimal receptive field scale (i.e., convolution kernel size) for each spatial pixel location within the image.
As illustrated in the SAFM section of
Figure 2, this module employs a multi-branch, multi-kernel hybrid channel-spatial attention mechanism: Input features are evenly partitioned along the channel dimension into four distinct “Chunks”. Each “Chunk” is processed by a depthwise convolutional layer equipped with a specific spatial kernel size. This design enables the network to capture spatial information at varying scales across different channel groups. The four processed “Chunks” are subsequently re-concatenated along the channel dimension. Finally, a 1 × 1 convolution, followed by BN, GELU activation, and a residual connection, is utilized to aggregate this cross-channel, multi-scale information, yielding the refined feature map.
2.3. UPerHead Decoder Module
The four enhanced feature layers
, output by the PyramidSSMNeck, are fed into the UPerHead decoder module. As illustrated in
Figure 3, UPerHead employs a dual-branch parallel architecture that efficiently integrates the strengths of the Pyramid Pooling Module (PPM) and the FPN. For clarity, the notation used in
Figure 3 is summarized in
Table 1.
The PSP (Pyramid Pooling) branch specializes in capturing global context, operating exclusively on the deepest feature layer
, which encompasses the richest semantic information. This branch applies multi-scale adaptive pooling
to
, followed by transformation via 1 × 1 convolutions (T1–T4). Subsequently, all resultant maps are upsampled (U1–U4) and concatenated to generate the PSP Out feature map:
Simultaneously, the FPN facilitates the top-down fusion of multi-scale features. It establishes lateral connections (L1–L4) via 1 × 1 convolutions and employs a top-down “upsample-add-refine” strategy to progressively fuse high-level semantics with low-level details layer by layer:
Here, denotes the FPN output, represents the lateral input, and the Refine process consists of a 3 × 3 convolution.
Finally, in the “Final Fusion” stage, the outputs from the FPN branch (FPN1, FPN2, FPN3) and the output of the PSP branch
are resized to a unified resolution of 128 × 128 and concatenated. The concatenated features are subsequently processed through a 3 × 3 convolution (Bottleneck) and a 1 × 1 convolutional classification head to yield the final segmentation prediction:
This dual-branch fusion architecture empowers the model to simultaneously preserve global semantic consistency and local spatial boundary details.