Building Extraction Network with Gated Mamba-CNN and Wavelet-Based Boundary Enhancement

Yang, Dongjie; Yang, Yuanwei; Gao, Xianjun; Huang, Rujing; Gao, Xinlong; Han, Kuikui; Guo, Kangliang; Tao, Yuan

doi:10.3390/rs18111773

Open AccessArticle

Building Extraction Network with Gated Mamba-CNN and Wavelet-Based Boundary Enhancement

by

Dongjie Yang

^1,2,

Yuanwei Yang

^1,3,*,

Xianjun Gao

¹,

Rujing Huang

¹,

Xinlong Gao

¹,

Kuikui Han

¹,

Kangliang Guo

¹ and

Yuan Tao

⁴

¹

School of Geosciences, Yangtze University, Wuhan 430100, China

²

School of Information and Design, Zhejiang Industry Polytechnic College, Shaoxing 312000, China

³

State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan 430063, China

⁴

Central & Southern China Municipal Engineering Design and Research Institute Co., Ltd., Wuhan 430010, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1773; https://doi.org/10.3390/rs18111773

Submission received: 30 March 2026 / Revised: 1 May 2026 / Accepted: 7 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Applications of Remote Sensing Imagery for Urban Areas (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

GWNet combines gated Mamba-CNN modeling and wavelet-based boundary enhancement to improve building extraction by jointly strengthening global-local feature representation and boundary recovery.
GWNet achieves the best overall performance on the WHU, Massachusetts, and WHU Satellite I datasets, while ablation results show that GMC mainly improves region completeness and WBO mainly enhances contour quality.

What are the implications of the main findings?

Adaptive global-local feature fusion is effective for reducing errors caused by spectral heterogeneity, shadow occlusion, and complex background interference in remote sensing building extraction.
Wavelet-based high-frequency enhancement provides a simple and robust strategy for preserving building boundaries and improving model generalization across scenes with different resolutions and complexities.

Abstract

Building extraction from high-resolution remote sensing imagery remains challenging due to spectral heterogeneity, complex background interference, and incomplete boundary delineation. Thus, we propose GWNet, which integrates gated Mamba-CNN modeling with wavelet-based boundary enhancement. Specifically, a Gated Mamba-CNN Module (GMC) is embedded into the medium- and low-resolution branches to jointly capture local texture features and long-range dependencies. In addition, a channel-wise gating mechanism is introduced to adaptively balance global contextual information and local structural details, thereby alleviating fragmented predictions and internal holes within the same building caused by variations in roof materials, while reducing the misclassification between buildings and background objects such as roads and bare land. Furthermore, a Wavelet Boundary Optimization Module (WBO) is designed to exploit multi-directional high-frequency components extracted by fixed Haar wavelet filters, thereby enhancing the representation of building boundaries and corners. This design effectively mitigates boundary blurring, incomplete contours, and missed detections caused by the loss of high-frequency edge information during downsampling. Extensive experiments on four public datasets, namely WHU, Massachusetts, WHU Satellite I, and Potsdam, demonstrate the effectiveness and robustness of GWNet across diverse spatial resolutions and scene complexities. Specifically, GWNet achieves IoU/BIoU scores of 90.68%/66.88% on the WHU dataset, 73.02%/93.19% on the Massachusetts dataset, 63.86%/83.77% on the WHU Satellite I dataset, and 83.21%/58.96% on the Potsdam dataset, consistently outperforming several competitive methods. Qualitative results further confirm that GWNet produces more complete building regions and sharper, more continuous boundaries. These findings validate the effectiveness of the proposed global–local feature extraction mechanism and wavelet-based boundary enhancement strategy.

Keywords:

building extraction; semantic segmentation; Mamba; wavelet

1. Introduction

Building extraction from high-resolution remote sensing imagery (HRSI) is a fundamental task in applications such as urban planning, updating geographic databases, disaster assessment, and land-use monitoring [1]. With the continuous improvement in HRSI resolution, the texture, shape, and spatial structure of buildings can be represented more explicitly, providing favorable conditions for fine-grained extraction. However, at the same time, the variations in scale, material, morphology, and imaging conditions of buildings in complex scenes have become increasingly pronounced, making accurate building extraction still highly challenging. Previous studies have shown that building extraction from optical HRSI has long been constrained by complex background interference and the difficulty of preserving boundary details [2].

In recent years, deep learning (DL)-based semantic segmentation methods have significantly advanced the performance of building extraction. In particular, convolutional neural networks (CNNs), owing to their strong local feature extraction capability [3], can effectively capture the textures, corners, and local shape characteristics of buildings, and have therefore become the mainstream technical paradigm for this task. However, the receptive field enlargement in conventional CNNs typically relies on stacked convolutions, pooling, or dilated convolutions, which remain focused on local neighborhood modeling and are relatively limited in capturing long-range dependencies and global contextual information [4,5]. When buildings appear in scenes with complex backgrounds or substantial roof material variations, relying solely on local features often leads to unstable target recognition, thereby resulting in missed detections, false positives, and discontinuities in overall building structures.

To compensate for the limitations of convolutional networks in global modeling, researchers have recently begun introducing global context modeling mechanisms into remote sensing (RS) semantic segmentation. In particular, the visual state space model Mamba has demonstrated considerable potential for modeling global dependencies in HRSI, owing to its ability to capture long-range dependencies with linear computational complexity. Existing hybrid Mamba-CNN architectures have shown that combining the local detail modeling capability of CNNs with the long-range dependency modeling capability of Mamba can effectively improve RS semantic segmentation performance. Nevertheless, most existing methods primarily focus on the fusion of global and local features. For the spectral heterogeneity problem in building extraction caused by complex backgrounds, illumination variations, and material differences, how to adaptively regulate the contributions of global and local information across different channels remains an open question.

In addition to regional semantic modeling, boundary quality is another critical issue in building extraction. Although deep networks can learn stronger semantic discrimination, high-frequency edge information is progressively attenuated during repeated downsampling and upsampling, leading to blurred building contours and incomplete boundaries. Existing studies have addressed this issue from the perspectives of boundary-assisted learning, edge branch construction, and boundary refinement modules, and have demonstrated that collaborative modeling of region learning and boundary learning is beneficial for contour preservation [6,7]. Meanwhile, wavelet transforms decompose features into low-frequency semantic information and high-frequency detail information, with the latter particularly sensitive to edges, textures, and structural variations, thus offering a new perspective for boundary enhancement. However, most existing wavelet-CNN studies mainly introduce wavelet transforms as general downsampling, pooling, multi-resolution representation, or image-restoration operators. For example, Wavelet CNNs integrate wavelet-based frequency-domain analysis into CNNs to supplement spatial-domain feature representation [8], while MWCNN embeds discrete wavelet transform and inverse wavelet transform into a U-Net-like architecture to enlarge the receptive field and improve the trade-off between computational efficiency and restoration performance [9]. In contrast, building extraction requires not only general detail preservation but also explicit recovery of geometrically regular and continuous building contours. Therefore, this study focuses on using fixed Haar high-frequency sub-bands as boundary-sensitive priors to enhance building edges, corners, and contour transitions.

To address the above issues, this study proposes GWNet (Gated Mamba-CNN and Wavelet-based Network) for building extraction from HRSI. The proposed network adopts a hierarchical multi-branch architecture. A Gated Mamba-CNN Module is embedded into the medium- and low-resolution branches, where a parallel local CNN path and a global Mamba path are employed to achieve adaptive global-local feature fusion, thereby enhancing the model’s joint representation capability for semantic consistency and local structural details of buildings in complex scenes. In addition, a Wavelet Boundary Optimization Module is introduced into the low-resolution branch. By exploiting multi-directional boundary responses extracted by fixed Haar wavelet filters and applying residual enhancement, this module compensates for the loss of high-frequency details caused by repeated downsampling and improves the recovery of building boundaries.

The main contributions are as follows:

First, a building extraction network, namely GWNet, is proposed by integrating gated Mamba-CNN modeling and wavelet-based boundary enhancement, thereby improving building extraction accuracy in complex scenes through global-local collaborative modeling and frequency-domain high-frequency edge enhancement.

Second, a Gated Mamba-CNN Module is designed to adaptively regulate the contributions of local detail features and global contextual features via a channel-wise gating mechanism, thereby improving the model’s adaptability to spectral heterogeneity and complex background interference. Third, a Wavelet Boundary Optimization Module is developed to enhance the representation of building boundaries by utilizing high-frequency sub-band information from fixed Haar wavelets, thereby alleviating contour blurring and incomplete boundary extraction.

Extensive experiments conducted on four public datasets, namely WHU, Massachusetts, WHU Satellite I, and Potsdam, verify the effectiveness and robustness of the proposed method. Specifically, GWNet achieves IoU/F1/BIoU scores of 90.68%/95.11%/66.88% on WHU, 73.02%/84.41%/93.19% on Massachusetts, 63.86%/77.94%/83.77% on WHU Satellite I, and 83.21%/90.83%/58.96% on Potsdam, consistently outperforming multiple competitive methods. These results demonstrate that the proposed global–local collaborative modeling and wavelet-based boundary enhancement strategy can effectively improve region completeness and boundary preservation in HRSI building extraction.

2. Related Work

2.1. Remote Sensing Building Extraction Methods

Building extraction is a fundamental task in the interpretation of HRSI. Early methods mainly relied on handcrafted spectral, texture, shadow, and geometric features. However, such methods are highly sensitive to imaging conditions, variations in roof materials, and complex backgrounds, and therefore exhibit limited generalization ability [10]. With the development of DL, semantic segmentation methods based on fully convolutional networks have gradually become the mainstream technical paradigm for building extraction [11]. Although DL has substantially advanced building extraction performance, it still faces significant challenges in complex scenes, including large-scale variations, strong background interference, pronounced intra-class differences, and difficulties in preserving boundaries [2].

To address these challenges, existing studies have proposed improvements based on multi-scale feature fusion, context enhancement, and boundary constraints. Ran et al. enhanced the collaboration between high-level semantic representations and low-level spatial details through a coarse-to-fine prediction refinement strategy and a boundary refinement module, thereby alleviating issues such as internal holes in large buildings, missed detections of small buildings, and blurred edges [12]. Li et al. introduced an auxiliary boundary learning task to encourage the network to focus on both building regions and boundary information during training, thereby improving contour preservation capability [13]. Although these methods improve building extraction accuracy to some extent, preserving global semantic consistency and local structural details under complex background conditions remains an unresolved problem.

2.2. Global–Local Collaborative Modeling Methods

CNNs possess strong local receptive-field modeling capabilities in RS semantic segmentation and can effectively extract fine-grained features such as textures, corners, and edges. Therefore, they have long served as the backbone of extraction tasks. However, convolution operations are inherently biased toward local neighborhood modeling and are relatively limited in capturing long-range dependencies and large-scale contextual information. In scenes with significant variations in building scale, severe shadow occlusions, and strong visual similarity between background objects and buildings, this limitation can easily lead to insufficient global semantic understanding [14,15]. In contrast, the recently emerging visual state space model, Mamba, can model long-sequence dependencies with linear computational complexity and is considered to offer both efficiency advantages and strong potential for global dependency modeling in HRSI analysis [16]. As a result, it has rapidly become an important research direction in RS vision tasks, especially for scenarios involving large-format imagery and long-range dependency modeling [17].

2.3. Boundary Enhancement and Frequency-Domain Modeling Methods

Building extraction requires not only high regional segmentation accuracy but also regular, continuous, and geometrically consistent building contours. However, deep networks tend to lose high-frequency details during successive downsampling and upsampling operations, often resulting in blurred boundaries, rounded corners, and incomplete elongated building structures. To address this issue, various boundary enhancement strategies have been proposed, including auxiliary boundary-detection tasks, edge-branch construction, boundary-refinement modules, and coarse-to-fine progressive optimization frameworks [6,7]. These studies have demonstrated that collaborative modeling of region and boundary information is beneficial for improving contour completeness and shape fidelity.

In addition to boundary modeling in the spatial domain, frequency-domain enhancement has also become an important research direction for improving detail representation in CNN-based vision models. Wavelet transforms can decompose feature maps into low-frequency components related to coarse semantic structures and high-frequency components associated with edges, textures, and local structural variations. Therefore, wavelet-based representations have been widely explored to compensate for the detail loss caused by conventional convolution, pooling, and downsampling operations. For example, Wavelet CNNs introduce wavelet-based multi-resolution analysis into convolutional networks to supplement frequency-domain information and enhance spatial feature representation [8]. MWCNN further embeds discrete wavelet transform and inverse wavelet transform into a U-Net-like architecture, demonstrating the effectiveness of wavelet-CNN integration in enlarging the receptive field and improving the trade-off between computational efficiency and image restoration performance [9].

Recent studies have also introduced wavelet-based spatial–frequency fusion into semantic segmentation and remote sensing interpretation. For instance, Wavelet-CNet employs wavelet cross fusion and detail enhancement to improve RGB-thermal semantic segmentation [18]. Hua et al. introduced wavelet feature enhancement and spatial–frequency domain fusion into remote sensing semantic segmentation, showing that frequency-domain information can complement spatial-domain CNN features and improve segmentation accuracy [19]. These studies indicate that wavelet-based frequency-domain modeling is effective for enhancing structural details and improving multi-scale feature representation.

Nevertheless, most existing wavelet-CNN methods are mainly designed for image restoration, texture representation, multimodal fusion, or generic semantic segmentation. In these methods, wavelet transforms are usually used as general downsampling, pooling, multi-resolution representation, or feature-fusion operators. They do not explicitly focus on the incomplete boundaries, discontinuous contours, and weakened high-frequency structural cues commonly encountered in high-resolution remote sensing building extraction. Different from these methods, the proposed Wavelet Boundary Optimization Module (WBO) is specifically designed for building boundary enhancement. It employs fixed Haar high-frequency filters to extract horizontal, vertical, and diagonal boundary-sensitive responses from low-resolution feature maps, and then injects these responses back into the network through residual enhancement. This design provides a lightweight and parameter-free frequency-domain prior for recovering building edges, corners, and contour transitions.

In summary, existing building extraction methods have achieved substantial progress in regional semantic feature extraction, global context aggregation, and boundary detail recovery. However, two limitations remain. First, when strong spectral confusion exists between buildings and background objects in complex scenes, current methods are still insufficient in jointly modeling global semantics and local details. Second, during boundary recovery, high-frequency structural information is not fully exploited, resulting in blurred and incomplete building contours. To this end, this study develops an improved building extraction network that integrates adaptive global–local feature allocation with wavelet-based high-frequency boundary enhancement.

3. Methodology

3.1. Model Architecture

The overall architecture of GWNet (Gated Mamba-CNN and Wavelet-based Network) is illustrated in Figure 1. The proposed network adopts an encoder–decoderarchitecture, in which the Gated Mamba-CNN Module (GMC) is embedded into the medium- and low-resolution feature stages. Through channel-wise gating, GMC adaptively integrates global contextual information with local structural details, thereby improving the network’s discriminative representation capability. Meanwhile, a Wavelet Boundary Optimization Module (WBO) is introduced at the bottom of the lowest-resolution branch. Based on fixed Haar wavelet filters, this module extracts multi-directional boundary responses to enhance the representation of building edges, corners, and contour details, compensating for the loss of high-frequency information during downsampling and yielding more complete and clearer building boundaries.

3.2. Gated Mamba-CNN Module

To alleviate the spectral heterogeneity problem in building extraction caused by roof material variations, shadow occlusion, and complex background interference, this study designs a Gated Mamba-CNN Module (GMC). As illustrated in Figure 2, the module consists of five components: Local CNN Branch, Global Mamba Branch, Channel-wise Gating, Adaptive Fusion, and Residual Enhancement. The core idea is to model local details and global contextual information in parallel on the same input feature map. Specifically, the local branch focuses on extracting neighborhood features such as building edges, corners, and textures. In contrast, the global branch employs the Mamba state-space model to capture long-range dependencies and maintain overall structural consistency. Subsequently, a channel-wise gating mechanism is introduced to adaptively regulate the contributions of the two feature types across different channels, and a residual enhancement strategy is further adopted to preserve the original feature representation, thereby improving both training stability and feature representation capability. According to the implementation, GMC is primarily embedded in the medium- and low-resolution branches to strengthen global-local collaborative modeling while avoiding excessive computational overhead.

3.2.1. Global–Local Parallel Feature Extraction

Let the input feature map be denoted as

X \in R^{B \times C \times H \times W}

, where

B

,

C

,

H

, and

W

represent the batch size, channel number, and spatial dimensions, respectively. The module first constructs local and global representations in parallel from the same input feature map.

(1): Local branch: local detail modeling via depthwise separable convolution

The local branch employs depthwise separable convolution to extract local building texture. The computation process is as follows:

F_{l} = {BN}_{2} {(Conv}_{1 \times 1} {(δ (BN}_{1} {(Con}_{3 \times 3} (X)))) .

(1)

where

{Con}_{3 \times 3} (\cdot)

denotes a channel-wise independent

3 \times 3

convolution,

Conv (\cdot)

denotes a

1 \times 1

pointwise convolution,

BN (\cdot)

represents batch normalization, and

δ (\cdot)

denotes the ReLU activation function.

(2): Global branch: long-range dependency modeling based on Mamba

To enhance modeling of large-scale spatial contextual information and long-range pixel dependencies, the global branch first reshapes the two-dimensional feature map into a one-dimensional sequence. Let

L = H \times W

. Then, the input feature

X_{S e g}

can be represented as

S = Flatten (x), S \in R^{(B \times L \times C)}

(2)

Considering that directly performing long-sequence modeling over all channels would incur substantial computational overhead, the channel dimension is divided into

K

groups, with the default setting of

K = 4

. Accordingly, the channel dimensionality of each group is defined as follows.

d = \frac{C}{K} .

(3)

To clearly describe the channel-grouping process, S is divided along the channel dimension and reorganized into a grouped sequence G. Specifically,

S \in R^{(B \times L \times C)}

is first reshaped into

R^{(B \times L \times K \times d)}

, then permuted to

R^{(B \times K \times L \times d)}

, and finally the batch dimension and group dimension are merged. The grouped sequence is defined as follows.

G = G r o u p (S, K), G \in R^{((B \times K) \times L \times d)} .

(4)

This operation allows Mamba to model each channel group independently while keeping the spatial sequence length L unchanged. Subsequently, layer normalization and Mamba-based state-space modeling are applied to the grouped sequence:

N = L N (G)

(5)

Z = M (N) + λ N, Z \in R^{((B \times K) \times L \times d)} .

(6)

where

M (\cdot)

denotes the Mamba sequence modeling operator, and

λ

is a learnable residual scaling factor for the normalized grouped sequence. This design preserves the advantage of the state-space model in capturing long-range dependencies while reducing computational cost through channel grouping and batch merging. After Mamba modeling, Z is restored from the grouped representation to the original sequence form, denoted as

U \in R^{(B \times L \times C)}

. Then, U is reshaped back to the two-dimensional feature map and normalized by batch normalization to obtain the global feature representation:

F_{g} = {BN}_{g} (R e s t o r e 2 D (U)), F_{g} \in R^{(B \times C \times H \times W)}

(7)

where Group(·) denotes the channel-grouping operation, and Restore2D(·) denotes the inverse transformation that restores the sequence representation to the original spatial feature format.

Overall, the local branch is more effective at characterizing fine-grained structural information, such as edges, corners, and textures. In contrast, the global branch is better suited to encoding cross-region contextual information and overall building structural consistency. The parallel modeling of these two feature types provides a foundation for subsequent adaptive feature allocation.

3.2.2. Gated Adaptive Fusion and Residual Enhancement

After obtaining the local feature

F_{l}

and the global feature

F_{g}

, a channel-wise gating mechanism is further introduced to dynamically regulate the relative contributions of global and local information across different channels according to the input content. First, global average pooling is applied to the original input feature

X

to obtain a channel descriptor:

z = GAP (x) \in R^{B \times C \times 1 \times 1} .

(8)

Then, a gating function composed of two

1 \times 1

convolutions is employed to generate the gating weights, which can be formulated as follows:

α = σ (W_{2} δ (W_{1} z)), α \in R^{B \times C \times 1 \times 1} .

(9)

Here,

W_{1}

and

W_{2}

denote the weight matrices of the two convolutional transformations, respectively,

r

represents the channel reduction ratio, and

σ (\cdot)

denotes the Sigmoid activation function. The resulting

α \in [0, 1]

characterizes the preference of each channel for global features. When

α_{c}

approaches 1, the corresponding channel relies more heavily on global contextual information; conversely, when

α_{c}

approaches 0, the channel places greater emphasis on local textures and boundary responses. Under the modulation of the gating weights, the module performs channel-wise adaptive fusion of the global and local features, which is formulated as follows:

F = α ⊙ F_{g} + (1 - α) ⊙ F_{l} .

(10)

where

⊙

denotes element-wise multiplication. Compared with direct addition or simple concatenation, this fusion strategy can dynamically regulate the contribution of each branch according to different input scenes and semantic channels, thereby better adapting to the spectral heterogeneity and background ambiguity commonly encountered in building extraction tasks.

Finally, to preserve the original feature representation, alleviate gradient degradation during deep network training, and improve the stability of the fused output, residual injection and nonlinear activation are further introduced, yielding the final output:

Y = δ (F + X) .

(11)

where

Y \in R^{B \times C \times H \times W}

denotes the final output of the module. The residual enhancement enables the module to effectively inherit the original feature information after introducing global modeling and gated fusion, thereby avoiding over-smoothing and feature drift.

The detailed computational procedure of the proposed GMC is summarized in Algorithm 1.

Algorithm 1 Gated Mamba-CNN Module

Input: Input feature map

x \in R^{B \times C \times H \times W}

; number of channel groups K; Mamba operator M(·)

Output: Refined feature map

Y \in R^{B \times C \times H \times W}

1:

R \leftarrow x

2: Local CNN branch

3:

F_{l} \leftarrow {BN}_{2} {(PWConv (δ (BN}_{1} {DWCon}_{3 \times 3} (x))))

4: Global Mamba branch

5:

L \leftarrow H \times W

d \leftarrow C / K

6: S ← Flatten(X)

7: G ← Group(S, K)

8: N ← LN(G)

9: Z ← M(N) + λN

10: U ← RestoreSeq(Z, K)

11:

F_{g}

←

{BN}_{g} (R e s t o r e 2 D (U))

12: Channel-wise gated adaptive fusion

13: z ← GAP(R)

14:

α \leftarrow σ (W_{2} δ (W_{1} z)),

15:

F \leftarrow α ⊙ F_{g} + (1 - α) ⊙ F_{l} .

16: Residual enhancement

17:

Y \leftarrow δ (F + x)

18: return

Y

3.3. Wavelet Boundary Optimization Module

To enhance the representation of building boundaries, this study designs a Wavelet Boundary Optimization Module (WBO) with an architecture illustrated in Figure 3. Given the feature map

X \in R^{B \times C \times H \times W}

from the lowest-resolution branch, the module first employs a fixed two-dimensional Haar high-frequency filter bank,

\{K_{L H}, K_{H L}, K_{H H}\}

, to perform grouped convolution with a stride of 2 on each channel independently, thereby extracting high-frequency responses in the horizontal, vertical, and diagonal directions and obtaining a multi-directional high-frequency feature representation. The Haar wavelet filter bank is adopted for three main reasons. First, Haar wavelets provide a simple and efficient way to decompose feature responses into directional high-frequency components, which are naturally sensitive to horizontal, vertical, and diagonal boundary variations. Second, the fixed-parameter design introduces no additional learnable parameters, thereby avoiding extra training burden and reducing the risk of overfitting, especially when the number of training samples is limited. Third, compared with general edge operators such as Sobel or Laplacian filters, Haar wavelets provide a more structured frequency-domain decomposition that can be naturally embedded into CNN feature maps through grouped convolution. Unlike previous wavelet-CNN methods that mainly use wavelet transforms for general feature downsampling, pooling, or image restoration, WBO explicitly enhances building boundary responses in the lowest-resolution branch, where high-frequency contour information is most likely to be weakened during repeated downsampling.

Subsequently, the multi-directional high-frequency sub-bands are fused by a

1 \times 1

convolution, batch normalization, and ReLU activation to produce a compact edge-enhanced representation

E

, which is then restored to the original spatial resolution via bilinear interpolation. The computation is formulated as follows:

E = Up (δ (BN ({Conv}_{1 \times 1} (H))), H, W) .

(12)

where

Up (\cdot)

denotes bilinear upsampling,

δ (\cdot)

denotes the ReLU activation function,

BN (\cdot)

represents batch normalization, and

{Conv}_{1 \times 1}

denotes a

1 \times 1

convolution. Finally, the module outputs the enhanced feature

X^{w}

in a residual manner, which is defined as

X^{w} = X + E .

(13)

By explicitly introducing high-frequency priors in the frequency domain, this design compensates for the loss of boundary details caused by repeated downsampling in convolutional networks, thereby improving contour completeness and the structural recovery of buildings. The corresponding algorithm is summarized as follows. The pseudocode is presented in Algorithm 2.

Algorithm 2 Wavelet Boundary Optimization Module

Input: Input feature map

x \in R^{B \times C \times H \times W}

Output: Boundary-enhanced feature map

X^{w} \in R^{B \times C \times H \times W}

1: Construct fixed 2D Haar high-frequency filter bank

2: Whaar

\leftarrow \{K L H, K H L, K H H\}

3: Extract multi-directional high-frequency responses

4: H

\leftarrow

GroupConv(X, Whaar, stride = 2, groups = C)

5: //

H \in R^{(B \times 3 C \times H / 2 \times W / 2)}

6: Fuse high-frequency subbands

7:

\tilde{E} \leftarrow δ (BN ({Conv}_{1 \times 1} (H))), H, W

8: Recover original spatial resolution

9:

E \leftarrow Up (\tilde{E}, H, W) .

10: Residual boundary optimization

11:

X^{w} \leftarrow X + E .

12: return

X^{w}

4. Experiment

4.1. Datasets

To comprehensively evaluate the effectiveness and generalization capability of the proposed model, experiments were conducted on four public datasets: WHU [20], Massachusetts [21], the WHU Satellite I dataset, and Potsdam [22].

4.1.1. WHU Dataset

The WHU building dataset, a high-quality benchmark dataset, was adopted in this study. It has a spatial resolution of 0.3 m and contains 150 HRSI, each with a size of 6800 × 7200 pixels. In line with the experimental settings, all images were cropped into 512 × 512-pixel patches for training, validation, and testing. The training, validation, and test sets contain 4736, 1036, and 2416 image patches, respectively.

4.1.2. Massachusetts Dataset

In addition, the Massachusetts dataset, with lower spatial resolution, was employed to complement the WHU dataset and provide a more thorough evaluation of the generalization ability of the proposed network across different resolutions. This dataset has a spatial resolution of 1 m and consists of 151 HRSI, each of size 6800 × 7200 pixels. Similarly, all images were cropped into 512 × 512 patches for training, validation, and testing. The training, validation, and test sets contain 1066, 36, and 90 image patches, respectively.

4.1.3. WHU Satellite I Dataset

Furthermore, the WHU Satellite I dataset was included in the experiments. This dataset was selected because its spatial resolution varies considerably, ranging from approximately 0.3 m to 2.5 m, providing a more challenging scenario for evaluating the adaptability of the proposed model across different resolution conditions. It contains 51 HRSI, which were also cropped into 512 × 512 patches for training, validation, and testing. The training, validation, and test sets contain 488, 30, and 51 image patches, respectively.

4.1.4. Potsdam Dataset

The Potsdam dataset was acquired in the urban area of Potsdam, Germany, and comprises 38 very high-resolution aerial scenes with a spatial resolution of 0.05 m and an image size of 6000 × 6000 pixels. It represents typical urban environments in Potsdam, including large-scale building blocks and densely arranged small buildings. After patch generation, 3456, 1008, and 1008 samples were used for training, validation, and testing, respectively.

4.2. Experimental Details

4.2.1. Experimental Environment

All experiments were conducted on Windows using an NVIDIA GeForce RTX 3090 GPU, and PyTorch 2.1.1 was used as the DL framework. The remaining implementation details are listed in Table 1. In addition, all comparative experiments were conducted using a unified platform and official code.

Although different epoch numbers were used across datasets, this setting was primarily determined by the number of cropped training patches. Since the WHU training set contains substantially more image patches than the Massachusetts dataset, 50 epochs on WHU and 200 epochs on Massachusetts yield comparable total training iterations, ensuring a fair optimization process across datasets.

4.2.2. Evaluation Metrics

The evaluation metrics adopted in this study include intersection over union (IoU), overall accuracy (OA), precision (P), recall (R), F1 score (F1), structural similarity index (SSIM) [24], and boundary intersection over union (BIoU).

O A = \frac{T P + T N}{T P + T N + F P + F N} \times 100 %

(14)

P = \frac{T P}{T P + F P} \times 100 %

(15)

R = \frac{T P}{T P + F N} \times 100 %

(16)

F 1 = \frac{2 \times P \times R}{P + R} \times 100 %

(17)

I O U = \frac{T P}{T P + F P + F N} \times 100 %

(18)

B I o U = \frac{| (G_{d} \cap G) \cap (P_{d} \cap P) |}{| (G_{d} \cap G) \cup (P_{d} \cap P) |} \times 100 %

(19)

where

TP

,

FP

,

FN

, and

TN

denote the numbers of true positives, false positives, false negatives, and true negatives, respectively. In BIoU,

G

and

P

are the ground-truth and predicted target sets, while

G_{d}

and

P_{d}

denote the pixels within a distance

d

from the boundaries of G and P, respectively. The parameter d is determined in [24].

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(20)

where

μ_{x}

and

μ_{y}

denote the means of

x

and

y

,

σ_{x}^{2}

and

σ_{y}^{2}

denote the variances,

σ_{x y}

denotes the covariance, and

C_{1}, C_{2}

are constant for numerical stability.

4.3. Result and Analysis

4.3.1. WHU Dataset Analysis

The WHU dataset has a spatial resolution of 0.3 m, preserving abundant local details, such as building textures, corners, and fine boundary structures. Therefore, this dataset places high demands on the model not only in regional recognition and structural preservation, but also in accurate boundary localization under dense building distributions and complex background interference.

Table 2 reports the quantitative comparison results on the WHU dataset. As shown, GWNet achieves the best overall performance among all compared methods. Specifically, its OA, P, R, IoU, F1, BIoU, and SSIM reach 98.91%, 94.99%, 95.23%, 90.68%, 95.11%, 66.88, and 96.36, respectively, ranking first across all seven metrics. Compared with the second-best results, GWNet improves OA, P, R, IoU, F1, BIoU, and SSIM by 0.19, 0.36, 0.44, 1.53, 0.85, 0.21, and 0.31 points, respectively. The simultaneous superiority in IoU, F1, and BIoU indicates that the proposed method not only improves regional overlap and extraction completeness but also achieves more accurate boundary localization and contour consistency.

Figure 4 further shows that GWNet yields more complete building bodies in dense building blocks, large-area roofs, and elongated or geometrically regular structures. Compared with other methods, GWNet more effectively suppresses false detections caused by roads, bare land, and surrounding clutter, while reducing adhesion between adjacent buildings and internal holes within the same building. Moreover, the boundary extraction results in Figure 5 demonstrate that the contours predicted by GWNet are closer to the reference boundaries, especially at sharp corners, long straight edges, and narrow structural parts, where competing methods still exhibit broken boundaries, blurred edges, or contour deviation.

These results show that the advantage of GWNet on the WHU dataset mainly comes from the complementary effects of GMC and WBO. On the one hand, GMC enhances global–local collaborative feature extraction by jointly modeling local textures and long-range contextual dependencies, which helps preserve the semantic continuity and structural integrity of building regions in complex scenes. On the other hand, WBO enhances high-frequency boundary responses using Haar wavelets, thereby improving corner preservation, contour sharpness, and boundary completeness. This interpretation is also consistent with the ablation results, where GMC contributes more to region overlap and object completeness, while WBO contributes more to boundary refinement and BIoU improvement.

4.3.2. Massachusetts Dataset Analysis

The Massachusetts dataset has a spatial resolution of 1 m, which is lower than that of the WHU dataset. In this lower-resolution setting, building boundary details, small-structure elements, and separation cues between adjacent buildings are more likely to be weakened. Therefore, this dataset is better suited to evaluating the robustness of the proposed method with respect to regional completeness, structural continuity, and boundary recovery under insufficient fine-detail information.

As listed in Table 3, GWNet still achieves the best overall performance on the Massachusetts dataset. Specifically, its OA, R, IoU, F1, BIoU, and SSIM reach 94.20%, 84.35%, 73.02%, 84.41%, 93.19, and 80.98, respectively, ranking first in six of the seven metrics. Although its Precision is slightly lower than that of several competing methods, GWNet improves OA, R, IoU, F1, BIoU, and SSIM by 0.63, 4.01, 3.49, 2.38, 1.23, and 1.65 points, respectively, over the second-best results. In particular, the substantial gain in Recall indicates that GWNet is more effective at suppressing missed detections. At the same time, the concurrent improvements in IoU and F1 confirm that the model achieves better extraction completeness and overall segmentation quality under low-resolution conditions.

The qualitative comparison in Figure 6 shows that, in dense residential areas, regular large-roof buildings, and urban scenes with complex backgrounds, GWNet produces more continuous and complete building regions. Compared with other methods, it reduces internal fragmentation, weak responses in small buildings, and adhesion between adjacent objects. Meanwhile, the boundary visualization in Figure 7 further demonstrates that GWNet recovers building contours more accurately, especially for rectangular and oblique structures, where the predicted boundaries are more closed, smoother, and closer to the reference contours, with fewer jagged artifacts and missed edge segments.

These observations indicate that the superiority of GWNet on the Massachusetts dataset is still closely related to the collaborative roles of GMC and WBO. GMC strengthens the representation of main building bodies by jointly extracting global context and local texture cues, which is particularly beneficial for maintaining semantic consistency and suppressing confusion between buildings and backgrounds when local details are weak. WBO, in contrast, compensates for the loss of edge and contour information caused by low resolution and repeated downsampling, thereby improving boundary integrity and geometric fidelity. This interpretation is also supported by the ablation results, which show that GMC contributes more to IoU and regional completeness, whereas WBO provides stronger gains in BIoU and boundary refinement.

4.3.3. WHU Satellite I Dataset Analysis

The WHU Satellite I dataset spans a wide spatial resolution range from approximately 0.3 m to 2.5 m and includes scenes with substantial scale variation, diverse imaging conditions, and complex backgrounds. Therefore, it provides a more challenging benchmark for evaluating the robustness and generalization capability of the proposed method under cross-scene and multi-resolution conditions.

As shown in Table 4, GWNet achieves the best performance on the WHU Satellite I dataset in terms of OA, IoU, F1, and BIoU, reaching 88.71%, 63.86%, 77.94%, and 83.77, respectively. Compared with the second-best results, GWNet improves OA, IoU, F1, and BIoU by 0.38, 3.79, 2.89, and 2.56 points, respectively. Although GWNet does not rank first in Precision, Recall, or SSIM, its clear superiority on the key metrics most directly related to overall segmentation quality demonstrates that it achieves a better balance between regional extraction and boundary recovery in complex multi-resolution scenes.

Figure 8 further shows that GWNet performs better in scenes with densely distributed small buildings, elongated structures, complex orientations, and large-scale variations. Compared with the competing methods, GWNet reduces missed detections, internal holes, and adhesion between adjacent buildings, while preserving better structural continuity. In addition, the boundary visualization in Figure 9 shows that GWNet produces clearer and more accurate boundary localization for long strip-shaped buildings, oblique building edges, and densely distributed small targets. Even in scenes with complex roof textures and large resolution changes, its extracted contours remain more complete and closer to the reference boundaries.

These results indicate that the performance gain of GWNet on the WHU Satellite I dataset can also be attributed to the complementary mechanisms of GMC and WBO. GMC improves the model’s ability to jointly capture global contextual layout and local structural details, which is crucial for maintaining semantic consistency across scales and reducing confusion in complex heterogeneous scenes. WBO further strengthens multi-directional high-frequency boundary responses, making it more effective at recovering corners, elongated contours, and boundary transitions under large variations in resolution. Overall, the strong performance of GWNet on this dataset demonstrates that the proposed method is not only effective under a single-resolution setting but also highly robust under multi-resolution and cross-scene conditions.

4.3.4. Potsdam Dataset Analysis

The Potsdam dataset contains building scenes with relatively regular geometric structures and fine man-made boundaries, making it suitable for evaluating the model’s ability to preserve geometric shapes, recover complete building regions, and delineate accurate contours in challenging local areas.

As shown in Table 5, GWNet achieves the best overall performance on the Potsdam dataset. Specifically, its OA, R, IoU, F1, BIoU, and SSIM reach 95.11%, 87.46%, 83.21%, 90.83%, 58.96, and 93.61, respectively, ranking first in six of the seven evaluation metrics. Compared with the second-best results, GWNet improves OA, R, IoU, F1, BIoU, and SSIM by 0.84, 0.72, 2.85, 1.72, 2.05, and 0.97 points, respectively. Although its Precision is slightly lower than that of UANet and MAP-Net, the consistent advantages in OA, IoU, F1, BIoU, and SSIM indicate that GWNet achieves stronger regional overlap, better contour consistency, and higher overall structural fidelity.

From the visualization results in Figure 10, it can be observed that GWNet performs better in scenes with large, elongated roofs, isolated small structures, oblique building boundaries, and locally complex edge regions. Compared with other methods, GWNet produces more complete building masks, suppresses spurious responses around the targets, and better preserves the overall geometric shape of buildings. Figure 11 further confirms that GWNet provides more accurate boundary localization, especially along long straight edges, angular corners, and narrow protruding parts, where competing methods still suffer from broken contours, boundary deviation, or noisy edge responses.

These results further demonstrate the effectiveness of GMC and WBO on geometrically structured building scenes. GMC improves global–local feature interaction, allowing the network to better maintain the integrity of building bodies while preserving local shape cues. WBO strengthens the representation of high-frequency contour information, thereby enhancing edge sharpness, corner preservation, and contour continuity. As a result, GWNet achieves superior performance not only in regional segmentation accuracy but also in boundary quality and geometric detail recovery on the Potsdam dataset.

5. Discussion

5.1. Effectiveness and Complementarity of GMC and WBO

To further verify the effectiveness and complementarity of the Gated Mamba-CNN Module (GMC) and the Wavelet Boundary Optimization Module (WBO), ablation experiments were conducted across three datasets: Massachusetts, WHU Satellite I, and WHU. These three datasets cover low-resolution scenes, multi-resolution complex scenes, and high-resolution scenes with rich structural details, respectively, thereby enabling a more comprehensive evaluation of the contributions of the two modules to regional representation and boundary recovery. As summarized in Table 6, both modules consistently improve the baseline on all three datasets. At the same time, the complete GWNet achieves the best overall performance, demonstrating the effectiveness of combining global–local collaborative modeling with wavelet-based boundary enhancement.

As shown in Table 6, on the Massachusetts dataset, the baseline achieves OA, P, R, IoU, F1, BIoU, and SSIM values of 93.38%, 85.01%, 78.25%, 68.76%, 81.49%, 91.93%, and 79.33%, respectively. After introducing WBO, the model improves to 93.85%, 86.12%, 79.83%, 70.73%, 82.86%, 92.03, and 80.60, whereas introducing GMC yields 93.93%, 85.97%, 80.57%, 71.21%, 83.19%, 91.78, and 80.61, respectively. Compared with the baseline, GMC brings larger gains in IoU and F1, whereas WBO yields a higher BIoU. This indicates that GMC contributes more to strengthening global–local feature interaction and improving regional semantic completeness, thereby enhancing building-body integrity and regional overlap. In contrast, WBO is more effective in enhancing contour localization and boundary consistency, leading to clearer building outlines and better geometric refinement under low-resolution conditions.

A similar trend can also be observed in the WHU Satellite I dataset, which exhibits greater resolution variation and more complex scene heterogeneity. Relative to the baseline, Base+WBO improves IoU, F1, and BIoU by 3.01, 2.37, and 3.97 points, respectively, while Base+GMC improves the same metrics by 3.05, 2.39, and 3.72 points. These results again suggest that GMC plays a stronger role in improving region completeness and semantic consistency across different scales. In contrast, WBO contributes more to contour refinement and boundary recovery in complex multi-resolution scenes. On the WHU dataset, the same pattern remains evident: Base+GMC achieves a larger IoU gain than Base+WBO, while Base+WBO yields a larger BIoU improvement, further confirming the complementary functional roles of the two modules.

The visualization results in Figure 12 further support the above quantitative findings. Across the three datasets, the response maps produced by GMC are more concentrated in the main building bodies and internal semantic regions, indicating that GMC is more effective at enhancing region continuity, suppressing internal holes, and improving the completeness of building extraction. By contrast, the response maps produced by WBO exhibit stronger activations along building edges, corners, and separation regions between adjacent buildings, showing greater sensitivity to boundary structures and contour transitions. The final results preserve both strong regional responses and clear boundary activations, demonstrating that the joint use of GMC and WBO enables collaborative enhancement of global–local feature extraction and boundary optimization.

When GMC and WBO are introduced simultaneously, the complete GWNet achieves the best overall performance on all three datasets. Specifically, in Massachusetts, GWNet reaches 73.02% IoU, 84.41% F1, 93.19 BIoU, and 80.98 SSIM; on WHU Satellite I, it achieves 63.86% IoU, 77.94% F1, 83.77 BIoU, and 74.71 SSIM; and on WHU, it reaches 90.68% IoU, 95.11% F1, 66.88 BIoU, and 96.36 SSIM. These results indicate that GMC and WBO are not redundant; rather, they complement each other from distinct perspectives. GMC primarily improves building-region representation and global–local semantic consistency, whereas WBO primarily enhances contour detail recovery and boundary localization. Their combination, therefore, enables GWNet to achieve a better balance among regional completeness, structural fidelity, and boundary accuracy.

5.2. Analysis of the Gate Mechanism and High-Frequency Filter Selection

To further analyze the internal mechanisms of the proposed modules, additional ablation experiments were conducted on the Massachusetts dataset. Five variants were compared, including the baseline network, the network without the gate mechanism, the networks using Sobel and Laplacian filters instead of Haar wavelet filters, and the complete GWNet. The results are reported in Table 7.

As shown in Table 7, the complete GWNet achieves the best overall performance, with OA, R, IoU, F1, BIoU, and SSIM values of 94.20%, 84.35%, 73.02%, 84.41%, 93.19, and 80.98, respectively. Compared with the baseline, GWNet improves IoU, F1, BIoU, and SSIM by 4.26, 2.92, 1.26, and 1.65 points, respectively. These improvements indicate that the proposed design effectively enhances both building-region completeness and boundary quality.

In the GMC gate mechanism, removing the gate results in clear performance degradation. The IoU, F1, BIoU, and SSIM of Base + GMC without Gate decrease to 53.05%, 69.33%, 82.22, and 72.25, respectively, which are much lower than those of the complete GWNet. This result indicates that directly combining global and local features without adaptive regulation may weaken feature discrimination and cause unstable feature fusion. In contrast, the proposed gate mechanism adaptively balances global contextual information and local structural details, allowing the network to better preserve complete building regions under complex background interference.

For the high-frequency filter selection in WBO, replacing Haar wavelet filters with Sobel or Laplacian filters also results in lower performance. The Sobel-based variant achieves an IoU of 66.15% and a BIoU of 90.34, while the Laplacian-based variant obtains an IoU of 63.81% and a BIoU of 88.74. In comparison, the complete GWNet achieves an IoU of 73.02% and a BIoU of 93.19. Specifically, GWNet outperforms the Sobel-based variant by 6.87 points in IoU and 2.85 points in BIoU, and outperforms the Laplacian-based variant by 9.21 points in IoU and 4.45 points in BIoU. These results demonstrate that Haar wavelet filters are more effective for enhancing building boundaries and contour structures.

This advantage can be attributed to the structured multi-directional decomposition of Haar wavelets. Sobel and Laplacian filters mainly emphasize local gradient or second-order edge responses, whereas Haar wavelet filters can extract horizontal, vertical, and diagonal high-frequency components in a unified manner. These components are more consistent with the geometric characteristics of building boundaries, such as straight edges, corners, and contour transitions. In addition, the fixed Haar filters introduce no additional learnable parameters, keeping the WBO lightweight and reducing the risk of overfitting. Therefore, the additional ablation experiments further confirm the effectiveness of the channel-wise gate mechanism in GMC and the rationality of using fixed Haar wavelet filters in WBO.

6. Conclusions

This study proposes GWNet, a building extraction network that integrates a Gated Mamba-CNN Module (GMC) and a Wavelet Boundary Optimization Module (WBO), to address the common challenges in building extraction from HRSI, including spectral heterogeneity, complex background interference, and incomplete boundary delineation. Specifically, GMC enhances the discriminative capability of the model for building targets by jointly modeling local texture information and long-range contextual dependencies in parallel, while employing a channel-wise gating mechanism to achieve adaptive fusion of global and local features. In contrast, WBO uses fixed Haar wavelet high-frequency filters to extract multi-directional edge responses and compensates for boundary, corner, and contour details that are easily lost during downsampling, thereby improving boundary recovery and structural completeness.

Extensive experiments on four public datasets demonstrate the effectiveness and robustness of the proposed method across scenes with different spatial resolutions and complexity levels. GWNet achieves IoU/F1/BIoU scores of 90.68%/95.11%/66.88 on WHU, 73.02%/84.41%/93.19 on Massachusetts, 63.86%/77.94%/83.77 on WHU Satellite I, and 83.21%/90.83%/58.96 on Potsdam. These quantitative results, together with the visual comparisons, confirm that GWNet can produce more complete building regions and clearer boundaries under complex backgrounds, low-resolution conditions, multi-scale variations, and densely distributed urban scenes.

In addition, the ablation experiments on the Massachusetts dataset further verify the complementarity of the two core modules. GMC is more effective in improving the completeness and regional overlap of building bodies, whereas WBO shows greater advantages in boundary detail recovery and contour refinement. When the two modules are combined, the model achieves the best performance on metrics such as IoU, F1, BIoU, and SSIM. The heatmaps in Figure 7 further reveal that GMC mainly enhances responses in the main building regions, whereas WBO places greater emphasis on boundary and corner regions. Their combination, therefore, enables simultaneous enhancement of regional semantic modeling and boundary detail recovery.

Overall, the proposed GWNet provides an effective and robust solution for building extraction from HRSI. Future work will focus on model lightweighting, adaptive modeling of multi-scale targets, and more refined boundary-constraint strategies to further improve the generalization capability and practical value of the proposed method in ultra-large-format imagery, complex urban scenes, and real-world applications.

Author Contributions

Conceptualization, D.Y.; methodology, D.Y.; validation, D.Y.; investigation, D.Y., Y.Y., X.G. (Xinlong Gao), R.H., X.G. (Xianjun Gao), K.H. and K.G.; data curation, D.Y., R.H., Y.T. and X.G. (Xianjun Gao); writing—original draft preparation, D.Y. and R.H.; writing—review and editing, D.Y.; visualization, D.Y. and K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tianjin Key Laboratory of Rail Transit Navigation Positioning and Spatio-temporal Big Data Technology, No. TKL2026A06, Open Project of National Key Laboratory for Waterway Traffic Control (26-3-3), Tibet Autonomous Region Science and Technology Major Project XZ202402ZD0001, Key Project of the Scientific Research Plan of Hubei Provincial Department of Education (D20231304), the China National Science and Technology Major Project (Grant 2024ZD1001003), Open Fund of National Engineering Laboratory for Digital Construction and Evaluation Technology of Urban Rail Transit (No. 2023ZH01), Open Fund of Key Laboratory of Mine Environmental Monitoring and Improving around Poyang Lake, Ministry of Natural Resources (No. MEMI-2021-2022-08), Tianjin Science and Technology Plan Project (No. 23YFYSHZ00190, No. 23YFZCSN00280), Hunan Natural Science Foundation Department Joint Fund (No. 2024JJ8327), Jiangxi Provincial Natural Science Foundation (No. 20232ACB204032), Hunan Provincial Department of Natural Resources Science and Technology Project (20230153CH).

Data Availability Statement

Publicly available datasets were analyzed in this study. The Massachusetts, the WHU and WHU Satellite I dataset can be found here: https://www.cs.toronto.edu/~vmnih/data/ (accessed on 6 May 2026). and https://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 6 May 2026).

Conflicts of Interest

Author Yuan Tao was employed by the Central & Southern China Municipal Engineering Design and Research Institute Co., Ltd. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yang, D.; Gao, X.; Yang, Y.; Jiang, M.; Guo, K.; Liu, B.; Li, S.; Yu, S. CSA-Net: Complex Scenarios Adaptive Network for Building Extraction for Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 938–953. [Google Scholar] [CrossRef]
Yang, D.; Gao, X.; Yang, Y.; Guo, K.; Han, K.; Xu, L. Advances and future prospects in building extraction from high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 6994–7016. [Google Scholar] [CrossRef]
Cheng, X.; Han, K.; Xu, J.; Li, G.; Xiao, X.; Zhao, W.; Gao, X. SPFDNet: Water extraction method based on spatial partition and feature decoupling. Remote Sens. 2024, 16, 3959. [Google Scholar] [CrossRef]
Xu, T.; Gao, X.; Yang, Y.; Xu, L.; Xu, J.; Wang, Y. Construction of a semantic segmentation network for the overhead catenary system point cloud based on multi-scale feature fusion. Remote Sens. 2022, 14, 2768. [Google Scholar] [CrossRef]
Zhang, G.; Gao, X.; Yang, Y.; Wang, M.; Ran, S. Controllably deep supervision and multi-scale feature fusion network for cloud and snow detection based on medium-and high-resolution imagery dataset. Remote Sens. 2021, 13, 4805. [Google Scholar] [CrossRef]
Hu, A.; Wu, L.; Chen, S.; Xu, Y.; Wang, H.; Xie, Z. Boundary shape-preserving model for building mapping from high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610217. [Google Scholar] [CrossRef]
Zhu, Y.; Huang, B.; Fan, Y.; Usman, M.; Chen, H. Iterative Polygon Deformation for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704314. [Google Scholar] [CrossRef]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet convolutional neural networks. arXiv 2018, arXiv:1805.08620. [Google Scholar] [CrossRef]
Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Salt Lake City, UT, USA, 2018; pp. 773–782. [Google Scholar]
Chang, J.; Gao, X.; Yang, Y.; Wang, N. Object-oriented building contour optimization methodology for image classification results via generalized gradient vector flow snake model. Remote Sens. 2021, 13, 2406. [Google Scholar] [CrossRef]
Gao, X.; Yang, J.; Xie, X.; Yang, Y.; Wang, N.; Cao, X.; Du, B.; Tan, M.; Xu, L.; Kou, Y. DG 2-TCR: An Adaptive Clouds Removal Network for Optical Remote Sensing Images Using SAR-Driven Dual-Flow Fusion Guidance. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5619016. [Google Scholar]
Ran, S.; Gao, X.; Yang, Y.; Li, S.; Zhang, G.; Wang, P. Building multi-feature fusion refined network for building extraction from high-resolution remote sensing images. Remote Sens. 2021, 13, 2794. [Google Scholar] [CrossRef]
Li, Y.; Hong, D.; Li, C.; Yao, J.; Chanussot, J. HD-Net: High-resolution decoupled network for building footprint extraction via deeply supervised body and boundary decomposition. ISPRS J. Photogramm. Remote Sens. 2024, 209, 51–65. [Google Scholar] [CrossRef]
Xie, Y.; Zhu, J.; Cao, Y.; Feng, D.; Hu, M.; Li, W.; Zhang, Y.; Fu, L. Refined extraction of building outlines from high-resolution remote sensing imagery based on a multifeature convolutional neural network and morphological filtering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1842–1855. [Google Scholar] [CrossRef]
Wei, S.; Ji, S.; Lu, M. Toward automatic building footprint delineation from aerial images using CNN and regularization. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2178–2189. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. Ultralight vm-unet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation. arXiv 2024, arXiv:2403.20035. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Zhang, W.; Zhang, Q.; Yan, Y. Wavelet-CNet: Wavelet Cross Fusion and Detail Enhancement Network for RGB-Thermal Semantic Segmentation. Sensors 2026, 26, 1067. [Google Scholar] [CrossRef]
Hua, C.; Ren, F. FESW-UNet: A Dual-Domain Attention Network for Sorghum Aphid Segmentation. Sensors 2026, 26, 458. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Bradbury, K.; Brigman, B.; Collins, L.; Johnson, T.; Lin, S.; Newell, R.; Park, S.; Suresh, S.; Wiesner, H.; Xi, Y. Aerial imagery object identification dataset for building and road detection, and building height estimation. Figshare 2016. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Cheng, B.; Girshick, R.; Dollár, P.; Berg, A.C.; Kirillov, A. Boundary IoU: Improving object-centric image segmentation evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Nashville, TN, USA, 2021; pp. 15334–15342. [Google Scholar]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction from Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6169–6181. [Google Scholar] [CrossRef]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A fully convolutional neural network for automatic building extraction from high-resolution remote sensing images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
Yuan, W.; Zhang, X.; Shi, J.; Wang, J. LiteST-Net: A hybrid model of lite swin transformer and convolution for building extraction from remote sensing image. Remote Sens. 2023, 15, 1996. [Google Scholar] [CrossRef]
Bose, S.; Chowdhury, R.S.; Pal, D.; Bose, S.; Banerjee, B.; Chaudhuri, S. Multiscale probability map guided index pooling with attention-based learning for road and building segmentation. ISPRS J. Photogramm. Remote Sens. 2023, 206, 132–148. [Google Scholar] [CrossRef]
Yao, S.; Liu, D.; Li, T.; Li, S.; Ren, W.; Cao, X. UAGLNet: Uncertainty-Aggregated Global–Local Fusion Network with Cooperative CNN–Transformer for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5601714. [Google Scholar] [CrossRef]
Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608513. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of GWNet.

Figure 2. Gated Mamba-CNN Module.

Figure 3. Wavelet Boundary Optimization Module.

Figure 4. Visualization results on the WHU dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 5. Visualization results of boundary extraction from the WHU dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 6. Visualization results on the Massachusetts dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 7. Visualization results of boundary extraction from the Massachusetts dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 8. Visualization results on the WHU Satellite I dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 9. Visualization results of boundary extraction from the WHU Satellite I dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively.

Figure 10. Visualization results on the Potsdam dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively.

Figure 11. Visualization results of boundary extraction from the Potsdam dataset. (a–e) A–O denote the image, label, BRRNet, BuildFormerSegDP, CBRNet, HDNet, LiteSTNet, MA-FCN, MAP-Net, MFCNN, MSSDMPA-Net, UltraLight VM-UNet, UAGLNet, UANet, and our method, respectively. The red boxes indicate the regions that require particular attention.

Figure 12. (a–e) Visualization results of the ablation experiment.

Table 1. Parameter Settings.

Parameter Settings
Optimizer	Adam [23]
Initial learning rate	0.0001
Batch size	4
Epochs of the Massachusetts dataset	200
Epochs of the WHU dataset	50
Epochs of the Potsdam dataset	50
Epochs of the WHU Satellite I dataset	200

Table 2. Quantitative evaluation results of the WHU dataset.

Methods	OA/%	P/%	R/%	IoU/%	F1/%	BIoU	SSIM
MFCNN [14]	98.25	94.59	89.41	85.06	91.93	65.43	94.57
MA-FCN [15]	98.49	92.94	93.60	87.39	93.27	65.85	95.46
MAP-Net [25]	98.55	94.07	92.79	87.66	93.42	66.09	95.39
BRRNet [26]	98.41	94.04	91.55	86.52	92.77	65.48	95.09
BuildFormer [27]	98.48	93.03	93.31	87.21	93.17	65.58	95.19
CBRNet [28]	98.72	93.74	94.79	89.15	94.26	66.50	96.05
LiteST-Net [29]	98.58	93.84	93.36	87.97	93.60	66.67	95.63
MSSDMPA-Net [30]	98.57	94.63	92.36	87.76	93.48	65.84	95.60
HD-Net [13]	98.28	90.78	94.13	85.92	92.43	65.31	95.12
UltraLight VM-UNet [16]	97.23	87.09	88.22	78.02	87.65	61.72	92.50
UAGLNet [31]	98.30	91.90	92.95	85.91	92.42	65.51	95.00
UANet [32]	98.57	93.62	93.56	87.95	93.59	66.17	95.66
GWNet	98.91	94.99	95.23	90.68	95.11	66.88	96.36