A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement

Zuo, Chang; Lan, Xiaoji

doi:10.3390/ijgi15030096

Open AccessArticle

A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement

by

Chang Zuo

^1,2

and

Xiaoji Lan

^1,2,*

¹

School of Civil and Surveying Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

Technology Innovation Center for Land Spatial Ecological Protection and Restoration in Great Lakes Basin (JX), Nanchang 330029, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(3), 96; https://doi.org/10.3390/ijgi15030096

Submission received: 20 November 2025 / Revised: 5 February 2026 / Accepted: 18 February 2026 / Published: 26 February 2026

(This article belongs to the Topic Geospatial AI: Systems, Model, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Building extraction from high-resolution remote sensing imagery holds significant value for urban planning, disaster assessment, and geospatial analysis. However, current semantic segmentation models still face limitations when handling complex scenarios characterized by diverse building morphologies, significant scale variations, and blurred boundaries. To address the challenges of insufficient long-range dependency modeling, suboptimal multi-scale feature representation, and weak spatial adaptability, this paper proposes a building extraction network that integrates multi-scale sequence modeling with spatial adaptive enhancement. Adopting UPerNet (equipped with ConvNeXt-Tiny) as the baseline framework, the proposed method introduces a dedicated PyramidSSM-based neck (PyramidSSMNeck) as the primary design for multi-scale feature alignment and fusion, and further integrates three enhancement components (S6 (SSM-based), LSKNet, and SAFM) that provide additional improvements mainly reflected in boundary delineation. Specifically, PyramidSSMNeck performs structured cross-scale feature projection, alignment, and aggregation to strengthen multi-scale representation; S6 enhances long-range contextual modeling, LSKNet adaptively adjusts spatial receptive fields to accommodate scale variations, and SAFM modulates feature responses with spatial cues to refine boundaries and fine details—forming a unified framework in which PyramidSSMNeck primarily drives multi-scale alignment and fusion, while S6, LSKNet, and SAFM further enhance long-range context modeling and spatial adaptivity, mainly benefiting boundary preservation and fine-detail integrity. Experiments were conducted on the WHU Building, INRIA, and a self-constructed Ganzhou urban dataset, and the results indicate that the proposed method achieved IoU scores of 91.29%, 81.96%, and 88.18% across the three datasets, outperforming the baseline UPerNet (ConvNeXt-Tiny) by 2.37%, 0.88%, and 3.68%, respectively, with F1-scores consistently exceeding 90%. Importantly, ablation results indicate that the majority of region-level gains (IoU/F1) come from PyramidSSMNeck, whereas the additional modules contribute more prominently to boundary quality, yielding a Boundary IoU increase from 63.29% to 65.63% (+2.34) from the neck-only setting to the full model. Visualization results further support the method’s advantages in boundary preservation and detail integrity, and additional cross-domain transfer experiments (zero-shot and few-shot from WHU to Ganzhou) suggest improved robustness under domain shift.

Keywords:

building extraction; remote sensing imagery; multi-scale fusion; State Space Models (SSMs); Large Selective Kernel Network (LSKNet); Spatial Adaptive Feature Modulation (SAFM); deep learning

1. Introduction

Automatic and precise extraction of building information from High Spatial Resolution (HSR) remote sensing imagery constitutes a fundamental task in domains such as urban planning, dynamic monitoring, national geographic census, disaster emergency response, and 3D digital city modeling [1,2,3]. However, due to the dense distribution, complex structures, and significant scale variations in buildings in urban environments, coupled with susceptibility to shadows, vegetation occlusion, and illumination changes, building contours are prone to blurred boundaries and irregular shapes, posing significant challenges for high-precision extraction [4,5,6].

Early research typically relied on hand-crafted features such as spectral, textural, and geometric attributes, combined with classifiers like Support Vector Machines (SVMs) or Random Forests for building recognition; however, such methods struggle to maintain stable generalization capabilities under complex background conditions [1] and are ill-suited for complex and variable urban scenarios. The advancement of deep learning technologies, particularly Fully Convolutional Networks (FCNs), represented by encoder–decoder architectures [7], has significantly propelled the progress of building extraction. As a classic implementation of this architecture, U-Net [8] and its variants utilize Skip Connections to fuse shallow spatial information with deep semantic information, gaining widespread application in the remote sensing field [8]. However, its simple feature concatenation approach results in a Semantic Gap, and the inevitable loss of high-frequency spatial details during the encoder down-sampling process leads to insufficient recognition of small building entities and suboptimal segmentation accuracy in boundary regions [9,10,11]. To address U-Net’s deficiencies in multi-scale context modeling, subsequent researchers have proposed a series of improved models. For instance, Zhao et al. [12] proposed PSPNet [13] (Pyramid Scene Parsing Network), which introduces a Pyramid Pooling Module (PPM) to aggregate features at different scales for global context acquisition. However, the fixed grid pooling operation employed by PPM is relatively rigid, easily leading to the over-smoothing or loss of local detail information. Chen et al. [14] proposed DeepLabV3+, which employs Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale information utilizing atrous convolutions with varying dilation rates. Nevertheless, the sparse sampling characteristic of ASPP tends to generate Gridding Effects, restricting the perception of fine boundaries and potentially causing voids within large buildings [14]. He et al. [15] proposed APCNet, attempting to mitigate the limitations of PPM via an Adaptive Context Module (ACM) that dynamically computes affinities between local regions, yet there remains room for improvement in balancing global guidance with local detail correlation. UPerNet combines the Feature Pyramid Network (FPN) with PPM, aiming to unify perceptual information across different levels [2,16], and is frequently utilized in conjunction with modern backbones like ConvNeXt [16]. Although UPerNet excels in multi-task processing, it inherits the inherent defects of PPM regarding boundary detail handling when applied to the specific task of building extraction.

A comprehensive analysis of the aforementioned mainstream models, such as U-Net [9], DeepLabV3+ [14], PSPNet [12,13], APCNet [15], and UPerNet [16], reveals three critical common issues remaining in current research:

1.: Insufficient long-range dependency modeling.

Traditional CNN architectures, constrained by the local receptive fields of convolution kernels, struggle to establish explicit correlations between global pixels, causing models to prioritize learning local textures over global structures and significantly limiting generalization capabilities across domains (e.g., transferring from the WHU dataset to the Ganzhou urban dataset) [17,18]. Although models like DeepLabV3+ attempt to expand the receptive field via atrous convolution, they still fail to effectively capture the spatial layout relationships between buildings [17].

2.: Inadequate multi-scale object modeling.

Modules like PPM and ASPP employ fixed-scale pooling or dilation rates, making it difficult to adaptively match the extreme scale variations in buildings in remote sensing imagery, resulting in significantly low recall rates for small and dense buildings [17,19]. Research indicates that existing methods perform poorly when handling buildings with immense scale disparities, particularly in areas with dense building distribution, and fixed-scale feature extraction strategies fail to effectively cover all targets [19].

3.: Weak feature spatial adaptability.

During feature fusion (such as combining down-sampled and up-sampled features), models lack spatial position awareness, treating all pixels “equally” without prioritizing high-frequency information areas like building boundaries, leading to severe edge blurring and artifacts [20]. This issue is particularly pronounced in building boundary extraction, where traditional methods often lose crucial shape detail information [20]. These common issues severely constrain the performance of building extraction models in practical applications, necessitating breakthrough innovations in global dependency modeling, adaptive multi-scale feature extraction, and spatial perception mechanisms.

In recent years, State Space Models (SSMs) have demonstrated superior global dependency capture capabilities and linear complexity characteristics in sequence modeling tasks, providing a new and more efficient pathway for long-range feature modeling in remote sensing building extraction [21,22,23,24,25,26]. Simultaneously, architectures such as edge enhancement networks and multi-scale attention fusion strategies have further improved model sensitivity to building boundaries and local geometric structures. Addressing the aforementioned challenges, this paper proposes a high-resolution remote sensing building extraction network that fuses multi-scale sequence modeling with spatial adaptive enhancement. This method utilizes UPerNet (with a ConvNeXt-Tiny backbone) as the foundational framework and introduces a dedicated PyramidSSM-based neck (PyramidSSMNeck) as the primary design for structured multi-scale feature projection, alignment, and fusion, upon which it further integrates three enhancement components (S6 (SSM-based), LSKNet [27,28], and SAFM (Spatial Adaptive Feature Modulation) [29]) that provide complementary improvements mainly reflected in boundary delineation. Specifically, PyramidSSMNeck emphasizes structured cross-scale feature projection, alignment, and aggregation to strengthen multi-scale representation; S6 enhances long-sequence contextual modeling to better capture global dependencies; the LSKNet module, by introducing a Large Selective Kernel mechanism, enables the network to dynamically adjust the spatial receptive field, adaptively capturing multi-scale spatial patterns; the SAFM module dynamically modulates feature responses based on spatial positional information, enhancing the recognition precision of high-frequency details in boundary regions. Overall, PyramidSSMNeck contributes the dominant improvements in region-level metrics, whereas S6, LSKNet, and SAFM provide additional gains that are primarily reflected in boundary-sensitive evaluation; improved cross-domain transferability is observed for the proposed full framework in WHU → Ganzhou experiments. The experimental results on the public WHU Building Dataset [30], the INRIA Dataset [31], and a self-constructed Ganzhou urban building dataset validate the effectiveness and superiority of the proposed method.

The main innovations and contributions of this work are summarized as follows:

(1) We propose a PyramidSSMNeck-based building extraction architecture built on the UPerNet (ConvNeXt-Tiny) baseline, which strengthens multi-scale feature alignment and fusion for HSR imagery under complex scale variation and boundary ambiguity.

(2) On top of the proposed PyramidSSMNeck, we integrate three enhancement components—S6 for long-range context modeling, LSKNet for spatially adaptive receptive-field selection, and SAFM for spatial refinement—to provide additional gains that are primarily reflected in boundary quality.

(3) Extensive experiments on the WHU, INRIA, and Ganzhou datasets demonstrate consistent gains in both region- and boundary-sensitive metrics (e.g., IoU/BIoU), as well as improved transfer performance under the WHU → Ganzhou cross-domain setting.

2. Research Methods and Principles

To effectively address the unique challenges inherent in HSR remote sensing imagery—such as drastic scale variations in building objects, complex spatial distributions, strong background noise interference, and long-range contextual dependencies—this paper designs a building semantic segmentation network that integrates multi-scale sequence modeling with spatial adaptive enhancement. As illustrated in Figure 1, the model adopts UPerNet equipped with ConvNeXt-Tiny as the baseline framework, constructing a holistic architecture composed of a backbone network (ConvNeXt-Tiny), a multi-scale feature-enhanced neck (PyramidSSMNeck), and a decoding head (UPerHead). The core philosophy of the proposed model is to strengthen region-level representation through structured cross-scale feature projection, alignment, and fusion in PyramidSSMNeck, while S6, LSKNet, and SAFM provide additional refinement that is more evident in boundary preservation and fine-detail integrity.

2.1. ConvNeXt-Tiny Feature Extraction Module

As illustrated in Figure 1, ConvNeXt-Tiny adopts a hierarchical structure comprising four stacked stages (Stage 0 to Stage 3) to progressively down-sample input features, systematically increasing channel dimensions (from 96 to 768) while reducing spatial resolution (from 128 × 128 to 16 × 16). This process generates four feature maps (F1, F2, F3, and F4) at distinct semantic levels, providing rich multi-scale inputs for the subsequent neck and decoder modules. The fundamental building unit of ConvNeXt-Tiny is the ConvNeXt block [32] (see the bottom of Figure 1), which incorporates the following key design elements:

Large Kernel Depthwise Convolution [33]:

The core component of this block is a 7 × 7 large-kernel depthwise convolution. In contrast to traditional 3 × 3 convolution kernels, the 7 × 7 large kernel significantly expands the model’s Effective Receptive Field (ERF), enabling the capture of broader spatial contextual information. Simultaneously, the depthwise convolution format ensures computational efficiency.

2.: Layer Normalization (LN):

Regarding the normalization strategy, this block substitutes Layer Normalization (LN) for Batch Normalization (BN), which is commonly employed in convolutional networks. As a standard component of Transformers, LN offers more stable training dynamics across varying batch sizes.

3.: Inverted Bottleneck:

This block adopts an inverted bottleneck design derived from the Feed-Forward Network (FFN) of Transformers. As illustrated in Figure 1, channel dimensions (C) are first expanded by a factor of 4 to 4C via a 1 × 1 convolution, undergo a non-linear transformation through the Gaussian Error Linear Unit (GELU) activation function, and are finally compressed back to C via another 1 × 1 convolution. This “narrow-wide-narrow” architecture compels the model to learn complex feature transformations within a higher-dimensional (4C) feature space while restricting the computationally intensive large-kernel convolution to the narrower (C) channel dimension, striking a delicate balance between performance and efficiency.

4.: Residual Connections and DropPath:

By combining residual connections (ResNet) with DropPath (a structural regularization technique), this block effectively ensures stable gradient propagation within deep networks and enhances the model’s generalization capabilities.

2.2. PyramidSSMNeck Feature Enhancement Module

Subsequently, these multi-scale features are fed into the proposed PyramidSSMNeck module to facilitate feature fusion and enhancement. The PyramidSSMNeck module represents the core innovation of this study, designed to serve as a “neck” bridging the encoder and decoder to specifically address two critical challenges in semantic segmentation of remote sensing imagery: Global contextual dependency: Accurate building recognition (e.g., distinguishing between roofs and roads with similar textures) relies heavily on long-range spatial relationships; Scale diversity: Building objects in remote sensing imagery exhibit vast size variations, ranging from small shacks occupying a few pixels to large complexes spanning hundreds of pixels. As illustrated in Figure 1, PyramidSSMNeck receives multi-scale feature maps from the four stages of ConvNeXt-Tiny. Initially, a “Projection Layer” is employed to unify these four feature maps of varying dimensions into a consistent channel count. Subsequently, the feature map at each scale is independently processed by a pivotal PyramidSSM Block for deep feature enhancement. Finally, all enhanced feature maps are fused during the “Feature Alignment and Fusion” stage via up-sampling, concatenation, and convolution operations, providing prepared pyramidal feature inputs for the UPerHead decoder.

The PyramidSSM Block serves as the core computational unit of the PyramidSSMNeck, as depicted in Figure 2. It employs a meticulously designed Sequential Pipeline, wherein features successively traverse three sub-modules—S6, LSKNet, and SAFM—to achieve progressive enhancement of complementary information.

The S6 module, derived from the Mamba architecture, functions as a Selective State Space Model (SSM) [34]. Its primary design objective is to efficiently capture long-range dependencies within sequential data. The fundamental principle of SSMs is grounded in continuous-time systems, wherein the evolution of the state

h (t) \in R^{N}

is governed by Ordinary Differential Equations (ODEs):

h^{'} (t) = A h (t) + B x (t)

(1)

y (t) = C h (t)

(2)

Here,

A \in R^{N \times N}

denotes the state matrix, while

B \in R^{N \times 1}

and

C \in R^{1 \times N}

represent the input and output transformation matrices, respectively. To facilitate implementation on digital computing hardware, the continuous system necessitates discretization. In this study, we employ the Zero-Order Hold (ZOH) principle to transform the continuous parameters

(Δ, A, B)

into their discrete counterparts

(\bar{A}, \bar{B})

via a learnable timescale parameter

Δ

:

\bar{A} = e x p (Δ A)

(3)

\bar{B} = (Δ A)^{- 1} (e x p (Δ A) - I) \cdot Δ B

(4)

Following the discretization defined in Equation (4), the SSM can be efficiently computed in a recurrent form:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}

(5)

y_{t} = C h_{t}

(6)

The revolutionary attribute of the S6 module lies in its “selectivity”. Unlike traditional SSMs, the key parameters of S6 are not static but are input-dependent. As illustrated in the S6 module component of Figure 2, input features are projected via x_proj and subsequently split to dynamically generate these parameters. This mechanism empowers the model to “selectively” determine which information to propagate or forget along the spatial sequence, thereby achieving content-aware long-range information interaction. Following the capture of global context by the S6 module, the features are fed into the Large Selective Kernel Network (LSKNet) module. Designed to address the issue of scale diversity in remote sensing imagery, this module enables the network to dynamically adjust its spatial receptive field in response to the input content.

The core mechanism of LSKNet (as illustrated in the LSKNet block of Figure 2) performs spatially selective fusion, which differs from methods such as SKNet [35] that conduct selection mainly along the channel dimension. For building extraction in HSR remote sensing imagery, spatial selection is particularly suitable because buildings exhibit substantial scale variation and morphology diversity, and many ambiguities (e.g., adjacent small buildings and irregular boundaries) are location-dependent. Therefore, spatially adaptive receptive-field selection helps adjust responses according to local structure. The operational workflow proceeds as follows:

1.: Feature Extraction:

Input features are processed in parallel through four depthwise convolution branches equipped with varying large kernel sizes, yielding four distinct feature maps, denoted as

f e a t_{i}

.

f e a t_{i} = D W C o n v_{i} (U), i \in {1,2, 3,4}

(7)

2.: Spatial Selection Weight Generation:

Within the “Selection Path,” these four feature maps undergo element-wise summation. The aggregated result is subsequently processed by a “Selection Block” (comprising a 1 × 1 convolution, Batch Normalization, ReLU activation, and a Softmax function) to generate four distinct sets of spatial attention maps, denoted as

w_{i}

.

W = Softmax (Select (\sum_{i = 1}^{4} f e a t_{i}))

(8)

As illustrated in Figure 2 (LSKNet), the Selection Block outputs a four-channel score map, where each channel corresponds to one convolutional branch. The Softmax in Equation (8) is applied across the four branches at each spatial location

(h, w)

, producing pixel-wise weights

{w_{i} (h, w)}_{i = 1}^{4}

that satisfy

\sum_{i = 1}^{4} w_{i} (h, w) = 1

. Each

w_{i}

is an

H \times W

spatial weight map and is broadcast along the channel dimension when reweighting

{f e a t}_{i}

in Equation (9).

3.: Weighted Fusion:

Within the “Fusion Path,” each feature map

f e a t_{i}

undergoes element-wise multiplication with its corresponding spatial weight map

w_{i}

, followed by a summation of the results:

V = \sum_{i = 1}^{4} (w_{i} ⊙ f e a t_{i})

(9)

The resultant fused feature V is added to the original input U (via a Residual path) to yield the final output of the LSKNet module:

O u t p u t_{L S K} = U + V

(10)

This mechanism empowers the network to dynamically and adaptively select the optimal receptive field scale (i.e., convolution kernel size) for each spatial pixel location within the image.

As illustrated in the SAFM section of Figure 2, this module employs a multi-branch, multi-kernel hybrid channel-spatial attention mechanism: Input features are evenly partitioned along the channel dimension into four distinct “Chunks”. Each “Chunk” is processed by a depthwise convolutional layer equipped with a specific spatial kernel size. This design enables the network to capture spatial information at varying scales across different channel groups. The four processed “Chunks” are subsequently re-concatenated along the channel dimension. Finally, a 1 × 1 convolution, followed by BN, GELU activation, and a residual connection, is utilized to aggregate this cross-channel, multi-scale information, yielding the refined feature map.

2.3. UPerHead Decoder Module

The four enhanced feature layers

F_{1}, F_{2}, F_{3}, F_{4}

, output by the PyramidSSMNeck, are fed into the UPerHead decoder module. As illustrated in Figure 3, UPerHead employs a dual-branch parallel architecture that efficiently integrates the strengths of the Pyramid Pooling Module (PPM) and the FPN. For clarity, the notation used in Figure 3 is summarized in Table 1.

The PSP (Pyramid Pooling) branch specializes in capturing global context, operating exclusively on the deepest feature layer

F_{4}

, which encompasses the richest semantic information. This branch applies multi-scale adaptive pooling

(P o o l_{s}, s \in \{1,2, 3,6\})

to

F_{4}

, followed by transformation via 1 × 1 convolutions (T1–T4). Subsequently, all resultant maps are upsampled (U1–U4) and concatenated to generate the PSP Out feature map:

Y_{p s p} = Concat ({U_{s} (T_{s} ({Pool}_{s} (F_{4}))), s \in {1,2, 3,6}})

(11)

Simultaneously, the FPN facilitates the top-down fusion of multi-scale features. It establishes lateral connections (L1–L4) via 1 × 1 convolutions and employs a top-down “upsample-add-refine” strategy to progressively fuse high-level semantics with low-level details layer by layer:

F_{i} = Refine (L_{i} + Upsample (F_{i + 1}))

(12)

Here,

F_{i}

denotes the FPN output,

L_{i}

represents the lateral input, and the Refine process consists of a 3 × 3 convolution.

Finally, in the “Final Fusion” stage, the outputs from the FPN branch (FPN1, FPN2, FPN3) and the output of the PSP branch

Y_{p s p}

are resized to a unified resolution of 128 × 128 and concatenated. The concatenated features are subsequently processed through a 3 × 3 convolution (Bottleneck) and a 1 × 1 convolutional classification head to yield the final segmentation prediction:

Y = f_{c l s} ({Conv}_{3 \times 3} (Concat (Align (Y_{p s p}, F_{1}, F_{2}, F_{3}))))

(13)

This dual-branch fusion architecture empowers the model to simultaneously preserve global semantic consistency and local spatial boundary details.

3. Experiment and Result Analysis

3.1. Dataset

This study utilizes the WHU Building Dataset [30], the INRIA Aerial Image Labeling Dataset [31], and a self-constructed dataset covering parts of Zhanggong District, Ganzhou City, for experimental validation. Representative cropped samples and corresponding ground truths from the three datasets are illustrated in Figure 4.

WHU Building Dataset:

While this dataset comprises both satellite and aerial imagery, this study exclusively utilizes the aerial imagery subset. Covering an area exceeding 450 km² in Christchurch, New Zealand, the aerial imagery contains approximately 187,000 building instances with a 0.3 m ground resolution, consisting of visible light (RGB) spectral bands. The dataset consists of 4736 training images (approx. 130,500 buildings), 1036 validation images (approx. 14,500 buildings), and 2416 test images (approx. 42,000 buildings) [30]. All images are cropped to dimensions of 512 × 512 pixels. As shown in Figure 4a, WHU scenes are mainly characterized by regular residential patterns with relatively consistent illumination and roof materials, providing a representative high-resolution aerial benchmark.

2.: INRIA Aerial Image Labeling Dataset:

This dataset spans approximately 810 km², featuring aerial imagery from ten cities across the United States and Austria. Composed of three RGB spectral bands with a spatial resolution of 0.3 m, the images categorize pixels into two classes: “building” and “non-building.” As ground truth annotations are provided only for the training set, this study exclusively utilizes the training subset. This subset includes 180 remote sensing images from five cities—Austin, Chicago, Kitsap, Vienna, and Tyrol-West—with each city covering an area of 81 km². Following the official partition standard, the first five tiles from each city are designated as the validation set, the last two tiles as the test set, and the remaining tiles as the training set [31]. All original 5000 × 5000 pixel images were cropped into 512 × 512 tiles with an overlap ratio of 0.01. The final partition results in 14,500 images for training, 2500 for validation, and 1000 for testing. As illustrated in Figure 4c, INRIA covers diverse urban forms and roof scales across multiple cities, introducing variations in building sizes and background textures compared with WHU.

3.: Ganzhou Dataset:

This dataset was collected from Zhanggong District, Ganzhou City, Jiangxi Province, China, and is sourced from QuickBird satellite imagery, as illustrated in Figure 5. The study area mainly covers the old urban district of Ganzhou, which includes residential blocks, industrial zones, and protected historic districts, resulting in diverse building styles (from traditional heritage-style buildings to modern constructions) and complex spatial layouts. The QuickBird imagery includes a 0.6 m panchromatic (PAN) band (399.3–675 nm) and 2.4 m multispectral bands (Blue/Green/Red/NIR: 450–520/520–600/630–690/760–900 nm) with 11-bit radiometric resolution. Prior to cropping and segmentation, the imagery underwent preprocessing steps including geometric correction, radiometric correction, filtering, smoothing, and band fusion (pan-sharpening). This was followed by the manual labeling of approximately 23,000 building samples, the ground truth distribution of which is visualized in Figure 6. The imagery was ultimately cropped to dimensions of 512 × 512 pixels and partitioned into 1152 training images, 116 validation images, and 230 test images. Compared with WHU and INRIA (0.3 m aerial RGB imagery), the Ganzhou dataset introduces a clear domain shift in acquisition platform (satellite vs. aerial), spatial resolution (0.6 m PAN/2.4 m multispectral), and urban morphology, which is suitable for evaluating cross-domain robustness.

The three datasets share the same segmentation target (building footprint extraction) and are standardized to a uniform tile size (512 × 512), enabling consistent training and evaluation. However, they differ significantly in geographic region, imaging conditions, and urban morphology. WHU provides a high-resolution aerial benchmark with relatively consistent residential patterns; INRIA covers multiple Western cities with broader variability in urban forms and building scales; and the self-constructed Ganzhou dataset represents a satellite-based Chinese old-city scenario with heterogeneous building styles and complex layouts. Selecting these datasets allows us to (i) validate effectiveness on widely used public benchmarks (WHU and INRIA), and (ii) assess robustness under domain shift and local urban diversity using the Ganzhou dataset, thereby strengthening the generality and practical relevance of the proposed method.

3.2. Implementation Details

The models were implemented using the PyTorch framework (version 2.1.2) and executed on a single NVIDIA RTX 4090 GPU equipped with 24 GB of VRAM, with the codebase developed as an extension of MMSegmentation, an open-source semantic segmentation toolbox. The hyperparameter configuration utilized an initial learning rate of 0.00006 combined with an automatic learning rate decay schedule, while the batch size was standardized at 4 for all models.

3.3. Accuracy Assessment and Evaluation Metrics

To comprehensively evaluate the building extraction performance, this study employs six standard metrics: F1-Score, Precision, Recall, Intersection over Union (IoU), and Accuracy (Acc), and Boundary IoU (BIoU). The calculation formulas are defined as follows, and BIoU is described afterwards:

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

Recall = \frac{T P}{T P + F N}

(15)

F 1 - Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(16)

IoU = \frac{T P}{T P + F P + F N}

(17)

A c c = \frac{(T P + T N)}{(T P + T N + F P + F N)}

(18)

Here, the definitions are as follows: True Positive (TP) denotes the number of pixels correctly predicted as buildings; False Positive (FP) denotes the number of background pixels incorrectly classified as buildings; False Negative (FN) refers to the number of building pixels incorrectly classified as background; and True Negative (TN) represents the number of background pixels correctly identified as background. Specifically, Precision measures the accuracy of building pixels within the prediction results, while Recall quantifies the model’s capability to identify ground truth building regions. The F1-Score provides a harmonic mean of Precision and Recall, effectively balancing the trade-off between missed detections (omissions) and false alarms (commissions). Intersection over Union (IoU) measures the degree of overlap between the predicted segmentation and the ground truth, serving as the most representative metric for semantic segmentation tasks. Accuracy (Acc) indicates the overall classification correctness across all pixels, calculated as the ratio of correctly predicted pixels to the total pixel count.

In addition, Boundary IoU (BIoU) is introduced to quantify boundary preservation. Following a morphological gradient definition, a boundary band is constructed by subtracting an eroded mask from a dilated mask:

B_{k} (M) = D i l a t e (M, k) - E r o d e (M, k)

(19)

where

M

denotes a binary mask and

k

the structuring-element radius (set to

k = 3

in this work). BIoU is then computed as the IoU between the boundary bands of prediction

P

and ground truth

G

:

B I o U = \frac{| B_{k} (P) \cap B_{k} (G) |}{| B_{k} (P) \cup B_{k} (G) |}

(20)

Unless otherwise stated, masks are binarized with a threshold of 0.5. This construction yields a boundary “ring” around the contour with an effective width of approximately 2k pixels (≈6 pixels when k = 3).

3.4. Experimental Results

To validate the effectiveness of the proposed model, comparative experiments were conducted on the WHU Building Dataset, the INRIA Dataset, and the self-constructed Ganzhou Dataset. Mainstream semantic segmentation models—including U-Net (FCN), PSPNet, DeepLabV3+, APCNet, and UPerNet (with ResNet50, Swin-Tiny, and ConvNeXt-Tiny backbones)—were selected as baselines for a comprehensive evaluation based on six metrics: IoU, Acc, F1-Score, Precision, Boundary IoU (BIoU), and Recall. The quantitative results are presented in Table 2, Table 3 and Table 4.

On the WHU Building Dataset (Table 2), the proposed method achieved superior performance across all metrics, attaining an IoU of 91.29% and an F1-Score of 95.45%, which represent improvements of 2.37% and 1.32%, respectively, over the second-best model, UPerNet (ConvNeXt-Tiny). In terms of boundary quality, the proposed method also achieves the highest BIoU (65.63%), outperforming UPerNet (ConvNeXt-Tiny) by 5.79%, suggesting improved boundary preservation. Compared with conventional convolutional networks (e.g., U-Net and PSPNet), the proposed method demonstrates superior performance in global feature modeling and boundary detail extraction, indicating that the introduced architecture possesses strong generalization capabilities for high-resolution building extraction tasks.

On the INRIA Dataset (Table 3), the model similarly exhibited robust performance, with IoU, F1-Score, and Precision reaching 81.96%, 90.09%, and 90.68%, In addition, the proposed method attains the best BIoU (49.10%), respectively, comprehensively outperforming the comparative models. Compared to UPerNet (ConvNeXt-Tiny), the improvements of 0.88% in IoU and 0.54% in F1-Score demonstrate the model’s substantial robustness when handling diverse urban scenarios and complex building morphologies.

On the Ganzhou Dataset (Table 4), the proposed method achieved its most prominent performance, with an IoU of 88.18% and both F1-Score and Precision reaching 93.72%, significantly surpassing other methods. Notably, the proposed method substantially improves BIoU to 62.43%, compared with 49.36% of UPerNet (ConvNeXt-Tiny), The 3.68% improvement in IoU compared to UPerNet (ConvNeXt-Tiny) indicates the model’s enhanced stability in adapting to regional feature distributions and addressing challenges such as high textural complexity and boundary blurring.

Comparison with Excellent Methods

To provide a broader reference on widely used public benchmarks, we further include a literature-based comparison with several recent building extraction methods on the WHU dataset, as summarized in Table 5. For methods that are not re-implemented in this work, the numbers are quoted from their original papers for reference; “–” indicates that the corresponding metric was not reported. We note that direct fairness may be affected by differences in experimental settings (e.g., data split, input resolution, augmentation, and evaluation protocol) across papers; in particular, some compared studies adopt data splits or input resolutions that differ from ours. Thus, Table 5 is intended for literature-based positioning rather than a strictly controlled re-implementation benchmark, and the specific settings of each method follow its original publication.

As shown in Table 5, our method achieves an IoU of 91.29% and an F1-score of 95.45%, which is competitive with recent approaches. While JLCS reports a higher IoU (91.64%) and BEMRF-Net attains a slightly higher overall accuracy (98.93%), our method provides a more balanced Precision/Recall behavior (95.48%/95.41%) and maintains strong overall accuracy (98.99%).

3.5. Visualization Analysis

The visualization results provide a more intuitive demonstration of the proposed method’s superiority. Figure 7, Figure 8 and Figure 9 illustrate the comparative building extraction results of the proposed method against baseline models across the WHU, INRIA, and Ganzhou datasets. Red regions denote omission errors (False Negatives), while green regions indicate commission errors (False Positives). Overall, the proposed method consistently yields clearer and more complete building segmentation results across diverse scenarios.

In the WHU Building dataset (Figure 7), when confronting dense small-scale objects and complex circular boundaries, baseline models (particularly U-Net and PSPNet) exhibit severe omissions (red) and boundary adhesion (green); whereas the proposed model achieves precision nearly identical to the Ground Truth labels.

In the INRIA dataset (Figure 8), facing the challenge of high spectral similarity between rooftops and backgrounds such as parking lots, most baseline models generate extensive erroneous detections (green). Conversely, the proposed method demonstrates superior semantic discrimination capabilities, effectively suppressing false positives while maintaining the structural integrity of large complex buildings, thereby avoiding internal voids (red) observed in models like UPerNet (Swin-T).

In the Ganzhou dataset (Figure 9), the proposed model similarly exhibits more stable behavior; it achieves more complete and precise segmentation for both large-scale and low-contrast targets, whereas other models suffer from noticeable omissions or commission errors in these examples.

3.6. Ablation Experiment

To validate the effectiveness of the proposed PyramidSSMNeck architecture and its three internal sub-modules—the S6 State Space Model (S6), Large Selective Kernel Network (LSKNet), and Spatial Adaptive Feature Modulation (SAFM)—an ablation study was designed to incrementally enable each module. Using UPerNet (ConvNeXt-Tiny) as the baseline model, the PyramidSSMNeck and the three enhancement sub-modules were sequentially integrated under identical dataset and training configurations to analyze their respective impacts on model performance.

3.6.1. Quantitative Ablation Results

As presented in Table 6, Using UPerNet (ConvNeXt-Tiny) as the baseline model, the PyramidSSMNeck and the three enhancement sub-modules were sequentially integrated under identical dataset and training configurations to analyze their respective impacts on model performance. UPerNet (ConvNeXt-Tiny) serves as the baseline model without any feature enhancement structures; Neck1 denotes the integration of the PyramidSSMNeck architecture between the backbone and the decoding head, with the internal S6, LSKNet, and SAFM modules deactivated; Neck2, Neck3, and Neck4 correspond to the individual activation of the S6, LSKNet, and SAFM modules, respectively, based on the Neck1 configuration; Neck5, Neck6, Neck7 represent dual-module combinations (S6 + LSK, S6 + SAFM, LSK + SAFM); finally, the proposed model (Ours) incorporates all three modules simultaneously (S6 + LSK + SAFM). The quantitative results are summarized in Table 6.

Experimental results demonstrate that the introduction of the overall PyramidSSMNeck framework yields the most significant performance gain: Compared to the baseline UPerNet, the IoU of the Neck1 model increased significantly from 88.92% to 91.22% (+2.30), and the F1-Score rose from 94.13% to 95.41%, indicating that the multi-scale feature alignment and fusion operations serve as the primary source of improvement in overall accuracy, while also providing a clear gain in boundary quality (BIoU from 59.84 to 63.29, +3.45). This improvement is primarily attributed to the multi-scale input projection and fusion strategy of PyramidSSMNeck, which enables the interaction of semantic information across different levels within a unified spatial scale, thereby improving the multi-scale representation capacity of the fused features.

In contrast, individually enabling S6, LSKNet, or SAFM leads to only minor changes in the global metrics (IoU/F1) (within 0.05 IoU relative to Neck1), which is consistent with the strong baseline already established by PyramidSSMNeck; their effects are more evident on boundary-oriented evaluation (BIoU). Specifically, BIoU increases from 63.29 (Neck1) to 63.80 (+0.51) with S6, 64.18 (+0.89) with LSKNet, and 64.62 (+1.33) with SAFM, indicating that these components mainly target structural coherence and boundary refinement, whose benefits are more clearly reflected in boundary-sensitive cues (e.g., BIoU) and qualitative visualizations (Section 3.6.3), rather than coarse region-level scores alone. This suggests that PyramidSSMNeck contributes the dominant gains in region-level accuracy, whereas S6, LSKNet, and SAFM provide additional improvements that are primarily reflected in boundary quality.

Further analysis of the dual-module combinations (Neck5–Neck7) and the full model reveals that while overall performance metrics remain stable, a more balanced trend is observed between Precision and Recall. A consistent pattern can also be observed in BIoU, where combining modules tends to yield more stable boundary delineation than enabling a single module alone. For instance, Neck6 (S6+SAFM) achieved a Precision of 95.51% with a Recall of 95.40%, indicating that the S6 module reinforced global consistency while the SAFM module optimized local responses at boundaries, which corresponds to improved boundary-sensitive scores (BIoU 65.26 vs. 63.29 for Neck1). The full model shows only minor fluctuations in Precision and Recall, yet achieves the best boundary performance (BIoU 65.63, +2.34 over Neck1) with only a marginal IoU change (+0.07), further indicating that the added components mainly contribute to boundary-sensitive improvements.

3.6.2. Validation of Neck Architecture Effectiveness

To rule out the possibility that the performance gain observed in the neck-only setting (i.e., the “neck-only” variant corresponding to Neck1 in Table 6) mainly stems from increased model capacity, we conduct a compute-matched comparison of neck-only variants on the WHU Building dataset. Specifically, we compare the proposed PyramidSSMNeck (i.e., the Neck1 configuration without S6, LSKNet, and SAFM) with three representative alternative neck designs: (i) SimpleConvFusion, which performs multi-scale fusion via channel projection, interpolation-based alignment, and convolutional refinement; (ii) AttentionFusion, which follows the same fusion pipeline but introduces a lightweight Squeeze-and-Excitation (SE)-style channel reweighting before fusion; and (iii) BiFPN, which adopts bidirectional top-down and bottom-up feature aggregation with learnable fusion weights.

For a fair comparison, we keep the backbone, decoder head, training schedule, and input resolution (512 × 512) unchanged. We adjust only the refinement depth of each baseline neck (i.e., the number of repeated refinement convolution layers, or repeated BiFPN layers) to match the GFLOPs of our method as closely as possible. BiFPN is slightly higher due to its bidirectional aggregation structure, but remains within a narrow range (within ~3% GFLOPs). As reported in Table 7, PyramidSSMNeck consistently achieves the best performance (91.22% IoU) under approximately matched GFLOPs. Notably, our method achieves this while using fewer parameters (38.69 M) compared to the alternatives (>43 M), demonstrating that the improvement derives from the effective multi-scale feature alignment and fusion design rather than increased model capacity.

3.6.3. Visual Interpretation of Module-Wise Effects

Although the region-level quantitative improvements introduced by S6, LSKNet, and SAFM are modest, their effects become more evident in boundary-sensitive evaluation (e.g., BIoU in Table 6) and qualitative visualizations (Figure 10, Figure 11 and Figure 12). Each example shows the input image, a feature activation map, and an error map of predictions, where green denotes true positives (TP), red false positives (FP), and blue false negatives (FN).

As shown in Figure 10, incorporating S6 suppresses fragmented false positives (red) in background regions and reduces discontinuities along elongated structures, leading to cleaner and more coherent building masks. This is consistent with the design objective of S6, which strengthens long-range dependency modeling and global structural coherence.

Figure 11 highlights the effect of LSKNet in mixed-scale and densely distributed scenes. Compared with the Neck-only configuration, LSKNet improves separation between adjacent small buildings and reduces local confusions around dense blocks (fewer FP/FN near boundaries), which aligns with its spatially adaptive receptive-field selection. Such boundary-level refinements are better captured by BIoU than by IoU alone.

The impact of SAFM is illustrated in Figure 12. For large buildings, the baseline prediction may contain interior holes (blue FN) caused by weak responses in homogeneous roof regions. After introducing SAFM, these missed regions are largely recovered, yielding more compact and structurally complete masks without introducing additional false alarms. This behavior supports SAFM as a spatial refinement component that stabilizes boundary-adjacent responses under challenging appearance variations.

3.6.4. Efficiency-Aware Analysis of Ablation Configurations

To examine whether the enhancement modules introduce non-trivial computational overhead, we report an efficiency-aware ablation analysis in Figure 13a–e. All measurements were conducted on an RTX 4090 (24 GB) using 512 × 512 inputs; latency and peak GPU memory were measured with batch size 1 under the same inference pipeline.

As shown in Figure 13a, enabling the three enhancement modules leads to a moderate increase in model size and FLOPs compared with the Neck-only variant, while the accuracy gain remains consistent (IoU: 91.22% → 91.29%). Notably, inference latency is almost unchanged across ablation settings (≈3.3–3.4 ms/image in Figure 13b). Peak GPU memory shows only a small increase from 220 MB (Neck-only) to 244 MB (Full) (Figure 13c), and the different module combinations fall within a narrow range (≈220–241 MB).

Furthermore, the scaling curves in Figure 13d,e indicate that latency–resolution curves for all variants nearly overlap as input resolution increases, and memory grows smoothly with resolution. Importantly, the Full model maintains a near-constant memory gap relative to the Neck-only configuration across resolutions, suggesting that the enhancement modules mainly contribute limited constant-factor overhead and do not change the scaling behavior for high-resolution inference.

3.6.5. Summary and Discussion of Ablation Findings

In conclusion, although the incremental gains in region-level metrics (e.g., IoU and F1-score) are relatively modest, the quantitative ablation results (Section 3.6.1) indicate that PyramidSSMNeck provides the dominant structural contribution. Crucially, the compute-matched validation (Section 3.6.2) suggests that this performance advantage is primarily driven by the effective multi-scale alignment design rather than increased model capacity. Building on this foundation, S6, LSKNet, and SAFM provide additional improvements in global context modeling, spatially adaptive receptive-field adjustment, and local refinement, respectively. Their effects are more clearly reflected in boundary-sensitive evaluation (BIoU) and the qualitative comparisons in Figure 10, Figure 11 and Figure 12 (Section 3.6.3), where typical false positives/false negatives are suppressed, and object structures become more coherent. In addition, the efficiency-aware analysis (Section 3.6.4) shows that these enhancements introduce only a moderate increase in computational cost while preserving similar scaling behavior with respect to input resolution. Overall, the proposed design improves segmentation accuracy and boundary delineation in a practical and computationally efficient manner.

4. Discussion

4.1. Computational Efficiency and Fairness Analysis

To complement the accuracy comparisons, we further evaluate the computational efficiency and scalability of the proposed method against representative baselines, including U-Net (FCN), PSPNet, DeepLabV3+, APCNet, and UPerNet with different backbones. Table 8 reports model parameters, FLOPs, inference latency, throughput (FPS), and GPU memory consumption, while Figure 14 summarizes efficiency–performance trade-offs and resolution scaling behaviors.

All efficiency results are measured on a single NVIDIA RTX 4090 GPU (24 GB VRAM). Unless otherwise specified, latency, FPS, and memory are evaluated with 512 × 512 inputs and batch size = 1. Inference latency is reported as mean ± standard deviation over 100 runs, with 50 warm-up iterations excluded. CUDA synchronization (torch.cuda.synchronize()) is applied before and after each timing measurement to ensure accurate GPU timing. All experiments are conducted in evaluation mode with gradient computation disabled, using FP32 precision, and no compilation-based optimizations (e.g., torch.compile) are applied.

FLOPs are computed using the thop library (pytorch-OpCounter). Following standard practice, FLOPs = MACs × 2. All operators are consistently counted across methods using the same profiling function under the fixed input size (512 × 512), ensuring a fair comparison.

As shown in Table 8 and Figure 14a, the proposed method achieves the best IoU while maintaining a moderate computational budget (45.0 M parameters and 299.0 GFLOPs). Compared with heavier backbones (e.g., UPerNet (R50) and UPerNet (Swin)), our method requires substantially fewer FLOPs, indicating that the performance gain is not solely obtained by increasing computational cost. In terms of runtime, Table 8 shows that the proposed method attains an inference latency of 3.43 ± 0.07 ms per image, which is close to UPerNet (ConvNeXt) and faster than most CNN-based baselines. Figure 14c further demonstrates a relatively low GPU memory footprint (244 MB), remaining significantly below memory-intensive models such as U-Net.

We further examine scalability with increasing input resolution. As illustrated in Figure 14d,e, both latency and memory increase smoothly as resolution grows, and the proposed method exhibits stable scaling trends without abnormal overhead. Overall, these results suggest that the introduced multi-scale modeling and spatial enhancement improve feature representation efficiency and remain compatible with high-resolution remote sensing applications, offering a favorable balance between segmentation performance and computational cost.

4.2. Cross-Domain Generalization Analysis

To evaluate cross-domain generalization under domain shift, we train all models on the WHU Building dataset (source domain) and directly test them on the Ganzhou dataset (target domain). We report zero-shot transfer results (no target-domain fine-tuning) and few-shot adaptation results by fine-tuning WHU-pretrained weights with a limited number of labeled target samples. For few-shot settings, the fine-tuning iterations are set to 500/2000/3000/5000 for 5/20/50/100 shots, respectively, while keeping other training configurations unchanged.

Figure 15 reports IoU, F1-score, Precision, and Recall as the number of target-domain shots increases, and Table 9 highlights the results under 0-shot and 100-shot settings. Under the zero-shot transfer (WHU → Ganzhou), our method achieves 33.57% IoU and 50.26% F1, exceeding the best baseline in Table 9 under 0-shot (UPerNet (ConvNeXt): 24.58% IoU, 39.47% F1) by +8.99 IoU and +10.79 F1.

With few-shot fine-tuning, all methods improve as more labeled target samples are provided (Figure 15). At 100 shots, our method reaches 72.35% IoU and 83.96% F1 (Table 9), which is comparable to the top-performing baselines at this shot level (e.g., UPerNet (R50): 71.89% IoU, 83.65% F1; PSPNet: 71.50% IoU, 83.38% F1; DeepLabV3+: 71.09% IoU, 83.10% F1). Overall, Figure 15 shows a rapid performance increase from low-shot to mid-shot settings, followed by a slower improvement trend at higher shot counts.

Figure 15c,d shows that different methods exhibit different Precision–Recall patterns, particularly in low-shot regimes. Several baselines show relatively higher Precision but lower Recall at very limited supervision, while Recall generally increases as more target samples are used for fine-tuning, with Precision remaining more stable or mildly fluctuating. At 100-shot, our method maintains competitive Precision while achieving strong Recall, consistent with its IoU/F1 results.

These results provide empirical evidence that the proposed method transfers effectively under the WHU → Ganzhou domain shift in the zero-shot setting and remains competitive after few-shot adaptation.

5. Conclusions

This paper proposes a high-resolution remote sensing building extraction framework that integrates multi-scale sequence modeling with spatially adaptive enhancement on top of a UPerNet (ConvNeXt-Tiny) baseline. The design targets key challenges in HSR imagery, including limited long-range context modeling, insufficient multi-scale fusion, and boundary ambiguity in complex scenes.

The main contribution is PyramidSSMNeck, which enables effective multi-scale feature alignment and fusion and accounts for most of the performance gain over the baseline. The additional modules (S6, LSKNet, and SAFM) are secondary components that mainly refine boundary quality and local detail representation, while PyramidSSMNeck remains the primary source of improvement.

Cross-domain transfer results evaluated on the full model further demonstrate improved robustness under domain shifts in both zero-shot and few-shot settings. Overall, the proposed framework offers an accurate and scalable solution for refined building extraction from high-resolution remote sensing imagery and provides a modular basis for future extensions to broader cross-domain segmentation tasks.

Author Contributions

Chang Zuo conceived and designed the experiments, performed the experiments, and wrote the original draft. Xiaoji Lan supervised the entire process and assisted in reviewing and editing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Sciences Foundation of China (Grant No. 41561085) and the Open Research Fund of Technology Innovation Center for Land Spatial Ecological Protection and Restoration in Great Lakes Basin (Grant No. JXCXZX2025003).

Data Availability Statement

Publicly available datasets were analyzed in this study. The WHU Building Dataset can be found here: http://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 14 September 2025). The INRIA Aerial Image Labeling Dataset is available here: https://project.inria.fr/aerialimagelabeling/ (accessed on 7 October 2025). The self-constructed Ganzhou dataset presented in this study is publicly available at: https://pan.baidu.com/s/1xZwo04ljbMEMbltK9lsSCw (accessed on 1 November 2025) (access code: ek75).

Acknowledgments

The authors would like to express their gratitude to the potential mentors, editors, and reviewers for their valuable suggestions and constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Acc	Accuracy
ASPP	Atrous Spatial Pyramid Pooling
BN	Batch Normalization
CNN	Convolutional Neural Network
ERF	Effective Receptive Field
FCN	Fully Convolutional Network
FFN	Feed-Forward Network
FN	False Negative
FP	False Positive
FPN	Feature Pyramid Network
GELU	Gaussian Error Linear Unit
GPU	Graphics Processing Unit
HSR	High Spatial Resolution
IoU	Intersection over Union
LN	Layer Normalization
LSKNet	Large Selective Kernel Network
ODE	Ordinary Differential Equation
PPM	Pyramid Pooling Module
PSPNet	Pyramid Scene Parsing Network
SAFM	Spatial Adaptive Feature Modulation
SSM	State Space Model
SVM	Support Vector Machine
TN	True Negative
TP	True Positive
ResNet	Residual network
UPerNet	Unified Perceptual Parsing Network
SKNet	Selective Kernel Network
BIoU	Boundary IoU
FLOPs	Floating-Point Operations
FPS	Frames Per Second
MACs	Multiply–Accumulate Operations
VRAM	Video Random Access Memory

References

Chen, K.; Fu, K.; Gao, X.; Yan, M.; Sun, X.; Zhang, H. Building Extraction from Remote Sensing Images with Deep Learning in a Supervised Manner. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: Fort Worth, TX, USA, 2017; pp. 1672–1675. [Google Scholar]
Luo, L.; Li, P.; Yan, X. Deep Learning-Based Building Extraction from Remote Sensing Images: A Comprehensive Review. Energies 2021, 14, 7982. [Google Scholar] [CrossRef]
Dong, X.; Cao, J.; Zhao, W. A Review of Research on Remote Sensing Images Shadow Detection and Application to Building Extraction. Eur. J. Remote Sens. 2024, 57, 2293163. [Google Scholar] [CrossRef]
Xu, H.; Zhu, P.; Luo, X.; Xie, T.; Zhang, L. Extracting Buildings from Remote Sensing Images Using a Multitask Encoder-Decoder Network with Boundary Refinement. Remote Sens. 2022, 14, 564. [Google Scholar] [CrossRef]
Xia, L.; Zhang, X.; Zhang, J.; Yang, H.; Chen, T. Building Extraction from Very-High-Resolution Remote Sensing Images Using Semi-Supervised Semantic Edge Detection. Remote Sens. 2021, 13, 2187. [Google Scholar] [CrossRef]
Zhu, Q.; Li, Z.; Zhang, Y.; Guan, Q. Building Extraction from High Spatial Resolution Remote Sensing Images via Multiscale-Aware and Segmentation-Prior Conditional Random Fields. Remote Sens. 2020, 12, 3983. [Google Scholar] [CrossRef]
Liu, W.; Yang, M.; Xie, M.; Guo, Z.; Li, E.; Zhang, L.; Pei, T.; Wang, D. Accurate Building Extraction from Fused DSM and UAV Images Using a Chain Fully Convolutional Neural Network. Remote Sens. 2019, 11, 2912. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Lei, Y.; Zhou, W.; Chan, S.; Zhou, X.; Hu, J.; Xu, J. UNet Reimagined: Advanced Building Extraction through Dual Attention and Global-Local Synthesis. Syst. Sci. Control Eng. 2025, 13, 2503238. [Google Scholar] [CrossRef]
Cao, S.; Feng, D.; Liu, S.; Xu, W.; Chen, H.; Xie, Y.; Zhang, H.; Pirasteh, S.; Zhu, J. BEMRF-Net: Boundary Enhancement and Multiscale Refinement Fusion for Building Extraction From Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16342–16358. [Google Scholar] [CrossRef]
Han, R.; Fan, X.; Liu, J. EUNet: Edge-UNet for Accurate Building Extraction and Edge Emphasis in Gaofen-7 Images. Remote Sens. 2024, 16, 2397. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Honolulu, HI, USA, 2017; pp. 6230–6239. [Google Scholar]
Yuan, W.; Wang, J.; Xu, W. Shift Pooling PSPNet: Rethinking PSPNet for Building Extraction in Remote Sensing Images from Entire Local Feature Pooling. Remote Sens. 2022, 14, 4889. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive Pyramid Context Network for Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Long Beach, CA, USA, 2019; pp. 7511–7520. [Google Scholar]
Wang, J.; Zhang, C.; Lin, T. ConvNeXt-UperNet-Based Deep Learning Model for Road Extraction from High-Resolution Remote Sensing Images. Comput. Mater. Contin. 2024, 80, 1907–1925. [Google Scholar] [CrossRef]
He, W.; Li, J.; Cao, W.; Zhang, L.; Zhang, H. Building Extraction from Remote Sensing Images via an Uncertainty-Aware Network. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Chen, M.; Wu, J.; Liu, L.; Zhao, W.; Tian, F.; Shen, Q.; Zhao, B.; Du, R. DR-Net: An Improved Network for Building Extraction from High Resolution Remote Sensing Image. Remote Sens. 2021, 13, 294. [Google Scholar] [CrossRef]
He, Q.; Sun, X.; Yan, Z.; Fu, K. DABNet: Deformable Contextual and Boundary-Weighted Network for Cloud Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Shi, F.; Zhang, T. A Multi-Task Network with Distance–Mask–Boundary Consistency Constraints for Building Extraction from Aerial Images. Remote Sens. 2021, 13, 2656. [Google Scholar] [CrossRef]
Wang, L.; Li, D.; Dong, S.; Meng, X.; Zhang, X.; Hong, D. PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery. Int. J. Appl. Earth Obs. Geoinf. 2025, 144, 104884. [Google Scholar] [CrossRef]
Zhang, H.; Zhu, Y.; Wang, D.; Zhang, L.; Chen, T.; Wang, Z.; Ye, Z. A Survey on Visual Mamba. Appl. Sci. 2024, 14, 5683. [Google Scholar] [CrossRef]
Shen, D.; Zhu, X.; Tian, J.; Liu, J.; Du, Z.; Wang, H.; Ma, X. HTD-Mamba: Efficient Hyperspectral Target Detection with Pyramid State Space Model. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O. RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Sharshar, A.; Matsun, A. Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection. arXiv 2023, arXiv:2311.12956. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Paris, France, 2023; pp. 16748–16759. [Google Scholar]
Sun, L.; Dong, J.; Tang, J.; Pan, J. Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can Semantic Labeling Methods Generalize to Any City? The Inria Aerial Image Labeling Benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS); IEEE: Fort Worth, TX, USA, 2017; pp. 3226–3229. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large Kernel Matters—Improve Semantic Segmentation by Global Convolutional Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Long Beach, CA, USA, 2019; pp. 510–519. [Google Scholar]
Chen, K.; Zou, Z.; Shi, Z. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction from Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
Li, S.; Bao, T.; Liu, H.; Deng, R.; Zhang, H. Multilevel Feature Aggregated Network with Instance Contrastive Learning Constraint for Building Extraction. Remote Sens. 2023, 15, 2585. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, Z.; Wang, B.; Li, S.; Liu, H.; Xu, D.; Ma, C. BOMSC-Net: Boundary Optimization and Multi-Scale Context Awareness Based Building Extraction from High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Liao, C.; Hu, H.; Li, H.; Ge, X.; Chen, M.; Li, C.; Zhu, Q. Joint Learning of Contour and Structure for Boundary-Preserved Building Extraction. Remote Sens. 2021, 13, 1049. [Google Scholar] [CrossRef]
Li, Y.; Hong, D.; Li, C.; Yao, J.; Chanussot, J. HD-Net: High-Resolution Decoupled Network for Building Footprint Extraction via Deeply Supervised Body and Boundary Decomposition. ISPRS J. Photogramm. Remote Sens. 2024, 209, 51–65. [Google Scholar] [CrossRef]

Figure 1. Overall structure of the model.

Figure 2. PyramidSSM Block structure.

Figure 3. UPerHead decoder structure.

Figure 4. Cropped samples and corresponding ground truth (GT) masks from (a) the WHU Building dataset, (b) the self-constructed Ganzhou Building dataset, and (c) the INRIA Aerial Image Labeling dataset.

Figure 5. Visualization of the original QuickBird satellite imagery covering the study area in Zhanggong District, Ganzhou City.

Figure 6. Corresponding manual ground truth annotations for building instances within the study area.

Figure 7. Visual comparison of building extraction results on the WHU Building dataset (green: TP; red: FP; blue: FN).

Figure 8. Visual comparison of building extraction results on the INRIA dataset (green: TP; red: FP; blue: FN).

Figure 9. Visual comparison of building extraction results on the Ganzhou dataset (green: TP; red: FP; blue: FN).

Figure 10. Qualitative visualization of the effect of S6 on neck feature activations and error-colored predictions (green: TP; red: FP; blue: FN; white dashed boxes highlight significant improvements).

Figure 11. Qualitative visualization of the effect of LSKNet on neck feature activations and error-colored predictions (green: TP; red: FP; blue: FN; white dashed boxes highlight significant improvements).

Figure 12. Qualitative visualization of the effect of SAFM on neck feature activations and error-colored predictions (green: TP; red: FP; blue: FN; white dashed boxes highlight significant improvements).

Figure 13. Efficiency–performance trade-off of ablation configurations on WHU.

Figure 14. Comparison of computational efficiency and resource consumption among different building extraction methods. Latency in (b) is reported as the mean value; the corresponding standard deviations are provided in Table 8.

Figure 15. Cross-domain adaptation performance (WHU → Ganzhou) with varying shots: (a) IoU, (b) F1-score, (c) Precision, and (d) Recall.

Table 1. Summary of symbols and operations in Figure 3 (UPerHead).

Symbol	Meaning	Description
F1–F4	Multi-scale feature maps from PyramidSSM	Resolutions: 128 × 128, 64 × 64, 32 × 32, 16 × 16
L1–L3	Lateral 1 × 1 convolution layers	Channel projection: 512 → 256
P1–P4	Multi-scale adaptive pooling operations	Pool bins: 1 × 1, 2 × 2, 3 × 3, 6 × 6
T1–T4	Post-pooling transformation	1 × 1 projection: 512 → 128
U1–U4	Spatial upsampling for scale alignment	Upsample to 16 × 16 before concatenation
PSP Out	Output of Pyramid Pooling Module	Aggregated PSP feature: 16 × 16 × 512
FPN1–FPN3	Feature maps in the top-down FPN pathway	Scales: 128 × 128, 64 × 64, 32 × 32
Up + Add	Top-down fusion step	Upsample and add to the lateral feature
Refine	Local spatial refinement convolution	3 × 3 refinement, channels preserved (256 → 256)
Align	Spatial resolution alignment	Resize all fusion inputs to 128 × 128
Concat	Feature concatenation	Concatenate along channel dimension
Bottleneck	Final fusion block	3 × 3 fusion: 1024 → 512
Head	Segmentation prediction layer	1 × 1 classifier: 512 → 1
Mask	Final binary segmentation output	Output resolution: 128 × 128 × 1

Table 2. Accuracy assessment of models on the WHU Building test set.

Methods	IoU%	Acc%	F1%	Precision%	Recall%	BIoU%
Unet (FCN)	87.05	98.46	93.07	93.06	93.09	57.69
PSPNet	87.42	98.47	93.29	90.96	95.74	57.53
Deeplabv3+	88.05	98.72	94.21	95.11	93.32	61.13
APCNet	84.35	98.01	91.51	87.17	96.31	53.95
UperNet (ResNet50)	87.41	98.46	93.28	90.44	96.31	58.05
UperNet (Swin-tiny)	88.1	98.55	93.67	91.25	96.23	58.58
UperNet (Convnext-tiny)	88.92	98.66	94.13	91.9	96.47	59.84
Ours	91.29	98.99	95.45	95.48	95.41	65.63

Table 3. Accuracy assessment of models on the INRIA test set.

Methods	IoU%	Acc%	F1%	Precision%	Recall%	BIoU%
Unet (FCN)	72.06	94.74	83.76	77.8	90.71	41.49
PSPNet	78.71	96.25	88.09	83.89	92.72	46.24
Deeplabv3+	78.8	96.48	88.14	88.83	87.46	46.71
APCNet	78.97	94.49	88.25	88.29	88.21	46.88
UperNet (ResNet50)	77.62	96.00	87.4	82.62	92.77	46.92
UperNet (Swin-tiny)	78.43	96.17	87.91	83.3	93.06	46.49
UperNet (Convnext-tiny)	81.08	96.86	89.55	89.24	89.86	47.85
Ours	81.96	97.05	90.09	90.68	89.5	49.10

Table 4. Accuracy assessment of models on the Ganzhou test set.

Methods	IoU%	Acc%	F1%	Precision%	Recall%	BIoU%
Unet (FCN)	77.67	97.12	87.43	81.97	93.67	43.56
PSPNet	81.85	97.75	90.02	85.7	94.79	46.87
Deeplabv3+	83.49	98.09	91.0	91.72	90.3	48.18
APCNet	81.3	97.67	89.69	85.2	94.68	46.09
UperNet (ResNet50)	83.34	98.08	90.91	91.9	89.95	48.25
UperNet (Swin-tiny)	83.06	98.03	90.74	91.23	90.26	47.70
UperNet (Convnext-tiny)	84.5	98.2	91.6	91.62	91.57	49.36
Ours	88.18	98.66	93.72	93.72	93.72	62.43

Table 5. Comparison with recent building extraction methods on WHU.

Methods	IoU%	Acc%	F1%	Precision%	Recall%
STTNet [36]	90.48	98.97	94.97	–	–
MAPNet [37]	90.86	–	95.21	95.62	94.81
MFA-Net [38]	91.07	–	95.33	94.64	96.02
DS-Net [39]	90.84	–	94.31	95.01	94.93
BEMRF-Net [10]	91.15	98.93	95.49	95.77	95.23
JLCS [40]	91.64	–	95.64	95.83	95.44
HD-Net [41]	90.19	98.85	94.84	95.00	94.68
Ours	91.29	98.99	95.45	95.48	95.41

Note: “–” indicates the metric was not reported in the original paper; except for ours, numbers are quoted from the corresponding publications. Results from different papers may use different data splits, input resolutions, and training/evaluation protocols; thus, this table serves as a literature-based reference rather than a strictly controlled re-implementation comparison.

Table 6. Ablation study results on the WHU Building dataset.

Methods	PyramidSSMNeck	S6	LSK	SAFM	IoU%	Acc%	F1%	Precision%	Recall%	BIoU%
UperNet (Convnext-tiny)					88.92	98.66	94.13	91.9	96.47	59.84
Neck1 (Neck-only)	√				91.22	98.98	95.41	95.39	95.44	63.29
Neck2 (+S6)	√	√			91.17	98.97	95.38	95.33	95.43	63.80
Neck3 (+LSK)	√		√		91.2	98.97	95.4	95.38	95.41	64.18
Neck4 (+SAFM)	√			√	91.2	98.98	95.4	95.39	95.4	64.62
Neck5 (+S6 + LSK)	√	√	√		91.27	98.98	95.43	95.49	95.38	65.30
Neck6 (+S6 + SAFM)	√	√		√	91.24	98.99	95.45	95.51	95.4	65.26
Neck7 (+LSK + SAFM)	√		√	√	91.19	98.97	95.39	95.34	95.45	65.32
Ours	√	√	√	√	91.29	98.98	95.45	95.48	95.41	65.63

Note: “√” indicates that the corresponding module is included in the model configuration.

Table 7. Compute-matched comparison of neck-only variants on the WHU Building dataset.

Neck	FLOPs (G)	Params (M)	IoU%	Acc%	F1%	Precision%	Recall%
SimpleConvFusion	232.09	43.41	88.65	98.63	93.98	91.61	96.48
AttentionFusion	232.10	43.42	88.64	98.62	93.98	91.62	96.46
BiFPN	238.13	44.59	88.89	98.66	94.12	91.82	96.54
PyramidSSMNeck	232.47	38.69	91.22	98.98	95.41	95.39	95.44

Table 8. Computational efficiency comparison of different methods on the WHU Building dataset.

Method	Params (M)	FLOPs (G)	Latency (ms)	FPS	Memory (MB)
Unet (FCN)	29.1	404.7	9.26 ± 0.19	108.7	678.1
PSPNet	49.0	356.6	5.99 ± 0.08	170.8	430.0
DeepLabv3+	43.6	352.0	6.19 ± 0.11	168.7	410.6
APCNet	56.3	408.3	6.15 ± 0.09	170.1	466.4
UPerNet (R50)	66.4	473.2	4.88 ± 0.05	215.0	425.7
UPerNet (Swin)	59.8	440.8	7.58 ± 0.04	138.6	332.1
UPerNet (ConvNeXt)	37.9	152.8	3.43 ± 0.05	289.8	216.0
Ours	45.0	299.0	3.43 ± 0.07	288.8	244.1

Table 9. Cross-domain transfer performance from WHU to Ganzhou under zero-shot and few-shot (100-shot) settings.

Method	IoU% (0-Shot)	F1% (0-Shot)	IoU% (100-Shot)	F1% (100-Shot)
Unet (FCN)	17.75	30.15	64.7	78.56
PSPNet	16.21	27.9	71.5	83.38
DeepLabv3+	18.25	30.87	71.09	83.1
APCNet	23.37	37.89	68.95	81.62
UPerNet (R50)	20.51	34.03	71.89	83.65
UPerNet (Swin)	24.23	39.01	63.74	77.85
UPerNet (ConvNeXt)	24.58	39.47	62.56	76.97
Ours	33.57	50.26	72.35	83.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Zuo, C.; Lan, X. A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement. ISPRS Int. J. Geo-Inf. 2026, 15, 96. https://doi.org/10.3390/ijgi15030096

AMA Style

Zuo C, Lan X. A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement. ISPRS International Journal of Geo-Information. 2026; 15(3):96. https://doi.org/10.3390/ijgi15030096

Chicago/Turabian Style

Zuo, Chang, and Xiaoji Lan. 2026. "A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement" ISPRS International Journal of Geo-Information 15, no. 3: 96. https://doi.org/10.3390/ijgi15030096

APA Style

Zuo, C., & Lan, X. (2026). A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement. ISPRS International Journal of Geo-Information, 15(3), 96. https://doi.org/10.3390/ijgi15030096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Resolution Remote Sensing Building Extraction Network Integrating Multi-Scale Sequence Modeling and Spatial Adaptive Enhancement

Abstract

1. Introduction

2. Research Methods and Principles

2.1. ConvNeXt-Tiny Feature Extraction Module

2.2. PyramidSSMNeck Feature Enhancement Module

2.3. UPerHead Decoder Module

3. Experiment and Result Analysis

3.1. Dataset

3.2. Implementation Details

3.3. Accuracy Assessment and Evaluation Metrics

3.4. Experimental Results

Comparison with Excellent Methods

3.5. Visualization Analysis

3.6. Ablation Experiment

3.6.1. Quantitative Ablation Results

3.6.2. Validation of Neck Architecture Effectiveness

3.6.3. Visual Interpretation of Module-Wise Effects

3.6.4. Efficiency-Aware Analysis of Ablation Configurations

3.6.5. Summary and Discussion of Ablation Findings

4. Discussion

4.1. Computational Efficiency and Fairness Analysis

4.2. Cross-Domain Generalization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI