GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation

Qian, Haoran; Liu, Xuan; Li, Zhuang; Ma, Yongjie; Lu, Zhenyu

doi:10.3390/rs18050785

Open AccessArticle

GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation

by

Haoran Qian

¹,

Xuan Liu

^1,*

,

Zhuang Li

²

,

Yongjie Ma

¹ and

Zhenyu Lu

¹

School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 785; https://doi.org/10.3390/rs18050785

Submission received: 27 January 2026 / Revised: 22 February 2026 / Accepted: 2 March 2026 / Published: 4 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel GS-USTNet architecture is proposed, integrating Global–Local Adaptive Convolution with Skip-Guided Attention to enhance feature representation for remote sensing image segmentation.
Experimental results on multiple remote sensing datasets demonstrate that GS-USTNet consistently outperforms existing state-of-the-art segmentation methods in terms of accuracy and boundary delineation.

What are the implications of the main finding?

The proposed global–local adaptive learning strategy effectively captures both large-scale contextual information and fine-grained spatial details in complex remote sensing scenes.
S-USTNet provides a robust and generalizable solution for high-precision remote sensing image segmentation, with potential applications in land cover mapping and environmental monitoring.

Abstract

Semantic segmentation of remote sensing imagery is crucial for applications such as land resource management and urban planning, yet it remains challenging due to low intra-class variation, ambiguous boundaries, and the coexistence of multi-scale geospatial features. To tackle these issues, we propose GS-USTNet, a novel architecture that enhances both feature representation and boundary recovery. First, we introduce a Global–Local Adaptive Convolution (GLAConv) module that dynamically fuses global contextual cues with local responses to generate content-aware convolutional weights, thereby improving feature discriminability. Second, we design a Skip-Guided Attention (SGA) mechanism that leverages spatial–channel joint attention to guide the decoder, effectively mitigating attention dispersion in complex scenes or under class imbalance and significantly sharpening object boundaries. Built upon the efficient USTNet framework, our model achieves substantial performance gains without compromising computational efficiency. Extensive experiments on benchmark datasets demonstrate that GS-USTNet achieves consistent improvements over the original USTNet, with gains of approximately 3.5% in overall accuracy and 6.0% in F1-score across datasets. Ablation studies further confirm the effectiveness of the proposed GLAConv and SGA modules. This work provides an efficient and robust approach for fine-grained semantic segmentation of high-resolution remote sensing imagery.

Keywords:

remote sensing image segmentation; adaptive convolution; attention mechanism; USTNet

1. Introduction

Semantic segmentation of remote sensing imagery represents a core task at the intersection of computer vision and remote sensing. It fundamentally involves pixel-wise classification—assigning a semantic label to each pixel in the image—to produce an output semantic map that matches the spatial resolution of the input. This enables precise delineation of land cover types and accurate boundary characterization. Owing to its high practical value, this technique underpins a wide range of critical applications, including urban planning [1,2,3], land resource utilization [4,5,6,7], lane analysis for autonomous vehicles [8,9,10,11], environmental monitoring [12], disaster response [13], and cropland change detection [14].

Early traditional approaches to semantic segmentation of remote sensing imagery primarily relied on handcrafted features combined with classical models, which can be broadly categorized into two groups. The first group comprises statistical learning methods. For instance, Maulik et al. [15] proposed an improved differential evolution-based automatic fuzzy clustering algorithm (MoDEAFC), which enhances global optimization through an adaptive mutation strategy and employs the Xie–Beni (XB) index as the fitness function to automatically determine the number of clusters and optimal partitioning. Rutherford et al. [16] developed a regression modeling framework tailored for simulating vegetation succession. The second group consists of machine learning techniques: Inglada et al. [17] combined geometric handcrafted features with support vector machines (SVMs); Du et al. [18] integrated random forests with GIS data for multispectral classification; Tatsumi et al. [19] extracted statistical features from EVI time series and coupled them with random forests for crop classification; and Fu et al. [20] improved hyperspectral vegetation classification accuracy by selecting low-redundancy spectral features via between-class scatter matrices and fusing them with Gabor spatial features. Despite these efforts, traditional methods suffer from significant limitations—they heavily rely on expert-designed features, exhibit poor generalization, and struggle to handle challenges such as ambiguous object boundaries and spectral mixing in complex scenes, thereby failing to meet the demands of precise remote sensing interpretation.

In recent years, deep learning methods have rapidly advanced and been widely adopted in semantic segmentation of remote sensing imagery. The U-Net model proposed by Ronneberger et al. [21], featuring an encoder–decoder architecture with skip connections, has become a foundational baseline in this field. Zhou et al. [22] introduced U-Net++, which employs nested skip connections to optimize feature fusion and effectively mitigate the semantic gap. PSPNet, proposed by Zhao et al. [23] (commonly attributed to Zhou et al. in some contexts), incorporates a pyramid pooling module to enhance multi-scale contextual representation. SegNet, developed by Badrinarayanan et al. [24], utilizes the first 13 convolutional layers of VGG16 as its encoder and reuses pooling indices for upsampling, establishing itself as an efficient and lightweight classic for remote sensing segmentation. DeepLabV3+, proposed by Chen et al. [25], combines atrous convolution with the Atrous Spatial Pyramid Pooling (ASPP) module to further improve adaptability to objects of varying scales. However, these CNN-based approaches remain limited by their restricted receptive fields, making it difficult to capture long-range dependencies. Moreover, their skip connections typically rely on simple feature concatenation, often leading to information redundancy and degraded segmentation accuracy.

With the evolution of deep learning, semantic segmentation of remote sensing imagery has gradually entered the era of attention mechanisms and Transformers. In terms of attention, Woo et al. [26] proposed CBAM, which parallelly integrates channel and spatial attention modules to precisely enhance the representation of critical features. Li et al. [27] introduced SCAttNet, which embeds lightweight channel and spatial attention modules to adaptively refine features, significantly improving small-object segmentation performance in high-resolution remote sensing images. CE-Net, proposed by Gu et al. [28], leverages a context encoder to effectively aggregate multi-scale contextual information for enhanced segmentation accuracy. Liu et al. [29] presented AFNet, an adaptive fusion network that employs a Scale Feature Attention Module (SFAM) to accommodate objects of varying sizes, a Scale-Layer Attention Module (SLAM) to align receptive fields for easily confused classes, and an Adjacent Confidence Score Refinement (ACSR) module to optimize classification. Transformer models [30], owing to their powerful global modeling capacity, have become a research hotspot. Liu et al. [31] proposed the Swin Transformer, which constructs hierarchical feature maps via shifted windows, achieving linear computational complexity. Models such as SETR [32] and Swin-Unet [33] break the local receptive field limitation of CNNs but suffer from high computational overhead and suboptimal performance on small objects. In hybrid architectures, Li et al. [34] proposed MACU-Net, which innovatively adopts multi-scale skip connections and Asymmetric Convolution Blocks (ACBs) to optimize feature fusion efficiency. Nevertheless, these models still exhibit notable shortcomings: Transformer-based architectures struggle to scale to high-resolution remote sensing imagery, while asymmetric convolutions focus solely on local feature extraction and lack consideration of the global semantic context.

Unlike conventional semantic segmentation tasks on medium-resolution datasets, high-resolution remote sensing images exhibit several distinctive characteristics, including extremely large spatial dimensions, complex multi-scale object distributions, and long-range spatial dependencies across distant regions. These properties significantly increase the difficulty of feature modeling, as local convolutional operations struggle to capture global contextual relationships, while global modeling mechanisms may lose fine-grained spatial details. Although image cropping is adopted during training to accommodate GPU memory limitations, the core challenge remains of how to effectively model both long-range contextual interactions and fine-grained boundary structures within high-resolution imagery. Therefore, the proposed GS-USTNet is specifically designed to enhance global–local feature interaction and adaptive spatial modeling, which are critical for high-resolution remote sensing semantic segmentation.

In summary, despite significant advances in semantic segmentation of remote sensing imagery, several critical challenges remain. First, the parameter-sharing mechanism of standard convolutions is rigid and uniform, making it inflexible for adapting to highly heterogeneous regions commonly found in remote sensing scenes, and thus failing to simultaneously capture discriminative features of diverse land cover types such as buildings and water bodies. Second, conventional skip connections typically employ naive feature concatenation, which introduces substantial noise and redundancy, severely degrading boundary delineation accuracy. Third, existing adaptive convolution approaches (e.g., Asymmetric Convolution Blocks, ACBs) focus solely on local feature modulation while neglecting guidance from the global semantic context, limiting their performance in complex scenarios. To address these issues, we propose GS-USTNet, which integrates a Global–Local Adaptive Convolution (GLAConv) module for dynamic, context-aware filtering and a Skip-Guided Attention (SGA) mechanism to refine information flow across skip connections. The main contributions of this work are summarized as follows:

1.: We present GS-USTNet, a novel U-shaped architecture tailored for high-resolution remote sensing image segmentation. By organically integrating GLAConv and SGA, our model effectively tackles key challenges, including low intra-class variation, ambiguous boundaries, and coexisting multi-scale objects, offering a new paradigm for high-precision segmentation in complex remote sensing scenes.
2.: We design the Global–Local Adaptive Convolution (GLAConv) module, which dynamically models the dependency between global context and local responses to generate content-aware convolutional weights. This enables truly spatially and semantically adaptive feature extraction, significantly enhancing the representational capacity for heterogeneous regions such as urban–rural fringes and mixed-use areas.
3.: We propose the Skip-Guided Attention (SGA) mechanism, which introduces a learnable spatial–channel joint gating strategy during decoding to adaptively select and reweight features from encoder skip connections. This effectively suppresses background noise and redundant information, substantially improving boundary detail recovery under class imbalance or complex backgrounds.
4.: We conduct comprehensive experiments on two authoritative remote sensing segmentation benchmarks—WHDLD and GID. Results show that our method achieves overall accuracies (OAs) of 86.31% and 87.89%, and F1-scores of 76.57% and 67.30%, respectively, outperforming current mainstream approaches. Ablation studies further validate the effectiveness and synergistic benefits of each proposed component.

2. Methods

2.1. Problem Formulation and Overall Model Architecture

Following the convention of existing works, we formulate the semantic segmentation of remote sensing imagery as a high-resolution, multi-scale feature extraction and fine-grained spatial reconstruction problem. Given an input RGB remote sensing image

X \in R^{3 \times H \times W}

, where H and W denote the height and width of the image, respectively, the goal is to predict a semantic label for each pixel, yielding an output segmentation map

Y \in R^{C \times H \times W}

, where C represents the total number of land cover classes in the scene. To achieve accurate pixel-wise classification, we propose the GS-USTNet (Global–Local Adaptive Convolution and Skip-Guided Attention Network), a novel encoder–decoder architecture whose overall structure is illustrated in Figure 1.

Building upon the USTNet framework [35], the proposed model introduces two core innovations: the Global–Local Adaptive Block (GLABlock) and the Skip-Guided Attention Block (SGA), designed to enhance feature representational capacity and improve boundary detail recovery. Specifically, the encoder of GS-USTNet consists of a series of alternately stacked GLABlocks and downsampling modules (DownSample), which progressively extract and compress multi-scale contextual information. Each GLABlock incorporates an improved Global–Local Adaptive Convolution (GLAConv) that decouples the “feature generation” and “dynamic weighting” processes. By jointly leveraging local receptive fields and global contextual cues, GLAConv generates content-aware convolutional weights, thereby strengthening the model’s discriminative power for complex land cover structures.

During encoding, the feature map at each level is not only forwarded to the next stage for downsampling but also routed into the SGA module to produce a spatial–channel joint attention mask, which guides the reconstruction of salient regions in the subsequent decoding phase. In contrast to conventional skip connections that naively concatenate encoder and decoder features, the SGA module—inspired by the SCSA attention mechanism—proposes a lightweight, attention-guided fusion strategy. It first performs joint attention modeling along both spatial and channel dimensions on the encoder features, then generates learnable attention weights via a mask generation mechanism. These weights are subsequently applied to the corresponding upsampled features in the decoder, enabling selective feature fusion. This approach effectively mitigates attention dispersion caused by redundant information and significantly enhances fine-grained boundary representation. At the end of the encoder, we integrate a Swin Transformer Block (STB) as the global context modeling unit. Following the design philosophy of USTNet, this module employs a Dynamic Parallel Window Attention (DPWA) mechanism to efficiently capture long-range dependencies, thereby enriching high-level semantic representations.

The decoder adopts a symmetric design with respect to the encoder, comprising four consecutive upsampling modules (UpSample) interleaved with GLABlocks to progressively restore spatial resolution. At each decoding stage, the upsampled features from the previous layer are fused with the attention mask generated by the SGA module via element-wise addition, yielding an enhanced feature representation. This refined feature is then processed by a GLABlock for further decoding and refinement. Finally, a

1 \times 1

convolutional layer maps the feature channels to the number of semantic classes, producing the output segmentation map

Y

. The entire segmentation process can be formally expressed as

Y = GS-USTNet (X),

(1)

where

GS-USTNet (\cdot)

denotes the proposed end-to-end segmentation network, which integrates key components, including GLABlocks, the Skip-Guided Attention (SGA) module, Swin Transformer Blocks (STBs), and symmetric upsampling–downsampling pathways, to achieve a complete mapping from input image to pixel-wise classification output.

As illustrated in Figure 1, the input image is first processed by a depthwise separable convolution (DSConv) block in the encoder for initial feature extraction. It then passes sequentially through four stacked pairs of GLABlock and DownSample modules, progressively reducing spatial resolution while increasing channel dimensionality to construct a multi-scale feature pyramid. At each level, the feature output from the GLABlock is simultaneously fed into the SGA module to generate cross-level attention masks. The Swin Transformer Block (STB) at the top of the encoder aggregates global contextual information.

In the decoder, spatial resolution is gradually restored through UpSample modules. By integrating the attention-guided signals provided by the SGA module with the local feature reconstruction capability of GLABlocks, the decoder achieves high-fidelity semantic recovery. The internal architecture of the GLABlock and the design principle of its core component, the Global–Local Adaptive Convolution (GLAConv), are detailed in Section 2.2. Furthermore, the attention mechanism and mask generation pipeline of the SGA module are thoroughly elaborated in Section 2.3.

2.2. GLABlock and Global–Local Adaptive Convolution (GLAConv)

High-resolution remote sensing imagery typically exhibits rich spatial details and complex land cover structures. However, semantic segmentation of such data faces two inherent challenges: inter-class similarity and intra-class variability, which significantly complicate accurate classification and precise boundary localization. As shown in Figure 2, representative examples from the WHDLD and GID datasets illustrate these issues: on one hand, distinct land cover classes often exhibit high visual similarity—for instance, in Figure 2b, agricultural fields and bare soil are difficult to distinguish due to similar color and texture; on the other hand, significant morphological variations exist within the same semantic class; as seen in Figure 2a, industrial buildings all labeled as “building” exhibit diverse roof materials, colors, and geometric shapes; similarly, the paddy fields in Figure 2c,d vary considerably in size, shape, and shadow distribution.These characteristics impose severe limitations on conventional convolutional neural networks. Standard convolutions employ fixed receptive fields and shared parameters across all spatial locations, rendering them incapable of adaptively capturing local features. Consequently, they struggle to model the complex relationships among multi-scale land cover objects. Specifically, under inter-class similarity, models are prone to misclassification due to confusion between local textures or color cues; under intra-class variability, a single receptive field cannot simultaneously capture large-scale structural context and fine-grained local details, often resulting in blurred boundaries and region merging artifacts.

These limitations highlight the inadequacy of existing approaches in modeling the synergistic dependency between global context and local responses. To address this bottleneck, we propose a novel feature extraction unit—namely, the Global–Local Adaptive Block (GLABlock)—whose core component is the Global–Local Adaptive Convolution (GLAConv). This module explicitly fuses local spatial responses with global contextual information to dynamically generate content-aware convolutional weights, enabling fine-grained modeling of complex land cover structures in remote sensing imagery. Particularly in scenarios characterized by pronounced intra-class heterogeneity and ambiguous inter-class boundaries, GLAConv adaptively adjusts its feature extraction strategy according to local content, thereby enhancing both discriminative capacity for heterogeneous regions and boundary reconstruction accuracy.

2.2.1. GLABlock

To overcome the representational bottleneck of standard convolutions in remote sensing image segmentation—caused by parameter sharing and limited receptive fields—we propose a novel learnable convolution operator, termed Global–Local Adaptive Convolution (GLAConv), inspired by RFAConv [36]. GLAConv serves as the core component of the feature extraction unit, GLABlock. By explicitly modeling the synergistic relationship between global semantic context and local spatial patterns, GLAConv dynamically generates a set of content-aware, non-shared convolutional weights for each output location, thereby enabling adaptive feature transformation tailored to the underlying land cover structures.

As illustrated in Figure 1, the GLABlock consists of a depthwise separable convolution (DSConv), a GLAConv module, followed by batch normalization and ReLU activation units. Specifically, the input feature map is first processed by a

K \times K

DSConv to extract preliminary local features. The output is then passed through batch normalization (BN) and a ReLU activation function to introduce non-linearity. The resulting feature is subsequently fed into the GLAConv module, which explicitly models the interaction between global semantics and local spatial cues to generate dynamic, content-adaptive convolutional weights for fine-grained representation of complex land cover structures. The output of GLAConv undergoes another BN–ReLU pair to produce the enhanced main-path feature.

To further improve training stability and mitigate potential loss of fine details during deep-layer global–local information fusion, the GLABlock incorporates a residual connection: the original input feature is directly added to the final activated output. This design not only facilitates effective gradient backpropagation but also preserves low-level spatial details—such as edges and textures—that are critical for accurate segmentation. Consequently, the block simultaneously enhances high-level semantic representation and ensures precise boundary reconstruction.

2.2.2. Global–Local Adaptive Convolution (GLAConv)

Figure 3 illustrates the overall architecture of GLAConv. Given an input feature map

X \in R^{B \times C \times H \times W}

, where B denotes the batch size, C denotes the number of channels, and H and W denote the spatial dimensions, the computation of GLAConv is organized into four sequential stages: local feature generation, dual-attention weight generation, weighted fusion, and feature recombination and projection, as detailed below.

Local Feature Generation

As shown in Figure 3, local feature extraction constitutes the first of three parallel branches in GLAConv. This branch generates a

K^{2}

-dimensional vector for each spatial location

(h, w)

, encoding the local pattern within the

K \times K

neighborhood centered at

(h, w)

. In contrast to standard convolutions—which apply shared kernel weights uniformly across all spatial positions—this design enables position-specific local representation. Specifically, this branch employs a depthwise separable convolution (DWConv) followed by batch normalization and a ReLU activation function to explicitly produce a

K^{2}

-dimensional channel-wise vector for each output position. The operation is formulated as

F_{local} = {DWConv}_{K} (X) \in R^{B \times C K^{2} \times H_{o} \times W_{o}},

(2)

where K denotes the kernel size, S the stride, and

H_{o} \times W_{o}

the resulting spatial resolution of the output.

To facilitate subsequent per-channel and per-local-position operations, we further reshape

F_{local}

into

{\tilde{F}}_{local}

by splitting the

C K^{2}

channel dimension into two separate dimensions: C (channels) and

K^{2}

(local spatial positions). The resulting tensor can be interpreted as follows: for each sample, each channel

c \in {1, \dots, C}

, and each output spatial location

(i, j)

, it stores a

K^{2}

-dimensional vector describing the local

K \times K

pattern centered at

(i, j)

. This reshaping operation is expressed as

{\tilde{F}}_{local} = Reshape (F_{local}) \in R^{B \times C \times K^{2} \times H_{o} \times W_{o}} .

(3)

Dual-Path Attention Weight Generation

The core innovation of GLAConv lies in its extension of RFAConv [36] through a dual-path attention mechanism that jointly generates content-aware, adaptive convolutional weights. Specifically, two parallel branches are employed: (i) a local dynamic weight prediction branch, which flexibly models the importance of each spatial position within a local neighborhood based on input feature content; and (ii) a global context guidance branch, which injects image-level semantic cues to ensure that local decisions remain consistent with the global scene context. The outputs of these two branches are element-wise multiplied and normalized to produce the final modulation coefficients, which are subsequently used to reweight the locally extracted features.

Concretely, given an input feature map

X \in R^{B \times C \times H \times W}

, the local dynamic branch first applies average pooling with kernel size K and stride S to capture local contextual information, yielding

X_{pooled} \in R^{B \times C \times H_{o} \times W_{o}}

. This pooled tensor is then projected into the weight space via a

1 \times 1

convolution followed by batch normalization and a ReLU activation, producing an intermediate weight map

W_{local} \in R^{B \times C K^{2} \times H_{o} \times W_{o}}

. To align with the structure of the local feature representation, this tensor is reshaped by splitting the

C K^{2}

channel dimension into C channels and a

K^{2}

-dimensional local spatial dimension, resulting in

W_{local} \in R^{B \times C \times K^{2} \times H_{o} \times W_{o}}

. This representation enables the branch to dynamically predict, for each output location

(i, j)

and each channel c, the relative importance of every position within the corresponding

K \times K

local neighborhood. Notably, this branch operates independently of the main feature extraction path. In parallel, the global context guidance branch compresses the spatial dimensions of

X

via global average pooling (GAP), followed by two

1 \times 1

convolutions and a Sigmoid activation, yielding a global guidance vector:

g \in R^{B \times C K^{2} \times 1 \times 1}

. To maintain structural compatibility with the local branch,

g

is reshaped into a modulation factor:

γ \in R^{B \times C \times K^{2} \times 1 \times 1}

.

During subsequent computation,

γ

is automatically broadcast across all spatial positions

(H_{o}, W_{o})

to semantically modulate the locally predicted weights

W_{local}

.

The final adaptive convolutional weights are obtained by element-wise multiplication of

W_{local}

and

γ

, followed by Softmax normalization along the

K^{2}

dimension:

W = Softmax (W_{local} ⊙ γ) \in R^{B \times C \times K^{2} \times H_{o} \times W_{o}},

(4)

where ⊙ denotes element-wise multiplication. The resulting weights

W

satisfy the property that, for any sample, channel, and output location, the sum of the

K^{2}

coefficients equals one, endowing them with a clear probabilistic interpretation.

These weights are then used to modulate the locally generated features

{\tilde{F}}_{local} \in R^{B \times C \times K^{2} \times H_{o} \times W_{o}}

, producing the weighted feature map:

{\tilde{F}}_{weighted} = {\tilde{F}}_{local} ⊙ W .

(5)

To recover a standard convolutional output format,

{\tilde{F}}_{weighted}

is “unfolded” into a high-resolution spatial representation via a Rearrange operation:

F_{upscaled} = Rearrange ({\tilde{F}}_{weighted}) \in R^{B \times C \times (H_{o} K) \times (W_{o} K)} .

(6)

this operation reconstructs each

K^{2}

-dimensional vector at position

(i, j)

into a

K \times K

spatial block and tiles these blocks to form a complete feature map of resolution

(H_{o} K) \times (W_{o} K)

.

Finally, a standard convolutional layer with kernel size K is applied to

F_{upscaled}

to perform spatial compression and channel transformation, yielding the final GLAConv output:

F_{result} = Conv (F_{upscaled}) \in R^{B \times C^{'} \times H \times W},

(7)

where

C^{'}

denotes the output channel dimension. This projection not only effectively reduces computational redundancy but also transforms the locally adaptive representation into a compact, high-level semantic feature map. Consequently, GLAConv achieves a favorable balance between efficiency and expressive power, significantly enhancing its capacity to model complex land cover structures in high-resolution remote sensing imagery.

2.3. SGA (Skip-Guided Attention) Module

In classical U-Net-like architectures, skip connections typically integrate low-level features from the encoder directly into corresponding levels of the decoder through simple concatenation or addition. However, this indiscriminate fusion strategy can introduce a significant amount of redundant or even conflicting information—especially in remote sensing imagery, where encoder features contain rich textures and edge details that are not equally relevant to the current decoding task across all spatial regions. Blindly injecting all information may interfere with high-level semantic reconstruction, leading to boundary blurring or class confusion.

To address these limitations, we propose a Skip-Guided Attention (SGA) mechanism, whose core idea is to selectively propagate only the most discriminative spatial regions of the encoder features that are relevant to the current decoding task. Specifically, the SGA module takes the skip feature

X_{enc} \in R^{B \times C \times H \times W}

from each encoder stage as its sole input and generates a single-channel attention mask

M \in R^{B \times 1 \times H \times W}

by jointly modeling importance in both spatial and channel dimensions. This mask is then added to the upsampled decoder feature

X_{dec}

at the same resolution level to produce an enhanced fused representation:

X_{fused} = X_{dec} + M,

(8)

which is subsequently fed into the following GLABlock for refined decoding.

This design establishes an active guidance mechanism—“the encoder tells the decoder where to attend”—effectively suppressing irrelevant background responses while enhancing the recovery of critical structural details such as road edges and building contours. As illustrated in Figure 4, the SGA module comprises three cascaded components: (i) multi-scale spatial attention, (ii) channel attention calibration, and (iii) an attention mask generation head.

First, the multi-scale spatial attention sub-module (SpatialAttention) in this module is inspired by the multi-channel grouping and multi-scale convolution ideas in the SCSA attention mechanism [37]. We designed a lightweight spatial attention structure aimed at efficiently modeling long-range spatial dependencies. Given the encoder skip feature

X_{enc}

, the module performs global average pooling operations separately along the two orthogonal directions of height (H) and width (W). Specifically, global average pooling is performed in the width direction to obtain a one-dimensional feature representation in the height direction:

X_{h} = \frac{1}{W} \sum_{j = 1}^{W} X_{enc} (:, :, :, j) \in R^{B \times C \times H}

(9)

subsequently,

X_{h}

is uniformly divided into G sub-tensors along the channel dimension; here, we evenly divide it into 4 sub-tensors:

X_{h} = [x_{h}^{(1)}, \dots, x_{h}^{(G)}]

, where each sub-tensor

x_{h}^{(g)} \in R^{B \times C_{g} \times H}

, and

C_{g} = C / G

represents the number of channels per group. Then, for the g-th group sub-feature

x_{h}^{(g)}

, a one-dimensional depthwise convolution with kernel size

k_{g} \in K = {1, 3, 5, 7}

is independently applied, where the number of groups equals the number of input channels (i.e.,

groups = C_{g}

), thereby capturing multi-scale contextual information while maintaining channel independence, ultimately obtaining

z_{h}^{(g)}

.

z_{h}^{(g)} = {Conv 1 d}_{k_{g}} (x_{h}^{(g)}) \in R^{B \times C_{g} \times H}

(10)

after concatenating all group outputs, we obtain

Z_{h} = [z_{h}^{(1)}, \dots, z_{h}^{(G)}] \in R^{B \times C \times H}

. Similarly, the same operation as above is performed along the height direction to obtain a set of concatenated features

Z_{w}

after multi-scale convolution. After obtaining these two sets of feature sub-vectors, they are passed through the Sigmoid function to obtain two sets of spatial attention masks between 0 and 1,

A_{h} \in R^{B \times C \times H}

and

A_{w} \in R^{B \times C \times W}

. Finally, they are expanded into four-dimensional tensors and element-wise multiplied with the original features to achieve spatial adaptive modulation:

X_{spatial} = X_{enc} ⊙ A_{h} ⊙ A_{w}

(11)

where

A_{h}

and

A_{w}

, after expansion into four-dimensional tensors, are

\in R^{B \times C \times H \times W}

, and ⊙ denotes the broadcast outer product.

After spatial attention, the channel attention calibration module follows. It mainly adopts the idea of the SE attention mechanism, first deriving a one-dimensional attention vector in the channel dimension with shape

B \times C \times 1 \times 1

from the input spatial attention features, thereby weighting the channel dimension of the spatial attention. The specific implementation process is as follows. First, compress the spatial dimensions through global average pooling:

z_{c} = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{spatial} (:, :, i, j) \in R^{B \times C \times 1 \times 1}

(12)

subsequently, after two fully connected layers, including a ReLU activation function, the final layer uses a Sigmoid activation function to obtain a set of channel weight vectors, learning non-linear dependencies between channels:

s_{c} = Sigmoid (W_{2} ReLU (W_{1} z_{c})) \in R^{B \times C \times 1 \times 1}

(13)

where

W_{1} \in R^{C / r \times C}

and

W_{2} \in R^{C \times C / r}

are learnable weights, and r is the compression ratio (default

r = 8

). Then, we broadcast this channel weight feature with the spatial attention feature

X_{spatial}

.

X_{s c} = X_{spatial} ⊙ S_{c}

(14)

finally, we feed the jointly enhanced feature

X_{s c}

into the mask generation head:

M = σ (W_{m} * X_{s c} + b_{m}) \in R^{B \times 1 \times H \times W}

(15)

where

W_{m} \in R^{1 \times C \times 1 \times 1}

is a

1 \times 1

convolution kernel,

b_{m}

is the bias term, and

σ (\cdot)

is the Sigmoid function. The output mask M can be interpreted as the importance score of each spatial position in the encoder feature for the current decoding task.

It is worth noting that the SGA module only acts on the encoder side, introduces no additional decoder parameters, and outputs a lightweight single-channel mask with extremely low computational overhead. More importantly, it transforms the traditional “feature” skip connection into an “attention-guided” intelligent fusion: the decoder no longer receives the original encoder features, but rather their importance-weighted signals generated by the module. This mechanism fundamentally alleviates information redundancy and attention dispersion issues, especially showing stronger discriminative ability in areas with inter-class similarity or blurred boundaries (as shown in Figure 5). Experiments (see Section 4) show that this design significantly improves boundary F1-scores and overall accuracy on both WHDLD and GID datasets, verifying its effectiveness in complex remote sensing scenarios.

3. Experiments and Results Analysis

3.1. Loss Function and Implementation Details

In remote sensing image semantic segmentation, the model is required to classify every pixel in the input image, which constitutes a typical multi-class pixel-wise prediction task. To this end, this paper adopts the multi-class cross-entropy loss as the optimization objective to measure the discrepancy between the predicted outputs and the ground truth annotations.

Let the input image have spatial dimensions

H \times W

and C semantic classes. For any pixel location

(i, j)

, the model outputs a prediction vector of length C, where the c-th element denotes the predicted probability that the pixel belongs to class c. The corresponding ground truth label is denoted as

y_{i j} \in {0, 1, \dots, C - 1}

. The overall cross-entropy loss is then defined as

L_{CE} = - \frac{1}{N} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 0}^{C - 1} I (y_{i j} = c) \cdot log (p_{i j}^{(c)}),

(16)

where

I (\cdot)

is the indicator function that equals 1 if the condition is true and 0 otherwise, and N denotes the total number of valid pixels involved in the loss computation.

Considering that remote sensing semantic segmentation datasets often contain invalid or unlabeled regions—such as padded borders or noisy areas—we incorporate an ignore mechanism during loss calculation. Specifically, pixels with ground truth label value 255 are excluded from the loss, meaning that they do not contribute to backpropagation or parameter updates. This strategy effectively prevents corrupted or undefined labels from interfering with model training, thereby improving training stability and convergence.

In practice, the loss function is implemented as

L = CrossEntropyLoss (P, Y; ignore_index = 255),

(17)

where P and Y denote the model’s predicted probability maps and the ground truth label map, respectively.

Overall, the cross-entropy loss directly enforces pixel-level classification accuracy. When combined with the proposed GLAConv module and Skip-Guided Attention (SGA) mechanism, it enables the network to maintain strong overall performance while placing greater emphasis on boundary regions and fine-grained structures in complex scenes, thereby enhancing the overall segmentation accuracy for high-resolution remote sensing imagery.

All models are trained for 200 epochs using the Adam optimizer with

β_{1} = 0.5

,

β_{2} = 0.999

and an initial learning rate of

1 \times 10^{- 3}

. The learning rate is decayed via cosine annealing to a minimum of

1 \times 10^{- 5}

. A batch size of 16 is used throughout training. Input images are resized to

128 \times 128

pixels for the GID dataset and

256 \times 256

pixels for the WHDLD dataset, respectively. Random horizontal flipping is applied as the only data augmentation strategy. The loss function is standard cross-entropy, with label value 255 (denoting invalid or unlabeled regions in GID) ignored during backpropagation. All experiments are conducted on an NVIDIA GeForce RTX 4080 Super GPU using PyTorch 1.12.1. To ensure a fair comparison, identical training protocols are applied to all baseline methods.

3.2. Datasets and Evaluation Metrics

3.2.1. Datasets

We conduct experiments on two benchmark datasets—the Wuhan Dense Labeling Dataset (WHDLD) [38] and the Gaofen Image Dataset (GID)—with representative samples shown in Figure 5 and Figure 6.

The WHDLD is a publicly available dataset specifically designed for semantic segmentation of high-resolution urban scenes. It comprises imagery acquired from the Gaofen-1 and ZY-3 satellites over the Wuhan metropolitan area in China. The original multispectral and panchromatic bands were fused and resampled to produce standardized RGB products with a spatial resolution of 2 m per pixel. The dataset contains 4940 remote sensing images, each of size

256 \times 256

pixels, stored in true-color RGB format to preserve natural spectral characteristics. Each image is paired with a pixel-aligned ground truth label map, where every pixel is precisely annotated into one of six semantic classes: bare soil, building, pavement, vegetation, road, and water.

The GID used in this study is derived from Gaofen-2 satellite imagery and consists of 150 large-scale RGB images covering 60 cities across China. Each original image has a size of

7200 \times 6800

pixels with a ground sampling distance of 4 m, corresponding to approximately 506 km² of geographic coverage, thereby providing rich land cover information for fine-grained land use classification [39]. To facilitate model training and evaluation, all images were uniformly cropped into non-overlapping patches of size

512 \times 512

pixels. GID employs a systematic and fine-grained annotation scheme encompassing 15 land cover categories with clear semantic meanings, namely, industrial land, urban residential, rural residential, transportation land, paddy field, irrigated farmland, dry cropland, orchard, arbor forest, shrubland, natural grassland, artificial grassland, river, lake, and pond. This multi-class, high-resolution labeling framework enables comprehensive modeling of complex land cover patterns.

To enhance training stability and generalization capability and to align with the input requirements of the proposed network, we perform systematic data preprocessing on both the WHDLD and GID datasets.

For the GID dataset, the original

512 \times 512

pixel patches are further resized to

128 \times 128

pixels to match the network input size and improve training efficiency. The WHDLD images retain their native resolution of

256 \times 256

pixels. Given the differences in label formats between the two datasets, we apply dataset-specific normalization procedures: in GID, the ignore label value (255) in the single-channel ground truth maps is remapped to class index 14 (the 15th class) to prevent interference during training; for WHDLD, RGB pseudo-color label maps are converted into categorical index maps via a predefined color-to-class mapping, corresponding to six land cover classes—bare soil, building, pavement, road, vegetation, and water.

Due to the extremely large spatial dimensions of high-resolution remote sensing images, a cropping strategy is employed during training to fit the data into GPU memory. It is important to note that cropping is only a training strategy for computational feasibility, rather than a replacement for high-resolution modeling. The proposed GS-USTNet is specifically designed to handle the intrinsic characteristics of high-resolution imagery, such as large spatial coverage and complex contextual relationships, which remain present within each cropped region.

Subsequently, a series of data augmentation techniques are applied exclusively to the training set, including random horizontal flipping, random rotation, color jittering, and Gaussian noise injection, to improve the model’s robustness to varying imaging conditions and scene appearances. All input RGB images are normalized by scaling pixel values to the range of

[0, 1]

, which accelerates convergence and enhances numerical stability during optimization. Finally, both datasets are split into training, validation, and test subsets at an approximate ratio of 8:1:1. The detailed partition statistics are summarized in Table 1.

3.2.2. Evaluation Metrics

To comprehensively evaluate the performance of GS-USTNet on remote sensing image semantic segmentation, we adopt four standard metrics: overall accuracy (OA), Mean Accuracy (MA), Mean Intersection over Union (mIoU), and mean F1-score. All metrics are computed on the test set. The confusion matrix for a binary classification case is illustrated in Table 2, where TP, FN, FP, and TN denote true positives, false negatives, false positives, and true negatives, respectively. These quantities form the basis for computing the aforementioned evaluation metrics.

Overall accuracy (OA) measures the proportion of correctly classified pixels across all classes:

OA = \frac{\sum_{c = 1}^{C} {TP}_{c}}{N},

(18)

where

{TP}_{c}

denotes the number of true positive pixels for class c, C is the total number of classes, and N is the total number of valid pixels. While OA provides an intuitive measure of global segmentation performance, it is sensitive to class imbalance and can be dominated by majority classes.

Mean Accuracy (MA) mitigates this bias by first computing the per-class accuracy and then averaging across all classes:

MA = \frac{1}{C} \sum_{c = 1}^{C} \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}},

(19)

where

{FN}_{c}

is the number of false negatives for class c. MA offers a more balanced assessment of model performance across diverse land cover categories.

Mean Intersection over Union (mIoU) is one of the most widely used metrics in semantic segmentation, quantifying the spatial overlap between predictions and ground truth:

mIoU = \frac{1}{C} \sum_{c = 1}^{C} \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c} + {FN}_{c}},

(20)

where

{FP}_{c}

denotes the number of false positives for class c.

The F1-score for class c combines precision and recall into a single harmonic mean:

F 1_{c} = \frac{2 \cdot {TP}_{c}}{2 \cdot {TP}_{c} + {FP}_{c} + {FN}_{c}} .

(21)

the Mean F1-score is obtained by averaging

F 1_{c}

over all C classes.

By integrating these complementary metrics, our evaluation framework assesses GS-USTNet from multiple perspectives—global correctness, per-class fairness, and region-wise overlap—ensuring a rigorous, objective, and convincing validation of its segmentation capability in complex remote sensing scenarios.

3.3. Comparative Experiments

3.3.1. Quantitative Evaluation

To validate the effectiveness of the proposed GS-USTNet for remote sensing image semantic segmentation, we conduct quantitative comparisons with a variety of state-of-the-art methods on two representative datasets: the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID). The compared methods include U-Net [21], U-Net++ [22], DeepLabV3+ [40], PSPNet [23], CE-Net [28], FGC [41], DATUNet [42], MACUNet [34], and the original USTNet [35]. The evaluation metrics include the number of model parameters (Params), overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). The selected comparison methods include several representative and widely adopted classical semantic segmentation architectures, such as FCN-based, encoder–decoder, and attention-enhanced frameworks. These models are commonly used as benchmark baselines in remote sensing semantic segmentation and provide a fair and stable reference for evaluating architectural improvements. Although more recent Transformer-based models have been proposed, our primary goal is to demonstrate the effectiveness of the proposed global–local adaptive mechanism compared with well-established segmentation paradigms. Future work will include comparisons with more recent large-scale Transformer-based architectures. The results are summarized in Table 3 and Table 4, where the best performance in each column is underlined.

As shown in Table 3, GS-USTNet achieves the best overall performance on WHDLD, consistently outperforming all competitors across all four core metrics. Specifically, GS-USTNet attains an OA of 86.31%, which represents a significant improvement of 2.88 percentage points over the baseline USTNet. Moreover, its AA, F1-score, and mIoU reach 75.00%, 76.57%, and 64.02%, respectively—ranking first among all methods.

Compared to classical architectures such as U-Net and its variants (U-Net++, CE-Net), GS-USTNet improves mIoU by 5.11%, 2.64%, and 3.14%, respectively. This demonstrates that the proposed Global–Local Adaptive Convolution (GLAConv) module effectively enhances multi-scale feature representation, while the Skip-Guided Attention (SGA) mechanism provides superior boundary delineation in complex urban scenes. Notably, despite DATUNet’s significantly larger model size (82.43 M parameters), its segmentation performance remains inferior to GS-USTNet. In contrast, GS-USTNet achieves better results with only 13.75 M parameters, indicating an excellent trade-off between accuracy and computational complexity. These results confirm the strong feature modeling capability and robustness of GS-USTNet in densely labeled remote sensing scenarios.

On the more challenging GID dataset, GS-USTNet again demonstrates superior performance, as shown in Table 4. It achieves the highest scores across all four metrics, with an OA of 87.89% and an mIoU of 65.96%, representing improvements of 4.93 and 11.54 percentage points over the original USTNet, respectively. The GID dataset contains multiple land cover classes with severe class imbalance, posing a greater challenge to model generalization. The significantly higher AA (67.32%) of GS-USTNet indicates that the SGA mechanism effectively guides the decoder to reconstruct features for minority classes, thereby improving per-class fairness without compromising overall accuracy.Furthermore, compared to canonical segmentation models such as DeepLabV3+ and PSPNet, GS-USTNet improves mIoU by 7.31% and 15.26%, respectively, further validating its strong adaptability to multi-scale objects and irregular boundaries in high-resolution remote sensing imagery. In summary, the consistent superiority of GS-USTNet across both WHDLD and GID datasets highlights its robustness in diverse remote sensing scenarios. This success can be attributed to two key innovations: (1) the GLAConv module dynamically integrates global contextual information with local receptive responses, enhancing spatial structure modeling; and (2) the Skip-Guided Attention mechanism alleviates attention dispersion in U-Net-like architectures under complex backgrounds, significantly improving boundary fidelity and fine-grained object segmentation. With a moderate model size (13.75 M parameters), GS-USTNet achieves state-of-the-art performance across multiple core metrics, fully demonstrating its effectiveness and practical value for remote sensing image semantic segmentation.

3.3.2. Qualitative Analysis

To further evaluate the practical performance of GS-USTNet in remote sensing image semantic segmentation, we conduct a qualitative comparison on representative regions from the Wuhan Dense Labeling Dataset (WHDLD), as shown in Figure 7. Each color corresponds to a specific land cover class: gray denotes bare soil, red represents buildings, olive yellow indicates paved areas, yellow signifies roads, green stands for vegetation, and blue marks water bodies.

From the overall segmentation results, GS-USTNet demonstrates a strong ability to distinguish between different land cover types in complex scenes with multiple coexisting classes. Its predictions exhibit high spatial consistency with the ground truth annotations. Notably, in regions where buildings are closely adjacent to other land covers, the model accurately delineates red building areas and effectively suppresses leakage into neighboring vegetation (green) or paved areas (olive yellow), thereby significantly reducing inter-class confusion.For linear or boundary-sharp objects with large-scale variations—such as roads and water bodies—GS-USTNet also shows superior structural integrity. The visual results reveal that road regions (yellow) maintain good connectivity with notably fewer fragmentation artifacts. Water bodies (blue) exhibit complete contours and sharp boundaries against surrounding vegetation or bare soil, indicating robust fine-grained structure modeling and boundary discrimination capabilities. Moreover, for spectrally similar classes such as paved areas and bare soil, several baseline methods suffer from noticeable misclassifications, whereas GS-USTNet stably distinguishes gray bare soil from olive yellow paved regions, leading to improved regional completeness and semantic coherence. This observation suggests that the integration of the Global–Local Adaptive Convolution (GLAConv) module enhances the model’s discriminative power by effectively fusing contextual information with local textural cues.

In summary, the visual results on WHDLD provide compelling qualitative evidence of GS-USTNet’s advantages in complex remote sensing scenarios. The model not only preserves global structural fidelity but also achieves finer discrimination at multi-class boundaries, offering intuitive support for the quantitative improvements reported earlier.

To further validate the generalization capability of GS-USTNet under diverse and complex land cover conditions, we also perform qualitative comparisons on selected representative regions from the Gaofen Image Dataset (GID), as illustrated in Figure 8. The GID dataset features rich land cover types, fine-grained semantic categories, and intricate spatial distributions, posing significant challenges to multi-class discrimination and boundary delineation. The color coding is as follows: red—industrial land; magenta—urban residential; light brown—rural residential; pink—transportation land; dark green—paddy fields; light green—irrigated farmland; gray-green—dry cropland; purple—orchards; dark purple—broadleaf forest; light purple—shrubland; yellow—natural grassland; olive yellow—artificial grassland; dark blue—rivers; cyan—lakes; bright blue—ponds. In general, GS-USTNet’s predictions on GID align closely with the ground truth in both the spatial structure and semantic distribution. In densely built-up areas such as industrial zones and urban residential regions, the model clearly separates different types of constructed land, with sharp boundaries between red and magenta regions and minimal inter-class confusion. Compared to baseline methods, GS-USTNet exhibits stronger discriminability between spectrally similar classes like urban and rural residential areas.

For agricultural and vegetation-related categories, the model maintains excellent spatial continuity across paddy fields, irrigated farmland, and dry cropland. The green-shaded regions appear structurally coherent without excessive fragmentation or over-smoothing. In areas where orchards, broadleaf forests, and shrublands intermingle, GS-USTNet successfully preserves distinct boundaries between these vegetation subtypes, highlighting its strength in modeling complex vegetative structures. Regarding water bodies, the model produces stable predictions across large-scale regions: rivers (dark blue), lakes (cyan), and ponds (bright blue) all exhibit clear contours and natural transitions at interfaces with surrounding farmland or built-up areas. Even in narrow river channels or small water bodies, GS-USTNet maintains better connectivity and reduces fragmentation or misclassification compared to competing methods.

In conclusion, although the visualization includes only a subset of compared models, the representative samples from GID clearly demonstrate that GS-USTNet achieves superior semantic consistency and spatial structural integrity in highly complex, multi-class land cover scenarios. These qualitative findings are fully consistent with the quantitative improvements observed in OA, AA, F1-score, and mIoU, further confirming the effectiveness and robustness of GS-USTNet for semantic segmentation of high-resolution remote sensing imagery.

3.3.3. Computational Complexity Analysis

To quantitatively substantiate our claim that the proposed GS-USTNet maintains high computational efficiency, we conduct a comparative analysis of model complexity in terms of the number of trainable parameters (Params) and floating-point operations (FLOPs). All models are evaluated on the GID dataset. The results are summarized in Table 5.

As shown in Table 5, the introduction of the GLAConv and SGA modules increases the model’s parameter count from 8.47 M (USTNet) to 13.75 M. However, this increase in capacity comes with only a modest rise in computational cost, as the FLOPs grow from 3.3099 G to 4.8011 G. This demonstrates that our architectural enhancements are highly efficient; the significant performance gains reported in Section 3.3 (e.g., +4.93% OA and +11.54% mIoU over USTNet on GID) are achieved without imposing a substantial burden on inference speed or hardware resources. The results confirm that GS-USTNet offers an excellent trade-off between accuracy and efficiency, making it well suited for practical remote sensing applications.

3.4. Ablation Study

To systematically validate the effectiveness and rationality of the key components in GS-USTNet, we conduct a series of ablation experiments on the WHDLD dataset. The study primarily focuses on two proposed modules: the Global–Local Adaptive Convolution (GLAConv) and the Skip-Guided Attention (SGA) mechanism. By incrementally integrating these components into the baseline architecture, we analyze their individual and combined contributions to the overall segmentation performance.

Specifically, we construct four model variants:

1.: Baseline: USTNet [35];
2.: USTNet+GLAC: USTNet equipped with only the GLAConv module;
3.: USTNet+SGA: USTNet enhanced with only the SGA mechanism;
4.: GS-USTNet: the full model incorporating both GLAConv and SGA.

All variants are trained under identical optimization settings and evaluated using the same metrics: number of parameters (Params), overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). The results are summarized in Table 6.

Ablation 1: Effect of GLAConv. To assess the impact of GLAConv on feature representation, we integrate it into the encoder–decoder backbone of USTNet, yielding USTNet+GLAC. This module dynamically fuses global contextual cues with local receptive responses to generate content-aware convolutional weights, thereby enhancing feature discriminability. As shown in Table 6, USTNet+GLAC achieves marginal improvements in F1-score (72.39%) and mIoU (60.23%) compared to the baseline, indicating that GLAConv contributes to better intra-class consistency and boundary delineation. However, its gains in OA and AA are limited, suggesting that adaptive convolution alone is insufficient to guide the decoder toward salient regions effectively. This implies that GLAConv’s benefits are best realized when coupled with a higher-level attention mechanism.

Ablation 2: Effect of SGA. We then evaluate the Skip-Guided Attention mechanism by constructing USTNet+SGA. SGA leverages skip connections between the encoder and decoder to apply joint spatial–channel attention during feature reconstruction, mitigating attention dispersion in complex backgrounds—a common issue in U-Net-like architectures.

The results show significant performance gains: OA increases to 85.56% and mIoU to 63.49%, representing improvements of 2.13 and 3.41 percentage points over the baseline, respectively. Notably, the substantial gains in AA and F1-score confirm that SGA enhances per-class fairness and refines fine-grained structures, particularly at object boundaries. This validates the critical role of SGA in remote sensing semantic segmentation.

Ablation 3: Synergistic Effect of GLAConv and SGA. Finally, we combine both modules to form the complete GS-USTNet. As reported in Table 6, GS-USTNet achieves the best performance across all metrics: OA = 86.31%, AA = 75.00%, F1-score = 76.57%, and mIoU = 64.02%, outperforming the baseline by 2.88, 3.62, 4.34, and 3.94 percentage points, respectively. Moreover, it consistently surpasses the single-module variants, demonstrating strong complementarity between GLAConv (feature-level modeling) and SGA (attention-guided decoding).

Although the full model incurs a moderate increase in parameter count (from 8.47 M to 13.75 M), the performance gain is substantial, reflecting a favorable trade-off between accuracy and complexity.

In summary, the ablation study quantitatively confirms the individual efficacy and synergistic interaction of GLAConv and SGA. GLAConv enriches feature representation by adaptively integrating global context and local details during encoding, while SGA refines decoding through guided attention, improving both class-wise discrimination and boundary fidelity. Their combination yields consistent and stable improvements, underscoring the rationality of the proposed architecture and its advantage in global–local collaborative modeling for complex remote sensing scenes.

4. Discussion

This work presents a comprehensive experimental evaluation of GS-USTNet on two representative remote sensing benchmarks: the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID). The evaluation encompasses quantitative comparisons, ablation studies, and qualitative visual analysis, providing multi-faceted evidence of the model’s effectiveness and architectural soundness.

In quantitative comparisons, GS-USTNet consistently outperforms a wide range of state-of-the-art methods—including classical architectures (U-Net, DeepLabV3+, PSPNet), lightweight designs (FGC, MACUNet), and recent advances (DATUNet, USTNet)—across all core metrics (OA, AA, F1-score, mIoU) on both datasets. Notably, the performance margin is more pronounced on GID, a dataset characterized by high inter-class similarity, severe class imbalance, and large-scale variations. This highlights GS-USTNet’s superior generalization capability and robustness in complex land cover scenarios.

The ablation study further corroborates the design rationale. GLAConv provides stable performance gains with negligible parameter overhead, while SGA significantly boosts both overall accuracy and per-class fairness. Their integration yields the best results, confirming a synergistic relationship between adaptive feature modeling and attention-guided decoding.

Qualitative analysis reinforces these findings. Across diverse scenes—ranging from dense urban areas to intricate agricultural–forestry–water systems—GS-USTNet produces segmentation maps with high semantic coherence, sharp boundaries, and minimal misclassification. The visual fidelity aligns closely with ground truth annotations and supports the quantitative trends, thereby enhancing the credibility of our conclusions.

Collectively, the experimental results demonstrate that GS-USTNet achieves a compelling balance between efficiency and accuracy. The proposed global–local collaborative modeling strategy, coupled with the Skip-Guided Attention mechanism, offers a practical and effective solution for fine-grained semantic segmentation in complex remote sensing imagery. This work not only advances the state of the art but also provides a solid foundation for future research and real-world applications in geospatial intelligence.

5. Conclusions

To address key challenges in remote sensing image semantic segmentation—such as significant intra-class variation, ambiguous object boundaries, and the coexistence of multi-scale land cover objects—we propose an enhanced segmentation model, GS-USTNet, built upon the USTNet framework. The proposed architecture integrates two core components: a Global–Local Adaptive Convolution (GLAConv) module and a Skip-Guided Attention (SGA) mechanism. GLAConv strengthens global contextual modeling during feature extraction by dynamically fusing local and global cues, while SGA enhances discriminative capability at critical regions during the decoding phase through attention guidance derived from skip connections. Together, these modules significantly improve segmentation accuracy and result stability in complex remote sensing scenes. Extensive experiments on two representative benchmarks—the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID)—demonstrate that GS-USTNet consistently outperforms a wide range of state-of-the-art and recently proposed methods across multiple evaluation metrics, including overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). Ablation studies further validate the individual contributions and synergistic effects of GLAConv and SGA, while qualitative visual comparisons intuitively illustrate the model’s superior performance in boundary preservation, fine-detail recovery, and inter-class discrimination.

Collectively, both quantitative and qualitative results confirm that GS-USTNet exhibits strong generalization capability and robustness in scenarios characterized by complex land cover distributions and diverse semantic categories. Nevertheless, certain limitations remain. First, the model size of GS-USTNet is larger than that of the original USTNet, which may hinder its deployment in resource-constrained environments; thus, further optimization for efficiency is warranted. Second, the current study focuses exclusively on optical remote sensing imagery and does not yet exploit the complementary potential of multi-source data, such as the Synthetic Aperture Radar (SAR), Digital Surface Models (DSMs), or multi-temporal sequences.

Future work will focus on three directions: (1) lightweight architectural design to reduce computational overhead; (2) multi-modal fusion strategies for joint modeling of heterogeneous geospatial data; and (3) enhancement of cross-region generalization to improve real-world applicability. These efforts aim to broaden the practical utility and scalability of GS-USTNet in operational remote sensing applications. In summary, the experimental results show that GS-USTNet achieves competitive and stable performance compared with representative baseline models. While the improvements are moderate in certain metrics, the consistent gains across multiple datasets validate the effectiveness of integrating Global–Local Adaptive Convolution and Skip-Guided Attention. Nevertheless, further validation against more recent large-scale segmentation frameworks will be explored in future research to comprehensively assess the generalization capability of the proposed approach.

Author Contributions

Conceptualization, H.Q. and X.L.; methodology, software, and formal analysis, H.Q. and Y.M.; validation and investigation, H.Q. and Z.L. (Zhuang Li); resources, Z.L. (Zhenyu Lu); data curation and visualization, H.Q. and Z.L. (Zhuang Li); writing—original draft preparation, H.Q.; writing—review and editing, X.L. and H.Q.; supervision, X.L.; project administration, X.L. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Joint Fund of Zhejiang Provincial Natural Science Foundation Project (No. LZJMD25D050002).

Data Availability Statement

The data are available from the corresponding author upon request.

Acknowledgments

The authors would like to express their sincere gratitude to the editors of Remote Sensing and the anonymous reviewers for their insightful comments and constructive suggestions, which significantly improved the quality and clarity of this manuscript. We also gratefully acknowledge the research teams behind the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID) for making these high-quality, publicly available remote sensing benchmarks accessible to the scientific community.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fang, F.; Yuan, X.; Wang, L.; Liu, Y.; Luo, Z. Urban land-use classification from photographs. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1927–1931. [Google Scholar] [CrossRef]
Li, Y.; Li, Y.; Zhu, X.; Fang, H.; Ye, L. A method for extracting buildings from remote sensing images based on 3DJA-UNet3+. Sci. Rep. 2024, 14, 19067. [Google Scholar] [CrossRef]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
Zhang, T.; Su, J.; Liu, C.; Chen, W.H. State and parameter estimation of the AquaCrop model for winter wheat using sensitivity informed particle filter. Comput. Electron. Agric. 2021, 180, 105909. [Google Scholar] [CrossRef]
Foley, J.A.; DeFries, R.; Asner, G.P.; Barford, C.; Bonan, G.; Carpenter, S.R.; Chapin, F.S.; Coe, M.T.; Daily, G.C.; Gibbs, H.K.; et al. Global consequences of land use. Science 2005, 309, 570–574. [Google Scholar] [CrossRef]
Yasir, M.; Wan, J.; Liu, S.; Hui, S.; Xu, M.; Hossain, M. Coupling of deep learning and remote sensing: A comprehensive systematic literature review. Int. J. Remote Sens. 2023, 44, 157–193. [Google Scholar] [CrossRef]
Victor, N.; Maddikunta, P.K.; Mary, D.R.; Murugan, R.; Chengoden, R.; Gadekallu, T.R.; Rakesh, N.; Zhu, Y.; Paek, J. Remote sensing for agriculture in the era of industry 5.0—A survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5920–5945. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Kherraki, A.; Maqbool, M.; El Ouazzani, R. Traffic scene semantic segmentation by using several deep convolutional neural networks. In Proceedings of the 2021 3rd IEEE Middle East and North Africa COMMunications Conference (MENACOMM); IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Boudissa, M.; Kawanaka, H.; Wakabayashi, T. Semantic segmentation of traffic landmarks using classical computer vision and U-Net model. J. Phys. Conf. Ser. 2022, 2319, 012001. [Google Scholar] [CrossRef]
Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road extraction methods in high-resolution remote sensing images: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
Matikainen, L.; Karila, K. Segment-based land cover mapping of a suburban area—Comparison of high-resolution remotely sensed datasets using classification trees and test field points. Remote Sens. 2011, 3, 1777–1804. [Google Scholar] [CrossRef]
Yao, H.; Qin, R.; Chen, X. Unmanned aerial vehicle for remote sensing applications—A review. Remote Sens. 2019, 11, 1443. [Google Scholar] [CrossRef]
Yu, Y.; Bao, Y.; Wang, J.; Chu, H.; Zhao, N.; He, Y.; Liu, Y. Crop row segmentation and detection in paddy fields based on treble-classification otsu and double-dimensional clustering method. Remote Sens. 2021, 13, 901. [Google Scholar] [CrossRef]
Maulik, U.; Saha, I. Automatic fuzzy clustering using modified differential evolution for image classification. IEEE Trans. Geosci. Remote Sens. 2010, 48, 3503–3510. [Google Scholar] [CrossRef]
Rutherford, G.N.; Guisan, A.; Zimmermann, N.E. Evaluating sampling strategies and logistic regression methods for modelling complex land cover changes. J. Appl. Ecol. 2007, 44, 414–424. [Google Scholar] [CrossRef]
Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
Du, S.; Zhang, F.; Zhang, X. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach. ISPRS J. Photogramm. Remote Sens. 2015, 105, 107–119. [Google Scholar] [CrossRef]
Tatsumi, K.; Yamashiki, Y.; Torres, M.A.; Taipe, C.L. Crop classification of upland fields using Random forest of time-series Landsat 7 ETM+ data. Comput. Electron. Agric. 2015, 115, 171–179. [Google Scholar] [CrossRef]
Fu, Y.; Zhao, C.; Wang, J.; Jia, X.; Yang, G.; Song, X.; Feng, H. An improved combination of spectral and spatial features for vegetation classification in hyperspectral images. Remote Sens. 2017, 9, 261. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Zhou, J.; Hao, M.; Zhang, D.; Zou, P.; Zhang, W. Fusion PSPnet image segmentation based method for multi-focus image fusion. IEEE Photonics J. 2019, 11, 6501412. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 801–818. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
Li, H.; Qiu, K.; Chen, L.; Mei, X.; Hong, L.; Tao, C. SCAttNet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 905–909. [Google Scholar] [CrossRef]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
Liu, R.; Mi, L.; Chen, Z. AFNet: Adaptive fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7871–7886. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 6881–6890. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022; pp. 213–229. [Google Scholar]
Li, R.; Duan, C.; Zheng, S. Macu-net semantic segmentation from high-resolution remote sensing images. arXiv 2020, arXiv:2007.13083. [Google Scholar]
Han, Z.; Chen, X.; Ye, Z.; Su, Y.; Wang, L.; Mei, S. USTNet: A U-Net Swin Transformer Network for Aerial Visible-to-Infrared Image Translation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5640814. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Si, Y.Z.; Xu, H.Y.; Zhu, X.Z.; Zhang, W.H.; Dong, Y.; Chen, Y.X.; Li, H.B. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Shao, Z.; Zhou, W.; Deng, X.; Zhang, M.; Cheng, Q. Multilabel remote sensing image retrieval based on fully convolutional network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 318–328. [Google Scholar] [CrossRef]
Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Ji, S.; Zhang, Z.; Zhang, C.; Wei, S.; Lu, M.; Duan, Y. Learning discriminative spatiotemporal features for precise crop classification from multi-temporal satellite images. Int. J. Remote Sens. 2020, 41, 3162–3174. [Google Scholar] [CrossRef]
Li, Y.; Yan, B.; Hou, J.; Bai, B.; Huang, X.; Xu, C.; Fang, L. UNet based on dynamic convolution decomposition and triplet attention. Sci. Rep. 2024, 14, 271. [Google Scholar] [CrossRef] [PubMed]

Figure 1. This figure illustrates the overall architecture of GS-USTNet, while detailed structures of key modules are further presented in Figures 3 and 4.

Figure 2. Illustration of inter-class similarity and intra-class variability in the remote sensing datasets. (a) Intra-class variability in the “building” class (WHDLD). (b) Inter-class similarity between agricultural fields and bare soil (GID). (c,d) Intra-class variability in paddy fields (GID).

Figure 3. Overall architecture of the Global–Local Adaptive Convolution (GLAConv) module. The input feature is processed through three parallel branches: local feature generation, global context modeling, and dual-attention weight prediction. These components are fused to produce content-aware convolutional weights for adaptive feature transformation.

Figure 4. Architecture of the Skip-Guided Attention (SGA) module. Figure 4 illustrates the internal mechanism of the SGA module as embedded within the overall GS-USTNet framework.

Figure 5. Sample from the Wuhan Dense Labeling Dataset (WHDLD). The top row shows the original RGB image, and the bottom row displays the corresponding pixel-wise ground truth label map. Classes include bare soil, building, pavement, vegetation, road, and water.

Figure 6. Sample from the Gaofen Image Dataset (GID). The top row shows the original RGB image, and the bottom row presents the corresponding ground truth annotation with 15 fine-grained land cover classes, such as industrial land, urban residential, paddy field, river, and lake.

Figure 7. Qualitative comparison of segmentation results on the WHDLD dataset. Different colors represent different land cover classes: gray—bare soil, red—buildings, olive yellow—paved areas, yellow—roads, green—vegetation, blue—water bodies.

Figure 8. Qualitative comparison of segmentation results on the GID dataset. Color coding: red—industrial land, magenta—urban residential, light brown—rural residential, pink—transportation land, dark green—paddy fields, light green—irrigated farmland, gray-green—dry farmland, purple—tree forests, dark purple—shrub forests, light purple—natural grassland, yellow—artificial grassland, olive yellow—river, dark blue—lake, cyan—pond, and bright blue—Orchards.

Table 1. Dataset statistics. The WHDLD and GID datasets are split into training, validation, and test subsets at an approximate ratio of 8:1:1.

Dataset	Total Images	Train	Val	Test	Categories	Resolution
WHDLD	4940	4495	222	223	6	$256 \times 256$
GID	2593	1984	307	302	15	$128 \times 128$

Table 2. Confusion matrix for binary classification.

	Predicted Class 1	Predicted Class 2
Actual Class 1	TP	FN
Actual Class 2	FP	TN

Table 3. Quantitative comparison on the WHDLD dataset. The best results are underlined.

Method	Params (M)	OA (%)	AA (%)	F1-Score (%)	mIoU (%)
U-Net	31.04	82.69	71.22	72.06	58.91
U-Net++	9.05	84.15	72.33	74.22	61.39
DeepLabV3+	10.11	82.57	68.15	70.39	57.31
FGC	2.18	83.72	69.87	72.48	59.47
CE-Net	29.01	83.98	71.77	73.60	60.88
DATUNet	82.43	84.72	74.06	74.93	62.26
MACUNet	5.78	84.71	74.52	75.32	62.68
PSPNet	25.21	74.77	60.08	62.76	47.11
UNetFormer	11.6827	67.25	60.78	60.78	44.27
SegFormer	3.7157	65.48	56.7	58.68	43.35
USTNet	8.47	83.43	71.38	72.23	60.08
GS-USTNet	13.75	86.31	75.00	76.57	64.02

Table 4. Quantitative comparison on the GID dataset. The best results are underlined.

Method	Params (M)	OA (%)	AA (%)	F1-Score (%)	mIoU (%)
U-Net	31.04	82.50	53.26	55.16	51.27
U-Net++	9.05	83.28	54.16	56.39	53.21
DeepLabV3+	10.11	83.96	56.50	60.81	58.65
FGC	2.18	82.64	53.65	55.53	51.93
CE-Net	29.01	81.74	54.01	55.97	52.23
DATUNet	82.43	80.43	48.08	48.54	44.04
MACUNet	5.78	81.52	54.09	53.74	52.40
PSPNet	25.21	80.89	57.25	59.95	50.70
UNetFormer	11.6827	79.49	54.16	55.6	48.79
SegFormer	3.7157	76.82	55.35	55.35	56.96
USTNet	8.47	82.96	53.20	55.87	54.42
GS-USTNet	13.75	87.89	67.32	67.31	65.96

Table 5. Comparison of computational complexity (number of parameters and FLOPs) for different segmentation methods on the GID dataset. Parameters are reported in millions (M), and FLOPs are in gigaflops (G). Entries marked with “–” indicate that FLOPs could not be reliably computed, typically due to custom operators or unavailable public implementations.

Method	Params (M)	FLOPs (G)
U-Net	31.04	13.6991
U-Net++	9.047	–
DeepLabV3+	10.11	6.4201
FGC	2.18	–
CE-Net	29.005	–
DATUNet	82.43	11.2413
MACUNet	5.7813	5.7815
PSPNet	25.21	0.8164
UNetFormer	11.6827	0.7351
SegFormer	3.7157	0.4264
USTNet	8.47	3.3099
GS-USTNet (Ours)	13.75	4.8011

Table 6. Ablation study on the WHDLD dataset.

Method	Params (M)	OA (%)	AA (%)	F1-Score (%)	mIoU (%)
USTNet	8.47	83.43	71.38	72.23	60.08
USTNet+GLAC	8.47	83.37	71.31	72.39	60.23
USTNet+SGA	9.31	85.56	73.58	75.21	63.49
GS-USTNet	13.75	86.31	75.00	76.57	64.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qian, H.; Liu, X.; Li, Z.; Ma, Y.; Lu, Z. GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation. Remote Sens. 2026, 18, 785. https://doi.org/10.3390/rs18050785

AMA Style

Qian H, Liu X, Li Z, Ma Y, Lu Z. GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation. Remote Sensing. 2026; 18(5):785. https://doi.org/10.3390/rs18050785

Chicago/Turabian Style

Qian, Haoran, Xuan Liu, Zhuang Li, Yongjie Ma, and Zhenyu Lu. 2026. "GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation" Remote Sensing 18, no. 5: 785. https://doi.org/10.3390/rs18050785

APA Style

Qian, H., Liu, X., Li, Z., Ma, Y., & Lu, Z. (2026). GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation. Remote Sensing, 18(5), 785. https://doi.org/10.3390/rs18050785

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GS-USTNet: Global–Local Adaptive Convolution with Skip-Guided Attention for Remote Sensing Image Segmentation

Highlights

Abstract

1. Introduction

2. Methods

2.1. Problem Formulation and Overall Model Architecture

2.2. GLABlock and Global–Local Adaptive Convolution (GLAConv)

2.2.1. GLABlock

2.2.2. Global–Local Adaptive Convolution (GLAConv)

Local Feature Generation

Dual-Path Attention Weight Generation

2.3. SGA (Skip-Guided Attention) Module

3. Experiments and Results Analysis

3.1. Loss Function and Implementation Details

3.2. Datasets and Evaluation Metrics

3.2.1. Datasets

3.2.2. Evaluation Metrics

3.3. Comparative Experiments

3.3.1. Quantitative Evaluation

3.3.2. Qualitative Analysis

3.3.3. Computational Complexity Analysis

3.4. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI