1. Introduction
Semantic segmentation of remote sensing imagery represents a core task at the intersection of computer vision and remote sensing. It fundamentally involves pixel-wise classification—assigning a semantic label to each pixel in the image—to produce an output semantic map that matches the spatial resolution of the input. This enables precise delineation of land cover types and accurate boundary characterization. Owing to its high practical value, this technique underpins a wide range of critical applications, including urban planning [
1,
2,
3], land resource utilization [
4,
5,
6,
7], lane analysis for autonomous vehicles [
8,
9,
10,
11], environmental monitoring [
12], disaster response [
13], and cropland change detection [
14].
Early traditional approaches to semantic segmentation of remote sensing imagery primarily relied on handcrafted features combined with classical models, which can be broadly categorized into two groups. The first group comprises statistical learning methods. For instance, Maulik et al. [
15] proposed an improved differential evolution-based automatic fuzzy clustering algorithm (MoDEAFC), which enhances global optimization through an adaptive mutation strategy and employs the Xie–Beni (XB) index as the fitness function to automatically determine the number of clusters and optimal partitioning. Rutherford et al. [
16] developed a regression modeling framework tailored for simulating vegetation succession. The second group consists of machine learning techniques: Inglada et al. [
17] combined geometric handcrafted features with support vector machines (SVMs); Du et al. [
18] integrated random forests with GIS data for multispectral classification; Tatsumi et al. [
19] extracted statistical features from EVI time series and coupled them with random forests for crop classification; and Fu et al. [
20] improved hyperspectral vegetation classification accuracy by selecting low-redundancy spectral features via between-class scatter matrices and fusing them with Gabor spatial features. Despite these efforts, traditional methods suffer from significant limitations—they heavily rely on expert-designed features, exhibit poor generalization, and struggle to handle challenges such as ambiguous object boundaries and spectral mixing in complex scenes, thereby failing to meet the demands of precise remote sensing interpretation.
In recent years, deep learning methods have rapidly advanced and been widely adopted in semantic segmentation of remote sensing imagery. The U-Net model proposed by Ronneberger et al. [
21], featuring an encoder–decoder architecture with skip connections, has become a foundational baseline in this field. Zhou et al. [
22] introduced U-Net++, which employs nested skip connections to optimize feature fusion and effectively mitigate the semantic gap. PSPNet, proposed by Zhao et al. [
23] (commonly attributed to Zhou et al. in some contexts), incorporates a pyramid pooling module to enhance multi-scale contextual representation. SegNet, developed by Badrinarayanan et al. [
24], utilizes the first 13 convolutional layers of VGG16 as its encoder and reuses pooling indices for upsampling, establishing itself as an efficient and lightweight classic for remote sensing segmentation. DeepLabV3+, proposed by Chen et al. [
25], combines atrous convolution with the Atrous Spatial Pyramid Pooling (ASPP) module to further improve adaptability to objects of varying scales. However, these CNN-based approaches remain limited by their restricted receptive fields, making it difficult to capture long-range dependencies. Moreover, their skip connections typically rely on simple feature concatenation, often leading to information redundancy and degraded segmentation accuracy.
With the evolution of deep learning, semantic segmentation of remote sensing imagery has gradually entered the era of attention mechanisms and Transformers. In terms of attention, Woo et al. [
26] proposed CBAM, which parallelly integrates channel and spatial attention modules to precisely enhance the representation of critical features. Li et al. [
27] introduced SCAttNet, which embeds lightweight channel and spatial attention modules to adaptively refine features, significantly improving small-object segmentation performance in high-resolution remote sensing images. CE-Net, proposed by Gu et al. [
28], leverages a context encoder to effectively aggregate multi-scale contextual information for enhanced segmentation accuracy. Liu et al. [
29] presented AFNet, an adaptive fusion network that employs a Scale Feature Attention Module (SFAM) to accommodate objects of varying sizes, a Scale-Layer Attention Module (SLAM) to align receptive fields for easily confused classes, and an Adjacent Confidence Score Refinement (ACSR) module to optimize classification. Transformer models [
30], owing to their powerful global modeling capacity, have become a research hotspot. Liu et al. [
31] proposed the Swin Transformer, which constructs hierarchical feature maps via shifted windows, achieving linear computational complexity. Models such as SETR [
32] and Swin-Unet [
33] break the local receptive field limitation of CNNs but suffer from high computational overhead and suboptimal performance on small objects. In hybrid architectures, Li et al. [
34] proposed MACU-Net, which innovatively adopts multi-scale skip connections and Asymmetric Convolution Blocks (ACBs) to optimize feature fusion efficiency. Nevertheless, these models still exhibit notable shortcomings: Transformer-based architectures struggle to scale to high-resolution remote sensing imagery, while asymmetric convolutions focus solely on local feature extraction and lack consideration of the global semantic context.
Unlike conventional semantic segmentation tasks on medium-resolution datasets, high-resolution remote sensing images exhibit several distinctive characteristics, including extremely large spatial dimensions, complex multi-scale object distributions, and long-range spatial dependencies across distant regions. These properties significantly increase the difficulty of feature modeling, as local convolutional operations struggle to capture global contextual relationships, while global modeling mechanisms may lose fine-grained spatial details. Although image cropping is adopted during training to accommodate GPU memory limitations, the core challenge remains of how to effectively model both long-range contextual interactions and fine-grained boundary structures within high-resolution imagery. Therefore, the proposed GS-USTNet is specifically designed to enhance global–local feature interaction and adaptive spatial modeling, which are critical for high-resolution remote sensing semantic segmentation.
In summary, despite significant advances in semantic segmentation of remote sensing imagery, several critical challenges remain. First, the parameter-sharing mechanism of standard convolutions is rigid and uniform, making it inflexible for adapting to highly heterogeneous regions commonly found in remote sensing scenes, and thus failing to simultaneously capture discriminative features of diverse land cover types such as buildings and water bodies. Second, conventional skip connections typically employ naive feature concatenation, which introduces substantial noise and redundancy, severely degrading boundary delineation accuracy. Third, existing adaptive convolution approaches (e.g., Asymmetric Convolution Blocks, ACBs) focus solely on local feature modulation while neglecting guidance from the global semantic context, limiting their performance in complex scenarios. To address these issues, we propose GS-USTNet, which integrates a Global–Local Adaptive Convolution (GLAConv) module for dynamic, context-aware filtering and a Skip-Guided Attention (SGA) mechanism to refine information flow across skip connections. The main contributions of this work are summarized as follows:
- 1.
We present GS-USTNet, a novel U-shaped architecture tailored for high-resolution remote sensing image segmentation. By organically integrating GLAConv and SGA, our model effectively tackles key challenges, including low intra-class variation, ambiguous boundaries, and coexisting multi-scale objects, offering a new paradigm for high-precision segmentation in complex remote sensing scenes.
- 2.
We design the Global–Local Adaptive Convolution (GLAConv) module, which dynamically models the dependency between global context and local responses to generate content-aware convolutional weights. This enables truly spatially and semantically adaptive feature extraction, significantly enhancing the representational capacity for heterogeneous regions such as urban–rural fringes and mixed-use areas.
- 3.
We propose the Skip-Guided Attention (SGA) mechanism, which introduces a learnable spatial–channel joint gating strategy during decoding to adaptively select and reweight features from encoder skip connections. This effectively suppresses background noise and redundant information, substantially improving boundary detail recovery under class imbalance or complex backgrounds.
- 4.
We conduct comprehensive experiments on two authoritative remote sensing segmentation benchmarks—WHDLD and GID. Results show that our method achieves overall accuracies (OAs) of 86.31% and 87.89%, and F1-scores of 76.57% and 67.30%, respectively, outperforming current mainstream approaches. Ablation studies further validate the effectiveness and synergistic benefits of each proposed component.
3. Experiments and Results Analysis
3.1. Loss Function and Implementation Details
In remote sensing image semantic segmentation, the model is required to classify every pixel in the input image, which constitutes a typical multi-class pixel-wise prediction task. To this end, this paper adopts the multi-class cross-entropy loss as the optimization objective to measure the discrepancy between the predicted outputs and the ground truth annotations.
Let the input image have spatial dimensions
and
C semantic classes. For any pixel location
, the model outputs a prediction vector of length
C, where the
c-th element denotes the predicted probability that the pixel belongs to class
c. The corresponding ground truth label is denoted as
. The overall cross-entropy loss is then defined as
where
is the indicator function that equals 1 if the condition is true and 0 otherwise, and
N denotes the total number of valid pixels involved in the loss computation.
Considering that remote sensing semantic segmentation datasets often contain invalid or unlabeled regions—such as padded borders or noisy areas—we incorporate an ignore mechanism during loss calculation. Specifically, pixels with ground truth label value 255 are excluded from the loss, meaning that they do not contribute to backpropagation or parameter updates. This strategy effectively prevents corrupted or undefined labels from interfering with model training, thereby improving training stability and convergence.
In practice, the loss function is implemented as
where
P and
Y denote the model’s predicted probability maps and the ground truth label map, respectively.
Overall, the cross-entropy loss directly enforces pixel-level classification accuracy. When combined with the proposed GLAConv module and Skip-Guided Attention (SGA) mechanism, it enables the network to maintain strong overall performance while placing greater emphasis on boundary regions and fine-grained structures in complex scenes, thereby enhancing the overall segmentation accuracy for high-resolution remote sensing imagery.
All models are trained for 200 epochs using the Adam optimizer with , and an initial learning rate of . The learning rate is decayed via cosine annealing to a minimum of . A batch size of 16 is used throughout training. Input images are resized to pixels for the GID dataset and pixels for the WHDLD dataset, respectively. Random horizontal flipping is applied as the only data augmentation strategy. The loss function is standard cross-entropy, with label value 255 (denoting invalid or unlabeled regions in GID) ignored during backpropagation. All experiments are conducted on an NVIDIA GeForce RTX 4080 Super GPU using PyTorch 1.12.1. To ensure a fair comparison, identical training protocols are applied to all baseline methods.
3.2. Datasets and Evaluation Metrics
3.2.1. Datasets
We conduct experiments on two benchmark datasets—the Wuhan Dense Labeling Dataset (WHDLD) [
38] and the Gaofen Image Dataset (GID)—with representative samples shown in
Figure 5 and
Figure 6.
The WHDLD is a publicly available dataset specifically designed for semantic segmentation of high-resolution urban scenes. It comprises imagery acquired from the Gaofen-1 and ZY-3 satellites over the Wuhan metropolitan area in China. The original multispectral and panchromatic bands were fused and resampled to produce standardized RGB products with a spatial resolution of 2 m per pixel. The dataset contains 4940 remote sensing images, each of size pixels, stored in true-color RGB format to preserve natural spectral characteristics. Each image is paired with a pixel-aligned ground truth label map, where every pixel is precisely annotated into one of six semantic classes: bare soil, building, pavement, vegetation, road, and water.
The GID used in this study is derived from Gaofen-2 satellite imagery and consists of 150 large-scale RGB images covering 60 cities across China. Each original image has a size of
pixels with a ground sampling distance of 4 m, corresponding to approximately 506 km
2 of geographic coverage, thereby providing rich land cover information for fine-grained land use classification [
39]. To facilitate model training and evaluation, all images were uniformly cropped into non-overlapping patches of size
pixels. GID employs a systematic and fine-grained annotation scheme encompassing 15 land cover categories with clear semantic meanings, namely,
industrial land,
urban residential,
rural residential,
transportation land,
paddy field,
irrigated farmland,
dry cropland,
orchard,
arbor forest,
shrubland,
natural grassland,
artificial grassland,
river,
lake, and
pond. This multi-class, high-resolution labeling framework enables comprehensive modeling of complex land cover patterns.
To enhance training stability and generalization capability and to align with the input requirements of the proposed network, we perform systematic data preprocessing on both the WHDLD and GID datasets.
For the GID dataset, the original pixel patches are further resized to pixels to match the network input size and improve training efficiency. The WHDLD images retain their native resolution of pixels. Given the differences in label formats between the two datasets, we apply dataset-specific normalization procedures: in GID, the ignore label value (255) in the single-channel ground truth maps is remapped to class index 14 (the 15th class) to prevent interference during training; for WHDLD, RGB pseudo-color label maps are converted into categorical index maps via a predefined color-to-class mapping, corresponding to six land cover classes—bare soil, building, pavement, road, vegetation, and water.
Due to the extremely large spatial dimensions of high-resolution remote sensing images, a cropping strategy is employed during training to fit the data into GPU memory. It is important to note that cropping is only a training strategy for computational feasibility, rather than a replacement for high-resolution modeling. The proposed GS-USTNet is specifically designed to handle the intrinsic characteristics of high-resolution imagery, such as large spatial coverage and complex contextual relationships, which remain present within each cropped region.
Subsequently, a series of data augmentation techniques are applied exclusively to the training set, including random horizontal flipping, random rotation, color jittering, and Gaussian noise injection, to improve the model’s robustness to varying imaging conditions and scene appearances. All input RGB images are normalized by scaling pixel values to the range of
, which accelerates convergence and enhances numerical stability during optimization. Finally, both datasets are split into training, validation, and test subsets at an approximate ratio of 8:1:1. The detailed partition statistics are summarized in
Table 1.
3.2.2. Evaluation Metrics
To comprehensively evaluate the performance of GS-USTNet on remote sensing image semantic segmentation, we adopt four standard metrics: overall accuracy (OA), Mean Accuracy (MA), Mean Intersection over Union (mIoU), and mean F1-score. All metrics are computed on the test set. The confusion matrix for a binary classification case is illustrated in
Table 2, where TP, FN, FP, and TN denote true positives, false negatives, false positives, and true negatives, respectively. These quantities form the basis for computing the aforementioned evaluation metrics.
Overall accuracy (OA) measures the proportion of correctly classified pixels across all classes:
where
denotes the number of true positive pixels for class
c,
C is the total number of classes, and
N is the total number of valid pixels. While OA provides an intuitive measure of global segmentation performance, it is sensitive to class imbalance and can be dominated by majority classes.
Mean Accuracy (MA) mitigates this bias by first computing the per-class accuracy and then averaging across all classes:
where
is the number of false negatives for class
c. MA offers a more balanced assessment of model performance across diverse land cover categories.
Mean Intersection over Union (mIoU) is one of the most widely used metrics in semantic segmentation, quantifying the spatial overlap between predictions and ground truth:
where
denotes the number of false positives for class
c.
The
F1-score for class
c combines precision and recall into a single harmonic mean:
the Mean F1-score is obtained by averaging
over all
C classes.
By integrating these complementary metrics, our evaluation framework assesses GS-USTNet from multiple perspectives—global correctness, per-class fairness, and region-wise overlap—ensuring a rigorous, objective, and convincing validation of its segmentation capability in complex remote sensing scenarios.
3.3. Comparative Experiments
3.3.1. Quantitative Evaluation
To validate the effectiveness of the proposed GS-USTNet for remote sensing image semantic segmentation, we conduct quantitative comparisons with a variety of state-of-the-art methods on two representative datasets: the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID). The compared methods include U-Net [
21], U-Net++ [
22], DeepLabV3+ [
40], PSPNet [
23], CE-Net [
28], FGC [
41], DATUNet [
42], MACUNet [
34], and the original USTNet [
35]. The evaluation metrics include the number of model parameters (Params), overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). The selected comparison methods include several representative and widely adopted classical semantic segmentation architectures, such as FCN-based, encoder–decoder, and attention-enhanced frameworks. These models are commonly used as benchmark baselines in remote sensing semantic segmentation and provide a fair and stable reference for evaluating architectural improvements. Although more recent Transformer-based models have been proposed, our primary goal is to demonstrate the effectiveness of the proposed global–local adaptive mechanism compared with well-established segmentation paradigms. Future work will include comparisons with more recent large-scale Transformer-based architectures. The results are summarized in
Table 3 and
Table 4, where the best performance in each column is underlined.
As shown in
Table 3, GS-USTNet achieves the best overall performance on WHDLD, consistently outperforming all competitors across all four core metrics. Specifically, GS-USTNet attains an OA of 86.31%, which represents a significant improvement of 2.88 percentage points over the baseline USTNet. Moreover, its AA, F1-score, and mIoU reach 75.00%, 76.57%, and 64.02%, respectively—ranking first among all methods.
Compared to classical architectures such as U-Net and its variants (U-Net++, CE-Net), GS-USTNet improves mIoU by 5.11%, 2.64%, and 3.14%, respectively. This demonstrates that the proposed Global–Local Adaptive Convolution (GLAConv) module effectively enhances multi-scale feature representation, while the Skip-Guided Attention (SGA) mechanism provides superior boundary delineation in complex urban scenes. Notably, despite DATUNet’s significantly larger model size (82.43 M parameters), its segmentation performance remains inferior to GS-USTNet. In contrast, GS-USTNet achieves better results with only 13.75 M parameters, indicating an excellent trade-off between accuracy and computational complexity. These results confirm the strong feature modeling capability and robustness of GS-USTNet in densely labeled remote sensing scenarios.
On the more challenging GID dataset, GS-USTNet again demonstrates superior performance, as shown in
Table 4. It achieves the highest scores across all four metrics, with an OA of 87.89% and an mIoU of 65.96%, representing improvements of 4.93 and 11.54 percentage points over the original USTNet, respectively. The GID dataset contains multiple land cover classes with severe class imbalance, posing a greater challenge to model generalization. The significantly higher AA (67.32%) of GS-USTNet indicates that the SGA mechanism effectively guides the decoder to reconstruct features for minority classes, thereby improving per-class fairness without compromising overall accuracy.Furthermore, compared to canonical segmentation models such as DeepLabV3+ and PSPNet, GS-USTNet improves mIoU by 7.31% and 15.26%, respectively, further validating its strong adaptability to multi-scale objects and irregular boundaries in high-resolution remote sensing imagery. In summary, the consistent superiority of GS-USTNet across both WHDLD and GID datasets highlights its robustness in diverse remote sensing scenarios. This success can be attributed to two key innovations: (1) the GLAConv module dynamically integrates global contextual information with local receptive responses, enhancing spatial structure modeling; and (2) the Skip-Guided Attention mechanism alleviates attention dispersion in U-Net-like architectures under complex backgrounds, significantly improving boundary fidelity and fine-grained object segmentation. With a moderate model size (13.75 M parameters), GS-USTNet achieves state-of-the-art performance across multiple core metrics, fully demonstrating its effectiveness and practical value for remote sensing image semantic segmentation.
3.3.2. Qualitative Analysis
To further evaluate the practical performance of GS-USTNet in remote sensing image semantic segmentation, we conduct a qualitative comparison on representative regions from the Wuhan Dense Labeling Dataset (WHDLD), as shown in
Figure 7. Each color corresponds to a specific land cover class: gray denotes bare soil, red represents buildings, olive yellow indicates paved areas, yellow signifies roads, green stands for vegetation, and blue marks water bodies.
From the overall segmentation results, GS-USTNet demonstrates a strong ability to distinguish between different land cover types in complex scenes with multiple coexisting classes. Its predictions exhibit high spatial consistency with the ground truth annotations. Notably, in regions where buildings are closely adjacent to other land covers, the model accurately delineates red building areas and effectively suppresses leakage into neighboring vegetation (green) or paved areas (olive yellow), thereby significantly reducing inter-class confusion.For linear or boundary-sharp objects with large-scale variations—such as roads and water bodies—GS-USTNet also shows superior structural integrity. The visual results reveal that road regions (yellow) maintain good connectivity with notably fewer fragmentation artifacts. Water bodies (blue) exhibit complete contours and sharp boundaries against surrounding vegetation or bare soil, indicating robust fine-grained structure modeling and boundary discrimination capabilities. Moreover, for spectrally similar classes such as paved areas and bare soil, several baseline methods suffer from noticeable misclassifications, whereas GS-USTNet stably distinguishes gray bare soil from olive yellow paved regions, leading to improved regional completeness and semantic coherence. This observation suggests that the integration of the Global–Local Adaptive Convolution (GLAConv) module enhances the model’s discriminative power by effectively fusing contextual information with local textural cues.
In summary, the visual results on WHDLD provide compelling qualitative evidence of GS-USTNet’s advantages in complex remote sensing scenarios. The model not only preserves global structural fidelity but also achieves finer discrimination at multi-class boundaries, offering intuitive support for the quantitative improvements reported earlier.
To further validate the generalization capability of GS-USTNet under diverse and complex land cover conditions, we also perform qualitative comparisons on selected representative regions from the Gaofen Image Dataset (GID), as illustrated in
Figure 8. The GID dataset features rich land cover types, fine-grained semantic categories, and intricate spatial distributions, posing significant challenges to multi-class discrimination and boundary delineation. The color coding is as follows: red—industrial land; magenta—urban residential; light brown—rural residential; pink—transportation land; dark green—paddy fields; light green—irrigated farmland; gray-green—dry cropland; purple—orchards; dark purple—broadleaf forest; light purple—shrubland; yellow—natural grassland; olive yellow—artificial grassland; dark blue—rivers; cyan—lakes; bright blue—ponds. In general, GS-USTNet’s predictions on GID align closely with the ground truth in both the spatial structure and semantic distribution. In densely built-up areas such as industrial zones and urban residential regions, the model clearly separates different types of constructed land, with sharp boundaries between red and magenta regions and minimal inter-class confusion. Compared to baseline methods, GS-USTNet exhibits stronger discriminability between spectrally similar classes like urban and rural residential areas.
For agricultural and vegetation-related categories, the model maintains excellent spatial continuity across paddy fields, irrigated farmland, and dry cropland. The green-shaded regions appear structurally coherent without excessive fragmentation or over-smoothing. In areas where orchards, broadleaf forests, and shrublands intermingle, GS-USTNet successfully preserves distinct boundaries between these vegetation subtypes, highlighting its strength in modeling complex vegetative structures. Regarding water bodies, the model produces stable predictions across large-scale regions: rivers (dark blue), lakes (cyan), and ponds (bright blue) all exhibit clear contours and natural transitions at interfaces with surrounding farmland or built-up areas. Even in narrow river channels or small water bodies, GS-USTNet maintains better connectivity and reduces fragmentation or misclassification compared to competing methods.
In conclusion, although the visualization includes only a subset of compared models, the representative samples from GID clearly demonstrate that GS-USTNet achieves superior semantic consistency and spatial structural integrity in highly complex, multi-class land cover scenarios. These qualitative findings are fully consistent with the quantitative improvements observed in OA, AA, F1-score, and mIoU, further confirming the effectiveness and robustness of GS-USTNet for semantic segmentation of high-resolution remote sensing imagery.
3.3.3. Computational Complexity Analysis
To quantitatively substantiate our claim that the proposed GS-USTNet maintains high computational efficiency, we conduct a comparative analysis of model complexity in terms of the number of trainable parameters (Params) and floating-point operations (FLOPs). All models are evaluated on the GID dataset. The results are summarized in
Table 5.
As shown in
Table 5, the introduction of the GLAConv and SGA modules increases the model’s parameter count from 8.47 M (USTNet) to 13.75 M. However, this increase in capacity comes with only a modest rise in computational cost, as the FLOPs grow from 3.3099 G to 4.8011 G. This demonstrates that our architectural enhancements are highly efficient; the significant performance gains reported in
Section 3.3 (e.g., +4.93% OA and +11.54% mIoU over USTNet on GID) are achieved without imposing a substantial burden on inference speed or hardware resources. The results confirm that GS-USTNet offers an excellent trade-off between accuracy and efficiency, making it well suited for practical remote sensing applications.
3.4. Ablation Study
To systematically validate the effectiveness and rationality of the key components in GS-USTNet, we conduct a series of ablation experiments on the WHDLD dataset. The study primarily focuses on two proposed modules: the Global–Local Adaptive Convolution (GLAConv) and the Skip-Guided Attention (SGA) mechanism. By incrementally integrating these components into the baseline architecture, we analyze their individual and combined contributions to the overall segmentation performance.
Specifically, we construct four model variants:
- 1.
- 2.
USTNet+GLAC: USTNet equipped with only the GLAConv module;
- 3.
USTNet+SGA: USTNet enhanced with only the SGA mechanism;
- 4.
GS-USTNet: the full model incorporating both GLAConv and SGA.
All variants are trained under identical optimization settings and evaluated using the same metrics: number of parameters (Params), overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). The results are summarized in
Table 6.
Ablation 1: Effect of GLAConv. To assess the impact of GLAConv on feature representation, we integrate it into the encoder–decoder backbone of USTNet, yielding USTNet+GLAC. This module dynamically fuses global contextual cues with local receptive responses to generate content-aware convolutional weights, thereby enhancing feature discriminability. As shown in
Table 6, USTNet+GLAC achieves marginal improvements in F1-score (72.39%) and mIoU (60.23%) compared to the baseline, indicating that GLAConv contributes to better intra-class consistency and boundary delineation. However, its gains in OA and AA are limited, suggesting that adaptive convolution alone is insufficient to guide the decoder toward salient regions effectively. This implies that GLAConv’s benefits are best realized when coupled with a higher-level attention mechanism.
Ablation 2: Effect of SGA. We then evaluate the Skip-Guided Attention mechanism by constructing USTNet+SGA. SGA leverages skip connections between the encoder and decoder to apply joint spatial–channel attention during feature reconstruction, mitigating attention dispersion in complex backgrounds—a common issue in U-Net-like architectures.
The results show significant performance gains: OA increases to 85.56% and mIoU to 63.49%, representing improvements of 2.13 and 3.41 percentage points over the baseline, respectively. Notably, the substantial gains in AA and F1-score confirm that SGA enhances per-class fairness and refines fine-grained structures, particularly at object boundaries. This validates the critical role of SGA in remote sensing semantic segmentation.
Ablation 3: Synergistic Effect of GLAConv and SGA. Finally, we combine both modules to form the complete GS-USTNet. As reported in
Table 6, GS-USTNet achieves the best performance across all metrics: OA = 86.31%, AA = 75.00%, F1-score = 76.57%, and mIoU = 64.02%, outperforming the baseline by 2.88, 3.62, 4.34, and 3.94 percentage points, respectively. Moreover, it consistently surpasses the single-module variants, demonstrating strong complementarity between GLAConv (feature-level modeling) and SGA (attention-guided decoding).
Although the full model incurs a moderate increase in parameter count (from 8.47 M to 13.75 M), the performance gain is substantial, reflecting a favorable trade-off between accuracy and complexity.
In summary, the ablation study quantitatively confirms the individual efficacy and synergistic interaction of GLAConv and SGA. GLAConv enriches feature representation by adaptively integrating global context and local details during encoding, while SGA refines decoding through guided attention, improving both class-wise discrimination and boundary fidelity. Their combination yields consistent and stable improvements, underscoring the rationality of the proposed architecture and its advantage in global–local collaborative modeling for complex remote sensing scenes.
4. Discussion
This work presents a comprehensive experimental evaluation of GS-USTNet on two representative remote sensing benchmarks: the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID). The evaluation encompasses quantitative comparisons, ablation studies, and qualitative visual analysis, providing multi-faceted evidence of the model’s effectiveness and architectural soundness.
In quantitative comparisons, GS-USTNet consistently outperforms a wide range of state-of-the-art methods—including classical architectures (U-Net, DeepLabV3+, PSPNet), lightweight designs (FGC, MACUNet), and recent advances (DATUNet, USTNet)—across all core metrics (OA, AA, F1-score, mIoU) on both datasets. Notably, the performance margin is more pronounced on GID, a dataset characterized by high inter-class similarity, severe class imbalance, and large-scale variations. This highlights GS-USTNet’s superior generalization capability and robustness in complex land cover scenarios.
The ablation study further corroborates the design rationale. GLAConv provides stable performance gains with negligible parameter overhead, while SGA significantly boosts both overall accuracy and per-class fairness. Their integration yields the best results, confirming a synergistic relationship between adaptive feature modeling and attention-guided decoding.
Qualitative analysis reinforces these findings. Across diverse scenes—ranging from dense urban areas to intricate agricultural–forestry–water systems—GS-USTNet produces segmentation maps with high semantic coherence, sharp boundaries, and minimal misclassification. The visual fidelity aligns closely with ground truth annotations and supports the quantitative trends, thereby enhancing the credibility of our conclusions.
Collectively, the experimental results demonstrate that GS-USTNet achieves a compelling balance between efficiency and accuracy. The proposed global–local collaborative modeling strategy, coupled with the Skip-Guided Attention mechanism, offers a practical and effective solution for fine-grained semantic segmentation in complex remote sensing imagery. This work not only advances the state of the art but also provides a solid foundation for future research and real-world applications in geospatial intelligence.
5. Conclusions
To address key challenges in remote sensing image semantic segmentation—such as significant intra-class variation, ambiguous object boundaries, and the coexistence of multi-scale land cover objects—we propose an enhanced segmentation model, GS-USTNet, built upon the USTNet framework. The proposed architecture integrates two core components: a Global–Local Adaptive Convolution (GLAConv) module and a Skip-Guided Attention (SGA) mechanism. GLAConv strengthens global contextual modeling during feature extraction by dynamically fusing local and global cues, while SGA enhances discriminative capability at critical regions during the decoding phase through attention guidance derived from skip connections. Together, these modules significantly improve segmentation accuracy and result stability in complex remote sensing scenes. Extensive experiments on two representative benchmarks—the Wuhan Dense Labeling Dataset (WHDLD) and the Gaofen Image Dataset (GID)—demonstrate that GS-USTNet consistently outperforms a wide range of state-of-the-art and recently proposed methods across multiple evaluation metrics, including overall accuracy (OA), Average Accuracy (AA), F1-score, and Mean Intersection over Union (mIoU). Ablation studies further validate the individual contributions and synergistic effects of GLAConv and SGA, while qualitative visual comparisons intuitively illustrate the model’s superior performance in boundary preservation, fine-detail recovery, and inter-class discrimination.
Collectively, both quantitative and qualitative results confirm that GS-USTNet exhibits strong generalization capability and robustness in scenarios characterized by complex land cover distributions and diverse semantic categories. Nevertheless, certain limitations remain. First, the model size of GS-USTNet is larger than that of the original USTNet, which may hinder its deployment in resource-constrained environments; thus, further optimization for efficiency is warranted. Second, the current study focuses exclusively on optical remote sensing imagery and does not yet exploit the complementary potential of multi-source data, such as the Synthetic Aperture Radar (SAR), Digital Surface Models (DSMs), or multi-temporal sequences.
Future work will focus on three directions: (1) lightweight architectural design to reduce computational overhead; (2) multi-modal fusion strategies for joint modeling of heterogeneous geospatial data; and (3) enhancement of cross-region generalization to improve real-world applicability. These efforts aim to broaden the practical utility and scalability of GS-USTNet in operational remote sensing applications. In summary, the experimental results show that GS-USTNet achieves competitive and stable performance compared with representative baseline models. While the improvements are moderate in certain metrics, the consistent gains across multiple datasets validate the effectiveness of integrating Global–Local Adaptive Convolution and Skip-Guided Attention. Nevertheless, further validation against more recent large-scale segmentation frameworks will be explored in future research to comprehensively assess the generalization capability of the proposed approach.