Next Article in Journal
Polar Mesospheric Cloud Detections by TROPOMI/Sentinel-5P: First Results and Validation
Previous Article in Journal
Seed-Driven Grid Adaptation Method: A Prior-Guided Active Learning Framework for Impervious Surface Mapping on the Qinghai–Xizang Plateau Using Google Satellite Embeddings
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GC2F-Net: A Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network for Remote Sensing Semantic Segmentation

1
School of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun 113001, China
2
School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(10), 1600; https://doi.org/10.3390/rs18101600
Submission received: 19 March 2026 / Revised: 9 May 2026 / Accepted: 13 May 2026 / Published: 16 May 2026

Highlights

What are the main findings?
  • Combining category prior information with frequency-assisted decoding effectively balances global semantic modeling and local detail.
  • The proposed GC2F-Net achieves competitive and robust performance on multiple high-resolution remote sensing benchmark datasets, with significant advantages in scenes containing small objects and ambiguous boundaries.
What are the implications of the main findings?
  • The proposed framework improves global semantic consistency and spatial-frequency collaborative representation for remote sensing segmentation.
  • The framework is promising for practical applications requiring structural integrity and fine-grained delineation, such as land-cover interpretation, urban mapping, and traffic monitoring.

Abstract

Semantic segmentation of high-resolution remote sensing images constitutes an important foundation for urban mapping and land-cover interpretation. However, objects in remote sensing scenes usually exhibit large-scale variations, significant intra-class differences, and complex background interference. Due to these factors, existing methods for complex high-resolution scenes still suffer from insufficient global semantic modeling, boundary blurring, and small-object omission. To address the above challenges, this paper proposes a Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network (GC2F-Net). Specifically, ResNet-50 is adopted as the encoder, and a Global Category-Center Module is utilized to generate a global category-center prior based on deep features, which is then combined with a Fourier Global Enhancement Module to enhance deep features in the frequency domain. During the decoding stage, a Local Category-Aware Frequency Attention Module is employed to progressively refine feature representations under the guidance of the global category-center prior, thereby achieving collaborative improvement in global semantic consistency and local detail recovery. Experimental results demonstrate that GC2F-Net achieves robust and competitive segmentation performance on multiple public remote sensing semantic segmentation datasets. The proposed method provides an effective spatial-frequency collaborative modeling paradigm for the semantic segmentation of high-resolution remote sensing images.

1. Introduction

In recent years, advances in remote sensing technology have enabled the efficient acquisition of high-resolution images from multiple platforms, such as satellites and unmanned aerial vehicles (UAVs). These images provide critical data for ground object recognition and scene understanding [1,2]. High-resolution remote sensing images contain rich fine-grained texture and structural information, making them well suited for ground object classification and mapping. Consequently, they have been widely used in land resource management [3], urban mapping [4], traffic monitoring [5], and natural disaster assessment [6]. Remote sensing image semantic segmentation aims to perform pixel-level annotation and fine-grained delineation of ground object categories, thereby supporting the interpretation and application of complex surface information [7].
However, early semantic segmentation methods were predominantly designed for natural images or low-resolution remote sensing images. As a result, they often failed to fully exploit the complex spatial structures and detailed features of high-resolution remote sensing images. Although recent methods have significantly improved segmentation accuracy, semantic segmentation of high-resolution remote sensing images remains challenging. This is because these images often contain large variations in object scale, significant intra-class differences, complex textures, and fine-grained boundaries. These characteristics make it difficult to accurately distinguish small objects, preserve object boundaries, and maintain category consistency in complex scenes [8]. Therefore, effectively exploiting high-resolution spatial information for accurate ground object and scene segmentation remains an important problem.
More specifically, semantic segmentation of complex remote sensing scenes still faces three major challenges. First, targets in high-resolution remote sensing images exhibit large variations in object scale and significant intra-class differences. Moreover, small objects and fine structures strongly depend on local details for accurate recognition. Existing methods may weaken local discriminability when global contextual information is introduced, thereby leading to blurred boundaries and the omission of small objects. Second, frequency-domain information is beneficial for enhancing texture and structural representations. However, existing frequency-domain enhancement strategies mostly rely on global filtering or fixed-band modulation, and they lack adaptive control mechanisms tailored to the semantic requirements of different categories. This may amplify noisy textures or introduce background interference while enhancing details. Finally, multiscale feature fusion during decoding is commonly used to progressively restore spatial resolution and detailed information. However, existing fusion strategies often rely on convolution or pixel-level attention without explicit category-level constraints. As a result, they are vulnerable to background distraction, which may impair category consistency and cross-scene generalization.
To address the above challenges, this paper proposes a Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network (GC2F-Net) for the semantic segmentation of high-resolution remote sensing images. The proposed method integrates category prior modeling, frequency-domain enhancement, and category-aware decoding optimization within a unified framework to improve global semantic consistency and local detail recovery. Experimental results on three representative datasets demonstrate the effectiveness, robustness, and generalization ability of GC2F-Net. The main contributions of this paper are summarized as follows:
1. GC2F-Net is proposed for semantic segmentation of high-resolution remote sensing images. Within a unified framework, category prior modeling, frequency-domain enhancement, and category-aware optimization in the decoding stage are integrated to achieve collaborative modeling of global semantics and local details.
2. A Global Category Center Module (GCCM) and a Fourier Global Enhancement Module (FGEM) are designed to strengthen global structural representation and texture-detail modeling. GCCM generates an image-adaptive global category-center prior to provide explicit category-level guidance, while FGEM introduces adaptive frequency-domain enhancement to complement spatial features with richer structural and detail information.
3. A Local Category-Aware Frequency Attention Module (LCFA) is proposed to refine decoding features under the guidance of the category prior. By incorporating category-aware interactions and frequency-aware recalibration, LCFA enhances the discrimination and recovery of fine-grained structures, including object boundaries and small objects.

2. Related Work

This section reviews related studies on traditional feature-based methods, CNN-based methods, Transformer-based methods, and frequency-domain methods. It further discusses their limitations to clarify the motivation and positioning of GC2F-Net.

2.1. Traditional Remote Sensing Segmentation Methods

Early semantic segmentation methods for remote sensing images mainly relied on handcrafted features and traditional classifiers [9,10]. However, these methods have limited capability in feature representation and global context modeling. As a result, they often struggle to achieve robust segmentation results in high-resolution remote sensing images with complex textures and fine-grained boundaries. Consequently, research has gradually shifted toward deep learning-based methods.

2.2. CNN-Based Methods

In recent years, the rapid development of deep learning techniques has greatly promoted semantic segmentation [11]. Methods based on convolutional neural networks (CNNs), owing to their powerful feature extraction and representation capabilities, have become the mainstream paradigm in the field of semantic segmentation [12]. Among early semantic segmentation methods, FCN replaced fully connected layers with convolutional layers and pioneered end-to-end pixel-level prediction, laying the foundation for subsequent semantic segmentation models [13], while U-Net adopted an encoder–decoder architecture with skip connections, effectively integrating shallow detailed information with deep semantic information and achieving favorable performance in various segmentation tasks [14]. Subsequently, PSPNet [15] and the DeepLab series [16] constructed multiscale contextual representations by introducing pyramid pooling and dilated convolution mechanisms, which improved segmentation accuracy in scenarios with scale variations. Meanwhile, FPN [17] achieved multiscale feature fusion through a top-down feature pyramid architecture and lateral connections. HRNet [18] maintained high-resolution representations and performed multiscale feature interaction, thereby preserving local detail information while improving contextual fusion. In terms of global context modeling, DANet [19] captured long-range dependencies through position attention and channel attention mechanisms, while OCRNet [20] enhanced the semantic consistency of regions belonging to the same category by exploiting object-level regional representations.
Overall, CNN-based methods demonstrate strong advantages in local detail modeling, multiscale feature fusion, and high-resolution representation learning. However, their context modeling capacity is still constrained by the locality of convolution operations. As a result, they may fail to maintain category-level semantic consistency in large and complex remote sensing scenes, especially when ground objects exhibit substantial scale variations, strong intra-class differences, or complex background interference.

2.3. Transformer-Based Methods

To address the limitations of CNNs in global context modeling, researchers introduced the Transformer [21] into computer vision tasks, leading to remarkable progress in image classification and semantic segmentation. By using self-attention to model relationships between arbitrary positions, the Transformer can effectively capture global contextual information. ViT [22] first adopted a pure Transformer architecture as an image feature extractor, providing a new paradigm for global context modeling in vision tasks. Swin Transformer [23] employed a hierarchical architecture and shifted window self-attention, thereby achieving a balance between multiscale feature representation and global context modeling while keeping computational complexity under control. However, in high-resolution scenes, affected by the patch partitioning scheme and the global information aggregation mechanism, Transformer models still exhibit deficiencies in local detail representation and boundary detail delineation [24]. Therefore, an increasing number of studies have attempted to construct CNN–Transformer architectures to integrate the local detail representation ability of convolution operations with the global context modeling capability of self-attention. For example, DAFormer [25] and SegFormer [26] enhanced global context modeling by introducing Transformer representations or adopting hybrid designs. Meanwhile, UNetFormer [27] embedded Transformer modules into a U-shaped encoder–decoder architecture, strengthening global context modeling while preserving local detailed information.
Overall, Transformer-based and CNN–Transformer methods improve long-range dependency modeling and global context aggregation. However, most of them still perform global–local interaction mainly in the spatial domain. In high-resolution remote sensing scenes with complex textures, ambiguous boundaries, and small objects, spatial-domain fusion alone may be insufficient to recover fine structures. Moreover, these methods usually lack explicit category-level priors to guide the decoding process, which may lead to category inconsistency in regions with strong background interference.

2.4. Frequency-Domain Methods

In recent years, researchers have explored the enhancement of deep feature representations from a frequency-domain perspective. By explicitly modeling information from different frequency bands, these studies have aimed to address the limitations of spatial-domain representations in describing complex textures and boundary details, and consequently have proposed various frequency-domain modeling methods for semantic segmentation [28]. A typical strategy is to decompose features into low-frequency and high-frequency components using the Fourier transform [29] and wavelet transform [30], followed by targeted modeling of information in different frequency bands. In general, low-frequency components usually correspond to global structural information and large-scale variations, whereas high-frequency components mainly characterize local details and edge information. On this basis, SFFNet [31] obtained low-frequency and high-frequency components through wavelet decomposition and designed cross-domain alignment and selection mechanisms to achieve spatial-frequency fusion, thereby enhancing the representation capability for regions with complex textures and fine-grained boundaries. FSDENet [32] combined the fast Fourier transform (FFT) and Haar wavelets within the U-Net framework to jointly model low-frequency structural information and high-frequency detailed information. This design improved global structural consistency and enhanced the representation of local details such as boundaries and textures. SF3Net [33] integrated frequency-domain enhancement with spatial-domain feature aggregation, thereby achieving a balance between segmentation accuracy and computational efficiency in high-resolution remote sensing image segmentation. In addition, GFNet [34] constructed a global filtering operator from a frequency-domain perspective, transforming global information interaction in the spatial domain into frequency-domain operations. This design enabled more efficient acquisition of global contextual information and enhanced the capability for long-range dependency modeling.
Overall, frequency-domain modeling can complement spatial-domain representations by enhancing texture, edge, and structural information. However, most existing frequency-domain strategies, such as GFNet [34], SFFNet [31], FSDENet [32], and SF3Net [33], mainly focus on global filtering, frequency decomposition, or spatial-frequency feature fusion. Although these designs are effective for improving structural representation, frequency enhancement is generally performed in a category-agnostic manner. That is, the same frequency modulation strategy is often applied to the entire image or feature map without explicitly considering the distinct semantic requirements of different categories. In complex remote sensing scenes, this may amplify irrelevant background textures or weaken category-specific boundary cues. Different from these methods, GC2F-Net explicitly constructs image-adaptive global category-center priors and uses them to guide frequency-aware decoding refinement. Therefore, GC2F-Net further emphasizes the coordinated use of category-level semantic priors, frequency-domain enhancement, and local decoding optimization within a unified framework.

3. Materials and Methods

To address the challenges in semantic segmentation of high-resolution remote sensing images, this paper proposes GC2F-Net. This section provides a systematic description of its overall architecture and key modules.

3.1. Overall Framework

The proposed GC2F-Net adopts an encoder–decoder architecture, and its overall framework is illustrated in Figure 1. In the encoding stage, ResNet-50 is employed as the backbone network to progressively extract hierarchical features through four residual stages, and a convolutional layer is introduced after each stage for channel alignment, thereby generating multiscale features for subsequent decoding and feature fusion. In the decoding stage, the GCCM is used to construct a category prior and provide category-level semantic constraints for subsequent decoding, while the FGEM performs frequency-domain enhancement on deep features to improve global structural representation and texture-detail modeling. Under the guidance of the category prior, the LCFA optimizes feature representations through local category-aware interaction and feature recalibration. Through the collaborative effects of the above modules, the model can effectively integrate global semantics and local details during the progressive restoration of spatial resolution. From an intuitive perspective, GCCM, FGEM, and LCFA are introduced to address category inconsistency, insufficient global structural representation, and detail degradation. In addition, skip connections are introduced between the encoder and decoder to concatenate features at corresponding scales along the channel dimension, thereby alleviating the loss of details caused by downsampling and enhancing the representation capability for target boundaries and structures.
Specifically, GC2F-Net employs the encoder to progressively extract multiscale features, which can be formulated as follows:
F l = E l F l 1 , l = 1 , 2 , 3 , 4 , F 0 = I .
where I R C × H × W denotes the input remote sensing image, and C, H and W represent the number of channels, height, and width. The F l R C × H × W denotes the output feature at layer l, E l ( · ) represents the mapping function of the corresponding residual stage, and l is the stage index.
GC2F-Net constructs a category prior on the deepest features, and the overall mapping of GCCM is formulated as follows:
C gr = G F 4 .
where G ( · ) denotes the mapping function of GCCM, and C gr R K × C represents the global category-center matrix after frequency-guided channel recalibration, where K denotes the number of classes and C denotes the channel dimension of the category centers.
Meanwhile, the FGEM performs frequency-domain enhancement on the deep features and fuses them with the original features in a residual manner:
F 4 fused = F 4 + F fg F 4 .
where F fg ( · ) denotes the nonlinear mapping of the FGEM, and F 4 fused represents the deep feature after frequency-domain enhancement and residual fusion.
On this basis, LCFA takes F 4 fused as input and introduces C gr as the category prior at the two scales of 1 / 32 and 1 / 16 to perform local category-aware interaction and modeling on local features, thereby enhancing the discriminative responses related to target categories and suppressing background interference during the progressive restoration of spatial resolution. Subsequently, the decoder restores spatial details through progressive upsampling, skip connections, and convolution operations, and finally outputs the pixel-level classification probability map.

3.2. Global Category-Center Module

During the decoding process of semantic segmentation, although deep features possess strong semantic representation capability, their information is still mainly organized at the pixel level and lacks explicit category-level priors that can be used to constrain the decoding process. As a result, category semantics are difficult to propagate stably and effectively to subsequent stages. To provide more stable semantic guidance, GCCM is introduced on the deepest feature F 4 R C × H × W to explicitly construct image-adaptive category-center priors. As illustrated in Figure 2, the GCCM mainly consists of two parts, the Dual-Scale Category-aware Aggregation Module and the Frequency-Domain Center Refinement Module. The former performs category-aware aggregation on deep features in the spatial domain based on coarse-grained predictions, whereas the latter combines frequency-domain statistics to conduct global recalibration on the category centers. In this way, GCCM can be intuitively regarded as a category-level summarization mechanism, which converts dense pixel features into compact category centers and helps reduce background-induced category confusion during subsequent decoding.
Specifically, to improve the stability of the global category centers, the GCCM adopts a dual-scale branch to downsample the deep features and the coarse predictions, which can be formulated as follows:
F ( s ) = AvgPool s F 4 , P ( s ) = AvgPool s P coarse , s { 1 , 2 } .
where P coarse R K × H × W denotes the coarse-grained prediction map obtained from F 4 through a 1 × 1 convolution, which is used to generate category-aware aggregation weights; AvgPool s ( · ) denotes the average pooling operation at scale s; F ( s ) R C × H s × W s and P ( s ) R K × H s × W s denote the feature map and the coarse-grained prediction map at scale s, respectively, where s = 1 denotes the original-scale branch and s = 2 denotes the downsampled branch; C denotes the number of channels; and K denotes the number of categories.
After obtaining the feature F ( s ) and the coarse-grained prediction P ( s ) at scale s, in order to construct category-aware aggregation weights along the spatial dimension, Softmax normalization is first applied to P ( s ) over the spatial dimension, thereby obtaining the weight distribution of each category at different spatial locations:
a k , i ( s ) = exp p k , i ( s ) j = 1 N s exp p k , j ( s ) , i = 1 , , N s , k = 1 , , K , s { 1 , 2 } .
where p k , i ( s ) denotes the prediction score of the i spatial location belonging to the k category at scale s, a k , i ( s ) denotes the corresponding category-aware aggregation weight, and N s = H s W s is the number of spatial locations.
Subsequently, the above weights are used to perform weighted aggregation of the features, thereby obtaining the category center of the k category at scale s:
c k ( s ) = i = 1 N s a k , i ( s ) f i ( s ) , k = 1 , , K .
where f i ( s ) R C denotes the feature vector at the i spatial location at scale s, and c k ( s ) R C denotes the category center of the k category at scale s. By stacking all category centers along the category dimension, the category-center matrix C ( s ) R K × C is obtained. Subsequently, a lightweight nonlinear transformation and gating operation are applied to C ( s ) to suppress noisy categories and enhance the representation of effective categories. Finally, the dual-scale category centers are fused to obtain the initial global category centers:
C g = 1 | S | s C ( s ) , s { 1 , 2 } .
where C g R K × C denotes the initial global category centers obtained by dual-scale fusion, and | S | denotes the number of scale branches.
Furthermore, the Frequency-Domain Center Refinement Module takes F 4 as input and first performs a two-dimensional real-valued Fourier transform on it to obtain the frequency-domain representation F ^ 4 = F ( F 4 ) . Subsequently, its amplitude spectrum M = | F ^ 4 | is computed, and the channel-wise frequency-domain statistical information z = GAP ( M ) is extracted through global average pooling. On this basis, an MLP and a Sigmoid function are employed to generate the channel weight vector w = MLP ( z ) , which is then applied to the initial global category centers C g R K × C for channel recalibration, thereby yielding the frequency-refined global category centers C gr R K × C . This process enables the category centers to more fully encode the key frequency information contained in the deep features, thereby improving their representation stability and cross-scene generalization capability.
( C g r ) k , c = w c ( C g ) k , c .
where k and c denote the category index and channel index, respectively, and w c denotes the weight of the c channel predicted from the frequency-domain statistics. Through the above frequency-guided channel recalibration, the model is able to enhance the responses of discriminative frequency-domain channels, thereby obtaining more robust global category-center priors. In this way, the GCCM provides explicit category-level constraints for subsequent feature updates during the decoding stage, enabling the category prior to effectively participate in the subsequent pixel-level interaction.

3.3. Fourier Global Enhancement Module

Since the category prior mainly provides semantic constraints at the level of category vectors, it is insufficient to directly enhance global structural representations at the feature-map level. Therefore, it is necessary to introduce a frequency-domain enhancement branch at the feature-map level to perform global enhancement on deep features. Intuitively, GCCM provides category-level guidance on what semantic regions should be emphasized, whereas FGEM further enhances how global structures and texture-related frequency responses are represented in the deep feature map. As shown in Figure 3, Different from the global category-center priors constructed by the GCCM, the FGEM operates directly on the deep feature map F 4 R C × H × W of the encoder. While preserving the spatial resolution, it exploits frequency-domain information to enhance global structural representation and adaptively fuses it with the original spatial features, thereby strengthening structural and detail cues without replacing the original spatial representation, and finally outputs the deep feature representation F 4 fused for subsequent decoding and for the LCFA.
Overall, the FGEM can be regarded as a nonlinear transformation applied to the deep encoder feature F 4 R C × H × W . Specifically, a two-dimensional real-valued Fourier transform (rFFT2) with orthogonal normalization is first performed on the F 4 channel by channel along the spatial dimensions, yielding the complex spectrum:
F ^ ( u , v ) = F F 4 ( x , y ) .
where F ( · ) denotes the two-dimensional real-valued Fourier transform, and F ^ ( u , v ) denotes the corresponding complex-valued frequency-domain representation. The complex spectrum is then decomposed into the amplitude spectrum and the phase spectrum:
A = | F ^ ( u , v ) | , ϕ = arg F ^ ( u , v ) .
where A and ϕ denote the amplitude spectrum and phase spectrum, respectively.
To preserve spatial structural information, the FGEM keeps the phase spectrum ϕ unchanged and only modulates the amplitude spectrum A . For each sample and channel, the amplitude spectrum is normalized over the frequency plane as:
μ b , c = Mean ( u , v ) Ω A b , c , u , v , σ b , c = Std ( u , v ) Ω A b , c , u , v + ϵ , A b , c , u , v norm = A b , c , u , v μ b , c σ b , c .
where b and c denote the sample index and channel index, respectively, Ω denotes the set of frequency coordinates, and ϵ = 10 5 is used for numerical stability.
The normalized amplitude spectrum is then modulated by a bounded gated function:
A ˜ b , c , u , v = A b , c , u , v 1 + α tanh A b , c , u , v norm .
where α is a learnable scalar parameter initialized to 0, which controls the modulation strength of the amplitude spectrum.
The modulated amplitude spectrum is recombined with the original phase spectrum and transformed back to the spatial domain:
F ˜ ( u , v ) = A ˜ ( u , v ) exp j ϕ ( u , v ) , X enh = F 1 F ˜ ( u , v ) .
Finally, the enhanced feature is mapped by a learnable 1 × 1 convolution and fused with the input feature through a residual connection:
F 4 global = F 4 + Conv 1 × 1 X enh , F 4 global R C × H × W .
where Conv 1 × 1 ( · ) denotes a learnable channel-mapping convolution.
Although the frequency-domain global branch enhances global structural representation, the original spatial features still preserve local textures and edge details. Therefore, a lightweight channel-adaptive fusion strategy is adopted. The spatial-branch feature F 4 spatial and the global-branch feature F 4 global are concatenated, followed by global average pooling, a learnable 1 × 1 convolution, and a Sigmoid function to generate the channel weight w . The fused feature is obtained as
F 4 fused = w F 4 spatial + 1 w F 4 global .
where F 4 spatial = F 4 , ⊙ denotes element-wise multiplication, and w ( 0 , 1 ) C × 1 × 1 denotes the channel-wise fusion weight. Through this design, the FGEM provides enhanced deep representations for subsequent multiscale decoding and the LCFA.

3.4. Local Category-Aware Frequency Attention Module

Category prior can provide explicit category-level constraints for the decoding process. However, if it is merely treated as additional input information, it is often difficult for the category prior to establish an explicit correspondence with pixel-level discrimination, especially in boundary regions and small-object regions, where category confusion and background-induced false responses are more likely to occur. Therefore, the key purpose of LCFA is to make the category prior directly participate in local feature refinement rather than only serving as auxiliary global information. To address this issue, LCFA is introduced at the 1 / 32 and 1 / 16 scales during the decoding stage. This module injects the category prior into the local feature updating process through pixel-category center interaction, and recalibrates feature responses by incorporating frequency statistics, thereby improving the discrimination capability for fine-grained structures. In this way, each spatial position can adaptively select relevant category-center information under global semantic guidance, which helps refine ambiguous boundaries and small-object regions. The architecture of the LCFA at the 1 / 32 scale is illustrated in Figure 4.
First, the input features are projected into pixel tokens, while the global category-centers are projected into category tokens. Specifically, a 1 × 1 convolution is applied to the local feature X , followed by flattening along the spatial dimension to generate the Query. Meanwhile, a linear projection is performed on the global category-centers C gr to generate the Key and Value:
Query = Flatten Conv 1 × 1 ( X ) R N × C , N = H × W , Key = ϕ K C gr , Value = ϕ V C gr , Key , Value R K × C .
where Conv 1 × 1 ( · ) denotes the 1 × 1 convolution used for channel projection, Flatten ( · ) denotes the flattening operation over the spatial dimensions, and ϕ K ( · ) and ϕ V ( · ) denote the linear projection functions.
Subsequently, category-level attention is performed by taking the pixel tokens as the Query and the category tokens as the Key and Value, followed by Softmax normalization along the category dimension, so that each spatial position can adaptively select the relevant category centers and aggregate them to obtain a category-aware representation:
A = Softmax Q K T C R N × K , O = A V R N × C .
where C is the scaling factor, A R N × K denotes the association weights between pixel tokens and category centers, and O R N × C denotes the category-aware feature representation obtained by weighted aggregation of the category value tokens.
Next, O is restored to the spatial layout and passed through a 1 × 1 convolution to obtain the category-aware feature X att R C × H × W . To preserve local texture details while injecting the category prior, the LCFA concatenates X att and the original feature X along the channel dimension, and then applies a 3 × 3 convolution to obtain the category-enhanced feature:
X ca = Conv 3 × 3 Cat ( X , X att ) R C × H × W .
where Cat ( · ) denotes concatenation along the channel dimension, and Conv 3 × 3 ( · ) is used for detail compensation. It can be observed that, different from the GCCM, which mainly performs recalibration at the category level, the LCFA explicitly establishes the correspondence between local features and category centers at the pixel level, enabling each spatial position to adaptively select more relevant category-center information under the guidance of category prior, thereby enhancing the responses of target regions and suppressing background interference. In addition, an auxiliary prediction map is generated within the module for auxiliary supervision during the training stage.
Relying on spatial attention is insufficient to characterize the differences in the roles of different channels in frequency structures. To this end, a Spectral-weight Gating Module (SWG) is introduced into the LCFA to perform joint recalibration of channels and frequencies on X ca . Specifically, a two-dimensional real-valued Fourier transform is first applied to the input feature to obtain the amplitude spectrum. Subsequently, the average responses of the low-frequency band and the high-frequency band are computed separately, and their ratio is used to construct a channel-wise frequency descriptor. Finally, a channel gating weight g is generated through a lightweight mapping and a Sigmoid activation, which is used to modulate the category-enhanced feature in a channel-wise manner, thereby yielding the output of the LCFA:
X LCFA = g X ca .
where g R C × 1 × 1 denotes the channel gating weight, and ⊙ denotes element-wise multiplication. When the high-frequency energy of a certain channel within a local window is significantly higher than its low-frequency energy, the SWG assigns a larger weight to that channel so as to emphasize structure-sensitive components such as edges and textures; conversely, channels with weaker frequency-domain contributions or those dominated by noise are suppressed. Different from the image-level spectral modulation in the FGEM, the SWG in the LCFA operates on decoding-stage features in a window-based manner, making it more suitable for characterizing local details and target boundaries.
In the network, one LCFA layer is applied at the 1 / 32 scale by taking F 4 fused as input to obtain deep category-aware enhanced features. Subsequently, another LCFA layer is applied at the 1 / 16 scale to the upsampled intermediate-level features. The two scales share the same set of global category centers C gr , but model local–global interaction separately under different spatial resolutions, thereby enabling the category prior to be progressively propagated to finer-grained spatial locations. The outputs of the multiscale LCFA are fused with shallow detailed features through skip connections and upsampling operations, and are then passed to the final prediction head to produce the pixel-level segmentation results.

4. Experiments

4.1. Datasets

To evaluate the proposed method, experiments were conducted on three representative semantic segmentation datasets, namely ISPRS Vaihingen (ISPRS 2D Semantic Labeling Benchmark: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx, (accessed on 9 May 2026), ISPRS Potsdam (ISPRS 2D Semantic Labeling Benchmark: https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/semantic-labeling.aspx, (accessed on 9 May 2026), and UAVid (UAVid dataset: https://uavid.nl/, (accessed on 9 May 2026). These datasets differ substantially in terms of imaging platform, viewing perspective, spatial resolution, and scene complexity, thereby providing a comprehensive benchmark for evaluating the generalization capability and robustness of the proposed model under diverse scene conditions.
The ISPRS Vaihingen dataset consists of high-resolution aerial orthophotos acquired over Vaihingen, Germany, together with pixel-level annotations. It has a spatial resolution of 9 cm and mainly covers typical urban scenes. Following the commonly used setting, a six-class semantic segmentation task was conducted, including impervious surfaces, buildings, low vegetation, trees, cars, and clutter. For dataset partitioning, images with IDs 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 32, 34, and 37 were used for training, image 30 was used for validation, and the remaining 17 images were used for testing. The original large-format images were cropped into patches of 512 × 512 pixels as model inputs.
The ISPRS Potsdam dataset, also released by ISPRS, is a widely used benchmark for high-resolution remote sensing image semantic segmentation. It consists of 38 aerial images with a spatial resolution of 5 cm, and each original image is approximately 6000 × 6000 pixels in size. The dataset follows the same six-class annotation scheme as the Vaihingen dataset. For dataset partitioning, images with IDs 2_11, 2_12, 3_10, 3_11, 3_12, 4_10, 4_11, 4_12, 5_10, 5_11, 5_12, 6_7, 6_8, 6_9, 6_10, 6_11, 6_12, 7_7, 7_8, 7_9, 7_10, 7_11, and 7_12 were used for training, image 2_10 was used for validation, and the remaining 14 images were used for testing. The original large-format images were cropped into patches of 512 × 512 pixels as model inputs.
The UAVid dataset is a widely used high-resolution UAV image dataset for urban scene semantic segmentation. It is captured from an oblique viewing perspective and contains complex urban elements, including roads, buildings, vehicles, pedestrians, and vegetation. The dataset consists of 42 video sequences, from which annotated image frames are extracted, and is annotated into eight semantic classes. Following the official sequence-based split, 20 sequences were used for training, 7 sequences were used for validation, and the remaining 15 sequences were used for testing. The original 4K images were cropped into patches of 512 × 512 pixels as model inputs.

4.2. Implementation Details

All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU. The proposed model was implemented based on the PyTorch deep learning framework (v2.1.0) with Python (v3.10) and CUDA (v12.1). deep learning framework. ResNet-50 was adopted as the encoder backbone, upon which GC2F-Net was constructed. Unless otherwise specified, the input images were cropped into patches of 512 × 512 pixels, and the batch size was set to 6.
For training, the AdamW optimizer was adopted, with the initial learning rate and weight decay set to 1 × 10 4 and 1 × 10 2 , respectively. The number of training epochs was set to 150 for Vaihingen, 80 for Potsdam, and 80 for UAVid. The learning rate was updated using a polynomial decay strategy with the power set to 0.9. In addition, 1000 warm-up iterations were introduced at the early stage of training to improve training stability. For data augmentation, in addition to basic normalization, random horizontal and vertical flipping, random rotation, and geometric transformations were applied during training. Moreover, additional random cropping and flipping were performed on image patches containing relatively many vehicle targets to alleviate the training bias caused by the insufficient samples of small-object categories.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed model in semantic segmentation tasks, mean Intersection over Union (mIoU), Overall Accuracy (OA), and mean F1-score (mF1) are adopted as the overall evaluation metrics, while the IoU of each category is also reported. Specifically, mIoU is used to measure the regional overlap between the predictions and the ground-truth annotations, OA reflects the overall classification accuracy over all pixels, and mF1 more comprehensively characterizes the overall segmentation performance across different categories. The above metrics are calculated as follows:
mIoU = 1 N k = 1 N T P k T P k + F P k + F N k , OA = k = 1 N T P k k = 1 N ( T P k + F P k ) , mF 1 = 1 N k = 1 N 2 Precision k Recall k Precision k + Recall k .
where N denotes the number of valid categories, and T P k , F P k , and F N k denote the numbers of true positives, false positives, and false negatives for category k, respectively. Precision k and Recall k denote the precision and recall of category k, respectively, which are defined as Precision k = T P k T P k + F P k and Recall k = T P k T P k + F N k .

4.4. Loss Function

To jointly account for pixel-wise classification accuracy and region-level overlap quality, the weighted sum of the cross-entropy (CE) loss and the multiclass Dice loss is adopted as the training objective. The CE loss facilitates stable optimization of pixel-wise category prediction, whereas the Dice loss constrains the consistency between the predicted regions and the ground-truth annotations and generally exhibits stronger robustness to class imbalance. The overall loss is formulated as follows:
L = λ ce L ce + λ dice L dice .
where λ ce = 1.0 , and λ dice is initially set to 0.3 and gradually adjusted to 0.4 during training to impose a stronger constraint on region-level overlap. The cross-entropy loss is defined as:
L ce = 1 | Ω | i Ω log p i , y i .
where Ω denotes the set of valid pixels, and p i , y i denotes the predicted probability of pixel i for its ground-truth category. Unlabeled invalid pixels are ignored in the loss computation and excluded from backpropagation. The Dice loss is used to measure the regional overlap between the predictions and the annotations. In implementation, Softmax is first applied to the logits to obtain probabilities, and a mask is then imposed on invalid pixel positions to exclude them from the statistics. Subsequently, the Dice value is computed for each category and averaged. The multiclass Dice loss is defined as:
L dice = 1 1 C c = 1 C 2 i Ω p i , c g i , c + ϵ i Ω p i , c 2 + i Ω g i , c 2 + ϵ .
where C denotes the number of categories, g i , c denotes the ground-truth label in one-hot form, and ϵ denotes the smoothing term. In implementation, the Dice loss is computed using squared terms.

5. Results

5.1. Comparative Experiments

Unless otherwise specified, all comparison methods are evaluated under the same training and inference settings described in Section 3.2 and with the same evaluation metrics. To verify the effectiveness of GC2F-Net, comparative experiments are conducted on the three datasets against several representative methods, including DANet [19], DeepLabV3 [35], FPN [17], PSPNet [15], OCRNet [20], SegFormer-R50 [26], and UNetFormer [27]. Specifically, SegFormer-R50 refers to a ResNet-50 encoder combined with a SegFormer-inspired decoding head. This setting was used to keep the backbone consistent with other ResNet-50-based comparison methods, enabling a controlled comparison under the same encoder configuration. Therefore, SegFormer-R50 should be regarded as a ResNet-50-based controlled variant, rather than the original Transformer-based SegFormer architecture.
In addition to segmentation accuracy, the model complexity and inference efficiency of different methods are compared to more comprehensively evaluate the trade-off between accuracy and efficiency. It should be noted that the number of parameters, FLOPs, and FPS are mainly determined by the model architecture and the unified testing settings, and therefore remain consistent across different datasets, whereas mIoU varies with the dataset. To avoid repeatedly presenting the same efficiency axis ranges, the ISPRS Vaihingen dataset is selected for visualization to illustrate the distribution of different methods in terms of accuracy and efficiency.
As shown in Figure 5 and Table 1, GC2F-Net achieves the best segmentation performance, with an mIoU of 76.90% and 26.06 M parameters. Compared with other representative methods, the proposed model obtains a favorable balance between segmentation accuracy and model efficiency. In terms of computational cost, GC2F-Net requires 59.26 G FLOPs and reaches 103.56 img/s, showing acceptable inference efficiency under the same evaluation setting. Overall, GC2F-Net achieves a competitive performance–efficiency trade-off among segmentation accuracy, model size, and inference speed, which further demonstrates its effectiveness for high-resolution remote sensing semantic segmentation.

5.1.1. Vaihingen Dataset

On the ISPRS Vaihingen dataset, as shown in Table 2, GC2F-Net achieves the best overall performance among the compared methods. The mIoU, mF1, and OA reach 76.90%, 85.86%, and 91.56%, respectively. Compared with SegFormer-R50, which is the strongest competing method, GC2F-Net improves mIoU, mF1, and OA by 2.34%, 1.27%, and 0.86%, respectively. These results indicate that the proposed method exhibits clear advantages in both overall segmentation accuracy and category consistency.
From the category-level results, GC2F-Net performs particularly well on the Building and Car classes, indicating that the proposed model possesses stronger structural representation and fine-grained discrimination capabilities for regular-structure targets and small-scale objects. Meanwhile, it also outperforms the other compared methods on the Clutter category, demonstrating higher robustness in complex backgrounds and unstructured regions. As shown in Figure 6, GC2F-Net produces smoother and more coherent segmentation results in boundary regions and identifies small objects more accurately. In particular, in building edges and road regions, its predicted boundaries are more consistent with the ground truth (GT). In summary, GC2F-Net effectively preserves fine-grained details while improving category consistency, thereby achieving more reliable segmentation results.

5.1.2. Potsdam Dataset

On the Potsdam dataset, as shown in Table 3, GC2F-Net achieves an mIoU, mF1, and OA of 79.44%, 87.41%, and 91.23%, respectively. Compared with FPN, which is the strongest competing method, the proposed method improves these three metrics by 0.57%, 0.60%, and 0.12%, respectively, indicating that it likewise exhibits favorable segmentation performance in this scene.
From the category-level results, GC2F-Net maintains strong competitiveness across multiple major categories and achieves better performance on key categories such as Bui and LowVeg, indicating that the proposed method can stably improve the segmentation accuracy of the main ground-object categories. Meanwhile, it also performs robustly on the complex background category Clutter, demonstrating its capability to suppress background interference and contributing to the improvement of overall mIoU and mF1. As shown in Figure 7, GC2F-Net produces more coherent prediction results over large-area regions with less fragmentation and presents clearer contours at boundary locations such as buildings and roads. In regions with adjacent vegetation and strong background interference, the category confusion phenomenon is effectively alleviated, indicating that the proposed method has advantages in both global context modeling and local detail recovery.

5.1.3. UAVid Dataset

As shown in Table 4, since the pixel-level ground-truth annotations (GT) of the UAVid test set are not publicly available, this paper follows the common evaluation protocol and conducts evaluation on the validation set. Overall, GC2F-Net achieves the best performance in terms of mIoU and mF1, reaching 67.04% and 78.65%, respectively, while the OA reaches 87.63%, indicating that the proposed method possesses strong overall segmentation capability and category consistency in complex urban scenes.
Figure 8 presents the qualitative comparison results on the UAVid dataset, from which it can be observed that GC2F-Net exhibits better connectivity and more accurate boundary localization for structured objects. It also effectively reduces structural discontinuities and boundary overflow in elongated regions and boundary transition areas. In addition, in small-object regions, its predictions are more consistent with the GT in spatial distribution, with reduced omission and misclassification errors. These results indicate that GC2F-Net possesses strong local detail recovery capability and stable category consistency in complex urban scenes. Meanwhile, due to factors such as the small scale of the Human category, complex shape variations, susceptibility to occlusion, and imbalanced category distribution, the IoU values of all methods for this category remain relatively low overall, whereas the proposed method still achieves results superior to those of most compared methods.
In summary, GC2F-Net not only achieves collaborative improvement in segmentation accuracy and robustness under complex scene conditions, but also alleviates the difficulty of jointly accounting for global context modeling and local detail representation in semantic segmentation of high-resolution remote sensing images. The experimental results demonstrate that the proposed method achieves stable and competitive performance on all three datasets, exhibiting favorable cross-dataset generalization capability.

5.2. Ablation Experiments

To verify the effectiveness of the key modules in GC2F-Net, namely GCCM, FGEM, and LCFA, and to analyze the contribution of each module to the overall performance improvement, ablation experiments are conducted. Considering that the variation trends of the experimental results are generally consistent across different datasets, the ISPRS Vaihingen dataset is selected in the main text for analysis. The baseline model (Baseline) is defined as follows: under the same backbone network and overall encoder–decoder framework, the GCCM, FGEM, and LCFA modules are removed, and only conventional multiscale feature extraction and decoding fusion are retained, thereby constructing a minimal comparison model without the proposed modules. Except for the module configuration, the remaining training and inference settings, data partitioning, and evaluation metrics are kept consistent with those used in the comparative experiments to ensure fairness and reproducibility.
As shown in Table 5, the complete model achieves 76.90%, 85.86%, and 91.56% in terms of mIoU, mF1, and OA, respectively, representing improvements of 2.74%, 2.40%, and 0.46% over the Baseline. These results indicate that the proposed modules significantly improve the overall segmentation accuracy and enhance category consistency. In terms of computational cost, compared with the baseline model, the complete model introduces only slight increases in the number of parameters and FLOPs, indicating that these performance gains are obtained with only a small additional computational overhead, thereby demonstrating a favorable trade-off between efficiency and accuracy.

5.2.1. Influence of Global Category-Center Module

After introducing GCCM into the Baseline, mIoU increases from 74.16% to 74.86%, while mF1 and OA improve from 83.46% and 91.10% to 83.98% and 91.19%, respectively. These results indicate that global category-center modeling provides stable category-level constraints for feature learning, thereby yielding consistent performance gains. As shown in Figure 9, compared with the baseline model, the introduction of GCCM leads to more stable predictions in small-scale regions surrounding vehicles and fewer misclassifications; in regions with relatively cluttered backgrounds, scattered noisy predictions are also reduced. Overall, GCCM improves the category consistency of predictions by explicitly introducing category prior and achieves stable performance gains under complex background conditions.

5.2.2. Influence of Fourier Global Enhancement Module

After introducing FGEM into the Baseline, mIoU and mF1 improve from 74.16% and 83.46% to 75.08% and 84.41%, respectively. These results indicate that frequency-guided feature enhancement effectively improves overall segmentation accuracy and category consistency. As shown in Figure 10, after introducing FGEM, the prediction results in complex background regions become more coherent, fragmentation is reduced, and discontinuities and noise near elongated structures are alleviated. In summary, FGEM mainly improves the robustness of the model in complex backgrounds and fine-grained structures by enhancing global structural representation and compensating for texture and detailed information, thereby yielding stable performance gains.

5.2.3. Influence of Local Category-Aware Frequency Attention Module

After introducing LCFA into the Baseline, mIoU increases from 74.16% to 75.11%, while mF1 and OA improve to 84.34% and 91.20%, respectively, indicating that local category-aware interaction enhances category discrimination capability during the decoding stage and yields stable performance gains. As shown in Figure 11, after introducing LCFA, the boundaries of structured objects such as buildings and roads become clearer, and the overflow phenomenon is reduced. Meanwhile, in regions with strong background interference, misclassification and adhesion phenomena are effectively suppressed. Overall, LCFA enhances the local detail recovery capability and category consistency of the model in complex regions by improving the fine-grained discrimination capability at boundaries and complex transition regions, and is one of the key modules driving the performance improvement.

5.2.4. Module-Removal Ablation Study

To verify the role of each module in the complete model, GCCM, FGEM, and LCFA are removed from the full model one by one, and the performance changes after removing each individual module are compared. The results show that removing any one of these modules leads to a decline in overall performance, with mIoU decreasing by 0.83%, 1.28%, and 2.04%, respectively, and mF1 decreasing by 0.61%, 1.02%, and 1.96%, respectively. These results indicate that GCCM, FGEM, and LCFA all play positive roles in improving the model performance.
As reported in Table 6, removing GCCM, FGEM, or LCFA leads to performance degradation, demonstrating the effectiveness of each component. Figure 12 presents the qualitative comparison results. Without GCCM, the prediction results are more prone to block-wise misclassification caused by insufficient category consistency. After removing FGEM, the prediction stability in small-object and fine-grained structure regions decreases, and the noise and fragmentation become more pronounced. When LCFA is removed, boundary overflow and local adhesion are more likely to occur in category transition regions, thereby leading to incomplete target contours. In summary, GCCM, FGEM, and LCFA play complementary roles in category prior modeling, global structural representation enhancement, and fine-grained discrimination in boundaries and complex transition regions, respectively, jointly supporting the overall performance advantages of GC2F-Net.
In summary, the ablation experiments verify the effectiveness and complementarity of the key modules in GC2F-Net. Introducing GCCM, FGEM, or LCFA individually brings stable performance gains over the baseline framework, indicating that category prior modeling, frequency-domain enhancement, and local category-aware interaction each make independent contributions to performance improvement. Meanwhile, removing any of these modules from the complete model leads to performance degradation, demonstrating that all three are important components for improving segmentation performance. Overall, GCCM, FGEM, and LCFA collaboratively enhance category consistency, local detail recovery, and boundary discrimination from the perspectives of category-prior modeling, frequency-domain enhancement, and local category-aware interaction, respectively. This enables GC2F-Net to achieve more robust performance gains with only modest additional computational overhead.

6. Discussion

Systematic experiments conducted on the ISPRS Vaihingen, ISPRS Potsdam, and UAVid datasets validate the effectiveness and generalization ability of GC2F-Net for semantic segmentation in complex scenes. The advantages of GC2F-Net are mainly reflected in its improved overall evaluation metrics and its enhanced prediction quality for regions with complex textures, class boundaries, and fine-grained structures of small objects. On the Vaihingen and Potsdam datasets, GC2F-Net achieves superior results compared with existing methods in terms of mIoU, mF1, and OA. On the UAVid dataset, the proposed method also demonstrates strong scene adaptation capability, indicating favorable generalization performance under different imaging platforms, resolutions, and scene complexities.
To further discuss the robustness of GC2F-Net in practical remote sensing scenarios, we directly tested the same trained checkpoint on perturbed test images from the Vaihingen dataset. Two common visual perturbations were considered, including brightness variation and Gaussian blur, which simulate illumination changes and imaging blur that may occur during remote sensing image acquisition. As shown in Table 7, GC2F-Net achieves an mIoU of 76.90% on the clean test set. Under brightness variation and Gaussian blur, the model maintains mIoUs of 72.47% and 75.42%, respectively, with corresponding mIoU drops of 4.43 and 1.48 percentage points. These results indicate that GC2F-Net maintains relatively stable segmentation performance under common visual perturbations, further supporting its robustness in complex remote sensing scenes.
Compared with conventional spatial-domain-based remote sensing semantic segmentation methods, the experimental results indicate that relying solely on spatial contextual modeling makes it difficult to simultaneously capture global structural representation and boundary detail recovery. By integrating frequency-domain information with spatial features, GC2F-Net enhances global structural representation while improving the discriminative capability of local details, thereby exhibiting more robust segmentation performance in regions with complex textures and category boundaries.
Although GC2F-Net achieves competitive overall performance, the category-level results indicate that it still has limitations in several challenging scenarios. On the Potsdam dataset, some categories, such as Tree and Car, do not always achieve the highest IoU among all comparison methods. For the Tree category, misclassification may occur between trees and low vegetation due to their similar spectral and texture distributions, especially around ambiguous boundary regions. For the Car category, omission or boundary deviation may still occur when vehicles are small, densely distributed, or partially occluded by shadows. On the UAVid dataset, the Human category remains particularly challenging because pedestrians usually occupy only a very small number of pixels and are easily affected by occlusion, scale variation, motion blur, and oblique-view imaging. In addition, cluttered urban backgrounds and fragmented object boundaries may still lead to local misclassification. These observations indicate that GC2F-Net still has room for improvement in extremely small-object recognition, ambiguous vegetation boundary delineation, and highly complex urban scenes.
In addition to the above category-level limitations, spatial-frequency collaborative modeling introduces additional feature transformation and fusion processes, which may increase model complexity and computational cost. Therefore, achieving a better balance between cross-domain information interaction, segmentation accuracy, and model efficiency remains an important challenge for future research.
In the future, more lightweight and efficient spatial-frequency collaborative modeling strategies will be explored to reduce model complexity while preserving the complementary advantages of spatial and frequency domains. Specifically, we will investigate lightweight frequency-aware modules, reduced-channel feature transformation, and dynamic feature interaction mechanisms to decrease parameters and FLOPs. In addition, multimodal information fusion will be further studied by incorporating complementary data sources, such as DSM and SAR, to improve the robustness of semantic segmentation in complex remote sensing scenes. Future research will also explore small-model-based collaborative frameworks and multi-agent systems, in which lightweight modules cooperate for spatial feature extraction, frequency-domain enhancement, and decision fusion, aiming to balance segmentation accuracy, efficiency, robustness, and practical deployment. For large-scale practical applications, overlap-based slicing and MapReduce-style parallel inference will also be investigated to reduce slicing-grid effects on small or thin objects while improving large-scale inference efficiency.

7. Conclusions

This paper proposes a spatial-frequency collaborative segmentation network for semantic segmentation of high-resolution remote sensing images. The GCCM, FGEM, and LCFA modules in GC2F-Net achieve collaborative modeling of global structural representations and local detail information through category prior fusion, frequency-domain feature enhancement, and local class-aware interaction, thereby improving the segmentation accuracy and robustness of the model in complex scenes. Experimental results demonstrate that GC2F-Net achieves robust and competitive segmentation performance on the ISPRS Vaihingen, ISPRS Potsdam, and UAVid datasets, with mIoU values of 76.90%, 79.44%, and 67.04%,indicating favorable generalization capability and application potential. Overall, the proposed method provides an effective modeling paradigm for semantic segmentation of high-resolution remote sensing images. Future work will explore lightweight models and multimodal information fusion to enhance the cross-scene generalization capability and segmentation performance.

Author Contributions

Conceptualization, T.L., L.G. and B.L.; methodology, T.L. and H.Y.; software, T.L.; formal analysis, T.L. and L.G.; investigation, T.L. and B.L.; writing—original draft preparation, T.L.; writing—review and editing, T.L., L.G., H.Y. and B.L.; supervision, J.X.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. 61273239.

Data Availability Statement

The data and the code of this paper are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhang, Z.; Zhu, L. A Review on Unmanned Aerial Vehicle Remote Sensing: Platforms, Sensors, Data Processing Methods, and Applications. Drones 2023, 7, 398. [Google Scholar] [CrossRef]
  2. Soylu, B.E.; Guzel, M.S.; Bostanci, G.E.; Ekinci, F.; Asuroglu, T.; Acici, K. Deep-Learning-Based Approaches for Semantic Segmentation of Natural Scene Images: A Review. Electronics 2023, 12, 2730. [Google Scholar]
  3. Mashala, M.J.; Dube, T.; Mudereri, B.T.; Ayisi, K.K.; Ramudzuli, M.R. A Systematic Review on Advancements in Remote Sensing for Assessing and Monitoring Land Use and Land Cover Changes Impacts on Surface Water Resources in Semi-Arid Tropical Environments. Remote Sens. 2023, 15, 3926. [Google Scholar] [CrossRef]
  4. Qiu, W.; Gu, L.; Gao, F.; Jiang, T. Building Extraction From Very High-Resolution Remote Sensing Images Using Refine-UNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6002905. [Google Scholar] [CrossRef]
  5. Butilă, E.V.; Boboc, R.G. Urban Traffic Monitoring and Analysis Using Unmanned Aerial Vehicles (UAVs): A Systematic Literature Review. Remote Sens. 2022, 14, 2072–4292. [Google Scholar] [CrossRef]
  6. Román, A.; Tovar-Sánchez, A.; Larrad, M.; Rubiano-Sánchez, F.J.; Zafra, J.M.; Piñeiro, R.; Castillo, Á.; López, F.A.; Vela, A.L.; Allende, A.; et al. UAV Imagery in Natural Disasters: Real-Time Damage Assessment of Flash Flooding Events. Ecol. Inform. 2025, 91, 103433. [Google Scholar] [CrossRef]
  7. Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep-Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8370–8396. [Google Scholar] [CrossRef]
  8. Zhang, L.; Zhang, L. Artificial Intelligence for Remote Sensing Data Analysis: A Review of Challenges and Opportunities. IEEE Geosci. Remote Sens. Mag. 2022, 10, 270–294. [Google Scholar] [CrossRef]
  9. Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 2072–4292. [Google Scholar] [CrossRef]
  10. Zhou, Y.; Wu, W.; Wang, H.; Zhang, X.; Yang, C.; Liu, H. Identification of Soil Texture Classes Under Vegetation Cover Based on Sentinel-2 Data with SVM and SHAP Techniques. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3758–3770. [Google Scholar] [CrossRef]
  11. Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
  12. Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef]
  13. Vu, K.H.; Nguyen, D.P.; Pham, H.A. A Comprehensive Investigation into Semantic Segmentation and Its Applications. SN Comput. Sci. 2025, 6, 880. [Google Scholar] [CrossRef]
  14. Xiao, L.; Song, J.; Xie, X.; Fan, C. Enhanced Medical Image Segmentation Using U-Net with Residual Connections and Dual Attention Mechanism. Eng. Appl. Artif. Intell. 2025, 153, 110794. [Google Scholar] [CrossRef]
  15. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
  16. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
  17. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  18. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
  19. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3141–3149. [Google Scholar]
  20. Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 173–190. [Google Scholar]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  22. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
  23. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
  24. Liu, Y.; Gao, K.; Wang, H.; Yang, Z.; Wang, P.; Ji, S.; Huang, Y.; Zhu, Z.; Zhao, X. A Transformer-Based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104083. [Google Scholar] [CrossRef]
  25. Hoyer, L.; Dai, D.; Van Gool, L. DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9914–9925. [Google Scholar]
  26. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
  27. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-Like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  28. Zhang, Y.; Jiang, L.; Chen, F.; Xie, J.; Zhang, B.; He, G.; Lin, S. SegCFT: Context-Aware Fourier Transform for Efficient Semantic Segmentation. Neurocomputing 2024, 596, 127946. [Google Scholar] [CrossRef]
  29. Ruan, J.; Gao, J.; Xie, M.; Xiang, S. Learning Multi-Axis Representation in Frequency Domain for Medical Image Segmentation. Mach. Learn. 2025, 114, 10. [Google Scholar] [CrossRef]
  30. Li, Y.; Liu, Z.; Yang, J.; Zhang, H. Wavelet Transform Feature Enhancement for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5644. [Google Scholar] [CrossRef]
  31. Yang, Y.; Yuan, G.; Li, J. SFFNet: A Wavelet-Based Spatial and Frequency Domain Fusion Network for Remote Sensing Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
  32. Fu, J.; Yu, Y.; Wang, L. FSDENet: A Frequency and Spatial Domains-Based Detail Enhancement Network for Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 19378–19392. [Google Scholar] [CrossRef]
  33. He, Y.; Lu, Z.; Huan, H. SF3Net: Frequency-Domain Enhanced Segmentation Network for High-Resolution Remote Sensing Imagery. Remote Sens. 2025, 17, 3734. [Google Scholar] [CrossRef]
  34. Rao, Y.; Zhao, W.; Zhu, Z.; Zhou, J.; Lu, J. GFNet: Global Filter Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10960–10973. [Google Scholar] [CrossRef] [PubMed]
  35. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Figure 1. Overall framework of GC2F-Net.
Figure 1. Overall framework of GC2F-Net.
Remotesensing 18 01600 g001
Figure 2. Structure of GCCM.
Figure 2. Structure of GCCM.
Remotesensing 18 01600 g002
Figure 3. Structure of FGEM.
Figure 3. Structure of FGEM.
Remotesensing 18 01600 g003
Figure 4. Structure of 1/32 LCFA.
Figure 4. Structure of 1/32 LCFA.
Remotesensing 18 01600 g004
Figure 5. Performance–efficiency trade-off comparison of different methods on the ISPRS Vaihingen dataset. FLOPs and FPS are measured using a 512 × 512 input image on a single NVIDIA RTX 3090 GPU.
Figure 5. Performance–efficiency trade-off comparison of different methods on the ISPRS Vaihingen dataset. FLOPs and FPS are measured using a 512 × 512 input image on a single NVIDIA RTX 3090 GPU.
Remotesensing 18 01600 g005
Figure 6. Qualitative visual comparison on the ISPRS Vaihingen dataset.
Figure 6. Qualitative visual comparison on the ISPRS Vaihingen dataset.
Remotesensing 18 01600 g006
Figure 7. Qualitative visual comparison on the ISPRS Potsdam dataset.
Figure 7. Qualitative visual comparison on the ISPRS Potsdam dataset.
Remotesensing 18 01600 g007
Figure 8. Qualitative visual comparison on the UAVid dataset.
Figure 8. Qualitative visual comparison on the UAVid dataset.
Remotesensing 18 01600 g008
Figure 9. Comparison of segmentation results before and after applying GCCM.
Figure 9. Comparison of segmentation results before and after applying GCCM.
Remotesensing 18 01600 g009
Figure 10. Comparison of segmentation results before and after applying FGEM.
Figure 10. Comparison of segmentation results before and after applying FGEM.
Remotesensing 18 01600 g010
Figure 11. Comparison of segmentation results before and after applying LCFA.
Figure 11. Comparison of segmentation results before and after applying LCFA.
Remotesensing 18 01600 g011
Figure 12. Comparison with the baseline and full models under different module removal settings.
Figure 12. Comparison with the baseline and full models under different module removal settings.
Remotesensing 18 01600 g012
Table 1. Performance–efficiency comparison on the ISPRS Vaihingen dataset.
Table 1. Performance–efficiency comparison on the ISPRS Vaihingen dataset.
MethodParams
(M)
FLOPs
(G)
FPS
(img/s)
mIoU
(%)
DANet40.3651.68146.9566.94
DeeplabV341.6752.08144.9667.60
FPN34.1390.2899.3572.43
PSPNet34.0048.30141.4967.24
OCRNet46.5852.82152.2067.37
SegFormer-R5024.7655.78163.7075.08
UnetFormer24.1148.22111.3672.49
Ours26.0659.26103.5676.90
Table 2. Quantitative comparison on the ISPRS Vaihingen dataset (%). Bold values indicate the best result in each column.
Table 2. Quantitative comparison on the ISPRS Vaihingen dataset (%). Bold values indicate the best result in each column.
MethodBackbonePer-Class IoU (%)mIoU (%)mF1 (%)OA (%)
ImSurfBui.LowVegTreeCarClutter
DANetResNet-5083.3589.5171.2180.4749.1827.8966.9477.8989.42
DeeplabV3ResNet-5083.7889.6970.7580.3845.7335.2667.6078.7789.50
FPNResNet-5086.4490.8673.8182.4077.7023.3672.4381.4290.92
PSPNetResNet-5083.7689.7771.4180.5447.3730.6367.2478.2589.60
OCRNetResNet-5084.2490.2972.0780.4647.2029.9667.3778.2589.81
SegFormer-R50ResNet-5087.1891.3573.5882.3679.8136.2075.0884.2891.16
UnetFormerResNet-5086.5590.0172.2682.0275.8828.2372.4981.9890.60
OursResNet-5087.9192.0674.2182.7981.1143.2976.9085.8691.56
Table 3. Quantitative comparison on the ISPRS Potsdam dataset (%). Bold values indicate the best result in each column.
Table 3. Quantitative comparison on the ISPRS Potsdam dataset (%). Bold values indicate the best result in each column.
MethodBackbonePer-Class IoU (%)mIoU
(%)
mF1
(%)
OA
(%)
ImSurfBui.LowVeg TreeCarClutter
DANetResNet-5084.8992.7476.8579.1973.9944.9375.4385.0790.15
DeeplabV3ResNet-5082.4990.5675.9178.7069.9038.5972.6982.9789.03
FPNResNet-5087.3893.5777.9281.0193.1740.1878.8786.8191.11
PSPNetResNet-5085.2892.6076.5479.5377.3341.9675.5484.9890.16
OCRNetResNet-5085.2093.0177.5080.0675.7342.6875.7085.1190.42
SegFormer-R50ResNet-5085.8592.6176.1279.6392.3833.7876.7385.0390.17
UnetFormerResNet-5085.0491.9876.8178.3789.6341.1077.6685.8889.94
OursResNet-5087.6793.9378.3979.5992.8244.2679.4487.4191.23
Table 4. Quantitative comparison on the UAVid dataset (%). Bold values indicate the best result in each column.
Table 4. Quantitative comparison on the UAVid dataset (%). Bold values indicate the best result in each column.
MethodPer-Class IoU (%)mIoU
(%)
mF1
(%)
OA
(%)
Bui.RoadTreeVeg.Mo.CarSt.CarHumanClutter
DANet90.8376.1976.8967.4756.2252.7312.3664.7162.1873.8587.17
DeeplabV391.4577.1777.2667.6355.6856.1914.5965.3563.1774.8187.49
FPN91.1778.2677.3368.3362.0656.6527.2365.7665.8577.8387.69
PSPNet90.9477.0076.8067.1956.0157.3414.4165.2463.1274.8087.30
OCRNet90.9976.9076.7168.6053.0152.3211.5764.5261.8073.4487.27
SegFormer-R5090.9677.5277.0767.4461.6060.2228.8364.7066.0478.1187.36
UNetFormer90.9377.6077.5668.7068.4059.4519.8263.9065.8077.2987.53
Ours91.0477.6577.2868.6768.2263.1025.5764.8167.0478.6587.63
Table 5. Ablation experiments on the Vaihingen datasets.
Table 5. Ablation experiments on the Vaihingen datasets.
MethodmIoU (%)mF1 (%)OA (%)Parameters (M)FLOPs (G)
Baseline74.1683.4691.1024.97358.174
Baseline+GCCM74.8683.9891.1925.00558.174
Baseline+FGEM75.0884.4191.1225.00658.182
Baseline+LCFA75.1184.3491.2025.99159.396
Full76.9085.8691.5626.05659.405
Table 6. Impact of different module removals on performance (%) on the ISPRS Vaihingen dataset. ↓ indicates the relative decrease compared with the full model.
Table 6. Impact of different module removals on performance (%) on the ISPRS Vaihingen dataset. ↓ indicates the relative decrease compared with the full model.
MethodGCCMFGLCFAmIoU (%)mF1 (%)OA (%)
w/o GCCM×76.07 (↓0.83)85.25 (↓0.61)91.32 (↓0.24)
w/o FGEM×75.62 (↓1.28)84.84 (↓1.02)91.51 (↓0.05)
w/o LCFA×74.86 (↓2.04)83.90 (↓1.96)91.29 (↓0.27)
Full76.9085.8691.56
Table 7. Robustness evaluation of GC2F-Net under common visual perturbations on the Vaihingen dataset (%).
Table 7. Robustness evaluation of GC2F-Net under common visual perturbations on the Vaihingen dataset (%).
ConditionOA (%)mF1 (%)mIoU (%)mIoU Drop (%)
Clean91.5685.8676.90
Brightness variation89.5982.8372.474.43
Gaussian blur91.0284.7875.421.48
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, T.; Guo, L.; Xin, J.; Yu, H.; Li, B. GC2F-Net: A Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network for Remote Sensing Semantic Segmentation. Remote Sens. 2026, 18, 1600. https://doi.org/10.3390/rs18101600

AMA Style

Li T, Guo L, Xin J, Yu H, Li B. GC2F-Net: A Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network for Remote Sensing Semantic Segmentation. Remote Sensing. 2026; 18(10):1600. https://doi.org/10.3390/rs18101600

Chicago/Turabian Style

Li, Teng, Laide Guo, Junchang Xin, Hongfei Yu, and Bowen Li. 2026. "GC2F-Net: A Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network for Remote Sensing Semantic Segmentation" Remote Sensing 18, no. 10: 1600. https://doi.org/10.3390/rs18101600

APA Style

Li, T., Guo, L., Xin, J., Yu, H., & Li, B. (2026). GC2F-Net: A Global Category-Center Prior-Guided Spatial-Frequency Collaborative Network for Remote Sensing Semantic Segmentation. Remote Sensing, 18(10), 1600. https://doi.org/10.3390/rs18101600

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop