1. Introduction
With the rapid advancement in high-quality RS imaging technologies, the acquisition of large volumes of high-resolution RS imagery has become possible. At the same time, the progress of artificial intelligence and deep learning has provided powerful tools for analyzing such data. High-resolution RS images are capable of capturing detailed surface information and complex object characteristics, while also offering insights into ecological environments and human activities [
1]. By extracting and analyzing relevant information from these images, a comprehensive understanding of natural environments and anthropogenic patterns can be achieved from multiple perspectives. This deeper understanding plays a critical role in supporting analysis, monitoring, and decision-making across a wide range of domains [
2]. These capabilities have driven the application of RS semantic segmentation across several domains. In environmental monitoring, it supports land and water body segmentation [
3] and disaster assessment [
4]; in agricultural management, it enables farmland mapping [
5] and crop yield estimation [
6]; and in urban planning [
7], it facilitates road extraction and infrastructure analysis [
8]. Such diverse applications highlight the essential role of semantic segmentation in transforming raw RS imagery into actionable information for environmental assessment, resource management, and decision support.
Traditional image segmentation methods primarily rely on spectral, spatial, and textural features for segmenting RS images [
9]. However, these approaches struggle to generalize to complex and diverse high-resolution RS data due to their dependence on handcrafted features and limited capacity for nonlinear representation. Consequently, their performance in large-scale and heterogeneous scenarios remains unsatisfactory. With the advent of deep learning, CNNs, Transformers, and Mamba have achieved remarkable progress in semantic segmentation of RS images. In recent years, CNNs have attained significant success in this field owing to their strong capability for local feature modeling. FCN [
10], as a milestone in applying CNNs to image segmentation, substantially improved segmentation accuracy. Subsequently, U-Net [
11], with its encoder–decoder architecture incorporating skip connections, has become a standard framework in image segmentation. Nevertheless, the inherent limitations of convolutional kernels restrict the receptive field, hindering the ability to capture long-range dependencies. To address this issue, CCNet [
12] integrates a criss-cross attention module to capture long-distance contextual information. DANet [
13] employs a dual attention mechanism to enhance the representation of features in both spatial and channel dimensions, thus improving the precision of segmentation. DeepLabv3 [
14] combines dilated convolutions with an atrous spatial pyramid pooling (ASPP) structure, further enhancing performance. In addition, several studies have focused on alleviating boundary ambiguity. Ref. [
15] replaced skip connections with boundary attention to more effectively restore the boundaries of objects. Ref. [
16] incorporated boundary information for the multimodal fusion of characteristics, improving the accuracy of boundary segmentation. Ref. [
17] proposed a novel prototype matching strategy for body joint boundaries to mitigate feature aliasing at object edges.
Although CNNs have achieved remarkable success in many computer vision tasks, they are inherently constrained by the limited receptive field of convolutional kernels, making it challenging to capture multi-scale features in RS images. Transformers [
18], leveraging self-attention mechanisms, effectively capture global contextual information and demonstrate strong performance in modeling long-range dependencies. Ref. [
19] proposed combining a pyramid structure with attention mechanisms to enhance the model’s ability to capture global information. Similarly, ref. [
20] introduced lightweight attention modules that preserved global perception while reducing computational cost. However, Transformers often underperform in modeling fine-grained local details. To address this limitation, Swin Transformer [
21] introduced a shifted window mechanism to capture multi-scale global features, achieving superior performance in fine-grained classification and boundary recognition. LGBSwin [
22] further enhanced global modeling capability by integrating spatial and channel attention, adaptively fusing low- and high-level features to extract boundary information and mitigate boundary ambiguity. In addition, segmentation networks relying solely on CNNs are limited in capturing global representations, while pure Transformer-based networks tend to lose fine-grained local details. Consequently, researchers have increasingly explored hybrid architectures that integrate CNNs with Transformers. For instance, CCTNet [
23] couples CNN and Transformer networks through a lightweight adaptive fusion module, effectively integrating local information with global context and demonstrating the effectiveness of hybrid models in RS image segmentation. LETNet [
24] embeds Transformer and CNN components to complement each other’s limitations. DAFormer [
25] and SegFormer [
26] combine the local feature extraction strength of CNNs with the long-range dependency modeling capability of Transformers, delivering outstanding performance in semantic segmentation, particularly in complex and dynamic scenarios. In recent years, Mamba [
27] has emerged as a research focus due to its excellent long-range modeling ability and linear computational complexity. VMamba [
28] was the first to apply Mamba to image segmentation, verifying its effectiveness in vision tasks. RS3Mamba [
29] adopted a dual-encoder structure with Mamba and CNN, significantly improving segmentation accuracy through feature fusion. Overall, existing Transformer-based or hybrid CNN-Transformer or Mamba architectures have demonstrated strong capabilities in capturing global contextual information and integrating multi-scale spatial features. However, these models primarily operate in the spatial domain and rarely exploit frequency-domain representations, limiting their ability to handle complex textures, fine boundaries, and noise-prone details commonly found in remote sensing images. This gap motivates our study to incorporate frequency-domain feature learning into the segmentation framework, aiming to achieve more balanced modeling of both spatial and frequency information for improved segmentation robustness and accuracy.
Frequency-domain techniques have attracted significant attention in RS image semantic segmentation due to their ability to effectively capture both global and local features. Methods such as Fourier and wavelet transforms decompose RS images into frequency components, enabling more efficient extraction of fine details and global structures. FcaNet [
30] introduces a frequency channel attention mechanism that enhances feature selection by leveraging both spatial and frequency information to improve the performance of CNNs. XNet [
31] employs a wavelet-based network to integrate low- and high-frequency features, thereby improving semantic segmentation of biomedical images in both fully supervised and semi-supervised settings through the joint use of spatial and frequency data. SpectFormer [
32] incorporates frequency-domain information and attention mechanisms into the ViTs framework, enhancing its capability to handle visual tasks. In addition, SFFNet [
33] adopts a spatial–frequency fusion network based on wavelets to effectively combine spatial and frequency information, thereby improving segmentation performance in remote sensing imagery. Moreover, RS images are often affected by high-frequency noise resulting from sensor limitations, atmospheric interference, and complex surface textures. Such noise overlaps with genuine high-frequency details, making it difficult to distinguish informative signals from irrelevant variations and often leading to degraded segmentation accuracy. To mitigate these effects, frequency-domain techniques have been introduced to enhance useful frequency information and suppress redundant components [
34]. However, most existing approaches rely on one-dimensional frequency modeling, treating the frequency domain as an auxiliary enhancement rather than a primary representation. Their limited capacity for multi-dimensional frequency learning and lack of explicit noise suppression strategies can lead to the loss of critical spatial information, restricting their effectiveness in complex real-world scenarios.
Based on the above analysis, we propose a spatial–frequency feature fusion network (SF3Net) with a U-shaped encoder–decoder architecture. The network effectively integrates frequency domain features while preserving spatial features enriched with semantic information. Specifically, to effectively extract comprehensive frequency domain features, we propose the frequency feature stereoscopic learning module (FFSL), which utilizes Fourier transform to capture frequency information from three directions. Moreover, the spatial feature aggregation module (SFAM) enhances spatial contextual features through weighted extraction. Subsequently, the spatial–frequency feature fusion module (SFFM) integrates frequency domain features from FFSL with spatial features from SFAM, enabling comprehensive feature fusion. In addition, a feature selection module (FSM) is introduced to select the shallow features of the encoder to compensate for the loss of detail during the encoder downsampling process.
As depicted in
Figure 1, SF3Net achieves a good balance between performance and model size. The contributions of this article are as follows.
A Frequency Feature Stereoscopic Learning (FFSL) module is proposed, in which adaptive frequency-domain weights are learned to suppress high-frequency noise and enhance informative spectral components through multi-dimensional Fourier modeling.
A Spatial Feature Aggregation Module (SFAM) is designed to preserve structural details and aggregate spatial context, compensating for information loss during feature extraction.
SF3Net is constructed, integrating spatial features from SFAM and frequency features from FFSL through a Spatial–Frequency Feature Fusion Module (SFFM). This unified framework jointly models spatial and frequency information, enhancing multi-scale perception and achieving more accurate and robust segmentation in complex RS imagery.
2. Methodology
In this section, we present the overall structure of the proposed SF3Net and subsequently introduce four important modules in SF3Net, namely, SFFM, FFSL, SFAM, and FSM.
2.1. Network Architecture
The structure of SF3Net is shown in
Figure 2, designed as a U-shaped encoder–decoder architecture. In the encoding phase, MobileNetV2 [
35] is used for comprehensive spatial feature extraction, with a channel adapter adjusting the number of channels to [16, 32, 48, 96, 128] to meet the decoder’s lightweight design requirements. In the decoding phase, SFFM is employed to perform frequency domain learning and spatial domain enhancement on the raw data with redundant information, followed by the fusion of the features processed from both representation domains. SFFM consists of two branches: the frequency domain feature stereoscopic learning branch and the spatial domain feature aggregation branch. The frequency domain feature stereoscopic learning branch utilizes Fast Fourier Transform (FFT) to map the raw features to frequency domain features, learning rich frequency domain information from three mutually perpendicular directions. The spatial domain feature aggregation branch uses two parallel dilated convolutions to expand the receptive field, followed by a series of average pooling, convolution, and activation operations to re-calibrate the feature channels. Finally, soft pooling is applied to aggregate pixel-level spatial information across two dimensional directions. Additionally, FSM is designed in the skip connection to adaptively select the shallow features from the encoder, compensating for the loss of detail information during the downsampling process.
Specifically, for an input image
,
h and
w represent the height and width of the input image, respectively. The encoding stage of feature extraction yields five features of different scales:
,
,
,
,
, where H and W are the downsampled height and width. Then, the feature channels are adjusted to match the decoder using a channel adapter, consisting of a series of depthwise separable convolutions (DWConv), expressed as follows:
where
i = 1, 2, 3, 4, 5, this stage yields five features of different scales after channel adapter:
,
,
,
,
.
Subsequently, , , , , are used as the raw data for the decoding stage. In decoding stage 5, after performing frequency domain feature stereoscopic learning and spatial domain feature aggregation on , the features from both domains are fused to produce the output features of the SFFM.
Next, the output features obtained from the SFFM are upsampled and element-wise added to the most relevant shallow features selected adaptively by the FSM to form the input for the next stage of the decoder. Using this approach, the outputs of the five stages of the decoder are obtained: , , , , . Finally, a segmentation head is used to generate the final pixel-level prediction.
2.2. Spatial–Frequency Feature Fusion Module
We designed the SFFM to efficiently incorporate frequency domain features with spatial features, as shown in
Figure 2. The SFFM consists of two main branches: the spatial domain feature aggregation branch and the frequency domain feature stereoscopic learning branch. In first branch, SFAM is dedicated to spatial feature aggregation, enhancing spatial context information by using parallel dilated convolutions to expand the receptive field and applying soft pooling for pixel-level feature aggregation. Meanwhile, the FFSL branch performs frequency domain feature learning, where the original features are transformed into the frequency domain to capture detailed frequency characteristics. The outputs of two branches,
and
, are then combined and passed through a Group Normalization layer. The resulting features,
, are processed further before being fed into a feed-forward network (FFN). This design allows the SFFM to efficiently combine both frequency and spatial domain features, leading to a more comprehensive feature representation for semantic segmentation task. The SFFM can be represented as follows:
where
and
denote the operations of the SFAM and FFSL for feature mapping, respectively, and ⊕ represents element-wise addition.
,
,
,
, and
represent the input features, the mapped spatial features, frequency features, aggregation features, and the output features of SFFM, respectively.
2.3. Spatial Feature Aggregation Module
Although depthwise separable convolution effectively reduces memory consumption, it somewhat weakens the network’s ability to extract spatial structural information. Moreover, the loss of small-scale features caused by mutual occlusion of ground objects in RS images requires more precise spatial structural information to mitigate these challenges. To address this issue, we introduce the SFAM to encode spatial structural infomation more precisely, as shown in
Figure 3. First, an dilated convolution with a dilation rate of 2 is applied to reduce the input channel size C to C/2, expanding the receptive field while reducing computational complexity. Then, the receptive field is further expanded using two parallel dilated convolutions with dilation rates of 1 and 3, respectively, and the results from both branches are concatenated for subsequent processing. Next, feature recalibration of the channel dimension is performed using average pooling, two continuous convolution and activation operations. By soft-pool [
36] operation introducing attention mechanisms in both horizontal and vertical dimensions to account for pixel relationships, the SFAM overcomes the limitations of convolution operations in capturing global spatial structural information. Finally, a 1 × 1 convolution is used to adjust the number of channels C/2 back to C. This enhances the network’s ability to perceive spatial information in feature maps, thereby improving its representational capacity.
For an input feature
, the computation procedure can be expressed as follows:
where
i,
j, and
k represent the indexes for the vertical direction, horizontal direction, and the channel, respectively;
,
,
,
denotes a dilation convolution with convolution kernel 3 and dilation rate 2;
represents performing concatenation operation along the channel dimension;
represents a 1 × 1 convolution operation;
,
and
denote the intermediate features of the computational process, respectively; and ⊙ stands for element-wise multiplication. The feature
, where
represents a dilated convolution layer with Group Normalization and the GELU activation function.
and
capture the pixel-level weights of the feature map in the two spatial directions. By performing element-wise multiplication of
and
as shown in Equation (
3), the position-aware output feature map
is obtained.
2.4. Frequency Feature Stereoscopic Learning Module
In the field of RS image segmentation, recent methods have primarily focused on obtaining richer spatial domain information while often neglecting the significance of the frequency domain. In the spatial domain, the complex backgrounds of RS images frequently lead to blurred segmentation boundaries. In contrast, the frequency domain is more sensitive to grayscale variations, and different objects occupy distinct frequency domains, making them easier to differentiate [
37]. Although prior work, such as [
31,
32,
38], has introduced frequency domain features, the global information obtained remains insufficient and lacks effective integration with spatial features. Moreover, RS images are often affected by high-frequency noise caused by sensor limitations, atmospheric interference, and fine-scale clutter. This noise overlaps with true edges and details, making it difficult to distinguish informative from irrelevant frequency components and thus reducing segmentation robustness. To overcome these issues and enhance segmentation accuracy, we design the FFSL module.
As shown in
Figure 4, the input feature
is split into four sub-features along the channel dimension, denoted as
,
,
, and
. Each sub-feature has the number of channels of the
quarter. Depthwise separable convolution is then applied to
to compute the local spatial feature
, ensuring that sufficient spatial information is retained for subsequent operations. The computation of
is defined as follows:
where
denotes the splitting operation along the channel dimension.
We utilize the 2D fast fourier transform (FFT) to to extract frequency-domain features. The 2D FFT is an extension of the 1D FFT, applied across both spatial dimensions of an image. For each row x of the image
, we compute the 1D FFT along the horizontal axis, then for each column v of the intermediate result
, we compute the 1D FFT along the vertical axis. To derive the 2D FFT formulation, we substitute Equation (
5) into Equation (
6):
where
M and
N are the dimensions of the image,
represents the pixel intensity at spatial coordinates
,
represents the frequency components at spatial frequencies
, and j is the imaginary unit.
Subsequently, 2D FFT operations are applied to
,
and
to transform them into the frequency domain. Unlike a conventional 2D FFT applied solely on the spatial plane, these transforms are generalized to different dimension pairs of the feature tensor, i.e., “Channel-Height”, “Height-Width”, and “Channel-Width”. Three learnable weights,
,
and
, each initialized with ones, serve as adaptive frequency filters that modulate the amplitude of different frequency components along their respective dimensions. Through task-driven backpropagation, these weights learn to attenuate noisy high-frequency signals while enhancing semantically meaningful low- and mid-frequency components. The computations are defined as follows:
where ⊙ denotes the element-wise multiplication operation.
Next, the learned frequency domain feature
,
, and
are transformed back into spatial-domain features
,
, and
.
where
denotes the inverse fast fourier transform.
The feature
and
are fused, and
and
are fused, resulting in global feature information enriched through frequency domain weight learning. Finally, the four parts are concatenated to produce the final output features of the FFSL module. This process can be denoted as follows:
where
and
represent the fused features in the width-channel direction and the height-channel direction, respectively, and
represents the output features from the FFSL.
2.5. Feature Selection Module
Although the skip connections in U-shaped networks partially compensate for the information loss during the encoder downsampling process, they also introduce noise from shallow features. To enable more refined selection and dynamic fusion of shallow features while reducing noise interference, we incorporate the FSM into the skip connections, as shown in
Figure 5.
Specifically, for a feature from the encoder with size (h, w, c), the feature is first refined through convolution, GroupNorm, and GELU activation functions, reducing the channel dimension to (c/4)
where
represents the features obtained through channel reduction operations.
Next, a channel attention module is applied to obtain the channel attention map, adjusting the importance of each channel and further emphasizing critical features
Here, and represents the channel attention map and the features obtained through channel attention operations.
Subsequently, two operations, maximum pooling and average pooling in parallel combined with convolution and Sigmoid activation functions, are used to generate a spatial attention map, filtering the features in spatial dimensions and highlights the feature information in salient regions.
where
and
represents the spatial attention map and the features obtained through spatial attention operations.
Finally, a convolution operation is used to increase the channel dimension, restoring the feature to its original dimensionality.
Here, represents the features obtained through FSM. Through the processing of the FSM, the decoder’s performance in detail recovery and fine feature extraction is effectively enhanced, improving the overall performance of the network.
5. Discussion
Through comprehensive experiments conducted on the Vaihingen, Potsdam, and agricultural datasets, we have thoroughly validated the exceptional capabilities of SF3Net in the field of remote sensing image segmentation. In numerous domains including land resource management, urban planning, and disaster monitoring, efficiency and accuracy remain the core requirements for various tasks. Leveraging the high efficiency and precision characteristics of this method, it demonstrates excellent applicability and outstanding performance in the aforementioned applications, fully showcasing its tremendous potential and promising prospects for widespread application across multiple domains. Furthermore, SF3Net effectively overcomes the limitations inherent in traditional spatial domain segmentation methods by enhancing spatial feature representation and integrating frequency domain features, exhibiting particularly remarkable performance in inferring edge regions and areas with significant texture variations. This advancement further promotes the development in frequency–spatial feature fusion segmentation methods. However, experiments show that while FFSL introduces rich frequency domain information and preserves some local spatial information, the Fourier transform cannot retain fine spatial details when converting the image to the frequency domain, leading to potential missegmentation. Additionally, the Fourier transform requires both frequency domain conversion and inverse conversion, increasing computational overhead and affecting the real-time performance and efficiency of segmentation. On the other hand, the main advantage of SFFM is that it compensates for the spatial information loss caused by FFSL in frequency domain feature learning by fusing SFAM and FFSL. Furthermore, FSM selectively fuses the most relevant shallow features, reducing information loss during the encoder’s downsampling process and further compensating for FFSL’s loss of spatial detail information. In the future, we plan to explore contrastive learning and multimodal techniques to narrow the semantic gap between spatial and frequency domain features, improve fusion efficiency, and ultimately enhance semantic segmentation.