Next Article in Journal
Infrared Temperature Measurement of Spaceborne Rotating Scanning Mirrors by Integrating Radiometric Calibration and Drift Compensation
Previous Article in Journal
Characterization of Local and Long-Distance Ice Floe Motion in the Yellow River Using UAV–GPS Joint Observations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FSSC-Net: A Frequency–Spatial Self-Calibrated Network for Task-Adaptive Remote Sensing Image Understanding

School of Software Engineer, Xi’an Jiaotong University, No. 28 Xian Ning West Road, Xi’an 710049, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(5), 824; https://doi.org/10.3390/rs18050824
Submission received: 30 January 2026 / Revised: 3 March 2026 / Accepted: 5 March 2026 / Published: 6 March 2026
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • FSSC-Net enables task-aware frequency–spatial calibration for robust remote sensing.
  • Dynamic frequency selection and calibration achieve improvements across multiple tasks.
What are the implications of the main findings?
  • Frequency–spatial self-calibration provides an effective paradigm for addressing scale variation, dense object distribution, and background complexity in remote sensing imagery, highlighting the importance of explicitly modeling frequency-domain information alongside spatial features.
  • The proposed framework offers a general and transferable enhancement strategy that can be seamlessly integrated into existing deep models, enabling consistent performance improvements across multiple remote sensing tasks such as classification, detection, and segmentation.

Abstract

Although recent studies have achieved remarkable progress in remote sensing image understanding by fusing spatial- and frequency-domain features to leverage their complementary strengths, they still face two key limitations: frequency modeling remains rigid due to static constraints, limiting adaptability, and spatial–frequency fusion often suffers from poor generalization and instability across tasks and network depths. Our experiments reveal that the relative importance of low- and high-frequency components varies dynamically across feature hierarchies and training stages, indicating that frequency information is inherently task-dependent and stage-aware. Motivated by these observations, we propose the Frequency–Spatial Self-Calibrated Network (FSSC-Net), a task-driven framework for adaptive frequency modeling and collaborative spatial–frequency fusion. FSSC-Net incorporates a lightweight, plug-and-play self-calibrated frequency modeling mechanism, comprising a Dynamic Frequency Selection Module and a Task-Guided Calibration Fusion Module. This mechanism adaptively modulates frequency responses via soft masks, enabling dynamic extraction of task-relevant low- and high-frequency components and effective alignment between spatial- and frequency-domain features. Moreover, we present a systematic analysis of frequency importance across tasks and training stages, providing quantitative evidence for the necessity of task-calibrated frequency modeling. Extensive experiments on various benchmarks demonstrate that FSSC-Net consistently outperforms state-of-the-art methods, exhibiting strong task adaptability and robust cross-task generalization.

1. Introduction

Benefiting from the rapid development of backbone architectures and representation learning paradigms, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have achieved remarkable progress in remote sensing image analysis. However, compared with natural images, remote sensing imagery poses significantly greater challenges due to its wide spatial coverage, extreme scale variation, dense distribution of small objects, complex backgrounds, and diverse imaging conditions such as illumination changes, atmospheric interference, and sensor noise [1,2,3,4]. In particular, in high-resolution aerial or satellite images, small and weak targets rely heavily on subtle structural cues, including edges, contours, and local textures, which are highly susceptible to degradation during deep feature abstraction. These characteristics impose stringent requirements on the robustness and adaptability of visual representation models.
To address these challenges, increasing attention has been devoted to the fusion of spatial- and frequency-domain representations in remote sensing vision [5,6]. This paradigm seeks to exploit the inherent complementarity between spatial structures (e.g., object shapes, spatial layouts, and contextual relationships) and frequency characteristics (e.g., edges, textures, and repetitive patterns), thereby enhancing model discriminability in complex remote sensing scenarios. Notably, frequency-domain features are particularly effective in capturing fine-grained details and structural regularities that are critical for small object detection, land-cover classification, and boundary-sensitive semantic segmentation in remote sensing imagery.
Existing frequency-aware methods can be broadly categorized into three groups. The first group emphasizes high-frequency information, such as edges and textures, and enhances spatial-domain representations through explicit frequency guidance, detail injection, or noise suppression mechanisms [7,8,9]. These approaches have been widely applied to infrared small target detection [10], remote sensing object recognition [11], low-light or low-contrast image enhancement [12], as well as surface defect and change detection tasks [13]. The second group models spatial-domain and frequency-domain features as parallel branches and achieves collaborative optimization through feature concatenation [14], attention-based weighting [15], residual learning [16], or cross-domain fusion strategies [17]. These methods have demonstrated strong performance in salient object detection [18], remote sensing image classification, few-shot recognition, small object detection [19], and image super-resolution [20]. The third group incorporates task priors or domain knowledge to guide frequency modeling; for example, through curriculum learning strategies [21] or controlled frequency injection schemes, and has been applied to specialized domains such as marine seismic processing [22], medical image analysis [23], and fine-grained pose estimation [24].
Despite these advances, spatial–frequency collaborative frameworks for remote sensing imagery still face two fundamental limitations. First, frequency-domain modeling often lacks flexibility: most existing methods rely on static frequency masks or fixed transformation parameters, which struggle to adapt to the scale diversity, scene complexity, and sensor variability commonly encountered in remote sensing data. Second, spatial–frequency fusion strategies exhibit limited generalization across different tasks and network depths, resulting in unstable performance when transferred among classification, detection, and segmentation scenarios.
To further investigate this issue, we analyze the frequency calibration weights of each FSSC block in FSSC-Net at different training stages. Our observations show that the network exhibits distinct dependencies on low- and high-frequency components at different depths, and that these dependencies evolve dynamically during training. Specifically, the relative importance of high- and low-frequency channels varies across feature hierarchies, and the proportion of high-frequency features among the most important channels changes with training progress. This phenomenon indicates that, in remote sensing tasks, the significance of frequency-domain information is inherently hierarchical and stage-dependent, closely related to spatial resolution, semantic abstraction level, and task objectives. Consequently, static frequency modeling or fixed fusion strategies may fail to fully exploit frequency cues and may even introduce redundancy, constrain representational capacity, and ultimately degrade performance.
Motivated by these observations, we propose the Frequency–spatial Self-Calibrated Network (FSSC-Net), a novel architecture specifically designed for task-driven frequency modeling and adaptive spatial–frequency fusion in remote sensing vision. The main contributions of this work are summarized as follows:
  • We propose FSSC-Net, an adaptive spatial–frequency collaborative framework tailored for remote sensing tasks. The network dynamically selects task-relevant frequency components based on data distributions and selectively integrates low- and high-frequency information to enhance spatial–frequency representations.
  • We design a lightweight and plug-and-play self-calibrated frequency modeling mechanism, composed of a Dynamic Frequency Selection Module and a Task-Guided Calibration Fusion Module. This mechanism enables soft and adaptive frequency response learning, improving the flexibility of frequency extraction and the robustness of spatial–frequency fusion across diverse tasks.
  • We introduce a systematic frequency importance analysis framework across tasks, network depths, and training stages, revealing task-specific frequency preferences and validating the necessity of task-calibrated frequency modeling.
  • Extensive experiments on benchmarks for image classification, object detection, and semantic segmentation demonstrate the effectiveness of FSSC-Net, highlighting its strong task adaptability and cross-task generalization capability in complex remote sensing scenarios.

2. Previous Study and Methodology

2.1. Frequency-Domain Feature Extraction

Due to their advantages in edge enhancement, detail preservation, and noise suppression, frequency-domain features have been widely adopted in recent years across various vision tasks. A range of transformation methods, such as Fast Fourier Transform (FFT) [25], Discrete Cosine Transform (DCT) [26], and wavelet transforms, serve as foundational tools for frequency-domain modeling. Fourier transforms (FFT or DFT (Discrete Fourier Transform)), which decompose signals into amplitude and phase components, are commonly used for frequency analysis and transformation. For instance, APFNN [27] applies 2D FFT with circular masks to extract high- and low-frequency features for pansharpening; SFAN [28] leverages FFT and masks for dehazing in remote sensing images; CESFusion [29] decomposes amplitude and phase via FFT to enhance inter-frequency interaction; and DSFI-CD [30] combines FFT with convolution to suppress noise in low-light conditions. Similarly, TSFNet [31] and LF-MDet [32] utilize FFT to achieve strong performance in remote sensing super-resolution and multimodal detection tasks, respectively. Wavelet transforms (e.g., DWT, Haar, WPT) offer excellent multi-scale decomposition capabilities, making them effective for preserving image details. BiDFNet [33] integrates DWT with cross-attention to achieve dynamic pansharpening; SFFNet [34] adopts Haar wavelets for remote sensing image segmentation; and 2D Haar wavelets are also employed in SAR ship detection [35] for joint spatial–frequency feature modeling. SFFD [36] uses WPT for directional object detection, while ISSP-Net [37] utilizes DWT for multimodal classification. The DCT is often used for high-frequency compensation and noise suppression. For example, SFMNet [18] incorporates DCT with an adaptive high–low frequency strategy to detect salient RGB-T targets in low-light environments; IFENet [37] applies DCT for frequency separation to enhance multimodal feature fusion; FIOD-VUE [22] and SAFANet [38] both introduce DCT into underwater detection and arbitrary-scale super-resolution tasks, respectively. Despite their effectiveness, most of the above transformation-based methods rely on static basis functions or fixed masking strategies, limiting their adaptability across diverse tasks. For instance, the choice of wavelet basis and decomposition depth in DWT, or the radius and coefficient settings of masks in FFT/DCT, are often manually specified based on empirical rules. To address this limitation, we propose the Dynamic Frequency Selection Module (DFSM). DFSM employs a soft mask to achieve initial separation of high- and low-frequency components, and introduces a learnable weight matrix to dynamically adjust the frequency response range. This enables frequency-domain feature extraction that is both data-distribution aware and task-driven.

2.2. Spatial–Frequency Feature Fusion

Effective fusion of spatial and frequency information is essential for enhancing the representational capacity of vision models. In recent years, attention mechanisms have been widely adopted to facilitate spatial–frequency collaboration, leveraging their adaptive focusing capability to improve the model’s perception of task-relevant regions. For instance, CESF [29] employs pooling and convolution to construct a spatial-channel attention module for spatial–frequency integration. SFFnet [34] introduces dual cross-attention to align spatial and frequency features. SFFI [18] and MSCViT [39] combine cross-modal attention with convolution for multi-source information fusion, while Dim2Clear [40] and MoME [28] utilize weighted strategies to integrate heterogeneous modal features. Some approaches also employ inverse frequency-domain transformations to achieve alignment across domains, for example, through BiDFNet [33,41] apply inverse wavelet transforms for dual-domain alignment, while LF-MDet [32] and ISSPNet [37] use inverse FFT for frequency restoration. Pyramid-based global context modeling methods, such as GCFPN [42], enhance spatial–frequency consistency across hierarchical levels. Inspired by residual learning, certain methods (e.g., BRCM [43,44]) transmit original spatial features via residual paths, enhancing information flow between spatial and frequency domains. Task-specific fusion designs, such as those in SAFANet [38] and PPIN [45], have also shown promising performance in specialized scenarios. However, most of these fusion strategies are manually designed for particular tasks or domains, lacking a unified and generalizable mechanism. Given the significant variation in dependency on high- and low-frequency features across tasks, fixed fusion schemes struggle to adapt effectively to multi-task scenarios. To address this issue, we propose the Task-Guided Calibration Fusion Module (TGCFM). This module first reorganizes and separates high- and low-frequency features to better align with task characteristics. Then, a calibration block is introduced to perform channel-wise selection, preserving the most informative frequency components. Finally, a cross-attention mechanism is employed to complete the spatial–frequency fusion, enabling efficient and generalizable dual-domain collaborative modeling.

3. Methodology

In this section, we present the proposed Frequency–Spatial Self-Calibration Network (FSSC-Net). Section 3.1 introduces the overall architecture of FSSC-Net. Section 3.2 describes the Dynamic Frequency Selection Module, which adaptively identifies and extracts task-relevant frequency components. Section 3.3 details the Spatial Granularity Self-Adaptive Module, designed to provide granularity-adaptive spatial features. Finally, Section 3.4 elaborates on the Task-Guided Calibration Fusion Module, which achieves task-aware adaptive fusion of spatial and frequency features by self-calibration.

3.1. Frequency–Spatial Self-Calibrated Network

To enhance the representational capacity of deep networks and improve performance across downstream tasks, it is essential to adopt frequency feature extraction and spatial–frequency fusion strategies that are more closely aligned with task requirements. To this end, we propose the FSSC-Net, a novel architecture built upon a task-guided calibration mechanism. FSSC-Net integrates three key modules across all stages: an adaptive frequency-domain feature extraction module, a spatial-domain multi-scale modeling module, and a task-aware spatial–frequency fusion module, enabling collaborative modeling across both domains.
As illustrated in Figure 1, FSSC-Net adopts a hierarchical architecture composed of four stages with progressively reduced spatial resolutions. Each stage consists of multiple FSSC Blocks, which serve as the fundamental building units of the network. In the figure, the green regions indicate the multi-granularity spatial feature extraction module, which consists of MGPU (Multi-Granularity Perception Unit)and GAU (Granularity Adaptation Unit), the blue regions denote the Dynamic Frequency Selection Module with high- and low-frequency branches, and the gray module represents the self-calibrated frequency–spatial feature fusion module. The red dashed box provides a schematic of the feature computation and information flow within an FSSC Block.
Each FSSC Block comprises three key components: (1) DFSM, (2) the Spatial Granularity Self-Adaptive Module (SGSAM), and (3) the TGCFM. Within each block, input features are processed in parallel by frequency- and spatial-domain branches to capture semantic structures and fine-grained spatial details, respectively. The spatial branch, inspired by window attention mechanisms, employs a hybrid design that combines parallel and cascaded multi-granularity branches to accommodate diverse task-specific granularity requirements. The cascaded window structure facilitates inter-scale information exchange, while learnable fusion weights allow the network to handle heterogeneous feature distributions effectively (see Figure 2b).
In the frequency branch, DFSM applies soft masks to divide deep features into adaptive high- and low-frequency components. A multi-scale frequency-domain pyramid is then constructed based on distinct index sets and fused via learnable weights to adapt to diverse application needs. SGSAM addresses the mismatch between fixed-scale modeling and task-specific spatial feature demands by introducing a multi-granularity perception mechanism and adaptive fusion strategy, enabling dynamic adjustment of spatial granularity and a balanced representation of local details and global semantics. Finally, TGCFM implements a task-aware spatial–frequency fusion strategy. By integrating frequency feature calibration with a cross-attention mechanism, it facilitates deep integration of spatial and frequency information, substantially enhancing collaborative modeling across both domains.
Overall, the key contribution of FSSC-Net lies in the proposed task-guided adaptive cross-domain collaboration framework, which overcomes the limitations of traditional frequency-domain feature extraction and fusion, namely insufficient flexibility and lack of task awareness. In the frequency feature extraction stage, the framework constructs dynamic frequency subspaces under task supervision, enabling more targeted frequency representations. In the spatial–frequency fusion stage, a calibrate-then-fuse two-stage strategy is employed: frequency features are first selectively filtered and optimized in a task-relevant manner, and then non-overlapping windowed cross-attention is applied to achieve efficient semantic alignment and cross-domain collaboration. This design allows the model to dynamically adjust spatial granularity, frequency allocation, and cross-domain interaction patterns according to different visual tasks, significantly enhancing representation capacity and task performance in complex scenarios while maintaining computational efficiency. Overall, FSSC-Net provides a systematic, task-driven solution for spatial–frequency collaborative modeling.

3.2. Dynamic Frequency Selection Module

The DFSM is designed to adaptively identify task-relevant frequency components from the DCT domain through a data-driven strategy. It enables the selective integration of high-frequency details and low-frequency semantic information to enhance representation flexibility. The module operates in two main stages: (1) Multi-Scale Frequency Component Construction: A soft mask mechanism is applied to construct multiscale frequency-domain representations using different index sets (e.g., 32 × 32 and 16 × 16). Channel attention is then used to dynamically model the importance of each frequency component. This design allows the model to adaptively select relevant frequency features based on the intrinsic characteristics of the input data and the specific requirements of the task, preserving key components while suppressing redundancy. (2) Frequency Fusion and Weight Adjustment: To further improve the adaptability of frequency-domain features, DFSM incorporates learnable weights to perform dynamic fusion across different frequency components. This facilitates adaptive control over the granularity of frequency representation. Throughout the extraction process, the soft masks enable flexible alignment with various frequency distributions, while also suppressing noise and redundant responses, thus enhancing the robustness of the features. For each frequency representation built from different index sets, DFSM dynamically fuses them to balance information across multiple granularities. This results in a hierarchical frequency-domain representation that not only supports task-driven frequency selection but also leverages multi-scale information to improve feature expressiveness.
In summary, DFSM enables multi-scale collaboration and task-adaptive frequency modeling by dynamically constructing and fusing frequency components. It provides a more flexible and targeted frequency representation strategy for vision models. As illustrated in Figure 3, the specific pipeline of DFSM is as follows: Given an input feature map f i f R B × C × H × W , adaptive average pooling (AAP) is first applied to transform it into a fixed resolution f A P P . Then, based on learnable soft masks denoted as S o f t M a s k s e t i l o w and S o f t M a s k s e t i h i g h , the module extracts multi-scale low-frequency and high-frequency components f ^ s e t i l o w and f ^ s e t i h i g h , respectively, from corresponding index sets denoted as s e t i . These features are then refined via a Squeeze-and-Excitation (SE) module for relevance filtering and noise suppression, and are element-wise multiplied with the original input f i f to generate the refined frequency feature groups.
During fusion, features from different index groups within the high- and low-frequency components are further integrated via learnable parameters to produce the final multi-scale high-frequency feature f m s h i g h and low-frequency feature f m s l o w respectively. The complete fusion process is described in Equation (1).
f A A p = A A P h , w f i f , f i f R B × C × H × W f ^ s e t i l o w = S E S o f t M a s k s e t i l o w f A A p f ^ s e t i h i g h = S E S o f t M a s k s e t i h i g h f A A p f m s l o w = i α i · f ^ s e t i l o w f i f , i α i = 1 f m s h i g h = j β j · f ^ s e t j h i g h f i f , j β j = 1
Owing to the aforementioned design, the DFSM does not perform selection within a single frequency spectrum. Instead, it constructs multiple frequency subspaces using different index sets, forming frequency representations that are better adapted to the current task. Specifically, DFSM does not apply a one-time global filter to the input features; rather, it independently calibrates each frequency subspace and applies different strategies to different subspaces. Finally, the module achieves collaborative optimization across multiple frequency subspaces through learnable fusion weights, dynamically balancing low- and high-frequency information while enhancing feature representation. Therefore, the core innovation of DFSM lies in the fact that it does not merely select the “most salient frequencies,” but constructs task-oriented multi-subspace frequency representations, supporting multi-structure collaboration and dynamic adaptation.

3.3. Spatial Granularity Self-Adaptive Module

In the Vision Transformer domain, window-based attention mechanisms have become a key strategy for balancing computational efficiency and model performance. Inspired by pioneering works such as Swin Transformer and P2T, we propose the Spatial Granularity Self-Adaptive Module (SGSAM) to address the limitations of conventional approaches that rely on fixed-scale convolutions or single-size windows, which often struggle to reconcile the dynamic needs of local detail modeling and global semantic understanding. SGSAM enables a dynamic adaptation of spatial feature granularity, providing efficient and flexible spatial representations. SGSAM is composed of two core components. (1) MGPU (Multi-Granularity Perception Unit): This unit constructs spatial representations with diverse perceptual granularity through a parallel attention-branch architecture, as illustrated in Figure 2a. (2) Granularity Adaptation Unit (GAU): This unit dynamically selects and integrates features with the most suitable granularity based on task-specific demands, forming an adaptive mapping between granularity and semantic requirement, as shown in Figure 2b. As a key spatial component within each FSSC Block, SGSAM significantly improves both the flexibility and accuracy of spatial feature representation. By adaptively adjusting feature granularity, it provides task-aware spatial inputs for downstream tasks such as object detection and scene segmentation. Moreover, it helps prevent mismatches between scale selection and actual feature requirements, which are common pitfalls in fixed-scale designs. Overall, SGSAM enhances the network’s spatial adaptability and semantic precision, leading to improved performance in complex vision scenarios.

3.3.1. Multi-Granularity Perception Unit (MGPU)

The MGPU is designed to construct multi-granular spatial representations under controlled computational cost. As shown in Figure 2, the input feature map f i f R B × C × H × W is first passed through a 1 × 1 convolution for channel compression, resulting in a compact representation f c R B × c × H × W , where c C . It is worth emphasizing that we adopt a full-channel compression strategy instead of channel splitting, in order to avoid potential information loss.
The compressed feature f c is then fed into the MGPU for multi-granularity feature extraction. The MGPU consists of multiple parallel-cascaded attention branches, each utilizing a different window size-denoted as W i n s 1 , W i n s 2 , W i n s 3 to establish spatial perception at varying granularities. Specifically, smaller windows (e.g., 3 × 3 ,   5 × 5 ) focus on capturing fine-grained patterns such as textures and edges, while larger windows (e.g., 7 × 7 ) are responsible for aggregating coarse-grained information, such as scene context and global semantics. As illustrated in Figure 2a, increasing window sizes lead to progressively larger receptive fields, thereby enabling richer spatial modeling.
Concretely, the compressed feature f c is reshaped into a sequence representation f s i R B × N i × c for each window size s i , where N i = H s i × W s i . The MGPU adopts a parallel-cascaded architecture: the features extracted by a smaller window branch f w i are not only used as its own output but also serve as input to the next branch with a larger window. This progressive feature passing mechanism facilitates hierarchical context modeling and enhances cross scale consistency. The detailed computation process is defined in Equation (2):
f c = C o n v 1 × 1 f i , f c R B × c × H × W , c C f s i = S e r i a l i z a t i o n f c , f s i R B × N i × c f w i = R e v e r s e M L P W i n A t t n w i f s i + f w i 1 , f w i R B × c × H × W

3.3.2. Granularity Adaptation Module (GAU)

Leveraging the channel-compression feature extraction paradigm implemented in MGPU, each attention branch operates on the full input representation, thereby preserving complete spatial details during multi-granularity processing. To effectively integrate the complementary information captured by different window branches and enhance the models adaptability to diverse tasks and data, the Granularity Adaptation Module (GAU) introduces a dynamic weighting fusion mechanism and channel-wise attention calibration for efficient feature aggregation.
Specifically, the output features from each branch, denoted as f w 1 , f w 2 , f w 3 , are first modulated by learnable weights α ^ f w 1 , α ^ f w 2 , α ^ f w 3 , where α ^ f w i 0,1 and α ^ f w 1 = 1 . This design enables the model to adaptively adjust the contribution of different granularity features according to task requirements. For instance, in texture-intensive tasks, the weight of smaller-window branches is increased, while in tasks dominated by global semantics, larger-window branches receive greater emphasis.
The dynamically modulated features are then concatenated to form the fused representation f c o n c a t R B × c 1 + c 2 + c 3 × H × W , where c 1 = c 2 = c 3 = c . To further enhance the discriminative capacity of the fused features, we perform channel wise attention modeling on f c o n c a t , which adaptively emphasizes informative channels while suppressing redundant responses. Through the joint effect of learnable granularity weights and the channel attention, GAU achieves dynamic balancing of multi-granular features f m g and reinforces cross-scale consistency. The detailed computation is formulated in Equation (3):
f concat = Conv 1 × 1 Concat f w 1 a ^ f w 1 , f w 2 a ^ f w 2 , , f w i a ^ f w i , f w i R B × c × H × W , f mg = f c o n c a t H × W i = 1 H j = 1 W f c o n c a t c , i , j + f c o n c a t , c = 1,2 , , c

3.4. Task-Guided Calibration Fusion Module

The FSSC Block constructs a multi-modal representation by integrating the DFSM and the SGSAM in a collaborative manner. To achieve task-aware adaptive fusion of spatial and frequency features, we propose the TGCFM. This module employs a two-stage architecture to facilitate deep interaction and complementarity between the spatial and frequency domains.
As illustrated in Figure 4, the proposed TGCFM follows a two-stage design to achieve adaptive frequency–spatial collaboration. In the calibration stage, frequency features are selectively enhanced according to task-specific requirements, while in the fusion stage, a cross-attention [33] mechanism is employed to realize semantic alignment and information exchange between frequency and spatial representations.
The inputs to TGCFM consist of a frequency feature pyramid generated by the DFSM and multi-granularity spatial features extracted by the SGSAM. During the calibration phase, high-frequency and low-frequency features are first processed by separate 1 × 1 convolutions to learn task-specific representations for each frequency band. To further identify informative channels and suppress redundancy, channel-wise attention is applied to both branches, producing importance scores that are compared against a task-dependent threshold. Only the most informative channels are retained via a Top-K selection strategy, yielding calibrated frequency features that are both compact and task-adaptive.
In the fusion phase, the calibrated low-frequency features serve as queries (Q) to provide global semantic guidance, the high-frequency features act as keys (K) to convey fine-grained structural details, and the spatial features function as values (V) to deliver localized spatial context. To balance modeling capacity and computational efficiency, a window-based attention mechanism is adopted. Features from both the frequency and spatial domains are partitioned into non-overlapping windows and serialized prior to attention computation, enabling efficient cross-domain interaction within local regions. The overall fusion process is formulated in Equation (4), where T o p K · denotes the selection of the top-K channels with the highest importance scores, and s e r i a l i z e · represents the serialization of window-partitioned feature maps. Through this task-guided calibration and adaptive fusion strategy, TGCFM dynamically adjusts its frequency–spatial interaction patterns to varying task demands, leading to consistent performance improvements across diverse vision tasks.
f l o w c l i b r a t e = S e r i a l i z e { T o p K C A C o n v 1 × 1 f f r e q u e n c y l o w } f h i g h c l i b r a t e = S e r i a l i z e { T o p K C A C o n v 1 × 1 f f r e q u e n c y h i g h } f s p a t i a l t w i n = S e r i a l i z e { f m g } A t t n F u s e f l o w c l i b r a t e , f h i g h c l i b r a t e , f s p a t i a l w i n = S o f t M a x W Q f l o w c l i b r a t e W K f h i g h c l i b r a t e d + P W V f s p a t i a l w i n f f u s e d = R e s h a p e M L P L N A t t n f u s e d w + f i f
Overall, the core of TGCFM lies in its task-guided calibration prior to fusion. Rather than simply reusing all frequency features, it selectively enhances and retains the frequency components most relevant to the current task, reducing noise and redundancy at the source. During the calibration stage, different frequency bands are processed independently, and channel attention combined with a TOP-K strategy enables task-adaptive feature selection and compact representation. In the subsequent fusion stage, non-overlapping windowed cross-attention constructs high-semantic-density cross-domain interactions using sparse channels, achieving fine-grained semantic alignment and efficient structural fusion. In summary, TGCFM integrates spatial and frequency information through a task-aware calibrate-then-fuse strategy, effectively improving feature representation quality and task adaptability.

4. Experiments

This section presents a comprehensive evaluation of FSSC-Net and its core module, TGCFM, across three fundamental computer vision tasks. Section 4.1 begins with image classification experiments on the ImageNet-1K dataset, where the classification accuracy of different methods is systematically compared to validate the effectiveness of the proposed framework in general-purpose tasks. Subsequently, a remote sensing image classification task is conducted to assess the task adaptability of FSSC-Net and the generalization capability of its frequency-domain self-calibration mechanism. Section 4.2 and Section 4.3 report the performance of our approach on object detection and semantic segmentation tasks, respectively. Finally, Section 4.4 presents ablation studies that investigate the critical design choices of the TGCFM and analyze their contributions to the overall performance. Section 4.5 conducts an in-depth analysis of frequency dependence across different network depths within the classification task, as well as frequency preferences across different vision tasks. Section 4.6 further validates the effectiveness of TGCFM through Grad-CAM visualizations and T-SNE-based feature distribution analysis.

4.1. Image Classification on ImageNet-1K and AID

4.1.1. Image Classification Experiments on the ImageNet-1K Dataset

Setting: We evaluate the effectiveness of the proposed FSSC-Net on the ImageNet-1K dataset [46], which contains 1.28 M training images and 50 K validation images spanning 1000 classes. Top-1 accuracy is reported based on single crop evaluation. All models are trained for 300 epochs on the ImageNet-1K dataset, with input images resized to 224 × 224. The batch size is set to 1024, and the initial learning rate is 0.01. We adopt the AdamW optimizer in combination with a cosine annealing learning rate schedule, including a 20-epoch linear warm-up phase. All experiments are conducted on four NVIDIA RTX 4090 GPUs(NVIDIA Corporation, Santa Clara, CA, USA).
Results: We evaluate the effectiveness of TGCFM from two perspectives: (1) overall classification performance of FSSC-Net compared to existing state-of-the-art backbone networks, and (2) performance gains achieved by incorporating TGCFM into established architectures.
Table 1 reports the Top-1 classification accuracy of FSSC-Net on the ImageNet-1K dataset alongside several representative Transformer-based and convolution-based models. The symbol “–” denotes parameters that were not reported in the original studies. To avoid introducing assumptions or estimation bias, these values are intentionally left unfilled. Under comparable parameter counts and computational budgets, FSSC-Net achieves competitive or superior performance. For instance, FSSC-Net surpasses the Swin-T by 0.6%, and outperforms multi-scale models such as PVT-Small and the more complex PVT-Medium by 2.1% and 0.7%, respectively. In addition, when compared to classical CNNs like ResNet and ResNeXt, FSSC-Net consistently delivers higher accuracy, demonstrating the effectiveness of the proposed frequency–spatial fusion strategy as a complementary enhancement to conventional convolutional structures.
Table 2 summarizes the performance gains achieved by integrating TGCFM into standard backbone networks. In this table, the suffix “-TGCFM” denotes models augmented with the TGCFM. Notably, ResNet-50-TGCFM and Swin-Tiny-TGCFM achieve Top-1 accuracy improvements of 1.46% and 0.84%, respectively. These results validate the transferability and generalizability of TGCFM. Both FSSC-Net and TGCFM-augmented variants benefit from richer and more expressive feature representations, highlighting the potential of the proposed composite frequency modeling approach to enhance both convolution- and self-attention-based networks.

4.1.2. Remote Sensing Classification on the AID Dataset

Setting: To further evaluate the adaptability of FSSC-Net and the effectiveness of TGCFM in domain-specific scenarios beyond general-purpose tasks, we conduct extended experiments on remote sensing image classification. Specifically, we adopt the AID dataset, which contains 10,000 aerial images with a resolution of 600 × 600, covering 30 representative scene categories. The dataset is split into training and validation sets in an 8:2 ratio. We follow the same training and evaluation protocol used in FDNet for fair comparison. FDNet employs frequency-domain convolution, multi-scale frequency decoupling, and attention mechanisms to extract features in the frequency domain, effectively mitigating the information loss caused by spatial downsampling. This design has demonstrated improved classification accuracy and computational efficiency in remote sensing tasks.
Results: As summarized in Table 3. Notably, although FSSC-Net is designed as a general frequency–spatial framework rather than a dataset-specific solution, it achieves a Top-1 accuracy of 96.20% on the AID dataset, surpassing FDNet’s 95.92% by 0.28%. This confirms the strong task adaptability of FSSC-Net and the robustness of its frequency–spatial fusion mechanism in complex scenarios. Furthermore, when TGCFM is integrated into Swin-T, the model achieves a 4.38% improvement in classification accuracy, demonstrating the effectiveness and generality of adaptive frequency modeling and self-calibration mechanisms in unstructured tasks such as remote sensing image classification.
Figure 5 presents the visualization results of TGCFM on the AID dataset. Panels (a-1) and (a-2) illustrate the mean importance of low- and high-frequency channels at different network depths, as well as the proportion of high-frequency components among the top 75% most important channels, respectively. Panels (b) and (c) show histograms of channel importance for low- and high-frequency branches across different layers. Compared with the methods shown in Figure 6, it is evident that remote sensing image classification exhibits a distinct dependency on frequency-domain features, highlighting the unique frequency demands of this task. Together with the performance results of FSSC-Net and TGCFM reported in Table 3, these findings further validate the effectiveness and necessity of the TGCFM mechanism, demonstrating that FSSC-Net equipped with TGCFM can effectively self-calibrate and adapt to the specific frequency characteristics of remote sensing image classification.

4.2. Object Detection on General Detection Dataset MS COCO and Remote Sensing Dataset AITOD

To assess the effectiveness of the proposed method with a focus on remote sensing target detection, experiments are conducted on both the COCO benchmark and the AITOD [60] dataset. While COCO provides a widely used benchmark with diverse object scales and complex backgrounds for validating general detection robustness, AITOD is specifically designed for aerial imagery, where targets are predominantly small, densely distributed, and embedded in cluttered scenes.
Evaluations on AITOD highlight the ability of the proposed method to preserve fine-grained structural details and improve small-object localization in remote sensing scenarios, while the additional COCO results demonstrate that these gains do not come at the expense of general detection performance. Together, the results indicate that the proposed approach is well-suited for remote sensing object detection with strong generalization capability.

4.2.1. Object Detection on MS COCO

Setting: We evaluate the object detection performance of FSSC-Net on the MS COCO dataset [61], which comprises approximately 118 K training images, 5 K validation images, and 20 K test images. Backbone networks are pretrained on ImageNet with an input resolution of 224 × 224 and then integrated into the detection framework. Cascade RCNN is chosen as the detection architecture due to its widespread adoption and competitive performance in vision benchmarks.
During training, we utilize the multi-scale training strategy; input images are resized such that the shorter side is randomly sampled between 480 and 800 pixels, while the longer side is constrained to a maximum of 1333 pixels. The models are optimized using the AdamW optimizer with an initial learning rate of 0.0001, a weight decay of 0.05, and a batch size of 16. The training process is conducted over a 3× schedule (36 epochs). Model performance is evaluated using the official COCO evaluation API, with metrics including AP (Average Precision), AP50 and AP75. Additionally, we report the number of parameters and FLOPs for each model to assess the trade-off between accuracy and computational efficiency.
Results: As shown in Table 4, incorporating the pretrained FSSC-Net as the backbone in the Cascade R-CNN [62] framework yields superior performance across multiple key metrics when compared to existing CNN- and Transformer-based detectors. The comparison results are either sourced directly from the original publications or reproduced under identical training settings for fairness. Notably, FSSC-Net achieves higher detection accuracy under similar parameter budgets. For example, it outperforms Swin-Tiny and PVT-V2 by 0.4% and 0.8% in AP, respectively. Furthermore, FSSC-Net achieves nearly the same performance as MambaVision-T, while using 14.89 million fewer backbone parameters, resulting in a 0.2% accuracy decrease. These results highlight the efficiency of FSSC-Net in scenarios with constrained model complexity, demonstrating the benefits of the proposed frequency–spatial fusion strategy in dense prediction tasks.

4.2.2. Object Detection on AITOD

Setting: We evaluate the remote sensing object detection performance of FSSC-Net on the AITOD dataset, which is specifically designed for aerial image target detection and contains densely distributed objects with extreme scale variations. AITOD focuses on small and very small targets captured from high-altitude platforms, posing significant challenges in terms of feature representation and localization accuracy. Backbone networks are pretrained on ImageNet with an input resolution of 224 × 224 and then incorporated into the detection framework. Following common practice in remote sensing detection, Cascade R-CNN is adopted as the detection architecture to ensure strong localization refinement for small objects. During training, a multi-scale strategy is employed to accommodate large variations in object size. Input images are resized by randomly sampling the shorter side within a predefined range while constraining the longer side to a fixed maximum resolution to balance accuracy and computational cost. The models are optimized using the AdamW optimizer with an initial learning rate of 0.0001, a weight decay of 0.05, and a batch size of 16. All models are trained under a 3× schedule. Evaluation is conducted using standard detection metrics, including AP, AP50, and AP75, with additional emphasis on small-object performance to reflect the characteristics of the AITOD dataset. Model complexity is further analyzed in terms of parameter count and FLOPs.
Results: As shown in Table 5, FSSC-Net achieves an overall AP of 30.4 on the AITOD dataset, demonstrating superior performance among CNN-based detectors and competitive results compared with recent remote sensing-oriented methods. The model further obtains 62.2 AP50 and 25.9 AP75, indicating robust detection and reliable localization under varying IoU thresholds in aerial imagery. At the category level, FSSC-Net exhibits clear advantages on representative remote sensing targets characterized by small object size and dense spatial distribution, including airplanes (39.8 AP), storage tanks (51.7 AP), ships (47.6 AP), and swimming pools (37.0 AP). These gains suggest that the proposed frequency–spatial fusion strategy effectively preserves fine-grained structural and boundary cues that are often degraded in deep feature hierarchies of aerial images. Overall, the results confirm that FSSC-Net is well-suited for remote sensing object detection, particularly in complex aerial scenes dominated by small-scale targets, validating its effectiveness on the AITOD benchmark.

4.3. Semantic Segmentation on ADE20K

Setting: We evaluate the semantic segmentation performance of FSSC-Net on the widely used ADE20K dataset [63], which comprises 25 K high-quality images with pixel-level annotations, including 20K for training, 2 K for validation, and another 3 K for testing. The dataset covers 150 semantic categories across diverse indoor and outdoor scenes. For the segmentation framework, we adopt UPerNet [64] as the base architecture and follow standard training and evaluation protocols. All input images are resized to a resolution of 512 × 512 during training.
Results: As shown in Table 6, FSSC-Net demonstrates competitive performance on the semantic segmentation task. When employed as the backbone in the UPerNet framework, it achieves higher or comparable accuracy relative to several state-of-the-art methods. Specifically, FSSC-Net outperforms MambaVision-T and Swin-Tiny by 0.2% and 0.4% in mIoU, respectively, further validating the effectiveness and generalizability of the proposed frequency–spatial fusion mechanism in dense prediction tasks. These results demonstrate that FSSC-Net effectively improves the model’s ability to capture and integrate multi-scale semantic information, thereby enabling more accurate and coherent pixel-level predictions in complex visual scenes.

4.4. Ablation Study

This section presents a series of ablation experiments aimed at investigating the role and individual contributions of each component within the proposed TGCFM mechanism. To this end, we incorporate TGCFM into the Swin-T and systematically evaluate the impact of various design choices on overall model performance. Due to computational resource constraints, all ablation studies are conducted on the image classification task. The training protocol follows the settings described in previous sections, using the ImageNet dataset with a total of 300 training epochs

4.4.1. Analysis of DFSM and TGCFM

In this section, we conduct ablation studies to evaluate the effectiveness of the TGCFM, focusing on the necessity of the Dynamic Frequency Selection Module and the TGCFM. For DFSM, we perform image classification experiments using both fixed masks and learnable soft masks to assess the flexibility and adaptability of frequency-domain modeling. For TGCFM, we compare it with a baseline approach that fuses features via channel-wise concatenation followed by a 1 × 1 convolution, aiming to validate the effectiveness of the proposed spatial–frequency collaborative fusion strategy. As shown in Table 7, in the frequency modeling component, the introduction of soft masks significantly improves classification performance, demonstrating the critical role of DFSM in enhancing frequency feature representation. In the spatial–frequency fusion part, the TGCFM, equipped with cross-attention, consistently outperforms the baseline across multiple metrics, confirming its importance in feature alignment and enhancement.

4.4.2. Analysis of Frequency Branches and Corresponding Components in DFSM

In this section, we conduct an ablation study to evaluate the effects of both the number of frequency branches in the DFSM and the number of frequency components within each branch. Table 8 presents the impact of varying the number of frequency-domain branches, while Table 9 examines how different quantities of frequency components within each branch influence overall model performance.
Table 8 shows the Top-1 classification accuracy of the Swin-TGCFM network under various configurations of frequency-domain branches. We conducted experiments with different combinations of high-frequency, low-frequency, and mid-frequency branches to assess both the necessity of incorporating frequency information and the limitations of relying on a single-frequency branch. The results indicate that increasing the number of frequency branches generally improves Top-1 accuracy. Specifically, using only the high-frequency or low-frequency branch yields gains of 0.8% and 0.65%, respectively, indicating that high-frequency features-rich in fine-grained details are more beneficial for classification tasks than low-frequency features. When both high- and low-frequency branches are used together, accuracy improves further by 0.19% and 0.04% compared to a single branch, approaching a performance plateau. Adding a mid-frequency branch provides only marginal additional gains, indicating diminishing returns. These findings highlight that combining diverse frequency-domain features effectively complements spatial features and enhances model performance. This observation aligns with the conclusions drawn previously, which suggests that high-frequency information is inherently more difficult for neural networks to learn, and that improving the model’s ability to capture these features can significantly boost accuracy.
Table 9 presents the results of varying the number of frequency components within each frequency-domain branch. When the number of components is set to two, the model achieves a significant Top-1 accuracy improvement of +0.84%. Increasing the number of components to four further yields a smaller gain of +0.31%, indicating that a moderate number of frequency bands offers a favorable trade-off between performance and computational efficiency. These results demonstrate that the multi-component composite design effectively enhances the flexibility of feature representations and that overall performance tends to improve as more frequency components are incorporated. Notably, the model achieves its best computational efficiency with two components and its highest accuracy with four components, which may reflect the varying contributions of different frequency components. Based on this trade-off, we adopt the two-component configuration as the default setting in our final model design.

4.4.3. Ablation Analysis of the Multi-Size Window Design

We evaluate the effectiveness of multi-scale window design in FSSC-Net. To capture spatial features at different receptive field sizes, the AMSWA module first compresses the input feature channels and then applies window attention with varying window sizes. To verify the necessity of this design, we conducted ablation experiments using different numbers of window branches within FSSC-Net. During these experiments, the overall parameter count and computational cost are kept approximately constant by proportionally adjusting the channel allocation across the window branches. As shown in Table 10, reducing the number of window branches leads to a clear, noticeable decline in Top-1 accuracy. Notably, despite having comparable parameters and FLOP counts, the single-window variant performs 2.7% worse than the four-window version, clearly validating the effectiveness of the multi-scale window design in enhancing model performance.

4.4.4. Ablation Analysis on the Application Scope of TGCFM

To determine the optimal configuration of TGCFM, we progressively integrated it into different stages of the Swin architecture. The experimental results are presented in Table 11, where checkmarks denote the stages at which TGCFM is applied for feature enhancement. As observed, Top-1 accuracy increases consistently with the number of TGCFM integrated stages, reaching its peak when all four stages incorporate TGCFM. Notably, full integration introduces only a marginal increase of 4.7% in parameter count. These results suggest that applying TGCFM across all stages of the backbone is an effective and efficient design choice.

4.4.5. Throughput and GPU Memory Comparison

We also evaluate the GPU memory usage and inference throughput of the Swin Transformer before and after integrating the proposed TGCFM mechanism, using batch sizes of 128 and 256, respectively. All experiments are conducted on an NVIDIA RTX 4090 GPU. As shown in Table 12, incorporating TGCFM leads to a moderate increase in resource consumption for both Swin-T and PVT-small. Specifically, the proposed modifications lead to an average inference throughput reduction of 17.49% to 22.67%, along with a GPU memory overhead increase of 21.5% to 31.3%. Additionally, the computational complexity (GFLOPs) increases marginally by 0.1 to 0.2, and the model size grows by 2.24 M to 1.08 M parameters. Despite this additional computational overhead, the trade-off is considered acceptable given the notable improvements (0.45% and 0.84%) in accuracy achieved by the model. It is also worth noting that most of the extra cost stems from the multi-branch architecture inherent in the TGCFM design. To further improve practical efficiency and deployment readiness, future work may explore optimization strategies such as matrix concatenation, shared computation across frequency branches, or pruning redundant frequency components to reduce memory and computation overhead.

4.4.6. Ablation Study on the Necessity of DFSM

To further validate the necessity of DFSM, we replaced it with FFT, DWT, and DCT for frequency modeling and measured the resulting changes in model performance. The detailed results are presented in Table 13. It can be observed that directly using fixed spectral transformations such as FFT, DWT, or DCT leads to varying degrees of performance degradation, indicating that single-spectrum modeling cannot replace the task-conditioned frequency subspace structure constructed by DFSM. This further demonstrates the effectiveness and necessity of DFSM.

4.5. Investigation of Key Parameters

To validate the sensitivity of the masking mechanism and the necessity of the soft mask design, we conducted mechanism replacement experiments by substituting the original soft mask with three alternative strategies: No Mask, Hard Mask, and Learnable Mask. The results show that removing the mask leads to a significant performance drop (−1.2% mAP). Similarly, adopting a fixed-threshold Hard Mask or a Learnable Mask with only channel-wise learnable weights also results in varying degrees of performance degradation. This indicates that raw frequency features contain redundant or interfering components; without high–low frequency separation and task-guided selection, the effectiveness of spatial–frequency fusion is compromised, thereby confirming the structural necessity of selective modeling.
Further analysis in Table 14 reveals that although Hard Mask performs better than No Mask, its lack of continuous adjustment limits its ability to accommodate the fine-grained frequency requirements of different tasks. While Learnable Mask introduces continuous gating, it fails to achieve effective frequency subspace reconstruction, and its performance still falls short of our method. Overall, the advantage of DFSM lies not merely in channel reweighting, but in its task-guided structural selection and frequency subspace construction mechanism, which enables more targeted frequency modeling and more efficient spatial–frequency collaboration.
To evaluate the stability and rationality of the Top-K mechanism, we conducted a sensitivity analysis on its initial ratio. As shown in the Table 15, the results indicate that the model is generally insensitive to the initial Top-K value, maintaining stable performance within approximately 0.8 ± 0.1, which suggests a relatively broad effective range for this parameter. However, when the ratio decreases to 0.6 or 0.5, performance drops significantly, indicating that excessive compression leads to the loss of informative frequency components and consequently weakens representation capability. This observation is consistent with the theoretical expectations of frequency subspace construction. Overall, the performance improvement does not stem from fine-grained tuning of the Top-K ratio, but rather from the structural design of task-guided frequency subspace construction. Essentially, Top-K serves as a robust and adaptive subspace compression mechanism.
Furthermore, to enhance the statistical reliability of our findings, we conducted three independent runs for both FSSC-Net and the Swin Transformer baseline on ImageNet. For Top-1 accuracy, FSSC-Net achieves 81.95 ± 0.04, compared with 80.30 ± 0.11 for Swin, yielding an average improvement of approximately 1.65% with a notably smaller standard deviation. For Top-5 accuracy, FSSC-Net obtains 95.52 ± 0.21, surpassing Swin’s 95.12 ± 0.015 by approximately 0.40% on average. These results demonstrate that FSSC-Net maintains stable performance under random initialization and achieves statistically significant improvements over a strong baseline. The consistently lower variance and clear accuracy gains indicate that the performance improvements are robust and reliable, rather than arising from incidental fluctuations.

4.6. Analysis of Channel Importance and Frequency Dependency

Figure 6 presents an analysis of the high-frequency and low-frequency channel importance produced by the calibration modules in FSSC-Net across three major visual tasks: image classification, object detection, and semantic segmentation. We conduct analysis from two perspectives: intra-task variation—how frequency dependency differs across network depths within the same task; inter-task variation—how frequency preferences shift across different vision tasks.

4.6.1. Analysis of Frequency Dependency Across Training Stages in Classification Tasks

Figure 6(a-1–a-3) illustrates the average channel importance of high- and low-frequency branches (denoted by light and dark blue lines, respectively) across different FSSC blocks during training epochs in the image classification task. It also shows the proportion of high-frequency channels within the top 75% important channels (green line). Overall, the network exhibits a stronger reliance on low-frequency features during early training stages, as evidenced by higher average weights in low-frequency channels. As training progresses, the importance of high-frequency features increases, with the average proportion of high-frequency channels among the top 75% rising from 43.3% to 46.5%. This trend suggests that the network gradually shifts from coarse structural modeling to fine-grained edge and texture representation as training proceeds. This observation aligns well with the curriculum learning theory proposed in EfficientTrain++, which posits that models first learn easily distinguishable low-frequency information before refining high-frequency details.
Furthermore, a depth-wise frequency preference is observed: shallow and deep blocks exhibit structurally different frequency tendencies. Deeper blocks (e.g., Block10–12) tend to emphasize high-frequency features, while shallow ones favor low-frequency components. These preferences remain consistent throughout training and are not due to random fluctuations, indicating that the frequency calibration mechanism in FSSC-Net effectively allocates attention across frequency components adaptively. This dynamic adjustment enhances the model’s expressiveness and depthwise adaptability to varying feature demands.

4.6.2. Analysis of Frequency Dependency Across Model Depth and Vision Tasks

Figure 6(a-3,b-1,b-2,c-1,c-2) present the average importance of high- and low-frequency channels, the high-frequency ratio within the top 75% important channels, and the histograms (Figure 6(d-1,d-2,e-1,e-2)) of channel importance (with a bin size of 0.05) across classification, segmentation, and detection tasks. The results reveal significant differences in frequency dependencies among the tasks. In the image classification task, deep layers—especially Block12—exhibit a strong preference for high-frequency features, reflecting the importance of detailed edges and textures for fine-grained categorization. In contrast, mid- and shallow layers (e.g., Block3–6) lean toward low-frequency components, supporting semantic abstraction and intermediate representation transformation. In semantic segmentation (Figure 6(b-1,b-2)), high- and low-frequency channel weights are more balanced overall. However, high-frequency features account for more than 50% of the top 75% important channels, averaging 51.02%. This ratio remains stable and slightly increases toward the deeper blocks, suggesting a growing need for high-frequency features in pixel-level boundary modeling. These details are critical for enhancing segmentation boundary precision.
In contrast, object detection (Figure 6(c-1,c-2)) shows a more complex frequency preference. The overall proportion of high-frequency channels among the top 75% is approximately 50.13%, with different layers exhibiting distinct patterns: Blocks 4–6 favor low-frequency features, potentially for capturing object contours and coarse localization. Shallow and deeper blocks, however, focus more on high-frequency information to support edge enhancement and refined bounding box localization. This layered dependency pattern echoes the dual demands of object detection, which requires both precise localization and accurate classification across scales.
To further verify differences between task-specific frequency modeling strategies, we analyze the histograms of channel importance across tasks (see Figure 6(d-1,d-2,e-1,e-2,f-1,f-2)). These histograms demonstrate clear differences in high/low frequency dependency that align with the core objectives of each task. In classification (d-1,d-2), both frequency branches show a core-dominated, auxiliary-supported distribution: low-frequency channels maintain some dispersion to support global semantic extraction, while high-frequency channels tend to be highly polarized, focusing on critical details. Detection (f-1,f-2) presents a stage-wise dependency: early blocks require a blend of high- and low-frequency features for multidimensional cues (edges, textures, initial localization), while later stages filter features by retaining only highly significant or marginal channels, aligning with the dual-task nature of detection. In segmentation (e-1,e-2), a balanced moderate dependency emerges. The importance of both high- and low-frequency channels is tightly clustered in the 0.45–0.55 range, indicating the task’s emphasis on deep aggregation of local detail and contextual semantics. By avoiding domination from extremely important or redundant channels, the model ensures fine-grained pixel-level accuracy.
Overall, these visualizations reveal clear differences in the reliance on low- and high-frequency features across tasks, as well as substantial variation in frequency preferences across different network depths. These task-specific frequency preferences not only reflect the fundamental differences in modeling strategies among vision tasks, but also strongly support the necessity of task- and data-driven frequency feature extraction and self-calibrated spatial–frequency fusion, while validating the rationale and effectiveness of our task-calibrated frequency modeling mechanism. Specifically, FSSC-Net, through the integration of the DFSM and TGCFM, enables adaptive modulation in the frequency domain and task-aligned feature fusion, a design that significantly enhances the network’s generalization and performance across multi-task, multi-scale, and multi-target visual scenarios.

4.6.3. Quantitative Analysis of the Importance of Frequency Features

To further analyze the practical contributions of different frequency components and the calibration mechanism in TGCFM, we design intervention-based ablation experiments, in which specific structural components are deliberately removed. Specifically, three intervention settings are considered: (1) removing the low-frequency branch (–Low Part); (2) removing the high-frequency branch (–High Part); and (3) removing the entire frequency calibration module (–Calibration Part). The experimental results are reported in Table 16 (δ denotes the performance change relative to the full model).
From the results, several key observations can be drawn. First, both high- and low-frequency information are indispensable. Removing either the high-frequency or low-frequency branch leads to a significant performance drop (approximately −2.8% mAP), indicating that the two types of frequency components play distinct yet complementary roles in object detection and jointly support effective representation learning. Second, the frequency calibration mechanism is the primary driver of performance improvement. When the entire calibration module is removed, the performance drops by more than 11% mAP, demonstrating that task-guided modeling in the frequency subspace is the key source of performance gains, rather than simple feature concatenation or fusion operations. Finally, this intervention-based study establishes a clear causal relationship at the structural level, showing that the observed performance improvements stem from the task-driven frequency calibration mechanism itself, rather than heuristic design choices or stacked attention effects. These results provide solid empirical evidence for the substantive contribution of TGCFM.

4.7. Model Visualization Analysis

To gain a deeper understanding of the working mechanism of TGCFM, we conducted a systematic visualization analysis from both internal and external perspectives. Specifically, Figure 7 illustrates the frequency branches within DFSM and presents a t-SNE projection based on samples from the ImageNet dataset to depict the resulting feature distributions. Figure 8 shows bounding box and heatmap visualizations for remote sensing object detection, enabling a comparison of the actual detection performance between FSSC-Net and the Baseline. Following this, Figure 9 provides comparative heatmap visualizations before and after incorporating TGCFM under representative scenarios, including multi-target scenes, occlusion, simple backgrounds, and complex backgrounds. Together, these visualizations facilitate an in-depth analysis of how TGCFM reshapes the model’s attention patterns and enhances its focus on task-relevant regions.

4.7.1. Internal Feature Visualization Analysis

Figure 7a visualizes the frequency-domain features extracted at Stage 1 and Stage 2 of Swin-TGCFM, where High and Low denote the responses of the high-frequency and low-frequency branches, respectively. At Stage 1, which is closer to the input layer, the extracted features preserve a large amount of low-level physical information from the input image. As a result, the frequency representations exhibit a strong correspondence with the original image in both the pixel and frequency domains: the high-frequency branch predominantly responds to edges, textures, and other fine-grained local structures, whereas the low-frequency branch captures smoother and more homogeneous regions. At this stage, the high- and low-frequency components demonstrate clear complementarity in terms of physical characteristics. In contrast, at Stage 2, the features—shaped by interactions between the two frequency branches—gradually evolve into more abstract semantic representations. The visualizations increasingly transcend low-level physical boundaries and place greater emphasis on task-relevant foreground regions, indicating enhanced semantic perception. These observations suggest that the adaptive extraction of high- and low-frequency features not only preserves fine-grained details and global contextual information, but also facilitates the learning of task-relevant representations through cross-frequency interaction. Grad-CAM in Figure 7b visualizations from Stage 4 of both the baseline Swin-Tiny and the TGCFM-enhanced Swin-Tiny reveal that TGCFM guides the model to allocate closer attention to semantically relevant regions. This improvement can be attributed to TGCFM’s task-guided calibration mechanism, which selectively amplifies frequency components that are most informative for the current task, and its spatial–frequency fusion, which aligns these calibrated features with spatial representations. Consequently, the network focuses more effectively on target regions, resulting in sharper localization patterns and more complete coverage of object boundaries.
Similarly, t-SNE in Figure 7c projections of high-dimensional features for 20 classes from ImageNet show that TGCFM-enhanced features exhibit tighter intra-class clustering and increased inter-class separation. These changes directly reflect TGCFM’s ability to perform adaptive frequency selection: by modulating task-relevant frequency channels, the module produces more discriminative representations, which in turn improve class separability. Thus, both internal (Grad-CAM) and external (t-SNE) visualizations substantiate that TGCFM enhances the network’s feature representation quality through explicit task-adaptive frequency calibration and spatial–frequency fusion.

4.7.2. External Attention Visualization Analysis

As shown in Figure 9, to evaluate the effectiveness of the TGCFM, we visualize and compare the feature maps generated by Swin and Swin-TGCFM across Stages 1 to 4. In the heatmap visualizations, color variations reflect the model’s focus of attention: deeper red regions indicate areas with richer information content and greater influence on the decision-making process, whereas blue regions are relatively suppressed and contribute less to the final output. This visualization facilitates an intuitive interpretation of the model’s decision basis.
The rows highlighted by red dashed boxes correspond to the results produced by Swin-TGCFM, while red bounding boxes in the images denote the target regions. As observed, incorporating TGCFM enables the model to localize targets more accurately, with the activated regions in the feature maps exhibiting stronger spatial alignment with the corresponding objects in the original images. Subfigures (a–d) illustrate representative scenarios, including (a) occluded objects, (b) multiple objects, (c) simple backgrounds, and (d) complex backgrounds.

4.7.3. Detection Behavior Visualization Analysis

To better understand the behavior of the proposed detection algorithm and its target perception capability in complex remote sensing scenarios, we conduct qualitative visualizations from two complementary perspectives: detection result performance and feature response distribution. First, the predicted bounding boxes are overlaid on the original AITOD images to visually assess the model’s localization accuracy as well as false positives and missed detections under challenging conditions, including multi-scale targets, dense distributions, and complex backgrounds. This visualization provides an intuitive evaluation of the detector’s overall performance and robustness in real-world remote sensing scenes. Second, to further investigate the key regions and discriminative features involved in the model’s detection decisions, heatmap visualizations are employed to analyze the internal feature responses. By comparing the response distributions over target regions and background areas, the heatmaps reveal whether the model can effectively focus on target-relevant regions while suppressing background interference, thereby offering intuitive evidence of the effectiveness of the proposed method in feature representation and small-object perception.
As shown in Figure 8a, the bounding box visualizations of FSSC-Net, the Baseline, and the Ground Truth are presented. The red dashed boxes indicate the target locations in the original images, while the red solid boxes correspond to the enlarged views. It can be observed that the detection results of FSSC-Net are highly consistent with the ground-truth annotations, achieving more accurate and stable localization. In contrast, the Baseline produces a considerable number of false positives and redundant detections. Overall, FSSC-Net demonstrates superior capability in localizing small objects in remote sensing images and significantly reduces false detections due to its more discriminative feature representations. Figure 8b illustrates the heatmap visualizations of the models, where warmer colors indicate higher attention responses and cooler colors denote lower responses. Compared with the Baseline, FSSC-Net exhibits more concentrated attention on true target regions while maintaining lower responses to irrelevant background areas. In contrast, the Baseline shows relatively dispersed attention over the targets and excessive activation in background regions, further highlighting the advantage of FSSC-Net in feature modeling and target discrimination.

5. Discussion

We comprehensively validate the generalization and robustness of FSSC-Net across mainstream computer vision tasks, including image classification, object detection, and semantic segmentation. By embedding the self-calibrated frequency modeling and fusion mechanism, comprising DFSM and TGCFM, into popular architectures such as the Swin Transformer, we further demonstrate the necessity and effectiveness of task-driven self-calibration mechanisms for frequency-aware representation learning. Despite its advantages, FSSC-Net still presents certain limitations. The integration of multi-branch frequency modeling and cross-attention-based fusion inevitably increases the model’s parameter count and inference cost, which may pose challenges in resource-constrained environments. Additionally, although FSSC-Net exhibits strong performance in standard vision tasks, its adaptability to more complex scenarios, such as multi-modal fusion and cross-domain recognition, has yet to be systematically investigated. Overall, FSSC-Net provides a novel and effective modeling paradigm for frequency-domain representation and frequency–spatial fusion, offering new insights into improving task adaptability and feature expressiveness in visual models. In future work, we plan to further explore its adaptability across diverse tasks and complex environments, while also investigating network lightweighting and efficient inference strategies to enhance the deployability and practical scope of FSSC-Net in real-world applications.

6. Conclusions

To address the limited flexibility of frequency-domain feature extraction and the constrained generalization of existing spatial–frequency fusion strategies in remote sensing vision tasks, this paper proposes the Frequency–Spatial Self-Calibrated Network (FSSC-Net). The network is specifically designed to tackle the challenges of remote sensing imagery, including large spatial coverage, extreme scale variations, dense distributions of small objects, and complex backgrounds. By integrating a task- and data-driven frequency modeling mechanism with an efficient spatial–frequency collaborative strategy, FSSC-Net enables adaptive and robust joint representation learning.
Specifically, FSSC-Net comprises a DFSM and a TGCFM. DFSM adaptively modulates frequency responses through a soft-mask mechanism, enabling task-relevant selection of low- and high-frequency components, while TGCFM aligns spatial-domain and frequency-domain features under task guidance. This design substantially enhances the network’s ability to preserve fine-grained structural information, such as edges, contours, and textures, which is critical for small object detection, land-cover classification, and boundary-sensitive semantic segmentation in remote sensing imagery.
Extensive experiments on general vision benchmarks and the AID remote sensing image classification benchmark demonstrate that FSSC-Net consistently outperforms existing state-of-the-art methods across multiple task settings, effectively generalizing to complex remote sensing scenarios while exhibiting strong robustness and cross-task adaptability. Ablation studies further validate the effectiveness of the proposed self-calibrated dual-domain collaboration strategy in improving semantic consistency and task alignment. Moreover, analysis of frequency calibration behaviors across different network depths and training stages reveals systematic dynamic patterns, providing empirical support for the necessity of adaptive, task-aware frequency modeling in remote sensing vision.
In conclusion, FSSC-Net not only introduces a novel paradigm for frequency-domain representation learning in remote sensing imagery but also advances spatial–frequency fusion toward adaptive, task-aware collaborative modeling. Future work will explore the scalability of this framework in multi-modal remote sensing data fusion, real-time large-scale inference, and resource-constrained deployment scenarios, further extending its applicability in practical remote sensing applications.

Author Contributions

Conceptualization: H.Y. and B.Z.; methodology: H.Y.; validation: H.Y.; formal analysis, H.Y.; investigation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, B.Z.; visualization, H.Y.; supervision, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Special Project for the Integration of “Two Chains”—Qinchuangyuan General Window Industrial Cluster Project for the Integration of Two Chains, grant number 2022QCY-LL-72.

Data Availability Statement

The original data presented in the study are openly available from the following publicly accessible sources: COCO dataset at https://cocodataset.org/#home (accessed on5 August 2025); ImageNet dataset at https://image-net.org/ (accessed on 6 April 2021); AITOD dataset at https://github.com/jwwangchn/AI-TOD (accessed on 25 September 2025); ADE20K dataset at https://groups.csail.mit.edu/vision/datasets/ADE20K/ (accessed on 15 January 2026).

Acknowledgments

The authors would like to thank the editors and reviewers for their constructive comments, which helped improve the quality of this manuscript. The authors also sincerely acknowledge Pingping Luo and Kanhua Yu for their valuable discussions, insightful suggestions, and general academic support during the course of this research. In addition, thanks are extended to members of the research team and all those who provided assistance during the survey and experimental process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lim, J.-S.; Astrid, M.; Yoon, H.-J.; Lee, S.-I. Small Object Detection using Context and Attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; IEEE: New York, NY, USA, 2021; pp. 181–186. [Google Scholar]
  2. Xiao, J.; Guo, H.; Yao, Y.; Zhang, S.; Zhou, J.; Jiang, Z. Multi-Scale Object Detection with the Pixel Attention Mechanism in a Complex Background. Remote Sens. 2022, 14, 3969. [Google Scholar] [CrossRef]
  3. Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Trans. Image Process. TIP 2017, 26, 3142–3155. [Google Scholar] [CrossRef]
  4. Yao, M.; Huang, J.; Jin, X.; Xu, R.; Zhou, S.; Zhou, M.; Xiong, Z. Generalized Lightness Adaptation with Channel Selective Normalization. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 10634–10645. [Google Scholar]
  5. Xu, W.; Ding, Z.; Wang, Z.; Cui, Z.; Hu, Y.; Jiang, F. Think Locally and Act Globally: A Frequency–Spatial Fusion Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. TGRS 2025, 63, 4109917. [Google Scholar] [CrossRef]
  6. Qi, J.; Zhang, X.; Ye, D.; Ruan, Y.; Guo, X.; Wang, S.; Li, H. SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 28563–28578. [Google Scholar] [CrossRef]
  7. Zhang, J.; Ding, A.; Li, G.; Zhang, L.; Zeng, D. A Pyramid Attention Network with Edge Information Injection for Remote-Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. TGRS 2023, 20, 6007205. [Google Scholar] [CrossRef]
  8. Awad, B.; Erer, I. Boundary SAM: Improved Parcel Boundary Delineation Using SAM’s Image Embeddings and Detail Enhancement Filters. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2502905. [Google Scholar] [CrossRef]
  9. Yang, J.; Liu, T.; Zhang, L.; Zhang, Y. Effective Road Segmentation with Selective State-Space Model and Frequency Feature Compensation. IEEE Trans. Geosci. Remote Sens. TGRS 2025, 63, 5603813. [Google Scholar] [CrossRef]
  10. Cai, S.; Yang, J.; Xiang, T.; Bai, J. Frequency-Aware Contextual Feature Pyramid Network for Infrared Small-Target Detection. IEEE Geosci. Remote Sens. Lett. TGRS 2025, 22, 6501205. [Google Scholar] [CrossRef]
  11. Wang, W.; Yang, T.; Wang, X. From Spatial to Frequency Domain: A Pure Frequency Domain FDNet Model for the Classification of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5636413. [Google Scholar] [CrossRef]
  12. Ye, J.; Yang, L.; Qiu, C.; Zhang, Z. Joint Low-Light Enhancement and Deblurring with Structural Priors Guidance. Expert Syst. Appl. 2024, 249, 123722. [Google Scholar] [CrossRef]
  13. Zhang, Q.; Lai, J.; Zhu, J.; Xie, X. Wavelet-Guided Promotion-Suppression Transformer for Surface-Defect Detection. IEEE Trans. Image Process. TIP 2023, 32, 4517–4528. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Xu, F.; Sun, Y.; Wang, J. Spatial and Frequency Information Fusion Transformer for Image Super-Resolution. Neural Netw. NN 2025, 187, 107351. [Google Scholar] [CrossRef] [PubMed]
  15. Pei, X.; Huang, Y.; Su, W.; Zhu, F.; Liu, Q. FFTFormer: A Spatial-Frequency Noise Aware CNN-Transformer for Low Light Image Enhancement. Knowl. Based Syst. KBS 2025, 314, 113055. [Google Scholar] [CrossRef]
  16. Chen, S.; Xu, L.; Li, X.; Tian, C. Frequency-Space Enhanced and Temporal Adaptative RGBT Object Tracking. Neurocomputing 2025, 640, 130240. [Google Scholar] [CrossRef]
  17. Li, S.; Jiang, C.; Ye, Q.; Wang, S.; Yang, W.; Zhang, H. Fusing Spatial and Frequency Features for Compositional Zero-Shot Image Classification. Expert Syst. Appl. 2024, 258, 125230. [Google Scholar] [CrossRef]
  18. Yue, H.; Guo, J.; Yin, X.; Zhang, Y.; Zheng, S. Salient Object Detection in Low-Light RGB-T Scene via Spatial-Frequency Cues Mining. Neural Netw. NN 2024, 178, 106406. [Google Scholar] [CrossRef] [PubMed]
  19. Luo, Z.; Wang, Y.; Chen, L.; Yang, W. Frequency Spectrum Features Modeling for Real-Time Tiny Object Detection in Remote Sensing Image. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011205. [Google Scholar] [CrossRef]
  20. Zuo, Y.; Yao, W.; Hu, Y.; Fang, Y.; Liu, W.; Peng, Y. Image Super-Resolution via Efficient Transformer Embedding Frequency Decomposition with Restart. IEEE Trans. Image Process. TIP 2024, 33, 4670–4685. [Google Scholar] [CrossRef]
  21. Wang, Y.; Yue, Y.; Lu, R.; Han, Y.; Song, S.; Huang, G. EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training. IEEE Trans. Pattern Anal. Mach. Intell. TPAMI 2024, 46, 8036–8055. [Google Scholar] [CrossRef]
  22. Xie, Z.; Yang, M.; Shen, M.; Qiu, Y.; Wang, X. FIOD-VUE: Focusing on Invariant Information in Object Detection of Varying Underwater Environment. IEEE Trans. Circuits Syst. Video Technol. TCSVT 2024, 34, 10743–10752. [Google Scholar] [CrossRef]
  23. Xia, B.; Zhan, B.; Shen, M.; Yang, H. Explicit-Implicit Priori Knowledge-Based Diffusion Model for Generative Medical Image Segmentation. Knowl. Based Syst. KBS 2024, 303, 112426. [Google Scholar] [CrossRef]
  24. Wu, X.; Wang, L.; Huang, J. AnimalRTPose: Faster Cross-Species Real-Time Animal Pose Estimation. Neural Netw. NN 2025, 190, 107685. [Google Scholar] [CrossRef]
  25. Nussbaumer, H.J. The Fast Fourier Transform; Springer: Berlin/Heidelberg, Germany, 1981; pp. 80–111. [Google Scholar]
  26. Chen, W.-H.; Smith, C.; Fralick, S. A Fast Computational Algorithm for the Discrete Cosine Transform. IEEE Trans. Commun. 1977, 25, 1004–1009. [Google Scholar] [CrossRef]
  27. Miao, R.; Shi, H.; Peng, F.; Zhang, S. Attention-Guided Progressive Frequency-Decoupled Network for Pan-Sharpening. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 5403616. [Google Scholar] [CrossRef]
  28. Shen, H.; Ding, H.; Zhang, Y.; Cong, X.; Zhao, Z.-Q.; Jiang, X. Spatial-Frequency Adaptive Remote Sensing Image Dehazing with Mixture of Experts. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 4211114. [Google Scholar] [CrossRef]
  29. Zhang, H.; Yang, Y.; Li, C.; Lu, Y.; Zhang, G.; Chen, S. CESFusion: Cross-Frequency Enhanced Spatial—Spectral Fusion Network for Hyperspectral and Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. TGRS 2025, 63, 5511815. [Google Scholar] [CrossRef]
  30. Li, X.; Tan, Y.; Liu, K.; Wang, X.; Zhou, X. DSFI-CD: Diffusion-Guided Spatial-Frequency-Domain Information Interaction for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. TGRS 2025, 63, 5613818. [Google Scholar] [CrossRef]
  31. Wang, J.; Lu, Y.; Wang, S.; Wang, B.; Wang, X.; Long, T. Two-Stage Spatial-Frequency Joint Learning for Large-Factor Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 5606813. [Google Scholar] [CrossRef]
  32. Sun, X.; Yu, Y.; Cheng, Q. Low-Rank Multimodal Remote Sensing Object Detection with Frequency Filtering Experts. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 5637114. [Google Scholar] [CrossRef]
  33. Li, Y.; Jin, W.; Qiu, S.; He, Y. Multiscale Spatial-Frequency Domain Dynamic Pansharpening of Remote Sensing Images Integrated with Wavelet Transform. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 5408915. [Google Scholar] [CrossRef]
  34. Xu, W.; Liang, M.; Lu, Y.; Gao, R.; Yang, D. Spatial–Frequency Fusion Network with Learnable Fractional Fourier Transform for Remote Sensing Imaging Enhancement. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17610–17621. [Google Scholar] [CrossRef]
  35. Wang, S.; Cai, Z.; Yuan, J. Automatic SAR Ship Detection Based on Multifeature Fusion Network in Spatial and Frequency Domains. IEEE Trans. Geosci. Remote Sens. TGRS 2023, 61, 4102111. [Google Scholar] [CrossRef]
  36. Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z. Instance-Aware Spatial-Frequency Feature Fusion Detector for Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5606513. [Google Scholar] [CrossRef]
  37. Ma, W.; Zhang, H.; Ma, M.; Chen, C.; Hou, B. ISSP-Net: An Interactive Spatial-Spectral Perception Network for Multimodal Classification. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 4412014. [Google Scholar] [CrossRef]
  38. Yu, W.; Li, Z.; Liu, Q.; Jiang, F.; Guo, C.; Zhang, S. Scale-Aware Frequency Attention Network for Super-Resolution. Neurocomputing 2023, 554, 126584. [Google Scholar] [CrossRef]
  39. Zhang, B.; Zhang, Y. MSCViT: A Small-Size ViT Architecture with Multi-Scale Self-Attention Mechanism for Tiny Datasets. Neural Netw. NN 2025, 188, 107499. [Google Scholar] [CrossRef]
  40. Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Gao, X. Dim2Clear Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. TGRS 2023, 61, 5001714. [Google Scholar] [CrossRef]
  41. Zhang, J.; He, X.; Yan, K.; Cao, K.; Li, R.; Xie, C.; Zhou, M.; Hong, D. Pan-Sharpening with Wavelet-Enhanced High-Frequency Information. IEEE Trans. Geosci. Remote Sens. TGRS 2024, 62, 5402914. [Google Scholar] [CrossRef]
  42. Zhan, D.; Yang, H.; Wang, X.; Wang, H.; Ma, N.; Liu, Z. Interactive Fusion Distillation Network Embedded with Poisson Images for Wheel Tread Defect Detection. IEEE Trans. Instrum. Meas. 2025, 74, 2522413. [Google Scholar] [CrossRef]
  43. Sun, H.; Yao, Z.; Du, B.; Wan, J.; Ren, D.; Tong, L. Spatial–Frequency Residual-Guided Dynamic Perceptual Network for Remote Sensing Image Haze Removal. IEEE Trans. Geosci. Remote Sens. TGRS 2025, 63, 5612416. [Google Scholar] [CrossRef]
  44. Liu, J.; Zhang, Q.; Xie, F.; Wang, X.; Wu, S. Incipient Fault Detection of Planetary Gearbox under Steady and Varying Condition. Expert Syst. Appl. 2023, 233, 121003. [Google Scholar] [CrossRef]
  45. Huang, J.; Zhu, X.; Chen, Z.; Lin, G.; Huang, M.; Feng, Q. Pathological Priors Inspired Network for Vertebral Osteophytes Recognition. IEEE Trans. Med. Imaging 2024, 43, 2522–2536. [Google Scholar] [CrossRef]
  46. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  47. Hatamizadeh, A.; Kautz, J. MambaVision: A Hybrid Mamba-Transformer Vision Backbone. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 25261–25270. [Google Scholar]
  48. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in Transformer. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Atlanta, GA, USA, 2021; Volume 34, pp. 15908–15919. [Google Scholar]
  49. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, UK, 2020. [Google Scholar]
  50. Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
  51. Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.E.H.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 538–547. [Google Scholar]
  52. Tan, M.; Le, V.Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
  53. Li, Q.; Shen, L. Dual-Branch Interactive Cross-Frequency Attention Network for Deep Feature Learning. Expert Syst. Appl. 2024, 254, 124406. [Google Scholar] [CrossRef]
  54. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
  55. Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-Path Vision Transformer for Dense Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7277–7286. [Google Scholar]
  56. Bai, L.; Liu, Q.; Li, C.; Ye, Z.; Hui, M.; Jia, X. Remote Sensing Image Scene Classification Using Multiscale Feature Fusion Covariance Network with Octave Convolution. IEEE Trans. Geosci. Remote Sens. TGRS 2022, 60, 5620214. [Google Scholar] [CrossRef]
  57. Shaker, A.M.; Maaz, M.; Rasheed, H.; Khan, S.H.; Yang, M.; Khan, F. SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17379–17390. [Google Scholar]
  58. Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet Convolutional Neural Networks. arXiv 2018, arXiv:1805.08620. [Google Scholar] [CrossRef]
  59. Oyelade, O.N.; Ezugwu, A.E. A Novel Wavelet Decomposition and Transformation Convolutional Neural Network with Data Augmentation for Breast Cancer Detection Using Digital Mammogram. Sci. Rep. 2022, 12, 5913. [Google Scholar] [CrossRef]
  60. Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.-S. Tiny Object Detection in Aerial Images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
  61. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context; Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  62. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  63. Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic Understanding of Scenes Through the ADE20K Dataset. Int. J. Comput. Vis. IJCV 2019, 127, 302–321. [Google Scholar] [CrossRef]
  64. Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified Perceptual Parsing for Scene Understanding; Springer International Publishing: Cham, Switzerland, 2018; pp. 432–448. [Google Scholar]
Figure 1. Overall architecture of FSSC-Net.
Figure 1. Overall architecture of FSSC-Net.
Remotesensing 18 00824 g001
Figure 2. Structure of the Spatial Granularity Self-Adaptive Module (SGSAM). (a) represents the multi-granularity perception part, and (b) represents the granularity adaptation part.
Figure 2. Structure of the Spatial Granularity Self-Adaptive Module (SGSAM). (a) represents the multi-granularity perception part, and (b) represents the granularity adaptation part.
Remotesensing 18 00824 g002
Figure 3. Detailed structure of the Dynamic Frequency Selection Module.
Figure 3. Detailed structure of the Dynamic Frequency Selection Module.
Remotesensing 18 00824 g003
Figure 4. Structure of the Task-Guided Calibration Fusion Module (TGCFM).
Figure 4. Structure of the Task-Guided Calibration Fusion Module (TGCFM).
Remotesensing 18 00824 g004
Figure 5. Visualization of TGCFM’s importance in the model trained on the AID dataset. (a-1,a-2) illustrate the mean importance of low- and high-frequency channels at different network depths. (b,c) show histograms of channel importance for low- and high-frequency branches across different layers.
Figure 5. Visualization of TGCFM’s importance in the model trained on the AID dataset. (a-1,a-2) illustrate the mean importance of low- and high-frequency channels at different network depths. (b,c) show histograms of channel importance for low- and high-frequency branches across different layers.
Remotesensing 18 00824 g005
Figure 6. (a-1a-3,b-1,b-2,c-1,c-2) illustrate the average frequency-domain channel importance within the calibration modules of 12 FSSC-Blocks across three major vision tasks: image classification, semantic segmentation, and object detection. The light blue and dark blue curves represent the mean importance scores of low- and high-frequency features, respectively, while the green curve denotes the proportion of high-frequency channels among the top 75% most important channels. Specifically, (a-1a-3) shows how frequency importance evolves across different training stages in the classification task; (b-1,b-2,c-1,c-2) present the frequency importance scores of the calibration modules in segmentation and detection tasks, respectively, based on models pre-trained on the classification task. (d-1,d-2,e-1,e-2,f-1,f-2) display histograms of frequency importance weights within the calibration modules of 12 FSSC blocks for the three vision tasks. Detailed analysis is in Section 4.5.
Figure 6. (a-1a-3,b-1,b-2,c-1,c-2) illustrate the average frequency-domain channel importance within the calibration modules of 12 FSSC-Blocks across three major vision tasks: image classification, semantic segmentation, and object detection. The light blue and dark blue curves represent the mean importance scores of low- and high-frequency features, respectively, while the green curve denotes the proportion of high-frequency channels among the top 75% most important channels. Specifically, (a-1a-3) shows how frequency importance evolves across different training stages in the classification task; (b-1,b-2,c-1,c-2) present the frequency importance scores of the calibration modules in segmentation and detection tasks, respectively, based on models pre-trained on the classification task. (d-1,d-2,e-1,e-2,f-1,f-2) display histograms of frequency importance weights within the calibration modules of 12 FSSC blocks for the three vision tasks. Detailed analysis is in Section 4.5.
Remotesensing 18 00824 g006
Figure 7. (a) Visualization of the high-frequency and low-frequency branch features in Stage 1 and Stage 2 of Swin-TGCFM. (b) Grad-cam of Stage 4 for Swin-Tiny and Swin-Tiny-TGCFM. (c) t-SNE visualization of feature distributions before and after applying TGCFM.
Figure 7. (a) Visualization of the high-frequency and low-frequency branch features in Stage 1 and Stage 2 of Swin-TGCFM. (b) Grad-cam of Stage 4 for Swin-Tiny and Swin-Tiny-TGCFM. (c) t-SNE visualization of feature distributions before and after applying TGCFM.
Remotesensing 18 00824 g007
Figure 8. Visualization results on the AITOD dataset, including predicted bounding boxes and corresponding heatmaps. The red dashed line in the figure indicates the target area, the solid red line represents the enlarged result, and the other colors correspond to the detection results.
Figure 8. Visualization results on the AITOD dataset, including predicted bounding boxes and corresponding heatmaps. The red dashed line in the figure indicates the target area, the solid red line represents the enlarged result, and the other colors correspond to the detection results.
Remotesensing 18 00824 g008
Figure 9. Visualization of the heatmaps generated by Swin and Swin-TGCFM from Stage 1 to Stage 4. The area within the solid red line in the figure is the region where the target is located.
Figure 9. Visualization of the heatmaps generated by Swin and Swin-TGCFM from Stage 1 to Stage 4. The area within the solid red line in the figure is the region where the target is located.
Remotesensing 18 00824 g009
Table 1. Comparative results of FSSC-Net and recent state-of-the-art methods on the ImageNet-1K classification task.
Table 1. Comparative results of FSSC-Net and recent state-of-the-art methods on the ImageNet-1K classification task.
BackboneParam (M)ResolutionFLOPs (G)Top-1Top-5
PVT-Small24.522423.879.8%-
PVT-Medium44.222426.781.2%-
MambaVision-T [47]31.822424.482.3%-
TNT-5 [48]23.822425.281.5%95.7%
ResNet-5025.622424.176.2%92.9%
DeiT-Small [49]22.122424.681.2%95.4%
ResNeXt-50-32x4d [50]25.022424.377.8%-
Res2Net-5025.722424.278.0%93.5%
T2T-ViTt-14 [51]21.522425.280.7%-
Swin-T29.022424.581.3%-
PVTv2-B225.422424.082.0%-
EfficientNet-B3 [52]12.022421.881.6%-
ResNet-10144.722427.977.4%-
ResNet-l5260.2224211.678.3%-
DiCaN(l0l; 48, 48) [53]44.322429.581.3%95.5%
FSSC-Net29.622424.581.9%95.4%
Table 2. Evaluation of performance gains brought by TGCFM across different methods.
Table 2. Evaluation of performance gains brought by TGCFM across different methods.
BackboneParam (M)ResolutionFLOPs (G)Top-1Top-5
PVT-Small24.l22423.779.1294.13%
PVT-Small-TGCFM25.022423.879.5794.27%
Swin-T29.022424.580.3295.13%
Swin-T-TGCFM31.222424.781.1695.33%
ResNet-5025.622424.178.0194.04%
ResNet-50-TGCFM29.922424.879.4794.78%
Table 3. Comparison of accuracy between FSSC-Net and other remote sensing image classification methods on the AID Dataset.
Table 3. Comparison of accuracy between FSSC-Net and other remote sensing image classification methods on the AID Dataset.
BackboneResolutionAccuracyPrecisionRecall
PVT-Small-TGCFM224293.95%93.83%93.15%
Swin-T224289.57%89.50%89.31%
Swin-T-TGCFM224295.80%95.62%95.34%
ResNet50224294.70%94.68%94.41%
ConvNext [54]224292.80%92.81%92.40%
MPViT [55]224292.83%92.94%92.56%
MFC2Net [56]224293.68%93.63%93.33%
SwiftFormer [57]224293.47%93.52%93.12%
WCNNs [58]224292.88%92.82%92.61%
WDTCNN [59]224293.92%93.97%93.64%
FDNet224295.92%95.90%95.77%
FSSC-Net224296.20%95.98%95.84%
Table 4. Evaluation of object detection results on MS COCO val2017 using the Cascade R-CNN framework.
Table 4. Evaluation of object detection results on MS COCO val2017 using the Cascade R-CNN framework.
BackboneMethodParam (M)ResolutionFLOPs (G)APAP50AP75Train Schedule
PVT-SmallRetinaNet44.10800/1333-43.065.346.9
Swin-TCascade Mask R-CNN72.77800/133374550.569.354.9
ResNet-50Mask R-CNN82.00800/128073946.364.350.5
TNT-SFaster RCNN48.10--41.564.144.5
Res2Net-101Cascade R-Cnn-800/1333-43.063.5--
PVT-V2-B2Sparse R-CNN45.00800/1280-50.169.554.9
ConvNeXt-TCascade Mask R-CNN86.00800/128074150.469.154.8
MambaVision-TCascade Mask R-CNN86.00800/128074051.170.055.6
FSSC-NetCascade R-CNN79.80800/133382050.969.855.2
Table 5. Evaluation of remote sensing object detection results on AITOD using the Mask R-CNN framework.
Table 5. Evaluation of remote sensing object detection results on AITOD using the Mask R-CNN framework.
MethodBackboneAIBRSTSHSPVEPEWMAPAP50AP75
TridentNetResNet-507.60.010.89.62.57.01.90.04.913.12.7
RetinaNetResNet-500.40.02.410.00.05.81.00.02.48.40.7
Mask R-cnnResNet-5020.01.917.418.68.010.93.80.010.122.77.6
ATSSResNet-500.59.816.826.40.612.04.01.28.922.55.3
Cascade R-cnnResNet-5021.06.720.321.07.712.84.80.011.825.69.5
FCOSResNet-5017.21.621.219.80.813.34.90.09.824.16.2
SSDVgg-1610.92.110.714.31.89.22.20.96.521.02.4
YOLOv3Darknet-5319.19.721.720.94.714.55.22.712.335.35.6
KLDNetResNet-5013.618.735.742.35.224.99.36.719.6--
DNTRResNet-50------- 26.256.720.2
CAF2ENet-MResNet-5039.722.843.750.624.634.917.18.930.263.725.0
MENetSwin-T27.516.437.442.118.824.910.28.223.256.215.0
OursFSSC-Net39.832.351.747.69.63716.88.430.462.225.9
Table 6. Performance comparison on semantic segmentation tasks using the ADE20K benchmark.
Table 6. Performance comparison on semantic segmentation tasks using the ADE20K benchmark.
BackboneMethodParam (M)ResolutionFLOPs (G)mIoU
PVT-SmallSemantic FPN28.2512244.539.8
MambaVision-TSemantic FPN31.85122-46.0
Swin-TUPerNet29.0512239.345.8
ResNet-50UPerNet28.5512245.636.7
TNT-STrans2Seg23.8512230.443.6
PVT-V2-B2Semantic FPN25.4512245.845.2
ConvNext-TUPerNet29.0512223.346.7
FSSC-NetUPerNet29.6512240.646.2
Table 7. Ablation results of DFSM and TGCFM on classification performance.
Table 7. Ablation results of DFSM and TGCFM on classification performance.
BackboneF-MaskS-MaskScaleC-ConvLW-AddX-AttnTop-1Top-5
FSSC-Net 81.11%95.07%
FSSC-Net 81.41%95.14%
FSSC-Net 81.77%95.26%
FSSC-Net 81.40%95.15%
FSSC-Net 81.92%95.40%
Table 8. Ablation analysis of classification performance with varying numbers of frequency branches in DFSM.
Table 8. Ablation analysis of classification performance with varying numbers of frequency branches in DFSM.
BackboneFb. HighFb. LowFb. MidParam (M)ResolutionFLOPs (G)Top-1Top-5
Swin-T 29.0022424.5080.32%95.13%
Swin-T 29.8922424.5580.97%95.27%
Swin-T 29.8922424.5581.12%95.12%
Swin-T 31.2422424.6881.16%95.33%
Swin-T32.9722424.9681.22%95.29%
Table 9. Evaluation of classification performance under different frequency component allocations per TGCFM branch.
Table 9. Evaluation of classification performance under different frequency component allocations per TGCFM branch.
BackboneParam (M)ResolutionFLOPs (G)Top-1Top-5
Swin-T + fc.029.0022424.5080.32%95.13%
Swin-T + fc.l29.6222424.5180.76%95.26%
Swin-T + fc.231.2422424.6881.16%95.33%
Swin-T + fc.332.3222424.7781.19%95.32%
Swin-T + fc.433.4122424.8781.47%95.64%
Table 10. Ablation analysis of window number impact on FSSC-Net performance.
Table 10. Ablation analysis of window number impact on FSSC-Net performance.
Backbonewin.11win.7win.5win.3Param (M)ResolutionFLOPs (G)Top-1Top-5
FSSC-Net 31.522424.7879.23%93.80%
FSSC-Net 31.222424.7481.08%94.68%
FSSC-Net 30.322424.5681.49%95.11%
FSSC-Net29.622424.5081.92%95.40%
Table 11. Ablation analysis on the effect of TGCFM integration at different Swin-T stages.
Table 11. Ablation analysis on the effect of TGCFM integration at different Swin-T stages.
BackboneStage 1Stage 2Stage 3Stage 4Param (M)ResolutionFLOPs (G)Top-1Top-5
Swin-T 29.0022424.5080.32%95.13%
Swin-T 29.0122424.5280.94%95.18%
Swin-T 29.0622424.5581.05%95.22%
Swin-T 29.6222424.6181.12%95.31%
Swin-T31.2422424.6881.16%95.33%
Table 12. Analyzing the impact of TGCFM on throughput and GPU memory usage in Swin-T and PVT-Small models.
Table 12. Analyzing the impact of TGCFM on throughput and GPU memory usage in Swin-T and PVT-Small models.
BackboneParam (M)ResolutionFLOPs (G)Batch of 128Batch of 256Memory (MiB)
Swin-T29.0022424.51566154810,717
Swin-T + TGCFM31.2422424.71292126313,031
PVT-Small31.2422423.71801190211,693
PVT-Small + TGCFM32.3222423.81455147115,357
Table 13. Quantitative comparison of frequency domain feature extraction methods for classification.
Table 13. Quantitative comparison of frequency domain feature extraction methods for classification.
MethodδmAPδAP50
FFT−0.25+0.33
DWT−0.83−0.24
DCT−0.23−0.01
Table 14. Analysis of accuracy across different masks.
Table 14. Analysis of accuracy across different masks.
MethodδmAPδAP50
hard mask−0.45−0.06
learnable mask−0.34−0.4
no mask−1.2−0.53
Table 15. Analysis of different initialization K values.
Table 15. Analysis of different initialization K values.
TopKδmAPδAP50
0.9+0.07+0.01
0.7−0.03−0.12
0.6−0.24−0.09
0.5−0.51−0.12
Table 16. Ablation analysis of frequency preference.
Table 16. Ablation analysis of frequency preference.
StrategyδmAPδAP50
-
Low Part
−2.82−1.55
-
High Part
−2.77−1.59
-
Calibration Part
−11.16−6.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, H.; Zhang, B. FSSC-Net: A Frequency–Spatial Self-Calibrated Network for Task-Adaptive Remote Sensing Image Understanding. Remote Sens. 2026, 18, 824. https://doi.org/10.3390/rs18050824

AMA Style

Yuan H, Zhang B. FSSC-Net: A Frequency–Spatial Self-Calibrated Network for Task-Adaptive Remote Sensing Image Understanding. Remote Sensing. 2026; 18(5):824. https://doi.org/10.3390/rs18050824

Chicago/Turabian Style

Yuan, Hao, and Bin Zhang. 2026. "FSSC-Net: A Frequency–Spatial Self-Calibrated Network for Task-Adaptive Remote Sensing Image Understanding" Remote Sensing 18, no. 5: 824. https://doi.org/10.3390/rs18050824

APA Style

Yuan, H., & Zhang, B. (2026). FSSC-Net: A Frequency–Spatial Self-Calibrated Network for Task-Adaptive Remote Sensing Image Understanding. Remote Sensing, 18(5), 824. https://doi.org/10.3390/rs18050824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop