1. Introduction
Benefiting from the rapid development of backbone architectures and representation learning paradigms, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have achieved remarkable progress in remote sensing image analysis. However, compared with natural images, remote sensing imagery poses significantly greater challenges due to its wide spatial coverage, extreme scale variation, dense distribution of small objects, complex backgrounds, and diverse imaging conditions such as illumination changes, atmospheric interference, and sensor noise [
1,
2,
3,
4]. In particular, in high-resolution aerial or satellite images, small and weak targets rely heavily on subtle structural cues, including edges, contours, and local textures, which are highly susceptible to degradation during deep feature abstraction. These characteristics impose stringent requirements on the robustness and adaptability of visual representation models.
To address these challenges, increasing attention has been devoted to the fusion of spatial- and frequency-domain representations in remote sensing vision [
5,
6]. This paradigm seeks to exploit the inherent complementarity between spatial structures (e.g., object shapes, spatial layouts, and contextual relationships) and frequency characteristics (e.g., edges, textures, and repetitive patterns), thereby enhancing model discriminability in complex remote sensing scenarios. Notably, frequency-domain features are particularly effective in capturing fine-grained details and structural regularities that are critical for small object detection, land-cover classification, and boundary-sensitive semantic segmentation in remote sensing imagery.
Existing frequency-aware methods can be broadly categorized into three groups. The first group emphasizes high-frequency information, such as edges and textures, and enhances spatial-domain representations through explicit frequency guidance, detail injection, or noise suppression mechanisms [
7,
8,
9]. These approaches have been widely applied to infrared small target detection [
10], remote sensing object recognition [
11], low-light or low-contrast image enhancement [
12], as well as surface defect and change detection tasks [
13]. The second group models spatial-domain and frequency-domain features as parallel branches and achieves collaborative optimization through feature concatenation [
14], attention-based weighting [
15], residual learning [
16], or cross-domain fusion strategies [
17]. These methods have demonstrated strong performance in salient object detection [
18], remote sensing image classification, few-shot recognition, small object detection [
19], and image super-resolution [
20]. The third group incorporates task priors or domain knowledge to guide frequency modeling; for example, through curriculum learning strategies [
21] or controlled frequency injection schemes, and has been applied to specialized domains such as marine seismic processing [
22], medical image analysis [
23], and fine-grained pose estimation [
24].
Despite these advances, spatial–frequency collaborative frameworks for remote sensing imagery still face two fundamental limitations. First, frequency-domain modeling often lacks flexibility: most existing methods rely on static frequency masks or fixed transformation parameters, which struggle to adapt to the scale diversity, scene complexity, and sensor variability commonly encountered in remote sensing data. Second, spatial–frequency fusion strategies exhibit limited generalization across different tasks and network depths, resulting in unstable performance when transferred among classification, detection, and segmentation scenarios.
To further investigate this issue, we analyze the frequency calibration weights of each FSSC block in FSSC-Net at different training stages. Our observations show that the network exhibits distinct dependencies on low- and high-frequency components at different depths, and that these dependencies evolve dynamically during training. Specifically, the relative importance of high- and low-frequency channels varies across feature hierarchies, and the proportion of high-frequency features among the most important channels changes with training progress. This phenomenon indicates that, in remote sensing tasks, the significance of frequency-domain information is inherently hierarchical and stage-dependent, closely related to spatial resolution, semantic abstraction level, and task objectives. Consequently, static frequency modeling or fixed fusion strategies may fail to fully exploit frequency cues and may even introduce redundancy, constrain representational capacity, and ultimately degrade performance.
Motivated by these observations, we propose the Frequency–spatial Self-Calibrated Network (FSSC-Net), a novel architecture specifically designed for task-driven frequency modeling and adaptive spatial–frequency fusion in remote sensing vision. The main contributions of this work are summarized as follows:
We propose FSSC-Net, an adaptive spatial–frequency collaborative framework tailored for remote sensing tasks. The network dynamically selects task-relevant frequency components based on data distributions and selectively integrates low- and high-frequency information to enhance spatial–frequency representations.
We design a lightweight and plug-and-play self-calibrated frequency modeling mechanism, composed of a Dynamic Frequency Selection Module and a Task-Guided Calibration Fusion Module. This mechanism enables soft and adaptive frequency response learning, improving the flexibility of frequency extraction and the robustness of spatial–frequency fusion across diverse tasks.
We introduce a systematic frequency importance analysis framework across tasks, network depths, and training stages, revealing task-specific frequency preferences and validating the necessity of task-calibrated frequency modeling.
Extensive experiments on benchmarks for image classification, object detection, and semantic segmentation demonstrate the effectiveness of FSSC-Net, highlighting its strong task adaptability and cross-task generalization capability in complex remote sensing scenarios.
3. Methodology
In this section, we present the proposed Frequency–Spatial Self-Calibration Network (FSSC-Net).
Section 3.1 introduces the overall architecture of FSSC-Net.
Section 3.2 describes the Dynamic Frequency Selection Module, which adaptively identifies and extracts task-relevant frequency components.
Section 3.3 details the Spatial Granularity Self-Adaptive Module, designed to provide granularity-adaptive spatial features. Finally,
Section 3.4 elaborates on the Task-Guided Calibration Fusion Module, which achieves task-aware adaptive fusion of spatial and frequency features by self-calibration.
3.1. Frequency–Spatial Self-Calibrated Network
To enhance the representational capacity of deep networks and improve performance across downstream tasks, it is essential to adopt frequency feature extraction and spatial–frequency fusion strategies that are more closely aligned with task requirements. To this end, we propose the FSSC-Net, a novel architecture built upon a task-guided calibration mechanism. FSSC-Net integrates three key modules across all stages: an adaptive frequency-domain feature extraction module, a spatial-domain multi-scale modeling module, and a task-aware spatial–frequency fusion module, enabling collaborative modeling across both domains.
As illustrated in
Figure 1, FSSC-Net adopts a hierarchical architecture composed of four stages with progressively reduced spatial resolutions. Each stage consists of multiple FSSC Blocks, which serve as the fundamental building units of the network. In the figure, the green regions indicate the multi-granularity spatial feature extraction module, which consists of MGPU (Multi-Granularity Perception Unit)and GAU (Granularity Adaptation Unit), the blue regions denote the Dynamic Frequency Selection Module with high- and low-frequency branches, and the gray module represents the self-calibrated frequency–spatial feature fusion module. The red dashed box provides a schematic of the feature computation and information flow within an FSSC Block.
Each FSSC Block comprises three key components: (1) DFSM, (2) the Spatial Granularity Self-Adaptive Module (SGSAM), and (3) the TGCFM. Within each block, input features are processed in parallel by frequency- and spatial-domain branches to capture semantic structures and fine-grained spatial details, respectively. The spatial branch, inspired by window attention mechanisms, employs a hybrid design that combines parallel and cascaded multi-granularity branches to accommodate diverse task-specific granularity requirements. The cascaded window structure facilitates inter-scale information exchange, while learnable fusion weights allow the network to handle heterogeneous feature distributions effectively (see
Figure 2b).
In the frequency branch, DFSM applies soft masks to divide deep features into adaptive high- and low-frequency components. A multi-scale frequency-domain pyramid is then constructed based on distinct index sets and fused via learnable weights to adapt to diverse application needs. SGSAM addresses the mismatch between fixed-scale modeling and task-specific spatial feature demands by introducing a multi-granularity perception mechanism and adaptive fusion strategy, enabling dynamic adjustment of spatial granularity and a balanced representation of local details and global semantics. Finally, TGCFM implements a task-aware spatial–frequency fusion strategy. By integrating frequency feature calibration with a cross-attention mechanism, it facilitates deep integration of spatial and frequency information, substantially enhancing collaborative modeling across both domains.
Overall, the key contribution of FSSC-Net lies in the proposed task-guided adaptive cross-domain collaboration framework, which overcomes the limitations of traditional frequency-domain feature extraction and fusion, namely insufficient flexibility and lack of task awareness. In the frequency feature extraction stage, the framework constructs dynamic frequency subspaces under task supervision, enabling more targeted frequency representations. In the spatial–frequency fusion stage, a calibrate-then-fuse two-stage strategy is employed: frequency features are first selectively filtered and optimized in a task-relevant manner, and then non-overlapping windowed cross-attention is applied to achieve efficient semantic alignment and cross-domain collaboration. This design allows the model to dynamically adjust spatial granularity, frequency allocation, and cross-domain interaction patterns according to different visual tasks, significantly enhancing representation capacity and task performance in complex scenarios while maintaining computational efficiency. Overall, FSSC-Net provides a systematic, task-driven solution for spatial–frequency collaborative modeling.
3.2. Dynamic Frequency Selection Module
The DFSM is designed to adaptively identify task-relevant frequency components from the DCT domain through a data-driven strategy. It enables the selective integration of high-frequency details and low-frequency semantic information to enhance representation flexibility. The module operates in two main stages: (1) Multi-Scale Frequency Component Construction: A soft mask mechanism is applied to construct multiscale frequency-domain representations using different index sets (e.g., 32 × 32 and 16 × 16). Channel attention is then used to dynamically model the importance of each frequency component. This design allows the model to adaptively select relevant frequency features based on the intrinsic characteristics of the input data and the specific requirements of the task, preserving key components while suppressing redundancy. (2) Frequency Fusion and Weight Adjustment: To further improve the adaptability of frequency-domain features, DFSM incorporates learnable weights to perform dynamic fusion across different frequency components. This facilitates adaptive control over the granularity of frequency representation. Throughout the extraction process, the soft masks enable flexible alignment with various frequency distributions, while also suppressing noise and redundant responses, thus enhancing the robustness of the features. For each frequency representation built from different index sets, DFSM dynamically fuses them to balance information across multiple granularities. This results in a hierarchical frequency-domain representation that not only supports task-driven frequency selection but also leverages multi-scale information to improve feature expressiveness.
In summary, DFSM enables multi-scale collaboration and task-adaptive frequency modeling by dynamically constructing and fusing frequency components. It provides a more flexible and targeted frequency representation strategy for vision models. As illustrated in
Figure 3, the specific pipeline of DFSM is as follows: Given an input feature map
, adaptive average pooling (AAP) is first applied to transform it into a fixed resolution
. Then, based on learnable soft masks denoted as
and
, the module extracts multi-scale low-frequency and high-frequency components
and
, respectively, from corresponding index sets denoted as
. These features are then refined via a Squeeze-and-Excitation (SE) module for relevance filtering and noise suppression, and are element-wise multiplied with the original input
to generate the refined frequency feature groups.
During fusion, features from different index groups within the high- and low-frequency components are further integrated via learnable parameters to produce the final multi-scale high-frequency feature
and low-frequency feature
respectively. The complete fusion process is described in Equation (1).
Owing to the aforementioned design, the DFSM does not perform selection within a single frequency spectrum. Instead, it constructs multiple frequency subspaces using different index sets, forming frequency representations that are better adapted to the current task. Specifically, DFSM does not apply a one-time global filter to the input features; rather, it independently calibrates each frequency subspace and applies different strategies to different subspaces. Finally, the module achieves collaborative optimization across multiple frequency subspaces through learnable fusion weights, dynamically balancing low- and high-frequency information while enhancing feature representation. Therefore, the core innovation of DFSM lies in the fact that it does not merely select the “most salient frequencies,” but constructs task-oriented multi-subspace frequency representations, supporting multi-structure collaboration and dynamic adaptation.
3.3. Spatial Granularity Self-Adaptive Module
In the Vision Transformer domain, window-based attention mechanisms have become a key strategy for balancing computational efficiency and model performance. Inspired by pioneering works such as Swin Transformer and P2T, we propose the Spatial Granularity Self-Adaptive Module (SGSAM) to address the limitations of conventional approaches that rely on fixed-scale convolutions or single-size windows, which often struggle to reconcile the dynamic needs of local detail modeling and global semantic understanding. SGSAM enables a dynamic adaptation of spatial feature granularity, providing efficient and flexible spatial representations. SGSAM is composed of two core components. (1) MGPU (Multi-Granularity Perception Unit): This unit constructs spatial representations with diverse perceptual granularity through a parallel attention-branch architecture, as illustrated in
Figure 2a. (2) Granularity Adaptation Unit (GAU): This unit dynamically selects and integrates features with the most suitable granularity based on task-specific demands, forming an adaptive mapping between granularity and semantic requirement, as shown in
Figure 2b. As a key spatial component within each FSSC Block, SGSAM significantly improves both the flexibility and accuracy of spatial feature representation. By adaptively adjusting feature granularity, it provides task-aware spatial inputs for downstream tasks such as object detection and scene segmentation. Moreover, it helps prevent mismatches between scale selection and actual feature requirements, which are common pitfalls in fixed-scale designs. Overall, SGSAM enhances the network’s spatial adaptability and semantic precision, leading to improved performance in complex vision scenarios.
3.3.1. Multi-Granularity Perception Unit (MGPU)
The MGPU is designed to construct multi-granular spatial representations under controlled computational cost. As shown in
Figure 2, the input feature map
is first passed through a
convolution for channel compression, resulting in a compact representation
, where
. It is worth emphasizing that we adopt a full-channel compression strategy instead of channel splitting, in order to avoid potential information loss.
The compressed feature
is then fed into the MGPU for multi-granularity feature extraction. The MGPU consists of multiple parallel-cascaded attention branches, each utilizing a different window size-denoted as
,
,
to establish spatial perception at varying granularities. Specifically, smaller windows (e.g.,
) focus on capturing fine-grained patterns such as textures and edges, while larger windows (e.g.,
) are responsible for aggregating coarse-grained information, such as scene context and global semantics. As illustrated in
Figure 2a, increasing window sizes lead to progressively larger receptive fields, thereby enabling richer spatial modeling.
Concretely, the compressed feature
is reshaped into a sequence representation
for each window size
, where
. The MGPU adopts a parallel-cascaded architecture: the features extracted by a smaller window branch
are not only used as its own output but also serve as input to the next branch with a larger window. This progressive feature passing mechanism facilitates hierarchical context modeling and enhances cross scale consistency. The detailed computation process is defined in Equation (2):
3.3.2. Granularity Adaptation Module (GAU)
Leveraging the channel-compression feature extraction paradigm implemented in MGPU, each attention branch operates on the full input representation, thereby preserving complete spatial details during multi-granularity processing. To effectively integrate the complementary information captured by different window branches and enhance the models adaptability to diverse tasks and data, the Granularity Adaptation Module (GAU) introduces a dynamic weighting fusion mechanism and channel-wise attention calibration for efficient feature aggregation.
Specifically, the output features from each branch, denoted as , , , are first modulated by learnable weights , , , where and . This design enables the model to adaptively adjust the contribution of different granularity features according to task requirements. For instance, in texture-intensive tasks, the weight of smaller-window branches is increased, while in tasks dominated by global semantics, larger-window branches receive greater emphasis.
The dynamically modulated features are then concatenated to form the fused representation
, where
. To further enhance the discriminative capacity of the fused features, we perform channel wise attention modeling on
, which adaptively emphasizes informative channels while suppressing redundant responses. Through the joint effect of learnable granularity weights and the channel attention, GAU achieves dynamic balancing of multi-granular features
and reinforces cross-scale consistency. The detailed computation is formulated in Equation (3):
3.4. Task-Guided Calibration Fusion Module
The FSSC Block constructs a multi-modal representation by integrating the DFSM and the SGSAM in a collaborative manner. To achieve task-aware adaptive fusion of spatial and frequency features, we propose the TGCFM. This module employs a two-stage architecture to facilitate deep interaction and complementarity between the spatial and frequency domains.
As illustrated in
Figure 4, the proposed TGCFM follows a two-stage design to achieve adaptive frequency–spatial collaboration. In the calibration stage, frequency features are selectively enhanced according to task-specific requirements, while in the fusion stage, a cross-attention [
33] mechanism is employed to realize semantic alignment and information exchange between frequency and spatial representations.
The inputs to TGCFM consist of a frequency feature pyramid generated by the DFSM and multi-granularity spatial features extracted by the SGSAM. During the calibration phase, high-frequency and low-frequency features are first processed by separate 1 × 1 convolutions to learn task-specific representations for each frequency band. To further identify informative channels and suppress redundancy, channel-wise attention is applied to both branches, producing importance scores that are compared against a task-dependent threshold. Only the most informative channels are retained via a Top-K selection strategy, yielding calibrated frequency features that are both compact and task-adaptive.
In the fusion phase, the calibrated low-frequency features serve as queries (Q) to provide global semantic guidance, the high-frequency features act as keys (K) to convey fine-grained structural details, and the spatial features function as values (V) to deliver localized spatial context. To balance modeling capacity and computational efficiency, a window-based attention mechanism is adopted. Features from both the frequency and spatial domains are partitioned into non-overlapping windows and serialized prior to attention computation, enabling efficient cross-domain interaction within local regions. The overall fusion process is formulated in Equation (4), where
denotes the selection of the top-
K channels with the highest importance scores, and
represents the serialization of window-partitioned feature maps. Through this task-guided calibration and adaptive fusion strategy, TGCFM dynamically adjusts its frequency–spatial interaction patterns to varying task demands, leading to consistent performance improvements across diverse vision tasks.
Overall, the core of TGCFM lies in its task-guided calibration prior to fusion. Rather than simply reusing all frequency features, it selectively enhances and retains the frequency components most relevant to the current task, reducing noise and redundancy at the source. During the calibration stage, different frequency bands are processed independently, and channel attention combined with a TOP-K strategy enables task-adaptive feature selection and compact representation. In the subsequent fusion stage, non-overlapping windowed cross-attention constructs high-semantic-density cross-domain interactions using sparse channels, achieving fine-grained semantic alignment and efficient structural fusion. In summary, TGCFM integrates spatial and frequency information through a task-aware calibrate-then-fuse strategy, effectively improving feature representation quality and task adaptability.
4. Experiments
This section presents a comprehensive evaluation of FSSC-Net and its core module, TGCFM, across three fundamental computer vision tasks.
Section 4.1 begins with image classification experiments on the ImageNet-1K dataset, where the classification accuracy of different methods is systematically compared to validate the effectiveness of the proposed framework in general-purpose tasks. Subsequently, a remote sensing image classification task is conducted to assess the task adaptability of FSSC-Net and the generalization capability of its frequency-domain self-calibration mechanism.
Section 4.2 and
Section 4.3 report the performance of our approach on object detection and semantic segmentation tasks, respectively. Finally,
Section 4.4 presents ablation studies that investigate the critical design choices of the TGCFM and analyze their contributions to the overall performance.
Section 4.5 conducts an in-depth analysis of frequency dependence across different network depths within the classification task, as well as frequency preferences across different vision tasks.
Section 4.6 further validates the effectiveness of TGCFM through Grad-CAM visualizations and T-SNE-based feature distribution analysis.
4.1. Image Classification on ImageNet-1K and AID
4.1.1. Image Classification Experiments on the ImageNet-1K Dataset
Setting: We evaluate the effectiveness of the proposed FSSC-Net on the ImageNet-1K dataset [
46], which contains 1.28 M training images and 50 K validation images spanning 1000 classes. Top-1 accuracy is reported based on single crop evaluation. All models are trained for 300 epochs on the ImageNet-1K dataset, with input images resized to 224 × 224. The batch size is set to 1024, and the initial learning rate is 0.01. We adopt the AdamW optimizer in combination with a cosine annealing learning rate schedule, including a 20-epoch linear warm-up phase. All experiments are conducted on four NVIDIA RTX 4090 GPUs(NVIDIA Corporation, Santa Clara, CA, USA).
Results: We evaluate the effectiveness of TGCFM from two perspectives: (1) overall classification performance of FSSC-Net compared to existing state-of-the-art backbone networks, and (2) performance gains achieved by incorporating TGCFM into established architectures.
Table 1 reports the Top-1 classification accuracy of FSSC-Net on the ImageNet-1K dataset alongside several representative Transformer-based and convolution-based models. The symbol “–” denotes parameters that were not reported in the original studies. To avoid introducing assumptions or estimation bias, these values are intentionally left unfilled. Under comparable parameter counts and computational budgets, FSSC-Net achieves competitive or superior performance. For instance, FSSC-Net surpasses the Swin-T by 0.6%, and outperforms multi-scale models such as PVT-Small and the more complex PVT-Medium by 2.1% and 0.7%, respectively. In addition, when compared to classical CNNs like ResNet and ResNeXt, FSSC-Net consistently delivers higher accuracy, demonstrating the effectiveness of the proposed frequency–spatial fusion strategy as a complementary enhancement to conventional convolutional structures.
Table 2 summarizes the performance gains achieved by integrating TGCFM into standard backbone networks. In this table, the suffix “-TGCFM” denotes models augmented with the TGCFM. Notably, ResNet-50-TGCFM and Swin-Tiny-TGCFM achieve Top-1 accuracy improvements of 1.46% and 0.84%, respectively. These results validate the transferability and generalizability of TGCFM. Both FSSC-Net and TGCFM-augmented variants benefit from richer and more expressive feature representations, highlighting the potential of the proposed composite frequency modeling approach to enhance both convolution- and self-attention-based networks.
4.1.2. Remote Sensing Classification on the AID Dataset
Setting: To further evaluate the adaptability of FSSC-Net and the effectiveness of TGCFM in domain-specific scenarios beyond general-purpose tasks, we conduct extended experiments on remote sensing image classification. Specifically, we adopt the AID dataset, which contains 10,000 aerial images with a resolution of 600 × 600, covering 30 representative scene categories. The dataset is split into training and validation sets in an 8:2 ratio. We follow the same training and evaluation protocol used in FDNet for fair comparison. FDNet employs frequency-domain convolution, multi-scale frequency decoupling, and attention mechanisms to extract features in the frequency domain, effectively mitigating the information loss caused by spatial downsampling. This design has demonstrated improved classification accuracy and computational efficiency in remote sensing tasks.
Results: As summarized in
Table 3. Notably, although FSSC-Net is designed as a general frequency–spatial framework rather than a dataset-specific solution, it achieves a Top-1 accuracy of 96.20% on the AID dataset, surpassing FDNet’s 95.92% by 0.28%. This confirms the strong task adaptability of FSSC-Net and the robustness of its frequency–spatial fusion mechanism in complex scenarios. Furthermore, when TGCFM is integrated into Swin-T, the model achieves a 4.38% improvement in classification accuracy, demonstrating the effectiveness and generality of adaptive frequency modeling and self-calibration mechanisms in unstructured tasks such as remote sensing image classification.
Figure 5 presents the visualization results of TGCFM on the AID dataset. Panels (a-1) and (a-2) illustrate the mean importance of low- and high-frequency channels at different network depths, as well as the proportion of high-frequency components among the top 75% most important channels, respectively. Panels (b) and (c) show histograms of channel importance for low- and high-frequency branches across different layers. Compared with the methods shown in
Figure 6, it is evident that remote sensing image classification exhibits a distinct dependency on frequency-domain features, highlighting the unique frequency demands of this task. Together with the performance results of FSSC-Net and TGCFM reported in
Table 3, these findings further validate the effectiveness and necessity of the TGCFM mechanism, demonstrating that FSSC-Net equipped with TGCFM can effectively self-calibrate and adapt to the specific frequency characteristics of remote sensing image classification.
4.2. Object Detection on General Detection Dataset MS COCO and Remote Sensing Dataset AITOD
To assess the effectiveness of the proposed method with a focus on remote sensing target detection, experiments are conducted on both the COCO benchmark and the AITOD [
60] dataset. While COCO provides a widely used benchmark with diverse object scales and complex backgrounds for validating general detection robustness, AITOD is specifically designed for aerial imagery, where targets are predominantly small, densely distributed, and embedded in cluttered scenes.
Evaluations on AITOD highlight the ability of the proposed method to preserve fine-grained structural details and improve small-object localization in remote sensing scenarios, while the additional COCO results demonstrate that these gains do not come at the expense of general detection performance. Together, the results indicate that the proposed approach is well-suited for remote sensing object detection with strong generalization capability.
4.2.1. Object Detection on MS COCO
Setting: We evaluate the object detection performance of FSSC-Net on the MS COCO dataset [
61], which comprises approximately 118 K training images, 5 K validation images, and 20 K test images. Backbone networks are pretrained on ImageNet with an input resolution of 224 × 224 and then integrated into the detection framework. Cascade RCNN is chosen as the detection architecture due to its widespread adoption and competitive performance in vision benchmarks.
During training, we utilize the multi-scale training strategy; input images are resized such that the shorter side is randomly sampled between 480 and 800 pixels, while the longer side is constrained to a maximum of 1333 pixels. The models are optimized using the AdamW optimizer with an initial learning rate of 0.0001, a weight decay of 0.05, and a batch size of 16. The training process is conducted over a 3× schedule (36 epochs). Model performance is evaluated using the official COCO evaluation API, with metrics including AP (Average Precision), AP50 and AP75. Additionally, we report the number of parameters and FLOPs for each model to assess the trade-off between accuracy and computational efficiency.
Results: As shown in
Table 4, incorporating the pretrained FSSC-Net as the backbone in the Cascade R-CNN [
62] framework yields superior performance across multiple key metrics when compared to existing CNN- and Transformer-based detectors. The comparison results are either sourced directly from the original publications or reproduced under identical training settings for fairness. Notably, FSSC-Net achieves higher detection accuracy under similar parameter budgets. For example, it outperforms Swin-Tiny and PVT-V2 by 0.4% and 0.8% in AP, respectively. Furthermore, FSSC-Net achieves nearly the same performance as MambaVision-T, while using 14.89 million fewer backbone parameters, resulting in a 0.2% accuracy decrease. These results highlight the efficiency of FSSC-Net in scenarios with constrained model complexity, demonstrating the benefits of the proposed frequency–spatial fusion strategy in dense prediction tasks.
4.2.2. Object Detection on AITOD
Setting: We evaluate the remote sensing object detection performance of FSSC-Net on the AITOD dataset, which is specifically designed for aerial image target detection and contains densely distributed objects with extreme scale variations. AITOD focuses on small and very small targets captured from high-altitude platforms, posing significant challenges in terms of feature representation and localization accuracy. Backbone networks are pretrained on ImageNet with an input resolution of 224 × 224 and then incorporated into the detection framework. Following common practice in remote sensing detection, Cascade R-CNN is adopted as the detection architecture to ensure strong localization refinement for small objects. During training, a multi-scale strategy is employed to accommodate large variations in object size. Input images are resized by randomly sampling the shorter side within a predefined range while constraining the longer side to a fixed maximum resolution to balance accuracy and computational cost. The models are optimized using the AdamW optimizer with an initial learning rate of 0.0001, a weight decay of 0.05, and a batch size of 16. All models are trained under a 3× schedule. Evaluation is conducted using standard detection metrics, including AP, AP50, and AP75, with additional emphasis on small-object performance to reflect the characteristics of the AITOD dataset. Model complexity is further analyzed in terms of parameter count and FLOPs.
Results: As shown in
Table 5, FSSC-Net achieves an overall AP of 30.4 on the AITOD dataset, demonstrating superior performance among CNN-based detectors and competitive results compared with recent remote sensing-oriented methods. The model further obtains 62.2 AP
50 and 25.9 AP
75, indicating robust detection and reliable localization under varying IoU thresholds in aerial imagery. At the category level, FSSC-Net exhibits clear advantages on representative remote sensing targets characterized by small object size and dense spatial distribution, including airplanes (39.8 AP), storage tanks (51.7 AP), ships (47.6 AP), and swimming pools (37.0 AP). These gains suggest that the proposed frequency–spatial fusion strategy effectively preserves fine-grained structural and boundary cues that are often degraded in deep feature hierarchies of aerial images. Overall, the results confirm that FSSC-Net is well-suited for remote sensing object detection, particularly in complex aerial scenes dominated by small-scale targets, validating its effectiveness on the AITOD benchmark.
4.3. Semantic Segmentation on ADE20K
Setting: We evaluate the semantic segmentation performance of FSSC-Net on the widely used ADE20K dataset [
63], which comprises 25 K high-quality images with pixel-level annotations, including 20K for training, 2 K for validation, and another 3 K for testing. The dataset covers 150 semantic categories across diverse indoor and outdoor scenes. For the segmentation framework, we adopt UPerNet [
64] as the base architecture and follow standard training and evaluation protocols. All input images are resized to a resolution of 512 × 512 during training.
Results: As shown in
Table 6, FSSC-Net demonstrates competitive performance on the semantic segmentation task. When employed as the backbone in the UPerNet framework, it achieves higher or comparable accuracy relative to several state-of-the-art methods. Specifically, FSSC-Net outperforms MambaVision-T and Swin-Tiny by 0.2% and 0.4% in mIoU, respectively, further validating the effectiveness and generalizability of the proposed frequency–spatial fusion mechanism in dense prediction tasks. These results demonstrate that FSSC-Net effectively improves the model’s ability to capture and integrate multi-scale semantic information, thereby enabling more accurate and coherent pixel-level predictions in complex visual scenes.
4.4. Ablation Study
This section presents a series of ablation experiments aimed at investigating the role and individual contributions of each component within the proposed TGCFM mechanism. To this end, we incorporate TGCFM into the Swin-T and systematically evaluate the impact of various design choices on overall model performance. Due to computational resource constraints, all ablation studies are conducted on the image classification task. The training protocol follows the settings described in previous sections, using the ImageNet dataset with a total of 300 training epochs
4.4.1. Analysis of DFSM and TGCFM
In this section, we conduct ablation studies to evaluate the effectiveness of the TGCFM, focusing on the necessity of the Dynamic Frequency Selection Module and the TGCFM. For DFSM, we perform image classification experiments using both fixed masks and learnable soft masks to assess the flexibility and adaptability of frequency-domain modeling. For TGCFM, we compare it with a baseline approach that fuses features via channel-wise concatenation followed by a 1 × 1 convolution, aiming to validate the effectiveness of the proposed spatial–frequency collaborative fusion strategy. As shown in
Table 7, in the frequency modeling component, the introduction of soft masks significantly improves classification performance, demonstrating the critical role of DFSM in enhancing frequency feature representation. In the spatial–frequency fusion part, the TGCFM, equipped with cross-attention, consistently outperforms the baseline across multiple metrics, confirming its importance in feature alignment and enhancement.
4.4.2. Analysis of Frequency Branches and Corresponding Components in DFSM
In this section, we conduct an ablation study to evaluate the effects of both the number of frequency branches in the DFSM and the number of frequency components within each branch.
Table 8 presents the impact of varying the number of frequency-domain branches, while
Table 9 examines how different quantities of frequency components within each branch influence overall model performance.
Table 8 shows the Top-1 classification accuracy of the Swin-TGCFM network under various configurations of frequency-domain branches. We conducted experiments with different combinations of high-frequency, low-frequency, and mid-frequency branches to assess both the necessity of incorporating frequency information and the limitations of relying on a single-frequency branch. The results indicate that increasing the number of frequency branches generally improves Top-1 accuracy. Specifically, using only the high-frequency or low-frequency branch yields gains of 0.8% and 0.65%, respectively, indicating that high-frequency features-rich in fine-grained details are more beneficial for classification tasks than low-frequency features. When both high- and low-frequency branches are used together, accuracy improves further by 0.19% and 0.04% compared to a single branch, approaching a performance plateau. Adding a mid-frequency branch provides only marginal additional gains, indicating diminishing returns. These findings highlight that combining diverse frequency-domain features effectively complements spatial features and enhances model performance. This observation aligns with the conclusions drawn previously, which suggests that high-frequency information is inherently more difficult for neural networks to learn, and that improving the model’s ability to capture these features can significantly boost accuracy.
Table 9 presents the results of varying the number of frequency components within each frequency-domain branch. When the number of components is set to two, the model achieves a significant Top-1 accuracy improvement of +0.84%. Increasing the number of components to four further yields a smaller gain of +0.31%, indicating that a moderate number of frequency bands offers a favorable trade-off between performance and computational efficiency. These results demonstrate that the multi-component composite design effectively enhances the flexibility of feature representations and that overall performance tends to improve as more frequency components are incorporated. Notably, the model achieves its best computational efficiency with two components and its highest accuracy with four components, which may reflect the varying contributions of different frequency components. Based on this trade-off, we adopt the two-component configuration as the default setting in our final model design.
4.4.3. Ablation Analysis of the Multi-Size Window Design
We evaluate the effectiveness of multi-scale window design in FSSC-Net. To capture spatial features at different receptive field sizes, the AMSWA module first compresses the input feature channels and then applies window attention with varying window sizes. To verify the necessity of this design, we conducted ablation experiments using different numbers of window branches within FSSC-Net. During these experiments, the overall parameter count and computational cost are kept approximately constant by proportionally adjusting the channel allocation across the window branches. As shown in
Table 10, reducing the number of window branches leads to a clear, noticeable decline in Top-1 accuracy. Notably, despite having comparable parameters and FLOP counts, the single-window variant performs 2.7% worse than the four-window version, clearly validating the effectiveness of the multi-scale window design in enhancing model performance.
4.4.4. Ablation Analysis on the Application Scope of TGCFM
To determine the optimal configuration of TGCFM, we progressively integrated it into different stages of the Swin architecture. The experimental results are presented in
Table 11, where checkmarks denote the stages at which TGCFM is applied for feature enhancement. As observed, Top-1 accuracy increases consistently with the number of TGCFM integrated stages, reaching its peak when all four stages incorporate TGCFM. Notably, full integration introduces only a marginal increase of 4.7% in parameter count. These results suggest that applying TGCFM across all stages of the backbone is an effective and efficient design choice.
4.4.5. Throughput and GPU Memory Comparison
We also evaluate the GPU memory usage and inference throughput of the Swin Transformer before and after integrating the proposed TGCFM mechanism, using batch sizes of 128 and 256, respectively. All experiments are conducted on an NVIDIA RTX 4090 GPU. As shown in
Table 12, incorporating TGCFM leads to a moderate increase in resource consumption for both Swin-T and PVT-small. Specifically, the proposed modifications lead to an average inference throughput reduction of 17.49% to 22.67%, along with a GPU memory overhead increase of 21.5% to 31.3%. Additionally, the computational complexity (GFLOPs) increases marginally by 0.1 to 0.2, and the model size grows by 2.24 M to 1.08 M parameters. Despite this additional computational overhead, the trade-off is considered acceptable given the notable improvements (0.45% and 0.84%) in accuracy achieved by the model. It is also worth noting that most of the extra cost stems from the multi-branch architecture inherent in the TGCFM design. To further improve practical efficiency and deployment readiness, future work may explore optimization strategies such as matrix concatenation, shared computation across frequency branches, or pruning redundant frequency components to reduce memory and computation overhead.
4.4.6. Ablation Study on the Necessity of DFSM
To further validate the necessity of DFSM, we replaced it with FFT, DWT, and DCT for frequency modeling and measured the resulting changes in model performance. The detailed results are presented in
Table 13. It can be observed that directly using fixed spectral transformations such as FFT, DWT, or DCT leads to varying degrees of performance degradation, indicating that single-spectrum modeling cannot replace the task-conditioned frequency subspace structure constructed by DFSM. This further demonstrates the effectiveness and necessity of DFSM.
4.5. Investigation of Key Parameters
To validate the sensitivity of the masking mechanism and the necessity of the soft mask design, we conducted mechanism replacement experiments by substituting the original soft mask with three alternative strategies: No Mask, Hard Mask, and Learnable Mask. The results show that removing the mask leads to a significant performance drop (−1.2% mAP). Similarly, adopting a fixed-threshold Hard Mask or a Learnable Mask with only channel-wise learnable weights also results in varying degrees of performance degradation. This indicates that raw frequency features contain redundant or interfering components; without high–low frequency separation and task-guided selection, the effectiveness of spatial–frequency fusion is compromised, thereby confirming the structural necessity of selective modeling.
Further analysis in
Table 14 reveals that although Hard Mask performs better than No Mask, its lack of continuous adjustment limits its ability to accommodate the fine-grained frequency requirements of different tasks. While Learnable Mask introduces continuous gating, it fails to achieve effective frequency subspace reconstruction, and its performance still falls short of our method. Overall, the advantage of DFSM lies not merely in channel reweighting, but in its task-guided structural selection and frequency subspace construction mechanism, which enables more targeted frequency modeling and more efficient spatial–frequency collaboration.
To evaluate the stability and rationality of the Top-K mechanism, we conducted a sensitivity analysis on its initial ratio. As shown in the
Table 15, the results indicate that the model is generally insensitive to the initial Top-K value, maintaining stable performance within approximately 0.8 ± 0.1, which suggests a relatively broad effective range for this parameter. However, when the ratio decreases to 0.6 or 0.5, performance drops significantly, indicating that excessive compression leads to the loss of informative frequency components and consequently weakens representation capability. This observation is consistent with the theoretical expectations of frequency subspace construction. Overall, the performance improvement does not stem from fine-grained tuning of the Top-K ratio, but rather from the structural design of task-guided frequency subspace construction. Essentially, Top-K serves as a robust and adaptive subspace compression mechanism.
Furthermore, to enhance the statistical reliability of our findings, we conducted three independent runs for both FSSC-Net and the Swin Transformer baseline on ImageNet. For Top-1 accuracy, FSSC-Net achieves 81.95 ± 0.04, compared with 80.30 ± 0.11 for Swin, yielding an average improvement of approximately 1.65% with a notably smaller standard deviation. For Top-5 accuracy, FSSC-Net obtains 95.52 ± 0.21, surpassing Swin’s 95.12 ± 0.015 by approximately 0.40% on average. These results demonstrate that FSSC-Net maintains stable performance under random initialization and achieves statistically significant improvements over a strong baseline. The consistently lower variance and clear accuracy gains indicate that the performance improvements are robust and reliable, rather than arising from incidental fluctuations.
4.6. Analysis of Channel Importance and Frequency Dependency
Figure 6 presents an analysis of the high-frequency and low-frequency channel importance produced by the calibration modules in FSSC-Net across three major visual tasks: image classification, object detection, and semantic segmentation. We conduct analysis from two perspectives: intra-task variation—how frequency dependency differs across network depths within the same task; inter-task variation—how frequency preferences shift across different vision tasks.
4.6.1. Analysis of Frequency Dependency Across Training Stages in Classification Tasks
Figure 6(a-1–a-3) illustrates the average channel importance of high- and low-frequency branches (denoted by light and dark blue lines, respectively) across different FSSC blocks during training epochs in the image classification task. It also shows the proportion of high-frequency channels within the top 75% important channels (green line). Overall, the network exhibits a stronger reliance on low-frequency features during early training stages, as evidenced by higher average weights in low-frequency channels. As training progresses, the importance of high-frequency features increases, with the average proportion of high-frequency channels among the top 75% rising from 43.3% to 46.5%. This trend suggests that the network gradually shifts from coarse structural modeling to fine-grained edge and texture representation as training proceeds. This observation aligns well with the curriculum learning theory proposed in EfficientTrain++, which posits that models first learn easily distinguishable low-frequency information before refining high-frequency details.
Furthermore, a depth-wise frequency preference is observed: shallow and deep blocks exhibit structurally different frequency tendencies. Deeper blocks (e.g., Block10–12) tend to emphasize high-frequency features, while shallow ones favor low-frequency components. These preferences remain consistent throughout training and are not due to random fluctuations, indicating that the frequency calibration mechanism in FSSC-Net effectively allocates attention across frequency components adaptively. This dynamic adjustment enhances the model’s expressiveness and depthwise adaptability to varying feature demands.
4.6.2. Analysis of Frequency Dependency Across Model Depth and Vision Tasks
Figure 6(a-3,b-1,b-2,c-1,c-2) present the average importance of high- and low-frequency channels, the high-frequency ratio within the top 75% important channels, and the histograms (
Figure 6(d-1,d-2,e-1,e-2)) of channel importance (with a bin size of 0.05) across classification, segmentation, and detection tasks. The results reveal significant differences in frequency dependencies among the tasks. In the image classification task, deep layers—especially Block12—exhibit a strong preference for high-frequency features, reflecting the importance of detailed edges and textures for fine-grained categorization. In contrast, mid- and shallow layers (e.g., Block3–6) lean toward low-frequency components, supporting semantic abstraction and intermediate representation transformation. In semantic segmentation (
Figure 6(b-1,b-2)), high- and low-frequency channel weights are more balanced overall. However, high-frequency features account for more than 50% of the top 75% important channels, averaging 51.02%. This ratio remains stable and slightly increases toward the deeper blocks, suggesting a growing need for high-frequency features in pixel-level boundary modeling. These details are critical for enhancing segmentation boundary precision.
In contrast, object detection (
Figure 6(c-1,c-2)) shows a more complex frequency preference. The overall proportion of high-frequency channels among the top 75% is approximately 50.13%, with different layers exhibiting distinct patterns: Blocks 4–6 favor low-frequency features, potentially for capturing object contours and coarse localization. Shallow and deeper blocks, however, focus more on high-frequency information to support edge enhancement and refined bounding box localization. This layered dependency pattern echoes the dual demands of object detection, which requires both precise localization and accurate classification across scales.
To further verify differences between task-specific frequency modeling strategies, we analyze the histograms of channel importance across tasks (see
Figure 6(d-1,d-2,e-1,e-2,f-1,f-2)). These histograms demonstrate clear differences in high/low frequency dependency that align with the core objectives of each task. In classification (d-1,d-2), both frequency branches show a core-dominated, auxiliary-supported distribution: low-frequency channels maintain some dispersion to support global semantic extraction, while high-frequency channels tend to be highly polarized, focusing on critical details. Detection (f-1,f-2) presents a stage-wise dependency: early blocks require a blend of high- and low-frequency features for multidimensional cues (edges, textures, initial localization), while later stages filter features by retaining only highly significant or marginal channels, aligning with the dual-task nature of detection. In segmentation (e-1,e-2), a balanced moderate dependency emerges. The importance of both high- and low-frequency channels is tightly clustered in the 0.45–0.55 range, indicating the task’s emphasis on deep aggregation of local detail and contextual semantics. By avoiding domination from extremely important or redundant channels, the model ensures fine-grained pixel-level accuracy.
Overall, these visualizations reveal clear differences in the reliance on low- and high-frequency features across tasks, as well as substantial variation in frequency preferences across different network depths. These task-specific frequency preferences not only reflect the fundamental differences in modeling strategies among vision tasks, but also strongly support the necessity of task- and data-driven frequency feature extraction and self-calibrated spatial–frequency fusion, while validating the rationale and effectiveness of our task-calibrated frequency modeling mechanism. Specifically, FSSC-Net, through the integration of the DFSM and TGCFM, enables adaptive modulation in the frequency domain and task-aligned feature fusion, a design that significantly enhances the network’s generalization and performance across multi-task, multi-scale, and multi-target visual scenarios.
4.6.3. Quantitative Analysis of the Importance of Frequency Features
To further analyze the practical contributions of different frequency components and the calibration mechanism in TGCFM, we design intervention-based ablation experiments, in which specific structural components are deliberately removed. Specifically, three intervention settings are considered: (1) removing the low-frequency branch (–Low Part); (2) removing the high-frequency branch (–High Part); and (3) removing the entire frequency calibration module (–Calibration Part). The experimental results are reported in
Table 16 (δ denotes the performance change relative to the full model).
From the results, several key observations can be drawn. First, both high- and low-frequency information are indispensable. Removing either the high-frequency or low-frequency branch leads to a significant performance drop (approximately −2.8% mAP), indicating that the two types of frequency components play distinct yet complementary roles in object detection and jointly support effective representation learning. Second, the frequency calibration mechanism is the primary driver of performance improvement. When the entire calibration module is removed, the performance drops by more than 11% mAP, demonstrating that task-guided modeling in the frequency subspace is the key source of performance gains, rather than simple feature concatenation or fusion operations. Finally, this intervention-based study establishes a clear causal relationship at the structural level, showing that the observed performance improvements stem from the task-driven frequency calibration mechanism itself, rather than heuristic design choices or stacked attention effects. These results provide solid empirical evidence for the substantive contribution of TGCFM.
4.7. Model Visualization Analysis
To gain a deeper understanding of the working mechanism of TGCFM, we conducted a systematic visualization analysis from both internal and external perspectives. Specifically,
Figure 7 illustrates the frequency branches within DFSM and presents a t-SNE projection based on samples from the ImageNet dataset to depict the resulting feature distributions.
Figure 8 shows bounding box and heatmap visualizations for remote sensing object detection, enabling a comparison of the actual detection performance between FSSC-Net and the Baseline. Following this,
Figure 9 provides comparative heatmap visualizations before and after incorporating TGCFM under representative scenarios, including multi-target scenes, occlusion, simple backgrounds, and complex backgrounds. Together, these visualizations facilitate an in-depth analysis of how TGCFM reshapes the model’s attention patterns and enhances its focus on task-relevant regions.
4.7.1. Internal Feature Visualization Analysis
Figure 7a visualizes the frequency-domain features extracted at Stage 1 and Stage 2 of Swin-TGCFM, where High and Low denote the responses of the high-frequency and low-frequency branches, respectively. At Stage 1, which is closer to the input layer, the extracted features preserve a large amount of low-level physical information from the input image. As a result, the frequency representations exhibit a strong correspondence with the original image in both the pixel and frequency domains: the high-frequency branch predominantly responds to edges, textures, and other fine-grained local structures, whereas the low-frequency branch captures smoother and more homogeneous regions. At this stage, the high- and low-frequency components demonstrate clear complementarity in terms of physical characteristics. In contrast, at Stage 2, the features—shaped by interactions between the two frequency branches—gradually evolve into more abstract semantic representations. The visualizations increasingly transcend low-level physical boundaries and place greater emphasis on task-relevant foreground regions, indicating enhanced semantic perception. These observations suggest that the adaptive extraction of high- and low-frequency features not only preserves fine-grained details and global contextual information, but also facilitates the learning of task-relevant representations through cross-frequency interaction. Grad-CAM in
Figure 7b visualizations from Stage 4 of both the baseline Swin-Tiny and the TGCFM-enhanced Swin-Tiny reveal that TGCFM guides the model to allocate closer attention to semantically relevant regions. This improvement can be attributed to TGCFM’s task-guided calibration mechanism, which selectively amplifies frequency components that are most informative for the current task, and its spatial–frequency fusion, which aligns these calibrated features with spatial representations. Consequently, the network focuses more effectively on target regions, resulting in sharper localization patterns and more complete coverage of object boundaries.
Similarly, t-SNE in
Figure 7c projections of high-dimensional features for 20 classes from ImageNet show that TGCFM-enhanced features exhibit tighter intra-class clustering and increased inter-class separation. These changes directly reflect TGCFM’s ability to perform adaptive frequency selection: by modulating task-relevant frequency channels, the module produces more discriminative representations, which in turn improve class separability. Thus, both internal (Grad-CAM) and external (t-SNE) visualizations substantiate that TGCFM enhances the network’s feature representation quality through explicit task-adaptive frequency calibration and spatial–frequency fusion.
4.7.2. External Attention Visualization Analysis
As shown in
Figure 9, to evaluate the effectiveness of the TGCFM, we visualize and compare the feature maps generated by Swin and Swin-TGCFM across Stages 1 to 4. In the heatmap visualizations, color variations reflect the model’s focus of attention: deeper red regions indicate areas with richer information content and greater influence on the decision-making process, whereas blue regions are relatively suppressed and contribute less to the final output. This visualization facilitates an intuitive interpretation of the model’s decision basis.
The rows highlighted by red dashed boxes correspond to the results produced by Swin-TGCFM, while red bounding boxes in the images denote the target regions. As observed, incorporating TGCFM enables the model to localize targets more accurately, with the activated regions in the feature maps exhibiting stronger spatial alignment with the corresponding objects in the original images. Subfigures (a–d) illustrate representative scenarios, including (a) occluded objects, (b) multiple objects, (c) simple backgrounds, and (d) complex backgrounds.
4.7.3. Detection Behavior Visualization Analysis
To better understand the behavior of the proposed detection algorithm and its target perception capability in complex remote sensing scenarios, we conduct qualitative visualizations from two complementary perspectives: detection result performance and feature response distribution. First, the predicted bounding boxes are overlaid on the original AITOD images to visually assess the model’s localization accuracy as well as false positives and missed detections under challenging conditions, including multi-scale targets, dense distributions, and complex backgrounds. This visualization provides an intuitive evaluation of the detector’s overall performance and robustness in real-world remote sensing scenes. Second, to further investigate the key regions and discriminative features involved in the model’s detection decisions, heatmap visualizations are employed to analyze the internal feature responses. By comparing the response distributions over target regions and background areas, the heatmaps reveal whether the model can effectively focus on target-relevant regions while suppressing background interference, thereby offering intuitive evidence of the effectiveness of the proposed method in feature representation and small-object perception.
As shown in
Figure 8a, the bounding box visualizations of FSSC-Net, the Baseline, and the Ground Truth are presented. The red dashed boxes indicate the target locations in the original images, while the red solid boxes correspond to the enlarged views. It can be observed that the detection results of FSSC-Net are highly consistent with the ground-truth annotations, achieving more accurate and stable localization. In contrast, the Baseline produces a considerable number of false positives and redundant detections. Overall, FSSC-Net demonstrates superior capability in localizing small objects in remote sensing images and significantly reduces false detections due to its more discriminative feature representations.
Figure 8b illustrates the heatmap visualizations of the models, where warmer colors indicate higher attention responses and cooler colors denote lower responses. Compared with the Baseline, FSSC-Net exhibits more concentrated attention on true target regions while maintaining lower responses to irrelevant background areas. In contrast, the Baseline shows relatively dispersed attention over the targets and excessive activation in background regions, further highlighting the advantage of FSSC-Net in feature modeling and target discrimination.
5. Discussion
We comprehensively validate the generalization and robustness of FSSC-Net across mainstream computer vision tasks, including image classification, object detection, and semantic segmentation. By embedding the self-calibrated frequency modeling and fusion mechanism, comprising DFSM and TGCFM, into popular architectures such as the Swin Transformer, we further demonstrate the necessity and effectiveness of task-driven self-calibration mechanisms for frequency-aware representation learning. Despite its advantages, FSSC-Net still presents certain limitations. The integration of multi-branch frequency modeling and cross-attention-based fusion inevitably increases the model’s parameter count and inference cost, which may pose challenges in resource-constrained environments. Additionally, although FSSC-Net exhibits strong performance in standard vision tasks, its adaptability to more complex scenarios, such as multi-modal fusion and cross-domain recognition, has yet to be systematically investigated. Overall, FSSC-Net provides a novel and effective modeling paradigm for frequency-domain representation and frequency–spatial fusion, offering new insights into improving task adaptability and feature expressiveness in visual models. In future work, we plan to further explore its adaptability across diverse tasks and complex environments, while also investigating network lightweighting and efficient inference strategies to enhance the deployability and practical scope of FSSC-Net in real-world applications.
6. Conclusions
To address the limited flexibility of frequency-domain feature extraction and the constrained generalization of existing spatial–frequency fusion strategies in remote sensing vision tasks, this paper proposes the Frequency–Spatial Self-Calibrated Network (FSSC-Net). The network is specifically designed to tackle the challenges of remote sensing imagery, including large spatial coverage, extreme scale variations, dense distributions of small objects, and complex backgrounds. By integrating a task- and data-driven frequency modeling mechanism with an efficient spatial–frequency collaborative strategy, FSSC-Net enables adaptive and robust joint representation learning.
Specifically, FSSC-Net comprises a DFSM and a TGCFM. DFSM adaptively modulates frequency responses through a soft-mask mechanism, enabling task-relevant selection of low- and high-frequency components, while TGCFM aligns spatial-domain and frequency-domain features under task guidance. This design substantially enhances the network’s ability to preserve fine-grained structural information, such as edges, contours, and textures, which is critical for small object detection, land-cover classification, and boundary-sensitive semantic segmentation in remote sensing imagery.
Extensive experiments on general vision benchmarks and the AID remote sensing image classification benchmark demonstrate that FSSC-Net consistently outperforms existing state-of-the-art methods across multiple task settings, effectively generalizing to complex remote sensing scenarios while exhibiting strong robustness and cross-task adaptability. Ablation studies further validate the effectiveness of the proposed self-calibrated dual-domain collaboration strategy in improving semantic consistency and task alignment. Moreover, analysis of frequency calibration behaviors across different network depths and training stages reveals systematic dynamic patterns, providing empirical support for the necessity of adaptive, task-aware frequency modeling in remote sensing vision.
In conclusion, FSSC-Net not only introduces a novel paradigm for frequency-domain representation learning in remote sensing imagery but also advances spatial–frequency fusion toward adaptive, task-aware collaborative modeling. Future work will explore the scalability of this framework in multi-modal remote sensing data fusion, real-time large-scale inference, and resource-constrained deployment scenarios, further extending its applicability in practical remote sensing applications.