From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection

Yuan, Hao; Zhang, Bin; Wang, Yachuan; Qiang, Qianyao

doi:10.3390/rs18050777

Open AccessArticle

From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection

¹

School of Software Engineer, Xi’an Jiaotong University, No. 28 Xian Ning West Road, Xi’an 710049, China

²

School of Artificial Intelligence, University of Posts and Telecommunications, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 777; https://doi.org/10.3390/rs18050777

Submission received: 23 January 2026 / Revised: 1 March 2026 / Accepted: 2 March 2026 / Published: 4 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

CFBA-FPN compensates for deep features via shallow-guided frequency calibration.
High-frequency structural cues are injected to mitigate small-object degradation.
Cross-scale collaborative alignment corrects spatial and semantic offsets.
Cascaded gated fusion suppresses background noise and enhances small objects.

What are the implications of the main findings?

The findings demonstrate that explicitly addressing frequency-domain structural degradation and cross-scale semantic misalignment is essential for improving the robustness and localization accuracy of small-object detection in remote sensing imagery, where discriminative cues are inherently sparse and easily corrupted.
By validating a distortion-aware, cross-scale feature compensation paradigm, this work implies a shift in feature pyramid design from heuristic multi-scale fusion toward principled calibration of shallow and deep representations, offering broader applicability to other dense prediction tasks beyond remote sensing.

Abstract

Remote sensing object detection within multi-scale frameworks remains challenging, largely due to structural degradation and semantic misalignment introduced during cross-scale semantic enhancement. As feature hierarchies deepen, high-frequency details for small-object localization decay, while nonlinear transformations and receptive field asymmetry cause cross-scale semantic and spatial offsets. While existing feature pyramid-based approaches improve detection performance through multi-scale fusion or semantic aggregation, they fail to fundamentally address the cumulative information degradation arising from hierarchical feature extraction. To this end, we propose CFBA-FPN, a unified shallow–deep cross-scale feature compensation framework that explicitly models both frequency discrepancies and semantic offsets across scales. Specifically, shallow features are exploited as structural and spatial anchors to inject lost high-frequency information into deeper layers, effectively mitigating structural degradation. Meanwhile, a cross-scale collaborative semantic alignment strategy is introduced to correct semantic inconsistencies and spatial misalignments among multi-scale features. Building upon these designs, a cascaded gated fusion mechanism is developed to adaptively balance shallow structural compensation and deep semantic representation, thereby suppressing background noise and enhancing small-object responses. Extensive experiments on the AI-TOD, VisDrone, and DIOR benchmarks demonstrate that CFBA-FPN consistently improves localization accuracy and recognition capability, validating its effectiveness and generalization ability in remote sensing object detection.

Keywords:

remote sensing object detection; frequency-domain structural degradation; frequency-aware feature compensation; cross-scale semantic alignment

1. Introduction

Remote sensing object detection has long been regarded as a fundamental and challenging problem in the field of object detection, and this challenge becomes particularly pronounced under multi-scale detection frameworks. Due to the wide spatial coverage of remote sensing imagery, the sparsity of effective object pixels, and the complexity of background clutter, remote sensing object detection is highly sensitive to operations commonly employed for semantic enhancement, such as feature downsampling, receptive field expansion, and scale normalization.

At the feature representation level, limited spatial resolution and complex background interference often result in insufficient discriminative information for remote sensing objects [1]. As network depth increases, repeated downsampling operations continuously attenuate the fine-grained structural and semantic cues of small objects on feature maps, and in extreme cases, these cues may be completely lost. Although shallow features preserve relatively rich spatial details, they suffer from limited semantic discrimination capability. In contrast, deep features exhibit stronger semantic expressiveness but inevitably sacrifice spatial precision. Achieving an effective balance between these two types of features has remained one of the key challenges in remote sensing object detection. Overall, under the combined effects of information loss, geometric deformation, and background contamination, detection models struggle to learn stable and reliable feature representations from small remote sensing objects that are inherently information-scarce and easily disturbed.

From an architectural perspective, multi-scale approaches represented by Feature Pyramid Networks (FPNs [2]) attempt to mitigate scale variation by progressively enhancing deep semantic representations through cross-scale feature fusion, leveraging spatial details from shallow layers. While such methods have achieved remarkable success in detecting medium and large objects, extensive empirical observations indicate that aggressive cross-scale semantic enhancement often comes at the cost of fine-grained structural information—information that is crucial for the precise localization and recognition of remote sensing objects. As feature hierarchies deepen, discriminative high-frequency components, such as edges, contours, and local texture variations, are continuously suppressed. From a frequency-domain perspective, this phenomenon can be attributed to structural degradation induced by cross-scale semantic abstraction: although deep features are effective at modeling low-frequency semantic patterns, they struggle to preserve high-frequency components that characterize remote sensing objects, causing feature representations to become blurred or even overwhelmed by background information. Most existing methods focus on improving feature fusion strategies or introducing additional feature extraction modules, yet they fail to fundamentally address the progressively accumulated information degradation inherent in the feature extraction process. For remote sensing objects that already contain limited information, such early-stage high-frequency loss is often irreversible and cannot be effectively recovered through subsequent upsampling or multi-scale fusion.

Beyond feature degradation, remote sensing object detection also faces another equally critical but often overlooked issue, namely spatial and semantic misalignment between shallow and deep features. Most feature pyramid-based models implicitly assume rigid spatial correspondence across feature maps at different scales, determined solely by fixed downsampling ratios. However, due to factors such as receptive field asymmetry, nonlinear transformations, and object center shifts, such rigid alignment assumptions often fail to guarantee accurate correspondence of the same object across scales, thereby introducing redundant or noisy information during feature fusion. Owing to scale constraints, remote sensing objects are particularly sensitive to such spatial and semantic misalignment, and its negative impact is far more severe than that observed for medium and large objects.

Based on the above analysis, we argue that the fundamental limitation of remote sensing object detection performance in multi-scale frameworks arises from two types of distortions introduced during cross-scale semantic enhancement: (1) frequency-domain structural degradation, where the high-frequency details essential for object recognition are progressively suppressed and contaminated by background information during feature extraction; and (2) cross-scale semantic misalignment, where inaccurate spatial and semantic correspondence between shallow and deep features prevents consistent alignment of the same object across scales. To address these challenges, we propose a unified shallow–deep feature compensation framework that explicitly models frequency discrepancies and semantic misalignment across scales. The proposed framework treats shallow features as structural and spatial anchors, calibrating deep semantic features through a frequency-aware mechanism and compensating for the structural information gradually lost during feature extraction via high-frequency residual injection. Meanwhile, through collaborative guidance across multi-scale features, semantic offsets between scales are learned and corrected, thereby enhancing the robustness of deep features for remote sensing object detection. Extensive experiments on multiple benchmark datasets demonstrate the significant advantages of the proposed method in terms of both accuracy and generalization.

The main contributions of this work are summarized as follows:

We systematically analyze and formalize frequency-domain structural degradation and cross-scale semantic misalignment, revealing them as the fundamental factors limiting the performance of remote sensing object detection in multi-scale frameworks.
We propose a unified cross-scale feature compensation framework that injects lost high-frequency information into deep features via frequency-aware residual modeling, while jointly alleviating the above distortions through a query-guided semantic alignment mechanism.
We design a cascaded gated fusion strategy that adaptively integrates shallow compensation cues with deep semantic features, enabling precise selection of remote sensing objects while actively suppressing background noise and redundant information.
Taking semantic alignment as the core principle, we construct a cross-scale guided feature pyramid architecture that achieves cooperative enhancement from both frequency and spatial perspectives along top-down and bottom-up pathways.

2. Related Works

2.1. Multi-Scale Feature Fusion for Object Detection

Multi-scale feature fusion is widely regarded as a core mechanism for alleviating scale variation in object detection. As a seminal work in this area, the Feature Pyramid Network (FPN) constructs a top-down pathway with lateral connections to progressively inject high-level semantic information into high-resolution feature maps, enabling effective integration of multi-scale semantic and spatial cues, and has become a standard component in modern detection frameworks. Building upon FPN, PANet [3] further introduces a bottom-up path to enhance the propagation of low-level detailed information; Recursive FPN [4] strengthens interactions between shallow and deep features through iterative refinement and feature reuse; NAS-FPN [5] leverages neural architecture search to automatically learn more complex feature fusion topologies; BiFPN [6] improves the flexibility and stability of multi-scale feature fusion by introducing learnable fusion weights; and AUGFPN [7] further emphasizes semantic alignment during the feature fusion process. With the increasing adoption of Vision Transformers [8] in detection tasks, multi-scale feature fusion strategies have gradually been extended to Transformer-based architectures. For example, IMFA [9] proposes a sparse sampling strategy for Transformer features to balance detection accuracy and computational efficiency, while CMSFF [10] enhances cross-scale information interaction through cascaded multi-scale feature fusion. Despite the notable progress achieved by these methods in fusion structure design and semantic enhancement, their core focus remains on improving feature aggregation strategies. Consequently, the progressive degradation of high-frequency structural information and the irreversible loss of fine-grained details during cross-scale fusion have not been sufficiently addressed, leaving substantial room for further improvement in small-object detection performance.

2.2. Shallow Feature Reinforcement for Small-Object Detection

In small-object detection tasks, the importance of shallow features in preserving spatial details has been repeatedly emphasized, owing to the small object size, the sparsity of effective pixels, and the fact that early downsampling operations tend to cause irreversible loss of critical information. As key cues for small-scale object perception, extensive studies have sought to improve detection performance by strengthening shallow feature representations or enhancing interactions between shallow and deep features. Specifically, SDS-Net [11] achieves dynamic fusion of shallow and deep features by constructing a bidirectional association mechanism with learnable mixing factors; FFCA-YOLO [12] enhances local region perception of shallow features within the detection neck; CFENet [13] introduces the FSM module to alleviate feature confusion arising during shallow–deep feature fusion; FBRT-YOLO [14] mitigates spatial and semantic mismatches between shallow and deep features through explicit decoupling and weighted injection along channel and spatial dimensions; and PG-DRFNet [15] guides deep features to focus on small-object regions by leveraging high-resolution shallow feature maps, thereby improving detection performance. Despite the diverse innovations embodied in these shallow feature enhancement strategies, most existing approaches still rely on global or uniformly applied feature weighting schemes. Such designs fail to fully account for the spatial sparsity and structural sensitivity of small objects, making it difficult to achieve fine-grained and selective enhancement of small-object features. Consequently, how to adaptively strengthen shallow features relevant to small objects while avoiding the introduction of background noise remains an open and challenging problem that warrants further investigation.

2.3. Frequency-Aware Feature Compensation

Frequency-domain analysis [16,17] provides a complementary perspective for feature modeling and enhancement beyond traditional spatial-domain approaches. By leveraging the Fourier Transform or Discrete Cosine Transform, a number of studies attempt to decouple high-frequency components (e.g., edges and textures) from low-frequency components (e.g., global structures and semantic patterns), thereby selecting and enhancing frequency information that is more critical to the object task. SET [18] effectively improves detection performance by amplifying spectrum components related to small objects; FDA-IRSTD [19] jointly models frequency-domain and spatial-domain features for infrared small-object detection; FMC-DETR [20] achieves accurate small-object detection in aerial scenes through frequency decoupling and multi-domain feature collaboration; GSFANet [21] and SFDTNet [22] respectively propose global spatial–frequency attention mechanisms and adaptive frequency selection strategies to enhance infrared small-object detection accuracy; and HS-FPN [23] introduces high-frequency awareness and spatial dependency modeling into the feature pyramid structure to strengthen small-object representations.

However, most existing frequency-aware methods focus primarily on frequency enhancement within a single scale or within same-level features, while rarely explicitly modeling frequency distribution discrepancies and compensation relationships across multi-scale features. During multi-scale feature extraction, the progressive attenuation of shallow high-frequency details in deep semantic features remains largely unaddressed, which limits the potential performance gains of frequency-enhanced approaches for small-object detection. Therefore, how to recover and inject critical high-frequency structural information through a cross-scale frequency residual compensation mechanism has become a pressing issue for frequency-aware detection methods.

2.4. Cross-Scale Feature Alignment and Semantic Consistency

In multi-scale and even multi-modal visual tasks, achieving precise alignment in spatial geometric layouts and consistency in high-level semantic representations is widely recognized as a key factor for improving model performance. Early studies primarily relied on pyramid architectures to integrate multi-scale information through top-down or bottom-up feature aggregation. However, these approaches typically assume fixed spatial correspondences across different scales, making them inadequate for handling alignment errors induced by downsampling operations, variations in receptive fields, and inconsistent levels of semantic abstraction. To alleviate these issues, subsequent works have progressively introduced attention mechanisms or fine-grained decoupled interaction strategies to enhance cross-scale and cross-modal feature consistency. For instance, some methods [24,25] project multi-scale features into a unified embedding space to match and align semantically correlated regions across scales; ICAFusion [26] explicitly models complementary relationships between RGB and thermal infrared modalities via a dual cross-attention mechanism; FusionMamba [27] incorporates dynamic convolution and channel attention into the Mamba architecture for efficient multi-scale feature modeling; CLIP-based approaches [28] leverage cross-modal prior knowledge to achieve fine-grained alignment between visual features and semantic representations; and PIDNet [29] employs a boundary-aware attention mechanism to guide the cooperative fusion of detailed structures and contextual semantics, effectively mitigating structural information degradation in deep features. Despite the substantial progress made in cross-scale or cross-modal feature alignment, most existing methods primarily focus on embedding space mapping or attention-based weighting strategies, while paying insufficient attention to spatial geometric misalignment and fine-grained structural information loss caused by repeated downsampling and semantic abstraction. In fine-grained perception tasks such as remote sensing object detection, high-frequency structural details contained in shallow features are often difficult to preserve and align within deep semantic representations. Motivated by this observation, this work approaches the problem from the perspective of joint modeling of cross-scale spatial alignment and semantic consistency. We introduce a novel feature compensation and alignment mechanism that explicitly calibrates spatial correspondences across multi-scale features while maintaining semantic consistency, thereby enabling more accurate and robust fusion of small-object features.

3. Methodology

This section introduces the proposed Cross-scale Frequency and Bias-Aligned Feature Pyramid Network (CFBA-FPN), a unified enhancement paradigm tailored for remote sensing object detection. Departing from conventional designs that treat shallow feature compensation as a collection of isolated modules, CFBA-FPN formulates cross-scale enhancement as a collaborative process, in which the preservation of feature representational capacity and the consistency of geometric correspondence are jointly considered throughout multi-scale feature propagation. Specifically, CFBA-FPN aims to address two fundamental challenges inherent in deep feature pyramids. First, as feature hierarchies deepen, discriminative high-frequency structures that are critical for small-object perception progressively attenuate. Second, traditional pyramid architectures rely on rigid, scale-based spatial correspondences between shallow and deep features. To this end, CFBA-FPN integrates two complementary mechanisms: Cross-scale Frequency-aware Structural Compensation Injection (CFCI), which explicitly compensates for high-frequency structural loss from a frequency-aware perspective, and Bias-Calibrated Spatial Alignment (BCSA), which corrects cross-scale geometric bias through query-guided offset learning. Rather than functioning as independent auxiliary components, these mechanisms are jointly embedded into a bias-aware feature pyramid, enabling controlled and geometrically consistent enhancement from shallow to deep features. By unifying frequency-aware representational compensation with spatial bias correction within a single architectural framework, CFBA-FPN provides a composable and extensible paradigm for cross-scale feature enhancement that can be seamlessly integrated into existing detection backbones.

3.1. Cross-Scale Frequency and Bias-Aligned Feature Pyramid Network

In object detection tasks, deep features generally possess stronger discriminative semantic representations but inevitably suffer from high-frequency detail degradation and spatial localization instability. Traditional Feature Pyramid Networks (FPN) and their variants alleviate challenges caused by scale variations through multi-scale feature fusion, inherently assuming that features at different scales exhibit good consistency in both frequency distribution and spatial correspondence. However, in high-resolution remote sensing scenarios with densely distributed small objects, this assumption often fails: high-frequency structural information critical for small objects in shallow features tends to be overshadowed by deep semantic features during fusion, and spatial offsets caused by downsampling and receptive field expansion further amplify cross-scale semantic misalignment and localization errors.

Based on these observations, we argue that the differences between shallow and deep features should not be regarded solely as scale or semantic discrepancies, but rather modeled jointly as two interrelated yet decouplable problems: cross-scale frequency inconsistency and spatial–semantic misalignment. To this end, we propose a unified shallow-to-deep feature compensation framework for small-object detection, CFBA-FPN (Cross-scale Frequency and Bias-Aligned Feature Pyramid Network). This network treats high-resolution shallow features as anchors for fine-grained details and geometric reference and explicitly calibrates and compensates deep features along complementary frequency and spatial dimensions, achieving semantically consistent and structurally aligned cross-scale feature representations in the pyramid.

Given the multi-scale features

{P_{2}, P_{3}, P_{4}, P_{5}}

output by the backbone, CFBA-FPN does not treat them as a directly fusible set. Instead, it constructs a compensation-aware bidirectional feature pyramid generation process. Differences between shallow and deep features are decomposed into two interrelated yet decouplable compensation dimensions: (1) inconsistencies in frequency-level representations and (2) misalignment in spatial–semantic correspondence. The core idea is to explicitly incorporate these differences into the feature pyramid modeling process rather than passively correcting them at the detection head. The overall structure resembles BiFPN, with top-down and bottom-up stages, but in both stages, high-resolution shallow features are explicitly involved in cross-scale compensation modeling. Specifically, CFBA-FPN first models and compensates cross-scale frequency differences in the top-down pathway, followed by modeling and aligning cross-scale spatial–semantic misalignment in the bottom-up pathway, ultimately reconstructing a feature pyramid with dual frequency and geometric calibration.

In the first stage of CFBA-FPN, a Cross-Scale Frequency Compensation Injection (CFCI) mechanism is introduced along the top-down path. The shallow feature P2 serves as the reference source for frequency modeling, capturing the frequency distribution of fine-grained structures. For each deep feature

P_{i}^{'} (i \in {3, 4, 5})

, the frequency distribution difference with P2 is explicitly modeled and compensated via a controlled residual injection mechanism:

{\tilde{P}}_{i}^{'} = P_{i} + g_{i} \times δ (P_{2}, P_{i}^{'}), i \in {3, 4, 5}

(1)

Here,

{\tilde{P}}_{i}^{'}

denotes the CFCI-calibrated feature,

δ

represents the learned high-frequency residual, and

g_{i}

controls its injection. This process does not simply amplify high-frequency responses; instead, under the joint constraints of high-frequency presence and semantic consistency, only high-frequency components beneficial for modeling the semantics at the current scale are injected into deep features, restoring spatially consistent fine-grained details while avoiding noise propagation. As shown in Figure 1, the backbone output

P_{2}

is concurrently fed into three CFCI modules, while features propagated from high to low levels along the top-down path (

P_{5}^{'}, P_{4}^{'}, P_{3}^{'}

) are used as deep inputs for frequency modeling. Under joint constraints, the CFCI modules inject only semantically beneficial high-frequency residuals into deep features, effectively restoring spatially coherent details while suppressing noise. The outputs of this stage (

{\tilde{P}}_{5}^{'}, {\tilde{P}}_{4}^{'}, {\tilde{P}}_{3}^{'}

) provide a frequency-stabilized semantic foundation for subsequent geometric alignment.

In the second stage, CFBA-FPN performs spatial–semantic compensation and alignment along the bottom-up path. At this stage, deep features already exhibit relatively stable semantic representations after frequency calibration. Using the stage-one feature

{\tilde{P}}_{2}^{'}

as a high-resolution geometric anchor, we first perform joint channel and spatial modeling through a Cross-Scale Processing (CSP) module:

{\ddot{P}}_{2}^{'} = C S P ({\tilde{P}}_{2}^{'})

(2)

Next, a Bias-Guided Cross-Scale Alignment (BCSA) module selectively injects aligned shallow information into the stage-two deep features

P_{i}^{''}

, producing the final CFBA-FPN output

{\hat{P}}_{i}

:

{\hat{P}}_{i} = F_{a l i g n} ({\ddot{P}}_{2}^{'}, P_{i}^{''})

(3)

The BCSA module explicitly regresses the spatial offsets of deep features with respect to the shallow feature map and combines this with a confidence-aware gating mechanism, allowing shallow features to be adaptively injected only when they are reliable compensation sources. In practice, this strategy preferentially activates shallow compensation in localization-sensitive small-object regions while adaptively suppressing it in large-object or background regions, enhancing small-object representation while maintaining the compactness and robustness of overall feature expressions.

3.2. Cross-Scale Frequency Calibration Injection (CFCI) Module

In feature pyramid-based object detection frameworks, shallow features (e.g.,

P_{2}

) typically preserve abundant spatial details and high-frequency structural information. As the feature hierarchy deepens (e.g.,

P_{i}, i \in {3, 4, 5})

, although semantic representations become increasingly powerful, high-frequency responses inevitably deteriorate due to successive downsampling operations and nonlinear transformations. This cross-scale inconsistency in frequency distributions prevents critical edge, contour, and local texture cues—particularly those essential for small-object detection—from being effectively retained in deep features, thereby limiting detection performance. Direct cross-scale feature fusion often exacerbates this issue: deep features dominated by semantic information but deficient in high-frequency components tend to overwhelm discriminative high-frequency structures in shallow features, resulting in frequency misalignment and information redundancy. To address this, we propose a Cross-Scale Frequency Calibration Injection (CFCI) module that explicitly models and aligns frequency-domain representations across different scales. By integrating frequency decomposition, structure-guided calibration, and residual injection, the proposed module compensates for missing high-frequency components in deep features while effectively suppressing noise propagation.

As illustrated in Figure 2, given a shallow feature map

P_{2} \in R^{B \times C \times H \times W}

, and a deep feature map

P_{i}^{'} \in R^{B \times α_{i} C \times \frac{H}{S_{i}} \times \frac{W}{S_{i}}}

, where

S_{i} \in {4, 8, 16}, α_{i} \in {2, 4, 8}

, we first apply the Discrete Wavelet Transform (DWT) to both features, decomposing them into four frequency subbands

{L L, H L, L H, H H}

:

\begin{matrix} D W T (P_{2}) \to {L L_{2}, H L_{2}, L H_{2}, H H_{2}} \\ D W T (P_{i}) \to {L L_{i}, H L_{i}, L H_{i}, H H_{i}} \end{matrix}

(4)

Here, the low-frequency subband

L L

mainly encodes structural and semantic layouts, while the high-frequency subbands capture edges and local texture information. To ensure cross-scale structural consistency, the deep low-frequency subband

L L_{i}

is first bilinearly upsampled and aligned in both channel and spatial dimensions using a 3 × 3 convolution:

{L L}_{i} ↑ = {C o n v}_{3 \times 3} (B i l i n e a r ({L L}_{i})) {L L}_{i} ↑ \in R^{B \times C \times \frac{H}{2} \times \frac{W}{2}}

(5)

The aligned low-frequency representation serves as a stable structural and positional reference for subsequent high-frequency calibration. Given the strong spatial sensitivity of high-frequency components, their reliability varies across different regions. We therefore explicitly model the spatial presence of high-frequency structures from shallow features. Specifically, the high-frequency subbands of

P_{2}

are concatenated as:

H_{2} = C o n c a t ({H L}_{2,} {L H}_{2}, {H H}_{2})

(6)

A convolution followed by a Sigmoid activation is applied to generate a high-frequency existence confidence map:

M_{e x i s t} = σ (C o n v (H_{2})), M_{e x i s t} \in R^{B \times 1 \times \frac{H}{2} \times \frac{W}{2}}

(7)

This confidence map characterizes whether discriminative high-frequency structures are present at each spatial location. Rather than serving as an explicit noise detector, it acts as a soft spatial prior that modulates feature injection strength through semantic consistency and feature gating mechanisms. To further prevent the injection of semantically irrelevant high-frequency noise, we introduce a semantic constraint gate derived from the deep features themselves. The high-frequency subbands of

P_{i}

are first concatenated, upsampled, and projected:

H_{i}^{↑} = {C o n v}_{1 \times 1} (C o n v (B i l i n e a r (C o n c a t ({H L}_{i,} {L H}_{i}, {H H}_{i}))))

(8)

Yielding

H_{i}^{↑} = R^{B \times C \times \frac{H}{2} \times \frac{W}{2}}

. The aligned low-frequency feature

{L L}_{i}^{↑}

are concatenated and passed through a 1 × 1 convolution and Sigmoid function to produce a channel-wise semantic gate:

g_s e m = \frac{1}{H W} \sum_{x, y} σ ({C o n v}_{1 \times 1} (H_{i}^{↑}, {L L}_{i}^{↑})) \in R^{B \times C \times 1 \times 1}

(9)

This gate constrains the injection strength of high-frequency information from the perspective of semantic consistency. By jointly applying the spatial existence constraint and the semantic gate, we obtain a controlled high-frequency representation for deep features:

H_{i}^{c} = M_{e x i s t} \times g_{s e m} \times H_{i}^{↑}

(10)

thereby retaining only high-frequency components that are reliable in both spatial and semantic dimensions. To achieve stable and effective cross-scale information injection, we model frequency discrepancies between shallow and deep features using a high-frequency residual formulation. The shallow high-frequency representation is defined as:

H_{2}^{c} = {C o n v}_{1 \times 1} (C a t ({H L}_{2,} {L H}_{2}, {H H}_{2})), H_{2}^{c} \in R^{B \times C \times \frac{H}{2} \times \frac{W}{2}}

(11)

The high-frequency difference is then computed as:

δ h = R e L U (H_{2}^{c} - H_{i}^{c})

(12)

This asymmetric design emphasizes compensating missing high-frequency components rather than enforcing strict frequency matching. Considering the mismatch between

δ h

and

P_{i}

in spatial resolution and channel dimensions, we introduce a cross-scale mapping module with spatially adaptive filtering capability:

δ_{↓} = D W C o n v ({C o n v}_{1 \times 1} (δ h)), δ_{↓} \in R^{B \times α_{i} C \times \frac{H}{S_{i}} \times \frac{W}{S_{i}}}

(13)

Finally, the calibrated frequency residual is injected into the deep feature via a learnable gating function, where

g_{i}

denotes a spatial–channel joint gate broadcast along feature dimensions:

\begin{matrix} g_{i} = σ (C o n v (P_{i})) \\ P_{i}^{'} = P_{i} + g_{i} \times δ_{↓} \end{matrix}

(14)

Unlike existing multi-scale fusion or frequency enhancement approaches, CFCI explicitly formulates cross-scale feature interaction as a frequency calibration problem. By introducing a task-driven adaptive frequency weighting mechanism, the module automatically selects informative frequency components through end-to-end optimization under detection supervision, enhancing structural high-frequency cues while suppressing irrelevant noise. Meanwhile, by jointly modeling high-frequency existence, semantic consistency, and cross-scale frequency residuals, the proposed module selectively injects spatially reliable and semantically aligned high-frequency information into deep features with negligible computational overhead, effectively compensating for discriminative details critical to small-object detection.

3.3. Bias-Guided Cross-Scale Spatial Alignment (BCSA) Module

To alleviate the spatial misalignment between deep semantic features and shallow high-resolution representations—particularly in small-object detection and localization-sensitive scenarios—we propose a Bias-Guided Cross-Scale Spatial Alignment (BCSA) module (Figure 3). The core idea of BCSA is to explicitly regress the spatial bias of deep feature points on shallow feature maps and, under the constraints of query-guided modulation and confidence-aware gating, adaptively inject aligned shallow details into deep representations. This design enables geometrically consistent cross-scale feature enhancement.

Given a shallow feature map

P_{2} \in R^{B \times C \times H \times W}

and a deep feature map at the (i)-th level

P_{i}^{'} \in R^{B \times α_{i} C \times \frac{H}{S_{i}} \times \frac{W}{S_{i}}}

, where

S_{i} \in {4, 8, 16}

denotes the stride of

P_{i}

relative to

P_{2}

and

α_{i} \in {2, 4, 8}

is the channel expansion factor, spatial correspondence between shallow and deep features is often systematically biased due to downsampling operations and receptive field expansion. Such misalignment weakens fine-grained localization capability. The objective of BCSA is therefore to learn a point-wise mapping from each deep feature location to its most relevant region on the shallow feature map and to compensate deep representations with aligned shallow details. To obtain a more stable and discriminative shallow feature representation, we first introduce a Spatial–Channel Cooperative Encoding Module (SCCEM) to preprocess

P_{2}

:

P_{2}^{s c c e m} = S C C E M (P_{2})

(15)

Without altering feature resolution, this operation enhances spatial consistency and channel discriminability, providing a reliable feature source for subsequent alignment. For each spatial location

P = (x, y)

on the deep feature map

P_{i}

, it is treated as a continuous feature center. Ideally, its corresponding position on the shallow feature map can be expressed as:

(x_{0}, y_{0}) = (S_{i} \times x, S_{i} \times y)

(16)

However, this correspondence is often inaccurate due to scale variation and receptive field asymmetry. To address this issue, we expand the sampling range to construct a candidate alignment region with radius:

R = α \times S_{i}, α \in [1.5, 2.5]

(17)

and define the shallow candidate region as:

τ (p) = [x_{0} - R, x_{0} + R] \times [y_{0} - R, y_{0} + R]

(18)

where

α

is empirically chosen to balance coverage capability and computational efficiency. To mitigate semantic discrepancies between shallow and deep features, we first project deep features into a unified embedding space:

Q_{i} = Φ_{q} (P_{i}), Q_{i} \in R^{B \times C \times H_{i} \times W_{i}}

(19)

where

Φ_{q}

is implemented using a 1 × 1 convolution. The projected feature

Q_{i}

serves as a query embedding that encodes high-level semantic priors. For each deep location p, an ROIAlign operation is applied to

P_{2}^{s c c e m}

over the region

τ (p)

to obtain a fixed-size shallow region feature:

P_{2}^{R} \in R^{C \times K \times K}

(20)

This region feature is then processed by a

K \times K

depthwise separable convolution for spatial modeling and flattened before being fed into a multilayer perceptron (MLP) to generate a compact shallow region embedding:

E_{2} (p) \in R^{B \times C}

(21)

which preserves local spatial structure while maintaining computational efficiency. Directly fusing

Q_{i}

and

E_{2} (p)

may cause shallow features to be dominated by deep semantics. To prevent this, we introduce a query-guided scaling mechanism, in which the deep query adaptively modulates the shallow embedding. Specifically, a channel-wise gating vector is generated from the query feature:

g_{q} = σ (W_{g} Q_{i}), g_{q} \in R^{B \times C}

(22)

And applied to rescale the shallow embedding:

{\hat{E}}_{2} (P) = E_{2} (p) ⊙ g_{q}

(23)

This mechanism ensures that shallow features remain semantically consistent with deep queries during subsequent geometric regression. The fused feature representation is defined as:

F (P) = C o n c a t (Q_{i}, {\hat{E}}_{2} (P))

(24)

Based on

F (P)

, we regress the spatial bias of deep feature points on the shallow feature map:

(∆ x, ∆ y) = s_{i} \times t a n h ({M L P}_{o f f} F (P))

(25)

where the offset magnitude is constrained to be proportional to the stride, improving training stability. Meanwhile, to assess the necessity of shallow feature compensation, we predict a confidence score:

c o n f (p) = σ ({M L P}_{c o n f} (F (P)))

(26)

This confidence term is implicitly learned without explicit supervision and serves as an adaptive gate to suppress unreliable shallow information injection. The final shallow sampling position is computed as:

(x_{s}, y_{s}) = (x_{0} + ∆ x, y_{0} + ∆ y)

(27)

And differentiable bilinear interpolation is employed to sample shallow features. The aligned shallow feature is then injected into the deep representation via confidence-aware gating:

{\tilde{P}}_{i} (p) = P_{i} (p) + c o n f (p) \times φ (P_{2} (x_{s}, y_{s}))

(28)

where

φ (\cdot)

denotes a lightweight projection layer. By explicitly decoupling semantic querying, geometric alignment, and feature injection, the BCSA module enables robust cross-scale feature interaction with modest computational overhead. Compared to implicit alignment strategies such as deformable convolution, BCSA offers stronger interpretability and controllability in modeling spatial correspondences, making it particularly well suited for small-object detection and localization-sensitive tasks.

The proposed framework includes several key hyperparameters. K denotes the ROIAlign output size for shallow features, used to capture local context (typically 3–7). The radius coefficient α controls the spatial extent of candidate regions, balancing coverage and efficiency (usually 1.5–2.5).

S_{i}

represents the relative downsampling factor between deep and shallow features, determining their initial spatial alignment. The query-guided channel gate

g_{q}

regulates the fusion ratio to prevent shallow features from overwhelming deep semantics, while the confidence score

c o n f (p)

adaptively modulates shallow feature injection to suppress unreliable information during cross-scale fusion.

4. Experiment

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

We conduct systematic evaluations of the proposed method on three representative publicly available aerial and remote sensing object detection datasets: AI-TOD [30], VisDrone [31], and DIOR [32]. These datasets differ significantly in scene complexity, object scale distribution, and category diversity, enabling a comprehensive assessment of the model’s detection performance and generalization ability across diverse application scenarios.

AI-TOD is a high-resolution remote sensing image dataset specifically constructed for small-object detection in aerial scenes. It comprises 28,036 images with 700,621 annotated object instances, primarily collected from real-world scenarios using drones and airborne platforms. The dataset covers various typical aerial environments, including urban roads, parking lots, ports, and residential areas. AI-TOD contains 8 object categories: airplane, bridge, storage tank, ship, swimming pool, vehicle, windmill, and basketball court. A distinguishing characteristic of AI-TOD is that many objects occupy only a small number of pixels in the images, exhibiting extremely small scales, dense spatial distributions, and complex backgrounds. Compared with conventional remote sensing or natural scene detection datasets, AI-TOD places a stronger emphasis on the detection difficulty of ultra-small objects, whose sizes are often limited to a few dozen pixels or even smaller, while also presenting substantial scale variations and class imbalance. These properties impose higher demands on models in terms of shallow-layer detail modeling, multi-scale feature fusion, and preservation of high-frequency structural information. Consequently, AI-TOD has become an important benchmark for evaluating small-object detection algorithms in aerial and remote sensing scenarios.

VisDrone is collected using a variety of real-world UAV platforms and covers complex scenes such as urban streets, residential areas, campuses, commercial districts, and transportation hubs. The dataset exhibits significant viewpoint changes, altitude variations, and background diversity. It contains over 2.6 million annotated object instances with bounding boxes or point annotations. Specifically, the VisDrone-DET detection subset includes 10 common object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor, primarily focusing on traffic participants and pedestrians. Images in VisDrone vary widely in resolution, ranging from 540 × 960 to 2000 × 1500 pixels, and the dataset exhibits notable class imbalance. These factors pose significant challenges for models in multi-scale feature representation, contextual information modeling, and fine-grained structure preservation. As a result, VisDrone is widely used to evaluate the robustness and generalization capability of object detection algorithms in complex aerial scenes, particularly for small and densely distributed objects.

DIOR is a large-scale, high-quality optical remote sensing object detection dataset designed to provide a unified benchmark for multi-category detection tasks in remote sensing scenarios. The dataset consists of high-resolution images from multiple sources, covering diverse geographic regions, imaging conditions, and land cover types. It encompasses a wide range of scenes, including airports, ports, urban areas, industrial parks, and farmland. DIOR contains 20 object categories, such as airplane, ship, vehicle, bridge, harbor, stadium, windmill, and storage tank, covering transportation infrastructure, industrial facilities, and public infrastructure. DIOR is characterized by large-scale variations, comprehensive category coverage, and high scene complexity, making it suitable for evaluating overall detection performance and scalability in remote sensing scenarios. It also serves as a complementary benchmark to datasets focusing primarily on small objects.

4.1.2. Implementation Details and Evaluation Metrics

All experiments were conducted on an Ubuntu 22.04 operating system, using a platform equipped with four NVIDIA RTX 4090 GPUs (24 GB memory per card) and implemented based on the PyTorch 2.0.1 deep learning framework. The proposed method adopts YOLO11 [33] as the baseline detector, with the P2 feature level explicitly incorporated during training as a shallow information source to preserve high-resolution spatial details. During training, the model is optimized using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 5 × 10⁻⁴. The initial learning rate is set to 0.01 and decayed following a cosine annealing schedule. A warm-up strategy is applied during the first 5 epochs, linearly increasing the learning rate. The batch size is set to 16, and training is performed for a total of 200 epochs. To ensure fair comparison across datasets, all input images are uniformly resized to 800 × 800 pixels during both training and testing. For frequency feature extraction, a standard two-dimensional Discrete Wavelet Transform (DWT) is implemented using the PyWavelets library. Specifically, the one-dimensional low-pass and high-pass decomposition filters of the selected mother wavelet are first obtained and then combined via outer-product operations to construct the corresponding 2D wavelet kernels (LL, LH, HL, HH). These wavelet filters are registered as fixed tensors using register_buffer, ensuring they remain non-trainable and are excluded from gradient updates. The DWT is implemented as channel-wise depthwise convolution, allowing each feature channel to be processed independently. A stride of 2 is applied to simultaneously achieve single-level frequency decomposition and spatial downsampling.

For quantitative evaluation, we adopt standard metrics including AP, AP₅₀, AP₇₅, AP_vt (very tiny), AP_t (tiny), AP_s (small), AP_m (medium), and AP_l (large). Here, AP (or mAP) denotes the mean Average Precision computed over multiple IoU thresholds from 0.5 to 0.95 with a step size of 0.05. AP₅₀ and AP₇₅ represent the average precision at IoU thresholds of 0.5 and 0.75, respectively. The scale-specific metrics AP_vt, AP_t, AP_s, AP_m, and AP_l evaluate the detection performance for ultra-tiny, tiny, small, medium, and large objects, enabling a fine-grained analysis of the model’s ability to handle objects of different scales.

4.2. Comparison with State-of-the-Art Methods

(1): Table 1 compares the detection performance of different methods on the DIOR dataset. Based on methodological characteristics, these methods can be broadly categorized into four groups. Two-stage Detectors first generate region proposals via an RPN and then perform classification and bounding box regression, typically achieving high localization accuracy; representative methods include Faster R-CNN, Cascade R-CNN, and Mask R-CNN. One-stage Detectors predict classes and bounding boxes directly on feature maps, offering higher computational efficiency and suitability for dense object scenarios; representative methods include RetinaNet, FCOS, and ASSD. Transformer-based Detectors leverage self-attention mechanisms to model global dependencies, capturing cross-scale and long-range features and improving performance in complex scenes; examples include DETR, Deformable DETR, ACI-former, and Swin Transformer. Finally, Specialized or Multi-feature Networks are optimized for small objects, remote sensing images, or multi-feature fusion tasks, enhancing detection capabilities through multi-scale features, adaptive weighting, or spatial-channel attention; representative methods include TMAFNet, AFGMFNet, AGMF-Net, SDPNet, and BAFNet (Table 2).

It can be observed that the proposed method achieves significant advantages in both overall performance and the majority of individual categories, attaining an mAP of 78.6%, ranking first among all compared approaches. Compared with representative transformer-based methods such as Swin, Deformable DETR, and ACI-former, our method achieves a stable improvement of 1.9–5.1% mAP, demonstrating its stronger overall detection capability in complex remote sensing scenarios.

At the category level, our method achieves state-of-the-art performance in multiple representative categories, including BF, BC, BR, ETS, GTF, STA, STO, and TC. Specifically, BF, BC, and TC reach AP values of 94.3%, 91.8%, and 95.9%, respectively, significantly outperforming existing methods. These objects typically possess regular geometric structures and well-defined boundaries, with discriminative cues highly dependent on high-frequency structures and precise spatial localization. By explicitly introducing shallow high-frequency compensation within the feature pyramid, our method effectively mitigates the progressive degradation of fine-grained details in deep features, thereby substantially improving detection performance for such objects.

Moreover, for categories with elongated structures or large-scale spans, such as bridge (BR), expressway toll station (ETS), and ground track field (GTF), our method also achieves the best results, with AP values of 55.2%, 86.8%, and 87.8%, respectively. For these objects, which are prone to cross-scale spatial misalignment, the proposed bias-guided cross-scale spatial alignment mechanism effectively enhances the geometric consistency between shallow and deep features, improving localization stability and overall detection accuracy.

Further analysis indicates that while methods such as TBNet, LSK-Net, and AFGMFNet exhibit strong performance on certain individual categories, their overall performance shows considerable fluctuation across categories. In contrast, our method achieves top or second-best performance in 12 out of 20 categories, demonstrating a more balanced and stable detection capability. Taken together, the performance improvements of our method are not due to coincidental optimization of a few categories, but result from a systematic enhancement of shallow detail representation, cross-scale feature consistency, and high-frequency structure modeling. This validates its effectiveness and strong generalization ability across multiple remote sensing object categories.

(2): Table 2 presents a comparison of the proposed method with several mainstream detection algorithms on the AITOD dataset. Our method achieves significant advantages in both overall performance and small-object-related metrics, reaching 30.87% AP, 62.7% AP₅₀, and 26.83% AP₇₅, demonstrating the best or highly competitive performance among all compared approaches. Compared with traditional two-stage methods such as Faster R-CNN and Cascade R-CNN, as well as classic single-stage methods like RetinaNet and FCOS, our approach achieves a substantial improvement of 15–20% AP, validating its effectiveness in ultra-small-object detection scenarios.

Since AITOD predominantly consists of ultra-small objects, it imposes higher requirements on shallow detail modeling and multi-scale feature representation. From the scale-specific metrics, our method achieves 16.78% AP_vt, 31.56% AP_t, and 35.47% AP_s, all surpassing existing methods. These results indicate that the proposed cross-scale frequency compensation and spatial alignment mechanisms effectively mitigate the deficiency in high-frequency details and localization information in deep features.

At the category level, our method demonstrates superior performance on multiple representative small-object classes. Specifically, airplane, storage tank, and vehicle achieve AP values of 40.4%, 52.2%, and 37.3%, respectively. These objects are typically small, densely distributed, and highly dependent on edge and local texture cues. By incorporating shallow high-frequency information into the feature pyramid, our method enhances the deep features’ representation of fine-grained structures, significantly improving detection accuracy. Moreover, in urban scenarios, our method accurately identifies vehicles beside buildings or yachts within docks even under complex backgrounds or shadow occlusions, reflecting its ability to strengthen object feature representation.

Further comparison with small-object-specific remote sensing methods, such as BAFNet, shows that while these methods may achieve advantages in certain categories or scales, their overall performance exhibits considerable fluctuation. In contrast, our approach maintains a more balanced performance across overall AP, small-object metrics, and multi-category detection, slightly surpassing BAFNet in AP (30.87% vs. 30.5%) and further widening the margin on key metrics such as AP₅₀, AP₇₅, and AP_vt, demonstrating stronger robustness and generalization capability.

In summary, the performance improvements of our method on AITOD mainly result from systematic modeling of ultra-small-object discriminative details and cross-scale geometric consistency, rather than local optimization for individual categories. The experimental results fully validate the effectiveness and generalization potential of the proposed method for small-object detection in complex remote sensing scenarios.

(3): Table 3 summarizes the performance comparison between the proposed method and several mainstream detection algorithms on the VisDrone dataset. It can be observed that our method achieves significant advantages in both overall detection accuracy and across objects of different scales, reaching 36.2% AP, 56.5% AP₅₀, and 36.1% AP₇₅, outperforming all compared methods in overall performance. Compared with traditional two-stage detectors such as Faster R-CNN and Cascade R-CNN, as well as classic single-stage methods like RetinaNet, CenterNet, and YOLOF, our method achieves a substantial improvement of over 10–20% AP. Moreover, compared with recent high-performance transformer-based detectors such as DINO and RT-DETR, our approach maintains clear advantages on key metrics including AP and AP₇₅, indicating stronger discriminative capability and precise localization in complex aerial scenarios.

From the scale perspective, VisDrone contains numerous small objects with large-scale variations and dense distributions, which imposes high demands on feature pyramid detail representation and cross-scale consistency. Our method achieves 25.2% AP_s, 49.8% AP_m, and 56.2% AP_l, all significantly surpassing existing approaches. In particular, for small-object detection (AP_s), our method improves by 2.8% over BAFNet and nearly 9% over Cascade R-CNN and DINO, demonstrating the effectiveness of the proposed method in fine-grained object modeling. Further comparisons with methods optimized for aerial scenarios show that while BAFNet and FENet exhibit competitive performance in overall AP or AP₅₀, their improvements on high-IoU thresholds or multi-scale metrics are relatively limited. In contrast, our method not only achieves the best overall AP but also further extends its advantages on AP₇₅ and scale-specific metrics (AP_m/AP_l), indicating its ability to maintain precise localization while achieving stable detection across multiple object scales.

In summary, the performance gains of our method on VisDrone are not due to local optimization for a single scale or individual categories. Instead, they result from systematic enhancements in shallow high-frequency detail modeling, cross-scale feature consistency, and spatial alignment mechanisms, enabling the model to achieve higher detection accuracy and robustness in densely populated, scale-variant, and complex aerial scenes. These results further validate the effectiveness and strong generalization ability of the proposed method in small-object-dense detection tasks.

4.3. Visualization Analysis of the Proposed Method on the AI-TOD and DIOR Datasets

4.3.1. Result Visualization on the AI-TOD Dataset

Considering the challenging characteristics of the AI-TOD dataset in terms of object scale, scene complexity, and object density—namely, extremely small object sizes, complex background structures, and densely distributed objects—Figure 4 presents a qualitative visualization of the detection results produced by the proposed method. The selected representative scenarios include vehicle detection in complex urban backgrounds, densely distributed ship detection in port areas, pedestrian detection under haze conditions, and aircraft and vehicle detection in nighttime scenes, comprehensively covering common sources of interference and extreme conditions in small-object detection.

Furthermore, as observed in the figure, the proposed method is also capable of accurately identifying unannotated objects of the same class. Overall, Figure 4 demonstrates that the proposed method consistently detects densely distributed ultra-small objects in these challenging scenes, effectively suppressing background interference and reducing both false positives and false negatives. Notably, even under conditions of low contrast, weak feature representation, and high object density, the model maintains reliable localization and discrimination ability, highlighting its robustness and generalization capability for dense ultra-small-object detection in complex backgrounds.

4.3.2. Visualization Analysis on the DIOR Dataset

Figure 5 presents a qualitative visualization of the detection results on the DIOR dataset, including both bounding box comparisons and attention heatmap visualizations. Specifically, the first to third rows correspond to the ground-truth annotations, the proposed method, and the baseline method, respectively. In the heatmaps, warm and cool colors indicate strong and weak attention responses, where warmer regions represent higher activation levels, while cooler regions denote lower responses.

As shown in Figure 5a, the selected scenes cover a variety of representative object scales and structural patterns, including elongated bridge structures, large-scale golf course objects, port scenes with significant scale variation, and small-sized vehicle objects. Compared with the baseline, the introduction of CFBA-FPN substantially reduces false detections and redundant bounding boxes in elongated objects and multi-scale mixed scenes, indicating that the proposed method achieves superior discriminative feature representation and improved object separability.

Furthermore, in the small-object scenarios highlighted by the red zoomed-in regions, the baseline method exhibits evident false positives, whereas the proposed method effectively suppresses such erroneous detections. This observation demonstrates that, through cross-scale frequency enhancement and spatial alignment, the robustness and discriminability of small-object features are significantly improved, thereby reducing the likelihood of misclassification.

Figure 5b illustrates the comparison of attention heatmaps across scenes with varying levels of complexity, including playgrounds, port areas, and single-object scenes. It can be clearly observed that, after incorporating CFBA-FPN, the model’s attention responses become more concentrated on true object regions, while irrelevant background activations are markedly suppressed. Specifically, object regions exhibit denser orange–red responses, whereas background areas are dominated by cooler colors. This indicates that the proposed method effectively guides the model to focus on discriminative object regions, thereby enhancing detection stability and reliability.

4.3.3. Visualization Analysis on the AI-TOD Dataset

Figure 6 illustrates the qualitative visualization results on the AI-TOD dataset, where the visualization settings are consistent with those used for the DIOR dataset. Unlike DIOR, which features significant scale variations and imbalanced category distributions, AI-TOD is specifically designed for extremely small-object detection, where objects are characterized by tiny object sizes, weak visual cues, high-density distributions, and complex background structures, making the detection task considerably more challenging. Therefore, qualitative analysis on this dataset provides further insights into the effectiveness and robustness of the proposed method in dense small-object detection scenarios.

As shown in Figure 6a, compared with the baseline, the proposed method significantly reduces false positives and redundant bounding boxes in extremely small-object scenes. In particular, within densely populated regions, the proposed method achieves more accurate localization of true objects while effectively avoiding repeated or erroneous detections. This observation indicates that the proposed method exhibits stronger discriminative capability in modeling and distinguishing weak-feature small objects.

The attention heatmap comparisons further support the above findings. As illustrated in Figure 6b, across representative scenarios—including single-ship detection in maritime environments, densely distributed vehicles in parking lots, and vehicle detection under complex urban and road backgrounds—the proposed method markedly enhances the concentration of attention on true object regions. Compared with the baseline, high-response areas (highlighted in orange–red) are almost exclusively focused on actual object regions, while background responses are effectively suppressed, with negligible high activations observed in irrelevant areas. These results demonstrate that the proposed method effectively mitigates the interference caused by complex backgrounds in small-object detection, guiding the model to focus on discriminative object regions and thereby improving the stability and reliability of the detection results.

4.4. Ablation Studies

This section presents a systematic ablation study to evaluate the effectiveness of CFBA-FPN and its individual components, including CFCI and BCSA. All experiments are conducted on the VisDrone dataset and compared with the baseline under identical backbone architectures, training strategies, and evaluation protocols. Unless otherwise specified, performance is consistently assessed using AP, AP_s, AP_m, and AP_l as evaluation metrics.

4.4.1. Effectiveness of Cross-Scale Frequency Calibration Injection Module

Table 4 reports the detection performance when deploying the Cross-Scale Frequency Calibration Injection (CFCI) module at different levels of the feature pyramid, aiming to analyze its effectiveness across varying spatial resolutions and semantic hierarchies. All experiments are conducted under identical backbone networks and training configurations. Overall, introducing CFCI at any pyramid level consistently yields performance gains, validating its effectiveness in cross-scale feature calibration.

When CFCI is applied only to the shallow P3 layer, the overall AP increases by 0.4%, with the most notable improvement observed for small objects (AP_s +0.9%). This indicates that calibrating high-resolution shallow features effectively enhances fine-grained object representations. As CFCI is extended to multiple pyramid levels, performance further improves. In particular, jointly introducing CFCI at the adjacent P3 and P4 levels leads to the most pronounced gains, achieving a +1.2% improvement in overall AP and consistent enhancements across objects of different scales. This suggests that jointly calibrating semantically contiguous feature levels helps alleviate semantic and spatial inconsistencies across scales.

In contrast, deploying CFCI only at non-adjacent levels or exclusively at middle-to-high pyramid layers results in relatively limited performance improvements. When CFCI is simultaneously applied to P3, P4, and P5, the model achieves the best overall performance, with AP reaching 34.7% (+1.6%), along with substantial gains for both small and large object detection (AP_s +1.8% and AP_l +5.1%). These results demonstrate that multi-level collaborative calibration effectively integrates fine-grained details from shallow layers with rich semantic information from deeper layers, enabling progressive cross-scale feature enhancement.

In summary, the ablation study quantitatively verifies the effectiveness of CFCI across multiple pyramid levels, particularly within hierarchically contiguous feature pyramids, and further substantiates the rationality of the proposed progressive cross-scale calibration design.

4.4.2. Effectiveness of Bias-Guided Cross-Scale Spatial Alignment Module

Table 5 presents the detection performance when deploying the Bias-Guided Cross-Scale Spatial Alignment (BCSA) module at different levels of the feature pyramid, aiming to analyze its effectiveness in spatial alignment across various scales and semantic hierarchies. All experiments were conducted using the same backbone network and training configuration.

Overall, introducing BCSA at different pyramid levels consistently improves performance over the baseline, indicating that explicitly modeling cross-scale spatial offsets positively contributes to enhancing feature alignment. When BCSA is applied only to the shallow P3 layer, the overall AP gain is limited (+0.3%), although slight improvements are observed in both AP_s and AP_l, suggesting that single-layer spatial alignment provides only modest gains in scale robustness.

Performance improvements become more pronounced when BCSA is applied across multiple levels. Specifically, introducing BCSA at adjacent levels P3 and P4 results in a +1.4% increase in overall AP, with a notable gain at AP₇₅ (+1.3%), demonstrating that cross-scale spatial alignment effectively enhances localization accuracy under high-IoU thresholds. In contrast, deploying BCSA at non-adjacent levels (P3 and P5) or exclusively at middle-to-high levels (P4 and P5) still improves overall AP, but the gains are slightly lower than those of the adjacent-level configuration, suggesting that large semantic gaps may reduce the stability of spatial offset modeling.

When BCSA is simultaneously applied to P3, P4, and P5, the model achieves optimal performance, with overall AP increasing to 35.1% (+2.0% over the baseline) and significant gains observed across all object scales, particularly for small objects (AP_s +2.5%). This indicates that multi-level collaborative spatial alignment effectively mitigates cumulative spatial offsets in cross-scale features, thereby enhancing the localization accuracy and robustness of fused features.

In summary, the ablation study demonstrates that BCSA exhibits stronger spatial alignment capability across multiple pyramid levels, especially within hierarchically contiguous feature pyramids, and experimentally validates the rationality of its design for alleviating cross-scale spatial offsets and improving high-precision localization.

4.4.3. Joint Effect of Frequency Calibration and Spatial Alignment

Table 6 presents the ablation results of jointly applying the cross-scale frequency compensation and spatial alignment modules, “×” and “√” indicate whether the module is adopted. The results indicate that both mechanisms independently improve detection performance: when only frequency compensation is introduced, the overall AP increases to 34.7%, with small-object performance rising to 23.9%; when only spatial alignment is applied, the overall AP further improves to 35.1%, and AP_s reaches 24.6%, demonstrating that explicitly modeling cross-scale spatial offsets is particularly effective for small-object localization.

When both modules are enabled simultaneously, the model achieves optimal performance, with overall AP increasing to 36.2% and AP_s to 25.2%, showing additional gains over the single-module configurations. This indicates that frequency-aware compensation and spatial alignment exhibit strong complementarity in cross-scale feature modeling: the former focuses on enhancing discriminative representations under scale variations, while the latter effectively mitigates spatial misalignment during cross-scale feature fusion.

However, simply stacking these two modules within a conventional FPN is insufficient to fully exploit their synergistic potential. Due to the lack of constraints in the feature injection process and inconsistent cross-scale interactions, feature misalignment and redundant fusion may still occur, particularly affecting small-object detection. To address this, we propose CFBA-FPN, a customized feature pyramid structure designed for the collaborative modeling of frequency compensation and spatial alignment. In CFBA-FPN, frequency-calibrated features are progressively injected into each pyramid level under geometric-aware and gating-controlled guidance, explicitly coordinating frequency enhancement and spatial alignment at the structural level.

As shown in Table 6, CFBA-FPN achieves substantial improvements in both overall AP and AP_s compared to other configurations. Compared with the simple combination of frequency compensation and spatial alignment, CFBA-FPN consistently enhances small-object detection performance, fully validating the necessity and effectiveness of structured feature propagation and coordinated fusion to unleash the complementarity of the two mechanisms.

4.4.4. Ablation Study of High-Frequency Residual and $M_e x i s t$ in CFCI

Table 7 presents an ablation analysis of the High-Frequency Residual (HFR) and the existence-aware gating (

M_e x i s t

) within CFCI. When HFR is removed, the overall AP drops by 0.7%, with notable decreases in AP₇₅ and AP_s, indicating that high-frequency residuals play a critical role in modeling fine-grained structures and achieving high-precision localization.

Further removing

M_e x i s t

results in a more significant performance degradation, with overall AP decreasing to 35.0% and AP_s dropping by 1.4%. This demonstrates that explicitly modeling valid regions and suppressing irrelevant responses is essential for stable feature injection during cross-scale frequency compensation.

These results collectively validate the complementarity and necessity of HFR and

M_e x i s t

within CFCI, whose synergistic effect underpins the model’s performance advantages in small-object detection and at high-IoU thresholds.

4.4.5. Ablation Study of Confidence Gating

Table 8 presents an ablation analysis of the confidence gating mechanism (

C o n f_g a t e

) within BCSA. When

C o n f_g a t e

is removed, the overall AP drops from 36.2% to 35.4%, with AP₇₅ and AP_s decreasing by 1.1% and 1.3%, respectively, while AP₅₀ remains largely unchanged. This indicates that confidence gating primarily affects high-precision localization and small-object detection performance.

These results suggest that incorporating a confidence-aware gating mechanism during cross-scale spatial alignment helps suppress the influence of low-confidence or noisy offsets on feature fusion, thereby enhancing the stability and reliability of spatial alignment. Overall,

C o n f_g a t e

plays a critical role in fully leveraging BCSA’s advantages in fine-grained object localization and under high-IoU thresholds.

4.4.6. Ablation Study of Mother Wavelet in CFCI

As shown in Table 9, the choice of mother wavelet has only a minor influence on the detection performance of CFBA-FPN. The Haar wavelet achieves the best results (mAP: 0.786, AP50: 0.941), while db2 and Sym2 show slight performance decreases, with overall variations remaining small.

These results indicate that the proposed framework is not highly sensitive to the specific wavelet type, since DWT in our design mainly serves as a frequency decoupling operator, and the primary performance gains come from the subsequent semantic gating and frequency calibration mechanisms.

4.4.7. Computational Cost Analysis

Table 10 compares the computational cost and inference speed of CFBA-FPN with the baseline and other popular detectors. CFBA-FPN introduces a modest increase in parameters and FLOPs compared to the baseline (Params: 27.82 M vs. 25.33 M; GFLOPs: 71.83 vs. 68.25), while maintaining a relatively high inference speed (49.9 FPS). In contrast, Cascade R-CNN and RetinaNet incur substantially higher computational costs and lower FPS, demonstrating that CFBA-FPN achieves a favorable trade-off between detection performance and efficiency.

4.4.8. Analysis of Computational Cost and Performance

Table 11 compares the precision gain per unit of computational cost (GFLOPs) among different methods on small-object remote sensing datasets. CFBA-FPN achieves an mAP increase of 4.8 with an additional 9.11 GFLOPs, corresponding to a GFLOPs-per-mAP ratio of 1.9, which is lower than most other methods and indicates a more efficient trade-off between accuracy improvement and computational overhead. In particular, while methods like GLSDet achieve similar absolute mAP gains, their unit-cost efficiency is much lower (5.66 GFLOPs/mAP), highlighting that CFBA-FPN delivers higher performance gains relative to the additional computation required.

5. Discussion

In this work, we investigated small-object detection from the perspectives of cross-scale feature degradation and geometric misalignment. We argue that frequency-aware compensation and spatial alignment are inherently complementary: improving small-object detection requires not only stronger semantic representations but also fine-grained coordination between feature expressiveness and geometric correspondence. The former focuses on “which information should be retained”, while the latter determines “where this information should be injected”.

The CFBA-FPN architecture serves as a structured framework that coordinates these two aspects, preventing the aggregation of noisy features and enabling effective compensation from shallow to deep layers. Ablation studies demonstrate that CFCI and BCSA fully leverage their complementarity: frequency-aware enhancement strengthens feature representations, whereas spatial alignment mitigates noisy fusion caused by fixed, scale-dependent mapping assumptions. This architectural-level synergy is crucial for fully exploiting the advantages of both components.

Despite its effectiveness, CFBA-FPN introduces additional computational overhead due to frequency decomposition and offset prediction. While this overhead remains moderate in our implementation, further optimizations—such as lightweight frequency approximations or sparse offset learning—could improve efficiency. Moreover, the current design primarily focuses on enhancing pyramid features for the detection head; extending frequency- and geometry-aware compensation to early backbone stages or Transformer-based detectors remains an open research direction.

In future work, exploring task-adaptive frequency modeling for other dense prediction tasks, such as instance segmentation or object tracking, will also be a central focus of our research.

6. Conclusions

In this work, we propose CFBA-FPN, a spatial-frequency collaborative feature pyramid framework for small-object detection, designed to explicitly address cross-scale feature degradation and geometric misalignment. By introducing frequency-aware compensation to recover discriminative high-frequency cues and employing query-guided spatial alignment to ensure geometrically consistent feature injection, CFBA-FPN effectively enhances feature interactions from shallow to deep layers. Extensive experiments demonstrate that the proposed method significantly improves small-object detection performance while maintaining highly competitive efficiency. The results of this study highlight the importance of jointly considering feature expressiveness and spatial correspondence in multi-scale detection frameworks, providing new insights for the design of robust feature pyramids.

Author Contributions

Conceptualization: H.Y. and B.Z.; methodology: H.Y.; validation: H.Y.; formal analysis, H.Y.; investigation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, B.Z. and Q.Q.; visualization, H.Y. and Y.W.; supervision, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Special Project for the Integration of “Two Chains”—Qinchuangyuan General Window Industrial Cluster Project for the Integration of Two Chains, grant number 2022QCY-LL-72.

Data Availability Statement

The original data presented in the study are openly available from the following publicly accessible sources: AI-TOD dataset at https://github.com/jwwangchn/AI-TOD (accessed on 24 September 2025); VisDrone dataset at https://github.com/VisDrone/VisDrone-Dataset (accessed on 24 September 2025); DIOR dataset at https://ieee-dataport.org/documents/dior (accessed on 24 September 2025).

Acknowledgments

The authors would like to thank the editors and reviewers for their constructive comments on this research. Thanks also go to members of the research team and everyone who helped us during the research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

APL	Airplane
APO	Airport
BF	Baseball Field
BC	Basketball Court
BR	Bridge
CH	Chimney
DAM	Dam
ESA	Expressway Service Area
ETS	Expressway Toll Station
GC	Golf Course
GTF	Ground Track Field
HA	Harbor
OP	Overpass
SH	Ship
STA	Stadium
STO	Storage Tank
SP	Swimming Pool
TC	Tennis Court
TS	Train Station
VE	Vehicle
WM	Windmill

References

Zhao, Y.; Sun, H.; Wang, S. Small Object Detection in Medium–Low-Resolution Remote Sensing Images Based on Degradation Reconstruction. Remote Sens. 2024, 16, 2645. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Qiao, S.; Chen, L.-C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 10208–10219. [Google Scholar]
Ghiasi, G.; Lin, T.-Y.; Le, Q.V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2019; pp. 7029–7038. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 10778–10787. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 12592–12601. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.; Luo, Z.; Yu, Y.; Tian, Z.; Zhang, J.; Lu, S. Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 6206–6216. [Google Scholar]
Thapa, S.; Zhao, B.; Han, Y.; Luo, S. Enhanced Aircraft Detection in Compressed Remote Sensing Image Using CMSFF-YOLOv8. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 5650116. [Google Scholar] [CrossRef]
Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. SDS-Net: Shallow–Deep Synergism-Detection Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 3001113. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2024, 62, 5611215. [Google Scholar] [CrossRef]
Hu, H.; Chen, S.-B.; Tang, J. CFENet: Contextual Feature Enhancement Network for Tiny Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 4703113. [Google Scholar] [CrossRef]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI); Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2025; Volume 39, pp. 8673–8681. [Google Scholar] [CrossRef]
Wang, L.; Li, J.; Zhang, J.; Zhuo, L.; Tian, Q. Position Guided Dynamic Receptive Field Network: A Small Object Detection Friendly to Optical and SAR Images. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2025, 35, 8265–8282. [Google Scholar] [CrossRef]
Daubechies, I. Orthonormal Bases of Compactly Supported Wavelets. Commun. Pure Appl. Math. 1988, 41, 909–996. [Google Scholar] [CrossRef]
Fourier, J.-B.-J. De la Diffusion de la Chaleur; Cambridge University Press: Cambridge, UK, 2009; pp. 425–601. [Google Scholar]
Sun, H.; Wang, R.; Li, Y.; Yang, L.; Lin, S.; Cao, X.; Zhang, B. SET: Spectral Enhancement for Tiny Object Detection. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2025; pp. 4713–4723. [Google Scholar]
Zhu, Y.; Ma, Y.; Fan, F.; Huang, J.; Yao, Y.; Zhou, X.; Huang, R. Toward Robust Infrared Small Target Detection via Frequency and Spatial Feature Fusion. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 2001115. [Google Scholar] [CrossRef]
Liang, B.; Liu, Y.; Qiu, B.; Wang, Y.; Sui, X.; Chen, Q. FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection. arXiv 2025, arXiv:2509.23056. [Google Scholar]
Deng, C.; Zhao, Z.; Xu, X.; Xia, Y.; Li, J.; Plaza, A. GSFANet: Global Spatial–Frequency Attention Network for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 5007017. [Google Scholar] [CrossRef]
Liu, Y.; Tu, B.; Liu, B.; He, Y.; Li, J.; Plaza, A. Spatial–Frequency Domain Transformation for Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 5634916. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI); Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2025; Volume 39, pp. 6896–6904. [Google Scholar]
Zhang, G.; Luo, Z.; Huang, J.; Lu, S.; Xing, E.P. Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion. Int. J. Comput. Vis. (IJCV) 2024, 132, 2825–2844. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Li, L.; Pan, J.; Zhu, S.; Du, J.; Wang, P. BAFPN: An Optimization for YOLO. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS); IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Shen, J.; Chen, Y.; Liu, Y.; Zuo, X.; Fan, H.; Yang, W. ICAFusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection. Pattern Recognit. (PR) 2024, 145, 109913. [Google Scholar] [CrossRef]
Xie, X.; Cui, Y.; Tan, T.; Zheng, X.; Yu, Z. FusionMamba: Dynamic Feature Enhancement for Multi-modal Image Fusion with Mamba. Vis. Intell. 2024, 2, 37. [Google Scholar] [CrossRef]
Zheng, S.; Zhao, P.; Zheng, Z.; He, P.; Cheng, H.; Cai, Y.; Huang, Q. Look Around Before Locating: Considering Content and Structure Information for Visual Grounding. In Proceedings of the AAAI Conference on Artificial Intelligence(AAAI); Association for the Advancement of Artificial Intelligence (AAAI): Washington, DC, USA, 2025; Volume 39, pp. 1656–1664. [Google Scholar]
Xu, J.; Xiong, Z.; Bhattacharyya, S. PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2023; pp. 19529–19539. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.-S. Detecting Tiny Objects in Aerial Images: A Normalized Wasserstein Distance and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The Vision Meets Drone Object detection Challenge Results. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW); IEEE: Piscataway, NJ, USA, 2021; pp. 2847–2854. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR); IEEE: Piscataway, NJ, USA, 2018; pp. 6154–6162. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 2999–3007. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019; pp. 9626–9635. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the Computer Vision—ECCV 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 765–781. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019; pp. 6568–6577. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Yi, J.; Wu, P.; Metaxas, D.N. ASSD: Attentive Single Shot Multibox Detector. Comput. Vis. Image Underst. (CVIU) 2019, 189, 102827. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Ma, Y.; Chai, L.; Jin, L. Scale Decoupled Pyramid for Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2023, 61, 4704314. [Google Scholar] [CrossRef]
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A Task-Balanced Multiscale Adaptive Fusion Network for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2023, 61, 5613515. [Google Scholar] [CrossRef]
Xu, G.; Song, T.; Sun, X.; Gao, C. TransMIN: Transformer-Guided Multi-Interaction Network for Remote Sensing Object Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6000505. [Google Scholar] [CrossRef]
Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-Free Global Multiscale Fusion Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2024, 62, 5603214. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Xu, D.; Gao, Y.; Zhao, T. TBNet: A Texture and Boundary-Aware Network for Small Weak Object Detection in Remote-Sensing Imagery. Pattern Recognit. 2025, 158, 110976. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2023; pp. 16748–16759. [Google Scholar]
Li, J.; Li, H.; Xu, H.; Song, R.; Li, Y.; Du, Q. Background Suppression Network with Attention Collapse Inhibited Transformer for Optical Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 5602413. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-Cnn. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar]
Zhou, Z.; Zhu, Y. KLDet: Detecting Tiny Objects in Remote Sensing Images via Kullback–Leibler Di-vergence. IEEE Trans. Geosci. Remote Sens. (TGRS) 2024, 62, 4703316. [Google Scholar] [CrossRef]
Wu, J.; Pan, Z.; Lei, B.; Hu, Y. FSANet: Feature-and-Spatial-Aligned Network for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2022, 60, 5630717. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage Enhancement Network for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. (TGRS) 2024, 62, 5611512. [Google Scholar] [CrossRef]
Liu, H.-I.; Tseng, Y.-W.; Chang, K.-C.; Wang, P.; Shuai, H.-H.; Cheng, W.-H. A DeNoising FPN with Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2024, 62, 4704415. [Google Scholar] [CrossRef]
Song, J.; Zhou, M.; Luo, J.; Pu, H.; Feng, Y.; Wei, X.; Jia, W. Boundary-Aware Feature Fusion with Dual-Stream Attention for Remote Sensing Small Object Detection. IEEE Trans. Geosci. Remote Sens. (TGRS) 2025, 63, 5600213. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with Improved Denoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Kiobya, T.; Zhou, J.; Maiseli, B. A Multi-Scale Semantically Enriched Feature Pyramid Network with Enhanced Focal Loss for Small-Object Detection. Knowl. Based Syst. 2025, 310, 113003. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed CFBA-FPN.

Figure 2. Pipeline of the proposed CFCI, which performs cross-scale frequency calibration to enhance discriminative feature representations.

Figure 3. Pipeline of the proposed BCSA, illustrating bias-guided cross-scale spatial alignment.

Figure 4. Bounding box visualization of detection results for remote sensing objects on the AI-TOD dataset.

Figure 5. Qualitative comparison of detection results on the DIOR dataset.

Figure 6. Qualitative visualization of detection results on the AI-TOD dataset.

Table 1. Performance (%) comparison between the proposed method and state-of-the-art methods on the DIOR dataset.

Method	APL	APO	BF	BC	BR	CH	DAM	ESA	ETS	GC	GTF	HA	OP	SH	STA	STO	TC	TS	VE	WM	mAP
Faster-R-CNN [34]	81.1	76.9	89.8	80.5	45.1	81.7	63.4	80.9	63.2	73.3	78.3	43.3	58.4	68.5	89.1	59.0	81.1	60.2	77.8	79.3	66.9
Cascade R-CNN [35]	81.2	81.4	90.1	81.1	46.9	81.5	68.1	84.2	65.3	74.5	81.6	37.8	60.4	68.9	88.8	60.5	81.0	57.5	47.7	80.7	67.4
RetinaNet [36]	53.3	77.0	69.3	85.0	44.1	73.2	62.4	78.6	62.8	78.6	76.6	49.9	59.6	71.1	68.4	45.8	81.3	55.2	44.4	85.5	66.1
FCOS [37]	73.5	68.0	69.9	85.1	34.7	73.6	49.3	52.1	47.6	67.2	68.7	46.3	51.1	72.2	59.8	64.6	81.2	42.7	42.2	74.8	61.2
CornetNet [38]	58.8	84.2	72.0	80.8	46.4	75.3	64.3	81.6	76.3	79.5	29.5	26.1	60.6	37.6	70.7	45.2	84.0	57.1	43.0	75.9	65.9
CenterNet [39]	73.6	58.0	69.7	88.5	36.2	76.9	47.9	52.7	53.9	60.5	62.6	45.7	52.6	88.2	63.7	76.2	83.7	51.3	54.4	79.5	63.9
DETR [40]	39.6	74.0	65.2	80.7	26.5	75.2	66.8	70.5	52.5	74.2	62.1	27.4	47.0	8.5	46.0	14.5	64.4	54.3	14.4	55.6	68.6
ASSD [41]	85.6	82.4	75.8	89.5	40.7	77.6	64.7	67.1	61.7	80.8	78.6	62.0	58.0	84.9	76.7	65.3	87.9	62.4	44.5	76.3	71.1
Swin [42]	91.4	68.7	87.5	85.4	43.5	88.2	58.9	64.4	60.1	75.6	78.5	60.1	62.2	73.1	87.2	73.6	89.4	58.3	72.5	73.9	73.5
SDPNet [43]	86.8	81.4	83.4	90.5	52.7	78.1	63.2	87.8	73.6	81.9	85.6	65.2	64.7	80.4	82.6	77.8	87.5	56.1	59.9	84.7	76.2
TMAFNet [44]	92.2	77.7	75.0	91.3	47.1	78.6	53.6	67.1	66.2	78.5	76.3	64.9	61.4	90.5	72.2	75.4	90.7	62.1	55.2	83.2	72.9
TransMIN [45]	62.6	82.3	77.1	90.3	50.6	79.5	69.8	81.2	73.1	82.7	83.5	56.6	63.8	76.2	73.5	62.3	88.4	63.6	52.3	89.5	73.0
AFGMFNet [46]	90.9	72.8	79.3	89.7	44.7	81.4	59.3	66.0	62.7	73.8	79.2	65.0	61.7	91.7	78.6	75.8	90.7	60.0	58.0	83.1	73.2
TBNet [47]	64.9	86.8	76.6	89.2	50.6	80.0	74.3	86.4	74.7	82.5	83.6	56.4	64.9	79.1	77.8	58.1	88.2	69.4	44.6	87.0	73.6
LGDA	79.6	83.4	81.8	89.2	52.8	73.7	68.1	84.5	61.2	77.3	81.9	60.1	55.2	89.7	64.9	82.7	85.2	63.9	58.5	88.6	74.1
AGMF-Net [46]	90.9	72.8	79.3	89.7	44.7	81.4	59.3	66.0	62.7	73.8	79.2	65.0	61.7	91.7	78.6	75.8	90.7	60.0	58.0	83.1	73.2
Deformable DETR [48]	92.4	64.6	87.6	82.3	42.5	72.6	63.3	79.1	54.3	77.3	68.5	69.6	69.4	74.6	88.5	73.5	77.4	68.2	71.3	72.3	76.5
LSK-Net [49]	91.0	77.8	92.9	83.8	53.4	92.2	65.1	68.7	60.0	83.8	74.1	60.0	63.9	78.5	90.3	81.8	90.7	61.3	73.6	76.3	76.0
ACI-former [50]	91.8	68.2	84.6	83.6	40.7	73.6	68.2	76.4	58.3	78.2	69.4	67.5	68.4	76.3	86.2	74.6	81.4	64.6	69.4	74.8	76.7
OURs	90.1	81.5	94.3	91.8	55.2	92.1	65.2	85.6	86.8	83.2	87.8	63.1	64.9	82.7	92.2	83.5	95.9	57.7	61.8	56.3	78.6

Note: Red and blue denote the top 2 performance, respectively.

Table 2. Performance (%) comparison between the proposed method and state-of-the-art methods on the AI-TOD dataset.

Method	AI	BR	ST	SH	SP	VE	PE	WM	AP	AP₅₀	AP₇₅	AP_vt	AP_t	AP_s	AP_m
RetinaNet	0.4	0.0	2.4	10.0	0.0	5.8	1.0	0.0	2.4	8.4	0.7	0.8	2.7	4.0	8.8
Mask R-CNN [51]	20.0	1.9	17.4	18.6	8.0	10.9	3.8	0.0	10.1	22.7	7.6	0.0	5.8	22.1	32.6
CenterNet	12.8	2.0	15.3	15.9	1.7	11.7	4.6	1.5	8.2	24.3	3.4	1.3	6.9	12.9	22.3
Cascade R-CNN	21.0	6.7	20.3	21.0	7.7	12.8	4.8	0.0	11.8	25.6	9.5	0.1	7.9	23.0	34.8
Faster R-CNN	19.6	1.4	17.3	17.4	7.9	10.4	3.6	0.0	9.7	22.1	6.8	0.0	5.2	21.7	32.5
DetectoRS	25.3	8.2	22.9	25.4	13.2	15.2	5.9	0.1	14.5	32.2	10.9	0.1	10.6	27.7	37.7
FCOS	17.2	1.6	21.2	19.8	0.8	13.3	4.9	0.0	9.8	24.1	6.2	1.2	7.8	16.1	27.2
KLDNet [52]	13.6	18.7	35.7	42.3	5.2	24.9	9.3	6.7	19.6	-	-	8.4	20.6	22.7	26.4
FSANet [53]	30.9	15.1	35.0	40.3	19.8	24.9	8.9	5.6	22.6	52.8	15.6	7.4	21.6	29.1	38.5
MENet [54]	27.5	16.4	37.4	42.1	18.8	24.9	10.2	8.2	23.2	56.2	15.0	9.7	23.9	25.3	34.4
DNTR [55]	-	-	-	-	-	-	-	-	26.2	56.7	20.2	12.8	26.4	31.0	37.0
TBNet	32.6	54.0	12.2	21.1	19.5	10.1	38.3	27.4	26.9	59.0	19.7	11.1	28.4	32.0	36.1
BAFNet [56]	34.5	22.7	39.4	68.1	29.3	28.1	14.9	6.9	30.5	59.8	26.6	16.6	31.3	35.1	40.5
Ours	40.4	32.4	52.2	47.9	10.57	37.3	17.1	9.1	30.87	62.7	26.83	16.78	31.56	35.47	40.85

Note: Red and blue denote the top 2 performance, respectively.

Table 3. Performance comparison between the proposed method and state-of-the-art methods on the VidDrone dataset.

Method	AP (%)	AP₅₀ (%)	AP₇₅ (%)	AP_s (%)	AP_m (%)	AP_l (%)
RetinaNet	17.6	29.6	18.1	8.1	29.4	36.8
CenterNet	21.4	36.1	21.8	12.6	31.7	38.3
YOLOF	15.1	26.3	15.4	6.1	25.2	32.4
Cascade R-CNN	25.6	43.1	26.2	16.4	37.4	41.4
Faster R-CNN	24.5	42.5	24.6	16.1	35.9	36.5
DINO [57]	26.8	44.2	28.9	17.5	37.3	41.3
RT-DETR [58]	15.9	36.5	10.7	13.8	28.4	23.1
MSSEFPN [59]	25.3	40.4	27.8	17.0	35.4	41.8
BAFNet	30.8	52.6	30.9	22.4	41.0	43.2
FENet	34.5	59.73	32.31	-	-	-
ours	36.2	56.6	36.1	25.2	49.8	56.2

Note: Red and blue denote the top 2 performance, respectively.

Table 4. Layer-wise ablation of the cross-scale frequency calibration injection module.

CFCI Location	AP (%)	AP₅₀ (%)	AP₇₅ (%)	AP_s (%)	AP_m (%)	AP_l (%)
Baseline	33.1	55.9	32.4	22.1	46.5	50.0
P3	33.5	55.7	32.2	23.0	47.8	52.3
P3, P4	34.3	56.0	33.1	23.8	48.4	54.6
P3, P5	33.9	55.6	32.5	23.4	48.3	52.1
P4, P5	33.6	55.5	32.3	22.8	48.1	51.9
P3, P4, P5	34.7	55.8	33.2	23.9	48.8	55.1

Table 5. Layer-wise ablation of the bias-guided cross-scale spatial alignment module.

	AP (%)	AP₅₀ (%)	AP₇₅ (%)	AP_s (%)	AP_m (%)	AP_l (%)
Baseline	33.1	55.9	32.4	22.1	46.5	50.0
P3	33.4	55.9	32.6	22.3	46.5	50.3
P3, P4	34.5	56.2	33.7	23.3	47.4	53.2
P3, P5	34.0	55.7	33.2	22.9	47.2	53.5
P4, P5	34.2	55.8	33.5	22.6	47.3	53.4
P3, P4, P5	35.1	56.1	34.4	24.6	48.7	54.8

Table 6. Joint ablation of cross-scale frequency calibration and spatial alignment modules.

CFCI	BCSA	AP (%)	AP₅₀ (%)	AP_s (%)
×	×	33.1	55.9	22.1
√	×	34.7	55.8	23.9
×	√	35.1	56.1	24.6
√	√	36.2	56.5	25.2

Table 7. Ablation study on high-frequency residual modeling and existence-aware gating.

	AP	AP₅₀	AP₇₅ (%)	AP_s (%)	AP_m (%)	AP_l (%)
Full model	36.2	56.6	36.1	25.2	49.8	56.2
$- H F R$	35.5	56.4	35.3	24.5	49.3	55.9
$- M_e x i s t$	35.0	56.1	34.9	23.8	49.0	55.7

Table 8. Ablation study on confidence gating in BCSA.

	AP (%)	AP₅₀ (%)	AP₇₅ (%)	AP_s (%)	AP_m (%)	AP_l (%)
Full model	36.2	56.6	36.1	25.2	49.8	56.2
- $C o n f_g a t e$	35.4	56.4	35.0	23.9	49.1	55.8

Table 9. Impact of different mother wavelet functions in CFCI.

Method	mAP (%)	AP₅₀ (%)
Haar	0.786	0.941
-db2	0.784	0.936
-Sym2	0.78	0.914

Table 10. Comparison of computational cost with baseline and other methods.

Method	Params	GFLOPs	FPS
Baseline	25.33	68.25	57.14
CFBA-FPN	27.82	71.83	49.9
Cascade-rcnn	69.43	128.23	37.0
Retinanet	37.97	149.77	42.1

Table 11. Comparison of tradeoff of computational overhead and performance benefits.

Method	Dataset	δGFLOPs	δmAP (%)	GFLOPs/mAP
CPMFNet	DIOR-R	7	3.28	2.13
CFBA-FPN	DIOR	9.11	4.8	1.9
CFPT	Xview	7.4	4.1	1.8
CSFPR	AITOD	6.9	1.5	4.6
GLSDet	VisDrone	26.6	4.7	5.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, H.; Zhang, B.; Wang, Y.; Qiang, Q. From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection. Remote Sens. 2026, 18, 777. https://doi.org/10.3390/rs18050777

AMA Style

Yuan H, Zhang B, Wang Y, Qiang Q. From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection. Remote Sensing. 2026; 18(5):777. https://doi.org/10.3390/rs18050777

Chicago/Turabian Style

Yuan, Hao, Bin Zhang, Yachuan Wang, and Qianyao Qiang. 2026. "From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection" Remote Sensing 18, no. 5: 777. https://doi.org/10.3390/rs18050777

APA Style

Yuan, H., Zhang, B., Wang, Y., & Qiang, Q. (2026). From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection. Remote Sensing, 18(5), 777. https://doi.org/10.3390/rs18050777

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Structural Degradation to Semantic Misalignment: A Unified Frequency-Aware Compensation Framework for Remote Sensing Object Detection

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Multi-Scale Feature Fusion for Object Detection

2.2. Shallow Feature Reinforcement for Small-Object Detection

2.3. Frequency-Aware Feature Compensation

2.4. Cross-Scale Feature Alignment and Semantic Consistency

3. Methodology

3.1. Cross-Scale Frequency and Bias-Aligned Feature Pyramid Network

3.2. Cross-Scale Frequency Calibration Injection (CFCI) Module

3.3. Bias-Guided Cross-Scale Spatial Alignment (BCSA) Module

4. Experiment

4.1. Datasets and Evaluation Metrics

4.1.1. Datasets

4.1.2. Implementation Details and Evaluation Metrics

4.2. Comparison with State-of-the-Art Methods

4.3. Visualization Analysis of the Proposed Method on the AI-TOD and DIOR Datasets

4.3.1. Result Visualization on the AI-TOD Dataset

4.3.2. Visualization Analysis on the DIOR Dataset

4.3.3. Visualization Analysis on the AI-TOD Dataset

4.4. Ablation Studies

4.4.1. Effectiveness of Cross-Scale Frequency Calibration Injection Module

4.4.2. Effectiveness of Bias-Guided Cross-Scale Spatial Alignment Module

4.4.3. Joint Effect of Frequency Calibration and Spatial Alignment

4.4.4. Ablation Study of High-Frequency Residual and M _ e x i s t in CFCI

4.4.5. Ablation Study of Confidence Gating

4.4.6. Ablation Study of Mother Wavelet in CFCI

4.4.7. Computational Cost Analysis

4.4.8. Analysis of Computational Cost and Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.4.4. Ablation Study of High-Frequency Residual and $M_e x i s t$ in CFCI