4.1. Datasets and Evaluation Metrics
4.1.1. Datasets
We conduct systematic evaluations of the proposed method on three representative publicly available aerial and remote sensing object detection datasets: AI-TOD [
30], VisDrone [
31], and DIOR [
32]. These datasets differ significantly in scene complexity, object scale distribution, and category diversity, enabling a comprehensive assessment of the model’s detection performance and generalization ability across diverse application scenarios.
AI-TOD is a high-resolution remote sensing image dataset specifically constructed for small-object detection in aerial scenes. It comprises 28,036 images with 700,621 annotated object instances, primarily collected from real-world scenarios using drones and airborne platforms. The dataset covers various typical aerial environments, including urban roads, parking lots, ports, and residential areas. AI-TOD contains 8 object categories: airplane, bridge, storage tank, ship, swimming pool, vehicle, windmill, and basketball court. A distinguishing characteristic of AI-TOD is that many objects occupy only a small number of pixels in the images, exhibiting extremely small scales, dense spatial distributions, and complex backgrounds. Compared with conventional remote sensing or natural scene detection datasets, AI-TOD places a stronger emphasis on the detection difficulty of ultra-small objects, whose sizes are often limited to a few dozen pixels or even smaller, while also presenting substantial scale variations and class imbalance. These properties impose higher demands on models in terms of shallow-layer detail modeling, multi-scale feature fusion, and preservation of high-frequency structural information. Consequently, AI-TOD has become an important benchmark for evaluating small-object detection algorithms in aerial and remote sensing scenarios.
VisDrone is collected using a variety of real-world UAV platforms and covers complex scenes such as urban streets, residential areas, campuses, commercial districts, and transportation hubs. The dataset exhibits significant viewpoint changes, altitude variations, and background diversity. It contains over 2.6 million annotated object instances with bounding boxes or point annotations. Specifically, the VisDrone-DET detection subset includes 10 common object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor, primarily focusing on traffic participants and pedestrians. Images in VisDrone vary widely in resolution, ranging from 540 × 960 to 2000 × 1500 pixels, and the dataset exhibits notable class imbalance. These factors pose significant challenges for models in multi-scale feature representation, contextual information modeling, and fine-grained structure preservation. As a result, VisDrone is widely used to evaluate the robustness and generalization capability of object detection algorithms in complex aerial scenes, particularly for small and densely distributed objects.
DIOR is a large-scale, high-quality optical remote sensing object detection dataset designed to provide a unified benchmark for multi-category detection tasks in remote sensing scenarios. The dataset consists of high-resolution images from multiple sources, covering diverse geographic regions, imaging conditions, and land cover types. It encompasses a wide range of scenes, including airports, ports, urban areas, industrial parks, and farmland. DIOR contains 20 object categories, such as airplane, ship, vehicle, bridge, harbor, stadium, windmill, and storage tank, covering transportation infrastructure, industrial facilities, and public infrastructure. DIOR is characterized by large-scale variations, comprehensive category coverage, and high scene complexity, making it suitable for evaluating overall detection performance and scalability in remote sensing scenarios. It also serves as a complementary benchmark to datasets focusing primarily on small objects.
4.1.2. Implementation Details and Evaluation Metrics
All experiments were conducted on an Ubuntu 22.04 operating system, using a platform equipped with four NVIDIA RTX 4090 GPUs (24 GB memory per card) and implemented based on the PyTorch 2.0.1 deep learning framework. The proposed method adopts YOLO11 [
33] as the baseline detector, with the P2 feature level explicitly incorporated during training as a shallow information source to preserve high-resolution spatial details. During training, the model is optimized using Stochastic Gradient Descent (SGD) with a momentum of 0.9 and a weight decay of 5 × 10
−4. The initial learning rate is set to 0.01 and decayed following a cosine annealing schedule. A warm-up strategy is applied during the first 5 epochs, linearly increasing the learning rate. The batch size is set to 16, and training is performed for a total of 200 epochs. To ensure fair comparison across datasets, all input images are uniformly resized to 800 × 800 pixels during both training and testing. For frequency feature extraction, a standard two-dimensional Discrete Wavelet Transform (DWT) is implemented using the PyWavelets library. Specifically, the one-dimensional low-pass and high-pass decomposition filters of the selected mother wavelet are first obtained and then combined via outer-product operations to construct the corresponding 2D wavelet kernels (LL, LH, HL, HH). These wavelet filters are registered as fixed tensors using register_buffer, ensuring they remain non-trainable and are excluded from gradient updates. The DWT is implemented as channel-wise depthwise convolution, allowing each feature channel to be processed independently. A stride of 2 is applied to simultaneously achieve single-level frequency decomposition and spatial downsampling.
For quantitative evaluation, we adopt standard metrics including AP, AP50, AP75, APvt (very tiny), APt (tiny), APs (small), APm (medium), and APl (large). Here, AP (or mAP) denotes the mean Average Precision computed over multiple IoU thresholds from 0.5 to 0.95 with a step size of 0.05. AP50 and AP75 represent the average precision at IoU thresholds of 0.5 and 0.75, respectively. The scale-specific metrics APvt, APt, APs, APm, and APl evaluate the detection performance for ultra-tiny, tiny, small, medium, and large objects, enabling a fine-grained analysis of the model’s ability to handle objects of different scales.
4.2. Comparison with State-of-the-Art Methods
- (1)
Table 1 compares the detection performance of different methods on the DIOR dataset. Based on methodological characteristics, these methods can be broadly categorized into four groups. Two-stage Detectors first generate region proposals via an RPN and then perform classification and bounding box regression, typically achieving high localization accuracy; representative methods include Faster R-CNN, Cascade R-CNN, and Mask R-CNN. One-stage Detectors predict classes and bounding boxes directly on feature maps, offering higher computational efficiency and suitability for dense object scenarios; representative methods include RetinaNet, FCOS, and ASSD. Transformer-based Detectors leverage self-attention mechanisms to model global dependencies, capturing cross-scale and long-range features and improving performance in complex scenes; examples include DETR, Deformable DETR, ACI-former, and Swin Transformer. Finally, Specialized or Multi-feature Networks are optimized for small objects, remote sensing images, or multi-feature fusion tasks, enhancing detection capabilities through multi-scale features, adaptive weighting, or spatial-channel attention; representative methods include TMAFNet, AFGMFNet, AGMF-Net, SDPNet, and BAFNet (
Table 2).
It can be observed that the proposed method achieves significant advantages in both overall performance and the majority of individual categories, attaining an mAP of 78.6%, ranking first among all compared approaches. Compared with representative transformer-based methods such as Swin, Deformable DETR, and ACI-former, our method achieves a stable improvement of 1.9–5.1% mAP, demonstrating its stronger overall detection capability in complex remote sensing scenarios.
At the category level, our method achieves state-of-the-art performance in multiple representative categories, including BF, BC, BR, ETS, GTF, STA, STO, and TC. Specifically, BF, BC, and TC reach AP values of 94.3%, 91.8%, and 95.9%, respectively, significantly outperforming existing methods. These objects typically possess regular geometric structures and well-defined boundaries, with discriminative cues highly dependent on high-frequency structures and precise spatial localization. By explicitly introducing shallow high-frequency compensation within the feature pyramid, our method effectively mitigates the progressive degradation of fine-grained details in deep features, thereby substantially improving detection performance for such objects.
Moreover, for categories with elongated structures or large-scale spans, such as bridge (BR), expressway toll station (ETS), and ground track field (GTF), our method also achieves the best results, with AP values of 55.2%, 86.8%, and 87.8%, respectively. For these objects, which are prone to cross-scale spatial misalignment, the proposed bias-guided cross-scale spatial alignment mechanism effectively enhances the geometric consistency between shallow and deep features, improving localization stability and overall detection accuracy.
Further analysis indicates that while methods such as TBNet, LSK-Net, and AFGMFNet exhibit strong performance on certain individual categories, their overall performance shows considerable fluctuation across categories. In contrast, our method achieves top or second-best performance in 12 out of 20 categories, demonstrating a more balanced and stable detection capability. Taken together, the performance improvements of our method are not due to coincidental optimization of a few categories, but result from a systematic enhancement of shallow detail representation, cross-scale feature consistency, and high-frequency structure modeling. This validates its effectiveness and strong generalization ability across multiple remote sensing object categories.
- (2)
Table 2 presents a comparison of the proposed method with several mainstream detection algorithms on the AITOD dataset. Our method achieves significant advantages in both overall performance and small-object-related metrics, reaching 30.87% AP, 62.7% AP
50, and 26.83% AP
75, demonstrating the best or highly competitive performance among all compared approaches. Compared with traditional two-stage methods such as Faster R-CNN and Cascade R-CNN, as well as classic single-stage methods like RetinaNet and FCOS, our approach achieves a substantial improvement of 15–20% AP, validating its effectiveness in ultra-small-object detection scenarios.
Since AITOD predominantly consists of ultra-small objects, it imposes higher requirements on shallow detail modeling and multi-scale feature representation. From the scale-specific metrics, our method achieves 16.78% APvt, 31.56% APt, and 35.47% APs, all surpassing existing methods. These results indicate that the proposed cross-scale frequency compensation and spatial alignment mechanisms effectively mitigate the deficiency in high-frequency details and localization information in deep features.
At the category level, our method demonstrates superior performance on multiple representative small-object classes. Specifically, airplane, storage tank, and vehicle achieve AP values of 40.4%, 52.2%, and 37.3%, respectively. These objects are typically small, densely distributed, and highly dependent on edge and local texture cues. By incorporating shallow high-frequency information into the feature pyramid, our method enhances the deep features’ representation of fine-grained structures, significantly improving detection accuracy. Moreover, in urban scenarios, our method accurately identifies vehicles beside buildings or yachts within docks even under complex backgrounds or shadow occlusions, reflecting its ability to strengthen object feature representation.
Further comparison with small-object-specific remote sensing methods, such as BAFNet, shows that while these methods may achieve advantages in certain categories or scales, their overall performance exhibits considerable fluctuation. In contrast, our approach maintains a more balanced performance across overall AP, small-object metrics, and multi-category detection, slightly surpassing BAFNet in AP (30.87% vs. 30.5%) and further widening the margin on key metrics such as AP50, AP75, and APvt, demonstrating stronger robustness and generalization capability.
In summary, the performance improvements of our method on AITOD mainly result from systematic modeling of ultra-small-object discriminative details and cross-scale geometric consistency, rather than local optimization for individual categories. The experimental results fully validate the effectiveness and generalization potential of the proposed method for small-object detection in complex remote sensing scenarios.
- (3)
Table 3 summarizes the performance comparison between the proposed method and several mainstream detection algorithms on the VisDrone dataset. It can be observed that our method achieves significant advantages in both overall detection accuracy and across objects of different scales, reaching 36.2% AP, 56.5% AP
50, and 36.1% AP
75, outperforming all compared methods in overall performance. Compared with traditional two-stage detectors such as Faster R-CNN and Cascade R-CNN, as well as classic single-stage methods like RetinaNet, CenterNet, and YOLOF, our method achieves a substantial improvement of over 10–20% AP. Moreover, compared with recent high-performance transformer-based detectors such as DINO and RT-DETR, our approach maintains clear advantages on key metrics including AP and AP
75, indicating stronger discriminative capability and precise localization in complex aerial scenarios.
From the scale perspective, VisDrone contains numerous small objects with large-scale variations and dense distributions, which imposes high demands on feature pyramid detail representation and cross-scale consistency. Our method achieves 25.2% APs, 49.8% APm, and 56.2% APl, all significantly surpassing existing approaches. In particular, for small-object detection (APs), our method improves by 2.8% over BAFNet and nearly 9% over Cascade R-CNN and DINO, demonstrating the effectiveness of the proposed method in fine-grained object modeling. Further comparisons with methods optimized for aerial scenarios show that while BAFNet and FENet exhibit competitive performance in overall AP or AP50, their improvements on high-IoU thresholds or multi-scale metrics are relatively limited. In contrast, our method not only achieves the best overall AP but also further extends its advantages on AP75 and scale-specific metrics (APm/APl), indicating its ability to maintain precise localization while achieving stable detection across multiple object scales.
In summary, the performance gains of our method on VisDrone are not due to local optimization for a single scale or individual categories. Instead, they result from systematic enhancements in shallow high-frequency detail modeling, cross-scale feature consistency, and spatial alignment mechanisms, enabling the model to achieve higher detection accuracy and robustness in densely populated, scale-variant, and complex aerial scenes. These results further validate the effectiveness and strong generalization ability of the proposed method in small-object-dense detection tasks.
4.4. Ablation Studies
This section presents a systematic ablation study to evaluate the effectiveness of CFBA-FPN and its individual components, including CFCI and BCSA. All experiments are conducted on the VisDrone dataset and compared with the baseline under identical backbone architectures, training strategies, and evaluation protocols. Unless otherwise specified, performance is consistently assessed using AP, APs, APm, and APl as evaluation metrics.
4.4.1. Effectiveness of Cross-Scale Frequency Calibration Injection Module
Table 4 reports the detection performance when deploying the Cross-Scale Frequency Calibration Injection (CFCI) module at different levels of the feature pyramid, aiming to analyze its effectiveness across varying spatial resolutions and semantic hierarchies. All experiments are conducted under identical backbone networks and training configurations. Overall, introducing CFCI at any pyramid level consistently yields performance gains, validating its effectiveness in cross-scale feature calibration.
When CFCI is applied only to the shallow P3 layer, the overall AP increases by 0.4%, with the most notable improvement observed for small objects (APs +0.9%). This indicates that calibrating high-resolution shallow features effectively enhances fine-grained object representations. As CFCI is extended to multiple pyramid levels, performance further improves. In particular, jointly introducing CFCI at the adjacent P3 and P4 levels leads to the most pronounced gains, achieving a +1.2% improvement in overall AP and consistent enhancements across objects of different scales. This suggests that jointly calibrating semantically contiguous feature levels helps alleviate semantic and spatial inconsistencies across scales.
In contrast, deploying CFCI only at non-adjacent levels or exclusively at middle-to-high pyramid layers results in relatively limited performance improvements. When CFCI is simultaneously applied to P3, P4, and P5, the model achieves the best overall performance, with AP reaching 34.7% (+1.6%), along with substantial gains for both small and large object detection (APs +1.8% and APl +5.1%). These results demonstrate that multi-level collaborative calibration effectively integrates fine-grained details from shallow layers with rich semantic information from deeper layers, enabling progressive cross-scale feature enhancement.
In summary, the ablation study quantitatively verifies the effectiveness of CFCI across multiple pyramid levels, particularly within hierarchically contiguous feature pyramids, and further substantiates the rationality of the proposed progressive cross-scale calibration design.
4.4.2. Effectiveness of Bias-Guided Cross-Scale Spatial Alignment Module
Table 5 presents the detection performance when deploying the Bias-Guided Cross-Scale Spatial Alignment (BCSA) module at different levels of the feature pyramid, aiming to analyze its effectiveness in spatial alignment across various scales and semantic hierarchies. All experiments were conducted using the same backbone network and training configuration.
Overall, introducing BCSA at different pyramid levels consistently improves performance over the baseline, indicating that explicitly modeling cross-scale spatial offsets positively contributes to enhancing feature alignment. When BCSA is applied only to the shallow P3 layer, the overall AP gain is limited (+0.3%), although slight improvements are observed in both APs and APl, suggesting that single-layer spatial alignment provides only modest gains in scale robustness.
Performance improvements become more pronounced when BCSA is applied across multiple levels. Specifically, introducing BCSA at adjacent levels P3 and P4 results in a +1.4% increase in overall AP, with a notable gain at AP75 (+1.3%), demonstrating that cross-scale spatial alignment effectively enhances localization accuracy under high-IoU thresholds. In contrast, deploying BCSA at non-adjacent levels (P3 and P5) or exclusively at middle-to-high levels (P4 and P5) still improves overall AP, but the gains are slightly lower than those of the adjacent-level configuration, suggesting that large semantic gaps may reduce the stability of spatial offset modeling.
When BCSA is simultaneously applied to P3, P4, and P5, the model achieves optimal performance, with overall AP increasing to 35.1% (+2.0% over the baseline) and significant gains observed across all object scales, particularly for small objects (APs +2.5%). This indicates that multi-level collaborative spatial alignment effectively mitigates cumulative spatial offsets in cross-scale features, thereby enhancing the localization accuracy and robustness of fused features.
In summary, the ablation study demonstrates that BCSA exhibits stronger spatial alignment capability across multiple pyramid levels, especially within hierarchically contiguous feature pyramids, and experimentally validates the rationality of its design for alleviating cross-scale spatial offsets and improving high-precision localization.
4.4.3. Joint Effect of Frequency Calibration and Spatial Alignment
Table 6 presents the ablation results of jointly applying the cross-scale frequency compensation and spatial alignment modules, “×” and “√” indicate whether the module is adopted. The results indicate that both mechanisms independently improve detection performance: when only frequency compensation is introduced, the overall AP increases to 34.7%, with small-object performance rising to 23.9%; when only spatial alignment is applied, the overall AP further improves to 35.1%, and AP
s reaches 24.6%, demonstrating that explicitly modeling cross-scale spatial offsets is particularly effective for small-object localization.
When both modules are enabled simultaneously, the model achieves optimal performance, with overall AP increasing to 36.2% and APs to 25.2%, showing additional gains over the single-module configurations. This indicates that frequency-aware compensation and spatial alignment exhibit strong complementarity in cross-scale feature modeling: the former focuses on enhancing discriminative representations under scale variations, while the latter effectively mitigates spatial misalignment during cross-scale feature fusion.
However, simply stacking these two modules within a conventional FPN is insufficient to fully exploit their synergistic potential. Due to the lack of constraints in the feature injection process and inconsistent cross-scale interactions, feature misalignment and redundant fusion may still occur, particularly affecting small-object detection. To address this, we propose CFBA-FPN, a customized feature pyramid structure designed for the collaborative modeling of frequency compensation and spatial alignment. In CFBA-FPN, frequency-calibrated features are progressively injected into each pyramid level under geometric-aware and gating-controlled guidance, explicitly coordinating frequency enhancement and spatial alignment at the structural level.
As shown in
Table 6, CFBA-FPN achieves substantial improvements in both overall AP and AP
s compared to other configurations. Compared with the simple combination of frequency compensation and spatial alignment, CFBA-FPN consistently enhances small-object detection performance, fully validating the necessity and effectiveness of structured feature propagation and coordinated fusion to unleash the complementarity of the two mechanisms.
4.4.4. Ablation Study of High-Frequency Residual and in CFCI
Table 7 presents an ablation analysis of the High-Frequency Residual (HFR) and the existence-aware gating (
) within CFCI. When HFR is removed, the overall AP drops by 0.7%, with notable decreases in AP
75 and AP
s, indicating that high-frequency residuals play a critical role in modeling fine-grained structures and achieving high-precision localization.
Further removing results in a more significant performance degradation, with overall AP decreasing to 35.0% and APs dropping by 1.4%. This demonstrates that explicitly modeling valid regions and suppressing irrelevant responses is essential for stable feature injection during cross-scale frequency compensation.
These results collectively validate the complementarity and necessity of HFR and within CFCI, whose synergistic effect underpins the model’s performance advantages in small-object detection and at high-IoU thresholds.
4.4.5. Ablation Study of Confidence Gating
Table 8 presents an ablation analysis of the confidence gating mechanism (
) within BCSA. When
is removed, the overall AP drops from 36.2% to 35.4%, with AP
75 and AP
s decreasing by 1.1% and 1.3%, respectively, while AP
50 remains largely unchanged. This indicates that confidence gating primarily affects high-precision localization and small-object detection performance.
These results suggest that incorporating a confidence-aware gating mechanism during cross-scale spatial alignment helps suppress the influence of low-confidence or noisy offsets on feature fusion, thereby enhancing the stability and reliability of spatial alignment. Overall, plays a critical role in fully leveraging BCSA’s advantages in fine-grained object localization and under high-IoU thresholds.
4.4.6. Ablation Study of Mother Wavelet in CFCI
As shown in
Table 9, the choice of mother wavelet has only a minor influence on the detection performance of CFBA-FPN. The Haar wavelet achieves the best results (mAP: 0.786, AP50: 0.941), while db2 and Sym2 show slight performance decreases, with overall variations remaining small.
These results indicate that the proposed framework is not highly sensitive to the specific wavelet type, since DWT in our design mainly serves as a frequency decoupling operator, and the primary performance gains come from the subsequent semantic gating and frequency calibration mechanisms.
4.4.7. Computational Cost Analysis
Table 10 compares the computational cost and inference speed of CFBA-FPN with the baseline and other popular detectors. CFBA-FPN introduces a modest increase in parameters and FLOPs compared to the baseline (Params: 27.82 M vs. 25.33 M; GFLOPs: 71.83 vs. 68.25), while maintaining a relatively high inference speed (49.9 FPS). In contrast, Cascade R-CNN and RetinaNet incur substantially higher computational costs and lower FPS, demonstrating that CFBA-FPN achieves a favorable trade-off between detection performance and efficiency.
4.4.8. Analysis of Computational Cost and Performance
Table 11 compares the precision gain per unit of computational cost (GFLOPs) among different methods on small-object remote sensing datasets. CFBA-FPN achieves an mAP increase of 4.8 with an additional 9.11 GFLOPs, corresponding to a GFLOPs-per-mAP ratio of 1.9, which is lower than most other methods and indicates a more efficient trade-off between accuracy improvement and computational overhead. In particular, while methods like GLSDet achieve similar absolute mAP gains, their unit-cost efficiency is much lower (5.66 GFLOPs/mAP), highlighting that CFBA-FPN delivers higher performance gains relative to the additional computation required.