1. Introduction
Object detection in Unmanned Aerial Vehicle (UAV) imagery has become increasingly important across diverse applications including precision agriculture [
1], traffic monitoring [
2], disaster assessment [
3], and infrastructure inspection [
4,
5]. However, the aerial perspective introduces substantial technical challenges. Objects of interest typically occupy a small fraction of the image area, while atmospheric effects such as haze and turbulence, variable illumination conditions, and dense spatial distributions create additional complexity for accurate detection. Recent advances in general object detection have demonstrated the effectiveness of both CNN-based and transformer-based architectures across a wide range of visual recognition tasks [
6], providing a solid foundation for addressing these aerial-specific challenges.
Contemporary object detection frameworks can be categorized into anchor-based, region-based, and transformer-based approaches. Anchor-based single-stage detectors, such as SSD [
7] and the YOLO series [
8,
9,
10,
11,
12,
13,
14,
15], achieve real-time inference speeds through predefined anchor boxes combined with non-maximum suppression (NMS). However, several factors limit their effectiveness in UAV scenarios. The fixed geometric configurations of anchor boxes may not adequately match the scale distribution characteristics of aerial imagery, where object size variance substantially exceeds that of ground-level datasets. Additionally, NMS post-processing operates sequentially, preventing full utilization of parallel computing architectures, and threshold-based suppression can inadvertently eliminate valid detections in densely populated regions [
16].
Two-stage region-based methods, exemplified by the R-CNN family [
17], employ iterative refinement to improve localization precision. These approaches typically require substantial computational resources, which constrains deployment to server-based platforms rather than edge devices. The resulting latency from network transmission becomes prohibitive for time-sensitive applications such as autonomous navigation.
Transformer-based detection frameworks have recently demonstrated the feasibility of end-to-end learning without hand-crafted components. DETR [
18] introduced set prediction using global self-attention mechanisms, eliminating the need for anchor boxes and NMS in its architectural design. However, the quadratic complexity of self-attention restricts practical input resolutions. Deformable DETR [
19] addressed this limitation through sparse sampling at learnable offsets, while DN-DETR [
20] accelerated convergence by incorporating denoising training mechanisms. RT-DETR [
21] achieved competitive inference speeds through hybrid CNN–Transformer encoders and introduced IoU-aware query selection to reduce reliance on NMS, representing a notable advancement in transformer-based detection efficiency. It is worth noting that while DETR-family architectures are designed to operate without NMS, in practice a lightweight NMS post-processing step is often applied in dense detection scenarios (such as UAV imagery) to suppress residual duplicate predictions, as we adopt in our experimental protocol for fair comparison across all evaluated methods.
Despite these advances, several observations suggest opportunities for improvement in the context of UAV-based small object detection. Analysis of gradient flow during training reveals substantial disparities across pyramid levels in existing multi-scale fusion strategies, where features at different resolution levels typically receive uniform weighting. This phenomenon correlates with observable performance gaps between small- and large-object detection accuracies. Additionally, standard IoU-based loss functions apply identical penalty curves regardless of object scale or detection difficulty. For small or partially occluded objects where intersection-over-union values tend to be lower, gradient magnitudes can decrease substantially, potentially impeding optimization convergence for challenging samples. Furthermore, model compactness remains critical for deployment on resource-constrained UAV platforms, necessitating architectures that balance detection performance with parameter efficiency. However, it should be recognized that reducing theoretical computational metrics (parameters and GFLOPs) does not always translate directly into proportional inference speed improvements, as actual throughput depends on hardware-specific factors including memory access patterns, operator fusion, and parallelization efficiency.
Recent works have explored modifications to the RT-DETR architecture. Li et al. [
22] incorporated self-attention upsampling modules and refined the DIoU loss function. Lin et al. [
23] replaced the ResNet backbone with MobileNetV2 to reduce parameters, while He et al. [
24] proposed a scale adaptive feature pyramid network to enhance multi-scale object detection performance. Wu et al. [
25] substituted RepViT as the backbone and integrated HiLo attention mechanisms. These approaches primarily focus on individual components—either the backbone architecture, neck design, or loss function—rather than coordinated treatment of scale-related challenges across the detection pipeline.
Addressing these observations regarding model compactness and scale-adaptive optimization, this work proposes CBW-DETR, a framework that integrates architectural efficiency with scale-aware mechanisms throughout the detection pipeline. The framework encompasses three principal components:
To reduce model complexity while preserving feature diversity, a restructured residual block architecture partitions channel processing into selective pathways. Parallel branches with different receptive field configurations capture multi-scale contextual information, while a subset of channels bypass processing through identity connections, reducing parameter count and theoretical computation without compromising representational capacity.
Addressing gradient flow imbalance across pyramid levels, a bidirectional feature pyramid network introduces learnable compensation mechanisms, weighted according to feature map spatial dimensions. This formulation enables differential treatment of features during both forward propagation and gradient backpropagation, facilitating balanced optimization across scales.
For scale-adaptive optimization, the framework incorporates a loss formulation that implements category-specific statistical tracking. Through exponential moving averages and gradient modulation based on scale-dependent loss distributions, the mechanism amplifies optimization signals for under-performing categories while stabilizing convergence for well-optimized samples.
Experimental validation on VisDrone2019 and DOTA datasets demonstrates improvements in detection accuracy alongside significant reductions in model parameters (28.1%) and theoretical computation (18.0%). These reductions in model footprint come at a moderate cost in inference throughput due to memory-access-intensive operations in the proposed modules. Ablation studies quantify the individual and combined effects of the proposed components across different object scale categories, including a detailed analysis of inference latency contributions from each module.
3. Methodology
3.1. Framework Overview
The proposed CBW-DETR framework addresses model complexity reduction and multi-scale feature representation challenges in UAV-based object detection through three coordinated architectural innovations. Building upon RT-DETR as the baseline detector, the framework integrates an adaptive receptive field feature extraction module with multi-resolution wavelet-domain enhancement, a cross-scale attention-based feature pyramid with deformable fusion, and an uncertainty-aware adaptive loss with dynamic gradient modulation.
As illustrated in
Figure 1, the framework follows a design philosophy centered on dynamic adaptability rather than static architectural choices. This adaptability manifests at three levels of the detection pipeline. At the feature extraction level, ContextGFE learns optimal receptive field combinations through gating mechanisms while incorporating multi-resolution wavelet-domain features to capture both local spatial patterns and scale-dependent edge characteristics. At the feature fusion level, SAFPN employs spatial-variant compensation factors and cross-scale attention to enable adaptive feature aggregation that accounts for gradient flow differences across pyramid levels. At the supervision level, ASIoU integrates uncertainty estimation with dynamic gradient scaling to achieve robust optimization across object scales. The coordinated design ensures that modifications at one level complement enhancements at other levels, creating synergistic improvements throughout the detection pipeline rather than isolated optimizations. It should be noted that while the proposed modules are designed to reduce model parameters and theoretical computation (GFLOPs), certain operations—such as multi-branch parallel convolutions, wavelet transforms involving spatial data reshuffling, and deformable convolutions with irregular memory access patterns—are memory-access-intensive rather than compute-bound, which may result in actual inference throughput that does not scale proportionally with theoretical computation reductions.
3.2. Baseline Architecture
RT-DETR [
21] provides the foundation through its transformer architecture designed for real-time object detection, whose overall structure is illustrated in
Figure 2. The architecture comprises three principal components working in sequence. First, a ResNet-18 backbone with BasicBlock residual modules processes input images to generate multi-scale feature representations
at spatial strides of
pixels. Second, a hybrid encoder processes these features through Attention-based Intra-scale Feature Interaction (AIFI) that refines features within each scale, and CNN-based Cross-scale Feature Fusion (CCFF) that aggregates information across different scales. Third, a transformer decoder with IoU-aware query selection generates final object predictions through iterative refinement. While RT-DETR was architecturally designed to reduce reliance on NMS through its IoU-aware query selection mechanism, in dense detection scenarios characteristic of UAV imagery, a lightweight NMS post-processing step is commonly applied in practice to suppress residual duplicate predictions from the fixed set of object queries. In our experimental protocol, NMS is uniformly applied to all evaluated methods to ensure fair comparison (see
Section 4.1 for details).
While this architecture achieves strong performance on standard benchmarks, it employs several fixed computational patterns that limit adaptability. The backbone applies uniform convolutions across all feature channels without distinguishing between redundant and informative channels. The feature fusion mechanism uses simple element-wise addition with uniform weights, ignoring scale-dependent characteristics of gradient flow during backpropagation. The supervision employs scale-agnostic loss functions that apply identical optimization pressure regardless of object size or detection difficulty.
Our modifications introduce adaptive mechanisms at each of these levels to enable data-driven optimization. Rather than relying on hand-crafted architectural choices, the proposed components learn to allocate representational resources and optimization signals based on the specific characteristics of aerial imagery.
3.3. Context-Guided Feature Extraction with Adaptive Receptive Fields
The ContextGFE module replaces standard BasicBlock residual modules in backbone stages S3, S4, and S5 to address computational redundancy while maintaining representational capacity. The key insight motivating this design is that aerial imagery exhibits distinct patterns at multiple spatial scales simultaneously. Small objects such as pedestrians or vehicles require fine-grained local features to capture subtle appearance details, while cluttered backgrounds benefit from broader contextual information to distinguish foreground from background. Rather than using fixed kernel sizes that commit to a single scale, ContextGFE dynamically allocates representational resources across multiple receptive fields based on input characteristics.
Figure 3 illustrates the complete module architecture. The design follows a three-stage pipeline: dimensionality reduction to eliminate redundant channels, parallel multi-scale processing with dynamic weighting, and attention-based feature refinement.
Given input features
, the module first reduces channel dimensionality through a lightweight projection:
where
is a
convolution that projects from
C to
channels. This reduction serves two purposes: it eliminates redundant information present in highly correlated feature channels, and it decreases the computational burden of subsequent multi-scale processing. The use of parametric ReLU allows the network to learn optimal negative slope values, providing more flexibility than standard ReLU.
Rather than committing to fixed kernel sizes, the module employs a lightweight gating network to dynamically determine the importance of different receptive field sizes:
where
represents learned weights for four different kernel configurations:
for local details,
for mid-range patterns,
for broader context, and dilated
with dilation rate 3 for capturing long-range dependencies efficiently. The softmax operation ensures these weights form a valid probability distribution, allowing the network to allocate computational emphasis across scales in a principled manner. We note that the four parallel convolution branches, while reducing total GFLOPs through the preceding channel reduction, introduce fragmented memory access operations that limit GPU parallelization efficiency, an issue we quantitatively analyze in
Section 4.4.
The multi-scale spatial features are computed through a weighted combination:
This formulation implements a soft attention mechanism over receptive field sizes. Rather than hard-selecting a single scale, the module blends features from all scales with learned weights, providing greater flexibility and smoother gradient flow.
To complement spatial convolutions with multi-resolution edge representation, a parallel branch applies the Discrete Wavelet Transform (DWT) [
39] to decompose features into hierarchical sub-bands:
where DWT decomposes the input feature map into one approximation sub-band
and three detail sub-bands
using Haar wavelets. The approximation sub-band
captures low-frequency structural information, while the detail sub-bands
,
, and
explicitly encode horizontal, vertical, and diagonal edge orientations, respectively. This spatially localized multi-resolution decomposition is particularly beneficial for small-object boundary delineation in UAV imagery, as it directly models edge structures at multiple scales rather than relying solely on global frequency statistics. However, the DWT and inverse DWT operations involve spatial data reshuffling across feature map dimensions, which constitutes a memory-bandwidth-bound operation rather than a compute-bound one, contributing to the gap between theoretical GFLOP reduction and actual inference throughput. The learnable filter
is implemented as a two-layer network:
with
and
. This bottleneck architecture encourages the network to learn compact wavelet-domain representations by selectively amplifying informative sub-band coefficients while suppressing noise-dominated components.
The spatial and wavelet-domain features are concatenated and projected back to the original channel dimension:
where
is a
convolution that integrates information from both domains. This fusion enables the network to leverage complementary information where spatial features provide localization cues while wavelet features contribute multi-resolution edge context that aids in distinguishing small objects from background clutter.
The fused features undergo refinement through a joint spatial-channel attention mechanism. Unlike standard channel attention that operates solely on global statistics, this mechanism considers both spatial and channel dimensions. The spatial attention component identifies important spatial locations:
where
and
denote pooling operations along the channel dimension. The concatenation of max-pooled and mean-pooled features provides complementary information about salient spatial regions. The
convolution kernel allows the attention mechanism to consider local context when determining spatial importance.
In parallel, the channel attention component identifies informative feature channels:
where GAP denotes global average pooling that aggregates spatial information into channel-wise statistics. The squeeze–excitation mechanism with reduction ratio
creates a bottleneck that forces the network to learn compact channel importance representations.
The final output integrates both attention mechanisms through multiplicative interaction:
where ⊗ denotes the outer product and ⊙ represents element-wise multiplication. The outer product creates a full spatial-channel attention map that can model joint dependencies. The residual connection ensures that the module can preserve useful information from the input when the transformed features provide limited additional value.
3.4. Scale-Aware Feature Pyramid with Cross-Scale Attention
Standard feature pyramid networks combine features from different pyramid levels through uniform fusion operations, typically element-wise addition or concatenation with fixed weights. However, this uniform treatment fails to account for two important characteristics. First, features at different pyramid levels exhibit distinct statistical properties where higher-level features tend to be more semantically rich but spatially coarse, while lower-level features provide precise localization with limited semantic content. Second, gradient flow during backpropagation naturally favors certain pyramid levels based on their spatial resolution, creating imbalances that can bias optimization.
SAFPN addresses these issues through explicit compensation mechanisms that account for scale-dependent characteristics and attention-based fusion that enables adaptive information aggregation across pyramid levels, as illustrated in
Figure 4.
The design employs spatial-variant compensation factors that adapt to local feature characteristics. Rather than applying global scalar weights uniformly across all spatial locations, the module learns pixel-wise compensation:
where
is a learned spatial map specific to pyramid level
i. The denominator provides scale-dependent normalization that ensures the compensation magnitude is inversely related to feature map size. Higher-resolution features receive stronger compensation to balance their naturally weaker gradient signals. The learnable component
is computed through:
This formulation allows the network to determine local compensation strength based on the characteristics of features at adjacent pyramid levels.
Before fusion, the module employs cross-scale spatial attention to model the relevance of features across different pyramid levels. This attention mechanism uses a lightweight query–key–value formulation:
where
. The projection matrices
with
reduce computational cost while maintaining representational capacity. This attention mechanism enables the network to dynamically determine which spatial locations in the coarser feature map are most relevant for refining each location in the finer feature map.
The top-down pathway propagates semantic information from coarser to finer scales. At each pyramid level
, features are fused through:
where the compensation factor
modulates the combined features before depthwise convolution refines them. The learnable weights
and
allow the network to balance contributions from upsampled features and original backbone features.
The bottom-up pathway performs reverse fusion to incorporate fine-grained details into higher-level features. A key challenge in multi-scale fusion is spatial misalignment, where objects may occupy different relative positions across pyramid levels due to downsampling operations. To address this, the module employs deformable convolution with learned offsets:
The offset field
is predicted based on both original backbone features and top-down features, allowing the network to learn optimal spatial alignment patterns. The deformable convolution then applies:
where DeformConv applies convolution at adaptively determined spatial locations. Note that while deformable convolution does not significantly increase theoretical GFLOPs, its irregular memory access patterns—sampling features at learned offset positions rather than regular grid locations—are less amenable to GPU parallelization, contributing to inference latency beyond what GFLOPs alone would suggest. This formulation enables three-way fusion incorporating aligned top-down features, compensated backbone features, and downsampled features from the previous bottom-up level.
For the highest pyramid level
, which lacks coarser features for bottom-up fusion:
All fusion weights
initialize to 1.0 and are optimized end-to-end during training. An analysis of the converged values of these weights and their implications for scale-specific feature contribution is provided in
Section 4.4.
3.5. Adaptive Scale IoU Loss with Uncertainty Estimation
Standard IoU-based losses apply uniform optimization pressure to all detection samples regardless of their characteristics. However, objects of different scales exhibit distinct optimization dynamics, where smaller objects typically require more iterations to achieve accurate localization, while larger objects may converge rapidly. Additionally, prediction uncertainty varies across samples, with some detections exhibiting high confidence while others remain ambiguous. Applying uniform loss weighting to such heterogeneous samples can result in inefficient optimization.
ASIoU introduces a principled approach to adaptive supervision that modulates optimization intensity based on two factors: the relative difficulty of each sample within its scale category, and the uncertainty of the prediction.
The framework first partitions objects into scale categories based on ground-truth bounding box area:
These thresholds align with standard definitions used in benchmark datasets.
For each predicted bounding box, the module estimates prediction uncertainty through Monte Carlo dropout. During training, the module performs
T stochastic forward passes with dropout enabled:
where
represents the bounding box prediction in the
t-th forward pass, and
is the mean prediction. The variance
quantifies prediction fluctuation across different dropout masks. High variance indicates uncertain predictions while low variance suggests confident predictions.
Rather than tracking loss statistics with uniform weights, the module employs uncertainty-weighted exponential moving averages for each scale category:
The uncertainty weight attenuates the influence of unreliable samples:
where the temperature parameter
controls sensitivity to uncertainty. Predictions with high uncertainty receive lower weights in the moving average, preventing noisy samples from distorting category-specific statistics.
The core gradient modulation mechanism scales loss contributions based on the ratio between current sample loss and the category-specific moving average:
When a sample’s loss exceeds the category average, the ratio exceeds 1.0 and the power-law relationship amplifies gradients, focusing optimization on difficult cases. The exponent
adapts over training epochs:
This formulation starts with moderate amplification and gradually increases emphasis on hard samples as training progresses.
The distance-based weighting component provides spatial regularization that adapts to object scale:
where
and
denote predicted and ground-truth box centers. Normalizing by ground-truth area
ensures that the effective distance threshold scales appropriately. Larger objects tolerate larger center offsets proportionally.
An auxiliary aspect ratio prediction task provides additional regularization:
Operating in log-space ensures symmetry with respect to aspect ratio inversions and provides stable gradients.
The complete ASIoU loss integrates all components:
where the multiplicative combination of
and
enables joint consideration of relative difficulty and spatial quality.
3.6. Framework Integration
The complete CBW-DETR framework integrates the three proposed components within the RT-DETR architecture through coordinated modifications at different pipeline stages. The backbone network employs ContextGFE modules at stages S3, S4, and S5, replacing standard BasicBlock residual modules. These adaptive feature extraction modules process input images to generate multi-scale feature representations with reduced parameter count and theoretical computation while maintaining representational capacity through dynamic receptive field selection and multi-resolution wavelet-domain enhancement.
SAFPN receives these multi-scale backbone features and processes them through bidirectional fusion pathways incorporating spatial-adaptive compensation and cross-scale attention. The top-down pathway enriches features with semantic information from coarser scales, while the bottom-up pathway incorporates fine-grained details from higher-resolution features. The output pyramid features exhibit balanced gradient flow characteristics and effective integration of information across scales.
These refined pyramid features feed into the transformer decoder, which maintains the standard RT-DETR architecture with IoU-aware query selection and iterative refinement through six transformer layers. The decoder generates object predictions that are supervised through the ASIoU loss, which provides scale-adaptive optimization pressure accounting for both relative sample difficulty and prediction uncertainty.
The training objective combines classification and localization objectives:
where focal loss addresses class imbalance in the classification task. The loss is computed only for matched query–target pairs determined through Hungarian matching that finds the optimal bipartite assignment minimizing combined classification and localization costs.
4. Experiments
4.1. Experimental Setup
All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB memory), Intel Core i9-10900K CPU, and 64 GB RAM. The software environment consisted of Ubuntu 20.04 LTS, CUDA 12.4, cuDNN 8.9.0, PyTorch 2.2.0, and Python 3.9.13. Models were implemented using MMDetection [
40] toolbox version 3.1.0 for standardized training and evaluation.
Models were trained from scratch using the AdamW optimizer [
41] with momentum parameters of 0.9 and 0.999, a weight decay of 0.0001, and a gradient clipping threshold of 0.1. The learning rate followed cosine annealing from the initial value 0.0001 to a minimum of 0.000001 with 5-epoch linear warm-up. Training ran for 400 epochs with batch size 8, requiring approximately 36 h per model. Input images were resized to 640 × 640 pixels with aspect-ratio-preserving padding (value 114). Data augmentation included random horizontal flipping (probability 0.5), random scaling (range [0.8, 1.2]), HSV augmentation (hue ± 0.015, saturation [0.3, 1.7], value [0.6, 1.4]), mosaic augmentation (probability 0.5 for first 350 epochs), mixup (probability 0.1, alpha 0.5), random translation (maximum 0.1 × image size), and random rotation (±10 degrees). Images were normalized using ImageNet statistics. No augmentation was applied during validation and testing.
The total loss combined focal loss for classification (weight 1.0, focusing parameter 2.0, balance factor 0.25), L1 loss for box regression (weight 5.0), and ASIoU for localization (weight 2.0). For ASIoU, scale-specific parameters were: initial focusing exponents (small 2.0, medium 1.5, large 1.0), momentum factors (small 0.1, medium 0.15, large 0.2), uncertainty temperature 0.5, spatial bandwidth (small 0.5, medium 0.3, large 0.2), and aspect ratio loss weight 0.5. Hungarian matching [
18] used costs weighted at 2.0 for classification, 5.0 for L1 regression, and 2.0 for IoU. Positive assignment required an IoU above 0.7, while an IoU below 0.3 designated negative samples.
Performance was evaluated using COCO-style metrics [
42]: AP at IoU thresholds 0.5 (AP50), 0.75 (AP75), and averaged over [0.5:0.95:0.05] (mAP); scale-specific AP for small (area less than 32
2 pixels), medium (32
2 to 96
2 pixels), and large (above 96
2 pixels) objects; model parameters (millions); computational cost (GFLOPs at 640 × 640 input); model size (MB); inference speed (FPS on RTX 3090 with batch size 1); and latency (milliseconds per image). Efficiency metrics averaged 1000 iterations after 100 warm-up iterations. All reported FPS values include the complete inference pipeline encompassing model forward pass, confidence thresholding, and NMS post-processing, ensuring fair and consistent timing across all evaluated methods.
For ContextGFE, channel partition used 8 groups with kernel sizes {3, 5, 7} and dilation rate 3. The wavelet-domain branch employed one-level Haar wavelet decomposition with a filter reduction ratio of 2, while the attention reduction ratio was 16 with a 7 × 7 spatial kernel. For SAFPN, the cross-scale attention projection dimension was
, the compensation stability constant was 0.000001, and deformable convolution [
26] sampled 9 offset points. All fusion weights initialized to 1.0. For uncertainty estimation, Monte Carlo dropout performed 5 forward passes with probability 0.1. The transformer decoder used 6 layers with 8 attention heads, feedforward dimension 1024, dropout 0.1, and 300 object queries.
While RT-DETR was architecturally designed to reduce reliance on NMS through IoU-aware query selection, in dense UAV detection scenarios with significant object overlap, a lightweight NMS post-processing step is commonly applied in practice to suppress residual duplicate predictions from the fixed set of 300 object queries. To ensure fair comparison across all detector families—including YOLO-series methods that inherently require NMS—we uniformly apply NMS with IoU threshold 0.65 and confidence threshold 0.01 to all evaluated methods throughout our experiments. We acknowledge that this departs from the purely end-to-end paradigm of DETR-family architectures. The fast inference mode used a confidence threshold of 0.25 with top-100 selection before NMS. For reproducibility, all random seeds were fixed at 42 (Python 3.8, NumPy 1.21, PyTorch 1.12, CUDA 11.3) with deterministic algorithms enabled, though this reduced training speed by approximately 5–8%.
4.2. Datasets
To experimentally verify the effectiveness of the CBW-DETR algorithm in the field of small object detection, we selected the VisDrone2019 and DOTA datasets for our experiments.
The VisDrone2019 dataset [
29], created by the AISKYEYE team at Tianjin University, is one of the most representative benchmark datasets in the field of drone vision. The dataset collects visual data from scenarios in different cities under various environments and weather conditions. It contains 400 video clips and 8629 static images. The data covers 10 typical urban object classes, such as pedestrians, cars, buses, and bicycles. In the experiment, we divided the 8629 images according to a 7:2:1 ratio, resulting in 6471 images for the training set, 1610 images for the testing set, and the remaining 548 images for the validation set.
Additionally, we used the DOTA dataset [
30] to verify the universality and extensiveness of the algorithm. The DOTA dataset is a large-scale dataset designed for object detection in remote sensing images, characterized by dense objects and significant scale variations. Since the original image sizes vary considerably (ranging from 800 × 800 to 4000 × 4000 pixels), direct use for training is difficult and affects the training results. Therefore, we performed cropping processing: original images were cropped into 1024 × 1024 pixel patches with a stride of 824 pixels, yielding an overlap ratio of approximately 19.5%. Patches containing no annotated objects were discarded. This produced a total of 21,046 images, of which 15,749 were used for training and 5297 for testing. During inference, predictions from overlapping patches were merged using NMS with an IoU threshold of 0.5 to eliminate duplicate detections at patch boundaries. All models evaluated on DOTA used horizontal bounding box (HBB) annotations and followed the identical cropping and merging protocol to ensure fair comparison.
4.3. Evaluation Metrics
Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), Model Parameters (Parameters), and Model Computation (GFLOPs) were used as evaluation metrics in the experiments. The formulas are as follows:
True Positive (TP) reflects the number of positive samples accurately identified by the model; False Positive (FP) shows the number of negative samples incorrectly classified as positive by the model; and False Negative (FN) tallies the number of positive samples incorrectly identified as negative by the model. N represents the total number of classes in the classification task. mAP50 is the value obtained by comprehensively evaluating the average precision of each category when the Intersection over Union (IoU) is set to 0.5.
4.4. Comparison with State-of-the-Art Methods
To evaluate whether CBW-DETR achieves favorable accuracy–complexity trade-offs in small object detection, we compared it with current mainstream algorithms on the VisDrone2019 dataset, including YOLOv5, YOLOv8, YOLOv10, YOLOv11, Deformable DETR, and DINO. It was also compared with recent improved algorithms. The experimental results are shown in
Table 1.
The CBW-DETR algorithm achieved 64.7%, 49.9%, 51.5%, and 32.5% in Precision, Recall, mAP50, and mAP50:95 respectively, all of which were the highest among the compared methods in this study. In terms of model complexity, the CBW-DETR algorithm has the smallest parameter count (14.3 M) and model size (28.0 MB) among all compared methods, representing significant reductions compared to heavier architectures such as YOLOv5l (47.9 M), YOLOv8l (43.6 M), Deformable DETR (40.0 M), and DINO (47.0 M).
Regarding inference speed, it is important to provide a transparent analysis. CBW-DETR achieves 73.6 FPS, which is notably lower than the RT-DETR-R18 baseline (94.1 FPS), YOLOv5m (120.1 FPS), YOLOv8m (103.5 FPS), and YOLOv11m (95.4 FPS). As analyzed in detail in
Section 4.6.1, this reduction is attributable to memory-access-intensive operations in ContextGFE and SAFPN rather than increased arithmetic computation. However, CBW-DETR remains faster than several compared methods, including YOLOv8l (63.5 FPS), YOLOv11l (70.4 FPS), Improved RT-DETR [
45] (54.6 FPS), and Frequency-Enhanced [
46] (69.0 FPS), and substantially exceeds the 30 FPS real-time threshold commonly required for UAV applications. In summary, among the methods evaluated in this study, CBW-DETR achieves the highest detection accuracy with a compact model footprint, while maintaining real-time inference capability at a moderate throughput cost relative to its direct baseline.
4.5. Generalization Experiments
To evaluate the generalization ability of the CBW-DETR algorithm on other aerial object datasets, we conducted comprehensive experiments on the DOTA dataset. To ensure a thorough generalization evaluation, we benchmark CBW-DETR against a broad set of methods including YOLO-series variants, transformer-based detectors, and recent improved models, following the same scope as the VisDrone2019 evaluation. All models were trained and evaluated on the same DOTA split with the identical preprocessing protocol described in
Section 4.2 (1024 × 1024 crops, stride 824, overlap 19.5%, horizontal bounding boxes). The comparison results are shown in
Table 2.
From the table, it can be observed that CBW-DETR achieves 77.9% Precision, 70.4% Recall, 73.3% mAP50, and 48.5% mAP50:95 on the DOTA dataset, obtaining the best detection accuracy results across all metrics among the compared methods. In terms of model complexity, CBW-DETR maintains its compact footprint with 14.3 M parameters and a 28.0 MB model size, consistent with the VisDrone2019 results. The improvement patterns on DOTA are broadly consistent with those observed on VisDrone2019: CBW-DETR achieves meaningful accuracy gains over the RT-DETR-R18 baseline (+1.8% mAP50, +1.7% mAP50:95) while significantly reducing parameters (28.1%) and theoretical computation (18.0%). Compared with heavier architectures such as YOLOv8l and DINO, CBW-DETR achieves higher accuracy with substantially fewer parameters and lower computation.
Regarding inference speed on DOTA, CBW-DETR achieves 71.2 FPS, which is lower than the RT-DETR-R18 baseline (96.3 FPS) for the same reasons analyzed in
Section 4.6.1. This is consistent with the throughput reduction pattern observed on VisDrone2019 and reflects the inherent characteristics of the proposed modules rather than dataset-specific factors. In summary, the cross-dataset results demonstrate that CBW-DETR’s accuracy improvements and model complexity reductions generalize across different aerial imagery domains, validating the framework’s robustness. However, the throughput–accuracy trade-off also persists consistently across datasets.
4.6. Ablation Study
To understand the individual and combined contributions of the proposed ContextGFE, SAFPN, and ASIoU modules to the overall performance, a series of ablation experiments were designed on the VisDrone2019 dataset. The results are shown in
Table 3.
Compared with the original RT-DETR-R18 baseline model, the full CBW-DETR algorithm reduced theoretical computation by 18.0% and parameters by 28.1%, while increasing Precision by 2.6%, Recall by 3.0%, mAP@0.5 by 3.3%, and mAP@0.5:0.95 by 3.2%. The model size was also reduced by 27.4%. The following conclusions can be drawn from the progressive module addition analysis. After adopting the ContextGFE module alone, the detection accuracy was improved (mAP50 +1.2%) while significantly reducing model complexity (parameters reduced by 5.2 M, computation reduced by 13.1 GFLOPs), proving the module’s advantages in feature extraction and lightweight design. The incorporation of Haar wavelet decomposition in the frequency-domain branch further contributes to edge feature capturing, with the detailed sub-bands providing spatially localized multi-resolution representations that benefit small-object boundary delineation. After adopting the SAFPN module alone, detection performance was significantly improved (mAP50 +1.0%) with a minor increase in computational cost (+2.0 GFLOPs), especially for small object detection accuracy, verifying the effectiveness of the scale compensation weighted fusion strategy. Using the ASIoU loss function alone significantly improved the model’s localization accuracy (mAP50 +1.6%) with no additional parameters or computation, proving that its scale-adaptive dynamic gradient adjustment mechanism can effectively optimize the regression performance of targets at different scales. When the three improvement modules worked together, not only were a parameter compression of 28.1% and a theoretical computation reduction of 18.0% achieved, but a 3.3% increase in mAP50 was also obtained, indicating that the improvement strategies have complementary synergistic effects rather than redundant overlapping benefits.
4.6.1. Inference Latency Analysis
It is important to note that while the proposed modules reduce theoretical computation (GFLOPs) and parameter count, the actual inference speed decreased from 94.1 FPS (baseline) to 73.6 FPS (full CBW-DETR), representing a 21.8% reduction in throughput. To understand the sources of this discrepancy between theoretical and empirical efficiency, we conducted a per-module latency breakdown analysis on the RTX 3090 with 640 × 640 input resolution, as shown in
Table 4.
The analysis reveals three primary sources of the throughput reduction. First, ContextGFE’s four parallel convolution branches with different kernel sizes (3 × 3, 5 × 5, 7 × 7, dilated 3 × 3) require separate memory allocation and kernel launches, creating fragmented GPU utilization despite the preceding channel reduction. The DWT/IDWT operations contribute an additional 0.5 ms due to spatial data reshuffling across feature map dimensions, which is memory-bandwidth-bound rather than compute-bound. Second, SAFPN’s deformable convolutions sample features at learned offset positions rather than regular grid locations, introducing irregular memory access patterns that prevent efficient cache utilization and hardware-level memory coalescing. Third, the cross-scale attention mechanism, while lightweight in terms of GFLOPs, introduces sequential softmax and matrix multiplication operations that create pipeline stalls.
Despite this throughput reduction, we emphasize that 73.6 FPS substantially exceeds the commonly adopted 30 FPS real-time threshold for UAV applications and remains practical for deployment scenarios. The primary efficiency advantage of CBW-DETR lies in its significantly reduced model footprint: 14.3 M parameters and a 28.0 MB model size represent 28.1% and 27.4% reductions, respectively, which directly benefit memory-constrained edge deployment platforms where model storage and runtime memory are the binding constraints, rather than raw computational throughput.
4.6.2. NMS Effect Analysis
To quantify the contribution of NMS post-processing to the overall detection performance and to clarify the distinction between the architectural design of DETR-family methods and the evaluation protocol adopted in this work, we report CBW-DETR results with and without NMS in
Table 5.
As shown in
Table 5, applying NMS yields a 1.3% improvement in mAP50 and a notable 3.4% increase in Precision, primarily by suppressing duplicate predictions in densely populated regions. The Recall remains unchanged since NMS only removes redundant predictions rather than introducing new ones. This result confirms that while CBW-DETR can operate without NMS and still achieve competitive accuracy (50.2% mAP50), the lightweight post-processing step provides meaningful performance gains in dense UAV scenarios, justifying its inclusion in our evaluation protocol. The NMS processing adds only 0.4 ms per image (
Table 4), representing a negligible latency cost relative to the accuracy benefit.
4.6.3. Wavelet-Domain Enhancement Analysis
To investigate the effectiveness of the Discrete Wavelet Transform (DWT) as the frequency-domain enhancement strategy in ContextGFE, we conducted comparative experiments between FFT-based and DWT-based decomposition, as shown in
Table 6. DWT with Haar wavelets achieves superior performance (mAP50: 49.4% vs. 49.1%), which we attribute to its spatially localized multi-resolution decomposition property. Unlike FFT which operates globally in the frequency domain, DWT’s detail sub-bands
directly encode horizontal, vertical, and diagonal edge orientations at multiple scales, providing more targeted feature representations for small-object boundary delineation in UAV imagery. Based on these findings, we adopt DWT as the frequency-domain enhancement component in the final ContextGFE design.
4.6.4. SAFPN Fusion Weight Convergence Analysis
To verify that the learnable fusion weights in SAFPN actively contribute to scale-specific feature compensation rather than remaining at their initialization values, we tracked the converged values of
after 400 epochs of training on VisDrone2019. The results are presented in
Table 7.
The converged weights deviate meaningfully from their initialization value of 1.0, with deviations ranging from 9% to 28%. Several observations can be drawn. In the top-down pathway, the network learns to assign a higher weight () to upsampled coarse-scale features that carry rich semantic information, while moderately attenuating original backbone features () that contain more redundant spatial detail at the current level. This asymmetry suggests that for UAV imagery with small objects, the semantic context from higher pyramid levels is more valuable than the local features at each level during top-down refinement. In the bottom-up pathway, the compensation factor-modulated backbone features receive increased weight (), indicating that the spatial compensation mechanism in effectively enhances backbone features to the point where the network prefers them over the downsampled contributions () from finer levels. These results confirm that the learnable weights are actively adapting to the data distribution and contributing to the scale-aware fusion strategy, rather than remaining inert at their initialization values.
4.7. Loss Function Comparison
To evaluate the performance of ASIoU, quantitative comparative experiments were conducted on the VisDrone2019 dataset, comparing ASIoU with GIoU (baseline algorithm), DIoU, CIoU, EIoU, FocalEIoU, InnerFocalEIoU, ShapeIoU, InnerDIoU, InnerCIoU, and InnerEIoU. The comparison results are shown in
Table 8. In the table, ASIoU achieved the highest mAP50 value and Precision value, reaching 49.8% and 63.5%, respectively, and performed well in Recall and mAP50:95 metrics. Comprehensively evaluated, ASIoU demonstrates the most balanced performance across all metrics, indicating its effectiveness as a well-suited loss function for UAV-based small object detection.
4.8. Visualization Analysis
To more comprehensively demonstrate the performance of the CBW-DETR algorithm, this section provides intuitive comparisons in two aspects: comparing various important performance indicators before and after model improvement, and showing visual detection results.
First, as shown in
Figure 5, on the VisDrone2019 dataset test data, the four curves clearly reflect the overall performance improvements in detection tasks before and after model enhancement. The CBW-DETR algorithm shows consistent improvements in Precision, Recall, mAP50, and mAP50:95 throughout the training process. This demonstrates that the ContextGFE module, SAFPN module, and ASIoU loss function effectively enhance the model’s sensitivity and accuracy for small object detection, enabling the model to demonstrate stronger recognition capabilities in complex scenes.
Four different scenarios were selected from the VisDrone2019 dataset to compare the detection effects of the model before and after improvement, as shown in
Figure 6. For the daytime scenario shown in
Figure 6a, the CBW-DETR algorithm identifies smaller objects more accurately than the RT-DETR-R18 model, demonstrating improved sensitivity to fine-grained targets. In the dense object scenario of
Figure 6b, the CBW-DETR algorithm can effectively identify and locate numerous objects, providing more accurate recognition in dense scenes and demonstrating strong capability for handling complex scenarios. In the night environment of
Figure 6c, the CBW-DETR algorithm maintains stable performance and can effectively identify dense objects under low-light conditions. In the occluded scenario of
Figure 6d, facing partially visible targets, the CBW-DETR algorithm can still accurately identify the objects, providing more complete and accurate detection results.
To more intuitively visualize the model’s attention mechanism, Grad-CAM++ technology was used for heatmap visualization on the VisDrone2019 dataset. The heatmap represents the degree of feature response through color gradients, where red areas represent the highest activation values, indicating the regions of highest model attention. As shown in the comparison between
Figure 7a and
Figure 7c, the RT-DETR-R18 algorithm exhibits missing detections and false positives, while the CBW-DETR algorithm accurately detects vehicles. In
Figure 7b,d, RT-DETR-R18 shows notable false detection problems, whereas the CBW-DETR algorithm successfully avoids such issues. The heatmap comparison reveals that CBW-DETR produces more focused and concentrated attention regions on actual object locations, suggesting that the proposed ContextGFE and SAFPN modules enable the model to better distinguish foreground objects from background clutter. These qualitative results are consistent with the quantitative improvements observed in
Table 1 and
Table 3, providing visual evidence for the effectiveness of the proposed architectural designs.
5. Conclusions
This paper presents CBW-DETR, a lightweight transformer-based framework designed specifically for small object detection in UAV imagery. Through three coordinated innovations consisting of ContextGFE for compact feature extraction, SAFPN for scale-aware feature fusion, and ASIoU for adaptive optimization, the framework achieves notable improvements in detection accuracy alongside significant reductions in model parameters (28.1%) and theoretical computation (18.0%) compared to the RT-DETR-R18 baseline.
Experimental validation on VisDrone2019 demonstrates that CBW-DETR substantially reduces model complexity compared to the baseline RT-DETR while achieving improvements in detection accuracy across all evaluation metrics. Cross-dataset validation on DOTA confirms generalization capability across diverse aerial imagery characteristics, though the scope of comparison on DOTA remains more limited than on VisDrone2019. Among the methods evaluated in this study, CBW-DETR achieves the highest detection accuracy with a compact model footprint of 14.3 M parameters and 28.0 MB, demonstrating favorable accuracy–complexity trade-offs compared to recent YOLO variants and transformer-based detectors.
The proposed ContextGFE module achieves parameter and computation reductions through adaptive receptive field selection and multi-resolution wavelet-domain enhancement mechanisms. Specifically, the incorporation of Discrete Wavelet Transform with Haar wavelets enables spatially localized decomposition into approximation and detail sub-bands, where the detail coefficients explicitly encode horizontal, vertical, and diagonal edge orientations at multiple scales. This property proves particularly beneficial for small-object boundary delineation in UAV imagery, and ablation experiments confirm its superiority over FFT-based global frequency decomposition. The SAFPN module introduces spatial-variant compensation factors and cross-scale attention mechanisms to address gradient flow imbalance across pyramid levels, with particularly notable improvements in small-object recall performance. Convergence analysis of the learnable fusion weights confirms that they actively adapt to the data distribution, providing meaningful scale-specific contributions rather than remaining at initialization values. The ASIoU loss function implements uncertainty-aware gradient modulation and scale-specific optimization strategies, enhancing localization accuracy across objects of varying sizes.
Visualization analysis demonstrates robust detection performance across challenging scenarios, including nighttime conditions with limited illumination, dense object distributions with significant overlap, and scenes with severe occlusions. Grad-CAM++ attention heatmaps reveal that CBW-DETR focuses more effectively on salient object regions while suppressing background interference compared to baseline methods, providing qualitative validation for the effectiveness of the proposed architectural designs.
5.1. Limitations
Despite the improvements achieved, several limitations of the current work should be acknowledged. First, while CBW-DETR reduces theoretical computation (GFLOPs) and model parameters, the actual inference throughput decreases from 94.1 FPS to 73.6 FPS (a 21.8% reduction) compared to the baseline. As discussed in the inference latency analysis, this discrepancy arises because the proposed modules—particularly ContextGFE’s multi-branch parallel convolutions and DWT/IDWT spatial data reshuffling, SAFPN’s deformable convolutions with irregular memory access patterns, and cross-scale attention with sequential softmax operations—introduce memory-access-intensive computations that are less amenable to GPU parallelization than the standard convolutions they replace. Although 73.6 FPS comfortably exceeds the 30 FPS real-time threshold for most UAV applications, this throughput–accuracy trade-off should be carefully considered for latency-critical deployments. Second, all inference speed measurements were conducted exclusively on an NVIDIA RTX 3090 GPU. The latency characteristics of the proposed modules may differ substantially on resource-constrained edge deployment platforms commonly used in UAV systems (e.g., NVIDIA Jetson Orin Nano, mobile SoCs), where memory bandwidth limitations could further exacerbate the gap between theoretical and empirical efficiency. Benchmarking on such platforms remains an important validation step that is not covered in the current work. Third, while the generalization experiments on the DOTA dataset have been expanded to include a broad set of comparison methods, future work could further strengthen generalization claims by evaluating additional aerial detection benchmarks beyond VisDrone2019 and DOTA. Fourth, while RT-DETR was architecturally designed to operate without NMS, our experimental protocol applies NMS uniformly to all methods for fair comparison. This means the reported results do not reflect a purely end-to-end detection paradigm, and the 1.3% mAP50 improvement attributable to NMS (as shown in the NMS effect analysis) should be considered when interpreting the results.
5.2. Future Work
Future research directions include investigating operator-level optimization strategies (such as kernel fusion and custom CUDA kernels for the DWT/IDWT and multi-branch convolution operations) to narrow the gap between theoretical computation reduction and actual inference throughput. Conducting comprehensive deployment benchmarks on edge computing platforms commonly used in UAV systems would provide practical validation of the framework’s real-world applicability. Expanding the cross-dataset evaluation to include additional aerial detection benchmarks with a broader set of comparison methods would strengthen generalization claims. Other promising directions include investigating adaptive query selection mechanisms for the transformer decoder to further enhance detection performance on extremely small objects, exploring learnable wavelet basis functions as alternatives to fixed Haar wavelets to enable data-driven multi-resolution decomposition, integrating temporal information for improved consistency in video-based detection scenarios, and extending the framework to multi-task learning paradigms that jointly optimize detection with complementary tasks such as object tracking and instance segmentation.