CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery

Qin, Suning; Cheng, Ke; Wang, Yuanquan

doi:10.3390/electronics15102010

Open AccessArticle

CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery

by

Suning Qin

¹,

Ke Cheng

^1,* and

Yuanquan Wang

²

¹

School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212000, China

²

School of Artificial Intelligence, Hebei University of Technology (HeBUT), Tianjin 300401, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2010; https://doi.org/10.3390/electronics15102010

Submission received: 3 April 2026 / Revised: 24 April 2026 / Accepted: 7 May 2026 / Published: 9 May 2026

Download

Browse Figures

Versions Notes

Abstract

Small object detection in Unmanned Aerial Vehicle (UAV) imagery faces critical challenges, including extreme scale variations, dense spatial distributions, and stringent computational constraints, for real-time deployment. To address these challenges, this paper proposes a CBW-based Detection Transformer (CBW-DETR), an enhanced transformer-based detection framework that integrates architectural efficiency with scale-aware mechanisms throughout the detection pipeline. The framework comprises three coordinated innovations. First, a Context-Guided Feature Extraction (ContextGFE) module reduces model parameters and theoretical computational cost through adaptive receptive field selection and wavelet-domain enhancement while maintaining representational capacity. Second, a Scale-Aware Feature Pyramid Network (SAFPN) employs spatial-variant compensation factors and cross-scale attention to facilitate balanced gradient flow across pyramid levels, particularly benefiting small object detection. Third, an Adaptive Scale IoU (ASIoU) loss function implements uncertainty-aware gradient modulation and scale-specific optimization to enhance localization accuracy for objects of varying sizes. Extensive experiments on VisDrone2019 and Dataset for Object Detection in Aerial Images (DOTA) datasets demonstrate that CBW-DETR achieves substantial improvements in detection accuracy while reducing model parameters by 28.1% and theoretical computation by 18.0% compared to the Real-Time Detection Transformer-R18 (RT-DETR-R18) baseline. These reductions in model complexity come at a moderate cost in inference throughput (73.6 frames per second (FPS) vs. 94.1 FPS), attributable to memory-access-intensive operations introduced by multi-branch convolutions and wavelet transforms. Among the evaluated detectors, including You Only Look Once (YOLO) series variants and transformer-based methods, CBW-DETR achieves a competitive detection accuracy with a notably compact model footprint. Visualization analysis confirms its robust performance across diverse challenging scenarios including nighttime conditions, dense object distributions, and severe occlusions, validating the framework’s practical applicability for UAV-based detection applications.

Keywords:

RT-DETR; lightweight; small object detection; feature fusion; loss function; wavelet transform

1. Introduction

Object detection in Unmanned Aerial Vehicle (UAV) imagery has become increasingly important across diverse applications including precision agriculture [1], traffic monitoring [2], disaster assessment [3], and infrastructure inspection [4,5]. However, the aerial perspective introduces substantial technical challenges. Objects of interest typically occupy a small fraction of the image area, while atmospheric effects such as haze and turbulence, variable illumination conditions, and dense spatial distributions create additional complexity for accurate detection. Recent advances in general object detection have demonstrated the effectiveness of both CNN-based and transformer-based architectures across a wide range of visual recognition tasks [6], providing a solid foundation for addressing these aerial-specific challenges.

Contemporary object detection frameworks can be categorized into anchor-based, region-based, and transformer-based approaches. Anchor-based single-stage detectors, such as SSD [7] and the YOLO series [8,9,10,11,12,13,14,15], achieve real-time inference speeds through predefined anchor boxes combined with non-maximum suppression (NMS). However, several factors limit their effectiveness in UAV scenarios. The fixed geometric configurations of anchor boxes may not adequately match the scale distribution characteristics of aerial imagery, where object size variance substantially exceeds that of ground-level datasets. Additionally, NMS post-processing operates sequentially, preventing full utilization of parallel computing architectures, and threshold-based suppression can inadvertently eliminate valid detections in densely populated regions [16].

Two-stage region-based methods, exemplified by the R-CNN family [17], employ iterative refinement to improve localization precision. These approaches typically require substantial computational resources, which constrains deployment to server-based platforms rather than edge devices. The resulting latency from network transmission becomes prohibitive for time-sensitive applications such as autonomous navigation.

Transformer-based detection frameworks have recently demonstrated the feasibility of end-to-end learning without hand-crafted components. DETR [18] introduced set prediction using global self-attention mechanisms, eliminating the need for anchor boxes and NMS in its architectural design. However, the quadratic complexity of self-attention restricts practical input resolutions. Deformable DETR [19] addressed this limitation through sparse sampling at learnable offsets, while DN-DETR [20] accelerated convergence by incorporating denoising training mechanisms. RT-DETR [21] achieved competitive inference speeds through hybrid CNN–Transformer encoders and introduced IoU-aware query selection to reduce reliance on NMS, representing a notable advancement in transformer-based detection efficiency. It is worth noting that while DETR-family architectures are designed to operate without NMS, in practice a lightweight NMS post-processing step is often applied in dense detection scenarios (such as UAV imagery) to suppress residual duplicate predictions, as we adopt in our experimental protocol for fair comparison across all evaluated methods.

Despite these advances, several observations suggest opportunities for improvement in the context of UAV-based small object detection. Analysis of gradient flow during training reveals substantial disparities across pyramid levels in existing multi-scale fusion strategies, where features at different resolution levels typically receive uniform weighting. This phenomenon correlates with observable performance gaps between small- and large-object detection accuracies. Additionally, standard IoU-based loss functions apply identical penalty curves regardless of object scale or detection difficulty. For small or partially occluded objects where intersection-over-union values tend to be lower, gradient magnitudes can decrease substantially, potentially impeding optimization convergence for challenging samples. Furthermore, model compactness remains critical for deployment on resource-constrained UAV platforms, necessitating architectures that balance detection performance with parameter efficiency. However, it should be recognized that reducing theoretical computational metrics (parameters and GFLOPs) does not always translate directly into proportional inference speed improvements, as actual throughput depends on hardware-specific factors including memory access patterns, operator fusion, and parallelization efficiency.

Recent works have explored modifications to the RT-DETR architecture. Li et al. [22] incorporated self-attention upsampling modules and refined the DIoU loss function. Lin et al. [23] replaced the ResNet backbone with MobileNetV2 to reduce parameters, while He et al. [24] proposed a scale adaptive feature pyramid network to enhance multi-scale object detection performance. Wu et al. [25] substituted RepViT as the backbone and integrated HiLo attention mechanisms. These approaches primarily focus on individual components—either the backbone architecture, neck design, or loss function—rather than coordinated treatment of scale-related challenges across the detection pipeline.

Addressing these observations regarding model compactness and scale-adaptive optimization, this work proposes CBW-DETR, a framework that integrates architectural efficiency with scale-aware mechanisms throughout the detection pipeline. The framework encompasses three principal components:

To reduce model complexity while preserving feature diversity, a restructured residual block architecture partitions channel processing into selective pathways. Parallel branches with different receptive field configurations capture multi-scale contextual information, while a subset of channels bypass processing through identity connections, reducing parameter count and theoretical computation without compromising representational capacity.
Addressing gradient flow imbalance across pyramid levels, a bidirectional feature pyramid network introduces learnable compensation mechanisms, weighted according to feature map spatial dimensions. This formulation enables differential treatment of features during both forward propagation and gradient backpropagation, facilitating balanced optimization across scales.
For scale-adaptive optimization, the framework incorporates a loss formulation that implements category-specific statistical tracking. Through exponential moving averages and gradient modulation based on scale-dependent loss distributions, the mechanism amplifies optimization signals for under-performing categories while stabilizing convergence for well-optimized samples.

Experimental validation on VisDrone2019 and DOTA datasets demonstrates improvements in detection accuracy alongside significant reductions in model parameters (28.1%) and theoretical computation (18.0%). These reductions in model footprint come at a moderate cost in inference throughput due to memory-access-intensive operations in the proposed modules. Ablation studies quantify the individual and combined effects of the proposed components across different object scale categories, including a detailed analysis of inference latency contributions from each module.

2. Related Work

2.1. Small Object Detection in Aerial Imagery

Object detection in aerial imagery presents distinct challenges compared to ground-level scenarios due to extreme scale variations, dense spatial distributions, and imaging condition variability. Early approaches adapted general-purpose detectors through multi-scale feature pyramid architectures. Feature pyramid networks (FPNs) [26] construct multi-scale representations through top-down pathways and lateral connections, forming the foundation for many subsequent methods. PANet [27] augmented FPNs with an additional bottom-up path to enhance feature propagation across scales. BiFPN [28] introduced weighted bidirectional feature fusion, improving efficiency by removing nodes with single input edges.

The VisDrone dataset [29] has become a standard benchmark for evaluating UAV-based detection methods, containing diverse scenarios with objects spanning wide scale ranges. DOTA [30] provides additional challenges through oriented bounding box annotations for aerial imagery. These datasets have driven the development of specialized methods for aerial object detection.

Recent UAV detection methods have explored various architectural modifications for small object detection. These approaches demonstrate the importance of specialized designs for aerial detection scenarios, though they primarily focus on CNN-based architectures without fully leveraging recent advances in transformer-based methods.

2.2. Transformer-Based Object Detection

The introduction of transformers to computer vision [31,32] enabled new paradigms for object detection. DETR [18] formulated detection as a set prediction problem using bipartite matching between predicted and ground-truth boxes, eliminating the architectural dependency on hand-crafted components like anchor generation and non-maximum suppression. However, DETR suffered from slow convergence, requiring hundreds of epochs and high computational complexity due to quadratic self-attention.

Subsequent works addressed these limitations through various mechanisms. Deformable DETR replaced dense attention with deformable attention modules that sample features at learned offset positions, reducing complexity while maintaining effectiveness. DN-DETR [20] introduced denoising training that adds noise to ground-truth boxes during training, providing auxiliary supervision signals that accelerate convergence. More broadly, the integration of attention mechanisms and multi-scale feature learning has driven significant progress in general object detection, as comprehensively surveyed in the recent literature [6].

RT-DETR [21] achieved real-time inference speeds through architectural innovations including an efficient hybrid encoder combining convolutional and transformer modules, IoU-aware query selection, and uncertainty-minimal query selection during inference. The framework demonstrated competitive accuracy–speed trade-offs compared to YOLO-series detectors [8,9,13,14,15]. While RT-DETR was architecturally designed to reduce reliance on NMS through its IoU-aware query selection mechanism, it is important to note that in dense detection scenarios—such as those encountered in UAV imagery—a lightweight NMS post-processing step is often still applied in practice to suppress residual duplicate predictions from the fixed set of object queries. This distinction between architectural design intent and practical deployment protocol is relevant when interpreting performance comparisons across different detector families.

Recent adaptations of RT-DETR for specific applications have explored domain-specific modifications. Li et al. [22] optimized RT-DETR for insulator defect detection through improved feature extraction. Tang et al. [2] developed a lightweight RT-DETR variant for real-time traffic light detection. Pan et al. [3] proposed improvements for agricultural applications. These works demonstrate the versatility of the RT-DETR architecture across diverse applications, though systematic optimization for UAV-based small object detection remains underexplored.

2.3. Loss Functions for Bounding Box Regression

Bounding box regression loss functions have evolved from simple

ℓ_{1}

and

ℓ_{2}

distances to IoU-based formulations that directly optimize detection metrics. GIoU [33] addressed the gradient vanishing problem of IoU loss when boxes have zero overlap by incorporating the area of the smallest enclosing box. DIoU and CIoU [34] introduced distance-based penalties and aspect ratio consistency to accelerate convergence and improve shape prediction.

Recent works have explored adaptive weighting mechanisms. Focal Loss [35] introduced a modulating factor to down-weight well-classified examples, addressing class imbalance in dense object detection. EIoU [36] decomposes box regression into width, height, and center distance components with separate loss terms for more fine-grained optimization.

Shape-IoU [37] proposed a more accurate metric considering both bounding box shape and scale, particularly beneficial for objects with varying aspect ratios. However, existing loss functions generally apply uniform weighting across different object scales and detection difficulties. For small object detection in UAV imagery, where optimization difficulty varies significantly across scales, adaptive loss formulations that modulate gradient magnitudes based on object-specific characteristics remain an open research direction.

2.4. Efficient Network Architectures for Detection

The deployment of object detectors on resource-constrained platforms has driven research into efficient architectures. MobileNets [38] introduced depthwise separable convolutions to reduce parameters and computational costs while maintaining reasonable accuracy. The evolution from YOLOv7 [13] to YOLOv10 [15] demonstrates continuous improvements in balancing detection performance with computational efficiency through architectural innovations.

Recent lightweight detection methods for UAV applications have explored various efficiency strategies. The tension between computational efficiency and feature expressiveness remains central to architecture design. For UAV-based detection, additional constraints include limited onboard computational resources and real-time processing requirements. Partitioning strategies and selective pathway activation based on channel subsets represent promising directions for achieving better efficiency–accuracy trade-offs in multi-scale feature extraction scenarios.

It is worth noting, however, that theoretical efficiency metrics such as parameter count and GFLOPs do not always correlate linearly with actual inference speed. Architectural choices that reduce arithmetic operations may simultaneously introduce memory-access-intensive or irregularly structured computations—such as multi-branch parallel convolutions, wavelet transforms, and deformable convolutions—that are less amenable to hardware parallelization on modern GPUs. Consequently, a holistic evaluation of lightweight architectures should consider both theoretical complexity and empirical throughput on target deployment platforms.

3. Methodology

3.1. Framework Overview

The proposed CBW-DETR framework addresses model complexity reduction and multi-scale feature representation challenges in UAV-based object detection through three coordinated architectural innovations. Building upon RT-DETR as the baseline detector, the framework integrates an adaptive receptive field feature extraction module with multi-resolution wavelet-domain enhancement, a cross-scale attention-based feature pyramid with deformable fusion, and an uncertainty-aware adaptive loss with dynamic gradient modulation.

As illustrated in Figure 1, the framework follows a design philosophy centered on dynamic adaptability rather than static architectural choices. This adaptability manifests at three levels of the detection pipeline. At the feature extraction level, ContextGFE learns optimal receptive field combinations through gating mechanisms while incorporating multi-resolution wavelet-domain features to capture both local spatial patterns and scale-dependent edge characteristics. At the feature fusion level, SAFPN employs spatial-variant compensation factors and cross-scale attention to enable adaptive feature aggregation that accounts for gradient flow differences across pyramid levels. At the supervision level, ASIoU integrates uncertainty estimation with dynamic gradient scaling to achieve robust optimization across object scales. The coordinated design ensures that modifications at one level complement enhancements at other levels, creating synergistic improvements throughout the detection pipeline rather than isolated optimizations. It should be noted that while the proposed modules are designed to reduce model parameters and theoretical computation (GFLOPs), certain operations—such as multi-branch parallel convolutions, wavelet transforms involving spatial data reshuffling, and deformable convolutions with irregular memory access patterns—are memory-access-intensive rather than compute-bound, which may result in actual inference throughput that does not scale proportionally with theoretical computation reductions.

3.2. Baseline Architecture

RT-DETR [21] provides the foundation through its transformer architecture designed for real-time object detection, whose overall structure is illustrated in Figure 2. The architecture comprises three principal components working in sequence. First, a ResNet-18 backbone with BasicBlock residual modules processes input images to generate multi-scale feature representations

{F_{3}, F_{4}, F_{5}}

at spatial strides of

{8, 16, 32}

pixels. Second, a hybrid encoder processes these features through Attention-based Intra-scale Feature Interaction (AIFI) that refines features within each scale, and CNN-based Cross-scale Feature Fusion (CCFF) that aggregates information across different scales. Third, a transformer decoder with IoU-aware query selection generates final object predictions through iterative refinement. While RT-DETR was architecturally designed to reduce reliance on NMS through its IoU-aware query selection mechanism, in dense detection scenarios characteristic of UAV imagery, a lightweight NMS post-processing step is commonly applied in practice to suppress residual duplicate predictions from the fixed set of object queries. In our experimental protocol, NMS is uniformly applied to all evaluated methods to ensure fair comparison (see Section 4.1 for details).

While this architecture achieves strong performance on standard benchmarks, it employs several fixed computational patterns that limit adaptability. The backbone applies uniform

3 \times 3

convolutions across all feature channels without distinguishing between redundant and informative channels. The feature fusion mechanism uses simple element-wise addition with uniform weights, ignoring scale-dependent characteristics of gradient flow during backpropagation. The supervision employs scale-agnostic loss functions that apply identical optimization pressure regardless of object size or detection difficulty.

Our modifications introduce adaptive mechanisms at each of these levels to enable data-driven optimization. Rather than relying on hand-crafted architectural choices, the proposed components learn to allocate representational resources and optimization signals based on the specific characteristics of aerial imagery.

3.3. Context-Guided Feature Extraction with Adaptive Receptive Fields

The ContextGFE module replaces standard BasicBlock residual modules in backbone stages S3, S4, and S5 to address computational redundancy while maintaining representational capacity. The key insight motivating this design is that aerial imagery exhibits distinct patterns at multiple spatial scales simultaneously. Small objects such as pedestrians or vehicles require fine-grained local features to capture subtle appearance details, while cluttered backgrounds benefit from broader contextual information to distinguish foreground from background. Rather than using fixed kernel sizes that commit to a single scale, ContextGFE dynamically allocates representational resources across multiple receptive fields based on input characteristics.

Figure 3 illustrates the complete module architecture. The design follows a three-stage pipeline: dimensionality reduction to eliminate redundant channels, parallel multi-scale processing with dynamic weighting, and attention-based feature refinement.

Given input features

X \in R^{C \times H \times W}

, the module first reduces channel dimensionality through a lightweight projection:

X_{r} = PReLU (BN (W_{r} * X))

(1)

where

W_{r}

is a

1 \times 1

convolution that projects from C to

C / 2

channels. This reduction serves two purposes: it eliminates redundant information present in highly correlated feature channels, and it decreases the computational burden of subsequent multi-scale processing. The use of parametric ReLU allows the network to learn optimal negative slope values, providing more flexibility than standard ReLU.

Rather than committing to fixed kernel sizes, the module employs a lightweight gating network to dynamically determine the importance of different receptive field sizes:

g = Softmax (W_{g} \cdot AdaptiveAvgPool (X_{r})) \in R^{4}

(2)

where

g = [g_{3}, g_{5}, g_{7}, g_{d}]

represents learned weights for four different kernel configurations:

3 \times 3

for local details,

5 \times 5

for mid-range patterns,

7 \times 7

for broader context, and dilated

3 \times 3

with dilation rate 3 for capturing long-range dependencies efficiently. The softmax operation ensures these weights form a valid probability distribution, allowing the network to allocate computational emphasis across scales in a principled manner. We note that the four parallel convolution branches, while reducing total GFLOPs through the preceding channel reduction, introduce fragmented memory access operations that limit GPU parallelization efficiency, an issue we quantitatively analyze in Section 4.4.

The multi-scale spatial features are computed through a weighted combination:

F_{s p a t i a l} = \sum_{k \in {3, 5, 7, d}} g_{k} \cdot {Conv}_{k} (X_{r} [: C / 4])

(3)

This formulation implements a soft attention mechanism over receptive field sizes. Rather than hard-selecting a single scale, the module blends features from all scales with learned weights, providing greater flexibility and smoother gradient flow.

To complement spatial convolutions with multi-resolution edge representation, a parallel branch applies the Discrete Wavelet Transform (DWT) [39] to decompose features into hierarchical sub-bands:

F_{f r e q} = IDWT (ϕ (DWT (X_{r} [C / 4 : C / 2])))

(4)

where DWT decomposes the input feature map into one approximation sub-band

LL

and three detail sub-bands

{LH, HL, HH}

using Haar wavelets. The approximation sub-band

LL

captures low-frequency structural information, while the detail sub-bands

LH

,

HL

, and

HH

explicitly encode horizontal, vertical, and diagonal edge orientations, respectively. This spatially localized multi-resolution decomposition is particularly beneficial for small-object boundary delineation in UAV imagery, as it directly models edge structures at multiple scales rather than relying solely on global frequency statistics. However, the DWT and inverse DWT operations involve spatial data reshuffling across feature map dimensions, which constitutes a memory-bandwidth-bound operation rather than a compute-bound one, contributing to the gap between theoretical GFLOP reduction and actual inference throughput. The learnable filter

ϕ

is implemented as a two-layer network:

ϕ (Z) = ReLU (W_{f}^{(2)} \cdot ReLU (W_{f}^{(1)} \cdot Z))

(5)

with

W_{f}^{(1)} \in R^{C / 8 \times C / 4}

and

W_{f}^{(2)} \in R^{C / 4 \times C / 8}

. This bottleneck architecture encourages the network to learn compact wavelet-domain representations by selectively amplifying informative sub-band coefficients while suppressing noise-dominated components.

The spatial and wavelet-domain features are concatenated and projected back to the original channel dimension:

F_{f u s e d} = W_{p} * Concat ([F_{s p a t i a l}, F_{f r e q}])

(6)

where

W_{p}

is a

1 \times 1

convolution that integrates information from both domains. This fusion enables the network to leverage complementary information where spatial features provide localization cues while wavelet features contribute multi-resolution edge context that aids in distinguishing small objects from background clutter.

The fused features undergo refinement through a joint spatial-channel attention mechanism. Unlike standard channel attention that operates solely on global statistics, this mechanism considers both spatial and channel dimensions. The spatial attention component identifies important spatial locations:

A_{s p a t i a l} = σ ({Conv}_{7 \times 7} (Concat [\max (F_{f u s e d}), mean (F_{f u s e d})]))

(7)

where

max (\cdot)

and

mean (\cdot)

denote pooling operations along the channel dimension. The concatenation of max-pooled and mean-pooled features provides complementary information about salient spatial regions. The

7 \times 7

convolution kernel allows the attention mechanism to consider local context when determining spatial importance.

In parallel, the channel attention component identifies informative feature channels:

A_{c h a n n e l} = σ (W_{c}^{(2)} (ReLU (W_{c}^{(1)} (GAP (F_{f u s e d})))))

(8)

where GAP denotes global average pooling that aggregates spatial information into channel-wise statistics. The squeeze–excitation mechanism with reduction ratio

r = 16

creates a bottleneck that forces the network to learn compact channel importance representations.

The final output integrates both attention mechanisms through multiplicative interaction:

Y = X + (A_{s p a t i a l} \otimes A_{c h a n n e l}) ⊙ F_{f u s e d}

(9)

where ⊗ denotes the outer product and ⊙ represents element-wise multiplication. The outer product creates a full spatial-channel attention map that can model joint dependencies. The residual connection ensures that the module can preserve useful information from the input when the transformed features provide limited additional value.

3.4. Scale-Aware Feature Pyramid with Cross-Scale Attention

Standard feature pyramid networks combine features from different pyramid levels through uniform fusion operations, typically element-wise addition or concatenation with fixed weights. However, this uniform treatment fails to account for two important characteristics. First, features at different pyramid levels exhibit distinct statistical properties where higher-level features tend to be more semantically rich but spatially coarse, while lower-level features provide precise localization with limited semantic content. Second, gradient flow during backpropagation naturally favors certain pyramid levels based on their spatial resolution, creating imbalances that can bias optimization.

SAFPN addresses these issues through explicit compensation mechanisms that account for scale-dependent characteristics and attention-based fusion that enables adaptive information aggregation across pyramid levels, as illustrated in Figure 4.

The design employs spatial-variant compensation factors that adapt to local feature characteristics. Rather than applying global scalar weights uniformly across all spatial locations, the module learns pixel-wise compensation:

A_{i} = 1 + \frac{β_{i} (h, w)}{\sqrt{H_{i} W_{i} + ϵ}}

(10)

where

β_{i} (h, w) \in R^{H_{i} \times W_{i}}

is a learned spatial map specific to pyramid level i. The denominator provides scale-dependent normalization that ensures the compensation magnitude is inversely related to feature map size. Higher-resolution features receive stronger compensation to balance their naturally weaker gradient signals. The learnable component

β_{i}

is computed through:

β_{i} = Sigmoid ({Conv}_{3 \times 3} (Concat [P_{i}^{i n}, Upsample (P_{i + 1}^{i n})]))

(11)

This formulation allows the network to determine local compensation strength based on the characteristics of features at adjacent pyramid levels.

Before fusion, the module employs cross-scale spatial attention to model the relevance of features across different pyramid levels. This attention mechanism uses a lightweight query–key–value formulation:

Q_{i} = W_{Q} P_{i}^{i n}, K_{i + 1} = W_{K} Interpolate (P_{i + 1}^{t d})

(12)

S_{i, i + 1} = Softmax (\frac{Q_{i} K_{i + 1}^{T}}{\sqrt{d_{k}}}) V_{i + 1}

(13)

where

V_{i + 1} = W_{V} P_{i + 1}^{t d}

. The projection matrices

W_{Q}, W_{K}, W_{V} \in R^{d_{k} \times C}

with

d_{k} = C / 8

reduce computational cost while maintaining representational capacity. This attention mechanism enables the network to dynamically determine which spatial locations in the coarser feature map are most relevant for refining each location in the finer feature map.

The top-down pathway propagates semantic information from coarser to finer scales. At each pyramid level

i \in {4, 3}

, features are fused through:

P_{i}^{t d} = DWConv (A_{i} ⊙ (S_{i, i + 1} + w_{1}^{t d} Upsample (P_{i + 1}^{t d}) + w_{2}^{t d} P_{i}^{i n}))

(14)

where the compensation factor

A_{i}

modulates the combined features before depthwise convolution refines them. The learnable weights

w_{1}^{t d}

and

w_{2}^{t d}

allow the network to balance contributions from upsampled features and original backbone features.

The bottom-up pathway performs reverse fusion to incorporate fine-grained details into higher-level features. A key challenge in multi-scale fusion is spatial misalignment, where objects may occupy different relative positions across pyramid levels due to downsampling operations. To address this, the module employs deformable convolution with learned offsets:

Δ p_{i} = {Conv}_{3 \times 3} (Concat [P_{i}^{i n}, P_{i}^{t d}])

(15)

The offset field

Δ p_{i}

is predicted based on both original backbone features and top-down features, allowing the network to learn optimal spatial alignment patterns. The deformable convolution then applies:

P_{i}^{o u t} = DeformConv (P_{i}^{t d}, Δ p_{i}) + w_{1}^{b u} A_{i} ⊙ P_{i}^{i n} + w_{2}^{b u} Downsample (P_{i - 1}^{o u t})

(16)

where DeformConv applies convolution at adaptively determined spatial locations. Note that while deformable convolution does not significantly increase theoretical GFLOPs, its irregular memory access patterns—sampling features at learned offset positions rather than regular grid locations—are less amenable to GPU parallelization, contributing to inference latency beyond what GFLOPs alone would suggest. This formulation enables three-way fusion incorporating aligned top-down features, compensated backbone features, and downsampled features from the previous bottom-up level.

For the highest pyramid level

P_{5}

, which lacks coarser features for bottom-up fusion:

P_{5}^{o u t} = DWConv (A_{5} ⊙ (w_{1}^{b u} P_{5}^{i n} + w_{2}^{b u} P_{5}^{t d}))

(17)

All fusion weights

{w_{1}^{t d}, w_{2}^{t d}, w_{1}^{b u}, w_{2}^{b u}}

initialize to 1.0 and are optimized end-to-end during training. An analysis of the converged values of these weights and their implications for scale-specific feature contribution is provided in Section 4.4.

3.5. Adaptive Scale IoU Loss with Uncertainty Estimation

Standard IoU-based losses apply uniform optimization pressure to all detection samples regardless of their characteristics. However, objects of different scales exhibit distinct optimization dynamics, where smaller objects typically require more iterations to achieve accurate localization, while larger objects may converge rapidly. Additionally, prediction uncertainty varies across samples, with some detections exhibiting high confidence while others remain ambiguous. Applying uniform loss weighting to such heterogeneous samples can result in inefficient optimization.

ASIoU introduces a principled approach to adaptive supervision that modulates optimization intensity based on two factors: the relative difficulty of each sample within its scale category, and the uncertainty of the prediction.

The framework first partitions objects into scale categories based on ground-truth bounding box area:

s = \{\begin{matrix} small, & A_{g t} < 32^{2} \\ medium, & 32^{2} \leq A_{g t} < 96^{2} \\ large, & A_{g t} \geq 96^{2} \end{matrix}

(18)

These thresholds align with standard definitions used in benchmark datasets.

For each predicted bounding box, the module estimates prediction uncertainty through Monte Carlo dropout. During training, the module performs T stochastic forward passes with dropout enabled:

σ_{B}^{2} = \frac{1}{T} \sum_{t = 1}^{T} | | B_{t} - \bar{B} {| |}^{2}

(19)

where

B_{t}

represents the bounding box prediction in the t-th forward pass, and

\bar{B} = \frac{1}{T} \sum_{t = 1}^{T} B_{t}

is the mean prediction. The variance

σ_{B}^{2}

quantifies prediction fluctuation across different dropout masks. High variance indicates uncertain predictions while low variance suggests confident predictions.

Rather than tracking loss statistics with uniform weights, the module employs uncertainty-weighted exponential moving averages for each scale category:

{\bar{L}}_{s}^{(t)} = (1 - m_{s} \cdot w_{u n c}) {\bar{L}}_{s}^{(t - 1)} + m_{s} \cdot w_{u n c} \cdot L_{IoU}^{(t)}

(20)

The uncertainty weight attenuates the influence of unreliable samples:

w_{u n c} = exp (- \frac{σ_{B}^{2}}{τ})

(21)

where the temperature parameter

τ

controls sensitivity to uncertainty. Predictions with high uncertainty receive lower weights in the moving average, preventing noisy samples from distorting category-specific statistics.

The core gradient modulation mechanism scales loss contributions based on the ratio between current sample loss and the category-specific moving average:

R_{s} (e) = {(\frac{L_{IoU}}{{\bar{L}}_{s} + ϵ})}^{γ_{s} (e)}

(22)

When a sample’s loss exceeds the category average, the ratio exceeds 1.0 and the power-law relationship amplifies gradients, focusing optimization on difficult cases. The exponent

γ_{s} (e)

adapts over training epochs:

γ_{s} (e) = γ_{s}^{i n i t} \cdot (1 + λ \cdot \frac{e}{E_{t o t a l}})

(23)

This formulation starts with moderate amplification and gradually increases emphasis on hard samples as training progresses.

The distance-based weighting component provides spatial regularization that adapts to object scale:

R W_{s} = exp (- \frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{A_{g t} \cdot σ_{s}^{2}})

(24)

where

(x, y)

and

(x_{g t}, y_{g t})

denote predicted and ground-truth box centers. Normalizing by ground-truth area

A_{g t}

ensures that the effective distance threshold scales appropriately. Larger objects tolerate larger center offsets proportionally.

An auxiliary aspect ratio prediction task provides additional regularization:

L_{a s p e c t} = SmoothL1 (\log \frac{w}{h} - log \frac{w_{g t}}{h_{g t}})

(25)

Operating in log-space ensures symmetry with respect to aspect ratio inversions and provides stable gradients.

The complete ASIoU loss integrates all components:

L_{ASIoU} = R_{s} (e) \cdot R W_{s} \cdot L_{IoU} + α \cdot L_{a s p e c t}

(26)

where the multiplicative combination of

R_{s} (e)

and

R W_{s}

enables joint consideration of relative difficulty and spatial quality.

3.6. Framework Integration

The complete CBW-DETR framework integrates the three proposed components within the RT-DETR architecture through coordinated modifications at different pipeline stages. The backbone network employs ContextGFE modules at stages S3, S4, and S5, replacing standard BasicBlock residual modules. These adaptive feature extraction modules process input images to generate multi-scale feature representations

{F_{3}, F_{4}, F_{5}}

with reduced parameter count and theoretical computation while maintaining representational capacity through dynamic receptive field selection and multi-resolution wavelet-domain enhancement.

SAFPN receives these multi-scale backbone features and processes them through bidirectional fusion pathways incorporating spatial-adaptive compensation and cross-scale attention. The top-down pathway enriches features with semantic information from coarser scales, while the bottom-up pathway incorporates fine-grained details from higher-resolution features. The output pyramid features

{P_{3}^{o u t}, P_{4}^{o u t}, P_{5}^{o u t}}

exhibit balanced gradient flow characteristics and effective integration of information across scales.

These refined pyramid features feed into the transformer decoder, which maintains the standard RT-DETR architecture with IoU-aware query selection and iterative refinement through six transformer layers. The decoder generates object predictions that are supervised through the ASIoU loss, which provides scale-adaptive optimization pressure accounting for both relative sample difficulty and prediction uncertainty.

The training objective combines classification and localization objectives:

L_{total} = λ_{cls} L_{focal} + λ_{box} L_{ASIoU}

(27)

where focal loss addresses class imbalance in the classification task. The loss is computed only for matched query–target pairs determined through Hungarian matching that finds the optimal bipartite assignment minimizing combined classification and localization costs.

4. Experiments

4.1. Experimental Setup

All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU (24 GB memory), Intel Core i9-10900K CPU, and 64 GB RAM. The software environment consisted of Ubuntu 20.04 LTS, CUDA 12.4, cuDNN 8.9.0, PyTorch 2.2.0, and Python 3.9.13. Models were implemented using MMDetection [40] toolbox version 3.1.0 for standardized training and evaluation.

Models were trained from scratch using the AdamW optimizer [41] with momentum parameters of 0.9 and 0.999, a weight decay of 0.0001, and a gradient clipping threshold of 0.1. The learning rate followed cosine annealing from the initial value 0.0001 to a minimum of 0.000001 with 5-epoch linear warm-up. Training ran for 400 epochs with batch size 8, requiring approximately 36 h per model. Input images were resized to 640 × 640 pixels with aspect-ratio-preserving padding (value 114). Data augmentation included random horizontal flipping (probability 0.5), random scaling (range [0.8, 1.2]), HSV augmentation (hue ± 0.015, saturation [0.3, 1.7], value [0.6, 1.4]), mosaic augmentation (probability 0.5 for first 350 epochs), mixup (probability 0.1, alpha 0.5), random translation (maximum 0.1 × image size), and random rotation (±10 degrees). Images were normalized using ImageNet statistics. No augmentation was applied during validation and testing.

The total loss combined focal loss for classification (weight 1.0, focusing parameter 2.0, balance factor 0.25), L1 loss for box regression (weight 5.0), and ASIoU for localization (weight 2.0). For ASIoU, scale-specific parameters were: initial focusing exponents (small 2.0, medium 1.5, large 1.0), momentum factors (small 0.1, medium 0.15, large 0.2), uncertainty temperature 0.5, spatial bandwidth (small 0.5, medium 0.3, large 0.2), and aspect ratio loss weight 0.5. Hungarian matching [18] used costs weighted at 2.0 for classification, 5.0 for L1 regression, and 2.0 for IoU. Positive assignment required an IoU above 0.7, while an IoU below 0.3 designated negative samples.

Performance was evaluated using COCO-style metrics [42]: AP at IoU thresholds 0.5 (AP50), 0.75 (AP75), and averaged over [0.5:0.95:0.05] (mAP); scale-specific AP for small (area less than 32² pixels), medium (32² to 96² pixels), and large (above 96² pixels) objects; model parameters (millions); computational cost (GFLOPs at 640 × 640 input); model size (MB); inference speed (FPS on RTX 3090 with batch size 1); and latency (milliseconds per image). Efficiency metrics averaged 1000 iterations after 100 warm-up iterations. All reported FPS values include the complete inference pipeline encompassing model forward pass, confidence thresholding, and NMS post-processing, ensuring fair and consistent timing across all evaluated methods.

For ContextGFE, channel partition used 8 groups with kernel sizes {3, 5, 7} and dilation rate 3. The wavelet-domain branch employed one-level Haar wavelet decomposition with a filter reduction ratio of 2, while the attention reduction ratio was 16 with a 7 × 7 spatial kernel. For SAFPN, the cross-scale attention projection dimension was

C / 8

, the compensation stability constant was 0.000001, and deformable convolution [26] sampled 9 offset points. All fusion weights initialized to 1.0. For uncertainty estimation, Monte Carlo dropout performed 5 forward passes with probability 0.1. The transformer decoder used 6 layers with 8 attention heads, feedforward dimension 1024, dropout 0.1, and 300 object queries.

While RT-DETR was architecturally designed to reduce reliance on NMS through IoU-aware query selection, in dense UAV detection scenarios with significant object overlap, a lightweight NMS post-processing step is commonly applied in practice to suppress residual duplicate predictions from the fixed set of 300 object queries. To ensure fair comparison across all detector families—including YOLO-series methods that inherently require NMS—we uniformly apply NMS with IoU threshold 0.65 and confidence threshold 0.01 to all evaluated methods throughout our experiments. We acknowledge that this departs from the purely end-to-end paradigm of DETR-family architectures. The fast inference mode used a confidence threshold of 0.25 with top-100 selection before NMS. For reproducibility, all random seeds were fixed at 42 (Python 3.8, NumPy 1.21, PyTorch 1.12, CUDA 11.3) with deterministic algorithms enabled, though this reduced training speed by approximately 5–8%.

4.2. Datasets

To experimentally verify the effectiveness of the CBW-DETR algorithm in the field of small object detection, we selected the VisDrone2019 and DOTA datasets for our experiments.

The VisDrone2019 dataset [29], created by the AISKYEYE team at Tianjin University, is one of the most representative benchmark datasets in the field of drone vision. The dataset collects visual data from scenarios in different cities under various environments and weather conditions. It contains 400 video clips and 8629 static images. The data covers 10 typical urban object classes, such as pedestrians, cars, buses, and bicycles. In the experiment, we divided the 8629 images according to a 7:2:1 ratio, resulting in 6471 images for the training set, 1610 images for the testing set, and the remaining 548 images for the validation set.

Additionally, we used the DOTA dataset [30] to verify the universality and extensiveness of the algorithm. The DOTA dataset is a large-scale dataset designed for object detection in remote sensing images, characterized by dense objects and significant scale variations. Since the original image sizes vary considerably (ranging from 800 × 800 to 4000 × 4000 pixels), direct use for training is difficult and affects the training results. Therefore, we performed cropping processing: original images were cropped into 1024 × 1024 pixel patches with a stride of 824 pixels, yielding an overlap ratio of approximately 19.5%. Patches containing no annotated objects were discarded. This produced a total of 21,046 images, of which 15,749 were used for training and 5297 for testing. During inference, predictions from overlapping patches were merged using NMS with an IoU threshold of 0.5 to eliminate duplicate detections at patch boundaries. All models evaluated on DOTA used horizontal bounding box (HBB) annotations and followed the identical cropping and merging protocol to ensure fair comparison.

4.3. Evaluation Metrics

Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), Model Parameters (Parameters), and Model Computation (GFLOPs) were used as evaluation metrics in the experiments. The formulas are as follows:

P = \frac{T P}{T P + F P}

(28)

R = \frac{T P}{T P + F N}

(29)

A P = \int_{0}^{1} P (R) d R

(30)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(31)

True Positive (TP) reflects the number of positive samples accurately identified by the model; False Positive (FP) shows the number of negative samples incorrectly classified as positive by the model; and False Negative (FN) tallies the number of positive samples incorrectly identified as negative by the model. N represents the total number of classes in the classification task. mAP50 is the value obtained by comprehensively evaluating the average precision of each category when the Intersection over Union (IoU) is set to 0.5.

4.4. Comparison with State-of-the-Art Methods

To evaluate whether CBW-DETR achieves favorable accuracy–complexity trade-offs in small object detection, we compared it with current mainstream algorithms on the VisDrone2019 dataset, including YOLOv5, YOLOv8, YOLOv10, YOLOv11, Deformable DETR, and DINO. It was also compared with recent improved algorithms. The experimental results are shown in Table 1.

The CBW-DETR algorithm achieved 64.7%, 49.9%, 51.5%, and 32.5% in Precision, Recall, mAP50, and mAP50:95 respectively, all of which were the highest among the compared methods in this study. In terms of model complexity, the CBW-DETR algorithm has the smallest parameter count (14.3 M) and model size (28.0 MB) among all compared methods, representing significant reductions compared to heavier architectures such as YOLOv5l (47.9 M), YOLOv8l (43.6 M), Deformable DETR (40.0 M), and DINO (47.0 M).

Regarding inference speed, it is important to provide a transparent analysis. CBW-DETR achieves 73.6 FPS, which is notably lower than the RT-DETR-R18 baseline (94.1 FPS), YOLOv5m (120.1 FPS), YOLOv8m (103.5 FPS), and YOLOv11m (95.4 FPS). As analyzed in detail in Section 4.6.1, this reduction is attributable to memory-access-intensive operations in ContextGFE and SAFPN rather than increased arithmetic computation. However, CBW-DETR remains faster than several compared methods, including YOLOv8l (63.5 FPS), YOLOv11l (70.4 FPS), Improved RT-DETR [45] (54.6 FPS), and Frequency-Enhanced [46] (69.0 FPS), and substantially exceeds the 30 FPS real-time threshold commonly required for UAV applications. In summary, among the methods evaluated in this study, CBW-DETR achieves the highest detection accuracy with a compact model footprint, while maintaining real-time inference capability at a moderate throughput cost relative to its direct baseline.

4.5. Generalization Experiments

To evaluate the generalization ability of the CBW-DETR algorithm on other aerial object datasets, we conducted comprehensive experiments on the DOTA dataset. To ensure a thorough generalization evaluation, we benchmark CBW-DETR against a broad set of methods including YOLO-series variants, transformer-based detectors, and recent improved models, following the same scope as the VisDrone2019 evaluation. All models were trained and evaluated on the same DOTA split with the identical preprocessing protocol described in Section 4.2 (1024 × 1024 crops, stride 824, overlap 19.5%, horizontal bounding boxes). The comparison results are shown in Table 2.

From the table, it can be observed that CBW-DETR achieves 77.9% Precision, 70.4% Recall, 73.3% mAP50, and 48.5% mAP50:95 on the DOTA dataset, obtaining the best detection accuracy results across all metrics among the compared methods. In terms of model complexity, CBW-DETR maintains its compact footprint with 14.3 M parameters and a 28.0 MB model size, consistent with the VisDrone2019 results. The improvement patterns on DOTA are broadly consistent with those observed on VisDrone2019: CBW-DETR achieves meaningful accuracy gains over the RT-DETR-R18 baseline (+1.8% mAP50, +1.7% mAP50:95) while significantly reducing parameters (28.1%) and theoretical computation (18.0%). Compared with heavier architectures such as YOLOv8l and DINO, CBW-DETR achieves higher accuracy with substantially fewer parameters and lower computation.

Regarding inference speed on DOTA, CBW-DETR achieves 71.2 FPS, which is lower than the RT-DETR-R18 baseline (96.3 FPS) for the same reasons analyzed in Section 4.6.1. This is consistent with the throughput reduction pattern observed on VisDrone2019 and reflects the inherent characteristics of the proposed modules rather than dataset-specific factors. In summary, the cross-dataset results demonstrate that CBW-DETR’s accuracy improvements and model complexity reductions generalize across different aerial imagery domains, validating the framework’s robustness. However, the throughput–accuracy trade-off also persists consistently across datasets.

4.6. Ablation Study

To understand the individual and combined contributions of the proposed ContextGFE, SAFPN, and ASIoU modules to the overall performance, a series of ablation experiments were designed on the VisDrone2019 dataset. The results are shown in Table 3.

Compared with the original RT-DETR-R18 baseline model, the full CBW-DETR algorithm reduced theoretical computation by 18.0% and parameters by 28.1%, while increasing Precision by 2.6%, Recall by 3.0%, mAP@0.5 by 3.3%, and mAP@0.5:0.95 by 3.2%. The model size was also reduced by 27.4%. The following conclusions can be drawn from the progressive module addition analysis. After adopting the ContextGFE module alone, the detection accuracy was improved (mAP50 +1.2%) while significantly reducing model complexity (parameters reduced by 5.2 M, computation reduced by 13.1 GFLOPs), proving the module’s advantages in feature extraction and lightweight design. The incorporation of Haar wavelet decomposition in the frequency-domain branch further contributes to edge feature capturing, with the detailed sub-bands providing spatially localized multi-resolution representations that benefit small-object boundary delineation. After adopting the SAFPN module alone, detection performance was significantly improved (mAP50 +1.0%) with a minor increase in computational cost (+2.0 GFLOPs), especially for small object detection accuracy, verifying the effectiveness of the scale compensation weighted fusion strategy. Using the ASIoU loss function alone significantly improved the model’s localization accuracy (mAP50 +1.6%) with no additional parameters or computation, proving that its scale-adaptive dynamic gradient adjustment mechanism can effectively optimize the regression performance of targets at different scales. When the three improvement modules worked together, not only were a parameter compression of 28.1% and a theoretical computation reduction of 18.0% achieved, but a 3.3% increase in mAP50 was also obtained, indicating that the improvement strategies have complementary synergistic effects rather than redundant overlapping benefits.

4.6.1. Inference Latency Analysis

It is important to note that while the proposed modules reduce theoretical computation (GFLOPs) and parameter count, the actual inference speed decreased from 94.1 FPS (baseline) to 73.6 FPS (full CBW-DETR), representing a 21.8% reduction in throughput. To understand the sources of this discrepancy between theoretical and empirical efficiency, we conducted a per-module latency breakdown analysis on the RTX 3090 with 640 × 640 input resolution, as shown in Table 4.

The analysis reveals three primary sources of the throughput reduction. First, ContextGFE’s four parallel convolution branches with different kernel sizes (3 × 3, 5 × 5, 7 × 7, dilated 3 × 3) require separate memory allocation and kernel launches, creating fragmented GPU utilization despite the preceding channel reduction. The DWT/IDWT operations contribute an additional 0.5 ms due to spatial data reshuffling across feature map dimensions, which is memory-bandwidth-bound rather than compute-bound. Second, SAFPN’s deformable convolutions sample features at learned offset positions rather than regular grid locations, introducing irregular memory access patterns that prevent efficient cache utilization and hardware-level memory coalescing. Third, the cross-scale attention mechanism, while lightweight in terms of GFLOPs, introduces sequential softmax and matrix multiplication operations that create pipeline stalls.

Despite this throughput reduction, we emphasize that 73.6 FPS substantially exceeds the commonly adopted 30 FPS real-time threshold for UAV applications and remains practical for deployment scenarios. The primary efficiency advantage of CBW-DETR lies in its significantly reduced model footprint: 14.3 M parameters and a 28.0 MB model size represent 28.1% and 27.4% reductions, respectively, which directly benefit memory-constrained edge deployment platforms where model storage and runtime memory are the binding constraints, rather than raw computational throughput.

4.6.2. NMS Effect Analysis

To quantify the contribution of NMS post-processing to the overall detection performance and to clarify the distinction between the architectural design of DETR-family methods and the evaluation protocol adopted in this work, we report CBW-DETR results with and without NMS in Table 5.

As shown in Table 5, applying NMS yields a 1.3% improvement in mAP50 and a notable 3.4% increase in Precision, primarily by suppressing duplicate predictions in densely populated regions. The Recall remains unchanged since NMS only removes redundant predictions rather than introducing new ones. This result confirms that while CBW-DETR can operate without NMS and still achieve competitive accuracy (50.2% mAP50), the lightweight post-processing step provides meaningful performance gains in dense UAV scenarios, justifying its inclusion in our evaluation protocol. The NMS processing adds only 0.4 ms per image (Table 4), representing a negligible latency cost relative to the accuracy benefit.

4.6.3. Wavelet-Domain Enhancement Analysis

To investigate the effectiveness of the Discrete Wavelet Transform (DWT) as the frequency-domain enhancement strategy in ContextGFE, we conducted comparative experiments between FFT-based and DWT-based decomposition, as shown in Table 6. DWT with Haar wavelets achieves superior performance (mAP50: 49.4% vs. 49.1%), which we attribute to its spatially localized multi-resolution decomposition property. Unlike FFT which operates globally in the frequency domain, DWT’s detail sub-bands

{LH, HL, HH}

directly encode horizontal, vertical, and diagonal edge orientations at multiple scales, providing more targeted feature representations for small-object boundary delineation in UAV imagery. Based on these findings, we adopt DWT as the frequency-domain enhancement component in the final ContextGFE design.

4.6.4. SAFPN Fusion Weight Convergence Analysis

To verify that the learnable fusion weights in SAFPN actively contribute to scale-specific feature compensation rather than remaining at their initialization values, we tracked the converged values of

{w_{1}^{t d}, w_{2}^{t d}, w_{1}^{b u}, w_{2}^{b u}}

after 400 epochs of training on VisDrone2019. The results are presented in Table 7.

The converged weights deviate meaningfully from their initialization value of 1.0, with deviations ranging from 9% to 28%. Several observations can be drawn. In the top-down pathway, the network learns to assign a higher weight (

w_{1}^{t d} = 1.28

) to upsampled coarse-scale features that carry rich semantic information, while moderately attenuating original backbone features (

w_{2}^{t d} = 0.83

) that contain more redundant spatial detail at the current level. This asymmetry suggests that for UAV imagery with small objects, the semantic context from higher pyramid levels is more valuable than the local features at each level during top-down refinement. In the bottom-up pathway, the compensation factor-modulated backbone features receive increased weight (

w_{1}^{b u} = 1.14

), indicating that the spatial compensation mechanism in

A_{i}

effectively enhances backbone features to the point where the network prefers them over the downsampled contributions (

w_{2}^{b u} = 0.91

) from finer levels. These results confirm that the learnable weights are actively adapting to the data distribution and contributing to the scale-aware fusion strategy, rather than remaining inert at their initialization values.

4.7. Loss Function Comparison

To evaluate the performance of ASIoU, quantitative comparative experiments were conducted on the VisDrone2019 dataset, comparing ASIoU with GIoU (baseline algorithm), DIoU, CIoU, EIoU, FocalEIoU, InnerFocalEIoU, ShapeIoU, InnerDIoU, InnerCIoU, and InnerEIoU. The comparison results are shown in Table 8. In the table, ASIoU achieved the highest mAP50 value and Precision value, reaching 49.8% and 63.5%, respectively, and performed well in Recall and mAP50:95 metrics. Comprehensively evaluated, ASIoU demonstrates the most balanced performance across all metrics, indicating its effectiveness as a well-suited loss function for UAV-based small object detection.

4.8. Visualization Analysis

To more comprehensively demonstrate the performance of the CBW-DETR algorithm, this section provides intuitive comparisons in two aspects: comparing various important performance indicators before and after model improvement, and showing visual detection results.

First, as shown in Figure 5, on the VisDrone2019 dataset test data, the four curves clearly reflect the overall performance improvements in detection tasks before and after model enhancement. The CBW-DETR algorithm shows consistent improvements in Precision, Recall, mAP50, and mAP50:95 throughout the training process. This demonstrates that the ContextGFE module, SAFPN module, and ASIoU loss function effectively enhance the model’s sensitivity and accuracy for small object detection, enabling the model to demonstrate stronger recognition capabilities in complex scenes.

Four different scenarios were selected from the VisDrone2019 dataset to compare the detection effects of the model before and after improvement, as shown in Figure 6. For the daytime scenario shown in Figure 6a, the CBW-DETR algorithm identifies smaller objects more accurately than the RT-DETR-R18 model, demonstrating improved sensitivity to fine-grained targets. In the dense object scenario of Figure 6b, the CBW-DETR algorithm can effectively identify and locate numerous objects, providing more accurate recognition in dense scenes and demonstrating strong capability for handling complex scenarios. In the night environment of Figure 6c, the CBW-DETR algorithm maintains stable performance and can effectively identify dense objects under low-light conditions. In the occluded scenario of Figure 6d, facing partially visible targets, the CBW-DETR algorithm can still accurately identify the objects, providing more complete and accurate detection results.

To more intuitively visualize the model’s attention mechanism, Grad-CAM++ technology was used for heatmap visualization on the VisDrone2019 dataset. The heatmap represents the degree of feature response through color gradients, where red areas represent the highest activation values, indicating the regions of highest model attention. As shown in the comparison between Figure 7a and Figure 7c, the RT-DETR-R18 algorithm exhibits missing detections and false positives, while the CBW-DETR algorithm accurately detects vehicles. In Figure 7b,d, RT-DETR-R18 shows notable false detection problems, whereas the CBW-DETR algorithm successfully avoids such issues. The heatmap comparison reveals that CBW-DETR produces more focused and concentrated attention regions on actual object locations, suggesting that the proposed ContextGFE and SAFPN modules enable the model to better distinguish foreground objects from background clutter. These qualitative results are consistent with the quantitative improvements observed in Table 1 and Table 3, providing visual evidence for the effectiveness of the proposed architectural designs.

5. Conclusions

This paper presents CBW-DETR, a lightweight transformer-based framework designed specifically for small object detection in UAV imagery. Through three coordinated innovations consisting of ContextGFE for compact feature extraction, SAFPN for scale-aware feature fusion, and ASIoU for adaptive optimization, the framework achieves notable improvements in detection accuracy alongside significant reductions in model parameters (28.1%) and theoretical computation (18.0%) compared to the RT-DETR-R18 baseline.

Experimental validation on VisDrone2019 demonstrates that CBW-DETR substantially reduces model complexity compared to the baseline RT-DETR while achieving improvements in detection accuracy across all evaluation metrics. Cross-dataset validation on DOTA confirms generalization capability across diverse aerial imagery characteristics, though the scope of comparison on DOTA remains more limited than on VisDrone2019. Among the methods evaluated in this study, CBW-DETR achieves the highest detection accuracy with a compact model footprint of 14.3 M parameters and 28.0 MB, demonstrating favorable accuracy–complexity trade-offs compared to recent YOLO variants and transformer-based detectors.

The proposed ContextGFE module achieves parameter and computation reductions through adaptive receptive field selection and multi-resolution wavelet-domain enhancement mechanisms. Specifically, the incorporation of Discrete Wavelet Transform with Haar wavelets enables spatially localized decomposition into approximation and detail sub-bands, where the detail coefficients explicitly encode horizontal, vertical, and diagonal edge orientations at multiple scales. This property proves particularly beneficial for small-object boundary delineation in UAV imagery, and ablation experiments confirm its superiority over FFT-based global frequency decomposition. The SAFPN module introduces spatial-variant compensation factors and cross-scale attention mechanisms to address gradient flow imbalance across pyramid levels, with particularly notable improvements in small-object recall performance. Convergence analysis of the learnable fusion weights confirms that they actively adapt to the data distribution, providing meaningful scale-specific contributions rather than remaining at initialization values. The ASIoU loss function implements uncertainty-aware gradient modulation and scale-specific optimization strategies, enhancing localization accuracy across objects of varying sizes.

Visualization analysis demonstrates robust detection performance across challenging scenarios, including nighttime conditions with limited illumination, dense object distributions with significant overlap, and scenes with severe occlusions. Grad-CAM++ attention heatmaps reveal that CBW-DETR focuses more effectively on salient object regions while suppressing background interference compared to baseline methods, providing qualitative validation for the effectiveness of the proposed architectural designs.

5.1. Limitations

Despite the improvements achieved, several limitations of the current work should be acknowledged. First, while CBW-DETR reduces theoretical computation (GFLOPs) and model parameters, the actual inference throughput decreases from 94.1 FPS to 73.6 FPS (a 21.8% reduction) compared to the baseline. As discussed in the inference latency analysis, this discrepancy arises because the proposed modules—particularly ContextGFE’s multi-branch parallel convolutions and DWT/IDWT spatial data reshuffling, SAFPN’s deformable convolutions with irregular memory access patterns, and cross-scale attention with sequential softmax operations—introduce memory-access-intensive computations that are less amenable to GPU parallelization than the standard convolutions they replace. Although 73.6 FPS comfortably exceeds the 30 FPS real-time threshold for most UAV applications, this throughput–accuracy trade-off should be carefully considered for latency-critical deployments. Second, all inference speed measurements were conducted exclusively on an NVIDIA RTX 3090 GPU. The latency characteristics of the proposed modules may differ substantially on resource-constrained edge deployment platforms commonly used in UAV systems (e.g., NVIDIA Jetson Orin Nano, mobile SoCs), where memory bandwidth limitations could further exacerbate the gap between theoretical and empirical efficiency. Benchmarking on such platforms remains an important validation step that is not covered in the current work. Third, while the generalization experiments on the DOTA dataset have been expanded to include a broad set of comparison methods, future work could further strengthen generalization claims by evaluating additional aerial detection benchmarks beyond VisDrone2019 and DOTA. Fourth, while RT-DETR was architecturally designed to operate without NMS, our experimental protocol applies NMS uniformly to all methods for fair comparison. This means the reported results do not reflect a purely end-to-end detection paradigm, and the 1.3% mAP50 improvement attributable to NMS (as shown in the NMS effect analysis) should be considered when interpreting the results.

5.2. Future Work

Future research directions include investigating operator-level optimization strategies (such as kernel fusion and custom CUDA kernels for the DWT/IDWT and multi-branch convolution operations) to narrow the gap between theoretical computation reduction and actual inference throughput. Conducting comprehensive deployment benchmarks on edge computing platforms commonly used in UAV systems would provide practical validation of the framework’s real-world applicability. Expanding the cross-dataset evaluation to include additional aerial detection benchmarks with a broader set of comparison methods would strengthen generalization claims. Other promising directions include investigating adaptive query selection mechanisms for the transformer decoder to further enhance detection performance on extremely small objects, exploring learnable wavelet basis functions as alternatives to fixed Haar wavelets to enable data-driven multi-resolution decomposition, integrating temporal information for improved consistency in video-based detection scenarios, and extending the framework to multi-task learning paradigms that jointly optimize detection with complementary tasks such as object tracking and instance segmentation.

Author Contributions

Conceptualization, S.Q. and K.C.; methodology, S.Q.; software, S.Q.; validation, S.Q., K.C. and Y.W.; formal analysis, S.Q.; investigation, S.Q.; resources, K.C. and Y.W.; data curation, S.Q.; writing—original draft preparation, S.Q.; writing—review and editing, K.C. and Y.W.; visualization, S.Q.; supervision, K.C.; project administration, K.C.; funding acquisition, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available. The VisDrone2019 dataset can be accessed from the official website: https://github.com/VisDrone/VisDrone-Dataset (accessed on 2 April 2026). The DOTA dataset is available at https://captain-whu.github.io/DOTA/dataset.html (accessed on 2 April 2026). All experimental codes and related materials supporting the conclusions of this article are available from the corresponding author (Ke Cheng, email: chengke1972@just.edu.cn) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, L.; Liu, Z.; Huang, X.; Li, D.; Wang, F. A transformer-based UAV instance segmentation model TFYOLOv7. Signal Image Video Process. 2024, 18, 3299–3308. [Google Scholar] [CrossRef]
Tang, C.; Li, Y.; Wang, L.; Li, W. Real-time traffic light detection based on lightweight improved RT-DETR. J. Real-Time Image Process. 2025, 22, 82. [Google Scholar] [CrossRef]
Pan, J.; Song, S.; Guan, Y.; Jia, W. Improved Wheat Detection Based on RT-DETR Model. IAENG Int. J. Comput. Sci. 2024, 52, 705–719. [Google Scholar]
Ren, Y.; Huang, L.; Du, F.; Yao, X. An efficient and lightweight skin pathology detection method based on multi-scale feature fusion using an improved RT-DETR model. J. South. Med. Univ. 2025, 45, 409–421. [Google Scholar] [CrossRef]
Liang, N.; Liu, W. Small Target Detection Algorithm for Traffic Signs Based on Improved RT-DETR. Eng. Lett. 2025, 33, 140. [Google Scholar]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 7263–7271. [Google Scholar] [CrossRef]
Gai, R.; Chen, N.; Yuan, H. A detection algorithm for cherry fruits based on the improved YOLO-v4 model. Neural Comput. Appl. 2023, 35, 13895–13906. [Google Scholar] [CrossRef]
Wu, T.H.; Wang, T.W.; Liu, Y.Q. Real-time vehicle and distance detection based on improved YOLOv5 network. In Proceedings of the 2021 3rd World Symposium on Artificial Intelligence (WSAI), Guangzhou, China, 18–20 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 24–28. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.-M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Salscheider, N.O. FeatureNMS: Non-Maximum Suppression by Learning Feature Embeddings. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7848–7854. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the 2020 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query Denoising. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 13619–13627. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Li, D.; Yang, P.; Zou, Y. Optimizing Insulator Defect Detection with Improved DETR Models. Mathematics 2024, 12, 1507. [Google Scholar] [CrossRef]
Lin, C.; Zhong, Y.; Kong, Y.; Chen, R.; Xie, Z. Real-time Detection of Strawberry Leaf Blight Based on Improved RT-DETR. Inf. Technol. Informatiz. 2025, 1, 79–82. [Google Scholar] [CrossRef]
He, L.; Jiang, M.; Ohbuchi, R.; Furuya, T.; Zhang, M.; Li, P. Scale Adaptive Feature Pyramid Networks for 2D Object Detection. Sci. Program. 2020, 2020, 8839979. [Google Scholar] [CrossRef]
Wu, C.; Zhang, D.; Zhang, L.; Chen, R.; Mao, S. Research on Pine Cone Detection in Forest Based on RT-DETR. J. For. Sci. 2025, 61, 25–37. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
Du, D.W.; Zhu, P.F.; Wen, L.Y.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 213–226. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Curran Associates: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 658–666. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the 2020 AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-IOU: More Accurate Metric Considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Mallat, S.G. A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 11, 674–693. [Google Scholar] [CrossRef]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 2019 International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Zhang, R.F.; Du, Y.T.; Cheng, X.H. Small Target Detection Algorithm BiEO-YOLOv8s from UAV Perspective. Laser Optoelectron. Prog. 2025, 62, 0437002. [Google Scholar] [CrossRef]
Nie, Y.; Lai, H.C.; Gao, G.X. Small Target Detection and Tracking with Improved YOLOv7+ByteTrack. Comput. Eng. Appl. 2024, 60, 189–202. [Google Scholar]
Cheng, X.M.; Zhang, X.S.; Cao, B.J.; Song, C.L. Research on Small Object Detection Method Based on Improved RT-DETR. Comput. Eng. Appl. 2025, 61, 144–155. [Google Scholar] [CrossRef]
Li, J.; Wang, X.M. Aerial Small Object Detection Algorithm Based on Frequency Enhancement and Fine-Grained Fusion. J. Shaanxi Univ. Sci. Technol. 2025, 5, 175–186. [Google Scholar] [CrossRef]
Xie, G.B.; Li, X.; Lin, Z.Y. Drone-DETR Algorithm for Object Detection in UAV Images. Electron. Opt. Control 2025, 32, 70. [Google Scholar]
Gao, W.F.; Yi, Y.X.; Huang, L.L.; Li, L.Y.; Li, H.; Xie, J. An Efficient UAV Aerial Photography-Based Small Target Detection Algorithm. Control Decis. 2025, 8, 2525–2533. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of CBW-DETR with three stages: ContextGFE for adaptive multi-scale extraction, SAFPN for cross-scale attention fusion, and ASIoU for uncertainty-aware supervision.

Figure 2. RT-DETR baseline architecture showing the three-stage pipeline. Our modifications target model complexity in feature extraction, gradient balance in feature fusion, and scale-adaptive optimization in supervision.

Figure 3. ContextGFE architecture combining adaptive receptive field selection, wavelet-domain enhancement, and joint spatial-channel attention. The gating mechanism learns to emphasize relevant scales for each input. The * denotes the convolution operation in the PCConv module.

Figure 4. SAFPN architecture with bidirectional pathways. Spatial-adaptive compensation factors and cross-scale attention enable balanced gradient flow and effective multi-scale feature integration.

Figure 5. Performance curve comparison between RT-DETR-R18 and CBW-DETR on VisDrone2019.

Figure 6. Visual comparison between RT-DETR-R18 (top row) and CBW-DETR (bottom row) on different scenarios: (a) daytime natural environment; (b) daytime urban road; (c) nighttime illuminated area; (d) nighttime low-light environment. Red circles highlight regions where the baseline model fails to detect objects, while the proposed CBW-DETR successfully identifies them.

Figure 7. Grad-CAM++ heatmap comparison between RT-DETR-R18 (top row) and CBW-DETR (bottom row): (a,b) RT-DETR-R18 heatmaps; (c,d) CBW-DETR heatmaps. Red circles highlight regions where the baseline model exhibits missing detections or false positives, which are correctly identified by CBW-DETR.

Table 1. Comparison with state-of-the-art methods on VisDrone2019. Bold values indicate the best performance.

Model	P%	R%	mAP50/%	mAP50:95/%	Param/M	GFLOPs	FPS	Size/MB
YOLOv5m	50.3	37.9	36.3	19.2	22.1	52.3	120.1	42.6
YOLOv5l	45.1	35.2	38.7	24.3	47.9	114.2	77.9	91.9
YOLOv8m	55.7	44.3	40.9	24.3	25.8	78.7	103.5	50.8
YOLOv8l	57.4	45.3	45.7	28.1	43.6	165.2	63.5	85.6
YOLOv10m	54.1	40.8	42.0	25.8	16.5	63.5	80.5	31.9
YOLOv10l	55.2	42.5	44.2	27.2	25.7	126.4	91.3	49.7
YOLOv11m	53.7	42.5	43.8	26.8	20.0	67.7	95.4	38.6
YOLOv11l	56.2	42.9	46.8	29.0	25.3	86.8	70.4	38.6
Deformable DETR	52.6	31.2	42.2	27.1	40.0	196.0	73.2	76.0
DINO	57.1	35.2	46.2	29.4	47.0	279.0	89.3	91.2
RT-DETR-R18	62.1	46.9	48.2	29.3	19.9	57.2	94.1	38.6
BiEO-YOLOv8s [43]	57.1	46.5	45.8	26.3	34.2	—	—	—
YOLOv7 + ByteTrack [44]	56.0	44.0	46.8	—	11.3	—	—	—
Improved RT-DETR [45]	64.3	48.8	50.8	31.7	14.6	49.6	54.6	—
Frequency-Enhanced [46]	62.9	49.3	51.1	32.2	14.3	50.3	69.0	—
Drone-DETR [47]	62.6	48.7	50.4	31.1	19.1	68.8	—	—
CBW-DETR (Ours)	64.7	49.9	51.5	32.5	14.3	46.9	73.6	28.0

Table 2. Generalization experiments on DOTA dataset. Bold values indicate the best performance in each column.

Model	P%	R%	mAP50/%	mAP50:95/%	Param/M	GFLOPs	FPS	Size/MB
YOLOv5m	68.2	60.5	63.8	41.2	22.1	52.3	122.4	42.6
YOLOv5l	70.1	62.3	65.7	43.1	47.9	114.2	79.6	91.9
YOLOv8m	72.5	64.1	67.3	44.0	25.8	78.7	105.2	50.8
YOLOv8l	74.3	66.2	69.5	45.8	43.6	165.2	64.8	85.6
YOLOv10m	71.8	63.5	66.4	43.5	16.5	63.5	82.1	31.9
YOLOv10l	73.6	65.4	68.7	45.1	25.7	126.4	93.0	49.7
YOLOv11m	72.1	64.8	67.9	44.3	20.0	67.7	97.1	38.6
YOLOv11l	74.8	67.1	70.6	46.2	25.3	86.8	72.3	38.6
Deformable DETR	71.4	61.8	65.1	42.7	40.0	196.0	74.5	76.0
DINO	75.2	67.5	71.2	47.0	47.0	279.0	90.8	91.2
RT-DETR-R18	77.4	68.9	71.5	46.8	19.9	57.2	96.3	38.6
UAV-Based Detection [48]	—	—	67.9	45.2	19.4	110.0	—	—
CBW-DETR (Ours)	77.9	70.4	73.3	48.5	14.3	46.9	71.2	28.0

Table 3. Ablation study on VisDrone2019 dataset. ✓ denotes using the corresponding module, and × denotes not using it.

ContextGFE	SAFPN	ASIoU	P%	R%	mAP50/%	mAP50:95/%	Param/M	GFLOPs	FPS	Size/MB
×	×	×	62.1	46.9	48.2	29.3	19.9	57.2	94.1	38.6
✓	×	×	62.3	48.6	49.4	30.5	14.7	44.1	78.8	28.8
×	✓	×	63.8	47.8	49.2	30.2	19.5	59.2	87.7	37.9
×	×	✓	63.5	47.7	49.8	30.5	19.9	57.2	94.6	38.6
✓	✓	×	63.1	49.3	50.1	30.9	14.3	46.9	76.3	28.0
✓	×	✓	63.7	49.2	51.0	31.5	14.7	44.1	77.8	28.8
×	✓	✓	64.1	48.5	50.5	31.2	19.5	59.2	89.7	37.9
✓	✓	✓	64.7	49.9	51.5	32.5	14.3	46.9	73.6	28.0

Table 4. Per-module inference latency breakdown (ms per image, RTX 3090, 640 × 640 input). Bold values denote the total latency.

Module	Latency (ms)	Primary Bottleneck
Backbone (BasicBlock, baseline)	3.8	Compute-bound
Backbone (ContextGFE)	5.6	Memory access (multi-branch + DWT)
– Multi-branch convolutions	+1.1	Fragmented parallel paths
– DWT/IDWT operations	+0.5	Spatial data reshuffling
– Gating + attention	+0.2	Sequential dependencies
Neck (CCFF, baseline)	2.1	Compute-bound
Neck (SAFPN)	3.4	Memory access (deformable conv)
– Deformable convolution	+0.7	Irregular memory access
– Cross-scale attention	+0.4	QKV projections + softmax
– Compensation factors	+0.2	Element-wise operations
Encoder + Decoder	4.3	Compute-bound
NMS post-processing	0.4	Sequential
Total (baseline)	10.6	—
Total (CBW-DETR)	13.7	—

Table 5. Effect of NMS post-processing on CBW-DETR performance (VisDrone2019). Bold values indicate the best performance.

Configuration	P%	R%	mAP50/%	mAP50:95/%
CBW-DETR (w/o NMS, top-300 queries)	61.3	49.9	50.2	31.6
CBW-DETR (with NMS, IoU = 0.65)	64.7	49.9	51.5	32.5

Table 6. Comparison of frequency-domain enhancement strategies in ContextGFE on VisDrone2019. Bold values indicate the best performance.

Frequency Branch	P%	R%	mAP50/%	mAP50:95/%
w/o frequency branch	61.8	47.6	48.5	29.7
FFT	62.0	48.3	49.1	30.2
DWT (Haar, Ours)	62.3	48.6	49.4	30.5

Table 7. Converged values of SAFPN learnable fusion weights after 400 epochs on VisDrone2019 (initialized at 1.0).

Weight	Init	Converged	Interpretation
$w_{1}^{t d}$	1.0	1.28	Upsampled coarse-scale features receive higher emphasis
$w_{2}^{t d}$	1.0	0.83	Original backbone features slightly down-weighted in top-down path
$w_{1}^{b u}$	1.0	1.14	Compensated backbone features weighted up in bottom-up path
$w_{2}^{b u}$	1.0	0.91	Downsampled fine-scale features slightly attenuated

Table 8. Comparison of loss function performance on VisDrone2019. Bold values indicate the best performance in each column.

Loss Function	P%	R%	mAP50/%	mAP50:95/%
ASIoU (Ours)	63.5	47.7	49.8	30.5
GIoU (Baseline)	62.1	46.9	48.2	29.3
DIoU	63.1	47.6	48.3	29.3
CIoU	62.1	46.9	48.2	29.3
EIoU	62.3	47.6	48.7	30.0
FocalEIoU	62.1	47.2	48.2	29.6
InnerFocalEIoU	62.6	47.6	48.4	29.9
ShapeIoU	62.6	47.5	48.7	30.0
InnerDIoU	61.7	47.1	47.3	29.1
InnerCIoU	63.3	47.4	48.4	29.3
InnerEIoU	61.2	48.1	48.2	29.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, S.; Cheng, K.; Wang, Y. CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery. Electronics 2026, 15, 2010. https://doi.org/10.3390/electronics15102010

AMA Style

Qin S, Cheng K, Wang Y. CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery. Electronics. 2026; 15(10):2010. https://doi.org/10.3390/electronics15102010

Chicago/Turabian Style

Qin, Suning, Ke Cheng, and Yuanquan Wang. 2026. "CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery" Electronics 15, no. 10: 2010. https://doi.org/10.3390/electronics15102010

APA Style

Qin, S., Cheng, K., & Wang, Y. (2026). CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery. Electronics, 15(10), 2010. https://doi.org/10.3390/electronics15102010

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CBW-DETR: A Lightweight Detection Transformer for Small Object Detection in UAV Imagery

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection in Aerial Imagery

2.2. Transformer-Based Object Detection

2.3. Loss Functions for Bounding Box Regression

2.4. Efficient Network Architectures for Detection

3. Methodology

3.1. Framework Overview

3.2. Baseline Architecture

3.3. Context-Guided Feature Extraction with Adaptive Receptive Fields

3.4. Scale-Aware Feature Pyramid with Cross-Scale Attention

3.5. Adaptive Scale IoU Loss with Uncertainty Estimation

3.6. Framework Integration

4. Experiments

4.1. Experimental Setup

4.2. Datasets

4.3. Evaluation Metrics

4.4. Comparison with State-of-the-Art Methods

4.5. Generalization Experiments

4.6. Ablation Study

4.6.1. Inference Latency Analysis

4.6.2. NMS Effect Analysis

4.6.3. Wavelet-Domain Enhancement Analysis

4.6.4. SAFPN Fusion Weight Convergence Analysis

4.7. Loss Function Comparison

4.8. Visualization Analysis

5. Conclusions

5.1. Limitations

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI