1. Introduction
In recent years, the decline in UAV production costs, along with advances in artificial intelligence and electronic information engineering, have placed UAVs at the forefront of international remote sensing research [
1]. UAVs have seen widespread deployment in agricultural monitoring [
2], Internet of Things frameworks [
3], infrastructure inspection [
4], and intelligent transportation systems [
5]. Embedded real-time detection techniques in UAVs play a key role in these applications, particularly in real-time object detection [
6], target tracking [
7], autonomous obstacle avoidance [
8], and data analysis, where embedded models must identify small targets quickly and accurately. However, UAV remote sensing imagery inherently contains a high density of small objects, and onboard real-time computation is heavily constrained by existing hardware capabilities. The limited computational and memory resources on UAV platforms pose a significant challenge for real-time detection of small objects on embedded UAVs [
3].
In UAV remote sensing imagery, target objects often occupy fewer than
pixels and are therefore classified as small objects [
9]. To address this, Zhang et al. [
10] proposed a multi-scale vehicle detection adversarial network by combining a multiscale detector with a dedicated small-object subnet. Gong et al. [
11] enhanced YOLOv5 with a Normalized Attention Module (NAM)—dubbed SPHYOLOv5—to improve small-object localization, while Hamzenejadi et al. [
12] incorporated lightweight attention mechanisms into the YOLO backbone to balance real-time speed with precision. Liu et al. [
13] adopted a passive integration strategy to yield a more compact network architecture optimized for UAV deployment. Despite these gains in detection accuracy, the resulting models remain too large and computation-heavy for real-time inference on resource-constrained embedded platforms.
To address the requirements of small-object detection and embedded deployment, Zhao et al. [
14] introduced RT-DETR, which removes inference delays caused by non-maximum suppression (NMS) and outperforms YOLO-based detectors in both accuracy and speed. This real-time detector is thus highly suitable for practical deployment under stringent latency constraints. Furthermore, Huang et al. [
15] proposed DEIM, the latest SOTA model in object detection, designed to balance small-object detection performance, computational efficiency, and matching efficiency. Nonetheless, it encounters two significant challenges in practical implementation: The first is Transformer’s global attention mechanism calculates pairwise relationships among all feature vectors, resulting in an exponential increase in inference complexity. The other is training process exhibits slow convergence. The one-to-one matching strategy used by the Hungarian algorithm results in very few positive samples and many low-quality matches, limiting the model’s effectiveness in learning, particularly for small-object detection.
In this study, we introduce a novel approach known as the Adaptive UAV Hardware-Focused Network for Target Detection (AUHF-DETR) to mitigate two key challenges in embedded small-object detection. This method builds upon the RT-DETR-r18 framework and features a lightweight backbone for small-object detection, complemented by a spatial attention module. These elements are crucial for satisfying the real-time detection requirements of embedded UAV platforms. The core distinction between our approach and DEIM lies in the encoder architecture: AUHF-DETR employs a single encoder (built on a CNN backbone), whereas DEIM utilizes a dual-encoder design. In summary, our method not only inherits RT-DETR’s exceptional inference speed and maintains stable accuracy across varying object scales, but also—thanks to its simpler single-encoder structure—is better suited for real-time deployment on edge-device GPUs than DEIM.
We evaluated our approach on the VisDrone2019, CAPRK, and HIT-UAV datasets. Despite the complexity of the environments, the high density of small objects, and the multimodal characteristics of these public remote sensing benchmarks, AUHF-DETR consistently delivers impressive performance.
The main contributions of this work are as follows:
In the backbone, the WTConv-Block and AdaRes-Block enhance feature extraction, bridging the gap between remote-sensing object-detection models and the real-time demands of UAVs. Reversible connections allow for gradually decoupling small-object features during forward propagation, improving recognition accuracy;
In the attention module, we introduce the PSA component. Unlike the baseline model’s global self-attention, PSA adaptively computes relationships among feature vectors within each ROI, significantly reducing model complexity and meeting embedded deployment constraints;
In the encoder stage, we propose a small-object-focused pyramid structure, BDFPN. To enhance training, we increase the number of targets per image to generate more positive samples, providing dense supervision signals. This approach accelerates convergence and increases the accuracy of small object detection;
For the loss function, we design a small-object penalty term with a scaling parameter. By dynamically adjusting this scaling factor, the model flexibly computes IoU for small objects, elevating the loss weight on small-object localization and mitigating the adverse effects of bounding-box size variation on detection performance.
The structure of this paper is as follows:
Section 2 reviews the evolution of object detection in remote sensing.
Section 3 provides a detailed description of AUHF-DETR, our method for real-time target detection embedded in UAVs.
Section 4 presents the experimental evaluation—including ablation studies, comparative assessments, visualization experiments, and embedded PX4 simulation—conducted on the VisDrone2019, CARPK, and HIT-UAV datasets to evaluate the effectiveness of the proposed approach.
Section 5 discusses how AUHF-DETR overcomes the challenges of real-time small-object detection on embedded UAV platforms. Finally,
Section 6 concludes the paper and emphasizes the model’s potential value.
3. Proposed Method
3.1. Overall Framework
Adaptive UAV Hardware-Focused Network for Target Detection (AUHF-DETR) is a lightweight, embedded detection Transformer that utilizes a spatial attention mechanism. The overall architecture is illustrated in
Figure 1. Based on the RT-DETR-r18, the model comprises four main components: a backbone network, an encoder, a decoder, and a detection head. High-resolution UAV remote-sensing images are initially processed by the WTC-AdaResNet backbone, which extracts hierarchical semantic features through a multi-scale pyramid structure. The final pyramid stage (S5) is directed to the PSA module for ROI-based feature interaction. Simultaneously, multi-scale features from the backbone (S2, S3, S4, and F5) are sent to the encoder for effective feature fusion. In the encoder, we integrate the baseline’s core component—the RepC3 module—a convolutional block introduced in YOLOv5 and RT-DETR that enhances computational efficiency through partial connections and re-parameterization. Ultimately, the decoder and detection head work together to predict bounding boxes, facilitating precise target localization.
To specifically address the challenges of embedded small-object detection in UAV remote sensing imagery, we introduce wavelet convolution into the backbone’s convolutional blocks, creating a wide-field multi-frequency dilation module (WTConv-Block). This design effectively reduces parameter inflation caused by enlarged receptive fields, thereby decreasing the overall model size. We also propose a resolution-adaptive algorithm to build a cross-scale feature decoupling module (AdaRes-Block), which facilitates reversible feature flow between sub-networks after sampling, thus preserving richer details of small objects. The Partition Split Spatial Attention (PSA) module utilizes spatial attention weights to minimize long-range dependencies, addressing the complexity mismatch between remote-sensing detectors and onboard real-time requirements. Our Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN) conducts efficient, layer-wise sampling consistent with UAV real-time paradigms, further enhancing embedded deployability. Finally, in the decoder, we implement an Inner-MPDIoU bounding-box regression strategy: by selecting queries that minimize uncertainty, we consolidate the encoder’s differentiated features for large, medium, and small objects, thus improving small-object localization accuracy. The following subsections provide detailed descriptions of each module, and
Figure 2 presents the complete AUHF-DETR architecture.
3.2. Improvement of Backbone (WTC-AdaResNet)
In embedded UAV remote-sensing object detection tasks, reconciling model size with on-board constraints while accurately detecting small targets is critical. To this end, we integrate a wavelet convolution module with an adaptive-resolution module in the backbone, resulting in the wide-receptive-field WTConv-Block and the adaptive reversible-resolution (AdaRes-Block). The resulting backbone, WTC-AdaResNet, addresses the computational mismatch between conventional remote-sensing detectors and embedded UAV real-time requirements.
As shown in
Figure 2, WTC-AdaResNet consists of only 23 convolutional layers and generates a feature-resolution map at 1/16. In contrast to mainstream backbones that employ a greater number of downsampling layers, our design significantly reduces both the parameter count and computational overhead. To enhance the preservation of small-object details, we initially stack three CBR layers to produce a 1/2 resolution feature map, then utilize a MaxPool layer to achieve a 1/4 resolution map. In Stages 1–3, we parallelize three WTConv-Blocks to capture richer detail through large receptive fields without altering the spatial resolution. It is important to note that the Stage 4 AdaRes-Block does not operate directly on the 1/16 (or lower) resolution feature maps. Instead, it ingests the 1/4 resolution features produced by Stage 3 and applies reversible connections and cross-scale fusion at that resolution to ensure that fine-grained details are fully preserved. Meanwhile, the WTConv-Blocks in the first three stages have already leveraged multi-frequency, small-kernel convolutions to expand the local receptive field, thereby supplying richer contextual cues for small-object detection. During feature propagation, the AdaRes-Block generates a 1/16 resolution feature map; employing reversible connections effectively mitigates the information loss typically encountered in deep architectures. For the BDFPN inputs, we extract features from backbone layers 4, 5, 6, and 10, thereby promoting feature reuse and bolstering information propagation.
3.2.1. WTConv-Block
To address the parameter explosion that accompanies larger convolutional kernels in small-object detection, we integrate WTConv into our backbone. WTConv employs multi-frequency dilated convolutions to dynamically expand the receptive field without a proportional increase in parameters. Conventional CNNs, constrained by fixed kernel sizes, often lack sufficient global context for accurately detecting small targets. While many detectors boost accuracy by enlarging kernels, this strategy incurs substantial parameter growth, creating performance bottlenecks and undermining the balance between model size and detection precision on UAV platforms. In contrast, WTConv delivers enhanced global feature perception and improved detection efficiency with minimal computational overhead.
WTConv [
44], introduced at ECCV 2024, solves the parameter inflation issue inherent in expanding CNN receptive fields, allowing efficient capture of both local and global features. We embed WTConv into the downsampling stages of RT-DETR’s backbone, optimizing the convolutional network to reduce parameter count and computational complexity. This integration enables AUHF-DETR to be deployed on UAV hardware, meeting the real-time detection requirements of remote-sensing imagery.
We adopt the efficient Haar wavelet transform, which—unlike Daubechies or Meyer wavelets—effectively captures contextual semantics while limiting computational cost. This transform decomposes an image spatially into high- and low-frequency components via depthwise convolution and downsampling. In one dimension, this employs the
and
convolution kernels; in two dimensions, the operation extends to four separable filters for depthwise convolution.
The four filters decompose the input UAV remote-sensing image X into a low-frequency component
and high-frequency components
,
and
:
The WTConv module applies hierarchical wavelet decomposition to split the input signal into subbands at different frequencies, thereby enhancing sensitivity to low-frequency information. To alleviate the trade-off between convolution-kernel size and parameter count, we first filter and downsample the high- and low-frequency components via the wavelet transform. Next, we apply smaller
convolutional kernels to each subband and finally reconstruct the fused output using the inverse wavelet transform (IWT).
Figure 3 illustrates the basic architecture of the WTConv module. The entire process can be summarized as:
where X denotes the input tensor, w is the convolutional kernel, and WT, and IWT refer to the wavelet transform, and its inverse, respectively.
WTConv efficiently decouples convolutional operations across multiple frequency bands, enabling an expanded effective receptive field with smaller kernels and reduced computational overhead. This design empowers the network to capture both local and global features of small objects in remote sensing imagery, transcending the local-only constraints of traditional convolutions. As a result, the model learns richer semantic representations, leading to superior small-object detection accuracy. By constraining the size of the wavelet kernels, WTConv aggregates low-frequency information over a broader spatial extent without inflating parameter counts, thereby addressing the challenge of deploying large-scale detectors on resource-limited UAV platforms.
3.2.2. AdaRes-Block
To address missed detections of small objects in UAV remote-sensing imagery, we introduce Adaptive Resolution Module (AdaRes-Block). This module integrates multiple subnetworks (SubNet), allowing information to flow reversibly between them and thus preventing the feature loss common in conventional CNNs. As illustrated in
Figure 4a, the AdaRes-Block is inserted at Stage 4 of the backbone, while the first three stages stack multiple wavelet-convolution downsampling modules to enlarge the receptive field of the feature maps, thereby enhancing the model’s ability to capture fine-grained edges and textures of small objects. It receives feature maps from the preceding stage and, by leveraging reversible connections, progressively decouples and preserves features during forward propagation—unlike traditional networks, which often compress or discard information. These reversible inter-SubNet connections retain rich spatial details from shallow layers while providing stronger semantic representations at higher resolutions in deeper layers. As a result, the detection performance for small objects in UAV applications is markedly improved.
In
Figure 4b, the core component of the AdaRes-Block is the AdaRes unit, which creates reversible connections among feature maps of different resolutions during feature propagation, allowing for repeated feature reuse. In
Figure 4c, increasing the number of SubNet layers enables the model to capture progressively richer detailed features, resulting in superior performance on complex detection tasks. However, deeper SubNet configurations also lead to a significant increase in parameter count, heightening dependency on high-performance computing resources—contrary to the real-time constraints of embedded UAV deployment. To strike a balance between SubNet depth and model compactness, we conducted a series of experiments. The results indicate that employing one or two SubNet layers offers the best trade-off between model size and detection accuracy. Accordingly, we introduce two lightweight variants: AUHF-DETR-S and AUHF-DETR-M. The detailed block architecture and the reversible connection mechanism are illustrated in
Figure 5.
In small-object detection tasks, AdaRes implements a two-fold process of high-resolution feature preservation and multi-scale feature fusion. For high-resolution feature preservation in
Figure 5a, the input image first passes through the STEM module, where a
depthwise convolution extracts initial features. These features are then normalized via LayerNorm and fed into the reversible core component, SubNet. In SubNet, bidirectional reversible connections preserve high-resolution details, addressing the challenge of spatial detail loss—and consequent localization error—for small objects caused by repeated downsampling in deep networks. The forward propagation is formalized in Equation (
4): features traverse multiple levels (Level 0–3), with each level’s output linearly weighted by a learnable parameter alpha and summed with the original input features to prevent unidirectional information loss for small targets. By retaining rich feature representations at deeper layers, the AdaRes mechanism enhances the model’s capacity and efficiency in learning and expressing small-object details.
denotes the Level 0 feature map from the previous iteration,
denotes the historical feature map from the adjacent higher level, and
denotes the updated feature map.
is a learnable dynamic weight parameter that adjusts the contribution of
to the current feature representation. Finally,
represents the feature extraction operation at Level 0, where x is the input feature map for this level.
In the implementation of multi-scale feature fusion, to address the insufficient extraction of shallow details (edges, textures) and deep contextual semantics in conventional networks for small-object detection, we propose the fusion strategy illustrated in
Figure 5c. Within each SubNet’s Level component, this method aligns feature resolutions across adjacent levels. In the downsampling branch, a Conv2D operation halves the spatial dimensions of higher-level features to match the lower-level resolution (e.g., Level 1 → Level 2); in the upsampling branch, a linear transformation followed by nearest-neighbor interpolation upsamples lower-level features to satisfy higher-level requirements (e.g., Level 2 → Level 1). The aligned features are then passed to the ConvNext block depicted in
Figure 5d, where depthwise-separable convolutions, pointwise convolutions, GELU activations, and LayerNorm collectively encode richer semantic information within the enhanced receptive field. In UAV ground-target detection scenarios, this fidelity-preserving fusion approach markedly improves the detection accuracy of small objects in remote sensing imagery.
Compared with other feature fusion or preservation techniques (e.g., dense connections), the proposed reversible connection design offers advantages across multiple dimensions. In terms of information completeness, dense connections concatenate all features from previous layers, often leading to excessive channel dimensionality. In contrast, reversible connections perform additive coupling under fixed resolution, preserving all input information without increasing the number of channels. For fine-grained small object feature retention, traditional skip connections typically fuse shallow and deep features at fixed levels. However, AdaRes leverages reversible mechanisms to align and fuse features across multiple resolutions within each SubNet, ensuring bidirectional flow and reuse of both low-level details and high-level semantics during every forward pass. Regarding embedded real-time deployment, while dense connections significantly increase parameters and computation, our reversible design maintains a lightweight architecture that meets the constraints of embedded systems while achieving competitive small object detection accuracy with only 1–2 SubNet configurations. Overall, the AdaRes-Block not only improves detail preservation for small objects but also satisfies the dual requirements of computational efficiency and memory footprint in UAV-based real-time detection scenarios.
3.3. Partition Split Spatial Attention (PSA)
To tackle the low inference speed (FPS) and exponential computational complexity caused by the global self-attention mechanism in traditional DETR, we present the PSA module. By utilizing a spatial attention strategy that limits pairwise computations to local areas, PSA significantly decreases model complexity while enhancing detection FPS.
The PSA module, as illustrated in
Figure 6, consists of two parallel branches with shared parameters: a low-frequency adaptive feature interaction branch and a high-frequency lightweight feature extraction branch. Designed to minimize computational complexity, PSA effectively processes fine-grained semantic details in remote sensing images and enhances intra-scale feature fusion. By leveraging spatial attention, it adaptively computes pairwise relationships among feature vectors within each ROI, significantly reducing overall model complexity. In small-object detection, conventional global self-attention mechanisms struggle to capture long-range dependencies without incurring high computational costs, as they focus on global features and overlook the distinction between high- and low-frequency components—critical for detecting small targets rich in fine details. As shown in
Figure 6a, the high-frequency branch employs depthwise separable convolutions to enhance feature resolution, effectively capturing edges and minute textures in UAV imagery while markedly improving small-object recognition.
Figure 6b depicts the low-frequency branch, which uses average pooling to generate spatial attention weights, allowing the model to aggregate broader contextual information. By encoding low-frequency features, the module extracts global structural cues while focusing computations on ROI-specific vectors, eliminating redundant operations inherent in global attention and thus significantly lowering computational load while increasing inference FPS. PSA restructures the input feature map from four dimensions (batch, channel, width, height) to five dimensions (batch, channel, width, height, sampling factor), partitioning spatial information into smaller regions and reducing complexity. This design yields higher FPS without adding parameters and significantly enhances small-object detection performance.
To clarify the fundamental differences between AUHF-DETR and the YOLO architecture: YOLO variants employ convolutional prediction heads at the end of the backbone, directly performing multi-scale convolutional regression and classification on feature maps. In contrast, AUHF-DETR inserts a PSA module after multi-scale feature fusion (Detailed in
Section 3.2.2), leveraging locally partitioned self-attention to enhance detail and enable global information exchange, while fully preserving the transformer’s query-key-value operations and end-to-end matching pipeline.
3.4. Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN)
Lin et al. [
45] first proposed the Feature Pyramid Network (FPN), which fuses multi-scale feature maps from pyramid levels P3 through P7 via a top-down pathway with lateral connections. Building on this, PANet [
46] augments FPN with a complementary bottom-up path to reinforce feature propagation and fusion, albeit at the cost of increased model complexity. To retain performance while streamlining the architecture, BiFPN [
47] prunes low-utility nodes—those with only a single input—and employs bidirectional pathways for efficient multi-scale feature fusion. Moreover, BiFPN introduces learnable scalar weights on each input edge, enabling adaptive calibration of multi-resolution features.
Building on this, we propose a Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN) specifically designed for small-object detection to accelerate early-stage convergence. BDFPN tackles the positive-sample scarcity inherent in DETR’s Hungarian one-to-one assignment by increasing the number of object instances per image during training. This results in denser positive samples, richer supervisory signals, and higher-quality matches to expedite convergence. Additionally, BDFPN adaptively selects optimal feature-fusion pathways and includes high-resolution shallow-layer features, improving the network’s ability to concentrate on critical details of ground-level small targets.
Figure 7 presents a comparison of various pyramid network architectures.
BDFPN enables cross-scale feature fusion by integrating convolutional modules, upsampling layers, RepC3 blocks, and a dynamic fusion unit. As shown in
Figure 8, three distinct fusion strategies are adopted. Within the pyramid structure, the dynamic fusion unit first aggregates input features along the channel dimension and employs a CBS block to produce spatially adaptive weights. These weights are normalized via the Softmax function to align with the number of input features. Subsequently, each weight is applied element-wise to its corresponding feature map, and the weighted maps are summed to yield a high-resolution representation that captures both global context and fine details. The dynamic fusion mechanism adaptively adjusts the contribution of each feature map, enabling efficient multi-scale integration, accelerating training convergence, and improving small-object detection accuracy.
3.5. Inner-MPDIoU Regression for Small Object Detection
In the RT-DETR model, the classification task is optimized using the cross-entropy loss, while the bounding box regression task utilizes the L1 loss and the GIoU loss. Target matching is performed using the Hungarian algorithm. Specifically, the formulation of the GIoU loss is presented in Equation (
5), and the total loss function of the RT-DETR model is defined in Equation (
6).
In Equation (
4),
and
represent the ground truth and predicted bounding boxes, respectively. C denotes the smallest enclosing box covering both
and
. The final term of the GIoU loss reflects the proportion of the area within C that lies outside the union of
and
, effectively penalizing predictions that are far from the ground truth. In Equation (
5),
denotes the classification loss, and
is the weight for the regression loss, which is computed using Smooth L1 Loss.
evaluates the overlap quality between bounding boxes. The weights
and
are used to balance the contributions of each loss component.
Overall, GIoU enhances localization performance by addressing zero-overlap cases and incorporating a penalty term based on the enclosing box. However, for small object detection, the enclosing box C tends to be disproportionately large, which reduces the penalty term and causes GIoU to closely resemble the standard IoU loss, providing limited additional information. Furthermore, GIoU is not sufficiently sensitive to variations in aspect ratio and suffers from weak penalization, negatively affecting performance in challenging detection scenarios. To overcome these limitations, we redesign the bounding box regression loss and introduce the Inner-MPDIoU loss.
The MPDIoU loss integrates three elements: the IoU term, the minimum point distance (MPD) metric, and a normalization component. Specifically, the IoU term quantifies the overlap between prediction and ground truth via the intersection-to-union ratio, as defined in Equation (
7).
where
denotes the ground truth bounding box region, and
denotes the predicted bounding box region.
The innovation of the MPDIoU loss lies in the introduction of two minimum-point-distance terms to balance spatial discrepancies between bounding boxes. Specifically, as defined in Equations (
8) and (
9), the positional offset is measured by the Euclidean distance between the top-left and bottom-right corner points of the predicted and ground-truth boxes. To prevent inconsistent gradient magnitudes caused by varying box sizes, these distance terms are normalized by the original image’s width and height. The complete loss formulation is then provided in Equation (
10).
where
denotes the coordinates of the predicted box’s corner (either top left or bottom right),
denotes the corresponding ground truth corner coordinates, and
is the sum of the squared width www and height hhh of the original image.
To enhance the model’s small-object detection capability and accelerate convergence, we extend the MPDIoU loss by incorporating auxiliary bounding boxes from the Inner IoU framework, thereby formulating the novel Inner MPDIoU loss function with an added small object term
. By introducing a dynamic scaling factor, the model can adaptively calculate the relevant IoU values, increasing the weight assigned to small objects. This modification not only accelerates convergence during training but also decreases both false positive and false negatives for small targets in remote sensing imagery. The formal definition of Inner-MPDIoU is as follows:
where
,
,
and
denote the left, right, top, and bottom boundary coordinates of the ground truth or predicted box, respectively, and
and
denote the width and height of the contracted region within the ground truth box.
3.6. Model’s Decoder and Transformer Structure
In this section, we first present the decoder architecture of our model, and then detail the connections and innovations of AUHF-DETR with Transformer techniques to better situate our work within the relevant technical context.
3.6.1. Structure of the Model Decoder
The AUHF-DETR decoder inherits its architecture from RT-DETR [
14], comprising ten identical Transformer decoder layers. In each layer, a set of learnable query vectors first undergoes multi-head self-attention to capture global context dependencies among queries. The updated queries are then fused with encoder feature maps via multi-head cross-attention, enabling query-key-value-based feature interaction. Following the attention modules, two feed-forward networks with GELU activations, residual connections, and layer normalization ensure feature diversity and training stability. The decoder’s output queries are projected by two separate multilayer perceptrons (MLPs) into bounding-box regression vectors and class prediction scores. Whereas the original RT-DETR optimizes with an L1 + GIoU composite loss, AUHF-DETR replaces the regression term with our proposed Inner-MPDIoU loss—augmented with a small-object penalty—to strengthen gradient responses and localization accuracy for small targets. Finally, all predicted boxes are paired one-to-one with ground-truth boxes via the minimum uncertainty matching mechanism, completing the end-to-end detection pipeline.
3.6.2. Related to Transformer Technology
AUHF-DETR builds upon RT-DETR, inheriting the core Transformer framework: end-to-end query-key-value feature interaction via self-attention and one-to-one matching through the minimum-uncertainty query algorithm. On this foundation, we introduce three targeted enhancements to accommodate both small-object detection and real-time embedded deployment requirements:
First, we introduce PSA spatial attention (Detailed in
Section 3.3). In a conventional Transformer, global self-attention computes pairwise similarities across all positions, resulting in
complexity. PSA alleviates this by partitioning the feature map into local ROIs and processing each through separate high- and low-frequency branches to extract fine-grained textures and global structures, respectively. Self-attention is then computed within each ROI, reducing complexity to
, where
;
Second, we propose multi-scale cross-layer dynamic fusion via BDFPN (Detailed in
Section 3.4) with learnable weights. Analogous to multi-head cross-attention aggregating information across feature maps, BDFPN employs learnable scalar weights to adaptively allocate attention among multi-scale inputs, ensuring that small, medium, and large target features all receive adequate focus and accelerate the convergence of positive samples.
Third, for decoder query and localization loss optimization, we retain DETR’s decoder-query mechanism—ensuring each query maps to a unique detection target. To further enhance small-object localization accuracy, we introduce the Inner-MPDIoU loss (Detailed in
Section 3.5), which combines the Transformer’s end-to-end, no-NMS matching advantage with IoU-driven regression, thereby boosting gradient responsiveness for small bounding boxes.
4. Experiments and Results
4.1. Dataset and Implementation Details
(a) VisDrone2019: This dataset includes a total of 10,209 images captured across diverse urban and suburban environments, such as city roads, residential areas, and industrial districts, as illustrated in
Figure 9a,b. The imagery was collected using UAVs operating under varying altitudes, camera angles, and weather conditions. Specifically, the dataset is divided into 6471 training images, 548 for validation, and 1580 for testing. It covers ten object categories. The overall class distribution is visualized in
Figure 10a.
We evaluate our approach on three publicly aerial datasets: VisDrone2019 [
48], CARPK [
49], and HIT-UAV [
50]. The details of these three datasets are as follows:
(b) CARPK: This dataset consists of images intercepted from HD video frames recorded from four different car parks, which contains 989 training set images and 459 validation set images. This dataset is just a densely distributed single class of small target data.
Figure 9c intercepts this dataset of the image display, from which we can see that there are even more small targets. The data distribution of the dataset is shown in
Figure 10b.
(c) HIT-UAV: This dataset comprises 2898 thermal infrared images collected by UAVs across diverse scenes such as campuses, highways, and parking areas. It is characterized by densely distributed small targets belonging to five categories. Designed to support UAV applications in low-light environments, the dataset enhances detection capabilities under challenging illumination. In our experiments, we utilized 2029 images for training, 290 for validation, and the remaining 579 for testing. In
Figure 9d, we present images from the dataset, which were collected using infrared sensors to evaluate the model’s capability to process multispectral datasets; the dataset’s data distribution is shown in
Figure 10.
All experiments were carried out using the PyTorch framework on an NVIDIA GeForce RTX 4090 GPU, under Ubuntu 24.04 with CUDA 12.1. We introduced the warm-up trick to prevent drastic oscillations at the start of training. Hyperparameter settings are given in
Table 1.
4.2. Evaluation Metrics
Following the evaluation protocol on MS COCO, our paper mainly takes two criteria for different purposes:
For model accuracy evaluation, we adopt , , and as our evaluation metrics. represents the average precision over all categories, while and denote the average precision calculated at the IoU thresholds of 0.5 and 0.75 overall categories. Furthermore, to measure the performance of different object scales, we adopt three metrics: , , and .
For real-time requirements, we use the number of frames processed per second FPS as an evaluation index. Usually, it should be at least 30 FPS to satisfy the constraints of real-time processing.
4.3. Loss Function Exploration Experiments
From the previous discussion of loss functions, GIoU has certain limitations for small-object detection from the UAV perspective. To evaluate the effectiveness of our proposed Inner-MPDIoU loss, we designed a series of experiments to benchmark various mainstream loss functions within the AUHF-DETR framework. Specifically, we integrated each loss function into AUHF-DETR and assessed their performance on the VisDrone2019 validation set. The results of these comparisons are shown in
Table 2.
For the small object detection term
in the loss function, the core concept of Inner is implemented by introducing a dynamic scaling factor. We conducted experiments across a range of ratio values to identify the hyperparameter that best enhances small object detection from the UAV perspective. The results of these experiments are summarized in
Table 3.
The results demonstrate that the Inner-MPDIoU loss function achieves the best performance in both detection accuracy and inference speed. Accordingly, we adopt Inner-MPDIoU in place of the original GIoU loss for the bounding-box regression task.
4.4. Model Balancing Exploration Experiments
The goal of this exploration is to balance detection accuracy, inference speed, and model size for real-time embedded UAV remote-sensing detection. While we initially balanced these factors in the backbone by varying the number of SubNet layers, this balance applies solely within the encoder. Therefore, we further control algorithmic complexity by adjusting three key parameters—network width, depth, and maximum channel count—to pinpoint the optimal model configuration for our task.
During this process, we discovered that reducing any single parameter in isolation (width, depth, or max channels) negatively impacts overall performance. Therefore, achieving a balanced trade-off among all three is essential for optimal results. In this section, we design and evaluate various configurations to examine how these parameters affect model performance; the results are summarized in
Table 4.
The count of model parameters directly reflects a model’s adaptability across devices with varying computational capabilities. As shown in
Table 4, GFLOPs increase with both network depth and width; however, when the depth factor exceeds a certain threshold, the total parameter count surpasses the deployment limit of the baseline. Conversely, setting the depth factor to 0.43 and the width factor to 0.45 produces an acceptable parameter count but leads to a significant drop in GFLOPs, failing to meet high-performance detection requirements. Balancing these observations and the performance of models with different maximum channel settings, we selected a depth factor of approximately 0.4, a width factor of roughly 0.6, and a maximum channel count between 512 and 768. We then evaluated various parameter combinations on the VisDrone2019 validation set to determine whether higher accuracy can be maintained with a reduced model footprint. The results of these experiments are summarized in
Table 5.
As shown in
Table 5, configuring the model with a depth factor of 0.45, a width factor of 0.60, and a maximum channel count of 512 results in the best inference speed and smallest footprint; however, its
is only 27.8%, which indicates relatively low detection accuracy. After considering multiple factors, we adopt a depth factor of 0.43, a width factor of 0.60, and a maximum channel count of 512 as the base configuration. With this setup, the model achieves an
of 30.9%, the highest in our experiments—while maintaining a balanced trade-off between parameter count and inference speed, thus satisfying the requirements for embedded real-time detection.
4.5. Ablation Experiments
Ablation experiments on the VisDrone2019 validation set, demonstrate that our four key modifications collectively enhance both real-time performance and small-object detection accuracy. First, we replace the original BasicBlock with the WTConv-Block—a lightweight, large-receptive-field convolution module—and the AdaRes-Block, which allows for adaptive-resolution, reversible feature propagation within the backbone. Second, we substitute the AIFI module with our PSA spatial attention mechanism, effectively eliminating exhaustive pairwise computations and reducing parameter overhead. Third, we integrate the BDFPN into the encoder to mitigate the scarcity of positive samples caused by Hungarian one-to-one matching and improve small-object detectability. Finally, we adopt the Inner-MPDIoU loss function—enhanced with a small-object penalty term—to expedite convergence and further enhance localization precision for small targets.
Table 6 reports detailed performance metrics for each modification, demonstrating that every enhancement produces measurable gains over the RT-DETR-r18 baseline. Substituting the backbone with WTC-AdaResNet leads to significant improvements across all metrics—most notably in small-object accuracy. Integrating the PSA spatial attention module further refines feature representation, increasing overall
by 1.8% while slightly decreasing model size. Introducing the BDFPN structure markedly boosts
by 2.6% without a substantial rise in parameter count. These architectural improvements collectively reduce the model footprint and speed up inference, achieving an initial balance among size, accuracy, and speed. Finally, with all improvements applied, the model achieves a 5.9% increase in
and a 3.12% reduction in parameters, boosting the inference speed to 65.7 FPS—nearly twice the baseline’s 35 FPS—fully meeting the requirements for real-time deployment. In summary, the AUHF-DETR model shows significant improvements over the baseline in all respects, meets real-time detection needs on embedded UAV systems, and strikes a favorable balance between model size, inference speed, and detection accuracy.
4.6. Comparisons of Performance
To evaluate the detection performance improvements of AUHF-DETR, we compared our final variants against current mainstream detectors. We developed two configurations: AUHF-DETR-S (10.29 M parameters) and AUHF-DETR-M (19.55 M parameters). In AUHF-DETR-S, the AdaRes-Block contains a single SubNet layer, and the per-stage channel widths are reduced from [64, 128, 256, 512] to [56, 112, 224, 448], thus cutting the parameter count while maintaining detection accuracy—ideal for UAVs with limited computing power. AUHF-DETR-M utilizes two SubNet layers configured with channels [64, 128, 256, 512], yielding a total of 19.55 million parameters. Although this increases the parameter count, it provides richer feature representations and further enhances detection performance, making it suitable for higher-capacity UAV platforms. As shown in
Table 7, compared with RT-DETR-r18, AUHF-DETR-M achieves relative improvements of 5.9% in
, 4.8% in
, 10.5% in
, 4.4% in
, 2.2% in
, and 2.4% in
, while reducing the parameter count by 3.12%. These significant gains demonstrate AUHF-DETR’s efficiency for UAV ground-target detection and confirm that the reduced model footprint facilitates real-time deployment on embedded UAV systems.
Notably, the AUHF-DETR-S model delivers excellent performance across all object scales, achieving significant improvements: reaches 23.8%, a 2.1% gain over RT-DETR-r18. Moreover, the model excels in detection accuracy, inference speed, and model size. Specifically, its computational complexity is 23 GFLOPs—higher than lightweight models such as YOLOv11-N (6.5 GFLOPs) and YOLOv11-S (21.5 GFLOPs), yet significantly lower than computation-intensive detectors like Deformable DETR (173 GFLOPs). Importantly, with only 10.29 M parameters (49.0% fewer than the baseline), AUHF-DETR-S achieves the goal of lightweight design, enabling real-time onboard deployment. Its inference speed of 86.9 FPS fully meets real-time detection requirements. In comparison with the latest transformer-based detectors—DEIM (2025 CVPR), Deformable DETR (2021 ICLR), and Swin DETR (2021 ICCV)—our model demonstrates a marked advantage in average precision (AP) metrics, confirming the effectiveness of our enhancements. In summary, particularly for real-time multi-scale object detection, AUHF-DETR exhibits outstanding performance, underscoring its potential and applicability as an embedded UAV detection model. To further evaluate the model’s robustness, comparative experiments were also conducted on the HIT-UAV and CARPK datasets.
In the HIT-UAV dataset, most studies typically report mAP@IoU = 0.5 as their evaluation metric. While this provides quick and straightforward feedback, it does not fully capture a model’s robustness across varying object scales and multiple IoU thresholds. To more comprehensively assess AUHF-DETR’s ability to detect small objects, we adopt a more rigorous evaluation protocol. The results presented in
Table 8 and
Table 9.
In the three comparative experiments, our model consistently achieves SOTA performance on all evaluated datasets, effectively balancing detection accuracy, inference speed, and model footprint. Moreover, AUHF-DETR exhibits robust generalization capabilities across different sensor types, such as visible and infrared modalities, showcasing its remarkable adaptability. This makes AUHF-DETR uniquely suited for seamless integration into UAV hardware platforms for real-time object detection, a capability that few existing models can deliver simultaneously.
4.7. Visualization Experiment
This study performs a visual analysis of the enhanced AUHF-DETR model, employing heatmaps and qualitative detection results. Additionally, UAV aerial imagery captured over various scenes in Fuzhou City was used for inference further to evaluate the model’s generalization ability in real-world environments.
In this section, we utilize the EigenGradCAM [
59] method to create heatmaps for both our enhanced model and leading detectors, visually emphasizing each model’s focus areas, as shown in
Figure 11. We selected a variety of scenarios—such as urban bus stops and nighttime cityscapes—and compared heatmaps across different optical sensors to better illustrate the model’s generalizability. In comparison to other methods, AUHF-DETR demonstrates significantly stronger attention to small objects, effectively detecting extremely small vehicles and pedestrians at long distances. Furthermore, the WTC-AdaResNet backbone design further directs the model’s focus toward object centers, improving bounding-box regression accuracy and overall detection performance.
Figure 12 presents the inference results of SOTA detectors alongside AUHF-DETR on the VisDrone2019-val dataset. To highlight the model’s robustness, we include both daytime and nighttime scenes, focusing on scenarios with densely clustered small targets such as pedestrians and bicycles. In the images, red circles indicate false detections, green circles denote missed detections, and cyan circles mark False Positive (FP). In the top-to-bottom comparison, competing algorithms frequently miss small objects (e.g., pedestrians and bicycles) due to limited feature extraction capacity. Additionally, in complex scenarios involving motorbikes or bicycles with riders, they often misclassify the rider as a pedestrian (Ground Truth labeled as “others”) or fail to detect them entirely. By contrast, AUHF-DETR not only recovers the majority of “others” targets but also detects and correctly classifies more small objects under low-light nighttime conditions.
To provide a more objective assessment of AUHF-DETR’s inference performance, we compared it to mainstream detectors on the HIT-UAV and CARPK datasets, as illustrated in
Figure 13. In the first column, YOLO-based models generate false positives on irrelevant objects in front of buildings. In the third column, RT-DETR misclassifies occluded vehicles as other motor vehicles, which leads to reduced detection accuracy. In the fifth column, under extremely dense scenes, SOTA detectors struggle with localizing single-class objects. In contrast, AUHF-DETR accurately identifies all object categories in both infrared imagery and highly congested scenarios. Overall, AUHF-DETR showcases superior detection accuracy and robustness in infrared domains and heavily occluded, densely populated environments.
To further evaluate the model’s generalization capability, we captured UAV aerials over Fuzhou’s urban during morning and evening peak hours, running inference as shown in
Figure 14. The results reveal only a small number of false positives or missed detections; the model accurately identifies tiny pedestrians and fast-moving objects, demonstrating exceptional detection performance. This further corroborates its strong generalization. Additionally, the model’s compact size facilitates deployment on other hardware platforms, fully satisfying the real-time detection requirements of UAV systems.
4.8. Exploratory Experiments on Detection Speed Using Embedded GPUs
To intuitively demonstrate the feasibility of deploying our proposed model for real-time detection on UAVs, we conducted speed tests by utilizing various mainstream models on an embedded GPU. The experiments were executed on the widely used embedded platform NVIDIA Jetson AGX Xavier, with GPU-related specifications provided in
Table 10.
Based on human visual perception of real-time performance, a model achieving ≥ 35 FPS is deemed suitable for onboard UAV deployment. We benchmarked the inference speeds of the YOLO series, RT-DETR series, and our proposed variants on an NVIDIA Jetson AGX Xavier, with results reported in
Table 11. All of our variant models exceeded 35 FPS on the embedded GPU, thus satisfying the real-time detection requirements for UAVs. Notably, AUHF-DETR-M not only reduces parameter count compared to the baseline but also lowers model complexity, achieving faster inference on the embedded GPU (47 FPS vs 39 FPS) in
Table 11. In contrast, Transformer-based models exhibit lower FPS than the YOLO series due to their inherently higher computational complexity. In future work, we will investigate techniques such as pruning and knowledge distillation to further reduce model complexity and boost detection speed on embedded GPUs.
4.9. UAV Embedded Simulation Experiments
To demonstrate the model’s embedded deployment more accurately, we conducted UAV simulation experiments based on PX4 within the ROS [
60] framework, embedding AUHF-DETR in the UAV to emulate its detection performance in real-world scenarios. We constructed a virtual environment in Gazebo and mounted a monocular camera on the UAV for vehicle detection and tracking. Flight trajectories were managed using QGroundControl (QGC). The expected simulation result is shown in
Figure 15a, where the UAV’s captured imagery is streamed in real-time to the display window at the upper left.
On the UAV platform, we deployed the RT-DETR-18 and AUHF-DETR-S models—both trained on the VisDrone2019 dataset—onto the drone and performed vehicle detection at various altitudes under QGC guidance, as illustrated in
Figure 15b. Evaluations from different viewing angles and heights show that RT-DETR presents both false positives (Within the red circle, a car is incorrectly classified as a truck) and missed detections(Within the yellow circle, there are numerous missed detections of small objects). In contrast, the proposed AUHF-DETR consistently achieves high detection performance at both low and high altitudes while maintaining an inference speed of 86.9 FPS—fully meeting real-time operational requirements.
5. Discussion
The experimental results indicate that our network achieves strong performance on the VisDrone2019, HIT-UAV, and CARPK datasets. In this section, we provide an in-depth discussion and analysis of these findings.
In the VisDrone2019 dataset, pedestrians and motorbike/bicycle riders (ground truth labeled as “others”) exhibit very similar shapes and sizes. As shown in the last column of
Figure 12, algorithms such as YOLOv11, YOLOv10, and RT-DETR-18 often merge these categories to boost overall accuracy or simply miss the “others” class altogether. However, accurately distinguishing pedestrians from “others” is critical for real-world applications. Our two variant models effectively resolve these subtle distinctions, avoiding both false positives and missed detections even when object features are nearly identical.
In the HIT-UAV dataset, small pedestrian targets in thermal infrared imagery are prone to being misclassified as background noise and subsequently ignored. Furthermore, the thermal infrared sensor characteristics can produce spurious false objects under high-temperature conditions. As illustrated in
Figure 13 across different model outputs, these issues become especially evident in scenes with multiple targets and complex backgrounds. For example, in the first column of YOLOv11’s detections, a white high-temperature object in front of a building is wrongly identified as a pedestrian, significantly reducing detection accuracy. Similarly, in the CARPK dataset, a model’s capability to handle extreme dense occlusion is critical. Our proposed AUHF-DETR not only produces zero false positives in infrared remote sensing images but also maintains high detection precision under heavy occlusion in
Table 9, highlighting its superior adaptability and generalization.
In UAV small-object detection tasks, high-altitude remote sensing imagery contains many small targets. As shown in
Table 7 and
Table 8 and
Figure 11,
Figure 12 and
Figure 13, mainstream detectors such as RT-DETR and DEIM exhibit relatively low
. Our method not only achieves superior
performance but also detects a greater number of small objects in inference visualizations, demonstrating strong robustness.
In a series of embedded-GPU speed comparison experiments across mainstream detectors, our approach exhibited superior robustness and faster inference performance. As shown in
Table 10 and
Table 11, on the NVIDIA Jetson AGX Xavier platform, AUHF-DETR-S not only achieved the highest inference speed (68 FPS) but also maintained the smallest parameters among models such as YOLOv11 and RT-DETR, underscoring its embedded adaptability. Furthermore, as illustrated in
Figure 14 and
Figure 15, when deployed in the Gazebo simulation environment using the PX4 and ROS frameworks, AUHF-DETR delivered exceptional performance in UAV embedded simulation trials. The model also demonstrated strong generalization capability in real-time UAV detection over urban Fuzhou. In summary, the proposed method is well suited for integration into UAV systems for real-time onboard detection.
However, under extreme conditions, such as complete darkness in the visible-light spectrum at night and spurious artifacts in the thermal-infrared band under high-temperature scenarios, AUHF-DETR cannot perform all-weather observations. To mitigate these adverse effects, our future work will focus on integrating multispectral remote sensing data with knowledge distillation techniques to capture critical cross-modal information, ultimately developing a robust model capable of all-weather onboard UAV detection.
6. Conclusions
In this paper, we present AUHF-DETR, an innovative RT-DETR-based model designed for real-time small object detection in embedded UAV remote sensing applications. Our compact variant consists of just 10.29 million parameters while achieving an impressive inference speed of 68 FPS (AGX Xavier), fully meeting the requirements for onboard real-time deployment. AUHF-DETR addresses the key challenges associated with small object detection and embedded environments through several significant innovations. First, we developed the WTC-AdaResNet backbone to extract multi-scale object features with minimal parameter overhead. Second, we replace global self-attention with a PSA module to mitigate the exponential computational demands of traditional Transformer attention mechanisms. Third, we introduce a BDFPN specifically tailored for UAV small object detection, which accelerates convergence by alleviating the issue of limited positive matches typically found in the Hungarian one-to-one assignment. Lastly, we enhance the loss function with a dynamically scaled small object penalty term, further improving localization accuracy for small targets.
Experimental results from the VisDrone2019, CARPK, and HIT-UAV multimodal public datasets indicate that AUHF-DETR significantly outperforms RT-DETR and various other transformer variants. Unlike many other object detectors, AUHF-DETR is compatible with UAV hardware platforms, achieves excellent inference speed on embedded GPUs, excels in small-object detection, and offers high inference speed, achieving an optimal balance between detection performance and model size. Furthermore, the model accurately identifies all object categories in UAV remote sensing images of urban areas in Fuzhou, demonstrating excellent detection performance in PX4 embedded simulation experiments.
However, for UAV remote-sensing detection tasks, especially under extremely dense occlusion,
Figure 12—AUHF-DETR still experiences certain missed and false detections of small targets, such as pedestrians and bicycles. Similarly, the model cannot perform all-day detection under extreme lighting conditions (e.g., completely dark visible light scenarios). To address these limitations, our future work will focus on enhancing the detection of densely occluded objects without increasing the model size and on incorporating multispectral datasets with knowledge distillation to capture key cross-modal information, ultimately developing an all-weather remote-sensing detection model capable of robustly detecting congested targets.