AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection

Guo, Hengyu; Wu, Qunyong; Wang, Yuhang

doi:10.3390/rs17111920

Open AccessArticle

AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection

by

Hengyu Guo

^1,2

,

Qunyong Wu

^1,2,*

and

Yuhang Wang

^1,2

¹

Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Fuzhou University, Fuzhou 350108, China

²

The Academy of Digital China (Fujian), Fuzhou University, Fuzhou 350108, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(11), 1920; https://doi.org/10.3390/rs17111920

Submission received: 16 April 2025 / Revised: 23 May 2025 / Accepted: 30 May 2025 / Published: 31 May 2025

(This article belongs to the Special Issue Advances in Artificial Intelligence (AI) and Deep Learning (DL) in UAV-Based Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

Real-time object detection on embedded unmanned aerial vehicles (UAVs) is crucial for emergency rescue, autonomous driving, and target tracking applications. However, UAVs’ hardware limitations create conflicts between model size and detection accuracy. Moreover, challenges such as complex backgrounds from the UAV’s perspective, severe occlusion, densely packed small targets, and uneven lighting conditions complicate real-time detection for embedded UAVs. To tackle these challenges, we propose AUHF-DETR, an embedded detection model derived from RT-DETR. In the backbone, we introduce a novel WTC-AdaResNet paradigm that utilizes reversible connections to decouple small-object features. We further replace the original global attention mechanism with the PSA module to strengthen inter-feature relationships within each ROI, thereby resolving the embedded challenges posed by RT-DETR’s complex token computations. In the encoder, we introduce a BDFPN for multi-scale feature fusion, effectively mitigating the small-object detection difficulties caused by the baseline’s Hungarian assignment. Extensive experiments on the public VisDrone2019, HIT-UAV, and CARPK datasets demonstrate that compared with RT-DETR-r18, AUHF-DETR achieves a 2.1% increase in

A P_{s}

on VisDrone2019, reduces the parameter count by 49.0%, and attains 68 FPS (AGX Xavier), thus satisfying the real-time requirements for small-object detection in embedded UAVs.

Keywords:

UAV images; object detection; AUHF-DETR; embedded UAV real-time detection

1. Introduction

In recent years, the decline in UAV production costs, along with advances in artificial intelligence and electronic information engineering, have placed UAVs at the forefront of international remote sensing research [1]. UAVs have seen widespread deployment in agricultural monitoring [2], Internet of Things frameworks [3], infrastructure inspection [4], and intelligent transportation systems [5]. Embedded real-time detection techniques in UAVs play a key role in these applications, particularly in real-time object detection [6], target tracking [7], autonomous obstacle avoidance [8], and data analysis, where embedded models must identify small targets quickly and accurately. However, UAV remote sensing imagery inherently contains a high density of small objects, and onboard real-time computation is heavily constrained by existing hardware capabilities. The limited computational and memory resources on UAV platforms pose a significant challenge for real-time detection of small objects on embedded UAVs [3].

In UAV remote sensing imagery, target objects often occupy fewer than

32 \times 32

pixels and are therefore classified as small objects [9]. To address this, Zhang et al. [10] proposed a multi-scale vehicle detection adversarial network by combining a multiscale detector with a dedicated small-object subnet. Gong et al. [11] enhanced YOLOv5 with a Normalized Attention Module (NAM)—dubbed SPHYOLOv5—to improve small-object localization, while Hamzenejadi et al. [12] incorporated lightweight attention mechanisms into the YOLO backbone to balance real-time speed with precision. Liu et al. [13] adopted a passive integration strategy to yield a more compact network architecture optimized for UAV deployment. Despite these gains in detection accuracy, the resulting models remain too large and computation-heavy for real-time inference on resource-constrained embedded platforms.

To address the requirements of small-object detection and embedded deployment, Zhao et al. [14] introduced RT-DETR, which removes inference delays caused by non-maximum suppression (NMS) and outperforms YOLO-based detectors in both accuracy and speed. This real-time detector is thus highly suitable for practical deployment under stringent latency constraints. Furthermore, Huang et al. [15] proposed DEIM, the latest SOTA model in object detection, designed to balance small-object detection performance, computational efficiency, and matching efficiency. Nonetheless, it encounters two significant challenges in practical implementation: The first is Transformer’s global attention mechanism calculates pairwise relationships among all feature vectors, resulting in an exponential increase in inference complexity. The other is training process exhibits slow convergence. The one-to-one matching strategy used by the Hungarian algorithm results in very few positive samples and many low-quality matches, limiting the model’s effectiveness in learning, particularly for small-object detection.

In this study, we introduce a novel approach known as the Adaptive UAV Hardware-Focused Network for Target Detection (AUHF-DETR) to mitigate two key challenges in embedded small-object detection. This method builds upon the RT-DETR-r18 framework and features a lightweight backbone for small-object detection, complemented by a spatial attention module. These elements are crucial for satisfying the real-time detection requirements of embedded UAV platforms. The core distinction between our approach and DEIM lies in the encoder architecture: AUHF-DETR employs a single encoder (built on a CNN backbone), whereas DEIM utilizes a dual-encoder design. In summary, our method not only inherits RT-DETR’s exceptional inference speed and maintains stable accuracy across varying object scales, but also—thanks to its simpler single-encoder structure—is better suited for real-time deployment on edge-device GPUs than DEIM.

We evaluated our approach on the VisDrone2019, CAPRK, and HIT-UAV datasets. Despite the complexity of the environments, the high density of small objects, and the multimodal characteristics of these public remote sensing benchmarks, AUHF-DETR consistently delivers impressive performance.

The main contributions of this work are as follows:

In the backbone, the WTConv-Block and AdaRes-Block enhance feature extraction, bridging the gap between remote-sensing object-detection models and the real-time demands of UAVs. Reversible connections allow for gradually decoupling small-object features during forward propagation, improving recognition accuracy;
In the attention module, we introduce the PSA component. Unlike the baseline model’s global self-attention, PSA adaptively computes relationships among feature vectors within each ROI, significantly reducing model complexity and meeting embedded deployment constraints;
In the encoder stage, we propose a small-object-focused pyramid structure, BDFPN. To enhance training, we increase the number of targets per image to generate more positive samples, providing dense supervision signals. This approach accelerates convergence and increases the accuracy of small object detection;
For the loss function, we design a small-object penalty term $L_{S m a l l O b j e c t}$ with a scaling parameter. By dynamically adjusting this scaling factor, the model flexibly computes IoU for small objects, elevating the loss weight on small-object localization and mitigating the adverse effects of bounding-box size variation on detection performance.

The structure of this paper is as follows: Section 2 reviews the evolution of object detection in remote sensing. Section 3 provides a detailed description of AUHF-DETR, our method for real-time target detection embedded in UAVs. Section 4 presents the experimental evaluation—including ablation studies, comparative assessments, visualization experiments, and embedded PX4 simulation—conducted on the VisDrone2019, CARPK, and HIT-UAV datasets to evaluate the effectiveness of the proposed approach. Section 5 discusses how AUHF-DETR overcomes the challenges of real-time small-object detection on embedded UAV platforms. Finally, Section 6 concludes the paper and emphasizes the model’s potential value.

2. Related Work

This section first reviews recent advances in object-detection algorithms. Next, we examine how these methods have been adapted and optimized for UAV-based remote sensing in light of the requirements of this study. Finally, we conclude with an overview of the latest progress in Transformer-based object-detection frameworks.

2.1. Evolutionary Trends in Object Detection Algorithms

The evolution of object detection algorithms can be categorized into two main groups: traditional methods and deep learning-based models [16]. Traditional approaches typically depend on handcrafted features, utilizing local descriptor-based detectors such as SIFT [17] and SURF [18], which identify key points across various scales to perform object detection. However, these methods are associated with two significant drawbacks: (1) the manual design and extraction of detailed, salient features are labor-intensive and time-consuming, and (2) they demonstrate limited generalization capability.

Within the deep learning paradigm, object detection methods are typically divided into two families: two-stage detectors and one-stage detectors. Two-stage detectors—exemplified by the R-CNN series [19,20]—employ a coarse-to-fine pipeline that first generates region proposals and then classifies and refines each candidate. Although this approach yields SOTA accuracy, it suffers from relatively low inference speed. To mitigate this, SPPNet [21] introduced spatial pyramid pooling to allow proposals of arbitrary size to be pooled directly into the CNN backbone, reducing redundant feature computations. Fast R-CNN [22] further accelerated two-stage detection by streamlining feature extraction and proposal classification into a unified network; nonetheless, because the proposal and classification stages remain distinct, its throughput still falls short of real-time requirements.

One-stage networks employ a single neural module for end-to-end detection, directly predicting object features to accelerate inference, and are thus known as “single-shot” algorithms. End-to-end frameworks such as the YOLO family and Transformer-based detectors have recently achieved significant gains in detection performance. Cai et al. [23] introduced the one-step paradigm; YOLOv11 [24] proposed RELAN that enhances recognition accuracy via an attention-centric architecture. Wang et al. [25] developed AMFEF-DETR, which accurately extracts object features from complex backgrounds. The DEIM [15] model addresses small-object detection through Matchability-Aware Loss, but its inference speed bottleneck hampers real-time deployment on resource-constrained platforms such as UAVs.

Because UAV-borne remote sensing images predominantly contain small targets and must operate on resource-constrained embedded devices, the mainstream detectors—evaluated mainly in natural scenes or on high-performance hardware—are ill-suited to satisfy the constraints of real-time embedded detection requirements of UAV remote sensing imagery.

2.2. Advancements in Object Detection Techniques for UAV Remote Sensing Platforms

With the continuous advancement of UAV technology, real-time UAV object detection models have also evolved significantly. Justin et al. [26] introduced CR-CNN, a dual-branch first stage that separately predicts ROIs to independently enhance small-object localization. Nguyen et al. [27] proposed a multi-task architecture that explicitly emphasizes road-region features, substantially improving vehicle detection in high-spatial-resolution UAV imagery. Cherif et al. [28] explored the fusion of UAV and LiDAR sensing with a point-voxel region-based CNN for airborne traffic monitoring. Despite these advances, the computational demands and architectural complexity of these models still preclude real-time inference on resource limited embedded UAV platforms.

Currently, single-stage UAV remote sensing object detection models based on improved YOLO architectures have attracted significant attention in the research community. Liu et al. [29] proposed GSC-YOLO, a lightweight detector built on YOLOv5 that incorporates the GhostNet backbone and a spatial pyramid downsampling convolution module, achieving high accuracy with a compact model. Yu et al. [30] introduced LSM-YOLO, which offers superior feature extraction. Li et al. [31] presented SOD-YOLO for UAV imagery. However, YOLO-based improvements inevitably experience the negative impact of NMS on both speed and accuracy. The emergence of RT-DETR eliminated NMS-induced delays while maintaining high precision and fast inference. Building on the RT-DETR framework, Li et al. [32] developed RT-DETR-UAVs, which incorporate a global feature aggregation module and dynamic sparse attention to enhance small-object detection performance in aerial images. Fu et al. [33] introduced DLSW-YOLOv8n, which incorporates a Deformable Large-Kernel Network (DL-Net) to substantially improve UAV-based maritime target detection accuracy. Liu et al. [34] proposed DNTR, which regularizes multi-scale features to mitigate noise during feature fusion. Our analysis of UAV small-object detectors reveals that most studies prioritize detection accuracy for UAV remote sensing imagery, with relatively little focus on model size and inference speed.

2.3. Transformer-Based Object Detection

Since Carion et al. [35] introduced the end-to-end DETR framework in 2020—eliminating NMS via bipartite matching to directly predict one-to-one object sets—it has attracted widespread interest. However, the original DETR incurs high computational overhead, slow convergence, and a large model footprint. To address these drawbacks, numerous variants have emerged: Multi-spectral DETR [36] fuses multispectral features with deformable attention for richer representations; DN-DETR [37] employs iterative refinement and denoising during training to boost robustness; Group-DETR [38] adopts a grouped one-to-many assignment to reduce computation; Lite DETR [39] decreases the update frequency of low-level features to accelerate the encoder; SAM-DETR [40] applies semantic alignment so queries attend more effectively to pertinent regions; and ESO-DETR [41] introduces a closed single-head attention backbone block to enhance local detail extraction. Transformer-based real-time detectors—such as RT-DETR [14], MCG-RTDETR [42], D-Fine [43], and DEIM [15]—further optimize backbone architectures and loss functions to balance speed and accuracy. Nevertheless, these approaches remain unable to satisfy the strict real-time inference requirements and small-object precision demands of embedded UAV platforms.

3. Proposed Method

3.1. Overall Framework

Adaptive UAV Hardware-Focused Network for Target Detection (AUHF-DETR) is a lightweight, embedded detection Transformer that utilizes a spatial attention mechanism. The overall architecture is illustrated in Figure 1. Based on the RT-DETR-r18, the model comprises four main components: a backbone network, an encoder, a decoder, and a detection head. High-resolution UAV remote-sensing images are initially processed by the WTC-AdaResNet backbone, which extracts hierarchical semantic features through a multi-scale pyramid structure. The final pyramid stage (S5) is directed to the PSA module for ROI-based feature interaction. Simultaneously, multi-scale features from the backbone (S2, S3, S4, and F5) are sent to the encoder for effective feature fusion. In the encoder, we integrate the baseline’s core component—the RepC3 module—a convolutional block introduced in YOLOv5 and RT-DETR that enhances computational efficiency through partial connections and re-parameterization. Ultimately, the decoder and detection head work together to predict bounding boxes, facilitating precise target localization.

To specifically address the challenges of embedded small-object detection in UAV remote sensing imagery, we introduce wavelet convolution into the backbone’s convolutional blocks, creating a wide-field multi-frequency dilation module (WTConv-Block). This design effectively reduces parameter inflation caused by enlarged receptive fields, thereby decreasing the overall model size. We also propose a resolution-adaptive algorithm to build a cross-scale feature decoupling module (AdaRes-Block), which facilitates reversible feature flow between sub-networks after sampling, thus preserving richer details of small objects. The Partition Split Spatial Attention (PSA) module utilizes spatial attention weights to minimize long-range dependencies, addressing the complexity mismatch between remote-sensing detectors and onboard real-time requirements. Our Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN) conducts efficient, layer-wise sampling consistent with UAV real-time paradigms, further enhancing embedded deployability. Finally, in the decoder, we implement an Inner-MPDIoU bounding-box regression strategy: by selecting queries that minimize uncertainty, we consolidate the encoder’s differentiated features for large, medium, and small objects, thus improving small-object localization accuracy. The following subsections provide detailed descriptions of each module, and Figure 2 presents the complete AUHF-DETR architecture.

3.2. Improvement of Backbone (WTC-AdaResNet)

In embedded UAV remote-sensing object detection tasks, reconciling model size with on-board constraints while accurately detecting small targets is critical. To this end, we integrate a wavelet convolution module with an adaptive-resolution module in the backbone, resulting in the wide-receptive-field WTConv-Block and the adaptive reversible-resolution (AdaRes-Block). The resulting backbone, WTC-AdaResNet, addresses the computational mismatch between conventional remote-sensing detectors and embedded UAV real-time requirements.

As shown in Figure 2, WTC-AdaResNet consists of only 23 convolutional layers and generates a feature-resolution map at 1/16. In contrast to mainstream backbones that employ a greater number of downsampling layers, our design significantly reduces both the parameter count and computational overhead. To enhance the preservation of small-object details, we initially stack three CBR layers to produce a 1/2 resolution feature map, then utilize a MaxPool layer to achieve a 1/4 resolution map. In Stages 1–3, we parallelize three WTConv-Blocks to capture richer detail through large receptive fields without altering the spatial resolution. It is important to note that the Stage 4 AdaRes-Block does not operate directly on the 1/16 (or lower) resolution feature maps. Instead, it ingests the 1/4 resolution features produced by Stage 3 and applies reversible connections and cross-scale fusion at that resolution to ensure that fine-grained details are fully preserved. Meanwhile, the WTConv-Blocks in the first three stages have already leveraged multi-frequency, small-kernel convolutions to expand the local receptive field, thereby supplying richer contextual cues for small-object detection. During feature propagation, the AdaRes-Block generates a 1/16 resolution feature map; employing reversible connections effectively mitigates the information loss typically encountered in deep architectures. For the BDFPN inputs, we extract features from backbone layers 4, 5, 6, and 10, thereby promoting feature reuse and bolstering information propagation.

3.2.1. WTConv-Block

To address the parameter explosion that accompanies larger convolutional kernels in small-object detection, we integrate WTConv into our backbone. WTConv employs multi-frequency dilated convolutions to dynamically expand the receptive field without a proportional increase in parameters. Conventional CNNs, constrained by fixed kernel sizes, often lack sufficient global context for accurately detecting small targets. While many detectors boost accuracy by enlarging kernels, this strategy incurs substantial parameter growth, creating performance bottlenecks and undermining the balance between model size and detection precision on UAV platforms. In contrast, WTConv delivers enhanced global feature perception and improved detection efficiency with minimal computational overhead.

WTConv [44], introduced at ECCV 2024, solves the parameter inflation issue inherent in expanding CNN receptive fields, allowing efficient capture of both local and global features. We embed WTConv into the downsampling stages of RT-DETR’s backbone, optimizing the convolutional network to reduce parameter count and computational complexity. This integration enables AUHF-DETR to be deployed on UAV hardware, meeting the real-time detection requirements of remote-sensing imagery.

We adopt the efficient Haar wavelet transform, which—unlike Daubechies or Meyer wavelets—effectively captures contextual semantics while limiting computational cost. This transform decomposes an image spatially into high- and low-frequency components via depthwise convolution and downsampling. In one dimension, this employs the

[1, 1] / \sqrt{2}

and

[1, - 1] / \sqrt{2}

convolution kernels; in two dimensions, the operation extends to four separable filters for depthwise convolution.

f_{L L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}], f_{L H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ 1 & - 1 \end{matrix}], f_{H L} = \frac{1}{2} [\begin{matrix} 1 & 1 \\ - 1 & - 1 \end{matrix}], f_{H H} = \frac{1}{2} [\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}]

(1)

The four filters decompose the input UAV remote-sensing image X into a low-frequency component

X_{L L}

and high-frequency components

X_{L H}

,

X_{H L}

and

X_{H H}

:

[X_{L L}, X_{L H}, X_{H L}, X_{H H}] = Conv ([f_{L L}, f_{L H}, f_{H L}, f_{H H}], X)

(2)

The WTConv module applies hierarchical wavelet decomposition to split the input signal into subbands at different frequencies, thereby enhancing sensitivity to low-frequency information. To alleviate the trade-off between convolution-kernel size and parameter count, we first filter and downsample the high- and low-frequency components via the wavelet transform. Next, we apply smaller

3 \times 3

convolutional kernels to each subband and finally reconstruct the fused output using the inverse wavelet transform (IWT). Figure 3 illustrates the basic architecture of the WTConv module. The entire process can be summarized as:

Y = IWT (Conv (w, WT (X)))

(3)

where X denotes the input tensor, w is the convolutional kernel, and WT, and IWT refer to the wavelet transform, and its inverse, respectively.

WTConv efficiently decouples convolutional operations across multiple frequency bands, enabling an expanded effective receptive field with smaller kernels and reduced computational overhead. This design empowers the network to capture both local and global features of small objects in remote sensing imagery, transcending the local-only constraints of traditional convolutions. As a result, the model learns richer semantic representations, leading to superior small-object detection accuracy. By constraining the size of the wavelet kernels, WTConv aggregates low-frequency information over a broader spatial extent without inflating parameter counts, thereby addressing the challenge of deploying large-scale detectors on resource-limited UAV platforms.

3.2.2. AdaRes-Block

To address missed detections of small objects in UAV remote-sensing imagery, we introduce Adaptive Resolution Module (AdaRes-Block). This module integrates multiple subnetworks (SubNet), allowing information to flow reversibly between them and thus preventing the feature loss common in conventional CNNs. As illustrated in Figure 4a, the AdaRes-Block is inserted at Stage 4 of the backbone, while the first three stages stack multiple wavelet-convolution downsampling modules to enlarge the receptive field of the feature maps, thereby enhancing the model’s ability to capture fine-grained edges and textures of small objects. It receives feature maps from the preceding stage and, by leveraging reversible connections, progressively decouples and preserves features during forward propagation—unlike traditional networks, which often compress or discard information. These reversible inter-SubNet connections retain rich spatial details from shallow layers while providing stronger semantic representations at higher resolutions in deeper layers. As a result, the detection performance for small objects in UAV applications is markedly improved.

In Figure 4b, the core component of the AdaRes-Block is the AdaRes unit, which creates reversible connections among feature maps of different resolutions during feature propagation, allowing for repeated feature reuse. In Figure 4c, increasing the number of SubNet layers enables the model to capture progressively richer detailed features, resulting in superior performance on complex detection tasks. However, deeper SubNet configurations also lead to a significant increase in parameter count, heightening dependency on high-performance computing resources—contrary to the real-time constraints of embedded UAV deployment. To strike a balance between SubNet depth and model compactness, we conducted a series of experiments. The results indicate that employing one or two SubNet layers offers the best trade-off between model size and detection accuracy. Accordingly, we introduce two lightweight variants: AUHF-DETR-S and AUHF-DETR-M. The detailed block architecture and the reversible connection mechanism are illustrated in Figure 5.

In small-object detection tasks, AdaRes implements a two-fold process of high-resolution feature preservation and multi-scale feature fusion. For high-resolution feature preservation in Figure 5a, the input image first passes through the STEM module, where a

4 \times 4

depthwise convolution extracts initial features. These features are then normalized via LayerNorm and fed into the reversible core component, SubNet. In SubNet, bidirectional reversible connections preserve high-resolution details, addressing the challenge of spatial detail loss—and consequent localization error—for small objects caused by repeated downsampling in deep networks. The forward propagation is formalized in Equation (4): features traverse multiple levels (Level 0–3), with each level’s output linearly weighted by a learnable parameter alpha and summed with the original input features to prevent unidirectional information loss for small targets. By retaining rich feature representations at deeper layers, the AdaRes mechanism enhances the model’s capacity and efficiency in learning and expressing small-object details.

C_{0}^{n e w} = a l p h a_{0} \times C_{0}^{p r e v} + L e v e l_{0} (x, C_{1}^{p r e v})

(4)

C_{0}^{p r e v}

denotes the Level 0 feature map from the previous iteration,

C_{1}^{p r e v}

denotes the historical feature map from the adjacent higher level, and

C_{0}^{n e w}

denotes the updated feature map.

a l p h a_{0}

is a learnable dynamic weight parameter that adjusts the contribution of

C_{0}^{p r e v}

to the current feature representation. Finally,

L e v e l_{0} (x, C_{1}^{p r e v})

represents the feature extraction operation at Level 0, where x is the input feature map for this level.

In the implementation of multi-scale feature fusion, to address the insufficient extraction of shallow details (edges, textures) and deep contextual semantics in conventional networks for small-object detection, we propose the fusion strategy illustrated in Figure 5c. Within each SubNet’s Level component, this method aligns feature resolutions across adjacent levels. In the downsampling branch, a Conv2D operation halves the spatial dimensions of higher-level features to match the lower-level resolution (e.g., Level 1 → Level 2); in the upsampling branch, a linear transformation followed by nearest-neighbor interpolation upsamples lower-level features to satisfy higher-level requirements (e.g., Level 2 → Level 1). The aligned features are then passed to the ConvNext block depicted in Figure 5d, where depthwise-separable convolutions, pointwise convolutions, GELU activations, and LayerNorm collectively encode richer semantic information within the enhanced receptive field. In UAV ground-target detection scenarios, this fidelity-preserving fusion approach markedly improves the detection accuracy of small objects in remote sensing imagery.

Compared with other feature fusion or preservation techniques (e.g., dense connections), the proposed reversible connection design offers advantages across multiple dimensions. In terms of information completeness, dense connections concatenate all features from previous layers, often leading to excessive channel dimensionality. In contrast, reversible connections perform additive coupling under fixed resolution, preserving all input information without increasing the number of channels. For fine-grained small object feature retention, traditional skip connections typically fuse shallow and deep features at fixed levels. However, AdaRes leverages reversible mechanisms to align and fuse features across multiple resolutions within each SubNet, ensuring bidirectional flow and reuse of both low-level details and high-level semantics during every forward pass. Regarding embedded real-time deployment, while dense connections significantly increase parameters and computation, our reversible design maintains a lightweight architecture that meets the constraints of embedded systems while achieving competitive small object detection accuracy with only 1–2 SubNet configurations. Overall, the AdaRes-Block not only improves detail preservation for small objects but also satisfies the dual requirements of computational efficiency and memory footprint in UAV-based real-time detection scenarios.

3.3. Partition Split Spatial Attention (PSA)

To tackle the low inference speed (FPS) and exponential computational complexity caused by the global self-attention mechanism in traditional DETR, we present the PSA module. By utilizing a spatial attention strategy that limits pairwise computations to local areas, PSA significantly decreases model complexity while enhancing detection FPS.

The PSA module, as illustrated in Figure 6, consists of two parallel branches with shared parameters: a low-frequency adaptive feature interaction branch and a high-frequency lightweight feature extraction branch. Designed to minimize computational complexity, PSA effectively processes fine-grained semantic details in remote sensing images and enhances intra-scale feature fusion. By leveraging spatial attention, it adaptively computes pairwise relationships among feature vectors within each ROI, significantly reducing overall model complexity. In small-object detection, conventional global self-attention mechanisms struggle to capture long-range dependencies without incurring high computational costs, as they focus on global features and overlook the distinction between high- and low-frequency components—critical for detecting small targets rich in fine details. As shown in Figure 6a, the high-frequency branch employs depthwise separable convolutions to enhance feature resolution, effectively capturing edges and minute textures in UAV imagery while markedly improving small-object recognition. Figure 6b depicts the low-frequency branch, which uses average pooling to generate spatial attention weights, allowing the model to aggregate broader contextual information. By encoding low-frequency features, the module extracts global structural cues while focusing computations on ROI-specific vectors, eliminating redundant operations inherent in global attention and thus significantly lowering computational load while increasing inference FPS. PSA restructures the input feature map from four dimensions (batch, channel, width, height) to five dimensions (batch, channel, width, height, sampling factor), partitioning spatial information into smaller regions and reducing complexity. This design yields higher FPS without adding parameters and significantly enhances small-object detection performance.

To clarify the fundamental differences between AUHF-DETR and the YOLO architecture: YOLO variants employ convolutional prediction heads at the end of the backbone, directly performing multi-scale convolutional regression and classification on feature maps. In contrast, AUHF-DETR inserts a PSA module after multi-scale feature fusion (Detailed in Section 3.2.2), leveraging locally partitioned self-attention to enhance detail and enable global information exchange, while fully preserving the transformer’s query-key-value operations and end-to-end matching pipeline.

3.4. Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN)

Lin et al. [45] first proposed the Feature Pyramid Network (FPN), which fuses multi-scale feature maps from pyramid levels P3 through P7 via a top-down pathway with lateral connections. Building on this, PANet [46] augments FPN with a complementary bottom-up path to reinforce feature propagation and fusion, albeit at the cost of increased model complexity. To retain performance while streamlining the architecture, BiFPN [47] prunes low-utility nodes—those with only a single input—and employs bidirectional pathways for efficient multi-scale feature fusion. Moreover, BiFPN introduces learnable scalar weights on each input edge, enabling adaptive calibration of multi-resolution features.

Building on this, we propose a Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN) specifically designed for small-object detection to accelerate early-stage convergence. BDFPN tackles the positive-sample scarcity inherent in DETR’s Hungarian one-to-one assignment by increasing the number of object instances per image during training. This results in denser positive samples, richer supervisory signals, and higher-quality matches to expedite convergence. Additionally, BDFPN adaptively selects optimal feature-fusion pathways and includes high-resolution shallow-layer features, improving the network’s ability to concentrate on critical details of ground-level small targets. Figure 7 presents a comparison of various pyramid network architectures.

BDFPN enables cross-scale feature fusion by integrating convolutional modules, upsampling layers, RepC3 blocks, and a dynamic fusion unit. As shown in Figure 8, three distinct fusion strategies are adopted. Within the pyramid structure, the dynamic fusion unit first aggregates input features along the channel dimension and employs a CBS block to produce spatially adaptive weights. These weights are normalized via the Softmax function to align with the number of input features. Subsequently, each weight is applied element-wise to its corresponding feature map, and the weighted maps are summed to yield a high-resolution representation that captures both global context and fine details. The dynamic fusion mechanism adaptively adjusts the contribution of each feature map, enabling efficient multi-scale integration, accelerating training convergence, and improving small-object detection accuracy.

3.5. Inner-MPDIoU Regression for Small Object Detection

In the RT-DETR model, the classification task is optimized using the cross-entropy loss, while the bounding box regression task utilizes the L1 loss and the GIoU loss. Target matching is performed using the Hungarian algorithm. Specifically, the formulation of the GIoU loss is presented in Equation (5), and the total loss function of the RT-DETR model is defined in Equation (6).

L_{G I o U} = 1 - I o U + \frac{| C ∖ (B_{g t} \cup B_{p r d}) |}{| C |}

(5)

L_{R T - D E T R} = L_{c l s} + λ_{r e g} \cdot L_{r e g} + λ_{G I o U} \cdot L_{G I o U}

(6)

In Equation (4),

B_{g t}

and

B_{p r d}

represent the ground truth and predicted bounding boxes, respectively. C denotes the smallest enclosing box covering both

B_{g t}

and

B_{p r d}

. The final term of the GIoU loss reflects the proportion of the area within C that lies outside the union of

B_{g t}

and

B_{p r d}

, effectively penalizing predictions that are far from the ground truth. In Equation (5),

L_{c l s}

denotes the classification loss, and

λ_{r e g}

is the weight for the regression loss, which is computed using Smooth L1 Loss.

L_{G I o U}

evaluates the overlap quality between bounding boxes. The weights

λ_{r e g}

and

λ_{G I o U}

are used to balance the contributions of each loss component.

Overall, GIoU enhances localization performance by addressing zero-overlap cases and incorporating a penalty term based on the enclosing box. However, for small object detection, the enclosing box C tends to be disproportionately large, which reduces the penalty term and causes GIoU to closely resemble the standard IoU loss, providing limited additional information. Furthermore, GIoU is not sufficiently sensitive to variations in aspect ratio and suffers from weak penalization, negatively affecting performance in challenging detection scenarios. To overcome these limitations, we redesign the bounding box regression loss and introduce the Inner-MPDIoU loss.

The MPDIoU loss integrates three elements: the IoU term, the minimum point distance (MPD) metric, and a normalization component. Specifically, the IoU term quantifies the overlap between prediction and ground truth via the intersection-to-union ratio, as defined in Equation (7).

IoU = \frac{| B_{g t} \cap B_{p r d} |}{| B_{g t} \cup B_{p r d} |}

(7)

where

B_{g t}

denotes the ground truth bounding box region, and

B_{p r d}

denotes the predicted bounding box region.

The innovation of the MPDIoU loss lies in the introduction of two minimum-point-distance terms to balance spatial discrepancies between bounding boxes. Specifically, as defined in Equations (8) and (9), the positional offset is measured by the Euclidean distance between the top-left and bottom-right corner points of the predicted and ground-truth boxes. To prevent inconsistent gradient magnitudes caused by varying box sizes, these distance terms are normalized by the original image’s width and height. The complete loss formulation is then provided in Equation (10).

d = \sqrt{{(x_{p r a} - x_{g t})}^{2} + {(y_{p r a} - y_{g t})}^{2}}

(8)

Normalized d = \frac{d^{2}}{w^{2} + h^{2}}

(9)

L_{MPDIOU} = 1 - (IoU - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}})

(10)

where

(x_{p r d}, y_{p r d})

denotes the coordinates of the predicted box’s corner (either top left or bottom right),

(x_{g t}, y_{g t})

denotes the corresponding ground truth corner coordinates, and

w^{2} + h^{2}

is the sum of the squared width www and height hhh of the original image.

To enhance the model’s small-object detection capability and accelerate convergence, we extend the MPDIoU loss by incorporating auxiliary bounding boxes from the Inner IoU framework, thereby formulating the novel Inner MPDIoU loss function with an added small object term

L_{S m a l l O b j e c t}

. By introducing a dynamic scaling factor, the model can adaptively calculate the relevant IoU values, increasing the weight assigned to small objects. This modification not only accelerates convergence during training but also decreases both false positive and false negatives for small targets in remote sensing imagery. The formal definition of Inner-MPDIoU is as follows:

inter = (min (b_{r}^{g t}, b_{r}^{p r d}) - max (b_{l}^{g t}, b_{l}^{p r d})) \times (min (b_{b}^{g t}, b_{b}^{p r d}) - max (b_{t}^{g t}, b_{t}^{p r d}))

(11)

union = (w^{g t} \times h^{g t}) \times {(ratio)}^{2} + (w^{p r d} \times h^{p r d}) \times {(ratio)}^{2} - inter

(12)

L_{SmallObject} = IOU - \frac{inter}{union}

(13)

L_{Inner - MPDIoU} = L_{MPDIoU} + L_{SmallObject}

(14)

where

b_{r}

,

b_{l}

,

b_{b}

and

b_{t}

denote the left, right, top, and bottom boundary coordinates of the ground truth or predicted box, respectively, and

w^{g t}

and

h^{g t}

denote the width and height of the contracted region within the ground truth box.

3.6. Model’s Decoder and Transformer Structure

In this section, we first present the decoder architecture of our model, and then detail the connections and innovations of AUHF-DETR with Transformer techniques to better situate our work within the relevant technical context.

3.6.1. Structure of the Model Decoder

The AUHF-DETR decoder inherits its architecture from RT-DETR [14], comprising ten identical Transformer decoder layers. In each layer, a set of learnable query vectors first undergoes multi-head self-attention to capture global context dependencies among queries. The updated queries are then fused with encoder feature maps via multi-head cross-attention, enabling query-key-value-based feature interaction. Following the attention modules, two feed-forward networks with GELU activations, residual connections, and layer normalization ensure feature diversity and training stability. The decoder’s output queries are projected by two separate multilayer perceptrons (MLPs) into bounding-box regression vectors and class prediction scores. Whereas the original RT-DETR optimizes with an L1 + GIoU composite loss, AUHF-DETR replaces the regression term with our proposed Inner-MPDIoU loss—augmented with a small-object penalty—to strengthen gradient responses and localization accuracy for small targets. Finally, all predicted boxes are paired one-to-one with ground-truth boxes via the minimum uncertainty matching mechanism, completing the end-to-end detection pipeline.

3.6.2. Related to Transformer Technology

AUHF-DETR builds upon RT-DETR, inheriting the core Transformer framework: end-to-end query-key-value feature interaction via self-attention and one-to-one matching through the minimum-uncertainty query algorithm. On this foundation, we introduce three targeted enhancements to accommodate both small-object detection and real-time embedded deployment requirements:

First, we introduce PSA spatial attention (Detailed in Section 3.3). In a conventional Transformer, global self-attention computes pairwise similarities across all positions, resulting in

O (n^{2})

complexity. PSA alleviates this by partitioning the feature map into local ROIs and processing each through separate high- and low-frequency branches to extract fine-grained textures and global structures, respectively. Self-attention is then computed within each ROI, reducing complexity to

O (n \cdot k)

, where

k ≪ N

;

Second, we propose multi-scale cross-layer dynamic fusion via BDFPN (Detailed in Section 3.4) with learnable weights. Analogous to multi-head cross-attention aggregating information across feature maps, BDFPN employs learnable scalar weights to adaptively allocate attention among multi-scale inputs, ensuring that small, medium, and large target features all receive adequate focus and accelerate the convergence of positive samples.

Third, for decoder query and localization loss optimization, we retain DETR’s decoder-query mechanism—ensuring each query maps to a unique detection target. To further enhance small-object localization accuracy, we introduce the Inner-MPDIoU loss (Detailed in Section 3.5), which combines the Transformer’s end-to-end, no-NMS matching advantage with IoU-driven regression, thereby boosting gradient responsiveness for small bounding boxes.

4. Experiments and Results

4.1. Dataset and Implementation Details

(a) VisDrone2019: This dataset includes a total of 10,209 images captured across diverse urban and suburban environments, such as city roads, residential areas, and industrial districts, as illustrated in Figure 9a,b. The imagery was collected using UAVs operating under varying altitudes, camera angles, and weather conditions. Specifically, the dataset is divided into 6471 training images, 548 for validation, and 1580 for testing. It covers ten object categories. The overall class distribution is visualized in Figure 10a.

We evaluate our approach on three publicly aerial datasets: VisDrone2019 [48], CARPK [49], and HIT-UAV [50]. The details of these three datasets are as follows:

(b) CARPK: This dataset consists of images intercepted from HD video frames recorded from four different car parks, which contains 989 training set images and 459 validation set images. This dataset is just a densely distributed single class of small target data. Figure 9c intercepts this dataset of the image display, from which we can see that there are even more small targets. The data distribution of the dataset is shown in Figure 10b.

(c) HIT-UAV: This dataset comprises 2898 thermal infrared images collected by UAVs across diverse scenes such as campuses, highways, and parking areas. It is characterized by densely distributed small targets belonging to five categories. Designed to support UAV applications in low-light environments, the dataset enhances detection capabilities under challenging illumination. In our experiments, we utilized 2029 images for training, 290 for validation, and the remaining 579 for testing. In Figure 9d, we present images from the dataset, which were collected using infrared sensors to evaluate the model’s capability to process multispectral datasets; the dataset’s data distribution is shown in Figure 10.

All experiments were carried out using the PyTorch framework on an NVIDIA GeForce RTX 4090 GPU, under Ubuntu 24.04 with CUDA 12.1. We introduced the warm-up trick to prevent drastic oscillations at the start of training. Hyperparameter settings are given in Table 1.

4.2. Evaluation Metrics

Following the evaluation protocol on MS COCO, our paper mainly takes two criteria for different purposes:

For model accuracy evaluation, we adopt

A P

,

A P_{50}

, and

A P_{75}

as our evaluation metrics.

A P

represents the average precision over all categories, while

A P_{50}

and

A P_{75}

denote the average precision calculated at the IoU thresholds of 0.5 and 0.75 overall categories. Furthermore, to measure the performance of different object scales, we adopt three metrics:

A P_{s m a l l}

,

A P_{m e d i u m}

, and

A P_{l a r g e}

.

For real-time requirements, we use the number of frames processed per second FPS as an evaluation index. Usually, it should be at least 30 FPS to satisfy the constraints of real-time processing.

4.3. Loss Function Exploration Experiments

From the previous discussion of loss functions, GIoU has certain limitations for small-object detection from the UAV perspective. To evaluate the effectiveness of our proposed Inner-MPDIoU loss, we designed a series of experiments to benchmark various mainstream loss functions within the AUHF-DETR framework. Specifically, we integrated each loss function into AUHF-DETR and assessed their performance on the VisDrone2019 validation set. The results of these comparisons are shown in Table 2.

For the small object detection term

L_{S m a l l O b j e c t}

in the loss function, the core concept of Inner is implemented by introducing a dynamic scaling factor. We conducted experiments across a range of ratio values to identify the hyperparameter that best enhances small object detection from the UAV perspective. The results of these experiments are summarized in Table 3.

The results demonstrate that the Inner-MPDIoU loss function achieves the best performance in both detection accuracy and inference speed. Accordingly, we adopt Inner-MPDIoU in place of the original GIoU loss for the bounding-box regression task.

4.4. Model Balancing Exploration Experiments

The goal of this exploration is to balance detection accuracy, inference speed, and model size for real-time embedded UAV remote-sensing detection. While we initially balanced these factors in the backbone by varying the number of SubNet layers, this balance applies solely within the encoder. Therefore, we further control algorithmic complexity by adjusting three key parameters—network width, depth, and maximum channel count—to pinpoint the optimal model configuration for our task.

During this process, we discovered that reducing any single parameter in isolation (width, depth, or max channels) negatively impacts overall performance. Therefore, achieving a balanced trade-off among all three is essential for optimal results. In this section, we design and evaluate various configurations to examine how these parameters affect model performance; the results are summarized in Table 4.

The count of model parameters directly reflects a model’s adaptability across devices with varying computational capabilities. As shown in Table 4, GFLOPs increase with both network depth and width; however, when the depth factor exceeds a certain threshold, the total parameter count surpasses the deployment limit of the baseline. Conversely, setting the depth factor to 0.43 and the width factor to 0.45 produces an acceptable parameter count but leads to a significant drop in GFLOPs, failing to meet high-performance detection requirements. Balancing these observations and the performance of models with different maximum channel settings, we selected a depth factor of approximately 0.4, a width factor of roughly 0.6, and a maximum channel count between 512 and 768. We then evaluated various parameter combinations on the VisDrone2019 validation set to determine whether higher accuracy can be maintained with a reduced model footprint. The results of these experiments are summarized in Table 5.

As shown in Table 5, configuring the model with a depth factor of 0.45, a width factor of 0.60, and a maximum channel count of 512 results in the best inference speed and smallest footprint; however, its

A P

is only 27.8%, which indicates relatively low detection accuracy. After considering multiple factors, we adopt a depth factor of 0.43, a width factor of 0.60, and a maximum channel count of 512 as the base configuration. With this setup, the model achieves an

A P

of 30.9%, the highest in our experiments—while maintaining a balanced trade-off between parameter count and inference speed, thus satisfying the requirements for embedded real-time detection.

4.5. Ablation Experiments

Ablation experiments on the VisDrone2019 validation set, demonstrate that our four key modifications collectively enhance both real-time performance and small-object detection accuracy. First, we replace the original BasicBlock with the WTConv-Block—a lightweight, large-receptive-field convolution module—and the AdaRes-Block, which allows for adaptive-resolution, reversible feature propagation within the backbone. Second, we substitute the AIFI module with our PSA spatial attention mechanism, effectively eliminating exhaustive pairwise computations and reducing parameter overhead. Third, we integrate the BDFPN into the encoder to mitigate the scarcity of positive samples caused by Hungarian one-to-one matching and improve small-object detectability. Finally, we adopt the Inner-MPDIoU loss function—enhanced with a small-object penalty term—to expedite convergence and further enhance localization precision for small targets.

Table 6 reports detailed performance metrics for each modification, demonstrating that every enhancement produces measurable gains over the RT-DETR-r18 baseline. Substituting the backbone with WTC-AdaResNet leads to significant improvements across all metrics—most notably in small-object accuracy. Integrating the PSA spatial attention module further refines feature representation, increasing overall

A P

by 1.8% while slightly decreasing model size. Introducing the BDFPN structure markedly boosts

A P_{s}

by 2.6% without a substantial rise in parameter count. These architectural improvements collectively reduce the model footprint and speed up inference, achieving an initial balance among size, accuracy, and speed. Finally, with all improvements applied, the model achieves a 5.9% increase in

A P

and a 3.12% reduction in parameters, boosting the inference speed to 65.7 FPS—nearly twice the baseline’s 35 FPS—fully meeting the requirements for real-time deployment. In summary, the AUHF-DETR model shows significant improvements over the baseline in all respects, meets real-time detection needs on embedded UAV systems, and strikes a favorable balance between model size, inference speed, and detection accuracy.

4.6. Comparisons of Performance

To evaluate the detection performance improvements of AUHF-DETR, we compared our final variants against current mainstream detectors. We developed two configurations: AUHF-DETR-S (10.29 M parameters) and AUHF-DETR-M (19.55 M parameters). In AUHF-DETR-S, the AdaRes-Block contains a single SubNet layer, and the per-stage channel widths are reduced from [64, 128, 256, 512] to [56, 112, 224, 448], thus cutting the parameter count while maintaining detection accuracy—ideal for UAVs with limited computing power. AUHF-DETR-M utilizes two SubNet layers configured with channels [64, 128, 256, 512], yielding a total of 19.55 million parameters. Although this increases the parameter count, it provides richer feature representations and further enhances detection performance, making it suitable for higher-capacity UAV platforms. As shown in Table 7, compared with RT-DETR-r18, AUHF-DETR-M achieves relative improvements of 5.9% in

A P

, 4.8% in

A P_{50}

, 10.5% in

A P_{75}

, 4.4% in

A P_{s}

, 2.2% in

A P_{m}

, and 2.4% in

A P_{l}

, while reducing the parameter count by 3.12%. These significant gains demonstrate AUHF-DETR’s efficiency for UAV ground-target detection and confirm that the reduced model footprint facilitates real-time deployment on embedded UAV systems.

Notably, the AUHF-DETR-S model delivers excellent performance across all object scales, achieving significant improvements:

A P_{s}

reaches 23.8%, a 2.1% gain over RT-DETR-r18. Moreover, the model excels in detection accuracy, inference speed, and model size. Specifically, its computational complexity is 23 GFLOPs—higher than lightweight models such as YOLOv11-N (6.5 GFLOPs) and YOLOv11-S (21.5 GFLOPs), yet significantly lower than computation-intensive detectors like Deformable DETR (173 GFLOPs). Importantly, with only 10.29 M parameters (49.0% fewer than the baseline), AUHF-DETR-S achieves the goal of lightweight design, enabling real-time onboard deployment. Its inference speed of 86.9 FPS fully meets real-time detection requirements. In comparison with the latest transformer-based detectors—DEIM (2025 CVPR), Deformable DETR (2021 ICLR), and Swin DETR (2021 ICCV)—our model demonstrates a marked advantage in average precision (AP) metrics, confirming the effectiveness of our enhancements. In summary, particularly for real-time multi-scale object detection, AUHF-DETR exhibits outstanding performance, underscoring its potential and applicability as an embedded UAV detection model. To further evaluate the model’s robustness, comparative experiments were also conducted on the HIT-UAV and CARPK datasets.

In the HIT-UAV dataset, most studies typically report mAP@IoU = 0.5 as their evaluation metric. While this provides quick and straightforward feedback, it does not fully capture a model’s robustness across varying object scales and multiple IoU thresholds. To more comprehensively assess AUHF-DETR’s ability to detect small objects, we adopt a more rigorous evaluation protocol. The results presented in Table 8 and Table 9.

In the three comparative experiments, our model consistently achieves SOTA performance on all evaluated datasets, effectively balancing detection accuracy, inference speed, and model footprint. Moreover, AUHF-DETR exhibits robust generalization capabilities across different sensor types, such as visible and infrared modalities, showcasing its remarkable adaptability. This makes AUHF-DETR uniquely suited for seamless integration into UAV hardware platforms for real-time object detection, a capability that few existing models can deliver simultaneously.

4.7. Visualization Experiment

This study performs a visual analysis of the enhanced AUHF-DETR model, employing heatmaps and qualitative detection results. Additionally, UAV aerial imagery captured over various scenes in Fuzhou City was used for inference further to evaluate the model’s generalization ability in real-world environments.

In this section, we utilize the EigenGradCAM [59] method to create heatmaps for both our enhanced model and leading detectors, visually emphasizing each model’s focus areas, as shown in Figure 11. We selected a variety of scenarios—such as urban bus stops and nighttime cityscapes—and compared heatmaps across different optical sensors to better illustrate the model’s generalizability. In comparison to other methods, AUHF-DETR demonstrates significantly stronger attention to small objects, effectively detecting extremely small vehicles and pedestrians at long distances. Furthermore, the WTC-AdaResNet backbone design further directs the model’s focus toward object centers, improving bounding-box regression accuracy and overall detection performance.

Figure 12 presents the inference results of SOTA detectors alongside AUHF-DETR on the VisDrone2019-val dataset. To highlight the model’s robustness, we include both daytime and nighttime scenes, focusing on scenarios with densely clustered small targets such as pedestrians and bicycles. In the images, red circles indicate false detections, green circles denote missed detections, and cyan circles mark False Positive (FP). In the top-to-bottom comparison, competing algorithms frequently miss small objects (e.g., pedestrians and bicycles) due to limited feature extraction capacity. Additionally, in complex scenarios involving motorbikes or bicycles with riders, they often misclassify the rider as a pedestrian (Ground Truth labeled as “others”) or fail to detect them entirely. By contrast, AUHF-DETR not only recovers the majority of “others” targets but also detects and correctly classifies more small objects under low-light nighttime conditions.

To provide a more objective assessment of AUHF-DETR’s inference performance, we compared it to mainstream detectors on the HIT-UAV and CARPK datasets, as illustrated in Figure 13. In the first column, YOLO-based models generate false positives on irrelevant objects in front of buildings. In the third column, RT-DETR misclassifies occluded vehicles as other motor vehicles, which leads to reduced detection accuracy. In the fifth column, under extremely dense scenes, SOTA detectors struggle with localizing single-class objects. In contrast, AUHF-DETR accurately identifies all object categories in both infrared imagery and highly congested scenarios. Overall, AUHF-DETR showcases superior detection accuracy and robustness in infrared domains and heavily occluded, densely populated environments.

To further evaluate the model’s generalization capability, we captured UAV aerials over Fuzhou’s urban during morning and evening peak hours, running inference as shown in Figure 14. The results reveal only a small number of false positives or missed detections; the model accurately identifies tiny pedestrians and fast-moving objects, demonstrating exceptional detection performance. This further corroborates its strong generalization. Additionally, the model’s compact size facilitates deployment on other hardware platforms, fully satisfying the real-time detection requirements of UAV systems.

4.8. Exploratory Experiments on Detection Speed Using Embedded GPUs

To intuitively demonstrate the feasibility of deploying our proposed model for real-time detection on UAVs, we conducted speed tests by utilizing various mainstream models on an embedded GPU. The experiments were executed on the widely used embedded platform NVIDIA Jetson AGX Xavier, with GPU-related specifications provided in Table 10.

Based on human visual perception of real-time performance, a model achieving ≥ 35 FPS is deemed suitable for onboard UAV deployment. We benchmarked the inference speeds of the YOLO series, RT-DETR series, and our proposed variants on an NVIDIA Jetson AGX Xavier, with results reported in Table 11. All of our variant models exceeded 35 FPS on the embedded GPU, thus satisfying the real-time detection requirements for UAVs. Notably, AUHF-DETR-M not only reduces parameter count compared to the baseline but also lowers model complexity, achieving faster inference on the embedded GPU (47 FPS vs 39 FPS) in Table 11. In contrast, Transformer-based models exhibit lower FPS than the YOLO series due to their inherently higher computational complexity. In future work, we will investigate techniques such as pruning and knowledge distillation to further reduce model complexity and boost detection speed on embedded GPUs.

4.9. UAV Embedded Simulation Experiments

To demonstrate the model’s embedded deployment more accurately, we conducted UAV simulation experiments based on PX4 within the ROS [60] framework, embedding AUHF-DETR in the UAV to emulate its detection performance in real-world scenarios. We constructed a virtual environment in Gazebo and mounted a monocular camera on the UAV for vehicle detection and tracking. Flight trajectories were managed using QGroundControl (QGC). The expected simulation result is shown in Figure 15a, where the UAV’s captured imagery is streamed in real-time to the display window at the upper left.

On the UAV platform, we deployed the RT-DETR-18 and AUHF-DETR-S models—both trained on the VisDrone2019 dataset—onto the drone and performed vehicle detection at various altitudes under QGC guidance, as illustrated in Figure 15b. Evaluations from different viewing angles and heights show that RT-DETR presents both false positives (Within the red circle, a car is incorrectly classified as a truck) and missed detections(Within the yellow circle, there are numerous missed detections of small objects). In contrast, the proposed AUHF-DETR consistently achieves high detection performance at both low and high altitudes while maintaining an inference speed of 86.9 FPS—fully meeting real-time operational requirements.

5. Discussion

The experimental results indicate that our network achieves strong performance on the VisDrone2019, HIT-UAV, and CARPK datasets. In this section, we provide an in-depth discussion and analysis of these findings.

In the VisDrone2019 dataset, pedestrians and motorbike/bicycle riders (ground truth labeled as “others”) exhibit very similar shapes and sizes. As shown in the last column of Figure 12, algorithms such as YOLOv11, YOLOv10, and RT-DETR-18 often merge these categories to boost overall accuracy or simply miss the “others” class altogether. However, accurately distinguishing pedestrians from “others” is critical for real-world applications. Our two variant models effectively resolve these subtle distinctions, avoiding both false positives and missed detections even when object features are nearly identical.

In the HIT-UAV dataset, small pedestrian targets in thermal infrared imagery are prone to being misclassified as background noise and subsequently ignored. Furthermore, the thermal infrared sensor characteristics can produce spurious false objects under high-temperature conditions. As illustrated in Figure 13 across different model outputs, these issues become especially evident in scenes with multiple targets and complex backgrounds. For example, in the first column of YOLOv11’s detections, a white high-temperature object in front of a building is wrongly identified as a pedestrian, significantly reducing detection accuracy. Similarly, in the CARPK dataset, a model’s capability to handle extreme dense occlusion is critical. Our proposed AUHF-DETR not only produces zero false positives in infrared remote sensing images but also maintains high detection precision under heavy occlusion in Table 9, highlighting its superior adaptability and generalization.

In UAV small-object detection tasks, high-altitude remote sensing imagery contains many small targets. As shown in Table 7 and Table 8 and Figure 11, Figure 12 and Figure 13, mainstream detectors such as RT-DETR and DEIM exhibit relatively low

A P_{s}

. Our method not only achieves superior

A P_{s}

performance but also detects a greater number of small objects in inference visualizations, demonstrating strong robustness.

In a series of embedded-GPU speed comparison experiments across mainstream detectors, our approach exhibited superior robustness and faster inference performance. As shown in Table 10 and Table 11, on the NVIDIA Jetson AGX Xavier platform, AUHF-DETR-S not only achieved the highest inference speed (68 FPS) but also maintained the smallest parameters among models such as YOLOv11 and RT-DETR, underscoring its embedded adaptability. Furthermore, as illustrated in Figure 14 and Figure 15, when deployed in the Gazebo simulation environment using the PX4 and ROS frameworks, AUHF-DETR delivered exceptional performance in UAV embedded simulation trials. The model also demonstrated strong generalization capability in real-time UAV detection over urban Fuzhou. In summary, the proposed method is well suited for integration into UAV systems for real-time onboard detection.

However, under extreme conditions, such as complete darkness in the visible-light spectrum at night and spurious artifacts in the thermal-infrared band under high-temperature scenarios, AUHF-DETR cannot perform all-weather observations. To mitigate these adverse effects, our future work will focus on integrating multispectral remote sensing data with knowledge distillation techniques to capture critical cross-modal information, ultimately developing a robust model capable of all-weather onboard UAV detection.

6. Conclusions

In this paper, we present AUHF-DETR, an innovative RT-DETR-based model designed for real-time small object detection in embedded UAV remote sensing applications. Our compact variant consists of just 10.29 million parameters while achieving an impressive inference speed of 68 FPS (AGX Xavier), fully meeting the requirements for onboard real-time deployment. AUHF-DETR addresses the key challenges associated with small object detection and embedded environments through several significant innovations. First, we developed the WTC-AdaResNet backbone to extract multi-scale object features with minimal parameter overhead. Second, we replace global self-attention with a PSA module to mitigate the exponential computational demands of traditional Transformer attention mechanisms. Third, we introduce a BDFPN specifically tailored for UAV small object detection, which accelerates convergence by alleviating the issue of limited positive matches typically found in the Hungarian one-to-one assignment. Lastly, we enhance the loss function with a dynamically scaled small object penalty term, further improving localization accuracy for small targets.

Experimental results from the VisDrone2019, CARPK, and HIT-UAV multimodal public datasets indicate that AUHF-DETR significantly outperforms RT-DETR and various other transformer variants. Unlike many other object detectors, AUHF-DETR is compatible with UAV hardware platforms, achieves excellent inference speed on embedded GPUs, excels in small-object detection, and offers high inference speed, achieving an optimal balance between detection performance and model size. Furthermore, the model accurately identifies all object categories in UAV remote sensing images of urban areas in Fuzhou, demonstrating excellent detection performance in PX4 embedded simulation experiments.

However, for UAV remote-sensing detection tasks, especially under extremely dense occlusion, Figure 12—AUHF-DETR still experiences certain missed and false detections of small targets, such as pedestrians and bicycles. Similarly, the model cannot perform all-day detection under extreme lighting conditions (e.g., completely dark visible light scenarios). To address these limitations, our future work will focus on enhancing the detection of densely occluded objects without increasing the model size and on incorporating multispectral datasets with knowledge distillation to capture key cross-modal information, ultimately developing an all-weather remote-sensing detection model capable of robustly detecting congested targets.

Author Contributions

Conceptualization, H.G. and Q.W.; Methodology, H.G.; Software, H.G.; Validation, H.G. and Y.W.; Formal analysis, H.G.; Investigation, H.G. and Q.W.; Resources, Q.W.; Writing—original draft preparation, H.G.; Writing—review and editing, H.G. and Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fujian Provincial Science and Technology Program “University-Industry Cooperation Project” (2024Y4015) and by the National Natural Science Foundation of China (41471333).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

Sincere thanks to the editors, reviewers, and all staff of the journal. It is your professional spirit and unremitting efforts that make every work shine brightly.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Zheng, Z.; Yuan, J.; Yao, W.; Kwan, P.; Yao, H.; Liu, Q.; Guo, L. Fusion of UAV-Acquired Visible Images and Multispectral Data by Applying Machine-Learning Methods in Crop Classification. Agronomy 2024, 14, 2670. [Google Scholar] [CrossRef]
Tian, Z.; Liu, H.; Wu, J.; Chen, W.; Zheng, R.; Wang, Z. PrFu-YOLO: A Lightweight Network Model for UAV-Assisted Real-Time Vehicle Detection Towards an IoT Underlayer. IEEE Internet Things J. 2024, 11, 37536–37549. [Google Scholar] [CrossRef]
Sato, S.; Anezaki, T. Autonomous Flight Drone for Infrastructure (Transmission Line) Inspection (2). In Proceedings of the 2017 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 24–26 November 2017; pp. 294–296. [Google Scholar]
Dilek, E.; Dener, M. Computer Vision Applications in Intelligent Transportation Systems: A Survey. Sensors 2023, 23, 2938. [Google Scholar] [CrossRef]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-Time Object Detection Based on UAV Remote Sensing: A Systematic Literature Review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Sun, X.; Zhang, W. Implementation of Target Tracking System Based on Small Drone. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu China, 20–22 December 2019; Volume 1, pp. 1863–1866. [Google Scholar]
Fraga-Lamas, P.; Ramos, L.; Mondéjar-Guerra, V.; Fernández-Caramés, T.M. A Review on IoT Deep Learning UAV Systems for Autonomous Obstacle Detection and Collision Avoidance. Remote Sens. 2019, 11, 2144. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Zhang, R.; Newsam, S.; Shao, Z.; Huang, X.; Wang, J.; Li, D. Multi-Scale Adversarial Network for Vehicle Detection in UAV Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 180, 283–295. [Google Scholar] [CrossRef]
Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-Transformer-Enabled YOLOv5 with Attention Mechanism for Small Object Detection on Satellite Images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
Hamzenejadi, M.H.; Mohseni, H. Fine-tuned YOLOv5 for real-time vehicle detection in UAV imagery: Architectural improvements and performance boost. Expert Syst. Appl. 2023, 231, 120845. [Google Scholar] [CrossRef]
Liu, Z.; Chen, C.; Huang, Z.; Chang, Y.C.; Liu, L.; Pei, Q. A Low-Cost and Lightweight Real-Time Object-Detection Method Based on UAV Remote Sensing in Transportation Systems. Remote Sens. 2024, 16, 3712. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Cai, Y.; Zhou, Y.; Han, Q.; Sun, J.; Kong, X.; Li, J.; Zhang, X. Reversible Column Networks. arXiv 2022, arXiv:2212.11696. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images. Drones 2024, 8, 523. [Google Scholar] [CrossRef]
Butler, J.; Leung, H. A Heatmap-Supplemented R-CNN Trained Using an Inflated IoU for Small Object Detection. Remote Sens. 2024, 16, 4065. [Google Scholar] [CrossRef]
Hoanh, N.; Pham, T.V. A Multi-Task Framework for Car Detection from High-Resolution UAV Imagery Focusing on Road Regions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 17160–17173. [Google Scholar] [CrossRef]
Cherif, B.; Ghazzai, H.; Alsharoa, A. LiDAR From the Sky: UAV Integration and Fusion Techniques for Advanced Traffic Monitoring. IEEE Syst. J. 2024, 18, 1639–1650. [Google Scholar] [CrossRef]
Liu, Y.-B.; Zeng, Y.-H.; Qin, J.-H. GSC-YOLO: A Lightweight Network for Cup and Piston Head Detection. Signal Image Video Process. 2024, 18, 351–360. [Google Scholar] [CrossRef]
Yu, Z.; Guan, Q.; Yang, J.; Yang, Z.; Zhou, Q.; Chen, Y.; Chen, F. LSM-YOLO: A Compact and Effective ROI Detector for Medical Detection. arXiv 2024, arXiv:2408.14087. [Google Scholar]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Tan, L.; Liu, Z.; Liu, H.; Li, D.; Zhang, C. A Real-Time Unmanned Aerial Vehicle (UAV) Aerial Image Object Detection Model. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Fu, Z.; Xiao, Y.; Tao, F.; Si, P.; Zhu, L. DLSW-YOLOv8n: A Novel Small Maritime Search and Rescue Object Detection Framework for UAV Images with Deformable Large Kernel Net. Drones 2024, 8, 310. [Google Scholar] [CrossRef]
Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A DeNoising FPN with Transformer R-CNN for Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Xing, Y.; Yang, S.; Wang, S.; Zhang, S.; Liang, G.; Zhang, X.; Zhang, Y. MS-DETR: Multispectral Pedestrian Detection Transformer With Loosely Coupled Fusion and Modality-Balanced Optimization. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20628–20642. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query Denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13619–13627. [Google Scholar]
Chen, Q.; Chen, X.; Wang, J.; Zhang, S.; Yao, K.; Feng, H.; Han, J.; Ding, E.; Zeng, G.; Wang, J. Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6633–6642. [Google Scholar]
Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L.; Ni, L.M. Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 18558–18567. [Google Scholar]
Zhang, G.; Luo, Z.; Yu, Y.; Cui, K.; Lu, S. Accelerating DETR Convergence via Semantic-Aligned Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 949–958. [Google Scholar]
Liu, Y.; He, M.; Hui, B. ESO-DETR: An Improved Real-Time Detection Transformer Model for Enhanced Small Object Detection in UAV Imagery. Drones 2025, 9, 143. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery. Remote Sens. 2024, 16, 3169. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine Regression Task in DETRs as Fine-Grained Distribution Refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 363–380. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A High-Altitude Infrared Thermal Dataset for Unmanned Aerial Vehicle-Based Object Detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef] [PubMed]
Hsieh, M.R.; Lin, Y.L.; Hsu, W.H. Drone-Based Object Counting by Spatially Regularized Regional Proposal Network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4145–4153. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9759–9768. [Google Scholar]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-To-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Qian, W.; Xia, Z.; Xiong, J.; Gan, Y.; Guo, Y.; Weng, S.; Deng, H.; Hu, Y.; Zhang, J. Manipulation Task Simulation Using ROS and Gazebo. In Proceedings of the 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014), Bali, Indonesia, 5–10 December 2014; pp. 2594–2598. [Google Scholar]

Figure 1. Structural Overview of the AUHF-DETR Network.

Figure 2. Detailed Architecture of the AUHF-DETR Model.

Figure 3. Structural Overview of the WTConv Network.

Figure 4. (a) Chain Structure of the Backbone Network. (b) Architecture of the Ada-Res Block. (c) Sequential Configuration of SubNet Layers within the AdaRes Module.

Figure 5. Detailed Architecture of the AdaRes-Block Network and the Implementation of Its Reversible Connections.

Figure 6. Two Parallel Branches of the PSA Module.

Figure 7. Comparison of Feature Pyramid Network Architectures.

Figure 8. Comparison of Different Fusion Methods: (a) Element-wise Weighting; (b) Concatenation Operation; (c) Adaptive Integration.

Figure 9. (a,b) are from the VisDrone dataset, (c) is from the CARPK dataset, and (d) is from the HIT-UAV dataset.

Figure 10. (a) Data distribution of the VisDrone2019 dataset. (b) Data distribution of the HIT-UAV dataset. (c) Data distribution of the CARPK dataset.

Figure 11. Heatmap Performance Comparison of Different Models Across Various Scenarios on Three Datasets.

Figure 12. Detection Results on the VisDrone2019 Dataset.

Figure 13. Detection Results on HIT-UAV Dataset and CARPK Dataset.

Figure 14. Generalization Performance of the AUHF-DETR-S Model.

Figure 15. (a): UAV Real-Time Detection Simulation Experiments. (b): Comparison of Simulation Results between RT-DETR-18 and AUHF-DETR-S.

Table 1. Experimental parameter configurations.

Parameters	Setup
Epochs	300
Image size	640 × 640
Batch size	8
Optimizer	AdamW
Learning rate	0.0005
Final learning rate	0.1
Momentum	0.9
Weight decay	0.00001
Warmup epochs	3
Warmup momentum	0.8
Warmup bias	0.1

Table 2. Exploration Experiments on Mainstream Loss Functions.

Loss	AP	${AP}_{50}^{val}$	${AP}_{75}^{val}$	FPS
GIoU	29.1	56.5	28.2	61.1
CIoU	28.9	55.8	27.9	62.5
DIoU	29.4	56.2	28.9	63.3
EIoU	28.3	56.0	28.6	61.9
MPDIoU	29.6	56.9	30.5	65.4

Table 3. Experimental Results of Different Ratios in

L_{S m a l l O b j e c t}

.

Table 3. Experimental Results of Different Ratios in

L_{S m a l l O b j e c t}

.

Ratio	AP	${AP}_{50}^{val}$	${AP}_{75}^{val}$	FPS
0.60	29.3	56.9	28.5	67.1
0.62	29.6	56.7	29.8	66.3
0.65	28.9	57.0	29.4	65.7
0.70	28.3	54.9	28.2	64.3
0.63	30.9	57.2	30.7	65.7

Table 4. Preliminary Results of Parameter Matching.

Depth	Width	Max_Channels	Layers	Parameters	GFLOPs
0.43	0.45	1024	469	17953130	35.9
0.43	0.50	896	469	18874684	39.2
0.43	0.55	768	469	19920636	51.9
0.40	0.55	768	422	19270652	50.5
0.45	0.55	512	441	16581096	43.6
0.43	0.60	512	441	19553180	51.8
0.50	0.60	512	497	21177884	56.8

Table 5. Performance of Different Network Architectures on VisDrone2019-val.

Depth	Width	Max_Channels	AP	${AP}_{50}^{val}$	FPS	Parameters
0.43	0.50	1024	28.6	55.8	69.4	17.95 M
0.43	0.50	896	29.1	56.9	64.7	18.87 M
0.40	0.55	768	29.9	57.2	67.7	19.27 M
0.45	0.55	512	27.8	53.4	70.6	16.58 M
0.43	0.60	512	30.9	57.2	65.7	19.55 M

Table 6. Ablation experiment result. WHC refers to the scaling operations applied to the model’s width, depth, and maximum number of channels.

Backbone	PSA	BDFPN	Loss	WHC	AP	${AP}_{50}^{val}$	${AP}_{75}^{val}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	Param
					25.0	52.4	20.2	21.7	46.1	66.7	20.18
✓					27.2	54.9	24.9	23.8	48.2	69.9	23.04
	✓				26.8	53.7	25.3	22.4	47.6	70.1	19.93
		✓			27.4	55.2	25.1	24.3	47.0	67.7	17.96
			✓		25.8	53.1	24.6	23.0	46.2	67.3	20.18
✓	✓				28.5	54.7	26.8	23.9	48.0	69.2	22.78
✓	✓	✓			29.8	56.7	28.6	24.8	47.6	70.2	20.56
✓	✓	✓	✓		30.5	56.9	29.9	25.8	48.1	68.7	20.56
✓	✓	✓	✓	✓	30.9	57.2	30.7	26.1	48.3	69.1	19.55

Table 7. Comparison of Different Models in VisDrone2019-val.

Methods	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	GFLOPs	Params (M)	FPS (s)
Faster R-CNN [51]	24.3	39.6	25.9	15.4	36.4	45.0	208	41.39	38.2
RetinaNet [52]	17.3	29.1	17.9	8.1	29.4	35.2	210	36.52	36.1
Cascade R-CNN [53]	25.1	39.8	26.7	15.7	37.6	46.3	236	69.29	13.7
GFL [54]	24.7	39.8	25.6	15.0	37.1	47.4	206	32.28	36.7
ATSS-dyhead [55]	26.3	41.5	27.7	16.2	40.1	55.7	110	38.91	24.7
TOOD [56]	26.3	41.9	27.5	16.8	38.5	49.0	199	32.04	34.7
YOLOv8-M [42]	22.7	40.1	22.4	17.6	40.6	59.1	79	25.86	134.2
YOLOv11-M [24]	21.9	41.8	21.5	16.2	41.0	58.2	68	20.06	141.3
YOLOv11-L [24]	23.4	43.9	23.9	19.0	43.5	60.3	87	25.31	135.6
Deformable DETR [57]	22.9	41.8	22.6	18.6	41.4	62.1	173	40.56	10.3
Swin-DETR [58]	21.8	40.9	22.2	18.0	41.9	64.1	5	29.14	225.3
D-Fine-M [43]	25.3	52.5	23.2	20.6	46.6	64.1	57	19.25	–
RT-DETR-r18 [14]	25.0	52.4	20.2	21.7	46.1	66.7	57	20.18	125.3
RT-DETR-r50 [14]	25.6	54.6	19.7	22.6	45.5	67.0	129	42.94	75.3
DEIM-S [15]	28.1	46.9	28.4	19.3	39.3	55.2	25	10.18	134.0
DEIM-M [15]	30.7	51.3	31.1	21.3	42.6	61.5	57	19.19	98.7
AUHF-DETR-S	28.3	53.0	28.6	23.8	43.2	63.3	23	10.29	86.9
AUHF-DETR-M	30.9	57.2	30.7	26.1	48.3	69.1	52	19.55	65.7

Table 8. Comparison of Different Models in HIT-UAV.

Methods	AP₅₀	AP_S	Params (M)	GFLOPs
YOLOv11-M [24]	78.5	38.1	25.31	87
Deformable DETR [57]	76.6	35.1	40.56	173
Swin-DETR [58]	74.3	34.9	29.14	5
RT-DETR-r18 [14]	78.8	39.6	20.18	57
D-Fine-M [43]	78.9	41.2	19.25	57
DEIM-S [15]	84.2	42.3	10.18	25
DEIM-M [15]	84.6	42.7	19.19	57
AUHF-DETR-S	84.3	43.0	10.29	23
AUHF-DETR-M	85.8	44.2	19.55	52

Table 9. Comparison of Models on CARPK Dataset.

Methods	AP₅₀	Params (M)	GFLOPs	FPS
YOLOv11-M [24]	61.1	20.05	68	141.3
Deformable DETR [57]	60.9	40.56	173	10.3
Swin-DETR [58]	59.1	29.14	5	225.3
RT-DETR-r18 [14]	62.8	20.18	57	125.3
D-Fine-M [43]	62.0	19.25	57	-
DEIM-S [15]	61.5	10.18	25	134.0
DEIM-M [15]	61.7	19.19	57	98.7
AUHF-DETR-S	62.9	10.29	23	86.9
AUHF-DETR-M	64.3	19.55	52	65.7

Table 10. Detailed information on different embedded GPUs.

Characteristics	Jetson TX2	Jetson AGX Xavier
GPU Framework	Pascal (256 Cores)	Volta (512 Cores + Tensor Core)
AI Computing Power	1.5 TFLOPS (FP16)	32 TOPS (INT8)
Power Consumption	≤7.5 W	10 W–30 W
Typical applications	Terminal devices	Autonomous machines

Table 11. Inference speed evaluation of mainstream object detection models on the NVIDIA Jetson AGX Xavier platform.

Methods	FPS (S)	Params (M)	GFLOPs
YOLOv8-M	63	25.86	79
YOLOv11-M	65	20.06	68
RT-DETR-r18	39	20.18	57
AUHF-DETR-S	68	10.29	23
AUHF-DETR-M	47	19.55	52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, H.; Wu, Q.; Wang, Y. AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sens. 2025, 17, 1920. https://doi.org/10.3390/rs17111920

AMA Style

Guo H, Wu Q, Wang Y. AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sensing. 2025; 17(11):1920. https://doi.org/10.3390/rs17111920

Chicago/Turabian Style

Guo, Hengyu, Qunyong Wu, and Yuhang Wang. 2025. "AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection" Remote Sensing 17, no. 11: 1920. https://doi.org/10.3390/rs17111920

APA Style

Guo, H., Wu, Q., & Wang, Y. (2025). AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sensing, 17(11), 1920. https://doi.org/10.3390/rs17111920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Evolutionary Trends in Object Detection Algorithms

2.2. Advancements in Object Detection Techniques for UAV Remote Sensing Platforms

2.3. Transformer-Based Object Detection

3. Proposed Method

3.1. Overall Framework

3.2. Improvement of Backbone (WTC-AdaResNet)

3.2.1. WTConv-Block

3.2.2. AdaRes-Block

3.3. Partition Split Spatial Attention (PSA)

3.4. Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN)

3.5. Inner-MPDIoU Regression for Small Object Detection

3.6. Model’s Decoder and Transformer Structure

3.6.1. Structure of the Model Decoder

3.6.2. Related to Transformer Technology

4. Experiments and Results

4.1. Dataset and Implementation Details

4.2. Evaluation Metrics

4.3. Loss Function Exploration Experiments

4.4. Model Balancing Exploration Experiments

4.5. Ablation Experiments

4.6. Comparisons of Performance

4.7. Visualization Experiment

4.8. Exploratory Experiments on Detection Speed Using Embedded GPUs

4.9. UAV Embedded Simulation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI