UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone

Zhong, Yi; Zhao, Di; Han, Yi; Wang, Zhou

doi:10.3390/drones10020106

Open AccessArticle

UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone

¹

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

²

Chery Automobile Co., Ltd., Wuhu 241000, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(2), 106; https://doi.org/10.3390/drones10020106

Submission received: 17 December 2025 / Revised: 26 January 2026 / Accepted: 28 January 2026 / Published: 2 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A method for redesigning the RT-DETR backbone has been established based on the use of a frequency-enhanced multi-scale feature fusion module and a grouped multi-kernel interaction module.
We integrated Shape-NWD into the loss computation to make the objective shape-aware.

What are the implications of the main findings?

An end-to-end object detection model tailored for UAV aerial-view scenarios is proposed.
Compared with RT-DETR, our method achieves improvements in both AP and AP₅₀ while largely preserving the real-time performance of RT-DETR.

Abstract

Despite the widespread adoption of UAV-based object detection, traditional YOLO architectures are bottlenecked by their reliance on NMS, which complicates deployment on edge devices due to limited support across hardware acceleration platforms. While end-to-end models such as RT-DETR eliminate this bottleneck, they suffer from severe feature degradation for small targets caused by the inherent conflict between deep downsampling and detail preservation. To bridge this gap, we propose a Frequency-Enhanced Real-Time Detection framework specifically designed for UAV perspectives. Unlike standard backbones, our design incorporates a Frequency-Enhanced Multi-Scale Fusion module, which transforms features into the frequency domain to explicitly amplify high-frequency components essential for small object localization. Additionally, a Grouped Multi-Kernel Interaction module is introduced to dynamically capture multi-scale contextual information. Furthermore, we integrate Shape-NWD into the loss computation by introducing shape weight coefficients and scale correlation factors, directing focus toward the intrinsic attributes of bounding boxes to enhance regression accuracy for tiny targets. Experimental results on the VisDrone dataset demonstrate that our method improves the Average Precision by 0.9% and AP₅₀ by 1.1% compared to the baseline, with consistent gains observed on the UAVVaste dataset.

Keywords:

UAV object detection; small object detection; RT-DETR; frequency-enhanced; end-to-end object detection

1. Introduction

With the development of society, the applications of UAVs have become increasingly widespread, and object detection tasks on UAV platforms have also attracted extensive attention [1]. The elevated viewing angle of UAV-captured imagery leads to an extremely low pixel proportion of targets, posing significant challenges for the model in differentiating targets from background noise. Furthermore, the occlusion issue is more pronounced. Targets can be blocked by buildings, trees, or overlap with other targets, which directly causes the absence of effective features. Additionally, as these models need to be deployed on the embedded hardware of UAVs, such platforms are severely constrained in terms of computational power, power consumption, and storage capacity. However, the detection task still demands strict real-time performance. The YOLO series stands as a cornerstone in the field of real-time object detection. Defined by their one-stage detection framework, the YOLO series has struck an impressive balance between inference speed and detection precision through successive evolutionary advancements—encompassing the refinement of backbones (e.g., CSPDarknet [2]) and PANet [3] necks, as well as the transition from anchor-based to anchor-free methodologies. For mainstream industrial use cases, YOLO models are typically the go-to choice, attributed to their straightforward deployment, well-developed ecosystem, and superior efficiency in detecting objects of conventional sizes.

Most state-of-the-art UAV object detection methods adopt single-stage frameworks, largely built upon enhanced versions of the YOLO family [4,5]. These methods rely heavily on manually designed components, such as NMS [6,7] and handcrafted anchors, and their performance is significantly affected by three factors: (1) the post-processing steps used to merge near-duplicate predictions, (2) the design of the anchor set, (3) the heuristic rules for assigning ground-truth boxes to anchors.

Recent advancements in UAV-based surveillance have witnessed the widespread deployment of YOLO-based architectures. For example, Fadan et al. [8] and Alqahtani et al. [9] proposed effective frameworks for dynamic traffic surveillance and border control, respectively. Although these studies confirm the practicality of YOLO in specific application scenarios, they mainly focus on system-level integration rather than addressing the inherent small-target feature loss in the backbone network.

Furthermore, in terms of robustness, Munir et al. [10,11] have conducted extensive analysis on the impact of adverse weather conditions and image distortions on UAV-based detection, proposing methods such as YOLO-RAW to mitigate environmental degradation effects. Unlike these approaches that address external noise sources (e.g., fog or rain), our method specifically targets the internal structural degradation of fine-grained features induced by deep downsampling operations.

To address this issue, we derive inspiration from the field of image reconstruction. Abro et al. [12] reviewed strategies in Fourier ptychographic microscopy, emphasizing the effectiveness of frequency-domain analysis in recovering high-resolution details. Although originally applied to microscopy, we innovatively adapt this frequency-domain perspective to a real-time object detection backbone to explicitly recover the high-frequency edge information of tiny UAV targets—information that is typically lost in spatial-domain convolutions.

By comparison, end-to-end models [13,14,15] do not require NMS or depend on manual prior knowledge, thus rendering them an excellent option for UAV-based target detection. DETR [16] is a popular end-to-end target detection model in recent years, which introduces the transformer architecture into target detection tasks. However, its high computational cost and poor real-time performance prevent it from being well applied to tasks with high real-time requirements.

Yian Zhao et al. proposed RT-DETR [16], which outperforms the popular YOLO series in both accuracy and real-time performance. As the first real-time end-to-end target detector, RT-DETR eliminates the dependence on manual prior knowledge. Based on the DETR framework, RT-DETR incorporates a Hybrid Encoder to improve the processing performance for multi-scale objects. By efficiently fusing multi-layer features, the Hybrid Encoder enables the model to capture fine-grained details in images, thus enhancing the precision of object detection. Additionally, uncertainty minimization-based query selection is incorporated into the model. In fact, it surpasses the state-of-the-art YOLO series models in both inference speed and detection accuracy. Nevertheless, the model architecture prioritizes multi-scale feature fusion and efficient inference, with insufficient capability in extracting fine-grained features and preserving details essential for small targets.

This leads to a lower recognition accuracy for small-scale and low-resolution targets. Existing state-of-the-art DETR-based models [17,18,19,20,21] are tailored for natural scenes. As shown in Figure 1, UAV-captured images suffer from issues including a high ratio of small targets, diverse target scales, and background clutter, which bring considerable challenges when applying mainstream DETR architectures to UAV image analysis.

Table 1 summarizes the characteristics and limitations of existing mainstream methods. The YOLO series has established an efficient standard for multi-scale feature fusion by introducing Feature Pyramid Networks (FPN) and Path Aggregation Networks (PANet), enabling it to excellently adapt to object detection at different scales. However, despite the effectiveness of its fusion strategy, its reliance on Non-Maximum Suppression (NMS) remains a major bottleneck for edge-side deployment, limiting parallel inference efficiency. In contrast, while the DETR series eliminates NMS through an end-to-end paradigm, its high computational cost makes it difficult to achieve real-time performance. Although RT-DETR strikes a balance between speed and NMS elimination, its feature fusion mechanism primarily operates after the output of the backbone network. This delayed fusion strategy implies that during the deep downsampling process of the backbone network, the inherent low-pass filtering effect of convolutional operations may have already caused irreversible over-smoothing of high-frequency details crucial for small targets. To address this core limitation, our method does not follow the traditional spatial-domain fusion path. Instead, it innovatively moves the fusion to the inside of the backbone network and introduces the frequency domain dimension, ensuring that the fine-grained features of tiny targets are explicitly preserved and effectively enhanced before deep semantic features are formed.

UAV-captured images are more intricate than conventional natural images. In the process of conducting object detection on aerial imagery, problems such as extremely small object sizes, significant scale variations, and existing occlusions need to be taken into account. As shown in Figure 2, the recognition accuracy of RT-DETR in the

{A P}_{s}

metric is not as good as that of the YOLO series. Thus, studies on multi-scale feature fusion [25,26] and bounding box regression loss functions [27] are critical methods for enhancing the accuracy of object detection.

To improve the performance of RT-DETR in aerial image processing, this paper proposes an end-to-end target detection architecture specifically designed for the UAV perspective. We achieve multi-scale feature fusion by improving ResNet [29], addressing the issue of small target detail loss during the downsampling process of the backbone. Subsequently, factors related to the shape and scale of the bounding box itself are incorporated into the IOU loss calculation [30,31,32].

Low-level feature maps exhibit high resolution, enabling the capture of fine details like object edges and textures. However, their semantic information is insufficient, which hinders the differentiation of objects from the background. High-level feature maps possess abundant semantic information, yet significant detail degradation occurs. In 2017, Kaiming He and colleagues presented the Feature Pyramid Network [33].

Feature pyramids address this issue by integrating deep and shallow features, while enhancing small target localization and multi-scale feature representation. However, backbone networks [34] still face difficulties in integrating and preserving shallow-layer information, leading to the problem of feature mismatch. Therefore, multi-scale feature fusion [35] is required to integrate feature maps of different levels, enabling the model to not only localize small targets via high-resolution features but also accurately classify them through deep semantic features, while adapting to the detection requirements of targets at various scales.

Although existing multi-scale feature fusion methods such as the Feature Pyramid Network and its variants [36,37,38,39] have alleviated the problem of scale variation to some extent through cross-level connections, these approaches remain inherently limited by feature operations performed solely in the spatial domain. In drone aerial photography, targets are extremely small, and their key features are mainly reflected in high-frequency components such as edges and textures. Through the continuous downsampling operations of the backbone network, these fragile high-frequency details are easily submerged or smoothed by low-frequency background noise, resulting in irreversible loss of high-frequency information. Although existing spatial-domain attention mechanisms can enhance salient regions, they cannot explicitly separate and recover these lost spectral details, thus facing inherent performance bottlenecks in tiny object detection. In contrast, the proposed FFFE module introduces frequency-domain analysis and uses Fast Fourier Transform to accurately locate and enhance high-frequency components, breaking through the limitations of pure spatial-domain methods in preserving tiny object features from the perspective of physical principles.

Our main contributions are summarized as follows:

An enhanced end-to-end real-time target detector derived from RT-DETR is proposed, exhibiting superior performance over the baseline RT-DETR in small target detection;
A ResNet-based modified backbone is proposed to enable multi-scale feature fusion, resolving the problem of small target detail degradation in the downsampling stage;
The shape and scale of bounding boxes are introduced into the IOU loss computation, where these factors also influence the regression performance.

2. Materials and Methods

The method proposed in this paper is built upon the RT-DETR architecture, specifically tailored for UAV small target detection tasks. As illustrated in Figure 3, the complete framework consists of three core components: the Backbone, the Hybrid Encoder, and the Transformer Decoder. The processing pipeline operates as follows: First, the input image is processed by our redesigned Frequency-Enhanced Multi-Scale Backbone. In this stage, we integrate the Grouped Multi-Kernel Interaction module to replace standard convolutional layers, enabling the capture of fine-grained features across varying receptive fields. Subsequently, the feature maps undergo processing via the Frequency-Enhanced Multi-Scale Feature Fusion module, which leverages frequency-domain transformations to explicitly amplify high-frequency edge information critical for small targets. Following feature extraction, the multi-scale features (S3, S4, S5) are fed into the Hybrid Encoder for intra-scale interaction and cross-scale fusion. Finally, the Transformer Decoder decodes the object queries, and the model generates the final bounding boxes and class predictions in an end-to-end manner, supervised by the proposed Shape-NWD loss function.

2.1. Frequency-Enhanced Multi-Scale Feature Fusion Module

A fundamental dilemma exists in conventional backbones (e.g., ResNet, VGGNet) [40,41,42,43,44]: low-level features are abundant in spatial positional information but lack sufficient semantic representation, whereas high-level features possess robust semantic information but suffer from severe spatial degradation. Crucially, the progressive spatial downsampling in these architectures inherently acts as a low-pass filter. This mechanism causes the high-frequency information—such as the sharp edges and fine textures essential for identifying small UAV targets—to be irreversibly smoothed out or submerged by background noise. Consequently, small targets can often be identified but not precisely localized in deep layers.

In contrast to conventional spatial or channel attention mechanisms (e.g., CBAM [45], SE [46]) that predominantly depend on local pixel correlations or channel-wise weighting, frequency-domain analysis introduces a distinctive global viewpoint for feature enhancement [47,48,49]. As shown in Figure 4, in UAV aerial scenarios, small targets typically appear as high-frequency components in the image, and this information is prone to irreversible loss during the spatial downsampling processes of traditional backbones. Although spatial attention mechanisms can focus on specific regions, they often face difficulties in distinguishing weak signals of tiny targets from background noise in low-resolution feature maps. In contrast, frequency-domain enhancement utilizes the global receptive field characteristic of the Fourier Transform to explicitly separate features into high- and low-frequency components. This capability enables the model to directly target and amplify the high-frequency details that characterize small targets, without being affected by spatial positional biases. Therefore, theoretically, frequency-domain enhancement has more structural advantages than purely spatial strategies in addressing the specific challenges of UAV-based small object detection.

Solely relying on spatial-domain attention cannot explicitly recover these missing spectral details. To address this fundamental limitation, we introduce the Frequency-Enhanced Multi-Scale Feature Fusion (FFFE) module. By leveraging the Fast Fourier Transform, this module explicitly isolates and amplifies the high-frequency components corresponding to small objects. Through a mechanism involving channel splitting, frequency-domain transformation, and complementary mapping, the FFFE embeds shallow spatial details into deep semantic features layer by layer. This design transforms the feature extraction paradigm from purely spatial perception to a dual-domain collaborative learning, effectively alleviating the spatial-semantic mismatch and ensuring that critical fine-grained details are preserved for accurate small target localization.

As depicted in Figure 5, the specific architecture of the frequency-enhanced multi-scale feature fusion module involves splitting the input feature map into two subsets,

X_{1}

and

X_{2}

. Specifically, 1/4 of the feature map’s channels are assigned to

X_{1}

, and the remaining 3/4 to

X_{2}

. This design is primarily intended to mitigate excessive computational overhead resulting from a sharp increase in computational complexity. Frequency enhancement and multi-scale feature fusion are exclusively applied to

X_{1}

, whereas

X_{2}

is only concatenated with the processed features of

X_{1}

. By limiting the enhancement to a small fraction of the channels, the module preserves the underlying features while reducing the computational cost.

The feature map on the left side of the figure denotes

X_{2}

, while the one on the right represents the processed feature map of

X_{1}

. Feature map

X_{1}

first passes through a 1 × 1 convolution to adjust its channel dimension for subsequent computations. Meanwhile, the GELU activation function is employed to introduce non-linear processing, and after this step, we obtain

X_{c o n v}

.

X_{c o n v} = G E L U (C o n v_{1 \times 1} (X_{1}))

(1)

Next, we perform Fast Fourier Transform on

X_{c o n v}

to convert it from the time domain to the frequency domain. Meanwhile, global average pooling is applied to

X_{c o n v}

in the time domain to compress the frequency-domain features, extract global frequency statistical information, and obtain a channel-wise descriptive vector. Subsequently, this vector passes through a 1 × 1 convolution to generate a set of frequency-domain filtering coefficients. By performing element-wise multiplication between these coefficients and the corresponding frequency features, we can weighted amplify or suppress different frequency components. Finally, the filtered frequency features are subjected to the Inverse Fast Fourier Transform to convert them back to the time domain, thereby obtaining the enhanced feature map.

X_{s p} = ‖I F F T (C o n v_{1 \times 1} (G A P (X_{c o n v})) \cdot F F T (X_{c o n v}))‖

(2)

The main purpose of this step is to address the problem that small targets, such as pedestrians and vehicles in UAV images, have an extremely low pixel ratio, and their high-frequency information (e.g., edges and textures) is easily submerged by the background. Low-frequency components correspond to regions with slowly varying grayscales in the image (e.g., large-area sky and ground backgrounds), determining the overall contour of the image but lacking details. High-frequency components correspond to regions with rapidly varying grayscales, such as the edges of small targets and wheel textures, containing abundant detailed information and serving as the key to recognizing small and occluded targets in UAV detection. High-frequency components are highly susceptible to loss during the downsampling process of traditional backbones. Through frequency enhancement, we accurately preserve and amplify high-frequency details useful for detection while suppressing low-frequency background interference, providing more robust feature support for subsequent target localization and classification.

After obtaining the frequency-enhanced feature map

X_{s p}

, the subsequent step is to perform multi-scale feature fusion on it. We simultaneously apply 1 × 1, 3 × 3, and 5 × 5 convolutions to

X_{s p}

, extracting contextual information of different ranges from the feature map by using convolution kernels of varying sizes. After the feature information extraction is completed, weighted summation is performed directly instead of channel concatenation. This can effectively reduce the computational load when frequency enhancement is performed again later. By fusing features of different scales within the same set of channels, we obtain a feature representation integrated with multi-scale information, which is a superposition of information from various scales. However, direct summation may lead to conflicts or overshadowing between features of different scales. For instance, strong-response features of large targets may overshadow weak-response features of small targets.

To address this issue, we introduce a locally adaptive attention weight generator. Different local regions of UAV images have varying demands for local detail, medium-scale, and medium-to-large-scale features, requiring dynamic weight allocation through an attention mechanism. We concatenate the feature maps of the three scales, and each spatial position on the resulting

X_{c a t}

contains different feature information from the three receptive fields.

Initially, a 1 × 1 convolution is applied to reduce the channel dimension, mitigating computational complexity while fusing cross-scale information.

X_{t m p} = R E L U (B N (C o n v_{1 \times 1} (X_{c a t})))

(3)

A lightweight 3 × 3 convolutional layer, followed by a SoftMax activation function, is then utilized to generate weight maps for the three scales.

[W_{1}, W_{2}, W_{3}] = S o f t M a x (C o n v_{3 \times 3} (X_{t m p}))

(4)

This 3 × 3 convolution has 3 output channels, corresponding to the three distinct scales. By incorporating this locally adaptive attention weight generator, the original static and uniform feature summation is converted into dynamic and context-aware intelligent fusion. This enables the model to adaptively select the most appropriate feature scale based on the local content of the image, thereby effectively resolving the issues of feature conflict and overshadowing. It is particularly suitable for addressing the core challenges of large target scale differences and uneven distribution in UAV images.

X_{s} = W_{1} \cdot C o n v_{1 \times 1} (X_{s p}) + W_{2} \cdot C o n v_{3 \times 3} (X_{s p}) + W_{3} \cdot C o n v_{5 \times 5} (X_{s p})

(5)

Subsequently, we need to perform secondary frequency enhancement on

X_{s}

. The first frequency enhancement is conducted on the original features, while the second is implemented on the features that have already integrated multi-scale contextual information. This means it performs refinement based on higher-level and more abundant semantic features, enabling more intelligent frequency selection. For instance, it can further suppress background noise while emphasizing specific frequency patterns related to target categories.

X_{s 1} = I F F T (F F T (C o n v_{1 \times 1} (X_{s})) \cdot C o n v_{1 \times 1} (X_{s}))

(6)

However, frequency enhancement is also a double-edged sword—overemphasizing high-frequency components may simultaneously amplify noise and irrelevant textures in the image. Therefore, we introduce two learnable parameters, α and β, which serve as channel-wise weights. These two parameters dynamically determine how much original spatial information to retain and how much frequency-enhanced information to incorporate in the final output, thereby avoiding noise introduction caused by over-enhancement. The core idea of this mechanism is to enable the model to independently judge through the learnable parameters: frequency enhancement achieves better performance for certain regions or features, while the original spatial features are sufficient for others.

X_{F} = α \cdot X_{s 1} + β \cdot X_{s}

(7)

After obtaining

X_{F}

, it needs to be fused with the initial bypass branch

X_{2}

.

X_{2}

contains 3/4 of the channel information of the original input features, and this information has not undergone any non-linear transformation or potential information loss. Concatenating it with

X_{F}

ensures the integrity of the original information. By dividing the feature stream into processed and unprocessed parts, we can not only benefit from the performance improvement brought by complex transformations but also retain the gradient flow and richness of the original features.

X_{F i n a l} = G E L U (C o n v_{1 \times 1} (C o n c a t (X_{F}, X_{2})))

(8)

2.2. Grouped Multi-Kernel Interaction Module

UAV imagery is characterized by extreme scale variations, where targets can range from a few pixels to hundreds of pixels within the same frame. However, traditional convolution layers typically utilize fixed-size kernels (e.g., 3 × 3). This design imposes a fixed receptive field that cannot simultaneously adapt to such diverse scales, leading to significant feature detail loss during the downsampling process.

To overcome this structural rigidity and address the detail degradation, this paper introduces the Grouped Multi-Kernel Interaction (GMKI) module, with its detailed architecture illustrated in Figure 6. By grouping input channels and applying parallel convolutions with varying kernel sizes (3 × 3, 5 × 5, 7 × 7), we effectively simulate multi-scale perception within a single layer. This mechanism allows the network to capture local fine-grained details of small targets while simultaneously perceiving global contextual information for larger objects, fundamentally solving the single-scale view limitation of standard backbones.

In aerial images, a small target may consist of only dozens of or even a few pixels. After multiple convolutional layers and downsampling, this limited feature information is highly susceptible to loss or being submerged by background noise. Meanwhile, traditional convolutional layers typically use fixed-size convolution kernels, which determines that the receptive field of the neurons in this layer is fixed. Small convolution kernels have a small receptive field and are good at capturing details and local features, but struggle to obtain the contextual information of targets. Large convolution kernels have a large receptive field and can capture broader contextual information and global structures, but tend to introduce excessive background noise and lose fine-grained features. The grouped multi-kernel interaction module introduced in this paper is designed to enable neural networks to possess multi-scale perception capabilities at the same layer level.

The input channels are split into G groups, with G set to 4 in this paper. The input feature map

X_{i n}

is partitioned along the channel dimension into four distinct sub-tensors.

\{X_{1}, X_{2}, X_{3}, X_{4}\} = S p l i t (X_{i n}, G)

(9)

Then, different depth-wise convolutions are applied to each group of features, respectively, and the latency of this parallel structure is usually lower. For the first group, we use 3 × 3 depth-wise convolution, for the second group 5 × 5, for the third group 7 × 7, and the fourth group is not processed to keep the original information.

\{\begin{cases} Y_{1} = D W C o n v_{3 \times 3} (X_{1}) \\ Y_{2} = D W C o n v_{5 \times 5} (X_{2}) \\ Y_{3} = D W C o n v_{7 \times 7} (X_{3}) \\ Y_{4} = X_{4} \end{cases}

(10)

Each group of channels is responsible for one scale. Features of large targets are not polluted by the detailed textures of small kernels, and features of small targets are not submerged by the smoothing effect of large kernels, which is crucial for scenes with drastic scale changes in aerial images.

Dividing input channels into four groups for parallel computation produces a serious side effect: information does not flow between groups. So, after splicing the four channels, we perform a channel rearrangement to shuffle the order, making channels from different scales adjacent in memory to fuse information in the next convolution. Initially, a reshape operation is performed to transform the tensor shape from

(B, C, H, W)

to

(B, 4, C / 4, H, W)

. Subsequently, a transposition is applied to exchange two specific axes, resulting in a tensor shape of

(B, C / 4,4, H, W)

. Finally, pass through flatten to return to a 4D tensor, and the dimension is

(B, C, H, W)

. Within the shuffled feature map, each set of four consecutive channels comprises exactly one channel originating from each of the four distinct groups.

X_{s h u f f l e} = S h u f f l e (C o n c a t (Y_{1}, Y_{2}, Y_{3}, Y_{4}))

(11)

Following the channel shuffle, a 1 × 1 pointwise convolution is employed to integrate cross-scale information, dynamically assigning weights to features associated with varying receptive fields for effective fusion. The convolution here is not only performing fusion but also feature selection, letting the network dynamically adjust attention to different scales according to the current image content. Subsequently, the output is processed through a Batch Normalization layer followed by a SiLU activation function to introduce essential non-linearity.

X_{f u s i o n} = S i L U (B N (P W C o n v_{1 \times 1} (X_{s h u f f l e})))

(12)

Finally, spatial downsampling is performed through a depth-wise convolution with a stride of 2 to output the final feature map.

X_{o u t p u t} = D W C o n v_{3 \times 3} (X_{f u s i o n})

(13)

The grouped multi-kernel interaction module in this paper groups the channels to realize feature collaboration of multi-scale kernels, then breaks inter-group barriers through Channel Shuffle, and finally fuses all features. This retains the lightweight advantage of depth-wise convolution and makes up for the defect of feature isolation.

2.3. Loss Function

RT-DETR employs the GIoU loss function for bounding box regression. GIoU is specifically designed to resolve the issue where traditional IoU becomes ineffective or inefficient when optimizing non-overlapping bounding boxes and those with inclusion relationships. Its core innovation resides in the introduction of the minimum enclosing rectangle, which fuses the overlap degree with spatial positional information. This allows the loss function to effectively guide model optimization even in complex scenarios, particularly suitable for UAV aerial small target detection.

However, when using GIoU for bounding box regression in the UAV aerial small target detection scenario, its performance is unsatisfactory, especially when the IoU is low. Due to the small size of small targets’ bounding boxes, minor coordinate deviations can lead to a sharp drop in IoU. Additionally, non-overlapping or inclusion relationships between ground truth and bounding boxes occur frequently. To address the aforementioned issues, this paper introduces Shape-NWD. Its essence lies in fusing the core design of Shape-IoU (i.e., shape weight and scale awareness) with NWD, which is specifically designed for small target detection. This resolves the limitation of traditional NWD that fails to consider the impact of the bounding box’s own shape and scale, improving the localization and size matching accuracy of small targets.

In conventional NWD, the equal-weighted center distance is substituted with a shape-aware weighted center distance, with a higher weight assigned to the center deviation along the short-side direction. Meanwhile, the equal-weighted length-width discrepancies in traditional NWD are indirectly incorporated into the shape weight, prioritizing the size deviations associated with shape matching between the predicted bounding box and the ground truth.

w^{g t}

and

h^{g t}

are the width and height of the ground truth box. scale is a hyperparameter used to adjust shape sensitivity, set to 0.5 in this paper.

W_{y}

is the weighted vertical deviation, and

W_{x}

is used for the weighted horizontal deviation.

\{\begin{cases} W_{y} = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}} \\ W_{x} = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}} \end{cases}

(14)

First, the length-width discrepancies between the predicted bounding box and ground truth are computed, retaining the size difference measurement mechanism of conventional NWD to ensure sensitivity to the dimensions of extremely small targets—a critical requirement for UAV aerial small target detection. The term weight denotes a hyperparameter within the formulation, which is set to a value of 2.

\{\begin{cases} B = \frac{{(w - w^{g t})}^{2} + {(h - h^{g t})}^{2}}{w e i g h t^{2}} \\ D = \sqrt{W_{x} \times {(x_{c} - x_{c}^{g t})}^{2} + W_{y} \times {(y_{c} - y_{c}^{g t})}^{2} + B} \end{cases}

(15)

Subsequently, the equal-weighted center distance is substituted with the shape-weighted center distance, which is then fused with the size difference term to compute Shape-NWD. An exponential function maps the total distance

D

to a metric value within the range [0, 1] where values closer to 1 denote a more accurate match between the predicted and GT bounding boxes. Here, parameter

C

is a hyperparameter related to the target scale size of the dataset, and is set to 25 in this paper.

N W D^{s h a p e} = e^{- \frac{D}{C}}

(16)

3. Results

3.1. DataSets

The VisDrone dataset is a large-scale UAV vision benchmark constructed by the AISKYEYE team at the Machine Learning and Data Mining Laboratory (MLDM Lab) of Tianjin University. Specifically tailored for diverse computer vision tasks under UAV perspectives, it accurately replicates real-world scenario challenges in UAV applications. As a core and highly influential benchmark in UAV computer vision research and practical applications, it has been widely adopted globally. This study employs the VisDrone dataset for UAV target detection experiments, aligning with the research focus on small aerial targets. The dataset is collected from 14 cities across China, with a geographical span of thousands of kilometers, ensuring diverse regional characteristics. Its scenarios encompass diverse environments, including urban high-rise districts, rural farmlands, and other landscapes, with images captured under varying weather conditions (e.g., sunny, cloudy) and diverse lighting environments. Additionally, it covers both sparsely distributed and extremely crowded target scenarios, fully satisfying the testing requirements of models in complex real-world applications—a key characteristic for evaluating UAV-based detection systems. The dataset defines multiple common target categories, including pedestrians, vehicles, bicycles, motorcycles, and others, making it highly relevant for practical tasks such as traffic surveillance, pedestrian recognition, and vehicle tracking. Compared with ground-view datasets, it accurately captures the unique characteristics of UAV perspectives—small targets, cluttered backgrounds, and frequent occlusions—effectively filling the gap in UAV-specific benchmark datasets. Furthermore, its high annotation accuracy and extensive scene coverage enable it to effectively address key challenges in UAV target detection, such as missed detections of small aerial targets and false detections in dense scenarios—critical evaluation criteria for the proposed model in this study.

To fully assess the model’s performance, this study validates the proposed approach on the VisDrone2019 dataset—a large-scale benchmark with abundant samples, widely recognized for its representativeness in UAV target detection. Figure 7 and Figure 8 present the detailed distribution of sample images, annotation counts, and bounding box dimensions for the dataset. To further analyze the scale characteristics of UAV targets, Figure 8 visualizes the scale distribution of ground truth bounding boxes across ten categories using violin plots. In this visualization, the vertical axis represents the square root of the bounding box area in pixels, providing a standardized measure of target size. The width of each “violin” represents the probability density of targets at a given scale: wider sections indicate a higher frequency of occurrence within that specific dimension range. To provide a clear reference for object classification, two horizontal dashed lines are plotted at 32 and 96 pixels, corresponding to the standard thresholds for small and medium objects, respectively. The plot reveals a distinct pattern: for categories such as ‘Pedestrian’, ‘People’, and ‘Bicycle’, the distribution is heavily concentrated in the bottom region below the 32-pixel line, statistically confirming the overwhelming prevalence of tiny targets in UAV-captured imagery. In contrast, vehicle categories like ‘Van’ and ‘Bus’ exhibit a more elongated and diffuse distribution, extending significantly into the medium and large scale ranges. This visualization effectively demonstrates the drastic intra-class scale variations and the core challenge of small object detection that our method addresses.

As illustrated in the figure, the dataset exhibits the following key characteristics and challenges for UAV target detection: the targets are characterized by a large quantity, low resolution, small size, and high inter-class similarity. Specifically, the “pedestrian” and “people” categories are highly indistinguishable in the dataset. From the UAV aerial perspective, the targets exhibit a wide range of size variations and more pronounced mutual occlusion, presenting significant challenges to the performance of target detection algorithms. Additionally, there is a significant class imbalance in annotation counts. For instance, the dataset contains 144,867 bounding box annotations for the “Car” category, compared to only 3246 annotations for the “Awning-tricycle” category. This class imbalance imposes stricter demands on the robustness of detection models, as they must effectively generalize to both frequent and rare target categories. Both the overall bounding box size distribution and the size distribution within individual categories indicate that, except for the “Van”, “Truck”, “Car”, and “Bus” categories (which include more medium and large-sized bounding boxes), the majority of bounding boxes are concentrated within 100 × 100 pixels. Notably, small targets (≤50 × 50 pixels) account for the largest proportion of the dataset. Consequently, the ability to detect small targets is paramount for models deployed in UAV-based target detection tasks—a core focus of the proposed method in this study.

The entire dataset is partitioned into a training set, a validation set, and a test set, consisting of 6471, 548, and 1610 images, respectively. Given that the proposed model is deployed on UAV platforms with constrained resources and limited space through embedded devices, only backbone network variants of ResNet18 and ResNet50 are utilized—a design choice aligned with the computational efficiency requirements of aerial deployment. Training and testing are performed on a local Ubuntu operating system, with the following hardware specifications: Intel (R) Core (TM) i9-12900K CPU, GeForce RTX 4090 GPU, 16 GB of system memory, and 24 GB of GPU memory. The software environment for testing is identical to that of training to ensure experimental consistency. The proposed model is an RT-DETR-based variant, trained with a batch size of 4 for 400 epochs. Standard COCO evaluation metrics are used, including mean Average Precision, AP₅₀ and AP across multiple target scales (i.e., small, medium, and large targets). All input images are resized to a resolution of 640 × 640 pixels.

3.2. Comparisons with Other Object Detection Networks

Based on the original RT-DETR, this paper introduces a feature fusion module with frequency enhancement and a grouped multi-kernel interaction module, and optimizes the IOU loss function. RT-DETR-R18 and RT-DETR-R50 are adopted as the baseline models for comparative experiments. The models proposed in this experiment are categorized into Model-R18 and Model-R50, distinguished by their distinct backbone architectures.

On the VisDrone dataset, compared with the baseline RT-DETR-R18, the proposed R18-based model achieves a 0.8% improvement in Average Precision and a 1.3% increase in AP₅₀. Similarly, compared with the baseline, the R50-based model achieves a 1.0% improvement in AP and a 0.9% increase in AP₅₀. In addition, we compared our method with other target detectors that have similar computational costs to the R18 and R50 models in this paper, and the results show that our method is also superior to other methods in terms of accuracy.

This study focuses on aerial image target detection methods suitable for practical deployment, characterized by low hardware requirements and high generalizability. Consequently, the comparative models are restricted to one-stage target detection algorithms, as they offer convenient deployment and broad applicability—key advantages for UAV-based real-world applications. State-of-the-art DETR-series models, various YOLO-series models, and classical improved YOLO variants are selected as comparative baselines. These models cover a broad spectrum of architectural configurations, and extensive existing works have validated their reliability and stability across diverse application scenarios, ensuring the fairness and rigor of comparative evaluations. The comparative experimental results are summarized in Table 2.

To further demonstrate the generalizability of the proposed approach, we additionally evaluated it on the UAVVaste dataset. The results, presented in Table 3, indicate that the proposed model retains a competitive edge compared to other state-of-the-art models. Compared to the VisDrone dataset, UAVVaste contains a smaller volume of data, demonstrating that the proposed method is not dependent on large-scale annotated data, thus highlighting its data efficiency.

To provide a more intuitive visualization of the comparative results listed in Table 3, we normalized various performance metrics. Crucially, for cost-related indicators where lower values are preferred (e.g., Parameters and GFLOPs), we applied a ‘1-minus’ inversion operation to the normalized values. This transformation unifies the evaluation criteria, ensuring that for all metrics, a value closer to 1 indicates superior performance. As shown in Figure 9a, we use normalized bar charts to describe these experimental results in detail. In addition, in the normalized score heatmap in Figure 9b, the closer the color is to dark green, the better the performance. Our Model-R18 has relatively balanced performance in all aspects and maintains a balance between detection accuracy and real-time performance. Our Model-R50 achieves the highest accuracy in AP, APs, and AP₅₀, demonstrating superior performance.

Although there exists a wealth of research on object detection based on multi-scale feature fusion, existing literature in this field mainly focuses on spatial-domain attention mechanisms (e.g., CBAM, SE) or path aggregation methods (e.g., PANet, BiFPN). While these methods perform well in general object detection, they often struggle to distinguish extremely tiny targets from complex background noise in UAV scenarios, as spatial downsampling inevitably blurs high-frequency details. The core advantage of the method proposed in this paper lies in the introduction of the frequency-domain dimension into the backbone network.

As shown in Table 4, we compare our method with the latest SOTA models released in 2024, namely YOLOv10 and YOLOv11, as well as HIC-YOLOv5, which is specifically optimized for UAVs. Compared with YOLOv11, although YOLOv11-M incorporates an improved C3k2 module to enhance feature extraction, our Model-R18 achieves a 1.6% AP improvement with a comparable number of parameters. This demonstrates that even the most advanced spatial feature extractors are less effective than our frequency enhancement strategy when dealing with small UAV targets. Compared with HIC-YOLOv5, our method achieves a significant 1.5% advantage in AP. This indicates that simply increasing the spatial receptive field, as HIC-YOLOv5 does, is less accurate than explicitly recovering high-frequency information via FFT. The data in Table 3 further corroborates this finding. On the UAVVaste dataset, our Model-R18 achieves an AP of 36.2%, explicitly surpassing the baseline RT-DETR-R18 at 35.8% and YOLOv11-S at 27.3%. In summary, the scientific contribution of this paper is not only the improvement in accuracy, but also the proof that “frequency-domain and spatial-domain joint modeling” has superior theoretical upper bounds and practical performance compared to pure spatial-domain methods in solving the feature smoothing problem of small UAV targets.

3.3. Ablation Experiment

Ablation studies are performed using the Model-R18 variant on the VisDrone2019 dataset to evaluate the contribution of each proposed module to detection accuracy. FFFE (Frequency-Enhanced Multi-Scale Feature Fusion Module) represents the proposed frequency-enhanced multi-scale feature fusion module, GMKI (Grouped Multi-Kernel Interaction Module) denotes the introduced grouped multi-kernel interaction module, and SN stands for Shape-NWD.

The ablation results presented in Table 5 offer empirical evidence regarding the specific contributions of our proposed modules. Specifically, the introduction of the FFFE module results in a notable improvement in detection accuracy, increasing the AP₅₀ from 44.8% to 45.4%. In the context of UAV imagery, where targets are frequently blurred or low-contrast, this improvement is significant. This supports our hypothesis that explicitly encoding features in the frequency domain aids in recovering high-frequency details that are typically lost during spatial downsampling. The distinct rise in AP₅₀ indicates that the detector becomes more capable of distinguishing valid objects from background noise.

Furthermore, the integration of the GMKI module delivers the most substantial single-step improvement in overall AP, increasing performance from 27.0% to 27.5%. Unlike the localized improvements provided by SN or FFFE, GMKI enhances the global representation through multi-kernel interactions. This consistent improvement in the strict AP metric demonstrates that GMKI effectively handles the drastic scale variations inherent in drone datasets, enabling the model to maintain robustness across different object sizes and viewing angles.

4. Discussion

Our research mainly focuses on constructing an end-to-end real-time target detection model suitable for UAV platforms based on RT-DETR. The main contributions lie in improving the backbone using a multi-scale feature fusion module with frequency enhancement and a grouped multi-interaction module, while also introducing Shape-NWD. As indicated by the experimental results, our improvements are effective, increasing the target detection accuracy while ensuring the real-time performance of RT-DETR. The improvement in experimental results goes beyond mere numerical increments; it empirically validates the theoretical hypotheses put forward in this study. Specifically, the substantial improvement in small object detection accuracy observed in both the VisDrone and UAVVaste datasets can be directly ascribed to the FFFE module’s effectiveness in preserving high-frequency information. In the baseline RT-DETR, the feature representations of small targets are prone to being smoothed out following deep convolutional layers. Nevertheless, the introduction of frequency-domain enhancement enables effective reweighting and reactivation of these faint edge signals. Furthermore, while the GMKI module incurs a marginal computational overhead, the visualization results presented in Figure 9 show significantly enhanced robustness in dense and occluded scenarios. This demonstrates that the multi-kernel interaction effectively captures more abundant contextual semantics, thereby compensating for the limitations of single receptive fields in traditional architectures.

As shown in Figure 10, the display results of the model on part of the test set are presented. The experiment selects various typical detection scenarios as test samples, including complex backgrounds, target occlusions, dim lighting, dense targets, small targets, and top-down shooting angles, which place high requirements on the robustness of the detection model. It can be seen from the results that the proposed model has completed the detection task, and can accurately identify targets and locate their positions in various scenarios.

Inference results indicate that the performance limitations of the current model are mainly observed in two core scenarios. First, as illustrated in Figure 11, in dense crowd environments, the model exhibits insufficient fine-grained discrimination capability between the ‘Pedestrian’ and ‘People’ categories. In such scenarios, the visual feature boundaries between these two categories become blurred due to diverse individual postures, high target density, severe occlusions, and shape overlaps, thereby making accurate distinction difficult. Second, as qualitatively visualized in Figure 12, misclassifications occur frequently among categories with high inter-class similarity. Typical failure cases include the confusion among rigid vehicles (e.g., ‘Van’, ‘Car’, and ‘Truck’) and non-rigid cycles (e.g., ‘Bicycle’, ’Motor’, ‘Awning-tricycle’ and ‘Tricycle’). Under the overhead perspective of UAVs, these targets share highly similar geometric structures and appearances. The loss of texture details in small-scale representations further exacerbates the difficulty of distinguishing them, leading to false label predictions.

Second, the model exhibits performance bottlenecks on certain small-scale targets. As shown in Table 6, the metrics for ‘Bicycle’ and ‘Awning-tricycle’ are noticeably lower compared to other categories. Given the fixed imaging quality and the strict requirement for a lightweight architecture, overcoming the deficiency in small target feature extraction to improve recall rates remains a critical challenge and a primary direction for future research in UAV-based detection.

Although the improved model proposed in this paper achieves better detection accuracy than the baseline model in UAV aerial target detection, with significant improvements in metrics such as AP and AP₅₀, the introduction of the multi-scale frequency enhancement module and grouped multi-kernel interaction module leads to an increase in the model’s overall GFLOPs compared to the baseline, along with a certain degree of decrease in the FPS metric, as detailed in Table 7. Nevertheless, the model achieves a minimum frame rate of 69 FPS, which is substantially higher than the real-time threshold for practical UAV applications, meeting or even exceeding real-time processing requirements in most task scenarios. However, in scenarios involving high-dynamic maneuvering, dense target distributions, or the need for sustained high frame rate stability, there remains room for further optimization of the model’s inference efficiency, enabling higher-performance real-time detection while preserving its accuracy advantages.

To assess the practical application value of the proposed method in UAV engineering, we conducted a feasibility analysis based on the computational requirements observed during the experiments. The current validation was performed on a high-performance workstation equipped with an NVIDIA GeForce RTX 4090 GPU, an Intel Core i9-12900K CPU, and 24 GB of video memory. Under this configuration, the proposed Model-R18 exhibits a computational complexity of 79 GFLOPs and achieves an inference speed of 126 FPS. Considering the practical requirements of UAV engineering applications, the feasibility of edge-side deployment is a critical aspect of this study. Although the validation experiments were conducted on a high-performance GPU, the proposed architecture demonstrates significant advantages for real-world engineering implementation. Based on the computational metrics, the target deployment platforms for this model are embedded computing units such as the NVIDIA Jetson Orin NX or AGX Orin, which are widely used in UAV systems. To achieve real-time inference under constrained computational resources, the deployment pipeline will utilize the TensorRT inference engine. The model will be exported to the standard ONNX (Open Neural Network Exchange) format and then compiled into a TensorRT engine to maximize hardware utilization. A key advantage of our design is that the proposed GMKI module relies primarily on standard depth-wise separable convolutions. Unlike complex dynamic attention mechanisms that may suffer from latency bottlenecks on edge devices, these standard operators enjoy high-level optimization support in hardware acceleration libraries (e.g., cuDNN), ensuring efficient execution on edge-side GPUs. To balance power consumption and speed, we propose adopting FP16 inference or INT8 Post-Training Quantization for edge deployment. Given the robustness of the RT-DETR architecture, the precision loss from quantization is expected to be negligible, while inference throughput can be improved by 2–3 times. This optimization ensures the system maintains low end-to-end latency, satisfying the strict real-time requirements of UAV flight control systems.

5. Conclusions

In this study, we proposed a real-time end-to-end object detection framework specifically designed for UAV aerial imagery, building upon the RT-DETR architecture. By redesigning the ResNet backbone to incorporate the Frequency-Enhanced Multi-Scale Feature Fusion module and the Grouped Multi-Kernel Interaction module, we effectively addressed the inherent challenges of feature mismatch and detail loss in aerial scenarios. Furthermore, integrating the Shape-NWD loss function significantly improved regression accuracy for small and variable-scale targets. Experimental results on the VisDrone2019 dataset show that our proposed Model-R18 and Model-R50 outperform the baseline RT-DETR, achieving AP improvements of 0.8% and 1.0%, and AP₅₀ increases of 1.3% and 0.9%, respectively. Ideally, the proposed model balances accuracy and efficiency; despite increased computational complexity, the Model-R50 variant maintains an inference speed of 69 FPS, which fully meets the real-time requirements for UAV deployment.

Despite these promising results, our current approach has certain limitations. First, the introduction of frequency enhancement and multi-kernel interactions inevitably increases computational costs, leading to a reduction in inference speed. Second, as discussed, the model still struggles with fine-grained classification in extremely dense crowds, particularly in distinguishing between visually similar categories such as “Pedestrian” and “People” due to severe occlusion and blurred feature boundaries.

Future research will focus on overcoming current limitations to enable robust edge deployment. To address the increased computational cost caused by frequency enhancement and multi-kernel interactions, we plan to explore structural reparameterization and hardware-aware optimization techniques. These methods aim to reduce inference latency on embedded UAV platforms while preserving the precision improvements achieved by the proposed modules. Furthermore, to reduce class ambiguity in dense crowds (e.g., distinguishing pedestrians from people), we intend to integrate fine-grained feature decoupling or contrastive learning objectives into the detection head. This will improve the model’s ability to learn more discriminative representations for visually similar targets in complex aerial scenes.

Author Contributions

Conceptualization, Y.Z. and D.Z.; methodology, Y.Z. and D.Z.; software, Y.Z., D.Z. and Y.H.; validation, Y.Z., D.Z. and Z.W.; formal analysis, Y.Z. and D.Z.; investigation, Y.Z., D.Z., Y.H. and Z.W.; resources, D.Z. and Y.H.; data curation, Y.Z., D.Z., Y.H. and Z.W.; writing—original draft preparation, Y.Z. and D.Z.; writing—review and editing, Y.Z. and D.Z.; visualization, Y.Z. and D.Z.; supervision, Y.H. and Z.W.; project administration, Y.Z., D.Z.; funding acquisition, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/Zd-hub/UAV-Small-Target-Detection-Method-Based-on-Frequen-cy-Enhanced-Multi-Scale-Fusion-Backbone (accessed on 25 January 2026).

Conflicts of Interest

Author Zhou Wang was employed by the company Chery Automobile Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

FFFE	Frequency-Enhanced Multi-Scale Feature Fusion Module
GMKI	Grouped Multi-Kernel Interaction Module
UAV	Unmanned Aerial Vehicle
GFLOPs	Giga Floating-Point Operations Per Second

References

Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Lou, H.; Duan, X.; Guo, J.; Liu, H.; Gu, J.; Bi, L.; Chen, H. DC-YOLOv8: Small-size object detection algorithm based on camera sensor. Electronics 2023, 12, 2323. [Google Scholar] [CrossRef]
Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Fadan, A.M.; Abro, G.E.M.; Khan, A.M. A UAV-Based Framework for Dynamic Traffic Surveillance in Dhahran, Saudi Arabia. In Proceedings of the 2025 IEEE 15th International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 22–23 August 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 114–118. [Google Scholar]
Alqahtani, S.K.; Abro, G.E.M. Autonomous drone-based border surveillance using real-time object detection with Yolo. In Proceedings of the 2025 IEEE 15th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 24–25 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 564–569. [Google Scholar]
Munir, A.; Siddiqui, A.J.; Hossain, M.S.; El-Maleh, A. YOLO-RAW: Advancing UAV Detection with Robustness to Adverse Weather Conditions. IEEE Trans. Intell. Transp. Syst. 2025, 26, 7857–7873. [Google Scholar] [CrossRef]
Munir, A.; Siddiqui, A.J.; Anwar, S.; El-Maleh, A.; Khan, A.H.; Rehman, A. Impact of adverse weather and image distortions on vision-based UAV detection: A performance evaluation of deep learning models. Drones 2024, 8, 638. [Google Scholar] [CrossRef]
Abro, G.E.M.; Horain, P.; Damurie, J. Image reconstruction & calibration strategies for Fourier ptychographic microscopy—A brief review. In Proceedings of the 2022 IEEE 5th International Symposium in Robotics and Manufacturing Automation (ROMA), Malacca, Malaysia, 6–8 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022. [Google Scholar]
Sohan, M.; Ram, T.S.; Reddy, C.V.R. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 3507014. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia (MM’16), Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Zhang, H.; Zhang, S. Shape-iou: More accurate metric considering bounding box shape and scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE Computer Society: Los Alamitos, CA, USA, 2019; Volume 43, pp. 652–662. [Google Scholar]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ozge Unel, F.; Ozkalayci, B.O.; Cigla, C. The power of tiling for small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Targ, S.; Almeida, D.; Lyman, K. Resnet in resnet: Generalizing residual architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recog. 2019, 90, 119–133. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Rao, Y.; Zhao, W.; Zhu, Z.; Lu, J.; Zhou, J. Global filter networks for image classification. Adv. Neural Inf. Process. Syst. 2021, 34, 980–993. [Google Scholar]
Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.-K.; Ren, F. Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 13–19 June 2020. [Google Scholar]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 6614–6619. [Google Scholar]

Figure 1. Challenges in UAV-based object detection. The four areas marked with yellow boxes are typical challenging scenes captured from the UAV perspective, which involve difficult cases such as dense small targets and occlusions.

Figure 2. In object detection experiments conducted on the COCO2017 dataset, RT-DETR-R50 and RT-DETR-R101 perform worse than the YOLO series in terms of the AP metric, indicating that RT-DETR exhibits relatively poor performance in small object detection [28].

Figure 3. Global Architecture Diagram. (a) The overall architecture of the proposed improved RT-DETR. The green and dark yellow blocks in the figure represent feature vectors, and we use this approach to illustrate the vector transmission process. In the CCFF module, the light yellow and light blue blocks denote the downsampling and upsampling operations respectively. (b) Architecture of the frequency-enhanced backbone network. The solid arrows in Figure 3 represent the normal data flow, and the dashed arrows in Figure 3b point to the feature maps of different depths output by the backbone network.

Figure 4. The (left image) is the processed original, and the (right image) is the result after high-pass filtering. It can be seen that the high-frequency components indeed preserve more texture details of small targets.

Figure 5. The specific structure of the frequency-enhanced multi-scale feature fusion module.

Figure 6. Architecture diagram of the grouped multi-kernel interaction module.

Figure 7. Distribution of sample images in the VisDrone dataset.

Figure 8. Distribution of object scales across different categories in the VisDrone dataset. The dashed lines at 32 and 96 pixels represent the thresholds for small and medium objects, respectively, highlighting the prevalence of tiny targets in UAV-captured imagery.

Figure 9. Normalized comparative bar charts and heatmaps of model metrics. The best indicators are highlighted with black boxes. Our Model-R50 achieves the best performance in terms of AP, AP₅₀, and APs. Downward arrows indicate that lower values are better for the corresponding metric, while upward arrows represent that higher values are better. (a) Normalized performance comparison of object detection models. (b) Model performance heatmap.

Figure 10. Model inference results on the VisDrone dataset.

Figure 11. Non-normalized confusion matrix of Model-R18 on the VisDrone2019 validation set, displaying the raw number of detection instances.

Figure 12. Visualization of typical misclassification cases on the VisDrone2019 dataset. The red boxes represent the model’s incorrect predictions, while the green boxes represent the ground truth. For clarity, we removed the other prediction boxes and kept the incorrectly predicted ones. The above figure includes “car” predicted as “van”, “pedestrian” predicted as “people”, “anwing-tricycle” predicted as “tricycle”, and “motor” predicted as “bicycle”.

Table 1. Summary of state-of-the-art object detection methods and their limitations.

Representative Approaches	Key Advantages	Key Limitations
YOLOv8 [22] YOLOv9 [23] YOLOv11 [24]	High inference speed; Mature ecosystem.	Dependence on NMS creates deployment bottlenecks; Performance drops significantly for dense tiny objects due to heuristic anchor matching.
DETR Deformable DETR	Eliminates NMS; Captures global context.	Slow convergence speed; Extremely high computational complexity.
RT-DETR	Real-time speed with end-to-end architecture.	Feature degradation during deep downsampling causes loss of fine-grained details for small targets.

Table 2. Experimental results on the VisDrone-2019.

Model	Params (M)	GFLOPs	FPS	AP	AP₅₀
Real-time Object Detectors
YOLOV8-M	25.9	78.9	156	24.6	40.7
YOLOV8-L	43.7	165.2	79	26.1	42.7
YOLOV9-S	7.2	26.7	160	22.7	38.3
YOLOV9-M	20.1	76.8	110	25.2	42.0
YOLOV10-M	15.4	59.1	254	24.5	40.5
YOLOV10-L	24.4	120.3	163	26.3	43.1
YOLOV11-S	9.4	21.3	265	23.0	38.7
YOLOV11-M	20.0	67.7	162	25.9	43.1
Object Detectors for UAV Imagery
HIC-YOLOV5 [50]	9.4	31.7	82	26.0	44.3
End-to-end Object Detectors
DETR	60	187	28	24.1	40.1
Deformable DETR	40	173	19	27.1	42.2
RT-DETR-R18	20	60	183	26.7	44.6
RT-DETR-R50	42	136	89	28.4	47.0
Our Model
Model-R18	20.8	79	126	27.5	45.9
Model-R50	42.9	172	69	29.4	47.9

The bold text indicates the best performance.

Table 3. Experimental results on UAVVaste.

Model	Params (M)	GFLOPs	FPS	AP_S	AP_M	AP	AP₅₀
YOLOV11-S	9.4	21.3	255	27.3	48.7	27.8	63.0
HIV-YOLOV5	9.4	31.2	80	30.8	20.9	30.5	65.1
RT-DETR-R18	20.0	57.3	184	35.8	64.8	36.3	72.6
RT-DETR-R50	42.0	129.9	69	37.0	62.3	37.4	73.5
Model-R18	20.8	79	127	36.2	64.2	36.9	73.1
Model-R50	42.9	172	71	37.3	61.1	37.5	74.2

Table 4. Comparison of technical characteristics and performance limitations across different detection paradigms.

Category	Representative Models	Fusion Strategy	AP	AP₅₀
General Real-time Detectors	YOLOv8 YOLOv9 YOLOv10 YOLOv11	FPN + PANet	24.6/26.1 (+2.9/3.3) 22.7/25.2 (+4.8/4.2) 24.5/26.3 (+3.0/3.1) 23.0/25.9 (+4.5/3.5)	40.7/42.7 (+4.8/7.2) 38.3/42.0 (+7.6/5.9) 40.5/43.1 (+5.4/4.8) 38.7/43.1 (+7.2/4.8)
UAV-Specialized Detectors	HIC-YOLOv5 [50]	CBAM/SE/ Reweighting	26.0 (+1.5)	44.3 (+1.6)
End-to-end Object Detectors	RT-DETR	AIFI + CCFF	26.7/28.4 (+0.8/1.0)	44.6/47.0 (+1.3/0.9)
Ours	Model-R18/R50	FFFE + GMKI	27.5/29.4	45.9/47.9

For the YOLO series, each version is evaluated at two distinct scales. The values to the left and right of the forward slash (“/”) correspond to the lightweight and large-capacity models, respectively. Similarly, for RT-DETR and the proposed method, this notation denotes models equipped with the ResNet18 and ResNet50 backbones, respectively. The bold text indicates the best performance.

Table 5. Results of the ablation study.

Baseline	SN	FFFE	GMKI	AP	AP₅₀
√				26.7	44.6
√	√			26.8	44.8
√	√	√		27.0	45.4
√	√	√	√	27.5	45.9

The symbol “√” indicates that the corresponding module is used.

Table 6. Detailed detection performance of Model-R18 across individual categories, evaluated by Recall and AP₅₀.

Categories	Recall	AP₅₀	Categories	Recall	AP₅₀
pedestrian	48.1	53.4	truck	34.5	36.4
people	43.4	45.1	tricycle	32.3	32.4
bicycle	19.9	18.2	awning- tricycle	18.8	18.3
car	84.1	85.9	bus	60.2	62.5
van	46.4	50.5	motor	56.6	56.8

Table 7. Comparison of model performance metrics.

Model	GFLOPs	FPS
RT-DETR-R18	60	183
RT-DETR-R50	136	89
Model-R18	79	126
Model-R50	172	69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhong, Y.; Zhao, D.; Han, Y.; Wang, Z. UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone. Drones 2026, 10, 106. https://doi.org/10.3390/drones10020106

AMA Style

Zhong Y, Zhao D, Han Y, Wang Z. UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone. Drones. 2026; 10(2):106. https://doi.org/10.3390/drones10020106

Chicago/Turabian Style

Zhong, Yi, Di Zhao, Yi Han, and Zhou Wang. 2026. "UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone" Drones 10, no. 2: 106. https://doi.org/10.3390/drones10020106

APA Style

Zhong, Y., Zhao, D., Han, Y., & Wang, Z. (2026). UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone. Drones, 10(2), 106. https://doi.org/10.3390/drones10020106

Article Menu

UAV Small Target Detection Method Based on Frequency-Enhanced Multi-Scale Fusion Backbone

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Frequency-Enhanced Multi-Scale Feature Fusion Module

2.2. Grouped Multi-Kernel Interaction Module

2.3. Loss Function

3. Results

3.1. DataSets

3.2. Comparisons with Other Object Detection Networks

3.3. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI