DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images

Wan, Zeyu; Lan, Yizhou; Xu, Zhuodong; Shang, Ke; Zhang, Feizhou

doi:10.3390/rs17101768

Open AccessArticle

DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images

by

Zeyu Wan

,

Yizhou Lan

,

Zhuodong Xu

,

Ke Shang

and

Feizhou Zhang

^*

Institute of RS and GIS, School of Earth and Space Sciences, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(10), 1768; https://doi.org/10.3390/rs17101768

Submission received: 20 March 2025 / Revised: 20 April 2025 / Accepted: 15 May 2025 / Published: 19 May 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Drone object detection serves as a fundamental task for more advanced applications. However, drone images typically exhibit challenges such as small object sizes, dense distributions, and high levels of overlap. Traditional object detection networks struggle to achieve the required accuracy and efficiency under these conditions. In this paper, we propose DAU-YOLO, a novel object detection method tailored for drone imagery, built upon the YOLOv11 framework. To enhance feature extraction, a Receptive-Field Attention (RFA) module is introduced in the backbone, allowing adaptive convolution kernel adjustments across different local regions, thereby addressing the challenge of dense object distributions. In the neck, we propose a Dynamic Attention and Upsampling (DAU) module, which incorporates additional low-level features rich in small-object information. Furthermore, Scale-Diffusion Attention and Task-Aware Attention are employed to refine these features, significantly improving the network’s ability to detect small objects. To maintain an extremely lightweight architecture, the bottom-most Bottom–Up layer is removed, reducing model complexity without compromising detection accuracy. In the experiments, the proposed method achieves state-of-the-art (SOTA) performance on the VisDrone2019 dataset. On the validation set, DAU-YOLO(l) attains an

m A P_{50}

of 56.1%, surpassing the baseline YOLOv11(l) by 9.1% and the latest similar-structure method Drone-YOLO(l) by 4.8%, while maintaining only 28.9M parameters, almost half those of Drone-YOLO(l). In the discussion, we provide a detailed analysis of the improvements in small object detection as well as the trade-off between detection accuracy and inference speed. These results demonstrate the effectiveness of DAU-YOLO in addressing the challenges of drone object detection, offering a highly accurate and lightweight solution for real-time applications in complex aerial scenes.

Keywords:

small object detection; drone image; YOLOv11; attention; remote sensing

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have been widely adopted across various domains, including Intelligent Transportation [1,2], precision agriculture [3,4], and urban surveillance [5,6], where accurate object detection plays a crucial role. Due to their lightweight design and high maneuverability, UAVs can efficiently capture aerial imagery over large areas, while also being more susceptible to environmental interference. The perspective of UAV imagery is also distinctive; it is neither the frontal view of the human eye nor the top–down orthorectified view of traditional remote sensing but rather a tilted top–down view. This distinct viewpoint, combined with real-world challenges, results in five major types of imbalance in UAV-based object detection tasks [7], making the problem more complex than conventional object detection. Specifically, scale imbalance refers to massive variations in object sizes, with small objects heavily overrepresented in many UAV datasets. Spatial imbalance arises when objects are densely packed in only a portion of the image, leaving other regions sparse. Objective imbalance reflects the contradiction between the high processing demands of high-resolution UAV images and the limited onboard computational resources. Semantic imbalance occurs due to large differences within the same category (intra-class) or between different categories (inter-class). Lastly, class imbalance happens when one or a few classes dominate the dataset, skewing the overall data distribution.

The first fundamental problem that needs to be addressed is the imbalance between real-time processing imposed by high-resolution images and the limited computing capabilities of drones. Therefore, faster one-stage networks are often used as base models for further improvements, such as YOLO [8,9], SSD [10,11], and DETR [12,13] series. Secondly, to accommodate the characteristics of UAV imagery, including small object sizes, dense object distributions, and scale variations, research has typically focused on the following aspects: Multi-scale feature fusion techniques such as Feature Pyramid Networks (FPNs) [14] and Path Aggregation Networks (PANets) [15] have been integrated into detection models to improve small object representation. Attention mechanisms, such as spatial attention (e.g., CBAM [16], CA [17]), channel attention (e.g., SENet [18], ECA [19]), and self-attention (e.g., Non-local Networks [20], Transformer [21]) have been leveraged to emphasize key object regions while suppressing background noise. Other techniques, such as post-processing refinements like soft-NMS [22] and replacing the loss function with Normalized Gaussian Wasserstein Distance (NWD) [23], can be used to perform targeted optimizations based on the characteristics of small objects.

Nevertheless, some challenges remain unresolved. Previous methods have significantly improved

m A P

by incorporating sophisticated attention mechanisms in modules such as C2f and utilizing upsampling in the neck [8]. However, these approaches often apply the same processing strategy to features of different scales, limiting their ability to fully exploit the network’s potential for further performance enhancement. Moreover, compared to the challenge of the size of small objects, the dense distribution and severe overlap of small objects pose an even greater difficulty. Taking the VisDrone [24] dataset used in this paper as an example, on urban roads, there are a large number of cyclists, which means high overlap between the classifications of “people” and “motor”, making it difficult for traditional UAV object detection methods to accurately distinguish them.

In this paper, we propose a UAV object detection method based on the YOLOv11 architecture. The main contributions are as follows:

A novel UAV small object detection method DAU-YOLO, which enhances small object detection accuracy while minimizing the number of parameters. This method achieves 56.1% $m A P_{50}$ on the VisDrone2019-DET dataset with only 28.9 M parameters.
The utilization of the Receptive-Field Attention (RFA) module [25] in the backbone, which helps enhance feature extraction for critical local regions, significantly improving the model’s ability to differentiate overlapping objects.
A Dynamic Attention Upsampling (DAU) module, combining upsampling techniques with a dynamic attention mechanism [26], helps the network maximize the utilization of feature information from the small object detection layer. This module effectively enhances the extremely small object detection capability.

2. Related Work

Compared to traditional object detection, UAV-based detection faces more complex scenarios and stricter real-time processing requirements. One-stage methods, such as YOLO, are more suitable for this scenario due to their fast inference speed and efficiency. The YOLO (You Only Look Once) series has undergone significant evolution since its initial introduction, continuously improving speed, accuracy, and adaptability to various object detection scenarios, including UAV-based detection. YOLOv1 [27] introduced the concept of a single-stage detector, directly predicting bounding boxes and class probabilities from an input image in a single forward pass, revolutionizing the field of object detection. YOLOv3 [28] improved small object detection by adopting a feature pyramid structure with multi-scale predictions, a concept that has been maintained in later YOLO versions. YOLOv4 [29] introduced CSPDarknet53 as the backbone, incorporating Cross Stage Partial Networks (CSPNets) to reduce computational complexity while improving feature extraction efficiency. YOLOv5 [30] adopted the Focus module to preserve more spatial information during downsampling and K-Means clustering for automatic anchor generation, reducing manual tuning efforts. It is also known for its ease of use and fast deployment. YOLOv8 [31] replaced the C3 module in YOLOv5 with a more lightweight C2f module, enhancing efficiency while retaining strong feature representation. YOLOv8 also introduced a decoupled head structure, separating classification and localization tasks, and transitioned from an anchor-based to an anchor-free detection paradigm. YOLOv11 [32] proposed a new convolutional structure named C3k2 to reduce computational complexity by fixing parameter N to 2. In addition, YOLOv11 incorporated a C2PSA attention module after SPPF to improve focus on key feature regions. The detection head was further optimized by replacing two standard convolutions with Depthwise Convolutions (DWConvs), significantly reducing computational cost and parameter count.

The YOLO series has achieved state-of-the-art (SOTA) performance across multiple domains, making it a strong candidate for UAV-based object detection. In recent studies, attention mechanisms have been widely employed to enhance feature representation and focus on critical regions, especially for small object detection in UAV imagery. For example, Zhai et al. [33] introduced the GAM into YOLOv8 to improve target feature fusion for UAV applications. Moreover, Zhao et al. [34] proposed TPH-YOLOv5++, which integrates a sparse local attention (SLA) module and a cross-layer asymmetric transformer (CA-Trans), addressing challenges such as object scale variation and motion blur. Zhang et al. [35] introduced a spatial context-aware module (SCAM) to capture global associations across channels and space, which plays a similar role as attention mechanisms to strengthen small object representations while suppressing background interference.

Beyond attention mechanisms, multi-scale feature fusion is another critical aspect of improving UAV-based detection models. Guo et al. [36] proposed an improved YOLOv5 for SAR ship target detection by integrating a Convolutional Block Attention Module (CBAM) and a Bidirectional Feature Pyramid Network (BiFPN), effectively enhancing multi-scale feature fusion and achieving a 1.9% AP improvement over the standard YOLOv5. Chen et al. [37] introduced a High-Resolution Feature Pyramid Network (HR-FPN) with high-resolution feature alignment (HRFA) and fusion (HRFF) modules to strengthen contextual representation across different scales. Drone-DETR [12], Dq-DETR [38], and UAV-DETR [39] have also shown promise, but they are not the main focus of this study. Instead, our work focuses on introducing lightweight modules that can significantly improve detection accuracy without adding a substantial computational burden.

3. Methodology

3.1. The Framework of DAU-YOLO

Figure 1 illustrates the architecture of the proposed network model, named DAU-YOLO, which improves upon YOLOv11. Considering that YOLOv11 has achieved state-of-the-art (SOTA) performance in various domains, offering enhanced feature extraction, optimized efficiency, and competitive accuracy with fewer parameters compared to YOLOv8, it serves as a strong baseline. Our goal is to further improve the detection accuracy of small objects without significantly altering the network architecture. In the backbone, there is one stem layer and four stage layers. The stem layer is an RFA module and each stage layer consists of one RFA module and one C3k2. The RFA module, which introduces the Receptive-Field Attention mechanism into standard convolution, can retain more small object information during the original 3 × 3 convolution downsampling process. In the neck, we add the Dynamic Attention Upsampling (DAU) module into standard PAFPN [15] to better integrate features at different scales. Among the features, shallow features are especially crucial, where a large amount of small object information is preserved. So, the module applies dynamic attention mechanisms to extract shallow features more effectively. In the end, to minimize the parameter, we experimented with removing the deep feature detection head after adding the shallow feature detection head, ensuring that the number of parameters remained as small as possible. To accommodate varying accuracy requirements and hardware constraints, the proposed DAU-YOLO has been released in multiple versions, including nano (n), small (s), medium (l), large (l), and extra-large (x), similar to YOLOv11. All versions share the same overall architecture, with differences only in network depth and the number of parameters per layer. The parameter size of each version is detailed in Table 1.

3.2. RFA Module

In DAU-YOLO, the Receptive-Field Attention module and C3k2 module are utilized to ensure high-quality feature extraction and effective image downsampling. The C3k2 module is a newly introduced feature extraction component in the YOLOv11 model, which can divide the input features into two parts: one part is directly processed through standard convolution operations, while the other undergoes variable convolutional kernels (e.g., 3 × 3,

5 \times 5

) using multiple C3K structures or bottleneck structures. Finally, the two feature streams are concatenated and fused using a 1 × 1 convolution. This design maintains a lightweight structure while effectively extracting features of complex scenarios. The structure of the C3k2 block is shown in Figure 2.

The Receptive-Field Attention module replaces the standard convolution in YOLOv11, providing an efficient approach for feature extraction and downsampling. This approach not only highlights the importance of various features within the receptive-field window but also enhances the spatial representation of receptive-field features. For the ultra-small objects to be detected, their pixel ratio in a 1920 × 1080 image is only in the single digits. After being normalized to 640 × 640 for network input, their pixel count may be extremely small. During the

3 \times 3

downsampling process, the weight distribution of each sliding window is crucial. Therefore, we drew inspiration from Receptive-Field Attention and integrated it into the backbone, enabling the maximal extraction and differentiation of local small object information.

As shown in Figure 3, the Receptive-Field Attention module consists of two paths: One path is to multiply the attention map. If the convolutional kernel size is

k \times k

, after downsampling and extracting receptive-field spatial characteristics, the shape of input feature

X \in (b, c, h, w)

will become

(b, c \times k^{2}, h / 2, w / 2)

, where

b, c, h,

and w represent the size of batch, channel, height, and width. To reduce computational complexity and accelerate training speed, we reshape the feature and use AvgPool2d to aggregate the global information of each receptive-field feature. Then, we apply softmax to highlight the importance of each feature within the receptive-field representation, obtaining

X_{1}

. The calculation formula is given in Equation (1).

X_{1} = S o f t m a x (C o n v^{1 \times 1} (A v g P o o l (X)))

(1)

The other path is to obtain the Receptive-Field Attention feature. We use a CBR operation, which means Convolution + Batchnorm + ReLU, to obtain the feature from X and reshape it to

X_{2}

. This is a

1 \times 1

convolution, designed to interact with information. The calculation formula is given in Equation (2).

X_{2} = R e L U (N o r m (C o n v^{k \times k} (X))

(2)

Finally, the results of the two steps are combined using the cross product, giving

X_{3}

. The computation of RFA can generally be formulated as Equation (3).

X_{3} = X_{1} \times X_{2}

(3)

Subsequently,

X_{3}

needs to be reshaped and processed through a convolutional layer to transform its shape into

X_{n e x t} \in (b, c_{n e x t}, h / 2, w / 2)

, which is then fed into the C3k2 module. The size of

c_{n e x t}

is set manually, typically as

c_{n e x t} = c \times 2 = 64 \times 2^{n}

, where n represents the number of the stage layer.

3.3. DAU-Module

In the proposed network, the neck retains a PAFPN architecture. PAFPN incorporates a bottom–up pathway after the top–down pathway of FPN, reinforcing the feature hierarchy by integrating precise localization signals from lower layers through bottom–up path augmentation. Building upon the two original YOLO11 top–down and bottom–up layers, we made two major modifications. First, a novel module named DAU (dynamic attention and upsampling) is introduced at the top of neck. This module enhances small object feature extraction through upsampling and maximizes information utilization via spatial-diffusion attention and task-aware attention mechanisms. Second, the positions of the two bottom–up layers have been adjusted. The bottom–up layer in the deep stage is removed and added to the top layer. This allows for full interaction of information between the DAU module and other neck modules while significantly reducing the increase in parameters.

In the backbone, apart from the stem layer, there are four stage layers, each performing downsampling. We refer to the outputs of these four layers as P1, P2, P3, and P4. In YOLO11, the features P2, P3, and P4 are concatenated with the output features from the next lower top–down layer in the neck, followed by the C3K2 operation. It is generally believed that during the downsampling and feature extraction process in images, shallow layers undergo fewer downsampling operations, preserving richer fine details of small objects, while deep layers primarily capture global information about objects and their backgrounds. Therefore, the network aims to fully utilize the shallow features of P1. To achieve this, an initial upsampling operation is performed within the DAU module. The DAU module fully integrates feature P1 with the output of top–down layer 2 and processes them through the C3K2 module for effective feature extraction. Then, we enhanced the extracted features with a spatial-diffusion block and a task-aware block. The structure of DAU module is shown in Figure 4.

The use of these two attention mechanisms stems from a careful consideration of what the feature characteristics are before the detection layer. First, since the PAFPN structure is employed, selecting the most appropriate scale among multiple scales is crucial. However, as this method specifically focuses on improving small-object detection, and subsequent experiments demonstrate that the P1 layer provides the optimal scale, a scale-attention block is not added to reduce parameter complexity. Second, in the spatial domain, regions of interest should be assigned higher attention weights for good semantic representation. Finally, the detection head encompasses multiple tasks. In YOLO-based object detection, it is required to output box loss, cls loss, and dfl loss. Therefore, we present a task-aware attention mechanism, enabling adaptive channel-wise attention to effectively prioritize different tasks. The two attention blocks are applied sequentially.

To better distinguish the target objects from adjacent objects and background, we introduce a learnable offset into the standard convolution. Considering the high dimensionality of the spatial domain, the spatial-diffusion block isdivided into two steps: (1) utilizing deformable convolution [40] to enable the attention mechanism to learn more sparse representations and (2) aggregating features within the same spatial region. The computation of the spatial-diffusion block can generally be formulated as Equation (4).

F_{S} (X) \cdot X = \sum_{n = 1}^{N} w_{n} \cdot X (p_{n} + Δ p_{n}) \cdot Δ m_{n}

(4)

where N is the number of sparse sampling locations,

p_{n} + Δ p_{n}

is a shifted location by the self-learned spatial offset

Δ p_{n}

to focus on a discriminative region, and

Δ m_{n}

is a self-learned importance scalar at location p.

Task-aware attention is essentially a form of channel attention. We integrate a task-aware block after the spatial-diffusion block, facilitate joint learning, and enhance the generalization of object representation. It selects the optimal activation function for each channel adaptively, dynamically switching on and off channels of features to adaptively prioritize different tasks. The computation of the task-aware block can be formulated as Equation (5).

F_{C} (X) \cdot X = m a x (α^{1} (X) \cdot X_{c} + β^{1} (X), α^{2} (X) \cdot X_{c} + β^{2} (X))

(5)

where

F_{C}

is the feature slice at the c-th channel and

{[α^{1}, α^{2}, β^{1}, β^{2}]}^{T} = θ (\cdot)

is a hyper-function that learns to control the activation thresholds. The inspiration for the hyper-function originates from Dynamic ReLU [41].

Since the DAU module with upsampling has been added, an additional top–down layer is required in the PAFPN, resulting in four detection layers. However, experimental results indicate that removing the lowest detection layer has little impact on accuracy. Therefore, in this work, we omit the lowest detection layer along with its corresponding bottom–up layer. This ensures that the parameter count is minimized as much as possible without compromising accuracy.

4. Experimental Results

4.1. Dataset and Experiments Environment

The VisDrone2019 dataset [24] is a widely recognized benchmark in drone-based computer vision, collected and curated by the AISKYEYE research team at Tianjin University’s Machine Learning and Data Mining Laboratory. This dataset covers a diverse range of scenes and environmental conditions, addressing key challenges in UAV imagery, such as high object density, significant scale variations, adverse weather, and low-light conditions. Some examples of complex scenarios are shown in Figure 5. The detection dataset consists of images along with their corresponding annotation files, including 6471 images in the training set, 548 in the validation set, 1610 in the test set, and 1580 in the competition set. Image sizes range from 480 × 360 to 1500 × 2000 pixels. The dataset includes annotations for 10 object categories: pedestrians, people, bicycles, cars, vans, trucks, tricycles, awning-tricycles, buses, and motorcycles. In this study, the training set is used for model training, while the validation and test sets are employed for an objective evaluation of DAU-YOLO’s performance. The choice of VisDrone2019 is motivated by its high degree of scene complexity and diversity, which poses significant challenges for object detection algorithms and closely reflects real-world UAV applications. Furthermore, as a commonly adopted benchmark in the field, it enables fair comparisons with existing methods and provides a reliable basis for evaluating both accuracy and generalization.

For the hardware and software configuration, we employed an Intel(R) Core(TM) i5-12400F processor, Intel Corporation, Santa Clara, CA, USA with six cores and 12 threads, operating at a base frequency of 2.5 GHz, along with 32 GB of RAM. The experiments were conducted using a GeForce RTX 4060 Ti GPU with 16 GB of video memory. The deep learning framework utilized was PyTorch 2.4.1 with Torchvision 0.19.1 and the baseline version of YOLO11 was Ultralytics 8.2.100. The hyperparameters remained consistent throughout training, validation, and testing to ensure experimental reliability. The training process was conducted over 300 epochs, with all input images resized to 640 × 640 before being fed into the network. To maintain fairness in performance comparison, none of the networks employed pre-trained weights. The detailed training parameter settings are presented in Table 2.

4.2. Experiment Metrics

The experiments assess the proposed methods based on both detection performance and model complexity. The evaluation metrics include precision (P), recall (R), average precision (AP), and mean average precision (mAP) for detection accuracy, as well as the number of parameters (M) in millions to quantify model size. In the field of multi-class object detection, detection results are categorized into four types: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). True Positives (TP) refer to the number of positive samples correctly detected. In the context of object detection, “correct” detection requires both the correct class and the correct bounding box. The correct class means that the predicted type matches the label type, and the confidence score is above a certain threshold. The correct bounding box means that the Intersection over Union (IoU) is greater than a threshold. Similarly, False Positives (FP) refer to incorrect positive predictions, also known as false alarms, which include either incorrect class predictions or an IoU below the threshold. False Negatives (FN) refer to incorrect negative predictions, also known as missed detections, indicating areas where a target exists but is not detected. True Negatives (TN) refer to correct negative predictions, which in object detection typically represent background regions. The metrics can be computed based on the quantities of these four types of results, as shown in Table 3.

4.3. Comparison

Our experimental results were compared against a variety of the most widely used and recent methods on this dataset, including Faster R-CNN [42], RT-DETR [43], Drone-DETR [12], YOLOv5 [30], YOLOv8 [31], Modified YOLOv8 [44], and YOLOv10 [45]. Additionally, since our proposed method is an improvement based on YOLO11 and Drone-YOLO [46] has a similar upsampling structure, both are selected as baseline models for comparison in this experiment. To evaluate the effectiveness of our improvements across different model capacities, we selected two representative variants of DAU-YOLO: a lightweight version (nano, n) and a larger version (large, l). These choices allow us to assess performance under different computational constraints. Due to hardware limitations, the extra-large (x) version could not be trained on our local PC setup and was therefore excluded from the experiments.

Table 4 presents the performance comparison between the proposed methods and baseline models on the VisDrone2019-val dataset, along with their parameter sizes. Our proposed DAU-YOLO(l) performs best on both

m A P_{50}

and

m A P_{50 - 95}

. DAU-YOLO(l) increases the parameter size by only 3.6 M compared to YOLO11 (l), yet achieves a remarkable 19.4% improvement in

m A P_{50}

. Compared to the recent SOTA method Drone-YOLO, DAU-YOLO has only half the parameter size while achieving an

m A P_{50}

higher by 4.8 percentage points. For the extremely small DAU-YOLO (n) version, it achieves performance close to YOLOv11 (l) while using only 2.5M parameters. Surprisingly, RT-DETR with ResNet50 achieves the lowest accuracy among all models. This may be due to the lack of a multi-scale structure similar to other networks, which limits its ability to detect small objects effectively.

On the VisDrone2019-test dataset, we focus on comparing YOLOv11 and Drone-YOLO. Figure 6 provides a detailed performance comparison. As shown, DAU-YOLO achieves the highest accuracy across all categories. Specifically, DAU-YOLO(l) improves

m A P_{50}

by 13.2% over baseline method and by 3.1 percentage points over the SOTA method. From Figure 6, we observe that the improvements are particularly significant for the categories “pedestrian”, “people”, “tricycle”, and “motor”, with gains exceeding 5 percentage points. These categories primarily consist of small objects, which aligns with common expectations. To further validate the small-object detection capability of our method, we conduct a more in-depth analysis in the discussion.

4.4. Visualization

To more intuitively validate the effectiveness of the DAU module and our proposed method, we selected three of the most challenging images from the test and validation sets as case studies.The following analysis compares the object detection performance of YOLOv11(l), YOLOv11(l) + DAU Module, and DAU-YOLO(l) across various scenarios, including overlapping and dense distribution, low-lighting conditions, tilted camera angles, and extremely small objects. The following images illustrate the detection results of different models: (a) represents the results of YOLOv11(l), (b) corresponds to YOLOv11(l) with the DAU module, and (c) shows the results of the proposed DAU-YOLO(l).

As shown in Figure 7 in the upper part of the image, there are approximately four pedestrians and twelve individuals riding motors. In (a1), YOLOv11(l) correctly detected only 1 pedestrian, misclassifying 1 motor rider (1 people + 1 motor) as a motor and another as a pedestrian. YOLOv11(l) + DAU successfully detected four motor riders and six motors, while DAU-YOLO(l) identified eight motor riders. Below this region, YOLOv11(l) missed detecting two motor riders, whereas both DAU-YOLO(l) and YOLOv11(l) + DAU successfully identified them and accurately distinguished between the motors and the riders. This demonstrates DAU-YOLO(l)’s superior capability in detecting extremely small and overlapping objects.

As shown in Figure 8, in the lower-left part of the image, some black-painted cars were missed by YOLOv11(l) due to their low contrast with the dark asphalt road. Similarly, in the upper-left part, several cars were partially occluded by building pillars, making detection more challenging. However, both DAU-YOLO(l) and YOLOv11(l) + DAU successfully identified these vehicles, highlighting their enhanced ability to detect objects in low-contrast and occluded environments. In the upper-right part of the image, the distinction between DAU-YOLO(l) and YOLOv11(l) + DAU becomes more apparent when detecting fast-moving vehicles. While YOLOv11(l) + DAU struggles with accurately identifying some of these high-speed objects, DAU-YOLO(l) successfully detects them. This demonstrates DAU-YOLO(l)’s superior capability in handling complex scenarios involving motion blur and rapid object movement.

As shown in Figure 9, in the upper-central part of the image, YOLOv11(l) fails to detect four particularly small cars and one motor, whereas both DAU-YOLO(l) and YOLOv11(l) + DAU successfully identify them. Notably, in the middle of the image, an object resembling an electrical box is mistakenly detected as a car by YOLOv11(l) and YOLOv11(l) + DAU. Similarly, in the lower-right part, a short white wall is incorrectly recognized as a car. However, DAU-YOLO(l) does not suffer from these misdetections, demonstrating its ability to effectively reduce false positives and improve detection accuracy.

5. Discussion

5.1. Ablation

In the ablation experiments, we used the YOLOv11(l) model as the baseline for comparison and progressively introduced the key improvements: (1) the Receptive-Field Attention module in the backbone (YOLOv11(l) + RFA), (2) the Dynamic Attention and Upsampling Module in the neck (YOLOv11(l) + DAU), and (3) the combination of both improvements (YOLOv11(l) + RFA + DAU). (4) Building upon (3), we further removed the bottom-most bottom–up layer and its corresponding detection head to achieve a more lightweight architecture (YOLOv11(l) + RFA + DAU - BUlayer). These enhancements culminated in our proposed DAU-YOLO(l) model. Each model was evaluated using multiple metrics on the VisDrone2019-Val dataset while keeping all hyperparameters unchanged. The training process used an input image size of 640 × 640 and a batch size of 4 and ran for 300 epochs.

As shown in Table 5, adding the Receptive-Field Attention (RFA) module to YOLOv11(l) results in a modest performance gain of +0.6% in

m A P_{50}

and +0.5% in

m A P_{50 - 95}

, with only a 0.1M increase in parameters. In contrast, integrating the Dynamic Attention and Upsampling (DAU) module leads to a substantial improvement of +7.6% in

m A P_{50}

and +5.3% in

m A P_{50 - 95}

, demonstrating its significant impact on feature enhancement. This confirms that the DAU module effectively enriches small-object features through upsampling and leverages the Dynamic Attention mechanism to extract key information, resulting in a remarkable performance boost. When combining both RFA and DAU, we observe a further boost, reaching 56.0%

m A P_{50}

and 35.3%

m A P_{50 - 95}

, proving that the features extracted by the RFA module in the backbone can be effectively leveraged by the DAU module, achieving a synergistic effect where the combined improvements exceed the sum of their individual contributions. Finally, in DAU-YOLO(l), removing the bottom-most bottom–up layer, which is also the largest in terms of parameter size, reduces the parameter count from 32.5 M to 28.9 M, achieving a more lightweight model while maintaining the same detection accuracy. This highlights the efficiency of our proposed architecture, balancing performance and computational cost, making it well suited for UAV-based object detection.

5.2. Improvement on Small Object

Considering the varying perspectives in UAV imagery, large physical objects do not necessarily appear large in the image. To better account for object scale, we adopt the COCO dataset’s standard definitions of small, medium, and large objects. All detection results are converted into COCO-style annotations, and evaluation is performed using the pycocotools library. (Note: The

m A P

values calculated by pycocotools are typically 1%–2% lower than those obtained via the Ultralytics built-in evaluation function, as officially acknowledged.) Specifically, small objects are defined as those with bounding boxes smaller than 32 × 32 pixels, medium objects range between 32 × 32 and 96 × 96 pixels, and large objects exceed 96×96 pixels. A detailed comparison of DAU-YOLO and its baseline in terms of detection accuracy across different object sizes on the VisDrone2019-test dataset is presented in Table 6. The metrics

m A P_{s}

,

m A P_{m}

, and

m A P_{l}

correspond to

m A P_{50 - 95}

calculated specifically for small, medium, and large objects based on the COCO evaluation standard. As shown, DAU-YOLO demonstrates improvements across all object sizes, with the most significant gains observed for small objects. Notably, the nano version of DAU-YOLO achieves even greater improvements compared to the large version, which is particularly beneficial for scenarios requiring extreme model lightweighting.

To visually demonstrate the improvement in small object detection across different classes, we further categorize all object classes into small and large subcategories based on the 32 × 32 pixel threshold. For example, “pedestrian” is divided into “pedestrian (small)” and “pedestrian (large)”, effectively expanding the original 10 categories into 20. Additionally, including the background class results in a 21 × 21 confusion matrix. We conducted experiments on the VisDrone2019-test dataset using DAU-YOLO(l) and YOLOv11(l). The analysis is performed based on their respective confusion matrices to assess the performance differences, particularly in small object detection.

As shown in Figure 10 and Figure 11, in a multi-class classification model, the values on the diagonal of the confusion matrix represent the number of correctly classified samples for each category. In the confusion matrix of DAU-YOLO, all diagonal elements are larger than those of YOLOv11, demonstrating an overall improvement in detection accuracy. To further analyze this improvement, we focus on the top five categories with the highest number of correctly detected instances, as shown in Table 7. Although all categories exhibit varying degrees of improvement, the enhancement is particularly significant for small objects. Notably, pedestrian (S), van (S), and bus (S) show an improvement of over 40% in small object detection, highlighting DAU-YOLO’s superior capability in handling small-scale targets.

5.3. Trade-Off Analysis Between Accuracy and Speed

In real-world UAV object detection tasks, achieving a balance between accuracy and speed is essential, given the limited memory, GPU resources, and computational capacity available on embedded devices. Therefore, we introduce more comprehensive evaluation metrics—including file size (F-Size), inference time (IF-Time), and Frames Per Second (FPS)—to analyze the applicability of DAU-YOLO, aiming to assist future researchers in selecting the most suitable model for their specific scenarios. In YOLO-based object detection tasks, the overall detection time consists of four stages: preprocess, inference, loss, and postprocess. However, since the loss computation is only relevant during training and requires negligible time during inference, it can be ignored in speed evaluation. Therefore, FPS can be calculated as Equation (6):

FPS = \frac{1000}{T_{preprocess} + T_{inference} + T_{postprocess}}

(6)

where

T_{preprocess}

,

T_{inference}

, and

T_{postprocess}

represent the average time (in ms) per image for each corresponding stage. A higher FPS value reflects a more efficient real-time detection performance. In this experiment,

T_{preprocess}

falls within the range of 0.2 ms to 0.3 ms on average, depending solely on the input image size when the same preprocessing method is used.

T_{postprocess}

is highly correlated with the number of objects to be detected in the image, and therefore varies significantly across different datasets. In this experiment, the average postprocessing time ranges from 2 to 3 ms, although in some cases it can exceed 20 ms for a single image.

T_{inference}

refers to the inference time of the neural network, which will be discussed in detail below.

To simulate a more realistic scenario, we evaluate both accuracy and speed on the previously unused VisDrone2019-test dataset, considering that some networks are developed based on improvements to YOLOv8. Therefore, choosing whether to use YOLOv8 or YOLOv11 as the baseline for object detection poses an interesting question. So, in this study, we compare YOLOv8, YOLOv11, and DAU-YOLO in terms of both accuracy and efficiency. Table 8 provides a detailed performance comparison.

Firstly, DAU-YOLO demonstrates a consistent advantage in detection accuracy. In the large-size model setting, DAU-YOLO outperforms the baseline by 5.1 points in

m A P_{50}

and surpasses the SOTA method by 3.1 points. In the nano-size model setting, DAU-YOLO achieves a 3.9-point improvement over the baseline and a 1.2-point gain over the SOTA method. Therefore, DAU-YOLO is the preferred choice in scenarios where accuracy is critical. Secondly, among the compared models, YOLOv8 demonstrates the fastest inference speed, followed by YOLOv11. DAU-YOLO is slightly slower than YOLOv11, while Drone-YOLO exhibits the slowest performance. This may be attributed to the use of depthwise separable convolutions in YOLOv11 and the frequent read–write operations introduced by the attention mechanisms in DAU-YOLO. These operations result in low FLOPs and high Memory Access Cost [47]. In terms of parameter count and model file size, both DAU-YOLO and its baseline YOLOv11 are smaller than Drone-YOLO and its corresponding baseline YOLOv8, demonstrating a higher degree of model compactness. In summary, if the deployment scenario is relatively simple and highly sensitive to inference speed, the lightweight YOLOv8(n) model is recommended. For applications that prioritize detection accuracy, the DAU-YOLO(l) model is a better choice. It is also worth noting that deploying these models with acceleration frameworks such as TensorRT or ONNX can further improve their inference speed.

6. Conclusions

This paper presents DAU-YOLO, a lightweight and high-precision object detection method for drone imagery built upon YOLOv11. While retaining the effective CSPDarknet backbone, PAFPN structure, and C3k2 module, DAU-YOLO introduces two key enhancements: the Receptive-Field Attention (RFA) module and the Dynamic Attention and Upsampling (DAU) module. RFA improves local feature extraction and enhances the distinction between closely adjacent objects. The DAU module enriches small-object features by adding an extra top–down upsampling layer to PAFPN. It incorporates Spatial-Diffusion Attention to integrate contextual background information and Task-Aware Attention to dynamically adjust channel-wise focus before detection. To further ensure an ultra-lightweight design without compromising accuracy, the bottom-most bottom–up layer is removed, significantly reducing parameters while maintaining high performance. In our experiments, DAU-YOLO outperforms other SOTA methods on both

m A P_{50}

and

m A P_{50 - 95}

, achieving the highest precision. Compared to its baseline YOLOv11(l), DAU-YOLO improves

m A P_{50}

by 19.4% while adding only 3.6M parameters, reaching 56.1%

m A P_{50}

on the VisDrone2019-val dataset. Notably, DAU-YOLO outperforms the previous SOTA method Drone-YOLO by 4.8 percentage points while using only half the parameters. To further validate its effectiveness in detecting small objects, we conducted a detailed evaluation using

m A P_{s m a l l}

and other scale-specific metrics. Additionally, we divided all categories into small and large subcategories and conducted an in-depth analysis of the 21×21 confusion matrix, confirming a significant improvement in small-object detection. A trade-off analysis between accuracy and speed was also performed to provide practical guidance for UAV-based real-time applications.

With the rapid development of drone technology, drones are increasingly utilized in both civilian and military applications. The proposed method enhances real-time small object detection, providing essential technical support for these complex scenarios. DAU-YOLO effectively addresses key challenges in drone-based object detection, significantly improving the recognition of small objects and overlapping targets. However, it has not been specifically optimized for low-light environments or fast-moving objects. As object detection serves as the foundation for higher-level tasks such as object tracking, future research will focus on integrating DAU-YOLO into real-world applications to further advance these capabilities.

Author Contributions

Methodology, Z.W. and Y.L.; Software, Z.W. and Y.L.; Validation, Z.W.; Writing—original draft, Z.W.; Writing—review & editing, Y.L., Z.X. and K.S.; Supervision, F.Z.; Project administration, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data were derived from public domain resources. The data presented in this study are available in VisDrone-DET2019 accessed in 1 October 2019 at https://openaccess.thecvf.com/content_ICCVW_2019/html/VISDrone/Du_VisDrone-DET2019_The_Vision_Meets_Drone_Object_Detection_in_Image_Challenge_ICCVW_2019_paper.html. These data were derived from the following resources available in the public domain: https://github.com/VisDrone/VisDrone-Dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sharma, V.; You, I.; Pau, G.; Collotta, M.; Lim, J.D.; Kim, J.N. LoRaWAN-based energy-efficient surveillance by drones for intelligent transportation systems. Energies 2018, 11, 573. [Google Scholar] [CrossRef]
Menouar, H.; Guvenc, I.; Akkaya, K.; Uluagac, A.S.; Kadri, A.; Tuncer, A. UAV-enabled intelligent transportation systems for the smart city: Applications and challenges. IEEE Commun. Mag. 2017, 55, 22–28. [Google Scholar] [CrossRef]
Radoglou-Grammatikis, P.; Sarigiannidis, P.; Lagkas, T.; Moscholios, I. A compilation of UAV applications for precision agriculture. Comput. Netw. 2020, 172, 107148. [Google Scholar] [CrossRef]
Tsouros, D.C.; Bibi, S.; Sarigiannidis, P.G. A review on UAV-based applications for precision agriculture. Information 2019, 10, 349. [Google Scholar] [CrossRef]
Kim, H.; Mokdad, L.; Ben-Othman, J. Designing UAV surveillance frameworks for smart city and extensive ocean with differential perspectives. IEEE Commun. Mag. 2018, 56, 98–104. [Google Scholar] [CrossRef]
Yan, X.; Fu, T.; Lin, H.; Xuan, F.; Huang, Y.; Cao, Y.; Hu, H.; Liu, P. UAV detection and tracking in urban environments using passive sensors: A survey. Appl. Sci. 2023, 13, 11320. [Google Scholar] [CrossRef]
Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent Advances for Aerial Object Detection: A Survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. Yolo-based uav technology: A review of the research and its applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Alkentar, S.M.; Alsahwa, B.; Assalem, A.; Karakolla, D. Practical comparation of the accuracy and speed of YOLO, SSD and Faster RCNN for drone detection. J. Eng. 2021, 27, 19–31. [Google Scholar] [CrossRef]
Hoshino, W.; Seo, J.; Yamazaki, Y. A study for detecting disaster victims using multi-copter drone with a thermographic camera and image object recognition by SSD. In Proceedings of the 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Delft, The Netherlands, 12–16 July 2021; pp. 162–167. [Google Scholar]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient small object detection for remote sensing image using enhanced RT-DETR model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, online, 19–25 June 2021; pp. 7373–7382. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: v3. 0. Zenodo 2020. Available online: https://zenodo.org/records/3983579 (accessed on 14 May 2025).
Sohan, M.; Sai Ram, T.; Reddy, R.; Venkata, C. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2020; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: An optimized YOLOv8 network for tiny UAV object detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
Zhao, Q.; Liu, B.; Lyu, S.; Wang, C.; Zhang, H. TPH-YOLOv5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer. Remote Sens. 2023, 15, 1687. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Guo, Y.; Chen, S.; Zhan, R.; Wang, W.; Zhang, J. Sar ship detection based on yolov5 using cbam and bifpn. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2147–2150. [Google Scholar]
Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-resolution feature pyramid network for small object detection on drone view. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 475–489. [Google Scholar] [CrossRef]
Huang, Y.X.; Liu, H.I.; Shuai, H.H.; Cheng, W.H. Dq-detr: Detr with dynamic query for tiny object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2020; Springer: Berlin/Heidelberg, Germany, 2024; pp. 290–305. [Google Scholar]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic ReLU. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 351–367. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]

Figure 1. The whole architecture of DAU-YOLO.

Figure 2. Structure of C3k2 module.

Figure 3. Structure of Receptive-Field Attention (RFA) module.

Figure 4. Structure of dynamic attention and upsampling (DAU) module.

Figure 5. Samples of VisDrone2019-DET. (a) A basketball court with significant scale variations. (b) An intersection in low-light conditions. (c) An intersection at night without streetlights. (d) A crowded group of people dancing in front of a building. (e) A congested intersection with a mix of pedestrians and vehicles. (f) A street captured in foggy, backlit conditions.

Figure 6. Accuracy comparison for each category on VisDrone2019-test.

Figure 7. A complex intersection in a suburban area during daytime. A large number of individuals riding electric bikes create significant overlapping, while numerous extremely small objects are present in the upper portion of the image. (a) represents the results of YOLOv11(l), (b) corresponds to YOLOv11(l) with the DAU module, and (c) shows the results of the proposed DAU-YOLO(l).

Figure 8. A low-light asphalt road between high-rise buildings. Some vehicles are difficult to detect due to occlusion, while others are challenging to recognize because of motion blur at high speeds. (a) represents the results of YOLOv11(l), (b) corresponds to YOLOv11(l) with the DAU module, and (c) shows the results of the proposed DAU-YOLO(l).

Figure 9. A late-night street scene with numerous high-speed vehicles. The vehicles on the main road to the left are moving at high speeds, while the vehicles on the service road to the right remain stationary. (a) represents the results of YOLOv11(l), (b) corresponds to YOLOv11(l) with the DAU module, and (c) shows the results of the proposed DAU-YOLO(l).

Figure 10. Confusion matrix of results of DAU-YOLO(l).

Figure 11. Confusion matrix of results of YOLOv11(l).

Table 1. Parameter sizes for different versions of DAU-YOLO.

Versions	Layers	Parameters
Nano	415	2.5 M
Small	415	9.3 M
Medium	541	24.7 M
Large	788	28.9 M
Extra-Large	788	64.7 M

Table 2. Training parameter settings.

Parameters	Setup
Batch size	4
Epochs	300
Optimizer	SGD
Learning rate	0.01
Momentun	0.937
Weight decay	0.0005
IoU	0.7
Box loss weight	7.5
Class loss weight	0.5
DFL loss weight	1.5

Table 3. The formulation of metrics.

Metrics	Formulation
Precision (P)	$P = \frac{T P}{(T P + F P)}$
Recall (R)	$R = \frac{T P}{(T P + F N)}$
Average Precision (AP)	$A P = \int_{0}^{1} p (r) d r$
mean Average Precision (mAP)	$m A P = \frac{1}{k} \sum_{i = 1}^{k} A P_{i}$
$m A P_{50}$	$m A P_{50} = m A P^{I o U = 0.5}$
$m A P_{50 - 95}$	$m A P_{50 - 95} = \int_{0.5}^{0.95} m A P d (I o U)$

Table 4. Specific performance of models on the VisDrone2019-val.

Method	${mAP}_{50}$	${mAP}_{50 - 95}$	Parameters
Faster RCNN + ResNet50 + FPN	36.5	/	25.6 M
RTDETR + ResNet50	33.7	17.0	42.0 M
Drone-DETR	53.9	33.9	28.7 M
YOLOv5(l)	44.7	27.5	53.1 M
YOLOv8(l)	44.3	27.4	43.7 M
Modified YOLOv8	42.3	27.3	9.7 M
YOLOv10(l)	42.3	25.8	24.4 M
YOLOv11(l)	47.0	29.2	25.3 M
DAU-YOLO(l)	56.1	35.3	28.9 M
Drone-YOLO(l)	51.3	31.9	76.2 M
YOLOv11(n)	35.1	20.3	2.6 M
DAU-YOLO(n)	46.9	28.7	2.5 M
Drone-YOLO(n)	38.1	22.7	3.05 M

Note: Bold font indicates the best performance in the comparison.

Table 5. Ablation experiment result in VisDrone2019-val.

Method	${mAP}_{50}$	${mAP}_{50 - 95}$	Parameters
YOLOv11(l)	47.0	29.2	25.3 M
YOLOv11(l)+RFA	47.6 (↑ 1.3%)	29.7 (↑ 1.7%)	25.4 M
YOLOv11(l)+DAU	54.6 (↑ 16.2%)	34.5 (↑ 18.2%)	32.4 M
YOLOv11(l)+RFA+DAU	56.0 (↑ 19.1%)	35.3 (↑ 20.9%)	32.5 M
DAU-YOLO(l)	56.1 (↑ 19.4%)	35.3 (↑ 20.9%)	28.9 M

Note: The arrows in the table indicate the improvement of the method’s results compared to YOLOv11(l).

Table 6. Evaluation of DAU-YOLO and YOLOv11 on small, medium, and large objects.

Method	${mAP}_{50}$	${mAP}_{50 - 95}$	${mAP}_{l}$	${mAP}_{m}$	${mAP}_{s}$
YOLOv11(l)	36.9	21.4	46.9	32.5	11.0
DAU-YOLO(l)	42.0	24.3	48.1 (↑ 2.5%)	35.8 (↑ 9.2%)	13.9 (↑ 26.4%)
YOLOv11(n)	26.8	14.8	33.8	23.4	6.4
DAU-YOLO(n)	31.1	17.6	36.8 (↑ 8.9%)	26.4 (↑ 12.8%)	9.4 (↑ 46.9%)

Note: The arrows in the table indicate the improvement of the method’s results compared to YOLOv11(l).

Table 7. Improvement for five categories.

Category	DAU-YOLO(l)	YOLO11(l)	Improvement
car (S)	10,221	8435	21.2%
car (L)	11,276	11,194	0.7%
van (S)	902	604	49.3%
van (L)	1587	1383	14.8%
bus (S)	148	102	45.1%
bus (L)	1478	1409	4.9%
motor (S)	1970	1433	37.5%
motor (L)	615	543	13.3%
pedestrian (S)	6633	4492	47.7%
pedestrian (L)	1649	1606	2.8%

Table 8. Specific performance of models on VisDrone2019-test.

Method	${mAP}_{50}$	Parameters	F-Size	IF-Time	FPS
YOLOv8(n)	27.8	3.0 M	5.9 M	10.6 ms	94.1
YOLOv11(n)	28.3	2.6 M	5.3 M	13.6 ms	73.4
DAU-YOLO(n)	32.2	2.5 M	14.3 M	19.6 ms	51.0
Drone-YOLO(n)	31.0	3.05 M	14.1 M	31.3 ms	32.2
YOLOv8(l)	37.4	43.7 M	83.6 M	15.2 ms	66.2
YOLOv11(l)	38.7	25.3 M	49.0 M	20.4 ms	48.1
DAU-YOLO(l)	43.8	28.9 M	55.0 M	40.6 ms	24.6
Drone-YOLO(l)	40.7	76.2 M	148.6 M	46.3 ms	21.5

Note: Bold font indicates the best performance in the comparison.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Z.; Lan, Y.; Xu, Z.; Shang, K.; Zhang, F. DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sens. 2025, 17, 1768. https://doi.org/10.3390/rs17101768

AMA Style

Wan Z, Lan Y, Xu Z, Shang K, Zhang F. DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sensing. 2025; 17(10):1768. https://doi.org/10.3390/rs17101768

Chicago/Turabian Style

Wan, Zeyu, Yizhou Lan, Zhuodong Xu, Ke Shang, and Feizhou Zhang. 2025. "DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images" Remote Sensing 17, no. 10: 1768. https://doi.org/10.3390/rs17101768

APA Style

Wan, Z., Lan, Y., Xu, Z., Shang, K., & Zhang, F. (2025). DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images. Remote Sensing, 17(10), 1768. https://doi.org/10.3390/rs17101768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DAU-YOLO: A Lightweight and Effective Method for Small Object Detection in UAV Images

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. The Framework of DAU-YOLO

3.2. RFA Module

3.3. DAU-Module

4. Experimental Results

4.1. Dataset and Experiments Environment

4.2. Experiment Metrics

4.3. Comparison

4.4. Visualization

5. Discussion

5.1. Ablation

5.2. Improvement on Small Object

5.3. Trade-Off Analysis Between Accuracy and Speed

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI