AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images

Deng, Yu; Hu, Yucong; Ye, Yun; Xu, Pengpeng

doi:10.3390/drones10050338

Open AccessArticle

AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images

¹

School of Civil Engineering and Transportation, South China University of Technology, Wushan Road, Guangzhou 510641, China

²

School of Traffic and Transportation Engineering, Central South University, Shaoshan South Road, Changsha 410075, China

³

Faculty of Maritime and Transportation, Ningbo University, Fenghua Road, Ningbo 315832, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 338; https://doi.org/10.3390/drones10050338

Submission received: 9 March 2026 / Revised: 28 April 2026 / Accepted: 29 April 2026 / Published: 1 May 2026

(This article belongs to the Topic Unmanned Vehicles Technology and Embodied Intelligence Systems for Intelligent Transportation)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

AD-YOLO improves small object detection in traffic-dense UAV images by integrating adaptive orientation-aware feature extraction, dual-path cross-scale feature fusion, and a reparametrized large-kernel fusion module.
AD-YOLO outperforms baseline models in detection accuracy with acceptable computational costs, demonstrating its strong robustness and application potential under complex aerial perspectives.

What are the implications of the main findings?

Jointly modeling object orientations, multi-scale contexts, and bidirectional feature interactions contributes to enhancing the detection of densely distributed, scale-varying objects in UAV images.
AD-YOLO offers a concise yet effective approach to boosting detection performance on dense and small objects in UAV imagery, without requiring extensive modifications to the original framework.

Abstract

The densely distributed, scale-varying objects in unmanned aerial vehicle (UAV) images, together with their dynamic, diverse, and unconstrained backgrounds, make conventional detection methods prone to missed detections, false alarms, and localization biases. To improve UAV vision tasks, we propose AD-YOLO, a unified method tailored for small object detection in traffic-dense settings. First, a module combining an adaptive rotation convolution unit and grouped directional attention with mixed-kernel features is introduced to enhance the model’s orientation invariance and multi-scale discrimination. Then, a dual-path collaborative feature pyramid network is proposed to jointly refine the model’s semantic and spatial details via a multi-directional context aggregation path and a hierarchical semantic progressive fusion path. Last, a hierarchically dense reparameterized large-kernel module is designed to produce broader receptive fields with reduced computational complexity. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate that AD-YOLO outperforms state-of-the-art methods in detection accuracy while maintaining favorable computational efficiency.

Keywords:

UAV images; object detection; adaptive rotation convolution; multi-scale feature; large-kernel convolution

1. Introduction

Traffic object detection, as a core perception technology in intelligent transportation systems, plays an irreplaceable role in downstream tasks such as traffic state prediction, traffic violation recognition, vehicle accident analysis, and scene understanding in automated vehicles [1,2]. Traditional detection approaches based on fixed ground cameras or vehicle-mounted sensors have exhibited inherent limitations in terms of coverage range, deployment flexibility, and viewpoint diversity. Recently, unmanned aerial vehicles (UAVs), leveraging their advantages of having high-altitude wide-area coverage and low geometric distortion, have become increasingly prevalent in traffic monitoring, emergency response, and road condition assessment [3]. Nevertheless, due to the high flight altitude of these vehicles, traffic objects, such as vehicles and pedestrians, typically occupy only a few pixels in UAV aerial imagery, and are characterized by extremely small scales, blurred textures, and arbitrary orientations. These densely distributed, scale-varying targets, coupled with dynamic, diverse, and unconstrained backgrounds, render traditional detection models susceptible to missed detections, false alarms, and localization biases [4].

At present, single-stage detectors, represented by the you only look once (YOLO) series [5], have been widely adopted in various computer vision tasks, owing to their superior efficiency, strong adaptability, and excellent scalability [6]. However, existing architectures are primarily fine-tuned for ground-view scenarios, which involve moderately scaled targets with clear structural features. This makes them less effective in detecting small, dense targets in UAV-captured imagery. While intermediate strategies such as cascaded architectures and attention mechanisms help improve detection accuracy [7,8], these strategies inevitably increase model parameters and computational overhead.

Therefore, enhancing detection accuracy while maintaining a favorable balance between accuracy and efficiency has emerged as a key challenge in UAV-based traffic object detection. To fill this gap, we propose AD-YOLO, a unified YOLO variant that integrates an Adaptive guidance (AG) module and a Dual-path collaborative feature pyramid network (DPCFPN). By advancing state-of-the-art (SOTA) algorithms for UAV-based object vision tasks, primarily through three key aspects—feature extraction enhancement, cross-scale fusion optimization, and lightweight network design—AD-YOLO is designed to achieve an elegant balance between detail preservation, semantic enhancement, and computational efficiency. Specifically, our main contributions are summarized as follows:

(1): To mitigate information loss during the feature extraction of diverse traffic objects, we propose the AG module, which boosts the backbone network’s capability to extract multi-orientation and multi-scale object features via two key components: the adaptive rotational convolution unit (ARCUnit) and the group directional attention mechanism with mixed kernels (GDA-MK).
(2): To alleviate the fine-detail loss caused by downsampling and the information redundancy induced by upsampling in conventional feature pyramid architectures, we propose the DPCFPN. By coupling the multi-directional context aggregation path (MDCAP) with the hierarchical semantic progressive fusion path (HSPFP), the DPCFPN synergistically captures fine-grained spatial details and high-level abstract semantics, thereby improving detection consistency for multi-scale objects.
(3): To enhance the representation of deep features embedded in the DPCFPN, we introduce the hierarchically dense reparameterized large-kernel (HDRepLK). Without significantly increasing the number of model parameters, HDRepLK effectively expands the network’s receptive field and enhances its capacity to fuse multi-scale contextual information.
(4): Extensive experiments conducted on two mainstream UAV-based traffic object datasets, VisDrone2019 [4] and UAVDT [9], demonstrate that AD-YOLO outperforms SOTA methods in detection accuracy with acceptable computational costs, verifying its strong robustness and promising potential for application to complex aerial perspectives.

The remainder of this paper is organized as follows. Section 2 critically reviews related work. Section 3 defines the research problem and delineates the proposed AD-YOLO. Section 4 presents experimental studies to illustrate the merits of AD-YOLO, with major findings, limitations, and potential research directions summarized in Section 5.

2. Literature Review

This section presents a systematic review of the research progress in traffic object detection, with a particular focus on methods tailored for conventional traffic scenarios and techniques dedicated to small-object detection in UAV aerial imagery. The advantages and limitations of these approaches are analyzed in detail to establish a solid theoretical foundation and technical background for the proposed work.

2.1. Traditional Traffic Object Detection

Earlier studies focused primarily on images captured via fixed ground cameras or on-board vision systems, where objects appear relatively large, imaging distances are short, and backgrounds remain relatively stable—conditions favorable for high-accuracy detection. In such settings, two-stage detectors such as the Fast R-CNN [10] and Faster R-CNN [11] have showed superior accuracy, owing to their strong feature extraction and classification capabilities, whereas single-stage detectors such as the YOLO series [5,12,13,14,15,16,17] and single shot multi-box detector [18] outperform with high detection speeds along with respectable accuracy.

To cater to the unique demands of traffic monitoring, numerous studies have proposed forms of task-oriented optimization. For example, Mhalla et al. [19] established a multi-object detection system for traffic surveillance based on the Fast R-CNN, which significantly improved the pedestrian detection rate in surveillance videos and effectively enhanced detection performance under nighttime conditions. Chen et al. [20] developed an edge-computing framework using a pruned YOLOv3 model to enable efficient vehicle detection under fixed viewpoints. Charouh et al. [21] integrated a background subtraction-based preprocessing module into YOLOv5, reducing computational loads by localizing motion-active regions.

However, these ad hoc methods are likely subject to performance degradation when applied to UAV imagery, in which traffic objects such as vehicles and pedestrians often occupy only a few to several dozen pixels, exhibiting extremely small scales, blurred textures, and arbitrary orientations. Targets captured by UAVs are also densely packed and embedded in complex, heterogeneous backgrounds. The feature extraction and receptive field designs of conventional detectors—refined for large, well-defined objects—may struggle to capture sufficient discriminative information from such tiny, irregular instances, resulting in weak feature representations, inaccurate localizations, and high miss rates.

2.2. Traffic Object Detection in UAV Images

To tackle challenges in UAV aerial imagery such as tiny object scales, complex backgrounds, diverse viewpoints, and high target densities, existing studies have primarily explored two key directions: feature extraction and feature fusion.

2.2.1. Feature Extraction Mechanisms

In YOLO-based frameworks, the feature extraction module—typically embedded within the backbone network—plays a pivotal role in determining the model’s capacity to perceive and discriminate the target information. To address the challenges of small object detection and diverse background interference, researchers have advanced feature extraction through two primary strategies: modifying convolution structures to enhance geometric adaptability and multi-scale representation, and integrating attention mechanisms to dynamically emphasize salient features and suppress noises.

Convolutional Deformation Mechanisms

To improve robustness against geometric variations and background clutter, several deformable convolution variants have been explored [22,23]. By introducing learnable offset parameters to dynamically adjust sampling locations, deformable convolution networks enable adaptive receptive field modeling for non-rigid and arbitrarily oriented objects. Shin et al. [24] incorporated a deformable convolution into YOLOv5, which demonstrated improved small object detection performance on aerial datasets. Xu et al. [25] integrated multi-scale deformable convolutions with an adaptive fusion attention mechanism for efficient detection and background suppression on drone platforms. Likewise, Peng et al. [26] applied deformable convolutions in a neck network to adaptively adjust receptive fields and suppress background interference.

At the same time, large-kernel convolution has gained popularity because of its ability to expand receptive fields without significantly increasing computational costs. Wang et al. [27] designed a lightweight network using large-kernel and depthwise convolutions combined with large receptive field attention and SIoU loss to enhance feature localization. Shi et al. [28] introduced a large spatial kernel selective attention fusion module, while Wang et al. [29] embedded large-kernel attention into the Res-VAN module, achieving efficient large-field modeling with reduced computational overhead.

Dilated convolution [30] offers another promising means of expanding receptive fields while preserving spatial resolution. By adjusting dilation rates, it captures long-range contextual information without adding parameters. A representative application of dilated convolution is the atrous spatial pyramid pooling module [31]. In drone sensing, where scale variations and background complexity are prominent, dilated convolutions are often integrated across backbone, neck, and head networks. For instance, Qiu et al. [32] employed large dilated kernels in the backbone, multi-scale dilated convolutions in the neck, and a multi-scale detection head, which significantly boosted feature expressiveness and localization accuracy for small objects.

Attention Mechanisms

To enhance the network’s focus on critical regions, an attention module has been incorporated into a YOLO architecture. This manipulation enables adaptive weighting of feature maps, amplifying responses to target areas while suppressing background noise. Jiang et al. [33] introduced the convolutional block attention module (CBAM) [34] into YOLOv5s, enabling dual-dimensional feature recalibration to enhance sensitivity to salient regions. Subsequent works have further validated CBAM’s effectiveness in dense and overlapping scenarios [35]. Meanwhile, Wang et al. [36] combined a ghost bottleneck with coordinate attention to improve YOLOv5’s focus on key regions, while Li et al. [37] embedded multi-head channel and spatial trans-attention modules into YOLOv8. Collectively, these studies demonstrate that attention mechanisms effectively guide feature learning toward semantically meaningful regions, substantially improving detection performance in low-contrast, occluded, and densely packed aerial images.

2.2.2. Feature Fusion Networks

Feature pyramid networks (FPN) [38] and path aggregation networks (PANet) [39] have become cornerstones of multi-scale object detection. Enhancing the efficiency and fidelity of cross-scale feature propagation has thus emerged as a crucial research direction. Ghiasi et al. [40] leveraged a neural architecture search to discover optimal cross-scale connection patterns, while Gong et al. [41] introduced a learnable fusion factor to modulate the flow of deep semantic information into shallow layers, enhancing the discrimination of small objects. Shi et al. [42] designed a high-frequency spatially aware FPN, incorporating spatial dependency modules to preserve local structural details. Similarly, Meng et al. [43] introduced an adaptive FPN [44] to enhance non-adjacent layer interactions, addressing limited cross-layer information flow in conventional FPNs.

2.3. Transformer-Based Methods for UAV Object Detection

To enhance the representation of small objects in aerial imagery, several studies have integrated transformer modules into existing detection frameworks. For example, Wang et al. [8] proposed PETNet, a YOLO-based prior-enhanced transformer network for aerial image detection, where transformer structures were introduced to strengthen contextual modeling and improve the perception of small targets. Similarly, Xu et al. [25] developed MSDC-DETR by combining multi-scale deformable convolutions with adaptive fusion attention, demonstrating that transformer-style detection pipelines can also achieve competitive performance in UAV object detection tasks.

Recent studies have also explored hybrid architectures that integrate convolutional operations with transformer-based attention mechanisms. Liao et al. [45] fused transformer-based attention with hybrid convolutional blocks, thereby enhancing feature representation under sparse and occluded conditions. Meanwhile, He et al. [46] proposed a dual-path transformer-based feature pyramid network, which enables progressive feature fusion from coarse to fine scales and demonstrates superior performance in small object detection and boundary localization.

2.4. Summary of Limitations in Existing Research

Although tremendous progress has been made in UAV-based object detection, current methods still face three critical challenges:

Inadequate modeling of geometric variations: Standard convolutions and generic attention mechanisms lack explicit orientation alignment and joint direction–scale collaborative modeling. Even with the adoption of deformable convolutions [24,25,26,27,28,29,47] or dimension-independent attention modules [33,35,36,37], existing models still struggle to effectively capture arbitrary orientations, weak textures, and dense layouts.
Inefficient cross-scale feature fusion: Commonly used feature pyramids such as PANet [39] and adaptive FPN [44] rely primarily on feedforward fusion mechanisms. This reliance restricts the deep-shallow feature interaction necessary to mitigate detail attenuation and maintain feature consistency across scales in dense, small-object detection [48,49].
Trade-off between detection accuracy and computational efficiency: Superior models often achieve performance gains through module stacking, leading to excessive parameters and prohibitive computational costs. Conversely, lightweight designs typically sacrifice detection accuracy in complex aerial scenes. There remains a lack of unified architectures that inherently balance representational capacities with computational efficiency, particularly for resource-constrained aerial perception systems.

3. Methodology

To address the aforementioned issues, we propose AD-YOLO, a unified detection method designed to adaptively perceive object orientation, perform bi-directional fine-grained feature fusion, and maintain strong discriminative capability within a parameter-efficient architecture. We first give an overview of AD-YOLO in Section 3.1. We then present the technical details of the proposed AG, DPCFPN, and HDRepLK modules in Section 3.2, Section 3.3, and Section 3.4, respectively.

3.1. Overview

As Figure 1 illustrates, built on the classical architecture of YOLOv8 [15], AD-YOLO comprises three components: the backbone network, the neck network, and the detection head.

The backbone network aims to extract multi-level features from input images. To mitigate feature loss and missed detections caused by extremely small objects with arbitrary orientations, the AG module is introduced to enhance the representation of multi-directional and multi-scale targets.

The neck network serves as the cornerstone for multi-scale feature fusion. We propose the DPCFPN, which establishes a dual path collaborative mechanism between deep semantic information and shallow high-resolution features. This design not only preserves the contextual information embedded in high-level features but also propagates spatial details essential for small target detection throughout the network. On this basis, we further design an efficient feature fusion unit, HDRepLK. By harnessing a densely connected structure combining large-kernel features and dilated convolutions to enable efficient multi-scale feature fusion, while incorporating a reparameterization strategy to automatically remove redundant branches during inference, HDRepLK effectively controls model parameter counts and computational costs without sacrificing representational capacity.

The detection head retains the standard decoupled architecture of YOLOv8. Specifically, AD-YOLO adopts an anchor-free mechanism to simultaneously and efficiently predict bounding box coordinates and category probabilities, avoiding the cumbersome hyperparameter tuning associated with traditional anchor-based detection approaches.

3.2. Adaptive Guidance Module

As Figure 2 shows, the AG module consists of two core components: ARCUnit and GDA-MK.

3.2.1. Adaptive Rotational Convolution Unit

As an adaptive rotation convolution unit integrating learnable rotational offsets with an orientation-aware mechanism, ARCUnit consists of two successive CBS modules (comprising 3 × 3 convolution, batch normalization, and sigmoid linear unit (SiLU) activation function) that sandwich the core ARC module, together with residual connections to facilitate gradient propagation and feature reuse. Unlike standard convolutions that perform sampling on a fixed regular grid, ARC [50] dynamically adjusts convolutional sampling locations via rotation offsets learned to fit local geometric structures. This design enables effective modeling of spatial transformations such as object rotation and deformation, thus enhancing the model’s robustness toward objects with diverse orientations and irregular shapes.

Specifically, given feature

x_{i, j} \in R^{C \times H \times W}

at spatial location

(i, j)

in the input feature map, an auxiliary network

f_{aux} (\cdot)

first predicts K candidate rotation angles

Δ θ_{k}

and their corresponding normalized weights

w_{k}

.:

f_{aux} (x_{i, j}) = {Δ θ_{k}, w_{k}}_{k = 1}^{K}, \sum_{k = 1}^{K} w_{k} = 1, w_{k} \geq 0

(1)

where

f_{aux} (\cdot)

consists of a

3 \times 3

depthwise separable convolution followed by a rectified linear unit (ReLU) activation function, with minimal parameter overheads and only a small additional computational cost. In our implementation, we set K = 4. Each

Δ θ_{k} \in [- π / 2, π / 2]

represents the rotational offset angle for the kth candidate direction, and

w_{k}

denotes the importance weight of that direction. For example, when the target is a car parked diagonally, the network assigns a higher

w_{k}

to the direction aligned with the car’s major axis, thereby enhancing the feature response along that orientation. Conversely, directions misaligned with object edges receive lower weights, suppressing noise and irrelevant responses.

ARC then uses a standard convolution kernel

3 \times 3

with a set of relative sampling coordinates

P_{n} = (p_{n}^{x}, p_{n}^{y}), n = 1, \dots, 9,

where

P_{n} \in {(- 1, - 1), (- 1, 0), \dots, (1, 1)}

represents integer offsets relative to the kernel center. Each sampling point

P_{n}

is rotated by angle

Δ θ_{k}

around the kernel center, resulting in floating-point coordinates

{\tilde{P}}_{n, k} = ({\tilde{p}}_{n, k}^{x}, {\tilde{p}}_{n, k}^{y})

in continuous space as:

{\tilde{P}}_{n, k} = R (Δ θ_{k}) \cdot P_{n}

(2)

where the rotation matrix is defined as

R (Δ θ_{k}) = [\begin{matrix} cos Δ θ_{k}, - sin Δ θ_{k} \\ sin Δ θ_{k}, cos Δ θ_{k} \end{matrix}]

.

After rotation, the actual sampling location in the input feature map at position

(i, j)

becomes

(i + {\tilde{p}}_{n, k}^{x}, j + {\tilde{p}}_{n, k}^{y}) .

Since this location typically has non-integer coordinates, the model retrieves the feature response via bilinear interpolation. Moreover, as the rotated sampling locations may fall outside the valid spatial range of the input feature map, boundary handling is required prior to interpolation. In our implementation, out-of-range coordinates are clipped to the nearest valid boundary position and then used for bilinear interpolation. This strategy ensures stable feature extraction near image borders and enhances the reproducibility of the ARC operation. Given a sampling point

(x, y) = (i + {\tilde{p}}_{n, k}^{x}, j + {\tilde{p}}_{n, k}^{y})

, the feature values at its four nearest integer-coordinate neighbors are:

\begin{matrix} V_{11} = x (⌊ x ⌋, ⌊ y ⌋), V_{12} = x (⌊ x ⌋, ⌈ y ⌉), V_{21} = x (⌈ x ⌉, ⌊ y ⌋), V_{22} = x (⌈ x ⌉, ⌈ y ⌉) \end{matrix}

(3)

The interpolated value is then computed as:

\begin{matrix} v & = (1 - Δ x) (1 - Δ y) V_{11} + (1 - Δ x) Δ y V_{12} + Δ x (1 - Δ y) V_{21} + Δ x Δ y V_{22} \end{matrix}

(4)

where

Δ x = x - ⌊ x ⌋

and

Δ y = y - ⌊ y ⌋

are the fractional offsets. This interpolation operation is differentiable, enabling end-to-end training through backpropagation.

Finally, the output feature at location

(i, j)

,

y_{i, j}

is generated as:

y_{i, j} = \sum_{k = 1}^{K} w_{k} \cdot \sum_{n = 1}^{N} W_{n} \cdot v

(5)

where

W_{n}

represents the shared convolutional weights — the same set of filter parameters is used across all rotation directions. This design ensures model efficiency and preserves the translation equivariant property of standard convolutions.

The ARCUnit allows the model to automatically learn the dominant directional distribution of targets under varying scenarios during training and to focus on the most discriminative sampling orientations, thereby significantly enhancing its feature representation capability for rotated small targets under oblique UAV viewpoints.

3.2.2. Group Directional Attention Mechanism with Mixed Kernels

Although ARCUnit adaptively adjusts sampling orientations based on local geometric structures and strengthens feature representation for arbitrarily rotated targets, in UAV vision tasks, traffic targets typically exhibit scale variations (e.g., coexistence of small motorcycles and large trucks). Relying solely on the orientation adaptation seems insufficient to fully capture rich multi-scale contextual cues. To tackle this limitation, we introduce the GDA-MK module after ARCUnit to boost the model’s cross-scale contextual perception capability.

As Figure 2 illustrates, GDA-MK comprises three cascaded substructures: the directional awareness branch [51], the mixed-kernel local enhancement branch, and the attention-based interaction fusion module.

Let us first divide the input feature

F \in R^{C \times H \times W}

into G groups along the channel dimension. In the directional awareness branch, each sub-channel group

F_{g} \in R^{\frac{C}{G} \times H \times W}

undergoes average pooling along the height and width dimensions, generating vertical and horizontal context vectors

S_{h}^{g} \in R^{\frac{C}{G} \times 1 \times W}

and

S_{w}^{g} \in R^{\frac{C}{G} \times H \times 1}

, respectively. They are subsequently processed to generate directional-aware features

A_{g}

as follows:

A_{H W}^{g} = {Convolution}_{1 \times 1} (Concatenation (S_{h}^{g}, Permute (S_{w}^{g})))

(6)

S_{H}^{g}, S_{W}^{g} = Split (A_{H W}^{g}, [H, W]), S_{H}^{g} \in R^{\frac{C}{G} \times H \times 1}, S_{W}^{g} \in R^{\frac{C}{G} \times 1 \times W}

(7)

A_{g} = G N (F_{g} \times σ (S_{H}^{g}) \times σ (S_{W}^{g}))

(8)

where

A_{H W}^{g}

refers to the directional attention feature of group g.

S_{H}^{g}

and

S_{W}^{g}

denote attention weights along height and width, respectively.

σ (\cdot)

is the sigmoid function, and

G N

denotes group normalization.

In the mixed-kernel local enhancement branch, a parallel dual-path structure is harnessed for multi-granularity feature extraction. One path uses a standard

3 \times 3

depthwise convolution (

L_{s}

) to capture fine-grained local details, while the other employs a

7 \times 7

large-kernel depthwise separable convolution (

L_{l}

) to expand the receptive field and extract wide-field contexts. The mixed-kernel local enhanced features of group g,

L_{g}

, can then be computed as:

L_{g} = C B G (Concatenation (L_{s} (F_{g}), L_{l} (F_{g})))

(9)

where

C B G

denotes the computational module composed of a

1 \times 1

standard convolution, a batch normalization layer, and a Gaussian error linear unit (GeLU).

To enable cross-branch information synergy, we propose the attention-based interaction fusion module. First, global average pooling is applied to

A_{g} \in R^{C \times H \times W}

and

L_{g} \in R^{C \times H \times W}

, yielding global descriptor vectors

v_{A_{g}} \in R^{C \times 1}

and

v_{L_{g}} \in R^{C \times 1}

, respectively. Then, cross-branch correlation is computed by projecting the global descriptor of one branch onto the spatial features of the other. This projection measures their semantic responses across all spatial locations and generates cross-branch attention responses

S_{A_{g} \to L_{g}}

and

S_{L_{g} \to A_{g}}

, as follows:

S_{A_{g} \to L_{g}} = v_{A_{g}}^{⊤} reshape (L_{g}, [C, H W]) \in R^{1 \times H W}

(10)

S_{L_{g} \to A_{g}} = v_{L_{g}}^{⊤} reshape (A_{g}, [C, H W]) \in R^{1 \times H W}

(11)

where ⊤ refers to transposition and

reshape

denotes tensor reshaping operation. This configuration enables each branch to perceive the global semantic emphasis of its counterpart and generate a spatial response map for cross-branch feature modulation, thereby facilitating effective inter-branch feature interaction and refinement.

Subsequently,

S_{A_{g} \to L_{g}}

and

S_{L_{g} \to A_{g}}

are reshaped and passed through the sigmoid function to produce spatial attention maps

{\hat{S}}_{A_{g} \to L_{g}}

and

{\hat{S}}_{L_{g} \to A_{g}}

, respectively, as follows:

{\hat{S}}_{A_{g} \to L_{g}} = σ (reshape (S_{A_{g} \to L_{g}}, [1, H, W])) \in R^{1 \times H \times W}

(12)

{\hat{S}}_{L_{g} \to A_{g}} = σ (reshape (S_{L_{g} \to A_{g}}, [1, H, W])) \in R^{1 \times H \times W}

(13)

The modulated features

{\tilde{A}}_{g}

and

{\tilde{L}}_{g}

can then be calculated as:

{\tilde{A}}_{g} = {\hat{S}}_{L_{g} \to A_{g}} ⊙ A_{g} \in R^{C \times H \times W}

(14)

{\tilde{L}}_{g} = {\hat{S}}_{A_{g} \to L_{g}} ⊙ L_{g} \in R^{C \times H \times W}

(15)

where ⊙ is the Hadamard product.

{\tilde{A}}_{g}

and

{\tilde{L}}_{g}

are fused with

F_{g}

to achieve adaptive interaction and integration between multi-scale context and directional-aware features:

O u t p u t_{F_{g}} = {\tilde{L}}_{g} + {\tilde{A}}_{g} + F_{g}

(16)

where

O u t p u t_{F_{g}}

denotes the output of the gth subgroup.

Finally, the output across all subgroups is concatenated as:

O u t p u t_{F} = Concatenation (O u t p u t_{F_{0}}, O u t p u t_{F_{1}}, \dots, O u t p u t_{F_{G - 1}})

(17)

The GDA-MK, together with the ARCUnit, constructs a direction-scale collaborative enhancement architecture: ARCUnit is primarily responsible for orientation sensitivity calibration, while GDA-MK focuses on multi-scale contextual modeling. This complementary design may strengthen the model’s representation capacity for traffic targets with diverse scales and arbitrary orientations in complex UAV aerial scenes, providing more discriminative feature embeddings for downstream detection tasks.

3.3. Dual-Path Collaborative Feature Pyramid Network

As a prevailing feature fusion architecture, PANet [39] significantly enhances the representation of FPN [38], allowing cross-level integration of multi-scale features through both top-down and bottom-up pathways. Due to its intrinsic configuration, several critical limitations, however, remain. First, the unidirectional, sequential feature fusion with insufficient cross-stage connectivity and multi-path interactions, might lead to constrained information flow and limited feature complementarity. Second, due to repeated non-linear transformations and downsampling, fine-grained spatial details in shallow features progressively degrade, potentially impairing small object localization under scale variations. Third, the fragile gradient propagation caused by singular backward paths necessarily results in unstable training and reduced generalization.

Built on the PANet, we propose the DPCFPN, as shown in Figure 3. DPCFPN establishes the dual path collaborative architecture consisting of a main branch and an auxiliary branch [52], aiming to unify high-level semantic guidance with low-level spatial detail preservation. The auxiliary branch comprises two core components: the MDCAP and the HSPFP. The former integrates multi-scale contextual information to strengthen the guidance of high-level semantics on low-level features, while the latter leverages lightweight cross-stage connections to mitigate the degradation of spatial details during feature propagation. These two components work collaboratively to form a bidirectional closed-loop feature interaction mechanism, achieving a dynamic balance between semantic enhancement and spatial reconstruction.

3.3.1. Multi-Directional Context Aggregation Path

As an auxiliary branch introduced into the standard upsampling path, MDCAP injects early high-resolution feature maps preserved in the backbone network directly into the current fusion node, enabling cross-stage feature reuse and spatial detail compensation. Specifically, at the nth upsampling fusion stage, the network receives three heterogeneous inputs:

C_{n - 1}^{high}

(features from a high-resolution level of the backbone which are rich in geometric priors such as edges and textures),

C_{n}^{mid}

(same-scale backbone features with original semantics and structural information), and

{F^{'}}_{n + 1}^{low ↑}

(high-level semantic features from a lower-resolution level reconstructed via upsampling which possess strong class discriminability), which are fused as follows:

F_{n}^{'} = Concatenation (CBS (C_{n - 1}^{high}), C_{n}^{mid}, U_{bilinear} ({F^{'}}_{n + 1}^{low ↑}))

(18)

where

F_{n}^{'}

denotes the output of the aggregation path.

CBS

refers to a module composed of a

3 \times 3

standard convolution, batch normalization, and SiLU activation function.

U_{bilinear}

represents bilinear upsampling.

By aggregating contextual information from coarser-grained, original, and fine-grained scales during feature generation, MDCAP enhances multi-scale feature interaction and helps alleviate information attenuation and semantic discontinuities that may arise in single-path fusion.

3.3.2. Hierarchical Semantic Progressive Fusion Path

After enhancing shallow features via MDCAP, to further improve cross-level semantic coherence and spatial structural stability in the feature pyramid, we introduce another auxiliary path (HSPFP) into the downsampling path, enabling progressive refinement of enhanced shallow features to deep semantic representations. Specifically, at the nth downsampling fusion stage, the network receives four heterogeneous input sources:

{F^{'}}_{n - 1}^{high ↓}

(output from the high-resolution level of the aggregation path which is downsampled to provide spatial detail prior),

{F^{″}}_{n - 1}^{mid ↓}

(deep features from the previous HSPFP output which are downsampled for alignment to convey high-level context),

{F^{'}}_{n}^{mid}

(current-level output of the aggregation path which carries the joint representation after multi-source fusion), and

{F^{'}}_{n + 1}^{low ↑}

(output from the low-resolution level of the aggregation path which is upsampled to inject high-level semantic guidance), which are concatenated as follows:

{F^{″}}_{n} = Concatenation (CBS ({F^{'}}_{n - 1}^{high ↓}), CBS ({F^{″}}_{n - 1}^{mid ↓}), {F^{'}}_{n}^{mid}, U_{bilinear} ({F^{'}}_{n + 1}^{low ↑}))

(19)

where

F_{n}^{″}

denotes the output of the nth downsampling layer.

Combined with the main branch, the auxiliary branch constructs a dual-path collaborative structure: the MDCAP module enhances cross-scale contextual aggregation, while HSPFP strengthens the progressive interaction between high-level semantic information and low-level spatial details. The cooperation of these two components improves cross-scale feature consistency and contributes to a favorable balance between detection accuracy and model computational complexity.

3.4. Hierarchically Dense Reparameterized Large-Kernel

The core modules of the neck network in YOLO-v8 typically reuse the features extracted from the backbone network directly, lacking dedicated feature extraction mechanisms tailored for multi-branch feature fusion tasks. This results in insufficient feature interaction and severely limits the network’s semantic representation capability and overall detection performance.

To efficiently fuse the multi-source features output by MDCAP and HSPFP, we propose HDRepLK. By integrating hierarchical dense connections and large-kernel convolution reparameterization, HDRepLK expands the receptive field and enhances contextual modeling capability, while controlling model parameters. As Figure 4 illustrates, HDRepLK is constructed by stacking multiple bottleneck blocks, each incorporating a multi-branch parallel architecture that includes a large-kernel convolution (

17 \times 17

) and five dilated convolutions with varying dilation rates. Such an ingenious design facilitates the capture of rich multi-scale spatial information and long-range dependencies, deepening and broadening the feature fusion process.

4. Experiments

4.1. Datasets

Our method was evaluated on two publicly available benchmarks: VisDrone2019 [4] and UAVDT [9]. The VisDrone2019 was collected by UAV from various altitudes and viewpoints, covering diverse traffic scenarios. The dataset comprises 10,209 images divided into training (6471 images), validation (548 images), and test (3190 images) sets. Given that publicly accessible annotations for the official test split are currently unavailable, to ensure an apple-to-apple comparison, we report the model’s results on the validation set, as is consistent with previous studies [7,53,54,55].

As a large-scale UAV dataset recorded under unconstrained urban environments, the UAVDT includes 23,829 training and 16,580 test images. The objects in UAVDT were fine-annotated into three categories (car, bus, and truck) and three scales (small, medium, and large) following the standard COCO detection metrics. More than 60% of objects had a size of less than 32 pixels, making UAVDT particularly intractable to detection algorithms.

4.2. Experimental Setup

To ensure full reproducibility of our experimental results, the hardware environment and training protocols are elaborated in Table 1 and Table 2, respectively.

For the VisDrone2019 dataset, training was terminated if model performance on the official validation set did not improve for 30 consecutive epochs. In contrast, for the UAVDT dataset, all models were trained for the full 300 epochs without early stopping. Instead of using the fine-tuning strategy, we trained AD-YOLO from scratch with randomly initialized weights. This practice ensures full control over the model’s architecture and training process, avoiding the biases that arise in pretrained models. To further improve the model’s generalization capability, we also applied data augmentation techniques such as cropping, rotation, flipping, Mosaic, mix-up, and color adjustments to the training set.

4.3. Evaluation Metrics

We adopted the precision, recall, and mean average precision at different intersection-over-union (IoU) thresholds (i.e., mAP⁵⁰, mAP⁷⁵, and mAP^50:95) as primary evaluation metrics. In addition to the detection accuracy, we also measured the computational efficiency using parameter counts, frames per second (FPS), and giga floating-point operations (GFLOPs). To demonstrate the model’s adaptability to different objects, we further reported the average precision (AP) at 0.50 IoU across object categories and scales.

4.4. Comparison with SOTA Baselines

To evaluate the performance of AD-YOLO, we conducted a comprehensive comparison against a range of SOTA methods on both the VisDrone2019 and UAVDT datasets.

4.4.1. Overall Performance

As Table 3 shows, the proposed AD-YOLO outperforms all the baselines on the VisDrone2019 validation set. Specifically, compared with YOLOv5-M [14], AD-YOLO achieves a notably 10.2% and 7.4% improvement in mAP⁵⁰ and mAP^50:95, respectively. Against YOLOv8-M [15], AD-YOLO also increases mAP⁵⁰ by a margin of 6.5% while reducing the parameter count by 41%. Although YOLOv10-S [16], Hyper-YOLO [54], EL-YOLO [53], EBC-YOLO [55] and MSUD-YOLO [56] have some advantages in computational costs, AD-YOLO exceeds their margins in mAP⁵⁰ by 6%, 5.8%, 2.5%, 1.1% and 3.4%, respectively. More importantly, compared with the newly released YOLO variants such as YOLOv11-M [17] and CF-YOLO [7], AD-YOLO not only achieves higher precision, recall, mAP⁵⁰, and mAP^50:95, but also outperforms with substantially fewer parameters, manifesting a favorable balance between detection accuracy and computational efficiency.

Given that the UAVDT dataset encompasses a range of challenging traffic scenes such as foggy conditions, nighttime imaging, and cluttered backgrounds, it is unsurprising that model performance degrades markedly. As Table 4 shows, AD-YOLO yields an mAP⁵⁰ of 35.4% and an mAP^50:95 of 23.0% on the UAVDT test set, again outperforming almost all the baselines including YOLOv5 [14], YOLOv8 [15], YOLOv10 [16], YOLOv11 [17], MSDC-DETR [25], OSD-YOLO [57], MFF-KD [58], AD-Det [59] and ST-YOLO [60]. Although PETNet [8] results in an mAP⁵⁰ of 3.2% higher than that of AD-YOLO, such gains are achieved at the expense of 83.0 million parameters—more than quintuple that of AD-YOLO.

4.4.2. Performance Comparison on Objects with Varying Shapes and Scales

Table 5 presents the comparison results across ten object categories on the VisDrone2019 validation set. In general, AD-YOLO surpasses the YOLO baselines by approximately 0.5%-16.9% in AP. Notably, AD-YOLO consistently outperforms YOLOv11-M, one of the best-performing YOLO variants, with 1.3%, 1.7%, 2.5%, 0.5%, 0.6%, 1.3%, 0.8%, 1.4%, 1.6%, and 1.0% margins in AP for pedestrians, persons, bicycles, cars, vans, trucks, tricycles, awning-tricycles, buses, and motorcycles, respectively. Similarly, AD-YOLO is again superior to YOLOv11-M by 1.3%, 1.4%, and 8.1% margins in AP for cars, trucks, buses, respectively, on the UAVDT test set, as shown in Table 6.

The versatility of AD-YOLO is also illustrated by its superiority in detecting multi-scale objects. As Table 7 presents, compared with the strongest baseline, YOLOv11-M, although its number of parameters has been reduced from 20.0 to 15.3 million, AD-YOLO still exceeds its results by 0.6%, 0.9%, and 1.6% margins in AP for small, medium, and large objects on the VisDrone2019 validation set, respectively. Similar findings can also be observed on the UAVDT test set.

Together, these results demonstrate that our model is fairly robust in detecting densely distributed, arbitrarily oriented targets with varying shapes and scales.

4.5. Ablation Studies

We conducted a comprehensive ablation study on the VisDrone2019 dataset to dissect the role played by AD-YOLO’s core modules. All models were trained for a fixed number of 100 epochs to improve training efficiency.

4.5.1. Effectiveness of the Group Directional Attention Mechanism with Mixed Kernels

To evaluate the effectiveness of GDA-MK, we first replaced the original C2f module in YOLOv8-M with the AG without GDA-MK. On this basis, we subsequently embedded several mainstream attention structures, including the global-attention [61], self-attention [62], selective kernel network [63], and CBAM [34], for comparative experiments. The results are shown in Table 8.

Interestingly, the global-attention and self-attention tend to overemphasize global semantic information during the pooling process, resulting in insufficient modeling of critical local details such as object edges and textures. Consequently, the generated feature representations become overly smoothed, leading to some degeneration in mAP⁵⁰ and mAP^50:95. Although the recognition of traffic targets is enhanced to some extent through the multi-branch, multi-receptive-field structure, the selective kernel network entails substantial computational burdens, with parameter counts increasing dramatically by 50%. CBAM, as a well-known lightweight module that combines channel and spatial attention, seems insensitive to large-scale objects, given a constant AP for large objects compared to the baseline. Thanks to its unique architecture design with three cascaded structures, the proposed GDA-MK effectively enhances the model’s multi-scale discriminative capability while maintaining reasonable complexity, achieving superior perception of targets with varying scales.

4.5.2. Effectiveness of the Dual-Path Collaborative Feature Pyramid Network

To validate the effectiveness of DPCFPN, we first conducted a comparative experiment with the original YOLOv8-M built on PANet and two popular multi-scale feature fusion networks, i.e., adaptive FPN and adaptively spatial feature fusion. As Table 9 shows, the proposed DPCFPN not only outperforms the three baselines with substantially higher mAP⁵⁰,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

, but also achieves a lightweight design with substantially fewer parameters. Compared with the PANet, adaptive FPN, and adaptive spatial feature fusion, DPCFPN reduces the number of parameters by 36%, 7%, and 39%, respectively.

We then conducted an ablation experiment to examine the individual contributions of MDCAP and HSPFP. According to Table 9, the dual-branch design of DPCFPN yields the highest mAP⁵⁰, exceeding the MDCAP-only and HSPFP-only variants by margins of 1.2% and 0.6%, while also reducing the parameter count by 29% and 4%, respectively. These results demonstrate that the superiority of DPCFPN lies in the collaboration between two core components: MDCAP strengthens the guidance of high-level semantics on low-level features, whilst HSPFP leverages efficient cross-stage connections to alleviate the degradation of spatial details during feature propagation. Such a harmonious framework design is the key to enhanced multi-scale feature fusion via a lighter architecture.

4.5.3. Component-Wise Ablation Analysis

To thoroughly evaluate the contributions of the AG, DPCFPN, and HDRepLK modules, we conducted a series of component-wise ablation experiments on the VisDrone2019 dataset using YOLOv8-M as the baseline. The results are presented in Table 10.

First, the introduction of AG module into the backbone network of YOLOv8-M increases mAP⁵⁰,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

by a margin of 2.7%, 1.7%, 2.5%, and 1.7%, respectively. Such substantial performance gains stem likely from the unique direction–scale collaborative mechanism of AG module: the ARCUnit refines feature representations to rotated targets by adaptively adjusting sampling orientations, while the GDA-MK effectively captures multi-scale contextual information.

Similarly, replacing the original PANet with the proposed DPCFPN increases mAP⁵⁰,

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

by 4.4%, 2.8%, 4.0%, and 1.9%, respectively. These findings are largely expected, because by integrating two auxiliary branches into the backbone of PANet [39], DPCFPN formulates a bidirectional closed-loop feature fusion architecture, enabling it to unify high-level semantic guidance with lower-level spatial detail preservation. What is worth mentioning is that, via a parameter-efficient module stacking design, DPCFPN dramatically reduces the total number of parameters by 32%, maintaining model compactness without sacrificing detection accuracy.

The standard C2F module is limited by its local receptive field and rigid architecture design. To fully exploit the rich semantic and spatial details embedded in the DPCFPN outputs, replacing the C2f module with HDRepLK in the neck network further increases mAP⁵⁰ from 43.4% to 44.1%. The growth seems to be more pronounced for the detection of large-scale objects, whose AP boosts sharply from 43.9% to 46.9%. A drop of 4.3% in

{AP}_{l}

can also be found when the HDRepLK module is removed from AD-YOLO. These results evidently indicate that HDRepLK, by expanding the receptive field and enhancing long-range contextual modeling, can more effectively harness the multi-source features produced by DPCFPN, particularly for recognizing objects with large scales.

In summary, all three components contribute synergistically to the outperformance of AD-YOLO, spanning feature extraction enhancement, cross-scale fusion optimization, and network compactness design.

4.6. Visualization Analysis

To uncover the decision-making process of AD-YOLO, we harnessed gradient-weighted class activation Mapping (Grad-CAM) [65] to highlight critical regions for traffic object detection. The results are illustrated in Figure 5.

As Figure 5 shows, under complex traffic scenarios with dense pedestrian crowds, multi-scale vehicles, and unconstrained background interferences, the heatmap generated by AD-YOLO reveals stronger spatial focus and semantic consistency. Even for small objects in the distance (e.g., vehicles driving far away or pedestrians on crosswalks), AD-YOLO still maintains continuous responses, focusing precisely on the main parts of objects. In contrast, the baseline YOLOv8-M tends to exhibit diffuse activation regions, blurred boundaries, and localization biases, reflecting its limitations in cross-scale feature fusion and detail preservation. That is, regardless of large-scale objects in the foreground or tiny targets at the far end of images, AD-YOLO can correctly localize and activate their core regions.

To further evaluate the model’s performance under complex real-world conditions, we compared the detection results between the baseline YOLOv8-M and the proposed AD-YOLO. Specifically, Figure 6 illustrates a busy crossroad scenario, where targets are typically characterized by dense distribution and mutual occlusion. Figure 7 focuses on the models’ detection capabilities for extremely small objects from a UAV perspective, highlighting key challenges such as obliquely parked motorcycles, tiny targets that visually blend into the background, and high appearance similarity between targets and non-target objects. Figure 8 presents an unstructured street scene with a highly cluttered background, which includes interferences such as tree occlusions, complex road textures, and various non-standard vehicles (e.g., tricycles and awning-tricycles).

As illustrated in Figure 6 with densely packed and crowded targets, AD-YOLO achieves substantial performance gains over the baseline YOLOv8-M, with 9.2%, 13.7%, and 9.8% improvements in AP for pedestrians, bicycles, and cars, respectively. Similarly, for extremely small targets in Figure 7, YOLOv8-M exhibits low-confidence predictions and severe missed detection. In contrast, AD-YOLO accurately localizes and recognizes tiny pedestrians, persons, and motorcycles, achieving an AP of 10.5%, 12.5%, and 14.2% higher than that of YOLOv8-M, respectively. The versatility of AD-YOLO is also illustrated by its superiority in detecting shape-varying and arbitrarily oriented targets under highly cluttered background conditions. As Figure 8 exemplifies, AD-YOLO yields an AP of 67.2%, 51.6%, 90.9%, 50.4%, 59.4%, and 52.7% for pedestrians, persons, cars, tricycles, motorcycles, and trucks, respectively, again outperforming YOLOv8-M by 10.4%, 5.7%, 12.9%, 9.3%, 10.7%, and 15.9% margins in AP.

In summary, across these diverse and challenging scenarios, AD-YOLO consistently demonstrates superior detection performance and robustness compared to the baseline YOLOv8-M model. By leveraging its powerful multi-scale context awareness and adaptive feature extraction mechanisms, AD-YOLO effectively suppresses background noise and overcomes the challenges associated with tiny target scales and arbitrary orientations. Consequently, AD-YOLO not only achieves high-confidence localization with more precise bounding boxes for variable-scale targets but also excels at distinguishing highly overlapping instances of the same category, validating its exceptional adaptability and practical potential in complex aerial traffic monitoring scenarios.

4.7. Deployment Feasibility

Realistic UAV-based object detection is typically characterized by dynamic environmental interference, constrained hardware resources, and diverse task requirements—factors that directly determine the practical value of a detection model. In this section, we particularly discuss the merits of AD-YOLO for deployment in realistic UAV scenarios.

First, for small and dense target detection—a core challenge in high-altitude UAV imaging—AD-YOLO’s DPCFPN effectively fuses multi-scale features and enhances its ability to represent small targets through hierarchical feature interaction. Experimental results on VisDrone2019, which contains a large number of small-scale targets, show that AD-YOLO outperforms baseline models (YOLOv5-M, YOLOv8-M, YOLOv10-M, and YOLOv11-M) for small target detection, with a 0.6–4.9% improvement in AP.

Second, AD-YOLO demonstrates robustness to arbitrary target orientations and dynamic backgrounds in realistic UAV scenarios. The AG module’s cross-branch correlation mechanism enables the model to adapt to the arbitrary orientations of targets (e.g., vehicles moving in different directions and pedestrians with random postures) by dynamically adjusting feature modulation strategies. Meanwhile, the data augmentation pipeline adopted during training further enhances the model’s adaptability to dynamic environmental changes.

Third, AD-YOLO adapts to varying UAV flight parameters in realistic scenarios, such as changes in flight altitude, speed, and viewing angle. When the UAV adjusts its altitude (from 50 m to 150 m), leading to variations in target scale (from small to medium), AD-YOLO’s HDRepLK module expands the receptive field without increasing computational costs, ensuring consistent detection performance across different scales. This adaptability is critical for real-world applications, where UAVs often need to adjust flight parameters to cover large areas or focus on specific targets.

Last, hardware compatibility and easy integration further enhance deployment feasibility. Built on the YOLO framework, which is widely supported by mainstream deep learning toolkits (e.g., TensorRT and OpenVINO), AD-YOLO can be readily deployed on on-board hardware, without extensive modification of the UAV’s existing control system.

5. Conclusions

In the present study, we proposed AD-YOLO, a unified detection method tailored for densely distributed, scale-varying, and arbitrarily oriented traffic objects in UAV images. Built on the mainstream architecture of a single-stage object detector, AD-YOLO distinguishes existing YOLO variants mainly from three aspects. First, the ARCUnit and GDA-MK modules were introduced to replace the original C2F module in the backbone network, effectively enhancing the model’s robustness to object rotations and scale variations. Second, by embedding two auxiliary branches (MDCAP and HSPFP) into PANet, the proposed DPCFPN successfully boosted the model’s detection accuracy while substantially reducing model parameters by approximately 36%. Third, the HDRepLK module was designed to expand the effective receptive field and capture the long-range dependency of multi-source fused features. Extensive experiments demonstrate that AD-YOLO achieves superior multi-scale detection capability and a favorable trade-off between detection accuracy and computational efficiency, with 15.3 million parameters, an mAP⁵⁰ of 45.4% and 35.4%, and an mAP^50:95 of 27.8% and 23.0% on the VisDrone2019 and UAVDT datasets, respectively. Future studies can further enhance model performance under diverse, dynamic, and unconstrained conditions by leveraging lightweight designs and incorporating additional modalities such as optical flow signals and point cloud data.

Author Contributions

Conceptualization, Y.D. and P.X.; Methodology, Y.D.; Software, Y.D.; Validation, Y.H. and Y.Y.; Formal analysis, P.X.; Investigation, Y.D. and P.X.; Data curation, Y.D.; Writing—original draft, Y.D. and P.X.; Writing—review & editing, Y.H. and Y.Y.; Visualization, Y.D. and P.X.; Supervision, Y.H. and P.X.; Project administration, P.X.; Funding acquisition, P.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 52302433) and the Natural Science Foundation of Guangdong Province (grant number 2024A1515011578). The APC was funded by the National Natural Science Foundation of China (grant number 52302433).

Institutional Review Board Statement

This study did not involve animal experiments and was conducted according to relevant ethical standards.

Data Availability Statement

VisDrone2019 can be accessed at https://github.com/VisDrone/VisDrone-Dataset (accessed on 28 April 2026), and UAVDT can be accessed at https://github.com/dataset-ninja/uavdt (accessed on 28 April 2026). Model codes for AD-YOLO will be released once the acceptance of manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this article.

References

Zhang, R.; Wang, B.; Zhang, J.; Bian, Z.; Feng, C.; Ozbay, K. When language and vision meet road safety: Leveraging multimodal large language models for video-based traffic accident analysis. Accid. Anal. Prev. 2025, 219, 108077. [Google Scholar] [CrossRef]
Muhammad, K.; Hussain, T.; Ullah, H.; Ser, J.D.; Rezaei, M.; Kumar, N.; Hijji, M.; Bellavista, P.; de Albuquerque, V.H.C. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22694–22715. [Google Scholar] [CrossRef]
Outay, F.; Mengash, H.A.; Adnan, M. Applications of unmanned aerial vehicle (UAV) in road safety, traffic and highway infrastructure management: Recent advances and challenges. Transp. Res. Part A Policy Pract. 2020, 141, 116–129. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 83, 83535–83574. [Google Scholar] [CrossRef]
Wang, C.; Han, Y.; Yang, C.; Wu, M.; Chen, Z.; Yun, L.; Jin, X. CF-YOLO for small target detection in drone imagery based on YOLOv11 algorithm. Sci. Rep. 2025, 15, 16741. [Google Scholar] [CrossRef]
Wang, T.; Ma, Z.; Yang, T.; Zou, S. PETNet: A YOLO-based prior enhanced transformer network for aerial image detection. Neurocomputing 2023, 547, 126384. [Google Scholar] [CrossRef]
Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Nicu, S. The unmanned aerial vehicle benchmark: Object detection, tracking and baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 30 January 2026).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/topics/yolov8 (accessed on 30 January 2026).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLOv11. 2024. Available online: https://github.com/topics/yolo11 (accessed on 30 January 2026).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Mhalla, A.; Chateau, T.; Gazzah, S.; Amara, N.E.B. An embedded computer-vision system for multi-object detection in traffic surveillance. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4006–4018. [Google Scholar] [CrossRef]
Chen, C.; Liu, B.; Wan, S.; Qiao, P.; Pei, Q. An edge traffic flow detection scheme based on deep learning in an intelligent transportation system. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1840–1852. [Google Scholar] [CrossRef]
Charouh, Z.; Ezzouhri, A.; Ghogho, M.; Guennoun, Z. A resource-efficient CNN-based method for moving vehicle detection. Sensors 2022, 22, 1193. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More deformable, better results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9300–9308. [Google Scholar] [CrossRef]
Shin, Y.; Shin, H.; Ok, J.; Back, M.; Youn, J.; Kim, S. DCEF2-YOLO: Aerial detection YOLO with deformable convolution–efficient feature fusion for small target detection. Remote Sens. 2024, 16, 1071. [Google Scholar] [CrossRef]
Xu, X.; Xing, Z.; Sun, M.; Zhang, P.; Yang, K. Enhancing UAV object detection through multi-scale deformable convolutions and adaptive fusion attention. J. Supercomput. 2025, 81, 1301. [Google Scholar] [CrossRef]
Peng, J.; Lv, K.; Wang, G.; Xiao, W.; Ran, T.; Yuan, L. MLSA-YOLO: A multi-level feature fusion and scale-adaptive framework for small object detection. J. Supercomput. 2025, 81, 528. [Google Scholar] [CrossRef]
Wang, W.; Li, S.; Shao, J.; Jumahong, H. LKC-Net: Large kernel convolution object detection network. Sci. Rep. 2023, 13, 9535. [Google Scholar] [CrossRef] [PubMed]
Shi, C.; Zheng, X.; Zhao, Z.; Zhang, K.; Su, Z.; Lu, Q. LSKF-YOLO: Large selective kernel feature fusion network for power tower detection in high-resolution satellite remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Wang, Z.; Li, Y.; Liu, Y.; Meng, F. Improved object detection via large kernel attention. Expert Syst. Appl. 2024, 240, 122507. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Qiu, Y.; Sha, F.; Niu, L. DKA-YOLO: Enhanced small object detection via dilation kernel aggregation convolution modules. IEEE Access 2024, 12, 187353–187366. [Google Scholar] [CrossRef]
Jiang, T.; Li, C.; Yang, M.; Wang, Z. An improved YOLOv5s algorithm for object detection with an attention mechanism. Electronics 2022, 11, 2494. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Wang, J.; Wu, J.; Wu, J.; Wang, J.; Wang, J. YOLOv7 optimization model based on attention mechanism applied in dense scenes. Appl. Sci. 2023, 13, 9173. [Google Scholar] [CrossRef]
Wang, S.; Liu, Y.; Wang, X.; Xu, J. An improved YOLO algorithm for UAV detection in formation flight. Signal Image Video Process. 2025, 19, 195. [Google Scholar] [CrossRef]
Li, M.; Chen, Y.; Zhang, T.; Huang, W. TA-YOLO: A lightweight small object detection model based on multi-dimensional trans-attention module for remote sensing images. Complex Intell. Syst. 2024, 10, 5459–5473. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar] [CrossRef]
Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in FPN for tiny object detection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1159–1167. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6896–6904. [Google Scholar] [CrossRef]
Meng, X.; Yuan, F.; Zhang, D. Improved model MASW YOLO for small target detection in UAV images based on YOLOv8. Sci. Rep. 2025, 15, 25027. [Google Scholar] [CrossRef] [PubMed]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics; IEEE: Piscataway, NJ, USA, 2023; pp. 2184–2189. [Google Scholar] [CrossRef]
Liao, D.; Zhang, J.; Tao, Y.; Jin, X. ATBHC-YOLO: Aggregate transformer and bidirectional hybrid convolution for small object detection. Complex Intell. Syst. 2025, 11, 38. [Google Scholar] [CrossRef]
He, J.; Liu, B.; Chen, H. HDPNet: Hourglass vision transformer with dual-path feature pyramid for camouflaged object detection. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2025; pp. 8638–8647. [Google Scholar] [CrossRef]
Cao, D.; Chen, Z.; Gao, L. An improved object detection algorithm based on multi-scaled and deformable convolutional neural networks. Hum.-Centric Comput. Inf. Sci. 2020, 10, 14. [Google Scholar] [CrossRef]
Zhou, L.; Zhao, S.; Li, S.; Wang, Y.; Liu, Y.; Zuo, X. A lightweight object detection method based on fine-grained information extraction and exchange in UAV aerial images. Knowl.-Based Syst. 2025, 315, 113253. [Google Scholar] [CrossRef]
Wu, P.; Xu, Y.; Ma, Y.; Zhang, Y.; Xu, Y. LYA-YOLO: A lightweight and accurate YOLO model in drone aerial image scenes. Expert Syst. Appl. 2026, 321, 132166. [Google Scholar] [CrossRef]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive Rotated Convolution for Rotated Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6566–6577. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for Accurate Object Detection. In Proceedings of the Pattern Recognition and Computer Vision, Urumqi, China, 18–20 October 2024; pp. 492–505. [Google Scholar] [CrossRef]
Xue, C.; Xia, Y.; Wu, M.; Chen, Z.; Cheng, F.; Yun, L. EL-YOLO: An efficient and lightweight low-altitude aerial objects detector for onboard applications. Expert Syst. Appl. 2024, 256, 124848. [Google Scholar] [CrossRef]
Feng, Y.; Huang, J.; Du, S.; Ying, S.; Yong, J.H.; Li, Y.; Ding, G.; Ji, R.; Gao, Y. Hyper-YOLO: When visual object detection meets hypergraph computation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2388–2401. [Google Scholar] [CrossRef]
Luo, H.; Wang, Y.; Chen, Y.; Li, X.; Zhan, J.; Zuo, D. EBC-YOLO: A remote sensing target recognition model adapted for complex environments. Earth Sci. Inform. 2025, 18, 282. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, H.; Zhang, W.; Ma, J.; Li, C.; Ding, Y.; Zhang, Z. MSUD-YOLO: A novel multiscale small object detection model for UAV aerial images. Drones 2025, 9, 429. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, X.; Sun, S.; You, H.; Wang, Y.; Lin, J.; Wang, J. Vehicle detection in drone aerial views based on lightweight OSD-YOLOv10. Sci. Rep. 2025, 15, 25155. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Liang, X.; Hu, Q.; Lin, Y.e.; Xia, C. Multi-scale feature fusion with knowledge distillation for object detection in aerial imagery. Eng. Appl. Artif. Intell. 2025, 158, 111518. [Google Scholar] [CrossRef]
Li, Z.; Lian, S.; Pan, D.; Wang, Y.; Liu, W. AD-Det: Boosting object detection in UAV images with Focused small objects and balanced tail classes. Remote Sens. 2025, 17, 1556. [Google Scholar] [CrossRef]
Yan, H.; Kong, X.; Wang, J.; Tomiyama, H. ST-YOLO: An enhanced detector of small objects in unmanned aerial vehicle imagery. Drones 2025, 9, 338. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar] [CrossRef]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]

Figure 1. The overall pipeline of AD-YOLO, which adopts the classical architecture of YOLOv8 with a backbone network, a neck network, and a detection head. The original C2 with better gradient Flow (C2F) module in the backbone and neck is replaced by the proposed adaptive guidance (AG) module and the hierarchically dense reparameterized large-kernel (HDRepLK) module, respectively. The red and purple arrow lines denote the proposed multi-directional context aggregation path (MDCAP) and the hierarchical semantic progressive fusion path (HSPFP), respectively. CBS refers to convolution, batch normalization, and sigmoid linear unit. SPPF, UP, and Concat represent spatial pyramid pooling fast, upsampling, and concatenation, respectively.

Figure 2. The structure of AG, wherein ARCUnit enhances feature representations for rotated small objects, while GDA-MK strengthens the contextual perception across scales. CBS refers to convolution, batch normalization, and sigmoid function. CBG denotes convolution, batch normalization, and Gaussian error linear unit. LK and AGP represent large-kernel and adaptive global pooling, respectively. GN is the group normalization.

Figure 3. The structure of DPCFPN. Two auxiliary branches, i.e., MDCAP and HSPFP, are embedded into PANet [39].

Figure 4. The structure of HDRepLK, which encompasses a multi-branch parallel architecture with a large-kernel convolution and five dilated convolutions with varying dilation rates. CBS refers to convolution, batch normalization, and sigmoid function. LKConv, DConv, and BN denotes large-kernel convolution, dilated convolution, and batch normalization, respectively.

Figure 5. Grad-CAM visualization of the model’s attention paid to detected objects.

Figure 6. Visual comparison of detection results in a busy crossroad scenario with densely distributed targets.

Figure 7. Visual comparison of detection results for extremely small objects.

Figure 8. Visual comparison of detection results in an unstructured street scene with a cluttered background.

Table 1. Hardware and software environment.

Component	Configuration
Central processing unit	Intel Core Ultra 7 (20-core)
Graphics processing unit	NVIDIA GeForce RTX 5090D (24 GB VRAM)
Memory	32 GB RAM
Software	Python 3.8.20, PyTorch 2.2.1, CUDA 12.1

Table 2. Detailed training hyper-parameter settings.

Hyper-Parameter	Value
Input resolution	640 × 640
Total epochs	300
Batch size	16
Optimizer	Stochastic gradient descent
Initial learning rate	0.01
Final learning rate	0.0001 (cosine decay)

Table 3. Comparison of model overall performance on the VisDrone2019 validation set. P, R, and Para denote precision, recall, and parameter count (unit: million), respectively. Note that YOLOv5, YOLOv8, YOLOv10, YOLOv11, and AD-YOLO were trained from scratch under a unified experimental configuration, where model results were averaged over multiple runs using different random seeds. The results of other baseline models were directly derived from their original papers, where a fine-tuning strategy might be employed. “—” indicates that the corresponding results were neither reported in the original papers nor reproducible, due to the unavailability of the source code.

	P	R	mAP⁵⁰	mAP⁷⁵	mAP^50:95	Para	FPS	GFLOPs
YOLOv5-S [14]	41.6	32.3	31.4	18.9	17.3	7.1	170	16.0
YOLOv5-M [14]	46.7	35.2	35.2	22.5	20.2	20.9	133	23.8
YOLOv8-S [15]	49.5	38.7	36.1	23.9	23.4	11.1	250	28.8
YOLOv8-M [15]	53.3	41.2	38.9	26.3	25.6	25.8	130	23.2
YOLOv10-S [16]	51.0	38.1	39.3	24.7	23.8	8.1	345	24.8
YOLOv10-M [16]	53.9	42.1	42.3	26.5	25.8	16.5	174	15.3
YOLOv11-S [17]	51.8	38.1	39.0	22.8	23.4	9.4	167	9.41
YOLOv11-M [17]	54.0	43.1	43.9	27.1	26.9	20.0	103	20.0
Hyper-YOLO [54]	50.6	38.8	39.6	—	23.8	11.2	—	39.0
EL-YOLO [53]	48.8	40.8	42.9	—	24.8	6.7	—	1.1
EBC-YOLO [55]	55.3	42.0	44.3	—	26.7	10.2	—	35.5
MSUD-YOLO [56]	53.0	42.0	43.4	—	25.6	6.8	134	—
CF-YOLO [7]	52.8	43.4	44.9	—	27.5	23.9	377	23.9
AD-YOLO (Ours)	56.9	43.5	45.4	27.8	27.8	15.3	192	14.1

Table 4. Comparison of model overall performance on the UAVDT test set. P, R, and Para denote precision, recall, and parameter count (unit: million), respectively. Note that YOLOv5, YOLOv8, YOLOv10, YOLOv11, and AD-YOLO were trained from scratch under a unified experimental configuration, where model results were averaged over multiple runs using different random seeds. The results of other baseline models were directly derived from their original papers, where a fine-tuning strategy might be employed. “—” indicates that the corresponding results were neither reported in the original papers nor reproducible, due to the unavailability of the source code.

	P	R	mAP⁵⁰	mAP⁷⁵	mAP^50:95	Para	FPS	GFLOPs
YOLOv5-S [14]	36.3	30.3	28.6	17.3	16.9	7.1	204	16.0
YOLOv5-M [14]	40.4	35.6	31.4	20.8	19.3	20.9	165	23.8
YOLOv8-S [15]	36.2	30.2	29.4	18.4	17.4	11.1	285	28.8
YOLOv8-M [15]	35.8	34.2	30.3	18.9	17.8	25.8	160	23.2
YOLOv10-S [16]	35.4	27.4	26.3	15.9	15.0	8.1	370	24.8
YOLOv10-M [16]	39.7	29.8	29.4	17.3	16.9	16.5	217	15.3
YOLOv11-S [17]	44.8	33.9	31.2	17.7	17.4	9.4	190	9.4
YOLOv11-M [17]	39.6	30.1	31.8	21.8	19.9	20.0	155	20.3
MSDC-DETR [25]	—	—	30.6	—	18.6	19.6	—	59.1
OSD-YOLO [57]	42.3	33.1	31.5	—	17.8	1.6	—	7.9
MFF-KD [58]	—	—	33.9	23.5	21.3	10.5	—	—
AD-Det [59]	—	—	34.2	21.9	20.1	64.1	—	107.2
ST-YOLO [60]	—	—	33.4	—	—	9.00	—	20.1
PETNet [8]	—	—	38.6	22.3	21.5	83.0	—	63.9
AD-YOLO (Ours)	45.2	34.4	35.4	25.8	23.0	15.3	273	14.1

Table 5. Comparison of AP across ten object categories on the VisDrone2019 validation set.

	YOLOv5-M [14]	YOLOv8-M [15]	YOLOv10-M [16]	YOLOv11-M [17]	AD-YOLO
Pedestrian	42.9	45.4	46.7	47.8	49.1
Person	32.3	35.1	35.9	36.2	37.9
Bicycle	13.5	15.3	17.1	17.6	20.1
Car	74.9	78.3	81.5	81.8	82.3
Van	38.6	43.7	49.2	49.7	50.3
Truck	31.1	37.4	41.5	43.0	44.3
Tricycle	20.3	23.7	33.7	34.1	34.9
Awning-tricycle	10.9	14.4	17.9	18.2	19.6
Bus	47.3	52.1	61.9	62.6	64.2
Motorcycle	40.2	43.6	49.3	50.2	51.2
mAP⁵⁰	35.2	38.9	42.3	43.9	45.4

Table 6. Comparison of AP across three object categories on the UAVDT test set.

	Car	Truck	Bus	mAP⁵⁰
YOLOv5-M [14]	66.6	4.8	22.8	31.4
YOLOv8-M [15]	67.7	3.5	19.7	30.3
YOLOv10-M [16]	68.4	3.0	23.1	31.5
YOLOv11-M [17]	69.7	3.5	22.2	31.8
AD-YOLO (Ours)	71.0	4.9	30.3	35.4

Table 7. Comparison of AP on objects with varying scales.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively.

Table 7. Comparison of AP on objects with varying scales.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively.

	VisDrone2019			UAVDT
	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$
YOLOv5-M [14]	11.3	27.6	32.6	10.7	25.5	32.1
YOLOv8-M [15]	13.6	36.1	39.1	10.1	23.6	29.6
YOLOv10-M [16]	14.5	37.6	45.8	11.1	24.2	28.7
YOLOv11-M [17]	15.6	38.4	49.6	11.5	25.7	31.1
AD-YOLO (Ours)	16.2	39.3	51.2	12.1	27.9	32.5

Table 8. Performance comparison of different attention mechanisms on the VisDrone2019 validation set.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Table 8. Performance comparison of different attention mechanisms on the VisDrone2019 validation set.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Method	mAP⁵⁰	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	Para	GFLOPs
YOLOv8-M [15]	38.9	12.6	33.4	42.0	25.8	23.2
ARCUnit (baseline)	40.5	13.4	34.2	42.6	22.8	20.3
+Global-attention [61]	39.5	12.9	33.5	42.2	23.6	21.1
+Self-attention [62]	40.3	13.2	33.9	42.4	22.8	20.5
+Selective kernel network [63]	41.3	13.9	35.5	42.4	33.9	28.6
+CBAM [34]	41.5	14.0	35.8	42.6	22.8	20.4
+GDA-MK	41.6	14.3	35.9	43.6	22.8	20.4

Table 9. Performance comparison of different feature fusion strategies against the baseline.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Table 9. Performance comparison of different feature fusion strategies against the baseline.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Method	mAP⁵⁰	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	Para	GFLOPs
YOLOv8-M (baseline) [15]	38.9	12.6	33.4	42.0	25.9	23.2
+Adaptive FPN [44]	39.5	13.1	33.9	42.5	17.9	16.5
+Adaptively spatial feature fusion [64]	41.3	14.1	35.7	43.1	27.1	24.8
+MDCAP	42.1	14.6	36.1	42.1	23.4	21.2
+HSPFP	42.7	14.6	37.7	43.7	17.3	15.8
+DPCFPN (MDCAP+HSPFP)	43.3	15.4	37.4	43.9	16.6	15.1

Table 10. Ablation experiment on the VisDrone2019 validation set.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Table 10. Ablation experiment on the VisDrone2019 validation set.

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

denote AP for objects with small, medium, and large scales, respectively. Para represents parameter count (unit: million).

Baseline	AG	DPCFPN	HDRepLK	mAP⁵⁰	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	Para	GFLOPs
✓				38.9	12.6	33.4	42.0	25.86	23.2
✓	✓			41.6	14.3	35.9	43.6	22.76	20.4
✓		✓		43.3	15.4	37.4	43.9	16.63	15.1
✓		✓	✓	44.1	15.9	38.3	46.9	18.35	16.8
✓	✓	✓		42.7	14.8	36.8	43.9	13.53	12.4
✓	✓	✓	✓	45.4	16.2	39.3	51.2	15.26	14.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, Y.; Hu, Y.; Ye, Y.; Xu, P. AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images. Drones 2026, 10, 338. https://doi.org/10.3390/drones10050338

AMA Style

Deng Y, Hu Y, Ye Y, Xu P. AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images. Drones. 2026; 10(5):338. https://doi.org/10.3390/drones10050338

Chicago/Turabian Style

Deng, Yu, Yucong Hu, Yun Ye, and Pengpeng Xu. 2026. "AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images" Drones 10, no. 5: 338. https://doi.org/10.3390/drones10050338

APA Style

Deng, Y., Hu, Y., Ye, Y., & Xu, P. (2026). AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images. Drones, 10(5), 338. https://doi.org/10.3390/drones10050338

Article Menu

AD-YOLO: A Unified Method for Traffic-Dense and Small Object Detection in UAV Images

Highlights

Abstract

1. Introduction

2. Literature Review

2.1. Traditional Traffic Object Detection

2.2. Traffic Object Detection in UAV Images

2.2.1. Feature Extraction Mechanisms

Convolutional Deformation Mechanisms

Attention Mechanisms

2.2.2. Feature Fusion Networks

2.3. Transformer-Based Methods for UAV Object Detection

2.4. Summary of Limitations in Existing Research

3. Methodology

3.1. Overview

3.2. Adaptive Guidance Module

3.2.1. Adaptive Rotational Convolution Unit

3.2.2. Group Directional Attention Mechanism with Mixed Kernels

3.3. Dual-Path Collaborative Feature Pyramid Network

3.3.1. Multi-Directional Context Aggregation Path

3.3.2. Hierarchical Semantic Progressive Fusion Path

3.4. Hierarchically Dense Reparameterized Large-Kernel

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Comparison with SOTA Baselines

4.4.1. Overall Performance

4.4.2. Performance Comparison on Objects with Varying Shapes and Scales

4.5. Ablation Studies

4.5.1. Effectiveness of the Group Directional Attention Mechanism with Mixed Kernels

4.5.2. Effectiveness of the Dual-Path Collaborative Feature Pyramid Network

4.5.3. Component-Wise Ablation Analysis

4.6. Visualization Analysis

4.7. Deployment Feasibility

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI