Next Article in Journal
A Lightweight Identity Authentication Protocol for Vehicle Ad Hoc Network Based on PUF-Obfuscation
Previous Article in Journal
CKM-YOLO11: A Lightweight Maize Foliar Disease Detection Model for Complex Natural Field Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LHA-YOLO: A Lightweight and High-Accuracy Detector via Parallel Attention and Divide-and-Conquer Fusion for UAV Images

1
School of Physics and Electronics, Shanxi Datong University, Datong 037009, China
2
College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou 350002, China
3
School of Artificial Intelligence, Xidian University, Xi’an 710071, China
*
Authors to whom correspondence should be addressed.
Sensors 2026, 26(10), 2970; https://doi.org/10.3390/s26102970
Submission received: 3 April 2026 / Revised: 2 May 2026 / Accepted: 5 May 2026 / Published: 8 May 2026
(This article belongs to the Section Remote Sensors)

Abstract

Small-object detection in unmanned aerial vehicle (UAV) images poses significant challenges due to limited pixel representation, complex backgrounds, and insufficient feature discriminability. While one-stage detectors like YOLO offer a favorable speed-accuracy trade-off, their performance on small objects is often hampered by conflicts between semantic and spatial information during multi-scale feature fusion in existing networks. To address this, we propose LHA-YOLO, a lightweight and high-accuracy network based on YOLO11. The network is built upon two core innovations. The first is the Lightweight Feature Extraction Module (LFEM), which employs a parallel spatial-channel attention mechanism to extract discriminative cross-dimensional features efficiently and with low computational cost. The second is the Divide-and-Conquer Propagation Path (DCPP) strategy. This strategy decouples and separately optimizes the handling of semantic and spatial information within its bidirectional propagation paths. To achieve this, the top-down path utilizes the Channel Attention-guided Semantic Aggregation (CASA) module to enhance semantic consistency. In parallel, the bottom-up path employs the Spatial Attention-guided Detail Aggregation (SADA) module to preserve spatial fidelity. Extensive evaluation on the VisDrone and UAVDT datasets shows that LHA-YOLO strikes a favorable balance between performance and efficiency. On VisDrone, it improves mAP50 from 39.4% to 41.6% and mAP50–95 from 23.5% to 24.9% over YOLOv11s. On UAVDT, it raises mAP50 from 32.2% to 36.9% and mAP50–95 from 19.4% to 22.9%, while reducing GFLOPs from 21.3 to 18.8. These results confirm the efficacy of our design for real-time UAV applications.

1. Introduction

The rapid development of unmanned aerial vehicle (UAV) technology has significantly expanded its applications in fields such as precision agriculture [1], traffic monitoring [2], and disaster response [3]. A key enabler of these UAV applications is real-time object detection, especially small-object detection. Traditional methods for small-object detection in UAV imagery typically rely on handcrafted features [4] and sliding window designs [5], which often suffer from limited generalization capability and insufficient real-time performance. For instance, multi-threshold binarization has been proposed as an effective feature extraction technique, reducing training samples without losing accuracy [6]. In recent years, convolutional neural networks (CNNs) have advanced the field owing to their powerful representational capacity and real-time inference ability. CNN-based methods for small-object detection are primarily categorized into two-stage and one-stage models. Although two-stage detectors [7,8,9] based on Faster R-CNN [10] and Cascade R-CNN [11] generally achieve higher accuracy in detecting small objects, their high computational cost and slow inference speed make them unsuitable for real-time applications. As a result, one-stage detectors [12,13,14] based on YOLO series [15,16] and SSD [17] remain the most widely adopted solutions in practice. These one-stage models are capable of performing detection tasks at high speeds, with YOLO series in particular demonstrating a favorable balance between speed and accuracy. However, detecting small objects in UAV images remains a challenging task due to limited pixel representation of objects, complex backgrounds, and a lack of discriminative features. In response, various attention mechanisms [18,19,20] have been developed to alleviate these issues, as they can adaptively select key information by leveraging the local and global information of the image to improve feature discriminability and assist in feature extraction.
In this paper, we propose a Lightweight Feature Extraction Module (LFEM) that leverages a parallel spatial-channel attention mechanism to effectively correlate cross-dimensional features, thereby capturing discriminative features of small objects. The module is constructed primarily using convolutional blocks, multiple Multi-Dimension Feature Representation (MDFR) blocks, and residual connections to form a residual structure. This design preserves feature map information while sequentially processing it through n MDFR groups, facilitating the integration of original features with multi-stage semantic information to provide rich and hierarchical representations. Each MDFR employs a strategy where the input features are split into two equal parts, one for spatial detail extraction and the other for channel-wise semantic processing. Through effective concatenation, features from both dimensions are comprehensively combined at each pixel location, enhancing the representation of small-object-related features and improving computational efficiency. The proposed LFEM effectively captures spatial distributions of objects while simultaneously modeling inter-channel variations, thereby increasing discriminability among object categories.
In order to further enhance the feature representation of small objects, classic object detection algorithms use PA-FPN [21] or Bi-FPN [22] structures to fuse different levels of features. The PA-FPN employs a top-down pathway within the FPN [23] framework to propagate high-level semantic information to lower-level features, thereby enriching their semantic representation. Simultaneously, the PAN module introduces a bottom-up pathway that transmits fine-grained localization and detailed features from lower to higher levels, enhancing the spatial details of high-level feature maps. This design enables effective integration of both semantic and spatial information across different scales, leading to significant improvements in multi-scale object detection performance. However, the classic PA-FPN or Bi-FPN does not fully differentiate the functional differences between top-down and bottom-up information flows when directly merging multi-scale features. This may lead to conflicts between semantic and spatial information, especially impairing the detection accuracy of small objects. Given that the top-down path is mainly used to transmit contextual semantic information, it is necessary to further strengthen the semantic consistency between multi-scale features in this path to fully leverage its capacity for semantic modeling. Similarly, the bottom-up pathway should emphasize the preservation and highlighting of spatial details, particularly localized features essential for precise positioning.
To resolve the conflict between semantic and spatial information, we propose a Divide-and-Conquer Propagation Path (DCPP) strategy that processes the two information streams separately without introducing additional computational overhead. Different from the conventional direct fusion in the classic Bi-FPN structure, our DCPP employs a progressive feature aggregation method and replaces the traditional general attention mechanism with task-specific attention modules in different aggregation processes. Specifically, within the top-down propagation pathway, the Channel Attention-guided Semantic Aggregation (CASA) module utilizes a channel attention mechanism to dynamically recalibrate channel-wise weights. This emphasizes semantically relevant features while suppressing responses from redundant or noisy channels. Subsequently, a progressive aggregation strategy fuses multi-scale features in a stepwise manner, which is essential for maintaining semantic consistency throughout the top-down flow and improving both the efficiency and coherence of contextual semantic propagation. As a result, the CASA module not only enhances semantic alignment across different feature scales but also boosts the discriminative power of the fused representations. Correspondingly, in the bottom-up pathway, the Spatial Attention-guided Detail Aggregation (SADA) module focuses on the spatial dimensions of feature maps using a spatial attention mechanism. This preserves and enhances detailed information, particularly fine-grained features critical for accurate localization. The same progressive aggregation strategy then fuses the refined features, preserving the integrity of spatial details as they propagate upward and thereby improving both the efficiency and fidelity of the bottom-up information flow. By adopting this divided yet complementary bidirectional propagation strategy, our model achieves notable improvements in detection accuracy, especially for small objects against complex backgrounds in UAV images.
With the recent introduction of the YOLO11 model [24], we adopt YOLO11 as our baseline to evaluate its performance in small-object detection scenarios. We integrate the proposed LFEM and DCPP modules into the Backbone and Neck of YOLO11, forming a lightweight and high-accuracy network named LHA-YOLO. This enhanced network improves the capture of discriminative features from small objects while suppressing background noise, thereby strengthening feature extraction and representation capabilities. As a result, LHA-YOLO achieves notable improvements in detecting small objects in UAV images. The main contributions of this work can be summarized as follows:
(1)
We propose LHA-YOLO, a YOLO11-based multi-scale feature fusion network that integrates a novel feature extraction module and a aggregation strategy to enhance the accuracy and real-time stability of small-object detection for UAV applications.
(2)
We design a Lightweight Feature Extraction Module (LFEM) that employs a parallel spatial-channel attention mechanism to effectively correlate cross-dimensional representations, thereby efficiently extracting discriminative features for small objects with enhanced accuracy and minimal computational overhead.
(3)
We present a Divide-and-Conquer Propagation Path (DCPP) strategy, which decouples the processing of semantic and spatial information into dedicated pathways to address their inherent conflict, thus achieving enhanced discrimination and localization for small objects without adding notable computational cost.
(4)
We extensively evaluate the proposed method on challenging VisDrone and UAVDT datasets, achieving competitive results in both accuracy and inference speed, demonstrating its practical value for real-time UAV applications.

2. Related Work

This section reviews and evaluates existing research results from three aspects: the core evolution of the YOLO series, the improvement of small-object detection based on the YOLO framework, and the integration of attention mechanisms into small-object detection methods.

2.1. Evolution of the YOLO Series

The YOLO (You Only Look Once) series has been a cornerstone of real-time object detection research, renowned for its exceptional balance between speed and accuracy. Its evolution represents a continuous refinement of architectural design and training methodologies.
YOLOv1 [25] reframed object detection as a single regression problem. It directly predicted bounding boxes and class probabilities from an entire image. This unified architecture enabled end-to-end training and achieved remarkable inference speeds. However, its spatial constraints, such as limited predictions per grid, caused poor performance on small and densely packed objects. Its localization accuracy was also comparatively coarse. YOLOv2 [26] introduced several key improvements, including anchor boxes, batch normalization, and a higher resolution classifier, boosting recall and precision. YOLOv3 [27] adopted a deeper and more powerful backbone network, Darknet-53, which utilized residual connections. Crucially, it incorporated a multi-scale prediction mechanism inspired by FPN [23], detecting objects at three different scales. This architecture fundamentally enhanced its capability to detect small objects and became the baseline for subsequent research.
YOLOv4 [28] integrated numerous effective techniques such as Mosaic data augmentation, the CSPDarknet53 backbone, the PANet neck, and the CIoU loss, achieving state-of-the-art performance without sacrificing speed. Following this, the development of YOLO became more community-driven. YOLOv5 [29] gained widespread industrial adoption due to its superior engineering and flexible PyTorch framework. Subsequently, YOLOX [30] abandoned the anchor-based paradigm by introducing an anchor-free mechanism and a decoupled head, simplifying the pipeline. Building on these advances, YOLOv6 [31] and YOLOv7 [15] further refined the model through innovations such as structural re-parameterization and trainable bag-of-freebies strategies. YOLOv8 [16] further integrated anchor-free design and new loss functions, forming one of the strongest versions currently available. YOLOv9 [32] introduced Programmable Gradient Information (PGI) and a Generalized Efficient Layer Aggregation Network (GELAN) to mitigate the information bottleneck problem in deep networks. YOLOv10 [33] advanced the field by eliminating the need for Non-Maximum Suppression (NMS) during inference, thereby reducing latency and enabling fully end-to-end object detection.
As the latest iteration in the Ultralytics YOLO series, YOLOv11 [24] demonstrated notable improvements in accuracy, speed, and computational efficiency for real-time detection tasks. Building upon advancements from previous versions, YOLOv11 incorporated optimizations and training methodology, making it highly suitable for a wide range of computer vision applications. Experimental results indicated that YOLOv11 had better performance in small-object detection from UAV imagery. Therefore, we selected YOLOv11 as the baseline for the experiments presented in this paper.

2.2. YOLO-Based Improvements for Small-Object Detection

Although the multi-scale design of the YOLO series since YOLOv3 has improved small-object detection, significant room for improvement remains for detecting objects that are small, dense, or blurry. This section mainly summarizes the latest improvement research on small-object detection based on classic architectures such as YOLOv5, YOLOv8, and YOLOv11.
Building upon YOLOv5, SCM-YOLO [34] introduced innovative lightweight modules that enhance spatial local information and adaptively fuse multi-scale feature information. MFFSODNet [35] introduced an additional prediction head that focuses on tiny objects, replaces the large-object head. They designed MSFEM to capture fine-grained information from small objects and BDFPN to achieve efficient multi-scale feature fusion. BiFPN-YOLO [36] incorporated a BiFPN as a replacement for the conventional PANet, enhancing multi-scale feature fusion. The study explored alternative solutions to the Swish function by evaluating its performance against many other activation functions. Wang et al. [37] designed a fine-grained semantic fusion module and TOCM module. These modules enhance the distinction between background and objects, thereby extending the model’s detection boundaries for both.
Building upon YOLOv8, LGA-YOLO [38] employed a MLKM to broaden the receptive field and enhance local features. It also used a DGAM to capture global contextual information and emphasize vehicle features in complex backgrounds. Luo et al. [39] proposed a channel priority attention dynamic snake convolution module to capture fine-grained details, and incorporated a MPDIoU and a DAT to boost detection efficiency while maintaining computational efficiency. LSOD-YOLO [40] introduced a lightweight cross-layer output reconstruction module that strengthens the integration of shallow and deep features. The authors also adopted a lightweight Dysample that preserves fine image details while keeping low computational overhead. Quan et al. [41] used an attention mechanism to capture long-range dependencies between distant pixels, introduced Slideloss to strengthen the learning of challenging samples, and employed ShapeIoU to improve bounding box regression by incorporating shape and scale awareness.
Building upon YOLOv11, PS-YOLO [42] incorporated an efficient FasterBiFFPN neck network. This network replaced the original PAFPN and enables more effective multi-scale feature fusion. PS-YOLO also introduced a NWDLoss, which uses shared convolutions to learn common features across objects of different scales. Li et al. [43] incorporated an efficient channel attention mechanism to enhance feature discriminability, and modified the loss function to increase the model’s sensitivity to uncertain object regions. PC-YOLOs [44] restructured the hierarchical architecture of YOLO11. They incorporated a P2 layer for small-object detection and removed the P5 layer to reduce computational overhead and decrease model complexity. The authors also introduced a coordinate spatial attention mechanism that captures spatial and positional information critical for small objects.

2.3. Attention Mechanisms for Small-Object Detection

Attention mechanisms, which emulate the ability of the human visual system to focus on salient regions, offer a powerful approach to enhance relevant features and suppress redundant information. This capability is especially valuable in the context of small-object detection. For channel attention, the Squeeze-and-Excitation (SE) Network [45] is a foundational work that learns to recalibrate channel-wise feature responses by modeling interdependencies between channels. Integrating SE blocks or similar modules [46] into network architectures [47] enables the network to amplify discriminative features for small objects. For spatial attention, the Convolutional Block Attention Module (CBAM) [48] sequentially infers both channel and spatial attention maps. The spatial attention highlights the location of information regions. After applying multi-scale fusion in networks [49], the focus of the network can be guided to spatial positions that may contain small objects, thereby reducing the interference of cluttered backgrounds.
For self-attention and transformers, the core self-attention mechanism [50] of transformer computes interactions between all position pairs in a feature map, enabling it to capture global contextual information directly. Transformer-based detectors [51] and their CNN hybrids [52] use global context to localize small objects, avoiding the long-range dependency loss of local convolutions. Recent works have further tailored transformer architectures specifically for UAV small-object detection. MSAE–DETR [53] introduces a multiscale adaptive enhancement detection transformer with dual-scale attention and frequency-domain modeling. HMF-DEIM [54] designs a lightweight hierarchical backbone with multi-domain feature blending and real-time inference. DRONet [55] builds on RT-DETR and introduces occlusion-aware modules for dense and occluded aerial scenes. MSA-DETR [56] integrates PercepConv and SODAttention modules to enhance multi-scale feature extraction and spatial attention.
Beyond the application of universal attention modules, some studies have designed specialized attention mechanisms tailored for object detection. For example, feature pyramid fusion attention [57] performs attention-guided feature selection and fusion across different hierarchical levels of a feature pyramid network. Context attention [58] is specifically designed to aggregate contextual information from regions surrounding potential objects. Meanwhile, multi-scale attention [59] enables collaborative reinforcement of feature responses from disparate scales, enhancing discriminability across varying object sizes. In summary, attention mechanisms offer an effective technical pathway for small-object detection by enabling feature recalibration and global relational modeling.

3. Proposed Method

3.1. Overview of YOLO11

Building upon the classic Backbone–Neck–Head architecture, YOLO11 (as shown in Figure 1) introduces significant innovations that achieve steady progress in real-time generic object detection. It achieves an optimal balance between accuracy, inference speed, and computational efficiency through novel network architecture designs and advanced training methods.
The backbone is designed for hierarchical multi-scale feature extraction from input images. Its architecture primarily comprises three key components: CBS blocks, C3K2 modules, and a SPPF module. The CBS block, composed of Convolution, Batch Normalization (BN), and SiLU, serves as a fundamental unit that performs feature transformation and downsampling. Its integrated BN layer and SiLU activation function ensure stable training and expressive feature maps. These features are subsequently processed by the C3K2 modules, which optimize information by splitting feature maps and applying grouped convolutions. The modules can enhance feature representation capacity and computational efficiency. Finally, the SPPF (Spatial Pyramid Pooling Fast) module leverages multiple parallel max-pooling operations with varying kernel sizes to aggregate rich multi-scale contextual information. This effectively improves the model’s ability to recognize objects across different scales without compromising its real-time inference speed.
The neck predominantly employs the classic path aggregation network-feature pyramid network (PAN-FPN) structure, which augments the conventional FPN by incorporating an additional bottom-up path. This pathway effectively reintegrates low-level spatial features with high-level semantic features, thereby enhancing the detection accuracy through multi-dimensional features. Notably, a C2PSA module is introduced prior to feature fusion to leverage attention mechanisms for guided integration. The C2PSA module utilizes multiple parallel attention mechanisms alongside feedforward networks, significantly improving global feature modeling capability. This design enables the network to better capture long-range dependencies and complex nonlinear interactions, ultimately strengthening feature representational power and increasing architectural flexibility across diverse application scenarios.
The detection head is responsible for generating the final predictions, which consist of bounding box coordinates, dimensions, and class probabilities. In the classification branch, depthwise convolution (DWConv) is employed in place of traditional convolution, reducing the number of parameters while preserving accuracy, thereby enhancing the computational efficiency of the model. The regression branch incorporates both standard convolution and deformable convolution to refine the localization performance and improve the accuracy of bounding box predictions. Overall, YOLO11 achieves a superior balance between detection performance and computational efficiency through its refined architectural design and optimized training pipeline.

3.2. Overview of the LHA-YOLO Architecture

Despite its state-of-the-art performance on generic object detection benchmarks, YOLO11 exhibits insufficient feature representation capabilities when applied to small-object detection in UAV images. This paper proposes LHA-YOLO as shown in Figure 2, a novel framework to overcome these shortcomings. The key research objectives include: designing a dedicated attention mechanism to effectively extract and enhance features of small objects; improving the feature pyramid network to achieve more efficient multi-scale feature fusion, facilitating the integration of deep semantic information with shallow spatial details; and incorporating strategies that enhance detection accuracy while maintaining high computational efficiency.
In the backbone network, our architecture retains the five feature extraction stages and the SPPF module from YOLO11. With the exception of the first stage, each feature extraction block primarily consists of CBS and Lightweight Feature Extraction Module (LFEM). The LFEM is composed of a convolutional layer, n repeated Multi-Dimension Feature Representation (MDFR) blocks, and a set of residual connections. It is designed to effectively extract and enhance features of small objects while maintaining high computational efficiency.
In the neck network, we introduce a Divide-and-Conquer Propagation Path (DCPP) strategy to enhance the complementary advantages of both information streams without additional computational cost. Specifically, we integrate dedicated aggregation attention mechanisms into both propagation paths. In the top-down pathway, a Channel Attention-guided Semantic Aggregation (CASA) module strengthens semantic consistency across multi-scale features and enhances the discriminative power of the fused representations. In the bottom-up pathway, a Spatial Attention-guided Detail Aggregation (SADA) module progressively refines and aggregates features to emphasize spatial details, particularly the localized features critical for precise localization. This strategy facilitates the effective propagation of both contextual semantics and spatial details, thereby robustly strengthening multi-scale fusion.
As a result, the proposed LHA-YOLO model incorporates specialized modules to achieve a more comprehensive understanding of complex scenes and improve detection performance across diverse object scales.

3.3. Proposed LFEM

The Lightweight Feature Extraction Module (LFEM), as shown in Figure 3, is primarily composed of CBS blocks, multiple Multi-Dimension Feature Representation (MDFR) blocks, and residual connections. Within the LFEM architecture, input features are split into two pathways via convolutional operations, one branch is fed into n successive MDFR, while the other is retained and later concatenated with the output from these MDFR blocks. As a result, the LFEM enhances multi-channel information integration, thereby improving discriminative feature extraction for small objects.
As shown in Figure 4, the MDFR block consists of two primary stages. The first-stage residual structure begins with a partial convolution (PConv) layer [60] employing a 3 × 3 kernel, which processes feature maps corresponding to one-fourth of the total channels. These processed channels are then concatenated with the remaining original channels to preserve consistency between the input and output channel dimensions. The combined features are subsequently passed through two pointwise convolution (PWConv) layers to produce the main branch features. Finally, the main branch features are added element-wise to the input feature map to generate the output. This efficient design facilitates comprehensive integration of channel information while maintaining low computational overhead. Specifically, PConv operates efficiently by convolving only a subset of input channels while leaving the remainder unchanged, producing feature maps that combine both original and transformed features. PWConv further compresses channel dimensionality to minimize parameters. Consequently, the LFEM achieves lower computational cost compared to traditional residual structures.
In the second stage, a multi-attention coordination mechanism is employed. The output features from the first stage are split into two separate branches along the channel dimension, each receiving input features F of size H × W × C 2 , where H, W, and C denote the number of height, width, and channels respectively. Each branch uses multi-scale kernel size to divide the feature map into k groups along the channel dimension, denoted as F 1 , F 2 , …, F k . The number of channels for F i is C 2 k . In the upper branch, each subgroup F i is processed by a spatial attention module to capture contextual information for small objects. The attention module learns importance weights that highlight relevant spatial regions. This process is defined as
F s = F i × σ ( ω s · G M P ( F i ) + b s ) .
F s is the spatial attention feature corresponding to subgroup feature F i . σ ( · ) is the sigmoid activation function. GMP refers to the global maximum pooling operation of F i in the spatial dimension to extract spatial statistics, while ω s and b s represent a convolutional weight matrix and bias term of size H × W × 1, which aid in encoding spatial representations. Similarly, the lower branch employs a channel attention mechanism to model inter-channel dependencies relevant to small objects. The operation is defined as
F c = F j × σ ( ω c · G A P ( F j ) + b c ) .
F c is the channel attention feature corresponding to subgroup feature F j , which has C 2 k channels. GAP denotes the global average pooling operation of F j in the channel dimension, which produces a vector of size 1 × 1 × C 2 k aggregating channel-wise statistics. The convolutional layer ω c and bias b c have a kernel size of 1 × 1 with both input and output channels equal to C 2 k . The sigmoid output is then broadcasted element-wise to the spatial dimensions of F j for channel recalibration.
Finally, the two branches yield k sets of spatial attention and channel attention feature maps, each with dimensions H × W × C 2 k . These feature maps are then concatenated and undergo channel shuffling to ensure effective feature redistribution, followed by reconstruction into a new integrated feature representation. Each MDFR incorporates both channel and spatial transformations, enabling effective aggregation of small-object features while suppressing redundant background noise and interference.

3.4. Proposed DCPP Strategy

As illustrated in Figure 5, the Divide-and-Conquer Propagation Path (DCPP) strategy processes the top-down and bottom-up information streams separately using dedicated attention-guided progressive aggregation. The top-down pathway emphasizes semantic consistency, while the bottom-up pathway preserves spatial details. This divided yet complementary design progressively aggregates contextual semantics and spatial details, thereby significantly enhancing the model’s capacity for multi-scale feature fusion and improving detection accuracy, especially for small objects in complex UAV images.
To enhance semantic consistency in the top-down pathway, we propose the Channel Attention-guided Semantic Aggregation (CASA) module, as shown in Figure 6. Its core mechanism dynamically recalibrates channel-wise feature responses to emphasize semantically rich information and suppress noise. Through a stepwise aggregation strategy, the module strengthens cross-scale semantic alignment and boosts the discriminative power of fused features. The Dual-Pooling Channel Attention (DPCA) processes the input features F by applying both average pooling and max pooling operations along the spatial dimensions. To minimize computational overhead, the pooled feature maps are first compressed via a shared 1 × 1 convolutional layer, reducing the channel dimension to 1 16 of the original. The channel dimension is then restored through another shared 1 × 1 convolution. The resulting two feature maps are combined via element-wise summation, followed by a sigmoid function to generate the final channel attention weights α c . This process is formulated as
α c = σ ( ω c · M P ( F ) + ω c · A P ( F ) ) ,
where M P ( · ) denotes maximum pooling and A P ( · ) denotes average pooling. ω c consists of two 1 × 1 convolution layers and ReLU activation operations. σ ( · ) is the sigmoid activation function. The channel-refined features F α c are then obtained by element-wise multiplication of the attention weights α c with the input features F I n p u t . Subsequently, the deep-level feature map F d o u t p u t is modulated by the complementary weights ( 1 α c ) to produce a residual feature F d ( 1 α c ) . Finally, F α c and F d ( 1 α c ) are summed to form the aggregated output F a g g c . This process is formulated as
F a g g c = α c × F I n p u t + ( 1 α c ) × F d o u t p u t .
Notably, for the deepest feature map F 5 , the output is generated solely by its channel-wise multiplication with the attention weights. This channel-refined feature is then propagated to shallower layers to convey enhanced semantic information. Consequently, the CASA module refines the quality of semantic propagation throughout the pathway, enriching shallow features with high-level semantics that are pivotal for accurate classification.
To preserve and enhance spatial information in the bottom-up pathway, we propose the Spatial Attention-guided Detail Aggregation (SADA) module, as shown in Figure 7. Its core mechanism computes spatial attention weights to model location importance, thereby suppressing background noise while highlighting discriminative features crucial for object localization. Furthermore, a stepwise integration strategy is applied to aggregate these spatial-refined features, which improves both the efficiency and fidelity of spatial detail propagation across the network. The Dual-Pooling Spatial Attention (DPSA) module operates by aggregating channel information from the input features F through both maximum and average pooling. To preserve fine-grained spatial details, the input features for this module are sourced directly from the multi-level outputs of the backbone network, consistent with the input to the DPCA module. Then, the pooled maps are fused via element-wise summation, and a sigmoid function is applied to generate the spatial attention weight map α s . This process is formulated as
α s = σ ( ω s · M P ( F ) + ω s · A P ( F ) ) ,
where M P ( · ) denotes maximum pooling and A P ( · ) denotes average pooling. ω s denotes the 3 × 3 convolution and ReLU activation operations. σ ( · ) is the sigmoid activation function. The input features F I n p u t are spatially refined by the attention map α s to produce F α s . Concurrently, the shallow features F s o u t p u t are passed through a complementary gate ( 1 α s ) to form the residual feature F s ( 1 α s ) , preserving details that the spatial attention may have suppressed. The final aggregated feature F a g g s integrates these with the channel-aware feature F a g g c via summation, as formulated below:
F a g g s = α s × F I n p u t + ( 1 α s ) × F s o u t p u t + F a g g c .
For the shallow feature map F 3 , the output is generated solely by its element-wise multiplication with the spatial attention weights. This spatially refined feature is then propagated to deeper layers, conveying enhanced spatial information. By integrating such multi-scale spatial cues that benefit precise localization, the SADA module sharpens the perception of spatial structures in deep features, thereby significantly strengthening the representation and transmission of spatial details throughout the bottom-up pathway.
Unlike standard fusion strategies, our stepwise aggregation performs a weighted combination guided by attention mechanisms, adaptively emphasizing the most informative multi-level features with negligible computational overhead. This approach enhances the bidirectional propagation path, enabling refined feature integration within the pyramid network. Consequently, the model achieves superior multi-scale representation, improving its ability to interpret complex scenes and detect objects across various scales.

4. Experimental Results and Analysis

4.1. Experimental Datasets

We evaluate our method on two challenging UAV vehicle detection benchmarks: VisDrone [61] and UAVDT [62]. The VisDrone dataset is a standard and challenging small-object detection benchmark designed for advancing research in UAV applications. The dataset encompasses a variety of challenging scenarios, such as crowded city streets, busy traffic intersections, and highly dense pedestrian areas, while also considering various challenging environmental factors, such as nighttime, rainy weather, and foggy weather. This complexity makes it a robust testbed for evaluating detection models. The dataset consists of 8629 images with 6471 for training, 548 for validation, and 1610 for testing, which includes 10 predefined categories such as cars, pedestrians, and bicycles. A notable characteristic of VisDrone is its pronounced class imbalance: cars account for 40.9% of instances, while awning tricycles represent only 0.9%. Importantly, small objects dominate the dataset, comprising 62.4% of all instances. This combination of severe class imbalance and the prevalence of small objects makes accurate detection particularly challenging, aligning the benchmark closely with real UAV-based perception difficulties. UAVDT comprises 39,850 images (23,258 training, 16,592 testing) with three vehicle categories (car, truck, bus) at 1080 × 540 resolution. Objects are also divided by size into small (area < 32 2 ), medium ( 32 2 < area < 96 2 ), and large (area > 96 2 ). The dataset presents significant challenges due to low resolution, illumination variations, partial occlusions, and frequent viewpoint changes caused by UAV motion. These factors, together with the dominance of small and medium sized objects, make accurate detection highly demanding.

4.2. Implementation Details

The experimental platform is based on Ubuntu 20.04 operating system, using Python 3.11, PyTorch 2.8, and CUDA 12.8. To ensure fair comparison, all experimental models involved in this paper were trained using the same experimental parameters and settings. All training processes are conducted from scratch without pre-trained weights. The hardware configuration includes i9-11900K CPU and NVIDIA RTX GPU. All models were trained using stochastic gradient descent (SGD) optimizer on the Visdrone dataset for 300 epochs. The parameters of SGD are an initial learning rate of 0.01, weight decay of 0.0005, momentum of 0.937, batch size of 16, and input image size of 640 × 640.

4.3. Evaluation Metrics

Based on standard object detection evaluation metrics, the performance of the proposed LHA-YOLO model and all baseline models was rigorously evaluated on the Visdrone validation set. The main indicators include performance metrics such as precision (P), recall (R), mean average precision (mAP), and complexity metrics such as Million parameters (M), Gigabit Floating Point Operations (GFLOPs), and Frames Per Second (FPS).
Precision measures the accuracy of positive sample predictions. It quantifies the proportion of predicted bounding boxes that correctly classify and locate objects of interest. Recall measures the completeness of model predictions. It quantifies the proportion of actual ground truth objects that are successfully detected by the model. The mean average precision (mAP) is the predominant benchmark metric for evaluating object detection models. It summarizes the accuracy across all object classes by calculating the mean of the average precision (AP) scores for each class. AP is defined as the area under the precision–recall curve for a single category. mAP is the average of AP over all categories. In this paper, we adopt mAP50 and mAP50–95 as the primary evaluation metrics. mAP50 refers to the mean average precision at an IoU threshold of 0.5, while mAP50–95 denotes the average of the 10 IoU thresholds from 0.5 to 0.95 with a step size of 0.05.

4.4. Experimental Comparison with YOLO11 Series Models

We comprehensively evaluate LHA-YOLO against YOLO11. As shown in Table 1, all LHA-YOLO variants consistently outperform their baseline counterparts in detection accuracy (P, R, mAP) across different scales. These improvements come with lower computational cost (Params and GFLOPs) and higher inference speed (FPS), especially for the LHA-YOLOl and LHA-YOLOx models.
Taking LHA-YOLOs and YOLO11s as an example, we further analyze per-class performance on VisDrone to assess the impact of class imbalance. Our method improves AP for the majority class, cars, from 79.5% to 81.1%, and for the minority class, awning-tricycles, from 14.7% to 16.5%. These consistent gains confirm that the proposed approach does not sacrifice minority-class accuracy for overall improvement.
The superior detection performance of our method is further illustrated through visual comparisons. Figure 8 presents example results between YOLO11s and LHA-YOLOs on challenging scenes. The visualizations reveal that YOLO11s suffers from significant false negatives, failing to detect many small objects. In contrast, LHA-YOLOs successfully identifies these challenging instances with more precise bounding boxes. YOLO11s is prone to generating duplicate or inaccurate detections in complex backgrounds, whereas LHA-YOLOs effectively suppresses such false positives while maintaining a high recall rate. This demonstrates its enhanced capability to discern objects in cluttered environments. These observations strongly support the quantitative metrics, collectively verifying that our proposed improvements substantially enhance detection reliability across diverse UAV scenarios.

4.5. Ablation Study

4.5.1. Investigation on Generalizability of LFEM and DCPP Strategy

We conduct ablation studies to validate the generalizability of the proposed Lightweight Feature Extraction Module (LFEM) and Divide-and-Conquer Propagation Path (DCPP) strategy. Both modules are integrated into two distinct detector architectures, YOLO8s and YOLO11s, and evaluated through benchmark experiments under standard protocols.
As shown in Table 2, the proposed LFEM and DCPP strategy bring consistent improvements across both YOLO8s and YOLO11s architectures, with particularly notable gains in the YOLO11s. Notably, their combined use produces a synergistic effect, yielding performance superior to their individual contributions. Specifically, LHA-YOLOs achieves gains of 2.2% in mAP50 and 1.4% in mAP50–95. This demonstrates the compatibility and combined effect of the modules, confirming the generalizability of the proposed approach.
Furthermore, as shown in Table 2, the proposed modules enhance model efficiency. Specifically, integrating the LFEM and DCPP strategy reduces both Params and GFLOPs for both YOLO8s and YOLO11s. The efficiency gains are positively correlated with model size, meaning larger models achieve greater complexity reduction. This effect is clearly observed in the results for YOLO8s. Overall, the findings indicate that our method alleviates the performance–complexity trade-off through structural re-parameterization and computational optimization.
We visualize the contributions of each module in different scenarios, specifically on ground roads and elevated roads, as shown in Figure 9 and Figure 10, respectively. Compared to the YOLO11s baseline, the YOLO11s-LFEM model primarily reduces false positives by enhancing feature discriminability but tends to miss some small objects. Conversely, the YOLO11s-DCPP model improves the recall of small objects through better multi-scale feature integration, yet it is prone to generating false detections. The integration of both the LFEM and DCPP strategies in LHA-YOLO yields synergistic benefits, demonstrating that their complementary designs jointly address the core challenges of UAV-based detection.
In summary, the integrated design contributes to cross-architecture generalization. It achieves consistent performance gains alongside complexity reduction in both YOLO8s and YOLO11s. Ablation results indicate that the feature enhancement and attention mechanisms produce complementary effects, which together improve detection accuracy and lower computational costs across different architectural designs.

4.5.2. Analysis of MDFR Effectiveness

This section presents a systematic ablation analysis of the Multi-Dimension Feature Representation (MDFR) block within the LFEM. Experiments are conducted on YOLO11 under identical training conditions, with results detailed in Table 3.
First, the baseline configuration with only the first-stage PConv-PWConv operations provides only modest gains. It exceeds the standard YOLO11m solely in the medium scale, indicating that a design focused purely on efficiency lacks sufficient representational capacity, particularly for smaller models. Subsequently, introducing the channel attention (CA) mechanism brings stable improvements across all scales. CA works by selectively amplifying informative feature channels. Independently, integrating the spatial attention (SA) mechanism also advances the baseline, as SA refines localization by highlighting salient regions and filtering background clutter, which is essential for complex UAV scenes. Ultimately, the complete MDFR block, unifying both attention pathways, delivers the best performance. The parallel CA and SA operations complement each other, in which CA enhances semantic clarity for accurate classification while SA sharpens spatial fidelity for precise regression. This synergy, grounded in an efficient first stage, enables a powerful and manageable multi-scale feature representation.
In summary, the ablation study validates that MDFR’s two-stage design, which combines efficient transformation with dual-attention refinement, achieves a good balance between efficiency and capacity. The progressive performance gain supports the architectural rationale and confirms MDFR’s suitability for real-time, high-accuracy UAV detection tasks.

4.5.3. Evaluation of CASA and SADA Modules Importance

This section aims to evaluate the importance of the proposed Channel Attention-guided Semantic Aggregation (CASA) and Spatial Attention-guided Detail Aggregation (SADA) modules within the DCPP strategy. Experiments are conducted on YOLO11s to assess various attention and aggregation strategies across its bidirectional pathways (Table 4). We test four core components per pathway: Dual-Pooling Channel Attention (DPCA), Dual-Pooling Spatial Attention (DPSA), conventional multi-scale feature fusion (CMFF), and our progressive multi-scale feature aggregation (PMFA). The complete DCPP strategy is realized by coupling DPCA with PMFA in the top-down pathway (CASA) and DPSA with PMFA in the bottom-up pathway (SADA).
The baseline YOLO11s model (row 3) establishes the performance reference. Introducing DPCA in the top-down pathway (row 4) enhances mAP50 by 0.3% and mAP50–95 by 0.2%. Extending this with DPSA in the bottom-up pathway (row 5) yields further gains of 0.5% in mAP50 and 0.5% in mAP50–95. This confirms that integrating attention mechanisms in both pathways enhances detection through complementary feature enhancement, where channel attention emphasizes semantically rich channels and spatial attention amplifies spatially significant regions. However, these CMFF-based improvements remain modest. In contrast, the complete DPCA and DPSA implementation with PMFA (row 6) achieves significant gains, including a 1.0% increase in mAP50, a 0.8% increase in mAP50–95. This demonstrates that PMFA effectively enables DPCA to strengthen semantic coherence and DPSA to preserve structural details.
To validate our pathway-specific design, we tested an alternative that merges DPCA and DPSA outputs via element-wise multiplication. Although this variant surpasses the baseline (final row, Table 4), it underperforms our DCPP strategy. This outcome highlights that the key lies in specialized, pathway-optimized application rather than simple fusion. Our dedicated approach allows each attention type to optimize its designated pathway, where channel attention enhances semantic coherence across scales and spatial attention preserves precise structural details.
In summary, the ablation study substantiates the individual importance of both the specialized attention modules and the progressive aggregation strategy. Theoretically, the conflict between semantic and spatial information arises from the distinct roles of the two pathways. The top-down pathway conveys high-level categorical semantics for classification. The bottom-up pathway captures low-level positional details for localization. Forcing a single fusion node to accommodate both types of information leads to gradient competition and feature interference during training. This is harmful for small objects, which require precise localization. DCPP resolves this conflict by decoupling the two streams. It applies task-specific attention and progressive aggregation to each pathway. As a result, the DCPP strategy integrates CASA and SADA to achieve superior detection performance through an optimal balance of semantic and spatial refinements. This validates the effectiveness of our dedicated architectural design over generic attention combination schemes.

4.6. Experimental Comparison with Mainstream Models

We conduct comprehensive benchmarks on the VisDrone and UAVDT datasets to evaluate the generalization of LHA-YOLO against contemporary one-stage, two-stage, and transformer-based detectors.

4.6.1. Evaluation on VisDrone Dataset

Results are summarized in Table 5. Compared to classical two-stage detectors such as Faster R-CNN and Cascade R-CNN, LHA-YOLOs achieves a higher mAP50, with improvements of 22% and 22.7% respectively, while using approximately five times fewer parameters. In the comparison among YOLO series models, we primarily focus on small-scale models. Among them, LHA-YOLOs achieves the best performance on both mAP50 and mAP50–95. While it does not have the lowest Params, its GFLOPs rank second, being only slightly higher than those of FFCA-YOLOs, yet the model still maintains highly competitive overall efficiency. Notably, LHA-YOLOs even outperforms the transformer-based detector RT-DETR-R18, whose parameter count is comparable to the medium-scale model LHA-YOLOm, while LHA-YOLOs uses only half the parameters of RT-DETR-R18. This demonstrates that LHA-YOLO11s achieves an excellent balance between accuracy and model complexity, making it a highly efficient and practical choice for real-world deployment when both performance and resource consumption are considered.
Figure 11 provides visual evidence of the strengths of LHA-YOLO. In challenging UAV scenarios, it outperforms models such as RT-DETR-R18 by reducing false positives in clutter more reliably and detecting frequently missed small objects more accurately, showcasing its superior robustness. These results verify the reliable performance of our method under real-world complexity, consistent with the quantitative gains.

4.6.2. Evaluation on UAVDT Dataset

To further validate the generalization of our proposed method, we conducted extended experiments on the UAVDT dataset. Table 6 shows the quantitative results. LHA-YOLOs achieves the highest mAP50 36.9% and mAP50–95 22.9% among YOLO series models. Its GFLOPs 18.8 stay competitive among lightweight YOLO variants. Figure 12 provides qualitative detection examples. LHA-YOLOs reliably detects small and dense objects under different lighting and weather conditions. It also handles background clutter well. These results confirm that our method generalizes across different UAV datasets.

4.7. Discussion

Based on experimental results, we analyze the performance of LHA-YOLO for UAV object detection. Compared to the baseline YOLO11 models, LHA-YOLO variants consistently improve detection accuracy across all scales. The results are reported in Table 1. The LFEM module reduces computational cost while increasing both mAP50 and mAP50–95. However, the ablation study in Table 3 shows a different finding. Removing the parallel-attention second stage from the MDFR block and keeping only the PConv-PWConv first stage causes a slight mAP drop on YOLO11n and YOLO11s. This indicates that the second-stage attention is essential for recovering representational capacity, especially for small models. The DCPP strategy decouples semantic and spatial information flows using CASA and SADA. This leads to further gains in mAP50 and mAP50–95. Applying DCPP alone without LFEM marginally increases the parameter count. This suggests that the benefit of decoupled propagation is best realized when combined with a lightweight backbone. Together, both modules reduce model complexity. The reduction in GFLOPs is more pronounced than the reduction in parameters. The reason is that DCPP introduces progressive aggregation and additional attention operations. These operations add some parameter overhead but still lower the overall computational cost. In comparisons with mainstream methods on VisDrone and UAVDT, LHA-YOLOs achieves the highest mAP50 and mAP50–95.
Nevertheless, several limitations remain. First, our validation is primarily based on two UAV datasets, VisDrone and UAVDT. Although these are widely used aerial detection benchmarks, the generalization to other small-object scenarios, such as the DOTA dataset for rotating objects, or general surveillance footage, has not been systematically evaluated. More experiments on diverse benchmarks are needed to establish broad applicability. Second, while our method achieves real-time inference on an NVIDIA RTX, practical UAV deployment often involves resource-constrained edge devices like Jetson TX2 or Xavier NX. The inference speed, latency, and memory consumption on such platforms have not been measured, so additional optimization or model compression may be required for onboard real-time processing. Third, our preliminary analysis shows that the model occasionally fails to detect extremely tiny objects (e.g., those smaller than 16 × 16 pixels) or heavily overlapping large objects. Extreme scale variations remain challenging for the current design, indicating that further improvements in multi-scale feature adaptation are needed. Addressing these limitations is the primary direction of our future research.

5. Conclusions

In this study, we present LHA-YOLO, a lightweight and accurate object detection framework built upon the YOLO11 baseline. The network is designed for UAV-based small-object detection, aiming to improve accuracy while maintaining real-time performance. Specifically, we propose the Lightweight Feature Extraction Module (LFEM), which uses a parallel spatial-channel attention mechanism within MDFR blocks to effectively capture discriminative features of small objects with low computational cost. We also introduce the Divide-and-Conquer Propagation Path (DCPP) strategy, which decouples semantic and spatial information flows through channel-guided and spatial-guided attention modules and a progressive aggregation scheme. This design resolves the conflict between semantic and spatial information in multi-scale feature fusion. We conduct extensive experiments on the VisDrone and UAVDT datasets and analyze the results both quantitatively and qualitatively. The experimental results confirm that LHA-YOLO outperforms the baseline YOLO11 models and many mainstream detectors. Compared to other advanced methods, LHA-YOLO achieves higher detection accuracy with competitive model complexity and inference speed.
Despite these satisfactory results, we find room for further improvement. The computational complexity of LHA-YOLO, especially from the DCPP attention modules, could be reduced. The model also occasionally fails on extremely small objects and densely overlapping scenes. In future work, we plan to incorporate more efficient attention designs or model pruning techniques to lower the computational overhead. We will also explore stronger multi-scale feature adaptation and loss functions to improve performance on dense- and tiny-object scenarios.

Author Contributions

Conceptualization, J.Y. and X.P.; methodology, J.Y. and Q.P.; software, Q.P.; validation, J.Y. and Q.P.; formal analysis, J.Y.; investigation, J.Y., Q.P. and X.P.; resources, J.Y.; data curation, Q.P.; writing—original draft preparation, J.Y.; writing—review and editing, X.P. and Q.P.; visualization, Q.P.; supervision, J.Y. and X.P.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Program of Shanxi Province under Grant 202303021212246, and PhD Research Startup Foundation of Shanxi Datong University under Grant 2021-B-01.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
UAVUnmanned aerial vehicle
LHA-YOLOLightweight and high-accuracy network based on YOLO11
LFEMLightweight feature extraction module
DCPPDivide-and-conquer propagation path
CASAChannel attention-guided semantic aggregation
SADASpatial attention-guided detail aggregation
MDFRMulti-dimension feature representation
DPCADual-pooling channel attention
DPSADual-pooling spatial attention
PConvPartial convolution
PWConvPointwise convolution
mAPMean average precision
GFLOPsGigabit floating point operations

References

  1. Zhang, S.; Wang, X.; Lin, H.; Qiang, Z. A review of the application of UAV multispectral remote sensing technology in precision agriculture. Smart Agric. Technol. 2025, 12, 101406. [Google Scholar] [CrossRef]
  2. Almujally, N.A.; Wu, T.; Alhasson, H.F.; Hanzla, M.; Jalal, A.; Liu, H. UAV-based intelligent traffic surveillance using recurrent neural networks and Swin transformer for dynamic environments. Front. Neurorobot. 2025, 19, 1681341. [Google Scholar]
  3. Yang, L.; Zhang, X.; Li, Z.; Li, L.; Shi, Y. A LODBO algorithm for multi-UAV search and rescue path planning in disaster areas. Chin. J. Aeronaut. 2025, 38, 103301. [Google Scholar] [CrossRef]
  4. Liu, K.; Mattyus, G. Fast multiclass vehicle detection on aerial images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1938–1942. [Google Scholar]
  5. Teutsch, M.; Kruger, W. Robust and fast detection of moving vehicles in aerial videos using sliding windows. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 26–34. [Google Scholar]
  6. Rusyn, B.; Lutsyk, O.; Kosarevych, R.; Maksymyuk, T.; Gazda, J. Features extraction from multi-spectral remote sensing images based on multi-threshold binarization. Sci. Rep. 2023, 13, 19655. [Google Scholar] [CrossRef]
  7. Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward Accurate and Efficient Object Detection on Drone Imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; Volume 36, pp. 1026–1033. [Google Scholar]
  8. Huang, H.; Li, L.; Ma, H. An Improved Cascade R-CNN-Based Target Detection Algorithm for UAV Aerial Images. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 232–237. [Google Scholar]
  9. Xie, B.; Wang, Y.; Han, M.; Wang, Y.; Chen, J. Density-Guided Two-Stage Small Object Detection in UAV Images. Expert Syst. Appl. 2025, 297, 129346. [Google Scholar] [CrossRef]
  10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  11. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
  12. Chen, Y.; Liu, Z. DFTD-YOLO: Lightweight Multi-Target Detection from Unmanned Aerial Vehicle Viewpoints. IEEE Access 2025, 13, 24672–24680. [Google Scholar] [CrossRef]
  13. Qu, J.; Li, Q.; Pan, J.; Sun, M.; Lu, X.; Zhou, Y.; Zhu, H. SS-YOLOv8: Small-size object detection algorithm based on improved YOLOv8 for UAV imagery. Multimed. Syst. 2025, 31, 42. [Google Scholar] [CrossRef]
  14. Wang, C.; Yi, H. DGBL-YOLOv8s: An Enhanced Object Detection Model for Unmanned Aerial Vehicle Imagery. Appl. Sci. 2025, 15, 2789. [Google Scholar] [CrossRef]
  15. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
  16. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics/tree/main/ultralytics/cfg/models/v8 (accessed on 17 December 2025).
  17. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  18. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  19. Merget, D.; Rock, M.; Rigoll, G. Robust facial landmark detection via a fully-convolutional local-global context network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 781–790. [Google Scholar]
  20. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  21. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  22. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
  23. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  24. Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  25. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  26. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  27. Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
  28. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
  29. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Yifu, Z.; Abhiram, V.; et al. Ultralytics YOLOv5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 December 2025).
  30. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
  31. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
  32. Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
  33. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  34. Qiang, H.; Hao, W.; Xie, M.; Tang, Q.; Shi, H.; Zhao, Y.; Han, X. SCM-YOLO for Lightweight Small Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 249. [Google Scholar] [CrossRef]
  35. Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar] [CrossRef]
  36. Doherty, J.; Gardiner, B.; Kerr, E.; Siddique, N. BiFPN-yolo: One-stage object detection integrating Bi-directional feature pyramid networks. Pattern Recognit. 2025, 160, 111209. [Google Scholar] [CrossRef]
  37. Wang, M.; Zhang, B. Contrastive learning and similarity feature fusion for UAV image target detection. IEEE Geosci. Remote Sens. Lett. 2023, 21, 1–5. [Google Scholar] [CrossRef]
  38. Zhang, Y.; Wang, W.; Ye, M.; Yan, J.; Yang, R. LGA-YOLO for Vehicle Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 5317–5330. [Google Scholar] [CrossRef]
  39. Luo, W.; Yuan, S. Enhanced YOLOv8 for small-object detection in multiscale UAV imagery: Innovations in detection accuracy and efficiency. Digit. Signal Process. 2025, 158, 104964. [Google Scholar] [CrossRef]
  40. Wang, H.; Liu, J.; Zhao, J.; Zhang, J.; Zhao, D. Precision and speed: LSOD-YOLO for lightweight small object detection. Expert Syst. Appl. 2025, 269, 126440. [Google Scholar] [CrossRef]
  41. Quan, Z.; Sun, J. A Feature-Enhanced Small Object Detection Algorithm Based on Attention Mechanism. Sensors 2025, 25, 589. [Google Scholar] [CrossRef]
  42. Zhong, H.; Zhang, Y.; Shi, Z.; Zhang, Y.; Zhao, L. PS-YOLO: A Lighter and Faster Network for UAV Object Detection. Remote Sens. 2025, 17, 1641. [Google Scholar] [CrossRef]
  43. Li, Y.; Yan, H.; Li, D.; Wang, H. Robust Miner Detection in Challenging Underground Environments: An Improved YOLOv11 Approach. Appl. Sci. 2024, 14, 11700. [Google Scholar] [CrossRef]
  44. Wang, Z.; Su, Y.; Kang, F.; Wang, L.; Lin, Y.; Wu, Q.; Li, H.; Cai, Z. Pc-yolo11s: A lightweight and effective feature extraction method for small target image detection. Sensors 2025, 25, 348. [Google Scholar] [CrossRef]
  45. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  46. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
  47. Yang, J.; Xie, X.; Wang, Z.; Zhang, P.; Zhong, W. Bi-directional information guidance network for UAV vehicle detection. Complex Intell. Syst. 2024, 10, 5301–5316. [Google Scholar] [CrossRef]
  48. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  49. Ying, Z.; Zhou, J.; Zhai, Y.; Quan, H.; Li, W.; Genovese, A.; Piuri, V.; Scotti, F. Large-scale high-altitude uav-based vehicle detection via pyramid dual pooling attention path aggregation network. IEEE Trans. Intell. Transp. Syst. 2024, 25, 14426–14444. [Google Scholar] [CrossRef]
  50. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  51. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  52. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  53. Shan, Y.; Li, X.; Yao, D.; Wang, E.; Sun, Y.; Liu, Y.; Bai, S.; Jia, D. MSAE–DETR: A multiscale adaptive-enhancement UAV small-object detection algorithm. Meas. Sci. Technol. 2026, 37, 075203. [Google Scholar] [CrossRef]
  54. Ma, L.; Luo, Y.; Xu, J. HMF-DEIM: High-Fidelity Multi-Domain Fusion Transformer for UAV Small Object Detection. Sensors 2026, 26, 2187. [Google Scholar] [CrossRef] [PubMed]
  55. Qian, J.; Tao, C.; Luo, X.; Gao, Z.; Wang, T.; Xiao, F.; Cao, F.; Zhang, Z. DRONet: Occlusion-mastering multi-object detection tailored for unmanned aerial vehicles. Displays 2026, 93, 103388. [Google Scholar] [CrossRef]
  56. Li, Z.; Qi, L. MSA-DETR: A Multi-Scale Attention Augmented Model for Small Object Detection in UAV Imagery. Remote Sens. 2026, 18, 1179. [Google Scholar] [CrossRef]
  57. Ma, T.; Yin, H. MAFPN: A mixed local-global attention feature pyramid network for aerial object detection. Remote Sens. Lett. 2024, 15, 907–918. [Google Scholar] [CrossRef]
  58. Kim, M.; Lim, H.S.; Lee, S.; Kim, B.; Kim, G. Bi-directional contextual attention for 3d dense captioning. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 385–401. [Google Scholar]
  59. Ma, S.; Lu, H.; Liu, J.; Zhu, Y.; Sang, P. Layn: Lightweight multi-scale attention yolov8 network for small object detection. IEEE Access 2024, 12, 29294–29307. [Google Scholar] [CrossRef]
  60. Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
  61. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  62. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  63. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtually, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
  64. Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1432. [Google Scholar] [CrossRef]
Figure 1. The overall architecture of YOLO11.
Figure 1. The overall architecture of YOLO11.
Sensors 26 02970 g001
Figure 2. The overall architecture of LHA-YOLO.
Figure 2. The overall architecture of LHA-YOLO.
Sensors 26 02970 g002
Figure 3. The overall architecture of LFEM.
Figure 3. The overall architecture of LFEM.
Sensors 26 02970 g003
Figure 4. The overall architecture of MDFR.
Figure 4. The overall architecture of MDFR.
Sensors 26 02970 g004
Figure 5. The overall architecture of the DCPP strategy.
Figure 5. The overall architecture of the DCPP strategy.
Sensors 26 02970 g005
Figure 6. The overall architecture of CASA module. DPCA: Dual-Pooling Channel Attention. MP: Max pooling. AP: Average pooling.
Figure 6. The overall architecture of CASA module. DPCA: Dual-Pooling Channel Attention. MP: Max pooling. AP: Average pooling.
Sensors 26 02970 g006
Figure 7. The overall architecture of SADA module. DPSA: Dual-Pooling Spatial Attention. MP: Max pooling. AP: Average pooling.
Figure 7. The overall architecture of SADA module. DPSA: Dual-Pooling Spatial Attention. MP: Max pooling. AP: Average pooling.
Sensors 26 02970 g007
Figure 8. Qualitative comparison between YOLO11s (left) and LHA-YOLOs (right). The proposed method shows stronger robustness, with fewer false negatives on small objects and suppressed false positives in complex backgrounds.
Figure 8. Qualitative comparison between YOLO11s (left) and LHA-YOLOs (right). The proposed method shows stronger robustness, with fewer false negatives on small objects and suppressed false positives in complex backgrounds.
Sensors 26 02970 g008
Figure 9. Comparative visualization of module contributions on ground roads. Results from (a) baseline YOLO11s, (b) YOLO11s-LFEM, (c) YOLO11s-DCPP, and (d) the integrated LHA-YOLOs.
Figure 9. Comparative visualization of module contributions on ground roads. Results from (a) baseline YOLO11s, (b) YOLO11s-LFEM, (c) YOLO11s-DCPP, and (d) the integrated LHA-YOLOs.
Sensors 26 02970 g009
Figure 10. Comparative visualization of module contributions on elevated roads. Results from (a) baseline YOLO11s, (b) YOLO11s-LFEM, (c) YOLO11s-DCPP, and (d) the integrated LHA-YOLOs.
Figure 10. Comparative visualization of module contributions on elevated roads. Results from (a) baseline YOLO11s, (b) YOLO11s-LFEM, (c) YOLO11s-DCPP, and (d) the integrated LHA-YOLOs.
Sensors 26 02970 g010
Figure 11. Qualitative detection results comparing LHA-YOLOs (right column) and RT-DETR-R18 (left column) on VisDrone images.
Figure 11. Qualitative detection results comparing LHA-YOLOs (right column) and RT-DETR-R18 (left column) on VisDrone images.
Sensors 26 02970 g011
Figure 12. Qualitative detection results on UAVDT images.
Figure 12. Qualitative detection results on UAVDT images.
Sensors 26 02970 g012
Table 1. Performance comparison between LHA-YOLO network and YOLO11 network.
Table 1. Performance comparison between LHA-YOLO network and YOLO11 network.
MethodsP (%)R (%) mAP50 (%) mAP50–95 (%) Params (M)GFLOPsFPS
YOLO11n44.633.833.319.42.586.3526
LHA-YOLOn44.634.034.019.82.55.8588
YOLO11s50.838.639.423.59.4221.3500
LHA-YOLOs52.739.641.624.99.018.8556
YOLO11m55.442.243.626.620.0467.7333
LHA-YOLOm56.843.245.328.018.558.3357
YOLO11l56.344.145.327.925.386.6204
LHA-YOLOl57.644.146.028.221.768.4286
YOLO11x57.344.946.62956.8194.5122
LHA-YOLOx59.545.747.729.548.8153.4167
Table 2. Generalizability evaluation of the proposed LFEM and DCPP modules tested on YOLO8 and YOLO11.
Table 2. Generalizability evaluation of the proposed LFEM and DCPP modules tested on YOLO8 and YOLO11.
ModelBaselineLFEMDCPP mAP50 (%) mAP50–95 (%) Params (M)GFLOPs
YOLO8s××38.823.211.128.5
×39.323.67.618.2
×39.723.911.827.8
40.424.18.317.5
YOLO11s××39.423.59.4221.3
×39.823.88.5819.4
×40.424.39.8121.6
41.624.99.018.8
Note: ✓ denotes the adoption of the corresponding strategy, while × indicates its omission.
Table 3. Ablation study of MDFR block components across YOLO11 architectures on VisDrone dataset.
Table 3. Ablation study of MDFR block components across YOLO11 architectures on VisDrone dataset.
ModelPConv-PWConvCASA mAP50 (%) mAP50–95 (%) Params (M)GFLOPs
YOLO11n×××33.319.42.586.3
××32.819.02.25.4
×33.619.52.45.9
×33.219.02.35.8
33.619.22.376.1
YOLO11s×××39.423.59.4221.3
××39.123.37.817.7
×39.723.98.2419.1
×39.623.58.119.1
39.823.88.5819.4
YOLO11m×××43.626.620.0467.7
××44.026.716.955.8
×44.627.217.658.6
×44.527.217.258.6
44.827.51860
Note: ✓ denotes the adoption of the corresponding strategy, while × indicates its omission.
Table 4. Ablation study on CASA and SADA modules across YOLO11s on VisDrone dataset.
Table 4. Ablation study on CASA and SADA modules across YOLO11s on VisDrone dataset.
ModelTop-Down PathwayBottom-Up PathwaymAP50 (%)mAP50–95 (%)
DPCA DPSA CMFF PMFA DPCA DPSA CMFF PMFA
YOLO11s××××××39.423.5
×××××39.723.7
××××39.924.0
××××40.424.3
××40.123.9
Note: ✓ denotes the adoption of the corresponding strategy, while × indicates its omission.
Table 5. Comparison of our LHA-YOLO11 with other state-of-the-art object detectors on VisDrone dataset.
Table 5. Comparison of our LHA-YOLO11 with other state-of-the-art object detectors on VisDrone dataset.
Model mAP50 (%) mAP50–95 (%) Params (M)GFLOPs
Faster RCNN [10]19.6-41.2118.8
Cascade RCNN [11]18.9-69.0146.6
YOLOv5s [29]38.923.29.1223.8
YOLOv6s [31]37.022.016.343.7
YOLOv8s [16]38.823.211.128.5
YOLOv9s [32]40.424.17.1726.7
YOLOv10s [33]38.823.47.2221.4
YOLOv11s [24]39.423.59.4221.3
TPH-YOLOv5s [63]39.323.69.223.1
FFCA-YOLOs [64]37-2.3317.4
PS-YOLOs [42]40.724.25.5320.0
RT-DETR-R18 [52]41.423.219.957.0
LHA-YOLOs41.624.99.018.8
Table 6. Comparison of our LHA-YOLO11 with other state-of-the-art object detectors on UAVDT dataset.
Table 6. Comparison of our LHA-YOLO11 with other state-of-the-art object detectors on UAVDT dataset.
Model mAP50 (%) mAP50–95 (%) Params (M)GFLOPs
YOLOv5s [29]31.317.79.1223.8
YOLOv6s [31]30.817.816.343.7
YOLOv8s [16]30.618.011.128.5
YOLOv9s [32]29.617.67.1726.7
YOLOv10s [33]31.619.37.2221.4
YOLOv11s [24]32.219.49.4221.3
RT-DETR-R18 [52]30.317.619.957.0
LHA-YOLOs36.922.99.018.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Pan, X.; Pan, Q. LHA-YOLO: A Lightweight and High-Accuracy Detector via Parallel Attention and Divide-and-Conquer Fusion for UAV Images. Sensors 2026, 26, 2970. https://doi.org/10.3390/s26102970

AMA Style

Yang J, Pan X, Pan Q. LHA-YOLO: A Lightweight and High-Accuracy Detector via Parallel Attention and Divide-and-Conquer Fusion for UAV Images. Sensors. 2026; 26(10):2970. https://doi.org/10.3390/s26102970

Chicago/Turabian Style

Yang, Jianxiu, Xiong Pan, and Qingzhe Pan. 2026. "LHA-YOLO: A Lightweight and High-Accuracy Detector via Parallel Attention and Divide-and-Conquer Fusion for UAV Images" Sensors 26, no. 10: 2970. https://doi.org/10.3390/s26102970

APA Style

Yang, J., Pan, X., & Pan, Q. (2026). LHA-YOLO: A Lightweight and High-Accuracy Detector via Parallel Attention and Divide-and-Conquer Fusion for UAV Images. Sensors, 26(10), 2970. https://doi.org/10.3390/s26102970

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop