3.1. Overview of YOLO11
Building upon the classic Backbone–Neck–Head architecture, YOLO11 (as shown in
Figure 1) introduces significant innovations that achieve steady progress in real-time generic object detection. It achieves an optimal balance between accuracy, inference speed, and computational efficiency through novel network architecture designs and advanced training methods.
The backbone is designed for hierarchical multi-scale feature extraction from input images. Its architecture primarily comprises three key components: CBS blocks, C3K2 modules, and a SPPF module. The CBS block, composed of Convolution, Batch Normalization (BN), and SiLU, serves as a fundamental unit that performs feature transformation and downsampling. Its integrated BN layer and SiLU activation function ensure stable training and expressive feature maps. These features are subsequently processed by the C3K2 modules, which optimize information by splitting feature maps and applying grouped convolutions. The modules can enhance feature representation capacity and computational efficiency. Finally, the SPPF (Spatial Pyramid Pooling Fast) module leverages multiple parallel max-pooling operations with varying kernel sizes to aggregate rich multi-scale contextual information. This effectively improves the model’s ability to recognize objects across different scales without compromising its real-time inference speed.
The neck predominantly employs the classic path aggregation network-feature pyramid network (PAN-FPN) structure, which augments the conventional FPN by incorporating an additional bottom-up path. This pathway effectively reintegrates low-level spatial features with high-level semantic features, thereby enhancing the detection accuracy through multi-dimensional features. Notably, a C2PSA module is introduced prior to feature fusion to leverage attention mechanisms for guided integration. The C2PSA module utilizes multiple parallel attention mechanisms alongside feedforward networks, significantly improving global feature modeling capability. This design enables the network to better capture long-range dependencies and complex nonlinear interactions, ultimately strengthening feature representational power and increasing architectural flexibility across diverse application scenarios.
The detection head is responsible for generating the final predictions, which consist of bounding box coordinates, dimensions, and class probabilities. In the classification branch, depthwise convolution (DWConv) is employed in place of traditional convolution, reducing the number of parameters while preserving accuracy, thereby enhancing the computational efficiency of the model. The regression branch incorporates both standard convolution and deformable convolution to refine the localization performance and improve the accuracy of bounding box predictions. Overall, YOLO11 achieves a superior balance between detection performance and computational efficiency through its refined architectural design and optimized training pipeline.
3.2. Overview of the LHA-YOLO Architecture
Despite its state-of-the-art performance on generic object detection benchmarks, YOLO11 exhibits insufficient feature representation capabilities when applied to small-object detection in UAV images. This paper proposes LHA-YOLO as shown in
Figure 2, a novel framework to overcome these shortcomings. The key research objectives include: designing a dedicated attention mechanism to effectively extract and enhance features of small objects; improving the feature pyramid network to achieve more efficient multi-scale feature fusion, facilitating the integration of deep semantic information with shallow spatial details; and incorporating strategies that enhance detection accuracy while maintaining high computational efficiency.
In the backbone network, our architecture retains the five feature extraction stages and the SPPF module from YOLO11. With the exception of the first stage, each feature extraction block primarily consists of CBS and Lightweight Feature Extraction Module (LFEM). The LFEM is composed of a convolutional layer, n repeated Multi-Dimension Feature Representation (MDFR) blocks, and a set of residual connections. It is designed to effectively extract and enhance features of small objects while maintaining high computational efficiency.
In the neck network, we introduce a Divide-and-Conquer Propagation Path (DCPP) strategy to enhance the complementary advantages of both information streams without additional computational cost. Specifically, we integrate dedicated aggregation attention mechanisms into both propagation paths. In the top-down pathway, a Channel Attention-guided Semantic Aggregation (CASA) module strengthens semantic consistency across multi-scale features and enhances the discriminative power of the fused representations. In the bottom-up pathway, a Spatial Attention-guided Detail Aggregation (SADA) module progressively refines and aggregates features to emphasize spatial details, particularly the localized features critical for precise localization. This strategy facilitates the effective propagation of both contextual semantics and spatial details, thereby robustly strengthening multi-scale fusion.
As a result, the proposed LHA-YOLO model incorporates specialized modules to achieve a more comprehensive understanding of complex scenes and improve detection performance across diverse object scales.
3.3. Proposed LFEM
The Lightweight Feature Extraction Module (LFEM), as shown in
Figure 3, is primarily composed of CBS blocks, multiple Multi-Dimension Feature Representation (MDFR) blocks, and residual connections. Within the LFEM architecture, input features are split into two pathways via convolutional operations, one branch is fed into
n successive MDFR, while the other is retained and later concatenated with the output from these MDFR blocks. As a result, the LFEM enhances multi-channel information integration, thereby improving discriminative feature extraction for small objects.
As shown in
Figure 4, the MDFR block consists of two primary stages. The first-stage residual structure begins with a partial convolution (PConv) layer [
60] employing a 3
3 kernel, which processes feature maps corresponding to one-fourth of the total channels. These processed channels are then concatenated with the remaining original channels to preserve consistency between the input and output channel dimensions. The combined features are subsequently passed through two pointwise convolution (PWConv) layers to produce the main branch features. Finally, the main branch features are added element-wise to the input feature map to generate the output. This efficient design facilitates comprehensive integration of channel information while maintaining low computational overhead. Specifically, PConv operates efficiently by convolving only a subset of input channels while leaving the remainder unchanged, producing feature maps that combine both original and transformed features. PWConv further compresses channel dimensionality to minimize parameters. Consequently, the LFEM achieves lower computational cost compared to traditional residual structures.
In the second stage, a multi-attention coordination mechanism is employed. The output features from the first stage are split into two separate branches along the channel dimension, each receiving input features
F of size
HW, where
H,
W, and
C denote the number of height, width, and channels respectively. Each branch uses multi-scale kernel size to divide the feature map into
k groups along the channel dimension, denoted as
,
, …,
. The number of channels for
is
. In the upper branch, each subgroup
is processed by a spatial attention module to capture contextual information for small objects. The attention module learns importance weights that highlight relevant spatial regions. This process is defined as
is the spatial attention feature corresponding to subgroup feature
.
is the sigmoid activation function. GMP refers to the global maximum pooling operation of
in the spatial dimension to extract spatial statistics, while
and
represent a convolutional weight matrix and bias term of size
HW1, which aid in encoding spatial representations. Similarly, the lower branch employs a channel attention mechanism to model inter-channel dependencies relevant to small objects. The operation is defined as
is the channel attention feature corresponding to subgroup feature
, which has
channels. GAP denotes the global average pooling operation of
in the channel dimension, which produces a vector of size
aggregating channel-wise statistics. The convolutional layer
and bias
have a kernel size of
with both input and output channels equal to
. The sigmoid output is then broadcasted element-wise to the spatial dimensions of
for channel recalibration.
Finally, the two branches yield k sets of spatial attention and channel attention feature maps, each with dimensions HW. These feature maps are then concatenated and undergo channel shuffling to ensure effective feature redistribution, followed by reconstruction into a new integrated feature representation. Each MDFR incorporates both channel and spatial transformations, enabling effective aggregation of small-object features while suppressing redundant background noise and interference.
3.4. Proposed DCPP Strategy
As illustrated in
Figure 5, the Divide-and-Conquer Propagation Path (DCPP) strategy processes the top-down and bottom-up information streams separately using dedicated attention-guided progressive aggregation. The top-down pathway emphasizes semantic consistency, while the bottom-up pathway preserves spatial details. This divided yet complementary design progressively aggregates contextual semantics and spatial details, thereby significantly enhancing the model’s capacity for multi-scale feature fusion and improving detection accuracy, especially for small objects in complex UAV images.
To enhance semantic consistency in the top-down pathway, we propose the Channel Attention-guided Semantic Aggregation (CASA) module, as shown in
Figure 6. Its core mechanism dynamically recalibrates channel-wise feature responses to emphasize semantically rich information and suppress noise. Through a stepwise aggregation strategy, the module strengthens cross-scale semantic alignment and boosts the discriminative power of fused features. The Dual-Pooling Channel Attention (DPCA) processes the input features
F by applying both average pooling and max pooling operations along the spatial dimensions. To minimize computational overhead, the pooled feature maps are first compressed via a shared 1
1 convolutional layer, reducing the channel dimension to
of the original. The channel dimension is then restored through another shared 1
1 convolution. The resulting two feature maps are combined via element-wise summation, followed by a sigmoid function to generate the final channel attention weights
. This process is formulated as
where
denotes maximum pooling and
denotes average pooling.
consists of two 1
1 convolution layers and ReLU activation operations.
is the sigmoid activation function. The channel-refined features
are then obtained by element-wise multiplication of the attention weights
with the input features
. Subsequently, the deep-level feature map
is modulated by the complementary weights
to produce a residual feature
. Finally,
and
are summed to form the aggregated output
. This process is formulated as
Notably, for the deepest feature map
, the output is generated solely by its channel-wise multiplication with the attention weights. This channel-refined feature is then propagated to shallower layers to convey enhanced semantic information. Consequently, the CASA module refines the quality of semantic propagation throughout the pathway, enriching shallow features with high-level semantics that are pivotal for accurate classification.
To preserve and enhance spatial information in the bottom-up pathway, we propose the Spatial Attention-guided Detail Aggregation (SADA) module, as shown in
Figure 7. Its core mechanism computes spatial attention weights to model location importance, thereby suppressing background noise while highlighting discriminative features crucial for object localization. Furthermore, a stepwise integration strategy is applied to aggregate these spatial-refined features, which improves both the efficiency and fidelity of spatial detail propagation across the network. The Dual-Pooling Spatial Attention (DPSA) module operates by aggregating channel information from the input features
F through both maximum and average pooling. To preserve fine-grained spatial details, the input features for this module are sourced directly from the multi-level outputs of the backbone network, consistent with the input to the DPCA module. Then, the pooled maps are fused via element-wise summation, and a sigmoid function is applied to generate the spatial attention weight map
. This process is formulated as
where
denotes maximum pooling and
denotes average pooling.
denotes the 3
3 convolution and ReLU activation operations.
is the sigmoid activation function. The input features
are spatially refined by the attention map
to produce
. Concurrently, the shallow features
are passed through a complementary gate
to form the residual feature
, preserving details that the spatial attention may have suppressed. The final aggregated feature
integrates these with the channel-aware feature
via summation, as formulated below:
For the shallow feature map
, the output is generated solely by its element-wise multiplication with the spatial attention weights. This spatially refined feature is then propagated to deeper layers, conveying enhanced spatial information. By integrating such multi-scale spatial cues that benefit precise localization, the SADA module sharpens the perception of spatial structures in deep features, thereby significantly strengthening the representation and transmission of spatial details throughout the bottom-up pathway.
Unlike standard fusion strategies, our stepwise aggregation performs a weighted combination guided by attention mechanisms, adaptively emphasizing the most informative multi-level features with negligible computational overhead. This approach enhances the bidirectional propagation path, enabling refined feature integration within the pyramid network. Consequently, the model achieves superior multi-scale representation, improving its ability to interpret complex scenes and detect objects across various scales.