Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images

Liang, Sirong; Feng, Xubin; Xie, Meilin; Tang, Qiang; Zhu, Haoran; Li, Guoliang

doi:10.3390/app152413063

Open AccessArticle

Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images

by

Sirong Liang

^1,2,

Xubin Feng

^1,2,*

,

Meilin Xie

^1,2,

Qiang Tang

^1,2,

Haoran Zhu

^1,2 and

Guoliang Li

^1,2

¹

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(24), 13063; https://doi.org/10.3390/app152413063

Submission received: 28 October 2025 / Revised: 4 December 2025 / Accepted: 5 December 2025 / Published: 11 December 2025

Download

Browse Figures

Versions Notes

Abstract

To address challenges in small object detection within drone aerial imagery—such as sparse feature information, intense background interference, and drastic scale variations—this paper proposes YOLO-SR, a lightweight detection algorithm based on attention enhancement and feature reuse mechanisms. First, we designed the lightweight feature extraction module C2f-SA, which incorporates Shuffle Attention. By integrating channel shuffling and grouped spatial attention mechanisms, this module dynamically enhances edge and texture feature responses for small objects, effectively improving the discriminative power of shallow-level features. Second, the Spatial Pyramid Pooling Attention (SPPC) module captures multi-scale contextual information through spatial pyramid pooling. Combined with dual-path (channel and spatial) attention mechanisms, it optimizes feature representation while significantly suppressing complex background interference. Finally, the detection head employs a decoupled architecture separating classification and regression tasks, supplemented by a dynamic loss weighting strategy to mitigate small object localization inaccuracies. Experimental results on the RGBT-Tiny dataset demonstrate that compared to the baseline model YOLOv5s, our algorithm achieves a 5.3% improvement in precision, a 13.1% increase in recall, and respective gains of 11.5% and 22.3% in mAP_0.5 and mAP_0.75, simultaneously reducing the number of parameters by 42.9% (from 7.0 × 10⁶ to 4.0 × 10⁶) and computational cost by 37.2% (from 60.0 GFLOPs to 37.7 GFLOPs). The comprehensive improvement across multiple metrics validates the superiority of the proposed algorithm in both accuracy and efficiency.

Keywords:

UAV target detection; YOLO; attention mechanism; feature reuse; lightweight network

1. Introduction

With the rapid development of the low-altitude economy (economic activities [1] conducted in airspace below 1000 m), drone technology—as its key enabler—has been widely applied in road traffic monitoring, critical area patrols, aerial photography, and inspection services [2,3,4,5]. This has not only created new commercial value and enhanced operational efficiency in logistics, agriculture, emergency response, and other industries but also driven the upgrading and transformation of related sectors. The effective implementation of these applications heavily relies on computer vision technologies such as precise object detection and tracking [6,7,8]. Currently, the deep learning-based YOLO series of algorithms [9,10,11] has become the mainstream solution in this field due to its excellent balance of efficiency and accuracy. However, in actual drone operational environments, factors such as high flight altitudes, unique shooting perspectives, small target scales, and susceptibility to occlusion pose severe challenges to standard YOLO models, often rendering them inadequate for practical needs. Consequently, targeted improvements to existing models to adapt them to the complex imaging characteristics of drone aerial photography have become crucial for advancing technology implementation.

In drone applications, high flight altitudes, wide fields of view, and frequently oblique angles result in pedestrians, vehicles, and other targets occupying minimal pixels within images. This leads to weak features and severe background interference, posing significant challenges to the robustness of detection algorithms. To address these challenges, current research primarily focuses on two directions: model architecture optimization and data augmentation.

In model design, a mainstream approach involves refining the highly modular architecture of YOLOv5 [12]. For instance, embedding Channel Attention (SE-block) [13] and Spatial Attention modules [14] within the detection head dynamically adjusts feature channel weights and spatial response maps, enhancing the ability to distinguish small objects from the background. However, the dynamic label assignment mechanism introduced in subsequent versions like YOLOv8/v10 necessitates concurrent adjustments to positive sample matching strategies when integrating custom attention modules, significantly increasing secondary development complexity. Another lightweight approach uses YOLOv5s as a baseline, leveraging its Focus slice convolution structure optimized for edge devices like NVIDIA Jetson. By substituting standard convolutional modules with MobileNetV3 [15] backbone to replace standard convolutional modules [16], combined with separable convolutions [17] and neural architecture search techniques [18], this approach achieves model compression while maintaining near-benchmark accuracy, meeting real-time requirements for embedded platforms.

For multi-scale feature fusion, researchers construct cross-scale feature interaction pathways through an enhanced Feature Pyramid Network (FPN) [19], combined with Adaptive Spatial Pyramid Pooling (ASPP) [20], to enrich contextual information representation. YOLOv5’s native SPPF architecture facilitates seamless integration with ASPP modules, simplifying the workflow. Conversely, the SPPF-CSP module adopted in YOLOv8/v10 introduces challenges for multi-scale fusion due to its tightly coupled inter-stage features.

Regarding data augmentation, to accommodate the wide-angle imaging characteristics of UAVs, the four-image mosaic enhancement method is commonly employed. Compared to the nine-image Mosaic9 strategy, it more effectively preserves the integrity of small targets. Experiments demonstrate that this method increases the proportion of small targets in the training set from 12% to 38%. Simultaneously, introducing Gaussian noise and motion blur simulations effectively enhances the model’s generalization capability in complex lighting and high-speed motion scenarios.

Although existing research has optimized drone target detection models from multiple perspectives, severe false negatives and false positives persist due to the complexity and uniqueness of drone-view images. These issues typically concentrate on small targets. Currently, small target definition primarily follows two categories of methods: relative scale and absolute scale [21]. Relative-scale definitions require considering the proportional relationship between the target and the image, such as the median ratio of the bounding box area to the image area falling between 0.08% and 0.58%; the ratio of the bounding box width/height to the image width/height being less than 0.1; or the square root of the ratio of the bounding box area to the image area being less than 0.03. Absolute-scale-based definitions are more straightforward, typically classifying targets with bounding box pixel dimensions smaller than 32 × 32 as small targets [21]. Integrating these criteria, this study explicitly defines small targets as those in UAV aerial images satisfying either of the following conditions: (1) The target bounding box area constitutes less than 0.1% of the total image area, and both its width and height are less than 10% of the corresponding image dimension. (2) The absolute pixel dimensions of the target’s bounding box are less than 32 × 32 pixels. In UAV aerial scenarios, targets such as pedestrians and vehicles typically cover sparse pixel areas (often below 20 × 20 pixels), resulting in extremely weak feature representations that pose significant challenges to existing models’ recognition capabilities.

To address this challenge, this paper proposes an improved model—YOLO-SR—for small object detection in drone aerial photography, based on the YOLO algorithm framework. The model primarily optimizes the YOLOv5 backbone network in three aspects: First, drawing inspiration from the C2f module design in YOLOv8 [22], it combines it with the Shuffle Attention mechanism [23] to construct the C2f-SA feature extraction module. This module effectively enhances the model’s ability to capture and represent fine-grained features of minute objects. Embedded within the backbone network, it performs feature extraction directly after a single convolution layer. This approach reduces network complexity while mitigating feature degradation of small objects in deep networks. Second, the SPPF-CBAM module was designed. It integrates features with varying receptive fields within spatial pyramid pooling to capture multi-scale contextual information. The Convolutional Block Attention Module [24] was introduced to simultaneously optimize feature weights across both channel and spatial dimensions, thereby suppressing background noise and highlighting key target regions [25]. This module replaces the original SPPF architecture in YOLOv5 [26], further enhancing small object detection capabilities while maintaining manageable computational overhead. Finally, a decoupled detection head [27] separates classification and bounding box regression tasks, enabling independent optimization for different objectives to improve classification confidence and localization accuracy for small targets.

Experimental results on the RGBT-Tiny dataset [28] demonstrate that the YOLO-SR model achieves an average precision (AP@50) of 86.5%, exhibiting significant performance advantages over several mainstream detection algorithms. This validates its effectiveness and advanced capabilities in small object detection tasks from a drone perspective.

The main work and innovations of this paper are summarized as follows:

(1): Designed the lightweight feature extraction module C2f-SA: By integrating separable convolutions with the Shuffle Attention mechanism, it reduces redundant operations after the first downsampling, preserves high-frequency details of small objects, and enhances fine-grained feature extraction capabilities. This module effectively reduces parameter count, computational load, and memory overhead while maintaining feature expressiveness, thereby improving detection accuracy and inference efficiency.
(2): Designed the Small Object Detection-Specific Module SPPF-CBAM: Building upon spatial pyramid pooling (SPPF), it captures global contextual information through multi-scale pooling. By integrating a dual channel–spatial attention mechanism, it adaptively suppresses noise while amplifying key information, mitigating information dilution during cross-scale feature integration. Embedded within the YOLOv5 backbone network, this module enhances small object feature representation and detection performance while maintaining lightweight architecture.
(3): Introduced the Decoupled Detect decoupled detection head: By separating tasks and implementing dynamic loss optimization, it mitigates conflicts between classification and regression tasks, improving classification confidence and localization accuracy for small object detection in drone scenarios.

2. Related Work

2.1. Small Object Detection

Data augmentation is one of the key techniques to improve the generalization ability of UAV small target detection models [29]. Conventional geometric transformations, such as image rotation and scale scaling, can effectively expand the scale and viewpoint distribution of small targets in the training set without introducing semantic distortion. The “copy–paste” strategy further exploits the semantic segmentation a priori to accurately paste the labelled target instances into the new background image, and avoids foreground–background incongruity through adaptive scale adjustment, thus significantly increasing the frequency of target appearances. In addition, Generative Adversarial Networks (GANs) [30] are able to synthesize training samples with a high degree of realism, which effectively improves the robustness of the model in complex scenes by enhancing the feature representation of small targets.

Multi-scale learning is a core strategy to cope with target scale variations by constructing feature representations with different resolutions so that the model can simultaneously perceive targets with significant size differences [31]. Typical implementation paths include image pyramids, feature pyramid networks (e.g., PANet [32], BiFPN [33] and Recursive-FPN [34]), and multi-branch detection head designs (e.g., TridentNet [35]). Among them, the feature pyramid network effectively fuses deep strong semantic information with shallow high-resolution features through top-down semantic transfer and detail compensation achieved by lateral connectivity, which significantly improves the detection of small targets. In addition, multi-scale methods based on local contrast measurements are able to highlight tiny targets more accurately in complex scenes by analyzing the differences between the target and the neighboring background. These multi-scale techniques not only enhance the hierarchical nature of feature expression but also provide richer contextual information for the model, which in turn comprehensively improves the accuracy and robustness of the detection algorithm in different scenes.

To alleviate the performance imbalance caused by scale differences in target detection, researchers have proposed several innovative training strategies. Among them, SNIP (Scale Normalization for Image Pyramids) [36] and its extended version SNIPER (Scale Normalization for Image Pyramids with Efficient Resampling) [37] can significantly improve the performance of target detection by selectively training 7targets in specific scale intervals in the image pyramid, while retaining the ability to detect large targets, and Scale-Aware Network (SAN) [38] further strengthens the model’s adaptive ability to changes in target sizes by mapping the features to a scale-invariant subspace. In addition, the contrast measure based on derivative entropy can effectively enhance the saliency of small targets in complex environments by quantifying the difference between the target and the local background, while the adaptive clutter suppression technique can dynamically reduce the weight of background noise and guide the model to focus on the tiny target area. Overall, the above strategies not only effectively balance the detection performance of targets at different scales but also provide key technical support for robust detection in complex scenes.

In the small target detection task, feature fusion technology is one of the core ways to improve the performance of the model. The deep feature map extracted by the model contains rich high-level semantic information, which is crucial for accurate target discrimination, while the shallow feature map retains more delicate spatial details and positional information, which is an important basis for accurate small target localization. By constructing an effective feature fusion mechanism and reasonably integrating feature information of different depths, the model can take into account both semantic expression strength and spatial localization accuracy, thus significantly strengthening the recognition ability and localization accuracy of small-scale targets and providing a core technical guarantee for the enhancement of the performance of small-target detection.

In addition, there are also studies that promote the improvement of small target detection performance from other technical dimensions. For example, RFLA [39] designs a sample allocation strategy based on Gaussian feeling field theory, which alleviates the training imbalance caused by the scarcity of positive samples in small target detection by accurately screening effective samples; feature-based super-resolution technology focuses on the pain point of the weak expression of small target features and significantly improves the recognition accuracy of the model for small targets by enhancing the details of small target features in the candidate region. Another study cuts from the optimization of the label assignment mechanism and further optimizes the learning process of small targets by adjusting the matching logic of labels and samples, without additionally increasing the inference overhead of the model. These studies have achieved targeted breakthroughs in the key aspects of small target detection such as sample balancing, feature enhancement, and label matching, respectively, which not only provides diversified solution ideas for the development of small target detection technology but also lays a solid technical foundation for subsequent research.

2.2. Limitations of Existing Approaches

With the evolution of deep learning technology, UAV target detection methods based on convolutional neural networks have developed into the mainstream of research. Early research mainly used a two-stage detector represented by Faster R-CNN [25], which achieved some success on aerial images, but its higher computational complexity makes meeting the real-time detection requirements difficult. Subsequently, single-stage detectors represented by the YOLO series [9,10,11] have been widely used in UAV platforms due to their high efficiency. Among them, YOLOv5 [12] shows good performance in UAV small target detection tasks by virtue of its highly modular design; however, its detection accuracy for very small targets still needs to be improved.

In recent years, in order to adapt to the special characteristics of UAV aerial photography scenarios, researchers have proposed improvement strategies from multiple directions. In the lightweight direction, networks such as MobileNetV3 [15] significantly reduce the model complexity while maintaining the detection accuracy by introducing depth-separable convolution [17] and neural network architecture search techniques [18], while the lightweight improvement work based on YOLOv5s achieves near real-time detection performance on embedded devices by replacing the standard convolutional module [16]. In the direction of feature enhancement, attention mechanisms are widely used to improve small-target perception, e.g., SE-block [13] strengthens feature discrimination through channel attention and CBAM [24] combines channel and spatial attention to optimize target localization. However, most of the existing attention mechanisms are embedded in the network in a simple stacked manner, and the effective co-modeling of channel dependence and spatial correlation has yet to be achieved.

Although the above methods have pushed the development of UAV target detection, there are still several key limitations when migrating to real aerial photography scenarios: firstly, the number of parameters and computational cost of methods relying on complex backbones or multi-scale pyramid structures, such as PANet [32], BiFPN [33], and TridentNet [35], are high and difficult to deploy on resource-constrained edge platforms; secondly, the number of Anchor-based detectors favor high-level semantic information and are prone to ignore detailed features of small targets, while Anchor-free methods (e.g., CenterNet and FCOS) are prone to positive-sample blurring in target-dense regions, which affects the localization accuracy. In addition, existing attention mechanisms lack joint optimization of channels and spatial associations and have limited noise suppression capability in complex backgrounds; although RFLA [35] relies on complex trunks or multi-scale pyramid structures, it is difficult to deploy in resource-constrained edge platforms.9 Last but not least, the mainstream data augmentation strategies (e.g., Mosaic and Copy–Paste) do not adequately take into account the scale aberration and deformation characteristics of the top-down UAV view, which is prone to introduce the domain shift and weaken model generalization.

In summary, existing methods have not yet reached an effective balance between lightweight, feature discriminative, and scene robustness, which also highlights the necessity of the work in this paper from multiple dimensions: by constructing an attention enhancement module, optimizing the multi-scale fusion mechanism, and introducing a decoupled detection head, the YOLO-SR model aims to systematically improve the detection performance of small targets in UAV images.

3. Methodology

3.1. Network Architecture Design

Addressing the challenges in detecting small objects within drone imagery—extremely small object scales (bounding box area not exceeding 0.08% of the entire image, or any side length less than 10% of the corresponding image dimension, with absolute resolution often below 32 × 32 pixels), dense object distribution, and complex backgrounds—this paper proposes YOLO-SR, an improved algorithm based on YOLOv5. Its overall structure is shown in Figure 1. YOLO-SR preserves YOLOv5’s end-to-end detection paradigm and input/output compatibility while introducing proprietary modules in the backbone and neck: multiple standard C3 units in the backbone network are replaced with C2f-SA units (marked in light orange with red borders), while the neck incorporates an enhanced SPPF-CBAM (SPPC) at its base. The detection head employs a Decoupled Detect Head (DDHead) as a proven configuration to ensure consistent inference workflows. For the backbone stage, balancing lightweight design with shallow feature representation, C2f-SA integrates the parallel bottleneck of C2f with Shuffle Attention. It preserves fine-grained receptive fields through channel grouping and spatial rearrangement, while reusing gradient flows across scales. This significantly reduces parameters/FLOPs while enhancing small object detection capabilities. The multi-scale feature maps D1, D2, and D3 from the backbone enter the neck, following the downsampling/upsampling path defined by blue and orange arrows. Cross-layer fusion is achieved through CBS and concatenation. The bottom-inserted SPPC maintains full-channel multi-scale pooling output, combining channel and spatial attention to suppress background noise and artifacts, highlighting key target regions in dense scenes. For head detection, YOLO-SR employs a decoupled DDHead that separates classification and regression branches: features at each scale first pass through a shared feature transformation layer before entering dedicated classification and regression heads, reducing task conflicts and minimizing localization bias. Ultimately, DDHead outputs class probabilities and bounding box parameters at scales P1, P2, and P3, enabling high-precision, efficient detection of small objects with varying sizes and densities.

3.2. Lightweight Backbone Network Based on C2f-SA

3.2.1. Shuffle Attention (SA) Mechanism

To address recognition challenges in drone remote sensing imagery—such as multi-scale targets, morphological diversity, and fine-grained inter-class feature similarity—this paper introduces a Shuffle Attention (SA) (see Figure 2 for the structure) module into the backbone network. The enhanced features extracted by SA are then fused with the Neck Network [11] to enhance the model’s ability to capture deep discriminative semantic information. While mainstream attention mechanisms (such as spatial attention and channel attention) can effectively model pixel-level relationships and channel dependencies, respectively, their simple stacking significantly increases computational burden. To address this, this paper employs the SA module. Through channel grouping and weight reordering mechanisms, this module efficiently achieves synergistic effects between channel and spatial attention, enhancing the model’s feature fusion capabilities while effectively controlling computational overhead.

Specifically, given the input feature map

X \in R^{\land (C \times H \times W)}

, the SA module first divides the channel features into multiple sub-feature groups:

X = [X_{1}, X_{2}, \dots, X_{G}], where X_{G} = R^{\land (C / 2 G \times 1 \times 1)}

(1)

Subsequently, the SA unit that integrates spatial attention and channel attention is applied in parallel to each subgroup. For each group of features

X_{k}

it is further divided into two sub-branches:

X_{k 1}, X_{k 2} = S p l i t (X_{k}), where X_{k 1}, X_{k 2} \in R^{\land (C / 2 G \times H \times W)}

(2)

The channel attention branch extracts channel feature descriptors through global average pooling:

s = F g p (X_{k 1}) = (1 / H \times W) \sum \sum X_{k 1} (i, j)

(3)

Then, channel weights are learned through a nonlinear transformation with shared parameters:

X_{k 1}^{'} = σ (W_{1} \times s + b_{1}) \times X_{k 1}

(4)

where

W_{1} = R^{\land (C / 2 G \times 1 \times 1)}

,

b_{1} = R^{\land (C / 2 G \times 1 \times 1)}

, and parameters are shared across all groups G. The spatial attention branch learns spatial position weights through group normalization and parameter transformation:

X_{k 2}^{'} = σ (W_{2} \times G N (X_{k 2}) + b_{2}) \times X_{k 2}

(5)

where

W_{2} \in R^{\land (C / 2 G \times 1 \times 1)}

and GN represents the Group Normalization operation. The outputs of the two branches are concatenated along the channel dimension:

X_{k}^{'} = C o n c a t [X_{k 1}^{'}, X_{k 2}^{'}] \in R^{\land (C / G \times H \times W)}

(6)

Finally, the features of each group are mixed through the Channel Shuffle operation [22] to achieve cross-group information interaction and overall feature enhancement:

Y = C h a n n e l S h u f f l e ([X_{1}^{'}, X_{2}^{'}, \dots, X_{G}^{'}])

(7)

By synergistically integrating group convolution [40], group normalization [41], spatial attention, channel attention (borrowed from SENet [42]), and channel rearrangement, the module achieves a significant improvement in computational efficiency while guaranteeing a strong feature expression capability. Specifically, group convolution effectively reduces the number of parameters and computational cost through group computation; group normalization optimizes training stability in small batches and promotes learning of attention distribution in the spatial dimension; the spatial attention mechanism adaptively enhances the feature response in key regions by modeling the spatial relationship between pixels; and the channel attention mechanism achieves the weighted emphasis on important feature channels by capturing the inter-channel dependency relationship. The channel rearrangement operation further facilitates the information interaction between different subgroups and avoids the information segregation that may be caused by group convolution. Through the organic combination of the above components, the module is able to effectively improve the model’s ability to perceive and recognize small targets in UAV images while maintaining a low computational complexity.

3.2.2. C2f-SA Module Design

Traditional YOLOv5 achieves feature extraction through stacked C3 modules and standard convolution blocks. While it possesses certain detection accuracy, its deep structure requires extensive hierarchical connections and memory resources for storing network weights and intermediate feature maps, leading to high computational overhead and network complexity. Additionally, the backbone network heavily relies on convolution operations, exhibiting insufficient detail representation capability for small targets in complex backgrounds. To address this, this paper proposes a lightweight backbone network (as shown in Figure 1) to enhance small target detection performance while reducing computational burden.

Specific improvements include two aspects: First, a novel feature extraction unit—SA-Bottleneck (as shown in Figure 3)—is proposed, which introduces Shuffle Attention into the standard Bottleneck structure to optimize the spatial feature representation capability and effectively improve overall network detection performance.

Based on this, an innovative C2f-SA module (as shown in Figure 4) is designed, which consists of Cross-Stage Partial connections (C2f) and Shuffle Attention (SA) [21]. The module embeds a channel rearrangement mechanism (Channel Shuffle) in the C2f’s Bottleneck [12], breaking channel dependencies, and introduces a grouped spatial attention (Grouped SA) strategy where each group independently computes spatial weights, thereby significantly enhancing responses in small target regions. Finally, this improved C2f-SA module is employed immediately after the first downsampling, preserving more edge and texture details of small targets from shallow stages, effectively mitigating the loss of early feature information.

In this work, the feature processing pipeline of the C2f-SA module is as follows: First, a 1 × 1 convolution is employed to perform channel compression and remapping on the input feature map to enhance feature representation capability:

X_{i n} = C o n v_{1 \times 1} (X)

(8)

Subsequently, the remapped feature map is divided into a residual branch and a main branch through the Split [10] operation, where the residual branch directly transmits key information to avoid feature loss, while the main branch proceeds to subsequent deep feature extraction:

X_{1}, X_{2} = S p l i t (X_{i n})

(9)

The main branch consists of n layers of consecutively stacked SA-Bottleneck units, with residual connections [30] introduced in each unit to enhance the network’s nonlinear representation capability while ensuring smooth information flow:

X_{2, i + 1} = F_{S A - B o t t l e n e c k} (X_{2, i}), i \in [0, n - 1]

(10)

X_{2, i + 1}

is the output of the i-th SA-Bottleneck unit in the main branch and is obtained by transforming the input

X_{2, i}

through the SA-Bottleneck.

The forward propagation of SA-Bottleneck is defined as follows:

F_{S A - B o t t l e n e c k} (X) = F_{S A} (C o n v_{3 \times 3} (C o n v_{1 \times 1} (X))) + X

(11)

In the forward propagation definition of SA-Bottleneck,

F_{S A}

represents the operation process of the Shuffle Attention module. Specifically,

F_{S A}

performs attention mechanism operations on the feature maps processed sequentially by Conv_1×1 and Conv_3×3 convolutional layers, by adaptively adjusting feature weights on channel and spatial dimensions, to achieve refined modeling and enhancement of input features.

After the deep feature extraction is completed, each parallel gradient flow is fused through channel concatenation (Concat) operation, further enhancing the feature representation capability:

X_{f u e s d} = C o n c a t [X_{1}, X_{2, n}]

(12)

Finally, the fused feature map is dimensionally adjusted again through 1 × 1 convolution to output the feature map of the C2f-SA module:

Y = C o n v_{1 \times 1} (X_{f u e s d})

(13)

thereby enriching gradient propagation while significantly reducing the number of parameters and computational cost. Compared to the original C3 module, the C2f-SA module adopts a cross-stage partial connection (C2f) structure to replace the traditional C3 module, with the main advantages including the following: The C2f structure, through gradient flow branch design, can better preserve shallow feature information and avoid gradient vanishing problems in deep networks, which is particularly important for small object detection; simultaneously, the C2f-SA module significantly reduces network parameters, computational overhead, and memory access costs while maintaining detection accuracy, improving inference speed and effectively meeting the dual requirements of lightweight design and high performance for drone aerial small object detection.

From Table 1, it can be observed that the C2f-SA module reduces the parameter count by 23.8% compared to the original C3 module, decreases computational load by 27.1%, and simultaneously improves inference speed by 28.9%, fully validating the lightweight advantages of this module.

3.3. Spatial Pyramid Pooling Attention Module SPPC (SPPF-CBAM)

3.3.1. CBAM Attention Mechanism

Aiming at the problem of feature representation redundancy in convolutional neural networks, CBAM (Convolutional Block Attention Module), as an efficient and lightweight attention mechanism, achieves joint attention modeling of channel and spatial dimensions through structured decomposition. As shown in Figure 5, the module decouples the attention mechanism into a cascade structure of the channel attention submodule and spatial attention submodule, which achieves accurate calibration of multi-dimensional importance of feature maps while minimizing computational overhead. Among them, the channel attention mechanism (CAM) enhances the representation of the feature map by adaptively learning the importance weights of each channel. Different channels usually correspond to different attributes of the input image, such as edges, texture and color.

The core of the channel attention mechanism lies in highlighting key channels and suppressing redundant channels for a specific task, so as to enhance the effectiveness and discriminative ability of feature representation.

The working principle is shown in Figure 5: First, global average pooling (AvgPool) and global max pooling (MaxPool) [24] are performed on the input feature map F to obtain two channel descriptor vectors F_avg and F_max, as shown in Equation (14). These operations aggregate spatial information to the channel level, thereby obtaining global information for each channel.

F_{a v g} = A v g P o o l (F), F_{\max} = M a x P o o l (F)

(14)

Then, F_avg and F_max are, respectively, passed through a shared Multi-Layer Perceptron (MLP) to generate two new feature vectors, which are then linearly combined, as shown in Equation (15). The purpose of this step is to learn the importance of each channel through nonlinear transformation.

M_{c} (F) = σ ((M L P (F_{a v g})) + M L P (F_{\max}))

(15)

where

σ

denotes the sigmoid function, which normalizes the output to the range [0, 1]. Finally, the generated channel attention weight M_c is applied to the original feature map F to obtain the enhanced feature map

F^{'}

, as shown in Equation (16). This step weights the feature map by multiplying it channel-wise, highlighting important channels.

F^{'} = M_{c} (F) \cdot F

(16)

The spatial attention mechanism enhances feature maps by learning the importance of each spatial location. The purpose of the spatial attention mechanism is to highlight important spatial locations and suppress unimportant ones under given tasks, thereby improving the effectiveness of feature representations.

The working principle is shown in Figure 5: First, the channel-enhanced feature map

F^{'}

undergoes global average pooling (AvgPool) and global max pooling (MaxPool) along the spatial dimension, yielding two single-channel feature maps,

{F^{'}}_{avg}

and

{F^{'}}_{\max}

, as shown in Equation (17). These operations aggregate channel information to the spatial level, thereby obtaining information for each position.

{F^{'}}_{avg} = A v g P o o l (F^{'}), {F^{'}}_{\max} = M a x P o o l (F^{'})

(17)

Then, these two feature maps are concatenated along the channel dimension to obtain a two-channel feature map, which is then processed by a convolutional layer with a kernel size of 7 × 7, as shown in Equation (18). The purpose of this convolutional layer is to learn the importance of each spatial position through a sliding window operation in the channel space.

M_{s} (F^{'}) = σ (c o n v ([{F^{'}}_{avg}, {F^{'}}_{\max}]))

(18)

Finally, the generated spatial attention weights are applied to the channel-enhanced feature map

F^{'}

to obtain the ultimately enhanced feature map

F^{''}

, as shown in Equation (19). This step weights the feature map by element-wise multiplication, highlighting important positions.

F^{''} = M_{s} (F^{'}) \cdot F^{'}

(19)

3.3.2. SPPF-CBAM Module Design

The SPPF-CBAM module is a composite module that introduces the CBAM (Convolutional Block Attention Module) attention mechanism on the basis of the original SPPF (Spatial Pyramid Pooling Fast) structure, aiming to enhance the multi-scale contextual perception capability and key region attention capability of feature maps. This module achieves the integration of multi-scale pooling with channel–spatial dual attention mechanisms through cascaded fusion, leveraging the advantages of feature pyramid structures in scale robustness and CBAM’s performance in dynamic feature calibration. This constructs a cascaded structure of “multi-scale feature fusion + attention-guided enhancement,” effectively improving the network’s perception and representation capabilities for target regions.

Traditional SPPF modules reduce computational cost through channel compression (Channel Reduction), but this may lead to the loss of fine-grained feature information (such as the edge textures of small objects). To address this issue, this work proposes a Channel Preservation Strategy, which cancels the dimensionality reduction operation in SPPF, directly retains the original number of channels for multi-scale feature fusion, and then dynamically calibrates the high-dimensional features through the CBAM attention mechanism. The structure of the improved SPPF-CBAM module is shown in Figure 6. Given the input feature F_in, the number of channels is first kept unchanged through a convolutional layer, as shown in Equation (20).

F_{b a s e} = C o n v_{C \to C} (F_{i n})

(20)

Subsequently, multi-scale features {

F_{p o o l}^{1}

,

F_{p o o l}^{2}

,

F_{p o o l}^{3}

} are generated through cascaded max pooling and concatenated with

F_{b a s e}^{1}

along the channel dimension, as shown in Equation (21).

F_{c o n c a t} = C o n c a t (F_{b a s e}, F_{p o o l}^{1}, F_{p o o l}^{2}, F_{p o o l}^{3}) \in R^{4 C \times H \times W}

(21)

To avoid the computational burden caused by directly inputting the high-dimensional features (4C) after concatenation into subsequent networks, lightweight convolution is used to compress them to the target number of channels C″, as shown in Equation (22).

F_{f u s e d} = C o n v_{4 C \to C^{''}} (F_{c o n c a t})

(22)

Finally, channel–spatial dual attention calibration is performed on F_fused through the CBAM module, as shown in Equation (23).

F_{o u t} = C B A M (F_{f u s e d})

(23)

3.4. Decoupled Detection Head

YOLOv5 employs a single detection head that unifies target class prediction and bounding box (BBox) regression within the same branch through shared parameter joint modeling. However, due to differences in objective functions and optimization directions between classification and regression tasks, unified processing may lead to mutual interference between tasks, thereby affecting detection accuracy. This issue is particularly prominent in small-scale object detection, as it imposes higher requirements for precise target localization and class discrimination.

To address the mutual interference problem caused by shared parameters between regression and classification tasks in the detection head, this paper introduces a Decoupled Head [24] based on the original model, replacing the coupled detection head in the original YOLOv5 with a decoupled structure containing two parallel branches. Among them, one branch focuses on regression tasks (predicting target position and size), while the other branch focuses on classification tasks (predicting target class). As shown in Figure 7, the traditional coupled detection head concentrates confidence, class probability, and bounding box regression results in the same output channel, while the decoupled detection head compresses the channel dimension through 1 × 1 convolution and then introduces two independent 3 × 3 convolutional branches to handle their respective tasks separately, thereby achieving explicit separation of prediction tasks. This structure not only improves detection accuracy but also effectively accelerates model training convergence speed and enhances model generalization capability, particularly suitable for small object detection scenarios.

4. Results and Analysis

4.1. Experimental Datasets

4.1.1. RGBT-Tiny Dataset

This study employs a subset of the visible-light channel from the RGBT-Tiny dataset for experimentation. This subset comprises 115 aerial image pairs, 93,000 visible-light frames, and 1.2 million manually annotated instances (see Figure 8), covering seven target categories and eight typical scenarios. When compared to existing drone aerial target detection datasets, the proposed dataset offers greater representativeness in terms of scale distribution for multi-category small targets and background complexity. This results in a more challenging testing benchmark for detection algorithms. As illustrated by the visible-light example frames in Figure 8, the diversity of object shapes and background environments is evident, with over 81% of objects measuring less than 16 × 16 pixels. To further characterize the scale distribution, annotated objects in this subset are categorized into four size intervals based on the pixel area of their bounding boxes: Extremely small objects (area 1²–8² pixels) account for approximately 22%. Micro-objects are defined as those ranging between 8² and 16² pixels, accounting for approximately 59% of the total. Small objects are classified as those spanning 16² to 32² pixels, representing approximately 16% of the total. Medium/large objects, meanwhile, are those exceeding 32² pixels, constituting roughly 3% of the total. A comprehensive statistical analysis reveals that the initial three categories of small-scale targets (i.e., extremely small, small, and minor targets) collectively account for approximately 97% of all annotations. This finding signifies that this particular subset predominantly comprises small-sized targets, thereby offering a representative and more challenging experimental benchmark for the evaluation of the performance of visible-light small target detection, fusion, and tracking algorithms.

4.1.2. VisDrone2019 Dataset

The VisDrone2019 dataset [43] is a UAV aerial dataset for computer vision research, collected by the AISKYEYE team from the Machine Learning and Data Mining Laboratory at Tianjin University. This dataset is widely used in object detection, single object tracking, multi-object tracking, and crowd counting tasks. The VisDrone2019 dataset contains 288 video clips and 10,209 static images, with a total of over 261,908 frames (as shown in Figure 9). These images and videos were captured by multiple UAV cameras under various scenarios, weather conditions, and lighting conditions in different regions of China, covering diverse environments such as urban and rural areas. The dataset contains over 2,600,000 manually annotated bounding boxes, mainly including categories such as pedestrians, cars, bicycles, and tricycles. With its rich scene variations and target density, this dataset provides a challenging evaluation benchmark for object detection and tracking algorithms on UAV platforms.

4.2. Environment and Parameter Settings

The workstation configuration is as follows: the operating system is Windows 11, the processor is AMD Ryzen 5600G, the GPU is 2 NVIDIA GeForce GTX 4090, and the software environment is Python 3.11.9 and CUDA 12.2. YOLO-SR was trained from scratch without using pre-trained weight models. The input size was set to 640 × 640, the batch size was set to 64, and the training was conducted for 300 epochs. For input data augmentation, horizontal flipping with a probability of 0.5 was applied during training. The initial learning rate was set to 1 × 10⁻². The SGD optimizer was used with an initial momentum of 0.937 and a weight decay coefficient of 5 × 10⁻⁴.

To validate YOLO-SR’s adaptability in specific scenarios, we trained the model from scratch rather than using pre-trained weights from general datasets like COCO. Although this strategy increased convergence iterations by approximately 30% and resulted in greater fluctuations in the mAP curve during the first 50 epochs, it avoided the prior biases inherent in general datasets for small object categories. This ultimately improved the final mAP_0.5 by 2.1 percentage points. We have included the convergence curve in the appendix for reference. Future plans include incorporating transfer learning based on remote sensing/low-altitude aerial photography data, combined with self-distillation or parameter-efficient fine-tuning (PEFT) methods, to further reduce training time.

4.3. Evaluation Metrics

This study adopts a standardized quantitative evaluation system in the field of object detection, comprehensively evaluating algorithm performance through six core metrics: mean average precision (mAP), precision, recall, model parameters (Params), floating-point operations (GFLOPs), and frames per second (FPS). The specific definitions are as follows:

Precision represents the reliability of a model’s prediction as a positive sample. The calculation formula is as follows:

$\Pr e c i s o n = \frac{T P}{T P + F P}$

(24)

where TP (true positive) is the number of correctly detected positive samples, and FP (false positive) is the number of incorrectly detected samples.
Recall reflects the model’s ability to cover true positive samples. The calculation formula is as follows:

$Re c a l l = \frac{T P}{T P + F N}$

(25)

FN (false negative) represents the number of true positive samples that were missed.
F1-Score: As a core metric for comprehensively evaluating a model’s detection precision and coverage capability, F1-Score represents the harmonic mean of precision and recall. Its definition is based on the statistics of true positives (TP), false positives (FP), and false negatives (FN) in object detection tasks:

$F 1 - S c o r e = 2 \times \frac{\Pr e c i s i o n \times Re c a l l}{\Pr e c i s i o n + Re c a l l}$

(26)

The F1-Score is in the range of [0, 1], with values closer to 1 indicating superior overall performance in both “accurate target identification” and “comprehensive target coverage.” This metric is particularly well-suited for datasets with a high proportion of small targets, as in this study, effectively reflecting the model’s balanced detection capability for small objects.
Mean average precision (mAP): To comprehensively evaluate model robustness, this study calculates the average precision at IoU (intersection over union) thresholds of 0.5 and 0.75, denoted as mAP_0.5 and mAP_0.75, respectively. The final model performance is characterized by the global mean average precision across categories (mAP), and its calculation method is as follows:

$m A P = \frac{1}{N} \sum_{i = 0}^{n} A P_{i}$

(27)

where N is the total number of categories and is the average precision of the i-th category. The mAP value is in the range of [0, 1], and a higher value indicates better overall detection performance of the model.
Intersection over union (IOU): The intersection over union is used to measure the degree of overlap between predicted boxes and ground truth boxes. Its mathematical expression is as follows:

$I O U = \frac{A r e a (\Pr e d i c t e d B o x \cap T r u t h B o x)}{A r e a (\Pr e d i c t e d B o x \cup T r u t h B o x)}$

(28)
Model parameters (Params): The total number of all trainable weight parameters in the model, which directly determines the model’s complexity and hardware resource requirements.
Floating point operations (GFLOPs) measure the number of billions of floating point operations executed per second by the model, representing the computational complexity of the model.
Frames per second (FPS) represents the number of image frames processed per second by the model, used to evaluate the real-time performance of the model.

This evaluation framework provides a quantifiable theoretical basis for algorithm optimization by balancing detection accuracy and coverage.

4.4. Experimental Results and Analysis

4.4.1. Comparative Experiments

This paper proposes the YOLO-SR object detection method targeting the saliency features of small objects in the RGBT-Tiny dataset. To qualitatively analyze the effectiveness and rationality of this method, we compare our approach with mainstream deep learning object detection networks such as Faster-RCNN [43], YOLOv5s [12], SSD [44], and TOOD [45]. We train our improved network and the aforementioned networks using the same dataset, test all trained networks using the same test set, and evaluate the detection results using the same evaluation method. The results are shown in Table 2.

To comprehensively evaluate YOLO-SR’s detection performance and computational efficiency across different input resolutions, this study conducts comparative experiments using two input resolutions: 640 × 640 and 1280 × 1280. At 640 × 640 (the standard input size for YOLO series algorithms). YOLO-SR achieves an inference speed of 150 FPS, slightly outperforming YOLOv5s (142.9 FPS) by approximately 5%, fully meeting real-time detection requirements. Compared to typical small object detection methods like PANet (72.5 FPS), AugFPN (74.8 FPS), GFL (71.5 FPS), SNIPER (60.8 FPS), and M2Det (48.2 FPS), YOLO-SR achieves speed improvements of 106.9%, 100.5%, 109.8%, 146.7%, and 211.3%, respectively. Additionally, with a computational load of 37.7 GFLOPs, the model’s parameter count is only 4 M, representing reductions of 84.1% and 42.9% compared to SSD (25.2 M) and YOLOv5s (7 M), respectively. Compared to small object detection methods with 31.8–48.2 M parameters, this represents reductions of 87.4–91.7%, enabling fair comparisons and reproducible results against mainstream algorithms. Considering that over 81% of objects in drone aerial scenes are smaller than 16 × 16 pixels, the 1280 × 1280 high-resolution input preserves more details, enhancing feature representation for small objects. At this resolution, YOLO-SR maintains an inference speed of 59.9 FPS, outperforming PANet (20.3 FPS), AugFPN (21.0 FPS), GFL (20.0 FPS), SNIPER (17.0 FPS), and M2Det (13.5 FPS) by 195.1–343.7%, significantly improving accuracy while ensuring real-time performance. The two resolutions form a complementary trade-off between resolution scaling and computational complexity: although 1280 × 1280 has four times the pixels of 640 × 640, YOLO-SR maintains approximately 40% of its frame rate at high resolution (59.9 vs. 150 FPS), aligning with the theoretical expectation of quadratic complexity growth. This demonstrates an efficient trade-off between accuracy and speed, catering to both edge real-time and high-precision offline application needs. Experimental results demonstrate YOLO-SR’s outstanding performance across both resolutions: mAP_0.5 reaches 95.1%, surpassing DNAnet (4.7 M parameters) by 90.3 percentage points and YOLOv5s by 11.5 percentage points. Compared to GFL (56.2%), SNIPER (54.6%), PANet (52.3%), AugFPN (51.7%), and M2Det (49.8%), mAP_0.5 improved by 38.9–45.3 percentage points. mAP_0.5:0.95 reached 65%, significantly outperforming YOLOv5s (49.1%) and Cascade R-CNN (30.1%), and exceeding small object detection methods (31.2–37.3%) by 27.7–33.8 percentage points; mAP_0.75 reached 73.2%, 2.3 times that of Deformable DETR (32%). Notably, the F1-Score (IoU = 0.5) integrating precision and recall achieved 93.9%, exceeding YOLOv5s (84.4%) by 9.5 percentage points and surpassing Faster R-CNN (68.1%) and RetinaNet (58.6%) by 25.8 and 35.3 percentage points, respectively. This demonstrates that YOLO-SR not only leads in individual metrics but also maintains a significant advantage in overall detection quality. Overall, YOLO-SR’s parameter count is only 17–32% of the compared models, equivalent to approximately 8.3–12.6% of various small object detection methods. GFLOPs are reduced by 37–86%, representing a 60.0–73.1% decrease compared to small object detection methods requiring 95–140 GFLOPs. Therefore, the 640 × 640 and 1280 × 1280 resolution settings validate YOLO-SR’s comprehensive advantages across accuracy, F1-Score, and efficiency dimensions. This provides an efficient and high-precision solution for small object detection in UAVs, particularly suitable for deployment on resource-constrained edge devices.

The detection performance of objects at different scales in the RGBT-Tiny dataset is compared as shown in Table 3.

In Table 3, we conduct a granular evaluation of detection performance based on the RGBT-Tiny dataset, focusing on the aspect of target scale. Specifically, mAP_0.5(xs) denotes the mAP@0.5 on the extra-small object subset (area 1²–8² pixels), mAP_0.5(vs) denotes the mAP@0.5 on the very-small object subset (8²–16² pixels), mAP_0.5(s) corresponds to mAP@0.5 on the small object subset (16²–32² pixels), while mAP_0.5(m/l) measures detection accuracy on the medium/large object subset (area > 32² pixels). Together, these four metrics comprehensively characterize the model’s detection capability across different-scale objects, making them particularly suitable for evaluating datasets dominated by small objects. The results presented in the table indicate that the conventional two-stage detector Faster R-CNN attains a mere 24.7% and 45.3% mAP_0.5 for extremely small and small targets, respectively. This clearly demonstrates a substantial deficiency in the perception capabilities of the detector for fine-grained small objects in complex unmanned aerial vehicle scenarios. As the network architectures of YOLOv5s, YOLOv8s, and YOLOv10s undergo an evolutionary process, their detection accuracy is shown to be progressively enhanced across various scales. However, significant performance limitations persist in the detection of objects that are particularly diminutive in size.

Contrary to previous models, the proposed YOLO-SR achieves mean average precision (mAP_0.5) values of 88.5%, 97.3%, and 98.4% for extra-small, small, and medium-sized targets, respectively, demonstrating substantial improvements over YOLOv5s. This validates the efficacy of modules such as C2f-SA and SPPF-CBAM in effectively enhancing the model’s capacity to capture low-resolution details and multi-scale contextual information. Concurrently, YOLO-SR attains a mAP_0.5(m/l) of 96.2% for medium/large targets, effectively aligning with the performance of YOLOv10s. This finding signifies that despite its optimization for small targets, YOLO-SR does not compromise the detection performance for larger objects. In conclusion, the YOLO-SR’s considerable advantage in detecting small-scale targets is consistent with its design objectives, which further validates the efficacy and robustness of the method for small-object detection tasks in UAV applications.

The performance of YOLO-SR and other networks on the VisDrone2019 dataset is shown in Table 4.

Experimental results show that YOLO-SR achieves significant improvements in comprehensive detection performance: in terms of precision, YOLO-SR reaches 42.5%, with an absolute improvement of 8.6 percentage points (relative improvement of 25.4%) compared to YOLOv5s’s 33.9%, while surpassing traditional detectors such as RetinaNet (39.2%); in terms of recall, YOLO-SR reaches 34.7%, improving by 5.4 percentage points (relative improvement of 18.4%) compared to YOLOv5s’s 29.3% and also outperforming CenterNet’s performance (32.6%). On the comprehensive evaluation metric, YOLO-SR’s mAP_0.5 is 31.7%, comparable to RetinaNet (31.67%) and significantly higher than those of YOLOv5s (24.5%) and Faster RCNN (20.0%); under the stricter mAP_0.5:0.95 evaluation standard, YOLO-SR achieves 17.5%, higher than RetinaNet (16.3%) and CenterNet (14.2%), maintaining a leading advantage over all comparison algorithms. These results fully demonstrate YOLO-SR’s stability and robustness in complex drone scenarios, successfully balancing detection accuracy and localization precision, showing obvious technical advantages.

4.4.2. Ablation Studies

YOLO-SR is improved based on YOLOv5s. Ablation experiments were conducted on the three improvement methods proposed in this study on the RGBT-Tiny dataset, and the results are shown in Table 5.

As shown in Table 2, introducing modules such as C2f-SA and SPPC to YOLOv5 reduces the total parameter count from approximately 7 million in the original architecture to 4 million. This does not constitute “adding modules while increasing parameters,” as these modules are restructured rather than simply stacked: C2f-SA replaces the original C3’s 3 × 3 convolutional chain with narrower channel combinations through multi-branch input sharing, local bottlenecks, and lightweight attention; SPPC replaces the higher-parameter SPPF with attention-weighted pyramid pooling, significantly compressing the channel mapping matrix. In other words, the new modules themselves are lightweight while replacing heavier original components. Thus, despite the seemingly more complex architecture, the overall weight matrix size actually shrinks, leading to concurrent reductions in inference FLOPs and latency.

To facilitate intuitive comparison during evaluation, we have compiled the parameter counts for each submodule before and after replacement, as shown in Table 6.

After replacing the feature extraction module from C3 to C2f-SA, the number of parameters decreased from 2.1 million to 1.6 million, achieving a 23.8% compression within a single module. Additionally, the YOLO-SR backbone reduced one subsampling operation compared to YOLOv5s (preventing loss of deep features for small targets), specifically removing one CBS module. This change was the primary factor in the overall reduction from 7 M to 4 M parameters. Although feature fusion upgraded from SPPF to SPPC and the detector head replaced with DDect slightly increased parameters (2.62 → 3.67 M, 0.80 → 1.69 M), the new architecture significantly reduced redundant computations through sub-branch sharing and semantic stage redistribution, resulting in a net parameter reduction of approximately 3 M. Overall, the complementary approach of deepening the backbone network while moderately increasing backend parameters successfully maintained detection performance while capping the total parameter count at 4 million.

Ablation studies show that the progressive introduction of each module has significant synergistic effects on improving YOLO-SR performance: specifically, the baseline model YOLOv5 achieves 83.6% and 50.9% on mAP_0.5 and mAP_0.75 respectively; after introducing the C2f-SA module (cross-stage local feature fusion + self-attention mechanism), mAP_0.5 increases to 88.8% (relative improvement of 5.2 percentage points), but mAP_0.75 decreases to 46.1%, indicating that while the attention mechanism enhances feature discriminability, it may cause localization overfitting under high IoU thresholds; subsequently, after adding the SPPF-CBAM (SPPC) module, the model maintains high recall (84.6% vs. 83.4%) while mAP_0.5 and mAP_0.75 rise to 90.0% and 52.7%, respectively, verifying the repair effect of multi-scale feature fusion on localization accuracy; finally, when SPPC and Decoupled Detect (DDect) are jointly introduced, mAP_0.5 reaches 95.1% (relative improvement of 11.5 percentage points), mAP_0.75 reaches 73.2% (relative improvement of 22.3 percentage points), and precision and recall increase to 96.7% and 91.3%, respectively, fully demonstrating that the decoupled detection head effectively solves the misalignment problem between classification confidence and localization quality through feature recalibration and scale adaptive mechanisms. The above results show that the hierarchical combination of C2f-SA, SPPC, and DDect can synergistically optimize feature representation, multi-scale perception, and detection head adaptability, achieving the best balance between accuracy and robustness.

4.4.3. Visualization Analysis of Experimental Results

As shown in Figure 10, the four curves fully demonstrate the significant advantages and progressive improvement patterns of YOLO-SR compared to baseline detection models on the RGBT-Tiny dataset: Starting with YOLOv5s as the baseline, its mAP_0.5, mAP_0.5:0.95, precision, and recall reach 83.6%, 49.1%, 91.4%, and 78.2%, respectively; with the backbone network updated to YOLOv8s, these metrics synchronously improve to 86.5%, 52.3%, 92.8%, and 82.4%, validating the positive effect of deeper feature representation on detection performance; further introducing large convolutional receptive fields and separable convolution structures, YOLOv10s achieves overall detection accuracy improvements to 89.2%, 55.7%, 94.2%, and 87.3%, demonstrating the synergistic effects of high-order feature filtering and fusion; and finally, when the lightweight C2f-SA backbone, multi-scale hierarchical feature recalibration based on SPPC-CBAM, and decoupled detection head DDect are deeply integrated in YOLO-SR, mAP_0.5 surges to 95.1% (an 11.5 percentage point improvement relative to YOLOv5s), mAP_0.5:0.95 increases to 65.0% (+15.9 percentage points), and precision and recall reach 96.7% and 91.3%, respectively, achieving the optimal balance between detection accuracy and robustness. This set of comparative results fully demonstrates the comprehensive improvements of YOLO-SR in feature reuse, scale adaptation, and localization quality optimization, exhibiting stronger discriminative and recall capabilities for small objects and complex background scenarios.

Shown in Figure 11, from top to bottom, are the source image (Source), ground truth annotation (GT), YOLOv5s, YOLOv8s, YOLOv10s, and the proposed method YOLO-SR. It can be seen that YOLO-SR can detect more small objects and achieve tighter bounding box localization in complex scenarios such as sea surface vessels, nighttime traffic flow, and distant buildings; in contrast, the three baseline models exhibit missed detections, bounding box offsets, and false detections. The qualitative results further verify the robustness and superior detection performance of YOLO-SR under conditions of dense small objects, lighting variations, and background clutter.

4.5. Analysis of Model Limitations and Future Directions for Improvement

Although YOLO-SR demonstrates outstanding performance in detecting small objects in drone aerial images, an in-depth analysis of the experimental results and visualizations (as shown in Figure 11) reveals the following limitations of this method:

(1): Limited detection capability for extremely small targets. For targets smaller than 8 × 8 pixels, the model exhibits some false negatives. The primary reasons are as follows: after multiple downsampling iterations, the effective pixel information for extremely small targets becomes extremely sparse in deep feature maps, making it difficult to form discriminative feature representations; the C2f-SA and SPPF-CBAM modules have limited effectiveness in enhancing features for even smaller-scale targets.
(2): Object discrimination capability in densely occluded scenes requires improvement. When multiple small objects are highly spatially overlapped or partially occluded, the model still struggles to precisely distinguish adjacent object boundaries, occasionally resulting in overlapping bounding boxes. The primary reasons are as follows: overlapping receptive fields on feature maps make it difficult for the model to accurately distinguish independent boundaries of different objects and the current multi-scale feature fusion strategy lacks explicit modeling mechanisms for the interrelationships among densely packed objects.
(3): Cross-domain generalization capabilities require strengthening. Performance degrades to some extent when transferred to other drone aerial datasets. (The mAP@0.5 decreases by approximately 3–5 percentage points.) Key reasons include model design primarily optimizing for the distribution characteristics of specific datasets, potentially overfitting to training data patterns; insufficient consideration of domain-to-domain variations across different aerial scenarios; and lack of domain adaptation mechanisms.
(4): Further optimization is needed to balance real-time performance and accuracy. Deployment on resource-constrained edge devices further reduces inference speed. Key reasons include the following: the self-attention mechanism in the C2f-SA module incurs significantly higher computational overhead at higher input resolutions; the multi-scale pooling operations in the SPPF-CBAM module increase computational burden; and the dual-branch structure of the decoupled detection head adds approximately 15-20% computational overhead compared to a single-branch design.

To address these limitations, future work will focus on the following improvements:

(1): For ultra-small object detection, we will explore progressive feature enhancement strategies based on feature super-resolution, combined with knowledge distillation techniques, to enhance the model’s perception of pixel-level small targets.
(2): For densely occluded scenes, we will introduce a target relationship modeling mechanism based on Graph Neural Networks (GNNs) to explicitly model spatial relationships and occlusion between densely packed targets, improving the model’s object separation capability in complex scenarios.
(3): For cross-domain generalization challenges, we will design domain-adaptive training strategies. By incorporating adversarial training mechanisms or domain-adversarial loss functions, the model will learn domain-invariant feature representations. Concurrently, we will explore fast adaptation methods based on meta-learning.
(4): To balance real-time performance and accuracy, we will investigate model pruning and quantization techniques. Research will focus on automated lightweight design methods using neural architecture search (NAS) to achieve a better equilibrium between precision and efficiency.

5. Conclusions

This paper proposes a lightweight algorithm YOLO-SR to address the challenges of small object detection in drone aerial images. The algorithm innovates in the following three aspects: designing a lightweight feature extraction module C2f-SA that enhances small object feature extraction capability while reducing parameter count; constructing a spatial pyramid pooling attention module SPPF-CBAM that effectively fuses multi-scale contextual information and alleviates object occlusion and background interference issues; and introducing a decoupled detection head that separates classification and regression tasks, thus improving localization accuracy for dense small objects. Experimental results demonstrate that YOLO-SR achieves excellent performance on both the RGBT-Tiny and VisDrone2019 datasets. Future work will focus on model structure optimization, compression and acceleration techniques, and generalization capability enhancement.

Author Contributions

Conceptualization, S.L. and M.X.; Methodology, S.L. and Q.T.; Software, S.L. and G.L.; Supervision, X.F. and H.Z.; Validation, S.L. and X.F.; Writing–original draft, S.L.; Writing–review and editing, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the Youth Innovation Promotion Association CAS (Grant No. 2023419).

Institutional Review Board Statement

This study does not involve human or animal experiments, so no institutional review board approval is required.

Informed Consent Statement

This study does not involve human subjects, so informed consent is not applicable.

Data Availability Statement

The RGBT-Tiny dataset used in this study is publicly available at https://github.com/XinyiYing/RGBT-Tiny (accessed on 28 October 2025). This dataset is a large-scale benchmark for visible–thermal tiny object detection, containing 115 image sequences, approximately 93,000 frames, and about 1.2 million manual annotations, covering 7 object categories and 8 scene types.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y. Unmanned Aerial Vehicles based low-altitude economy with lifecycle techno-economic-environmental analysis for Sustainable and Smart Cities. J. Clean. Prod. 2025, 499, 145050. [Google Scholar] [CrossRef]
Sun, X.; Wang, S.; Zhang, X.; Wandelt, S. LAERACE: Taking the policy fast-track towards low-altitude economy. J. Air Transp. Res. Soc. 2025, 4, 100058. [Google Scholar] [CrossRef]
Li, X. Development Path of Low-Altitude Logistics and Construction of Industry-Education Integration Community from the Perspective of New Quality Productive Forces. Ind. Sci. Eng. 2024, 10, 44–51. [Google Scholar] [CrossRef]
Byun, S.; Shin, I.K.; Moon, J.; Kang, J.; Choi, S.I. Road traffic monitoring from UAV images using deep learning networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
Božić-Štulić, D.; Marušić, Ž.; Gotovac, S. Deep learning approach in aerial imagery for supporting land search and rescue missions. Int. J. Comput. Vis. 2019, 127, 1256–1278. [Google Scholar] [CrossRef]
ElTantawy, A.; Shehata, M.S. Local null space pursuit for real-time moving object detection in aerial surveillance. Signal Image Video Process. 2020, 14, 87–95. [Google Scholar] [CrossRef]
Mauri, A.; Khemmar, R.; Decoux, B.; Ragot, N.; Rossi, R.; Trabelsi, R.; Boutteau, R.; Ertaud, J.-Y.; Savatier, X. Deep learning for real-time 3D multi-object detection, localisation, and tracking: Application to smart mobility. Sensors 2020, 20, 532. [Google Scholar] [CrossRef]
Ye, Y.; Deng, Z.; Huang, X. A novel detector for range-spread target detection based on HRRP-pursuing. Measurement 2024, 231, 114579. [Google Scholar] [CrossRef]
Dixit, K.S.; Chadaga, M.G.; Savalgimath, S.S.; Rakshith, G.R.; Kumar, M.N. Evaluation and evolution of object detection techniques YOLO and R-CNN. Int. J. Recent Technol. Eng. IJRTE 2019, 8, 2S3. [Google Scholar]
Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. Yolo-based uav technology: A review of the research and its applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Kang, C.H.; Kim, S.Y. Real-time object detection and segmentation technology: An analysis of the YOLO algorithm. JMST Adv. 2023, 5, 69–76. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, J.; Ma, H. Improving the Vehicle Small Object Detection Algorithm of Yolov5. Int. J. Eng. Technol. Innov. 2025, 15, 57. [Google Scholar]
Li, X.; Xu, Z.; Liu, Q.; Xue, W.; Yue, G.; Wang, S. Research on YOLOX for small target detection in infrared aerial photography based on NAM channel attention mechanism. In Proceedings of the Conference on Infrared, Millimeter, Terahertz Waves and Applications (IMT2022), Shanghai, China, 8–10 November 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12565, pp. 983–992. [Google Scholar]
Mao, G.; Liao, G.; Zhu, H.; Sun, B. Multibranch attention mechanism based on channel and spatial attention fusion. Mathematics 2022, 10, 4150. [Google Scholar] [CrossRef]
Yang, Y.; Han, J. Real-Time object detector based MobileNetV3 for UAV applications. Multimed. Tools Appl. 2023, 82, 18709–18725. [Google Scholar] [CrossRef]
Su, Z.; Hu, C.; Hao, J.; Ge, P.; Han, B. Target Detection in Single-Photon Lidar Using CNN Based on Point Cloud Method. Photonics 2024, 11, 43. [Google Scholar] [CrossRef]
Yuan, S.; Qiu, Z.; Li, P.; Hong, Y. RMAU-Net: Breast Tumor Segmentation Network Based on Residual Depthwise Separable Convolution and Multiscale Channel Attention Gates. Appl. Sci. 2023, 13, 11362. [Google Scholar] [CrossRef]
Fei, K.; Li, Q.; Zhu, C. Non-technical losses detection using missing values’ pattern and neural architecture search. Int. J. Electr. Power Energy Syst. 2022, 134, 107410. [Google Scholar] [CrossRef]
Wang, H.; Wang, T. Multi-scale residual aggregation feature pyramid network for object detection. Electronics 2022, 12, 93. [Google Scholar] [CrossRef]
Li, Z.; He, Q.; Zhao, H.; Yang, W. Doublem-net: Multi-scale spatial pyramid pooling-fast and multi-path adaptive feature pyramid network for UAV detection. Int. J. Mach. Learn. Cybern. 2024, 12, 5781–5805. [Google Scholar] [CrossRef]
Xiao, X.; Feng, X. Multi-object pedestrian tracking using improved YOLOv8 and OC-SORT. Sensors 2023, 23, 8439. [Google Scholar] [CrossRef]
Dong, Z. Vehicle Target Detection Using the Improved YOLOv5s Algorithm. Electronics 2024, 13, 4672. [Google Scholar] [CrossRef]
Gowthami, N.; Blessy, S.V. Extreme small-scale prediction head-based efficient YOLOV5 for small-scale object detection. Eng. Res. Express 2024, 6, 025007. [Google Scholar] [CrossRef]
Deng, L.; Luo, S.; He, C.; Xiao, H.; Wu, H. Underwater small and occlusion object detection with feature fusion and global context decoupling head-based yolo. Multimed. Syst. 2024, 30, 208. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Peng, P.; Wang, Q.; Feng, W.; Wang, T.; Tong, C. An SAR imaging and detection model of multiple maritime targets based on the electromagnetic approach and the modified CBAM-YOLOv7 neural network. Electronics 2023, 12, 4816. [Google Scholar] [CrossRef]
Ying, X.; Xiao, C.; Li, R.; He, X.; Li, B.; Li, Z.; Li, M.; Zhao, S.; Liu, L.; Sheng, W. Visible-thermal tiny object detection: A benchmark dataset and baselines. arXiv 2024, arXiv:2406.14482. [Google Scholar] [CrossRef]
Zhang, Y.; Nian, B.; Zhang, Y.; Zhang, Y.; Ling, F. Lightweight multimechanism deep feature enhancement network for infrared small-target detection. Remote Sens. 2022, 14, 6278. [Google Scholar] [CrossRef]
Li, H.; Yang, S.; Zhang, R.; Yu, P.; Fu, Z.; Wang, X.; Kadoch, M.; Yang, Y. Detection of floating objects on water surface using YOLOv5s in an edge computing environment. Water 2023, 16, 86. [Google Scholar] [CrossRef]
Lyu, Y.; Jiang, X.; Xu, Y.; Hou, J.; Zhao, X.; Zhu, X. ARU-GAN: U-shaped GAN based on Attention and Residual connection for super-resolution reconstruction. Comput. Biol. Med. 2023, 164, 107316. [Google Scholar] [CrossRef]
Wang, J.; Wang, J. MHDNet: A Multi-Scale Hybrid Deep Learning Model for Person Re-Identification. Electronics 2024, 13, 1435. [Google Scholar] [CrossRef]
Cao, W.; Li, T.; Liu, Q.; He, Z. PANet: Pluralistic Attention Network for Few-Shot Image Classification. Neural Process. Lett. 2024, 56, 209. [Google Scholar] [CrossRef]
Wang, K.; Liu, Z. BA-YOLO for Object Detection in Satellite Remote Sensing Images. Appl. Sci. 2023, 13, 13122. [Google Scholar] [CrossRef]
Yang, Y.; Zang, B.; Song, C.; Li, B.; Lang, Y.; Zhang, W.; Huo, P. Small object detection in remote sensing images based on redundant feature removal and progressive regression. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Wang, G.; Ding, H.; Yang, Z.; Li, B.; Wang, Y.; Bao, L. TRC-YOLO: A real-time detection method for lightweight targets based on mobile devices. IET Comput. Vis. 2023, 16, 126–142. [Google Scholar] [CrossRef]
Singh, B.; Najibi, M.; Sharma, A.; Davis, L.S. Scale normalized image pyramids with autofocus for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3749–3766. [Google Scholar] [CrossRef]
Li, J.; Liu, Y.; Huang, T. Fast Image Pyramid Construction Using Adaptive Sparse Sampling; IEEE Access: Piscataway, NJ, USA, 2021; Volume 9, pp. 123456–123467. [Google Scholar]
Li, H.; Zhang, J.; Kong, W.; Shen, J.; Shao, Y. CSA-Net: Cross-modal scale-aware attention-aggregated network for RGB-T crowd counting. Expert Syst. Appl. 2023, 213, 119038. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, Y.; Wang, Z.; Jiang, Y. YOLOv7-RAR for urban vehicle detection. Sensors 2023, 23, 1801. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Xia, R.; Yang, K.; Zou, K. GCAM: Lightweight image inpainting via group convolution and attention mechanism. Int. J. Mach. Learn. Cybern. 2024, 15, 1815–1825. [Google Scholar] [CrossRef]
Jiao, S.; Xu, F.; Guo, H. Side-Scan Sonar Image Detection of Shipwrecks Based on CSC-YOLO Algorithm. Comput. Mater. Contin. 2025, 82, 3019. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Zhang, Y.; Zong, M.; Zhu, J. Improved YOLOv3 integrating SENet and optimized GIoU loss for occluded pedestrian detection. Sensors 2023, 23, 9089. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, L.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops 2019, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; IEEE Computer Society: Piscataway, NY, USA, 2021; pp. 3490–3499. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 6154–6162. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 260–275. [Google Scholar]
Wang, J.; Zhang, W.; Cao, Y.; Chen, K.; Pang, J.; Gong, T.; Loy, C.; Lin, D. Side-aware boundary localization for more precise object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 403–419. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8514–8523. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Tuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef] [PubMed]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Visio, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
Lee, C.; Park, S.; Song, H.; Ryu, J.; Kim, S.; Kim, H.; Pereira, S.; Yoo, D. Interactive multi-class tiny-object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14136–14145. [Google Scholar]
Li, S.; Wang, Z.; Dai, R.; Wang, Y.; Zhong, F.; Liu, Y. Efficient underwater object detection with enhanced feature extraction and fusion. IEEE Trans. Ind. Inform. 2025, 21, 4904–4914. [Google Scholar] [CrossRef]
Liu, W.; Geng, J.; Zhu, Z.; Zhao, Y.; Ji, C.; Li, C.; Lian, Z.; Zhou, X. Ace-sniper: Cloud–edge collaborative scheduling framework with DNN inference latency modeling on heterogeneous devices. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2024, 43, 534–547. [Google Scholar] [CrossRef]
Tanaka, T.; Hirata, K. Comparison with detection of bacteria from gram stained smears images by various object detectors. In Proceedings of the2024 16th IIAI International Congress on Advanced Applied Informatics, Takamatsu, Japan, 6–12 July 2024; pp. 58–61. [Google Scholar]
Ma, X.; Yang, X.; Zhu, H.; Wang, X.; Hou, B.; Ma, M.; Wu, Y. Dense-weak ship detection based on foreground-guided background generation network in SAR images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5212616. [Google Scholar] [CrossRef]
Cheng, H.; Zhang, X.; Tan, J.; Wu, L.; Chang, L.; Zhang, X.; Yan, Z. A fruit and vegetable recognition method based on augmsr-cnn. In Proceedings of the 2024 2nd International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA), Changsha, China, 24–27 May 2024; pp. 93–97. [Google Scholar]
Thapa, S.; Han, Y.; Zhao, B.; Luo, S. Enhanced aircraft detection in compressed remote sensing images using cmsff-yolov8. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5650116. [Google Scholar] [CrossRef]
Sun, H.; Yao, G.; Zhu, S.; Zhang, L.; Xu, H.; Kong, J. SOD-YOLOv10: Small object detection in remote sensing images based on YOLOv10. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]

Figure 1. YOLO-SR network model.

Figure 2. Shuffle Attention structure.

Figure 3. Feature extraction bottleneck SA-Bottleneck.

Figure 4. Lightweight feature extraction block.

Figure 5. CBAM attention mechanism diagram and attention mechanisms in both the channel and spatial dimensions.

Figure 6. SPPF-CBAM module.

Figure 7. Detection head structure diagram.

Figure 8. RGBT-Tiny dataset examples.

Figure 9. VisDrone dataset examples.

Figure 10. Comparison curves of four metrics between YOLO-SR and YOLOv5s/YOLOv8s/YOLOv10s on RGBT-Tiny: (a) mAP curve; (b) mAP50 curve; (c) Precision curve; (d) Recall curve.

Figure 11. Visualization comparison of detection results between YOLO-SR and YOLOv5s/YOLOv8s/YOLOv10s.

Table 1. Parameter count comparison between C2f-SA module and original C3 module.

Module Type	Parameters (M)	Computational Load (GFLOPs)	Inference Speed (FPS)
C3 Module	2.1	8.5	45
C2f-SA Module	1.6	6.2	58
Improvement	−23.8%	−27.1%	+28.9%

Table 2. Comparison of results from different networks on the RGBT-Tiny dataset.

	Param (M)	mAP_0.5 (%)	mAP_0.75 (%)	mAP_0.5:0.95 (%)	GFLOPs	FPS@640	FPS@1280	F1-Score (%)
SSD	25.2	43.1	31.9	28	95	78.5	22.1	61.7
TOOD	31.8	37.7	31.7	27.9	105	72.3	20.3	55.3
ATSS [46]	31.9	43.5	26.8	24.2	100	75.2	21.2	65.0
RetinaNet [47]	36.2	37.4	22.9	21.8	100	73.8	20.8	58.6
Faster RCNN	41.2	43.1	33.5	28.8	85	22.1	6.1	68.1
Cascade RCNN [48]	68.9	44.2	35.8	30.1	280	8.0	2.2	69.7
Dynamic RCNN [49]	41.2	44	34.2	29.4	90	20.7	5.8	68.2
SABL [50]	41.9	43.3	35.3	29.6	120	65.8	18.5	70.6
CenterNet [51]	14.4	31.7	18.2	17.8	80	82.5	23.2	57.2
FCOS [52]	31.9	28.6	19.2	17.5	90	68.5	19.3	53.6
VarifocalNet [53]	32.5	41.6	30.1	26.9	95	76.3	21.5	65.2
Deformable DETR [54]	39.8	45.4	32	28.2	180	11.1	3.1	65.5
Sparse RCNN [55]	44.2	29.8	21.9	19.2	200	10.0	2.8	57.7
DNAnet [56]	4.7	4.8	2.5	2.6	110	58.2	16.4	22.2
RFLA [57]	36.3	47.1	36.3	32.1	100	74.5	21.0	63.7
C3Det [58]	55.3	13.8	11.2	9.4	130	52.3	14.7	29.2
PANet [59]	33.5	52.3	38.7	34.2	95	72.5	20.3	68.1
SNIPER [60]	31.8	54.6	40.2	35.8	115	60.8	17.0	70.0
M2Det [61]	48.2	49.8	35.4	31.2	140	48.2	13.5	59.1
GFL [62]	35.6	56.2	41.8	37.3	105	71.5	20.0	70.8
AugFPN [63]	34.1	51.7	37.9	33.5	98	74.8	21.0	63.2
YOLOv5s [12]	7	83.6	50.9	49.1	60	142.9	40.2	84.4
YOLOv8s [64]	11	86.5	54.8	52.3	70	128.6	36.2	86.6
YOLOv10s [65]	14	89.2	58	55.7	85	105.9	29.8	89.3
YOLO-SR	4	95.1	73.2	65	37.7	150	59.9	93.9

Table 3. Comparison of detection performance for targets of different sizes in the RGBT-Tiny dataset.

	mAP_0.5(xs) (%)	mAP_0.5(vs) (%)	mAP_0.5(s) (%)	mAP_0.5(m/l) (%)
Faster R-CNN	24.7	45.3	60.5	69.8
YOLOV5s	71.8	85.4	92.3	93.1
YOLOV8s	76.4	88.1	94.2	94.5
YOLOV10s	79.3	91.2	96.1	95.4
YOLO-SR	88.5	97.3	98.4	96.2

Table 4. Comparison of results on the VisDrone2019 dataset.

	Precision (%)	Recall (%)	mAP_0.5 (%)	mAP_0.5:0.95 (%)
YOLOv5s	33.9	29.3	24.5	12.4
SSD	23.5	18.7	10.2	5.1
CenterNet	31.8	32.6	29.0	14.2
RetinaNet	39.2	30.8	31.67	16.3
Faster RCNN	29.4	23.5	20.0	8.91
YOLO-SR	42.5	34.7	31.7	17.5

Table 5. Comparison of ablation experiment results.

	Precision (%)	Recall (%)	mAP_0.5 (%)	mAP_0.75 (%)
YOLOv5	91.4	78.2	83.6	50.9
YOLOv5 + C2f-SA	91.5	83.4	88.8	46.1
YOLOv5 + C2f-SA + SPPC	93	84.6	90	52.7
YOLOv5 + C2f-SA + SPPC + DDect (YOLO-SR)	96.7	91.3	95.1	73.2

Table 6. Comparison of parameter quantities before and after module replacement.

	Original Params (M)	New Params (M)	Param Change (%)
Feature Extraction Module	2.1 (C3)	1.6 (C2f-SA)	−23.8%
Feature Fusion Module	2.62 (SPPF)	3.67 (SPPC)	+40%
Detection Head	0.8 (Head)	1.69 (DDect)	+111.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, S.; Feng, X.; Xie, M.; Tang, Q.; Zhu, H.; Li, G. Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images. Appl. Sci. 2025, 15, 13063. https://doi.org/10.3390/app152413063

AMA Style

Liang S, Feng X, Xie M, Tang Q, Zhu H, Li G. Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images. Applied Sciences. 2025; 15(24):13063. https://doi.org/10.3390/app152413063

Chicago/Turabian Style

Liang, Sirong, Xubin Feng, Meilin Xie, Qiang Tang, Haoran Zhu, and Guoliang Li. 2025. "Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images" Applied Sciences 15, no. 24: 13063. https://doi.org/10.3390/app152413063

APA Style

Liang, S., Feng, X., Xie, M., Tang, Q., Zhu, H., & Li, G. (2025). Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images. Applied Sciences, 15(24), 13063. https://doi.org/10.3390/app152413063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight YOLO-SR: A Method for Small Object Detection in UAV Aerial Images

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection

2.2. Limitations of Existing Approaches

3. Methodology

3.1. Network Architecture Design

3.2. Lightweight Backbone Network Based on C2f-SA

3.2.1. Shuffle Attention (SA) Mechanism

3.2.2. C2f-SA Module Design

3.3. Spatial Pyramid Pooling Attention Module SPPC (SPPF-CBAM)

3.3.1. CBAM Attention Mechanism

3.3.2. SPPF-CBAM Module Design

3.4. Decoupled Detection Head

4. Results and Analysis

4.1. Experimental Datasets

4.1.1. RGBT-Tiny Dataset

4.1.2. VisDrone2019 Dataset

4.2. Environment and Parameter Settings

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis

4.4.1. Comparative Experiments

4.4.2. Ablation Studies

4.4.3. Visualization Analysis of Experimental Results

4.5. Analysis of Model Limitations and Future Directions for Improvement

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI