1. Introduction
With the in-depth integration of drone technology and computer vision, UAV-oriented ground target detection has evolved into a core supporting technology in fields such as remote sensing monitoring, smart agriculture, security patrols, and emergency rescue [
1,
2,
3]. Compared with traditional ground monitoring and satellite remote sensing, UAVs, with their advantages of flexibility, low cost, and high-resolution imaging, can quickly acquire ground observation data in complex scenarios and realize the real-time perception and accurate positioning of ground targets [
4,
5]. However, UAV-view ground target detection is confronted with numerous inherent challenges: an extremely high proportion of small targets (for instance, distant vehicles, pedestrians, and crop seedlings only account for 1–5% of image pixels); sparse feature information that is susceptible to interference from background clutter; significant variations in target scales, where both large close-range targets and small distant targets coexist in the same scene; and complex imaging environments that are greatly influenced by factors such as illumination changes, airflow vibration, and occlusion [
6,
7,
8]. These problems cause traditional target-detection algorithms to suffer from missed detection, high false detection rates, and insufficient positioning accuracy in UAV scenarios, making them difficult to meet practical application requirements.
You Only Look Once (YOLO) [
9] has been extensively applied in UAV-based ground target-detection tasks, thanks to its end-to-end inference capability and the balanced performance between speed and accuracy [
9,
10,
11]. As the latest iteration of the series, YOLOv11 further improves feature representation capability and inference efficiency by introducing the C3k2 feature-extraction module, C2PSA attention mechanism, and optimized detection head structure [
12]. Nevertheless, the original design of YOLOv11 is primarily tailored for general scenarios, and it still exhibits obvious deficiencies in small-target detection for UAVs: first, the large stride of characteristic extraction layers (with a minimum stride of 8) results in the severe loss of detailed features of small targets during the downsampling process; second, the excessively high channel compression ratio (default e = 0.5) makes it challenging to retain the limited feature information of small targets; third, the inadequate refinement of the attention mechanism prevents it from effectively focusing on small-target regions and suppressing interference from complex backgrounds [
13,
14]. Therefore, it is of important theoretical value and engineering significance to carry out targeted improvements on YOLOv11 according to the characteristics of UAV small targets, so as to enhance its detection performance in low-pixel, strong-interference, and multi-scale scenarios.
As an authoritative benchmark dataset for UAV vision tasks, the VisDrone dataset is composed of 10,209 images that were collected under different altitudes and scene conditions, covering 10 common object categories including pedestrians, vehicles, and bicycles. Among them, small targets (pixel area < 32 × 32) account for more than 40%, which realistically simulates the complex scenarios of UAV-based ground detection [
15]. In addition to VisDrone, DOTAv1 is another mainstream large-scale benchmark dedicated to aerial object-detection tasks, which is widely adopted in UAV remote sensing detection research [
16]. This dataset contains 2806 high-resolution aerial images collected from diverse geographic regions and shooting perspectives, annotating a total of 188,282 object instances across 15 categories such as airplanes, ships, storage tanks, and vehicles. Different from ordinary UAV datasets focusing on horizontal bounding box detection, DOTAv1 features abundant arbitrarily oriented objects and densely distributed tiny targets, which poses great challenges to multi-scale and rotated object detection algorithms and can comprehensively evaluate the robustness of detection models in complex aerial scenes. Conducting algorithm research and validation based on the above datasets can effectively ensure the practicality and generalization ability of the improved algorithm.
4. Results
4.1. Experimental Environment
The adopted hardware configuration for neural network training in this study is described as follows: the graphical processing unit (GPU) utilizes a single virtual GPU equipped with 48 GB of video memory, yielding a total available video memory of 48 GB; the central processing unit (CPU) is a virtual 20-core Intel® Xeon® Platinum 8470Q processor. The comprehensive hardware resources furnish robust computational capability and sufficient memory support to ensure stable model training.
4.2. Dataset
In this experiment, the VisDrone2019-DET dataset [
15] is employed. As a large-scale benchmark for object detection under drone-borne imagery, this dataset was developed by the AI-Eye Team of the Machine Learning and Data Mining Laboratory at Tianjin University. It aims to promote research and development in the field of automatic understanding of drone vision data and provides a comprehensive and rigorous evaluation platform for object-detection algorithms in drone scenarios. This dataset consists of 10,209 static images, together with 288 video segments and 261,908 video frames. All data are acquired by cameras equipped on various drone platforms, covering diverse scenarios across 14 cities in China, among which urban and rural regions serve as the two primary environmental settings. It involves scenes with varying target densities such as sparse and dense distributions, and the collection process is conducted under diverse weather and lighting conditions. This can fully simulate the complex environmental constraints in real drone operations, demonstrating strong scene representativeness and practicality.
Regarding data annotation, the VisDrone2019-DET dataset employs meticulous manual annotation, with a total of over 2.6 million bounding boxes marked across 12 distinct object categories. Specifically, these categories include ignored regions (ID 0), pedestrians (ID 1), people (ID 2), bicycles (ID 3), cars (ID 4), vans (ID 5), trucks (ID 6), tricycles (ID 7), awning tricycles (ID 8), buses (ID 9), motorcycles (ID 10), and others (ID 11). Among them, there are 10 valid object categories (excluding ignored regions). Pedestrians and cars are the dominant categories, accounting for approximately 35% and 25%, respectively, which conforms to the object distribution characteristics of actual drone observation scenarios. Each annotation entry incorporates comprehensive attribute descriptions, such as the horizontal and vertical coordinates of the bounding box’s top-left corner (x, y), as well as its corresponding width and height, together with an evaluation validity flag (1 indicates inclusion in evaluation, 0 denotes exclusion), class ID, target truncation level (ranging from 0 to 2, corresponding to no truncation, partial truncation, and complete truncation in sequence), and occlusion level (ranging from 0 to 2, representing no occlusion, partial occlusion, and heavy occlusion, respectively). These multi-dimensional annotation details provide strong support for the robustness verification and fine-grained performance analysis of algorithms and can meet the training and testing requirements of object-detection algorithms in complex scenarios.
The dataset is partitioned into three distinct subsets: training, validation, and test. Specifically, the training set comprises 6471 images, while the validation set contains 548 images. The test subset is further divided into a standard test set with 1610 images and an extended test set (including test-dev), amounting to 3190 images in total. The test-dev subset provides annotation information and can be used for publishing academic paper results, while the test-set-challenge subset is only for competitions without annotations provided. A key attribute of this dataset lies in its considerable variance in object scales: approximately 31.6% of targets are smaller than 32 × 32 pixels, while 70% are below 64 × 64 pixels, thus rendering it a representative dataset for dense small-object detection. Meanwhile, it presents real-scene challenges such as inconsistent image resolutions (common resolutions range from 1360 × 765 to 2000 × 1500), target occlusion, motion blur, and uneven illumination. These factors can effectively verify the adaptability and performance upper bound of object-detection algorithms from drone perspectives, and the dataset is widely applied in academic research and algorithm verification for drone-based object detection, small object detection, complex scene adaptation, and related fields.
Figure 5 presents several representative annotated samples from the VisDrone2019-DET dataset. These examples vividly illustrate the visualization of bounding box annotations and category labels for diverse objects under different scenarios and intuitively reflect the annotation protocols and object distribution characteristics of the dataset.
To comprehensively evaluate the overall performance of the proposed algorithm and eliminate the contingency of experimental results on a single dataset, so as to enhance the generalization capability and universal applicability of the model in various aerial detection scenarios, this study additionally conducts comparative experiments on the DOTAv1 dataset [
16]. As a canonical large-scale benchmark for high-resolution remote sensing aerial object detection, DOTAv1 contains 2806 aerial images collected from diverse geographical scenes and shooting perspectives, covering 15 fine-grained object categories such as airplanes, ships, storage tanks, and large-sized vehicles, with more than 188,000 annotated target instances in total. Different from VisDrone2019-DET that adopts horizontal bounding box annotation, DOTAv1 specializes in oriented object detection, where most targets are arbitrarily arranged with random rotation angles and densely distributed in wide-range complex backgrounds. Such unique characteristics bring severe challenges for multi-scale feature learning and the high-precision localization of rotated tiny objects.
Figure 6 displays typical annotated samples of the DOTAv1 dataset, which explicitly demonstrates the oriented bounding box annotation form and complex target distribution features of remote sensing aerial scenes.
4.3. Evaluation Metrics
To comprehensively assess the detection performance and engineering applicability of the proposed model, this experiment adopts six representative metrics widely used in object detection, namely precision (P), recall (R), mean average precision mAP@50, and mean average precision mAP@50–90. The above metrics are used for comparative analysis from four dimensions: detection accuracy, localization accuracy, and model lightweight degree. In addition, GFLOPs (Giga Floating Point Operations Per Second) is introduced as a crucial metric for evaluating the model’s computational complexity. It represents the number of floating-point operations (in billions) required for a single forward pass of the model, directly reflecting the demand on computational resources and inference speed. Lower GFLOPs typically indicate higher computational efficiency, which is especially important for deployment on edge devices or real-time systems. The specific definitions, calculation methods, and evaluation significance of each metric are as follows. All metric computations are performed in strict accordance with the universal standards in object-detection research, so as to guarantee the reliability and comparability of the experimental results.
Precision serves as a key indicator for quantifying the accuracy of model detection outputs. It denotes the ratio of true positive samples to all instances predicted as positive by the model, which effectively characterizes the model’s capability to mitigate false-positive detections (i.e., misidentifying background or non-target objects as valid targets). This is of great significance for practical application scenarios of YOLO models, such as drone-based small-object detection and remote sensing monitoring, as it helps avoid decision-making errors caused by false-positive outputs. The corresponding calculation formula is expressed as follows:
In the formula, the True Positive (TP) denotes the quantity of samples that are both truly positive and correctly predicted as positive by the model, corresponding to the targets successfully detected. the False Positive (FP) refers to the quantity of samples incorrectly identified as positive despite being negative in reality, namely the background or non-target objects misjudged by the model. Generally, a higher precision value indicates more dependable detection outputs and fewer false-positive errors.
Recall is complementary to precision. It quantifies the model’s ability to identify genuine positive samples, i.e., the ratio of actual positive samples successfully detected by the model, reflecting its capacity to reduce missed detections. In object-detection tasks, particularly for small and densely distributed objects, the recall value directly determines whether the model can fully locate all intended targets. Its calculation formula is as follows:
In the formula, the False Negative (FN) denotes the quantity of samples that are truly positive yet misclassified as negative by the model, corresponding to targets that go undetected. A higher recall value signifies fewer omitted targets and stronger capability in identifying actual objects. Notably, precision and recall generally maintain a trade-off relationship: excessive pursuit of higher precision may result in reduced recall, and the converse also holds. Consequently, the model’s overall detection capability ought to be evaluated synthetically by integrating these two metrics.
The mean average precision (mAP@50) is evaluated at an IoU threshold of 0.5, focusing on the model’s recognition ability under relatively lenient localization criteria. In contrast, mAP@50–95 refers to the mean average precision computed across a series of IoU thresholds ranging from 0.5 to 0.95 with a step interval of 0.05, that is
This metric more rigorously reflects the model’s comprehensive detection performance under different localization accuracy requirements and serves as a key indicator for evaluating detection quality.
Parameters (in units of M, meaning millions) serve as a key metric for evaluating the lightweight level of the model. They quantify the total amount of learnable parameters within the model, including weight and bias parameters in convolutional layers, fully connected layers, and other network structures. The scale of parameters is directly associated with the model’s memory footprint, training complexity, and inference efficiency: fewer parameters indicate a more lightweight model, which demands less storage space, consumes less GPU memory during training, and achieves faster inference, thus being more applicable to deployment on resource-constrained platforms. Conversely, larger parameter counts usually strengthen the model’s feature representation capability, but also lead to higher training difficulty, greater memory and storage consumption, and degraded inference speed. In this work, parameters are counted in millions (M) to enable an intuitive comparison of the lightweight characteristics among various YOLO architectures.
4.4. Comparative Experiments
4.4.1. Comparative Algorithm Setup
To objectively and fairly verify the detection performance and lightweight advantages of the proposed YOLOSO model in this paper, a variety of YOLO-series models provided by Ultralytics [
56] were selected as the core comparison benchmarks, including YOLOv8n, YOLOv9t, YOLOv10n, and the basic benchmark model YOLOv11n set in this study. In addition, on the VisDrone-DET2019 dataset, comparative experiments are also conducted against current state-of-the-art improved YOLO models (such as SuperYOLO [
34] and LS-YOLO [
57]). To avoid the distortion of comparison results caused by differences in model scale and ensure that all comparison models are at the same lightweight level, all the above YOLO-series models adopt the “n” (nano) version, with their parameters controlled within 3.5M, which is consistent with the lightweight design goal of the proposed YOLOSO model. In addition, to further validate the model performance in medium and large-scale scenarios, the Rtdetr-L model [
58] was introduced for comparison. Considering its large number of parameters, the S-version of YOLOSO (YOLOSO-S) was used for a fair comparison at the same parameter scale to eliminate performance interference caused by model size differences. Furthermore, to verify the generalization of the proposed model, experiments are also carried out on the DOTA-v1 dataset, where comparisons are made only against the official standard models.
To minimize the impact of experimental variables on the comparison outcomes, all models were trained using the same training dataset and a unified training protocol. The specific training configurations are as follows: we uniformly set the batch size to 16, set the initial learning rate to 0.01, and assign a weight decay coefficient of 0.0005 to ease overfitting. The stochastic gradient descent (SGD) optimizer is employed, with the total training epoch fixed at 200. Meanwhile, mainstream object-detection-training strategies, including Mosaic data augmentation, adaptive anchor computation, random cropping, and flipping, are applied equally across all models to guarantee training consistency.
In the testing stage, identical inference parameters are adopted for all models: the confidence threshold is set to 0.25, and the IoU threshold for non-maximum suppression (NMS) is configured as 0.7. Such settings eliminate disturbances caused by inconsistent training and inference hyperparameters, thereby ensuring the reliability and accuracy of the quantitative comparison results.
The detection outputs of each model corresponding to the input image in
Figure 7a are illustrated in
Figure 7. It can be observed that the proposed YOLOSO model achieves high-precision detection toward small vehicles in the scene. To further compare the performance of different models, key indicators including precision (P), recall (R), mAP@50, mAP@50–95, and parameters (M) are compared, so as to quantitatively assess the overall performance of the YOLOSO model in UAV-borne ground small-object-detection tasks.
4.4.2. Comparison on VisDrone2019-DET
The performance comparison results on the VisDrone2019-DET test set are presented in
Table 1. A dimension diagram is plotted based on these experimental results, as illustrated in
Figure 8.
The comprehensive quantitative performance of all involved detection models on the VisDrone2019-DET dataset is summarized in
Table 1. In this experiment, all comparison baselines are dominated by multiple mainstream lightweight YOLO-n series models, including YOLOv8n, YOLOv10n, and YOLOv11n. The comparison group also contains YOLOv9t, advanced improved detectors (SuperYOLO, LS-YOLO), the high-performance real-time detector Rtdetr-L, the classical two-stage algorithm Faster-RCNN, and our proposed YOLOSO and its structurally enhanced variant YOLOSO-S. In terms of computational resource consumption reflected by parameters (M) and GFLOPs, the parameter and computational overhead of Faster-RCNN are excessively higher than all other competitors, which makes it inappropriate for horizontal comparison; thus, its corresponding M and GFLOPs data are not listed in the table. Conventional lightweight YOLO-n models maintain extremely low computational costs, with parameters ranging from 2.59 to 3.01 and GFLOPs fluctuating between 6.5 and 8.8. Such lightweight characteristics enable these basic YOLO-n models to achieve fast inference speed for edge deployment. Nevertheless, enhanced models such as SuperYOLO and LS-YOLO increase network complexity to pursue better feature extraction capability, resulting in a sharp rise in the computational burden, where their GFLOPs reach 20.9 and 42.5, respectively. Different from the lightweight design of mainstream YOLO-n baselines, YOLOSO-S is not a simplified lightweight model. Instead, it is a complicated and upgraded version derived from the lightweight YOLO-n architecture. By embedding additional feature-enhancement modules and multi-scale detection branches, YOLOSO-S obtains a larger parameter size of 14.85 and a higher GFLOPs value of 66.5. Although its computational cost is inferior to basic YOLO-n models, it is still more lightweight than the heavyweight Rtdetr-L (103.4 GFLOPs).
In terms of detection performance, six core evaluation metrics, including precision, recall, mAP50, and mAP50–95, are adopted to evaluate model capabilities for challenging aerial small-target detection. The original YOLO-n counterparts present balanced but mediocre detection performance. The precision and recall of mainstream YOLO-n models are concentrated at approximately 42.4–43.1% and 31.6–32.9%, while their mAP50 and mAP50–95 are limited below 33.6% and 19.3%. The modified detectors SuperYOLO and LS-YOLO fail to achieve effective performance improvement compared with the vanilla YOLO-n series, and partial indicators are even degraded. Our proposed basic YOLOSO model achieves prominent performance gains on the basis of YOLO-n frameworks, reaching 47.2% precision, 36.8% recall, 37.3% mAP50, and 22.0% mAP50–95, which outperforms all lightweight YOLO-n baselines and other modified detectors. Benefiting from the complicated network structure and optimized detection strategies tailored for UAV scenarios, YOLOSO-S achieves the optimal results across all metrics among all contenders. It achieves the highest precision of 56.1%, recall of 43.0%, mAP50 of 45.3% and mAP50–95 of 27.4%. To conclude, the basic YOLOSO realizes an excellent trade-off between computational complexity and detection accuracy, which is suitable for general real-time aerial detection tasks. As a high-precision complicated variant, YOLOSO-S sacrifices partial inference efficiency to achieve state-of-the-art detection performance, proving great application value for high-precision UAV target-detection tasks.
4.4.3. Comparison with on DOTAv1
To further validate the effectiveness of the proposed model, we conducted additional experiments on the DOTAv1 dataset, comparing our YOLOSO series against official YOLO models (YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n).
Table 2 summarize the comparative results in terms of M, GFLOPs, P, R, and detection accuracy (mAP50 and mAP50–95).
As shown in
Table 2, the proposed YOLOSO model achieves the best overall detection performance on the DOTA-v1 dataset. In terms of detection precision, YOLOSO reaches 62.2%, which is higher than all baseline models. Its recall rate is 26.3%, outperforming YOLOv8n, YOLOv9t, YOLOv10n, and YOLOv11n by 1.7%, 4.1%, 6.2%, and 3.0%, respectively. For comprehensive detection accuracy, the mAP50 and mAP50–95 of our model are 27.3% and 14.9%, ranking first among all compared models.
In terms of model scale and computational overhead, YOLOSO has 3.56 M parameters and 12.4 GFLOPs, which is slightly larger than the lightweight YOLO series models. The increase in parameters and computation brings a significant improvement in detection accuracy, proving that the optimized structure of YOLOSO is effective for aerial target-detection tasks on the DOTA-v1 dataset. Although the model has a slight rise in computational complexity, it obtains obvious performance gains and balances detection accuracy and model applicability well.
4.4.4. Comparison of FPS of Different Models
To comprehensively evaluate the real-time detection performance and practical deployment capability of the proposed YOLOSO model, we further test and compare the Frames Per Second (FPS) of all comparative models on the DOTA-v1 dataset. Consistent with the above comparison experiments, we select the mainstream lightweight YOLO series models including YOLOv8n, YOLOv9t, YOLOv10n, and YOLOv11n as the baseline models. All models are tested under the same experimental hardware environment and parameter configuration to ensure the fairness and credibility of the FPS comparison results. The real-time inference speed of each model is statistically analyzed, and the differences in deployment efficiency and detection latency between the YOLOSO model and other baseline models are quantitatively discussed.
The detailed FPS, parameter quantity (M), computational complexity (GFLOPs), and detection accuracy (mAP50) of each model are listed in
Table 3. It can be observed that the four baseline lightweight models maintain excellent real-time inference performance, with FPS values ranging from 26.53 to 35.8. Specifically, YOLOv10n achieves the highest FPS of 35.8, possessing the fastest inference speed among all comparison models. Compared with the baseline models, the proposed YOLOSO model has a slightly reduced inference speed, with an FPS of 20.53. This slight drop in real-time performance is mainly attributed to the increased model parameters and computational overhead brought by the optimized structural design.
Nevertheless, the moderate reduction in FPS is completely acceptable for practical aerial target-detection deployment scenarios. Compared with all baseline models, YOLOSO achieves a significant accuracy breakthrough, with its mAP50 reaching 27.3%, which is 2.5%, 4.1%, 7.2%, and 3.8% higher than that of YOLOv8n, YOLOv9t, YOLOv10n, and YOLOv11n, respectively. The substantial improvement in aerial detection accuracy far compensates for the minor loss of inference speed. In actual UAV remote sensing and aerial monitoring deployment tasks, high-precision target detection is the core demand, and the FPS of 20.53 can fully meet the real-time working requirements of conventional aerial detection scenarios. Therefore, the YOLOSO model realizes an effective trade-off between detection accuracy and inference efficiency, and the slight efficiency drop is reasonable and deployable in practical applications.
4.5. Ablation Experiments
To clarify the specific contributions of the proposed optimization strategies, including the C2PSASO module, the C3k2SO module, the ED-CBAM module, and the overall structural modifications of the model, to detection performance and to further validate the rationality of the optimized design of the YOLOSO model, a set of controlled ablation experiments were performed based on the baseline YOLOv11n network The experimental design focused on single optimization strategies and combined optimization strategies. By comparing the number of parameters, GFLOPs, precision (P), recall (R), mAP50, and mAP50–95 across different experimental schemes, the effectiveness of each optimized module was quantified. The experimental design is shown in
Table 4.
The experimental performance of each ablation group on the VisDrone2019-DET test set is presented in
Table 5. The five-dimensional graph generated from these results is displayed in
Figure 9.
As shown in Experiment 2, replacing the original C2PSA with the proposed C2PSASO module slightly reduces model parameters (from 2.59 M to 2.44 M) and GFLOPs (from 6.5 to 6.3). Meanwhile, all detection metrics are improved: precision rises from 42.7% to 43.8%, recall increases from 32.0% to 34.8%, mAP50 goes up from 33.6% to 34.3%, and mAP50–95 climbs from 19.3% to 19.9%. This demonstrates that the C2PSASO module can streamline model computation while enhancing feature-extraction capability.
Experiment 3 adopts the C3k2SO module individually. It brings a notable performance gain: precision reaches 45.4%, recall 35.3%, mAP50 35.0%, and mAP50–95 20.1%. Nevertheless, this module introduces more parameters (3.82 M) and higher computational overhead (8.7 GFLOPs), indicating its strong feature representation ability at the cost of moderate increased complexity.
By introducing the ED-CBAM attention module in Experiment 4, the model achieves better overall performance than the baseline YOLOv11n. The total parameters reach 2.6 M, which is only slightly higher than the baseline value of 2.59 M. It can be seen that using ED-CBAM alone will not lead to excessive expansion of the model size.
Experiment 5 only optimizes the overall model structure. Compared with the baseline, it achieves a substantial performance leap: precision, recall, mAP50, and mAP50–95 are promoted to 46.9%, 35.8%, 36.4% and 21.7% respectively. Although GFLOPs rises to 10.4, the parameter volume only increases slightly to 2.67 M, proving that structural optimization is a highly efficient way to boost detection accuracy with limited extra computation.
On the basis of structural modification, Experiment 6 further integrates the C2PSASO module. The parameters and GFLOPs decline marginally, and detection indicators see slight growth, which verifies the good compatibility between structural redesign and the C2PSASO module.
Experiment 7 combines structural optimization, C2PSASO, and C3k2SO modules. The detection performance is further elevated, and the overall parameters and computation are well controlled compared with the single use of C3k2SO. It proves that the joint application of multiple modules can balance model complexity and detection accuracy.
Experiment 8 is the complete YOLOSO model integrating all proposed strategies. It obtains the optimal overall performance across all groups, with precision of 47.2%, recall of 36.8%, mAP50 of 37.3% and mAP50–95 of 22.0%. The parameters and GFLOPs remain at a reasonable level.
In summary, each designed module and structural improvement contributes positively to detection performance. The combination of all optimization strategies achieves mutual complementation, enabling the YOLOSO model to obtain superior detection results while maintaining acceptable model scale and computational cost.
4.6. Experimental Summary
This experiment focuses on verifying the performance of the proposed YOLOSO object-detection model. All experimental trials were carried out in a hardware environment configured with a single vGPU featuring 48 GB of video memory and a 20-core virtual Intel® Xeon® Platinum 8470Q processor. Taking the VisDrone2019-DET drone small-object detection dataset as the benchmark, precision, recall, mAP50, mAP50–95, GFLOPs, and the number of model parameters were selected as core evaluation metrics. The effectiveness of the proposed model was verified through comparative experiments and ablation experiments. The dataset comprises 10,209 static images and 261,908 video frames, with over 2.6 million annotated bounding boxes covering 10 valid object categories. Among these objects, 31.6% are smaller than 32 × 32 pixels, making it a typical small-object dense dataset that can effectively simulate real UAV operating scenarios.
In comparative experiments, the proposed YOLOSO model is compared with representative lightweight YOLO architectures including YOLOv8n, YOLOv9t, YOLOv10n, and YOLOv11n, all of which have fewer than 3.5 M parameters, together with SuperYOLO, LS-YOLO, the medium-to-large-scale model RT-DETR-L, and Faster-RCNN. All comparison models are trained and inferred under consistent experimental settings. The results indicate that the YOLOSO model contains 3.56 million parameters and achieves 12.4 GFLOPs, slightly exceeding those of other lightweight counterparts, yet presents remarkable superiority in detection performance. Specifically, it achieves a precision of 47.2%, a recall of 36.8%, and mAP50 of 37.3%, corresponding to 4.5 percentage points, 4.8 percentage points, and 3.7 percentage points increments relative to YOLOv11n, respectively. Furthermore, on the DOTAv1 dataset, YOLOSO achieves 62.2% precision and 27.3% mAP50, outperforming all compared YOLO models. In terms of real-time inference performance, although YOLOSO obtains a relatively lower FPS of 20.53 compared with the baseline lightweight YOLO models (27.47 for YOLOv8n, 27.6 for YOLOv9t, 35.8 for YOLOv10n, and 26.53 for YOLOv11n), due to the increased parameter scale and computational complexity, the moderate FPS reduction is completely acceptable for practical aerial deployment scenarios. The significant accuracy improvement substantially compensates for the slight inference speed loss, and the FPS of 20.53 fully meets the real-time working requirements of conventional UAV remote sensing and aerial monitoring tasks, achieving a well-balanced trade-off between detection accuracy and deployment efficiency. The YOLOSO-S variant, which comprises 14.85 million parameters and 66.5 GFLOPs, achieves the optimal results across all metrics, with 56.1% precision, 43.0% recall, 45.3% mAP50, and 27.4% mAP50–95, thereby realizing a favorable trade-off between lightweight characteristics and detection accuracy. Ablation studies further verify the efficacy of each individually improved module. Replacing C2PSA with the C2PSASO module alone reduces the number of parameters to 2.44 million and GFLOPs to 6.3, while improving mAP50 by 0.7 percentage points. Introducing the C3k2SO module alone improves mAP50 by 1.4 percentage points (to 35.0%), and overall structural optimization improves mAP50 by 2.8 percentage points (to 36.4%). The YOLOSO model, which combines all three improvements, achieves the best overall performance, with mAP50 increased by 3.7 percentage points (to 37.3%) compared with the baseline YOLOv11n.
Extensive experimental results adequately validate the rationality and superior performance of the proposed YOLOSO model. Relative to mainstream detection frameworks, YOLOSO demonstrates more favorable detection accuracy and flexible deployment potential in UAV-borne small-object detection scenarios. Benefiting from the C2PSASO module, the C3k2SO module, the ED-CBAM module, and global structural optimization, the model achieves a desirable balance between lightweight architectural design and high-precision inference. This work can serve as an effective reference for the advancement and optimization of lightweight object-detection models.
5. Discussion
Experimental results show that the proposed YOLOSO model outperforms YOLOv8n, YOLOv9t, YOLOv10n, SuperYOLO, LS-YOLO, and the baseline YOLOv11n by a large margin in core metrics including precision, recall, and mAP50. In terms of real-time inference performance, although YOLOSO obtains a relatively lower FPS of 20.53 compared with the baseline lightweight YOLO models (27.47 for YOLOv8n, 27.6 for YOLOv9t, 35.8 for YOLOv10n, and 26.53 for YOLOv11n), due to the increased parameter scale and computational complexity, the moderate FPS reduction is completely acceptable for practical aerial deployment scenarios. The significant accuracy improvement substantially compensates for the slight inference speed loss, and the FPS of 20.53 fully meets the real-time working requirements of conventional UAV remote sensing and aerial monitoring tasks, achieving a well-balanced trade-off between detection accuracy and deployment efficiency. The primary causes for such performance improvement are summarized as follows. First, the newly added P2 high-resolution feature branch reduces the minimum detection stride from 8 to 4, effectively alleviating the loss of detailed features of small objects in the downsampling process. This is highly consistent with the conclusion in existing studies that “high-resolution feature layers are crucial for improving small-object detection performance”. Furthermore, this paper constructs a four-scale detection framework of “P2-P3-P4-P5”, which strengthens detection consistency for multi-scale objects and compensates for the shortcoming that a single high-resolution branch can hardly maintain satisfactory performance for large objects. Second, the collaborative optimization of the two core modules, C3k2SO and C2PSASO, reduces feature loss of small objects while enhancing fine-grained feature extraction and attention focusing by adjusting channel compression ratios (0.25 for shallow modules and 0.75 for deep modules in C3k2SO; 0.25 in C2PSASO), optimizing convolution kernel configurations (combining 1 × 3 and 3 × 1 convolutions), and refining attention head numbers (from 4 to 8 in C3k2SO; from 4 to 8 in C2PSASO). Compared with existing methods that only optimize a single module, the proposed strategy is more systematic and comprehensive and achieves a more favorable balance between feature extraction efficiency and accuracy.
From the ablation experiments, the overall structural optimization contributes the most to performance improvement, with mAP50 increased by 2.8 percentage points higher than the baseline model (from 33.6% to 36.4%). This shows that the rationality of the network structure is a key factor determining small-object-detection performance. The C3k2SO module provides more significant accuracy gains, while the C2PSASO module shows obvious advantages in lightweight design. Their synergy enables YOLOSO to strike a balance between lightweight architecture and high precision, which is of great importance for on-board deployment on UAVs. UAV platforms are typically constrained by their limited computing power and memory resources and cannot support excessively large models. Although the parameter count of YOLOSO is around 3.56M, slightly higher than other lightweight YOLO models, it achieves remarkable performance improvement. Meanwhile, the medium-to-large version YOLOSO-S has 14.85M parameters, which is 53.6% fewer than Rtdetr-L (32.0M), further verifying the rationality of the proposed lightweight optimization strategy. In comparison with related works, the improvement strategies in this paper are more targeted. Most existing studies focus on single-dimensional optimization, while this work conducts systematic improvements from network structure and core modules, which better matches the characteristics of UAV-based small objects: sparse feature representation, significant scale variations, and strong background clutter. It thus alleviates the prevalent problems of high miss detection and false detection in such scenarios.
Nevertheless, this study also has certain limitations. First, experiments are only validated on the single VisDrone2019-DET dataset and DOTAv1 datasets. Although it covers various complex scenarios, it cannot fully represent all UAV application environments. Detection performance under harsh conditions such as heavy rain, dense fog, and high-altitude areas has not been verified, leaving room for improving model generalization. Second, despite its lightweight design, YOLOSO still requires further acceleration for real-time on-board inference, especially when processing high-resolution images. Third, the detection performance for heavily occluded and severely truncated small objects remains unsatisfactory, as such targets have extremely sparse feature information and are difficult to recognize effectively. Future work will focus on optimizing feature enhancement schemes for these extreme cases.
We also acknowledge that a more detailed analysis by object size bucket (e.g., tiny, small, medium) and by occlusion/truncation level would further strengthen the evaluation of our method. The VisDrone dataset provides occlusion and truncation flags that enable such a fine-grained assessment. While this analysis is beyond the scope of the current manuscript, we plan to conduct it in future work to provide a more comprehensive characterization of YOLOSO’s performance under varying object scales and occlusion conditions.
Based on the above discussion, the results of this study have clear theoretical value and engineering application prospects. Theoretically, the network design combining high-resolution branches and multi-scale fusion, as well as the collaborative optimization of C3k2SO and C2PSASO modules, provide new ideas and methods for improving lightweight object detectors in small-object scenarios, enriching the technical system of UAV-based small-object detection. For engineering applications, YOLOSO achieves a favorable trade-off between lightweight design and high precision, making it suitable for UAV remote sensing monitoring, security patrol, smart agriculture, and other practical scenarios. For instance, it can accurately detect small pests and diseases in farmland to support precision agriculture and effectively identify distant pedestrians and vehicles in security tasks to improve patrol efficiency and safety. Future research will address the limitations of this work by further refining the model structure, expanding validation datasets, enhancing robustness in extreme scenarios, and improving inference efficiency, so as to promote the practical deployment of UAV-based small-object-detection technology.
6. Conclusions
Aiming at the core problems in UAV-based ground small object detection, such as feature loss of small objects, poor scale adaptability, and strong background interference, this paper takes YOLOv11n as the basic framework and systematically optimizes it from three dimensions: network structure, core modules, feature enhancement. This paper proposes a small object-enhanced detection algorithm (YOLOSO) suitable for UAV-based detection scenarios.
The optimized network structure effectively mitigates the phenomenon of feature loss in small target objects. By adding a P2 high-resolution feature branch with a stride of 4, a four-scale detection system of “P2-P3-P4-P5” is constructed, reducing the minimum detection stride from 8 to 4, which significantly improves the ability to capture details of tiny objects. Meanwhile, a bidirectional feature fusion strategy of “top-down + bottom-up” is adopted to realize the deep interaction of multi-scale features, enhancing the network’s adaptability to scale variations and providing sufficient feature support for small object detection. Experiments show that structural optimization alone improves mAP50 by 2.8 percentage points as compared with the original baseline model.
The collaborative optimization of the two core modules, C3k2SO and C2PSASO, significantly improves the refinement and effectiveness of feature extraction. The C3k2SO module reduces small object feature loss and enhances the ability to capture local textures of small objects by adjusting the channel compression ratio (0.25 for shallow modules and 0.75 for deep modules), optimizing convolution kernel configuration (combination of 1 × 3 and 3 × 1 convolutions), and improving the attention mechanism (attention heads increased from 4 to 8, channel compression ratio reduced from 0.5 to 0.25, with an additional 1 × 1 convolutional layer). The C2PSASO module avoids the gradient disappearance of small object features and strengthens feature focusing ability by reducing the channel compression ratio (from 0.5 to 0.25), increasing the number of attention heads (from 4 to 8), and adding residual connections with a 1 × 1 convolutional branch. Ablation experiments verify that replacing with the C2PSASO module alone improves mAP50 by 0.7 percentage points, and introducing the C3k2SO module alone improves mAP50 by 1.4 percentage points. Their synergistic effect further boosts the detection capability of the optimized model.
Experimental results demonstrate that the YOLOSO model has 3.56M parameters, still within the range of lightweight models. Its recall and mAP50 reach 36.8% and 37.3%, respectively, which corresponds to 4.8 percentage points and 3.7 percentage points improvements compared with the baseline YOLOv11n (32.0% recall and 33.6% mAP50), and significantly outperforming mainstream lightweight models such as YOLOv8n, YOLOv9t, and YOLOv10n. In terms of real-time inference performance on aerial detection tasks, YOLOSO achieves an FPS of 20.53 on the DOTA-v1 dataset. Although this value is slightly lower than that of the lightweight baseline models, including YOLOv8n (27.47 FPS), YOLOv9t (27.6 FPS), YOLOv10n (35.8 FPS), and YOLOv11n (26.53 FPS), the moderate reduction in inference speed is entirely acceptable for practical UAV deployment. The substantial improvement in detection accuracy effectively compensates for the minor loss of real-time performance, and the achieved frame rate can fully meet the basic real-time operation requirements of conventional UAV remote sensing and aerial monitoring tasks, realizing a reasonable balance between high-precision detection and practical deployment efficiency. The medium-to-large version YOLOSO-S reduces parameters by 53.6% compared with Rtdetr-L (14.85M vs. 32.0M), while all performance metrics are significantly improved (mAP50 of 45.3% vs. 37.8%), verifying the superiority of the model at different scales.