4.1. Experimental Environment
The VisDrone2019 dataset, curated and released by the AISKYEYE team affiliated with the Laboratory of Machine Learning and Data Mining at Tianjin University, is adopted as the core experimental data source in this work. This dataset is divided into three subsets: 6471 training images, 548 validation images, and 1610 test images, with annotations covering 10 distinct object categories for detection tasks. The training subset alone contains a total of 353,550 labeled detection instances. The ultra-small scale of the objects to be detected, combined with intricate background contexts, presents substantial challenges for small object detection in UAV aerial imagery. An example from the VisDrone2019 dataset is shown in
Figure 7.
In total, the training split comprises 353,550 labeled object instances, among which small-sized people and cars account for the largest proportion. The statistical data indicate that 212,630 objects occupy pixel areas of less than 32 × 32, with 34,827 objects covering fewer than 10 × 10 pixels. These characteristics pose enormous challenges for detection models, thereby degrading tracking performance. The distribution of these objects is shown in
Figure 8.
In addition, most objects in the VisDrone2019 dataset are of tiny sizes, and such small objects are susceptible to annotation errors caused by occlusion, blurriness and other adverse factors. To address this issue, the dataset first defines the category of ignored regions, which marks areas difficult to annotate precisely due to low resolution or dense crowds as ignored regions. Secondly, it divides the occlusion degree of individual objects into three levels, namely 0 (no occlusion), 1 (partial occlusion) and 2 (severe occlusion). During the training phase of SODet-YOLO on this dataset, we remove all ignored regions and only adopt objects without occlusion and with partial occlusion for training. Only clear and accurately annotated samples are used for model optimization. This mechanism prevents the model from fitting ambiguous, incorrect and unidentifiable annotations, and allows the model to concentrate on learning effective object features.
The experimental environment is equipped with an RTX 4090 graphics card. All simulations are implemented using Python 3.9.21 based on the PyTorch 1.12.0 framework, and the specific version details are presented in
Table 1. Additionally, the training parameters were set to 200 epochs, 4 workers, and a batch size of 12. Stochastic Gradient Descent (SGD) was adopted as the optimization algorithm, and training was terminated if there was no performance improvement for 50 consecutive epochs. Finally, seeing the initial learning rate and final learning rate are set to 0.01, as detailed in
Table 2.
4.2. Ablation Experiments
As shown in
Table 3, this ablation experiment uses YOLO11n as the base model. By gradually incorporating the P2 detection head, the FGA-AFPN neck network, the C3k2_IDC module, the MPDInterpIoU loss function, and the PPA module, it systematically verifies the impact of each improved module on both mAP@0.5 and the computational cost.
First, Model ② is obtained by adding the P2 to the YOLO11n. This model can effectively capture the detailed features of objects, improving the mAP@0.5 by 4.349% to 36.916%. Then, Model ③ is derived by replacing the neck network of Model ② with FGA-AFPN. This modification enables the full fusion of features across different levels, and the built-in spatial and channel attention mechanisms further highlight salient features while effectively suppressing background interference, leading to a further increase of 2.217% in mAP@0.5, which reaches 39.133%. After introducing the C3k2_IDC module into Model ③ to obtain Model ④, the receptive field of the model is expanded, allowing for effective capture of global information, which results in a slight improvement of 0.335% in mAP@0.5 to 39.468%. To optimize the regression process, Model ④ is trained with the MPDInterpIoU loss function for bounding box regression to yield Model ⑤. It effectively alleviates the gradient vanishing problem during the early-stage regression. Moreover, the incorporated factors like the width–height deviation and center point distance further optimize the bounding box regression process, thus improving the mAP@0.5 by 0.849% to 40.317%. Finally, the PPA module is introduced into Model ⑤ to preserve key features during downsampling, which boosts the mAP@0.5 by 1.17% to 41.487%. The specific metrics of each model are shown in
Figure 9.
To verify the effectiveness of MPDInterpIoU, Model ⑦ was obtained by training on the basis of Model ④ with MPDIoU. The results show that the mAP@0.5 reaches 39.795%, representing an increase of 0.327% compared with Model ④, which is significantly inferior to the improvement achieved by MPDInterpIoU. We also conduct ablation studies on individual modules of YOLO11, where we separately incorporate the C3k2_IDC module and the PPA module, and train the model using the MPDInterpIoU loss function. Among these modifications, the PPA module achieves the largest improvement in mAP@0.5 but also incurs the highest computational cost, while MPDInterpIoU yields the smallest improvement. The plausible reason is that although MPDInterpIoU accelerates the bounding box regression speed of YOLO11n in the early training phase, it becomes challenging to achieve further performance gains after approximately 200 iterations.
While the mAP@0.5 of the model gradually improves with the deepening of modifications, the computational complexity and model parameters also increase accordingly. Among all the modifications, the introduction of FGA-AFPN causes the largest increase in the training cost, with the average inference time per image increasing by 10.8 ms, and adding 0.63 M model parameters and 4.2 GFLOPs. However, the incorporation of this module improves the mAP@0.5 by 2.217%, which is second only to the P2 small object detection layer, indicating a favorable overall effectiveness. Although introducing the PPA module improves the mAP@0.5 by 1.17%, it also increases the computational cost by 6.0 GFLOPs, which is even higher than the GFLOPs increment brought by the FGA-AFPN, and the average inference time per image increases by 14.1 ms. Therefore, its overall effectiveness is inferior to that of FGA-AFPN. In contrast, other modules or loss functions contribute relatively minor improvements to the model performance, with a correspondingly small increase in training cost. Compared with YOLO11n, SODet-YOLO increases the number of parameters by 1.19 M and GFLOPs by 19.7, thereby elevating the training cost of the model. The average inference time increases from 25.5 ms to 60.2 ms, representing an increment of approximately 136%.
Overall, based on YOLO11n, SODet-YOLO progressively integrates the FGA-AFPN, C3k2_IDC and PPA modules, and adopts the novel MPDInterpIoU loss function. All these designs are specifically constructed to solve common bottlenecks in UAV aerial tiny object detection, indicating the feature degradation of small objects, the loss of shallow fine-grained information and inadequate cross-level feature fusion, thereby achieving a remarkable rise in the mAP@0.5. Nevertheless, the deployment of high-resolution detection heads and multi-scale feature fusion strategies inevitably leads to a synchronous growth in model parameters, computational complexity and inference time, and all these improvements are indispensable structural designs for high-precision detection in UAV aerial scenarios.
The average single-frame inference time of SODet-YOLO is 60.2 ms, which is equivalent to about 16.6 FPS. Although it can realize the real-time processing of UAV aerial videos, its real-time performance is relatively constrained. As revealed by the ablation experiments, the PPA module causes the most obvious increase in the average inference time and severely limits the real-time detection performance. Accordingly, the PPA module can be discarded when computing resources are scarce to obtain a lower average single-frame inference time with only a moderate sacrifice of detection accuracy. In general, SODet-YOLO is mainly dedicated to improving the detection accuracy of tiny objects in UAV aerial scenes, and no lightweight optimization has been implemented in the current work. In future studies, we aim to greatly cut down computational costs while causing only a slight decline in detection accuracy.
We additionally pick three images in a random manner and test them with each model from the ablation experiment;
Figure 10 displays the visualization of the detection results for each model.
In the detection results of the first image, given the ultra-small size of distant objects present in UAV aerial photographs, model ① exhibits a substantial number of undetected objects. With the introduction of each module, the detection performance gradually improves, significantly reducing the occurrence of missed detections. From the detection outcomes of the model ① on the second image, there are not only a large number of missed detections but also a misclassification issue: an entire row of Car objects in the lower-left area is misclassified as Truck. After a series of improvements, most objects are successfully detected, yet the row of Car objects in the lower-left area still suffers from missed detections—only the final model ⑥ manages to detect the objects in this area. This result successfully demonstrates the high performance of the improved model proposed in this paper. Finally, the detection results of the third image show that the number of missed detections decreases overall as more modules are added. Since the Motor objects are extremely small in aerial photography scenarios, the model ① fails to sufficiently extract their features, leading to numerous instances of missed detection. On the contrary, the detection results of the f model ⑥ indicate that a large number of Motor objects undetected by the model ① are successfully identified. Overall, the model ⑥ achieves a superior detection capability while the model ① performs the worst, which verifies the effectiveness of the ablation experiments.
4.3. Confusion Matrix Comparison Experiment
As illustrated in
Figure 11, the vertical axis indicates the ground-truth categories and the horizontal axis corresponds to the predicted class labels. Overall, the proposed SODet-YOLO model attains superior classification accuracy for the majority of object classes relative to the baseline YOLO11n, with the sole exception of the Bus category, for which performance exhibits a slight decrement. The most substantial gain in the classification rate is observed for the People category, which increases from 0.43 to 0.69. This improvement substantiates the model’s ability to comprehensively extract features associated with small objects. Overall, the proposed algorithm yields a marked enhancement in small object detection capabilities without compromising the performance for larger-scale objects.
Although SODet-YOLO achieves improved recognition accuracy for all categories compared with YOLO11n, the correct classification rates of most categories remain relatively low, including People, Bicycle, Van, Tricycle and Awning-tricycle. Specifically, the People category still maintains a false-negative rate of approximately 31%. This is mainly because most people objects appear on motorcycles or bicycles, which inevitably causes severe object occlusion and increases the difficulty of effective feature learning for the model. In addition, categories such as Bicycle, Van, Tricycle and Awning-tricycle have scarce samples in the training set, leading to extremely imbalanced class distribution of the dataset. Insufficient training on these categories further raises their false-negative ratios. In summary, SODet-YOLO still has considerable room for improvement in its accurate object classification performance.
4.4. Object Detection for Various Sizes
As shown in
Figure 12, we calculate APs of different sizes on the VisDrone2019 test dataset. The experimental outcomes indicate that the enhanced model delivers relatively weak detection performance for objects smaller than 10 × 10 pixels, yet it still outperforms the baseline YOLO11n model. For objects sized between 10 × 10 and 32 × 32 pixels, both the AP@[0.5:0.95] and AP@0.5 metrics achieve the most substantial improvements. This verifies that the proposed algorithm has markedly strengthened the detection performance for small objects; furthermore, for the detection performance of large objects, it is still slightly better than YOLO11n.
For objects within the size range of 0 × 10 to 10 × 10 pixels, SODet-YOLO only achieves marginal performance improvements, and its overall detection capability in this range is still quite limited. One primary reason is that objects of this size occupy a small proportion in the training set, with only 34,827 samples accounting for 9.9% of the total data. In addition, severe class imbalance exists among these tiny objects such as People, Pedestrian and Motor. This makes the model mainly concentrate on objects ranging from 10 × 10 to 32 × 32 pixels and pay insufficient attention to most ultra-small objects in the above size interval. Another factor lies in the extreme scarcity of effective feature information for such miniature objects. Even though we have greatly optimized the feature extraction capability and alleviated the loss of fine-grained features, the model still fails to fully extract adequate discriminative features from these ultra-small objects, which leads to the restricted detection performance of SODet-YOLO for objects in this size range.
The AP@0.5 values for objects across different size intervals were computed for each model in the ablation study, as illustrated in
Figure 13. Given the intrinsic difficulty of detecting such tiny objects, improvements in this range are very limited, a constraint that can be attributed to the incorporation of the P2. For objects with dimensions between 10 × 10 and 32 × 32 pixels, the AP increases progressively as the proposed modules are successively integrated, thereby exhibiting the most pronounced gain among all size ranges. Considering the substantial scale variation in UAV-acquired aerial images—where small objects constitute the majority yet larger instances also appear—the proposed algorithm enhances detection accuracy for small objects without degrading performance on larger ones. This balanced outcome arises because the combined modules further elevate detection performance for large- and medium-sized objects. Although this relative enhancement is less marked than that achieved for small objects, the overall result represents a significant advancement in the small-object detection capability.
4.6. Experimental Results on Other Dataset
To further validate the generalization ability of the improved model, training experiments are carried out on the Aerial traffic image dataset for road traffic detection scenarios. This public dataset is hosted on the Kaggle platform and was created by Roboflow user Shaha; it provides annotated aerial videos of traffic scenes in Almaty captured by drones. The dataset contains 1710 training images, 558 validation images and 440 test images, with object categories including PMT, articulated-bus, bus, car, freight, motorbike, small bus, and truck. In fact, the number of objects labeled as PMT in the dataset is extremely scarce, which makes it difficult to conduct adequate training for this category. As a result, the model achieves excellent training performance on other categories, while the AP@0.5 of the PMT category is almost zero. Therefore, we exclude this category from the experimental analysis. Finally, slight adjustments are made to the training parameters, with only the training epoch changed from 200 to 100.
As illustrated by the experimental data in
Table 4, both YOLO11n and the proposed SODet-YOLO achieve outstanding performance for large-sized vehicles like trucks Freight, Truck, and Bus; however, the improved model attains marginally higher accuracy for these categories overall. The baseline model exhibits comparatively lower detection accuracy for small-sized objects, specifically Car and Motorbike, with values of 73.6% and 20.6% respectively, suggesting a tendency toward missed detections and false detections in small-object scenarios. In contrast, the enhanced model yields accuracies of 83.5% for Car and 56.1% for Motorbike, corresponding to improvements of 9.9 and 35.5 percentage points, respectively. These findings collectively confirm that the proposed algorithm substantially enhances small-object detection performance.
In addition, we select two dedicated datasets for vehicle and pedestrian detection tasks, namely the Top-View Drone Car Detection Dataset and the Tiny Person Dataset. The Top-View Drone Car Detection Dataset is derived from the Kaggle platform and split into training and validation sets, which contain 11,586 and 794 images respectively, with only the vehicle category involved. The Tiny Person Dataset includes 1610 training images and 759 validation images, with a total of 72,651 annotations. It divides people into two categories: sea people and earth people. In the experiments, we unexpectedly observe that outstanding performance can be achieved on the Top-View Drone Car Detection Dataset by setting the training epoch to only 5. On the contrary, since the objects in the Tiny Person Dataset are extremely tiny, favorable detection metrics cannot be obtained even if we set the epoch to 200 for training. The specific experimental results are shown in
Table 5 and
Table 6.
As shown in
Table 5 and
Table 6, we compare the experimental results of YOLO11n and SODet-YOLO on the Top-View Drone Car Detection Dataset and Tiny Person Dataset. The results reveal that both models achieve excellent performance on the simple Top-View Drone Car Detection Dataset. Notably, SODet-YOLO surpasses YOLO11n in all evaluation metrics, achieving substantial improvements in mAP@0.5 and Recall in particular. Nevertheless, both methods obtain relatively limited performance on the challenging Tiny Person Dataset. This is because the objects in this dataset are extremely tiny, making it hard for detection models to extract effective detailed features. Even under such harsh conditions, SODet-YOLO still delivers better detection performance, which validates the effectiveness of the proposed model modifications.