4.3.1. Comparative Experiments on Anchor Box Parameter Settings
To validate the effectiveness of increasing the number of anchor boxes for small object detection, we conducted comparative experiments using a network with only two prediction layers. The anchor box numbers were set to 3, 4, 5, 6, and 7, respectively. The experiments were performed on the VisDrone training dataset, with subsequent evaluation on the validation set. The experimental results are shown in
Table 3.
In the table, we record the corresponding recall and GFLOPs values for different numbers of anchor boxes. GFLOPs is the number of floating-point operations, which can be interpreted as the computational cost, and is used to measure the complexity of an algorithm. It can be observed that, when the number of anchor boxes was set to 4, the network achieved a noticeable improvement in recall rate. However, when the number of anchor boxes was increased to 5 and 6, the recalls did not increase and did not differ much compared to 4 anchor boxes, indicating a certain degree of redundancy in the anchor boxes. Furthermore, when the number of anchor boxes was increased to 7, the recall rate decreased dramatically, suggesting that an excessive number of anchor boxes could negatively impact the network’s performance. Thus, the most reasonable number of anchor boxes was determined to be 4. At this setting, the network’s GFLOPs increased from 128.9 to 129.1, with no significant increase in computational complexity. Additionally, the matching success rate between the dataset’s targets and the prior boxes improved. Further increasing the number of anchor boxes not only failed to significantly improve the network’s performance but also increased the computational workload. The experimental results indicate that, when the number of anchor boxes is set to 4, the model’s computational complexity remains within a reasonable range while improving the detection of small objects to some extent. Thus, more small objects can be correctly detected.
4.3.3. Ablation Experiment
In this study, we conducted a series of ablation experiments to assess the individual contributions of the proposed enhancements towards improving the performance of the model. These enhancements included early feature extraction and the removal of the large object detection head (referred to as YOLOv7-T), the incorporation of an additional anchor box (YOLOv7-TA), the replacement of the CIoU loss function with EIoU (YOLOv7-TAE), and the substitution of the feature fusion module in FPN with the attention-based self-adaptive fusion module introduced in this paper (known as SODCNN).
Experiments are evaluated on a validation set. Evaluation metrics such as parameters,
,
, AP on small, medium and large targets, GFLOPs, FPS, and training time are employed to compare the performance of each model. The experimental results are presented in
Figure 10 and
Table 4.
The results clearly demonstrate the efficacy of each enhancement. Notably, the large increase in GFLOPs for YOLOv7-T compared to the YOLOv7 indicates a boost in model computation. But its significantly improved from 49.57% to 52.57%. And AP values on small, medium, and large targets and the speed of network detection during the test were also substantially improved. For the training time, advancing the feature extraction improved the training time from 13.89 h to 15.84 for 250 epochs. Then, removing the large target detection head reduced the training time of the network to 14.71 without affecting the large target detection. Compared to the original YOLOv7, no great improvement in training time on the YOLOv7-T. The increase in overall network performance suggests that this improvement highlights the importance of early feature extraction, as it effectively mitigated the loss of valuable information related to small objects in the deep layers.
Moreover, the addition of an extra anchor box in YOLOv7-TA resulted in a modest but noticeable 0.43% improvement in mAP compared to YOLOv7-T, and the AP values on the three size targets were also further improved. Meanwhile, the GFLOPs and training time were not significantly improved, indicating the positive impact of increasing the number of anchor boxes on the detection performance, particularly for densely populated small objects. Furthermore, the analysis of convergence speed revealed that setting the anchor box number to 4 expedited the model’s convergence process to a certain extent.
Additionally, YOLOv7-TAE demonstrated an enhanced detection accuracy while maintaining parameter number and computational complexity comparable to YOLOv7-TA. The replacement of CIoU loss with EIoU loss led the model to optimize its parameters in a more reasonable direction, contributing to the improved performance.
Lastly, SODCNN exhibited further optimization compared to YOLOv7-TAE, and the introduction of the ECF structure brought optimization to the network while introducing a small number of parameters and computational complexity, emphasizing the superiority of the proposed dynamic allocation of fusion weights over simple linear fusion. This adaptive fusion mechanism enables the model to flexibly and comprehensively leverage the information captured in the fused feature maps, leading to enhanced detection capabilities.
Overall, these ablation experiments provide valuable insights into the effectiveness and significance of each improvement point, highlighting the potential of the proposed enhancements for enhancing small object detection performance.
In order to validate the effectiveness of the proposed algorithm in practical applications, we conducted visual analysis on the test set. We show the visualization results of YOLOv7 and SODCNN in three different scenes: daytime, nighttime, and motion blur, shown in
Figure 11,
Figure 12 and
Figure 13, respectively. We also compared the number of targets detected by the two models in these detection scenarios in
Table 5.
Figure 11 shows the visualization results in a daytime scene,
Figure 11a is the original input image, and
Figure 11b is the real labeled image with the number of labeled targets counted in
Table 5.
Figure 11c,d show the detection results of YOLOv7 and SODCNN, respectively. According to the visualization results and
Table 5, SODCNN can correctly detect 16 targets, which has higher detection accuracy compared to YOLOv7.
Figure 12 shows the results of the visualization of the night scene. Comparing
Figure 12c,d, our proposed algorithm demonstrates the ability to detect more small objects in the nighttime scene and accurately identify partially overlapping and occluded targets, improving the number of detected targets from 83 to 107.
Figure 13 shows the algorithm detection performance in a motion blur scenario, where the texture details of small objects are lost and the features are distorted, resulting in a significant number of missed detections. However, as observed in
Figure 13d, the improved model can still accurately identify pedestrians in the motion blur scene. Number of targets correctly detected increased from 4 to 10, indicating enhanced robustness. By performing feature extraction in the shallow layers and adaptively fusing them into the neck module, our model retains and effectively utilizes the features of small objects, allowing it to learn comprehensive information about small objects in complex scenes. Furthermore, the incorporation of additional anchor boxes addresses the issue of missed detections in crowded and overlapping scenarios. Overall, our proposed model exhibits superior adaptability for small object detection in complex scenes.
To verify the applicability of our algorithm to other detection tasks, we applied the algorithm to the CARPK dataset. This dataset, proposed by Hsieh et al. [
43] in 2017, is a collection of nearly 90,000 cars from 4 different parking lots collected by drones, containing 989 images for the training set and 459 images for the validation set. The original YOLOv7 and the SODCNN proposed in this paper are compared under the same experimental conditions. The experimental results are shown in
Table 6, where our proposed algorithm improves
from 98.41% to 99.18% and
from 71.63% to 74.26% with respect to the original YOLOv7. The experimental results show that our model is able to show its superiority in different detection tasks and can be applied in different detection scenarios.
4.3.4. Comparison with Other Methods
According to
Figure 8a, the number of labels varies greatly from class to class, where bus, tricycle, and truck have a relatively small number of labels, with 251, 532, and 750 labels, respectively. Car and pedestrian have a larger number, with 14,064 and 8844 labels, respectively. In order to verify the effectiveness of our algorithm on classes with different numbers of training labels, we compared the
of YOLOv7, TPH-YOLOv5, and our proposed algorithm on different classes of targets. The experimental results are shown in
Table 7, for different classes, our proposed algorithm achieves the best performance, and it is still effective in improving its detection accuracy for classes with few training labels. Compared to YOLOv7, our algorithm improves by 6.8%, 5.6%, and 7.7% for the small targets of pedestrian, people, and bicycle, respectively, and 2.4%, 1.3%, and 3.3% for the relatively large targets of car, van, and truck, respectively. It can be seen that the performance of our algorithm improves more significantly on small targets.
We conducted comparative experiments involving prominent object detection algorithms on the VisDrone dataset, including YOLOv3 [
44], YOLOv4 [
45], YOLOv5l, YOLOv6s [
46], YOLOv8s [
47], Cascade R-CNN, RetinaNet, TPH-YOLOv5, PicoDet [
48], PP-YOLOE [
49], and EL-YOLOv5s [
50]. By referring to the data presented in
Table 8, it is evident that our proposed algorithm surpasses these models in terms of performance. Specifically, our model achieves a remarkable improvement in
, surpassing the YOLOv8s by 11.9%. And the
is boosted by 6.65%. YOLOv8 is the latest study of the YOLO series of target detection algorithms. In addition to this, our proposed algorithm outperforms the
of the two-stage algorithm Cascade R-CNN by 31.92%, and outperforms the anchor-free detectors PicoDet and PP-YOLOE by 19.92% and 14.43%, respectively. The TPH-YOLOv5 and EL-YOLOv5s algorithms are proposed on the VisDrone dataset, and the detection accuracy of our algorithm outperforms both models. These findings unequivocally establish the superior capabilities of our model in effectively detecting small objects in complex scenes.
Our network has been designed with efficiency in mind, making it well-suited for edge computing environments. On our laboratory equipment, we conducted recognition tests on images with a resolution of 640 × 640, achieving a recognition rate of 105 frames per second. However, we acknowledge that, in resource-constrained environments such as drones, performance can pose a challenge.
We have undertaken extensive optimization efforts to maximize performance in drones; we still suggest that the available computational resources on the drones should be upgraded to ensure the loading and computation of our network models. We aim to achieve a minimum performance level of 3–5 frames per second, which we believe will effectively meet real-world application demands in the resource-constrained devices.