4.1. Dataset Description, Evaluation Metrics, and Experiment Setting
In the experiment, VisDrone-2019 [
40] and BDD100K [
41] were utilized as benchmark datasets. The VisDrone-2019 images were captured by a drone camera flying at different heights above various cities. The dataset consists of 8629 photographs and 10 separate annotations. These categories include people, pedestrians, tricycles, bicycles, cars, vans, trucks, awning tricycles, motorcycles, and buses. For the purposes of this experiment, 6471 images were designated as the training set, 548 images were allocated for testing, and 1610 images were for the validation set. In accordance with the evaluation protocol of the MS COCO dataset, the target sizes are classified as follows: small targets have an area of less than
, medium targets have an area of
< area <
, and large targets have an area of more than
. The VisDrone dataset was examined in accordance with the previously stated criteria to ascertain the distribution of targets of varying sizes.
As shown in
Figure 8, small targets constitute a large share of the VisDrone dataset compared to large and medium targets. The images presented in
Figure 9 further illustrate that the VisDrone dataset encompasses characteristics such as changing backgrounds (night and day), dense detection targets, and targets of varied dimensions.
In contrast to the VisDrone dataset, the BDD100K dataset encompasses 10 distinct categories. Nevertheless, the BDD100K dataset boasts a significantly larger number of images. From the 100,000 images, 80,000 labeled images were chosen, with 70,000 allocated to the training sets and 10,000 to the validation sets.
In regard to the selection of evaluation metrics, this study utilizes AP, AP50, and AP75 to assess how accurately each object class is detected, in accordance with the evaluation protocol of the MS COCO dataset. This method enables a comparison between the results of this YOLO model and those of other models. The term ‘AP’ refers to average precision over all the categories, while ‘AP50’ and ‘AP75’ refer to average precision values at 0.5 IoU and 0.75 IoU, respectively, for all categories. In the study, the sizing of targets formed the basis of the performance evaluation. ‘APsmall’ is the average accuracy of small targets, which is of area size <
. Other metrics include precision and recall.
In this experiment, the model was trained on an NVIDIA A100 graphics card with CUDA 12.1, cuDNN 8.9.1, and Python 3.12. The batch size was 8, and 200 rounds of training were carried out. The other hyperparameter settings are shown in the
Table 2.
4.2. Ablation Studies
This section will discuss the speedups brought about by the changes implemented in the DFE-YOLO model. The ablation experiments were carried out using YOLOv8L to see the contribution of each model to the overall performance. The VisDrone-DET 2019 test set was used to check how well each module of DFE-YOLO improved the network YOLOv8L. The design of the ablation experiment is as follows:
- (1)
Ablation A1: substitute the head with the FASFF four-detector head.
- (2)
Ablation A2: use DySample instead of the original YOLO upsampling layer.
- (3)
Ablation A3: modify the loss function from CIoU to EIoU.
To ensure fairness, the same experimental environment and training settings were employed in all ablation experiments. The ensuing table presents the experimental results.
From
Table 3, the following conclusions were drawn:
Overall performance has improved. On the VisDrone-DET2019 dataset, DFE-YOLO outperforms YOLOv8L in terms of AP, AP50, and AP75.
FASFF contributes more. In the VisDrone-2019 dataset, small objects account for a large proportion, and the object density is high. Therefore, adding detection heads, especially small target detection heads, has the most obvious improvement. The results of the ablation experiment A1 indicate that the incorporation of the enhanced small target detection head leads to an increase in the average precision at IoU = 0.50 (AP50) from 34.8% to 37.1%, while the average precision for small objects (APsmall) rises from 9.6% to 12%.
The DySample sampling strategy combined with FASFF results in enhanced detection performance. In the A2 experiment, the integration of DySample as the upsampling strategy results in an additional improvement in AP, building upon the gains achieved through FASFF. It is confirmed that the DySample strategy can help the model better learn multi-scale target features through its dynamic sampling and balanced sample distribution, especially when dealing with complex scenes. FASFF can abstract informative features from feature maps at different scales and adjust fusion weights dynamically according to the target characteristics. When DySample works together with FASFF, in that case, DySample gives good samples, enabling FASFF to do more precise and efficient feature fusion.
When DySample is integrated with FASFF, the performance of EIoU loss shows significant improvement. This combination enhances target box localization and increases precision for detecting small-scale objects and bounding box regression. Consequently, the general detection accuracy of the model experiences further enhancement, as reflected by a 0.2% increase in both AP and AP50. Meanwhile, its ability to identify small-sized targets also improves, leading to a 0.2% increase in APsmall.
Table 3.
Ablation study results. Bold numbers indicate the best performance in each column. ✓ indicates the inclusion of the corresponding module.
Table 3.
Ablation study results. Bold numbers indicate the best performance in each column. ✓ indicates the inclusion of the corresponding module.
| A1 | A2 | A3 | AP (%) | AP50 (%) | AP75 (%) | APsmall (%) |
---|
YOLOv8L | | | | 20.3 | 34.8 | 20.9 | 9.6 |
DFE-YOLO | ✓ | | | 21.4 | 37.1 | 21.9 | 12.0 |
✓ | ✓ | | 22.0 | 38.4 | 22.2 | 12.3 |
✓ | ✓ | ✓ | 22.2 | 38.6 | 22.5 | 12.5 |
To validate the effectiveness of the proposed FASFF module, we compare it with several widely adopted multi-scale feature fusion strategies, including BiFPN [
42], RepGFPN [
43], and a customized PANet variant implemented in PaddleDetection [
44]. As shown in
Table 4, FASFF consistently outperforms these alternatives on the VisDrone test set. Notably, it achieves a 2.1% improvement in APsmall over the YOLOv8L baseline and a 1.6% gain over BiFPN. These results demonstrate the advantage of FASFF, particularly in enhancing small-object detection performance under complex urban surveillance scenes characterized by dense and multi-scale targets.
We also explore the effects of various upsampling techniques integrated within the FASFF module. As shown in
Table 4, bilinear interpolation, transposed convolution, and our DySample-based strategy all provide performance improvements over the YOLOv8L baseline. Among them, DySample achieves the best overall performance, reaching 22.0% AP and 12.3% APsmall, while maintaining a high inference speed of 65.7 FPS. This balance between detection accuracy and runtime efficiency makes DySample particularly suitable for real-time traffic surveillance tasks on edge devices.
4.3. Comparative Experiments
To assess the progress of the DFE-YOLO network, this study conducts a comparison of its performance against several notable object detection networks.
The data illustrated in
Table 5 demonstrate that DFE-YOLO surpasses other target detection techniques in performance. At the high IoU threshold (AP75), the accuracy of DFE-YOLO (30.7%) is higher than CenterNet [
45] (27.6%), QueryDet (28.8%), and Faster R-CNN (20.1%), combining FPN and ResNet-101. At an IoU of 0.5, DFE-YOLO attains an accuracy of 48.4%, exceeding the performance of Cascade R-CNN (38.5%), DMNET [
46] (48.1%), and RetinaNet (29.2%). These results show that DFE-YOLO has better target recognition capabilities than these classic networks.
Although our proposed model outperforms other models in overall AP and AP50/ 75 metrics, CenterNet demonstrates strong performance in the APsmall metric. As shown in
Table 5, CenterNet achieves an APsmall score that is 0.3% higher than that of our model. This indicates that CenterNet has a relative advantage in capturing fine-grained spatial details. APsmall is a critical metric in dense traffic environments where small-scale targets, such as pedestrians and distant vehicles, frequently appear. Despite its lower overall AP, CenterNet’s superior APsmall highlights its competitiveness in specific detection scenarios.
The improved performance is due to several key factors. Compared with the classic network, DFE-YOLO not only inherits the backbone network of YOLOv8 but also adopts a four-detection-head architecture. The addition of a supplementary object detection head enhances the model’s effectiveness in recognizing small objects. Additionally, DFE-YOLO adopts the adaptive spatial feature fusion (ASFF), which improves multi-scale object detection by dynamically fusing features across different resolutions. Moreover, in the feature upsampling stage, DySample selects more effective feature points through a point sampling strategy instead of the former kernel-based one. This leads to more accurate feature upsampling inputs and improves the input quality for the FASFF detection head, thus boosting detection accuracy.
However, DFE-YOLO remains a constituent of the YOLO series; thus, a comparison with the conventional detector alone does not provide a comprehensive evaluation of its performance advantages. Consequently, further research is necessary to assess the performance of DFE-YOLO within the YOLO series, thereby substantiating its continued competitiveness within the YOLO framework. Within the same experimental environment, this study selected YOLOv5L, YOLOv8x (the version with the highest accuracy in the YOLOv8 series), YOLOv10, and DFE-YOLO for comparative experiments, given that DFE-YOLO is based on YOLOv8L.
To better highlight the advantages of the DFE-YOLO architecture in complex target detection scenarios,
Figure 10 shows a detection comparison result in a typical complex scene. It is a very cluttered scene with many small objects, and some of the items are occluded by other objects, for example, trees. In addition, objects also come in various sizes and shapes. As shown in
Figure 10, the DFE-YOLO network has lower missed detection and higher detection rates compared to other YOLO networks.
From the results in
Table 5, in comparison with YOLOv5L, YOLOv8x, and YOLOv10L, DFE-YOLO achieves higher accuracy on the VisDrone-2019 dataset than the compared networks. The VisDrone-2019 dataset is predominantly composed of aerial imagery obtained from drones, with a considerable emphasis on small targets within the dataset. As a large-scale autonomous driving dataset, BDD100K contains complex environments such as rainy days, daytime, and nighttime. At the same time, compared with the 8692 images of VisDrone-2019, BDD100K has a larger amount of data. Moreover, 80K labeled images in the dataset are used for experiments. The complexity of the dataset and the larger sample size provide a more rigorous test for the scalability of the network and for handling different target scales and environmental conditions.
From the results presented in
Table 6, DFE-YOLO achieves higher accuracy than YOLO models such as YOLOv8L on the BDD100K dataset. In real applications, small- and long-range target detection in autonomous driving is crucial. In terms of small target detection accuracy, DFE-YOLO’s APsmall is 13.8%, which is 3% higher than YOLOv8L (10.8%) and higher than other YOLO networks compared. This result shows that DFE-YOLO has certain advantages in small target detection.
Beyond excelling in detecting small targets, DFE-YOLO shows strong capability in medium and large target detection. From the data presented in
Table 6, it can be inferred that DFE-YOLO attains high accuracy across all target sizes when compared to alternative YOLO models. In terms of small target detection accuracy, DFE-YOLO’s APmedium is 36.5%, which is 1.8% higher than YOLOv8L (34.7%). This improvement is due to the innovative application of the FASFF strategy. FASFF not only adds a detection head to increase the input of the feature scale of small targets, but also adopts adaptive feature space fusion. Adaptive feature space fusion enables the DFE-YOLO to fuse different feature maps. This approach enables the network to integrate information from multiple resolutions more effectively, thereby enhancing the detection accuracy of diverse targets. At the same time, the experimental results also indicate that the improvement of the combination of DySample and FASFF strategies is generalizable. In autonomous driving scenarios. This dual enhancement approach not only increases the precision of small target detection but also guarantees that the model can sustain consistently high performance across all target sizes.
Although DFE-YOLO and YOLOv8L have the same accuracy in large target detection, the improvement in detection capabilities for small and medium targets has significantly improved the overall accuracy of DFE-YOLO. This is reflected in the higher overall average precision (AP) of DFE-YOLO compared to other networks.
The PR curve is employed in this work to evaluate DFE-YOLO, with experiments conducted on the BDD100K. The value of precision is depicted in the PR curve shown in
Figure 11 and stands for mPrecision, or mean precision over all the categories. The calculation procedure can be written in detail as described below:
In this context, “N” refers to the number of all classes, and “” indicates the precision for the i-th class.
The PR curve is indicative of the average precision trend of disparate models at varying recall rates. As depicted in
Figure 11, DFE-YOLO exhibits enhanced performance in comparison to other YOLO models. Notably, at recall > 0.6, DFE-YOLO exhibits a commendable precision, suggesting a minimal false detection rate and optimal detection stability at high recall rates.
The results from experiments on various datasets prove conclusively that DFE-YOLO has the ability to tackle the difficulties faced in detecting objects in traffic monitoring scenes. With advanced feature fusion methods, highly specialized small target detection heads, and dynamic sampling strategies, DFE-YOLO performs very well in detecting targets across various sizes. More specifically, outcomes on the BDD100K dataset further prove that the model holds adequate potential in increasingly complex settings and can find wide application in real-life applications.