4.4.1. Experimental Results of the Rural Road Dataset
Figure 8 shows the curves of training and validation loss functions of EMAF-Net after 300 rounds of training on our self-built rural road object detection dataset, represented by blue and green lines, respectively, and the gradually increasing trend of mAP is indicated by the red line. From the figure, it can be seen that the model is divided into two stages during the entire training process: The first stage freezes the backbone and trains for 50 rounds using the Adam optimizer. As the epochs increase, the training loss and validation loss decrease significantly, demonstrating the model’s initial rapid convergence characteristic. The second stage unfreezes the entire network and trains for the next 250 rounds. The optimizer is switched to SGD and the learning rate is reduced using the cosine annealing strategy. It can be observed that the loss function of the model slows down more gradually in the second stage, and the validation loss stabilizes at around 250–300 epochs, indicating that the model has good generalization performance. EMAF-Net can maintain stable convergence over a long training period. Meanwhile, mAP gradually increases from the initial 0 and stabilizes at around 0.64. This indicates that EMAF-Net can effectively detect objects in rural road scenes through long-term training and optimization and maintain high detection performance even under strict IoU thresholds.
Figure 9 shows the confusion matrix of EMAF-Net on the rural road object detection dataset. Each element in the confusion matrix represents the normalized ratio between the true class and the predicted class. Values closer to the diagonal indicate higher classification accuracy of the model, while values off the diagonal reflect the degree of confusion between classes. From the figure, it can be seen that in scenarios with complex environments and large variations in object scales, EMAF-Net demonstrates a high accuracy rate in detecting multiple object categories. For example, the proportion of correct classification for the large-scale category car reaches 0.73, which is the highest among all categories. The correct classification proportions for tricycle and agricultural machinery are also 0.65 and 0.51, respectively. Moreover, the correct detection accuracy rate of the small-scale category street light by the model is 0.60, and the correct classification rate for traffic signs is 0.53. This fully demonstrates the model’s ability to detect complex and diverse objects. However, the figure also reveals that the model still has a certain degree of confusion for some categories. For instance, the truck category achieved a correct classification rate of 0.63, but some samples were still wrongly detected as a car (with a proportion of 0.07), while some samples of the motorcycle category were mistakenly identified as a tricycle (with a proportion of 0.07). This is because these object categories have similar shapes and appearances, and the front-end appearance of trucks and cars in rural road scenes is very similar, making them easy to misidentify. Some motorcycles and tricycles have a very small amount of confusion due to the frequent presence of riders blocking the details of the vehicle body, and their appearances are slightly different from each other from a frontal perspective, so most of them can still be correctly distinguished. Other categories do not have obvious confusion and can be correctly and undistortedly identified.
Overall, the confusion matrix results of EMAF-Net indicate that the model has a high accuracy rate in object detection in rural road scenes, accompanied by a small amount of misclassification between categories. In addition, missed detections may still occur in cases where objects are partially occluded. For example, agricultural implements attached to machinery or pedestrians riding on agricultural vehicles may be partially hidden by the vehicle body, making their visual features incomplete. Such occlusion reduces the discriminative features available to the detector and may lead to missed detections in some complex scenes. This issue is common in real rural traffic environments and remains a challenging problem for object detection models.
Figure 10 shows the visualization of the main category confusion pairs and their flow directions on the rural road dataset by EMAF-Net. In
Figure 10a, the Top-K confused category pairs and the corresponding confusion ratios are presented.
Figure 10b presents the main confusion flow from the true category to the predicted category in a bar graph format. The thicker the connection line, the higher the confusion ratio. Overall, the Top-K analysis and the bar graph results are consistent with the confusion matrix conclusion in
Figure 9; EMAF-Net can achieve stable discrimination for most categories in complex rural road scenarios.
To evaluate the detection performance of EMAF-Net, comparative experiments were conducted using three representative models: YOLOv7, Deformable DETR, and YOLOv11 [
28]. These three models were chosen because YOLOv7 is a classic lightweight object detection model based on convolutional neural networks, with its structure design demonstrating highly optimized detection efficiency and fast inference capabilities, and it is representative in resource-constrained scenarios. Deformable DETR represents the object detection method based on the Transformer structure, which demonstrates advanced theoretical advantages in complex multi-object detection scenarios through the use of attention mechanisms and deformable convolutions but has a relatively high computational complexity. YOLOv11 is an improved version of the YOLO series, which further enhances the detection performance through parameter optimization and multi-scale feature extraction. These three models are representative in multiple dimensions and from a historical perspective and can fully reflect the improvement value of EMAF-Net in terms of detection performance, lightweighting, and real-time capabilities.
Table 2 presents the training results of all the selected object detection models on the self-built rural road dataset. To ensure fair comparison, all models were trained for 300 rounds to ensure sufficient convergence. However, it is worth noting that Deformable DETR has converged around the 150th round in this task experiment. Subsequent iterations have had limited improvement in the metrics, and some metrics even slightly decreased due to overfitting. Therefore, the best results of this model are used as the final data for the comparative analysis in the table.
From the table, it can be seen that in terms of accuracy, our improved EMAF-Net model has demonstrated significant advantages in all indicators. The mAP@0.5 reached 64.05%, which was 2.11% higher than the YOLOv11 model, 8.53% higher than YOLOv7, and 13.15% higher than Deformable DETR. Additionally, in the stricter IoU threshold of mAP@0.5:0.95, the accuracy of EMAF-Net was 48.95%, which was 4.63% higher than YOLOv11, 7.56% higher than YOLOv7, and 18.45% higher than Deformable DETR. In terms of running efficiency, the params of EMAF-Net were only 18.3 M, which was 18.6 M less than YOLOv7, 1.8 M less than YOLOv11, and 21.7 M less than Deformable DETR, fully demonstrating the lightweight characteristic of the model. In terms of model computational complexity, the GFLOPs of EMAF-Net was 38.5, which was 66.2% less than YOLOv7, 29.5% less than YOLOv11, and 134.5% less than Deformable DETR, further illustrating the advantage of EMAF-Net in computational efficiency. Moreover, in terms of real-time performance, after exporting to ONNX format on the NVIDIA RTX 4090 GPU, the FPS of EMAF-Net reached 184.62, which was 51.78% higher than YOLOv11, 27.6% higher than YOLOv7, and 137.32% higher than Deformable DETR. It is worth noting that although high-end GPU platforms can achieve extremely high inference speeds, the computational capabilities of edge devices in practical application scenarios are often limited. To verify applicability on typical edge computing devices, video inference experiments were conducted using a representative NVIDIA 1650Ti GPU. The results show that EMAF-Net achieves an average inference speed of approximately 27 FPS (39 ms per frame) on the 1650Ti, meeting the real-time requirements for rural road scene perception. The exported ONNX model can be directly deployed to most edge devices or applications without additional processing, demonstrating excellent practical value. Combining detection performance and computational efficiency, EMAF-Net achieved a balance between accuracy and lightweighting. Overall, EMAF-Net performed excellently in terms of detection accuracy, model lightweighting, and real-time performance, especially in complex rural road scenarios with limited resources. The experimental results verified the effectiveness of the model improvement strategy.
To more intuitively reveal the trade-off relationship between different models in terms of accuracy and complexity, a visual analysis of the key indicators was further conducted. As shown in
Figure 11, the relationship between model complexity and detection accuracy (mAP@0.5 and the stricter mAP@0.5:0.95) is illustrated from two perspectives: the number of parameters (Params) and computational complexity (GFLOPs). Here, the size of the bubbles is proportional to the model complexity: the first two subplots (a) and (c) are proportional to Params, and the last two subplots (b) and (d) are proportional to the computational cost (GFLOPs). Additionally, the color of the bubbles is used to represent the corresponding detection accuracy (mAP), with darker yellow indicating higher accuracy. Through this graph, one can more intuitively compare the accuracy performance of each method under different complexity constraints and verify the ability to achieve model lightweighting and efficient inference while improving detection accuracy, thereby more comprehensively demonstrating the advantages of EMAF-Net in terms of accuracy and efficiency.
Based on the quantitative results and complexity trade-off analysis described above, qualitative comparisons of detection results among different models in complex rural road scenarios are further presented.
Figure 12 shows the visual detection results of four models under different light intensities, different road surface materials, and different weather backgrounds. At the same time, in combination with the requirements of the detection task, it can comprehensively evaluate the object recognition performance of each model in different environments. This ranking is mainly based on the light intensity, presenting rural road scenarios with different road surface materials, including asphalt roads and non-hardening paths, transitioning gradually from bright daytime to dark rainy-weather scenes. The first picture in the first column shows a relatively wide intersection on the rural road, with abundant daylight, presenting a strong light condition. The road is paved with asphalt, and there are houses and power poles around, reflecting the relatively smooth but simple infrastructure of rural transportation. The second picture shows a typical rural road scene at the field edge, clearly showing agricultural machinery and small trucks. The road surface material is a non-hardening path, and the environment has sufficient light, with many crops and a lot of vegetation around, reflecting another type of rural road scene. The third picture shows the rural road scene at dusk, with the gradually weakening light changes adding complexity to the detection task. The fourth picture shows the mountain pasture scene with low-light conditions and a slippery road surface, with dense vegetation along the edges. This area has gradually decreasing light and an overall darker scene, able to reflect the unique transportation characteristics of remote rural areas. The fifth picture shows the cloudy environment of the road after rain and the wet road features, with the view affected by raindrops, presenting the complex and variable traffic environment and weather conditions of rural areas and also adding higher challenges to the detection task.
From the visual detection results, it can be seen that different models have significant differences in their ability to identify objects in complex environments. In high-light scenarios, YOLOv7 and YOLOv11 can capture common objects, such as cars and banners, with high confidence levels. However, they are insufficient in identifying complex objects like the small distant ones in the fourth row, such as animals and people. Deformable DETR has good robustness in object box positioning, and most detection boxes accurately select the object boundaries, but their confidence levels are generally low. Especially in low-light environments (such as rainy days and dusk scenes), their confidence levels significantly decrease, and they are prone to being affected by background interference, resulting in incorrect box selection phenomena. For example, in the second picture, the agricultural implements that were partially obscured were not detected, and the person category on the agricultural machinery was not detected either. In the fifth picture, the car was detected, but its confidence level was only 0.795, while other models were clearly above 0.9. In contrast, EMAF-Net demonstrates outstanding detection performance in complex scenarios, especially in the recognition of agricultural implements, banners, people, and animal categories. In the first picture, it can be clearly seen that our model has higher confidence levels than other models in all categories, especially in the banner category, where the confidence level is the highest at 0.9 in the YOLO series and 0.93 in our model. And the traffic light was only detected at 0.66 in YOLOv7, 0.712 in Deformable DETR, and only 0.58 in YOLOv11, while our model had a confidence level of 0.73. In the fourth mountainous low-light scene, EMAF-Net accurately detected multiple object categories such as car, animal, and person, and completed clear boundary box selection. Especially for the animal category, YOLOv7 and YOLOv11 missed some objects, while Deformable DETR had two cases of background misjudgment. In the fifth low-light, rainy environment, EMAF-Net still accurately captured distant vehicle objects and achieved the highest confidence level of 0.97.
By comparing the above results, it can be seen that EMAF-Net performs exceptionally well in complex categories and low-light environments, especially in the detection of distant and small objects. It is significantly superior to other models in capturing these categories. These experimental results further validate the robustness of EMAF-Net and its applicability in the complex rural road environment.
4.4.2. Experimental Results of the BDD100k Dataset
To evaluate the generalization capability of the model on a public dataset, four representative detectors—YOLOv7, Deformable DETR, YOLOv11, and EMAF-Net—were selected for comparative experiments on the BDD100K dataset. The BDD100K dataset is a public dataset with rich driving scenarios, including various weather, lighting conditions, and complex road environments, and is an important benchmark for verifying the model’s generalization ability. Each model was uniformly trained for 150 epochs, and the results are shown in
Table 3.
From the table, it can be seen that EMAF-Net performs the best on the BDD100K dataset. It demonstrates outstanding detection performance in key metrics such as mAP@0.5 and mAP@0.5:0.95. Among them, EMAF-Net’s mAP@0.5 reaches 45.46%, which is much higher than 42.97% of YOLOv7 and 44.70% of Deformable DETR. Compared with the better-performing YOLOv11 at 45.32%, it has achieved a 0.14% improvement, further consolidating its generalization ability. Additionally, in the more complex scenarios with stricter IoU indicators, mAP@0.5:0.95, EMAF-Net surpasses the other three models with a precision of 27.01, verifying its good adaptability to complex environments. In contrast, YOLOv7, as a classic object detection model, achieves only 42.97% on mAP@0.5, reflecting its still-limited adaptability to complex scenes. Deformable DETR’s overall accuracy is not significantly better than other models, with mAP@0.5:0.95 being only 23.90%, indicating the model’s inadaptability to fine-grained detection requirements. YOLOv11 performs well in multi-scale feature extraction optimization, with mAP@0.5 reaching 45.32%. However, EMAF-Net, with its improved feature fusion strategy and robust design, has better generalization ability, achieving higher precision in detection performance while also having the advantage of a lightweight design.
Overall, EMAF-Net’s performance on the BDD100K dataset not only leads other models in all indicators but also demonstrates its excellent adaptability to complex environments and diverse objects, further verifying the generalization ability of EMAF-Net and the feasibility of model improvement strategies on public datasets and their applicability to real-world scene tasks.
To more intuitively present the comparison results of all models on the public dataset BDD100K,
Figure 13 shows the visual detection results of four models in various driving scenarios. The BDD100K dataset includes driving scenes of urban streets and rural highways, and is an important benchmark for evaluating the generalization performance and object detection capabilities of the models. In the detection task, these scenarios pose strict challenges to the models in recognizing distant objects, small objects, occlusions, and dynamic objects.
From the results in
Figure 13, it can be seen that different models have their own advantages in object detection in complex scenarios, but EMAF-Net performs the most outstandingly. YOLOv7 shows a better bounding box selection effect in some object detections, such as the recognition of car classes in daytime scenes, which is relatively accurate. However, for small objects like the pedestrian at a long distance in the fourth picture, and the traffic sign, bus, and car in the rainy and dark scene of the fifth picture, they were not accurately recognized, and the confidence levels were generally low. Deformable DETR had generally low confidence levels: a very obvious car in the first picture was not detected, and the traffic sign in the second picture was not recognized either. The confidence level of the car was the highest at only 0.871 in the EMAF-Net. Especially in low-light conditions, the phenomenon of missed detection and false detection was particularly obvious. In the second and fifth pictures, a large number of airplanes appeared as misidentified phenomena by the model, showing obvious limitations. Compared with the previous two models, YOLOv11 significantly improved the detection accuracy for common categories but still had difficulty in capturing distant objects or small objects in complex backgrounds. For example, the traffic light in the second picture was not recognized, the object box of the car at a long distance in the third picture was one less than EMAF-Net, and the traffic sign in the fifth picture was misidentified as a car and was not recognized by YOLOv7.
Compared with the above models, EMAF-Net demonstrated excellent detection performance under different lighting and scene conditions, especially in the detection of small objects, distant objects, and complex category scenes. In the first picture, it could accurately detect multiple object categories, such as traffic signs and cars, and the confidence level was higher than the other models. Especially for traffic signs, its confidence level reached 0.61, significantly surpassing 0.53 of YOLOv11, while the other two models did not recognize it. The confidence level of bicycle detection in the second picture reached 0.91, far exceeding the other models. In addition, the model accurately recognized traffic lights that the other three models did not recognize. Moreover, the traffic sign had a confidence level that was only 0.06 lower than that of YOLOv11, 0.17 higher than that of YOLOv7, and Deformable DETR failed to recognize it. In the fifth picture, it could be clearly seen that in the night and rainy scene, the detection ability of EMAF-Net was particularly outstanding, not only recognizing more categories than other models but also having significantly higher confidence levels.