4.1. Evaluation of Object Detection
To assess the individual and combined contributions of the proposed architectural components, a comprehensive ablation study was conducted on three key modules: Model A (Mamba-YOLO backbone), Model B (VSSBlock), and Model C (XSSBlock). The experimental results are summarized in
Table 4.
Starting from the baseline configuration, the model achieved precision of 0.919, recall of 0.883, mAP@50 of 0.929, and mAP@50–90 of 0.707. Upon integrating the VSSBlock module (Model B), substantial improvements were observed—most notably, recall increased to 0.923 and mAP@50–90 improved to 0.756. These results underscore the effectiveness of VSSBlock in capturing multi-scale semantic features and enhancing the model’s ability to detect targets with varying sizes and occlusions.
Subsequently, the inclusion of the XSSBlock module (Model C) led to a marginal decline in precision (from 0.919 to 0.917), while recall remained relatively high at 0.896. However, a performance drop was observed in mAP@50–90, which decreased to 0.713. This degradation may be attributed to the stacked SS2D-based state-space modeling in XSSBlock, which, in highly complex visual contexts, may introduce feature redundancy and compromise localization accuracy.
Despite this, when all three modules—Mamba-YOLO + VSSBlock + XSSBlock—were combined, the model achieved optimal performance across all metrics, with precision rising to 0.945, mAP@50 reaching 0.950, and mAP@50–90 improving to 0.769. These results validate the complementarity and synergistic effect of the modules when used together.
Overall, the ablation study demonstrates that the integrated architecture strikes an effective balance between detection accuracy and robustness, confirming the benefit of combining local perception, global context modeling, and enhanced feature fusion in complex traffic detection scenarios.
In order to comprehensively verify the target detection performance of the proposed model in traffic scenarios, this paper selects the current mainstream YOLO series models, the representative Transformer structure RT-DETR18, and the lightweight baseline model Mamba-YOLO for horizontal comparison. The experimental results are shown in
Table 5.
The YOLO series is renowned for high accuracy. Specifically, YOLOv12s achieves 0.948 mAP@50 and 0.759 mAP@50–95 with 9.26 M parameters, 21.5 GFLOPs and 115.7 FPS, remaining acceptable for real-time applications. RT-DETR18 delivers the best fine-grained localization (0.771 mAP@50–95), yet its 20.0 M parameters, 60.0 GFLOPs and 55.3 FPS render it impractical for real-time ITS deployment. The original lightweight Mamba-YOLO exhibits extreme efficiency (5.8 M, 13.2 GFLOPs, 120.1 FPS), yet its restricted representation limits performance under occlusion and small-object scenarios (0.747 mAP@50–95).
Our enhanced model strikes a superior accuracy–complexity balance: 0.945 precision, 0.919 recall, 0.945 mAP@50, and 0.769 mAP@50–95, rivaling RT-DETR18 while requiring only 9.07 M parameters, 17.2 GFLOPs and 117.7 FPS. This substantial reduction in memory and computation, coupled with maintained precision, endows the model with exceptional deployment friendliness. Although the lightweight trade-off marginally compromises localization, the holistic competitiveness remains compelling.
Thanks to its compact architecture and robust generalization, the model performs reliably across diverse scenes—urban intersections, non-motorized lanes, and nighttime surveillance—accurately detecting vehicles and pedestrians under challenging conditions, thus seamlessly aligning with modern ITS infrastructure requirements.
To rigorously assess the proposed model’s efficacy under multi-scale scenarios—particularly for small-object detection—we concentrate on two representative categories, pedestrian and motorcycle, and benchmark against three top-performing lightweight YOLO variants (YOLOv5s, YOLOv8s, and YOLOv12s). Quantitative comparison reveals that our method consistently delivers superior fine-grained localization and scale robustness.
As reported in
Table 6, the overall accuracy remains moderate, owing to the limited presence of motorcycles and pedestrians in the dataset. Nevertheless, our model attains the best AP@50–95 of 0.623 for motorcycles and 0.738 for pedestrians, thereby confirming its leading small-object detection capability under data-scarce conditions.
To comprehensively assess the deployment viability of our lightweight model, we benchmark both accuracy and real-time performance on an RTX 4090 desktop (NVIDIA Corporation, Santa Clara, California, USA) and a Raspberry (Raspberry Pi Foundation, Cambridge, United Kingdom) Pi 4B emulator (4× Cortex-A72 @ 1.5 GHz). As shown in
Figure 8, on the RTX 4090 our model achieves the highest FPS among YOLOv5n and YOLOv7-tiny while delivering the best mAP. On the emulated embedded platform, although the FPS is slightly lower than that of YOLOv5n and YOLOv7-tiny, it still sustains real-time inference with superior accuracy.
4.2. Image Processing Effect Evaluation
In traffic visual perception scenarios, raw images are frequently affected by environmental factors such as low illumination and haze, leading to reduced contrast, blurred edges, and poorly defined target structures. To systematically evaluate the practical impact of the proposed image preprocessing modules on detection performance, a controlled comparative experiment was designed in which the input variable was the presence or absence of image enhancement. This approach enables a quantitative assessment of the contribution and necessity of image defogging and illumination enhancement within the overall detection pipeline.
Two experimental conditions were constructed:
All experiments were conducted using the same detection network architecture and hyperparameter settings to ensure a fair comparison. The primary evaluation metrics were mAP@50 and mAP@50–90, providing insight into both coarse and fine-grained detection accuracy.
As shown in
Figure 9, the left image illustrates detection results under low-light conditions, while the right image displays results after illumination enhancement. In the original (unprocessed) image, severe underexposure leads to substantial information loss in dark regions. As a result, object boundaries appear indistinct, especially in the upper portion of the frame, where the model struggles to identify distant vehicles and pedestrians. Confidence scores for near-field vehicle detections are relatively low, with values of 0.88 and 0.85, and the confidence for a motorcycle at the bottom of the frame is only 0.78, reflecting the model’s limited perceptual capability in poorly illuminated environments.
In contrast, the enhanced image shows significant improvements in overall brightness, contrast, and structural visibility. The contours of vehicles are clearly defined, and target integrity is preserved. Detection coverage is notably expanded, extending from the mid-ground to distant areas of the image. Nearly all vehicles are successfully identified, with the front three vehicles receiving confidence scores of 0.90, 0.89, and 0.94, respectively. Even distant vehicles in the upper lanes are correctly detected, many with confidence values exceeding 0.90. Furthermore, the confidence score for the previously low-confidence motorcycle increases to 0.84, indicating enhanced model awareness across the scene.
These results confirm that image enhancement preprocessing effectively mitigates visual degradation caused by low-light conditions. It substantially improves the detectability, confidence distribution, and spatial coverage of targets, thereby enhancing the overall detection accuracy. The integration of such preprocessing modules offers strong practical support for reliable object detection in visually complex and dynamically illuminated traffic environments.
As shown in
Figure 10, the left image illustrates the detection outcome on the original, unprocessed input, whereas the right image presents the results after applying both defogging and illumination enhancement. In the original image, the presence of haze and low overall brightness leads to reduced visual clarity and blurred target boundaries, particularly for mid- and long-range objects. While most targets remain detectable, confidence scores show moderate degradation. Specifically, the detection confidence for the black vehicle on the left is 0.92, the motorcycle below registers 0.90, and the bicycle on the right—located in the non-motor vehicle lane—achieves 0.88. However, the image lacks clear visual stratification and sufficient contrast, which may compromise boundary localization accuracy.
In contrast, the right image—after preprocessing—demonstrates enhanced visual quality. The overall brightness is elevated, haze interference is effectively removed, and target textures and edges are significantly clearer. The detection confidence for most targets remains stable or shows slight improvement: the white vehicle in the main lane maintains a confidence of 0.94, the motorcycle registers 0.91, and the bicycle’s confidence increases to 0.90. The refined image structure facilitates more compact and accurate bounding box generation, improving edge delineation and spatial consistency.
These results collectively highlight the practical benefits of image enhancement. The preprocessing pipeline substantially improves overall visibility and fine-grained detail representation, leading to more stable and accurate detection results. Enhanced edge definition and texture recovery support superior structural perception and spatial localization, especially in multi-target traffic scenes.
By comparing the detection performance before and after illumination enhancement and dehazing, it becomes evident that image preprocessing plays a pivotal role in boosting detection system effectiveness. Illumination enhancement significantly increases the perceptibility and textural clarity of dark regions, enhancing the model’s sensitivity to low-light targets and increasing both the number and confidence of detections. Simultaneously, dehazing effectively mitigates structural degradation caused by blurring and occlusion, thereby improving the edge clarity, local detail expression, and detection stability.
Overall, the dual-stage image restoration process—comprising both dehazing and low-light enhancement—successfully addresses the perceptual degradation challenges in complex traffic environments. It lays a solid foundation for downstream detection tasks by delivering high-quality input, ultimately reinforcing the robustness, precision, and reliability of intelligent traffic perception systems.
To rigorously evaluate the contribution of the proposed image enhancement module to the generalization of the object detector under adverse weather, we conducted an ablation study on the Cityscapes-Adverse dataset [
29] (2500 images: 20% hazy, 20% low-light, 60% normal). As listed in
Table 7, the baseline without any enhancement achieves 0.715 mAP@50–90. The haze-removal branch alone raises this to 0.742 (+3.7 pp), while the illumination–enhancement branch yields a larger gain to 0.791 (+7.6 pp). Cascading the two modules further elevates the metric to 0.811, amounting to a total improvement of 9.6 pp, which substantiates the complementary effects of dehazing and illumination correction in the feature space and quantitatively confirms the efficacy of the proposed enhancement strategy.