This section presents multiple experiments and analyses performed on the established dataset, including ablation experiments, comparative experiments, failure analysis, practical deployment challenges, and some exploratory experiments. All experimental results reported in the tables of this section are obtained by evaluating the best validation weights (best.pt) on the test set, ensuring a fair and consistent evaluation protocol across all experiments.
4.1. Ablation Experiments
To analyze how various path aggregation networks influence object detection outcomes in nighttime traffic scenarios, we conduct a detailed exploration of the neck structures in YOLOv8n. Various feature fusion methods are tried, including Slim-neck [
32], CCFF [
33], and BiFPN. The influence of introducing P2 low-level features into the BiFPN network is also explored. Evaluations are performed on the self-built BDD100K nighttime test subset. For a fair comparison,
Table 3 reports the evaluation results obtained by evaluating the best validation weights (best.pt) of each method on the test set.
Experimental findings reveal that effective feature fusion methods can improve nighttime traffic object detection performance. Among them, introducing P2 shallow features into the BiFPN network achieves the best overall performance with only a minimal increase in computation, reaching a peak of 53.8% and mAP50–95 of 29.3%, representing improvements of 0.7% and 0.8% over the original BiFPN, respectively. Meanwhile, it maintains moderate Params (2.78 M), GFLOPs (8.1 G), and a fast detection speed (263 FPS). Nighttime images typically suffer from low contrast, high noise, and uneven illumination, leading to blurred object edges and loss of fine details. The weighted fusion mechanism in BiFPN assigns learnable weights to each input feature level, dynamically suppressing low-quality, noisy levels while enhancing high-quality, detail-rich levels. This adaptive weighting strategy enables the network to automatically reduce interference from low-SNR regions when fusing multi-scale features. The additionally introduced high-resolution P2 features preserve the fine edges and texture information of small objects, which are severely degraded in deeper features. After weighted fusion with deeper features, the shallow noise is suppressed, edge detail perception is enhanced, and detection performance for small and low-contrast targets is improved. This makes the BiFPN design particularly suitable for nighttime traffic object detection.
To promote the model’s capability to extract fused features, an attention module is integrated between the output end of the feature fusion network and detection head—a location whose effectiveness has been well-validated, as shown in
Figure 1. The experiment systematically evaluates the effects of various advanced and traditional attention mechanisms, experimenting with different configurations to identify the optimal one under nighttime traffic conditions. A summary of the results can be found in
Table 4.
The experimental results demonstrate that lightweight attention modules with relatively simple structures such as SE and CA further improved model performance without increasing computational cost. As shown in
Table 4, adopting the SE channel attention mechanism achieves the highest
(54.7%). Integrating the CA mechanism significantly enhances the model’s overall performance, maintaining a high
(54.6%) while achieving the highest mAP
50–95 (30.1%) and a reasonably fast detection speed (233 FPS), with only a 0.01 M increase in Params. In contrast, introducing more complex attention mechanisms like MSDA [
34], ACmix [
35], and LSAK [
36] do not demonstrate better synergistic effects. The performance decreases or even falls below the baseline model. This suggests that complex computations may conflict with BiFPN’s dynamic weighted fusion mechanism, amplifying background noise at night.
Additionally, we attempt to introduce attention modules at the end of multiple feature layers (P3, P4, P5). This approach does not lead to an improvement in accuracy. Instead, it increases network parameters and computational burden. Consequently, we introduce only a single lightweight CA module at the end of the top P5 layer of the feature pyramid, which guides the model to focus on the fused high-level semantic features and spatial information with negligible added parameters and computational costs. Under low-light conditions, the intensity difference between objects and the background is extremely small. CA performs global average pooling separately along the horizontal and vertical directions, generating a pair of direction-aware feature maps, thereby preserving precise spatial location information. This decomposition enables the network to capture the position of objects in the image, and even when visual cues are extremely weak, it guides the model to focus on regions where targets are likely to appear. In nighttime traffic scenes, vehicles and pedestrians often appear in specific spatial areas (e.g., the horizontal band of the road surface, the sides of the road). The row-wise and column-wise attention of CA can lock onto these areas, reducing noise interference from road textures and background regions, thus improving detection performance under low contrast. Heatmaps before and after introducing the CA module are shown in
Figure 7 for three representative challenging scenes. In all cases, the CA mechanism consistently increases the heat intensity in target-relevant regions and covers a larger spatial area, strengthening the model’s capability to aggregate spatial context cues.
Table 5 summarizes the final ablation experiment, reporting the evaluation outcomes for each model’s best validation weights (best.pt) on the test set. The AP curves for each nighttime traffic category of the baseline and proposed models, along with their corresponding
values, are depicted in
Figure 8. From the results, a clear synergistic effect between BiFPN feature fusion and the CA attention mechanism is observed. Introducing the CA module alone has a negative impact on the model, reducing
by 1.0% compared to the baseline. However, introducing the CA attention model after feature fusion leads to a considerable performance boost, increasing
from 51.5% to 54.6%. This indicates that BiFPN’s weighted bidirectional feature fusion provides richer semantic information and spatial features. The enriched feature representation allows the CA module to better distinguish key regions from background noise, promoting the model’s discriminative power and detection precision.
To investigate why the CA module alone degrades performance, we place it after different positions in the YOLOv8n neck (P3, P4, P5, and all three layers). As shown in
Table 6, the negative effect is highly position-dependent: CA after P3 or P4 yields
close to the baseline (51.4% and 51.5%, respectively), while CA after P5 causes a noticeable drop to 50.5%. Adding CA after all layers also results in a slight drop (51.1%). This suggests that in the original YOLOv8n, the P5 layer already contains highly semantic but spatially coarse features; inserting CA after it introduces redundant attention that disrupts the original feature distribution. Shallower layers (P3, P4) retain richer spatial details, allowing CA to function without harming performance. In our final architecture, however, the P5 layer is fundamentally changed by BiFPN, which fuses multi-scale information from shallower levels (P2–P4). This enriched P5 layer now provides both semantic and spatial cues, making it a suitable location for CA to exploit context. Consequently, in the full model, CA placed after the BiFPN-enhanced P5 layer (BiFPN_P2+CA) achieves a clear performance gain. This analysis confirms that the degradation is not inherent to CA itself, but depends on the feature richness of the layer where it is inserted.
Building upon the BiFPN and CA modules, the addition of the DySample upsampler further increases the
by 2.0% while causing a slight decrease in mAP
50–95. Further analysis reveals that this upsampler provides pronounced improvement for particularly challenging categories like motorbike and bike. Taking the motorbike category as an example (performance detailed in
Table 7), compared to YOLOv8n, its Precision (
P) increases substantially by 35.1%, Recall (
R) increases by 6.1%,
increases by 20.8%, and mAP
50–95 increases by 12.3%. Compared to the BiFPN_P2+CA model, introducing the DySample module causes a slight accuracy trade-off for some categories, but the overall performance remains superior to the original baseline, achieving the best comprehensive performance. This suggests that DySample’s dynamic upsampling mechanism particularly focuses on the details of small and blurry-edged targets at night, alleviating information loss in challenging samples. It should be noted that the slight drop in mAP
50–95 indicates a minor regression in localization precision at stricter IoU thresholds, which mainly originates from a small loss in precision for a few easy categories, and this minor loss is acceptable from a safety perspective. In contrast, the detection performance of motorbikes and bikes is greatly improved. From a safety-critical viewpoint, a missed detection of a vulnerable road user (e.g., a motorcyclist or cyclist) by an autonomous driving system could lead to a fatal accident, while a slight reduction in localization precision for common objects such as cars is unlikely to directly cause a collision. Therefore, trading a marginal loss in localization precision for a substantial improvement in detecting rare but high-risk categories is a reasonable trade-off for autonomous driving safety. This is especially significant for object detection in complex nighttime traffic scenes and directly contributes to the reliability of collision avoidance systems in real-world driving scenarios.
The point sampling nature of DySample is suitable for nighttime images, which often suffer from low signal-to-noise ratio (SNR) and abrupt illumination changes. DySample resamples a bilinearly interpolated continuous feature map via learned content-aware offsets, thereby avoiding noise amplification in low-SNR regions. We adopt the LP-style variant with a static scope factor because dynamic offsets may become unstable under extreme brightness contrast (e.g., headlights adjacent to shadows), whereas a static factor ensures stable sampling behavior. The offset range factor
is the theoretical marginal value that prevents sampling overlap [
28], which avoids boundary artifacts that are particularly harmful to small, dimly lit objects such as distant pedestrians or occluded vehicles. To verify the effectiveness of the DySample parameters, we conduct a brief sensitivity analysis by varying
and
g. As shown in
Table 8, the original configuration (
,
) achieves the best overall performance, outperforming
,
and
,
. A larger
introduces background noise, while increasing
g to 8 also leads to performance degradation, and both result in a drop in FPS. Moreover, adjusting
or
g cannot recover the slight mAP
50–95 loss. These results confirm that
and
strike the optimal trade-off between accuracy and efficiency for nighttime traffic detection.
Relative to the baseline, the YOLOv8n-BCD model achieves substantial overall gains. Precision (P) increases substantially from 56.8% to 66.1%, while Recall (R) improves by 1.8%. The and mAP50–95 see gains of 5.1% and 3.4%, respectively, indicating stronger generalization capability in nighttime traffic scenes. Meanwhile, the model reduces Params from 3.01 M to 2.79 M and achieves a high frame rate of 208 FPS, which preserves its lightweight and real-time characteristics. The parameter reduction mainly originates from replacing PANet with BiFPN in the neck network, decreasing the neck parameters by approximately 0.24 M due to the removal of single-input edge nodes. Additionally, incorporating the shallow P2 feature map into the fusion network introduces a minor increase of 0.01 M, while the addition of CA and DySample contributes only negligible overhead (0.01 M in total). The proposed model attains a better balance between model complexity and detection precision, which makes it more favorable for deployment on platforms like vehicle-mounted vision sensors.
4.2. Comparative Experiments
To verify the advantages of YOLOv8n-BCD, a comprehensive comparison is performed against prevailing mainstream object detection algorithms. Other lightweight detection architectures from the YOLO series [
13] are systematically evaluated, including YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv8s, YOLOv9-tiny, YOLOv10n, YOLOv11n, and YOLOv12n. However, since the nighttime-specific methods discussed in the related work (YOLO-FA, YOLO-D, and FP-ZeroDCE+YOLOv7) have not made their official code publicly available, and the datasets and annotation categories they use differ from ours, a direct comparison would be unfair. Therefore, these methods are not included in this comparative study. All experiments are performed within a consistent environment using a unified training strategy. The comparison results can be found in
Table 9, presenting the performance of each model’s best validation weights (best.pt) on the test subset.
Among all models tested, YOLOv8n-BCD secures the highest at 56.6% and mAP50–95 at 29.9%. In comparison with the more recent YOLOv11n and YOLOv12n frameworks, it improves by 3.7% and 6.5%, respectively. Although GFLOPs and Params increase slightly, it attains a faster runtime speed of 208 FPS. Compared to YOLOv8s (a larger version in the YOLOv8 framework), our method improves Precision (P) and by 9.2% and 3.6%, respectively. Meanwhile, it significantly reduces both GFLOPs and Params, resulting in a more lightweight and effective network structure. In addition, we observe that YOLOv6n achieves the highest Precision (P) and YOLOv3-tiny attains the fastest detection speed (FPS). But this performance advantage comes at the expense of lower overall detection accuracy, higher computational cost, and increased parameter count, rendering them less suitable for deployment on resource-constrained real-world vehicle sensor platforms.
Object detection in nighttime traffic scenes is extremely challenging due to low illumination, strong noise interference, and pronounced scale variations. Traditional lightweight detection frameworks exhibit low detection accuracy and high missed detection rates in such difficult scenarios. By employing efficient feature fusion, a lightweight attention module, and the dynamic upsampler, our method strikes a desirable equilibrium between network complexity and processing efficiency. This leads to higher recall, improved accuracy, and superior generalization performance.
Figure 9 provides a qualitative comparison of several well-performing YOLO series models in a real nighttime traffic scene. It is evident that the YOLOv8n-BCD model demonstrates relatively higher average accuracy among the four models. Notably, for a challenging bike instance, YOLOv8n-BCD correctly detects it, whereas YOLOv8n and YOLOv11n completely miss it. Although YOLOv5n also outputs a bounding box labeled “bike”, this detection is a false positive—the target does not actually exist in the scene.
Figure 10 presents the detection results of various models under a more challenging rainy night scene with severe road reflections and glare. The proposed model accurately detects two traffic lights despite complex light interference, whereas YOLOv5n and YOLOv8n produce multiple false positive detections of traffic lights. In the darker region on the left side of the image, YOLOv5n, YOLOv8n, and YOLOv11n all output false positive vehicle detections, while only YOLOv8n-BCD, unaffected by the dark area and reflections, yields no false alarms. This comparison validates the effectiveness of the proposed model in complex nighttime scenes: it can not only detect targets missed by other models but also avoid false positives, thereby improving detection performance for rare categories and vulnerable road users, demonstrating better generalization capability.
To further validate the generalization capability of the proposed model, we conduct an evaluation on an additional BDD100K nighttime test set consisting of 21,992 images, from which annotation-incompatible images and the previously selected 3500 images have been removed. We also apply oversampling to the training set of our self-built subset (3500 images) and retrain the model to alleviate class imbalance.
Table 10 presents the performance of the baseline YOLOv8n, the original YOLOv8n-BCD (without oversampling), and the retrained YOLOv8n-BCD-OS (with oversampling) on this larger test set. It can be observed that on this broader data distribution, the original YOLOv8n-BCD already achieves a certain improvement over the baseline, increasing
from 38.6% to 40.1%. After oversampling, the YOLOv8n-BCD-OS model further improves its performance, reaching an
of 41.3%. This indicates that our model maintains stable generalization ability on large-scale real-world nighttime data. The detection performance for the two rare categories, bike and motorbike, on this test set is shown in
Table 11. It can be seen that our method effectively improves detection accuracy for both categories, and the model with oversampling achieves further improvement, indicating that the class imbalance issue is effectively mitigated and the model’s detection robustness for vulnerable road users is enhanced.
Experimental results show that the proposed model achieves a significant performance improvement over the baseline. To confirm that this improvement is not due to random fluctuations during training, the baseline YOLOv8n and the proposed YOLOv8n-BCD model are each independently trained three times using three different random seeds (0, 42, and 123) on the self-built nighttime training set. All trained models are evaluated on the larger BDD100K nighttime test set comprising 21,992 images, which has a broader data distribution and a larger sample size, thereby facilitating statistical inference.
Table 12 reports the
values obtained from the three independent training runs. A paired
t-test is conducted on these three pairs of
values. The results show that YOLOv8n-BCD achieves a statistically significant improvement over the baseline (mean
= 1.97%, t(2) = 8.43,
p = 0.014). This confirms that the observed performance gain can be attributed to the effectiveness of the model architecture rather than random chance.
To evaluate the proposed model’s perception capability across varying lighting conditions, we extract and filter the daytime portion of the BDD100K dataset and construct a daytime test set of 22,380 images for assessing detection performance under normal lighting.
Table 13 presents the results of the baseline and the proposed model on this daytime test set. It can be observed that YOLOv8n achieves a daytime
of 43.0%, while our proposed YOLOv8n-BCD improves
by 2.5% over the baseline, reaching 45.5%. Although our model is specifically designed for nighttime environments, the experiments show that it also delivers a notable performance gain in the daytime, indicating good generalization and robustness across different lighting conditions. Notably, all models achieve substantially higher absolute
in daytime scenes than at night on comparably sized datasets (e.g., 43.0% on the 22,380-image daytime set vs. 38.6% on the 21,992-image nighttime set for the baseline). This confirms that nighttime perception indeed degrades due to complex lighting, low signal-to-noise ratio, and object blur, making nighttime object detection more challenging. Previous paired
t-test results indicate that our model still attains a statistically significant average
improvement of 1.97% at night. This demonstrates that even in complex nighttime traffic scenes, where the baseline performance is already low and achieving further improvement is highly difficult, YOLOv8n-BCD can still provide stable and statistically significant performance gains. Such stable gains are practically significant for ensuring the safety of nighttime autonomous driving. Moreover, YOLOv8n-BCD reduces the parameter count by 7.3% compared to the baseline, and it outperforms the parameter-heavier baseline across multiple lighting conditions. This improvement stems from structural modifications of the proposed model rather than from an increase in model capacity, achieving better generalization while retaining a lightweight architecture. These architectural modifications are motivated by nighttime-specific degradation phenomena, yet also yield performance gains under normal lighting, suggesting that solving the more challenging nighttime problem produces more robust feature representations. Integrating both daytime and nighttime evaluation results, YOLOv8n-BCD demonstrates applicability across varying lighting environments, offering a reliable and efficient lightweight foundational module for real-time visual perception in autonomous driving under diverse lighting conditions.
4.3. Failure Analysis
Figure 11 shows the column-normalized confusion matrix of the proposed YOLOv8n-BCD model on the self-built nighttime test set. It can be observed that for the three categories of person, car, and traffic light, the proportions of correctly predicted samples are 50%, 69%, and 51%, respectively, all exceeding or close to half, indicating that the model has a certain ability to recognize common nighttime traffic objects. For the two difficult categories, bike and motorbike, only 36% of bikes and 17% of motorbikes are correctly predicted. Missed detections exist in all categories, with bikes and motorbikes being particularly severe. Analysis shows that nighttime images generally suffer from low contrast, uneven illumination, and blurred object edges, causing targets to blend into the background under low-light conditions, especially when objects are distant or partially occluded, making it difficult for the model to extract sufficient discriminative features, thus leading to missed detections. Bikes and motorbikes are small and easily occluded; their edge information is highly susceptible to loss under insufficient nighttime lighting, and they are often disturbed by headlight halos, making it hard for the model to distinguish them from background noise.
Some motorbikes are misclassified as cars, which may be because the light spots produced by motorbike tail lamps or headlights at night resemble those of cars, causing the model to incorrectly classify them as cars in the absence of detailed shape information. Moreover, the confusion matrix reveals that a considerable number of persons, cars, and traffic lights are predicted by the model even though these objects do not actually exist (false positives). This is due to the widespread presence of distant small objects, partially occluded targets, and complex lighting variations in nighttime traffic scenes, such as water stains, reflections from streetlights and headlights, and shadows of roadside buildings. These textures and patterns share some similarity with real objects at low resolution, leading to numerous false alarms; adverse conditions such as rain, fog, and glare further exacerbate these misjudgments.
Figure 12 presents three typical failure cases. In the first case, all persons in an extremely dark shadowed area are missed, one bike is also missed, and another bike is mistakenly detected as a car. The illumination in this area is extremely low, and the intensity difference between the targets and the background almost disappears, making it impossible for the model to extract effective edge and texture features. The second case involves a complex rainy night traffic scene. Extensive water accumulation on the road creates mirror-like reflections, raindrops and fog on vehicle windows further interfere with visibility, and strong glare from oncoming headlights and streetlights causes two cars on the left side of the image to be completely missed. In addition, the model falsely detects multiple non-existent cars and persons in the reflective areas and halos, while distant, partially occluded small objects are also not recognized. The third case occurs in snowy weather with accumulated snow on both sides of the road. A distant motorbike is partially occluded by snow and appears with low resolution and blurriness; the model fails to capture its features and thus misses it. Analysis of these failure cases reveals that the model suffers from insufficient feature representation capability under extremely low illumination, particularly poor sensitivity to blurred, low-contrast targets. Moreover, under adverse snowy/rainy weather and strong lighting variations, the model is easily disturbed by reflections, glare, and occlusion, leading to both numerous false positives and missed detections of genuine targets.
In summary, although the proposed model achieves significant improvements over the baseline, further enhancement is still needed. The main bottlenecks in nighttime traffic scenes are summarized as follows. First, feature extraction for targets under extremely low illumination remains insufficient, leading to missed detections and false positives. Second, the model exhibits poor robustness to glare, reflections, and adverse weather conditions such as rain, snow, and fog, which also contribute to false positives. Third, the discriminative ability for partially occluded and distant small objects is still limited. These limitations point toward key directions for future model optimization.
4.4. Practical Deployment Challenges
Despite the good balance achieved by the proposed YOLOv8n-BCD between detection accuracy and model size (2.79 M parameters, 208 FPS inference speed), deploying it on embedded autonomous vehicle platforms still faces trade-offs among memory footprint, real-time latency, and energy consumption. Model quantization is a straightforward and effective technique. Saranya et al. [
38] demonstrated that applying INT8 post-training quantization to YOLOv8n on the Jetson Orin Nano reduced inference latency from 164.9 ms to 94.7 ms, with only about 1% loss in mAP, confirming that INT8 quantization is a viable path for edge deployment. Furthermore, structured pruning can directly reduce the number of model parameters and computational cost. In YOLOv8n-BCD, the parameter reduction mainly originates from replacing PANet with BiFPN in the neck network, where the removal of single-input edge nodes decreases the neck parameters by approximately 0.24 M. However, this level of reduction is still insufficient for real-world onboard deployment. Zhou et al. [
39] employed the LAMP pruning method, which globally prunes unimportant channels while keeping the detection heads intact. Applying LAMP alone to YOLOv8n compressed the model size from 6.0 MB to 2.4 MB, substantially reducing parameters but causing a slight decrease in mAP
50. This shows that LAMP effectively compresses the model at the expense of a minor accuracy loss, a deficiency that can be remedied when combined with other modules. For our nighttime traffic object detector, the combination of INT8 quantization and LAMP-style pruning represents an effective approach to meeting the strict storage and power constraints of onboard vision sensors.
Beyond model compression, practical deployment of the perception module also requires seamless integration with downstream vehicle control and planning systems. This module must not only deliver accurate detection results but also provide timely outputs to downstream vehicle control systems within bounded time to enable reliable decision-making. End-to-end latency must remain within an acceptable range. Saranya et al. [
38] employed fixed-priority scheduling, CPU core affinity, and WCET analysis to ensure that 98.11% of inferences meet the soft deadline of 150 ms. They also introduced a deadline-miss penalty model, showing that bounded overruns (<30 ms) remain within the tolerance of autonomous vehicle control loops. For nighttime autonomous driving, real-time requirements are particularly critical. Due to the inherent difficulties of night scenes—low contrast, blurred object boundaries, and severe occlusion—the perception module is already prone to missed detections and false positives. If uncontrollable delays are added, the downstream control system will not receive timely and reliable inputs, thereby posing a serious threat to driving safety.
Furthermore, a complete autonomous driving system relies not only on high-precision object detection but also on a clear pathway from detection outputs to executable driving decisions. Object detection is merely the first step in the autonomous driving pipeline; the key to safe autonomous navigation lies in converting 2D bounding boxes into spatial information that can be used for driving logic reasoning. In recent years, several studies have explored different technical routes for integrating detection results into the autonomous driving pipeline. Yu et al. [
40] proposed YOLO MDE, which adds an extra depth prediction channel to the output layer of YOLOv4, unifying 2D object detection and monocular depth estimation within a single network architecture and enabling the system to directly output distance information while recognizing objects. This approach equips the autonomous driving system with preliminary spatial perception, allowing it to distinguish nearby obstacles from distant background and thereby providing a basis for braking or obstacle-avoidance decisions. However, distance information alone is insufficient to support complete driving decisions. The perception system must also understand the spatial context surrounding objects, especially the ego vehicle’s drivable area, to determine whether detected objects genuinely lie on the driving path and pose a real threat. The YOLOP model proposed by Wu et al. [
41] simultaneously performs traffic object detection, drivable area segmentation, and lane detection in a single unified network, delivering 2D perceptual information that encompasses obstacle positions, safe traversable space, and road structure, thus laying a richer semantic foundation for subsequent path planning and risk assessment.
Although models like YOLOP have greatly enhanced environmental understanding on the 2D plane, they still essentially reason by projecting the world onto the image plane and cannot directly acquire the precise 3D coordinates, dimensions, and orientation of objects. Such 3D information is crucial for accurate obstacle avoidance and path planning in complex traffic scenarios (e.g., at night). To overcome this limitation, the system must map detection results into an ego-centric 3D coordinate system to obtain the complete spatial position, size, and orientation of objects, which typically relies on support from multi-sensor fusion techniques. In this direction, the C2L3-Fusion framework proposed by Ngo et al. [
42] employs the CLOCs mechanism to perform decision-level fusion of 2D detections from YOLOv8 and 3D LiDAR point cloud detections from PointPillars, directly outputting refined 3D bounding boxes. By contrast, the work by Murendeni et al. [
43] adopts a route of model-level modification and feature-level fusion. Taking YOLOv4 as the base framework, they extend the network output layers to simultaneously predict object depth, 3D dimensions, and orientation angle, and introduce a multi-task loss function for joint optimization, reconstructing the original 2D detector into a unified network capable of directly reasoning 3D spatial information while leveraging feature-level fusion of LiDAR point clouds and RGB images to enhance depth estimation. These two approaches construct a complete transformation pathway from 2D detection to 3D spatial perception for autonomous driving systems, meeting the core demand of downstream control modules for precise spatial information.
Overall, 2D object detection results can be transformed into spatial logic that supports autonomous driving decisions through various technical pathways such as depth estimation, environmental understanding, and 3D spatial mapping. The YOLOv8n-BCD model proposed in this paper, as a lightweight nighttime traffic object detection vision module, can provide high-quality detection outputs for the nighttime autonomous driving system pipeline. Through the future integration of methods such as depth estimation and multi-sensor-based 3D spatial coordinate perception, it could offer a more reliable perception foundation for subsequent spatial reasoning and risk assessment. This establishes a cost-effective and efficient practical pathway for building robust nighttime autonomous driving perception systems under resource-constrained conditions.