An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions

Ou, Ruiyang; Du, Luyao; Chen, Wei; Liu, Huiheng

doi:10.3390/act15060313

Open AccessArticle

An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions

¹

School of Automation, Wuhan University of Technology (WHUT), Wuhan 430070, China

²

School of Physics and Electronic Engineering, Hubei University of Arts and Science, Xiangyang 441053, China

^*

Authors to whom correspondence should be addressed.

Actuators 2026, 15(6), 313; https://doi.org/10.3390/act15060313

Submission received: 11 March 2026 / Revised: 2 May 2026 / Accepted: 22 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Autonomous Vehicles Impact on Roads and Control Strategies)

Download

Browse Figures

Versions Notes

Abstract

Traffic object detection based on camera sensors is a critical task for autonomous vehicles. However, in nighttime conditions with adverse lighting, several challenges arise: blurred object edges, large-scale variations, and complex lighting conditions involving both overexposure and underexposure. As a result, it remains difficult for vision-based perception tasks to ensure reliable precision and rapid inference simultaneously. This paper proposes a novel, efficient, and lightweight vision module for detecting traffic objects in challenging nighttime environments, developed by enhancing the YOLOv8n architecture. Firstly, a bidirectional weighted feature fusion method (BiFPN) is incorporated in the path aggregation network, and an additional shallow P2 feature map is introduced to fully utilize key information from features at different scales. Then, the coordinate attention (CA) module is inserted between the end of the feature pyramid and the detection head to capture both semantic and spatial information of the object. Finally, the dynamic upsampler (DySample) is employed to guide the model in focusing on the detailed features of challenging samples, thereby balancing accuracy across different object categories. A subset of nighttime traffic scenes is curated from the BDD100K dataset for the evaluation of the proposed approach. The experiments demonstrate that, relative to the baseline, our method raises the mean average precision (

{mAP}_{50}

) from 51.5% to 56.6%, achieves a 7.3% decrease in parameter quantity, and maintains a fast inference speed of 208 FPS. For the challenging bike and motorbike categories, notable improvements in detection accuracy are achieved. Compared with other advanced YOLO-series models such as YOLOv11, the proposed model also exhibits significant performance advantages with a 3.7% higher

{mAP}_{50}

. Furthermore, our model demonstrates good generalization performance on the larger BDD100K nighttime partition. The findings confirm that our approach significantly improves detection accuracy without compromising real-time processing, highlighting its potential as a lightweight vision module providing reliable perceptual inputs for autonomous vehicle control and safety actuators in challenging nighttime scenarios.

Keywords:

autonomous vehicles; nighttime environmental perception; traffic object detection; sensor reliability; neural networks

1. Introduction

Reliable environmental perception is crucial for the safe operation of autonomous vehicles. Camera sensors, with their excellent performance and low cost, serve as the cornerstone of modern vehicle perception systems. However, as passive sensors, the performance of standard RGB cameras heavily depends on ambient lighting conditions. In challenging traffic scenarios such as low-light nighttime environments and glaring vehicle headlights, image quality severely degrades, leading to significant performance deterioration [1]. Studies have shown that in low-light conditions, perception devices like camera sensors may suffer from complete failure of visual odometry. In glare-interference environments, the quality of camera data can decrease by 50–70%, with perception error rates reaching as high as 40% [2]. This decline in perception performance in complex nighttime traffic scenes poses a serious threat to the driving safety of intelligent vehicles.

Computer vision has undergone substantial progress over the past decade, fueled by the rise of deep learning. Convolutional Neural Networks (CNNs) [3] and Vision Transformer models [4] have demonstrated powerful representation learning capabilities and are widely applied in computer vision, including instance segmentation (e.g., Mask R-CNN [5]), image classification (e.g., GhostNet [6]), and object detection. Thottempudi et al. [7] noted that algorithm frameworks based on deep learning and sensor fusion have become the mainstream approach to addressing the challenges of environmental perception for autonomous driving in complex scenes. Classic two-stage detection algorithms, represented by Fast R-CNN [8] and Faster R-CNN [9], consist of two stages: region proposal and region recognition. They generate high-quality candidate boxes before performing detection and regression, providing higher accuracy while incurring greater processing time. Set-based detectors like DETR [10,11] formulate object detection as a direct set prediction problem using a transformer and bipartite matching, eliminating hand-crafted components such as anchors and NMS. However, they suffer from slow convergence, high computational cost, and poor performance on small objects, making them less suitable for complex traffic scenes where small and densely distributed targets are common. By comparison, one-stage detection networks such as YOLO [12,13] and SSD [14] reframe object detection as a regression task. The main benefit of these networks is their rapid inference capability coupled with reliable accuracy, which renders them highly appropriate for meeting the demanding real-time perception needs of autonomous driving systems. This offers a promising new avenue for mitigating the performance degradation of vision sensors caused by complex nighttime environments, which is fundamental for downstream vehicle control decisions and safety interventions.

While the application of YOLO series frameworks to nighttime traffic scenes has shown promise, substantial challenges remain due to the inherent difficulty of the task. Real-world nighttime scenarios present complex lighting conditions characterized by intertwined sources such as vehicle headlights and streetlights, resulting in localized glare and stark bright-dark contrasts. Images in low-light regions display low signal-to-noise ratios and dramatic detail loss. Objects exhibit large-scale variations, where small and distant objects are represented by limited pixels and exhibit less distinguishable features. These problems are compounded by severe occlusion in dense traffic and the presence of noticeable motion blur. Consequently, traditional models experience a marked increase in missed detections and a notable reduction in detection effectiveness. It is also essential to consider the model’s real-time capability and lightweight design in these challenging scenes. Balancing high performance and low complexity on resource-constrained vehicle platforms has become a critical challenge for automated driving systems (ADS).

To overcome these issues, we introduce YOLOv8n-BCD, an efficient and lightweight model designed to improve detection accuracy for traffic objects in nighttime scenarios. In the neck network, a Bidirectional Weighted Feature Pyramid Network (BiFPN) is employed, which replaces the traditional concatenation method with a weighting mechanism and incorporates higher-resolution feature maps during fusion. This enables the model to capture and fuse features from a broader range of spatial resolutions. A lightweight coordinate attention (CA) mechanism is introduced at the end of the feature output stage to fully utilize spatial contextual information and suppress noise. The dynamic upsampler (DySample) is adopted to adaptively focus on the detailed features of difficult samples, thereby reducing information loss. In summary, the principal contributions of this study are as follows:

This paper proposes a detector particularly tailored to object detection within nighttime traffic scenes. The neck network utilizes a Bidirectional Weighted Feature Pyramid Network along with an additional shallow P2 feature map, optimizing the feature representation and aggregation mechanisms for objects at multiple scales.
Our study systematically investigates the impact of different attention modules under nighttime conditions. A CA mechanism is introduced at the feature output end to focus on spatial context and key target features while suppressing interference from complex background noise. The DySample operation is employed to reduce information loss for difficult targets, significantly improving detection accuracy for these challenging categories.
A nighttime traffic scene subset is constructed based on the large-scale public autonomous driving dataset BDD100K. The proposed model is evaluated and compared with cutting-edge lightweight YOLO models on this self-built dataset, demonstrating its effectiveness and real-time performance.

2. Related Work

Object detection algorithms accomplish detection tasks by extracting spatial information and semantic features of objects to achieve accurate localization and classification of targets within an image. Currently, various detection algorithms have been applied in the field of autonomous driving environmental perception, with significant progress being made. For example, Tong et al. [15] developed MSSD, an enhanced traffic object detection algorithm derived from SSD. It employs a deep residual network and a bidirectional feature fusion structure to promote feature extraction capability and detection accuracy for small traffic objects. Lamichhane et al. [16] designed an improved sparse Transformer model (MST) based on the Vision Transformer, introducing a depth-aware attention mechanism combined with a cross-modal feature alignment module and a dynamic instance interaction mechanism. This significantly strengthens the generalization capability of object detection in complex traffic scenes. Notably, the YOLO family of models has received considerable attention for its superior efficiency and modern design as a single-stage detection approach. Xue et al. [17] developed YOLO-FSE, a lightweight vehicle detection model that improved upon YOLOv5 by incorporating the C3_Faster module from FasterNet and introducing a Shuffle Attention mechanism. This approach achieves a detection frame rate increase of over 40% and enhances the detection of small and distant vehicles. Wang et al. [18] proposed a lightweight detection algorithm, YOLOdrive, which designed a Spatial and Channel Reconstruction Convolution module (SCConvC2f) and optimizes YOLOv8n by combining the inverted residual structure, linear bottleneck layers, and depthwise separable convolution from MobileNetv2. This approach maintains high accuracy while substantially reducing model complexity.

Most existing detection methods perform well under adequate lighting but experience severe degradation in challenging conditions such as low-light nighttime or glare-intensive environments. Consequently, strengthening the adaptability and robustness of detection models in such complex environments remains a central research problem. Jain et al. [19] introduced a low-light enhancement approach leveraging Zero-Reference Deep Curve Estimation (Zero-DCE) and utilized pixel-wise adjustment curves to adaptively enhance image brightness and contrast. Zhang et al. [20] designed a feature-preserving low-light enhancement algorithm (FP-ZeroDCE) and integrated an Attentional Feature Fusion (AFF) module into YOLOv7, effectively improving low-light image quality and vehicle detection accuracy. Das et al. [21] proposed a detection framework named YOLO-D. It integrates a Low-Light Enhancement Module (LLEM) to improve image quality and employs a Multiscale Domain Adaptive Network (MS-DAN) to perform adversarial learning across multiple feature scales. This approach significantly improves the model’s robustness to varying lighting conditions. Although such integrated image preprocessing methods can improve model effectiveness, they tend to increase architectural complexity and parameter count. Therefore, it is essential to balance detection accuracy against model complexity. Li et al. [22] presented the YOLO-FA detector based on Type-1 Fuzzy Attention (T1FA). Instead of relying on image preprocessing, it reduces feature map uncertainty by re-weighting them with fuzzy entropy and captures multi-scale information by incorporating mixed depth convolution from the MetaFormer (MDFormer). This approach achieves high accuracy on the UA-DETRAC dataset with a better balance between precision and speed. Peng et al. [23] improved YOLOv8s for nighttime vehicle detection by integrating RepGFPN, a Deformable Attention Transformer (DAT), and the MPDIoU loss function, effectively addressing shallow feature loss, occluded target recognition, and bounding box distortion.

Over the past several years, multi-sensor hybrid models combining deep learning for perception have also yielded a series of achievements in challenging nighttime traffic scenes. Hazem Rashed et al. [24] proposed a camera–LiDAR fusion model named FuseMODNet for moving object detection. Using a three-stream CNN architecture, it deeply fuses image information with LiDAR motion information (which is unaffected by lighting), effectively improving the perception of moving targets in nighttime environments. Choi et al. [25] proposed a fusion strategy that integrates data from thermal infrared cameras and LiDAR sensors. This approach achieves sensor alignment through a dedicated calibration process, utilizes YOLOv4 to identify objects within thermal imagery, and fuses LiDAR point cloud data, significantly enhancing detection accuracy in nighttime scenes.

This paper proposes an efficient and lightweight detection method for difficult nighttime traffic scenarios. The objective is to improve object detection efficacy for blurred and small objects in nighttime environments, while maintaining a balance between precision and inference speed to satisfy the deployment requirements of vehicle-mounted vision sensors as a reliable perceptual foundation for autonomous vehicle control and safety systems.

3. Materials and Methods

3.1. Methodology

We propose YOLOv8n-BCD, a comprehensive framework for detecting objects in nighttime autonomous driving scenarios, constructed upon the classic lightweight YOLOv8n detector. The overview of the model architecture is presented in Figure 1, with the primary improvements including: a Bidirectional Feature Pyramid Network (BiFPN) for weighted feature fusion, a CA mechanism to capture spatial context, and a DySample upsampler to alleviate information loss.

3.1.1. Bidirectional Weighted Feature Fusion Network

In autonomous driving environmental perception under complex nighttime conditions, images captured by vision sensors commonly suffer from issues such as low contrast, poor signal-to-noise ratio, and uneven illumination. These problems cause key information to be mixed with irrelevant background noise and lead to significant degradation of features for objects at different scales. To address this, we improve the original model’s feature fusion network based on BiFPN’s dual-directional inter-scale connectivity and learnable fusion weights [26]. This enhancement facilitates dynamic information interaction across scales, thereby better extracting detailed semantic information.

Compared to the conventional Path Aggregation Network (PANet), we optimize the model at the structural level by adopting the BiFPN weighted bidirectional feature fusion mechanism and additionally incorporating shallow P2 features. As shown in Figure 2, the network builds upon the bottom-up and top-down feature fusion paths of PANet. It eliminates single-input edge nodes with limited impact on feature aggregation and introduces additional links between input and output nodes at corresponding levels. This constructs denser multi-scale feature interaction paths, fusing more features without substantially increasing computational cost. Furthermore, we introduce lower-level P2 features from the backbone network. Although feature maps at this level contain more noise interference, they possess higher resolution and retain more edge details and precise spatial location information. The influence of shallow noise can be suppressed after complementary fusion with deeper semantic information, aiding in the extraction and fusion of detailed features for small and blurred targets at night.

Aiming to enhance the combination of shallow-layer fine details and deep-layer semantic abstractions in the context of nighttime autonomous driving, the BiFPN path aggregation network assigns individual weights to each incoming feature map. It employs a fast normalized fusion strategy, enabling the network to dynamically determine the significance of different input features and balance the contributions of features at different resolutions. For N input features to be fused and the feature map

P_{i}

of the i-th input, the fast normalized fusion output O is described by:

O = \sum_{i = 1}^{N} \frac{σ (w_{i}) \cdot P_{i}}{ϵ + \sum_{j = 1}^{N} σ (w_{j})}

(1)

where

w_{i}

is a learnable weight associated with the i-th input feature

P_{i}

, used for weighted fusion of the extracted features.

ϵ = 0.0001

is a negligible constant used to maintain numerical stability.

σ (\cdot)

denotes the Swish activation function, employed to ensure nonlinear normalization of the weights. Taking the P4 layer features as an example (Figure 2), the formulas for performing BiFPN feature fusion are as follows:

P_{4}^{mid} = Conv (\frac{σ (w_{1}) \cdot P_{4}^{in} + σ (w_{2}) \cdot U p s a m p l i n g (P_{5}^{in})}{σ (w_{1}) + σ (w_{2}) + ϵ})

(2)

P_{4}^{out} = Conv (\frac{σ (w_{1}^{'}) \cdot P_{4}^{in} + σ (w_{2}^{'}) \cdot P_{4}^{mid} + σ (w_{3}^{'}) \cdot D o w n s a m p l i n g (P_{3}^{out})}{σ (w_{1}^{'}) + σ (w_{2}^{'}) + σ (w_{3}^{'}) + ϵ})

(3)

where

P_{4}^{in}

represents the original input features of this layer,

P_{4}^{mid}

denotes the intermediate fused features from the top-down path, and

P_{4}^{out}

is the final output fused feature of this layer.

U p s a m p l i n g (\cdot)

and

D o w n s a m p l i n g (\cdot)

represent upsampling and downsampling operations, respectively. A similar fusion scheme applies to other pyramid levels, with the indices adjusted accordingly.

For the additionally introduced P2 feature map, it is first downsampled via a convolution to match the spatial resolution of the P3 level. It is then fed, together with the original P3 feature and the previously fused P3 output, into a BiFPN node (see Figure 2, purple arrow). The learnable weights in the fast normalized fusion automatically adjust the contribution of each input. Although the P2 layer contains more noise due to its high resolution, the network learns to assign a lower weight to the entire P2 layer, thereby suppressing irrelevant noise while preserving fine edge details as much as possible. This dynamic weighting mechanism allows the model to benefit from the high-resolution P2 features without destabilizing the overall representation.

3.1.2. Coordinate Attention Mechanism

The multi-scale features fused by BiFPN contain rich contextual information but also a considerable amount of irrelevant background noise interference. Introducing a CA module [27] after feature fusion and before the detection head enables the model to emphasize important spatial regions and relevant contextual cues. This enhances the extraction of detailed features for small targets and reduces background interference without significantly increasing computational overhead, directly optimizing the quality of input features to the detection head.

Figure 3 provides a visualization of the procedures for embedding coordinate information and producing coordinate attention in the CA module. To preserve spatial positional information, the CA module separates 2D global pooling into two simultaneous 1D feature encoding processes. For an input feature map

X \in R^{C \times H \times W}

, it employs a pair of 1D global average pooling kernels—spanning

(H, 1)

and

(1, W)

—to individually capture channel-wise information along horizontal and vertical axes. This generates two direction-sensitive feature representations

z_{c}^{h} (h)

and

z_{c}^{w} (w)

. For instance, the value at position

(h, w)

in the c-th channel can be expressed as:

z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(4)

z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(5)

where C, H, and W refer to the channel number, spatial height, and spatial width of the input feature map, respectively.

To efficiently utilize the encoded coordinate information, the CA module first concatenates the generated aggregated feature maps

z_{c}^{h} (h)

and

z_{c}^{w} (w)

along the spatial dimension. A shared

1 \times 1

convolutional transformation function

F_{1}

is applied for nonlinear transformation and dimensionality reduction, producing an intermediate representation

f \in R^{C / r \times (H + W)}

that captures spatial characteristics across both horizontal and vertical axes:

f = δ (F_{1} ([z^{h}, z^{w}]))

(6)

where

[\cdot, \cdot]

indicates the concatenation process,

δ

stands for a nonlinear activation operator, and r is a reduction ratio controlling the module’s complexity. Subsequently, the intermediate feature f is partitioned spatially to obtain two distinct tensors:

f^{h} \in R^{C / r \times H}

and

f^{w} \in R^{C / r \times W}

. Both tensors are then mapped back to the input channel size using individual

1 \times 1

convolutional layers,

F_{h}

and

F_{w}

, resulting in:

g^{h} = σ (F_{h} (f^{h}))

(7)

g^{w} = σ (F_{w} (f^{w}))

(8)

where

σ

is the sigmoid activation function. The resulting outputs

g^{h}

and

g^{w}

are then expanded and serve as attention coefficients, and are incorporated into the input features through element-wise multiplication. Thus, the coordinate attention block produces an output that can be described as:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(9)

This mechanism enables the CA module to dynamically emphasize spatial location information relevant to the target in the feature map, suppressing background noise and enhancing the representation capability for small targets and key details. Experimental results in subsequent sections indicate that placing the CA mechanism at the end of the feature fusion output of the

P_{5}

layer, which contains the richest semantic information, can effectively improve detection performance.

3.1.3. Dynamic Upsampler DySample

In traffic object detection tasks, nearby vehicles may occupy a large portion of the image, while medium and small distant targets consist of only a few pixels. This leads to large object scale variations, a problem exacerbated at night by complex background noise and blurred target contours. We introduce the dynamic upsampler DySample [28], specifically employing the LP-style variant with a static scope factor. This approach helps guide the model to dynamically attend to information from objects at different scales, reduce information loss for difficult samples, and particularly improve the feature extraction capability for challenging categories.

The DySample upsampler module learns content-aware offsets in groups. These offsets are added to a bilinearly initialized grid to form a dynamic sampling grid, thereby adjusting the sampling location in the continuous space of the input feature for each output point. Figure 4 illustrates the workflow of the DySample upsampler. Given an input feature map

X \in R^{C \times H \times W}

and an upsampling scale factor s, a linear transformation with C input and

2 g s^{2}

output channels (with g denoting the total number of groups) is employed to produce offsets. These offsets are subsequently reorganized via pixel shuffle into

O \in R^{2 g \times s H \times s W}

. The sampling set S is defined as:

O = λ \cdot linear (X)

(10)

S = G + O

(11)

where

λ = 0.25

is a static range factor that constrains the offset range to prevent sampling point overlap.

linear (\cdot)

is a linear projection operation implemented via a

1 \times 1

convolution. G is the bilinear initialization grid. Finally, using the bilinear sampling function

grid_sample

in PyTorch 2.9.0, values are interpolated from X according to the coordinates in the grid S, yielding the output feature map

X^{'} \in R^{C \times s H \times s W}

:

X^{'} = grid_sample (X, S)

(12)

To enhance the module’s adaptability to different semantic features, the feature map is segmented into four channel-wise groups (

g = 4

) as part of the sampling process during the sampling phase. Each group is sampled independently, and the results are subsequently merged. This mechanism enables parallel processing of multi-scale information, extracts more edge detail features, and improves detection accuracy for difficult samples. The choice of these hyperparameters is validated by the sensitivity analysis presented in the subsequent ablation experiments.

3.2. Dataset

For performance assessment of the enhanced YOLOv8n model in nighttime traffic scenarios, a nighttime subset is constructed based on the BDD100K dataset [29]. The BDD100K dataset serves as a large-scale public resource focused on traffic object recognition in autonomous driving. All its data are collected by vision sensors mounted on real vehicles, comprising 100,000 high-resolution images that comprehensively cover real driving scenarios across diverse regions, time periods, and weather conditions.

Using LabelImg 1.8.6 and Python 3.11.13 scripts, we filter 3500 high-quality images depicting nighttime traffic scenes from BDD100K. The number 3500 is chosen to balance the representativeness of rare categories with computational feasibility. Specifically, in the original BDD100K nighttime partition, the numbers of bike and motorbike instances are far fewer than those of categories such as car. A larger subset would further aggravate the class imbalance, leading to unstable training while also increasing computational cost. Therefore, our curated subset retains five common traffic object categories: person, car, bike, motorbike, and traffic light. It intentionally includes a high proportion of images containing bikes and motorbikes to enable a fairer evaluation of these vulnerable road users. Nevertheless, due to the original data distribution, the overall proportions of bikes and motorbikes remain relatively low. The selected images feature high instance density, with an average of approximately 19 annotated instances per image. They encompass various low-light conditions and objects at different scales, providing a representative picture of the complex nighttime traffic environment. Most images have a resolution of

1280 \times 720

, offering sufficient spatial detail for model learning. As shown in Figure 5, the established dataset includes various challenging real-world nighttime scenarios, ensuring targeted evaluation.

The dataset is partitioned into training, validation, and test subsets at a ratio of 7:1.5:1.5 randomly, containing 2450, 525, and 525 images, respectively. The training set facilitates model parameter optimization, the validation set tracks training progress, and the test set assesses the model’s ability to generalize to novel samples. Stratified sampling based on instance counts is utilized during dataset partitioning to maintain balanced category representation. The specific annotation counts for each category are shown in Table 1. To further alleviate the class imbalance issue, we later apply

3 \times

oversampling to the rare bike and motorbike categories in the training set of our constructed dataset, thereby increasing their proportion in the training set. Additionally, to further evaluate the generalization performance of the model, we remove annotation-incompatible images and the previously selected 3500 images from the full BDD100K nighttime partition, and construct an additional test set containing 21,992 nighttime traffic scene images. This extra test set is used to assess the model’s performance on a larger scale of data. Furthermore, to enable cross-lighting evaluation, we construct a daytime test set of 22,380 images from the daytime portion of the BDD100K dataset to assess the model’s generalization performance under normal lighting conditions.

3.3. Experimental Setup

All experimental procedures in this study were performed using a personal laptop equipped with an Intel(R) Core(TM) Ultra9 275HX CPU (base frequency 2.70 GHz), 32 GB of RAM, and an NVIDIA GeForce RTX 5070 Laptop GPU (8 GB VRAM). The software environment is based on the Windows 11 operating system, utilizing an isolated Python environment managed by Anaconda. The deep learning framework used is PyTorch 2.9.0 (CUDA 12.8) with Ultralytics 8.3.163 for YOLOv8, and the programming language is Python 3.11.13.

The training hyperparameters are listed in Table 2. Other parameters remain consistent with the default configuration of YOLOv8n. Mosaic data augmentation is deactivated during the last 10 epochs of training. The loss function employed in all experiments consists of bounding box regression loss, classification loss, and distribution focal loss, expressed as:

Loss = λ_{box} \cdot L_{CIoU} + λ_{cls} \cdot L_{BCE} + λ_{dfl} \cdot L_{DFL}

(13)

where

λ_{box} = 7.5

is the weight coefficient for bounding box loss,

λ_{cls} = 0.5

is the weight coefficient for classification loss,

λ_{dfl} = 1.5

is the weight coefficient for distribution focal loss, and the default CIoU [30] loss of YOLOv8n is employed for bounding box regression. This loss formulation follows the standard YOLOv8n configuration as implemented in the Ultralytics YOLOv8 framework [31].

Given the highly challenging nature of the dataset, shorter training cycles may lead to insufficient model learning. For some models, the final training weights (last.pt) from the validation set demonstrate significantly better performance on the test subset than the best validation weights (best.pt). To ensure thorough learning and convergence, we extend the training epochs to 600 and apply an early stopping mechanism that halts training if no progress is observed for 200 consecutive epochs. The training loss curve for the YOLOv8n-BCD model is shown in Figure 6. Over the entire 600-epoch training process, the loss value decreases rapidly in the first 200 epochs, then slows down, and finally stabilizes at a low level, indicating that the model training has converged. During training with autoadjusted batch size, the proposed YOLOv8n-BCD model achieves a peak GPU memory usage of approximately 7.04 GB and an average memory usage of 5.43 GB. For inference on a single 640 × 640 image with a batch size of 1, the peak memory usage is about 0.4 GB. The baseline YOLOv8n exhibits similar memory consumption.

3.4. Evaluation Metrics

For a thorough appraisal of our model and an analysis of the role played by individual components, this paper employs Precision (P), Recall (R), mean Average Precision (

mAP

), Parameter count (

P a r a m s

), Giga Floating-point Operations Per Second (

G F L O P s

), Inference Time, Frames Per Second (

F P S

), and Model Size as evaluation metrics.

Precision reflects the correctness of the model’s predictions, quantified as the ratio of true positives to all instances classified as positive. Recall measures the model’s capacity to identify actual positives, defined as the fraction of true positives retrieved out of all real positive cases. They are expressed as:

P = \frac{T P}{T P + F P}

(14)

R = \frac{T P}{T P + F N}

(15)

where

T P

,

F P

, and

F N

denote the counts of true positives, false positives, and false negatives (missed detections), respectively. High precision indicates high dependability of the predicted outcomes, while high recall indicates fewer missed targets. The Average Precision (

A P

) for a given category is determined by computing the area beneath the corresponding Precision–Recall curve. The mean Average Precision (

mAP

) is subsequently derived by calculating the mean

A P

across all categories, providing a comprehensive criterion for assessing overall model effectiveness. It can be expressed as:

A P = \int_{0}^{1} P (R) d R

(16)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(17)

where N denotes the total count of categories. For evaluation, we employ

{mAP}_{50}

and

{mAP}_{50 – 95}

as performance indicators. Specifically,

{mAP}_{50}

refers to the mean Average Precision calculated at an IoU threshold of

0.5

, while

{mAP}_{50 – 95}

denotes the mean value of

A P

measured across ten IoU thresholds ranging from

0.5

to

0.95

in increments of

0.05

.

Parameter count (

P a r a m s

) quantifies the total trainable variables within the model architecture.

G F L O P s

, measured in billions of operations per second, indicate the total floating-point computations involved in executing a single forward inference through the model. Both metrics can assist in measuring the model’s structural complexity and processing efficiency. Inference Time is the average computation time required for the model to process a single image. The frame rate (

F P S

), reflecting the per-second image processing speed, directly indicates the model’s real-time processing capability and lightweight nature, expressed as:

F P S = \frac{1000}{Inference Time (ms)}

(18)

4. Results and Discussion

This section presents multiple experiments and analyses performed on the established dataset, including ablation experiments, comparative experiments, failure analysis, practical deployment challenges, and some exploratory experiments. All experimental results reported in the tables of this section are obtained by evaluating the best validation weights (best.pt) on the test set, ensuring a fair and consistent evaluation protocol across all experiments.

4.1. Ablation Experiments

To analyze how various path aggregation networks influence object detection outcomes in nighttime traffic scenarios, we conduct a detailed exploration of the neck structures in YOLOv8n. Various feature fusion methods are tried, including Slim-neck [32], CCFF [33], and BiFPN. The influence of introducing P2 low-level features into the BiFPN network is also explored. Evaluations are performed on the self-built BDD100K nighttime test subset. For a fair comparison, Table 3 reports the evaluation results obtained by evaluating the best validation weights (best.pt) of each method on the test set.

Experimental findings reveal that effective feature fusion methods can improve nighttime traffic object detection performance. Among them, introducing P2 shallow features into the BiFPN network achieves the best overall performance with only a minimal increase in computation, reaching a peak

{mAP}_{50}

of 53.8% and mAP_50–95 of 29.3%, representing improvements of 0.7% and 0.8% over the original BiFPN, respectively. Meanwhile, it maintains moderate Params (2.78 M), GFLOPs (8.1 G), and a fast detection speed (263 FPS). Nighttime images typically suffer from low contrast, high noise, and uneven illumination, leading to blurred object edges and loss of fine details. The weighted fusion mechanism in BiFPN assigns learnable weights to each input feature level, dynamically suppressing low-quality, noisy levels while enhancing high-quality, detail-rich levels. This adaptive weighting strategy enables the network to automatically reduce interference from low-SNR regions when fusing multi-scale features. The additionally introduced high-resolution P2 features preserve the fine edges and texture information of small objects, which are severely degraded in deeper features. After weighted fusion with deeper features, the shallow noise is suppressed, edge detail perception is enhanced, and detection performance for small and low-contrast targets is improved. This makes the BiFPN design particularly suitable for nighttime traffic object detection.

To promote the model’s capability to extract fused features, an attention module is integrated between the output end of the feature fusion network and detection head—a location whose effectiveness has been well-validated, as shown in Figure 1. The experiment systematically evaluates the effects of various advanced and traditional attention mechanisms, experimenting with different configurations to identify the optimal one under nighttime traffic conditions. A summary of the results can be found in Table 4.

The experimental results demonstrate that lightweight attention modules with relatively simple structures such as SE and CA further improved model performance without increasing computational cost. As shown in Table 4, adopting the SE channel attention mechanism achieves the highest

{mAP}_{50}

(54.7%). Integrating the CA mechanism significantly enhances the model’s overall performance, maintaining a high

{mAP}_{50}

(54.6%) while achieving the highest mAP_50–95 (30.1%) and a reasonably fast detection speed (233 FPS), with only a 0.01 M increase in Params. In contrast, introducing more complex attention mechanisms like MSDA [34], ACmix [35], and LSAK [36] do not demonstrate better synergistic effects. The performance decreases or even falls below the baseline model. This suggests that complex computations may conflict with BiFPN’s dynamic weighted fusion mechanism, amplifying background noise at night.

Additionally, we attempt to introduce attention modules at the end of multiple feature layers (P3, P4, P5). This approach does not lead to an improvement in accuracy. Instead, it increases network parameters and computational burden. Consequently, we introduce only a single lightweight CA module at the end of the top P5 layer of the feature pyramid, which guides the model to focus on the fused high-level semantic features and spatial information with negligible added parameters and computational costs. Under low-light conditions, the intensity difference between objects and the background is extremely small. CA performs global average pooling separately along the horizontal and vertical directions, generating a pair of direction-aware feature maps, thereby preserving precise spatial location information. This decomposition enables the network to capture the position of objects in the image, and even when visual cues are extremely weak, it guides the model to focus on regions where targets are likely to appear. In nighttime traffic scenes, vehicles and pedestrians often appear in specific spatial areas (e.g., the horizontal band of the road surface, the sides of the road). The row-wise and column-wise attention of CA can lock onto these areas, reducing noise interference from road textures and background regions, thus improving detection performance under low contrast. Heatmaps before and after introducing the CA module are shown in Figure 7 for three representative challenging scenes. In all cases, the CA mechanism consistently increases the heat intensity in target-relevant regions and covers a larger spatial area, strengthening the model’s capability to aggregate spatial context cues.

Table 5 summarizes the final ablation experiment, reporting the evaluation outcomes for each model’s best validation weights (best.pt) on the test set. The AP curves for each nighttime traffic category of the baseline and proposed models, along with their corresponding

{mAP}_{50}

values, are depicted in Figure 8. From the results, a clear synergistic effect between BiFPN feature fusion and the CA attention mechanism is observed. Introducing the CA module alone has a negative impact on the model, reducing

{mAP}_{50}

by 1.0% compared to the baseline. However, introducing the CA attention model after feature fusion leads to a considerable performance boost, increasing

{mAP}_{50}

from 51.5% to 54.6%. This indicates that BiFPN’s weighted bidirectional feature fusion provides richer semantic information and spatial features. The enriched feature representation allows the CA module to better distinguish key regions from background noise, promoting the model’s discriminative power and detection precision.

To investigate why the CA module alone degrades performance, we place it after different positions in the YOLOv8n neck (P3, P4, P5, and all three layers). As shown in Table 6, the negative effect is highly position-dependent: CA after P3 or P4 yields

{mAP}_{50}

close to the baseline (51.4% and 51.5%, respectively), while CA after P5 causes a noticeable drop to 50.5%. Adding CA after all layers also results in a slight drop (51.1%). This suggests that in the original YOLOv8n, the P5 layer already contains highly semantic but spatially coarse features; inserting CA after it introduces redundant attention that disrupts the original feature distribution. Shallower layers (P3, P4) retain richer spatial details, allowing CA to function without harming performance. In our final architecture, however, the P5 layer is fundamentally changed by BiFPN, which fuses multi-scale information from shallower levels (P2–P4). This enriched P5 layer now provides both semantic and spatial cues, making it a suitable location for CA to exploit context. Consequently, in the full model, CA placed after the BiFPN-enhanced P5 layer (BiFPN_P2+CA) achieves a clear performance gain. This analysis confirms that the degradation is not inherent to CA itself, but depends on the feature richness of the layer where it is inserted.

Building upon the BiFPN and CA modules, the addition of the DySample upsampler further increases the

{mAP}_{50}

by 2.0% while causing a slight decrease in mAP_50–95. Further analysis reveals that this upsampler provides pronounced improvement for particularly challenging categories like motorbike and bike. Taking the motorbike category as an example (performance detailed in Table 7), compared to YOLOv8n, its Precision (P) increases substantially by 35.1%, Recall (R) increases by 6.1%,

{mAP}_{50}

increases by 20.8%, and mAP_50–95 increases by 12.3%. Compared to the BiFPN_P2+CA model, introducing the DySample module causes a slight accuracy trade-off for some categories, but the overall performance remains superior to the original baseline, achieving the best comprehensive performance. This suggests that DySample’s dynamic upsampling mechanism particularly focuses on the details of small and blurry-edged targets at night, alleviating information loss in challenging samples. It should be noted that the slight drop in mAP_50–95 indicates a minor regression in localization precision at stricter IoU thresholds, which mainly originates from a small loss in precision for a few easy categories, and this minor loss is acceptable from a safety perspective. In contrast, the detection performance of motorbikes and bikes is greatly improved. From a safety-critical viewpoint, a missed detection of a vulnerable road user (e.g., a motorcyclist or cyclist) by an autonomous driving system could lead to a fatal accident, while a slight reduction in localization precision for common objects such as cars is unlikely to directly cause a collision. Therefore, trading a marginal loss in localization precision for a substantial improvement in detecting rare but high-risk categories is a reasonable trade-off for autonomous driving safety. This is especially significant for object detection in complex nighttime traffic scenes and directly contributes to the reliability of collision avoidance systems in real-world driving scenarios.

The point sampling nature of DySample is suitable for nighttime images, which often suffer from low signal-to-noise ratio (SNR) and abrupt illumination changes. DySample resamples a bilinearly interpolated continuous feature map via learned content-aware offsets, thereby avoiding noise amplification in low-SNR regions. We adopt the LP-style variant with a static scope factor because dynamic offsets may become unstable under extreme brightness contrast (e.g., headlights adjacent to shadows), whereas a static factor ensures stable sampling behavior. The offset range factor

λ = 0.25

is the theoretical marginal value that prevents sampling overlap [28], which avoids boundary artifacts that are particularly harmful to small, dimly lit objects such as distant pedestrians or occluded vehicles. To verify the effectiveness of the DySample parameters, we conduct a brief sensitivity analysis by varying

λ

and g. As shown in Table 8, the original configuration (

λ = 0.25

,

g = 4

) achieves the best overall performance, outperforming

λ = 0.5

,

g = 4

and

λ = 0.25

,

g = 8

. A larger

λ

introduces background noise, while increasing g to 8 also leads to performance degradation, and both result in a drop in FPS. Moreover, adjusting

λ

or g cannot recover the slight mAP_50–95 loss. These results confirm that

λ = 0.25

and

g = 4

strike the optimal trade-off between accuracy and efficiency for nighttime traffic detection.

Relative to the baseline, the YOLOv8n-BCD model achieves substantial overall gains. Precision (P) increases substantially from 56.8% to 66.1%, while Recall (R) improves by 1.8%. The

{mAP}_{50}

and mAP_50–95 see gains of 5.1% and 3.4%, respectively, indicating stronger generalization capability in nighttime traffic scenes. Meanwhile, the model reduces Params from 3.01 M to 2.79 M and achieves a high frame rate of 208 FPS, which preserves its lightweight and real-time characteristics. The parameter reduction mainly originates from replacing PANet with BiFPN in the neck network, decreasing the neck parameters by approximately 0.24 M due to the removal of single-input edge nodes. Additionally, incorporating the shallow P2 feature map into the fusion network introduces a minor increase of 0.01 M, while the addition of CA and DySample contributes only negligible overhead (0.01 M in total). The proposed model attains a better balance between model complexity and detection precision, which makes it more favorable for deployment on platforms like vehicle-mounted vision sensors.

4.2. Comparative Experiments

To verify the advantages of YOLOv8n-BCD, a comprehensive comparison is performed against prevailing mainstream object detection algorithms. Other lightweight detection architectures from the YOLO series [13] are systematically evaluated, including YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv8s, YOLOv9-tiny, YOLOv10n, YOLOv11n, and YOLOv12n. However, since the nighttime-specific methods discussed in the related work (YOLO-FA, YOLO-D, and FP-ZeroDCE+YOLOv7) have not made their official code publicly available, and the datasets and annotation categories they use differ from ours, a direct comparison would be unfair. Therefore, these methods are not included in this comparative study. All experiments are performed within a consistent environment using a unified training strategy. The comparison results can be found in Table 9, presenting the performance of each model’s best validation weights (best.pt) on the test subset.

Among all models tested, YOLOv8n-BCD secures the highest

{mAP}_{50}

at 56.6% and mAP_50–95 at 29.9%. In comparison with the more recent YOLOv11n and YOLOv12n frameworks, it improves

{mAP}_{50}

by 3.7% and 6.5%, respectively. Although GFLOPs and Params increase slightly, it attains a faster runtime speed of 208 FPS. Compared to YOLOv8s (a larger version in the YOLOv8 framework), our method improves Precision (P) and

{mAP}_{50}

by 9.2% and 3.6%, respectively. Meanwhile, it significantly reduces both GFLOPs and Params, resulting in a more lightweight and effective network structure. In addition, we observe that YOLOv6n achieves the highest Precision (P) and YOLOv3-tiny attains the fastest detection speed (FPS). But this performance advantage comes at the expense of lower overall detection accuracy, higher computational cost, and increased parameter count, rendering them less suitable for deployment on resource-constrained real-world vehicle sensor platforms.

Object detection in nighttime traffic scenes is extremely challenging due to low illumination, strong noise interference, and pronounced scale variations. Traditional lightweight detection frameworks exhibit low detection accuracy and high missed detection rates in such difficult scenarios. By employing efficient feature fusion, a lightweight attention module, and the dynamic upsampler, our method strikes a desirable equilibrium between network complexity and processing efficiency. This leads to higher recall, improved accuracy, and superior generalization performance. Figure 9 provides a qualitative comparison of several well-performing YOLO series models in a real nighttime traffic scene. It is evident that the YOLOv8n-BCD model demonstrates relatively higher average accuracy among the four models. Notably, for a challenging bike instance, YOLOv8n-BCD correctly detects it, whereas YOLOv8n and YOLOv11n completely miss it. Although YOLOv5n also outputs a bounding box labeled “bike”, this detection is a false positive—the target does not actually exist in the scene. Figure 10 presents the detection results of various models under a more challenging rainy night scene with severe road reflections and glare. The proposed model accurately detects two traffic lights despite complex light interference, whereas YOLOv5n and YOLOv8n produce multiple false positive detections of traffic lights. In the darker region on the left side of the image, YOLOv5n, YOLOv8n, and YOLOv11n all output false positive vehicle detections, while only YOLOv8n-BCD, unaffected by the dark area and reflections, yields no false alarms. This comparison validates the effectiveness of the proposed model in complex nighttime scenes: it can not only detect targets missed by other models but also avoid false positives, thereby improving detection performance for rare categories and vulnerable road users, demonstrating better generalization capability.

To further validate the generalization capability of the proposed model, we conduct an evaluation on an additional BDD100K nighttime test set consisting of 21,992 images, from which annotation-incompatible images and the previously selected 3500 images have been removed. We also apply oversampling to the training set of our self-built subset (3500 images) and retrain the model to alleviate class imbalance. Table 10 presents the performance of the baseline YOLOv8n, the original YOLOv8n-BCD (without oversampling), and the retrained YOLOv8n-BCD-OS (with oversampling) on this larger test set. It can be observed that on this broader data distribution, the original YOLOv8n-BCD already achieves a certain improvement over the baseline, increasing

{mAP}_{50}

from 38.6% to 40.1%. After oversampling, the YOLOv8n-BCD-OS model further improves its performance, reaching an

{mAP}_{50}

of 41.3%. This indicates that our model maintains stable generalization ability on large-scale real-world nighttime data. The detection performance for the two rare categories, bike and motorbike, on this test set is shown in Table 11. It can be seen that our method effectively improves detection accuracy for both categories, and the model with oversampling achieves further improvement, indicating that the class imbalance issue is effectively mitigated and the model’s detection robustness for vulnerable road users is enhanced.

Experimental results show that the proposed model achieves a significant performance improvement over the baseline. To confirm that this improvement is not due to random fluctuations during training, the baseline YOLOv8n and the proposed YOLOv8n-BCD model are each independently trained three times using three different random seeds (0, 42, and 123) on the self-built nighttime training set. All trained models are evaluated on the larger BDD100K nighttime test set comprising 21,992 images, which has a broader data distribution and a larger sample size, thereby facilitating statistical inference. Table 12 reports the

{mAP}_{50}

values obtained from the three independent training runs. A paired t-test is conducted on these three pairs of

{mAP}_{50}

values. The results show that YOLOv8n-BCD achieves a statistically significant improvement over the baseline (mean

Δ

{mAP}_{50}

= 1.97%, t(2) = 8.43, p = 0.014). This confirms that the observed performance gain can be attributed to the effectiveness of the model architecture rather than random chance.

To evaluate the proposed model’s perception capability across varying lighting conditions, we extract and filter the daytime portion of the BDD100K dataset and construct a daytime test set of 22,380 images for assessing detection performance under normal lighting. Table 13 presents the results of the baseline and the proposed model on this daytime test set. It can be observed that YOLOv8n achieves a daytime

{mAP}_{50}

of 43.0%, while our proposed YOLOv8n-BCD improves

{mAP}_{50}

by 2.5% over the baseline, reaching 45.5%. Although our model is specifically designed for nighttime environments, the experiments show that it also delivers a notable performance gain in the daytime, indicating good generalization and robustness across different lighting conditions. Notably, all models achieve substantially higher absolute

{mAP}_{50}

in daytime scenes than at night on comparably sized datasets (e.g., 43.0% on the 22,380-image daytime set vs. 38.6% on the 21,992-image nighttime set for the baseline). This confirms that nighttime perception indeed degrades due to complex lighting, low signal-to-noise ratio, and object blur, making nighttime object detection more challenging. Previous paired t-test results indicate that our model still attains a statistically significant average

{mAP}_{50}

improvement of 1.97% at night. This demonstrates that even in complex nighttime traffic scenes, where the baseline performance is already low and achieving further improvement is highly difficult, YOLOv8n-BCD can still provide stable and statistically significant performance gains. Such stable gains are practically significant for ensuring the safety of nighttime autonomous driving. Moreover, YOLOv8n-BCD reduces the parameter count by 7.3% compared to the baseline, and it outperforms the parameter-heavier baseline across multiple lighting conditions. This improvement stems from structural modifications of the proposed model rather than from an increase in model capacity, achieving better generalization while retaining a lightweight architecture. These architectural modifications are motivated by nighttime-specific degradation phenomena, yet also yield performance gains under normal lighting, suggesting that solving the more challenging nighttime problem produces more robust feature representations. Integrating both daytime and nighttime evaluation results, YOLOv8n-BCD demonstrates applicability across varying lighting environments, offering a reliable and efficient lightweight foundational module for real-time visual perception in autonomous driving under diverse lighting conditions.

4.3. Failure Analysis

Figure 11 shows the column-normalized confusion matrix of the proposed YOLOv8n-BCD model on the self-built nighttime test set. It can be observed that for the three categories of person, car, and traffic light, the proportions of correctly predicted samples are 50%, 69%, and 51%, respectively, all exceeding or close to half, indicating that the model has a certain ability to recognize common nighttime traffic objects. For the two difficult categories, bike and motorbike, only 36% of bikes and 17% of motorbikes are correctly predicted. Missed detections exist in all categories, with bikes and motorbikes being particularly severe. Analysis shows that nighttime images generally suffer from low contrast, uneven illumination, and blurred object edges, causing targets to blend into the background under low-light conditions, especially when objects are distant or partially occluded, making it difficult for the model to extract sufficient discriminative features, thus leading to missed detections. Bikes and motorbikes are small and easily occluded; their edge information is highly susceptible to loss under insufficient nighttime lighting, and they are often disturbed by headlight halos, making it hard for the model to distinguish them from background noise.

Some motorbikes are misclassified as cars, which may be because the light spots produced by motorbike tail lamps or headlights at night resemble those of cars, causing the model to incorrectly classify them as cars in the absence of detailed shape information. Moreover, the confusion matrix reveals that a considerable number of persons, cars, and traffic lights are predicted by the model even though these objects do not actually exist (false positives). This is due to the widespread presence of distant small objects, partially occluded targets, and complex lighting variations in nighttime traffic scenes, such as water stains, reflections from streetlights and headlights, and shadows of roadside buildings. These textures and patterns share some similarity with real objects at low resolution, leading to numerous false alarms; adverse conditions such as rain, fog, and glare further exacerbate these misjudgments.

Figure 12 presents three typical failure cases. In the first case, all persons in an extremely dark shadowed area are missed, one bike is also missed, and another bike is mistakenly detected as a car. The illumination in this area is extremely low, and the intensity difference between the targets and the background almost disappears, making it impossible for the model to extract effective edge and texture features. The second case involves a complex rainy night traffic scene. Extensive water accumulation on the road creates mirror-like reflections, raindrops and fog on vehicle windows further interfere with visibility, and strong glare from oncoming headlights and streetlights causes two cars on the left side of the image to be completely missed. In addition, the model falsely detects multiple non-existent cars and persons in the reflective areas and halos, while distant, partially occluded small objects are also not recognized. The third case occurs in snowy weather with accumulated snow on both sides of the road. A distant motorbike is partially occluded by snow and appears with low resolution and blurriness; the model fails to capture its features and thus misses it. Analysis of these failure cases reveals that the model suffers from insufficient feature representation capability under extremely low illumination, particularly poor sensitivity to blurred, low-contrast targets. Moreover, under adverse snowy/rainy weather and strong lighting variations, the model is easily disturbed by reflections, glare, and occlusion, leading to both numerous false positives and missed detections of genuine targets.

In summary, although the proposed model achieves significant improvements over the baseline, further enhancement is still needed. The main bottlenecks in nighttime traffic scenes are summarized as follows. First, feature extraction for targets under extremely low illumination remains insufficient, leading to missed detections and false positives. Second, the model exhibits poor robustness to glare, reflections, and adverse weather conditions such as rain, snow, and fog, which also contribute to false positives. Third, the discriminative ability for partially occluded and distant small objects is still limited. These limitations point toward key directions for future model optimization.

4.4. Practical Deployment Challenges

Despite the good balance achieved by the proposed YOLOv8n-BCD between detection accuracy and model size (2.79 M parameters, 208 FPS inference speed), deploying it on embedded autonomous vehicle platforms still faces trade-offs among memory footprint, real-time latency, and energy consumption. Model quantization is a straightforward and effective technique. Saranya et al. [38] demonstrated that applying INT8 post-training quantization to YOLOv8n on the Jetson Orin Nano reduced inference latency from 164.9 ms to 94.7 ms, with only about 1% loss in mAP, confirming that INT8 quantization is a viable path for edge deployment. Furthermore, structured pruning can directly reduce the number of model parameters and computational cost. In YOLOv8n-BCD, the parameter reduction mainly originates from replacing PANet with BiFPN in the neck network, where the removal of single-input edge nodes decreases the neck parameters by approximately 0.24 M. However, this level of reduction is still insufficient for real-world onboard deployment. Zhou et al. [39] employed the LAMP pruning method, which globally prunes unimportant channels while keeping the detection heads intact. Applying LAMP alone to YOLOv8n compressed the model size from 6.0 MB to 2.4 MB, substantially reducing parameters but causing a slight decrease in mAP₅₀. This shows that LAMP effectively compresses the model at the expense of a minor accuracy loss, a deficiency that can be remedied when combined with other modules. For our nighttime traffic object detector, the combination of INT8 quantization and LAMP-style pruning represents an effective approach to meeting the strict storage and power constraints of onboard vision sensors.

Beyond model compression, practical deployment of the perception module also requires seamless integration with downstream vehicle control and planning systems. This module must not only deliver accurate detection results but also provide timely outputs to downstream vehicle control systems within bounded time to enable reliable decision-making. End-to-end latency must remain within an acceptable range. Saranya et al. [38] employed fixed-priority scheduling, CPU core affinity, and WCET analysis to ensure that 98.11% of inferences meet the soft deadline of 150 ms. They also introduced a deadline-miss penalty model, showing that bounded overruns (<30 ms) remain within the tolerance of autonomous vehicle control loops. For nighttime autonomous driving, real-time requirements are particularly critical. Due to the inherent difficulties of night scenes—low contrast, blurred object boundaries, and severe occlusion—the perception module is already prone to missed detections and false positives. If uncontrollable delays are added, the downstream control system will not receive timely and reliable inputs, thereby posing a serious threat to driving safety.

Furthermore, a complete autonomous driving system relies not only on high-precision object detection but also on a clear pathway from detection outputs to executable driving decisions. Object detection is merely the first step in the autonomous driving pipeline; the key to safe autonomous navigation lies in converting 2D bounding boxes into spatial information that can be used for driving logic reasoning. In recent years, several studies have explored different technical routes for integrating detection results into the autonomous driving pipeline. Yu et al. [40] proposed YOLO MDE, which adds an extra depth prediction channel to the output layer of YOLOv4, unifying 2D object detection and monocular depth estimation within a single network architecture and enabling the system to directly output distance information while recognizing objects. This approach equips the autonomous driving system with preliminary spatial perception, allowing it to distinguish nearby obstacles from distant background and thereby providing a basis for braking or obstacle-avoidance decisions. However, distance information alone is insufficient to support complete driving decisions. The perception system must also understand the spatial context surrounding objects, especially the ego vehicle’s drivable area, to determine whether detected objects genuinely lie on the driving path and pose a real threat. The YOLOP model proposed by Wu et al. [41] simultaneously performs traffic object detection, drivable area segmentation, and lane detection in a single unified network, delivering 2D perceptual information that encompasses obstacle positions, safe traversable space, and road structure, thus laying a richer semantic foundation for subsequent path planning and risk assessment.

Although models like YOLOP have greatly enhanced environmental understanding on the 2D plane, they still essentially reason by projecting the world onto the image plane and cannot directly acquire the precise 3D coordinates, dimensions, and orientation of objects. Such 3D information is crucial for accurate obstacle avoidance and path planning in complex traffic scenarios (e.g., at night). To overcome this limitation, the system must map detection results into an ego-centric 3D coordinate system to obtain the complete spatial position, size, and orientation of objects, which typically relies on support from multi-sensor fusion techniques. In this direction, the C2L3-Fusion framework proposed by Ngo et al. [42] employs the CLOCs mechanism to perform decision-level fusion of 2D detections from YOLOv8 and 3D LiDAR point cloud detections from PointPillars, directly outputting refined 3D bounding boxes. By contrast, the work by Murendeni et al. [43] adopts a route of model-level modification and feature-level fusion. Taking YOLOv4 as the base framework, they extend the network output layers to simultaneously predict object depth, 3D dimensions, and orientation angle, and introduce a multi-task loss function for joint optimization, reconstructing the original 2D detector into a unified network capable of directly reasoning 3D spatial information while leveraging feature-level fusion of LiDAR point clouds and RGB images to enhance depth estimation. These two approaches construct a complete transformation pathway from 2D detection to 3D spatial perception for autonomous driving systems, meeting the core demand of downstream control modules for precise spatial information.

Overall, 2D object detection results can be transformed into spatial logic that supports autonomous driving decisions through various technical pathways such as depth estimation, environmental understanding, and 3D spatial mapping. The YOLOv8n-BCD model proposed in this paper, as a lightweight nighttime traffic object detection vision module, can provide high-quality detection outputs for the nighttime autonomous driving system pipeline. Through the future integration of methods such as depth estimation and multi-sensor-based 3D spatial coordinate perception, it could offer a more reliable perception foundation for subsequent spatial reasoning and risk assessment. This establishes a cost-effective and efficient practical pathway for building robust nighttime autonomous driving perception systems under resource-constrained conditions.

5. Conclusions

This paper introduces an improved object detection framework built upon YOLOv8n, tailored for challenging nighttime traffic environments to provide reliable perception for vehicle control and safety systems in automated driving. By integrating three key innovations—a bidirectional weighted feature fusion network (BiFPN), a lightweight CA mechanism, and a dynamic upsampling strategy (DySample)—the model achieves efficient multi-scale feature aggregation, precise spatial-contextual modeling, and adaptive detail preservation for challenging samples. Evaluations on the nighttime subset of self-constructed BDD100K demonstrate significant performance improvements: the BiFPN structure elevates mAP₅₀ by 2.3% and achieves a 7.6% reduction in parameters; the CA mechanism further enhances mAP₅₀ by 0.8% through spatial-channel interdependency learning, and DySample contributes an additional 2.0% accuracy gain by mitigating upsampling artifacts. Notably, the framework exhibits remarkable robustness for challenging categories (vulnerable road users), with mAP₅₀ improvements of 4.2% for bikes and 20.8% for motorbikes. In comparison with state-of-the-art YOLO variants, our model has a superior balance between accuracy (56.6% mAP₅₀) and efficiency (208 FPS), alongside a compact architecture of merely 2.79 M parameters, thereby facilitating deployment on resource-limited autonomous vehicle platforms.

Despite these advancements, this study has several limitations that should be acknowledged: (1) The dataset suffers from class imbalance; the rare categories (motorbike and bike) contain relatively few instances. Although we applied oversampling to mitigate this imbalance and validated the model on a larger BDD100K nighttime test set, the detection performance on the self-built nighttime test set may still be subject to statistical variability. (2) In extreme low-light environments with severe glare or near-zero illumination, sensor noise and insufficient discriminative features still lead to elevated missed detection rates. (3) The detection capability for distant, extremely small, or heavily occluded targets still needs to be improved. (4) The model has not yet been tested in real-world nighttime driving scenarios on actual autonomous vehicles. Previous research [44] has shown that adverse environmental conditions, such as water splashes on wet roads and hailstones, can introduce spurious textures or high-frequency noise, leading to erroneous detections. This underscores the necessity of extending the BDD100K nighttime subset with similar real-world degradation scenarios and of exploring multi-modal fusion strategies to achieve robust perception under adverse conditions. Therefore, future work will prioritize three directions: (1) Dataset diversification to address long-tailed class distributions and incorporate diverse illumination conditions (e.g., fog, rain, tunnel transitions); (2) Edge deployment optimization through model quantization and hardware-aware pruning for real-time inference on automotive embedded systems and evaluation on actual edge hardware platforms (e.g., NVIDIA Jetson Orin or automotive-grade SoCs) to measure inference speed, power consumption, and latency, thereby assessing the model’s practical deployment potential. (3) Multi-modal fusion with LiDAR or thermal imaging to compensate for vision sensor limitations in extreme scenarios, along with integrating the proposed detector with depth estimation or 3D coordinate perception methods, can further support autonomous driving decision-making. Practical validation via in-vehicle deployment and iterative refinement based on real-world feedback will be critical to meeting the stringent reliability requirements of automated driving systems (ADSs).

Author Contributions

Conceptualization, R.O., L.D. and H.L.; methodology, R.O.; software, R.O.; validation, R.O.; formal analysis, R.O., L.D. and W.C.; investigation, R.O.; resources, L.D., W.C. and H.L.; data curation, R.O.; writing—original draft, R.O.; writing—review and editing, R.O., L.D., W.C. and H.L.; visualization, R.O.; supervision, H.L.; project administration, W.C.; funding acquisition, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by Key Research and Development Project of Jiangxi Province (No. 20244BBG73004), and the Fundamental Research Funds for the Central Universities (WUT: 2024IVA043 and 104972024KFYd0012) of China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The BDD100K dataset is publicly available. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Y.; Moreau, J.; Ibanez-Guzman, J. Emergent Visual Sensors for Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2023, 24, 4716–4737. [Google Scholar] [CrossRef]
Kumar, M.; Rattan, N.; Mondal, S. Sensor systems for autonomous vehicles: Functionality and reliability challenges in adverse environmental conditions. Measurement 2026, 258, 119215. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 2 December 2025).
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Thottempudi, P.; Jambek, A.B.B.; Kumar, V.; Acharya, B.; Moreira, F. Resilient object detection for autonomous vehicles: Integrating deep learning and sensor fusion in adverse conditions. Eng. Appl. Artif. Intell. 2025, 151, 110563. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
Shehzadi, T.; Hashmi, K.A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Object Detection with Transformers: A Review. Sensors 2025, 25, 6025. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Tong, G.; Cai, J.; Huang, L.; Shang, B.; Liu, J.; Chen, M. SSD Based Target Detection Algorithm for the Autonomous Driving Vision System. In Proceedings of the IEEE 2nd International Conference on Information Technology, Big Data, Artificial Intelligence (ICIBA), Chongqing, China, 17–19 December 2021; pp. 213–217. [Google Scholar] [CrossRef]
Lamichhane, B.R.; Paudel, B.; Paudel, S.; Srijuntongsiri, G.; Horanont, T. MST: A Modified Sparse Transformer with depth-aware attention for multi-modal camera–LiDAR fusion in autonomous vehicles. Transp. Res. Interdiscip. Perspect. 2025, 34, 101571. [Google Scholar] [CrossRef]
Xue, T.; Liu, Z.; Lan, S.; Zhang, Q.; Yang, A.; Li, J. YOLO-FSE: An Improved Target Detection Algorithm for Vehicles in Autonomous Driving. IEEE Internet Things J. 2025, 12, 13922–13933. [Google Scholar] [CrossRef]
Wang, L.; Hua, S.; Zhang, C.; Yang, G.; Ren, J.; Li, J. YOLOdrive: A Lightweight Autonomous Driving Single-Stage Target Detection Approach. IEEE Internet Things J. 2024, 11, 36099–36113. [Google Scholar] [CrossRef]
Jain, D.; Sathvika, C.; Meti, C.; Mallibhat, K. Low Light Image Enhancement for Autonomous Vehicle Applications. In Proceedings of the 6th International Conference on Emerging Technologies (INCET), Belgaum, India, 23–25 May 2025; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, J.; Peng, J.; Kong, X.; Wang, S.; Hu, J. Vehicle spatiotemporal distribution identification in low-light environment based on image enhancement and object detection. Adv. Eng. Inform. 2025, 65, 103165. [Google Scholar] [CrossRef]
Das, P.P.; Ganguly, T.; Chaudhuri, R.; Deb, S. YOLO-D: A Domain Adaptive approach towards low light object detection. Procedia Comput. Sci. 2025, 258, 3042–3051. [Google Scholar] [CrossRef]
Kang, L.; Lu, Z.; Meng, L.; Gao, Z. YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst. Appl. 2024, 237, 121209. [Google Scholar] [CrossRef]
Peng, L.; Jiang, L. Nighttime vehicle target detection based on visual features. Appl. Comput. Intell. 2026, 6, 23–37. [Google Scholar] [CrossRef]
Rashed, H.; Ramzy, M.; Vaquero, V.; El Sallab, A.; Sistu, G.; Yogamani, S. FuseMODNet: Real-Time Camera and LiDAR Based Moving Object Detection for Robust Low-Light Autonomous Driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 2393–2402. [Google Scholar] [CrossRef]
Choi, J.D.; Kim, M.Y. A Sensor Fusion System with Thermal Infrared Camera and LiDAR for Autonomous Vehicles: Its Calibration and Application. In Proceedings of the 12th International Conference on Ubiquitous Future Networks (ICUFN), Jeju Island, Republic of Korea, 17–20 August 2021; pp. 361–365. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6004–6014. [Google Scholar] [CrossRef]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2633–2642. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 October 2025).
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Jiao, J.; Tang, Y.M.; Lin, K.Y.; Gao, Y.; Ma, A.J.; Wang, Y.; Zheng, W.S. DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition. IEEE Trans. Multimed. 2023, 25, 8906–8919. [Google Scholar] [CrossRef]
Pan, X.; Ge, C.; Lu, R.; Song, S.; Chen, G.; Huang, Z.; Huang, G. On the Integration of Self-Attention and Convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 805–815. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Saranya, M.; Archana, N.; Rishi Koushik, G. Deadline-Adherent Edge AI for Intelligent Vehicles: Real-Time Obstacle and Traffic Light Detection Using Quantized YOLOv8n on Jetson Orin Nano. IET Intell. Transp. Syst. 2026, 20, e70135. [Google Scholar] [CrossRef]
Zhou, W.; Wang, J.; Meng, X.; Wang, J.; Song, Y.; Liu, Z. MP-YOLO: Multidimensional feature fusion based layer adaptive pruning YOLO for dense vehicle object detection algorithm. J. Vis. Commun. Image Represent. 2025, 112, 104560. [Google Scholar] [CrossRef]
Yu, J.; Choi, H. YOLO MDE: Object Detection with Monocular Depth Estimation. Electronics 2022, 11, 76. [Google Scholar] [CrossRef]
Wu, D.; Liao, M.-W.; Zhang, W.-T.; Wang, X.-G.; Bai, X.; Cheng, W.-Q.; Liu, W.-Y. YOLOP: You Only Look Once for Panoptic Driving Perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Ngo, T.B.; Ngo, L.; Phi, A.V.; Nguyen, T.T.H.T.; Nguyen, A.; Brown, J.; Perera, A. C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles. Sensors 2025, 25, 2688. [Google Scholar] [CrossRef] [PubMed]
Murendeni, R.; Mwanza, A.; Obagbuwa, I.C. Using a YOLO Deep Learning Algorithm to Improve the Accuracy of 3D Object Detection by Autonomous Vehicles. World Electr. Veh. J. 2025, 16, 9. [Google Scholar] [CrossRef]
Wiseman, Y. Real-time monitoring of traffic congestions. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; pp. 501–505. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed nighttime traffic object detection model. The improvements (highlighted in red) include the BiFPN weighted fusion module, CA mechanism, and DySample upsampler. The bidirectional fusion structure of BiFPN is also illustrated. Here, Conv denotes standard convolution, and C2f is a lightweight CSP bottleneck featuring split-concat fusion in YOLOv8.

Figure 2. Architecture diagrams (adapted from [26]): (a) Original PANet; (b) Improved BiFPN with weighted fusion and shallow feature integration.

Figure 3. Flowchart of the CA mechanism processing (adapted from [27]).

Figure 4. Flowchart of the DySample dynamic upsampler (LP-style variant with static scope factor).

Figure 5. Six example images from the constructed nighttime traffic dataset: (a–f) show various challenging real-world nighttime scenarios.

Figure 6. Training loss curve of the YOLOv8n-BCD model. The horizontal axis denotes the number of training epochs, and the vertical axis denotes the loss value.

Figure 7. Comparison of feature heatmaps before and after introducing the CA mechanism. (a,b): multi-scale object scene; (c,d): dense small-object scene; (e,f): glare scene. In each pair, the left images (a,c,e) are the heatmaps before CA, and the right images (b,d,f) are the heatmaps after CA.

Figure 8.

A P

curves and mAP₅₀ for each nighttime traffic category, where the light-colored curve in (b) represents the baseline YOLOv8n.

Figure 8.

A P

curves and mAP₅₀ for each nighttime traffic category, where the light-colored curve in (b) represents the baseline YOLOv8n.

Figure 9. Side-by-side detection visualization under a typical nighttime scene.

Figure 10. Side-by-side detection visualization under a rainy night with severe glare and road reflections.

Figure 11. Column-normalized confusion matrix of YOLOv8n-BCD.

Figure 12. Visual analysis of three typical failure cases (left column: ground truth; right column: model predictions): (a,b) extreme low-light conditions; (c,d) complex rainy night with glare and reflections; and (e,f) snowy weather with target occlusion.

Table 1. Annotation counts for each category in the dataset.

Category	Training Set	Validation Set	Test Set	Total
person	8967	1782	2083	12,832
car	24,597	5245	4935	34,777
bike	650	141	156	947
motorbike	277	66	66	409
traffic light	12,564	2702	2658	17,924
Total	47,055	9936	9898	66,889

Table 2. Hyperparameter settings.

Training Parameter	Value
Learning Rate Scheduler	Cosine Annealing
Optimizer	AdamW
Initial Learning Rate	0.01
Learning Rate Decay Factor	0.001
Weight Decay	0.05
Momentum	0.937
Epochs	600
Early Stopping Patience	200
Number of Workers	2
Image Size	$640 \times 640$
Batch Size	Autobatch (−1)

Table 3. Experimental results for different feature fusion methods.

Model	mAP₅₀	mAP_50–95	GFLOPs	Params	FPS
YOLOv8n	51.5%	26.5%	8.1	3.01 M	303
Slim-neck [32]	50.3%	26.4%	7.3	2.80 M	154
CCFF [33]	53.4%	28.0%	6.6	1.97 M	263
BiFPN [26]	53.1%	28.5%	7.9	2.77 M	263
BiFPN_P2	53.8%	29.3%	8.1	2.78 M	263

Table 4. Experimental results for introducing different attention mechanisms.

Model	mAP₅₀	mAP_50–95	GFLOPs	Params	FPS
YOLOv8n	51.5%	26.5%	8.1	3.01 M	303
+BiFPN_P2	53.8%	29.3%	8.1	2.78 M	263
+MSDA [34]	49.6%	26.1%	8.3	3.05 M	208
+ACmix [35]	52.0%	27.8%	8.3	3.00 M	189
+LSAK (11) [36]	52.7%	27.6%	8.2	2.85 M	238
+EMA [37]	53.8%	29.4%	8.1	2.78 M	238
+CBAM	52.4%	28.2%	8.2	2.85 M	256
+CAA	54.5%	29.3%	8.2	2.92 M	244
+SE	54.7%	29.4%	8.1	2.79 M	244
+CA [27]	54.6%	30.1%	8.1	2.79 M	233
+EMA_3	51.4%	27.5%	8.2	2.78 M	200
+CA_3	53.4%	28.5%	8.1	2.80 M	204

Table 5. Results of ablation experiments.

Model	P	R	mAP₅₀	mAP_50–95	Params	GFLOPs	FPS
YOLOv8n	56.8%	41.7%	51.5%	26.5%	3.01 M	8.1	303
BiFPN_P2	62.0%	41.7%	53.8%	29.3%	2.78 M	8.1	263
CA	55.0%	40.9%	50.5%	26.4%	3.01 M	8.1	286
DySample	60.9%	41.5%	52.8%	28.0%	3.02 M	8.1	270
BiFPN_P2+CA	63.2%	41.6%	54.6%	30.1%	2.79 M	8.1	233
BiFPN_P2+CA+DySample	66.1%	43.5%	56.6%	29.9%	2.79 M	8.1	208

Table 6. Effect of CA placement on detection performance (CA module inserted alone).

Model	P	R	mAP₅₀	mAP_50–95
YOLOv8n	56.8%	41.7%	51.5%	26.5%
CA_P3	56.7%	42.2%	51.4%	27.1%
CA_P4	58.5%	40.9%	51.5%	27.2%
CA_P5	55.0%	40.9%	50.5%	26.4%
CA_all (P3, P4, P5)	59.2%	40.2%	51.1%	26.9%

Table 7. Module contribution for motorbike detection.

Model	P	R	mAP₅₀	mAP_50–95
YOLOv8n	33.3%	13.6%	24.4%	10.7%
BiFPN	40.0%	15.2%	26.8%	17.1%
CA	30.8%	12.1%	21.9%	11.3%
DySample	41.7%	15.2%	29.0%	16.0%
BiFPN+CA	47.4%	13.6%	31.8%	20.9%
BiFPN+CA+DySample	68.4%	19.7%	45.2%	23.0%

Table 8. Sensitivity analysis of DySample hyperparameters (

λ

and g).

Table 8. Sensitivity analysis of DySample hyperparameters (

λ

and g).

Configuration	P	R	mAP₅₀	mAP_50–95	FPS
$λ = 0.5$ , $g = 4$	59.5%	41.5%	52.6%	27.8%	118
$λ = 0.25$ , $g = 8$	63.7%	40.4%	53.6%	29.4%	167
$λ = 0.25$ , $g = 4$	66.1%	43.5%	56.6%	29.9%	208

Table 9. Results of comparative experiments.

Model	P	R	mAP₅₀	mAP_50–95	GFLOPs	Params	FPS
YOLOv3-tiny	57.7%	27.1%	43.9%	21.7%	18.9	12.13 M	588
YOLOv5n	60.9%	39.9%	52.4%	27.7%	7.1	2.50 M	294
YOLOv6n	68.2%	31.6%	51.1%	28.7%	11.7	4.23 M	370
YOLOv8n	56.8%	41.7%	51.5%	26.5%	8.1	3.01 M	303
YOLOv8s	56.9%	44.5%	53.0%	28.4%	28.4	11.13 M	244
YOLOv9-tiny	56.0%	40.8%	50.6%	26.8%	7.6	1.97 M	143
YOLOv10n	58.6%	37.1%	48.8%	26.1%	6.5	2.27 M	137
YOLOv11n	61.5%	41.2%	52.9%	28.6%	6.3	2.58 M	164
YOLOv12n	59.0%	37.7%	50.1%	26.5%	6.3	2.56 M	135
YOLOv8n-BCD	66.1%	43.5%	56.6%	29.9%	8.1	2.79 M	208

Table 10. Performance comparison on the additional BDD100K nighttime test set (21,992 images).

Model	P	R	mAP₅₀	mAP_50–95	GFLOPs	Params	FPS
YOLOv8n	44.2%	36.4%	38.6%	19.6%	8.1	3.01 M	303
YOLOv8n-BCD	46.1%	37.3%	40.1%	20.6%	8.1	2.79 M	208
YOLOv8n-BCD-OS	46.4%	38.9%	41.3%	21.0%	8.1	2.79 M	208

Table 11. Performance for the two rare categories on the additional BDD100K nighttime test set.

Model	Bike				Motorbike
Model	$P$	$R$	mAP₅₀	mAP_50–95	$P$	$R$	mAP₅₀	mAP_50–95
YOLOv8n	21.5%	25.6%	18.8%	8.8%	20.6%	18.6%	16.8%	8.3%
YOLOv8n-BCD	23.1%	27.1%	20.2%	9.8%	24.8%	20.6%	18.2%	9.1%
YOLOv8n-BCD-OS	18.2%	31.7%	21.7%	10.3%	21.6%	30.9%	21.9%	11.4%

Table 12. Three-run mAP₅₀ results on the larger BDD100K test set.

Model	Seed = 0	Seed = 42	Seed = 123
YOLOv8n	38.6%	39.3%	39.1%
YOLOv8n-BCD	40.1%	41.5%	41.3%

Table 13. Performance comparison on the daytime test set.

Model	P	R	mAP₅₀	mAP_50–95	Params
YOLOv8n	53.2%	30.5%	43.0%	23.9%	3.01 M
YOLOv8n-BCD	55.9%	31.6%	45.5%	25.6%	2.79 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ou, R.; Du, L.; Chen, W.; Liu, H. An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions. Actuators 2026, 15, 313. https://doi.org/10.3390/act15060313

AMA Style

Ou R, Du L, Chen W, Liu H. An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions. Actuators. 2026; 15(6):313. https://doi.org/10.3390/act15060313

Chicago/Turabian Style

Ou, Ruiyang, Luyao Du, Wei Chen, and Huiheng Liu. 2026. "An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions" Actuators 15, no. 6: 313. https://doi.org/10.3390/act15060313

APA Style

Ou, R., Du, L., Chen, W., & Liu, H. (2026). An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions. Actuators, 15(6), 313. https://doi.org/10.3390/act15060313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient and Lightweight Model for Traffic Object Detection in Autonomous Vehicles Under Nighttime Conditions

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Methodology

3.1.1. Bidirectional Weighted Feature Fusion Network

3.1.2. Coordinate Attention Mechanism

3.1.3. Dynamic Upsampler DySample

3.2. Dataset

3.3. Experimental Setup

3.4. Evaluation Metrics

4. Results and Discussion

4.1. Ablation Experiments

4.2. Comparative Experiments

4.3. Failure Analysis

4.4. Practical Deployment Challenges

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI