1. Introduction
Accurate traffic signal detection plays a crucial role not only in ensuring vehicle safety but also in optimizing traffic system performance. According to the World Health Organization, over 50% of urban traffic accidents occur at intersections, with a significant share attributed to signal violations or misinterpretation by drivers [
1]. Reliable signal detection in autonomous systems can therefore reduce the likelihood of intersection-related collisions. Furthermore, precise and timely detection of traffic signals supports smoother traffic flow, reduces idling time, and contributes to fuel efficiency and lower emissions, which are key objectives in sustainable and intelligent transportation infrastructure [
2,
3]. Autonomous driving systems rely on robust traffic light detection for safe navigation, making real-time recognition of signal states a critical task [
4]. Early approaches to traffic light detection used color thresholding, template matching, or mapping data, but these classical methods struggled with complex backgrounds and varied lighting [
5]. The advent of deep learning, especially convolutional neural networks (CNNs), revolutionized object detection by learning hierarchical features automatically [
6]. In particular, the You Only Look Once [
3] (YOLO) family of one-stage detectors has enabled real-time detection with high accuracy. The original YOLO unified object classification and localization into a single network, achieving unprecedented speed without sacrificing considerable precision [
7]. Subsequent YOLO versions (v2–v7) progressively improved detection performance by incorporating advances like multi-scale feature fusion and improved training techniques. Recently, YOLOv8 pushed the state-of-the-art further with an anchor-free decoupled head, an enhanced feature fusion PAN-fused path network, and a unified multi-task loss, resulting in superior small-object detection and simpler, faster inference compared to YOLOv7 [
8]. Empirical studies confirm YOLOv8’s advantages. For example, Hoang et al. reported that YOLOv8 attained ~98–99% mAP on traffic light datasets, outperforming YOLOv5–v7 in both accuracy and real-time performance.
Despite these advances, traffic light detection remains challenging due to the small object size, illumination variations, and environmental complexity [
9,
10]. Traffic lights are often distant and visually ambiguous, making detection difficult, especially under glare, weather effects, or occlusions [
11]. Additionally, determining the light’s color state (red, yellow, green) under such challenging conditions adds a layer of complexity [
9]. These factors contribute to missed or incorrect detections, which can be hazardous in autonomous driving systems. Consequently, improving the robustness and reliability of detection remains a major research focus [
12]. Moreover, autonomous systems also need to apply sensor fusion methods, such as the Adaptive Fuzzy Kalman Filter, to improve real-time vehicle positioning accuracy, thus ensuring dependable control and navigation [
13].
To address these challenges, recent works have enhanced YOLO-based detection in several capacities. One direction focuses on model efficiency for deployment in real-time embedded systems. For instance, Wang et al. built a lightweight YOLOv5s-based detector that reduced model size and computational load while maintaining high accuracy—suitable for low-power platforms like the Jetson Nano [
14]. Another line of work tackles small-object sensitivity, using feature pyramid fusion and attention modules. Chuang et al. integrated a residual bi-fusion FPN with transformer-based attention to improve recall and precision on distant lights [
15]. Similarly, Zhou et al. proposed KCS-YOLO, which adds a small-object detection layer and CBAM attention to enhance visibility under fog and low light [
16].
Data augmentation is another key strategy, with synthetic data generation and domain adaptation helping to address rare or difficult scenarios. For example, Gakhar et al. introduced a method using Fourier domain adaptation to simulate hostile weather (fog, rain), improving detection robustness [
17]. Such techniques complement architectural improvements by expanding model generalization.
In parallel with recent architectural advancements in object detection models such as YOLO, an increasing number of studies have highlighted the emerging role of generative artificial intelligence (AI) in vision-based transportation systems. Generative models, including Generative Adversarial Networks (GANs) and diffusion-based models, have been actively utilized to address key limitations in training data—such as data scarcity, rare event representation, and robustness under challenging visual conditions including glare, fog, and occlusion. Wang et al. [
18] and Da et al. [
19] presented comprehensive surveys outlining how generative AI is shaping the future of autonomous driving and transportation planning. These works emphasize practical applications such as synthetic training data generation, digital twin construction, and simulation-based infrastructure design.
Several recent studies have also demonstrated the direct benefits of generative AI in traffic signal detection tasks. Huang and Wang [
20] developed a lightweight traffic light detector augmented with diffusion-based data generation, achieving over 98% accuracy in adverse lighting conditions. Zhu et al. [
21] introduced ODGEN, a domain-specific synthetic data pipeline based on diffusion models, which improved YOLO detector performance by up to 25%. Similarly, Baik and Kim [
22] used photorealistic human image synthesis to train YOLOv7 for detecting human traffic controllers, highlighting generative AI’s potential in addressing rare, safety-critical scenarios with limited real-world data.
Several recent studies have focused on improving the efficiency and accuracy of YOLOv8 through architectural modifications, particularly by incorporating Depth-Wise Separable Convolutions (DWSCs). These enhancements aim to reduce computational cost and model size while maintaining or improving detection performance. Zhang et al. [
23] proposed a lightweight version of YOLOv8s by replacing standard convolutional layers with DWSCs for tomato detection. Their model achieved a notable mAP of 93.4% with a 27% reduction in model parameters and real-time inference at 138.8 FPS. Similarly, Wu et al. [
24] introduced LCA-YOLOv8-Seg for surface crack detection by embedding depth-wise convolutions into a segmentation-based YOLOv8n framework. Their model achieved 0.945 IoU accuracy at 129 FPS with a significant reduction in weight size.
Yue et al. [
25] developed LE-YOLO, a YOLOv8n variant that integrates DWSCs into a custom lightweight backbone named LHGNet. Although their model showed a mAP improvement of 5.4% on the VisDrone dataset and a reduction of parameters to 2.1 M, it suffered from reduced inference speed due to added complexity. Another variant, YOLOv8s-Depthwise, proposed by Naftali et al. [
26], integrated DWSCs throughout both the backbone and partially in the head. It achieved a competitive mAP (0.750) and significantly reduced model size (10.6 MB), while maintaining a respectable inference time (~0.027 s per frame). However, their implementation lacked comprehensive feature fusion strategies and modular integration across all stages.
Although the above works have demonstrated the potential of DWSCs in YOLOv8 architecture, most limit their modifications to the backbone or select layers without fully leveraging synergies with other architectural components such as SPPF, FPN, C2F modules, and dynamic upsampling.
In contrast to prior works, the architecture proposed in this study integrates depth-wise separable convolution blocks at multiple levels, including both the backbone and head. This approach represents comprehensive architectural restructuring rather than a simple substitution of standard convolutions. The main contributions of the proposed model can be summarized as follows:
Full-spectrum DWSC Deployment: In contrast to existing models that primarily use DWSCs in the backbone, our architecture extends DWSC integration across the entire feature processing pipeline, from input stages through to multi-scale detection heads.
Enhanced Multi-Scale Feature Fusion: The design includes multiple upsampling and concatenation layers, paired with C2F modules, to capture both shallow and deep semantic features efficiently.
SPPF Integration for Contextual Awareness: A strategic SPPF block enriches spatial features before final detection, contributing to better performance on objects affected by glare or occlusion.
End-to-End Lightweight Optimization: Despite increased modularity, the overall model achieves significant reductions in model size and improved inference speed, while maintaining or improving mAP on standard datasets.
This novel configuration outperforms existing YOLOv8 models by achieving a better balance among accuracy, model compactness, and real-time performance. To the best of our knowledge, this is the first work to holistically integrate DWSCs with YOLOv8’s FPN, SPPF, and C2F across both the backbone and head, demonstrating superior results without compromising detection robustness.
It is important to mention that traffic signal displays appear in various configurations beyond the standard circular indicators. These include directional arrow signals (e.g., green or red turn arrows), which are commonly used to control lane-specific movements and indicate permitted turning actions. While arrow-shaped signals serve distinct purposes in traffic control, their shapes, orientations, and activation states introduce additional complexity for vision-based recognition systems {Yeh, 2020 #50} {Yeh, 2021 #49}. In contrast, circular red, yellow, and green lights remain the most universally adopted form of vehicle signalization at intersections and are widely standardized across jurisdictions. For this reason, the present study focuses exclusively on circular signal states, which are essential for general vehicle navigation and decision-making in autonomous driving scenarios. Future research can consider extending the detection framework to accommodate arrow and other specialized signals through multi-class classification or shape-aware recognition models.
Although this study focuses on the detection of traffic light color states, future work can consider addressing broader real-world challenges. These include mitigating the effects of sun glare on visual perception [
27,
28], integrating Unscented Kalman Filters (UKFs) for improved state estimation and multi-sensor fusion [
29,
30], and incorporating Simultaneous Localization and Mapping (SLAM) frameworks for enhanced localization and navigation in dynamic traffic environments [
27,
28]. Additionally, recent advancements suggest that generative AI is increasingly being integrated with traditional computer vision methods in intelligent transportation systems. While not implemented in the present work, these techniques offer valuable potential for expanding training diversity, improving detection robustness, and enabling realistic simulation-driven development in future applications.
2. Enhanced YOLOv8 + DWSC for Traffic Light Detection
You Only Look Once (YOLO) is a cutting-edge, one-stage object-detection model known for its real-time capability and its simple architecture. YOLO differs from conventional two-stage detectors that generate region proposals and classify based on those proposals. YOLO treats the detection task as a single regression problem and predicts the coordinates of the bounding box and class probabilities directly from the entire image in a single forward pass using a CNN [
7,
31]. This single-model approach ensures that YOLO remains both fast and accurate and uniquely poised for real-time applications like autonomous driving, surveillance, and robotics [
32].
The architecture of YOLO divides the input image into a (G × G) grid where each grid cell is responsible for predicting the bounding boxes along with their class confidence scores. The model then outputs the class probability map for efficient localization and identification of objects,
Figure 1. The architecture enables the detection of multiple objects within an image frame while still achieving real-time performance. Its superior speed and accuracy make YOLO a choice of model for embedded and edge-computing used cases where inference speed is a paramount concern [
33].
2.1. YOLOv8 Detection Model Structure
The YOLOv8 object detection model is a single-stage, anchor-free deep learning framework optimized for high-speed and high-accuracy detection tasks. It is composed of three main components: the backbone, the neck, and the detection head. This architecture enables efficient multiscale feature extraction, fusion, and object prediction, making it highly suitable for real-time scenarios such as traffic light signal recognition in autonomous driving applications.
The detection pipeline begins with input preprocessing, where the input image is resized (e.g., to 640 × 640 pixels), normalized to a range between 0 and 1, and optionally augmented using transformations such as mosaic augmentation, HSV adjustments, or image flipping.
The preprocessed image
is passed through the backbone, which utilizes C2f modules to extract multiscale features
. These modules enhance gradient flow and computational efficiency by enabling cross-stage feature propagation and partial computation reuse [
8].
The extracted features are then processed by the neck, which employs a combination of Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) mechanisms to produce aggregated multi-scale feature maps . This fusion strategy allows for strong localization of small and large objects by integrating semantic and spatial information across scales.
The detection head is anchor-free and decoupled into separate branches for classification, bounding box regression, and objectness estimation:
The classification branch predicts object class probabilities .
The regression branch predicts bounding box coordinates .
The abjectness branch outputs the confidence score indicating the presence of an object.
At inference time, sigmoid activations are applied to normalize predictions, and Non-Maximum Suppression (NMS) is performed to filter overlapping detections based on confidence and IoU thresholds. The final detections are returned as a set as follows:
where each detection
is associated with a class label
and a confidence score
.
In this study, YOLOv8 serves as the base detection architecture.
To enhance its performance under high-glare and illumination-variant conditions, especially in distinguishing yellow traffic lights, Depth-Wise Separable Convolutions (DWSCs) are integrated into the backbone and/or detection head. This modification improves computational efficiency and feature selectivity without sacrificing detection accuracy [
34].
2.2. Proposed Enhanced YOLOv8
In this work, we developed an improved YOLOv8-based architecture particularly for traffic light state detection. The proposed architecture deploys Depth-Wise Separable Convolution (DWSC) to maximize the representational power with reduced computational overhead. The architecture presented in this work differs from prior architectures in that it uses Depth-Wise Separable Convolution (DWSC) blocks at various levels and at different parts of the model, the backbone, as well as the head; see
Table 1 and
Figure 2. Additionally, the proposed architectural improvement also includes a specialized signal-state classification module, which is reliable in recognizing red, yellow, and green lights.
The detector’s adaptability to multi-scale object changes was enhanced through the combined use of DWSCs within C2F and SPPF modules, which support efficient feature extraction while preserving gradient flow via modular depth and residual links. Overall, the lightweight convolutional architecture, coupled with efficient color classification, offers an improvement in detection accuracy, lower false positives, and real-time efficiency, rendering it fit for use in these settings that are dynamic and complicated.
Depth-Wise Separable Convolutions (DWSCs) for Enhanced Efficiency
Depth-Wise Separable Convolutions (DWSCs), which consist of depth-wise and point-wise (1 × 1) convolutions, are a promising architectural improvement to convolutional neural networks (CNNs) in the scope of real-time systems like traffic light detection in embedded environments because they combine a low computational complexity and parameter number with detection accuracy. DWSCs can be used to address real-time requirements when inference speed and size of the model are constrained, as shown in
Figure 3.
In a conventional convolutional layer, the convolutional kernel acts on all input channels at once to create a highly computationally intensive output feature map. DWSC, however, breaks this process into two steps. Depth-wise convolution uses a separate kernel for each input channel and convolves them independently. This operation extracts spatial features independently for each of the input channels but maintains the spatial dimensionality while reducing the computational cost considerably.
where
Yd is the depth-wise convolution output;
Wd is the depth-wise filter.
A depth-wise convolution is followed by point-wise convolution (1 × 1) to unify or combine the outputs that have been filtered spatially across all input channels. The depth-wise 1 × 1 convolution achieves a linear combination of the channels. This step enables the model to combine the information from (potentially spatially disjoint) different feature maps. Point-wise convolution is basic in terms of operations. However, it is essential for combining the features from the demands of depth-wise convolution. Point-wise 1 × 1 convolution can merge spatially disjointed features and process-separated features as a multi-channel one. Depth-wise convolution can capture color-dependent and smaller objects such as traffic lights.
Equation (3) illustrates the full computation of the Depth-Wise Separable Convolution (DWSC) method, where the operation is decomposed into two stages: a spatial convolution performed independently for each input channel (depth-wise), followed by a channel-wise linear combination using 1 × 1 convolutions (point-wise). In this formulation, Wd denotes the depth-wise filters, which capture spatial features across each input channel independently, while Wp represents the point-wise filters of shape M×K, responsible for projecting the intermediate features into a new space by fusing information across all M input channels to produce K output channels. The final output tensor Y [i, j, k] ∈ R H × W × K aggregates both local spatial patterns and inter-channel relationships. This architectural separation reduces computational complexity significantly compared with a traditional convolution with kernel size P × Q, input channels M, and output channels K, which requires P × Q × M × Q multiplications. The DWSC formulation reduces this to P × Q × M + M × K, yielding a subsequent computational cost reduction and making it highly efficient for real-time applications.
2.3. Model Training
The accurate and reliable prediction of the proposed real-time traffic light detection is highly dependent on the training process. A robust training process is employed in this paper to ensure the accuracy, robustness, and generalization of the proposed YOLOv8-DWSC model across different traffic scenarios and situations. The training process is structured as follows:
Initial Phase—Few-Shot Learning Foundation: The available labeled data consisted of a small set of traffic light images allowing the model to learn the initial object properties. Starting with a small set of high-quality data contributing to the model development corresponds to a few-shot learning foundation, where the model can generalize from the few examples because of its ability to compose a hierarchy of features.
Transfer Learning with Pre-Trained Weights: Transfer learning was used to initialize the model with pre-trained YOLOv8 weights, which were further trained on the public Github dataset 2023 [
36]. This approach improves training speed and convergence. With transfer learning, the model used the features already extracted from the large-scale image datasets while the model recognized the traffic lights’ signals, e.g., shapes and edges. All layers were fine-tuned with lower learning rates to prevent overfitting.
Real-Time Data Augmentation Using Video Libraries: A dynamic data augmentation strategy using video was adopted to enhance dataset variability and authenticity. Traffic video sequences were used to capture multiple angles, lighting situations, and motion distortions, sampled in
Figure 4 and
Figure 5. This enabled the model to learn driving characteristics using video sequences, which is reflective of the real-world situation. Karakan et al. [
37] and Kim et al. [
12] highlighted the improvements in object detectors when image data sourced from videos were applied to models. Video helps to generalize the datatypes through temporal context of the frames, provide pertinent information on variations in object size, blurring, and occlusions. Such enhancements taught the model to generalize motion-oriented processes and improved the traffic light model.
Dataset Partitioning and Management: The whole dataset was organized into three sections:
Training Set: The training set includes manually annotated images as well as augmented frames from videos.
Validation Set: The validation dataset is used for hyperparameter tuning and overfitting detection in training.
Test Set: This is held out for final performance evaluation to check the generalization of the model.
The proposed model leverages few-shot learning, transfer learning, and real-time data augmentation from video sources to effectively address the training challenges posed by complex urban environments with diverse environmental and temporal conditions. Given this training approach, the enhanced detection performance of the YOLOv8-DWSC model is primarily attributed to its model robustness in real scenarios.
While standard datasets like LISA were reviewed, this study prioritized real-time performance using a custom dataset derived from video capture in uncontrolled conditions. However, future work will include benchmarking on the LISA Traffic Light Dataset to enable broader comparability and validation
2.4. Performance Metrics
To provide a detailed evaluation of the model’s performance in identifying traffic light signals, this subsection first presents key detection accuracy metrics, including precision, recall, F1 score, and mean Average Precision (mAP). These metrics offer critical insights into the model’s ability to accurately and consistently detect relevant objects under varying confidence thresholds. Following this, training loss indicators are discussed to assess the model’s learning behavior and convergence throughout the training process, complementing the accuracy metrics with optimization-based performance insights.
2.4.1. Detection Accuracy Metrics
To comprehensively evaluate the performance of the proposed object detection model, several accuracy metrics were employed. These metrics quantify the model’s ability to correctly identify and classify relevant instances while minimizing false detections. In this subsection, we detail key evaluation indicators including precision, recall, F1 score, and mean Average Precision (mAP), which collectively offer a robust framework for assessing detection reliability, sensitivity, and overall effectiveness across various confidence thresholds and intersection over union (IoU) levels.
Precision quantifies the ratio of correctly identified positive instances to the total number of instances that the model classified as positive. It reflects the model’s ability to avoid false positives and is defined as follows:
where TP refers to true positives, and FP represents false positives. Precision is widely used in object detection and classification tasks to measure the accuracy of positive predictions [
40].
Recall measures the proportion of actual positive instances that are correctly identified by the model, thereby reflecting its sensitivity and completeness. It is expressed as follows:
where FN denotes false negatives. A high recall indicates that most of the relevant instances are detected [
40,
41].
The F1 score serves as the harmonic mean of precision and recall, offering a balanced evaluation of the model’s ability to detect true positives while minimizing both false positives and false negatives. The F1 score is particularly valuable in real-time applications where both high precision and high recall are critical. It is defined as follows:
where P and R are the precision and recall, respectively. This metric is extensively used in information retrieval and detection systems for its robustness in imbalanced datasets [
42].
To further investigate the model’s performance under varying detection confidence levels, precision–confidence, recall–confidence, and F1–confidence curves were utilized. These performance plots enable an in-depth analysis of the model’s behavior across a range of confidence thresholds, providing insight into its calibration and robustness under conditions of uncertainty [
43].
The mean Average Precision (mAP) was employed to summarize detection performance across multiple intersection over union (IoU) thresholds. This metric captures the area under the precision–recall curve and is formulated as follows:
where P(R) denotes the precision at a specific recall level R. The mAP is a standard benchmark in object detection challenges such as PASCAL VOC and MS COCO [
44,
45].
2.4.2. Training Loss Indicators
The model used for object detection in this paper is YOLOv8. During the training phase, the three main losses observed were box loss, classification loss, and distribution focal loss (DFL). Box loss measures the accuracy of the bounding box of the predicted and the actual outputs. Classification loss reflects the model’s prediction of probabilities for classes. DFL aids in calculating the offset and refining the predicted results. The loss functions are calculated after every epoch for both the training and testing datasets, reflecting the model’s convergence and generalization.
Box Loss (CIoU): Box loss measures how accurate the predicted bounding boxes are compared to the true bounding boxes. YOLOv8 uses complete intersection over union (CIoU) loss [
1], which is given as follows:
where
is the intersection over union of the predicted and ground-truth boxes,
is the distance in Euclidean metric between their centers,
is the diagonal length of the smallest enclosing box that covers both boxes, and
measures the consistency of their aspect ratios. This formulation ensures the predicted and true boxes have improved overlap, spatial proximity, and shape consistency [
40].
Classification Loss (Binary Cross-Entropy): YOLOv8 uses a binary cross-entropy (BCE) loss function for classification in a multi-label context [
41]. The loss for each predicted class probability
and true label
is given as follows:
This loss penalizes the incorrect classification of object categories and contributes to the model’s ability to distinguish between different classes.
Distribution Focal Loss (DFL): To improve the bounding box regression capabilities, YOLOv8 employs distribution focal loss (DFL) within the generalized focal loss (GFL) framework [
42]. Instead of directly regressing the coordinates, DFL represents each coordinate as a discrete probability distribution on the fixed set of bins. DFL loss is calculated by applying the cross-entropy loss between the predicted distribution and the soft target distribution, which is defined based on the two closest bins to the ground-truth value as follows:
where
is the predicted distribution and
the ground truth offset.
In summary, during training, the logs of the training and validation loss values of , are recorded at each epoch. These curves are meaningful signals of the learning process. If the training loss is low and decreases, the model is converging. When the gap between training and validation losses is high, the model is likely overfitting. Conversely, if the loss values are consistently high, this could indicate underfitting or lack of learning capacity. Loss curves are used as diagnostic instruments during training.
Notice that the loss metrics here are different from the evaluation metrics; namely, mean Average Precision (mAP), precision, or recall. Specifically, the loss metrics show how good the model fits the training dataset, while the evaluation metrics determine the model’s real performance during inference on unseen data. mAP is the main metric used to measure the object detection results on object detection benchmarks [
43].
4. Results and Discussion
The results obtained from our experiment highlight the robustness of our object detection model in accurately recognizing traffic light signals in diverse real-life situations. The precision–confidence curve,
Figure 6, presents an analysis of the model’s performance, highlighting its consistent high precision, especially in accurately identifying red, yellow, and green traffic lights, as confidence levels increase. This indicates that the model demonstrates a strong capacity to identify all signals accurately and minimize instances of false positives.
Further examination of the precision–recall curve, as shown in
Figure 7, highlights exceptional performance, with the model achieving high precision for (red = 0.992, yellow = 0.99, and green = 0.85) traffic lights. The model’s exceptional precision, accompanied by its dependable recall, showcases its ability to accurately detect almost all instances of light signals.
The recall–confidence curve provides additional evidence, illustrating how the model effectively maintains high recall levels while operating at lower confidence thresholds, guaranteeing the detection of traffic lights colors in a comprehensive manner.
The object detection model’s ability to recognize traffic light colors is demonstrated by the recall–confidence curve in
Figure 8. The model (red curve) maintains high recall at various confidence levels, remaining near 1.0 even at lower thresholds. Recall is negatively impacted by confidence, which is expected given that higher thresholds eliminate some detections for greater accuracy. Very few true positives are missed because the model can identify almost all traffic light colors, even at lower confidence levels (high recall). Because it could be extremely dangerous to miss a red light, this high recall is necessary for autonomous navigation and real-time traffic monitoring. The model’s calibration curve demonstrates its dependability for experimental applications by showing a high recall in traffic light color detection.
The F1–confidence curve in
Figure 9, provides an analysis of the model’s performance in detecting traffic lights colors. The red curve indicates that the model maintains a high F1 score across a range of confidence levels, peaking around 0.8. The F1 score, which is the harmonic mean of precision and recall, shows that the model balances both metrics effectively. At lower confidence thresholds, the model achieves high recall, capturing most red traffic light instances. As confidence increases, precision improves, maintaining a high F1 score up to a confidence threshold of 0.8, beyond which it begins to decline. This suggests that the model is well-calibrated, providing a reliable performance in detecting traffic lights colors. This high F1 score highlights the model’s suitability for real-time traffic monitoring and autonomous vehicle navigation, ensuring both accurate and comprehensive detection of red traffic signals.
Figure 10 presents the training and validation loss curves for the proposed YOLOv8 model during the traffic light detection task. The graph illustrates three primary components of the validation loss, including classification loss (val/cls_loss), bounding box regression loss (val/box_loss), and distribution focal loss (val/dfl_loss), each tracked across all training epochs.
As observed, all three loss components exhibit a consistent downward trend throughout training, indicating that the model is effectively minimizing error and learning appropriate feature representations over time. The classification loss steadily decreases, showing that the model’s ability to distinguish traffic light color classes improves. Similarly, the box loss exhibits a strong downward trajectory, reflecting improved precision in object localization and bounding box regression. The distribution focal loss also reduces smoothly, further supporting the stability and robustness of the model’s convergence process.
The absence of significant fluctuations or spikes across the loss curves suggests stable training dynamics and a well-behaved optimization trajectory. This trend reinforces that the proposed YOLOv8 architecture adapts effectively to the traffic light dataset, achieving reliable generalization performance without overfitting.
In addition, mAP50–95 also increase steadily across epochs. This improvement signifies the model’s growing precision and generalization capability across varying locations. Overall, the consistent decline of validation loss components across training epochs demonstrates the enhanced learning capacity, improved localization accuracy, and classification reliability of the proposed model in the context of traffic light signal detection under real-world conditions.
The performance of the enhanced YOLOv8 model over 80 epochs is illustrated in
Figure 11. The results demonstrate a consistent improvement in detection accuracy, precision, and recall throughout the training process, accompanied by a steady decrease in the loss value.
During the initial epochs, the model exhibits rapid learning, reflected in the steep increase in accuracy, precision, and recall. These metrics increase significantly in the early stages and gradually plateau as the model converges. The accuracy curve stabilizes at a high value, indicating consistent correct classification. Similarly, precision improves steadily, showing the model’s ability to reduce false-positive predictions, while recall increases continuously, highlighting enhanced detection of true positives with fewer missed objects.
The model validation performance metrics include the mean Average Precision (mAP) at intersection over union (IoU) thresholds of 0.50 (mAP50) and 0.50–0.95. The increasing recall trend supports a positive improvement in the model’s ability to detect true positives with reduced false-negative occurrences, particularly critical in traffic light recognition for autonomous driving.
The loss curve follows an inverse trend: it starts at a high value and decreases rapidly in the early epochs, eventually stabilizing at a low value. This indicates effective learning and minimal error as training progresses. The smoothness and monotonic nature of all curves—with no irregular oscillations—further confirm the stability and consistency of the optimization process
Overall, these results suggest that the proposed enhanced YOLOv8 model achieves high generalization capability, reliable spatial detection, and accurate color recognition of traffic lights. Its stable training behavior and performance trends make it well-suited for real-world autonomous driving implementations, where safety-critical object detection is essential.
The modified YOLOv8 architecture results are displayed in
Table 3. Training the YOLOv8 architecture to detect the red light involved the use of live videos for approximately 2 h, 45 min, and 10 s. The accuracy rates for traffic lights were 0.99 for red lights, 0.99 for yellow lights, and 0.85 for green lights. On average, the accuracy rate reached 94.3%.
Hence, the F1 score and confidence values that are incorporated into the performance curves for machine learning models, such as precision–confidence or F1–confidence curves, are not assigned any specific units. The F1 score is a measure of a model’s accuracy that considers both precision (the accuracy of positive predictions) and recall (the ability to find all positive instances). The F1 score is calculated as the harmonic mean of precision and recall, providing a single metric ranging from 0 to 1 to evaluate the model’s performance. Similar to the precision and recall harmonic mean, confidence scores also range from 0 to 1, with higher values corresponding to a higher level of confidence in the pre- diction’s correctness. Therefore, neither the F1 score nor the confidence values are assigned any particular units, with both falling within the 0 to 1 range.
The proposed modified YOLOv8 model offers key advantages over other models, including higher accuracy in detecting red lights, 0.68 vs. 0.40 in YOLO8 [
33], the ability to detect additional classes like yellow lights, and improved generalization across varied conditions. While it slightly underperforms in green light detection, its overall balance and versatility make it a stronger choice for real-world applications, as shown in
Table 4.
In comparison to alternative models in
Table 5, the proposed modified YOLOv8 model exhibits a well-rounded performance across precision, recall, and F1 score. The model’s exceptional precision score of 0.99 signifies its remarkable ability to predict the correct class accurately. Although it falls slightly behind YOLO8 [
47] and YOLO8 [
50] with a recall score of 0.90, the F1 score of 0.90 reflects a commendable balance between precision and recall. Notably, the modified YOLOv8 outshines other models in overall detection accuracy, as evidenced by its perfect mAP@0.50 score of 1.00 at a threshold of 0.635. One limitation is its relatively lower recall, which might result in missing some detections that other models capture, potentially due to conditions and timing during image capture. Nevertheless, the high precision and mAP scores demonstrate that the proposed YOLOv8 model excels in accuracy and overall performance compared to existing YOLO models, despite its lower recall capability.
Furthermore, additional evaluation metrics have been provided for a thorough evaluation of the model performance in response to the recommendation to analyze model performance more deeply, rather than focusing exclusively on precision score. The results in
Table 6 indicate that the model has high precision over all classes of traffic light signals, with red and yellow signals being the dominant classes. From the normalized confusion matrix shown in
Figure 12, it is observed that 85% of the time, the green signal is classified correctly, and the red and yellow signals are consistently classified correctly. There is a possibility of confusion between green signals and background, indicating a tendency to misclassify due to noise or glare.
In
Figure 13, we present the ROC curve and confusion matrix results for the proposed CNN model for traffic light color classification. The ROC curve produces an AUC of 0.99, indicating a highly predictive model that can accurately classify the traffic light signals from the background noise.
The confusion matrix corroborates these results, presenting 135 true negatives, 141 true positives, 10 false positives, and 14 false negatives. The results indicate the model’s high precision and recall values, which are important for guaranteeing the detection reliability in safety-critical scenarios. These results prove the model’s robustness and accuracy when using CNN for real-time color recognition.