1. Introduction
Robust, real-time recognition of various types of traffic lights and road signs is a fundamental requirement for an autonomous vehicle navigation system. An accurate understanding of road elements, including traffic lights, regulatory and warning signs, lane-direction indicating marks, and vulnerable road users, directly impacts decision-making, safety assurance, energy efficiency, and compliance with traffic regulations [
1,
2,
3,
4,
5]. Although recent advances in deep learning have significantly improved object detection performance in structured urban environments, a comprehensive understanding of the traffic scene remains a challenging problem due to environmental variability, diverse signaling conventions, and strict real-time constraints [
5,
6].
Among traffic control devices, traffic lights play a significant role, as they transmit safety-critical commands that must be interpreted correctly and consistently over time [
7]. Beyond the standard static states (red, yellow, and green), real-world deployments often include arrow-based signals and flashing modes that indicate distinct operational meanings, such as permissive turns, cautionary states, or intersection control overrides [
8,
9]. However, many existing perception systems either neglect these advanced signal types or treat traffic-light recognition as a simplified color classification problem, limiting their applicability in complex urban intersections and mixed traffic scenarios [
7,
9,
10].
Recent vision-based traffic perception systems commonly rely on convolutional neural network (CNN) detectors to identify traffic lights and signs directly from monocular video streams [
6,
7]. Although such approaches achieve high detection accuracy, they often struggle with temporal instability, illumination changes, partial occlusions, and glare effects [
8,
10]. Moreover, end-to-end learning-based models often attempt to infer traffic light states directly from single frames, which is inherently unreliable for flashing signals that require temporal context [
7,
11]. Implementing temporal reasoning using recurrent architectures can mitigate this issue, but often at the cost of increased computational complexity and reduced real-time feasibility. Despite significant progress in traffic-light detection and classification, several limitations remain. First, most existing approaches operate on frame-level analysis and lack robust temporal reasoning, making them unreliable for interpreting flashing traffic signals that require sequential context. Second, many methods do not maintain consistent object identity across frames, which is essential for accurate temporal state interpretation in dynamic environments. Third, existing unified perception frameworks primarily focus on detection tasks and treat traffic lights as static objects, failing to address fine-grained state recognition, particularly for arrow-based and flashing signals. As a result, there is a lack of a real-time, vision-based framework that integrates multi-class detection, identity-aware tracking, and interpretable temporal reasoning for comprehensive traffic-light state recognition.
To address these limitations, this work develops a unified real-time multi-class traffic signal perception framework, HybridSignalNet, that integrates state-of-the-art object detection, efficient multi-object tracking, and temporal color-based traffic-light state classification. The system uses You Only Look Once, version 11n (YOLOv11n) to detect a comprehensive set of roadway elements, including circular and arrow traffic lights, lane-direction indicators, and road signs. Detected traffic light instances are tracked across consecutive frames using the ByteTrack multi-object tracking framework, enabling the maintenance of a fixed-length temporal buffer for each signal instance. An HSV-based color analysis module is applied to per-frame Regions of Interest (ROI) to infer instantaneous traffic light color states, while rule-based temporal reasoning over the buffered per-frame state observations is used to robustly distinguish between static and flashing traffic light behaviors for both circular and arrow signals.
Unlike conventional hybrid pipelines that primarily focus on component integration, the proposed framework introduces a methodological shift by explicitly decoupling spatial perception from interpretable temporal reasoning, enabling efficient and transparent modeling of complex traffic signal behaviors such as flashing states. The proposed HybridSignalNet further extends beyond traditional traffic-light detection systems by distinguishing between circular and arrow signals and accurately identifying multiple flashing patterns that convey distinct operational semantics. Furthermore, the system incorporates practical robustness mechanisms, such as inner-region cropping, Saturation-Value (SV), and morphological filtering, to improve reliability in challenging real-world conditions without requiring additional model retraining or extensive data enhancement. By integrating detection, tracking, and temporal-state inference into a single optimized pipeline, this study provides a practical and deployable framework for comprehensive traffic-environment understanding. The proposed approach is validated on a multi-class dataset comprising traffic lights, road signs, and lane-direction markings, demonstrating strong performance across all classes while maintaining real-time operation. The proposed HybridSignalNet makes significant contributions to the field of Intelligent Transportation Systems (ITSs), which include
A unified real-time multi-class perception framework, termed HybridSignalNet, is proposed by integrating YOLOv11n-based object detection, ByteTrack-based multi-object tracking, and HSV-based temporal traffic-light state classification. The framework enables simultaneous multi-class recognition of heterogeneous roadway elements, including traffic lights, regulatory road signs, and lane-direction indicators. In particular, it provides temporally consistent detection and classification of both circular and arrow traffic lights across static and flashing operational states. Unlike conventional systems that focus only on basic static signals, the proposed framework supports comprehensive multi-state and multi-class traffic scene understanding required for real-world autonomous driving. To the best of our knowledge, few existing real-time perception frameworks comprehensively address flashing traffic light states for both circular and arrow-based signals within a unified multi-class pipeline.
The proposed system, HybridSignalNet, introduces a decoupled spatio-temporal perception architecture that separates deep-learning-based spatial detection from interpretable temporal state reasoning. Unlike conventional hybrid pipelines that primarily focus on component integration, this design represents a methodological shift by replacing sequence learning models (e.g., Long Short-Term Memory (LSTM)/ Gated Recurrent Units (GRU)) with a lightweight and deterministic rule-based temporal reasoning mechanism. This design reduces computational and training complexity, supports real-time operation, and improves decision interpretability. In addition, the modular structure facilitates integration with future perception modules and sensor-fusion components, making the framework suitable for practical autonomous vehicle deployment.
An interpretable color-based traffic-light state classification module is introduced using HSV color-space analysis combined with rule-based temporal reasoning. This design improves transparency and reliability compared to end-to-end black-box learning approaches, enabling deterministic interpretation of traffic signal behavior under varying illumination conditions.
The proposed framework achieves high detection and classification performance, attaining an F1-score of 91.3% across all traffic-related object classes and an F1-score of 96.7% for traffic-light state recognition. In addition, the system satisfies strict real-time constraints, achieving a steady-state throughput of 47.73 frames per second (fps) and a decision latency below 10 ms. These results demonstrate the suitability of the framework for deployment in intelligent transportation and autonomous driving systems.
The remainder of this paper is organized as follows:
Section 2 reviews the related work on traffic signal detection and classification and different models and methods used for that purpose;
Section 3 details our research methodology, including dataset construction, model design, experimental setup, and training strategy;
Section 4 presents and discusses the experimental results; and
Section 5 concludes the study with future research directions.
3. Research Methodology
This section presents the methodology used to develop HybridSignalNet. The proposed framework follows a systematic pipeline that begins with dataset collection and preprocessing, proceeds through model training and parameter optimization, and concludes with comprehensive performance evaluation. The system architecture consists of three core modules. First, a YOLOv11n-based detector is employed to detect and classify most traffic-related objects; however, for circular and arrow traffic lights, the model is used solely for detection. Second, a ByteTrack-based tracking module is applied to associate detected traffic lights across consecutive video frames, enabling temporal consistency. Finally, an HSV color-space-based classifier combined with rule-based temporal reasoning determines the final traffic light state, including red, yellow, green, flashing red, flashing yellow, red arrow, yellow arrow, green arrow, and flashing yellow arrow states. The overall system workflow is illustrated in
Figure 1.
3.1. Dataset Collection and Curation
A comprehensive dataset was constructed for this study using real-world traffic recordings collected from multiple urban intersections located on the Mississippi State University campus and within the cities of Starkville and Columbus, Mississippi, USA. The recordings capture natural driving environments and include a wide range of traffic scenarios and environmental variations. To complement the real-world data and to better analyze temporal traffic-light behavior, simulated traffic-light videos were also generated using the RoadRunner (MathWorks, Natick, MA, USA), a simulation tool. The simulator was used to create both circular and arrow traffic-light configurations with controlled flashing patterns. The dataset covers a diverse set of traffic-related object categories, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. All real-world videos were captured using a GoPro HERO8 Black digital camera (GoPro, Inc., San Mateo, CA, USA) under normal driving conditions. The recordings were conducted at different times of the day to reflect realistic traffic scenarios. Each video was recorded at 30 fps, which provides sufficient temporal resolution for accurate object detection, frame-to-frame tracking, and analysis of traffic signal state transitions, including flashing behavior. In addition to the collected recordings, several publicly available datasets were incorporated to improve dataset diversity and generalization capability. Images from the LISA Traffic Sign Dataset and the Mapillary Traffic Sign Dataset were included to expand the diversity of object class appearances, roadway layouts, and environmental contexts. Furthermore, videos from the BDD100K dataset were utilized to evaluate the performance of the proposed system in detecting, tracking, and classifying traffic-light states. The integration of both custom-collected data and established public datasets creates a heterogeneous and challenging evaluation benchmark, enabling the proposed system to be assessed under diverse real-world conditions rather than within a single controlled environment. Although the dataset incorporates multiple sources to enhance diversity, it is primarily collected from a specific geographic region (Mississippi, USA), and therefore full generalization to different countries with varying traffic signal designs and regulations is not guaranteed. Additionally, while the dataset captures illumination variability, it does not explicitly include adverse weather conditions such as heavy rain, fog, or snow.
3.2. Preprocessing Pipeline
Following data collection, several preprocessing steps were applied to prepare the dataset for model training and evaluation. These steps were designed to enhance data quality, ensure consistency, and improve the reliability of the detection, tracking, and classification processes. The preprocessing pipeline begins with video segmentation, in which the recorded and simulated videos are converted into individual image frames, followed by dataset cleaning to remove low-quality samples. Subsequently, the retained images are annotated to generate ground-truth labels for all classes. Data augmentation techniques are then applied to increase dataset diversity and improve model generalization. Finally, the complete dataset is partitioned into three non-overlapping subsets-training, validation, and testing-using a ratio of 70%:15%:15%, respectively. The following subsections describe the preprocessing steps in detail.
3.2.1. Video Segmentation and Frame Sampling
The recorded and simulated videos were segmented into individual frames to enable frame-level annotation and analysis. To reduce temporal redundancy and ensure scenario diversity, a uniform sampling strategy was employed. Frames were systematically extracted from different temporal segments of each video, rather than from every consecutive frame, thereby producing a dataset containing varied lighting, occlusion, and traffic conditions.
3.2.2. Data Cleaning and Quality Control
A stringent quality filter was applied to the sampled frames. Any frames affected by excessive motion blur, image distortion, or camera instability were discarded, along with any samples that did not meet quality standards. This cleaning step ensured that only high-quality frames suitable for reliable detection, tracking, and classification were retained for subsequent training and evaluation, thereby improving the overall robustness and performance of the proposed system.
3.2.3. Image Annotation
The cleaned frames were annotated using Roboflow (Roboflow, Des Moines, IA, USA), a widely used computer vision annotation platform. Each frame was labeled with bounding boxes corresponding to relevant traffic-related objects. The annotation process covered 13 distinct classes, namely: Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. These annotations provided precise spatial ground truth information required for training and evaluating the YOLOv11n model. To ensure annotation accuracy and consistency, an automated validation script was developed to verify the correspondence between annotated class IDs and their true object categories, as well as the correctness of bounding box localization. Any detected inconsistencies or labeling errors were subsequently reviewed and corrected manually to guarantee high-quality annotations.
3.2.4. Data Augmentation
To enhance the diversity and robustness of the dataset, a data augmentation pipeline was implemented. The applied techniques included horizontal and vertical flipping, random brightness and contrast adjustments, small random rotations, slight blurring, and color jittering. During augmentation, bounding-box annotations were automatically updated in YOLO format to ensure spatial consistency between images and labels. This augmentation strategy increased the effective dataset size and improved the model’s robustness to illumination variations and camera viewpoint changes.
3.2.5. Dataset Partitioning
The final dataset consists of 39,000 images, with 3000 images per class, ensuring balanced representation across all classes and supporting stable and reliable model training. The dataset was partitioned into three subsets: a training set (70%, 27,300 images) used for model learning, a validation set (15%, 5850 images) used for hyperparameter tuning, and a test set (15%, 5850 images) used for final unbiased performance evaluation.
3.3. YOLOv11-Based Object Detection
After preparing the dataset, a YOLOv11n model was trained to detect and classify all traffic-related classes, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. For arrow and circular traffic lights, YOLOv11n is used only for object detection, while traffic-light state classification is handled in subsequent stages. This design decouples spatial detection from temporal state reasoning and prevents the detector from learning flashing behaviors directly from image-level supervision. The internal structure of the employed detector, YOLOv11n, is illustrated in
Figure 2 [
53].
The YOLOv11 detector follows a single-stage object detection paradigm in which object localization and classification are performed simultaneously. As shown in
Figure 2 [
53], the architecture consists of three main components: a backbone network responsible for hierarchical feature extraction, a neck module that aggregates multi-scale features through up-sampling and feature concatenation, and a multi-scale detection head that produces bounding-box and class predictions at different spatial resolutions. This multi-scale prediction strategy is particularly important for detecting small and distant objects such as traffic lights in real-world urban driving environments.
Training was conducted using the hyperparameters summarized in
Table 1. This detection stage provides accurate and temporally consistent bounding boxes for all traffic-related objects, forming a reliable foundation for downstream tracking and traffic-light state classification modules.
3.4. Traffic-Light Tracking Using ByteTrack
Following object detection, a ByteTrack-based multi-object tracking module is applied to associate traffic-light detections across consecutive video frames. Each detected traffic light is assigned a unique track identity that is maintained over time, ensuring that observations collected from different frames correspond to the same physical signal. Separate tracking streams are maintained for circular and arrow traffic lights to prevent identity mixing between structurally different signal types. This track-level identity preservation provides a stable temporal foundation for subsequent state reasoning. Overall, the ByteTrack-based tracking module provides reliable temporal consistency while remaining computationally efficient, making it well suited for real-time traffic monitoring and downstream signal state analysis, particularly for applications involving temporal reasoning such as flashing state detection.
3.5. HSV-Based Traffic-Light State Extraction
For each tracked traffic light, an ROI is extracted from the detected bounding box. The cropped ROI is then converted from the RGB color-space to the HSV (Hue-Saturation-Value) color space, which provides better separation of chromatic information under varying illumination conditions. Minimum threshold values are applied to the saturation and value channels to suppress dark or low-saturation pixels. Binary masks corresponding to red, yellow, and green colors are generated using predefined HSV ranges, followed by morphological operations to reduce noise and fill small gaps.
The HSV thresholds used for color segmentation were determined through a systematic, data-driven empirical procedure on the validation dataset. A representative subset was manually annotated at the pixel level for red, yellow, and green traffic light states under diverse illumination conditions, including daylight, nighttime, glare, shadows, overcast, and partial occlusion. The distributions of Hue, Saturation, and Value channels were analyzed for each color class, revealing strong separability in the Hue component, with overlap primarily caused by illumination variations affecting SV. Based on this analysis, initial threshold ranges were defined and subsequently refined using a grid search on the validation set to maximize the frame-level classification F1-score while maintaining robustness to illumination changes and sensor noise. In addition, minimum SV thresholds were introduced to suppress low-intensity and desaturated pixels, significantly reducing false positives caused by glare, reflections, and low-light conditions. The final thresholds were fixed based on validation performance and were not tuned on the test set, ensuring fair evaluation and avoiding overfitting. This empirical strategy provides a balance between interpretability, reproducibility, and practical robustness, while allowing explicit control over failure modes under challenging real-world conditions. The dominant color is determined by computing the ratio of active pixels for each color mask relative to the ROI area. This HSV-based color extraction provides frame-level color observations that are forwarded to the temporal reasoning stage.
3.6. Temporal Buffering for Traffic-Light State Reasoning
To enable robust traffic-light state classification over time, a temporal buffering mechanism is introduced for each tracked traffic light. For every unique track ID generated by the ByteTrack tracking module, a fixed-length sliding buffer is maintained to store recent color observations extracted from consecutive video frames. At each frame, the instantaneous color label obtained from the HSV-based analysis (i.e., Red, Yellow, or Green) is appended to the buffer. This temporal aggregation allows the system to reason over short-term color transitions rather than relying on single-frame decisions, which are often sensitive to noise, reflections, or motion blur. Flashing traffic light states are inferred by analyzing characteristic alternating patterns between “On” states (e.g., Red, or Yellow) and the “Off” state within the temporal buffer using a rule-based fuzzy reasoning strategy. Importantly, flashing behaviors are not learned directly by the deep learning detector but are instead inferred through explicit temporal buffering and color-based logic. This approach improves interpretability, robustness, and generalization across varying flashing frequencies and environmental conditions.
It is important to clarify the distinction between frame-level observations and sequence-level interpretation in the proposed framework. While object detection and HSV-based color extraction are performed at the frame level, flashing behavior is not defined at the individual frame level. Instead, flashing traffic light states are inferred at the sequence level using the temporal buffer associated with each tracked object. The temporal reasoning module analyzes patterns of activation and deactivation across consecutive frames to determine whether a signal exhibits flashing behavior. Regarding correctness, flashing classification does not require observing a complete flashing cycle. A prediction is considered correct when sufficient temporal variation is present within the observation window, such as alternation between active and inactive signal states. Because the proposed temporal rules do not rely on strict periodicity, the system remains robust when only partial flashing cycles are observed. If no temporal variation is detected and the signal appears consistently active, the system classifies it as a static signal. This design improves robustness under partial visibility, occlusion, and limited observation windows.
To further illustrate the scope of the dataset, a custom dataset was constructed using 57 real-world traffic videos recorded at multiple urban intersections across the Mississippi State University campus and the cities of Starkville and Columbus, Mississippi, USA. These were complemented by simulated traffic-light videos generated using RoadRunner (MathWorks, Natick, MA, USA), as well as images sourced from publicly available benchmark datasets, including the LISA Traffic Sign Dataset and the Mapillary Traffic Sign Dataset. In addition, videos from the BDD100K dataset were incorporated to further enhance dataset diversity and to evaluate the proposed framework under large-scale real-world driving conditions for traffic-light recognition. The video sequences range from 10 to 60 min in duration, providing extensive temporal coverage across diverse driving scenarios.
From these combined sources, a total of 39,000 images were extracted and annotated, covering 13 traffic-related object classes: Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. To ensure balanced representation and prevent class bias during model training, each class contains exactly 3000 annotated images.
Table 2 summarizes the detailed specifications of the constructed dataset.
3.7. Experimental Environment
All experiments related to dataset preprocessing and the evaluation of the proposed framework were conducted on a high-performance computing cluster running Rocky Linux 9.2. The system was equipped with NVIDIA A100 80GB PCIe GPUs and an Intel Xeon Platinum 8362 processor with 64 CPU Cores and 1.5 TB of system memory.
Table 3 shows the detailed specification of the experimental environment.
3.8. Evaluation Metrics
To evaluate the performance of the proposed framework, standard evaluation metrics were used, including precision, recall, F1-score, mAP, Multiple Object Tracking Accuracy (MOTA), and Identity F1-Score (IDF1). Precision is an evaluation metric that measures the proportion of correctly predicted positive instances among all instances predicted as positive by the model and it is expressed in Equation (
1) [
13,
54,
55].
Recall evaluates the model’s ability to identify all actual positive instances as shown in Equation (
2) [
13,
54,
55].
Here, TP indicates true positive, FP indicates false positive, and FN refers to the false negative.
The F1-score combines precision and recall into a single metric by computing their harmonic mean, as shown in Equation (
3) [
13].
Here, in Equation (
3), P is denoted as Precision, and R is denoted as Recall. The mAP is defined as the mean of the Average Precision (AP) values computed across all object classes, as defined in Equation (
4) [
54,
56].
Multiple Object Tracking Accuracy (MOTA) evaluates the overall tracking performance by jointly accounting for missed detections, false positives, and identity switches across all frames. MOTA is defined as follows in Equation (
5) [
57,
58].
Here, in Equation (
5), FN denotes the number of false negatives, FP denotes the number of false positives, IDSW denotes the number of identity switches, and GT denotes the number of ground-truth objects at frame t.
The Identity F1-Score (IDF1) measures the accuracy of identity preservation by evaluating how consistently tracked identities correspond to ground-truth identities over time. IDF1 is defined as follows in Equation (
6) [
55].
Here, in Equation (
6), IDTP denotes identity true positives, IDFP denotes identity false positives, and IDFN denotes identity false negatives. This metric emphasizes identity continuity rather than pure detection accuracy.
4. Results and Discussion
In this section, we discuss the results obtained from the proposed framework, HybridSignalNet, for detecting and classifying traffic-related objects and traffic-light states. The evaluated object classes include Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, and Yield, as well as traffic-light states including Red, Green, Yellow, Flashing Red, Flashing Yellow, Red Arrow, Green Arrow, Yellow Arrow, and Flashing Yellow Arrow. Additionally, this section analyzes the conducted experiments and fine-tuning procedures used to determine the optimal parameters for the proposed framework. Representative qualitative and quantitative results are also presented to demonstrate the effectiveness and efficiency of HybridSignalNet in accurately detecting and classifying the considered traffic-related classes.
4.1. YOLOv11 Model Selection and Performance
YOLOv11 is one of the latest versions of the YOLO family of single-stage real-time object detectors. It is designed to achieve higher detection accuracy, lower latency, and improved computational efficiency compared to earlier YOLO versions [
59]. YOLOv11 is available in several model variants, including nano (n), small (s), medium (m), large (l), and extra-large (x), offering different trade-offs between accuracy and computational cost.
In this experiment, three variants of YOLOv11 (nano, medium, and large) were evaluated on the constructed dataset. As expected, increasing model capacity improved detection performance. The large model achieved the highest detection accuracy, with a precision of 96.6%, recall of 95.2%, F1-score of 95.9%, of 96.4%, and of 84.7%. However, this performance gain came at a substantially higher computational cost and longer training time. In contrast, the nano model achieved a precision of 92.8%, recall of 90.0%, F1-score of 91.3%, of 92.3%, and of 73.5%, while requiring significantly less computational resources and training time. Although its detection accuracy is lower than that of the larger variants, the nano model remains sufficiently accurate for reliable object localization performance.
The proposed system aims to achieve both high detection accuracy and real-time operation in practical traffic environments. Since the subsequent modules (object tracking and HSV-based temporal reasoning) refine the final traffic-light state estimation, extremely high detection accuracy is not strictly required. Instead, a lightweight detector capable of stable localization with low computational overhead is preferable. Therefore, YOLOv11n was selected as the base detector for the proposed framework. This design allows the system to maintain real-time capability while delegating temporal consistency and signal-state interpretation to the higher-level reasoning module.
Table 4 summarizes the quantitative performance comparison among the evaluated YOLOv11 variants.
The YOLOv11n model is used to detect multiple traffic-related object classes, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. For traffic lights, the detector performs only object localization rather than state classification. Following detection, an HSV color-space-based classification module is applied to determine the states of traffic lights. This module classifies detected circular and arrow traffic lights into their respective states: red, green, yellow, flashing red, flashing yellow, red arrow, green arrow, yellow arrow, and flashing yellow arrow. The detection performance for all object classes is summarized in
Table 5.
As shown in
Table 5, the detector consistently performs well across most traffic-related objects. The Circular Traffic Light class achieves the best performance, with a precision of 96.8%, recall of 95.1%, F1-score of 95.9%,
of 96.9%, and
of 83.5%. The Arrow Traffic Light class also achieves high accuracy, with a precision of 95.6%, recall of 93.8%, F1-score of 94.7%,
of 95.8%, and
of 80.1%. Other classes, including Stop, Yield, Speed Limit, Roundabout, and lane-marking categories, also demonstrate reliable detection performance, with F1-scores generally above 89% and strong
values. Lower performance is observed for Pedestrian and some lane-marking classes due to smaller object size, partial occlusions, and appearance variability in real traffic scenes. Importantly, the two most critical classes for the proposed framework, Arrow Traffic Light and Circular Traffic Light, achieve high localization accuracy. These detected traffic lights serve as input to the subsequent traffic light state classification module, which further classifies them into red, green, yellow, flashing red, flashing yellow, red arrow, green arrow, yellow arrow, and flashing yellow arrow states.
Overall, the detector achieves an average precision of 92.8%, recall of 90.0%, F1-score of 91.3%, of 92.3%, and of 73.5%, demonstrating that the selected lightweight detector is sufficiently accurate while maintaining real-time capability.
4.2. Temporal Tracking Module: Integrating ByteTrack for State Recognition
The tracking module is a critical component that ensures consistent identification and tracking of traffic lights throughout the video sequence. Although the YOLO-based detector employed in this study is effective at localizing traffic lights, it operates independently on each frame and does not retain object identity over time. To address this limitation, a tracking module is incorporated to assign a persistent track ID to each detected traffic light, allowing the system to follow the same physical signal throughout the video sequence. This temporal continuity is particularly important for temporal analysis and for accurately recognizing flashing traffic-light states. In this experiment, three tracking models were evaluated: ByteTrack [
60], DeepSORT [
61], and an IoU-based tracker [
62].
Table 6 presents a quantitative comparison of these methods using standard multi-object tracking metrics, including Multiple Object Tracking Accuracy (MOTA) and Identity F1-Score (IDF1).
As shown in
Table 6, ByteTrack achieves the best overall performance, obtaining the highest scores across both evaluation metrics, with MOTA and IDF1 values of 90.7% and 88.2%, respectively. DeepSORT ranks second, while the IoU-based tracker exhibits the lowest performance, achieving MOTA and IDF1 values of 84.2% and 79.5% for DeepSORT, and 72.5% and 68.9% for the IoU-based tracker. Based on these results, ByteTrack was selected as the tracking method for the proposed system due to its superior tracking accuracy and its ability to maintain consistent object identities across frames in dynamic traffic environments.
4.3. HSV-Based Traffic-Light State Classification
The classification module operates on ROIs corresponding to detected and tracked traffic lights, aiming to determine their final operational state accurately and consistently over time. For each tracked traffic light, the detected bounding box is cropped and converted into HSV color-space, where saturation and brightness thresholds are applied to suppress background noise and illumination artifacts. Color-specific masks are then used to identify red, yellow, and green signal activations at the frame level. To capture temporal behavior, each traffic light maintains a short history buffer of recent color observations, which enables the system to distinguish between static and flashing signals.
Flashing states are inferred using a fuzzy temporal rule-based strategy that analyzes the co-occurrence and alternation of on and off states over time, rather than relying on strict periodicity. This approach allows the system to handle detection noise and partial occlusions while maintaining stable classification. Separate rule sets are applied for circular and arrow traffic lights to reflect their operational differences, and when the current frame is ambiguous, the last valid state is preserved to ensure temporal continuity.
Table 7 presents a quantitative comparison between three color-based methods, RGB [
63], CIELAB (L*a*b*) [
64,
65,
66], and HSV color-space [
66,
67], in terms of precision, recall, and F1-score.
The RGB-based approach exhibits the weakest performance, achieving an F1-score of 85.3%, primarily due to its high sensitivity to illumination variations, shadows, and brightness changes. Since RGB directly encodes intensity information, it struggles to reliably separate color information under real-world lighting conditions. The CIELAB color-space significantly improves performance, achieving an F1-score of 94.7%. This improvement is attributed to its perceptual uniformity and partial separation of luminance from chromatic components. However, despite this improvement, CIELAB still exhibits moderate sensitivity to illumination changes and color overlap in complex outdoor scenes. The HSV color-space achieves the best overall performance, with a precision of 97.5%, recall of 95.9%, and an F1-score of 96.7%. The superior performance of HSV is due to its explicit separation of color (Hue) from illumination-related components (Saturation and Value), making it more robust to brightness fluctuations, shadows, and sensor noise. Based on these results, HSV was selected as the core color representation for traffic light state extraction in the proposed framework.
4.4. Impact of Morphological Filtering in the HSV-Based Traffic-Light Classification Module
This experiment analyzes the contribution of the morphological filtering stage in the proposed HSV-based traffic-light classification module. The experiment isolates the effect of the morphological operations while keeping all other components unchanged, including the object detector, tracking mechanism, HSV thresholds, inner-ROI cropping, and temporal buffering. After generating the binary color masks in the HSV color-space, the masks may contain fragmented regions and small holes due to illumination variations, reflections, and sensor noise. To refine these masks, a morphological closing operation (dilation followed by erosion) is applied to each color mask.
Table 8 presents the performance of the proposed HybridSignalNet framework in terms of precision, recall, and F1-score with and without morphological filtering.
The results demonstrate that morphological filtering improves all evaluation metrics, increasing the F1-score from 93.8% to 96.7%. The most noticeable improvement appears in recall, which indicates that the system becomes more capable of consistently detecting active signal regions. Therefore, morphological filtering is not merely a post-processing technique but a critical component that enhances the robustness and temporal stability of the proposed traffic-light state classification framework.
4.5. Impact of Temporal Buffer Length on Traffic-Light Classification Robustness and Startup Latency
To robustly recognize traffic-light states, the proposed framework maintains a fixed-length temporal buffer for each tracked traffic light, storing recent frame-level color observations obtained from the HSV-based module. The buffer length determines the amount of temporal evidence available to infer the characteristic ON/OFF temporal alternation of flashing signals. A shorter buffer enables faster initial decisions but may fail to capture a complete flashing cycle under noise, glare, or intermittent detections. In contrast, a longer buffer improves temporal stability by aggregating more observations; however, it increases startup latency because the system must accumulate sufficient history before producing the first reliable state decision.
To justify the selected buffer length, we evaluated the system using buffer sizes of 9, 15, 27, and 45 frames while keeping all other components unchanged. For each configuration, we measured (i) the F1-score of overall traffic-light state classification (including both static and flashing states for circular and arrow traffic lights), and (ii) the startup latency, defined as the time required for the system to produce the first valid state decision. The results are summarized in
Table 9.
The results indicate that increasing the buffer length improves classification robustness, as the system becomes less sensitive to transient HSV noise and occasional missed ON/OFF observations. However, this improvement is accompanied by increased startup latency because more frames must be accumulated before stable temporal reasoning can be established. The selected buffer length of 27 frames provides the most balanced trade-off, achieving near-optimal classification performance while maintaining a sub-second startup delay. This balance is particularly important for real-time driving applications, where both reliability and timely decision-making are required.
4.6. Impact of Inner-ROI Cropping and Saturation-Value (SV) Thresholding on System Performance
To evaluate the contribution of the proposed robustness mechanisms, a comparative analysis was conducted focusing on inner-ROI cropping and SV thresholding within the HSV-based traffic-light state extraction module. Four configurations of the HybridSignalNet framework were considered: (1) the full model with both inner-ROI cropping and SV thresholding enabled, (2) a variant without inner-ROI cropping, (3) a variant without SV thresholding, and (4) a variant without both mechanisms. All configurations were evaluated on the same test dataset under diverse real-world conditions, including illumination variations, glare, and partial occlusions.
The results are presented in
Table 10. The full model achieves the highest performance, with an F1-score of 96.7%. Removing inner-ROI cropping results in a noticeable performance drop, indicating the importance of suppressing background interference within the detected region. Similarly, removing SV thresholding reduces robustness under low-saturation and low-illumination conditions. The largest performance degradation is observed when both mechanisms are removed, where the F1-score decreases to 93.5%. This confirms that the two components provide complementary benefits, where inner-ROI cropping reduces spatial noise, while SV thresholding mitigates illumination-related artifacts. These findings demonstrate the effectiveness of the proposed design in improving traffic light state classification under challenging real-world conditions.
4.7. Traffic-Light State Classification and Error Analysis
To provide a comprehensive evaluation of the proposed HybridSignalNet framework,
Table 11 reports the classification performance across all traffic light states in terms of precision, recall, and F1-score. These results reflect the final state-level classification performance obtained by integrating HSV-based color feature extraction with temporal buffering and rule-based reasoning. The evaluation demonstrates consistently high accuracy across both static and dynamic (flashing) signal states, highlighting the effectiveness and robustness of the proposed hybrid spatio-temporal classification strategy.
For static traffic light states, including Red, Green, and Yellow, the system achieves consistently high performance, with F1-scores ranging from 97.0% to 98.4%. These results indicate reliable discrimination between dominant color states across varying lighting and environmental conditions. For flashing traffic light states, such as Flashing Red, Flashing Yellow, and Flashing Yellow Arrow, the performance is slightly lower but remains strong, with F1-scores between 94.5% and 95.4%. The modest performance reduction is expected due to the inherent temporal nature of flashing signals, where alternating “on” and “off” patterns introduce ambiguity at the frame level. Nevertheless, the temporal buffering mechanism effectively captures these periodic transitions, enabling accurate state inference. Arrow-based traffic light states, including Red Arrow, Yellow Arrow, and Green Arrow, also demonstrate robust performance, achieving F1-scores above 97.0%, confirming that the proposed approach generalizes effectively across different signal geometries. Overall, the system achieves an average precision of 97.5%, recall of 95.9%, and F1-score of 96.7%, demonstrating that the HSV-based classification combined with temporal reasoning provides a reliable solution for traffic-light state recognition, particularly for challenging flashing scenarios.
To further analyze the classification performance across all traffic light states, a confusion matrix is presented in
Figure 3. As shown in
Figure 3, the proposed HybridSignalNet demonstrates consistently high classification performance across all classes, with diagonal values ranging from 0.94 to 0.98, indicating strong class separability and reliable state recognition. The majority of predictions are concentrated along the diagonal, while minor misclassifications are primarily observed between temporally or visually similar states. For example, flashing red is occasionally confused with static red (3%), which is expected due to frame-level ambiguity when the signal is captured during its active phase. Similarly, confusion between yellow and flashing yellow (2–3%) occurs under low-light or reduced visibility conditions, where HSV-based color differentiation becomes less distinctive. In addition, arrow-based traffic lights exhibit slight misclassification (1–2%), primarily due to their smaller spatial footprint and sensitivity to occlusion and motion blur. Partial occlusion by surrounding objects or vehicles can distort the arrow shape, while glare from reflective surfaces may further degrade edge clarity, leading to occasional confusion with non-arrow states. These error patterns are consistent with real-world challenges in traffic-light classification and indicate that most errors arise from challenging environmental conditions rather than fundamental model limitations. Overall, the observed error rates remain low (below 3%), demonstrating the robustness and stability of the proposed system under diverse and challenging real-world conditions.
4.8. Robustness Analysis
To provide a more comprehensive evaluation of system robustness, additional analysis is conducted under practical conditions, including tracking errors, HSV threshold sensitivity, and lighting variations.
Effect of Tracking Errors: The proposed framework relies on ByteTrack to maintain identity consistency across frames. While tracking errors such as ID switches or missed associations may disrupt the temporal buffer, their impact on classification is mitigated by the use of fixed-length temporal aggregation and rule-based reasoning. Since classification decisions are based on buffered observations rather than single-frame predictions, occasional tracking inconsistencies have limited influence on the final state inference. This design effectively acts as a temporal smoothing mechanism. However, severe or persistent tracking failures may degrade classification performance by corrupting the temporal history.
Sensitivity to HSV Thresholds: The HSV thresholds were determined through a systematic validation procedure, as described in
Section 3. However, beyond their selection, it is important to analyze their impact on system robustness. Experimental observations indicate that the proposed framework remains stable under moderate variations in threshold values, as the temporal reasoning mechanism compensates for minor frame-to-frame color fluctuations. In particular, the integration of SV thresholding plays a critical role in suppressing low-intensity noise and improving discrimination under varying illumination conditions. Nevertheless, extreme deviations in threshold selection can lead to inaccurate color segmentation and subsequent misclassification. This behavior highlights the inherent trade-off between sensitivity and robustness in color-based approaches and underscores the importance of careful threshold calibration for reliable deployment.
Impact of Lighting Variations: The dataset encompasses a diverse range of illumination conditions, including daytime variations, shadows, and glare, enabling evaluation under realistic visual environments. To enhance robustness against such variability, the proposed framework incorporates preprocessing techniques such as SV thresholding and morphological filtering. Experimental results presented in the preceding sections indicate that these components contribute to stable performance under moderate lighting variations. However, under extreme conditions—such as very low-light environments or severe glare—classification accuracy may still be affected due to degradation of reliable color information. This limitation highlights the inherent dependency of color-based approaches on illumination quality.
Overall, these observations demonstrate that the proposed framework maintains stable performance under moderate variations and disturbances, while highlighting potential limitations under extreme conditions.
4.9. Performance Comparison with Previous Works
Most existing traffic-light recognition studies primarily focus on static signal states (red, yellow, and green) and provide limited support for dynamic signal states. In particular, the recognition of flashing traffic lights, especially flashing arrow signals, remains largely underexplored in real-time perception systems. In this context, the proposed framework, HybridSignalNet, introduces a unified real-time solution capable of detecting and classifying both static and flashing traffic lights, including circular and arrow signals, within a single integrated framework. This design enables comprehensive traffic-signal understanding at intersections while maintaining real-time performance.
To evaluate its effectiveness, HybridSignalNet is compared against several recent state-of-the-art real-time traffic-light detection models reported in the literature. To ensure a fair comparison, all evaluated detection models were trained using identical experimental conditions. Each model was trained on the same dataset, using the same train/validation/test splits, image resolution (640 × 640), and batch size (16). No model received additional tuning or dataset-specific optimization. All experiments were conducted on the same hardware platform and evaluated using the same metrics. Default hyperparameters provided by the Ultralytics framework were used consistently across all YOLO-based models to avoid bias toward any specific architecture.
Table 12 presents a performance comparison in terms of precision, recall, and F1-score for static traffic-light detection and classification.
As shown in
Table 12, the proposed HybridSignalNet achieves the highest overall performance among all compared methods for static traffic-light detection and classification, attaining a precision of 98.4%, recall of 96.8%, and an F1-score of 97.6%. This performance surpasses that of Wan et al. [
71], which employed YOLOv9 and reported an F1-score of 96.7%, as well as the YOLOv7-based approach by De Guia and Deveraj [
70], which achieved an F1-score of 95.9%.
The method proposed by Zhu and Yian [
69], based on YOLOv5, demonstrated moderate performance with an F1-score of 95.1%, while the lowest performance was observed in the work of Naimi et al. [
68], which relied on a modified SSD architecture and achieved an F1-score of 92.7%. Beyond these quantitative results, it is important to note that the compared methods are primarily designed to handle static traffic-light states, whereas HybridSignalNet additionally supports robust recognition of flashing signals, including both circular and directional arrow flashing lights, without compromising real-time operation. This capability highlights the proposed framework’s substantial advancement over existing approaches and underscores its suitability for deployment in real-world autonomous driving environments.
While
Table 12 evaluates static traffic-light recognition performance, it does not capture the ability of models to handle temporal behaviors such as flashing signals. Therefore,
Table 13 presents a comparison with state-of-the-art temporal models.
As shown in
Table 13, HybridSignalNet demonstrates competitive classification performance compared to recent temporal deep learning approaches such as FlashLightNet, which employs a CNN–LSTM architecture for spatiotemporal modeling. In addition to improved classification accuracy, the proposed framework achieves significantly lower latency while maintaining competitive throughput, owing to its non-sequential, rule-based temporal reasoning mechanism. Unlike LSTM-based models that rely on sequential processing, the proposed approach enables efficient parallel frame processing, making it well-suited for latency-critical real-time and safety-critical applications. This design enables robust recognition of both static and flashing traffic-light states, including arrow signals, while maintaining strong performance (F1-score of 96.7%) and real-time efficiency. Furthermore, the proposed framework benefits from explicit object tracking (ByteTrack), which ensures temporal consistency and reduces instability across frames, a limitation commonly observed in frame-based or sequence-based models. These results highlight the effectiveness of the proposed approach in handling complex real-world traffic scenarios and support its suitability for deployment in ITSs and autonomous driving applications.
4.10. Computational Complexity Analysis
To further evaluate the efficiency and real-time capability of the proposed HybridSignalNet, a detailed computational complexity analysis is presented. This analysis includes the number of parameters, floating point operations (FLOPs), and per-module latency breakdown for each component of the system.
As shown in
Table 14, the majority of the computational cost is attributed to the YOLOv11n detection module, while the ByteTrack tracking and HSV-based reasoning modules introduce minimal overhead. Latency measurements were conducted on an NVIDIA A100 GPU using an input resolution of 640 × 640. These results confirm that HybridSignalNet achieves high computational efficiency while maintaining real-time performance. Furthermore, the modular design enables efficient processing without relying on computationally expensive recurrent architectures, making the proposed framework well-suited for deployment in real-world ITSs and autonomous driving applications.
4.11. Performance Samples of HybridSignalNet
This section presents qualitative samples demonstrating the performance of the proposed HybridSignalNet in detecting and recognizing traffic-related objects, with a particular emphasis on traffic-light state recognition, including both static and flashing states. The key novel contribution of this work lies in the accurate classification of all traffic-light states, especially flashing circular and arrow signals, which are rarely addressed in prior work.
Figure 4 and
Figure 5 provide representative examples illustrating the capability of HybridSignalNet to detect and classify both static and flashing traffic lights. Specifically,
Figure 4 demonstrates the system’s ability to recognize static arrow traffic-light states (red, yellow, and green arrows), while
Figure 5 highlights the accurate detection and classification of flashing arrow signals and flashing circular traffic lights under real-world conditions.
Based on the results shown in
Figure 4 and
Figure 5, the proposed model exhibits high performance and robustness in detecting and classifying traffic-light states. Detected traffic lights are highlighted using green bounding boxes, and their corresponding states are correctly identified through clear textual labels displayed above each traffic light. These results confirm the effectiveness of HybridSignalNet in handling both static and temporal (flashing) traffic-light patterns, validating its suitability for real-time intelligent transportation and autonomous-driving applications.
4.12. Real-Time Performance Analysis
The performance of HybridSignalNet is highly competitive. However, it is essential to verify that the system operates under real-time constraints. As shown in
Table 15, the proposed system satisfies real-time performance requirements, where the threshold values are selected based on previous related works. In [
60], real-time performance is evaluated using frames per second (fps) with a 30 fps threshold, and the proposed system in this paper exceeds this requirement, operating at ≥30 fps. In [
72], the system achieves a detection speed of 23.81 fps, which is considered suitable for many real-time applications, including driver-assistance systems. In [
73], the authors define real-time performance as achieving an inference speed faster than 30 fps, where the proposed system achieves 48.8 fps (with RFB-c) or 58.1 fps (without RFB-c). Similarly, in [
74], real-time performance is measured using inference speed in fps, with real-time operation defined as achieving at least 30 fps.
Specifically, the proposed HybridSignalNet produces the first traffic-light state decision within 900 ms, which is below the defined real-time threshold of 1.00 s. This initial delay corresponds to a one-time startup cost required for temporal buffer filling. After the first decision, the system operates in a steady-state mode and requires only 6.77 ms to generate subsequent traffic light state decisions, which is well within the 10 ms latency requirement. Furthermore, the system achieves a sustained end-to-end throughput of 47.73 fps, exceeding the real-time requirement of 30 fps. These results demonstrate that HybridSignalNet is capable of reliable real-time traffic-light detection and classification, even when temporal analysis for flashing signals is required.
Although the proposed HybridSignalNet framework demonstrates strong robustness and high accuracy across multiple traffic scenarios, certain challenging conditions may still affect its performance. Detection accuracy can degrade when traffic lights appear extremely small or distant in the image, particularly under long-range highway conditions, where fine spatial details are limited. Similarly, heavy occlusions caused by trees, large vehicles, or structural elements may interrupt consistent detection and tracking, potentially leading to temporary identity switches or incomplete temporal buffers. In addition, non-standard traffic signal geometries, or uncommon regional signal designs may reduce classification stability. Furthermore, although the system achieves real-time performance, it is important to note that the reported results were obtained using an NVIDIA A100 GPU, which represents a high-end server-grade platform. Therefore, the reported FPS and latency should be interpreted as an upper-bound performance estimate rather than a direct indicator of embedded deployment capability. Nevertheless, the proposed framework is designed to be computationally efficient. The use of YOLOv11n and rule-based temporal reasoning reduces computational complexity, making the system suitable for resource-constrained platforms. These limitations highlight important directions for future work. In particular, future research will focus on improving robustness under extreme visual conditions and evaluating the framework on embedded automotive hardware (e.g., NVIDIA Jetson platforms), including detailed analysis of latency, throughput, memory usage, and power efficiency.
Despite the strong performance of the proposed framework across diverse scenarios, certain limitations remain. In particular, variations in traffic signal designs and regulations across different countries may introduce domain shifts that affect generalization. Moreover, the current dataset does not extensively cover adverse weather conditions such as rain, fog, and snow. Future work will focus on cross-dataset evaluation, domain adaptation techniques, and improving robustness under diverse environmental and geographic conditions.
5. Conclusions
This paper presents HybridSignalNet, a unified real-time traffic signal perception framework that integrates deep learning-based object detection with efficient multi-object tracking and temporal color-based reasoning for comprehensive traffic signal understanding. By decoupling spatial detection from temporal state inference, the proposed system effectively addresses limitations of conventional single-frame or purely learning-based approaches, particularly for arrow-based and flashing traffic light signals that require temporal context. The use of YOLOv11n enables accurate detection of a diverse set of traffic-related objects, while ByteTrack ensures stable identity preservation across frames with low computational overhead. The HSV-based color analysis, combined with temporal buffering and rule-based reasoning, provides a transparent and robust mechanism for traffic-light state classification under varying illumination and environmental conditions. Experimental validation on a custom real-world dataset demonstrates strong performance across all detected object classes, as well as reliable recognition of both static and flashing traffic-light states, thereby confirming the effectiveness of the proposed hybrid architecture, HybridSignalNet, for real-time deployment. Importantly, the framework achieves this performance without relying on complex recurrent network architectures or extensive retraining procedures, which supports its practical implementation in ITSs and autonomous driving platforms. Despite the strong performance of the proposed HybridSignalNet, several limitations remain. In particular, the current framework assumes that the detection and tracking modules provide reliable ROIs for temporal reasoning. In practice, missed detections or bounding box instability may propagate to the temporal module and affect state inference. Although the proposed system mitigates these effects through temporal buffering, track continuity, and last-valid-state preservation, a dedicated quantitative analysis of error propagation from detection and tracking into the temporal reasoning stage was not included in this study. This represents an important direction for future work.
Future work will extend the proposed framework to additional traffic signal types and more complex intersection scenarios, while exploring the integration of multi-sensor inputs such as radar and LiDAR to improve robustness under challenging conditions. To enhance contextual awareness, future versions will incorporate High-Definition (HD) maps and precise vehicle localization, including GNSS/INS fusion and visual SLAM, enabling predictive region-of-interest generation and reducing computational overhead in multi-signal environments. Incorporating lane directional cues to better associate detected traffic lights with the vehicle’s path for reducing false positives without relying on HD maps is another scope of work [
75]. Additionally, lightweight JPEG-based methods will be explored to estimate traffic congestion from images, enabling low-overhead extension to HybridSignalNet’s multi-class roadway perception and integrated traffic-density awareness for safer intersection navigation and V2X cooperative routing without additional sensors [
76]. Finally, we will investigate multi-vehicle cooperative perception leveraging V2X communication to overcome occlusions and extend effective sensing range through shared detection and classification information.