HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition

Khaled, Laith Bani; Rahman, Mahfuzur; Ebu, Iffat Ara; Ball, John E.

doi:10.3390/electronics15091964

Open AccessArticle

HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition

Department of Electrical and Computer Engineering, James Worth Bagley College of Engineering, Mississippi State University, Starkville, MS 39762, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2026, 15(9), 1964; https://doi.org/10.3390/electronics15091964

Submission received: 14 March 2026 / Revised: 22 April 2026 / Accepted: 24 April 2026 / Published: 6 May 2026

(This article belongs to the Special Issue Intelligent Transportation Systems: Advances in Object Detection and Traffic Management)

Download

Browse Figures

Versions Notes

Abstract

Reliable perception of roadway signals is critical for autonomous vehicles operating in complex urban environments, particularly when traffic lights convey safety-critical instructions through flashing and arrow indications that extend beyond conventional red, yellow, and green states. However, most existing vision-based approaches focus primarily on static traffic-light recognition and lack robust mechanisms for interpreting temporal behaviors such as flashing signals. To address this limitation, this paper proposes a unified real-time perception framework, termed HybridSignalNet, for multi-class recognition of traffic lights, road signs, and lane-related roadway elements. The framework combines spatial detection with temporal state reasoning to interpret both steady and flashing signal patterns in video streams. Experimental evaluation demonstrates strong performance across multiple object classes, achieving an average detection F1-score of 91.3%, while traffic-light state classification reaches 96.7%, including reliable identification of flashing and arrow-based signals. The proposed system operates in real-time and provides an interpretable and deployable solution for intelligent transportation systems and autonomous driving applications, particularly at signalized intersections where temporal signal understanding is essential for safe decision-making.

Keywords:

autonomous vehicles; traffic-light recognition; machine learning; ByteTrack; YOLOv11; intersection navigation

1. Introduction

Robust, real-time recognition of various types of traffic lights and road signs is a fundamental requirement for an autonomous vehicle navigation system. An accurate understanding of road elements, including traffic lights, regulatory and warning signs, lane-direction indicating marks, and vulnerable road users, directly impacts decision-making, safety assurance, energy efficiency, and compliance with traffic regulations [1,2,3,4,5]. Although recent advances in deep learning have significantly improved object detection performance in structured urban environments, a comprehensive understanding of the traffic scene remains a challenging problem due to environmental variability, diverse signaling conventions, and strict real-time constraints [5,6].

Among traffic control devices, traffic lights play a significant role, as they transmit safety-critical commands that must be interpreted correctly and consistently over time [7]. Beyond the standard static states (red, yellow, and green), real-world deployments often include arrow-based signals and flashing modes that indicate distinct operational meanings, such as permissive turns, cautionary states, or intersection control overrides [8,9]. However, many existing perception systems either neglect these advanced signal types or treat traffic-light recognition as a simplified color classification problem, limiting their applicability in complex urban intersections and mixed traffic scenarios [7,9,10].

Recent vision-based traffic perception systems commonly rely on convolutional neural network (CNN) detectors to identify traffic lights and signs directly from monocular video streams [6,7]. Although such approaches achieve high detection accuracy, they often struggle with temporal instability, illumination changes, partial occlusions, and glare effects [8,10]. Moreover, end-to-end learning-based models often attempt to infer traffic light states directly from single frames, which is inherently unreliable for flashing signals that require temporal context [7,11]. Implementing temporal reasoning using recurrent architectures can mitigate this issue, but often at the cost of increased computational complexity and reduced real-time feasibility. Despite significant progress in traffic-light detection and classification, several limitations remain. First, most existing approaches operate on frame-level analysis and lack robust temporal reasoning, making them unreliable for interpreting flashing traffic signals that require sequential context. Second, many methods do not maintain consistent object identity across frames, which is essential for accurate temporal state interpretation in dynamic environments. Third, existing unified perception frameworks primarily focus on detection tasks and treat traffic lights as static objects, failing to address fine-grained state recognition, particularly for arrow-based and flashing signals. As a result, there is a lack of a real-time, vision-based framework that integrates multi-class detection, identity-aware tracking, and interpretable temporal reasoning for comprehensive traffic-light state recognition.

To address these limitations, this work develops a unified real-time multi-class traffic signal perception framework, HybridSignalNet, that integrates state-of-the-art object detection, efficient multi-object tracking, and temporal color-based traffic-light state classification. The system uses You Only Look Once, version 11n (YOLOv11n) to detect a comprehensive set of roadway elements, including circular and arrow traffic lights, lane-direction indicators, and road signs. Detected traffic light instances are tracked across consecutive frames using the ByteTrack multi-object tracking framework, enabling the maintenance of a fixed-length temporal buffer for each signal instance. An HSV-based color analysis module is applied to per-frame Regions of Interest (ROI) to infer instantaneous traffic light color states, while rule-based temporal reasoning over the buffered per-frame state observations is used to robustly distinguish between static and flashing traffic light behaviors for both circular and arrow signals.

Unlike conventional hybrid pipelines that primarily focus on component integration, the proposed framework introduces a methodological shift by explicitly decoupling spatial perception from interpretable temporal reasoning, enabling efficient and transparent modeling of complex traffic signal behaviors such as flashing states. The proposed HybridSignalNet further extends beyond traditional traffic-light detection systems by distinguishing between circular and arrow signals and accurately identifying multiple flashing patterns that convey distinct operational semantics. Furthermore, the system incorporates practical robustness mechanisms, such as inner-region cropping, Saturation-Value (SV), and morphological filtering, to improve reliability in challenging real-world conditions without requiring additional model retraining or extensive data enhancement. By integrating detection, tracking, and temporal-state inference into a single optimized pipeline, this study provides a practical and deployable framework for comprehensive traffic-environment understanding. The proposed approach is validated on a multi-class dataset comprising traffic lights, road signs, and lane-direction markings, demonstrating strong performance across all classes while maintaining real-time operation. The proposed HybridSignalNet makes significant contributions to the field of Intelligent Transportation Systems (ITSs), which include

A unified real-time multi-class perception framework, termed HybridSignalNet, is proposed by integrating YOLOv11n-based object detection, ByteTrack-based multi-object tracking, and HSV-based temporal traffic-light state classification. The framework enables simultaneous multi-class recognition of heterogeneous roadway elements, including traffic lights, regulatory road signs, and lane-direction indicators. In particular, it provides temporally consistent detection and classification of both circular and arrow traffic lights across static and flashing operational states. Unlike conventional systems that focus only on basic static signals, the proposed framework supports comprehensive multi-state and multi-class traffic scene understanding required for real-world autonomous driving. To the best of our knowledge, few existing real-time perception frameworks comprehensively address flashing traffic light states for both circular and arrow-based signals within a unified multi-class pipeline.
The proposed system, HybridSignalNet, introduces a decoupled spatio-temporal perception architecture that separates deep-learning-based spatial detection from interpretable temporal state reasoning. Unlike conventional hybrid pipelines that primarily focus on component integration, this design represents a methodological shift by replacing sequence learning models (e.g., Long Short-Term Memory (LSTM)/ Gated Recurrent Units (GRU)) with a lightweight and deterministic rule-based temporal reasoning mechanism. This design reduces computational and training complexity, supports real-time operation, and improves decision interpretability. In addition, the modular structure facilitates integration with future perception modules and sensor-fusion components, making the framework suitable for practical autonomous vehicle deployment.
An interpretable color-based traffic-light state classification module is introduced using HSV color-space analysis combined with rule-based temporal reasoning. This design improves transparency and reliability compared to end-to-end black-box learning approaches, enabling deterministic interpretation of traffic signal behavior under varying illumination conditions.
The proposed framework achieves high detection and classification performance, attaining an F1-score of 91.3% across all traffic-related object classes and an F1-score of 96.7% for traffic-light state recognition. In addition, the system satisfies strict real-time constraints, achieving a steady-state throughput of 47.73 frames per second (fps) and a decision latency below 10 ms. These results demonstrate the suitability of the framework for deployment in intelligent transportation and autonomous driving systems.

The remainder of this paper is organized as follows: Section 2 reviews the related work on traffic signal detection and classification and different models and methods used for that purpose; Section 3 details our research methodology, including dataset construction, model design, experimental setup, and training strategy; Section 4 presents and discusses the experimental results; and Section 5 concludes the study with future research directions.

2. Related Work

Robust perception of traffic control elements is a foundational challenge for ITSs and autonomous driving. Research has evolved along several key paths to address the specific difficulties of detecting small, distant traffic lights and correctly interpreting their dynamic states. This section reviews these primary directions, highlighting their respective strengths and persistent limitations that motivate the present work.

2.1. Detector-Centric and Frame-Based Approaches

A dominant line of research treats Traffic-Light Recognition (TLR) primarily as a small-object detection problem, focusing on improving deep learning architectures for reliable localization. This includes modifying generic detectors like Single-Shot Detection (SSD) to fuse multi-scale features for small targets [12], developing task-specific single-stage networks such as DeepTLR and its hierarchical variant HDTLR [13,14], and enhancing YOLO-family models for better feature extraction [15,16]. A recent survey by Pavlitska et al. organizes these detector-based methods as modified generic detectors, multi-stage pipelines, and task-specific single-stage networks, confirming that small object size and background variability remain major bottlenecks [17]. Recent studies continue this trend, employing techniques like salience-sensitive loss in Deformable DETR to prioritize decision-critical lights [18] or meta-learning with YOLOv8 to focus on illuminated regions [19]. Extending beyond traffic lights, ensemble and hybrid CNN methods have shown high accuracy for related tasks like traffic sign recognition [20]. Beyond CNN-based detectors, Vision Transformers (ViT) have demonstrated strong promise for traffic perception tasks. Studies evaluating ViT variants against CNNs on traffic sign classification reveal competitive trade-offs [21], while novel architectures such as Evolutionary Algorithms with the Transformer architecture (EATFormer)-based attention modules achieve state-of-the-art accuracy on benchmark datasets [22]. Conditional multi-head transformers that dynamically adapt feature aggregation further improve robustness to occlusion and illumination changes [23]. While these methods achieve high per-frame detection accuracy, they operate fundamentally on static images. Consequently, they lack temporal persistence, struggle with flashing patterns, and cannot maintain stable object identities across video sequences, which is a critical shortfall for real-world driving, where consistent tracking is essential. Beyond perception, ITS also benefit from optimization-based approaches to signal control. For instance, Multi-Objective Particle Swarm Optimization (MCMOPSO) algorithm demonstrates how this swarm system can effectively optimize traffic signal phase timing, highlighting the broader systems-level challenges that perception frameworks like HybridSignalNet must ultimately support [24].

2.2. Multi-Stage and Map-Assisted Pipelines

To improve robustness, many systems decouple detection from classification. These pipelines first localize traffic lights using detectors, attention models, or semantic segmentation and then apply a dedicated classifier to the cropped ROI to determine the state or pictogram [8,10,25,26,27]. Representative works include combining YOLOv5s with AlexNet for classification [28], using heuristic ROI algorithms with CNN classifiers for real-time detection [29], and employing Faster R-CNN with ResNet for joint traffic sign and light detection [30]. This “detect-then-classify” paradigm often reduces false positives. Some approaches further constrain the search space using high-definition (HD) map data and vehicle localization to predefine probable traffic light locations, enhancing long-range performance [31,32]. Recent ensemble learning methods with color-based augmentation explicitly detect individual signal lights before classification, achieving high accuracy without reliance on bounding boxes [31]. However, these systems often remain frame-by-frame in their analysis. They typically do not incorporate robust multi-object tracking to handle identity persistence through occlusions, and they remain vulnerable to errors if map data is inaccurate or unavailable. Furthermore, their classification modules are usually designed for static states and lack explicit mechanisms for temporal reasoning over flashing signals.

2.3. Hardware-Efficient and Real-Time Systems

To enable deployment in resource-constrained environments, several works have explored lightweight and hardware-accelerated designs. FPGA-based implementations using compressed models like YOLOv3-Tiny achieve real-time performance with high detection accuracy [33]. Other studies develop custom low-power CNN architectures optimized for traffic signal classification on embedded platforms, achieving near-perfect accuracy for basic color states [34]. Synthetic data generation has also been used to improve robustness in varied conditions [35]. While these systems address critical efficiency needs, they are typically limited to a small set of basic signal classes (red, yellow, green) and do not model complex states such as flashing lights or directional arrows, thus failing to meet the full requirements for urban autonomous navigation.

2.4. Temporal Consistency and Filtering Methods

The importance of temporal smoothing has long been recognized. Early systems used classical computer vision techniques like CAMSHIFT tracking, temporal voting, or probabilistic filters such as Hidden Markov Models (HMM), Interacting Multiple Models (IMM) to enforce valid state transitions and reduce flicker [7,36,37,38]. Laith et al. proposed a traffic-light classification model by combining object detection with fuzzy logic based on HSV color-space [39]. A modern exemplar is the aUToLights system, which integrates multi-camera detection with HMM-based filtering and leverages known traffic light transition rules to improve robustness under occlusion [40]. Similarly, multi-camera fusion with Labeled Multi-Bernoulli filtering demonstrates the effectiveness of probabilistic tracking [38]. While effective, such approaches often rely heavily on system-level priors, geometric assumptions, or multi-sensor fusion, which can limit their generalizability to new environments without such information. In contrast, our work adopts a generic vision-based multi-object tracker (ByteTrack) to maintain identity persistence, enabling temporal reasoning without dependence on map or geometric priors.

2.5. Deep Temporal Learning Models

With the success of recurrent neural networks, researchers have developed end-to-end spatiotemporal models for TLR. Frameworks like FlashLightNet combine CNNs with LSTM networks to directly learn and classify traffic light state sequences, including flashing behaviors, from video clips [41]. Related deep temporal models, including CNN-LSTM hybrid architectures, have been widely adopted for traffic signal phase prediction. More recently, transformer-based sequence modeling has been applied to vehicle light recognition model TrVLR employs a CNN backbone with transformer encoder-decoder architecture. This structural task, which is analogous to flashing traffic-light classification, works for long-range temporal dependencies in light state sequences, surpassing CNN-LSTM baselines on turn signal recognition [42]. In addition, CNN-GRU-LSTM-based deep learning frameworks have demonstrated strong performance in accurate traffic forecasting. GRUs, a streamlined variant of recurrent neural networks, are particularly effective for modeling sequential data and have been successfully applied to anticipate temporal patterns in traffic volumes [43,44,45,46]. More advanced variants like transfer learning-based LSTM-GAN models improve adaptability under varying conditions [47], and hierarchical attention–LSTM networks demonstrate strong performance for broader traffic parameter forecasting [48]. While these methods provide powerful temporal modeling, they come with significant computational and training complexity, operate as black boxes that reduce system explainability, and are rarely integrated into a unified pipeline that also detects other critical roadway elements.

2.6. Multi-Task and Unified Perception Frameworks

A practical autonomous driving stack requires simultaneous understanding of multiple roadway elements. Research has thus progressed towards unified, real-time perception models. Works like YOLOP demonstrate the feasibility of jointly performing traffic object detection, lane detection, and drivable area segmentation in a single network [49,50]. Similarly, multi-task learning has been explored for traffic sign recognition to improve efficiency through shared feature extraction [47,48]. However, these comprehensive frameworks typically do not incorporate sophisticated temporal reasoning for traffic light states. They detect traffic lights as mere objects without classifying complex, time-dependent behaviors like flashing arrows or circles, creating a gap between entire scene perception and actionable signal understanding. The most recent frontier in unified perception is end-to-end transformer-based autonomous driving. DriveTransformer addresses cumulative error in sequential perception-prediction-planning pipelines by enabling parallel task queries, sparse sensor representations, and streaming temporal fusion, achieving state-of-the-art results on both closed- and open-loop benchmarks [51]. Similarly, an End-to-end Multimodal Model for Autonomous driving (EMMA) reframes detection and classification as multimodal language generation within a large foundation model, demonstrating strong generalization across complex driving scenarios [52]. While these end-to-end frameworks represent a powerful scaling direction, they sacrifice interpretability and typically treat traffic lights as generic objects, without explicit mechanisms for classifying fine-grained states such as flashing arrows.

In summary, prior work has made significant strides in frame-based detection accuracy, temporal filtering, and multi-task network design. Yet, a fragmented landscape remains: frame-based detectors lack temporal identity; two-stage and map-assisted systems can be brittle; classical filters may over-rely on priors; deep temporal models sacrifice efficiency and interpretability; and unified perception pipelines omit detailed traffic-light state analysis. Consequently, there is a lack of a real-time, vision-based framework that unifies multi-class detection (lights, signs, directional indicators) with identity-aware temporal reasoning capable of robustly classifying the full spectrum of static and flashing states for both circular and arrow signals. HybridSignalNet is proposed to bridge this gap by integrating a YOLOv11n-based multi-class detector, a ByteTrack-based multi-object tracker for identity persistence, and an interpretable HSV-based temporal reasoning module within a deployable, efficient pipeline.

3. Research Methodology

This section presents the methodology used to develop HybridSignalNet. The proposed framework follows a systematic pipeline that begins with dataset collection and preprocessing, proceeds through model training and parameter optimization, and concludes with comprehensive performance evaluation. The system architecture consists of three core modules. First, a YOLOv11n-based detector is employed to detect and classify most traffic-related objects; however, for circular and arrow traffic lights, the model is used solely for detection. Second, a ByteTrack-based tracking module is applied to associate detected traffic lights across consecutive video frames, enabling temporal consistency. Finally, an HSV color-space-based classifier combined with rule-based temporal reasoning determines the final traffic light state, including red, yellow, green, flashing red, flashing yellow, red arrow, yellow arrow, green arrow, and flashing yellow arrow states. The overall system workflow is illustrated in Figure 1.

3.1. Dataset Collection and Curation

A comprehensive dataset was constructed for this study using real-world traffic recordings collected from multiple urban intersections located on the Mississippi State University campus and within the cities of Starkville and Columbus, Mississippi, USA. The recordings capture natural driving environments and include a wide range of traffic scenarios and environmental variations. To complement the real-world data and to better analyze temporal traffic-light behavior, simulated traffic-light videos were also generated using the RoadRunner (MathWorks, Natick, MA, USA), a simulation tool. The simulator was used to create both circular and arrow traffic-light configurations with controlled flashing patterns. The dataset covers a diverse set of traffic-related object categories, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. All real-world videos were captured using a GoPro HERO8 Black digital camera (GoPro, Inc., San Mateo, CA, USA) under normal driving conditions. The recordings were conducted at different times of the day to reflect realistic traffic scenarios. Each video was recorded at 30 fps, which provides sufficient temporal resolution for accurate object detection, frame-to-frame tracking, and analysis of traffic signal state transitions, including flashing behavior. In addition to the collected recordings, several publicly available datasets were incorporated to improve dataset diversity and generalization capability. Images from the LISA Traffic Sign Dataset and the Mapillary Traffic Sign Dataset were included to expand the diversity of object class appearances, roadway layouts, and environmental contexts. Furthermore, videos from the BDD100K dataset were utilized to evaluate the performance of the proposed system in detecting, tracking, and classifying traffic-light states. The integration of both custom-collected data and established public datasets creates a heterogeneous and challenging evaluation benchmark, enabling the proposed system to be assessed under diverse real-world conditions rather than within a single controlled environment. Although the dataset incorporates multiple sources to enhance diversity, it is primarily collected from a specific geographic region (Mississippi, USA), and therefore full generalization to different countries with varying traffic signal designs and regulations is not guaranteed. Additionally, while the dataset captures illumination variability, it does not explicitly include adverse weather conditions such as heavy rain, fog, or snow.

3.2. Preprocessing Pipeline

Following data collection, several preprocessing steps were applied to prepare the dataset for model training and evaluation. These steps were designed to enhance data quality, ensure consistency, and improve the reliability of the detection, tracking, and classification processes. The preprocessing pipeline begins with video segmentation, in which the recorded and simulated videos are converted into individual image frames, followed by dataset cleaning to remove low-quality samples. Subsequently, the retained images are annotated to generate ground-truth labels for all classes. Data augmentation techniques are then applied to increase dataset diversity and improve model generalization. Finally, the complete dataset is partitioned into three non-overlapping subsets-training, validation, and testing-using a ratio of 70%:15%:15%, respectively. The following subsections describe the preprocessing steps in detail.

3.2.1. Video Segmentation and Frame Sampling

The recorded and simulated videos were segmented into individual frames to enable frame-level annotation and analysis. To reduce temporal redundancy and ensure scenario diversity, a uniform sampling strategy was employed. Frames were systematically extracted from different temporal segments of each video, rather than from every consecutive frame, thereby producing a dataset containing varied lighting, occlusion, and traffic conditions.

3.2.2. Data Cleaning and Quality Control

A stringent quality filter was applied to the sampled frames. Any frames affected by excessive motion blur, image distortion, or camera instability were discarded, along with any samples that did not meet quality standards. This cleaning step ensured that only high-quality frames suitable for reliable detection, tracking, and classification were retained for subsequent training and evaluation, thereby improving the overall robustness and performance of the proposed system.

3.2.3. Image Annotation

The cleaned frames were annotated using Roboflow (Roboflow, Des Moines, IA, USA), a widely used computer vision annotation platform. Each frame was labeled with bounding boxes corresponding to relevant traffic-related objects. The annotation process covered 13 distinct classes, namely: Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. These annotations provided precise spatial ground truth information required for training and evaluating the YOLOv11n model. To ensure annotation accuracy and consistency, an automated validation script was developed to verify the correspondence between annotated class IDs and their true object categories, as well as the correctness of bounding box localization. Any detected inconsistencies or labeling errors were subsequently reviewed and corrected manually to guarantee high-quality annotations.

3.2.4. Data Augmentation

To enhance the diversity and robustness of the dataset, a data augmentation pipeline was implemented. The applied techniques included horizontal and vertical flipping, random brightness and contrast adjustments, small random rotations, slight blurring, and color jittering. During augmentation, bounding-box annotations were automatically updated in YOLO format to ensure spatial consistency between images and labels. This augmentation strategy increased the effective dataset size and improved the model’s robustness to illumination variations and camera viewpoint changes.

3.2.5. Dataset Partitioning

The final dataset consists of 39,000 images, with 3000 images per class, ensuring balanced representation across all classes and supporting stable and reliable model training. The dataset was partitioned into three subsets: a training set (70%, 27,300 images) used for model learning, a validation set (15%, 5850 images) used for hyperparameter tuning, and a test set (15%, 5850 images) used for final unbiased performance evaluation.

3.3. YOLOv11-Based Object Detection

After preparing the dataset, a YOLOv11n model was trained to detect and classify all traffic-related classes, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. For arrow and circular traffic lights, YOLOv11n is used only for object detection, while traffic-light state classification is handled in subsequent stages. This design decouples spatial detection from temporal state reasoning and prevents the detector from learning flashing behaviors directly from image-level supervision. The internal structure of the employed detector, YOLOv11n, is illustrated in Figure 2 [53].

The YOLOv11 detector follows a single-stage object detection paradigm in which object localization and classification are performed simultaneously. As shown in Figure 2 [53], the architecture consists of three main components: a backbone network responsible for hierarchical feature extraction, a neck module that aggregates multi-scale features through up-sampling and feature concatenation, and a multi-scale detection head that produces bounding-box and class predictions at different spatial resolutions. This multi-scale prediction strategy is particularly important for detecting small and distant objects such as traffic lights in real-world urban driving environments.

Training was conducted using the hyperparameters summarized in Table 1. This detection stage provides accurate and temporally consistent bounding boxes for all traffic-related objects, forming a reliable foundation for downstream tracking and traffic-light state classification modules.

3.4. Traffic-Light Tracking Using ByteTrack

Following object detection, a ByteTrack-based multi-object tracking module is applied to associate traffic-light detections across consecutive video frames. Each detected traffic light is assigned a unique track identity that is maintained over time, ensuring that observations collected from different frames correspond to the same physical signal. Separate tracking streams are maintained for circular and arrow traffic lights to prevent identity mixing between structurally different signal types. This track-level identity preservation provides a stable temporal foundation for subsequent state reasoning. Overall, the ByteTrack-based tracking module provides reliable temporal consistency while remaining computationally efficient, making it well suited for real-time traffic monitoring and downstream signal state analysis, particularly for applications involving temporal reasoning such as flashing state detection.

3.5. HSV-Based Traffic-Light State Extraction

For each tracked traffic light, an ROI is extracted from the detected bounding box. The cropped ROI is then converted from the RGB color-space to the HSV (Hue-Saturation-Value) color space, which provides better separation of chromatic information under varying illumination conditions. Minimum threshold values are applied to the saturation and value channels to suppress dark or low-saturation pixels. Binary masks corresponding to red, yellow, and green colors are generated using predefined HSV ranges, followed by morphological operations to reduce noise and fill small gaps.

The HSV thresholds used for color segmentation were determined through a systematic, data-driven empirical procedure on the validation dataset. A representative subset was manually annotated at the pixel level for red, yellow, and green traffic light states under diverse illumination conditions, including daylight, nighttime, glare, shadows, overcast, and partial occlusion. The distributions of Hue, Saturation, and Value channels were analyzed for each color class, revealing strong separability in the Hue component, with overlap primarily caused by illumination variations affecting SV. Based on this analysis, initial threshold ranges were defined and subsequently refined using a grid search on the validation set to maximize the frame-level classification F1-score while maintaining robustness to illumination changes and sensor noise. In addition, minimum SV thresholds were introduced to suppress low-intensity and desaturated pixels, significantly reducing false positives caused by glare, reflections, and low-light conditions. The final thresholds were fixed based on validation performance and were not tuned on the test set, ensuring fair evaluation and avoiding overfitting. This empirical strategy provides a balance between interpretability, reproducibility, and practical robustness, while allowing explicit control over failure modes under challenging real-world conditions. The dominant color is determined by computing the ratio of active pixels for each color mask relative to the ROI area. This HSV-based color extraction provides frame-level color observations that are forwarded to the temporal reasoning stage.

3.6. Temporal Buffering for Traffic-Light State Reasoning

To enable robust traffic-light state classification over time, a temporal buffering mechanism is introduced for each tracked traffic light. For every unique track ID generated by the ByteTrack tracking module, a fixed-length sliding buffer is maintained to store recent color observations extracted from consecutive video frames. At each frame, the instantaneous color label obtained from the HSV-based analysis (i.e., Red, Yellow, or Green) is appended to the buffer. This temporal aggregation allows the system to reason over short-term color transitions rather than relying on single-frame decisions, which are often sensitive to noise, reflections, or motion blur. Flashing traffic light states are inferred by analyzing characteristic alternating patterns between “On” states (e.g., Red, or Yellow) and the “Off” state within the temporal buffer using a rule-based fuzzy reasoning strategy. Importantly, flashing behaviors are not learned directly by the deep learning detector but are instead inferred through explicit temporal buffering and color-based logic. This approach improves interpretability, robustness, and generalization across varying flashing frequencies and environmental conditions.

It is important to clarify the distinction between frame-level observations and sequence-level interpretation in the proposed framework. While object detection and HSV-based color extraction are performed at the frame level, flashing behavior is not defined at the individual frame level. Instead, flashing traffic light states are inferred at the sequence level using the temporal buffer associated with each tracked object. The temporal reasoning module analyzes patterns of activation and deactivation across consecutive frames to determine whether a signal exhibits flashing behavior. Regarding correctness, flashing classification does not require observing a complete flashing cycle. A prediction is considered correct when sufficient temporal variation is present within the observation window, such as alternation between active and inactive signal states. Because the proposed temporal rules do not rely on strict periodicity, the system remains robust when only partial flashing cycles are observed. If no temporal variation is detected and the signal appears consistently active, the system classifies it as a static signal. This design improves robustness under partial visibility, occlusion, and limited observation windows.

To further illustrate the scope of the dataset, a custom dataset was constructed using 57 real-world traffic videos recorded at multiple urban intersections across the Mississippi State University campus and the cities of Starkville and Columbus, Mississippi, USA. These were complemented by simulated traffic-light videos generated using RoadRunner (MathWorks, Natick, MA, USA), as well as images sourced from publicly available benchmark datasets, including the LISA Traffic Sign Dataset and the Mapillary Traffic Sign Dataset. In addition, videos from the BDD100K dataset were incorporated to further enhance dataset diversity and to evaluate the proposed framework under large-scale real-world driving conditions for traffic-light recognition. The video sequences range from 10 to 60 min in duration, providing extensive temporal coverage across diverse driving scenarios.

From these combined sources, a total of 39,000 images were extracted and annotated, covering 13 traffic-related object classes: Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. To ensure balanced representation and prevent class bias during model training, each class contains exactly 3000 annotated images. Table 2 summarizes the detailed specifications of the constructed dataset.

3.7. Experimental Environment

All experiments related to dataset preprocessing and the evaluation of the proposed framework were conducted on a high-performance computing cluster running Rocky Linux 9.2. The system was equipped with NVIDIA A100 80GB PCIe GPUs and an Intel Xeon Platinum 8362 processor with 64 CPU Cores and 1.5 TB of system memory. Table 3 shows the detailed specification of the experimental environment.

3.8. Evaluation Metrics

To evaluate the performance of the proposed framework, standard evaluation metrics were used, including precision, recall, F1-score, mAP, Multiple Object Tracking Accuracy (MOTA), and Identity F1-Score (IDF1). Precision is an evaluation metric that measures the proportion of correctly predicted positive instances among all instances predicted as positive by the model and it is expressed in Equation (1) [13,54,55].

Precision = \frac{TP}{TP + FP}

(1)

Recall evaluates the model’s ability to identify all actual positive instances as shown in Equation (2) [13,54,55].

Recall = \frac{TP}{TP + FN}

(2)

Here, TP indicates true positive, FP indicates false positive, and FN refers to the false negative.

The F1-score combines precision and recall into a single metric by computing their harmonic mean, as shown in Equation (3) [13].

F 1 = \frac{2 PR}{P + R}

(3)

Here, in Equation (3), P is denoted as Precision, and R is denoted as Recall. The mAP is defined as the mean of the Average Precision (AP) values computed across all object classes, as defined in Equation (4) [54,56].

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(4)

Multiple Object Tracking Accuracy (MOTA) evaluates the overall tracking performance by jointly accounting for missed detections, false positives, and identity switches across all frames. MOTA is defined as follows in Equation (5) [57,58].

MOTA = 1 - \frac{\sum (FN + FP + IDSW)}{\sum GT}

(5)

Here, in Equation (5), FN denotes the number of false negatives, FP denotes the number of false positives, IDSW denotes the number of identity switches, and GT denotes the number of ground-truth objects at frame t.

The Identity F1-Score (IDF1) measures the accuracy of identity preservation by evaluating how consistently tracked identities correspond to ground-truth identities over time. IDF1 is defined as follows in Equation (6) [55].

IDF 1 = \frac{2 IDTP}{2 IDTP + IDFP + IDFN}

(6)

Here, in Equation (6), IDTP denotes identity true positives, IDFP denotes identity false positives, and IDFN denotes identity false negatives. This metric emphasizes identity continuity rather than pure detection accuracy.

4. Results and Discussion

In this section, we discuss the results obtained from the proposed framework, HybridSignalNet, for detecting and classifying traffic-related objects and traffic-light states. The evaluated object classes include Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, and Yield, as well as traffic-light states including Red, Green, Yellow, Flashing Red, Flashing Yellow, Red Arrow, Green Arrow, Yellow Arrow, and Flashing Yellow Arrow. Additionally, this section analyzes the conducted experiments and fine-tuning procedures used to determine the optimal parameters for the proposed framework. Representative qualitative and quantitative results are also presented to demonstrate the effectiveness and efficiency of HybridSignalNet in accurately detecting and classifying the considered traffic-related classes.

4.1. YOLOv11 Model Selection and Performance

YOLOv11 is one of the latest versions of the YOLO family of single-stage real-time object detectors. It is designed to achieve higher detection accuracy, lower latency, and improved computational efficiency compared to earlier YOLO versions [59]. YOLOv11 is available in several model variants, including nano (n), small (s), medium (m), large (l), and extra-large (x), offering different trade-offs between accuracy and computational cost.

In this experiment, three variants of YOLOv11 (nano, medium, and large) were evaluated on the constructed dataset. As expected, increasing model capacity improved detection performance. The large model achieved the highest detection accuracy, with a precision of 96.6%, recall of 95.2%, F1-score of 95.9%,

m A P_{0.5}

of 96.4%, and

m A P_{0.5 : 0.95}

of 84.7%. However, this performance gain came at a substantially higher computational cost and longer training time. In contrast, the nano model achieved a precision of 92.8%, recall of 90.0%, F1-score of 91.3%,

m A P_{0.5}

of 92.3%, and

m A P_{0.5 : 0.95}

of 73.5%, while requiring significantly less computational resources and training time. Although its detection accuracy is lower than that of the larger variants, the nano model remains sufficiently accurate for reliable object localization performance.

The proposed system aims to achieve both high detection accuracy and real-time operation in practical traffic environments. Since the subsequent modules (object tracking and HSV-based temporal reasoning) refine the final traffic-light state estimation, extremely high detection accuracy is not strictly required. Instead, a lightweight detector capable of stable localization with low computational overhead is preferable. Therefore, YOLOv11n was selected as the base detector for the proposed framework. This design allows the system to maintain real-time capability while delegating temporal consistency and signal-state interpretation to the higher-level reasoning module. Table 4 summarizes the quantitative performance comparison among the evaluated YOLOv11 variants.

The YOLOv11n model is used to detect multiple traffic-related object classes, including Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light. For traffic lights, the detector performs only object localization rather than state classification. Following detection, an HSV color-space-based classification module is applied to determine the states of traffic lights. This module classifies detected circular and arrow traffic lights into their respective states: red, green, yellow, flashing red, flashing yellow, red arrow, green arrow, yellow arrow, and flashing yellow arrow. The detection performance for all object classes is summarized in Table 5.

As shown in Table 5, the detector consistently performs well across most traffic-related objects. The Circular Traffic Light class achieves the best performance, with a precision of 96.8%, recall of 95.1%, F1-score of 95.9%,

m A P_{0.5}

of 96.9%, and

m A P_{0.5 : 0.95}

of 83.5%. The Arrow Traffic Light class also achieves high accuracy, with a precision of 95.6%, recall of 93.8%, F1-score of 94.7%,

m A P_{0.5}

of 95.8%, and

m A P_{0.5 : 0.95}

of 80.1%. Other classes, including Stop, Yield, Speed Limit, Roundabout, and lane-marking categories, also demonstrate reliable detection performance, with F1-scores generally above 89% and strong

m A P_{0.5}

values. Lower performance is observed for Pedestrian and some lane-marking classes due to smaller object size, partial occlusions, and appearance variability in real traffic scenes. Importantly, the two most critical classes for the proposed framework, Arrow Traffic Light and Circular Traffic Light, achieve high localization accuracy. These detected traffic lights serve as input to the subsequent traffic light state classification module, which further classifies them into red, green, yellow, flashing red, flashing yellow, red arrow, green arrow, yellow arrow, and flashing yellow arrow states.

Overall, the detector achieves an average precision of 92.8%, recall of 90.0%, F1-score of 91.3%,

m A P_{0.5}

of 92.3%, and

m A P_{0.5 : 0.95}

of 73.5%, demonstrating that the selected lightweight detector is sufficiently accurate while maintaining real-time capability.

4.2. Temporal Tracking Module: Integrating ByteTrack for State Recognition

The tracking module is a critical component that ensures consistent identification and tracking of traffic lights throughout the video sequence. Although the YOLO-based detector employed in this study is effective at localizing traffic lights, it operates independently on each frame and does not retain object identity over time. To address this limitation, a tracking module is incorporated to assign a persistent track ID to each detected traffic light, allowing the system to follow the same physical signal throughout the video sequence. This temporal continuity is particularly important for temporal analysis and for accurately recognizing flashing traffic-light states. In this experiment, three tracking models were evaluated: ByteTrack [60], DeepSORT [61], and an IoU-based tracker [62]. Table 6 presents a quantitative comparison of these methods using standard multi-object tracking metrics, including Multiple Object Tracking Accuracy (MOTA) and Identity F1-Score (IDF1).

As shown in Table 6, ByteTrack achieves the best overall performance, obtaining the highest scores across both evaluation metrics, with MOTA and IDF1 values of 90.7% and 88.2%, respectively. DeepSORT ranks second, while the IoU-based tracker exhibits the lowest performance, achieving MOTA and IDF1 values of 84.2% and 79.5% for DeepSORT, and 72.5% and 68.9% for the IoU-based tracker. Based on these results, ByteTrack was selected as the tracking method for the proposed system due to its superior tracking accuracy and its ability to maintain consistent object identities across frames in dynamic traffic environments.

4.3. HSV-Based Traffic-Light State Classification

The classification module operates on ROIs corresponding to detected and tracked traffic lights, aiming to determine their final operational state accurately and consistently over time. For each tracked traffic light, the detected bounding box is cropped and converted into HSV color-space, where saturation and brightness thresholds are applied to suppress background noise and illumination artifacts. Color-specific masks are then used to identify red, yellow, and green signal activations at the frame level. To capture temporal behavior, each traffic light maintains a short history buffer of recent color observations, which enables the system to distinguish between static and flashing signals.

Flashing states are inferred using a fuzzy temporal rule-based strategy that analyzes the co-occurrence and alternation of on and off states over time, rather than relying on strict periodicity. This approach allows the system to handle detection noise and partial occlusions while maintaining stable classification. Separate rule sets are applied for circular and arrow traffic lights to reflect their operational differences, and when the current frame is ambiguous, the last valid state is preserved to ensure temporal continuity. Table 7 presents a quantitative comparison between three color-based methods, RGB [63], CIELAB (L*a*b*) [64,65,66], and HSV color-space [66,67], in terms of precision, recall, and F1-score.

The RGB-based approach exhibits the weakest performance, achieving an F1-score of 85.3%, primarily due to its high sensitivity to illumination variations, shadows, and brightness changes. Since RGB directly encodes intensity information, it struggles to reliably separate color information under real-world lighting conditions. The CIELAB color-space significantly improves performance, achieving an F1-score of 94.7%. This improvement is attributed to its perceptual uniformity and partial separation of luminance from chromatic components. However, despite this improvement, CIELAB still exhibits moderate sensitivity to illumination changes and color overlap in complex outdoor scenes. The HSV color-space achieves the best overall performance, with a precision of 97.5%, recall of 95.9%, and an F1-score of 96.7%. The superior performance of HSV is due to its explicit separation of color (Hue) from illumination-related components (Saturation and Value), making it more robust to brightness fluctuations, shadows, and sensor noise. Based on these results, HSV was selected as the core color representation for traffic light state extraction in the proposed framework.

4.4. Impact of Morphological Filtering in the HSV-Based Traffic-Light Classification Module

This experiment analyzes the contribution of the morphological filtering stage in the proposed HSV-based traffic-light classification module. The experiment isolates the effect of the morphological operations while keeping all other components unchanged, including the object detector, tracking mechanism, HSV thresholds, inner-ROI cropping, and temporal buffering. After generating the binary color masks in the HSV color-space, the masks may contain fragmented regions and small holes due to illumination variations, reflections, and sensor noise. To refine these masks, a morphological closing operation (dilation followed by erosion) is applied to each color mask. Table 8 presents the performance of the proposed HybridSignalNet framework in terms of precision, recall, and F1-score with and without morphological filtering.

The results demonstrate that morphological filtering improves all evaluation metrics, increasing the F1-score from 93.8% to 96.7%. The most noticeable improvement appears in recall, which indicates that the system becomes more capable of consistently detecting active signal regions. Therefore, morphological filtering is not merely a post-processing technique but a critical component that enhances the robustness and temporal stability of the proposed traffic-light state classification framework.

4.5. Impact of Temporal Buffer Length on Traffic-Light Classification Robustness and Startup Latency

To robustly recognize traffic-light states, the proposed framework maintains a fixed-length temporal buffer for each tracked traffic light, storing recent frame-level color observations obtained from the HSV-based module. The buffer length determines the amount of temporal evidence available to infer the characteristic ON/OFF temporal alternation of flashing signals. A shorter buffer enables faster initial decisions but may fail to capture a complete flashing cycle under noise, glare, or intermittent detections. In contrast, a longer buffer improves temporal stability by aggregating more observations; however, it increases startup latency because the system must accumulate sufficient history before producing the first reliable state decision.

To justify the selected buffer length, we evaluated the system using buffer sizes of 9, 15, 27, and 45 frames while keeping all other components unchanged. For each configuration, we measured (i) the F1-score of overall traffic-light state classification (including both static and flashing states for circular and arrow traffic lights), and (ii) the startup latency, defined as the time required for the system to produce the first valid state decision. The results are summarized in Table 9.

The results indicate that increasing the buffer length improves classification robustness, as the system becomes less sensitive to transient HSV noise and occasional missed ON/OFF observations. However, this improvement is accompanied by increased startup latency because more frames must be accumulated before stable temporal reasoning can be established. The selected buffer length of 27 frames provides the most balanced trade-off, achieving near-optimal classification performance while maintaining a sub-second startup delay. This balance is particularly important for real-time driving applications, where both reliability and timely decision-making are required.

4.6. Impact of Inner-ROI Cropping and Saturation-Value (SV) Thresholding on System Performance

To evaluate the contribution of the proposed robustness mechanisms, a comparative analysis was conducted focusing on inner-ROI cropping and SV thresholding within the HSV-based traffic-light state extraction module. Four configurations of the HybridSignalNet framework were considered: (1) the full model with both inner-ROI cropping and SV thresholding enabled, (2) a variant without inner-ROI cropping, (3) a variant without SV thresholding, and (4) a variant without both mechanisms. All configurations were evaluated on the same test dataset under diverse real-world conditions, including illumination variations, glare, and partial occlusions.

The results are presented in Table 10. The full model achieves the highest performance, with an F1-score of 96.7%. Removing inner-ROI cropping results in a noticeable performance drop, indicating the importance of suppressing background interference within the detected region. Similarly, removing SV thresholding reduces robustness under low-saturation and low-illumination conditions. The largest performance degradation is observed when both mechanisms are removed, where the F1-score decreases to 93.5%. This confirms that the two components provide complementary benefits, where inner-ROI cropping reduces spatial noise, while SV thresholding mitigates illumination-related artifacts. These findings demonstrate the effectiveness of the proposed design in improving traffic light state classification under challenging real-world conditions.

4.7. Traffic-Light State Classification and Error Analysis

To provide a comprehensive evaluation of the proposed HybridSignalNet framework, Table 11 reports the classification performance across all traffic light states in terms of precision, recall, and F1-score. These results reflect the final state-level classification performance obtained by integrating HSV-based color feature extraction with temporal buffering and rule-based reasoning. The evaluation demonstrates consistently high accuracy across both static and dynamic (flashing) signal states, highlighting the effectiveness and robustness of the proposed hybrid spatio-temporal classification strategy.

For static traffic light states, including Red, Green, and Yellow, the system achieves consistently high performance, with F1-scores ranging from 97.0% to 98.4%. These results indicate reliable discrimination between dominant color states across varying lighting and environmental conditions. For flashing traffic light states, such as Flashing Red, Flashing Yellow, and Flashing Yellow Arrow, the performance is slightly lower but remains strong, with F1-scores between 94.5% and 95.4%. The modest performance reduction is expected due to the inherent temporal nature of flashing signals, where alternating “on” and “off” patterns introduce ambiguity at the frame level. Nevertheless, the temporal buffering mechanism effectively captures these periodic transitions, enabling accurate state inference. Arrow-based traffic light states, including Red Arrow, Yellow Arrow, and Green Arrow, also demonstrate robust performance, achieving F1-scores above 97.0%, confirming that the proposed approach generalizes effectively across different signal geometries. Overall, the system achieves an average precision of 97.5%, recall of 95.9%, and F1-score of 96.7%, demonstrating that the HSV-based classification combined with temporal reasoning provides a reliable solution for traffic-light state recognition, particularly for challenging flashing scenarios.

To further analyze the classification performance across all traffic light states, a confusion matrix is presented in Figure 3. As shown in Figure 3, the proposed HybridSignalNet demonstrates consistently high classification performance across all classes, with diagonal values ranging from 0.94 to 0.98, indicating strong class separability and reliable state recognition. The majority of predictions are concentrated along the diagonal, while minor misclassifications are primarily observed between temporally or visually similar states. For example, flashing red is occasionally confused with static red (3%), which is expected due to frame-level ambiguity when the signal is captured during its active phase. Similarly, confusion between yellow and flashing yellow (2–3%) occurs under low-light or reduced visibility conditions, where HSV-based color differentiation becomes less distinctive. In addition, arrow-based traffic lights exhibit slight misclassification (1–2%), primarily due to their smaller spatial footprint and sensitivity to occlusion and motion blur. Partial occlusion by surrounding objects or vehicles can distort the arrow shape, while glare from reflective surfaces may further degrade edge clarity, leading to occasional confusion with non-arrow states. These error patterns are consistent with real-world challenges in traffic-light classification and indicate that most errors arise from challenging environmental conditions rather than fundamental model limitations. Overall, the observed error rates remain low (below 3%), demonstrating the robustness and stability of the proposed system under diverse and challenging real-world conditions.

4.8. Robustness Analysis

To provide a more comprehensive evaluation of system robustness, additional analysis is conducted under practical conditions, including tracking errors, HSV threshold sensitivity, and lighting variations.

Effect of Tracking Errors: The proposed framework relies on ByteTrack to maintain identity consistency across frames. While tracking errors such as ID switches or missed associations may disrupt the temporal buffer, their impact on classification is mitigated by the use of fixed-length temporal aggregation and rule-based reasoning. Since classification decisions are based on buffered observations rather than single-frame predictions, occasional tracking inconsistencies have limited influence on the final state inference. This design effectively acts as a temporal smoothing mechanism. However, severe or persistent tracking failures may degrade classification performance by corrupting the temporal history.

Sensitivity to HSV Thresholds: The HSV thresholds were determined through a systematic validation procedure, as described in Section 3. However, beyond their selection, it is important to analyze their impact on system robustness. Experimental observations indicate that the proposed framework remains stable under moderate variations in threshold values, as the temporal reasoning mechanism compensates for minor frame-to-frame color fluctuations. In particular, the integration of SV thresholding plays a critical role in suppressing low-intensity noise and improving discrimination under varying illumination conditions. Nevertheless, extreme deviations in threshold selection can lead to inaccurate color segmentation and subsequent misclassification. This behavior highlights the inherent trade-off between sensitivity and robustness in color-based approaches and underscores the importance of careful threshold calibration for reliable deployment.

Impact of Lighting Variations: The dataset encompasses a diverse range of illumination conditions, including daytime variations, shadows, and glare, enabling evaluation under realistic visual environments. To enhance robustness against such variability, the proposed framework incorporates preprocessing techniques such as SV thresholding and morphological filtering. Experimental results presented in the preceding sections indicate that these components contribute to stable performance under moderate lighting variations. However, under extreme conditions—such as very low-light environments or severe glare—classification accuracy may still be affected due to degradation of reliable color information. This limitation highlights the inherent dependency of color-based approaches on illumination quality.

Overall, these observations demonstrate that the proposed framework maintains stable performance under moderate variations and disturbances, while highlighting potential limitations under extreme conditions.

4.9. Performance Comparison with Previous Works

Most existing traffic-light recognition studies primarily focus on static signal states (red, yellow, and green) and provide limited support for dynamic signal states. In particular, the recognition of flashing traffic lights, especially flashing arrow signals, remains largely underexplored in real-time perception systems. In this context, the proposed framework, HybridSignalNet, introduces a unified real-time solution capable of detecting and classifying both static and flashing traffic lights, including circular and arrow signals, within a single integrated framework. This design enables comprehensive traffic-signal understanding at intersections while maintaining real-time performance.

To evaluate its effectiveness, HybridSignalNet is compared against several recent state-of-the-art real-time traffic-light detection models reported in the literature. To ensure a fair comparison, all evaluated detection models were trained using identical experimental conditions. Each model was trained on the same dataset, using the same train/validation/test splits, image resolution (640 × 640), and batch size (16). No model received additional tuning or dataset-specific optimization. All experiments were conducted on the same hardware platform and evaluated using the same metrics. Default hyperparameters provided by the Ultralytics framework were used consistently across all YOLO-based models to avoid bias toward any specific architecture. Table 12 presents a performance comparison in terms of precision, recall, and F1-score for static traffic-light detection and classification.

As shown in Table 12, the proposed HybridSignalNet achieves the highest overall performance among all compared methods for static traffic-light detection and classification, attaining a precision of 98.4%, recall of 96.8%, and an F1-score of 97.6%. This performance surpasses that of Wan et al. [71], which employed YOLOv9 and reported an F1-score of 96.7%, as well as the YOLOv7-based approach by De Guia and Deveraj [70], which achieved an F1-score of 95.9%.

The method proposed by Zhu and Yian [69], based on YOLOv5, demonstrated moderate performance with an F1-score of 95.1%, while the lowest performance was observed in the work of Naimi et al. [68], which relied on a modified SSD architecture and achieved an F1-score of 92.7%. Beyond these quantitative results, it is important to note that the compared methods are primarily designed to handle static traffic-light states, whereas HybridSignalNet additionally supports robust recognition of flashing signals, including both circular and directional arrow flashing lights, without compromising real-time operation. This capability highlights the proposed framework’s substantial advancement over existing approaches and underscores its suitability for deployment in real-world autonomous driving environments.

While Table 12 evaluates static traffic-light recognition performance, it does not capture the ability of models to handle temporal behaviors such as flashing signals. Therefore, Table 13 presents a comparison with state-of-the-art temporal models.

As shown in Table 13, HybridSignalNet demonstrates competitive classification performance compared to recent temporal deep learning approaches such as FlashLightNet, which employs a CNN–LSTM architecture for spatiotemporal modeling. In addition to improved classification accuracy, the proposed framework achieves significantly lower latency while maintaining competitive throughput, owing to its non-sequential, rule-based temporal reasoning mechanism. Unlike LSTM-based models that rely on sequential processing, the proposed approach enables efficient parallel frame processing, making it well-suited for latency-critical real-time and safety-critical applications. This design enables robust recognition of both static and flashing traffic-light states, including arrow signals, while maintaining strong performance (F1-score of 96.7%) and real-time efficiency. Furthermore, the proposed framework benefits from explicit object tracking (ByteTrack), which ensures temporal consistency and reduces instability across frames, a limitation commonly observed in frame-based or sequence-based models. These results highlight the effectiveness of the proposed approach in handling complex real-world traffic scenarios and support its suitability for deployment in ITSs and autonomous driving applications.

4.10. Computational Complexity Analysis

To further evaluate the efficiency and real-time capability of the proposed HybridSignalNet, a detailed computational complexity analysis is presented. This analysis includes the number of parameters, floating point operations (FLOPs), and per-module latency breakdown for each component of the system.

As shown in Table 14, the majority of the computational cost is attributed to the YOLOv11n detection module, while the ByteTrack tracking and HSV-based reasoning modules introduce minimal overhead. Latency measurements were conducted on an NVIDIA A100 GPU using an input resolution of 640 × 640. These results confirm that HybridSignalNet achieves high computational efficiency while maintaining real-time performance. Furthermore, the modular design enables efficient processing without relying on computationally expensive recurrent architectures, making the proposed framework well-suited for deployment in real-world ITSs and autonomous driving applications.

4.11. Performance Samples of HybridSignalNet

This section presents qualitative samples demonstrating the performance of the proposed HybridSignalNet in detecting and recognizing traffic-related objects, with a particular emphasis on traffic-light state recognition, including both static and flashing states. The key novel contribution of this work lies in the accurate classification of all traffic-light states, especially flashing circular and arrow signals, which are rarely addressed in prior work.

Figure 4 and Figure 5 provide representative examples illustrating the capability of HybridSignalNet to detect and classify both static and flashing traffic lights. Specifically, Figure 4 demonstrates the system’s ability to recognize static arrow traffic-light states (red, yellow, and green arrows), while Figure 5 highlights the accurate detection and classification of flashing arrow signals and flashing circular traffic lights under real-world conditions.

Based on the results shown in Figure 4 and Figure 5, the proposed model exhibits high performance and robustness in detecting and classifying traffic-light states. Detected traffic lights are highlighted using green bounding boxes, and their corresponding states are correctly identified through clear textual labels displayed above each traffic light. These results confirm the effectiveness of HybridSignalNet in handling both static and temporal (flashing) traffic-light patterns, validating its suitability for real-time intelligent transportation and autonomous-driving applications.

4.12. Real-Time Performance Analysis

The performance of HybridSignalNet is highly competitive. However, it is essential to verify that the system operates under real-time constraints. As shown in Table 15, the proposed system satisfies real-time performance requirements, where the threshold values are selected based on previous related works. In [60], real-time performance is evaluated using frames per second (fps) with a 30 fps threshold, and the proposed system in this paper exceeds this requirement, operating at ≥30 fps. In [72], the system achieves a detection speed of 23.81 fps, which is considered suitable for many real-time applications, including driver-assistance systems. In [73], the authors define real-time performance as achieving an inference speed faster than 30 fps, where the proposed system achieves 48.8 fps (with RFB-c) or 58.1 fps (without RFB-c). Similarly, in [74], real-time performance is measured using inference speed in fps, with real-time operation defined as achieving at least 30 fps.

Specifically, the proposed HybridSignalNet produces the first traffic-light state decision within 900 ms, which is below the defined real-time threshold of 1.00 s. This initial delay corresponds to a one-time startup cost required for temporal buffer filling. After the first decision, the system operates in a steady-state mode and requires only 6.77 ms to generate subsequent traffic light state decisions, which is well within the 10 ms latency requirement. Furthermore, the system achieves a sustained end-to-end throughput of 47.73 fps, exceeding the real-time requirement of 30 fps. These results demonstrate that HybridSignalNet is capable of reliable real-time traffic-light detection and classification, even when temporal analysis for flashing signals is required.

Although the proposed HybridSignalNet framework demonstrates strong robustness and high accuracy across multiple traffic scenarios, certain challenging conditions may still affect its performance. Detection accuracy can degrade when traffic lights appear extremely small or distant in the image, particularly under long-range highway conditions, where fine spatial details are limited. Similarly, heavy occlusions caused by trees, large vehicles, or structural elements may interrupt consistent detection and tracking, potentially leading to temporary identity switches or incomplete temporal buffers. In addition, non-standard traffic signal geometries, or uncommon regional signal designs may reduce classification stability. Furthermore, although the system achieves real-time performance, it is important to note that the reported results were obtained using an NVIDIA A100 GPU, which represents a high-end server-grade platform. Therefore, the reported FPS and latency should be interpreted as an upper-bound performance estimate rather than a direct indicator of embedded deployment capability. Nevertheless, the proposed framework is designed to be computationally efficient. The use of YOLOv11n and rule-based temporal reasoning reduces computational complexity, making the system suitable for resource-constrained platforms. These limitations highlight important directions for future work. In particular, future research will focus on improving robustness under extreme visual conditions and evaluating the framework on embedded automotive hardware (e.g., NVIDIA Jetson platforms), including detailed analysis of latency, throughput, memory usage, and power efficiency.

Despite the strong performance of the proposed framework across diverse scenarios, certain limitations remain. In particular, variations in traffic signal designs and regulations across different countries may introduce domain shifts that affect generalization. Moreover, the current dataset does not extensively cover adverse weather conditions such as rain, fog, and snow. Future work will focus on cross-dataset evaluation, domain adaptation techniques, and improving robustness under diverse environmental and geographic conditions.

5. Conclusions

This paper presents HybridSignalNet, a unified real-time traffic signal perception framework that integrates deep learning-based object detection with efficient multi-object tracking and temporal color-based reasoning for comprehensive traffic signal understanding. By decoupling spatial detection from temporal state inference, the proposed system effectively addresses limitations of conventional single-frame or purely learning-based approaches, particularly for arrow-based and flashing traffic light signals that require temporal context. The use of YOLOv11n enables accurate detection of a diverse set of traffic-related objects, while ByteTrack ensures stable identity preservation across frames with low computational overhead. The HSV-based color analysis, combined with temporal buffering and rule-based reasoning, provides a transparent and robust mechanism for traffic-light state classification under varying illumination and environmental conditions. Experimental validation on a custom real-world dataset demonstrates strong performance across all detected object classes, as well as reliable recognition of both static and flashing traffic-light states, thereby confirming the effectiveness of the proposed hybrid architecture, HybridSignalNet, for real-time deployment. Importantly, the framework achieves this performance without relying on complex recurrent network architectures or extensive retraining procedures, which supports its practical implementation in ITSs and autonomous driving platforms. Despite the strong performance of the proposed HybridSignalNet, several limitations remain. In particular, the current framework assumes that the detection and tracking modules provide reliable ROIs for temporal reasoning. In practice, missed detections or bounding box instability may propagate to the temporal module and affect state inference. Although the proposed system mitigates these effects through temporal buffering, track continuity, and last-valid-state preservation, a dedicated quantitative analysis of error propagation from detection and tracking into the temporal reasoning stage was not included in this study. This represents an important direction for future work.

Future work will extend the proposed framework to additional traffic signal types and more complex intersection scenarios, while exploring the integration of multi-sensor inputs such as radar and LiDAR to improve robustness under challenging conditions. To enhance contextual awareness, future versions will incorporate High-Definition (HD) maps and precise vehicle localization, including GNSS/INS fusion and visual SLAM, enabling predictive region-of-interest generation and reducing computational overhead in multi-signal environments. Incorporating lane directional cues to better associate detected traffic lights with the vehicle’s path for reducing false positives without relying on HD maps is another scope of work [75]. Additionally, lightweight JPEG-based methods will be explored to estimate traffic congestion from images, enabling low-overhead extension to HybridSignalNet’s multi-class roadway perception and integrated traffic-density awareness for safer intersection navigation and V2X cooperative routing without additional sensors [76]. Finally, we will investigate multi-vehicle cooperative perception leveraging V2X communication to overcome occlusions and extend effective sensing range through shared detection and classification information.

Author Contributions

Conceptualization, methodology, validation, formal analysis, investigation, L.B.K.; Resources, data curation, validation, investigation, data organization, writing—original draft preparation, L.B.K. and M.R.; Writing—manuscript, review and editing, formal analysis, visualization, L.B.K., M.R., I.A.E. and J.E.B.; Supervision, project administration, J.E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset generated and analyzed in this study is not publicly available at this time due to ongoing research activities and pending formal permissions for distribution. However, it can be obtained from the corresponding author upon reasonable request, subject to a formal data-sharing agreement. The dataset will be made publicly accessible once the related projects are completed and all necessary approvals have been secured.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.; Teng, S.; Li, B.; Na, X.; Li, Y.; Li, Z.; Wang, J.; Cao, D.; Zheng, N.; Wang, F.Y. Milestones in Autonomous Driving and Intelligent Vehicles—Part II: Perception and Planning. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 6401–6415. [Google Scholar] [CrossRef]
Bresson, G.; Alsayed, Z.; Yu, L.; Glaser, S. Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving. IEEE Trans. Intell. Veh. 2017, 2, 194–220. [Google Scholar] [CrossRef]
Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixao, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
Rahman, M.; Khaled, L.B.; Ebu, I.A.; Ball, J.E. V2I communication of autonomous vehicles with traffic light infrastructures of multiple intersections to optimize vehicle speed. In Proceedings of the Autonomous Systems: Sensors, Processing, and Security for Ground, Air, Sea, and Space Vehicles and Infrastructure 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13474, pp. 32–48. [Google Scholar]
Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 2020, 12, 1–308. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jensen, M.B.; Philipsen, M.P.; Møgelmose, A.; Moeslund, T.B.; Trivedi, M.M. Vision for looking at traffic lights: Issues, survey, and perspectives. IEEE Trans. Intell. Transp. Syst. 2016, 17, 1800–1815. [Google Scholar] [CrossRef]
Behrendt, K.; Novak, L.; Botros, R. A deep learning approach to traffic lights: Detection, tracking, and classification. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2017; pp. 1370–1377. [Google Scholar]
Li, Z.; Zeng, Q.; Liu, Y.; Liu, J.; Li, L. An improved traffic lights recognition algorithm for autonomous driving in complex scenarios. Int. J. Distrib. Sens. Netw. 2021, 17, 15501477211018374. [Google Scholar] [CrossRef]
Kim, H.K.; Yoo, K.Y.; Park, J.H.; Jung, H.Y. Traffic light recognition based on binary semantic segmentation network. Sensors 2019, 19, 1700. [Google Scholar] [CrossRef]
Soetedjo, A.; Yamada, K. Fast and Robust Traffic Sign Detection. In Proceedings of the 2005 IEEE International Conference on Systems, Man and Cybernetics; IEEE: New York, NY, USA, 2005; Volume 2, pp. 1341–1346. [Google Scholar] [CrossRef]
Müller, J.; Dietmayer, K. Detecting traffic lights by single shot detection. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2018; pp. 266–273. [Google Scholar]
Weber, M.; Wolf, P.; Zöllner, J.M. DeepTLR: A single deep convolutional network for detection and classification of traffic lights. In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2016; pp. 342–348. [Google Scholar]
Weber, M.; Huber, M.; Zöllner, J.M. HDTLR: A CNN based hierarchical detector for traffic lights. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2018; pp. 255–260. [Google Scholar]
Wang, Q.; Zhang, Q.; Liang, X.; Wang, Y.; Zhou, C.; Mikulovich, V.I. Traffic lights detection and recognition method based on the improved YOLOv4 algorithm. Sensors 2021, 22, 200. [Google Scholar] [CrossRef]
Liu, P.; Li, T. Traffic light detection based on depth improved yolov5. In Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE); IEEE: New York, NY, USA, 2023; pp. 395–399. [Google Scholar]
Pavlitska, S.; Lambing, N.; Bangaru, A.K.; Zöllner, J.M. Traffic light recognition using convolutional neural networks: A survey. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2023; pp. 2790–2796. [Google Scholar]
Greer, R.; Gopalkrishnan, A.; Landgren, J.; Rakla, L.; Gopalan, A.; Trivedi, M. Robust traffic light detection using salience-sensitive loss: Computational framework and evaluations. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Tammisetti, V.; Stettinger, G.; Cuellar, M.P.; Molina-Solana, M. Meta-YOLOv8: Meta-learning-enhanced YOLOv8 for precise traffic light color detection in ADAS. Electronics 2025, 14, 468. [Google Scholar] [CrossRef]
Yildiz, G.; Ulu, A.; Dızdaroğlu, B.; Yildiz, D. Hybrid image improving and CNN (HIICNN) stacking ensemble method for traffic sign recognition. IEEE Access 2023, 11, 69536–69552. [Google Scholar] [CrossRef]
Zheng, Y.; Jiang, W. Evaluation of vision transformers for traffic sign classification. Wirel. Commun. Mob. Comput. 2022, 2022, 3041117. [Google Scholar] [CrossRef]
Mingwin, S.; Shisu, Y.; Wanwag, Y.; Huing, S. Revolutionizing traffic sign recognition: Unveiling the potential of vision transformers. arXiv 2024, arXiv:2404.19066. [Google Scholar] [CrossRef]
Naz, I.; Shah, J.H.; Tahir, A.; Marri, M.R.; Saleem, R.; Saeed, M.E.S. Attention to detail: A conditional multi-head transformer for traffic sign recognition. PLoS ONE 2025, 20, e0335341. [Google Scholar] [CrossRef]
Yuen, M.C.; Ng, S.C.; Leung, M.F. A Competitive Mechanism Multi-Objective Particle Swarm Optimization Algorithm and Its Application to Signalized Traffic Problem. Cybern. Syst. 2021, 52, 73–104. [Google Scholar] [CrossRef]
Lu, Y.; Lu, J.; Zhang, S.; Hall, P. Traffic signal detection and classification in street views using an attention model. Comput. Vis. Media 2018, 4, 253–266. [Google Scholar] [CrossRef]
Wang, X.; Jiang, T.; Xie, Y. A method of traffic light status recognition based on deep learning. In Proceedings of the 2018 International Conference on Robotics, Control and Automation Engineering; Association for Computing Machinery: New York, NY, USA, 2018; pp. 166–170. [Google Scholar]
Haltakov, V.; Mayr, J.; Unger, C.; Ilic, S. Semantic segmentation based traffic light detection at day and at night. In Proceedings of the German Conference on Pattern Recognition; Springer: Cham, Swizterland, 2015; pp. 446–457. [Google Scholar]
Niu, C.; Li, K. Traffic light detection and recognition method based on YOLOv5s and AlexNet. Appl. Sci. 2022, 12, 10808. [Google Scholar] [CrossRef]
Ouyang, Z.; Niu, J.; Liu, Y.; Guizani, M. Deep CNN-based real-time traffic light detector for self-driving vehicles. IEEE Trans. Mob. Comput. 2019, 19, 300–313. [Google Scholar] [CrossRef]
Jayasinghe, O.; Hemachandra, S.; Anhettigama, D.; Kariyawasam, S.; Wickremasinghe, T.; Ekanayake, C.; Rodrigo, R.; Jayasekara, P. Towards real-time traffic sign and traffic light detection on embedded systems. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2022; pp. 723–728. [Google Scholar]
Chen, Y.C.; Lin, H.Y. Traffic Light Detection and Recognition using Ensemble Learning with Color-Based Data Augmentation. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2024; pp. 3199–3204. [Google Scholar]
Lin, H.Y.; Tsai, M.Y. Traffic light detection for nighttime driving with log-polar transform incorporated learning. Results Eng. 2025, 25, 103783. [Google Scholar] [CrossRef]
Al Amin, R.; Hasan, M.; Wiese, V.; Obermaisser, R. FPGA-based real-time object detection and classification system using YOLO for edge computing. IEEE Access 2024, 12, 73268–73278. [Google Scholar] [CrossRef]
Al Amin, R.; Hossain, M.S.A.; Schmid, L.T.; Wiese, V.; Obermaisser, R. Power efficient real-time traffic signal classification for autonomous driving using FPGAs. In Proceedings of the 2024 6th International Conference on Communications, Signal Processing, and their Applications (ICCSPA); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
de Mello, J.P.V.; Tabelini, L.; Berriel, R.F.; Paixao, T.M.; De Souza, A.F.; Badue, C.; Sebe, N.; Oliveira-Santos, T. Deep traffic light detection by overlaying synthetic context on arbitrary natural images. Comput. Graph. 2021, 94, 76–86. [Google Scholar] [CrossRef]
Gong, J.; Jiang, Y.; Xiong, G.; Guan, C.; Tao, G.; Chen, H. The recognition and tracking of traffic lights based on color segmentation and camshift for intelligent vehicles. In Proceedings of the 2010 IEEE Intelligent Vehicles Symposium; IEEE: New York, NY, USA, 2010; pp. 431–435. [Google Scholar]
Zhang, Y.; Xue, J.; Zhang, G.; Zhang, Y.; Zheng, N. A multi-feature fusion based traffic light recognition algorithm for intelligent vehicles. In Proceedings of the 33rd Chinese Control Conference; IEEE: New York, NY, USA, 2014; pp. 4924–4929. [Google Scholar]
Bach, M.; Reuter, S.; Dietmayer, K. Multi-camera traffic light recognition using a classifying Labeled Multi-Bernoulli filter. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2017; pp. 1045–1051. [Google Scholar]
Khaled, L.B.; Rahman, M.; Ball, J.E. Robust traffic light, road sign, and lane marking recognition including flashing red and yellow lights using deep learning techniques for intersection navigation. In Proceedings of the Autonomous Systems: Sensors, Processing, and Security for Ground, Air, Sea, and Space Vehicles and Infrastructure 2025; SPIE: Bellingham, WA, USA, 2025; Volume 13474, p. 1347402. [Google Scholar]
Wu, S.; Amenta, N.; Zhou, J.; Papais, S.; Kelly, J. aUToLights: A robust multi-camera traffic light detection and tracking system. In Proceedings of the 2023 20th Conference on Robots and Vision (CRV); IEEE: New York, NY, USA, 2023; pp. 89–96. [Google Scholar]
Khaled, L.B.; Rahman, M.; Ebu, I.A.; Ball, J.E. FlashLightNet: An End-to-End Deep Learning Framework for Real-Time Detection and Classification of Static and Flashing Traffic Light States. Sensors 2025, 25, 6423. [Google Scholar] [CrossRef]
Zhou, J.; Yang, J.; Wu, X.; Zhou, W.; Wang, Y. TrVLR: A transformer-based vehicle light recognition method in vehicle inspection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19995–20005. [Google Scholar] [CrossRef]
Islam, Z.; Abdel-Aty, M.; Mahmoud, N. Using CNN-LSTM to predict signal phasing and timing aided by High-Resolution detector data. Transp. Res. Part C Emerg. Technol. 2022, 141, 103742. [Google Scholar] [CrossRef]
Singh, V.; Sahana, S.K.; Bhattacharjee, V. A novel CNN-GRU-LSTM based deep learning model for accurate traffic prediction. Discov. Comput. 2025, 28, 38. [Google Scholar] [CrossRef]
Deekshetha, H.; Shreyas Madhav, A.; Tyagi, A.K. Traffic prediction using machine learning. In Evolutionary Computing and Mobile Sustainable Networks: Proceedings of ICECMSN 2021; Springer: Singapore, 2022; pp. 969–983. [Google Scholar]
Abdullah, S.M.; Periyasamy, M.; Kamaludeen, N.A.; Towfek, S.; Marappan, R.; Kidambi Raju, S.; Alharbi, A.H.; Khafaga, D.S. Optimizing traffic flow in smart cities: Soft GRU-based recurrent neural networks for enhanced congestion prediction using deep learning. Sustainability 2023, 15, 5949. [Google Scholar] [CrossRef]
Alawaji, K.; Hedjar, R.; Zuair, M. Traffic sign recognition using multi-task deep learning for self-driving vehicles. Sensors 2024, 24, 3282. [Google Scholar] [CrossRef]
Akbar, M.; Susilawati, I.; Jati, B.S.; Alamsyah, N. Multi-Task Learning for Traffic Sign Recognition using Multi-Scale Convolutional Neural Networks. Int. J. Adv. Data Inf. Syst. 2025, 6, 391–402. [Google Scholar] [CrossRef]
Wu, D.; Liao, M.W.; Zhang, W.T.; Wang, X.G.; Bai, X.; Cheng, W.Q.; Liu, W.Y. Yolop: You only look once for panoptic driving perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Wang, C.; Chen, X.; Jiao, Z.; Song, S.; Ma, Z. An Improved YOLOP Lane-Line Detection Utilizing Feature Shift Aggregation for Intelligent Agricultural Machinery. Agriculture 2025, 15, 1361. [Google Scholar] [CrossRef]
Jia, X.; You, J.; Zhang, Z.; Yan, J. Drivetransformer: Unified transformer for scalable end-to-end autonomous driving. arXiv 2025, arXiv:2503.07656. [Google Scholar]
Hwang, J.J.; Xu, R.; Lin, H.; Hung, W.C.; Ji, J.; Choi, K.; Huang, D.; He, T.; Covington, P.; Sapp, B.; et al. Emma: End-to-end multimodal model for autonomous driving. arXiv 2024, arXiv:2410.23262. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 17 February 2026).
Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning; Association for Computing Machinery: New York, NY, USA, 2006; pp. 233–240. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Swizterland, 2016; pp. 17–35. [Google Scholar]
Zhu, M. Recall, precision and average precision. Dep. Stat. Actuar. Sci. Univ. Waterloo 2004, 2, 6. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Swizterland, 2022; pp. 1–21. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2017; pp. 3645–3649. [Google Scholar]
Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS); IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Süsstrunk, S.; Buckley, R.; Swen, S. Standard RGB color spaces. In Proceedings of the Color and Imaging Conference; Society of Imaging Science and Technology: Springfield, VA, USA, 1999; Volume 7, pp. 127–134. [Google Scholar]
Commission Internationale de l’Eclairage (CIE). Colorimetry, 3rd ed.; Technical Report CIE 15.3:2004; CIE Central Bureau: Vienna, Austria, 2004. [Google Scholar]
Schanda, J. CIE colorimetry and colour displays. In Proceedings of the Color and Imaging Conference; Society of Imaging Science and Technology: Springfield, VA, USA, 1996; Volume 4, pp. 230–234. [Google Scholar]
Bora, D.J.; Gupta, A.K.; Khan, F.A. Comparing the performance of L* A* B* and HSV color spaces with respect to color image segmentation. arXiv 2015, arXiv:1506.01472. [Google Scholar]
Smith, A.R. Color gamut transform pairs. ACM Siggraph Comput. Graph. 1978, 12, 12–19. [Google Scholar] [CrossRef]
Naimi, H.; Akilan, T.; Khalid, M.A. Fast traffic sign and light detection using deep learning for automotive applications. In Proceedings of the 2021 IEEE Western New York Image and Signal Processing Workshop (WNYISPW); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Zhu, Y.; Yan, W.Q. Traffic sign recognition based on deep learning. Multimed. Tools Appl. 2022, 81, 17779–17791. [Google Scholar] [CrossRef]
De Guia, J.M.; Deveraj, M. Development of traffic light and road sign detection and recognition using deep learning. Development 2024, 15, 942–952. [Google Scholar]
Wan, H.; Ruan, J.; An, W.; Hao, R. A YOLOv9-Based Night-Time Signal Light Detection Method in Carla Environment. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS); IEEE: New York, NY, USA, 2024; pp. 852–857. [Google Scholar]
Zhang, H.; Qin, L.; Li, J.; Guo, Y.; Zhou, Y.; Zhang, J.; Xu, Z. Real-time detection method for small traffic signs based on Yolov3. IEEE Access 2020, 8, 64145–64156. [Google Scholar] [CrossRef]
Chen, J.; Jia, K.; Chen, W.; Lv, Z.; Zhang, R. A real-time and high-precision method for small traffic-signs recognition. Neural Comput. Appl. 2022, 34, 2233–2245. [Google Scholar] [CrossRef]
Wang, F.; Bai, J.; Wang, M.; Liu, B.; Xue, H.; Chen, J. Robust traffic sign detection in real-world harsh conditions: A pioneering benchmark dataset and attention-based methodology. Eng. Appl. Artif. Intell. 2026, 166, 113526. [Google Scholar] [CrossRef]
Polley, N.; Pavlitska, S.; Boualili, Y.; Rohrbeck, P.; Stiller, P.; Bangaru, A.K.; Zollnerl, J.M. TLD-READY: Traffic Light Detection-Relevance Estimation and Deployment Analysis. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2024; pp. 3800–3806. [Google Scholar]
Wiseman, Y. Real-time monitoring of traffic congestions. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT); IEEE: New York, NY, USA, 2017; pp. 501–505. [Google Scholar] [CrossRef]

Figure 1. Methodological workflow of the proposed system.

Figure 2. Architecture overview of the YOLOv11 detector employed in the proposed HybridSignalNet framework. The model comprises a backbone for hierarchical feature extraction, a feature-aggregation neck for multi-scale representation, and a multi-scale detection head that predicts bounding boxes and class scores.

Figure 3. Confusion matrix for traffic-light state classification across nine classes, including static, flashing, and arrow traffic light states.

Figure 4. Sample results of HybridSignalNet for detecting and classifying static arrow traffic lights (red, yellow, and green).

Figure 5. Sample results of HybridSignalNet for detecting and classifying flashing circular and arrow traffic light states under real-world conditions.

Table 1. Training hyperparameters for YOLOv11n detector.

Parameter	Value
Input Image Resolution	640 × 640 pixels
Batch Size	16
Number of Epochs	580
Early Stopping Patience	100 epochs
Pretraining	COCO pretrained weights (transfer learning)
Loss Function	CIoU + BCE + DFL

Table 2. Custom dataset summary.

Attribute	Specification
Data Source	Real-world traffic videos captured at urban intersections around Mississippi State University and in Starkville and Columbus, Mississippi, USA. Simulated traffic-light videos generated using RoadRunner. Additional samples obtained from the LISA Traffic Sign Dataset and the Mapillary Traffic Sign Dataset. Videos from BDD100K dataset.
Number of Videos	57
Frame Rate	30 fps
Camera Device	GoPro HERO8
Number of Classes	13 Classes: Speed Limit, Left Lane, Straight Lane, Do Not Enter, Stop, Straight Left Lane, Right Lane, Straight Right Lane, Pedestrian, Roundabout, Yield, Arrow Traffic Light, and Circular Traffic Light
Dataset Split	Train (70%); Valid (15%); Test (15%)
Total Images	39,000 images
Images per Class	3000

Table 3. Experimental setup and software environment.

Category	Specification
Operating System	Rocky Linux 9.2 (Blue Onyx)
Access Method	Remote SSH via PuTTY
GPU	NVIDIA A100 80GB PCIe
Number of GPUs	6 (multi-GPU cluster)
CPU	Intel Xeon Platinum 8362 @ 2.80 GHz
CPU Cores	64 Cores
System Memory	1.5 TB RAM
CUDA Version	CUDA 12.1
Deep Learning Frameworks	Python 3.9.21, TensorFlow 2.19.1, PyTorch 2.5.1, OpenCV 4.11.0, NumPy 1.23.5, Ultralytics YOLO (v8.4.14), and Pandas 2.2.2.

Table 4. Performance comparison of YOLOv11 variants in terms of Precision, Recall, F1-score, mAP_0.5, mAP_0.5:0.95, and Training Time.

Version	Precision	Recall	F1-Score	mAP_0.5	mAP_0.5:0.95	Training Time (Hours)
Nano	92.8	90.0	91.3	92.3	73.5	10.23
Medium	95.4	93.7	94.5	95.0	80.9	19.42
Large	96.6	95.2	95.9	96.4	84.7	23.39

Table 5. Quantitative recognition performance of each class in terms of precision, recall, F1-score, mAP_0.5, and mAP_0.5:0.95.

Class	Precision	Recall	F1-Score	mAP_0.5	mAP_0.5:0.95
Speed Limit	95.0	93.4	94.2	94.6	77.9
Left Lane	90.9	88.5	89.7	90.8	70.5
Straight Lane	90.4	87.8	89.1	90.0	69.4
Do Not Enter	93.7	90.5	92.1	93.3	76.4
Stop	96.2	94.9	95.5	96.0	82.6
Straight Left Lane	91.6	89.0	90.3	91.2	72.0
Right Lane	91.2	88.3	89.7	90.6	71.3
Straight Right Lane	91.9	89.6	90.7	91.6	72.9
Pedestrian	88.2	81.9	84.9	87.3	63.2
Roundabout	94.0	91.2	92.6	93.6	77.1
Yield	94.5	92.3	93.4	94.0	78.2
Arrow Traffic Light	95.6	93.8	94.7	95.8	80.1
Circular Traffic Light	96.8	95.1	95.9	96.9	83.5
Average	92.8	90.0	91.3	92.3	73.5

Table 6. Performance comparison of IoU-based, DeepSORT, and ByteTrack trackers in terms of MOTA, and IDF1.

Tracking Method	MOTA	IDF1
IoU-based	72.5	68.9
DeepSORT	84.2	79.5
ByteTrack	90.7	88.2

Table 7. Performance comparison of color space-based traffic-light classification methods in terms of Precision, Recall, and F1-Score.

Method	Precision	Recall	F1-Score
RGB	86.1	84.5	85.3
CIELAB (Lab*)	95.6	93.8	94.7
HSV	97.5	95.9	96.7

Table 8. Performance evaluation of morphological closing in the proposed framework.

Setting	Precision	Recall	F1-Score
Without morphology	95.4	92.3	93.8
With morphology	97.5	95.9	96.7

Table 9. Effect of temporal buffer length on traffic-light classification robustness and startup latency.

Buffer Length (Frames)	F1-Score	Startup Latency (s)
9	94.1	0.30
15	95.6	0.50
27	96.7	0.90
45	96.8	1.50

Table 10. Quantitative impact of inner-ROI cropping and SV thresholding on traffic-light classification performance.

Configuration	Precision	Recall	F1-Score
Full model (with ROI + SV threshold)	97.5	95.9	96.7
Without inner-ROI cropping	95.2	94.6	94.9
Without SV thresholding	95.8	95.0	95.4
Without both (ROI + SV)	93.8	93.2	93.5

Table 11. Traffic-light state classification performance of the proposed HybridSignalNet framework.

Traffic Light State	Precision	Recall	F1-Score
Red	98.9	97.2	98.0
Green	99.2	97.7	98.4
Yellow	98.0	96.0	97.0
Flashing Red	95.1	94.0	94.5
Flashing Yellow	96.4	94.5	95.4
Red Arrow	98.3	96.5	97.4
Yellow Arrow	97.6	96.8	97.2
Green Arrow	98.7	97.1	97.9
Flashing Yellow Arrow	95.4	93.6	94.5
Average	97.5	95.9	96.7

Table 12. Performance comparison with state-of-the-art methods on static traffic-light recognition.

Study	Precision	Recall	F1-Score
Naimi et al. [68]	92.5	93.1	92.7
Zhu and Yian [69]	96.2	94.0	95.1
De Guia and Deveraj [70]	96.5	95.4	95.9
Wan et al. [71]	97.3	96.1	96.7
Proposed Framework	98.4	96.8	97.6

Table 13. Performance comparison with state-of-the-art temporal methods for flashing traffic-light recognition.

Study	Precision	Recall	F1-Score	Throughput (FPS)	Latency (ms)
FlashLightNet [41]	96.3	95.6	95.9	55.3	14.7
HybridSignalNet (Proposed)	97.5	95.9	96.7	47.7	6.77

Table 14. Computational complexity and latency analysis of HybridSignalNet.

Module	Parameters	FLOPs	Latency (ms)
YOLOv11n Detection	2.6M	6.5 GFLOPs	5.10
ByteTrack Tracking	–	<0.1 GFLOPs	0.50
HSV-Based Reasoning	–	<0.1 GFLOPs	1.17
Total	2.6M	6.5 GFLOPs	6.77

Table 15. Real-time performance evaluation of HybridSignalNet.

Metric	Definition	Measured Value	Real-Time Requirement	Meets Requirement
Camera Frame Rate (fps)	Input video frame rate	30 fps	≥30 fps	Yes
Sequence Length (M)	Temporal window used for flashing detection	27	-	-
First Output End-to-End Latency	Time until the first valid output is produced (startup delay)	900 ms	≤1.00 s	Yes
Steady-State Latency (FIFO)	Time per decision during continuous detection after initial flashing recognition	6.77 ms	≤10 ms	Yes
Throughput After First Decision	Sustained processing rate in steady-state operation	47.73 fps	≥30 fps	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khaled, L.B.; Rahman, M.; Ebu, I.A.; Ball, J.E. HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition. Electronics 2026, 15, 1964. https://doi.org/10.3390/electronics15091964

AMA Style

Khaled LB, Rahman M, Ebu IA, Ball JE. HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition. Electronics. 2026; 15(9):1964. https://doi.org/10.3390/electronics15091964

Chicago/Turabian Style

Khaled, Laith Bani, Mahfuzur Rahman, Iffat Ara Ebu, and John E. Ball. 2026. "HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition" Electronics 15, no. 9: 1964. https://doi.org/10.3390/electronics15091964

APA Style

Khaled, L. B., Rahman, M., Ebu, I. A., & Ball, J. E. (2026). HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition. Electronics, 15(9), 1964. https://doi.org/10.3390/electronics15091964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HybridSignalNet: A Real-Time Unified Framework for Multi-Class Roadway Perception with Flashing and Arrow Traffic-Light Recognition

Abstract

1. Introduction

2. Related Work

2.1. Detector-Centric and Frame-Based Approaches

2.2. Multi-Stage and Map-Assisted Pipelines

2.3. Hardware-Efficient and Real-Time Systems

2.4. Temporal Consistency and Filtering Methods

2.5. Deep Temporal Learning Models

2.6. Multi-Task and Unified Perception Frameworks

3. Research Methodology

3.1. Dataset Collection and Curation

3.2. Preprocessing Pipeline

3.2.1. Video Segmentation and Frame Sampling

3.2.2. Data Cleaning and Quality Control

3.2.3. Image Annotation

3.2.4. Data Augmentation

3.2.5. Dataset Partitioning

3.3. YOLOv11-Based Object Detection

3.4. Traffic-Light Tracking Using ByteTrack

3.5. HSV-Based Traffic-Light State Extraction

3.6. Temporal Buffering for Traffic-Light State Reasoning

3.7. Experimental Environment

3.8. Evaluation Metrics

4. Results and Discussion

4.1. YOLOv11 Model Selection and Performance

4.2. Temporal Tracking Module: Integrating ByteTrack for State Recognition

4.3. HSV-Based Traffic-Light State Classification

4.4. Impact of Morphological Filtering in the HSV-Based Traffic-Light Classification Module

4.5. Impact of Temporal Buffer Length on Traffic-Light Classification Robustness and Startup Latency

4.6. Impact of Inner-ROI Cropping and Saturation-Value (SV) Thresholding on System Performance

4.7. Traffic-Light State Classification and Error Analysis

4.8. Robustness Analysis

4.9. Performance Comparison with Previous Works

4.10. Computational Complexity Analysis

4.11. Performance Samples of HybridSignalNet

4.12. Real-Time Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI