1. Introduction
The exponential growth of urban traffic volumes presents an increasingly critical challenge for modern city infrastructure and quality of life. As urban populations continue to expand, traditional fixed-time traffic signal systems have proven inadequate for managing dynamic traffic patterns, resulting in significant congestion, increased emissions, and substantial economic losses. This complex challenge has sparked intensive research into advanced traffic management solutions, particularly in the domain of adaptive signal control systems [
1].
Recent advances in artificial intelligence and control theory have created new opportunities for adaptive traffic management, with Fuzzy Logic and reinforcement learning emerging as particularly promising approaches [
2]. The Mamdani fuzzy inference system, specifically, has demonstrated notable potential in traffic control optimization through its intuitive framework and ability to incorporate human expertise [
3]. This approach translates traffic engineers’ knowledge into a formal decision-making system through linguistic variables and inference rules, offering a transparent and interpretable control mechanism [
4].
While both Mamdani fuzzy systems and reinforcement learning methods have proven effective individually, relatively few studies have conducted a direct, systematic comparison of the two under equivalent conditions [
5]. Existing comparisons tend to be limited in scope, focusing on simplified scenarios or lacking robust evaluation metrics. For example, Ding et al. explored a hybrid fuzzy-Q-learning system but did not benchmark it against pure fuzzy or pure reinforcement approaches in a controlled simulation [
6]. Similarly, Ouyang et al. evaluated different Q-learning strategies but did not contrast them with Fuzzy Logic [
4]. A more comprehensive comparative analysis remains necessary to understand each approach’s strengths, weaknesses, and applicability to real-world urban traffic systems. Recent work by Vlahogianni et al. has started addressing this gap [
7], yet the field still lacks standardized experimental frameworks and generalizable conclusions [
8].
Additional recent contributions include Jafari et al. [
9], who proposed a Fuzzy Logic system for optimizing traffic flow across multiple intersections using Takagi–Sugeno modeling, demonstrating superior performance over fixed-time controllers. Furthermore, Bi et al. [
10] introduced a hybrid Type-2 fuzzy and Deep Reinforcement Learning (DRL) framework that dynamically adjusts green phases based on real-time traffic data, showing improved handling of uncertainty in single-intersection optimization. These methods reinforce the potential of intelligent and hybrid control systems in urban traffic management.
However, a clearly defined research gap persists: few studies have conducted an empirical, side-by-side comparison of Fuzzy Logic and Reinforcement Learning approaches using a shared simulation environment, comparable traffic patterns, and standardized evaluation metrics. Several recent reviews confirm this deficiency. For instance, Eom and Kim [
1] provide a broad survey on traffic signal control problems but do not quantitatively compare adaptive models under uniform settings. Similarly, Qadri et al. [
5] outline the evolution of traffic signal control strategies but highlight the need for empirical performance benchmarking across AI-based methods.
This paper addresses that research gap by conducting a direct comparison between Mamdani Fuzzy Logic and Q-learning under equivalent simulation parameters. The proposed adaptive control strategy for traffic light timings in [
11] demonstrates how DRL can scale to complex urban topologies and dynamic flows. That work aimed to evaluate the trade-off between journey time and CO
2 emissions by comparing fixed-time and adaptive management in a realistic situation. However, such systems require a thorough evaluation of trade-offs against interpretable alternatives like Fuzzy Logic, particularly for municipal adoption. The work herein positions itself at this intersection, addressing this gap by implementing and evaluating both approaches—Mamdani Fuzzy Logic and DRL. The controlled simulation environment SUMO (Simulation of Urban Mobility) was utilized, with vehicle distributions ranging from 150 to 300 vehicles per hour across different route segments [
12]. SUMO’s detailed modeling of individual vehicle behavior, traffic signals, and road networks provides an ideal platform for testing adaptive control strategies in a reproducible manner [
11].
Through this comparison, the rule-based decision-making process of the Mamdani fuzzy system is examined in relation to the adaptive policy development of reinforcement learning for efficient traffic flow management [
13]. The practical importance of this analysis lies in its relevance to the design and deployment of intelligent traffic systems. The Mamdani fuzzy system’s strength lies in its use of expert-defined rules and its interpretability [
3], while reinforcement learning offers adaptability and real-time policy optimization [
14].
Recent studies show that smart traffic signal control has improved with new technology. Lin et al. [
15] suggest a combined method that uses fuzzy control and the Differential Evolution algorithm. This approach helps traffic flow more smoothly and reduces waiting times. Yang et al. [
16] present a system using multi-agent deep reinforcement learning for retail supply chains. This system is useful in areas with many sensors. In [
17], the authors employ Flexible Neural Trees to change traffic signal phases, which helps manage congestion effectively. In [
18], Wang et al. employ deep reinforcement learning for traffic light timing, resulting in shorter travel times and reduced traffic congestion. Lastly, in [
19], is presented a review of different reinforcement learning methods for adjusting traffic control, with highlights of their benefits in city settings. Overall, these studies show the strong potential of AI for improving real-time traffic systems.
The main contributions of this paper are as follows:
A replicable experimental framework was developed using the SUMO simulator to enable a systematic comparison between Fuzzy Logic and reinforcement learning for traffic signal control. This framework ensures consistency across test scenarios and facilitates further research.
A fluidity indicator is introduced, which quantifies the smoothness of traffic flow transitions, offering a complementary perspective to traditional metrics such as waiting time and vehicle throughput.
A detailed comparative analysis of Mamdani fuzzy control and reinforcement learning was conducted, combining qualitative observations with quantitative results under varying traffic conditions.
Practical insights are offered for urban planners and traffic engineers by identifying the contexts in which each method performs best, thus supporting informed decision-making in the deployment of adaptive traffic management systems.
Thus, this study addresses a clearly defined gap in the literature by conducting a direct, systematic comparison between Mamdani Fuzzy Logic and Q-learning controllers under identical traffic simulation settings using SUMO. Unlike prior works that focus on either method in isolation, we employ a unified evaluation framework and introduce a novel traffic fluidity metric that complements conventional performance indicators. To our knowledge, this is the first study to benchmark these approaches side by side across multiple traffic densities in a scalable, replicable simulation environment, offering practical insights for adaptive urban traffic control.
The paper is structured as follows: the second section, Materials and Methods, describes the Fuzzy Logic and Q-learning algorithms and their implementation.
Section 3 is divided in two parts to present the results of testing the algorithms separately.
Section 4 compares the previously mentioned outputs, and lastly,
Section 5 concludes this paper.
2. Materials and Methods
2.1. Fuzzy Logic Approach
Fuzzy Logic control systems enhance traffic signal management by addressing the uncertainty and imprecision found in urban traffic. Unlike traditional methods that rely on exact models, Fuzzy Logic uses everyday language and expert-defined rules to reflect human decision-making in traffic scenarios.
This approach is beneficial for managing factors like vehicle numbers, wait times, and congestion, which are difficult to measure precisely. The Fuzzy Logic controller discussed herein uses the Mamdani inference system for easy rule formulation and clear decision-making. By categorizing measurements into terms like “low”, “medium”, and “high”, the system can analyze multiple traffic factors simultaneously, making informed timing decisions for signals. This method incorporates traffic engineering knowledge, facilitating clarity and maintainability, which is essential for user acceptance and ongoing use.
2.2. Inference System Design
The fuzzy inference mechanism used in this traffic control system is based on the Mamdani model, originally introduced by Ebrahim Mamdani in 1974 for controlling a steam engine. This well-established approach follows four main stages: fuzzification, rule evaluation, aggregation, and defuzzification. In the fuzzification phase, crisp inputs—such as vehicle counts and average waiting times—are mapped to linguistic terms like low, medium, or high using triangular membership functions. The rule evaluation stage applies Fuzzy Logic operators, typically using the MIN operator for AND conditions, to calculate the firing strength of each rule. These strengths are then used to truncate the output membership functions during the implication phase. The aggregation step combines these truncated outputs using the MAX operator, forming a unified fuzzy set. Finally, the centroid method (center of gravity) is applied in the defuzzification stage to compute the final green light duration, as presented in Equation (1):
where
is the output variable and
is the aggregated membership function. This calculation produces a precise green light duration value that reflects the collective influence of all applicable rules. The Mamdani approach is particularly well-suited for traffic control as it allows domain knowledge to be incorporated as interpretable linguistic rules, handling traffic flow uncertainties and nonlinearities while maintaining transparency in decision-making processes.
For data collection, the system extracts information from the simulation environment through several mechanisms. Lane IDs are retrieved from the network by iterating through the edges and their respective lanes. Traffic light IDs are obtained directly from the network structure. Vehicle counts and average waiting times are continuously measured for specific lanes during simulation, while the system also identifies which lanes are controlled by each traffic light intersection.
The traffic control implementation involves retrieving the current traffic light program, modifying the duration of green phases based on the fuzzy controller’s output, and applying the new program settings to the traffic light. This process carefully maintains the structure of transition phases (yellow and red) to ensure safe traffic operation while focusing optimization efforts on the green phase durations.
Control decisions are executed by gathering traffic data for each specific traffic light, applying the fuzzy inference process to determine the optimal green time, and implementing the control decision along with recording performance metrics. The effectiveness of these decisions are evaluated using a reward function that combines waiting time reduction and vehicle throughput maximization, with higher rewards indicating better performance.
The implementation includes spatial analysis capabilities that generate congestion heat maps by tracking traffic conditions across the network. Congestion scores are calculated based on the combination of waiting time and vehicle count at different network locations. These scores are then visualized on a color-coded map of the network, enabling the identification of congestion hotspots and evaluation of the controller’s effectiveness in alleviating traffic congestion throughout the road network.
2.3. Rule Base and Membership Functions
The basic component of the system, the Fuzzy Logic controller, operates on two input variables. The first variable is the number of vehicles waiting at an intersection, with a limit from 0 to 50 vehicles, and is divided into three fuzzy sets: Low, Medium, and High.
The Low fuzzy set is represented by the triangular membership function with parameters [0, 0, 20], Medium [10, 25, 40], and High with parameters [30, 50, 50]. The “low” category of vehicle numbers ([0, 0, 20]) forms a right-angled triangular membership function, ensuring that zero vehicles unambiguously represents “low” traffic density. The second category, the medium category ([10, 25, 40]), creates a symmetrical triangular function centered at 25 with an overlap of 10–20 vehicles, allowing smooth transitions in fuzzy inference. The third category, the “high” category, is similar to a trapezoidal function and imposes full membership for numbers above 50, showing the capacity limits of the lane and the decreasing impact of additional vehicles. The average time that vehicles have to wait at the traffic light is the input variable waiting time with a limit in the interval [0, 60]. In
Figure 1, the Fuzzy Logic membership functions used in the traffic control system are presented.
Because triangular membership functions are straightforward and easy to comprehend, they were selected for this research; however, in the future, Gaussian curves may be considered for more accurate decision bounds. This approach strikes an effective balance between practical implementation and robust fuzzy modeling. The triangular functions facilitate straightforward transitions that traffic engineers can readily comprehend, while allowing the system to process these functions quickly in real-time. The overlapping regions between adjacent categories—such as the range of 10 to 20 vehicles, which lies between the ‘low’ and ‘medium’ classifications—constitute critical transition zones. These zones enable the traffic controller to simultaneously respond to multiple rules, facilitating gradual adjustments rather than abrupt changes as traffic conditions progressively change. Furthermore, the design of the ‘low’ and ‘high’ membership functions reflects realistic traffic patterns. When there are no cars, the system plainly indicates a ‘low’ status; when traffic congestion reaches 50 vehicles or more, the presence of more vehicles has less of an influence on the control strategy’s efficacy.
The controller’s output variable, green time, determines the duration of the green light phase with a domain defined in the range [15, 60] seconds. This variable is categorized into Short, Medium, and Long fuzzy sets with parameters [10, 10, 25], [20, 30, 40], and [35, 60, 60], respectively. The relationships between input and output variables are defined through nine fuzzy rules that collectively form a comprehensive rule base for the controller. These rules capture expert knowledge about traffic management, such as extending green light duration when vehicle count and waiting time are high and shortening it when these values are low. In
Table 1, the 9 Fuzzy Logic rules for the proposed traffic control system are presented.
Algorithm 1 exemplifies the integration methodology between the Mamdani Fuzzy Logic controller and the SUMO traffic simulation environment. The signal timing in the SUMO simulation environment is dynamically optimized by the traffic light control system, based on Fuzzy Logic. The controller, using the Python interface (version 3.9) of TraCI, regularly monitors both the waiting times at each intersection and the vehicle density. These metrics are used as inputs to a Mamdani fuzzy inference system, with triangular membership functions, that orders traffic conditions into linguistic categories.
To balance the throughput with the waiting time, the inference engine applies rules that correlate traffic conditions with appropriate green time durations. The signal timing to the evolving traffic patterns is continuously adapted by the system. It simultaneously tracks performance metrics and generates congestion visualizations. Thus, intelligent traffic management and efficient response to variable traffic demands during the simulation can be achieved.
In the reward calculation function, traffic efficiency is measured using a formula that combines two key elements: a throughput reward (based on how many vehicles move) and a waiting penalty (which reflects delays). The throughput reward increases with vehicle count, while the waiting penalty grows as vehicles spend more time stopped. To calculate waiting time, the system checks each lane controlled by traffic signals and gathers data on how long vehicles are in them. To prevent extreme values from disrupting the system, a cap value is set for the maximum waiting time at 59 s, before it enters the Fuzzy controller.
The Fuzzy traffic control function records these metrics at regular intervals and stores them for later analysis. Error-handling measures were also integrated to provide sensible default values if something goes awry with the calculations. This approach ensures the simulation continues running smoothly even when problems occur. This reward system effectively translates real-world traffic conditions into numerical values that help guide the fuzzy controller’s decisions throughout the entire simulation process.
Algorithm 1: Adaptive Fuzzy Logic Traffic Light Control |
Input: NETWORK_FILE, SUMO_CONFIG_FILE, max_steps, update_interval, necessary libraries |
Output: Traffic metrics, reward evolution |
Initialize data: |
| SUMO simulation and environment |
| Lanes and traffic_lights from network topology |
| Fuzzy Logic control system: |
| | Input/output variables |
| | Membership functions |
| | Define rules |
| Statistics tracker |
While step < max_steps do |
| Execute simulation; count current vehicles |
| For each traffic_light_id do |
| | Get current state of traffic and reward Calculate traffic metrics Input traffic state to fuzzy controller Get recommended green time Apply new action; calculate reward Update traffic analyzer |
| Compute average metrics, record statistics |
| Increase step |
Close simulation |
2.4. Q-Learning Approach
Reinforcement learning using Q-learning offers a practical method for controlling traffic signals by learning from interactions with the environment instead of relying on fixed expert rules. This technique allows the controller to identify the best signal timing based on a reward mechanism, making it responsive to fluctuating traffic conditions.
The choice of basic Q-learning over more complicated deep learning was made for a few reasons. First, it works effectively with discrete states, facilitating rapid decision-making in real-time. Second, its straightforward nature makes it simpler to comprehend and troubleshoot, assisting traffic engineers in ensuring the system meets regulatory standards. Finally, it requires fewer training episodes to reach optimal performance. In summary, this approach fine-tunes traffic signal timings through manageable trial-and-error processes, ultimately improving traffic management in urban areas.
2.5. State and Action Design
Q-learning is a model-free reinforcement learning technique that learns an optimal action-selection policy for any finite Markov Decision Process. An MDP is formally defined as a tuple (S, A, R, γ, π) where S represents the state space, A represents the set of possible actions, R is the reward function, γ is the discount factor, and π represents the action selection policy. It works by learning an action-value function that ultimately gives the expected utility of taking a given action in each state and following the optimal policy thereafter. The “Q” in Q-learning stands for quality, representing the estimation of long-term discounted reward when performing a specific action in a specific state, as defined by Equation (2).
where
is the current estimation of the Q-value for state and action ;
is the learning rate;
is the discount factor;
is the immediate reward;
is the maximum estimated reward achievable in the next state .
The optimal Q-function that the algorithm converges on is defined by Equation (3):
with sufficient exploration and learning iterations, as shown in Equation (4):
A critical aspect of applying Q-learning to traffic control is defining an appropriate state representation. The implementation uses a discretized state space to reduce complexity while capturing relevant traffic conditions. According to Equation (5), the state is represented as a tuple comprising several key traffic metrics:
The component captures vehicle count in 6 bins (0–5, 6–10, 11–15, 16–20, 21–25, 26+), while represents waiting time across 6 bins (0–3 s, 4–7 s, 8–12 s, 13–20 s, 21–30 s, >30 s). The parameter measures queue length in 4 bins, and captures arrival rate in 3 bins. This 6/6/4/3 discretization configuration was empirically optimized to balance state space granularity with computational efficiency, enabling the algorithm to capture meaningful traffic patterns while maintaining reasonable training times and memory requirements.
This discretization approach was chosen deliberately to balance granularity with computational efficiency. The state space must be sufficiently detailed to capture meaningful traffic patterns while remaining manageable for the learning algorithm. The specific bin boundaries were selected based on typical urban traffic dynamics. At low volumes, small changes in vehicle counts may have significant impacts. At higher volumes, larger increments are needed to reflect meaningful differences in congestion levels.
The green phase durations that the agent can select for traffic signals form the action space. They range from 5 to 60 s in 5 s increments. This discrete action space provides a balance between signal control adaptability and computational feasibility. The 5 s minimum ensures an adequate green time for safety and driver expectations, and the 60 s upper limit prevents excessive waiting for cross-traffic. The 5 s increment is sufficient to allow for adjustments while still maintaining a reasonable amount of action space for efficient learning.
The discretization approach using 6/6/4/3 bins for vehicle count, waiting time, queue length, and arrival rate, respectively, was chosen based on extensive empirical testing. This specific configuration represents an optimal compromise between state space granularity and computational efficiency.
For vehicle counts, the 6-bin structure (0–5, 6–10, 11–15, 16–20, 21–25, 26+) captures meaningful traffic density transitions, particularly distinguishing between low volumes (where small changes significantly impact traffic flow) and high volumes (where larger increments better represent congestion levels). Similarly, the 6-bin waiting time categorization provides sufficient resolution to detect critical threshold crossings in driver waiting experience without exponentially expanding the state space.
The 4-bin queue length and 3-bin arrival rate discretizations were determined through sensitivity analysis to be sufficient for capturing the essential dynamics of these parameters while minimizing redundant state representations. Alternative configurations with finer granularity (8/8/6/5) were tested but showed only marginal improvements in control performance while increasing the state space by over 300%, significantly hampering learning convergence rates.
This balanced approach enables the Q-learning algorithm to effectively capture meaningful traffic patterns while maintaining a manageable state space size that allows for reasonable training times and memory requirements. The specific boundaries were calibrated using real-world urban traffic data to ensure they represent practically significant thresholds in traffic management.
Algorithm 2 presented above, provides a more structured representation of the optimization process. The traffic management system uses reinforcement learning with Q-learning to facilitate the synchronization of traffic signals in SUMO simulations. Depending on the traffic conferences, the agent sets green light times, between 5 and 60 s, by collecting the number of vehicles and waiting times at each intersection. It calculates rewards using multiple factors including waiting penalties, throughput rewards, and congestion penalties while gradually reducing exploration over time. The system maintains a replay buffer for batch learning, manages performance metrics, and allows for post-completion visualizations.
Algorithm 2: Q-Learning Adaptive Traffic Light Control |
Input: NETWORK_FILE, SUMO_CONFIG_FILE, max_steps, update_interval, necessary libraries |
Output: Traffic metrics, reward evolution |
Initialize data: |
| SUMO simulation and environment |
| Lanes and traffic_lights from network topology |
| Define green time durations |
| Q-learning agent: |
| | Possible actions |
| | Learning parameters |
| | Initialize replay buffer |
| Traffic analyzer for patterns |
While step < max_steps do |
| Execute simulation; count current vehicles |
| For each traffic_light_id do |
| | Get current state of traffic and reward Get optimal green time from RL agent Calculate traffic metrics Apply action to traffic light If previous state exists: Store experience Learning from replay buffer Update traffic analyzer |
| Compute average metrics, record statistics |
| Increase step |
Close simulation |
2.6. Reward Function Design
The reward function is designed to guide the agent toward minimizing waiting times and maximizing vehicle throughput. The calculation begins with a negative penalty proportional to vehicle waiting time, establishing waiting time reduction as a primary objective. This is complemented by a positive reward proportional to the number of vehicles at the intersection, which promotes throughput. Additionally, the system recognizes improvements over previous states by rewarding reductions in waiting time and increases in processed vehicles.
For scenarios of extreme congestion, characterized by waiting times exceeding 30 s combined with more than 20 vehicles present, the system applies a substantial negative penalty. This encourages the agent to avoid or quickly resolve severe congestion situations. Furthermore, the system calculates an efficient metric as the ratio of vehicles processed to waiting time, rewarding efficient processing of vehicles.
The reward components are carefully weighed to balance these competing objectives. The waiting time penalty uses a coefficient of −2.0, making it a significant factor but not overwhelming other considerations. Vehicle throughput is weighted at 1.5, slightly less than waiting time, reflecting that while throughput is important, reducing delays takes precedence. Improvements in waiting time receive a higher coefficient of 3.0, emphasizing the importance of continuous improvement in traffic flow.
To prevent numerical instability during learning, the implementation constrains input values and limits the final reward to between −1000 and 1000. These bounds prevent extreme rewards from destabilizing the learning process while still providing sufficient space for effective policy updates.
2.7. Experience Replay and Stability
The implementation includes experience replay through an enriched agent class. Previous experiences, such as state, action, reward, next state tuples are stored in the previous experience replay, in a buffer and samples from this buffer to update Q values.
This makes it possible to use past experiences more efficiently, by reusing them multiple times throughout the learning process. In addition, it reduces the correlation between consecutive training samples, which is very important, since reinforcement learning algorithms assume independent samples. It also improves the stability of the learning process and creates better convergence properties by breaking temporal dependencies.
The buffer used by the system is of fixed size and keeps a predetermined number of recent experiences. During the learning process, batches are randomly sequenced from this buffer, which helps to break the temporal correlations inherent in sequential experiences. To provide a diversity of scenarios while maintaining computational efficiency, the buffer size was selected for 2000 experiences and the batch size at 64.
2.8. Traffic Analysis
The system includes a traffic analysis module that monitors and predicts traffic patterns. This module maintains historical data on vehicle counts and waiting times for each traffic signal over a sliding window of observations. By analyzing changes in vehicle counts between consecutive time steps, the system calculates flow rates that indicate how efficiently vehicles are moving through intersections. The module also identifies evolving traffic patterns, categorizing them as stable, increasing, decreasing, or fluctuating based on statistical analysis of recent observations.
Traffic patterns are utilized to anticipate future conditions, enabling proactive measures rather than mere reactive responses to congestion. Traffic conditions are categorized using a straightforward 0–4 scale, where a rating of 0 indicates smooth traffic flow, while a rating of 4 signifies significant congestion. These levels were devised through the integration of traffic engineering expertise and empirical testing to ensure their reliability. Under optimal conditions (level 0), vehicles transition through intersections with minimal delay. Conversely, at level 4, traffic flow deteriorates markedly, characterized by prolonged queues and wait times exceeding 45 s. As traffic becomes worse, the system implements more aggressive timing adjustments. Both vehicle count and waiting time indicators are continuously monitored. This approach helps the Q-learning system make more informed decisions and adapt to changing traffic patterns.
2.9. Directional Control
The implementation extends beyond basic signal timing adjustment by incorporating directional traffic control. The system identifies the lanes and directions controlled by each traffic signal, distinguishing between north–south, east–west, and turning movements. For each direction, the system determines congestion levels based on vehicle counts and waiting times, then calculates priority scores that reflect the relative need for green time.
Based on these priorities, the available green time is allocated proportionally, ensuring that busy directions receive increased attention, while providing a minimum service to all trips. The directional approach allows for the creation of clearer control strategies that can handle asymmetric traffic patterns (e.g., large one-way commuter flows during peak hours) and is of particular importance for intersections where traffic demands vary significantly from different approaches. The implementation uses an epsilon-greedy strategy to balance exploration and exploitation. The agent selects with probability epsilon (exploration rate) a random action to find potentially better strategies, and with probability 1-epsilon, selects the action with the highest Q value to capitalize on known efficient strategies.
The exploration rate is gradually decreased by a decay schedule, to improve learning over time. Initially, the rate is set to 0.5, allowing the system to explore different options early on. Multiplied by a decay factor of 0.9995 at regular intervals to gradually shift towards using learned policies, the exploration rate reaches a minimum value of 0.05, to ensure enough randomness for dealing with unexpected traffic patterns. The gradual transition ensures efficient exploration of the state space in the early stages, focusing on exploiting the acquired knowledge as the simulation progresses. To accommodate the complexity and dynamics of traffic patterns, the decay rate was chosen to be relatively slow, thus allowing the agent to continue adapting to new situations, even after substantial learning.
The Q-learning algorithm’s parameters were chosen to ensure good performance and reliability. The learning rate was set to α = 0.1, as recommended for traffic control applications [
19]. This value helps the system learn quickly while remaining stable. With this moderate rate, the agent can adjust to changing traffic patterns without reacting too strongly to short-term fluctuations.
The discount factor was configured to γ = 0.95 to focus on long-term rewards while staying responsive to current traffic conditions. This choice reflects how traffic flows and management decisions today can affect both immediate and future traffic performance. A higher value would focus too much on future rewards and might hinder timely responses. A lower value could lead to shortsighted actions, which would not be effective in optimizing traffic networks.
The exploration parameters were fine-tuned through initial simulations with various traffic scenarios. For the experience replay, the buffer’s size was configured to 2000 experiences to balance memory use and learning stability. Tests with smaller buffers (500 and 1000) lacked diversity for effective learning, while larger buffers (5000+) provided minimal improvements at much higher costs.
2.10. Performance Monitoring and Visualization
The system incorporates comprehensive performance monitoring and visualization capabilities to evaluate the effectiveness of the learning process and resulting control strategies. Throughout the simulation, the system tracks key metrics such as vehicle counts, waiting times, green times, and rewards. This data forms the basis for time-series plots that illustrate the evolution of performance metrics over the course of the simulation, revealing trends and patterns in the learning process.
The system generates histograms to display the distribution of various performance indicators, providing insights into the range and frequency of different traffic conditions and control responses. Correlation matrices help identify relationships between different metrics, such as how changes in green time affect waiting times or how vehicle counts influence rewards.
The traffic flow index is a composite measure that was created to assess how well traffic flows in a traffic light system, as calculated by Equation (6):
It combines several important factors into a single, easy-to-interpret score, on a scale from 0 to 100. The
avg_normalized_speed parameter represents the average vehicle speeds divided by the maximum speed allowed for each vehicle. It takes values between 0 and 1 (0 = all vehicles are stopped, 1 = all vehicles are traveling at the maximum speed allowed) and has the highest weight (50%) because speed is the most important indicator of traffic flow. The
stop_time_ratio component is the ratio of the total time spent stopped to the duration of the traffic light cycle for all vehicles. This has a weight of 30%, reflecting the importance of reducing the time spent at a stop. Finally, the
stops_factor parameter measures the frequency of stops (1 = no stops, 0 = all vehicles were stopped). It is calculated according to Equation (7) presented below:
where
avg_stops is the total number of stops. It has a weight of 20%, because each stop and restart contributes to traffic inefficiency, even if it is short-lived.
2.11. Simulation Environment
The proposed traffic light control system (
Figure 2) was implemented using Python as the primary programming language, developed within the PyCharm IDE (academic version 2023.3.1).
The system architecture employs SUMO as the underlying traffic simulation engine, with direct Python integration achieved through the TraCI API. This integration enables programmatic control and data extraction from the simulation environment. The implementation leverages several key Python libraries: for efficient numerical operations, for implementing the fuzzy inference system, for generating performance visualizations and analytical graphs, and the standard module for persistent storage of simulation results.
The system follows a modular design approach, where the SUMO environment provides the foundation for traffic simulation based on network definition files (.net.xml) and simulation configuration files (.sumocfg). Within this framework, dedicated Python functions extract real-time traffic metrics including vehicle counts and waiting times directly from the simulation. These metrics are processed by the Fuzzy Logic controller implemented with , which applies membership functions and inference rules to determine optimal green light durations. The resulting control decisions are then applied to the simulation through TraCI commands that modify traffic light timing programs.
Rather than relying on predefined membership functions and inference rules, the Q-learning agent autonomously discovers effective control policies through repeated interaction with the environment. The environment is configured through network and configuration files that define the road network topology, traffic demand, and initial traffic signal control plans. Before proceeding with the simulation, the system verifies the existence of these configuration files to ensure proper operation.
The road network was configured as a comprehensive urban traffic system arranged in a 4 × 5 grid of intersections. The map displays a hierarchical structure with three distinct road types that transition from top to bottom, each serving different traffic needs. The uppermost section contains residential roads with narrower lanes, the highest priority designation, and lower speed limits featuring just one lane in each direction. These connect to the middle section’s “main” type roads which maintain a medium priority with slightly higher speed limits and two edges per lane. Finally, the bottom of the map contains arterial roads that, despite having the lowest priority, permit the highest speeds and feature three edges per lane for maximum capacity. The four intersection types illustrate this transition, with the top row of intersections uniquely operating without traffic signals while all others incorporate traffic light systems for flow control.
When examining the different intersection types in detail, a clear progression in complexity moving downward through the map can be observed (see
Figure 3). Type 1 intersections located in the top row are simple crossroads with single-lane residential roads in each direction, allowing standard vehicle movements without traffic light regulation. Moving down, Type 2 intersections introduce traffic signals and wider roads with two edges per direction, forming a transition zone between residential and main road types. Type 3 intersections further increase in complexity, combining two-edge and three-edge road configurations with more sophisticated signaling to manage the intersection between main and arterial roads. Finally, Type 4 intersections at the bottom of the map represent the most complex configuration, with wide three-edge arterial roads in each direction, comprehensive lane arrangements for all possible turning movements, and advanced traffic signal systems to efficiently handle the higher volumes and speeds of traffic expected on these major thoroughfares.
Vehicle flows were generated artificially with predefined routes for each vehicle, with each simulation set to run for 5000 steps. The flow rates were carefully calibrated across five scenarios: the first four cases used a growing average number of vehicles per simulation step (approximately 100, 300, 500, and 1000 vehicles on the map per step, respectively).
3. Results
Experimental scenarios were conducted using the presented algorithms—a Fuzzy Logic method and a reinforcement learning (RL) algorithm based on Q-learning. For each method, five experiments were conducted, resulting in a comprehensive dataset that allowed us to analyze multiple performance metrics.
The analysis yielded several key output distributions: green phase durations across different traffic conditions, average waiting times at intersections, and the correlation between vehicle volume and waiting time at each simulation step. Additionally, the extracted visual representations show the evolution of average waiting times in relation to the reward values for each algorithm, providing insight into how the controllers adapted to changing traffic conditions over time. These distributions are illustrated in the graphs presented below.
The extracted statistics were compiled in the two tables of this section in order to measure and contrast the performance of both methods in various circumstances. Key performance indicators collected during each experimental run are displayed in these tables. These include the average number of vehicles on the map during the simulation period, the average waiting times for vehicles at intersections, the number of stops that vehicles must make during their journeys, the time lost per traffic signal cycle, and a traffic flow index that shows how efficiently traffic moves through the network overall. Together, these measures offer a strong foundation for evaluating the performance of the Fuzzy Logic and Q-learning techniques in a range of traffic situations.
The performance evaluation is based on four key metrics with specific measurement protocols: Average Waiting Time represents the mean time (seconds) vehicles spend stationary at intersections, calculated as total waiting time divided by total vehicle count per simulation cycle; Average Stops per Vehicle measures the total number of complete stops (speed = 0) divided by total vehicles throughout each vehicle’s network journey; Lost Time per Cycle quantifies inefficient time (seconds) during each signal cycle, calculated as cycle time minus effective green time utilization and averaged across all intersections; and Traffic Flow Index provides a composite efficiency score (0–100%) combining normalized speed, stop-time ratio, and stop frequency as defined in Equation (6).
3.1. Fuzzy Logic Results
The simulation results in
Figure 4,
Figure 5,
Figure 6 and
Figure 7 demonstrate the performance characteristics of the fuzzy-based traffic light control algorithm under varying traffic densities. At a low traffic density of 100 cars, the system maintains relatively short green phase durations concentrated around 15 s, resulting in minimal average waiting times of 1–2 s for most vehicles (see
Figure 4). As traffic density increases to 300 cars per simulation step in
Figure 5, the algorithm adaptively extends green phases to a bimodal distribution centered around 26–28 s, although this leads to increased waiting times ranging from 0 to 6 s. Under moderate-high traffic conditions (500 cars;
Figure 6), green phases become even more concentrated at longer durations (30–32 s), with waiting times showing a more uniform distribution across 0–8 s and notable peaks at both low (0–2 s) and moderate (6–8 s) waiting periods.
When traffic density reaches its maximum of 1000 cars, the system hits its adaptive limits. Green traffic light phases mainly last about 35 s as presented in
Figure 7, which causes waiting times to vary. In this case, some drivers are waiting up to 35 s, which shows higher congestion and lower system efficiency. The results show that the fuzzy controller can adjust signal timing based on traffic load, but its performance drops significantly when congestion is heavy.
The performance data presented in
Table 2 reveals several significant patterns in the Fuzzy-Logic traffic signal control algorithm’s behavior across escalating traffic densities. It demonstrates that as vehicle volume increases from 100 to 1000 vehicles per step, a systematic deterioration in overall system performance becomes evident across multiple metrics.
The results show that average waiting time experiences a substantial and consistent increase, rising from 2.67 s under light traffic conditions to 11.02 s at peak volume—representing a greater than fourfold increase. Concurrently, the Traffic Flow Index exhibits a steady decline from 81.67% efficiency at the lowest density to 68.24% at the highest, quantifying the progressive reduction in system effectiveness as congestion intensifies.
Lost Time per Cycle, as recorded in
Table 2, similarly displays an upward trajectory, increasing from 16.35 s to 25.21 s with growing traffic density, indicating less efficient cycle utilization under heavier loads.
Table 2 also reveals that the average stops per vehicle metric fluctuates somewhat nonlinearly, ranging between 0.21 and 0.34 across different scenarios.
Under minimal traffic conditions, the system shows significant variability in waiting times for unpredictable fluctuations range between 0 as well as 8 time units (see
Figure 8). When traffic density increases up to moderate levels then these fluctuations become much more pronounced and they display far less predictable patterns with higher amplitude variations. The fuzzy controller shows that there is a marked degradation in the performance under those critical traffic conditions. Waiting times then escalate to approximately 35 time units also, beyond this is more than triple the values observed under minimal traffic. Along a gradually increasing trajectory the reward amasses, and it suggests the system incrementally improves in performance, but plateaus near 80 units in critical conditions.
In this model, some congestion thresholds may exceed the fuzzy controller’s traffic flow optimization abilities. Waiting time and reward curves behave with irregularity, suggesting the Fuzzy Logic approach responds dynamically to immediate traffic conditions. However, this kind of approach lacks any consistency for more predictable long-term performance within higher-density traffic environments, despite potentially being able to adapt well to unexpected scenarios.
3.2. Q-Learning Results
The Q-learning-based traffic light control system exhibits distinctly different behavioral patterns compared to the fuzzy logic approach. At low traffic density for 100 cars per step, the Q-learning algorithm converges to green phase durations primarily concentrated around 32 s (see
Figure 9), which is significantly longer than the fuzzy system. However, it achieves superior performance, with most waiting times clustered within 0–1 s. As traffic density increases to 300 cars, the learned policy maintains green phases in the 30–35 s range with a more dispersed distribution, as shown below in
Figure 10. Furthermore, waiting times remain heavily skewed toward minimal delays (0–2 s) with a longer tail extending to 12 s.
Under moderate to high traffic conditions (500 cars/simulation step), the Q-learning system demonstrates remarkable consistency, maintaining the 30–35 s green phase strategy while keeping the majority of waiting times under 4 s (
Figure 11). At a maximum traffic density of 1000 cars, the results shown in
Figure 12 indicate that the Q-learning controller demonstrates a bimodal waiting time distribution. There are peaks at both 0–1 s and 3–4 s, suggesting that the algorithm has learned to optimize for different traffic scenarios simultaneously. The Q-learning approach effectively learns optimal timing policies, minimizing waiting times across varying traffic densities and showcasing the benefits of reinforcement learning in complex traffic patterns.
Table 3 shows how the Q-learning traffic signal control algorithm performs at different traffic densities, revealing patterns that differ significantly from the Fuzzy Logic approach. What stands out is how remarkably stable certain values remain, even as traffic volume increases substantially.
The average waiting time documented in
Table 2 shows only modest increases as traffic density rises, beginning at 2.57 s with 100 vehicles/step and reaching just 3.71 s at 1000 vehicles/step—representing only a 44% increase despite a tenfold growth in traffic volume. This contrasts sharply with the performance degradation observed in the Fuzzy Logic algorithm. Similarly, the Traffic Flow Index maintains impressively high values across all scenarios, ranging from 96.93% at the lowest density to 85.45% at the highest, indicating robust performance even under demanding conditions.
Figure 13 shows the remarkably consistent behavior patterns of the Q-learning algorithm across all traffic intensity scenarios. The system exhibits a well-defined cyclical pattern of waiting times under low traffic conditions, with frequent peaks of around 14 time units followed by sharp drops to almost zero values. As traffic intensities rise from moderate to critical levels, this cyclic pattern continues with remarkable consistency, retaining consistent amplitude and frequency characteristics. These cycles’ stability, even when facing severe traffic, points to strong scalability and resistance to performance deterioration as traffic volume increases. Reaching values of up to 250 units, the reward acquisition curve exhibits a characteristic stepped progression that is noticeably higher than those attained by the Fuzzy Logic method.
These distinct reward accumulation phases display variable learning thresholds at which the system significantly enhances its performance approach. The consistency of behavior at various traffic densities indicates that the Q-learning algorithm effectively recognizes basic traffic patterns and creates optimal control strategies that work at all congestion levels. The superior scalability and optimization capability of the Q-learning approach for long-term traffic management applications is demonstrated by this consistency in performance, especially the maintenance of reasonable waiting times even under critical traffic conditions.
To assess how effective the proposed Q-learning adaptive traffic control system is, a comparative analysis was conducted against a baseline fixed-time controller. The static controller was set up with predetermined signal timing plans defined in a tll.xml file, which had cycle times varying from 90 s for simpler intersections (including 41 s of green phases and 4 s of yellow transitions) to 144 s for more complex intersections (with 31 s of green phases, 5 s of yellow phases, and multiple signal groups). These static timing plans exemplified traditional traffic management methods commonly utilized in urban intersections, where phase durations remain fixed regardless of the actual demand for traffic. Both controllers were evaluated under similar high-traffic conditions, simulating approximately 500 vehicles per simulation step to represent congested traffic scenarios.
The experimental outcomes reveal considerable performance enhancements of the Q-learning adaptive algorithm compared to the static baseline across several critical metrics: average waiting time was lowered by 87.5% (from 24.99 s to 3.13 s), the average vehicle count dropped by 18.7% (from 568.75 vehicles to 462.52 vehicles), and the average stops per vehicle improved by 51.9% (from 4.89 stops to 2.36 stops). Furthermore, the adaptive system realized a 5.2% enhancement in the traffic flow index (from 89.47 to 94.15). These findings illustrate the superior performance of adaptive traffic control systems in comparison to traditional fixed-time strategies in high-density traffic situations, particularly emphasizing the shortcomings of static timing plans that fail to adapt to real-time traffic circumstances.
4. Discussion
Comparing the two sets of graphs (
Figure 4,
Figure 5,
Figure 6 and
Figure 7 for Fuzzy controller and
Figure 9,
Figure 10,
Figure 11 and
Figure 12 for Q-learning algorithm) reveals distinct behavioral patterns between the Fuzzy Logic and Q-learning traffic control approaches. The Fuzzy Logic controller demonstrates a bimodal distribution for green phase durations that is notably skewed toward longer intervals (30–35 s), while simultaneously exhibiting a secondary, less prominent peak at shorter durations (approximately 15 s). This asymmetrical distribution suggests a preference for extended green phases with occasional shorter interventions, indicating a potentially less balanced control strategy. In contrast, the Q-learning controller manifests a more symmetrical, approximately normal distribution centered around the 30–35 s range, with frequencies gradually diminishing toward both extremes of the spectrum (15–45 s), suggesting a more balanced optimization approach to signal timing.
The trends in reward evolution illustrated in
Figure 8 and
Figure 13 have important implications for practical system deployment and operational management. The highly variable and irregular reward curves seen with the Fuzzy Logic approach indicate unpredictable system behavior, which presents significant challenges for real-world implementation. These fluctuations, ranging from nearly zero to 80 units, make it difficult for traffic engineers to predict system performance under various traffic conditions and to establish meaningful performance benchmarks for evaluation. In contrast, the Q-learning approach shows remarkably stable and predictable reward curves, consistently reaching over 250 units. This stability offers considerable advantages for municipal systems. The consistent cyclical patterns enable traffic engineers to accurately forecast system behavior and effectively plan maintenance schedules. As a result, citizens benefit from predictable traffic flow patterns, which can foster public trust in intelligent traffic systems. Additionally, this stability simplifies monitoring procedures, allowing for automated anomaly detection; significant deviations from expected patterns can reliably trigger maintenance alerts.
The waiting time distributions further accentuate these algorithmic differences. The Fuzzy Logic controller generates a multimodal distribution with several peaks dispersed across the entire waiting time range (0–17.5 s), indicating inconsistent performance characteristics with clusters of waiting times occurring at various intervals. This pattern suggests that vehicles experience widely varying delays under the fuzzy control regime. Conversely, the Q-learning algorithm produces a right-skewed distribution with a pronounced concentration in the minimal waiting time range (0–4 s), followed by a consistent diminution in frequency as waiting times increase. This distribution pattern strongly indicates that the Q-learning approach more effectively minimizes vehicle delays throughout the network.
These comparative patterns suggest fundamentally different optimization strategies between the two algorithmic approaches. The Q-learning method appears to achieve more consistent and potentially superior performance, characterized by predictable signal timing patterns and systematically minimized vehicle waiting times. The Fuzzy Logic controller, while still functional, demonstrates greater variability in its control decisions, potentially leading to less optimal traffic flow with more irregular patterns of vehicle progression through the network. These differences highlight the potential advantages of reinforcement learning techniques for adaptive traffic signal control in dynamic urban environments.
The comprehensive performance profile in
Table 2 suggests that while the Fuzzy-Logic algorithm maintains acceptable efficiency during lighter traffic conditions, its performance capabilities become increasingly compromised as traffic volumes approach and exceed 500 vehicles per step. The most pronounced performance decline occurs in the transition from 500 to 1000 vehicles per step, potentially indicating a critical threshold beyond which the algorithm’s adaptive mechanisms struggle to maintain optimal traffic flow management.
The performance profile presented in
Table 3 indicates that the Q-learning algorithm demonstrates superior adaptability to increasing traffic demands, maintaining relatively stable waiting times and high traffic flow indices even under heavy congestion. This consistency suggests that the reinforcement learning approach effectively learns optimal control strategies that scale well with increasing traffic volumes, exhibiting considerably greater resilience than the Fuzzy Logic method in managing heightened traffic densities.
5. Conclusions
This research has demonstrated quantitative differences in the traffic signal control between Q-learning and Fuzzy Logic approaches. The Q-learning methodology exhibited outstanding stability across traffic densities along with it maintained waiting times between 2.57 and 3.71 s despite a tenfold increase in vehicle volume—representing only a 44% degradation compared to the fourfold increase observed with the Fuzzy Logic controller. Likewise, the Traffic Flow Index remained higher with Q-learning consistently, and it declined from 96.93% to 85.45% under maximum load, versus the fuzzy approach’s drop from 81.67% to 68.24%. Across all traffic conditions, the cyclic pattern of waiting times with the Q-learning algorithm remained stable, indicating pattern recognition and optimization capabilities that outperformed the Fuzzy Logic approach, particularly in high-congestion scenarios.
These findings validate the replicable experimental framework established in this study and demonstrate the practical value of the comparative analysis for supporting evidence-based decision-making in adaptive traffic control deployment. The novel fluidity indicator introduced herein proved effective in distinguishing algorithm performance characteristics, while the systematic comparison methodology enables continued research into hybrid and context-aware traffic control strategies. The current study simulates environments to an exclusive extent rather than using data from real-world traffic, which may not fully capture the actual complexity and unpredictability of urban traffic patterns. Their design made the simulations thorough. However, these algorithms were not tested within the scope of extreme weather conditions, special events, or perhaps emergency scenarios that could greatly alter traffic dynamics. Additionally, the discrete state space representation carefully calibrated within the Q-learning approach necessarily simplifies continuous traffic parameters. It can potentially overlook some subtle traffic behaviors.
Looking ahead, a number of promising extensions could improve the current work’s efficacy and relevance. By using neural networks to process high-dimensional, continuous state spaces without explicit discretization, a Deep Q-Network approach may be able to overcome the drawbacks of discrete state representations. This could preserve the self-improving features of the current Q-learning implementation while allowing for more sophisticated responses to intricate traffic patterns.
An additional worthwhile avenue for research is a multi-agent reinforcement learning system with inter-intersection coordination. Instead of treating each intersection as a separate control problem, such a system could optimize traffic flow across several connected intersections. The system’s responsiveness and state representation could be greatly improved by incorporating real-time visual feedback from traffic cameras.
Finally, Vehicle-to-Everything (V2X) communication integration can directly incorporate data from vehicles into that traffic control system. Real-time information may allow the reinforcement learning algorithm to develop control strategies. This information includes vehicle positions, speeds, destinations, and also routes, transforming traffic optimization beyond reactive to proactive control.
Building upon the demonstrated advantages of the Q-learning approach, these potential extensions could greatly advance the overall field of smart traffic management systems toward much more adaptive, efficient, and responsive solutions for many urban mobility challenges. As cities worldwide face increasing mobility demands and environmental pressures, the selection of adaptive traffic control systems becomes a critical infrastructure decision with long-lasting implications for urban efficiency, sustainability, and quality of life. This research provides municipal decision-makers, engineers, and researchers with quantitative evidence supporting the strategic implementation of reinforcement learning-based traffic management systems as a foundation for future smart city development.