Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies

Owais, Mahmoud; Mohammed, Badr O.; Kamal, Abdulrahman A.; Shaban, Abdulrahman; Mostafa, Ahmed H.; Hatem, Kareem; Emad, John; Younis, Salah T.; Ali, Samia A.; Abdel-Hakim, Alaa E.; Alkabbany, Islam M.

doi:10.3390/su18115702

Open AccessArticle

Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies

by

Mahmoud Owais

^1,*

,

Badr O. Mohammed

¹

,

Abdulrahman A. Kamal

²

,

Abdulrahman Shaban

²,

Ahmed H. Mostafa

²

,

Kareem Hatem

²

,

John Emad

²,

Salah T. Younis

²,

Samia A. Ali

²,

Alaa E. Abdel-Hakim

²

and

Islam M. Alkabbany

²

¹

Civil Engineering Department, Faculty of Engineering, Assiut University, Assiut 71515, Egypt

²

Electrical Engineering Department, Faculty of Engineering, Assiut University, Assiut 71575, Egypt

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5702; https://doi.org/10.3390/su18115702 (registering DOI)

Submission received: 11 May 2026 / Revised: 1 June 2026 / Accepted: 2 June 2026 / Published: 4 June 2026

(This article belongs to the Special Issue Sustainable and Smart Transportation Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Urban traffic congestion remains a persistent challenge for conventional fixed-time signal control, particularly under fluctuating and asymmetric demand. Although multi-agent reinforcement learning (MARL) has shown promise for adaptive traffic signal control, previous studies have often focused on isolated intersections, simplified synthetic networks, or deep-learning-based controllers without systematically comparing tabular and deep-value-based multi-agent approaches under equivalent operating conditions. This study addresses this gap by comparing three traffic signal control strategies: fixed-time control, Multi-Agent Tabular Q-Learning, and multi-agent Deep Q-Network control (MADQN). The evaluation was conducted in a microscopic traffic simulation environment using two complementary testbeds: a synthetic two-intersection corridor, which enables controlled analysis of multi-agent coordination, and a real-world digital twin of the 25 January Corridor in Assiut, Egypt, which tests controller robustness under asymmetric geometry and realistic turning movements. The controllers are assessed under low-, medium-, and high-demand scenarios using queue length, cumulative delay, and Time-To-Collision as operational and safety-related indicators. The results show that MARL-based controllers generally outperform fixed-time control, but their relative performance depends on demand intensity and network complexity. MADQN provides stronger generalization in low-demand and queue-dissipation conditions, whereas Tabular Q-Learning remains highly competitive and can achieve superior delay reduction in several medium- and high-demand cases. These findings indicate that deeper MARL architectures are not universally superior; rather, adaptive signal control deployment should match the controller architecture to the operational objective, traffic demand regime, and practical complexity of the target corridor.

Keywords:

adaptive traffic signal control; multi-agent reinforcement learning; deep Q-network; tabular Q-learning; SUMO simulation; TraCI; fixed-time signal control; urban traffic congestion; queue length; traffic delay; Time-To-Collision; digital twin

1. Introduction

The intersection traffic signal control problem (ITSCP) is one of the many long-standing challenges in urban traffic management. It concerns the determination of appropriate signal-phase sequences, green durations, cycle lengths, and coordination strategies in response to changing traffic conditions. The problem is inherently complex because signal timing must account for stochastic vehicle arrivals, fluctuating demand levels, turning movements, geometric constraints, queue propagation, and operational objectives such as delay reduction, queue minimization, safety, and environmental performance [1]. With the continuous growth of urban populations and vehicle ownership, inefficient signal timing has become a major contributor to congestion, excessive delays, fuel consumption, and vehicular emissions [2].

Historically, traffic signal control strategies have evolved through several stages. The earliest and most widely implemented approach is fixed-time control, in which signal timings are predefined based on historical traffic patterns. Although fixed-time systems are simple, predictable, and easy to implement, they lack the flexibility required to respond to real-time fluctuations in traffic demand. Actuated control was later introduced to improve responsiveness by using detectors to extend or terminate green phases according to vehicle presence or gaps. However, actuated systems remain limited in their ability to optimize network-wide performance, particularly under asymmetric, oversaturated, or rapidly changing traffic conditions. Adaptive traffic signal control represents a more advanced stage, where signal plans are continuously adjusted using real-time traffic measurements. These systems improve operational performance but often depend on predefined models, calibrated parameters, or centralized optimization procedures that may become difficult to scale in large and complex networks [3,4].

In recent years, intelligent transportation systems (ITS) have increasingly incorporated artificial intelligence, sensing technologies, communication systems, and microscopic simulation platforms to improve traffic management. Within this development, reinforcement learning (RL) has emerged as a promising self-learning approach for adaptive traffic signal control. In RL-based signal control, each signalized intersection is treated as an agent that observes traffic states, selects signal actions, and updates its control policy according to rewards related to queue length, waiting time, delay, or other operational indicators. Early RL methods, such as Tabular Q-Learning, demonstrated the feasibility of learning adaptive signal policies; however, their scalability is limited when the state space becomes large or continuous [5,6].

Deep reinforcement learning (DRL) has therefore received growing attention as a more powerful framework for traffic signal control. By using neural networks to approximate value functions or policies, DRL can process high-dimensional traffic states and learn nonlinear relationships between traffic conditions and signal decisions. Several studies have shown that DRL-based controllers, including Deep Q-Networks, policy-gradient methods, and actor–critic models, can outperform fixed-time and conventional adaptive strategies under various traffic scenarios. Recent developments have also extended DRL applications toward multi-objective control, incorporating operational efficiency, emissions, and safety-related measures such as Time-To-Collision (TTC). Nevertheless, many DRL studies remain focused on isolated intersections or simplified synthetic networks, which limits the understanding of how these methods perform under coordinated multi-intersection and real-world corridor conditions [7,8,9].

Because urban traffic networks are naturally distributed systems, multi-agent reinforcement learning (MARL) has received increasing attention in traffic signal control research. In MARL-based control, each intersection can be represented as an autonomous agent that learns local signal decisions while interacting with neighboring intersections through traffic flow propagation. This structure is particularly suitable for arterial corridors and urban networks, where a decision at one intersection affects upstream queues, downstream discharge, and network-wide delay. However, MARL also introduces new challenges, including non-stationarity, coordination among agents, communication requirements, and the need to balance local and global objectives. Therefore, further investigation is required to clarify how different value-based multi-agent learning architectures perform under different demand levels, network geometries, and real-world traffic conditions [10,11].

Motivated by this need, the present study provides a detailed investigation of MARL for adaptive traffic signal control. This study compares three control strategies: a conventional fixed-time controller, a multi-agent Tabular Q-Learning controller, and a multi-agent Deep Q-Network (MADQN) controller. The evaluation is conducted using a SUMO-based microscopic traffic simulation integrated with Python 3.14.2 and TraCI. To examine both controlled and practical conditions, the controllers are tested on a synthetic two-intersection corridor and on a real-world digital twin of the 25 January Corridor in Assiut, Egypt. The controllers are evaluated under different traffic demand levels using queue- and delay-based performance measures, with TTC monitored as a complementary safety-related indicator.

The main contributions of this study can be summarized as follows:

It moves beyond isolated intersection evaluation by examining adaptive traffic signal control in multi-intersection corridor environments.
It provides a direct comparison between three control strategies under equivalent simulation conditions: fixed-time control, multi-agent Tabular Q-Learning, and multi-agent Deep Q-Network control.
It evaluates both tabular and deep-value-based MARL architectures under low-, medium-, and high-demand traffic scenarios, allowing their relative strengths and limitations to be assessed across different demand regimes.
It uses two complementary testbeds: a controlled synthetic two-intersection corridor and a real-world digital twin of the 25 January Corridor in Assiut, Egypt.
It assesses controller performance using queue length and cumulative delay as operational indicators, while monitoring Time-To-Collision as a complementary safety-related measure.
It provides practical insight into the conditions under which deeper MARL architectures are advantageous and when simpler tabular learning can remain competitive in corridor-level adaptive traffic signal control.

The remainder of this article is organized as follows. Section 2 reviews the relevant literature on traffic signal control, reinforcement learning, deep reinforcement learning, and multi-agent approaches. Section 3 describes the simulation environment, network topologies, calibration process, state representation, action space, and reward functions. Section 4 presents the algorithmic implementation of the fixed-time, Tabular Q-Learning, and MADQN controllers. Section 5 discusses the evaluation results achieved across the synthetic and real-world networks under different traffic demand scenarios. Finally, Section 6 summarizes the main findings, highlights the conclusions, and outlines directions for future research.

2. Literature Review

The intersection traffic signal control problem (ITSCP) has long been studied as a core challenge in urban traffic management because signal decisions directly affect delay, queue formation, travel time, emissions, and safety. ITSCP can be classified according to several dimensions, including network type, road-user composition, priority treatment, real-time control strategy, objective function, and signal-timing constraints. Reference [1] distinguished between isolated intersections, arterial corridors, and general urban networks and classified traffic signal control strategies into fixed-time, actuated, and adaptive approaches. Within this taxonomy, the present study focuses on arterial- and corridor-level adaptive control, where multiple signalized intersections interact through upstream and downstream traffic propagation.

Traditional fixed-time control remains widely used because it is simple, predictable, and easy to implement. However, its reliance on predefined timing plans makes it unsuitable for fluctuating and asymmetric traffic demand. Actuated control improves responsiveness by using detector information, yet it remains limited when network-wide coordination is required. Adaptive control, therefore, represents a more suitable direction for dynamic urban corridors, particularly when real-time traffic states can be used to update signal decisions. Simulation-based optimization has also contributed to this field. For example, reference [12] proposed a GA–SUMO simheuristic framework in which candidate green-time combinations are generated by a genetic algorithm and evaluated in SUMO using queue length, waiting time, time loss, and emissions. Their results showed improvements in PSO and Webster timing. However, such approaches optimize timing plans iteratively under predefined conditions rather than learning decentralized policies that respond continuously to real-time traffic states.

The realism of adaptive traffic control evaluation depends strongly on the microscopic simulation environment used to represent traffic dynamics. SUMO has become a widely adopted platform for testing intelligent transportation systems because it supports detailed traffic flow modeling, signal control, routing, and external controller integration through TraCI. Reference [13] emphasized that SUMO represents junction movement through internal lanes, entry and exit links, right-of-way matrices, and conflict relationships, which are important because queue formation, delay accumulation, and safety-related indicators depend on how vehicles enter and clear intersections. Reference [14] demonstrated the feasibility of integrating RL with SUMO–TraCI for real-time traffic management decisions through a Q-learning-based vehicle route optimization framework. Similarly, reference [15] used SUMO to evaluate an intelligent traffic management framework in which a Deep-Neuro-Fuzzy model assigned dynamic road-segment weights under changing weather, road, and traffic conditions. Although these studies did not focus primarily on traffic signal control, they confirm the value of SUMO as a flexible platform for testing intelligent traffic management strategies before field deployment.

Recent studies have further highlighted the importance of realistic and calibrated simulation when evaluating adaptive control methods. Reference [5] proposed Crossflow, a Python-based SUMO tool that converts real-world turning movement counts into simulation-ready traffic flows for single or multiple intersections, showing that simulated counts can closely reproduce observed traffic conditions. Reference [16] similarly emphasized the need for a full “simulate-before-deploy” pipeline that includes realistic network geometry, signal-controller replication, detector mapping, scenario generation, calibration, and post-deployment comparison. These studies are directly relevant to the present work because reinforcement learning controllers can only be meaningfully evaluated when traffic demand, geometry, and operational conditions are sufficiently realistic.

RL has emerged as a promising approach for adaptive signal control because it allows signal agents to learn from interaction with the traffic environment. In early value-based formulations, Q-learning was used to estimate the long-term value of signal actions under different traffic states. Reference [17] proposed a neural network Q-learning framework for a single SUMO intersection, using lane-level halting vehicle counts derived from vehicle positions and speeds as the traffic state. Their controller outperformed fixed-time, gap-based, and time-loss-based strategies under multiple synthetic demand patterns, including major/minor road imbalance, turning imbalance, tidal flow, and time-varying demand. However, because the framework was limited to a single intersection, it motivates further investigation into multi-intersection and corridor-level control.

DRL extends traditional Q-learning by using neural networks to approximate value functions or policies in larger and more complex state spaces. Several studies have shown that DRL controllers can improve signal performance compared with conventional timing strategies. Reference [18], for example, proposed a hybrid KNN–DQN system in which KNN classifies traffic states using historical traffic features, while DQN adapts signal decisions through SUMO interaction. The hybrid controller reduced the average waiting time, stop frequency, and improved flow rate compared with fixed-time, KNN-only, and DRL-only baselines. Reference [19] proposed a Grouping-DQN framework that improves state representation by dividing approach lanes into spatial groups and assigning higher weights to cells closer to the intersection. Their dual-channel CNN processed vehicle position and speed matrices and achieved better queue and waiting-time performance than fixed timing, standard DQN, and A2C. Reference [20] addressed the computational burden of DQN training by introducing transfer learning, where previously learned neural-network weights are reused under related traffic demand or connected and autonomous vehicle penetration scenarios. Their results showed faster convergence and improvements in waiting time and fuel consumption, although their work remained focused on a single isolated intersection.

Policy-gradient and actor–critic approaches provide another important branch of DRL-based traffic control. Reference [21] presented an early vision-based policy-gradient controller that used raw image pixels from a Unity3D (version Unity 2018.3.x.) traffic simulator and processed them through a CNN to select traffic light actions. Although the experiment was limited to a simple two-phase intersection, it showed the feasibility of end-to-end visual DRL for signal control. Reference [6] proposed a PPO-TSC framework using lane queue length and vehicle waiting time as both state- and reward-related indicators. Their SUMO-based evaluation at a real-world single intersection showed that PPO-TSC reduced average travel time and time loss, and improved average speed under peak conditions. Reference [8] extended policy-gradient control to distributed cooperative intersection management by combining local detector measurements with GNN-based prediction embeddings from neighboring intersections. Their PPO-based controller outperformed fixed-time and Webster baselines on a nine-intersection SUMO grid. These studies demonstrate the potential of policy-gradient methods, but they also show that much of the literature either focuses on isolated intersections or employs architectures that differ from the value-based multi-agent controllers examined in the present study.

RL has also been applied to vehicle-side and autonomous driving control problems, which provide useful methodological insights for traffic signal research. Reference [22] used PPO within a SUMO–Flow–RLlib framework to control leading autonomous vehicles at a non-signalized intersection under mixed-autonomy conditions. Their results showed that higher autonomous vehicle penetration improved speed, reduced delay, and lowered fuel consumption and emissions. Gutiérrez-Moreno [10] applied PPO to high-level autonomous driving decisions across signalized, stop sign, uncontrolled, and mixed intersection scenarios using SUMO and CARLA. Reference [23] also showed that SUMO–TraCI can support communication-enabled autonomous platooning, where platoon leaders are controlled externally, and follower vehicles use modified car-following logic. Although these studies do not directly optimize traffic signal phases, they confirm that compact state representations, simulation-based RL, and cooperative control can support future integration between infrastructure-side signal agents and connected or autonomous vehicles.

Because urban signalized networks are naturally distributed systems, multi-agent reinforcement learning (MARL) has become increasingly important for adaptive traffic signal control. Recent MARL surveys classify intelligent transportation methods according to both coordination structure and learning algorithm. Coordination structures include centralized training with centralized execution, centralized training with decentralized execution, and decentralized training with decentralized execution, while algorithmic categories include value-based methods, policy-gradient approaches, actor–critic frameworks, and communication-enhanced architectures [11]. This classification is useful for positioning the present study, which focuses on value-based MARL using multi-agent Tabular Q-Learning and multi-agent Deep Q-Network control under a decentralized operational logic.

Reference [2] developed one of the foundational MARL-based adaptive traffic signal control systems, MARLIN-ATSC, in which each signalized intersection is modeled as a learning agent. Their framework compared independent and integrated multi-agent control and showed that coordinated decentralized learning can improve network-level traffic performance compared with conventional fixed-time and actuated control. Reference [7] later proposed a cooperative DRL framework in which each intersection is controlled by a DQN agent, and neighboring agents exchange state, action, and reward information during policy learning. Their experiments on 2 × 2 and 2 × 3 SUMO grid networks showed improvements in average waiting time and queue length compared with several MARL baselines. Reference [24] further demonstrated the effectiveness of a cooperative DQN-based MARL in a six-intersection SUMO network, where the MARL controller achieved lower average waiting times and average travel times than fixed-cycle, actuated, and policy-gradient single-agent controllers. These studies support the operational value of decentralized adaptive signal control over static or semi-actuated timing strategies.

More recent MARL studies have attempted to address coordination, non-stationarity, environmental performance, and transferability. Reference [9] introduced a causal-inference-based multi-agent reinforcement learning framework to capture unobserved time-varying dynamics caused by changing neighboring-agent policies. Their method jointly optimizes causal inference and MARL objectives and demonstrated transfer from synthetic environments to real traffic scenarios. Reference [25] proposed a cooperative multi-intersection DQN framework for CO₂ emission reduction, where each intersection agent considers its own state, action, and reward, as well as selected information from adjacent intersections. Tested on a real-world six-intersection SUMO corridor in Icheon City, their controller reduced cumulative waiting time and CO₂ emissions compared with fixed-time control. These studies show that explicit coordination can improve arterial-level performance beyond isolated intersection optimization.

Safety and environmental objectives have also become increasingly important in DRL and MARL signal control research. Reference [4] compared PPO, A2C, and DQN under reward functions targeting delay, stopped vehicles, TTC-based conflicts, and combined efficiency–safety–emission objectives. Their experiments on synthetic and real-world intersections showed that PPO generally achieved stable convergence, while their multi-intersection experiment found that decentralized control reduced delay, conflicts, and emissions more effectively than centralized control. Reference [26] proposed a formally constrained RL framework in which Time-To-Collision is used to detect vehicle-level conflicts during queue formation. Their approach checks whether TTC remains above a safety threshold and applies speed adaptation when conflicts are detected. In addition, reference [27] demonstrated that start-up lost time can significantly increase the congestion ratio, green time requirements, and average travel time at signalized intersections. These findings suggest that adaptive RL controllers should avoid excessive phase switching and be evaluated not only by operational efficiency but also by safety-related indicators and realistic signal transition constraints.

To further clarify the position of the present study relative to existing work, Table 1 summarizes representative recent studies on RL-, DRL-, and MARL-based traffic signal control. The comparison highlights the control method, simulation or test environment, main evaluation indicators, reported limitations, and the specific gap addressed by the present study.

Overall, the literature demonstrates strong progress in applying RL, DRL, and MARL to adaptive traffic signal control. However, several gaps remain. Many DQN and PPO studies are still limited to isolated intersections or simplified synthetic networks. Some cooperative MARL studies evaluate grid networks but do not directly compare tabular and deep-value-based multi-agent controllers under the same demand conditions. Other studies emphasize policy-gradient, transfer-learning, causal inference, or environmental objectives, but fewer works provide a focused comparison between multi-agent Tabular Q-Learning and multi-agent Deep Q-Network control in both controlled and realistic corridor environments. In addition, real-world digital-twin testing remains essential because signal control performance can change substantially under asymmetric geometry, realistic turning movements, and demand variations.

Accordingly, the present article contributes to the literature by providing a direct comparative evaluation of fixed-time control, multi-agent Tabular Q-Learning, and multi-agent Deep Q-Network control for adaptive traffic signal management. This study uses a SUMO–Python–TraCI simulation framework and evaluates the controllers under different traffic demand levels using queue- and delay-based performance measures, with TTC monitored as a complementary safety-related indicator. Unlike studies limited to isolated intersections, this article examines multi-agent control in corridor settings, including both a synthetic two-intersection testbed and a real-world digital twin of the 25 January Corridor in Assiut, Egypt. This allows the study to clarify the relative strengths and limitations of tabular and deep-value-based MARL controllers under varying network complexities, demand intensity, and operational performance objectives.

3. Simulation Environment

The experiments were conducted using the SUMO microscopic traffic simulator, an open-source platform widely used for planning, modeling, and analyzing urban traffic networks and frequently adopted for MARL-based traffic signal control experimentation [11]. Recent work in [5] further demonstrates SUMO’s suitability for intersection-level studies by converting real-world turning movement count data into SUMO-compatible vehicle flows and validating simulated counts against observed traffic volumes. The suitability of SUMO for intelligent traffic management prototyping is also supported in [15], who used SUMO to evaluate an ITMS framework with dynamic road-segment weights, route-choice behavior, traffic light simulation, and vehicular communication scenarios before real-world deployment.

Beyond its role as a microscopic traffic simulator, SUMO provides an explicit intersection-traversal model in which vehicles move through junctions using lane-to-lane links and internal lanes rather than abstract point-node movements. This is important for the present MARL evaluation because signal actions influence not only upstream queue discharge but also vehicle progression through internal junction space, right-of-way conflicts, and downstream queue propagation. Therefore, the use of SUMO supports a more realistic assessment of queue length, delay, and safety-related behavior under adaptive traffic signal control [13].

In the present study, SUMO is integrated with Python-based AI control through the Traffic Control Interface (TraCI), enabling real-time interaction between the reinforcement learning agents and the simulated traffic environment, consistent with earlier SUMO–TraCI reinforcement learning applications in both route optimization and traffic signal control [14,23,24,25]. This SUMO-centered architecture is also consistent with Flow/RLlib-based deep RL traffic studies, where SUMO provides microscopic vehicle dynamics while an external RL library updates policies from observed traffic states and returns control actions to the simulator [22]. The flexibility of SUMO–TraCI has also been demonstrated in autonomous platooning simulations, where platoon leaders were controlled by an external application while follower vehicles were managed through a modified car-following model inside SUMO. This supports the suitability of SUMO for future connected-vehicle extensions, although the present study intentionally focuses on detector-based traffic signal control rather than vehicle longitudinal control [23].

Similar SUMO–TraCI architectures have recently been used for multi-objective DRL traffic signal control in both synthetic and real-world intersection scenarios [4]. The simulation utilizes a 0.10 s microscopic time step. This SUMO–Python interaction framework is also consistent with formally constrained RL studies, where TTC values are continuously extracted from simulations and checked against safety constraints during the learning process [26]. A related intersection-RL study also used SUMO as a lightweight first-stage simulator before transferring the trained policy to CARLA, showing that simple simulation can accelerate early RL training but that vehicle dynamics and simulator differences introduce a non-trivial domain-adaptation challenge [10]. This supports the present use of SUMO for controlled algorithmic comparison while also indicating that future deployment-oriented validation should consider higher-fidelity simulation or field-calibrated dynamics.

3.1. Formal MDP/MARL Formulation and Agent Interaction Model

Before defining the simulation-specific state, action, and reward variables, the adaptive traffic signal control problem is formally described as a reinforcement learning decision process. For a single signalized intersection, the control problem can be represented as a Markov Decision Process (MDP), defined by the tuple,

M = ⟨S, A, P, R, γ⟩,

where

S

denotes the set of traffic states observed by the controller,

A

is the set of feasible signal control actions,

P (s^{'} | s, a)

is the transition probability from state

s

to the next state

s^{'}

after applying action

a

,

R (s, a, s^{'})

is the immediate reward function, and

γ \in [0, 1]

is the discount factor that determines the relative importance of future rewards. In the context of traffic signal control, the state includes detector-derived queue information and the current signal phase. The action corresponds to maintaining the current phase or switching to the next feasible phase, and the reward penalizes operational inefficiency such as queue accumulation or cumulative delay.

For a corridor with multiple signalized intersections, the problem becomes a multi-agent reinforcement learning (MARL) problem because each traffic light controller acts as an autonomous agent while being dynamically coupled with neighboring intersections through traffic flow propagation. The MARL system is formally defined by the tuple

G = ⟨ N, S, {A_{i}}_{i = 1}^{N}, P, {R_{i}}_{i = 1}^{N}, \{O_{i}}_{i = 1}^{N}, γ⟩,

where

N = {1, 2, \dots, N}

is the set of signalized intersection agents,

S

is the global traffic state of the corridor,

A_{i}

is the action space of agent

i

,

P (s^{'} | s, a)

is the environment transition function under the joint action

a = (a_{1}, a_{2}, \dots, a_{N})

,

R_{i}

is the reward received by agent

i

,

O_{i}

is the local observation space available to agent

i

, and

γ

is the discount factor. At each decision step

t

, the global state is denoted by

s_{t} \in S

, while each agent observes an input state or observation

s_{i}^{t}

or

o_{i}^{t}

, as well as selects an action

a_{i}^{t} \in A_{i}

and contributes to the joint action vector

a_{t} = (a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})

. The SUMO environment then evolves according to

s_{t + 1} \sim P (s_{t + 1} | s_{t}, a_{t})

and each agent receives an immediate reward

r_{i}^{t} = R_{i} (s_{t}, a_{t}, s_{t + 1})

. The objective of each agent is to learn a policy

π_{i}

that maximizes the expected discounted return,

J_{i} [π_{i}] = E [\sum_{t = 0}^{T} γ^{t} r_{i}^{t}],

where

T

is the simulation horizon. In this study, the reward is defined as a negative operational cost, based either on queue length or cumulative delay. Therefore, maximizing the return is equivalent to minimizing congestion-related penalties over time.

The interaction among agents is indirect and occurs through the traffic environment. A green extension or phase switch at an upstream intersection affects vehicle discharge, downstream arrivals, queue spillback, and subsequent detector readings at neighboring intersections. Thus, although each agent executes its action locally, its decision changes the future observations and rewards of other agents. This coupling is expressed by the dependence of the next global state

s_{t + 1}

on the joint action vector

a_{t}

rather than on any single isolated action.

Although the above formulation is related to the centralized training with decentralized execution (CTDE) paradigm, the implementation adopted in this study should not be interpreted as a strict CTDE architecture. In a strict CTDE setting, agents are typically trained using centralized information through a centralized critic, a value-decomposition module, a mixing network, or a centralized parameter-update mechanism, while decentralized policies are used during execution. This can be generally expressed as

Q_{t o t} (s_{t}, a_{t}) = f (Q_{1} (o_{1}^{t}, a_{1}^{t}), Q_{2} (o_{2}^{t}, a_{2}^{t}), \dots, Q_{N} (o_{N}^{t}, a_{N}^{t}))

or, in actor-critic form, as a centralized critic

Q_{i}^{c} (s_{t}, a_{t})

that evaluates the joint state-action configuration during training while each agent executes its local policy independently.

By contrast, the present work implements an independent value-based MARL structure. Each traffic signal controller is modeled as an independent agent with its own Q-table in the tabular case or its own DQN model and replay buffer in the deep-value-based case. At decision step

t

, each agent independently selects a binary signal control action,

a_{i}^{t} \in {m a i n t a i n, s w i t c h}

, according to its own policy or value function. The joint action applied to the SUMO environment is, therefore,

a_{t} = (a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})

. However, the learning updates are performed independently for each agent rather than through a centralized critic or a joint value-decomposition mechanism. In some experiments, the state input used by each agent is based on a compressed network-level detector representation rather than a strictly local observation limited to a single intersection. This provides the agents with partial global traffic awareness. Nevertheless, the architecture remains independent MARL because each agent maintains its own value function, replay memory, and parameter updates, and no centralized critic, value-mixing function, or centralized training module is used. During execution, the decentralized action rule remains

a_{i}^{t} = π_{i} (s_{i}^{t})

and the joint policy can be written as

π (a_{t} | s_{t}) = \prod_{i = 1}^{N} π_{i} (a_{i}^{t} | s_{i}^{t})

. This formulation accurately reflects the implemented controller structure: traffic signal agents interact through the shared traffic environment, but their learning and decision-making processes are value-based, independent, and decentralized.

3.2. Network Topologies: Synthetic vs. Real-World Validation

To comprehensively evaluate the RL controllers, experiments were conducted across two distinct network topologies. This two-level design is consistent with the ITSCP taxonomy in which test networks are commonly distinguished as isolated intersections, arterial networks, or general networks; the present study deliberately moves beyond isolated intersection evaluation by testing a two-intersection corridor and a real-world arterial digital twin [1]. To improve the clarity of the experimental setup, the network figures were annotated to show traffic flow directions, intersection identifiers, detector locations, and the control area associated with each signalized junction:

The Synthetic “Two-Junction” Network: A plain, synthetic two-intersection corridor constructed entirely within the SUMO environment. It serves as the foundational, controlled testbed. By isolating variables within a symmetrical geometry, this network allows for a clear observation of baseline multi-agent coordination without the noise of real-world road irregularities; see Figure 1. In this network, 24 detectors are distributed to detect traffic movement.
The Real-World “25 January” Corridor (Assiut, Egypt): To prove the practical viability of the framework, the agents were subjected to a digital twin of the “25 January” intersection, a major real-world arterial corridor located in Assiut, Egypt. This network features complex, asymmetric geometries, varied lane capacities, and realistic turning movement proportions with 16 distributed detectors; see Figure 2. This digital-twin design is consistent with the study by [16], which showed that realistic SUMO modeling requires accurate lane geometry, traffic light configuration, detector mapping, and movement-specific trajectories, because map-based networks such as OSM may contain lane-count, direction, and signal-layout inaccuracies that reduce simulation credibility. This design choice is consistent with recent data-driven SUMO intersection studies, which show that real-world intersection geometry and movement-specific demand are essential for producing credible microscopic traffic scenarios and for evaluating signal control interventions under true-to-life conditions.

Across both networks, the environmental state is monitored in real-time via strategically placed 60 m lane-area detectors on incoming road edges. These detectors simulate physical induction loops or camera-based tracking zones, feeding live halting vehicle counts directly to the AI agent [26].

3.3. Calibration and Validation

Calibration was treated as a prerequisite for credible RL evaluation rather than as a post-processing step, consistent with calibrated SUMO studies that use field-detector data, traffic volumes, vehicle parameters, and signal settings to evaluate intersection performance [27]. Earlier ITMS simulation work similarly emphasized that traffic management models require simulation or emulation before field implementation, particularly when real-time infrastructure and direct sensor integration are not yet available [15]. This approach is supported in the study by [16], which argued that calibrated SUMO models bridge the gap between simulation and field operation by matching geometry, signal-controller behavior, vehicle distribution, and driver behavior parameters before testing traffic control strategies.

Before evaluating the performance of reinforcement learning (RL) agents, it is critical to ensure that the simulation environment accurately reflects real-world traffic behavior. Calibration and validation of traffic demand, vehicle behavior, and theoretical benchmarks form the foundation of scientific rigor in this study, allowing meaningful assessment of RL control strategies under realistic conditions.

3.3.1. Grounding of Traffic Demand and Flow Assumptions

To replicate realistic urban intersection conditions, traffic volumes were carefully scaled across three demand levels, representing off-peak, standard, and peak traffic conditions:

Low Demand: Approximately 42 vehicles/h per lane, simulating light, off-peak flows typical of early morning or late-night periods.
Medium Demand: Roughly 350 vehicles/h per lane, representing standard operational conditions in urban arterials.
High Saturation Demand: Scaled from 840 to 1400 vehicles/h per lane, depending on the network topology, designed to push intersections towards peak congestion and test system resilience.

These values are consistent with empirical ranges for urban arterials, as reported in standard traffic engineering references, including the Highway Capacity Manual [28]. Such grounding ensures that improvements in agent performance are meaningful in real-world contexts rather than artifacts of arbitrary simulation parameters.

Turning movements were carefully configured to reflect network-specific geometry and traffic patterns, preserving realism while isolating the effects of traffic volume on congestion:

Synthetic Two-Junction Network: A controlled, symmetrical environment with 70% straight, 20% left, and 10% right turns. This configuration establishes a baseline for algorithmic evaluation, enabling a clear observation of agent coordination without the complexity of real-world asymmetry.
25 January Corridor (Digital Twin, Assiut, Egypt): An asymmetric real-world arterial, calibrated to 82% straight, 12% major turn, and 6% minor turn. These proportions reflect the dominance of through-movements observed on real signalized arterials and introduce realistic directional imbalances that challenge RL agents to coordinate adaptive signal timing effectively. This scenario-based design is consistent with earlier SUMO-based Q-network research showing that RL signal controllers should be trained and evaluated under structurally different demand patterns, including major/minor road imbalances, through/left-turn imbalances, tidal traffic, and time-varying traffic demands [17].

Vehicle arrivals were modeled using a Poisson process, a widely accepted assumption in traffic flow theory for representing stochastic and independent arrivals [29]. This approach introduces realistic variability in arrival patterns, ensuring that RL agents encounter naturally fluctuating traffic streams rather than artificially uniform flows.

3.3.2. Calibration of Vehicle Behavior

The simulation framework was implemented in SUMO (Simulation of Urban Mobility), a validated microscopic traffic simulator that incorporates empirical vehicle dynamics [30]. Vehicle parameters were carefully calibrated:

Maximum Speed: 13.9 m/s (≈50 km/h), representing typical urban arterial speed limits.

Acceleration/Deceleration: 2.6 m/s² and 4.5 m/s², reflecting realistic driver responses to traffic signals and neighboring vehicles.
Driver Imperfection Parameter (σ = 0.5): Introduced to simulate human variability, capturing stop-and-go behavior, heterogeneous vehicle response times, and non-uniform queue formation. This parameter choice is methodologically consistent with the study by [31], which found that driver imperfection (sigma) and reaction time (tau) had a stronger influence on simulated queue length and travel time than parameters such as minimum gap, acceleration, or deceleration. This supports the need to explicitly calibrate driver behavior parameters when modeling heterogeneous urban traffic.

This calibration ensures that vehicle trajectories and interactions reflect microscopic traffic behavior, providing a rich, realistic environment for RL agents to learn adaptive control policies.

3.3.3. Theoretical Validation Using Webster’s Delay Model

To ensure analytical fidelity, the performance of fixed-time control in SUMO was benchmarked against Webster’s delay model, a classical method for estimating average vehicle delay at signalized intersections under uniform arrival conditions [32]. In the present study, Webster’s delay formulation was used only as a theoretical reference for interpreting fixed-time delay behavior and not as an optimization method for generating the implemented signal timings. The SUMO fixed-time signal programs were predefined static timing plans, whereas the RL and MARL controllers adjusted their maintain/switch decisions dynamically based on detector-derived queue states. Therefore, Webster’s model was used to provide an analytical benchmark under conventional signal control assumptions, while the operational baseline in all simulation experiments remained the implemented SUMO fixed-time controller.

Low and Medium Demand: Simulation delays were consistently lower than Webster estimates, which is expected because SUMO incorporates stochastic arrivals through the Poisson process, while Webster assumes deterministic uniform arrivals. This validates that the simulation behaves realistically under non-saturated conditions.
High Demand (1400 vehicles/h): The degree of saturation (X = 3.18) exceeds the valid range of Webster’s formula [28]. Despite this, the simulation produced a finite mean delay of 129.8 s, demonstrating that fixed-time signals fail to handle oversaturation. This scenario provides a quantitative justification for testing adaptive RL controllers, highlighting the limitations of traditional methods under extreme demand.

3.3.4. Summary and Implications for RL Evaluation

By combining empirical traffic volumes, network-specific turning ratios, stochastic vehicle arrivals, and validated vehicle dynamics, this calibration ensures that the simulation environment is both realistic and scientifically robust. These measures provide a controlled yet challenging framework for evaluating RL-based traffic signal controllers:

Traffic patterns remain consistent, allowing fair comparison across demand scenarios.
Vehicle behavior variability ensures that RL agents encounter realistic uncertainties.
Benchmarking against theoretical models provides quantitative evidence of the inadequacy of fixed-time control under high congestion, establishing a baseline for improvement.

Ultimately, this meticulous calibration underpins the validity of all subsequent RL experiments and ensures that observed performance gains reflect genuine adaptive capabilities rather than artifacts of the simulation setup [31]. This is consistent with recent evidence that empirically grounded traffic volume and turning movement inputs reduce the risk of “garbage-in, garbage-out” simulation outcomes and improve the credibility of traffic control evaluations in SUMO.

3.4. State Space Formulation

The state representation relies heavily on real-time queue dynamics. The number of queued (halting) vehicles on a lane-area detector is defined as Equation (1).

This detector-based formulation avoids assuming full CAV connectivity, which is important because DQN-TSC performance can depend strongly on the available vehicle information level under mixed CAV–HDV traffic [20].

q_{i}^{t} = | \{k \in V_{i}^{t} | v_{k} (t) \leq ϵ\}

(1)

where

V_{i}^{t}

= the set of all vehicles present within the detection zone of the sensor

i

at time step

t; k

= an individual vehicle within that set;

v_{k} (t)

= the velocity (speed) of the vehicle

k

at time

t

;

ϵ

= a small velocity threshold defining a “halted” or stopped state (e.g.,

0.1 m / s

);

| . |

= the cardinality operator, which simply means “the total number of elements in this set.”

At any decision step

t

, the raw state vector

S_{r a w}

is a concatenation of the halting vehicle counts at all detectors (

p

) and the current active phase indices of the two traffic lights (

p

).

This queue-based state representation is consistent with [17], who encoded a signalized intersection state using lane-level halting vehicle counts derived from real-time vehicle position and speed, demonstrating that compact queue vectors can support neural network Q-learning without requiring full vehicle-trajectory prediction. A richer alternative is to augment local detector states with predictive graph embeddings from neighboring intersections, as shown in [8], although the present study retains compact detector-count states to preserve computational simplicity and direct interpretability.

This compact state design is consistent with graph-based MARL formulations in which each intersection agent observes local and neighboring traffic information, such as waiting time and approaching vehicle waves, to support cooperative signal control under partial observability [9]. The use of halting vehicle counts is supported in [24], who argued that halting numbers provide a cleaner traffic-state signal than raw occupancy because they focus on vehicles moving below a stopping-speed threshold and reduce noise from newly spawned or transient vehicles. This supports the present detector-based state formulation, in which queued vehicles are represented directly through real-time lane-area halting counts:

S_{r a w} = [q_{1}, q_{2}, \dots, q_{n}, p_{1}, p_{2}]

(2)

where

S_{r a w}

= the raw, uncompressed state vector representing the environment before any discretization or normalization is applied;

p_{1}, p_{2}

= the current active signal-phase index (e.g., red, green, or yellow) for traffic lights 1 and 2.

To process these data, two distinct mathematical transformations were engineered:

Tabular Discretization: For the Tabular Q-Learning implementation, raw detector-level queue counts were compressed into discrete queue-density categories to prevent state-space explosion. At each decision step, the halting vehicle counts from lane-area detectors were first aggregated by incoming edge. The total queue count for each edge was then mapped into one of four discrete bins: bin 0 for no queued vehicles, bin 1 for 1–9 queued vehicles, bin 2 for 10–18 queued vehicles, and bin 3 for more than 18 queued vehicles. The resulting edge-level queue bins were concatenated with the current traffic signal-phase indices to form the tabular state representation used by the Q-table.
The selected four-bin discretization reflects a compromise between traffic-state resolution and learning tractability. A finer discretization would preserve more detailed queue information but would substantially increase the number of possible state–action pairs and slow tabular learning. Conversely, a coarser discretization would reduce Q-table size but could hide operationally important differences between light, moderate, and saturated queues. The selected thresholds preserve the distinction between empty-, low-, moderate-, and high-congestion states while keeping the Q-table computationally manageable for the two-intersection networks. A full sensitivity analysis of alternative discretization schemes was beyond the scope of the present study and is, therefore, identified as a future research direction.
DQN Normalization: Continuous queue counts are scaled by a factor of 10.0 and clipped to a bounded [0, 1] range to prevent gradient saturation in the neural network. The use of compact queue-based state features is consistent with recent PPO-TSC evidence showing that simplified traffic-state vectors can reduce training complexity while maintaining strong control performance. Ref. [6] reported that feature vector states based on queue length and waiting time trained more efficiently than high-dimensional DTSE representations. This compact state design is further supported in the study by [4], which reported that different state definitions produced broadly similar control performance, suggesting that low-resolution traffic-state inputs may be sufficient for effective DRL-based signal control.

A complementary approach is to encode the traffic state using spatial feature matrices rather than compact detector counts. Reference [19] represented each approach lane using vehicle position and speed matrices and assigned larger weights to cells closer to the intersection, reflecting the stronger influence of vehicles near the stop line on signal control decisions. Compared with this high-resolution representation, the present study adopts a compact detector-based state vector to reduce dimensionality and support scalable multi-agent execution. Future MADQN extensions could evaluate whether distance-weighted spatial occupancy and speed features improve performance in complex real-world corridors. A further extension is vision-based state encoding: the authors of [21] used raw intersection images as DRL state inputs processed through CNN layers, whereas the present study intentionally adopts detector-based queue counts to reduce dimensionality, preserve interpretability, and support scalable multi-agent execution.

3.5. Action Space, SMDP Lock, and Safety (TTC)

The agent operates with a binary action space of

0

(to maintain the current green phase) or

1

(to switch to the next logical phase). To ensure physical realism and prevent erratic light flickering, a Semi-Markov Decision Process (SMDP) lock is enforced, requiring a minimum green time (4.0 to 12.0 s). This minimum-green constraint is also consistent with SUMO’s microscopic intersection logic. SUMO’s improved intersection model was developed to avoid unrealistic step-length-dependent behavior by using predicted vehicle arrival times, foe-link conflicts, and safety gaps when determining whether a vehicle may enter an intersection. In this context, the SMDP lock prevents the RL controller from exploiting unrealistically rapid phase changes and keeps learned signal actions compatible with physical approach, entry, and intersection occupancy dynamics [13].

This constraint is also consistent with lost-time sensitivity evidence, showing that frequent phase transitions can reduce effective green time and increase travel time under saturated signalized intersection conditions [27], consistent with distributed RL traffic signal designs that preserve fixed phase sequences while imposing explicit transition and minimum-green constraints [8].

The importance of selecting an appropriate signal action interval is supported by recent DRL-TSC studies, although the optimal interval appears to be network- and controller-dependent. Reference [6] reported that a 10 s interval achieved the best balance between responsiveness and excessive phase switching in their PPO-TSC setting, whereas the authors of [4] found that 3 s and 5 s intervals allowed PPO, A2C, and DQN agents to converge faster and more stably than a 10 s interval. These findings support the present use of an SMDP minimum-green lock because the controller must remain responsive to queue changes while avoiding unrealistic signal flickering and excessive yellow-time losses.

The binary maintain/switch action formulation was selected to preserve the conventional signal-phase sequence while still allowing adaptive control. This design also follows the long-standing ITSCP view that fixed or constrained phase sequences can preserve driver expectancy, safety, and fairness, while minimum and maximum green-duration limits prevent skipped phases, excessive green extension, and unrealistic control behavior [1]. This design is more realistic for deployment than unrestricted dynamic phase selection because arbitrary phase jumps may confuse drivers, disrupt expected signal order, and increase excessive waiting for non-prioritized approaches. Consistent with this logic, the authors of [25] maintained phase sequences in a cooperative MARL signal control model and imposed minimum and maximum green-time constraints to prevent both signal flickering and indefinite green extension for dominant traffic streams.

Traffic safety is often analyzed using Time-To-Collision (TTC) to identify potential traffic conflicts between vehicle interactions that could jeopardize safety [26]. Classical gap acceptance research has long shown that intersection safety and crossing behavior depend on the temporal gaps accepted or rejected by drivers when entering or crossing a major traffic stream, making time-based surrogate measures relevant for evaluating traffic conflict risks at intersections [33].

TTC is calculated by looking at vehicles and their immediate leaders. The formula is

T T C = \frac{d}{v_{f} - v_{l}}

(3)

where

d

is the gap between the follower and the leader,

v_{f}

is the follower speed, and

v_{l}

is the leader speed. This is only valid if

v_{f} > v_{l}

(closing in); otherwise,

T T C = \infty

(no collision risk).

This function scans all vehicles, checks the gaps and speeds against their leading vehicle, and calculates the lowest (most critical) TTC in the network at that exact moment. This metric is recorded every 100 steps. A standard TTC analysis defines a single crash (TTC approaching 0) as a system safety failure. A nominal default TTC value of 50 s is assigned when no critical vehicle-following interaction is detected, primarily to maintain graph-scale consistency and avoid missing TTC records; this value is not treated as a conventional collision risk threshold. By implementing the SMDP lock, the controller is constrained against unrealistic rapid phase switching, while TTC monitoring is used only as an independent diagnostic measure of potential vehicle-following conflict conditions. Since TTC is not included in the reward function, it does not directly guide policy learning; rather, it is used to evaluate the safety-related consequences of the learned queue- or delay-based signal control policies [26].

In the present implementation, TTC is used as a surrogate safety-monitoring indicator rather than as a direct reward component. This separation was adopted to preserve a clear comparison between the operational reward objectives, namely, queue minimization and delay minimization, and the independent safety-related behavior of each controller. Therefore, TTC values are recorded and analyzed to identify potential vehicle-following conflict conditions, while the learning reward remains based on queue or delay penalties. This design allows this study to evaluate whether operationally efficient policies also produce acceptable surrogate safety behavior without explicitly biasing the reward toward TTC during training.

3.6. Objective and Reward Functions

The present study focuses on queue- and delay-based rewards because these metrics directly represent operational efficiency and reduce the risk of unintended reward exploitation. This choice is consistent with the broader ITSCP literature, where delay is the most frequently used performance index, and queue length is commonly treated as a complementary objective because it is closely correlated with delay and vehicle stops [1]. A complementary reward-design approach is to optimize network average speed while penalizing collisions, as used in [22] for PPO-based leading AV control. However, the present study retains direct queue- and delay-penalty rewards because signal control performance is more directly reflected in stopped vehicles and accumulated waiting time.

This reward-design choice is further supported in the study by [9], which formulated a spatially decomposable reward using queue length and waiting time to jointly capture congestion and travel delays. This design choice is also supported in the study by [6], which showed through ablation analysis that combining queue length and waiting time in both the state representation and reward design produced better traffic performance than using either feature alone. Earlier DRL traffic signal studies also used throughput-oriented rewards, such as assigning a positive reward for each vehicle passing through the junction [21].

Their findings reinforce the need for reward functions that are directly aligned with observable operational indicators rather than relying on unrelated or overly complex surrogate objectives. Similarly, reference [20] selected the total waiting time as the DQN reward because it captures both traffic volume and stopping duration, whereas queue length alone may not fully represent the temporal burden experienced by vehicles.

This reward-design logic is consistent with the study conducted by reference [19], which defined reward by using changes in the cumulative vehicle waiting-time and yellow-light duration, thereby aligning the objective with operational efficiency while discouraging excessive phase switching. A complementary reward-design strategy was proposed in reference [24], which minimized the standard deviation of halting vehicles across incoming approaches rather than penalizing only total queue length or waiting time. This formulation encourages the agent to equalize queue distribution among approaches and may improve perceived fairness among road users. Compared with that approach, the present study adopts direct queue- and delay-penalty rewards to prioritize network-level operational efficiency and maintain interpretability. Nevertheless, standard deviation-based rewards could be incorporated in future work as a fairness-oriented extension, particularly under asymmetric demand patterns.

The goal of each RL agent is to maximize the expected cumulative discounted reward over time:

J = E [\sum_{t = 0}^{\infty} γ^{t} R (s_{t}, a_{t})]

(4)

where

J

= the expected cumulative discounted reward that the agent seeks to maximize;

E

= the expected value over all possible future trajectories;

γ

= the discount factor (between 0 and 1). This determines the importance of long-term future rewards versus immediate, short-term rewards.

R (s_{t}, a_{t})

= the immediate reward received from the environment after taking action

a_{t}

in state

s_{t}

.

Because traffic represents a penalty, the immediate rewards

R_{t}

are negative. Two objective functions were modeled:

Queue Minimization: penalizes the total number of halted vehicles across the detectors [34]:

R_{t} = - \sum_{i = 1}^{N} q_{i} (t)

(5)

2.: Delay Minimization: penalizes the total accumulated waiting time ( $w$ ) across all incoming edges ( $M$ ) to align with a user-centric travel time reduction:

R_{t} = - \sum_{j = 1}^{M} w_{j} (t)

(6)

where

R_{t}

= the calculated immediate penalty/reward at time step

t

;

N

= the total number of lane detectors in the network;

w_{j} (t)

= the accumulated waiting time of vehicles on the incoming edge/lane

j

at time

t

;

M

= the total number of incoming edges/lanes being monitored for delay.

Although TTC was not included in the primary reward functions of the present experiments, a safety-aware extension can be formulated by adding a TTC-based penalty term to the operational reward. For example, when the minimum TTC observed in the network falls below a predefined threshold, the reward can be penalized to discourage signal actions that increase close-following or conflict-prone interactions. A general safety-aware reward can, therefore, be expressed as a weighted combination of operational performance and TTC risk, where the queue or delay penalty is supplemented by an additional penalty for unsafe TTC events. This extension would allow the controller to jointly optimize efficiency and surrogate safety, but it requires careful calibration of the TTC threshold and penalty weight to avoid overly conservative signal policies that improve TTC only by suppressing traffic discharge and increasing delay. For clarity and reproducibility, the main simulation, traffic demand, detector, and learning settings used in the experimental framework are summarized in Table 2.

4. Algorithmic Implementation

4.1. The Fixed-Time Baseline

The baseline controller operates on a fixed, repeated cycle. Fixed-time control represents the classical signal-timing strategy in which phase sequences and phase durations are predefined from historical traffic patterns, whereas adaptive strategies adjust control decisions using real-time or predicted traffic conditions [1]. To ensure transparency in the comparison with the RL-based controllers, the fixed-time baseline was implemented as a predefined static signal control benchmark rather than as an adaptive or optimized controller. The same network geometry, detector configuration, vehicle behavior parameters, demand levels, turning movement proportions, and simulation duration were used when evaluating the fixed-time, Tabular Q-Learning, and MADQN controllers. Therefore, performance differences were attributed to the control logic rather than to differences in network exposure or demand conditions.

The fixed-time baseline was implemented using predefined SUMO signal programs for each network. In the synthetic two-junction network, both Node2 and Node3 used the same fixed-time structure consisting of four green phases, each lasting 42 s, with a 3 s yellow transition after each green phase. Therefore, the total fixed-time cycle length for each synthetic intersection was 180 s. The phase sequence consisted of an east–west green phase, a north–south green phase, and two opposing-direction green phases, each followed by a yellow transition. Since both synthetic intersections used identical cycles with an offset of 0 s, the baseline was synchronized in cycle structure but did not include explicit progression-based offset coordination.

For the 25 January real-world corridor, the two fixed-time signal controllers used different cycle lengths. The southern intersection, represented by clusterJ11_J14_J16_J17#3 more, used a 111 s cycle composed of ten intervals with durations of 22, 3, 6, 4, 23, 3, 1, 23, 3, and 2 s. This controller managed twelve connection movements and included major east/north movements, protected movement extensions, southbound/northbound combinations, west/south entry flows, yellow transitions, and short clearance buffers. The northern intersection, represented by clusterJ15_J27_J28_J31#1 more, used a 109 s cycle composed of six intervals with durations of 40, 3, 20, 3, 20, and 3 s. This controller managed ten connection movements and alternated between major east–west movements, side-movement priority phases, and yellow-clearance intervals.

Because the two real-world intersections used mismatched cycle lengths of 111 s and 109 s, the fixed-time baseline for the 25 January Corridor was treated as an isolated/independent fixed-time control configuration rather than a fully coordinated progression-based arterial signal system. This is important because conventional offset-based arterial coordination typically requires a common cycle length, or an exact compatible cycle relationship, to maintain a stable phase relationship between intersections. Therefore, the fixed-time benchmark represents the static signal control program implemented in the SUMO network, not an optimized coordinated arterial signal system. The results are summarized in Table 3.

Webster’s method was not used to optimize or calibrate the fixed-time signal timings in the implemented SUMO signal programs. The fixed-time baseline, therefore, represents a predefined static timing benchmark rather than a Webster-optimized or adaptive baseline. Webster’s delay formulation was used only as a theoretical reference for interpreting fixed-time delay behavior under conventional signal control assumptions. This clarification was added to avoid implying that Webster optimization was used to tune the implemented baseline or the RL-based controllers.

The signal phase

p

for traffic light

j

is updated based solely on elapsed time:

p_{j}^{t} = [t] m o d T_{c y c l e}

(7)

where

p_{j}^{t}

= the active signal phase for traffic light

j

at a given time

t

;

T_{c y c l e}

= the total duration (in seconds or simulation steps) of one complete, pre-programmed traffic light cycle;

m o d

= the modulo operator used to continuously loop the fixed-time cycle. While highly predictable, this system is incapable of adapting to real-time traffic surges and extraordinary network loads [18]. Fixed and pre-optimized timing plans may also remain vulnerable to start-up lost time because each phase transition reduces the effective green time available for vehicle discharge [27].

Between conventional fixed-time control and online reinforcement learning, simulation-based metaheuristics provide an intermediate class of signal control methods. In GA–SUMO frameworks, green-time plans are encoded as candidate solutions, evaluated in microscopic simulations, and improved iteratively using evolutionary operators such as selection, crossover, mutation, and elitism [12]. These methods are effective for identifying optimized timing plans under predefined traffic conditions, but their scalability becomes more challenging as the number of intersections, signal heads, and decision variables increases. The present study, therefore, focuses on RL and MARL controllers, which are designed to respond to real-time traffic states and support decentralized execution across multiple intersections.

4.2. The RL Architectures: Single-Agent vs. Multi-Agent

To replace the fixed-time baseline, two distinct reinforcement learning architectures were developed and tested against each other:

The initial RL architecture utilized a centralized Shared-Brain approach. In this setup, a single global agent observes the entire network state simultaneously and outputs a joint action to control all intersections at once. While this provides the agent with perfect global awareness, it often suffers from severe policy interference as the network scales.
To resolve the bottlenecks of the centralized brain, the system was upgraded to independent multi-agent reinforcement learning. Here, each physical intersection is instantiated as an autonomous embedded agent with localized state compression and independent memory buffers.

4.3. Tabular Q-Learning (Single- and Multi-Agents)

Both the single-agent and multi-agent architectures were first implemented using discrete Tabular Q-Learning. Tabular Q-Learning was retained in this study despite its known scalability limitations for three reasons. First, it provides an interpretable and computationally lightweight baseline for evaluating whether deep function approximation is necessary under the tested corridor conditions. Second, the state representation used in this work is intentionally compact, relying on discretized detector-based queue counts and a limited binary action space, which keeps the state-action space manageable for the two-intersection networks considered. Third, tabular methods remain practically relevant in small or moderately complex traffic control settings, especially where signal phases are fixed, the number of intersections is limited, detector measurements are aggregated into coarse traffic-density bins, and traffic patterns are sufficiently structured or recurrent. Therefore, Tabular Q-Learning is not presented as a scalable solution for large urban networks, but as a transparent reference controller and a practical option for low-dimensional adaptive signal control problems. Similar Q-table-based learning has previously been applied in SUMO for route optimization, where road-edge identifiers represented states, and feasible route choices represented actions [14].

The agent updates its discretized Q-table using the Bellman equation, blending previous knowledge with the newly acquired reward [34]:

Q (s, a) \leftarrow Q (s, a) + α [r + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]

(8)

where

Q (s, a)

= the current estimated Q-value (expected future reward) for being in state

s

and taking action

a

;

α

= the learning rate. This determines to what extent newly acquired information overrides old information (between 0 and 1).

r

= the immediate reward observed after taking the action, and

\max_{a^{'}} Q (s^{'}, a^{'})

= the maximum estimated Q-value for the next state

s^{'}

across all possible future actions

a^{'}

.

To balance exploration and exploitation, the agents utilize an epsilon-greedy policy:

a_{t} = \{\begin{matrix} r a n d o m a c t i o n, & w i t h p r o b a b i l i t y ϵ_{t} \\ \arg \max_{a} Q (s_{t}, a), & w i t h p r o b a b i l i t y 1 - ϵ_{t} \end{matrix}

(9)

where

a_{t}

= the action selected by the agent at time step

t

;

ϵ_{t}

= the exploration rate probability. It dictates the chance that the agent takes a random action to explore the environment rather than exploiting its best-known action.

4.4. Deep Q-Network (DQN and MADQN)

To overcome the limitations of discrete tabular methods, fully connected Deep Q-Network models were implemented for the DQN and MADQN controllers. The networks used compact multilayer perceptron architectures that map normalized detector-based queue states and signal-phase indices to Q-values for the discrete maintain/switch action space. The implemented architectures varied slightly across objectives and network settings. The queue-oriented DQN/MADQN runs used three hidden layers in the form of either 128–128–64 neurons or, in the larger 25 January queue-oriented configuration, 256–256–128 neurons. The delay-oriented DQN/MADQN runs used a compact two-hidden-layer architecture of 128–128 neurons in the multi-agent implementation, with a three-layer 128–128–64 configuration used in some single-agent synthetic runs.

These architectures were selected as practical baselines that balance representational capacity and computational tractability during online SUMO–TraCI training. The first hidden layers extract nonlinear relationships from normalized queue-density inputs, while the final hidden layer, when used, compresses the learned representation before Q-value prediction. ReLU activation was used in the hidden layers because it is computationally efficient, supports stable gradient propagation, and reduces the risk of vanishing gradients compared with saturating activation functions. The final layer used linear activation because the DQN estimates unconstrained Q-values for each discrete signal control action. These neural network configurations were not intended to represent globally optimal architectures; rather, they provide lightweight value-function approximators suitable for two-intersection corridor experiments.

During training, the neural network minimizes the mean squared error (Huber Loss) between the predicted Q-value and the target network value:

L (θ) = E [{(r + γ \max_{a^{'}} Q_{t a r g e t} (s^{'}, a^{'}; θ^{-}) - Q_{t a r g e t} (s, a; θ))}^{2}]

(10)

where

L (θ)

= the loss function being minimized by the neural network during training;

θ^{-}

= the weights of the target neural network. These are kept frozen and only updated periodically to keep the learning process stable.

Q_{t a r g e t} (s, a; θ)

= the Q-value predicted by the primary active network;

Q_{t a r g e t} (s^{'}, a^{'}; θ^{-})

= the target Q-value estimated by the stable target network. The networks train iteratively on randomized mini-batches to optimize traffic signal policies on the fly [18]. The use of replay memory and a target network is consistent with earlier Q-network signal control implementations, where randomized experience replay was used to reduce temporal correlation between sequential samples, and the target network was updated periodically to stabilize Q-value learning [17].

Recent G-DQN evidence suggests that neural network architecture can substantially affect DRL-based traffic signal control. Reference [19] improved conventional DQN by using a dual-channel CNN that separately processes vehicle position and speed matrices before feature fusion. In contrast, the present MADQN implementation uses compact normalized queue-count inputs and fully connected layers, prioritizing computational simplicity and decentralized multi-agent scalability. This distinction clarifies the methodological scope of the present study and suggests that CNN-based spatial encoders may be tested in future MADQN variants when richer detector- or camera-derived state data are available. Unlike earlier policy-gradient traffic signal controllers that map visual states directly to action probabilities, such as the CNN-based model of [21], the present MADQN framework uses detector-count states, Q-value approximation, experience replay, target network stabilization, and decentralized multi-agent execution. This clarifies that the current comparison focuses on value-based MARL, while policy-gradient methods remain relevant to future benchmarks.

4.5. Coordination Strategy and Relation to CTDE

For the multi-agent architecture, the implemented framework follows an independent value-based MARL structure with decentralized execution rather than a strict CTDE implementation with a centralized critic or centralized training module. Each traffic signal is represented as an autonomous agent with its own Q-table or DQN model and its own action-selection process. The agents interact through the shared SUMO traffic environment, where the action of one intersection affects downstream arrivals, queue propagation, spillback, and the subsequent observations and rewards of neighboring agents. However, the implementation does not use a centralized critic, value-decomposition network, mixing network, or centralized parameter-update module. Therefore, the framework is described as an independent, decentralized value-based MARL implementation. CTDE is discussed as a related formal MARL paradigm and as a possible future extension rather than as the exact implemented training architecture.

This decentralized multi-agent formulation is consistent with distributed cooperative RL schemes in which each intersection operates as an autonomous local controller while using selected neighborhood information to improve coordination. Ref. [8] showed that GNN-based embeddings can provide each local controller with upstream and downstream traffic contexts without requiring centralized real-time action selection. Compared with that approach, the present study adopts a simpler value-based MARL design, using Tabular Q-Learning and MADQN to prioritize local responsiveness, interpretability, and scalable decentralized execution. Thus, the fixed-time baseline relied on predefined repeated signal cycles, whereas the MARL controllers used adaptive decentralized decisions learned independently by each signal agent through interaction with the shared SUMO environment. In the implemented fixed-time baseline, no explicit progression-based offset coordination was applied; therefore, the comparison evaluates static pre-timed operation against adaptive multi-agent control under the same network geometry and demand scenarios. Despite the practical advantages of decentralized multi-agent value-based control, MARL-based traffic signal control remains subject to several known methodological challenges. First, the learning environment is non-stationary because each intersection agent updates its policy while neighboring agents are also learning, meaning that the transition and reward dynamics observed by one agent may change over time. Second, each agent operates under partial observability during decentralized execution since it observes detector-derived local traffic states rather than the complete network state. This can limit the agent’s ability to anticipate downstream congestion, spillback, or delayed effects of upstream signal decisions. Third, policy instability may occur during training, particularly when value estimates change due to simultaneous policy updates across agents. In the present MADQN implementation, replay memory, target network stabilization, ε-greedy exploration decay, and the SMDP minimum-green lock were used to reduce instability, but these mechanisms do not eliminate it completely. Finally, a local-versus-global optimization conflict may arise when an agent minimizes its own approach queues or delay while unintentionally worsening downstream congestion or network-wide performance. Therefore, the results should be interpreted as an evaluation of value-based decentralized MARL under the selected reward functions, detector layout, and corridor configurations rather than as a complete solution to all coordination challenges in large-scale urban networks.

This design is also supported in the study by [7], which showed that cooperative DQN agents can improve multi-intersection signal control when neighboring intersections transmit recent state, action, and reward information to the local learning process. Compared with such explicit neighbor-transfer schemes, the present MADQN implementation adopts a simpler decentralized execution structure, reducing communication requirements while still enabling value-based multi-agent control under realistic corridor conditions.

4.6. General Learning Parameters

The learning-based controllers were implemented using a fully online SUMO–TraCI training protocol. The main implementation parameters used in the final scripts are summarized below to support reproducibility and clarify the training configuration.

To improve reproducibility, the main implementation parameters of the DQN/MADQN training process are explicitly summarized here. The learning-based controllers were trained in a fully online manner for one complete SUMO simulation episode under each demand scenario, with each episode consisting of 10,000 simulation steps and a microscopic step length of 0.10 s. The DQN/MADQN agents used experience replay with randomized mini-batch sampling. In the multi-agent DQN/MADQN implementations, each intersection agent used an independent replay buffer with a maximum capacity of 200,000 transitions and a mini-batch size of 128 samples. The target networks were updated every 200 training steps using soft target network updates. The soft-update coefficient varied by objective, with τ = 0.01 in delay-oriented DQN/MADQN runs and τ = 0.002 in queue-oriented DQN/MADQN runs. Neural network weights were optimized using the Adam optimizer with Huber loss. The final MADQN scripts used a learning rate of 0.0002, while the broader DQN experiments used learning rates within the range of 0.0002–0.0005. The exploration policy was ε-greedy, initialized at ε = 1.0 and decayed toward a minimum value of 0.1 during online training.

It should be noted that the term “episode” in the present implementation refers to one complete online SUMO simulation horizon for a given controller, network, demand level, and optimization objective. The learning process was, therefore, conducted as fully online continuous learning within a 10,000-step microscopic simulation run rather than as an offline multi-episode training protocol. This design was adopted to maintain identical traffic-exposure conditions across the fixed-time, Tabular Q-Learning, and MADQN comparisons. Consequently, the reported learning curves and performance values should be interpreted as controlled online training outcomes rather than as evidence of fully converged multi-episode optimal policies.

The action space was binary, where action 0 maintained the current signal phase and action 1 switched to the next feasible phase in the predefined sequence. To prevent unrealistic signal flickering, an SMDP minimum-green lock was enforced. In the synthetic two-junction experiments, the minimum-green lock was 40 simulation steps, equivalent to 4 s, for delay-oriented controllers, and 120 simulation steps, equivalent to 12 s, for queue-oriented controllers. In the 25 January Corridor scripts, the minimum-green lock ranged from 30 to 60 simulation steps, equivalent to 3–6 s depending on the objective and controller. These implementation choices ensured that the learning controllers operated under realistic phase-transition constraints while retaining adaptive maintain/switch decision-making.

To improve reproducibility, the main implementation parameters of the DQN/MADQN training process are explicitly summarized in Table 4. The learning-based controllers were trained in a fully online manner for one complete SUMO simulation episode under each demand scenario, with each episode consisting of 10,000 simulation steps and a microscopic step length of 0.10 s. The DQN/MADQN agents used experience replay with randomized mini-batch sampling. In the multi-agent DQN/MADQN implementations, each intersection agent used an independent replay buffer with a maximum capacity of 200,000 transitions and a mini-batch size of 128 samples. The target networks were updated every 200 training steps using soft target network updates. The soft-update coefficient varied by objective, with τ = 0.01 in delay-oriented DQN/MADQN runs and τ = 0.002 in queue-oriented DQN/MADQN runs. Neural network weights were optimized using the Adam optimizer with Huber loss and a learning rate of 0.0002 in the final DQN/MADQN scripts. The exploration policy was ε-greedy, initialized at ε = 1.0 and decayed toward a minimum value of 0.1 during online training.

Training stability is an important consideration in DRL-based traffic signal control because the learning process is affected by stochastic vehicle arrivals, exploratory actions, and the non-stationarity introduced by multiple agents learning simultaneously. In the present implementation, training was performed in a fully online manner over a 10,000-step SUMO execution horizon. Learning behavior was monitored descriptively using the evolution of cumulative reward, queue length, delay, and TTC traces recorded at 100-step intervals. Reward oscillations were expected, particularly during the early part of training, because the agents followed an ε-greedy policy initialized at ε = 1.0 and gradually decayed toward ε = 0.1. Therefore, the reward curves were not expected to improve monotonically at every time step.

For the DQN/MADQN agents, replay memory, randomized mini-batch sampling, target network stabilization, soft target updates, ε-decay, and the SMDP minimum-green lock were used to reduce unstable value updates and unrealistic phase switching. These mechanisms improve stability but do not fully eliminate learning variability, especially in multi-agent settings where each agent’s policy changes while neighboring agents are also learning. The present evaluation, therefore, reports the behavior of the implemented online learning runs under controlled demand and network settings. Future work should extend this analysis by repeating training across multiple random seeds and reporting convergence curves, reward-oscillation statistics, and confidence intervals.

4.7. The Analysis Framework

Figure 3 presents the unified training architecture adopted for the proposed adaptive traffic signal control framework in the SUMO environment. The flowchart integrates the common interaction loop between the reinforcement learning agent and the microscopic traffic simulator while also distinguishing the learning mechanisms of the two implemented control paradigms: Tabular Q-Learning and Deep Q-Network/Multi-Agent Deep Q-Network (DQN/MADQN).

The workflow starts with the initialization phase, in which the SUMO traffic simulation environment is launched and linked to Python through the Traffic Control Interface (TraCI). At this stage, the learning structure is also initialized according to the selected controller. For the Tabular Q-Learning configuration, this consists of creating an empty or zero-initialized Q-table indexed by discretized traffic states and signal control actions. For the DQN/MADQN configuration, the framework initializes the main neural network, the target network, and the experience replay buffer used for off-policy learning. In the multi-agent setting, each controlled intersection is represented by an independent agent, while coordination is achieved under the adopted decentralized decision-making framework.

After initialization, the algorithm enters the episodic or continuous training loop by resetting the environment and observing the initial state. In the tabular implementation, the observed traffic information is transformed into a discretized state representation, whereas in the deep-learning implementation, the state is maintained in a normalized numerical form suitable for neural network input. This state typically includes queue-related traffic indicators collected from lane-area detectors, along with the active signal-phase information for each controlled intersection.

At each training step, the controller checks whether the current step count is still below the predefined maximum number of training steps. If the stopping condition is reached, the training process terminates and the learned policy is exported, either as a trained Q-table or as saved deep-network parameters. Otherwise, the agent proceeds to the next decision cycle.

During each cycle, the agent selects a control action according to an epsilon-greedy exploration policy. Under this mechanism, the controller chooses a random action with probability

ε

used to maintain exploration, and chooses the action with the highest current estimated Q-value with probability

1 - ε

to exploit previously learned knowledge. The selected action is then applied to the traffic environment through SUMO–TraCI execution, which changes or maintains the active signal phase according to the control logic. The environment responds to this action by evolving to the next simulation state.

Following action execution, the framework observes the immediate reward and the next state. The reward is computed from the operational performance of the traffic network, typically using queue-based or delay-based penalty functions. At this point, the workflow branches into the algorithm-specific learning phase.

In the Tabular Q-Learning branch, learning is performed through direct temporal-difference updating of the Q-table. First, the temporal-difference (TD) error is computed as the difference between the current Q-value estimate and the target value formed from the observed reward and the maximum Q-value of the next state. Then, the Q-value corresponding to the current state–action pair is updated using the Bellman update equation. This process incrementally adjusts the Q-table to improve the policy over repeated environment interactions. After the update, the current state is replaced with the next state, the exploration rate

ε

is decayed according to the predefined schedule, and the algorithm returns to the next training step.

In the DQN/MADQN branch, the learning process is more elaborate because the Q-function is approximated via a neural network rather than stored explicitly in a table. After each environment interaction, the transition tuple

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is stored in the experience replay buffer. The framework then checks whether the buffer size has exceeded the predefined mini-batch size. If sufficient experiences are available, a random mini-batch is sampled from the buffer. Using this sampled batch, the algorithm computes the target Q-values based on the reward and the output of the target network, which is a periodically synchronized copy of the main network used to stabilize training. The main network weights are then updated by minimizing the loss between predicted and target Q-values, typically through gradient descent using a robust loss function such as Huber loss or mean squared error.

After the neural network update, the internal state is advanced by setting the current state equal to the next state. The framework then checks whether the predefined target network update interval has been reached. If so, the weights of the main network are copied or softly transferred to the target network to maintain stable Q-target estimation during subsequent updates. Regardless of whether synchronization occurs, the exploration rate

ε

is decayed to gradually shift the agent from exploration-dominant behavior toward exploitation of the learned control policy. The algorithm then returns to the next training step and repeats the same sequence until the full training horizon is completed.

Overall, the unified flowchart highlights that both Tabular Q-Learning and DQN/MADQN share the same environment–interaction structure—state observation, epsilon-greedy action selection, action execution, reward collection, and state transition—while differing in the way the action-value function is learned and updated. The tabular method performs direct discrete-value updates and is, therefore, simpler and computationally lighter, whereas the deep-learning method relies on replay memory, mini-batch optimization, and target network synchronization to handle larger and more complex state spaces. This unified representation clarifies the algorithmic relationship between the two approaches and shows how both are embedded within the same SUMO-based adaptive traffic signal control framework.

5. Evaluation and Results

The experiments were evaluated using a controlled, fully online simulation protocol. For each network configuration, controller type, demand level, and optimization objective, the corresponding controller was executed for one complete SUMO simulation horizon of 10,000 steps, with a microscopic step length of 0.10 s. During each run, queue length, cumulative delay, reward evolution, and TTC were recorded at regular 100-step intervals and exported for post-processing. The reported values in the results, therefore, represent the outputs of the implemented controlled simulation runs rather than averages across multiple independent random-seed replications.

The same network geometry, detector layout, vehicle behavior parameters, demand files, turning movement assumptions, simulation duration, and performance metrics were used when comparing the fixed-time, Tabular Q-Learning, and MADQN controllers within each scenario. However, a full repeated-run statistical protocol with multiple random seeds was not implemented in the current version. Consequently, the reported results should be interpreted as comparative simulation outcomes under controlled settings rather than as statistical estimates of population-level controller performance. This limitation is acknowledged, and future work should repeat each experiment across multiple random seeds, as well as report mean values, standard deviations, confidence intervals, and statistical significance tests. Because the present experiments were conducted as controlled online simulation runs rather than repeated independent replications, statistical significance testing was not applied in the current version. No confidence intervals or standard deviations are, therefore, reported for the main performance indicators. The comparisons are interpreted as scenario-based simulation outcomes under identical traffic and controller settings. Although this approach is sufficient for demonstrating comparative behavior under the implemented networks, it does not quantify run-to-run uncertainty. A full statistical evaluation requires multiple replications with different random seeds for each controller and demand level.

To ensure fair comparison, all controllers were evaluated under identical traffic-exposure conditions within each scenario. The fixed-time, Tabular Q-Learning, and MADQN controllers used the same network geometry, detector placement, vehicle behavior parameters, demand levels, route files, turning movement assumptions, simulation step lengths, simulation horizons, and performance metrics. For the learning-based controllers, training and evaluation occurred online during the same 10,000-step execution horizon. The fixed-time controller was not assigned training episodes because it is a deterministic static benchmark. Instead, it was exposed to the same demand scenario and simulation duration as the RL-based controllers. Therefore, performance differences reflect differences in control logic under equivalent simulated traffic conditions while recognizing the methodological difference between static pre-timed control and adaptive online learning.

Because the learning-based controllers were trained and evaluated online within the same 10,000-step simulation horizon, the reported reward, queue length, and cumulative delay trajectories should be interpreted as within-run learning behavior. These trajectories indicate how the agents adapt during the controlled simulation episode, but they do not represent convergence statistics across repeated training episodes. Therefore, terms such as “convergence” and “stabilization” are used descriptively to indicate reduced fluctuation and more consistent control behavior within the observed run.

To assess learning behavior, cumulative delay and queue length trajectories were monitored over the 10,000-step online simulation horizon. These curves were used as descriptive indicators of training evolution rather than formal proof of convergence. During the early part of the simulation, reward, delay, and queue trajectories may fluctuate because the agents continue to explore actions under the ε-greedy policy. As ε decays, the controllers increasingly exploit learned action-value estimates, and the trajectories generally become more stable. However, because each scenario was evaluated using a single controlled online run rather than multiple independent training replications, convergence behavior should be interpreted descriptively rather than statistically.

The convergence profiles reveal a clear transition between exploration and exploitation. During the initial learning period, approximately from step 0 to step 4000, the agents exhibit relatively unstable behavior, which is reflected in the fluctuations and temporary increases in cumulative delay and queue length. This behavior is expected because the agents are still sampling alternative actions, updating Q-values, and forming an initial representation of the traffic control environment. After approximately step 4000, the effect of exploration gradually decreases as the agents increasingly rely on learned action-value estimates. By around step 8000, the learning curves show a more stable pattern, indicating that the agents have reached a convergence region in which signal decisions become more consistent and operational performance improves.

The results also indicate differences in convergence smoothness between the learning architectures. Tabular Q-Learning is able to converge to a performance level below the fixed-time baseline; however, its learning trajectory remains more sensitive to local state variations because continuous traffic conditions must be represented through discrete queue-state bins. This leads to residual oscillations in the smoothed curves, particularly under higher demand. In contrast, the MADQN controller shows a smoother convergence pattern because the neural network approximation allows the agent to generalize across normalized continuous state inputs rather than relying on rigid state discretization. Therefore, the deep value-based approach demonstrates stronger stability in representing changing congestion conditions.

The comparison between single-agent and multi-agent structures further highlights the importance of distributed control in corridor-based signal systems. Under the high-demand scenario in the two-junction network, the single-agent DQN converged to a cumulative delay plateau of approximately 372 s, whereas the MADQN approach converged to a lower value of approximately 332 s. This difference suggests that centralized control becomes less efficient as the joint state-action space expands, while the multi-agent structure allows each intersection to respond more directly to local traffic conditions while still benefiting from coordinated learning. The result supports the suitability of the centralized training, decentralized execution framework for multi-intersection traffic signal control.

Across demand levels, the convergence behavior further demonstrates the scalability of the MADQN controller. Under low-demand conditions, both the fixed-time controller and reinforcement learning controllers maintained near-zero delay, indicating that adaptive control did not introduce unnecessary switching or artificial congestion during off-peak operation. Under medium demand, the performance gap between the fixed-time and learning-based controllers became more visible, as the RL agents were able to adjust signal decisions to reduce queue growth and delay accumulation. Under high demand, the distinction became most significant. In the 25 January Corridor, with a demand level of 1400 vehicles/h, the fixed-time controller failed to contain cumulative delay, which exceeded 430 s. By contrast, the MADQN controller stabilized at approximately 147 s of delay, indicating that the learned policy was able to adapt to the asymmetric and directional loading pattern of the real-world arterial corridor.

Under low-demand conditions, the network remains well below capacity, and both single-agent and multi-agent RL controllers rapidly learn to avoid serving empty or lightly occupied approaches. This behavior effectively reduces unnecessary phase activation, allowing delays to approach near-zero levels while maintaining minimal queue lengths. The result confirms that adaptive control does not introduce artificial congestion under off-peak conditions and can preserve efficient traffic progression even when demand is sparse.

The convergence results indicate that the reinforcement learning controllers not only improve operational performance relative to the fixed-time baseline but also exhibit meaningful learning stability after the exploratory phase. The MADQN framework provides the most consistent convergence behavior, particularly under complex or high-demand conditions, because it combines continuous state approximation with decentralized execution. These findings support the use of multi-agent deep reinforcement learning as a scalable adaptive signal control strategy for both controlled synthetic networks and more realistic arterial corridors.

5.1. Two-Junction Validation (Synthetic Control)

The synthetic network allowed for baseline evaluation of the control architectures without real-world geometric interference.

5.1.1. Low-Demand Performance

Figure 4 compares the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent two-junction network under low-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Since the traffic volume is relatively low, the main control challenge is not severe congestion but rather avoiding unnecessary stops caused by inefficient signal-phase allocation.

The fixed-time controller produced the poorest operational performance. Its cumulative delay reached 2,339,587.3 s, whereas Tabular QL and DQN reduced this value to 25,255.3 s and 26,070.3 s, respectively. These reductions correspond to 98.92% for Tabular QL and 98.89% for DQN compared with fixed-time control. This confirms that both RL-based controllers successfully avoided unnecessary stops under sparse traffic conditions by adapting signal phases to real-time vehicle arrivals.

Queue performance showed a similar improvement. Cumulative queue accumulation decreased from 52,409 under fixed-time control to 21,686 using Tabular QL and 20,433 using DQN. Therefore, Tabular QL reduced queue accumulation by 58.62%, while DQN achieved a slightly higher reduction of 61.01%. This indicates that DQN provided marginally better queue dissipation, although both RL methods clearly outperformed the rigid fixed-time controller.

A direct comparison between the two RL approaches shows that Tabular QL produced a cumulative delay of 815.0 s lower than DQN, equivalent to a 3.13% reduction. Conversely, DQN reduced cumulative queue accumulation by 1253 vehicle-count units compared with Tabular QL, corresponding to a 5.78% lower cumulative queue accumulation relative to Tabular QL. This trade-off suggests that Tabular QL was marginally more effective in minimizing the total waiting time, whereas DQN was slightly more effective in preventing queue persistence.

Figure 5 compares the fixed-time controller, Tabular Q-Learning, and DQN for the multi-agent two-junction network under low-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Since the demand level is low, the main objective is to avoid unnecessary stops and maintain smooth traffic progression rather than resolving severe congestion.

The fixed-time controller produced the poorest delay performance. Its cumulative delay reached 2,339,587.3 s, whereas Tabular QL reduced this value to 23,830.3 s, and DQN further reduced it to 23,615.0 s. These values correspond to delay reductions of 98.98% for Tabular QL and 98.99% for DQN compared with fixed-time control. The instantaneous delay results also support this trend, as the average delay decreased from 223.68 s under fixed-time control to 2.72 s with Tabular QL and 2.21 s with DQN.

Queue performance showed a similar improvement. The cumulative queue accumulation decreased from 52,409 under fixed-time control to 23,646 using Tabular QL and 18,415 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 54.88%, while DQN achieved a larger reduction of 64.86% compared with fixed-time control. The instantaneous queue results also indicate better traffic discharge under RL control, with the average queue length decreasing from 5.35 vehicles under fixed-time control to 2.35 vehicles for Tabular QL and 1.81 vehicles for DQN.

A direct comparison between the two RL approaches shows that DQN achieved the best performance in both delay and queue metrics. DQN reduced cumulative delay by 215.3 s compared with Tabular QL, equivalent to a 0.90% lower cumulative delay relative to Tabular QL. More importantly, DQN reduced cumulative queue accumulation by 5231 vehicle-count units compared with Tabular QL, corresponding to a 22.12% lower cumulative queue accumulation relative to Tabular QL. This indicates that the advantage of DQN is more evident in queue dissipation than in delay reduction under low-demand conditions.

5.1.2. Medium-Demand Performance

As traffic volume increases, the rigid cycle of the fixed-time controller causes platoons to stop unnecessarily, leading to a steady accumulation of delays. The RL models significantly outperform the baseline here by dynamically adjusting green times to match incoming vehicle waves.

Figure 6 compares the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent two-junction network under medium-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Compared with the low-demand scenario, medium demand introduces higher vehicle arrival rates, making inefficient phase allocation more likely to generate persistent delay and queue accumulation.

The fixed-time controller produced the poorest delay performance. Its cumulative delay reached 22,789,575.7 s, whereas Tabular QL reduced this value to 769,256.5 s, and DQN reduced it to 982,356.4 s. These values correspond to delay reductions of 96.62% for Tabular QL and 95.69% for DQN compared with fixed-time control. This indicates that both RL-based controllers substantially reduced unnecessary waiting time by adapting the signal phases to the observed traffic state.

Queue performance also improved markedly under RL control. The cumulative queue accumulation decreased from 527,892 under fixed-time control to 204,977 using Tabular QL and 202,668 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 61.17%, while DQN achieved a slightly higher reduction of 61.61% compared with fixed-time control.

A direct comparison between the two RL approaches shows a trade-off between delay and queue performance. Tabular QL achieved a lower cumulative delay than DQN by 213,099.9 s, equivalent to a 21.69% lower cumulative delay relative to DQN. In contrast, DQN achieved a slightly lower cumulative queue accumulation than Tabular QL by 2309 vehicle-count units, corresponding to a 1.13% lower cumulative queue accumulation relative to Tabular QL. This indicates that Tabular QL was more effective in minimizing total waiting time, while DQN provided a small advantage in queue dissipation.

Figure 7 compares the fixed-time controller, Tabular Q-Learning, and DQN for the multi-agent two-junction network under medium-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Under medium demand, the traffic load becomes sufficiently high for poor signal coordination to generate persistent delay and queue accumulation, making adaptive control more important than in the low-demand case.

The fixed-time controller produced the poorest delay performance. Its cumulative delay reached 22,789,575.7 s, whereas Tabular QL reduced this value to 740,159.3 s, and DQN reduced it to 781,095.0 s. These values correspond to delay reductions of 96.75% for Tabular QL and 96.57% for DQN compared with fixed-time control. This confirms that both RL-based controllers substantially improved network performance by adapting signal decisions to real-time traffic states instead of following a rigid cycle.

Queue performance also improved markedly under RL control. The cumulative queue accumulation decreased from 527,892 under fixed-time control to 213,927 using Tabular QL and 207,449 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 59.48%, while DQN achieved a slightly higher reduction of 60.70% compared with fixed-time control.

A direct comparison between the two RL approaches shows a trade-off between delay and queue performance. Tabular QL achieved a lower cumulative delay than DQN by 40,935.7 s, equivalent to a 5.24% lower cumulative delay relative to DQN. Conversely, DQN reduced cumulative queue accumulation by 6478 vehicle-count units compared with Tabular QL, corresponding to a 3.03% lower cumulative queue accumulation relative to Tabular QL. This indicates that Tabular QL was slightly more effective in minimizing total delay, whereas DQN provided a modest advantage in queue dissipation.

5.1.3. High-Demand Performance

At peak saturation, the fixed-time controller collapses completely. Under these extreme conditions, upgrading to a multi-agent architecture demonstrates a powerful advantage in spatial coordination, successfully stabilizing the network and capping exponential congestion growth.

Figure 8 compares the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent two-junction network under high-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Under high demand, the network operates close to saturation, making the controller’s ability to limit delay growth and queue accumulation more critical than in the low- and medium-demand cases.

The fixed-time controller produced the poorest delay performance. Its cumulative delay reached 68,661,929.4 s, whereas Tabular QL reduced this value to 3,458,562.3 s, and DQN reduced it to 3,710,575.1 s. These values correspond to delay reductions of 94.96% for Tabular QL and 94.60% for DQN compared with fixed-time control. This confirms that both RL-based controllers substantially mitigated delay accumulation under saturated traffic conditions.

Queue performance also improved under RL control, although the improvement was less pronounced than the delay reduction. The cumulative queue accumulation decreased from 1,163,843 under fixed-time control to 1,038,254 using Tabular QL and 1,017,254 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 10.79%, while DQN achieved a slightly higher reduction of 12.60% compared with fixed-time control.

A direct comparison between the two RL approaches shows a trade-off between delay and queue performance. Tabular QL achieved a lower cumulative delay than DQN by 252,012.8 s, equivalent to a 6.79% lower cumulative delay relative to DQN. Conversely, DQN reduced cumulative queue accumulation by 21,000 vehicle-count units compared with Tabular QL, corresponding to a 2.02% lower cumulative queue accumulation relative to Tabular QL. This indicates that Tabular QL was more effective in minimizing total delay, whereas DQN provided a modest advantage in queue reduction.

Figure 9 compares the fixed-time controller, Tabular Q-Learning, and DQN for the multi-agent two-junction network under high-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Under high demand, the traffic network operates close to saturation; therefore, controller performance is mainly reflected in its ability to limit the growth of cumulative delay and queue accumulation.

The fixed-time controller produced the poorest delay performance. Its cumulative delay reached 68,661,929.4 s, whereas Tabular QL reduced this value to 3,308,441.9 s, and DQN reduced it to 3,319,432.9 s. These values correspond to delay reductions of 95.18% for Tabular QL and 95.17% for DQN compared with fixed-time control. This confirms that both multi-agent RL controllers strongly mitigated the exponential delay growth observed under the fixed-cycle baseline.

Queue performance also improved under RL control. The cumulative queue accumulation decreased from 1,163,843 under fixed-time control to 1,010,218 using Tabular QL and 1,004,745 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 13.20%, while DQN achieved a slightly higher reduction of 13.67% compared with fixed-time control.

A direct comparison between the two RL approaches shows very similar performance, with only a small trade-off between delay and queue metrics. Tabular QL achieved a slightly lower cumulative delay than DQN by 10,991.0 s, which is equivalent to a 0.33% lower cumulative delay relative to DQN. Conversely, DQN reduced cumulative queue accumulation by 5473 vehicle-count units compared with Tabular QL, corresponding to a 0.54% reduction.

5.2. 25 January Corridor (Digital Twin, Assiut, Egypt, and Real-World Control)

When deployed onto the complex geometries and asymmetric lane configurations of the 25 January Corridor in Assiut, the framework proved its scalability across all demand levels.

5.2.1. Low-Demand Performance

Like the synthetic network, the RL agents optimize green-light duration for the dominant traffic streams, cutting minor delays caused by the fixed-time baseline’s rigid cycles. Figure 10 compares the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent 25 January network under low-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Compared with the synthetic two-junction network, the 25 January Corridor represents a more realistic geometry; however, under low-demand conditions, the main control objective remains the reduction in unnecessary stops and queue formation.

The fixed-time controller produced the highest cumulative delay, reaching 62,482.0 s. Tabular QL reduced this value to 14,691.4 s, while DQN achieved the lowest cumulative delay of 9419.0 s. These values correspond to delay reductions of 76.49% for Tabular QL and 84.93% for DQN compared with fixed-time control. This indicates that both RL-based controllers improved signal responsiveness, with DQN showing a clearer advantage in reducing total waiting time.

Queue performance followed the same trend. The cumulative queue accumulation decreased from 4355 under fixed-time control to 2426 using Tabular QL and 2239 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 44.29%, while DQN achieved a higher reduction of 48.59% compared with fixed-time control.

A direct comparison between the two RL approaches shows that DQN outperformed Tabular QL in both operational metrics. DQN reduced cumulative delay by 5272.4 s compared with Tabular QL, equivalent to a 35.89% lower cumulative delay relative to Tabular QL. It also reduced cumulative queue accumulation by 187 vehicle-count units, corresponding to a 7.71% lower cumulative queue accumulation relative to Tabular QL.

Figure 11 compares the fixed-time controller, Tabular Q-Learning, and DQN for the Multi-Agent 25 January network under low-demand conditions. The figure evaluates operational efficiency using instantaneous and cumulative delay, as well as instantaneous and cumulative queue length. Under low-demand conditions, the main objective is to reduce unnecessary stopping and maintain smooth vehicle progression through the real-world corridor geometry.

The fixed-time controller produced the highest cumulative delay, reaching 62,482.0 s. Tabular QL reduced this value to 9965.7 s, while DQN achieved the lowest cumulative delay of 9554.7 s. These values correspond to delay reductions of 84.05% for Tabular QL and 84.71% for DQN compared with fixed-time control. This indicates that both multi-agent RL controllers substantially improved signal responsiveness under sparse traffic conditions.

Queue performance showed the same overall trend. The cumulative queue accumulation decreased from 4355 under fixed-time control to 2361 using Tabular QL and 2270 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 45.79%, while DQN achieved a slightly higher reduction of 47.88% compared with fixed-time control.

A direct comparison between the two RL approaches shows that DQN achieved the best performance in both delay and queue metrics. DQN reduced cumulative delay by 411.0 s compared with Tabular QL, equivalent to a 4.12% lower cumulative delay relative to Tabular QL. It also reduced cumulative queue accumulation by 91 vehicle-count units, corresponding to a 3.85% lower cumulative queue accumulation relative to Tabular QL.

5.2.2. Medium-Demand Performance

The irregular geometry of the Assiut intersection makes fixed-time control highly inefficient when dealing with asymmetric traffic bursts. By adapting to these bursts rather than relying on rigid cycles, total cumulative delay was severely reduced compared to the static baseline.

Figure 12 evaluates the performance of the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent 25 January network under medium-demand conditions. This scenario is more challenging than the low-demand case because the real-world 25 January Corridor has asymmetric geometry and higher vehicle arrival rates, making inefficient signal timing more likely to generate sustained delay and queue accumulation.

The fixed-time controller produced the highest cumulative delay, reaching 837,375.4 s. Tabular QL reduced this value to 569,692.0 s, while DQN reduced it to 595,823.1 s. These results correspond to cumulative delay reductions of 31.97% for Tabular QL and 28.85% for DQN compared with fixed-time control. This confirms that both RL-based controllers improved signal responsiveness in the real-world corridor, with Tabular QL achieving the stronger delay reduction in this scenario.

Queue performance also improved under RL control. The cumulative queue accumulation decreased from 48,703 under fixed-time control to 29,216 using Tabular QL and 29,693 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 40.01%, while DQN achieved a reduction of 39.03% compared with fixed-time control.

A direct comparison between the two RL approaches shows that Tabular QL outperformed DQN in both cumulative delay and cumulative queue accumulation. Tabular QL reduced cumulative delay by 26,131.1 s compared with DQN, equivalent to a 4.39% lower cumulative delay relative to DQN. It also reduced cumulative queue accumulation by 477 vehicle-count units, corresponding to a 1.61% lower cumulative queue accumulation relative to DQN.

Figure 13 evaluates the fixed-time controller, Tabular Q-Learning, and DQN for the multi-agent 25 January network under medium-demand conditions. This scenario tests the controllers under a realistic arterial corridor with asymmetric geometry and a moderate traffic load, where effective multi-agent coordination is required to reduce delay and prevent queue accumulation.

The fixed-time controller produced the highest cumulative delay, reaching 837,375.4 s. Tabular QL reduced this value to 272,075.4 s, while DQN reduced it to 295,685.7 s. These values correspond to cumulative delay reductions of 67.51% for Tabular QL and 64.69% for DQN compared with fixed-time control. This shows that both multi-agent RL controllers substantially improved delay performance by adapting signal decisions to real-time traffic conditions.

Queue performance also improved compared with the fixed-time baseline. The cumulative queue accumulation decreased from 48,703 under fixed-time control to 28,718 using Tabular QL and 29,612 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 41.03%, while DQN achieved a reduction of 39.20% compared with fixed-time control.

A direct comparison between the two RL approaches shows that Tabular QL outperformed DQN in both cumulative delay and cumulative queue accumulation. Tabular QL reduced the cumulative delay by 23,610.3 s compared with DQN, equivalent to a 7.98% lower cumulative delay relative to DQN. It also reduced cumulative queue accumulation by 894 vehicle-count units, corresponding to a 3.02% lower cumulative queue accumulation relative to DQN.

5.2.3. High-Demand Performance

Under high-demand conditions in the real-world 25 January Corridor, the RL-based controllers provided clear reductions in cumulative delay compared with the fixed-time baseline. However, queue improvements were more limited because the corridor operated close to saturation, where physical capacity constraints restrict the extent to which signal control adaptation alone can reduce queue accumulation.

Figure 14 evaluates the fixed-time controller, Tabular Q-Learning, and DQN for the single-agent 25 January network under high-demand conditions. This scenario represents a more congested operating condition in the real-world corridor, where the controller must limit delay growth and queue accumulation under increased traffic pressure and asymmetric intersection geometry.

The fixed-time controller produced the highest cumulative delay, reaching 4,244,813.4 s. Tabular QL reduced this value to 1,509,716.9 s, while DQN achieved a slightly lower cumulative delay of 1,488,098.8 s. These values correspond to cumulative delay reductions of 64.43% for Tabular QL and 64.94% for DQN compared with fixed-time control. This shows that both RL-based controllers substantially improved delay performance under high-demand real-world conditions, with DQN providing the stronger delay reduction.

Queue performance showed a more limited improvement because the network operates closer to capacity under high demand. The cumulative queue accumulation decreased from 219,221 under fixed-time control to 199,331 using Tabular QL and 203,415 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 9.07%, while DQN achieved a reduction of 7.21% compared with fixed-time control.

A direct comparison between the two RL approaches shows a trade-off between delay and queue performance. DQN reduced cumulative delay by 21,618.1 s compared with Tabular QL, equivalent to a 1.43% lower cumulative delay relative to Tabular QL. Conversely, Tabular QL reduced cumulative queue accumulation by 4084 vehicle-count units compared with DQN, corresponding to a 2.01% lower cumulative queue accumulation relative to DQN.

Figure 15 evaluates the fixed-time controller, Tabular Q-Learning, and DQN for the multi-agent 25 January network under high-demand conditions. This scenario represents the most demanding real-world case, combining high traffic pressure with the asymmetric geometry of the 25 January Corridor. Under these conditions, controller performance is mainly reflected in its ability to limit cumulative delay and queue growth near the physical capacity of the network.

The fixed-time controller produced the highest cumulative delay, reaching 4,244,813.4 s. Tabular QL reduced this value to 1,490,484.3 s, while DQN reduced it to 1,536,591.5 s. These values correspond to cumulative delay reductions of 64.89% for Tabular QL and 63.80% for DQN compared with fixed-time control. This confirms that both multi-agent RL controllers substantially improved delay performance under high-demand real-world traffic conditions.

Queue performance also improved under RL control, although the reduction was more limited because the network operates close to saturation. The cumulative queue accumulation decreased from 219,221 under fixed-time control to 198,048 using Tabular QL and 198,849 using DQN. Therefore, Tabular QL reduced cumulative queue accumulation by 9.66%, while DQN achieved a reduction of 9.29% compared with fixed-time control.

A direct comparison between the two RL approaches shows that Tabular QL achieved slightly better performance in both cumulative delay and cumulative queue accumulation. Tabular QL reduced cumulative delay by 46,107.2 s compared with DQN, equivalent to a 3.00% lower cumulative delay relative to DQN. It also reduced cumulative queue accumulation by 801 vehicle-count units, corresponding to a 0.40% lower cumulative queue accumulation relative to DQN.

5.3. Comparative Performance Analysis

Table 5 and Table 6 provide a consolidated comparison of the three signal control strategies—fixed-time, Tabular Q-Learning, and DQN/MADQN—across the two investigated networks: the synthetic two-junction network and the real-world 25 January Corridor in Assiut. The values reported in these tables represent the post-convergence performance of the controllers, with the minimum delay and queue values extracted after the 5000th simulation step to ensure that the comparison reflects stabilized learning behavior rather than early exploratory fluctuations.

As illustrated in Table 5, the synthetic two-junction network demonstrates the clear operational benefit of reinforcement-learning-based control. Under high-demand conditions, the fixed-time controller produced a delay of 4613.5 s, while Tabular Q-Learning and DQN reduced this value substantially. In the multi-agent setting, DQN achieved the lowest high-demand delay of 321.8 s, while Tabular Q-Learning reached 338.9 s. A similar trend appears in the single-agent setting, where DQN achieved the lowest high-demand delay of 302.7 s compared with 329.6 s for Tabular Q-Learning. These results indicate that DQN can provide stronger delay minimization in the synthetic network when traffic demand is high and when the traffic state can benefit from continuous function approximation.

For medium-demand conditions in the two-junction network, Table 5 shows a more nuanced relationship between the two learning controllers. In the multi-agent setting, Tabular Q-Learning and DQN produced nearly identical delay values, 28.2 s and 28.6 s, respectively. However, DQN achieved a lower queue value of 12 compared with 15 for Tabular Q-Learning. In the single-agent setting, Tabular Q-Learning performed better in delay minimization, achieving 28.7 s compared with 46.6 s for DQN, while both methods maintained low queue values. This suggests that Tabular Q-Learning can remain highly competitive in controlled and moderately loaded environments where the discretized state representation is sufficient to capture the relevant traffic dynamics.

Under low-demand conditions, both RL controllers reduced delay to near-zero levels in the two-junction network. The fixed-time controller recorded a delay of 62.7 s, while Tabular Q-Learning and DQN reduced the delay to approximately 0.4–0.6 s. Queue values also remained at 1 across all controllers, indicating that under sparse traffic conditions, the primary advantage of RL is not queue reduction but the avoidance of unnecessary stops caused by rigid fixed cycles.

Table 6 extends the comparison to the real-world 25 January Corridor, where the geometry is more asymmetric, and the traffic patterns are more representative of practical arterial operation. Under high-demand conditions, the advantage of RL control remains clear. The fixed-time controller recorded a delay of 132.1 s, while Tabular Q-Learning reduced this value to approximately 9.1–9.2 s across both single-agent and multi-agent settings. DQN also improved substantially over fixed-time control, with delay values of approximately 18.5–18.6 s. These results indicate that, in the real-world corridor, Tabular Q-Learning produced the strongest high-demand delay performance, likely because the reduced and discretized state space was sufficient for capturing the dominant traffic patterns of the corridor.

For medium- and low-demand scenarios in the 25 January Corridor, Table 6 shows that all controllers maintained very low queue values, generally equal to 1. This indicates that the network did not experience severe queue accumulation under these demand levels. However, the delay values still reveal differences between the controllers. Under low demand, DQN achieved the lowest single-agent delay of 0.4 s, while Tabular Q-Learning achieved the lowest multi-agent delay of 0.8 s compared with 1.7 s for the fixed-time controller. Under medium demand, the differences are relatively small, suggesting that the corridor operates efficiently under moderate loading and that the main performance distinction becomes more visible under high demand.

The overall interpretation of Table 5 and Table 6 is summarized in Table 7, which compares the control strategies according to delay performance, queue management, adaptability, scalability, and recommended application. Fixed-time control remains the simplest strategy and may be acceptable under stable, predictable, and low-variation traffic conditions. However, its performance deteriorates when demand fluctuates or when the network becomes saturated because it cannot respond to real-time queue formation or directional imbalance.

Tabular Q-Learning provides a strong balance between performance, simplicity, and interpretability. It substantially reduces delay compared with fixed-time control and performs particularly well in simple or moderately complex networks where discretized traffic states can adequately represent congestion conditions. Its strong performance in the 25 January Corridor, especially under high-demand conditions, confirms that tabular learning can remain effective when the state-action space is manageable and when traffic patterns are relatively structured. These findings help define the practical operating domain of tabular methods. Tabular Q-Learning is most suitable for small corridors, isolated intersections, or limited multi-intersection systems where the number of signal phases and detector states can be discretized without excessive state-space growth. Its usefulness decreases as the number of intersections, detectors, phase combinations, demand regimes, or continuous traffic-state variables increases because the Q-table grows rapidly and learning becomes inefficient. In such larger or more heterogeneous networks, deep value approximation, graph-based MARL, or actor–critic methods are likely to be more appropriate.

DQN/MADQN offers stronger generalization capability because it processes normalized continuous traffic states rather than relying on fixed state bins. This makes it more suitable for complex, high-volume, or multi-intersection environments where traffic conditions vary dynamically. The DQN results in Table 5, particularly under high-demand synthetic scenarios, demonstrate its ability to reduce delay and improve queue dissipation. However, the comparison also shows that DQN does not dominate in every case. In some medium- and high-demand real-world scenarios, Tabular Q-Learning achieved lower delay, indicating that the best controller depends on network complexity, demand level, and the selected performance metric. The difference in performance between MADQN and Tabular Q-Learning can be interpreted in relation to traffic-state complexity and demand regime. Under low-demand and queue-dissipation conditions, traffic states are sparse, variable, and often characterized by short queues or rapidly changing detector counts. In these cases, MADQN benefits from neural network function approximation because it can generalize across similar traffic states rather than treating each discretized state independently. This enables smoother Q-value estimation and more flexible phase-maintenance or phase-switching decisions when the objective is to dissipate queues without excessive phase changes.

By contrast, under medium- and high-demand conditions, traffic states become more saturated, repetitive, and dominated by persistent queues. In such conditions, the discretized state representation used by Tabular Q-Learning can remain effective because the agent repeatedly observes similar congestion states and can learn stable action preferences for them. The four-bin discretization captures the main operational regimes—empty, low, moderate, and high queues—which may be sufficient when congestion patterns are recurrent. Therefore, the competitiveness of Tabular Q-Learning under higher demand does not contradict the generalization advantage of MADQN; rather, it indicates that simpler tabular control may remain practical when the state-action space is compact and when traffic states are repeatedly encountered.

These findings suggest that deep MARL architectures are not automatically superior in all operating regimes. Instead, controller selection should consider the network size, demand variability, state-space dimensionality, computational costs, interpretability, and the target performance metric. The RL agents were trained and evaluated within the same SUMO-based simulation framework; the generalization capability of the learned policies should be interpreted with caution. Although the use of both a synthetic two-intersection corridor and a real-world 25 January digital twin provides two different evaluation environments, the learned policies may still reflect the specific geometry, detector layout, turning ratios, demand levels, and vehicle behavior assumptions used during training. Therefore, strong performance in the tested scenarios does not necessarily guarantee equivalent performance under unseen traffic patterns, incident conditions, seasonal demand variations, different signal phasing schemes, or larger network topologies. This is particularly relevant for MADQN, which may learn useful generalized value approximations but may also overfit to repeated corridor-specific state transitions if the training scenarios are not sufficiently diverse. Consequently, the reported results should be viewed as evidence of scenario-specific robustness within the tested networks rather than definitive proof of transferability to all urban corridors. Because the RL agents were trained and evaluated within the same SUMO-based simulation framework, the generalization capability of the learned policies should be interpreted with caution. The use of both a synthetic two-intersection corridor and a real-world 25 January digital twin provides two distinct evaluation environments, but the learned policies may still reflect the specific geometry, detector placement, turning ratios, demand levels, signal-phase structure, and vehicle behavior assumptions used during training.

5.4. Cross-Network Conclusions

Across both the synthetic two-junction network and the real-world 25 January Corridor, and across all demand profiles, the reinforcement learning controllers consistently and substantially outperformed the fixed-time baseline in both delay and queue management. The largest relative gains were observed under medium demand, where the fixed-time signal is most misaligned with actual traffic patterns and the RL agents have sufficient room to demonstrate adaptive behavior. These findings are consistent with the broader MARL-for-ITS literature, which reports that network-wide multi-agent control can reduce delay and improve throughput by allowing intersections to adapt to upstream and downstream traffic dynamics rather than optimizing isolated signal phases [11].

The observed superiority of adaptive controllers over fixed-time control is also consistent with recent hybrid KNN–DQN evidence. Reference [18] reported that a hybrid intelligent controller reduced the average waiting time by 48%, reduced the number of stops by more than 58%, and increased the flow rate by 57% compared with a conventional fixed-time system. Although their study considered a hybrid single-intersection/multi-intersection SUMO framework rather than the MARL comparison performed here, both studies converge on the same operational conclusion: controllers that respond to real-time traffic states substantially outperform static signal plans, particularly when demand varies or unexpected disturbances occur.

The superiority of adaptive DRL control over static signal timing is further supported by PPO-based evidence from [6], where PPO-TSC achieved lower average travel time and time loss and a higher average speed than DQN, DQN-DTSE, D3QN-DTSE, PPO-DTSE, Max-PPO, LSTM-PPO, and ELM-MP under single-intersection flat and peak traffic scenarios. These findings are also consistent with [25], who evaluated cooperative MARL signal control on six contiguous real-world intersections in Icheon City using SUMO. Their neighbor-aware cooperative controller reduced the cumulative waiting time by approximately 54% during off-peak conditions and 30% during peak conditions compared with fixed-time control. Although their objective emphasized environmental performance and their evaluation used waiting time rather than the delay/queue formulation adopted here, both studies support the same operational conclusion: adaptive multi-agent signal control is more effective than fixed-time control when traffic demand varies across interconnected intersections.

Multi-agent coordination provided a clear additional advantage over single-agent control in both networks, with decentralized phase switching enabling smoother flow through successive intersections and preventing the cascade congestion that undermines the fixed-time approach under a sustained load. The 25 January results confirm that decentralized RL can scale beyond the synthetic proof-of-concept to real-world arterial conditions in Assiut, though the absolute performance values are modulated by the physical capacity constraints and topology of the real network. This is consistent with [4], who compared centralized and decentralized DRL control for a multi-intersection corridor and found that decentralized control reduced delay and conflicts by 26.4% and 26.9%, respectively, relative to centralized control. Although their implementation used PPO and multi-objective rewards rather than the Tabular Q-Learning and MADQN controllers evaluated here, both studies support the same architectural conclusion: distributing signal control decisions across local intersection agents can improve scalability and network-level robustness.

A practical challenge for RL-based traffic signal control is the cold-start period required before a deep agent develops an effective policy. Reference [18] addressed this issue by using KNN to provide fast signal-timing recommendations based on historical traffic patterns while the DQN agent gradually improved through environmental interaction. This suggests that real-world deployment of MADQN controllers may benefit from a transitional layer that uses empirical rules, supervised classifiers, or historically calibrated policies until the reinforcement learning agents reach stable performance. Such a strategy could reduce unsafe or inefficient exploratory actions during early deployment.

5.5. Surrogate Safety Assessment Using Time-to-Collision

In addition to delay- and queue-based operational metrics, Time-To-Collision (TTC) was analyzed as a complementary surrogate safety indicator. Descriptive statistics were calculated for all case studies, agent paradigms, demand levels, TTC datasets, and control strategies using a nominal safe/default TTC value of 50 s. The analysis included a mean TTC, standard deviation, median TTC, minimum TTC, conflict-focused mean TTC after excluding the 50 s default value, the percentage of unsafe TTC events below 3 s, the percentage of critical TTC events below 1.5 s, and the percentage of samples retained at the 50 s default value.

The 50 s value was used as a nominal safe/default TTC value when no critical vehicle-following interaction was detected, rather than as a conventional collision risk threshold. Therefore, higher mean TTC values should not be interpreted in isolation as evidence of superior traffic control. A controller may produce higher TTC values because it creates fewer close-following interactions through excessive stopping, lower vehicle discharge, or reduced vehicle interactions. For this reason, TTC was interpreted jointly with cumulative delay and queue accumulation.

To improve the readability of the main manuscript, the full TTC descriptive statistics table was moved to Appendix A as Table A1. The main text, therefore, focuses on the interpretation of the most relevant TTC trends, while the Appendix A preserves the complete stratified statistics for transparency and reproducibility. As shown in Appendix A, Table A1, TTC behavior varied substantially with demand level and network topology. Low-demand scenarios generally produced higher mean TTC values and lower proportions of unsafe TTC events, whereas medium- and high-demand scenarios showed lower TTC values because of denser vehicle interactions and more frequent close-following conditions. Since TTC was used as an independent surrogate safety-monitoring indicator rather than as a direct reward component, these results should be interpreted jointly with the queue length and cumulative delay findings.

The TTC analysis should be interpreted as a surrogate safety assessment rather than as a fully safety-constrained optimization result. Since TTC was not included in the reward function, the reported TTC values provide an independent diagnostic measure of the safety-related consequences of the learned signal policies. This distinction is important because a controller may improve the mean TTC by reducing vehicle interactions through excessive stopping while still producing poor operational performance. Therefore, TTC is interpreted jointly with queue length and cumulative delay. Controllers that reduce delay and queue accumulation while maintaining low percentages of unsafe TTC events are considered more balanced from an efficiency–safety perspective.

Although Table A1 summarizes the TTC statistics at the case-study level, the following discussion refers to the full stratified TTC results, including the single- and multi-agent settings and the delay- and queue-based TTC datasets.

The TTC analysis shows that safety behavior is strongly affected by demand level and network topology. Under low-demand conditions, the RL-based controllers generally maintained high TTC values and reduced the proportion of non-default TTC samples compared with the fixed-time baseline. In the two-junction single-agent low-demand case, Tabular QL achieved the highest delay-based mean TTC of 48.89 s, with only 3% of the samples being below the 50 s default value compared with 34.70 s and 35% non-default samples for fixed-time control. For the corresponding queue-based TTC dataset, DQN achieved the highest mean TTC of 44.79 s and reduced the non-default samples to 14%. This indicates that, under sparse traffic conditions, the RL controllers improved operational performance while maintaining favorable TTC-based safety margins.

A similar pattern was observed in the two-junction multi-agent low-demand case. DQN achieved the highest delay-based mean TTC of 47.92 s and the highest queue-based mean TTC of 41.72 s, with lower non-default TTC proportions than fixed-time control. These findings suggest that multi-agent coordination can support smoother vehicle progression under low demand by reducing unnecessary stop-and-go interactions.

For medium- and high-demand scenarios in the two-junction network, TTC values decreased sharply across all controllers. Mean TTC values were generally close to 2 s in medium-demand conditions and around 1.5 s in high-demand ones, with approximately 98–99% of the samples falling below the 50 s default value. This reflects the higher frequency of close vehicle-following interactions as the network approaches saturation. Under these conditions, the differences between controllers became small, indicating that TTC was increasingly governed by traffic density and physical capacity limitations rather than signal control alone.

In the 25 January Corridor, the low-demand TTC results again showed favorable safety behavior for the RL controllers. In the single-agent low-demand scenario, DQN achieved the highest delay-based mean TTC of 49.18 s, with only 2% non-default TTC samples. For queue-based TTC, Tabular QL achieved the highest mean TTC of 48.81 s, while DQN remained very close at 48.62 s. In the multi-agent low-demand scenario, Tabular QL achieved the highest delay-based mean TTC of 49.02 s, whereas DQN achieved the highest queue-based mean TTC of 48.64 s. These results indicate that both RL approaches maintained favorable safety margins under sparse real-world traffic conditions.

For the medium- and high-demand 25 January scenarios, fixed-time control showed higher mean TTC values than the RL controllers in several datasets. However, this should not be interpreted as superior overall traffic performance. Since TTC values were capped at 50 s when no critical interaction was detected, a higher mean TTC can partly reflect a larger proportion of non-interaction samples, low-discharge conditions, or prolonged stopping rather than safer and more efficient traffic operation. This interpretation is particularly important because the same fixed-time controller produced substantially worse cumulative delay and queue accumulation in the operational analysis. Therefore, the TTC results confirm the need to interpret safety jointly with efficiency metrics.

5.6. Comparison with State-of-the-Art Studies

The present findings are broadly consistent with recent state-of-the-art studies showing that RL-, DRL-, and MARL-based traffic signal controllers can outperform fixed-time signal control under dynamic traffic demand. For example, previous Q-learning and DQN-based studies reported reductions in waiting time, queue length, stop frequency, or travel time compared with fixed-time or conventional baselines [17,18,19,20]. Similarly, recent PPO and multi-agent DRL studies demonstrated that adaptive learning-based controllers can improve traffic efficiency under single-intersection, grid-network, or corridor-level scenarios [6,7,8,24,25]. The results of the present study support these general findings because both multi-agent Tabular Q-Learning and MADQN produced improvements over the fixed-time baseline in several demand scenarios.

However, the present study differs from many previous works in three important respects. First, it provides a direct comparison between a simpler tabular value-based MARL controller and a deep value-based MADQN controller under equivalent simulation settings. Second, it evaluates both controllers using two complementary testbeds: a controlled synthetic two-intersection corridor and a real-world digital twin of the 25 January Corridor in Assiut, Egypt. Third, it examines performance across low-, medium-, and high-demand scenarios using queue length, cumulative delay, and TTC as operational and safety-related indicators. These aspects make the comparison more useful for practical controller selection because the results show that deeper DRL architectures are not universally superior; rather, the preferred controller depends on the demand level, network complexity, and the selected operational objective.

A strict numerical comparison with previous studies is difficult because published studies differ in network geometry, demand generation, signal phasing, detector configuration, state representation, reward design, training horizon, and evaluation metrics. Therefore, the comparison with state-of-the-art studies should be interpreted at the methodological and performance-trend levels rather than as a direct one-to-one benchmark. Within this context, the proposed framework is significant because it demonstrates that compact detector-based MARL controllers can improve adaptive signal control in both controlled and realistic corridor settings while also clarifying the trade-off between the simplicity of Tabular Q-Learning and the function-approximation capability of MADQN.

5.7. Computational Overhead, Training Behavior, and Scalability Considerations

Computational overhead, training behavior, convergence speed, and scalability are important considerations for interpreting the practical applicability of the proposed MARL framework. In the present implementation, Tabular Q-Learning is computationally lightweight because it updates explicit state-action values without neural network training, replay memory, or gradient-based optimization. This makes it easier to implement and is suitable for small corridor networks with compact, discretized states. However, its main limitation is the rapid growth of the Q-table as the number of intersections, detectors, signal phases, and discretization levels increases. Therefore, although Tabular Q-Learning remains competitive in the tested two-intersection corridors, its scalability may become limited in larger urban networks.

By contrast, MADQN introduces higher computational overhead during training because each agent requires neural network inference, mini-batch learning, replay-buffer management, and target network updates. Nevertheless, MADQN is generally more suitable for larger or continuous traffic-state representations because function approximation avoids the explicit enumeration of all state-action combinations. In terms of training behavior, the learning curves should be interpreted as evidence of policy stabilization during the simulation horizon rather than as formal proof of algorithmic convergence. During early training, exploration may produce unstable queues and delay trajectories; later, as exploration decreases, the agents increasingly exploit learned action-value estimates, and the performance trends tend to become more stable. A detailed wall-clock training-time analysis was beyond the scope of the present study and should be addressed in future work by using repeated runs and different network sizes.

From a deployment perspective, the trained controllers should be interpreted as simulation-tested prototypes rather than immediately deployable field systems. Practical implementation would require offline pre-training using historical and simulated demand scenarios, validation under unseen traffic patterns, possible online fine-tuning with safety constraints, and distributed execution across intersections to reduce real-time computational burden. As the number of intersections increases, additional work is required to evaluate the convergence speed, communication requirements, coordination among neighboring agents, and the effect of parallel training or centralized-training/decentralized-execution strategies on network-level scalability.

5.8. Scalability, Communication Overhead, and Computational Complexity Analysis

To support the generalization of the proposed adaptive signal control framework, it is important to discuss scalability, communication overhead, and computational complexity. Although the present experiments focus on a synthetic two-intersection corridor and a real-world two-intersection digital twin, the computational characteristics of the three controllers differ substantially when extended to larger networks.

For fixed-time control, the online computational cost is negligible because the signal plan is predefined and does not depend on real-time state evaluation. At each decision step, the controller simply follows the stored phase schedule. Therefore, its online decision complexity can be considered approximately O (1) per intersection. However, this computational simplicity comes at the cost of limited adaptability under fluctuating and asymmetric demand.

For Tabular Q-Learning, each agent stores and updates a Q-table indexed by the discretized traffic state and the available actions. If N is the number of intersections, |S_i| is the discretized state-space size of agent i, and |A_i| is its action-space size, with the memory requirement being approximately O(

Σ_{i = 1}^{N}

|S_i| |A_i|). The online action-selection cost for each agent is small because it only requires comparing Q-values across the available actions, approximately O(|A_i|). The Q-value update is also computationally inexpensive. However, the main scalability limitation of Tabular Q-Learning is the exponential growth of the state space when more detectors, queue bins, signal phases, or neighboring-intersection states are included. Therefore, while Tabular Q-Learning is efficient for compact state representations and small corridors, its memory demand can become prohibitive in larger networks with high-dimensional observations.

For MADQN, the tabular representation is replaced by a neural network function approximator. Let P_i denote the number of trainable parameters in the DQN model of agent i. The online inference complexity of each agent is approximately proportional to the number of network parameters, O(P_i), while the memory requirement includes the neural network parameters, target network parameters, and replay buffer. If the replay buffer stores B_i transitions and each transition has state dimension d_i, the memory requirement can be approximated as O(

Σ_{i = 1}^{N}

(P_i + B_i d_i)). Compared with Tabular Q-Learning, MADQN has a higher computational cost per decision because neural network inference and training updates are required. However, it is more scalable for larger or continuous state spaces because its memory requirement depends on the neural network architecture and replay-buffer size rather than direct enumeration of all possible state-action combinations. This makes MADQN more appropriate when the number of detectors, state variables, or traffic scenarios increases.

The implemented multi-agent framework also has low communication overhead because it does not rely on explicit inter-agent message passing, a centralized critic, value-decomposition network, mixing network, or centralized parameter-update module during execution. Each traffic signal agent observes its local detector-based traffic state and selects its own maintain/switch action. Coordination occurs indirectly through the shared traffic environment: the action of one intersection changes vehicle discharge, downstream arrivals, queue propagation, and spillback conditions, which then affect the future observations of neighboring agents. Therefore, the communication requirement during execution is mainly limited to local detector-to-controller data transfer and the TraCI communication between SUMO and the Python control process.

In terms of network-level scaling, if all agents are executed independently and in parallel, the per-step decision cost grows approximately linearly with the number of intersections, O(N), assuming a fixed local state dimension and fixed action space per intersection. If centralized coordination, explicit message passing, or CTDE with a centralized critic were introduced, the communication and computational costs would increase because the controller would need to process joint states, joint actions, or neighboring-agent information. Therefore, the current independent decentralized design favors practical scalability and low communication overheads, although it may sacrifice some coordination quality compared with more communication-intensive cooperative MARL architectures.

Overall, the scalability analysis indicates that fixed-time control is computationally simplest but least adaptive, Tabular Q-Learning is lightweight but limited by state-space growth, and MADQN provides better representational scalability at the cost of higher neural network computation and replay-memory requirements. The present results should, therefore, be interpreted as corridor-level evidence rather than definitive proof of city-scale generalization. Future work should evaluate the proposed framework on larger networks with more intersections and should report empirical runtime, memory consumption, communication latency, and performance degradation as the number of agents increases.

6. Conclusions

This study evaluated value-based multi-agent reinforcement learning (MARL) for adaptive traffic signal control in corridor-level urban traffic networks. Three control strategies were compared under equivalent SUMO–Python–TraCI simulation conditions: conventional fixed-time control, multi-agent Tabular Q-Learning, and multi-agent Deep Q-Network control (MADQN). The evaluation used two complementary testbeds: a controlled synthetic two-intersection corridor and a real-world digital twin of the 25 January Corridor in Assiut, Egypt. Controller performance was assessed under low-, medium-, and high-demand scenarios using queue length and cumulative delay as primary operational indicators, while Time-To-Collision (TTC) was monitored as a complementary surrogate safety measure. The findings confirm that reinforcement-learning-based adaptive signal control can substantially improve corridor performance compared with fixed-time operation, particularly under fluctuating and asymmetric traffic demand. Fixed-time control remained simple and predictable but showed a limited ability to respond to changing traffic states, resulting in higher queue accumulation and cumulative delay as demand increased. In contrast, both MARL controllers adapted signal decisions to detector-based traffic conditions and provided more effective queue and delay management across the tested scenarios. The comparison between Tabular Q-Learning and MADQN shows that deeper learning architectures are not universally superior. MADQN generally performed better in low-demand and queue-dissipation conditions, where its neural network approximation supported smoother generalization across continuous traffic-state variations. However, multi-agent Tabular Q-Learning remained highly competitive and achieved stronger delay reductions in several medium- and high-demand cases, especially when the discretized state-action space was compact and recurrent traffic patterns were repeatedly encountered. Therefore, controller selection should be guided by the operational objective, traffic demand regime, network complexity, computational cost, and implementation constraints. The TTC analysis indicates that safety-related performance must be interpreted together with operational efficiency. Under low-demand conditions, the RL-based controllers generally maintained favorable TTC values and reduced unnecessary stop-and-go interactions. Under medium- and high-demand conditions, TTC values decreased across all controllers as dense vehicle interactions and capacity limitations became dominant. Since TTC was used as a monitoring indicator rather than as a direct reward component or safety constraint, the results should be interpreted as surrogate safety diagnostics rather than evidence of fully safety-constrained optimization. The use of SMDP minimum-green locks also helped maintain realistic signal operations by preventing unstable rapid phase switching. Overall, this study demonstrates the practical potential of independent value-based MARL for adaptive signal control in both synthetic and real-world corridor environments. Nevertheless, these findings should be interpreted as simulation-based evidence rather than direct field-deployment performance. Future research should strengthen validation through field-calibrated data, including traffic counts, turning movements, queue measurements, travel time observations, signal-controller records, and video- or trajectory-based traffic data. It should also examine the robustness of detector-based control under realistic sensing and communication limitations, such as measurement noise, missing data, occlusion, latency, and detector malfunction. In addition, deployment-oriented studies should address integration with existing signal controllers, regulatory requirements, fail-safe operation, hardware-in-the-loop testing, pilot implementation, and before–after field evaluation. Future work should also improve the methodological robustness and scalability of the proposed framework. This includes multi-episode and multi-seed training, convergence analysis, statistical significance testing, sensitivity analysis of learning parameters and state-discretization schemes, and evaluation under unseen demand patterns, incident scenarios, and additional real-world corridors. More advanced coordination mechanisms, such as neighbor communication, graph-based representations, shared or difference rewards, and policy-gradient or actor–critic MARL approaches, should be compared with the current independent value-based controllers. Finally, future studies should incorporate TTC directly into safety-aware or constrained reinforcement learning formulations, compare MARL with optimized non-RL baselines such as GA–SUMO or PSO–SUMO signal plans, and assess runtime, memory use, communication overheads, and performance scalability in larger urban networks. These extensions are necessary to bridge the gap between simulation-based learning and safe, reliable real-world deployment.

Author Contributions

Conceptualization, M.O., B.O.M. and I.M.A.; methodology, M.O., B.O.M., A.A.K. and A.S.; software, A.H.M., K.H. and J.E.; validation, M.O., B.O.M., S.T.Y. and I.M.A.; formal analysis, A.A.K., A.S., A.H.M. and K.H.; investigation, A.A.K., A.S., A.H.M., K.H. and J.E.; resources, M.O., S.T.Y., S.A.A. and A.E.A.-H.; data curation, A.A.K., A.S., A.H.M., K.H. and J.E.; writing—original draft preparation, B.O.M., A.A.K. and A.S.; writing—review and editing, M.O., B.O.M., S.T.Y., S.A.A., A.E.A.-H. and I.M.A.; visualization, A.H.M., K.H. and J.E.; supervision, M.O., S.T.Y., S.A.A., A.E.A.-H. and I.M.A.; project administration, M.O., B.O.M. and I.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available upon request by the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed TTC Descriptive Statistics

Table A1. Descriptive statistics of Time-To-Collision (TTC) by demand level and control strategy.

Case Study	Demand Level	Control Strategy	N	Mean TTC ± SD (s)	Median TTC (s)	Min TTC (s)	Conflict-Focused Mean TTC Excluding 50 s Default (s)	TTC < 3 s (%)	TTC < 1.5 s (%)	Samples at Default TTC = 50 s (%)
Two-Junction	Low	Fixed-Time	400	34.70 ± 21.25	50.00	0.91	6.28	19.0	12.0	65.0
Two-Junction	Low	Tabular QL	400	44.06 ± 14.66	50.00	0.91	10.43	4.5	2.8	85.0
Two-Junction	Low	DQN	400	44.88 ± 13.98	50.00	0.93	9.06	2.8	2.0	87.5
Two-Junction	Medium	Fixed-Time	400	1.94 ± 6.87	0.94	0.91	0.96	98.0	96.0	2.0
Two-Junction	Medium	Tabular QL	400	2.01 ± 6.87	0.95	0.15	1.03	97.5	92.8	2.0
Two-Junction	Medium	DQN	400	2.02 ± 6.87	0.95	0.50	1.04	97.5	91.5	2.0
Two-Junction	High	Fixed-Time	400	1.53 ± 5.00	0.92	0.90	1.04	98.0	98.0	1.0
Two-Junction	High	Tabular QL	400	1.53 ± 5.00	0.92	0.90	1.04	98.0	98.0	1.0
Two-Junction	High	DQN	400	1.53 ± 5.00	0.93	0.90	1.04	98.0	98.0	1.0
25 January Corridor	Low	Fixed-Time	400	46.74 ± 11.49	50.00	0.93	9.23	4.0	3.0	92.0
25 January Corridor	Low	Tabular QL	400	48.62 ± 7.58	50.00	0.92	10.54	1.5	1.0	96.5
25 January Corridor	Low	DQN	400	48.78 ± 7.32	50.00	0.94	5.65	1.5	1.0	97.2
25 January Corridor	Medium	Fixed-Time	400	10.83 ± 16.63	1.84	0.91	5.99	60.0	44.0	11.0
25 January Corridor	Medium	Tabular QL	400	5.00 ± 11.48	1.20	0.33	2.38	82.0	65.5	5.5
25 January Corridor	Medium	DQN	400	3.58 ± 9.68	1.09	0.72	1.90	90.8	76.2	3.5
25 January Corridor	High	Fixed-Time	400	1.97 ± 6.87	0.96	0.91	0.99	98.0	98.0	2.0
25 January Corridor	High	Tabular QL	400	1.82 ± 6.15	0.94	0.00	1.08	98.0	96.0	1.5
25 January Corridor	High	DQN	400	1.87 ± 6.52	0.94	0.70	1.01	98.0	97.5	1.8

Note. TTC = 50 s represents the nominal safe/default value assigned when no critical vehicle-following interaction was detected; therefore, it should not be interpreted as a collision-risk threshold. The conflict-focused mean TTC was calculated after excluding the 50 s default value. TTC < 3 s was treated as an unsafe traffic conflict, while TTC < 1.5 s was treated as a critical conflict. Values are aggregated within each case study across single- and multi-agent paradigms and TTC objective datasets.

References

Eom, M.; Kim, B.-I. The traffic signal control problem for intersections: A review. Eur. Transp. Res. Rev. 2020, 12, 50. [Google Scholar] [CrossRef]
El-Tantawy, S.; Abdulhai, B.; Abdelgawad, H. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC): Methodology and large-scale application on downtown Toronto. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1140–1150. [Google Scholar] [CrossRef]
Kumar, R.; Sharma, N.V.K.; Chaurasiya, V.K. Adaptive traffic light control using deep reinforcement learning technique. Multimed. Tools Appl. 2024, 83, 13851–13872. [Google Scholar] [CrossRef]
Elharoun, M.; El-Badawy, S.M.; Shwaly, E.A.-E.; Shahdah, U.E. Adaptive traffic signal control using deep reinforcement learning: A multi-objective approach for single and multi-intersection scenarios. IATSS Res. 2025, 49, 481–492. [Google Scholar] [CrossRef]
Maheshwari, H.; Yang, L.; Pazzi, R.W. Traffic intersection simulation using turning movement count data in sumo: A case study of toronto intersections. In Proceedings of the 2025 21st International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT); IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
Wang, L.; Zhang, G.; Yang, Q.; Han, T. An adaptive traffic signal control scheme with Proximal Policy Optimization based on deep reinforcement learning for a single intersection. Eng. Appl. Artif. Intell. 2025, 149, 110440. [Google Scholar] [CrossRef]
Haddad, T.A.; Hedjazi, D.; Aouag, S. A deep reinforcement learning-based cooperative approach for multi-intersection traffic signal control. Eng. Appl. Artif. Intell. 2022, 114, 105019. [Google Scholar] [CrossRef]
Guzmán, J.A.; Pizarro, G.; Núñez, F. A reinforcement learning-based distributed control scheme for cooperative intersection traffic control. IEEE Access 2023, 11, 57037–57045. [Google Scholar] [CrossRef]
Yang, S.; Yang, B.; Zeng, Z.; Kang, Z. Causal inference multi-agent reinforcement learning for traffic signal control. Inf. Fusion 2023, 94, 243–256. [Google Scholar] [CrossRef]
Gutiérrez-Moreno, R.; Barea, R.; López-Guillén, E.; Araluce, J.; Bergasa, L.M. Reinforcement learning-based autonomous driving at intersections in CARLA simulator. Sensors 2022, 22, 8373. [Google Scholar] [CrossRef] [PubMed]
Donatus, R.; Ter, K.; Ajayi, O.-O.; Udekwe, D. Multi-agent reinforcement learning in intelligent transportation systems: A comprehensive survey. arXiv 2025, arXiv:2508.20315. [Google Scholar]
Qadri, S.S.S.M.; Almusawi, A.; Albdairi, M.; Esirgün, E. Optimizing Traffic Signal Timing at Urban Intersections: A Simheuristic Approach Using GA and SUMO. In Proceedings of the 2024 Innovations in Intelligent Systems and Applications Conference (ASYU), Ankara, Turkiye, 16–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Krajzewicz, D.; Erdmann, J. Road intersection model in SUMO. In Proceedings of the 1st SUMO User Conference-SUMO 2013, Berlin, Germany, 15–17 May 2013; pp. 212–220. [Google Scholar]
Koh, S.S.; Zhou, B.; Yang, P.; Yang, Z.; Fang, H.; Feng, J. Reinforcement learning for vehicle route optimization in SUMO. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 28–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1468–1473. [Google Scholar]
Shamim Akhter, M.; Quaderi, S.J.S.; Al Forhad, M.A.; Sumit, S.H.; Rahman, M.R. A SUMO based simulation framework for intelligent traffic management system. J. Traffic Logist. Eng. 2020, 8, 1–5. [Google Scholar] [CrossRef]
Dobrilko, O.; Bublil, A. Leveraging SUMO for real-world traffic optimization: A comprehensive approach. In Proceedings of the SUMO Conference Proceedings, Berlin, Germany, 13–15 May 2024; pp. 179–194. [Google Scholar]
Guo, M.; Wang, P.; Chan, C.-Y.; Askary, S. A reinforcement learning approach for intelligent traffic signal control at urban intersections. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4242–4247. [Google Scholar]
Dhulkefl, E.J.; Abdulsattar, A.W.; Khudhur, Z.M.; Mahmood, T.A. Design of a hybrid intelligent traffic signal control system using nearest neighbor algorithm and deep reinforcement learning with SUMO simulator. J. Res. Eng. Comput. Sci. 2025, 3, 31–40. [Google Scholar] [CrossRef]
Cao, K.; Wang, L.; Zhang, S.; Duan, L.; Jiang, G.; Sfarra, S.; Zhang, H.; Jung, H. Optimization control of adaptive traffic signal with deep reinforcement learning. Electronics 2024, 13, 198. [Google Scholar] [CrossRef]
Song, L.; Fan, W. Traffic signal control under mixed traffic with connected and automated vehicles: A transfer-based deep reinforcement learning approach. IEEE Access 2021, 9, 145228–145237. [Google Scholar] [CrossRef]
Garg, D.; Chli, M.; Vogiatzis, G. Deep reinforcement learning for autonomous traffic light control. In Proceedings of the 2018 3rd IEEE International Conference on Intelligent Transportation Engineering (ICITE), Singapore, 3–5 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 214–218. [Google Scholar]
Quang Tran, D.; Bae, S.-H. Proximal policy optimization through a deep reinforcement learning framework for multiple autonomous vehicles at a non-signalized intersection. Appl. Sci. 2020, 10, 5722. [Google Scholar] [CrossRef]
Fernandes, P.; Nunes, U. Platooning of autonomous vehicles with intervehicle communications in SUMO traffic simulator. In Proceedings of the 13th International IEEE Conference on Intelligent Transportation Systems, Funchal, Portugal, 19–22 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1313–1318. [Google Scholar]
Kolat, M.; Kővári, B.; Bécsi, T.; Aradi, S. Multi-agent reinforcement learning for traffic signal control: A cooperative approach. Sustainability 2023, 15, 3479. [Google Scholar] [CrossRef]
Kim, H.; Park, J.; Kim, D.; Jun, C. Cooperative control of intersection traffic signals based on multi-agent reinforcement learning for carbon dioxide emission reduction. IEEE Access 2025, 13, 33485–33495. [Google Scholar] [CrossRef]
Barhoumi, O.; Zaki, M.H.; Tahar, S. Formally Constrained Reinforcement Learning for Traffic Signal Control at Intersections. In Proceedings of the 2025 IEEE International Systems Conference (SysCon), Montreal, QC, Canada, 7–10 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–8. [Google Scholar]
Singh, S.K.; Komolkiti, P.; Aswakul, C. Impact analysis of start-up lost time at major intersections on sathorn road using a synchro optimization and a microscopic SUMO traffic simulation. IEEE Access 2017, 6, 6327–6340. [Google Scholar] [CrossRef]
Manual, H.C. Manual. In A Guide for Multimodal Mobility Analysis; Transportation Research Board: Washington, DC, USA, 2016; Volume 6. [Google Scholar]
May, A.D. Traffic Flow Fundamentals; Prentice-Hall: Englewood Cliffs, NJ, USA, 1990. [Google Scholar]
Lopez, P.A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flötteröd, Y.-P.; Hilbrich, R.; Lücken, L.; Rummel, J.; Wagner, P.; Wießner, E. Microscopic traffic simulation using sumo. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2575–2582. [Google Scholar]
Chowdhury, M.M.H.; Chakraborty, T. Calibration of sumo microscopic simulation for heterogeneous traffic condition: The case of the city of khulna, bangladesh. Transp. Eng. 2024, 18, 100281. [Google Scholar] [CrossRef]
Webster, F.V. Traffic Signal Settings; H.M. Stationery Office: London, UK, 1958.
Ashton, W.D. Gap-acceptance problems at a traffic intersection. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1971, 20, 130–138. [Google Scholar] [CrossRef]
Owais, M.; Abulwafa, O.; Abbas, Y.A. When to decide to convert a roundabout to a signalized intersection: Simulation approach for case studies in Jeddah and Al-Madinah. Arab. J. Sci. Eng. 2020, 45, 7897–7914. [Google Scholar] [CrossRef]

Figure 1. Microscopic simulation view of the synthetic two-junction network in SUMO, utilized for baseline multi-agent coordination testing. Arrows indicate traffic flow directions, node labels identify the controlled intersections, detector labels indicate the locations of the 24 lane-area detectors, and shaded/dashed boundaries indicate the signal control area of each junction.

Figure 2. (a) Satellite imagery of the “25 January” arterial corridor in Assiut, Egypt, illustrating the complex, asymmetric geometry of the real-world network; (b) digital-twin representation of the “25 January” corridor in the SUMO environment, capturing realistic turning lane configurations and detector placements. Arrows indicate traffic flow directions, intersection labels identify the controlled junctions, detector labels show the positions of the 16 lane-area detectors, and shaded/dashed boundaries indicate the signal control area of each intersection.

Figure 3. Unified training workflow for SUMO-based adaptive traffic signal control.

Figure 4. Single-agent paradigm for the low-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 5. Multi-agent paradigm for the low-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 6. Single-agent paradigm for the medium-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 7. Multi-agent paradigm for the medium-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 8. Single-agent paradigm for the high-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 9. Multi-agent paradigm for the high-demand scenario at the two-junction case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 10. Single-agent paradigm for the low-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 11. Multi-agent paradigm for the low-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 12. Single-agent paradigm for the medium-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 13. Multi-agent paradigm for the medium-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 14. Single-agent paradigm for the high-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Figure 15. Multi-agent paradigm for the high-demand scenario at the 25 January Corridor case study: (a) instantaneous per-step delay; (b) cumulative delay; (c) instantaneous per-step queue length; (d) cumulative queue accumulation.

Table 1. Comparison of representative recent studies and the novelty of the proposed work.

Study	Control Approach	Test Environment	Main Indicators	Main Limitation	Relevance to the Present Study
MARLIN-ATSC study [2]	Multi-agent reinforcement learning	Urban signalized network	Network-level traffic performance	Earlier MARL framework; limited comparison with modern deep-value-based methods	Supports the relevance of decentralized learning for adaptive signal control
SUMO–TraCI Q-learning study [17]	Q-learning/neural network Q-learning	Single SUMO intersection	Queue, waiting time, traffic imbalance scenarios	Limited to isolated intersection control	Motivates extension toward corridor-level multi-agent control
KNN–DQN study [18]	Hybrid KNN–DQN	SUMO intersection scenario	Waiting time, stop frequency, flow rate	Focused on a hybrid single-controller design	Provides evidence for DQN-based adaptive control but not tabular-deep MARL comparison
Grouping-DQN study [19]	DQN with spatial state grouping	Signalized intersection simulation	Queue length, waiting time	High-resolution state representation; mainly isolated intersection focus	Contrasts with the compact detector-based state representation used in this study
PPO-TSC study [6]	Proximal Policy Optimization	Real-world single intersection in SUMO	Travel time, time loss, average speed	Policy-gradient framework; not focused on value-based MARL comparison	Supports the use of queue/waiting-time features for adaptive control
Cooperative DQN MARL study [7]	Cooperative multi-agent DQN	2 × 2 and 2 × 3 SUMO grid networks	Average waiting time, queue length	Grid-based synthetic networks; limited real-world corridor validation	Supports cooperative MARL but differs from the corridor and digital-twin focus of this study
Cooperative MARL emission study [25]	Multi-intersection DQN	Real-world six-intersection SUMO corridor	Waiting time, CO₂ emissions	Focused mainly on cooperative DQN and emissions	Shows the value of real-world corridor testing but does not compare tabular and deep-value-based MARL
Multi-objective DRL study [4]	PPO, A2C, and DQN	Synthetic and real-world intersections	Delay, conflicts, emissions, TTC	Mainly compares DRL algorithms rather than tabular versus deep MARL	Supports the inclusion of TTC and multi-objective evaluation indicators
Constrained RL safety study [26]	Formally constrained RL	Simulation-based intersection control	TTC and safety constraints	Safety-oriented framework rather than comparative corridor-level MARL	Supports TTC as a complementary safety-related indicator
Present study	Fixed-time, multi-agent Tabular Q-Learning, and MADQN	Synthetic two-intersection corridor and real-world 25 January Corridor digital twin	Queue length, cumulative delay, TTC	Simulation-based evaluation requiring future field validation	Provides direct tabular-deep-value-based MARL comparison under equivalent demand scenarios and corridor conditions

Table 2. Summary of the main simulation, calibration, and learning settings used in the experimental framework.

Category	Setting	Value/Description
Simulation platform	Microscopic simulator	SUMO integrated with Python through TraCI
Network configurations	Synthetic network	Two-intersection synthetic corridor
	Real-world network	Digital twin of the 25 January Corridor, Assiut, Egypt
Controlled intersections	Synthetic network	Two signalized intersections
	25 January Corridor	Two signalized intersections representing the northern and southern corridor junctions
Detector configuration	Synthetic network	24 lane-area detectors
	25 January Corridor	16 lane-area detectors
Detector length	Lane-area detector length	60 m on incoming approaches
Simulation step	Microscopic simulation step length	0.10 s
Simulation horizon	Duration of each controlled online run	10,000 simulation steps
Traffic arrival process	Vehicle generation	Poisson arrival process
Demand levels	Low demand	Approximately 42 vehicles/h/lane
	Medium demand	Approximately 350 vehicles/h/lane
	High demand	Approximately 840–1400 vehicles/h/lane, depending on network topology
Turning ratios	Synthetic network	70% straight, 20% left turn, 10% right turn
	25 January Corridor	82% straight, 12% major turn, 6% minor turn
Vehicle parameters	Maximum speed	13.9 m/s, approximately 50 km/h
	Acceleration	2.6 m/s²
	Deceleration	4.5 m/s²
	Driver imperfection	Sigma = 0.5
State representation	Tabular Q-Learning	Discretized queue-density categories combined with the current signal phase
	MADQN	Normalized detector-based queue counts combined with the current signal phase
Tabular discretization	Queue bins	0 vehicles, 1–9 vehicles, 10–18 vehicles, and more than 18 vehicles
Action space	Signal control action	Binary action: maintain current phase or switch to the next feasible phase
Signal realism constraint	SMDP minimum-green lock	4–12 s in the synthetic network; 3–6 s in the 25 January Corridor, depending on controller and objective
Reward objectives	Queue objective	Penalty based on the total halted vehicles
Reward objectives	Delay objective	Penalty based on accumulated waiting time
Safety-related indicator	TTC	Time-To-Collision is monitored as a complementary safety-related metric
Fixed-time baseline	Synthetic network	180 s cycle at each intersection; four 42 s green phases, each followed by 3 s yellow phases
Fixed-time baseline	25 January Corridor	Independent fixed-time controllers with 111 s and 109 s cycles
Training mode	RL/MARL controllers	Fully online learning during the same simulation horizon
Exploration policy	Tabular Q-Learning and MADQN	Epsilon-greedy policy
Exploration settings	Initial and minimum epsilon	Epsilon initialized at 1.0 and decayed toward 0.1
MADQN replay memory	Replay buffer	Independent replay buffer for each intersection agent
MADQN replay memory	Maximum buffer capacity	200,000 transitions
MADQN mini-batch	Batch size	128 samples
MADQN target network	Target update interval	Every 200 training steps
MADQN target network	Soft-update coefficient	Tau = 0.01 for delay-oriented runs and tau = 0.002 for queue-oriented runs
MADQN optimizer	Optimizer and loss	Adam optimizer with Huber loss
MADQN learning rate	Learning rate	0.0002 in the final MADQN scripts

Table 3. Fixed-time baseline configuration used in the simulation experiments.

Network	Intersection	Cycle Length	Phase Timing Structure	Coordination Logic	Webster Optimization
Synthetic two-junction network	Node2	180 s	Four 42 s green phases, each followed by 3 s yellow phase	Offset = 0 s; identical cycle structure; no explicit progression offset	Not used
Synthetic two-junction network	Node3	180 s	Four 42 s green phases, each followed by 3 s yellow phase	Offset = 0 s; identical cycle structure; no explicit progression offset	Not used
25 January Corridor	Southern intersection, clusterJ11_J14_J16_J17#3more	111 s	Ten intervals: 22, 3, 6, 4, 23, 3, 1, 23, 3, and 2 s	Isolated/independent fixed-time control; no common cycle with northern intersection	Not used
25 January Corridor	Northern intersection, clusterJ15_J27_J28_J31#1more	109 s	Six intervals: 40, 3, 20, 3, 20, and 3 s	Isolated/independent fixed-time control; no common cycle with southern intersection	Not used

Table 4. DQN/MADQN implementation parameters.

Parameter	Value Used in the Implemented Scripts	Purpose
Simulation horizon	10,000 steps	Defines the online learning/evaluation duration
SUMO step length	0.10 s	Microscopic simulation resolution
Training episodes	1 online episode per demand scenario	Fully online learning during one simulation horizon
Action space	0 = maintain phase, 1 = switch phase	Keeps action control simple and physically interpretable
Replay memory, MADQN	200,000 transitions per agent	Stores past experiences and reduces temporal correlation
Replay memory, single-agent DQN	50,000–200,000 transitions	Buffer size varied between single-agent DQN delay and queue experiments
Mini-batch size	128 samples in MADQN; 64–128 across DQN scripts	Defines the number of samples used per gradient update
Target network update frequency	Every 200 steps	Stabilizes Q-value target estimation
Soft target update coefficient	τ = 0.01 for delay; τ = 0.002 for queue	Smoothly transfers online network weights to the target network
Optimizer	Adam	Adaptive gradient optimization
Loss function	Huber loss	Reduces sensitivity to unstable temporal-difference errors
Learning rate	0.0002 in final MADQN scripts; 0.0002–0.0005 across DQN scripts	Controls neural network weight updates
Discount factor	0.99 for most DQN/MADQN runs	Prioritizes long-term queue/delay reductions
Exploration policy	ε-greedy	Balances exploration and exploitation
ε schedule	1.0 to 0.1	Starts with broad exploration and gradually shifts toward exploitation
SMDP lock, synthetic delay objective	40 steps = 4 s	Prevents rapid switching
SMDP lock, synthetic queue objective	120 steps = 12 s	Prevents rapid switching
SMDP lock, 25 January Corridor	30–60 steps = 3–6 s	Objective-dependent phase-switching constraint

Table 5. Comparative values of the RL at the synthetic two-junction network.

Agent/Network	Demand Type	Reward	Fixed Time	Tabular QL	DQN
Multi Agent	High Demand	Delay (s)	4613.5	338.9	321.8
		Queue	103	92	103
	Medium Demand	Delay (s)	1386	28.2	28.6
		Queue	37	15	12
	Low Demand	Delay (s)	62.7	0.4	0.4
		Queue	1	1	1
Single Agent	High Demand	Delay (s)	4613.5	329.6	302.7
		Queue	103	88	101
	Medium Demand	Delay (s)	1386	28.7	46.6
		Queue	37	11	12
	Low Demand	Delay (s)	62.7	0.6	0.4
		Queue	1	1	1

Table 6. Comparative values of the RL at the 25 January Network in Assiut.

Agent/Network	Demand Type	Reward	Fixed Time	Tabular QL	DQN
Multi Agent	High Demand	Delay (s)	132.1	9.2	18.5
		Queue	7	7	8
	Medium Demand	Delay (s)	0.1	0.7	0.7
		Queue	1	1	1
	Low Demand	Delay (s)	1.7	0.8	1.3
		Queue	1	1	1
Single Agent	High Demand	Delay (s)	132.1	9.1	18.6
		Queue	7	9	7
	Medium Demand	Delay (s)	0.1	0.1	0.4
		Queue	1	1	1
	Low Demand	Delay (s)	1.7	0.6	0.4
		Queue	1	1	1

Table 7. Comparative summary of traffic signal control strategies.

Evaluation Aspect	Fixed-Time Control	Tabular Q-Learning	DQN/MADQN
Control Logic	Predefined signal cycle	Learns adaptive actions from discretized traffic states	Learns adaptive actions from continuous/normalized traffic states
Delay Performance	Highest delay under variable demand	Strong delay reduction; often competitive under medium and high demand	Strong delay reduction; more stable in complex and dynamic conditions
Queue Performance	Queues accumulate due to static timing	Reduces queues through state-based phase decisions	Generally improves queue dissipation, especially in multi-agent settings
Adaptability	Low	Moderate	High
Scalability	Limited to predictable traffic patterns	Suitable for simple or moderately complex networks	More suitable for complex, high-volume, or multi-intersection networks
Main Advantage	Simple and easy to implement	Computationally light and interpretable	Better generalization and smoother learning behavior
Recommended Use	Stable, low-variation traffic conditions	Simple intersections or limited state spaces	Complex corridors and adaptive multi-junction control

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Owais, M.; Mohammed, B.O.; Kamal, A.A.; Shaban, A.; Mostafa, A.H.; Hatem, K.; Emad, J.; Younis, S.T.; Ali, S.A.; Abdel-Hakim, A.E.; et al. Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies. Sustainability 2026, 18, 5702. https://doi.org/10.3390/su18115702

AMA Style

Owais M, Mohammed BO, Kamal AA, Shaban A, Mostafa AH, Hatem K, Emad J, Younis ST, Ali SA, Abdel-Hakim AE, et al. Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies. Sustainability. 2026; 18(11):5702. https://doi.org/10.3390/su18115702

Chicago/Turabian Style

Owais, Mahmoud, Badr O. Mohammed, Abdulrahman A. Kamal, Abdulrahman Shaban, Ahmed H. Mostafa, Kareem Hatem, John Emad, Salah T. Younis, Samia A. Ali, Alaa E. Abdel-Hakim, and et al. 2026. "Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies" Sustainability 18, no. 11: 5702. https://doi.org/10.3390/su18115702

APA Style

Owais, M., Mohammed, B. O., Kamal, A. A., Shaban, A., Mostafa, A. H., Hatem, K., Emad, J., Younis, S. T., Ali, S. A., Abdel-Hakim, A. E., & Alkabbany, I. M. (2026). Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies. Sustainability, 18(11), 5702. https://doi.org/10.3390/su18115702

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Traffic Signal Control Using Multi-Agent Reinforcement Learning: A Comparison of Control Strategies

Abstract

1. Introduction

2. Literature Review

3. Simulation Environment

3.1. Formal MDP/MARL Formulation and Agent Interaction Model

3.2. Network Topologies: Synthetic vs. Real-World Validation

3.3. Calibration and Validation

3.3.1. Grounding of Traffic Demand and Flow Assumptions

3.3.2. Calibration of Vehicle Behavior

3.3.3. Theoretical Validation Using Webster’s Delay Model

3.3.4. Summary and Implications for RL Evaluation

3.4. State Space Formulation

3.5. Action Space, SMDP Lock, and Safety (TTC)

3.6. Objective and Reward Functions

4. Algorithmic Implementation

4.1. The Fixed-Time Baseline

4.2. The RL Architectures: Single-Agent vs. Multi-Agent

4.3. Tabular Q-Learning (Single- and Multi-Agents)

4.4. Deep Q-Network (DQN and MADQN)

4.5. Coordination Strategy and Relation to CTDE

4.6. General Learning Parameters

4.7. The Analysis Framework

5. Evaluation and Results

5.1. Two-Junction Validation (Synthetic Control)

5.1.1. Low-Demand Performance

5.1.2. Medium-Demand Performance

5.1.3. High-Demand Performance

5.2. 25 January Corridor (Digital Twin, Assiut, Egypt, and Real-World Control)

5.2.1. Low-Demand Performance

5.2.2. Medium-Demand Performance

5.2.3. High-Demand Performance

5.3. Comparative Performance Analysis

5.4. Cross-Network Conclusions

5.5. Surrogate Safety Assessment Using Time-to-Collision

5.6. Comparison with State-of-the-Art Studies

5.7. Computational Overhead, Training Behavior, and Scalability Considerations

5.8. Scalability, Communication Overhead, and Computational Complexity Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Detailed TTC Descriptive Statistics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI