Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning

Cao, Kerang; Yang, Siqi; Yang, Cheng; Yu, Mingxu; Geng, Jietan; Jung, Hoekyung

doi:10.3390/math14010149

Open AccessArticle

Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning

by

Kerang Cao

^1,2

,

Siqi Yang

^1,2,

Cheng Yang

^1,2,

Mingxu Yu

^3,4,

Jietan Geng

^3,4 and

Hoekyung Jung

^5,*

¹

College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang 110142, China

³

“Industrial Internet + Hazardous Chemicals Safety Production” Key Laboratory, Ministry of Emergency Management, Beijing 100054, China

⁴

China Academy of Industrial Internet, Beijing 100102, China

⁵

Computer Engineering Department, Paichai University, 155-40 Baejae-ro, Daejeon 35345, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(1), 149; https://doi.org/10.3390/math14010149 (registering DOI)

Submission received: 21 October 2025 / Revised: 20 December 2025 / Accepted: 28 December 2025 / Published: 30 December 2025

Download

Browse Figures

Versions Notes

Abstract

Although Adaptive Traffic Signal Control (ATSC) can alleviate congestion issues to some extent in traditional signal control systems, it still faces challenges in dealing with complex and dynamic traffic environments, such as difficulties in agent coordination, high computational complexity, and unstable optimization results. To address these challenges, this paper proposes a multi-agent deep reinforcement learning algorithm based on SENet, called SE-A3C. The SE-A3C algorithm enhances the feature extraction capability and adaptability of the neural network by introducing the Squeeze-and-Excitation (SE) module from SENet. This allows the model to focus more precisely on high-information features and capture interdependencies between different channels, thereby improving the model’s discriminative ability and decision-making performance. Additionally, the algorithm incorporates Nash equilibrium concepts to maintain a relative balance among agents during coordinated control, avoiding suboptimal competition between agents and significantly improving system stability and efficiency. Experimental results show that, compared to traditional A3C, DQN, and Ape-X algorithms, the SE-A3C algorithm significantly improves the efficiency of traffic signal control and the overall throughput of traffic flow in complex traffic scenarios.

Keywords:

adaptive traffic signal control; deep reinforcement learning; A3C; SE-A3C

MSC:

68T07; 90B20

1. Introduction

With the increase in global vehicle ownership, traffic congestion and pollution have become significant barriers to sustainable development. Intersections are the primary locations where traffic jams, accidents, and pollution emissions escalate [1]. Therefore, efficient traffic signal control is critical for improving urban traffic efficiency and reducing environmental pollution. Urban traffic signal control is essentially a sequential decision-making problem, making reinforcement learning (RL) methods well-suited for addressing such challenges. RL-based Adaptive Traffic Signal Control (ATSC) systems treat intersections as agents that adaptively adjust traffic signals based on the current traffic conditions (such as the number of vehicles on a lane, average speed, and average waiting time). Leveraging the advanced perception capabilities of deep neural networks, RL excels in learning and interpreting high-dimensional and continuous traffic features, enabling the system to autonomously adjust and optimize signal timing plans, thereby reducing the average vehicle waiting time and enhancing road capacity [2]. However, RL-based ATSC systems must process vast amounts of real-time data and rapidly compute optimal solutions, resulting in extremely high computational complexity. As such, developing efficient and stable optimization algorithms remains a significant challenge [3].

Currently, research on reinforcement learning-based intelligent traffic signal control systems remains primarily focused on single intersections. These systems typically use single-agent reinforcement learning approaches, where the agent observes traffic state information and executes the optimal traffic signal strategy for the current state, often neglecting the interactions between different intersections [4,5,6]. However, urban traffic systems generally operate in a regional context. To address this limitation, researchers have proposed multi-agent reinforcement learning (MARL) approaches [7,8], which aim to solve the coordination challenges between multiple intersections. Efficiently coordinating multiple intersections to improve travel efficiency has become a key focus and challenge in current applied research. Furthermore, the design of simulation experiments for some real-world traffic areas tends to be overly simplistic, significantly reducing the reliability of the experimental results. This limitation has motivated our research.

To address the aforementioned issues, this paper proposes the SE-A3C algorithm. This algorithm builds upon the Asynchronous Advantage Actor–Critic (A3C) framework, incorporating the concept of Nash equilibrium to maintain relative balance among agents. Additionally, the algorithm introduces an attention mechanism through the SE module from the Squeeze-and-Excitation Network (SENet), which adaptively recalibrates channel features, further enhancing the performance of the neural network. The main contributions of this paper are as follows:

Redefine the reinforcement learning model in the context of multi-agent signal optimization, incorporating the concepts of clustered sequences and Nash equilibrium to improve the design of states and rewards, thereby enhancing model performance.
Leverage an attention mechanism to enhance the neural network’s adaptability to feature channels, improving overall model performance.
Utilize the SUMO 1.14.1 simulation software to create various intersection scenarios, selecting the average vehicle queue length and cumulative queue time as performance evaluation metrics. Vehicle generation is modeled using the Weibull distribution to better reflect real-world traffic conditions.

2. Related Work

With the rapid development of Deep Reinforcement Learning (DRL), Intelligent Transportation Systems (ITS) have seen significant optimization and enhancement. DRL has had a profound impact on smart transportation, with widespread applications in traffic management, autonomous driving, and traffic prediction, significantly improving the efficiency, safety, and sustainability of transportation systems. For instance, the successful application of the BERT-based deep spatio-temporal network proposed by Cao et al. [9] in taxi demand prediction effectively demonstrates the strong potential of deep learning models in handling complex spatio-temporal problems, establishing a solid foundation for applying advanced neural network architectures to other traffic control tasks. Adaptive Traffic Signal Control (ATSC) is a key component of ITS, and its optimization problem is fundamentally a sequential decision-making challenge, making it highly suitable for solving through reinforcement learning to effectively reduce traffic congestion [10].

Research on ATSC has gradually shifted from single-intersection signal optimization to multi-intersection signal optimization [11], expanding solutions from single-agent to multi-agent scenarios. Cao et al. provided systematic theoretical support, offering practical application value for regional traffic signal optimization [12]. Building on this, Bokade et al. [13] proposed a novel communication protocol framework based on the Centralized Training and Decentralized Execution (CTDE) paradigm, which reduces the system’s additional communication overhead, thereby improving coordination efficiency among multiple agents. It is noteworthy that, alongside reinforcement learning-based approaches, several cutting-edge studies have explored fundamentally distinct technical pathways. For instance, Naderi et al. [14] proposed a Model Predictive Control (MPC)-based solution for lane-free and signal-free intersection crossing, while Malekzadeh et al. [15] investigated optimal internal boundary control for automated vehicles in lane-free environments. While these MPC-based approaches have demonstrated exceptional performance, their effectiveness heavily depends on precise system models and involves computationally intensive online optimization, thereby facing significant challenges in computational efficiency and scalability within large-scale networks.

As research progresses, scholars continue to explore the potential of multi-agent approaches in urban traffic management and attempt to optimize algorithms to tackle complex traffic signal control problems. In addition to optimizing intersection signal control, some researchers focus on more complex traffic environments. Mei et al. [16] noted that current research predominantly focuses on intersections within cities, with limited attention given to the impact of rail information. As a result, they refined the reward function coefficients and used SUMO simulation software to construct urban and rail intersections. Through simulations with two deep reinforcement learning algorithms, they demonstrated the effectiveness of deep reinforcement learning in such contexts.

Additionally, other researchers have proposed innovative methods to optimize traffic signal control. Zheng et al. [17] incorporated multi-agent characteristics into the Deep Q-Network (DQN) algorithm, effectively reducing the average travel waiting time and total queue length in a region, thus improving traffic efficiency. Zhang et al. [18] introduced a multi-agent deep reinforcement learning algorithm based on the Proximal Policy Optimization (PPO) algorithm, incorporating a parameter-sharing mechanism and lead-lag phase sequences. Experimental results demonstrated that this method outperformed existing deep reinforcement learning algorithms. Hassan et al. [19] also improved the DQN algorithm by sharing state information among agents to reduce traffic signal waiting times, while focusing on new features such as the signal state of the previous intersection, the distance between intersections, and the average speed between them, thereby enhancing the algorithm’s performance. Zhao et al. [20] integrated the design of traffic flow direction into the ATSC problem based on deep reinforcement learning, and this new extension helped with heuristic exploration of the design space. They further extended this design to the DQN algorithm, proposing an optimized DQN method capable of exploring diverse feasible design schemes and eventually converging to the optimal combination of traffic flow direction design and traffic signal control strategies. Zhu et al. [21] focused on optimizing traffic efficiency, global optimality, and convergence stability in ATSC problems, proposing a recommendation model based on the states and relative positions of neighboring agents, which provides action selection suggestions to other agents. Experiments showed that this method significantly improved performance across multiple evaluation metrics. Shen et al. [22] observed that decision-making among agents in previous multi-intersection ATSC studies was difficult to balance, and thus introduced a Nash equilibrium strategy to constrain agent decisions, accelerating the network’s convergence and improving efficiency. Experimental results demonstrated that this method is better suited for optimizing signals in multi-intersection scenarios. Wang et al. [23] pointed out that existing studies overlooked the impact of vehicles near intersections on traffic flow. They introduced a Convolutional Block Attention Module (CBAM) into the Dueling Double Deep Q Network (D3QN). Comparative experimental results showed that this algorithm effectively improved traffic efficiency.

As can be seen from the aforementioned studies, current multi-agent deep reinforcement learning still has limitations in the definition of the three key elements of reinforcement learning. There is also room for improvement in action selection strategies. In terms of model dependency, while MPC-based methods [14,15] demonstrate excellent control performance, they are heavily reliant on precise mathematical models and struggle to adapt to complex and dynamic real-world traffic environments. In contrast, existing DRL methods have reduced model dependency but remain insufficient in state representation and feature extraction. Regarding multi-agent coordination mechanisms, although Shen et al. [22] introduced a Nash equilibrium strategy to coordinate multi-agent decisions, they only applied it as an external constraint without deeply integrating it into the agents’ learning mechanisms, thereby limiting further improvements in coordination efficiency. In network architecture design, the convolutional block attention module incorporated by Wang et al. [23] in D3QN primarily focuses on spatial dimensions, lacking adequate modeling of inter-channel relationships in features, which hinders the full exploitation of critical information in traffic states. Building on previous research, this paper redefines the reinforcement learning model in the context of multi-agent signal optimization. By introducing clustered sequences and Nash equilibrium concepts, we redesign the state and reward elements of the reinforcement learning framework to more accurately reflect the traffic environment of multiple intersections, thereby improving the model’s performance. Furthermore, to address the limitations in current multi-agent signal control optimization research, where the neural network structures are relatively traditional and thus restrict performance, this paper proposes the SE-A3C algorithm based on the A3C framework. The SE-A3C algorithm introduces an attention mechanism by incorporating the Squeeze-and-Excitation (SE) module from SENet after the CNN layers, enhancing the neural network’s adaptability to different feature channels. This allows the network to more effectively learn and utilize the relationships between feature channels, significantly boosting its performance.

3. Traffic Signal Control Model

In this study, the SE module from SENet is introduced into both the Actor and Critic networks of the A3C algorithm, as shown in Figure 1. The structure of the Actor and Critic networks in the SE-A3C algorithm is designed as follows: In the input layer, the Actor network receives a position and speed matrix, while the Critic network’s input includes the position matrix, speed matrix, and the action. The input matrices pass through two convolutional layers for feature extraction, whereas the actions in the Critic network are fed directly into the fully connected layers. The extracted feature matrices are then processed by the SE module, where channel weights are computed and recalibrated. After recalibration, the feature matrices are flattened in the Flatten layer. Finally, fully connected layers are added to integrate and further extract features. All convolutional and fully connected layers use the ReLU activation function, while the final fully connected layer employs a linear activation function to output traffic signals that align with the action space. By incorporating the SE module, the SE-A3C algorithm not only improves the model’s stability and performance but also enhances the neural network’s ability to adapt to feature channels, enabling more effective capture and utilization of key traffic flow features.

3.1. State

In deep reinforcement learning, the state is used to describe the current characteristics of the environment, and the agent decides which action to take based on the state. In related research, to better capture the characteristics of intersections while minimizing the training time of the neural network, the state of the intersection is typically represented as a matrix, which serves as the input to the neural network [24]. For each regular four-way intersection, the two inbound lanes from each direction are divided into 10 equally sized cells, resulting in an 8 × 10 matrix. For complex intersections, the position and speed matrices are sized 12 × 10, as shown in Figure 2. In the position matrix, if a cell contains a vehicle, its state is set to 1; otherwise, it is set to 0. In the speed matrix, the speed is the average of all vehicles in the cell, and the speed matrix is normalized.

Although modeling intersections by discretizing inbound lanes into cells is a simple and efficient method, in practice, vehicles closer to the intersection have a greater impact on the traffic environment. To more accurately reflect the state of an intersection, this paper introduces the concept of grouped sequences. This concept originates from Sun Bin’s Art of War, where sequences are grouped according to specific patterns.

In this study, to account for the varying influence of vehicles based on their distance from the intersection, the 10 cells in each inbound lane (ordered from closest to farthest from the intersection) are divided into 4 groups: (a1), (a2, a3), (a4, a5, a6), (a7, a8, a9, a10). These groups are assigned weights of 0.4, 0.3, 0.2, and 0.1, respectively, to reflect the relative impact of vehicles on the traffic conditions.

3.2. Action

In deep reinforcement learning, an action refers to the interactive behavior selected and executed by the agent from the action set based on the current state. The selection and execution of actions influence both the environmental feedback and the agent’s subsequent decision-making process. In order to improve vehicle throughput at intersections and speed up traffic flow, the actions are defined as changing the signal lights to green for a specific set of lanes within one signal cycle. Specifically, the action set for a standard four-way intersection is defined as follows:

A = {E W, E W L, S N, S N L}

(1)

where EW refers to vehicles traveling in the east–west direction, allowing straight and right turns; EWL refers to vehicles traveling in the east–west direction, allowing only left turns; SN refers to vehicles traveling in the north–south direction, allowing straight and right turns; SNL refers to vehicles traveling in the north–south direction, allowing only left turns.

The action set for the complex six-way intersection is defined as follows:

A^{'} = {E W, E W L, S N, S N L, N E S W, N E S W L}

(2)

where the first four actions are the same as those for a standard intersection, NESW refers to vehicles traveling in the northeast–southwest direction, allowing straight and right turns, and NESWL refers to vehicles traveling in the northeast–southwest direction, allowing only left turns. To ensure smooth transitions between traffic signals, the signal sequence is set as the primary optimization target. The durations of the green and yellow lights are fixed at 12 s and 4 s, respectively, with the remaining time allocated to red lights. When the agent selects an action, if the current and previous actions are the same, the signal remains unchanged; if they differ, a yellow light is introduced between the red and green lights to provide a buffer during the transition.

3.3. Reward

The design of the reward function is crucial in deep reinforcement learning, as it provides feedback to the model regarding the performance of previous actions during training. In a multi-agent traffic signal control system, despite the presence of multiple agents, the core objective of the traffic system remains maximizing the throughput at intersections. To achieve this, this paper introduces a Nash equilibrium strategy, in which local optimization and Nash equilibrium operate collaboratively to maximize the total system reward. The design of the reward function is closely tied to the optimization objectives of the model. The reward function for each intersection is defined as a weighted linear combination of the change in cumulative vehicle waiting time (T) and the cumulative duration of yellow lights (Y) when traffic conditions deteriorate:

R = k_{1} T + k_{2} Y, k_{1} + k_{2} = 1, k_{2} < k_{1}

(3)

Among these terms, the value of T represents the difference between the total waiting time of all vehicles in the previous time step and that in the current time step. A positive value indicates a reduction in waiting time and an improvement in traffic conditions, whereas a negative value suggests worsening congestion. The value of Y denotes the cumulative duration of yellow lights during signal transitions. Given that the studied intersections experience heavy traffic flow, longer durations of red and green lights help increase vehicle throughput, while frequent phase transitions tend to cause congestion. k₁ and k₂ are the weights of T and Y, respectively. Since T directly reflects the level of congestion in the traffic system and has a greater impact on the optimization effect, it is assigned a higher weight.

In the game G(a1, …, an; r1, …, rn) if the action ai of any agent i in the action set (a1, …, an) is the best response strategy πi* to the actions of the other agents (a1, …, ai − 1, ai + 1, …, an), meaning that this action allows the agent to achieve the maximum reward ri, then the action combination (a1*, …, an*) is considered a Nash equilibrium strategy of the game G. Each agent chooses the strategy that maximizes its own reward, assuming the strategies of other agents remain unchanged, and eventually all agents’ strategies reach a Nash equilibrium. Therefore, introducing Nash equilibrium strategies in a multi-agent traffic signal control system ensures that the total system reward is maximized.

To maintain Nash equilibrium in the environment, during each round of neural network training, it is necessary to ensure that the reward obtained by agent i after selecting its action is maximized, and that it cannot achieve higher rewards by altering its own strategy. Additionally, the actions selected by agents prior to i must remain unchanged, ensuring that each agent makes decisions while optimally responding to the strategies of the other agents.

R ({π_{1}}^{*}, \dots {π_{i}}^{*}, \dots π_{9}) \geq R ({π_{1}}^{*}, \dots {π_{i}}^{*}, \dots π_{9})

(4)

R ({π_{1}}^{*}, \dots {π_{i}}^{*}, {π_{i + 1}}^{*} \dots π_{9}) \geq R ({π_{1}}^{*}, \dots {π_{1}}^{*}, π_{i + 1} \dots π_{9})

(5)

3.4. SE Module

In this paper, we incorporate the SE module from SENet into both the Actor and Critic networks of the A3C algorithm. The SE module is an attention mechanism in deep learning that is lightweight yet significantly enhances the performance of convolutional neural networks (CNNs) [25]. In the traditional A3C network architecture, after the input state undergoes feature extraction via convolutional layers, the relationships between feature channels are not explicitly modeled. This can lead to some channels having a minimal impact on the output, reducing the model’s accuracy and overall performance. The SE module is designed to address this issue. As shown in Figure 3, the SE module builds relationships between channels through the Squeeze and Excitation operations. During the Squeeze operation, the feature matrix is compressed into a feature vector using global average pooling, which captures the global information of each channel. Global average pooling computes the mean value of each channel across its spatial dimensions, producing a compressed representation of the channel features. In the Excitation operation, the interdependencies between channels are modeled, and importance weights are assigned to each channel. The model learns the weights of each channel via fully connected layers and nonlinear activation functions, ultimately producing a weight vector for each channel. During this process, the feature vector passes through two fully connected layers. The first fully connected layer reduces the number of channels to reduce computational complexity, using a ReLU activation function. The second fully connected layer restores the original number of channels to ensure matrix compatibility, using a Sigmoid activation function. Finally, the learned channel weights are used to recalibrate the original feature matrix. Each channel is multiplied by its corresponding weight, emphasizing important channels and suppressing less relevant ones. This recalibration process helps the network selectively focus on critical information channels while suppressing noise or unimportant data. The entire SE module process is represented in Equation (1).

X^{'} = S c a l e (X) = X \cdot S i g m o i d (W_{2} \cdot Re l u (W_{1} \cdot P o o l (X)))

(6)

The matrix X′ represents the output feature matrix after a series of operations, while X denotes the original input feature matrix, which contains various attributes and dimensions of the raw data. The Pool operation refers to global average pooling, which aggregates spatial information from each feature channel into a single scalar value, thereby capturing global information for each channel. The weight matrices W₁ and W₂ are learned through training and optimization, enabling the neural network to better capture and fit the inherent patterns and structure within the data. The Scale operation represents element-wise matrix multiplication, where the recalibrated channel weights are applied to the original feature matrix. This operation effectively combines global context with local features, making the output feature matrix X′ more representative and discriminative. By focusing on the most relevant information, the model enhances its ability to capture critical features while suppressing irrelevant or noisy data, ultimately improving the network’s overall performance.

3.5. A3C Algorithm

The Asynchronous Advantage Actor–Critic(A3C) algorithm is an improvement over the traditional Actor–Critic algorithm, designed to enhance the efficiency of reinforcement learning through multi-threaded asynchronous parallelism. In the Actor–Critic framework, the Actor is responsible for generating actions and interacting with the environment, while the Critic evaluates the Actor’s performance and guides policy updates using Temporal Difference (TD) error. A3C’s key innovation is the introduction of a global network and multiple worker threads. The global network stores the latest model parameters and shares them across all threads. Each thread independently collects data and trains within different environments, asynchronously uploading the updated gradients to the global network. Upon receiving these updates, the global network synchronizes the parameters and redistributes them to the threads for the next training iteration. This multi-threaded parallel approach enables A3C to rapidly collect and process large amounts of data, significantly improving training speed and performance. Since the threads operate in diverse environments, the increased variety of samples enhances the generalization capability of the policy, allowing the algorithm to better adapt to complex and dynamic environments.

The A3C algorithm consists of an Actor network and a Critic network, which together form its core architecture [26]. The Actor network is responsible for action selection. It takes the current state as input and outputs actions based on the learned policy. The policy can be either deterministic (where the Actor directly outputs the action) or stochastic (where the Actor outputs a probability distribution over possible actions). Through continuous adjustment and optimization, the Actor network learns how to select the optimal action in different states to maximize cumulative rewards.

The Critic network is responsible for evaluating the quality of the actions chosen by the Actor network. It estimates the expected cumulative rewards that can be obtained given a specific state and action by calculating a value function (usually a state-value function or an action-value function). The Critic’s feedback is used to guide the Actor in adjusting its policy. By minimizing the difference between the estimated value and the actual return, the Critic updates its network. The model structures of the Actor and Critic networks with the SE module from SENet are illustrated in Figure 4 and Figure 5.

The update formula for the Actor network is shown in Equation (2). The Actor network uses a policy function

π_{θ}

, which outputs the probability of each action based on the current state s. An action a is selected based on these probabilities, and interaction with the environment yields the next state s’ and reward r. The parameters

θ

of the Actor network are updated according to the following equation:

θ = θ + α \nabla_{θ} \log π_{θ} (s, a) Q (s, a)

(7)

where θ represents the learnable parameters of the neural network,

α

is the learning rate,

\nabla_{θ} \log π_{θ} (s, a)

is the gradient of the log-probability of taking action a in state s under the current policy

π_{θ}

, and Q(s,a) is the action-value function.

The update formula for the Critic network is shown in Equation (3). The Critic network uses the value function V to compute the value of the current state s and the next state s’, and calculates the TD error

δ

to update the parameters of the Critic network. The TD error is computed as follows:

δ = r + γ Q (s^{'}, a^{'}) - Q (s, a)

(8)

where r is the immediate reward, γ is the discount factor, indicating the agent’s emphasis on future rewards—the higher the value, the more the agent focuses on future gains. Q(s’, a’) is the action-value function for the next state s’ and action a’, while Q(s,a) is the action-value function for the current state s and action a.

The goal of the Critic network is to update the parameters

φ

of the value function by minimizing the TD error. The specific update formula is shown in Equation (4):

ϕ = ϕ + β δ \nabla_{ϕ} V (s)

(9)

where

β

is the learning rate of the Critic network, and

\nabla_{φ} V (s)

represents the gradient of the value function V(s) with respect to the parameters

φ

.

The action-value function is defined in Equation (5):

Q (s, a) = \sum_{s^{'}} P_{s \to s^{'}}^{a} [r (s, a) + γ V (s^{'})]

(10)

This equation represents the probability

P_{s \to s^{'}}^{a}

of transitioning from state s to state s’ after taking action a, the immediate reward

r (s, a)

obtained, and the weighted sum of the discounted value function V(s’) of the next state s’.

The state-value function is defined in Equation (6):

V (s) = \sum_{a} π (a ∣ s) Q (s, a)

(11)

This equation represents the weighted sum of the value function V(s’) of the next state s’ after selecting action a in state s according to the policy

π (a | s)

, and obtaining the immediate reward

r (s, a)

.

The A3C algorithm is an improvement and extension of the A2C algorithm, and both use the same advantage function (7). The advantage function is defined as the difference between the action-value function and the state-value function, representing how advantageous a particular action is relative to the current state value. Specifically, if the advantage function is greater than 0, it indicates that the action is better than the average action; if the advantage function is less than 0, it indicates that the action is worse than the average action.

A (s, a) = Q (s, a) - V (s)

(12)

In summary, the SE-A3C Algorithm 1 process is as follows:

Algorithm 1 SE-A3C algorithm

1: Input

Initialize training rounds episodes, single agent training duration t_max, single agent training steps per round t = 1, total training steps T = 1

Initialize global Actor network parameters θ, Global Critical Network Parameters θ_v. Single agent Actor network parameters θ′, Single agent Critical network parameters θ′_v

Initialize learning rate α, Discount factor γ, Learning Strategy Probability ε.

2: Repeat

3: Initialize gradient: dθ = 0 and dθ_V = 0
4: Synchronize agent network parameters θ′ = θ and θ′_v = θ_v

5: Divide the lanes into groups to obtain the state s_t

6: Repeat

7: Selecting action a_t based on Actor network
8: Execute action a_t to obtain new state s_t+1 and reward r_t

9: t = t +1

10: T = T + 1

11: Until reaching the final state s_t or t = t_max

12: for i ∈ {t, …, t_max} do
13: R = r_i + γR

14: Cumulative gradient

d θ = d θ + \nabla_{θ^{'}} \log π (a_{i} s_{i}; θ^{'}) (R - V (s_{i}; θ^{'}))

15: Cumulative gradient

d θ_{v} = d θ_{v} + α {(R - V (s_{i}; {θ^{'}}_{v}))}^{2} / α {θ^{'}}_{v}

16: End for
17: Use gradient formula to perform asynchronous updates and update global network parameters θ and θ_v

18: End

3.6. Architectural Analysis of SE-A3C Based on the A3C Baseline

To clearly articulate the architectural contributions of the proposed method, this study takes the A3C framework as the baseline and conducts a comparative analysis between the structural extensions introduced by SE-A3C and the common extension directions of A3C explored in recent related studies.

SE-A3C is designed as a structural extension of the classical Asynchronous Advantage Actor–Critic (A3C) framework. Its objective is to enhance the efficiency of traffic state representation and the consistency of multi-agent cooperative decision-making, while preserving the original learning paradigm and decentralized decision mechanism. Specifically, SE-A3C introduces a channel-wise feature recalibration mechanism at the feature modeling level, where the Squeeze-and-Excitation (SE) module is employed to adaptively reweight different traffic state feature channels, thereby improving the model’s ability to perceive informative features. To address the multi-source heterogeneity inherent in traffic state information, SE-A3C further adopts a grouped sequential state representation to structurally model different types of traffic flow features, making it more suitable for decentralized multi-agent learning. In addition, in multi-intersection traffic signal control scenarios, SE-A3C incorporates a game-theoretic coordination constraint by embedding Nash equilibrium concepts into the multi-agent decision-making process, which enhances decision consistency among neighboring intersections and improves overall system stability.

In recent years, a variety of studies have extended the A3C framework from different architectural perspectives. One line of research improves multi-intersection coordination by modifying the critic structure, such as introducing locally centralized or neighborhood-conditioned critics, while maintaining system scalability [27]. Another line of work incorporates explicit communication mechanisms into A3C-based multi-agent frameworks, enabling agents to achieve cooperative decision-making through information sharing and thus better supporting distributed deployment scenarios [28]. In addition, some studies integrate temporal modeling and spatial structure modeling into the A3C framework, for example, by combining recurrent neural networks (LSTM or GRU) with graph-based representations to capture both the temporal evolution of traffic flow and the spatial dependencies among intersections [29]. Other approaches extend A3C through coordination graphs, multi-step returns, or off-policy learning strategies to improve joint policy optimization and training stability in large-scale multi-agent environments [30]. These methods primarily focus on information sharing, spatio-temporal modeling, or optimization stability, rather than on improving the efficiency of traffic state feature representation.

Overall, the key distinction between SE-A3C and the aforementioned extension directions lies in the fact that SE-A3C does not seek performance improvements by introducing new learning paradigms, explicit communication mechanisms, or complex network structures. Instead, it enhances the original A3C framework by focusing on feature modeling and multi-agent decision consistency, while maintaining architectural simplicity. This structural extension provides a clear and consistent basis for evaluating the effectiveness of SE-A3C in the subsequent experimental studies.

To further clarify the architectural characteristics of SE-A3C, a detailed feature-level comparison with recent A3C-based architectural extensions is presented in Table 1.

4. Experimental Section

4.1. Experimental Environment

The experimental environment in this study is designed using the SUMO (Simulation of Urban Mobility) microscopic multi-modal traffic simulation software. The experimental scenario, as shown in Figure 6, represents a complex multi-intersection road network, primarily composed of several two-way lane intersections. Additionally, the scenario includes a complex six-way intersection (as shown in Figure 7), whose action set is independent of the other intersections due to its complexity. The traffic flow is generated using a Weibull distribution, which accurately simulates peak and off-peak traffic patterns, providing a suitable probabilistic model for data generation.

4.2. Experimental Setup

Lane Configuration: At each intersection, lanes are divided into inbound and outbound lanes. In each direction, there are four inbound lanes (two straight lanes, one left-turn lane, and one right-turn lane) and four outbound lanes. The driving directions of the vehicles are randomly assigned as straight, left-turn, or right-turn. Traffic Signals: Each inbound lane and its corresponding outbound lane are controlled by a set of traffic signals. The signals are dynamically adjusted based on traffic flow and time. A red light indicates no entry, a green light allows vehicles to pass, and a yellow light serves as a buffer between red and green lights, allowing vehicles that have crossed the stop line to continue moving. Right-turning vehicles are typically not subject to traffic signals. Phases and Signal Sequence: A phase refers to a period during which traffic flow from a specific direction is allowed to pass. For example, during a particular phase, the east–west traffic signals might allow straight and right-turn movements but not left turns; in this case, the straight lanes have a green light, while the left-turn lanes have a red light. The signal sequence defines the transition order between different phases at an intersection. By properly configuring the signal sequence, traffic flow between intersections can be coordinated to effectively reduce congestion. Traffic Rules: The duration of each phase is fixed but can be adjusted based on experimental results. In each time step, there must be at least one lane with a green or yellow light; there cannot be a situation where all lights at the intersection are red simultaneously.

To make the experiment more consistent with real-world conditions, we used a Weibull distribution for simulating vehicle generation. The Weibull distribution is a highly flexible continuous probability distribution, whose shape parameter can be adjusted to fit different types of data, making it suitable for modeling a variety of datasets. The probability density function of the Weibull distribution is defined as:

f (x; λ, k) = \{\begin{array}{c} \frac{k}{λ} {(\frac{x}{λ})}^{k - 1} e^{- {(\frac{x}{λ})}^{k}} & x \geq 0 \\ 0 & x < 0 \end{array}

(13)

where x is a random variable, λ > 0 is the scale parameter, and k > 0 is the shape parameter [31]. When k = 1, the Weibull distribution simplifies to an exponential distribution, and when k = 2, it becomes a Rayleigh distribution. In this experiment, the shape parameter is set to k = 2, as this configuration aligns well with traffic flow patterns during specific time periods, as shown in Figure 8. During peak traffic hours, the number of vehicles follows a rise-then-fall trend: the number of vehicles steadily increases during the initial phase, reaches a peak, and then gradually decreases as the peak period ends.

In the experiment, we used the Traci interface in SUMO to connect SUMO with Python3.8 and defined a vehicle generation class. This class initializes the total number of vehicles and the total number of simulation steps, and defines a vehicle generation function. The function increases the vehicle generation probability at the complex intersection, making it a high-traffic area to test the model’s ability to optimize traffic signal control. To further enhance realism, we implemented differentiated vehicle generation probabilities across the road network: the complex intersection was configured as a high-traffic zone with a generation probability of 0.6, while other intersections were set to lower traffic levels with a generation probability of 0.4. The total simulation duration was set to 5400 s (90 min) to cover a complete traffic flow cycle of “rise-peak-decline.” During this period, the system generated a total of 2000 vehicles. This scale ensured the emergence of statistically significant traffic congestion and interactions, thereby guaranteeing effective evaluation of the algorithm’s control performance.

The selection of hyperparameters was determined through systematic preliminary experiments, with specific configurations presented in Table 2. The learning rate

α

= 0.001 was identified as the optimal value within the range [0.0005, 0.002]. The results demonstrate that the algorithm exhibits low sensitivity to learning rate variations within this range while maintaining overall stable performance. Specifically, when the learning rate falls below 0.0008, the network convergence speed decreases; when it exceeds 0.0015, oscillations emerge in the cumulative reward curve. After comprehensive consideration of both convergence speed and stability, we ultimately set the learning rate to 0.001, which achieves the optimal balance across all evaluated algorithms. The discount factor

γ

= 0.75 was chosen to balance short-term returns and long-term benefits in the traffic signal control problem, as this value effectively addresses both immediate traffic condition improvements and long-term system optimization objectives. The number of training episodes, set to 200, was determined through convergence analysis, with experiments demonstrating that all comparative algorithms achieved stable convergence within this training scale. The batch size of 20 was selected to maintain training stability while considering computational efficiency. The duration settings of traffic lights were configured with reference to common practices in real-world traffic signal systems to ensure the authenticity of the simulation environment.

4.3. Experimental Results and Analysis

We trained and tested the SE-A3C algorithm, DQN algorithm, A3C algorithm, Ape-X algorithm, and a fixed signal timing scheme in the same environment. The DQN, A3C, and Ape-X algorithms used their original neural network structures, but the state representation and reward functions were improved based on the modifications proposed in this paper. The performance of each algorithm was compared and analyzed using three metrics: cumulative reward, average queue length, and cumulative waiting time.

Cumulative reward reflects the convergence speed of the neural network and is used to indirectly indicate the quality of traffic conditions. Since the fixed signal timing scheme does not have a cumulative reward metric, Figure 9 only shows the cumulative reward comparison for the four deep reinforcement learning algorithms. The SE-A3C algorithm proposed in this paper introduces the SE module, enabling the neural network to better capture details and contextual information from the data, thereby improving model performance. As the number of training episodes increases, the SE-A3C algorithm significantly outperforms the other four algorithms, with a faster convergence speed and the best performance at the point of final convergence.

The average queue length reflects the state of vehicle queuing at intersections—the shorter the queue length, the smoother the traffic flow. A comparison of the average queue length for the five methods is shown in Figure 10. In the early stages of training, the fixed signal timing method results in fewer queued vehicles compared to the other four methods, as the deep reinforcement learning neural networks are still in the learning phase. However, after several rounds of training, the number of queued vehicles in the deep reinforcement learning algorithms begins to steadily decrease. Experimental results show that the SE-A3C algorithm performs the best in terms of average queue length after training, demonstrating higher efficiency in traffic signal control. The SE-A3C algorithm allows the neural network to better capture and utilize detailed information from the data, thereby optimizing traffic flow and reducing vehicle queuing times.

A comparison of the cumulative queue time for the five methods is shown in Figure 11. In the early stages of training, the four deep reinforcement learning algorithms had not yet converged and performed poorly in optimizing the environment, with cumulative queue times even higher than the fixed signal timing scheme. However, after several rounds of training, the performance of the four deep reinforcement learning algorithms significantly surpassed that of the fixed timing method. The DQN algorithm exhibited longer cumulative queue times in the early stages, with slower convergence, ultimately underperforming compared to the other algorithms. Compared to the A3C algorithm, the SE-A3C algorithm achieved shorter cumulative waiting times and noticeably faster convergence. Initially, the SE-A3C algorithm performed similarly to the Ape-X algorithm, but as training progressed, SE-A3C showed progressively better optimization and resulted in less cumulative queue time when the neural network converged. The experimental results indicate that the SE-A3C algorithm performed the best in terms of cumulative queue time, significantly outperforming the other algorithms.

The trained deep reinforcement learning models and the fixed signal timing scheme were tested in a simulated environment for 90 min, and the number of queued vehicles in the traffic was compared. The experimental results are shown in Figure 12. At the beginning of the test, due to the impact of peak traffic hours, the number of queued vehicles increased across all five methods. Among them, the SE-A3C algorithm had the fewest queued vehicles. In contrast, the fixed signal timing scheme performed poorly in handling high traffic volumes, with a consistently high number of queued vehicles, demonstrating the effectiveness and feasibility of deep reinforcement learning algorithms.

As the testing progressed, the stability of the SE-A3C algorithm became evident. In the SE-A3C environment, the increase in the number of queued vehicles was gradual, and the peak number of queued vehicles was lower compared to the Ape-X algorithm. Although the number of queued vehicles in both algorithm environments approached zero upon final convergence, the SE-A3C algorithm converged significantly faster than the Ape-X algorithm. Therefore, the SE-A3C algorithm demonstrated the best overall performance. Experimental results show that the SE-A3C algorithm performs optimally in handling complex traffic flows, effectively reducing the number of queued vehicles while improving the efficiency and stability of traffic signal control.

In summary, as shown in Table 3, the fixed signal timing method performed poorly in high-traffic environments due to its inability to adaptively learn from traffic information. Based on the comparison of cumulative rewards in the table, agents in the SE-A3C algorithm environment achieved the highest cumulative reward, maximizing long-term benefits. This demonstrates the superiority of the SE-A3C algorithm.

Based on the comparison of vehicle queue length and cumulative waiting time in the table, both the SE-A3C and Ape-X algorithms demonstrated strong performance, with both optimization metrics maintained at low levels. As a result, during the final testing phase, both SE-A3C and Ape-X were able to effectively manage the 2000 vehicles in the environment, ensuring no vehicle congestion during the testing period. The key difference lies in the SE module integrated into the SE-A3C algorithm’s neural network, which enhances the model’s ability to analyze relationships between different channels, providing more efficient feature representation. Consequently, the SE-A3C algorithm achieved faster convergence and better overall performance, proving its superiority in terms of accuracy and robustness.

The experimental results show that when traffic flow enters peak periods, the model trained by the SE-A3C algorithm can adjust traffic light signals to more optimal phase stages. This improves traffic efficiency, reduces energy consumption, enhances traffic safety, improves the driving experience, and promotes urban planning and sustainable development.

5. Conclusions

With the rapid growth in the number of vehicles in urban areas, the number of high-traffic intersections has significantly increased, placing higher demands on traffic signal control. To ensure the reliability of the experimental results, this study constructs a multi-agent traffic signal control optimization model in a more complex simulation environment and applies it to the newly proposed SE-A3C algorithm. By introducing an adaptive feature extraction module (SE module), the algorithm effectively extracts the features with the greatest impact on traffic flow, thereby enhancing model performance. At the same time, with the incorporation of Nash equilibrium concepts, agents can find a balance point in their rewards, accelerating the convergence of the neural network. Comparative experiments with traditional A3C, DQN, Ape-X algorithms, and the fixed signal timing method demonstrate that the SE-A3C algorithm performs superiorly in key metrics such as cumulative reward, average queue length, and cumulative waiting time, validating its effectiveness and superiority.

Future research will focus on optimizing traffic signal control in multi-intersection areas, redesigning traffic signal systems for more complex environments, including irregular intersection layouts and intersections with more than four directions. Additionally, we will explore more advanced deep reinforcement learning algorithms, such as Proximal Policy Optimization (PPO), Twin Delayed Deep Deterministic Policy Gradient (TD3), and Reinforcement Learning with Expert Demonstrations and Minimum Loss (REM). These algorithms could enhance performance by incorporating convolutional layers into the neural network to improve feature extraction accuracy and adding residual networks to strengthen overall network capabilities. Through these improvements, we expect to further advance the intelligence of traffic signal control systems, contributing to the sustainable development and optimization of urban traffic.

Author Contributions

Conceptualization, K.C., S.Y. and H.J.; methodology, K.C., S.Y. and C.Y.; software, K.C.; validation, K.C., S.Y. and C.Y.; formal analysis, K.C.; investigation, K.C.; resources, M.Y. and J.G.; data curation, K.C.; writing—original draft preparation, K.C.; writing—review and editing, S.Y., C.Y., M.Y., J.G. and H.J.; visualization, K.C.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2026-RS-2022-00156334, contribution rate: 70%) and Liaoning Provincial Department of Science and Technology Plan Project-General Project (Project No.: 2025-MS-141).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yue, W.; Li, C.; Chen, Y.; Duan, P.; Mao, G. What is the root cause of congestion in urban traffic networks: Road infrastructure or signal control? IEEE Trans. Intell. Transp. Syst. 2022, 23, 8662–8679. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Gayah, V.; Li, Z. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation. ACM SIGKDD Explor. Newsl. 2021, 22, 12–18. [Google Scholar] [CrossRef]
Wang, X.; Sanner, S.; Abdulhai, B. A critical review of traffic signal control and a novel unified view of reinforcement learning and model predictive control approaches for adaptive traffic signal control. arXiv 2022, arXiv:2211.14426. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Yao, H.; Li, Z. IntelliLight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2496–2505. [Google Scholar]
Zheng, G.; Xiong, Y.; Zang, X.; Feng, J.; Wei, H.; Zhang, H.; Li, Y.; Xu, K.; Li, Z. Learning phase competition for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing China, 3–7 November 2019; pp. 1963–1972. [Google Scholar]
Zheng, G.; Zang, X.; Xu, N.; Wei, H.; Yu, Z.; Gayah, V.; Xu, K.; Li, Z. Diagnosing reinforcement learning for traffic signal control. arXiv 2019, arXiv:1905.04716. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; Vicente, R. Multiagent cooperation and competition with deep reinforcement learning. PLoS ONE 2017, 12, e0172395. [Google Scholar] [CrossRef]
Cao, D.; Zeng, K.; Wang, J.; Sharma, P.K.; Ma, X.; Liu, Y.; Zhou, S. BERT-based deep spatial-temporal network for taxi demand prediction. IEEE Trans. Intell. Transp. Syst. 2021, 23, 9442–9454. [Google Scholar] [CrossRef]
Kanis, S.; Samson, L.; Bloembergen, D.; Bakker, T. Back to Basics: Deep Reinforcement Learning in Traffic Signal Control. arXiv 2021. [Google Scholar] [CrossRef]
Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. Is multiagent deep reinforcement learning the answer or the question? A brief survey. arXiv 2018. [Google Scholar] [CrossRef]
Cao, K.; Wang, L.; Zhang, S.; Duan, L.; Jiang, G.; Sfarra, S.; Zhang, H.; Jung, H. Optimization Control of Adaptive Traffic Signal with Deep Reinforcement Learning. Electronics 2024, 13, 198. [Google Scholar] [CrossRef]
Bokade, R.; Jin, X.; Amato, C. Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control. IEEE Access 2023, 11, 47646–47658. [Google Scholar] [CrossRef]
Naderi, M.; Typaldos, P.; Papageorgiou, M. Lane-free signal-free intersection crossing via model predictive control. Control Eng. Pract. 2025, 154, 106115. [Google Scholar] [CrossRef]
Malekzadeh, M.; Papamichail, I.; Papageorgiou, M.; Bogenberger, K. Optimal internal boundary control of lane-free automated vehicle traffic. Transp. Res. Part C Emerg. Technol. 2021, 126, 103060. [Google Scholar] [CrossRef]
Mei, X.; Fukushima, N.; Yang, B.; Wang, Z.; Takata, T.; Nagasawa, H.; Nakano, K. Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway. IEEE Sens. J. 2023, 23, 31125–31136. [Google Scholar] [CrossRef]
Zheng, P.; Chen, Y.; Kumar, B.V.D. Regional Intelligent Traffic Signal Control System Based on Multi-agent Deep Reinforcement Learning. In Proceedings of the 2023 8th International Conference on Computer and Communication Systems (ICCCS), Guangzhou, China, 21–23 April 2023; pp. 362–367. [Google Scholar] [CrossRef]
Zhang, W.; Yan, C.; Li, X.; Fang, L.; Wu, Y.-J.; Li, J. Distributed Signal Control of Arterial Corridors Using Multi-Agent Deep Reinforcement Learning. IEEE Trans. Intell. Transp. 2023, 24, 178–190. [Google Scholar] [CrossRef]
Hassan, M.A.; Elhadef, M.; Khan, M.U.G. Collaborative Traffic Signal Automation Using Deep Q-Learning. IEEE Access 2023, 11, 136015–136032. [Google Scholar] [CrossRef]
Zhao, X.; Flocco, D.; Azarm, S.; Balachandran, B. Deep Reinforcement Learning for the Co-Optimization of Vehicular Flow Direction Design and Signal Control Policy for a Road Network. IEEE Access 2023, 11, 7247–7261. [Google Scholar] [CrossRef]
Zhu, C.; Ye, D.; Zhu, T.; Zhou, W. Location-Based Real-Time Updated Advising Method for Traffic Signal Control. IEEE IoT J. 2024, 11, 14551–14562. [Google Scholar] [CrossRef]
Shen, H.; Zhao, H.; Zhang, Z.; Yang, X.; Song, Y.; Liu, X. Network-Wide Traffic Signal Control Based on MARL with Hierarchical Nash-Stackelberg Game Model. IEEE Access 2023, 11, 145085–145100. [Google Scholar] [CrossRef]
Wang, P.; Ni, W. An Enhanced Dueling Double Deep Q-Network with Convolutional Block Attention Module for Traffic Signal Optimization in Deep Reinforcement Learning. IEEE Access 2024, 12, 44224–44232. [Google Scholar] [CrossRef]
Zhao, T.; Wang, P.; Li, S. Traffic Signal Control with Deep Reinforcement Learning. In Proceedings of the 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), Chongqing, China, 6–8 December 2019. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Sewak, M. Actor-Critic Models and the A3C: The Asynchronous Advantage Actor-Critic Model. In Deep Reinforcement Learning; Springer: Singapore, 2019; pp. 141–152. [Google Scholar]
Goel, H.; Zhang, Y.; Damani, M.; Sartoretti, G. SocialLight: Distributed Cooperation Learning towards Network-Wide Traffic Signal Control. In Proceedings of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Auckland, New Zealand, 9–13 May 2022; pp. 1–9. [Google Scholar]
Wu, Q.; Wu, J.; Shen, J.; Yong, B.; Zhou, Q. An Edge Based Multi-Agent Auto Communication Method for Traffic Light Control. Sensors 2020, 20, 4291. [Google Scholar] [CrossRef] [PubMed]
Zai, W.; Yang, D. Improved Deep Reinforcement Learning for Intelligent Traffic Signal Control Using ECA_LSTM Network. Sustainability 2023, 15, 13668. [Google Scholar] [CrossRef]
Michailidis, P.; Michailidis, I.; Lazaridis, C.R.; Kosmatopoulos, E. Traffic Signal Control via Reinforcement Learning: A Review on Applications and Innovations. Infrastructures 2025, 10, 114. [Google Scholar] [CrossRef]
Rinne, H. The Weibull Distribution a Handbook; Chapman and Hall/CRC: New York, NY, USA, 2009. [Google Scholar] [CrossRef]

Figure 1. Structure of SE-A3C.

Figure 2. Cell matrix.

Figure 3. SE module structure.

Figure 4. Actor network structure.

Figure 5. Critic network structure.

Figure 6. Multi-intersection road network structure.

Figure 7. Complex intersection interface.

Figure 8. Weibull Distribution with Shape Parameter of 2.

Figure 9. Comparison of cumulative rewards for four algorithms.

Figure 10. Comparison of average queue length among five methods.

Figure 11. Comparison of five methods for cumulative queuing time.

Figure 12. Comparison of Five Methods for Testing the Number of Queued Vehicles.

Table 1. Architectural comparison of SE-A3C and Recent Algorithms.

Feature	A3C (Classical Baseline)	SE-A3C (Proposed)	A3C with Modified Critic Structure	A3C with Explicit Communication	A3C with Spatio-Temporal Modeling
Learning Paradigm	Asynchronous Actor-Critic	Asynchronous Actor-Critic	Asynchronous Actor-Critic	Asynchronous Actor-Critic	Asynchronous Actor-Critic
Critic Design	Fully decentralized	Decentralized with coordination constraint	Locally centralized/neighborhood-conditioned [27]	Decentralized	Decentralized or partially centralized
State Representation	Local intersection state	Grouped structured local state	Local + neighboring state	Local + communicated information [28]	Temporal sequences with spatial relations [29]
Feature Modeling Strategy	Uniform processing	Channel-wise recalibration (SE)	Shared feature encoding	Message-based aggregation	Recurrent encoding + graph aggregation
Temporal Modeling	Optional	Lightweight sequential grouping	Optional	Optional	LSTM/GRU-based modeling [29]
Multi-Agent Coordination	Implicit	Constraint-based consistency	Critic-level coordination	Explicit communication [28]	Graph-based spatial coordination
Training Stability	Standard A3C updates	Structural constraints	Improved value estimation	Stabilized via communication	Enhanced via structured modeling

Table 2. Hyperparameter.

Hyperparameter	Value
Number of training rounds episodes	200
Number of training steps per round steps	5400
Batch size	20
Weibull distribution shape Parameter k	2
Learning Rate $α$	0.001
Discount Factory $γ$	0.75
Green light duration	12
Yellow light duration	4
$k_{1}$	0.8
$k_{2}$	0.2

Table 3. Comparison of Experimental Results of Four Methods.

	Accumulated Rewards	Average Number of Queued Vehicles	Accumulated Queuing Time	Number of Vehicles After Testing	The Number of Steps for Neural Network Convergence
A3C	−34,500.3	267.1	5650	65	3300
Ape-X	−20,630.5	233.6	3399	0	2400
SE-A3C	−14,990.8	221.5	2079	0	2100
DQN	−42,631.2	265.0	7086	118	-
Fixed	-	293.5	22,673	659	-

The bold part is the result of experimental method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, K.; Yang, S.; Yang, C.; Yu, M.; Geng, J.; Jung, H. Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning. Mathematics 2026, 14, 149. https://doi.org/10.3390/math14010149

AMA Style

Cao K, Yang S, Yang C, Yu M, Geng J, Jung H. Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning. Mathematics. 2026; 14(1):149. https://doi.org/10.3390/math14010149

Chicago/Turabian Style

Cao, Kerang, Siqi Yang, Cheng Yang, Mingxu Yu, Jietan Geng, and Hoekyung Jung. 2026. "Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning" Mathematics 14, no. 1: 149. https://doi.org/10.3390/math14010149

APA Style

Cao, K., Yang, S., Yang, C., Yu, M., Geng, J., & Jung, H. (2026). Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning. Mathematics, 14(1), 149. https://doi.org/10.3390/math14010149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Traffic Signal Control Model

3.1. State

3.2. Action

3.3. Reward

3.4. SE Module

3.5. A3C Algorithm

3.6. Architectural Analysis of SE-A3C Based on the A3C Baseline

4. Experimental Section

4.1. Experimental Environment

4.2. Experimental Setup

4.3. Experimental Results and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI