Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks

Fuad, Muhammad Riza Tanwirul; Fernandez, Eric Okto; Mukhlish, Faqihza; Putri, Adiyana; Sutarto, Herman Yoseph; Hidayat, Yosi Agustina; Joelianto, Endra

doi:10.3390/su142114590

Open AccessArticle

Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks

by

Muhammad Riza Tanwirul Fuad

¹

,

Eric Okto Fernandez

¹,

Faqihza Mukhlish

²

,

Adiyana Putri

³

,

Herman Yoseph Sutarto

⁴

,

Yosi Agustina Hidayat

⁵

and

Endra Joelianto

^6,7,*

¹

Department of Engineering Physics, Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung 40132, Indonesia

²

Engineering Physics Research Group, Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung 40132, Indonesia

³

Graduate Program of Engineering Physics, Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung 40132, Indonesia

⁴

Department of Intelligent System, PT. Pusat Riset Energi, Bandung 40226, Indonesia

⁵

Industrial System and Techno-Economy Research Group, Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung 40132, Indonesia

⁶

Instrumentation and Control Research Group, Faculty of Industrial Technology, Institut Teknologi Bandung, Bandung 40132, Indonesia

⁷

University Center of Excellence Artificial Intelligence on Vision, NLP and Big Data Analytics (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung 40132, Indonesia

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(21), 14590; https://doi.org/10.3390/su142114590

Submission received: 26 September 2022 / Revised: 28 October 2022 / Accepted: 2 November 2022 / Published: 6 November 2022

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

The demand for transportation has increased significantly in recent decades in line with the increasing demand for passenger and freight mobility, especially in urban areas. One of the most negative impacts is the increasing level of traffic congestion. A possible short-term solution to solve this problem is to utilize a traffic control system. However, most traffic control systems still use classical control algorithms with the green phase sequence determined, based on a specific strategy. Studies have proven that this approach does not provide the expected congestion solution. In this paper, an adaptive traffic controller was developed that uses a reinforcement learning algorithm called deep Q-network (DQN). Since the DQN performance is determined by reward selection, an exponential reward function, based on the macroscopic fundamental diagram (MFD) of the distribution of vehicle density at intersections was considered. The action taken by the DQN is determining traffic phases, based on various rewards, ranging from pressure to adaptive loading of pressure and queue length. The reinforcement learning algorithm was then applied to the SUMO traffic simulation software to assess the effectiveness of the proposed strategy. The DQN-based control algorithm with the adaptive reward mechanism achieved the best performance with a vehicle throughput of 56,384 vehicles, followed by the classical and conventional control methods, such as Webster (50,366 vehicles), max-pressure (50,541 vehicles) and uniform (46,241 vehicles) traffic control. The significant increase in vehicle throughput achieved by the adaptive DQN-based control algorithm with an exponential reward mechanism means that the proposed traffic control could increase the area productivity, implying that the intersections could accommodate more vehicles so that the possibility of congestion was reduced. The algorithm performed remarkably in preventing congestion in a traffic network model of Central Jakarta as one of the world’s most congested cities. This result indicates that traffic control design using MFD as a performance measure can be a successful future direction in the development of reinforcement learning for traffic control systems.

Keywords:

traffic control; reinforcement learning; deep Q-network; pressure; queue length; vehicle density; adaptive reward mechanism; macroscopic fundamental diagram

1. Introduction

Congestion in transportation networks is growing because of the increasing demand for passenger and freight mobility. In urban areas with a lack of public transportation and low road capacity, preventing congestion highly depends on how effective the method used is, in handling the traffic control signals [1]. In general, traffic signal operation is differentiated into two types of methods: fixed time and actuated/adaptive control methods. The first employs a repeating pattern with a fixed interval for each lane at an intersection during the cycle time (one period of the signaling system). This method does not consider the dynamics of the traffic [2]. For the latter, adaptive control methods have been developed by incorporating the queue length of each lane as pressure on the signaling time, known as the max-pressure control [3,4], where the pressure is seen as the difference between the upstream and downstream queue length at the intersection, indicating an inequivalent vehicle distribution. Another notable adaptive method has been proposed by Webster, with the concept of minimizing the delay time per vehicle to optimize the signaling time, based on the queue density and road width [5]. However, for highly complex urban transportation networks, the current approaches cannot be easily implemented without prior expert knowledge. This is because the representation of the traffic states must be carried out adequately when using a classical method [6]. Adaptive and auto-generative solutions are required to solve congestion problems in highly connected transportation networks. In this paper, a deep Q-network reinforcement learning approach is investigated as a novel approach to optimal signaling control learning, based on changing traffic dynamics.

The max-pressure approach compares the pressure variation between the arms at an intersection [3]. The setting of the decision process in the max-pressure only considers adjacent queues at each intersection, so it is a decentralized approach. Moreover, beyond basic serviceability, the max-pressure can operate without any prior knowledge of the traffic demand. This offers a significant advantage over most current traffic control systems, which necessitate a costly and time-consuming retiming procedure in the case of changes in demand patterns [4]. Most studies on the max-pressure used artificial data sets and a grid network model. Subsequently, studies on the max-pressure utilizing real data from an arterial road were conducted [4,7]. Recent studies by Salomons and Hegyi [8] and Joelianto et al. [9] investigated the max-pressure with the assessment of its effectiveness using the macroscopic fundamental diagram (MFD) measure. However, a collaborative process is not present in the max-pressure, in spite of it being a decentralized approach. With highly connected transportation networks in urban areas, it is important for all intersections to collaborate while fulfilling the objective of optimizing the inflow and outflow of traffic. Hence, in this paper, a reinforcement learning approach is proposed around the idea of combining the max-pressure with the MFD assessment.

The MFD is a macroscopic congestion metric that makes it easier to model the traffic flow and explains the network outflow and vehicle accumulation in an urban network at an aggregate level [10]. The MFD links the number of vehicles in a region to the rate at which trips are completed in that area when the network is uniformly congested, with demand changing gradually over time [11]. Godfrey [12] first postulated the idea of the MFD, but empirical data have only lately confirmed its reality [13]. Gayah et al. [14] investigated the effect of locally adaptive traffic signals on a network’s stability and the MFD. Yan et al. [10] developed an iterative learning controller, based on the MFD and utilized it to control the signaling split in urban networks. All in all, using the MFD to control traffic signals is a current interest in optimizing traffic flow in urban areas.

Several reinforcement learning (RL) methods have been developed for solving traffic congestion at the city level. Recently, traffic system reinforcement learning methods have drawn interest because they can directly learn from complex conditions through training and rewarding without specific requirements on traffic model assumptions [6]. In general, the learning process takes traffic conditions as states and signal configurations as actions at each intersection [15]. The study by Chen et al. has shown that decentralized reinforcement learning satisfies scalability, coordination, and data feasibility [16]. The formulation of decentralized RL agents in [3] utilizes pressure from the max-pressure as the reward system. The objective of the decentralized RL agents is to minimize the pressure to balance the vehicle distribution and maximize the throughput in the traffic network.

Using intersections as self-learning agents in RL provides a powerful approach in formulating the traffic signal control problem because it does not depend on supervised learning labels. The characteristic of reinforcement learning allows agents to update the control signal in response to new observations of the dynamic traffic conditions in the environment. Thus, the presence of human experts in control parameter tuning, can be replaced by online learning or trial-and-error interaction. Recently, deep learning methods have been combined with RL, known as deep Q-learning. By utilizing the benefit of deep learning, the RL agents can learn to capture the relationship between states and actions by using the data structure. Q-learning is a model-free, off-policy, temporal difference reinforcement learning method that looks for the best course of action, given the current state of the agent. The learning method can sort out problems with stochastic transitions and rewards without requiring adaptation and has been applied in several investigations [17,18,19,20,21,22]. Deep reinforcement learning (DRL) has been used to control intersections via the deep-Q-network (DQN) by utilizing a neural network to approximate the Q value function [23,24,25].

Both the max-pressure and RL can generally be seen as an optimization problem [26]. In the RL methods, reward design is necessary to achieve an efficient learning process. The reward is the objective of the optimization, and the solution is based on a trial-and-error search to tell the agent if an action is good or not. Wei et al. [26] defined the reward as pressure, which indicates the degree of disequilibrium between the vehicle density in the incoming and outgoing lanes. The max-pressure uses a greedy algorithm to minimize the pressure. If the reward function is set identical to the objective of the max-pressure, the same result can be achieved as with the max-pressure to maximize the network throughput [26]. The reward design in [27] considers the capacity of connecting lanes and upcoming vehicles from adjacent intersections. This reward is meant to avoid traffic congestion in the case of letting vehicles move into fully occupied lanes and RL is applied for the cooperative traffic signal system.

Reinforcement learning-based control can correct weaknesses resulting from previous actions. Thus, reinforcement learning can overcome the weaknesses of classical traffic control. Recently, Wu et al. [28] conducted a large-scale traffic signal control study using distributed agent deep reinforcement learning by developing two new multi-agent reinforcement learning (MARL) algorithms, Nash-A2C and Nash-3AC, naturally incorporating the Nash equilibrium theory into the actor-critic architecture of deep reinforcement learning. Furthermore, a distributed IoT computing architecture was designed for the urban traffic that is much better suited to the distributed approaches. The reinforcement learning reward was defined by means of long-term traffic conditions (waiting time). In contrast, in the present paper, the reward is based on traffic stream properties (flow and density). This consideration is meant to improve traffic control performance when traffic conditions exceed the saturation flow condition. In this regard, the focus of this study was on constructing a reinforcement learning reward that can handle congestion in any traffic scenario instead of solely preventing congestion. This paper proposes an adaptive traffic control algorithm, based on a multi-agent reinforcement learning control using an exponential reward approach for urban intersection networks.

The designed reinforcement learning control algorithm selects the traffic phase to be activated, based on the pressure and queue length with the expectation that each selection is the optimal decision in each scenario that occurs. The MFD approach was employed to examine the performance of the reinforcement learning control macroscopically (overall traffic network model) rather than analyzing its effectiveness at each intersection. A reward function that prioritizes pressure will produce good results before reaching saturation flow, whereas a reward function that emphasizes queue length will achieve better outcomes after reaching saturation flow. Utilizing an exponential function approach, the adaptive reward is determined using a combination of the total pressure and the total queue length of the phase at an intersection, providing a reward that is adaptable to the vehicle density situation at any given moment. The proposed RL algorithm will be demonstrated using a traffic network model of Central Jakarta, as a case study.

This paper proposes a deep Q-network algorithm for traffic control, with the following contributions. (1) Designing an algorithm, based on reinforcement learning with the performance measured in terms of MFD to effectively control the traffic in various traffic conditions. (2) Multi-agent DQN is used to approximate the Q-value to determine the action taken by the agent at each intersection. (3) The adaptive reward function uses an exponential approach that is based on the combination of the total pressure and the total queue length of the phase at an intersection. (4) The developed controller can prevent the vehicle flow from becoming completely jammed (gridlock) and increase the number of vehicles that successfully complete their trips in the simulation (maximum throughput).

The rest of this paper is organized as follows. Section 2 provides the problem formulation for the traffic network as well as explanations of the max-pressure traffic control method, the Webster traffic control method, and the macroscopic fundamental diagram. Section 3 describes the design of the proposed reinforcement learning traffic control using the deep Q-network, as well as the agent design. Section 4 shows an implementation of the controller in a traffic network using the proposed reward function. Section 5 presents the result and its discussion, and Section 6 presents the conclusion and discusses future work.

2. Problem Formulation

2.1. Traffic Network

Traffic signal control has become more important as traffic congestion has become unavoidable [29]. In order to develop traffic control algorithms, an excellent network model is required to speed up and reduce research costs [30]. On top of that, the direct application of developed traffic control algorithms to traffic networks has a high risk of bad performance and accidents. Traffic network models are made up of intersections and edge nodes [31]. Vehicles enter the traffic system network through edge nodes and can proceed straight, turn left, or turn right at the intersection, based on the available road directions.

Figure 1 shows the traffic network models that are classified as single intersection, multiple intersection, and grid traffic network. Terms, such as ‘road’, ‘lane’, and ‘junction’ are employed to refer to the road structure. The road is an element of the traffic network, described as the path that vehicles can take to move from one location to another within the traffic network. Lanes are components that make up the road. Depending on the width of the road, one road may have one or more lanes. One road with three lanes can be understood as a road that allows three vehicles to pass simultaneously. A junction is a node in the traffic network that connects roadways. The traffic signal is a concept used in traffic network modeling and can be described as traffic movement direction, traffic phase, and traffic cycle. The traffic movement direction is defined as the direction in which vehicles turn when approaching an intersection. Several traffic movement directions can be active at the same time. A traffic phase is a combination of many directional movements that are operational at the same time. Only one traffic phase can be active at a time. A traffic cycle is a collection of consecutively triggered traffic phases. Non-cyclical systems refer to traffic phases that are not activated sequentially.

In large-scale traffic control systems, controlled traffic intersections interact with each other, which causes the traffic control problem to become more complex [32]. The optimization of only one traffic intersection can lead to locally optimal solutions [26]. Locally optimal solutions are indicated by an increasing level of service (LOS) on an intersection while the LOS of other intersections decreases. This paper focuses on the development of a traffic control strategy for large-scale urban networks by implementing reinforcement learning with intersection pressure as the reward to achieve coordination between directly connected intersections. To measure the control performance in a large-scale urban network, the MFD is an essential performance metric. Further, the concept of pressure as an agent reward and the MFD are discussed in Section 3.3 and Section 2.4, respectively.

2.2. Max-Pressure Control

The max-pressure control algorithm manages intersections by activating the traffic phase with the highest pressure [3]. The max-pressure method has two different ways to activate the green light. In the first way, a cycle-based scheme is used while in the second way, a slot-based scheme is used [7]. The slot-based max-pressure control does not use a specific sequence pattern. The traffic phase pressure (

P_{p}

) is defined as the difference between the number of vehicles (

n

) in the phase’s upstream and downstream [33].

P_{p} = \sum_{l \in L_{p, ups}} n_{l} - \sum_{k \in L_{p, dws}} n_{k}

(1)

where

n_{l}

is the number of vehicles in the upstream lane with

L_{p, ups}

as the set of upstream lanes in intersection phase p, while the number of vehicles in the downstream lane is denoted by

n_{k}

and

L_{p, dws}

as the set of downstream lanes in intersection phase p.

Figure 2 means that the pressure from south to north is two while the pressure from west to east is negative one. A negative pressure indicates that the number vehicles downstream is higher than the number of vehicles upstream, hence, it is less favorable to activate a phase with a higher negative pressure. The max-pressure control algorithm is able to increase the amount of throughput by minimizing the intersection pressure by activating the traffic phase with the highest pressure [26]. When two arms have the same pressure, the activation will be chosen at random. Cycle-based methods are frequently employed in traffic situations, where each intersection’s lane activation follows a certain pattern, such as clockwise or counterclockwise. The green time allotment for each arm in cycle-based methods is proportionate to the pressure weight. The operator must establish the appropriate cycle time for each junction. The cycle time is divided, based on the weight of each arm’s pressure [7],

E_{p} (t) = \frac{f (\sum_{l, k} μ_{l k, p} P_{p})}{\sum_{i \in p} f (\sum_{l, k} μ_{l k, i} P_{i})}

(2)

where the function

f (\sum_{l, k} μ_{l k, p} P_{p})

is presented as exponential function

e^{η \sum_{l, k} μ_{l k, p} P_{p}}

. In some lane weight calculations, a negative number may be encountered; in these cases, an exponential function is employed to manage the negative value. To prevent the exponential values from approaching infinity,

η

is utilized as a scaling factor.

2.3. Webster Traffic Control

The Webster control method is used to determine the ideal cycle length for each phase at a single junction. The Webster technique can be utilized in real-time applications by first collecting a small amount of data and assuming that there will be no unexpected changes in the conditions. In other words, it calculates traffic assuming a steady state [5]. The methods proposed in this study are applicable to both fixed-time and vehicle-actuation methods.

The Webster control algorithm manages intersections by developing a cycle-based control scheme [5]. The Webster algorithm calculates the optimal cycle length (

C

) using critical flow data (

Y_{i}

) and lost time in a cycle (

L t

) with

N

as the number of phases in the intersection [33].

C = \frac{(1.5 \times L t) + 5}{1 - \sum_{i = 1}^{N} Y_{i}}

(3)

Critical flow data are measured using the vehicle volume (

V_{i}

), the saturated flow in lane (

s f

), and the number of incoming lanes in phase (

n j

) [33].

Y_{i} = \frac{V_{i}}{s f \times n j}

(4)

In the Webster control algorithm, the optimal cycle length is updated every time in interval

W

, therefore, it assumes that traffic demand will be roughly the same for the subsequent interval. The selection of

W

has several trade-offs; smaller values enable more frequent adjustments, based on the changing traffic demand at the expense of instability, while larger values of

W

adapt less frequently but allow traffic stability to be increased [33].

2.4. Macroscopic Fundamental Diagram

Analysis and evaluation of the traffic control strategy is carried out simultaneously in the entire modeled network (macroscopic evaluation) to obtain a traffic control strategy that is able to provide a globally optimal solution. As a macroscopic analysis tool, the MFD was selected to evaluate the traffic control strategy in the entire modeled network. The MFD describes the relationship between the average traffic flow in each lane in the modeled network (

F_{t}

) and the average traffic density in the entire modeled network (

K_{t}

) [34].

F_{t} = \frac{\sum_{i \in z} f_{i, t} l_{i}}{\sum_{i \in z} l_{i}}

(5)

K_{t} = \frac{\sum_{i \in z} k_{i, t} l_{i}}{\sum_{i \in z} l_{i}}

(6)

The variables used in the calculations are

f_{i, t}

as the vehicle flow in lane i at time t,

k_{i, t}

as the vehicle density in lane i at time t, and the lane length denoted as

l_{i}

, and the number of lanes in the network denoted as

z

. The MFD categorizes traffic conditions into four regions, L₁, L₂, L₃, and L₄ [10].

In Figure 3, the L₁ region reflects undersaturated traffic conditions, which occur when the traffic network is beginning to be loaded with vehicles. In the L₂ region, traffic has reached saturation, which indicates that several lanes in the traffic network are beginning to fill up while there is a limit to the number of vehicles that the network can manage. As the traffic demand (the number of vehicles entering the network) increases, several lanes in the traffic network will become congested and unable to handle incoming vehicles. This condition continues with vehicles starting to pile up and the vehicle speed decreasing, resulting in a reduced number of vehicles leaving the traffic network. This is the condition that arises in the L₃ region.

Figure 3 illustrates that as traffic demand continues to rise, the operating point in the MFD graph will increasingly shift towards the L₃, and L₄ regions. This is due to the number of vehicles that have gathered as the number of exiting vehicles decreased while the number of arriving vehicles did not decrease. This traffic condition is represented by the L₄ region, specifically when the traffic network is completely gridlocked. In this case, several intersections are gridlocked, and traffic controllers are powerless to intervene. Optimal MFD conditions are defined by higher and wider flow peaks that do not approach the L₄ region. The traffic network performance can be improved by expanding the L₂ region in the MFD graph so that the peak value of the vehicle outflow from the traffic network increases. The L₂ region can be expanded by raising the gradient in the L₁ region to increase the average vehicle velocity and applying a traffic controller that can defer the traffic conditions to the L₃ region.

3. Reinforcement Learning in Traffic Control

Reinforcement learning can be conducted through trial and error without making unfounded assumptions about the traffic model [35]. Reinforcement learning capabilities are needed because traffic circumstances are always changing and becoming more complicated so that the assumptions that were used in developing the traffic model become irrelevant to actual conditions. Utilizing reinforcement learning, the control algorithm can generate strategies for controlling a traffic system, based on the feedback obtained. In traffic control scenarios, the reinforcement learning agent receives various forms of information, such as queue length, waiting time, traffic flow, and so forth, at any given moment. The information on traffic conditions provided to the agents describes the state of the environment to be used as the basis for the agents to select an action.

3.1. Reinforcement Learning

The concept of reinforcement learning is that an agent interacts with an environment, taking action, and learning through trial and error to maximize its cumulative rewards over time. The way a learning agent interacts with its environment, as manifested in states, actions, and rewards, is defined by reinforcement learning with the formal framework of the Markov decision process (MDP) [36]. The Markov decision process is represented by a tuple

(S, A, P, R, γ)

, which stand for states, actions, transition probabilities, rewards, and discount factor, respectively. The probability distribution of the next state is defined by

P (s^{'}, r | s, a)

when any action

a \in A (s)

is taken at any state

s \in S

give a new state

s^{'}

and an expected reward

r \in R

. A policy

π (a | s)

for the MDP is a mapping of any state

s \in S

to a probability distribution over

A (s)

, while the state-value function

V_{π} (s)

defines the expected return reward when starting in a particular state s and adhering to policy

π (. | s)

[36], defined as,

V_{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R_{t + 1} | S_{t} = s]

(7)

Correspondingly, the expected return reward while starting in a particular state

s

, taking the action

a

, and adhering to policy

π (. | s)

is defined by the action-value function

Q_{π} (s, a)

[36], that is,

Q_{π} (s, a) = E [\sum_{t = 0}^{\infty} γ^{t} R_{t + 1} | S_{t} = s, A_{t} = a]

(8)

The purpose of reinforcement learning is to determine the policy that maximizes the cumulative reward. A policy

π

can be ensured to be optimal if its expected return reward is higher than or equal to that of policy

π^{'}

for all states. As in [36], the optimal state-value function

V_{π}^{*}

and action-value function

Q_{π}^{*}

are essentially defined as follows to describe optimality,

V_{π}^{*} (s, a) = \max_{π} V_{π} (s), \forall (s) \in S,

(9)

Q_{π}^{*} (s, a) = \max_{π} Q_{π} (s, a), \forall (s, a) \in S \times A,

(10)

These functions provide the expected return reward for taking action

a

in state

s

and then implementing an optimal strategy for the state-action pair

(s, a)

. Based on [36], Equation

Q^{*}

can be expressed in terms of

V^{*}

as follows:

Q_{π}^{*} (s, a) = E [R_{t + 1} + γ V^{*} (S_{t + 1}) | S_{t} = s, A_{t} = a]

(11)

Under an optimum policy, the value of a state must be equal to the expected return reward for the optimum action related to that state. To evaluate this, the Bellman optimality equation [36] is used, which can be expressed in terms of

V^{*}

and

Q^{*}

as follows,

V_{π}^{*} (s) = \max_{a} Q_{π}^{*} (s, a)

(12)

= \max_{a} E [R_{t + 1} + γ V_{π}^{*} (S_{t + 1}) | S_{t} = s, A_{t} = a]

(13)

= \max_{a} \sum_{s^{'}, r} P (s^{'}, r | s, a) [r + γ V_{π}^{*} (s^{'})]

(14)

Q_{π}^{*} (s, a) = E [R_{t + 1} + γ \max_{a^{'}} Q_{π}^{*} (S_{t + 1}, a^{'}) | S_{t} = s, A_{t} = a]

(15)

= \sum_{s^{'}, r} P (s^{'}, r | s, a) [r + γ \max_{a^{'}} Q_{π}^{*} (s^{'}, a^{'})]

(16)

Several approaches have been proposed to provide optimal solutions, including value iteration [37,38], policy iteration [39,40], SARSA [41,42], and Q-learning [43,44].

3.2. Deep Q-Network

The deep Q-network (DQN) is a value-based reinforcement learning algorithm that uses deep neural network

Q_{θ}

to approximate the optimal Q-function (action-value function)

Q^{*}

[45]. The neural network receives a state

s

as input and produces an estimate of the Q-value for all potential actions

a

as output. This demonstrates how the neural network maps states to specific actions, based on each action value. Instead of utilizing the last obtained experience to update the Q-function, the DQN stores the transition as a tuple

(S_{t}, A_{t}, R_{t}, S_{t + 1})

in replay memory

D

, and then utilizes a mini batch to sample the experiences uniformly, at random to train the parameters of the neural network by using the stochastic gradient descent. The DQN uses the replay memory to both limit network learning from interconnected experiences and to enable re-learning from prior experiences.

The DQN utilizes two distinct neural networks: the target neural network

Q_{\bar{θ}}

and the online neural network

Q_{θ}

. The DQN calculates the target

r_{t} + γ m a x_{u_{t + 1}} Q_{\bar{θ}} (s^{'}, a^{'})

and estimation

Q_{θ} (s_{t}, a_{t})

for each experience using the target network and the online network, respectively. Thus, the loss function to update the parameters of online network

θ

can be expressed as follows [46],

L (θ) = \frac{1}{n} \sum_{t = 1}^{n} {(r_{t} + γ m a x_{u_{t + 1}} Q_{\bar{θ}} (s^{'}, a^{'}) - Q_{θ} (s_{t}, a_{t}))}^{2}

(17)

The target network parameters

\bar{θ}

are only updated periodically to the online network parameters. Training can progress more steadily in this approach. In essence, the goal of the DQN is to match the estimated Q-value to a more precise target.

The ϵ-greedy exploration approach is often used to define the new policy during an iteration. An action is chosen randomly for every state s with probability ϵ. If its probability equals 1 − ϵ, the action that maximizes Q_t is selected. Other approaches for determining new policies exist, such as the Boltzmann exploration [36]. The Boltzmann exploration determines the action, based on the Boltzmann distribution (softmax) and the Q-value acquired, influenced by temperature parameter τ. The parameter τ describes the activeness of an action. When the temperature is greater

τ \to \infty

, practically all of the actions have the same probability, whereas when the temperature is lower,

τ \to 0

, the probability with the maximum Q-value is very close to one. The Boltzmann function [36] is expressed as follows,

S (a) = \frac{\exp (\frac{Q_{t} (a)}{τ})}{\sum_{a^{'} \in A} \exp (\frac{Q_{t} (a^{'})}{τ})}

(18)

The selected action will be determined by the maximum value derived from the softmax function. The benefit of the Boltzmann exploration over ϵ-greedy is that the information regarding the potential value of alternative actions can also be considered.

3.3. Agent Design

(1) State. The state

s

consists of four traffic features representing the traffic state at each intersection, namely an array indicating the currently active green phase, a binary variable showing whether the minimum green time in the current phase has already passed, the density of incoming vehicles at each phase, and the queue length of incoming vehicles at each phase. The density and queue length are defined for each phase to assist the agent in making an action decision.

(2) Action. The action a is in the form of the phase selection at each intersection. In the traffic network model, each intersection has a different number of phases, which may be two or three phases. The agent will evaluate the performance of the action every five seconds and can then choose another action. The selected action has a maximum duration of 25 s.

(3) Reward. In this paper, the reward is defined based on the max-pressure, quantifying the pressure of the upstream and downstream traffic at an intersection. The reward,

r_{i}

, is defined as,

r_{i} = - P_{i}

(19)

= - \sum_{b = 1}^{B} | (\sum_{l \in L_{p}, u p s} (\frac{n_{l}}{x_{l}}) - \sum_{k \in L_{p}, d w s} (\frac{n_{k}}{x_{k}})) |

(20)

where

P_{i}

is the pressure at intersection

i

,

B

is the number of the phase,

n_{l}

and

n_{k}

are the numbers of the upstream and downstream vehicles, respectively, and

x_{l}

and

x_{k}

are the lengths of the upstream and downstream lanes. In this study, the following reward variation was used,

r_{i} = - (w_{a} \times P_{t} + w_{b} \times Q l_{t})

(21)

Equation (21) defines the combination of the total intersection pressure and total queue length of vehicles at the intersection with varying pressure and queue length weights. The variables

P_{t}

and

Q l_{t}

represent the phase pressure at time

t

and the queue length in the upstream phase at time

t

, respectively. The values of weight coefficients

w_{a}

and

w_{b}

indicate the preferred reward.

4. Implementation

4.1. Traffic Control Using the Webster Algorithm

The equations used in the design of the Webster algorithm-based traffic controllers are mentioned in Section 2.3. The value of the saturated vehicle flow in Equation (4) corresponds to the value utilized in Genders and Razavi’s research [33], which was 0.44 veh/s, or 1584 vehicles in one hour. The calculation of the optimal cycle length is limited to a minimum of 40 s and a maximum of 180 s to avoid short cycles, which could trigger many yellow and red lights, and long cycles, which could lead to the possibility of a phase being activated for an overly long period. The optimal cycle length computation will be updated every 1800 s (

W

), ensuring the cycle alters at least ten times. A lower

W

value increases the controller’s adaptability but could produce instability, whereas a large

W

value increases the stability but reduces the adaptability.

4.2. Traffic Control Using the Max-Pressure Algorithm

The max-pressure is considered one of the state-of-the-art control methods in traffic control [7]. A slot-based max-pressure was used, which operates by activating the phase with the highest pressure value. The max-pressure Equation (1), proposed by Gender and Razavi [33], is appropriate to be applied at the intersections that have relatively the same road length in each phase (similar road capacity). However, when the length of the road in each phase varies, based on the actual traffic network conditions, as was the case in this study, this equation becomes less representative of the actual pressure conditions. One possible approach to solve this is to include the road capacity information in the pressure calculation performed, according to Wei et al. [26],

P_{p} = \sum_{l \in L_{p}, u p s} (\frac{n_{l}}{n_{m a x, l}}) - \sum_{k \in L_{p}, d w s} (\frac{n_{k}}{n_{m a x, k}})

(22)

where

P_{p}

represents the pressure value for each phase

p

,

n_{l}

is the number of vehicles in the upstream lane,

n_{k}

is the number of vehicles in the downstream lane, and

n_{m a x}

indicates the lane’s maximum capacity. However, this equation is difficult to implement because, in this study the vehicles have a different length for each type, hence, the vehicle capacity that can be accommodated by a lane will be different for the distinct combinations of vehicle types. In order to address this issue, an approach was utilized where the lane capacity value is substituted by the length of the relevant lane. This assumes that the lane capacity and length are directly proportional. Hence, the pressure equation for each phase is defined as,

P_{p} = \sum_{l \in L_{p}, u p s} (\frac{n_{l}}{x_{l}}) - \sum_{k \in L_{p}, d w s} (\frac{n_{k}}{x_{k}})

(23)

where

x_{l}

is the upstream lane length and

x_{k}

is the downstream lane length. Following the calculation of the pressure value for each intersection, this value is sorted from largest to smallest, and the phase with the greatest pressure is chosen to be activated. If two or more phases have the same pressure value, the phase with the most movement directions will be chosen. If there are zero vehicles at an intersection at the start of the simulation, the phase will be randomly chosen.

4.3. Traffic Control Using Reinforcement Learning

In this study, the reinforcement learning algorithm was utilized with the Q-value approximation approach in the form of the deep Q-network to control traffic. The multi-agent DQN was chosen because of the high possibility of variation in the number of states, consisting of several actions and intersections, and each intersection is controlled by a single agent. Figure 4 depicts the agent design of the DQN algorithm. Details of the agent design are explained in Section 3.3.

According to PressLight [26], tuning the weights, based on a reward derived from the pressure and queue length, may result in drastically varying travel times. However, based on the outcomes of the experiments, the traffic control that prioritizes pressure will yield good results before the saturation flow is reached, while the traffic control that emphasizes queue length will be more successful after the saturation flow is reached. As a result, we cannot apply consistent weights to the reward, based on pressure and queue length. Therefore, this paper proposes a reward using an exponential equation to seamlessly alter the weight in a region, accentuating either pressure or queue length. This region is determined by the interchange of the control performance that prioritizes pressure with the control performance that emphasizes queue length at a particular density point, based on the density distribution. In order to consider the density distribution, the weight coefficients

w_{a}

and

w_{b}

should be modified as follows,

w_{a} = e^{- β k_{t}}

(24)

w_{b} = 1 - e^{- β k_{t}}

(25)

where

K_{t}

is vehicle density at time

t

and

β

is the threshold between the dominant area of the pressure and queue length. Thus, the reward Equation (21) given to the DQN agent becomes,

r_{i} = - (e^{- β k_{t}} \times P_{t} + (1 - e^{- β k_{t}}) \times Q l_{t})

(26)

Equation (26) employs an exponential approach, allowing the weight coefficient to adjust adaptively, in response to the vehicle density conditions. An explanation of determining the value of

β

will be given in Section 5.

5. Result and Discussion

This section presents the experiments to evaluate the control performance of the deep Q-network. This paper deals with traffic control on a large-scale system, consisting of 31 controlled traffic intersections in Central Jakarta, spread over the Menteng and Senen districts, as shown in Figure 5. The traffic network modeling and simulation was performed in Eclipse SUMO 1.11.0. The incoming vehicle data, turning ratio, and traffic phase were taken from TomTom and SCATS on 17 January 2020. The incoming vehicle data was constructed only at the entry points on the edges of the traffic network model, while the turning ratio was constructed at every turning point in the modeled traffic network. The ratio of the simulated vehicles consisted of 23.7% passenger cars, 2.5% buses, 4.6% trucks, and 69.2% motorcycles. The traffic phase was assigned at every controlled intersection, with each intersection having two or three traffic phases. The DQN was implemented using the Keras framework and communicated with SUMO through TraCI. The DQN was trained for fifty episodes from 08.00 to 09.00 and tested from 06.00 to 23.00. The DQN hyperparameters are shown in Table 1.

In this research, we analyzed the traffic control performance using the MFD to represent the relationship between the vehicle flow and the vehicle density in the traffic network. The DQN algorithm was first combined with an action selection method, known as ϵ-greedy exploration, with the reward defined as the total pressure at the intersection. This traffic control algorithm was applied to a traffic model of Central Jakarta, yielding the MFD graph results shown below.

Figure 6 shows that the developed DQN algorithm achieved a higher peak flow (202 veh/h) than the classical and conventional control algorithms employed as a comparison, namely Webster (168 veh/h), the max-pressure (175 veh/h), and the uniform (180 veh/h). However, the developed DQN control had a lower throughput value (45,262 vehicles) than the classical controls. These results indicate that while the DQN algorithm’s peak flow was higher than the classical controls’ peak flow, this high flow could not be sustained over time. This condition caused vehicles to accumulate in the network, increasing the network density as the simulation progressed. Furthermore, in terms of the vehicle density distribution, the developed reinforcement learning method was unable to minimize the congestion. As seen in Figure 6, the vehicle density pattern continued to expand to the right area when the vehicle flow hit the lowest point, indicating that the MFD DQN control reached the L₄ region. In the MFD, the L₄ region indicates that the simulated traffic has reached a gridlock condition. In order to improve the performance of the traffic control algorithm, the action selection method was replaced with the Boltzmann exploration. Figure 7 shows the MFD graph using the Boltzmann exploration technique.

Once the action selection method was replaced with the Boltzmann exploration, the condition of maximum density in the traffic network was reduced to 35 veh/km, accompanied by an insignificant decrease of the peak flow from 202 veh/h to 200 veh/h. The resultant throughput value grew from 45,268 vehicles to 49,923 vehicles but was still lower than with the max-pressure and the Webster. The next step in improving the DQN control’s performance was to vary the reward

r_{i}

by including the queue length in addition to the pressure reward, according to Equation (21). The value variations of the coefficients

w_{a}

and

w_{b}

investigated in this study are shown in Table 2.

Figure 8 depicts the MFD graph with the variation of the reward weight coefficient. Each adjustment of the weight coefficient value yielded an MFD with a varying control performance. The coefficient values prioritizing pressure rewards (DQN-0.9-0.1 and DQN-0.7-0.3) compared to the coefficient values prioritizing the queue length reward (DQN-0.3-0.7 and DQN-0.1-0.9) tended to produce a greater peak flow but were accompanied by a considerably higher vehicle density. The DQN-0.5-0.5 agent generated better performance outcomes, compared to the variation of other coefficient values. Although the DQN-0.5-0.5 produced a maximum vehicle density of 33 veh/km (greater than the DQN-0.1-0.9 at 30 veh/km), this value was compensated by the higher peak flow from other variations (213 veh/km). The DQN-0.5-0.5 provided the highest throughput, indicating that it could maintain a high vehicle flow for a long period of time. The DQN-0.1-0.9 had a low throughput value, generating the lowest maximum vehicle density and reducing the number of vehicles that could enter the traffic network model, instead of increasing the number of vehicles exiting the network model.

The DQN-0.5-0.5 agent achieved better results, in terms of the vehicle flow and throughput, when compared to the classical and conventional control algorithms shown in Figure 9. In terms of the vehicle density, the Webster control was able to produce a lower maximum vehicle density. The DQN-0.5-0.5 failed to prevent the vehicle density in the MFD from exceeding the L₄ condition, resulting in a greater maximum vehicle density than with the Webster control. According to Figure 9, the vehicle density continued to shift to the right after attaining a lower vehicle flow (15 veh/h at 31 veh/km density). In order to improve this density condition, the variation of the other reward weight coefficient values must be re-tested.

However, due to the lengthy processing time required for testing, this trial-and-error strategy is rendered ineffectual. In addition to the lengthy computation time, the employment of a fixed incentive weight coefficient from start to finish is deemed improper. The MFD can be regarded as consistent from start to finish if it exhibits the greatest flow peak followed by the lowest maximum density when compared to the other control algorithm variations. As can be seen in Figure 8, the MFD revealed that there was no consistent coefficient variation from the start to the end of the simulation. In order to simplify the analysis of this condition, the MFD graph was evaluated when the rewards supplied to the DQN agents were full pressure and queue length. As shown in Figure 10, the DQN-pressure agent had a greater peak flow than the DQN-queue agent, although this value was accompanied by a higher maximum vehicle density. Furthermore, the DQN-pressure failed to maintain the trend of the decreasing density in the 23 veh/km and 27 veh/km density region.

The density downward trend shown in Figure 10 indicates that there was a performance exchange between the DQN-pressure and the DQN-queue in the density range of 23 veh/km to 27 veh/km. Therefore, we need a DQN agent that can swap between the pressure rewards and queue rewards in this performance area to acquire the optimum performance for each condition. A technique for a weight reward coefficient that is adaptable to the vehicle density circumstances is proposed to suit this demand, namely an adaptive weight reward. The reward equation that calculates the adaptive weight is shown in Equation (26). In order to simplify the testing of this adaptive reward mechanism, the midpoint value of the performance exchange area in the MFD was used, 25 veh/km. As the midpoint (exchange point), both

w_{a}

and

w_{b}

should be equal to 0.5. The

β

value that produced a density of 25 veh/km was 0.0277. The MFD derived from the implementation of this adaptive reward mechanism technique is shown below.

Figure 11 illustrates that the DQN agent with adaptive reward mechanism (DQN-exponent) produced the highest peak flow and could maintain the maximum density (32 veh/km) at the lowest value, compared to the DQN-pressure and DQN-queue agents. The MFD graph also demonstrates the adaptability of the DQN-exponent, since when the vehicle density was low (less than 25 veh/km), the MFD pattern provided by the DQN-exponent replicated that of the DQN-pressure, while when the vehicle density was high, the MFD pattern resembled that of the DQN-queue. In terms of the throughput, the DQN-exponent provided the highest value, implying that the low maximum density in the MFD resulted from the DQN agent’s capacity to facilitate vehicles to exit the network model. Furthermore, the DQN-exponent agent’s performance was compared to the classical controls that were utilized in the initial comparison. The MFD graphs of the control algorithms were compared as follows.

Figure 12 illustrates that the DQN agent with an adaptive reward mechanism (DQN-exponent) achieved the highest peak flow (202 veh/h) and could maintain the maximum vehicle density with the lowest value (32 veh/km), when compared to the classical control algorithms utilized as a comparison. The DQN-exponent succeeded to provide a higher throughput value (56,384 veh) than the classical control algorithms.

Additionally, the advanced traffic sensing method can be employed to obtain more accurate traffic information. Recently, Pu et al. [47] developed a multimodal traffic speed monitoring, based on the passive Wi-Fi and Bluetooth sensing technology. Because the vehicle detector loop has a limited range from the signalized intersection, the accurate vehicle speed information can be acquired utilizing their research if congestion occurs in the middle of road segments. As a result, this strategy may improve the traffic control performance.

6. Conclusions

This paper proposed the adaptive deep Q-network algorithm with an exponential reward mechanism for the traffic control in an urban network, consisting of 31 intersections in Central Jakarta, spread over the Menteng and Senen districts. The adaptive reward mechanism, based on an exponential equation technique was developed to adjust the weight reward coefficient to the vehicle density condition. The Jakarta traffic network model was constructed utilizing real vehicle input data through SUMO and SCATS to reproduce real-world circumstances.

The adaptive deep Q-network algorithm with the exponential reward mechanism achieved the highest throughput (56,384 vehicles), compared to the classical control algorithms, the max-pressure (43,294 vehicles), the Webster (50,366 vehicles), and the uniform (46,241 vehicles). The significant increment of the vehicle throughput indicates that the traffic control could improve the region productivity, signifying that an intersection could handle more vehicles, reducing the congestion possibility. The proposed DQN prevented the vehicle density in the network from reaching the L₄ MFD region with a maximum density of 32 veh/km, in contrast to the classical control algorithms, which failed to prevent this condition.

The pressure calculation used has a limitation. The lane’s maximum capacity in the pressure calculation was substituted by the length of the relevant lane, considering that different types of vehicles have different lengths. This approach assumes that lane capacity and length are directly proportional. In future work, it is necessary to evaluate the adaptive reward utilizing different equational approximations, besides the exponential approximation. Reinforcement learning action in the form of cycle length needs to be considered to meet real-world circumstances.

Author Contributions

Conceptualization, E.J., M.R.T.F., E.O.F., F.M. and H.Y.S.; methodology, E.J., M.R.T.F., E.O.F., F.M. and H.Y.S.; software, M.R.T.F. and E.O.F.; validation, E.J., H.Y.S. and F.M.; formal analysis, E.J., M.R.T.F., E.O.F., F.M. and H.Y.S.; investigation, E.J., H.Y.S. and F.M.; resources, E.J., H.Y.S., M.R.T.F. and E.O.F.; data curation, M.R.T.F. and E.O.F.; writing—original draft preparation, M.R.T.F., E.O.F. and A.P.; writing—review and editing, E.J., M.R.T.F., A.P., F.M., H.Y.S. and Y.A.H.; visualization, M.R.T.F., E.O.F. and A.P.; supervision, E.J., F.M., H.Y.S. and Y.A.H.; project administration, E.J. and Y.A.H.; funding acquisition, E.J. and Y.A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Competitive Basic Research under the program of the Ministry of Education, Culture, Research, and Technology of the Republic of Indonesia 2022, No. 033/E5/PG.02.00/2022.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are very grateful to National Competitive Basic Research grant in 2022 from the Ministry of Education, Culture, Research, and Technology of the Republic of Indonesia.

Conflicts of Interest

The authors declare no conflict of interest.

References

Qu, Z.; Pan, Z.; Chen, Y.; Wang, X.; Li, H. A distributed control method for urban networks using multi-agent reinforcement learning based on regional mixed strategy Nash-equilibrium. IEEE Access 2020, 8, 19750–19766. [Google Scholar] [CrossRef]
Noaeen, M.; Naik, A.; Goodman, L.; Crebo, J.; Abrar, T.; Abad, Z.S.H.; Bazzan, A.L.C.; Far, B. Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Syst. Appl. 2022, 199, 116830. [Google Scholar] [CrossRef]
Varaiya, P. The Max-Pressure Controller for Arbitrary Networks of Signalized Intersections. In Advances in Dynamic Network Modeling in Complex Transportation Systems; Springer: New York, NY, USA, 2013; pp. 27–66. [Google Scholar]
Kouvelas, A.; Lioris, J.; Fayazi, S.A.; Varaiya, P. Maximum Pressure Controller for Stabilizing Queues in Signalized Arterial Networks. Transp. Res. Rec. 2014, 2421, 133–141. [Google Scholar] [CrossRef] [Green Version]
Webster, F.V. Traffic Signal Settings; Road Research Technique Paper; Department of Scientific and Industrial Research: Delhi, India, 1957. [Google Scholar]
Zhang, L.; Wu, Q.; Shen, J.; Lü, L.; Du, B.; Wu, J. Expression might be enough: Representing pressure and demand for reinforcement learning based traffic signal control. Int. Conf. Mach. Learn. 2022, 162, 26645–26654. [Google Scholar]
Ramadhan, S.A.; Sutarto, H.Y.; Kuswana, G.S.; Joelianto, E. Application of area traffic control using the max-pressure algorithm. Transp. Plan. Technol. 2020, 43, 783–802. [Google Scholar] [CrossRef]
Salomons, A.M.; Hegyi, A. Intersection Control and MFD Shape: Vehicle-Actuated Versus Back-Pressure Control. IFAC-PapersOnLine 2016, 49, 153–158. [Google Scholar] [CrossRef]
Joelianto, E.; Utami, F.P.; Sutarto, H.Y.; Gautama, S.; Semanjski, I.; Fathurrahman, M.F. Performance Analysis of Max-Pressure Control System for Traffic Network using Macroscopic Fundamental Diagram. Int. J. Artif. Intell. 2022, 20, 1–23. [Google Scholar]
Yan, F.; Tian, F.-L.; Shi, Z.-K. Iterative Learning Control Approach for Signaling Split in Urban Traffic Networks with Macroscopic Fundamental Diagrams. Math. Probl. Eng. 2015, 2015, 975328. [Google Scholar] [CrossRef] [Green Version]
Wang, P.F.; Wada, K.; Akamatsu, T.; Hara, Y. An Empirical Analysis of Macroscopic Fundamental Diagrams for Sendai Road Networks. JSTE J. Interdiscip. Inf. Sci. 2015, 21, 49–61. [Google Scholar] [CrossRef] [Green Version]
Godfrey, J.W. The mechanism of a road network. Traffic Eng. Control 1969, 11, 323–327. [Google Scholar]
Geroliminis, N.; Daganzo, C.F. Existence of urban-scale macroscopic fundamental diagrams: Some experimental findings. Transp. Res. Part B Methodol. 2008, 42, 759–770. [Google Scholar] [CrossRef] [Green Version]
Gayah, V.V.; Gao, X.; Nagle, A.S. On the impacts of locally adaptive signal control on urban network stability and the Macroscopic Fundamental Diagram. Transp. Res. Part B Methodol. 2014, 70, 255–268. [Google Scholar] [CrossRef]
Genders, W.; Razavi, S. Using a deep reinforcement learning agent for traffic signal control. arXiv 2016, arXiv:1611.01142. [Google Scholar]
Chen, C.; Wei, H.; Xu, N.; Cheng, G.; Yang, M.; Xiong, Y.; Xu, K.; Li, Z. Toward a thousand lights: Decentralized deep reinforcement learning for large-scale traffic signal control. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 3414–3421. [Google Scholar]
Rizvi, S.A.A.; Lin, Z. Output feedback Q-learning control for the discrete-time linear quadratic regulator problem. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1523–1536. [Google Scholar] [CrossRef]
Radac, M.B.; Precup, R.E. Data-driven model-free slip control of anti-lock braking systems using reinforcement Q-learning. Neurocomputing 2018, 275, 317–329. [Google Scholar] [CrossRef]
Clarke, R.J.; Fletcher, L.; Greatwood, C.; Waldock, A.; Richardson, T.S. Closed-loop Q-learning control of a small unmanned aircraft. In Proceedings of the AIAA Scitech 2020 Forum, Orlando, FL, USA, 6–10 January 2020. [Google Scholar]
Iskandar, R.F.; Leksono, E.; Joelianto, E. Q-Learning Hybrid Type-2 Fuzzy Logic Control Approach for Photovoltaic Maximum Power Point Tracking Under Varying Solar Irradiation Exposure. Int. J. Intell. Eng. Syst. 2021, 15, 199–208. [Google Scholar] [CrossRef]
Gheisarnejad, M.; Sharifzadeh, M.; Khooban, M.; Al-Haddad, K. Adaptive fuzzy q-learning control design and application to grid-tied nine-level packed e-cell (PEC9) inverter. IEEE Trans. Ind. Electron. 2022, 70, 1071–1076. [Google Scholar] [CrossRef]
Zamfirache, I.A.; Precup, R.E.; Roman, R.C.; Petriu, E.M. Reinforcement Learning-based control using Q-learning and gravitational search algorithm with experimental validation on a nonlinear servo system. Inf. Sci. 2022, 583, 99–120. [Google Scholar] [CrossRef]
Lin, Y.; Dai, X.; Li, L.; Wang, F.-Y. An efficient deep reinforcement learning model for urban traffic control. arXiv 2018, arXiv:1808.01876. [Google Scholar]
Alemzadeh, S.; Moslemi, R.; Sharma, R.; Mesbahi, M. Adaptive Traffic Control with Deep Reinforcement Learning: Towards State-of-the-art and Beyond. arXiv 2020, arXiv:2007.10960. [Google Scholar]
Anirudh, R.; Krishnan, M.; Kekuda, A. Intelligent Traffic Control System using Deep Reinforcement Learning. In Proceedings of the International Conference on Innovative Trends in Information Technology (ICITIIT), Kottayam, India, 13–14 February 2020. [Google Scholar]
Wei, H.; Chen, C.; Zheng, G.; Wu, K.; Gayah, V.; Xu, K.; Li, Z. Presslight: Learning max pressure control to coordinate traffic signals in arterial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
Boukerche, A.; Zhong, D.; Sun, P. A novel reinforcement learning-based cooperative traffic signal system through Max-Pressure control. IEEE Trans. Veh. Technol. 2021, 71, 1187–1198. [Google Scholar] [CrossRef]
Wu, Q.; Wu, J.; Shen, J.; Du, B.; Telikani, A.; Fahmideh, M.; Liang, C. Distributed agent-based deep reinforcement learning for large scale traffic signal control. Knowl.-Based Syst. 2022, 241, 108304. [Google Scholar] [CrossRef]
Eom, M.; Kim, B.-I. The traffic signal control problem for intersections: A review. Eur. Transp. Res. Rev. 2020, 12, 50. [Google Scholar] [CrossRef]
Bellemans, T.; Schutter, B.D.; Moor, B.D. Models for traffic control. J. A 2002, 43, 13–22. [Google Scholar]
Rasheed, F.; Yau, K.L.A.; Noor, R.M.; Wu, C.; Low, Y.C. Deep Reinforcement Learning for Traffic Signal Control: A Review. IEEE Access 2020, 8, 208016–208044. [Google Scholar] [CrossRef]
Castillo, R.G.; Clempner, J.B.; Poznyak, A.S. Solving the multi-traffic signal-control problem for a class of continuous-time markov games. In Proceedings of the 12th International Conference on Electrical Engineering, Computing Science and Automatic Control, Mexico city, Mexico, 26–30 October 2015. [Google Scholar]
Genders, W.; Razavi, S. An Open-Source Framework for Adaptive Traffic Signal Control. arXiv 2019, arXiv:1909.00395. [Google Scholar]
Wahaballa, A.M.; Hemdan, S.; Kurauchi, F. Relationship Between Macroscopic Fundamental Diagram Hysteresis and Network-Wide Traffic Conditions. Transp. Res. Procedia 2018, 34, 235–242. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Gayah, V.; Li, Z. Recent Advances in Reinforcement Learning for Traffic Signal Control: A Survey of Models and Evaluation. ACM SIGKDD Explor. Newsl. 2021, 22, 12–18. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: London, UK, 2018. [Google Scholar]
Bellman, R. The Theory of Dynamic Programming; Rand Corporation: Santa Monica, CA, USA, 1954. [Google Scholar]
Dai, P.; Mausam; Weld, D.S.; Goldsmith, J. Topological Value Iteration Algorithms. J. Artif. Intell. Res. 2011, 42, 181–209. [Google Scholar]
Howard, R.A. Dynamic Programming and Markov Processes; MIT Press: Cambridge, MA, USA, 1960. [Google Scholar]
Bertsekas, D.P. Approximate Policy Iteration: A Survey and Some New Methods. J. Control Theory Appl. 2011, 9, 310–335. [Google Scholar] [CrossRef] [Green Version]
Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report; Cambridge University: Cambridge, UK, 1994. [Google Scholar]
Zou, S.; Xu, T.; Liang, Y. Finite-sample analysis for SARSA with linear function approximation. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Technical Note: Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Jang, B.; Kim, M.; Harerimana, G.; Kim, J.W. Q-Learning Algorithms: A Comprehensive Classification and Applications. IEEE Access 2019, 7, 133653–133667. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Fan, J.; Wang, Z.; Xie, Y.; Yang, Z. A Theoretical Analysis of Deep Q-Learning. arXiv 2020, arXiv:1901.00137. [Google Scholar]
Pu, Z.; Cui, Z.; Tang, J.; Wang, S.; Wang, Y. Multimodal Traffic Speed Monitoring: A Real-Time System Based on Passive Wi-Fi and Bluetooth Sensing Technology. IEEE Internet Things J. 2022, 9, 12413–12424. [Google Scholar] [CrossRef]

Figure 1. Illustration of the traffic network models: (a) single intersection, (b) multiple intersections, (c) grid traffic network [29].

Figure 2. Illustration of the defining pressure at an intersection [26].

Figure 3. Macroscopic fundamental diagram [10].

Figure 4. Illustration of the DQN implementation in traffic control.

Figure 5. Jakarta traffic network model in the Menteng and Senen districts.

Figure 6. MFD DQN-epsilon compared to the classical control.

Figure 7. MFD DQN-epsilon and the DQN–Boltzmann compared to the classical control.

Figure 8. MFD DQN–Boltzmann with the variation of the reward weight coefficient.

Figure 9. MFD DQN–Boltzmann-0.5-0.5 compared to the classical control.

Figure 10. MFD DQN–Boltzmann-pressure and the DQN–Boltzmann-queue.

Figure 11. MFD DQN-exponent compared to the DQN-pressure and DQN-queue.

Figure 12. MFD DQN-exponent compared to the classical controls.

Table 1. DQN hyperparameters.

Hyperparameters	Value
Learning rate $α$	0.001
Loss function $L$	MSE
Optimization	Adam
Discount factor $γ$	0.95
Replay memory size $D$	10,000
Minibatch size	256
Update target network	1800 time step
Initial epsilon $ϵ$	1
Epsilon decay	0.95
Minimum epsilon	0.01
Temperature parameter $τ$	0.5

Table 2. Variation of the coefficient value.

$w_{a}$	$w_{b}$
0.9	0.1
0.7	0.3
0.5	0.5
0.3	0.7
0.1	0.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fuad, M.R.T.; Fernandez, E.O.; Mukhlish, F.; Putri, A.; Sutarto, H.Y.; Hidayat, Y.A.; Joelianto, E. Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks. Sustainability 2022, 14, 14590. https://doi.org/10.3390/su142114590

AMA Style

Fuad MRT, Fernandez EO, Mukhlish F, Putri A, Sutarto HY, Hidayat YA, Joelianto E. Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks. Sustainability. 2022; 14(21):14590. https://doi.org/10.3390/su142114590

Chicago/Turabian Style

Fuad, Muhammad Riza Tanwirul, Eric Okto Fernandez, Faqihza Mukhlish, Adiyana Putri, Herman Yoseph Sutarto, Yosi Agustina Hidayat, and Endra Joelianto. 2022. "Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks" Sustainability 14, no. 21: 14590. https://doi.org/10.3390/su142114590

APA Style

Fuad, M. R. T., Fernandez, E. O., Mukhlish, F., Putri, A., Sutarto, H. Y., Hidayat, Y. A., & Joelianto, E. (2022). Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks. Sustainability, 14(21), 14590. https://doi.org/10.3390/su142114590

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Deep Q-Network Algorithm with Exponential Reward Mechanism for Traffic Control in Urban Intersection Networks

Abstract

1. Introduction

2. Problem Formulation

2.1. Traffic Network

2.2. Max-Pressure Control

2.3. Webster Traffic Control

2.4. Macroscopic Fundamental Diagram

3. Reinforcement Learning in Traffic Control

3.1. Reinforcement Learning

3.2. Deep Q-Network

3.3. Agent Design

4. Implementation

4.1. Traffic Control Using the Webster Algorithm

4.2. Traffic Control Using the Max-Pressure Algorithm

4.3. Traffic Control Using Reinforcement Learning

5. Result and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI