A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game

Zhang, Le; Lyu, Lijing; Zheng, Shanshui; Ding, Li; Xu, Lang

doi:10.3390/su141912033

Open AccessArticle

A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game

by

Le Zhang

¹

,

Lijing Lyu

^2,*,

Shanshui Zheng

¹,

Li Ding

³ and

Lang Xu

⁴

¹

School of Transport and Logistics, Guangzhou Railway Polytechnic, Guangzhou 510430, China

²

School of Management, Guangzhou Huali Science and Technology Vocational College, Guangzhou 511325, China

³

School of Physics and Optoelectronics, South China University of Technology, Guangzhou 510630, China

⁴

School of Transport and Communications, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(19), 12033; https://doi.org/10.3390/su141912033

Submission received: 31 August 2022 / Revised: 19 September 2022 / Accepted: 20 September 2022 / Published: 23 September 2022

(This article belongs to the Special Issue Advance in Transportation, Smart City, and Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

Route game is recognized as an effective method to alleviate Braess’ paradox, which generates a new traffic congestion since numerous vehicles obey the same guidance from the selfish route guidance (such as Google Maps). The conventional route games have symmetry since vehicles’ payoffs depend only on the selected route distribution but not who chose, which leads to the precise Nash equilibrium being able to be solved by constructing a special potential function. However, with the arrival of smart cities, the real-time of route schemes is more of a concerned of engineers than the absolute optimality in real traffic. It is not an easy task to re-construct the new potential functions of the route games due to the dynamic traffic conditions. In this paper, compared with the hard-solvable potential function-based precise method, a matched Q-learning algorithm is designed to generate the approximate Nash equilibrium of the classic route game for real-time traffic. An experimental study shows that the Nash equilibrium coefficients generated by the Q-learning-based approximate solving algorithm all converge to 1.00, and still have the required convergence in the different traffic parameters.

Keywords:

traffic congestion; Braess’ paradox; route game; Q-learning; approximate Nash equilibrium

1. Introduction

Traffic congestion has become a major issue for city managers, since it causes serious direct and indirect economic losses in modern cities [1,2,3]. Vehicle route guidance (VRG) can adjust the trajectories of travelers in the space dimension to mitigate the traffic congestion [4]. According to the quantity of agents, route guidance can be divided into the single-agent-based selfish VRG and the multi-agent-based coordinated VRG.

A selfish VRG system provides a single vehicle’s optimal path without considering others’ route schemes [5]. The drivers divert to the recommended free-flowing roads after receiving the congestion information. For example, based on the requirements of vehicles, the studies in [6,7,8] use the traffic data obtained from the selfish route guidance systems to choose paths by predicting traffic trends. In the past, due to the limitations of the old on-board computers, the vehicles struggle to interact with other agents in fast-moving [9]. Selfish route guidance systems (such as the Google Maps) are adopted as the mainstream method for vehicle navigation [10].

With the proliferation of selfish VRG systems, however, researchers find that the road efficiency is reduced due to Braess’ paradox (the traffic congestion switches to an alternative road since the same optimal route is recommended to numerous vehicles) [11]. Braess’ paradox may reduce the running efficiency of the discrete systems such as VRG, that has been studied in many journals such as Operations Research [12,13,14,15].

Recently, with the emergence of the Internet of vehicles (IoVs), the multi-agent-based coordinated VRG (modeled by route game) becomes possible for alleviating the Braess’ paradox [16]. The route game model can consider users’ routes systematically [17]. For example, the studies in [18,19] design two route games based on the assumptions of mixed and pure strategy, respectively. The research team in [20] investigates the information perturbations of IoVs on the proposed route game in [18]. Moreover, based on the route game in [19], the work in [21] builds an information perturbation-based route game model under the mixed-strategy assumption. In the above route games, the vehicles and their requirements are set as player and payoffs, respectively. The Nash equilibrium represents the final route schemes of the coordinated VRG.

For solving route games that have multi-players, some milestone theorems have been proved. The study in [22] proposes the famous “potential function” to solve the congestion games which are usually used to simulate the behaviors of independent players competing for the limited resources (such as the route game in VRG problem). Subsequently, the authors in [23,24,25] further study the potential functions in various structure-based congestion games (such as the weighted congestion games). The potential function-based solving method is a precise algorithm to generate the pure strategy Nash equilibrium of the existing route games.

In today’s smart cities, the real-time of traffic management systems is more concerned by engineers [26]. However, the work of building the potential function is a manual task. Moreover, it is not an easy task to re-construct the new potential functions for the route games due to the dynamic traffic conditions [27]. Hence, compared with the hard-solved absolute optimality, the approximate optimal routes that can be solved effectively are more significant in real coordinated VRG.

For the existing literature of approximate Nash equilibrium, intelligent algorithms (such as heuristic, machine learning and reinforcement learning) are often adopted to calculate the approximate solutions of game theories [28]. For example, the study in [29] presents a concurrent learning to obtain an approximate Nash equilibrium for an N-player nonzero-sum game. The work in [30] uses a stackelberg game to model the interactions between the content providers and the edge caching devices, and designs a reinforcement learning to solve the approximate Nash equilibrium for the proposed game model. In [31], the approximate Nash equilibrium is used to solve a multi-objective optimization problem combining by the particle swarm optimization and the self-organizing mapping neural network. Among the above intelligent algorithm, the reinforcement learning is a method that can approximate the optimal solutions by self-learning in training [32]. It can describe the multistage discrete decisions such as in the pure strategy-based game theory.

To adapt to the real-time of smart traffic, compared with the existing hard-solvable potential function-based precise algorithms, a matched Q-learning-based solving method (a type of reinforcement learning) is designed in this paper to generate the approximate Nash equilibrium for the classic route game. Specifically, based on the features of the route game, the virtual strategy set of all vehicles is set as the current state. Only one vehicle updates its virtual strategy based on the optimization rules in each iteration. For the reward function, the concept of Nash equilibrium coefficient (quantifying the degree of Nash equilibrium) is introduced to evaluate the action in each iteration. Using the established Q-learning algorithm, the approximate Nash equilibrium of the route game can converge to a unique solution in the basic traffic scenarios.

Finally, a microscopic traffic scenario and two typical experimental methods (the potential function-based precise algorithm and the designed Q-learning approximate method) are designed to test the validity and robustness of the contributions. With the example comparisons, the results show that the Nash equilibrium coefficients generated by the Q-learning approximate solving algorithm all converge to 1.00, which means that the obtained approximate routes are the same as the precise Nash equilibrium generated by the potential function. Moreover, the robustness analysis shows that the Q-learning-based solving algorithm still has required convergence in the different traffic parameters. The main contributions of this paper are summarized as follows:

To adapt to the real-time of smart traffic, the Nash equilibrium coefficient of route games is proposed in this paper, which means the proportion of vehicles whose current route strategies are optimal;
A Q-learning-based approximate solving algorithm is designed to generate the coordinated route schemes for the classic route game. This approximate Nash equilibrium-based method is more suitable for the dynamic traffic compared with the hard-solved precise solving algorithm;
A decentralized route coordination framework (that can be applied to large road networks based on the assumption of IoVs) is built to alleviating Braess’ paradox in VRG.

The rest of this paper is organized as follows. Section 2 illustrates the assumptions and formalizations of the art. Section 3 describes the contributions of the model and solving algorithm. Section 4 presents the running experiment and comparative results. Conclusions and potential future works are given in Section 5.

2. Basic Assumptions and Formalizations

In this section, the preliminary notions and the classic route game are formalized based on the available assumptions.

2.1. Concepts of the Art

There are some concepts that have not been explained at length in the introductory section:

Braess’ paradox: with the proliferation of selfish VRG systems, the road efficiency is reduced since numerous vehicles obey the same guidance from the selfish VRG systems [15];
Rout games: In a route game, the vehicles and their requirements are set as players and payoffs, respectively. The Nash equilibrium represents the final coordination routes. The route game can generate the route strategies systematically for a group of vehicles based on agents’ interaction relationship [17];
Pure/mixed strategy: Pure strategy denotes a certain action chosen by each player in a game theory. Mixed strategy denotes a random action distribution with certain probability [33];
Symmetrical games: All agents are non-personalized, which leads to the symmetry of the payoff matrix in the game. The symmetry makes the symmetrical games have some special mathematical properties [20].

2.2. Formalization of the Route Game

2.2.1. Description of the Traffic Scenario

Route game is recognized as an effective way to alleviate Braess’ paradox, which generates a new traffic congestion on one alternative road since numerous vehicles obey the same guidance from the selfish VRG systems (such as Google Maps). For the sake of a concise model, referring to the study in [34] (which researches the price of anarchy for selfish routing), a basic traffic scenario with two contrastive paths is adopted to describe the classic route game. The details of the micro road network are shown in Figure 1.

In Figure 1, let

I = {i |i = 1, 2, \dots, n}

be the set of vehicles and

t_{i}

be the travel time of vehicle i. To reduce the calculations without weakening the contribution description, the pure strategy is adopted in this paper’s route game, and it is assumed that the vehicles have the same origin-destination from point O to point D. All vehicles are equipped with IoVs devices and no overtaking occurs. Based on the IoVs technique, the driving data can be exchanged between vehicles in fast-moving [35].

Each vehicle has to choose a path to its destination D. Let

E = {ε |ε = 1, 2, \dots, m}

be the set of the alternative paths. In Figure 1, there are two optional paths for each vehicle i (

ε = 1

:

O \to 3 \to 1 \to 2 \to D

;

ε = 2

:

O \to 3 \to 4 \to 2 \to D

). Assume that the traffic lights are all green to eliminate the influence from non-variable factors. All traffic parameters use standardized units.

In driving, all vehicles originally chose the shortest paths as their route strategies. This could generate traffic congestion since the road load exceeds the limit of flow. When traffic congestion occurs, the widely used selfish VRG systems (such as Google Maps) usually recommend the optimal alternative path to all vehicles. Due to the same route diversion, the selfish VRG may lead to Braess’ paradox, where traffic congestion switches from one road to an alternative one. Route games can alleviate the negative impact of Braess’ paradox by considering users’ routes systematically.

2.2.2. Establishment of the Classic Route Game

A complete route game consists of four elements: players, strategies, payoff functions, and game rules [36]. Similar to the existing route games [18,19,20,21], let vehicle

i (i \in I)

be the player, the alternative path set

E

be the pure strategy set, and the static as the rules. Let

θ_{i}

(

θ_{i} = ε_{i} \in E

in pure strategy) and

π_{i}

be the strategy and payoff of vehicle i, respectively. The value of

π_{i}

depends on the selection of

θ_{i}

and

θ_{- i}

:

θ_{- i} = (θ_{1}, θ_{2}, \dots, θ_{i - 1}, θ_{i + 1}, \dots, θ_{n}) .

Travel time

t_{i}

is usually adopted as payoff

π_{i}

for vehicle i. For the travel time of vehicles, the engineers usually adopt the BPR formulation:

π_{i} (θ_{i}, θ_{- i}) = t_{i} = t_{ε_{i}} (1 + α {(\frac{N_{ε_{i}}}{C_{ε_{i}}})}^{β})

(1)

to estimate, where

t_{ε_{i}}

denotes the basic travel time of path

ε_{i}

,

C_{ε_{i}}

denotes the capacity of path

ε_{i}

,

N_{ε_{i}}

denotes the current vehicle volume of edge

ε_{i}

, and

α

and

β

are constants that are fit by huge traffic data.

N_{ε_{i}}

depends only on the vehicle route strategies:

N_{ε_{i}} = \sum_{i^{'} \in - i}^{} T {θ_{i^{'}} = θ_{i}} .

(2)

In the built pure strategy-based route game, each vehicle tries to improve its payoff

π

by selecting an optimal strategy

θ

. Let

Θ = (θ_{1}, θ_{2}, \dots, θ_{n})

be the selected strategies of all vehicles. The Nash equilibrium means the final coordination routes maintain a steady-state for all vehicles.

Definition 1.

A Nash equilibrium in the pure strategy-based game theory is a special strategy set

Θ^{*} = (θ_{1}^{*}, θ_{2}^{*}, θ_{3}^{*}, \dots, θ_{n}^{*})

satisfying

π_{i} (Θ^{*}) \leq π_{i} (θ_{1}^{*} \dots, θ_{i - 1}^{*}, θ_{i}, θ_{i + 1}^{*} \dots, θ_{n}^{*}), \forall i \in I, θ_{i} \in E .

It means that any player who changes strategy alone will not get a better payoff [37]. The route game steps are as follows:

Step 1. The game information (such as the strategies and payoff functions) is obtained by vehicles based on the IoVs technology;
Step 2. Based on the built route game $G = {Θ; π_{1}, π_{1}, \dots π_{n}}$ , the on-board computer of each vehicle predicts the route strategies of others;
Step 3. Based on the prediction result of other vehicles’ route strategies, all vehicles simultaneously calculate their own optimal route strategies θ and receive the payoff π.

The payoff function Equation (1) shows that the route game is symmetric. It means the payoff of each vehicle depends only on the strategy distribution but not who chooses (all players are non-personalized). A large number of basic road units form a complex traffic network. The built classic route game in this paper can be extended to the large road network due to its decentralized feature.

3. Contributions to Solving Algorithms

In this section, based on the symmetry of the classic route game, the precise solving algorithm is constructed by the potential function as the control group. Subsequently, considering the limitation of the precise algorithm on smart roads, a matched Q-learning-based approximate solving algorithm is designed to calculate the coordination routes of the classic route game.

3.1. Potential Function-Based Precise Algorithm

The symmetry of payoff matrix is the significant feature in the classic route game. In [38], based on this mathematical nature, the existence of the pure strategy Nash equilibrium is proved in symmetrical games. Subsequently, the study in [22] finds that the solutions of the potential functions are the Nash equilibrium in potential games. Considering the features of the route game built in Section 2.2, the proof processes of the precise Nash equilibrium are as follows:

When the path

ε_{i}

is chosen by vehicle i, there is

x_{ε_{i}} = \{\begin{matrix} 1 path ε_{i} is selected by vehicle i \\ 0 otherwise \end{matrix} ε_{i} = 1, 2, \dots, m

(3)

where

\sum_{ε_{i} = 1}^{m} x_{ε_{i}} = 1

. That is, each vehicle chooses one of the alternative path

ε

as its route strategy. Let the selection times of path

ε

(

ε \in E

) be

N_{ε}

:

N_{ε} - \sum_{i = 1}^{n} T {ε = ε_{i}} = 0 .

(4)

Let

c_{ε} (k_{ε})

be the expected payoff of the vehicles who select the path

ε

chosen by

k_{ε}

vehicles. The value of

c_{ε} (k_{ε})

is calculated by Equation (1). The potential function of the established route game can be constructed by

min \sum_{ε \in E} \sum_{k = 0}^{N_{ε}} c_{ε} (k_{ε}) .

(5)

Subject to:

\{\begin{matrix} x_{ε_{i}} = \{\begin{matrix} 1 path ε_{i} is selected by vehicle i \\ 0 otherwise \end{matrix} \\ \sum_{ε_{i} = 1}^{m} x_{ε_{i}} = 1 \\ N_{ε} = \sum_{i = 1}^{n} T {ε = ε_{i}} \\ c_{ε} (k_{ε}) = t_{f r e e - f l o w}^{ε} (1 + α {(\frac{k_{ε}}{C_{ε}})}^{β}) . \end{matrix}

(6)

Let

E^{*} = (ε_{1}^{*}, ε_{2}^{*}, \dots, ε_{n}^{*})

be a coordination route set for all vehicles, which satisfies the solution of the potential function Equation (5). If route set

E^{*}

is not a Nash equilibrium, there must exist a path

ε_{i}^{'}

to make vehicle i arrive a shorter travel time. That is

c_{ε_{i}^{'}} (k_{ε_{i}^{'}} + 1) < c_{ε_{i}^{*}} (k_{ε_{i}^{*}}) .

(7)

When the selected path

ε_{i}^{*}

of vehicle i changes to

ε_{i}^{'}

, let

N_{ε}^{'}

denote the vehicle quantity choosing the path

ε

. Inspired by the conclusions in [22], there exists

\begin{matrix} \sum_{ε \in E} \sum_{k = 0}^{N_{ε}^{'}} c_{ε} (k_{ε}) \\ = \sum_{\binom{ε \in E}{\binom{ε \notin ε_{i}^{'}}{ε \notin ε_{i}^{*}}}} \sum_{k = 0}^{N_{ε}} c_{ε} (k_{ε}) + \sum_{k = 0}^{N_{ε_{i}^{'}}} c_{ε_{i}^{'}} (k_{ε_{i}^{'}}) + \sum_{k = 0}^{N_{ε_{i}^{*}}} c_{ε_{i}^{*}} (k_{ε_{i}^{*}}) \\ = \sum_{\binom{ε \in E}{ε \notin ε_{i}^{*}}} \sum_{k = 0}^{N_{ε}} c_{ε} (k_{ε}) + c_{ε_{i}^{'}} (k_{ε_{i}^{'}} + 1) + \sum_{k = 0}^{N_{ε_{i}^{*}}} c_{ε_{i}^{*}} (k_{ε_{i}^{*}}) \\ = \sum_{ε \in E} \sum_{k = 0}^{N_{ε}} c_{ε} (k_{ε}) + c_{ε_{i}^{'}} (k_{ε_{i}^{'}} + 1) - c_{ε_{i}^{*}} (k_{ε_{i}^{*}}) \\ < \sum_{ε \in E} \sum_{k = 0}^{N_{ε}} c_{ε} (k) . \end{matrix}

The existence of the precise Nash equilibrium is deduced mathematically in the built route game.

If Equation (7) is true, there must exist

ε_{i}^{'}

to make the value of Equation (5) smaller. This conclusion is inconsistent with

E^{*}

being the solution of Equation (5). It is proved that Equation (7) does not exist. That is, the vehicle route set satisfying the potential function Equation (5) is the pure strategy Nash equilibrium of the route game.

3.2. Q-Learning-Based Approximate Solving Algorithm

3.2.1. Definition of Approximate Nash Equilibrium

In today’s smart cities, the real-time of traffic management systems is more concerned by engineers. However, the work of building the potential function is a manual task. And it is not an easy work to re-construct the new potential functions due to the dynamic traffic conditions. Compared with the hard-solved potential function-based precise algorithm, an approximate route scheme that can be solved effectively is more significant in real traffic [39].

In this section, to fill the above gap, the concept of the Nash equilibrium coefficient

η

:

η = \frac{\sum_{i = 1}^{n} T {π_{i} (θ_{i}, θ_{- i}) \leq π_{i} (θ_{i}^{'}, θ_{- i})}}{n}

is introduced into the route game, where

T {\cdot}

denotes the indicator function and

θ_{i}^{'}

denotes any selectable path for vehicle i. The value of Nash equilibrium coefficient

η

is calculated by this definition.

According to the definition of Nash equilibrium coefficient

η

, if other vehicles’ route strategies

θ_{- i}

remain the same, the value of

T {π_{i} (θ_{i}, θ_{- i}) \leq π_{i} (θ_{i}^{'}, θ_{- i})}

is 1.00 when the current route strategy

θ_{i}

is optimal for vehicle i. Nash equilibrium coefficient

η

means the proportion of the vehicles whose current route strategy is optimal.

In the route game proposed in this paper, the approximate Nash equilibrium

η

is created to replace the function of the precise Nash equilibrium. When the Nash equilibrium coefficient reaches 1.00 (

η = 1.00

), the corresponding route strategy set denotes the pure strategy Nash equilibrium of the route game.

3.2.2. Q-Learning Matched with the Route Game

For the existing literature regarding approximate Nash equilibrium, intelligent algorithms (such as heuristic, machine learning and reinforcement learning) are often adopted to calculate the approximate solutions of game theories [28]. Among them, Q-learning, which approximates the optimal solutions by self-learning in training, can describe the multistage discrete decisions in the pure strategy-based game theory [21].

In this subsection, combined with the features of the route game, a matched Q-learning-based solving method (a type of reinforcement learning) is designed to generate the approximate Nash equilibrium for the classic route game. A complete Q-learning algorithm consists of four elements—reward function, optimization rule, state set and action set [18,19,20].

For the

k_{t h}

iteration in Q-learning, the virtual strategy set of all vehicles is set as the state

s_{k}

:

s_{k} = (θ_{1}^{k}, θ_{2}^{k}, \dots, θ_{i}^{k}, \dots, θ_{n}^{k}), θ_{i}^{k} \in Θ

where

θ {_{i}}^{K}

denotes the virtual strategy of vehicle i in

k_{t h}

iteration. That is, the state of each iteration is actually a feasible route scheme for the route game.

For the selection action

a_{k}

in the state

s_{k}

, only one vehicle updates its virtual strategy based on the optimization rules. There is

a_{k} = (i, θ_{i}^{k + 1}), i \in I, θ_{i}^{k + 1} \in Θ

where

θ_{i}^{k + 1}

denotes the virtual strategy of vehicle i in state

s_{k + 1}

. The value of

θ_{i}^{k + 1}

is generated by famous

ε - Greedy

algorithm (note that here

ε

is not the notation of the selectable path). The

ε - Greedy

algorithm is a conventional algorithm, whose specific steps are not described in this paper due to the space limitation.

About the state updating, there is

s_{k + 1} = Φ (s_{k}, a_{k})

(8)

where

Φ

denotes the certain mapping function from

s_{k}

to

s_{k + 1}

. It means that the next state

s_{k + 1}

depends only on the current state

s_{k}

and action

a_{k}

.

For the reward function R, the Nash equilibrium coefficient

η

is used to evaluate the action

a_{k}

in each iteration:

R_{k} (s_{k}, a_{k}) = \frac{\sum_{i = 1}^{n} T {π_{i} (s_{k + 1}) \leq π_{i} ({θ_{i}}^{^{'}}, θ_{- i}^{t + 1})}}{n} .

(9)

R means the proportion of the vehicles whose virtual route strategies are optimal. The values of

s_{k + 1}

and

π_{i}

are generated by Equations (8) and (1), respectively.

The value-action matrix Q is

Q (s_{k}, a_{k}) = (1 - σ) Q (s_{k}, a_{k}) + σ \cdot (R_{k} + γ \cdot max_{a \in A} Q (s_{k}, a)) .

(10)

where

σ

denotes the learning step and

γ

denotes the discounted factor. The running details of the designed Q-learning are shown in Figure 2.

In Figure 2, there may exist more than one Nash equilibrium in the route game. The final route solution

Θ^{e}

depends on initial

Θ^{0}

. The dimension of the Q-value matrix is

|S| \times |A|

, where S denotes the set of states and A denotes the set of actions. The training task is to converge the value-action matrix Q. The time complexity of the designed Q-learning algorithm is

O (n^{3})

(note that here n is not the notation of vehicle quantity).

Due to the limited space, the calculation details of the Q-learning-based approximate Nash equilibrium are not described in this section. Its convergence is analyzed in Section 4 with numerical experiments.

3.3. Discussion

In this section, firstly, the potential function-based precise solving algorithm is constructed as the control group. Subsequently, considering the limitation of the precise algorithm on real-time roads, a matched Q-learning-based algorithm is designed to generate the approximate Nash equilibrium for the route game. In the Q-learning-based approximate solving method, the concept of approximate Nash equilibrium is introduced into the route game. This approximate Nash equilibrium-based method is more suitable for the dynamic traffic compared with the hard-solved precise solving algorithm

4. Numerical Experiments

In this section, a representative traffic scenario is built to verify the proposed contributions in the micro level. Firstly, as a control group, the potential function-based precise solving algorithm is worked on four traffic scenarios with differentiated traffic parameters. It is used to prove the feasibility of the route game established in this paper. Then, the convergence of the Q-learning-based approximate solving algorithm is studied in the same traffic scenarios, and the converged values are compared with the precise solutions. Finally, the robustness of the Q-learning-based approximate solving algorithm is tested.

4.1. Preparation

For the numerical experiments, the running speed depends on the computer performance and programming language. In this paper, MATLAB R2016a tool is used. The default parameters are accepted, and the processor of the computer is AMD Ryzen 74800U with Radeon Graphics 1.80 GHz.

The basic traffic scenario in Figure 1 is adopted to verify the contributions in this paper. To reduce the error caused by randomness, four different parameters (shown in Table 1) are used in the micro road network.

As shown in Table 1, in four traffic scenarios, the vehicle number

n = 10

, 10 and the quantity of alternative paths

m = 2

. Traffic scenarios 1, 2 and 3 are general, whose alternative roads are different. For traffic scenario 4, it is a special road network that the traffic conditions of all alternative roads are undifferentiated.

In the payoff function Equation (1), let constants

α

and

β

be 1.00. For the parameters of Q-learning, similar to the general parameter settings, we set learning step

σ = 0.2

, discounted factor

γ = 0.9

, greedy coefficient

ε = 0.8

, and maximum iteration times

M = 100

. The variation of these parameters will affect the speed of convergence but not the accuracy [40].

4.2. Effectiveness of the Precise Algorithm (Control Group)

A potential function-based precise solving algorithm for the route game is described in Section 3.1. It is used to prove the effectiveness of the route game that can alleviate Braess’ paradox. The precise solving algorithm is set as a control group to test the accuracy of the following Q-learning-based approximate solving algorithm.

Let Method 1 be the random route selection from the alternative path set

E

, Method 2 be the selfish VRG without considering the interaction between vehicles’ decisions, and Method 3 be the route game solved by the potential function-based precise algorithm.

Let the number of samples be 20. Running the built potential function-based precise algorithm under the traffic parameters in Section 4.1, the results are shown in Table 2. Since solving Equation (5) in the control group is not the main task of this paper, the specific steps of this potential function-based precise method are not described in detail.

As shown in Table 2, the selfish VRG (Method 2) obtains a longer travel time compared with the random path selection (Method 1) in all traffic scenarios. This means that the road efficiency is reduced due to the occurrence of Braess’ paradox (a congestion switches to an alternative road since the same optimal route is recommended to numerous vehicles). For the route game-based coordinated VRG (Method 3), the precise algorithm generates the shortest travel time compared with Method 1 and 2 by considering users’ routes systematically. Braess’ paradox is mitigated. In other words, the feasibility of the route game is verified. It should be noted that the travel time of vehicles depends only on the flow ratio due to the symmetry of the payoff matrix.

4.3. Availability of the Q-Learning-Based Approximate Solving Algorithm

In today’s smart cities, the real-time nature of traffic management systems is a primary concern for engineers. However, the work of building the potential function for precise algorithm is a manual task, which cannot adapt to dynamic traffic conditions. Hence, compared with the hard-solved absolute optimality, a Nash equilibrium coefficient-based approximate solving method that can be solved effectively is designed in Section 3.2.

Let the number of samples be 20 (the initial route schemes of 20 samples are different), the and training times be

1 \times 10^{4}

,

1 \times 10^{5}

and

5 \times 10^{5}

, respectively. Running the built Q-learning-based approximate solving algorithm (designed in Section 3.2) on traffic scenario 1, the results are shown in Figure 3 with 20 different color curves.

As shown in Figure 3a, when the training times are set to

1 \times 10^{4}

, the Nash equilibrium coefficients of all samples distribute from 0.30 to 1.00 due to the insufficient training, and the corresponding travel times do not converge to a unique value. It means that the generated coordination routes are not the Nash equilibrium. When the training times are increased to

1 \times 10^{5}

, the results (shown in Figure 3b) are similar to Figure 3a that the calculated coordination routes still do not receive the Nash equilibrium.

As shown in Figure 3c, with the training times increased to

5 \times 10^{5}

, the Nash equilibrium coefficients of samples all converge to 1.00, and the corresponding travel times converge to a unique value 41.76. The Nash equilibrium routes are received. Compared with the results in Table 2, it is found that the Q-learning-based approximate solving algorithm receives the same coordination routes as the potential function-based precise solving algorithm. The validity of the designed approximate solving algorithm is demonstrated in this section. It is worth noting that the operation time of Figure 3c is about 335 s in the MATLAB environment described in Section 4.1.

4.4. Robustness of the Q-Learning-Based Approximate Solving Algorithm

In the verification of the Q-learning-based approximate solving algorithm, only traffic scenario 1 is considered in Figure 3. To verify the universality of the Q-learning-based solving algorithm in various traffic scenarios, the convergence of the designed Q-learning needs to be further tested under different traffic parameters. The robustness of the designed approximate solving algorithm is shown in Figure 4.

As shown in Figure 4a, the Q-learning-based solving algorithm still exists required convergence in traffic scenarios 2, 3 and 4 (the Nash equilibrium coefficients coverage to 1.00). As shown in Figure 4b, the corresponding travel times converge to the unique values 31.86, 35.44, and 35.00, respectively. Compared with the results in Table 2, it is found that the Q-learning-based approximate solving algorithm receives the same route schemes as the potential function-based precise solution algorithm in traffic scenarios 2, 3, and 4. The robustness of the Q-learning-based approximate solving algorithm is verified.

4.5. Discussion

In this section, firstly, the potential function-based precise solving algorithm is used as the control group. The results show that Braess’ paradox is mitigated by the built route game. Then, the convergence of the Q-learning-based approximate solving algorithm is studied in the same traffic scenarios. The converged approximate route schemes show that the Q-learning-based approximate solving algorithm receives the same Nash equilibrium as the potential function-based precise solving algorithm. Finally, the robustness analysis shows that the Q-learning-based approximate solving algorithm still has the required convergence in different traffic parameters.

5. Conclusions and Future Works

The route game is recognized as an effective way to alleviate Braess’ paradox. In this paper, compared with the existing hard-solvable precise solving algorithms, a matched Q-learning-based approximate solving algorithm is designed to generate the coordination routes for the route game. In the Q-learning-based approximate solving algorithm, the concept of the approximate Nash equilibrium is introduced into the route game, which is more suitable for the real-time nature of smart traffic. An experimental study shows that the Nash equilibrium coefficients generated by the Q-learning-based approximate solving algorithm all converge to 1.00, and the obtained route schemes receive the same Nash equilibrium as the potential function-based precise solving algorithm. The robustness analysis shows that the Q-learning-based approximate solving algorithm still has the required convergence in different traffic parameters.

The built approximate Nash equilibrium-based route game is a decentralized approach that can be applied to large road networks based on IoVs. Meanwhile, there still exist some limitations to be further studied. For example, (1) the conventional route games have symmetry, since vehicles’ payoffs depend only on the selected route distribution. However, for a VRG model with personalized vehicles, the payoff matrix is not necessarily symmetric. The convergence of the proposed Q-learning-based approximate solving algorithm deserves further study. (2) The main purpose of this paper is to propose the basic Q-learning framework to prove the feasibility of approximate Nash equilibrium in route games. It is necessary to design a faster multi-agent-based Q-learning for bigger scale cities, that can be compared with the previous approximate algorithms such as heuristic.

Author Contributions

L.Z. designed and performed the experiments and contributed to the paper writing. L.L. and S.Z. provided supervision and funding acquisition. L.D. and L.X. participated in the experiment design. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Shanghai Soft Science Key Project under Grant 2269219350 and in part by the Guangdong Provincial College Youth Innovation Talent Project under Grant 2022WQNCX279.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Saberi, M.; Hamedmoghadam, H.; Ashfaq, M.; Hosseini, S.A.; Gu, Z.; Shafiei, S.; Nair, D.J.; Dixit, V.; Gardner, L.; Waller, S.T.; et al. A simple contagion process describes spreading of traffic jams in urban networks. Nat. Commun. 2020, 11, 1–9. [Google Scholar] [CrossRef] [PubMed]
Guo, Y.; Tang, Z.; Guo, J. Could a smart city ameliorate urban traffic congestion? A quasi-natural experiment based on a smart city pilot program in China. Sustainability 2020, 12, 2291. [Google Scholar] [CrossRef]
Afrin, T.; Yodo, N. A survey of road traffic congestion measures towards a sustainable and resilient transportation system. Sustainability 2020, 12, 4660. [Google Scholar] [CrossRef]
Tang, C.; Hu, W.; Hu, S.; Stettler, M.E.J. Urban Traffic Route Guidance Method with High Adaptive Learning Ability under Diverse Traffic Scenarios. IEEE Trans. Intell. Transp. Syst. 2020, 22, 2956–2968. [Google Scholar] [CrossRef]
Zhang, L.; Khalgui, M.; Li, Z. Predictive intelligent transportation: Alleviating traffic congestion in the internet of vehicles. Sensors 2021, 21, 7330. [Google Scholar] [CrossRef] [PubMed]
Chen, M.; Yu, X.; Liu, Y. PCNN: Deep convolutional networks for short-term traffic congestion prediction. IEEE Trans. Intell. Transp. Syst. 2018, 19, 3550–3559. [Google Scholar] [CrossRef]
Sun, J.; Kim, J. Joint prediction of next location and travel time from urban vehicle trajectories using long short-term memory neural networks. Transp. Res. C-Emerg. Technol. 2021, 128, 103114. [Google Scholar] [CrossRef]
Li, J.; Ma, Y.; Gao, R.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem. IEEE Trans. Cybern. 2021; in press. [Google Scholar]
Zhang, L.; Khalgui, M.; Li, Z.; Zhang, Y. Fairness concern-based coordinated vehicle route guidance using an asymmetrical congestion game. IET Intell. Transp. Syst. 2022; in press. [Google Scholar]
Yang, S.B.; Guo, C.; Yang, B. Context-aware path ranking in road networks. IEEE Trans. Knowl. Data Eng. 2022, 34, 3153–3168. [Google Scholar] [CrossRef]
Braess, D.; Nagurney, A.; Wakolbinger, T. On a paradox of traffic planning. Transp. Sci. 2005, 39, 446–450. [Google Scholar] [CrossRef]
Scarsini, M.; Schröder, M.; Tomala, T. Dynamic atomic congestion games with seasonal flows. Oper. Res. 2018, 66, 327–339. [Google Scholar] [CrossRef] [Green Version]
Cao, Z.; Chen, B.; Chen, X.; Wang, C. Atomic dynamic flow games: Adaptive vs. nonadaptive agents. Oper. Res. 2021, 69, 1680–1695. [Google Scholar] [CrossRef]
Lee, J. Multilateral bargaining in networks: On the prevalence of inefficiencies. Oper. Res. 2018, 66, 1204–1217. [Google Scholar] [CrossRef]
Acemoglu, D.; Makhdoumi, A.; Malekian, A.; Ozdaglar, A. Informational Braess’ paradox: The effect of information on traffic congestion. Oper. Res. 2018, 66, 893–917. [Google Scholar] [CrossRef]
Lin, K.; Li, C.; Fortino, G.; Rodrigues, J.J. Vehicle route selection based on game evolution in social internet of vehicles. IEEE Internet Things J. 2018, 5, 2423–2430. [Google Scholar] [CrossRef]
Mostafizi, A.; Koll, C.; Wang, H. A Decentralized and Coordinated Routing Algorithm for Connected and Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11505–11517. [Google Scholar] [CrossRef]
Du, L.; Chen, S.; Han, L. Coordinated online in-vehicle navigation guidance based on routing game theory. Transp. Sci. Rec. 2015, 2497, 106–116. [Google Scholar] [CrossRef]
Du, L.; Han, L.; Li, X.Y. Distributed coordinated in-vehicle online routing using mixed-strategy congestion game. Transp. Res. B-Meth. 2014, 67, 1–17. [Google Scholar] [CrossRef]
Du, L.; Han, L.; Chen, S. Coordinated online in-vehicle routing balancing user optimality and system optimality through information perturbation. Transp. Res. B-Meth. 2015, 79, 121–133. [Google Scholar] [CrossRef]
Spana, S.; Du, L.; Yin, Y. Strategic Information Perturbation for an Online In-Vehicle Coordinated Routing Mechanism for Connected Vehicles Under Mixed-Strategy Congestion Game. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4541–4555. [Google Scholar] [CrossRef]
Monderer, D.; Shapley, L.S. Potential games. Games Econom. Behav. 1996, 14, 124–143. [Google Scholar] [CrossRef]
Milchtaich, I. Congestion games with player-specific payoff functions. Games Econom. Behav. 1996, 13, 111–124. [Google Scholar] [CrossRef]
Harks, T.; Klimm, M.; Möhring, R.H. Characterizing the existence of potential functions in weighted congestion games. Theory Comput. Syst. 2011, 49, 46–70. [Google Scholar] [CrossRef]
Harks, T.; Klimm, M. On the existence of pure Nash equilibria in weighted congestion games. Math. Oper. Res. 2012, 37, 419–436. [Google Scholar] [CrossRef]
Lin, H.H.; Hsu, I.C.; Lin, T.Y.; Tung, L.M.; Ling, Y. After the Epidemic, Is the Smart Traffic Management System a Key Factor in Creating a Green Leisure and Tourism Environment in the Move towards Sustainable Urban Development? Sustainability 2022, 14, 3762. [Google Scholar] [CrossRef]
Ali, M.S.; Coucheney, P.; Coupechoux, M. Distributed Learning in Noisy-Potential Games for Resource Allocation in D2D Networks. IEEE Trans. Mob. Comput. 2019, 19, 2761–2773. [Google Scholar] [CrossRef]
Ganzfried, S. Algorithm for Computing Approximate Nash Equilibrium in Continuous Games with Application to Continuous Blotto. Games 2021, 12, 47. [Google Scholar] [CrossRef]
Kamalapurkar, R.; Klotz, J.R.; Dixon, W.E. Concurrent learning-based approximate feedback-Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA J. Autom. Sin. 2014, 1, 239–247. [Google Scholar]
Xu, Q.; Su, Z.; Lu, R. Game Theory and Reinforcement Learning Based Secure Edge Caching in Mobile Social Networks. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3415–3429. [Google Scholar] [CrossRef]
Zhao, C.; Guo, D. Particle Swarm Optimization Algorithm With Self-Organizing Mapping for Nash Equilibrium Strategy in Application of Multiobjective Optimization. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 5179–5193. [Google Scholar] [CrossRef] [PubMed]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Wu, S.; Luo, M.; Zhang, J.; Zhang, D.; Zhang, L. Pharmaceutical Supply Chain in China: Pricing and Production Decisions with Price-Sensitive and Uncertain Demand. Sustainability 2022, 14, 7551. [Google Scholar] [CrossRef]
Lazar, D.; Coogan, S.; Pedarsani, R. Routing for traffic networks with mixed autonomy. IEEE Trans. Automat. Control 2020, 66, 2664–2676. [Google Scholar] [CrossRef]
Ullah, I.; Khan, M.A.; Alsharif, M.H.; Nordin, R. An anonymous certificateless signcryption scheme for secure and efficient deployment of Internet of vehicles. Sustainability 2021, 13, 10891. [Google Scholar] [CrossRef]
Zhou, B.; Song, Q.; Zhao, Z.; Liu, T. A reinforcement learning scheme for the equilibrium of the in-vehicle route choice problem based on congestion game. Appl. Math. Comput. 2020, 371, 124895. [Google Scholar] [CrossRef]
Nash, J.F., Jr. Equilibrium points in n-person games. Proc. Natl. Acad. Sci. USA 1950, 36, 48–49. [Google Scholar] [CrossRef]
Rosenthal, R.W. A class of games possessing pure-strategy Nash equilibria. Internat. J. Game Theory 1973, 2, 65–67. [Google Scholar] [CrossRef]
Umair, M.; Cheema, M.A.; Cheema, O.; Li, H.; Lu, H. Impact of COVID-19 on IoT adoption in healthcare, smart homes, smart buildings, smart cities, transportation and industrial IoT. Sensors 2021, 21, 3838. [Google Scholar] [CrossRef]
Tan, T.; Bao, F.; Deng, Y.; Jin, A.; Dai, Q.; Wang, J. Cooperative Deep Reinforcement Learning for Large-Scale Traffic Grid Signal Control. IEEE Trans. Cybern. 2020, 50, 2687–2700. [Google Scholar] [CrossRef]

Figure 1. Basic traffic scenario of the route game.

Figure 2. Process of the matched Q-learning for solving the route game.

Figure 3. The convergence of the Q-learning-based approximate solving algorithm in traffic scenario 1.

Figure 4. The convergence of the Q-learning-based approximate solving algorithm in traffic scenarios 2, 3 and 4.

Table 1. Traffic parameters of numerical experiments.

Traffic Scenario	n	m	$C_{ε = 1}$	$C_{ε = 2}$	$t_{free - flow}^{ε = 2}$	$t_{free - flow}^{ε = 1}$
Scenario 1	10	2	10	10	30	26
Scenario 2	10	2	15	10	25	20
Scenario 3	10	2	10	15	30	24
Scenario 4	10	2	30	30	30	30

Note: All traffic parameters adopt the standardized units.

Table 2. The results of the precise solving algorithm.

	Method 1			Method 2			Method 3
	$η$	$τ$	$t$	$η$	$τ$	$t$	$η$	$τ$	$t$
Scenario 1	–	0.99:1.00	43.47	–	0:1	52.00	1.00	2:3	41.76
Scenario 2	–	0.98:1.00	32.56	–	0:1	40.00	1.00	2:3	31.86
Scenario 3	–	1.01:1.00	36.97	–	0:1	40.00	1.00	3:7	35.44
Scenario 4	–	1.00:1.00	35.00	–	1:0	40.00	1.00	1:1	35.00

Note:

η

denotes the averaged Nash equilibrium coefficient of 20 samples.

τ

denotes the averaged flow ratio of

ε

= 1 to

ε

= 2. t denotes the averaged travel time of 20 samples.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Lyu, L.; Zheng, S.; Ding, L.; Xu, L. A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game. Sustainability 2022, 14, 12033. https://doi.org/10.3390/su141912033

AMA Style

Zhang L, Lyu L, Zheng S, Ding L, Xu L. A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game. Sustainability. 2022; 14(19):12033. https://doi.org/10.3390/su141912033

Chicago/Turabian Style

Zhang, Le, Lijing Lyu, Shanshui Zheng, Li Ding, and Lang Xu. 2022. "A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game" Sustainability 14, no. 19: 12033. https://doi.org/10.3390/su141912033

APA Style

Zhang, L., Lyu, L., Zheng, S., Ding, L., & Xu, L. (2022). A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game. Sustainability, 14(19), 12033. https://doi.org/10.3390/su141912033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Q-Learning-Based Approximate Solving Algorithm for Vehicular Route Game

Abstract

1. Introduction

2. Basic Assumptions and Formalizations

2.1. Concepts of the Art

2.2. Formalization of the Route Game

2.2.1. Description of the Traffic Scenario

2.2.2. Establishment of the Classic Route Game

3. Contributions to Solving Algorithms

3.1. Potential Function-Based Precise Algorithm

3.2. Q-Learning-Based Approximate Solving Algorithm

3.2.1. Definition of Approximate Nash Equilibrium

3.2.2. Q-Learning Matched with the Route Game

3.3. Discussion

4. Numerical Experiments

4.1. Preparation

4.2. Effectiveness of the Precise Algorithm (Control Group)

4.3. Availability of the Q-Learning-Based Approximate Solving Algorithm

4.4. Robustness of the Q-Learning-Based Approximate Solving Algorithm

4.5. Discussion

5. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI