Vehicle-Following Control Based on Deep Reinforcement Learning

Huang, Yong; Xu, Xin; Li, Yong; Zhang, Xinglong; Liu, Yao; Zhang, Xiaochuan

doi:10.3390/app122010648

Open AccessArticle

Vehicle-Following Control Based on Deep Reinforcement Learning

by

Yong Huang

¹

,

Xin Xu

²,

Yong Li

³,

Xinglong Zhang

²

,

Yao Liu

¹ and

Xiaochuan Zhang

^1,*

¹

School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China

²

College of Computer Science, National University of Defense Technology, Changsha 400015, China

³

China Coal Technology Engineering Group Chongqing Research Institute, Chongqing 400039, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10648; https://doi.org/10.3390/app122010648

Submission received: 7 August 2022 / Revised: 16 September 2022 / Accepted: 22 September 2022 / Published: 21 October 2022

(This article belongs to the Special Issue Recent Advances in Machine Learning and Computational Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Intelligent vehicle-following control presents a great challenge in autonomous driving. In vehicle-intensive roads of city environments, frequent starting and stopping of vehicles is one of the important cause of front-end collision accidents. Therefore, this paper proposes a subsection proximal policy optimization method (Subsection-PPO), which divides the vehicle-following process into the start–stop and steady stages and provides control at different stages with two different actor networks. It improves security in the vehicle-following control using the proximal policy optimization algorithm. To improve the training efficiency and reduce the variance of advantage function, the weighted importance sampling method is employed instead of the importance sampling method to estimate the data distribution. Finally, based on the TORCS simulation engine, the advantages and robustness of the method in vehicle-following control is verified. The results show that compared with other deep learning learning, the Subsection-PPO algorithm has better algorithm efficiency and higher safety than PPO and DDPG in vehicle-following control.

Keywords:

subsection proximal policy optimization; weighted importance sampling; TORCS; vehicle-following; autonomous driving

1. Introduction

Nowadays, with the rapid development of autonomous driving technology, an increasing number of enterprises and universities are investing in the research and development of autonomous driving technology, and the future mode of travel will bear great changes. However, the current autonomous driving technology is not mature enough, and there are many areas that need to be developed. Especially in vehicle-intensive roads of city environments, traffic congestion is a frequently encountered situation, and the frequent start and stop of vehicles and instabilities in vehicle speed lead to a large number of front-end collision accidents. Therefore, a safe vehicle-following control method is of great significance for driving safety and alleviating traffic congestion.

Vehicle following is the most basic microscopic driving behavior in vehicle driving. It mainly deals with the interaction between the front and rear vehicles when the vehicles are platooning in a single lane [1]. It includes longitudinal control and lateral control. There are various methods of vehicle-following control, such as Model-Predictive-Control (MPC) [2], Proportional Integral Derivative (PID) control [3], fuzzy control method [4] and methods based on deep neural networks [5,6]. The deep network-based learning control method has been studied in dealing with complex road scenes. The methods based on deep networks can be roughly divided into two categories, supervised learning methods that use expert data to train deep networks [7], and deep reinforcement learning methods that explore and find high-reward strategies continuously during interacting with the environment [8]. The former trains the controller by collecting a large amount of expert driving data, while the latter trains the control policy by continuous exploration and trial-and-error in the environment.

The data for training using expert data need to be prepared with much human efforts. It is difficult to manually assess and screen unsafe driving data, which can lead to extreme security risks. Deep reinforcement learning can obtain optimized driving policies in a self-learning way. It learns through continuous exploration in the environment. Exploratory ability determines the ability to learn from the environment.

Most accidents that occur when vehicle-following on densely packed roads in a city environment are mainly in the stop–start phase, while the accident rate is lower in the slow-moving phase. Therefore, this paper proposes to divide the entire vehicle-following process into two stages: start–stop, steady driving. And we modify the training of the single policy network of the PPO algorithm to train two policy networks corresponding to different vehicle-following stages, respectively. Furthermore, the PPO algorithm uses the importance sampling method to estimate the distribution of the advantage function. We found that using the importance sampling method to estimate the distribution can lead to inconsistency, resulting in a large variance, which is detrimental to the optimization efficiency of the policy. Hence this paper proposes to use weighted importance sampling instead of importance sampling, which can effectively reduce the variance between the resampled data distribution and the real distribution, and improve the training efficiency of the policy network. Our work contributions can be summarized as follows:

According to the characteristics of vehicle-following in dense urban roads, the vehicle-following process is divided into two stages: start–stop, steady driving. In order to improve the PPO algorithm, a subsection policy optimization algorithm (Subsection-PPO) is proposed to use two different policy networks for training in different following stages.
The weighted importance sampling method is used instead of the importance sampling method when estimating the objective function.
In order to evaluate the effectiveness of the Section-PPO method in vehicle-following control, this paper uses the TORCS (The Open Racing Car Simulator) simulation environment to simulate urban traffic flow for verification. The experimental results show that the method has a very good effect in vehicle-following scenarios.

2. Related Work

Vehicle-following research has been an important research direction in traffic flow analysis and autonomous driving research. The related research on vehicle-following model can be traced back to the middle of the last century, and there have been many research advances since then. In terms of vehicle-following model research, the first vehicle-following model was proposed by Pips [9] and was widely used in the description of vehicle flow. Later, different types of vehicle-following models based on different directions and fields were also proposed. Gazis et al. [10] jointly proposed the GM model, which is based on the driver’s stimulus response while taking into account the safe distance. The subsequent stimulus-response vehicle-following models are mostly based on this model. The Gipps [11] model as the earliest safe distance model was proposed by Kometani and Sasaki. Based on the Gipps model, AyresTJ et al. [12] proposed a safety distance model based on the headway. Jamson et al. [13] proposed a driver state vehicle distance model based on kinematics analysis, and a safe distance model based on the difference of driver response delay in emergency state. Treiber, Helbing et al. proposed the classic IDM model [14]. The IDM model can fully describe the change of vehicle-following behavior from free flow to congested state with a unified structure. With the development of neural networks, the use of fuzzy logic to establish a neural network vehicle-following model has been realized. The results of the neural network vehicle-following model proposed by Mathew et al. [15] after comparing with Gipps show that its prediction accuracy is higher than the latter. The vehicle-following model based on sequence-to-sequence proposed by Sharma et al. [16] has memory effect and response delay capabilities, which further expands the spatial expectation and improves the accuracy of platoon simulation and the stability of traffic flow. Li et al. [17] proposed a novel platoon formation and optimization model combining graph theory and safety potential field (G-SPF) theory which can form a collision-free platoon in a short time. Zhu and Zhang [18] proposed an improved forward-considered vehicle-following model that uses an average expected velocity field to describe the flow of autonomous vehicles. The new model has three key parameters: adjustable sensitivity, intensity factor, and average expected velocity field size, which in general bear a large impact on the stability and congestion state of autonomous vehicle flow.

The purpose of vehicle-following decision-making is to form a vehicle-following decision that can ensure the safety and rationality of vehicle-following according to the state information of the vehicle-following vehicle or the vehicle-following queue and certain decision-making methods. In terms of vehicle-following decision research, Li et al. [19] proposed an adaptive hierarchical control structure, in which the upper control layer is used to obtain the sliding mode control law of the required acceleration according to the inter-vehicle state information. In the lower control layer, switching logic with hysteresis boundaries is developed to ensure ride comfort, and the desired torque is calculated in real-time based on an inverse dynamics model to track the desired acceleration planned by the upper control layer. Zhang et al. [20] proposed a behavior estimation method based on contextual traffic information to identify and predict lane change intentions, and optimize the acceleration sequence by combining the lane change intentions of other vehicles. The above methods are based on traditional control methods and do not have the robustness to adapt to most scenarios, so many teams have turned their attention to deep reinforcement learning. The intelligent vehicle-following process can be abstracted into a state transition process that conforms to the Markovian property [21], so it is possible to use the deep reinforcement learning realize the vehicle-following. Guerrieri et al. [22] proposed a new automatic traffic data acquisition method mom-dl based on the deep learning method and yolov3 algorithm. This method can automatically detect vehicles in the traffic flow and estimate the traffic variable flow, spatial average speed and vehicle density of the expressway under static and uniform traffic conditions. Masmoudi et al. [23] used the vision algorithm YOLO to identify the current state, the reinforcement learning algorithm Q-learning and the DQN algorithm to control the following vehicles. After conducting a simulation experiment, they concluded that the following vehicle can make reasonable decisions. However, the Q-learning algorithm has the problem of dimension explosion in continuous problems such as vehicle-following, so Zhu et al. [24] and others chose the DDPG [25] algorithm that can output continuous actions to improve and verify it in the vehicle-following scene, while showing good generalization ability. Reinforcement learning shows great potential in sequential decision optimization problems, but there are certain difficulties in the design of reward functions. Gao et al. [26] used an inverse reinforcement learning algorithm to establish a reward function for each driver’s data, and analyzed the driving characteristics and following policy, and subsequent simulations in a highway environment demonstrated the effectiveness of the method.

3. Problem Formulation

Vehicle-following on city roads is different from cruise control since the vehicle will start, stop, and shift frequently. Therefore, it is necessary to adjust the accelerator or brake according to the state of the vehicle in front. The ability to maintain a safe driving distance is the most important indicator of vehicle-following control. Following is a random and interactive process, so vehicle-following control can be modeled as a Markov decision process (MDP). During the MDP process, the following vehicle needs to continuously observe the current state and make decisions. MDP can be represented by a tuple

{S, A, P, R}

, where S is a set of states. We divide state S into two states as the input of the policy network, namely the start–stop state

S_{s t a r t - s t o p}

and steady state

S_{steady}

. A is the action set and we divide A into start–stop phase action

A_{s t a r t - s t o p}

and stable phase action

A_{steady}

. P is the state transition probability, and R is the immediate reward obtained after performing action A. Figure 1 shows a schematic diagram of car following control. Below, we will introduce the state space and action space in the process of vehicle-following.

3.1. State Space

Vehicle-following constantly explores the environment for learning, so it needs to continuously obtain the current state as input. The simulation environment in this paper is TORCS, and Table 1 shows the state space.

Because the state space data obtained by the sensor bear different dimensions, we adopt normalization to [0,1] in the experimental stage to eliminate the adverse influence caused by the singular sample data.

3.2. Action Space

Intelligent vehicle-following as longitudinal control requires reasonable control of the accelerator opening and braking force to maintain a safe and stable vehicle spacing. The accelerator opening and braking force constitute the action space vector, as shown in Table 2.

3.3. Reward Function

The reward function

R : S \times A \times S^{'} \to R

in reinforcement learning is an incentive mechanism that enables the agent to learn a behavioral strategy to meet the ultimate goal. Two policy networks are used in this paper but need to maintain a consistent estimate of the advantage function, so a reward function is used uniformly:

R = γ_{1} \times α - γ_{2} \times β - η \times

a c c_r

where $α = \{\begin{matrix} d i s t, if s p e e d X_{-} r \times T_{m t h} \leq d i s t \leq s p e e d X_{-} r \times T_{m t h} + 2 \\ 0, others \end{matrix}$ , $γ_{1} = 1$ . A positive reward is given when the vehicle maintains a safe spacing $[s p e e d X_{-} r \times T_{m t h}$ , $s p e e d X_{-} r \times T_{m t h} + 2]$ , and $T_{m t h} = 2 s$ is the Minimum Time Headway.
$β = ∣$ $s p e e d X_{-} r -$ $s p e e d X_{-} p ∣, γ_{2} = 0.2$ . On vehicle-intensive roads, the vehicle speed is generally less than 30 km/h. In this paper, the maximum speed of the preceding vehicle is set as $s p e e d X_{-} p = 25 km / h \approx 7 m / s$ . Since the vehicle uses the sensor to obtain the current state information, the accuracy and speed are much higher than that of a human driver. Maintaining a similar speed can stably follow the distance to ensure safety. Therefore, the speed of the vehicle is kept as consistent as possible in the experiment.
$a c c_{-} r$ is the vehicle acceleration, $η = 0.05$ .

4. Methodology

Intelligent vehicle-following control can be described as a Markov decision process. This paper proposes the Subsection-PPO algorithm, which divides a set of trajectories into a start–stop part and a steady part and uses a weighted importance sampling method to calculate the objective function. In order to provide the training vehicle with initial power and exploration capability during the training phase, this paper uses Ornstein–Uhlenbeck (OU) [27] noise with random process.

4.1. Noise

Since the use of time-series-related noise can increase the exploration efficiency, this paper adopts the time-series-related Ornstein–Uhlenbeck (OU) noise. OU noise is a stochastic process and its differential equation is as follows:

N_{t} = - θ (x_{t} - μ) d t + σ d W_{t}

(1)

where

x_{t}

is usually one dimension of agent action,

μ \in R

represents the mean value of action,

θ > 0

and

σ > 0

,

d W_{t} = W_{t} - W_{s} \sim (0, (t - s))

is Wiener process. The magnitude of

θ

is directly proportional to the degree that

a_{t}

tends to

μ

, and

σ

is the magnification of the disturbance in the Wiener process.

4.2. Proximal Policy Optimization Algorithm

The proximal policy optimization algorithm is a reinforcement learning algorithm based on policy gradient, which is evolved from the trust region policy optimization (TRPO) [28]. If the agent’s reward in the environment is higher, it means that they have a stronger ability to complete tasks, and the ultimate goal of all policy gradient methods is to maximize the cumulative reward, that is, to maximize

η (π) = E_{s_{0}, a_{0}, s_{1}, a_{1} \dots} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t})]

, among them,

γ

is the discount factor, indicating that the farther away from the current state, the smaller the impact on the current state,

η (π)

refers to the cumulative reward obtained when performing actions according to the policy

π

.

S_{0}, S_{1}, S_{2} \dots

represents the state transition of the agent in the environment, treat state transitions as a given distribution

s \sim ρ (s)

. The TRPO algorithm proposes to use the advantage function

A_{π} (s, a) = Q_{π} (s, a) - V_{π} (s)

to evaluate the quality of executing an action, where

Q_{π} (s, a)

is the value-action pair, and

V_{π} (s)

is the state value. It has been proved that the cumulative reward of a new policy

π_{new}

can be expressed as:

η (π_{new}) = η (π) + E_{s_{0}, a_{0}, \dots \sim π_{nev}} [\sum_{t = 0}^{\infty} γ^{t} A_{π_{old}} (s_{t}, a_{t})]

(2)

That is, the cumulative reward of the old policy plus the cumulative advantage function of the new policy.

Therefore, if

E_{s_{0}, a_{0}, \dots \sim π_{nev}} [\sum_{t = 0}^{\infty} γ^{t} A_{π_{old}} (s_{t}, a_{t})]

can be guaranteed to be greater than or equal to 0, the monotonic increase of the cumulative reward can be guaranteed, that is, the optimization of the strategy. Since the cumulative advantage function cannot be calculated directly, TRPO uses the importance sampling method to estimate the advantage function and uses the KL divergence to limit the update range of the policy to ensure the monotony of the cumulative reward, that is, to ensure the continuous optimization of policy. The core problem of TRPO algorithm is defined as:

\begin{matrix} {maximize}_{θ} E_{s \sim π_{θ_{o l d}}, a \sim π_{θ_{o l d}}} [\frac{π_{θ} (a ∣ s)}{π_{θ_{o l d}} (a ∣ s)} A_{θ_{old}} (s, a)] \\ subject to E_{s \sim π_{θ_{o l d}}} [D_{K L} (π_{θ_{o l d}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))] \leq δ \end{matrix}

(3)

The TRPO algorithm uses the KL divergence method to calculate the confidence region, and the update range of the control policy network is within a certain range to ensure that the value of the cumulative advantage function is greater than or equal to 0. However, the method of updating the policy in the trust region is complex and inefficient, so the PPO algorithm proposes to use the method of clipping the importance weight to limit the updating range of the policy. The maximizing objective "replacement" function of the PPO algorithm is:

L^{C P I} (θ) = E_{t} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} A_{t}] = E_{t} [r_{t} (θ) A_{t}]

(4)

where, the superscript

C P I

indicates conservative strategy iteration [29],

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})}

is the importance weight obtained by importance sampling. The objective function of the PPO algorithm is finally:

L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

(5)

The value of

r_{t} (θ)

is limited to

[1 - ϵ, 1 + ϵ]

, where

ϵ

is a hyperparameter.

4.3. Subsection-PPO

In urban congested roads, due to the high density of vehicles and the slow speed, frequent starts and stops are often caused, and accidents also occur. Therefore, this paper proposes to divide vehicle-following into two stages: start–stop and steady driving. Therefore, the collected trajectories

τ = (s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \dots, s_{n}, a_{n}, r_{n})

data need to be divided into two categories and processed separately. The original data are divided into two categories, the algorithm also needs to make corresponding changes. Hence this paper proposes the Subsection-PPO algorithm on the basis of the proximal policy optimization algorithm (PPO), using two Actor networks to learn different stages policy, as shown in Figure 2. It is worth noting that in order to ensure consistent state value estimation, we only use one Critic network to estimate the state value of the two stages. Afterwards, simulation experiments show that this method is effective. The following will introduce the division method of different stages:

start-stop stage $\{\begin{matrix} start, when a c c_{-} p > 0 & & \\ d i s t \notin [{s p e e d X}_{-} r * T_{m t h}, {s p e e d X}_{-} r * T_{m t h} + 2] \\ stop, when a c c_{-} p \leq 0 & & \\ d i s t \notin [s p e e d X_{-} r * T_{m t h}, s p e e d X_{-} r * T_{m t h} + 2] \end{matrix}$
The trajectories generated in this stage are start–stop data.
steady stage: when $d i s t$ $\in$ $s p e e d X_{-} r * T_{m t h}, s p e e d X_{-} r * T_{m t h} + 2$
The trajectories generated in this stage are steady driving data.

In the PPO algorithm, the importance sampling method changes the algorithm from on-policy to off-policy, which improves the utilization of data, and controls the update region of the policy network by clipping the importance weight to ensure that the update process is monotonous unabated. Although the importance sampling method is unbiased in estimating the distribution of the data, using a new distribution to estimate the old distribution will lead to large variance, so the model training efficiency is not as good as the sampling method that is both unbiased and consistent. To solve this problem, this paper proposes to use the weighted importance sampling method instead of the importance sampling method to estimate the actual objective function. Then, the final objective function is transformed from Equation (5) to:

L^{C P I} (θ) = E_{t} [\sum_{n = 1}^{N} \frac{ω^{n}}{\sum_{m = 1}^{N} ω^{n}} A_{t}] = E_{t} [r_{t} (θ) A_{t}]

(6)

where

ω^{n} = \frac{π_{θ_{o l d}} (a_{t} ∣ s_{t})}{π^{n} θ (a_{t} ∣ s_{t})}

. The weighted importance sampling is both unbiased and consistent. Hence as the sampling volume increases, the method can render the estimated value increasingly close to the true distribution of the objective function. Figure 3 shows the overall framework of the subsection-PPO algorithm.

5. Experimental Simulation

This experiment is based on the TORCS (The Open Racing Car Simulator) simulation platform, which provides rich road data and comprehensive vehicle radars. We use the python language to implement the reinforcement learning code, use UDP protocol to achieve data interaction with the simulation platform and control the simulated vehicle. To verify the effectiveness of the algorithm proposed in this paper, we simulate the actual urban traffic flow in the TORCS simulation platform, and only consider the longitudinal control of the vehicle in the experiment. Our experiments include:

The experiment compares the cumulative reward of the Subsection-PPO algorithm proposed in this paper with PPO and DDPG.
The total distance of vehicle-following by different algorithms while maintaining a safe spacing is compared.
The effect of the proposed algorithm in vehicle-following control is described using the relationship between speed and distance.

5.1. Hardware Configuration

The experimental platform employs an Ubuntu20.04 operating system with 16GB DDR4, the processor is Intel Core i5-10200h CPU @ 2.40 GHz sixteen core, and the graphics card is NVIDIA Quadro RTX 5000. The learning rates of actor_network (start–stop) and actor_network (steady) are both

l r_{a} = 3 \times 10^{- 4}

. The learning rate of critic_network

l r_{-} c = 1 \times 10^{- 3}

. Training timesteps are 10,000.

5.2. Experiment and Comparison

We choose to compare with two reinforcement learning algorithms, namely: PPO and DDPG. Both of their algorithms are based on the basic Actor–Critic architecture. In simple terms, the Actor network is used for action output, and the Critic network is used for state or action value evaluation.

Firstly, the comparison of cumulative rewards is essential. The change of cumulative rewards shows the exploration ability and learning ability of the reinforcement learning algorithm. The cumulative reward is opposite to the loss value. The higher the cumulative reward obtained after convergence, the better the learning of the agent. The perception of the environment is also richer.

In this paper, the method of random acceleration and braking is used to longitudinally control the preceding vehicle to simulate the state of the preceding vehicle, and the reinforcement learning method is used to control the following vehicle. The reward function is introduced in Section 3.3. The variation of cumulative reward during training phase is shown in Figure 4.

The experiment simulates vehicle-following in a crowded road, so the Minimum Time Headway (MTH) in the strong vehicle-following state is selected to calculate the safe vehicle-following spacing. It is considered safe when the vehicle-following state spacing is within the range

[s p e e d X_{-} r \times T_{m t h}, s p e e d X_{-} r \times T_{m t h} + 2]

. Figure 5 shows the driving distance of different control methods under the condition of maintaining safe spacing. The whole journey is 1600 m. Figure 6 shows the relationship between vehicle velocity and spacing.

6. Conclusions

Based on the PPO algorithm, according to the characteristics of different stages of vehicle-following control we divide the trajectories into two parts: stop–start, steady driving. And we use the weighted importance sampling method instead of the importance sampling method. To sum up, we propose the Subsection-PPO algorithm for vehicle-following control. Subsection-PPO algorithm uses a dual actor network, but in order to avoid the training non convergence caused by inconsistent value estimates, we choose to employ a critic network for value estimation. The action vectors of different vehicle-following stages are calculated by the corresponding actor network, which makes our method well applicable to vehicle-following problems. Furthermore, the weighted importance sampling method improves the training efficiency. We simulate the vehicle-following situation of urban roads in the TORCS simulation environment and subsequently compare and verify the methods we propose. These results prove the feasibility and advantages of our proposed vehicle-following safety of the method. However, there are still shortcomings in our work. At this stage, the technology of autonomous driving is constantly developing. In the case of ensuring safety, it is necessary to consider the acceleration changes of the vehicle, which affects the energy consumption and ride comfort of the vehicle. This will be the direction of our future work.

Author Contributions

Y.H. wrote the manuscript and designed research methods; X.Z. (Xinglong Zhang), X.X. and X.Z. (Xiaochuan Zhang) edited and revised the manuscript; Y.L. (Yao Liu) and Y.L. (Yong Li) analyzed the data. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by research on key technologies of internet of things platform for smart city (Grant No. 2020ZDXM12) of the key program and research on the basic support system of urban management comprehensive law enforcement (Grant No. 2021ZDXM17) of China Coal Technology Engineering Group Chongqing Research Institute.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Paschalidis, E.; Choudhury, C.F.; Hess, S. Combining driving simulator and physiological sensor data in a latent variable model to incorporate the effect of stress in car-following behaviour. Anal. Methods Accid. Res. 2019, 22, 100089. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, Q.; Nie, G.; Tian, Y. A multi-objective model predictive control for vehicle adaptive cruise control system based on a new safe distance model. Int. J. Automot. Technol. 2021, 22, 475–487. [Google Scholar] [CrossRef]
Farag, W. Complex Trajectory Tracking Using PID Control for Autonomous Driving. Int. J. Intell. Transp. Syst. Res. 2019, 18, 356–366. [Google Scholar] [CrossRef]
Choomuang, R.; Afzulpurkar, N. Hybrid Kalman filter/fuzzy logic based position control of autonomous mobile robot. Int. J. Adv. Robot. Syst. 2005, 2, 20. [Google Scholar] [CrossRef] [Green Version]
Fayjie, A.R.; Hossain, S.; Oualid, D.; Lee, D.J. Driverless car: Autonomous driving using deep reinforcement learning in urban environment. In Proceedings of the IEEE 2018 15th International Conference on Ubiquitous Robots (UR), Honolulu, HI, USA, 26–30 June 2018; pp. 896–901. [Google Scholar] [CrossRef]
Colombaroni, C.; Fusco, G.; Isaenko, N. Modeling car following with feed-forward and long-short term memory neural networks. Transp. Res. Procedia 2021, 52, 195–202. [Google Scholar] [CrossRef]
Bhattacharyya, R.; Wulfe, B.; Phillips, D.; Kuefler, A.; Morton, J.; Senanayake, R.; Kochenderfer, M. Modeling human driving behavior through generative adversarial imitation learning. arXiv 2020, arXiv:2006.06412. [Google Scholar] [CrossRef]
Lin, Y.; McPhee, J.; Azad, N.L. Longitudinal dynamic versus kinematic models for car-following control using deep reinforcement learning. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 1504–1510. [Google Scholar] [CrossRef] [Green Version]
Pipes, L.A. An operational analysis of traffic dynamics. J. Appl. Phys. 1953, 24, 274–281. [Google Scholar] [CrossRef]
Gazis, D.C.; Herman, R.; Potts, R.B. Car-following theory of steady-state traffic flow. Oper. Res. 1959, 7, 499–505. [Google Scholar] [CrossRef]
Cattin, J.; Leclercq, L.; Pereyron, F.; El Faouzi, N.E. Calibration of Gipps’ car-following model for trucks and the impacts on fuel consumption estimation. IET Intell. Transp. Syst. 2019, 13, 367–375. [Google Scholar] [CrossRef]
Ayres, T.; Li, L.; Schleuning, D.; Young, D. Preferred time-headway of highway drivers. In Proceedings of the ITSC 2001, Oakland, CA, USA, 25–29 August 2001; 2001 IEEE Intelligent Transportation Systems. Proceedings (Cat. No. 01TH8585). pp. 826–829. [Google Scholar] [CrossRef]
Jamson, A.H.; Merat, N. Surrogate in-vehicle information systems and driver behaviour: Effects of visual and cognitive load in simulated rural driving. Transp. Res. Part F Traffic Psychol. Behav. 2005, 8, 79–96. [Google Scholar] [CrossRef]
Treiber, M.; Kesting, A. Traffic flow dynamics: data, models and simulation. Phys. Today 2014, 67, 54. [Google Scholar]
Mathew, T.V.; Ravishankar, K. Neural Network Based Vehicle-Following Model for Mixed Traffic Conditions. Eur. Transp.-Trasp. Eur. 2012, 52, 1–15. [Google Scholar]
Sharma, O.; Sahoo, N.; Puhan, N. Highway Discretionary Lane Changing Behavior Recognition Using Continuous and Discrete Hidden Markov Model. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 1476–1481. [Google Scholar] [CrossRef]
Li, L.; Gan, J.; Qu, X.; Mao, P.; Yi, Z.; Ran, B. A novel graph and safety potential field theory-based vehicle platoon formation and optimization method. Appl. Sci. 2021, 11, 958. [Google Scholar] [CrossRef]
Zhu, W.X.; Zhang, L.D. A new car-following model for autonomous vehicles flow with mean expected velocity field. Phys. A: Stat. Mech. Its Appl. 2018, 492, 2154–2165. [Google Scholar] [CrossRef]
Li, W.; Chen, T.; Guo, J.; Wang, J. Adaptive car-following control of intelligent electric vehicles. In Proceedings of the 2018 IEEE 4th International Conference on Control Science and Systems Engineering (ICCSSE), Wuhan, China, 21–23 August 2018; pp. 86–89. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, Q.; Wang, J.; Verwer, S.; Dolan, J.M. Lane-change intention estimation for car-following control in autonomous driving. IEEE Trans. Intell. Veh. 2018, 3, 276–286. [Google Scholar] [CrossRef]
Kamrani, M.; Srinivasan, A.R.; Chakraborty, S.; Khattak, A.J. Applying Markov decision process to understand driving decisions using basic safety messages data. Transp. Res. Part Emerg. Technol. 2020, 115, 102642. [Google Scholar] [CrossRef]
Guerrieri, M.; Parla, G. Deep learning and yolov3 systems for automatic traffic data measurement by moving car observer technique. Infrastructures 2021, 6, 134. [Google Scholar] [CrossRef]
Masmoudi, M.; Friji, H.; Ghazzai, H.; Massoud, Y. A Reinforcement Learning Framework for Video Frame-based Autonomous Car-following. IEEE Open J. Intell. Transp. Syst. 2021, 2, 111–127. [Google Scholar] [CrossRef]
Zhu, M.; Wang, X.; Wang, Y. Human-like autonomous car-following model with deep reinforcement learning. Transp. Res. Part C Emerg. Technol. 2018, 97, 348–368. [Google Scholar] [CrossRef] [Green Version]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
Gao, H.; Shi, G.; Xie, G.; Cheng, B. Car-following method based on inverse reinforcement learning for autonomous vehicle decision-making. Int. J. Adv. Robot. Syst. 2018, 15, 1729881418817162. [Google Scholar] [CrossRef]
Ngoduy, D.; Lee, S.; Treiber, M.; Keyvan-Ekbatani, M.; Vu, H. Langevin method for a continuous stochastic car-following model and its stability conditions. Transp. Res. Part C Emerg. Technol. 2019, 105, 599–610. [Google Scholar] [CrossRef] [Green Version]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar] [CrossRef]
Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning, Sydney, Australia, 8–12 July 2002. [Google Scholar]

Figure 1. Vehicle-following control.

Figure 2. Actor-network at different stages.

Figure 3. Subsection-PPO.

Figure 4. Cumulative rewards.

Figure 5. Driving distance with safe spacing.

Figure 6. The dotted line indicates the spacing range in the ideal state. Initial vehicle spacing is 10 m. The distance traveled using the Subsection-PPO algorithm while maintaining a safe spacing accounted for 93.8% of the total mileage.

Table 1. State space.

Sensor	Range	Description
$s p e e d X_{-} r$	$(- \infty, + \infty) (km / h)$	Speed along the longitudinal axis of the vehicle (driving direction of the rear vehicle)
$s p e e d X_{-} p$	$(- \infty, + \infty) (km / h)$	Speed along the longitudinal axis of the vehicle (direction of travel of the vehicle in front)
$a c c_{-} r$	$(- \infty, + \infty) (km / h)$	Acceleration of rear vehicle
$d i s t$	$(- \infty, + \infty) (m / s^{2})$	Spacing between vehicles
$r p m_{-} r$	$(0, + \infty) (rpm)$	Speed along the Z-axis of the vehicle

Table 2. Action Space.

Action	Range	Description
$a c c e l e r a t i o n$	$[0, 1]$	Throttle force 0 means not to step on the throttle, 1 means fully depressed.
$b r a k e$	$[0, 1]$	Braking force, 0 means no braking, 1 means fully depressed.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Y.; Xu, X.; Li, Y.; Zhang, X.; Liu, Y.; Zhang, X. Vehicle-Following Control Based on Deep Reinforcement Learning. Appl. Sci. 2022, 12, 10648. https://doi.org/10.3390/app122010648

AMA Style

Huang Y, Xu X, Li Y, Zhang X, Liu Y, Zhang X. Vehicle-Following Control Based on Deep Reinforcement Learning. Applied Sciences. 2022; 12(20):10648. https://doi.org/10.3390/app122010648

Chicago/Turabian Style

Huang, Yong, Xin Xu, Yong Li, Xinglong Zhang, Yao Liu, and Xiaochuan Zhang. 2022. "Vehicle-Following Control Based on Deep Reinforcement Learning" Applied Sciences 12, no. 20: 10648. https://doi.org/10.3390/app122010648

APA Style

Huang, Y., Xu, X., Li, Y., Zhang, X., Liu, Y., & Zhang, X. (2022). Vehicle-Following Control Based on Deep Reinforcement Learning. Applied Sciences, 12(20), 10648. https://doi.org/10.3390/app122010648

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle-Following Control Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. State Space

3.2. Action Space

3.3. Reward Function

4. Methodology

4.1. Noise

4.2. Proximal Policy Optimization Algorithm

4.3. Subsection-PPO

5. Experimental Simulation

5.1. Hardware Configuration

5.2. Experiment and Comparison

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI