# Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

#### 2.1. Rule Based and Optimization Computation Based for Autonomous Driving Decision Planning

#### 2.2. Machine-Learning-Based Methods for Autonomous Driving Decision Planning

#### 2.3. Research on Safe Reinforcement Learning

## 3. Problem Statement and Method Framework

#### 3.1. System Model and Problem Statement

#### 3.2. Framework for RECPO-Based Decision-Making in Autonomous Highway Driving

## 4. Highway Vehicle Control Transformed into Safe Reinforcement Learning

#### 4.1. Constrained Markov Decision Process

**$\varphi $**denotes the set of all feasible policies.

#### 4.2. Convert Decision-Making Problem for Autonomous Highway Driving into a Constrained Markov Decision Problem

^{2}), a penalty will be given. Otherwise, a reward will be given:

## 5. Autonomous Highway Driving Based on Safe Reinforcement Learning

#### 5.1. Importance Sampling

#### 5.2. Evaluation and Optimization of Policy Networks

#### 5.3. Value Network Parameter Update

#### 5.4. Highway Automatic Driving Algorithm Based on RECPO

Algorithm 1 Replay Buffer Constrained Policy Optimization |

Input: Initialize $\pi ,{V}_{r},{V}_{c},\varphi ,\psi $, set $cos{t}_{d},\gamma ,k{l}_{m}ax,{N}_{e}$for epoch $n=0,1,2,3\dots \to {N}_{e}$ dofor $t=0,1,2\dots \to $ done is True doCollect ${s}_{t}$ when interacting with the environment in IDAPS Collect ${a}_{t}={s}_{t}\to {V}_{r}\left({s}_{t}\right)\phantom{\rule{4pt}{0ex}}\&\phantom{\rule{4pt}{0ex}}{V}_{c}\left({s}_{t}\right)\to \pi \left({s}_{t}\right)$ ${s}_{t+1}\Leftarrow {s}_{t}$, get the feedback reward ${r}_{t}$, cost ${c}_{t}$, done and the probability distribution ${P}_{t}({a}_{t},{s}_{t})$ of the policy network ${\pi}_{t}$ for ${a}_{t}$ Obtain trajectory ${\tau}_{t}=[{s}_{0},{a}_{0},{r}_{0},{c}_{0},{P}_{0}({a}_{0},{s}_{0}),{s}_{1}\dots ]\to $ Replay Buff end forfor $\phantom{\rule{4pt}{0ex}}\mathbf{do}k=0,1,2\dots $Sampling the trajectory D from the Replay Buffer Calculate importance weight w Calculate advantage function of reward and cost function: ${\widehat{A}}^{\pi}(s,a),{\widehat{A}}_{c}^{\pi}(s,a)$ Calculate $\widehat{g},\widehat{b},\widehat{c},\widehat{B}$: $\widehat{g}=\nabla J\left(\pi \right),\widehat{b}=\nabla {J}_{c}\left(\pi \right)$ $\widehat{c}={\sum}_{t=0}^{T}{\gamma}^{t}{C}_{t}+\frac{{d}_{i}}{1-\gamma}$ $B=\delta -\frac{{\widehat{c}}^{2}}{{\widehat{b}}^{T}{H}^{-1}\widehat{b}}$ if $(\u2225\widehat{b}\u2225<1\times {10}^{-8}\widehat{c}0)$ or $(\widehat{c}<0B0)$ thenupdate policy network as: ${\theta}_{k+1}={\theta}_{k}+\sqrt{\frac{{\widehat{g}}^{T}{H}^{-1}\widehat{g}}{\delta}}{H}^{-1}\widehat{g}$ //see the Equation (28) else if $B>0$ thensolve convex dual problem, get ${\nu}^{*},{\lambda}^{*}$ solve $\alpha $ by backtracking line search, update policy network as: ${\theta}_{k+1}={\theta}_{k}+\frac{\alpha}{{\lambda}^{*}}{H}^{-1}(\widehat{g}-\widehat{b}{\nu}^{*})$ // see the Equation (29) else if $\widehat{c}>0B0$ thenupdate policy network as: ${\theta}_{k+1}={\theta}_{k}-\sqrt{\frac{2\delta}{{\widehat{b}}^{T}{H}^{-1}\widehat{b}}}{H}^{-1}\widehat{b}$ // see the Equation (32) end ifUpdate $\varphi ,\psi $ as: $\varphi =argmi{n}_{\varphi}E\left[{({V}_{r}\left({s}_{t}\right)-{\widehat{R}}_{t})}^{2}\right]$ $\psi =argmi{n}_{\psi}E\left[{({V}_{c}\left({s}_{t}\right)-{\widehat{C}}_{t})}^{2}\right]$ end forend for |

## 6. Experiment

#### 6.1. Experimental Setup

#### 6.2. Experimental Training Process

#### 6.3. Performance Comparison of RECPO, CPO, DDPG and IDM + MOBIL after Deployment

^{2}throughout the drive, with a standard deviation of only 0.88 and a jerk mean of −0.098 m/s

^{3}. This indicates a very stable vehicle speed during high-speed travel, where passengers hardly feel any acceleration due to throttle or brake applications. Although the IDM + MOBIL also maintains an average acceleration under $0.1$ m/s

^{2}, its higher standard deviation of 3.25 and a jerk mean of −0.171 m/s

^{3}suggest less stability in acceleration, leading to a more pronounced feeling of bumpiness compared to RECPO. Despite the CPO algorithm exhibiting a higher average acceleration than both RECPO and IDM + MOBIL, its standard deviation is low at 0.85, and its jerk mean is only −0.058 m/s

^{3}, indicating higher stability in acceleration. Even though there is a stronger sensation of thrust and pitching during acceleration and braking phases, the vibration is less than that of IDM + MOBIL. The average acceleration is 1.21 m/s

^{2}, which is acceptable for highway driving. Detailed data can be found in Table 3.

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Cui, G.; Zhang, W.; Xiao, Y.; Yao, L.; Fang, Z. Cooperative Perception Technology of Autonomous Driving in the Internet of Vehicles Environment: A Review. Sensors
**2022**, 22, 5535. [Google Scholar] [CrossRef] - Shan, M.; Narula, K.; Wong, Y.F.; Worrall, S.; Khan, M.; Alexander, P.; Nebot, E. Demonstrations of Cooperative Perception: Safety and Robustness in Connected and Automated Vehicle Operations. Sensors
**2021**, 21, 200. [Google Scholar] [CrossRef] [PubMed] - Schiegg, F.A.; Llatser, I.; Bischoff, D.; Volk, G. Collective Perception: A Safety Perspective. Sensors
**2021**, 21, 159. [Google Scholar] [CrossRef] [PubMed] - Xiao, W.; Mehdipour, N.; Collin, A.; Bin-Nun, A.; Frazzoli, E.; Duintjer Tebbens, R.; Belta, C. Rule-based Optimal Control for Autonomous Driving. arXiv
**2021**, arXiv:2101.05709. [Google Scholar] - Collin, A.; Bilka, A.; Pendleton, S.; Duintjer Tebbens, R. Safety of the Intended Driving Behavior Using Rulebooks. arXiv
**2021**, arXiv:2105.04472. [Google Scholar] - Chen, Y.; Bian, Y. Tube-based Event-triggered Path Tracking for AUV against Disturbances and Parametric Uncertainties. Electronics
**2023**, 12, 4248. [Google Scholar] [CrossRef] - Seccamonte, F.; Kabzan, J.; Frazzoli, E. On Maximizing Lateral Clearance of an Autonomous Vehicle in Urban Environments. arXiv
**2019**, arXiv:1909.00342. [Google Scholar] - Zheng, L.; Yang, R.; Peng, Z.; Liu, H.; Wang, M.Y.; Ma, J. Real-Time Parallel Trajectory Optimization with Spatiotemporal Safety Constraints for Autonomous Driving in Congested Traffic. arXiv
**2023**, arXiv:2309.05298. [Google Scholar] - Yang, K.; Tang, X.; Qiu, S.; Jin, S.; Wei, Z.; Wang, H. Towards Robust Decision-Making for Autonomous Driving on Highway. IEEE Trans. Veh. Technol.
**2023**, 72, 11251–11263. [Google Scholar] [CrossRef] - Sutton, R.S. Learning to Predict by the Methods of Temporal Differences. Mach. Learn.
**1988**, 3, 9–44. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv
**2015**, arXiv:1509.02971. [Google Scholar] - Li, Y.; Li, Y.; Poh, L. Deep Reinforcement Learning for Autonomous Driving. arXiv
**2018**, arXiv:1811.11329. [Google Scholar] - Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. arXiv
**2020**, arXiv:2002.00444. [Google Scholar] [CrossRef] - Maramotti, P.; Capasso, A.P.; Bacchiani, G.; Broggi, A. Tackling real-world autonomous driving using deep reinforcement learning. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022. [Google Scholar]
- Zhu, W.; Miao, J.; Hu, J.; Qing, L. An Empirical Study of DDPG and PPO-Based Reinforcement Learning Algorithms for Autonomous Driving. IEEE Access
**2020**, 11, 125094–125108. [Google Scholar] - Fu, Y.; Li, C.; Yu, F.R.; Luan, T.H.; Zhang, Y. A decision-making strategy for vehicle autonomous braking in emergency via deep reinforcement learning. IEEE Trans. Veh. Technol.
**2020**, 69, 5876–5888. [Google Scholar] [CrossRef] - Hoel, C.-J.; Wolff, K.; Laine, L. Automated speed and lane change decision making using deep reinforcement learning. In Proceedings of the 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2148–2155. [Google Scholar]
- Ho, S.S. Complementary and competitive framing of driverless cars: Framing effects, attitude volatility, or attitude resistance? Int. J. Public Opin. Res.
**2021**, 33, 512–531. [Google Scholar] [CrossRef] - Ju, Z.; Zhang, H.; Li, X.; Chen, X.; Han, J.; Yang, M. A survey on attack detection and resilience for connected and automated vehicles: From vehicle dynamics and control perspective. IEEE Trans. Intell. Veh.
**2022**, 7, 815–837. [Google Scholar] [CrossRef] - Tamar, A.; Xu, H.; Mannor, S. Scaling up robust mdps by reinforcement learning. arXiv
**2013**, arXiv:1306.6189. [Google Scholar] - Geibel, P.; Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res.
**2005**, 24, 81–108. [Google Scholar] [CrossRef] - Moldovan, T.M.; Abbeel, P. Safe exploration in markov decision processes. arXiv
**2012**, arXiv:1205.4810. [Google Scholar] - Zhao, T.; Yurtsever, E.; Paulson, J.A.; Rizzoni, G. Formal certification methods for automated vehicle safety assessment. IEEE Trans. Intell. Veh.
**2023**, 8, 232–249. [Google Scholar] [CrossRef] - Tang, X.; Zhang, Z.; Qin, Y. On-road object detection and tracking based on radar and vision fusion: A review. IEEE Intell. Transp. Syst. Mag.
**2021**, 14, 103–128. [Google Scholar] [CrossRef] - Chen, J.; Shuai, Z.; Zhang, H.; Zhao, W. Path following control of autonomous four-wheel-independent-drive electric vehicles via second-order sliding mode and nonlinear disturbance observer techniques. IEEE Trans. Ind. Electron.
**2021**, 68, 2460–2469. [Google Scholar] [CrossRef] - Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. arXiv
**2017**, arXiv:1705.10528. [Google Scholar] - Altman, E. Constrained Markov Decision Processes: Stochastic Modeling; Routledge: London, UK, 1999. [Google Scholar]
- Hu, X.; Chen, P.; Wen, Y.; Tang, B.; Chen, L. Long and Short-Term Constraints Driven Safe Reinforcement Learning for Autonomous Driving. arXiv
**2024**, arXiv:2403.18209. [Google Scholar] - Dulac-Arnold, G.; Mankowitz, D.J.; Hester, T. Challenges of Real-World Reinforcement Learning. arXiv
**2019**, arXiv:1904.12901. [Google Scholar] - Levine, S.; Kumar, V.; Tucker, G.; Fu, J. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv
**2023**, arXiv:2307.15217. [Google Scholar] - Bae, S.H.; Joo, S.H.; Pyo, J.W.; Yoon, J.S.; Lee, K.; Kuc, T.Y. Finite State Machine based Vehicle System for Autonomous Driving in Urban Environments. In Proceedings of the 2020 20th International Conference on Control, Automation and Systems (ICCAS), Busan, Republic of Korea, 13–16 October 2020; pp. 1181–1186. [Google Scholar] [CrossRef]
- Fan, H.; Zhu, F.; Liu, C.; Shen, L. Baidu Apollo EM Motion Planner for Autonomous Driving: Principles, Algorithms, and Performance. IEEE Intell. Transp. Syst. Mag.
**2020**, 12, 124–138. [Google Scholar] - Urmson, C.; Anhalt, J.; Bagnell, D.; Baker, C.; Bittner, R.; Clark, M.N.; Dolan, J.; Duggins, D.; Galatali, T.; Geyer, C.; et al. Autonomous driving in urban environments: Boss and the Urban Challenge. J. Field Robot.
**2008**, 25, 425–466. [Google Scholar] [CrossRef] - Treiber, M.; Kesting, A. Traffic Flow Dynamics: Data, Models and Simulation; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Vanholme, B.; De Winter, J.; Egardt, B. Integrating autonomous and assisted driving through a flexible haptic interface. IEEE Intell. Transp. Syst. Mag.
**2013**, 5, 42–54. [Google Scholar] - Ferguson, D.; Stentz, A. Using interpolation to improve path planning: The Field D* algorithm. J. Field Robot.
**2008**, 23, 79–101. [Google Scholar] [CrossRef] - Paden, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh.
**2016**, 1, 33–55. [Google Scholar] [CrossRef] - Liu, C.; Lee, S.; Varnhagen, S.; Tseng, H.E. Path planning for autonomous vehicles using model predictive control. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 174–179. [Google Scholar] [CrossRef]
- Thrun, S.; Montemerlo, M.; Dahlkamp, H.; Stavens, D.; Aron, A.; Diebel, J.; Fong, P.; Gale, J.; Halpenny, M.; Hoffmann, G.; et al. Stanley: The robot that won the DARPA Grand Challenge. J. Field Robot.
**2006**, 23, 661–692. [Google Scholar] [CrossRef] - Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway decision-making and motion planning for autonomous driving via soft actor-critic. IEEE Trans. Veh. Technol.
**2022**, 71, 4706–4717. [Google Scholar] [CrossRef] - Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017. [Google Scholar]
- Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving. arXiv
**2016**, arXiv:1610.03295. [Google Scholar] - Pan, Y.; Cheng, C.-A.; Saigol, K.; Lee, K.; Yan, X.; Theodorou, E.; Boots, B. Agile Autonomous Driving using End-to-End Deep Imitation Learning. arXiv
**2017**, arXiv:1709.07174. [Google Scholar] - Fulton, N.; Platzer, A. Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 6485–6492. [Google Scholar] [CrossRef]
- Cao, Z.; Xu, S.; Peng, H.; Yang, D.; Zidek, R. Confidence-Aware Reinforcement Learning for Self-Driving Cars. IEEE Trans. Intell. Transp. Syst.
**2022**, 23, 7419–7430. [Google Scholar] [CrossRef] - Tian, R.; Sun, L.; Bajcsy, A.; Tomizuka, M.; Dragan, A.D. Safety assurances for human–robot interaction via confidence-aware game-theoretic human models. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 11229–11235. [Google Scholar]
- Wen, L.; Duan, J.; Li, S.E.; Xu, S.; Peng, H. Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization. arXiv
**2020**, arXiv:2003.01303. [Google Scholar] - Xu, H.; Zhan, X.; Zhu, X. Constraints Penalized Q-learning for Safe Offline Reinforcement Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8753–8760. [Google Scholar] [CrossRef]
- Zhang, Q.; Zhang, L.; Xu, H.; Shen, L.; Wang, B.; Chang, Y.; Wang, X.; Yuan, B.; Tao, D. SaFormer: A Conditional Sequence Modeling Approach to Offline Safe Reinforcement Learning. arXiv
**2023**, arXiv:2301.12203. [Google Scholar] - Treiber, M.; Hennecke, A.; Helbing, D. Congested Traffic States in Empirical Observations and Microscopic Simulations. Phys. Rev. E
**2000**, 62, 1805–1824. [Google Scholar] [CrossRef] [PubMed] - Treiber, M.; Kesting, A. Modeling lane-changing decisions with MOBIL. In Traffic and Granular Flow’07; Springer: Berlin/Heidelberg, Germany, 2009; pp. 211–221. [Google Scholar]

**Figure 1.**Schematic illustration of the proposed framework for robust decision-making in autonomous highway driving based on safe reinforcement learning.

**Figure 2.**Four basic situations potentially encountered while driving on the highway. (

**a**) Idle highways, changing lanes at will for acceleration and deceleration; (

**b**) Emergency lane change to avoid the preceding vehicle; (

**c**) Overtaking from the only available lane; (

**d**) High speed roads with dense traffic flow can only slow down to follow.

**Figure 5.**Training curve—The shaded area of each color line in all graphs represents its standard deviation. Actor loss is plotted in Reward value loss, critic loss is plotted in Cost value loss. The cost of DDPG is quantified according to the standards of RECPO and CPO.

Parameters | Value |
---|---|

CARLA Simulator | - |

Time step ${t}_{step}$ | 0.05 s |

Maximum number of training epochs and tiem steps ${N}_{e}$ | 5000, 100 |

Total length of road and simulated road length | 10 km, 1 km |

Speed limit | [17 m/s, 30 m/s] |

Acceleration limit | <3 m/s^{2} |

Lane width | 3.5 m |

Number of lanes and vehicles | 3, 24 |

RECPO & CPO | - |

Discount factor $\gamma $ and ${\lambda}_{gae}$ | 0.9, 0.97 |

lr for reward and cost | 1 × ${10}^{-3}$→0 |

Number of hidden layer and hidden layer neuron | 2, 128 |

Distance of safe ${d}_{safe}$, observation ${d}_{o}$ and target waypoint ${d}_{w}$ | 30 m, 50 m, 2 m |

Weight of cost function ${k}_{1},{k}_{2},{k}_{3},{k}_{4}$ | 45, 50, 5, 5 |

Weight of reward function ${k}_{5},{k}_{6},{k}_{7}$ | 2, 1, 50 |

Replay Buffer size | 20,480 |

DDPG | - |

lr for actor and critic | 1 × ${10}^{-4}$→0 |

Optimizer of actor and critic | Adam |

Number of hidden layer | 2 |

Number of hidden layer1 and layer2 neuron | 400, 300 |

Discount factor $\gamma $ | 0.99 |

Replay Buffer size | 1,000,000 |

Methods | Average Speed | Speed Standard Deviation |
---|---|---|

IDM + MOBIL | 24.81 | 5.95 |

RECPO | 27.52 | 0.88 |

CPO | 27.28 | 1.73 |

DDPG | 44.57 | 19.41 |

Methods | Average Acceleration | Acceleration Standard Deviation | Average Jerk |
---|---|---|---|

IDM + MOBIL | −0.0321 | 3.25 | −0.171 |

RECPO | 0.084 | 0.75 | −0.098 |

CPO | 0.21 | 0.85 | −0.058 |

DDPG | 3.14 | 0.98 | −0.7 |

Methods | Success Rate | Safe Distance Trigger | Average Front Vehicle Distance |
---|---|---|---|

IDM + MOBIL | 100% | 48 | 17.73 |

RECPO | 100% | 0 | 58.85 |

CPO | 100% | 0 | 48.53 |

DDPG | 4% | 1 | 32.35 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhao, R.; Chen, Z.; Fan, Y.; Li, Y.; Gao, F.
Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning. *Sensors* **2024**, *24*, 4140.
https://doi.org/10.3390/s24134140

**AMA Style**

Zhao R, Chen Z, Fan Y, Li Y, Gao F.
Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning. *Sensors*. 2024; 24(13):4140.
https://doi.org/10.3390/s24134140

**Chicago/Turabian Style**

Zhao, Rui, Ziguo Chen, Yuze Fan, Yun Li, and Fei Gao.
2024. "Towards Robust Decision-Making for Autonomous Highway Driving Based on Safe Reinforcement Learning" *Sensors* 24, no. 13: 4140.
https://doi.org/10.3390/s24134140