Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging
Abstract
1. Introduction
- A risk-aware safe RL method based on CMDP is proposed for the highway on-ramp merging task, in which the individual’s risk preference is incorporated into the safety constraints to accommodate the safety expectations of users. The safety level of the RL policy can be adjusted by computing the cost limits of CMDP constraints using fuzzy logic based on user preferences and traffic density.
- An Action Shielding Mechanism is built to mask out unsafe RL actions. We pre-execute the RL action with MPC and conduct collision checks with surrounding agents to determine whether the action is safe. Theoretical proof has shown the effectiveness of the shielding mechanism in terms of safety and sampling efficiency.
- Numerical simulations in different levels of traffic densities have shown that our method outperforms the baselines, which can improve safety without sacrificing traffic efficiency. Due to the use of user preference-aware safety constraints and action shielding, risk behaviors can be significantly reduced during the exploration stage of RL, enabling safer policy learning in interactive ramp-merging scenarios.
2. Related Works
2.1. RL-Based Approach
2.2. Human Risk Perception in Decision-Making
2.3. Combination of RL and MPC
3. Problem Statement
3.1. Constrained Markov Decision Process
3.1.1. State Space
3.1.2. Action Space
3.1.3. Reward
3.1.4. Cost
3.1.5. Problem Formulation
3.2. Human-Aligned Safety Cost Limits
4. Model Predictive Control
4.1. Discrete Linear Model
| Algorithm 1 MPC and States Prediction |
|
4.2. States Computation
5. Safe Reinforcement Learning
5.1. Lagrangian-Based Discrete SAC
5.1.1. Critic Network and Policy Network
| Algorithm 2 Human-aligned safe RL |
|
| Algorithm 3 Action Shielding Mechanism |
|
5.1.2. Cost Network
5.1.3. n-Step TD Learning
5.1.4. Lagrange Multiplier
5.2. Action Shielding Mechanism
5.2.1. Situation 1
5.2.2. Situation 2
5.2.3. Situation 3
6. Theoretical Analysis
6.1. Safety Performance
6.2. Convergence Analysis
7. Experimental Setup
7.1. Scenario Settings
7.2. Implementation Details
8. Results and Discussion
8.1. Convergence Performance
8.2. Performance Evaluation
8.2.1. Traffic Success
8.2.2. Traffic Safety
8.2.3. Traffic Efficiency
8.3. Ablation Study
8.3.1. Safety Constraints
8.3.2. ASM
8.3.3. Personal Preference
8.3.4. Visual Result
8.4. Sensitivity Analysis
8.4.1. Prediction Coefficient
8.4.2. Risk Tolerance
9. Discussion
10. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wang, H.; Wang, W.; Yuan, S.; Li, X. Uncovering Interpretable Internal States of Merging Tasks at Highway on-Ramps for Autonomous Driving Decision-Making. IEEE Trans. Autom. Sci. Eng. 2022, 19, 2825–2836. [Google Scholar] [CrossRef]
- Liang, J.; Tan, C.; Yan, L.; Zhou, J.; Yin, G.; Yang, K. Interaction-Aware Trajectory Prediction for Safe Motion Planning in Autonomous Driving: A Transformer-Transfer Learning Approach. IEEE Trans. Intell. Transp. Syst. 2025, 26, 17080–17095. [Google Scholar] [CrossRef]
- Wang, H.; Gao, H.; Yuan, S.; Zhao, H.; Wang, K.; Wang, X.; Li, K.; Li, D. Interpretable Decision-Making for Autonomous Vehicles at Highway On-Ramps With Latent Space Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 70, 8707–8719. [Google Scholar] [CrossRef]
- Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
- Lyu, Y.; Luo, W.; Dolan, J.M. Probabilistic safety-assured adaptive merging control for autonomous vehicles. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 10764–10770. [Google Scholar]
- Lubars, J.; Gupta, H.; Chinchali, S.; Li, L.; Raja, A.; Srikant, R.; Wu, X. Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 942–947. [Google Scholar]
- Wang, K.; Mu, C.; Ni, Z.; Liu, D. Safe Reinforcement Learning and Adaptive Optimal Control With Applications to Obstacle Avoidance Problem. IEEE Trans. Autom. Sci. Eng. 2024, 21, 4599–4612. [Google Scholar] [CrossRef]
- Yan, Z.; Kreidieh, A.R.; Vinitsky, E.; Bayen, A.M.; Wu, C. Unified automatic control of vehicular systems with reinforcement learning. IEEE Trans. Autom. Sci. Eng. 2022, 20, 789–804. [Google Scholar] [CrossRef]
- Chen, X.; Xu, B.; Hu, M.; Bian, Y.; Li, Y.; Xu, X. Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation. IEEE CAA J. Autom. Sin. 2024, 11, 2011–2026. [Google Scholar] [CrossRef]
- Gao, Z.; Hao, H.; Gao, F.; Zhao, R. Constrained Reinforcement Learning-Enabled Policies With Augmented Lagrangian for Cooperative Intersection Management. IEEE Internet Things J. 2024, 12, 5396–5411. [Google Scholar] [CrossRef]
- Wang, Y.; Zhan, S.S.; Jiao, R.; Wang, Z.; Jin, W.; Yang, Z.; Wang, Z.; Huang, C.; Zhu, Q. Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments. In Proceedings of the 40th International Conference on Machine Learning; Journal of Machine Learning Research Inc.: New York, NY, USA, 2023; Volume 202, pp. 36593–36604. [Google Scholar]
- Carr, S.; Jansen, N.; Junges, S.; Topcu, U. Safe reinforcement learning via shielding under partial observability. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2023; Volume 37, pp. 14748–14756. [Google Scholar]
- Chen, D.; Hajidavalloo, M.R.; Li, Z.; Chen, K.; Wang, Y.; Jiang, L.; Wang, Y. Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11623–11638. [Google Scholar] [CrossRef]
- Isele, D.; Nakhaei, A.; Fujimura, K. Safe Reinforcement Learning on Autonomous Vehicles. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–6. [Google Scholar]
- Peng, J.; Yu, S.; Ge, Y.; Li, S.; Fan, Y.; Zhou, J.; He, H. Personalized Decision-Making Framework for Collaborative Lane Change and Speed Control Based on Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2025, 26, 13629–13644. [Google Scholar] [CrossRef]
- Teng, J.; Li, Y.; Yang, Z.; Yang, Z.; Shao, X.; Qin, H. User Preference-Aware and Efficient Trajectory Planning for Autonomous Parking with Hybrid A* and Nonlinear Optimization. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Edmonton, AB, Canada, 24–27 September 2024; pp. 1090–1097. [Google Scholar]
- Chen, C.; Lan, Z.; Zhan, G.; Lyu, Y.; Nie, B.; Li, S.E. Quantifying the Individual Differences of Drivers’ Risk Perception via Potential Damage Risk Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8093–8104. [Google Scholar] [CrossRef]
- Nyberg, T.; Pek, C.; Dal Col, L.; Norén, C.; Tumova, J. Risk-aware Motion Planning for Autonomous Vehicles with Safety Specifications. In Proceedings of the 32nd IEEE Intelligent Vehicles Symposium, Nagoya, Japan, 11–17 July 2021; pp. 1016–1023. [Google Scholar]
- Geisslinger, M.; Trauth, R.; Kaljavesi, G.; Lienkamp, M. Maximum Acceptable Risk as Criterion for Decision-Making in Autonomous Vehicle Trajectory Planning. IEEE Open J. Intell. Transp. Syst. 2023, 4, 570–579. [Google Scholar] [CrossRef]
- Yang, K.; Li, B.; Shao, W.; Tang, X.; Liu, X.; Wang, H. Prediction Failure Risk-Aware Decision-Making for Autonomous Vehicles on Signalized Intersections. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12806–12820. [Google Scholar] [CrossRef]
- Morinelly, J.E.; Ydstie, B.E. Dual mpc with reinforcement learning. IFAC-PapersOnLine 2016, 49, 266–271. [Google Scholar] [CrossRef]
- Zanon, M.; Gros, S.; Bemporad, A. Practical reinforcement learning of stabilizing economic MPC. In Proceedings of the 18th European Control Conference (ECC), Naples, Italy, 25–28 June 2019; pp. 2258–2263. [Google Scholar]
- Bellegarda, G.; Byl, K. An Online Training Method for Augmenting MPC with Deep Reinforcement Learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5453–5459. [Google Scholar]
- Karnchanachari, N.; Valls, M.I.; Hoeller, D.; Hutter, M. Practical reinforcement learning for mpc: Learning from sparse objectives in under an hour on a real robot. In Proceedings of the Learning for Dynamics and Control, UC Berkeley, CA, USA, 10–11 June 2020; pp. 211–224. [Google Scholar]
- Williams, G.; Wagener, N.; Goldfain, B.; Drews, P.; Rehg, J.M.; Boots, B.; Theodorou, E.A. Information theoretic mpc for model-based reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1714–1721. [Google Scholar]
- Gros, S.; Zanon, M. Reinforcement learning based on mpc and the stochastic policy gradient method. In Proceedings of the American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 1947–1952. [Google Scholar]
- Li, Y.; Li, J.; Huang, W.; Yang, Q.; Qin, H.; Jiang, X.; Bian, Y.; Hu, M.; Hu, Y. Risk-Constrained On-Ramp Merging via Safety-Augmented Reinforcement Learning and Model Predictive Control. IEEE Internet Things J. 2026; early access. [CrossRef]
- Chen, J.; Shen, J.; Chen, W.; Li, J.; Zhang, S. Application of Robust Fuzzy Cooperative Strategy in Global Consensus of Stochastic Multi-Agent Systems. IEEE Trans. Autom. Sci. Eng. 2025, 22, 12058–12070. [Google Scholar] [CrossRef]
- Mamdani, E.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
- Marina Martinez, C.; Heucke, M.; Wang, F.Y.; Gao, B.; Cao, D. Driving Style Recognition for Intelligent Vehicle Control and Advanced Driver Assistance: A Survey. IEEE Trans. Intell. Transp. Syst. 2018, 19, 666–676. [Google Scholar] [CrossRef]
- Peng, J.; Zhang, S.; Zhou, Y.; Li, Z. An Integrated Model for Autonomous Speed and Lane Change Decision-Making Based on Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21848–21860. [Google Scholar] [CrossRef]
- Werling, M.; Ziegler, J.; Kammel, S.; Thrun, S. Optimal trajectory generation for dynamic street scenarios in a frenet frame. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 987–993. [Google Scholar]
- Christodoulou, P. Soft Actor-Critic for Discrete Action Settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
- Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [Google Scholar] [CrossRef]
- Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
- Bellman, R. Dynamic programming and stochastic control processes. Inf. Control 1958, 1, 228–239. [Google Scholar] [CrossRef]
- Bertsekas, D.; Nedic, A.; Ozdaglar, A. Convex Analysis and Optimization; Athena Scientific: Nashua, NH, USA, 2003; Volume 1. [Google Scholar]
- Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
- Luo, Z.Q.T.; Tseng, P. On the Convergence Rate of Dual Ascent Methods for Linearly Constrained Convex Minimization. Math. Oper. Res. 1993, 18, 846–867. [Google Scholar] [CrossRef]
- Leurent, E. An Environment for Autonomous Driving Decision-Making, 2018. Available online: https://github.com/eleurent/highway-env (accessed on 20 May 2026).
- Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805–1824. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]













| Traffic Density | High | Medium | Low | ||
|---|---|---|---|---|---|
| Cost Limit | |||||
| Risk Preference | |||||
| Conservative | Small | Small | Medium | ||
| Neutral | Small | Medium | Large | ||
| Aggressive | Medium | Large | Large | ||
| Symbol | Description | Value |
|---|---|---|
| Discount factor | 0.99 | |
| Policy network learning rate | ||
| Critic network learning rate | ||
| Cost network learning rate | ||
| Temperature parameter learning rate | ||
| Initial Lagrangian multiplier | 1.0 | |
| Lagrangian multiplier learning rate | ||
| Replay buffer size | ||
| Batch size | 256 |
| Method | Success Rate (%) ↑ | Collision Ratio ↓ | Average Cost ↓ | Average Time (s) ↓ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| High | Medium | Low | High | Medium | Low | High | Medium | Low | High | Medium | Low | |
| Dueling DQN [42] | 87.0 | 94.3 | 99.0 | 0.013 | 0.005 | 0.005 | 0.50 | 0.28 | 0.08 | 11.78 | 11.47 | 10.85 |
| SACD [33] | 94.5 | 97.5 | 99.2 | 0.010 | 0.008 | 0.005 | 0.44 | 0.25 | 0.10 | 11.59 | 11.33 | 10.82 |
| PPO [43] | 99.5 | 97.7 | 99.2 | 0.003 | 0.018 | 0.008 | 0.01 | 0.03 | 0.01 | 12.36 | 11.62 | 10.95 |
| RAPRL (ours) | 99.0 | 99.5 | 99.3 | 0.003 | 0.005 | 0.005 | 0.02 | 0.02 | 0.02 | 11.87 | 11.46 | 10.96 |
| PP | ASM | SC | Success Rate (%) ↑ | Collision Ratio ↓ | Average Cost ↓ | Average Time (s) ↓ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| High | Medium | Low | High | Medium | Low | High | Medium | Low | High | Medium | Low | |||
| - | - | - | 94.5 | 97.5 | 99.2 | 0.010 | 0.008 | 0.005 | 0.44 | 0.25 | 0.10 | 11.59 | 11.33 | 10.82 |
| - | - | 🗸 | 97.2 | 97.3 | 99.3 | 0.003 | 0.005 | 0.005 | 0.23 | 0.13 | 0.04 | 12.08 | 11.54 | 10.99 |
| - | 🗸 | 🗸 | 98.3 | 98.8 | 98.9 | 0.010 | 0.008 | 0.003 | 0.02 | 0.02 | 0.02 | 12.33 | 11.58 | 11.02 |
| 🗸 | 🗸 | 🗸 | 99.0 | 99.5 | 99.3 | 0.003 | 0.005 | 0.005 | 0.02 | 0.02 | 0.02 | 11.87 | 11.64 | 10.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Teng, J.; Huang, W.; Yuan, S.; Hu, M.; Qin, H.; Li, Y.; Bian, Y.; Li, B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines 2026, 14, 605. https://doi.org/10.3390/machines14060605
Teng J, Huang W, Yuan S, Hu M, Qin H, Li Y, Bian Y, Li B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines. 2026; 14(6):605. https://doi.org/10.3390/machines14060605
Chicago/Turabian StyleTeng, Jingjia, Wenjie Huang, Shijie Yuan, Manjiang Hu, Hongmao Qin, Yang Li, Yougang Bian, and Bai Li. 2026. "Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging" Machines 14, no. 6: 605. https://doi.org/10.3390/machines14060605
APA StyleTeng, J., Huang, W., Yuan, S., Hu, M., Qin, H., Li, Y., Bian, Y., & Li, B. (2026). Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines, 14(6), 605. https://doi.org/10.3390/machines14060605

