Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging

Teng, Jingjia; Huang, Wenjie; Yuan, Shijie; Hu, Manjiang; Qin, Hongmao; Li, Yang; Bian, Yougang; Li, Bai

doi:10.3390/machines14060605

Open AccessArticle

Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging

by

Jingjia Teng

¹,

Wenjie Huang

¹,

Shijie Yuan

¹,

Manjiang Hu

¹

,

Hongmao Qin

¹,

Yang Li

^1,*,

Yougang Bian

¹

and

Bai Li

²

¹

College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China

²

School of Information & Electronic Engineering, East China Normal University, Shanghai 200241, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(6), 605; https://doi.org/10.3390/machines14060605

Submission received: 21 April 2026 / Revised: 22 May 2026 / Accepted: 26 May 2026 / Published: 28 May 2026

(This article belongs to the Special Issue Optimization-Based Motion Planning & Control for Autonomous Driving in Dynamic Environments)

Download

Browse Figures

Versions Notes

Abstract

Reinforcement learning (RL) has been widely utilized for decision-making in highway on-ramp merging scenarios. However, most existing methods incorporate safety through reward functions, which may allow autonomous vehicles to trade safety for higher cumulative rewards. Moreover, personalized human risk preferences are rarely considered, making the learned policies difficult to adapt to heterogeneous user-specific risk requirements and potentially resulting in overly conservative or insufficiently cautious behaviors. To address these issues, this paper proposes a Risk-Aware Personal Preference-Based Safe Reinforcement Learning framework (RAPRL), for autonomous decision-making in on-ramp merging scenarios. Specifically, the high-level decision-making problem is formulated as a constrained Markov decision process (CMDP), in which safety requirements are explicitly represented as constraints rather than reward terms. To enable personalized safety regulation, a fuzzy logic mechanism is developed to adaptively determine the constraint cost limit according to the driver’s risk preference and the surrounding traffic density. The resulting safe RL problem is solved using a Lagrangian-based soft actor-critic algorithm (SAC). Furthermore, an Action Shielding Mechanism is designed to assess the potential risk of candidate actions before execution and replace unsafe or infeasible actions, thereby improving safety during both policy learning and execution. Theoretical analysis shows that the proposed shielding mechanism can reduce unsafe exploration and improve sample efficiency. Extensive simulations in on-ramp merging scenarios demonstrate that RAPRL effectively reduces safety violations while maintaining driving efficiency. Compared with the SAC Discrete method, the proposed method improves the success rate by 4.76% and reduces the collision ratio by 70%, indicating a better safety–efficiency trade-off.

Keywords:

reinforcement learning; driver preferences; constrained MDP; soft actor-critic; action shielding

1. Introduction

Merging into congested highway traffic remains a challenging task for autonomous vehicles (AVs), as the decision-making process must balance two conflicting objectives: safety and efficiency [1]. This trade-off becomes more complex in interactive traffic, where the ego CAV must make merging decisions under uncertain HDV responses and vehicle-to-vehicle interactions [2]. Meanwhile, the desired behavior of the ego CAV is not uniform across users. A conservative user or safety-conscious driver may prefer larger safety margins and more cautious merging maneuvers, whereas an aggressive user may accept a higher level of risk to reduce merge delay and improve traffic efficiency. Therefore, diverse driving preferences should be explicitly considered in autonomous decision-making, especially when the vehicle must regulate its safety–efficiency tradeoff in interactive traffic. To satisfy users’ risk expectations, decision-making systems of AVs should incorporate user preferences and provide multiple behavioral options.

Reinforcement learning (RL) has been widely used for AV decision-making tasks [3,4]. However, most methods often neglect user-specific risk preferences, resulting in learned policies that may not align with an individual’s risk perception, thereby reducing user satisfaction and trust in the system [5]. Moreover, ensuring safety within RL still remains challenging. Most existing approaches incorporate safety requirements into the reward function instead of enforcing them as explicit constraints, which may cause the agent to prioritize long-term rewards while compromising safety [6]. Therefore, it is essential to develop a risk-aware reinforcement learning framework that explicitly incorporates personal risk preferences into the decision-making process of autonomous driving.

The key problem addressed in this paper is to learn an on-ramp merging policy that can explicitly regulate safety constraints, adapt the allowable risk level to user-specific preferences and traffic conditions, and prevent unsafe high-level actions from being executed. This problem involves three coupled aspects: constraint-based safety regulation during policy learning, preference-aware adjustment of the safety cost limit, and execution-level filtering of risky actions before they are applied to the vehicle. To address this problem, we propose a Risk-Aware Personal Preference-Based Safe Reinforcement Learning framework (RAPRL) for autonomous on-ramp merging, as shown in Figure 1. The high-level decision-making problem is first formulated as a constrained Markov decision process (CMDP), where safety requirements are represented as explicit cost constraints and the cost limit is adaptively determined according to human risk preference and traffic density, as shown in Figure 2. Based on this formulation, the resulting safe RL problem is solved using a Lagrangian-based soft actor-critic (SAC) algorithm. To further reduce unsafe exploration and prevent risky decisions from being executed, an MPC-based Action Shielding Mechanism (ASM) is introduced to pre-execute candidate RL actions, evaluate their predicted collision risk and feasibility, and replace unsafe or invalid actions before execution. After this shielding process, the selected action is sent to the low-level MPC module for trajectory planning and vehicle control. Finally, theoretical analysis and numerical simulations are conducted to evaluate the safety performance and learning efficiency of the proposed method. The main contributions are summarized as follows:

A risk-aware safe RL method based on CMDP is proposed for the highway on-ramp merging task, in which the individual’s risk preference is incorporated into the safety constraints to accommodate the safety expectations of users. The safety level of the RL policy can be adjusted by computing the cost limits of CMDP constraints using fuzzy logic based on user preferences and traffic density.
An Action Shielding Mechanism is built to mask out unsafe RL actions. We pre-execute the RL action with MPC and conduct collision checks with surrounding agents to determine whether the action is safe. Theoretical proof has shown the effectiveness of the shielding mechanism in terms of safety and sampling efficiency.
Numerical simulations in different levels of traffic densities have shown that our method outperforms the baselines, which can improve safety without sacrificing traffic efficiency. Due to the use of user preference-aware safety constraints and action shielding, risk behaviors can be significantly reduced during the exploration stage of RL, enabling safer policy learning in interactive ramp-merging scenarios.

2. Related Works

2.1. RL-Based Approach

Reinforcement Learning (RL) has demonstrated significant potential in the complex decision-making tasks of autonomous driving, particularly in handling interaction and mixed traffic scenarios [7,8]. While these approaches effectively optimize driving policies for high-level objectives, standard RL algorithms primarily focus on maximizing cumulative rewards and often lack explicit mechanisms to guarantee safety constraints during the learning process. To address this limitation, the constrained Markov decision process (CMDP) framework was introduced to explicitly incorporate safety constraints. For instance, ref. [9] formulated unsignalized intersection navigation as a CMDP, solving it via a Safe-PPO algorithm, while [10] combined the Augmented Lagrangian method with PPO to enforce risk boundaries. However, CMDP-based methods typically treat safety as a statistical constraint or a soft penalty. As highlighted by [11], it remains difficult for these methods to strictly enforce hard safety constraints, leaving the agent prone to falling into unsafe regions during exploration. To mitigate the risk of unsafe actions, the Action Shielding Mechanism (ASM) has been developed to monitor and correct policy outputs [12]. Recent works have incorporated motion prediction into shielding modules to anticipate and mask actions that lead to collisions [13,14,15]. However, most existing shielding approaches rely on simplified kinematic predictions or immediate action masking, often overlooking the discrepancy between high-level decisions and the actual trajectories executed by low-level controllers. Furthermore, a theoretical justification is often absent.

Unlike existing studies, we pre-execute the RL action with MPC to acquire the predictions of the ego vehicle and then conduct collision checks with surrounding agents based on their motion predictions to determine whether the RL action is safe or not. The unsafe actions are replaced with safe ones, which can help the RL agent quickly learn to act safely. Moreover, we provide a theoretical justification that the Action Shielding Mechanism can enhance the safety performance and learning efficiency of RL.

2.2. Human Risk Perception in Decision-Making

Risk perception in driving has gained wide scientific interest over the years, with differences among individuals’ risk perception found to have a significant impact on the safety design of intelligent vehicles [16]. Many studies have used statistical methods to analyze driving risk, among which the most common method is to assess risk by analyzing driving behaviors. To understand the differences in risk perception between drivers, driving style has been studied. Ref. [17] proposes a universal driving risk model, which provides a new way of modeling the differences in driver’s risk perception at a physical level. Risk perception is also the main consideration in decision-making for driving safety. Ref. [18] uses the risk of each sample trajectory as the cost item and obtains a trajectory with minimal cost to guarantee the safety requirement. Ref. [19] proposes a model to describe the acceptable risk and shows how an accepted risk contributes to the decision-making of AVs at the maneuver level. Ref. [20] proposes a risk-aware decision-making framework to handle the epistemic uncertainty arising from training the prediction model on insufficient data.

Unlike previous works, we explicitly incorporate human risk perception into the CMDP framework to align high-level decisions with user expectations. Specifically, we employ a fuzzy logic controller that dynamically adjusts the CMDP cost limits based on individual risk preferences and real-time traffic density, thereby enabling adaptable safety levels.

2.3. Combination of RL and MPC

RL and Model Predictive Control (MPC) represent two distinct paradigms in autonomous decision-making. While RL excels in handling complex system dynamics and learning from interaction, it often struggles with generalization and safety guarantees in unseen environments. Conversely, MPC provides robust constraint handling but is limited by model accuracy and computational burden. Consequently, recent research has increasingly focused on hybrid architectures that leverage the strengths of both. For instance, ref. [21] formulated a dual MPC where an RL agent acts as an adaptive solver, and [22] utilized RL to estimate the value function within a parametric MPC framework. Others employ MPC to guide RL training; specifically, MPC has been used to reduce sample complexity [23] or approximate value functions for high-level objectives [24,25]. Furthermore, ref. [26] demonstrated the feasibility of embedding RL techniques directly into MPC-based policies. These studies collectively highlight the potential of RL–MPC integration in complex control tasks. However, most existing hybrid approaches focus primarily on performance optimization or value approximation, often overlooking the critical aspect of explicit safety constraints during the exploration phase.

Unlike the previous studies, we formulate the on-ramp merging problem within a CMDP framework, where safety violations are explicitly constrained via cost functions rather than implicitly handled through reward shaping. Moreover, we introduce an Action Shielding Mechanism that leverages MPC as a predictive safety filter to prevent unsafe actions during the exploration phase of RL. This design enables safety-aware learning under explicit constraints while maintaining compatibility with downstream MPC-based trajectory planning.

3. Problem Statement

This section formulates the decision-making process of the on-ramp merging problem as a hierarchical framework that consists of a CMDP for high-level maneuver decisions and MPC for low-level motion control.

3.1. Constrained Markov Decision Process

The high-level maneuver decision-making problem is formulated as a CMDP model in this study, which specifies safety requirements as constraint terms. CMDP is described as a tuple

< S, A, P, r, c, η, γ >

, where S is the state space, A is the action space, P is the transition probability

p (s_{t + 1} | s_{t}, a_{t})

from state

s_{t}

to the next state

s_{t + 1}

under the action

a_{t} \in A

,

r : S \times A \to R

is the reward function,

c : S \times A \to R

is the cost function, and

γ \in [0, 1)

is a discount factor.

3.1.1. State Space

Let

s^{i} = (o^{i}, x^{i}, y^{i}, v_{x}^{i}, v_{y}^{i})

be the state of vehicle i, where

o^{i}

is a binary variable indicating whether the vehicle i is observable,

x^{i}

and

y^{i}

are the longitudinal and lateral distance between the vehicle i and the ego vehicle, and

v_{x}^{i}

and

v_{y}^{i}

are relative speed between the vehicle i and the ego vehicle in longitudinal and lateral directions. The complete state of the environment consists of all vehicles that are observed by the ego vehicle, and the state space S is written as,

S = (s^{1}, s^{2}, \dots, s^{n})

, where n is the maximum number of observed vehicles on the main lane.

3.1.2. Action Space

The high-level decision module outputs a discrete action, including turning left/right, idling, and acceleration/deceleration. The action space A is defined as

A = [A^{◃}, A^{▹}, A^{↑}, A^{⊘}, A^{↓}],

(1)

where

A^{◃}, A^{▹}, A^{↑}, A^{⊘},

and

A^{↓}

are turning left, turning right, acceleration, idling, and deceleration, respectively.

3.1.3. Reward

The reward functions are defined considering safety, success ratio, and traffic efficiency, and we summarize the rewards as

r = r_{s} + r_{g} + r_{v},

(2)

where r is the sum of rewards, consisting of the reward of safety

r_{s}

, the reward of successfully merging and reaching the goal position

r_{g}

, and the reward of traffic efficiency

r_{v}

. We set

r_{s} = \{\begin{matrix} 0.05 & if TTC \geq 2.5 s, \\ - 1 & otherwise, \end{matrix}

(3)

where TTC denotes the time to collision with surrounding vehicles. The term

r_{g}

is defined as

r_{g} = r_{g 1} + r_{g 2}

, with

r_{g 1} = 5

awarded when the ego vehicle successfully merges into the target lane, and

r_{g 2} = 10

awarded when the ego vehicle reaches the designated goal position. The reward for traffic efficiency

r_{v}

is defined based on ego vehicle speed and traffic speed, which encourages the ego vehicle to drive at the average speed of the traffic flow,

r_{v} = \{\begin{matrix} 0.1 & if |v_{ego} - v_{ave}| \leq κ v_{ave}, \\ - 0.5 & otherwise, \end{matrix}

(4)

where

v_{ego}

is the speed of ego vehicle,

v_{ave}

is the average speed of traffic, and

κ

is the coefficient of the average speed.

3.1.4. Cost

The cost c includes three situations, as illustrated in Figure 3, which are computed based on the predicted states of the ego vehicle and surrounding objects, including (a) fail to merge before reaching the end of the road, (b) collide with other vehicles, and (c) hard to merge when the target lane is occupied. The cost functions can be written as,

c = c_{1} + c_{2} + c_{3},

(5)

where c is the sum of costs, including the cost

c_{1}

for failing to merge before reaching the end of the road, the cost

c_{2}

for colliding with other vehicles, and the cost

c_{3}

for hard to merge when the target lane is occupied.

Similar event-based cost designs have been used in reinforcement learning-based autonomous driving, where hazardous driving events are represented by penalty terms and assigned different weights according to their safety severity [27]. Inspired by this design principle, we define the CMDP cost in this study based on the following three safety-related events in the on-ramp merging task.

\{\begin{matrix} c_{1} = 0.2 & if failed to merge, \\ c_{2} = 2 & if collision occurs, \\ c_{3} = 0.3 & if l_{occupied} = 1 . \end{matrix}

(6)

The cost values in Equation (6) are dimensionless penalty weights rather than physical risk quantities. They are selected according to the relative severity of the corresponding events. Collision is regarded as the most severe safety violation and is therefore assigned the largest cost. Failing to merge before the end of the ramp is treated as a task-level failure, while target-lane occupancy represents a local merging difficulty that may lead to unsafe interaction if the ego vehicle continues to merge. Accordingly, the cost values are designed to satisfy the severity ordering

c_{2} > c_{3} > c_{1}

, where

c_{2}

strongly penalizes collision events, and

c_{1}

and

c_{3}

provide additional constraint signals for non-collision risks. The target-lane occupancy cost

c_{3}

is set slightly larger than

c_{1}

because it acts as an early warning signal for potential merging conflicts and may be triggered repeatedly during the merging process. These values are empirically selected in simulation to maintain a clear distinction between collision and non-collision events while keeping the cumulative cost within a suitable range for Lagrangian-based policy optimization.

In particular, we use the longitudinal position and relative speed to determine whether the target lane is occupied,

l_{o c}^{i} = \{\begin{matrix} 1, & if |Δ v_{x}^{i}| \leq 1.5 and |Δ x^{i}| \leq 5, \\ 0, & otherwise . \end{matrix}

(7)

l_{occupied} = \max_{i} [l_{o c}^{i}],

(8)

where

l_{occupied}

denotes the target lane occupancy status.

Δ v_{x}^{i}

and

Δ x^{i}

represent the longitudinal relative speed and distance of vehicle i, respectively. Specifically,

l_{occupied}

is set to 1 only if a vehicle falls within the defined spatial and speed thresholds, identifying it as an immediate obstacle for merging.

3.1.5. Problem Formulation

We formulate the decision-making problem as a CMDP to maximize cumulative rewards subject to safety constraints:

\begin{matrix} \begin{matrix} max_{π} & J_{π}^{R} = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})], \\ s . t . & J_{π}^{C} = E_{π} [\sum_{t = 0}^{\infty} γ^{t} c (s_{t}, a_{t})] \leq η, \end{matrix} \end{matrix}

(9)

where

J_{π}^{R}

and

J_{π}^{C}

denote the expected return and cost, respectively.

η

is the cost limit reflecting risk preference. By introducing a Lagrange multiplier

λ \geq 0

, we recast this as an unconstrained min–max problem:

max_{π} min_{λ \geq 0} (J_{π}^{R} - λ (J_{π}^{C} - η)) .

(10)

This Lagrangian formulation penalizes policies that violate the cost constraint (

J_{π}^{C} > η

), while maximizing the reward when the constraint is satisfied.

3.2. Human-Aligned Safety Cost Limits

As discussed previously, the cumulative costs

J_{π}^{C}

are constrained by the cost limit

η

in Equation (9), which determines the risk preference of the learned policy. Setting an appropriate value for

η

is essential to match the users’ preferences for safety. In this study, we utilize the fuzzy logical method [28] to determine the value of the cost limit

η

based on the specific risk preference and traffic density. We employ Mamdani inference [29] to accommodate human-semantic descriptors and encode them as interpretable rules, thereby deriving the safety cost limit from risk preference and traffic density. Unlike quantitative risk assessments that typically assume a single acceptable risk level, fuzzy logic allows the cost limit to adapt jointly to both preference and density.

In this study, we design the membership functions based on expert knowledge to map the fuzzy inputs (risk preference and traffic density) to the fuzzy output (cost limit), as illustrated in Figure 4. To make the membership functions explicit, let

p \in [0, 100]

denote the normalized risk-preference variable,

ρ \in [0.5, 1]

denote the normalized traffic-density variable, and

η \in [0, 0.1]

denote the cost limit. The membership functions of risk preference are defined as

\begin{matrix} \begin{matrix} μ_{con} (p) & = \{\begin{matrix} 1, & 0 \leq p \leq 30, \\ \frac{50 - p}{20}, & 30 < p < 50, \\ 0, & 50 \leq p \leq 100, \end{matrix} μ_{neu} (p) & = \{\begin{matrix} 0, & 0 \leq p \leq 30, \\ \frac{p - 30}{20}, & 30 < p < 50, \\ \frac{70 - p}{20}, & 50 \leq p < 70, \\ 0, & 70 \leq p \leq 100, \end{matrix} \end{matrix} \\ μ_{agg} (p) = \{\begin{matrix} 0, & 0 \leq p \leq 50, \\ \frac{p - 50}{20}, & 50 < p < 70, \\ 1, & 70 \leq p \leq 100 . \end{matrix} \end{matrix}

(11)

Similarly, the membership functions of traffic density are defined as

\begin{matrix} \begin{matrix} μ_{low} (ρ) & = \{\begin{matrix} 1, & ρ = 0.5, \\ \frac{0.7 - ρ}{0.2}, & 0.5 < ρ < 0.7, \\ 0, & 0.7 \leq ρ \leq 1, \end{matrix} μ_{med} (ρ) & = \{\begin{matrix} 0, & ρ = 0.5, \\ \frac{ρ - 0.5}{0.2}, & 0.5 < ρ < 0.7, \\ 1, & 0.7 \leq ρ \leq 0.8, \\ \frac{1 - ρ}{0.2}, & 0.8 < ρ < 1, \\ 0, & ρ = 1, \end{matrix} \end{matrix} \\ μ_{high} (ρ) = \{\begin{matrix} 0, & 0.5 \leq ρ \leq 0.8, \\ \frac{ρ - 0.8}{0.2}, & 0.8 < ρ < 1, \\ 1, & ρ = 1 . \end{matrix} \end{matrix}

(12)

The membership functions of the cost limit are defined as

\begin{matrix} \begin{matrix} μ_{small} (η) & = \{\begin{matrix} 1, & 0 \leq η \leq 0.01, \\ \frac{0.05 - η}{0.04}, & 0.01 < η < 0.05, \\ 0, & 0.05 \leq η \leq 0.1, \end{matrix} μ_{medium} (η) & = \{\begin{matrix} 0, & 0 \leq η \leq 0.01, \\ \frac{η - 0.01}{0.04}, & 0.01 < η < 0.05, \\ \frac{0.09 - η}{0.04}, & 0.05 \leq η < 0.09, \\ 0, & 0.09 \leq η \leq 0.1, \end{matrix} \end{matrix} \\ μ_{large} (η) = \{\begin{matrix} 0, & 0 \leq η \leq 0.05, \\ \frac{η - 0.05}{0.03}, & 0.05 < η < 0.08, \\ 1, & 0.08 \leq η \leq 0.1 . \end{matrix} \end{matrix}

(13)

These equations correspond to the membership curves shown in Figure 4a–c, and provide explicit mathematical definitions for the fuzzy inputs and output used in the Mamdani inference process. The above ranges and membership shapes are selected to support smooth safety–efficiency regulation in on-ramp merging. The normalized risk-preference range allows different user-specified risk tolerances to be represented on a common scale, while the normalized traffic-density range reflects the transition from sparse to dense merging conditions considered in the simulation. The cost-limit range is consistent with the sensitivity analysis of the CMDP constraint limit. For the membership shapes, shoulder functions are used for boundary categories, such as conservative/aggressive preference, low/high density, and small/large cost limit, whereas triangular or trapezoidal functions are used for intermediate categories. This overlapping design avoids abrupt changes in the cost limit when the input variables vary slightly, which is important for maintaining stable safety–efficiency tradeoffs during merging decisions.

Consistent with established driving-style taxonomies in driving behavior research [30], the user’s risk preference is represented by three commonly used linguistic levels, i.e., conservative, neutral, and aggressive. As shown in Figure 4a, these levels are modeled as predefined fuzzy membership functions over the normalized risk-preference variable ranging from 0 to 100%. They are not inferred online from driver data in the current study; instead, the membership functions are specified based on prior studies and expert assumptions to systematically evaluate how different risk-tolerance levels affect the adaptive safety cost limit and the learned merging policy. In practical applications, the normalized risk-preference input can be obtained through several interfaces, such as a pre-driving questionnaire, a user-selected driving-style setting, or calibration from historical driving behavior. The corresponding membership degrees with respect to the conservative, neutral, and aggressive sets are then used as inputs to the subsequent fuzzy inference process. Therefore, the focus of this study is not to infer the user’s risk preference, but to translate a given preference into an adaptive CMDP cost limit for safe policy learning. Traffic density includes {‘low’, ‘medium’, ‘high’}, as shown in Figure 4b, and varies between 0.5 and 1.0, where larger values indicate denser traffic conditions with smaller inter-vehicle headways and more surrounding vehicles within the ego vehicle’s interaction range. The medium-density traffic is modeled with a trapezoidal membership function, reaching its maximum membership value at moderate density levels. The membership function of cost is shown in Figure 4c, including {‘small’, ‘medium’, ‘large’}, and the cost limits vary between 0 and 0.1.

After fuzzification, we can define fuzzy rules that map the fuzzy inputs (risk preference and traffic density) to the fuzzy output (cost limit). Fuzzy rules are a crucial aspect of designing a fuzzy logic system, and the rules are usually derived from expert knowledge or empirical data. In this study, the fuzzy rules are defined based on human knowledge, and Table 1 shows the fuzzy relations between cost limit, risk preference, and traffic density.

The rule table is designed to reflect the safety–efficiency tradeoff in on-ramp merging. For a fixed traffic density, a higher risk preference is assigned a larger cost limit, because the ego vehicle is allowed to accept a slightly higher level of risk to reduce merge delay and improve traffic efficiency. Conversely, for a fixed risk preference, a higher traffic density is assigned a smaller cost limit, because dense traffic provides fewer acceptable merging gaps and requires stricter safety regulation. Therefore, the rule table follows a monotonic and interpretable design: the cost limit increases with user risk tolerance and decreases with traffic density. For example, an aggressive preference under low-density traffic leads to a large cost limit to encourage efficient merging, whereas a conservative preference under high-density traffic leads to a small cost limit to enforce cautious behavior. From Table 1, we can infer the fuzzy output of cost limit given the fuzzy inputs of risk preference and traffic density. For instance, if traffic density is ‘low’ and the risk preference is ‘aggressive’, then the cost limit is ‘large’. We use a matrix

\tilde{R}

to represent the fuzzy rules, and the process of rule evaluations can be written as

\tilde{C} = ⋃_{k = 1}^{9} ({\tilde{A}}_{i} \times {\tilde{B}}_{j}) \circ {\tilde{R}}_{k},

(14)

where

\tilde{C}

is the fuzzy output,

{\tilde{A}}_{i}

is the traffic density,

{\tilde{B}}_{j}

is the risk preference,

i, j \in {1, 2, 3}

are the indices of different fuzzy sets w.r.t. risk preference and traffic density, and the operator ∘ denotes the composition of fuzzy relations.

Based on the fuzzy rules, the fuzzy output of each rule can be computed. These fuzzy outputs are aggregated into a single fuzzy set and converted to the crisp output value using the centroid method. Figure 5 illustrates an example of aggregation and defuzzification of the Mamdani process. Given that the traffic density and risk preference are 0.57 and 45%, and the membership value of the input is

\tilde{A} = {Low, Medium}

and

\tilde{B} = {Conservative, Neutral}

, we can compute the fuzzy output for the cost limit of small, medium, and large, as shown in Figure 5, which are 0.25, 0.35, and 0.65, respectively. With the fuzzy output, we can compute the grey area for each fuzzy set and aggregate them into a single fuzzy set, which is the union area of three grey areas. Finally, we take the centroid of the unions of grey areas to obtain a crisp value for the cost limit, which is 0.0595. Therefore, the cost limit is 0.0595 when traffic density is 0.57 and the risk preference is 45%.

4. Model Predictive Control

This section introduces the MPC formulation used for vehicle control, including the linearized discrete-time kinematic bicycle model, the prediction of future states over a finite horizon, and the quadratic optimization of control inputs under constraints.

4.1. Discrete Linear Model

In this study, we use the kinematic bicycle model in MPC [31]. As illustrated in Figure 6, the vehicle state is written as

X = {[x, y, v, φ]}^{⊤}

, where x and y denote the longitudinal and lateral position, and v and

φ

are vehicle speed and the yaw angle. The side-slip angle, wheelbase, and length from the front and rear axles to the center of gravity are denoted as

β

, l,

l_{f}

, and

l_{r}

, respectively. The control variable is denoted as

U = {[u_{1}, u_{2}]}^{⊤}

, where

u_{1}

and

u_{2}

are acceleration a and the steering angle

δ

, respectively. The referenced position and control variable are defined as

{[x_{r}, y_{r}, v_{r}, φ_{r}]}^{⊤}

and

{[v_{r}, δ_{r}]}^{⊤}

. We derive a linear model

\begin{matrix} \dot{X} = f (X, U) = [\begin{matrix} v cos (φ + β) \\ v sin (φ + β)) \\ a \\ \frac{v}{l} sin β \end{matrix}], \end{matrix}

(15)

\begin{matrix} β = arctan (\frac{l_{r}}{l_{f} + l_{r}} t a n δ), \end{matrix}

(16)

\begin{matrix} {\dot{e}}_{X} = A_{F} e_{X} + B_{G} e_{U}, \end{matrix}

(17)

\begin{matrix} A_{F} = [\begin{matrix} 0 & 0 & cos (φ_{r} + β_{r}) & - v_{r} sin (φ_{r} + β_{r}) \\ 0 & 0 & sin (φ_{r} + β_{r}) & v_{r} cos (φ_{r} + β_{r}) \\ 0 & 0 & 0 & 0 \\ 0 & 0 & \frac{sin β_{r}}{l} & 0 \end{matrix}], \end{matrix}

(18)

\begin{matrix} B_{G} = [\begin{matrix} 0 & \frac{- v_{r} sin (φ_{r} + β_{r}) l_{r} (l_{f} + l_{r})}{{(l_{f} + l_{r})}^{2} {cos}^{2} δ_{r} + l_{r} {tan}^{2} δ_{r}} \\ 0 & \frac{- v_{r} cos (φ_{r} + β_{r}) l_{r} (l_{f} + l_{r})}{{(l_{f} + l_{r})}^{2} {cos}^{2} δ_{r} + l_{r} {tan}^{2} δ_{r}} \\ 1 & 0 \\ 0 & \frac{\frac{v_{r}}{l} cos β_{r} l_{r} (l_{f} + l_{r})}{{(l_{f} + l_{r})}^{2} {cos}^{2} δ_{r} + l_{r} {tan}^{2} δ_{r}} \end{matrix}], \end{matrix}

(19)

where

A_{F} \in R^{4 \times 4}

and

B_{G} \in R^{4 \times 2}

are the Jacobi matrix of the function f with respect to state X and control U, respectively. The state error between the predicted state and the reference state is denoted as

e_{X} \in R^{4}

. The control error between the predicted control and the reference control is denoted as

e_{U} \in R^{2}

. We discretize the linear model as

\begin{matrix} e_{X} (k) = A_{k} e_{X} (k - 1) + B_{k} e_{U} (k - 1), \end{matrix}

(20)

where

e_{X} (k)

is the state error at time k,

e_{X} (k - 1)

is the state error at time

k - 1

, and

e_{U} (k - 1)

is the control error at time

k - 1

. The system matrix

A_{k}

and input matrix

B_{k}

are defined as

\begin{matrix} A_{k} = A_{F} \cdot Δ t + I, \end{matrix}

(21)

\begin{matrix} B_{k} = B_{G} \cdot Δ t, \end{matrix}

(22)

where

Δ t

is the time interval. We define the predicted state error vector

E_{X}

and the control error vector

E_{U}

as

\begin{matrix} E_{X} & = {[e_{X}^{⊤} (k + 1), \dots, e_{X}^{⊤} (k + N)]}^{⊤}, \end{matrix}

(23)

\begin{matrix} E_{U} & = {[e_{U}^{⊤} (k), \dots, e_{U}^{⊤} (k + N - 1)]}^{⊤}, \end{matrix}

(24)

where N is the predicted time horizon. Then, the optimization problem can be formulated as

\begin{matrix} \begin{matrix} \min & E_{X}^{⊤} W_{X} E_{X} + E_{U}^{⊤} W_{U} E_{U}, \end{matrix} \end{matrix}

(25)

where

W_{X}

and

W_{U}

are the diagonal matrices representing the weight of the state and control errors respectively. The constraints on the control variables are described as

\begin{matrix} \begin{matrix} - {\tilde{a}}_{\max} \leq u_{1} (k + i) \leq {\tilde{a}}_{\max}, i = 0, 1, \dots, N - 1, \\ - δ_{\max} \leq u_{2} (k + i) \leq δ_{\max}, i = 0, 1, \dots, N - 1, \end{matrix} \end{matrix}

(26)

where the max acceleration

{\tilde{a}}_{\max}

and the max steering angle

δ_{\max}

are the upper bound of the control variables, which are usually designed to ensure driving comfort and tracking performance. This optimization problem is a quadratic programming (QP) problem that can be quickly solved by a nonlinear programming solver, e.g., CVXOPT. The quadratic program solver is denoted as ‘QP.solver’ at line 8 (see Algorithm 1) to solve the optimization problem defined in Equation (25). The solution to this optimization problem is

{[e_{U}^{⊤} (k), . . ., e_{U}^{⊤} (k + N - 1)]}^{⊤}

, and we take the first one

e_{U}^{⊤} (k)

as the optimal control.

Algorithm 1 MPC and States Prediction

Input:: $s_{t}$ , $a_{t}$ , $a_{t}^{⊙}$ , QP.solver. MPC is used for state prediction (MPC.predict) and control execution (MPC.execute).

1:: if MPC.predict then
2:: $v_{r} = g (a_{t})$ by (28) ▹ execute the raw action
3:: else if MPC.execute then
4:: $v_{r} = g (a_{t}^{⊙})$ by (28) ▹ execute the replaced action
5:: end if
6:: Get $s_{r}$ by (27) ▹ reference positions
7:: $(x_{r}, y_{r}) \leftarrow (s_{r}, l_{r})$ by coordinate transform
8:: $U^{*} \leftarrow QP . solver (x_{r}, y_{r}, v_{r})$ ▹ solve the optimal control
9:: Get $e_{X}$ by (20) ▹state errors
10:: Get $x_{e}^{σ}$ by (29) ▹ predicted states
11:: if MPC.predict then
12:: return $x_{e}^{σ}$
13:: else if MPC.execute then
14:: return $U^{*}$
15:: end if

Output:: $x_{e}^{σ}$ or $U^{*}$ ▹ predicted states or optimal control

4.2. States Computation

In this study, we compute the state error

e_{X} (k)

at each time k based on the predicted positions for the entire time horizon, as shown in Algorithm 1. We first need to compute reference positions in the Frenet frame to acquire the predicted positions and then convert them back to the Cartesian coordinates. The reference positions are the waypoints along the reference line in the Frenet frame [32]. In the first step, the reference position

s_{r}

in the Frenet frame is derived by

x_{r}

, which is the orthogonal projection point of x relative to the reference path. In the next step, the relationship between

x_{r}

and

s_{r}

is already known. For example, if the reference path is a long straight road, then the relationship can be expressed as

x_{r} = s_{r}

. It is assumed that the reference line is the center line of the target lane. The target lane is located on the left lane if the decision is to turn left, and vice versa. Therefore, the reference positions in the Frenet frame can be computed as

s_{r} (k) = s_{r} (k - 1) + v_{r} Δ t,

(27)

where

s_{r} (k)

is the reference position at time k, and

v_{r}

is the reference speed that is determined by the decision

a_{t}

, which is updated as

v_{r} = \{\begin{matrix} v_{t} + Δ v & if a_{t} = A^{↑} \\ v_{t} - Δ v & if a_{t} = A^{↓} \\ v_{t} & else \end{matrix},

(28)

where

v_{t}

is the current speed of the ego vehicle,

a_{t}

is the high-level decision determined by the RL agent, and the action space A is defined as

A = [A^{◃}, A^{▹}, A^{↑}, A^{⊘}, A^{↓}]

, which are turning left, turning right, acceleration, idling and deceleration, respectively. In particular, we execute the RL decision in MPC for motion prediction of the ego vehicle, and the predicted position of the ego vehicle

x_{e}^{σ}

is written as

x_{e}^{σ} = x_{r} (σ) + e_{X}^{x} (σ),

(29)

where

σ \in [1, N]

is the prediction coefficient,

x_{r} (σ)

is the reference longitudinal position in the Cartesian coordinates, and

e_{X}^{x} (σ)

is the longitudinal position error computed by Equation (20).

5. Safe Reinforcement Learning

This section introduces the safe RL algorithm, for which we use a discrete version of the SAC algorithm to solve the CMDP problem. Also, the ASM is built to check the safety of the RL action and replace it when necessary based on the predicted states provided by the MPC module.

5.1. Lagrangian-Based Discrete SAC

Safe RL is employed for the high-level behavioral decision in this study, i.e., to determine the optimal driving decision from a discrete action space A. A discrete action version of the SAC algorithm, i.e., SAC-Discrete (SACD) [33], is used to solve the CMDP problem.

5.1.1. Critic Network and Policy Network

Let

{\overset{˚}{a}}^{1}, {\overset{˚}{a}}^{2}, \dots, {\overset{˚}{a}}^{| A |}

be the discrete actions in the action space, i.e.,

A = {\{{\overset{˚}{a}}^{h}\}}_{h = 1}^{| A |}

,

| A |

is the size of the action space. A critic network and policy network are used in this study, as shown in Figure 7. The soft Q-function of the discrete SAC outputs a vector of size

| A |

that consists of the Q-value of each action, i.e.,

q_{ω} : S \to R^{| A |}

:

q_{ω} (s_{t}) = {[Q_{ω} (s_{t}, {\overset{˚}{a}}^{1}), Q_{ω} (s_{t}, {\overset{˚}{a}}^{2}), \dots, Q_{ω} (s_{t}, {\overset{˚}{a}}^{| A |})]}^{T},

(30)

where

q_{ω} (s_{t})

is the soft Q-function parameterized by

ω

, and

Q_{ω} (s_{t}, {\overset{˚}{a}}^{h})

is the state-action value at state

s_{t}

with the discrete action

{\overset{˚}{a}}^{h} (h = 1, 2, \dots, | A |)

. Likewise, the policy can directly output the action distribution, which outputs a vector that consists of the probability of each action at state

s_{t}

, i.e.,

π_{θ} : S \to {[0, 1]}^{| A |}

:

π_{θ} (s_{t}) = {[π_{θ} ({\overset{˚}{a}}^{1} | s_{t}), π_{θ} ({\overset{˚}{a}}^{2} | s_{t}), \dots, π_{θ} ({\overset{˚}{a}}^{| A |} | s_{t})]}^{T},

(31)

where

θ

denotes the policy network parameters (see Figure 7),

π_{θ} ({\overset{˚}{a}}^{h} | s_{t})

is the probability of the action

{\overset{˚}{a}}^{h} (h = 1, 2, \dots, | A |)

conditioned on the state

s_{t}

. In the discrete action settings, the soft state-value function is computed as

V (s_{t}) = π_{θ} {(s_{t})}^{T} [q_{ω} (s_{t}) - ξ \log π_{θ} (s_{t})] .

(32)

where

ξ

is the temperature parameter that determines the relative importance of the entropy term versus the reward term. According to [33], the policy network parameters

θ

can be learned by minimizing the following loss function:

J_{π} (θ) = \underset{s_{t} \sim D}{E} [π_{θ} {(s_{t})}^{⊤} (ξ \log π_{θ} (s_{t}) - q_{ω} (s_{t}))] .

(33)

The loss function of the temperature parameter is defined as

G (ξ) = π_{θ} {(s_{t})}^{⊤} [- ξ (\log π_{θ} (s_{t}) + \bar{H})],

(34)

where

\bar{H}

is a hyperparameter that represents the target entropy. As shown in Figure 7, the critic network parameter

ω

of the soft Q-function can be learned by minimizing the soft Bellman residual:

\begin{matrix} J_{q} (ω) = \underset{s_{t}, a_{t} \sim D}{E} [\frac{1}{2} {(Q_{ω} (s_{t}, a_{t}) - \bar{Q} (s_{t}, a_{t}))}^{2}], \end{matrix}

(35)

\begin{matrix} \bar{Q} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + γ E_{s_{t + 1}} [V_{\bar{ω}} (s_{t + 1})], \end{matrix}

(36)

where

Q_{ω} (s_{t}, a_{t})

is the estimated Q function, and

\bar{Q} (s_{t}, a_{t})

is the target soft Q function. In particular,

V_{\bar{ω}} (s_{t + 1})

is estimated using a target Q network:

V_{\bar{ω}} (s_{t + 1}) = π_{θ} {(s_{t + 1})}^{⊤} [q_{\bar{ω}} (s_{t + 1}) - ξ \log π_{θ} (s_{t + 1})],

(37)

where

\bar{ω}

is the parameter of the target Q network, which is updated in Line 18 in Algorithm 2.

Algorithm 2 Human-aligned safe RL

Input:: MPC, ASM, $σ$ , N
Input:: initialize $θ, ω_{1, 2}, ω_{c}, \bar{ω}, {\bar{ω}}_{c}, λ, ξ, Γ$ and $D \leftarrow \emptyset$
1:: for each iteration do
2:: for each environment step do
3:: $a_{t} \sim π_{θ} (\cdot ∣ s_{t})$
4:: $x_{e}^{σ} \leftarrow MPC . predict (s_{t}, a_{t})$ by Algorithm 1
5:: $a_{t}^{⊙} \leftarrow ASM (a_{t}, x_{e}^{σ}, σ, N)$ by Algorithm 3
6:: $s_{t + 1} \leftarrow MPC . execute (s_{t}, a_{t}^{⊙})$ by Algorithm 1
7:: $r_{t} \leftarrow r (s_{t}, a_{t}^{⊙})$
8:: $c_{t} \leftarrow c (s_{t}, a_{t}^{⊙})$
9:: $D \leftarrow D \cup {(s_{t}, a_{t}^{⊙}, r_{t}, c_{t}, s_{t + 1})}$
10:: end for
11:: for each gradient step do
12:: $λ \leftarrow λ - α_{λ} \nabla_{λ} J (λ)$
13:: $ω_{i} \leftarrow ω_{i} - α_{ω} \nabla_{ω_{i}} J_{q} (ω_{i})$ for $i = 1, 2$
14:: $ω_{c} \leftarrow ω_{c} - α_{ω_{c}} \nabla_{ω_{c}} J_{q_{c}} (ω_{c})$
15:: $θ \leftarrow θ - α_{θ} \nabla_{θ} J_{π}^{λ} (θ)$
16:: $ξ \leftarrow ξ - α_{ξ} \nabla_{ξ} G (ξ)$
17:: if soft update then
18:: $\bar{ω} \leftarrow Γ ω + (1 - Γ) \bar{ω}$
19:: $\bar{ω_{c}} \leftarrow Γ ω_{c} + (1 - Γ) \bar{ω_{c}}$
20:: end if
21:: end for
22:: end for
Output:: Optimal policy $π^{*}$

Algorithm 3 Action Shielding Mechanism

Input:: $a_{t}$ , $x_{e}^{σ}$ , $x_{obj}$ , $σ$ , N
1:: if $a_{t}$ is $A^{◃}$ then
2:: for each vehicle i do
3:: $x_{i} (N) = x_{i} (0) + N \cdot v_{i} Δ t$
4:: if $x_{i} (0) \leq x_{e}^{σ} \leq x_{i} (N)$ and $| y_{e}^{σ} (A^{◃}) - y_{i} | \leq d_{y}$ then
5:: $a_{t}^{⊙} = A^{↓}$ ▹ Situation 1
6:: end if
7:: end for
8:: end if
9:: if $a_{t}$ is $A^{▹}$ then
10:: if ego vehicle has merged then
11:: $a_{t}^{⊙} = A^{⊘}$ ▹ Situation 2
12:: end if
13:: end if
14:: if $a_{t}$ is $A^{↑}$ or $A^{⊘}$ then
15:: if $| x_{e}^{σ} (a) - x_{obj} | \leq d_{x}$ then
16:: $a_{t}^{⊙} = A^{↓}$ ▹ Situation 3
17:: end if
18:: end if
Output:: Safe decision $a_{t}^{⊙}$

5.1.2. Cost Network

A cost network parameterized by

ω_{c}

is built to approximate the cost value function

q_{ω_{c}} (s_{t})

, as shown in Figure 7. The cost value function

q_{ω_{c}} (s_{t})

is defined as

q_{ω_{c}} (s_{t}) = {[Q_{ω_{c}} (s_{t}, {\overset{˚}{a}}^{1}), Q_{ω_{c}} (s_{t}, {\overset{˚}{a}}^{2}), \dots, Q_{ω_{c}} (s_{t}, {\overset{˚}{a}}^{| A |})]}^{T},

(38)

where

Q_{ω_{c}} (s_{t}, {\overset{˚}{a}}^{h})

is the soft cost value at state

s_{t}

with the discrete action

{\overset{˚}{a}}^{h} (h = 1, 2, \dots, | A |)

. Similarly, the loss function of the cost network is defined as

\begin{matrix} J_{q_{c}} (ω_{c}) = \underset{s_{t}, a_{t} \sim D}{E} [\frac{1}{2} {(Q_{ω_{c}} (s_{t}, a_{t}) - {\bar{Q}}_{c} (s_{t}, a_{t}))}^{2}], \end{matrix}

(39)

where

Q_{ω_{c}} (s_{t}, a_{t})

is the cost value at state

s_{t}

and action

a_{t}

and

{\bar{Q}}_{c} (s_{t}, a_{t})

is the target cost value:

\begin{matrix} {\bar{Q}}_{c} (s_{t}, a_{t}) & = c (s_{t}, a_{t}) + γ E_{s_{t + 1}} [V_{{\bar{ω}}_{c}} (s_{t + 1})], \end{matrix}

(40)

\begin{matrix} V_{{\bar{ω}}_{c}} (s_{t + 1}) & = π_{θ} {(s_{t + 1})}^{⊤} q_{{\bar{ω}}_{c}} (s_{t + 1}), \end{matrix}

(41)

where the target cost function

q_{{\bar{ω}}_{c}}

is approximated by the target cost network with parameter

{\bar{ω}}_{c}

, which is updated in Line 19 in Algorithm 2.

5.1.3. n-Step TD Learning

In traditional temporal difference (TD) learning, the agent updates the value estimates based on the current estimate and a target value. n-step TD learning generalizes this idea by considering not just the immediate next state but by looking ahead n steps into the future. This means that the update is based on the sum of rewards over the next n steps. In this study, we use n-step TD learning, which is expected to strike a balance between the short-term and the longer-term perspective of Monte Carlo methods. By adjusting the parameter n, the trade-off between bias and variance can be adjusted in the learning process. We also use n-step transitions to approximate the target Q value and cost value functions. According to [34], the target Q-value in Equation (36) and target cost value in Equation (40) are rewritten as

\begin{matrix} \bar{Q} (s_{t}, a_{t}) = \sum_{j = 0}^{n - 1} γ^{j} r (s_{t + j}, a_{t + j}) + γ^{n} E_{s_{t + 1}} [V_{\bar{ω}} (s_{t + n})], \end{matrix}

(42)

\begin{matrix} {\bar{Q}}_{c} (s_{t}, a_{t}) = \sum_{j = 0}^{n - 1} γ^{j} c (s_{t + j}, a_{t + j}) + γ^{n} E_{s_{t + 1}} [V_{\bar{ω_{c}}} (s_{t + n})], \end{matrix}

(43)

where n is the parameter that determines the number of steps that we want to look ahead before updating the Q-function.

5.1.4. Lagrange Multiplier

We have a cost network, the loss function of the policy network is modified by adding the cost item via the Lagrange multiplier:

J_{π}^{λ} (θ) = J_{π} (θ) + \underset{s_{t} \sim D}{E} [λ π_{θ} {(s_{t})}^{T} q_{ω_{c}} (s_{t})],

(44)

where

λ

is the Lagrange multiplier:

\nabla_{θ} J_{π}^{λ} (θ) = \frac{\partial J_{π} (θ)}{\partial θ} + λ {\frac{\partial π_{θ} (s_{t})}{\partial θ}}^{T} q_{ω_{c}} (s_{t}) .

(45)

We use the state-action value to approximate the

J_{π}^{C}

,

J_{π}^{C} = Q_{ω_{c}} (s_{t}, a_{t})

(46)

The loss function of the Lagrange multiplier is denoted as

J (λ) = \underset{s_{t} \sim D}{E} [- λ (Q_{ω_{c}} (s_{t}, a_{t}) - η)],

(47)

where

J (λ)

is the loss for Lagrange multiplier

λ

.

Overall, the proposed human-aligned safe RL is described in Algorithm 2. The raw action

a_{t}

sampled from the policy

π_{θ} (\cdot | s_{t})

(Line 3) would first be sent to MPC for motion prediction (MPC.predict) (Line 4) and the ASM for safety check (Line 5). The replaced action

a_{t}^{⊙}

provided by the ASM is then executed in the low-level MPC controller (MPC.execute, Line 6) and then gets the corresponding cost and reward training (Line 7–8). We use two critic networks parameterized by

ω_{i} (i = 1, 2)

to avoid the overestimation of the soft Q-value [35]. In addition, the target soft Q value and cost value are updated based on the parameters

\bar{ω}

and

\bar{ω_{c}}

.

5.2. Action Shielding Mechanism

The frequent constraint violation behavior usually makes the RL agent hard to learn. In this study, to enhance the sample efficiency for safe RL, we design a rule-based ASM to replace unexpected or unsafe RL actions with safe ones (see Algorithm 3), which can help guide the RL agent to take the right action during the exploration process. In the ASM, the unsafe RL action is detected by checking whether there exist collisions between the ego vehicle and surrounding vehicles based on the predicted states.

We define the unsafe decision set as

Ω_{unsafe} = Ω_{1} \cup Ω_{2} \cup Ω_{3}

, including three typical situations, as shown in Figure 8. If the original RL decisions

a_{t} \in Ω_{unsafe}

, the decisions

a_{t}

will be substituted with a safe one

a_{t}^{⊙}

.

5.2.1. Situation 1

A collision is likely to occur when the action turning left

A^{◃}

is executed (see top Figure 8). The unsafe set

Ω_{1}

of this situation is defined as

Ω_{1} = {a \in A |, x_{i} (0) \leq x_{e}^{σ} (A^{◃}) \leq x_{i} (N), | y_{e}^{σ} (A^{◃}) - y_{i} | \leq d_{y}},

(48)

where

x_{e}^{σ}

and

y_{e}^{σ}

are the predicted position of ego vehicle based on the action

A^{◃}

,

σ

is the prediction coefficient,

x_{i}

and

y_{i}

are the predicted positions of the

i^{th}

vehicle, and

d_{y}

is the safe distance between the ego vehicle and the

i^{th}

vehicle in the lateral direction. In this situation, the action

A^{◃}

is replaced with the deceleration action

A^{↓}

to delay the merging maneuver and avoid the predicted lateral conflict. The replaced action

a_{t}^{⊙}

is written as

a_{t}^{⊙} = A^{↓} .

(49)

5.2.2. Situation 2

The action is labeled as an unexpected decision when the ego vehicle has already merged into the target lane, but the RL agent outputs an action turning right

A^{▹}

, which is considered to be unnecessary (middle in Figure 8). The unsafe set

Ω_{2}

of this situation is defined as

Ω_{2} = {a \in A | a = A^{▹}, I_{merge} = 1},

(50)

where

I_{merge} = 1

indicates that the ego vehicle has successfully merged. Therefore, an action idling

A^{⊘}

is used to replace the unexpected action

A^{▹}

. The replaced action

a_{t}^{⊙}

is written as

a_{t}^{⊙} = A^{⊘} .

(51)

5.2.3. Situation 3

The ego vehicle and its neighbor’s motion have few differences in the longitudinal direction. The ego vehicle would fail to merge before reaching the end of the merge zone if acceleration

A^{↑}

or idling

A^{⊘}

are executed (see bottom of Figure 8). The unsafe set

Ω_{3}

of this situation is defined as

Ω_{3} = {a \in A | a = A^{↑} o r A^{⊘}, | x_{e}^{σ} (a) - x_{obj} | \leq d_{x}},

(52)

where

x_{e}^{σ} (a)

is the predicted longitudinal position of the ego vehicle when executing the action a, i.e.,

a = A^{↑} o r A^{⊘}

.

x_{obj}

is the longitudinal position of the adjacent vehicle that occupies the target lane.

d_{x}

is the distance threshold. The condition

| x_{e}^{σ} (a) - x_{obj} | \leq d_{x}

indicates that the ego vehicle and the adjacent vehicle have few differences in the longitudinal direction, making it hard for the ego vehicle to merge. Similarly, the action

A^{↑}

or

A^{⊘}

would be replaced by the deceleration

A^{↓}

, which slows down the ego vehicle to find another opportunity to merge, i.e.,

a_{t}^{⊙} = A^{↓}

.

Overall, the proposed ASM is designed to handle the dominant edge cases in the considered ramp-merging scenario, including predicted lateral collision during lane changing, unexpected right-turn decisions after successful merging, and failure-to-merge risk caused by target lane occupancy. These cases correspond to the main unsafe or infeasible high-level decisions observed in the current simulation environment. For the first and third cases, the deceleration action is used as a shielding fallback to delay the merging maneuver or enlarge the longitudinal gap, allowing the ego vehicle to seek another merging opportunity. It should be noted that this fallback action is implemented through the MPC execution layer and is subject to the acceleration bounds in the vehicle-control optimization. Therefore, it is not equivalent to an emergency braking maneuver in the present implementation.

Nevertheless, deceleration is not universally safe under all traffic configurations. For example, when the following vehicle is very close to the ego vehicle, has delayed response, or does not follow IDM-like longitudinal behavior, a deceleration fallback may increase rear-end collision risk. In the current simulation, this risk is partially mitigated by the bounded MPC execution and the IDM-based response of surrounding vehicles, but rear-side safety is not explicitly modeled as an independent shielding constraint. Therefore, the ASM should be interpreted as a scenario-specific safety augmentation module for reducing risky exploration in the considered ramp-merging setting, rather than a complete safety supervisor for all possible traffic configurations. A more general shielding design can further incorporate rear-side safety margins and select fallback actions from multiple candidate maneuvers.

6. Theoretical Analysis

This section presents a theoretical analysis of the ASM regarding the safety performance of the learned RL policy and the convergence performance of the proposed safe RL algorithm.

6.1. Safety Performance

Theorem 1.

If the current action

a_{t}

is unsafe, i.e.,

a_{t} \in Ω_{unsafe}

, and when it is replaced with a safe one

a_{t}^{⊙}

using ASM, then the state action cost function

Q_{c} (s_{t}, a_{t}^{⊙})

is lower than the original state action cost function

Q_{c} (s_{t}, a_{t})

, which indicates improved safety effectiveness contributed by the ASM.

Proof.

We consider the Bellman equation [36], which is denoted as

Q_{c} (s_{t}, a_{t}) = c (s_{t}, a_{t}) + γ V_{c} (s_{t + 1}) .

(53)

For simplicity, let

c_{t} = c (s_{t}, a_{t})

and

c_{t}^{⊙} = c (s_{t}, a_{t}^{⊙})

. Then we have

Q_{c} (s_{t}, a_{t}) - Q_{c} (s_{t}, a_{t}^{⊙}) = c_{t} - c_{t}^{⊙} + V_{c} (s_{t + 1}) - V_{c} (s_{t + 1}^{⊙}) .

(54)

Considering the first and second items in the Bellman equation, the original decision

a_{t}

will lead to a collision in the next steps as the result of

c_{t} > c_{t}^{⊙}

. As for the difference in state cost value, we assume that the following actions will also be replaced by a safe decision if the original is unsafe. Then, we have

V_{c} (s_{t + 1}) - V_{c} (s_{t + 1}^{⊙}) = E_{π} [\sum_{t = 0}^{T} γ^{t} (c (s_{t + 1}, a_{t + 1}) - c (s_{t + 1}^{⊙}, a_{t + 1}^{⊙}))] .

(55)

During the rest steps in the episode, the original decision will be replaced by the safe one, and will result in a safe state. Therefore, we have

c (s_{t + 1}, a_{t + 1}) > c (s_{t + 1}^{⊙}, a_{t + 1}^{⊙}), t \in [0, T] .

(56)

As a consequence, we have

V_{c} (s_{t + 1}) > V_{c} (s_{t + 1}^{⊙}), t \in [0, T],

(57)

Q_{c} (s_{t}, a_{t}) > Q_{c} (s_{t}, a_{t}^{⊙}), t \in [0, T] .

(58)

□

The performance of the policy with the ASM could be promoted because the safe action can reduce the state-action cost value at each step and guide the agent to choose the safe decision. Furthermore, another benefit of the proposed method is that the training process is not affected by the ASM.

6.2. Convergence Analysis

The optimization objective is to maximize the sum of the discounted reward while assuring the cost value satisfies the constraints. From Equation (10), the Lagrangian dual function [37] is formulated as

d (λ) = \min_{π} - J_{R}^{π} + λ (J_{C}^{π} - η) .

(59)

The dual ascent is to find

\max_{λ \geq 0} d (λ)

. During the learning process, the policy update is imperfect due to two factors. One factor is that the number of iterations is limited for computation efficiency and another factor is that the replaced decision can guarantee safety but not optimality. Hence, we assume the suboptimality of the solution

π^{*}

is upper bounded as

- J_{R}^{π^{*}} + λ (J_{C}^{π^{*}} - η) - d (λ) < ϵ .

(60)

We denoted the residual of

λ

before and after the imperfect update in a single step of Algorithm 2 as

\hat{g} (λ, α_{λ}) = \frac{1}{α_{λ}} (\max (0, λ + α_{λ} \nabla_{λ} d (λ)) - λ) .

(61)

Lemma 1.

Following the imperfect dual ascent with step size

α_{λ} \leq μ

, we have

d (λ_{k + 1}) \geq d (λ_{k}) + \frac{α_{λ}}{2} ‖ \hat{g} (λ_{k}, α_{λ}) ‖^{2} - \sqrt{\frac{2 ϵ}{μ}} ‖ \hat{g} (λ_{k}, α_{λ}) ‖ .

(62)

Proof.

The result follows from standard projected dual ascent for

μ

-smooth concave dual functions (cf. Theorem 2.2.7 in [38]). In our setting, the policy update at each iteration is only

ϵ

-suboptimal, which introduces a deviation term of order

\sqrt{ϵ / μ}

; that is, the deviation can be bounded by a constant times

\sqrt{ϵ / μ}

. Intuitively, one step of the projected ascent guarantees an increase in

d (λ)

proportional to

∥ \hat{g} ∥^{2}

, minus a correction term of size at most on the order of

\sqrt{ϵ / μ}

, which accounts for the inexact policy improvement. □

Taking

λ^{'} = λ_{k}

and utilizing the fact that

λ_{k + 1} - λ_{k} = α_{λ} \hat{g}

gives the result.

Theorem 2.

There exist constant

χ > 0

such that the imperfect update converges to a dual solution

\hat{λ}

that satisfies

\min_{λ^{*} \in P^{*}} ‖ λ^{*} - \hat{λ} ‖ \leq χ \sqrt{\frac{ϵ}{μ}} .

(63)

Proof.

Let

ϕ (λ) = \min_{λ^{*} \in P^{*}} ‖ λ - λ^{*} ‖

. Based on Theorem 4.1 of [39], there exists a constant

ψ

such that

ϕ (λ_{k}) + ‖ π^{*} - π ‖ \leq ψ ‖ λ_{k + 1} - λ_{k} ‖ .

(64)

From Lemma 1, when

‖ \hat{g} ‖ > \frac{2}{α_{λ}} \sqrt{\frac{2 ϵ}{μ}}

, the Lagrangian dual function

d (λ)

monotonically increases and the imperfect dual ascent would reach a

\hat{λ}

satisfying

‖ \hat{g} (\hat{λ}, α_{λ}) ‖ < \frac{2}{α_{λ}} \sqrt{\frac{2 ϵ}{μ}}

. Then, it follows that

ψ ‖ λ_{k + 1} - λ_{k} ‖ < ψ (2 + α_{λ}) \sqrt{\frac{2 ϵ}{μ}}

. Taking

ψ = \frac{χ}{\sqrt{2} (2 + α_{λ})}

, we have

ϕ (\hat{λ}) \leq χ \sqrt{\frac{ϵ}{μ}}

. □

Theorem 2 shows that the proposed method will converge to a near-optimal solution, even under imperfect policy updates.

7. Experimental Setup

This section introduces the implementation details of the experiments, including the simulation environment, scenario setting, model formulation, and implementation details, etc.

7.1. Scenario Settings

We built an on-ramp merging scenario based on the simulation platform highway-env [40]. As shown in Figure 1, the merge zone is comprised of a lane width of 5 m and a length of 70 m. The ego vehicle starts from the entrance of the ramp, 80 m from the merge zone. The longitudinal acceleration of the surrounding vehicles is predicted by the intelligent driver model (IDM) [41]. The longitudinal distance between the front vehicle i and the rear vehicle

i + 1

is denoted as

d_{i, i + 1}

, which is defined as

d_{i, i + 1} = d_{s} + v_{i + 1} / ρ

, where

d_{s}

is the safety distance.

v_{i + 1}

is the speed of the rear vehicle. The initial speed of all vehicles ranges from 17 m/s to 27 m/s.

ρ \in [0.5, 1]

represents the space density of traffic flow. The traffic density is in low level when

ρ \in [0.5, 0.7)

, medium level when

ρ \in [0.5, 1]

, and high level when

ρ \in (0.8, 1]

. A larger value of

ρ

yields smaller headways, which increases the number of surrounding vehicles within the ego vehicle’s interaction range during the on-ramp merge and results in tighter available gaps. The surrounding vehicles are controlled by IDM-based longitudinal behavior models, which provide a controlled and reproducible traffic environment for evaluating the proposed method. This setting is used as a benchmark scenario and does not aim to fully reproduce all heterogeneous driving behaviors observed in real traffic.

7.2. Implementation Details

The parameter settings of the proposed approach, including MPC and safe RL, are summarized as follows, and the main hyperparameters with explicit symbols are further listed in Table 2. The MPC prediction horizon is set to 10 time steps. To ensure smooth merging, the reference steering angle is set to 0 rad with a maximum magnitude of

π / 8

rad, and the acceleration is constrained within [−0.5 g, 0.5 g]. The prediction coefficient

σ

is set to 5, meaning that collision checking in the ASM is performed five steps ahead. For safe RL, the environment simulation frequency is 10 Hz, while the decision-making frequency is 2 Hz. Each episode terminates when the ego vehicle reaches the goal or a collision occurs, and the agent is trained for 0.5 million interaction steps. During training, the agent is coupled with MPC online: the agent outputs actions for motion prediction, which are overridden by the ASM if potential collisions are detected, and the resulting shielded actions are executed by the low-level MPC controller to obtain reward and cost signals. All neural networks in safe RL consist of two hidden layers with 256 units each and use ReLU activations. The optimizer is Adam, the target entropy ratio is set to 0.98, the policy update frequency is 1, and the target network update frequency is 10. Other symbolic hyperparameters, including the learning rates, replay buffer size, batch size, discount factor, and Lagrangian multiplier settings, are reported in Table 2. Moreover, we train the RL agent with five random seeds. All the experiments are conducted on an Intel Core i5-11300H CPU with 16 GB RAM that runs at 3.1 GHz.

8. Results and Discussion

8.1. Convergence Performance

To demonstrate the advantages of RAPRL (ours), we compare the training curves of RAPRL with three baselines: Dueling DQN [42], SACD [33], and PPO [43]. Figure 9 presents the comparative results across three key metrics: Crash Ratio, Average Cost, and Average Reward, plotted against the total environmental interactions. The shaded regions represent the standard deviation across different random seeds.

As illustrated in the Average Reward curve (Figure 9 (right)), RAPRL outperforms all baselines in terms of the final converged value. While Dueling DQN exhibits the fastest initial learning speed, it converges to a sub-optimal policy with a lower final reward compared to RAPRL. In contrast, RAPRL maintains a steady growth rate and ultimately achieves the highest asymptotic performance. PPO shows the slowest convergence rate and the lowest final reward. In Figure 9 (left), both Dueling DQN and RAPRL demonstrate rapid reductions in collision ratios, quickly converging to a lower crash ratio. This indicates that both methods can effectively learn to avoid fatal states at an early stage. However, PPO requires significantly more interactions to reduce the crash risk to an acceptable level. Crucially, the Average Cost curve (Figure 9 (middle)) reveals a significant advantage of our method. Although Dueling DQN achieves a low crash ratio, its average cost exhibits a sharp increase after the initial drop and stabilizes at a high level. Conversely, RAPRL consistently reduces the average cost and converges to the lowest level among all methods.

These results demonstrate that RAPRL not only ensures collision avoidance but also effectively minimizes comprehensive penalties, achieving a superior balance between safety constraints and driving efficiency.

8.2. Performance Evaluation

8.2.1. Traffic Success

Traffic success is primarily evaluated by the success rate, which quantifies the probability of the ego vehicle successfully completing the merging task without terminal failures. As presented in Table 3, the proposed RAPRL demonstrates exceptional robustness and superior performance across varying traffic densities. Specifically, RAPRL achieves success rates of 99.0%, 99.5%, and 99.3% in high, medium, and low traffic densities, respectively. In the challenging high-density scenario, where the baseline Dueling DQN experiences a significant performance drop to 87.0%, RAPRL maintains a high success rate, representing a substantial improvement of 12.0%. Similarly, compared to SACD, RAPRL improves the success rate by 4.5%, 2.0%, and 0.1% across the three density levels. Although PPO achieves a marginally higher success rate in the high-density scenario (99.5%), it suffers from performance degradation in the medium density scenario (97.7%). RAPRL consistently maintains a success rate above 99% in all tested simulation scenarios, indicating its stability and adaptability within the considered IDM-based ramp-merging environments.

8.2.2. Traffic Safety

Traffic safety is assessed using collision ratio and average cost, which reflect the frequency of accidents and the agent’s adherence to safety constraints. As shown in Table 3, the unconstrained methods, Dueling DQN and SACD, incur significantly higher average costs, particularly in high-density scenarios (0.50 and 0.44, respectively), indicating aggressive behaviors that frequently violate safety boundaries. In contrast, RAPRL dramatically reduces the average cost to 0.02 across all densities. Regarding collision ratio, while PPO demonstrates strong performance in high and low densities, it exhibits a spike in collision ratio to 0.018 under medium density, suggesting instability in complex interaction scenarios. RAPRL, however, consistently suppresses the collision ratio to a negligible level (0.003∼0.005) across all test cases. This validates that by explicitly optimizing the constrained objective, RAPRL effectively minimizes safety risks.

8.2.3. Traffic Efficiency

Traffic efficiency is evaluated by the average time required to complete the merging maneuver. As illustrated in Table 3, Dueling DQN and SACD achieve the shortest average times (e.g., 11.78 s and 11.59 s in high density) but at the cost of high accident risks and safety violations, as discussed in the safety analysis. Conversely, PPO exhibits the longest average time (12.36 s in high density), indicating that it adopts an overly conservative policy to ensure safety, thereby sacrificing efficiency. RAPRL achieves a well-balanced trade-off between these conflicting objectives. It records an average time of 11.87 s in high density, which is faster than PPO, yet it maintains the lowest safety cost. These results demonstrate that RAPRL enables the agent to complete efficient merging maneuvers without compromising safety standards or adopting overly cautious behaviors.

8.3. Ablation Study

This section conducts an ablation study to evaluate the impact of RAPRL’s essential components: the Action Shielding Mechanism (ASM), safety constraints (SC), and personal preference (PP). We benchmark the full RAPRL method against three ablated variants: (a) RAPRL w/o SC, (b) RAPRL w/o ASM and SC, and (c) RAPRL w/o ASM, SC, and PP. The quantitative performance metrics are summarized in Table 4, while the training curves for average reward and average cost are illustrated in Figure 10.

It is worth clarifying that SC and ASM play different roles in RAPRL. The CMDP-based safety constraint is a learning-level mechanism that penalizes the expected cumulative cost through the Lagrangian formulation and guides the policy toward lower-risk behavior during optimization, whereas ASM is an execution-level safety filter that evaluates the raw high-level action using MPC-based prediction and replaces it when the action is identified as unsafe or infeasible. During training, the shielded action

a_{t}^{⊙}

is executed by the low-level MPC controller, and the resulting transition, reward, and cost are stored for policy updates. Thus, ASM also influences policy learning by reducing unsafe exploration and exposing the agent to corrected lower-risk behaviors. The two mechanisms are therefore complementary: SC encourages the policy itself to satisfy the expected cost constraint, while the ASM handles residual unsafe actions at the action execution level.

8.3.1. Safety Constraints

To verify the fundamental role of safety constraints in risk mitigation, we compare the baseline method (w/o PP, ASM, and SC) with the variant incorporating only SC (RAPRL w/o PP and ASM). As shown in Table 4, the introduction of SC significantly reduces the collision ratio across all traffic densities. Notably, in the high-density scenario, the collision ratio drops from 0.010 to 0.003, and the success rate improves from 94.5% to 97.2%. This indicates that SC encourages the policy of avoiding high-cost behaviors by penalizing safety violations during policy optimization, rather than directly filtering actions at execution time. However, the limitation of using SC alone is also evident: the average time increases from 11.59 s to 12.08 s, suggesting that the agent adopts a conservative policy to satisfy hard constraints, thereby sacrificing driving efficiency. Furthermore, as observed in the training curves (Figure 10 (middle)), although the method with SC (orange line) reduces crash occurrences, its average cost exhibits significant oscillations and remains at a relatively high level compared to the full method. This implies that while hard constraints prevent collisions, they struggle to guide the agent to smoothly navigate away from high-risk regions, leading to frequent boundary-hovering behaviors. Therefore, the improvement in the SC-only variant shows that the CMDP policy itself can learn safer behavior under the Lagrangian cost constraint.

8.3.2. ASM

The Action Shielding Mechanism (ASM) is designed to address the limitations of static constraints by actively correcting the agent’s behavior near safety boundaries. By comparing the variant “RAPRL w/o PP” (which includes SC and ASM) with the version containing only SC, we observe a decisive improvement in safety compliance. The most striking impact is reflected in the average cost. As detailed in Table 4, adding ASM drastically reduces the average cost in high-density scenarios from 0.23 to 0.02, an order-of-magnitude improvement. This result is visually corroborated by the average cost curve in Figure 10, where the green curve, corresponding to the SC and ASM variant without PP, converges rapidly to a near-zero cost level, significantly outperforming the oscillating orange curve (w/o ASM). Additionally, the success rate further increases to 98.3%. These results demonstrate that ASM further suppresses residual unsafe actions that may still be produced by the learned policy, thereby improving constraint satisfaction and stabilizing the average cost during training.

8.3.3. Personal Preference

Finally, we evaluate the contribution of the Personal Preference (PP) module in balancing safety with efficiency and accelerating learning. Comparing the complete RAPRL method with the “RAPRL w/o PP” variant, Table 4 reveals that PP plays a crucial role in recovering the efficiency lost due to safety restrictions. In high-density traffic, the average time decreases from 12.33s to 11.87s, and the success rate peaks at 99.0%. This indicates that preference-aware cost-limit adaptation helps the agent recover part of the efficiency lost under strict safety regulation and guides the policy toward a better safety–efficiency trade-off. Moreover, the average reward curve in Figure 10 (right) shows that the full method (red line) exhibits the fastest convergence speed and achieves the highest asymptotic reward. This confirms that incorporating human-like driving preferences not only optimizes the trade-off between safety and efficiency but also allows the agent to learn optimal policies more rapidly and stably.

8.3.4. Visual Result

We visualize the behaviors of agents trained under three settings for the on-ramp merging task: RAPRL without PP and ASM, RAPRL without PP, and the full RAPRL, as shown in Figure 11. The ego vehicle trained without PP and ASM can select unsafe or suboptimal actions, e.g., turning right despite having merged into the main lane (see Figure 11a). Incorporating ASM prevents such unsafe actions in both RAPRL w/o PP and full RAPRL (see Figure 11b,c). Furthermore, the agent trained by full RAPRL reaches the farthest position from the end of the merge zone (green car in Figure 11) at episode termination, demonstrating that integrating personalized preferences enhances traffic efficiency.

Overall, the ablation results indicate that the CMDP-based safety constraint and ASM contribute to safety in different but complementary ways. The SC-only variant improves the collision-related metrics compared with the unconstrained baseline, showing that the policy can lead to safer behavior under the Lagrangian cost constraint. However, SC alone still exhibits higher average cost and larger oscillations during training, suggesting that the learned policy may still approach unsafe boundaries. After adding ASM, the average cost is significantly reduced and the learning curve becomes more stable, indicating that the shielding module suppresses residual unsafe actions before execution. Therefore, the reported safety performance of RAPRL should be interpreted as the combined result of learned constraint-aware policy optimization and execution-level action shielding, rather than as the effect of either mechanism alone. Accordingly, the learned policy should be understood as a constraint-aware policy trained with execution-level shielding support, rather than a policy whose safety is solely guaranteed by post-processing.

8.4. Sensitivity Analysis

To evaluate the robustness of our framework, we conduct a sensitivity analysis on two critic design parameters that govern the system’s safety characteristics: the prediction coefficient

σ

in the ASM and the cost limit

η

(reflecting risk tolerance) in the CMDP.

8.4.1. Prediction Coefficient

We conduct motion prediction for constraint checking in the ASM of RAPRL and change the prediction coefficient

σ

to test its effect on the average cost and average reward. The prediction coefficient

σ

denotes the look-ahead step used by the ASM to select the predicted ego vehicle state from the MPC prediction horizon. Given the MPC horizon N, the predicted state sequence can be written as

{x_{e}^{1}, x_{e}^{2}, \dots, x_{e}^{N}}

, where

x_{e}^{σ}

denotes the predicted ego vehicle state at the

σ

-th future step. Therefore, the corresponding look-ahead time is

t_{σ} = σ Δ t, σ \in {1, \dots, N},

(65)

where

Δ t

is the simulation time interval. In this study, the MPC prediction horizon is

N = 10

, and the simulation frequency is 10 Hz, resulting in

Δ t = 0.1

s. Thus,

σ = 1

,

σ = 5

, and

σ = 10

correspond to approximately 0.1 s, 0.5 s, and 1.0 s look-ahead collision checking, respectively.

The value of

σ

affects the trade-off between responsiveness and prediction reliability. A small

σ

performs near-term collision checking and is more responsive to current states, but it may be too short-sighted to identify future merging conflicts. A large

σ

provides earlier warning of potential conflicts, but the predicted state may contain larger uncertainty and may lead to overly conservative or unstable shielding decisions. Therefore,

σ

is treated as a tunable hyperparameter of the ASM and is selected through sensitivity analysis.

As shown in Figure 12, we conducted five runs for each

σ

, initialized with different random seeds. We observe that a larger

σ

leads to faster learning but results in higher variability across the random seeds, indicating increased instability. On the other hand, a smaller

σ

can slow down the training process. However, with an appropriate prediction coefficient of

σ = 5

, the model achieves the best performance regarding the average cost (the lowest value) and the average reward (the highest value). This result indicates that a mid-horizon prediction step provides a better balance between early conflict detection and reliable state prediction. Accordingly,

σ = 5

is used as the default prediction coefficient in the experiments.

8.4.2. Risk Tolerance

In our CMDP framework, the cost limit

η

serves as a quantifiable proxy for the agent’s risk tolerance. Lower values of

η

enforce conservative driving strategies with strict adherence to safety constraints, whereas higher values permit more aggressive policies. To investigate the impact of this preference indicator, we evaluated

η

at 0.1, 0.05, 0.01, and 0.001, as illustrated in Figure 13. These results demonstrate that the policy is highly sensitive to this threshold. Larger

η

values relax the constraints, resulting in instability and increased safety violations, while overly small

η

values restrict exploration and slow down convergence. We empirically identify an optimal balance at

η = 0.01

, where the agent maximizes safety compliance without compromising driving performance. However, relying on a fixed cost limit reveals a key limitation: a static

η

cannot adapt to varying traffic densities or heterogeneous user preferences (e.g., aggressive driving in sparse traffic versus conservative driving in dense traffic). This limitation motivates our proposed approach, which dynamically adjusts

η

using fuzzy logic to align the agent’s risk sensitivity with evolving environmental and user-specific requirements.

This sensitivity analysis also helps interpret the influence of fuzzy-rule settings on the learned policy. In the proposed framework, the fuzzy membership functions and rule table affect policy learning mainly through the generated CMDP cost limit

η

. Therefore, different fuzzy-rule settings would lead to different safety–efficiency behaviors by changing the resulting value of

η

. As shown in Figure 13, an overly large

η

relaxes the safety constraint and may increase average cost, whereas an overly small

η

imposes a stricter constraint and may slow down policy learning. These results indicate that the learned policy is sensitive to the cost-limit level produced by the fuzzy module. Accordingly, the adopted fuzzy rules are designed to generate moderate and adaptive cost limits under different risk-preference and traffic-density conditions, avoiding both excessively permissive and excessively conservative behaviors.

9. Discussion

Although the proposed method achieves promising results in the evaluated ramp-merging scenarios, its current validation remains subject to several practical limitations. First, the experiments were conducted in a controlled single-ramp environment with IDM-based surrounding vehicles, which provides a reproducible benchmark but cannot fully represent multi-lane interactions, heterogeneous driver styles, perception uncertainty, and non-IDM behaviors in real traffic. Second, the user risk preference is specified before policy learning and remains fixed during each experiment. This setting enables a controlled analysis of how different preference levels affect the CMDP cost limit and the learned merging behavior, but it does not capture preference variations over time or across traffic contexts. Third, the MPC prediction and execution modules are evaluated by simulation, where the finite-horizon QP problem is solved at each decision step. Therefore, the current results demonstrate algorithmic effectiveness in simulations rather than embedded real-time deployment capability.

Future work will address these limitations by evaluating the framework in more realistic traffic simulators, naturalistic driving datasets, and heterogeneous human driving behavior models. We will also investigate online preference inference and adaptation based on user feedback, historical driving behavior, and interaction histories. For deployment feasibility, hardware-in-the-loop validation and embedded-oriented MPC implementations, such as warm-started QP solvers, sparse optimization, code generation, and reduced-horizon MPC, will be further explored.

10. Conclusions

This study proposes a risk-aware safe RL approach with personal preferences for decision-making in the on-ramp merging task for autonomous driving, in which RL is used for high-level decisions, followed by a low-level MPC. We formulate the high-level decision problem as a CMDP that represents safety in cost terms. Furthermore, we use the fuzzy logical method to compute the threshold of the cost limits based on human risk preferences and traffic density and a Lagrangian-based SAC is used to solve the CMDP. An Action Shielding Mechanism is designed to remove unsafe or invalid RL actions, and we theoretically prove its effectiveness in enhancing safety and sample efficiency. Numerical simulations and theoretical analysis demonstrate the superiority of our method regarding success rate, collision ratio, and average cost. Future work will further improve the practicality and robustness of the proposed framework under more realistic traffic conditions and deployment settings.

Author Contributions

Conceptualization: Y.L., S.Y. and Y.B.; methodology, S.Y. and J.T.; software, S.Y. and J.T.; validation, M.H., J.T. and Y.L.; investigation, H.Q. and M.H.; formal analysis, J.T. and W.H.; writing—original draft preparation, S.Y., Y.L. and H.Q.; writing—review and editing, J.T., W.H. and Y.B.; financial funding, B.L.; supervision, Y.L. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, H.; Wang, W.; Yuan, S.; Li, X. Uncovering Interpretable Internal States of Merging Tasks at Highway on-Ramps for Autonomous Driving Decision-Making. IEEE Trans. Autom. Sci. Eng. 2022, 19, 2825–2836. [Google Scholar] [CrossRef]
Liang, J.; Tan, C.; Yan, L.; Zhou, J.; Yin, G.; Yang, K. Interaction-Aware Trajectory Prediction for Safe Motion Planning in Autonomous Driving: A Transformer-Transfer Learning Approach. IEEE Trans. Intell. Transp. Syst. 2025, 26, 17080–17095. [Google Scholar] [CrossRef]
Wang, H.; Gao, H.; Yuan, S.; Zhao, H.; Wang, K.; Wang, X.; Li, K.; Li, D. Interpretable Decision-Making for Autonomous Vehicles at Highway On-Ramps With Latent Space Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 70, 8707–8719. [Google Scholar] [CrossRef]
Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
Lyu, Y.; Luo, W.; Dolan, J.M. Probabilistic safety-assured adaptive merging control for autonomous vehicles. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 10764–10770. [Google Scholar]
Lubars, J.; Gupta, H.; Chinchali, S.; Li, L.; Raja, A.; Srikant, R.; Wu, X. Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 942–947. [Google Scholar]
Wang, K.; Mu, C.; Ni, Z.; Liu, D. Safe Reinforcement Learning and Adaptive Optimal Control With Applications to Obstacle Avoidance Problem. IEEE Trans. Autom. Sci. Eng. 2024, 21, 4599–4612. [Google Scholar] [CrossRef]
Yan, Z.; Kreidieh, A.R.; Vinitsky, E.; Bayen, A.M.; Wu, C. Unified automatic control of vehicular systems with reinforcement learning. IEEE Trans. Autom. Sci. Eng. 2022, 20, 789–804. [Google Scholar] [CrossRef]
Chen, X.; Xu, B.; Hu, M.; Bian, Y.; Li, Y.; Xu, X. Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation. IEEE CAA J. Autom. Sin. 2024, 11, 2011–2026. [Google Scholar] [CrossRef]
Gao, Z.; Hao, H.; Gao, F.; Zhao, R. Constrained Reinforcement Learning-Enabled Policies With Augmented Lagrangian for Cooperative Intersection Management. IEEE Internet Things J. 2024, 12, 5396–5411. [Google Scholar] [CrossRef]
Wang, Y.; Zhan, S.S.; Jiao, R.; Wang, Z.; Jin, W.; Yang, Z.; Wang, Z.; Huang, C.; Zhu, Q. Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments. In Proceedings of the 40th International Conference on Machine Learning; Journal of Machine Learning Research Inc.: New York, NY, USA, 2023; Volume 202, pp. 36593–36604. [Google Scholar]
Carr, S.; Jansen, N.; Junges, S.; Topcu, U. Safe reinforcement learning via shielding under partial observability. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2023; Volume 37, pp. 14748–14756. [Google Scholar]
Chen, D.; Hajidavalloo, M.R.; Li, Z.; Chen, K.; Wang, Y.; Jiang, L.; Wang, Y. Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11623–11638. [Google Scholar] [CrossRef]
Isele, D.; Nakhaei, A.; Fujimura, K. Safe Reinforcement Learning on Autonomous Vehicles. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–6. [Google Scholar]
Peng, J.; Yu, S.; Ge, Y.; Li, S.; Fan, Y.; Zhou, J.; He, H. Personalized Decision-Making Framework for Collaborative Lane Change and Speed Control Based on Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2025, 26, 13629–13644. [Google Scholar] [CrossRef]
Teng, J.; Li, Y.; Yang, Z.; Yang, Z.; Shao, X.; Qin, H. User Preference-Aware and Efficient Trajectory Planning for Autonomous Parking with Hybrid A* and Nonlinear Optimization. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Edmonton, AB, Canada, 24–27 September 2024; pp. 1090–1097. [Google Scholar]
Chen, C.; Lan, Z.; Zhan, G.; Lyu, Y.; Nie, B.; Li, S.E. Quantifying the Individual Differences of Drivers’ Risk Perception via Potential Damage Risk Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8093–8104. [Google Scholar] [CrossRef]
Nyberg, T.; Pek, C.; Dal Col, L.; Norén, C.; Tumova, J. Risk-aware Motion Planning for Autonomous Vehicles with Safety Specifications. In Proceedings of the 32nd IEEE Intelligent Vehicles Symposium, Nagoya, Japan, 11–17 July 2021; pp. 1016–1023. [Google Scholar]
Geisslinger, M.; Trauth, R.; Kaljavesi, G.; Lienkamp, M. Maximum Acceptable Risk as Criterion for Decision-Making in Autonomous Vehicle Trajectory Planning. IEEE Open J. Intell. Transp. Syst. 2023, 4, 570–579. [Google Scholar] [CrossRef]
Yang, K.; Li, B.; Shao, W.; Tang, X.; Liu, X.; Wang, H. Prediction Failure Risk-Aware Decision-Making for Autonomous Vehicles on Signalized Intersections. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12806–12820. [Google Scholar] [CrossRef]
Morinelly, J.E.; Ydstie, B.E. Dual mpc with reinforcement learning. IFAC-PapersOnLine 2016, 49, 266–271. [Google Scholar] [CrossRef]
Zanon, M.; Gros, S.; Bemporad, A. Practical reinforcement learning of stabilizing economic MPC. In Proceedings of the 18th European Control Conference (ECC), Naples, Italy, 25–28 June 2019; pp. 2258–2263. [Google Scholar]
Bellegarda, G.; Byl, K. An Online Training Method for Augmenting MPC with Deep Reinforcement Learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5453–5459. [Google Scholar]
Karnchanachari, N.; Valls, M.I.; Hoeller, D.; Hutter, M. Practical reinforcement learning for mpc: Learning from sparse objectives in under an hour on a real robot. In Proceedings of the Learning for Dynamics and Control, UC Berkeley, CA, USA, 10–11 June 2020; pp. 211–224. [Google Scholar]
Williams, G.; Wagener, N.; Goldfain, B.; Drews, P.; Rehg, J.M.; Boots, B.; Theodorou, E.A. Information theoretic mpc for model-based reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1714–1721. [Google Scholar]
Gros, S.; Zanon, M. Reinforcement learning based on mpc and the stochastic policy gradient method. In Proceedings of the American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 1947–1952. [Google Scholar]
Li, Y.; Li, J.; Huang, W.; Yang, Q.; Qin, H.; Jiang, X.; Bian, Y.; Hu, M.; Hu, Y. Risk-Constrained On-Ramp Merging via Safety-Augmented Reinforcement Learning and Model Predictive Control. IEEE Internet Things J. 2026; early access. [CrossRef]
Chen, J.; Shen, J.; Chen, W.; Li, J.; Zhang, S. Application of Robust Fuzzy Cooperative Strategy in Global Consensus of Stochastic Multi-Agent Systems. IEEE Trans. Autom. Sci. Eng. 2025, 22, 12058–12070. [Google Scholar] [CrossRef]
Mamdani, E.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
Marina Martinez, C.; Heucke, M.; Wang, F.Y.; Gao, B.; Cao, D. Driving Style Recognition for Intelligent Vehicle Control and Advanced Driver Assistance: A Survey. IEEE Trans. Intell. Transp. Syst. 2018, 19, 666–676. [Google Scholar] [CrossRef]
Peng, J.; Zhang, S.; Zhou, Y.; Li, Z. An Integrated Model for Autonomous Speed and Lane Change Decision-Making Based on Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21848–21860. [Google Scholar] [CrossRef]
Werling, M.; Ziegler, J.; Kammel, S.; Thrun, S. Optimal trajectory generation for dynamic street scenarios in a frenet frame. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 987–993. [Google Scholar]
Christodoulou, P. Soft Actor-Critic for Discrete Action Settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [Google Scholar] [CrossRef]
Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
Bellman, R. Dynamic programming and stochastic control processes. Inf. Control 1958, 1, 228–239. [Google Scholar] [CrossRef]
Bertsekas, D.; Nedic, A.; Ozdaglar, A. Convex Analysis and Optimization; Athena Scientific: Nashua, NH, USA, 2003; Volume 1. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Luo, Z.Q.T.; Tseng, P. On the Convergence Rate of Dual Ascent Methods for Linearly Constrained Convex Minimization. Math. Oper. Res. 1993, 18, 846–867. [Google Scholar] [CrossRef]
Leurent, E. An Environment for Autonomous Driving Decision-Making, 2018. Available online: https://github.com/eleurent/highway-env (accessed on 20 May 2026).
Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805–1824. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]

Figure 1. Illustration of the highway on-ramp merging scenario. The ego vehicle (red) starts on a one-lane entrance ramp and needs to merge into the highway traffic safely and efficiently.

Figure 2. Overview of the proposed method. The high-level decision problem is formulated as a CMDP that incorporates individuals’ risk preferences into the constraints, followed by an MPC-based low-level control. A Lagrangian-based SAC algorithm is used to solve CMDP for the optimal RL policy. We design an Action Shielding Mechanism to mask out risky ones by pre-executing the action with MPC and conducting collision constraint checks. Then, the safe RL action is sent to the low-level MPC, which generates vehicle control for the simulation environment.

Figure 3. Cost design based on motion predictions of the ego vehicle and surrounding objects. There are three typical situations, including (a) failing to merge before reaching the end of the road, (b) colliding with other vehicles, and (c) hard to merge when the target lane is occupied by other vehicles.

Figure 4. The membership functions of the fuzzy inputs, i.e., risk preference and traffic density, and the fuzzy output cost limit: (a) risk preference, varying between 0 and 100%; (b) traffic density, varying between 0.5 and 1; and (c) cost limit, varying between 0 and 0.1.

Figure 5. Illustration of the aggregation and defuzzification of the Mamdani inference process. The fuzzy sets of the cost limit include large, medium, and small, represented by black, red, and blue dashed lines. Given traffic density and risk preference of 0.57 and 45%, with membership values

\tilde{A} = {Low, Medium}

and

\tilde{B} = {Conservative, Neutral}

, the fuzzy output for small, medium, and large cost limits are 0.25, 0.35, and 0.65, respectively. These outputs are aggregated into a single fuzzy set (the union of the three grey areas), and the centroid is taken to obtain the crisp cost limit of 0.0595.

Figure 5. Illustration of the aggregation and defuzzification of the Mamdani inference process. The fuzzy sets of the cost limit include large, medium, and small, represented by black, red, and blue dashed lines. Given traffic density and risk preference of 0.57 and 45%, with membership values

\tilde{A} = {Low, Medium}

and

\tilde{B} = {Conservative, Neutral}

, the fuzzy output for small, medium, and large cost limits are 0.25, 0.35, and 0.65, respectively. These outputs are aggregated into a single fuzzy set (the union of the three grey areas), and the centroid is taken to obtain the crisp cost limit of 0.0595.

Figure 6. Kinematic bicycle model.

Figure 7. The structure of the neural networks.

Figure 8. Action Shielding Mechanism. Three typical situations are defined: collision, unexpected decision, and failing to merge. The unsafe/unexpected action

a_{t}

(left) is substituted with a safe one

a_{t}^{⊙}

(right).

Figure 8. Action Shielding Mechanism. Three typical situations are defined: collision, unexpected decision, and failing to merge. The unsafe/unexpected action

a_{t}

(left) is substituted with a safe one

a_{t}^{⊙}

(right).

Figure 9. Comparative study. We compare RAPRL (ours) with Dueling DQN [42], SACD [33], and PPO [43]. The solid lines represent the mean values, while the shaded regions indicate the standard deviations. The results demonstrate that RAPRL achieves superior convergence performance, attaining the highest episodic reward and lowest average cost while maintaining robust safety guarantees compared to all baselines.

Figure 10. Ablation study. We compare the full RAPRL method against its ablated variants. The results show that SC helps the policy learn safer behavior, ASM further reduces residual unsafe actions and stabilizes the average cost, while PP improves the safety–efficiency trade-off and accelerates reward convergence. Consequently, the complete RAPRL method achieves better overall performance in terms of safety and efficiency.

Figure 11. Visualization of the trained agent, including (a) RAPRL w/o PP and ASM, (b) RAPRL w/o PP, and (c) RAPRL. The full RAPRL reaches the farthest position at

t = 10 s

, indicating improved traffic efficiency.

Figure 11. Visualization of the trained agent, including (a) RAPRL w/o PP and ASM, (b) RAPRL w/o PP, and (c) RAPRL. The full RAPRL reaches the farthest position at

t = 10 s

, indicating improved traffic efficiency.

Figure 12. Sensitivity of RAPRL to the prediction coefficient, i.e.,

σ

= 1, 5, 10. (a) Average cost. (b) Average reward.

Figure 12. Sensitivity of RAPRL to the prediction coefficient, i.e.,

σ

= 1, 5, 10. (a) Average cost. (b) Average reward.

Figure 13. Sensitivity of RAPRL to the hyperparameter cost limit

η

, where

η = 0.1, 0, 05, 0.01, 0.001

. (a) Large

η

can lead to instabilities and substantial increases in the average cost, while small

η

can make training slower. (b) The average reward changes slightly when

η

varies.

Figure 13. Sensitivity of RAPRL to the hyperparameter cost limit

η

, where

η = 0.1, 0, 05, 0.01, 0.001

. (a) Large

η

can lead to instabilities and substantial increases in the average cost, while small

η

can make training slower. (b) The average reward changes slightly when

η

varies.

Table 1. Fuzzy relations between the cost limit, traffic density, and risk preference.

		High	Medium	Low
	Cost Limit
Risk Preference
Conservative		Small	Small	Medium
Neutral		Small	Medium	Large
Aggressive		Medium	Large	Large

Table 2. Hyperparameters of the safe RL algorithm.

Symbol	Description	Value
$γ$	Discount factor	0.99
$α_{θ}$	Policy network learning rate	$1 \times 10^{- 4}$
$α_{ω}$	Critic network learning rate	$1 \times 10^{- 4}$
$α_{ω_{c}}$	Cost network learning rate	$1 \times 10^{- 4}$
$α_{ξ}$	Temperature parameter learning rate	$1 \times 10^{- 4}$
$λ_{0}$	Initial Lagrangian multiplier	1.0
$α_{λ}$	Lagrangian multiplier learning rate	$1 \times 10^{- 4}$
$D$	Replay buffer size	$1 \times 10^{5}$
$B$	Batch size	256

Table 3. Comparison between the constraint-free RL methods and ours.

Method	Success Rate (%) ↑			Collision Ratio ↓			Average Cost ↓			Average Time (s) ↓
Method	High	Medium	Low	High	Medium	Low	High	Medium	Low	High	Medium	Low
Dueling DQN [42]	87.0	94.3	99.0	0.013	0.005	0.005	0.50	0.28	0.08	11.78	11.47	10.85
SACD [33]	94.5	97.5	99.2	0.010	0.008	0.005	0.44	0.25	0.10	11.59	11.33	10.82
PPO [43]	99.5	97.7	99.2	0.003	0.018	0.008	0.01	0.03	0.01	12.36	11.62	10.95
RAPRL (ours)	99.0	99.5	99.3	0.003	0.005	0.005	0.02	0.02	0.02	11.87	11.46	10.96

Table 4. Ablation study results.

PP	ASM	SC	Success Rate (%) ↑			Collision Ratio ↓			Average Cost ↓			Average Time (s) ↓
PP	ASM	SC	High	Medium	Low	High	Medium	Low	High	Medium	Low	High	Medium	Low
-	-	-	94.5	97.5	99.2	0.010	0.008	0.005	0.44	0.25	0.10	11.59	11.33	10.82
-	-	🗸	97.2	97.3	99.3	0.003	0.005	0.005	0.23	0.13	0.04	12.08	11.54	10.99
-	🗸	🗸	98.3	98.8	98.9	0.010	0.008	0.003	0.02	0.02	0.02	12.33	11.58	11.02
🗸	🗸	🗸	99.0	99.5	99.3	0.003	0.005	0.005	0.02	0.02	0.02	11.87	11.64	10.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Teng, J.; Huang, W.; Yuan, S.; Hu, M.; Qin, H.; Li, Y.; Bian, Y.; Li, B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines 2026, 14, 605. https://doi.org/10.3390/machines14060605

AMA Style

Teng J, Huang W, Yuan S, Hu M, Qin H, Li Y, Bian Y, Li B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines. 2026; 14(6):605. https://doi.org/10.3390/machines14060605

Chicago/Turabian Style

Teng, Jingjia, Wenjie Huang, Shijie Yuan, Manjiang Hu, Hongmao Qin, Yang Li, Yougang Bian, and Bai Li. 2026. "Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging" Machines 14, no. 6: 605. https://doi.org/10.3390/machines14060605

APA Style

Teng, J., Huang, W., Yuan, S., Hu, M., Qin, H., Li, Y., Bian, Y., & Li, B. (2026). Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines, 14(6), 605. https://doi.org/10.3390/machines14060605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging

Abstract

1. Introduction

2. Related Works

2.1. RL-Based Approach

2.2. Human Risk Perception in Decision-Making

2.3. Combination of RL and MPC

3. Problem Statement

3.1. Constrained Markov Decision Process

3.1.1. State Space

3.1.2. Action Space

3.1.3. Reward

3.1.4. Cost

3.1.5. Problem Formulation

3.2. Human-Aligned Safety Cost Limits

4. Model Predictive Control

4.1. Discrete Linear Model

4.2. States Computation

5. Safe Reinforcement Learning

5.1. Lagrangian-Based Discrete SAC

5.1.1. Critic Network and Policy Network

5.1.2. Cost Network

5.1.3. n-Step TD Learning

5.1.4. Lagrange Multiplier

5.2. Action Shielding Mechanism

5.2.1. Situation 1

5.2.2. Situation 2

5.2.3. Situation 3

6. Theoretical Analysis

6.1. Safety Performance

6.2. Convergence Analysis

7. Experimental Setup

7.1. Scenario Settings

7.2. Implementation Details

8. Results and Discussion

8.1. Convergence Performance

8.2. Performance Evaluation

8.2.1. Traffic Success

8.2.2. Traffic Safety

8.2.3. Traffic Efficiency

8.3. Ablation Study

8.3.1. Safety Constraints

8.3.2. ASM

8.3.3. Personal Preference

8.3.4. Visual Result

8.4. Sensitivity Analysis

8.4.1. Prediction Coefficient

8.4.2. Risk Tolerance

9. Discussion

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI