Next Article in Journal
Correlation-Driven Multisensory Fusion for Intelligent Fault Analysis in Induction Motors
Next Article in Special Issue
A Hybrid Reverse Learning Particle Swarm Optimization Method for Aircraft Maintenance Scheduling Based on the Resource-Constrained Project Scheduling Problem Model
Previous Article in Journal
Towards Safer and More Efficient Cooperative Vehicle Platooning: Map-Based Calibration of Centralised LQR Control
Previous Article in Special Issue
Application of CILQR-Based Motion Planning and Tracking Control to Intelligent Tracked Vehicles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging

1
College of Mechanical and Vehicle Engineering, Hunan University, Changsha 410082, China
2
School of Information & Electronic Engineering, East China Normal University, Shanghai 200241, China
*
Author to whom correspondence should be addressed.
Machines 2026, 14(6), 605; https://doi.org/10.3390/machines14060605
Submission received: 21 April 2026 / Revised: 22 May 2026 / Accepted: 26 May 2026 / Published: 28 May 2026

Abstract

Reinforcement learning (RL) has been widely utilized for decision-making in highway on-ramp merging scenarios. However, most existing methods incorporate safety through reward functions, which may allow autonomous vehicles to trade safety for higher cumulative rewards. Moreover, personalized human risk preferences are rarely considered, making the learned policies difficult to adapt to heterogeneous user-specific risk requirements and potentially resulting in overly conservative or insufficiently cautious behaviors. To address these issues, this paper proposes a Risk-Aware Personal Preference-Based Safe Reinforcement Learning framework (RAPRL), for autonomous decision-making in on-ramp merging scenarios. Specifically, the high-level decision-making problem is formulated as a constrained Markov decision process (CMDP), in which safety requirements are explicitly represented as constraints rather than reward terms. To enable personalized safety regulation, a fuzzy logic mechanism is developed to adaptively determine the constraint cost limit according to the driver’s risk preference and the surrounding traffic density. The resulting safe RL problem is solved using a Lagrangian-based soft actor-critic algorithm (SAC). Furthermore, an Action Shielding Mechanism is designed to assess the potential risk of candidate actions before execution and replace unsafe or infeasible actions, thereby improving safety during both policy learning and execution. Theoretical analysis shows that the proposed shielding mechanism can reduce unsafe exploration and improve sample efficiency. Extensive simulations in on-ramp merging scenarios demonstrate that RAPRL effectively reduces safety violations while maintaining driving efficiency. Compared with the SAC Discrete method, the proposed method improves the success rate by 4.76% and reduces the collision ratio by 70%, indicating a better safety–efficiency trade-off.

1. Introduction

Merging into congested highway traffic remains a challenging task for autonomous vehicles (AVs), as the decision-making process must balance two conflicting objectives: safety and efficiency [1]. This trade-off becomes more complex in interactive traffic, where the ego CAV must make merging decisions under uncertain HDV responses and vehicle-to-vehicle interactions [2]. Meanwhile, the desired behavior of the ego CAV is not uniform across users. A conservative user or safety-conscious driver may prefer larger safety margins and more cautious merging maneuvers, whereas an aggressive user may accept a higher level of risk to reduce merge delay and improve traffic efficiency. Therefore, diverse driving preferences should be explicitly considered in autonomous decision-making, especially when the vehicle must regulate its safety–efficiency tradeoff in interactive traffic. To satisfy users’ risk expectations, decision-making systems of AVs should incorporate user preferences and provide multiple behavioral options.
Reinforcement learning (RL) has been widely used for AV decision-making tasks [3,4]. However, most methods often neglect user-specific risk preferences, resulting in learned policies that may not align with an individual’s risk perception, thereby reducing user satisfaction and trust in the system [5]. Moreover, ensuring safety within RL still remains challenging. Most existing approaches incorporate safety requirements into the reward function instead of enforcing them as explicit constraints, which may cause the agent to prioritize long-term rewards while compromising safety [6]. Therefore, it is essential to develop a risk-aware reinforcement learning framework that explicitly incorporates personal risk preferences into the decision-making process of autonomous driving.
The key problem addressed in this paper is to learn an on-ramp merging policy that can explicitly regulate safety constraints, adapt the allowable risk level to user-specific preferences and traffic conditions, and prevent unsafe high-level actions from being executed. This problem involves three coupled aspects: constraint-based safety regulation during policy learning, preference-aware adjustment of the safety cost limit, and execution-level filtering of risky actions before they are applied to the vehicle. To address this problem, we propose a Risk-Aware Personal Preference-Based Safe Reinforcement Learning framework (RAPRL) for autonomous on-ramp merging, as shown in Figure 1. The high-level decision-making problem is first formulated as a constrained Markov decision process (CMDP), where safety requirements are represented as explicit cost constraints and the cost limit is adaptively determined according to human risk preference and traffic density, as shown in Figure 2. Based on this formulation, the resulting safe RL problem is solved using a Lagrangian-based soft actor-critic (SAC) algorithm. To further reduce unsafe exploration and prevent risky decisions from being executed, an MPC-based Action Shielding Mechanism (ASM) is introduced to pre-execute candidate RL actions, evaluate their predicted collision risk and feasibility, and replace unsafe or invalid actions before execution. After this shielding process, the selected action is sent to the low-level MPC module for trajectory planning and vehicle control. Finally, theoretical analysis and numerical simulations are conducted to evaluate the safety performance and learning efficiency of the proposed method. The main contributions are summarized as follows:
  • A risk-aware safe RL method based on CMDP is proposed for the highway on-ramp merging task, in which the individual’s risk preference is incorporated into the safety constraints to accommodate the safety expectations of users. The safety level of the RL policy can be adjusted by computing the cost limits of CMDP constraints using fuzzy logic based on user preferences and traffic density.
  • An Action Shielding Mechanism is built to mask out unsafe RL actions. We pre-execute the RL action with MPC and conduct collision checks with surrounding agents to determine whether the action is safe. Theoretical proof has shown the effectiveness of the shielding mechanism in terms of safety and sampling efficiency.
  • Numerical simulations in different levels of traffic densities have shown that our method outperforms the baselines, which can improve safety without sacrificing traffic efficiency. Due to the use of user preference-aware safety constraints and action shielding, risk behaviors can be significantly reduced during the exploration stage of RL, enabling safer policy learning in interactive ramp-merging scenarios.

2. Related Works

2.1. RL-Based Approach

Reinforcement Learning (RL) has demonstrated significant potential in the complex decision-making tasks of autonomous driving, particularly in handling interaction and mixed traffic scenarios [7,8]. While these approaches effectively optimize driving policies for high-level objectives, standard RL algorithms primarily focus on maximizing cumulative rewards and often lack explicit mechanisms to guarantee safety constraints during the learning process. To address this limitation, the constrained Markov decision process (CMDP) framework was introduced to explicitly incorporate safety constraints. For instance, ref. [9] formulated unsignalized intersection navigation as a CMDP, solving it via a Safe-PPO algorithm, while [10] combined the Augmented Lagrangian method with PPO to enforce risk boundaries. However, CMDP-based methods typically treat safety as a statistical constraint or a soft penalty. As highlighted by [11], it remains difficult for these methods to strictly enforce hard safety constraints, leaving the agent prone to falling into unsafe regions during exploration. To mitigate the risk of unsafe actions, the Action Shielding Mechanism (ASM) has been developed to monitor and correct policy outputs [12]. Recent works have incorporated motion prediction into shielding modules to anticipate and mask actions that lead to collisions [13,14,15]. However, most existing shielding approaches rely on simplified kinematic predictions or immediate action masking, often overlooking the discrepancy between high-level decisions and the actual trajectories executed by low-level controllers. Furthermore, a theoretical justification is often absent.
Unlike existing studies, we pre-execute the RL action with MPC to acquire the predictions of the ego vehicle and then conduct collision checks with surrounding agents based on their motion predictions to determine whether the RL action is safe or not. The unsafe actions are replaced with safe ones, which can help the RL agent quickly learn to act safely. Moreover, we provide a theoretical justification that the Action Shielding Mechanism can enhance the safety performance and learning efficiency of RL.

2.2. Human Risk Perception in Decision-Making

Risk perception in driving has gained wide scientific interest over the years, with differences among individuals’ risk perception found to have a significant impact on the safety design of intelligent vehicles [16]. Many studies have used statistical methods to analyze driving risk, among which the most common method is to assess risk by analyzing driving behaviors. To understand the differences in risk perception between drivers, driving style has been studied. Ref. [17] proposes a universal driving risk model, which provides a new way of modeling the differences in driver’s risk perception at a physical level. Risk perception is also the main consideration in decision-making for driving safety. Ref. [18] uses the risk of each sample trajectory as the cost item and obtains a trajectory with minimal cost to guarantee the safety requirement. Ref. [19] proposes a model to describe the acceptable risk and shows how an accepted risk contributes to the decision-making of AVs at the maneuver level. Ref. [20] proposes a risk-aware decision-making framework to handle the epistemic uncertainty arising from training the prediction model on insufficient data.
Unlike previous works, we explicitly incorporate human risk perception into the CMDP framework to align high-level decisions with user expectations. Specifically, we employ a fuzzy logic controller that dynamically adjusts the CMDP cost limits based on individual risk preferences and real-time traffic density, thereby enabling adaptable safety levels.

2.3. Combination of RL and MPC

RL and Model Predictive Control (MPC) represent two distinct paradigms in autonomous decision-making. While RL excels in handling complex system dynamics and learning from interaction, it often struggles with generalization and safety guarantees in unseen environments. Conversely, MPC provides robust constraint handling but is limited by model accuracy and computational burden. Consequently, recent research has increasingly focused on hybrid architectures that leverage the strengths of both. For instance, ref. [21] formulated a dual MPC where an RL agent acts as an adaptive solver, and [22] utilized RL to estimate the value function within a parametric MPC framework. Others employ MPC to guide RL training; specifically, MPC has been used to reduce sample complexity [23] or approximate value functions for high-level objectives [24,25]. Furthermore, ref. [26] demonstrated the feasibility of embedding RL techniques directly into MPC-based policies. These studies collectively highlight the potential of RL–MPC integration in complex control tasks. However, most existing hybrid approaches focus primarily on performance optimization or value approximation, often overlooking the critical aspect of explicit safety constraints during the exploration phase.
Unlike the previous studies, we formulate the on-ramp merging problem within a CMDP framework, where safety violations are explicitly constrained via cost functions rather than implicitly handled through reward shaping. Moreover, we introduce an Action Shielding Mechanism that leverages MPC as a predictive safety filter to prevent unsafe actions during the exploration phase of RL. This design enables safety-aware learning under explicit constraints while maintaining compatibility with downstream MPC-based trajectory planning.

3. Problem Statement

This section formulates the decision-making process of the on-ramp merging problem as a hierarchical framework that consists of a CMDP for high-level maneuver decisions and MPC for low-level motion control.

3.1. Constrained Markov Decision Process

The high-level maneuver decision-making problem is formulated as a CMDP model in this study, which specifies safety requirements as constraint terms. CMDP is described as a tuple < S , A , P , r , c , η , γ > , where S is the state space, A is the action space, P is the transition probability p ( s t + 1 | s t , a t ) from state s t to the next state s t + 1 under the action a t A , r : S × A R is the reward function, c : S × A R is the cost function, and γ [ 0 , 1 ) is a discount factor.

3.1.1. State Space

Let s i = ( o i , x i , y i , v x i , v y i ) be the state of vehicle i, where o i is a binary variable indicating whether the vehicle i is observable, x i and y i are the longitudinal and lateral distance between the vehicle i and the ego vehicle, and v x i and v y i are relative speed between the vehicle i and the ego vehicle in longitudinal and lateral directions. The complete state of the environment consists of all vehicles that are observed by the ego vehicle, and the state space S is written as, S = s 1 , s 2 , , s n , where n is the maximum number of observed vehicles on the main lane.

3.1.2. Action Space

The high-level decision module outputs a discrete action, including turning left/right, idling, and acceleration/deceleration. The action space A is defined as
A = A , A , A , A , A ,
where A , A , A , A , and A are turning left, turning right, acceleration, idling, and deceleration, respectively.

3.1.3. Reward

The reward functions are defined considering safety, success ratio, and traffic efficiency, and we summarize the rewards as
r = r s + r g + r v ,
where r is the sum of rewards, consisting of the reward of safety r s , the reward of successfully merging and reaching the goal position r g , and the reward of traffic efficiency r v . We set
r s = 0.05 if TTC 2.5 s , 1 otherwise ,
where TTC denotes the time to collision with surrounding vehicles. The term r g is defined as r g = r g 1 + r g 2 , with r g 1 = 5 awarded when the ego vehicle successfully merges into the target lane, and r g 2 = 10 awarded when the ego vehicle reaches the designated goal position. The reward for traffic efficiency r v is defined based on ego vehicle speed and traffic speed, which encourages the ego vehicle to drive at the average speed of the traffic flow,
r v = 0.1 if v ego v ave κ v ave , 0.5 otherwise ,
where v ego is the speed of ego vehicle, v ave is the average speed of traffic, and κ is the coefficient of the average speed.

3.1.4. Cost

The cost c includes three situations, as illustrated in Figure 3, which are computed based on the predicted states of the ego vehicle and surrounding objects, including (a) fail to merge before reaching the end of the road, (b) collide with other vehicles, and (c) hard to merge when the target lane is occupied. The cost functions can be written as,
c = c 1 + c 2 + c 3 ,
where c is the sum of costs, including the cost c 1 for failing to merge before reaching the end of the road, the cost c 2 for colliding with other vehicles, and the cost c 3 for hard to merge when the target lane is occupied.
Similar event-based cost designs have been used in reinforcement learning-based autonomous driving, where hazardous driving events are represented by penalty terms and assigned different weights according to their safety severity [27]. Inspired by this design principle, we define the CMDP cost in this study based on the following three safety-related events in the on-ramp merging task.
c 1 = 0.2 if failed to merge , c 2 = 2 if collision occurs , c 3 = 0.3 if l occupied = 1 .
The cost values in Equation (6) are dimensionless penalty weights rather than physical risk quantities. They are selected according to the relative severity of the corresponding events. Collision is regarded as the most severe safety violation and is therefore assigned the largest cost. Failing to merge before the end of the ramp is treated as a task-level failure, while target-lane occupancy represents a local merging difficulty that may lead to unsafe interaction if the ego vehicle continues to merge. Accordingly, the cost values are designed to satisfy the severity ordering c 2 > c 3 > c 1 , where c 2 strongly penalizes collision events, and c 1 and c 3 provide additional constraint signals for non-collision risks. The target-lane occupancy cost c 3 is set slightly larger than c 1 because it acts as an early warning signal for potential merging conflicts and may be triggered repeatedly during the merging process. These values are empirically selected in simulation to maintain a clear distinction between collision and non-collision events while keeping the cumulative cost within a suitable range for Lagrangian-based policy optimization.
In particular, we use the longitudinal position and relative speed to determine whether the target lane is occupied,
l o c i = 1 , if Δ v x i 1.5 and Δ x i 5 , 0 , otherwise .
l occupied = max i l o c i ,
where l occupied denotes the target lane occupancy status. Δ v x i and Δ x i represent the longitudinal relative speed and distance of vehicle i, respectively. Specifically, l occupied is set to 1 only if a vehicle falls within the defined spatial and speed thresholds, identifying it as an immediate obstacle for merging.

3.1.5. Problem Formulation

We formulate the decision-making problem as a CMDP to maximize cumulative rewards subject to safety constraints:
max π J π R = E π t = 0 γ t r ( s t , a t ) , s . t . J π C = E π t = 0 γ t c ( s t , a t ) η ,
where J π R and J π C denote the expected return and cost, respectively. η is the cost limit reflecting risk preference. By introducing a Lagrange multiplier λ 0 , we recast this as an unconstrained min–max problem:
max π min λ 0 J π R λ ( J π C η ) .
This Lagrangian formulation penalizes policies that violate the cost constraint ( J π C > η ), while maximizing the reward when the constraint is satisfied.

3.2. Human-Aligned Safety Cost Limits

As discussed previously, the cumulative costs J π C are constrained by the cost limit η in Equation (9), which determines the risk preference of the learned policy. Setting an appropriate value for η is essential to match the users’ preferences for safety. In this study, we utilize the fuzzy logical method [28] to determine the value of the cost limit η based on the specific risk preference and traffic density. We employ Mamdani inference [29] to accommodate human-semantic descriptors and encode them as interpretable rules, thereby deriving the safety cost limit from risk preference and traffic density. Unlike quantitative risk assessments that typically assume a single acceptable risk level, fuzzy logic allows the cost limit to adapt jointly to both preference and density.
In this study, we design the membership functions based on expert knowledge to map the fuzzy inputs (risk preference and traffic density) to the fuzzy output (cost limit), as illustrated in Figure 4. To make the membership functions explicit, let p [ 0 , 100 ] denote the normalized risk-preference variable, ρ [ 0.5 , 1 ] denote the normalized traffic-density variable, and η [ 0 , 0.1 ] denote the cost limit. The membership functions of risk preference are defined as
μ con ( p ) = 1 , 0 p 30 , 50 p 20 , 30 < p < 50 , 0 , 50 p 100 , μ neu ( p ) = 0 , 0 p 30 , p 30 20 , 30 < p < 50 , 70 p 20 , 50 p < 70 , 0 , 70 p 100 , μ agg ( p ) = 0 , 0 p 50 , p 50 20 , 50 < p < 70 , 1 , 70 p 100 .
Similarly, the membership functions of traffic density are defined as
μ low ( ρ ) = 1 , ρ = 0.5 , 0.7 ρ 0.2 , 0.5 < ρ < 0.7 , 0 , 0.7 ρ 1 , μ med ( ρ ) = 0 , ρ = 0.5 , ρ 0.5 0.2 , 0.5 < ρ < 0.7 , 1 , 0.7 ρ 0.8 , 1 ρ 0.2 , 0.8 < ρ < 1 , 0 , ρ = 1 , μ high ( ρ ) = 0 , 0.5 ρ 0.8 , ρ 0.8 0.2 , 0.8 < ρ < 1 , 1 , ρ = 1 .
The membership functions of the cost limit are defined as
μ small ( η ) = 1 , 0 η 0.01 , 0.05 η 0.04 , 0.01 < η < 0.05 , 0 , 0.05 η 0.1 , μ medium ( η ) = 0 , 0 η 0.01 , η 0.01 0.04 , 0.01 < η < 0.05 , 0.09 η 0.04 , 0.05 η < 0.09 , 0 , 0.09 η 0.1 , μ large ( η ) = 0 , 0 η 0.05 , η 0.05 0.03 , 0.05 < η < 0.08 , 1 , 0.08 η 0.1 .
These equations correspond to the membership curves shown in Figure 4a–c, and provide explicit mathematical definitions for the fuzzy inputs and output used in the Mamdani inference process. The above ranges and membership shapes are selected to support smooth safety–efficiency regulation in on-ramp merging. The normalized risk-preference range allows different user-specified risk tolerances to be represented on a common scale, while the normalized traffic-density range reflects the transition from sparse to dense merging conditions considered in the simulation. The cost-limit range is consistent with the sensitivity analysis of the CMDP constraint limit. For the membership shapes, shoulder functions are used for boundary categories, such as conservative/aggressive preference, low/high density, and small/large cost limit, whereas triangular or trapezoidal functions are used for intermediate categories. This overlapping design avoids abrupt changes in the cost limit when the input variables vary slightly, which is important for maintaining stable safety–efficiency tradeoffs during merging decisions.
Consistent with established driving-style taxonomies in driving behavior research [30], the user’s risk preference is represented by three commonly used linguistic levels, i.e., conservative, neutral, and aggressive. As shown in Figure 4a, these levels are modeled as predefined fuzzy membership functions over the normalized risk-preference variable ranging from 0 to 100%. They are not inferred online from driver data in the current study; instead, the membership functions are specified based on prior studies and expert assumptions to systematically evaluate how different risk-tolerance levels affect the adaptive safety cost limit and the learned merging policy. In practical applications, the normalized risk-preference input can be obtained through several interfaces, such as a pre-driving questionnaire, a user-selected driving-style setting, or calibration from historical driving behavior. The corresponding membership degrees with respect to the conservative, neutral, and aggressive sets are then used as inputs to the subsequent fuzzy inference process. Therefore, the focus of this study is not to infer the user’s risk preference, but to translate a given preference into an adaptive CMDP cost limit for safe policy learning. Traffic density includes {‘low’, ‘medium’, ‘high’}, as shown in Figure 4b, and varies between 0.5 and 1.0, where larger values indicate denser traffic conditions with smaller inter-vehicle headways and more surrounding vehicles within the ego vehicle’s interaction range. The medium-density traffic is modeled with a trapezoidal membership function, reaching its maximum membership value at moderate density levels. The membership function of cost is shown in Figure 4c, including {‘small’, ‘medium’, ‘large’}, and the cost limits vary between 0 and 0.1.
After fuzzification, we can define fuzzy rules that map the fuzzy inputs (risk preference and traffic density) to the fuzzy output (cost limit). Fuzzy rules are a crucial aspect of designing a fuzzy logic system, and the rules are usually derived from expert knowledge or empirical data. In this study, the fuzzy rules are defined based on human knowledge, and Table 1 shows the fuzzy relations between cost limit, risk preference, and traffic density.
The rule table is designed to reflect the safety–efficiency tradeoff in on-ramp merging. For a fixed traffic density, a higher risk preference is assigned a larger cost limit, because the ego vehicle is allowed to accept a slightly higher level of risk to reduce merge delay and improve traffic efficiency. Conversely, for a fixed risk preference, a higher traffic density is assigned a smaller cost limit, because dense traffic provides fewer acceptable merging gaps and requires stricter safety regulation. Therefore, the rule table follows a monotonic and interpretable design: the cost limit increases with user risk tolerance and decreases with traffic density. For example, an aggressive preference under low-density traffic leads to a large cost limit to encourage efficient merging, whereas a conservative preference under high-density traffic leads to a small cost limit to enforce cautious behavior. From Table 1, we can infer the fuzzy output of cost limit given the fuzzy inputs of risk preference and traffic density. For instance, if traffic density is ‘low’ and the risk preference is ‘aggressive’, then the cost limit is ‘large’. We use a matrix R ˜ to represent the fuzzy rules, and the process of rule evaluations can be written as
C ˜ = k = 1 9 ( A ˜ i × B ˜ j ) R ˜ k ,
where C ˜ is the fuzzy output, A ˜ i is the traffic density, B ˜ j is the risk preference, i , j { 1 , 2 , 3 } are the indices of different fuzzy sets w.r.t. risk preference and traffic density, and the operator ∘ denotes the composition of fuzzy relations.
Based on the fuzzy rules, the fuzzy output of each rule can be computed. These fuzzy outputs are aggregated into a single fuzzy set and converted to the crisp output value using the centroid method. Figure 5 illustrates an example of aggregation and defuzzification of the Mamdani process. Given that the traffic density and risk preference are 0.57 and 45%, and the membership value of the input is A ˜ = { Low , Medium } and B ˜ = { Conservative , Neutral } , we can compute the fuzzy output for the cost limit of small, medium, and large, as shown in Figure 5, which are 0.25, 0.35, and 0.65, respectively. With the fuzzy output, we can compute the grey area for each fuzzy set and aggregate them into a single fuzzy set, which is the union area of three grey areas. Finally, we take the centroid of the unions of grey areas to obtain a crisp value for the cost limit, which is 0.0595. Therefore, the cost limit is 0.0595 when traffic density is 0.57 and the risk preference is 45%.

4. Model Predictive Control

This section introduces the MPC formulation used for vehicle control, including the linearized discrete-time kinematic bicycle model, the prediction of future states over a finite horizon, and the quadratic optimization of control inputs under constraints.

4.1. Discrete Linear Model

In this study, we use the kinematic bicycle model in MPC [31]. As illustrated in Figure 6, the vehicle state is written as X = [ x , y , v , φ ] , where x and y denote the longitudinal and lateral position, and v and φ are vehicle speed and the yaw angle. The side-slip angle, wheelbase, and length from the front and rear axles to the center of gravity are denoted as β , l, l f , and l r , respectively. The control variable is denoted as U = [ u 1 , u 2 ] , where u 1 and u 2 are acceleration a and the steering angle δ , respectively. The referenced position and control variable are defined as [ x r , y r , v r , φ r ] and [ v r , δ r ] . We derive a linear model
X ˙ = f ( X , U ) = v cos ( φ + β ) v sin ( φ + β ) ) a v l sin β ,
β = arctan ( l r l f + l r t a n δ ) ,
e ˙ X = A F e X + B G e U ,
A F = 0 0 cos ( φ r + β r ) v r sin ( φ r + β r ) 0 0 sin ( φ r + β r ) v r cos ( φ r + β r ) 0 0 0 0 0 0 sin β r l 0 ,
B G = 0 v r sin ( φ r + β r ) l r ( l f + l r ) ( l f + l r ) 2 cos 2 δ r + l r tan 2 δ r 0 v r cos ( φ r + β r ) l r ( l f + l r ) ( l f + l r ) 2 cos 2 δ r + l r tan 2 δ r 1 0 0 v r l cos β r l r ( l f + l r ) ( l f + l r ) 2 cos 2 δ r + l r tan 2 δ r ,
where A F R 4 × 4 and B G R 4 × 2 are the Jacobi matrix of the function f with respect to state X and control U, respectively. The state error between the predicted state and the reference state is denoted as e X R 4 . The control error between the predicted control and the reference control is denoted as e U R 2 . We discretize the linear model as
e X ( k ) = A k e X ( k 1 ) + B k e U ( k 1 ) ,
where e X ( k ) is the state error at time k, e X ( k 1 ) is the state error at time k 1 , and e U ( k 1 ) is the control error at time k 1 . The system matrix A k and input matrix B k are defined as
A k = A F · Δ t + I ,
B k = B G · Δ t ,
where Δ t is the time interval. We define the predicted state error vector E X and the control error vector E U as
E X = [ e X ( k + 1 ) , , e X ( k + N ) ] ,
E U = [ e U ( k ) , , e U ( k + N 1 ) ] ,
where N is the predicted time horizon. Then, the optimization problem can be formulated as
min E X W X E X + E U W U E U ,
where W X and W U are the diagonal matrices representing the weight of the state and control errors respectively. The constraints on the control variables are described as
a ˜ max u 1 ( k + i ) a ˜ max , i = 0 , 1 , , N 1 , δ max u 2 ( k + i ) δ max , i = 0 , 1 , , N 1 ,
where the max acceleration a ˜ max and the max steering angle δ max are the upper bound of the control variables, which are usually designed to ensure driving comfort and tracking performance. This optimization problem is a quadratic programming (QP) problem that can be quickly solved by a nonlinear programming solver, e.g., CVXOPT. The quadratic program solver is denoted as ‘QP.solver’ at line 8 (see Algorithm 1) to solve the optimization problem defined in Equation (25). The solution to this optimization problem is [ e U ( k ) , . . . , e U ( k + N 1 ) ] , and we take the first one e U ( k ) as the optimal control.
Algorithm 1 MPC and States Prediction
Input: 
s t , a t , a t , QP.solver. MPC is used for state prediction (MPC.predict) and control execution (MPC.execute).
  1:
if MPC.predict then
  2:
     v r = g ( a t ) by (28)                                                     ▹ execute the raw action
  3:
else if MPC.execute then
  4:
     v r = g ( a t ) by (28)                                                    ▹ execute the replaced action
  5:
end if
  6:
Get s r by (27)                                                         ▹ reference positions
  7:
( x r , y r ) ( s r , l r ) by coordinate transform
  8:
U * QP . solver ( x r , y r , v r )                                                 ▹ solve the optimal control
  9:
Get e X by (20)                                                              ▹state errors
10:
Get x e σ by (29)                                                            ▹ predicted states
11:
if MPC.predict then
12:
    return  x e σ
13:
else if MPC.execute then
14:
    return  U *
15:
end if
Output: 
x e σ or U *                                                   ▹ predicted states or optimal control

4.2. States Computation

In this study, we compute the state error e X ( k ) at each time k based on the predicted positions for the entire time horizon, as shown in Algorithm 1. We first need to compute reference positions in the Frenet frame to acquire the predicted positions and then convert them back to the Cartesian coordinates. The reference positions are the waypoints along the reference line in the Frenet frame [32]. In the first step, the reference position s r in the Frenet frame is derived by x r , which is the orthogonal projection point of x relative to the reference path. In the next step, the relationship between x r and s r is already known. For example, if the reference path is a long straight road, then the relationship can be expressed as x r = s r . It is assumed that the reference line is the center line of the target lane. The target lane is located on the left lane if the decision is to turn left, and vice versa. Therefore, the reference positions in the Frenet frame can be computed as
s r ( k ) = s r ( k 1 ) + v r Δ t ,
where s r ( k ) is the reference position at time k, and v r is the reference speed that is determined by the decision a t , which is updated as
v r = v t + Δ v if a t = A v t Δ v if a t = A v t else ,
where v t is the current speed of the ego vehicle, a t is the high-level decision determined by the RL agent, and the action space A is defined as A = [ A , A , A , A , A ] , which are turning left, turning right, acceleration, idling and deceleration, respectively. In particular, we execute the RL decision in MPC for motion prediction of the ego vehicle, and the predicted position of the ego vehicle x e σ is written as
x e σ = x r ( σ ) + e X x ( σ ) ,
where σ [ 1 , N ] is the prediction coefficient, x r ( σ ) is the reference longitudinal position in the Cartesian coordinates, and e X x ( σ ) is the longitudinal position error computed by Equation (20).

5. Safe Reinforcement Learning

This section introduces the safe RL algorithm, for which we use a discrete version of the SAC algorithm to solve the CMDP problem. Also, the ASM is built to check the safety of the RL action and replace it when necessary based on the predicted states provided by the MPC module.

5.1. Lagrangian-Based Discrete SAC

Safe RL is employed for the high-level behavioral decision in this study, i.e., to determine the optimal driving decision from a discrete action space A. A discrete action version of the SAC algorithm, i.e., SAC-Discrete (SACD) [33], is used to solve the CMDP problem.

5.1.1. Critic Network and Policy Network

Let a ˚ 1 , a ˚ 2 , , a ˚ | A | be the discrete actions in the action space, i.e., A = a ˚ h h = 1 | A | , | A | is the size of the action space. A critic network and policy network are used in this study, as shown in Figure 7. The soft Q-function of the discrete SAC outputs a vector of size | A | that consists of the Q-value of each action, i.e., q ω : S R | A | :
q ω ( s t ) = [ Q ω ( s t , a ˚ 1 ) , Q ω ( s t , a ˚ 2 ) , , Q ω ( s t , a ˚ | A | ) ] T ,
where q ω ( s t ) is the soft Q-function parameterized by ω , and Q ω ( s t , a ˚ h ) is the state-action value at state s t with the discrete action a ˚ h ( h = 1 , 2 , , | A | ) . Likewise, the policy can directly output the action distribution, which outputs a vector that consists of the probability of each action at state s t , i.e., π θ : S [ 0 , 1 ] | A | :
π θ ( s t ) = [ π θ ( a ˚ 1 | s t ) , π θ ( a ˚ 2 | s t ) , , π θ ( a ˚ | A | | s t ) ] T ,
where θ denotes the policy network parameters (see Figure 7), π θ ( a ˚ h | s t ) is the probability of the action a ˚ h ( h = 1 , 2 , , | A | ) conditioned on the state s t . In the discrete action settings, the soft state-value function is computed as
V ( s t ) = π θ ( s t ) T q ω ( s t ) ξ log π θ ( s t ) .
where ξ is the temperature parameter that determines the relative importance of the entropy term versus the reward term. According to [33], the policy network parameters θ can be learned by minimizing the following loss function:
J π ( θ ) = E s t D π θ ( s t ) ξ log π θ ( s t ) q ω ( s t ) .
The loss function of the temperature parameter is defined as
G ( ξ ) = π θ ( s t ) ξ log π θ ( s t ) + H ¯ ,
where H ¯ is a hyperparameter that represents the target entropy. As shown in Figure 7, the critic network parameter ω of the soft Q-function can be learned by minimizing the soft Bellman residual:
J q ( ω ) = E s t , a t D 1 2 Q ω ( s t , a t ) Q ¯ ( s t , a t ) 2 ,
Q ¯ ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 V ω ¯ ( s t + 1 ) ,
where Q ω ( s t , a t ) is the estimated Q function, and Q ¯ ( s t , a t ) is the target soft Q function. In particular, V ω ¯ ( s t + 1 ) is estimated using a target Q network:
V ω ¯ ( s t + 1 ) = π θ ( s t + 1 ) q ω ¯ ( s t + 1 ) ξ log π θ ( s t + 1 ) ,
where ω ¯ is the parameter of the target Q network, which is updated in Line 18 in Algorithm 2.
Algorithm 2 Human-aligned safe RL
Input: 
MPC, ASM, σ , N
Input: 
initialize θ , ω 1 , 2 , ω c , ω ¯ , ω ¯ c , λ , ξ , Γ and D
  1:
for each iteration do
  2:
      for each environment step do
  3:
             a t π θ ( · s t )
  4:
             x e σ MPC . predict ( s t , a t ) by Algorithm 1
  5:
             a t ASM ( a t , x e σ , σ , N ) by Algorithm 3
  6:
             s t + 1 MPC . execute ( s t , a t ) by Algorithm 1
  7:
             r t r ( s t , a t )
  8:
             c t c ( s t , a t )
  9:
             D D { ( s t , a t , r t , c t , s t + 1 ) }
10:
      end for
11:
      for each gradient step do
12:
             λ λ α λ λ J ( λ )
13:
             ω i ω i α ω ω i J q ( ω i ) for i = 1 , 2
14:
             ω c ω c α ω c ω c J q c ( ω c )
15:
             θ θ α θ θ J π λ ( θ )
16:
             ξ ξ α ξ ξ G ( ξ )
17:
            if soft update then
18:
                  ω ¯ Γ ω + ( 1 Γ ) ω ¯
19:
                  ω c ¯ Γ ω c + ( 1 Γ ) ω c ¯
20:
            end if
21:
      end for
22:
end for
Output: 
Optimal policy π *
Algorithm 3 Action Shielding Mechanism
Input: 
a t , x e σ , x obj , σ , N
  1:
if  a t is A  then
  2:
     for each vehicle i do
  3:
            x i ( N ) = x i ( 0 ) + N · v i Δ t
  4:
           if  x i ( 0 ) x e σ x i ( N ) and | y e σ ( A ) y i | d y  then
  5:
                a t = A                                    ▹ Situation 1
  6:
           end if
  7:
     end for
  8:
end if
  9:
if  a t is A  then
10:
    if ego vehicle has merged then
11:
            a t = A                                     ▹ Situation 2
12:
     end if
13:
end if
14:
if  a t is A or A  then
15:
     if  | x e σ ( a ) x obj | d x  then
16:
            a t = A                                      ▹ Situation 3
17:
     end if
18:
end if
Output: 
Safe decision a t

5.1.2. Cost Network

A cost network parameterized by ω c is built to approximate the cost value function q ω c ( s t ) , as shown in Figure 7. The cost value function q ω c ( s t ) is defined as
q ω c ( s t ) = [ Q ω c ( s t , a ˚ 1 ) , Q ω c ( s t , a ˚ 2 ) , , Q ω c ( s t , a ˚ | A | ) ] T ,
where Q ω c ( s t , a ˚ h ) is the soft cost value at state s t with the discrete action a ˚ h ( h = 1 , 2 , , | A | ) . Similarly, the loss function of the cost network is defined as
J q c ( ω c ) = E s t , a t D 1 2 Q ω c ( s t , a t ) Q ¯ c ( s t , a t ) 2 ,
where Q ω c ( s t , a t ) is the cost value at state s t and action a t and Q ¯ c ( s t , a t ) is the target cost value:
Q ¯ c ( s t , a t ) = c ( s t , a t ) + γ E s t + 1 [ V ω ¯ c ( s t + 1 ) ] ,
V ω ¯ c ( s t + 1 ) = π θ ( s t + 1 ) q ω ¯ c ( s t + 1 ) ,
where the target cost function q ω ¯ c is approximated by the target cost network with parameter ω ¯ c , which is updated in Line 19 in Algorithm 2.

5.1.3. n-Step TD Learning

In traditional temporal difference (TD) learning, the agent updates the value estimates based on the current estimate and a target value. n-step TD learning generalizes this idea by considering not just the immediate next state but by looking ahead n steps into the future. This means that the update is based on the sum of rewards over the next n steps. In this study, we use n-step TD learning, which is expected to strike a balance between the short-term and the longer-term perspective of Monte Carlo methods. By adjusting the parameter n, the trade-off between bias and variance can be adjusted in the learning process. We also use n-step transitions to approximate the target Q value and cost value functions. According to [34], the target Q-value in Equation (36) and target cost value in Equation (40) are rewritten as
Q ¯ ( s t , a t ) = j = 0 n - 1 γ j r ( s t + j , a t + j ) + γ n E s t + 1 [ V ω ¯ ( s t + n ) ] ,
Q ¯ c ( s t , a t ) = j = 0 n - 1 γ j c ( s t + j , a t + j ) + γ n E s t + 1 [ V ω c ¯ ( s t + n ) ] ,
where n is the parameter that determines the number of steps that we want to look ahead before updating the Q-function.

5.1.4. Lagrange Multiplier

We have a cost network, the loss function of the policy network is modified by adding the cost item via the Lagrange multiplier:
J π λ ( θ ) = J π ( θ ) + E s t D λ π θ ( s t ) T q ω c ( s t ) ,
where λ is the Lagrange multiplier:
θ J π λ ( θ ) = J π ( θ ) θ + λ π θ ( s t ) θ T q ω c ( s t ) .
We use the state-action value to approximate the J π C ,
J π C = Q ω c ( s t , a t )
The loss function of the Lagrange multiplier is denoted as
J ( λ ) = E s t D λ ( Q ω c ( s t , a t ) η ) ,
where J ( λ ) is the loss for Lagrange multiplier λ .
Overall, the proposed human-aligned safe RL is described in Algorithm 2. The raw action a t sampled from the policy π θ ( · | s t ) (Line 3) would first be sent to MPC for motion prediction (MPC.predict) (Line 4) and the ASM for safety check (Line 5). The replaced action a t provided by the ASM is then executed in the low-level MPC controller (MPC.execute, Line 6) and then gets the corresponding cost and reward training (Line 7–8). We use two critic networks parameterized by ω i ( i = 1 , 2 ) to avoid the overestimation of the soft Q-value [35]. In addition, the target soft Q value and cost value are updated based on the parameters ω ¯ and ω c ¯ .

5.2. Action Shielding Mechanism

The frequent constraint violation behavior usually makes the RL agent hard to learn. In this study, to enhance the sample efficiency for safe RL, we design a rule-based ASM to replace unexpected or unsafe RL actions with safe ones (see Algorithm 3), which can help guide the RL agent to take the right action during the exploration process. In the ASM, the unsafe RL action is detected by checking whether there exist collisions between the ego vehicle and surrounding vehicles based on the predicted states.
We define the unsafe decision set as Ω unsafe = Ω 1 Ω 2 Ω 3 , including three typical situations, as shown in Figure 8. If the original RL decisions a t Ω unsafe , the decisions a t will be substituted with a safe one a t .

5.2.1. Situation 1

A collision is likely to occur when the action turning left A is executed (see top Figure 8). The unsafe set Ω 1 of this situation is defined as
Ω 1 = { a A | , x i ( 0 ) x e σ ( A ) x i ( N ) , | y e σ ( A ) y i |   d y } ,
where x e σ and y e σ are the predicted position of ego vehicle based on the action A , σ is the prediction coefficient, x i and y i are the predicted positions of the i th vehicle, and d y is the safe distance between the ego vehicle and the i th vehicle in the lateral direction. In this situation, the action A is replaced with the deceleration action A to delay the merging maneuver and avoid the predicted lateral conflict. The replaced action a t is written as
a t = A .

5.2.2. Situation 2

The action is labeled as an unexpected decision when the ego vehicle has already merged into the target lane, but the RL agent outputs an action turning right A , which is considered to be unnecessary (middle in Figure 8). The unsafe set Ω 2 of this situation is defined as
Ω 2 = { a A | a = A , I merge = 1 } ,
where I merge = 1 indicates that the ego vehicle has successfully merged. Therefore, an action idling A is used to replace the unexpected action A . The replaced action a t is written as
a t = A .

5.2.3. Situation 3

The ego vehicle and its neighbor’s motion have few differences in the longitudinal direction. The ego vehicle would fail to merge before reaching the end of the merge zone if acceleration A or idling A are executed (see bottom of Figure 8). The unsafe set Ω 3 of this situation is defined as
Ω 3 = { a A | a = A   o r   A , | x e σ ( a ) x obj |   d x } ,
where x e σ ( a ) is the predicted longitudinal position of the ego vehicle when executing the action a, i.e., a = A   o r   A . x obj is the longitudinal position of the adjacent vehicle that occupies the target lane. d x is the distance threshold. The condition | x e σ ( a ) x obj |   d x indicates that the ego vehicle and the adjacent vehicle have few differences in the longitudinal direction, making it hard for the ego vehicle to merge. Similarly, the action A or A would be replaced by the deceleration A , which slows down the ego vehicle to find another opportunity to merge, i.e., a t = A .
Overall, the proposed ASM is designed to handle the dominant edge cases in the considered ramp-merging scenario, including predicted lateral collision during lane changing, unexpected right-turn decisions after successful merging, and failure-to-merge risk caused by target lane occupancy. These cases correspond to the main unsafe or infeasible high-level decisions observed in the current simulation environment. For the first and third cases, the deceleration action is used as a shielding fallback to delay the merging maneuver or enlarge the longitudinal gap, allowing the ego vehicle to seek another merging opportunity. It should be noted that this fallback action is implemented through the MPC execution layer and is subject to the acceleration bounds in the vehicle-control optimization. Therefore, it is not equivalent to an emergency braking maneuver in the present implementation.
Nevertheless, deceleration is not universally safe under all traffic configurations. For example, when the following vehicle is very close to the ego vehicle, has delayed response, or does not follow IDM-like longitudinal behavior, a deceleration fallback may increase rear-end collision risk. In the current simulation, this risk is partially mitigated by the bounded MPC execution and the IDM-based response of surrounding vehicles, but rear-side safety is not explicitly modeled as an independent shielding constraint. Therefore, the ASM should be interpreted as a scenario-specific safety augmentation module for reducing risky exploration in the considered ramp-merging setting, rather than a complete safety supervisor for all possible traffic configurations. A more general shielding design can further incorporate rear-side safety margins and select fallback actions from multiple candidate maneuvers.

6. Theoretical Analysis

This section presents a theoretical analysis of the ASM regarding the safety performance of the learned RL policy and the convergence performance of the proposed safe RL algorithm.

6.1. Safety Performance

Theorem 1. 
If the current action a t is unsafe, i.e., a t Ω unsafe , and when it is replaced with a safe one a t using ASM, then the state action cost function Q c ( s t , a t ) is lower than the original state action cost function Q c ( s t , a t ) , which indicates improved safety effectiveness contributed by the ASM.
Proof. 
We consider the Bellman equation [36], which is denoted as
Q c ( s t , a t ) = c ( s t , a t ) + γ V c ( s t + 1 ) .
For simplicity, let c t = c ( s t , a t ) and c t = c ( s t , a t ) . Then we have
Q c ( s t , a t ) Q c ( s t , a t ) = c t c t + V c ( s t + 1 ) V c ( s t + 1 ) .
Considering the first and second items in the Bellman equation, the original decision a t will lead to a collision in the next steps as the result of c t > c t . As for the difference in state cost value, we assume that the following actions will also be replaced by a safe decision if the original is unsafe. Then, we have
V c ( s t + 1 ) V c ( s t + 1 ) = E π t = 0 T γ t ( c ( s t + 1 , a t + 1 ) c ( s t + 1 , a t + 1 ) ) .
During the rest steps in the episode, the original decision will be replaced by the safe one, and will result in a safe state. Therefore, we have
c ( s t + 1 , a t + 1 ) > c ( s t + 1 , a t + 1 ) , t [ 0 , T ] .
As a consequence, we have
V c ( s t + 1 ) > V c ( s t + 1 ) , t [ 0 , T ] ,
Q c ( s t , a t ) > Q c ( s t , a t ) , t [ 0 , T ] .
The performance of the policy with the ASM could be promoted because the safe action can reduce the state-action cost value at each step and guide the agent to choose the safe decision. Furthermore, another benefit of the proposed method is that the training process is not affected by the ASM.

6.2. Convergence Analysis

The optimization objective is to maximize the sum of the discounted reward while assuring the cost value satisfies the constraints. From Equation (10), the Lagrangian dual function [37] is formulated as
d ( λ ) = min π J R π + λ ( J C π η ) .
The dual ascent is to find max λ 0 d ( λ ) . During the learning process, the policy update is imperfect due to two factors. One factor is that the number of iterations is limited for computation efficiency and another factor is that the replaced decision can guarantee safety but not optimality. Hence, we assume the suboptimality of the solution π * is upper bounded as
J R π * + λ ( J C π * η ) d ( λ ) < ϵ .
We denoted the residual of λ before and after the imperfect update in a single step of Algorithm 2 as
g ^ ( λ , α λ ) = 1 α λ ( max ( 0 , λ + α λ λ d ( λ ) ) λ ) .
Lemma 1. 
Following the imperfect dual ascent with step size α λ μ , we have
d ( λ k + 1 ) d ( λ k ) + α λ 2 g ^ ( λ k , α λ ) 2 2 ϵ μ g ^ ( λ k , α λ ) .
Proof. 
The result follows from standard projected dual ascent for μ -smooth concave dual functions (cf. Theorem 2.2.7 in [38]). In our setting, the policy update at each iteration is only ϵ -suboptimal, which introduces a deviation term of order ϵ / μ ; that is, the deviation can be bounded by a constant times ϵ / μ . Intuitively, one step of the projected ascent guarantees an increase in d ( λ ) proportional to g ^ 2 , minus a correction term of size at most on the order of ϵ / μ , which accounts for the inexact policy improvement. □
Taking λ = λ k and utilizing the fact that λ k + 1 λ k = α λ g ^ gives the result.
Theorem 2. 
There exist constant χ > 0 such that the imperfect update converges to a dual solution λ ^ that satisfies
min λ * P * λ * λ ^ χ ϵ μ .
Proof. 
Let ϕ ( λ ) = min λ * P * λ λ * . Based on Theorem 4.1 of [39], there exists a constant ψ such that
ϕ ( λ k ) + π * π ψ λ k + 1 λ k .
From Lemma 1, when g ^ > 2 α λ 2 ϵ μ , the Lagrangian dual function d ( λ ) monotonically increases and the imperfect dual ascent would reach a λ ^ satisfying g ^ ( λ ^ , α λ ) < 2 α λ 2 ϵ μ . Then, it follows that ψ λ k + 1 λ k < ψ ( 2 + α λ ) 2 ϵ μ . Taking ψ = χ 2 ( 2 + α λ ) , we have ϕ ( λ ^ ) χ ϵ μ . □
Theorem 2 shows that the proposed method will converge to a near-optimal solution, even under imperfect policy updates.

7. Experimental Setup

This section introduces the implementation details of the experiments, including the simulation environment, scenario setting, model formulation, and implementation details, etc.

7.1. Scenario Settings

We built an on-ramp merging scenario based on the simulation platform highway-env [40]. As shown in Figure 1, the merge zone is comprised of a lane width of 5 m and a length of 70 m. The ego vehicle starts from the entrance of the ramp, 80 m from the merge zone. The longitudinal acceleration of the surrounding vehicles is predicted by the intelligent driver model (IDM) [41]. The longitudinal distance between the front vehicle i and the rear vehicle i + 1 is denoted as d i , i + 1 , which is defined as d i , i + 1 = d s + v i + 1 / ρ , where d s is the safety distance. v i + 1 is the speed of the rear vehicle. The initial speed of all vehicles ranges from 17 m/s to 27 m/s. ρ [ 0.5 , 1 ] represents the space density of traffic flow. The traffic density is in low level when ρ [ 0.5 , 0.7 ) , medium level when ρ [ 0.5 , 1 ] , and high level when ρ ( 0.8 , 1 ] . A larger value of ρ yields smaller headways, which increases the number of surrounding vehicles within the ego vehicle’s interaction range during the on-ramp merge and results in tighter available gaps. The surrounding vehicles are controlled by IDM-based longitudinal behavior models, which provide a controlled and reproducible traffic environment for evaluating the proposed method. This setting is used as a benchmark scenario and does not aim to fully reproduce all heterogeneous driving behaviors observed in real traffic.

7.2. Implementation Details

The parameter settings of the proposed approach, including MPC and safe RL, are summarized as follows, and the main hyperparameters with explicit symbols are further listed in Table 2. The MPC prediction horizon is set to 10 time steps. To ensure smooth merging, the reference steering angle is set to 0 rad with a maximum magnitude of π / 8 rad, and the acceleration is constrained within [−0.5 g, 0.5 g]. The prediction coefficient σ is set to 5, meaning that collision checking in the ASM is performed five steps ahead. For safe RL, the environment simulation frequency is 10 Hz, while the decision-making frequency is 2 Hz. Each episode terminates when the ego vehicle reaches the goal or a collision occurs, and the agent is trained for 0.5 million interaction steps. During training, the agent is coupled with MPC online: the agent outputs actions for motion prediction, which are overridden by the ASM if potential collisions are detected, and the resulting shielded actions are executed by the low-level MPC controller to obtain reward and cost signals. All neural networks in safe RL consist of two hidden layers with 256 units each and use ReLU activations. The optimizer is Adam, the target entropy ratio is set to 0.98, the policy update frequency is 1, and the target network update frequency is 10. Other symbolic hyperparameters, including the learning rates, replay buffer size, batch size, discount factor, and Lagrangian multiplier settings, are reported in Table 2. Moreover, we train the RL agent with five random seeds. All the experiments are conducted on an Intel Core i5-11300H CPU with 16 GB RAM that runs at 3.1 GHz.

8. Results and Discussion

8.1. Convergence Performance

To demonstrate the advantages of RAPRL (ours), we compare the training curves of RAPRL with three baselines: Dueling DQN [42], SACD [33], and PPO [43]. Figure 9 presents the comparative results across three key metrics: Crash Ratio, Average Cost, and Average Reward, plotted against the total environmental interactions. The shaded regions represent the standard deviation across different random seeds.
As illustrated in the Average Reward curve (Figure 9 (right)), RAPRL outperforms all baselines in terms of the final converged value. While Dueling DQN exhibits the fastest initial learning speed, it converges to a sub-optimal policy with a lower final reward compared to RAPRL. In contrast, RAPRL maintains a steady growth rate and ultimately achieves the highest asymptotic performance. PPO shows the slowest convergence rate and the lowest final reward. In Figure 9 (left), both Dueling DQN and RAPRL demonstrate rapid reductions in collision ratios, quickly converging to a lower crash ratio. This indicates that both methods can effectively learn to avoid fatal states at an early stage. However, PPO requires significantly more interactions to reduce the crash risk to an acceptable level. Crucially, the Average Cost curve (Figure 9 (middle)) reveals a significant advantage of our method. Although Dueling DQN achieves a low crash ratio, its average cost exhibits a sharp increase after the initial drop and stabilizes at a high level. Conversely, RAPRL consistently reduces the average cost and converges to the lowest level among all methods.
These results demonstrate that RAPRL not only ensures collision avoidance but also effectively minimizes comprehensive penalties, achieving a superior balance between safety constraints and driving efficiency.

8.2. Performance Evaluation

8.2.1. Traffic Success

Traffic success is primarily evaluated by the success rate, which quantifies the probability of the ego vehicle successfully completing the merging task without terminal failures. As presented in Table 3, the proposed RAPRL demonstrates exceptional robustness and superior performance across varying traffic densities. Specifically, RAPRL achieves success rates of 99.0%, 99.5%, and 99.3% in high, medium, and low traffic densities, respectively. In the challenging high-density scenario, where the baseline Dueling DQN experiences a significant performance drop to 87.0%, RAPRL maintains a high success rate, representing a substantial improvement of 12.0%. Similarly, compared to SACD, RAPRL improves the success rate by 4.5%, 2.0%, and 0.1% across the three density levels. Although PPO achieves a marginally higher success rate in the high-density scenario (99.5%), it suffers from performance degradation in the medium density scenario (97.7%). RAPRL consistently maintains a success rate above 99% in all tested simulation scenarios, indicating its stability and adaptability within the considered IDM-based ramp-merging environments.

8.2.2. Traffic Safety

Traffic safety is assessed using collision ratio and average cost, which reflect the frequency of accidents and the agent’s adherence to safety constraints. As shown in Table 3, the unconstrained methods, Dueling DQN and SACD, incur significantly higher average costs, particularly in high-density scenarios (0.50 and 0.44, respectively), indicating aggressive behaviors that frequently violate safety boundaries. In contrast, RAPRL dramatically reduces the average cost to 0.02 across all densities. Regarding collision ratio, while PPO demonstrates strong performance in high and low densities, it exhibits a spike in collision ratio to 0.018 under medium density, suggesting instability in complex interaction scenarios. RAPRL, however, consistently suppresses the collision ratio to a negligible level (0.003∼0.005) across all test cases. This validates that by explicitly optimizing the constrained objective, RAPRL effectively minimizes safety risks.

8.2.3. Traffic Efficiency

Traffic efficiency is evaluated by the average time required to complete the merging maneuver. As illustrated in Table 3, Dueling DQN and SACD achieve the shortest average times (e.g., 11.78 s and 11.59 s in high density) but at the cost of high accident risks and safety violations, as discussed in the safety analysis. Conversely, PPO exhibits the longest average time (12.36 s in high density), indicating that it adopts an overly conservative policy to ensure safety, thereby sacrificing efficiency. RAPRL achieves a well-balanced trade-off between these conflicting objectives. It records an average time of 11.87 s in high density, which is faster than PPO, yet it maintains the lowest safety cost. These results demonstrate that RAPRL enables the agent to complete efficient merging maneuvers without compromising safety standards or adopting overly cautious behaviors.

8.3. Ablation Study

This section conducts an ablation study to evaluate the impact of RAPRL’s essential components: the Action Shielding Mechanism (ASM), safety constraints (SC), and personal preference (PP). We benchmark the full RAPRL method against three ablated variants: (a) RAPRL w/o SC, (b) RAPRL w/o ASM and SC, and (c) RAPRL w/o ASM, SC, and PP. The quantitative performance metrics are summarized in Table 4, while the training curves for average reward and average cost are illustrated in Figure 10.
It is worth clarifying that SC and ASM play different roles in RAPRL. The CMDP-based safety constraint is a learning-level mechanism that penalizes the expected cumulative cost through the Lagrangian formulation and guides the policy toward lower-risk behavior during optimization, whereas ASM is an execution-level safety filter that evaluates the raw high-level action using MPC-based prediction and replaces it when the action is identified as unsafe or infeasible. During training, the shielded action a t is executed by the low-level MPC controller, and the resulting transition, reward, and cost are stored for policy updates. Thus, ASM also influences policy learning by reducing unsafe exploration and exposing the agent to corrected lower-risk behaviors. The two mechanisms are therefore complementary: SC encourages the policy itself to satisfy the expected cost constraint, while the ASM handles residual unsafe actions at the action execution level.

8.3.1. Safety Constraints

To verify the fundamental role of safety constraints in risk mitigation, we compare the baseline method (w/o PP, ASM, and SC) with the variant incorporating only SC (RAPRL w/o PP and ASM). As shown in Table 4, the introduction of SC significantly reduces the collision ratio across all traffic densities. Notably, in the high-density scenario, the collision ratio drops from 0.010 to 0.003, and the success rate improves from 94.5% to 97.2%. This indicates that SC encourages the policy of avoiding high-cost behaviors by penalizing safety violations during policy optimization, rather than directly filtering actions at execution time. However, the limitation of using SC alone is also evident: the average time increases from 11.59 s to 12.08 s, suggesting that the agent adopts a conservative policy to satisfy hard constraints, thereby sacrificing driving efficiency. Furthermore, as observed in the training curves (Figure 10 (middle)), although the method with SC (orange line) reduces crash occurrences, its average cost exhibits significant oscillations and remains at a relatively high level compared to the full method. This implies that while hard constraints prevent collisions, they struggle to guide the agent to smoothly navigate away from high-risk regions, leading to frequent boundary-hovering behaviors. Therefore, the improvement in the SC-only variant shows that the CMDP policy itself can learn safer behavior under the Lagrangian cost constraint.

8.3.2. ASM

The Action Shielding Mechanism (ASM) is designed to address the limitations of static constraints by actively correcting the agent’s behavior near safety boundaries. By comparing the variant “RAPRL w/o PP” (which includes SC and ASM) with the version containing only SC, we observe a decisive improvement in safety compliance. The most striking impact is reflected in the average cost. As detailed in Table 4, adding ASM drastically reduces the average cost in high-density scenarios from 0.23 to 0.02, an order-of-magnitude improvement. This result is visually corroborated by the average cost curve in Figure 10, where the green curve, corresponding to the SC and ASM variant without PP, converges rapidly to a near-zero cost level, significantly outperforming the oscillating orange curve (w/o ASM). Additionally, the success rate further increases to 98.3%. These results demonstrate that ASM further suppresses residual unsafe actions that may still be produced by the learned policy, thereby improving constraint satisfaction and stabilizing the average cost during training.

8.3.3. Personal Preference

Finally, we evaluate the contribution of the Personal Preference (PP) module in balancing safety with efficiency and accelerating learning. Comparing the complete RAPRL method with the “RAPRL w/o PP” variant, Table 4 reveals that PP plays a crucial role in recovering the efficiency lost due to safety restrictions. In high-density traffic, the average time decreases from 12.33s to 11.87s, and the success rate peaks at 99.0%. This indicates that preference-aware cost-limit adaptation helps the agent recover part of the efficiency lost under strict safety regulation and guides the policy toward a better safety–efficiency trade-off. Moreover, the average reward curve in Figure 10 (right) shows that the full method (red line) exhibits the fastest convergence speed and achieves the highest asymptotic reward. This confirms that incorporating human-like driving preferences not only optimizes the trade-off between safety and efficiency but also allows the agent to learn optimal policies more rapidly and stably.

8.3.4. Visual Result

We visualize the behaviors of agents trained under three settings for the on-ramp merging task: RAPRL without PP and ASM, RAPRL without PP, and the full RAPRL, as shown in Figure 11. The ego vehicle trained without PP and ASM can select unsafe or suboptimal actions, e.g., turning right despite having merged into the main lane (see Figure 11a). Incorporating ASM prevents such unsafe actions in both RAPRL w/o PP and full RAPRL (see Figure 11b,c). Furthermore, the agent trained by full RAPRL reaches the farthest position from the end of the merge zone (green car in Figure 11) at episode termination, demonstrating that integrating personalized preferences enhances traffic efficiency.
Overall, the ablation results indicate that the CMDP-based safety constraint and ASM contribute to safety in different but complementary ways. The SC-only variant improves the collision-related metrics compared with the unconstrained baseline, showing that the policy can lead to safer behavior under the Lagrangian cost constraint. However, SC alone still exhibits higher average cost and larger oscillations during training, suggesting that the learned policy may still approach unsafe boundaries. After adding ASM, the average cost is significantly reduced and the learning curve becomes more stable, indicating that the shielding module suppresses residual unsafe actions before execution. Therefore, the reported safety performance of RAPRL should be interpreted as the combined result of learned constraint-aware policy optimization and execution-level action shielding, rather than as the effect of either mechanism alone. Accordingly, the learned policy should be understood as a constraint-aware policy trained with execution-level shielding support, rather than a policy whose safety is solely guaranteed by post-processing.

8.4. Sensitivity Analysis

To evaluate the robustness of our framework, we conduct a sensitivity analysis on two critic design parameters that govern the system’s safety characteristics: the prediction coefficient σ in the ASM and the cost limit η (reflecting risk tolerance) in the CMDP.

8.4.1. Prediction Coefficient

We conduct motion prediction for constraint checking in the ASM of RAPRL and change the prediction coefficient σ to test its effect on the average cost and average reward. The prediction coefficient σ denotes the look-ahead step used by the ASM to select the predicted ego vehicle state from the MPC prediction horizon. Given the MPC horizon N, the predicted state sequence can be written as { x e 1 , x e 2 , , x e N } , where x e σ denotes the predicted ego vehicle state at the σ -th future step. Therefore, the corresponding look-ahead time is
t σ = σ Δ t , σ { 1 , , N } ,
where Δ t is the simulation time interval. In this study, the MPC prediction horizon is N = 10 , and the simulation frequency is 10 Hz, resulting in Δ t = 0.1 s. Thus, σ = 1 , σ = 5 , and σ = 10 correspond to approximately 0.1 s, 0.5 s, and 1.0 s look-ahead collision checking, respectively.
The value of σ affects the trade-off between responsiveness and prediction reliability. A small σ performs near-term collision checking and is more responsive to current states, but it may be too short-sighted to identify future merging conflicts. A large σ provides earlier warning of potential conflicts, but the predicted state may contain larger uncertainty and may lead to overly conservative or unstable shielding decisions. Therefore, σ is treated as a tunable hyperparameter of the ASM and is selected through sensitivity analysis.
As shown in Figure 12, we conducted five runs for each σ , initialized with different random seeds. We observe that a larger σ leads to faster learning but results in higher variability across the random seeds, indicating increased instability. On the other hand, a smaller σ can slow down the training process. However, with an appropriate prediction coefficient of σ = 5 , the model achieves the best performance regarding the average cost (the lowest value) and the average reward (the highest value). This result indicates that a mid-horizon prediction step provides a better balance between early conflict detection and reliable state prediction. Accordingly, σ = 5 is used as the default prediction coefficient in the experiments.

8.4.2. Risk Tolerance

In our CMDP framework, the cost limit η serves as a quantifiable proxy for the agent’s risk tolerance. Lower values of η enforce conservative driving strategies with strict adherence to safety constraints, whereas higher values permit more aggressive policies. To investigate the impact of this preference indicator, we evaluated η at 0.1, 0.05, 0.01, and 0.001, as illustrated in Figure 13. These results demonstrate that the policy is highly sensitive to this threshold. Larger η values relax the constraints, resulting in instability and increased safety violations, while overly small η values restrict exploration and slow down convergence. We empirically identify an optimal balance at η = 0.01 , where the agent maximizes safety compliance without compromising driving performance. However, relying on a fixed cost limit reveals a key limitation: a static η cannot adapt to varying traffic densities or heterogeneous user preferences (e.g., aggressive driving in sparse traffic versus conservative driving in dense traffic). This limitation motivates our proposed approach, which dynamically adjusts η using fuzzy logic to align the agent’s risk sensitivity with evolving environmental and user-specific requirements.
This sensitivity analysis also helps interpret the influence of fuzzy-rule settings on the learned policy. In the proposed framework, the fuzzy membership functions and rule table affect policy learning mainly through the generated CMDP cost limit η . Therefore, different fuzzy-rule settings would lead to different safety–efficiency behaviors by changing the resulting value of η . As shown in Figure 13, an overly large η relaxes the safety constraint and may increase average cost, whereas an overly small η imposes a stricter constraint and may slow down policy learning. These results indicate that the learned policy is sensitive to the cost-limit level produced by the fuzzy module. Accordingly, the adopted fuzzy rules are designed to generate moderate and adaptive cost limits under different risk-preference and traffic-density conditions, avoiding both excessively permissive and excessively conservative behaviors.

9. Discussion

Although the proposed method achieves promising results in the evaluated ramp-merging scenarios, its current validation remains subject to several practical limitations. First, the experiments were conducted in a controlled single-ramp environment with IDM-based surrounding vehicles, which provides a reproducible benchmark but cannot fully represent multi-lane interactions, heterogeneous driver styles, perception uncertainty, and non-IDM behaviors in real traffic. Second, the user risk preference is specified before policy learning and remains fixed during each experiment. This setting enables a controlled analysis of how different preference levels affect the CMDP cost limit and the learned merging behavior, but it does not capture preference variations over time or across traffic contexts. Third, the MPC prediction and execution modules are evaluated by simulation, where the finite-horizon QP problem is solved at each decision step. Therefore, the current results demonstrate algorithmic effectiveness in simulations rather than embedded real-time deployment capability.
Future work will address these limitations by evaluating the framework in more realistic traffic simulators, naturalistic driving datasets, and heterogeneous human driving behavior models. We will also investigate online preference inference and adaptation based on user feedback, historical driving behavior, and interaction histories. For deployment feasibility, hardware-in-the-loop validation and embedded-oriented MPC implementations, such as warm-started QP solvers, sparse optimization, code generation, and reduced-horizon MPC, will be further explored.

10. Conclusions

This study proposes a risk-aware safe RL approach with personal preferences for decision-making in the on-ramp merging task for autonomous driving, in which RL is used for high-level decisions, followed by a low-level MPC. We formulate the high-level decision problem as a CMDP that represents safety in cost terms. Furthermore, we use the fuzzy logical method to compute the threshold of the cost limits based on human risk preferences and traffic density and a Lagrangian-based SAC is used to solve the CMDP. An Action Shielding Mechanism is designed to remove unsafe or invalid RL actions, and we theoretically prove its effectiveness in enhancing safety and sample efficiency. Numerical simulations and theoretical analysis demonstrate the superiority of our method regarding success rate, collision ratio, and average cost. Future work will further improve the practicality and robustness of the proposed framework under more realistic traffic conditions and deployment settings.

Author Contributions

Conceptualization: Y.L., S.Y. and Y.B.; methodology, S.Y. and J.T.; software, S.Y. and J.T.; validation, M.H., J.T. and Y.L.; investigation, H.Q. and M.H.; formal analysis, J.T. and W.H.; writing—original draft preparation, S.Y., Y.L. and H.Q.; writing—review and editing, J.T., W.H. and Y.B.; financial funding, B.L.; supervision, Y.L. and B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, H.; Wang, W.; Yuan, S.; Li, X. Uncovering Interpretable Internal States of Merging Tasks at Highway on-Ramps for Autonomous Driving Decision-Making. IEEE Trans. Autom. Sci. Eng. 2022, 19, 2825–2836. [Google Scholar] [CrossRef]
  2. Liang, J.; Tan, C.; Yan, L.; Zhou, J.; Yin, G.; Yang, K. Interaction-Aware Trajectory Prediction for Safe Motion Planning in Autonomous Driving: A Transformer-Transfer Learning Approach. IEEE Trans. Intell. Transp. Syst. 2025, 26, 17080–17095. [Google Scholar] [CrossRef]
  3. Wang, H.; Gao, H.; Yuan, S.; Zhao, H.; Wang, K.; Wang, X.; Li, K.; Li, D. Interpretable Decision-Making for Autonomous Vehicles at Highway On-Ramps With Latent Space Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 70, 8707–8719. [Google Scholar] [CrossRef]
  4. Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
  5. Lyu, Y.; Luo, W.; Dolan, J.M. Probabilistic safety-assured adaptive merging control for autonomous vehicles. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 10764–10770. [Google Scholar]
  6. Lubars, J.; Gupta, H.; Chinchali, S.; Li, L.; Raja, A.; Srikant, R.; Wu, X. Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 942–947. [Google Scholar]
  7. Wang, K.; Mu, C.; Ni, Z.; Liu, D. Safe Reinforcement Learning and Adaptive Optimal Control With Applications to Obstacle Avoidance Problem. IEEE Trans. Autom. Sci. Eng. 2024, 21, 4599–4612. [Google Scholar] [CrossRef]
  8. Yan, Z.; Kreidieh, A.R.; Vinitsky, E.; Bayen, A.M.; Wu, C. Unified automatic control of vehicular systems with reinforcement learning. IEEE Trans. Autom. Sci. Eng. 2022, 20, 789–804. [Google Scholar] [CrossRef]
  9. Chen, X.; Xu, B.; Hu, M.; Bian, Y.; Li, Y.; Xu, X. Safe Efficient Policy Optimization Algorithm for Unsignalized Intersection Navigation. IEEE CAA J. Autom. Sin. 2024, 11, 2011–2026. [Google Scholar] [CrossRef]
  10. Gao, Z.; Hao, H.; Gao, F.; Zhao, R. Constrained Reinforcement Learning-Enabled Policies With Augmented Lagrangian for Cooperative Intersection Management. IEEE Internet Things J. 2024, 12, 5396–5411. [Google Scholar] [CrossRef]
  11. Wang, Y.; Zhan, S.S.; Jiao, R.; Wang, Z.; Jin, W.; Yang, Z.; Wang, Z.; Huang, C.; Zhu, Q. Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments. In Proceedings of the 40th International Conference on Machine Learning; Journal of Machine Learning Research Inc.: New York, NY, USA, 2023; Volume 202, pp. 36593–36604. [Google Scholar]
  12. Carr, S.; Jansen, N.; Junges, S.; Topcu, U. Safe reinforcement learning via shielding under partial observability. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2023; Volume 37, pp. 14748–14756. [Google Scholar]
  13. Chen, D.; Hajidavalloo, M.R.; Li, Z.; Chen, K.; Wang, Y.; Jiang, L.; Wang, Y. Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11623–11638. [Google Scholar] [CrossRef]
  14. Isele, D.; Nakhaei, A.; Fujimura, K. Safe Reinforcement Learning on Autonomous Vehicles. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–6. [Google Scholar]
  15. Peng, J.; Yu, S.; Ge, Y.; Li, S.; Fan, Y.; Zhou, J.; He, H. Personalized Decision-Making Framework for Collaborative Lane Change and Speed Control Based on Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2025, 26, 13629–13644. [Google Scholar] [CrossRef]
  16. Teng, J.; Li, Y.; Yang, Z.; Yang, Z.; Shao, X.; Qin, H. User Preference-Aware and Efficient Trajectory Planning for Autonomous Parking with Hybrid A* and Nonlinear Optimization. In Proceedings of the IEEE International Intelligent Transportation Systems Conference (ITSC), Edmonton, AB, Canada, 24–27 September 2024; pp. 1090–1097. [Google Scholar]
  17. Chen, C.; Lan, Z.; Zhan, G.; Lyu, Y.; Nie, B.; Li, S.E. Quantifying the Individual Differences of Drivers’ Risk Perception via Potential Damage Risk Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8093–8104. [Google Scholar] [CrossRef]
  18. Nyberg, T.; Pek, C.; Dal Col, L.; Norén, C.; Tumova, J. Risk-aware Motion Planning for Autonomous Vehicles with Safety Specifications. In Proceedings of the 32nd IEEE Intelligent Vehicles Symposium, Nagoya, Japan, 11–17 July 2021; pp. 1016–1023. [Google Scholar]
  19. Geisslinger, M.; Trauth, R.; Kaljavesi, G.; Lienkamp, M. Maximum Acceptable Risk as Criterion for Decision-Making in Autonomous Vehicle Trajectory Planning. IEEE Open J. Intell. Transp. Syst. 2023, 4, 570–579. [Google Scholar] [CrossRef]
  20. Yang, K.; Li, B.; Shao, W.; Tang, X.; Liu, X.; Wang, H. Prediction Failure Risk-Aware Decision-Making for Autonomous Vehicles on Signalized Intersections. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12806–12820. [Google Scholar] [CrossRef]
  21. Morinelly, J.E.; Ydstie, B.E. Dual mpc with reinforcement learning. IFAC-PapersOnLine 2016, 49, 266–271. [Google Scholar] [CrossRef]
  22. Zanon, M.; Gros, S.; Bemporad, A. Practical reinforcement learning of stabilizing economic MPC. In Proceedings of the 18th European Control Conference (ECC), Naples, Italy, 25–28 June 2019; pp. 2258–2263. [Google Scholar]
  23. Bellegarda, G.; Byl, K. An Online Training Method for Augmenting MPC with Deep Reinforcement Learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5453–5459. [Google Scholar]
  24. Karnchanachari, N.; Valls, M.I.; Hoeller, D.; Hutter, M. Practical reinforcement learning for mpc: Learning from sparse objectives in under an hour on a real robot. In Proceedings of the Learning for Dynamics and Control, UC Berkeley, CA, USA, 10–11 June 2020; pp. 211–224. [Google Scholar]
  25. Williams, G.; Wagener, N.; Goldfain, B.; Drews, P.; Rehg, J.M.; Boots, B.; Theodorou, E.A. Information theoretic mpc for model-based reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1714–1721. [Google Scholar]
  26. Gros, S.; Zanon, M. Reinforcement learning based on mpc and the stochastic policy gradient method. In Proceedings of the American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 1947–1952. [Google Scholar]
  27. Li, Y.; Li, J.; Huang, W.; Yang, Q.; Qin, H.; Jiang, X.; Bian, Y.; Hu, M.; Hu, Y. Risk-Constrained On-Ramp Merging via Safety-Augmented Reinforcement Learning and Model Predictive Control. IEEE Internet Things J. 2026; early access. [CrossRef]
  28. Chen, J.; Shen, J.; Chen, W.; Li, J.; Zhang, S. Application of Robust Fuzzy Cooperative Strategy in Global Consensus of Stochastic Multi-Agent Systems. IEEE Trans. Autom. Sci. Eng. 2025, 22, 12058–12070. [Google Scholar] [CrossRef]
  29. Mamdani, E.; Assilian, S. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Man Mach. Stud. 1975, 7, 1–13. [Google Scholar] [CrossRef]
  30. Marina Martinez, C.; Heucke, M.; Wang, F.Y.; Gao, B.; Cao, D. Driving Style Recognition for Intelligent Vehicle Control and Advanced Driver Assistance: A Survey. IEEE Trans. Intell. Transp. Syst. 2018, 19, 666–676. [Google Scholar] [CrossRef]
  31. Peng, J.; Zhang, S.; Zhou, Y.; Li, Z. An Integrated Model for Autonomous Speed and Lane Change Decision-Making Based on Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21848–21860. [Google Scholar] [CrossRef]
  32. Werling, M.; Ziegler, J.; Kammel, S.; Thrun, S. Optimal trajectory generation for dynamic street scenarios in a frenet frame. In Proceedings of the IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 987–993. [Google Scholar]
  33. Christodoulou, P. Soft Actor-Critic for Discrete Action Settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
  34. Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 1995, 38, 58–68. [Google Scholar] [CrossRef]
  35. Tang, X.; Huang, B.; Liu, T.; Lin, X. Highway Decision-Making and Motion Planning for Autonomous Driving via Soft Actor-Critic. IEEE Trans. Veh. Technol. 2022, 71, 4706–4717. [Google Scholar] [CrossRef]
  36. Bellman, R. Dynamic programming and stochastic control processes. Inf. Control 1958, 1, 228–239. [Google Scholar] [CrossRef]
  37. Bertsekas, D.; Nedic, A.; Ozdaglar, A. Convex Analysis and Optimization; Athena Scientific: Nashua, NH, USA, 2003; Volume 1. [Google Scholar]
  38. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
  39. Luo, Z.Q.T.; Tseng, P. On the Convergence Rate of Dual Ascent Methods for Linearly Constrained Convex Minimization. Math. Oper. Res. 1993, 18, 846–867. [Google Scholar] [CrossRef]
  40. Leurent, E. An Environment for Autonomous Driving Decision-Making, 2018. Available online: https://github.com/eleurent/highway-env (accessed on 20 May 2026).
  41. Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805–1824. [Google Scholar] [CrossRef] [PubMed]
  42. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
  43. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Figure 1. Illustration of the highway on-ramp merging scenario. The ego vehicle (red) starts on a one-lane entrance ramp and needs to merge into the highway traffic safely and efficiently.
Figure 1. Illustration of the highway on-ramp merging scenario. The ego vehicle (red) starts on a one-lane entrance ramp and needs to merge into the highway traffic safely and efficiently.
Machines 14 00605 g001
Figure 2. Overview of the proposed method. The high-level decision problem is formulated as a CMDP that incorporates individuals’ risk preferences into the constraints, followed by an MPC-based low-level control. A Lagrangian-based SAC algorithm is used to solve CMDP for the optimal RL policy. We design an Action Shielding Mechanism to mask out risky ones by pre-executing the action with MPC and conducting collision constraint checks. Then, the safe RL action is sent to the low-level MPC, which generates vehicle control for the simulation environment.
Figure 2. Overview of the proposed method. The high-level decision problem is formulated as a CMDP that incorporates individuals’ risk preferences into the constraints, followed by an MPC-based low-level control. A Lagrangian-based SAC algorithm is used to solve CMDP for the optimal RL policy. We design an Action Shielding Mechanism to mask out risky ones by pre-executing the action with MPC and conducting collision constraint checks. Then, the safe RL action is sent to the low-level MPC, which generates vehicle control for the simulation environment.
Machines 14 00605 g002
Figure 3. Cost design based on motion predictions of the ego vehicle and surrounding objects. There are three typical situations, including (a) failing to merge before reaching the end of the road, (b) colliding with other vehicles, and (c) hard to merge when the target lane is occupied by other vehicles.
Figure 3. Cost design based on motion predictions of the ego vehicle and surrounding objects. There are three typical situations, including (a) failing to merge before reaching the end of the road, (b) colliding with other vehicles, and (c) hard to merge when the target lane is occupied by other vehicles.
Machines 14 00605 g003
Figure 4. The membership functions of the fuzzy inputs, i.e., risk preference and traffic density, and the fuzzy output cost limit: (a) risk preference, varying between 0 and 100%; (b) traffic density, varying between 0.5 and 1; and (c) cost limit, varying between 0 and 0.1.
Figure 4. The membership functions of the fuzzy inputs, i.e., risk preference and traffic density, and the fuzzy output cost limit: (a) risk preference, varying between 0 and 100%; (b) traffic density, varying between 0.5 and 1; and (c) cost limit, varying between 0 and 0.1.
Machines 14 00605 g004
Figure 5. Illustration of the aggregation and defuzzification of the Mamdani inference process. The fuzzy sets of the cost limit include large, medium, and small, represented by black, red, and blue dashed lines. Given traffic density and risk preference of 0.57 and 45%, with membership values A ˜ = { Low , Medium } and B ˜ = { Conservative , Neutral } , the fuzzy output for small, medium, and large cost limits are 0.25, 0.35, and 0.65, respectively. These outputs are aggregated into a single fuzzy set (the union of the three grey areas), and the centroid is taken to obtain the crisp cost limit of 0.0595.
Figure 5. Illustration of the aggregation and defuzzification of the Mamdani inference process. The fuzzy sets of the cost limit include large, medium, and small, represented by black, red, and blue dashed lines. Given traffic density and risk preference of 0.57 and 45%, with membership values A ˜ = { Low , Medium } and B ˜ = { Conservative , Neutral } , the fuzzy output for small, medium, and large cost limits are 0.25, 0.35, and 0.65, respectively. These outputs are aggregated into a single fuzzy set (the union of the three grey areas), and the centroid is taken to obtain the crisp cost limit of 0.0595.
Machines 14 00605 g005
Figure 6. Kinematic bicycle model.
Figure 6. Kinematic bicycle model.
Machines 14 00605 g006
Figure 7. The structure of the neural networks.
Figure 7. The structure of the neural networks.
Machines 14 00605 g007
Figure 8. Action Shielding Mechanism. Three typical situations are defined: collision, unexpected decision, and failing to merge. The unsafe/unexpected action a t (left) is substituted with a safe one a t (right).
Figure 8. Action Shielding Mechanism. Three typical situations are defined: collision, unexpected decision, and failing to merge. The unsafe/unexpected action a t (left) is substituted with a safe one a t (right).
Machines 14 00605 g008
Figure 9. Comparative study. We compare RAPRL (ours) with Dueling DQN [42], SACD [33], and PPO [43]. The solid lines represent the mean values, while the shaded regions indicate the standard deviations. The results demonstrate that RAPRL achieves superior convergence performance, attaining the highest episodic reward and lowest average cost while maintaining robust safety guarantees compared to all baselines.
Figure 9. Comparative study. We compare RAPRL (ours) with Dueling DQN [42], SACD [33], and PPO [43]. The solid lines represent the mean values, while the shaded regions indicate the standard deviations. The results demonstrate that RAPRL achieves superior convergence performance, attaining the highest episodic reward and lowest average cost while maintaining robust safety guarantees compared to all baselines.
Machines 14 00605 g009
Figure 10. Ablation study. We compare the full RAPRL method against its ablated variants. The results show that SC helps the policy learn safer behavior, ASM further reduces residual unsafe actions and stabilizes the average cost, while PP improves the safety–efficiency trade-off and accelerates reward convergence. Consequently, the complete RAPRL method achieves better overall performance in terms of safety and efficiency.
Figure 10. Ablation study. We compare the full RAPRL method against its ablated variants. The results show that SC helps the policy learn safer behavior, ASM further reduces residual unsafe actions and stabilizes the average cost, while PP improves the safety–efficiency trade-off and accelerates reward convergence. Consequently, the complete RAPRL method achieves better overall performance in terms of safety and efficiency.
Machines 14 00605 g010
Figure 11. Visualization of the trained agent, including (a) RAPRL w/o PP and ASM, (b) RAPRL w/o PP, and (c) RAPRL. The full RAPRL reaches the farthest position at t = 10 s , indicating improved traffic efficiency.
Figure 11. Visualization of the trained agent, including (a) RAPRL w/o PP and ASM, (b) RAPRL w/o PP, and (c) RAPRL. The full RAPRL reaches the farthest position at t = 10 s , indicating improved traffic efficiency.
Machines 14 00605 g011
Figure 12. Sensitivity of RAPRL to the prediction coefficient, i.e., σ = 1, 5, 10. (a) Average cost. (b) Average reward.
Figure 12. Sensitivity of RAPRL to the prediction coefficient, i.e., σ = 1, 5, 10. (a) Average cost. (b) Average reward.
Machines 14 00605 g012
Figure 13. Sensitivity of RAPRL to the hyperparameter cost limit η , where η = 0.1 , 0 , 05 , 0.01 , 0.001 . (a) Large η can lead to instabilities and substantial increases in the average cost, while small η can make training slower. (b) The average reward changes slightly when η varies.
Figure 13. Sensitivity of RAPRL to the hyperparameter cost limit η , where η = 0.1 , 0 , 05 , 0.01 , 0.001 . (a) Large η can lead to instabilities and substantial increases in the average cost, while small η can make training slower. (b) The average reward changes slightly when η varies.
Machines 14 00605 g013
Table 1. Fuzzy relations between the cost limit, traffic density, and risk preference.
Table 1. Fuzzy relations between the cost limit, traffic density, and risk preference.
Traffic DensityHighMediumLow
Cost Limit
Risk Preference
ConservativeSmallSmallMedium
NeutralSmallMediumLarge
AggressiveMediumLargeLarge
Table 2. Hyperparameters of the safe RL algorithm.
Table 2. Hyperparameters of the safe RL algorithm.
SymbolDescriptionValue
γ Discount factor0.99
α θ Policy network learning rate 1 × 10 4
α ω Critic network learning rate 1 × 10 4
α ω c Cost network learning rate 1 × 10 4
α ξ Temperature parameter learning rate 1 × 10 4
λ 0 Initial Lagrangian multiplier1.0
α λ Lagrangian multiplier learning rate 1 × 10 4
D Replay buffer size 1 × 10 5
B Batch size256
Table 3. Comparison between the constraint-free RL methods and ours.
Table 3. Comparison between the constraint-free RL methods and ours.
MethodSuccess Rate (%) ↑Collision Ratio ↓Average Cost ↓Average Time (s) ↓
High Medium Low High Medium Low High Medium Low High Medium Low
Dueling DQN [42]87.094.399.00.0130.0050.0050.500.280.0811.7811.4710.85
SACD [33]94.597.599.20.0100.0080.0050.440.250.1011.5911.3310.82
PPO [43]99.597.799.20.0030.0180.0080.010.030.0112.3611.6210.95
RAPRL (ours)99.099.599.30.0030.0050.0050.020.020.0211.8711.4610.96
Table 4. Ablation study results.
Table 4. Ablation study results.
PPASMSCSuccess Rate (%) ↑Collision Ratio ↓Average Cost ↓Average Time (s) ↓
High Medium Low High Medium Low High Medium Low High Medium Low
---94.597.599.20.0100.0080.0050.440.250.1011.5911.3310.82
--🗸97.297.399.30.0030.0050.0050.230.130.0412.0811.5410.99
-🗸🗸98.398.898.90.0100.0080.0030.020.020.0212.3311.5811.02
🗸🗸🗸99.099.599.30.0030.0050.0050.020.020.0211.8711.6410.96
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Teng, J.; Huang, W.; Yuan, S.; Hu, M.; Qin, H.; Li, Y.; Bian, Y.; Li, B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines 2026, 14, 605. https://doi.org/10.3390/machines14060605

AMA Style

Teng J, Huang W, Yuan S, Hu M, Qin H, Li Y, Bian Y, Li B. Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines. 2026; 14(6):605. https://doi.org/10.3390/machines14060605

Chicago/Turabian Style

Teng, Jingjia, Wenjie Huang, Shijie Yuan, Manjiang Hu, Hongmao Qin, Yang Li, Yougang Bian, and Bai Li. 2026. "Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging" Machines 14, no. 6: 605. https://doi.org/10.3390/machines14060605

APA Style

Teng, J., Huang, W., Yuan, S., Hu, M., Qin, H., Li, Y., Bian, Y., & Li, B. (2026). Adaptive Constraint Regulation for Human Preference-Aware Safe Reinforcement Learning of On-Ramp Merging. Machines, 14(6), 605. https://doi.org/10.3390/machines14060605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop