A Reinforcement Learning Approach to Dynamic Trajectory Optimization with Consideration of Imbalanced Sub-Goals in Self-Driving Vehicles

: Goal-conditioned Reinforcement Learning (RL) holds promise for addressing intricate control challenges by enabling agents to learn and execute desired skills through separate decision modules. However, the irregular occurrence of required skills poses a significant challenge to effective learning. In this paper, we demonstrate the detrimental effects of this imbalanced skill (sub-goal) distribution and propose a novel training approach, Classified Experience Replay (CER), designed to mitigate this challenge. We demonstrate that adapting our method to conventional RL methods significantly enhances the performance of the RL agent. Considering the challenges inherent in tasks such as driving, characterized by biased occurrences of required sub-goals, our study demonstrates the improvement in trained outcomes facilitated by the proposed method. In addition, we introduce a specialized framework tailored for self-driving tasks on highways, integrating model predictive control into our RL trajectory optimization training paradigm. Our approach, utilizing CER with the suggested framework, yields remarkable advancements in trajectory optimization for RL agents operating in highway environments.

However, when applied to the domain of autonomous driving, RL faces formidable challenges owing to the intricacies of traffic environments.Autonomous driving involves navigating at high speeds while adhering to traffic regulations and responsively reacting to nearby vehicles.This complexity has led to the adoption of goal-conditioned RL in path planning approaches, aiming to separate decision making and trajectory generation [10][11][12].
A universal value function [13] serves as a standard framework for goal-conditioned RL, particularly useful in environments with multiple sub-tasks [14,15].However, the irregular occurrence and dominance of certain sub-goals hamper optimal policy learning in such settings, adversely impacting training performance.This imbalance in sub-goal distribution is not unique to RL but is also pervasive in data-centric Artificial Intelligence (AI) systems [16,17] and real-world applications.The highway environment exemplifies this imbalance, where the sub-goal of maintaining the current lane outweighs other subgoals, such as changing to an adjacent lane.This highlights the necessity of addressing this imbalance to achieve high-performance agents.
Experience replay [18], applied in the deep Q-network algorithm [19], marked a significant advancement in off-policy RL.The experience replay method stores transition samples in a buffer, enabling the training of RL networks using uniformly randomly sampled minibatches from this buffer.However, the uniform sampling method poses limitations in effectively learning optimal policies, especially in environments characterized by imbalanced sub-goal distributions across multiple goals.Addressing this challenge has been the focus of research efforts aiming to enhance training efficiency within experience replay.Prioritizing temporal difference error [20] and reinterpreting unsuccessful experiences using methods like Hindsight Experience Replay (HER) [21] have shown effectiveness in sparse reward environments, yet they struggle to fully resolve severe imbalances inherent in practical applications.
To tackle this fundamental issue, we propose Classified Experience Replay (CER), specifically designed for multi-goal tasks.CER optimizes sample selection by adjusting the probability of choosing transitions associated with each sub-goal.Unlike traditional uniform sampling, CER's mechanism strategically alters the selection probabilities, mitigating the impact of imbalanced sub-goal distributions during training.Our empirical evaluations show the detrimental effects of imbalanced sub-goal distributions on the training policy.Through experimentation, we demonstrate that CER significantly improves sample efficiency and enhances the trained agent's performance in handling the challenges posed by imbalanced sub-goal distributions in practical scenarios.
Another challenge in highway driving tasks lies in designing an effective reward function.Unlike discrete success or fail task completion in game environments, the continuous trajectory generation in driving tasks demands meticulous consideration at each step due to multifaceted constraints.Designing an appropriate reward function for RL in such constraint-rich environments proves challenging, often leading to overlapping reward components and unexpected outcomes [22][23][24][25].Achieving a comprehensive reward function encompassing all objectives and regulations necessitates extensive iterative reward shaping, a process prone to trial and error.Prior studies have attempted fine-grained rewards, aiming to obtain an optimal policy of action for continuous control.Meanwhile, Model Predictive Control (MPC) [26,27] optimizes vehicle control under constraints using objective functions.However, despite yielding optimal solutions, the computational overhead for real-time optimization in high-speed driving scenarios remains a challenge.Complex state space models and longer horizon calculations escalate computational demands significantly.
This paper proposes a framework for achieving stable driving maneuver training with reinforcement learning.To further promote efficient learning, we devise a two-stage hierarchical structure for driving tasks.This structure separates decision-making and trajectory generation modules, enabling online sub-goal generation conditioned for optimal trajectory planning.Leveraging both CER and the hierarchical structure, the RL agent exhibits enhanced training stability and adaptivity to evolving sub-goals.Moreover, we develop an MPC-synchronized reward structure that eliminates the need for extensive reward shaping.This hybrid structure allows concurrent computation of rewards alongside MPC control signals, facilitating training by utilizing optimized references with reduced reliance on manual crafting.Importantly, both the reward function and MPC are bypassed during the evaluation phases, ensuring real-time performance without their computational overhead.To evaluate our approach, agents are trained and tested in a complex gym highway environment [28], showcasing the efficacy of the proposed RL-MPC hybrid approach in navigating challenging highway scenarios efficiently.By demonstrating the adaptability and efficiency of our framework in this robust testing ground, we pave the way for the successful deployment of RL-based autonomous driving systems in dynamic highway settings.
Our contributions in this paper are as follows: 1.
We address imbalanced goal distributions in driving environments through a novel off-policy training method called CER, enabling the RL agent to dynamically adapt to changing sub-goals during driving scenarios.

2.
We propose a two-stage hierarchical structure for driving tasks, separating decision making and trajectory generation.This approach fosters efficient learning by facilitating online sub-goal generation tailored for optimal trajectory planning.

3.
We introduce an MPC-synchronized reward structure that eliminates the need for extensive reward shaping.This hybrid framework allows concurrent reward computation alongside control signals, utilizing optimized references and minimizing manual tuning.
The remainder of the paper is organized as follows: Section 2 presents an overview of related works on the proposed method.Section 3 demonstrates the CER and the framework for a highway driving task, while Section 4 presents the experimental environment and the training results.Subsequently, we demonstrate the driving performance of the proposed method.Our framework is then evaluated qualitatively and quantitatively compared with other methods.Finally, Section 5 closes the paper with the conclusions.

Related Works
Many prior studies have employed RL for driving tasks, adopting diverse approaches such as end-to-end architectures [29,30] and hierarchical methods [10][11][12].End-to-end approaches often oversimplify the complexity of driving tasks, encompassing multiple decision, planning, and control layers in dynamic environments.For instance, ref. [29] utilized Deep Deterministic Policy Gradients (DDPGs) [31] to generate control inputs like steering and throttle from image inputs on a two-lane track.However, their assumption of an environment with only the ego vehicle limited realism, as other vehicles were neglected.A similar constraint applied to the use of Dueling Deep Q Networks (DQNs) in [30], where the training and testing tracks were identical, hindering generalization.Unfortunately, typical end-to-end approaches intertwine actions with decisions, making maneuvers challenging to interpret and necessitating relearning from scratch when environmental conditions change.In contrast, hierarchical approaches, as demonstrated by [10], segregate the decision-making process from the trajectory generation process, mitigating the drawbacks of the end-to-end paradigm.Such approaches employ multiple agents to address various behaviors in driving situations.
In recent years, transformer [32] has emerged as a promising tool for training sequential actions, with several studies adopting them for self-driving tasks [33,34].However, their large network size and computational demands pose significant challenges, particularly in real-time control applications such as autonomous vehicles.
Unlike end-to-end architectures, hierarchical approaches face fewer constraints on task complexity.Here, the driving task is divided into decision-making and trajectory generation, with RL used as the decision-making module to generate high-level commands, such as lane keeping or lane changing.The motion planner is designed to capture vehicle dynamics using a two-dimensional grid map planning method [10].Alternatively, in [35], RL is used for both decision making and motion planning with separate RL agents, initialized using the inverse RL method [36].Although this approach avoids rule-based methods, designing and shaping rewards for each agent demands considerable time and effort.
In another study, ref. [12] employed a conditional variational autoencoder for highlevel decision making and an RL agent for control input generation based on the decision.While RL excels at handling unseen scenarios, it falls short in reflecting changes in traffic regulations and ensuring decision diversity.Moreover, limitations arise if speed limits are not included in the observation during training, requiring the agent to be trained anew.To address these concerns, some approaches leverage RL for decision-making and rule-based models for trajectory generation, as exemplified by [11].This method utilizes a rule-based decision making module and generates control inputs from an RL agent, specifying decisions more comprehensively in both lateral and longitudinal directions.
However, the efficiency of designing and training multiple specialized agents with this approach is questionable.
To tackle these issues, our research incorporates a straightforward rule-based decisionmaking module to adapt to changes in traffic conditions and individual preferences.The control input is generated by the RL motion planner, ensuring the adaptive management of situations by utilizing concatenated state inputs from sub-goals and observations.

Classified Experience Replay and Driving Framework
Our goal is to enhance decision-making performance in sequential multi-goal problems within an environment characterized by imbalanced goal distribution and to design a motion planning and control framework for highway driving tasks.In this paper, we focus on the off-policy RL method, which employs experience replay.In this section, we introduce the motivation and method of our approach, as well as how we organized the overall framework to apply it to multi-goal tasks.

Classified Experience Replay
In environments where the sub-goal g ∈ G is manually switched, the quantity of transitions stored in the replay buffer varies depending on the distribution of sub-goals.While a uniform distribution is typically used for sampling the minibatch, this approach may lead to an uneven utilization of each sub-goal g and its corresponding transitions, thereby adversely affecting training performance.This effect is exacerbated in cases of more imbalanced sub-goal distributions.To mitigate this issue, we propose classified experience replay.Unlike conventional methods that use a single buffer, we employ multiple buffers, each having the same dimensions as the sub-goal space G ∈ R n .Every transition is stored in the buffer corresponding to its associated g.During policy updates, transitions are equally sampled from all buffers, thereby mitigating the disproportionate sub-goal experience during training and preventing the agent from becoming overly reinforced towards specific goals.Figure 1 illustrates the process of storing and utilizing transitions in CER.To evaluate the impact of an imbalanced sub-goal distribution on the agent's training, we devised a straightforward cliff walk scenario featuring two cliffs (see Figure 2).Here, the agent aims to navigate from the start state S to the goal state G under different sub-goals: "Straight", "Up", and "Down".The optimal trajectory to the goal state varies depending on these sub-goals, which are manually designated before each episode.The agent receives rewards for reaching the goal, with specific rewards varying based on the sub-goal.Figure 2 depicts the reward settings in the environment for each sub-goal, with colors representing the one-step reward received by the agent for stepping on each cell.To demonstrate the effect of sub-goal distribution on training performance, we trained and tested the agent in 10 cases with varying probabilities for g down .The probability distribution is shown in Figure 3a.Initially, the probabilities of the sub-goals were uniform, with p(g straight ):p(g up ):p(g down ) = 1:1:1.Subsequently, we gradually increased the probabilities of g straight and g up until reaching a ratio of p(g straight ):p(g up ):p(g down ) = 10:10:1.We compared the performances of agents trained using DQN and DQN with CER.The total buffer sizes used for DQN + CER were equivalent to the single buffer size used in DQN.A test was considered successful only if the agent found an optimal path for any g, denoted by O for optimal path, SO for sub-optimal path, and F for failure to reach the goal state.Figure 3b presents the test results, with the optimality test indicating whether the agent found an optimal path to the final destination under a given sub-goal.The agent trained with DQN frequently failed to reach the goal state as the probabilities became biased.In contrast, the agent trained with DQN with CER consistently found the optimal path under any sub-goal, irrespective of the non-uniformity in probabilities.The symbol in each cell indicates whether the agent found a path from the start cell to the goal cell.O denotes an optimal result, SO denotes a sub-optimal result, and F denotes failure in finding a path.

Two-Stage Hierarchical Framework for Driving Task
In dynamic environments like highways, the task of determining where to drive and executing detailed maneuvers is intricately linked.This inherent complexity in decision making and control poses significant challenges.To address this, we adopted a twostage hierarchical framework with separate modules for decision-making and trajectory generation.The overall structure of our framework, depicted in Figure 4, comprises the decision-making and motion planning modules.For trajectory generation, we employ an RL agent conditioned on the sub-goals determined by the decision-making module.Additionally, we utilize MPC to generate reference trajectories during the training of the motion planner.In this section, we demonstrate the proposed framework in detail.

Decision Making
In driving scenarios, decision-making can vary depending on the specific objectives.In our research, we employed a Decision-Making (DM) module that determines the vehicle's actions, represented as its "turn lamp".Our approach incorporates two primary considerations: adhering to the center of the guidance lane and avoiding unnecessarily slow driving.While these considerations adequately address highway driving situations, certain nuanced decisions remain, such as determining when to accelerate or decelerate, the optimal timing for lane changes, and adjustments to the steering angle for passenger comfort.These nuanced decisions are difficult to address with a complex state machine and lack the necessary generality and robustness.Therefore, our DM module focuses solely on lane-changing decisions, leaving other control decisions to be handled by the RL motion planner.
The DM module receives input in the form of a predetermined guidance lane and the current state of the vehicle.Using odometry information, the DM module calculates the Time-To-Collision (TTC) value between the ego vehicle and its surroundings at each time step.Longitudinal TTC serves as a safety margin for lane changes, permitting lane-change commands only when the measured TTC values exceed predetermined thresholds for both the front and rear directions.These thresholds remain fixed throughout training.The ego vehicle strives to adhere closely to the guidance target lane, but if a preceding vehicle is driving slowly, the agent must decide whether to overtake or maintain its current lane.Guidance lanes are randomly generated in each episode, with a mandatory lane change occurring every 800 m by designating a different target lane.Figure 5 illustrates the operation of the DM module, depicting the ego vehicle as a yellow box and surrounding vehicles as blue boxes.Additionally, specific designations are used: yellow lines represent frontal TTC measurements (Figure 5b), orange lines represent rear time-to-collision measurements (Figure 5b), and red lines represent designated global guidance (Figure 5c,d).
The DM module determines lane-changing actions based on the current TTC values and the guidance path.In Figure 5d, for example, the ego vehicle transitions to the right lane to overtake the vehicle ahead.

Motion Planning
The decision output from the DM module is transmitted to the RL agent by concatenating it with the observation, subsequently generating the control input of the vehicle.Given the inherent uncertainty in driving scenarios, it is safer to constrain movements within an expected range through decision making rather than directly generating a control signal.The corresponding MDP formulation is presented below.
(1) State Space and Action Space: The state space comprises absolute and relative observations.Absolute observation encompasses the ego vehicle's odometric information and the odometric information of four surrounding vehicles.Odometric information, denoted as O k ∈ R 5 , includes two-dimensional position and velocity information, as well as the heading angle: Here, k represents the index of the vehicle, with 0 assigned to the ego vehicle and 1, 2, 3, and 4 representing nearby vehicles in clockwise order from the ego vehicle's center of mass.Relative information includes the time-to-collision value ttc for nearby states and the derived target lane l target from high-level decisions and the current ego vehicle's position.The state accumulates four consecutive decision time steps to ensure the Markov property.The action space a ∈ R 2 comprises the steering angle β and acceleration α, which collectively serve as the control input for the system dynamics.
(2) Reward Function: The core component of the reward function involves comparing the ego vehicle's odometric states with those generated by synchronized MPC.To facilitate MPC for vehicle control, we employed a linearized kinematic bicycle model as described in [37].In this model, we introduced the longitudinal distance d f between the ego vehicle and the vehicle in front as an additional state variable.This incorporation enables us to impose a constraint on maintaining a safe front gap.MPC shares the same action space and state space with an RL agent.The objective of the driving task is to follow a reference velocity and desired lateral position while adhering to inequality conditions that satisfy the vehicle's dynamics.
Figure 4 illustrates the synchronization process of MPC, where decisions from the DM module serve as the reference for the MPC trajectory, and the MPC trajectory, in turn, serves as the reference for the RL reward function.This approach enables the trained agent to achieve trajectory optimization performance comparable to that of an MPC controller with reduced computation time.The MPC reference trajectory is computed at every m step from the ego vehicle's current state to ensure reference stability.These m steps of the reference trajectory are calculated within a cloned environment and then transferred to the original environment for reward calculation.In addition to the odometric reward, the change of action is also factored in to prevent oscillation or abrupt changes in the control input.The running reward is calculated as follows: where H and R are diagonal matrices used to weigh each element.Another element of the reward function involves checking the termination conditions, which occurs in the event of collision or deviation from the traffic road of the ego vehicle, and is defined as follows: Building upon the framework detailed in Section 3.2, we trained our motion planning agent with our proposed off-policy training method, CER.Notably, CER is compatible with a range of off-policy algorithms.

Experimental Results
In this section, we evaluate how the proposed framework enhances the performance of trajectory optimization in dynamic highway settings.Our agent is trained in the gym highway environment [28], featuring multiple lanes and nearby vehicles, requiring the generation of appropriate control inputs based on odometry information for both its own position and that of nearby vehicles.The objective in this environment is to maintain a target speed of 30 m/s (108 km/h) while staying within the target lane and avoiding collisions.The base RL algorithms are implemented using the open-source RL framework, stable-baselines3 [38] with PyTorch, and the open-source toolbox do-mpc [39] is used for the MPC calculation.
To assess the effectiveness of our training method and driving framework, we implement CER in TD3 [40], which is considered the state-of-the-art deterministic policy gradient RL method.As the baseline method, we employ TD3 with the UVFA [13] structure for the goal-conditioned policy.
In an ablation study, we explore the impact of the behavior policy in the training process by integrating expert demonstrations with the stored transitions in the buffer.Trajectories from the MPC controller are utilized for reward calculation, with these trajectories also serving as expert demonstrations integrated into the training process, denoted as CER [M].While adhering to the network hyperparameters outlined in [40], additional buffers are employed for our configuration.To ensure training stability, the reward scale is clipped within the range of [−1, +1].Each episode consists of 1000 steps, with termination triggered if the agent deviates off track or collides with another vehicle.This section is structured as follows: In Section 4.1, we assess whether CER enhances training performance and compare the performances of TD3 in the highway environment with and without CER.Section 4.2 provides a detailed analysis of trajectories and sub-goal decisions, while Section 4.3 presents an evaluation of action diversity in the trained agent.
We demonstrate the adaptability of our hierarchical module to changes in the decisionmaking module in this section.Lastly, in Section 4.4, we conduct an ablation study to examine the effect of the behavior policy on the CER training agent.

Training Performance Evaluation
We conducted training using TD3 both with and without CER on driving tasks to evaluate whether CER enhances training performance.Figure 6a illustrates the average episode length curve per 1000 episodes over the complete training steps, with a maximum episode length of 1000 steps.Additionally, Figure 6b depicts the reward curves for the baseline method and the proposed method.As shown in Figure 6b, the use of CER contributes to enhanced training speed and performance.

Driving Performance Evaluation
In this section, we demonstrate the superior task competency of our method compared with the baseline method by showcasing several scenarios.

Case 1: Lane Keeping
In Figure 7, the maneuver of the proposed method and the baseline method is shown within the same simulation environment, under identical maneuver conditions of nearby vehicles.Despite this uniformity, noticeable discrepancies in the paths generated by our method and the baseline method are observed.To assess the lane-keeping performance of each path, Figure 8 zooms into a segment spanning from 1100 to 1500 m.The Center of the Lane (CoL) represents the position where the vehicle should be located, determined by the sub-goal output from the DM module referenced in Section 3.2.1.The positional deviation is computed using the mean square error between CoL and the vehicle's lateral position.Based on the maneuver data depicted in Figure 8a, TD3 + CER demonstrates an average deviation of 0.06 m and a maximum deviation of 0.07 m, accompanied by stable steering angles as illustrated in Figure 8c.In comparison, TD3 exhibits an average deviation of 0.49 m and a maximum deviation of 0.7 m under similar conditions.Figure 8d highlights the fluctuation of steering angles.Our method achieves minimal lane deviation through smooth steering control inputs.

Case 2: Lane Changing
In Figure 9, rapid sequential lane changes are observed.Between 1100 and 1300 m, two-lane changes take place consecutively.The detailed trajectory is depicted in Figure 10, where the proposed method adeptly follows the target lane, while the baseline method experiences delay and failure during the lane change.Additionally, between 1400 and 2200 m, the baseline method fails to maintain the lane, indicating deficiencies in steering control.Furthermore, within the desired speed range of 20 to 30 m/s, TD3 also struggles with velocity control.

Case 3: Overtaking
In Figure 11, both mandatory and discretionary lane changes (overtaking) occur.The gray areas denote the predetermined guidance lane, indicating mandatory lane changes for the ego vehicle.To assess each agent's performance in following sub-goals, Figure 12 shows the maneuvers of each agent with sub-goals at each time step.Both agents must opt to change lanes to the right at 1000 m as a mandatory lane change.While our method successfully follows the target lane, the TD3 agent deviates from the target lane after 1000 m.A closer examination of the overtaking action is provided in the section from 1100 to 1470 m in Figure 13, where the maneuver is depicted in both global Figure 13a,b coordinates.In relative coordinates, the origin is set as the obstacle vehicle.Here, the agent in our method executes an overtaking maneuver while maintaining a safe distance from the obstacle vehicle.

Action Diversity Evaluation
To assess the adaptability of our trained agent, we evaluated the performance of our best-performing policy from TD3 + CER using three different decision-making modules.Each decision-making module generated distinct sub-goals within the same scenario, allowing us to examine how effectively the motion planner responded to these sub-goals.
Figure 14 illustrates the vehicle trajectory influenced by three distinct decision-making modules, representing different driving styles: original (blue), aggressive (red), and tolerant (yellow).The variation in time-to-collision thresholds resulted in different behaviors within identical situations.Table 1 displays the thresholds for each DM module, denoted by TTC front and TTC rear .Figure 14a presents the trajectories of each vehicle with line markers and highlights the predetermined guidance lanes, represented by discrete gray areas every 800 m.The original decision-making module reflects the approach used during training.The "tolerant" module employed higher thresholds for lane changes, while the "aggressive" module exhibited a propensity for frequent lane changes.These DM modules and the associated maneuver analyses were evaluated under identical conditions on a four-lane highway with 50 nearby vehicles.Table 2 presents an analysis of the trajectories generated by each DM module using an identical motion planner, including the number of lane changes (N l.c. ) and the average velocity (v avg.).
Figure 14c depicts the velocities of each vehicle within the preferred speed range of 20 to 30 m/s.The "aggressive" approach involved frequent lane changes to maintain speed, particularly when encountering slower vehicles ahead.Conversely, the "tolerant" approach strictly adhered to the guidance, sometimes even decelerating.The dashed gray line in Figure 14c represents the desired speed limit of 30 m/s.In instances where no vehicle was in front, such as at 100 m in Figure 14c, the vehicles accelerated close to the desired 30 m/s.Our best-performing policy demonstrated adaptive behavior, generating diverse actions within the same scenarios based on the provided sub-goals.1 displays varying hyperparameters of the decision-making modules: "original", "aggressive", and "tolerant".(a) demonstrates different maneuvers made according to the varying decision-making modules.The distance and velocity plots in (b,c) reveal more evident characteristics of each decision-making module, along with the corresponding trajectories generated from each module.This result indicates that the trained MP module agent does not overfit to specific DM modules ("original") but adheres to sub-goals.

Ablation Study
In our MDP formulation, we integrated trajectories from the MPC controller as odometry references in the calculation of the running reward.These trajectories were repurposed as expert demonstrations during the training of our motion planner, with the expectation of accelerating convergence in the training process.This variant, referred to as CER[M], employed approximately 10% of each buffer for MPC demonstrations to mitigate the risk of policy failures due to extrapolation errors [41].Consequently, TD3 + CER[M] was trained with two behavior policies: the agent itself and the MPC policy.
Figure 15 illustrates the reward graph of the proposed methods (TD3 + CER and TD3 + CER[M]) alongside the baseline method (TD3).Notably, TD3 + CER[M] achieved convergence much faster compared with TD3 + CER, indicating that CER can significantly enhance the performance of the trained agent.Moreover, the judicious use of expert demonstrations can expedite convergence speed during training.
In Figure 16, the maneuver and velocity changes of each algorithm's best-performing policy are visualized.Each algorithm generates distinct trajectories under the same scenario.Notably, TD3 + CER[M], trained with MPC demonstrations, exhibits similar maneuver and velocity changes to MPC during the evaluation period.
Table 3 provides a detailed comparison of each trajectory, focusing on trajectory similarity to MPC and deviation from the target lane.The accumulated Euclidean distances between the MPC trajectory and the target line throughout the scenario are presented, reflecting the trajectory similarity criterion.Additionally, the deviation from the target lane, indicative of decision-making module responses at each time step, is analyzed.Among the RL-based approaches, TD3 + CER[M] demonstrates the closest trajectory similarity to MPC and the smallest deviation from the target lane.While exhibiting trajectory similarities with the MPC motion planner, our proposed method achieves a significant reduction in computational time during the scenario, approximately by a factor of one-tenth.Table 3. Quantitative analysis in Scenario 4. Considering the sum of Euclidean distance to the target lane as an indicator of how well the agent adheres to sub-goals, TD3 + CER outperforms the baseline method.Furthermore, TD3 + CER[M] generates trajectories most similar to MPC, as evidenced by the criteria of Euclidean distance to the MPC trajectory.Notably, the computation time of the RL agent is significantly less than that of MPC, suggesting the potential of substituting MPC with an RL agent trained using our proposed method.

MPC TD3 + CER[M] TD3 + CER TD3
Euclidean   We assessed the trained policy of each method across two-, three-, and four-lane highway environments, conducting 100 repetitions for each test.Success for each episode was defined as the agent completing 1000 time steps without encountering forced termination conditions (such as collision or deviation from the track).The success rate, representing the number of successful cases out of the total cases, is presented in Table 4. Additionally, we analyzed the average quality of maneuvers based on deviation and velocity, as outlined in Table 5.Our method consistently exhibited the smallest deviations from the lane center across all environments, while also maintaining velocity within the desired bounds.Collision with other vehicles was the sole cause of failure.In the two-lane environment, the absence of additional lanes made it challenging to overtake the slow-moving vehicles ahead, necessitating more delicate acceleration control.

Conclusions
This paper has presented an off-policy RL training method to address environments characterized by imbalanced sub-goal distributions.Our work demonstrates how incorporating CER into off-policy algorithms can significantly improve performance.The key contributions and findings of this research are summarized as follows: 1.
Imbalanced Sub-Goal Distribution and CER Technique: In Section 3, we utilized the twin cliff walk example to demonstrate how disproportional goal distributions in multi-goal environments can adversely impact training efficiency and optimality.Our adaptation of the CER technique to off-policy algorithms showed improved performance compared with methods that do not incorporate CER, demonstrating the validity of employing our method for multi-goal tasks.We adapt this approach to highway driving scenarios in the subsequent section.

2.
Hierarchical Framework for Highway Driving Task: We proposed a hierarchical framework suitable for dynamic highway environments and trained a motion planner using our suggested training method.By employing separate modules, our agent proficiently generates control inputs across various scenarios without necessitating re-training or transfer learning.This flexibility allows for the easy modification of the decision-making module, empowering the agent to accommodate changes in traffic regulations or individual driving preferences.Simplifying the reward function by synchronizing the online MPC trajectory led to two notable advantages: a more straightforward design of the reward function and maneuvering capabilities of the RL-trained motion planner comparable to an optimized MPC path planner, with computational demands reduced to less than one-tenth of the original.

3.
Discussions and Future Work: The incorporation of CER into the training of the motion planner demonstrated substantial improvements in both sample efficiency and performance.Our proposed method fosters the generation of uniform data in the training process, departing from uniform sampling to yield optimal solutions.Our method consistently outperformed other approaches without CER integration, underscoring its efficacy in addressing tasks, even in continuous control problems where traditional RL methods encounter difficulties in training.
However, there are limitations to the proposed framework.As replay buffers are defined independently for individual sub-goals, the method assumes that the subgoals are known in advance, necessitating some degree of prior knowledge of the task.This dependency on predefined sub-goals could limit the method's applicability in scenarios where such prior knowledge is not readily available or where some subgoals may be added or removed.Despite necessitating multiple buffers, the overall buffer capacity remains the same as that of a general single-experience replay buffer.Consequently, our method is adaptable to various off-policy algorithms, offering enhanced performance in multi-goal environments.
Ensuring safety in driving tasks is paramount.Incorporating additional techniques to constrain undesirable control actions, such as oscillations, exceeding speed limits, and executing safety maneuvers like emergency stops, can further enhance the robustness and reliability of the driving task.This can be achieved by adding masking layers or using constraint RL methods.
While our method demonstrates significant advancements, particularly in handling imbalanced sub-goal distributions and improving training efficiency, our future work will involve adding more safety-aware constraints for continuous control in driving tasks.These enhancements aim to improve the robustness and applicability of our framework in diverse and complex traffic scenarios.

Figure 1 .
Figure 1.Visualization of classified experience replay.Transitions are stored in the classified buffer and uniformly sampled to train the agent, ensuring a balanced selection of transition samples.

Figure 2 .
Figure 2. The twin cliff walk environment and reward configuration are depicted.The colored arrows in the left figure illustrate the optimal paths for achieving each sub-goal.The reward maps for each sub-goal are depicted as (i-iii).

Figure 3 .
Figure 3. (a) shows the probability distribution of each sub-goal from case #1 to case #10.In case #1, the distribution of three sub-goals is uniform, but as the cases progress, the distribution becomes biased towards two sub-goals other than g down .(b) demonstrates the success or failure of the trained agent.The symbol in each cell indicates whether the agent found a path from the start cell to the goal cell.O denotes an optimal result, SO denotes a sub-optimal result, and F denotes failure in finding a path.

Figure 4 .
Figure 4. Overview of the hierarchical framework, comprising the Decision-Making (DM) module and the Motion Planning (MP) module.The behavior command determined by the DM module acts as a sub-goal for the RL agent.Additionally, the MPC module regularly (represented as m steps) computes trajectories, providing expert trajectory data for reward calculation.

Figure 5 .
Figure 5. Decision-making process visualization in simulation.(a) The simulation environment showing multiple lanes and nearby vehicles.(b) Front and rear TTC values are depicted as orange and yellow dashed lines.(c) The guidance lane is indicated by red lines.(d) The green stripe represents the decision made by the ego vehicle.TTC values at (b) and the guidance lane at (c) determine the sub-goal of the current state of the ego vehicle.As depicted in (d), the DM module can decide to change lanes even when the vehicle is on the guidance lane if the TTC value is too small.

Figure 6 .
Figure 6.Reward curves of baseline method (TD3) and proposed method (TD3 + CER).The colored areas on each graph represents the boundaries of each measure, with averages calculated per 10 episodes, while lines represent averages per 1000 episodes.

Figure 8 .
Figure 8. Case study of lane-keeping performance: Section from 1100 m to 1500 m in Scenario 1. (a,b) depict the lane-keeping maneuvers of our method and the baseline method, respectively.(c,d) show the steering angles corresponding to the maneuvers in (a,b)."CoL" denotes the center of the lane, representing the optimal position for the vehicle.The red rectangle in (a) and the green rectangle in (b) denote the location of the vehicle at each labeled time step.In this lane-keeping scenario, the suggested method demonstrates superior performance, exhibiting stable lane-keeping maneuvers and steering angles, as evidenced by the stable steering angle changes.

Figure 10 .
Figure 10.Case study of lane-changing performance: Section from 1100 to 1300 m in Scenario 2. The lower label of the vehicle (rectangle box) indicates the time step, while the upper label indicates the velocity of the vehicle at each time step.The maneuver of the suggested method in (a) follows CoL while maintaining the desired velocity range.In contrast, the maneuver without CER, depicted in (b), fails to follow CoL and exceeds the desired velocity range.

Figure 12 .
Figure 12.Sub-goals and resulting paths of each agent are depicted in the figure.The yellow line represents the sub-goal decisions, labeled as "left", "right", and "idle".As the guidance lane changes at 1000 m, mandatory lane changes must occur for both agents in (a,b).While a lane change occurs in (a), it does not take place in (b).

Figure 13 .
Figure 13.Case study of overtaking performance: Section from 1100 to 1470 m in Scenario 3. The stability of the lane-changing maneuver is shown in (a).Examining the relative maneuver of the ego vehicle to the obstacle vehicle during overtaking, illustrated in (b), the ego vehicle maintains a safe gap, ensuring a safe overtaking maneuver.
Comparison of trajectories over the entire section.(b) Distance from the vehicle in front.(c) Change in velocity over the entire section.

Figure 14 .
Figure 14.Action diversity evaluation with the TD3 + CER agent.Table1displays varying hyperparameters of the decision-making modules: "original", "aggressive", and "tolerant".(a) demonstrates different maneuvers made according to the varying decision-making modules.The distance and velocity plots in (b,c) reveal more evident characteristics of each decision-making module, along with the corresponding trajectories generated from each module.This result indicates that the trained MP module agent does not overfit to specific DM modules ("original") but adheres to sub-goals.

Figure 16 .
Figure 16.Maneuver and velocity analysis with the addition of the MPC expert demonstration method.When the agent employs MPC demonstration, it trains to imitate the MPC's policy more straightforwardly.

Table 1 .
DM parameters for different driving styles.

Table 2 .
Analysis results for different driving styles.

Table 4 .
Quantitative analysis: Success rate measured as the number of successful cases out of the total cases.Out of 100 random road environment conditions, each with a corresponding number of lanes, TD3 + CER[M] exhibits the highest success rate.

Table 5 .
Quantitative analysis of driving performance, including lane-following accuracy and maintenance of desired velocity.Given that the target speed range is 30 m/s, TD3 + CER[M] shows maintaining the most approximate value to the target speed.However, given the deviations from the target lanes, TD3 + CER exhibits the smallest values.Notably, suggested methods are superior to baseline methods in every evaluation criterion.