A Multi-Objective Optimal Control Method for Navigating Connected and Automated Vehicles at Signalized Intersections Based on Reinforcement Learning

: The emergence and application of connected and automated vehicles (CAVs) have played a positive role in improving the efficiency of urban transportation and achieving sustainable development. To improve the traffic efficiency at signalized intersections in a connected environment while simultaneously reducing energy consumption and ensuring a more comfortable driving experience, this study investigates a flexible and real-time control method to navigate the CAVs at signalized intersections utilizing reinforcement learning (RL). Initially, control of CAVs at intersections is formulated as a Markov Decision Process (MDP) based on the vehicles’ motion state and the intersection environment. Subsequently, a comprehensive reward function is formulated considering energy consumption, efficiency, comfort, and safety. Then, based on the established environment and the twin delayed deep deterministic policy gradient (TD3) algorithm, a control algorithm for CAVs is designed. Finally, a simulation study is conducted using SUMO, with Lankershim Boulevard as the research scenario. Results indicate that the proposed methods yield a 13.77% reduction in energy consumption and a notable 18.26% decrease in travel time. Vehicles controlled by the proposed method also exhibit smoother driving trajectories.


Introduction
Optimizing traffic at intersections, a pivotal node within the urban road network and a critical hub for traffic flow regulation, is of paramount significance for the entire road network traffic system.Traditional vehicle driving faces significant constraints in terms of time and space, necessitating decisions within the confines of limited road space and transit time [1].Conversely, the advent of Connected and Automated Vehicles (CAVs), spurred by advancements in connected vehicles and artificial intelligence, has revolutionized transportation.CAVs facilitate optimized driving behaviors via interactive learning with the surrounding environment [2,3].Within intelligent transportation systems, real-time adjustments to and optimizations of vehicle driving strategies at signalized intersections can be implemented based on detected vehicle trajectory data [4].
Traditional research often uses optimization models for single-objective or multiobjective goals focused on energy consumption and efficiency, aiming to solve for vehicle control parameters.In the realm of fuel vehicles, prior studies have mainly concentrated on enhancing fuel savings and minimizing emissions.Eco-driving strategies are developed through the integration of optimal control and trajectory optimization to minimize fuel consumption and enhance traffic efficiency.These strategies seamlessly incorporate realtime traffic prediction, vehicle connectivity, and signal control, all aimed at reducing fuel usage while ensuring smooth mobility [5,6].Utilizing optimal control techniques like mixed-integer linear programming and Pontryagin's minimum principle, models were devised to optimize vehicle trajectories and traffic signals, particularly at signalized intersections [7,8].Moreover, there is a concerted effort to merge offline planning with online tracking, fostering the development of energy-efficient driving strategies for CAVs.Simulation outcomes have underscored substantial gains in fuel efficiency alongside notable reductions in CO 2 emissions [9].
However, as electric vehicles emerge, research endeavors are gradually shifting towards achieving reduced power consumption.Current research focuses on enhancing the efficiency of hybrid electric vehicles (HEVs) and electric vehicles (EVs), particularly in autonomous driving.Such research integrates vehicle dynamics and powertrain optimization for better fuel economy.Methods like approximate dynamic programming and optimal control models have optimized fuel consumption in autonomous HEVs [10,11].Moreover, an analytical model determined optimal speed profiles for EVs, considering road and traffic conditions [12].Additionally, an energy-efficient adaptive cruise control system was proposed for electric, connected, and autonomous vehicles (e-CAVs), improving energy efficiency compared to traditional strategies [13].These efforts collectively advance the sustainability of HEVs and EVs in autonomous driving.
Additionally, in the optimization of speed trajectories, researchers consider the temporal and spatial influences of vehicles at intersections.Depending on the driving mode, the vehicle speed is controlled in accordance with road characteristics and real-time traffic conditions.Dynamic eco-driving on main roads can yield up to 15% fuel savings and a reduction in carbon dioxide emissions [14].A multi-objective speed planning model optimized electric vehicle trajectories, resulting in substantial electricity and time savings [15].Additionally, an eco-driving method based on departure time prediction reduced delays at signalized intersections by optimizing CAV trajectories, showcasing notable efficiency improvements [16].Furthermore, a novel car-following model considering road geometry enhanced traffic analysis and offered insights into stability and spatial separation contours [17].However, the existing models are static, empirical, and designed for specific scenarios, relying on idealized assumptions.Consequently, these approaches might not fully account for the unpredictable nature of real-world traffic scenarios.
The advancement of artificial intelligence has led scholars to apply relevant theories and algorithms in analyzing traffic flow characteristics and managing vehicle operations [18], thereby significantly reducing computational complexity.Methods based on intelligent algorithms are dedicated to exploring dynamic and optimal driving strategies for vehicles.Cutting-edge technologies, such as reinforcement learning (RL) and deep reinforcement learning (DRL), facilitate the development of optimized driving strategies for efficient vehicle control at intersections.RL and DRL techniques are harnessed to develop various car-following models and control strategies for CAVs.These innovations aim to optimize trajectories, reduce energy consumption, enhance traffic efficiency, and bolster driving safety.
A recent study combined energy-efficient driving with adaptive traffic signal control using RL.It achieved significant fuel savings, ranging from 31.73% to 45.90%, with varying degrees of mobility sacrifice [19].Additionally, DRL was employed for longitudinal trajectory control in CAVs, ensuring fuel efficiency and safety at signalized intersections [20].Eco-driving applications for semi-actuated intersections effectively reduced fuel consumption by 29.2% and noise by 21.9%, enhancing sustainability [21].A parameterized RL approach was proposed to improve energy efficiency without disrupting other vehicles, offering promising results [22].Moreover, RL-based control minimized energy consumption at signalized intersections while maintaining mobility, demonstrating the potential for sustainable traffic management [23].Hybrid DRL-based eco-driving algorithms were proposed for low-level CAVs along signalized corridors, demonstrating substantial reductions in fuel consumption with minimal travel time impacts [24].Some studies have proposed RL models for e-CAVs to mitigate traffic oscillations and improve energy efficiency.These models exhibited self-learning capabilities and showed the potential to enhance travel efficiency while reducing energy consumption [25,26].
Additionally, previous studies have explored the application of RL-based methods to enhance traffic efficiency and driving safety.A framework using convolutional neural networks for prediction of time consumption at intersections was proposed, enabling optimal passing order and continuous control for connected vehicles [27].The impact of leading autonomous vehicles on urban networks was investigated, showcasing potential congestion mitigation benefits [28].A DRL-based reference speed-planning strategy was introduced for hybrid electric vehicles, with the goal of optimizing fuel economy and enhancing driving safety [29].Utilizing deep neural networks and multi-agent reinforcement learning, traffic light controllers were effectively coordinated, resulting in substantial reductions in traffic congestion [30,31].Through efficient reward functions, controllers adapted to varying traffic demands and diverse traffic light cycles, thereby enhancing intersection safety [32].Furthermore, an attention mechanism was incorporated to foster successful cooperation among mixed traffic streams and prevent intersection collisions [33].However, these investigations often adopt discrete action spaces and simple reward functions to streamline the training process, considering relatively few influencing factors.
Considering the limitations of current research, this paper proposes an advanced RL-based control method for navigating CAVs at multiple intersections.The approach integrates various factors, including the environment of signalized intersections and the specific driving characteristics of CAVs.The main contributions of our study are summarized as follows: 1.
The CAV is recognized as an agent that collects information on its status and surroundings, such as Signal Phase and Timing (SPaT) data and vehicle motion parameters, via roadside and onboard devices.It then interacts with the environment, utilizing RL algorithms to support decision-making and control; 2.
A general Markov decision process (MDP) framework for vehicle control is established, with a carefully designed reward function that considers energy consumption, traffic efficiency, driving comfort, and safety; 3.
Compared with traditional optimization model-based control approaches, our method employs a model-free reinforcement learning algorithm to generate the CAV's trajectory in real time.This significantly reduces computational complexity and enhances the ability to handle complex real-world scenarios.
The remainder of this article is structured as follows.Section 2 introduces the establishment of the Markov Decision Process (MDP) for vehicle control, integrating the signalized intersection environment with the driving dynamics of the CAV.It then details the training of the CAV using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, alongside designing a longitudinal motion control strategy within this environmental context.In Section 3, simulation experiments using Simulation of Urban MObility (SUMO) are conducted to evaluate the proposed method's feasibility and effectiveness.The simulation results are subsequently discussed and assessed using a variety of metrics.Finally, Section 4 concludes the article by summarizing the study's findings.

Research Scenario
This paper presents a research scenario where a CAV is integrated into a manually driven traffic flow to explore a single-vehicle control strategy.As depicted in Figure 1, all vehicles maintain their lanes without any lateral lane-changing.The traffic flow includes a mix of human-driven vehicles (HVs) and the CAV.The HVs adhere to a traditional driving model, while the CAV employs a specifically designed control algorithm for longitudinal following.The traffic system enables data exchange through communication techno intersection.Equipped with onboard sensors, the CAV interacts with nearby gather real-time data on the speed and position of the vehicle ahead, adjustin to maintain a safe distance.Moreover, the CAV accesses vital SPaT informat necting with roadside units.These SPaT data allow the system to calculate th green-light time, aiding the CAV in deciding whether to speed up or slow dow

Description of MDP
The strategy for navigating the CAV through signalized intersections is d a Markov decision process (MDP) [34].This MDP framework for CAV contr various factors, including the vehicle's velocity and location, information fr ceding vehicle, and the current state of traffic signals.The framework operate following assumptions: (1) The road is solely used by motor vehicles in prime driving condition, established regulations and free from unforeseen incidents like malfunc ratic intrusions; (2) Real-time information regarding the vehicle's position, speed, and accele cessible.Simultaneously, real-time communication between onboard d roadside equipment is assured, without any delays.

State
The state in this context should encompass the dynamics of the CAV, the of surrounding vehicles, and the status of traffic lights.Consequently, a mu vector is constructed to depict the CAV's state:  The traffic system enables data exchange through communication technologies at the intersection.Equipped with onboard sensors, the CAV interacts with nearby vehicles to gather real-time data on the speed and position of the vehicle ahead, adjusting its speed to maintain a safe distance.Moreover, the CAV accesses vital SPaT information by connecting with roadside units.These SPaT data allow the system to calculate the remaining green-light time, aiding the CAV in deciding whether to speed up or slow down.

Description of MDP
The strategy for navigating the CAV through signalized intersections is developed as a Markov decision process (MDP) [34].This MDP framework for CAV control considers various factors, including the vehicle's velocity and location, information from the preceding vehicle, and the current state of traffic signals.The framework operates under the following assumptions: (1) The road is solely used by motor vehicles in prime driving condition, adhering to established regulations and free from unforeseen incidents like malfunctions or erratic intrusions; (2) Real-time information regarding the vehicle's position, speed, and acceleration is accessible.Simultaneously, real-time communication between onboard devices and roadside equipment is assured, without any delays.

State
The state in this context should encompass the dynamics of the CAV, the conditions of surrounding vehicles, and the status of traffic lights.Consequently, a multi-element vector is constructed to depict the CAV's state: where S t denotes the state space at time t, x t is the travel distance of the vehicle, v t denotes the speed of the vehicle, a t denotes the acceleration of the vehicle, ∆v t is the differential value of speed between the preceding and following vehicles, and ∆x t is the spacing distance between the preceding and following vehicles.Additionally, the signal lamp's status is represented as ϕ t in (1), assigning a value of 1 for a red signal and 0 otherwise, and g t denotes the remaining duration of the green light in the current phase at the nearest signalized intersection.This definition of state remains relevant whether the vehicle is on a road segment or within an intersection.Consequently, the framework can naturally extend to accommodate a multi-intersection scenario.

Action
Upon acquiring pertinent information from the state space, the vehicle is required to respond dynamically based on acquired state information.This involves immediate adjustments in speed achieved through continuous acceleration and deceleration actions.Consequently, the action space (A t ) is identified by the acceleration of the vehicle, illustrated by (2).
Highlighting the importance of realism, acceleration (a t ) within the action space should comply with a specific interval (d min ≤ a t ≤ a max , where d min and a max respectively represents the vehicle's maximum allowable acceleration and deceleration, respectively, based on its specifications).Additionally, considering road traffic regulations and the safe operation of vehicles, it is essential to observe the following constraints: where v max denotes the maximum speed limit of the road, and ∆x * is the minimum safety distance between the front and rear vehicles.By utilizing the state-space parameters as input for the environment and defining reward functions as objectives, we employ RL algorithms to train the agent.The aim is to obtain an optimal sequence of actions that maximizes rewards, thereby optimizing system performance.The optimal action sequence represents the CAV's most efficient driving strategy.

Reward
When tackling the control challenges faced by CAVs at intersections, it is imperative to consider a range of factors, notably energy consumption and traffic efficiency among others.To comprehensively optimize the driving process of CAVs and attain optimal performance, this study designs a multi-objective reward function that considers diverse aspects to efficiently train agents.Specifically, the reward function aims to minimize energy consumption, mitigate traffic delays, and enhance driving comfort, all while prioritizing safety.
For the component of travel efficiency, the reward function utilizes the vehicle's travel distance at each step as it crosses intersections, as detailed in Equation ( 5).
To reduce the electric energy consumption of the CAV at intersections, the reward function (r 2 ) for the energy consumption component is presented in (6).
where η denotes the energy recovery factor, E veh is a function capable of yielding the instantaneous electric energy of the CAV, and E loss represents the energy loss caused by the driving resistance.E veh and E loss incorporate principles from vehicle dynamics and energy conversion, respectively, considering the energy brake-recovery mechanism.Specific details can be referenced in [35].Safety is always the top priority in vehicle operation.Here, the reward function (r 3 ) is formulated around time to collision (TTC), imposing greater penalties for actions that breach the safety margin, as demonstrated in (7).
where α is penalty coefficient for dangerous driving, TTC t denotes the time to the collision between the preceding and following vehicles at time t, and TTC * is the pre-defined safety threshold of TTC t , selected as 2 s according to previous literature [36].
Enhancing driving comfort entails maintaining the smooth operation of the vehicle, reducing abrupt accelerations and decelerations to achieve a relatively smooth driving trajectory.Jerk, the derivative of acceleration, is used to characterize the stability of vehicle driving.The reward function (r 4 ) for driving comfort is calculated by ( 8) and (9).
where β is the penalty coefficient for aggressive driving, J t denotes the jerk of the vehicle at the time t, and J * represents the maximum permissible rate of change in acceleration that enables the vehicle to drive smoothly.According to previous research experience [37], the value of J * can be taken as 4.
Ultimately, the overall reward function (R t ) is derived by integrating the reward functions across all four indices, as shown in (10).

TD3 Algorithm
The TD3 algorithm is the chosen training method in the environmental framework.Within reinforcement learning, several classical training algorithms merit consideration, such as the Deep Q-Network (DQN) algorithm and the Deep Deterministic Policy Gradient (DDPG) algorithm.The DQN algorithm relies on a learned value function, known as the Qfunction, and integrates crucial techniques such as sample pooling and target networks [38,39].However, DQN is primarily effective for tasks with a limited range of discrete actions and faces limitations in scenarios requiring real-time decisions, such as car-following on roads.To effectively address the reinforcement learning problem with continuous actions, the DDPG algorithm is introduced [40].DDPG merges the value function and the policy gradient algorithm, improving parameter updates and facilitating seamless decision-making in continuous action spaces.
Nonetheless, certain algorithm-level issues persist in DDPG, such as overestimation bias and susceptibility to overfitting of narrow peaks in the value estimate.As a remedy, TD3 (Twin Delayed DDPG) is introduced to address these shortcomings in the DDPG algorithm [41].Compared with DDPG algorithm, the TD3 algorithm uses a double critic network to calculate the target value of Q-functions, opting for the lesser of the two values, thus suppressing the problem of network overestimation.
As illustrated in Figure 2, the network structure of the TD3 algorithm is composed of the actor policy network and a critical value network.The actor network is responsible for determining optimal actions, while the critic network evaluates the desirability of these actions by estimating values of Q-functions.To enhance training stability, the target network periodically updates its parameters by copying them from the current network.Additionally, the delayed policy update is employed to ensure that the actor network is updated after the critic network undergoes multiple updates.
actions by estimating values of Q-functions.To enhance training stability, the target network periodically updates its parameters by copying them from the current network.Additionally, the delayed policy update is employed to ensure that the actor network is updated after the critic network undergoes multiple updates.( ) when calculating the target value, to mitigate the issue of network overestimation in the algorithm.For a set of data from the sample pool, the dual critic-target network calculates the target value ( y ) according to (11).
( ) where γ is the discount factor for reward +1 t R , θ ' i denotes the parameter of critic-target networks I and Ⅱ, and represents the estimated Q-value according to the state and action.Target policy smoothing is implemented in TD3 as a regularization technique.Its purpose is to constrain the action values by clipping them according to the target policy, TD3 employs a parameterized actor neural network, which takes the state (S t ) as input and generates a continuous action (A t ) as output.Simultaneously, a parameterized critic neural network is utilized to take both the state (S t ) and action (A t ) as inputs and estimate the Q-value function.The parameters of the algorithm are manually adjusted through extensive simulations.Both the actor and critic neural networks utilize a two-hidden-layer architecture employing a multi-layer perceptron (MLP).The first layer of both the actor and critic networks consists of 400 neurons, while the second layer comprises 300 neurons.
The network structure of the TD3 algorithm is intricately designed with distinct roles for each component.The actor network serves as the interface with the external intersection environment, managing the input and output of data.Concurrently, the set of transitions (S t , A t , R t , S t+1 ) is systematically added to the experience replay pool as samples for future training iterations.Action parameters (A t ) are transferred from the actor network to the critic network.The double critic networks are employed, selecting the minimum value between Q θ 1 ′ (S t+1 , A t+1 ) and Q θ 2 ′ (S t+1 , A t+1 ) when calculating the target value, to mitigate the issue of network overestimation in the algorithm.For a set of data from the sample pool, the dual critic-target network calculates the target value (y) according to (11).
where γ is the discount factor for reward R t+1 , θ i ′ denotes the parameter of critic-target networks I and II, and Q θ i ′ (S t+1 , A t+1 ) represents the estimated Q-value according to the state and action.Target policy smoothing is implemented in TD3 as a regularization technique.Its purpose is to constrain the action values by clipping them according to the target policy, ensuring that the actions remain within a valid action range (A t ∈ A low , A high ).Then, the target action can be expressed as follows: where µ θ i ′ (S t+1 ) denotes the action strategy adopted in state S t+1 , and ϵ ∼ (0, σ) is random Gaussian noise.Subsequently, the target value (y) is transmitted to dual critic-online networks I and II.Here, the current Q-value is recalculated, and the network parameters are updated to minimize the loss function (L) as follows: where N denotes the batch size, θ i is the parameter of critic-online networks I and II, and Q θ i (S t , µ θ i (S t ) + ϵ) represents the target Q-value.Through this optimization, parameter θ i of the critic is adjusted to enhance the accuracy of Q-value predictions.Parameter θ i ′ of the target network is updated smoothly from the main network as follows: where τ is a hyperparameter to determine the weight.Subsequently, the Q-value (Q θ 1 (S t , µ ω (S t ))) calculated by critic network I is transmitted to the actor network to update parameters.Then, the actor can be updated by the deterministic policy gradient (∇ ω J) as follows: where ω denotes the parameter of the actor network.

Vehicle Control Algorithm
The control algorithm for CAVs is developed within the MDP environment framework, leveraging the network architecture of the TD3 training algorithm.This algorithm takes the speed, position, signal phase, and timing information obtained from vehicle sensors as input.Through the training process, the algorithm outputs an action strategy to the vehicle controller for optimization of acceleration, ensuring smoother, safer, and more efficient driving behaviors.
Additionally, as the vehicle approaches intersections, the algorithm undertakes a detailed assessment of traffic signals.Specifically, it evaluates whether the remaining duration of the green light is sufficient for the vehicle to navigate the intersection smoothly and safely.Following this analysis, the algorithm issues precise commands to the vehicle controller, directing it to accelerate or decelerate as necessary.The flow chart of this algorithm is shown in Figure 3.The environmental input data involve collecting information on the vehicle's speed (v t ), position (x t ), and signal light status, including the current status of signal lamps (ϕ t ) and the remaining duration of the green light (g t ).Through multiple iterations equal to the number of training epochs multiplied by the ratio of sample volume to batch size, the action strategy with the maximum reward, namely the optimal acceleration, is output to control the vehicle.The algorithm proceeds through the following specific steps: Step1.Initialization: Upon initiation, the environment state (S t ) is reset, and essential road and traffic demand data are transmitted to the vehicle controller.The TD3 algorithm receives state information, including speed, position, and SPaT.It initializes action (A t ) and provides a predicted sequence of actions to the environment.Step2.Interaction with the environment: After receiving the action sequence, the environment calculates the reward (R t ) of the vehicle until it approaches the road boundary.Subsequently, the calculated state and reward are transmitted back to the controller.The TD3 algorithm stores these tuples (S t , A t , R t , S t+1 ) in the experience pool, accu- mulating valuable training samples.Step3.Training: Training begins once the replay buffer reaches its capacity, utilizing the stored samples to refine decision-making for vehicle actions.The algorithm continues training until the maximum exploration step is reached, signifying the conclusion of the current episode of training.Upon reaching the maximum number of iterations, the algorithm indicates the attainment of the terminal state.Step4.Output: The algorithm outputs a control strategy (π(a t )) for driving actions, including uniform speed, acceleration, and deceleration.This strategy is meticulously designed to maximize cumulative rewards (R * ), reflecting the algorithm's learned optimal behavior in response to the dynamic road environment and traffic conditions.

Simulation Platform and Scenarios
To validate the proposed control method for CAVs, this study utilizes SUMO simulation software to develop an intersection traffic simulation platform.The simulation platform integrates various components seamlessly, ensuring a comprehensive evaluation of the proposed control method for CAVs.This platform adopts a modular simulation approach, featuring a visual interface, result data collection, and other essential functions.The detailed architecture of this simulation platform is illustrated in Figure 4, providing a visual representation of the component interactions and their contributions to the system's overall functionality.The functional modules of the simulation platform illustrated in Figure 4 are described in detail in Table 1.

Simulation Platform and Scenarios
To validate the proposed control method for CAVs, this study utilizes SUMO simulation software to develop an intersection traffic simulation platform.The simulation platform integrates various components seamlessly, ensuring a comprehensive evaluation of the proposed control method for CAVs.This platform adopts a modular simulation approach, featuring a visual interface, result data collection, and other essential functions.The detailed architecture of this simulation platform is illustrated in Figure 4, providing a visual representation of the component interactions and their contributions to the system's overall functionality.The functional modules of the simulation platform illustrated in Figure 4 are described in detail in Table 1.
Table 1.Functional modules of simulation platform.

Environmental construction
This module is responsible for creating and editing the road network, with specific functions such as determining the starting and ending points, adding traffic demands, dividing the lanes, and configuring signal timing.

Simulation operation
This module is responsible for running the simulation program, with specific functions including setting the simulation duration, calculating the state space and reward function, and outputting the action strategy, as well as resetting the environment when the algorithm training termination conditions are met.

Data collection
This module is responsible for data acquisition and saving, with specific functions including acquiring vehicle trajectory data through the Traci interface, saving the simulation results in a numerical matrix, and outputting the data in *.csv file format for later organization and analysis.Within the platform, SUMO constructs the fundamental simulation environment and supplies the RL framework with essential simulation outcome data.Through the Traci interface, the RL framework retrieves critical information for the evaluation, including intersection geometry information (lane and signal IDs), vehicle dynamics (speed, acceleration, and driving distance), and signal timing details (signal phase duration).

Environmental construction
This module is responsible for creating and editing the road network, with specific functions such as determining the starting and ending points, adding traffic demands, dividing the lanes, and configuring signal timing.

Simulation operation
This module is responsible for running the simulation program, with specific functions including setting the simulation duration, calculating the state space and reward function, and outputting the action strategy, as well as resetting the environment when the algorithm training termination conditions are met.

Data collection
This module is responsible for data acquisition and saving, with specific functions including acquiring vehicle trajectory data through the Traci interface, saving the simulation results in a numerical matrix, and outputting the data in *.csv file format for later organization and analysis.
Within the platform, SUMO constructs the fundamental simulation environment and supplies the RL framework with essential simulation outcome data.Through the Traci interface, the RL framework retrieves critical information for the evaluation, including intersection geometry information (lane and signal IDs), vehicle dynamics (speed, acceleration, and driving distance), and signal timing details (signal phase duration).Subsequently, an interactive environment integrating multiple data sources is developed to evaluate vehicle control algorithms via the RL framework.
For the analysis and verification of the proposed control method's effectiveness, this study establishes a simulation scenario utilizing road map information and signal timing data from Lankershim Boulevard, featured in the Next Generation Simulation (NGSIM) dataset.Several experiments are carried out in this simulation scenario to assess the performance of the CAV, providing a comprehensive examination of the control method's real-road applicability.
Figure 5 illustrates a simulation scene with four adjacent urban signalized intersections along Lankershim Boulevard.Upon entering each intersection, vehicles receive pertinent environmental information.The mixed traffic flow on the road is composed of the CAV and HVs, with HVs following SUMO's default Krauss car-following model and the CAV being algorithmically controlled.This research focuses on investigating the longitudinal car-following behavior of the CAV within mixed traffic flow, with known parameters such as signal timing and road length.In the experiment, the Krauss model and three RL algorithms, namely TD3, DDPG, and DQN, are adopted to govern the car-following movement of the CAV.Data from the simulation are gathered across various car-following modes for subsequent comparison and analysis.Subsequently, an interactive environment integrating multiple data sources is developed to evaluate vehicle control algorithms via the RL framework.
For the analysis and verification of the proposed control method's effectiveness, this study establishes a simulation scenario utilizing road map information and signal timing data from Lankershim Boulevard, featured in the Next Generation Simulation (NGSIM) dataset.Several experiments are carried out in this simulation scenario to assess the performance of the CAV, providing a comprehensive examination of the control method's real-road applicability.
Figure 5 illustrates a simulation scene with four adjacent urban signalized intersections along Lankershim Boulevard.Upon entering each intersection, vehicles receive pertinent environmental information.The mixed traffic flow on the road is composed of the CAV and HVs, with HVs following SUMO's default Krauss car-following model and the CAV being algorithmically controlled.This research focuses on investigating the longitudinal car-following behavior of the CAV within mixed traffic flow, with known parameters such as signal timing and road length.In the experiment, the Krauss model and three RL algorithms, namely TD3, DDPG, and DQN, are adopted to govern the car-following movement of the CAV.Data from the simulation are gathered across various car-following modes for subsequent comparison and analysis.

Experimental Design and Parameters Settings
The simulation experiment's core modules in the RL environment consist of road network setup, agent state space and reward function calculation, simulation execution, and environmental resetting following each run.Road network setup involves determining start and end points and lane segmentation and linking, along with configuring signal

Experimental Design and Parameters Settings
The simulation experiment's core modules in the RL environment consist of road network setup, agent state space and reward function calculation, simulation execution, and environmental resetting following each run.Road network setup involves determining start and end points and lane segmentation and linking, along with configuring signal timings.Information on simulated vehicles and roads is sourced from Traci for state-space computation.The overall reward in this simulation experiment is calculated by evaluating crucial factors, including energy consumption per unit, travel distance, estimated collision time, and driving comfort.Moreover, the environmental reset process entails clearing all data before each training episode to initialize the simulation environment.In the simulation operation module, the vehicle is programmed to terminate the simulation once it has traversed beyond the entire road length and subsequently returns state and reward values to the main function after each iteration.Data from the simulations are stored in a numerical matrix format and outputted as text files for ease of sorting and analysis.
Four distinct control methods, namely the Krauss model, TD3, DQN, and DDPG, are employed for longitudinal following of the CAV during intersection navigation.Additionally, the lateral lane-changing behavior of vehicles is governed by the SUMO default lane-changing model (LC2013) [42].Experiments are conducted to assess and compare the performance of the CAV under each control method to identify the most effective one.To alleviate the impact of the preceding vehicle on the training vehicle and prioritize safety, default values for speed differential and relative distance in state space are set as maximum speed limits and safety thresholds.For structural parameters, current market performance parameters for electric vehicles are used, and drag coefficients for vehicle motion are sourced from existing studies [43].In addition, the hyperparameter of the RL algorithm is determined through numerous experiments.A hyperparameter in machine learning refers to a configuration setting that influences the behavior and performance of a model during training, yet it cannot be directly learned from the data.Examples include learning rate, batch size, and discount factor.It is imperative to maintain consistency in environmental parameters across simulations, encompassing road and vehicle attributes.These essential parameters for simulation fidelity are detailed in Table 2.

Simulation Results and Discussion
Data from the simulation experiments are meticulously processed to analyze the vehicle's motion characteristics under Krauss, TD3, DDPG, and DQN car-following modes.Regarding the safety aspect of CAV driving, it is noteworthy that unsafe driving behavior is initially observed in the early iterations of the algorithm simulation, where the time to collision exceeds the predefined safety threshold.However, such instances cease to occur in subsequent iterations.This is attributed to the penalty mechanism implemented in the reward function, where a significant penalty is assigned to driving behaviors violating safety protocols, while actions complying with safe driving norms receive equal reward values.Essentially, safety rewards are structured to enforce safe driving as a non-negotiable prerequisite.
In summary, given that across all modes, the CAV adheres to the fundamental requirement of safe driving, as evidenced by consistent safety reward values, our comparative analysis focuses on energy consumption, efficiency, and driving comfort among the different approaches.Notably, "efficiency" refers to the effectiveness of a vehicle's movement, quantified by factors such as travel time and mean velocity across intersections.The specific evaluation indicators include electricity consumption, travel time, and mean jerk.Findings are summarized in Table 3.The vehicle's motion performance, analyzed using various indices, is depicted in Figure 6.Regarding efficiency and comfort, the TD3 algorithm exhibits superior performance over the other three car-following modes.For energy consumption, this study delves into the electric energy consumed by the CAV to evaluate vehicle performance across various car-following modes.The application of the TD3 algorithm in CAV training significantly reduces driving energy consumption-4.33%,15.79%, and 13.77% compared to DDPG, DQN, and Krauss, respectively.This underscores the proposed algorithm's outstanding effectiveness in energy conservation.6. Regarding efficiency and comfort, the TD3 algorithm exhibits superior performance over the other three car-following modes.For energy consumption, this study delves into the electric energy consumed by the CAV to evaluate vehicle performance across various car-following modes.The application of the TD3 algorithm in CAV training significantly reduces driving energy consumption-4.33%,15.79%, and 13.77% compared to DDPG, DQN, and Krauss, respectively.This underscores the proposed algorithm's outstanding effectiveness in energy conservation.Based on trajectory data from simulations, a vehicle spacetime diagram is generated for intuitive comparison and analysis of vehicle traffic efficiency under various car-following modes.As illustrated in Figure 7, the CAV trained by the TD3 algorithm exhibits the longest driving distance per unit of time during most periods.Since the TD3 algorithm is an improvement on the DDPG algorithm, the curves corresponding to these two algorithms are very close in Figure 7.However, the action strategy obtained under DDPG algorithm training has an increased magnitude of variation at the beginning, as evidenced by the vehicle starting with greater acceleration.Although the CAV trained by DDPG On the other hand, the CAV controlled by the TD3 algorithm travels at a relatively continuous speed overall and spends less time at intersections.Therefore, in terms of the final results, the TD3-controlled CAV consumes the shortest amount of travel time to travel the entire length of the road.The TD3 mode incurs time cost reductions of 4.57%, 12.56%, and 18.26% compared to DDPG, DQN, and Krauss, respectively.This emphasizes the effectiveness of the TD3 algorithm in enhancing travel time efficiency.
As the CAV controlled by the TD3 algorithm approaches each intersection, it receives SPaT information from roadside equipment, which is then used to determine the remaining duration of the green light in the current phase.Employing this algorithm enables the determination of whether a smooth intersection passage can be achieved, subsequently allowing for judicious acceleration or deceleration actions.Therefore, optimal algorithmic control effectively reduces the time spent by the CAV to stop and wait at red lights while maintaining a high average velocity.
Figure 8 illustrates a violin diagram depicting the velocity distribution, offering a detailed analysis of the speed characteristics of the CAV.With TD3 algorithm training, the speeds of the CAV are notably concentrated in a higher range, achieving the highest average speed in comparison to other car-following modes.On the other hand, the CAV controlled by the TD3 algorithm travels at a relatively continuous speed overall and spends less time at intersections.Therefore, in terms of the final results, the TD3-controlled CAV consumes the shortest amount of travel time to travel the entire length of the road.The TD3 mode incurs time cost reductions of 4.57%, 12.56%, and 18.26% compared to DDPG, DQN, and Krauss, respectively.This emphasizes the effectiveness of the TD3 algorithm in enhancing travel time efficiency.
As the CAV controlled by the TD3 algorithm approaches each intersection, it receives SPaT information from roadside equipment, which is then used to determine the remaining duration of the green light in the current phase.Employing this algorithm enables the determination of whether a smooth intersection passage can be achieved, subsequently allowing for judicious acceleration or deceleration actions.Therefore, optimal algorithmic control effectively reduces the time spent by the CAV to stop and wait at red lights while maintaining a high average velocity.
Figure 8 illustrates a violin diagram depicting the velocity distribution, offering a detailed analysis of the speed characteristics of the CAV.With TD3 algorithm training, the speeds of the CAV are notably concentrated in a higher range, achieving the highest average speed in comparison to other car-following modes.Figure 9 compares the characteristics of the CAV s speed changes in various car-following modes.As depicted, the vehicle requires prompt speed adjustments near intersections, resulting in noticeable upward or downward shifts in the curve.Moreover, under the intelligent control of RL algorithms, the velocity of the CAV significantly improves compared to the traditional car-following model, with the TD3 algorithm reaching a relatively higher peak.The data reveal that the average driving speeds attained by TD3, DDPG, DQN, and Krauss modes are 13.41 km/h, 12.48 km/h, 11.01 km/h, and 10.46 km/h, respectively.In comparison with the DQN, DDPG, and Krauss modes, the TD3 mode demonstrates significant increases in average driving speed of 7.45%, 21.80%, and 28.20%, respectively.This underscores the effectiveness of the proposed method in enhancing the speed performance of the CAV. Figure 9 compares the characteristics of the CAV's speed changes in various carfollowing modes.As depicted, the vehicle requires prompt speed adjustments near intersections, resulting in noticeable upward or downward shifts in the curve.Moreover, under the intelligent control of RL algorithms, the velocity of the CAV significantly improves compared to the traditional car-following model, with the TD3 algorithm reaching a relatively higher peak.The data reveal that the average driving speeds attained by TD3, DDPG, DQN, and Krauss modes are 13.41 km/h, 12.48 km/h, 11.01 km/h, and 10.46 km/h, respectively.In comparison with the DQN, DDPG, and Krauss modes, the TD3 mode demonstrates significant increases in average driving speed of 7.45%, 21.80%, and 28.20%, respectively.This underscores the effectiveness of the proposed method in enhancing the speed performance of the CAV.
The jerk during vehicle operation is utilized as an index for evaluating driving comfort, reflecting sudden speed changes as the CAV navigates through intersections, thereby indicating the smoothness of vehicle operation.Figure 9 visually represents the variation in acceleration during CAV driving under different car-following modes.As depicted in Figure 10, both the DQN and DDPG modes show considerable fluctuations in driving behavior, whereas the TD3 mode maintains more stable acceleration, mostly within the range of [−3, 4].Comparative analysis of four car-following modes reveals that in the TD3 car-following mode, the CAV acceleration curve is notably flat, and the average jerk is the smallest.This behavior is attributed to its driving pattern, which is characterized by gradual acceleration or early deceleration.Compared with the DDPG, DQN, and Krauss modes, the jerk is reduced by 8.42%, 28.97%, and 16.89%, respectively.This suggests that the CAV trained with TD3 shows notably better stability in driving performance.The jerk during vehicle operation is utilized as an index for evaluating drivin fort, reflecting sudden speed changes as the CAV navigates through intersections, indicating the smoothness of vehicle operation.Figure 9 visually represents the v in acceleration during CAV driving under different car-following modes.As dep Figure 10, both the DQN and DDPG modes show considerable fluctuations in behavior, whereas the TD3 mode maintains more stable acceleration, mostly wi modes, the jerk is reduced by 8.42%, 28.97%, and 16.89%, respectively.This suggests that the CAV trained with TD3 shows notably better stability in driving performance.

Conclusions
An RL-based control method is proposed for CAVs at signalized intersections, aiming to optimize vehicle performance by holistically addressing energy consumption, traffic efficiency, driving comfort, and safety.The MDP framework for CAV driving is specifically tailored for multi-intersection environments, with the driving strategy trained by the TD3 algorithm.Furthermore, simulations of urban scenarios with multiple intersections are conducted to investigate the motion characteristics of the vehicle under various carfollowing modes.Results indicate that the proposed methods yield a 13.77% reduction in energy consumption and a noTable 18.26% decrease in travel time.The findings reveal that this method allows the CAV to dynamically adjust to traffic conditions, improving travel efficiency and driving comfort, which can also lead to reduced fuel consumption.
Since the simulation experiment reported in this paper is conducted under ideal conditions, within a scene devoid of random occurrences such as spontaneous overtaking or abrupt failures, the simulation is a valid testbed for the testing of algorithmic behavior.Notably, changes in factors such as traffic density, communication network reliability, and vehicle characteristics can lead to variations in simulation results.In practical use, considering the lack of perfect cooperation between CAVs and existing transportation infrastructure and the delay of communication systems, the actual optimization effect may require percentage correction, which will be scrutinized in a subsequent study.
The primary research focus of this paper is to utilize RL algorithms to control a single agent, specifically addressing the longitudinal car-following behavior of an individual CAV.Due to the potential for the variance of the gradient to escalate with an increasing number of agents, it becomes imperative to devise and implement multi-agent RL algorithms.Moreover, reducing the dimensionality of the state space becomes essential, particularly for traffic scenarios featuring multiple CAVs.This represents one of the key

Conclusions
An RL-based control method is proposed for CAVs at signalized intersections, aiming to optimize vehicle performance by holistically addressing energy consumption, traffic efficiency, driving comfort, and safety.The MDP framework for CAV driving is specifically tailored for multi-intersection environments, with the driving strategy trained by the TD3 algorithm.Furthermore, simulations of urban scenarios with multiple intersections are conducted to investigate the motion characteristics of the vehicle under various carfollowing modes.Results indicate that the proposed methods yield a 13.77% reduction in energy consumption and a notable 18.26% decrease in travel time.The findings reveal that this method allows the CAV to dynamically adjust to traffic conditions, improving travel efficiency and driving comfort, which can also lead to reduced fuel consumption.
Since the simulation experiment reported in this paper is conducted under ideal conditions, within a scene devoid of random occurrences such as spontaneous overtaking or abrupt failures, the simulation is a valid testbed for the testing of algorithmic behavior.Notably, changes in factors such as traffic density, communication network reliability, and vehicle characteristics can lead to variations in simulation results.In practical use, considering the lack of perfect cooperation between CAVs and existing transportation infrastructure and the delay of communication systems, the actual optimization effect may require percentage correction, which will be scrutinized in a subsequent study.
The primary research focus of this paper is to utilize RL algorithms to control a single agent, specifically addressing the longitudinal car-following behavior of an individual CAV.Due to the potential for the variance of the gradient to escalate with an increasing number of agents, it becomes imperative to devise and implement multi-agent RL algorithms.Moreover, reducing the dimensionality of the state space becomes essential, particularly for traffic scenarios featuring multiple CAVs.This represents one of the key research directions we aim to advance in the future.Future research will also focus on the lateral motion of CAVs on the road, combining lane-changing models to thoroughly study the driving behavior of vehicles.Moreover, a comprehensive analysis will be conducted from

Figure 1 .
Figure 1.Schematic diagram of the traffic scene at the intersection.

Figure 1 .
Figure 1.Schematic diagram of the traffic scene at the intersection.

Figure 2 .
Figure 2. Network structure of the twin delayed deep deterministic policy gradient (TD3) algorithm.TD3 employs a parameterized actor neural network, which takes the state ( t S ) as input and generates a continuous action ( t A ) as output.Simultaneously, a parameterized critic neural network is utilized to take both the state ( t S ) and action ( t A ) as inputs and estimate the Q-value function.The parameters of the algorithm are manually adjusted through extensive simulations.Both the actor and critic neural networks utilize a twohidden-layer architecture employing a multi-layer perceptron (MLP).The first layer of both the actor and critic networks consists of 400 neurons, while the second layer comprises 300 neurons.The network structure of the TD3 algorithm is intricately designed with distinct roles for each component.The actor network serves as the interface with the external intersection environment, managing the input and output of data.Concurrently, the set of transitions ( ) +1 , , , t t t t S A R S

Figure 2 .
Figure 2. Network structure of the twin delayed deep deterministic policy gradient (TD3) algorithm.

Figure 3 .
Figure 3. Flow chart of the Connected and Automated Vehicles (CAV) control algorithm.The environmental input data involve collecting information on the vehicle's speed (

Figure 3 .
Figure 3. Flow chart of the Connected and Automated Vehicles (CAV) control algorithm.

Figure 4 .
Figure 4. Architecture of the simulation platform.

Figure 4 .
Figure 4. Architecture of the simulation platform.

Figure 6 .
Figure 6.Comparison of results by several indices.

Figure 6 . 20 Figure 7 .
Figure 6.Comparison of results by several indices.Based on trajectory data from simulations, a vehicle spacetime diagram is generated for intuitive comparison and analysis of vehicle traffic efficiency under various car-following modes.As illustrated in Figure7, the CAV trained by the TD3 algorithm exhibits the longest driving distance per unit of time during most periods.Since the TD3 algorithm is an improvement on the DDPG algorithm, the curves corresponding to these two algorithms are very close in Figure7.However, the action strategy obtained under DDPG algorithm training has an increased magnitude of variation at the beginning, as evidenced by the

Figure 7 .
Figure 7. Spacetime diagram of CAV under different car-following modes.

Figure 8 .
Figure 8. Velocity distribution of CAV in different car-following modes.

Figure 8 .
Figure 8. Velocity distribution of CAV in different car-following modes.

Figure 9 .
Figure 9.Comparison of velocity curves of CAV in different car-following modes (0−150 s)

Figure 9 .
Figure 9.Comparison of velocity curves of CAV in different car-following modes (0−150 s).

Figure 10 .
Figure 10.Variation of CAV acceleration with time in different car-following modes (0−150 s).

Figure 10 .
Figure 10.Variation of CAV acceleration with time in different car-following modes (0-150 s).

Table 1 .
Functional modules of simulation platform.

Table 3 .
Simulation results under different car-following modes.
The vehicle's motion performance, analyzed using various indices, is depicted in Fig-ure