Design and Experimental Validation of a Cooperative Adaptive Cruise Control System Based on Supervised Reinforcement Learning

This paper presents a supervised reinforcement learning (SRL)-based framework for longitudinal vehicle dynamics control of cooperative adaptive cruise control (CACC) system. A supervisor network trained by real driving data is incorporated into the actor-critic reinforcement learning approach. In the SRL training process, the actor and critic network are updated under the guidance of the supervisor and the gain scheduler. As a result, the training success rate is improved, and the driver characteristics can be learned by the actor to achieve a human-like CACC controller. The SRL-based control policy is compared with a linear controller in typical driving situations through simulation, and the control policies trained by drivers with different driving styles are compared using a real driving cycle. Furthermore, the proposed control strategy is demonstrated by a real vehicle-following experiment with different time headways. The simulation and experimental results not only validate the effectiveness and adaptability of the SRL-based CACC system, but also show that it can provide natural following performance like human driving.


Introduction
With the growing demand for transportation in modern society, concern about road safety and traffic congestion is increasing [1,2].Intelligent transportation systems are regarded as a promising solution to these challenges due to their potential to significantly increase road capacity, enhance safety, and reduce fuel consumption [3,4].As a representative application for intelligent transportation, adaptive cruise control (ACC) systems have been extensively studied by academia and industry and are commercially available in a wide range of passenger vehicles.ACC utilizes a range sensor (camera and/or radar) to measure the inter-vehicle range, and the relative velocity and controls the longitudinal motion of the host vehicle to maintain a safe distance from the preceding vehicle [5].As a result, the driver is free from frequent acceleration and deceleration operation, and driving comfort and safety are improved.In recent years, the development of vehicle-to-vehicle (V2V) communication technologies, such as DSRC and LTE-V [6,7], has provided an opportunity for the host vehicle to exchange information with its surrounding vehicles.The resulting functionality, named cooperative adaptive cruise control (CACC), utilizes information on the forward vehicle or vehicles via V2V, as well as the inter-vehicle distance and relative velocity.Compared with conventional ACC, the wireless communication devices for CACC system increase cost of on-board hardwares.The control strategy becomes more complicated due to additional system state variables and the topological variety for platoons, which are even more critical when considering time delay, packet loss, and quantization error in the communications [8,9].In [10], it is demonstrated that the string stability domains shrink when the packet drop ratio or the sampling time of communication increases and above a critical limit string stability cannot be achieved.In [11], an optimal control strategy is employed to develop a synthesis strategy for both distributed controllers and the communication topology that guarantees string stability.In another hand, CACC is more attractive than conventional autonomous ACC, because the system behavior is more responsive to changes in the preceding vehicle speed, thereby enabling shorter following gaps and enhancing traffic throughput, fuel economy, and road safety [12].Simulation results revealed that the freeway capacity increases quadratically as the CACC market penetration increases, with a maximum value of 3080 veh/h/lane at 100% market penetration, which is roughly 63% higher than at 0% market penetration [13,14].
The earliest research on CACC dates to the PATH program [15] in the US, which was followed by the SARTRE project [16,17] in Europe and the Energy ITS project [18] in Japan.Another demonstration of CACC is the Grand Cooperative Driving Challenge (GCDC), which has been held twice in 2011 and 2016 in the Netherlands.In GCDC, several vehicles cooperated in both urban and highway driving scenarios to facilitate the deployment and research of cooperative driving systems based on a combination of communication and state-of-the-art sensor fusion and control [19,20].Most research on CACC has focused on vehicle dynamics control in the longitudinal direction to achieve stable following and coordination of the platoon.Linear feedforward and feedback controllers are widely used due to their advantages of simple structure and convenience in hardware implementation [21][22][23].In this approach, the controller uses the acceleration of the preceding vehicle and tracking errors as the feedforward and feedback signals, respectively, to determine the desired acceleration of the host vehicle.However, due to the nonlinearity of vehicle dynamics and uncertainty of the environment, sets of controller parameters need to be tuned manually, and it is difficult for controllers to be adaptive and robust to unknown disturbances.Model predictive control (MPC) has also been introduced for CACC by forecasting system dynamics and explicitly handling actuator and state constraints to generate an optimal control command [19,[24][25][26][27].By solving the optimal control problem over a finite horizon in a receding manner, multi control objectives, such as tracking accuracy, driver comfort, and fuel economy can be balanced with the cost function.MPC is also superior in terms of constraint satisfaction.However, the MPC design involves more parameters and thus requires more time to adjust [19,25].In addition, a detailed system dynamics model with high fidelity is needed to predict system states accurately, and solving the nonlinear optimization problem causes relatively high computational cost, which impedes real-time implementation with a short control sampling time [27].Moreover, due to CACC's characteristics of a shorter inter-vehicle gap and quicker response, some researchers have pointed out that it is necessary to consider driver psychology and driving habits in the controller design to gain better human acceptance [9,28].In [29], a human-aware autonomous control scheme for CACC is proposed by blending a data-driven self-learning algorithm with MPC.The simulation results showed that vehicle jerk can be reduced while maintaining a safe following distance.
As an important approach for solving complex sequential decision or control problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning [30][31][32].Recently, RL has been increasingly used in vehicle dynamics control to derive near-optimal or suboptimal control policies.In [33], an RL-based real-time and robust energy management approach is proposed for reducing the fuel consumption of a hybrid vehicle.The power request is modeled as a Markov chain, and the Q-learning algorithm is applied to calculate a discrete, near-optimal policy table.In [34], RL is used to design a full-range ACC system, in which the inducing region technique is introduced to guide the learning process for fast convergence.In [35], a novel actor-critic approach that uses a PD controller to pre-train the actor at each step is proposed, and it is proved that the estimation error is uniformly ultimate bounded.Charles et al. use function approximation along with gradient-descent learning algorithms as a means of directly modifying a control policy for CACC [36].The simulation results showed that this approach can result in an efficient policy, but the oscillatory behavior of the control policy must be further addressed by using continuous actions.In [37], a parameterized batch RL algorithm for near-optimal longitudinal velocity tracking is proposed, in which parameterized feature vectors based on kernels are learned from collected samples to approximate the value functions and policies, and the effectiveness of the controller is validated on an autonomous vehicle platform.
In this paper, we propose a supervised reinforcement learning (SRL) algorithm for the CACC problem, in which an actor-critic architecture is adopted, because it can map continuous state space to the control command and the policy can be updated directly by standard supervised learning methods [38].The main contributions of this study are the following: (1) A supervisor network trained by collected driving samples of the human-driver is combined with the actor-critic algorithm to guide the reinforcement learning process.(2) The composite output of the supervisor and the actor is applied to the system, and the proportion between supervised learning and reinforcement learning is adjusted by the gain scheduler during the training process.Using this method, the success rate of the training process is improved, and the driver characteristics are incorporated into the obtained control policy to achieve a human-like CACC controller.(3) The CACC test platform is developed based on two electric vehicles (EV) and a rapid control prototyping system.The performance and learning ability of the SRL-based control policy are effectively validated by simulations and real vehicle-following experiments.
The rest of the paper is organized as follows.Section 2 describes the control framework and the test platform.Section 3 presents the SRL-based control approach for solving the CACC problem.The simulation and experimental results for a real vehicle following control are illustrated in Section 4. Section 5 concludes the paper.

Control Framework for CACC
The fundamental functionality of the CACC system is to regulate the acceleration of the host vehicle, such that the relative velocity between the host vehicle and its nearest preceding vehicle, and the error between the inter-vehicle distance and the desired distance, converge to zero.In this study, the control framework is designed based on a fully control-by-wire EV. Figure 1 illustrates this framework by showing that the CACC system has two separate layers.The upper level controller generates the desired acceleration a d by using input signals, including the inter-vehicle distance d r and relative velocity v r measured by radar, the acceleration of the preceding vehicle a p transmitted via V2V communication, and the velocity and acceleration of the host vehicle measured by the on-board sensor.The lower-level controller manipulates the motor output torque or the braking pressure so that the acceleration of the vehicle tracks the desired acceleration.The accelerator pedal signal α and braking pressure P b are determined by a motor torque map and a braking look-up table, respectively, which are obtained from driving tests.An incremental PID controller and the acceleration/deceleration switching logic are also included in the lower-level module, although they are not shown in the figure .A similar two-layer control framework for ACC can be found in [5].

Hardware Implementation for the Test Platform
As shown in Figure 2, the experimental platforms are built based on two electric passenger vehicles as the host vehicle and the preceding vehicle.The host vehicle, produced by the BAIC company, is front-wheel drive and equipped with a 100-kW driving motor, a single-speed gearbox, and a 41.4-kWh lithium battery pack.
A Cohda Wireless MK5 OBU (On-Board Unit), which supports the SAE J2735 DSRC standard, is used to transmit information from the preceding vehicle to the host vehicle.The data receiving and sending program, "BSM-Shell", is run in the embedded Linux system of the OBUs.Both vehicles are equipped with the NAV-982 inertial navigation system (INS) combined with a Global Positioning System (GPS) to measure acceleration and record the position of the vehicle.On the preceding vehicle, a Raspberry Pi with extended RS232 and CAN interfaces serves as the bridge between NAV-982 and OBU for parsing the raw data from NAV-982.A Delphi 77-GHz millimeter-wave radar is installed on the front of the host vehicle to measure the distance and relative velocity of the preceding vehicle.A rapid control prototyping system is used to test the control strategy.The control algorithm is programmed in the MATLAB/Simulink environment and run in the dSPACE Autobox controller, equipped with a DS1007 processor board and a DS2202 I/O board.The control signals for acceleration and braking are sent to the lower-level control units through CAN bus.In addition, INS data parsing, radar signal processing, and the target tracking algorithm are fulfilled by the dSPACE.System states are monitored on the host PC through Ethernet.The sampling time of the system is 50 ms.

The SRL-Based Control Strategy for CACC
The principle of RL is learning an optimal policy, i.e., a mapping from states to actions that optimizes some performance criterion.Relying on RL alone, the learner is not told which actions to take but instead must discover which actions yield the greatest reward by trial and error [30].Some

Hardware Implementation for the Test Platform
As shown in Figure 2, the experimental platforms are built based on two electric passenger vehicles as the host vehicle and the preceding vehicle.The host vehicle, produced by the BAIC company, is front-wheel drive and equipped with a 100-kW driving motor, a single-speed gearbox, and a 41.4-kWh lithium battery pack.
A Cohda Wireless MK5 OBU (On-Board Unit), which supports the SAE J2735 DSRC standard, is used to transmit information from the preceding vehicle to the host vehicle.The data receiving and sending program, "BSM-Shell", is run in the embedded Linux system of the OBUs.Both vehicles are equipped with the NAV-982 inertial navigation system (INS) combined with a Global Positioning System (GPS) to measure acceleration and record the position of the vehicle.On the preceding vehicle, a Raspberry Pi with extended RS232 and CAN interfaces serves as the bridge between NAV-982 and OBU for parsing the raw data from NAV-982.A Delphi 77-GHz millimeter-wave radar is installed on the front of the host vehicle to measure the distance and relative velocity of the preceding vehicle.A rapid control prototyping system is used to test the control strategy.The control algorithm is programmed in the MATLAB/Simulink environment and run in the dSPACE Autobox controller, equipped with a DS1007 processor board and a DS2202 I/O board.The control signals for acceleration and braking are sent to the lower-level control units through CAN bus.In addition, INS data parsing, radar signal processing, and the target tracking algorithm are fulfilled by the dSPACE.System states are monitored on the host PC through Ethernet.The sampling time of the system is 50 ms.

Hardware Implementation for the Test Platform
As shown in Figure 2, the experimental platforms are built based on two electric passenger vehicles as the host vehicle and the preceding vehicle.The host vehicle, produced by the BAIC company, is front-wheel drive and equipped with a 100-kW driving motor, a single-speed gearbox, and a 41.4-kWh lithium battery pack.
A Cohda Wireless MK5 OBU (On-Board Unit), which supports the SAE J2735 DSRC standard, is used to transmit information from the preceding vehicle to the host vehicle.The data receiving and sending program, "BSM-Shell", is run in the embedded Linux system of the OBUs.Both vehicles are equipped with the NAV-982 inertial navigation system (INS) combined with a Global Positioning System (GPS) to measure acceleration and record the position of the vehicle.On the preceding vehicle, a Raspberry Pi with extended RS232 and CAN interfaces serves as the bridge between NAV-982 and OBU for parsing the raw data from NAV-982.A Delphi 77-GHz millimeter-wave radar is installed on the front of the host vehicle to measure the distance and relative velocity of the preceding vehicle.A rapid control prototyping system is used to test the control strategy.The control algorithm is programmed in the MATLAB/Simulink environment and run in the dSPACE Autobox controller, equipped with a DS1007 processor board and a DS2202 I/O board.The control signals for acceleration and braking are sent to the lower-level control units through CAN bus.In addition, INS data parsing, radar signal processing, and the target tracking algorithm are fulfilled by the dSPACE.System states are monitored on the host PC through Ethernet.The sampling time of the system is 50 ms.

The SRL-Based Control Strategy for CACC
The principle of RL is learning an optimal policy, i.e., a mapping from states to actions that optimizes some performance criterion.Relying on RL alone, the learner is not told which actions to take but instead must discover which actions yield the greatest reward by trial and error [30].Some

The SRL-Based Control Strategy for CACC
The principle of RL is learning an optimal policy, i.e., a mapping from states to actions that optimizes some performance criterion.Relying on RL alone, the learner is not told which actions to take but instead must discover which actions yield the greatest reward by trial and error [30].Some researchers have proposed that the use of supervisory information can effectively make a learning problem easier to solve [38].For a vehicle dynamics control problem, the richness of the human driver's operation data can be employed to supervise the learning process of RL.In this section, an SRL-based control algorithm for CACC is proposed, as shown in Figure 3.The actor-critic architecture is used to establish the continuous relationships between states and actions and the relationships among states, actions, and the control performance [39,40].The supervisor provides the actor with hints about which action may or may not be promising for a specific state, and the composite action blending the actions from the supervisor and the actor by the gain scheduler is sent to the system.The system responds to the input action with a transition from the current state to the next state and gives a reward to the action.More details about each module are described as follows.
Appl.Sci.2018, 8, x FOR PEER REVIEW 5 of 21 researchers have proposed that the use of supervisory information can effectively make a learning problem easier to solve [38].For a vehicle dynamics control problem, the richness of the human driver's operation data can be employed to supervise the learning process of RL.In this section, an SRL-based control algorithm for CACC is proposed, as shown in Figure 3.The actor-critic architecture is used to establish the continuous relationships between states and actions and the relationships among states, actions, and the control performance [39,40].The supervisor provides the actor with hints about which action may or may not be promising for a specific state, and the composite action blending the actions from the supervisor and the actor by the gain scheduler is sent to the system.The system responds to the input action with a transition from the current state to the next state and gives a reward to the action.More details about each module are described as follows.

System Dynamics
The primary control objective is to make the host vehicle follow the preceding vehicle at a desired distance ( ) The constant time headway spacing policy, which is the most commonly used, is adopted here [22][23][24].The desired inter-vehicle distance is proportional to velocity: in which 0 d is the desired safe distance at standstill, h is the time headway, and ( ) h v t is the velocity of the host vehicle.
The longitudinal dynamics model for the host vehicle must consider the powertrain, braking system, aerodynamic drag, longitudinal tire forces, and rolling resistances, etc.The following assumptions are made to obtain a suitable control-oriented model: (1) The tire longitudinal slip is negligible, and the lower level dynamics are lumped into a first order inertial system; (2) The vehicle body is rigid; (3) The influence of pitch and yaw motions is neglected.Then, the nonlinear longitudinal dynamics can be described as in which ( ) h s t is the position of the host vehicle; m is the vehicle mass; η is the efficiency of the driveline; ( ) T t and ( ) des T t are the actual and desired driving/braking torque, respectively; r is the wheel radius; D C is the aerodynamic coefficient; A is the frontal area; g is the acceleration of gravity; f is the rolling resistance coefficient; and τ is the inertial delay of vehicle longitudinal dynamics.
With the exact feedback linearization technique [41], the nonlinear model ( 2) is converted to in which ( ) u t is the control input after linearization, i.e., the desired acceleration, and we have

System Dynamics
The primary control objective is to make the host vehicle follow the preceding vehicle at a desired distance d d (t).The constant time headway spacing policy, which is the most commonly used, is adopted here [22][23][24].The desired inter-vehicle distance is proportional to velocity: in which d 0 is the desired safe distance at standstill, h is the time headway, and v h (t) is the velocity of the host vehicle.
The longitudinal dynamics model for the host vehicle must consider the powertrain, braking system, aerodynamic drag, longitudinal tire forces, and rolling resistances, etc.The following assumptions are made to obtain a suitable control-oriented model: (1) The tire longitudinal slip is negligible, and the lower level dynamics are lumped into a first order inertial system; (2) The vehicle body is rigid; (3) The influence of pitch and yaw motions is neglected.Then, the nonlinear longitudinal dynamics can be described as in which s h (t) is the position of the host vehicle; m is the vehicle mass; η is the efficiency of the driveline; T(t) and T des (t) are the actual and desired driving/braking torque, respectively; r is the wheel radius; C D is the aerodynamic coefficient; A is the frontal area; g is the acceleration of gravity; f is the rolling resistance coefficient; and τ is the inertial delay of vehicle longitudinal dynamics.
With the exact feedback linearization technique [41], the nonlinear model ( 2) is converted to in which u(t) is the control input after linearization, i.e., the desired acceleration, and we have in which a h (t) = .
v h (t) denotes the acceleration of the host vehicle.The third-order state space model for the vehicle's longitudinal dynamics is derived as Using the Euler discretization approach, the state equation of continuous system ( 5) is discretized with a fixed sampling time ∆t as follows: Considering the vehicle following system of CACC, we define the system state variables as ]. e d (t), v r (t), and a r (t) are the inter-vehicle distance error, relative velocity, and relative acceleration, respectively, denoted as in which s p (t), v p (t), and a p (t) are the position, velocity, and acceleration of the preceding vehicle, respectively.After taking the action u(t) in state x(t), the system will go to the next state x(t + 1) according to the following transition equations:

The SRL Control Algorithm
There are three neural networks in the proposed SRL control algorithm: the actor, the critic, and the supervisor.The actor network is responsible for generating the control command according to the states.The critic network is used to approximate the discounted total reward-to-go and evaluate the performance of the control signal.The supervisor is used to model a human driver's behavior and provide the predicted control signal of the driver to guide the training process of the actor and critic.

The Actor Network
The input of the actor network is the system state x(t) = [e d (t), v r (t), a r (t)], and the output is the optimal action.A simple three-layered feed-forward neural network is adopted for both the actor and critic according to [39].The output of the actor network is depicted as in which f (•) is the hyperbolic tangent activation function; j denote the inputs of the actor network; ω (1) a i,j represent the weights connecting the input layer to the hidden layer; ω (2) a i represent the weights connecting the hidden layer to the output layer; N ha is the number of neurons in the hidden layer of the actor network; and n is the number of state variables.

The Critic Network
The inputs of the critic network are the system state x(t) and the composite action u(t).The output of the critic, the J function, approximates the future accumulative reward-to-go value R(t) at time t.Specifically, R(t) is defined as in which α is the discount factor for the infinite-horizon problem (0 < α < 1).The discount factor determines the present value of future rewards: a reward received k time steps in the future is worth times what it would be worth if it were received immediately [30].T is the final time step.r(t) is the reward provided from the environment given by in which k 1 , k 2 , and k 3 are positive weighting factors.For the sake of control accuracy and driving comfort, r(t) is formulated as the negative weighted sum of the quadratic forms of the distance error, the relative velocity, and the fluctuation of the host vehicle's acceleration.Thus, a higher accumulative reward-to-go value R(t) indicates the better performance of the action.
In the critic network, the hyperbolic tangent and linear-type activation function are used in the hidden layer and the output layer, respectively.The output J(t) has the following form: in which x (c) j denote the inputs of the critic network, ω c i,j represent the weights connecting the input layer to the hidden layer, ω (2) c i represent the weights connecting the hidden layer to the output layer, and N hc is the number of neurons in the hidden layer of the actor network.

The Supervisor
Driver behavior can be modeled with parametric models, such as the SUMO model and the Intelligent Driver Model, and non-parametric models, such as the Gaussian Mixture Regression model and artificial neural network model [42].In [43,44], a neural network-based approach for modelling driver behavior is investigated.In this part, the driver behavior is modeled by a feed-forward neural network with the same structure as the actor network.The driver's operation data in a real vehicle-following scenario can be collected to form a dataset D = {e d (t), v r (t), a r (t), a des (t)}.Note that here we use the desired acceleration a des (t) as the driver's command instead of the accelerator pedal signal α and braking pressure P b .The supervisor network can be trained with dataset D to predict the driver's command a des (t) according to a given state [e d (t), v r (t), a r (t)].The weights are updated by prediction error back-propagation, and the Levenberg-Marquardt method is employed to train the network until the weights converge.

The Gain Scheduler
As mentioned above, the actor network, the supervisor network, and the gain scheduler generate a composite action to the host vehicle.The gain scheduler computes a weighted sum of the actions from the supervisor and the actor as follows: in which u s (t) is the output of the supervisor, i.e., the prediction of the driver's desired acceleration with regard to the current state.u s (t) is normalized within the range [−1, 1] m/s 2 .u E (t) is the exploratory action of the actor, u E (t) = u a (t) + N(0, σ).N(0, σ) denotes a random noise with zero mean and variance σ.The parameter k s weights the control proportion between the actor and the supervisor, 0 ≤ k s ≤ 1.This parameter is important for the supervised learning process, because it determines the autonomy level of the actor or the guidance intensity of the supervisor.Generally, the value of k s varies with the state.It can be adjusted by the actor, the supervisor, or a third party.
Considering the control requirement and comfort of the driver and passengers, the range of u(t), which is within the range [−1, 1], is transformed into the range [−2, 2] m/s 2 before applying to the system.

SRL Learning
During the learning process, the weights of the actor and critic are updated with error back-propagation at each time step.After taking the composite action u(t), the system transits to the next state x(t + 1) and gives a reward r(t).For the critic network, the prediction error has the same expression with the temporal difference (TD) error as follows: The squared error objective function is calculated as The objective function is minimized by the gradient descent approach as ) in which l c (t) is the learning rate of the critic network at time t, 0 < l c (t) < 1.
The adaptation of the actor network is regulated by the gain scheduler.The weights of the actor are updated according to the following rule as in which ∆ω RL a (t) and ∆ω SL a (t) are the updates based on RL and the supervised learning, respectively.The error and objective function of RL are defined as in which U c (t) is the desired control objective.Here, U c (t) is set to 0, because the reward tends to zero if optimal actions are being taken.The supervisory error and its objective function are calculated as The RL and supervisory objective functions are minimized by the gradient descent approach, and thus the reinforcement-based update and the supervisory update are calculated as in which l a (t) is the learning rate of the actor network at time t, 0 < l a (t) < 1.
The complete SRL algorithm is described in detail with the pseudocode as Algorithm 1.

Vehicle Model Validation
As described in Section 3.1, the longitudinal dynamics of the vehicle are treated as a first-order inertial system for simplicity.The inertial delay τ is identified by the step-response approach.In the driving test on a flat road, the step acceleration and braking commands are given to the lower-level controller, and the actual acceleration response is measured.The model parameters for acceleration and braking are estimated using a least-squares method as follows: τ acc = 0.15, τ brake = 0.52.The validation results for the identified vehicle model, as shown in Figure 4, indicate that the simulation output of the model corresponds well with the measured actual acceleration curve.Thus, the vehicle model adequately describes the longitudinal vehicle dynamics.Notably, the shorter inertial delay for acceleration is highly contributed to by the fast response of the pure electric powertrain of the test vehicle.
acceleration and braking are estimated using a least-squares method as follows: τ = 0.15

Training Process
First, the human driver's manual driving data are collected with our test platform in the vehiclefollowing scenario.Two of our colleagues are selected to drive the host vehicle.They both have proficient driving skills but different driving styles.The first driver (Driver 1) is aggressive and tends to use large acceleration and deceleration, whereas the second driver (Driver 2) is non-aggressive, with smoother driving.The data collection for each driver lasts approximately 10 min, with a sampling time of 50 ms.The drivers commanded the vehicle with their driving habits respectively.Then, the supervisor networks with three layers and ten hidden neurons are trained by the collected samples with the MATLAB Neural Network Toolbox.The maximum number of iteration steps, performance goal, and learning rate are set as 1000, 0.001, and 0.01, respectively.Two supervisor networks with different driving styles are obtained from Driver 1 and Driver 2 for the next step.
In the SRL stage, Algorithm 1 is performed for the CACC control.It should be mentioned that the convergence analysis of the actor-critic reinforcement learning algorithm has been given in [45].It is shown that the estimation error and the weights in the action and critic networks remain uniformly ultimately boundedness (UUB) under the given conditions.Here, the training parameters are selected on basis of UUB conditions provided by [45].Both the actor and critic networks have three layers and ten hidden neurons.The initial learning rate of the critic and actor networks is set as 0.3 to accelerate the convergence and decreases by 0.05 at each time step until it reaches 0.003.The discount factor is 0.9.The variance σ for action exploration is 0.05.For the gain scheduler, we expect that at the beginning of the training process, the actor network is updated dominantly by the

Training Process
First, the human driver's manual driving data are collected with our test platform in the vehiclefollowing scenario.Two of our colleagues are selected to drive the host vehicle.They both have proficient driving skills but different driving styles.The first driver (Driver 1) is aggressive and tends to use large acceleration and deceleration, whereas the second driver (Driver 2) is non-aggressive, with smoother driving.The data collection for each driver lasts approximately 10 min, with a sampling time of 50 ms.The drivers commanded the vehicle with their driving habits respectively.Then, the supervisor networks with three layers and ten hidden neurons are trained by the collected samples with the MATLAB Neural Network Toolbox.The maximum number of iteration steps, performance goal, and learning rate are set as 1000, 0.001, and 0.01, respectively.Two supervisor networks with different driving styles are obtained from Driver 1 and Driver 2 for the next step.
In the SRL stage, Algorithm 1 is performed for the CACC control.It should be mentioned that the convergence analysis of the actor-critic reinforcement learning algorithm has been given in [45].It is shown that the estimation error and the weights in the action and critic networks remain uniformly ultimately boundedness (UUB) under the given conditions.Here, the training parameters are selected on basis of UUB conditions provided by [45].Both the actor and critic networks have three layers and ten hidden neurons.The initial learning rate of the critic and actor networks is set as 0.3 to accelerate the convergence and decreases by 0.05 at each time step until it reaches 0.003.The discount factor is 0.9.The variance σ for action exploration is 0.05.For the gain scheduler, we expect that at the beginning of the training process, the actor network is updated dominantly by the supervisor.The weight of the reinforcement update then increases gradually, such that a near-optimal policy can finally be obtained.Thus, the initial interpolation parameter k s is set as 0.2 and increases with 0.004 at each time step until it reaches 0.8.A typical driving profile is constructed as the training cycle [34,36].
In our case, the preceding vehicle starts driving at a constant velocity of 50 km/h for 50 s.Then, two step acceleration signals, 0.42 m/s 2 and 0.83 m/s 2 , and two step deceleration signals, −0.42 m/s 2 and −0.83 m/s 2 , are successively given; each signal lasts 20 s.At 140 s, two periods of sine wave acceleration signals are given with a 40-s period and a 1 m/s 2 amplitude; then, the preceding vehicle drives at a constant velocity until the terminal time of 200 s.The initial velocity of the host vehicle is 60 km/h, and the initial inter-vehicle distance is 20 m.The host vehicle learns to follow the preceding vehicle while keeping a specific desired following range of 1 s time headway.The training step size is set as 1 s [34].
In this study, an experiment consists of a maximum of 1000 consecutive trials of the training cycle.The experiment is considered successful if the host vehicle can follow the preceding vehicle with the desired range in a steady state at the end of a trial (trial number less than 1000).For comparison, the actor-critic RL algorithm without the supervisor is also adopted for the same training process as SRL, and 100 experiments are performed for each.The training results summarized in Table 1 show that with the proposed SRL algorithm, the training process is always successful, and fewer trials are needed than with RL.

Simulation Results
The performance of the obtained control policy for CACC based on SRL is investigated via simulation with two ideal driving cycles, including the step acceleration cycle and the sinusoidal velocity cycle, and a collected real driving cycle.In [22], a linear feedforward and feedback control algorithm for CACC is proposed and tested on a vehicle.Here, the linear controller is also adopted for comparison.With the first two ideal cycles, the SRL control policy trained by samples of Driver 1 is compared with the linear control algorithm.With the real driving cycle, the SRL control policies trained by samples of Driver 1 and Driver 2 are compared to evaluate the learning ability of SRL and validate the effect of the supervisor.The time headway is set as 1 s, and the safe distance at standstill is 2 m.
Figure 5 presents the simulation results with the cycle of step accelerating and decelerating commands, followed by the maximum, average, and variance of the inter-vehicle distance error listed in Table 2.The initial distance, the initial velocity of the preceding vehicle, and the host vehicle are 7 m, 36 km/h, and 18 km/h, respectively.Note that for a fair comparison, the desired distance curve is calculated using the velocity of the preceding vehicle in this and following figures in this section.With the SRL-based control policy, the host vehicle tracks the velocity of the preceding vehicle well while maintaining the desired inter-vehicle distance.The velocity error and distance error of the SRL control policy damp more quickly compared with the linear control.More specifically, at the beginning of the simulation and the rising/falling edge of the step command, the SRL policy tends to generate a larger acceleration to achieve a faster response and more accurate inter-vehicle distance control.The maximum, average, and variance of the distance error of the SRL control policy are less than that of the linear controller, which indicates that the SRL control policy achieves a better following accuracy.
Figure 6 shows the simulation results of the sinusoidal velocity cycle.The corresponding maximum, average, and variance of the inter-vehicle distance error are listed in Table 3.At the beginning of the simulation, the SRL policy provides a larger acceleration than the linear controller to pursue the velocity of the preceding vehicle in a short period of time.After that, the SRL policy performs as well as the linear controller in velocity tracking.As shown in Figure 6c, the proposed SRL policy exceeds the linear controller in maintaining the desired distance.The distance error of SRL policy is less than the linear controller as shown in Table 3.  Figure 6 shows the simulation results of the sinusoidal velocity cycle.The corresponding maximum, average, and variance of the inter-vehicle distance error are listed in Table 3.At the beginning of the simulation, the SRL policy provides a larger acceleration than the linear controller to pursue the velocity of the preceding vehicle in a short period of time.After that, the SRL policy performs as well as the linear controller in velocity tracking.As shown in Figure 6c, the proposed SRL policy exceeds the linear controller in maintaining the desired distance.The distance error of SRL policy is less than the linear controller as shown in Table 3.Moreover, the performance of the SRL control policies trained by different drivers are compared using a real driving cycle collected on an urban road in Beijing.The driving samples of Driver 1 and Driver 2 are used to train two supervisor networks, and then these two supervisors are adopted for the SRL training process to obtain two different SRL control policies.Recall that Driver 1 has an aggressive driving style, whereas Driver 2 is non-aggressive.As shown in Figure 7, both SRL policies can control the host vehicle to follow the preceding vehicle stably and accurately, and the distance error and relative velocity are regulated in a small range.The variance of distance error of the policy trained by Driver 2 (SRL-Driver2) is larger than SRL-Driver1, as listed in Table 4.The velocity curve and the acceleration curve of SRL-Driver2 are smoother than SRL-Driver1.The fluctuation range of the acceleration for SRL-Driver 1 is also larger than that of SRL-Driver 2, especially at 20 s to 40 s and 60 s to 80 s.The acceleration distribution for the data samples and simulation results are depicted in Figure 8a,b, respectively.From the acceleration distribution of the collected driving data for the two human drivers shown in Figure 8a, it can be seen that for driver 2, whose driving style is nonaggressive, the acceleration distribution is more concentrated in the areas around zero.For driver 1, the distribution probability when the absolute value of acceleration is larger than 0.5 is greater than that of driver 2. This means driver 1 tends to give larger acceleration or braking command, whereas Moreover, the performance of the SRL control policies trained by different drivers are compared using a real driving cycle collected on an urban road in Beijing.The driving samples of Driver 1 and Driver 2 are used to train two supervisor networks, and then these two supervisors are adopted for the SRL training process to obtain two different SRL control policies.Recall that Driver 1 has an aggressive driving style, whereas Driver 2 is non-aggressive.As shown in Figure 7, both SRL policies can control the host vehicle to follow the preceding vehicle stably and accurately, and the distance error and relative velocity are regulated in a small range.The variance of distance error of the policy trained by Driver 2 (SRL-Driver2) is larger than SRL-Driver1, as listed in Table 4.The velocity curve and the acceleration curve of SRL-Driver2 are smoother than SRL-Driver1.The fluctuation range of the acceleration for SRL-Driver 1 is also larger than that of SRL-Driver 2, especially at 20 s to 40 s and 60 s to 80 s.The acceleration distribution for the data samples and simulation results are depicted in Figure 8a,b, respectively.From the acceleration distribution of the collected driving data for the two human drivers shown in Figure 8a, it can be seen that for driver 2, whose driving style is non-aggressive, the acceleration distribution is more concentrated in the areas around zero.For driver 1, the distribution probability when the absolute value of acceleration is larger than 0.5 is greater than that of driver 2. This means driver 1 tends to give larger acceleration or braking command, whereas driver 2 tends to give smaller and smoother acceleration or braking command.The distribution shapes for the data samples and the simulations results of the two drivers are clearly similar.In detail, the SRL control policy trained by Driver 2, which inherits the acceleration characteristics from the data samples of Driver 2, is more likely than SRL-Driver 1 to give small acceleration commands.It can be concluded that the SRL-based control policy learns the driving style from the samples of the human driver while guaranteeing control performance.

Test Results
Figure 9 shows the test scenario for the evaluation of the SRL-based CACC system.The host vehicle uses a millimeter-wave radar to measure the relative velocity and inter-vehicle distance.The acceleration of the preceding vehicle is sent to the host vehicle via V2V communication.Vehiclefollowing tests in low-speed driving situations were conducted on a flat, straight road on our campus.To validate the adaptability of the proposed control strategy, time headways of 0.8 s, 1.0 s, and 1.2 s are tested.The safe distance at standstill is set as 5 m.           5.Although the velocity of the preceding vehicle fluctuates between 15 and 30 km/h, the host vehicle can always follow the velocity of the preceding vehicle accurately.The inter-vehicle distance changes according to the desired distance.Distance error occurs when the velocity of the preceding vehicle fluctuates but remains in an acceptable range.Normally, the actual distance is larger than the desired distance when the preceding vehicle accelerates and smaller than the desired value when the preceding vehicle decelerates due to the slight hysteresis of the change of the host vehicle's velocity.The distance error and its variance decrease when a larger time headway is used, as listed in Table 5.In addition, the acceleration of the host vehicle is manipulated by the SRL controller to fluctuate closely with respect to that of the preceding vehicle.In conclusion, the effectiveness of the SRL-based CACC system is validated by vehicle-following tests with satisfactory performance.The test results with different time headways for the SRL-based CACC system are shown in Figures 10-12.The corresponding maximum, average, and variance of the inter-vehicle distance error are listed in Table 5.Although the velocity of the preceding vehicle fluctuates between 15 and 30 km/h, the host vehicle can always follow the velocity of the preceding vehicle accurately.The intervehicle distance changes according to the desired distance.Distance error occurs when the velocity of the preceding vehicle fluctuates but remains in an acceptable range.Normally, the actual distance is larger than the desired distance when the preceding vehicle accelerates and smaller than the desired value when the preceding vehicle decelerates due to the slight hysteresis of the change of the host vehicle's velocity.The distance error and its variance decrease when a larger time headway is used, as listed in Table 5.In addition, the acceleration of the host vehicle is manipulated by the SRL controller to fluctuate closely with respect to that of the preceding vehicle.In conclusion, the effectiveness of the SRL-based CACC system is validated by vehicle-following tests with satisfactory performance.

Conclusions
In this study, we propose an SRL-based framework for the longitudinal vehicle dynamics control of a CACC system.A supervisor trained by driving data from a human driver is introduced to guide the reinforcement learning process.During the learning process, the composite action of the supervisor network and the actor network is sent to the system, and the actor network is updated with the error provided by the supervisor and the critic network simultaneously.The gain scheduler adjusts the weighting between supervisor and actor to improve the learning efficiency and obtain a near-optimal control policy.The simulation results show that the performance of the proposed control strategy is superior to that of the linear controller, and the human driver's acceleration characteristics can be successfully mimicked by the SRL-based strategy, resulting in an improving driver comfort and acceptance of the CACC system.In addition, the proposed framework is demonstrated on CACC test vehicles, and the effectiveness is validated by vehicle-following tests with satisfactory performance.Notably, the SRL algorithm can be implemented in a real-time way, such that the driver's driving data are collected for online updating of the actor network to provide a human-like control policy.In the next step, the string stability for vehicle platooning will be considered in the SRL approach to reduce the impact of the trained control policy on traffic flow.

Conclusions
In this study, we propose an SRL-based framework for the longitudinal vehicle dynamics control of a CACC system.A supervisor trained by driving data from a human driver is introduced to guide the reinforcement learning process.During the learning process, the composite action of the supervisor network and the actor network is sent to the system, and the actor network is updated with the error provided by the supervisor and the critic network simultaneously.The gain scheduler adjusts the weighting between supervisor and actor to improve the learning efficiency and obtain a near-optimal control policy.The simulation results show that the performance of the proposed control strategy is superior to that of the linear controller, and the human driver's acceleration characteristics can be successfully mimicked by the SRL-based strategy, resulting in an improving driver comfort and acceptance of the CACC system.In addition, the proposed framework is demonstrated on CACC test vehicles, and the effectiveness is validated by vehicle-following tests with satisfactory performance.Notably, the SRL algorithm can be implemented in a real-time way, such that the driver's driving data are collected for online updating of the actor network to provide a human-like control policy.In the next step, the string stability for vehicle platooning will be considered in the SRL approach to reduce the impact of the trained control policy on traffic flow.

Figure 1 .
Figure 1.Control framework for the CACC system.

Figure 2 .
Figure 2. Functional architecture of the test platform.

Figure 1 .
Figure 1.Control framework for the CACC system.

21 Figure 1 .
Figure 1.Control framework for the CACC system.

Figure 2 .
Figure 2. Functional architecture of the test platform.

Figure 2 .
Figure 2. Functional architecture of the test platform.

Figure 3 .
Figure 3. Schematic representation of the SRL-based CACC algorithm.

Figure 3 .
Figure 3. Schematic representation of the SRL-based CACC algorithm.

.
The validation results for the identified vehicle model, as shown in Figure4, indicate that the simulation output of the model corresponds well with the measured actual acceleration curve.Thus, the vehicle model adequately describes the longitudinal vehicle dynamics.Notably, the shorter inertial delay for acceleration is highly contributed to by the fast response of the pure electric powertrain of the test vehicle.

Figure 4 .
Figure 4. Acceleration response of the host vehicle to the commanded acceleration and simulation output of the model: (a) acceleration and (b) braking.

Figure 4 .
Figure 4. Acceleration response of the host vehicle to the commanded acceleration and simulation output of the model: (a) acceleration and (b) braking.

Figure 5 .
Figure 5. Simulation results with the cycle of step acceleration and deceleration: (a) velocity, (b) acceleration, and (c) distance.

Figure 5 .
Figure 5. Simulation results with the cycle of step acceleration and deceleration: (a) velocity, (b) acceleration, and (c) distance.

Figure 6 .
Figure 6.Simulation results with the cycle of sinusoidal velocity: (a) velocity, (b) acceleration, and (c) distance.

Figure 6 .
Figure 6.Simulation results with the cycle of sinusoidal velocity: (a) velocity, (b) acceleration, and (c) distance.

Figure 7 .
Figure 7. Simulation results with the real driving cycle: (a) velocity, (b) acceleration, and (c) distance.Figure 7. Simulation results with the real driving cycle: (a) velocity, (b) acceleration, and (c) distance.

Figure 7 .
Figure 7. Simulation results with the real driving cycle: (a) velocity, (b) acceleration, and (c) distance.Figure 7. Simulation results with the real driving cycle: (a) velocity, (b) acceleration, and (c) distance.

Figure 8 .
Figure 8. Acceleration distribution for the data samples (a) and simulation results (b).

Figure 8 .
Figure 8. Acceleration distribution for the data samples (a) and simulation results (b).

Figure 9
Figure 9 shows the test scenario for the evaluation of the SRL-based CACC system.The host vehicle uses a millimeter-wave radar to measure the relative velocity and inter-vehicle distance.The acceleration of the preceding vehicle is sent to the host vehicle via V2V communication.Vehicle-following tests in low-speed driving situations were conducted on a flat, straight road on our campus.To validate the adaptability of the proposed control strategy, time headways of 0.8 s, 1.0 s, and 1.2 s are tested.The safe distance at standstill is set as 5 m.

Figure 8 .
Figure 8. Acceleration distribution for the data samples (a) and simulation results (b).

Figure 9
Figure 9 shows the test scenario for the evaluation of the SRL-based CACC system.The host vehicle uses a millimeter-wave radar to measure the relative velocity and inter-vehicle distance.The acceleration of the preceding vehicle is sent to the host vehicle via V2V communication.Vehiclefollowing tests in low-speed driving situations were conducted on a flat, straight road on our campus.To validate the adaptability of the proposed control strategy, time headways of 0.8 s, 1.0 s, and 1.2 s are tested.The safe distance at standstill is set as 5 m.
. The corresponding maximum, average, and variance of the inter-vehicle distance error are listed in Table

Table 1 .
Training results of SRL and RL.

Table 2 .
Maximum, average, and variance of distance error, with the cycle of step acceleration and deceleration.

Table 2 .
Maximum, average, and variance of distance error, with the cycle of step acceleration and deceleration.

Table 3 .
Maximum, average, and variance of distance error, with the cycle of sinusoidal velocity.

Table 3 .
Maximum, average, and variance of distance error, with the cycle of sinusoidal velocity.

Table 4 .
Maximum, average, and variance of distance error, with the real driving cycle.

Table 4 .
Maximum, average, and variance of distance error, with the real driving cycle.
Control Policy Maximum (m) Average (m) Variance

Table 4 .
Maximum, average, and variance of distance error, with the real driving cycle.

Table 5 .
Maximum, average, and variance of distance error, with the real driving cycle.