A Learning Control Method of Automated Vehicle Platoon at Straight Path with DDPG-Based PID

: Cooperative adaptive cruise control (CACC) has important signiﬁcance for the development of the connected and automated vehicle (CAV) industry. The traditional proportional integral derivative (PID) platoon controller adjustment is not only time-consuming and laborious, but also unable to adapt to different working conditions. This paper proposes a learning control method for a vehicle platooning system using a deep deterministic policy gradient (DDPG)-based PID. The main contribution of this study is automating the PID weight tuning process by formulating this objective as a deep reinforcement learning (DRL) problem. The longitudinal control of the vehicle platooning is divided into upper and lower control structures. The upper-level controller based on the DDPG algorithm can adjust the current PID controller parameters. Through ofﬂine training and learning in a SUMO simulation software environment, the PID controller can adapt to different road and vehicular platooning acceleration and deceleration conditions. The lower-level controller controls the gas/brake pedal to accurately track the desired acceleration and speed. Based on the hardware-in-the-loop (HIL) simulation platform, the results show that in terms of the maximum speed error, for the DDPG-based PID controller this is 0.02–0.08 m/s less than for the conventional PID controller, with a maximum reduction of 5.48%. In addition, the maximum distance error of the DDPG-based PID controller is 0.77 m, which is 14.44% less than that of the conventional PID controller.


Introduction
Connected and automated vehicles (CAVs) are an important development direction for the automobile industry. They are not only an important way to solve the problems of traffic safety, resource consumption, environmental pollution, etc., but are also the core element of establishing an intelligent transportation system. Cooperative adaptive cruise control (CACC) based on on-board sensors and vehicle-to-vehicle (V2V) and/or infrastructure-to-vehicle (I2V) communication has become a hot spot in the research of intelligent vehicles [1,2]. Through vehicle-to-everything (V2X) communication, this mode can receive the dynamic information of the surrounding environment in real-time and improve driving safety [3,4]. Simultaneously, CACC has a significant influence on improving the road capacity, reducing fuel consumption, decreasing environment pollution, and so on [5][6][7].
By sharing information among vehicles, a CACC system allows automated vehicles to form platoons and be driven at harmonized speed with smaller constant time gaps between vehicles [8]. CACC plays a positive role in improving the performance of the vehicular platooning system and ensuring the safety of vehicles, so it has attracted wide attention from researchers. Previous methods for CACC include proportional integral derivative (PID) control [9,10], sliding mode control (SMC) [11,12], model predictive control (MPC) [13][14][15], H-Infinity (H∞) control [16,17], etc. Due to the advantages of low complexity and less computation, PID controllers play an important role in the control field. However, the parameters of the PID controller need to be adjusted manually and cannot adapt to different working conditions. The control effect of SMC, MPC, and H∞ methods are closely related to model accuracy, and need a reasonably good model of the system to be controlled. When the model precision is higher, the control effect is better. Nevertheless, due to the complex nonlinear dynamics of the longitudinal movement of the vehicular platooning, it is difficult to establish an accurate model.
In recent years, Google's DeepMind team has combined deep neural networks with the decision-making capabilities of reinforcement learning to establish a framework for deep reinforcement learning (DRL) [18]. Then the deep deterministic policy gradient (DDPG) algorithm was proposed to realize the control of the continuous action space [19]. In addition, it has achieved good results in the field of automatic driving control [20]. At present, the DRL algorithm is mainly applied to the control of individual vehicles, specifically divided into longitudinal [21,22] and lateral [23,24] motion control. Zhu et al. [21] used real-world driving data for training and proposed a human-like car-following model based on the DDPG algorithm, which has higher accuracy than traditional methods. A lane change model based on DRL was designed, which can achieve more stable, safe, and efficient results by adjusting the reward function [23]. Chen et al. [25] proposed a path tracking control architecture that combines a conventional pure pursuit method and DRL algorithm. It was found that the approach of adding a DRL in parallel improves the performance of a traditional controller under various operating conditions. Zhou et al. [26] proposed a framework for learning the car-following behavior of drivers based on maximum entropy deep inverse reinforcement learning. Aiming at the problem of simple simulation scene setting in the above research, Makantasis et al. [27] established the traffic flow model in SUMO simulation software to train the agent. The car-following and lane-changing behavior integrated model using DDPG was developed and trained in the VISSIM simulation environment [28]. Some studies have tried to apply theory to practice [22], but the DRL algorithm based on a deep neural network is a "black box" model. In other words, the control principle is unknown and has significant uncertainty. The training results depend on the setting of random seeds, which is unstable. This is the reason why the current DRL algorithm is mainly implemented on the simulation platform and is difficult to apply to the real vehicle [29].
The learning controller has the strong ability of discrimination, memory, and selfadjustment. It can adjust its own parameters according to different controlled objects and environmental changes to achieve the best control performances. There are currently three main types of learning control systems: iterative learning control (ILC) [30,31], adaptive control based on neural networks (NN) [32,33], and learning control based on the Markov decision process (MDP) [34,35]. Wang et al. [30] presented a novel learningbased cruise controller for autonomous land vehicles (ALVs). The controller consists of a time-varying proportional-integral (PI) module and an actor-critic learning control module. Lu et al. [36] designed a personalized driving behavior learning system based on neural reinforcement learning (NRL), which utilized data collected by on-board sensors to learn the driver's longitudinal speed control characteristics. Combining DRL with traditional control methods has been a research hotspot in recent years. It takes advantage of the self-learning and self-tuning abilities of DRL. Moreover, it uses the traditional controller to ensure the stability of the system. The learning-based predictive control (LPC) method using the actor-critic framework was proposed, which was shown to be asymptotically stable in the Electronics 2021, 10, 2580 3 of 20 sense of Lyapunov [37]. Ure et al. [38] developed a reinforcement learning framework for automated weight tuning for MPC-based adaptive cruise control (ACC) systems. This approach significantly shortens the exhausting manual weight tuning process.
In summary, researchers in different fields have already completed numerous works in the longitudinal motion control of vehicular platooning, but there still exist some deficiencies as follows. (1) The vehicular platooning controller is difficult to adapt to various working conditions and controller parameters must be set manually by professional engineers (e.g., PID). The existing controllers such as MPC, LQR, and H∞ need a high-precision controlled object model. However, this knowledge is very difficult to obtain. (2) Neural networks and their derived controllers belong to the scope of supervised machine learning, which can only imitate the parameter adjustment strategies of expert demonstrations, but not necessarily the optimal control effect. Another issue is that their generalization ability also needs to be proved. (3) The end-to-end learning method performs well in an autonomous driving simulation environment, but its interpretability is poor, and there is little literature to analyze the stability of the control system. The vehicular platooning has complex nonlinearity, so the actual control effect cannot be guaranteed.
In view of the above problems, a learning control method that uses DDPG-based PID for longitudinal motion control of vehicular platooning is proposed in this paper. PID controllers are the most commonly used for industrial applications due to their simplicity in structure and robustness in performance. However, the traditional PID adjustment is not only time-consuming and laborious, but also unable to adapt to different working conditions. Therefore, we proposed a novel control strategy of vehicular platooning using DDPG-based PID to solve this problem. To the best knowledge of the authors, this is the first reported use of DDPG-based PID for vehicular platooning control. The PID controller parameters can be automatically adjusted in real-time according to the state by using a trained DDPG algorithm. Through offline training and learning in a simulation environment, the PID controller can adapt to different road and vehicular platooning acceleration and deceleration conditions. The advantage of this scheme is that the PID controller parameters do not rely on any manual tuning and can better adapt to the change in working conditions. The DDPG-based PID controller eliminates the drawbacks of the traditional PID controller, such as insufficient adaptability, and the difficulty of parameter regulation. In addition, the vehicular platooning system stability is proved by stability theory to ensure safety. Therefore, compared with the traditional PID controller, the DDPGbased PID has stronger robustness. This study is the further development of the learning control method, and provides a new idea for the practical application of DRL algorithm in the industrial field. However, the HIL simulation simplifies the road environment conditions. How to carry out real vehicle experiments to further verify the stability and reliability of a vehicular platoon controller is the focus of the next research in this paper.
The work in this paper is an extension of our previous publication [39]. The remainder of this paper is organized as follows. In Section 2, the architecture of the vehicular platooning control system and a string stability analysis are presented. In Section 3, we illustrate how the problem of vehicular platoon control is formulated as an MDP model. The DDPG-based PID control algorithm is trained in Section 4. In Section 5, the experimental result is presented and in Section 6, the results are analyzed and discussed. Finally, the conclusions and future work outlook are provided in the last section.

Vehicle Platoon Architecture
The information topology has an important impact on the stability of vehicular platooning. Currently, the main topologies include the predecessor following (PF) topology, bidirectional (BD) topology, and predecessor-leader following (PLF) topology [40]. In addition, the vehicle spacing control strategies consist of the constant spacing (CS) policy, constant time-gap (CT) policy, and variable time-gap (VT) policy [41]. The PLF topology

Vehicle Platoon Architecture
The information topology has an important impact on the stability of vehicular platooning. Currently, the main topologies include the predecessor following (PF) topology, bidirectional (BD) topology, and predecessor-leader following (PLF) topology [40]. In addition, the vehicle spacing control strategies consist of the constant spacing (CS) policy, constant time-gap (CT) policy, and variable time-gap (VT) policy [41]. The PLF topology and CT policy frameworks are applied in this paper to realize the vehicular longitudinal tracking control, as shown in Figure 1. Based on the hierarchical control structure [42], the upper-level controller receives the state information (such as vehicle position, speed, and acceleration) through communication technology and on-board sensors, and calculates the desired longitudinal acceleration. Then, the lower-level controller controls the gas/brake pedal to accurately track the desired acceleration and speed, with feedforward and feedback control logic. This paper focuses on the upper-level controller; meanwhile, the dynamic models of the vehicle powertrain system and braking system are handled by the lower-level controller. The vehicular platooning control framework is shown in Figure 2. In our system, we assume that the leading vehicle can be maneuvered in real time by automatic or manual driving, and that each vehicle is equipped with a global positioning system (GPS) and on-board sensors, and has V2V communication technology.   Based on the hierarchical control structure [42], the upper-level controller receives the state information (such as vehicle position, speed, and acceleration) through communication technology and on-board sensors, and calculates the desired longitudinal acceleration. Then, the lower-level controller controls the gas/brake pedal to accurately track the desired acceleration and speed, with feedforward and feedback control logic. This paper focuses on the upper-level controller; meanwhile, the dynamic models of the vehicle powertrain system and braking system are handled by the lower-level controller. The vehicular platooning control framework is shown in Figure 2. In our system, we assume that the leading vehicle can be maneuvered in real time by automatic or manual driving, and that each vehicle is equipped with a global positioning system (GPS) and on-board sensors, and has V2V communication technology.

Vehicle Platoon Architecture
The information topology has an important impact on the stability of vehicular platooning. Currently, the main topologies include the predecessor following (PF) topology, bidirectional (BD) topology, and predecessor-leader following (PLF) topology [40]. In addition, the vehicle spacing control strategies consist of the constant spacing (CS) policy, constant time-gap (CT) policy, and variable time-gap (VT) policy [41]. The PLF topology and CT policy frameworks are applied in this paper to realize the vehicular longitudinal tracking control, as shown in Figure 1. Based on the hierarchical control structure [42], the upper-level controller receives the state information (such as vehicle position, speed, and acceleration) through communication technology and on-board sensors, and calculates the desired longitudinal acceleration. Then, the lower-level controller controls the gas/brake pedal to accurately track the desired acceleration and speed, with feedforward and feedback control logic. This paper focuses on the upper-level controller; meanwhile, the dynamic models of the vehicle powertrain system and braking system are handled by the lower-level controller. The vehicular platooning control framework is shown in Figure 2. In our system, we assume that the leading vehicle can be maneuvered in real time by automatic or manual driving, and that each vehicle is equipped with a global positioning system (GPS) and on-board sensors, and has V2V communication technology.    vehicle platoon, the longitudinal model of the ith vehicle is obtained by considering the delay characteristic of the vehicle actuator as [43]: , u i (t) represent the position, velocity, acceleration, and desired acceleration of the center of gravity, respectively; τ is the first-order lag of the vehicle actuator. The platoon consists of N vehicles (or nodes), i.e., a leader (indexed as 1) and N − 1 followers (indexed by i accordingly).
For the ith vehicle, we can define the distance ε i,i−1 (t) and distance error e i,i−1 (t) between the ith and (i−1)th (preceding) vehicle as: where x d represents the desired distance between neighboring vehicles with x d = v i h + L; h is the constant-time headway; and L is the safety distance, which contains the length of the vehicle body. Similarly, we can also obtain: where ε i,1 (t), e i,1 (t) denote the distance and distance error between the ith vehicle and the leading vehicle, respectively. For the PLF topology, the distance error of the ith vehicle consists of two parts, i.e., where λ 1 , λ 2 are weight coefficients of e i,i−1 (t) and e i,1 (t), which are bounded with λ 1 + λ 2 = 1, 0 < λ 1 < 1 and 0 < λ 2 < 1.
The controllers are distributed in each vehicle, and each controller can use information for the preceding vehicle and leading vehicle. Based on (4), we can obtain u i (t) as: where K p , K i , and K d are the weight parameters of the PID controller. Then, the output of the upper-level controller of the ith (i ≥ 2) vehicle is expressed as:

Lower-Level Controller
The output of the lower controller is throttle opening or brake pressure, which can accurately track the desired acceleration and speed. According to the desired acceleration calculated by (5), the desired speed at the next moment can be expressed as follows: where v * (t + 1), v(t) are desired speed at time t + 1 and actual speed at time t, respectively; u i (t) is desired acceleration; and T s denotes the sampling period. The lower controller adopts a feedforward plus feedback control scheme. The feedforward value u lf (t) is obtained from the longitudinal inverse dynamics model [44], and the feedback value is Electronics 2021, 10, 2580 6 of 20 calculated by the PID feedback control method of speed error. Therefore, the output of the lower-level controller is as follows: where v e (t) is the deviation between the expected speed and the actual speed; K 1 , K 2 , and K 3 are positive parameters of controller.

Transfer Function of Distance Error
Based on Equation (1), the longitudinal dynamics of the ith and (i−1)th vehicles can be described by: The time derivative of Equation (2) can be written as: Based on Equations (9) and (10), following equation can be expressed: Then, combining Equations (9)-(11), we can obtain: ..
According to the Laplace transform on Equation (12), the transfer function of the distance error between neighboring vehicles can be derived as:

String Stability
According to the definition of string stability, the platoon can be said to be stable when the distance error between neighboring vehicles will not be amplified by the increase in the number of vehicles, i.e., [45]: Proof. Substituting s = jω into Equation (14), we have: If the condition of Equation (15) is fulfilled, we have B > 0. Let x = ω 2 , then x > 0. Considering τ 2 ω 6 > 0, the rest of B can be described by: Then the function f (x) can be rewritten as: where For a single-variable quadratic function, if given any x > 0, there exists f (x) > 0, which can be divided into the following two cases. Defining γ = λ 1 + (i−1) λ 2 , we can compute the sufficient and unnecessary conditions for stability of the platoon as follows: Then we can derive that: Case B: Here, the following equation is obtained from Equation (20): If given any parameters K p , K i , and K d meet the requirements of Equation (19) or (21), then the stability of the platoon can be guaranteed.

MDP Model for Vehicle Platoon Control
The problem of vehicular platoon control is formulated as an MDP model in this section. In our system, we assume that the environment is fully observable. The states, actions, and the reward function of the MDP are defined as follows.
Choosing an appropriate state space is critical to the convergence of the reinforcement learning algorithm. The selected state information should be related to the motion state of the platoon. According to PLF topology, the state space includes the relative position, relative speed, and relative acceleration between the host vehicle, the preceding vehicle, and the leading vehicle, respectively. At time step t, a set of states s t consists of six elements, i.e., where ∆a i,i−1 , ∆v i,i−1 , ∆x i,i−1 are relative acceleration, relative speed, and relative position of the host vehicle and the preceding vehicle. ∆a i,1 , ∆v i,1 , ∆x i,1 , denote relative acceleration, relative speed, and relative position of the host vehicle and the leading vehicle, respectively.
In the upper controller, the DDPG algorithm adjusts PID controller parameters in real-time, so the action space is: The goal of reinforcement learning is to find the optimal strategy to maximize the cumulative reward. The design of the reward function needs to consider the following aspects. Firstly, it is necessary to ensure that there is no collision between vehicles, when the vehicular platoon system is traveling. Secondly, the stability of the platoon should be guaranteed, i.e., the relative position among vehicles should be maintained at a reasonable distance. Thirdly, the host vehicle needs to respond quickly, which can follow the motion state of the preceding vehicle. Through the above analysis, the reward functions designed in this paper include the following parts: where ω 1 , ω 2 , ω 3 , ω 4 are the positive weight coefficients of the reward function.

Structural Design of DDPG Algorithm
In this paper, there is no image as input, so we use a full connection network to construct the DDPG network. The overall structure of the neural network is shown in Figure 3. There are 4 layers in the actor network, including 1 input layer, 1 output layer, and 2 hidden layers. There are 150 and 100 neurons in the hidden layers, which use the rectified linear unit (ReLU) activation function, because it has the advantage of accelerating convergence [46]. The input to the actor network is the state s t and the output is the action a t , which is a set of PID controller parameters. The final output layer of the actor network uses 3 sigmoid activation functions to generate continuous action values with a limit of (0, 1).

Structural Design of DDPG Algorithm
In this paper, there is no image as input, so we use a full connection network to construct the DDPG network. The overall structure of the neural network is shown in Figure  3. There are 4 layers in the actor network, including 1 input layer, 1 output layer, and 2 hidden layers. There are 150 and 100 neurons in the hidden layers, which use the rectified linear unit (ReLU) activation function, because it has the advantage of accelerating convergence [46]. The input to the actor network is the state st and the output is the action at, which is a set of PID controller parameters. The final output layer of the actor network uses 3 sigmoid activation functions to generate continuous action values with a limit of (0, 1).

ReLU
ReLU   The critic network takes the state s t and the action a t as input, and outputs a scalar Q-value Q(s t , a t ). The critical network consists of 2 input layers, 1 output layer, and 3 hidden layers containing 150, 200, and 100 hidden units. The ReLU activation function is used in the first and third hidden layers, and linear activation function is used to sum the variable values. It should be noted that the values vary greatly due to the different units of input state variables. In order to eliminate the dimensional influence between the data and improve the training effect, batch normalization is utilized, which can transform the input data into a normal distribution. In addition, the Ornstein-Uhlenbeck process noise is used to explore in order to improve the efficiency of exploration in the inertial system. Other training parameters are listed in Table 1.

Training Environment-SUMO
As is shown in Figure 4, a deep reinforcement learning training platform based on SUMO is designed in this paper, which is mainly composed of a SUMO simulator and an external controller Jetson TX2. The vehicle kinematics model and simulation scene are provided by SUMO. Meanwhile, the DDPG-based PID algorithm is written in Python language, stored in the Jetson TX2, and trained according to the simulation data. Considering the vehicle platoon communication structure and actual test conditions, three vehicles are set up to form a platoon driving along the straight road in the SUMO simulator. The first, second, and third vehicles are the leading vehicle, preceding vehicle, and host vehicle (red vehicle in Figure 4), respectively.

Training Environment-SUMO
As is shown in Figure 4, a deep reinforcement learning training platform based SUMO is designed in this paper, which is mainly composed of a SUMO simulator and external controller Jetson TX2. The vehicle kinematics model and simulation scene a provided by SUMO. Meanwhile, the DDPG-based PID algorithm is written in Python la guage, stored in the Jetson TX2, and trained according to the simulation data. Consideri the vehicle platoon communication structure and actual test conditions, three vehicles a set up to form a platoon driving along the straight road in the SUMO simulator. The fir second, and third vehicles are the leading vehicle, preceding vehicle, and host vehicle (r vehicle in Figure 4), respectively.  During the training process, the dynamic states of the three vehicles in SUMO a transmitted to the DDPG algorithm through traffic control interface (Traci) communi tion. After data processing, the state st is input to both the actor and the critical netwo and three PID parameters are output. The movement of the leading vehicle is planned advance. The preceding vehicle can follow the speed change of the leading vehicle manually adjusting the PID controller. The whole training process is iterative and cycl

Vehicle Platoon Control Policy Algorithm
The training process of the DDPG-based PID algorithm is divided into two cycles learn the policy of vehicle platoon longitudinal control, as shown in Algorithm 1. First During the training process, the dynamic states of the three vehicles in SUMO are transmitted to the DDPG algorithm through traffic control interface (Traci) communication.
After data processing, the state s t is input to both the actor and the critical network, and three PID parameters are output. The movement of the leading vehicle is planned in advance. The preceding vehicle can follow the speed change of the leading vehicle by manually adjusting the PID controller. The whole training process is iterative and cyclic.

Vehicle Platoon Control Policy Algorithm
The training process of the DDPG-based PID algorithm is divided into two cycles to learn the policy of vehicle platoon longitudinal control, as shown in Algorithm 1. Firstly, the parameters of the actor network and the critical network and replay buffer are initialized. Next, when the external cycle starts to run, the SUMO simulation environment needs to be initialized to obtain the initial state s 1 . In the inner cycle, the action a t is output according to the state s t , which is the three parameters of the PID controller. Then, the desired acceleration a des is calculated and implemented in the simulation environment. The reward r t and new state s t+1 are observed and saved into the replay buffer. Finally, the training samples are randomly selected from the replay buffer to update the parameters of the actor and critical network. Randomly initialize critic network and actor network 2 Initialize target networks and replay buffer 3 for episode = 1, to M do 4 Initialize SUMO simulation environment; 5 Receive initial observation state s 1 ; 6 for step = 1 to T do 7 Select action a t based on current actor network and calculate the desired acceleration of host vehicle a des ; 8 Execute desired acceleration a des in SUMO simulator and observe reward r t , new state s t+1 ; 9 Save transition (s t , a t , r t , s t+1 ) into replay buffer; 10 Sample a random batch size of N transitions from replay memory; 11 Update critic by minimizing the loss; 12 Update actor policy using the sampled gradient; 13 Update the target networks; 14 end for 15 end for

Algorithm Training Results
The scene of algorithm training should be representative, so the driving cycle including acceleration, cruise, and deceleration is designed in this paper. In this study, the simulation of the dynamic performance of the platoon at different speeds is achieved by setting the speed profile in the leading vehicle. Figure 5 shows the speed and acceleration changes of the leading vehicle. The parameters λ 1 and λ 2 in (4) are 0.5, and the training results are shown in Figure 6.

2021, 10, 2580
Save transition (st, at, rt, st+1) into replay buffer; Sample a random batch size of N transitions from replay memory Update critic by minimizing the loss; Update actor policy using the sampled gradient; Update the target networks; end for end for

Algorithm Training Results
The scene of algorithm training should be represe ing acceleration, cruise, and deceleration is designed i ulation of the dynamic performance of the platoon at d ting the speed profile in the leading vehicle. Figure 5 changes of the leading vehicle. The parameters λ1 and results are shown in Figure 6.    From Figure 6a-d, we can see that the vehicle using the DDPG-based PID controller can track the leading and preceding vehicle well in the entire movement process. The speed and acceleration change smoothly without overshoot. At the same time, when the leading vehicle begins to accelerate, the platoon takes about 15 s to reach steady state. When the leading vehicle returns to uniform speed, the platoon can reach steady state in about 5 s. The maximum distance error of the vehicle platoon is 0.38 m and the following vehicles can track the changes of the leading vehicle in time.
The DRL agent is trained using the DDPG algorithm for 600 episodes, where each episode starts with the same initial state and lasts for 5600 steps. The total reward per episode and reward per step are shown in Figure 7. The greater the value of the total reward per episode, the better the training performance. The DRL agent is trained using the DDPG algorithm for 600 episodes, where each episode starts with the same initial state and lasts for 5600 steps. The total reward per episode and reward per step are shown in Figure 7. The greater the value of the total reward per episode, the better the training performance. As can be seen from Figure 7a, the cumulative reward per episode value increases with time of training. The algorithm converges after training for 320 episodes and the cumulative reward is about −500. According to Figure 7b and the definition of the reward function, it can be concluded that the maximum reward per step whose value is 0 occurs when the speed and distance deviation between the controlled vehicle and the front vehicle is 0, and the speed changes smoothly. That is to say, when the leading vehicle travels at a constant speed, the whole platoon keeps stable and the reward value is the largest. As can be seen from Figure 7a, the cumulative reward per episode value increases with time of training. The algorithm converges after training for 320 episodes and the cumulative reward is about −500. According to Figure 7b and the definition of the reward function, it can be concluded that the maximum reward per step whose value is 0 occurs when the speed and distance deviation between the controlled vehicle and the front vehicle is 0, and the speed changes smoothly. That is to say, when the leading vehicle travels at a constant speed, the whole platoon keeps stable and the reward value is the largest. Due to the CT policy strategy, when the preceding vehicle speed changes, the host vehicle cannot accurately track the desired speed and distance at the same time. Then the reward inevitably appears to be a negative value. In addition, the larger the acceleration, the smaller the reward value will be. However, it can be seen from Figure 7b that the minimum reward value in the training episode is −0.32, which indicates that the model using on the DDPG-based PID platoon control algorithm can reduce the tracking error between the host vehicle and the preceding vehicle as much as possible.

Design of Hardware-in-the-Loop (HIL) Platform
To validate the effectiveness of the proposed method more realistically, the vehicle dynamic model is introduced to carry out the HIL test, which makes the system closer to the real environment. The platform is mainly composed of TruckSim software, Matlab/Simulink software, external controller Jetson TX2, NI-PXI real-time system, and the host computer, etc. In the simulations, a platoon consists of three trucks with the same structural parameters. The truck model LCF Van model is adapted, whose main dynamic parameters are listed in Table 2. The vehicular dynamic software Trucksim provides dynamic models of the platoon. The leading vehicle and the preceding vehicle are controlled by Matlab/Simulink in the host computer. The platoon controller obtains the state information through a CAN bus and outputs the control signal to control the host vehicle in NI-PXI. The overall architecture of the HIL platform is shown in Figure 8.  The vehicular dynamic software Trucksim provides dynamic models of the platoon. The leading vehicle and the preceding vehicle are controlled by Matlab/Simulink in the host computer. The platoon controller obtains the state information through a CAN bus and outputs the control signal to control the host vehicle in NI-PXI. The overall architecture of the HIL platform is shown in Figure 8.

Experimental Setup and Parameter Settings
The proposed control method was compared with a conventional PID on the HIL platform under three scenarios. The parameters of the experimental conditions are shown in Table 3. The influence factors of the initial speed, road slope, road adhesion coefficient,

Experimental Setup and Parameter Settings
The proposed control method was compared with a conventional PID on the HIL platform under three scenarios. The parameters of the experimental conditions are shown in Table 3. The influence factors of the initial speed, road slope, road adhesion coefficient, time headway, and acceleration are considered. By setting the leading vehicle movement, the dynamic performance of the platoon at different speeds is tested. Among the three scenarios, the first involves the scenario where there are 3% and 4% uphill sections. The second scenario has −3% and −4% downhill sections. In addition, in order to test the effects of the platoon controller in rainy and snowy weather, the road adhesion coefficient is set to 0.85 and 0.3, respectively, in the third scenario. The initial speed of the platoon is 15 m/s and the desired time headway is 2 s. After traveling at a constant speed for 30 s, the leading vehicle starts to accelerate at an acceleration of 0.5 m/s 2 for 10 s, and then the speed changes to 20 m/s. The first experimental scene of the leading vehicle movement and the road slope is shown in Figure 9. The initial speed of the platoon is 15 m/s and the desired time headway is 2 s. After traveling at a constant speed for 30 s, the leading vehicle starts to accelerate at an acceleration of 0.5 m/s 2 for 10 s, and then the speed changes to 20 m/s. The first experimental scene of the leading vehicle movement and the road slope is shown in Figure 9. •

Scenario 2
The initial speed of the platoon is set to 25 m/s. When t = 30 s, the leading vehicle starts to decelerate to 20 m/s with a deceleration of 0.5 m/s 2 . Then, the platoon keeps a time headway of 2 s and travels at a constant speed. Figure 10 shows the second experimental scene of the leading vehicle movement and the road slope. • Scenario 2 The initial speed of the platoon is set to 25 m/s. When t = 30 s, the leading vehicle starts to decelerate to 20 m/s with a deceleration of 0.5 m/s 2 . Then, the platoon keeps a time headway of 2 s and travels at a constant speed. Figure 10 shows the second experimental scene of the leading vehicle movement and the road slope.
• Scenario 3 At the initial moment, the platoon travels at a speed of 10 m/s for 10 s. Then, the leading vehicle generates an acceleration with the maximum value of 1 m/s 2 , and the speed reaches 20 m/s. When t = 50 s, the leading vehicle generates a deceleration with the maximum value −1 m/s 2 , and the speed returns to 10 m/s. The road adhesion coefficient is 0.3 for the section from 200 m to 800 m, and the remaining is 0.85. The third experimental scene of the leading vehicle movement and the road slope is shown in Figure 11. •

Scenario 2
The initial speed of the platoon is set to 25 m/s. When t = 30 s, the leading vehicle starts to decelerate to 20 m/s with a deceleration of 0.5 m/s 2 . Then, the platoon keeps a time headway of 2 s and travels at a constant speed. Figure 10 shows the second experimental scene of the leading vehicle movement and the road slope.  In the simulations, the upper controller parameters based on conventional PID are adjusted manually. Through Equation (4), the controller parameters of the preceding vehicle and the host vehicle are different due to the different input. The lower controller has different values according to the driving and braking modes. Table 4 shows the values of conventional PID controller parameters. The control parameters are determined by the empirical knowledge from experts. The initial state of the platoon is set as the desired state, i.e., the initial distance errors and initial speed errors are all equal to 0.

Validation Results
Scenario 1 is the experimental condition of the vehicle platoon accelerating uphill and Scenario 2 is the experimental condition of the vehicle platoon decelerating downhill. These two test scenarios verify the influence of the road slope, initial speed, and acceleration on the controller. The purpose of setting Scenario 3 is to test the influence of the road adhesion coefficient and time headway on the performance of the platoon controller. The speed, speed error, distance, and distance error performance of the host vehicle under three experimental conditions are shown in Figure 12. In the simulations, the upper controller parameters based on conventional PID are adjusted manually. Through Equation (4), the controller parameters of the preceding vehicle and the host vehicle are different due to the different input. The lower controller has different values according to the driving and braking modes. Table 4 shows the values of conventional PID controller parameters. The control parameters are determined by the empirical knowledge from experts. The initial state of the platoon is set as the desired state, i.e., the initial distance errors and initial speed errors are all equal to 0.

Validation Results
Scenario 1 is the experimental condition of the vehicle platoon accelerating uphill and Scenario 2 is the experimental condition of the vehicle platoon decelerating downhill. These two test scenarios verify the influence of the road slope, initial speed, and acceleration on the controller. The purpose of setting Scenario 3 is to test the influence of the road adhesion coefficient and time headway on the performance of the platoon controller. The speed, speed error, distance, and distance error performance of the host vehicle under three experimental conditions are shown in Figure 12.
Scenario 1 is the experimental condition of the vehicle platoon accelerating uphill and Scenario 2 is the experimental condition of the vehicle platoon decelerating downhill. These two test scenarios verify the influence of the road slope, initial speed, and acceleration on the controller. The purpose of setting Scenario 3 is to test the influence of the road adhesion coefficient and time headway on the performance of the platoon controller. The speed, speed error, distance, and distance error performance of the host vehicle under three experimental conditions are shown in Figure 12. From Figure 12b, the maximum speed errors of the DDPG-based and conventional PID are −0.95 m/s and −0.97 m/s. Compared with the uphill test scenario, the difference in distance error is more pronounced when driving on downward slopes. The maximum distance error of the DDPG-based controller is 0.85 m, while that of the conventional method is 0.93 m. In addition, due to the influence of road slope, the error curve fluctuates obviously in 8-14 s.
In the third experiment, there are obvious peaks and troughs in the curves of speed error and distance error where the road adhesion coefficient changes suddenly. In the third experiment, there are obvious peaks and troughs in the curves of velocity error and distance error where the road adhesion coefficient changes suddenly, as seen in Figure  12c. Due to the low adhesion coefficient of the road and the tire skids, the vehicle speed From Figure 12b, the maximum speed errors of the DDPG-based and conventional PID are −0.95 m/s and −0.97 m/s. Compared with the uphill test scenario, the difference in distance error is more pronounced when driving on downward slopes. The maximum distance error of the DDPG-based controller is 0.85 m, while that of the conventional method is 0.93 m. In addition, due to the influence of road slope, the error curve fluctuates obviously in 8-14 s.
In the third experiment, there are obvious peaks and troughs in the curves of speed error and distance error where the road adhesion coefficient changes suddenly. In the third experiment, there are obvious peaks and troughs in the curves of velocity error and distance error where the road adhesion coefficient changes suddenly, as seen in Figure 12c. Due to the low adhesion coefficient of the road and the tire skids, the vehicle speed curve overshoots. The maximum speed error of the DDPG-based PID is 1.38 m/s, and the absolute value of the maximum distance error is 0.77 m. The maximum speed error and the absolute value of the maximum distance error of the conventional PID are 1.46 m/s and 0.90 m, respectively. Owing to its predefined PID control structure, the training process for the agent in the DRL-based PID control converges significantly faster than that in the DRL control [47]. The DRL-based PID control achieves a significant improvement over the traditional PID control by optimizing the controller parameters continuously [48,49]. It is strongly robust for system disturbances, which is better than that of a conventional PID controller [50].

Stability Analysis
This section focuses on the stability analysis of homogeneous vehicular platoon control. The tendency of PID parameters (i.e., K p , K i , K d ) to change in the proposed method is shown in Figure 13. There are two kinds of stability for the platoon that need to be analyzed: It is strongly robust for system disturbances, which is better than that of a conventional PID controller [50].

Stability Analysis
This section focuses on the stability analysis of homogeneous vehicular platoon control. The tendency of PID parameters (i.e., Kp, Ki, Kd) to change in the proposed method is shown in Figure 13. There are two kinds of stability for the platoon that need to be analyzed: Internal stability. From the above experimental results, when the leading vehicle travels at a constant speed, the distance error between the host vehicle and the preceding vehicle gradually approaches 0, i.e.,lim → = 0, which means that internal stability can be guaranteed [40].
String stability. In the experiment, the values of parameters λ1, λ2, and τ are 0.5, 0.5, and 0.3 s, respectively. Based on (17), for the univariate function f (x) = ax 2 + bx + c, we can calculate the results of coefficients a and −b/2a, as shown in Figure 14. It can be seen that in the whole simulation process a > 0, which means that the parabola opens upward. In addition, −b/2a < 0 means the axis of symmetry is located on the negative half of the coordinate axis. According to the parameter λ1 = λ2 = 0.5 and Ki > 0, Internal stability. From the above experimental results, when the leading vehicle travels at a constant speed, the distance error between the host vehicle and the preceding vehicle gradually approaches 0, i.e., lim t→∞ e i (t) = 0, which means that internal stability can be guaranteed [40].
String stability. In the experiment, the values of parameters λ 1 , λ 2 , and τ are 0.5, 0.5, and 0.3 s, respectively. Based on (17), for the univariate function f (x) = ax 2 + bx + c, we can calculate the results of coefficients a and −b/2a, as shown in Figure 14.
It can be seen that in the whole simulation process a > 0, which means that the parabola opens upward. In addition, −b/2a < 0 means the axis of symmetry is located on the negative half of the coordinate axis. According to the parameter λ 1 = λ 2 = 0.5 and K i > 0, the minimum value of quadratic function f (0) = (1 − λ 1 )K 2 i > 0 always holds. Therefore, we will obtain |G(jω)|<1, which can satisfy the string stability condition of Equation (19) (see Section 2.3 Case A). In other words, the distance error of the vehicle platooning system is not amplified when transmitted to the following vehicles. It can be seen that in the whole simulation process a > 0, which means that the parabola opens upward. In addition, −b/2a < 0 means the axis of symmetry is located on the negative half of the coordinate axis. According to the parameter λ1 = λ2 = 0.5 and Ki > 0, In summary, the vehicular platoon controller based on DDPG-based PID can meet the requirements of internal stability and string stability.

Control Effect Analysis
The performances of the two controllers are analyzed from the two indexes of the maximum speed error and the maximum distance error. The maximum speed error is the maximum value of the speed deviation between the host vehicle and the leading vehicle. The maximum distance error is the actual and desired distance maximum deviation between the host vehicle and the preceding vehicle. The comparison results are listed in Table 5. It is seen from Table 5 that the maximum speed error of the vehicular platoon based on the conventional PID controller is 1.46 m/s, while that of the DDPG-based PID controller is 1.38 m/s, which improves the performance by more than 5.48%. From the point of view of the maximum distance error, the DDPG-based PID controller is 0.13 m less than that of the conventional PID controller, and the maximum platoon stability time is reduced by 14.44%. In summary, comparing with the conventional PID, the DDPG-based PID not only has a better performance of tracking, but can also guarantee the string stability under different working conditions.

Conclusions
In this paper, we have proposed a DDPG-based PID learning control method, which uses a DDPG algorithm to automatically tune the PID weights for a vehicle platooning system. This method combines the offline learning ability of DRL with the advantages of the simple structure and easy implementation of a traditional controller PID, without relying on any manual tuning. Thus, the problem of insufficient adaptability of the traditional controller is solved. Moreover, compared with a single DRL algorithm, the proposed method has stronger interpretability and stability. The results of three experimental conditions show that the DDPG-based PID controller can meet the requirements of string stability under different road and vehicular platooning acceleration and deceleration conditions. In terms of the maximum speed error, the DDPG-based PID controller is 0.02-0.08 m/s less than the conventional PID controller, with a maximum reduction of 5.48%. In addition, the maximum distance error of the DDPG-based PID controller is 0.77 m, which is 14.44% less than that of the conventional PID controller. It can be seen from the above analysis that the DDPG-based PID controller has stronger robustness.
The future work would be focused on the optimization design of the neural network structure to improve the speed of convergence effectively and has better performance than the DRL algorithm. Besides, the HIL simulation simplifies the road environment conditions. The following research can carry out real vehicle experiments to further verify the stability and reliability of a vehicular platoon controller using DDPG-based PID.