Next Article in Journal
A Novel Double Redundant Brake-by-Wire System for High Automation Driving Safety: Design, Optimization and Experimental Validation
Next Article in Special Issue
Advanced Controller Development Based on eFMI with Applications to Automotive Vertical Dynamics Control
Previous Article in Journal
A Closed-Loop Control Mathematical Model for Photovoltaic-Electrostatic Hybrid Actuator with a Slant Lower Electrode Based on PLZT Ceramic
Previous Article in Special Issue
Observer-Based Coordinated Control for Blended Braking System with Actuator Delay
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning-Based Cooperative Adaptive Cruise Control

1
Institute of System Dynamics and Control, Robotics and Mechatronics Center, German Aerospace Center (DLR), Münchener Str. 20, 82234 Wessling, Germany
2
Department of Mechanical Engineering, University of California, Merced, CA 95343, USA
*
Author to whom correspondence should be addressed.
Actuators 2021, 10(11), 286; https://doi.org/10.3390/act10110286
Submission received: 17 September 2021 / Revised: 11 October 2021 / Accepted: 21 October 2021 / Published: 26 October 2021
(This article belongs to the Special Issue Vehicle Modeling and Control)

Abstract

:
Traffic congestion and the occurrence of traffic accidents are problems that can be mitigated by applying cooperative adaptive cruise control (CACC). In this work, we used deep reinforcement learning for CACC and assessed its potential to outperform model-based methods. The trade-off between distance-error minimization and energy consumption minimization whilst still ensuring operational safety was investigated. Alongside a string stability condition, robustness against burst errors in communication also was incorporated, and the effect of preview information was assessed. The controllers were trained using the proximal policy optimization algorithm. A validation by comparison with a model-based controller was performed. The performance of the trained controllers was verified with respect to the mean energy consumption and the root mean squared distance error. In our evaluation scenarios, the learning-based controllers reduced energy consumption in comparison to the model-based controller by 17.9% on average.

1. Introduction

Cooperative adaptive cruise control (CACC) contributes to an efficient, higher-density traffic flow, especially on highways. Furthermore, since human errors are a major reason for traffic accidents, removing the driver from the driving task by using CACC in specific cases improves traffic safety [1]. In addition, the aerodynamic drag is reduced when driving in a vehicle platoon, therefore CACC-enabled platooning is an effective way to reduce fuel consumption [2].
CACC is an extension of adaptive cruise control (ACC). The basic cruise control (CC) enables the driver to define a certain velocity set point, which is then automatically tracked by the vehicle. ACC extends this approach by detecting obstacles in front of the vehicle and adjusting the velocity accordingly. CACC is a further extension using car-to-car (C2C) communication. With the exchange of information between the vehicles, the motion of the other vehicles can be anticipated. This allows a reduction in distance to the preceding vehicle, while maintaining driving comfort and safety. For full autonomous platooning, CACC must be extended by a lateral control that enables lane changing and lane keeping [3].
To allow safe operation, string stability is one of the most important control requirements for CACC [1]. If the platoon is string stable, fast deceleration of a preceding vehicle and a diminished distance to the following vehicle do not amplify downstream [4]. If this is not the case, the amplification may cause a following vehicle further downstream to stop entirely or crash into the preceding vehicle.
In the past years, there have been great advances in the development of artificial intelligence, especially regarding deep reinforcement learning (deep RL, DRL) [5]. In RL, an agent is trained by interaction with an environment such that a given goal is achieved. In DRL, this agent is approximated with a neural network. Many of these recent advances rely on model-free approaches, in which the agent does not have access to the environment’s mathematical model.
The application of DRL to highly complex games (Atari [6], Go [7], Dota 2 [8]) has especially sped up the research, but DRL is also applied to control problems ranging from lateral and longitudinal path following control [9] to power grid control [10], to active flow control [11], and to traffic signal control [12].
Classical control methods can already tackle multiple aspects of CACC. Model predictive control (MPC) was used to create energy-efficient acceleration profiles for platoon vehicles that also consider the string stability of the platoon [13]. Communication losses were considered for CACC control by using, e.g., a Kalman filter to estimate the preceding vehicle’s acceleration during short intervehicle communication outages [14].
However, the incorporation of a string stability condition may lead to higher complexity when tuning the MPC controller [15]. In addition, larger prediction horizons lead to an increased computational effort, which lowers the usability of MPC controllers for real-world applications [16].
Compared to classical control methods, the model-free DRL’s robustness may be difficult to guarantee [17], but there are also multiple advantages, including:
  • There is no need to create a (simplified) mathematical system model, as the DRL agent can directly learn from a real system or a simulated high-fidelity model [18];
  • All relevant information can be directly fed into the agent’s state vector, and the system’s constraints are learned by the agent’s neural network [6,19], both without necessarily creating preprocessing structures, and;
  • The deployment of the trained agent’s neural network is computationally efficient [16].
For CACC, this means that DRL has the potential to outperform classical methods regarding the sim-to-real-gap, the application complexity for the control engineer, as well as the required performance of the deployment hardware.
DRL has already been assessed for some aspects of CACC. In [16], the authors investigated DRL for CC and compared it to MPC regarding predictive velocity tracking. It was shown to achieve similar performance with a computational effort up to 70-fold lower. The authors of [20] compared DRL to MPC for ACC and also found it to be comparable in performance. DRL provided lower costs (combined penalized error, control signal amplitude, and vehicle jerk) than MPC when facing model uncertainties (in the context of sim-to-real-gap) due to its generalization capabilities.
Reference [21] applied DRL on the ACC as well as the CACC problem, and studied the effect of discrete control outputs on a set point distance. However, the discrete control outputs led to intense oscillations in the velocity and acceleration trajectory. The authors of [22] augmented the approach by using continuous actions. To cope with possible issues regarding safety and robustness, they also investigated the possibility to replace a DRL agent that directly controlled the vehicle with a model-based DRL agent that only controlled the headway parameter of a model-based controller. They concluded that the model-based DRL agent was safer and more robust.
In [23], a supervisor network was used to augment a RL agent to increase its performance. The combined output of the supervisor and the agent was applied to the CACC system, which led to a reduced distance error compared to a linear controller.
In addition, a multiagent RL (MARL) can be used to consider the dynamics of a whole vehicle platoon already during training [24].
The main contribution of this work is the application of DRL to CACC with continuous action and state space, while also considering
  • Energy minimization;
  • A string stability constraint;
  • Preview information in the communication, and;
  • The effect of burst errors on the communication.
Previous research examined some of these aspects (e.g., energy minimization was considered in [19], preview information in [16], and communication errors in [24]), but not the combination of all four. To the authors’ best knowledge, this is one of the first works to apply a string stability constraint in the presented way to DRL-based CACC. Additionally, a high model-complexity was incorporated: a high-fidelity model was used to model energy losses in the powertrain, and the platoon leader vehicle’s motion was modeled according to a real-world dataset. This work was based on one of the author’s unpublished master’s thesis [25].
The remainder of this work is organized as follows: in Section 2, the CACC problem and the platoon model are introduced. Section 3 explains the basics of DRL and the application to the CACC problem. In Section 4, the training of the DRL agents is described, and the comparison to a model-based controller is carried out. Section 5 summarizes the findings and gives an outlook.

2. Problem Formulation

The aim of CACC control is to ensure that the vehicles that form a platoon:
  • Stay at a safe distance from each other to prevent collisions, but;
  • Drive close enough to the preceding vehicle such that the air drag is decreased (thereby decreasing energy consumption) and highway capacity is increased;
  • Consider passenger comfort by reducing vehicle jerk, and;
  • Drive in a string-stable way.
Figure 1 shows the follower vehicle with index i driving behind the preceding vehicle with index i − 1, separated by the bumper-to-bumper distance d i . Both feature their corresponding position x i , velocity v i , and acceleration a i . The entire platoon is composed of m vehicles.
The target distance d i * to the preceding vehicle may be decribed via a constant spacing or constant time-headway [1], with constant time-headway being defined as [26]:
d i * = r c , i + t h , i v i .
where r c , i denotes the distance at standstill and t h , i v i denotes the variable distance, adjusted by the time-headway t h , i . The deviation from the target distance is described by the distance error:
e i = d i d i * .
Analyzing string stability is more difficult in DRL than in a model-based approach, because the DRL problem is not typically designed in a frequency domain, where string stability usually is easier to analyze [26,27]. In [28], it was shown that string stability can be guaranteed as follows:
sup ω | a i   ( j ω ) a i 1   ( j ω ) | 1 ,   2 i m ,
where a ( j ω ) is the acceleration of the respective vehicle in frequency domain.

2.1. Communication Model

Regarding the platooning C2C information flow, a predecessor-following topology (cf. [3]) was adopted: in such a topology, information is sent one way from the preceding vehicle to the following vehicle (cf. Figure 1).
For modeling the communication, an appropriate distance for transmitting data and a sufficient signal bandwidth was assumed. For a specific prediction horizon N k , the preceding vehicle i 1 transmits its current acceleration and its predicted accelerations over a N k -window a transmit , i 1 , k T = [ a i 1 , k ,   a i 1 , k + 1 ,   , a i 1 , k + ( N k 1 ) ] . For N k = 1 , only the current acceleration is transmitted. The follower vehicle receives the delayed acceleration a ˜ transmit , i , k T = [ a ˜ i , k ,   a ˜ i , k + 1 ,   , a ˜ i , k + ( N k 1 ) ] with an assumed delay of Δ i for vehicle i (cf. Table 1).
In this communication, burst errors due to noise may occur, and a sequence of successive transmissions a ˜ transmit , i , k is not received. The burst errors are modeled with a Markov chain according to the Gilbert–Elliot model [29] (cf. [30], Figure 2). The Markov chain consists of a state r , which can receive transmissions, and a state l , in which transmissions are unavailable. According to the Markov property, the transition to the next state only depends on the present state. This means that for state r , it remains in state r with probability p = p r and 0 < p r < 1 , or leaves it with p = 1 p r . The same holds true for state l with p = p l and 0 < p l < p r . The Markov chain was implemented as an automaton, starting in the receiving state. The probabilities used for the automaton in case of the occurrence of burst errors were taken from [30]. They parametrized a low communication quality, and can be found in Table 2. A parametrization for perfect communication quality is also given.
Reference [30] suggested the use of a buffer when a sequence of prediction values is transmitted. The buffer is referred to as a ˜ buffer , i , k . For each time step k , the received transmitted acceleration a ˜ transmit , i , k is saved in the buffer: a ˜ buffer , i , k = a ˜ transmit , i , k . If a burst error occurs, no a ˜ transmit , i , k is received. In this case, the first element of the buffer no longer contains valid information and the resulting gap in the buffer is filled with a placeholder a invalid (cf. Table 1). a invalid is an invalid acceleration value that was chosen to be large enough not to be attainable in our scenario, such that it could not occur in regular communication, and was interpretable as a communication error by the DRL agent.

2.2. Vehicle Model

The single vehicle was created as a two-track model in Modelica [31] using Dymola [32] and the planar mechanics library [33]. Modelica is an object-oriented open source modeling language for multiphysical (e.g., mechanical and electrical) systems. The vehicle model represented the properties of the ROboMObil (ROMO) [34], which is the robotic full x-by-wire research vehicle of the German Aerospace Center (cf. Figure 1 for two ROMOs forming a platoon). The center of gravity (used to calculate the vehicle’s position x i ; cf. Figure 1) was assumed to be at the geometric center of the rectangle spanned by the track width and the wheelbase.
As wheel models, four dry-friction slip-based wheels of the planar mechanics library [33] were used and connected to fixed translation elements. The slip-based wheels limited the amount of propulsion force dependent on the normal force and the friction coefficient.
The energy consumption of the single vehicle mainly depended on the variable air drag and the electric powertrain.
ROMO’s surface front A x , air drag coefficient c w , and the air density ρ air were chosen according to [35] (cf. Table 1). Reference [36] stated that the air drag coefficient is dependent on the distance d = d i to the preceding vehicle: When driving close to the preceding vehicle, the air drag coefficient is reduced. This reduced air drag leads to less power being necessary for driving. To model the variable air drag, a nonlinear approximation was used, as suggested by [37]:
c d ( d ) = c w ( 1 c d , 1 c d , 2 + d ) ,
where c d , 1 and c d , 2 are variable air drag function coefficients. The parameters were obtained by performing a linear regression on the experimental data given in [38]. The variable air drag force can then be calculated as:
F aer ( d , v ) = 1 2 c d ( d ) ρ air A x v 2 .
The battery of one single vehicle [39] consist of n cells , s cells connected in series (cf. Table 1). The total pack voltage U batt and total pack resistance R batt were assumed to be constant during experiments (at 100% state-of-charge). To account for battery losses, we considered Ohmic heat losses in each cell (each cell with respective voltage U cell and resistance R cell ).
Each of the four wheels was powered by an in-wheel motor, which was modeled as an open-loop controlled permanent magnet synchronous machine [35,39,40]. This allowed us to calculate the vehicle’s power consumption P = P i as:
P = P mech + P losses .
where P mech is the summed mechanical power of all electric motors. For each j -th motor ( j { 1 , 2 , 3 , 4 } ), P mech , j = M j ω m , j , with motor torque M j and mechanical rotor angular velocity ω m , j . P also considered the powertrain losses P losses , which consisted of the Ohmic heat losses in the battery and the losses in the electric motors (i.e., inverter losses, copper losses, iron losses, and mechanical friction losses).
A proportional controller was used as the low-level acceleration control (with a high gain k M ), which controlled the torque set point M * for the electric motors. The torque was limited to ± M nom (nominal torque, cf. Table 1), so that the physical constraints of the system were not violated. Additionally, a first-order low pass filter was applied to the acceleration control’s output with a time constant   T M to achieve an actuator lag.
The block diagram in Figure 2 gives an overview of the used vehicle model with respect to the RL framework discussed in Section 3 (cf. Section 3.2 for more information on action u , denormalized neural network output u ˜ , and the respective range, rate and string stability limits; cf. Section 3.3 for more information on feedback state s ).

2.3. Leader Vehicle Motion Model

For modeling the platoon’s leader vehicle motion, trajectory data from the Next Generation Simulation (NGSIM) dataset was used [41]. In this dataset, positions of vehicles were measured with multiple synchronized cameras at different locations in the USA (Emeryville’s Interstate 80; Los Angeles’ Route 101; and Lankershim Boulevard, Atlanta’s Peachtree Street) at two to three specific times of the day. In doing so, it was aimed at portraying different amounts of traffic and thereby different vehicle trajectories during the buildup of congestion, during transition between congested and uncongested traffic, as well as during full congestion.
The NGSIM data must be postprocessed due to measurement errors [42] to create consistency between velocity and acceleration signals. In this work, this was done in a multistep approach by:
  • Removing acceleration outliers ( a > | 30   m / s 2 | ) by applying a natural cubic spline interpolation on the velocity profile;
  • Reducing the velocity profile noise by applying a first-order low-pass Butterworth filter (cutoff frequency of 0.5 Hz); and
  • Removing implausible accelerations ( a > 5   m / s 2 or a < 8   m / s 2 ) also by applying a natural cubic spline interpolation on the velocity profile.
Finally, if implausible accelerations were still present or if velocities higher than ROMO’s nominal velocity limit v nom = 27.4   m / s occurred, the trajectory was removed from the set of valid trajectories. After doing so, the resulting set consisted of 312 trajectories. It was permutated and then split into training (70%) and validation (30%) sets.

3. Reinforcement-Learning-Based Cooperative Adaptive Cruise Control

In this work, we used reinforcement learning to solve the control problem. A Markov decision process (MDP) described the probabilistic basis of RL, and was used to derive the algorithms to find the optimal action in each time step. An MDP consists of an agent and the environment, as shown in Figure 3. In each discrete time step k , the agent perceives the current state s k   S ( S being the set of all valid states) of the environment and performs the corresponding action u k A ( A being the set of all valid actions) [5]. If a reward R k + 1 is assigned according to the action u k and the resulting state s k + 1 , the MDP can be called a Markov reward process (MRP) [43]. The MRP is defined to begin with state s 0 and action u 0 , but without an initial reward. The first reward R 1 is assigned after the transition to the state s 1 . From a control theory’s perspective, the agent can be seen as controller, the environment as plant, and the action as control signal.
A sequence beginning with s 0 ,   u 0 ,   R 1 ,   s 1 ,   u 1 , R 2 ,   s 2 ,   u 2 , and ending with a terminal state s T is called an episode τ [5]. After reaching s T , no further action can be taken and a reset to a starting state occurs. In this work, a fixed interval between action and new state Δ t = Δ i (cf. Table 1) was assumed, yielding the time t = k Δ t with k + .
For the MDP, the Markov property also held true (cf. Section 2.1).

3.1. Policy Optimization

The policy π describes which action an agent takes given a certain state. This can be described with a stochastic policy π :   A × S [ 0 ,   1 ] as in [43]:
u k ~   π (     |   s k ) .
The agent tries to maximize the discounted return G k ,   which is the sum of rewards that are received until the terminal state is reached, each multiplied by a discount factor γ [ 0 ,   1 ] [5]:
G k R k + 1 + γ R k + 2 + γ 2 R k + 3 + + γ T k 1 R T .
In the case of γ = 1 , the discounted return was just referred to as return in this work. The state-value function is defined as the expected discounted return for a state s when a given policy π is applied:
V π ( s ) E π [   G k   |   s k = s   ] ,   for   all   s S .
The goal of RL is to find the optimal policy π * that optimizes the expected discounted return for a specific episode τ (with the actions of τ being taken according to π ):
π * = arg   max π   E τ ~ π [ G ( τ ) ] .
Regarding DRL, this work used the proximal policy optimization (PPO) algorithm with clipping [44,45] to calculate neural-network-based approximations of π * and V π . The weights of the neural network [46] π are referred to as θ , s k S n and u k A m .
PPO uses gradient ascent to optimize θ , such that the obtained discounted returns following a state-action tuple ( s , u ) are more probable. These obtained discounted returns are described with the estimated advantage A ^ k , θ = G k V π ( s k ) [5,44,47]. PPO does not directly optimize the policy, but it uses an objective function L ( s , u , θ i , θ , A ^ θ i ) [44]:
θ i + 1 = arg   max θ E s , u ~ π [   L ( s , u , θ i , θ , A ^ θ i ) ] ,
with:
L ( s , u , θ i , θ , A ^ θ i ) = min ( r o ( θ ) A ^ θ i ,   g ( r o ( θ ) , ϵ ) A ^ θ i ) ,
where r o ( θ ) is the probability ratio r o ( θ ) =   π ( u | s , θ ) π ( u | s , θ i ) and g is the clipping function. The clipping function limits the changes of r o ( θ ) to [ 1 ϵ , 1 + ϵ ]. Thereby, the objective function L creates an incentive for only small changes of θ , which increases training stability compared to a basic DRL algorithm (in this work, training stability meant a monotonic increase of the obtained average returns during training).
The basic principle of operation of PPO is shown in Algorithm 1. As for the policy, the update of the value function V π (step 11) can be performed via gradient descent, such that the objective function L is optimized. Policy and value function updates are repeated in multiple passes until the trained policy’s behavior converges, which means until the obtained returns can approximately no longer be increased.
Algorithm 1. Basic PPO principle of operation [44,45,47].
1:Input: initial policy parameters θ , initial state-value parameters
2:repeat:
3:  for each process 𝒫i with i = 1, …, n𝒫:
4:     Generate T 𝒫   transitions   { s k , u k , R k + 1 }   with   π ( | )
5:    for each transition k with k = 1, …, T𝒫:
6:       Compute G k
7:       A ^ k , θ = G k V π ( s k ) 1
8:    end for
9:   end for
10:   Update the policy by maximizing the objective function L
in n e training epochs, sampling a minibatch (with
n𝒫·T𝒫/nmini transitions) in each epoch
11:   Update the value function
12:until convergence
13: Output: trained policy π , trained state-value function V π
1 Different advantage term calculations are possible.
The computational time of the training is decreased by using multiprocessing with n𝒫 processes. For this, each process 𝒫i is used to generate T𝒫 transitions { s k , u k , R k + 1 } with a shared policy π ( | ) .
The policy is updated in n e passes (called training epochs) for each pass of the PPO algorithm. For each update of θ in an epoch, a minibatch consisting of multiple transitions is sampled. The minibatch is a subset of the batch of n𝒫·T𝒫 transitions that is available in each pass of the PPO algorithm. The size of the minibatch is defined by dividing the size of the batch by the minibatch scaling factor n mini . The number of passes until the algorithm converges is estimated by defining a fixed training length, which means the total number of generated transitions after which the training is ended.

3.2. Actions

The agent controls a follower vehicle (i.e., it acts with the environment) by setting a target acceleration u i , k . To improve passenger comfort, a jerk limit is introduced to enforce action smoothness:
| Δ u i , k | Δ u lim ,
with Δ u lim = 0.5 m/s2 and Δ u i , k = u ˜ i , k u i , k 1 . The action u ˜ i , k   is the denormalized output of the policy π i . Furthermore, only accelerations that are within the physical limits of the car are allowed to be set:
8   m / s 2 u i , k 5   m / s 2 .
In this work, the following two assumptions were made:
  • a i , k u i , k ,
  • a i 1 , k a ˜ i , k .
To satisfy assumption 1, the low-level acceleration controller was assumed to provide an accurate tracking performance. To satisfy assumption 2, the acceleration of the preceding vehicle a i 1 , k was assumed to be approximately equal to the delayed acceleration a ˜ i , k that was received by the follower vehicle (cf. Section 2.1). When we transferred the string stability concept of (3) to time domain with:
| a i , k | < γ max k ss [ k N ss ,   k ] { | a i 1 , k ss | } ,
this allowed us to represent the string stability condition as:
| u i , k | < γ max { u lim , min ,   max k ss [ k N ss ,   k ] { | a i 1 , k ss | } } < γ u ss , i , k .                                                                                            
In (15), the acceleration of the follower vehicle a i must be smaller than γ ( 0 ,   1 ) times the maximum acceleration of the preceding vehicle within a string stability time window N ss for every time step k . Thus, the follower vehicle is assumed to mimic the driving behavior of the preceding vehicle. If it does so and the accelerations are mostly attenuated downstream, string stability is guaranteed to a given extent. The transfer of the string stability concept (3) to time domain (15) was proposed in a similar way in [28].
Equation (16) makes the implementation of a string-stability condition easier, as it is not necessary to abort the episode due to a state constraint violation: only an additional control limit must be added. The resulting maximum from (16) is referred to as string-stability limit u ss , i , k . Furthermore, only valid values of a ˜ transmit , i , k are taken into consideration, and u ss , i , k cannot be below u lim , min = 0.1 m/s2. u lim , min gives the agent the possibility to counteract slight perturbations.
In this work, N ss = 20 and γ = 0.999 were chosen. With the fixed time interval Δ t (cf. Table 1) between two steps k and k 1 , this meant that the current acceleration and the received accelerations from the past two seconds were taken into consideration.

3.3. Feedback State

The design of the CACC DRL environment considers a two-vehicle platoon. The motion of the first vehicle—the leader—followed a reference trajectory obtained from the NGSIM dataset (cf. Section 2.3). The motion of the second vehicle—the follower—was controlled by the DRL agent. The follower vehicle was referred to with index i = 2 , the leader vehicle with index i = 1 . The agent state vector s 2 , k of the follower vehicle for time step k was chosen as
s 2 , k   = [ v 2 , k ,   a 2 , k ,   d 2 , k ,   Δ v 2 , k ,   e 2 , k ,   P 2 , k , u 2 , k 1 ,   u 2 , k 2 ,   a ˜ buffer , 2 , k ,   u ss , 2 , k ] T .
The states were partially chosen with respect to (1), with v 2 , k being the velocity of the vehicle and a 2 , k being the acceleration of the vehicle. d 2 , k is the distance to the preceding vehicle:
d 2 , k = x 1 , k x 2 , k l 1 2 l 2 2 ,
with x i , k being the position of the respective vehicle and l i its respective vehicle length. Δ v 2 , k = v 1 , k v 2 , k is the discrete approximation of the distance dynamics d ˙ 2 at time t = k Δ t . e 2 , k is the distance error (cf. (2)) and P 2 , k the power consumption at time step k (cf. (6)). Even though we did not investigate if this state selection fulfilled the Markov property, assembling the feedback state this way provided the agent with rich information about the environment.
The state s i , k was augmented with the history of past control values (as suggested by [48]) for learning a smooth control action. For this purpose, the actions from the prior two time steps u 2 , k 1 and u 2 , k 2 were provided.
The state s 2 , k   was normalized before it was fed into the policy π 2 to ensure the agent considered each state component equally.

3.4. Reward Function

The control goal of string stability is targeted by the action limit (16), while the jerk limit (13) ensures at least a certain degree of passenger comfort. The control objective of maintaining a safe distance while reducing air drag is incorporated into the CACC environment through the reward function.
To enhance the passenger comfort besides the already constrained change in control value Δ u i , k (cf. (13)), Δ u i , k was also penalized in the reward function.
The reward function was defined in such a way that it was always negative:
r ( s i , k , u ˜ i , k ) = { R abort ,   if   episode   is   aborted ( w e k e | e i , k | + w P k P | P i , k | + w Δ u k Δ u | Δ u i , k | ) ,   else .
If the episode was not aborted, the weighted ( w ) and normalized ( k ) error e i , k , power consumption P i , k and action change Δ u i , k created the achieved reward. The normalized state components and actions were used.
Several conditions led to a premature end of an episode and were related to safety and error constraints. To not crash into the preceding vehicle, the distance must always be larger than 0 m. Additionally, we chose an upper distance limit of 50 m:
0   m < d i , k < 50   m .
At the same time, this limited the maximum platoon length, thereby increasing the road capacity. To increase safety, the relative velocity also was limited:
5   m / s < v i 1 , k v i , k < 5   m / s .
If the episode was aborted, a negative reward R abort was assigned. The trained state-value function V π contained the information on whether specific states most likely were going to lead to a premature end of an episode (leading to a low return). Via the estimated advantage A ^ k , θ , this was learned by the agent during training (cf. (11)). This meant that for example the agent already attempted to avoid distances d i , k larger yet close to 0   m .
Maximizing the reward function yielded the desired behavior: error minimization, minimization of the power consumption, and an enhanced ride comfort by providing a smooth target acceleration signal. Two different sets of reward weights were heuristically determined and are listed in Table 3. The error-minimizing weight set (EM-RL) focused on minimizing the error e i , k , while the power-minimizing weight set (PM-RL) also explicitly considered the minimization of the absolute power consumption | P i , k | . The minimization of | P i , k | aimed at reducing the consumed energy E i at the end of the episode τ i : E i = 0 T P i ( t ) d t . Even though recuperation was possible, every acceleration and deceleration led to additional losses. Therefore, the less acceleration and deceleration was performed, the less powertrain losses occurred (smaller | P i , k | ), which yielded a smaller energy consumption E i .
In case of PM-RL, the weight w e for the error penalization was reduced to enable the agent to focus less on the error e i , k , and thereby to deliberately deviate from the set point distance d i , k * to reduce E i . In the case of EM-RL, the weight w P was set to zero, which was why the power was not explicitly considered.

4. Simulative Assessment

PPO was deployed with a policy network of two hidden layers with 64 neurons, each with the tan h ( ) activation function and using the Adam optimizer [49]. The default hyperparameters [45] were used, and n𝒫 = 4 parallel processes 𝒫 were utilized to achieve a lower computational time of the training. The minibatch scaling factor was increased from the default value n mini = 4 to n mini = 16 to create smaller minibatches and thereby further reduce the computational time (cf. Algorithm 1 and its description in Section 3.1). Since the default hyperparameters provided good training results, no further effort to optimize the hyperparameters was spent.
Training with the EM-RL reward function was performed with a batch size of 4⋅128 steps (T𝒫 = 128 steps per process were used as defined as default in [45]) and a training length of 2 × 106 time steps. Training with the PM-RL reward function was more complicated due to the added control goal of power minimization. This is why the batch size is increased to 4⋅256 steps (T𝒫 = 256) to stabilize the training. On the one hand, this led to a lowered sample efficiency, as more samples needed to be assessed before the policy is updated. On the other hand, the information from more samples was averaged before an update, leading to the policy being less prone to converge to local optima. As this also slowed the training process, the training length was increased to 4 × 106 time steps. See Appendix C for information on average run times.
The training outcome changed based on the initial random seeds (i.e., numbers used to initialize pseudorandom number generators), which affected the randomly assigned initial neural network weights, as well as the environment initialization. A reproducible training setup is a known issue in DRL [50]; to this end, several seeds should be used in each training. The authors of [50] suggested the use of at least five seeds; in this work, nine differently seeded runs were performed for each reward parametrization and communication parametrization. For analyzing the training, the mean return and the standard error were calculated in bins of 50,000 time steps for all training runs.
The DRL environment was created in the Python-based DRL framework OpenAI Gym [51], which provided standardized interfaces to connect DRL environments with DRL algorithms. The Modelica model of the vehicle was exported as a functional-mock-up unit (FMU) [52]. The software package PySimulator [53,54] was used to connect to the FMU and to create a Python-based interface, which could then be directly addressed by the DRL environment. This toolchain was first applied by our group in [9].

4.1. Training

Each CACC environment was initialized with a processed NGSIM velocity and acceleration trajectory. The trajectories were sampled from the NGSIM training set. A training episode had a length of t T (cf. Table 4), if it was not aborted earlier.
The initial velocity of the follower vehicle was selected as
v 2 , 0 = v 1 , 0 + Δ v rand ,
where v 2 , 0 was limited to be not below 0 m/s.
With v 2 , 0 , d 2 , 0 * can be calculated according to (1). The initial distance d 2 , 0 is defined to be equal to d 2 , 0 * plus a value Δ d rand :
d 2 , 0 = d 2 , 0 * + Δ d rand .
d 2 , 0 was limited to be not below r c , 2 . The initial values, as well as the sampling intervals (sampling performed with continuous uniform distribution) of Δ v rand and Δ d rand can be found in Table 4.
The random initialization of v 2 , 0 and d 2 , 0 as well as the random sampling of a leader vehicle trajectory, aimed at forcing the agent to cope with a wide variety of initial states and the avoidance of overfitting. Randomly initializing the environment helped the agent to explore the state space, which was inherently considered by the algorithm by using a stochastic policy during training.
The training results for different prediction horizons N k , training communication qualities, and reward functions are shown in Figure 4.
The training evolution with the EM-RL reward function is depicted in Figure 4a. One can see a stable training behavior, with a low standard error for all trainings. The average return converged at approximately 0.5 × 106 time steps. Initially, the average return was below R abort = −1000, because the agents explored different policies, and many lead to an episode being aborted. Comparing the curves for perfect communication quality with prediction horizons N k = 1, 2 and 20, one can see that there was a slight increase in sample efficiency (i.e., convergence speed w.r.t. time steps) for higher prediction horizons, and all prediction horizons showed good training stability. This suggested that the agent could make use of the additional information provided by an increased preview.
The training with low communication quality showed a drop in average return, and the training performance was worse regarding the achieved final average return. Since the agent only had access to less reliable information, the achievable performance was reduced. This could also be observed for PM-RL.
The PM-RL weight set led to higher absolute rewards due to the nonzero weight w P compared to the EM-RL weight set, as can be seen in Figure 4b. Additionally, it showed a less stable training behavior. Therefore, R abort was increased (cf. Table 3) to ensure that the agent focused on the safety and error constraint (cf. (20) and (21)).
The average return rose during the first 0.2 × 106 to 0.5 × 106 time steps. After reaching a peak, the return dropped and rose again shortly after until convergence. This can be explained by the opponent reward terms regarding the error e i , k and the power consumption P i , k : by following the leader vehicle but accelerating with lower absolute amplitude than it, the agent could easily decrease P i , k . However, at the same time, this increased e i , k . Only by exploration and considering the possible air drag reduction could a policy be found that reduced both P i , k and e i , k .
The general occurrence of training instability of PM-RL compared to EM-RL might have been created by including both the power consumption P i , k and the change in control value Δ u i , k in the reward function: The penalized power consumption | P i , k | could also reward control value smoothness, even though this was not its actual aim. Therefore, this created an ambiguous goal for the optimization.
When comparing the PM-RL curves for perfect training communication quality with prediction horizons N k = 1, N k = 2, and N k = 20, one can see that there was an increase in sample efficiency after 0.7 × 106 time steps for N k = 2 and N k = 20. This showed that the additional information acquired through the prediction horizon increased the training performance.

4.2. Test Set Evaluation

For validating the performance of the EM-RL and the PM-RL agents, they were compared to a model-based proportional-derivative (PD) controller, extended with a feedforward (FF) element to incorporate transmitted acceleration information [26] (cf. Appendix A for more information). This controller was referred to as a model-based PDFF (mb-PDFF) controller, and was applied to the test set with both reward function parametrizations. It was also implemented in Modelica/Dymola, like the vehicle model.
Because the design of the mb-PDFF from [26] did not consider any preview; i.e., only the present acceleration was transmitted, the mb-PDFF was exclusively compared with DRL results with a prediction horizon of N k = 1.
For evaluating the performance of the trained policies, they were deterministically applied on the test set. For this, the random initialization was turned off; i.e., Δ v rand = 0 and Δ d rand = 0 (cf. (22) and (23)).
It was necessary to select a specific agent out of the set of trained agents to make a direct comparison, as only a single agent could be used as controller in the follower vehicle. The policies with the least amount of crashes in the test set were selected. For training with perfect communication quality, only the test set results with perfect communication quality were considered. For training with low communication quality, the test set results with both perfect and low communication were considered, because the trained agent was expected to perform well in both situations. In case of the same number of crashes, the agent with a higher mean return in the test set G ¯ test was selected. G ¯ test is defined as G ¯ test = 1 / N τ , valid k = 1 N τ , valid G k , with N τ ,   valid being the number of episodes that were not aborted in the evaluation of the test.
Table 5 shows the policies corresponding to the best performing runs for each of the reward function parametrizations and communication qualities. The shown DRL evaluation results were obtained with perfect communication quality. In addition to the number of test set aborts, the mean energy consumption:
E ¯ test = 1 / N τ , valid k = 1 N τ , valid E k  
and the root mean squared error:
R M S E test = ( 1 / N s , valid k = 1 N s , valid e 2 , k 2 ) 0.5
were evaluated. The R M S E test was calculated with respect to the episode steps N s , valid that corresponded to N τ , valid .
First, mb-PDFF was compared with EM-RL and PM-RL, each trained with perfect communication quality (P-EM-RL, P-PM-RL).
The P-EM-RL agent showed the best performance with respect to the minimization of e . The P-EM-RL showed a much smaller R M S E test than mb-PDFF (−78.7%). This can be explained by model uncertainties being considered in the design of the mb-PDFF, which made the mb-PDFF less aggressive in order to be capable of adjusting to parameter deviations. Additionally, the P-EM-RL had a lower E ¯ test (−9.9%) than mb-PDFF, because the P-EM-RL was trained to minimize the change in control value.
P-PM-RL was able to reduce E ¯ test   by 17.9% compared to mb-PDFF and 8.9% compared to P-EM-RL. One can see that the deliberate deviation from the set point distance d 2 , k * was used to decrease the power consumption, and thereby decrease the energy consumption. This also led to a significant increase in the distance error: P-PM-RL’s R M S E test was 3.6-fold larger than mb-PDFF’s, and 17.1-fold larger than P-EM-RL’s.
Subsequently, EM-RL and PM-RL trained with low communication quality (L-EM-RL, L-PM-RL) were considered. Both L-EM-RL and L-PM-RL performed similarly to their counterparts P-EM-RL and P-PM-RL, but slightly worse (L-PM-RL had a lower R M S E test than P-PM-RL, but a higher E ¯ test ). This can be explained by the occurrences of burst errors during training, which made the agents rely less on the transmitted acceleration information.
This led to L-EM-RL being less capable of reaching the set point distance d 2 , k * than P-EM-RL, while simultaneously having an increased E ¯ test . Similarly, this led to L-PM-RL being more careful with deliberately deviating from d 2 , k * than P-PM-RL, and therefore also having an increased E ¯ test .
For all DRL agents, one test set abort occurred out of the 93 tested trajectories. This was related to the strict string-stability implementation of the DRL agents (as opposed to only considering string stability during the tuning of the mb-PDFF’s parameters, cf. Appendix A) even in emergency breaking situations.
The evaluation results for higher prediction horizons ( N k = 2 and N k = 20 ) are summarized in Appendix B (cf. Table A2; the DRL results in Table 5 are an extract from Table A2).
Larger prediction horizons only partially yielded a better performance of the agents. Only N k = 2 yielded a lower R M S E test for P-EM-RL, while neither N k = 2 nor N k = 20 improved L-EM-RL’s performance. This can be explained by the set point distance d 2 , k * tracking problem to be sufficiently solvable with N k = 2 . At the same time, an increased prediction horizon meant for the agent to identify the superfluous information, and therefore increased the difficulty. This was especially true for low communication quality, for which correct acceleration information was partially not available.
For P-PM-RL, a prediction horizon increase from N k = 1 to N k = 2 led to a lower R M S E test , while E ¯ test stayed approximately the same. A further horizon increase from N k = 1 to N k = 20 led to a lower E ¯ test and R M S E test (w.r.t. perfect communication quality during evaluation). For L-PM-RL, an increased prediction horizon led to a lower E ¯ test , but a higher R M S E test . This showed that even for low communication quality, the energy minimization objective benefited from a larger prediction horizon.
Overall, training with low communication quality resulted in a policy that was more robust regarding E ¯ test and R M S E test during low communication quality. However, the agents trained with low communication quality also showed multiple test set aborts during evaluation with low communication quality.

4.3. Controller Analysis

In the following section, the mb-PDFF, P-EM-RL, and P-PM-RL (the latter two with a prediction horizon of N k = 1) are analyzed in time domain with one trajectory chosen for the leader vehicle.
Figure 5a shows the velocity of the leader vehicle, as well as the velocity of the follower vehicle, as it was created by the mb-PDFF, the P-EM-RL agent, and the P-PM-RL agent. The good performance regarding the distance error minimization of P-EM-RL is clearly visible in Figure 5b.
It can be seen in Figure 5a that the follower vehicle mimicked the driving behavior of the leader vehicle. P-PM-RL showed undershoots after longer periods of deceleration (e.g., at 105 s), which resulted from deliberately deviating from the target distance d 2 , k * (cf. Figure 5b). Before the undershoot, P-PM-RL drove faster than the leader vehicle. At the end of the deceleration phase, the follower vehicle drove closer to the preceding vehicle to benefit from the reduced air drag. Afterwards, the reduced distance was compensated by driving slower than the leader vehicle, which meant by performing an undershoot. It should be noted that these undershoots resulted in negative velocity in the case of the preceding vehicle’s velocity approaching 0 m/s (e.g., Figure 5a at 50 s). This behavior was possible during training and driving cycle set evaluation, but was forbidden here in the time domain analysis.
Figure 6 shows the control value u = u 2 , k for mb-PDFF, P-EM-RL, and P-PM-RL for the time interval of 60 s to 80 s of the episode (the time interval was well suited to show the characteristic controllers’ behavior and the effect of the string-stability limit). It can be seen that the amplitude of P-EM-RL was lower than that of the mb-PDFF, which most likely was caused by the penalization of the change in control value Δ u i , k in its reward function. This showed that the P-EM-RL was less aggressive in general than the mb-PDFF (leading to a lower energy consumption), while having a better error-minimizing performance (cf. Table 5 and Figure 5b).
Regarding the constraints, the figure shows that the control value of the mb-PDFF and both agents remained within a modest range (cf. (14)), that the DRL agents respected the string stability limit, and that, usually, the mb-PDFF also was compatible with the DRL agents’ string-stability limit.
The P-PM-RL agent still shows a nonsmooth signal due to the ambiguity created by penalizing both the power consumption and the change in control value. During periods of acceleration and braking, it showed lower maximum absolute accelerations, but higher absolute accelerations at the end of the periods than mb-PDFF and P-EM-RL. This allowed it to deviate from the target distance d 2 , k * .
For analyzing the trained agents’ string-stability property, a platoon with three P-EM-RL-controlled follower vehicles was simulated (Figure 7a,b) and three P-PM-RL-controlled follower vehicles (Figure 7c,d) (P-EM-RL and P-PM-RL both with prediction horizon of N k = 1 ).
The P-EM-RL-controlled follower vehicles behaved as string stable: each follower vehicle followed the preceding vehicle without an acceleration amplification downstream. This also led to a reduction of the maximum absolute distance error values, as can be seen in Figure 7b at 50 s, 70 s, and 105 s.
The P-PM-RL-controlled vehicles mostly behaved as string stable, but also showed instabilities. The velocity over- and undershoots with respect to the preceding vehicle were amplified downstream at 90 s and 105 s, such that the absolute distance error also was amplified. For negative distance errors, this may lead to a crash. This was not fully prevented by the string stability limit (cf. (16)), because it only referred to the maximum absolute acceleration value of the preceding vehicle within the string stability time window N ss , but not the duration of this maximum acceleration. Thus, in future work, the string-stability limit also needs to restrict the acceleration duration of the follower vehicle. Additionally, this reduction of possible acceleration amplitude would increase the passenger comfort.
The string stability behavior of the P-EM-RL-controlled vehicles also suggested the assessment of P-EM-RL-controllers in platoons with more than three agent-controlled vehicles, whereas the P-PM-RL-controllers first required a stricter string-stability limit.

5. Discussion and Outlook

This work assessed the use of deep reinforcement learning for CACC.
A buffer was used in the C2C communication to store the predicted acceleration information. This enabled the use of a prediction horizon of multiple future time steps. The occurrence of burst errors was modeled with a Markov chain, and burst errors were marked as invalid information in the buffer. Furthermore, a string-stability condition was incorporated by limiting the acceleration set point of the follower vehicle with respect to future accelerations of the preceding vehicle. Thereby, accelerations could not be amplified downstream in the platoon without limit.
If the training and application of deep reinforcement learning is not properly designed, robustness and overfitting problems might occur; i.e., the learning-based algorithm is highly optimized for the data synthetically generated during simulation, failing to cope with data generated by the real system. This is particularly worrisome when transferring the algorithm from a simulation environment to the real world, leading to potential loss of performance (best scenario) and safety (worst case). To mitigate these issues in our work, model randomization techniques were incorporated into the learning process. The key idea was to inject parameter variability in the simulation model, exposing the agent to model inaccuracies during the learning phase, yielding robust control policies. In our work, particular attention was dedicated to exposing the control algorithm to different levels of communication delays and errors, one of the main sources of uncertainty in the platooning applications.
Multiple agents were trained and evaluated on an extra test set that was not used for training. The best performing agents were validated by comparison with a model-based PD controller with feedforward. Here, the distance error minimizing CACC agent was able to outperform the model-based controller by reducing the R M S E test by up to 78.7% on the test set.
In comparison with the model-based controller, the energy minimizing CACC agent reduced the mean energy consumption by 17.9% on the test set. For the error-minimizing agents, it was shown that only small increases of the prediction horizon length yielded a better error minimization, whereas the energy consumption could also be improved by large increases of the prediction horizon length. Moreover, it was shown that considering burst errors during training created an agent that was robust against them and performed well regarding distance errors and energy minimization.
All in all, it was possible to train agents that could: (1) minimize their energy consumption; (2) provide string stability to a given extent; (3) exploit preview information available in the communication; and (4) minimize the effect of burst errors.
However, to guarantee string stability more generally, the string stability limit should be able to change depending on the driving situation. Most of the time, it should be stricter, and also should consider the acceleration duration of the preceding vehicle instead of only the maximum acceleration. During test set evaluation of the DRL agents, some test set aborts occurred due to the string-stability limit in emergency breaking situations. Thus, the string-stability limit should also be less restrictive when the distance to the preceding vehicle is very low.
The experienced predisposition of DRL towards unsmooth control signals is a disadvantage, and must be considered when transferring the controller to real-world applications. Unsmooth acceleration control signals lead to high jerks and reduce driving comfort. In addition, they increase energy consumption.
In the future, we plan to validate the obtained results with experiments on real-word vehicles.

Author Contributions

Conceptualization, R.d.C., J.M., J.U. and J.B.; methodology, J.M., R.d.C. and J.U.; software, J.M., J.U., R.d.C. and J.B.; RL and FMI framework, J.U. and J.M.; validation, J.M. and R.d.C.; writing—original draft preparation, J.M.; writing—review and editing, J.M., R.d.C., J.U. and J.B.; visualization, J.M., J.U. and R.d.C.; supervision, R.d.C. and J.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work of J.M., J.B. and J.U was funded by the DLR internal project NGC-KoFiF & Intelligent Mobility. These projects also funded R.d.C. during his employment at DLR.

Data Availability Statement

Not applicable.

Acknowledgments

The authors’ thanks go to Christina Schreppel for her valuable support.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Model-Based PD Controller with Feedforward

The mb-PDFF was fast due to the proportional element, and featured a derivative component to increase the controller’s stability. Because of fast-changing set points in trajectory following, there was no need for an integral element. The control signal u i for platoon vehicle i was composed partially of a feedback signal u P D , i and a feedforward signal u f f , i :
u i = u P D , i + u f f , i .
u P D , i is calculated as:
u P D , i ( j ω ) = k p , i e i ( j ω ) + j ω k d , i e i ( j ω ) ,
where e i again is the distance error (cf. (2)). The feedforward signal u f f , i consists of a filter block and a block that models the communication delay Δ i (w.r.t. the transmitted present acceleration of the preceding vehicle). The filter was selected such that the disturbance effect of the preceding vehicle’s position x i 1 ( t ) was reduced, using a specific value for the time headway t h , i (cf. [26] and Table 4).
According to [26], the controller’s parameters t h , i , k p , i , and k d , i were determined with respect to specifications of pole placement, string stability, and nonparametric variations. The parameters were calculated with the parameter space method [55], and can be found in Table A1.
Table A1. Model-based PDFF controller parametrization.
Table A1. Model-based PDFF controller parametrization.
ParameterValue
k p , i 0.49
k d , i 0.70
Both the DRL-based CACCs and the mb-PDFF used the same distance at standstill r c , 2 , the same time headway t h , 2 , and the same delay Δ 2 (cf. Table 1 and Table 4).

Appendix B. Full Test Set Evaluation Results for Agents

Table A2 summarizes the driving cycle test set evaluation results for the EM-RL and PM-RL agents for prediction horizons N k = 1 ,   2 and 20 , as well as for perfect and low communication quality during training.
Table A2. Results of trained agents for the driving cycle test set during both perfect and low communication quality. The test set consisted of 93 trajectories.
Table A2. Results of trained agents for the driving cycle test set during both perfect and low communication quality. The test set consisted of 93 trajectories.
Perfect Communication Quality
during Training
Low Communication Quality
during Training
P-EM-RLP-PM-RLL-EM-RLL-PM-RL
N k = 1 N k = 2 N k = 20 N k = 1 N k = 2 N k = 20 N k = 1 N k = 2 N k = 20 N k = 1 N k = 2 N k = 20
Perfect Comm. Quality during Evaluation R M S E test
(m)
0.04380.04030.1580.7500.3700.6240.05160.1270.1000.5420.7330.803
E ¯ test
(Wh)
30.730.028.928.028.224.631.129.928.429.528.423.9
Aborts110102101110
Low Comm. Quality during Evaluation R M S E test
(m)
0.7310.4870.5933.101.494.910.2690.3170.3920.5350.6800.874
E ¯ test
(Wh)
37.536.534.035.936.235.731.430.530.730.029.325.3
Aborts89566309789114

Appendix C. Required Computational Time and Hardware Setup

In this work, an Intel Xeon W-2135 processor was used (3.7 GHz, 6 cores, 12 threads) together with 32 GB RAM and Tensorflow 1.12.0. Table A3 gives an overview of the required average run times for the different prediction horizons N k of the EM-RL training with 2 × 106 time steps and the PM-RL training with 4 × 106 time steps, each with perfect communication quality. For low communication quality, the average run times were similar.
Table A3. Average run times of EM-RL and PM-RL training runs with perfect communication quality.
Table A3. Average run times of EM-RL and PM-RL training runs with perfect communication quality.
Reward Function ParametrizationPrediction Horizon N k Average Run Time (s)
EM-RL17390
27960
2011,600
PM-RL112,800
214,100
2025,000

References

  1. Jia, D.; Lu, K.; Wang, J.; Zhang, X.; Shen, X. A Survey on Platoon-Based Vehicular Cyber-Physical Systems. IEEE Commun. Surv. Tutor. 2016, 18, 263–284. [Google Scholar] [CrossRef] [Green Version]
  2. Turri, V.; Besselink, B.; Martensson, J.; Johansson, K. Fuel-Efficient Heavy-Duty Vehicle Platooning by Look-Ahead Control. In Proceedings of the 53rd IEEE Conference on Decision and Control, Los Angeles, CA, USA, 15–17 December 2014. [Google Scholar]
  3. Guanetti, J.; Kim, Y.; Borrelli, F. Control of Connected and Automated Vehicles: State of the Art and Future Challenges. Annu. Rev. Control 2018, 45, 18–40. [Google Scholar] [CrossRef] [Green Version]
  4. Seiler, P.; Pant, A.; Hedrick, K. Disturbance Propagation in Vehicle Strings. IEEE Trans. Autom. Control 2004, 49, 1835–1842. [Google Scholar] [CrossRef]
  5. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA; London, UK, 2018. [Google Scholar]
  6. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. Available online: https://arxiv.org/abs/1312.5602 (accessed on 19 December 2013).
  7. Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
  8. OpenAI; Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019, arXiv:1912.06680. Available online: https://arxiv.org/abs/1912.06680 (accessed on 13 December 2019).
  9. Ultsch, J.; Brembeck, J.; de Castro, R. Learning-Based Path Following Control for an Over-Actuated Robotic Vehicle. In Proceedings of the AUTOREG 2019, Mannheim, Germany, 2–3 July 2019. [Google Scholar]
  10. Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-Reinforcement-Learning-Based Autonomous Voltage Control for Power Grid Operations. IEEE Trans. Power Syst. 2019, 35, 814–817. [Google Scholar] [CrossRef]
  11. Rabault, J.; Kuchta, M.; Jensen, A.; Réglade, U.; Cerardi, N. Artificial Neural Networks Trained through Deep Reinforcement Learning Discover Control Strategies for Active Flow Control. J. Fluid Mech. 2019, 865, 281–302. [Google Scholar] [CrossRef] [Green Version]
  12. Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1086–1095. [Google Scholar] [CrossRef] [Green Version]
  13. de Castro, R.; Schaub, A.; Satzger, C.; Brembeck, J. A Vehicle Following Controller for Highly-Actuated Vehicles. In Proceedings of the 13th International Symposium on Advanced Vehicle Control, Munich, Germany, 13–16 September 2016. [Google Scholar]
  14. Wu, C.; Lin, Y.; Eskandarian, A. Cooperative Adaptive Cruise Control With Adaptive Kalman Filter Subject to Temporary Communication Loss. IEEE Access 2019, 7, 93558–93568. [Google Scholar] [CrossRef]
  15. Dunbar, W.; Caveney, D. Distributed Receding Horizon Control of Vehicle Platoons: Stability and String Stability. IEEE Trans. Autom. Control 2012, 57, 620–633. [Google Scholar] [CrossRef]
  16. Buechel, M.; Knoll, A. Deep Reinforcement Learning for Predictive Longitudinal Control of Automated Vehicles. In Proceedings of the 21st International Conference on Intelligent Transportation Systems, Maui, HI, USA, 4–7 November 2018. [Google Scholar]
  17. Everett, M.; Lütjens, B.; How, J. Certifiable Robustness to Adversarial State Uncertainty in Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. (Early Access) 2021, 1–15. [Google Scholar] [CrossRef]
  18. Ultsch, J.; Mirwald, J.; Brembeck, J.; de Castro, R. Reinforcement Learning-based Path Following Control for a Vehicle with Variable Delay in the Drivetrain. In Proceedings of the IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 19 October–13 November 2020. [Google Scholar]
  19. Wei, Z.; Jiang, Y.; Liao, X.; Qi, X.; Wang, Z.; Wu, G.; Hao, P.; Barth, M. End-to-End Vision-Based Adaptive Cruise Control (ACC) Using Deep Reinforcement Learning. arXiv 2001, arXiv:2001.09181. Available online: https://arxiv.org/abs/2001.09181 (accessed on 24 January 2020).
  20. Lin, Y.; McPhee, J.; Azad, N. Comparison of Deep Reinforcement Learning and Model Predictive Control for Adaptive Cruise Control. IEEE Trans. Intell. Veh. 2021, 6, 221–231. [Google Scholar] [CrossRef]
  21. Desjardins, C.; Chaib-draa, B. Cooperative Adaptive Cruise Control: A Reinforcement Learning Approach. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1248–1260. [Google Scholar] [CrossRef]
  22. Chu, T.; Kalabić, U. Model-based Deep Reinforcement Learning for CACC in Mixed-Autonomy Vehicle Platoon. In Proceedings of the IEEE 58th Conference on Decision and Control, Nice, France, 11–13 December 2019. [Google Scholar]
  23. Wei, S.; Zou, Y.; Zhang, T.; Zhang, X.; Wang, W. Design and Experimental Validation of a Cooperative Adaptive Cruise Control System Based on Supervised Reinforcement Learning. Appl. Sci. 2018, 8, 1014. [Google Scholar] [CrossRef] [Green Version]
  24. Peake, A.; McCalmon, J.; Raiford, B.; Liu, T.; Alqahtani, S. Multi-Agent Reinforcement Learning for Cooperative Adaptive Cruise Control. In Proceedings of the IEEE 32nd International Conference on Tools with Artificial Intelligence, Baltimore, MD, USA, 9–11 November 2020. [Google Scholar]
  25. Mirwald, J. Platooning Control with Deep Reinforcement Learning. Master’s Thesis, Technical University of Munich, Munich, Germany, 2019. [Google Scholar]
  26. de Castro, R.; Brembeck, J. A Command Governor Approach for Platooning Applications. In Proceedings of the IEEE Intelligent Vehicles Symposium, Los Angeles, NV, USA, 11–14 June 2017. [Google Scholar]
  27. Ploeg, J. Analysis Design of Controllers for Cooperative Automated Driving. Ph.D. Thesis, Eindhoven University of Technology, Eindhoven, The Netherlands, 2014. [Google Scholar]
  28. Kianfar, R.; Augusto, B.; Ebadighajari, A.; Hakeem, U.; Nilsson, J.; Raza, A.; Tabar, R.; Irukulapati, N.; Englund, C.; Falcone, P.; et al. Design and Experimental Validation of a Cooperative Driving System in the Grand Cooperative Driving Challenge. IEEE Trans. Intell. Transp. Syst. 2012, 13, 994–1007. [Google Scholar] [CrossRef] [Green Version]
  29. Elliott, E. Estimates of Error Rates for Codes on Burst-Noise Channels. Bell Syst. Tech. J. 1963, 42, 1977–1997. [Google Scholar] [CrossRef]
  30. Patel, R.; Haerri, J.; Bonnet, C. Centralized Model Predictive CACC Control Robust to Burst Communication Errors. In Proceedings of the IEEE 88th Vehicular Technology Conference, Chicago, IL, USA, 27–30 August 2018. [Google Scholar]
  31. Modelica Association 61 Ideell Förening: Modelica. Available online: https://modelica.org/ (accessed on 15 September 2021).
  32. Dassault Systèmes, S.E. Dymola. Available online: https://www.3ds.com/products-services/catia/products/dymola (accessed on 15 September 2021).
  33. Zimmer, D.; van der Linden, F.; Qu, Z. Planar Mechanics 1.4.0. In a Free Modelica Library for Planar Mechanical MULTI-body Systems. Available online: https://github.com/dzimmer/PlanarMechanics/releases/tag/v1.4.0 (accessed on 12 January 2017).
  34. Brembeck, J.; Ho, L.; Schaub, A.; Satzger, C.; Tobolar, J.; Bals, J.; Hirzinger, G. ROMO—The Robotic Electric Vehicle. In Proceedings of the IAVSD International Symposium on Dynamics of Vehicle on Roads and Tracks, Manchester, UK, 11–14 August 2011. [Google Scholar]
  35. Brembeck, J. Model Based Energy Management State Estimation for the Robotic Electric Vehicle ROboMObil. Ph.D. Thesis, Technical University of Munich, Munich, Germany, 2018. [Google Scholar]
  36. Bertoni, L.; Guanetti, J.; Basso, M.; Masoero, M.; Cetinkunt, S.; Borrelli, F. An Adaptive Cruise Control for Connected Energy-Saving Electric Vehicles. IFAC-Pap. 2017, 50, 2359–2364. [Google Scholar] [CrossRef]
  37. Turri, V.; Besselink, B.; Johansson, K. Cooperative Look-Ahead Control for Fuel-Efficient and Safe Heavy-Duty Vehicle Platooning. IEEE Trans. Control. Syst. Technol. 2017, 25, 12–28. [Google Scholar] [CrossRef] [Green Version]
  38. Ahmed, S.; Bayer, B.; Deußen, N.; Emmelmann, H.-J.; Flegl, H.; Gilhaus, A.; Götz, H.; Großmann, H.; Hoffmann, R.; Hucho, W.-H.; et al. Aerodynamik des Automobils, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
  39. Tobolar, J.; Otter, M.; Bünte, T. Modelling of Vehicle Powertrains with the Modelica PowerTrain Library. In Proceedings of the Systemanalyse in der Kfz-Antriebstechnik IV, Augsburg, Germany, 6–7 March 2007. [Google Scholar]
  40. Schröder, D. Elektrische Antriebe—Regelung von Antriebssystemen, 3rd ed.; Springer: Berlin, Germany, 2009. [Google Scholar]
  41. United States Department of Transportation. Next Generation Simulation (NGSIM). Available online: https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm (accessed on 31 December 2006).
  42. Montanino, M.; Punzo, V. Making NGSIM Data Usable for Studies on Traffic Flow Theory: Multistep Method for Vehicle Trajectory Reconstruction. Transp. Res. Rec. J. Transp. Res. Board 2013, 2390, 99–111. [Google Scholar] [CrossRef]
  43. Szepesvári, C. Algorithms for Reinforcement Learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning; Morgan & Claypool: San Rafael, CA, USA, 2010. [Google Scholar]
  44. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347v2. Available online: https://arxiv.org/abs/1707.06347v2 (accessed on 28 August 2017).
  45. Hill, A.; Raffin, A.; Ernestus, M.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; et al. Stable Baselines—PPO2. Available online: https://github.com/hill-a/stable-baselines; https://stable-baselines.readthedocs.io/en/v2.4.0/modules/ppo2.html (accessed on 17 January 2019).
  46. Bishop, C. Pattern Recognition and Machine Learning, 8th ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  47. Heess, N.; TB, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.; et al. Emergence of Locomotion Behaviours in Rich Environments. arXiv 2017, arXiv:1707.02286. Available online: https://arxiv.org/abs/1707.02286 (accessed on 7 July 2017).
  48. Raffin, A.; Sokolkov, R. Learning to Drive Smoothly in Minutes. Available online: https://github.com/araffin/learning-to-drive-in-5-minutes/ (accessed on 31 January 2019).
  49. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. Available online: https://arxiv.org/abs/1412.6980 (accessed on 22 December 2014).
  50. Islam, R.; Henderson, P.; Gomrokchi, M.; Precup, D. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control. arXiv 2017, arXiv:1708.04133. [Google Scholar]
  51. Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. Available online: https://arxiv.org/abs/1606.01540 (accessed on 5 June 2016).
  52. Modelica Association 61 Ideell Förening: FMI Version 2.0. Available online: https://fmi-standard.org/downloads/ (accessed on 15 September 2021).
  53. Ganeson, A.; Fritzson, P.; Rogovchenko, O.; Asghar, A.; Sjölund, M.; Pfeiffer, A. An OpenModelica Python Interface and its use in PySimulator. In Proceedings of the 9th International Modelica Conference, Munich, Germany, 3–5 September 2012. [Google Scholar]
  54. Pfeiffer, A.; Hellerer, M.; Hartweg, S.; Otter, M.; Reiner, M. PySimulator—A Simulation and Analysis Environment in Python with Plugin Infrastructure. In Proceedings of the 9th International Modelica Conference, Munich, Germany, 3–5 September 2012. [Google Scholar]
  55. Ackermann, J. Robust Control: The Parameter Space Approach, 2nd ed.; Springer: London, UK, 2002. [Google Scholar]
Figure 1. Platooning linear model with information flow [26].
Figure 1. Platooning linear model with information flow [26].
Actuators 10 00286 g001
Figure 2. Block diagram of the environment.
Figure 2. Block diagram of the environment.
Actuators 10 00286 g002
Figure 3. Agent-environment interaction in a Markov decision process [5].
Figure 3. Agent-environment interaction in a Markov decision process [5].
Actuators 10 00286 g003
Figure 4. Average return and standard error for multiple runs of the EM-RL (a) and PM-RL (b) reward function parametrization.
Figure 4. Average return and standard error for multiple runs of the EM-RL (a) and PM-RL (b) reward function parametrization.
Actuators 10 00286 g004
Figure 5. Velocity v of the leader vehicle, as well as the follower vehicle (a). The velocity of the latter was created by mb-PDFF, P-EM-RL, and P-PM-RL; the respective distance errors e = e 2 , k are shown in (b).
Figure 5. Velocity v of the leader vehicle, as well as the follower vehicle (a). The velocity of the latter was created by mb-PDFF, P-EM-RL, and P-PM-RL; the respective distance errors e = e 2 , k are shown in (b).
Actuators 10 00286 g005
Figure 6. Control value u = u 2 , k with respective string-stability limit, created by mb-PDFF, P-EM-RL, and P-PM-RL. The time interval from 60 s to 80 s of the simulation with a total length of 120 s is shown.
Figure 6. Control value u = u 2 , k with respective string-stability limit, created by mb-PDFF, P-EM-RL, and P-PM-RL. The time interval from 60 s to 80 s of the simulation with a total length of 120 s is shown.
Actuators 10 00286 g006
Figure 7. Platoon consisting of three agent-controlled vehicles and a leader vehicle. (a,b) show the velocities v and the respective distance errors e in case the follower vehicles were P-EM-RL-controlled; (c,d) show v and e in case the follower vehicles were P-PM-RL-controlled.
Figure 7. Platoon consisting of three agent-controlled vehicles and a leader vehicle. (a,b) show the velocities v and the respective distance errors e in case the follower vehicles were P-EM-RL-controlled; (c,d) show v and e in case the follower vehicles were P-PM-RL-controlled.
Actuators 10 00286 g007
Table 1. Vehicle model parameters and general CACC problem parameters.
Table 1. Vehicle model parameters and general CACC problem parameters.
ParameterValueParameterValue
Δ t 0.1 s M nom 4 × 160 Nm
Δ i 0.1 s T M 0.1 s
a invalid −10 m/s2 k M 8000
U batt 322.4 V A x 1.232 m2
R batt 0.54 Ω c w 0.3
n cells , s 90 c d , 1 17.58
U cell 3.582 V c d , 2 34.03
R cell 0.006 Ω ρ air 1.25 Ns2/m4
Table 2. Communication parametrization regarding the communication quality and the probability of burst errors.
Table 2. Communication parametrization regarding the communication quality and the probability of burst errors.
Communication Quality p r p l
Perfect10
Low0.80.75
Table 3. Reward function parametrizations: the error-minimizing reward parametrization (EM-RL) and power-minimizing reward parametrization (PM-RL).
Table 3. Reward function parametrizations: the error-minimizing reward parametrization (EM-RL) and power-minimizing reward parametrization (PM-RL).
w e w P w Δ u R abort
EM-RL1.000.1−1000
PM-RL0.56.00.1−100,000
Table 4. Training setup and initialization.
Table 4. Training setup and initialization.
DescriptionParameterValue or Sampling Interval
Episode length t T 120 s
Distance at standstill r c , 2 2.0 m
Time headway t h , 2 0.74 s
Initial acceleration a 2 , 0 0 m/s2
Initial velocity deviation Δ v rand [−2.5 m/s, 2.5 m/s)
Initial leader vehicle position x 2 , 0 0 m
Initial distance deviation to leader vehicle Δ d rand [−10 m, 10 m)
Actions from prior two time steps u 2 , 1 ,   u 2 , 2 0 m/s2
Table 5. mb-PDFF and RL-agent-based CACC performance on test set. Perfect communication quality is available during evaluation and the RL agents use a prediction horizon of N k = 1. The test set consists of 93 trajectories.
Table 5. mb-PDFF and RL-agent-based CACC performance on test set. Perfect communication quality is available during evaluation and the RL agents use a prediction horizon of N k = 1. The test set consists of 93 trajectories.
mb-PDFFPerfect Comm. Quality
during Training
Low Comm. Quality
during Training
P-EM-RLP-PM-RLL-EM-RLL-PM-RL
R M S E test
(m)
0.20550.043830.74970.051600.5424
E ¯ test
(Wh)
34.0530.6927.9531.1229.51
Aborts01111
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mirwald, J.; Ultsch, J.; de Castro, R.; Brembeck, J. Learning-Based Cooperative Adaptive Cruise Control. Actuators 2021, 10, 286. https://doi.org/10.3390/act10110286

AMA Style

Mirwald J, Ultsch J, de Castro R, Brembeck J. Learning-Based Cooperative Adaptive Cruise Control. Actuators. 2021; 10(11):286. https://doi.org/10.3390/act10110286

Chicago/Turabian Style

Mirwald, Jonas, Johannes Ultsch, Ricardo de Castro, and Jonathan Brembeck. 2021. "Learning-Based Cooperative Adaptive Cruise Control" Actuators 10, no. 11: 286. https://doi.org/10.3390/act10110286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop