Next Article in Journal
Grid-Sim: Simulating Electric Fleet Charging with Renewable Generation and Battery Storage
Next Article in Special Issue
Autonomous Parking Space Detection for Electric Vehicles Based on Improved YOLOV5-OBB Algorithm
Previous Article in Journal
Research on Pedestrian Multi-Object Tracking Network Based on Multi-Order Semantic Fusion
Previous Article in Special Issue
Online Multiple Object Tracking Using Min-Cost Flow on Temporal Window for Autonomous Driving
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions

1
School of Automotive and Transportation Engineering, Hefei University of Technology, Hefei 230009, China
2
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China
*
Author to whom correspondence should be addressed.
World Electr. Veh. J. 2023, 14(10), 273; https://doi.org/10.3390/wevj14100273
Submission received: 28 August 2023 / Revised: 14 September 2023 / Accepted: 18 September 2023 / Published: 1 October 2023
(This article belongs to the Special Issue Recent Advance in Intelligent Vehicle)

Abstract

:
With the development of autonomous driving technology, truck platooning control has become a reality. Truck platooning can improve road capacity by maintaining a minor headway. Platooning systems can significantly reduce fuel consumption and emissions, especially for trucks. In this study, we designed a Platoon-MAPPO algorithm to implement truck platooning control based on multi-agent reinforcement learning for a platooning facing an on-ramp scenario on highway. A centralized training, decentralized execution algorithm was used in this paper. Each truck only computes its actions, avoiding the data computation delay problem caused by centralized computation. Each truck considers the truck status in front of and behind itself, maximizing the overall gain of the platooning and improving the global operational efficiency. In terms of performance evaluation, we used the traditional rule-based platooning following model as a benchmark. To ensure fairness, the model used the same network structure and traffic scenario as our proposed model. The simulation results show that the algorithm proposed in this paper has good performance and improves the overall efficiency of the platoon while guaranteeing traffic safety. The average energy consumption decreased by 14.8%, and the road occupancy rate decreased by 43.3%.

1. Introduction

Truck platooning refers to a driving state in which two or more vehicles drive on the highway, controlling the distance and driving state between the control vehicles to form a tight vehicle cluster. It uses sensors or communication technologies to obtain information between adjacent trucks and can autonomously maintain speed or distance. As early as 1996, the European Commission (EC) Information Society Technologies (IST) launched the CHAUFFEUR project to assess truck platooning [1], which studied either all trucks with automatic driving, or only the head truck with a human driving. Similar projects have also been carried out in other countries, such as California PATH, Energy-ITS, KONVOI, SARTRE, etc. [2,3,4,5,6]. The focus of these projects is mostly on truck control methods, platoon stability, and the external communication of trucks; all require automatic driving strategies to control all or some of the trucks in the platoon.
Platooning offers several benefits, one of which is enhancing traffic efficiency. By keeping smaller gaps between vehicles at the same speed, platooning can increase road capacity compared with normal traffic flow. Another benefit is improving driving safety. Trucks in the platoon communicate and coordinate with each other using autonomous driving control, resulting in smaller speed variations and faster reaction times than human drivers. A third benefit is reducing the environmental impact. The transportation sector is a major contributor to energy consumption and greenhouse gas emissions, with road transport accounting for 72.8% and 73.4%, respectively [7]. Platooning can significantly lower fuel use and emissions, especially for large vehicles such as trucks.
Early research projects used sensors to measure the distance from the preceding vehicle and a lower-level controller to maintain fixed spacing. For instance, refs.[2,4,5] achieved spacings of 6 m, 10 m, and 4.7 m, respectively. However, this method does not minimize the total energy consumption when the speed varies, as the platoon might need more energy to keep the preset spacing. Subsequent research formulated cost functions based on time and energy consumption and applied optimization and optimal control theory methods to find optimal solutions [8,9,10,11]. However, many methods were not scalable to large platoons, as the calculation cost increased with the number of vehicles.
In the past decade, there has been a lot of research based on deep learning (DL) to automate driving tasks as much as possible [12,13]. However, collecting decision-making and planning datasets for autonomous driving is costly and challenging, as it is hard to cover complex real-world traffic scenarios.
Reinforcement learning (RL) algorithms have emerged as a promising alternative for vehicle decision-making, planning, and control problems [14,15,16]. These algorithms do not require manual data collection and can adapt to various tasks. Some research has used RL to achieve autonomous driving for single vehicles [17], but few have applied it to multi-vehicle cooperative control [18]. This is because treating other vehicles as part of the environment violates the Markovian property and makes the learning unstable and difficult.
Multi-agent reinforcement learning (MARL) has been explored to control multi-vehicle behavior in recent studies. For example, ref. [19] only considered optimal strategies for stop-and-go under flat roads, while [20] applied DQN to vehicle grouping strategies in road networks. Ref. [21] used Q-learning to find optimal insert points for vehicles entering fleets.
This paper proposes a Platoon-MAPPO algorithm based on MARL for truck platoon control in a highway with an on-ramp region, which is the most complex traffic situation on highways. Existing platoon control methods, such as traditional control methods and DL-based methods, are not effective in this scenario. The Platoon-MAPPO algorithm has the following features:
(1)
It uses a MAPPO-based algorithm with centralized training and decentralized execution to control the platoon in the on-ramp area. Each truck only computes its own action, avoiding the data computation delay caused by centralized calculation.
(2)
It considers the driving status of the trucks in front and behind each truck, maximizing the overall platoon gain and improving the global operational efficiency.
(3)
It does not require communication, and it is scalable to any number of vehicles and communication devices.
Compared with traditional control methods and existing methods based on DL and RL, Platoon-MAPPO can effectively reduce road occupancy and energy consumption. Moreover, Platoon-MAPPO is more flexible than existing RL-based methods because it does not require intra-platoon communication. Additionally, the use of distributed computing means that the computational cost of each vehicle is independent of the number of platooning members, making it more practical and avoiding the waste of computing resources. Traditional algorithms have strong interpretability and can be used as a safety guarantee to assist Platoon-MAPPO operation.

2. Related Work

Most traditional methods in platooning rely on optimal control theory, which aims to find the optimal control inputs for each vehicle to achieve certain objectives, such as fuel efficiency, safety, and comfort. For instance, [8] proposed a serial distributed model predictive control (MPC) approach that ensured local stability and multi-criteria string stability for connected automated vehicles. Local stability means that each vehicle can track its desired speed and position, while multi-criteria string stability means that the errors in speed and position do not propagate along the platoon. In Ref. [10], a multi-anticipative controller was devised that enabled an equipped vehicle to use information from its direct predecessor to predict the behavior of its pre-predecessor. In this way, the vehicle can anticipate future actions of the leader vehicle and adjust its own control accordingly. Subsequently, some methods based on DL and RL emerged, which can learn from data and experience without relying on explicit models or rules. Ref. [18] was one of the first papers to apply RL to platooning. The authors in [22] improved the reward function to make the training more reasonable, but they neglected the behavior of other vehicles and failed to optimize the overall efficiency of the platoon. In Ref. [23], a platoon control algorithm based on centralized RL was designed; however, its single-agent training strategy might still result in unstable training. In Ref. [19], a method based on multi-agent RL was developed, although the scenario was relatively simple and required communication between platoons.
For highway on-ramps, ref. [24] suggested an optimal trajectory optimization strategy for connected and automated vehicles to cooperatively perform mainline platooning and on-ramp merging, in which each platoon could obtain its optimal control input in terms of fuel consumption. Ref. [25] transformed the complex 2D multi-platoon cooperation problem into a 1D platoon following control problem and derived an analytical solution to the optimal control. Ref. [26] proposed a platoon-based hierarchical merging control algorithm for on-ramp vehicles to achieve automated merging control under a connected traffic environment. Some studies attempted to use machine learning to enhance the overall traffic efficiency of highway on-ramps. Ref. [27] presented an RL framework that adaptively adjusted the intra-platoon gap of an individual platoon member. However, both traditional and machine learning methods mostly assumed that vehicles were intelligent connected vehicles that could exchange information freely. However, it is unrealistic to expect all vehicles on the road to maintain good wireless communication at present. Communication problems could pose serious risks, such as collisions, delays, or breakdowns. Many studies have mentioned this issue, such as [28], which presents a summary of the key technologies and challenges in platooning. Ref. [29] discussed different communication technologies in the platoon, and further examined security issues related to communication, such as jamming, spoofing, or hacking.

3. Preliminaries and Methods

3.1. Preliminary Knowledge

RL is learning what to do—how to map situations to actions—so as to maximize a numerical reward signal [30]. Unlike supervised learning, RL does not require labeled datasets to train neural networks. Instead, it allows agents to explore their environments and choose actions based on feedback from the environments to maximize long-term gains. At each time step, t , the environment provides the current state, s t , from the state space, S , and the agent selects the action a t , a t A , based on the observation value, o t , of the state and its own policy, π ( a t | o t ) , where A is the action space. Then, the environment generates the reward r ( s t , a t ) at this time, and gives s t + 1 for the next time step. p ( s t + 1 | s t , a t ) is called the state transition probability. When the environment satisfies Markov properties, the state transition probability is stable, and this process can be described as a Markov decision process (MDP). After a period of time, the agent will collect a trajectory ( s t , a t , r t , s t + 1 , a t + 1 ) . The return, g t = r t + γ g t + 1 , is defined as accumulative discount rewards, where γ is the discount factor and g t is the return value at time step t . The ultimate goal of RL is to optimize policy, π , so that E ( g t ) is maximized as much as possible. This means that agents will evolve toward the goals expected by designers.
RL can be divided into value-based methods and policy-based methods. Q-learning is a classic example of value-based methods. It can use dynamic programming to solve the state values if the environment is a Markov decision process. It can also use Monte Carlo methods to approximate the state transition values. In actual applications, most states are continuous variables; therefore, a state-value function, V ( s t ) , can be used to fit the predicted state values more accurately. DQN is a well-known method of this kind.
Policy-based methods were proposed to deal with continuous action space problems. They directly optimize policy, π , to maximize the expected state values at each state, argmax π ( E S [ V π ( S ) ] ) . A3C, TRPO, PPO, and other methods follow this basic principle and incorporate advanced optimization techniques, especially PPO [31,32,33].
MARL has attracted more attention than single-agent reinforcement learning (SARL) in recent years. When the agents cooperate with each other, the optimization goal is to maximize the total return. Some early studies relied on optimizing independent agents to maximize the global reward function. Each agent uses a TRPO network structure and treats other agents as part of the environment. However, this type of algorithm suffers from non-stationarity problems, because each agent is also part of the environment; this means that the environmental state transition probabilities change continuously as the agent policies change, which violates the RL assumptions.
To address this problem, later studies mostly used centralized training decentralized execution (CTDE) algorithms, which significantly improved the MARL performance. Agents learn their own policy network under the guidance of a centralized feedback network, and then generate actions independently. Recent studies have applied CTDE algorithms to continuous action spaces, such as MAPPO [34] and MADDPG [35]. The two articles share the idea that each critic can obtain all action information. During training, a critic that can observe the global situation guides actor training, while during inference, each actor only uses its own local observation value to calculate actions. In MADDPG, the authors argue that in cooperative and competitive scenarios, CTDE can enable agents to discover and utilize various information in the environment to produce better strategies than other algorithms. Additionally, MAPPO achieves excellent performance in multiple test projects without making significant changes to PPO.

3.2. Methods

3.2.1. Truck Platoon Communication Topology

Figure 1 shows the overall framework of the Platoon-MAPPO. The vehicles that are not controlled by the truck platoon are called interactive vehicles (IVs), which may affect the driving status of the truck platoon by traveling in front of it. The trucks in the truck platoon are called platoon members ( P M ), with P M 0 being the first truck and P M N being the last one. The other trucks are numbered sequentially from P M 1 to P M N 1 , and they can be referred to as P M n collectively.
The platoon communication structure can be classified into three types: centralized communication topology, decentralized communication topology, and hybrid communication topology. In centralized communication topology, each vehicle only communicates with the lead vehicle, which is easy to control but may cause high latency problems. In decentralized communication topology, each vehicle only receives information from the front vehicle, which has low latency. This study considers using hybrid communication topology, as shown in the topology diagram. Each truck receives information from both front and rear vehicles, but not from the environmental vehicle detected by the lead vehicle. Compared with algorithms that only consider the front vehicle’s driving status, this algorithm is more effective in optimizing the overall performance of the truck platoon.
This communication strategy enables the trucks in the platoon to switch positions, adjust the number of vehicles, and dynamically select the lead vehicle, enhancing the applicability and flexibility of the algorithm.
A truck platoon consists of three types of trucks: P M 0 , P M n , and P M N . This design enables the platoon length to be scalable and the algorithms to be compatible with any number of trucks. The training process can simulate the performance of the whole platoon with any number of trucks. The scalability of the platoon size also enhances the sampling efficiency, as each interaction with the simulation environment yields more information about the action-value of the intelligent agents. The algorithmic computation grows linearly with the platoon size in both training and application phases. Therefore, using distributed computing, the computational load of each truck remains constant regardless of the platoon size, which significantly reduces the overall computational burden on the platoon and avoids computational waste due to changes in the platoon size.

3.2.2. Crucial Elements

Platoon-MAPPO is a stochastic gradient policy algorithm that is trained centrally and executed distributivity. Each agent observes and acts independently. The reinforce method is applied to estimate the unknown action value functions, using the actual return from a collected trajectory as a substitute. PPO introduces an advantage function to stabilize the training, which is usually defined as the difference between the action value function and the state value function. The state value function represents the expected return that an agent will receive in a certain state. The advantage function can intuitively show how much better the selected action is than the mean of all actions selected in this state. The actor, π μ ( a | o ) , is a neural network function and the critic, V θ ( s ) , is another neural network function. Under the multi-agent architecture, the critic observes the global information, i.e., the observations of all agents, and predicts the expected return values for all agents based on these observations.
(1)
Observation and state: the observation value of each truck in the platoon is the part of the platoon’s overall state that can be observed by the truck. The observation value of each truck comprises several elements: (a) the type of the truck, which can be P M 0 , P M n , or P M N , indicating the position of the truck in the platoon; (b) the relative speed of the truck with respect to the vehicle in front of it; (c) the relative speed of the truck with respect to the vehicle behind it; (d) the relative distance between the truck and the vehicle in front of it; (e) the relative distance between the truck and the vehicle behind it; (f) the speed of the truck itself; and (g) if the truck is P M 0 , it also obtains the relative distance and speed between itself and the IV that is detected within its detection range. RL algorithms are not suitable for applying batch normalization to the input values. Therefore, the input values are scaled directly to [−1, 1] according to their value range. The position identifiers are 0, −1, and 1, representing P M 0 , P M n , and P M N , respectively. When the platoon consists of more than three trucks, all the position identifiers of the middle trucks, P M n , are set to 1. The training process can accommodate any number of trucks in the platoon.
State refers to various factors that influence driving in traffic environments. As a centralized training algorithm, PPO needs to predict the state value function of the entire environment. Therefore, state values include the observation values of all the trucks in the platoon.
(2)
Action: the main application scenario of truck platoon control technology is on highways, where the traffic environment is relatively simple. To minimize the interference of truck platoons with other human drivers, truck platoons will travel on the rightmost lane. Therefore, this article focuses on the longitudinal control of truck platoons. After obtaining the observation values, the decision-making algorithms will determine the driving force for each truck at each time step. The driving force indicates the acceleration or deceleration of the truck. When the driving force is positive, it means that the truck accelerates forward and is bounded by the maximum driving force value. When the driving force is negative, it means that the truck decelerates.
(3)
Reward: autonomous driving is typically a multi-objective optimization problem. These objectives may encompass various aspects, such as speed, travel time, collision, regulations, energy consumption, vehicle wear and tear, and passenger experience. For freight trucks, the main considerations are speed, energy consumption, and safety. The reward function should reflect the goals that autonomous driving vehicles aim to pursue, rather than the methods and techniques used to attain these goals. For instance, adding extra penalties for low speeds to motivate trucks to move forward is known as reward shaping. Reward shaping may facilitate learning in the early stages, but may also constrain the performance potential of the algorithm [36]. Therefore, our reward function only consists of three components: speed, energy consumption, and safety.
The reward function is as follows:
r = r 1 max ( v P M v IV + r ε , 1 ) r 2 D r 3 max ( F t , 0 )
where r represents the reward at a certain moment, v P M represents the instantaneous speed of a member of the platoon at a certain moment, v IV represents the instantaneous speed of IV at a certain moment. r ε is added to prevent the denominator from being zero and is set to 1 × 10−5. D is the collision penalty term. F t is the driving force. r 1 , r 2 , and r 3 are the coefficients used to adjust the weights of each item.
The reward function defines the objective of optimizing truck platoon driving strategies. The objective of truck platoon driving is efficiency and safety. Efficiency means enhancing travel speed while reducing energy consumption as much as possible. We aim to make the speed of truck platoon close to the traffic flow speed. If the traffic flow speed is fast, the platoon should be incentivized to increase their speed. If the traffic flow speed is slow, the platoon should not receive extra benefits from high speeds because high speeds are irrelevant at this time and may pose a risk instead. Therefore, we add a speed reward in reward function. When v P M < v IV , the reward will be proportional to the speed of the platoon. When v P M > v IV , no additional rewards will be given.
To ensure that policy optimization is carried out under safe driving conditions, a collision penalty term, D , is added to reward function. If a collision occurs, this value will be 1; otherwise, it will be 0.
For electric trucks, driving force size is almost proportional to energy consumption. Let instantaneous energy consumption be c = c ε F t and c be the instantaneous fuel consumption. c ε represents the linear coefficient mapping driving force to energy consumption. When the driving force equals zero, instantaneous energy consumption also equals zero, so there are no bias terms in reward function. An energy consumption penalty term was added to the reward function, and c ε was absorbed into r 3 , i.e.,
J ( μ ) = ( s , a ) min ( ( π μ ( a | s ) π μ ( a | s ) ) A μ ( s , a ) , c l i p ( π μ ( a | s ) π μ ( a | s ) , 1 ε , 1 + ε ) A θ ( s , a ) )
Algorithm 1 shows the training process of the algorithm; we used the Monte Carlo method to estimate the action value function, and calculated the advantage function A from it. ε is a hyperparameter that controls the size of the clipping term.
Algorithm 1: Platoon-MAPPO
Initialize actor network, π μ ( a | o ) , and critic network, V θ ( s ) , with weights μ and θ
Initialize batch size B, iterative step α , β , and done = 0
for episode = 0, 1, 2, … until convergence do
    while not done do
        initialize actor and critic states
        for all agents i, perform
             a t , m e a n i , a t , s t d i = π ( o t i ; μ )
            Sampling a t i from normal distribution N ( a t , m e a n i , a 2 t , s t d i )
             v t i = V ( s t ; θ )
        end for
        Send a t to the simulation environment, obtain [ r t , s t + 1 , o t + 1 ]
        Save [ s t , o t , a t , r t , s t + 1 , o t + 1 ]
    Compute the advantage function, A t , and return, R t
    end for
for for k = 0, …, PPO epoch perform
    Compute J ( μ )
    Update network parameters using the gradient method
     μ k + 1 μ k + α μ J μ k ( μ )
     θ k + 1 θ k β θ ( R t V θ ( s t ) ) 2
    end for

4. Simulation

4.1. Simulation Platform

Traffic simulation is the application of computer technology to mimic real traffic systems, which can virtually reproduce realistic traffic scenarios at low cost. RL algorithms depend on vehicles to learn the optimal driving strategies in traffic environments. This also implies that it is infeasible to train vehicle strategies in real traffic environments. Therefore, we established a road simulation environment in SUMO. SUMO is an open-source micro-continuous traffic simulation platform widely used in traffic research. SUMO enables users to construct various types of road networks and define basic parameters and behaviors of vehicles. It also provides interfaces for third-party programs, allowing algorithms to interact directly with the simulation environment.
To make the platoon control strategy more compatible with the demands of realistic situations, a dynamic and variable traffic environment was constructed that aligns with the actual platoon control algorithm.
Slopes affect the acceleration and deceleration capabilities of trucks, and have a significant effect on the fuel consumption and safety performance of platoons. Some studies have evaluated the energy consumption of truck platooning under different slope conditions; the results show that the road slope has a considerable impact on the energy consumption of truck platooning [37]. Therefore, when devising energy saving driving strategies for truck platooning, the impact of road slope must be taken into account. A normal highway slope is within 5%; thus, the road slope was set to α [ 0.05 , 0.05 ] , α denotes the tangent value.
The interference that platooning encounters varies depending on the traffic flow state. The most influential vehicle for platooning is the individual vehicle (IV) in front of it. The speed of an IV may fluctuate when it cruises or accesses ramps in real-world traffic scenarios. Moreover, surrounding vehicles may alter their speed more frequently in unstable traffic flow. These situations require platooning to adjust accordingly; otherwise, it may increase energy consumption and pose safety risks.
We considered two scenarios of traffic flow in ramp areas. In the first case, there were vehicles accessing ramps in front of the IV, and the IV adopted the IDM-following strategy. In the second case, the IV accessed ramps in front of the platoon, and the platoon faced disturbance from vehicles that suddenly appeared. To simulate these scenarios, we assumed that the platoon could only observe ramp vehicles when they were 50 m away from the ramp entrance. Ramp vehicles entered the main road at random speeds before the platoon reached the ramp, and then accelerated to the speed limit after entering the main road. We set a 3 km long road with a truck queue starting at an interval of 30 m and a speed of 20 m/s at the beginning of the road. The maximum speeds of the P M and IV were 25 km/h and 27.6 km/h, respectively. IVs started at random positions within one hundred meters in front of the truck fleet with the same initial speed. To simulate different traffic conditions for the platoon, IVs first cruised at a stable speed and then randomly decelerated to a lower value to simulate interference from ramps in real traffic scenarios. Finally, they accelerated to the maximum speed limit to simulate free-flowing traffic.

4.2. Longitudinal Dynamics of Trucks

Trucks have more degrees of freedom, which makes it extremely complex to establish an accurate dynamic model. However, this article focuses on decision-making planning algorithms; dynamic modeling is not its main concern. Moreover, too complex models may hinder the subsequent simulations. This article studies scenarios for straight or approximately straight regions of highways, establishing a simplified longitudinal truck dynamics model. This model takes into account the main factors affecting vehicle operation, so that it does not deviate from reality. Compared with sophisticated models, simplified models are adequate to verify the effectiveness of the algorithm.
An electric truck experiences force while driving:
F t = F f + F w + F i + F j F f = m g cos ω ( f a + f b v c ) F w = 0.5 a D C D U ρ v c 2 F i = m g sin ω F j = m a c F b m a x = φ b m g cos ω + m g sin ω
where F t is the traction force required by the truck, which consists of four parts: rolling resistance force F f , air drag force F w , gravity force F i , and acceleration provided by the vehicle F j ; f a and f b are rolling resistance coefficients; and a c is the acceleration of the truck.
When the truck platoon is moving, the air resistance experienced by trucks in the platoon will be significantly reduced. Therefore, it was necessary to modify the actual wind resistance coefficient, C D , of the truck. Studies [38] have shown that the overall air resistance coefficient decreases as the distance between vehicles decreases, and a method for modifying the air coefficient, a D , has been proposed.
a D = 0.7231 ( x P M 0 x P M 1 l ) 0.0919 f o r   P M 0 0.2241 ( x P M n x P M n - 1 l ) 0.1369 + 0.5016 f o r   P M n
When calculating a D for P M 0 , ( x P M 0 x P M 1 l ) is the distance between P M 1 and the first P M 0 , and when calculating a D for P M n , ( x P M n x P M n - 1 l ) is the distance between P M n and the previous P M N 1 . The values of parameters in formula (3) and (4) are shown in Table 1.
The neural network comprises fully connected network layers and activation function layers. The actor network outputs normal distributions a t , m e a n and a t , s t d . We observed that using two independent networks to output them separately improves the performance. Hence, we set up two identical actor networks with fully connected network layers in the middle; each layer had 64, 128, 128, 64, 64 neurons. The dimensions of the first and last layers matched those of the observation and action.
Furthermore, we employed the IDM [39] model and the CACC [40] control method as the baselines to control the platoon behavior, and compared them with the platoon-MAPPO algorithm. IDM is a well-established traffic flow simulation model; it can emulate the driving behavior of human drivers in traffic flow. CACC is a widely used control method for autonomous vehicle following, which enables cooperative driving by communicating with the preceding vehicle.

5. Results

We performed five simulation experiments with different initial random seeds. Figure 2 shows the training results.
We set up a simple test scenario to intuitively demonstrate the effectiveness of the Platoon-MAPPO algorithm. The IV initially accelerates to 15 m/s. At 20 s, it decelerates to 5 m/s to simulate the interference from a ramp. After 15 s, it accelerates again to 25 m/s. All vehicles have an initial speed of 10 m/s and a maximum acceleration of 1 m/s2. The P M 0 is 40 m away from the IV. The platoon consists of 10 trucks, and the initial headway distance within the convoy is 30 m. The test lasts for 1000 s.
We plotted a spatiotemporal map for each truck, as depicted in Figure 3, which shows the positional changes of the truck over time. We also plotted a heatmap of the speed value of each truck at each moment in Figure 4, to show the speed changes of all trucks in the fleet more intuitively. In the test, the P M 0 first observes the speed change of the IV. When the IV changes its speed, the P M 0 reacts first and then passes it on to each subsequent truck. To observe the reaction speed of each truck, we counted the reaction time difference according to the moment when the speed of each vehicle crosses 10 m/s. The reaction time differences during deceleration for IDM, CACC, and Platoon-MAPPO are 14 s, 9.2 s, and 4.1 s, respectively. The platoon reaction time of Platoon-MAPPO is significantly shorter than that of IDM and CACC.
Under IDM and CACC, the speed changes cause traffic oscillations that propagate through the entire fleet and have a lasting impact on the fleet. Especially for IDM, after experiencing an interference, the inter-vehicle distance keeps increasing. However, Platoon-MAPPO can effectively filter traffic oscillations. From the heatmap, we can see that the rear trucks in the fleet have smaller speed changes than the front trucks, which indicates that they can reduce energy consumption through smoother acceleration. Smoother speed changes reduce energy consumption in acceleration and smaller headway reduces overall air resistance of the fleet. In this test, based on IDM and CACC’s energy consumption as a benchmark, Platoon-MAPPO reduces energy consumption by 14.8% and 32.7%.
More specifically, under the same simulation duration, the last truck of Platoon-MAPPO travels the longest distance, followed by CACC and then ordinary IDM. This indicates that our proposed model has the highest overall travel speed and does not suffer from significant speed reduction due to energy saving needs.
Table 2 shows average traction forces and their components for each control strategy in this test scenario. Traction force is an approximate linear function of energy consumption for electric trucks; therefore, we can compare energy consumption accordingly.
For CACC, its control strategy may aim for higher instantaneous speed, but often, increasing instantaneous speed does not reduce travel time. Figure 4 shows that CACC did not eliminate traffic oscillations in this test scenario. The speed oscillations in the front of the platoon propagate to the rear, causing unnecessary acceleration and deceleration of the rear trucks. This may explain its higher energy consumption. At the same time, Platoon-MAPPO’s Fw is also higher than IDM’s because Platoon-MAPPO has a higher average speed. However, the platoon effectively filters traffic oscillations and has the least unnecessary acceleration and deceleration, resulting in the smallest Fj.
As mentioned above, the Platoon-MAPPO algorithm can be applied to truck platoons of any number of trucks. Another significant feature of platoons is that they reduce road occupancy and improve road capacity. To verify the effect of different numbers of vehicles on platoons, we simulated the fleet length as a variable based on the previous test. We set the minimum number of trucks in the fleet to 3 and the maximum to 20. We repeated the previous test and counted the length of the fleet occupying the road. Figure 5 shows the distance from the head of the P M 0 truck to the tail of the P M N truck at each time step.
Green, orange, and purple represent IDM, CACC, and Platoon-MAPPO control strategies, respectively. The shaded area represents the fleet length when there are 3 to 20 trucks. The dark line in the middle represents the average value. The two red lines represent the ratios of road space saved by Platoon-MAPPO compared with IDM and CACC at each time step.
The initial platoon lengths of IDM, CACC, and Platoon-MAPPO were identical. During the acceleration phase, the fleet lengths of IDM and CACC increased gradually due to the lag effect of truck acceleration. The trucks that accelerated earlier had higher speeds and covered longer distances in a given time, resulting in an increased gap between the trucks until the leading truck encountered an obstacle or a speed limit and stopped accelerating. In contrast, the platoon length of Platoon-MAPPO decreased during this phase because Platoon-MAPPO enabled each truck to have a very low reaction delay and a small safe distance under cooperative control, allowing the rear trucks to accelerate slightly more than the front trucks at low speeds until they reached an optimal gap, as shown in Figure 5. During the deceleration phase, the platoon lengths of IDM and CACC changed dramatically compared with Platoon-MAPPO’s due to the lag effect. During the re-acceleration phase, the fleet lengths of IDM and CACC increased significantly as the speed increased, and this change was proportional to the number of vehicles in their platoons; more trucks led to a faster increase in platoon length (see Figure 5 for the upper and lower bounds of the shaded area). The platoon length of Platoon-MAPPO was relatively stable and increased slowly with the speed increase, but it was not significantly affected by the number of vehicles in their fleets. The higher the number of vehicles, the higher the instantaneous speed; thus, Platoon-MAPPO saved more road space. In this test scenario, Platoon-MAPPO reduced the occupied road space by averages of 43.3% and 34.2% compared with IDM and CACC, respectively, with maximum reductions of 58.2% and 35.7%, respectively.
We tested the Platoon-MAPPO algorithm in various dynamic traffic scenarios to further evaluate its adaptability. Based on the previous test scenario, we set the target speed value for IV deceleration as a variable with the same range of values as in the training process. For quantitative analysis, we selected five values within the range and used each value as a test scenario and measured the energy consumption. Figure 6 shows some cases extracted from the test scenarios with target speed values of 5 km/h and 10 km/h for IV deceleration, and road slopes of 0 and 0.03. Each control strategy can extract four curves, represented by different linear gray curves. To facilitate observation and analysis, we also plotted the mean energy consumption ratio values of IDM, CACC, and Platoon-MAPPO with green, yellow, and blue curves, respectively. The horizontal axis in the figure is the number of trucks in the platoon, and the vertical axis is the energy consumption ratio. The energy consumption ratio was calculated by taking the maximum total energy consumption value in the test scenario as a benchmark and comparing other energy consumption values with this maximum value. This eliminated the linear relationship coefficient between traction force and energy consumption.
Platoon-MAPPO consistently exhibited the lowest energy consumption in various scenarios. The energy consumption of all three control strategies decreased with the increase in the number of vehicles in their platoon. This was especially noticeable in IDM, which might be because IDM’s control strategy could effectively reduce traffic oscillations, while CACC’s excessive pursuit of instantaneous speed made its acceleration change too sensitive, leading to high energy consumption. In the dynamic traffic scenarios tested, Platoon-MAPPO’s energy consumption was reduced by an average of 3.3% compared with IDM, with a maximum reduction of 4.0%.
Acceleration and climbing are extremely energy-consuming for trucks. For instance, when the slope was 0.02, the truck speed was 10m/s, and the acceleration was 0.5 m/s2, the rolling resistance force, the air drag force, the gravity force, and the force providing vehicle acceleration accounted for 5.8%, 1.2%, 34.9%, and 58.1% of the total force, respectively. Therefore, the air resistance has a very small effect during acceleration or climbing. When the slope and the acceleration were both zero and the speed was 20 m/s with an acceleration of 0.5 m/s2, the rolling resistance force and the air drag force accounted for 59.5% and 40.5% of the total force, respectively. In high-speed uniform motion on flat roads, the air resistance has a very high share but its absolute value is not large. Therefore, in dynamic traffic scenarios, Platoon-MAPPO’s advantage over IDM may not seem obvious, but this is actually because large numerical values that are hard to reduce through control strategies mask the efforts to reduce additional consumption.

6. Conclusions

Platoon-MAPPO is a truck platooning control algorithm based on MARL that can achieve efficient and safe platooning in highway ramp regions. Platoon-MAPPO adopts a centralized training and decentralized execution framework, allowing each truck to act independently without relying on a central controller, making the allocation of computing resources more concise. For the driving status of each truck in the platoon, each truck only refers to the driving status of the front and rear trucks, without requiring internal communication within the platoon or with the road network. This makes it easier for the truck’s driving strategy to maximize the overall interests of the platoon; this architecture can greatly improve the flexibility of the platoon formation compared with architectures that rely on communication and centralized computing.
According to the experimental results, Platoon-MAPPO reduced average energy consumption and road occupancy by 14.8% and 32.7%, respectively, compared with uncontrolled IDM models and CACC control methods. From the experimental results, it can be seen that this method significantly filters traffic oscillations, greatly reduces meaningless acceleration and deceleration, and maintains driving speed while reducing energy consumption. Compared with IDM and CACC, Platoon-MAPPO has a faster overall speed at lower energy consumption. At the same time, Platoon-MAPPO can maintain a smaller headway distance, which not only reduces energy consumption, but also significantly reduces road occupancy. Compared with IDM and CACC, Platoon-MAPPO’s road occupancy is reduced by 43.3% and 34.2% on average, and up to maximums of 58.2% and 35.7%, respectively. Therefore, in conclusion, Platoon-MAPPO can significantly improve the energy efficiency of truck platoons and improve road capacity.
Platoon-MAPPO benefits from a good basic model and reward function design, achieving good performance in experiments. However, similarly to general DL algorithms, RL cannot guarantee reasonable output results for all inputs, which is one of the reasons why learning-based algorithms have not yet been widely used in vehicle control fields. To address this issue, future research can start from two aspects: one is to use traditional methods to assist in the real-time monitoring of unsafe behavior; the other is to improve the interpretability of RL models and train them in more comprehensive and diverse traffic scenarios to improve robustness.

Author Contributions

Conceptualization, J.C.; methodology, J.C. and Z.Z.; software, Z.Z. and Y.D.; validation, Y.D.; formal analysis, Z.Z.; in-vestigation, J.C.; resources, J.C.; data curation, Y.D.; writing—original draft preparation, Y.D.; writing—review and editing, Z.Z.; visualization, Z.Z.; supervision, B.Y.; project administration, B.Y.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Nature Science Foundation of China, Grant No.51805133 and the Innovation Project of New Energy Vehicle and Intelligent Connected Vehicle of Anhui Province.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Harker, B.J. PROMOTE-CHAUFFEUR II & 5.8 GHz vehicle to vehicle communications system. In Proceedings of the 2001 ADAS. International Conference on Advanced Driver Assistance Systems, (IEE Conf. Publ. No. 483), Birmingham, UK, 17–18 September 2001; Available online: https://digital-library.theiet.org/content/conferences/10.1049/cp_20010504 (accessed on 15 August 2023).
  2. Shladover, S.E. AHS research at the California PATH program and future AHS research needs. In Proceedings of the 2008 IEEE International Conference on Vehicular Electronics and Safety, Columbus, OH, USA, 22–24 September 2008; pp. 4–5. [Google Scholar]
  3. Shladover, S.E. PATH at 20—History and major milestones. IEEE Trans. Intell. Transp. Syst. 2007, 8, 584–592. [Google Scholar] [CrossRef]
  4. Kunze, R.; Tummel, C.; Henning, K. Determination of the order of electronically coupled trucks on German motorways. In Proceedings of the 2009 2nd International Conference on Power Electronics and Intelligent Transportation System (PEITS), Shenzhen, China, 19–20 December 2009; pp. 41–46. [Google Scholar]
  5. Tsugawa, S.; Kato, S.; Aoki, K. An automated truck platoon for energy saving. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 4109–4114. [Google Scholar]
  6. Robinson, T.; Chan, E.; Coelingh, E. Operating platoons on public motorways: An introduction to the sartre platooning programme. In Proceedings of the 17th World Congress on Intelligent Transport Systems, Busan, Republic of Korea, 25–29 October 2010; p. 12. [Google Scholar]
  7. Zhao, W.; Ngoduy, D.; Shepherd, S.; Liu, R.; Papageorgiou, M. A platoon based cooperative eco-driving model for mixed automated and human-driven vehicles at a signalised intersection. Transp. Res. Part C Emerg. Technol. 2018, 95, 802–821. [Google Scholar] [CrossRef]
  8. Zhou, Y.; Wang, M.; Ahn, S. Distributed model predictive control approach for cooperative car-following with guaranteed local and string stability. Transp. Res. Part B Methodol. 2019, 128, 69–86. [Google Scholar] [CrossRef]
  9. Zhou, Y.; Ahn, S.; Wang, M.; Hoogendoorn, S. Stabilizing mixed vehicular platoons with connected automated vehicles: An H-infinity approach. Transp. Res. Part B Methodol. 2020, 132, 152–170. [Google Scholar] [CrossRef]
  10. Wang, M.; Daamen, W.; Hoogendoorn, S.P.; van Arem, B. Rolling horizon control framework for driver assistance systems. Part II: Cooperative sensing and cooperative control. Transp. Res. Part C Emerg. Technol. 2014, 40, 290–311. [Google Scholar] [CrossRef]
  11. He, X.; Liu, H.X.; Liu, X. Optimal vehicle speed trajectory on a signalized arterial with consideration of queue. Transp. Res. Part C Emerg. Technol. 2015, 61, 106–120. [Google Scholar] [CrossRef]
  12. Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
  13. Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 1 September 2022. [Google Scholar]
  14. Ye, F.; Zhang, S.; Wang, P.; Chan, C.Y. A Survey of Deep Reinforcement Learning Algorithms for Motion Planning and Control of Autonomous Vehicles. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium (IV), Nagoya, Japan, 11–17 July 2021; pp. 1073–1080. [Google Scholar]
  15. Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 4909–4926. [Google Scholar] [CrossRef]
  16. Talpaert, V.; Sobh, I.; Kiran, B.R.; Mannion, P.; Yogamani, S.; El-Sallab, A.; Perez, P. Exploring applications of deep reinforcement learning for real-world autonomous driving systems. arXiv 2019, arXiv:1901.01536. [Google Scholar]
  17. Hoel, C.-J.; Wolff, K.; Laine, L. Automated speed and lane change decision making using deep reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2148–2155. [Google Scholar]
  18. Desjardins, C.; Chaib-draa, B. Cooperative Adaptive Cruise Control: A Reinforcement Learning Approach. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1248–1260. [Google Scholar] [CrossRef]
  19. Li, M.; Cao, Z.; Li, Z. A Reinforcement Learning-Based Vehicle Platoon Control Strategy for Reducing Energy Consumption in Traffic Oscillations. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 5309–5322. [Google Scholar] [CrossRef] [PubMed]
  20. Chen, C.; Jiang, J.; Lv, N.; Li, S. An Intelligent Path Planning Scheme of Autonomous Vehicles Platoon Using Deep Reinforcement Learning on Network Edge. IEEE Access 2020, 8, 99059–99069. [Google Scholar] [CrossRef]
  21. Chen, C.; Zhang, Y.; Khosravi, M.R.; Pei, Q.; Wan, S. An Intelligent Platooning Algorithm for Sustainable Transportation Systems in Smart Cities. IEEE Sens. J. 2021, 21, 15437–15447. [Google Scholar] [CrossRef]
  22. Zhu, M.; Wang, Y.; Pu, Z.; Hu, J.; Wang, X.; Ke, R. Safe, efficient, and comfortable velocity control based on reinforcement learning for autonomous driving. Transp. Res. Part C Emerg. Technol. 2020, 117, 102662. [Google Scholar] [CrossRef]
  23. Chu, T.; Kalabić, U. Model-based deep reinforcement learning for CACC in mixed-autonomy vehicle platoon. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; pp. 4079–4084. [Google Scholar]
  24. Gao, Z.; Wu, Z.; Hao, W.; Long, K.; Byon, Y.J.; Long, K. Optimal Trajectory Planning of Connected and Automated Vehicles at On-Ramp Merging Area. IEEE Trans. Intell. Transp. Syst. 2022, 23, 12675–12687. [Google Scholar] [CrossRef]
  25. Kumaravel, S.D.; Malikopoulos, A.A.; Ayyagari, R. Decentralized Cooperative Merging of Platoons of Connected and Automated Vehicles at Highway On-Ramps. In Proceedings of the 2021 American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 2055–2060. [Google Scholar]
  26. Xue, Y.; Ding, C.; Yu, B.; Wang, W. A Platoon-Based Hierarchical Merging Control for On-Ramp Vehicles Under Connected Environment. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21821–21832. [Google Scholar] [CrossRef]
  27. Yadavalli, S.R.; Das, L.C.; Won, M. RLPG: Reinforcement Learning Approach for Dynamic Intra-Platoon Gap Adaptation for Highway On-Ramp Merging. arXiv 2022, arXiv:2212.03497. [Google Scholar]
  28. Jia, D.; Lu, K.; Wang, J.; Zhang, X.; Shen, X. A Survey on Platoon-Based Vehicular Cyber-Physical Systems. IEEE Commun. Surv. Tutor. 2016, 18, 263–284. [Google Scholar] [CrossRef]
  29. Willke, T.L.; Tientrakool, P.; Maxemchuk, N.F. A survey of inter-vehicle communication protocols and their applications. IEEE Commun. Surv. Tutor. 2009, 11, 3–20. [Google Scholar] [CrossRef]
  30. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; Adaptive Computation and Machine Learning; The MIT Press: Cambridge, MA, USA; Cambridge, London, 2018. [Google Scholar]
  31. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, Online, 16 June 2016; pp. 1928–1937. [Google Scholar]
  32. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, Online, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
  33. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  34. Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
  35. Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  36. Knox, W.B.; Allievi, A.; Banzhaf, H.; Schmitt, F.; Stone, P. Reward (mis) design for autonomous driving. Artif. Intell. 2023, 316, 103829. [Google Scholar] [CrossRef]
  37. Alam, A.; Besselink, B.; Turri, V.; Mårtensson, J.; Johansson, K.H. Heavy-duty vehicle platooning for sustainable freight transportation: A cooperative method to enhance safety and efficiency. IEEE Control Syst. Mag. 2015, 35, 34–56. [Google Scholar]
  38. Hussein, A.A.; Rakha, H.A. Vehicle Platooning Impact on Drag Coefficients and Energy/Fuel Saving Implications. IEEE Trans. Veh. Technol. 2022, 71, 1199–1208. [Google Scholar] [CrossRef]
  39. Treiber, M.; Hennecke, A.; Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 2000, 62, 1805–1824. [Google Scholar] [CrossRef] [PubMed]
  40. Milanés, V.; Shladover, S.E. Modeling cooperative and autonomous adaptive cruise control dynamic responses using experimental data. Transp. Res. Part C Emerg. Technol. 2014, 48, 285–300. [Google Scholar] [CrossRef]
Figure 1. Framework of Platoon-MAPPO.
Figure 1. Framework of Platoon-MAPPO.
Wevj 14 00273 g001
Figure 2. Return curve.
Figure 2. Return curve.
Wevj 14 00273 g002
Figure 3. Spatiotemporal map for each truck.
Figure 3. Spatiotemporal map for each truck.
Wevj 14 00273 g003
Figure 4. Heatmap of platoon speed.
Figure 4. Heatmap of platoon speed.
Wevj 14 00273 g004
Figure 5. Platoon length.
Figure 5. Platoon length.
Wevj 14 00273 g005
Figure 6. Energy consumption of the platoon.
Figure 6. Energy consumption of the platoon.
Wevj 14 00273 g006
Table 1. Truck longitudinal dynamics parameter settings.
Table 1. Truck longitudinal dynamics parameter settings.
ParameterValueDescription
m g 200,000 NGravity of truck
l17 mLength of truck
f a 0.0041Rolling resistance coefficient
f b 0.000025Rolling resistance coefficient
φ b (0.3, 0.6)Adhesion factor of truck
C D 0.564Air resistance coefficient of truck
U 5.8 m2Front projection area of truck
ρ 1.2258 kg/m3Density of air
ω [−0.05, 0.05]Road slope
v c [0 m/s, 25 m/s]Velocity of truck
Table 2. Traction force composition.
Table 2. Traction force composition.
Control StrategyFt (N)Fw (N)Fj (N)Ff (N)
IDM736856256901115
CACC833471264641157
Platoon-MAPPO627864344801155
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, J.; Zhou, Z.; Duan, Y.; Yu, B. Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions. World Electr. Veh. J. 2023, 14, 273. https://doi.org/10.3390/wevj14100273

AMA Style

Chen J, Zhou Z, Duan Y, Yu B. Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions. World Electric Vehicle Journal. 2023; 14(10):273. https://doi.org/10.3390/wevj14100273

Chicago/Turabian Style

Chen, Jiajia, Zheng Zhou, Yue Duan, and Biao Yu. 2023. "Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions" World Electric Vehicle Journal 14, no. 10: 273. https://doi.org/10.3390/wevj14100273

Article Metrics

Back to TopTop