Research on Reinforcement-Learning-Based Truck Platooning Control Strategies in Highway On-Ramp Regions

: With the development of autonomous driving technology, truck platooning control has become a reality. Truck platooning can improve road capacity by maintaining a minor headway. Platooning systems can signiﬁcantly reduce fuel consumption and emissions, especially for trucks. In this study, we designed a Platoon-MAPPO algorithm to implement truck platooning control based on multi-agent reinforcement learning for a platooning facing an on-ramp scenario on highway. A centralized training, decentralized execution algorithm was used in this paper. Each truck only computes its actions, avoiding the data computation delay problem caused by centralized computation. Each truck considers the truck status in front of and behind itself, maximizing the overall gain of the platooning and improving the global operational efﬁciency. In terms of performance evaluation, we used the traditional rule-based platooning following model as a benchmark. To ensure fairness, the model used the same network structure and trafﬁc scenario as our proposed model. The simulation results show that the algorithm proposed in this paper has good performance and improves the overall efﬁciency of the platoon while guaranteeing trafﬁc safety. The average energy consumption decreased by 14.8%, and the road occupancy rate decreased by 43.3%.


Introduction
Truck platooning refers to a driving state in which two or more vehicles drive on the highway, controlling the distance and driving state between the control vehicles to form a tight vehicle cluster.It uses sensors or communication technologies to obtain information between adjacent trucks and can autonomously maintain speed or distance.As early as 1996, the European Commission (EC) Information Society Technologies (IST) launched the CHAUFFEUR project to assess truck platooning [1], which studied either all trucks with automatic driving, or only the head truck with a human driving.Similar projects have also been carried out in other countries, such as California PATH, Energy-ITS, KONVOI, SARTRE, etc. [2][3][4][5][6].The focus of these projects is mostly on truck control methods, platoon stability, and the external communication of trucks; all require automatic driving strategies to control all or some of the trucks in the platoon.
Platooning offers several benefits, one of which is enhancing traffic efficiency.By keeping smaller gaps between vehicles at the same speed, platooning can increase road capacity compared with normal traffic flow.Another benefit is improving driving safety.Trucks in the platoon communicate and coordinate with each other using autonomous driving control, resulting in smaller speed variations and faster reaction times than human drivers.A third benefit is reducing the environmental impact.The transportation sector is a major contributor to energy consumption and greenhouse gas emissions, with road transport accounting for 72.8% and 73.4%, respectively [7].Platooning can significantly lower fuel use and emissions, especially for large vehicles such as trucks.
Early research projects used sensors to measure the distance from the preceding vehicle and a lower-level controller to maintain fixed spacing.For instance, refs.[2,4,5] achieved spacings of 6 m, 10 m, and 4.7 m, respectively.However, this method does not minimize the total energy consumption when the speed varies, as the platoon might need more energy to keep the preset spacing.Subsequent research formulated cost functions based on time and energy consumption and applied optimization and optimal control theory methods to find optimal solutions [8][9][10][11].However, many methods were not scalable to large platoons, as the calculation cost increased with the number of vehicles.
In the past decade, there has been a lot of research based on deep learning (DL) to automate driving tasks as much as possible [12,13].However, collecting decision-making and planning datasets for autonomous driving is costly and challenging, as it is hard to cover complex real-world traffic scenarios.
Reinforcement learning (RL) algorithms have emerged as a promising alternative for vehicle decision-making, planning, and control problems [14][15][16].These algorithms do not require manual data collection and can adapt to various tasks.Some research has used RL to achieve autonomous driving for single vehicles [17], but few have applied it to multi-vehicle cooperative control [18].This is because treating other vehicles as part of the environment violates the Markovian property and makes the learning unstable and difficult.
Multi-agent reinforcement learning (MARL) has been explored to control multi-vehicle behavior in recent studies.For example, ref. [19] only considered optimal strategies for stop-and-go under flat roads, while [20] applied DQN to vehicle grouping strategies in road networks.Ref. [21] used Q-learning to find optimal insert points for vehicles entering fleets.
This paper proposes a Platoon-MAPPO algorithm based on MARL for truck platoon control in a highway with an on-ramp region, which is the most complex traffic situation on highways.Existing platoon control methods, such as traditional control methods and DL-based methods, are not effective in this scenario.The Platoon-MAPPO algorithm has the following features: Compared with traditional control methods and existing methods based on DL and RL, Platoon-MAPPO can effectively reduce road occupancy and energy consumption.Moreover, Platoon-MAPPO is more flexible than existing RL-based methods because it does not require intra-platoon communication.Additionally, the use of distributed computing means that the computational cost of each vehicle is independent of the number of platooning members, making it more practical and avoiding the waste of computing resources.Traditional algorithms have strong interpretability and can be used as a safety guarantee to assist Platoon-MAPPO operation.

Related Work
Most traditional methods in platooning rely on optimal control theory, which aims to find the optimal control inputs for each vehicle to achieve certain objectives, such as fuel efficiency, safety, and comfort.For instance, [8] proposed a serial distributed model predictive control (MPC) approach that ensured local stability and multi-criteria string stability for connected automated vehicles.Local stability means that each vehicle can track its desired speed and position, while multi-criteria string stability means that the errors in speed and position do not propagate along the platoon.In Ref. [10], a multi-anticipative controller was devised that enabled an equipped vehicle to use information from its direct predecessor to predict the behavior of its pre-predecessor.In this way, the vehicle can anticipate future actions of the leader vehicle and adjust its own control accordingly.Subsequently, some methods based on DL and RL emerged, which can learn from data and experience without relying on explicit models or rules.Ref. [18] was one of the first papers to apply RL to platooning.The authors in [22] improved the reward function to make the training more reasonable, but they neglected the behavior of other vehicles and failed to optimize the overall efficiency of the platoon.In Ref. [23], a platoon control algorithm based on centralized RL was designed; however, its single-agent training strategy might still result in unstable training.In Ref. [19], a method based on multi-agent RL was developed, although the scenario was relatively simple and required communication between platoons.
For highway on-ramps, ref. [24] suggested an optimal trajectory optimization strategy for connected and automated vehicles to cooperatively perform mainline platooning and on-ramp merging, in which each platoon could obtain its optimal control input in terms of fuel consumption.Ref. [25] transformed the complex 2D multi-platoon cooperation problem into a 1D platoon following control problem and derived an analytical solution to the optimal control.Ref. [26] proposed a platoon-based hierarchical merging control algorithm for on-ramp vehicles to achieve automated merging control under a connected traffic environment.Some studies attempted to use machine learning to enhance the overall traffic efficiency of highway on-ramps.Ref. [27] presented an RL framework that adaptively adjusted the intra-platoon gap of an individual platoon member.However, both traditional and machine learning methods mostly assumed that vehicles were intelligent connected vehicles that could exchange information freely.However, it is unrealistic to expect all vehicles on the road to maintain good wireless communication at present.Communication problems could pose serious risks, such as collisions, delays, or breakdowns.Many studies have mentioned this issue, such as [28], which presents a summary of the key technologies and challenges in platooning.Ref. [29] discussed different communication technologies in the platoon, and further examined security issues related to communication, such as jamming, spoofing, or hacking.

Preliminary Knowledge
RL is learning what to do-how to map situations to actions-so as to maximize a numerical reward signal [30].Unlike supervised learning, RL does not require labeled datasets to train neural networks.Instead, it allows agents to explore their environments and choose actions based on feedback from the environments to maximize long-term gains.At each time step, t, the environment provides the current state, s t , from the state space, S, and the agent selects the action a t , a t ∈ A, based on the observation value, o t , of the state and its own policy, π(a t |o t ) , where A is the action space.Then, the environment generates the reward r(s t , a t ) at this time, and gives s t+1 for the next time step.p(s t+1 |s t , a t ) is called the state transition probability.When the environment satisfies Markov properties, the state transition probability is stable, and this process can be described as a Markov decision process (MDP).After a period of time, the agent will collect a trajectory (s t , a t , r t , s t+1 , a t+1 ).The return, g t = r t + γg t+1 , is defined as accumulative discount rewards, where γ is the discount factor and g t is the return value at time step t.The ultimate goal of RL is to optimize policy, π, so that E(g t ) is maximized as much as possible.This means that agents will evolve toward the goals expected by designers.
RL can be divided into value-based methods and policy-based methods.Q-learning is a classic example of value-based methods.It can use dynamic programming to solve the state values if the environment is a Markov decision process.It can also use Monte Carlo methods to approximate the state transition values.In actual applications, most states are continuous variables; therefore, a state-value function, V(s t ), can be used to fit the predicted state values more accurately.DQN is a well-known method of this kind.
Policy-based methods were proposed to deal with continuous action space problems.They directly optimize policy, π, to maximize the expected state values at each state, argmax π (E S [V π (S)]).A3C, TRPO, PPO, and other methods follow this basic principle and incorporate advanced optimization techniques, especially PPO [31][32][33].
MARL has attracted more attention than single-agent reinforcement learning (SARL) in recent years.When the agents cooperate with each other, the optimization goal is to maximize the total return.Some early studies relied on optimizing independent agents to maximize the global reward function.Each agent uses a TRPO network structure and treats other agents as part of the environment.However, this type of algorithm suffers from non-stationarity problems, because each agent is also part of the environment; this means that the environmental state transition probabilities change continuously as the agent policies change, which violates the RL assumptions.
To address this problem, later studies mostly used centralized training decentralized execution (CTDE) algorithms, which significantly improved the MARL performance.Agents learn their own policy network under the guidance of a centralized feedback network, and then generate actions independently.Recent studies have applied CTDE algorithms to continuous action spaces, such as MAPPO [34] and MADDPG [35].The two articles share the idea that each critic can obtain all action information.During training, a critic that can observe the global situation guides actor training, while during inference, each actor only uses its own local observation value to calculate actions.In MADDPG, the authors argue that in cooperative and competitive scenarios, CTDE can enable agents to discover and utilize various information in the environment to produce better strategies than other algorithms.Additionally, MAPPO achieves excellent performance in multiple test projects without making significant changes to PPO.

Truck Platoon Communication Topology
Figure 1 shows the overall framework of the Platoon-MAPPO.The vehicles that are not controlled by the truck platoon are called interactive vehicles (IVs), which may affect the driving status of the truck platoon by traveling in front of it.The trucks in the truck platoon are called platoon members (PM), with PM 0 being the first truck and PM N being the last one.The other trucks are numbered sequentially from PM 1 to PM N−1 , and they can be referred to as PM n collectively.The platoon communication structure can be classified into three types: centralized communication topology, decentralized communication topology, and hybrid communication topology.In centralized communication topology, each vehicle only communicates with the lead vehicle, which is easy to control but may cause high latency problems.In decentralized communication topology, each vehicle only receives information from the front vehicle, which has low latency.This study considers using hybrid communication topology, as shown in the topology diagram.Each truck receives information from both front and rear vehicles, but not from the environmental vehicle detected by the lead vehicle.Compared with algorithms that only consider the front vehicle's driving status, this algorithm is more effective in optimizing the overall performance of the truck platoon.
This communication strategy enables the trucks in the platoon to switch positions, adjust the number of vehicles, and dynamically select the lead vehicle, enhancing the applicability and flexibility of the algorithm.
A truck platoon consists of three types of trucks: PM 0 , PM n , and PM N .This design enables the platoon length to be scalable and the algorithms to be compatible with any number of trucks.The training process can simulate the performance of the whole platoon with any number of trucks.The scalability of the platoon size also enhances the sampling efficiency, as each interaction with the simulation environment yields more information about the action-value of the intelligent agents.The algorithmic computation grows linearly with the platoon size in both training and application phases.Therefore, using distributed computing, the computational load of each truck remains constant regardless of the platoon size, which significantly reduces the overall computational burden on the platoon and avoids computational waste due to changes in the platoon size.

Crucial Elements
Platoon-MAPPO is a stochastic gradient policy algorithm that is trained centrally and executed distributivity.Each agent observes and acts independently.The reinforce method is applied to estimate the unknown action value functions, using the actual return from a collected trajectory as a substitute.PPO introduces an advantage function to stabilize the training, which is usually defined as the difference between the action value function and the state value function.The state value function represents the expected return that an agent will receive in a certain state.The advantage function can intuitively show how much better the selected action is than the mean of all actions selected in this state.The actor, π µ (a o) , is a neural network function and the critic, V θ (s), is another neural network function.Under the multi-agent architecture, the critic observes the global information, i.e., the observations of all agents, and predicts the expected return values for all agents based on these observations.(2) Action: the main application scenario of truck platoon control technology is on highways, where the traffic environment is relatively simple.To minimize the interference of truck platoons with other human drivers, truck platoons will travel on the rightmost lane.Therefore, this article focuses on the longitudinal control of truck platoons.After obtaining the observation values, the decision-making algorithms will determine the driving force for each truck at each time step.The driving force indicates the acceleration or deceleration of the truck.When the driving force is positive, it means that the truck accelerates forward and is bounded by the maximum driving force value.When the driving force is negative, it means that the truck decelerates.(3) Reward: autonomous driving is typically a multi-objective optimization problem.
These objectives may encompass various aspects, such as speed, travel time, collision, regulations, energy consumption, vehicle wear and tear, and passenger experience.
For freight trucks, the main considerations are speed, energy consumption, and safety.The reward function should reflect the goals that autonomous driving vehicles aim to pursue, rather than the methods and techniques used to attain these goals.For instance, adding extra penalties for low speeds to motivate trucks to move forward is known as reward shaping.Reward shaping may facilitate learning in the early stages, but may also constrain the performance potential of the algorithm [36].Therefore, our reward function only consists of three components: speed, energy consumption, and safety.
The reward function is as follows: where r represents the reward at a certain moment, v PM represents the instantaneous speed of a member of the platoon at a certain moment, v IV represents the instantaneous speed of IV at a certain moment.r ε is added to prevent the denominator from being zero and is set to 1 × 10 −5 .D is the collision penalty term.F t is the driving force.r 1 , r 2 , and r 3 are the coefficients used to adjust the weights of each item.The reward function defines the objective of optimizing truck platoon driving strategies.The objective of truck platoon driving is efficiency and safety.Efficiency means enhancing travel speed while reducing energy consumption as much as possible.We aim to make the speed of truck platoon close to the traffic flow speed.If the traffic flow speed is fast, the platoon should be incentivized to increase their speed.If the traffic flow speed is slow, the platoon should not receive extra benefits from high speeds because high speeds are irrelevant at this time and may pose a risk instead.Therefore, we add a speed reward in reward function.When v PM < v IV , the reward will be proportional to the speed of the platoon.When v PM > v IV , no additional rewards will be given.
To ensure that policy optimization is carried out under safe driving conditions, a collision penalty term, D, is added to reward function.If a collision occurs, this value will be 1; otherwise, it will be 0.
For electric trucks, driving force size is almost proportional to energy consumption.Let instantaneous energy consumption be c = c ε F t and c be the instantaneous fuel consumption.c ε represents the linear coefficient mapping driving force to energy consumption.When the driving force equals zero, instantaneous energy consumption also equals zero, so there are no bias terms in reward function.An energy consumption penalty term was added to the reward function, and c ε was absorbed into r 3 , i.e., Algorithm 1 shows the training process of the algorithm; we used the Monte Carlo method to estimate the action value function, and calculated the advantage function A from it.ε is a hyperparameter that controls the size of the clipping term.

Simulation Platform
Traffic simulation is the application of computer technology to mimic real traffic systems, which can virtually reproduce realistic traffic scenarios at low cost.RL algorithms depend on vehicles to learn the optimal driving strategies in traffic environments.This also implies that it is infeasible to train vehicle strategies in real traffic environments.Therefore, we established a road simulation environment in SUMO.SUMO is an open-source microcontinuous traffic simulation platform widely used in traffic research.SUMO enables users to construct various types of road networks and define basic parameters and behaviors of vehicles.It also provides interfaces for third-party programs, allowing algorithms to interact directly with the simulation environment.
To make the platoon control strategy more compatible with the demands of realistic situations, a dynamic and variable traffic environment was constructed that aligns with the actual platoon control algorithm.
Slopes affect the acceleration and deceleration capabilities of trucks, and have a significant effect on the fuel consumption and safety performance of platoons.Some studies have evaluated the energy consumption of truck platooning under different slope conditions; the results show that the road slope has a considerable impact on the energy consumption of truck platooning [37].Therefore, when devising energy saving driving strategies for truck platooning, the impact of road slope must be taken into account.A normal highway slope is within 5%; thus, the road slope was set to α ∈ [−0.05, 0.05], α denotes the tangent value.
The interference that platooning encounters varies depending on the traffic flow state.The most influential vehicle for platooning is the individual vehicle (IV) in front of it.The speed of an IV may fluctuate when it cruises or accesses ramps in real-world traffic scenarios.Moreover, surrounding vehicles may alter their speed more frequently in unstable traffic flow.These situations require platooning to adjust accordingly; otherwise, it may increase energy consumption and pose safety risks.
We considered two scenarios of traffic flow in ramp areas.In the first case, there were vehicles accessing ramps in front of the IV, and the IV adopted the IDM-following strategy.In the second case, the IV accessed ramps in front of the platoon, and the platoon faced disturbance from vehicles that suddenly appeared.To simulate these scenarios, we assumed that the platoon could only observe ramp vehicles when they were 50 m away from the ramp entrance.Ramp vehicles entered the main road at random speeds before the platoon reached the ramp, and then accelerated to the speed limit after entering the main road.We set a 3 km long road with a truck queue starting at an interval of 30 m and a speed of 20 m/s at the beginning of the road.The maximum speeds of the PM and IV were 25 km/h and 27.6 km/h, respectively.IVs started at random positions within one hundred meters in front of the truck fleet with the same initial speed.To simulate different traffic conditions for the platoon, IVs first cruised at a stable speed and then randomly decelerated to a lower value to simulate interference from ramps in real traffic scenarios.Finally, they accelerated to the maximum speed limit to simulate free-flowing traffic.

Longitudinal Dynamics of Trucks
Trucks have more degrees of freedom, which makes it extremely complex to establish an accurate dynamic model.However, this article focuses on decision-making planning algorithms; dynamic modeling is not its main concern.Moreover, too complex models may hinder the subsequent simulations.This article studies scenarios for straight or approximately straight regions of highways, establishing a simplified longitudinal truck dynamics model.This model takes into account the main factors affecting vehicle operation, so that it does not deviate from reality.Compared with sophisticated models, simplified models are adequate to verify the effectiveness of the algorithm.
An electric truck experiences force while driving: where F t is the traction force required by the truck, which consists of four parts: rolling resistance force F f , air drag force F w , gravity force F i , and acceleration provided by the vehicle F j ; f a and f b are rolling resistance coefficients; and a c is the acceleration of the truck.
When the truck platoon is moving, the air resistance experienced by trucks in the platoon will be significantly reduced.Therefore, it was necessary to modify the actual wind resistance coefficient, C D , of the truck.Studies [38] have shown that the overall air resistance coefficient decreases as the distance between vehicles decreases, and a method for modifying the air coefficient, a D , has been proposed.
When calculating a D for PM 0 , (x PM 0 − x PM 1 − l) is the distance between PM 1 and the first PM 0 , and when calculating a D for PM n , (x PM n − x PM n−1 − l) is the distance between PM n and the previous PM N−1 .The values of parameters in formula ( 3) and ( 4) are shown in Table 1.
The neural network comprises fully connected network layers and activation function layers.The actor network outputs normal distributions a t,mean and a t,std .We observed that using two independent networks to output them separately improves the performance.Hence, we set up two identical actor networks with fully connected network layers in the middle; each layer had 64, 128, 128, 64, 64 neurons.The dimensions of the first and last layers matched those of the observation and action.Furthermore, we employed the IDM [39] model and the CACC [40] control method as the baselines to control the platoon behavior, and compared them with the platoon-MAPPO algorithm.IDM is a well-established traffic flow simulation model; it can emulate the driving behavior of human drivers in traffic flow.CACC is a widely used control method for autonomous vehicle following, which enables cooperative driving by communicating with the preceding vehicle.

Results
We performed five simulation experiments with different initial random seeds.Figure 2 shows the training results.We plotted a spatiotemporal map for each truck, as depicted in Figure 3, which shows the positional changes of the truck over time.We also plotted a heatmap of the speed value of each truck at each moment in Figure 4, to show the speed changes of all trucks in the fleet more intuitively.In the test, the 0 PM first observes the speed change of the IV.When the IV changes its speed, the 0 PM reacts first and then passes it on to each subsequent truck.To observe the reaction speed of each truck, we counted the reaction time difference according to the moment when the speed of each vehicle crosses 10 m/s.The reaction time differences during deceleration for IDM, CACC, and Platoon-MAPPO are 14 s, 9.2 s, and 4.1 s, respectively.The platoon reaction time of Platoon-MAPPO is significantly shorter than that of IDM and CACC.We set up a simple test scenario to intuitively demonstrate the effectiveness of the Platoon-MAPPO algorithm.The IV initially accelerates to 15 m/s.At 20 s, it decelerates to 5 m/s to simulate the interference from a ramp.After 15 s, it accelerates again to 25 m/s.All vehicles have an initial speed of 10 m/s and a maximum acceleration of 1 m/s 2 .The PM 0 is 40 m away from the IV.The platoon consists of 10 trucks, and the initial headway distance within the convoy is 30 m.The test lasts for 1000 s.
We plotted a spatiotemporal map for each truck, as depicted in Figure 3, which shows the positional changes of the truck over time.We also plotted a heatmap of the speed value of each truck at each moment in Figure 4, to show the speed changes of all trucks in the fleet more intuitively.In the test, the PM 0 first observes the speed change of the IV.When the IV changes its speed, the PM 0 reacts first and then passes it on to each subsequent truck.To observe the reaction speed of each truck, we counted the reaction time difference according to the moment when the speed of each vehicle crosses 10 m/s.The reaction time differences during deceleration for IDM, CACC, and Platoon-MAPPO are 14 s, 9.2 s, and 4.1 s, respectively.The platoon reaction time of Platoon-MAPPO is significantly shorter than that of IDM and CACC.
Under IDM and CACC, the speed changes cause traffic oscillations that propagate through the entire fleet and have a lasting impact on the fleet.Especially for IDM, after experiencing an interference, the inter-vehicle distance keeps increasing.However, Platoon-MAPPO can effectively filter traffic oscillations.From the heatmap, we can see that the rear trucks in the fleet have smaller speed changes than the front trucks, which indicates that they can reduce energy consumption through smoother acceleration.Smoother speed changes reduce energy consumption in acceleration and smaller headway reduces overall air resistance of the fleet.In this test, based on IDM and CACC's energy consumption as a benchmark, Platoon-MAPPO reduces energy consumption by 14.8% and 32.7%.
More specifically, under the same simulation duration, the last truck of Platoon-MAPPO travels the longest distance, followed by CACC and then ordinary IDM.This indicates that our proposed model has the highest overall travel speed and does not suffer from significant speed reduction due to energy saving needs.
Table 2 shows average traction forces and their components for each control strategy in this test scenario.Traction force is an approximate linear function of energy consumption for electric trucks; therefore, we can compare energy consumption accordingly.For CACC, its control strategy may aim for higher instantaneous speed, but often, increasing instantaneous speed does not reduce travel time.Figure 4 shows that CACC did not eliminate traffic oscillations in this test scenario.The speed oscillations in the front of the platoon propagate to the rear, causing unnecessary acceleration and deceleration of the rear trucks.This may explain its higher energy consumption.At the same time, Platoon-MAPPO's F w is also higher than IDM's because Platoon-MAPPO has a higher average speed.However, the platoon effectively filters traffic oscillations and has the least unnecessary acceleration and deceleration, resulting in the smallest F j .
As mentioned above, the Platoon-MAPPO algorithm can be applied to truck platoons of any number of trucks.Another significant feature of platoons is that they reduce road occupancy and improve road capacity.To verify the effect of different numbers of vehicles on platoons, we simulated the fleet length as a variable based on the previous test.We set the minimum number of trucks in the fleet to 3 and the maximum to 20.We repeated the previous test and counted the length of the fleet occupying the road.Figure 5 shows the distance from the head of the PM 0 truck to the tail of the PM N truck at each time step.Under IDM and CACC, the speed changes cause traffic oscillations that propagate through the entire fleet and have a lasting impact on the fleet.Especially for IDM, after experiencing an interference, the inter-vehicle distance keeps increasing.However, Platoon-MAPPO can effectively filter traffic oscillations.From the heatmap, we can see that the rear trucks in the fleet have smaller speed changes than the front trucks, which indicates that they can reduce energy consumption through smoother acceleration.Smoother speed changes reduce energy consumption in acceleration and smaller headway reduces overall air resistance of the fleet.In this test, based on IDM and CACC's energy consumption as a benchmark, Platoon-MAPPO reduces energy consumption by 14.8% and 32.7%.
More specifically, under the same simulation duration, the last truck of Platoon-MAPPO travels the longest distance, followed by CACC and then ordinary IDM.This indicates that our proposed model has the highest overall travel speed and does not suffer from significant speed reduction due to energy saving needs.
Table 2 shows average traction forces and their components for each control strategy in this test scenario.Traction force is an approximate linear function of energy consumption for electric trucks; therefore, we can compare energy consumption accordingly.For CACC, its control strategy may aim for higher instantaneous speed, but often increasing instantaneous speed does not reduce travel time.Figure 4 shows that CAC did not eliminate traffic oscillations in this test scenario.The speed oscillations in the fron of the platoon propagate to the rear, causing unnecessary acceleration and deceleration o the rear trucks.This may explain its higher energy consumption.At the same time, Pla toon-MAPPO's Fw is also higher than IDM's because Platoon-MAPPO has a higher ave age speed.However, the platoon effectively filters traffic oscillations and has the least un necessary acceleration and deceleration, resulting in the smallest Fj.
As mentioned above, the Platoon-MAPPO algorithm can be applied to truck platoon of any number of trucks.Another significant feature of platoons is that they reduce roa occupancy and improve road capacity.To verify the effect of different numbers of vehicle on platoons, we simulated the fleet length as a variable based on the previous test.We se the minimum number of trucks in the fleet to 3 and the maximum to 20.We repeated th previous test and counted the length of the fleet occupying the road.Figure 5   Green, orange, and purple represent IDM, CACC, and Platoon-MAPPO control strategies, respectively.The shaded area represents the fleet length when there are 3 to 20 trucks.The dark line in the middle represents the average value.The two red lines represent the ratios of road space saved by Platoon-MAPPO compared with IDM and CACC at each time step.
The initial platoon lengths of IDM, CACC, and Platoon-MAPPO were identical.During the acceleration phase, the fleet lengths of IDM and CACC increased gradually due to the lag effect of truck acceleration.The trucks that accelerated earlier had higher speeds and covered longer distances in a given time, resulting in an increased gap between the trucks until the leading truck encountered an obstacle or a speed limit and stopped accelerating.In contrast, the platoon length of Platoon-MAPPO decreased during this phase because Platoon-MAPPO enabled each truck to have a very low reaction delay and a small safe distance under cooperative control, allowing the rear trucks to accelerate slightly more than the front trucks at low speeds until they reached an optimal gap, as shown in Figure 5.During the deceleration phase, the platoon lengths of IDM and CACC changed dramatically compared with Platoon-MAPPO's due to the lag effect.During the re-acceleration phase, the fleet lengths of IDM and CACC increased significantly as the speed increased, and this change was proportional to the number of vehicles in their platoons; more trucks led to a faster increase in platoon length (see Figure 5 for the upper and lower bounds of the shaded area).The platoon length of Platoon-MAPPO was relatively stable and increased slowly with the speed increase, but it was not significantly affected by the number of vehicles in their fleets.The higher the number of vehicles, the higher the instantaneous speed; thus, Platoon-MAPPO saved more road space.In this test scenario, Platoon-MAPPO reduced the occupied road space by averages of 43.3% and 34.2% compared with IDM and CACC, respectively, with maximum reductions of 58.2% and 35.7%, respectively.
We tested the Platoon-MAPPO algorithm in various dynamic traffic scenarios to further evaluate its adaptability.Based on the previous test scenario, we set the target speed value for IV deceleration as a variable with the same range of values as in the training process.For quantitative analysis, we selected five values within the range and used each value as a test scenario and measured the energy consumption.Figure 6 shows some cases extracted from the test scenarios with target speed values of 5 km/h and 10 km/h for IV deceleration, and road slopes of 0 and 0.03.Each control strategy can extract four curves, represented by different linear gray curves.To facilitate observation and analysis, we also plotted the mean energy consumption ratio values of IDM, CACC, and Platoon-MAPPO with green, yellow, and blue curves, respectively.The horizontal axis in the figure is the number of trucks in the platoon, and the vertical axis is the energy consumption ratio.The energy consumption ratio was calculated by taking the maximum total energy consumption value in the test scenario as a benchmark and comparing other energy consumption values with this maximum value.This eliminated the linear relationship coefficient between traction force and energy consumption.
Platoon-MAPPO consistently exhibited the lowest energy consumption in various scenarios.The energy consumption of all three control strategies decreased with the increase in the number of vehicles in their platoon.This was especially noticeable in IDM, which might be because IDM's control strategy could effectively reduce traffic oscillations, while CACC's excessive pursuit of instantaneous speed made its acceleration change too sensitive, leading to high energy consumption.In the dynamic traffic scenarios tested, Platoon-MAPPO's energy consumption was reduced by an average of 3.3% compared with IDM, with a maximum reduction of 4.0%.
Acceleration and climbing are extremely energy-consuming for trucks.For instance, when the slope was 0.02, the truck speed was 10m/s, and the acceleration was 0.5 m/s 2 , the rolling resistance force, the air drag force, the gravity force, and the force providing vehicle acceleration accounted for 5.8%, 1.2%, 34.9%, and 58.1% of the total force, respectively.Therefore, the air resistance has a very small effect during acceleration or climbing.When the slope and the acceleration were both zero and the speed was 20 m/s with an acceleration of 0.5 m/s 2 , the rolling resistance force and the air drag force accounted for 59.5% and 40.5% of the total force, respectively.In high-speed uniform motion on flat roads, the air resistance has a very high share but its absolute value is not large.Therefore, in dynamic traffic scenarios, Platoon-MAPPO's advantage over IDM may not seem obvious, but this is actually because large numerical values that are hard to reduce through control strategies mask the efforts to reduce additional consumption.
value for IV deceleration as a variable with the same range of values as in the training process.For quantitative analysis, we selected five values within the range and used each value as a test scenario and measured the energy consumption.Figure 6 shows some cases extracted from the test scenarios with target speed values of 5 km/h and 10 km/h for IV deceleration, and road slopes of 0 and 0.03.Each control strategy can extract four curves represented by different linear gray curves.To facilitate observation and analysis, we also plotted the mean energy consumption ratio values of IDM, CACC, and Platoon-MAPPO with green, yellow, and blue curves, respectively.The horizontal axis in the figure is the number of trucks in the platoon, and the vertical axis is the energy consumption ratio.The energy consumption ratio was calculated by taking the maximum total energy consumption value in the test scenario as a benchmark and comparing other energy consumption values with this maximum value.This eliminated the linear relationship coefficient between traction force and energy consumption.

Conclusions
Platoon-MAPPO is a truck platooning control algorithm based on MARL that can achieve efficient and safe platooning in highway ramp regions.Platoon-MAPPO adopts a centralized training and decentralized execution framework, allowing each truck to act independently without relying on a central controller, making the allocation of computing resources more concise.For the driving status of each truck in the platoon, each truck only refers to the driving status of the front and rear trucks, without requiring internal communication within the platoon or with the road network.This makes it easier for the truck's driving strategy to maximize the overall interests of the platoon; this architecture can greatly improve the flexibility of the platoon formation compared with architectures that rely on communication and centralized computing.
According to the experimental results, Platoon-MAPPO reduced average energy consumption and road occupancy by 14.8% and 32.7%, respectively, compared with uncontrolled IDM models and CACC control methods.From the experimental results, it can be seen that this method significantly filters traffic oscillations, greatly reduces meaningless acceleration and deceleration, and maintains driving speed while reducing energy consumption.Compared with IDM and CACC, Platoon-MAPPO has a faster overall speed at lower energy consumption.At the same time, Platoon-MAPPO can maintain a smaller headway distance, which not only reduces energy consumption, but also significantly reduces road occupancy.Compared with IDM and CACC, Platoon-MAPPO's road occupancy is reduced by 43.3% and 34.2% on average, and up to maximums of 58.2% and 35.7%, respectively.Therefore, in conclusion, Platoon-MAPPO can significantly improve the energy efficiency of truck platoons and improve road capacity.
Platoon-MAPPO benefits from a good basic model and reward function design, achieving good performance in experiments.However, similarly to general DL algorithms, RL cannot guarantee reasonable output results for all inputs, which is one of the reasons why learning-based algorithms have not yet been widely used in vehicle control fields.To address this issue, future research can start from two aspects: one is to use traditional methods to assist in the real-time monitoring of unsafe behavior; the other is to improve the interpretability of RL models and train them in more comprehensive and diverse traffic scenarios to improve robustness.

( 1 )
It uses a MAPPO-based algorithm with centralized training and decentralized execution to control the platoon in the on-ramp area.Each truck only computes its own action, avoiding the data computation delay caused by centralized calculation.(2) It considers the driving status of the trucks in front and behind each truck, maximizing the overall platoon gain and improving the global operational efficiency.(3) It does not require communication, and it is scalable to any number of vehicles and communication devices.

Figure 1
Figure 1 shows the overall framework of the Platoon-MAPPO.The vehicles that are not controlled by the truck platoon are called interactive vehicles (IVs), which may affect the driving status of the truck platoon by traveling in front of it.The trucks in the truck platoon are called platoon members ( PM ), with

Figure 1 .Figure 1 .
Figure 1.Framework of Platoon-MAPPO.The platoon communication structure can be classified into three types: centralized communication topology, decentralized communication topology, and hybrid communication topology.In centralized communication topology, each vehicle only communicates with the lead vehicle, which is easy to control but may cause high latency problems.In decentralized communication topology, each vehicle only receives information from the front vehicle, which has low latency.This study considers using hybrid communication topology, as shown in the topology diagram.Each truck receives information from both front and rear vehicles, but not from the environmental vehicle detected by the lead vehicle.Compared with algorithms that only consider the front vehicle's driving status, this algorithm is more effective in optimizing the overall performance of the truck platoon.This communication strategy enables the trucks in the platoon to switch positions, adjust the number of vehicles, and dynamically select the lead vehicle, enhancing the applicability and flexibility of the algorithm.A truck platoon consists of three types of trucks:0 PM ,

( 1 )
Observation and state: the observation value of each truck in the platoon is the part of the platoon's overall state that can be observed by the truck.The observation value of each truck comprises several elements: (a) the type of the truck, which can be PM 0 , PM n , or PM N , indicating the position of the truck in the platoon; (b) the relative speed of the truck with respect to the vehicle in front of it; (c) the relative speed of the truck with respect to the vehicle behind it; (d) the relative distance between the truck and the vehicle in front of it; (e) the relative distance between the truck and the vehicle behind it; (f) the speed of the truck itself; and (g) if the truck is PM 0 , it also obtains the relative distance and speed between itself and the IV that is detected within its detection range.RL algorithms are not suitable for applying batch normalization to the input values.Therefore, the input values are scaled directly to [−1, 1] according to their value range.The position identifiers are 0, −1, and 1, representing PM 0 , PM n , and PM N , respectively.When the platoon consists of more than three trucks, all the position identifiers of the middle trucks, PM n , are set to 1.The training process can accommodate any number of trucks in the platoon.State refers to various factors that influence driving in traffic environments.As a centralized training algorithm, PPO needs to predict the state value function of the entire environment.Therefore, state values include the observation values of all the trucks in the platoon.

Figure 2 .
Figure 2. Return curve.We set up a simple test scenario to intuitively demonstrate the effectiveness of the Platoon-MAPPO algorithm.The IV initially accelerates to 15 m/s.At 20 s, it decelerates to 5 m/s to simulate the interference from a ramp.After 15 s, it accelerates again to 25 m/s.All vehicles have an initial speed of 10 m/s and a maximum acceleration of 1 m/s 2 .The 0 PM is 40 m away from the IV.The platoon consists of 10 trucks, and the initial headway distance within the convoy is 30 m.The test lasts for 1000 s.We plotted a spatiotemporal map for each truck, as depicted in Figure3, which shows the positional changes of the truck over time.We also plotted a heatmap of the speed value of each truck at each moment in Figure4, to show the speed changes of all trucks in the

Figure 3 .
Figure 3. Spatiotemporal map for each truck.Figure 3. Spatiotemporal map for each truck.

Figure 3 .
Figure 3. Spatiotemporal map for each truck.Figure 3. Spatiotemporal map for each truck.

Figure 5 .
Figure 5. Platoon length.Green, orange, and purple represent IDM, CACC, and Platoon-MAPPO control stra egies, respectively.The shaded area represents the fleet length when there are 3 to 2 trucks.The dark line in the middle represents the average value.The two red lines repre sent the ratios of road space saved by Platoon-MAPPO compared with IDM and CACC a

Figure 6 .
Figure 6.Energy consumption of the platoon.Figure 6.Energy consumption of the platoon.

Figure 6 .
Figure 6.Energy consumption of the platoon.Figure 6.Energy consumption of the platoon.
Send a t to the simulation environment, obtain [r t , s t+1 , o t+1 ] Save [s t , o t , a t , r t , s t+1 , o t+1 ] Compute the advantage function, A t , and return, R t

Table 1 .
Truck longitudinal dynamics parameter settings.
t (N) F w (N) F j (N) F f (N)