Next Article in Journal
Scenario-Based Optimization of Hybrid Renewable Energy Mixes for Off-Grid Rural Electrification in Laguna, Philippines
Previous Article in Journal
An Application Concept of a Mobile Micro-Water Turbine for the Recovery of Energy from the River
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Co-Optimization of Cooperative Adaptive Cruise Control and Energy Management for Plug-in Hybrid Electric Truck Platoons

1
Guangxi Research Institute of Mechanical Industry Co., Ltd., Nanning 530009, China
2
School of Light Industry and Food Engineering, Guangxi University, Nanning 530004, China
3
School of Mechanical Engineering, Guangxi University, Nanning 530004, China
*
Authors to whom correspondence should be addressed.
Energies 2026, 19(4), 935; https://doi.org/10.3390/en19040935
Submission received: 22 December 2025 / Revised: 10 January 2026 / Accepted: 15 January 2026 / Published: 11 February 2026

Abstract

To optimize fuel economy for platooning plug-in hybrid electric trucks, this paper proposes a co-optimization framework that integrates cooperative adaptive cruise control and energy management to enhance driving safety and fuel efficiency in complex traffic environments. The control strategy is divided into two layers: in the upper layer, a cooperative adaptive cruise control model based on distributed model predictive control (DMPC) is used to achieve stable platoon following and vehicle spacing, thus improving the overall platoon efficiency. In the lower layer, a distributed soft actor-critic (DSAC) algorithm is used for the fine-grained power distribution of plug-in hybrid electric trucks, enabling efficient energy utilization. The results demonstrate that this strategy significantly enhances the fuel economy and vehicle-following performance of plug-in hybrid truck platoons. Compared with the classical deep deterministic policy gradient (DDPG) algorithm, the energy management strategy based on the distributed soft actor-critic offers higher computational efficiency.

1. Introduction

With continued global economic growth, concerns over environmental pollution and energy use have intensified [1]. In response, hybrid electric vehicles (HEVs) have been widely recognized as an effective approach to curb vehicle fuel consumption and emissions [2]. By combining an internal combustion engine with electric propulsion, HEVs can lower greenhouse gas emissions and regulated pollutants relative to conventional vehicles [3]. Additionally, in contrast to pure electric vehicles, HEVs offer longer driving ranges, effectively addressing the issue of limited battery capacity [4]. However, since HEVs are equipped with multiple power sources, such as engines and electric motors, designing an efficient energy management strategy is crucial for maximizing their performance [5].
Current studies on energy management for plug-in hybrid electric trucks generally focus on two classes of approaches, namely rule-based strategies and optimization-based strategies [6]. Rule-based energy management strategies rely on predefined rules and logic derived from practical experience or heuristics to dynamically control the vehicle’s power sources [7]. Examples include state machine logic controllers and fuzzy logic controllers [8]. Although rule-based strategies offer good real-time performance and stability, they often lack flexibility and have limitations in optimizing performance [9]. On the other hand, optimization-based energy management strategies involve setting control objectives and constraints to search for the optimal control strategy [10]. Examples of such strategies include dynamic programming (DP) [11], model predictive control (MPC) [12], equivalent consumption minimization strategy (ECMS) [13], genetic algorithms (GA) [14], and game theory (GT) [15]. DP provides a globally optimal solution when the entire driving cycle is known a priori, and it is commonly adopted as a reference benchmark for evaluating energy management performance [16]. MPC and ECMS are typical real-time optimization strategies that are widely applied in online energy management [17]. ECMS defines the conversion relationship between electrical energy and fuel consumption through the use of equivalence factors, making the selection of these factors crucial. MPC exhibits robust performance, facilitating enhanced control capabilities for hybrid electric vehicles (HEVs) in complex environments.
Zeng et al. [18] developed a stochastic model predictive control (SMPC)-based energy management method that integrates road grade information into the optimization formulation. By constructing stochastic models for fuel consumption and battery state of charge (SOC), their simulations showed improved fuel economy. Zhang et al. [19] proposed an MPC-based scheme for intelligent plug-in HEVs, where a micro-traffic flow analysis model was used to enhance adaptability across diverse driving conditions. More recently, driven by advances in artificial intelligence, HEV energy management has increasingly leveraged data-driven techniques, including deep learning [20], reinforcement learning [21], and deep reinforcement learning [22]. Reinforcement learning is a learning mechanism that maximizes cumulative rewards by interacting with the environment, without the need for pre-labeled data, and optimizes itself through exploration and feedback from the environment [23]. Deep reinforcement learning (DRL) combines deep neural networks with reinforcement learning to learn policies/value functions in high-dimensional environments [24]. He et al. [25] proposed a hierarchical DRL-based energy management method: DDPG in the upper layer plans the SOC reference, while a DNN-based MPC in the lower layer performs power allocation, achieving 98.61% of the global-optimal fuel economy. Qi et al. [26] proposed an uncertainty-aware reinforcement learning–based energy management strategy, where actions are selected using a distributed uncertainty function. This design improves adaptability under complex driving conditions and leads to better fuel economy. Owing to their strong learning capability and adaptability, deep reinforcement learning methods have therefore attracted increasing attention in energy management research.
Energy management for hybrid electric vehicles is transitioning from traditional rule-based designs to more intelligent control frameworks [27]. However, existing strategies generally do not account for the driving environment [28]. Energy management performance is highly sensitive to the complexity and uncertainty of real-world driving conditions [29]. To better align the powertrain operation with the surrounding traffic environment, integrating eco-driving with energy management is crucial for unlocking additional energy-saving potential under complex driving conditions [30]. In particular, jointly optimizing adaptive cruise control and energy management enables coordinated decision-making beyond independently designed controllers [31]. Zhang et al. [32] proposed a framework for adaptive cruise control and energy management systems, utilizing deep neural networks to address upper-level following distance planning and lower-level power distribution separately, achieving significant reductions in vehicle energy consumption. Xie et al. [33] proposed an integrated model predictive control (IMPC) framework that unifies power management and adaptive speed control, jointly optimizing SOC and vehicle speed to improve both fuel economy and driving safety. By contrast, in conventional designs, adaptive cruise control primarily targets longitudinal following performance, whereas energy management is executed separately to allocate power-source torque according to the commanded acceleration [34]. This separated optimization approach fails to fully utilize the potential of joint control between adaptive cruise control and energy management strategies [35]. Khayyam et al. [36] proposed an adaptive cruise control model based on an adaptive neuro-fuzzy inference system, which takes into account factors such as aerodynamic drag, road gradient, and rolling resistance, achieving a 3% reduction in fuel consumption. Zhang et al. [37] developed a multi-objective integrated MPC-based adaptive cruise control scheme that enhances both tracking performance and collision avoidance. Peng et al. [38] proposed a heterogeneous multi-agent deep reinforcement learning method to enable coordinated optimization between adaptive cruise control and energy management. Furthermore, their approach improved computational performance by 10% through the use of prioritized experience replay techniques. By taking into account the coupling relationship between ACC and energy management, new integrated control strategies can be developed that more precisely match the operating states and efficiencies of the power sources, thus avoiding unnecessary stop-start cycles and sudden accelerations, which are highly energy-consuming driving behaviors [39]. The significance of co-optimizing adaptive cruise control and energy management strategies lies not only in improving the performance of individual systems but also in achieving overall vehicle energy efficiency and driving experience optimization through their coordinated functioning. This approach drives vehicles towards a more intelligent and environmentally friendly future.
Recent years have also witnessed increasing interest in integrating model predictive control (MPC) with deep reinforcement learning (DRL) to exploit their complementary strengths. In such hybrid designs, MPC is often used to explicitly handle constraints and provide receding-horizon optimization, while DRL is leveraged to enhance adaptability, learn complex mappings, or improve decision-making under uncertainty. This integration logic has been explored in other energy-related control problems [40], and related MPC–DRL paradigms have also been reported in vehicle and energy management contexts [41,42,43]. While these studies demonstrate the potential of MPC–DRL integration, the coupling architectures and problem formulations vary considerably, and hierarchical or co-optimization ACC–EM frameworks have already been reported in the literature as well. However, for plug-in hybrid electric truck (PHET) platoons, directly transferring existing designs can be nontrivial because platooning introduces stringent requirements on safety distance keeping, ride comfort, and stability under inter-vehicle coupling, and may also operate under limited information exchange and heterogeneous vehicle/powertrain characteristics. In addition, the traction demand of a truck platoon is strongly time-varying, which increases the difficulty of simultaneously achieving high-quality following performance and robust fuel-economy benefits.
Motivated by these platoon-specific challenges, the proposed method adopts a distributed hierarchical decomposition tailored to PHET platoons: the upper-layer DMPC focuses on safety- and comfort-oriented cooperative speed planning to shape the traction power demand, and the lower-layer DSAC performs continuous power-split optimization under the planned demand and powertrain constraints. This design is particularly suitable for the formulated problem because (i) platoon motion planning naturally requires constraint handling and coordination, which aligns with MPC, and (ii) energy management faces strongly time-varying demand and constraint-induced penalties, where distributional RL can improve learning robustness and convergence behavior.
To fully exploit the energy-saving potential of coordinated longitudinal control and power management in plug-in hybrid electric truck platoons, this paper proposes a co-optimization framework that jointly designs cooperative adaptive cruise control and energy management to improve both driving safety and fuel economy. The main contributions of this paper are as follows:
(1)
A distributed hierarchical co-optimization architecture for PHET platoons. We develop a DMPC–DSAC cooperative framework that decouples safety/comfort-constrained platoon motion planning from power-split optimization, enabling coordinated cruise control and energy management without requiring centralized data aggregation.
(2)
A distributed MPC layer tailored to platoon motion quality and coordination. The upper-layer DMPC explicitly targets stability and ride comfort while planning the platoon speed trajectory, providing a coordination-aware motion demand profile that supports smooth following and efficient operation under varying driving conditions.
(3)
A distributional RL-based energy management layer with improved learning robustness. The lower-layer DSAC performs continuous power allocation under the planned motion demand and powertrain constraints. By adopting distributional value learning within an entropy-regularized framework, the proposed strategy emphasizes convergence efficiency and training stability, and its effectiveness is validated through comparisons with representative benchmarks across different driving cycles.
The structure of this paper is organized as follows: Section 2 introduces the platoon model and powertrain model of plug-in hybrid electric trucks. Section 3 presents the co-optimization strategy for cooperative adaptive cruise control and energy management, including the cooperative adaptive cruise control model based on DMPC and the energy management strategy based on DSAC. Section 4 provides the simulation results and related discussions. Finally, Section 5 concludes the paper.

2. Modeling of Plug-in Hybrid Electric Trucks Systems

As shown in Figure 1, the plug-in hybrid electric trucks form a platoon and exchange information through vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. The entire platoon consists of n + 1 trucks, with the first truck serving as the lead vehicle and the subsequent n trucks as following vehicles. To ensure safe and efficient operation of the platoon, this paper presents a cooperative adaptive cruise control model based on a distributed model predictive control (DMPC) algorithm. The following vehicles track the lead vehicle while maintaining an appropriate distance, thereby enhancing the overall throughput of the platoon.

2.1. Truck Platoon Model

To better capture platoon dynamics, we formulate a nonlinear vehicle model for each truck. Assuming there are N trucks in the queue, the nonlinear longitudinal dynamics of the i -th truck at time t + 1 are as follows:
s i ( t + 1 ) = s i ( t ) + v i ( t ) Δ t , v i ( t + 1 ) = v i ( t ) + η i R i T i ( t ) C D ρ A i 2 v i 2 ( t ) m i g f i Δ t m i , i N T i ( t + 1 ) = T i ( t ) 1 μ i T i ( t ) Δ t + 1 μ i T ^ i ( t ) Δ t ,
where s i ( t ) represents the displacement of the i -th truck at time t, v i ( t ) is the speed of the i -th truck at time t, T i ( t ) is the actual torque of the i -th truck at time t, Δ t is the sampling time, η i is the mechanical transmission efficiency of the vehicle, R i is the wheel radius of the vehicle, C D is the air drag coefficient, ρ is the air density, A i is the frontal area of the vehicle, m i is the mass of the vehicle, g is the gravitational acceleration, f i is the rolling resistance coefficient, μ i is the longitudinal dynamic time delay constant, and T ^ i ( t ) is the desired torque.
This model uses the vehicle’s position, speed, and torque as state variables, represented by x i ( t ) = s i ( t ) , v i ( t ) , T i ( t ) T . The desired torque is used as the control variable, represented by u i ( t ) = T ^ i ( t ) , with the constraint conditions for the control variable being T min u i ( t ) T max . Therefore, Equation (1) can be simplified to:
x i ( t + 1 ) = A i ( x i ( t ) ) + B i u i ( t ) y i ( t ) = C x i ( t )
where A i = s i ( t ) + v i ( t ) Δ t v i ( t ) + Δ t m i η i R i T i ( t ) C D ρ A i 2 v i 2 ( t ) m i g f i T i ( t ) 1 μ i T i ( t ) Δ t , B i = 0 0 1 μ i Δ t , C = 1 0 0 1 0 0 .
Let the state vector of the vehicle be denoted as X ( t ) , the output vector as Y ( t ) , and the input vector as U ( t ) :
X ( t ) = x 1 T , x 2 T , , x N T T Y ( t ) = y 1 T , y 2 T , , y N T T U ( t ) = u 1 T , u 2 T , , u N T T
The discrete nonlinear dynamic model of the entire vehicle platoon is as follows:
X ( t + 1 ) = A ( X ( t ) ) + B U ( t ) Y ( t ) = C X ( t )
To enforce the desired inter-vehicle spacing, the desired position of the vehicle is defined as:
s i , d e s ( t ) = s 0 ( t ) i d 0
where d 0 represents the desired distance between adjacent vehicles.
The speed tracking error of the vehicles within the platoon is:
e v , i ( t ) = v i ( t ) v 0 ( t )
The inter-vehicle distance tracking error is:
e s , i ( t ) = s i ( t ) s i , d e s ( t )
For safe and stable platooning, the vehicle speed and inter-vehicle spacing are required to satisfy:
lim t e v , i ( t ) = 0 lim t e s , i ( t ) = 0

2.2. Plug-in Hybrid Electric Truck Powertrain Model

This study considers a single-axle parallel plug-in hybrid electric truck, whose powertrain architecture is illustrated in Figure 2. The system comprises an engine, traction motor, battery pack, transmission, and supervisory controller. Key vehicle parameters and powertrain specifications are summarized in Table 1.
Based on the longitudinal dynamics of the vehicle, the driving resistance of the vehicle is given by:
F r = F f + F w + F g + F a F f = m g f cos α F w = C D A u 2 21.15 F g = m g sin α F a = δ m d u d t
where F f represents the rolling resistance, F w is the air resistance, F g is the gradient resistance, F a is the inertial resistance, f is the rolling resistance coefficient, α is the road gradient, C D is the air resistance coefficient, A is the frontal area, δ is the rotational mass conversion factor, m is the vehicle mass, g is the gravitational acceleration, and u is the vehicle speed.
The required torque of the vehicle is given by:
T r e q = F r R i η , F r > 0 F r R η i , F r 0
where R is the wheel radius, i is the vehicle’s gear ratio, and η is the mechanical transmission efficiency.
The required rotational speed of the vehicle is given by:
n r e q = u i R
The required power of the vehicle is given by:
P r e q = T r e q n r e q = P e + P m
where P e is the engine power and P m is the motor power.
The fuel consumption map of the engine is shown in Figure 3. The map (fuel consumption rate, g/kWh) is derived from steady-state engine dynamometer bench test data and implemented as a 2-D lookup table with interpolation in simulation. Based on the engine’s characteristic curve, the fuel consumption rate of the engine can be expressed as:
b e = f ( n e , T e )
where n e is the engine speed and T e is the engine torque.
The instantaneous fuel consumption of the engine can be expressed as:
m ˙ f u e l = P e b e 3600 P e = T e n e / 9.55
The efficiency map of the motor is shown in Figure 4. The efficiency of the motor can be expressed as:
η m = f ( n m , T m )
where n m is the motor speed and T m is the motor torque.
The relationship between the motor power and the battery power can be expressed as:
P m = T m n m 9.55 = P b a t η m , Motor   mode P b a t / η m , Generator   mode
where P b a t represents the battery power.
An equivalent internal-resistance model is adopted for the battery, which can be written as:
I b a t = V o c V o c 2 4 R b a t P b a t 2 R b a t
S O C ( k ) = S O C 0 0 k I b a t ( t ) d t Q b a t
where I b a t represents the battery current, V o c is the open-circuit voltage of the battery, R b a t is the internal resistance of the battery, S O C ( k ) is the state of charge at the current time, S O C 0 is the state of charge at the initial time, and Q b a t is the battery capacity.

3. Cooperative Optimization Strategy for Adaptive Cruise Control and Energy Management

3.1. Energy Management Framework

Figure 5 presents the proposed cooperative co-optimization framework for the PHET platoon. The upper layer employs a DMPC-based cooperative adaptive cruise control scheme to generate platoon speed profiles, with objectives emphasizing stability and ride comfort to support safe and efficient operation. The lower layer adopts a DSAC–based energy management policy, which performs power-split optimization using the upper-layer planned vehicle states, thereby improving the platoon’s fuel economy. In this hierarchical scheme, the interaction between the two layers follows a one-way coupling. The DMPC layer outputs a planned speed trajectory over the prediction horizon, which determines the corresponding traction power demand profile. The DSAC layer then performs power-split decisions under this demand profile and the powertrain constraints to optimize fuel economy. In the current implementation, the DSAC outputs are not fed back to the DMPC layer for replanning; thus, the coupling effect is mainly reflected through the DMPC-shaped driving demand that DSAC responds to. Moreover, to verify the effectiveness of the proposed DSAC-based energy management module, DP and DDPG are adopted as benchmark methods for comparison. The proposed coordination is distributed rather than centralized. The DMPC layer relies on local onboard measurements and limited information exchange with neighboring vehicles, avoiding global data aggregation by a central coordinator. The DSAC-based energy management is executed locally on each vehicle based on the planned motion information and local powertrain states.

3.2. Cooperative Adaptive Cruise Control Model Based on Distributed Model Predictive Control

In this paper, DMPC [46] is used for platoon speed planning by decomposing the global optimization problem into a set of coupled local subproblems, one for each vehicle. A predecessor-following communication topology is used to facilitate information exchange between adjacent vehicles, enabling the optimization problem for each vehicle in the platoon to be solved.
The communication topology of the platoon is represented by a graph, G = V , E , where V = 0 , 1 , 2 , , N , E V × V . G can be simplified as follows: A is the adjacency matrix, L is the Laplacian matrix, and P is the fixed matrix.
A is used to represent the communication direction between vehicles within the platoon, such that:
A = a i , j R N × N
where a i , j = 1 , if j , i E 0 , if j , i E .
The Laplacian matrix L can be expressed as:
L = D A
where the in-degree matrix D is given by D = diag deg 1 , deg 2 , , deg N .
The pinning matrix P can be represented as:
P = diag p 1 , p 2 , , p N
where p i = 1 , if 0 , i E 0 , if 0 , i E .
The set of information received by vehicle i from the leading vehicle is represented as:
P i = 0 , if   p i = 1 , if   p i = 0
The set of information received by vehicle i from the neighboring vehicle j is represented as:
N i = j a i , j = 1 , j N
The set of information received by vehicle i from all vehicles in the platoon is represented as:
I i = N i P i
The distributed model predictive control (DMPC) method aims to achieve a globally optimal objective through local optimization of each subsystem. Each subsystem minimizes the local tracking error within the prediction horizon t , t + N p , while also considering the optimization of driving comfort and local performance. Therefore, a multi-objective optimization function is designed in this paper as follows:
J 1 , i ( k t ) = Q i ( y i p ( k t ) y i , d e s ( k t ) ) 2 J 2 , i ( k t ) = j F i G i ( y i p ( k t ) y j a ( k t ) d j , i ) 2 J 3 , i ( k t ) = R i ( u i p ( k t ) h i ( v i p ( k t ) ) ) 2 J 4 , i ( k t ) = F i ( y i p ( k t ) y i a ( k t ) ) 2
where J 1 , i and J 2 , i are the tracking error cost functions of the vehicles, J 3 , i is the driving comfort cost function, J 4 , i is the communication stability cost function between vehicles, Q i is the error coefficient matrix between the following vehicle and the leading vehicle, G i is the error coefficient matrix between the following vehicle and its neighboring vehicles, R i is the driving comfort coefficient matrix, F i is the communication stability coefficient matrix between vehicles, d j , i is the deviation of the following distance between adjacent vehicles, y i p ( k t ) is the predicted output sequence of the vehicle, y i , d e s ( k t ) is the desired state sequence of the leading vehicle, and y i a ( k t ) is the hypothesized output sequence of the vehicle.
The optimization problem of the i -th vehicle in the platoon at time t can be expressed as:
min U i J i ( y i p ( : t ) , u i p ( : t ) , y i a ( : t ) , y i a ( : t ) ) = k = 0 N p 1 ( J 1 , i ( k t ) + J 2 , i ( k t ) + J 3 , i ( k t ) + J 4 , i ( k t ) ) s t : x i p ( k + 1 t ) = A i ( x i p ( k t ) ) + B i u i p ( k t ) y i p ( k t ) = C i x i p ( k t ) x i p ( 0 t ) = x i ( t ) T min u i p ( k t ) T max y i p ( N p t ) = 1 II i j II i ( y j a ( N p t ) d j , i ) T i p ( N p t ) = h i ( v i p ( N p t ) )
where U i = u i p ( 0 t ) , u i p ( 1 t ) , , u i p ( N p 1 t ) T represents the control sequence to be optimized, and d j , i = ( j i ) d 0 , 0 T is the distance deviation between adjacent vehicles.

3.3. Energy Management Strategy Based on Distributional Soft Actor-Critic

3.3.1. DSAC Algorithm

DSAC [47] combines distributional reinforcement learning with the soft actor–critic framework in an off-policy manner, leading to more stable learning and enhanced exploration. DSAC uses distributional Bellman update to estimate the distribution of the Q-function, rather than solving it through a single expected value. This allows DSAC to model the distribution of Q-values, thus better handling noise and uncertainty in the environment. By combining entropy regularization with distributed Q-function estimation, DSAC improves its exploration ability, enabling superior performance in complex environments.
In this work, DSAC is selected mainly considering the characteristics of the PHET power-split problem, which involves a continuous action space, strongly time-varying power demand, and potentially noisy/uncertain dynamics induced by different driving cycles and operating constraints. Compared with widely used alternatives, DSAC offers several practical advantages in this context. First, PPO is an on-policy method, which typically requires substantially more interaction samples to reach comparable performance, making it less suitable for training energy management policies that rely on extensive environment rollouts. Second, TD3 and conventional SAC are expectation-based value methods, whereas DSAC explicitly learns the return (Q-value) distribution, which can provide more robust value estimation under fluctuating rewards and constraint-induced penalties that are common in energy management tasks. Consequently, DSAC tends to exhibit improved training stability and faster convergence. Finally, although the fuel economy improvement over a strong off-policy baseline such as DDPG may appear modest in some cases, DSAC remains attractive due to its higher convergence efficiency and learning stability, which are important for reliable training and evaluation across different driving conditions.
Figure 6 shows the DSAC architecture. DSAC consists of a distributional critic network and an actor (policy) network. The critic predicts the return distribution for each state–action pair rather than a single expected value, while the actor outputs actions conditioned on the current state. During training, DSAC samples transitions from the replay buffer and updates the critic to better approximate the target return distribution, after which the actor is optimized based on the critic’s estimates. These updated value estimates are subsequently used to update the policy network, improving action selection. DSAC is suitable for reinforcement learning tasks with continuous action spaces, where traditional Q-learning methods struggle to provide effective solutions. DSAC, on the other hand, offers a more stable and efficient solution.
(1)
Distributed reinforcement learning frameworks
Within the maximum-entropy framework, DSAC maximizes the expected cumulative reward with an entropy regularizer to promote exploration. The resulting objective is:
J π = E ( s i t , a i t ) ~ ρ π r i t ~ R ( s i , a i ) i = t γ i t r i + α H ( π ( s i ) )
where ρ π represents the distribution of the state-action pair ( s , a ) under the policy π , γ is the discount factor, α is the coefficient of entropy, H is the entropy of the policy π , and r i ( s i , a i ) is the reward at each step.
The soft Q-value function under policy π is defined as:
Q π ( s t , a t ) = E r ~ R ( s t , a t ) r + γ E π G t + 1
where G t = i = t γ i t r i α log π ( a i s i ) .
The optimal policy is obtained via soft policy iteration, which alternates between soft policy evaluation and soft policy improvement. In this procedure, the value function is updated using the soft Bellman operator T π :
T π Q π ( s , a ) = E r ~ R ( s , a ) r + γ E s ~ p , a ~ π Q π ( s , a ) α log π ( a s )
To improve the performance of the new policy π n e w , we update the policy by maximizing the entropy-augmented objective:
π n e w = arg max π J π = arg max π E s ~ ρ π , a ~ π Q π o l d ( s , a ) α log π ( a s )
Distributed soft policy models the distribution of the soft return, which is defined as:
Z π ( s t , a t ) = r t + γ G t + 1 | ( s i > t , a i > t ) ~ ρ π r i t ~ R ( | s i , a i )
The expected state-action return function is given by:
Q π ( s , a ) = E Z π ( s , a )
The soft state-action return distribution is defined as: Z π ( Z π ( s , a ) | s , a ) : S × A P ( Z π ( s , a ) ) . A distributional soft policy evaluation scheme is adopted to derive the optimal policy. Under the maximum-entropy formulation, the corresponding distributional Bellman operator is:
T D π Z π ( s , a ) = D r + γ ( Z π ( s , a ) α log π ( a | s ) )
In the above equation, both sides represent distributions, and A = D B indicates that the two distributions have the same probabilistic form. The strategy iteration that combines maximum entropy with distributional return is referred to as distributional soft policy iteration (DSPI), and DSPI can converge the policy to the optimal policy.
(2)
Principle of reducing overestimation
A quantitative analysis of the overestimation error in the return distribution learning is performed. For convenience, assume the entropy weight coefficient α = 0 . The greedy objective is defined as y = E r + γ E s max a Q θ ( s , a ) , and the Q-estimate Q θ ( s , a ) can be updated by minimizing the loss function ( y Q θ ( s , a ) ) 2 / 2 , with the parameters θ defined as:
θ n e w = θ + β ( y Q θ ( s , a ) ) θ Q θ ( s , a )
where β is the learning rate.
Since there is some error between the updated Q estimate Q θ and the true Q value Q ˜ ( s , a ) , we assume:
Q θ ( s , a ) = Q ˜ ( s , a ) + ε Q
where ε Q is the random error.
The parameters updated according to the true objective y ˜ are:
θ t r u e = θ + β ( y ˜ Q θ ( s , a ) ) θ Q θ ( s , a )
where y ˜ = E r + γ E s max a Q ˜ ( s , a ) .
The error in the updated Q estimate Q θ n e w ( s , a ) is given by:
Δ ( s , a ) = E ε Q Q θ n e w ( s , a ) Q θ t r u e ( s , a ) = β γ δ θ Q θ ( s , a ) 2 2
where δ = E ε Q E s max Q ( s , a ) a E s max Q ˜ ( s , a ) a 0 , which means Δ ( s , a ) 0 . This indicates that Δ ( s , a ) represents an upward bias. Although the single-step upward bias is small, it can accumulate and grow over time through temporal difference (TD) learning, leading to overestimation error.
To analyze the overestimation, we introduce the distributional return. Assume that the return follows a Gaussian distribution Z ( | s , a ) , where the mean and standard deviation are represented by two independent functions Q θ ( s , a ) and σ ψ ( s , a ) , with θ and ψ being the parameters of these functions, i.e., Z θ , ψ ( | s , a ) = N ( Q θ ( s , a ) , σ ψ ( s , a ) 2 ) . Similarly, assume that the target distribution is Z target ( | s , a ) = N ( y , σ target 2 ) . Therefore, the overestimation error of Q θ n e w ( s , a ) is given by:
Δ D ( s , a ) Δ ( s , a ) σ ψ ( s , a ) 2
From the above equation, it can be seen that the overestimation error Δ D ( s , a ) is inversely proportional to the square of σ ψ ( s , a ) , while σ ψ ( s , a ) is directly proportional to the system’s uncertainty. As the randomness of the return function and the return distribution increases, σ ψ ( s , a ) becomes larger. Therefore, by learning distributional returns, the randomness of the task can be reduced, thereby lowering the overestimation.
(3)
Design of the DSAC algorithm
In the DSAC algorithm, both the distributional value function Z θ ( | s , a ) and the policy function π θ ( | s ) follow Gaussian distributions, with parameters θ and ϕ , respectively. Their means and variances are computed by neural networks. During the policy evaluation phase, we choose the Kullback–Leibler (KL) divergence as the metric for the distributional distance, and optimize the return distribution corresponding to the current policy:
J Z ( θ ) = E ( s , a , r , s ) ~ B log P ( T D π ϕ Z ( s , a ) | Z θ ( | s , a ) )
The gradient of parameter θ is given by:
θ J Z ( θ ) = E ( s , a , r , s ) ~ B θ log P ( T D π ϕ Z ( s , a ) | Z θ ( | s , a ) )
where Z θ is a Gaussian model, Z θ ( | s , a ) = N ( Q θ ( s , a ) , σ θ ( s , a ) 2 ) .
From the above gradient formula, it can be observed that when the standard deviation of the distribution σ θ ( s , a ) 0 , the gradient θ J Z ( θ ) will explode, and when σ θ ( s , a ) , gradient vanishing occurs. Therefore, clipping techniques are employed to address this issue by limiting the standard deviation within a reasonable range to prevent large changes in the target distribution that could cause instability in learning. We constrain the target distribution to stay near the expectation of the current return distribution:
σ θ ( s , a ) = clip ( σ θ ( s , a ) , σ min , σ max )
T D π ϕ Z ( s , a ) _ _ _ _ _ _ _ _ _ _ _ _ _ _ = clip ( T D π ϕ Z ( s , a ) , Q θ ( s , a ) b , Q θ ( s , a ) + b )
where b is the clipping boundary.
The target network is updated using slow-moving updates:
θ τ θ + ( 1 τ ) θ , ϕ τ ϕ + ( 1 τ ) ϕ
In the policy improvement phase, we can learn the policy by maximizing the soft Q-value:
J π ( ϕ ) = E s ~ B , a ~ π ϕ Q θ ( s , a ) α log ( π ϕ ( a | s ) )
To reduce the gradient estimation variance, we use the reparameterization trick to compute the gradient of the policy. Since Q θ ( s , a ) can be explicitly parameterized by θ , we convert the action a into a deterministic variable to simplify the process:
a = f ϕ ( ξ a ; s ) = a mean + ξ a a std
where ξ a is an auxiliary variable, a mean is the mean of π ϕ ( | s ) , a std is the standard deviation of π ϕ ( | s ) , and denotes the Hadamard product.
The gradient of the policy update is given by:
ϕ J π ( ϕ ) = E s ~ B , ξ a α ϕ log ( π ϕ ( a | s ) ) + ( a Q θ ( s , a ) α a log ( π ϕ ( a | s ) ) ) ϕ f ϕ ( ξ a ; s )
The entropy coefficient α is updated as follows:
J ( α ) = E ( s , a ) ~ B α ( log π ϕ ( a | s ) H ¯ )
The pseudocode of the DSAC algorithm is shown in Algorithm 1.
Algorithm 1. DSAC algorithm
Initialize parameters θ , ϕ , α
Initialize target parameters θ θ , ϕ ϕ
Initialize learning rate β Z , β π , β α , τ
Initialize iteration index k = 0
For k = 0 k max do
 Select action a ~ π ϕ ( a | s )
 Observe reward r and new state s
 Store transition tuple ( s , a , r , s ) in buffer B
 Sample N transitions ( s , a , r , s ) from B
 Update soft return distribution θ θ β Z θ J Z ( θ )
  if k mod m then
  Update policy ϕ ϕ + β π ϕ J π ( ϕ )
  Adjust temperature α α β α α J ( α )
  Update target networks θ τ θ + ( 1 + τ ) θ , ϕ τ ϕ + ( 1 τ ) ϕ
  end if
end for

3.3.2. Energy Management Strategy Based on DSAC

This section presents a DSAC-based energy management strategy for platooned hybrid electric trucks to optimize power allocation. At each control step, the agent observes the current vehicle state, and the DSAC policy outputs a control action that is applied to the powertrain. In this study, the state vector comprises the battery SOC, vehicle speed, and longitudinal acceleration:
s = v , a c , S O C
In this study, torque compensation control is implemented for the motor torque using the vehicle’s required torque and the engine’s output torque [48]. Accordingly, the engine is taken as the control input, and the DSAC action is defined as the engine power:
a = P e
To enhance platoon energy efficiency, the DSAC agent is trained using a reward that penalizes both fuel and electricity consumption. The reward is defined as:
r = α m f u e l + β P b a t / 3600
where α represents the diesel fuel price [49], α = 3.8   CNY / L , and β represents the electricity price, β = 0.8   CNY / kWh .

4. Results and Discussion

4.1. Parameter Settings

The proposed hierarchical framework is implemented and simulated in Python. For the upper layer, a DMPC-based CACC platoon model is built with six PHETs, including one leader and five followers. The parameters of the truck platoon are shown in Table 2. For the lower layer, the DSAC-based energy management strategy is employed, and the hyperparameters of the DSAC algorithm are shown in Table 3. All simulations were performed in Python on a laptop equipped with an Intel Core i5-1240P CPU, 16 GB RAM, and Intel Iris(R) Xe Graphics, using Python 3.14.
In all simulations, regenerative braking is enabled. When the traction power demand becomes negative during deceleration, the motor operates in generator mode to recover braking energy and charge the battery within the motor/generator capability and battery charging power constraints. A plug-in charging period is not considered in the constructed drive profile, which represents the driving stage only. The initial battery SOC is set to 0.8, the SOH is assumed to be 100% (kept constant during the simulation), and the ambient temperature is fixed at 25 °C (room temperature).

4.2. Driving Condition Data Collection

In order to make the vehicle’s driving conditions more aligned with local actual conditions and thus improve the accuracy of the control strategy, this study uses an onboard vehicle CAN-bus data recorder to collect driving condition data on a road section in Liuzhou. A total of 18,812 data points were collected. The data collection equipment is shown in Figure 7. The data recorder recorded 18 signals, including time, vehicle speed, motor voltage, motor current, etc. The collected CAN signals were replayed and exported using TSMaster (v2023.8.30.958) software. Figure 8 shows part of the driving conditions collected in Liuzhou during the experiment.

4.3. Analysis of Vehicle Following Performance

The vehicle-following capability of the DMPC platoon model is evaluated using two representative driving cycles, namely CHTC and the Liuzhou city cycle. The characteristics of the driving conditions are shown in Table 4. The CHTC is a national standard cycle used to evaluate the energy consumption of commercial vehicles, with an average speed of 12.90 m/s and a maximum speed of 24.44 m/s. The Liuzhou cycle represents the actual driving conditions in Liuzhou city, with an average speed of 10.41 m/s and a maximum speed of 15.54 m/s.
Figure 9 presents the platoon simulation results under the CHTC driving cycle. Figure 9a compares the speed profiles of the leader and follower trucks. As shown, the following vehicles are able to closely follow the speed curve of the lead truck. To more accurately describe the following performance of the vehicles, we define the average speed deviation of the following vehicles as δ m e a n = i = 1 N v l v i / N , where v l is the lead vehicle’s speed and v i is the speed of the following vehicle. Table 5 presents the average speed deviations of the following vehicles. From Table 5, the first follower exhibits the smallest mean speed deviation (0.167 m/s), whereas the fifth follower shows the largest (0.213 m/s). Overall, the mean speed deviation increases progressively with the follower index. This is due to the cumulative effect of speed deviations, which becomes larger as more vehicles follow, resulting in a greater average speed deviation for the following vehicles.
Figure 9b shows the position curves of all vehicles. From the figure, it can be seen that each curve is arranged in sequence without any intersection, indicating that the following vehicles did not collide during the driving process, demonstrating good safety. Figure 9c illustrates the position error curves of the following vehicles. As shown, the position error of the following vehicles remains within a range of ± 2 m, maintaining a reasonable inter-vehicle distance. Table 6 presents the average position errors of the following vehicles. The first following vehicle has the smallest average position error, at 0.032 m, while the fifth following vehicle has the largest average position error, at 0.339 m. The average position error of the following vehicles increases with the number of following vehicles, which is due to the accumulation of position errors from preceding vehicles during the driving process, leading to larger average position errors. Figure 9d shows the acceleration curves of all vehicles. As can be seen, the vehicle accelerations are within a ± 1   m / s 2 range, and the maximum acceleration of the following vehicles is smaller than that of the lead vehicle, further improving the comfort of the vehicles.
To assess the robustness of the DMPC-based CACC model, simulations were performed using the Liuzhou driving cycle, with results shown in Figure 10. Figure 10a plots the vehicle speed profiles, indicating that all followers track the leader effectively. Table 5 summarizes the mean speed deviations. All followers exhibit deviations below 0.15 m/s, indicating good tracking performance. Figure 10b plots the vehicle position trajectories, showing that the followers remain uniformly spaced behind the leader while maintaining safe inter-vehicle distances.
Figure 10c shows the position error trajectories. The followers’ position errors remain within ±2 m, comparable to those under the CHTC, indicating the high stability of the DMPC-based CACC model. Figure 10d plots the vehicle acceleration profiles. The accelerations remain within ±1 m/s2, consistent with the results obtained under the CHTC. Through the analysis of the control effects under both the CHTC and Liuzhou driving cycles, we found that the cooperative adaptive cruise control model based on DMPC has good robustness, can solve for different driving conditions, and yields good simulation results with regard to vehicle following, maintaining inter-vehicle distance, and driving comfort.

4.4. Fuel Economy Analysis

For benchmarking, we additionally implemented DP and DDPG energy management strategies to compare against the proposed DSAC-based method. We analyze the simulation results for the first following vehicle. Figure 11 shows the SOC curve for the first following vehicle. Figure 11a,b displays the SOC curves for the CHTC and Liuzhou driving cycles, respectively. The red, green, and blue curves represent the SOC values for DP, DDPG, and DSAC, respectively. The variation in the SOC value reflects the vehicle’s electricity consumption during operation. In Figure 11, the initial SOC value is 0.8, and the final value is around 0.35 for all methods. The SOC trajectories for DDPG and DSAC are quite similar, while the trajectory for DP shows a significant difference. This is because DP is a global optimal control method, whereas DDPG and DSAC are local optimal control methods, resulting in a noticeable deviation in the final SOC compared to DP.
Table 7 shows the fuel consumption simulation results under the CHTC for different energy management strategies. Among them, DP has the lowest fuel consumption of 12.356 L/100 km, followed by DDPG with 13.686 L/100 km, and DSAC with 13.427 L/100 km. Compared to DP’s fuel consumption, DDPG achieves 90.28% of DP’s fuel economy, while DSAC achieves 92.02%. This indicates that DSAC has a higher fuel economy than DDPG. However, the fuel economy of DSAC is only 1.74% better than DDPG, as both energy management strategies are deep reinforcement learning methods developed based on actor-critic network architectures, and DSAC is an optimization of the DDPG algorithm. Table 8 shows the fuel consumption under the Liuzhou driving cycle, where DDPG achieves 90.87% fuel economy, and DSAC achieves 93.03%. The results demonstrate that the energy management strategy based on DSAC offers better fuel economy than DDPG and exhibits stronger robustness, adapting well to different driving conditions.
To provide a clearer comparison of fuel use, we examine the engine operating points. Figure 12 illustrates their distributions under the CHTC and Liuzhou driving cycles, shown in Figure 12a and Figure 12b, respectively.
The DP solution places most operating points in the high-efficiency band. By comparison, the DDPG- and DSAC-based controllers exhibit a wider spread that includes low-efficiency regions. Compared with DDPG, DSAC places a larger share of engine operating points in the high-efficiency region, which translates into better fuel economy.
In deep reinforcement learning algorithms, the average reward is an important metric for evaluating the completion of training. Figure 13 plots the average reward curves for different algorithms; Figure 13a corresponds to the CHTC. As shown, DDPG converges around episode 72, whereas DSAC converges earlier at approximately episode 55, indicating faster convergence for DSAC. Figure 13b shows the results under the Liuzhou driving cycle. DDPG converges at about episode 74, whereas DSAC converges earlier at approximately episode 53. Combined with the CHTC results, these simulations confirm that DSAC achieves faster convergence than DDPG. This is because the DSAC algorithm incorporates the return distribution function into maximum entropy reinforcement learning, and by directly learning the continuous return distribution, it addresses the issues of gradient explosion and vanishing gradients, thus improving the convergence speed.
In this study, the training data for the reinforcement learning agents are generated through online interaction with the simulation environment; that is, transition samples are collected while the agent operates the powertrain under the given driving cycle. The average reward curves in Figure 13 are used to assess training progress and convergence. For transparency and reproducibility, it should be noted that the training process is conducted for a fixed number of 500 episodes for both algorithms and both driving cycles. Convergence is identified when the average reward curve becomes stable and shows no further noticeable improvement, indicating that the learned policy has reached a steady performance level. This setup makes the learning process and stopping criterion explicit and repeatable.
Given the platoon-level focus of this study, the fuel economy of each vehicle is further examined. Table 9 reports the fuel consumption of all platoon members under the DSAC-based EM strategy. Under the CHTC, fuel consumption varies only slightly, from 13.398 to 13.503 L/100 km, and the five followers exhibit nearly identical values. This consistency is mainly attributed to their similar speed profiles, which lead to comparable energy demand and thus similar fuel use. Figure 14 presents the SOC trajectories of the five followers. The curves almost overlap, indicating similar battery energy usage across the platoon. Overall, these results demonstrate that the proposed DSAC-based EM strategy is well-suited for platooning and can effectively enhance the fuel economy of platoon driving.

4.5. Limitations of Simulation-Only Verification and Practical Considerations

Although the proposed energy management strategy demonstrates consistent improvements across the presented simulation cases, it is important to clarify that the validation in this study is conducted in a model-in-the-loop (MIL) simulation environment. The driving cycles are constructed from real-world CAN-bus recordings, which helps ensure that the longitudinal speed demand and operating profiles are representative of practical truck usage. However, using CAN-bus data to build driving cycles does not constitute an experimental validation of the proposed control strategy itself. Therefore, the practical credibility of the reported quantitative gains should be interpreted within the scope and assumptions of the adopted models and simulation settings.
(1)
Model fidelity and unmodeled dynamics.
The powertrain and battery models employed in this paper, while widely adopted for energy management research, inevitably simplify certain high-fidelity behaviors. For example, thermal dynamics, transient actuator behaviors, drivetrain backlash, and nonlinearities associated with real hardware may influence fuel consumption, battery current trajectories, and aging-related indicators in real operation. Moreover, battery life estimation models typically rely on a subset of aging mechanisms and may not capture all degradation pathways under varying temperatures, current ripple, and long-term calendar aging effects. These simplifications may lead to discrepancies between simulated and real-world performance.
(2)
Real-time implementation constraints.
The proposed strategy involves optimization and/or learning-based decision-making, whose practical deployment requires consideration of real-time computational limits (e.g., ECU/industrial PC constraints), solver convergence behavior under strict timing, and robustness to numerical issues. In real vehicles, additional constraints such as torque rate limits, actuator saturation, and controller scheduling may affect the achievable performance compared with offline simulation.
(3)
Communication, sensing, and uncertainty factors.
In practical settings, sensor noise, estimation errors (e.g., SOC estimation), and disturbances such as vehicle mass variations, road grade uncertainty, and aerodynamic changes can impact control outcomes. If the strategy relies on information exchange (e.g., for platooning/cooperative control), communication latency, packet loss, and time synchronization errors may further degrade performance. These factors are not fully represented in the current MIL simulations.
(4)
Implications and future validation.
Given the above limitations, the results presented in this paper should be viewed as demonstrating the potential effectiveness of the proposed framework under the assumed modeling conditions. As future work, we will (i) conduct robustness tests incorporating uncertainty, noise, and delay effects, (ii) perform cross-validation using higher-fidelity co-simulation platforms/models, and (iii) implement hardware-in-the-loop (HIL) and/or vehicle-level experiments to validate real-time feasibility and practical performance under realistic constraints.

4.6. Practical Implementation Issues: Communication Imperfections, Real-Time Computation, and Scalability

While the proposed DMPC–DSAC framework shows promising performance in the presented model-in-the-loop simulations, several practical implementation factors are not explicitly modeled and may affect real-world deployment. This subsection discusses key issues, including communication delays and packet loss in V2V/V2I networks, real-time computational burden, and scalability to larger platoons, and outlines corresponding research directions.

4.6.1. Communication Delays and Packet Loss in V2V/V2I Networks

In real platooning scenarios, the information exchanged among vehicles (e.g., states, predicted trajectories, or control intentions) may suffer from time-varying communication delays and intermittent packet loss. Such imperfections can lead to outdated or missing information, which may degrade tracking performance and potentially compromise platoon stability if not properly addressed. The current study does not explicitly incorporate these network effects; thus, the reported performance should be interpreted under an idealized communication assumption.
Future work will investigate delay-/loss-aware cooperative control by: (i) incorporating time-stamped information and delay compensation (prediction-based alignment) within DMPC; (ii) formulating robust or stochastic DMPC to account for bounded delays and probabilistic packet drops; and (iii) adopting event-triggered or asynchronous coordination mechanisms to reduce reliance on high-rate, lossless communications while maintaining safety constraints.

4.6.2. Real-Time Computational Burden of DMPC and DSAC

Practical deployment requires the controller to meet strict timing constraints on embedded hardware. DMPC may involve iterative optimization and coordination among multiple vehicles, and its computational cost can increase with prediction horizon length, constraint complexity, and coupling strength among agents. In addition, DSAC introduces learning-based decision components whose training is typically performed offline; however, online execution still requires real-time inference and integration with DMPC. The current manuscript focuses on algorithmic effectiveness in simulation and does not provide a detailed real-time feasibility study (e.g., worst-case execution time under a fixed sampling period).
As future research, we will report runtime statistics under representative sampling rates and hardware configurations, and explore computationally efficient implementations, such as warm-starting, reduced-order prediction models, limited-iteration DMPC, parallel/distributed solvers, and policy distillation, where a trained policy approximates the optimization-based controller for fast online execution while preserving safety constraints.

4.6.3. Scalability to Larger Platoons

As platoon size increases, the communication load, coordination complexity, and computational requirements of multi-agent optimization can grow significantly. Moreover, ensuring robust platoon behavior (e.g., string stability and constraint satisfaction) becomes more challenging under heterogeneous vehicle dynamics and uncertain road conditions. The current study evaluates the proposed framework in a limited-size platoon setting; therefore, scalability to larger platoons is not fully demonstrated.
To enhance scalability, future work will investigate hierarchical or clustered coordination architectures, where local sub-platoons are controlled with neighborhood interactions and higher-level coordination ensures global objectives. Additionally, we will conduct systematic scalability studies by increasing platoon size and reporting performance–complexity trade-offs (e.g., energy savings, battery-life metrics, tracking errors, and computation/communication overhead) to characterize deployment limits and guide practical parameter selection.

5. Conclusions

This paper presents a collaborative optimization framework for cooperative adaptive cruise control and energy management of plug-in hybrid electric truck platoons by integrating distributed model predictive control (DMPC) with a distributed soft actor-critic (DSAC) approach. The main findings are summarized as follows:
(1)
A hierarchical cooperative control structure is developed, where the upper-layer DMPC generates the platoon speed trajectory to improve car-following smoothness and ride comfort, and the lower-layer DSAC allocates power based on the planned motion information to enhance fuel economy.
(2)
Under the China heavy-duty commercial vehicle test cycle (CHTC) and the Liuzhou city driving cycle, the DMPC-based cruise control achieves reliable following and spacing performance, maintaining the position error within ±2 m and acceleration within ±1 m/s2, indicating good robustness and driving comfort.
(3)
The DSAC-based energy management strategy demonstrates favorable fuel economy performance, achieving 92.02% fuel efficiency on the CHTC and 93.03% on the Liuzhou cycle, and outperforming the DDPG-based method in the reported comparisons.
(4)
The training curves show that DSAC converges faster than DDPG under both driving cycles, indicating improved learning efficiency for the studied energy management task.
The limitations of this work are summarized as follows:
(1)
The framework is evaluated through simulation-based studies, and further verification under more realistic implementation conditions is still needed.
(2)
Some deployment-related factors (e.g., communication imperfections, real-time computation constraints, and larger platoon scalability) are not fully covered in the current evaluation.
(3)
The current hierarchical interaction follows a feed-forward structure, and the motion planning layer is not adjusted online based on the energy management outcomes.
Future work will focus on:
(1)
Conducting more realistic validation and implementation-oriented testing to further assess real-time feasibility and practical performance.
(2)
Enhancing robustness under practical uncertainties and extending evaluations to larger platoons and more complex traffic scenarios.
(3)
Exploring tighter coordination mechanisms between motion planning and energy management to further improve overall performance.

Author Contributions

Conceptualization, G.Z. and J.M.; methodology, G.Z., D.M. and X.L.; software, D.M. and X.L.; validation, J.M.; formal analysis, X.W.; investigation, X.W.; data curation, X.W.; writing—original draft, X.L.; writing—review and editing, Y.M.; supervision, J.M.; funding acquisition, Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 52365001) and the Guangxi Innovation Drive Development Special Funds Project (No. AA23062040 and No. AA23023011).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

Authors Xin Liu, Dong Mai, Jun Mao, Gang Zhang and Xiangning Wu were employed by the Guangxi Research Institute of Mechanical Industry Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

ACCadaptive cruise control
CHTCChina heavy-duty commercial vehicle test cycle
DDPGdeep deterministic policy gradient
DMPCdistributed model predictive control
DPdynamic programming
DRLdeep reinforcement learning
DSACdistributed soft actor-critic
ECMSequivalent consumption minimization strategy
GAgenetic algorithms
GTgame theory
HEVshybrid electric vehicles
KLKullback–Leibler
MPCmodel predictive control
PHETplug-in hybrid electric truck
SOCstate of charge
V2Ivehicle-to-infrastructure
V2Vvehicle-to-vehicle

References

  1. Pan, M.Z.; Cao, S.; Zhang, Z.Q.; Ye, N.Y.; Qin, H.F.; Li, L.L.; Guan, W. Recent progress on energy management strategies for hybrid electric vehicles. J. Energy Storage 2025, 116, 115936. [Google Scholar] [CrossRef]
  2. Maghfiroh, H.; Wahyunggoro, O.; Cahyadi, A.I. Real-time Energy Management Strategy of Hybrid Electric Vehicle: A Review. Int. J. Eng. 2025, 38, 2887–2901. [Google Scholar] [CrossRef]
  3. Chen, D.X.; Chen, T.; Li, Z.J.; Liu, Z.X.; Sun, C.Y.; Zhao, H. Energy management strategy for plug-in hybrid electric vehicles based on vehicle speed prediction and limited traffic information. Energy 2025, 326, 136292. [Google Scholar] [CrossRef]
  4. Wang, J.H.; Du, C.Q.; Yan, F.W.; Zhou, Q.; Xu, H.M. Hierarchical Rewarding Deep Deterministic Policy Gradient Strategy for Energy Management of Hybrid Electric Vehicles. IEEE Trans. Transp. Electrif. 2024, 10, 1802–1815. [Google Scholar] [CrossRef]
  5. Xu, H.Y.; He, H.W.; Yan, M.; Wu, J.D.; Li, M.L. Hierarchical energy management for fuel cell buses: A graph-agent DRL framework bridging macroscopic traffic flow and microscopic powertrain dynamics. Energy 2025, 332, 137237. [Google Scholar] [CrossRef]
  6. Li, F.Y.; Gao, L.F.; Zhang, Y.B.; Liu, Y.H. Integrated energy management for hybrid electric vehicles: A Bellman neural network approach. Eng. Appl. Artif. Intell. 2025, 145, 110166. [Google Scholar] [CrossRef]
  7. Li, S.G.; Sharkh, S.M.; Walsh, F.C.; Zhang, C.N. Energy and Battery Management of a Plug-In Series Hybrid Electric Vehicle Using Fuzzy Logic. IEEE Trans. Veh. Technol. 2011, 60, 3571–3585. [Google Scholar] [CrossRef]
  8. Zhang, B.J.; Deng, Y.W.; Yu, D.J. An investigation on energy management system of CJY6470 parallel hybrid electric off-road vehicle with fuzzy logic. In VPPC; IEEE: New York, NY, USA, 2008; pp. 1–9. [Google Scholar]
  9. Liu, Y.G.; Huang, B.; Yang, Y.; Lei, Z.Z.; Zhang, Y.J.; Chen, Z. Hierarchical speed planning and energy management for autonomous plug-in hybrid electric vehicle in vehicle-following environment. Energy 2022, 260, 125212. [Google Scholar] [CrossRef]
  10. Wang, Y.; Wu, Y.K.; Tang, Y.J.; Li, Q.; He, H.W. Cooperative energy management and eco-driving of plug-in hybrid electric vehicle via multi-agent reinforcement learning. Appl. Energy 2023, 332, 120563. [Google Scholar] [CrossRef]
  11. Larsson, V.; Johannesson, L.; Egardt, B. Analytic Solutions to the Dynamic Programming Subproblem in Hybrid Vehicle Energy Management. IEEE Trans. Veh. Technol. 2015, 64, 1458–1467. [Google Scholar] [CrossRef]
  12. Yang, C.; Wang, M.Y.; Wang, W.D.; Pu, Z.S.; Ma, M.Y. An efficient vehicle-following predictive energy management strategy for PHEV based on improved sequential quadratic programming algorithm. Energy 2021, 219, 119595. [Google Scholar] [CrossRef]
  13. Zhang, Y.J.; Chu, L.; Fu, Z.C.; Guo, C.; Zhao, D.; Li, Y.K.; Lei, X. An improved adaptive equivalent consumption minimization strategy for parallel plug-in hybrid electric vehicle. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2019, 233, 1649–1663. [Google Scholar] [CrossRef]
  14. Ahmadi, S.; Bathaee, S.M.T.; Hosseinpour, A.H. Improving fuel economy and performance of a fuel-cell hybrid electric vehicle (fuel-cell, battery, and ultra-capacitor) using optimized energy management strategy. Energy Convers. Manag. 2018, 160, 74–84. [Google Scholar] [CrossRef]
  15. Dextreit, C.; Kolmanovsky, I.V. Game Theory Controller for Hybrid Electric Vehicles. IEEE Trans. Control Syst. Technol. 2014, 22, 652–663. [Google Scholar] [CrossRef]
  16. Wang, C.; Liu, F.C.; Tang, A.H.; Liu, R. A dynamic programming-optimized two-layer adaptive energy management strategy for electric vehicles considering driving pattern recognition. J. Energy Storage 2023, 70, 107924. [Google Scholar] [CrossRef]
  17. Fu, Y.; Fan, Z.K.; Lei, Y.L.; Wang, X.L.; Sun, X.H. Integrated Optimization of Component Parameters and Energy Management Strategies for A Series-Parallel Hybrid Electric Vehicle. Automot. Innov. 2024, 7, 492–506. [Google Scholar] [CrossRef]
  18. Zeng, X.R.; Wang, J.M. A Parallel Hybrid Electric Vehicle Energy Management Strategy Using Stochastic Model Predictive Control With Road Grade Preview. IEEE Trans. Control Syst. Technol. 2015, 23, 2416–2423. [Google Scholar] [CrossRef]
  19. Zhang, Y.J.; Chen, Z.; Li, G.; Liu, Y.G.; Huang, Y.J. A Novel Model Predictive Control Based Co-Optimization Strategy for Velocity Planning and Energy Management of Intelligent PHEVs. IEEE Trans. Veh. Technol. 2022, 71, 12667–12681. [Google Scholar] [CrossRef]
  20. Han, R.Y.; He, H.W.; Wang, Y.X.; Wang, Y. Reinforcement Learning Based Energy Management Strategy for Fuel Cell Hybrid Electric Vehicles. Chin. J. Mech. Eng. 2025, 38, 66. [Google Scholar] [CrossRef]
  21. Zhang, Z.; Zhang, T.Z.; Hong, J.C.; Zhang, H.X.; Yang, J.; Jia, Q.X. Double deep Q-network guided energy management strategy of a novel electric-hydraulic hybrid electric vehicle. Energy 2023, 269, 126858. [Google Scholar] [CrossRef]
  22. Qin, J.H.; Huang, H.Z.; Lu, H.L.; Li, Z.J. Energy management strategy for hybrid electric vehicles based on deep reinforcement learning with consideration of electric drive system thermal characteristics. Energy Convers. Manag. 2025, 332, 119697. [Google Scholar] [CrossRef]
  23. Wang, H.C.; Fu, T.F.; Du, Y.Q.; Gao, W.H.; Huang, K.X.; Liu, Z.M.; Chandak, P.; Liu, S.; Van Katwyk, P.; Deac, A.; et al. Scientific discovery in the age of artificial intelligence. Nature 2023, 620, 47–60, Erratum in Nature 2023, 621, E33. [Google Scholar] [CrossRef]
  24. Li, J.; Wu, X.D.; Xu, M.; Liu, Y.G. Deep reinforcement learning and reward shaping based eco-driving control for automated HEVs among signalized intersections. Energy 2022, 251, 123924. [Google Scholar] [CrossRef]
  25. He, H.W.; Huang, R.C.; Meng, X.F.; Zhao, X.Y.; Wang, Y.; Li, M.L. Papers A novel hierarchical predictive energy management strategy for plug-in hybrid electric bus combined with deep deterministic policy gradient. J. Energy Storage 2022, 52, 104787. [Google Scholar] [CrossRef]
  26. Qi, C.Y.; Song, C.X.; Wang, D.; Xiao, F.; Jin, L.Q.; Song, S.X. Action Advising and Energy Management Strategy Optimization of Hybrid Electric Vehicle Agent Based on Uncertainty Analysis. IEEE Trans. Transp. Electrif. 2024, 10, 6940–6949. [Google Scholar] [CrossRef]
  27. Qin, H.Y.; Meng, L.W.; Lu, M.Z.; Xu, E.Y.; Lin, C.B.; Meng, Y.M. Hierarchical energy management strategy of hybrid electric vehicles under multiple uncertainties. Energy Sources Part A Recovery Util. Environ. Eff. 2025, 47, 2551099. [Google Scholar] [CrossRef]
  28. Tong, H.; Chu, L.; Wang, Z.X.; Zhao, D. Adaptive Pulse-and-Glide for synergistic optimization of driving behavior and energy management in hybrid powertrain. Energy 2025, 330, 136622. [Google Scholar] [CrossRef]
  29. Fan, Y.; Peng, J.K.; Wu, J.D.; Zhou, J.X.; Yu, S.C.; Ma, C.Y. Eco-Driving Strategy for Series Hybrid Electric Vehicle Based on Multi-Objective Deep Reinforcement Learning. IEEE Trans. Transp. Electrif. 2025, 11, 12381–12392. [Google Scholar] [CrossRef]
  30. Han, J.; Cui, H.H.; Khalatbarisoltani, A.; Yang, J.; Liu, C.Z.; Hu, X.S. Traffic-aware hierarchical eco-driving approach for connected hybrid electric vehicles at signalized intersections. Energy 2025, 334, 137596. [Google Scholar] [CrossRef]
  31. Zhang, F.Q.; Qi, Z.C.; Xiao, L.H.; Coskun, S.; Xie, S.B.; Liu, Y.T.; Li, J.C.; Song, Z.Y. Co-optimization on ecological adaptive cruise control and energy management of automated hybrid electric vehicles. Energy 2025, 314, 133542. [Google Scholar] [CrossRef]
  32. Zhang, H.L.; Peng, J.K.; Dong, H.X.; Tan, H.C.; Ding, F. Hierarchical reinforcement learning based energy management strategy of plug-in hybrid electric vehicle for ecological car-following process. Appl. Energy 2023, 333, 120599. [Google Scholar] [CrossRef]
  33. Xie, S.B.; Hu, X.S.; Liu, T.; Qi, S.W.; Lang, K.; Li, H.L. Predictive vehicle-following power management for plug-in hybrid electric vehicles. Energy 2019, 166, 701–714. [Google Scholar] [CrossRef]
  34. Yang, C.; Zha, M.J.; Wang, W.D.; Liu, K.J.; Xiang, C.L. Efficient energy management strategy for hybrid electric vehicles/plug-in hybrid electric vehicles: Review and recent advances under intelligent transportation system. IET Intell. Transp. Syst. 2020, 14, 702–711. [Google Scholar] [CrossRef]
  35. Chen, B.; Wang, M.B.; Hu, L.; Zhang, R.; Li, H.; Wen, X.J.; Gao, K. A hierarchical cooperative eco-driving and energy management strategy of hybrid electric vehicle based on improved TD3 with multi-experience. Energy Convers. Manag. 2025, 326, 119508. [Google Scholar] [CrossRef]
  36. Khayyam, H.; Nahavandi, S.; Davis, S. Adaptive cruise control look-ahead system for energy management of vehicles. Expert. Syst. Appl. 2012, 39, 3874–3885. [Google Scholar] [CrossRef]
  37. Zhang, Y.; Xu, M.F.; Qin, Y.C.; Dong, M.M.; Gao, L.; Hashemi, E. MILE: Multiobjective integrated model predictive adaptive cruise control for intelligent vehicle. IEEE Trans. Ind. Inform. 2023, 19, 8539–8548. [Google Scholar] [CrossRef]
  38. Peng, J.K.; Chen, W.Q.Q.; Fan, Y.; He, H.W.; Wei, Z.B.; Ma, C.Y. Ecological driving framework of hybrid electric vehicle based on heterogeneous multi-agent deep reinforcement learning. IEEE Trans. Transp. Electrif. 2024, 10, 392–406. [Google Scholar] [CrossRef]
  39. Fu, Z.M.; Jiang, Z.H.; Gao, S.; Tao, F.Z.; Si, P.J.; Wang, Z.K. Eco-driving strategy of fuel cell hybrid electric vehicle based on improved soft actor-critic algorithm in car-following scenario. J. Power Sources 2025, 652, 237516. [Google Scholar] [CrossRef]
  40. Fan, P.X.; Yang, J.; Ke, S.; Wen, Y.X.; Li, Y.H.; Xie, L.L. Load frequency control strategy for islanded multimicrogrids with V2G dependent on learning-based model predictive control. IET Gener. Transm. Distrib. 2023, 17, 4763–4780. [Google Scholar] [CrossRef]
  41. Liberati, F.; Atanasious, M.M.H.; De Santis, E.; Di Giorgio, A. A hybrid model predictive control-deep reinforcement learning algorithm with application to plug-in electric vehicles smart charging. Sustain. Energy Grids Netw. 2025, 44, 101963. [Google Scholar] [CrossRef]
  42. Mamani, K.M.S.; Romo, A.J.P. Integrating Model Predictive Control with Deep Reinforcement Learning for Robust Control of Thermal Processes with Long Time Delays. Processes 2025, 13, 1627. [Google Scholar] [CrossRef]
  43. Pan, C.; Wang, A.Q.; Peng, Z.H.; Han, B.; Lyu, G.; Zhang, W.D. Pursuit-evasion game of under-actuated ASVs based on deep reinforcement learning and model predictive path integral control. Neurocomputing 2025, 638, 130045. [Google Scholar] [CrossRef]
  44. Liu, X.; Shi, G.; Yang, C.; Xu, E.; Meng, Y. Co-Optimization of Speed Planning and Energy Management for Plug-In Hybrid Electric Trucks Passing Through Traffic Light Intersections. Energies 2024, 17, 6022. [Google Scholar] [CrossRef]
  45. Liu, X.; Yang, C.; Meng, Y.; Zhu, J.; Duan, Y.; Chen, Y. Hierarchical energy management of plug-in hybrid electric trucks based on state-of-charge optimization. J. Energy Storage 2023, 72, 107999. [Google Scholar] [CrossRef]
  46. Zheng, Y.; Li, S.E.; Li, K.Q.; Borrelli, F.; Hedrick, J.K. Distributed model predictive control for heterogeneous vehicle platoons under unidirectional topologies. IEEE Trans. Control Syst. Technol. 2017, 25, 899–910. [Google Scholar] [CrossRef]
  47. Duan, J.L.; Guan, Y.; Li, S.E.; Ren, Y.A.; Sun, Q.; Cheng, B. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6584–6598. [Google Scholar] [CrossRef] [PubMed]
  48. Li, L.; Yang, C.; Zhang, Y.H.; Zhang, L.P.; Song, J. Correctional DP-based energy management strategy of plug-in hybrid electric bus for city-bus route. IEEE Trans. Veh. Technol. 2015, 64, 2792–2803. [Google Scholar] [CrossRef]
  49. Xie, S.B.; Hu, X.S.; Qi, S.W.; Tang, X.L.; Lang, K.; Xin, Z.K.; Brighton, J. Model predictive energy management for plug-in hybrid electric vehicles considering optimal battery depth of discharge. Energy 2019, 173, 667–678. [Google Scholar] [CrossRef]
Figure 1. Schematic of a plug-in hybrid electric truck platoon [44].
Figure 1. Schematic of a plug-in hybrid electric truck platoon [44].
Energies 19 00935 g001
Figure 2. Powertrain configuration of the plug-in hybrid electric truck [44].
Figure 2. Powertrain configuration of the plug-in hybrid electric truck [44].
Energies 19 00935 g002
Figure 3. Engine efficiency map.
Figure 3. Engine efficiency map.
Energies 19 00935 g003
Figure 4. Motor efficiency map.
Figure 4. Motor efficiency map.
Energies 19 00935 g004
Figure 5. Control strategy framework.
Figure 5. Control strategy framework.
Energies 19 00935 g005
Figure 6. Structure of the DSAC algorithm.
Figure 6. Structure of the DSAC algorithm.
Energies 19 00935 g006
Figure 7. The data collection equipment.
Figure 7. The data collection equipment.
Energies 19 00935 g007
Figure 8. Liuzhou driving conditions.
Figure 8. Liuzhou driving conditions.
Energies 19 00935 g008
Figure 9. Simulation results under CHTC (a) Velocity curve (b) Position curve (c) Position error curve (d) Acceleration curve.
Figure 9. Simulation results under CHTC (a) Velocity curve (b) Position curve (c) Position error curve (d) Acceleration curve.
Energies 19 00935 g009
Figure 10. Simulation results under Liuzhou driving cycle: (a) Velocity curve, (b) Position curve, (c) Position error curve, (d) Acceleration curve.
Figure 10. Simulation results under Liuzhou driving cycle: (a) Velocity curve, (b) Position curve, (c) Position error curve, (d) Acceleration curve.
Energies 19 00935 g010
Figure 11. SOC curves of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Figure 11. SOC curves of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Energies 19 00935 g011
Figure 12. Engine operating point distribution of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Figure 12. Engine operating point distribution of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Energies 19 00935 g012
Figure 13. Average reward curves of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Figure 13. Average reward curves of different algorithms (a) CHTC (b) Liuzhou driving cycle.
Energies 19 00935 g013
Figure 14. SOC curve based on DSAC algorithm (a) CHTC (b) Liuzhou driving cycle.
Figure 14. SOC curve based on DSAC algorithm (a) CHTC (b) Liuzhou driving cycle.
Energies 19 00935 g014
Table 1. Powertrain parameters of the plug-in hybrid electric truck [45].
Table 1. Powertrain parameters of the plug-in hybrid electric truck [45].
CategoryParametersSymbolValuesUnit
Vehicle bodyGross vehicle mass m 18,000kg
Vehicle bodyFrontal area A 5.1m2
Vehicle bodyAerodynamic drag coefficient C D 0.527-
Traction motorPeak power P m , m a x 158.3kW
Traction motorPeak torque T m , m a x 293N·m
Traction motorMaximum speed n m , m a x 12,000rpm
EnginePeak power P e , m a x 169.1kW
EnginePeak torque T e , m a x 734N·m
EngineNominal speed n e , n o m 2200rpm
Battery packNominal voltage U b a t 560.28V
Battery packCapacity Q b a t 5Ah
Battery packNominal power P b a t , n o m 78.4kW
Table 2. Simulation parameters for the plug-in hybrid electric truck platoon.
Table 2. Simulation parameters for the plug-in hybrid electric truck platoon.
ParametersValue
Number of follower vehicles5
Desired inter-vehicle distance20 m
Prediction horizon7
Cost function weights Q i / G i / R i / F i 10/5/10/8
Table 3. Hyperparameters of the DSAC algorithm.
Table 3. Hyperparameters of the DSAC algorithm.
HyperparametersValue
OptimizerAdam
Number of hidden layers5
Number of hidden units per layer256
Batch size256
Value learning rate0.0001
Policy learning rate0.0001
Table 4. Characteristics of driving conditions.
Table 4. Characteristics of driving conditions.
Driving CycleTime (s)Distance (km)Maximum Speed (m/s)Maximum Acceleration (m/s2)Average Speed (m/s)
CHTC180023.2224.440.8112.90
Liuzhou192320.0315.540.8310.41
Table 5. The average speed deviation of the following vehicle.
Table 5. The average speed deviation of the following vehicle.
Driving CycleFollowing Vehicle 1Following Vehicle 2Following Vehicle 3Following Vehicle 4Following Vehicle 5
CHTC0.167 m/s0.186 m/s0.207 m/s0.212 m/s0.213 m/s
Liuzhou0.114 m/s0.127 m/s0.141 m/s0.145 m/s0.147 m/s
Table 6. The average position error of the following vehicle.
Table 6. The average position error of the following vehicle.
Driving CycleFollowing Vehicle 1Following Vehicle 2Following Vehicle 3Following Vehicle 4Following Vehicle 5
CHTC0.032 m0.089 m0.214 m0.257 m0.339 m
Liuzhou0.033 m0.093 m0.218 m0.256 m0.337 m
Table 7. Fuel consumption of different control strategies under the CHTC.
Table 7. Fuel consumption of different control strategies under the CHTC.
MethodSOC Final ValueFuel Consumption (L/100 km)Fuel Economy (%)
DP0.35012.356100
DDPG0.35913.68690.28
DSAC0.34613.42792.02
Table 8. Fuel consumption of different control strategies under the Liuzhou driving cycle.
Table 8. Fuel consumption of different control strategies under the Liuzhou driving cycle.
MethodSOC Final ValueFuel Consumption (L/100 km)Fuel Economy (%)
DP0.3508.647100
DDPG0.3489.51690.87
DSAC0.3539.29593.03
Table 9. Fuel consumption of vehicles in the platoon.
Table 9. Fuel consumption of vehicles in the platoon.
Driving CyclePerformance IndexFV1FV2FV3FV4FV5
CHTCFuel consumption (L/100 km)13.42713.50313.41913.43713.398
LiuzhouFuel consumption (L/100 km)9.2959.2249.3439.2519.304
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Mai, D.; Mao, J.; Zhang, G.; Wu, X.; Meng, Y. Co-Optimization of Cooperative Adaptive Cruise Control and Energy Management for Plug-in Hybrid Electric Truck Platoons. Energies 2026, 19, 935. https://doi.org/10.3390/en19040935

AMA Style

Liu X, Mai D, Mao J, Zhang G, Wu X, Meng Y. Co-Optimization of Cooperative Adaptive Cruise Control and Energy Management for Plug-in Hybrid Electric Truck Platoons. Energies. 2026; 19(4):935. https://doi.org/10.3390/en19040935

Chicago/Turabian Style

Liu, Xin, Dong Mai, Jun Mao, Gang Zhang, Xiangning Wu, and Yanmei Meng. 2026. "Co-Optimization of Cooperative Adaptive Cruise Control and Energy Management for Plug-in Hybrid Electric Truck Platoons" Energies 19, no. 4: 935. https://doi.org/10.3390/en19040935

APA Style

Liu, X., Mai, D., Mao, J., Zhang, G., Wu, X., & Meng, Y. (2026). Co-Optimization of Cooperative Adaptive Cruise Control and Energy Management for Plug-in Hybrid Electric Truck Platoons. Energies, 19(4), 935. https://doi.org/10.3390/en19040935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop