1. Introduction
The power system of a hybrid electric vehicle is composed of multiple power sources, and the power distribution of different power sources is realized through an energy management strategy to improve the vehicle’s fuel economy and driving range. As one of the crucial technologies of hybrid electric vehicles, the vehicle control strategy primarily solves plug-in hybrid electric vehicles’ energy management and torque distribution. It can be divided into a control strategy based on base rules, a control strategy based on optimization class, and a control strategy based on learning.
The rule-based control strategy can also be called the logic threshold-based control strategy, and its core idea is to ensure that the engine works in the high-efficiency zone. When the engine load is small, the machine stops, and the motor is driven separately; when the engine load is moderate, the engine works in the high-efficiency area, and the engine starts to charge or go; when the engine load is large, the motor can assist so that the machine only works in the high-efficiency area. This control strategy allows the engine to provide steady-state power and the motor to provide transient ability to improve the vehicle’s fuel economy. This type of control strategy is simple, reliable, and practical. Ping Li et al. [
1] used particle swarm optimization (PSO) to optimize the threshold parameters of the rule-based energy management strategy. To improve the adaptability of the control strategy, multiple historical driving cycles are used to optimize the parameters, resulting in a rule-based energy management control strategy that adapts to unknown driving cycles. Abdoulaye Pam [
2] et al. used DP to determine the ideal energy efficiency of the studied vehicle in a given driving cycle. A rule-based EMS algorithm can be derived by analyzing the DP-EMS results. Charbel J Mansour [
3] proposed a strategy optimization method, a rule-based energy management method which takes dynamic programming as the global optimization program to realize the real-time implementation of the energy management strategy of the Prius plug-in hybrid electric vehicle. The optimization process considers the ideal travel the driver selects on the vehicle-mounted global positioning system and is associated with the traffic management system. The control strategy based on rule class is widely used in engineering because of the small amount of calculation required and its fast calculation speed. However, it relies too much on formulation experience and needs better portability; it is challenging to achieve optimal control of vehicle power in application.
The operation of the rule-based control strategy is independent of the working conditions. Although it can improve the vehicle’s fuel economy to a certain extent, the results could be more optimal. According to the optimization objectives, the optimization control strategy can be divided into instantaneous and global optimization control strategies. Mansour C [
4] et al. proposed a simple adaptive rule strategy based on short-term driving pattern recognition and dynamic programming global optimization program. Mingming Gao et al. [
5] proposed an extended-range electric bus energy management strategy based on a convex optimization algorithm, which can be better applied to the REEB energy management system to meet the requirements of the power system. The Radau pseudo spectral knot method (RPKM) is proposed by Kegang Zhao [
6] to solve the energy management strategy of series-parallel plug-in hybrid electric vehicles based on global optimization to improve computational efficiency. Jian Wu et al. [
7] used PSO combined with various driving conditions to optimize the logic threshold parameters of a rule-based energy management strategy with the vehicle dynamic performance index as the constraint condition and the equivalent fuel consumption rate as the optimization objective.
With big data and computer technology development, machine learning is widely used in vehicle energy management strategies. The learning-based control strategy does not depend on the ‘expert experience’ and the digital model calculation of the controlled object. However, it uses advanced data mining methods and historical/real-time empirical data to obtain prediction results or control strategies. With the help of intelligent algorithms, the state space continuity and state action space continuity of energy management problems are realized, and the discretization problems in the optimization of the DP algorithm are avoided [
8,
9,
10,
11,
12]. Tawfiq M. Aljohani et al. [
13] proposed a real-time, metadata-driven electric vehicle path optimization method to reduce road energy demand. The strategy uses the state-behavior-reward-state-behavior (SARSA) algorithm to learn the maximum travel strategy of electric vehicles as agents. Weihan Li et al. [
14] proposed a multi-objective energy management strategy based on cloud-based hybrid architecture. This strategy has a deep deterministic policy gradient, which can improve the system’s electrical and thermal safety and minimize the system’s energy loss and aging cost. Weihan Li et al. [
15] designed a new reward to explore the optimal working range of high-power battery packs without imposing strict charging state constraints. In the training of deep q-learning models, different load curves are randomly combined to avoid over-fitting problems.
Yue S et al. [
16] solved the vehicle energy management problem of compound power supply using the sequential difference method. Li Wei et al. [
17], from the University of Chinese Academy of Sciences, added the battery life factor into the reward function of the deep reinforcement learning algorithm (DRL) to extend the battery life and verify their strategy’s adaptability to working conditions in simulation verification. Tang Xiaolin et al. [
18] from Chongqing University used the deep value network algorithm to complete the upper-level tracking control and the lower-level energy management, thus improving the fuel economy of the two vehicles. Zhao Chunling et al. [
19], from Chongqing Jiaotong University, applied the DRL algorithm to the energy allocation problem of PHEVs, which not only reduces pollutant emission of diesel engines but also greatly improves the fuel economy of the whole vehicle. Zhang Song et al. [
20] took hybrid electric buses as the research object, applied DDQN and TD3 algorithms to vehicle energy management, and adopted priority empirical playback to optimize the strategy, proving the effectiveness of the strategy.
In actual road driving, the energy management strategy under different working conditions is easily affected by random factors, and it is difficult to achieve a real-time optimal energy management strategy. To solve this problem, this paper first predicts the driving conditions, determines the correction factor of the demand torque distribution based on the prediction results, and corrects the actual demand torque of the vehicle. Finally, the energy management strategy based on TD3 is designed to complete the design and development of an energy management strategy based on condition prediction.
2. Vehicle Power System Construction
As shown in
Figure 1, the research object of this paper is the P2 configuration parallel hybrid commercial vehicle produced by a company. The main difference between the P2 configuration and other configurations is that a clutch controls the front and rear sides of the motor, so both the motor and the engine can drive the car independently.
Figure 2 shows the structure of the vehicle power system, and its main components are shown successively as follows: diesel engine, motor, power battery pack, clutch, five-speed transmission vehicle controller VCU, etc. The main parameters are shown in
Table 1.
The experimental modeling method is adopted to model the engine. This paper focuses on the energy management of hybrid commercial vehicles, so the engine model is simplified without considering the instantaneous corresponding characteristics of the system.
The corresponding data, such as torque and speed, were obtained through bench experiments, and the fuel consumption experimental model was obtained using the interpolation Formula (1). The fuel consumption figure is shown in
Figure 3.
This paper still adopts the method of experimental modeling, does not consider the internal operation mechanism, and only believes the input and output relationship in building the motor model. By testing the driving motor at different speeds and torque points on the experimental bench, parameters such as speed, torque, and current at the shaft end of the driving motor were recorded, and the motor efficiency (ratio of motor output power to input power) MAP was established, as shown in
Figure 4.
3. Introduction to Deep Reinforcement Learning
Deep learning originated from the research of artificial neural networks (ANNs). The mathematical model of ANNs is made of layers of neurons. ANNs are distributed parallel information processing algorithms that are used to simulate the behavior of animal neurons. Depending on the system’s complexity, ANNs realize the purpose of processing information by adjusting the interconnection between large internal nodes. The so-called deep learning is the neural network composed of multilayer neurons to approximate the function of machine learning. The structure of deep understanding is a multilayer perceptron with multiple hidden layers. By combining low-level features to form more abstract high-level representations of attribute categories or segments, deep learning can discover the distributed characteristics of data [
21].
The goal of reinforcement learning is to find the optimal strategy through trial-and-error learning between the agent and the environment to maximize the expectation of cumulative returns.
A reinforcement learning problem involves a decision-maker, the agent, operating in an environment modeled by states ∈S. The agent can take specific actions at ∈A as a function of the current state. After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current state. The chosen action reinforcement learning problem consists of a decision-maker, the agent, operating in an environment modeled by states ∈S. The agent can take specific actions at ∈ A as a function of the current state. After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current and chosen actions [
22].
At each time step, the agent follows a strategy, called the policy πt, which is a mapping from states to the probability of selecting each possible action: π(s, a) denotes the probability that a = if s = .
The objective of reinforcement learning is to use the interactions of the agent with its environment to derive (or approximate) an optimal policy to maximize the total amount of reward received by the agent over the long run [
23].
DRL combines the two disciplines of deep learning and reinforcement learning and uses the perceptual advantages of deep learning and the decision-making advantages of reinforcement learning to solve complex control problems belonging to MDP.
DRL combines deep learning and reinforcement learning to form a deep Q-learning network. Deep learning provides learning mechanisms, and reinforcement learning provides learning objectives for deep learning, making deep reinforcement learning capable of solving complex control problems [
24].
4. Twin Delayed Deep Deterministic Policy Gradient Algorithm
4.1. Twin Delayed Deep Deterministic Policy Gradient Algorithm
DRL is divided into three main categories: value function-based, policy gradient-based, and search and supervision-based. The algorithm based on the value function uses the value table or value function to estimate the optimal value function reasonably to select the action with the largest value; this method is often applied to discontinuous and discrete environments, and for the large set and continuous action scene, this method is prone to problems such as dimensional catastrophe, and the results of the training are poor. The representative algorithms are Q-learning, DQN. Policy gradient-based algorithms are trained and learn to maximize the reward value of the objective function of the resulting policy to obtain the optimal policy; this algorithm has a better optimization effect compared with the algorithm based on the value function, but it is prone to problems such as local extremes, and the representative algorithms are DDPG. The algorithms based on searching and supervising are the algorithms that add artificial supervision when searching for a strategy to accelerate the learning process and achieve better results. In this paper, the improved TD3 algorithm is applied to energy management strategy development, which is improved based on DDPG and combines the advantages of DDQN and DDPG algorithms [
23].
For the DRL algorithm, the twin delayed deep deterministic policy gradient algorithm, the TD3 algorithm, is an improved off-policy deep reinforcement learning algorithm for solving continuous control problems. In essence, the TD3 algorithm integrates the idea of the double Q-learning algorithm into the DDPG algorithm, combines both advantages, and uses delay strategy update and smooth regularization of the target strategy. In the face of complex continuous action space, it can implement efficient output action and effectively solve the overestimation of the Q value.
The TD3 algorithm adopts two critic networks to evaluate the output action-value function and then selects the minimum values of both to update the target
Q value, as shown in Equation (2):
For network update mode, the TD3 algorithm also adopts soft update mode to update target network parameters, as shown in Formula (3):
When training the algorithm, by increasing the ways to improve the algorithm of random noise and robustness, ~clip(N(0, σ), −c,c), c > 0, and
The loss function of the TD3 algorithm is defined as the error square of the above, as shown in Equation (5):
In the algorithm design, to avoid the correlation between samples, the experience pool playback method is adopted to store the experience data in the experience pool. When selecting samples for network training, the sample is randomly selected to break the correlation between samples and ensure the efficiency of network updating.
The key technology in DRL is the deep neural network to fit the
Q value function, whose structure consists of an input, hidden, and output layer. The input layer is composed of states and actions. The selection of hidden layers and neurons in this layer is obtained via trial and error. After many tests, it is concluded that the number of hidden layers is 3, and the number of neurons in each layer is 30, 120, and 120. The relay function is used to activate between hidden layers. As shown in (6), the activation from the hidden layer to the output layer adopts the Tanh function, as shown in (7), and the output layer is the value function.
4.2. Key Parameter Selection
In the construction of an energy management strategy for hybrid commercial vehicles based on a deep reinforcement learning algorithm, the vehicle controller is regarded as an agent, the power system and driving conditions as the environment, and the ultimate purpose of the controller is to find the optimal control strategy.
In hybrid commercial vehicles based on deep reinforcement learning algorithms, the vehicle controller is regarded as an agent, the power system and driving conditions as the environment, and the ultimate purpose of the controller is to find the optimal control strategy.
The key parameters such as system state, action space, and reward signal are set as follows.
State variables: For energy management of PHEVs, the system state reflects the vehicle’s characteristics while on the road. In this paper, the normalized acceleration (
a), state of charge (
SOC), demand torque (Tereq), and vehicle speed (
V) are taken as the state variables of the algorithm. The formula can express its state space:
Action variable: The engine output torque
Teice is taken as the action variable of the algorithm:
Reward function: The reward function of the strategy affects the algorithm’s convergence. In this paper, the
SOC value will be taken as the constraint condition, and vehicle pollutant emission, fuel consumption, and power consumption will be considered as the feedback reward function of the algorithm. See the following formula for details:
where
R(
t) is the state
x at time
t; under the action,
x is transferred to the next state to obtain the reward value. R1 represents the reward reporting function on the instantaneous fuel consumption of the engine and pollutant emission; it represents the instantaneous fuel consumption of the engine; CO, HC, and NO
x represent the emission of automobile pollutants. Because of the difference in dimensionality between them, the normalization method is used to deal with them before summation.
is the penalty term, which is equal to the sum of the maximum emission and the maximum instantaneous fuel consumption of the engine;
is the penalty factor of SOC change;
is the reference
SOC at a certain time; and
is the fuel consumption coefficient. When
, the fuel consumption coefficient is small, with PHEV provided mainly through the motor power; when
, the penalty factor is set to a larger value to increase the torque distributed by the engine. According to the defined reward function, the reward obtained decreases with the increase in emissions and fuel consumption.
Based on the above introduction of some principles of deep reinforcement learning architecture and the setting of some parameters, the optimal-state action-value function is defined as follows:
As the driving condition prediction results affect the vehicle demand torque, an energy management strategy based on driving condition prediction is proposed by combining the driving condition prediction algorithm with the energy management strategy. According to the acceleration probability distribution in
Figure 5, the correction factor is determined to correct the required torque of the vehicle. The figure shows that the two curves are more consistent, and the distribution of vehicle acceleration is mostly between −1.5 m/s
2 and −1.5 m/s
2. Then, we calculate the correlation coefficient, average error, and standard deviation of its predicted demand torque, as well as the actual demand torque to set the correction coefficient: when the acceleration
a < −1.5 m/s
2, the correction coefficient is 0.8; when −1.5 m/s
2 <
a < 1.5 m/s
2, the correction coefficient is 1; and when
a > 1.5 m/s
2, the correction coefficient is 1.2.
Figure 6 shows the framework of energy management of deep reinforcement learning based on condition prediction. When the vehicle is running, the BP neural network algorithm is first used to predict the driving conditions, and the vehicle demand torque is corrected according to the predicted value. The corrected vehicle demand torque, speed, battery SOC, and acceleration are state inputs. After the training of the target network and strategy network, the engine output torque with action value is output. We update the status value according to the output action and store the status, action, and reward value in the experience pool.