Research on Energy Management Strategy of a Hybrid Commercial Vehicle Based on Deep Reinforcement Learning

.


Introduction
The power system of a hybrid electric vehicle is composed of multiple power sources, and the power distribution of different power sources is realized through an energy management strategy to improve the vehicle's fuel economy and driving range.As one of the crucial technologies of hybrid electric vehicles, the vehicle control strategy primarily solves plug-in hybrid electric vehicles' energy management and torque distribution.It can be divided into a control strategy based on base rules, a control strategy based on optimization class, and a control strategy based on learning.
The rule-based control strategy can also be called the logic threshold-based control strategy, and its core idea is to ensure that the engine works in the high-efficiency zone.When the engine load is small, the machine stops, and the motor is driven separately; when the engine load is moderate, the engine works in the high-efficiency area, and the engine starts to charge or go; when the engine load is large, the motor can assist so that the machine only works in the high-efficiency area.This control strategy allows the engine to provide steady-state power and the motor to provide transient ability to improve the vehicle's fuel economy.This type of control strategy is simple, reliable, and practical.Ping Li et al. [1] used particle swarm optimization (PSO) to optimize the threshold parameters of the rule-based energy management strategy.To improve the adaptability of the control strategy, multiple historical driving cycles are used to optimize the parameters, resulting in a rule-based energy management control strategy that adapts to unknown driving cycles.Abdoulaye Pam [2] et al. used DP to determine the ideal energy efficiency of the studied vehicle in a given driving cycle.A rule-based EMS algorithm can be derived by analyzing the DP-EMS results.Charbel J Mansour [3] proposed a strategy optimization method, a rule-based energy management method which takes dynamic programming as the global optimization program to realize the real-time implementation of the energy management strategy of the Prius plug-in hybrid electric vehicle.The optimization process considers the ideal travel the driver selects on the vehicle-mounted global positioning system and is associated with the traffic management system.The control strategy based on rule class is widely used in engineering because of the small amount of calculation required and its fast calculation speed.However, it relies too much on formulation experience and needs better portability; it is challenging to achieve optimal control of vehicle power in application.
The operation of the rule-based control strategy is independent of the working conditions.Although it can improve the vehicle's fuel economy to a certain extent, the results could be more optimal.According to the optimization objectives, the optimization control strategy can be divided into instantaneous and global optimization control strategies.Mansour C [4] et al. proposed a simple adaptive rule strategy based on short-term driving pattern recognition and dynamic programming global optimization program.Mingming Gao et al. [5] proposed an extended-range electric bus energy management strategy based on a convex optimization algorithm, which can be better applied to the REEB energy management system to meet the requirements of the power system.The Radau pseudo spectral knot method (RPKM) is proposed by Kegang Zhao [6] to solve the energy management strategy of series-parallel plug-in hybrid electric vehicles based on global optimization to improve computational efficiency.Jian Wu et al. [7] used PSO combined with various driving conditions to optimize the logic threshold parameters of a rule-based energy management strategy with the vehicle dynamic performance index as the constraint condition and the equivalent fuel consumption rate as the optimization objective.
With big data and computer technology development, machine learning is widely used in vehicle energy management strategies.The learning-based control strategy does not depend on the 'expert experience' and the digital model calculation of the controlled object.However, it uses advanced data mining methods and historical/real-time empirical data to obtain prediction results or control strategies.With the help of intelligent algorithms, the state space continuity and state action space continuity of energy management problems are realized, and the discretization problems in the optimization of the DP algorithm are avoided [8][9][10][11][12].Tawfiq M. Aljohani et al. [13] proposed a real-time, metadata-driven electric vehicle path optimization method to reduce road energy demand.The strategy uses the state-behavior-reward-state-behavior (SARSA) algorithm to learn the maximum travel strategy of electric vehicles as agents.Weihan Li et al. [14] proposed a multi-objective energy management strategy based on cloud-based hybrid architecture.This strategy has a deep deterministic policy gradient, which can improve the system's electrical and thermal safety and minimize the system's energy loss and aging cost.Weihan Li et al. [15] designed a new reward to explore the optimal working range of high-power battery packs without imposing strict charging state constraints.In the training of deep q-learning models, different load curves are randomly combined to avoid over-fitting problems.
Yue S et al. [16] solved the vehicle energy management problem of compound power supply using the sequential difference method.Li Wei et al. [17], from the University of Chinese Academy of Sciences, added the battery life factor into the reward function of the deep reinforcement learning algorithm (DRL) to extend the battery life and verify their strategy's adaptability to working conditions in simulation verification.Tang Xiaolin et al. [18] from Chongqing University used the deep value network algorithm to complete the upper-level tracking control and the lower-level energy management, thus improving the fuel economy of the two vehicles.Zhao Chunling et al. [19], from Chongqing Jiaotong University, applied the DRL algorithm to the energy allocation problem of PHEVs, which not only reduces pollutant emission of diesel engines but also greatly improves the fuel economy of the whole vehicle.Zhang Song et al. [20] took hybrid electric buses as the research object, applied DDQN and TD3 algorithms to vehicle energy management, and adopted priority empirical playback to optimize the strategy, proving the effectiveness of the strategy.
In actual road driving, the energy management strategy under different working conditions is easily affected by random factors, and it is difficult to achieve a real-time optimal energy management strategy.To solve this problem, this paper first predicts the driving conditions, determines the correction factor of the demand torque distribution based on the prediction results, and corrects the actual demand torque of the vehicle.Finally, the energy management strategy based on TD3 is designed to complete the design and development of an energy management strategy based on condition prediction.

Vehicle Power System Construction
As shown in Figure 1, the research object of this paper is the P2 configuration parallel hybrid commercial vehicle produced by a company.The main difference between the P2 configuration and other configurations is that a clutch controls the front and rear sides of the motor, so both the motor and the engine can drive the car independently.
World Electr.Veh.J. 2023, 14, x FOR PEER REVIEW 3 of 14 applied DDQN and TD3 algorithms to vehicle energy management, and adopted priority empirical playback to optimize the strategy, proving the effectiveness of the strategy.
In actual road driving, the energy management strategy under different working conditions is easily affected by random factors, and it is difficult to achieve a real-time optimal energy management strategy.To solve this problem, this paper first predicts the driving conditions, determines the correction factor of the demand torque distribution based on the prediction results, and corrects the actual demand torque of the vehicle.Finally, the energy management strategy based on TD3 is designed to complete the design and development of an energy management strategy based on condition prediction.

Vehicle Power System Construction
As shown in Figure 1, the research object of this paper is the P2 configuration parallel hybrid commercial vehicle produced by a company.The main difference between the P2 configuration and other configurations is that a clutch controls the front and rear sides of the motor, so both the motor and the engine can drive the car independently.Figure 2 shows the structure of the vehicle power system, and its main components are shown successively as follows: diesel engine, motor, power battery pack, clutch, five-speed transmission vehicle controller VCU, etc.The main parameters are shown in Table 1.  Figure 2 shows the structure of the vehicle power system, and its main components are shown successively as follows: diesel engine, motor, power battery pack, clutch, five-speed transmission vehicle controller VCU, etc.The main parameters are shown in Table 1.
World Electr.Veh.J. 2023, 14, x FOR PEER REVIEW 3 of 14 applied DDQN and TD3 algorithms to vehicle energy management, and adopted priority empirical playback to optimize the strategy, proving the effectiveness of the strategy.
In actual road driving, the energy management strategy under different working conditions is easily affected by random factors, and it is difficult to achieve a real-time optimal energy management strategy.To solve this problem, this paper first predicts the driving conditions, determines the correction factor of the demand torque distribution based on the prediction results, and corrects the actual demand torque of the vehicle.Finally, the energy management strategy based on TD3 is designed to complete the design and development of an energy management strategy based on condition prediction.

Vehicle Power System Construction
As shown in Figure 1, the research object of this paper is the P2 configuration parallel hybrid commercial vehicle produced by a company.The main difference between the P2 configuration and other configurations is that a clutch controls the front and rear sides of the motor, so both the motor and the engine can drive the car independently.Figure 2 shows the structure of the vehicle power system, and its main components are shown successively as follows: diesel engine, motor, power battery pack, clutch, five-speed transmission vehicle controller VCU, etc.The main parameters are shown in Table 1.The experimental modeling method is adopted to model the engine.This paper focuses on the energy management of hybrid commercial vehicles, so the engine model is simplified without considering the instantaneous corresponding characteristics of the system.
The corresponding data, such as torque and speed, were obtained through bench experiments, and the fuel consumption experimental model was obtained using the interpolation Formula (1).The fuel consumption figure is shown in Figure 3.
World Electr.Veh.J. 2023, 14, x FOR PEER REVIEW 4 of 14 The experimental modeling method is adopted to model the engine.This paper focuses on the energy management of hybrid commercial vehicles, so the engine model is simplified without considering the instantaneous corresponding characteristics of the system.
The corresponding data, such as torque and speed, were obtained through bench experiments, and the fuel consumption experimental model was obtained using the interpolation Formula (1).The fuel consumption figure is shown in Figure 3.This paper still adopts the method of experimental modeling, does not consider the internal operation mechanism, and only believes the input and output relationship in building the motor model.By testing the driving motor at different speeds and torque points on the experimental bench, parameters such as speed, torque, and current at the shaft end of the driving motor were recorded, and the motor efficiency (ratio of motor output power to input power) MAP was established, as shown in Figure 4.This paper still adopts the method of experimental modeling, does not consider the internal operation mechanism, and only believes the input and output relationship in building the motor model.By testing the driving motor at different speeds and torque points on the experimental bench, parameters such as speed, torque, and current at the shaft end of the driving motor were recorded, and the motor efficiency (ratio of motor output power to input power) MAP was established, as shown in Figure 4.

Introduction to Deep Reinforcement Learning
Deep learning originated from the research of artificial neural networks (ANNs).The mathematical model of ANNs is made of layers of neurons.ANNs are distributed parallel information processing algorithms that are used to simulate the behavior of animal neurons.Depending on the system's complexity, ANNs realize the purpose of processing information by adjusting the interconnection between large internal nodes.The so-called deep learning is the neural network composed of multilayer neurons to approximate the function of machine learning.The structure of deep understanding is a multilayer perceptron with multiple hidden layers.By combining low-level features to form more abstract high-level representations of attribute categories or segments, deep learning can discover the distributed characteristics of data [21].
The goal of reinforcement learning is to find the optimal strategy through trial-anderror learning between the agent and the environment to maximize the expectation of cumulative returns.
A reinforcement learning problem involves a decision-maker, the agent, operating in an environment modeled by states ∈S.The agent can take specific actions at ∈ A as a function of the current state.After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current state.The chosen action reinforcement learning problem consists of a decision-maker, the agent, operating in an environment modeled by states ∈S.The agent can take specific actions at ∈ A as a function of the current state.After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current and chosen actions [22].
At each time step, the agent follows a strategy, called the policy πt, which is a mapping from states to the probability of selecting each possible action: π(s, a) denotes the probability that a =  if s =  .
The objective of reinforcement learning is to use the interactions of the agent with its environment to derive (or approximate) an optimal policy to maximize the total amount of reward received by the agent over the long run [23].
DRL combines the two disciplines of deep learning and reinforcement learning and uses the perceptual advantages of deep learning and the decision-making advantages of reinforcement learning to solve complex control problems belonging to MDP.
DRL combines deep learning and reinforcement learning to form a deep Q-learning network.Deep learning provides learning mechanisms, and reinforcement learning provides learning objectives for deep learning, making deep reinforcement learning capable of solving complex control problems [24].

Introduction to Deep Reinforcement Learning
Deep learning originated from the research of artificial neural networks (ANNs).The mathematical model of ANNs is made of layers of neurons.ANNs are distributed parallel information processing algorithms that are used to simulate the behavior of animal neurons.Depending on the system's complexity, ANNs realize the purpose of processing information by adjusting the interconnection between large internal nodes.The so-called deep learning is the neural network composed of multilayer neurons to approximate the function of machine learning.The structure of deep understanding is a multilayer perceptron with multiple hidden layers.By combining low-level features to form more abstract high-level representations of attribute categories or segments, deep learning can discover the distributed characteristics of data [21].
The goal of reinforcement learning is to find the optimal strategy through trial-anderror learning between the agent and the environment to maximize the expectation of cumulative returns.
A reinforcement learning problem involves a decision-maker, the agent, operating in an environment modeled by states ∈S.The agent can take specific actions at ∈A as a function of the current state.After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current state.The chosen action reinforcement learning problem consists of a decision-maker, the agent, operating in an environment modeled by states ∈S.The agent can take specific actions at ∈ A as a function of the current state.After choosing an action at time t, the agent receives a scalar reward ∈R and finds itself in a new state that depends on the current and chosen actions [22].
At each time step, the agent follows a strategy, called the policy πt, which is a mapping from states to the probability of selecting each possible action: π(s, a) denotes the probability that a = a t if s = s t .
The objective of reinforcement learning is to use the interactions of the agent with its environment to derive (or approximate) an optimal policy to maximize the total amount of reward received by the agent over the long run [23].
DRL combines the two disciplines of deep learning and reinforcement learning and uses the perceptual advantages of deep learning and the decision-making advantages of reinforcement learning to solve complex control problems belonging to MDP.
DRL combines deep learning and reinforcement learning to form a deep Q-learning network.Deep learning provides learning mechanisms, and reinforcement learning provides learning objectives for deep learning, making deep reinforcement learning capable of solving complex control problems [24].

Twin Delayed Deep Deterministic Policy Gradient Algorithm
DRL is divided into three main categories: value function-based, policy gradientbased, and search and supervision-based.The algorithm based on the value function uses the value table or value function to estimate the optimal value function reasonably to select the action with the largest value; this method is often applied to discontinuous and discrete environments, and for the large set and continuous action scene, this method is prone to problems such as dimensional catastrophe, and the results of the training are poor.The representative algorithms are Q-learning, DQN.Policy gradient-based algorithms are trained and learn to maximize the reward value of the objective function of the resulting policy to obtain the optimal policy; this algorithm has a better optimization effect compared with the algorithm based on the value function, but it is prone to problems such as local extremes, and the representative algorithms are DDPG.The algorithms based on searching and supervising are the algorithms that add artificial supervision when searching for a strategy to accelerate the learning process and achieve better results.In this paper, the improved TD3 algorithm is applied to energy management strategy development, which is improved based on DDPG and combines the advantages of DDQN and DDPG algorithms [23].
For the DRL algorithm, the twin delayed deep deterministic policy gradient algorithm, the TD3 algorithm, is an improved off-policy deep reinforcement learning algorithm for solving continuous control problems.In essence, the TD3 algorithm integrates the idea of the double Q-learning algorithm into the DDPG algorithm, combines both advantages, and uses delay strategy update and smooth regularization of the target strategy.In the face of complex continuous action space, it can implement efficient output action and effectively solve the overestimation of the Q value.
The TD3 algorithm adopts two critic networks to evaluate the output action-value function and then selects the minimum values of both to update the target Q value, as shown in Equation ( 2): For network update mode, the TD3 algorithm also adopts soft update mode to update target network parameters, as shown in Formula (3): ( When training the algorithm, by increasing the ways to improve the algorithm of random noise and robustness, ~clip(N(0, σ), −c,c), c > 0, and The loss function of the TD3 algorithm is defined as the error square of the above, as shown in Equation (5): (5) In the algorithm design, to avoid the correlation between samples, the experience pool playback method is adopted to store the experience data in the experience pool.When selecting samples for network training, the sample is randomly selected to break the correlation between samples and ensure the efficiency of network updating.
The key technology in DRL is the deep neural network to fit the Q value function, whose structure consists of an input, hidden, and output layer.The input layer is composed of states and actions.The selection of hidden layers and neurons in this layer is obtained via trial and error.After many tests, it is concluded that the number of hidden layers is 3, and the number of neurons in each layer is 30, 120, and 120.The relay function is used to activate between hidden layers.As shown in (6), the activation from the hidden layer to the output layer adopts the Tanh function, as shown in (7), and the output layer is the value function.

Key Parameter Selection
In the construction of an energy management strategy for hybrid commercial vehicles based on a deep reinforcement learning algorithm, the vehicle controller is regarded as an agent, the power system and driving conditions as the environment, and the ultimate purpose of the controller is to find the optimal control strategy.
In hybrid commercial vehicles based on deep reinforcement learning algorithms, the vehicle controller is regarded as an agent, the power system and driving conditions as the environment, and the ultimate purpose of the controller is to find the optimal control strategy.
The key parameters such as system state, action space, and reward signal are set as follows.
State variables: For energy management of PHEVs, the system state reflects the vehicle's characteristics while on the road.In this paper, the normalized acceleration (a), state of charge (SOC), demand torque (Tereq), and vehicle speed (V) are taken as the state variables of the algorithm.The formula can express its state space: Action variable: The engine output torque Te ice is taken as the action variable of the algorithm: Reward function: The reward function of the strategy affects the algorithm's convergence.In this paper, the SOC value will be taken as the constraint condition, and vehicle pollutant emission, fuel consumption, and power consumption will be considered as the feedback reward function of the algorithm.See the following formula for details: where R(t) is the state x at time t; under the action, x is transferred to the next state to obtain the reward value.R1 represents the reward reporting function on the instantaneous fuel consumption of the engine and pollutant emission; it represents the instantaneous fuel consumption of the engine; CO, HC, and NO x represent the emission of automobile pollutants.Because of the difference in dimensionality between them, the normalization method is used to deal with them before summation.λ is the penalty term, which is equal to the sum of the maximum emission and the maximum instantaneous fuel consumption of the engine; β is the penalty factor of SOC change; SOC req (t) is the reference SOC at a certain time; and α is the fuel consumption coefficient.When SOC > SOC req , the fuel consumption coefficient is small, with PHEV provided mainly through the motor power; when SOC > SOC req , the penalty factor is set to a larger value to increase the torque distributed by the engine.According to the defined reward function, the reward obtained decreases with the increase in emissions and fuel consumption.Based on the above introduction of some principles of deep reinforcement learning architecture and the setting of some parameters, the optimal-state action-value function is defined as follows: As the driving condition prediction results affect the vehicle demand torque, an energy management strategy based on driving condition prediction is proposed by combining the driving condition prediction algorithm with the energy management strategy.According to the acceleration probability distribution in Figure 5, the correction factor is determined to correct the required torque of the vehicle.The figure shows that the two curves are more consistent, and the distribution of vehicle acceleration is mostly between −1.5 m/s 2 and −1.5 m/s 2 .Then, we calculate the correlation coefficient, average error, and standard deviation of its predicted demand torque, as well as the actual demand torque to set the correction coefficient: when the acceleration a < −1.5 m/s 2 , the correction coefficient is 0.8; when −1.5 m/s 2 < a < 1.5 m/s 2 , the correction coefficient is 1; and when a > 1.5 m/s 2 , the correction coefficient is 1.2.
ctr.Veh.J. 2023, 14, x FOR PEER REVIEW  * (, ) =  [ +  * ( ,  ) = As the driving condition prediction results affect the vehi ergy management strategy based on driving condition predicti ing the driving condition prediction algorithm with the energy cording to the acceleration probability distribution in Figure 5, termined to correct the required torque of the vehicle.The f curves are more consistent, and the distribution of vehicle acce −1.5 m/s 2 and −1.5 m/s 2 .Then, we calculate the correlation coe standard deviation of its predicted demand torque, as well as to set the correction coefficient: when the acceleration a < −1.5 cient is 0.8; when −1.5 m/s 2 < a < 1.5 m/s 2 , the correction coeffic m/s 2 , the correction coefficient is 1.2.   Figure 6 shows the framework of energy management of deep reinforcement learning based on condition prediction.When the vehicle is running, the BP neural network algorithm is first used to predict the driving conditions, and the vehicle demand torque is corrected according to the predicted value.The corrected vehicle demand torque, speed, battery SOC, and acceleration are state inputs.After the training of the target network and strategy network, the engine output torque with action value is output.We update the status value according to the output action and store the status, action, and reward value in the experience pool.

Subsection Validity Verification
To verify the effectiveness and adaptability of the proposed strategy, this section applies the constructed vehicle model and simulation environment to train the strategy.For the design of the simulation model, the initial SOC value of the strategy is 0.8, and the final SOC value is 0.3 (Table 2).To verify the effectiveness of the proposed strategy, a driving condition is selected as the simulation condition of the strategy, and the energy management strategy based on rule control is selected as the evaluation benchmark strategy based on the DRL strategy.The strategy is determined according to the optimal interval of engine operation and the upper and lower limits of SOC.The three strategies are simulated under the same working conditions.The deep reinforcement learning algorithm takes the TD3 algorithm as an example.The simulation results of the three strategies are shown in Figure 7.

Subsection Validity Verification
To verify the effectiveness and adaptability of the proposed strategy, this section applies the constructed vehicle model and simulation environment to train the strategy.For the design of the simulation model, the initial SOC value of the strategy is 0.8, and the final SOC value is 0.3 (Table 2).To verify the effectiveness of the proposed strategy, a driving condition is selected as the simulation condition of the strategy, and the energy management strategy based on rule control is selected as the evaluation benchmark strategy based on the DRL strategy.The strategy is determined according to the optimal interval of engine operation and the upper and lower limits of SOC.The three strategies are simulated under the same working conditions.The deep reinforcement learning algorithm takes the TD3 algorithm as an example.The simulation results of the three strategies are shown in Figure 7.
As shown in Figure 7a, the TD3 energy management strategy considering condition prediction is continuous in the state space and can realize continuous control of throttle opening.In driving condition 1, the strategy based on DRL can make the vehicle run smoothly.When the vehicle starts to run, the engine and the motor work together.The motor's output torque is greater than the engine's output torque.With the increase in the speed, the demand for torque increases gradually.It can be seen from the distribution diagram of engine operating points that the improved strategic operating points are distributed in a reasonable range and work in the high-efficiency zone.As shown in Figure 7a, the TD3 energy management strategy considering condition prediction is continuous in the state space and can realize continuous control of throttle opening.In driving condition 1, the strategy based on DRL can make the vehicle run smoothly.When the vehicle starts to run, the engine and the motor work together.The motor's output torque is greater than the engine's output torque.With the increase in the speed, the demand for torque increases gradually.It can be seen from the distribution diagram of engine operating points that the improved strategic operating points are distributed in a reasonable range and work in the high-efficiency zone.
As shown in Figure 7b, the engine working point of the TD3-based energy management strategy is similar to the TD3 strategy, considering working condition prediction to some extent.However, compared with the unimproved TD3 strategy, the engine working point of the improved TD3 strategy is more located in the high-efficiency zone with low fuel consumption.
As shown in Figure 7c As can be seen from Figure 8, the strategy of TD3, considering working condition prediction, has a higher and fuller utilization rate of the motor.When the vehicle starts at the early stage, the driving motor starts to work and gives play to its characteristics of low speed and large torque to avoid the engine working in the inefficient zone.During braking, the braking energy is recovered.The final SOC value of the three strategies fluctuates around 0.3.As can be seen from the simulation results, the fuel consumption of the TD3 energy management strategy considering working condition prediction is reduced by 3.18% and 7.63% compared with the TD3 energy management strategy and rule control strategy.The simulation results are shown in Table 3.As shown in Figure 7b, the engine working point of the TD3-based energy management strategy is similar to the TD3 strategy, considering working condition prediction to some extent.However, compared with the unimproved TD3 strategy, the engine working point of the improved TD3 strategy is more located in the high-efficiency zone with low fuel consumption.
As shown in Figure 7c As can be seen from Figure 8, the strategy of TD3, considering working condition prediction, has a higher and fuller utilization rate of the motor.When the vehicle starts at the early stage, the driving motor starts to work and gives play to its characteristics of low speed and large torque to avoid the engine working in the inefficient zone.During braking, the braking energy is recovered.The final SOC value of the three strategies fluctuates around 0.3.As can be seen from the simulation results, the fuel consumption of the TD3 energy management strategy considering working condition prediction is reduced by 3.18% and 7.63% compared with the TD3 energy management strategy and rule control strategy.The simulation results are shown in Table 3.

Adaptability Verification
In this paper, we utilize MATLAB, Python, PyCharm  4 and 5.

Adaptability Verification
In this paper, we utilize MATLAB, Python, PyCharm  4 and 5.     4 and 5.As can be seen from Figures 9 and 10, the proposed strategy can adapt to three diffe driving conditions and show good fuel economy.The final SOC value of the battery ca stable at around 0.3.Compared with the results, when the final SOC value is roughly same, compared with the rule control strategy, the TD3 strategy considering driving co tion prediction reduces the fuel consumption under different driving conditions by 8 and 7.74%, respectively.Compared with the TD3 strategy, the introduction of driving dition prediction reduces fuel consumption by 3.08% and 2.84%, respectively.As can be seen from Figures 9 and 10, the proposed strategy can adapt to three different driving conditions and show good fuel economy.The final SOC value of the battery can be stable at around 0.3.Compared with the results, when the final SOC value is roughly the same, compared with the rule control strategy, the TD3 strategy considering driving condition prediction reduces the fuel consumption under different driving conditions by 8.32% and 7.74%, respectively.Compared with the TD3 strategy, the introduction of driving condition prediction reduces fuel consumption by 3.08% and 2.84%, respectively.

Conclusions
Based on condition prediction results and the deep reinforcement learning algorithm, this paper proposes an energy management strategy: deep reinforcement learning considering condition prediction.The BP neural network algorithm is used to predict the speed information in the next 5 s, and the obtained correction factor corrects the required torque of the vehicle.The simulation results show that under different training conditions, the proposed strategy can make full use of the characteristics of the drive motor, make the engine work in the optimal range, and make the condition adaptability strong.The introduction of condition prediction effectively reduces the fuel consumption of the energy management strategy with deep reinforcement learning and has a more vital self-learning ability.

Figure 1 .
Figure 1.The research object of this study.

Figure 1 .
Figure 1.The research object of this study.

Figure 1 .
Figure 1.The research object of this study.

Figure 6
Figure6shows the framework of energy management of d based on condition prediction.When the vehicle is running, the B is first used to predict the driving conditions, and the vehicle dem cording to the predicted value.The corrected vehicle demand torq acceleration are state inputs.After the training of the target netwo engine output torque with action value is output.We update the s output action and store the status, action, and reward value in the

Figure 6 .
Figure 6.Framework diagram of deep reinforcement learning energy management strategy considering working condition prediction.

Figure 6 .
Figure 6.Framework diagram of deep reinforcement learning energy management strategy considering working condition prediction.
, the rule-based control strategy has an engine working area of 50 N• M-240 N•m, compared with the TD3 strategy, which has a more extensive operating range of 150 N•m-200 N•m.The improved TD3 strategy has lower fuel consumption and continuous control.
, the rule-based control strategy has an engine working area of 50 N•m-240 N•m, compared with the TD3 strategy, which has a more extensive operating range of 150 N•m-200 N•m.The improved TD3 strategy has lower fuel consumption and continuous control.
, and other software to conduct joint simulations.The deep reinforcement learning considering driving condition prediction is verified by comparing the other two driving conditions and the adaptability of the energy management strategy.The simulation model still adopts the above training model.The simulation results are shown in Figures 9 and 10, and the comparative data of SOC and fuel consumption are shown in Tables

Figure 9 .
Figure 9. Simulation results of TD3 strategy considering driving mode prediction under mode 2.
, and other software to conduct joint simulations.The deep reinforcement learning considering driving condition prediction is verified by comparing the other two driving conditions and the adaptability of the energy management strategy.The simulation model still adopts the above training model.The simulation results are shown in Figures 9 and 10, and the comparative data of SOC and fuel consumption are shown in Tables
, and other software to conduct joint simulations.The deep reinforcement learning considering driving condition prediction is verified by comparing the other two driving conditions and the adaptability of the energy management strategy.The simulation model still adopts the above training model.The simulation results are shown in Figures 9 and 10 , and the comparative data of SOC and fuel consumption are shown in Tables

Figure 9 .
Figure 9. Simulation results of TD3 strategy considering driving mode prediction under mode 2.Figure 9. Simulation results of TD3 strategy considering driving mode prediction under mode 2.

Figure 9 .
Figure 9. Simulation results of TD3 strategy considering driving mode prediction under mode 2.Figure 9. Simulation results of TD3 strategy considering driving mode prediction under mode 2.

Figure 10 .
Figure 10.Simulation results of TD3 strategy considering driving mode prediction under dr mode 3.

Figure 10 .
Figure 10.Simulation results of TD3 strategy considering driving mode prediction under driving mode 3.

Table 1 .
Basic parameters of the vehicle and key components.

Table 1 .
Basic parameters of the vehicle and key components.

Table 3 .
Comparison of simulation results.

Table 3 .
Comparison of simulation results.

Table 3 .
Comparison of simulation results.

Table 4 .
Comparison of simulation results.

Table 4 .
Comparison of simulation results.

Table 5 .
Comparison of simulation results.

Table 5 .
Comparison of simulation results.