Deep Reinforcement Learning-Based Real-Time Joint Optimal Power Split for Battery–Ultracapacitor–Fuel Cell Hybrid Electric Vehicles

Hybrid energy storage systems for hybrid electric vehicles (HEVs) consisting of multiple complementary energy sources are becoming increasingly popular as they reduce the risk of running out of electricity and increase the overall lifetime of the battery. However, designing an efficient power split optimization algorithm for HEVs is a challenging task due to their complex structure. Thus, in this paper, we propose a model that jointly learns the optimal power split for a battery/ultracapacitor/fuel cell HEV. Concerning the mechanical system of the HEV, two propulsion machines with complementary operation characteristics are employed to achieve higher efficiency. Additionally, to train and evaluate the model, standard driving cycles and real driving cycles are employed as input to the mechanical system. Then, given the inputs, a temporal attention long short-term memory model predicts the next time step velocity, and through that velocity, the predicted load power and its corresponding optimal power split is computed by a soft actor–critic deep reinforcement learning model whose training phase is aided by shaped reward functions. In contrast to global optimization techniques, the local velocity and load power prediction without future knowledge of the driving cycle is a step toward real-time optimal energy management. The experimental results show that the proposed method is robust to different initial states of charge values, better allocates the power to the energy sources and thus better manages the state of charge of the battery and the ultracapacitor. Additionally, the use of two motors significantly increases the efficiency of the system, and the prediction step is shown to be a reliable way to plan the HESS power split in advance.


Introduction
The basic operating principle of internal combustion engine (ICE) vehicles involves transforming energy from fossil fuels into thermal energy. In this combustion, the gases generated are released into the atmosphere with a negative impact on the environment and human health. In particular, the transportation sector contributed 29% of greenhouse gas emissions (GHG) to total United States GHG emissions in 2019 [1]. Furthermore, there is a growing concern regarding the scarcity of fossil fuels and the need to implement a sustainable economy based on renewable energy sources. The Sustainable Development Goals (SDGs) established in 2015 by the United Nations General Assembly have sustainable energy at its core. The seventh goal specifically emphasizes the need for renewable energy sources [2]. In light of the growing concern of society with environmental issues, the development of battery electric vehicles (BEVs) is a step toward the fulfillment of the seventh SDG.
The engine is the primary distinction between a traditional internal combustion engine (ICE) car and an electric vehicle (EV). The latter is powered by an electric motor that works by transforming chemical energy stored in rechargeable batteries into electrical energy. The Another technique that has been emerging in the area of HEV EMSs is deep reinforcement learning (DRL) [24][25][26][27][28]. They explore different DRL techniques such as deep Q-learning (DQN), soft actor-critic (SAC) and deep deterministic policy gradient (DDPG) for different HEV topologies. Particularly, by applying reinforcement learning using real driving data, energy savings of about 16% compared to the existing binary control strategy were confirmed [28]. The energy efficiency was improved by looking at the existing driving cycle of one's vehicle and dividing the power output level of the ICE into 24 levels through reinforcement learning, instead of just following the rules set by the EMS.
However, we found only one work that addresses the dual-motor battery/UC/FC HEV EMS control strategy, due to their complex structure [20], which makes use of convex optimization and machine learning techniques, not reinforcement learning. Additionally, even though the authors in [27] develop a reinforcement learning-based model for an FC/battery/UC HEV, they use a deep-Q network model and do not explore the use of multiple motors nor the forecast of the future load power. Thus, in this paper, a real-time deep reinforcement learning (SAC)-based EMS control strategy for a HESS of a vehicle that has three energy sources, battery, FC and UC, and two complementary motors is proposed. First, a method for real-time velocity and load power prediction using SDCs and RDCs is proposed. This prediction step allows the EMS to better plan the use of resources of the HEV and prevent it from running out of electricity. Given the predicted load power, the SAC model is then used to efficiently distribute energy in the HESS. Through the experimental results, it was confirmed that compared to traditional rule-based methods the proposed method can better allocate the energy sources power and thus achieve good FC efficiencies and better manage the SOCs of the battery and the ultracapacitor. The model is also shown to be robust, as it can handle different initial values of SOC while satisfying all the constraints of the system. Moreover, the addition of reward shaping to the training phase of the SAC agent accelerated its convergence. Additionally, it can be seen that the use of two complementary motors leads to a big improvement in the efficiency of the vehicle compared to a single-motor architecture. Finally, the results of the prediction step show that the method is reliable to forecast the future speeds and load power and, therefore, aid the model in allocating the resources of the HEV.

Preliminaries
2.1. Overall Procedure for a Reinforcement Learning-Based Energy Management Strategy (EMS) Figure 1 illustrates the proposed method. The inputs of the neural network are the five past velocity and acceleration values of either standard driving cycles (SDCs) or real driving cycles (RDC). Given those values, the neural network predicts the future velocity and acceleration for one time step ahead, which will be used to compute the load powers of the two electrical motors. The load powers of these motors will be sent into the EMS, which will distribute the load power between the battery, the fuel cell, and the ultracapacitor using deep reinforcement learning (DRL).

Input Data
For the input, both SDCs and RDCs were used. Particularly, the worldwide harmonized light vehicles test cycles (WLTC) and the urban dynamometer driving schedule (UDDS) were used. Despite their ease of use, they do not represent the reality well enough; so they were mainly used for model validation and performance evaluation purposes. On the other hand, RDCs overcome these limitations because of the higher degree of complexity associated to them. There are personal elements, such as the driver's behavior, which is hard to model because of its time-varying nature. There are also factors that do not depend on the driver, such as the weather conditions (snowstorm, heavy rain, flooding), traffic signals and traffic conditions (construction and accidents may slow down the traffic). The RDCs data, collected by an On-Board Diagnostics (OBD)-II dongle connected to one of the author's vehicles, are the same as the data used in Hong et al. [29]. The four RDCs employed in this paper are shown in Figure 2.

Neural Network for Velocity Forecast
The problem that the neural network is trying to solve can be modeled as a univariate time series; that is, there is only one time-varying variable, which is the velocity of the vehicle. Despite the existing challenges related to time series forecast, it is an important field of study because knowing the future allow us to better plan short-term and long-term goals. Particularly, in our work, we predict the velocity to optimally allocate the available resources so that the energy storage components may achieve maximum efficiency, their lifetime may be maximized and the vehicle may not run out of electricity during the trip.
Based on the previous work of Hong et al. [29], which used the same RDC dataset as this present paper and evaluated different time series and deep learning models for velocity prediction (seasonal auto regressive integrated moving average (SARIMA), recurrent neural network (RNN), gated recurrent unit (GRU) and long short-term memory (LSTM)), we decided to use the attention-based LSTM model with step size T = 5 and hidden state size m = 32 proposed by the authors, as it outperformed the other models that were tested.

Vehicle System Overview
The vehicle is mainly composed of the HESS and the mechanical components shown in Figure 3. The HESS is composed of three energy sources (battery/ultracapacitor/fuel cell), and the HEV has two complementary propulsion machines.

Power Split Optimization for the Propulsion Machines
The mechanical model of the vehicle may be computed by Equations (1)- (14), and its parameters are given in Table 1. The acceleration values needed for the equations come from Section 2.3 by differentiating the forecast velocity values. Such acceleration values are then used to compute the total force F t acting on the vehicle, given by (1), consisting of the rolling resistance F rr , the aerodynamic drag F d , the grading resistance F gr and the F la . Given the total force, the load power and the power loss for each motor may be computed according to the motor efficiencies. Then, combining all the Equations (1)- (14) with the power split optimization procedure proposed by [19], the optimal power split and losses for the two motors may be computed.
In this paper, the motors chosen are the Toyota Camry MG1 [30] and the UQM PowerPhase 125 [31]. Their parameters are shown in Table 2 and their efficiency curves may be seen in Figure 4. Finally, Figure 5 shows the output of the method (load power and losses) for the two motors for the four RDCs shown in Figure 2.
Rolling resistance coefficient

Electrical Model of the Hybrid Energy Storage System (HESS)
The battery and the ultracapacitor packs adopted are based on the ones proposed by [19]. They are composed of, respectively, multiple K2 High Capacity Lithium iron phosphate (LiFePO4) 22650P battery cells [32] and multiple Maxwell BCAP1500 ultracapacitors [33]. Their basic models are given, respectively, by Figures 6 and 7; the parameters are given by Tables 3 and 4, and the equations used to compute the parameters are given by Equations (15)- (22) and (23)- (29).
Finally, the fuel cell (FC) and the DC/DC converter were modelled based on the system proposed in [20]. The DC has a maximum output of 35kW, the parameters for the FC are given in Table 5 and the FC stack's efficiency and power loss may be computed according to Equations (30) and (31).
Two inverters for two propulsion machines were modelled based on the datasheet [34]. These inverters are insulated gate bipolar transistors (IGBTs) with a based pulse width modulated (PWM) inverter. We assumed that inverter efficiency is 97%.

Problem Overview
Reinforcement learning (RL) at its core involves an agent trying to learn decision making and control by interacting with an environment. Such interaction happens through actions taken by the agent, which can modify the environment and cause a state transition. As every action has an impact on the environment, each yields rewards to encourage good actions and discourage bad ones. The choice of action for each state depends on the policy of the agent, which maps the states to the probabilities of choosing each action given a specific state.
Therefore, the problem of power split optimization for the presented HESS can be modeled as a deep reinforcement learning problem. As shown in the next sections, the agent of our proposed model must learn the best policy possible to maximize the rewards of the system as a whole. Moreover, as it is not possible to map all the possible state-action pairs of the environment to their respective rewards, we use deep neural networks (DNN) to approximate them. Finally, all the code related to the deep reinforcement learning-based model was implemented using the Python programming language and the TensorFlow's TF-Agents framework.

Agent, Environment, Action and State
In our case, we have the HESS, which is the agent inside the HEV, which is the environment. The HEV is supposed to move for a specified amount of time according to a driving cycle while varying velocity, optimizing the fuel cell's efficiency and minimizing the battery power magnitude and variation. However, initially the HEV does not know how to distribute the load power computed in Section 2.5 between the three different energy storage systems (battery, ultracapacitor and fuel cell). The task of power split is performed by the HESS, which must decide among infinite possible combinations of the optimal battery, ultracapacitor and fuel cell powers. Each of the combinations chosen is an action performed by the agent, which leads to a state transition. Thus, in this paper, the state and the action spaces are given, respectively, by S = {P load ; SOC bat ; SOC uc } and A = {P bat,% , P UC,% , P FC,% }. The actions are not absolute power values but continuous values ranging from zero to one, which are transformed into percentages summing up to one through the softmax function. For instance, if P load = 10 kW and {P bat,% , P UC,% , P FC,% } = {0.8, 0.5, 0.2}, the softmax function outputs approximately {0.44, 0.32, 0.24}, which would represent in absolute power values {P bat , P UC , P FC } = {4.4 kW, 3.2 kW, 2.4 kW}.

Rewards and Penalties
The learning process of the agent is driven by rewards and penalties depending on how good the agent's actions are. In our case, the agent should ideally satisfy all the following criteria (35)-(41), which are based on [20]. The parameters related to the constraints may be found in Table 6, and the objective function to be optimized is shown in Equation (34), where the coefficients (a) to (f) are positive penalty coefficients.
P min,bat_di f f ≤ P bat (t) − P bat (t − 1) ≤ P max,bat_di f f P min,DC ≤ P DC (t) ≤ P max,DC P min,UC ≤ P UC (t) ≤ P max,UC Considering the constraints (35)-(41), terminal states were designed. The terminal states and actions would be a SOC bat , SOC uc , P bat , P uc and P FC smaller or greater than the minimum or maximum values allowed according to Table 6. Thus, whenever the vehicle reached an illegal state or tried to perform an illegal action, the environment was reset, and the agent received a penalty proportional to the length of the driving cycle. As the length of the driving cycles used were all less than 2000, the agent receives a penalty of 1000, and the environment is reset. On the other hand, if the states and the actions chosen are valid, the vehicle receives a reward according to r = SOC 2 bat + SOC 2 uc + η 2 f c + g(x), where is the output of the shaped reward function as explained below.
Reward shaping is a technique that involves changing the structure of a sparse reward function to offer more regular feedback to the agent [35] and thus accelerate the learning process. Figure 8 shows an example of a sparse and a shaped reward function. All the four functions were activated depending on the SOC of the ultracapacitor to encourage higher or lower power from the three energy sources. Their thresholds and descriptions are shown in Table 7.

Soft Actor-Critic (SAC)
The essence of solving a reinforcement learning problem lies in optimizing the tradeoff between exploration and exploitation. In contrast to supervised learning, in RL there are no labels, and the agent must learn to satisfy all the rules of the environment simply by exploration. During the training stage, the model will solely perform random actions and gradually find an optimal balance between exploration and exploitation, thus the optimal policy. On the other hand, during the testing stage, the model will only perform exploitation. That is, it will act according to the optimal policy learned during the training stage.
The soft actor-critic (SAC) model [36] is an off-policy method that uses DRL to find the optimal balance between exploration and exploitation by maximizing both the reward and the entropy of the system. By maximizing the entropy, the model is encouraged to keep exploring and thus assign similar probabilities to actions with similar action values and not assign excessively large probabilities to a specific set of actions. On the other hand, by maximizing the rewards, the model will strive toward finding the optimal policy. Therefore, given the large action and state spaces of our model (see Figure 5 and Table 6), we believe that the SAC model would be appropriate to learn the ideal power split algorithm for the HESS.
The architecture chosen for our SAC agent is a two-network design without shared features between them: one for the actor and one for the critic. Their parameters after hyperparameter tuning are shown in Table 8.

SAC Agent Training Phase
The training phase of the SAC agent was performed using the WLTC class 3 data and the parameters shown in Table 8. Figure 9 shows the rewards over 1,400,000 iterations, while Figure 10 represents the average rewards over 10,000 steps. It may be seen that the average rewards converge to the value of about 800.

SAC Agent Evaluation with SDCs
To evaluate the proposed method, a rule-based power split technique was employed in the HESS for the WLTC class 1 standard driving cycle. The power split rule changed based on the sign of the load power. For positive load powers, ratios of {P bat , P UC , P FC } = {40%P load , 40%P load , 20%P load } were used, and in the case of regenerative braking, ratios of {P bat , P UC , P FC } = {50%P load , 50%P load , 0%P load } were used. The results are shown in the Figures 11-13.
Both techniques have similar results in terms of power split, but it can be seen that the rule-based technique struggles with managing the SOCs and the currents of the ultracapacitor and the battery. The initial SOCs are 0.9, but after the 1023 s of simulation, the final SOCs of the ultracapacitor are about 62.8% and 46.9% for the DRL and rule-based technique, respectively, as shown in Figure 13. The latter value is below the acceptable value because the theoretically minimum allowed SOC for the ultracapacitor is 50%, as shown in Table 6 due to the approximately linear discharge of the ultracapacitor.
Additionally, the SAC model is far more robust than the rule-based one. Figure 14 shows the SOCs of the battery and the UC when their initial values are 0.7. The SAC model is able to find the optimal power split while satisfying the constraints from Table 6, in  contrast to the rule-based technique. Finally, the DRL model also has the advantage of being able to perform well in new data without the need for retraining. It is shown in Sections 4.3 and 4.4 that the model can obtain good results, even for different data on which it was never trained.

SAC Agent Evaluation with RDCs
As explained in Section 2.2, four RDCs were employed to evaluate the model. Table 9 shows the results obtained by the proposed method when prediction of the speed and load power is not considered. First, it is interesting to analyze the SOC UC . Two initial values were considered: 55% and 90%. In general, when the initial SOC UC = 55%, the model could perform the optimal power split while respecting the constraints from Table 6, as the minimum SOC UC did not go below 50%. Additionally, it may be seen that the model focused on recharging the UC through regenerative braking, as the final SOC UC was greater than the initial. On the other hand, when the initial SOC UC = 90%, the model allocated more power to the UC instead of recharging it, as the final SOC UC of approximately 82% is smaller than the initial SOC UC .
Second, considering that the minimum and maximum efficiencies η FC are, respectively, 40% and 62.1%, the obtained results in the range of 55.8% to 57.3% were satisfactory. We have also plotted the FC powers related to the RDC1 for SOC UC = 55% and SOC UC = 90% in Figure 15 to better analyze the FC results. One thing to note is that there was no significant difference between the plots. However, one would expect that the model would allocate more power to the FC for low values of SOC UC to reduce the use of the UC and prevent it from going below the minimum of 50%. This means that the model could be a little bit further optimized to improve the power split method.
Third, compared to the Toyota Camry MG1-only structure, there is a significant efficiency improvement when two motors are used. The high improvement of 17.6% is achieved because the two motors are complementary.

Evaluation of the Prediction Method with RDCs
In this section, we evaluate the velocity and load power prediction method using the model described in Sections 2.3 and 2.5, along with the SAC agent. Figures 16-18 show the results obtained. By analyzing the figures, it can be seen that the predicted powers and SOCs are not too far from the actual ones. Additionally, Table 10 was made to gain greater insights into the prediction method. It shows the mean absolute error (MAE) between the predicted and actual values for 10 different parameters: speed, load power, battery and ultracapacitor SOCs, fuel cell efficiency, battery and ultracapacitor currents and battery, ultracapacitor and fuel cell powers. In general, the MAE is good for the speed, SOCs and currents prediction. In the case of the load and energy sources powers, the mean deviation was relatively high (around 5000 W) because the prediction model is highly sensitive to differences in acceleration values. Analogously, the deviation in the fuel cell efficiency is also relatively high (around 11% to 14%) because it is highly sensitive to small changes in the fuel cell power values. For instance, there is a pair of points (P FC,predicted , P FC,label ) = (945.7W, 0) =⇒ (η FC,predicted , η FC,label ) = (61.5%, 0) where the big difference between the predicted and the actual efficiencies can be clearly seen.

Conclusions
In this paper, a DRL-based method for real-time joint power split optimization for a battery/UC/FC HEV was proposed. First, the TA-LTSM was responsible for predicting the future velocity, which was converted to required load power through the mechanical model. The load power was then optimally split between the two motors and also split between the battery/UC/FC by the proposed SAC agent, which makes use of shaped reward functions to accelerate the training process. Compared to traditional rule-based techniques, the proposed method is robust to different initial SOC values and is able to satisfy the system constraints. Moreover, the results show that the usage of two complementary motors greatly increase the efficiency of the system. Finally, the average MAEs of the prediction step are reliable; therefore, the method may be used to plan in advance the HESS power split.
In the future, a better model for the HEV, including auxiliary systems modeling (air conditioning, lights, sound, power-assisted seats, windows) could be designed to compute a more precise load power. Additionally, the velocity forecast method can currently predict velocity only for the next time step. A better forecast method would be able to forecast the velocity for more time steps, allowing the model to better allocate its resources.Finally, more emphasis could be given to the optimization of the fuel cell to increase its efficiency and its usage whenever the SOC of the ultracapacitor falls below the minimum operating value.

Conflicts of Interest:
The authors declare no conflict of interest.