Energy Management for a Power-Split Plug-In Hybrid Electric Vehicle Based on Reinforcement Learning

Zheng Chen 1 , Hengjie Hu 1, Yitao Wu 1, Renxin Xiao 1, Jiangwei Shen 1,* and Yonggang Liu 2,3,* 1 Faculty of Transportation Engineering, Kunming University of Science and Technology, Kunming 650500, China; chen@kmust.edu.cn (Z.C.); huhengjie1995@163.com (H.H.); yitaowumail@gmail.com (Y.W.); xrx1127@foxmail.com (R.X.) 2 State Key Laboratory of Mechanical Transmissions, Chongqing University, Chongqing 400044, China 3 School of Automotive Engineering, Chongqing University, Chongqing 400044, China * Correspondence: shenjiangwei6@163.com (J.S.); andyliuyg@cqu.edu.cn (Y.L.)


Introduction
In recent years, as the greenhouse effect and air pollution have become increasingly severe, green energy attracts more attention in all walks of life.In automotive industry, exhaust emission from conventional fuel vehicles is an important factor that causes the environmental pollution.Developing new energy vehicles (NEVs) has shown its significance in reducing emission and lessening induced air pollution.Currently, NEVs can be mainly classified into three types, i.e., fuel cell vehicles, battery electric vehicles (BEVs) and hybrid electric vehicles (HEVs), and they are usually equipped with an energy storage system, such as a battery pack or a super-capacitor [1,2].For BEVs, it can be powered purely by the battery pack or the super-capacitor.Plug-in hybrid electric vehicles (PHEVs) are considered to combine advantages of both BEVs and HEVs [3].Compared with HEVs, the prominent advantage of PHEVs is that the battery pack can be recharged by the external charging plug, thereby supplying certain all electric range (AER).Compared with BEVs, the controller of PHEVs can start the engine to sustain the battery when a certain battery state of charge (SOC) threshold is reached and meanwhile supply the extended driving range.Consequently, it is critical to manage the power distribution between the battery and the engine properly in PHEVs.
Energy management strategy (EMS) of PHEVs is responsible for power and energy distribution among different energy storage systems, such as gasoline engine and electromotor.Different control tradeoff of energy management target is mentioned in related literatures [4,5] including fuel economy improvement [6], and tailpipe emission reduction [7].Rule based and optimization based methods are mostly considered, as discussed by the authors of [8].Rule based methods are relatively easier to exploit and are widely employed in practice [9,10].In [9], a classified rule based EMS is designed, which emphasizes on different operating modes of PHEVs, and simulation results yields satisfied emission reduction.However, these rule based strategies highly depend on design process and engineering experience, thus leading to longer design time [11].On the contrary, modern real-time and global optimization based algorithms can be applied with provable optimal guarantee.In particular, dynamic programming (DP), adopted by many researchers, is generally treated as an emblematic algorithm among all the optimal methods [12][13][14][15].In [12], the investigators proposed an intelligent EMS based on DP, by which numerical simulation results manifest the improved fuel economy dramatically.Quadratic programming (QP) is also a mature algorithm to search for the optimal result with affordable operational budgets [16], compared with DP.Pontryagin minimum principle (PMP) [17] and equivalent consumption minimization strategy (ECMS) [18] are also widely adopted in EMS of PHEVs.In addition, model predictive control (MPC) [19], is extensively investigated as a real-time optimization manner applying to EMS of PHEVs.Furthermore, intelligent algorithms such as simulated annealing (SA) optimization [17], neural network (NN) [20], genetic algorithm (GA) [21] are also employed for EMS of PHEVs in recent years.
Nowadays, with development of artificial intelligence (AI) technology, reinforcement learning (RL) is becoming more and more popular in various fields including robotic control, intelligent system, and energy management of power grids.In [22], a parallel control architecture based on the RL technology is applied for robotic manipulation, thereby enabling robots to easily adapt to the environment variation.RL is also introduced in the field of energy management of PHEVs in [23][24][25][26][27][28][29][30].In [23], the investigators find that the RL based EMS cannot only guarantee the vehicle dynamic performance, but also improve the fuel economy, and as a result, can outperform stochastic dynamic program (SDP) in terms of adaptability and learning ability.In [24], the Kullback-Leibler (KL) divergence technique is applied to calculate the power transition probability matrices of the RL algorithm to find the optimal power distribution ratio between the battery and the super-capacitor.Simulation results show that this kind of control policy cannot only effectively decrease the battery charging frequency and control the maximum discharging current, but also maximize the energy efficiency to cut down the overall cost under diverse conditions.In [25], a novel RL based method is proposed combining with the remaining travel distance estimation, and the controller could continuously search for the optimal strategy and learn from the previous process.In [26], a RL method called TD (λ)-learning is employed for the HEV, and simulation results manifest that the RL based policy can improve the fuel economy by 42%.In [27], a blended real-time control strategy is proposed based on the Q-learning (QL) method to balance the overall performance and optimality.A bi-level control strategy is proposed in [28], in which the fuzzy encoding predictor and the KL divergence rate are employed to predict the driver's power demand in the higher level, and the lower level is mainly focused on employing the RL algorithm to find the optimal solution.
Based on the above discussion, it is imperative to further apply the RL technique for energy management of power-split PHEVs.Hence, the main motivation of the energy management strategy is to further refine the battery power based on the RL by selecting proper state and action variables.As a result, the objectives for both optimal fuel economy and battery power restriction can be met at the same time, thereby prolonging the battery life potentially.For the sake of achieving the target, the powertrain of a power-split PHEV is modeled and analyzed first.Subsequently, considering that the proposed method should be applicable in most driving conditions, the Markov chain is adopted to estimate the transition probability matrix regarding demanded power under different driving cycles.Finally, the QL algorithm is conducted to develop and finally form the EMS towards reaching the optimal target.Furthermore, the proposed EMS is compared with the CD/CS strategy to validate the optimality under different driving cycles by simulations.The rest of this article is structured as follows: Section 2 describes the simplified vehicle structure and the fuel consumption model.In Section 3, the RL based framework is proposed to realize the optimal EMS.In Section 4, corresponding simulations prove the proposed method is superior to the CD/CS algorithm.Section 5 concludes the article.

PHEV Powertrain Model
In this paper, the model under study is a power-split PHEV derived from Autonomie.A typical power-split PHEV model is the Toyota Prius PHEV.The powertrain structure of the vehicle is shown in Figure 1, which consists of a 39 ampere-hour (Ah) traction battery pack, a gasoline engine, a final drive, a planetary transmission and two electric motors, i.e., Motor 1 and Motor 2. The engine, Motor 1 and Motor 2 connect with the planet carrier, the ring gear and the sun gear, respectively.As can be seen in Figure 1, motor 2 is employed to provide a significant portion of the electric power, and motor 1 is mainly used as a generator.The main parameters are listed in Table 1.corresponding simulations prove the proposed method is superior to the CD/CS algorithm.Section 5 concludes the article.

PHEV Powertrain Model
In this paper, the model under study is a power-split PHEV derived from Autonomie.A typical power-split PHEV model is the Toyota Prius PHEV.The powertrain structure of the vehicle is shown in Figure 1, which consists of a 39 ampere-hour (Ah) traction battery pack, a gasoline engine, a final drive, a planetary transmission and two electric motors, i.e., Motor 1 and Motor 2. The engine, Motor 1 and Motor 2 connect with the planet carrier, the ring gear and the sun gear, respectively.As can be seen in Figure 1, motor 2 is employed to provide a significant portion of the electric power, and motor 1 is mainly used as a generator.The main parameters are listed in Table 1.

Energy Management Problem
This paper focuses on minimizing the total fuel consumption.Hence, the fuel index β can be established as, where total F is the total fuel consumption, rate F donotes the fuel rate.T is the total driving time.
For the sake of calculating the fuel rate by appropriate simplification, rate F can be determined as, ( , )

Energy Management Problem
This paper focuses on minimizing the total fuel consumption.Hence, the fuel index β can be established as, where F total is the total fuel consumption, F rate donotes the fuel rate.T is the total driving time.For the sake of calculating the fuel rate by appropriate simplification, F rate can be determined as, where ω eng , T eng denote the speed and the torque of engine, respectively.To minimize the fuel consumption, the relationship between the vehicle power request and the fuel consumption needs to be analyzed in detail.

Power Request Model
Given a certain driving cycle, the power required to drive the vehicle powertrain can be calculated as, where P req is the vehicle request power, F f , F w , and F i represent the resistance derived from the road, air drag and vehicle inertial, respectively.v denotes the driving velocity.The resistances, that merely associated with vehicle and environment parameters, can be expressed as, where m is the total mass, f denotes the road resistance coefficient, g is the gravity coefficient, A is the frontal area of the vehicle, C d is the aerodynamic drag coefficient, and δ is the rotational mass coefficient.As shown in Figure 1, the power flow equations can be formulated to describe the corresponding power flow, as: where P f inal is the driveline power, P mot1 , P mot2 , and P eng are the output power of motor 1, motor 2 and engine, respectively.P acc denotes the power of electric accessories and is assumed to be a constant value, i.e., 220 W. η gear , η f inal and η c are the transmission efficiency factor of gear, final drive and electric convertor, respectively.As seen in Figure 1, the planetary gear set works as the coupling device that connects the engine and the motors, and the corresponding dynamic equations are expressed as follows: (6) where i gear is the transmission ratio of the planetary gear, ω mot1 , ω mot2 , and ω ring are the speed of motor 1, motor 2 and ring gear, respectively; T mot1 and T mot2 are the torque of two motors; r whl denotes the radius of the wheel and r f inal is the final driveline ratio.In this article, we choose to ignore the inertial of planet gear, sun gear and ring gear for ease of managing the energy distribution.
Based on the above descriptions, the instantaneous fuel consumption F rate can be redefined as: Now we can find that F rate can be directly determined by P bat , thus it is necessary to model the battery and analyze its power relationship.

Battery Model
To analyze the power relationship of the battery, a simplified battery model is presented here, which consists of an internal resistor and an open circuit voltage source, and the corresponding calculation equations of the battery model can be described as: where OCV denotes the battery open circuit voltage, i bat is the battery current, R int is the battery internal resistance, C bat is the battery capacity, SOC is the battery SOC and SOC init is its initial value.Detailed battery parameters varying with SOC are shown in Figure 2. It can be found that R int decreases from 0.1403 ohm to 0.09 ohm and OCV ranges from 165 V to 219.7 V.
To analyze the power relationship of the battery, a simplified battery model is presented here, which consists of an internal resistor and an open circuit voltage source, and the corresponding calculation equations of the battery model can be described as: where  From the above analysis, we can find that if the battery power is predetermined, the energy distribution strategy inside the vehicle can be achieved.By this manner, the control strategy distributions can be ascertained by the battery power.In order to ensure safety of all components and consider their power limitations and performance extension, some constraint conditions are imposed: where parameters with subscripts min and max mean their corresponding minimum and maximum values, respectively.In the next step, the RL based strategy is introduced to achieve the energy management of the PHEV.

Reinforcement Learning for Energy Management
To apply the RL for energy management of PHEVs, we need to build the vehicle power transition probability model first.From the above analysis, we can find that if the battery power is predetermined, the energy distribution strategy inside the vehicle can be achieved.By this manner, the control strategy distributions can be ascertained by the battery power.In order to ensure safety of all components and consider their power limitations and performance extension, some constraint conditions are imposed: P bat_min ≤ P bat ≤ P bat_max P mot1_min ≤ P mot1 ≤ P mot1_max P mot2_min ≤ P mot2 ≤ P mot2_max P eng_min ≤ P eng ≤ P eng_max P req_min ≤ P req ≤ P req_max SOC min ≤ SOC ≤ SOC max (9) where parameters with subscripts min and max mean their corresponding minimum and maximum values, respectively.In the next step, the RL based strategy is introduced to achieve the energy management of the PHEV.

Reinforcement Learning for Energy Management
To apply the RL for energy management of PHEVs, we need to build the vehicle power transition probability model first.

Transition Probablity Model
Markov chain model is a discrete time and state stochastic process with Markov property, of which the state is a sequence with multiple finite random variables.In this process, the selection of the next state is related to the current state and the current action, and does not show any relationship with the previous historical state.In addition, the change of state is independent of time, but is transferred by probability.According to the finite-state Markov chain driver model introduced in [31], the actual driving cycle can be considered as the stochastic Markov chain.The request power is treated as a stochastic variable and can be modeled by the Markov chain.To obtain the transition probability matrix, several standard driving cycles shown in Figure 3 are recorded and analyzed to estimate the transition probability matrix of the demanded power.The selected driving cycles not only include urban, suburban and highway driving conditions, but also involve some intense speed profiles, of which the velocity scale, the acceleration and deceleration frequency can cover most of the driving conditions.According to speed profiles of partially selected driving cycles depicted in Figure 3, the transition probability of the demand power can be calculated based on the maximum likelihood estimation, as: where n s,s represents the counted number transiting from s to s , and n s is the total number for all transitions of s. p s,s means the transition probability of the driver's power demand transferred from the current moment to the next moment at each velocity state.
which the state is a sequence with multiple finite random variables.In this process, the selection of the next state is related to the current state and the current action, and does not show any relationship with the previous historical state.In addition, the change of state is independent of time, but is transferred by probability.According to the finite-state Markov chain driver model introduced in [31], the actual driving cycle can be considered as the stochastic Markov chain.The request power is treated as a stochastic variable and can be modeled by the Markov chain.To obtain the transition probability matrix, several standard driving cycles shown in Figure 3 are recorded and analyzed to estimate the transition probability matrix of the demanded power.The selected driving cycles not only include urban, suburban and highway driving conditions, but also involve some intense speed profiles, of which the velocity scale, the acceleration and deceleration frequency can cover most of the driving conditions.According to speed profiles of partially selected driving cycles depicted in Figure 3, the transition probability of the demand power can be calculated based on the maximum likelihood estimation, as: , , , 1 According to calculation based on the Markov chain, the transition probability matrix for vehicle speed of 30 km/h and 80 km/h are shown in Figure 4.It can be found that the request power scope is from −40 kW to 40 kW at speed of 30 km/h and the request power scope is from −80 kW to 80 kW at speed of 80 km/h.The transition probability is limited within 0.1 to 0.7, and most of the distribution is concentrated on a diagonal.In addition, it can be clearly seen from Figure 4 that the transiting probability of power request moving from the current state to the next state with different speed values is obviously different.According to calculation based on the Markov chain, the transition probability matrix for vehicle speed of 30 km/h and 80 km/h are shown in Figure 4.It can be found that the request power scope is from −40 kW to 40 kW at speed of 30 km/h and the request power scope is from −80 kW to 80 kW at speed of 80 km/h.The transition probability is limited within 0.1 to 0.7, and most of the distribution is concentrated on a diagonal.In addition, it can be clearly seen from Figure 4 that the transiting probability of power request moving from the current state to the next state with different speed values is obviously different.

Reinforcement Learning Algorithm
RL, as a significant machine learning method, can conduct repeated explorations in which the agent takes a series of actions in its environment to maximize its designated benefits.The agentenvironment interaction for RL is illustrated in Figure 5.The agent-environment interaction can be regarded as a Markov decision process, and the RL mainly focuses on solving the Markov decision process based on a series of iteration.In this paper, the state variable s S ∈ includes the power request, SOC and the vehicle speed and the action variable a A ∈ is the battery power.The reward function r, which evaluates the current action, is defined as the immediate fuel consumption of the engine.
The object function could be written as the total reward for the finite future at each state, which can be described as: * 0 ( ) where γ ∈ ， is the discount factor to guarantee convergence of the agent during the learning process.Since any state is different and each state is unique, the object function can be reformulated as:

Reinforcement Learning Algorithm
RL, as a significant machine learning method, can conduct repeated explorations in which the agent takes a series of actions in its environment to maximize its designated benefits.The agent-environment interaction for RL is illustrated in Figure 5.

Reinforcement Learning Algorithm
RL, as a significant machine learning method, can conduct repeated explorations in which the agent takes a series of actions in its environment to maximize its designated benefits.The agentenvironment interaction for RL is illustrated in Figure 5.The agent-environment interaction can be regarded as a Markov decision process, and the RL mainly focuses on solving the Markov decision process based on a series of iteration.In this paper, the state variable s S ∈ includes the power request, SOC and the vehicle speed and the action variable a A ∈ is the battery power.The reward function r, which evaluates the current action, is defined as the immediate fuel consumption of the engine.
The object function could be written as the total reward for the finite future at each state, which can be described as: * 0 ( ) where γ ∈ ， is the discount factor to guarantee convergence of the agent during the learning process.Since any state is different and each state is unique, the object function can be reformulated as: The agent-environment interaction can be regarded as a Markov decision process, and the RL mainly focuses on solving the Markov decision process based on a series of iteration.In this paper, the state variable s ∈ S includes the power request, SOC and the vehicle speed and the action variable a ∈ A is the battery power.The reward function r, which evaluates the current action, is defined as the immediate fuel consumption of the engine.
The object function could be written as the total reward for the finite future at each state, which can be described as: where γ ∈ [0, 1] is the discount factor to guarantee convergence of the agent during the learning process.Since any state is different and each state is unique, the object function can be reformulated as: where p sa,s indicates the transition probability of state variables that change from s to s based on action a, and r(s, a) indicates the reward of applying action a to transfer from s to s .The optimal control strategy is determined by Bellman's principle: As a popular candidate of RL algorithms, the QL algorithm is simple and easy to implement [32], and has been widely employed to solve the optimal value function of MDP.The QL algorithm can obtain a strategy to maximize the sum of expected discounted rewards by directly optimizing an iterated value function Q.According to the updated Q value, the agent needs to examine every action in each iteration to make sure that the learning process can converge.In terms of these merits, we employ the QL algorithm as the kernel algorithm to train, learn and finally achieve the energy management of PHEVs.In the QL algorithm, the Q value, i.e., the state-action value, can be written as: Furthermore, the updated rule of Q value can be described as: where η ∈ [0, 1] is a decaying factor.According to the above discussion, the proposed method consists of a simplified vehicle model, a transition probability matrix, a reward matrix and the QL control strategy, where the reward matrix is computed via the simplified vehicle model and the control strategy is calculated according to the power transition matrix, the reward matrix and the QL algorithm feedback.Table 2 lists the pseudocode of the QL algorithm, and it can clearly illustrate the iterative process of QL algorithm.The optimal control strategy is derived through the iterative process shown in Table 2. Figure 6 summarized the detailed procedures of QL in Matlab [19].First, the QL algorithm and the MDP as well as the related parameters are combined and discretized.Then, the power transition matrix is calculated based on the driver model.Based on the discrete variables and the simplified PHEV model, the reward matrix R is calculated.After iteration, the QL algorithm can be applied successfully to find the optimal energy management solution.The optimal control strategy based on the RL algorithm is shown in Figure 7.The battery power ranges from −12 kW to 12 kW, the required power range is limited within −45 kW to 45 kW, and the SOC ranges from 0.3 to 0.9.It can be found that the optimal battery power can be determined by state variables, i.e., the required power, SOC and the vehicle speed.Figure 8 shows the convergence process of the QL algorithm, where the mean discrepancy is applied to measure the difference of the Q values.We can find that with increase of the iterations, the mean discrepancy gradually decreases to 0. From this point, the effectiveness and convergence of the QL algorithm can be proved.The optimal control strategy based on the RL algorithm is shown in Figure 7.The battery power ranges from −12 kW to 12 kW, the required power range is limited within −45 kW to 45 kW, and the SOC ranges from 0.3 to 0.9.It can be found that the optimal battery power can be determined by state variables, i.e., the required power, SOC and the vehicle speed.Figure 8 shows the convergence process of the QL algorithm, where the mean discrepancy is applied to measure the difference of the Q values.We can find that with increase of the iterations, the mean discrepancy gradually decreases to 0. From this point, the effectiveness and convergence of the QL algorithm can be proved.The optimal control strategy based on the RL algorithm is shown in Figure 7.The battery power ranges from −12 kW to 12 kW, the required power range is limited within −45 kW to 45 kW, and the SOC ranges from 0.3 to 0.9.It can be found that the optimal battery power can be determined by state variables, i.e., the required power, SOC and the vehicle speed.Figure 8 shows the convergence process of the QL algorithm, where the mean discrepancy is applied to measure the difference of the Q values.We can find that with increase of the iterations, the mean discrepancy gradually decreases to 0. From this point, the effectiveness and convergence of the QL algorithm can be proved.

Simulation and Result
In this article, simulations are conducted based on the Autonomie and Matlab/Simulink.New European Driving Cycle (NEDC), Highway Fuel Economy Test (HWFET) and Urban Dynamometer Driving Schedule (UDDS), shown in Figure 9, are employed to verify the proposed strategy.The selected driving cycles can represent most of the driving pattern under different driving conditions.To compare the performance of the proposed method, the charge depletion/charge sustaining (CD/CS) algorithm is introduced as a benchmark, which is widely employed in actual applications.In addition, the ECMS is also employed to compare the performance of the proposed algorithm.For the CD/DS algorithm, the power distribution of the vehicle can be easily achieved by setting a series of control parameters without any pre-known information of driving conditions.During the CD stage, except for some specific situation, the engine generally remains shut down, and the tractive power is mainly provided by the battery until the SOC drops to a specified lower threshold (e.g., 30%).Then, the vehicle is powered by both the engine and the battery to remain SOC near the specified value under the CS stage.The detailed CD/CS control scheme can be described [12] as:

Simulation and Result
In this article, simulations are conducted based on the Autonomie and Matlab/Simulink.New European Driving Cycle (NEDC), Highway Fuel Economy Test (HWFET) and Urban Dynamometer Driving Schedule (UDDS), shown in Figure 9, are employed to verify the proposed strategy.The selected driving cycles can represent most of the driving pattern under different driving conditions.

Simulation and Result
In this article, simulations are conducted based on the Autonomie and Matlab/Simulink.New European Driving Cycle (NEDC), Highway Fuel Economy Test (HWFET) and Urban Dynamometer Driving Schedule (UDDS), shown in Figure 9, are employed to verify the proposed strategy.The selected driving cycles can represent most of the driving pattern under different driving conditions.To compare the performance of the proposed method, the charge depletion/charge sustaining (CD/CS) algorithm is introduced as a benchmark, which is widely employed in actual applications.In addition, the ECMS is also employed to compare the performance of the proposed algorithm.For the CD/DS algorithm, the power distribution of the vehicle can be easily achieved by setting a series of control parameters without any pre-known information of driving conditions.During the CD stage, except for some specific situation, the engine generally remains shut down, and the tractive power is mainly provided by the battery until the SOC drops to a specified lower threshold (e.g., 30%).Then, the vehicle is powered by both the engine and the battery to remain SOC near the specified value under the CS stage.The detailed CD/CS control scheme can be described [12] as: To compare the performance of the proposed method, the charge depletion/charge sustaining (CD/CS) algorithm is introduced as a benchmark, which is widely employed in actual applications.In addition, the ECMS is also employed to compare the performance of the proposed algorithm.For the CD/DS algorithm, the power distribution of the vehicle can be easily achieved by setting a series of control parameters without any pre-known information of driving conditions.During the CD stage, except for some specific situation, the engine generally remains shut down, and the tractive power is mainly provided by the battery until the SOC drops to a specified lower threshold (e.g., 30%).Then, the vehicle is powered by both the engine and the battery to remain SOC near the specified value under the CS stage.The detailed CD/CS control scheme can be described [12] as: SOC > 36% min(27804.9,P req ) 33% ≤ SOC ≤ 36% min(27804.9• (SOC − 0.3)/0.03,P req ) 30% ≤ SOC ≤ 33% max(−28157.5 • (SOC − 0.3)/0.03,P req ) P req < 0, 27% ≤ SOC ≤ 30% max(−28157.5 • (SOC − 0.3)/0.03,P req − P eng_max ) P req > 0, 27% ≤ SOC ≤ 30% max(−28157.5, P req ) P req < 0, SOC < 27% max(−28157.5, P req − P eng_max ) P req > 0, SOC < 27% (16) where P eng_max represents the maximum power of engine.The ECMS algorithm, as a classical real-time optimization algorithm, transfers the electric consumption of the battery to the equivalent fuel consumption and then tries to minimize the fuel consumption.During each time constant, the vehicle power request is distributed to the battery and the engine according to the minimum principle.By this way, the whole fuel consumption can be reduced and the fuel economy can be improved simultaneously.A typical solution of the ECMS can be formulated based on the Hamilton function, as: where λ is an equivalent factor that can be adjusted dynamically or can be fixed as a constant value.x(t) and u(t) are state variables and control variables, respectively.In this paper, x(t) includes the battery SOC, the vehicle power demand, and the vehicle speed.Similar to before, u(t) is the battery power.By solving (17), the optimal solution can be found and the final fuel consumption can be obtained.In simulation validation, three standard cycles are selected to splice multifarious and verifiable conditions.Cycle 1 is consisted of two NEDC cycles, one UDDS cycle and two HWFET cycles, Cycle 2 is comprised of two UDDS cycles, two NEDC cycles and two HWFET cycles, and Cycles 3 and 4 includes five and six HWFET cycles.Cycles 5 and 6 are consisted of six and seven UDDS cycles, respectively.The fuel consumption results with the SOC correction [33] are listed in Table 3.It can be found that compared with the CD/CS scheme, the RL based control strategy can effectively reduce the fuel consumption by 10.1%, 9.31%, 4.84%, 4.49%, 5.95% and 5.13% under different driving cycles.Compared with the ECMS, the RL algorithm can gain similar fuel consumption savings.Thus, the validity of RL based algorithm can be proved.More intuitively, Figure 10 shows the battery power comparison with respect to the proposed algorithm, the ECMS and the CD/CS scheme.The power range of the battery based on the RL algorithm is from −12 kW to 12 kW, while the battery power based on the CD/CS algorithm ranges from −30 kW to 5 kW.It can be recognized that the EMS based on the RL algorithm is capable of controlling the range of the battery power variation smaller than that of the CD/CS method, and the RL method can restrict the maximum battery discharge power.Here we can conclude that the EMS based on the RL control strategy can protect the battery and extend the battery life to some extent.
Figure 11 shows the SOC curve under different driving cycles.The initial SOC is supposed to be 90%, and the minimum SOC threshold is 30%.Compared with the results of the CD/CS scheme, the SOC downward trend based on the RL method is more smoothly.Figure 12 illustrates the fuel consumption under four driving cycles.According to Figures 11 and 12, we can find that the optimized control strategy does not take effect completely in the entire cycle, and works before the battery SOC drops to a certain value.Even so, the proposed algorithm can still effectively reduce the fuel consumption.
To further discover improvements of the RL based strategy, the engine operating points for both RL based method and the CD/CS method under four driving cycles are depicted in Figure 13.It can be obviously found that by implementing the RL based algorithm, the engine working efficiency is higher than 30% in most cases.Compared with the CD/CS strategy, the proposed method can make the engine working points more densely in the high efficiency area.Moreover, it can be noticed that based on the RL based method, the majority of engine working points gather near the optimal operating line, not like that by the CD/CS algorithm.Therefore, it can explain that why the fuel consumption based on the proposed method is less than that based on the CD/CS method.Figure 11 shows the SOC curve under different driving cycles.The initial SOC is supposed to be 90%, and the minimum SOC threshold is 30%.Compared with the results of the CD/CS scheme, the SOC downward trend based on the RL method is more smoothly.Figure 12 illustrates the fuel consumption under four driving cycles.According to Figures 11 and 12, we can find that the optimized control strategy does not take effect completely in the entire cycle, and works before the battery SOC drops to a certain value.Even so, the proposed algorithm can still effectively reduce the fuel consumption.90%, and the minimum SOC threshold is 30%.Compared with the results of the CD/CS scheme, the SOC downward trend based on the RL method is more smoothly.Figure 12 illustrates the fuel consumption under four driving cycles.According to Figures 11 and 12, we can find that the optimized control strategy does not take effect completely in the entire cycle, and works before the battery SOC drops to a certain value.Even so, the proposed algorithm can still effectively reduce the fuel consumption.To further discover improvements of the RL based strategy, the engine operating points for both RL based method and the CD/CS method under four driving cycles are depicted in Figure 13.It can be obviously found that by implementing the RL based algorithm, the engine working efficiency is higher than 30% in most cases.Compared with the CD/CS strategy, the proposed method can make the engine working points more densely in the high efficiency area.Moreover, it can be noticed that based on the RL based method, the majority of engine working points gather near the optimal operating line, not like that by the CD/CS algorithm.Therefore, it can explain that why the fuel consumption based on the proposed method is less than that based on the CD/CS method.To further discover improvements of the RL based strategy, the engine operating points for both RL based method and the CD/CS method under four driving cycles are depicted in Figure 13.It can be obviously found that by implementing the RL based algorithm, the engine working efficiency is higher than 30% in most cases.Compared with the CD/CS strategy, the proposed method can make the engine working points more densely in the high efficiency area.Moreover, it can be noticed that based on the RL based method, the majority of engine working points gather near the optimal operating line, not like that by the CD/CS algorithm.Therefore, it can explain that why the fuel consumption based on the proposed method is less than that based on the CD/CS method.

Conclusions
In this paper, the Q-learning RL algorithm has been employed for the energy management of a power-split PHEV.The mathematical vehicle model is built after detailed powertrain analysis.By combining Q-learning method with MDP, the RL model of PHEV is constructed and the optimal result based on RL is obtained where the battery power is optimized.Three standard driving cycles are chosen for simulation verification.Simulation results manifest that the proposed RL algorithm can guarantee a preferable fuel consumption and show more effectiveness than the CD/CS algorithm.In addition, the proposed algorithm can restrict the battery current within a narrower range, thus extending the battery life cycle to some extent.
Our next step work will focus on exploring a more stable Markov chain model and more

Conclusions
In this paper, the Q-learning RL algorithm has been employed for the energy management of a power-split PHEV.The mathematical vehicle model is built after detailed powertrain analysis.By combining Q-learning method with MDP, the RL model of PHEV is constructed and the optimal result based on RL is obtained where the battery power is optimized.Three standard driving cycles are chosen for simulation verification.Simulation results manifest that the proposed RL algorithm can guarantee a preferable fuel consumption and show more effectiveness than the CD/CS algorithm.In addition, the proposed algorithm can restrict the battery current within a narrower range, thus extending the battery life cycle to some extent.
Our next step work will focus on exploring a more stable Markov chain model and more advanced optimization algorithm.In addition, the proposed algorithm will be further investigated to update the transition probability matrix of the Markov driver chain in real time, and hardware-in-the-loop and actual vehicle validation will be conducted to verify the real control performance of the proposed method.
Author Contributions: Z.C. and H.H. drafted this paper, discussed combine reinforcement learning with Markov decision process.Y.W. and R.X. provided some energy management strategy suggestions.J.S. oversaw the research.Y.L. revised the paper and provide some technical help.
OCV denotes the battery open circuit voltage, bat i is the battery current, int R is the battery internal resistance, bat C is the battery capacity, SOC is the battery SOC and init SOC is its initial value.Detailed battery parameters varying with SOC are shown in Figure 2. It can be found that int R decreases from 0.1403 ohm to 0.09 ohm and OCV ranges from 165 V to 219.7 V.

Figure 2 .
Figure 2. OCV and int R variation with state of charge (SOC).

Figure 2 .
Figure 2. OCV and R int variation with state of charge (SOC).

Figure 4 .
Figure 4.The transition probability map.(a) The transition probability map at V = 30 km/h; (b) The transition probability map at V = 80 km/h.

Figure 4 .
Figure 4.The transition probability map.(a) The transition probability map at V = 30 km/h; (b) The transition probability map at V = 80 km/h.

Figure 4 .
Figure 4.The transition probability map.(a) The transition probability map at V = 30 km/h; (b) The transition probability map at V = 80 km/h.

Figure 6 .
Figure 6.Procedures of the QL calculation.

Figure 7 .
Figure 7. Optimal control strategy based on RL algorithm with different speeds.(a) The optimal control action variable at V = 20 km/h; (b) The optimal control action variable at V = 40 km/h; (c) The optimal control action variable at V = 60 km/h; and (d) The optimal control action variable at V = 80 km/h.

Figure 6 .
Figure 6.Procedures of the QL calculation.

Figure 6 .
Figure 6.Procedures of the QL calculation.

Figure 7 .
Figure 7. Optimal control strategy based on RL algorithm with different speeds.(a) The optimal control action variable at V = 20 km/h; (b) The optimal control action variable at V = 40 km/h; (c) The optimal control action variable at V = 60 km/h; and (d) The optimal control action variable at V = 80 km/h.

Figure 7 .
Figure 7. Optimal control strategy based on RL algorithm with different speeds.(a) The optimal control action variable at V = 20 km/h; (b) The optimal control action variable at V = 40 km/h; (c) The optimal control action variable at V = 60 km/h; and (d) The optimal control action variable at V = 80 km/h.

Figure 8 .
Figure 8. Mean discrepancy of the Q-values.

Figure 9 .
Figure 9. Profile of simulation driving cycles.

Figure 8 .
Figure 8. Mean discrepancy of the Q-values.

Figure 8 .
Figure 8. Mean discrepancy of the Q-values.

Figure 9 .
Figure 9. Profile of simulation driving cycles.

Figure 9 .
Figure 9. Profile of simulation driving cycles.

Table 1 .
Main parameters of power-split PHEV.

Table 1 .
Main parameters of power-split PHEV.