Practical Application-Oriented Energy Management for a Plug-In Hybrid Electric Bus Using a Dynamic SOC Design Zone Plan Method

: The main problem in current energy management is the ability of practical application. To address the problem, this paper proposes a reinforcement learning (RL)-based energy management by combining Tubule Q-learning and Pontryagin’s Minimum Principle (PMP) algorithms for a plug-in hybrid electric bus (PHEB). The main innovation distinguished from the existing energy management strategies is that a dynamic SOC design zone plan method is proposed. It is characterized by two aspects: 1 (cid:13) a series of ﬁxed locations are deﬁned in the city bus route and a linear SOC reference trajectory is re-planned at ﬁxed locations; 2 (cid:13) a triangle zone will be re-planned based on the linear SOC reference trajectory. Additionally, a one-dimensional state space is also designed to ensure the real-time control. The off-line trainings demonstrate that the agent of the RL-based energy management can be well trained and has good generalization performance. The results of hardware in loop simulation (HIL) demonstrate that the trained energy management has good real-time performance, and its fuel consumption can be decreased by 12.92%, compared to a rule-based control strategy.


Introduction
The environmental pollution caused by the rapid development of the transportation industry cannot be ignored, and the electric vehicle is expected to solve this dilemma [1]. Plug-in hybrid electric vehicles (PHEVs), characterized by combining an electric motor and an internal combustion engine, can reduce gas emissions [2]. In practice, at least two power sources are deployed in the PHEV, that is, at least two degrees of freedom will be introduced into the energy management [3]. Therefore, energy management is the most important issue for the PHEVs [4,5].
Many energy management strategies such as rule-based, optimization-based, predictionbased, and reinforcement learning (RL)-based methods have been proposed. Nevertheless, no matter what methods are proposed, how to realize the real-time and economic control in real world is the objective of these methods. The rule-based energy management can easily realize real-time control in the real world. However, only when the key parameters are elaborately designed can the control performance improve. For example, Ding N. et al. proposed a hybrid energy management system based on a rule-based control strategy and genetic algorithm to improve the fuel economy and overcome the battery limitations [6]. Li P. et al. proposed an intelligent logic rule-based energy management method by optimizing the working area of the engine [7]. The optimization-based energy management can realize better economic control in the real world, although the real-time control performance may be sacrificed. For example, Hassanzadeh M. et al. proposed an energy management strategy based on PMP to improve the fuel economy and battery life in the uncertain method. In particular, the main difference between Ref. [24] and our work is that a dynamic linear SOC reference trajectory is planned at fixed locations based on the feedback SOC, and a triangle zone will be re-planned based on the reference SOC trajectory. The main advantage of this method is that the dynamic triangle zone can provide a margin for the fuel economy improvement, and can guide the feedback SOC reach the objective SOC.
The remainder of this paper is structured as follows. The modeling of the PHEB is introduced in Section 2. The RL-based energy management is detailed in Section 3. The results and discussion are presented in Section 4, and the conclusions are drawn in Section 5. Figure 1 shows the layout of the PHEB. It is constituted by an engine, a clutch, an electric motor (EM), and a 6-speed automated mechanical transmission (AMT). The gears used in this paper are from 2 to 6, without considering the climbing driving conditions. Many working modes such as engine driving, hybrid driving, motor driving, and regenerative braking can be realized based on the driving demand and energy management.

The Description of the PHEB
Processes 2022, 10, x FOR PEER REVIEW 3 of 15 SOC trajectories calculated by a series of historical driving conditions. Inspired by this method, this paper proposes an RL-based energy management together with a similar state variable design method. In particular, the main difference between Ref. [24] and our work is that a dynamic linear SOC reference trajectory is planned at fixed locations based on the feedback SOC, and a triangle zone will be re-planned based on the reference SOC trajectory. The main advantage of this method is that the dynamic triangle zone can provide a margin for the fuel economy improvement, and can guide the feedback SOC reach the objective SOC. The remainder of this paper is structured as follows. The modeling of the PHEB is introduced in Section 2. The RL-based energy management is detailed in Section 3. The results and discussion are presented in Section 4, and the conclusions are drawn in Section 5. Figure 1 shows the layout of the PHEB. It is constituted by an engine, a clutch, an electric motor (EM), and a 6-speed automated mechanical transmission (AMT). The gears used in this paper are from 2 to 6, without considering the climbing driving conditions. Many working modes such as engine driving, hybrid driving, motor driving, and regenerative braking can be realized based on the driving demand and energy management.

The Description of the PHEB
where m P denotes the power of the motor; m η denotes the efficiency of the motor, which can be interpolated by a look-up table formulated by the efficiency map of the motor (Figure 3); and m n and m T denote the speed and the torque of the motor, respectively. An engine model that can satisfy the energy management requirement is indispensable. Based on the fuel consumption rate MAP of the engine (Figure 2), the instantaneous fuel consumption of the engine is formulated by where m e denotes the instantaneous fuel consumption of the engine; T e denotes the torque of the engine; n e denotes the speed of the engine; and b e (T e , ω e ) denotes the fuel consumption rate of the engine.  Similarly, the motor is formulated by where P m denotes the power of the motor; η m denotes the efficiency of the motor, which can be interpolated by a look-up table formulated by the efficiency map of the motor ( Figure 3); and n m and T m denote the speed and the torque of the motor, respectively.   As shown in Figure 4, the battery is formulated as a Rint model; it is described as V denotes the battery voltage; b P denotes the battery power; b R denotes the internal resistance; and Q b denotes the battery capacity.

The Formulation of PMP
In this paper, an economic gear shift strategy is executed during shift process. In consideration of minimizing the fuel consumption, the objective function is described as where J denotes the performance index function and ( ) u t denotes the control vector, which is the throttle of the engine ( ( ) [ ( )] u t th t = ). Here, the throttle of the engine is denoted by ( ) th t , which ranges from 0 to 1. As shown in Figure 4, the battery is formulated as a Rint model; it is described as where V b denotes the battery voltage; P b denotes the battery power; R b denotes the internal resistance; and Q b denotes the battery capacity.  As shown in Figure 4, the battery is formulated as a Rint model; it is described as V denotes the battery voltage; b P denotes the battery power; b R denotes the internal resistance; and Q b denotes the battery capacity.

The Formulation of PMP
In this paper, an economic gear shift strategy is executed during shift process. In consideration of minimizing the fuel consumption, the objective function is described as where J denotes the performance index function and ( ) u t denotes the control vector, which is the throttle of the engine ( ( ) [ ( )] u t th t = ). Here, the throttle of the engine is denoted by ( ) th t , which ranges from 0 to 1.

The Formulation of PMP
In this paper, an economic gear shift strategy is executed during shift process. In consideration of minimizing the fuel consumption, the objective function is described as [m e (u(t))]dt (4) where J denotes the performance index function and u(t) denotes the control vector, which is the throttle of the engine (u(t) = [th(t)]). Here, the throttle of the engine is denoted by th(t), which ranges from 0 to 1. Inspired by Ref. [25], the energy management can be converted into instantaneous optimization problem by minimizing the Hamiltonian function, which can be described as where u * (t) denotes the optimum control solution; H(x(t), u(t), λ(t), t) denotes the Hamiltonian function, whereby the first term is the instant fuel consumption, and the second term is the delta SOC which is multiplied by co-state; and λ(t) denotes the co-state. Theoretically, the co-state is the only key parameter to influence the optimization performance and is a time-varied value. It can be also approximately recognized as constant over the whole trip, based on Refs. [22,23]. Moreover, it can also be adapted in real-time, based on the driving conditions [26].
Furthermore, some constraints, with respect to the physical components of the PHEB, are also indispensable, which is described as where ω e (t) and ω m (t) denote the rotate speeds of the engine and the motor, respectively; ω e_min , ω e_max and ω m_min , ω m_max denote the corresponding rotate speed boundaries; P e (t) and P m (t) denote the powers of the engine and the motor, respectively; and P e_min , P e_max and P m_min , P m_max denote the corresponding power boundaries.

The Design of the Dynamic SOC Design Zone
In this paper, only one state for the difference between the feedback SOC and reference SOC trajectory is designed for the RL-based energy management. Therefore, the SOC reference trajectory becomes the key issue. As shown in Figure 5, a novel dynamic SOC design zone plan method is proposed. The basic principle is that a linear SOC reference trajectory will be firstly planned at the fixed location, and simultaneously a dynamic reference SOC zone will be defined based on the linear SOC reference trajectory. There are two advantages to this method.
Processes 2022, 10, x FOR PEER REVIEW on the fixed location. Moreover, a triangle zone will be also defined to improve th economy of the PHEB.
(2). In Ref. [25], only an efficient zone is defined based on a series of optimiz SOC trajectories, and no dynamic reference SOC trajectory is planned. Moreover, states should be designed, and the efficient zone is designed off-line. The feedback SOC trajectory l d The destination The dynamic reference SOC zone The feasible zone The linear SOC reference trajectory The linear SOC reference trajectory In addition, the dynamic SOC design zone plan method can be described as T SOC and T D denote the dynamic value of target SOC and the elled distance, respectively, which are updated after fixed distance step; and fn SOC fnl D denote the value of target SOC and travelled distance at destination, respectiv

The Formalation of the RL-Based Energy Managemtn
QL is one of the most important RL methods, based on the Temporal-Difference method. The update process of the Q value can be described as (1). The state of the difference of SOC between the dynamic reference SOC and the feedback SOC is only defined. In this case, only a Q-matrix with 100 rows and 27 columns is designed, which can ensure the real-time control of the strategy.
(2). In practice, the real driving conditions cannot be completely predicted, so the optimal SOC trajectory is usually not completely predicted. In this case, taking the optimal SOC trajectory as reference trajectory is infeasible. However, if a feasible zone (the floating value is designed as 0.02) is defined based on the liner SOC reference trajectory, the feedback SOC may be controlled within a feasible zone, and a great margin for the fuel economy improvement may be provided. This is also the most important innovation in this paper.
The main difference between the proposed method and the existing methods are listed as follows.
(1). In Ref. [24], only a SOC reference trajectory is designed, based on a series of optimal SOC trajectories. In terms of our method, the SOC reference trajectory is only based on the fixed location. Moreover, a triangle zone will be also defined to improve the fuel economy of the PHEB.
(2). In Ref. [25], only an efficient zone is defined based on a series of optimization SOC trajectories, and no dynamic reference SOC trajectory is planned. Moreover, three states should be designed, and the efficient zone is designed off-line.
In addition, the dynamic SOC design zone plan method can be described as (7) where SOC ref and D ref denote the reference SOC and travelled distance at current time step, respectively; SOC T and D T denote the dynamic value of target SOC and the travelled distance, respectively, which are updated after fixed distance step; and SOC fnl and D fnl denote the value of target SOC and travelled distance at destination, respectively.

The Formalation of the RL-Based Energy Managemtn
QL is one of the most important RL methods, based on the Temporal-Difference (TD) method. The update process of the Q value can be described as where α denotes the learning rate, which is defined as 0.95 in this paper; γ denotes the discount factor, which is defined as 0.8 in this paper; r t+1 denotes the immediate reward at time t; and max a Q(s, a) denotes the maximum Q value in the next state.
As shown in Figure 6, at every time step, the agent will obtain a state S t from the environment, and an action a t will be evaluated by the agent. Then, a state of the environment s t+1 will be adapted, and a corresponding reward r t+1 will be transmitted to the agent. As an instantaneous optimization algorithm, PMP-based energy management has good real-time control performance. The only challenge is to recognize the co-state for the unrepeatable, stochastic driving conditions. In addition, QL is widely regarded as an intelligent algorithm that can adapt well to uncertain circumstances. Motivated by this, an RL-based energy management, combining PMP and RL, is proposed. As shown in Figure  7, at every time step, the agent will evaluate an action based on the states, and the states of PHEB will be adapted based on the action, then a reward will be generated and transmitted to the agent. As an instantaneous optimization algorithm, PMP-based energy management has good real-time control performance. The only challenge is to recognize the co-state for the unrepeatable, stochastic driving conditions. In addition, QL is widely regarded as an intelligent algorithm that can adapt well to uncertain circumstances. Motivated by this, an RL-based energy management, combining PMP and RL, is proposed. As shown in Figure 7, at every time step, the agent will evaluate an action based on the states, and the states of PHEB will be adapted based on the action, then a reward will be generated and transmitted to the agent. good real-time control performance. The only challenge is to recognize the co-state for the unrepeatable, stochastic driving conditions. In addition, QL is widely regarded as an intelligent algorithm that can adapt well to uncertain circumstances. Motivated by this, an RL-based energy management, combining PMP and RL, is proposed. As shown in Figure  7, at every time step, the agent will evaluate an action based on the states, and the states of PHEB will be adapted based on the action, then a reward will be generated and transmitted to the agent. (1) The state As stated above, the difference between the feedback SOC and the reference SOC is defined as the sole state, which is descried as where f SOC denotes the feedback SOC. The state t s ranges from −0.04 to 0.04, and is sampled by 100 points. That is to say, the number of rows is only 100, which provides a basis to reduce the dimensionality of the Q and R tables.
(2) The action The co-state is defined as the only action, which ranges from −2800 to −4000. Specifically, the action space is 2800, 2900, 3000, 3100, 3200, 3250, 3275, 3300, 3325, 3350, 3375, 3390 3400 3410 3425, 3450, 3475, 3500, 3525, 3550, 3575, 3600, 3650, 3700, 3800, 3900, 4000 (3) The reward As stated above, the reward is defined as Equation (11). Specifically, if the feedback SOC at time step 1 t + is larger than the upper boundary, then a punishment will be provided to the Q-value function. Moreover, the further the deviation is, the greater the punishment will be; if the feedback SOC at time step 1 t + is lower than the lower boundary, (1) The state As stated above, the difference between the feedback SOC and the reference SOC is defined as the sole state, which is descried as where SOC f denotes the feedback SOC. The state s t ranges from −0.04 to 0.04, and is sampled by 100 points. That is to say, the number of rows is only 100, which provides a basis to reduce the dimensionality of the Q and R tables.
(2) The action The co-state is defined as the only action, which ranges from −2800 to −4000. Specifically, the action space is As stated above, the reward is defined as Equation (11). Specifically, if the feedback SOC at time step t + 1 is larger than the upper boundary, then a punishment will be provided to the Q-value function. Moreover, the further the deviation is, the greater the punishment will be; if the feedback SOC at time step t + 1 is lower than the lower boundary, then a punishment will be provided to the Q-value function. Moreover, if the feedback SOC at time step t + 1 located in the feasible zone (between the lower and upper boundaries), a reward will be provided to the Q-value function, and the closer the feedback SOC to the reference, the greater the reward will be.
(4) The ε-greedy algorithm To realize the self-learning control, the ε-greedy algorithm is deployed, which is formulated by where r n is the random number, which ranges from 0 to 1. for t = 1, T do 4: observing the current state s t (s t = SOC f − SOC ref ) 5: selecting the action a t with ε-greedy algorithm 6: executing the action(a t ) and observing the next state 7: calculating the immediate reward based on Eq. (11) 8: updating the Q- Table by: : end 10: if the feedback SOC is bigger than 0.85 or lower than 0.25 or abs (s(t)) is bigger than 0.04 11: continue; 12: end 13: end 14: end As shown in Figure 8, the design process is divided into three steps: off-line training, off-line verification, and hardware in the loop (HIL) with controller.

Result Discussions
A series of historical driving cycles of the PHEB are deployed for training and testing. The total length of route is about 50 km and has 39 bus stops; the number of passengers at per station is assumed to be random. Moreover, as shown in Figure 9, a series of combined driving cycles, including driving cycle and passenger mass, are designed in this paper.

Result Discussions
A series of historical driving cycles of the PHEB are deployed for training and testing. The total length of route is about 50 km and has 39 bus stops; the number of passengers at per station is assumed to be random. Moreover, as shown in Figure 9, a series of combined driving cycles, including driving cycle and passenger mass, are designed in this paper.

The Training Process
To ensure the RL-based energy management has better control performance, a welltrained Q-table of the reinforcement learning is indispensable. Therefore, six combined driving cycles are firstly designed to train the Q-table. The Q-table is continually trained one by one based on the combined driving cycles, and will be trained 100 times for each combined driving cycle based on different ε value. Specifically, the training is divided into three stages; in the first stage, the ε is designed as 0.5 before episode 45, which implies that the action is randomly selected by the probability of 50%; in the second stage, the ε is designed as 0.15 between the episode 46 and episode 75, which implies that the action is randomly selected by the probability of 15%; in the third stage, the ε is designed as 0 between the episode 76 and episode 100, which implies that the action is selected by the trained Q-table.
As shown in Figure 10, the combined driving cycle 1 is firstly selected to train the Qtable (named Q-table 1). In the first stage, the agent strives to probe the possible action. In this case, the final SOCs are higher than 0.6, which implies that the Q-table 1 is not well trained; in the second stage, the RL will partly select the action based on the trained Qtable 1, whilst trying to probe possible actions by the ε-greedy algorithm. In this case, the SOCs will be fluctuated around 0.3, which implies the Q-table 1 has been better trained. In the third stage, the final SOC can easily reach the objective vale, and the feedback SOC trajectory can easily follow the reference SOC trajectory. This implies that the Q-table 1 has been well trained for the combined driving cycle 1. As shown in Figure 11, the combined driving cycle 2 is deployed to continually train the Q- table (Q-table 2), based on Q-table 1. The RL will still be terminated in advance in the first stage. Moreover, the final SOCs are higher than 0.5, which are lower than the final

The Training Process
To ensure the RL-based energy management has better control performance, a welltrained Q-table of the reinforcement learning is indispensable. Therefore, six combined driving cycles are firstly designed to train the Q-table. The Q-table is continually trained one by one based on the combined driving cycles, and will be trained 100 times for each combined driving cycle based on different ε value. Specifically, the training is divided into three stages; in the first stage, the ε is designed as 0.5 before episode 45, which implies that the action is randomly selected by the probability of 50%; in the second stage, the ε is designed as 0.15 between the episode 46 and episode 75, which implies that the action is randomly selected by the probability of 15%; in the third stage, the ε is designed as 0 between the episode 76 and episode 100, which implies that the action is selected by the trained Q-table.
As shown in Figure 10, the combined driving cycle 1 is firstly selected to train the Q- table (named Q-table 1). In the first stage, the agent strives to probe the possible action. In this case, the final SOCs are higher than 0.6, which implies that the Q-table 1 is not well trained; in the second stage, the RL will partly select the action based on the trained Q-table 1, whilst trying to probe possible actions by the ε-greedy algorithm. In this case, the SOCs will be fluctuated around 0.3, which implies the Q-table 1 has been better trained. In the third stage, the final SOC can easily reach the objective vale, and the feedback SOC trajectory can easily follow the reference SOC trajectory. This implies that the Q-table 1 has been well trained for the combined driving cycle 1. . Figure 9. The combined driving cycles.

The Training Process
To ensure the RL-based energy management has better control performance, a welltrained Q-table of the reinforcement learning is indispensable. Therefore, six combined driving cycles are firstly designed to train the Q-table. The Q-table is continually trained one by one based on the combined driving cycles, and will be trained 100 times for each combined driving cycle based on different ε value. Specifically, the training is divided into three stages; in the first stage, the ε is designed as 0.5 before episode 45, which implies that the action is randomly selected by the probability of 50%; in the second stage, the ε is designed as 0.15 between the episode 46 and episode 75, which implies that the action is randomly selected by the probability of 15%; in the third stage, the ε is designed as 0 between the episode 76 and episode 100, which implies that the action is selected by the trained Q-table.
As shown in Figure 10, the combined driving cycle 1 is firstly selected to train the Qtable (named Q-table 1). In the first stage, the agent strives to probe the possible action. In this case, the final SOCs are higher than 0.6, which implies that the Q-table 1 is not well trained; in the second stage, the RL will partly select the action based on the trained Qtable 1, whilst trying to probe possible actions by the ε-greedy algorithm. In this case, the SOCs will be fluctuated around 0.3, which implies the Q-table 1 has been better trained. In the third stage, the final SOC can easily reach the objective vale, and the feedback SOC trajectory can easily follow the reference SOC trajectory. This implies that the Q-table 1 has been well trained for the combined driving cycle 1. As shown in Figure 11, the combined driving cycle 2 is deployed to continually train the Q- table (Q-table 2), based on Q-table 1. The RL will still be terminated in advance in the first stage. Moreover, the final SOCs are higher than 0.5, which are lower than the final  Figure 11, the combined driving cycle 2 is deployed to continually train the Q-table (Q-table 2), based on Q-table 1. The RL will still be terminated in advance in the first stage. Moreover, the final SOCs are higher than 0.5, which are lower than the final SOCs in Figure 10. In the second stage, the final SOCs can satisfy the control object, and has good control perforce. In the third stage, the Q-table has been well trained, the final SOC can satisfy the control object, and the SOC trajectory can better follow the reference SOC trajectory well. This implies that the Q-table 2 has been better trained compared to the Q-table in Figure 10. SOCs in Figure 10. In the second stage, the final SOCs can satisfy the control object, and has good control perforce. In the third stage, the Q-table has been well trained, the final SOC can satisfy the control object, and the SOC trajectory can better follow the reference SOC trajectory well. This implies that the Q-table 2 has been better trained compared to the Q-table in Figure 10. Figure 11. The training process of combined driving cycle 2.

As shown in
As shown in Figure 12, the combined driving cycle 3 is deployed to continually train the Q- table (Q-table 3), based on the well trained Q-table 2. In the first stage, the control performance has been greatly improved, compared to the first stage for Q-table 1 and 2. However, the final SOCs still do not satisfy the control object. In contrast, the control preference is deteriorated in stage 2 compared to the Q-table 2. This implies that the driving conditions may be different from the combined driving cycle 2, and the generalization performance of the Q-table should be further improved. Nevertheless, the control preference can satisfy the control objective during the third stage, and it is recognized that the Q-table 3 has been well trained. As shown in Figure 13, the combined driving cycle 4 is deployed to continually train the Q- table (Q-table 4), based on the well trained Q-table 3. Similar to the first stage in Figure 12, the Q-table 4 is also not well trained, because the final SOCs do not reach 0.3. However, the control performance is greatly improved in the second stage, and the control performance can satisfy the control objective in the stage 3. This implies that Q-table 4 has been well trained. It can be seen from Figures 14 and 15 that the Q-table 4 has been well trained, because the control performance can be well satisfied in three stages, no matter the ε value. This implies that the generalization performance of the Q-table 4 has been greatly improved, and can satisfy the control performance. On the other hand, the generalization performance of the Q-table 4 will be further improved based on the trainings of the two combined driving cycles. In addition, Q-table 6 can be taken as the well trained Q-table being after continually trained by combined driving cycle 5 and 6. As shown in Figure 12, the combined driving cycle 3 is deployed to continually train the Q- table (Q-table 3), based on the well trained Q-table 2. In the first stage, the control performance has been greatly improved, compared to the first stage for Q-table 1 and 2. However, the final SOCs still do not satisfy the control object. In contrast, the control preference is deteriorated in stage 2 compared to the Q-table 2. This implies that the driving conditions may be different from the combined driving cycle 2, and the generalization performance of the Q-table should be further improved. Nevertheless, the control preference can satisfy the control objective during the third stage, and it is recognized that the Q-table 3 has been well trained.
Processes 2022, 10, x FOR PEER REVIEW 11 of 15 SOCs in Figure 10. In the second stage, the final SOCs can satisfy the control object, and has good control perforce. In the third stage, the Q-table has been well trained, the final SOC can satisfy the control object, and the SOC trajectory can better follow the reference SOC trajectory well. This implies that the Q-table 2 has been better trained compared to the Q-table in Figure 10. Figure 11. The training process of combined driving cycle 2.
As shown in Figure 12, the combined driving cycle 3 is deployed to continually train the Q- table (Q-table 3), based on the well trained Q-table 2. In the first stage, the control performance has been greatly improved, compared to the first stage for Q-table 1 and 2. However, the final SOCs still do not satisfy the control object. In contrast, the control preference is deteriorated in stage 2 compared to the Q-table 2. This implies that the driving conditions may be different from the combined driving cycle 2, and the generalization performance of the Q-table should be further improved. Nevertheless, the control preference can satisfy the control objective during the third stage, and it is recognized that the Q-table 3 has been well trained. As shown in Figure 13, the combined driving cycle 4 is deployed to continually train the Q- table (Q-table 4), based on the well trained Q-table 3. Similar to the first stage in Figure 12, the Q-table 4 is also not well trained, because the final SOCs do not reach 0.3. However, the control performance is greatly improved in the second stage, and the control performance can satisfy the control objective in the stage 3. This implies that Q-table 4 has been well trained. It can be seen from Figures 14 and 15 that the Q-table 4 has been well trained, because the control performance can be well satisfied in three stages, no matter the ε value. This implies that the generalization performance of the Q-table 4 has been greatly improved, and can satisfy the control performance. On the other hand, the generalization performance of the Q-table 4 will be further improved based on the trainings of the two combined driving cycles. In addition, Q-table 6 can be taken as the well trained Q-table being after continually trained by combined driving cycle 5 and 6. As shown in Figure 13, the combined driving cycle 4 is deployed to continually train the Q- table (Q-table 4), based on the well trained Q-table 3. Similar to the first stage in Figure 12, the Q-table 4 is also not well trained, because the final SOCs do not reach 0.3. However, the control performance is greatly improved in the second stage, and the control performance can satisfy the control objective in the stage 3. This implies that Q-table 4 has been well trained. SOCs in Figure 10. In the second stage, the final SOCs can satisfy the control object, and has good control perforce. In the third stage, the Q-table has been well trained, the final SOC can satisfy the control object, and the SOC trajectory can better follow the reference SOC trajectory well. This implies that the Q-table 2 has been better trained compared to the Q-table in Figure 10. Figure 11. The training process of combined driving cycle 2.
As shown in Figure 12, the combined driving cycle 3 is deployed to continually train the Q- table (Q-table 3), based on the well trained Q-table 2. In the first stage, the control performance has been greatly improved, compared to the first stage for Q-table 1 and 2. However, the final SOCs still do not satisfy the control object. In contrast, the control preference is deteriorated in stage 2 compared to the Q-table 2. This implies that the driving conditions may be different from the combined driving cycle 2, and the generalization performance of the Q-table should be further improved. Nevertheless, the control preference can satisfy the control objective during the third stage, and it is recognized that the Q-table 3 has been well trained. As shown in Figure 13, the combined driving cycle 4 is deployed to continually train the Q- table (Q-table 4), based on the well trained Q-table 3. Similar to the first stage in Figure 12, the Q-table 4 is also not well trained, because the final SOCs do not reach 0.3. However, the control performance is greatly improved in the second stage, and the control performance can satisfy the control objective in the stage 3. This implies that Q-table 4 has been well trained. It can be seen from Figures 14 and 15 that the Q-table 4 has been well trained, because the control performance can be well satisfied in three stages, no matter the ε value. This implies that the generalization performance of the Q-table 4 has been greatly improved, and can satisfy the control performance. On the other hand, the generalization performance of the Q-table 4 will be further improved based on the trainings of the two combined driving cycles. In addition, Q-table 6 can be taken as the well trained Q-table being after continually trained by combined driving cycle 5 and 6. It can be seen from Figures 14 and 15 that the Q-table 4 has been well trained, because the control performance can be well satisfied in three stages, no matter the ε value. This implies that the generalization performance of the Q-table 4 has been greatly improved, and can satisfy the control performance. On the other hand, the generalization performance of the Q-table 4 will be further improved based on the trainings of the two combined driving cycles. In addition, Q-table 6 can be taken as the well trained Q-table being after continually  trained by combined driving cycle 5

The Off-Line Verification
To verify the generalization and reliable performances of the Q- Table 6, the combined driving cycle 7 (denoted by No.7) and combined driving cycle 8 (denoted by No.8) are deployed. From Figure 16, it can be seen that the co-state can be well adjusted, based on the driving cycles and the Q- Table 6. Moreover, the final SOCs can satisfy the control object, and the SOC trajectories locate the designed boundary and can easily follow the SOC reference trajectories. This implies that the RL-based strategy has great potential for practical application. In addition, a rule-based energy management is also deployed to evaluate the fuel economy of the RL-based energy management method. From Table 1, it can be seen that the fuel consumption of conditions 7 and 8 can be reduced by 10.95% and 11.78%, respectively.

The Off-Line Verification
To verify the generalization and reliable performances of the Q- Table 6, the combined driving cycle 7 (denoted by No.7) and combined driving cycle 8 (denoted by No.8) are deployed. From Figure 16, it can be seen that the co-state can be well adjusted, based on the driving cycles and the Q- Table 6. Moreover, the final SOCs can satisfy the control object, and the SOC trajectories locate the designed boundary and can easily follow the SOC reference trajectories. This implies that the RL-based strategy has great potential for practical application. In addition, a rule-based energy management is also deployed to evaluate the fuel economy of the RL-based energy management method. From Table 1, it can be seen that the fuel consumption of conditions 7 and 8 can be reduced by 10.95% and 11.78%, respectively.

The Off-Line Verification
To verify the generalization and reliable performances of the Q- Table 6, the combined driving cycle 7 (denoted by No.7) and combined driving cycle 8 (denoted by No.8) are deployed. From Figure 16, it can be seen that the co-state can be well adjusted, based on the driving cycles and the Q- Table 6. Moreover, the final SOCs can satisfy the control object, and the SOC trajectories locate the designed boundary and can easily follow the SOC reference trajectories. This implies that the RL-based strategy has great potential for practical application. In addition, a rule-based energy management is also deployed to evaluate the fuel economy of the RL-based energy management method. From Table 1, it can be seen that the fuel consumption of conditions 7 and 8 can be reduced by 10.95% and 11.78%, respectively.

The Off-Line Verification
To verify the generalization and reliable performances of the Q- Table 6, the combined driving cycle 7 (denoted by No.7) and combined driving cycle 8 (denoted by No.8) are deployed. From Figure 16, it can be seen that the co-state can be well adjusted, based on the driving cycles and the Q- Table 6. Moreover, the final SOCs can satisfy the control object, and the SOC trajectories locate the designed boundary and can easily follow the SOC reference trajectories. This implies that the RL-based strategy has great potential for practical application. In addition, a rule-based energy management is also deployed to evaluate the fuel economy of the RL-based energy management method. From Table 1, it can be seen that the fuel consumption of conditions 7 and 8 can be reduced by 10.95% and 11.78%, respectively.

The Hardware in Loop Simulation Verify
As shown in Figure 17, a HIL test system mainly includes HCU, switch, Upper computer, CAN communication interface, DC 12 V, and Kvaser, built to verify the real-time and reliability of RL-based energy management. Here, the upper computer transmits CAN signals to the HCU through Kvaser. Figure 17. The HIL test system. Figure 18, a HIL simulation model is built based on D2P rapid prototyping control system and the well trained strategy. It mainly includes three modules: Input, HCU, and Output, where HCU is used to embed the RL-based energy management. Additionally, the combined driving cycle 9 (denoted by No.9) is also deployed. From Figure 19, it can be seen that the co-state can be adjusted in real-time based on the well-  Figure 17. The HIL test system.

As shown in
As shown in Figure 18, a HIL simulation model is built based on D2P rapid prototyping control system and the well trained strategy. It mainly includes three modules: Input, HCU, and Output, where HCU is used to embed the RL-based energy management.

The Hardware in Loop Simulation Verify
As shown in Figure 17, a HIL test system mainly includes HCU, switch, Uppe puter, CAN communication interface, DC 12 V, and Kvaser, built to verify the rea and reliability of RL-based energy management. Here, the upper computer tra CAN signals to the HCU through Kvaser. Figure 17. The HIL test system. Figure 18, a HIL simulation model is built based on D2P rapid typing control system and the well trained strategy. It mainly includes three mo Input, HCU, and Output, where HCU is used to embed the RL-based energy m ment. Additionally, the combined driving cycle 9 (denoted by No.9) is also deployed Figure 19, it can be seen that the co-state can be adjusted in real-time based on th  Figure 18. The composition of HIL.

As shown in
Additionally, the combined driving cycle 9 (denoted by No.9) is also deployed. From Figure 19, it can be seen that the co-state can be adjusted in real-time based on the welltrained strategy, and the control performance can be sufficiently satisfied. Moreover, the fuel consumption can be decreased by 12.92% compared to the rule based strategy as shown in Table 2.
Processes 2022, 10, x FOR PEER REVIEW 14 of 15 trained strategy, and the control performance can be sufficiently satisfied. Moreover, the fuel consumption can be decreased by 12.92% compared to the rule based strategy as shown in Table 2.
(a) the co-state (b) the SOC trajectory Figure 19. The HIL test results of combined driving cycle 9.

Conclusions
This paper proposes an RL-based energy management method based on a novel dynamic SOC design zone plan. The main conclusions are summarized as follows.
Firstly, the proposed dynamic SOC design zone plan method is feasible and applicable. The fuel consumption can be greatly decreased compared to the rule-based energy management.
Secondly, the agent of RL-based energy management can be well trained, and has good generalization performance. Moreover, the trained strategy can be easily embedded into the controller, and the real-time control performance can be satisfied well. It has great potential to be used in practice.
Future work will focus on the further verification of the RL-based energy management in the real vehicles.

Conclusions
This paper proposes an RL-based energy management method based on a novel dynamic SOC design zone plan. The main conclusions are summarized as follows.
Firstly, the proposed dynamic SOC design zone plan method is feasible and applicable. The fuel consumption can be greatly decreased compared to the rule-based energy management.
Secondly, the agent of RL-based energy management can be well trained, and has good generalization performance. Moreover, the trained strategy can be easily embedded into the controller, and the real-time control performance can be satisfied well. It has great potential to be used in practice.
Future work will focus on the further verification of the RL-based energy management in the real vehicles.