Reinforcement Learning–Based Energy Management Strategy for a Hybrid Electric Tracked Vehicle

.


Introduction
In recent years, hybrid electric vehicles (HEVs) are being widely used for reducing fuel consumption and emissions.In these vehicles, an energy management strategy controls the power distribution among multiple energy storage systems [1,2].This strategy realizes several control objectives, such as the driver's power demand, optimal gear shifting, and battery state-of-charge (SOC) regulation.Many optimal control methods have been proposed for designing energy management strategies in HEVs.For instance, because vehicles follow a certain driving cycle, the deterministic dynamic programming (DDP) approach can be used to obtain global optimal results [3][4][5].In addition, previous studies have applied the stochastic dynamic programming (SDP) approach to utilize the probabilistic statistics of the power request [6,7].Pontryagin's minimum principle was introduced in [8,9] and an equivalent consumption minimization strategy was suggested in [10][11][12] to obtain optimal control solutions.Furthermore, a model predictive control was introduced in [13] and convex optimization was presented in [14].Recently, game theory [15] and reinforcement learning (RL) [16] have attracted research attention for HEV energy management.RL is a heuristic learning method applied in numerous areas, such as robotic control, traffic improvement, and energy management.For example, previous studies have applied RL approaches for robotic control and for enabling robots to learn and adapt to situations online [17,18].Furthermore, [19] proposed an RL approach for enabling a set of unmanned aerial vehicles to automatically determine patrolling patterns in a dynamic environment.
The aforementioned RL studies have not evaluated energy management strategies for HEVs.A power management strategy for an electric hybrid bicycle was presented in [20]; however, the powertrain is simpler than that in HEVs and the power is not distributed among multiple power sources.In the current study, RL was applied to solve an energy management problem of a hybrid electric tracked vehicle (HETV).Statistical characteristics of an experimental driving schedule were extracted as a transition probability matrix of the power request.The energy management problem was formulated as a stochastic nonlinear optimal control problem with two state variables, namely the battery SOC and rotational speed of the generator, and one control variable, namely the engine throttle signal.Subsequently, the Q-learning and Dyna algorithms were applied to determine an energy management strategy for improving the fuel economy performance and achieving battery charge sustenance.Furthermore, the RL-based energy management strategy was compared with the dynamic programming (DP)-based energy management strategy.The simulation results indicated that the Q-learning algorithm entailed a lower computational cost (3 h) compared with the Dyna algorithm (7 h); nevertheless, the fuel consumption of the Q-learning algorithm was 1.7% higher than that of the Dyna algorithm.The Dyna algorithm registered almost the same fuel consumption as the DP-based global optimal solution.The Dyna algorithm is computationally more effective than SDP.However, because of their computational burdens, the Q-learning and Dyna algorithms cannot be used in current online operations, and further research on real-time applications is required.
The remainder of this paper is organized as follows: in Section 2, a hybrid powertrain is modeled and the optimal control problem is formulated.In Section 3, a statistical information model that is based on the experimental driving schedule is developed, and the Q-learning and Dyna algorithms are presented.The RL-based energy management strategy is compared with the DP-, and SDP-based energy management strategies in Section 4. Section 5 concludes this paper.

Hybrid Powertrain Modeling
Figure 1 shows a heavy-duty HETV with a dual-motor drive structure.The powertrain comprises two main power sources: an engine-generator set (EGS) and a battery pack.The dashed arrow lines in the figure indicate the directions of power flows.To guarantee a quick and adequately precise simulation, a quasi-static modeling methodology [21] was used to model the power request of the hybrid powertrain.Table 1 lists the vehicle parameters used in the model.

Power Request Model
Assume that only longitudinal motions are considered [4]; the torque of the two motors is calculated as follows: where T1 and T2 are the torque of the inside and outside motors, respectively, and ω1 and ω2 are the rotational speed of the inside and outside sprockets, respectively; r is the radius of the sprocket, Iz is the yaw moment of inertial, η is the efficiency from the motor shafts to the tracks, i0 is the fixed gear ratio between motors and sprockets, B is the vehicle tread, R is the turning radius of the vehicle, mv is the curb weight, and F1 and F2 are the rolling resistance forces of the two tracks.The yaw moment from the ground M is evaluated as follows: where g is the acceleration of gravity and L is the track contact length.The lateral resistance coefficient ut is computed empirically [22]: where umax is the maximum value of the lateral resistance coefficient.The turning radius R is expressed as: The rotational speed of the inside and outside sprockets (ω1 and ω2, respectively) is calculated as follows: where v1 and v2 are the speed of the two tracks.The rolling resistance forces acting on the two tracks are obtained using the following expression: where f is the rolling resistance coefficient.The power request Preq should be balanced by the two motors anytime as follows: where ηem is the efficiency of the motor.When the power request is positive, electric power is delivered to propel the vehicle and a positive efficiency sign is returned, and vice versa; however, when the powertrain absorbs the electric power (e.g., regenerative braking [23]), a negative efficiency sign is returned.

EGS Model
Figure 2 illustrates the equivalent electric circuit of the engine, permanent magnet, directive generator, and rectifier, where ωg is the rotational speed of the generator, Tg is the electromagnetic torque, Ke is the coefficient of the electromotive force, and Kxωg is the electromotive force; Kx is calculated as follows: where K is the number of poles and Lg is the synchronous inductance of the armature.The output voltage and current of the generator, Ug and Ig, respectively, are computed as follows [4]: ω ω where neng and Teng are the rotational speed and torque of the engine, respectively.Furthermore, Je and Jg are the moments of inertia; ieg is the fixed gear ratio connecting the engine and generator.The power request is balanced at any time by the EGS and battery as follows: 1 ( ) where Ub and Ib are the voltage and current, respectively, of the battery.The engine must be limited to the specific work area to ensure safety and reliability: ,min ,max ,max The fuel mass flow rate f m  (g/s) was determined according to the engine torque Teng and speed neng by using a brake specific fuel consumption map, which is typically obtained through a bench test.The control variable, engine throttle signal u_th(t), was normalized in the range [0,1], and the engine's torque was optimally regulated to control the power split between the EGS and battery to achieve minimum fuel consumption.

Battery Model
The SOC in a battery is a second state variable and is calculated as follows: where Qb is the battery capacity, Ib is the battery current, Voc is the open circuit voltage, Ri is the internal resistance, and Pb is the output power of the battery.To ensure reliability and safety, the current and SOC are constrained as: Figure 4 shows the Voc and Rint parameters [4].Cost function minimization is a trade-off between fuel consumption and charge sustainability in the battery and is expressed as follows: where β is a positive weighting factor, which is normally identified through multiple simulation iterations, and [t0, tf] is the entire time span.

RL-Based Energy Management Strategy
RL is a machine learning approach in which an agent senses an environment through its state and responds to the environment through its action under a control policy.In the proposed model, the control policy is improved iteratively by RL algorithms called Q-learning and Dyna algorithms.The environment provides numerical feedback called a reward and supplies a transition probability matrix for the agent.According to the driving schedule statistical model, a transition probability matrix is extracted from the V(SOC)/V sample information.Subsequently, the RL algorithm is adopted to optimize fuel consumption in another driving schedule by using the transition probability matrix.

Statistic Information of the Driving Schedule
A long natural driving schedule, including significant accelerations, braking, and steering (Figure 5), was obtained through a field experiment.The power request corresponding to the driving schedule is calculated according to Equations ( 1)-( 8) (Figure 6).Maximum likelihood estimation and nearest neighbor method were employed to compute the transition probability of the power request [24]: where Nik,j is the number of times the transition from P i req to P j req has occurred at a vehicle average velocity of k v , and Nik is the total event counts of the P i req occurrence at an average velocity of k v .
A smoothing technique was applied to the estimated parameters [25].Figure 7 illustrates the transition probability map at a velocity of 25 km/h.In this study, according to the Markov decision processes (MDPs) introduced in [26], the driving schedule was considered a finite MDP.The MDP comprises a set of state variables S = {(SOC(t), neng(t))| 0.2 ≤ SOC(t) ≤ 0.8, neng,min ≤ neng(t) ≤ neng,max}, set of actions a = {u_th(t)}, reward function r = f m  (s,a), and transition function psa, s', where psa, s' represents the probability of making a transition from state s to state s´ using action a.

Q-Learning and Dyna Algorithms
When π is used as a complete decision policy, the optimal value of a state s is defined as the expected finite discounted sum of the rewards [27], which is represented as follows: where γ ∈ [0,1] is the discount factor.The optimal value function is unique and can be reformulated as follows: ' ' * * ' , ( ) min( ( , ) γ ( )) Given the optimal value function, the optimal policy is specified as follows: Subsequently, the Q value and optimal Q value corresponding to the state s and action a are defined recursively as follows: The variable V * (s) is the value of s assuming an optimal action is taken initially; therefore, V * (s) = Q * (s, a) and π*(s) = arg mina Q*(s, a).The Q-learning updated rule is expressed as follows: where α ∈ [0,1] is a decayed factor in Q-learning.Unlike the Q-learning algorithm, the Dyna algorithm operates by iteratively interacting with the environment.For a tracked vehicle, the Dyna algorithm records the sample information as the vehicle operates on a new driving schedule.Then, incremental statistical information is used to update the reward and transition functions.The Dyna algorithm updated rule is as follows: ' ' ' ( , ) : ( , ) ( γ min ( , ) ( , )) where r and ' , sa s p are time variant and change as the driving schedule is updated.The Dyna algorithm clearly entails a heavier computational burden compared with the Q-learning algorithm.Section 4 compares the optimality between the two algorithms.Figure 8 depicts the computational flowchart of the two algorithms.

Comparison between the Q-Learning and Dyna Algorithm
Figure 9 shows the experimental driving schedule used in the simulation.Figure 10 illustrates the mean discrepancy of the two algorithms at v = 25 km/h, where the mean discrepancy is the deviation of two Q values per 100 iterations.The mean discrepancy declined with iterative computations, indicating the convergence of the Q-learning and Dyna algorithms.Figure 10 also shows that the rate of convergence of the Dyna algorithm is faster than that of the Q-learning algorithm.A possible conclusion is that the time-variant reward function and the transition function in the Dyna algorithm accelerates the convergence [28].11b shows the fuel consumption and working points of the engine.An SOC-correction method [29] was applied to compensate for the fuel consumption caused by the various SOC final values.Figure 12 illustrates the performance of the two algorithms.Table 2 lists the fuel consumption; the fuel consumption of the Dyna algorithm is lower than that of the Q-learning algorithm, which is attributable to the difference in the time-variant reward function and the transition function between the Dyna and Q-learning algorithms.Table 3 shows the computation times of the two algorithms; the Dyna algorithm has a longer computation time compared with the Q-learning algorithm.This is caused by the updated rule of the Dyna algorithm, in which the reward function and the transition probability are updated at a certain step size [28].Thus, the updated transition probability and reward function of the Dyna algorithm resulted in lower fuel consumption but longer computation time.

Comparative Analysis of the Results of Dyna Algorithm, SDP, and DP
To validate the optimality of the RL technique, the Dyna algorithm, SDP [24], and DP [30] were controlled on the experimental driving schedule shown in Figure 10; Figure 13 presents the simulation results.The SOC terminal values were close to the initial values because of the final constraint in the cost function.Figure 13b illustrates the engine work area, indicating that the engine frequently works in a low fuel consumption field to ensure optimal fuel economy.Table 4 lists the fuel consumption after SOC correction.The Dyna-based fuel consumption was lower than the SDP-based fuel consumption and extremely close to the DP-based fuel consumption.Table 5 shows the computation time of the three algorithms.Because of the policy iteration process in SDP, the SDP-based computation time was considerably longer than the Dyna-and DP-based computation times.Because the Dyna-based control policy is extremely close to the DP-based optimal control policy, the Dyna algorithm has the potential to realize a real-time control strategy in the future.When the present power request is considered a continuous system, the next power request of a vehicle can be predicted accurately using the method introduced in [31,32].Subsequently, when the power request is combined with the Dyna algorithm, the reward function and transition probability matrix can be updated.Furthermore, the computation time can be reduced when the transition probability matrix is updated as the reference [31].Finally, the power split at the next time can be determined and a real-time control can be implemented.

Conclusions
In this study, the RL method was employed to derive an optimal energy management policy for an HETV.The updated rules of the Q-learning and Dyna algorithms were elucidated.The two algorithms were applied to the same experimental driving schedule to compare their optimality and computation times.The simulation results indicated that the Dyna algorithm registers more efficient fuel economy than the Q-learning algorithm does.However, the computation time of the Dyna algorithm is considerably longer than that of the Q-learning algorithm.The global optimality of the Dyna algorithm was validated by comparing it with the DP and SDP methods.The results showed that the Dyna-based control policy is more effective than the SDP-based control policy and close to the DP-based optimal control policy.In future studies, the Dyna algorithm will be used to realize a real-time control by predicting the next power request in a stationary Markov chain-based transition probability model.

Figure 3
depicts the results of the EGS test and simulation run to validate the effectiveness of the equivalent electric circuit model, in which Ug and neng are predicted at an acceptable accuracy during the pulse transient current load.

Figure 2 .
Figure 2. Equivalent circuit of the engine-generator set.

Figure 3 .
Figure 3. Test and simulation results of the equivalent circuit.
U g /V, I g /A I g (input)

Figure 5 .
Figure 5.Long driving schedule of the tracked vehicle.

Figure 6 .
Figure 6.Power request of the long driving schedule.

Figure 7 .
Figure 7. Power request transition probability map at 25 km/h.

Figure 8 .
Figure 8. Computational flowchart of the Q-learning and Dyna algorithms.* The MDP toolbox is introduced in [26].

Figure 9 .
Figure 9. Experimental driving schedule used in the simulation.

Figure 10 .
Figure 10.Mean discrepancy of the value function in the Q-learning and Dyna algorithms.

Figure 11
Figure 11  depicts the simulation results of the Q-learning and Dyna algorithms.Because of the charge sustenance in the cost function, the final SOC values were close to the initial SOC value.Figure11bshows the fuel consumption and working points of the engine.An SOC-correction method[29] was applied to compensate for the fuel consumption caused by the various SOC final values.Figure12illustrates the performance of the two algorithms.Table 2 lists the fuel consumption; the fuel consumption of the Dyna algorithm is lower than that of the Q-learning algorithm, which is attributable to the difference in the time-variant reward function and the transition function between the Dyna and Q-learning algorithms.

Figure
Figure 11  depicts the simulation results of the Q-learning and Dyna algorithms.Because of the charge sustenance in the cost function, the final SOC values were close to the initial SOC value.Figure11bshows the fuel consumption and working points of the engine.An SOC-correction method[29] was applied to compensate for the fuel consumption caused by the various SOC final values.Figure12illustrates the performance of the two algorithms.Table 2 lists the fuel consumption; the fuel consumption of the Dyna algorithm is lower than that of the Q-learning algorithm, which is attributable to the difference in the time-variant reward function and the transition function between the Dyna and Q-learning algorithms.
discrepancy in Q-learning and Dyna algorithms Dyna, v=25km/h Q-learning, v=25km/h (a) SOC trajectories in two algorithms.(b) Engine operation area in the two algorithms.

Figure 11 .
Figure 11.SOC trajectories and engine operation area in the Q-learning and Dyna algorithms.

Figure 12 .
Figure 12.Battery and engine power in the Q-learning and Dyna algorithms.
(a) SOC trajectories in the Dyna algorithm, SDP, and DP.(b) Engine operation area in three algorithms.

Figure 13 .
Figure 13.SOC trajectories and engine operation area in the Dyna algorithm, SDP, and DP.

Table 2 .
Fuel consumption in the Q-learning and Dyna algorithms.

Table 3 .
Computation times of the Q-learning and Dyna algorithms.

Table 4 .
Fuel consumption in the Dyna algorithm, SDP, and DP.

Table 5 .
Computation times of the Dyna algorithm, SDP, and DP.
a A 2.4 GHz microprocessor with 12 GB RAM was used.