An Online Learning Control Strategy for Hybrid Electric Vehicle Based on Fuzzy Q-learning

In order to realize the online learning of a hybrid electric vehicle (HEV) control strategy, a fuzzy Q-learning (FQL) method is proposed in this paper. FQL control strategies consists of two parts: The optimal action-value function Q*(x,u) estimator network (QEN) and the fuzzy parameters tuning (FPT). A back propagation (BP) neural network is applied to estimate Q*(x,u) as QEN. For the fuzzy controller, we choose a Sugeno-type fuzzy inference system (FIS) and the parameters of the FIS are tuned online based on Q*(x,u). The action exploration modifier (AEM) is introduced to guarantee all actions are tried. The main advantage of a FQL control strategy is that it does not rely on prior information related to future driving conditions and can self-tune the parameters of the fuzzy controller online. The FQL control strategy has been applied to a HEV and simulation tests have been done. Simulation results indicate that the parameters of the fuzzy controller are tuned online and that a FQL control strategy achieves good performance in fuel economy.


Introduction
Hybrid electric vehicles (HEV), which combine the advantages of the fuel vehicle and pure electric vehicle, is the future of the road vehicle.Control strategy is one of the key technologies for hybrid electric vehicle and plays a decisive role on the performance of the vehicle.However, designing a highly-efficient and real time control strategy is a challenging task due to the complex structure of a HEV and an uncertain driving cycle.
Many existing control strategies are rule-based [1][2][3][4], such as the thermostatic strategy, the load-following strategy and electric assist strategy.These control strategies have been developed based on the results of extensive experimental trials and human expertise.Some other control strategies employ heuristic control techniques, with the resultant strategies formalized as fuzzy rules.Though these rule-based strategies are effective and can be easily implemented, their optimality and flexibility are critically limited by working conditions.Therefore, a control strategy that performs well under certain conditions may not provide satisfactory results under other conditions.
According to the literature [5][6][7][8][9][10], to optimize the operation of the HEV drivetrain, some model-based global optimization methods have been employed in control strategy design, such as dynamic programming (DP), sequential quadratic programming (SQP), genetic algorithms (GA), and so on.Usually, these algorithms can manage to determine the optimal power split between the engine and the motor for a particular driving cycle.However, the optimal power-split solutions obtained are only optimal with respect to a specific driving cycle and, in general, it is neither optimal nor charge-sustaining for other cycles.Unless future driving conditions can be predicted during real-time operation, there is no way to imply these control laws directly.Moreover, these methods suffer from the "curse of dimensionality" problem, which prevents their wide adoption in real-time applications.In conclusion, control strategy designs built upon global optimization techniques can serve to evaluate the potential fuel economy of a given drivetrain configuration, as well as the optimality of realizable control strategies.
Several studies, which developed neural networks to optimize the parameters of fuzzy controllers, show good fuel economy and system efficiency [11][12][13].In these studies, fuzzy controllers can be easily and directly designed by optimizing parameters, such as the shape of membership functions.The strategy result shows about 2%-4% better fuel economy than the "fuzzy controller only" optimization result, but this strategy uses fixed parameters for optimization, which makes it an offline optimization strategy; thus, the parameters of fuzzy controller cannot vary from environment to environment.
To adapt different driving cycles, researchers have proposed a model predictive control (MPC) for HEV which is a closed-loop optimal control strategy [14][15][16][17][18][19].To obtain the current control action, the optimal control problem in the finite domain is solved at each sampling instant.A dynamic model based on a predictive future, control action based on online rolling optimization, and feedback correction of the model error are the core features of the algorithm.This control strategy has the advantages of good control effect and strong robustness.During this process the limitations, uncertainty, nonlinearity, controlled variable, and manipulated variables, are dealt with effectively.However, when the prediction or control domain is very long, the MPC algorithm needs to solve an optimal control problem at each decision step and the algorithm is hardly executed in real-time for the great amount of calculation.
In order to make a control strategy adaptive to different driving cycles and convenient for practical application, we propose an approach to tune fuzzy controllers based on fuzzy Q-learning (FQL).The FQL algorithm consists of two parts: a Q-function estimation network (QEN) and fuzzy parameters tuning (FPT).A back propagation (BP) neural network is adopted to estimate and generalize the optimal action-value function Q(x,a), then Q(x,a) and an evaluation signal are used to guide the fuzzy controller to tune parameters so that the fuzzy controller achieves better performance.Unlike traditional Q-learning algorithms, the optimal action is not obtained directly based on approximated values of Q(x,a) and candidate discrete actions; rather, a fuzzy inference system (FIS) is applied to provide continuous control output.Compared with the Q-learning algorithm, FIS is introduced to enhance the generalizability of the state space and generate continuous action, to avoid the problem which is known as the "curse of dimensionality" in continuous systems, to tune the parameters and structure of FIS online so that FIS can be more adaptive to the external changes caused by the environment.The decrease of computational load makes FQL algorithm more convenient for practical applications.

Problem Formulation
The prototype vehicle is a single axis parallel HEV, and the drivetrain structure of the HEV is shown in Figure 1.The drivetrain is composed of an engine, an electric traction motor/generator, Ni-MH batteries, an automatic clutch, and an automatic/manual transmission system.The motor is directly linked between the auto clutch output and transmission input.This architecture provides the regenerative braking during deceleration and allows an efficient motor assist operation.To provide pure electrical propulsion, the engine can be disconnected from the drivetrain by the automatic clutch.Important parameters of this vehicle are given in Table 1.The state vector of the HEV system includes three state variables, i.e., X(k) = (Tdem(k), v(k), SOC(k)) T , where Tdem(k) stands for required torque at time k, v(k) is the vehicle speed, and SOC(k) represents the remaining charge of the battery at time k.The control vector is U(k) = Te(k), where Te(k) represents the output torque from the engine.The motor output torque Tm(k) can be obtained by subtracting Te(k) from Tdem(k).A torque split control strategy, which defines the best torque split between the engine and the motor, is adopted.
The control strategy goal of the HEV is to find the optimal control strategy that maps the observed states X(k) to the control action U(k) so as to minimize vehicle fuel consumption and emissions along a traveling route [20].In the meantime, the vehicle drivability and battery health should be satisfied.Mathematically, the control strategy of the HEV can be formulated as an infinite-horizon dynamic optimization problem as follows: where R(k) is the immediate cost function incurred by U(k) at time k and γ ∈ (0,1) is a discount factor that assures the convergence of the infinite sum of cost function.One of the key benefits of an infinite horizon problem is that the generated control strategy is time-invariant and, thus, can be easily implemented.
The cost function R(k) consists of the sum of the weighted fuel economy, emissions, and SOC, as shown in Equation ( 2

Fuzzy Q-Learning (FQL) Mechanism
The schematic diagram of a FQL control strategy is shown in Figure 2. FQL control strategies consists of two parts: i.e., Q * (x,u) estimator network (QEN) and FIS parameters tuning (FPT).A BP neural network as the QEN is used to estimate Q * (x,u).For the fuzzy controller, we choose a Sugeno-type fuzzy controller.

Back Propagation (BP) Neural Network for Estimating Q*(x,u) (QEN)
The application of reinforcement learning in control problems focuses on two main types of algorithms: actor-critic learning and Q-learning.The actor-critical learning system is a two-step process: i.e. to estimate the state value function J(x) and to choose the optimal action for each state.For Q-learning, the system estimates an action value function Q(x,u) for all state-action pairs and selects the optimal control algorithm based on Q(x,u) [21].
The action value function Q(x,u) is the expected discounted sum of rewards with the initial state x and initial action u which can be written as: where u is the action that acts on the system, and E( ) is the expected value function.The optimal action-value function Q * (x,u) is represented as: The QEN plays the role of approximating or predicting the optimal action-value function Q * (x,u) associated with different input states and control output.A BP neural network is adopted to estimate Q * (x,u) due to its good approximation property.The architecture of the QEN is shown in Figure 3.The topology of the QEN is considered to be a three-layer structure having 4-10-1 nodes.The inputs of QEN are state variables of a HEV, and are vehicle speed v(k), battery SOC, required torque Tdem(k), and control action U(k).The output of QEN is Q(x,u).In the QEN, Q(x,u) is represented by: ) where, V is the summed input of the output node;  + i is the weight between hidden node and the output node; y(i) is the output of the hidden node; a(i) is the summed input of ith hidden node; j − 1,i is the weight between input node and hidden node; is the input of QEN; and f is the activation function of the node.
Here, a sigmoid function is adopted as an activation function of the node, i.e., f The parameters of the QEN are tuned based on generalized policy iteration (GPI).We can approximate the optimal action-value function with the neural network by reducing the TD error δt continuously: The objective of the neural network is to minimize the following expression: The weight-update rule for the neural-network-based gradient-descent method is given by: Combining the above two equations, we can obtain: We can obtain ( , ) based on the chain rules for ω(40 (for i =1,…,10; j =1,…,4) also obtaining ( , ) where the control output of FIS is the fourth input of the neural network.

Fuzzy Interface System (FIS) Parameters Online Tuning-Based on Q*(x,u) (FPT)
This section focuses on how to tune the parameters of the fuzzy controller based on the approximated Q(x,u) obtained from the previous section.In order to optimize the output of the FIS, update the parameters of the FIS to maximize the action value function Q(x,u) with respect to the control output u for the current state.We can tune the parameters of FIS using gradient rules: ( , ) ( , ) ξ ξ where,  is the parameter to be tuned in FIS such as K l j, c l i, and  l i.We have obtained ( [ already through Equation ( 16), thus will only need to deduce The Sugeno-type fuzzy inference system is chosen in our FQL control strategy.If the state vector is represented by and the control output R u  , the IF-THEN rules of the fuzzy controller may be expressed as: where Fi l is the label of the fuzzy set in xi, for l = 1, 2, …, M. K l 0, K l 1, K l 2,…and Kn l are the constant coefficients of the consequent part of the fuzzy rule.We use product inference for the fuzzy implication, singleton fuzzifier, and center-average defuzzifier, respectively.The final output value is: is the membership degree of the fuzzy set A Gaussian function is used as the membership function of the fuzzy system, i.e.: for i = 1, 2, …, n ( n is the number of input variable) and l = 1, 2, …, M (M is number of fuzzy rule).Now we know the parameters that need to be tuned, i.e., c and  , in our proposed Sugeno-type FIS.
If we let: Equation ( 22) represents the product of different input membership functions in one fuzzy rule: where, a, b, and u represents the weighted summation, summation of weight of M rules, and total output respectively.
Thus we can calculate / ξ u   by the following equations:

Exploration Policy and Action Modifier
Witkins has shown that Q(x,a) converges to Q * (x,a) with a probability 1, if all actions continue to be tried from all states [22].In order to guarantee all actions to be tried, we implemented an exploration policy for the control output u recommended by the FIS.The action exploration modifier (AEM) is introduced to generate the control command uc.The uc is the sum of u and an additive disturb action ud, which has a normal distribution with the mean equal to zero and the standard deviation σQ(t) recommended by the FIS.The AEM can solve the dilemma of "exploration" in reinforcement learning, and is added after the FIS and before the system input, i.e., uc = u + ud，and ud ~N(0,σQ(t)).
The σQ(t) calculated as follows: where k is coefficient, which can expand or shrink the disturb action.

Overall Implementation Procedure
The detailed implementation procedure is presented as follows.
3) Before it is fed to the actual system, u is processed by the action modifier according to uc = u + ud.4) The action modifier provides uc, which acts as the control value of the system.5) Based on our requirements for the system, we evaluate the performance of the controller as r and obtain the states of the system.6) Obtain the approximated Q(xt+1, ut+1) from the QEN based on the current control action, and current states, and some previous states.
Here, we assume Q(xt+1, ut+1) ≈ maxu'Q(xt+1, u') because ut+1 is obtained from the FIS, which continuously maximizes Q(xt,ut) with respect to the control output u .8) Based on t obtained from Step 7, we can update the parameters of the QEN according to Equations ( 14) and ( 15). 9) Tune the parameters of the FIS based on Equations ( 17)-( 27).10) Substitute ) , ( 11) If the parameters of the QEN and the FIS are not changed any more or after predefined iterations, the learning procedure is terminated; otherwise, return to Step 2 after a fixed sampling time  .

Simulation Results and Discussion
In order to know the effectiveness of the FQL algorithm, simulation experiments were done in ADVISOR.Using a simulation to test the algorithm in a variety of driving cycles can eliminate the huge cost and time needed for actual experimentation.The simulation model for the HEV mentioned in section 2 was built in ADVISOR and is shown in Figure 4.For this particular HEV system, the parameters of the algorithm used in the simulations are summarized in Table 2, with proper notations defined in it.The next step is to define cost function R(k).The reason for the selection of a1 = 0 is simply because the emission maps are not provided for the engine.Thus, the resultant control strategy is a fuel-economy only strategy.In order to consider the power economy influence on fuel economy, we let a2 equals 1: ) ( R fuel k and ) ( R SOC k can be defined as follows: where x is the instantaneous fuel consumption value and y the SOC change rate. The FQL algorithm was written in MATLAB.The fuzzy rules were predesigned according to engineering experience, and complete rules details are given in Table 3, where the parameter of e T satisfies the relationship: VS < S < M < B < VB < SC < MC < BC, the parameters of Tdem and SOC satisfy the relationship: VS < S < M < B < VB.Initially, the membership functions of the fuzzy controller were randomly initialized.In order to illustrate the control strategy more clearly, a convenient method is applied to represent it in an intuitive manner.A torque-split-ratio (TSR) e d e m τ / T T  is defined to quantify the positive power flows in the powertrain [23].Four positive power operation modes are defined, including motor only ( τ 0  ), engine only ( τ 1  ), power-assist ( 0 τ 1   ), and charging mode ( τ 1  ).Figures 5 and 6 show the initial membership functions and TSR map for initial fuzzy controller.Simulation test was done under standard driving cycle UDDS.Figure 7 depicts the changing trend of the TSR map for fuzzy controller during the driving cycle.From 0 s to 1369 s, we can see that the surface of the TSR map is becoming smoother.The online learning of the fuzzy Q learning control strategy is the reason behind the TSR map smoothing.Figure 8 shows the final membership functions of the fuzzy controller.The simulation results for the UDDS driving cycles are shown in Figure 9.The FQL control strategy tends to maintain the battery SOC near 50%, finally.This leaves enough capacity to handle an extended period of battery discharge and enough capacity to absorb a long period of charging.In order to evaluate the performance and effectiveness of the FQL control strategy, the experiment results are compared with a heuristic rule-based control strategy known as "Parallel Electric Assist Control Strategy" and the fuzzy logic control strategy.The comparison results are listed in Table 4. Power consumption is converted to fuel consumption; equivalent fuel consumption is obtained by adding the converted power consumption and fuel consumption.As shown by the results of Table 4, equivalent fuel consumption of fuzzy control is decreased by 3.10% compared with the rule-based control strategy.Meanwhile, the equivalent fuel consumption of the FQL is decreased by 2.67% compared with the fuzzy control strategy.The FQL control strategy achieves good performance.Torque split trajectory by using the FQL control strategy is shown in Figure 12.In order to illustrate the torque split trajectory, we choose 160-320 s period from the driving cycle.It is illustrated that the engine provides most of the torque demand, while the motor helps when more torque is needed.The figure also depicts a relatively smooth profile of the engine torque compared with the demand torque and the motor torque.The smoother engine torque from the fuzzy Q learning control strategy indicates that it helps improve the operating conditions of the engine.In order to check the effectiveness of the proposed method, it is tested on different driving conditions (NEDC driving cycles) starting from the same initial conditions (the same parameters for both neural network and fuzzy controller).Figure 13 show the final membership functions of the FQL control strategy under NEDC cycle are different from that under UDDS cycle; as a result we get two different controllers.That is, the control strategy really learns from the environment.Simulation results for the NEDC driving cycles are shown in Figure 14.We can see that the SOC also maintains near 50%, finally.The rules of the fuzzy controller have a significant effect on the final value of SOC to make sure that SOC is maintained near 50%, and the proposed method in this paper only changes the parameters of the membership function of the fuzzy controller during the cycle.Table 5 shows the comparison between fuel consumption and equivalent fuel consumption with different control strategies, and we can see the FQL control strategy also achieves good performance.

Conclusions
An online learning control strategy based on FQL has been proposed to improve the fuel economy for a hybrid electric vehicle.The FQL control strategy contains two parts: QEN and FPT.We used a BP neural network as QEN to estimate Q*(x,u).For the fuzzy controller, we chose a Sugeno-type fuzzy controller and the parameters of the fuzzy controller were tuned online based on Q*(x,u).The action exploration modifier (AEM) is introduced to guarantee all actions are tried.Simulation results indicate that the parameters of the FIS are tuned online and the FQL control strategy achieves good performance in fuel economy.

Figure 1 .
Figure 1.Schematic diagram of the parallel hybrid electric vehicle drivetrain.
represents the output of one fuzzy rule: b

Figure 5 .
Figure 5. (a) Initial membership functions of dem T ; and (b) initial membership functions of SOC.

Figure 7 .Figure 8 .
Figure 7. Changing trend of fuzzy controller TSR map under Urban Dynamometer Driving Schedule (UDDS): (a) TSR map for fuzzy controller in 500 s; (b) TSR map for fuzzy controller in 1000 s; and (c) TSR map for fuzzy controller in 1369 s.

Figures 10 and 11
Figures 10 and 11 depict the distribution of engine and motor operating points under the rule-based control strategy and FQL control strategy.The FQL control strategy is a fuel strategy which limits the instantaneous fuel consumption.This strategy is not based on the efficiency of the engine, but it primarily limits the fuel use to a particular value.As shown in Figure10, most of the engine operation points under the FQL control strategy are below the 0.3 g/s fuel use line, while a great amount of engine operation points under the rule-based control strategy are below the 0.55 g/s fuel use line.That means the instantaneous fuel consumption of the FQL control strategy is less than the rule-based control strategy most of the time.The motor operation point distribution of the FQL control strategy is shown in Figure11.The efficiency of motor operating under the FQL control strategy is better than the rule-based control strategy.

Figure 12 .
Figure 12.Torque split trajectory by using the FQL control strategy under UDDS.

Figure 13 .
Figure 13.(a) Final membership functions of Tdem under New European Driving Cycle (NEDC); (b) Initial membership functions of SOC under NEDC.

Table 1 .
Summary of the hybrid electric vehicle (HEV) parameters.

Table 2 .
Summary of FQL algorithm parameters.

Table 3 .
Summary of fuzzy rules.