4. RL EMS
The EMS in a traditional engine–motor-based hybrid powertrain is only designed to achieve optimum energy consumption by regulating the power flow between different power sources. However, EMS in FC EV is not only responsible for reducing the energy consumption but is also expected to prolong the service time of battery and fuel cell. Furthermore, the solution performance of the multi-objective optimization problem is subjected to real-world driving conditions, as it is sensitive to the unknown power requirement.
Traditional EMS can be classified as rule based and optimization theory based. Although the advantages they have are different, neither of them achieves a good balance between performance and practicality. Deep deterministic policy gradient (DDPG) is a deep-reinforcement learning method based on an actor–critic architecture network. DDPG is good at solving continuous state/action optimization problems by updating the network parameters in the back propagation of the gradient. A memory pool that stores the previous results can accelerate calculation and avoid overfitting by weakening the correlation of training data. The reward function plays a key role in the reinforcement learning model, which includes the current and future rewards. A discount factor will be applied to the future reward to reflect the influence of the future on the current [
35].
As the name of DDPG suggests, the deterministic policy is adopted to select the action for the network (agent). Specific to the EMS in this study, the SoC of the battery and required power is selected as a state variable, while the ratio of FC output power and the total required power is selected as an action variable to optimize EMS by training the agent under various driving conditions. Then, the reward and state of the next step are returned after the action is applied to the environment (vehicle). To prompt the agent to explore a better solution in the environment, a normally distributed noise is applied in action. Then, the action of the agent can be expressed as
The reward function consists of three parts, i.e., the equivalent hydrogen consumption, battery capacity loss, and FC performance degradation in terms of efficiency. The reward function is expressed in Equation (15):
where R
t is the total rewards, γ is the discount factor to introduce the influence of the future on the current, and rt represents the current reward.
Since FC tends to achieve longer service time by keeping the generated power around the rated value, as shown in Equation (3), the performance degradation coefficient at constant rated power is selected as the target optimum value of FC degradation. The target optimum value of battery capacity degradation is determined by Equation (9), where Ah-throughput in a testing cycle is based on 0.5 DoH. The target equivalent hydrogen consumption is 1.2 kg/100 km. The instant reward consists of equivalent hydrogen consumption, battery capacity loss, and FC performance degradation coefficient and is expressed by Equation (16).
where C
H2 is instant equivalent hydrogen consumption per 100 km; Q
loss is the capacity loss of battery at t; FC
loss is the performance degradation coefficient at t;
,
,
are weighting factors.
The parameters are updated via the loss function of the actor network and critic network, as shown in Equations (17) and (18).
where S
t, a
t are the state and action at time t, respectively; Q
ϕ is the output function of critic network at time t, while Q
ϕ, is the output function of critic network at time t + 1.
Only part of the parameters of the critic–target network and the actor–target network will be updated, which is called “Soft-updating” and used to avoid “over-estimating” and stabilize the learning process of the agent to achieve optimal EMS. The parameters of the target network are updated by Equation (19).
ϕ and ϕ′ are parameters of the critic network and the critic–target network; θ and θ′ are parameters of the actor network and the actor–target network; τ is the smoothing factor. The DDPG-based EMS optimization procedures are shown in
Figure 7.
Since DoH = 0.375 is the optimal value for a balanced target between hydrogen consumption, cost, and battery life, as shown in
Table 6, it is adopted to evaluate the performance of RL-based EMS.
Figure 8 illustrates the training process for 300 rounds with weighting factors a
1 = 0.34, a
2 = 0.33, a
3 = 0.33 for hydrogen consumption, battery, and FC degradation, respectively, in the worldwide harmonized light-duty test cycle (WLTC). The total reward of each training episode converges to the highest reward in the end, which is usually taken as a successful training process for an optimal agent. The DDPG-based optimization of EMS is optimized by Matlab
® as well. The agent of optimal EMS is derived from 300 training episodes with converged reward; specifically, the 200th round is selected, as shown in
Figure 8.
In this particular episode, the power distribution between the FC and battery is shown in
Figure 9. To provide a better readership of the power allocating process, a partial enlarged picture in shown in
Figure 9b, in which the power burden on FC was reduced by battery in a cooperation mode, the battery recapturing the energy in vehicle braking, and the battery was charged by FC when extra power was available, as clearly demonstrated. It is worth noting that FC reaches its peak power, i.e., 45 kW, several times with the balanced preferences weighting factors.
The weighting factors of hydrogen consumption, battery capacity degradation, and FC performance degradation are changed to a
1 = 0.2, a
2 = 0.2, and a
3 = 0.6, given the preference for FC performance maintenance. The training process shown in
Figure 10 also illustrates the convergent total rewards, which is taken as a successful training process of the EMS agent.
Compared to the EMS agent with balanced weight factors a
1 = 0.34, a
2 = 0.33, a
3 = 0.33, the power distribution between the battery and FC shown in
Figure 11 shows a lower threshold for battery participation; in other words, the battery takes the responsibility of taking care of FC degradation by providing power more frequently together with FC. In addition, compared to the balanced weighting factors group, the FC performance maintenance preferred weighting factors that significantly reduced the participation of FC and limited the FC power to lower than 40 kW through the cycle.
The detailed comparison of equivalent hydrogen consumption, battery capacity degradation, and FC performance degradation of EMS agents in WLTC and UDDS is summarized in
Table 6. Given a FC life extension preference weighting factor, i.e., a
3 = 0.6, 39.4% improvement is recorded by reducing the FC performance degradation per cycle from 0.0071 to 0.0043, while hydrogen consumption increases 12.3% from 0.3652 kg to 0.41 kg, and a significant increase in battery capacity degradation ensues.
The derived EMS agent is implemented in a new driving cycle, e.g., United States Urban Dynamometer Driving Schedule (UDDS), to validate the effectiveness of EMS in unknown driving conditions.
Figure 12 and
Figure 13 show the power distribution with weighting factors group of a
1 = 0.34, a
2 = 0.33, a
3 = 0.33 and a
1 = 0.2, a
2 = 0.2, a
3 = 0.6, respectively. Comparing these two figures, it is clear that EMS with FC performance maintenance weight factors successfully limit the participation of FC in large power events, while the battery takes the principal role in driving the vehicle. In addition, the power of FC charging battery is well restricted, with a 50% drop.
Table 7 demonstrates the performance of EMS agent in UDDS with balanced weighting factors and FC life extension preference weighting factors. Similar to the training cycle WLTC, the FC performance degradation improved 27%, while hydrogen consumption increased 10% with a tremendous deterioration of battery capacity.
It is worth noting that the comparison of EMS agent performance between WLTC and UDDS demonstrates the effectiveness of EMS agent in unknown driving conditions by showing a similar trend of the varying hydrogen consumption, battery degradation, and FC degradation. Specifically, by altering the weight factors, the EMS agent effectively reduces FC performance degradation, although at a significant cost of battery capacity degradation, which is avoidable via a different preference.