Design and Improvement of SD3-Based Energy Management Strategy for a Hybrid Electric Urban Bus

: With the rapid development of machine learning, deep reinforcement learning (DRL) algorithms have recently been widely used for energy management in hybrid electric urban buses (HEUBs). However, the current DRL-based strategies suffer from insufﬁcient constraint capability, slow learning speed, and unstable convergence. In this study, a state-of-the-art continuous control DRL algorithm, softmax deep double deterministic policy gradients (SD3), is used to develop the energy management system of a power-split HEUB. In particular, an action masking (AM) technique that does not alter the SD3 (cid:48) s underlying principles is proposed to prevent the SD3-based strategy from outputting invalid actions that violate the system’s physical constraints. Additionally, the transfer learning (TL) method of the SD3-based strategy is explored to avoid repetitive training of neural networks in different driving cycles. The results demonstrate that the learning performance and learning stability of SD3 are unaffected by AM and that SD3 with AM achieves control performance that is highly comparable to dynamic planning for both the CHTC-B and WVUCITY driving cycles. Aside from that, TL contributes to the rapid development of SD3. TL can speed up SD3’s convergence by at least 67.61% without signiﬁcantly affecting fuel economy.


Introduction
Hybrid electric vehicle (HEV) technology is regarded as a dependable vehicle technology that balances long-endurance mileage and low energy usage.HEVs share the burden of the internal combustion engine (ICE) with batteries and electric motors, allowing the ICE to run more effectively and enhancing fuel economy.Therefore, the hybrid system must balance multiple control objectives, such as fuel economy and charging maintenance, thus making the energy management strategy (EMS) key in determining all aspects of HEVs' performance.
Research on HEVs has spawned massive EMSs, which can be classified into three groups: rule-based EMSs, optimization-based EMSs, and learning-based EMSs [1].Rulebased strategies control the operation of the power component in real-time using a welldeveloped heuristic or fuzzy rules [2,3].These methods are distinguished by low computing costs and excellent reliability, resulting in widespread use in engineering practice [4].However, because the formulation of control rules relies too heavily on engineers' knowledge and necessitates time-consuming experiments for calibration, rule-based EMSs have a lengthy development cycle and low fuel-saving performance and self-adaptability [5].Consequently, research on optimization-algorithm-based strategies that use optimization algorithms to achieve optimal control of HEVs has been substantially encouraged.Global optimization and real-time optimization are two categories of optimization-based strategies, depending on whether the optimization time domain is global or local [6].Typical global optimization methods include genetic algorithms [7], particle swarm optimization algorithms [8], and dynamic programming (DP) [9], which can find the optimal global solution of the objective function by offline computation.The DP approach, in particular, has been acknowledged as a standard for investigating the optimal energy optimization for HEVs.However, global optimization techniques rely on prior knowledge of the driving conditions of the future, making it impossible to use them in real-world scenarios [10].Therefore, real-time optimization method research has received much attention due to its practical viability.For instance, the equivalent consumption minimization strategy (ECMS), which accomplishes torque distribution by finding the least of the sum of the equivalent fuel consumption and the actual fuel consumption at each time step, is a greedy method with proven engineering applications [11].Unfortunately, the ECMS has a weak immunity to perturbations of the equivalent factor, which leads to its poor adaptability and stability [12].Despite the emergence of various adaptive ECMSs [13], the sensitivity issue with the equivalency factor has yet to be fully resolved.Another well-liked real-time optimum energy management approach is model predictive control (MPC), which combines optimization algorithms with predictive approaches to achieve rolling optimization control of HEVs [14].The correctness of the predictive model is crucial to the MPC's control performance since it determines the system's torque distribution by solving a predictive model [15].However, the development of MPC-based EMSs is hampered by the difficulty of balancing accuracy and stability in present prediction methods.
To identify new control approaches that can address the disadvantages mentioned above, learning-based EMSs, particularly those based on deep reinforcement learning (DRL), have recently received much attention [16].Since the energy management problem for HEVs can be described as the Markov decision process (MDP), the DRL technique can be successfully used for the energy-efficient control of HEVs [17].The DRL strategy uses neural networks to learn torque distribution, which can improve energy efficiency by interacting with the environment.Therefore, the DRL strategy is a neural network black box with end-to-end state inputs and action outputs, which implies the strategy only needs a small amount of computation to obtain the optimal output [18].DRL EMSs fall into two categories: discrete control (DC) DRL strategies and continuous control (CC) DRL strategies, depending on the control form [19]. Deep Q-network (DQN)-based, Double DQN-based, and Dueling DQN-based methods are the most emblematic DC DRL approaches [20][21][22].The core of these methods is to iterate a Q-value function Q(s, a) using neural networks that can evaluate state-action pairs (s, a) according to a temporal-differential (TD) approach so that the actions are selected at each step by argmax a Q(s, a) [23].Given that DC DRL methods can only handle actions that are independent of one another due to the way they learn, continuous physical space actions like engine power must be discretized.However, higher discretization accuracy makes it challenging to converge the DC DRL strategy, while lower discretization accuracy causes a significant discretization error [24].As a result, it is more challenging to employ DC DRL methods to optimize the energy efficiency of HEVs, which encourages the development of CC DRL strategies that can avoid the action discretization issue.The deep deterministic policy gradient (DDPG)-based strategy, the twin delayed DDPG (TD3)-based strategy, and the soft actor-critic (SAC)-based strategy are all representative CC DRL EMSs that learn torque distribution using actor-critic (AC) technology [25][26][27].Specifically, these strategies use actor neural networks to select output actions within the continuous action space and the critic neural networks to learn value functions that can evaluate the actor.Under the guidance of critics, the actor updates in the direction that can obtain a greater value evaluation through the gradient descent method.With the AC structure, CC DRL strategies are more accurate than DC DRL strategies at determining the optimal action since they can perform any action belonging to a continuous action interval [28].To demonstrate the effectiveness of CC DRL EMSs, numerous studies have been carried out.Li et al. [29] discovered, for instance, that the DDPG-based strategy performed better than the MPC-based strategy in terms of fuel-saving performance and computational speed without knowledge of future travels.According to Zhou et al. [30], compared to the DQN-based strategy, the double DQN-based strategy, and the dueling DQN-based strategy, the TD3-based technique converges more quickly, is more adaptable, Energies 2022, 15, 5878 3 of 21 and uses less fuel.Wu et al. [31] indicated that the SAC-based strategy has excellent learning ability.It can reduce the training time by 87.5% and improves fuel economy by up to 23.3% compared to the DQN-based strategy.
Nevertheless, the CC DRL strategies from the current studies still have unsolved issues.The first one is that the critic networks of common CC DRL strategies suffer from large Q-value estimation bias, which can cause poor learning stability and extreme sensitivity to the hyperparameters of the strategies [32].As a result, it usually takes a great deal of time to assess the effect of various combinations of hyperparameters.Another issue is the lack of security in CC DRL strategies.It is due to the fact that CC DRL strategies would execute aimless exploration over the whole action space to avoid getting trapped in a local optimum, and thus the strategies would output actions that violate the physical constraints of the powertrain system.This means that there may be an unreasonable torque distribution between the engine and the motor/generator, which can cause irreversible damage to the power components.Besides, the initialization of the CC DRL strategies is usually random, so the strategies require long exploration periods to accumulate experience, which leads to a high training cost.Transfer reinforcement learning techniques are expected to solve this problem, i.e., some knowledge from already learned strategies can be transferred to new strategies through transfer learning (TL), which speeds up learning by improving the initial performance of new strategies.However, there is a lack of research on transfer reinforcement learning in the EMS field.
To cope with the above problems, this study adopts a cutting-edge CC DRL algorithm, namely softmax deep double deterministic policy gradients (SD3), to formulate EMS for a power-split hybrid electric urban bus (HEUB).Based on the traditional AC method, SD3 can not only avoid the overestimation and underestimation of the Q-value through the double Q-learning technique, Boltzmann softmax technique, and dual-actor technique, but also can improve the sampling efficiency and learning stability.Therefore, the SD3 algorithm has great potential in the EMS field.The main contributions of this research are as follows:

•
The current EMS field lacks research related to the SD3 algorithm, and, to the authors' knowledge, this is a pioneering research work related to the SD3-based strategy;

•
In order to prevent the SD3-based strategy from outputting torque assignments that do not adhere to the physical limits of the powertrain system in stochastic exploration, an action masking method that does not go against the algorithmic concept is proposed.This work can drive DRL-based strategies toward engineering applications and be an inspiration for the improvement of other DRL algorithms;

•
The possibility of utilizing TL methods to quicken SD3 learning is investigated.That is, part of the prior knowledge of the already converged strategy is migrated to the new driving cycle for initializing the new strategy, which can avoid the cold start of the new strategy in the new environment.
The paper is organized as follows: In Section 2, the modeling of the studied HEUB and its associated energy management problems are introduced.In Section 3, the system design of the SD3-based strategy, the invalid action masking method, and the transfer learning technique are described.In Section 4, the SD3-based strategy experiments and the strategy's performance are discussed numerically under several metrics.Conclusions and some suggestions for future work are given in Section 5.

Powertrain Modeling and Energy Management Problem
As shown in Figure 1, the powertrain system of the studied HEUB is mainly composed of the ICE, low-power motor MG1, high-power motor MG2, battery pack, and power coupling mechanism with a front planetary row (PG1) and a rear planetary row (PG2).Detailed specifications of the HEUB are provided in Table 1.PG1 acts as a power-split mechanism, with a sun gear (S1) and a planet carrier (C1) connected to MG1 and ICE, respectively.The rear planetary row PG2 acts as the reduction mechanism of MG2, and its sun gear (S2) and gear ring (R2) are connected to MG2 and the chassis frame, respectively.Moreover, the gear ring (R1) of PG1 is connected to the planet carrier (C2) of PG2, and together they serve as the power output of the transmission.This work aims to establish a logical EMS to coordinate the efficient operation of the above components.This section outlines the mathematical formulation of the energy management optimization problem and the modeling of the HEUB's components.
As shown in Figure 1, the powertrain system of the studied HEUB is mainly composed of the ICE, low-power motor MG1, high-power motor MG2, battery pack, and power coupling mechanism with a front planetary row (PG1) and a rear planetary row (PG2).Detailed specifications of the HEUB are provided in Table 1.PG1 acts as a powersplit mechanism, with a sun gear (S1) and a planet carrier (C1) connected to MG1 and ICE, respectively.The rear planetary row PG2 acts as the reduction mechanism of MG2, and its sun gear (S2) and gear ring (R2) are connected to MG2 and the chassis frame, respectively.Moreover, the gear ring (R1) of PG1 is connected to the planet carrier (C2) of PG2, and together they serve as the power output of the transmission.This work aims to establish a logical EMS to coordinate the efficient operation of the above components.This section outlines the mathematical formulation of the energy management optimization problem and the modeling of the HEUB's components.

Vehicle Dynamics Model
The studied HEUB in this work is modeled using a backward method, and its longitudinal dynamics can be expressed as follows using the theory of automotive dynamics:

Vehicle Dynamics Model
The studied HEUB in this work is modeled using a backward method, and its longitudinal dynamics can be expressed as follows using the theory of automotive dynamics: where F t denotes the driving force demand, m denotes the vehicle mass, g denotes the gravity acceleration, f denotes the rolling resistance coefficient, ψ denotes the angle of road slope, C D denotes the aerodynamic drag coefficient, ρ denotes the air density, A f denotes the front area, v denotes the vehicle velocity, ζ denotes the rotary mass conversion coefficient, and .
v denotes the acceleration.The most crucial mechanical transmission part of the investigated HEUB is the power coupling mechanism shown in Figure 1.It can divide the ICE output power into two parts, one of which is converted into electrical energy by the generator and the other of which is transmitted through the mechanical path for driving the vehicle.This advantage enables a complete decoupling of the ICE from the wheels, enabling more effective control of the Energies 2022, 15, 5878 5 of 21 ICE operating point [33].According to the dynamics of the planetary wheel system, the constraint relations to be satisfied by the power distribution between the ICE, MG1, and MG2 can be obtained as follows: where n MG1 , n MG2 , and n ICE denote the speed of MG1, MG2, and ICE, respectively; T MG1 , T MG2 , and T ICE denote the torque of MG1, MG2, and ICE, respectively; k 1 and k 2 denote the characteristic parameters of PG1 and PG2, respectively; and n OUT and T OUT denote the output speed and output torque of the power coupling mechanism, respectively.
Then, the output torque of the power coupling mechanism and the vehicle driving force should satisfy the following relationship: where R W H denotes the rolling radius, and i f denotes the final gear ratio.

Power Units Model
The power unit model includes the ICE model, MG1 model, MG2 model, and battery model.The ICE and motor are built using a quasi-static-based modeling approach in this research.Namely, their transient responses are ignored in the model, and only experimental data are used to simulate their static variations [34].Therefore, the fuel consumption rate of the ICE and the efficiency of the motor are both functions concerning speed and torque.According to the efficiency maps [35], once the speed and torque are determined, the fuel consumption rate or efficiency can be calculated using the look-up table method, which has been adopted by most studies.
Furthermore, the ICE may be controlled to work in any condition because the investigated HEUB can doubly decouple the ICE speed and torque by employing a planetary gear set.In order to narrow the scope of the EMS's optimization-seeking, the optimal braking specific fuel consumption (BSFC) curve of the ICE is extracted as a priori information in this work.Lian et al. have demonstrated that this method can significantly shorten the DRL strategy's learning period without sacrificing fuel efficiency [24].The optimal operating curve of the object ICE is shown in Figure 2, which consists of the set of operating points with the lowest BSFC at each ICE power, P ICE .Additionally, since the electric motor's efficiency is far higher than the ICE's, the ICE can be the primary focus of energy management improvement.The power battery pack is simulated by a first-order equivalent circuit model, which is practical and effective for energy management problems [36].The voltage characteristic curve of the battery is shown in Figure 3, which shows that the open-circuit voltage depends on the battery's state of charge (SOC).In addition, since the open-circuit voltage changes more gently when the SOC is around 0.6, the initial SOC of the HEUB simulation is set to 0.6, and the SOC constraint range is set to [0.5,0.7] in this study.The theoretical battery model is formulated as: where P BAT , P MG1 , and P MG2 denote the power of the battery, MG1, and MG2, respectively; P LOSS denotes the lost power; and I BAT , V OC , R 0 , SOC 0 , and Q BAT denote the current, opencircuit voltage, internal resistance, initial SOC, and capacity of the battery, respectively.The power battery pack is simulated by a first-order equivalent circuit model, which is practical and effective for energy management problems [36].The voltage characteristic curve of the battery is shown in Figure 3, which shows that the open-circuit voltage depends on the battery's state of charge (SOC).In addition, since the open-circuit voltage changes more gently when the SOC is around 0.6, the initial SOC of the HEUB simulation is set to 0.6, and the SOC constraint range is set to [0.5,0.7] in this study.The theoretical battery model is formulated as:

Energy Management Problem
The energy management problem for HEVs is a non-linear optimal control problem with constraints, and its control objectives include minimizing fuel consumption and balancing electric quantity.These two objectives are critical for improving vehicle economy and endurance mileage, as well as maintaining battery life and reliability.Therefore, this

Energy Management Problem
The energy management problem for HEVs is a non-linear optimal control problem with constraints, and its control objectives include minimizing fuel consumption and balancing electric quantity.These two objectives are critical for improving vehicle economy and endurance mileage, as well as maintaining battery life and reliability.Therefore, this study uses a trade-off cost function between fuel economy and electric quantity sustainability to measure the performance of energy management strategies, as established below: where .
m ICE denotes the instantaneous fuel consumption, SOC re f denotes the penalty term preset value, and κ denotes a positive penalty factor.
The energy management problem is subjected to the following physical constraints: where BAT and I max BAT ) are the minimum (maximum) power and current, respectively.

Methods and Design of SD3-Based EMS
The DRL problem can be described by an MDP defined as a 5-tuple (S, A, T, r, γ) [37], where S denotes the state space of the environment, A denotes the action space of the agent, T denotes the transfer probability between states, r denotes the reward function, and γ denotes the discount factor.In the MDP, the DRL algorithm continuously updates the policy function to optimize the action output based on the reward feedback obtained from the interaction with the environment.The policy function π(s) is a mapping from states to actions, and its performance can be evaluated by a state-value function V(s) or a Q-value function Q(s, a).Therefore, DRL algorithms usually find the optimal policy function π * (s) that maximizes the cumulative reward by iterating over the optimal value function.
Since the energy management problem of HEVs is a sequential control problem, it can be mathematically described as an MDP.In the energy management MDP, the driving cycle and the powertrain system are part of the environment.The DRL algorithm searches for an optimal strategy that satisfies the energy optimization objective by trying different torque distribution strategies.This section proposes a novel DRL-based strategy, namely the SD3-based strategy, and its algorithm principles and formulation details are presented in detail.

Preliminary Formulation of SD3-Based EMS
SD3 is an off-policy CC DRL algorithm that iterates the optimal Q-value function and policy function by the AC method [32].Specifically, SD3 uses neural networks called actor and critic to approximate the policy function π(s) and Q-value function Q(s, a), respectively.The actor network is the interactive end of SD3 for action selection.At the same time, the critic network is used to estimate the Q-value and thus guide the actor network to learn the strategy that maximizes the Q-value by the gradient method.With this clever design, SD3 can handle problems with a continuous action space and avoid discretization errors.On this basis, to improve the critic's estimation performance and the actor's sampling efficiency, SD3 draws on the experience of DQN and double Q-learning.It integrates the dual-actor technique, dual-critic technique, and target network technique in the AC method [23,38].That means the learning framework of SD3 needs to integrate eight neural networks, including two actor networks, two critic networks, two target actor networks, and two target critic networks.Among them, only the actor and critic networks need to be trained with parameters, while the target network is updated by soft copying the weights of the actor and critic networks.It is worth mentioning that the actor and critic networks are trained using an experience replay method.That means that SD3 deposits transfer samples < s, a, r, s , d > consisting of state s, action a, immediate reward r, next state s , and done flag d generated from each interaction with the environment into the experience buffer R, and then randomly samples mini-batches of samples from the experience buffer to train the network.This practice can improve the utilization of samples and attenuate the correlation between training samples.
As for the critic network, SD3 uses the clipped double Q-learning method integrated with the Boltzmann softmax operator to estimate the Q-value and construct the TD error.Namely, the loss function of the critic network is as follows: Energies 2022, 15, 5878 8 of 21 where where where a denotes the next action; 1 and θ − 2 , respectively; p(a ) denotes the probability density function of the normal distribution; β denotes the parameters of the softmax operator; Q θi=1,2 (s, a) denotes the critic networks Q θ1 (s, a) and Q θ2 (s, a) with weights θ1 and θ2.Note that Equation ( 9) uses the minimum of the two target critic networks to initially estimate the Q-value.This technique, called clipped double Q-learning, can mitigate the overestimation of values.Equation ( 8) yields an unbiased estimate of the softmax operator by importance sampling, and the estimated Q-value is further processed.This technique can smooth the optimization landscape to limit the estimation error to a certain range.
As for the two actor networks, their learning objective is to maximize the Q-value, so they are updated according to the output value of the critic network using the deterministic policy gradient method, i.e., where π φi=1,2 (s) denotes the actor networks π φ1 (s) and π φ2 (s) with weights φ1 and φ2, respectively.The target networks do not need to be trained, and the parameters are optimized by soft updates, i.e., where τ is the soft update factor.Usually, τ 1 is set to ensure that the target networks change smoothly so that stable learning targets can be provided for the critic networks.Based on the above description, the pseudo-code of the full SD3 algorithm is provided in Algorithm 1.

Reward, Observation, Action, and Parameters Setting
The closed-loop control framework of the SD3-based strategy is shown in Figure 4, and its deployment consists of three main parts: reward design, environment construction (including state design and action design), and parameter setting of the SD3 algorithm.
Reward: In formulating the SD3-based strategy, the SD3 algorithm searches the energy management strategy that minimizes Equation ( 5) by controlling the HEUB interaction with the environment.Therefore, the multi-objective reward function of the SD3-based strategy should be designed as Reward The multi-objective reward function's main challenge is the parameter tuning between ς and κ.The purpose of parameter tuning is to minimize fuel use while maintaining SOC.Different weights between them will result in a variety of learning outcomes.
Update the critic θi according to Bellman loss: Update actor φi by policy gradient:  Observation: The SD3-based strategy's input, the state observation, is used to present fundamental information about the environment.The observation amounts should be comprehensive and independent of one another.Therefore, vehicle speed, acceleration, and SOC are set as state observations in this study, i.e., = { , ,SOC} State v v .Actions: After observing the state, the SD3-based strategy uses the actor networks to choose a suitable action to decide the system's torque distribution.According to the optimal BSFC curve discussed in the previous section, the ICE power determines the ICE speed and torque with the best efficiency so that the demand speed and demand torque of the motor can be further determined based on the mechanical characteristics of the system.Thus, the SD3-based strategy can optimize energy control by controlling the power output of the ICE, and the action space is defined as Observation: The SD3-based strategy's input, the state observation, is used to present fundamental information about the environment.The observation amounts should be comprehensive and independent of one another.Therefore, vehicle speed, acceleration, and SOC are set as state observations in this study, i.e., State = v, .v, SOC .Actions: After observing the state, the SD3-based strategy uses the actor networks to choose a suitable action to decide the system's torque distribution.According to the optimal BSFC curve discussed in the previous section, the ICE power determines the ICE speed and torque with the best efficiency so that the demand speed and demand torque of the motor can be further determined based on the mechanical characteristics of the system.Thus, the SD3-based strategy can optimize energy control by controlling the power output of the ICE, and the action space is defined as Action = {P ICE |P ICE ∈ [0kW, 160kW]}.
Parameter: In this study, all networks of SD3 consist of fully connected layers, including three hidden layers with 512, 256, and 256 nodes, respectively.Among them, the unit layers of the critic networks are activated by the ReLu function.The ReLu function also activates the input and intermediate layers of the actor networks, but the tanh function activates the output layers.Since the tanh function maps the output of the actor networks to (−1,1), the linear transformation of the action space is required.The essential hyperparameters, which were chosen only after several experiments, are also listed in Table 2.Among them, the discount factor γ, taken as 0.99, means that long-term payoff can be taken into account, thereby increasing the likelihood that SD3 will learn the globally optimal policy [39].The lower learning rate of the actor networks than that of the critic networks can make the iteration of the Q-function relatively faster, stabilizing the policy update.The larger buffer capacity can accumulate more experience to prevent policy overfitting, and an appropriate mini-batch size is beneficial for the learning stability and efficiency of the algorithm.In the studied powertrain, MG2 s role is to share the ICE load or charge the battery by brake recovery, so it must have a wide physical operating range to ensure a perfect match with the ICE.Instead, the MG1 s role is to regulate the ICE speed or absorb some of the ICE power to produce electricity.Hence, its physical operating range can be relatively narrow.Given this, the studied HEUB is equipped with a high-power motor MG2 and a low-power motor MG1 for the ICE.Nevertheless, Equation (2) states that when the ICE's demand speed or torque is too high, the demand speed or torque of the MG1 is also likely to be too high, which would cause the MG1 to violate Equation (5).Namely, the ICE should only operate within a specific range R ICE (t) at each time step t to guarantee the reasonable operation of MG1.This indicates that the output action of the SD3 should fall on R ICE (t), the part of the BSFC curve that intersects with R ICE (t).Meanwhile, because the BSFC curve's speed and torque are positively associated, the higher the power at each point, the higher the speed and torque.Therefore, R ICE (t) is continuous, as illustrated in Figure 5, with the upper limit of the maximum feasible power P max ICE (t) and the lower limit of the minimum feasible power P min ICE (t).In conclusion, the output action of SD3 should be limited to P max ICE (t) and P min ICE (t).

ICE ICE
the BSFC curve's speed and torque are positively associated, the higher the power at each point, the higher the speed and torque.Therefore, ICE () t is continuous, as illustrated in Figure 5, with the upper limit of the maximum feasible power max ICE () Pt and the lower limit of the minimum feasible power min ICE () Pt .In conclusion, the output action of SD3 should be limited to max ICE () Pt and min ICE () Pt .Unfortunately, like other DRL-based EMSs, the SD3 would explore the entire action space aimlessly in learning to reduce the risk of falling into a local optimum.That means that SD3 will output actions outside of the ICE () t .Therefore, it is necessary to design action masking (AM) to filter invalid actions and prevent SD3 from unnecessary exploration in learning.Given that the essence of SD3 is to develop strategies with long-term planning through the state distribution, action distribution, and state transition of the learning environment, AM needs to follow two criteria to prevent the violation of the principles of the SD3 algorithm, thus affecting SD3′s learning performance.The first one is that AM would not change the original action space distribution so that the environment's potential state transfer probability function will not be destroyed.The second is that samples containing invalid actions will not be collected into the experience replay buffer, so the wrong samples will not participate in the training.Specifically, SD3 implements AM by repeating the following steps at each time step t.Unfortunately, like other DRL-based EMSs, the SD3 would explore the entire action space aimlessly in learning to reduce the risk of falling into a local optimum.That means that SD3 will output actions outside of the R ICE (t).Therefore, it is necessary to design action masking (AM) to filter invalid actions and prevent SD3 from unnecessary exploration in learning.Given that the essence of SD3 is to develop strategies with long-term planning through the state distribution, action distribution, and state transition of the learning environment, AM needs to follow two criteria to prevent the violation of the principles of the SD3 algorithm, thus affecting SD3 s learning performance.The first one is that AM would not change the original action space distribution so that the environment's potential state transfer probability function will not be destroyed.The second is that samples containing invalid actions will not be collected into the experience replay buffer, so the wrong samples will not participate in the training.Specifically, SD3 implements AM by repeating the following steps at each time step t.

•
At each time step t, the R ICE (t) is first calculated by the following three steps: ( It is essential to point out that the first step is widely used by many mathematical model-based approaches, such as DP, ECMS, and MPC, which all exclude invalid actions by traversing the action space.Also, the AM technique needs to be deployed for both the actor networks and the target actor networks.Otherwise, the learning ability of the algorithm will be seriously affected.Besides, the way to mask invalid actions by the clip operation is only applicable to algorithms like DDPG, TD3, and SD3, which are based on the AC and deterministic policy.For Q-value-based algorithms like DQN, it is possible to ensure that invalid actions are not selected by argmax operations by setting the Q-value of the invalid action to negative infinity.For algorithms such as proximal policy optimization and SAC, which are based on the AC and randomness policy, action masking can be achieved by adjusting the probability of invalid actions being sampled to zero.

Transfer Learning Technology
The DRL algorithm suffers from sample inefficiency, which requires interaction with the environment to acquire many samples to learn the strategy.However, once the environment changes, the previous learning results become invalid and it is necessary to retrain the neural network at a vast cost [40].Transfer learning is an effective means to solve this problem, and the core idea is to use the experience gained by the model in the old task to improve the learning performance on a related but different task [41].
For HEUBs deployed with the same SD3-based strategy, their rewards, states, actions, powertrain characteristics, and algorithm settings are identical for different urban driving conditions.It implies a correlation between the optimal EMS for different driving conditions.Therefore, after learning in the source driving cycle C S , some parameters of the actor networks and critic networks can be transferred to the target driving cycle C T to accelerate the learning.Specifically, the knowledge transfer of the SD3-based strategy from C S to C T is divided into three steps:

•
The first step is to extract the parameters of the actor networks and critic networks of the SD3-based strategy that have been sufficiently converged in C S ; • Then, the extracted parameters are used to initialize the parameters of the corresponding networks of the SD3-based strategy in C T , and freezing is implemented for the input and intermediate layers of the networks; • Finally, the output layers of the networks are randomly initialized, and the networks are fine-tuned by a small amount of training.

Results and Discussion
The capability of the proposed EMS in fuel economy optimization and battery charge sustainability maintenance is validated in this section.The driving cycles used in the simulation include the China heavy-duty commercial vehicle cycle-bus (CHTC-B) [see Figure 6a] and the West Virginia University City cycle (WVUCITY) [see Figure 6b].Moreover, the DP, as an existing benchmarking technique, is involved as the reference strategy [42].In the formulation of DP, SOC is chosen as the state variable, which is discretized into 200 grids between 0.5 and 0.7.At the same time, the ICE power corresponding to the optimal BSFC curve is designated as the action variable, which is discrete to 160 grids ranging from the minimum to maximum power.

Results and Discussion
The capability of the proposed EMS in fuel economy optimization and battery cha sustainability maintenance is validated in this section.6b].Moreo the DP, as an existing benchmarking technique, is involved as the reference strategy [ In the formulation of DP, SOC is chosen as the state variable, which is discretized into grids between 0.5 and 0.7.At the same time, the ICE power corresponding to the opti BSFC curve is designated as the action variable, which is discrete to 160 grids rang from the minimum to maximum power.

Performance of SD3-Based Strategy
The performance of the SD3-based strategy in terms of convergence, stability, control is examined in this section.In this section, the initialization of the parameter the SD3′s networks is random, meaning that SD3 is not infused with any prior knowle that contributes to energy optimization before learning.Also, SD3 incorporates AM te nology to prevent unreasonable torque distribution in the powertrain system.

Convergence Performance
High-quality and stable convergence performance is a reflection of the powe learning performance of the DRL strategy.Generally, the reward curve is the only me to assess whether a DRL strategy converges.It shows how the total reward change each episode during the training process.In the early stages of training, reward val tend to be low and fluctuate widely.This is because the DRL strategy needs to exp both good and bad experiences.As experience accumulates, the reward value will gra ally increase and eventually remain at a high point.Figure 7 illustrates the reward cur of SD3 under the two cycles.According to Figure 7, SD3 completed its early explora for the CHTC-B and WVUCITY cycles after 69 and 176 episodes, respectively, and reward value started to fluctuate modestly at the peak.It is important to note that reward curve finally converges to a straight line, indicating that SD3 has very high c

Performance of SD3-Based Strategy
The performance of the SD3-based strategy in terms of convergence, stability, and control is examined in this section.In this section, the initialization of the parameters of the SD3 s networks is random, meaning that SD3 is not infused with any prior knowledge that contributes to energy optimization before learning.Also, SD3 incorporates AM technology to prevent unreasonable torque distribution in the powertrain system.

Convergence Performance
High-quality and stable convergence performance is a reflection of the powerful learning performance of the DRL strategy.Generally, the reward curve is the only metric to assess whether a DRL strategy converges.It shows how the total reward changes in each episode during the training process.In the early stages of training, reward values tend to be low and fluctuate widely.This is because the DRL strategy needs to explore both good and bad experiences.As experience accumulates, the reward value will gradually increase and eventually remain at a high point.Figure 7 illustrates the reward curves of SD3 under the two cycles.According to Figure 7, SD3 completed its early exploration for the CHTC-B and WVUCITY cycles after 69 and 176 episodes, respectively, and the reward value started to fluctuate modestly at the peak.It is important to note that the reward curve finally converges to a straight line, indicating that SD3 has very high convergence stability.
Energies 2022, 15, x FOR PEER REVIEW 14 when the reward curve does, highlighting the significance of employing the terminal as a gauge for the convergence of the strategy.
(a) (b)  The analysis of terminal SOC convergence discussed above has been able to s that SD3 has a significant capacity for SOC maintenance.To further demonstrate the planning capability of SD3, Figure 9 displays the SOC trajectories of SD3 and DP u two cycles.It is evident that the SD3′s battery does not exhibit overcharging or over charging, and the SOC can be tightly controlled to fluctuate only within a narrow r between 0.58 and 0.61.The battery can operate in the shallow cycle, and it is beneficia improving the charge-discharge efficiency and reliability.Additionally, the SOC ch trend of SD3 and DP are strikingly comparable in the boxed area of Figure 9, indica that their charging and discharging rules are identical.SOC maintenance is a vital control objective for energy management, so the convergence performance of the terminal SOC should also be a criterion to evaluate the convergence capability of the policy.Figure 8 depicts the variation of terminal SOC of SD3 during the learning process.According to Figure 8, the terminal SOC reaches convergence at 568 and 572 episodes under the CHTC-B and WVUCITY cycles, respectively.After, the terminal SOC fluctuates steadily around 0.6, with no value falling below 0.595 and no value exceeding 0.605.(criteria for judging convergence).Table 3 also displays the mean and variance of the terminal SOC for episodes ranging from 600 to 1000.It is evident that the terminal SOC variations after convergence are incredibly minimal, which serves as proof of the high-quality convergence of SD3.Furthermore, the terminal SOC does not converge when the reward curve does, highlighting the significance of employing the terminal SOC as a gauge for the convergence of the strategy.The analysis of terminal SOC convergence discussed above has been able to sh that SD3 has a significant capacity for SOC maintenance.To further demonstrate the S planning capability of SD3, Figure 9 displays the SOC trajectories of SD3 and DP un two cycles.It is evident that the SD3′s battery does not exhibit overcharging or over-  The analysis of terminal SOC convergence discussed above has been able to show that SD3 has a significant capacity for SOC maintenance.To further demonstrate the SOC planning capability of SD3, Figure 9 displays the SOC trajectories of SD3 and DP under two cycles.It is evident that the SD3 s battery does not exhibit overcharging or over-discharging, and the SOC can be tightly controlled to fluctuate only within a narrow range between 0.58 and 0.61.The battery can operate in the shallow cycle, and it is beneficial for improving the charge-discharge efficiency and reliability.Additionally, the SOC change trend of SD3 and DP are strikingly comparable in the boxed area of Figure 9, indicating that their charging and discharging rules are identical.The power allocation of the power system can reflect the basic control logic o strategy, including the operating mode switching logic and the engine control logic ure 10 shows the power split between the ICE and the battery and DP for SD3 unde cycles, with the black dotted lines in the figures representing the ICE's maximum fea power at each time step.As shown in Figure 10, the ICE power of SD3 is essentiall too low or too high, indicating that SD3 optimizes the operating point of the ICE by ning the energy allocation.This is because SD3 follows Bellman's optimality theory can plan for the entire driving cycle globally.Also, the SD3 tends to utilize more ba power when the vehicle is starting or moving at a low velocity.This control strategy line with engineering experience and contributes to the vehicle's fuel economy.It is noteworthy that the output power of the ICE never exceeds the maximum limit due filtering of the output action by SD3 using the AM technique, which demonstrate excellent reliability of the proposed AM method.Since ICE is the primary target of this study's energy optimization, the distrib of ICE operating points can impartially reflect SD3 performance.Corresponding t optimal BSFC operating curve of the subject ICE, Figure 11 shows the ICE power d The power allocation of the power system can reflect the basic control logic of the strategy, including the operating mode switching logic and the engine control logic.Figure 10 shows the power split between the ICE and the battery and DP for SD3 under two cycles, with the black dotted lines in the figures representing the ICE's maximum feasible power at each time step.As shown in Figure 10, the ICE power of SD3 is essentially not too low or too high, indicating that SD3 optimizes the operating point of the ICE by planning the energy allocation.This is because SD3 follows Bellman's optimality theory and can plan for the entire driving cycle globally.Also, the SD3 tends to utilize more battery power when the vehicle is starting or moving at a low velocity.This control strategy is in line with engineering experience and contributes to the vehicle's fuel economy.It is also noteworthy that the output power of the ICE never exceeds the maximum limit due to the filtering of the output action by SD3 using the AM technique, which demonstrates the excellent reliability of the proposed AM method.
Since ICE is the primary target of this study's energy optimization, the distribution of ICE operating points can impartially reflect SD3 performance.Corresponding to the optimal BSFC operating curve of the subject ICE, Figure 11 shows the ICE power distribution for SD3 and DP, with the red curve displaying the BSFC for each power.As depicted in Figure 11, the ICE power of SD3 is primarily spread between 100 and 120 kW under the CHTC-B cycle, whereas it is spread between 50 and 110 kW during the WVUCITY cycle.These intervals precisely match the low BSFC region of the object ICE, indicating that SD3 learns to improve the efficiency of ICE as much as possible.Statistics show that under the CHTC-B and WVUCITY cycles, the average BSFC of the ICEs of SD3 and DP are 205.64 g/kWh and 201.12 g/kWh, and 204.81 g/kWh and 201.54 g/kWh, respectively.It is evident that SD3 has very high fuel efficiency.Consequently, SD3 has a powerful ICE control performance, which will eventually be reflected in the vehicle's fuel consumption.
can plan for the entire driving cycle globally.Also, the SD3 tends to utilize more battery power when the vehicle is starting or moving at a low velocity.This control strategy is in line with engineering experience and contributes to the vehicle's fuel economy.It is also noteworthy that the output power of the ICE never exceeds the maximum limit due to the filtering of the output action by SD3 using the AM technique, which demonstrates the excellent reliability of the proposed AM method.Since ICE is the primary target of this study's energy optimization, the distribution of ICE operating points can impartially reflect SD3 performance.Corresponding to the optimal BSFC operating curve of the subject ICE, Figure 11 shows the ICE power distribution for SD3 and DP, with the red curve displaying the BSFC for each power.As depicted in Figure 11, the ICE power of SD3 is primarily spread between 100 and 120 kW under the CHTC-B cycle, whereas it is spread between 50 and 110 kW during the WVU-CITY cycle.These intervals precisely match the low BSFC region of the object ICE, Table 4 lists the fuel consumption of SD3 and DP under two driving cycles.The final SOC of different strategies is close to the initial value, and the deviation effect on fuel consumption is negligible.It can be seen that the fuel economy of the SD3 is comparable to that of the DP.Under the CHTC-B and WVUCITY cycles, the fuel consumption of SD3 is only 1.06% and 0.65% higher than that of DP, respectively.This indicates that the fuel economy of the proposed strategy is close to the global optimum.

Impact of Action Masking
The purpose of this section, which is an extension of the preceding section, is to compare the performance of SD3 with AM to SD3 without AM (SD3-AM for short) to examine Table 4 lists the fuel consumption of SD3 and DP under two driving cycles.The final SOC of different strategies is close to the initial value, and the deviation effect on fuel consumption is negligible.It can be seen that the fuel economy of the SD3 is comparable to that of the DP.Under the CHTC-B and WVUCITY cycles, the fuel consumption of SD3 is only 1.06% and 0.65% higher than that of DP, respectively.This indicates that the fuel economy of the proposed strategy is close to the global optimum.

Impact of Action Masking
The purpose of this section, which is an extension of the preceding section, is to compare the performance of SD3 with AM to SD3 without AM (SD3-AM for short) to examine the effect AM has on learning performance.The random seed for SD3 and SD3-AM is the same to assure the comparison's fairness.It indicates that the initial weights of their neural networks are identical.
Figures 12 and 13 display the reward and terminal SOC curves of SD3 and SD3-AM under two cycles, respectively.As illustrated in Figures 12 and 13, the terminal SOC and reward values of SD3 and SD3-AM in the first episode are nearly identical, resulting from the identical random seeds used in both.In addition, Figure 12 shows that the reward curves for SD3 and SD3-AM after convergence overlap, which suggests that the maximum rewards they can earn are not much different.It also implies that AM does not impact the learning performance of SD3.Additionally, Figure 13 demonstrates that, under the CHTC-B and WVUCITY cycles, the convergence times of the terminal SOC of SD3 are 16.59% and 13.07% earlier than those of SD3-AM.It suggests that adding AM will considerably accelerate SD3 s learning rate.In conclusion, AM does not reduce SD3 s learning stability or speed.It is attributed to the fact that AM does not destroy the distribution of the environment, and thus does not violate the mathematical principles of SD3.
Energies 2022, 15, x FOR PEER REVIEW 17 curves for SD3 and SD3-AM after convergence overlap, which suggests that the maxim rewards they can earn are not much different.It also implies that AM does not impac learning performance of SD3.Additionally, Figure 13 demonstrates that, under CHTC-B and WVUCITY cycles, the convergence times of the terminal SOC of SD3 16.59% and 13.07% earlier than those of SD3-AM.It suggests that adding AM will con erably accelerate SD3′s learning rate.In conclusion, AM does not reduce SD3′s lear stability or speed.It is attributed to the fact that AM does not destroy the distributio the environment, and thus does not violate the mathematical principles of SD3.
(a) (b) Figure 14 shows the number of invalid actions output by SD3-AM in each round ing the learning process.From Figure 14, SD3-AM has difficulty avoiding outputtin valid actions.The first reason is that SD3-AM avoids falling into the local optimum randomly exploring the entire action space, which puts the SD3 algorithm at risk o lecting invalid actions.The second reason is that SD3-AM will gradually master highly-efficient control method for ICE during the learning process, i.e., controlling to work in the low-BSFC region.However, the low BSFC corresponds to relatively power, and the MG1 is easily overloaded when the powertrain system's demand sp or demand torque is too high.Figure 14 shows the number of invalid actions output by SD3-AM in each round d ing the learning process.From Figure 14, SD3-AM has difficulty avoiding outputting valid actions.The first reason is that SD3-AM avoids falling into the local optimum randomly exploring the entire action space, which puts the SD3 algorithm at risk of lecting invalid actions.The second reason is that SD3-AM will gradually master highly-efficient control method for ICE during the learning process, i.e., controlling to work in the low-BSFC region.However, the low BSFC corresponds to relatively h power, and the MG1 is easily overloaded when the powertrain system's demand sp or demand torque is too high.Figure 14 shows the number of invalid actions output by SD3-AM in each round during the learning process.From Figure 14, SD3-AM has difficulty avoiding outputting invalid actions.The first reason is that SD3-AM avoids falling into the local optimum by randomly exploring the entire action space, which puts the SD3 algorithm at risk of selecting invalid actions.The second reason is that SD3-AM will gradually master the highly-efficient control method for ICE during the learning process, i.e., controlling ICE to work in the low-BSFC region.However, the low BSFC corresponds to relatively high power, and the MG1 is easily overloaded when the powertrain system's demand speed or demand torque is too high.
randomly exploring the entire action space, which puts the SD3 algorithm at risk o lecting invalid actions.The second reason is that SD3-AM will gradually master highly-efficient control method for ICE during the learning process, i.e., controlling to work in the low-BSFC region.However, the low BSFC corresponds to relatively h power, and the MG1 is easily overloaded when the powertrain system's demand sp or demand torque is too high.Figures 15-17 depict the distribution of ICE power, the powertrain's power allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Figure 15 shows that the ICE primarily operates in the region with the lowest BSFC, suggesting that the SD3-AM has exceptionally high fuel efficiency.However, Figure 16 reveals that the output power of the ICE exceeds the maximum feasible power several times, which should be categorically forbidden in engineering practice.As a result, shielding the invalid action is essential while aiming for the best fuel efficiency.Figure 17   In order to evaluate SD3-AM more objectively, simulation tests were also conducte in this study for DP without considering powertrain constraints (DP-AM for short).Tab 5 lists the fuel consumption of SD3-AM and DP-AM.It is evident that the SD3-AM ha excellent fuel efficiency and its fuel consumption differs very little from the DP-AM' Comparing Tables 4 and 5, it is also evident that SD3-AM has a higher economy than SD  Figures 15, 16, and 17 depict the distribution of ICE power, the powertrain's po allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Fi 15 shows that the ICE primarily operates in the region with the lowest BSFC, sugge that the SD3-AM has exceptionally high fuel efficiency.However, Figure 16 reveals the output power of the ICE exceeds the maximum feasible power several times, w should be categorically forbidden in engineering practice.As a result, shielding the i lid action is essential while aiming for the best fuel efficiency.Figure 17  In order to evaluate SD3-AM more objectively, simulation tests were also condu in this study for DP without considering powertrain constraints (DP-AM for short).T 5 lists the fuel consumption of SD3-AM and DP-AM.It is evident that the SD3-AM excellent fuel efficiency and its fuel consumption differs very little from the DP-A Figures 15, 16, and 17 depict the distribution of ICE power, the powertrain's po allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Fi 15 shows that the ICE primarily operates in the region with the lowest BSFC, sugges that the SD3-AM has exceptionally high fuel efficiency.However, Figure 16 reveals the output power of the ICE exceeds the maximum feasible power several times, w should be categorically forbidden in engineering practice.As a result, shielding the i lid action is essential while aiming for the best fuel efficiency.Figure 17  In order to evaluate SD3-AM more objectively, simulation tests were also condu in this study for DP without considering powertrain constraints (DP-AM for short).T 5 lists the fuel consumption of SD3-AM and DP-AM.It is evident that the SD3-AM excellent fuel efficiency and its fuel consumption differs very little from the DP-A Comparing Tables 4 and 5, it is also evident that SD3-AM has a higher economy than In order to evaluate SD3-AM more objectively, simulation tests were also conducted in this study for DP without considering powertrain constraints (DP-AM for short).Table 5 lists the fuel consumption of SD3-AM and DP-AM.It is evident that the SD3-AM has excellent fuel efficiency and its fuel consumption differs very little from the DP-AM's.Comparing Tables 4 and 5, it is also evident that SD3-AM has a higher economy than SD3 and DP.Since SD3-AM does not consider the physical constraints of the powertrain, it can control the ICE to work more in the low-BSFC region (high power region), achieving lower fuel consumption.Even though AM impacts fuel economy, it ensures that the system's power distribution is reliable, which is essential for engineering applications.Taken together, AM is a promising technology in DRL energy management.The primary purpose of this section is to explore the application of TL techniques in the SD3-based strategy (naming SD3 with TL techniques as SD3+TL ).Therefore, in this section, the network parameters of SD3+TL are initialized by the TL technique.That is, the CHTC-B cycle and the WVUCITY cycle are crossed as C S and C T .When the WVUCITY cycle is used as C T , the network initialization reuses the network parameters learned in CHTC-B and vice versa.Moreover, this section also implements the AM technique on SD3 to prevent unreasonable torque distribution in the system.
Figure 18 shows the reward and terminal SOC curve of SD3+TL under two cycles.By contrasting Figures 7, 8 and 18, it can be seen that SD3+TL and SD3 start to converge under the CHTC-B cycle after 184 and 568 episodes, respectively, while SD3+TL and SD3 start to converge under the WVUCITY cycle after 99 and 572 episodes, respectively.Obviously, TL can significantly improve the learning efficiency of strategies, with at least a 67.61% reduction in the learning time required for strategies.

Impact of Transfer Learning on SD3-Based Strategy
The primary purpose of this section is to explore the application of TL techniques in the SD3-based strategy (naming SD3 with TL techniques as SD3+TL ).Therefore, in this section, the network parameters of SD3+TL are initialized by the TL technique.That is, the CHTC-B cycle and the WVUCITY cycle are crossed as and .When the WVUCITY cycle is used as , the network initialization reuses the network parameters learned in CHTC-B and vice versa.Moreover, this section also implements the AM technique on SD3 to prevent unreasonable torque distribution in the system.Figure 18 shows the reward and terminal SOC curve of SD3+TL under two cycles.By contrasting Figures 7, 8, and 18, it can be seen that SD3+TL and SD3 start to converge under the CHTC-B cycle after 184 and 568 episodes, respectively, while SD3+TL and SD3 start to converge under the WVUCITY cycle after 99 and 572 episodes, respectively.Obviously, TL can significantly improve the learning efficiency of strategies, with at least a 67.61% reduction in the learning time required for strategies.Figures 19 and 20 show the SOC trajectory and ICE power distribution of SD3+TL under the two cycles, respectively.According to the results displayed in Figure 19, despite SD3+TL being injected with prior knowledge and frozen for prior knowledge before training, the overall trends of SOC for SD3+TL and SD3 are similar, with the main differences mainly in the first 400 s of the driving cycle.During this period, SD3+TL displays more charge depletion in the CHTC-B cycle and more charge sustenance in the WVUCITY cycle.Moreover, by comparing Figures 11 and 19, it can be found that the ICE power distribution is similar for SD3 and SD3+TL, with both pieces of ICE mainly operating in the low-BSFC region, demonstrating that TL has no impact on the strategy's overall learning direction.Table 6 compares the fuel economy of the SD3, DP, and SD3+TL.It can be   the SOC trajectory and ICE power distribution of SD3+TL under the two cycles, respectively.According to the results displayed in Figure 19, despite SD3+TL being injected with prior knowledge and frozen for prior knowledge before training, the overall trends of SOC for SD3+TL and SD3 are similar, with the main differences mainly in the first 400 s of the driving cycle.During this period, SD3+TL displays more charge depletion in the CHTC-B cycle and more charge sustenance in the WVUCITY cycle.Moreover, by comparing Figures 11 and 19, it can be found that the ICE power distribution is similar for SD3 and SD3+TL, with both pieces of ICE mainly operating in the low-BSFC region, demonstrating that TL has no impact on the strategy's overall learning direction.Table 6 compares the fuel economy of the SD3, DP, and SD3+TL.It can be discovered that SD3+TL is more economical than SD3 under the CHTC-B cycle and that its fuel consumption is only 0.05% greater than that of DP.Under the WVUCITY cycle, the economy of SD3+TL is just marginally worse than that of SD3, and its fuel usage is only 1.43% greater than that of DP.As a result, the TL has almost no impact on fuel economy while improving learning efficiency.
Energies 2022, 15, x FOR PEER REVIEW 20 discovered that SD3+TL is more economical than SD3 under the CHTC-B cycle and its fuel consumption is only 0.05% greater than that of DP.Under the WVUCITY c the economy of SD3+TL is just marginally worse than that of SD3, and its fuel usa only 1.43% greater than that of DP.As a result, the TL has almost no impact on fuel e omy while improving learning efficiency.

Conclusions
This study develops a novel energy management strategy for HEUB using the s of-the-art continuous control DRL algorithm, SD3.Additionally, the SD3 algorithm i proved by the proposed action masking method to prevent the torque allocation o DRL control loop output that does not satisfy the physical constraints of the power system.Meanwhile, the transfer learning technique's application in developing D based energy management strategies for HEUB is explored, enabling the rapid deve ment of SD3-based strategies.Overall, the proposed strategy is proven to be effectiv improving the fuel optimality of the reference HEUB.Furthermore, the following con sions are obtained:

•
The proposed AM technique can effectively filter invalid actions without affec the learning performance or stability of SD3.Under the CHTC-B and WVUC discovered that SD3+TL is more economical than SD3 under the CHTC-B cycle and that its fuel consumption is only 0.05% greater than that of DP.Under the WVUCITY cycle, the economy of SD3+TL is just marginally worse than that of SD3, and its fuel usage is only 1.43% greater than that of DP.As a result, the TL has almost no impact on fuel economy while improving learning efficiency.

Conclusions
This study develops a novel energy management strategy for HEUB using the stateof-the-art continuous control DRL algorithm, SD3.Additionally, the SD3 algorithm is improved by the proposed action masking method to prevent the torque allocation of the DRL control loop output that does not satisfy the physical constraints of the powertrain system.Meanwhile, the transfer learning technique's application in developing DRLbased energy management strategies for HEUB is explored, enabling the rapid development of SD3-based strategies.Overall, the proposed strategy is proven to be effective in improving the fuel optimality of the reference HEUB.Furthermore, the following conclusions are obtained:

•
The proposed AM technique can effectively filter invalid actions without affecting the learning performance or stability of SD3.Under the CHTC-B and WVUCITY

Conclusions
This study develops a novel energy management strategy for HEUB using the state-ofthe-art continuous control DRL algorithm, SD3.Additionally, the SD3 algorithm is improved by the proposed action masking method to prevent the torque allocation of the DRL control loop output that does not satisfy the physical constraints of the powertrain system.Meanwhile, the transfer learning technique's application in developing DRL-based energy management strategies for HEUB is explored, enabling the rapid development of SD3-based strategies.Overall, the proposed strategy is proven to be effective in improving the fuel optimality of the reference HEUB.Furthermore, the following conclusions are obtained:

•
The proposed AM technique can effectively filter invalid actions without affecting the learning performance or stability of SD3.Under the CHTC-B and WVUCITY cycles, SD3 with AM has a faster convergence speed than SD3 without AM, and its fuel economy can reach at least 98.94% of that of DP.

•
The TL technique can considerably accelerate the learning rate of SD3.Under the CHTC-B and WVUCITY cycles, the learning time of SD3 with TL is at least 67.61% less than that of SD3 without TL.Moreover, TL has almost no impact on the final control performance and economic performance of SD3.

Energies 2022 , 23 Figure 2 .
Figure 2. Optimal BSFC curve of the ICE.The power battery pack is simulated by a first-order equivalent circuit model, which is practical and effective for energy management problems [36].The voltage characteristic curve of the battery is shown in Figure 3, which shows that the open-circuit voltage depends on the battery's state of charge (SOC).In addition, since the open-circuit voltage
-circuit voltage, internal resistance, initial SOC, and capacity of the battery, respectively.

Figure 3 .
Figure 3. Characteristic curves of the battery.

Figure 3 .
Figure 3. Characteristic curves of the battery.

Figure 4 .
Figure 4.The framework of the SD3-based EMS.

Figure 4 .
Figure 4.The framework of the SD3-based EMS.

Figure 5 .
Figure 5.The reasonable working range of ICE.

Figure 5 .
Figure 5.The reasonable working range of ICE.

Figure 6 .
Figure 6.Driving cycles used for algorithm training and validation.(a) CHTC-B cycle; (b) W CITY cycle.

Figure 14 .
Figures 15-17 depict the distribution of ICE power, the powertrain's power allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Figure15shows that the ICE primarily operates in the region with the lowest BSFC, suggesting that the SD3-AM has exceptionally high fuel efficiency.However, Figure16reveals that the output power of the ICE exceeds the maximum feasible power several times, which should be categorically forbidden in engineering practice.As a result, shielding the invalid action is essential while aiming for the best fuel efficiency.Figure17illustrates that the SOC curves of SD3-AM and SD3 are different.Their differences are mainly in the smaller valley SOC of SD3-AM.

Figures 15 ,Figure 15 .Figure 16 .Figure 17 .
Figures 15,16, and 17 depict the distribution of ICE power, the powertrain's powe allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Figur 15 shows that the ICE primarily operates in the region with the lowest BSFC, suggestin that the SD3-AM has exceptionally high fuel efficiency.However, Figure16reveals th the output power of the ICE exceeds the maximum feasible power several times, whic should be categorically forbidden in engineering practice.As a result, shielding the inv lid action is essential while aiming for the best fuel efficiency.Figure17illustrates that th SOC curves of SD3-AM and SD3 are different.Their differences are mainly in the smalle valley SOC of SD3-AM.

Figure 15 .Figure 16 .Figure 17 .
Figures 15,16, and 17 depict the distribution of ICE power, the powertrain's po allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Fi 15 shows that the ICE primarily operates in the region with the lowest BSFC, sugge that the SD3-AM has exceptionally high fuel efficiency.However, Figure16reveals the output power of the ICE exceeds the maximum feasible power several times, w should be categorically forbidden in engineering practice.As a result, shielding the i lid action is essential while aiming for the best fuel efficiency.Figure17illustrates tha SOC curves of SD3-AM and SD3 are different.Their differences are mainly in the sm valley SOC of SD3-AM.

Figure 15 .Figure 16 .Figure 17 .
Figures 15,16, and 17 depict the distribution of ICE power, the powertrain's po allocation, and the SOC fluctuation of the SD3-AM under two cycles, respectively.Fi 15 shows that the ICE primarily operates in the region with the lowest BSFC, sugges that the SD3-AM has exceptionally high fuel efficiency.However, Figure16reveals the output power of the ICE exceeds the maximum feasible power several times, w should be categorically forbidden in engineering practice.As a result, shielding the i lid action is essential while aiming for the best fuel efficiency.Figure17illustrates tha SOC curves of SD3-AM and SD3 are different.Their differences are mainly in the sm valley SOC of SD3-AM.

Figures 19 and 20
Figures 19 and 20  show the SOC trajectory and ICE power distribution of SD3+TL under the two cycles, respectively.According to the results displayed in Figure19, despite SD3+TL being injected with prior knowledge and frozen for prior knowledge before training, the overall trends of SOC for SD3+TL and SD3 are similar, with the main differences mainly in the first 400 s of the driving cycle.During this period, SD3+TL displays more charge depletion in the CHTC-B cycle and more charge sustenance in the WVUCITY cycle.Moreover, by comparing Figures11 and 19, it can be found that the ICE power distribution is similar for SD3 and SD3+TL, with both pieces of ICE mainly operating in the low-BSFC Figures 19 and 20  show the SOC trajectory and ICE power distribution of SD3+TL under the two cycles, respectively.According to the results displayed in Figure19, despite SD3+TL being injected with prior knowledge and frozen for prior knowledge before training, the overall trends of SOC for SD3+TL and SD3 are similar, with the main differences mainly in the first 400 s of the driving cycle.During this period, SD3+TL displays more charge depletion in the CHTC-B cycle and more charge sustenance in the WVUCITY cycle.Moreover, by comparing Figures11 and 19, it can be found that the ICE power distribution is similar for SD3 and SD3+TL, with both pieces of ICE mainly operating in the low-BSFC
s ) denotes the target actor networks π φ − with weights φ − 1 and φ − 2 , respectively; ε 1 denotes the noise obeying normal distribution, which is clipped to the interval 2(s )
Action according to the dynamics of the powertrain system (similar to ECMS); and (3) Obtain R ICE (t) = [P min ICE (t), P max ICE (t)]; • Then, the ICE power P ICE (t) output from the actor network in the SD3-based strategy is restricted to P ICE (t) by the clip operation, i.e., P ICE (t) = clip[P ICE (t), P min ICE , P max ICE ], since the clip operation does not change the original action space and thus does not have any effect on the policy.
1) The action space Action = {P ICE |P ICE ∈ [0kW, 160kW]} is discretized to obtain P Action ; (2) Calculate the P max ICE (t) and P min ICE (t) by traversing P

Table 3 .
Statistics of the terminal SOC.

Table 3 .
Statistics of the terminal SOC.

Table 3 .
Statistics of the terminal SOC.

Table 4 .
Fuel consumption in simulation of initial SOC = 0.6.

Table 4 .
Fuel consumption in simulation of initial SOC = 0.6.

Table 5 .
Fuel consumption in simulation of initial SOC = 0.6.
Energies 2022,15,x FOR PEER REVIEW 19 of 23 and DP.Since SD3-AM does not consider the physical constraints of the powertrain, it can control the ICE to work more in the low-BSFC region (high power region), achieving lower fuel consumption.Even though AM impacts fuel economy, it ensures that the system's power distribution is reliable, which is essential for engineering applications.Taken together, AM is a promising technology in DRL energy management.

Table 5 .
Fuel consumption in simulation of initial SOC = 0.6.

Table 6 .
Fuel consumption in simulation of initial SOC = 0.6.

Table 6 .
Fuel consumption in simulation of initial SOC = 0.6.

Table 6 .
Fuel consumption in simulation of initial SOC = 0.6.