Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy

Lan, Cong; Zhang, Hailong; Zhao, Yongjuan; Du, Huipeng; Ren, Jinglei; Luo, Jiangyu

doi:10.3390/machines13090754

Open AccessArticle

Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy

by

Cong Lan

¹

,

Hailong Zhang

^1,*,

Yongjuan Zhao

²,

Huipeng Du

¹,

Jinglei Ren

¹ and

Jiangyu Luo

¹

School of Mechanical and Electrical Engineering, North University of China, Taiyuan 030051, China

²

Institute of Intelligent Weapons, North University of China, Taiyuan 030051, China

^*

Author to whom correspondence should be addressed.

Machines 2025, 13(9), 754; https://doi.org/10.3390/machines13090754

Submission received: 8 July 2025 / Revised: 20 August 2025 / Accepted: 22 August 2025 / Published: 23 August 2025

(This article belongs to the Section Vehicle Engineering)

Download

Browse Figures

Versions Notes

Abstract

The energy management strategy (EMS) is a core technology for improving the fuel economy of hybrid electric vehicles (HEVs). However, the coexistence of both discrete and continuous control variables, along with complex physical constraints in HEV powertrains, presents significant challenges for the design of efficient EMSs based on deep reinforcement learning (DRL). To further enhance fuel efficiency and coordinated powertrain control under complex driving conditions, this study proposes a hierarchical DRL-based EMS. The proposed strategy adopts a layered control architecture: the upper layer utilizes the soft actor–critic (SAC) algorithm for continuous control of engine torque, while the lower layer employs a deep Q-network (DQN) for discrete gear selection optimization. Through offline training and online simulation, experimental results demonstrate that the proposed strategy achieves fuel economy performance comparable to dynamic programming (DP), with only a 3.06% difference in fuel consumption. Moreover, it significantly improves computational efficiency, thereby enhancing the feasibility of real-time deployment. This study validates the optimization potential and real-time applicability of hierarchical reinforcement learning for hybrid control in HEV energy management. Furthermore, its adaptability is demonstrated through sustained and stable performance under long-duration, complex urban bus driving conditions.

Keywords:

energy management strategy; hierarchical reinforcement learning; gear-shifting strategy; discrete–continuous hybrid control; multi-objective optimization

1. Introduction

To alleviate the energy crisis and mitigate environmental pollution, electric vehicles (EVs) and hybrid electric vehicles (HEVs) are widely regarded as key technologies for achieving sustainable transportation [1]. Under the current development trend, HEVs demonstrate superior fuel economy and lower emissions compared to conventional internal combustion engine vehicles. Meanwhile, considering that EVs still face limitations in driving range due to the lack of significant breakthroughs in battery technology [2], HEVs are considered a more reliable transitional solution thanks to their longer driving range and reduced dependence on batteries [3]. Typically consisting of two or more power sources, the efficiency and performance of HEVs heavily depend on the rational allocation of energy among these sources [4]. Therefore, coordinating the power flow among different energy sources becomes a critical technical challenge in HEVs. At the heart of this challenge lies the energy management strategy (EMS), whose design directly influences the vehicle’s overall fuel economy and dynamic performance [5].

The EMS of HEVs primarily includes rule-based strategies, equivalent consumption minimization strategy (ECMS), dynamic programming (DP), and model predictive control (MPC) [6]. Rule-based strategies offer good real-time performance and low computational cost; however, their control logic typically relies on expert knowledge, resulting in limited adaptability and flexibility under complex driving conditions [7]. Both ECMS and MPC exhibit high computational efficiency and real-time capability. ECMS simplifies the global optimization problem into a series of single-step optimizations to reduce computational complexity, thereby achieving only local optimality [8]. MPC employs a receding horizon optimization approach to generate control strategies online [9]. Chen et al. [10]. employed a hierarchical MPC based on the Koopman model for electric vehicles in car-following scenarios, effectively reducing energy consumption while further improving computational efficiency. Nevertheless, the performance of both MPC and ECMS heavily depends on the accuracy of the system model and the pre-calibration of equivalent factors [11]. DP, on the other hand, is capable of yielding theoretically global optimal solutions, but due to its substantial computational burden and the requirement for complete future driving information, it is generally used as a performance benchmark [12].

With the rapid advancement of artificial intelligence, particularly reinforcement learning (RL) techniques, new avenues have been opened to address these challenges [13]. RL, with its model-free nature, leverages observational data from dynamic environments and learns autonomously through interactive trial-and-error, thereby overcoming the performance bottlenecks of traditional model-based methods caused by parameter inaccuracies and structural simplifications [14]. Liu et al. [15] developed an RL-enabled control strategy for HEVs by combining speed prediction indicators with Q-learning, demonstrating its effectiveness compared to stochastic dynamic programming methods. The strategy’s real-time performance and optimization capability were validated through hardware-in-the-loop experiments. However, the Q-learning algorithm relies on a 2-D lookup table to select actions, which requires the action space to be discrete and limits the dimensionality of variable spaces. Otherwise, it may suffer from “discretization error” and the “curse of dimensionality” [16].

To overcome these limitations, researchers have proposed integrating deep learning with reinforcement learning to address dimensionality and discretization issues. Compared with Q-learning, deep Q-network (DQN) uses neural networks to approximate the state-action value function, enabling it to handle more complex state and action spaces [17]. Li et al. [18] validated the adaptability of DQN across various driving cycles. However, DQN is inherently limited to discrete action spaces. To address continuous action spaces, deep deterministic policy gradient (DDPG) combines the structural advantages of DQN while enhancing stability and convergence under the actor–critic framework [19]. Lian et al. [20] further improved fuel economy by embedding expert knowledge into the DDPG framework.

Although deep reinforcement learning (DRL) has made significant progress in handling either fully discrete or continuous action spaces, real-world applications often require managing hybrid action spaces that involve both discrete and continuous variables simultaneously [21]. Gear selection in automatic transmissions introduces an additional degree of freedom in parallel HEVs, significantly influencing the operating points of both the engine and the motor. Therefore, shift commands should be optimally coordinated with power distribution to achieve efficient overall control. Traditional DRL algorithms face notable challenges in dealing with such hybrid action spaces. Existing approaches frequently attempt to homogenize heterogeneous action spaces to make them compatible with standard algorithms [22]. For instance, Zhao et al. [23] discretized continuous actions and combined them with DQN to jointly control battery current and gear-shifting operations in hybrid electric buses. On the other hand, Wang et al. [24] transformed discrete actions into continuous ones and applied DDPG to simultaneously learn driving modes and power distribution strategies. However, such transformations can restrict policy expressiveness, reduce optimization efficiency, and lead to unstable training processes, thereby degrading control performance [25]. To address these challenges, Xiong et al. [26] proposed a parameterized DQN approach that seamlessly integrates the network structures of DDPG and DQN, enabling direct learning in hybrid action spaces without exhaustive search over continuous parameters. Building on this, Tang et al. [27] implemented a Double-Deep Reinforcement Learning (DDRL) framework that combines DDPG and DQN to jointly control torque distribution and gear-shifting operations in HEVs, offering a hybrid solution capable of distinguishing between discrete and continuous actions.

Based on the above research, to further improve the decision-making efficiency and policy stability in the hybrid action space, this paper proposes a joint soft actor–critic (SAC) + DQN architecture. The proposed hierarchical EMS, which integrates SAC and DQN, is capable of simultaneously managing both continuous and discrete action variables, ensuring energy efficiency, battery state-of-charge (SOC) stability, and gear shifting coordination. The main contributions of this paper are summarized as follows:

The gear-shifting strategy is incorporated into the EMS, enabling more efficient coordination between the engine and motor across different gear ratios, thereby enhancing overall fuel economy and vehicle performance.
A hierarchical reinforcement learning (HRL) framework with a hybrid action space is introduced, combining the strengths of SAC and DQN to achieve simultaneous optimization of continuous (engine torque) and discrete (gear selection) control actions.
Through offline training and online testing, the proposed HRL approach is validated for its practicality and superiority in managing hybrid action spaces. Its adaptability is further demonstrated under long-duration and complex urban bus driving conditions.

The remainder of this paper is organized as follows: Section 2 introduces the HEV model; Section 3 presents the proposed HRL-based EMS; Section 4 discusses simulation results and analysis; and Section 5 concludes the paper.

2. Powertrain Modeling and Problem Formulation

This study adopts a typical parallel hybrid powertrain configuration, as illustrated in Figure 1. The system comprises a diesel engine, a clutch, an integrated motor-generator unit, an 8-speed automated manual transmission (8AMT), and a high-capacity battery pack. In this architecture, the clutch is positioned between the internal combustion engine and the electric motor. This layout allows the motor to operate independently when the clutch is disengaged, enabling pure electric driving or low-speed operation. When higher power output is required, the clutch engages, allowing the engine and motor to work in tandem to transmit power to the wheels via the transmission. Additionally, under specific operating conditions, the engine can charge the battery through the motor functioning as a generator, with electrical energy transferred via an inverter. This dual-mode drive strategy enables both independent and cooperative operation of the engine and motor, thereby enhancing overall energy efficiency and driving performance. The specific vehicle parameters are shown in Table 1.

2.1. Power Demand Model

According to vehicle longitudinal dynamics, the torque demand during driving must overcome rolling resistance, acceleration resistance, aerodynamic drag, and grade resistance. These torque requirements are defined as follows [28]:

T_{w} = [m g f \cos θ + \frac{1}{2} A C_{d} ρ_{a} v^{2} + m g \sin θ + δ m \frac{d v}{d t}] r

(1)

where

m

is the vehicle mass,

g

is the gravitational acceleration,

f

is the rolling resistance coefficient,

θ

is the road gradient,

A

is the frontal area,

C_{d}

is the aerodynamic drag coefficient,

ρ_{a}

is the air density,

v

is the vehicle speed,

δ

is the mass factor accounting for the rotational inertia of drivetrain components, and

r

is the effective wheel radius.

During vehicle operation, braking is an inevitable process. In this phase, the electric motor switches to generator mode, enabling the recovery of mechanical energy and its conversion into electrical energy, which is then stored in the battery. During regenerative braking, the driving and braking torques of the vehicle are jointly provided by the engine, the electric motor, and the mechanical braking system. The power balance during the braking process can be expressed as follows:

T_{w} = η_{T}^{s i g n (T_{w})} i_{g} (T_{e} + T_{m o t}) + T_{b}

(2)

where

T_{e}

represents the engine torque,

T_{m o t}

represents the motor torque,

η_{T}

represents the transmission efficiency,

i_{g}

denotes the gear ratio, and

T_{b}

denotes the brake torque. Here,

s i g n (T_{w})

denotes the sign function of the wheel torque

T_{w}

, which is defined as:

s i g n (T_{w}) = \{\begin{array}{l} 1, & T_{w} > 0 \\ - 1, & o t h e r w i s e \end{array}

(3)

2.2. Engine Model

The engine is modeled using a quasi-static approach, where fuel consumption and efficiency are obtained through interpolation based on instantaneous engine speed and torque. The total fuel consumption is calculated as follows:

F u e l = \int_{0}^{T} m_{f} (T_{e}, ω_{e}) d t

(4)

In the equation,

m_{f} (T_{e}, ω_{e})

represents the instantaneous fuel consumption of the engine,

T_{e}

denotes the engine torque and

ω_{e}

denotes the engine speed. The engine’s instantaneous fuel consumption can be obtained from the corresponding torque and speed at any given moment, as illustrated in Figure 2. The color shades in the figure represent the fuel consumption levels.

2.3. Battery Model

To simplify the battery modeling process, this study adopts an internal resistance model, with its equivalent circuit shown in Figure 3a. Furthermore, the overall battery characteristics as a function of SOC are obtained experimentally and fitted to the curve shown in Figure 3b. The battery output power, current, and SOC at each moment can be calculated using the following equations:

\{\begin{array}{l} P_{b a t t} (t) = V_{o c} (t) \cdot I (t) - R_{0} \cdot I^{2} (t) \\ I (t) = \frac{V_{o c} (t) - \sqrt{V_{o c}^{2} (t) - 4 \cdot R_{0} \cdot P_{b a t t} (t)}}{2 \cdot R_{0}} \\ S O C (t) = \frac{Q_{0} - \int_{0}^{t} I (τ) d τ}{Q} \end{array}

(5)

where

P_{b a t t} (t)

represents the battery power,

V_{o c} (t)

denotes the open-circuit voltage,

R_{0}

is the internal resistance of the battery,

Q_{0}

is the initial battery capacity, and

Q

denotes the nominal battery capacity.

3. Energy Management Strategy with Gear-Shifting

DRL combines the powerful feature extraction capabilities of DL with the decision-making and optimization advantages of RL, and has been widely applied to high-dimensional and complex control problems. At each time step

t (t = 1, 2, 3 \dots)

, the agent estimates the policy function through a deep neural network and generates an action

A_{t}

based on the current environmental state

S_{t}

. Upon executing this action, the environment returns a reward, which is used to construct the objective function for evaluating the policy’s effectiveness. The system then minimizes the loss function

L (θ)

using stochastic gradient descent to update the network weights continuously. Through repeated interactions with the environment and policy updates, the agent gradually learns and converges toward an optimized control strategy. The proposed HRL framework is composed of a combination of DQN and SAC. The specific details of the framework will be introduced in the following sections.

3.1. Deep Q-Network

As the first DRL algorithm, DQN integrates key techniques such as Q-learning, deep neural networks, experience replay, and target networks [29]. In Q-learning, the state–action value function and the corresponding actions and values are stored in a Q-table. However, as the state space grows exponentially, maintaining the Q-table becomes computationally intractable. To address this dimensionality issue, DQN employ neural networks to approximate the state–action value function:

Q (s, a, θ) \approx Q (s, a)

(6)

where

s

represents the state of the environment the agent is currently in,

a

represents the action the agent can take given the state

s

, and

θ

represents the parameters of the neural network. The update process is defined as follows:

In Q-learning, the update formula can be rewritten as:

Q_{new} (s, a) \leftarrow Q_{old} (s, a) + α [(r + γ \max_{a^{'}} Q_{old} (s^{'}, a^{'})) - Q_{old} (s, a)]

(7)

where

α

is the learning rate that controls the step size of parameter updates,

Q_{old} (s, a)

denotes the current Q-value estimate,

γ

denotes the discount factor, and

Q_{new} (s, a)

denotes the updated Q-value. In DQN, the network parameters are optimized by minimizing the following loss function:

L (θ) = E [{(r + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ))}^{2}]

(8)

where

θ^{-}

corresponds to those of the target network. The online network parameters are updated through gradient descent as:

\nabla_{θ} L (θ) = E [(r + γ \max_{a^{'}} Q (s^{'}, a^{'}; θ^{-}) - Q (s, a; θ)) \cdot \nabla_{θ} Q (s, a; θ)]

(9)

Moreover, DQN adopts mini-batch stochastic gradient descent to update the parameters efficiently:

sample = (s, a, r, s^{'})

(10)

3.2. Soft Actor–Critic

SAC incorporates entropy regularization into the optimization objective, which not only maximizes the expected return but also encourages the policy to maintain a certain degree of stochasticity [30]. By constraining entropy, SAC prevents the policy from prematurely converging to a suboptimal solution and enables it to better approximate the global optimum.

The objective function of SAC is thus extended to include an entropy regularization term:

π^{*} = \arg \max_{π} E [\sum_{t = 0}^{T} γ^{t} (r (s_{t}, a_{t}, s_{t + 1}) + α H (π (\cdot | s_{t}))) | s_{0} = s]

(11)

where

π

denotes the policy,

s_{t}

and

a_{t}

represent the state and action at timestep

t

,

γ

is the discount factor,

r

is the reward function, and

α

is the temperature coefficient used to balance the trade-off between reward maximization and entropy.

The entropy weight determines the extent of policy exploration [31]. Through adaptive temperature adjustment, SAC achieves a dynamic balance.

H (π (a_{t} | s_{t})) = - \sum_{a_{t}} π (a_{t} | s_{t}) \log π (a_{t} | s_{t})

(12)

To adaptively control the influence of entropy in the optimization process, the temperature coefficient

α

is optimized using the following objective:

J (α) = E_{(s_{t}, a_{t}) ~ π} [- α \log π (a_{t} | s_{t}) - H_{0}]

(13)

This loss function encourages the policy’s entropy to approach a predefined target entropy

H_{0}

, allowing for the dynamic adjustment of policy stochasticity and flexible control of behavior.

Consequently, it can dynamically adjust the stochasticity of the policy according to the target entropy. In SAC, the Q-function is formulated as:

Q^{π} (s, a) = E [\sum_{t = 0}^{T} γ^{t} (r (s_{t}, a_{t}, s_{t + 1}) + α H (π (\cdot | s_{t})))]

(14)

Accordingly, the corresponding state value function

V^{π} (s)

is reformulated to incorporate entropy, as follows:

V^{π} (s) = E_{a ~ π} [Q^{π} (s, a) + α \log π (a | s)]

(15)

The loss function of the Q-network is defined as:

J_{Q} (θ) = E_{(s, a) ~ D} [\frac{1}{2} {(Q_{θ} (s, a) - (r (s, a) + γ E_{s^{'}} [{\hat{V}}_{\bar{θ}} (s^{'})]))}^{2}]

(16)

Here,

D

denotes a batch of samples from the replay buffer, and

{\hat{V}}_{\bar{θ}} (s^{'})

is the value estimated by the target value network at the next state. The policy gradient of SAC is updated as follows:

\nabla_{θ} J_{Q} (θ) = \nabla_{θ} Q_{θ} (a_{t}, s_{t}) (Q_{θ} (s_{t}, a_{t}) - (r_{t} + γ Q_{\bar{θ}} (s_{t + 1}, a_{t + 1}) - α \log π_{ϕ} (a_{t + 1} | s_{t + 1})))

(17)

3.3. Energy Management Strategy Based on Hierarchical Reinforcement Learning

For the energy management problem of parallel hybrid electric vehicles, the primary control targets are the engine and the transmission. In this study, an HRL-based EMS framework is proposed, as illustrated in Figure 4. The design concept involves applying different reinforcement learning algorithms to appropriate subsystems, thereby enabling hierarchical control at the system level. Specifically, the upper-level policy adopts the SAC algorithm to regulate the engine’s output torque, while the lower-level policy employs the DQN algorithm to make gear-shifting decisions for the transmission.

The pseudocode of the HRL algorithm is summarized in Table 2. In addition, key variables and parameters involved in the system will be defined in detail in the following sections.

Table 3 lists the hyperparameter configurations used in this study. These hyperparameters play a critical role in the training effectiveness and convergence performance of the DRL algorithms.

The state space is used to represent the sensory information perceived by the agent from the environment:

S = {v, a, n_{g}, S O C, m_{f}}

(18)

Here, vehicle speed

v

and acceleration

a

reflect driving conditions and are used to infer the total torque demand. Gear position

n_{g}

and battery state of charge

S O C

describe the drivetrain status and battery condition, respectively. The fuel consumption rate

m_{f}

represents the engine’s fuel efficiency.

For discrete gear shifting and continuous torque control, fuel economy in HEVs can be further improved. However, DRL algorithms struggle to train effectively in a mixed action space that includes both discrete and continuous actions. To address this, the proposed method integrates the hybrid action space into an HRL framework via hierarchical control. The action space, representing the agent’s executable control behavior, is defined as:

A = \{\begin{array}{l} T_{e} (S A C) \\ n_{g} (D Q N) \end{array}

(19)

Here, the engine torque command

T_{e n g}

varies within the range [0, 1]. The engine gear position

n_{g}

includes three discrete values: −1, 0, 1, corresponding to downshift, hold, and upshift, respectively. To initially reduce gear shift frequency, a command holding interval of 3 s is applied for each shift signal.

The definition of the reward function directly influences the optimization of DRL-based energy management systems and is crucial for determining the optimization objectives and weight parameters. The cumulative reward is a joint reward shared by both the DQN and SAC strategies, originating from a unified feedback signal provided by the environment. During training, the two policy modules utilize a shared replay buffer, while independently computing their respective loss functions and updating their parameters separately. The reward function is defined as follows:

R = - [α \cdot m_{f} + β \cdot {(S O C - S O C_{0})}^{2} + γ \cdot ϕ_{ω}]

(20)

Here,

α

,

β

, and

γ

represent the weighting coefficients for each component.

m_{f}

denotes the instantaneous fuel consumption,

S O C_{0}

is the target final value of the battery SOC, and

ϕ_{ω}

is used to constrain the engine operation within its optimal operating region during the learning of the gear-shifting strategy. It is defined as follows:

ϕ_{ω} = \{\begin{matrix} - 1, & n_{g} = 1 & ω_{e} \leq 2000 \\ - 1, & n_{g} \in [2, 7] & 800 \leq ω_{e} \leq 2000 \\ - 1, & n_{g} = 8 & ω_{e} \geq 800 \\ 1, & o t h e r \end{matrix}

(21)

In the proposed HRL-based EMS, a shared reward is designed to coordinate the optimization objectives of both the upper- and lower-level controllers. The shared reward is obtained according to Equation (20). As illustrated in Figure 5, this common reward signal is simultaneously fed back to both the SAC-based torque control policy and the DQN-based gear-shifting policy, ensuring consistency in optimization at different hierarchical levels. The SAC algorithm is updated according to Equation (17), optimizing the stochastic policy by minimizing the entropy-regularized objective. The DQN, in contrast, follows the classical temporal-difference learning formulation given in Equation (7). This hierarchical structure ensures that both policies are independently updated under the guidance of a shared reward. The environment simultaneously outputs the system states and the unified reward, while the final actions (torque and gear position) are fed back into the environment. In this way, global coordination across subsystems is achieved, while preserving the algorithmic specialization of each control layer.

4. Simulation and Evaluation

4.1. Offline Training Results and Analysis

In this study, the China-World Transient Vehicle Cycle (C-WTVC), as shown in Figure 6, is selected as the target speed profile in the training environment. The corresponding total driving distance is 20.5 km, with a total duration of 1802 s. This driving cycle provides the algorithm with representative power demand characteristics, facilitating comprehensive policy learning across a variety of driving conditions.

To evaluate the effectiveness of the proposed EMS, it is compared against two other typical control strategies. The first is the DP/DP strategy, which employs the DP algorithm to simultaneously handle both continuous and discrete actions. The second is the DDPG/RB strategy, in which continuous control variables are optimized using the DDPG algorithm, while discrete control is implemented via a rule-based approach. The rewards of DDPG/RB-based EMS do not include gear shifting operations. This study adopts a charge-sustaining strategy to reduce peak energy demand, with the engine remaining the primary energy source. To ensure a fair comparison under identical conditions, both the initial and reference SOC are set to 0.5. The control objective is to maintain the SOC at approximately 0.5 at the end of each control cycle, thereby achieving energy balance and promoting efficient battery utilization.

The training process of DRL is essentially a policy exploration method based on trial and error. The agent continuously interacts with the environment to maximize cumulative rewards. Monitoring the trend of cumulative rewards per episode during training is crucial for evaluating the effectiveness of policy learning. When the cumulative reward gradually converges and stabilizes, it is generally considered that the agent has completed the training process. To visually compare the learning capabilities of different EMSs, the evolution of cumulative rewards during training is illustrated in Figure 7. The agents for the two EMS strategies successfully converge at Episodes 119 and 147, respectively, demonstrating the ability to learn effective policies within a hybrid action space. These results indicate the proposed method’s effectiveness in terms of learning efficiency.

4.2. Online Test Results

This section conducts a comprehensive performance analysis of the HRL-based EMS through a series of comparative experiments. The Orange County Transit Bus Cycle (US-OCTA) from California is adopted as the test driving cycle, as shown in Figure 8. This cycle exhibits typical characteristics of urban road operation, featuring frequent acceleration and deceleration events. It accurately reflects the dynamic response behavior of transit buses in urban traffic environments. Due to its pronounced speed fluctuations, the US-OCTA cycle offers strong representativeness and reference value for the verification and optimization of gear-shifting strategies.

Figure 9 illustrates the AMT gear shift sequences under three distinct energy management strategies. The total number of shifts for the DP-based, rule-based, and DQN-based approaches is 274, 131, and 192, respectively. Compared to the rule-based strategy depicted in Figure 9b, the DP-based method shown in Figure 9a results in significantly higher shift frequency and pronounced oscillations, particularly in the lower gears. This frequent shifting behavior reflects the global optimization objective of the DP method, but it may also lead to reduced driving comfort. In contrast, the reduced shifting activity observed in the rule-based strategy is primarily attributed to its reliance on predefined thresholds for engine speed and torque. While this approach enhances system stability and ride comfort, it lacks the flexibility to adapt to rapidly changing driving conditions. The DQN-based strategy, as illustrated in Figure 9c, strikes a balance between shift frequency and dynamic responsiveness. Notably, during the period between 1500 s and 1700 s, which features frequent acceleration and deceleration, it exhibits smooth and adaptive gear transitions that closely follow the vehicle speed variations shown in Figure 8. This behavior highlights the DQN agent’s ability to generalize to transient states and make context-aware decisions in real time, thereby maintaining vehicle performance while improving ride comfort.

The engine operating points are shown in Figure 10. The DP/DP-based EMS exhibits a wide distribution of operating points, with a concentration in regions of low fuel consumption and high efficiency. In contrast, the DDPG/RB-based EMS exhibits a broader dispersion of operating points, with most points located in areas of higher fuel consumption. This is attributed to the rule-based shifting strategy, which relies on predefined knowledge and lacks adaptability to complex driving conditions. For the SAC/DQN-based EMS, the use of step-by-step control results in a more concentrated distribution of operating points, reflecting a compromise among various optimization objectives. Moreover, with a more proactive shifting strategy, a higher proportion of efficient operating points is achieved, leading to improved average engine efficiency. As illustrated in Figure 11, the SAC/DQN-based EMS demonstrates energy consumption performance that lies between that of the DP/DP-based and DDPG/RB-based strategies. As shown in Table 4, the proposed method exhibits only a 3.06% performance gap compared to the DP/DP-based EMS, while significantly improving computational efficiency by 66.49%.

Figure 12 shows the SOC trajectories of the three different EMS strategies under the NEDC driving cycle. As illustrated, all three EMS strategies converge to a final SOC value close to the predefined reference value of 0.5. The decisions made by the proposed EMS closely align with those of the DP benchmark. Compared to the DDPG/RB-based EMS, the proposed method exhibits a more stable SOC trajectory, with smaller fluctuations throughout the entire cycle.

4.3. Analysis of Adaptability

The effectiveness of a strategy is not only reflected in its performance under ideal training conditions, but more importantly, in its generalization capability across different driving scenarios. Conducting an adaptability analysis of the proposed algorithm enables a systematic evaluation of its robustness and stability. Therefore, this study adopts real-world bus driving cycles from the U.S. National Renewable Energy Laboratory’s Drive Cycle Analysis Tool https://www.nrel.gov/transportation/drive-cycle-tool (accessed on 21 August 2025), as shown in Figure 13.

The battery SOC trajectory is shown in Figure 14. The SOC is maintained within the range of [0.3, 0.55], reflecting a reasonable fluctuation band, with the final SOC value remaining close to 0.5. This avoids deep charging and discharging of the battery, thereby ensuring the stability of the powertrain operation and extending battery life. Even under prolonged and complex driving conditions, the SOC can still be dynamically adjusted to stay within the ideal range, indicating that the proposed strategy effectively guides global energy planning.

Figure 15 presents the trajectories of engine output power, motor power, and fuel consumption. The engine exhibits clear intermittent operation, with frequent start–stop transitions that avoid prolonged operation under low-efficiency conditions. SAC effectively addresses complex, continuous decision-making problems by leveraging the entropy term to balance exploration and exploitation. It regulates high-frequency action selection, ensuring system responsiveness and improving energy recovery efficiency. Figure 16 illustrates the AMT gear shift sequence under DQN control. The transmission control strategy demonstrates clear logic and responsiveness. Figure 17 illustrates the usage distribution of each gear under the test driving conditions. Due to the frequent stop-and-go nature of urban bus operations, the vehicle spends a significant amount of time at low speeds, resulting in a high usage rate of 1st gear, accounting for 66.30%. The usage proportions of 2nd, 3rd, and 4th gears are relatively balanced—9.97%, 11.72%, and 9.68%, respectively—indicating that the system operates with a certain frequency in the medium-speed range and is capable of achieving smooth gear transitions and effective power output matching. The usage rate of higher gears is comparatively low, which aligns with the overall low-speed characteristics of the test cycle and the infrequent occurrence of high-speed cruising. This further validates that the proposed gear-shifting strategy can intelligently adapt gear selection to the actual driving conditions. These results highlight the excellent adaptability of the proposed strategy across diverse driving conditions.

5. Conclusions

This paper proposes an HRL-based EMS that integrates a torque distribution strategy based on SAC and a gear-shifting strategy based on DQN. The proposed architecture enables coordinated control of discrete and continuous action spaces through intelligent algorithms. After conducting both offline training and online testing of two EMS strategies, simulation results demonstrate that the SAC/DQN-based EMS outperforms the DDPG/RB-based EMS in terms of fuel consumption. Compared with the DP/DP-based EMS, it achieves a fuel consumption gap of only 3.06%. Moreover, leveraging the inherent advantages of neural networks, the SAC/DQN-based EMS improves computational efficiency by 66.49% relative to the DP/DP-based approach, highlighting its potential for real-time applications.

The results suggest that a hierarchical DRL framework, when tailored to task-specific characteristics and hybrid action modeling, can significantly enhance control performance and mitigate the tendency of DRL algorithms to converge to locally suboptimal solutions in complex tasks.

Future research will focus on extending the HRL-based energy management strategy to more complex and realistic scenarios. While this study validates the proposed strategy’s effectiveness and real-time potential in a simulation environment, further validation on HIL platforms and real vehicles is required to assess robustness and control stability under dynamic, non-ideal conditions. Additionally, to improve adaptability to real-world driving conditions, future work may incorporate richer high-dimensional inputs—such as road gradients, traffic information, and driver behavior features—to enhance generalization and environmental awareness. Furthermore, integrating online learning mechanisms and transfer learning capabilities could support adaptive control and continuous optimization of hybrid powertrains across different vehicle types and operational contexts, broadening the applicability of this approach in the domain of intelligent vehicle energy management.

Author Contributions

Conceptualization, validation, and methodology, C.L.; methodology and writing—original draft preparation, H.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, C.L., H.D., J.R. and J.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is financed by the National Natural Science Foundation of China (52402522) and the Fundamental Research Program of Shanxi Province (Grant No. 202403011211003). This research was also funded by the Technology Innovation Leading Talent Team for Special Unmanned Systems and Intelligent Equipment 202204051002001.

Data Availability Statement

The data used in this analysis are publicly available and access is provided in the text.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, X.; Han, J.; Tang, X.; Lin, X. Powertrain Design and Control in Electrified Vehicles: A Critical Review. IEEE Trans. Transp. Electrif. 2021, 7, 1990–2009. [Google Scholar] [CrossRef]
Cao, Y.; Yao, M.; Sun, X. An Overview of Modelling and Energy Management Strategies for Hybrid Electric Vehicles. Appl. Sci. 2023, 13, 5947. [Google Scholar] [CrossRef]
Ganesh, A.H.; Xu, B. A Review of Reinforcement Learning Based Energy Management Systems for Electrified Powertrains: Progress, Challenge, and Potential Solution. Renew. Sustain. Energy Rev. 2022, 154, 111833. [Google Scholar] [CrossRef]
Zhang, K.; Ruan, J.; Li, T.; Cui, H.; Wu, C. The Effects Investigation of Data-Driven Fitting Cycle and Deep Deterministic Policy Gradient Algorithm on Energy Management Strategy of Dual-Motor Electric Bus. Energy 2023, 269, 126760. [Google Scholar] [CrossRef]
Sun, X.; Fu, J.; Yang, H.; Xie, M.; Liu, J. An Energy Management Strategy for Plug-in Hybrid Electric Vehicles Based on Deep Learning and Improved Model Predictive Control. Energy 2023, 269, 126772. [Google Scholar] [CrossRef]
Benhammou, A.; Hartani, M.A.; Tedjini, H.; Guettaf, Y.; Soumeur, M.A. Breaking New Ground in HEV Energy Management: Kinetic Energy Utilization and Systematic EMS Approaches Based on Robust Drive Control. ISA Trans. 2024, 147, 288–303. [Google Scholar] [CrossRef]
Bagwe, R.M.; Byerly, A.; Dos Santos, E.C.; Ben-Miled, Z. Adaptive Rule-Based Energy Management Strategy for a Parallel HEV. Energies 2019, 12, 4472. [Google Scholar] [CrossRef]
Piras, M.; De Bellis, V.; Malfi, E.; Novella, R.; Lopez-Juarez, M. Adaptive ECMS Based on Speed Forecasting for the Control of a Heavy-Duty Fuel Cell Vehicle for Real-World Driving. Energy Convers. Manag. 2023, 289, 117178. [Google Scholar] [CrossRef]
Nugroho, S.A.; Chellapandi, V.P.; Borhan, H. Vehicle Speed Profile Optimization for Fuel Efficient Eco-Driving via Koopman Linear Predictor and Model Predictive Control. In Proceedings of the 2024 American Control Conference (ACC), Toronto, ON, Canada, 10 July 2024; pp. 4254–4261. [Google Scholar]
Chen, B.; Wang, M.; Hu, L.; He, G.; Yan, H.; Wen, X.; Du, R. Data-Driven Koopman Model Predictive Control for Hybrid Energy Storage System of Electric Vehicles under Vehicle-Following Scenarios. Appl. Energy 2024, 365, 123218. [Google Scholar] [CrossRef]
Wenhao, F.; Bolan, L.; Jingxian, T.; Dawei, Z. Study on Energy Management Strategy for a P2 Diesel HEV Considering Low Temperature Environment. Energy 2025, 318, 134771. [Google Scholar] [CrossRef]
Wang, Y.; Jiao, X. Dual Heuristic Dynamic Programming Based Energy Management Control for Hybrid Electric Vehicles. Energies 2022, 15, 3235. [Google Scholar] [CrossRef]
Wang, Z.; Dridi, M.; El Moudni, A. Co-Optimization of Eco-Driving and Energy Management for Connected HEV/PHEVs near Signalized Intersections: A Review. Appl. Sci. 2023, 13, 5035. [Google Scholar] [CrossRef]
Lü, X.; Qu, Y.; Wang, Y.; Qin, C.; Liu, G. A Comprehensive Review on Hybrid Power System for PEMFC-HEV: Issues and Strategies. Energy Convers. Manag. 2018, 171, 1273–1291. [Google Scholar] [CrossRef]
Liu, T.; Hu, X.; Li, S.E.; Cao, D. Reinforcement Learning Optimized Look-Ahead Energy Management of a Parallel Hybrid Electric Vehicle. IEEE/ASME Trans. Mechatron. 2017, 22, 1497–1507. [Google Scholar] [CrossRef]
Saiteja, P.; Ashok, B. Critical Review on Structural Architecture, Energy Control Strategies and Development Process towards Optimal Energy Management in Hybrid Vehicles. Renew. Sustain. Energy Rev. 2022, 157, 112038. [Google Scholar] [CrossRef]
Wang, H.; Ye, Y.; Zhang, J.; Xu, B. A Comparative Study of 13 Deep Reinforcement Learning Based Energy Management Methods for a Hybrid Electric Vehicle. Energy 2023, 266, 126497. [Google Scholar] [CrossRef]
Li, Y.; He, H.; Peng, J.; Wang, H. Deep Reinforcement Learning-Based Energy Management for a Series Hybrid Electric Vehicle Enabled by History Cumulative Trip Information. IEEE Trans. Veh. Technol. 2019, 68, 7416–7430. [Google Scholar] [CrossRef]
Hu, Y.; Li, W.; Xu, K.; Zahid, T.; Qin, F.; Li, C. Energy Management Strategy for a Hybrid Electric Vehicle Based on Deep Reinforcement Learning. Appl. Sci. 2018, 8, 187. [Google Scholar] [CrossRef]
Lian, R.; Peng, J.; Wu, Y.; Tan, H.; Zhang, H. Rule-Interposing Deep Reinforcement Learning Based Energy Management Strategy for Power-Split Hybrid Electric Vehicle. Energy 2020, 197, 117297. [Google Scholar] [CrossRef]
Zhang, J.; Tao, J.; Hu, Y.; Ma, L. An Energy Management Strategy Based on DDPG with Improved Exploration for Battery/Supercapacitor Hybrid Electric Vehicle. IEEE Trans. Intell. Transport. Syst. 2024, 25, 3999–4008. [Google Scholar] [CrossRef]
Han, R.; Lian, R.; He, H.; Han, X. Continuous Reinforcement Learning-Based Energy Management Strategy for Hybrid Electric-Tracked Vehicles. IEEE J. Emerg. Sel. Top. Power Electron. 2023, 11, 19–31. [Google Scholar] [CrossRef]
Zhao, P.; Wang, Y.; Chang, N.; Zhu, Q.; Lin, X. A Deep Reinforcement Learning Framework for Optimizing Fuel Economy of Hybrid Electric Vehicles. In Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, Republic of Korea, 22–25 January 2018; pp. 196–202. [Google Scholar]
Wang, Z.; Xie, J.; Kang, M.; Zhang, Y. Energy Management for a Series-Parallel Plug-In Hybrid Electric Truck Based on Reinforcement Learning. In Proceedings of the 2022 13th Asian Control Conference (ASCC), Jeju, Republic of Korea, 4 May 2022; pp. 590–596. [Google Scholar]
Qi, C.; Zhu, Y.; Song, C.; Yan, G.; Xiao, F.; Wang, D.; Zhang, X.; Cao, J.; Song, S. Hierarchical Reinforcement Learning Based Energy Management Strategy for Hybrid Electric Vehicle. Energy 2022, 238, 121703. [Google Scholar] [CrossRef]
Xiong, J.; Wang, Q.; Yang, Z.; Sun, P.; Han, L.; Zheng, Y.; Fu, H.; Zhang, T.; Liu, J.; Liu, H. Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space. arXiv 2018, arXiv:1810.06394. [Google Scholar]
Tang, X.; Chen, J.; Pu, H.; Liu, T.; Khajepour, A. Double Deep Reinforcement Learning-Based Energy Management for a Parallel Hybrid Electric Vehicle with Engine Start–Stop Strategy. IEEE Trans. Transp. Electrif. 2022, 8, 1376–1388. [Google Scholar] [CrossRef]
Zhang, D. An Improved Soft Actor-Critic-Based Energy Management Strategy of Heavy-Duty Hybrid Electric Vehicles with Dual-Engine System. Energy 2024, 308, 132938. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Wu, C.; Peng, J.; Chen, J.; He, H.; Pi, D.; Wang, Z.; Ma, C. Battery Health-Considered Energy Management Strategy for a Dual-Motor Two-Speed Battery Electric Vehicle Based on a Hybrid Soft Actor-Critic Algorithm with Memory Function. Appl. Energy 2024, 376, 124306. [Google Scholar] [CrossRef]

Figure 1. Vehicle powertrain configuration.

Figure 2. Engine fuel consumption map.

Figure 3. (a) Battery equivalent circuit. (b) Battery characteristic curve.

Figure 4. Overall architecture of the proposed HRL-EMS.

Figure 5. HRL update mechanism.

Figure 6. The China-World Transient Vehicle Cycle.

Figure 7. Total cumulative reward.

Figure 8. The Orange County Transit Bus Cycle.

Figure 9. AMT shift sequence. (a) DP/DP-based EMS. (b) DDPG/RB-based EMS. (c) SAC/DQN-based EMS.

Figure 10. Engine working points of three different EMSs. (a) DP/DP-based EMS. (b) DDPG/RB-based EMS. (c) SAC/DQN-based EMS.

Figure 11. Fuel consumption trajectory.

Figure 12. SOC trajectories in three EMSs.

Figure 13. Test conditions for adaptability validation.

Figure 14. SOC trajectory under test conditions.

Figure 15. Engine, motor power, and fuel consumption.

Figure 16. AMT shift sequence.

Figure 17. AMT gear distribution ratio.

Table 1. Vehicle parameters.

Parameter	Value
Vehicle mass	18,000 kg
Air resistance coefficient	0.52
Cross section	7.28 m²
Rolling resistance coefficient	0.0085
Battery capacity	48 Ah
Driving motor maximum power	182 kW
Engine maximum power	150 kW
Transmission type	8AMT
Total gear ratio	9.164/5.271/5.305/3.409/3.015/2.168/1.436/1.000

Table 2. Pseudocode of Hierarchical SAC-DQN Algorithm.

Algorithm parameters: learning rates

α_{1}, α_{2} \in [0,1]

, entropy coefficient

α

, discount factor

γ \in [0,1]

, target update rates

τ_{1}, τ_{2}

Initialize SAC actor and critic networks with parameters

ϕ, θ_{1}, θ_{2}

, DQN evaluation and target networks with parameters

θ_{e v a l}, θ_{t a r g e t}

, replay buffer

D

Repeat for each episode:
For

t = 0,1, \dots, T

do
With ϵ-greedy strategy:
Select gear command

n_{g} ~ π_{D Q N} (s_{t}; θ_{e v a l})

Select engine torque

T_{e n g} ~ π_{S A C} (s_{t}, k_{t}; ϕ)

Execute action

(k_{t}, a_{t})

, observe reward

r_{t}

, next state

s_{t + 1}

Store transition

(s_{t}, k_{t}, a_{t}, r_{t}, s_{t + 1}) \in D

Sample mini-batch from

D

Compute soft Q targets and update

θ_{1}, θ_{2}

via Bellman loss
Update actor network ϕ using policy gradient and entropy regularization
Update target Q-networks with rate

τ_{1}

if

t

mode

C = 0

then
Sample mini-batch from

D

Compute TD target:

y = r_{t} + γ \cdot m a x Q (s_{t + 1}, k^{’}; θ_{t a r g e t})

Update

θ_{e v a l}

via Equation (7)
Update target network:

θ_{t a r g e t} \leftarrow τ_{2} θ_{e v a l} + (1 - τ_{2}) θ_{t a r g e t}

End If
End For
End Repeat

Table 3. Part of hyperparameters of the HRL algorithms.

Hyperparameter	Value
Learning rate of actor-network	0.003
Learning rate of critic-network	0.005
Neurons distribution of actor and critic (target) network	25,612,864
Experience pool size	10,000
Mini-batch size	64
Discount factor	0.95
Number of training episodes	200
Activation function	ReLU
Initial entropy temperature	0.2
Optimizer	Adam

Table 4. Results of different experimental processes.

Method	Initial SOC	Final SOC	Fuel Consumption/kg	Gap	Computing Time/s
DP/DP-based EMS	0.5	0.4995	3.2800	-	4724
DDPG/RB-based EMS	0.5	0.5063	3.5265	+7.51%	1026
SAC/DQN-based EMS	0.5	0.4983	3.3804	+3.06%	1583

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, C.; Zhang, H.; Zhao, Y.; Du, H.; Ren, J.; Luo, J. Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy. Machines 2025, 13, 754. https://doi.org/10.3390/machines13090754

AMA Style

Lan C, Zhang H, Zhao Y, Du H, Ren J, Luo J. Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy. Machines. 2025; 13(9):754. https://doi.org/10.3390/machines13090754

Chicago/Turabian Style

Lan, Cong, Hailong Zhang, Yongjuan Zhao, Huipeng Du, Jinglei Ren, and Jiangyu Luo. 2025. "Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy" Machines 13, no. 9: 754. https://doi.org/10.3390/machines13090754

APA Style

Lan, C., Zhang, H., Zhao, Y., Du, H., Ren, J., & Luo, J. (2025). Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy. Machines, 13(9), 754. https://doi.org/10.3390/machines13090754

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Reinforcement Learning-Based Energy Management for Hybrid Electric Vehicles with Gear-Shifting Strategy

Abstract

1. Introduction

2. Powertrain Modeling and Problem Formulation

2.1. Power Demand Model

2.2. Engine Model

2.3. Battery Model

3. Energy Management Strategy with Gear-Shifting

3.1. Deep Q-Network

3.2. Soft Actor–Critic

3.3. Energy Management Strategy Based on Hierarchical Reinforcement Learning

4. Simulation and Evaluation

4.1. Offline Training Results and Analysis

4.2. Online Test Results

4.3. Analysis of Adaptability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI