5.4.2. Energy Management Strategy of Hybrid Power Systems Based on Deep Reinforcement Learning
In 2013, Mnih et al. first proposed applying convolutional neural networks (CNNs) in the field of deep learning applied to reinforcement learning, thereby introducing the deep Q network algorithm and laying the foundation for DRL [
86]. As a subfield of deep learning, DRL integrates the feature extraction capabilities of deep learning with decision-making abilities from RL [
87], enabling direct optimal decisions based on high-dimensional input data. This approach represents an end-to-end decision control system that reduces reliance on manual rule design.
Chen et al. proposed an intelligent energy management strategy for an unmanned ship hybrid system based on deep reinforcement learning, and the simulation verified the fuel economy and environmental performance of hybrid ships using this management strategy under different working conditions [
88].
In order to manage the energy of multi-mode vehicles, Hua et al. proposed a multi-agent DRL strategy, even though the two agents work together in DDPG. By introducing the correlation degree and analyzing the factors affecting the learning performance of the two agents, the unified setting of the two agents is obtained, and then the optimal working mode under the strategy is obtained through the parameterization study of the correlation degree [
89].
The basic idea of value-based DRL is to fit the value function of each state–action pair through a deep neural network and select the action with the greatest value as the output of the strategy. Q-learning is a value-function-based algorithm in RL, the core of which is to enable the agent to obtain the optimal strategy optimal action value function by learning one function, and then guide the agent to make the action with the highest return in a certain state, transforming reinforcement learning into deep reinforcement learning.
The structure of the QN algorithm is mainly composed of an experience replay pool, evaluation network, target network, etc. DQN is not suitable for dealing with continuous action problems, and there are shortcomings such as poor stability and overestimation. The DQN strategy does not change in real time, and its advantage is that the Q-value is calculated by using the deferred update target network, which greatly improves the convergence and stability of neural network training. Based on the Q-learning global optimization algorithm, You Jie distributed the energy of the whole vehicle to obtain the optimal torque of the engine and motor and simulated the fuel consumption of the hybrid vehicle as compared with the traditional small car while maintaining the balance of the battery SOC, which greatly improved the fuel economy of the vehicle [
90].
For plug-in hybrid electric vehicles, Tian Zejie proposed a Q-learning algorithm-based energy management strategy, using the battery SOC and required power as state variables and the output power of the dynamic assistance unit as the control variable. It employs the temporal difference algorithm to update real-time action-state values. When compared with the global optimal algorithm based on PMP, in a typical urban bus scenario of over 100 km, the total price for the Q-learning strategy is only CNY 1.57 higher, indicating the effectiveness of the Q-learning strategy from an application perspective [
91]. It is believed that, from an applied standpoint, this strategy can enhance the overall economic level of the vehicle. The results of the Q-learning EMS are shown in
Table 9.
Strategy-based reinforcement learning directly maps optimal operations to the current state and excels at solving high-dimensional and continuous problems. Algorithms can be subdivided into deterministic strategy-based algorithms and randomness-based algorithms. The most widely used DDPG is an algorithm that combines the Actor–Critic framework with a deterministic strategy gradient algorithm, while continuing to employ an empirical replay mechanism and a target neural network. The determinism of the DDPG algorithm is reflected in the fact that the agent obtains one definite action through the policy function, rather than the probability density of all actions in a certain state. Therefore, it is possible to simplify the sampling integration in the action space and obtain the specific output action, which improves the operation efficiency of the system and is better in terms of stability, robustness and environmental exploration. Traditional reinforcement learning cannot address the high-dimensional situation of state space and action spaces and relies more on DRL, which leverages neural networks to solve the high-dimensional problems of state and action spaces [
92]. DQN and double deep Q-network (DDQN) use a greedy policy to ensure that the agent’s actions are random. DQNs have roughly the same structure as DDQNs. However, DQNs are prone to overestimation, which can be solved by altering the learning structure of neural networks [
93].
Maroto et al. developed a dual-depth Q network algorithm to manage the energy of the system under the simulation of commercial software, and at the same time used the convolutional neural network to characterize the pollution emissions of the hybrid system so as to reduce fuel consumption and improve fuel economy [
94].
Reddy et al. developed an EMS based on Q-learning algorithms, which interacted with the simulation model of on-board hybrid systems to achieve autonomous learning. The simulation results showed that the control strategy can regulate the SOC to about 0.7, reduce SOC variation, extend battery life, improve control stability, and improve the efficiency of the hydrogen fuel system, thereby contributing to improved fuel economy [
95].
In order to manage the energy of multi-mode vehicles, Hua et al. proposed a multi-agent deep reinforcement learning strategy. Even though the two agents work together in the DDPG, by introducing the correlation degree and analyzing the factors affecting the learning performance of the two agents, the unified setting of the two agents is obtained, and then the optimal working mode under the strategy is obtained through the parameterization study of the correlation degree [
96]. In addition, Maroto et al. developed a dual-depth Q-network algorithm to manage the energy of the system under the simulation of commercial software and used the convolutional neural network to characterize the pollution emissions of the hybrid system, so as to reduce fuel consumption and improve fuel economy [
97].
Chen Zeyu proposed an EMS for plug-in HEVs with online learning capabilities, based on the deep Q-network deep reinforcement learning algorithm. The EMS establishes an adaptive optimal power distribution law using the change in engine power as the action space and constructs an offline interactive learning + online updating learning algorithm framework [
98].
Simulations were conducted both in software and under real driving conditions to validate its performance compared with PSO-optimized control strategies. The results indicated that the proposed deep reinforcement learning strategy can reduce the online running cost of the vehicle’s hybrid system by 6.9% compared to PSO optimization. When the driving conditions change significantly, the newly developed DDPG-based EMS demonstrates good road condition adaptability. Compared to the EMS without online learning capabilities, it further reduces fuel consumption by 3.64%. The cost comparisons of different energy management strategies are shown in
Table 10. The energy consumption comparison before and after controller updates is shown in
Table 11. The DRL EMS can effectively improve energy efficiency during online operation and has the ability to adapt to sudden changes in driving characteristics.
In order to achieve real-time optimization and improve some parameters of hybrid vehicles, Su Mingliang et al. proposed an energy management strategy based on deep reinforcement learning, introduced neural networks to predict the working conditions in the simulation process, and carried out simulations. The results showed that different algorithms in deep reinforcement learning are used to control and improve the battery power, but according to the optimization results obtained under different algorithms, the strategy based on DQN is better than that based on DQN and DDPG. The algorithm demonstrates better optimization and power regulation performance for the battery. In addition, in terms of convergence, the EMS based on DDPG converges faster, indicating that the algorithm model is more accurate and superior. In terms of suppressing battery power fluctuations, DDPG has fewer power fluctuations than the DQN and DDQN batteries, which can make the system more stable [
99].
Table 12 shows the results of the comparison of energy management strategies. In terms of suppressing battery power fluctuations, the DDPG-based EMS exhibits smaller fluctuations compared to the DQN- and DDQN-based strategies, leading to a more stable system.
Xie Si Wuo pointed out that hybrid electric vehicles require balancing fuel economy, battery life, and driving performance. Traditional energy management strategies such as rule-based control and dynamic programming have limitations in real-time performance or show dependence on global state information. A novel optimization strategy based on DRL, combined with model-free learning and environment adaptability, was proposed to achieve real-time optimization. Through various algorithm variants and improvement methods, the validity of this strategy was verified [
100].
Dynamic programming was adopted as a global optimal benchmark for subsequent strategy comparison. With state variables such as SOC and engine torque as control inputs, simulations were conducted under the NEDC (Norgren Emission Drives CO2-neutral) driving cycle, achieving a fuel economy of 4.352 L/100 km. However, its limitation is the need for prior knowledge of complete driving conditions, making it unsuitable for real-time application. Algorithmic improvements were made, including the following:
Double DQN: reducing overestimation of Q-values to enhance stability.
Dueling DQN: separating state value functions from action advantage functions to improve generalization.
D3QN: combining double and dueling structures to further optimize performance.
The experimental results showed that the fuel economy of D3QN was optimal at 4.697 L/100 km, achieving 90.2% of DP’s value, but with larger training fluctuations. Dueling DQN demonstrated more stable convergence, making it suitable for practical applications. Based on strategy gradient in DRL, improvements were made to integrate heuristic rules such as engine/motor torque constraints and SOC limits to avoid unreasonable actions and accelerate convergence.
Combined with historical best experience and real-time experience, the adaptability to environmental disturbances is improved, and fuel economy is improved by 3.2%. The continuous operation space processing capability is better than that of the DQN, and the fuel economy is 94.2% that of the DP. The results of the comparison with the model predictive control show that the fuel economy is 5.0025 L/100 km at the predicted step size of 50 s, which is 86.9% of the DP, but the calculation is large and the real-time performance is poor.
Table 13 shows the performance of the strategy.
This approach demonstrated improved real-time optimization and adaptability, though its computational demands necessitated further algorithmic refinement for broader practical application.
The standard deviation of robustness is only 1.24% under multiple working conditions, and the overall results show that DRL, especially the improved DDPG, is superior, which provides an important foundation for follow-up research and practical application. However, the disadvantage is that the training relies on the simulation environment and does not take into account actual traffic uncertainty such as random traffic flow. Neural network parameters such as the number of layers and activation functions are not systematically optimized. Future research can consider combining traffic prediction models to improve the adaptability of working conditions and carrying out hardware-in-the-loop verification of real-time performance, which can be combined with more practical scenarios and multi-objective optimization to further deepen the research.
Li Jiaxi et al. proposed that energy management strategies in HEVs are critical to fuel economy and battery life. Traditional methods, such as rule-based or PID-controlled ECMS, have the problems of complex parameter tuning and poor adaptability. In this paper, a parallel DRL method based on DDPG was proposed to optimize the equivalent factor to achieve fuel economy improvement and stable control of battery SOC. The DDPG algorithm is optimized, combined with the Actor–Critic framework, and the optimization of the equivalent factor is solved through offline training and online adjustment, which solves the limitation of the traditional ECMS relying on fixed equivalent factors. The edge computing architecture is introduced by using a parallel framework, and multiple edge devices work together to train the global network, which significantly improves the convergence speed, and the experiment shows that the acceleration is 334%. The state design includes the battery SOC, remaining driving range, and previous action, combined with intelligent network technology information such as real-time positioning, to enhance the adaptability of the strategy. The simulation experiments were carried out under FTP72 driving cycles, compared with the traditional PID-controlled A-ECMS and the globally optimized Nonadaptive-ECMS. The results show that in terms of fuel economy, the DDPG strategy reduces fuel consumption by 7.2% compared with PID-A-ECMS, and the fuel consumption per 100 km decreases from 8.3 L to 7.7 L. In terms of SOC retention, both DDPG and Nonadaptive-ECMS are effective in maintaining SOC, but DDPG allows for greater SOC fluctuations over long distances to optimize engine efficiency. When eight edge devices are trained in parallel, the convergence speed is increased by 334%. The reward function adopts the Gaussian form, balancing the SOC deviation weight of 0.7 and the instantaneous fuel consumption weight of 0.3. The Actor–Critic network structures are three-layer neural networks, with 120 neurons in each layer, and the learning rates are 0.0001 and 0.0002, respectively. Synchronize the gradient between the cloud global network and edge devices in parallel, breaking the data correlation and improving training efficiency. The theoretical contribution of the research is to combine DDPG with edge computing to provide a real-time and adaptive solution for HEV energy management. The engineering value significantly reduces fuel consumption, while meeting the SOC maintenance requirements, and is suitable for dynamic and changeable real-world driving scenarios. It is technologically forward-looking, and provides a new idea for the design of energy management strategies for intelligent networked vehicles. However, the research also has limitations, and the experiment is based on the simulation environment, which needs to be further verified by real vehicles. Parallel frameworks are sensitive to communication latency. Multi-agent reinforcement learning, combined with more complex driving conditions such as urban congestion, and hardware-in-the-loop (HIL) testing can be explored. This study implements the real-time optimization of HEV energy management strategy through the DDPG and edge computing framework, which is better than traditional methods in terms of fuel economy and algorithm efficiency, providing an innovative solution for energy management of intelligent vehicles [
101].
In addition, Pang S Z et al. mentioned that there are also relevant improved algorithms based on DRL in the field of aeronautical engineering and vehicle engineering energy management, such as hierarchical reinforcement learning, safety reinforcement learning, multi-agent reinforcement learning, and meta-reinforcement learning, but there is no in-depth discussion [
102].
Reinforcement learning shows strong adaptive optimization capabilities in hybrid energy management, and its core value lies in the ability to respond to dynamic environments in real time without accurate condition prediction. End-to-end decision-making reduces the burden of manual rule design. The multi-objective balance balances economy, battery life and emissions. In the future, with the deepening of edge computing and transfer learning, DRL is expected to become the core decision-making brain of intelligent hybrid systems, promoting the evolution of hybrid systems to full-scenario intelligence.
For the comparative analysis of rule-based control strategies, DRL intelligent control strategies, and multi-objective optimization control strategies in HEV EMSs, Zhou Jinping et al. proposed a heuristic control strategy for parallel mixed electric vehicles, aiming to optimize fuel economy and battery state of charge stability [
103]. Zhang Song et al., targeting dual-planet mixed electric buses, applied DRL methods such as DDQN and twin delayed deep deterministic policy gradients (TD3) to optimize fuel economy and SOC balance under complex operating conditions [
104]. Miao Dongxiao et al. proposed a heuristic control strategy based on the non-dominated sorting genetic algorithms-II (NSGA-II) optimization of logic thresholds for ship-series hybrid power systems, aiming to reduce fuel consumption and carbon emissions by dynamically adjusting power thresholds according to SOC and engine speed. Based on rule-based heuristic control strategies, combined with threshold change mechanisms and load following methods. According to SOC and engine speed, dynamically adjust the power thresholds to achieve coordinated operation between the engine and motor. Rule-based control strategies are simple, with low computational requirements, but rely heavily on expert experience. DRL control strategies utilize discrete DDQN and continuous TD3 algorithms to address the limitations of conventional approaches. The strategy is suitable for discrete action spaces, by using double Q networks, it reduces value overestimation issues. For continuous action spaces, through dual delay and truncation techniques, improved stability can be achieved. The algorithm autonomously learns optimal control logic, with enhanced adaptability [
105]. The fuel consumption comparison results under the three strategies are shown in
Table 14.
Based on the control strategy of multi-objective optimization, the NSGA-II algorithm is used to optimize the logic threshold rules. The optimization of logic threshold values includes diesel engine power limits and SOC boundaries, aiming to achieve multi-objective optimization of fuel consumption and carbon emissions. The offline optimization is combined with online application, balancing real-time performance and optimization effects.
In terms of system performance, the load-following threshold change strategy (LTS) shows superior performance compared to the 3.1~10.4% improvement in fuel economy achieved by EACS and the 2.5~5.7% improvement achieved by ECMS. The DRL strategy achieves a fuel economy equivalent to 93% of that achieved by the DP strategy (around 19.5 L/100 km). Compared to traditional internal combustion engine systems, NSGA-II-optimized logic threshold rules reduce fuel consumption by 11.09% and further reduce emissions compared to unoptimized hybrid powertrain systems by approximately 1.18%. In terms of SOC stability, the LTS strategy maintains SOC above 60%, while the DRL strategy employs a charge maintenance mode to achieve SOC balance. NSGA-II-optimized logic threshold rules enhance the SOC terminal value by more than 1.81% compared to unoptimized values.
In terms of carbon emissions, the optimization strategy of the NSGA-II optimization logic threshold rule reduces carbon emissions by 4.32% compared with traditional power systems. It can be seen that different strategies have their own characteristics, and the advantages of LTS strategy are simple rules, small amount of calculation, and easy practical application. The disadvantage is that it relies on expert experience and has limited global optimization capabilities. The advantages of the DRL strategy are self-learning, strong adaptability, and the ability to deal with continuous control problems. The disadvantage is that the training is complex and requires a lot of data and computing resources. The advantages of the NSGA-II optimization strategy are multi-objective optimization, taking into account fuel consumption and carbon emissions, and online application after offline optimization. The disadvantage is that the selection of optimized variables depends on experience, and the real-time performance is slightly inferior to that of pure rule strategies. Through comparative analysis, the LTS strategy is suitable for scenarios with high real-time requirements and limited computing resources, such as traditional hybrid vehicles. The DRL strategy is more suitable for system control such as a hybrid bus route with a fixed driving style and stable driving style. The NSGA-II optimization strategy can take into account multi-objective optimization and real-time performance, which is suitable for large-scale hybrid systems such as ships. The analysis shows that the LTS strategy is combined with optimization algorithms such as genetic algorithm to dynamically adjust the threshold to improve the adaptability of multiple working conditions. The DRL strategy is to combine transfer learning or online learning to improve generalization capabilities and real-time performance. The NSGA-II optimization strategy aims to explore online optimization methods to further improve real-time performance. It further shows that the rule method is simple and efficient, and is suitable for traditional applications. The learning method is highly adaptable and suitable for complex scenarios. The optimization method is multi-objective optimization and is suitable for large-scale systems. Future research can explore multi-method fusion, such as using DRL to optimize the threshold of rule policies, or combining NSGA-II to optimize the reward function of DRL, so as to balance efficiency and performance.
Based on the above analysis, the main conclusions of the energy management strategy for hybrid power systems based on intelligent control are as follows:
- (1)
Using algorithms such as fuzzy logic and neural networks to process nonlinear features of the system. Deep reinforcement learning enhances the ability of traditional reinforcement learning to handle complex tasks by introducing neural networks, and ADP may provide a more efficient optimization method for this type of deep reinforcement learning.
- (2)
The energy management strategy for hybrid power systems based on reinforcement learning is optimized through data-driven strategies, which have strong adaptability but require a large amount of training data. The model’s generalization ability depends on algorithm design.
- (3)
Adaptive dynamic programming can provide a theoretical framework or acceleration algorithm for reinforcement learning, especially in solving complex control problems. A-ECMS introduces SOC feedback regulation on the basis of ECMS, dynamically optimizes equivalent factors, balances energy consumption and battery life, but the disadvantage is that it still relies on accurate subsystem models and parameter calibration is complex.
- (4)
Typical algorithms of deep reinforcement learning (DRL) such as DQN, PPO, DDPG, etc. are widely used in hybrid power systems. The DQN algorithm is trained on historical operating data to achieve real-time mapping between required power and SOC. DDPG is suitable for continuous actions such as power distribution ratio, optimizing engine start stop frequency and motor torque distribution.
- (5)
With the development of artificial intelligence and networking, the control strategy of hybrid power systems is becoming more “intelligent” and closer to human learning.
- (6)
The control strategy continues to evolve towards deep learning, allowing the system to learn optimal control rules from massive historical data and even predict future operating conditions, achieving more accurate and adaptive energy management.
The intelligent control strategy drives the hybrid power system to shift from “parameter competition” to “energy efficiency revolution”, with its core values being real-time, predictive, and reliable. The main evolutionary logic can be summarized as: rule static optimization dynamic learning autonomy. In the future, with the improvement of AI chip computing power and the popularization of V2X technology, global optimization based on intelligent control will become mainstream, further narrowing the economic gap with traditional fuel transportation vehicles and providing key technical support for carbon neutral transportation. The learning-based energy management strategy is shifting from optimizing a single vehicle to promoting transportation energy collaboration, and from algorithm innovation to system reconstruction. Its development will profoundly reshape the application form and value ecology of hybrid power technology. With the breakthrough of basic theories and the maturity of enabling technologies, this field will usher in explosive innovation in the coming years, providing core support for low-carbon transportation.