1. Introduction
With the intensifying global energy crisis and environmental challenges, issues such as global warming, air pollution, and energy depletion have become increasingly severe, compelling society to pursue sustainable and clean energy alternatives [
1]. In this context, Fuel Cell Electric Vehicles (FCEVs) have emerged as a promising pathway toward low-carbon transportation due to their zero-emission characteristics, high energy conversion efficiency, and renewable potential [
2]. However, their practical implementation still faces several engineering challenges, including the relatively slow dynamic response of fuel cells [
3], degradation under frequent load fluctuations, and limitations in on-board hydrogen storage efficiency [
4]. Consequently, integrating fuel cells with auxiliary energy sources such as lithium-ion batteries or supercapacitors to form hybrid architectures has become a mainstream strategy to enhance power performance and system reliability in FCEVs [
5].
The architecture of FCEVs merely lays the foundation for addressing the aforementioned issues. The essential challenge lies in how to dynamically coordinate the power distribution among multiple energy sources—meeting the vehicle’s power demands while simultaneously optimizing hydrogen consumption and extending the lifespan of key components. This has become a critical barrier to the large-scale commercialization of FCEVs [
6]. To overcome this challenge, numerous institutions and research teams have conducted extensive studies on Energy Management Strategies (EMSs), which can be broadly classified into three categories: rule-based, optimization-based, and learning-based approaches [
7].
Conventional energy management strategies are generally classified into rule-based and optimization-based methods [
8]. Rule-based approaches, including fuzzy logic control [
9] and deterministic logic schemes such as power-following strategies [
10], rely on predefined “if–then” rules derived from prior knowledge and engineering experience. They provide strong real-time performance due to their simplicity but depend heavily on expert-defined logic, leading to limited adaptability under complex driving conditions and potential fuel cell load fluctuations that accelerate system degradation [
11,
12]. Optimization-based methods encompass global optimization techniques such as Dynamic Programming (DP) [
13] and Pontryagin’s Minimum Principle (PMP) [
14], as well as instantaneous optimization strategies like EMCS [
15] and Model Predictive Control (MPC) [
16]. These methods formulate energy management as a constrained optimization problem by constructing accurate system models and defining appropriate cost functions. While global optimization can yield theoretically optimal solutions, its reliance on complete prior driving-cycle information and high computational burden restricts its feasibility for real-time application [
17].
Reinforcement learning (RL) provides a model-free framework well suited to the complex, nonlinear, uncertain, and multi-objective characteristics of fuel cell hybrid systems and is commonly classified into value-based, policy-based, and actor–critic methods [
18]. Value-based approaches, such as Q-learning and Deep Q-Networks (DQN), effectively address discrete decision-making and high-dimensional state spaces. Wang et al. [
19] proposed an improved DQN-based energy management strategy that incorporates a data-driven battery lifetime map to characterize nonlinear aging and employs a parameterized DQN to handle hybrid discrete–continuous actions, achieving 99.5% of dynamic programming optimality and reducing operating costs by 3.1% under unknown conditions, while Wang et al. [
20] further enhanced stability and convergence speed through a Dueling DQN architecture. Policy-based methods, including Proximal Policy Optimization (PPO) [
21], directly optimize policy parameters using clipped surrogate objectives to improve training stability; Li et al. [
22] integrated PPO with dynamic programming prior knowledge and parallel computation, resulting in improved convergence speed, hydrogen consumption, and fuel cell degradation compared with DQN- and DDPG-based EMS, and Lv et al. [
23] developed an improved PPO-based hierarchical energy management strategy with adaptive driving modes and offline-optimized battery SOC references, achieving a 3.11% improvement in economic performance. Actor–critic methods combine value estimation and policy learning to balance exploration and exploitation; Deep Deterministic Policy Gradient (DDPG) improves stability in continuous control via target networks and delayed updates [
24], while Soft Actor–Critic (SAC) further enhances exploration efficiency through maximum entropy learning [
25]. Lu et al. [
26] proposed an improved DDPG-based strategy incorporating adaptive fuzzy filtering for frequency-decoupled power allocation and battery degradation modeling, achieving a 2.02% increase in average fuel cell efficiency and a 14.4% reduction in battery performance degradation compared with conventional DDPG.
Despite these advances, significant gaps remain in comparative analyses of RL-based EMS. Few studies systematically compare RL algorithms across different architectures and investigate the impact of hyperparameters and reward weights on system performance. Wang et al. [
27] designed and compared four types of EMS while considering parameter effects on optimization performance, but their work focused on gasoline-electric hybrid systems. Xu et al. [
28] designed and compared three types of RL-based EMS algorithms but did not discuss the effects of algorithm parameters and reward functions on optimization performance. Therefore, comprehensive research on the impacts of algorithm parameters, reward design, and hyperparameter tuning is crucial. Such studies not only deepen our understanding of RL’s potential in the EMS domain but also provide practical guidance for optimizing reward shaping, algorithm selection, and parameter tuning. This work holds both theoretical and practical significance, particularly in optimizing RL-based EMS under complex real-world driving conditions. It will advance the efficiency, reliability, and service life of fuel cell hybrid vehicles.
To address the aforementioned challenges, this study first establishes a dynamic simulation model of a fuel cell hybrid electric vehicle (FCHEV). Building on this model, a degradation-aware energy management strategy based on the Soft Actor–Critic (SAC) algorithm is developed. By exploiting SAC’s entropy-regularized learning mechanism, the proposed strategy explicitly balances hydrogen economy, power smoothness, and fuel cell degradation during long-term operation. The influences of key algorithmic hyperparameters and reward-weight configurations on training stability and control performance are systematically investigated. Furthermore, the proposed method is benchmarked against power-following, DQN-based, and PPO-based strategies in terms of tuning complexity, training efficiency, performance under training conditions, and generalization capability under unseen driving cycles. Through these analyses, this work provides a comprehensive assessment of the suitability of different reinforcement learning paradigms for energy management in fuel cell hybrid vehicles. In summary, the main contributions of this work are as follows:
A degradation-aware SAC-based energy management strategy is proposed, in which fuel cell durability and hydrogen economy are jointly incorporated into the reward function. Leveraging SAC’s maximum-entropy formulation, the strategy promotes operation within high-efficiency regions while mitigating degradation-inducing power fluctuations.
A systematic sensitivity analysis of SAC hyperparameters and reward design is conducted, revealing their respective roles in convergence behavior, training stability, and energy management performance. The results provide practical guidelines for hyperparameter tuning and reward shaping in RL-based EMS design.
A comprehensive comparative evaluation of power-following and RL-based EMSs (SAC, DQN, and PPO) is performed with respect to durability, economic performance, adaptability to unseen driving cycles, tuning complexity, and training efficiency. The comparative results clarify the strengths, limitations, and applicable scenarios of different RL algorithm categories for fuel cell hybrid vehicle energy management.
The remainder of this paper is organized as follows:
Section 2 presents the powertrain architecture of the fuel cell hybrid vehicle and the modeling of its key components.
Section 3 introduces the proposed SAC-based energy management strategy that considers fuel cell degradation and investigates the effects of hyperparameters and reward function weightings on convergence behavior and learning performance.
Section 4 compares the training outcomes and optimization performance of the three different algorithms. Finally,
Section 5 summarizes and discusses the entire study and presents concluding remarks.
4. Results and Discussion
4.1. Training Performance Comparison
As shown in
Figure 12 and
Table 8, all three algorithms exhibit similar learning trends, with rewards initially fluctuating at low values before stabilizing. Although each algorithm ultimately converges, SAC achieves the most stable post-convergence behavior, followed by PPO, while DQN shows larger fluctuations. In terms of convergence speed, PPO converges the fastest, DQN reaches stability more slowly, and SAC converges the slowest due to its dual-network architecture and required warm-up phase. These structural differences also produce notable disparities in training efficiency: DQN’s value-based updates suffer from low data utilization, PPO’s on-policy updates enable efficient and lightweight computation, and SAC’s hybrid actor–critic formulation balances exploration and value estimation but incurs higher computational overhead and increased parameter-tuning complexity.
4.2. Optimization Performance Comparison
As a baseline for comparison, a rule-based power-following energy management strategy is adopted. In this strategy, the fuel cell output power primarily follows the instantaneous power demand of the vehicle, while the battery compensates for the power difference to maintain system balance. The control logic does not involve optimization objectives related to efficiency or degradation mitigation and relies on predefined rules only. Under the same driving cycle and initial conditions, this baseline strategy is used to evaluate the reference degradation behavior of the fuel cell system, providing a fair benchmark for assessing the effectiveness of the proposed energy management strategy.
As shown in
Figure 13 and
Figure 14 and summarized in
Table 9, the SAC-based strategy exhibits distinct power allocation patterns compared to DQN, PPO, and power-following strategies. Specifically, SAC allocates a larger proportion of operating points to the medium-to-high efficiency range of the fuel cell, directly enhancing hydrogen utilization efficiency. In contrast, the power trajectories generated by DQN and PPO strategies are smoother with longer steady-state operation times, indicating more conservative power tracking behavior. Quantitative analysis reveals that during the CLTC cycle, the SAC strategy reduces hydrogen consumption by 11.15 g/100 km and 5.54 g/100 km compared to DQN and PPO, respectively. Furthermore, all three reinforcement learning-based strategies achieve significantly lower hydrogen consumption per 100 km than the power-following strategy.
Despite introducing more frequent load changes, the SAC strategy did not induce acceleration degradation. Conversely, it achieved degradation rates reduced by 73.86% and 62.35% compared to DQN and PPO, respectively, and by 42.84% compared to the power-following strategy. This indicates that degradation depends not only on load smoothness but is significantly influenced by the fuel cell’s operating efficiency zone. By prioritizing operation in high-efficiency zones, the SAC-based strategy effectively suppresses degradation mechanisms under enhanced power dynamics, demonstrating its exceptional capability in balancing efficiency and durability.
Compared with DQN and PPO, SAC maintains a higher policy entropy during training due to its maximum-entropy objective, enabling more efficient exploration of the state–action space. This enhanced exploration helps the agent avoid premature convergence to degradation-intensive control patterns, such as frequent load fluctuations or prolonged low-efficiency operation. As training proceeds, the adaptive entropy mechanism gradually reduces stochasticity, allowing the policy to converge to a stable power allocation strategy. This balanced exploration–exploitation process explains why SAC achieves superior degradation suppression while maintaining stable convergence, whereas DQN and PPO tend to converge faster but are more prone to locally optimal solutions with weaker durability awareness.
4.3. Transferability and Robustness Performance
The agent trained on the CLTC cycle was evaluated on the unseen WLTC cycle with an initial SOC of 0.6 to assess generalization capability. As shown in
Figure 15 and
Figure 16 and summarized in
Table 10, all three reinforcement learning-based strategies exhibit power allocation patterns broadly consistent with those observed during training, indicating stable policy execution under distribution shift. In particular, the SAC-based strategy continues to allocate a higher proportion of operating points within the fuel cell’s high-efficiency region, whereas DQN and PPO maintain smoother power trajectories that reflect conservative load-following behavior.
From a performance perspective, SAC preserves its advantage in both economy and durability under the unseen WLTC cycle, achieving reductions in hydrogen consumption of 28.39 g, 6.39 g and 193.18 g per 100 km and degradation reductions of 80.96%, 67.72% and 59.43% relative to DQN, PPO and Power-following, respectively. These results suggest that the degradation-aware reward formulation enables SAC to generalize efficiency-oriented control principles beyond the training cycle. However, the similarity between power distribution patterns in the training and testing cycles also indicates that the learned policy largely reproduces previously acquired behaviors, which may constrain its adaptability to cycle-specific dynamics. This observation highlights a trade-off between policy stability and adaptive responsiveness, suggesting that broader training scenarios or online adaptation mechanisms may be required to further enhance robustness under more diverse driving conditions.