1. Introduction
In recent years, environmental pollution and global warming have become increasingly severe. The transportation sector is widely recognized as one of the major sources of energy consumption and greenhouse gas emissions [
1]. To address the energy crisis and improve environmental quality, the development of green and low-carbon transportation has become an important strategic direction in many countries. In the process of transport decarbonization, battery electric vehicles (BEVs) and hydrogen fuel cell vehicles (HFCVs) are regarded as two key technological pathways [
2]. In parallel with the rapid development of BEVs and HFCVs, hydrogen-related electrochemical technologies have also made steady progress in recent years, which further supports the development of integrated electricity–hydrogen energy infrastructures [
3,
4]. Compared with conventional fossil-fuel vehicles (FFVs), BEVs and HFCVs exhibit significant advantages in terms of higher energy efficiency, lower carbon emissions, and reduced dependence on fossil fuels [
5,
6]. With the rapid development of new energy vehicles and the sustained attention from both consumers and manufacturers [
7], the demand for supporting energy supply infrastructure has become increasingly urgent. Therefore, electric–hydrogen integrated stations (EHISs), which integrate photovoltaic power generation, hydrogen production, hydrogen storage, and battery charging/hydrogen refueling functions, have gradually emerged as a promising form of transportation energy infrastructure [
8]. Achieving efficient, economical, and low-carbon operation of EHISs under uncertain conditions has thus become a key issue that needs to be addressed.
Extensive studies have been conducted on the operational optimization of charging stations and hydrogen refueling stations (HRSs). For charging stations, some studies have adopted methods such as chance-constrained optimization to reduce operating costs while considering the uncertainties of photovoltaic (PV) generation and energy storage systems (ESSs) [
9]. For HRSs, existing research has optimized on-site electrolytic hydrogen production and hydrogen storage configurations to reduce hydrogen costs and improve economic performance [
10]. In addition, some studies have incorporated HRSs into electricity and ancillary service market frameworks to improve revenues and reduce hydrogen costs [
11], while others have proposed multi-layer coordinated strategies to address the coupled dispatch problem of electricity–heat–hydrogen integrated energy systems [
12]. However, these methods still show limitations in meeting more complex engineering requirements. Most existing studies mainly focus on a single economic objective, making it difficult to simultaneously characterize and balance station profit, user demand satisfaction, and carbon emission constraints. Moreover, independent optimization of individual subsystems cannot fully exploit the synergistic benefits brought by electricity–hydrogen coupling. Therefore, coordinated optimal scheduling of electric–hydrogen integrated stations has become an urgent issue to be addressed.
Although the optimal scheduling of power systems and hydrogen systems has been relatively mature within their respective domains, the integrated management of electricity–hydrogen coupled systems is still at a developing stage, and its optimal scheduling needs to address complex issues such as electricity–hydrogen coupling mechanisms [
13]. Existing studies have proposed EHIS scheduling strategies capable of simultaneously satisfying the demands of battery electric vehicles and fuel cell vehicles, while handling uncertainty through photovoltaic (PV) power supply and stochastic modeling methods [
14]. In addition, some studies have developed operation schemes from the perspective of economic performance to reduce overall system costs [
15]. In recent years, deep reinforcement learning has also shown promising potential in energy management and sequential control problems under uncertainty, because it can learn adaptive decision policies directly from environment interaction without relying on explicit future scenario enumeration. Recent studies have extended DRL to hydrogen-coupled or multi-energy scheduling problems, including safe DRL-assisted energy management for active distribution networks with hydrogen fueling stations, data-driven scheduling of integrated electricity–heat–gas–hydrogen systems, and hybrid DRL-based scheduling frameworks with safety guarantees [
16,
17,
18]. However, for practical station-level operation, there is still a need for a deployable methodological framework that can coordinately balance profit, user demand satisfaction, and carbon emission constraints under refined engineering constraints and sequential uncertain decision-making processes.
To achieve the above refined scheduling objective, improving only the upper-level optimization algorithm is not sufficient. The operational optimization of EHISs also relies heavily on the modeling accuracy of the core hydrogen production equipment, namely the proton exchange membrane electrolyzer, and modeling errors may directly affect the credibility of scheduling and operational decisions [
19]. Existing studies have proposed various mechanistic modeling methods for PEM electrolyzers, such as representing the electrolyzer voltage as the sum of the open-circuit voltage and different overpotentials [
20], or developing thermo-electrochemical coupled models to characterize stack behavior [
21]. However, mechanistic models usually depend on a large number of parameter assumptions and often exhibit semi-empirical characteristics, which makes it difficult for them to accurately capture the complex nonlinearities and multivariable coupling relationships under fluctuating operating conditions, thereby limiting their adaptability in practical prediction and control [
22,
23]. In addition, some studies have further investigated the influence of current ripple on PEM electrolyzer performance through detailed modeling and experimental analysis, which highlights the importance of accurate model characterization under complex operating conditions [
24]. With the improvement of data acquisition and computing capabilities, data-driven methods have shown considerable potential in electrolyzer modeling [
25,
26]. Among these methods, XGBoost has attracted increasing attention because of its strong nonlinear fitting capability, built-in regularization, and good generalization performance on structured tabular data. In recent hydrogen-related studies, XGBoost has also been applied to PEM water electrolysis and hydrogen production prediction problems, showing competitive predictive accuracy under complex operating conditions [
27,
28]. Such machine-learning-based models can be trained using historical operating data to reduce dependence on mechanistic parameters [
29]. Nevertheless, data-driven models may still be affected by data scarcity during the initial training stage, and their robustness and generalization capability need to be further improved [
30]. Meanwhile, some studies have pointed out that, compared with fuel cell research, data-driven modeling studies for electrolyzers are still relatively limited [
31].
Considering the above factors, this study proposes a deep reinforcement learning-based scheduling method for an electric–hydrogen integrated station using a data-driven electrolyzer model, with the aim of improving scheduling accuracy as well as the economic and low-carbon operational performance of the station under uncertain operating conditions. To address the difficulty of traditional physics-based models in accurately characterizing the complex nonlinear hydrogen production behavior of proton exchange membrane (PEM) electrolyzers, as well as the limitations of existing scheduling methods in refined modeling and multi-objective coordination, a coordinated scheduling framework integrating data-driven modeling and deep reinforcement learning is developed. The main contributions of this paper are summarized as follows:
A learning-enhanced data-driven model for PEM electrolyzers is developed. Compared with traditional physics-based models, the proposed model provides greater flexibility and higher prediction accuracy under complex nonlinear operating conditions, enabling a more accurate characterization of hydrogen production behavior. Moreover, it can extract underlying patterns from limited data, thereby alleviating data scarcity issues to some extent and improving the practicality and generalization capability of the model.
A more refined scheduling model for the electric–hydrogen integrated station is established. The proposed model not only considers key factors such as equipment start-up and shut-down costs, hydrogen storage safety constraints, user demand satisfaction, and carbon emission limits, but also captures the sequential decision-making characteristics of the electricity–hydrogen coupled operation process at the station level. As a result, it can more realistically and comprehensively reflect the scheduling requirements of practical operating scenarios.
An improved DQN-based solution method integrating Lagrangian relaxation and the template policy-based reinforcement learning method is proposed. By transforming complex constraints into penalty terms, the proposed method enables effective handling of operational constraints. In addition, by reusing historical policy parameters from similar operating scenarios, it improves training efficiency, convergence performance, and generalization capability in similar scenarios.
The remaining sections of this paper are structured as follows.
Section 2 introduces the overall structure and main component models of the EHIS, and briefly reviews the basic principles of traditional reinforcement learning.
Section 3 presents the EHIS optimization framework based on the improved DQN algorithm.
Section 4 reports the experimental results and evaluates the effectiveness of the proposed method through comparative analysis under different scenarios. Finally,
Section 5 concludes the paper and discusses future research directions.
3. EHIS Optimization Framework Based on the Improved DQN Algorithm
This section develops the EHIS optimization framework based on the improved DQN algorithm. To enhance the accuracy of hydrogen production modeling, a data-driven PEM electrolyzer model is first introduced. Subsequently, the EHIS scheduling problem is formulated by jointly considering station profit, user demand satisfaction, and carbon emission constraints. Based on this formulation, an improved DQN-based solution method is then proposed to achieve efficient and reliable scheduling optimization.
3.1. Data-Driven PEM Electrolyzer Model
The core function of the electrolyzer is to convert electrical energy into hydrogen energy, and there exists a complex nonlinear mapping relationship between the hydrogen production rate and the input operating conditions [
34]. Owing to the limited accuracy of conventional mechanistic models, it is difficult to accurately characterize the strong nonlinear behavior of the electrolyzer under complex fluctuating operating conditions, which may lead to prediction bias. To address this issue, this study proposes a data-driven modeling method for the PEM electrolyzer. The developed model serves as the interaction environment for subsequent reinforcement learning training and enables fast and accurate prediction of the hydrogen production rate through machine learning. As illustrated in
Figure 2, the proposed deep XGBoost model adopts a serial hybrid architecture, in which the XGBoost module first extracts tree-based intermediate representations from the original 11-dimensional operating variables, and these representations are then fed into a deep neural network for further nonlinear refinement and final hydrogen production prediction.
The input data are given in Equation (18):
From left to right, the variables in Equation (18) represent the electrolyzer input power, membrane thickness, contact area of the membrane electrode assembly, operating temperature, the partial pressures of hydrogen, oxygen, and water vapor, the anodic and cathodic charge transfer coefficients, and the anodic and cathodic exchange current densities.
To establish the nonlinear mapping between the input features and the hydrogen production rate, this study constructs a deep XGBoost-based prediction model. The model takes the above 11-dimensional feature vector as the input and the hydrogen production rate as the output, as expressed in Equation (19).
The XGBoost regression model can be expressed as an additive ensemble of regression trees, as shown in Equation (20). In this formulation, the final prediction result is obtained by summing the outputs of multiple regression trees. During training, XGBoost iteratively minimizes the objective function through the gradient boosting strategy, such that each newly added tree is used to fit the residual error generated in the previous iteration, thereby continuously improving the overall prediction performance. The corresponding objective function is given in Equation (21), which consists of a data fitting term and a regularization term. The fitting term is used to measure the prediction error, while the regularization term is introduced to control the model complexity, suppress overfitting, and improve generalization capability.
To further improve the model performance, this study introduces a deep neural network on top of the XGB regression tree model. Specifically, the trained XGBoost module is first used to transform the original input vector into a tree-based intermediate representation. This representation is then transferred to the input layer of the DNN, which performs multi-layer nonlinear transformation to further capture higher-order feature interactions and outputs the final hydrogen production rate. In this way, the proposed model forms a serial hybrid architecture, where XGBoost is responsible for feature extraction and the DNN further performs nonlinear refinement, as shown in
Figure 2. The structure of the model can be expressed as Equation (22):
where
denotes the nonlinear processing of the input features by the deep neural network, and
represents the learnable parameters of the neural network. Considering the physical trend of the hydrogen production rate, monotonicity constraints are introduced during the training of the deep XGB model to enhance the physical consistency of the prediction results and reduce fluctuation risks. In addition, an independently and randomly split test set is adopted to evaluate the generalization performance of the model so as to avoid overfitting. The evaluation metrics are selected as RMSE and MAE, which are defined in Equations (23) and (24), respectively.
where
denotes the number of samples in the test set. After the training process is completed, the deep XGB model is serialized and stored so that it can be efficiently called in the subsequent system-level simulation and optimization process, thereby enabling an efficient replacement of the mechanistic model.
3.2. EHIS Optimization Model
To improve the operational efficiency of the electric–hydrogen integrated station (EHIS) under multi-energy coordination and multi-load response conditions, it is necessary to construct a well-designed optimization model to support effective scheduling and control of system operation. The composition of the EHIS has been introduced in the previous section. By selling hydrogen to hydrogen fuel cell vehicles (HFCVs) and electricity to battery electric vehicles (BEVs), the station aims to satisfy diversified energy demands while simultaneously considering user demand satisfaction and carbon emission control, with the overall objective of maximizing operational profit.
3.2.1. Profit Maximization Objective
The profit maximization objective function under scenario
s is defined in Equation (25), which consists of four parts. The first part represents the profit from selling electricity to BEVs. The second part represents the profit from selling hydrogen to HFCVs. The third part represents the start-up costs of the electrolyzer and the fuel cell, denoted by the coefficients
and
, respectively. The fourth part is the penalty term for violating the safety constraint of the hydrogen storage tank, weighted by the coefficient
, where
denotes the safety factor of the hydrogen storage capacity. The power constraints of the electrolyzer and BEVs are given in Equation (26), the BEV energy demand constraint is given in Equation (27), and the HFCV hydrogen demand constraint is given in Equation (28).
3.2.2. User Demand Response Model
The user demand penalty function under scenario s is given in Equation (33), where
represents the penalty term caused by unmet BEV demand, and the corresponding penalty coefficient is
.
represents the penalty term caused by unmet HFCV demand, and the corresponding penalty coefficient is
.
and
represent the total amount of electricity and hydrogen supplied by the entire system to users, respectively.
3.2.3. Carbon Emission Calculation Model
The carbon emission calculation model under scenario
s is given in Equation (34), which includes the carbon emissions associated with purchasing grey electricity from the grid and purchasing grey hydrogen from external sources.
and
represent the grid carbon intensity and the unit carbon emission coefficient of externally purchased grey hydrogen, respectively.
and
represent the amount of electricity purchased from the grid and the amount of hydrogen purchased from external sources, respectively.
3.2.4. CMDP Formulation
To address the coordinated scheduling problem under an uncertain environment, this study formulates the problem as a constrained Markov decision process. At each time step t, the operating condition of the EHIS is represented by a state vector containing the key exogenous and endogenous variables that affect the current scheduling decision; i.e.,
where
denotes the photovoltaic output,
denotes the time-of-use electricity price,
denotes the hydrogen price,
denotes the grid carbon intensity,
denotes the hydrogen storage level, and
and
denote the real-time demands of BEVs and HFCVs, respectively. In addition,
and
denote the operating states of the electrolyzer and the fuel cell, respectively, while
denotes the current decision stage within the scheduling horizon. The continuous state variables are normalized before being fed into the DQN, whereas the binary device-state variables and the time index are retained in their original forms. In this way, the state representation preserves the key operating information of the station without introducing additional state discretization.
The action at time step
is defined as the joint demand–satisfaction decision for the two transportation loads:
where
and
denote the satisfaction ratios of BEV charging demand and HFCV hydrogen demand, respectively. Since the DQN framework requires a finite action set, both action variables are discretized with a uniform step size of 0.1. Accordingly, the two action subsets are defined as:
and the joint action space is constructed as:
Therefore, the total number of feasible joint actions is 121. For a selected action , the actual supplied electricity and hydrogen are determined as and , respectively. By dynamically adjusting these two satisfaction ratios, the agent can indirectly realize the coordinated control of fuel cell output, electrolyzer start-up and shut-down, hydrogen storage charging and discharging, and external energy purchase under the physical constraints of the system.
This CMDP design maintains a continuous description of the system operating condition in the state space while keeping the action space compatible with the DQN-based solution method. In terms of scalability, the proposed formulation is favorable with respect to the state dimension, since the continuous state variables are directly handled by the neural network without manual state discretization. By contrast, the size of the action space grows with the discretization granularity of the demand–satisfaction ratios. A finer action interval can improve control resolution, but it also enlarges the number of joint actions and increases the learning complexity. In this study, the step size of 0.1 is adopted as a compromise between decision resolution and computational tractability.
On the basis of satisfying the physical constraints of the system and the data-driven model constraints described above, the optimization objective of the CMDP consists of two parts: the reward function and the constraint conditions. The reward function aims to maximize the profit defined in Equation (25). Meanwhile, in order to ensure service quality and environmental compliance, boundary constraints are imposed, requiring that the unmet user demand defined in Equation (33) and the carbon emissions calculated in Equation (34) be strictly limited within the allowable threshold ranges. By incorporating the above multiple constraints into the model, the policy can ensure the coordinated achievement of user satisfaction and low-carbon objectives while pursuing the maximization of cumulative profit.
3.3. Improved DQN-Based Solution Algorithm for the CMDP
The constrained Markov decision process constructed for the EHIS in the previous section involves a high-dimensional state space and multiple coupled constraints, making it difficult for traditional optimization methods to cope with its dynamic uncertainty. To address this issue, this study proposes an improved DQN-based solution algorithm for the CMDP. Taking DQN as the core framework, the proposed method integrates a dynamic learning rate and prioritized experience replay to improve convergence efficiency in complex environments. In addition, Lagrangian relaxation is employed to embed constraint handling into the training loop, and a policy template reuse mechanism is introduced to alleviate the cold-start problem under varying operating conditions. The overall solution framework for the EHIS is shown in
Figure 3.
3.3.1. Improved DQN Algorithm
Traditional reinforcement learning often suffers from low efficiency and difficulty in convergence when dealing with high-dimensional spaces. To address this issue, DQN combines deep learning with Q-learning and uses a neural network to approximate the state–action value function (Q-function), thereby significantly improving the modeling capability for complex environments and the effect of policy approximation [
35]. DQN adopts two neural network structures: the evaluation network
is used to predict the Q-values of all actions under the current state, thus providing the basis for decision making; the target network
provides stable target Q-values for calculating the loss function. The parameters of the evaluation network are continuously updated during training, while the parameters of the target network are periodically copied from the evaluation network to reduce instability during training. The Q-value update rule is given in Equation (39).
In the initial stage of training, in order to balance exploration and exploitation, the action selection strategy adopts an
-greedy policy. Under this policy, the joint action space
is defined as the set of all possible action combinations, where each action
represents a joint action
. At each decision step, the policy selects an action
randomly with probability
, and selects the action with the maximum Q-value under the current state with probability
. The specific policy function is given as follows:
To improve training efficiency, an exponential decay strategy is introduced, such that the value of gradually decreases during the training process, with an initial value of 1.0 and a minimum value of 0.01. In this way, the model can explore more strategies in the early stage of training and focus more on exploiting the learned optimal policy in the later stage.
In addition, this study introduces a dynamic learning rate adjustment mechanism into the conventional DQN framework, so that the learning rate can be adaptively adjusted during training. Specifically, as the training process proceeds, the learning rate gradually decreases to balance early-stage exploration and late-stage convergence. The learning rate adjustment formula is given in Equation (41), where
denotes the initial learning rate, which is set to 0.01, and
denotes the minimum learning rate, which is set to 0.001. Under this mechanism, the loss function
can be directly expressed as Equation (42), where
denotes the target network parameters under the influence of the dynamic learning rate. This mechanism adjusts the update rate of the control parameters by controlling the learning rate, thereby indirectly affecting the synchronization rhythm of the target network. As a result, a dynamic coupling relationship among the learning rate, target value, and policy is established during training, which improves the stability and accuracy of policy convergence.
To improve the sampling efficiency in complex tasks, this study proposes a dynamic prioritized experience replay (PER) mechanism. Based on the conventional PER, this mechanism introduces a temporal decay factor and an access-frequency correction term, so that the sampling probability of samples can achieve a dynamic balance between “focusing on high-value samples” and “avoiding over-concentration”. The specific sampling probability is defined in Equation (43).
where
controls the influence of priority on the sampling probability,
denotes the TD error,
is a smoothing term used to avoid zero probability, and
models the dynamic adjustment of samples with respect to time and access frequency. In this way, key samples with high TD errors can be sampled more frequently in the early stage, while their priority is adaptively reduced as learning gradually converges, thereby avoiding overtraining. To correct the bias caused by non-uniform sampling, the importance weight is defined as Equation (44).
where
denotes the number of samples in the replay buffer, and
gradually increases as training proceeds, thereby achieving a natural transition from exploration to stable convergence.
3.3.2. Constraint Handling Based on Lagrangian Relaxation
To solve the CMDP model constructed above and handle complex constraints while maximizing profit, this study adopts the Lagrangian relaxation technique. Specifically, by introducing Lagrange multipliers, the user demand constraint and the carbon emission constraint described above are transformed into penalty terms and embedded into the reward function of reinforcement learning, thereby constructing the modified one-step reward as follows:
where
are Lagrange multipliers, which are used to adaptively balance the trade-off between “profit maximization” and “constraint satisfaction”.
denotes the service constraint threshold, and
denotes the carbon emission threshold. When user demand is not satisfied or carbon emissions exceed the threshold, the reward will be reduced accordingly, thereby driving the policy to gradually learn feasible decisions that satisfy the constraints during the training process.
3.3.3. The Template Policy-Based Reinforcement Learning Method
This study proposes the template policy-based reinforcement learning method, which enables cross-scenario policy parameter sharing and fine-tuning. During the training process, each scenario category is initialized with a policy model
, and reinforcement learning training is performed on its corresponding typical operating trajectory. Finally, the trained policy parameters
are stored in the policy template library:
where M denotes the policy template library, and K denotes the total number of scenario categories. For a new scenario
s′ with a relatively small sample size, the Euclidean distance between its feature vector
and the existing scenario feature vectors
in the template library can be calculated. The most similar policy parameters are then selected for initialization and further fine-tuning:
This mechanism not only promotes the reuse of knowledge, but also improves the adaptability to heterogeneous scenarios during the training process.
4. Simulation Analysis
4.1. Experimental Settings
Considering the differences in weather conditions and user demand of the electric–hydrogen integrated station on different days, the electrolyzer parameters of the EHIS exhibit inconsistency. To more realistically reflect the diversity of operating environments, this study collects photovoltaic and user load data from different regions of Shanghai. After data completion, a clustering analysis method is adopted to classify typical operating scenarios, and representative scenarios are finally obtained. Some of the device parameters are listed in
Table 1. In this study, it is assumed that the electricity price and the carbon intensity of grey electricity vary dynamically within one day but can be known one day in advance, and their variation curves are shown in
Figure 4. The hydrogen price and the penalty coefficient
are both set to 50 CNY/kg, the penalty coefficients
and
are both set to 1, the safety factor
is set to 80%, and the start-up costs
and
are both set to 78 CNY/time [
36].
To evaluate the accuracy and generalization capability of the model in hydrogen production rate prediction, a PEM electrolyzer dataset containing 20,000 samples was constructed. More specifically, the dataset consists of two parts. One part was obtained from PEM electrolyzer-related data collected and compiled by the authors, while the other part was generated from the original samples through missing-value completion and data enhancement. The term “multiple types” in the original manuscript referred to the fact that the samples cover multiple operating conditions and parameter combinations, rather than different categories of electrolyzer devices. Specifically, the dataset spans different input power levels, operating temperatures, gas pressure conditions, membrane thicknesses, membrane electrode assembly contact areas, charge transfer coefficients, and exchange current densities, thereby reflecting the operating differences of PEM electrolyzers under diverse conditions. In addition, part of the original samples was compiled from published PEM electrolyzer experimental or processed data in the literature [
28,
37,
38], mainly to improve the coverage of operating conditions and parameter combinations and thus enhance the representativeness of the dataset.
After preprocessing, the dataset was divided into a training set and a test set with a ratio of 8:2. Missing-value completion and data enhancement were applied only to the training set, whereas the test set retained the original unenhanced samples. This treatment was adopted for two reasons. First, augmenting only the training set helps increase sample diversity, alleviate the limited coverage of the original data, and improve the model’s ability to learn complex nonlinear relationships under multiple operating conditions. Second, keeping the test set in its original unenhanced form allows the predictive performance to be evaluated on data that are closer to realistic operating conditions, thereby avoiding performance bias caused by test-set augmentation. Therefore, the enhanced data were used only for model learning, whereas the final evaluation was always conducted on the original data distribution. In addition, all input features were normalized before training to eliminate dimensional differences and improve convergence stability. To ensure a fair comparison, all models used the same data partition, and the test set was kept strictly independent throughout the training process. All models took the 11-dimensional parameters defined in Equation (18) as inputs and the hydrogen production rate as the output.
For reproducibility, the main hyperparameters of the prediction models were fixed as follows. For the XGB module, the number of boosting rounds was set to 200, the maximum tree depth was set to 6, the learning rate was set to 0.05, the subsample ratio was set to 0.8, the column sampling ratio was set to 0.8, the minimum child weight was set to 1, and the L2 regularization coefficient was set to 1. For the deep neural network module stacked on the XGB output, two hidden layers with 64 and 32 neurons were adopted, the activation function was ReLU, the optimizer was Adam, the learning rate was set to 0.001, the batch size was set to 128, and the number of training epochs was set to 200. For the MLP baseline, two hidden layers with 128 and 64 neurons were used, the activation function was ReLU, the optimizer was Adam, the learning rate was set to 0.001, the batch size was set to 128, and the number of training epochs was set to 200. By recording the MAE and RMSE curves during the training process, the convergence behavior and accuracy differences of the models were analyzed.
4.2. Prediction Performance of the Data-Driven Model
As shown in
Figure 5, the black and red curves correspond to the training and testing errors of XGB, respectively, while the blue curve corresponds to the testing error of MLP. In terms of the overall trend, the error of XGB decreases rapidly in the early stage of training and converges quickly, and the training and testing curves match closely, indicating that the model has high learning efficiency and strong generalization capability without overfitting. In contrast, although the MAE and RMSE of MLP also decrease during training, their overall values remain consistently higher than those of XGB. As can be seen from the enlarged view in the upper right corner, the error fluctuation of XGB in the later stage of training is significantly smaller than that of MLP, demonstrating better stability. In summary, XGB outperforms the comparison algorithm in terms of convergence speed, prediction accuracy, and generalization performance.
4.3. Optimization Performance of the EHIS
The representative scenarios obtained from the clustering analysis are shown in
Figure 6. Scenario 1 is shown in
Figure 6a–c, which present the PV output profile and the demand profiles of BEVs and HFCVs, respectively. This scenario represents an EHIS located near an office building on a sunny day. As shown in the figure, the PV output remains at a relatively high level throughout the day and reaches its peak at around 11:00. Since most users stay at work during daytime and return home at night, the demands of BEVs and HFCVs are concentrated around midday, while the demand levels in the morning and evening are relatively low.
Figure 7 shows the simulation results of Scenario 1. As can be seen from
Figure 7a, in terms of PV allocation, the PV output fully covers the BEV demand from 6:00 to 18:00 while, during the remaining periods, the demand is jointly supplied by the fuel cell and the power grid. In particular, from 1:00 to 4:00, grid electricity purchases increase due to the low electricity price and insufficient hydrogen storage. The HFCV demand is preferentially satisfied by on-site supply because of the high hydrogen price. The evolution of the hydrogen storage level is shown in
Figure 7b: it decreases slightly during the night, gradually accumulates from 6:00 with the increase in hydrogen production driven by PV power, reaches its peak at around 18:00, and then falls back to the initial level to ensure sufficient reserve for the next day’s operation.
Scenario 2 is shown in
Figure 6d–f, which represents an EHIS located near a residential area under rainy-day conditions. As shown in the figure, the daily PV output remains generally low and reaches its peak at around 14:00. Since residents usually return home in the evening, the demands of BEVs and HFCVs are relatively low around midday but become higher in the early morning and evening. In this experiment, the hydrogen price is set to 40 CNY/kg, while all other parameters remain unchanged.
Figure 8 shows the scheduling results of Scenario 2. Due to the limited PV output, PV power is mainly used for BEV charging from 6:00 to 18:00, while the remaining demand is supplemented by the power grid. The fuel cell provides additional power only during the high-electricity-price period from 19:00 to 21:00. The HFCV demand is mainly satisfied by on-site supply, with pipeline hydrogen serving as a supplementary source. As shown in
Figure 8b, the hydrogen storage level decreases during the night and increases during the daytime, reaches its peak at around 18:00, and then declines afterward. The results indicate that PV output and electricity price fluctuations significantly affect the coordinated scheduling strategy.
The carbon emission intensity of grey electricity has been given previously in this paper, while the carbon emission intensity of grey hydrogen is assumed to remain constant within one day. It is assumed that the externally purchased grey hydrogen comes from a stable fossil-fuel-based hydrogen production pathway, with a value of 24 kgCO2/kgH2.
Figure 9 shows the temporal distribution of carbon emissions in Scenario 2. It can be observed that the optimization strategy effectively reduces the total emissions associated with grey electricity through a mechanism of purchasing more during low-carbon periods and less during high-carbon periods. This indicates that the scheduling decisions make full use of the time-varying characteristics of carbon intensity while balancing profit and user demand. By incorporating carbon intensity into the coordinated scheduling process, the proposed method further reduces the overall carbon emissions of the system while maintaining service quality and operational economic performance.
Table 2 summarizes the comparison results of scheduling optimization based on the physics-based model and the data-driven model under the three scenarios. Limited by the simplified assumptions of electrochemical process dynamics, the traditional physics-based model cannot fully characterize the nonlinear coupled effects of multiple variables on hydrogen production power. In contrast, the data-driven model, benefiting from its higher prediction accuracy, effectively reduces the decision bias caused by model mismatch and thus achieves higher operating profits in all three scenarios. These results strongly verify the significant advantage of high-accuracy modeling in improving the economic performance of the system.
To further evaluate the robustness of the proposed method under extreme weather-induced renewable fluctuations, an additional extreme scenario is introduced, as shown in
Figure 10. This scenario is constructed based on Scenario 1 and represents an EHIS located near an office building under highly fluctuating weather conditions. In this scenario, the demand profiles of BEVs and HFCVs remain the same as those in Scenario 1, while the photovoltaic output exhibits significant short-term fluctuations during the daytime. Such fluctuations can reasonably simulate sudden variations in solar irradiance caused by intermittent cloud cover. Although the overall daily PV output remains at a relatively high level, noticeable drops and recoveries occur around the daytime peak period, indicating a much stronger volatility of renewable generation than that in Scenario 1. By keeping the transportation demand unchanged and perturbing only the PV profile, this extreme scenario is specifically designed to test the adaptability and robustness of the proposed scheduling strategy under highly uncertain renewable supply conditions.
From the perspective of control robustness, these results indicate that the proposed DRL-based scheduler does not rely on a single deterministic renewable pattern, but can adapt its dispatch policy online according to the changing operating state of the station. Even when the PV generation becomes highly intermittent, the learned policy still maintains stable operation by coordinating multiple energy supply pathways and the hydrogen storage buffer. This demonstrates that the proposed DRL approach has good robustness against renewable-side uncertainty and can preserve acceptable scheduling performance under disturbed operating conditions.
Figure 11 shows the simulation results of the extreme scenario. As can be seen from
Figure 11a, under the highly fluctuating PV condition, PV power still remains an important energy source during daytime, but the station no longer relies on it as smoothly as in Scenario 1. Instead, the system responds through more flexible coordination of the fuel cell, the power grid, and hydrogen storage, so as to maintain the continuity of station operation. During the night and late evening, the demand is still mainly covered by the coordinated support of the fuel cell and the power grid. Meanwhile, due to the relatively high hydrogen price, the HFCV demand continues to be preferentially satisfied by on-site hydrogen supply, while the use of pipeline hydrogen remains limited.
The evolution of the hydrogen storage level is shown in
Figure 11b. Similar to Scenario 1, the hydrogen quantity in the HST decreases slightly during the night, then increases during the daytime as surplus PV power is converted into hydrogen. However, owing to the intensified PV fluctuations, the variation of the hydrogen quantity in the HST becomes less smooth and exhibits more pronounced short-term fluctuations than in Scenario 1. During periods of reduced PV output, the stored hydrogen is released to support system operation, which helps buffer the impact of renewable intermittency. After reaching its daytime peak, the hydrogen quantity gradually decreases in the late afternoon and evening and finally returns to a level close to its initial value. These results indicate that, under the extreme fluctuating PV condition, the proposed scheduling strategy can still effectively coordinate PV generation, fuel cell output, grid purchase, and hydrogen storage operation, thereby maintaining stable energy supply and demonstrating good robustness against severe renewable uncertainty.
4.4. Improved DQN Algorithm and the TPRL Method
To quantitatively evaluate the independent contribution of each module in the proposed improved DQN algorithm, this study trains similar scenarios of Scenario 3 under the same environment, the same random seed, and exactly the same training configuration, and compares the following five methods: the first is the baseline DQN with uniform experience replay; the second introduces only the dynamic learning rate (DQN + DLR); the third introduces only prioritized experience replay (DQN + PER); the fourth is the improved DQN proposed in this paper (DLR + PER); and the last is the DQN initialized by the policy template (i.e., TPRL).
As shown in
Figure 12, all five methods exhibit the typical convergence characteristic of “rapid increase in the early stage and gradual stabilization in the later stage”, but they differ significantly in terms of convergence speed, profit level, and stability. The baseline DQN converges the slowest and shows the largest fluctuations, resulting in the lowest final profit. After introducing DLR, the convergence becomes faster, the profit level increases, and the oscillation is reduced. PER significantly improves the sampling efficiency and profit level, although its smoothness is slightly inferior to that of DLR. The improved DQN proposed in this paper achieves the best overall performance, combining the advantages of fast convergence, high profit, and high stability. In addition, since TPRL reuses the policy template of Scenario 3 for retraining, it achieves the fastest policy optimization speed.
4.5. Performance Comparison of Different Algorithms
Table 3 compares the overall performance of GA, DDQN, the improved DQN, and TPRL under the three scenarios. The algorithms are evaluated from three perspectives; namely, profit, user satisfaction, and carbon emissions. The results show that the two DRL-based methods, DDQN and the improved DQN, both outperform GA in all scenarios, demonstrating the superiority of reinforcement-learning-based scheduling over conventional optimization in EHIS operation. Furthermore, compared with DDQN, the improved DQN consistently achieves higher profit and satisfaction in all three scenarios, which verifies the effectiveness of the proposed dynamic learning-rate adjustment and prioritized experience replay mechanisms. Although DDQN yields slightly lower carbon emissions, the improved DQN provides a better overall balance between economic performance and service quality while maintaining emissions at an acceptable level. In addition, TPRL further improves the profit and satisfaction in Scenario 3 compared with direct training of the improved DQN, indicating that template-based policy reuse can provide better initialization and further enhance scheduling performance in similar scenarios. Overall, the improved DQN exhibits the best overall multi-objective trade-off among the directly trained algorithms, while TPRL further strengthens decision quality through template transfer.
It should also be emphasized that the superiority of the proposed method is not limited to a single operating condition. Across the three representative scenarios and the additional extreme PV fluctuation scenario, the improved DQN-based framework consistently maintains favorable performance in terms of profit, user satisfaction, and operational continuity. This suggests that the learned scheduling policy has a certain degree of robustness and adaptability to varying supply–demand patterns and renewable uncertainties, rather than being overfitted to one specific scenario.
Figure 13 shows the training performance of GA, DDQN, the improved DQN, and TPRL in Scenario 3. As can be seen from the figure, all four methods exhibit an overall upward trend in profit during the training process, but their convergence speed, final profit level, and stability are significantly different. Among them, TPRL achieves the fastest profit improvement in the early stage and converges to the highest final profit, indicating that the reuse of policy templates from similar scenarios can provide a better initialization and effectively alleviate the cold-start problem. The improved DQN also shows strong performance, with a higher final profit and better convergence stability than DDQN and GA. Compared with DDQN, the improved DQN converges to a clearly higher profit level, which further verifies the effectiveness of the proposed dynamic learning-rate adjustment and dynamic prioritized experience replay mechanisms. In contrast, although DDQN performs better than GA in the early training stage and reaches a higher final profit, its convergence speed and final performance are still inferior to those of the improved DQN. GA exhibits the slowest convergence and the lowest final profit among the compared methods, reflecting the limitation of conventional optimization algorithms in handling the sequential and highly coupled decision-making process of EHIS scheduling. Overall, the results demonstrate that the proposed improved DQN provides a more effective training process and better scheduling performance than the benchmark methods, while TPRL can further enhance convergence speed and final decision quality in similar scenarios.
Despite the favorable results, the proposed DRL-based method still has some limitations. The current validation is conducted on a limited number of representative scenarios and, although the additional extreme PV fluctuation case strengthens the robustness evaluation, it does not cover all possible uncertainties. Moreover, the policy generalization under significantly shifted scenario distributions and the scalability of the discretized DQN framework in larger EHIS applications still require further investigation.