Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model

Li, Dongdong; Liu, Liang; Liao, Haiyu

doi:10.3390/app16073605

Open AccessArticle

Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model

by

Dongdong Li

,

Liang Liu

and

Haiyu Liao

^*

Department of Electrical Engineering, Shanghai University of Electric Power, Shanghai 200090, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3605; https://doi.org/10.3390/app16073605

Submission received: 22 March 2026 / Revised: 2 April 2026 / Accepted: 3 April 2026 / Published: 7 April 2026

Download

Browse Figures

Versions Notes

Abstract

To address the inaccurate scheduling of electric–hydrogen integrated stations (EHISs) caused by the limited accuracy of conventional mechanistic models for proton exchange membrane (PEM) electrolyzers, this study proposes a deep reinforcement learning (DRL)-based scheduling strategy incorporating a data-driven electrolyzer model. First, a deep XGBoost model is developed to characterize the hydrogen production behavior of the PEM electrolyzer, thereby replacing the traditional mechanistic model and reducing prediction errors. Second, the EHIS scheduling problem is formulated as a constrained Markov decision process (CMDP) that explicitly considers user demand and carbon emission constraints. Third, an improved deep Q-network (DQN) algorithm integrating Lagrangian relaxation and the template policy-based reinforcement learning (TPRL) method is designed to solve the scheduling problem, which enhances convergence speed and generalization performance under similar operating scenarios. The simulation results demonstrate that the proposed method can effectively alleviate the decision-making risks introduced by model inaccuracies and significantly improve the operational profitability of the station while satisfying user demand and carbon emission constraints.

Keywords:

electric–hydrogen integrated station; data-driven modeling; constrained Markov decision process; deep reinforcement learning

1. Introduction

In recent years, environmental pollution and global warming have become increasingly severe. The transportation sector is widely recognized as one of the major sources of energy consumption and greenhouse gas emissions [1]. To address the energy crisis and improve environmental quality, the development of green and low-carbon transportation has become an important strategic direction in many countries. In the process of transport decarbonization, battery electric vehicles (BEVs) and hydrogen fuel cell vehicles (HFCVs) are regarded as two key technological pathways [2]. In parallel with the rapid development of BEVs and HFCVs, hydrogen-related electrochemical technologies have also made steady progress in recent years, which further supports the development of integrated electricity–hydrogen energy infrastructures [3,4]. Compared with conventional fossil-fuel vehicles (FFVs), BEVs and HFCVs exhibit significant advantages in terms of higher energy efficiency, lower carbon emissions, and reduced dependence on fossil fuels [5,6]. With the rapid development of new energy vehicles and the sustained attention from both consumers and manufacturers [7], the demand for supporting energy supply infrastructure has become increasingly urgent. Therefore, electric–hydrogen integrated stations (EHISs), which integrate photovoltaic power generation, hydrogen production, hydrogen storage, and battery charging/hydrogen refueling functions, have gradually emerged as a promising form of transportation energy infrastructure [8]. Achieving efficient, economical, and low-carbon operation of EHISs under uncertain conditions has thus become a key issue that needs to be addressed.

Extensive studies have been conducted on the operational optimization of charging stations and hydrogen refueling stations (HRSs). For charging stations, some studies have adopted methods such as chance-constrained optimization to reduce operating costs while considering the uncertainties of photovoltaic (PV) generation and energy storage systems (ESSs) [9]. For HRSs, existing research has optimized on-site electrolytic hydrogen production and hydrogen storage configurations to reduce hydrogen costs and improve economic performance [10]. In addition, some studies have incorporated HRSs into electricity and ancillary service market frameworks to improve revenues and reduce hydrogen costs [11], while others have proposed multi-layer coordinated strategies to address the coupled dispatch problem of electricity–heat–hydrogen integrated energy systems [12]. However, these methods still show limitations in meeting more complex engineering requirements. Most existing studies mainly focus on a single economic objective, making it difficult to simultaneously characterize and balance station profit, user demand satisfaction, and carbon emission constraints. Moreover, independent optimization of individual subsystems cannot fully exploit the synergistic benefits brought by electricity–hydrogen coupling. Therefore, coordinated optimal scheduling of electric–hydrogen integrated stations has become an urgent issue to be addressed.

Although the optimal scheduling of power systems and hydrogen systems has been relatively mature within their respective domains, the integrated management of electricity–hydrogen coupled systems is still at a developing stage, and its optimal scheduling needs to address complex issues such as electricity–hydrogen coupling mechanisms [13]. Existing studies have proposed EHIS scheduling strategies capable of simultaneously satisfying the demands of battery electric vehicles and fuel cell vehicles, while handling uncertainty through photovoltaic (PV) power supply and stochastic modeling methods [14]. In addition, some studies have developed operation schemes from the perspective of economic performance to reduce overall system costs [15]. In recent years, deep reinforcement learning has also shown promising potential in energy management and sequential control problems under uncertainty, because it can learn adaptive decision policies directly from environment interaction without relying on explicit future scenario enumeration. Recent studies have extended DRL to hydrogen-coupled or multi-energy scheduling problems, including safe DRL-assisted energy management for active distribution networks with hydrogen fueling stations, data-driven scheduling of integrated electricity–heat–gas–hydrogen systems, and hybrid DRL-based scheduling frameworks with safety guarantees [16,17,18]. However, for practical station-level operation, there is still a need for a deployable methodological framework that can coordinately balance profit, user demand satisfaction, and carbon emission constraints under refined engineering constraints and sequential uncertain decision-making processes.

To achieve the above refined scheduling objective, improving only the upper-level optimization algorithm is not sufficient. The operational optimization of EHISs also relies heavily on the modeling accuracy of the core hydrogen production equipment, namely the proton exchange membrane electrolyzer, and modeling errors may directly affect the credibility of scheduling and operational decisions [19]. Existing studies have proposed various mechanistic modeling methods for PEM electrolyzers, such as representing the electrolyzer voltage as the sum of the open-circuit voltage and different overpotentials [20], or developing thermo-electrochemical coupled models to characterize stack behavior [21]. However, mechanistic models usually depend on a large number of parameter assumptions and often exhibit semi-empirical characteristics, which makes it difficult for them to accurately capture the complex nonlinearities and multivariable coupling relationships under fluctuating operating conditions, thereby limiting their adaptability in practical prediction and control [22,23]. In addition, some studies have further investigated the influence of current ripple on PEM electrolyzer performance through detailed modeling and experimental analysis, which highlights the importance of accurate model characterization under complex operating conditions [24]. With the improvement of data acquisition and computing capabilities, data-driven methods have shown considerable potential in electrolyzer modeling [25,26]. Among these methods, XGBoost has attracted increasing attention because of its strong nonlinear fitting capability, built-in regularization, and good generalization performance on structured tabular data. In recent hydrogen-related studies, XGBoost has also been applied to PEM water electrolysis and hydrogen production prediction problems, showing competitive predictive accuracy under complex operating conditions [27,28]. Such machine-learning-based models can be trained using historical operating data to reduce dependence on mechanistic parameters [29]. Nevertheless, data-driven models may still be affected by data scarcity during the initial training stage, and their robustness and generalization capability need to be further improved [30]. Meanwhile, some studies have pointed out that, compared with fuel cell research, data-driven modeling studies for electrolyzers are still relatively limited [31].

Considering the above factors, this study proposes a deep reinforcement learning-based scheduling method for an electric–hydrogen integrated station using a data-driven electrolyzer model, with the aim of improving scheduling accuracy as well as the economic and low-carbon operational performance of the station under uncertain operating conditions. To address the difficulty of traditional physics-based models in accurately characterizing the complex nonlinear hydrogen production behavior of proton exchange membrane (PEM) electrolyzers, as well as the limitations of existing scheduling methods in refined modeling and multi-objective coordination, a coordinated scheduling framework integrating data-driven modeling and deep reinforcement learning is developed. The main contributions of this paper are summarized as follows:

A learning-enhanced data-driven model for PEM electrolyzers is developed. Compared with traditional physics-based models, the proposed model provides greater flexibility and higher prediction accuracy under complex nonlinear operating conditions, enabling a more accurate characterization of hydrogen production behavior. Moreover, it can extract underlying patterns from limited data, thereby alleviating data scarcity issues to some extent and improving the practicality and generalization capability of the model.
A more refined scheduling model for the electric–hydrogen integrated station is established. The proposed model not only considers key factors such as equipment start-up and shut-down costs, hydrogen storage safety constraints, user demand satisfaction, and carbon emission limits, but also captures the sequential decision-making characteristics of the electricity–hydrogen coupled operation process at the station level. As a result, it can more realistically and comprehensively reflect the scheduling requirements of practical operating scenarios.
An improved DQN-based solution method integrating Lagrangian relaxation and the template policy-based reinforcement learning method is proposed. By transforming complex constraints into penalty terms, the proposed method enables effective handling of operational constraints. In addition, by reusing historical policy parameters from similar operating scenarios, it improves training efficiency, convergence performance, and generalization capability in similar scenarios.

The remaining sections of this paper are structured as follows. Section 2 introduces the overall structure and main component models of the EHIS, and briefly reviews the basic principles of traditional reinforcement learning. Section 3 presents the EHIS optimization framework based on the improved DQN algorithm. Section 4 reports the experimental results and evaluates the effectiveness of the proposed method through comparative analysis under different scenarios. Finally, Section 5 concludes the paper and discusses future research directions.

2. EHIS Structure Composition

This section introduces the structure and main components of the electric–hydrogen integrated station, whose overall framework is illustrated in Figure 1. The EHIS mainly consists of a PV power generation system, an electrolyzer, a hydrogen storage tank (HST), a fuel cell, power converters, and auxiliary devices such as compressors and cooling systems. In addition, this section presents the refined component models of the EHIS and briefly reviews the basic principles of traditional reinforcement learning, which provide the foundation for the optimization framework developed in the subsequent section.

2.1. PEM Electrolyzer and Fuel Cell Model

Electrolytic water splitting provides a clean and efficient pathway for hydrogen generation. In comparison with conventional alkaline electrolysis based on nickel electrodes, PEM electrolysis exhibits superior efficiency [32]. The operating power of the electrolyzer is constrained by its upper and lower limits, as given in Equation (2), while the hydrogen production of the PEM electrolyzer can be described as follows:

n_{t, s}^{E L E} = η^{P V, c o n} η^{C} \frac{P_{t, s}^{P V - E L E} Δ T}{2 V_{c e l l} F}

(1)

P_{t, s}^{E L E, m i n} U_{t, s}^{E L E} \leq P_{t, s}^{P V - E L E} \leq P_{t, s}^{E L E, m a x} U_{t, s}^{E L E}

(2)

where

n_{t, s}^{E L E}

represents the hydrogen production of the electrolyzer at time t under scenario s;

η^{P V, c o n}

is the efficiency of the converter between the photovoltaic (PV) system and the electrolyzer;

η^{C}

denotes the efficiency of the compressor;

Δ T

represents the time constant;

P_{t, s}^{P V - E L E}

denotes the power supplied by the PV system to the electrolyzer at time t under scenario s;

V_{c e l l}

is the operating voltage of the electrolyzer cell;

P_{t, s}^{E L E, m i n}

and

P_{t, s}^{E L E, m a x}

represent the minimum and maximum input power of the electrolyzer at time t under scenario s, respectively; and

U_{t, s}^{E L E}

is a binary variable used to control the operation of the electrolyzer.

To date, many PEM electrolyzer models have been established on the basis of simplified physics-based formulations. However, practical operating conditions call for a more refined and accurate representation of the electrolyzer. In the above equation,

V_{c e l l}

denotes the cell voltage. In many existing studies, this variable is assumed to be constant, whereas in actual operation the cell voltage of the electrolyzer is generally higher than the ideal open-circuit voltage. Accordingly, the cell voltage can be expressed as:

V_{c e l l} = E_{o c} + η_{a c t} + η_{o h m}

(3)

where

E_{o c}

is the open-circuit voltage,

η_{a c t}

is the activation overpotential, and

η_{o h m}

is the ohmic overpotential. Considering the effects of temperature and species concentration,

E_{o c}

is calculated using the Nernst equation [33], along with a temperature-dependent correction for the reversible cell voltage. The activation overvoltage is described by the Butler–Volmer equation, while the ohmic overvoltage is determined based on membrane properties using a conductivity-based expression. Some intermediate equations and secondary physical constraints are not explicitly presented in the main text.

Since hydrogen production and consumption share the same underlying mechanism [29], the PEM fuel cell process can be modeled using the PEM electrolyzer model. The hydrogen consumption of the PEMFC can thus be expressed as:

n_{t, s}^{F C} = \frac{P_{t, s}^{F C - B E V} Δ T}{2.96 η^{F C} η^{F C, c o n} F}

(4)

P^{F C, m i n} U_{t, s}^{F C} \leq P_{t, s}^{F C - B E V} \leq P^{F C, m a x} U_{t, s}^{F C}

(5)

where

n_{t, s}^{F C}

represents the amount of hydrogen consumed by the fuel cell at time t under scenario s;

P_{t, s}^{F C - B E V}

denotes the electrical energy supplied by the fuel cell to the BEV at time t under scenario s;

η^{F C}

represents the efficiency of the fuel cell;

η^{F C, c o n}

represents the efficiency of the converter between the fuel cell and the BEV;

P^{F C, m i n}

and

P^{F C, m a x}

represent the minimum and maximum output power of the fuel cell, respectively; and

U_{t, s}^{F C}

is a binary variable used to control the operation of the fuel cell.

2.2. Hydrogen Storage Tank Model

When excess PV power is available, the hydrogen generated by the electrolyzer can be stored in the hydrogen storage tank. The stored hydrogen can then be used either to refuel HFCVs or to supply the fuel cell for electricity generation to meet EV demand. In this study, the initial hydrogen inventory in the tank is set to 30 kg, and the hydrogen level at the beginning of the day is assumed to be equal to that at the end of the day so as to enable cyclic daily operation. It should be noted that the fuel cell and the electrolyzer are not allowed to operate at the same time, as constrained by Equation (6). Nevertheless, the hydrogen storage tank is permitted to provide hydrogen to both HFCVs and the fuel cell simultaneously. Based on these assumptions, the hydrogen storage tank model is established as shown in Equations (7)–(12):

U_{t, s}^{E L E} + U_{t, s}^{F C} \leq 1

(6)

m_{s, t, i n, e f f e c t i v e} = m i n (m_{s, t, i n}^{H_{2} - E L E} \times η_{s, i n} \times Δ t, m_{s, m a x}^{H_{2}} - m_{s, t}^{H_{2}})

(7)

m_{s, t, o u t, e f f e c t i v e} = m i n (m_{s, t}^{H_{2}}, \frac{m_{s, t, o u t}^{H_{2} - F C} + m_{s, t, o u t}^{H_{2} - H F C V}}{η_{s, o u t}} \times Δ t)

(8)

m_{s, t + 1}^{H_{2}} = m_{s, t}^{H_{2}} + m_{s, t, i n, e f f e c t i v e} - m_{s, t, o u t, e f f e c t i v e}

(9)

0 \leq m_{s, t}^{H_{2}} \leq m_{s, m a x}^{H_{2}}

(10)

0 \leq m_{s, t, i n}^{H_{2} - E L E} \leq m_{s, i n, m a x}^{H_{2}}

(11)

0 \leq m_{s, t, o u t}^{H_{2} - F C} + m_{s, t, o u t}^{H_{2} - H F C V} \leq m_{s, o u t, m a x}^{H_{2}}

(12)

where

m_{s, t, i n, e f f e c t i v e}

and

m_{s, t, o u t, e f f e c t i v e}

represent the effective amount of hydrogen input to and output from the HST at time t under scenario s, respectively.

η_{s, i n}

and

η_{s, o u t}

denote the input and output efficiencies of the HST under scenario s, respectively.

m_{s, t, i n}^{H_{2} - E L E}

indicates the amount of hydrogen input into the tank from the electrolyzer at time t under scenario s.

m_{s, t, o u t}^{H_{2} - F C}

and

m_{s, t, o u t}^{H_{2} - H F C V}

represent the amounts of hydrogen supplied by the HST to the fuel cell and HFCV, respectively, at time t under scenario s.

m_{s, t}^{H_{2}}

denotes the amount of hydrogen stored in the HST at time t under scenario s, and

m_{s, m a x}^{H_{2}}

is the maximum capacity of the HST.

m_{s, i n, m a x}^{H_{2}}

and

m_{s, o u t, m a x}^{H_{2}}

represent the upper limits of hydrogen input to and output from the HST, respectively.

2.3. Reinforcement Learning Algorithm

For traditional reinforcement learning (RL) algorithms, the core idea is to learn an optimal state–action value function (Q-function) through the interaction between the agent and the environment, thereby deriving the optimal policy. During this interaction, the agent generates a sequence of states, actions, and rewards as follows:

[s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots, s_{n}]

(13)

The goal of the agent is to learn a policy, which is a probability distribution over actions given a state, in order to maximize the cumulative reward. The action-value function

Q^{π} (s, a)

represents the expected cumulative discounted reward obtained by taking action a in state s and subsequently following policy

π

, and is defined as:

Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s, a_{0} = a]

(14)

where

E_{π}

denotes the expectation under policy

π

,

γ \in (0, 1]

is the discount factor, and

r_{t}

is the immediate reward received at time step t. The policy function

π (a | s)

is defined as the probability of taking action a given state s:

π (a ∣ s) = ℙ [A_{t} = a ∣ S_{t} = s]

(15)

The Q-values are iteratively updated using the Bellman optimality equation, aiming to approximate the optimal policy. The update rule is given by:

Q (s, a) \leftarrow Q (s, a) + α [r + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)]

(16)

where

Q (s, a)

represents the estimated value of taking actiona in state s;

α \in (0, 1]

is the learning rate; r is the immediate reward received after taking the action; and

s'

denotes the next state. The term

\underset{a^{'}}{m a x} Q (s^{'}, a^{'})

represents the maximum Q-value over all possible actions in the next state. By continuously sampling state transitions and applying the above update rule, RL algorithms can converge to the optimal Q-function

Q^{*} (s, a)

in a model-free environment, from which the optimal policy can be derived as:

π^{*} (s) = \arg \max_{a} Q^{*} (s, a)

(17)

The algorithm features a simple structure and is easy to implement, making it suitable for discrete state and action spaces. However, when dealing with high-dimensional or continuous problems, the size of the Q-table grows rapidly, making it difficult to meet practical application requirements. This limitation has motivated the development of subsequent deep reinforcement learning methods, such as Deep Q-Networks (DQN).

3. EHIS Optimization Framework Based on the Improved DQN Algorithm

This section develops the EHIS optimization framework based on the improved DQN algorithm. To enhance the accuracy of hydrogen production modeling, a data-driven PEM electrolyzer model is first introduced. Subsequently, the EHIS scheduling problem is formulated by jointly considering station profit, user demand satisfaction, and carbon emission constraints. Based on this formulation, an improved DQN-based solution method is then proposed to achieve efficient and reliable scheduling optimization.

3.1. Data-Driven PEM Electrolyzer Model

The core function of the electrolyzer is to convert electrical energy into hydrogen energy, and there exists a complex nonlinear mapping relationship between the hydrogen production rate and the input operating conditions [34]. Owing to the limited accuracy of conventional mechanistic models, it is difficult to accurately characterize the strong nonlinear behavior of the electrolyzer under complex fluctuating operating conditions, which may lead to prediction bias. To address this issue, this study proposes a data-driven modeling method for the PEM electrolyzer. The developed model serves as the interaction environment for subsequent reinforcement learning training and enables fast and accurate prediction of the hydrogen production rate through machine learning. As illustrated in Figure 2, the proposed deep XGBoost model adopts a serial hybrid architecture, in which the XGBoost module first extracts tree-based intermediate representations from the original 11-dimensional operating variables, and these representations are then fed into a deep neural network for further nonlinear refinement and final hydrogen production prediction.

The input data are given in Equation (18):

Data = \{\begin{array}{l} P_{pemel}^{t}, δ_{m}, A, T, P_{H^{2}}, P_{O^{2}}, \\ P_{H^{2} O}, α_{an}, α_{cat}, i_{0, an}, i_{0, cat} \end{array}\}

(18)

From left to right, the variables in Equation (18) represent the electrolyzer input power, membrane thickness, contact area of the membrane electrode assembly, operating temperature, the partial pressures of hydrogen, oxygen, and water vapor, the anodic and cathodic charge transfer coefficients, and the anodic and cathodic exchange current densities.

To establish the nonlinear mapping between the input features and the hydrogen production rate, this study constructs a deep XGBoost-based prediction model. The model takes the above 11-dimensional feature vector as the input and the hydrogen production rate as the output, as expressed in Equation (19).

x \in R^{11}, v_{1} \in R

(19)

The XGBoost regression model can be expressed as an additive ensemble of regression trees, as shown in Equation (20). In this formulation, the final prediction result is obtained by summing the outputs of multiple regression trees. During training, XGBoost iteratively minimizes the objective function through the gradient boosting strategy, such that each newly added tree is used to fit the residual error generated in the previous iteration, thereby continuously improving the overall prediction performance. The corresponding objective function is given in Equation (21), which consists of a data fitting term and a regularization term. The fitting term is used to measure the prediction error, while the regularization term is introduced to control the model complexity, suppress overfitting, and improve generalization capability.

{\hat{v}}_{1} = f_{Θ} (x) = \sum_{m = 1}^{M} f_{m} (x), f_{m} \in F

(20)

L = \sum_{i = 1}^{N} {(v_{1, i} - {\hat{v}}_{1, i})}^{2} + \sum_{m = 1}^{M} Ω (f_{m})

(21)

To further improve the model performance, this study introduces a deep neural network on top of the XGB regression tree model. Specifically, the trained XGBoost module is first used to transform the original input vector into a tree-based intermediate representation. This representation is then transferred to the input layer of the DNN, which performs multi-layer nonlinear transformation to further capture higher-order feature interactions and outputs the final hydrogen production rate. In this way, the proposed model forms a serial hybrid architecture, where XGBoost is responsible for feature extraction and the DNN further performs nonlinear refinement, as shown in Figure 2. The structure of the model can be expressed as Equation (22):

{\hat{v}}_{1} = f_{Θ} (x) = \sum_{m = 1}^{M} f_{m} (x) + N (x, Θ_{d})

(22)

where

N (x, Θ_{d})

denotes the nonlinear processing of the input features by the deep neural network, and

Θ_{d}

represents the learnable parameters of the neural network. Considering the physical trend of the hydrogen production rate, monotonicity constraints are introduced during the training of the deep XGB model to enhance the physical consistency of the prediction results and reduce fluctuation risks. In addition, an independently and randomly split test set is adopted to evaluate the generalization performance of the model so as to avoid overfitting. The evaluation metrics are selected as RMSE and MAE, which are defined in Equations (23) and (24), respectively.

RMSE = \sqrt{\frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} {(v_{1, i} - {\hat{v}}_{1, i})}^{2}}

(23)

MAE = \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} ∣ v_{1, i} - {\hat{v}}_{1, i} ∣

(24)

where

N_{t}

denotes the number of samples in the test set. After the training process is completed, the deep XGB model is serialized and stored so that it can be efficiently called in the subsequent system-level simulation and optimization process, thereby enabling an efficient replacement of the mechanistic model.

3.2. EHIS Optimization Model

To improve the operational efficiency of the electric–hydrogen integrated station (EHIS) under multi-energy coordination and multi-load response conditions, it is necessary to construct a well-designed optimization model to support effective scheduling and control of system operation. The composition of the EHIS has been introduced in the previous section. By selling hydrogen to hydrogen fuel cell vehicles (HFCVs) and electricity to battery electric vehicles (BEVs), the station aims to satisfy diversified energy demands while simultaneously considering user demand satisfaction and carbon emission control, with the overall objective of maximizing operational profit.

3.2.1. Profit Maximization Objective

The profit maximization objective function under scenario s is defined in Equation (25), which consists of four parts. The first part represents the profit from selling electricity to BEVs. The second part represents the profit from selling hydrogen to HFCVs. The third part represents the start-up costs of the electrolyzer and the fuel cell, denoted by the coefficients

C_{E L E}

and

C_{F C}

, respectively. The fourth part is the penalty term for violating the safety constraint of the hydrogen storage tank, weighted by the coefficient

ω_{c a p}

, where

r^{s a f e}

denotes the safety factor of the hydrogen storage capacity. The power constraints of the electrolyzer and BEVs are given in Equation (26), the BEV energy demand constraint is given in Equation (27), and the HFCV hydrogen demand constraint is given in Equation (28).

\begin{array}{l} \max R_{s} = \sum_{t = 1}^{T} [(P_{t, s}^{PV - BEV} + P_{t, s}^{FC - BEV}) \cdot E P (t) \\ + m_{s, t, out}^{H_{2} - HFCV} \cdot H P (t) \\ - (C_{ELE} \cdot y_{t, s}^{ELE} + C_{FC} \cdot y_{t, s}^{FC}) \\ - ω_{cap} \cdot \max (0, m_{s, t}^{H_{2}} - r^{safe} m_{s, \max}^{H_{2}})] \end{array}

(25)

0 \leq P_{t, s}^{PV - ELE} + P_{t, s}^{PV - BEV} \leq P_{t, s}^{PV}

(26)

0 \leq P_{t, s}^{PV - BEV} + P_{t, s}^{FC - BEV} \leq P_{t, s}^{BEV}

(27)

0 \leq m_{s, t, out}^{H_{2} - HFCV} \leq m_{t, s}^{HFCV}

(28)

3.2.2. User Demand Response Model

The user demand penalty function under scenario s is given in Equation (33), where

d_{t, s}^{B E V}

represents the penalty term caused by unmet BEV demand, and the corresponding penalty coefficient is

ζ^{B E V}

.

d_{t, s}^{HFCV}

represents the penalty term caused by unmet HFCV demand, and the corresponding penalty coefficient is

ζ^{H F C V}

.

P_{t, s}^{s u p p l y}

and

m_{t, s}^{s u p p l y}

represent the total amount of electricity and hydrogen supplied by the entire system to users, respectively.

d_{t, s}^{BEV} = \max \{0, P_{t, s}^{BEV} - P_{t, s}^{supply}\}

(29)

P_{t, s}^{supply} = P_{t, s}^{PV - BEV} + P_{t, s}^{FC - BEV} + P_{t, s}^{grid - BEV}

(30)

d_{t, s}^{HFCV} = \max \{0, m_{t, s}^{HFCV} - m_{t, s}^{supply}\}

(31)

m_{t, s}^{supply} = m_{s, t, out}^{H 2 - HFCV} + m_{s, t, pipe}^{H 2 - HFCV}

(32)

c_{t, s}^{serv} = ζ_{BEV} d_{t, s}^{BEV} + ζ_{HFCV} d_{t, s}^{HFCV}

(33)

3.2.3. Carbon Emission Calculation Model

The carbon emission calculation model under scenario s is given in Equation (34), which includes the carbon emissions associated with purchasing grey electricity from the grid and purchasing grey hydrogen from external sources.

λ_{t}^{grid}

and

λ^{H 2, pur}

represent the grid carbon intensity and the unit carbon emission coefficient of externally purchased grey hydrogen, respectively.

E_{t, s}^{g r i d}

and

m_{t, s}^{H 2, p u r}

represent the amount of electricity purchased from the grid and the amount of hydrogen purchased from external sources, respectively.

e_{t, s} = E_{t, s}^{grid} λ_{t}^{grid} + m_{t, s}^{H 2, pur} λ^{H 2, pur}

(34)

3.2.4. CMDP Formulation

To address the coordinated scheduling problem under an uncertain environment, this study formulates the problem as a constrained Markov decision process. At each time step t, the operating condition of the EHIS is represented by a state vector containing the key exogenous and endogenous variables that affect the current scheduling decision; i.e.,

s_{t} = [\begin{matrix} P_{t}^{P V}, E P (t), H P (t), λ_{t}^{g r i d}, m_{s, t}^{H_{2}}, D_{t}^{B E V}, D_{t}^{H F C V}, U_{t, s}^{E L E}, U_{t, s}^{F C}, t \end{matrix}]

(35)

where

P_{t}^{P V}

denotes the photovoltaic output,

E P (t)

denotes the time-of-use electricity price,

H P (t)

denotes the hydrogen price,

λ_{t}^{g r i d}

denotes the grid carbon intensity,

m_{s, t}^{H_{2}}

denotes the hydrogen storage level, and

D_{t}^{B E V}

and

D_{t}^{H F C V}

denote the real-time demands of BEVs and HFCVs, respectively. In addition,

U_{t, s}^{E L E}

and

U_{t, s}^{F C}

denote the operating states of the electrolyzer and the fuel cell, respectively, while

t

denotes the current decision stage within the scheduling horizon. The continuous state variables are normalized before being fed into the DQN, whereas the binary device-state variables and the time index are retained in their original forms. In this way, the state representation preserves the key operating information of the station without introducing additional state discretization.

The action at time step

t

is defined as the joint demand–satisfaction decision for the two transportation loads:

a_{t} = (\begin{matrix} α_{t}^{B E V}, α_{t}^{H F C V} \end{matrix})

(36)

where

α_{t}^{B E V} \in [0, 1]

and

α_{t}^{H F C V} \in [0, 1]

denote the satisfaction ratios of BEV charging demand and HFCV hydrogen demand, respectively. Since the DQN framework requires a finite action set, both action variables are discretized with a uniform step size of 0.1. Accordingly, the two action subsets are defined as:

A^{B E V} = A^{H F C V} = {0, 0.1, 0.2, \dots, 1.0}

(37)

and the joint action space is constructed as:

A = A^{B E V} \times A^{H F C V}

(38)

Therefore, the total number of feasible joint actions is 121. For a selected action

a_{t}

, the actual supplied electricity and hydrogen are determined as

α_{t}^{B E V} D_{t}^{B E V}

and

α_{t}^{H F C V} D_{t}^{H F C V}

, respectively. By dynamically adjusting these two satisfaction ratios, the agent can indirectly realize the coordinated control of fuel cell output, electrolyzer start-up and shut-down, hydrogen storage charging and discharging, and external energy purchase under the physical constraints of the system.

This CMDP design maintains a continuous description of the system operating condition in the state space while keeping the action space compatible with the DQN-based solution method. In terms of scalability, the proposed formulation is favorable with respect to the state dimension, since the continuous state variables are directly handled by the neural network without manual state discretization. By contrast, the size of the action space grows with the discretization granularity of the demand–satisfaction ratios. A finer action interval can improve control resolution, but it also enlarges the number of joint actions and increases the learning complexity. In this study, the step size of 0.1 is adopted as a compromise between decision resolution and computational tractability.

On the basis of satisfying the physical constraints of the system and the data-driven model constraints described above, the optimization objective of the CMDP consists of two parts: the reward function and the constraint conditions. The reward function aims to maximize the profit defined in Equation (25). Meanwhile, in order to ensure service quality and environmental compliance, boundary constraints are imposed, requiring that the unmet user demand defined in Equation (33) and the carbon emissions calculated in Equation (34) be strictly limited within the allowable threshold ranges. By incorporating the above multiple constraints into the model, the policy can ensure the coordinated achievement of user satisfaction and low-carbon objectives while pursuing the maximization of cumulative profit.

3.3. Improved DQN-Based Solution Algorithm for the CMDP

The constrained Markov decision process constructed for the EHIS in the previous section involves a high-dimensional state space and multiple coupled constraints, making it difficult for traditional optimization methods to cope with its dynamic uncertainty. To address this issue, this study proposes an improved DQN-based solution algorithm for the CMDP. Taking DQN as the core framework, the proposed method integrates a dynamic learning rate and prioritized experience replay to improve convergence efficiency in complex environments. In addition, Lagrangian relaxation is employed to embed constraint handling into the training loop, and a policy template reuse mechanism is introduced to alleviate the cold-start problem under varying operating conditions. The overall solution framework for the EHIS is shown in Figure 3.

3.3.1. Improved DQN Algorithm

Traditional reinforcement learning often suffers from low efficiency and difficulty in convergence when dealing with high-dimensional spaces. To address this issue, DQN combines deep learning with Q-learning and uses a neural network to approximate the state–action value function (Q-function), thereby significantly improving the modeling capability for complex environments and the effect of policy approximation [35]. DQN adopts two neural network structures: the evaluation network

Q (s, a; θ)

is used to predict the Q-values of all actions under the current state, thus providing the basis for decision making; the target network

Q' (s, a; θ^{-})

provides stable target Q-values for calculating the loss function. The parameters of the evaluation network are continuously updated during training, while the parameters of the target network are periodically copied from the evaluation network to reduce instability during training. The Q-value update rule is given in Equation (39).

Q_{target} = r_{t} + γ m a x Q^{'} (s_{t + 1}, a^{'}; θ^{-})

(39)

In the initial stage of training, in order to balance exploration and exploitation, the action selection strategy adopts an

ε

-greedy policy. Under this policy, the joint action space

A = A_{F C} \times A_{H 2}

is defined as the set of all possible action combinations, where each action

a_{t} \in A

represents a joint action

(a_{F C}, a_{H 2})

. At each decision step, the policy selects an action

a_{t}

randomly with probability

ε

, and selects the action with the maximum Q-value under the current state with probability

1 - ε

. The specific policy function is given as follows:

a_{t} = (a_{FC}, a_{H 2}) = \{\begin{array}{l} a \in A = A_{FC} \times A_{H 2}, & p = ε \\ \arg \max_{(a_{FC}, a_{H 2})} Q (s, a; θ), & p = 1 - ε \end{array}

(40)

To improve training efficiency, an exponential decay strategy is introduced, such that the value of

ε

gradually decreases during the training process, with an initial value of 1.0 and a minimum value of 0.01. In this way, the model can explore more strategies in the early stage of training and focus more on exploiting the learned optimal policy in the later stage.

In addition, this study introduces a dynamic learning rate adjustment mechanism into the conventional DQN framework, so that the learning rate can be adaptively adjusted during training. Specifically, as the training process proceeds, the learning rate gradually decreases to balance early-stage exploration and late-stage convergence. The learning rate adjustment formula is given in Equation (41), where

η_{0}

denotes the initial learning rate, which is set to 0.01, and

η_{m i n}

denotes the minimum learning rate, which is set to 0.001. Under this mechanism, the loss function

L_{t}

can be directly expressed as Equation (42), where

θ_{t}^{-} (η_{t})

denotes the target network parameters under the influence of the dynamic learning rate. This mechanism adjusts the update rate of the control parameters by controlling the learning rate, thereby indirectly affecting the synchronization rhythm of the target network. As a result, a dynamic coupling relationship among the learning rate, target value, and policy is established during training, which improves the stability and accuracy of policy convergence.

η_{t} = η_{0} \cdot (1 - \frac{t}{τ}) + η_{m i n} \cdot \frac{t}{τ}

(41)

L_{t} = {(\begin{array}{l} Q (s_{t}, a_{t}; θ_{t}) \\ - [r_{t} + γ \max Q (s_{t + 1}, a^{'}; θ_{t} (η_{t}))] \end{array})}^{2}

(42)

To improve the sampling efficiency in complex tasks, this study proposes a dynamic prioritized experience replay (PER) mechanism. Based on the conventional PER, this mechanism introduces a temporal decay factor and an access-frequency correction term, so that the sampling probability of samples can achieve a dynamic balance between “focusing on high-value samples” and “avoiding over-concentration”. The specific sampling probability is defined in Equation (43).

P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}, p_{i} = (| δ_{i} | + ϵ) \cdot f (t_{i}, n_{i})

(43)

where

α \in [0, 1]

controls the influence of priority on the sampling probability,

δ

denotes the TD error,

ϵ

is a smoothing term used to avoid zero probability, and

f (t, n)

models the dynamic adjustment of samples with respect to time and access frequency. In this way, key samples with high TD errors can be sampled more frequently in the early stage, while their priority is adaptively reduced as learning gradually converges, thereby avoiding overtraining. To correct the bias caused by non-uniform sampling, the importance weight is defined as Equation (44).

w_{i} = {(\frac{1}{| D | \cdot P (i)})}^{β}

(44)

where

| D |

denotes the number of samples in the replay buffer, and

β \in [0, 1]

gradually increases as training proceeds, thereby achieving a natural transition from exploration to stable convergence.

3.3.2. Constraint Handling Based on Lagrangian Relaxation

To solve the CMDP model constructed above and handle complex constraints while maximizing profit, this study adopts the Lagrangian relaxation technique. Specifically, by introducing Lagrange multipliers, the user demand constraint and the carbon emission constraint described above are transformed into penalty terms and embedded into the reward function of reinforcement learning, thereby constructing the modified one-step reward as follows:

r_{t, s} = R_{t, s} - λ_{1} \max {0, c_{t, s}^{serv} - {\bar{c}}^{serv}} - λ_{2} \max {0, e_{t, s} - \bar{e}}

(45)

where

λ_{1} \geq 0, λ_{2} \geq 0

are Lagrange multipliers, which are used to adaptively balance the trade-off between “profit maximization” and “constraint satisfaction”.

{\bar{c}}^{s e r v}

denotes the service constraint threshold, and

\bar{e}

denotes the carbon emission threshold. When user demand is not satisfied or carbon emissions exceed the threshold, the reward will be reduced accordingly, thereby driving the policy to gradually learn feasible decisions that satisfy the constraints during the training process.

3.3.3. The Template Policy-Based Reinforcement Learning Method

This study proposes the template policy-based reinforcement learning method, which enables cross-scenario policy parameter sharing and fine-tuning. During the training process, each scenario category is initialized with a policy model

π^{(k)}

, and reinforcement learning training is performed on its corresponding typical operating trajectory. Finally, the trained policy parameters

θ^{(k)}

are stored in the policy template library:

M = {θ^{(1)}, θ^{(2)}, \dots, θ^{(K)}}

(46)

where M denotes the policy template library, and K denotes the total number of scenario categories. For a new scenario s′ with a relatively small sample size, the Euclidean distance between its feature vector

f^{(s^{'})}

and the existing scenario feature vectors

f^{(k)}

in the template library can be calculated. The most similar policy parameters are then selected for initialization and further fine-tuning:

θ^{(s^{'})} \leftarrow θ^{(k^{*})}, k^{*} = {argmin}_{k} ∥ f^{(s^{'})} - f^{(k)} ∥_{2}

(47)

This mechanism not only promotes the reuse of knowledge, but also improves the adaptability to heterogeneous scenarios during the training process.

4. Simulation Analysis

4.1. Experimental Settings

Considering the differences in weather conditions and user demand of the electric–hydrogen integrated station on different days, the electrolyzer parameters of the EHIS exhibit inconsistency. To more realistically reflect the diversity of operating environments, this study collects photovoltaic and user load data from different regions of Shanghai. After data completion, a clustering analysis method is adopted to classify typical operating scenarios, and representative scenarios are finally obtained. Some of the device parameters are listed in Table 1. In this study, it is assumed that the electricity price and the carbon intensity of grey electricity vary dynamically within one day but can be known one day in advance, and their variation curves are shown in Figure 4. The hydrogen price and the penalty coefficient

ω_{c a p}

are both set to 50 CNY/kg, the penalty coefficients

ζ^{B E V}

and

ζ^{H F C V}

are both set to 1, the safety factor

r^{s a f e}

is set to 80%, and the start-up costs

C_{E L E}

and

C_{F C}

are both set to 78 CNY/time [36].

To evaluate the accuracy and generalization capability of the model in hydrogen production rate prediction, a PEM electrolyzer dataset containing 20,000 samples was constructed. More specifically, the dataset consists of two parts. One part was obtained from PEM electrolyzer-related data collected and compiled by the authors, while the other part was generated from the original samples through missing-value completion and data enhancement. The term “multiple types” in the original manuscript referred to the fact that the samples cover multiple operating conditions and parameter combinations, rather than different categories of electrolyzer devices. Specifically, the dataset spans different input power levels, operating temperatures, gas pressure conditions, membrane thicknesses, membrane electrode assembly contact areas, charge transfer coefficients, and exchange current densities, thereby reflecting the operating differences of PEM electrolyzers under diverse conditions. In addition, part of the original samples was compiled from published PEM electrolyzer experimental or processed data in the literature [28,37,38], mainly to improve the coverage of operating conditions and parameter combinations and thus enhance the representativeness of the dataset.

After preprocessing, the dataset was divided into a training set and a test set with a ratio of 8:2. Missing-value completion and data enhancement were applied only to the training set, whereas the test set retained the original unenhanced samples. This treatment was adopted for two reasons. First, augmenting only the training set helps increase sample diversity, alleviate the limited coverage of the original data, and improve the model’s ability to learn complex nonlinear relationships under multiple operating conditions. Second, keeping the test set in its original unenhanced form allows the predictive performance to be evaluated on data that are closer to realistic operating conditions, thereby avoiding performance bias caused by test-set augmentation. Therefore, the enhanced data were used only for model learning, whereas the final evaluation was always conducted on the original data distribution. In addition, all input features were normalized before training to eliminate dimensional differences and improve convergence stability. To ensure a fair comparison, all models used the same data partition, and the test set was kept strictly independent throughout the training process. All models took the 11-dimensional parameters defined in Equation (18) as inputs and the hydrogen production rate as the output.

For reproducibility, the main hyperparameters of the prediction models were fixed as follows. For the XGB module, the number of boosting rounds was set to 200, the maximum tree depth was set to 6, the learning rate was set to 0.05, the subsample ratio was set to 0.8, the column sampling ratio was set to 0.8, the minimum child weight was set to 1, and the L2 regularization coefficient was set to 1. For the deep neural network module stacked on the XGB output, two hidden layers with 64 and 32 neurons were adopted, the activation function was ReLU, the optimizer was Adam, the learning rate was set to 0.001, the batch size was set to 128, and the number of training epochs was set to 200. For the MLP baseline, two hidden layers with 128 and 64 neurons were used, the activation function was ReLU, the optimizer was Adam, the learning rate was set to 0.001, the batch size was set to 128, and the number of training epochs was set to 200. By recording the MAE and RMSE curves during the training process, the convergence behavior and accuracy differences of the models were analyzed.

4.2. Prediction Performance of the Data-Driven Model

As shown in Figure 5, the black and red curves correspond to the training and testing errors of XGB, respectively, while the blue curve corresponds to the testing error of MLP. In terms of the overall trend, the error of XGB decreases rapidly in the early stage of training and converges quickly, and the training and testing curves match closely, indicating that the model has high learning efficiency and strong generalization capability without overfitting. In contrast, although the MAE and RMSE of MLP also decrease during training, their overall values remain consistently higher than those of XGB. As can be seen from the enlarged view in the upper right corner, the error fluctuation of XGB in the later stage of training is significantly smaller than that of MLP, demonstrating better stability. In summary, XGB outperforms the comparison algorithm in terms of convergence speed, prediction accuracy, and generalization performance.

4.3. Optimization Performance of the EHIS

The representative scenarios obtained from the clustering analysis are shown in Figure 6. Scenario 1 is shown in Figure 6a–c, which present the PV output profile and the demand profiles of BEVs and HFCVs, respectively. This scenario represents an EHIS located near an office building on a sunny day. As shown in the figure, the PV output remains at a relatively high level throughout the day and reaches its peak at around 11:00. Since most users stay at work during daytime and return home at night, the demands of BEVs and HFCVs are concentrated around midday, while the demand levels in the morning and evening are relatively low.

Figure 7 shows the simulation results of Scenario 1. As can be seen from Figure 7a, in terms of PV allocation, the PV output fully covers the BEV demand from 6:00 to 18:00 while, during the remaining periods, the demand is jointly supplied by the fuel cell and the power grid. In particular, from 1:00 to 4:00, grid electricity purchases increase due to the low electricity price and insufficient hydrogen storage. The HFCV demand is preferentially satisfied by on-site supply because of the high hydrogen price. The evolution of the hydrogen storage level is shown in Figure 7b: it decreases slightly during the night, gradually accumulates from 6:00 with the increase in hydrogen production driven by PV power, reaches its peak at around 18:00, and then falls back to the initial level to ensure sufficient reserve for the next day’s operation.

Scenario 2 is shown in Figure 6d–f, which represents an EHIS located near a residential area under rainy-day conditions. As shown in the figure, the daily PV output remains generally low and reaches its peak at around 14:00. Since residents usually return home in the evening, the demands of BEVs and HFCVs are relatively low around midday but become higher in the early morning and evening. In this experiment, the hydrogen price is set to 40 CNY/kg, while all other parameters remain unchanged.

Figure 8 shows the scheduling results of Scenario 2. Due to the limited PV output, PV power is mainly used for BEV charging from 6:00 to 18:00, while the remaining demand is supplemented by the power grid. The fuel cell provides additional power only during the high-electricity-price period from 19:00 to 21:00. The HFCV demand is mainly satisfied by on-site supply, with pipeline hydrogen serving as a supplementary source. As shown in Figure 8b, the hydrogen storage level decreases during the night and increases during the daytime, reaches its peak at around 18:00, and then declines afterward. The results indicate that PV output and electricity price fluctuations significantly affect the coordinated scheduling strategy.

The carbon emission intensity of grey electricity has been given previously in this paper, while the carbon emission intensity of grey hydrogen is assumed to remain constant within one day. It is assumed that the externally purchased grey hydrogen comes from a stable fossil-fuel-based hydrogen production pathway, with a value of 24 kgCO₂/kgH₂.

Figure 9 shows the temporal distribution of carbon emissions in Scenario 2. It can be observed that the optimization strategy effectively reduces the total emissions associated with grey electricity through a mechanism of purchasing more during low-carbon periods and less during high-carbon periods. This indicates that the scheduling decisions make full use of the time-varying characteristics of carbon intensity while balancing profit and user demand. By incorporating carbon intensity into the coordinated scheduling process, the proposed method further reduces the overall carbon emissions of the system while maintaining service quality and operational economic performance.

Table 2 summarizes the comparison results of scheduling optimization based on the physics-based model and the data-driven model under the three scenarios. Limited by the simplified assumptions of electrochemical process dynamics, the traditional physics-based model cannot fully characterize the nonlinear coupled effects of multiple variables on hydrogen production power. In contrast, the data-driven model, benefiting from its higher prediction accuracy, effectively reduces the decision bias caused by model mismatch and thus achieves higher operating profits in all three scenarios. These results strongly verify the significant advantage of high-accuracy modeling in improving the economic performance of the system.

To further evaluate the robustness of the proposed method under extreme weather-induced renewable fluctuations, an additional extreme scenario is introduced, as shown in Figure 10. This scenario is constructed based on Scenario 1 and represents an EHIS located near an office building under highly fluctuating weather conditions. In this scenario, the demand profiles of BEVs and HFCVs remain the same as those in Scenario 1, while the photovoltaic output exhibits significant short-term fluctuations during the daytime. Such fluctuations can reasonably simulate sudden variations in solar irradiance caused by intermittent cloud cover. Although the overall daily PV output remains at a relatively high level, noticeable drops and recoveries occur around the daytime peak period, indicating a much stronger volatility of renewable generation than that in Scenario 1. By keeping the transportation demand unchanged and perturbing only the PV profile, this extreme scenario is specifically designed to test the adaptability and robustness of the proposed scheduling strategy under highly uncertain renewable supply conditions.

From the perspective of control robustness, these results indicate that the proposed DRL-based scheduler does not rely on a single deterministic renewable pattern, but can adapt its dispatch policy online according to the changing operating state of the station. Even when the PV generation becomes highly intermittent, the learned policy still maintains stable operation by coordinating multiple energy supply pathways and the hydrogen storage buffer. This demonstrates that the proposed DRL approach has good robustness against renewable-side uncertainty and can preserve acceptable scheduling performance under disturbed operating conditions.

Figure 11 shows the simulation results of the extreme scenario. As can be seen from Figure 11a, under the highly fluctuating PV condition, PV power still remains an important energy source during daytime, but the station no longer relies on it as smoothly as in Scenario 1. Instead, the system responds through more flexible coordination of the fuel cell, the power grid, and hydrogen storage, so as to maintain the continuity of station operation. During the night and late evening, the demand is still mainly covered by the coordinated support of the fuel cell and the power grid. Meanwhile, due to the relatively high hydrogen price, the HFCV demand continues to be preferentially satisfied by on-site hydrogen supply, while the use of pipeline hydrogen remains limited.

The evolution of the hydrogen storage level is shown in Figure 11b. Similar to Scenario 1, the hydrogen quantity in the HST decreases slightly during the night, then increases during the daytime as surplus PV power is converted into hydrogen. However, owing to the intensified PV fluctuations, the variation of the hydrogen quantity in the HST becomes less smooth and exhibits more pronounced short-term fluctuations than in Scenario 1. During periods of reduced PV output, the stored hydrogen is released to support system operation, which helps buffer the impact of renewable intermittency. After reaching its daytime peak, the hydrogen quantity gradually decreases in the late afternoon and evening and finally returns to a level close to its initial value. These results indicate that, under the extreme fluctuating PV condition, the proposed scheduling strategy can still effectively coordinate PV generation, fuel cell output, grid purchase, and hydrogen storage operation, thereby maintaining stable energy supply and demonstrating good robustness against severe renewable uncertainty.

4.4. Improved DQN Algorithm and the TPRL Method

To quantitatively evaluate the independent contribution of each module in the proposed improved DQN algorithm, this study trains similar scenarios of Scenario 3 under the same environment, the same random seed, and exactly the same training configuration, and compares the following five methods: the first is the baseline DQN with uniform experience replay; the second introduces only the dynamic learning rate (DQN + DLR); the third introduces only prioritized experience replay (DQN + PER); the fourth is the improved DQN proposed in this paper (DLR + PER); and the last is the DQN initialized by the policy template (i.e., TPRL).

As shown in Figure 12, all five methods exhibit the typical convergence characteristic of “rapid increase in the early stage and gradual stabilization in the later stage”, but they differ significantly in terms of convergence speed, profit level, and stability. The baseline DQN converges the slowest and shows the largest fluctuations, resulting in the lowest final profit. After introducing DLR, the convergence becomes faster, the profit level increases, and the oscillation is reduced. PER significantly improves the sampling efficiency and profit level, although its smoothness is slightly inferior to that of DLR. The improved DQN proposed in this paper achieves the best overall performance, combining the advantages of fast convergence, high profit, and high stability. In addition, since TPRL reuses the policy template of Scenario 3 for retraining, it achieves the fastest policy optimization speed.

4.5. Performance Comparison of Different Algorithms

Table 3 compares the overall performance of GA, DDQN, the improved DQN, and TPRL under the three scenarios. The algorithms are evaluated from three perspectives; namely, profit, user satisfaction, and carbon emissions. The results show that the two DRL-based methods, DDQN and the improved DQN, both outperform GA in all scenarios, demonstrating the superiority of reinforcement-learning-based scheduling over conventional optimization in EHIS operation. Furthermore, compared with DDQN, the improved DQN consistently achieves higher profit and satisfaction in all three scenarios, which verifies the effectiveness of the proposed dynamic learning-rate adjustment and prioritized experience replay mechanisms. Although DDQN yields slightly lower carbon emissions, the improved DQN provides a better overall balance between economic performance and service quality while maintaining emissions at an acceptable level. In addition, TPRL further improves the profit and satisfaction in Scenario 3 compared with direct training of the improved DQN, indicating that template-based policy reuse can provide better initialization and further enhance scheduling performance in similar scenarios. Overall, the improved DQN exhibits the best overall multi-objective trade-off among the directly trained algorithms, while TPRL further strengthens decision quality through template transfer.

It should also be emphasized that the superiority of the proposed method is not limited to a single operating condition. Across the three representative scenarios and the additional extreme PV fluctuation scenario, the improved DQN-based framework consistently maintains favorable performance in terms of profit, user satisfaction, and operational continuity. This suggests that the learned scheduling policy has a certain degree of robustness and adaptability to varying supply–demand patterns and renewable uncertainties, rather than being overfitted to one specific scenario.

Figure 13 shows the training performance of GA, DDQN, the improved DQN, and TPRL in Scenario 3. As can be seen from the figure, all four methods exhibit an overall upward trend in profit during the training process, but their convergence speed, final profit level, and stability are significantly different. Among them, TPRL achieves the fastest profit improvement in the early stage and converges to the highest final profit, indicating that the reuse of policy templates from similar scenarios can provide a better initialization and effectively alleviate the cold-start problem. The improved DQN also shows strong performance, with a higher final profit and better convergence stability than DDQN and GA. Compared with DDQN, the improved DQN converges to a clearly higher profit level, which further verifies the effectiveness of the proposed dynamic learning-rate adjustment and dynamic prioritized experience replay mechanisms. In contrast, although DDQN performs better than GA in the early training stage and reaches a higher final profit, its convergence speed and final performance are still inferior to those of the improved DQN. GA exhibits the slowest convergence and the lowest final profit among the compared methods, reflecting the limitation of conventional optimization algorithms in handling the sequential and highly coupled decision-making process of EHIS scheduling. Overall, the results demonstrate that the proposed improved DQN provides a more effective training process and better scheduling performance than the benchmark methods, while TPRL can further enhance convergence speed and final decision quality in similar scenarios.

Despite the favorable results, the proposed DRL-based method still has some limitations. The current validation is conducted on a limited number of representative scenarios and, although the additional extreme PV fluctuation case strengthens the robustness evaluation, it does not cover all possible uncertainties. Moreover, the policy generalization under significantly shifted scenario distributions and the scalability of the discretized DQN framework in larger EHIS applications still require further investigation.

5. Conclusions

This study proposes a data-driven deep reinforcement learning-based scheduling strategy. By constructing a data-driven electrolyzer model to replace the traditional physics-based model and then adopting an improved DQN algorithm with Lagrangian relaxation, the proposed method aims to maximize profit while satisfying carbon emission and user demand constraints. In addition, TPRL method is introduced to improve the training efficiency under similar scenarios. The case study results show that:

(1): The data-driven electrolyzer model constructed based on deep XGB accurately captures the nonlinear hydrogen production characteristics of the PEM electrolyzer under complex operating conditions, effectively overcoming the prediction bias of traditional physics-based models under complex fluctuating conditions. Combined with the refined EHIS framework incorporating start-up costs and safety constraints, it significantly improves the reliability and accuracy of the scheduling strategy in practical implementation.
(2): An improved DQN algorithm incorporating Lagrangian relaxation fully exploits the economic potential of the station while satisfying carbon-emission and user-demand constraints. Meanwhile, the TPRL mechanism significantly improves the convergence speed and generalization performance in similar scenarios through policy transfer, thereby demonstrating the method’s high engineering adaptability under varying operating conditions.

However, the current validation is still limited to a finite number of representative scenarios. Although the additional extreme PV fluctuation case provides further evidence of robustness, it does not fully cover all possible operating uncertainties. Moreover, the generalization capability of the proposed DRL-based method under significantly different scenario distributions still needs further verification. In future work, richer uncertainties and more extreme operating conditions will be considered to further evaluate the robustness and adaptability of the proposed method.

Author Contributions

Conceptualization, L.L. and H.L.; methodology, D.L.; software, L.L.; validation, L.L., H.L. and D.L.; formal analysis, D.L.; resources, D.L.; data curation, D.L.; writing—original draft preparation, L.L.; writing—review and editing, H.L.; supervision, H.L.; project administration, D.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Naseri, H.; Waygood, E.; Patterson, Z. Household transportation lifecycle greenhouse gas emission prediction. Transp. Res. Part D Transp. Environ. 2025, 141, 104660. [Google Scholar] [CrossRef]
Wolf, N.; Neuber, R.; Mädlow, A.; Höck, M. Techno-economic analysis of green hydrogen supply for a hydrogen refueling station in Germany. Int. J. Hydrogen Energy 2025, 106, 318–333. [Google Scholar] [CrossRef]
Shahbaz, A.; Afaf, A.; Tahir, N.; Ullah, A.; Saim, S. Non precious metal catalysts: A fuel cell and ORR study of thermally synthesized nickel and platinum mixed nickel nanotubes for PEMFC. Key Eng. Mater. 2021, 875, 193–199. [Google Scholar] [CrossRef]
Ullah, A.; Ahmad, S.; Ali, T.; Samreen, A.; Saher, S.; Arif, M.Z.; Kannan, A.M. Advanced fabrication strategies for Durable, High-Efficiency, and scalable solid oxide fuel cells. Fuel 2026, 407, 137515. [Google Scholar] [CrossRef]
Suraj, S.; Manjarekar, N.S.; Barik, S.; Swain, S. A delayed charging enabled station for electric vehicles. Comput. Electr. Eng. 2025, 123, 110166. [Google Scholar] [CrossRef]
Aguilar, P.; Groß, B. Battery electric vehicles and fuel cell electric vehicles, an analysis of alternative powertrains as a mean to decarbonise the transport sector. Sustain. Energy Technol. Assess. 2022, 53, 102624. [Google Scholar]
Yilmaz, M.; Krein, P.T. Review of the impact of vehicle-to-grid technologies on distribution systems and utility interfaces. IEEE Trans. Power Electron. 2012, 28, 5673–5689. [Google Scholar] [CrossRef]
Yang, J.; Yu, F.; Ma, K.; Yang, B.; Yue, Z. Optimal scheduling of electric-hydrogen integrated charging station for new energy vehicles. Renew. Energy 2024, 224, 120224. [Google Scholar] [CrossRef]
Yan, Q.; Zhang, B.; Kezunovic, M. Optimized operational cost reduction for an EV charging station integrated with battery energy storage and PV generation. IEEE Trans. Smart Grid 2018, 10, 2096–2106. [Google Scholar] [CrossRef]
Lu, D.; Sun, J.; Peng, Y.; Chen, X. Optimized operation plan for hydrogen refueling station with on-site electrolytic production. Sustainability 2022, 15, 347. [Google Scholar] [CrossRef]
Dadkhah, A.; Bozalakov, D.; De Kooning, J.D.; Vandevelde, L. On the optimal planning of a hydrogen refuelling station participating in the electricity and balancing markets. Int. J. Hydrogen Energy 2021, 46, 1488–1500. [Google Scholar] [CrossRef]
Jia, S.; Kang, X.; Cui, J.; Tian, B.; Zhang, J.; Xiao, S. Multi-layer coordinated optimization of integrated energy system with electric vehicles based on feedback correction. Front. Energy Res. 2022, 10, 1008042. [Google Scholar] [CrossRef]
Van, L.P.; Do Chi, K.; Duc, T.N. Review of hydrogen technologies based microgrid: Energy management systems, challenges and future recommendations. Int. J. Hydrogen Energy 2023, 48, 14127–14148. [Google Scholar] [CrossRef]
Xu, X.; Hu, W.; Cao, D.; Huang, Q.; Liu, W.; Jacobson, M.Z.; Chen, Z. Optimal operational strategy for an offgrid hybrid hydrogen/electricity refueling station powered by solar photovoltaics. J. Power Sources 2020, 451, 227810. [Google Scholar] [CrossRef]
Gökçek, M.; Kale, C. Techno-economical evaluation of a hydrogen refuelling station powered by Wind-PV hybrid power sys-tem: A case study for İzmir-Çeşme. Int. J. Hydrogen Energy 2018, 43, 10615–10625. [Google Scholar] [CrossRef]
Prabawa, P.; Choi, D.H. Safe deep reinforcement learning-assisted two-stage energy management for active power distribution networks with hydrogen fueling stations. Appl. Energy 2024, 375, 124170. [Google Scholar] [CrossRef]
Liu, J.; Meng, X.; Wu, J. Data-driven optimal scheduling for integrated electricity-heat-gas-hydrogen energy system considering demand-side management: A deep reinforcement learning approach. Int. J. Hydrogen Energy 2025, 103, 147–165. [Google Scholar] [CrossRef]
Lei, Y.; Zhao, L.; Gu, J.; Wang, J. Optimal scheduling of electric-gas-thermal-hydrogen integrated energy system considering uncertainties and safe guarantee: A TD3-MIP-based approach. Energy 2025, 332, 137051. [Google Scholar] [CrossRef]
Sayed-Ahmed, H.; Toldy, Á.; Santasalo-Aarnio, A. Dynamic operation of proton exchange membrane electrolyzers—Critical review. Renew. Sustain. Energy Rev. 2024, 189, 113883. [Google Scholar] [CrossRef]
Choi, P.; Bessarabov, D.G.; Datta, R. A simple model for solid polymer electrolyte (SPE) water electrolysis. Solid State Ion. 2004, 175, 535–539. [Google Scholar] [CrossRef]
Marangio, F.; Santarelli, M.; Cali, M. Theoretical model and experimental analysis of a high pressure PEM water electrolyser for hydrogen production. Int. J. Hydrogen Energy 2009, 34, 1143–1158. [Google Scholar] [CrossRef]
Madhavan, P.V.; Shahgaldi, S.; Li, X. Modelling anti-corrosion coating performance of metallic bipolar plates for PEM fuel cells: A machine learning approach. Energy AI 2024, 17, 100391. [Google Scholar] [CrossRef]
Wang, C.; Li, W.; Wang, Y.; Yang, X.; Xu, S. Study of electrochemical corrosion on Q235A steel under stray current excitation using combined analysis by electrochemical impedance spectroscopy and artificial neural network. Constr. Build. Mater. 2020, 247, 118562. [Google Scholar] [CrossRef]
Buitendach, H.P.; Gouws, R.; Martinson, C.A.; Minnaar, C.; Bessarabov, D. Effect of a ripple current on the efficiency of a PEM electrolyser. Results Eng. 2021, 10, 100216. [Google Scholar] [CrossRef]
Nafeh, A.E.A. Hydrogen production from a PV/PEM electrolyzer system using a neural-network-based MPPT algorithm. Int. J. Numer. Model. Electron. Netw. Devices Fields 2011, 24, 282–297. [Google Scholar] [CrossRef]
Dang, Y.; Xu, J.; Yang, F.; Jiang, C.; Li, D. Meta Reinforcement Learning Based Adaptive and Interpretable Energy Storage Control Meets Dynamic Scenarios. IEEE Trans. Sustain. Energy 2025, 16, 2560–2572. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Shomope, I.; Al-Othman, A.; Tawalbeh, M.; Alshraideh, H.; Almomani, F. Machine learning in PEM water electrolysis: A study of hydrogen production and operating parameters. Comput. Chem. Eng. 2025, 194, 108954. [Google Scholar] [CrossRef]
Satjaritanun, P.; O’BRien, M.; Kulkarni, D.; Shimpalee, S.; Capuano, C.; Ayers, K.E.; Danilovic, N.; Parkinson, D.Y.; Zenyuk, I.V. Observation of preferential pathways for oxygen removal through porous transport layers of polymer electrolyte water electrolyzers. iScience 2020, 23, 101783. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Zheng, T.; Dang, Y.; Yang, F.; Li, D. Distributed deep reinforcement learning for data-driven water heater model in smart grid. IEEE Trans. Smart Grid 2025, 16, 2900–2912. [Google Scholar] [CrossRef]
Ozdemir, S.N.; Pektezel, O. Performance prediction of experimental PEM electrolyzer using machine learning algorithms. Fuel 2024, 378, 132853. [Google Scholar] [CrossRef]
Holladay, J.D.; Hu, J.; King, D.L.; Wang, Y. An overview of hydrogen production technologies. Catal. Today 2009, 139, 244–260. [Google Scholar] [CrossRef]
Awasthi, A.; Scott, K.; Basu, S. Dynamic modeling and simulation of a proton exchange membrane electrolyzer for hydrogen production. Int. J. Hydrogen Energy 2011, 36, 14779–14786. [Google Scholar] [CrossRef]
Kumar, S.S.; Himabindu, V. Hydrogen production by PEM water electrolysis—A review. Mater. Sci. Energy Technol. 2019, 2, 442–454. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Zheng, Y.; You, S.; Bindner, H.W.; Münster, M. Optimal day-ahead dispatch of an alkaline electrolyser system concerning thermal–electric properties and state-transitional dynamics. Appl. Energy 2022, 307, 118091. [Google Scholar] [CrossRef]
Mohamed, A.; Ibrahem, H.; Yang, R.; Kim, K. Optimization of proton exchange membrane electrolyzer cell design using machine learning. Energies 2022, 15, 6657. [Google Scholar] [CrossRef]
Ozdemir, S.N.; Taymaz, I.; Okumuş, E.; San, F.G.B.; Akgün, F. Experimental investigation on performance evaluation of PEM electrolysis cell by using a Taguchi method. Fuel 2023, 344, 128021. [Google Scholar] [CrossRef]

Figure 1. Electric–hydrogen integrated station model.

Figure 2. Architecture of the Proposed Deep XGBoost Model for PEM Electrolyzer Hydrogen Production Prediction.

Figure 3. Optimization Framework for EHIS Scheduling.

Figure 4. Electricity price and carbon intensity profiles.

Figure 5. Training MAE and RMSE Curves of XGB and MLP.

Figure 6. Simulation dataset under different scenarios: (a,d,g) PV power output. (b,e,h) Demand of BEVs. (c,f,i) Demand of HFCVs.

Figure 7. (a) Energy supply for BEVs and HFCVs in Scenario 1; (b) Hydrogen storage variation in Scenario 1.

Figure 8. (a) Energy supply for BEVs and HFCVs in Scenario 2; (b) Hydrogen storage variation in Scenario 2.

Figure 9. CO₂ emissions associated with grid electricity and grey hydrogen in Scenario 2.

Figure 10. Extreme PV fluctuation scenario based on Scenario 1.

Figure 11. (a) Energy supply for BEVs and HFCVs in the Extreme Scenario; (b) Hydrogen storage variation in the Extreme Scenario.

Figure 12. DQN ablation experiments and comparative evaluation with TPRL.

Figure 13. Training profit curves of GA, DDQN, Improved DQN, and TPRL in Scenario 3.

Table 1. Parameter Values for the Fuel Cell, Electrolyzer, and Hydrogen Storage Tank.

Model	Parameter	Value
Fuel Cell	$P^{FC, \min}$ /kW	0
	$P^{FC, \max}$ /kW	200
	$η^{F C}$	55%
	$η^{F C, c o n}$	95%
Hydrogen Storage Tank	$m_{s, m a x}^{H_{2}}$ /kg	80
	$m_{s, i n, m a x}^{H_{2}}$ /kg	20
	$m_{s . i n . m a x}^{H_{2}}$	20
	$m_{s, o u t, m a x}^{H_{2}}$ /kg	20
	$η_{s, o u t}, η_{s, i n}$	95%
Electrolyzer	$P_{t . s}^{E L E, m i n}$ /kW	0
	$P_{t, s}^{E L E, m a x}$ /kW	1000
	$η^{P V, c o n}$	95%
	$η^{C}$	95%

Table 2. Comparison of profit-optimization results based on physics-based and data-driven models across three scenarios.

Scenario	Physics-Based Model	Data-Driven Model
Scenario 1	3394.52	3529.17
Scenario 2	1649.61	1710.11
Scenario 3	2965.23	3049.08

Table 3. Performance comparison of GA, DDQN, Improved DQN, and TPRL under different scenarios.

Algorithm	Scenario	Profit (CNY)	User Satisfaction (%)	Carbon Emissions (kg)
Improved DQN	Scenario 1	3529.17	100	143.259
	Scenario 2	1710.11	95	1072.421
	Scenario 3	3049.08	96	542.595
DDQN	Scenario 1	3343.79	97	132.440
	Scenario 2	1618.50	92	994.575
	Scenario 3	2933.02	94	502.290
GA	Scenario 1	3173.52	93	124.391
	Scenario 2	1548.59	90	983.269
	Scenario 3	2783.13	90	477.650
TPRL	Scenario 3	3127.66	97	554.125

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, D.; Liu, L.; Liao, H. Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model. Appl. Sci. 2026, 16, 3605. https://doi.org/10.3390/app16073605

AMA Style

Li D, Liu L, Liao H. Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model. Applied Sciences. 2026; 16(7):3605. https://doi.org/10.3390/app16073605

Chicago/Turabian Style

Li, Dongdong, Liang Liu, and Haiyu Liao. 2026. "Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model" Applied Sciences 16, no. 7: 3605. https://doi.org/10.3390/app16073605

APA Style

Li, D., Liu, L., & Liao, H. (2026). Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model. Applied Sciences, 16(7), 3605. https://doi.org/10.3390/app16073605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Scheduling for an Electric–Hydrogen Integrated Station Using a Data-Driven Electrolyzer Model

Abstract

1. Introduction

2. EHIS Structure Composition

2.1. PEM Electrolyzer and Fuel Cell Model

2.2. Hydrogen Storage Tank Model

2.3. Reinforcement Learning Algorithm

3. EHIS Optimization Framework Based on the Improved DQN Algorithm

3.1. Data-Driven PEM Electrolyzer Model

3.2. EHIS Optimization Model

3.2.1. Profit Maximization Objective

3.2.2. User Demand Response Model

3.2.3. Carbon Emission Calculation Model

3.2.4. CMDP Formulation

3.3. Improved DQN-Based Solution Algorithm for the CMDP

3.3.1. Improved DQN Algorithm

3.3.2. Constraint Handling Based on Lagrangian Relaxation

3.3.3. The Template Policy-Based Reinforcement Learning Method

4. Simulation Analysis

4.1. Experimental Settings

4.2. Prediction Performance of the Data-Driven Model

4.3. Optimization Performance of the EHIS

4.4. Improved DQN Algorithm and the TPRL Method

4.5. Performance Comparison of Different Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI