Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm

Zhao, Jingbo; Gao, Zhengping; Chen, Zhe

doi:10.3390/en18153925

Open AccessArticle

Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm

by

Jingbo Zhao

¹

,

Zhengping Gao

² and

Zhe Chen

^1,*

¹

State Grid Jiangsu Electric Power Co., Ltd., Research Institute, Nanjing 210023, China

²

State Grid Jiangsu Electric Power Co., Ltd., Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(15), 3925; https://doi.org/10.3390/en18153925

Submission received: 12 June 2025 / Revised: 7 July 2025 / Accepted: 18 July 2025 / Published: 23 July 2025

(This article belongs to the Special Issue Advances in Hydrogen Energy and Power System)

Download

Browse Figures

Versions Notes

Abstract

The electric–hydrogen coupled integrated energy system (EHCS) is a critical pathway for the low-carbon transition of energy systems. However, the inherent uncertainties of renewable energy sources present significant challenges to optimal energy management in the EHCS. To address these challenges, this paper proposes an energy management method for the EHCS based on an improved proximal policy optimization (IPPO) algorithm. This method aims to overcome the limitations of traditional heuristic algorithms, such as low solution accuracy, and the inefficiencies of mathematical programming methods. First, a mathematical model for the EHCS is established. Then, by introducing the Markov decision process (MDP), this mathematical model is transformed into a deep reinforcement learning framework. On this basis, the state space and action space of the system are defined, and a reward function is designed to guide the agent to learn to the optimal strategy, which takes into account the constraints of the system. Finally, the efficacy and economic viability of the proposed method are validated through numerical simulation.

Keywords:

proximal policy optimization algorithm; electric–hydrogen coupled integrated energy system; energy management; renewable energy

1. Introduction

Under the background of sustained and rapid economic development and accelerated industrialization, total energy consumption has shown significant growth [1,2]. Non-renewable coal, oil, and other traditional fossil energy sources are facing a double dilemma due to their excessive use: on the one hand, their non-renewable nature has led to the depletion of reserves, and on the other hand, large-scale fossil fuel consumption results in massive greenhouse gas emissions, triggering frequent extreme weather and environmental crises like air pollution and acid rain [3].

In this context, accelerating the transformation of the energy structure, optimizing the energy consumption pattern by increasing the proportion of clean energy, and constructing a new energy system dominated by renewable energy have become an important strategic issue to break the resource and environmental constraints and promote the sustainable development of human society. The intermittent and fluctuating nature of renewable energy intensifies the complexity of the integrated energy system, which requires an energy management system to realize efficient consumption and stable regulation [4]. Depending on the model type, existing research methods can be broadly categorized into two main types: model-based optimization methods and learning-based methods [5]. Model-based optimization methods require knowledge of the state transfer mechanism of the system. The research results mainly include methods based on meta-heuristic optimization, mathematical planning methods, and methods based on control theory. The meta-heuristic optimization-based methods show unique advantages in solving high-dimensional, nonlinear, and multimodal complex optimization problems by simulating natural phenomena or intelligent behaviors of biological groups and constructing an intelligent optimization framework with a global search capability [6]. Ref. [7] developed a social group-based Gray Wolf optimizer for the economic dispatch problem, which can improve the efficiency and accuracy of the solution. However, the method has issues with search efficiency trade-offs. Therefore, mathematical programming methods already have been widely used for this problem. Ref. [8] introduced a stochastic optimization scheduling model for wind–solar–hydro hybrid systems, accounting for source and load uncertainties. The model utilizes the Vine-Copula and Monte Carlo simulations to generate correlated wind and solar power output scenarios. Ref. [9] developed a multi-objective optimization framework for the scheduling of an integrated electric–thermal energy system, which aims to constrain the scale of electric energy trading through the economic costs, so as to enhance the environmental benefits while guaranteeing the system economy. Control theory has also been applied to solve this problem in multiple research efforts [10,11]. The presented results necessitate uncertainty quantification within the system to accurately simulate the optimization problem. Furthermore, the application of solvers is required to derive the optimal scheduling policy. It is crucial to acknowledge that the selection of models and parameters significantly influences the validity of the optimization outcomes [12].

In recent years, learning-based methods have seen widespread adoption in the power systems field [13]. Such data-driven methods support the direct acquisition of models or control strategies from historical data by virtue of their model-free nature [14,15,16]. Aiming at the real-time electricity price uncertainty in microgrid energy management, Ref. [17] used deep Q-network algorithms to break through the dimensional bottleneck of Q-learning algorithms, but the need for a high-dimensional discretized action space leads to limited efficiency. In this regard, Ref. [18] develops a continuous action space deep reinforcement learning method to realize the economic dispatch of microgrids containing energy storage. Ref. [19] suffers from hyperparameter sensitivity defects, although it introduces a deep deterministic policy gradient to deal with high-dimensional continuous control [20]. Ref. [21] uses a PPO algorithm to significantly reduce the hyperparameter commissioning cost while satisfying the system N-1 security constraints, which is very important in the energy management of integrated energy systems.

In order to enhance the learning ability of the agent when interacting with the environment, one optimization method for the energy management of the EHCS based on the improved PPO algorithm (IPPO) is proposed, and the deterministic training is introduced to improve the convergence performance at the late stage of training. The so-called deterministic training in this context refers to the process whereby, during the late stages of agent training, the optimal action is selected deterministically based on a probabilistic reference aligned with a greedy policy, thereby enhancing the convergence efficiency of the algorithm. Furthermore, the algorithm’s economy efficiency and computational efficiency are validated through a comparative analysis with alternative methodologies.

2. Mathematical Description of EHCS

The energy management problem within an EHCS is formulated as a mathematical optimization model. This model seeks to optimize the dispatch of various system devices over time, subject to load demand constraints, with the objective of maximizing economic efficiency. Furthermore, with the evolution of carbon emissions trading, renewable energy quota mechanisms, and green certificate trading, the electricity–carbon–green certificate joint market presents a novel direction for electricity market development. Consequently, this paper proposes an energy management model for an EHCS, considering the integration of the electricity–carbon–green certificate market.

Take the EHCS shown in Figure 1 as an example, which contains CHP units, renewable energy generating units, hydrogen storage systems, and electric boilers.

2.1. Mathematical Modeling of Hydrogen Energy Storage System (HESS)

The HESS is mainly consisted of an electrolyzer, hydrogen storage tank, and fuel cell.

(1): Mathematical Modeling of Electrolyzer

P_{H_{2}, el} (t) = η_{el} P_{el} (t),

(1)

where

P_{H_{2}, el} (t)

is the electrolyzer’s hydrogen-producing power.

η_{el}

is the hydrogen-producing efficiency of the electrolyzer, and

P_{el} (t)

is the power consumed by the electrolyzer.

(2): Mathematical Modeling of Fuel Cell

P_{fc} (t) = η_{fc, p} P_{H_{2}, fc} (t),

(2)

Q_{fc} (t) = η_{fc, q} (1 - η_{fc, p}) P_{H_{2}, fc} (t),

(3)

where

P_{fc} (t)

is the hydrogen consumption power of the fuel cell,

η_{fc, p}

is the electrical efficiency of the fuel cell,

η_{fc, q}

is the residual heat utilization efficiency of the fuel cell,

P_{fc} (t)

is the power to generate electricity of the fuel cell, and

Q_{fc} (t)

is the power to produce heat of the fuel cell.

(3): Mathematical Modeling of Hydrogen Storage Tank

E_{hst} (t) = E_{hst} (t - 1) + (η_{ch, hst} P_{H_{2}, el} (t) - P_{H_{2}, fc} (t) / η_{dis, hst}) Δ h,

(4)

S o h (t) = E_{hst} (t) / M_{hst},

(5)

where

E_{hst} (t)

is the energy stored within the hydrogen storage tank,

η_{ch, hst}

and

η_{dis, hst}

are the hydrogen charging and discharging efficiencies of the hydrogen storage tank, respectively,

M_{hst}

is the rated capacity of the hydrogen storage tank, and

S o h (t)

is the hydrogen storage state of the hydrogen storage tank.

2.2. Objective Function

The primary objective of energy management within an EHCS is to minimize operational costs. The mathematical expression is as follows:

\min \sum_{t = 1}^{T} C (t) = \min \sum_{t = 1}^{T} (C_{grid} (t) + C_{chp} (t) + C_{eb} (t) + C_{hess} (t) + C_{cet} (t) + C_{gct} (t)),

(6)

where

C_{grid} (t)

is the cost of the system participating in the electricity market transaction,

C_{chp} (t)

is the operating cost of CHP units,

C_{eb} (t)

is the operating cost of the electric boiler,

C_{hess} (t)

is the operating cost of the HESS,

C_{cet} (t)

is the cost of the system participating in the carbon emission right transaction, and

C_{gct} (t)

is the cost of the system participating in the green certificate market transaction.

(1): Equipment Operating Costs

C_{eb} (t) = κ_{eb} P_{eb} (t) Δ h,

(7)

C_{hess} (t) = (κ_{el} P_{el} (t) + κ_{fc} P_{fc} (t)) Δ h,

(8)

C_{chp} (t) = λ_{gas} (t) (\frac{P_{chp} (t)}{η_{1}} + \frac{Q_{chp} (t)}{η_{2}}) Δ h,

(9)

where

P_{eb} (t)

is the consumption power of the electric boiler, and

κ_{eb}

is its cost factor.

P_{el} (t)

and

P_{fc} (t)

are the power of the electrolyzer and fuel cell in the system, and

κ_{el}

and

κ_{fc}

are the unit operating costs of the electrolyzer and fuel cell, respectively.

P_{chp} (t)

and

Q_{chp} (t)

are the outputs of the CHP unit.

λ_{gas} (t)

is the natural gas price, and

η_{1}

and

η_{2}

are the unit cost coefficients of the CHP unit.

(2): Costs of Participating in Electricity Market Transactions

C_{grid} (t) = \{\begin{cases} λ_{e} (t) P_{grid} (t) Δ h & P_{grid} (t) > 0 \\ λ_{e} (t) P_{grid} (t) λ_{disc} Δ h & P_{grid} (t) < 0 \end{cases},

(10)

where

λ_{e} (t)

is the electricity price,

λ_{disc}

is the discount factor for selling electricity, and

P_{grid} (t)

is the power exchanged from the electricity market.

(3): Costs of Participating in Carbon Credit Market Transactions

The carbon emission right trading pricing model mainly includes two mechanisms: unified pricing and ladder pricing. Under the unified carbon trading pricing mechanism, the system’s participation in carbon emission right trading cost is defined as follows:

C_{cet} (t) = λ_{cet} (m_{act} (t) - m_{0} (t)),

(11)

m_{act} (t) = u_{act} \max (0, P_{grid} (t)) + u_{chp} P_{chp} (t),

(12)

m_{0} (t) = u_{0} \max (0, P_{grid} (t)),

(13)

where

λ_{cet}

is the carbon trading unit price, with a value of CNY 0.058 per kilogram.

m_{act} (t)

and

m_{0} (t)

are the actual carbon emissions and carbon emission quotas of the system.

u_{0}

and

u_{act}

are the carbon emission quotas and carbon emissions when the system purchases electricity from the electricity market, and

u_{chp}

is the carbon emissions corresponding to the unit power of the CHP unit. The system’s participation in carbon emissions trading cost under the ladder pricing mechanism is defined as follows:

C_{cet} (t) = \{\begin{matrix} \begin{matrix} \dots & \dots \\ λ_{cet} (1 + δ) (m_{act} (t) - m_{0} (t)) & - 2 M < m_{act} (t) - m_{0} (t) < - M \end{matrix} \\ \begin{matrix} λ_{cet} (m_{act} (t) - m_{0} (t)) & - M < m_{act} (t) - m_{0} (t) < M \\ λ_{cet} (1 + δ) (m_{act} (t) - m_{0} (t)) & M < m_{act} (t) - m_{0} (t) < 2 M \\ \dots & \dots \end{matrix} \end{matrix},

(14)

where

δ

is the growth coefficient of carbon trading unit price, with a value of 0.25.

M

is the length of carbon emission interval, with a value of 1000.

(4): Costs of Participating in Green Certificate Market Transactions

C_{gct} (t) = - λ_{gct} θ_{gct} P_{rew} (t),

(15)

where

λ_{gct}

is the price of green certificates, with a value of CNY 50 per unit.

θ_{gct}

is the quota parameter of green certificates, with a value of 1 unit per MWh.

2.3. Constraints

(1): Power Balance Constraints

P_{rew} (t) + P_{chp} (t) + P_{grid} (t) - P_{eb} (t) + P_{fc} (t) - P_{el} (t) = P_{load} (t),

(16)

Q_{chp} (t) + Q_{eb} (t) + Q_{fc} (t) = Q_{load} (t),

(17)

where

P_{rew} (t)

is the power consumption of renewable energy, including the wind and photovoltaic power consumption,

P_{load} (t)

and

Q_{load} (t)

are the electrical and thermal loads, and

Q_{eb} (t)

is the thermal output of the electric boiler, where

Q_{eb} (t) = η_{eb} P_{eb} (t)

, and

η_{eb}

is the efficiency parameter.

(2): CHP Unit Operational Constraints

|P_{chp} (t) - P_{chp} (t - 1)| \leq P_{climb, \max},

(18)

|Q_{chp} (t) - Q_{chp} (t - 1)| \leq Q_{climb, \max},

(19)

where

P_{climb, \max}

and

Q_{climb, \max}

represent the maximum value of power output and thermal output climb per unit time.

In addition, as shown in Figure 2, the CHP unit needs to satisfy the polygonal constraints.

(3): Electric Boiler Operational Constraints

P_{eb, \min} \leq P_{eb} (t) \leq P_{eb, \max},

(20)

where

P_{eb, \min}

and

P_{eb, \max}

are the minimum and maximum power consumption of the electric boiler.

(4): Energy Exchange Constraints

|P_{grid} (t)| \leq P_{grid, \max},

(21)

where

P_{grid, \max}

is the maximum power exchanged from the electricity market.

(5): Renewable Energy Output Constraints

P_{rew} (t) \leq P_{rew, act} (t),

(22)

where

P_{rew, act} (t)

is the actual output of renewable energy in the system.

(6): HESS Operational Constraints

P_{el, \min} \leq P_{el} (t) \leq P_{el, \max},

(23)

P_{fc, \min} \leq P_{fc} (t) \leq P_{fc, \max},

(24)

S o h_{\min} \leq S o h (t) \leq S o h_{\max},

(25)

where

P_{el, \min}

and

P_{el, \max}

are the minimum and maximum power consumption of the electrolyzer,

P_{fc, \min}

and

P_{fc, \max}

are the minimum and maximum power production of the fuel cell, and

S o h_{\min}

and

S o h_{\max}

are the minimum and maximum hydrogen storage state of the hydrogen storage tank.

3. Energy Management Strategy of EHCS Based on IPPO Algorithm

The EHCS energy management system is modeled as an agent. The states observed from the environment for the agent include the renewable energy unit output, the electric load, the thermal load, the energy price, the CHP unit output at the previous moment, and the energy storage system’s state of charge at the previous moment. Thus, the system state can be formulated as follows:

s_{t} = \{P_{rew} (t), P_{load} (t), λ_{e} (t), λ_{gas} (t), S o h (t - 1), P_{chp} (t - 1), Q_{load} (t), λ_{gas} (t), Q_{chp} (t - 1)\} .

(26)

In the EHCS, the operational decision variable is the output of each energy conversion unit. To address the temporal coupling of unit outputs, this paper employs the output increment of each unit as the decision variable. Given that

P_{eb} (t)

and

P_{grid} (t)

are derivable from balance constraints once the CHP unit output and hydrogen storage system output are determined, the system action can be formulated as follows:

a_{t} = \{Δ P_{chp} (t), Δ Q_{chp} (t), P_{el} (t), P_{fc} (t)\} .

(27)

The reward function must encompass the system’s operational cost function to guide the agent toward the optimal policy. Given the critical need to avoid constraint violations in real-world operations, a penalty function is also integrated into the reward function, formulated as follows:

r_{t} = - (a_{1} C (t) + a_{2} C_{ex} (t)),

(28)

C_{ex} (t) = p_{en} (p_{grid} (t)) + p_{en} (p_{eb} (t)),

(29)

p_{en} (x) = |v_{x} - v_{\min}| + |v_{x} - v_{\max}| - |v_{\max} - v_{\min}|,

(30)

where

C_{ex} (t)

is the penalty term to constrain

p_{grid} (t)

and

p_{eb} (t)

.

a_{1}

represents the cost scaling factor, while

a_{2}

denotes the penalty term scaling factor.

In this paper, an IPPO algorithm is proposed to find the optimal strategy for the energy management problem in the framework of MDP. The PPO algorithm, an off-policy method, integrates dynamic step-size and importance sampling to stabilize policy updates by clipping the surrogate objective function [22]. However, the agent’s reliance on environment interactions for training data introduces significant uncertainty. This is particularly evident in later training stages, where substantial state changes during interactions can destabilize the policy and value functions, leading to reward curve degradation and potential convergence to local optima.

Based on the PPO algorithm, deterministic training is introduced to improve the convergence performance at the late stage of training. The so-called deterministic training refers to selecting the action with the highest probability according to a certain probability deterministically, with reference to the greedy strategy at the late stage of the training of an agent, which is defined as follows:

p (a = \underset{a^{'}}{argmax} (π (a^{'} | s, ϑ))) = 1 - ϵ,

(31)

where

ϵ

is the greedy parameter, with a value of 0.2.

The training framework for the EHCS based on the proposed IPPO algorithm is shown in Figure 3. It is worth noting that in the later stages of training, actions are selected based on the deterministic training strategy mentioned in the preceding paragraph.

4. Case Studies

4.1. Parameter Settings

To assess the efficacy of the proposed IPPO algorithm in optimizing energy management within the EHCS, we employ the system depicted in Figure 1 for simulation and analysis. Historical data, split into training and test sets, provide the renewable energy output and system’s load profiles [23]. The system operates on a 24 h dispatch period with hourly intervals. As detailed in Table 1, the system utilizes a time-of-use pricing mechanism, with a 0.5 discount factor applied to electricity sales. Natural gas is priced at a fixed rate of CNY 0.4/kWh [15]. The hydrogen storage system commences with an initial energy level of 1000 kWh. Additional operational parameters for the system’s equipment are presented in Table 2.

The policy network and the value network both employ fully connected layers with three hidden layers (128, 64, and 32 neurons, respectively) to form the neural network architecture, using the ReLU activation function in their hidden layers. The Adam optimizer is utilized to update the network’s weights, with specific hyperparameter configurations detailed in Table 3.

The simulation experiments were based on the Python language and were run on a computer with AMD Ryzen 5 5600 G CPU and 16 G RAM, and subsequent experiments were based on this environment.

4.2. Analysis of the Training Process

The training reward curve of the IPPO and PPO algorithms is shown in Figure 4. Initially, the reward of the agent is low due to environment exploration, and then gradually increases and stabilizes with the accumulation of experience. The original PPO has fluctuating rewards in the late stage due to the random strategy, and the improved algorithm introduces a deterministic training strategy to suppress this phenomenon. The analysis demonstrates that the IPPO algorithm is capable of learning the optimal strategy with faster speed and greater stability.

4.3. Analysis of Online Operation Results

Following the training of the proposed IPPO and PPO algorithms with historical data, the saved algorithm model is deployed for energy management within the EHCS. Test data, randomly selected from the historical dataset, are utilized in the test set. The system’s load curve, along with the wind and photovoltaic output curves, are illustrated in Figure 5 and Figure 6.

The energy management optimization outcomes derived from the method proposed in this study are illustrated in Figure 7 and Figure 8. As can be seen from Figure 7, under the time-of-use pricing mechanism, the EHCS will choose to purchase electricity from the electricity market to meet the main demand of the system’s power load during the low electricity price hours (e.g., 0:00~7:00) and at the same time start the electrolyzer to store energy in the HESS (e.g., 1:00 and 5:00~6:00). During high electricity price hours, it will choose to start the fuel cell to meet the system’s power load demand instead of purchasing electricity from the electricity market, thus avoiding the purchase of electricity during peak electricity price hours, which can achieve the effect of peak shaving and valley filling. It can be noticed that at 22:00, the combined renewable energy output and CHP unit output of the system exceeds the electric load, and the system conducts a power sale. This decision of the system is mainly considered to meet the system’s renewable energy consumption, and at the same time, improve the economic efficiency of the system.

As can be seen from Figure 8, under the time-of-use pricing mechanism, the EHCS will start the fuel cell to meet part of the thermal load demand of the system during high electricity price hours (e.g., 8:00, 10:00, and 19:00–21:00), and during low electricity price hours, the system will mainly consider using the electric boiler to supplement the part of the thermal load demand that is insufficient for the removal of the CHP unit instead of choosing to start the fuel cell for the thermal supply of the system. The system’s decision is mainly based on the consideration that the electric boiler can form a complementary synergistic operation mode with the CHP unit: through the price signal of the electricity market to adjust the power of electric–thermal conversion in real time, dynamically matching the demand curve of the thermal load on the user side, and at the same time efficiently completing the dynamic regulation of electric–thermal conversion and thermal load, this synergistic mechanism can help the system to realize a higher economic efficiency.

The optimization outcomes demonstrate that the proposed energy management strategy for the EHCS based on the IPPO algorithm can dynamically adjust the output of each device within the system in response to changes in the external environment.

4.4. Analysis of System Operation Results

The following five scenarios are set up to compare the impacts of the carbon emissions trading mechanism, the green certificate trading mechanism, and different carbon emissions pricing mechanisms on the operation of the system:

Scenario 1: Participation in the electricity trading market.

Scenario 2: Participation in the electricity trading market and the green certificate trading market.

Scenario 3: Participation in the power trading market and carbon trading market and using the unified pricing mechanism.

Scenario 4: Participation in the power trading market and carbon trading market and using the ladder pricing mechanism.

Scenario 5: Participation in the power trading market, green certificate trading market, and carbon trading market, and using the ladder pricing mechanism.

To evaluate the efficacy of our proposed system energy management optimization approach, this paper has benchmarked its performance against the IPSO algorithm, the PPO algorithm, the DDPG algorithm, and the stochastic mixed integer linear programming algorithm (SMILP) [24] within a comparative experimental framework.

In the SMILP model solution, the wind power curve in the test set is categorized into 10 classes representing wind power uncertainty scenarios using a clustering algorithm [25,26]. This classification forms the basis for defining the system operation cost error

e

.

e = \frac{c_{i} - c_{SMILP}}{c_{SMILP}} \times 100 %,

(32)

where

c_{SMILP}

is the system’s cost obtained by the SMILP method, while

c_{i}

is the cost obtained by other methods.

In the PSO algorithm, the particle swarm size is configured as 500, the maximum iteration count is set to 1500, and the individual learning factor and population learning factor are 1.5 and 2, respectively.

Different algorithms are employed to address each specific scenario. Table 4 shows the average operating costs of the system on a test set of 30 randomly selected days.

(1): Comparison of Different Algorithms

(1) Analysis of system’s operation costs

As can be seen from Table 4, considering scenario 1, the system’s average operation cost obtained by the SMILP method is CNY 17,310.20. In contrast, the proposed IPPO method yields a figure of CNY 17,787.57, with a corresponding error of 2.76%. The PPO method results in an average operation cost of CNY 17,904.24, with an error of 3.43%. The DDPG method results in an average operation cost of CNY 17,943.56, with an error of 3.66%. Finally, the IPSO method produces an average operational cost of CNY 17,867.35, and an error of 3.22%. As shown in Figure 9, from the operation of the HESS, the main reason for this error is that the IPPO method proposed in this paper obtains the electrolyzer and fuel cell power time and power size, and the SMILP method power time relative to the other two methods in the low tariff time period is more consistent. The main power error occurs at time 11:00, the SMILP method carries out a hydrogen storage action, while the IPPO method does not choose to carry out the hydrogen storage action. In terms of operational cost, the IPPO method proposed in this paper demonstrates certain benefits when compared to the IPSO and PPO methods. Furthermore, this approach facilitates enhanced optimization decisions concerning the energy management challenges within the EHCS framework.

(2) Analysis of system online execution time

Using scenario 1 as an example, Table 5 displays the online execution time of various methods. As shown, the proposed IPPO algorithm achieves an online execution time of 0.41 s, matching that of the PPO and DDPG algorithm, and significantly outperforming the SMILP and IPSO methods. This efficiency stems from the offline training phase of the reinforcement learning model, which once completed, allows the trained agent to be directly deployed for real-time energy management and scheduling in the EHCS without the need for additional retraining.

(2): Comparison of Different Scenarios

Scenario 2: Based on scenario 1, the green certificate market is analyzed. The system’s operational cost in this scenario is reduced by CNY 562.85 relative to scenario 1, primarily due to the economic advantages derived from converting wind and photovoltaic power consumption into green certificates.

Scenario 3: Based on scenario 1, a carbon trading market with a unified pricing mechanism is considered. Upon incorporating carbon emission costs, the total system operational cost increase from CNY 17,787.57 in scenario 1 to CNY 18,079.62. This escalation is attributed to the imposition of carbon emission costs. The carbon emission data for each scenario are presented in Table 6.

As can be seen from Table 6, considering the carbon trading market with a uniform pricing mechanism, the system carbon emission of scenario 3 decreases by 6.76% compared with scenario 1, which reduces the system carbon emission to a certain extent, but this also makes the system operation cost grow.

Scenario 4: Based on scenario 3, a carbon trading market with a ladder pricing mechanism is proposed to expedite the low-carbon transition of the power system. The carbon emissions of the system at each time step, derived from the IPPO algorithm, are presented in Figure 10, within the framework of scenario 1 and scenario 4.

As illustrated in Figure 10, the disparity in system carbon emissions between the two scenarios primarily manifests during periods of low electricity prices. In scenario 1, the system procures electricity from the power market to meet load demand during these periods, concurrently curtailing the CHP unit output to optimize system economics, resulting in system carbon emissions of 10,980.75 kg. Conversely, in scenario 4, the incorporation of carbon emission costs and a tiered pricing mechanism renders the economic benefits of purchasing power from the market during low-price periods insufficient to offset carbon costs. Consequently, the system increases the CHP unit output to satisfy the load demand. This strategy reduces system carbon emissions to 6064.30 kg, a 44.77% reduction compared to scenario 1. However, the system’s operational cost escalates to CNY 18,355.14, primarily due to the differential impacts of the carbon pricing mechanisms on operational expenses.

Scenario 5: Considering both the green certificate market and the carbon trading market with a ladder pricing mechanism, as illustrated in Table 5, the system’s carbon emissions in this scenario remain equivalent to those in scenario 4; however, the operational cost is reduced by CNY 562.85 relative to scenario 4. This reduction is primarily attributable to the integration of the green certificate market. Consequently, during the low-carbon transition of the power system, the incorporation of the green certificate market can mitigate, to some extent, the escalation of operational costs driven by environmental regulatory pressures.

4.5. Analysis of System Robustness

Figure 11 and Figure 12 show the impact of parameters in the proposed reward function on the agent’s accumulation of environmental experience.

For the cost discount factor

a_{1}

, different selections influence the convergence performance of the reward function during system training. Specifically, a smaller

a_{1}

leads to a smaller convergence value of the reward function, prompting this paper to set the cost discount factor as 0.01. Regarding the penalty factor

a_{2}

, when

a_{2} = 150

, the relatively large penalty intensity increases the volatility of the system return function and reduces system stability. Conversely, when

a_{2} = 50

, the smaller penalty factor weakens the convergence performance of the system return function. Thus, 100 is chosen as the optimal value for the penalty discount factor.

It can be seen that the agent usually converges to relatively satisfactory results under different parameter configurations.

5. Conclusions

This paper introduces an advanced energy management optimization method for an EHCS utilizing an IPPO algorithm. Unlike conventional optimization methods, this method obviates the need for the precise prediction of system uncertainties by leveraging real-time observational data for decision-making. Compared to the PPO method, the proposed method incorporates a deterministic training strategy in the later stages of the learning process, facilitating more rapid and stable convergence toward optimal control policies. Multiple operational scenarios are simulated to evaluate the performance of the proposed method against existing methods, including IPSO, DDPG, and PPO. Results demonstrate a reduction in average system operational costs by 0.45%, 0.87%, and 0.65%, respectively. Additionally, the reinforcement learning-based methods exhibit superior online computational efficiency relative to model-based optimization methods. The findings substantiate the feasibility and efficacy of the IPPO algorithm in optimizing energy management within EHCS frameworks.

In addition, in the process of the low-carbon transition of the power system, the carbon trading market considering the ladder pricing mechanism can reduce the system carbon emissions by 44.77%. Further considerations of the green certificate market can effectively alleviate the problem of the operating cost increase caused by environmental pressure.

Future research can focus on further improving the performance of learning-based algorithms, such as considering the use of multi-agent algorithms to solve this problem, and considering the security constraints of the system, as mentioned in Ref. [27].

Author Contributions

Conceptualization, J.Z. and Z.G.; methodology, J.Z. and Z.G.; software, Z.G. and Z.C.; validation, J.Z. and Z.G.; formal analysis, J.Z. and Z.C.; investigation, J.Z. and Z.C.; resources, J.Z.; writing—original draft preparation, J.Z. and Z.G.; writing—review and editing, J.Z. and Z.C.; visualization, J.Z. and Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the State Grid Jiangsu Electric Power Co., Ltd. Technology Project under Grant J2024005 (Research on Planning and Operation Technology of Electro-Hydrogen Coupling System Driven by the Electric-Carbon-Green Certificate Market).

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

Jingbo Zhao, Zhengping Gao and Zhe Chen were employed by the State Grid Jiangsu Electric Power Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Alabi, T.M.; Aghimien, E.I.; Agbajor, F.D.; Yang, Z.; Lu, L.; Adeoye, A.R.; Gopaluni, B. A review on the integrated optimization techniques and machine learning approaches for modeling, prediction, and decision making on integrated energy systems. Renew. Energy 2022, 194, 822–849. [Google Scholar] [CrossRef]
Zhao, J.; Song, Y.; Fan, H. Optimization Scheduling of Hydrogen-Integrated Energy Systems Considering Multi-Timescale Carbon Trading Mechanisms. Energies 2025, 18, 1612. [Google Scholar] [CrossRef]
Yang, M.; Liu, Y. Research on multi-energy collaborative operation optimization of integrated energy system considering carbon trading and demand response. Energy 2023, 283, 129117. [Google Scholar] [CrossRef]
Teng, F.; Zhang, Q.; Zou, T.; Zhu, J.; Tu, Y.; Feng, Q. Energy management strategy for seaport integrated energy system under polymorphic network. Sustainability 2022, 15, 53. [Google Scholar] [CrossRef]
Liu, D.; Zang, C.; Zeng, P.; Li, W.; Wang, X.; Liu, Y.; Xu, S. Deep reinforcement learning for real-time economic energy management of microgrid system considering uncertainties. Front. Energy Res. 2023, 11, 1163053. [Google Scholar] [CrossRef]
Wang, Y.; Hao, Y.; Wang, L.; Dang, X.; Jiang, L.; Zhang, Y. Multi-objective optimal dispatching for multi-energy microgrid based on improved particle swarm optimization algorithm. Electr. Meas. Instrum. 2023, 60, 29–36+59. [Google Scholar]
Hosseini, S.; Beigvand, S.D.; Abdi, H.; Rastgou, A. Society-based Grey Wolf Optimizer for large scale Combined Heat and Power Economic Dispatch problem considering power losses. Appl. Soft Comput. 2022, 117, 108351. [Google Scholar] [CrossRef]
Fan, Y.; Liu, W.; Zhu, F.; Wang, S.; Yue, H.; Zeng, Y.; Xu, B.; Zhong, P.-A. Short-term stochastic multi-objective optimization scheduling of wind-solar-hydro hybrid system considering source-load uncertainties. Appl. Energy 2024, 372, 123781. [Google Scholar] [CrossRef]
Li, X.; Chen, Y.Z.; Li, H.W.; Liu, L.; Huang, J.Q.; Guo, P.F. Two-stage Robust Optimization of Low-Carbon Economic Dispatch for Electricity-Thermal Integrated Energy System considering Carbon Trade. Electr. Power Constr. 2024, 45, 58–69. [Google Scholar]
Zhang, Y.; Fu, L.; Zhu, W.; Bao, X.; Liu, C. Robust model predictive control for optimal energy management of island microgrids with uncertainties. Energy 2018, 164, 1229–1241. [Google Scholar] [CrossRef]
Zhao, Z.; Guo, J.; Luo, X.; Lai, C.S.; Yang, P.; Lai, L.L.; Li, P.; Guerrero, J.M.; Shahidehpour, M. Distributed Robust Model Predictive Control-Based Energy Management Strategy for Islanded Multi-Microgrids Considering Uncertainty. IEEE Trans. Smart Grid 2022, 13, 2107–2120. [Google Scholar] [CrossRef]
Zhou, S.; Hu, Z.; Gu, W.; Jiang, M.; Chen, M.; Hong, Q.; Booth, C. Combined heat and power system intelligent economic dispatch: A deep reinforcement learning approach. Int. J. Electr. Power Energy Syst. 2020, 120, 106016. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, Y.; Huang, G.; Yang, X.D.; Weng, G.Q.; Zhou, Z.Y. Review on applications of deep reinforcement learning in regulation of microgrid systems. Power Syst. Technol. 2023, 47, 2774–2788. [Google Scholar]
Baldi, S.; Michailidis, I.; Ravanis, C.; Kosmatopoulos, E.B. Model-based and model-free “plug-and-play” building energy efficient control. Appl. Energy 2015, 154, 829–841. [Google Scholar] [CrossRef]
Wang, X.; Zhao, Q.; Zhao, L. Energy management approach for integrated electricity-heat energy system based on deep Q-learning network. Electr. Power Constr. 2021, 42, 10–28. [Google Scholar]
Feng, C.; Zhang, Y.; Wen, F.; Ye, C.; Zhang, Y.B. Energy management strategy for microgrid based on deep expected Q network algorithm. Autom. Electr. Power Syst. 2022, 46, 14–22. [Google Scholar]
Ji, Y.; Wang, J.; Xu, J.; Fang, X.; Zhang, H. Real-Time Energy Management of a Microgrid Using Deep Reinforcement Learning. Energies 2019, 12, 2291. [Google Scholar] [CrossRef]
Hua, H.C.; Qin, Y.C.; Hao, C.T.; Cao, J. Optimal energy management strategies for energy Internet via deep reinforcement learning approach. Appl. Energy 2019, 239, 598–609. [Google Scholar] [CrossRef]
Chen, P.; Liu, M.; Chen, C.; Shang, X. A battery management strategy in microgrid for personalized customer requirements. Energy 2019, 189, 116245. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2017, arXiv:1509.02971. [Google Scholar]
Yang, Z.; Ren, Z.; Sun, Z.; Liu, M.; Jiang, J.; Yin, Y. Security-constrained economic dispatch of renewable energy integrated power systems based on proximal policy optimization algorithm. Power Syst. Technol. 2023, 47, 988–998. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Belgium’s Electricity Transmission System Operator. Available online: https://www.elia.be/en/grid-data (accessed on 14 July 2024).
Bischi, A.; Taccari, L.; Martelli, E.; Amaldi, E.; Manzolini, G.; Silva, P.; Campanari, S.; Macchi, E. A detailed MILP optimization model for combined cooling, heat and power system operation planning. Energy 2014, 76, 168–174. [Google Scholar] [CrossRef]
Yin, Y.; Liu, T.; He, C. Day-ahead stochastic coordinated scheduling for thermal-hydro-wind-photovoltaic systems. Energy 2019, 187, 1552–1565. [Google Scholar] [CrossRef]
Rahman, M.T.; Hasan, K.N.; Sokolowski, P. Evaluation of wind farm aggregation using probabilistic clustering algorithms for power system stability assessment. Sustain. Energy Grids Netw. 2022, 30, 100678. [Google Scholar] [CrossRef]
Wei, F.X.; Yu, L.; Zhang, X.D. Fully distributed event-triggered security control for DC micro-grids subject to DoS attacks. IEEE Trans. Smart Grid 2025, 162, 929–941. [Google Scholar]

Figure 1. Framework of an EHCS.

Figure 2. Feasible operating region of the CHP unit.

Figure 3. Training framework for EHCS.

Figure 4. Training reward curve.

Figure 5. Schematic of load curves on the test day.

Figure 6. Schematic of renewable energy output curves on the test day.

Figure 7. Schematic of power load optimization results for the test day.

Figure 8. Schematic of thermal load optimization results for the test day.

Figure 9. Schematic of HESS capacity change.

Figure 10. Optimization results of system carbon emissions under different scenarios.

Figure 11. Average reward values for agent with different cost discount factors.

Figure 12. Average reward values for agent with different penalty discount factors.

Table 1. Time-of-use pricing mechanism.

	Time Period (h)	Electricity Price (CNY/kWh)
Valley Period	23:00~6:00	0.48
Flat Period	7:00, 11:00~17:00	0.90
Peak Period	8:00~10:00, 18:00~22:00	1.35

Table 2. System equipment operating parameters.

Parameter	Value	Parameter	Value
$P_{g r i d, m a x}$	1000 kW	$η_{e b}$	0.95
$P_{e b, m i n} / P_{e b, m a x}$	0/1200 kW	$κ_{e b}$	0.05
$P_{climb, \max} / Q_{climb, \max}$	300/360 kW	$η_{1} / η_{2}$	0.8/0.75
$P_{el, \min} / P_{el, \max}$	0/200 kW	$M_{h s t}$	2000 kWh
$S o h_{m i n} / S o h_{m a x}$	0.2/0.9	$P_{fc, \min} / P_{fc, \max}$	0/200 kW
$κ_{e l} / κ_{f c}$	0.15/0.17	$η_{c h, h s t} / η_{d i s, h s t}$	0.95/0.95
$η_{e l}$	0.6	$η_{f c, p} / η_{f c, q}$	0.6/0.88
$u_{0} / u_{a c t} / u_{c h p}$	0.7/0.96/0.26	$C$	(960, 400)
$A$	(0, 0)	$D$	(200, 0)
$B$	(960, 800)	$E$	(0, 0)

Table 3. Deep neural network training parameters.

Parameter	Value
Policy Network Learning Rate	3 × 10⁻⁴
Value Network Learning Rate	3 × 10⁻⁴
Reward Discount Factor	0.99
Greed Parameter	0.2
Clipping Parameter	0.2
Replay-buffer Size	10,000
Training Episode	1500
Minibatch Size	128

Table 4. Average operating cost of the system.

Methods\Scenarios	Scenario 1	Scenario 2	Scenario 3	Scenario 4	Scenario 5
IPPO	17,787.57	17,224.72	18,079.62	18,355.14	17,792.29
PPO	17,904.24	17,341.39	18,269.35	18,511.39	17,948.54
DDGP	17,943.56	17,380.71	18,334.42	18,586.63	18,023.78
IPSO	17,867.35	17,304.5	18,187.41	18,445.62	17,882.77
SMILP	17,310.20	16,747.35	17,589.21	17,831.82	17,268.97

Table 5. Average online execution time of the system.

Methods	Online Execution Time(s)
IPPO	0.41
PPO	0.41
DDPG	0.40
IPSO	14.12
SMILP	152.4

Table 6. Total carbon emissions of the system.

Scenarios	Scenario 1	Scenario 2	Scenario 3	Scenario 4	Scenario 5
Total carbon emissions (kg)	10,980.75	10,980.75	10,238.21	6064.30	6064.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Gao, Z.; Chen, Z. Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm. Energies 2025, 18, 3925. https://doi.org/10.3390/en18153925

AMA Style

Zhao J, Gao Z, Chen Z. Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm. Energies. 2025; 18(15):3925. https://doi.org/10.3390/en18153925

Chicago/Turabian Style

Zhao, Jingbo, Zhengping Gao, and Zhe Chen. 2025. "Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm" Energies 18, no. 15: 3925. https://doi.org/10.3390/en18153925

APA Style

Zhao, J., Gao, Z., & Chen, Z. (2025). Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm. Energies, 18(15), 3925. https://doi.org/10.3390/en18153925

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy Management of Electric–Hydrogen Coupled Integrated Energy System Based on Improved Proximal Policy Optimization Algorithm

Abstract

1. Introduction

2. Mathematical Description of EHCS

2.1. Mathematical Modeling of Hydrogen Energy Storage System (HESS)

2.2. Objective Function

2.3. Constraints

3. Energy Management Strategy of EHCS Based on IPPO Algorithm

4. Case Studies

4.1. Parameter Settings

4.2. Analysis of the Training Process

4.3. Analysis of Online Operation Results

4.4. Analysis of System Operation Results

4.5. Analysis of System Robustness

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI