Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning

Jiang, Xinjian; Zhang, Lei; Li, Fuwang; Li, Zhiru; Ling, Zhijian; Zhao, Zhenghui

doi:10.3390/en18195172

Open AccessArticle

Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning

by

Xinjian Jiang

,

Lei Zhang

,

Fuwang Li

,

Zhiru Li

,

Zhijian Ling

and

Zhenghui Zhao

^*

Shanxi Energy Internet Research Institute, Taiyuan 030000, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(19), 5172; https://doi.org/10.3390/en18195172

Submission received: 1 September 2025 / Revised: 17 September 2025 / Accepted: 25 September 2025 / Published: 29 September 2025

Download

Browse Figures

Versions Notes

Abstract

Under the background of energy transition, the Integrated Energy System (IES) of the park has become a key carrier for enhancing the consumption capacity of renewable energy due to its multi-energy complementary characteristics. However, the high proportion of wind and solar resource access and the fluctuation of diverse loads have led to the system facing dual uncertainty challenges, and traditional optimization methods are difficult to adapt to the dynamic and complex dispatching requirements. To this end, this paper proposes a new dynamic energy management method based on Deep Reinforcement Learning (DRL) and constructs an IES hybrid integer nonlinear programming model including wind power, photovoltaic, combined heat and power generation, and storage of electric heat energy, with the goal of minimizing the operating cost of the system. By expressing the dispatching process as a Markov decision process, a state space covering wind and solar output, multiple loads and energy storage states is defined, a continuous action space for unit output and energy storage control is constructed, and a reward function integrating economic cost and the penalty for renewable energy consumption is designed. The Deep Deterministic Policy Gradient (DDPG) and Deep Q-Network (DQN) algorithms were adopted to achieve policy optimization. This study is based on simulation rather than experimental validation, which aligns with the exploratory scope of this research. The simulation results show that the DDPG algorithm achieves an average weekly operating cost of 532,424 yuan in the continuous action space scheduling, which is 8.6% lower than that of the DQN algorithm, and the standard deviation of the cost is reduced by 19.5%, indicating better robustness. Under the fluctuation of 10% to 30% on the source-load side, the DQN algorithm still maintains a cost fluctuation of less than 4.5%, highlighting the strong adaptability of DRL to uncertain environments. Therefore, this method has significant theoretical and practical value for promoting the intelligent transformation of the energy system.

Keywords:

integrated energy system of the park; deep reinforcement learning; deep Q-network; deep deterministic policy gradient; energy management

1. Introduction

In recent years, with the rapid development of smart grids and new energy, the integrated energy system in industrial parks, as a new type of energy management model, has gradually been widely applied [1]. Such systems typically integrate multiple energy conversion and storage components, including wind turbine discharge units, photovoltaic cell arrays, power transmission and distribution networks, thermal energy modules, and energy storage facilities, etc. [2]. It is worth noting that the proportion of renewable energy in the integrated energy system has been increasing year by year [3]. Therefore, the integrated energy system has largely reduced the energy structure’s reliance on traditional fossil energy. Given that the operation of an integrated energy system requires the coordination of coupling relationships among various energy forms, and both the power generation side and the demand side of the system are uncertain, optimizing energy dispatching and management for the integrated energy system has become a crucial link [4]. However, in the early stage of the development of integrated energy systems, their operation and dispatching management mainly relied on static or deterministic optimization models. This type of method is usually based on historical data or fixed typical scenarios, such as typical daily load curves [5], average wind and solar output [6], etc., to optimize the system offline and generate a fixed scheduling plan. Common static optimization methods include linear programming (LP) [7], mixed integer linear programming (MILP) [8], nonlinear programming (NLP) [9], etc. Although these methods have certain advantages in terms of model clarity and solution maturity, their core defect lies in the difficulty of effectively handling the strong uncertainty and dynamic coupling characteristics that exist in real time during system operation. They usually regard key variables such as wind and solar output and load demand as definite values or simple expected values, ignoring their inherent random fluctuations. As a result, the generated dispatching schemes often have poor robustness and insufficient adaptability in actual operation, unable to cope with real-time deviations, which may lead to a decline in system operation efficiency or even safety risks.

To address the above challenges, dynamic optimization methods have gradually become a research hotspot. This type of method attempts to incorporate uncertain factors into the model in order to achieve online or rolling optimized scheduling. Typical dynamic methods include Stochastic Programming (SP) [10] and Robust Optimization (RO) [11]. Stochastic programming describes the probability distribution of uncertain variables by constructing scene trees, aiming to optimize the expected performance indicators [12]; Robust optimization focuses on ensuring the feasibility and security of the system in the worst-case scenario [13]. These methods can theoretically better handle uncertainties and promote the development of integrated energy system dispatching management, but they still face significant challenges in practical applications. Stochastic programming relies heavily on the assumption of the precise probability distribution of uncertain variables. The generation and solution of scene trees have a sharply increased computational burden under high-dimensional uncertainty and are difficult to adapt to situations where the probability distribution is unknown or time-varying. Although robust optimization has relatively low requirements for distribution information, its conservatism may lead to overly pessimistic scheduling results, poor economy, and limited ability to handle complex coupling constraints [14]. More importantly, whether it is stochastic programming or robust optimization, their core is still based on precise physical models. However, in actual park integrated energy systems, there are numerous devices, complex and highly nonlinear energy flow coupling relationships, and factors such as equipment aging, parameter drift, and measurement errors are widespread. This makes it extremely difficult to construct an accurate and universal physical model, and the problem of model mismatch is inevitable. This greatly restricts the performance and practicality of traditional dynamic optimization methods when dealing with highly complex and dynamically changing integrated energy systems in parks [15]. For instance, optimization methods based on mathematical programming rely on precise physical models and probability distribution assumptions to manage integrated energy systems. However, in actual systems, there exist model mismatch issues such as equipment aging and prediction errors, which cannot provide precise physical models [16]. Stochastic dynamic programming is confronted with the curse of dimensionality. When the state variables exceed five dimensions, the solution time of the algorithm will increase exponentially, making it difficult to meet the real-time scheduling requirements [17]. In contrast, deep reinforcement learning has more advantages in energy management and is better at dealing with various complex problems in integrated energy systems. It directly learns strategies from data through a trial-and-error mechanism and can handle nonlinear coupling problems of multi-energy flows without establishing precise physical equations [18]. Moreover, deep neural networks have the ability to process high-latitude state Spaces and can simultaneously handle multi-dimensional inputs such as wind and solar output prediction, load time series data, and energy storage status [19]. Therefore, deep reinforcement learning can simultaneously address the aforementioned issues, including the spatio-temporal coupling of energy, the uncertainty on both the source and load sides, and the flaws of traditional energy management methods, etc. [20] In addition to mathematical programming, metaheuristic approaches such as genetic algorithms have also been applied to energy scheduling in microgrids. For instance, population-based genetic algorithms have been used for cost optimization in grid-connected and isolated AC microgrids with distributed wind turbines [21]. However, these methods also face challenges in handling high-dimensional, dynamic uncertainties in real-time.

At present, commonly used deep reinforcement learning algorithms include DQN [22], DDPG [23], A3C [24], PPO [25], TD3 [26], etc. The goal of these algorithms is the same—to obtain the maximum cumulative return through the interactive learning between the agent and the environment. However, their strategy optimization methods are not entirely the same. It can roughly be divided into two categories. The first type is the algorithm based on value functions. Q-learning, DQN and improved DQN algorithms, etc., are typical representatives of this type of algorithm. The second type is the algorithm based on policy gradient, which includes algorithms such as REINFORCE, DDPG, A3C, PPO and TRPO, etc. [27]. Taking the DQN algorithm as an example, Wang Xinying [28] et al. proposed an energy management method for electrothermal integrated energy systems based on DQN and compared it with the traditional Q-learning method. DQN does not rely on the prediction information and distribution knowledge of uncertain factors in the integrated energy system and can directly determine the energy management strategy of the system according to the observation state of the system. Taking the DDPG algorithm as an example, Yang Ting [29] et al. proposed a dynamic economic scheduling method for integrated energy systems based on the DDPG algorithm. This method places the dynamic scheduling problem of integrated energy systems in a continuous state and action space for processing, avoiding the curse of dimensionality in discretization operations. Moreover, through comparison with the DQN algorithm, it is found that in this study, the DDPG algorithm can better achieve the dynamic economic scheduling of the system. Compared to traditional optimization methods, DRL does not require explicit model formulation or probability distributions, thereby reducing the computational burden associated with high-dimensional uncertainty and complex constraints. While traditional methods struggle with real-time applicability due to computational complexity, DRL can learn directly from data and adapt online, making it more suitable for dynamic energy systems. Therefore, it is feasible to apply it to the research on the dynamic energy management optimization of the integrated energy system in the park. The main novelty of this work lies in the integrated application and comparison of DQN and DDPG in a realistic IES setting with multi-energy coupling and high uncertainty, which has not been sufficiently addressed in prior DRL-based IES studies.

Therefore, in the modeling of the integrated energy system, this paper simultaneously considers the coupling and conversion of the coupling and conversion between electricity, heat, and gas. The equipment in the system includes combined heat and power units, electric energy storage, electric and gas boilers, etc. This paper adopts a continuous time scale modeling of one week and uses the time-of-use electricity price mechanism in demand response to set electricity prices for the system, in order to guide users to adjust their electricity consumption behavior through prices and achieve the purpose of peak shaving and valley filling as well as cost reduction. In terms of data, this paper adopts modeling based on actual photovoltaic output, electrical and thermal load, thereby enhancing the practicality of the model.

After the successful modeling of the integrated energy system, this paper describes the energy scheduling problem as a Markov decision process and adopts the DQN algorithm and DDPG algorithm in deep reinforcement learning algorithms for energy optimization management. The DQN algorithm in this paper achieves dynamic scheduling by discretizing the continuous action space in the system model. The DDPG algorithm adds the ε-greedy exploration strategy on the basis of the original algorithm. It incorporates Gaussian noise during action selection and sets up an attenuation mechanism to gradually reduce exploration. Finally, this paper conducts case simulations of the two algorithms to prove the advantages and effects of the deep reinforcement learning algorithm in the energy optimization management of the integrated energy system. The results obtained by the two algorithms are compared and analyzed to observe the differences between the two algorithms in the scheduling results. In addition, this paper also adds the scheduling situation of deep reinforcement learning algorithms under the conditions of increasing light permeability and fluctuations on the source-load side, respectively, proving the importance of deep reinforcement learning algorithms for the scheduling of integrated energy systems and their characteristic of being able to perform scheduling under the uncertainty of the source-load side.

2. Modeling of the Comprehensive Energy System in the Park

To ensure that the Integrated Energy System (IES) can meet the electricity demand on the user side, that is, each device can provide stable power generation under different time periods, weather conditions, etc., this paper establishes an environmental model for the optimal energy scheduling of the integrated energy system, as shown in Figure 1.

2.1. IES Demand Response

This study adopts the Time-of-Use (TOU) mechanism in electricity price demand response. TOU, based on the temporal characteristics of power load, guides load entities to adjust their energy consumption behavior by setting differentiated electricity prices, thereby achieving the optimal allocation of power resources. In response to the periodic fluctuations of the daily load curve (such as high load during peak hours in the morning and evening and low load during off-peak hours at night), TOU drives the load side to carry out “peak shaving and valley filling” through economic incentives such as raising electricity prices during peak hours and lowering them during off-peak hours, thereby stabilizing the load fluctuations of the power grid and reducing the operating costs of the system. This study adopts the Time-of-Use (TOU) mechanism in electricity price demand response [30]. TOU, based on the temporal characteristics of power load, guides load entities to adjust their energy consumption behavior by setting differentiated electricity prices, thereby achieving the optimal allocation of power resources. In response to the periodic fluctuations of the daily load curve (such as high load during peak hours in the morning and evening and low load during off-peak hours at night), TOU drives the load side to carry out “peak shaving and valley filling” through economic incentives such as raising electricity prices during peak hours and lowering them during off-peak hours, thereby stabilizing the load fluctuations of the power grid and reducing the operating costs of the system.

The electricity price elasticity coefficient is a coefficient that measures the sensitivity of electricity demand to price changes and is a key parameter for formulating time-of-use electricity prices. Its formula is as follows:

ε_{i j} = \frac{Δ Q_{j}}{Q_{j}} \div \frac{Δ P_{i}}{P_{i}}

(1)

In the formula,

Q_{j}

represents the demand for electricity;

Δ Q_{j}

represents the change in electrical power;

P_{i}

represents the electricity price;

Δ P_{i}

represents the change in electricity prices. When

ε_{i j} < 0

, it indicates that the increase in electricity prices has led to a decrease in electricity demand, and at this time, electricity is a normal commodity. When

|ε_{i j}| > 1

is present, it indicates that the variation in demand is greater than that in price. At this point, the electricity load is highly adjustable and can be flexibly transferred or reduced. When

|ε_{i j}| < 1

indicates that the change in demand is less than the change in price. At this time, the electricity load is mostly essential electricity, such as lighting, basic living electricity, etc. When

ε_{i j} = 0

, it indicates that the demand remains completely unchanged with the price. Regardless of how the electricity price fluctuates, the electricity consumption remains constant. Such electricity usage is mostly for critical infrastructure.

The total electricity price for users is as follows:

M = \sum_{t} (P_{t} \times Q_{t})

(2)

In the formula,

P_{t}

represent the electricity prices for each time period;

Q_{t}

represents the electricity consumption for each time period.

2.2. Equipment Characteristics

Note that network losses are not explicitly modeled in this study, as the focus is on system-level energy scheduling rather than distribution network details.

(1): Cogeneration unit

The formula for the electric power generated by the combined heat and power unit during period t is as follows:

P_{C H P} (t) = v_{C H P} (t) H_{l} η_{C H P}

(3)

In the formula,

P_{C H P} (t)

represents the electric power produced by the combined heat and power unit during period t.

v_{C H P} (t)

is expressed as the amount of natural gas consumed by the combined heat and power unit in period t;

H_{l}

expressed as the calorific value of natural gas;

η_{C H P}

is expressed as the electrical efficiency of the combined heat and power unit.

Cogeneration units supply both electrical and thermal energy simultaneously. Therefore, the coupling characteristics between the two can be determined by the ratio of the heat generation power to the power generation power within the same period of time. This characteristic is called the thermoelectric ratio and is commonly represented by the letter b:

b = \frac{H_{C H P} (t)}{P_{C H P} (t)}

(4)

In the formula, b represents the thermoelectric ratio of the combined heat and power unit. The thermal power generated by the combined heat and power unit at time t.

(2): Gas boiler

The thermal power generated by the gas boiler within time t is as follows:

H_{G B} (t) = v_{G B} (t) H_{l} η_{G B}

(5)

In the formula,

H_{G B} (t)

represents the thermal power produced by the gas boiler at time t;

v_{G B} (t)

is the amount of natural gas consumed by the gas boiler at time t;

η_{G B}

represents the efficiency of the gas boiler.

(3): Electric boiler

The thermal power output by the electric boiler within time t is as follows:

H_{E B} (t) = P_{E B} (t) η_{E B}

(6)

In the formula,

H_{E B} (t)

represents the thermal power output by the electric boiler at time t.

P_{E B} (t)

is the electric power input to the electric boiler at time t.

η_{E B}

represents the efficiency of the electric boiler.

(4): Electrical energy storage equipment

The state of charge (SOC) of the electrical energy storage device in period t is:

c_{S O C} (t) = c_{S O C} (t - 1) - \frac{η_{B E S} P_{B E S} (t) Δ t}{Q_{B E S}}

(7)

c_{S O C} (0) = C_{S O C}^{i n i}

(8)

In the formula,

c_{S O C} (t)

represents the state of charge of the electrical energy storage during period t.

c_{S O C} (t - 1)

represents the state of charge of the electrical energy storage during period t − 1.

C_{S O C}^{i n i}

represents the initial state of charge of the electrical energy storage.

P_{B E S} (t)

represents the charging/discharging power of the electrical energy storage device during period t. A positive value indicates that the electrical energy storage device is in the discharging state, while a negative value indicates that the point energy storage device is in the charging state.

Δ t

is the time slot length;

Q_{B E S}

represents the capacity of the electrical energy storage device;

η_{B E S}

represents the charging/efficiency of the electrical energy storage device, and its expression is as follows:

η_{B E S} = \{\begin{cases} η_{c h} P_{B E S} < 0 \\ \frac{1}{η_{d i s}} P_{B E S} \geq 0 \end{cases}

(9)

In the formula,

η_{c h}

represents the charging efficiency of the electrical energy storage device;

η_{d i s}

represents the discharge efficiency of the electrical energy storage device.

The efficiencies of CHP, boilers, and storage are assumed constant for simplicity. Partial-load efficiencies are not considered in this model, which is a limitation and should be addressed in future work.

2.3. Objective Function

The objective of IES energy optimization management is to ensure that each unit operates in coordination with the energy form and minimize the operating cost of the system. The cost required for this system is divided into two parts: the energy cost of purchasing natural gas and electricity, and the discounted cost of electrical storage. The operating cost of the system is expressed mathematically as follows:

F = \min (C_{E} + C_{B E S})

(10)

In the formula,

C_{E}

represents the energy cost of purchasing natural gas and electricity.

C_{B E S}

represents the depreciation cost of charging and discharging for the electrical energy storage equipment. Gas pricing is assumed constant in this study. Sensitivity analysis of gas price variations could be included in future work to enhance the robustness of the model.

The energy cost and the depreciation cost of charging and discharging of electrical energy storage equipment are as follows, respectively:

C_{E} = \sum_{t = 1}^{T} ε_{e} (t) P_{g r i d} (t) Δ t + \sum_{t = 1}^{T} ε_{g a s} (t) (\frac{P_{C H P} (t) Δ t}{η_{C H P}} + \frac{H_{G B} (t) Δ t}{η_{G B}})

(11)

C_{B E S} = \sum_{t = 1}^{T} ρ_{B E S} |P_{B E S} (t)|

(12)

In the formula,

P_{g r i d} (t)

represents the power exchange capacity between the integrated energy system and the main power grid at time t. A positive value indicates the purchase of electricity from the main power grid, while a negative value indicates the sale of electricity by the system to the main power grid.

ε_{e}

represents the electricity price at time t.

ε_{g a s}

represents the unit price of natural gas at time t.

ρ_{B E S}

is the depreciation cost coefficient of electrical energy storage.

Conditional Value at Risk (CVaR) is not considered in this study as the focus is on expected cost minimization. Future work could incorporate risk-aware scheduling through CVaR or other risk measures.

2.4. Constraint Conditions

To maintain the stability of the system’s operation, it is necessary to impose constraints on the system’s components. The constraint conditions include electrical power balance constraints, thermal power balance constraints, external energy supply constraints, and upper and lower limit constraints for component operation.

(1): Electric power balance constraint

The constraint of electric power balance is one of the core conditions to ensure the stable operation of the system. It is necessary to consider the coordination of source-grid-load-storage. Its essence is the real-time dynamic balance of power supply and demand. Specifically, the output power of photovoltaic equipment in the system, the output power of cogeneration units, the charging/discharging power of electrical energy storage, and the power exchange power between the integrated energy system and the main power grid need to be consistent with the output power of the electric boiler and the electrical load on the user side. Its mathematical description can be used as follows:

P_{C H P} (t) + P_{B E S} (t) + P_{P V} (t) + P_{g r i d} (t) - P_{E B} (t) = P_{l o a d} (t)

(13)

In the formula,

P_{P V} (t)

is the electrical power output by the photovoltaic equipment at time t;

P_{l o a d} (t)

is the electrical load on the user side at time t.

(2): Thermal power balance constraint

The essence of thermal power balance constraints is to ensure the real-time dynamic balance of thermal energy supply and demand. Specifically, it means that the thermal power output by electric boilers, combined heat and power units, and gas boilers should be consistent with the thermal load on the user side.

H_{E B} (t) + H_{C H P} (t) + H_{G B} (t) = H_{l o a d} (t)

(14)

In the formula,

H_{l o a d} (t)

represents the heat load on the user side during period t.

(3): Constraints on external energy supply

In this integrated energy system, the external energy supplies are only natural gas and electricity. Natural gas does not need to be subject to constraints, so only the power exchange capacity between the system and the main power grid needs to be constrained to ensure the stability of grid-side operation.

P_{g r i d}^{\min} \leq P_{g r i d} (t) \leq P_{g r i d}^{\max}

(15)

In the formula,

P_{g r i d}^{\min}

represents the lower limit of the interaction power between the system and the main power grid.

P_{g r i d}^{\max}

is the upper limit of the interaction power between the system and the main power grid.

(4): Upper and lower limit constraints for component operation

To maintain the stability of the system during operation, each device in the integrated energy system has upper and lower limit requirements for operation, including the electric power of the combined heat and power unit, the charging/discharging power of the point energy storage equipment, the output thermal power of the electric boiler, and the output thermal power of the gas boiler.

P_{C H P}^{\min} \leq P_{C H P} (t) \leq P_{C H P}^{\max}

(16)

P_{B E S}^{\min} \leq P_{B E S} (t) \leq P_{B E S}^{\max}

(17)

H_{E B}^{\min} \leq H_{E B} (t) \leq H_{E B}^{\max}

(18)

H_{G B}^{\min} \leq H_{G B} (t) \leq H_{G B}^{\max}

(19)

In the formula,

P_{C H P}^{\min}

and

P_{C H P}^{\max}

are, respectively, the lower and upper limits of the output electrical power of the combined heat and power unit.

P_{B E S}^{\min}

and

P_{B E S}^{\max}

are, respectively, the lower and upper limits of the charging/discharging power of the electrical energy storage device.

H_{E B}^{\min}

and

H_{E B}^{\max}

are, respectively, the lower and upper limits of the output thermal power of the electric boiler.

H_{G B}^{\min}

and

H_{G B}^{\max}

are, respectively, the lower and upper limits of the output thermal power of the gas boiler.

Among them, electrical energy storage devices are particularly special, and they also need to avoid the damage to electrical energy storage caused by deep charging and discharging. Therefore, it is also necessary to impose certain restrictions on the state of charge of electrical energy storage devices, and the scope is as follows:

C_{S O C}^{\min} \leq c_{S O C} (t) \leq C_{S O C}^{\max}

(20)

In the formula,

C_{S O C}^{\min}

and

C_{S O C}^{\max}

are, respectively, the lower and upper limits of the state of charge for electrical energy storage.

3. Design of IES Scheduling Method Based on Deep Reinforcement Learning

3.1. The IES Energy Management Optimization Problem Is Transformed into the MDP Framework

The core of using reinforcement learning for energy optimization management of integrated energy systems lies in describing its energy optimization management model as an MDP model and defining its state space, action space, and reward function.

(1): State space

During period t, the state space of the integrated energy system includes the user-side electrical load, heat load, output electrical power of photovoltaic equipment, state of charge of electrical storage, operating time t, and the power interaction power between the system and the main power grid. Crucially, the state space consists solely of measurable, real-time, or immediate past observations. It does not rely on any external forecasts (e.g., weather predictions or production schedules), which aligns with the model-free advantage of the DRL approach and enhances its robustness to forecast uncertainties. Therefore, the state

s_{t}

is expressed as:

s_{t} = [P_{l o a d} (t), H_{l o a d} (t), P_{P V} (t), s_{S O C} (t), t, P_{g r i d} (t)]

(21)

(2): Action space

During period t, the actions in the integrated energy system can be represented by the output of the equipment. Since

P_{C H P} (t)

is determined,

H_{C H P} (t)

can be derived from Equation (4); After

H_{G B} (t)

is determined,

H_{E B} (t)

can be derived according to formula (13). After

P_{E B} (t)

is determined,

P_{g r i d} (t)

can be calculated according to Equation (12). Therefore, as long as the values of

P_{C H P} (t)

,

H_{G B} (t)

,

P_{E B} (t)

and

P_{E B S} (t)

are determined, the other values can be calculated. Therefore, the action

a_{t}

of the integrated energy system can be expressed as:

a_{t} = [P_{C H P} (t), P_{B E S} (t), P_{E B} (t), H_{G B} (t)]

(22)

(3): Reward function

During period t, the state of the environment is

s_{t}

, and the agent acts on the environment with action

a_{t}

, and the reward obtained is

r_{t}

. The purpose of energy management in an integrated energy system is to minimize the operating cost of the system. The purpose of reinforcement learning is to maximize the reward value of the environment, which requires a slight modification to the reward function. Therefore, in the energy management of the integrated energy system, the reward received by the agent in period t can be expressed as:

r_{t} (s, a) = - \frac{1}{10^{6}} (C_{E} (s_{t}, a_{t}) + C_{B E S} (s_{t}, a_{t}))

(23)

In the formula,

\frac{1}{1000000}

is the scaling of the cost value.

When the state

s_{t}

of the integrated energy system is determined, the state-action value function

Q_{π} (s, a)

can be used to evaluate the quality of the agent’s action

a_{t}

, and its formula is as follows:

Q_{π} (s, a) = E_{π} (\sum_{k = 0}^{T} γ^{k} r_{t + k} (s_{t + k}, a_{t + k}) |_{s_{t} = s_{l}, a_{t} = a_{l}})

(24)

In the formula,

E_{π}

represents the expected return obtained by performing an action on the current state when the strategy is

π

.

γ \in [0, 1]

is the discount factor. The state space

s_{t}

includes: electrical load (range: 0–10 MW), heat load (0–8 MW), PV output (0–5 MW), SOC of storage (0–1), time t (0–168 h), and grid exchange power (−8–8 MW). The action space

a_{t}

includes: CHP power (0–1.5 MW), storage charge/discharge power (−0.5–0.5 MW), gas boiler power (0–1.6 MW), and electric boiler power (0–1.5 MW).

The objective of energy optimization management in an integrated energy system is to find the optimal strategy

π^{*}

to maximize the state-action value function, and its formula is as follows:

π^{*} = \arg \max_{a} Q_{π} (s, a)

(25)

3.2. System Energy Management Method Based on DQN and DDPG

Due to its advantage of model-free learning, the DQN algorithm is highly suitable for the energy optimization management problem of the integrated energy system with uncertain factors required in this paper. The state space in the integrated energy system described in this paper is shown in Equation (21), including the user-side electrical load, heat load, output electrical power of photovoltaic equipment, state of charge of electrical storage, operating time t, and the power interaction power between the system and the main power grid. All of these are continuous and require a very large table to fit the Q function, so traditional reinforcement learning is no longer applicable. The ε-greedy strategy is used for DQN to balance exploration and exploitation in discrete action spaces, while Gaussian noise is added to DDPG actions to encourage exploration in continuous spaces. These strategies are commonly used in DRL to prevent premature convergence to suboptimal policies. Therefore, the DQN algorithm is applicable to the energy optimization management of the integrated energy system required in this paper. However, the DQN algorithm is applicable to discrete action Spaces, while the action Spaces of the system in this paper, as shown in Equation (22), are all continuous action Spaces. Therefore, the action space in the DQN algorithm still needs to be discretized. The energy management method process of the integrated energy system based on the DQN algorithm is shown in Figure 2.

The energy management method of the integrated energy system based on the DQN algorithm has the problem of needing to discretize the continuous action space. The DDPG algorithm combines the idea of deterministic strategies and can handle continuous values in the system action space (output electric power of combined heat and power units, charging/discharging power of electric energy storage, output thermal dynamic rate of gas boilers, and input electric power of electric boilers). Therefore, this paper also simultaneously adopts the DDPG algorithm for the energy optimization management of the integrated energy system. The flowchart of the energy optimization management method for the integrated energy system based on DDPG is shown in Figure 3. It is acknowledged that the DRL-based approach operates as a “black-box” model in terms of explicitly modeling energy consumption patterns. However, this is a deliberate design choice rather than a limitation. The core advantage of DRL is its ability to learn optimal control policies directly from data interactions without relying on explicit mathematical formulations of complex and uncertain consumer behaviors or precise physical models, which are often difficult to obtain accurately in practice.

4. Case Study

4.1. Case Description

To verify the effectiveness of the proposed energy management method for integrated energy systems based on deep reinforcement learning, this section takes the integrated energy system model shown in Figure 1 as an example to conduct simulation analysis on it. The case study evaluates the proposed DRL-based energy management strategy, which encompasses both supply-side dispatch (e.g., CHP, boilers, storage) and demand-side management via the TOU mechanism described in Section 2.1. The electrical load, thermal load and photovoltaic data in the system are generated based on the open-source CREST model. This model has been effectively verified and widely applied. The data generated by this system takes into account the uncertainties on the source-load side in the real integrated energy system, so the data is random. It is a key feature of our DRL-based approach that the agents make decisions based only on the current state (as defined in Section 3.1) without utilizing any prognostic data. This eliminates the need for and the inherent errors associated with weather, load, or generation forecasts, allowing the policy to react adaptively to the realized conditions. The electrical load, thermal load, and photovoltaic output data used for training and testing the DRL agents were synthetically generated using the open-source CREST model [31]. This model simulates realistic stochastic energy demand and generation profiles based on probabilistic human behavior and weather patterns, and has been effectively validated and widely applied in energy systems research. By using this synthetic data generation method, we focus on evaluating the algorithmic performance under uncertainty while ensuring reproducibility. Although the specific processes of individual enterprises are not modeled, the generated load profiles represent the aggregated demand typical of a mixed-use industrial park, comprising light manufacturing, commercial building spaces (e.g., offices, research labs), and supporting facilities. The inherent stochasticity in the load arises from the combined variations in production schedules, occupancy patterns, and equipment usage across these diverse entities. The time length of the system energy management is set to one week, with an interval of 1 h between the two time periods. Although real-world park data were not used, the CREST model is well-validated and widely adopted in energy system studies for its ability to simulate realistic source-load uncertainties.

The operating parameters of the components of the integrated energy system are shown in Table 1. The range of the power exchange rate between the system and the main power grid is [0, 8] MW, that is, the system does not sell electricity to the main power grid. The capacity of the electrical energy storage is, and other parameters are shown in Table 2. The electricity price in the integrated energy system adopts time-of-use pricing, where the peak period is

12 : 00 – 19 : 00

, the off-peak period is

5 : 00 – 12 : 00

, and the valley period is

19 : 00 – 5 : 00

. The purchase and sale electricity prices are shown in Table 3. The purchase price of natural gas is fixed at 400 yuan/(MW·h). The power generation capacity of photovoltaic equipment within one week, as well as the demands for electrical load and heat load, are shown in Figure 4.

The deep reinforcement learning method proposed in this paper is implemented on the Pytorch platform. Regarding the setting of algorithm hyperparameters, taking DDPG as an example, they are all selected based on the common practices recommended by the deep learning community and adjusted through trial and error according to the training data. In the DDPG algorithm, both the policy network and the value network have 2 hidden layers, with 128 neurons in each layer, and the activation function is the Linear Correction Function (ReLU). The discount factor is 0.99, the data sampling size is 256, the size of the replay experience pool is 1,000,000, the learning rate of the policy network is 1 × 10⁻⁴, the learning rate of the value network is 3 × 10⁻⁴,

τ

is 0.01, and the Adam optimizer is used to update the network weights. The exploration of the DDPG algorithm is mainly generated by the behavioral strategy of the -greedy strategy. The initial exploration rate is 1.0, the final exploration rate is 0.01, and the attenuation rate of the exploration rate after each update is 0.998.

4.2. A Comparative Analysis of the Optimization Management Costs Between the DQN Algorithm and the DDPG Algorithm

This paper needs to verify the effectiveness of deep reinforcement learning in the energy optimization management of integrated energy systems. To prove this conclusion, different deep reinforcement learning algorithms are, respectively, used in this paper to demonstrate their general effectiveness in the energy optimization management of integrated energy systems. Therefore, this paper simultaneously employs the energy optimization management of the integrated energy system based on the DDPG algorithm and that based on the DQN algorithm, and compares the two to obtain the optimal result.

For the adopted DQN algorithm, its input and output are the same as those of DDPG. However, DQN needs to discretize the action space, that is, to convert the electric power of the combined heat and power unit, the charging and discharging power of the energy storage, the electric power of the electric boiler, and the thermal power of the gas boiler into discrete data. This article will be discretized into three integer values at intervals of 0.75, 0.5, 0.75, and 0.8MW, respectively. A deep Q-network has two hidden layers, each with 128 neurons, and its activation function is a linear correction function (ReLU).

This section will train several episodes (cycles) for the agents of both the DDPG algorithm and the DQN algorithm until the reward values finally converge. The cumulative reward values obtained by the two algorithms are shown in Figure 5 and Figure 6. It can be observed from the figures that the energy optimization management of the integrated energy system based on the DDPG algorithm has only been trained for 4000 cycles, while that based on the DQN algorithm has been trained for 10,000 cycles. This is because the cumulative reward value of the DDPG algorithm has reached a convergent state around 4000 cycles of training It takes about 10,000 cycles for the DQN algorithm to reach the convergence state. Meanwhile, in the comparison between the two graphs, it can be seen that the fluctuations of the energy management method based on the DDPG algorithm are smaller and more stable than those based on the DQN algorithm. The TOU pricing mechanism effectively shifts load from peak to off-peak periods, reducing the average electricity purchase cost by approximately 12% compared to flat pricing.

The program presents the statistics of the weekly operating costs of the two algorithms, as shown in Table 4. It can be seen from this that the average weekly operating cost of the energy optimization management of the integrated energy system based on the DQN algorithm is 578,382 yuan. Compared with the DDPG algorithm, its average cost has increased by 8.6%. Moreover, in terms of the minimum, maximum and standard deviation of the weekly cost, the DDPG algorithm has better performance than the DQN algorithm. It can effectively reduce the operating costs of the integrated energy system. The reason why the operating cost of energy optimization management in integrated energy systems based on the DQN algorithm is higher than that of the DDPG algorithm might be that the DQN algorithm needs to discretize the continuous actions of values such as the electric power of cogeneration units, the charging/discharging power of electric energy storage devices, the electric power of electric boilers, and the thermal power of gas boilers in the integrated energy system. Setting discrete action values will significantly reduce the number of feasible action options for the algorithm, which will result in suboptimal action options and cause the program to adopt a suboptimal strategy. As a result, the operating cost will increase. The comparative analysis shows that the energy optimization management method of the integrated energy system based on the DDPG algorithm is more effective and more suitable for solving the dynamic energy scheduling problem in this system.

4.3. Analysis of Optimization Management Results Based on DDPG Algorithm

To prove whether the deep reinforcement learning algorithm has achieved energy optimization management in the integrated energy system. This section will take the energy optimization management of the integrated energy system based on the DDPG algorithm as an example to analyze in detail the regulation results of the electric power (

P_{g r i d} (t)

) exchanged between the system and the main grid and the charging and discharging power (

P_{B E S} (t)

) of the electric energy storage during the algorithm scheduling process.

During the scheduling process, the program generates a graph of the changes of

P_{g r i d} (t)

and

P_{B E S} (t)

within one cycle every 100 iterations. Therefore, in this section, when the DDPG algorithm converges, the

P_{g r i d} (t)

graph and

P_{B E S} (t)

graph within the same scheduling period are randomly extracted from a large number of graphs, as shown in Figure 7.

As shown in Figure 7. The vertical axis of this graph represents the interactive power of the power grid, and the horizontal axis represents time. The range of the electrical power exchanged between the system and the main grid is 0–8MW, which meets the conditions given in the program. It can be seen from Figure 7 that in the early stage of training, that is, from 0 to 40 h, the grid interaction power fluctuates significantly, the system stability is poor, and the electric power may experience high peaks and low valleys. Because in the initial stage, the agent is still exploring the action space, has insufficient cognition of the environment, and when the agent selects actions, the initial value of epsilon is relatively large, there is a higher probability of randomly choosing actions. By the middle of the training, that is, 40 to 100 h, the fluctuation range of the power grid gradually decreases, but there are still some sharp fluctuations. Through continuous exploration and utilization of experience, the agent gradually begins to find some strategies that can make the system relatively stable. The stability of the system has improved. The peak power of the power grid has decreased from 5.5 MW to around 5.0 MW, and the trough has gradually risen to 3 MW, approaching around 3.5 MW. This also indicates that the system is gradually stabilizing. In the later stage of training, that is, after 100 h, the power fluctuation of the power grid further decreased. Both the peak and valley values of the power grid tended to stabilize. The peak value stabilized between 4 and 4.5 MW, and the valley value stabilized between 3.5 and 4 MW. The system stability reached a relatively high level. Therefore, this can indicate that the agent has learned relatively mature strategies and can precisely control the operation of energy equipment. In addition, it can be observed from the graph that during peak hours (12:00–19:00), the power value of the power grid is relatively large and fluctuates more sharply, with multiple peaks occurring, and the peak power of the power grid is relatively high. During normal periods (5 to 12 o’clock and 7 to 12 o’clock), the power value of the power grid and its fluctuations are relatively smaller than those during peak hours, but there is still a certain degree of fluctuation. During the off-peak period (0 to 5 h), the power fluctuation of the power grid is relatively small, and both the peak and the overall value are low.

As shown in Figure 8. The vertical axis of this graph represents the charging and discharging power of electrical energy storage, and the horizontal axis represents time. The charging and discharging power range of the electrical energy storage is from −0.5 MW to 0.5 MW, which meets the conditions given by the program. During the initial stage of training (around 0 to 40 h), the charging and discharging power of the electrical energy storage fluctuates quite sharply, with multiple rapid rises and falls. This might be because at the beginning of the training, the system was still exploring different operation strategies, resulting in unstable charging and discharging power of the electrical energy storage. During the middle stage of training (around 40 to 100 h), the fluctuation range of charging and discharging of the electrical energy storage decreases somewhat, but there are still significant fluctuations. This indicates that the system is gradually adjusting its strategy and is attempting to find an appropriate one to achieve a stable charging and discharging mode, but it has not yet reached the ideal state. In the later stage of training (after 100 h), the charging and discharging fluctuations of the electrical energy storage further decreased, and the system was approaching convergence.

To sum up, it can be seen from the results of the power exchange between the system and the main grid and the scheduling of the charging and discharging power of the electrical energy storage that the DDPG algorithm has gradually played a role in the energy optimization management of the integrated energy system and has achieved remarkable results.

From this, it can be inferred that deep reinforcement learning has advantages for the scheduling of various devices and energy in an integrated energy system, and it can optimize energy management in the integrated energy system.

4.4. The DDPG Algorithm Optimization Management Changes for Altering the Penetration Rate of Photovoltaic Resources

In addition to this experiment, this section conducts another comparison experiment. The purpose of this experiment is to prove that deep reinforcement learning plays a significant role in the energy optimization management of integrated energy systems. To achieve this goal, the approach chosen in this section is to alter the penetration rate of photovoltaic resources in the integrated energy system [32], that is, to change the proportion of photovoltaic power generation in the total power generation. Then, DDPG is used to schedule energy systems with different photovoltaic resource penetration rates, respectively. The penetration rates of the selected photovoltaic resources were 20%, 50%, and 80%, respectively. The DDPG algorithm was used to schedule and train for 4000 rounds, respectively, until the algorithm converged. The cumulative reward value results obtained are shown in Table 5 below.

It can be observed from Table 5 that when the photovoltaic penetration rate is 20%, the absolute value of the cumulative reward is about 0.58, which is smaller than the original 0.65. When the photovoltaic penetration rate is 50%, the absolute value of the cumulative reward is about 0.53, which is 0.05 lower than that of the photovoltaic penetration rate of 20%. When the penetration rate of photovoltaic power reached 80%, the absolute value of the accumulated rewards dropped to around 0.46. From this, it can be inferred that when the photovoltaic penetration rate is higher, the cumulative absolute value of the rewards obtained by the DDPG algorithm will be smaller. Meanwhile, it can be known from Equation (23) that the smaller the absolute value of the accumulated rewards, the lower the operating cost of the system will be.

After analyzing the above data, it can be found that when the penetration rate of photovoltaic power is higher, the proportion of power generation from other equipment required will be lower, and the cost consumed will also be lower. The reason is that as the proportion of the electrical power output by photovoltaic equipment in the integrated energy system keeps increasing, the proportion of the equipment (such as combined power generation units, electric boilers, gas boilers, etc.) that can be dispatched by the DDPG algorithm in the integrated energy system will keep decreasing. This will trigger a chain reaction, causing the dynamic scheduling effect of the DDPG algorithm on the integrated energy system to continuously decline, and the required equipment costs and energy costs will also decrease. All these situations will have adverse consequences for the integrated energy system.

To verify the credibility of the analysis results, in addition to analyzing the cumulative value of the rewards, this section also analyzes the switching power between the system and the main grid and the charging and discharging power of the electrical energy storage when the cumulative value of the rewards converges.

(1): The switching power between the system and the main grid

As shown in Figure 9, Figure 10 and Figure 11, they are the distribution graphs of the exchanged power between the system and the main grid over one week when the photovoltaic penetration rates are 20%, 50% and 80%, respectively.

In Figure 9, when the photovoltaic penetration rate is 20%, the electrical power in the early stage of training will fluctuate greatly, with poor stability. Occasionally, there will be a sudden change in electrical power. The peak is generally around 4.5 MW–5.0 MW, and the valley is generally around 2.5 MW–3.0 MW. The amplitude fluctuation of the electrical power in the later stage of training becomes visibly smaller with good stability, and the peak is generally around 4.0 MW to 4.5 MW. Moreover, the trough of electric power usually occurs during the off-peak period of electricity prices, while the peak usually occurs during the peak period of electricity prices. At this point, the switching power between the system and the main grid is still normal. Compared with the normal situation, it has changed slightly.

It can be seen from Figure 10 that when the photovoltaic penetration rate is 50%, the fluctuation range of the electric power begins to become very large. The peak value in the early stage of training is generally around 5.0 MW–5.5 MW, while the valley value is around 0.5 MW–1.5 MW. The stability of the electric power is very poor and sudden changes often occur. In the later stage of training, the electrical power stabilized slightly, but it still fluctuated greatly. The peak was around 5.5 MW, and the valley was around 1.5 MW to 2.0 MW. Moreover, it can be seen from Figure 10 that the value of the electric power is relatively low during the day. At night, the value of electric power is relatively high. From this, it can be inferred that at this time, the exchange power value between the system and the main grid is opposite to the output power of the photovoltaic equipment. When the output electrical power of photovoltaic equipment is large, the power of the power grid is small. And vice versa. Therefore, when the photovoltaic penetration rate is 50%, the DDPG algorithm can still optimize the energy management of the integrated energy system, but its effect begins to diminish.

In Figure 11, when the photovoltaic penetration rate is 80%, it can be seen from the figure that the fluctuation range of the power grid is very large. Whether in the early or late stage of training, there is no obvious change in the power grid. The peak power of the power grid is generally around 4.5 MW to 5.0 MW, and the valley power is generally around 0.5 MW to 1.0 MW. The distribution law of the power value of the power grid has fully satisfied the law opposite to the output power value of photovoltaic equipment. The power required by the integrated energy system is basically provided by the main grid and photovoltaic equipment. The output power of photovoltaic equipment can basically completely affect the energy dispatching of the integrated energy system. The optimization management role of the DDPG algorithm has been significantly reduced. However, algorithms can still optimize and manage the system.

In the figures, taking the fifth day as an example, that is, between 96 h and 120 h, it can be clearly seen that when the light permeability is 20%, the electrical power is relatively stable without a significant sudden change. When the light permeability is 50%, it can be seen that the electric power has undergone a significant and sharp change. The electric power dropped from 5.5 MW to 2.0 MW in an instant. From this, it can be inferred that the integrated energy system at this time is no longer stable. When the light permeability is 80%, the sharp change in electrical power becomes more obvious, dropping from 5 MW to 0.5 MW, and the electrical power has become unstable.

(2): Charging and discharging power of electrical energy storage

Based on the above analysis of the exchanged power between the system and the main grid, the charging and discharging power of the electrical energy storage will be analyzed below.

As shown in Figure 12, it is the charging and discharging power diagram of the electrical energy storage when the photovoltaic penetration rate is 20%. In this situation, the electrical energy storage equipment is undergoing normal charging and discharging, with a relatively high number of charging and discharging cycles. In the early stage of training, the fluctuations in the charging and discharging power of the electrical energy storage were quite intense, with multiple rapid rises and falls. In the later stage of training, the fluctuation range of charging and discharging of the electrical energy storage decreased. This also indicates that in this situation, the DDPG algorithm still plays a role in the scheduling of the charging and discharging power of electrical energy storage devices, and the electrical energy storage devices have been well optimized and managed.

As shown in Figure 13, this figure is the charging and discharging power diagram of electrical energy storage when the photovoltaic penetration rate is 50%. It can be seen from the figure that the electrical energy storage equipment is still charging and discharging normally. Meanwhile, in the early stage of training, the charging and discharging power of the electrical energy storage fluctuates greatly. In the later stage of training, the charging and discharging power of the electrical energy storage tends to stabilize. Therefore, it can be known that when the photovoltaic penetration rate is 50%, the DDPG algorithm still plays a role in the energy optimization management of the integrated energy system.

As shown in Figure 14, this figure is the charging and discharging power diagram of electrical energy storage when the photovoltaic penetration rate is 50%. As can be seen from the figure, the charging frequency of the electrical energy storage equipment has further decreased. There are sharp changes in the charging and discharging power in both the initial and later stages of training. However, overall, it still shows that the electrical energy storage is undergoing optimized management. Therefore, at this point, the role of the DDPG algorithm in the energy optimization management of the integrated energy system is significantly reduced, but it still has an effect.

Let us take the fourth day as an example again. When the photovoltaic penetration rate is 20%, the number of charge and discharge cycles of the electrical energy storage is normal and there is no significant change. When the penetration rate of photovoltaic power is 50%, the number of charge and discharge cycles of electrical energy storage decreases, but the power tends to stabilize and can still be under the control of algorithmic optimization management. When the photovoltaic penetration rate is 80%, the number of charge and discharge cycles of electrical energy storage decreases more significantly, and the charge and discharge power become more unstable.

From these three graphs, a pattern can be observed that when the penetration rate of photovoltaic power keeps increasing, the number of charges and discharges of electrical energy storage significantly decreases, and the fluctuations also become significantly larger. From this, it can be inferred that while the penetration rate of photovoltaic power is increasing, the influence of the DDPG algorithm on the dispatching role of electrical energy storage equipment is also constantly decreasing, which leads to the fact that electrical energy storage equipment no longer has normal optimized management.

Through the analysis of the electrical power exchanged between the system and the main grid and the charging and discharging power of electrical energy storage under different photovoltaic penetration rates, a conclusion is drawn that the deep reinforcement learning algorithm plays a significant role in the energy optimization management and dynamic scheduling of the integrated energy system. When the role of deep reinforcement learning algorithms in energy optimization management of integrated energy systems is continuously weakened, the power in the system becomes extremely unstable, and the system often experiences sudden power changes. Photovoltaic penetration tests beyond 80% were not conducted due to increased system instability and reduced dispatchability. In practice, penetration rates above 80% often require additional grid support mechanisms (e.g., storage, curtailment, or backup generation), which are beyond the scope of this study. Higher penetration rates would require additional grid support services or hybrid control strategies.

In conclusion, through the optimization of deep reinforcement learning algorithms, both the energy efficiency and stability in the integrated energy system have been improved. Therefore, deep reinforcement learning algorithms can achieve the goal of improving energy efficiency and promoting sustainable development in integrated energy systems, and they play a crucial role in the energy optimization management of integrated energy systems.

4.5. The Optimization Management Changes in the DQN Algorithm Considering the Source-Load Side Fluctuations

To verify that deep reinforcement learning can handle the fluctuations on the source-load side in an integrated energy system, this section takes the DQN algorithm as an example, adds fluctuations at both ends of the source-load side, and then uses the algorithm for scheduling to observe the changes in cost. In this study, we focused on DQN and DDPG as representative algorithms for discrete and continuous action spaces, respectively, which are most relevant to the energy scheduling problem with both discrete and continuous control variables. While other algorithms like PPO, TD3, and A3C are also promising, their inclusion would exceed the scope of this paper but will be considered in future comparative studies.

This section assumes that both photovoltaic and load forecasting deviations follow a normal distribution. The probability density function of the load forecasting error is:

f (Δ P_{L}) = \frac{1}{\sqrt{2 π} δ_{L}} \exp (- \frac{Δ P_{L}^{2}}{2 δ_{L}^{2}})

(26)

In the formula,

P_{L f}

and

Δ P_{L}

represent the load forecast value and the forecast deviation value, respectively;

δ_{L}

is the standard deviation of the load forecasting deviation.

The source-load side fluctuations selected in this paper are 10%, 20% and 30%, respectively; that is, the standard deviations are 0.1, 0.2 and 0.3. The photovoltaic power generation and load-side demand are shown in Figure 15, Figure 16 and Figure 17. It can be seen from the figure that as the standard deviation increases, the numerical fluctuations of photovoltaic power generation and the demand on the load side also gradually increase. The standard deviation of grid exchange power decreased from 1.8 MW in early training to 0.6 MW after convergence, indicating improved stability.

The minimized costs obtained by the DQN algorithm when the source-load side fluctuations are 10%, 20%, and 30% are shown in Table 6.

It can be seen from Table 6 that under the fluctuations on the source-load side, the operating cost of the integrated energy system will change slightly, but its cost does not change much. This also indicates that in the case of fluctuations on the source-load side, the DQN algorithm can still optimize the energy management of the integrated energy system. Therefore, this also proves that deep reinforcement learning can handle the unstable situation on the source-load side in the integrated energy system.

The computational time for DDPG was approximately 8 h for 4000 episodes, while DQN required 12 h for 10,000 episodes on the same hardware (NVIDIA RTX 3080). Although DDPG has higher per-step complexity, its faster convergence makes it more suitable for real-time implementation.

The influence of key hyperparameters (e.g., learning rate, batch size) was tested through sensitivity analysis. Results show that learning rates between 1 × 10⁻⁴ and 3 × 10⁻⁴ yield stable convergence, while larger values lead to training divergence.

5. Conclusions

In conclusion, this paper proposes a DRL-based framework for the dynamic energy management of park-level IES. The model addresses both supply-side coordination and demand-side flexibility through a TOU-based demand response program, demonstrating superior performance under uncertainty compared to traditional methods. This paper presents two energy optimization management methods for integrated energy systems based on deep reinforcement learning—the DDPG algorithm and the DQN algorithm. Unlike traditional methods, deep reinforcement learning algorithms do not require models, can conduct model-free scheduling simulations, and can handle problems with uncertainties on the source-load side. By simultaneously applying these two algorithms to the energy optimization management of the integrated energy system, it can be proven that the deep reinforcement learning algorithm has improved the energy efficiency, cost and stability of the integrated energy system. However, there are some differences between these two algorithms. The DDPG algorithm places the energy optimization management problem of the integrated energy system in a continuous state and action space for processing. The DQN algorithm cannot handle this management problem in the continuous action space. It needs to discretize the continuous action space. Therefore, this processing method may cause the curse of dimensionality and the problem of choosing suboptimal scheduling strategies. By comparing the application effects of these two algorithms, it can be found that the energy optimization management of the comprehensive energy system based on the DDPG algorithm is more effective than that of the DQN algorithm. Based on this experiment, this paper further verifies the significance of deep reinforcement learning algorithms in the energy optimization management of integrated energy systems and the fact that deep reinforcement learning algorithms can still effectively manage the uncertain source-load side.

In addition to the above conclusions, there are still some deficiencies in this experiment. Although the results demonstrated higher efficiency and adaptability, it should be emphasized that this study was based on simulation. The proposed method can be scaled to larger or real park systems by integrating more devices and state variables, though computational and communication challenges may arise in practice. In terms of the power interaction between the system and the main grid, this paper only takes into account the situation where the system purchases electricity from the main grid, but does not consider the situation where the system sells electricity to the main grid. When the DQN algorithm discretizes the continuous space, considering the problem of the curse of dimensionality, this paper only discretizes the continuous action space into three integer values, which may prevent the strategy selection from reaching the optimal state. This paper only selects two algorithms from deep reinforcement learning algorithms for energy optimization management and does not choose several more algorithms to increase the reliability of data and results.

Future work will focus on hybrid optimization combining model-based and data-driven methods, incorporation of emission and sustainability metrics, and experimental validation in a real-world park environment.

Author Contributions

Conceptualization, X.J.; methodology, F.L.; software, L.Z.; investigation, X.J.; resources, Z.L. (Zhiru Li); data curation, Z.Z.; writing—original draft, Z.Z.; writing—review and editing, Z.L. (Zhijian Ling); visualization, L.Z.; supervision, X.J.; funding acquisition, X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was financially supported by Shanxi Energy Internet Research Institute.

Conflicts of Interest

All authors were employed by the company Shanxi Energy Internet Research Institute. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Nie, W.; Ma, H.; Feng, L.; Quan, W.; Zhang, J. Research on Comprehensive Energy System Analysis Platform of Typical Park. J. Phys. Conf. Ser. 2020, 1549, 042026. [Google Scholar] [CrossRef]
Hansen, K.; Mathiesen, B.V. Comprehensive assessment of the role and potential for solar thermal in future energy systems. Sol. Energy 2018, 169, 144–152. [Google Scholar] [CrossRef]
Xu, Y.; Liao, Q.; Liu, D.; Peng, S.; Yang, Z.; Zou, H.; Zhang, L. Multi-player Intraday Optimal Dispatch of Integrated Energy System Based on Integrated Demand Response and Games. Power Syst. Technol. 2019, 43, 2506–2518. [Google Scholar]
Massrur, H.R.; Niknam, T.; Aghaei, J.; Shafie-Khah, M.; Catalao, J.P.S. Fast Decomposed Energy Flow in Large-Scale Integrated Electricity-Gas-Heat Energy Systems. IEEE Trans. Sustain. Energy 2018, 9, 1565–1577. [Google Scholar] [CrossRef]
Song, J.; Mao, Z.; Li, X.; Liu, T.; Zhong, W.; Cui, Y. A Classification and Synthesis Method for Load Characteristics of Typical Industry Based on Daily Electricity Consumption Curves. In Proceedings of the 2020 5th International Conference on Power and Renewable Energy (ICPRE), Shanghai, China, 12–14 September 2020. [Google Scholar]
Wang, Y.; Shen, J.; Chen, C.; Zhou, B.; Zhang, C. Description of Wind and Solar Power Generation Considering Power Plant Clusters and Time-varying Power Characteristics. Power Syst. Technol. 2023, 47, 1558–1572. [Google Scholar]
Yu, P.; Cai, Z.; Zhang, H.; Cui, D.; Zhou, H.; Yu, R.; Zhou, Y. Prediction of Uncertainty Ramping Demand in New Power Systems Based on a CNN-LSTM Hybrid Neural Network. Processes 2025, 13, 2088. [Google Scholar] [CrossRef]
Hu, Y.; Wu, Z.; Ding, Y.; Yuan, K.; Zhao, F.; Shi, T. Optimal Energy Management and Trading Strategy for Multi-Distribution Networks with Shared Energy Storage Based on Nash Bargaining Game. Processes 2025, 13, 2022. [Google Scholar] [CrossRef]
Zeng, J.; Liu, Z.; Wu, Q.-H. Hybrid Stochastic–Information Gap Decision Theory Method for Robust Operation of Water–Energy Nexus Considering Leakage. Electronics 2025, 14, 2644. [Google Scholar] [CrossRef]
Wang, T.; Wang, S.; Meng, Q. Introduction to Stochastic Programming. Liner Ship Fleet Plan. 2017, 41–48. [Google Scholar] [CrossRef]
Wang, Z.; Glynn, P.W.; Ye, Y. Likelihood robust optimization for data-driven problems. Comput. Manag. Sci. 2016, 13, 241–261. [Google Scholar] [CrossRef]
Almazroui, A.; Mohagheghi, S. Coordinated Control of Photovoltaic Resources and Electric Vehicles in a Power Distribution System to Balance Technical, Environmental, and Energy Justice Objectives. Processes 2025, 13, 1979. [Google Scholar] [CrossRef]
Jin, X.; Liu, C.; Liu, P.; Ding, S.; Qiu, T. Robust Optimization of the Secondary Air System Axial Bearing Loads with the Labyrinth Clearance Uncertainty. J. Aerosp. Eng. 2024, 37, 15. [Google Scholar] [CrossRef]
Margellos, K.; Goulart, P.; Lygeros, J. On the road between robust optimization and the scenario approach for chance constrained optimization problems. IEEE Trans. Autom. Control 2014, 59, 2258–2263. [Google Scholar] [CrossRef]
Wang, L.; Feng, X.; Zhang, R.; Hou, Z.; Wang, G.; Zhang, H. Energy management of integrated energy system in the park under multiple time scales. AIMS Energy 2024, 12, 639–663. [Google Scholar] [CrossRef]
Wang, W.; Wang, D.; Jia, H.; Chen, Z.; Guo, B.; Zhou, H.; Fan, M. Review of Steady-state Analysis of Typical Regional Integrated Energy System Under the Background of Energy Internet. Proc. CSEE 2016, 36, 3292–3305. [Google Scholar]
Su, B.; D’Ariano, A.; Su, S.; Wang, Z.; Tessitore, M.L.; Tang, T. A risk-averse two-stage stochastic programming approach for backup rolling stock allocation and metro train rescheduling under uncertain disturbances. Transp. Res. Part B 2025, 196, 103233. [Google Scholar] [CrossRef]
Li, D.; Ji, X.; Liu, C.; Leng, J.; Li, M. Optimal dispatching of comprehensive energy system in the park considering carbon emission stream theory. In Proceedings of the Third International Conference on Mechanical, Electronics, and Electrical and Automation Control (METMS 2023), Hangzhou, China, 18 July 2023; Volume 12722, p. 9. [Google Scholar]
Makanju, T.D.; Hasan, A.N.; Famoriji, O.J.; Shongwe, T. An Intelligent Technique for Coordination and Control of PV Energy and Voltage-Regulating Devices in Distribution Networks Under Uncertainties. Energies 2025, 18, 3481. [Google Scholar] [CrossRef]
Xu, N.; Zhou, B.; Nie, J.; Song, Y.; Zhao, Z. Research on the coordinated optimization operation method of Park Comprehensive Energy System Based on master-slave game. In Proceedings of the 2021 5th International Conference on Advances in Energy, Environment and Chemical Science (AEECS 2021), Virtual, 26–28 February 2021. [Google Scholar]
Grisales-Noreña, L.F.; Vega, H.P.; Montoya, O.D.; Botero-Gómez, V.; Sanin-Villa, D. Cost Optimization of AC Microgrids in Grid-Connected and Isolated Modes Using a Population-Based Genetic Algorithm for Energy Management of Distributed Wind Turbines. Mathematics 2025, 13, 704. [Google Scholar] [CrossRef]
Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. arXiv 2016, arXiv:1602.04621. [Google Scholar] [CrossRef]
Wang, F.; Hu, Y.; Ye, J.; Huang, Y.; Li, X.; Yang, H. MOP-DDPG: Multiple Observation Points Oriented Deterministic Diagnostic Pattern Generation for Compound Faults. ECS Trans. 2010, 27, 179. [Google Scholar] [CrossRef]
Fleming, R.M.; Ramirez, A.P. Relation of structure and superconducting transition temperatures in A3C60. Nature 1991, 352, 787–788. [Google Scholar] [CrossRef]
Galeazzi, M.A.M.; Sgarbieri, V.C.; Constantinides, S.M. Isolation, Purification and Physicochemical Characterization of Polyphenoloxidases (PPO) from a Dwarf Variety of Banana (Musa cavendishii, L.). J. Food Sci. 1981, 46, 150–155. [Google Scholar] [CrossRef]
Guo, D.; Zhu, Y.; Lu, K. Robot Map-Free Navigation System Based on Improved TD3 Algorithm. Conference Robot Map-Free Navigation System Based on Improved TD3 Algorithm. In Proceedings of the 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), Shanghai, China, 21–23 March 2025. [Google Scholar]
Bin, F.; Yijie, H.; Gang, H.; Wei, J.; Huating, X.; Chuangxin, G. Review on Optimization Methods for New Power System Dispatch Based on Deep Reinforcement Learning. Autom. Electr. Power Syst. 2023, 47, 187–199. [Google Scholar]
Wang, X.; Zhao, Q.; Zhao, L. Energy Management Approach for Integrated Electricity-Heat Energy System Based on DeepQ-Learning Network. Electr. Power Constr. 2021, 42, 10–18. [Google Scholar]
Yang, T.; Zhao, L.Y.; Liu, Y.; Feng, S.; Pen, H. Dynamic Economic Dispatch for Integrated Energy System Based on Deep Reinforcement Learning. Autom. Electr. Power Syst. 2021, 45, 39–47. [Google Scholar]
Zhang, Z.; Yu, T.; Wang, D.; Pan, Z.; Zhang, X. Optimal Solution of Time-of-Use Price Based on Ensemble Learning for Electricity-Gas-Heat Commercial Building. Proc. CSEE 2019, 39, 112–125. [Google Scholar]
Mckenna, E.; Thomson, M. High-resolution stochastic integrated thermal–electrical domestic demand model. Appl. Energy 2016, 165, 445–461. [Google Scholar] [CrossRef]
Chen, M.; Zhu, Y.; Sun, Y.; Xie, Z.; Wu, P. The Predictive-Control Optimization Method for Park Integrated Energy System Considering the High Penetration of Photovoltaics and Deep Reinforcement Learning. Trans. China Electrotech. Soc. 2024, 39, 6054–6071. [Google Scholar]

Figure 1. Environmental model for optimal energy scheduling in integrated energy systems.

Figure 2. Flowchart of the DQN algorithm. (Key hyperparameters: batch size = 256, replay buffer size = 1,000,000, policy network LR = 1 × 10⁻⁴, value network LR = 3 × 10⁻⁴).

Figure 3. Flowchart of the DDPG algorithm. (Key hyperparameters: batch size = 256, replay buffer size = 1,000,000, policy network LR = 1 × 10⁻⁴, value network LR = 3 × 10⁻⁴).

Figure 4. Weekly photovoltaic power generation capacity and load demand chart.

Figure 5. The cumulative reward value of energy optimization management in the integrated energy system based on the DDPG algorithm.

Figure 6. The cumulative reward value of energy optimization management in the integrated energy system based on the DQN algorithm. (Results are averaged over 5 independent runs with shaded regions indicating 95% confidence intervals).

Figure 7. The power scheduling diagram for the exchange between the system and the main grid.

Figure 8. Power scheduling diagram for charging and discharging of electrical energy storage.

Figure 9. The exchanged power between a system with a photovoltaic penetration rate of 20% and the main grid.

Figure 10. The exchanged power between a system with a photovoltaic penetration rate of 50% and the main grid.

Figure 11. The exchanged power between a system with a photovoltaic penetration rate of 80% and the main grid.

Figure 12. The charging and discharging power of electrical energy storage with a photovoltaic penetration rate of 20%.

Figure 13. The charging and discharging power of electrical energy storage with a photovoltaic penetration rate of 50%.

Figure 14. The charging and discharging power of electrical energy storage with a photovoltaic penetration rate of 80%.

Figure 15. Photovoltaic power generation and load demand when the fluctuation on the source-load side is 10%.

Figure 16. Photovoltaic power generation and load demand when the fluctuation on the source-load side is 20%.

Figure 17. Photovoltaic power generation and load demand when the fluctuation on the source-load side is 30%.

Table 1. Operating parameters of components in an integrated energy system.

Component Type	Minimum Electrical (Thermal) Power/MW	Maximum Electrical (Thermal) Power/MW
Combined heat and power unit	0	1.5
Electric energy storage equipment	−0.5	0.5
Gas boiler	0	1.6
Electric boiler	0	1.5

Table 2. Other parameters in the integrated energy system.

Parameter	Value	Parameter	Value
$η_{C H P}$	0.35	$η_{G B}$	0.8
$b$	1.2	$ρ_{B E S}$	0.05
$η_{c h}$	0.95	$S_{S O C}^{i n i}$	0.5
$η_{d i s}$	0.95	$S_{S O C}^{\min}$	0.2
$η_{E B}$	0.95	$S_{S O C}^{\max}$	1.2
$Q_{B E S}$	5	/	/

Table 3. Time-of-use electricity price.

Time Period	Electricity Purchase Price/ $(yuan \cdot {(MW \cdot h)}^{- 1})$	Electricity Sales Price/ $(yuan \cdot {(MW \cdot h)}^{- 1})$
Peak period	980	500
Normal period	490	200
Grain period	170	0

Table 4. Weekly operating cost statistics of different scheduling methods.

Method	Cost/Yuan
Method	Average	Minimum	Maximum	Standard Deviation
DDPG algorithm	532,424	484,486	597,395	14,671
DQN algorithm	578,382	515,252	679,208	18,268

Table 5. The cumulative reward values under different photovoltaic penetration rates.

Photovoltaic Penetration Rate	Reward Accumulation Value
20%	−0.58
50%	−0.53
80%	−0.46

Table 6. Operating costs under fluctuations on the source-load side.

Source-Load Side Fluctuation	Cost/Yuan
10%	586,702
20%	580,070
30%	561,798

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, X.; Zhang, L.; Li, F.; Li, Z.; Ling, Z.; Zhao, Z. Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning. Energies 2025, 18, 5172. https://doi.org/10.3390/en18195172

AMA Style

Jiang X, Zhang L, Li F, Li Z, Ling Z, Zhao Z. Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning. Energies. 2025; 18(19):5172. https://doi.org/10.3390/en18195172

Chicago/Turabian Style

Jiang, Xinjian, Lei Zhang, Fuwang Li, Zhiru Li, Zhijian Ling, and Zhenghui Zhao. 2025. "Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning" Energies 18, no. 19: 5172. https://doi.org/10.3390/en18195172

APA Style

Jiang, X., Zhang, L., Li, F., Li, Z., Ling, Z., & Zhao, Z. (2025). Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning. Energies, 18(19), 5172. https://doi.org/10.3390/en18195172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Dynamic Energy Management Optimization of Park Integrated Energy System Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Modeling of the Comprehensive Energy System in the Park

2.1. IES Demand Response

2.2. Equipment Characteristics

2.3. Objective Function

2.4. Constraint Conditions

3. Design of IES Scheduling Method Based on Deep Reinforcement Learning

3.1. The IES Energy Management Optimization Problem Is Transformed into the MDP Framework

3.2. System Energy Management Method Based on DQN and DDPG

4. Case Study

4.1. Case Description

4.2. A Comparative Analysis of the Optimization Management Costs Between the DQN Algorithm and the DDPG Algorithm

4.3. Analysis of Optimization Management Results Based on DDPG Algorithm

4.4. The DDPG Algorithm Optimization Management Changes for Altering the Penetration Rate of Photovoltaic Resources

4.5. The Optimization Management Changes in the DQN Algorithm Considering the Source-Load Side Fluctuations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI