4.1. Case Description
To verify the effectiveness of the proposed energy management method for integrated energy systems based on deep reinforcement learning, this section takes the integrated energy system model shown in
Figure 1 as an example to conduct simulation analysis on it. The case study evaluates the proposed DRL-based energy management strategy, which encompasses both supply-side dispatch (e.g., CHP, boilers, storage) and demand-side management via the TOU mechanism described in
Section 2.1. The electrical load, thermal load and photovoltaic data in the system are generated based on the open-source CREST model. This model has been effectively verified and widely applied. The data generated by this system takes into account the uncertainties on the source-load side in the real integrated energy system, so the data is random. It is a key feature of our DRL-based approach that the agents make decisions based only on the current state (as defined in
Section 3.1) without utilizing any prognostic data. This eliminates the need for and the inherent errors associated with weather, load, or generation forecasts, allowing the policy to react adaptively to the realized conditions. The electrical load, thermal load, and photovoltaic output data used for training and testing the DRL agents were synthetically generated using the open-source CREST model [
31]. This model simulates realistic stochastic energy demand and generation profiles based on probabilistic human behavior and weather patterns, and has been effectively validated and widely applied in energy systems research. By using this synthetic data generation method, we focus on evaluating the algorithmic performance under uncertainty while ensuring reproducibility. Although the specific processes of individual enterprises are not modeled, the generated load profiles represent the aggregated demand typical of a mixed-use industrial park, comprising light manufacturing, commercial building spaces (e.g., offices, research labs), and supporting facilities. The inherent stochasticity in the load arises from the combined variations in production schedules, occupancy patterns, and equipment usage across these diverse entities. The time length of the system energy management is set to one week, with an interval of 1 h between the two time periods. Although real-world park data were not used, the CREST model is well-validated and widely adopted in energy system studies for its ability to simulate realistic source-load uncertainties.
The operating parameters of the components of the integrated energy system are shown in
Table 1. The range of the power exchange rate between the system and the main power grid is [0, 8] MW, that is, the system does not sell electricity to the main power grid. The capacity of the electrical energy storage is, and other parameters are shown in
Table 2. The electricity price in the integrated energy system adopts time-of-use pricing, where the peak period is
, the off-peak period is
, and the valley period is
. The purchase and sale electricity prices are shown in
Table 3. The purchase price of natural gas is fixed at 400 yuan/(MW·h). The power generation capacity of photovoltaic equipment within one week, as well as the demands for electrical load and heat load, are shown in
Figure 4.
The deep reinforcement learning method proposed in this paper is implemented on the Pytorch platform. Regarding the setting of algorithm hyperparameters, taking DDPG as an example, they are all selected based on the common practices recommended by the deep learning community and adjusted through trial and error according to the training data. In the DDPG algorithm, both the policy network and the value network have 2 hidden layers, with 128 neurons in each layer, and the activation function is the Linear Correction Function (ReLU). The discount factor is 0.99, the data sampling size is 256, the size of the replay experience pool is 1,000,000, the learning rate of the policy network is 1 × 10−4, the learning rate of the value network is 3 × 10−4, is 0.01, and the Adam optimizer is used to update the network weights. The exploration of the DDPG algorithm is mainly generated by the behavioral strategy of the -greedy strategy. The initial exploration rate is 1.0, the final exploration rate is 0.01, and the attenuation rate of the exploration rate after each update is 0.998.
4.2. A Comparative Analysis of the Optimization Management Costs Between the DQN Algorithm and the DDPG Algorithm
This paper needs to verify the effectiveness of deep reinforcement learning in the energy optimization management of integrated energy systems. To prove this conclusion, different deep reinforcement learning algorithms are, respectively, used in this paper to demonstrate their general effectiveness in the energy optimization management of integrated energy systems. Therefore, this paper simultaneously employs the energy optimization management of the integrated energy system based on the DDPG algorithm and that based on the DQN algorithm, and compares the two to obtain the optimal result.
For the adopted DQN algorithm, its input and output are the same as those of DDPG. However, DQN needs to discretize the action space, that is, to convert the electric power of the combined heat and power unit, the charging and discharging power of the energy storage, the electric power of the electric boiler, and the thermal power of the gas boiler into discrete data. This article will be discretized into three integer values at intervals of 0.75, 0.5, 0.75, and 0.8MW, respectively. A deep Q-network has two hidden layers, each with 128 neurons, and its activation function is a linear correction function (ReLU).
This section will train several episodes (cycles) for the agents of both the DDPG algorithm and the DQN algorithm until the reward values finally converge. The cumulative reward values obtained by the two algorithms are shown in
Figure 5 and
Figure 6. It can be observed from the figures that the energy optimization management of the integrated energy system based on the DDPG algorithm has only been trained for 4000 cycles, while that based on the DQN algorithm has been trained for 10,000 cycles. This is because the cumulative reward value of the DDPG algorithm has reached a convergent state around 4000 cycles of training It takes about 10,000 cycles for the DQN algorithm to reach the convergence state. Meanwhile, in the comparison between the two graphs, it can be seen that the fluctuations of the energy management method based on the DDPG algorithm are smaller and more stable than those based on the DQN algorithm. The TOU pricing mechanism effectively shifts load from peak to off-peak periods, reducing the average electricity purchase cost by approximately 12% compared to flat pricing.
The program presents the statistics of the weekly operating costs of the two algorithms, as shown in
Table 4. It can be seen from this that the average weekly operating cost of the energy optimization management of the integrated energy system based on the DQN algorithm is 578,382 yuan. Compared with the DDPG algorithm, its average cost has increased by 8.6%. Moreover, in terms of the minimum, maximum and standard deviation of the weekly cost, the DDPG algorithm has better performance than the DQN algorithm. It can effectively reduce the operating costs of the integrated energy system. The reason why the operating cost of energy optimization management in integrated energy systems based on the DQN algorithm is higher than that of the DDPG algorithm might be that the DQN algorithm needs to discretize the continuous actions of values such as the electric power of cogeneration units, the charging/discharging power of electric energy storage devices, the electric power of electric boilers, and the thermal power of gas boilers in the integrated energy system. Setting discrete action values will significantly reduce the number of feasible action options for the algorithm, which will result in suboptimal action options and cause the program to adopt a suboptimal strategy. As a result, the operating cost will increase. The comparative analysis shows that the energy optimization management method of the integrated energy system based on the DDPG algorithm is more effective and more suitable for solving the dynamic energy scheduling problem in this system.
4.3. Analysis of Optimization Management Results Based on DDPG Algorithm
To prove whether the deep reinforcement learning algorithm has achieved energy optimization management in the integrated energy system. This section will take the energy optimization management of the integrated energy system based on the DDPG algorithm as an example to analyze in detail the regulation results of the electric power () exchanged between the system and the main grid and the charging and discharging power () of the electric energy storage during the algorithm scheduling process.
During the scheduling process, the program generates a graph of the changes of
and
within one cycle every 100 iterations. Therefore, in this section, when the DDPG algorithm converges, the
graph and
graph within the same scheduling period are randomly extracted from a large number of graphs, as shown in
Figure 7.
As shown in
Figure 7. The vertical axis of this graph represents the interactive power of the power grid, and the horizontal axis represents time. The range of the electrical power exchanged between the system and the main grid is 0–8MW, which meets the conditions given in the program. It can be seen from
Figure 7 that in the early stage of training, that is, from 0 to 40 h, the grid interaction power fluctuates significantly, the system stability is poor, and the electric power may experience high peaks and low valleys. Because in the initial stage, the agent is still exploring the action space, has insufficient cognition of the environment, and when the agent selects actions, the initial value of epsilon is relatively large, there is a higher probability of randomly choosing actions. By the middle of the training, that is, 40 to 100 h, the fluctuation range of the power grid gradually decreases, but there are still some sharp fluctuations. Through continuous exploration and utilization of experience, the agent gradually begins to find some strategies that can make the system relatively stable. The stability of the system has improved. The peak power of the power grid has decreased from 5.5 MW to around 5.0 MW, and the trough has gradually risen to 3 MW, approaching around 3.5 MW. This also indicates that the system is gradually stabilizing. In the later stage of training, that is, after 100 h, the power fluctuation of the power grid further decreased. Both the peak and valley values of the power grid tended to stabilize. The peak value stabilized between 4 and 4.5 MW, and the valley value stabilized between 3.5 and 4 MW. The system stability reached a relatively high level. Therefore, this can indicate that the agent has learned relatively mature strategies and can precisely control the operation of energy equipment. In addition, it can be observed from the graph that during peak hours (12:00–19:00), the power value of the power grid is relatively large and fluctuates more sharply, with multiple peaks occurring, and the peak power of the power grid is relatively high. During normal periods (5 to 12 o’clock and 7 to 12 o’clock), the power value of the power grid and its fluctuations are relatively smaller than those during peak hours, but there is still a certain degree of fluctuation. During the off-peak period (0 to 5 h), the power fluctuation of the power grid is relatively small, and both the peak and the overall value are low.
As shown in
Figure 8. The vertical axis of this graph represents the charging and discharging power of electrical energy storage, and the horizontal axis represents time. The charging and discharging power range of the electrical energy storage is from −0.5 MW to 0.5 MW, which meets the conditions given by the program. During the initial stage of training (around 0 to 40 h), the charging and discharging power of the electrical energy storage fluctuates quite sharply, with multiple rapid rises and falls. This might be because at the beginning of the training, the system was still exploring different operation strategies, resulting in unstable charging and discharging power of the electrical energy storage. During the middle stage of training (around 40 to 100 h), the fluctuation range of charging and discharging of the electrical energy storage decreases somewhat, but there are still significant fluctuations. This indicates that the system is gradually adjusting its strategy and is attempting to find an appropriate one to achieve a stable charging and discharging mode, but it has not yet reached the ideal state. In the later stage of training (after 100 h), the charging and discharging fluctuations of the electrical energy storage further decreased, and the system was approaching convergence.
To sum up, it can be seen from the results of the power exchange between the system and the main grid and the scheduling of the charging and discharging power of the electrical energy storage that the DDPG algorithm has gradually played a role in the energy optimization management of the integrated energy system and has achieved remarkable results.
From this, it can be inferred that deep reinforcement learning has advantages for the scheduling of various devices and energy in an integrated energy system, and it can optimize energy management in the integrated energy system.
4.4. The DDPG Algorithm Optimization Management Changes for Altering the Penetration Rate of Photovoltaic Resources
In addition to this experiment, this section conducts another comparison experiment. The purpose of this experiment is to prove that deep reinforcement learning plays a significant role in the energy optimization management of integrated energy systems. To achieve this goal, the approach chosen in this section is to alter the penetration rate of photovoltaic resources in the integrated energy system [
32], that is, to change the proportion of photovoltaic power generation in the total power generation. Then, DDPG is used to schedule energy systems with different photovoltaic resource penetration rates, respectively. The penetration rates of the selected photovoltaic resources were 20%, 50%, and 80%, respectively. The DDPG algorithm was used to schedule and train for 4000 rounds, respectively, until the algorithm converged. The cumulative reward value results obtained are shown in
Table 5 below.
It can be observed from
Table 5 that when the photovoltaic penetration rate is 20%, the absolute value of the cumulative reward is about 0.58, which is smaller than the original 0.65. When the photovoltaic penetration rate is 50%, the absolute value of the cumulative reward is about 0.53, which is 0.05 lower than that of the photovoltaic penetration rate of 20%. When the penetration rate of photovoltaic power reached 80%, the absolute value of the accumulated rewards dropped to around 0.46. From this, it can be inferred that when the photovoltaic penetration rate is higher, the cumulative absolute value of the rewards obtained by the DDPG algorithm will be smaller. Meanwhile, it can be known from Equation (23) that the smaller the absolute value of the accumulated rewards, the lower the operating cost of the system will be.
After analyzing the above data, it can be found that when the penetration rate of photovoltaic power is higher, the proportion of power generation from other equipment required will be lower, and the cost consumed will also be lower. The reason is that as the proportion of the electrical power output by photovoltaic equipment in the integrated energy system keeps increasing, the proportion of the equipment (such as combined power generation units, electric boilers, gas boilers, etc.) that can be dispatched by the DDPG algorithm in the integrated energy system will keep decreasing. This will trigger a chain reaction, causing the dynamic scheduling effect of the DDPG algorithm on the integrated energy system to continuously decline, and the required equipment costs and energy costs will also decrease. All these situations will have adverse consequences for the integrated energy system.
To verify the credibility of the analysis results, in addition to analyzing the cumulative value of the rewards, this section also analyzes the switching power between the system and the main grid and the charging and discharging power of the electrical energy storage when the cumulative value of the rewards converges.
- (1)
The switching power between the system and the main grid
As shown in
Figure 9,
Figure 10 and
Figure 11, they are the distribution graphs of the exchanged power between the system and the main grid over one week when the photovoltaic penetration rates are 20%, 50% and 80%, respectively.
In
Figure 9, when the photovoltaic penetration rate is 20%, the electrical power in the early stage of training will fluctuate greatly, with poor stability. Occasionally, there will be a sudden change in electrical power. The peak is generally around 4.5 MW–5.0 MW, and the valley is generally around 2.5 MW–3.0 MW. The amplitude fluctuation of the electrical power in the later stage of training becomes visibly smaller with good stability, and the peak is generally around 4.0 MW to 4.5 MW. Moreover, the trough of electric power usually occurs during the off-peak period of electricity prices, while the peak usually occurs during the peak period of electricity prices. At this point, the switching power between the system and the main grid is still normal. Compared with the normal situation, it has changed slightly.
It can be seen from
Figure 10 that when the photovoltaic penetration rate is 50%, the fluctuation range of the electric power begins to become very large. The peak value in the early stage of training is generally around 5.0 MW–5.5 MW, while the valley value is around 0.5 MW–1.5 MW. The stability of the electric power is very poor and sudden changes often occur. In the later stage of training, the electrical power stabilized slightly, but it still fluctuated greatly. The peak was around 5.5 MW, and the valley was around 1.5 MW to 2.0 MW. Moreover, it can be seen from
Figure 10 that the value of the electric power is relatively low during the day. At night, the value of electric power is relatively high. From this, it can be inferred that at this time, the exchange power value between the system and the main grid is opposite to the output power of the photovoltaic equipment. When the output electrical power of photovoltaic equipment is large, the power of the power grid is small. And vice versa. Therefore, when the photovoltaic penetration rate is 50%, the DDPG algorithm can still optimize the energy management of the integrated energy system, but its effect begins to diminish.
In
Figure 11, when the photovoltaic penetration rate is 80%, it can be seen from the figure that the fluctuation range of the power grid is very large. Whether in the early or late stage of training, there is no obvious change in the power grid. The peak power of the power grid is generally around 4.5 MW to 5.0 MW, and the valley power is generally around 0.5 MW to 1.0 MW. The distribution law of the power value of the power grid has fully satisfied the law opposite to the output power value of photovoltaic equipment. The power required by the integrated energy system is basically provided by the main grid and photovoltaic equipment. The output power of photovoltaic equipment can basically completely affect the energy dispatching of the integrated energy system. The optimization management role of the DDPG algorithm has been significantly reduced. However, algorithms can still optimize and manage the system.
In the figures, taking the fifth day as an example, that is, between 96 h and 120 h, it can be clearly seen that when the light permeability is 20%, the electrical power is relatively stable without a significant sudden change. When the light permeability is 50%, it can be seen that the electric power has undergone a significant and sharp change. The electric power dropped from 5.5 MW to 2.0 MW in an instant. From this, it can be inferred that the integrated energy system at this time is no longer stable. When the light permeability is 80%, the sharp change in electrical power becomes more obvious, dropping from 5 MW to 0.5 MW, and the electrical power has become unstable.
- (2)
Charging and discharging power of electrical energy storage
Based on the above analysis of the exchanged power between the system and the main grid, the charging and discharging power of the electrical energy storage will be analyzed below.
As shown in
Figure 12, it is the charging and discharging power diagram of the electrical energy storage when the photovoltaic penetration rate is 20%. In this situation, the electrical energy storage equipment is undergoing normal charging and discharging, with a relatively high number of charging and discharging cycles. In the early stage of training, the fluctuations in the charging and discharging power of the electrical energy storage were quite intense, with multiple rapid rises and falls. In the later stage of training, the fluctuation range of charging and discharging of the electrical energy storage decreased. This also indicates that in this situation, the DDPG algorithm still plays a role in the scheduling of the charging and discharging power of electrical energy storage devices, and the electrical energy storage devices have been well optimized and managed.
As shown in
Figure 13, this figure is the charging and discharging power diagram of electrical energy storage when the photovoltaic penetration rate is 50%. It can be seen from the figure that the electrical energy storage equipment is still charging and discharging normally. Meanwhile, in the early stage of training, the charging and discharging power of the electrical energy storage fluctuates greatly. In the later stage of training, the charging and discharging power of the electrical energy storage tends to stabilize. Therefore, it can be known that when the photovoltaic penetration rate is 50%, the DDPG algorithm still plays a role in the energy optimization management of the integrated energy system.
As shown in
Figure 14, this figure is the charging and discharging power diagram of electrical energy storage when the photovoltaic penetration rate is 50%. As can be seen from the figure, the charging frequency of the electrical energy storage equipment has further decreased. There are sharp changes in the charging and discharging power in both the initial and later stages of training. However, overall, it still shows that the electrical energy storage is undergoing optimized management. Therefore, at this point, the role of the DDPG algorithm in the energy optimization management of the integrated energy system is significantly reduced, but it still has an effect.
Let us take the fourth day as an example again. When the photovoltaic penetration rate is 20%, the number of charge and discharge cycles of the electrical energy storage is normal and there is no significant change. When the penetration rate of photovoltaic power is 50%, the number of charge and discharge cycles of electrical energy storage decreases, but the power tends to stabilize and can still be under the control of algorithmic optimization management. When the photovoltaic penetration rate is 80%, the number of charge and discharge cycles of electrical energy storage decreases more significantly, and the charge and discharge power become more unstable.
From these three graphs, a pattern can be observed that when the penetration rate of photovoltaic power keeps increasing, the number of charges and discharges of electrical energy storage significantly decreases, and the fluctuations also become significantly larger. From this, it can be inferred that while the penetration rate of photovoltaic power is increasing, the influence of the DDPG algorithm on the dispatching role of electrical energy storage equipment is also constantly decreasing, which leads to the fact that electrical energy storage equipment no longer has normal optimized management.
Through the analysis of the electrical power exchanged between the system and the main grid and the charging and discharging power of electrical energy storage under different photovoltaic penetration rates, a conclusion is drawn that the deep reinforcement learning algorithm plays a significant role in the energy optimization management and dynamic scheduling of the integrated energy system. When the role of deep reinforcement learning algorithms in energy optimization management of integrated energy systems is continuously weakened, the power in the system becomes extremely unstable, and the system often experiences sudden power changes. Photovoltaic penetration tests beyond 80% were not conducted due to increased system instability and reduced dispatchability. In practice, penetration rates above 80% often require additional grid support mechanisms (e.g., storage, curtailment, or backup generation), which are beyond the scope of this study. Higher penetration rates would require additional grid support services or hybrid control strategies.
In conclusion, through the optimization of deep reinforcement learning algorithms, both the energy efficiency and stability in the integrated energy system have been improved. Therefore, deep reinforcement learning algorithms can achieve the goal of improving energy efficiency and promoting sustainable development in integrated energy systems, and they play a crucial role in the energy optimization management of integrated energy systems.
4.5. The Optimization Management Changes in the DQN Algorithm Considering the Source-Load Side Fluctuations
To verify that deep reinforcement learning can handle the fluctuations on the source-load side in an integrated energy system, this section takes the DQN algorithm as an example, adds fluctuations at both ends of the source-load side, and then uses the algorithm for scheduling to observe the changes in cost. In this study, we focused on DQN and DDPG as representative algorithms for discrete and continuous action spaces, respectively, which are most relevant to the energy scheduling problem with both discrete and continuous control variables. While other algorithms like PPO, TD3, and A3C are also promising, their inclusion would exceed the scope of this paper but will be considered in future comparative studies.
This section assumes that both photovoltaic and load forecasting deviations follow a normal distribution. The probability density function of the load forecasting error is:
In the formula, and represent the load forecast value and the forecast deviation value, respectively; is the standard deviation of the load forecasting deviation.
The source-load side fluctuations selected in this paper are 10%, 20% and 30%, respectively; that is, the standard deviations are 0.1, 0.2 and 0.3. The photovoltaic power generation and load-side demand are shown in
Figure 15,
Figure 16 and
Figure 17. It can be seen from the figure that as the standard deviation increases, the numerical fluctuations of photovoltaic power generation and the demand on the load side also gradually increase. The standard deviation of grid exchange power decreased from 1.8 MW in early training to 0.6 MW after convergence, indicating improved stability.
The minimized costs obtained by the DQN algorithm when the source-load side fluctuations are 10%, 20%, and 30% are shown in
Table 6.
It can be seen from
Table 6 that under the fluctuations on the source-load side, the operating cost of the integrated energy system will change slightly, but its cost does not change much. This also indicates that in the case of fluctuations on the source-load side, the DQN algorithm can still optimize the energy management of the integrated energy system. Therefore, this also proves that deep reinforcement learning can handle the unstable situation on the source-load side in the integrated energy system.
The computational time for DDPG was approximately 8 h for 4000 episodes, while DQN required 12 h for 10,000 episodes on the same hardware (NVIDIA RTX 3080). Although DDPG has higher per-step complexity, its faster convergence makes it more suitable for real-time implementation.
The influence of key hyperparameters (e.g., learning rate, batch size) was tested through sensitivity analysis. Results show that learning rates between 1 × 10−4 and 3 × 10−4 yield stable convergence, while larger values lead to training divergence.