A Multi-Agent Deep-Reinforcement-Learning-Based Strategy for Safe Distributed Energy Resource Scheduling in Energy Hubs

: An energy hub (EH) provides an effective solution to the management of local integrated energy systems (IES), supporting the optimal dispatch and mutual conversion of distributed energy resources (DER) in multi-energy forms. However, the intrinsic stochasticity of renewable generation intensiﬁes ﬂuctuations in the system’s energy production when integrated into large-scale grids and increases peak-to-valley differences in large-scale grid integration, leading to a signiﬁcant reduction in the stability of the power grid. A distributed privacy-preserving energy scheduling method based on multi-agent deep reinforcement learning is presented for the EH cluster with renewable energy generation. Firstly, each EH is treated as an agent, transforming the energy scheduling problem into a Markov decision process. Secondly, the objective function is deﬁned as minimizing the total economic cost while considering carbon trading costs, guiding the agents to make low-carbon decisions. Lastly, differential privacy protection is applied to sensitive data within the EH, where noise is introduced using energy storage systems to maintain the same gas and electricity purchases while blurring the original data. The experimental simulation results demonstrate that the agents are able to train and learn from environmental information, generating real-time optimized strategies to effectively handle the uncertainty of renewable energy. Furthermore, after the noise injection, the validity of the original data is compromised while ensuring the protection of sensitive information.


Introduction
Energy hubs (EH) are characterized by the capability of integrating distributed renewable energy sources (RES), thereby facilitating a reduction in fossil fuel consumption and mitigation of carbon emissions [1][2][3].However, due to the intrinsic stochasticity and variability in renewable energy, large-scale integration of wind and solar generation will widen the system's peak-to-valley difference.Additionally, extensive displacement of traditional fossil-fuel-based generators will result in a lack of system flexibility, thus driving substantial curtailment of RES [4][5][6].Apart from the impacts of intermittent and uncertain RES, the stochastic nature of user loads, the diversity among energy sources, and the inter-dependencies among different energy forms also pose significant challenges regarding optimizing and managing energy systems [7,8].In this context, model-based optimization approaches, e.g., mixed-integer linear programming (MILP) [9], dynamic programming [10,11], model predictive control (MPC), etc., have been widely used to address such complex energy system scheduling problems.For instance, the authors in [12] employed MILP to optimize equipment selection and capacity configuration to minimize the annual economic cost.In [13], a two-stage MILP approach was employed to model the EH system, while Benders decomposition was utilized to solve the MILP problem.The authors in [14] applied dynamic programming methods to assess the performance of integrated electricity and heat networks by implementing decomposed electrical-hydraulic-thermal calculations for power flow.In [15], a weighted model predictive control energy scheduling regime was presented to enhance the resilience of EH clusters against contingencies.These optimization algorithms often rely on precise mathematical models and full information of parameters.However, due to the high complexity of EHs, establishing accurate models is challenging; meanwhile, the computational complexity of these algorithms grows exponentially with the increase in decision variables, limiting their applicability to large-scale EH optimization and scheduling problems.In this context, machine-learning (ML)-based algorithms are gaining popularity in recent years due to their low dependency on model accuracy and decent computational performance, and they have been applied in various fields, such as communication [16,17], bio-science [18,19], and energy [20,21].As a type of commonly used ML approach, reinforcement learning (RL) has been widely used in energy system dispatch problems [22,23].Through the interaction between agents and the environment, it learns the optimal action policy by trial and error to maximize cumulative rewards [24,25].In the optimization and scheduling of EHs, RL can be employed to learn the dynamic characteristics of the system and the complex relationships between energy demand and supply, enabling autonomous decision making and optimization scheduling.In [26], the proximal policy optimization algorithm was used to address the dynamic scheduling problem of EHs in uncertain environments.However, this algorithm faces difficulties in convergence when dealing with non-stationary problems.In [27], the deep deterministic policy gradient (DDPG) method was applied to solve the EH energy management problem based on the Stackelberg game model, which can handle high-dimensional state and action spaces.However, it is unable to address large-scale EH cluster cooperation issues.In [28], a multi-agent deep-deterministic-policy-gradient (MADDPG)-based RL algorithm for optimizing EH clusters was employed to address the uncertainty of renewable energy generation.Compared to single-agent algorithms, the training process of this model exhibits improved stability and converging performance for tackling the interactions among multiple agents.Through the above analysis, existing energy scheduling methods, such as MILP, DDPG, and MADDPG, exhibit different characteristics in various aspects.MILP (mixed-integer linear programming)-based centralized optimization has been widely utilized in energy system scheduling problems; it is based on physical models and can provide the optimal solution.However, it requires full information from users, which is unrealistic when numerous end-user-side DERs are involved, as in our manuscript.Additionally, it highly depends on the accuracy of the mathematical model and is more computationally expensive to solve problems with a large number of users, thus being unsuitable for realtime scheduling [29,30].RL-based algorithms such as DDPG-RL (deep deterministic policy gradient RL) are suitable for continuous state and action spaces, showing good adaptability to high-dimensional problems without the need for an accurate physical model, thus enabling offline training for energy system scheduling models and online application.Although showing good performance in handling continuous problems, DDPG is not suitable for multi-agent collaborative scenarios, which is the case in this paper [31,32].Multi-agent deep reinforcement learning (MADRL) involves deep reinforcement learning in a multi-agent environment.As one of the commonly used MADRL methods, MADDPG employs centralized learning and decentralized execution, designed to address learning and decision making in multi-agent cases [33,34].MADDPG is well-suited for multi-agent collaboration and adversarial scenarios, exhibiting strong adaptability with lax requirements for model precision in collaborative environments [16,35].Despite the effectiveness of MADDPG algorithms in dealing with energy hub (EH) cluster scheduling problems in complex environments, they require access to sensitive information from subsystems, which may pose potential risks of privacy leakage [36,37].In this context, investigating how to ensure the accuracy and real-time performance of distributed subsystem scheduling while protecting data privacy warrants further exploration.
With the increasing number and variety of devices connected to the EH, a considerable amount of electricity consumption data is generated during the process of optimization scheduling [38].These data may contain sensitive or private information from devices and users, posing significant security risks.Therefore, addressing privacy and security concerns in the optimization scheduling process is of utmost importance.In the data analysis and processing stage, commonly used privacy protection methods include homomorphic encryption (HE) [39,40] and differential privacy (DP) techniques.HE enables data analysis and processing while preserving the confidentiality of plaintext data [41].In [42], HE algorithms have been applied to address privacy protection challenges in distributed energy management frameworks.However, HE algorithms suffer from high computational complexity, leading to higher resource utilization, degraded system performance, and higher costs.On the other hand, DP methods involve simple operations, such as data perturbation and noise injection [43].These techniques are more suitable for privacy protection tasks in large-scale EH clusters, particularly under limited computational capabilities and resource constraints.However, excessive introduction of noise, while it greatly enhances the privacy protection of sensitive data, can lead to a decline in the performance of the EH network and instability in control [44].Therefore, the trade-off between privacy protection and the performance of energy system dispatch is significant.
Based on the aforementioned discussions, algorithms combining DP and MADDPG (denoted by DP-MADDPG) are worth investigating for solving EH cluster scheduling problems with data privacy protection.Different from approaches that combine MADDPG and HE, which require more computational resources, thus squeezing communication and computation resource allocation for other tasks within the system, DP-MADDPG exhibits low computational complexity, requiring lower computational resources but achieving decent privacy protection performance, especially in scenarios involving multiple agents.The comparisons of different algorithms are demonstrated in Table 1.In [45], DP-MADDPG has been applied to address the issue of optimal power scheduling for microgrids with data privacy protection.However, it does not consider the complex interactions between different energy carriers in multi-energy systems.In this paper, we extend the utilization of this approach to a heat-electricity-gas system in EHs, aiming to effectively solve the optimization problem in EH clusters.The main contributions of this paper are summarized as follows: (1) The DP-MADDPG algorithm is adopted for distributed management of the EH cluster system.Each agent independently controls the operation of its local system and adjusts its local policy based on real-time observations and reward signals, enhancing the robustness and reliability of the scheduling decisions.Furthermore, through collaboration among multiple agents, the method addresses complex scheduling issues and improves the energy utilization efficiency of the system.(2) Data privacy concerns are effectively addressed using the presented method.This method dynamically introduces noise interference and utilizes an energy storage system (ESS) to attenuate noise, ensuring the preservation of external transaction data while perturbing internal network data.Additionally, an effective evaluation mechanism for EH privacy protection is established to mitigate the impact of data correlation on evaluation results, enabling intelligent agents to generate noise data that satisfy the constraint conditions within a reasonable range.
The rest of this paper is organized as follows.Section 2 presents the model structure and equipment type for EH clusters.EH's optimal scheduling approach is detailed in Section 3. In Section 4, simulation results are presented to show the performance of the proposed approach, whilst the conclusion is drawn in Section 5.

Integrated Energy System Structure and Equipment Model 2.1. Integrated Energy System Structure
Figure 1 illustrates that EH comprises power grids, district heating networks, and gas networks.Through the utilization of diverse energy conversion and storage devices, EH facilitates the mutual supplementation and efficient utilization of energy resources among these networks, meeting diverse load requirements.Additionally, EH mitigates the challenge of inadequate electricity generation resulting from the unpredictable fluctuations of renewable energy sources by procuring energy from the distribution grid and natural gas networks.

Model of Devices
(1) Combined heat and power (CHP): CHP is an efficient energy utilization system that achieves integrated utilization of energy by simultaneously generating electricity and heat through the combustion of natural gas.The model can be described as follows: (2) Electric boiler (EB): EB is a device that converts electrical energy into thermal energy, used to meet the heat network's load requirements when CHP is not operational.
(3) Power to Gas (P2G): P2G is an energy conversion technology.By converting surplus electricity into natural gas, P2G systems can store energy in the natural gas grid to meet peak energy demands or periods when renewable energy generation falls short.This helps alleviate the challenges posed by the intermittency and fluctuations in renewable energy sources.The model is as follows: (4) Energy storage model: Energy storage unit system is utilized for balancing load and supply in a network.The output model can be described as follows: where x denotes the type of energy storage and e, h, and g denote the grid, heat, and gas networks, respectively.
It should be emphasized that network losses also have significant impacts on the decisions of economic energy dispatch.Since the networks for different energy carriers are not explicitly modeled in this paper, the impacts of network losses are considered in the energy prices.

Constraints
(1) Energy Balance Constraint: The power balance constraint of the entire integrated energy system is expressed as (2) Equipment Operating Constraints: For P2G, CHP, and EB devices, power constraints and ramp constraints must be adhered to during operation as follows: (3) Energy Storage Device Constraints: The capacity constraints and ramp constraints to be satisfied by energy storage devices on different networks are expressed as the following equations: where the non-negative integer I X,ch i,t and I X,dis i,t are introduced to ensure that charging and discharging behaviors do not occur simultaneously.
In this paper, we are focused on the scenario of energy hubs, covering a couple energy conversion infrastructures, e.g., combined heat and power (CHP), electric boilers, electrolyzers, various forms of storage, which enable the interchange of energy forms between heat, electricity, and hydrogen and natural gas.However, some of the energy infrastructures are geographically specific and highly rely on regulatory environments; for example, CHP is commonly used in regions at high latitude but barely seen in southern areas.Under the carbon neutrality target, policies are coming up to facilitate the displacement of gas boilers by electric heaters (heat pumps or electric boilers), but there is still a long way to go before large-scale heating system electrification.Regarding P2G devices, affordability and security issues still highly restrict the their large-scale deployment, which highly depends on technology advancement and political incentives.However, the decarbonisation pathway proposed by different countries may vary tremendously, which drives the preference of one type of energy infrastructure and suppresses the development of another.To this end, it is important to stress that the scenarios in this paper are not omni-applicable under different conditions; however, the proposed method provides meaningful insight regarding solving economic energy dispatch across multi-agents; the types of resources and application scenarios can be changed accordingly.

Carbon Trading Cost Model
Ignoring the carbon emissions from renewable power generators and energy storage, devices participating in carbon trading are CHP and P2G devices.For each carbon emission source, if the actual carbon emissions exceed the allocated carbon quota obtained for free, the excess portion needs to be purchased in the carbon trading market.The remaining quota can be sold.Therefore, the carbon trading cost model can be established as follows: (1) CHP Carbon Trading Cost: CHP units are one of the main carbon emission sources in the energy system.Assuming that the total carbon emission intensity and quota are proportional to the actual output, the carbon-related cost can be calculated as follows: (2) P2G Carbon Trading Cost: The P2G unit can capture CO 2 from power plants or biogas.As shown in Equation ( 21), the conversion process of P2G can be divided into two steps: electrolytic hydrogen production and methanation, where the volume of CO 2 consumed in this process is equal to the volume of CH 4 produced.
Therefore, the output of the P2G unit can be converted into an equivalent volume of CH4, allowing us to further determine the reduction in carbon emission intensity achieved by the P2G unit.
In this context, ρ CO 2 = 1.977 kg/m 3 represents the gas density of CO 2 , and α CH 4 denotes the calorific value of natural gas CH4, which takes the value of 9.87 kWh/m 3 .Since the P2G unit is not a carbon emission source, its carbon quota is set to zero.Thus, the calculation of carbon trading costs can be represented as follows:

Objective Function
Minimizing the total operating cost of the integrated energy system is chosen as the objective function, which includes costs associated with external energy procurement, equipment operation, and maintenance, as well as carbon trading.The specific calculation method is as follows: Based on the above discussion, the optimization scheduling problem described in this paper can be formulated as follows:

A Real-Time Optimal Energy Scheduling Method for EH Based on Distributed Deep Reinforcement Learning
In this section, each complete and independent EH is regarded as an agent responsible for controlling energy dispatch operations within the system.The optimization of dispatch problems is formulated as a Markov decision process (MDP), and the global optimal decision is obtained through experience sharing and collaborative training.Due to the fluctuation and uncertainty in renewable energy output and load demand in the environment of EH, as well as the involvement of multiple variables, such as different energy sources, loads, devices, and markets, the combination of state space and action space exhibits explosive growth.To address these challenges, the MADDPG algorithm is adopted, which excels in handling complex tasks with high-dimensional state and action spaces while employing an adaptive strategy to cope with environmental uncertainties.

MADDPG Algorithm
Traditional algorithms like deep Q-learning and DDPG often encounter issues such as unstable training and convergence difficulties when dealing with non-stationarity in multi-agent environments.To address these challenges, the MADDPG algorithm has been developed as a deep RL approach based on the deterministic policy gradient (DPG) and actor-critic framework, specifically designed for multi-agent settings.In MADDPG, each individual agent maintains its own actor and critic network, responsible for learning policies and evaluating policy value functions, respectively.During the training process, agents interact with the environment, selecting actions based on their actor networks and receiving rewards and subsequent states [46][47][48].The experiences of the agents are stored in a shared experience replay buffer.When updating network parameters, agents sample data from the experience replay buffer, calculate gradients, and update their respective network parameters accordingly.Furthermore, MADDPG incorporates techniques such as target networks and random sampling from the experience replay buffer to enhance training stability.These mechanisms contribute to the effectiveness of MADDPG in addressing the optimization problem of energy scheduling.Detailed descriptions of the algorithm's design and practical applications will be provided in the subsequent section.

Parameter Space
In traditional reinforcement learning, the MDP describes the interactive process between a single agent and the environment, where the agent selects actions based on the current state and evaluates the quality of its behavior through reward signals.MADDPG can be regarded as an extension of MDP for multi-agent scenarios.Thus, a reinforcement learning model for integrated energy systems can be represented by three essential components: the state space S i , action space A i , and reward space R i of agent i.
(1) State space: At time slot t, the state space of an EH cluster primarily encompasses the renewable energy generation (including wind power and photovoltaic generation) within each agent's region, the load of the three energy networks, the gas consumption of CHP units, the electricity consumption of EB and P2G devices, the electricity price, gas price, and the charging and discharging actions of energy storage systems.It can be defined as follows: with a i,t ∈ A i .(3) Reward function: The reward of agent i on given state si, t, and action at i can be described as where the λ is an integer parameter, indicating the number that does not satisfy (6)-( 18) at time slot t. (4) Algorithm chart: The optimal energy scheduling process for EH cluster based on MADDPG is shown in Algorithm 1.
The essence of this approach is to search the optimal solution in the feasibility region defined by the optimization problem.Instead of using the traditional MILP method, which can be computationally prohibitive for real-time application, we turn to the MAD-DPG method for offline training.To achieve this, we first established the integrated heat-electricity-gas system model, where the operational characteristics of various energy conversion components and their interactions are specifically taken into account.Then, we formulated the EH economic dispatch optimization problem, which serves as the environment of the MADDPG model.The reward of the model includes two parts: (1) the revenue gain (the objective of the optimization problem), and (2) the constraint violation penalty.Note that the coefficient of the constraint violation penalty is set very large so that, once an action violates a constraint, the reward will be particularly bad.In this way, this model learns offline to search the solutions with the highest reward, which can be interpreted as taking the actions that can lead to the highest revenue without violating physical constraints of the energy system.Given real-time EH operational conditions (environment parameters), the trained model tends to provide a decent decision for economic energy dispatch.

EH Privacy Protection Based on Differential Privacy
In the utilization of reinforcement learning for training optimization scheduling models, the interaction between agents and the environment gives rise to security risks associated with data privacy breaches.Particularly, when agents engage in data transactions with external power and gas grids, the internal parameters of the agents become more susceptible to leakage.In order to safeguard data privacy in EH, we adopt an efficient and computationally simple approach known as local differential privacy [49][50][51][52][53][54].This approach not only allows for quantifying the strength of privacy protection but also enables the application of the noise addition process at each EH node.By individually adding noise to the local privacy information of each agent, the probability of privacy leakage is greatly reduced.

Algorithm Chart
The optimal energy scheduling process for EH cluster based on MADDPG is shown in Algorithm 1.
We employ the Laplace mechanism to add noise to the data as a privacy-preserving measure, and each agent is responsible for controlling this noise addition process.Specifically, a local privacy dataset of agent i is first introduced and expressed as Secondly, the dataset would be mapped into x i,t = f (D i,t ) ∈ R d and used to generate the Laplace noise Lap n ( ∆ f ) to construct DP vector denoted as where ∆ f and are the sensitivity and privacy budget of function f , respectively.
The privacy protection efficiency of agent i is assessed by computing the discrepancy between the original privacy information x i,t and the perturbation information y i,t .The formula is defined as follows: where S i,t is denoted as the covariance matrix.Simultaneously, by incorporating the constraints ( 6)-( 18), the agent is guided to select noise addition actions that not only satisfy the constraints but also achieve the desired level of privacy protection effectiveness.Due to the negative impact of introducing external noise on the stability, security, and reliability of energy networks, utilizing internal ESS within the EH to provide the required additional energy for noise addition can mitigate this effect.By transforming the source of noise to be internal to the network, the impact of introducing noise can be effectively managed.This enables flexible adjustment and control of noise introduction according to the specific requirements and operational states of the network.Additionally, the ESS plays a role in energy balancing within the network, ensuring that sensitive data within the EH are perturbed without affecting energy transactions between the EH and external sources.Consequently, the added noise can be defined as where the energy obtained through the ESS and used as noise E X,noise ik,t follows the Laplace distribution with the following probability density function, where µ usually takes 0, and λ = ∆ f / .

Case Studies
The presented optimization scheduling model is applied to a cluster comprising four distributed EHs.Each EH can achieve a supply-demand balance through electricity and gas procurement operations.Time-of-use pricing is adopted for electricity and gas procurement/sales from the grid, with differentiated prices during different time intervals, as depicted in Figure 2. The operational parameters of the EH's devices are provided in Table 1.

Analysis of Optimized Schedule Results
To address the aforementioned cluster, this paper establishes an EH cluster optimization scheduling model based on the DP-MADDPG algorithm and conducts simulation experiments in the Python 3.7 environment to validate the proposed methodology's effectiveness.The DP-MADDPG algorithm's specific parameters are outlined in Tables 2 and 3.The model consists of four agents, and Agent-4 is chosen as a case study for the experimental analysis.In this paper, all the models are trained for 2500 episodes.By summing up the rewards obtained by each agent across 24 time steps in every round, the total reward for each iteration is calculated, and the data are averaged every 50 cycles.The convergence of the reward value for Agent-4 and the total reward value are illustrated in Figure 3.The number of times per episode 24 24 Size of experience replay unit 100,000 100,000 Based on the graph, it can be observed that, during the initial stages of training, when the action networks are in the exploration phase, the reward values are initially low and exhibit significant fluctuations.However, as the intelligent agents begin to learn from historical data extracted from the experience replay buffer, the reward values gradually show a clear upward trend.Around 500 episodes, the bonus value curve stabilizes and stays at a higher level.Eventually, the total reward value converges to around −4000.As can be observed, DP+MADDPG and MADDPG show good converging performance, while DDPG fails to converge within 2500 episodes.Regarding the converged reward, MADDPG is higher than DP-MADDPG since the introduction of data privacy protection incurs increased operational costs (compromising the solution accuracy).Therefore, the trade-off between privacy protection level and the economy of energy system dispatch is critical.

Optimization Results Analysis
After offline training of the MADDPG algorithm networks using historical data, the trained networks are saved for dynamic economic scheduling of the system.Considering that different intelligent agents have distinct reward evaluation criteria during the training process, this section presents three different energy network models.After completing the training, the power output and exchange power variations in each device within a single period are depicted in the corresponding curves in Figure 5.
From 0-7 h, the CHP unit is inactive, and the EB device is utilized to provide heat to the heating network, meeting the heat load requirements.The P2G device consumes electricity to supply gas to the gas network, satisfying the gas load and selling excess natural gas for economic benefits.Due to the fluctuation and uncertainty in renewable energy generation, photovoltaic generation is zero during the night, and wind power cannot meet the demands of the grid and other electrical devices.The agent compensates for this power deficit by purchasing electricity from the main grid.Additionally, due to the low electricity prices, the energy storage system on the grid adopts a strategy of storing electricity to cope with peak power demands, achieving efficient utilization and optimization of energy resources.
From 8-23 h, the CHP unit starts operating.Due to the high electricity prices, the P2G strategy of converting electricity to gas for the gas network is discontinued, and direct purchase of natural gas is adopted instead.As the electricity and heat demands of the grid and heating network differ, the CHP unit considers the actual conditions of both networks when generating electricity and heat.Therefore, an energy storage system is implemented in the heating network to balance the surplus or deficit of heat.With the involvement of the CHP unit and PV generation, the agent significantly reduces its electricity purchases during periods of high electricity prices compared to the 0-7 h period.Furthermore, the electricity supply curve aligns well with the load demand curve, allowing the energy storage system to operate near its optimal level.

Privacy Protection Results Analysis
In order to safeguard sensitive information in the power grid from leakage and identification, the privacy data of each round (including gas supply quantities for CHP devices, power-to-gas conversion quantities for P2G, power consumption quantities for EB, load data for the power grid, heating network, and gas network, wind and solar energy generation quantities, and the rate of change of ESS in the power grid and heating network) are protected using differential privacy techniques.The privacy data are perturbed using the Laplace mechanism, and constraints are applied to ensure that the added noise remains within reasonable bounds and does not exceed the power limits of the respective units.The specific transformations are illustrated in Figure 6, which presents the data for 1097 rounds.It can be seen from the figure that, after noise addition, the original data are blurred and distorted, making it impossible to infer and reconstruct specific information.
While maintaining constant gas and electricity consumption, the privacy data are perturbed, and the introduced noise is appropriately translated to the relevant networks based on their coupling relationships.Moreover, an ESS is separately deployed in the power grid, heating network, and gas network, serving as a provider of noise.The variations in energy storage provided by these units are depicted in Figure 7.   Considering the stochastic nature of renewable energy generation, it is crucial to assess how well the proposed method handles the inherent uncertainties in renewable energy production.This section provides a comprehensive analysis of the method's performance under various levels of renewable energy integration.Specifically, four scenarios are selected, including 50% RES, 100% RES, 150% RES, and 200% RES integration.Note that RES data of the previous case studies are used as the benchmark and denoted by 100% RES.
As illustrated in Figure 8, the proposed method shows similar convergence trends for all scenarios, indicating its robustness to the integration of intermittent RES.Based on the results, the converged reward increases with enhancement in RES penetration, which is intuitive since RES is characterized by zero marginal cost, thus reducing the operation costs incurred by fossil fuel consumption.
It should be emphasized that increasing/decreasing the amount of renewable energy generation is equivalent to decreasing/increasing the amount of loads since the net electricity load is equal to actual load minus renewable energy generation.Therefore, we can also interpret from the results that the proposed method shows good robustness in dealing with different levels of loads.

Sensitivities on the Number of Agents
In this section, we test four scenarios associated with 1 EH, 2 EHs, 4 EHs, and 8 EHs to observe how many episodes it takes to converge and the computational time.The simu-lation is performed on a computer with 2-core 3.50 GHz processor and 32 GB RAM, using Python as a tool.
As illustrated in Figure 9, the rewards for all the tested scenarios converge well, where fewer episodes are needed to achieve convergence with the increase in agent amounts.However, the computational time for a single episode increases when more EHs are considered due to the involvement of more variables.Table 4 shows the computational parameters of all the scenarios.Additionally, the converged reward reduces with the increase in EH numbers due to growth in energy consumption of more users.Based on these results, it can be concluded that the proposed method can effectively handle the coordination of multiple EBs.It should be emphasized that the MADDPG algorithm has an inherent advantage in solving multi-agent problems; therefore, theoretically, it is capable of handling a problem with many more agents with appropriate model parameter settings.

Sensitivities on Privacy Protection Levels
In this section, we will investigate how different levels of privacy protection impact the computational time and solution accuracy.
In this paper, the parameter epsilon is used to control the degree of noise added to protect data privacy.Increasing the amount of noise added to the original data enhances privacy protection but could lead to notable data distortion and heightened computational overhead.The proposed method is dedicated to securing the privacy of data transmission while concurrently optimizing distributed energy resource scheduling in energy hubs.As demonstrated in Table 5, three key metrics, including discrepancy rate, denoted by σ, computational time, and solution accuracy are used to assess the trade-off between the efficacy of privacy protection and solution performance.High accuracy validates the optimality of the energy dispatch decision made by the algorithm, while a high discrepancy rate ensures the privacy protection efficiency.As can be observed in the table, a larger amount of noises leads to greater discrepancy rates in privacy protection performance, indicating that more operational costs will be incurred as the consequence of privacy protection.Meanwhile, an increase in computational time is also observed.Therefore, selecting an appropriate epsilon value strikes a balance between privacy preservation and decision accuracy.Determining the suitable amount of noise introduced depends on the specific dataset and privacy requirements.
It should be stressed that, although noise injection reduces the solution accuracy, the physical constraints will not be violated since we introduced the constraint violation penalty, the coefficient of which is particularly large, thus ruling out any solutions located out of the feasibility region of the optimization problem.Additionally, although noise injection causes increased computational complexity, the convergence performance is not essentially impacted.

Discussion
In the long run, with high-penetration integration of renewable energy driven by carbon neutrality targets, the traditional top-down provision of flexibility from centralized generation units will be insufficient to support efficient accommodation of intermittent and fluctuant wind/solar energy, particularly in the context of low-inertia power systems caused by large-scale displacement of synchronous generators [55,56].Therefore, it is imperative to exploit the flexibility from DERs as a bottom-up complementary resort [57].
A promising application area for the proposed algorithm is virtual power plants (VPPs), dedicated to arousing the untapped flexibility in DERs by coordinating the operation/response of DERs at the end-user side.VPPs typically involve several hundreds of kWor MW-magnitude resource aggregation, which can be well-addressed by our algorithm, as illustrated in the case study.Regarding the number of participants (agents), the trained model can effectively handle the coordination of from one to eight energy hubs, as demonstrated, which matches the amount of participants of an average-size VPP in China [58].Since the training is performed offline, the requirement of hardware and communication delay is not strict.However, due to the involvement of diverse DERs, the control of the VPP may require heterogeneous communication systems and protocols.Additionally, the bottleneck of large-scale VPP application is how to effectively incentivize end users to participate in energy resource aggregation.Additionally, the hierarchical control framework of VPPs highly relies on extensive deployment of sensors for data collection of end users, an efficient algorithm to support the cloud-edge control mode, as well as massive computing power for coordination across numerous end devices.More importantly, incentivizing energy policies are crucial for the participation of end users to provide flexibility to the power system through VPPs [59,60].
At present, the construction of VPPs is primarily restricted at the stage of pilot trials in China [61].Although there are some commercial VPPs in Western countries running with good profitability, their scales and the amount/diversity of aggregated DERs are limited; therefore, their dependence on a smart and computationally efficient control algorithm to effectively coordinate the operation/response of numerous end users is not urgent.However, with the extensive decommissioning of centralized synchronous gas/coal-fired generation units in recent decades, relative policies are very likely to be mature to fully support the exploitation of flexibilities at the end-user side; VPPs will then play a major role in providing balancing and ancillary services for the power system, a large amount of heterogeneous DERs will be involved in the control framework of VPPs, and then the proposed control algorithm will truly show its superiority in handling fast scheduling across numerous and diversified resources [62].However, it is important to emphasize that the proposed algorithm should be regarded as a prototype, the performance of which is tested on the integrated heat-electricity-gas system; however, the physical model/mathematical formulations can be changed to adapt to new policies and emerging technologies without essentially jeopardizing the performance of the algorithm.

Conclusions
This paper proposes an EH cluster optimization and scheduling method based on the MADDPG algorithm, targeting EH cluster with multiple IESs.The optimal scheduling problem of the EH cluster is transformed into a deep reinforcement learning model.Each integrated energy system on the EH is treated as an agent, utilizing the capabilities of MADDPG to handle complex tasks with high-dimensional states and action spaces.Through collaborative training among multiple agents within the EH cluster, the method learns cooperative strategies to maximize the performance and efficient utilization of the overall energy system.Additionally, a differential privacy mechanism is introduced in the model to protect sensitive privacy data during the optimization and scheduling process.In each of the three energy networks, a storage system is introduced to serve as a provider of noise, ensuring that the purchase quantities of gas and electricity in the integrated energy system remain unaffected by the introduced noise.Finally, the proposed optimization and scheduling model is applied to a cluster scheduling optimization problem consisting of four EH.The experimental simulations demonstrate that the proposed method offers reasonable optimization strategies for scheduling problems and exhibits good generalization capabilities when facing uncertain fluctuations in renewable energy output.
In the future, we will explore multi-level optimization for the hierarchical control structure of VPPs, where multiple lower-level agents interact with the higher-level controller.The self-discharge efficiency of electricity, heat, and gas networks.
η ch e , η ch h , η ch g The charge efficiency of electricity, heat, and gas networks.
η dis e , η dis h , η dis The price of carbon trading.

C CHP
The O&M costs for CHP.

C EB
The O&M costs for EB.

C ES
The O&M costs for ES.

C P2G
The O&M costs for P2G.
The price of electricity and gas at time t.

γ, λ
The small/large positive value as a reward weight.e CHP The carbon emission quota associated with the unit energy generated.

E CHP
The carbon emission intensity associated with the unit energy generated.The values of electrical, thermal, and gas load at time t.P CHP n,i,t The power output of the nth CHP in node i at time t.P EB i,t The electric power input of EB in node i. ∆P EB i,t The variation of P EB i,t from slot t to t + ∆t.P P2G i,t The electric power input of P2G in node i. ∆P P2G i,t The variation of P P2G i,t from slot t to t + ∆t.P PV i,t The power generation of the photoelectric power generation unit in node i at time t.P WT i,t The power generation of the wind power generation unit in node i at time t.P x,ch i,t , P x,dis i,t The charging/discharging power in node i at time t.P net i,t The exchanged power with the main grid in node i at time t.

Figure 1 .
Figure 1.The framework of EH.

Figure 3 .
Figure 3.The curve of reward value.

Figure 4
Figure4demonstrates the comparisons between DP-MADDPG, MADDPG, and DDPG.As can be observed, DP+MADDPG and MADDPG show good converging performance, while DDPG fails to converge within 2500 episodes.Regarding the converged reward, MADDPG is higher than DP-MADDPG since the introduction of data privacy protection incurs increased operational costs (compromising the solution accuracy).Therefore, the trade-off between privacy protection level and the economy of energy system dispatch is critical.

Figure 5 .
Figure 5.The power changes in each network of the EH.

Figure 6 .
Figure 6.Comparison of original data and synthetic data for a single episode.

Figure 7 .
Figure 7. Power changes in noise-canceling energy storage devices.

Figure 8 .
Figure 8. Different levels of renewable energy sources.

Figure 9 .
Figure 9.The training process of the proposed algorithm.

Algorithm 1 :
Distributed Energy Management by MADDPG.
4for stept = 1 to T do4Each agent obtains environmental parameters.5 Actor network output action ∆G CHP n,i,t , ∆P P2G i,t , ∆P EB i,t , P e,dis i,t /P e,ch i,t 6 According to (6)-(18), and calculate the r i,t value by Equation (29).7 Update the state s i,t+∆t .8 if episode ≥ H then 9 Agent learns by extracting historical data from a pool of experience 10 end 11 Store [s i,t , a i,t , s i,t+∆t , r i,t ] into the experience pool 12 end 13 end

Table 2 .
Energy conversion devices and energy storage unit parameters.

Table 4 .
Comparative analysis of different numbers of EHs.

Table 5 .
Comparative analysis of different amounts of noise introduced.
Set of the state for agent i from time 1 to t, i.e., {s i,1 , s i,2 , ..., s i,t }.A iSet of the action for agent i from time 1 to t, i.e., {a i,1 , a i,2 , ..., a i,t }.α ES e , α ES h , α ES gThe self-discharge efficiency of electricity, heat, and gas networks.
The heat output of the nth CHP in node i at time t.The gas consumption of the nth CHP in node i at time t.The gas output of P2G in node i.The variation of G CHPn,i,t from slot t to t + ∆t.The exchanged power with the external gas network in node i at time t.