Agent-Based Energy Sharing Mechanism Using Deep Deterministic Policy Gradient Algorithm

Balancing energy generation and consumption is essential for smoothing the power grids. The mismatch between energy supply and demand would not only increase the cost on both sides, but also has a great impact on the stability of the system. This paper proposes a novel energy sharing mechanism (ESM) to facilitate the consumption of local energy. With the help of the ESM, multiple prosumers have an opportunity to share surplus energy with neighboring prosumers. The problem is formulated as a leader–follower framework based on the Stackelberg game theory. To address the aforementioned problems, a deep deterministic policy gradient (DDPG) is applied to solve the Nash equilibrium (NE). The numerical results demonstrate that the proposed method is more stable than the conventional reinforcement learning (RL) algorithm. Moreover, the proposed method can converge to NE and find a relatively good energy sharing (ES) pricing strategy without knowing the specific system information. In short, it is notable that the proposed ESM can be seen as a win– win strategy for both prosumers and the power system.


Introduction
The distributed energy resources (DER) have recently substantially increased. End users gradually install self-consumed renewable energy sources' (RES) generation on the consumer side. As a result, a new type of entity has emerged in the grid, namely, prosumers who either act as power consumers or power producers in a certain time period. During a different time interval, the prosumers can be sellers or buyers depending on the electricity pricing and their net power profiles. Therefore, it is feasible to improve the local consumption of DER and the stability of the power system through trading energy among neighboring prosumers. Peer-to-peer (P2P) trading emerged as an energy management mechanism that enables each prosumer to participate in energy trading with other local prosumers [1]. From [2], it is demonstrated that P2P energy trading can facilitate the local balance of DER. In order to alleviate the investment cost in upstream generation, increase network efficiency and energy security, Morstyn T. et al. proposed a P2P energy trading mechanism based on bilateral contract networks [3]. Moreover, prosumers are clustered into virtual microgrids to reduce the total energy cost [4]. In [5], a three-tiered framework including micro-grid balancing, aggregator scheduling and trading optimization is designed to provide a dynamic price signal to assist tradingstrategy-making, thereby motivating the efficient utilization of distributed energy resources.
However, due to the volatility of distributed renewable generation and the randomness of the end-user behavior, it poses challenges to the reliable operation of power grid, such as power flow As an alternative, DDPG has received more attention. In [37], a DDPG-based decision-making strategy of adaptive cruising for heavy vehicles is presented, taking stability into consideration. A semi-rule-based decision-making strategy for heavy intelligent vehicles based on the DDPG is proposed in [37]. Moreover, the DDPG algorithm is applied to the joint bidding and pricing problem for a load service entity [38], and to model the strategic bidding of market participant [39].
To date, although traditional RL algorithms are capable of solving the no-cooperative game of incomplete information, they are limited to low-dimensional discrete state/action space and is hard to converge to a relatively good solution. Based on the aforementioned research gap, this paper aims to address the limitation of precious methods and introduce a novel expanding application. We proposed DDPG-based agents that share energy locally under a VPP operation framework based on Stackelberg game theory, aiming to help the VPP operator center (VPPOC) to facilitate the local consumption of DER. The ES pricing problem is formulated as an MDP. Different from the traditional model-based approach, the proposed method in this paper has no need of any model information of prosumers. Finally, detailed case studies are provided to demonstrate the effectiveness of the proposed approach.
The main contributions of this paper are summarized as follows: • A novel ESM under the VPP operation framework is proposed; the interaction between prosumers and VPPOC is formulated as a non-cooperative game problem based on Stackelberg theory; • A DRL-based model-free approach is proposed to find the NE for the Stackelberg game without requiring any lower model information;

•
The effectiveness and stability of ESM is significantly improved based on the DDPG algorithm. The employment of DNN enhances the performance of the proposed model in the processing problem with high-dimensional continuous space.
The remainder of the paper is organized as follows: the problem formulation is introduced in Section 2. Then, the formulation of ES model is proposed in Section 3. Section 4 presents the process of solving the NE problem based on the DDPG algorithm. Experimental scenarios are implemented in Section 5 to demonstrate the effectiveness of the proposed approach. Finally, conclusions are asserted in Section 6.

System Architecture
In this work, we consider an ESM based on VPP. The framework is shown in Figure 1. The interconnected prosumers can exchange electrical energy through the interconnection infrastructure and a communication network. Each prosumer owns distributed RES generation such as wind generation (WG) and photovoltaic generation (PV), and electrical load. Within the VPP operation territory, each prosumer can be regarded as a buyer and seller of electrical energy. Moreover, the loads of each prosumer are assumed to be elastic, which relates to the electrical price. After obtaining the RES generation and planed load information, the VPP operating center determines the energy sharing price and announces it to prosumers. Then, each consumer optimizes the consuming behavior and aim to maximize the revenue or minimize the load cost according to the price signal. The prosumers can be classified into two categories: one is the supply prosumer that shares the surplus energy, the other needs extra energy to meet the load demand which is called demand prosumer. Actually, there is a deviation between the surplus energy and the load demand. Thus, it is essential to set a reasonable ES price to promote the supply and demand matching.

Up-Level: Model
In the hierarchical energy trading framework, VPP is obligated to decide the energy sharing price to reduce the imbalance between surplus energy and load demand. In this paper, the objective of the VPP is to minimize the imbalance energy, which can be donated as In (1)

Low-Level Model
In the lower model, when informed of the ES price by the VPPOC, prosumers subscribed to the ES program aim to maximize their incomes or minimize consumption cost. The consumers with surplus RES output attempted to determine their optimal load consumption by taking both the load utility and ES profits into consideration. However, the prosumers demanding extra electric energy try to optimal their load consumption while considering the incurred electrical consumption cost. The objective of the lower problem is donated as D . Specifically, the utility function is used to described the utility of the tasks. In addition, due to the non-decreasing, quadratic function and logarithmic function are widely used as utility functions [40]. Without loss of generality, this paper adopts the logarithmic utility function, and the detailed formulation of the utility function can be described as follows n,h n,h n,h β n,h n,h φ (D )=log (1+α D ) (7) where α n,h and n,h β are parameters varying with prosumer and time. n,h β is an experience parameter. Specifically, α n,h is the key factor that can capture the dynamics of prosumer load elastic feature. In this paper, we can set appropriate parameters to demonstrate the impacts of load variation to consumption utility. According to (6), the feasible consumption strategy space is defined as , min

Stackelberg Game Process
In this section, we reformulate the ES problem as a Stackelberg game. In the aforementioned ESM, the VPPOC is the leader, which sets the ES price by considering the distributed supply-demand energy balance among prosumers, and prosumers are followers which make an optimal load consumption decision according to the price signal and feed it back to VPPOC. The dynamic game process is described in Figure 2.  According to Figure 2, the gaming process is played as the following sequence: (1) The leader first announces its strategy to followers from the strategy space VPPOC Ω , i.e., an ES incentive price h λ ; (2) After informing the pricing strategy h λ , prosumer n chooses a best-response strategy from its strategy space n Ω as a reaction to the leader VPPOC, which can be viewed as a best-response (3) Based on the identified best-response strategy set comprises each of the following prosumers, the leader will select an optimal strategy from , which can be obtained by solving the upper problem, that is (4) At each time interval h, the leader chooses its optimal strategy in (3) and announces it to the follower again. Repeat the above three processes between the VPPOC and the prosumers until the desired NE is obtained.

Formulation of the Stackelberg Game
The strategic game form is formally defined as In dynamic game Γ , a set of follower players  choose the strategic load consumption strategy from the strategy space to maximize the objective n,h l  by considering the ES incentive strategy h  determined by the VPP leader V . The leader aims to maximize its objective by optimizing the incentive strategy. Obviously, the game problem is a bi-level optimization problem. We assume the optimal response strategy set as: Correspondingly, the optimal ES incentive strategy is

Agent Model
In this section, we first formulate the Stackelberg game problem as a finite MDP with a discrete time step. Then, we adopt a DDPG-based model-free approach that does not require full knowledge of the system model information to obtain the equilibrium point of the game problem.

A. Formulation of MDP
The ES rate is constrained by prices bounds as (2), which are derived by a mutual agreement or regulatory requirement between the VPPOC and prosumers, maintaining fair incentive rates and protecting each profit.
(3) State Transition: The states transition from time slot h to time slot h + 1 is shown as follows where h r represents the gap between RES generation and load demand. In an ideal situation, the VPPOC can compensate for the gap as soon as possible via determining a reasonable ES rate (i.e., the total load consumption of prosumers equal to the RES generation).
where μ Q (s,a) represents action value function; μ is the ES rate policy which maps from the environment state to the ES pricing strategy; γ is the discount factor that balances the importance of the immediate rewards and the future rewards. As the value of γ gets closer to 1, the policy is more foresighted. Conversely, as the value of γ gets close to 0, it indicates that it only takes into consideration the immediate rewards and the policy is shortsighted. The purpose of the learning process is to find an optimal policy * Ω over all feasible policies, which maximizes the action value as where * Q (s,a) represents the optimal action value function.
The overall diagram of the proposed ESM with DDPG methodology is illustrated in Figure 3. At a separate time interval, VPPOC announces ES price as a leader; the prosumers within its territory determine their load consumption according to the price signal. The interaction between prosumer and VPPOC is fed into the networks. Firstly, the μ network selects an action according to the policy and feeds it into the next Q network. Afterwards, the Q network evaluates the Q value of the action and updates the parameters of the Q function. Finally, an optimal action with a maximal Q value is selected as the ES price and announced to all the prosumers.

B. DDPG Algorithm
When the action space dimension is high, especially when it comes to continuous action space, it is difficult for DQN to learn the optimal policy. Thus, DDPG is proposed as an actor-critic algorithm based on the deterministic policy to operate in a problem with a continuous state and action space. DDPG is an Actor-Critic (AC) framework algorithm: it uses a deep neural network as an actor to approximate the policy function, and uses another DNN to approximate the action-value function, which acts as a critic and evaluates the performance of the actor and guides the renewal of the policy network. The actor and critic network architecture is shown in Figure 4.
The loss of the Q network is shown as follows The performance of the policy μ is measured by the performance objective that can be described as where the β ρ is the probability-distributed function of h s . The target of the training process is to maximize the β J (μ) , while minimizing the loss. The actor-critic network training process is described specifically in Algorithm 1.

Results
In this section, multiple case studies are presented to demonstrate the effectiveness of the proposed approach. Firstly, the performance of DDPG in solving the game theoretic problem for a single time interval is demonstrated. Then, we numerically compare the ability of different methods to solve the multi-agent game with incomplete information.

Experimental Setup
For ease of illustration, simulations were conducted based on six different classify prosumers. The performance of the proposed approach is evaluated based on data from real-work scenarios. The hourly RES generation and the load demand profiles were obtained on the date of 16 June 2018 from PJM. The training process is carried on the computer with one i7-8700 CPU. The agent-based game theoretic program is trained in Python with Pytorch, a deep reinforcement learning research platform.

The Evaluation of Training Process
In this part, the DDPG-based approach is trained to solve the multi-agent game problem and find an optimal ES pricing strategy. In order to evaluate the effectiveness of the training process, we first train the agent to learn an optimal ES pricing strategy within a time interval. The evaluation of the cumulative rewards is illustrated in Figure 5. It can be seen that the actions are randomly selected from the action space in the first 2000 iterations. Then, each agent is trained using mini-batch training data that are randomly selected from experience replay buffer  . Finally, the cumulative rewards converge around zero with small oscillations. This result indicates that the optimal action (i.e., ES pricing strategy) promotes the local consumption of prosumers through ES-and DDPG-based approach succeeds in learning a deterministic policy, as shown in Figure 6. The strategic action h λ converges to * h λ . It should be noted that each agent has not been given any information about the other agents, which reflects that the DDPG algorithm can solve the incomplete information game steadily.

The Performance of ESM
Actually, the load elasticity is critical for the stable operation of ES. In order to have a piratical analysis, six different prosumers with different elasticity levels are presented. Elastic coefficient α is proposed to describe the sensitivity of prosumers to the incentive price. Notice that the higher value of α indicates that the prosumers are more sensitive to ES price, i.e., under the same price volatility, the consumption behavior of a prosumer with higher α will change more. It is essential for the stability of ESM to research the relationship between prosumers' consumption behavior and ES price. The ES prices for differently elastic prosumers are illustrated in Figure 7. It is notable that with the increase in elasticity, the ES price that helps to facilitate local energy-demand balance is lower. Hence, having a great understanding of the category of prosumers within the registered area can facilitate VPPOC determining ES pricing strategy reasonably. In fact, for each prosumer, the RES generation and the load demand cannot match exactly. Some prosumers generate more and have surplus energy. However, some consume more and need an extra supply. The detailed ES behaviour of six kinds of prosumers before and after enrolling in ESM is shown in Figure 8. The positive energy represents that prosumers share surplus energy with other prosumers. Conversely, the negative one represents the energy that prosumers received from others. Before enrolling in ESM, only prosumer 3 has surplus energy and the others demand extra supply. The surplus energy cannot satisfy the whole demand. After enrolling in ESM, prosumers reduce their load demand according to the ES price signal. Specifically, the sharing energy of prosumer 2 becomes positive instead of negative. In summary, it can be demonstrated that the total surplus energy can meet the demand as far as possible via the guidance of the ES price signal based on the ESM.

Performance of DDPG in Agent-Based Problem
In this section, in order to show the performance of DDPG-based algorithm and verify the ability of an agent to learn and find a reasonable ES pricing strategy, we compare our approach with DQN and analytic iterative algorithm. The action space is continuous in the DDPG algorithm, but is required to be discrete in DQN. Thus, we make different settings for each approach. For DDPG and the analytic method, the action variable ES h λ is constrained by (2) and can take value arbitrarily within the price bounds. However, action space needs to be reduced in dimension and discretized for DQN. We use discrete nodal price to represent all the prices: ES To make the comparison more reliable, the discount factors of DDPG and DQN take the same value. For ease of illustration, simulation was carried out for 12 typical time intervals that can represent a 24-h day. We assume that there are six category prosumers involved in the ESM. The detailed generation resource type of each prosumer is shown in Table 1. The original RES and load profiles of six category prosumers are shown in Figures 9 and 10. It is notable that each profile has its own typical feature. Therefore, it can be proved that the proposed approach is practical and enhances the reliability of our conclusion.  The converge process of the analytic method, DQN and DDPG is shown in Figure 11. It can be seen that the analytic method obtains the optimal solution successfully. DDPG also converges to optimal solution. These two approaches both work well in solving the game theoretic problem, while DQN does not. Although the learning processes gradually converge, DQN does not find a relatively good solution and have a high volatility for all its iterations. To further compare the ability of three methods, the detailed solving information is listed in Table 2. Note that the analytic method takes the least amount of solving time and iterations to find the optimal solution that is about −11.5kWh. Comparing with DDPG, the total iterations of DQN are only one fifth of DDPG. However, though the solving time of DQN is more than twice as long as DDPG, it still cannot find a relatively good solution. Admitting that the iterations of DDPG are 20,000, it starts to converge after just 3800 iterations and achieves a far smaller error range than DQN. Specifically, as mentioned in Section 2, what makes them superior to the analytic method is that DDPG and DQN can solve the problem without knowing the detailed information of the system and the environment, and DDPG demonstrates a better effectiveness than DQN and is valid for problems with continuous action space.  For further illustration, the ES pricing strategies of three methods are shown in Figure 12. For each time interval, there is a corresponding ES price. It can be seen that the price for each time interval under analytic method and DDPG are quite similar, except for the last time slot. Specifically, the solving error of DDPG is within an acceptable range, while the error of DQN is not. Note that the ES price is relatively low for those time slots at which the total RES generation exceeds the total load demand of all the prosumers. The aim is to incentive prosumers to increase the load consumption. On the contrary, when the RES generation cannot satisfy the load demand, the VPPOC sets a relatively higher price to make prosumers reduce unnecessary load by taking economic efficiency into consideration.

Results of Enrolling in ESM
To verify the effectiveness of our proposed approach, a further case study is presented in the following section. The prosumers have flexibility and the load consumption relates to the price signal. In reality, there is a limit to the elasticity of load that is shown as the dark area in Figure 13. The load variation is constrained within the elastic interval. The load profiles of prosumers before and after participating in the ESM as a VPP are compared in Figure 13. The simulation results show that the proposed ESM can shift the peak load to other time slots. Upon the receipt of the ES price from VPPOC, the prosumers operate the load demand first and feed the result back in real time. Comparing Figures 12 and 13, during the peak load interval (i.e., time slots 8,9,10,11), the RES generation cannot satisfy the prosumers' load demand; due to the higher ES price, prosumers tend to reduce their load demand. Inversely, prosumers will increase power consumption properly during the period of energy surplus; correspondingly, the ES price is low. Therefore, it has more economic benefit and improves energy efficiency by executing the proposed ES mechanism.
The net loads of different prosumers are shown in Figure 14. Positive net loads indicate that prosumers have increased energy consumption and require extra energy through ESM, inversely, the negative net loads indicated that prosumers have surplus energy to share. Note that the diverse load profile, the ESM can be implemented successfully. For the exporting prosumers, the surplus energy can be shared to the neighboring prosumers; they can also receive the sharing energy when they need to. It is of great benefit to utilize energy efficiency. The the gap between supply and demand before and after participating in ESM are compared in Figure 15. Apparently, under the guidance of price signal, it facilitates prosumers' interactions with each other, and supply and demand can be matched locally. Thus, the gap narrows considerably compare to without ES mechanism, which further proves the effectiveness of the proposed method.  Finally, for each prosumer, the load fluctuation range reduces as shown in Figure 16. It demonstrates that the load profile is smoother after enrolling in ESM and it is important to improve the system reliability and safety, which can be seen a win-win mechanism for both VPP and prosumers.

Conclusions
In this paper, we introduce a novel ES simulation model to facilitate the local consumption of DER. VPPOC and prosumers enroll in the ESM under the leader-follower game framework based on Stackelberg theory, wherein the NE of the game can be solved based on the DDPG algorithm. Our model allows VPPOC to derive the optimal ES price without knowing the model information of prosumers; instead, the strategic policy is learned successfully by the agent through dynamic interaction with prosumers. The experimental results illustrate that the proposed ES framework can promote the local consumption of DER, reduce the energy cost for prosumers, balance energy supply and demand within VPP, and improve the stability of the power system.
Due to the proposed agent-based DDPG method showing good convergence in general and having a superior ability to solve problems with a continuous action space, it can be widely used for decision-making problem. Traditionally, the NE needed to solved based on the complete game model information. For our work, the proposed agent-based method is applied to solve the NE under circumstances of incomplete information. It is a supplement to the game theoretic field.
Future work should focus on the following two directions. The first is to apply the proposed ESM in integrated energy system with multiple energies. The second is to apply the DDPG-based agent to demand response management, in a multi-agent game.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.